mailing list of musl libc
 help / color / mirror / code / Atom feed
* Transition path for removing lazy init of thread pointer
@ 2014-03-24 17:49 Rich Felker
  2014-03-24 23:04 ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2014-03-24 17:49 UTC (permalink / raw)
  To: musl

I've begun looking at removing lazy-init of the thread pointer, one of
the big post-1.0 development items. Since I'm still a bit worried
about whether we can ensure that there's always a valid thread pointer
without dropping support for some old kernels (which were not
officially supported ever, but which de-facto worked for some apps),
I've come up with the following conservative roadmap:

Phase 1: Initialize the thread pointer at startup, but do not make
anything in musl assume that the thread pointer is valid unless the
affected code was already assuming the "or" statement: either the
thread pointer is already setup, or pthread_self() will successfully
set it up. In particular, this phase will leave __errno_location using
its own static errno location for the main thread rather than always
accessing errno via the thread pointer. This is helpful for the
dynamic linker anyway, since we have code that accesses errno early,
and otherwise the dynamic linker would need to first initialize a
potentially-fake thread pointer at startup, then switch to a new one
once the initial TLS size is known. During this phase, we should also
try to make sure there is a new mechanism to ensure that
pthread_create fails (rather than making a corrupt program state) when
the kernel does not support threads; currently, that's done by
checking for failure of pthread_self(), which will no longer be the
correct way.

Phase 2: Add workarounds to setup a valid thread-pointer on older
kernels that don't directly support it. On i386 and perhaps x86_64,
this could be done with modify_ldt, which has been around since
forever. (modify_ldt can't, AFAIK, create a thread-local %fs/%gs
mapping, but it can make a valid segment for just the main thread to
use.) I'm not sure about other archs though, and whether there are any
that matter where we can't set the thread register from userspace.

Phase 3: If we can ensure to a satisfactory degree that thread pointer
setup never fails, optimize musl-internal things (like errno) to
assume he thread pointer is available. This may require special care
for the early stages of the dynamic linker. Assuming thread pointer
validity is necessary for stack-protector anyway on many archs, and it
would allow us to support building libc itself with stack-protector.

I'm mostly done with phase 1, but right now it includes an ugly mix of
some elements from phases 2 and 3 that shouldn't be touched yet, and
it's missing error checks and code for dealing with what happens when
the thread pointer setup fails (whether this is fatal should, at phase
1, depend on whether the program/libs need TLS).

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-24 17:49 Transition path for removing lazy init of thread pointer Rich Felker
@ 2014-03-24 23:04 ` Rich Felker
  2014-03-24 23:44   ` Justin Cormack
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2014-03-24 23:04 UTC (permalink / raw)
  To: musl

On Mon, Mar 24, 2014 at 01:49:15PM -0400, Rich Felker wrote:
> Phase 1: Initialize the thread pointer at startup, but do not make
> anything in musl assume that the thread pointer is valid unless the
> [...]

Phase 1 is now complete. The intent is that it not break any usage
(even on ancient kernels that are unsupported) that was not already
broken before, so regression reports would be very appreciated if I'm
wrong about that.

At this point we now have two mandatory syscalls at startup on most
archs (just one where setting the thread pointer is entirely a
userspace operation). The second syscall is set_tid_address, and it
seems like it should only be needed it pthread_create is being used
(to that pthread_join will work), but it serves a second purpose of
standing in for gettid() too. We could, however, eliminate it in some
cases:x

- In static-linked programs that never need to know their own tid and
  never create threads, it can be skipped. This could be achieved with
  some weak symbol magic.

- Even in dynamic-linked programs, we could defer the tid lookup until
  it's needed by adding a __gettid() function/macro that looks in the
  thread structure, and if it finds zero, calls set_tid_address. This
  might add a few cycles to some synchronization primitives, but we're
  shaving a good number of cycles now anyway since lazy thread-pointer
  initialization is gone and we can inline the thread-pointer access
  in a lot more places.

I'd welcome feedback on whether these sorts of optimizations (well,
more like tradeoffs than outright optimizations) are desirable.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-24 23:04 ` Rich Felker
@ 2014-03-24 23:44   ` Justin Cormack
  2014-03-25  0:01     ` Laurent Bercot
  2014-03-25  0:58     ` Rich Felker
  0 siblings, 2 replies; 8+ messages in thread
From: Justin Cormack @ 2014-03-24 23:44 UTC (permalink / raw)
  To: musl

On Mon, Mar 24, 2014 at 11:04 PM, Rich Felker <dalias@aerifal.cx> wrote:
>
> At this point we now have two mandatory syscalls at startup on most
> archs (just one where setting the thread pointer is entirely a
> userspace operation). The second syscall is set_tid_address, and it
> seems like it should only be needed it pthread_create is being used
> (to that pthread_join will work), but it serves a second purpose of
> standing in for gettid() too. We could, however, eliminate it in some
> cases:x

Which archs is it userspace only?

Any idea what the performance benefit of the optimisations might be?
Static linked non threaded programs sounds like a good target.

Justin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-24 23:44   ` Justin Cormack
@ 2014-03-25  0:01     ` Laurent Bercot
  2014-03-25  1:55       ` Rich Felker
  2014-03-25  0:58     ` Rich Felker
  1 sibling, 1 reply; 8+ messages in thread
From: Laurent Bercot @ 2014-03-25  0:01 UTC (permalink / raw)
  To: musl

> Static linked non threaded programs sounds like a good target.

  +1.
  Most of my userspace is made of statically linked, non-threaded
programs, and I would love if musl was optimal for it.
  Also, what is the mandatory first syscall on startup ? With
musl-0.9.15, on kernel 3.2.something, there is no syscall at all
(except execve(), of course) when starting a non-threaded program.

-- 
  Laurent



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-24 23:44   ` Justin Cormack
  2014-03-25  0:01     ` Laurent Bercot
@ 2014-03-25  0:58     ` Rich Felker
  1 sibling, 0 replies; 8+ messages in thread
From: Rich Felker @ 2014-03-25  0:58 UTC (permalink / raw)
  To: musl

On Mon, Mar 24, 2014 at 11:44:57PM +0000, Justin Cormack wrote:
> On Mon, Mar 24, 2014 at 11:04 PM, Rich Felker <dalias@aerifal.cx> wrote:
> >
> > At this point we now have two mandatory syscalls at startup on most
> > archs (just one where setting the thread pointer is entirely a
> > userspace operation). The second syscall is set_tid_address, and it
> > seems like it should only be needed it pthread_create is being used
> > (to that pthread_join will work), but it serves a second purpose of
> > standing in for gettid() too. We could, however, eliminate it in some
> > cases:x
> 
> Which archs is it userspace only?

Setting the thread pointer is a fully-userspace operation on
microblaze, sh, and powerpc. On the rest it requires a syscall.

Note that one negative side-effect of it being fully userspace is that
we get no indication of whether the kernel is new enough to support
threads properly (whereas normally, successful syscall to set the
thread pointer means the kernel supports threads). So on these archs,
pthread_create really needs to be doing some extra work to make sure
threads are supported and the clone flags it's using will actually
have an effect; otherwise it could end up creating threads that don't
behave at all like threads... This is an issue we've ignored until
now, but it's fairly minor, especially since microblaze didn't even
exist during 2.4 and sh has been broken in the kernel up til now and
until at least the next kernel release.

> Any idea what the performance benefit of the optimisations might be?
> Static linked non threaded programs sounds like a good target.

Basically just cutting off one syscall that takes ~1000-3000 cycles
that's always accompanied by a syscall that takes over 100000 cycles
(execve). So not much, I think.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-25  0:01     ` Laurent Bercot
@ 2014-03-25  1:55       ` Rich Felker
  2014-03-25  6:35         ` Laurent Bercot
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2014-03-25  1:55 UTC (permalink / raw)
  To: musl

On Tue, Mar 25, 2014 at 12:01:45AM +0000, Laurent Bercot wrote:
> >Static linked non threaded programs sounds like a good target.
> 
>  +1.
>  Most of my userspace is made of statically linked, non-threaded
> programs, and I would love if musl was optimal for it.
>  Also, what is the mandatory first syscall on startup ? With
> musl-0.9.15, on kernel 3.2.something, there is no syscall at all
> (except execve(), of course) when starting a non-threaded program.

The mandatory syscall is set_thread_area or equivalent, e.g.
arch_prctl on x86_64. It's there because most archs need a syscall to
set the thread pointer used for accessing TLS. Even in single-threaded
programs, there are reasons one may want to have it.

The big reason is that, on most archs, stack protector's canary value
is stored at a fixed offset from the thread pointer rather than in a
global, so stack protector can't work without the thread pointer being
initialized. Up to now we've tried to detect whether stack protector
is used based on symbol references to __stack_chk_fail, but this check
gives a false negative (and thus crashing programs) if gcc optimizes
out the check to __stack_chk_fail but not the load of the canary, e.g.
in the program: int main() { exit(0); }

The other main reason is that lazy initialization is a lot more
expensive at runtime. All sorts of functions that need the thread
pointer (mainly synchronization primitives, like pthread_mutex_lock
when working with a recursive, error-checking, or robust mutex)
previously had to call a fairly heavy pthread_self() function that
performed the lazy initialization. Now they can just use an inline
implementation (usually asm) that obtains the thread pointer, and
for some functions having this inline means they're leaf functions and
the compiler can eliminate a lot of ugly prologue/epilogue/spilling
needed for non-leaf functions.

So despite always initializing the thread pointer kinda looking like
"bloat" from a minimal-program standpoint, it's really a major step
forward in debloating and simplifying lots of code.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-25  1:55       ` Rich Felker
@ 2014-03-25  6:35         ` Laurent Bercot
  2014-03-25  7:11           ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Laurent Bercot @ 2014-03-25  6:35 UTC (permalink / raw)
  To: musl

On 25/03/2014 01:55, Rich Felker wrote:
> The mandatory syscall is set_thread_area or equivalent, e.g.
> arch_prctl on x86_64. It's there because most archs need a syscall to
> set the thread pointer used for accessing TLS. Even in single-threaded
> programs, there are reasons one may want to have it.
>
> The big reason is that, on most archs, stack protector's canary value
> is stored at a fixed offset from the thread pointer rather than in a
> global, so stack protector can't work without the thread pointer being
> initialized. Up to now we've tried to detect whether stack protector
> is used based on symbol references to __stack_chk_fail, but this check
> gives a false negative (and thus crashing programs) if gcc optimizes
> out the check to __stack_chk_fail but not the load of the canary, e.g.
> in the program: int main() { exit(0); }

  That's a good reason indeed.
  I take it you're still hell-bent against compile-time options ? Because
a musl compile-time option "I don't want this musl to support stack
protector, yes I know it will crash programs compiled with it, but I'm
a big boy and know what I'm doing" would be great for OCD people like
me who like their strace clean. :)


> The other main reason is that lazy initialization is a lot more
> expensive at runtime.

  That's not a good reason for single-threaded programs.


> So despite always initializing the thread pointer kinda looking like
> "bloat" from a minimal-program standpoint, it's really a major step
> forward in debloating and simplifying lots of code.

  I totally understand and approve for multi-threaded programs and
programs using stack protection. I just wish there were a special
optimization for "int main() { return 0; }".

-- 
  Laurent



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Transition path for removing lazy init of thread pointer
  2014-03-25  6:35         ` Laurent Bercot
@ 2014-03-25  7:11           ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2014-03-25  7:11 UTC (permalink / raw)
  To: musl

On Tue, Mar 25, 2014 at 06:35:02AM +0000, Laurent Bercot wrote:
> On 25/03/2014 01:55, Rich Felker wrote:
> >The mandatory syscall is set_thread_area or equivalent, e.g.
> >arch_prctl on x86_64. It's there because most archs need a syscall to
> >set the thread pointer used for accessing TLS. Even in single-threaded
> >programs, there are reasons one may want to have it.
> >
> >The big reason is that, on most archs, stack protector's canary value
> >is stored at a fixed offset from the thread pointer rather than in a
> >global, so stack protector can't work without the thread pointer being
> >initialized. Up to now we've tried to detect whether stack protector
> >is used based on symbol references to __stack_chk_fail, but this check
> >gives a false negative (and thus crashing programs) if gcc optimizes
> >out the check to __stack_chk_fail but not the load of the canary, e.g.
> >in the program: int main() { exit(0); }
> 
>  That's a good reason indeed.
>  I take it you're still hell-bent against compile-time options ? Because

In general, no. I'll probably eventually accept compile-time options
for things like iconv charset selection.

But gratuitous ones, yes. Especially if supporting the compile-time
option significantly complicates the code and forces us to have
multiple #ifdef/#else cases, ala uClibc. Making thread-pointer
optional would, at least in the long term, be one of those, since it
either precludes all optimization and simplification that assumes the
thread pointer is available, or forces us to have multiple versions of
the same code for with/without it.

> a musl compile-time option "I don't want this musl to support stack
> protector, yes I know it will crash programs compiled with it, but I'm
> a big boy and know what I'm doing" would be great for OCD people like
> me who like their strace clean. :)

Yeah, this is really just a case of appealing to OCD, so thanks for
acknowledging that. :-)

I think we could still consider making the second syscall
(set_tid_address) get optimized out in static binaries that don't need
it, but it's enough of a complexity burden that I'd like to see what
others have to say about it, and at least wait to see how hard it
would be, once other cleanups related to this change are made.

> >The other main reason is that lazy initialization is a lot more
> >expensive at runtime.
> 
>  That's not a good reason for single-threaded programs.

Well there are a lot of mostly-useless micro-optimizations you could
do that, theoretically, improve single-threaded programs. Like
accessing errno directly. The problem is that these preclude doing
major systemic simplifications that have much greater debloating
effects (even on single-threaded programs!) unless we make a whole
separate single-thread-only libc.

For example, __stdio_read and __stdio_write just got simpler because
they no longer have to special-case the threaded/non-threaded cases to
avoid gratuitous thread-pointer loads and possible crashes.

And pthread_setcancelstate, which is used in various functions which
need to avoid triggering cancellation, is now simpler since it knows
the absence/presence of a thread pointer will be constant (before, it
had to be able to get/set state before the thread pointer was loaded
for consistency in case it's loaded later).

Right now that's about it for code that gets linked in NON-threaded
programs, but there will probably be more that gets simplified later,
and a lot more if you count code for programs using threads.

> >So despite always initializing the thread pointer kinda looking like
> >"bloat" from a minimal-program standpoint, it's really a major step
> >forward in debloating and simplifying lots of code.
> 
>  I totally understand and approve for multi-threaded programs and
> programs using stack protection. I just wish there were a special
> optimization for "int main() { return 0; }".

Yes, I miss the extreme-minimal strace too, but it's still pretty damn
minimal and not going to get any bigger anytime soon. What I don't
miss is the messy undocumented logic for lazy initialization.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-03-25  7:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-24 17:49 Transition path for removing lazy init of thread pointer Rich Felker
2014-03-24 23:04 ` Rich Felker
2014-03-24 23:44   ` Justin Cormack
2014-03-25  0:01     ` Laurent Bercot
2014-03-25  1:55       ` Rich Felker
2014-03-25  6:35         ` Laurent Bercot
2014-03-25  7:11           ` Rich Felker
2014-03-25  0:58     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).