tough choice on thread pointer initialization issue

mailing list of musl libc
 help / color / mirror / code / Atom feed

* tough choice on thread pointer initialization issue
@ 2012-02-10  2:58 Rich Felker
  2012-02-10  7:40 ` Solar Designer
  2012-02-25  6:56 ` Rich Felker
  0 siblings, 2 replies; 9+ messages in thread
From: Rich Felker @ 2012-02-10  2:58 UTC (permalink / raw)
  To: musl

hi all,

due to some recent changes in musl, i've detected a long-standing bug
that went unnoticed until now. the way musl's pthread implementation
works, the thread pointer register (and kernel-side support) for the
initial thread is subject to "lazy initialization" - no syscalls are
made to set it up until the first time it's used. this allows us to
keep an absolutely minimal set of syscalls for super-short/trivial
programs, which looks really impressive in the libc comparison charts
and strace runs. however...

the bug i've found occurs when the thread pointer happens to get
initialized in code that runs from a signal handler. this is a rare
situation that will only happen if the program is avoiding
async-signal-unsafe functions in the main flow of execution so that
it's free to use them in a signal handler, but despite being rare,
it's perfectly legal, and right now musl crashes on such programs, for
example:

#include <signal.h>
#include <stdio.h>
#include <pthread.h>
void evil(int sig)
{
	printf("hello, world %lx\n", (unsigned long)pthread_self());
}
int main(void)
{
	signal(SIGALRM, evil);
	alarm(1);
	sleep(2);
}

the issue is that when a signal handler returns, all registers,
including the thread-pointer registers (%gs or %fs on x86 or x86_64)
are reset to the values they had in the code the signal interrupted.
thus, musl thinks the thread pointer is valid at this point, but it's
actually null.

i see 3 possible fixes, none of which are ideal:

approach 1: hack the signal-return "restore" function to save the
current thread register value into the struct sigcontext before
calling SYS_sigreturn, so that it will be preserved when the
interrupted code resumes.

pros: minimal costs, never adds any syscalls versus current musl.

cons: ugly hack, and gdb does not like non-canonical sigreturn
functions (it refuses to work when the instruction pointer is at
them).

approach 2: call pthread_self() from sigaction(). this will ensure
that a signal handler never runs prior to the thread pointer being
initialized.

pros: minimal code changes, and avoids adding syscalls except for
programs that use signals but not threads.

cons: adds a syscall, and links unnecessary thread code when static
linking, in any program that uses signal handlers.

approach 3: always initialize the thread pointer from
__libc_start_main (befoe main runs). (this is the glibc approach)

pros: simple, and allows all the lazy-initialization logic to be
removed, moderately debloating and speeding up lots of thread-related
functions that will be able to use the thread pointer without making a
function call that checks whether it's initialized. would also make it
easier for us to support stack-protector, vsyscall/sysenter syscalls,
and thread-local storage in the future.

cons: constant additional 2-syscall overhead at startup (but it could
be optimized out when static-linking programs that don't use any
thread-related functions). their run times are ~1010ns and ~890ns on
my machine, compared to ~260000ns for the exec syscall. one other
possible issue is that we'd need to worry about making sure
non-threaded programs which otherwise would work on old kernels
without thread support don't crash due to assuming the thread pointer
is valid in places where they shouldn't need it.

before i make a decision, i'd like to hear if anyone from the
community has strong opinions one way or the other. i've almost ruled
out approach #1 and i'm leaning towards #3, with the idea that
simplicity is worth more than a couple trivial syscalls.

rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10  2:58 tough choice on thread pointer initialization issue Rich Felker
@ 2012-02-10  7:40 ` Solar Designer
  2012-02-10  7:58   ` Rich Felker
  2012-02-25  6:56 ` Rich Felker
  1 sibling, 1 reply; 9+ messages in thread
From: Solar Designer @ 2012-02-10  7:40 UTC (permalink / raw)
  To: musl

Hi Rich,

Thank you for posting this!

On Thu, Feb 09, 2012 at 09:58:25PM -0500, Rich Felker wrote:
> approach 1: hack the signal-return "restore" function to save the
> current thread register value into the struct sigcontext before
> calling SYS_sigreturn, so that it will be preserved when the
> interrupted code resumes.
> 
> pros: minimal costs, never adds any syscalls versus current musl.
> 
> cons: ugly hack, and gdb does not like non-canonical sigreturn
> functions (it refuses to work when the instruction pointer is at
> them).
> 
> 
> approach 2: call pthread_self() from sigaction(). this will ensure
> that a signal handler never runs prior to the thread pointer being
> initialized.
> 
> pros: minimal code changes, and avoids adding syscalls except for
> programs that use signals but not threads.
> 
> cons: adds a syscall, and links unnecessary thread code when static
> linking, in any program that uses signal handlers.

I think another con of the two approaches above is that they'll fail if
a program sets up a signal handler in a way bypassing musl (and other
prerequisites of the problem are met as well, as you described them).
Indeed, this makes it even more of a special case, but it's still legal
(or not? that's a musl policy thing I guess).

> approach 3: always initialize the thread pointer from
> __libc_start_main (befoe main runs). (this is the glibc approach)
...

> before i make a decision, i'd like to hear if anyone from the
> community has strong opinions one way or the other. i've almost ruled
> out approach #1 and i'm leaning towards #3, with the idea that
> simplicity is worth more than a couple trivial syscalls.

Not a strong opinion, but how about:

approach 4: initialize the thread pointer register to zero at program
startup.  Then before its use, just check it for non-zero instead of
checking a global flag.  (I presume that you're doing the latter now,
based on your description of the problem.)

Alexander


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10  7:40 ` Solar Designer
@ 2012-02-10  7:58   ` Rich Felker
  2012-02-10 10:42     ` Solar Designer
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2012-02-10  7:58 UTC (permalink / raw)
  To: musl

On Fri, Feb 10, 2012 at 11:40:02AM +0400, Solar Designer wrote:
> > [approaches 1 & 2]
> I think another con of the two approaches above is that they'll fail if
> a program sets up a signal handler in a way bypassing musl (and other
> prerequisites of the problem are met as well, as you described them).
> Indeed, this makes it even more of a special case, but it's still legal
> (or not? that's a musl policy thing I guess).

Well the only ways to install a signal handler short of making the
syscall yourself and bypassing libc is to call sigaction() or
signal(), so I don't think this is too much of an issue... At this
point all or at least most bets are off.

> Not a strong opinion, but how about:
> 
> approach 4: initialize the thread pointer register to zero at program
> startup.  Then before its use, just check it for non-zero instead of
> checking a global flag.  (I presume that you're doing the latter now,
> based on your description of the problem.)

Well this is definitely feasible but it makes more arch-specific code,
since examining the thread register needs asm. I suspect "mov %gs,%ax"
might be slow on some machines, too, since the segment registers
aren't intended to be read/written often. I just don't know enough to
know if it's a good approach; if you really think it's potentially the
best, perhaps you could provide more info.

As for approach #3, I've found that, at least on x86, it's possible to
setup an LDT entry for %gs for the main thread's thread pointer even
on ancient kernels, going all the way back to 1.2.x, using modify_ldt
syscall. This would just be a little bit of fallback code for when the
set_thread_area syscall fails with ENOSYS, and it would make it so
musl binaries using %gs-based thread pointer would work on any Linux,
not just 2.6+ (musl requires 2.6+ for thread support and lots of newer
features, but simple programs should run on old kernels). I'm not sure
to what extent the same would be possible on other archs. Anyway the
point is that we could mostly (possibly arch-dependent tho) avoid the
"con" of having to tip-toe around not using the thread pointer in code
that might be used in single-threaded apps if it's possible to setup
the thread pointer for at least the main thread even on old kernels.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10  7:58   ` Rich Felker
@ 2012-02-10 10:42     ` Solar Designer
  2012-02-10 11:22       ` Solar Designer
  2012-02-10 17:00       ` Rich Felker
  0 siblings, 2 replies; 9+ messages in thread
From: Solar Designer @ 2012-02-10 10:42 UTC (permalink / raw)
  To: musl

On Fri, Feb 10, 2012 at 02:58:18AM -0500, Rich Felker wrote:
> On Fri, Feb 10, 2012 at 11:40:02AM +0400, Solar Designer wrote:
> > approach 4: initialize the thread pointer register to zero at program
> > startup.  Then before its use, just check it for non-zero instead of
> > checking a global flag.  (I presume that you're doing the latter now,
> > based on your description of the problem.)
> 
> Well this is definitely feasible but it makes more arch-specific code,
> since examining the thread register needs asm.

But accessing it needs asm anyway, no?

> I suspect "mov %gs,%ax" might be slow on some machines, too,

It might be.  On the other hand, segment registers are read (saved) on
context switch, so this is not something CPU makers would disregard.
I wouldn't be surprised if some CPU has faster push than mov, though.

> since the segment registers
> aren't intended to be read/written often. I just don't know enough to
> know if it's a good approach; if you really think it's potentially the
> best, perhaps you could provide more info.

"Potentially" is the keyword here, but anyway I just did some testing on
a Core 2'ish CPU (E5420), and the instruction is very fast.  I put 1000
instances of:

movl %ss,%eax
testl %eax,%eax
jz 0

one after another, and this runs in just 1 cycle per the three
instructions above (so 1000 cycles for the 1000 instances total).
Of course, there are data dependencies here, so there's probably a
bypass involved and the instructions are grouped differently, like:

Cycle N:
testl %eax,%eax; jz 0; movl %ss,%eax

Cycle N+1:
testl %eax,%eax; jz 0; movl %ss,%eax

with a bypass between testl and jz.  (I am just guessing, though.)

In actual code (where you won't have repeats like this), you might want
to insert some instructions between the mov and the test (if you can).

I also tested 1000 instances of:

movl %gs,%eax

and:

movw %gs,%ax

and:

movw %gs,%ax
testw %ax,%ax

All of these execute in 1000 cycles total as well.  With "w" forms of
the instructions there are extra prefixes, so I think these should
better be avoided, even though there's no slowdown from them on this CPU.

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10 10:42     ` Solar Designer
@ 2012-02-10 11:22       ` Solar Designer
  2012-02-10 17:00       ` Rich Felker
  1 sibling, 0 replies; 9+ messages in thread
From: Solar Designer @ 2012-02-10 11:22 UTC (permalink / raw)
  To: musl

On Fri, Feb 10, 2012 at 02:42:52PM +0400, Solar Designer wrote:
> [...] I just did some testing on
> a Core 2'ish CPU (E5420), and the instruction is very fast.  I put 1000
> instances of:
> 
> movl %ss,%eax
> testl %eax,%eax
> jz 0
> 
> one after another, and this runs in just 1 cycle per the three
> instructions above (so 1000 cycles for the 1000 instances total).

To test my approach to testing this, I tried using:

movw %ss,%ax

followed by 1000 instances of:

movl %eax,%gs

or:

movw %ax,%gs

This executes in about 16.5 cycles per instruction.  Indeed, setting a
segment register is slow - but reading it apparently is not, at least
not on this CPU.

Alexander


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10 10:42     ` Solar Designer
  2012-02-10 11:22       ` Solar Designer
@ 2012-02-10 17:00       ` Rich Felker
  2012-02-10 19:12         ` Solar Designer
  1 sibling, 1 reply; 9+ messages in thread
From: Rich Felker @ 2012-02-10 17:00 UTC (permalink / raw)
  To: musl

On Fri, Feb 10, 2012 at 02:42:52PM +0400, Solar Designer wrote:
> On Fri, Feb 10, 2012 at 02:58:18AM -0500, Rich Felker wrote:
> > On Fri, Feb 10, 2012 at 11:40:02AM +0400, Solar Designer wrote:
> > > approach 4: initialize the thread pointer register to zero at program
> > > startup.  Then before its use, just check it for non-zero instead of
> > > checking a global flag.  (I presume that you're doing the latter now,
> > > based on your description of the problem.)
> > 
> > Well this is definitely feasible but it makes more arch-specific code,
> > since examining the thread register needs asm.
> 
> But accessing it needs asm anyway, no?

Well it's a little bit more complicated, but yes. The code that would
need fixing to use this approach is the "lazy init already completed"
version (__pthread_self() macro/inline) which intentionally has no
references to the init code to avoid pulling it into static bins that
don't want the extra code and data. This code would have to check that
the register is valid, and if not, backup registers (since the asm is
specified not to clobber any in the normal case) on the stack and make
a function call to a function that would get the proper value for the
thread register and fix it. This could be done without too much bloat
using some nonstandard calling convention hacks (putting the local
register save/restore code in the callee rather than caller) I
suppose. I don't claim any of this is impossible or even necessarily
too difficult to make approach 4 feasible, but it does at least
require some thinking about it.

> "Potentially" is the keyword here, but anyway I just did some testing on
> a Core 2'ish CPU (E5420), and the instruction is very fast.  I put 1000
> instances of:
> 
> movl %ss,%eax
> testl %eax,%eax
> jz 0
> 
> one after another, and this runs in just 1 cycle per the three
> instructions above (so 1000 cycles for the 1000 instances total).
> Of course, there are data dependencies here, so there's probably a
> bypass involved and the instructions are grouped differently, like:

OK, looks like performance is not an issue, at least on your hardware.

> In actual code (where you won't have repeats like this), you might want
> to insert some instructions between the mov and the test (if you can).

Unfortunately the only useful thing would be dereferencing %gs:0,
which is not a good idea if the test/jz hasn't completed... In pure
asm more could be done, but this code is only used in inline asm in C,
not in pure asm functions. After all the ugly arch-specific bugs in
glibc's asm thread primitives, I really don't want to go there.. :-)

> All of these execute in 1000 cycles total as well.  With "w" forms of
> the instructions there are extra prefixes, so I think these should
> better be avoided, even though there's no slowdown from them on this CPU.

I had no idea it was even valid to use the non-w-prefix forms with
segment registers. Learned something new. Are the high bits just
discarded (when writing) and zeroed (when reading)?

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10 17:00       ` Rich Felker
@ 2012-02-10 19:12         ` Solar Designer
  0 siblings, 0 replies; 9+ messages in thread
From: Solar Designer @ 2012-02-10 19:12 UTC (permalink / raw)
  To: musl

On Fri, Feb 10, 2012 at 12:00:15PM -0500, Rich Felker wrote:
> On Fri, Feb 10, 2012 at 02:42:52PM +0400, Solar Designer wrote:
> > All of these execute in 1000 cycles total as well.  With "w" forms of
> > the instructions there are extra prefixes, so I think these should
> > better be avoided, even though there's no slowdown from them on this CPU.
> 
> I had no idea it was even valid to use the non-w-prefix forms with
> segment registers. Learned something new. Are the high bits just
> discarded (when writing) and zeroed (when reading)?

Yes.

Frankly, I am not sure how portable this is exactly.  The Linux kernel
uses these short forms unconditionally (although there are also a few
instances of "movw"), whereas glibc somehow uses them on 686+ only - see
nptl/sysdeps/i386/tls.h vs. nptl/sysdeps/i386/i686/tls.h.  The latter
has this comment:

/* Macros to load from and store into segment registers.  We can use
   the 32-bit instructions.  */

My guess is that this is one of those things that was always this way,
but was only documented much later - starting with Pentium Pro maybe?
We can download some PDFs from Intel to confirm when this appeared
officially.  I'd start with those for PPro.

There were several things documented at about this time - e.g., the SALC
instruction that was available since 8086 (including clones), but was
only documented by Intel starting with Pentium Pro.

glibc is probably too careful in limiting this to 686+.

Well, I wouldn't be too surprised if some pre-686 CPU did not zero out
the high bits on read from segment registers into 32-bit registers.
If we're merely copying the value into another segment register, this
does not matter, but for the non-zero test it does...

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-10  2:58 tough choice on thread pointer initialization issue Rich Felker
  2012-02-10  7:40 ` Solar Designer
@ 2012-02-25  6:56 ` Rich Felker
  2012-02-25 13:32   ` Rich Felker
  1 sibling, 1 reply; 9+ messages in thread
From: Rich Felker @ 2012-02-25  6:56 UTC (permalink / raw)
  To: musl

On Thu, Feb 09, 2012 at 09:58:25PM -0500, Rich Felker wrote:
> the bug i've found occurs when the thread pointer happens to get
> initialized in code that runs from a signal handler. this is a rare
> situation that will only happen if the program is avoiding
> async-signal-unsafe functions in the main flow of execution so that
> it's free to use them in a signal handler, but despite being rare,
> it's perfectly legal, and right now musl crashes on such programs, for
> example:
> [...]
> the issue is that when a signal handler returns, all registers,
> including the thread-pointer registers (%gs or %fs on x86 or x86_64)
> are reset to the values they had in the code the signal interrupted.
> thus, musl thinks the thread pointer is valid at this point, but it's
> actually null.
> [...]
> before i make a decision, i'd like to hear if anyone from the
> community has strong opinions one way or the other. i've almost ruled
> out approach #1 and i'm leaning towards #3, with the idea that
> simplicity is worth more than a couple trivial syscalls.

PING.

Any more thoughts on this issue?

I've looked into the "solution 4" that Solar proposed, which simply
put is making __pthread_self() check for a clobbered thread register
and restore it. While this sounds simple, there are a few issues I've
run into whereby it starts to get ugly...

1. The proper value of the %gs register (for i386) or %fs register
(for x86_64) needs to be saved somewhere global so that __pthread_self
can restore it. This seems to be solvable but requires putting a
global 16-bit var in __set_thread_area.s and getting the asm right to
make it PIC-safe, and might call for some extra weak symbol hackery to
avoid pulling in excess code.

2. On x86_64, it seems that 0 is a valid value for %fs, so we can't
really tell if it's uninitialized! I don't have a machine to test
with, but from reading the kernel sources, it looks like %fs is 0 and
a hidden 64-bit offset is stored in a privileged register accessible
only by the kernel (one which hopefully would not be clobbered by
sigreturn, but I'm not sure...) when the thread pointer does not fit
in 32 bits, and the old i386 method (LDT entry and non-zero %fs
selector value) is used when the thread pointer fits in 32 bits. This
might be a show-stopper, because if we can't tell if the thread
pointer is valid, we can't restore it (and the value of %fs might
actually need to differ per-thread if some threads are below the 4gb
boundary and others are located above).

3. The code in __pthread_self() to test %gs and call out to external
code when it needs to be restored is larger than I would like,
especially since the inline asm block uses no clobbers, thus requiring
all registers to be saved across a call from the asm.

With these issues in mind, Solar's solution is seeming less appealing
and maybe not even feasible. Any ideas for salvaging it, or should I
fall back to one of the other options? (My original leaning was
towards always setting up the thread pointer prior to program entry.)

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: tough choice on thread pointer initialization issue
  2012-02-25  6:56 ` Rich Felker
@ 2012-02-25 13:32   ` Rich Felker
  0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2012-02-25 13:32 UTC (permalink / raw)
  To: musl

On Sat, Feb 25, 2012 at 01:56:13AM -0500, Rich Felker wrote:
> I've looked into the "solution 4" that Solar proposed, which simply
> put is making __pthread_self() check for a clobbered thread register
> and restore it. While this sounds simple, there are a few issues I've
> run into whereby it starts to get ugly...

Some revisions... it's not quite as bad as I said, but still not good
either..

> with, but from reading the kernel sources, it looks like %fs is 0 and
> a hidden 64-bit offset is stored in a privileged register accessible
> only by the kernel (one which hopefully would not be clobbered by
> sigreturn, but I'm not sure...) when the thread pointer does not fit
> in 32 bits, and the old i386 method (LDT entry and non-zero %fs
> selector value) is used when the thread pointer fits in 32 bits. This

Got the cases reversed I think, but the issue still stands.

> pointer is valid, we can't restore it (and the value of %fs might
> actually need to differ per-thread if some threads are below the 4gb
> boundary and others are located above).

Per-thread difference is irrelevant since only the main thread can
ever experience this problem; all new threads start with a valid
thread pointer.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-02-25 13:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-10  2:58 tough choice on thread pointer initialization issue Rich Felker
2012-02-10  7:40 ` Solar Designer
2012-02-10  7:58   ` Rich Felker
2012-02-10 10:42     ` Solar Designer
2012-02-10 11:22       ` Solar Designer
2012-02-10 17:00       ` Rich Felker
2012-02-10 19:12         ` Solar Designer
2012-02-25  6:56 ` Rich Felker
2012-02-25 13:32   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).