Possible design for global thread list

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Possible design for global thread list
@ 2018-10-10 16:33 Rich Felker
  2018-10-12 15:56 ` Rich Felker
  2018-10-14 17:32 ` Florian Weimer
  0 siblings, 2 replies; 5+ messages in thread
From: Rich Felker @ 2018-10-10 16:33 UTC (permalink / raw)
  To: musl

This came up as part of the tlsdesc discussion, and I'm documenting it
here in case we end up needing it. Doing this would be a radical
departure from musl's early intentional design of not tracking threads
globally, which theoretically has nice scalability properties but
which all break down when you consider that the kernel has to track
them anyway. It also involves a good bit of work, and possibly some
performance or realtime-property hits in some places (see below), so I
don't necessarily think it's a good idea. But in some ways it would be
a lot cleaner/simpler and would facilitate making TLS access somewhat
faster and making __synccall work independently of /proc and other
hacks.

At the time of the last __synccall overhaul in commit
78a8ef47c4d92b7680c52a85f80a81e29da86bb9, I believed it was impossible
to guarantee a consistent view of the list of still-live threads at
the kernel task level. As a result, __synccall depends on availability
of (and ability to open) /proc/self/task to ensure that it has
actually gotten them all, or waited for them to finish exiting.

It turns out that it's not actually impossible to keep a list, and
depending on your perspective, the way to keep such a list is either
incredibly beautiful or incredibly hideous. :-)

What makes it difficult is that exiting threads removing themselves
from the list cannot unlock the list themselves, or another thread
could obtain the lock and see the list with the exiting thread removed
before it actually exits. The solution is to let the kernel do it.
Instead of having the CLONE_CHILD_CLEARTID/set_tid_address for each
thread point to something in its own thread structure, have it point
to the global thread-list lock. Then the kernel unlocks the list (and
futex waits any waiters on it) atomically with respect to kernel task
termination.

Of course, this futex wake is already used for pthread_join, which
would need another mechanism. This is solved simply: pthread_exit can
FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
then has to wait on (but need not acquire) the thread-list lock after
waiting on the thread's own exit futex in order to ensure the exit has
actually finished. This is potentially subject to long waits if the
lock is under contention (lots of threads exiting or being created)
and retaken before pthread_join gets to run, but the probability of
collision can be made negligible (only possible under extremely rapid
tid reuse) by using the tid of the exiting thread as the wait value.
Alternatively, the tid of the joiner could be used, making collisions
impossible, but setting up to do this is more complex.

Such a lock is already used in the fallback/nommu implementation of
__unmapself. If we used a global thread list lock, no such lock in
__unmapself would be needed.

In order for __synccall to use a global thread list lock, the lock
would need to be async-signal-safe (since __synccall is used to
implement AS-safe functions like setuid). This means taking the lock
would require signals to be blocked, which imposes new requirements on
the implementation of pthread_create and dlopen. In some ways this is
nice; right now pthread_create has special cases for when the new
thread has to manipulate its signal mask after starting, and it would
be simpler/cleaner to just always do it. This would also allow working
around buggy kernels where the clone flags to set thread pointer, or
set/clear tid address, don't work (was a historic issue on some archs;
we just don't support those broken kernels now) by just not using them
and instead fixing things up in the new thread before unmasking
signals. It would have a slight performance cost though.

Blocking signals in dlopen might be uglier; it's a *long* operation,
and would break any sort of realtime signal handling going on.
However, I think the allocation of new TLS memory (which is all the
lock is needed for) could be moved to *after* all new DSOs are loaded
and relocated, as the last step before committing to success. This
helps with the current design where it's allocation-only, and where
__tls_get_new is responsible for acquiring that allocated memory later
in the thread that needs it, but if we switched to having dlopen
actually dole out the allocated memory to all live threads, there
would be a non-constant time operation that had to take place with the
thread list lock held, and with signals blocked.

One potential solution is making the list lock an AS-safe rwlock,
where only operations that will modify the list (thread creation and
exit) need to take the write lock, and only the write lock requires
signals to be blocked. dlopen and __synccall would then only be
readers. (This is an inversion of the roles with the current
__acquire_ptc/__release_ptc/__inhibit_ptc lock, where dlopen has the
"writer" role and thread creation/exit has the "reader" role.)

In order to implement an AS-safe rwlock without unbounded-time
blocking of signals while waiting for the lock, the wrlock logic would
have to go something like:

	for (;;) {
		futex wait for unlocked state;
		block signals;
		cas lock from unlocked to wr-locked state
		if (success) break;
		unblock signals;
	}

This would be unfortunate if rdlocks were a common operation, since
the "block signals" step gives a large window for an arriving reader
to steal the lock. Fortunately however they're very uncommon
operations that only make sense a small finite number of times in a
process lifetime, so I don't think this matters.

The biggest performance concern here is probably how long the lock
needs to be held for thread creation/exit and whether that interferes
with concurrent creation/exit. Naively I expect the answer is a big
yes. If a separate lock were used (just like now) to prevent dlopen
while the thread's memory is being allocated, the minimum the critical
section for creation could be is just around the call to __clone, and
linking into the list upon success. From what I remember, that can
easily be 15k cycles or so. At exit the critical section needs to
contain the __unmapself for a detached thread or the FUTEX_REQUEUE for
a joinable thread with a waiting joiner, but these are all fairly
quick syscalls.

So it's probably very questionable whether any of this makes sense to
pursue, but at least it's all now published in case we want/need to at
some point.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible design for global thread list
  2018-10-10 16:33 Possible design for global thread list Rich Felker
@ 2018-10-12 15:56 ` Rich Felker
  2018-10-14 17:32 ` Florian Weimer
  1 sibling, 0 replies; 5+ messages in thread
From: Rich Felker @ 2018-10-12 15:56 UTC (permalink / raw)
  To: musl

On Wed, Oct 10, 2018 at 12:33:06PM -0400, Rich Felker wrote:
> This came up as part of the tlsdesc discussion, and I'm documenting it
> here in case we end up needing it. Doing this would be a radical
> departure from musl's early intentional design of not tracking threads
> globally, which theoretically has nice scalability properties but
> which all break down when you consider that the kernel has to track
> them anyway. It also involves a good bit of work, and possibly some
> performance or realtime-property hits in some places (see below), so I
> don't necessarily think it's a good idea. But in some ways it would be
> a lot cleaner/simpler and would facilitate making TLS access somewhat
> faster and making __synccall work independently of /proc and other
> hacks.
> 
> At the time of the last __synccall overhaul in commit
> 78a8ef47c4d92b7680c52a85f80a81e29da86bb9, I believed it was impossible
> to guarantee a consistent view of the list of still-live threads at
> the kernel task level. As a result, __synccall depends on availability
> of (and ability to open) /proc/self/task to ensure that it has
> actually gotten them all, or waited for them to finish exiting.
> 
> It turns out that it's not actually impossible to keep a list, and
> depending on your perspective, the way to keep such a list is either
> incredibly beautiful or incredibly hideous. :-)
> 
> What makes it difficult is that exiting threads removing themselves
> from the list cannot unlock the list themselves, or another thread
> could obtain the lock and see the list with the exiting thread removed
> before it actually exits. The solution is to let the kernel do it.
> Instead of having the CLONE_CHILD_CLEARTID/set_tid_address for each
> thread point to something in its own thread structure, have it point
> to the global thread-list lock. Then the kernel unlocks the list (and
> futex waits any waiters on it) atomically with respect to kernel task
> termination.
> 
> Of course, this futex wake is already used for pthread_join, which
> would need another mechanism. This is solved simply: pthread_exit can
> FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
> then has to wait on (but need not acquire) the thread-list lock after
> waiting on the thread's own exit futex in order to ensure the exit has
> actually finished. This is potentially subject to long waits if the
> lock is under contention (lots of threads exiting or being created)
> and retaken before pthread_join gets to run, but the probability of
> collision can be made negligible (only possible under extremely rapid
> tid reuse) by using the tid of the exiting thread as the wait value.
> Alternatively, the tid of the joiner could be used, making collisions
> impossible, but setting up to do this is more complex.
> 
> Such a lock is already used in the fallback/nommu implementation of
> __unmapself. If we used a global thread list lock, no such lock in
> __unmapself would be needed.
> 
> In order for __synccall to use a global thread list lock, the lock
> would need to be async-signal-safe (since __synccall is used to
> implement AS-safe functions like setuid). This means taking the lock
> would require signals to be blocked, which imposes new requirements on
> the implementation of pthread_create and dlopen. In some ways this is
> nice; right now pthread_create has special cases for when the new
> thread has to manipulate its signal mask after starting, and it would
> be simpler/cleaner to just always do it. This would also allow working
> around buggy kernels where the clone flags to set thread pointer, or
> set/clear tid address, don't work (was a historic issue on some archs;
> we just don't support those broken kernels now) by just not using them
> and instead fixing things up in the new thread before unmasking
> signals. It would have a slight performance cost though.
> 
> Blocking signals in dlopen might be uglier; it's a *long* operation,
> and would break any sort of realtime signal handling going on.
> However, I think the allocation of new TLS memory (which is all the
> lock is needed for) could be moved to *after* all new DSOs are loaded
> and relocated, as the last step before committing to success. This
> helps with the current design where it's allocation-only, and where
> __tls_get_new is responsible for acquiring that allocated memory later
> in the thread that needs it, but if we switched to having dlopen
> actually dole out the allocated memory to all live threads, there
> would be a non-constant time operation that had to take place with the
> thread list lock held, and with signals blocked.
> 
> One potential solution is making the list lock an AS-safe rwlock,
> where only operations that will modify the list (thread creation and
> exit) need to take the write lock, and only the write lock requires
> signals to be blocked. dlopen and __synccall would then only be
> readers. (This is an inversion of the roles with the current
> __acquire_ptc/__release_ptc/__inhibit_ptc lock, where dlopen has the
> "writer" role and thread creation/exit has the "reader" role.)
> 
> In order to implement an AS-safe rwlock without unbounded-time
> blocking of signals while waiting for the lock, the wrlock logic would
> have to go something like:
> 
> 	for (;;) {
> 		futex wait for unlocked state;
> 		block signals;
> 		cas lock from unlocked to wr-locked state
> 		if (success) break;
> 		unblock signals;
> 	}
> 
> This would be unfortunate if rdlocks were a common operation, since
> the "block signals" step gives a large window for an arriving reader
> to steal the lock. Fortunately however they're very uncommon
> operations that only make sense a small finite number of times in a
> process lifetime, so I don't think this matters.
> 
> The biggest performance concern here is probably how long the lock
> needs to be held for thread creation/exit and whether that interferes
> with concurrent creation/exit. Naively I expect the answer is a big
> yes. If a separate lock were used (just like now) to prevent dlopen
> while the thread's memory is being allocated, the minimum the critical
> section for creation could be is just around the call to __clone, and
> linking into the list upon success. From what I remember, that can
> easily be 15k cycles or so. At exit the critical section needs to
> contain the __unmapself for a detached thread or the FUTEX_REQUEUE for
> a joinable thread with a waiting joiner, but these are all fairly
> quick syscalls.
> 
> So it's probably very questionable whether any of this makes sense to
> pursue, but at least it's all now published in case we want/need to at
> some point.

It should be noted that all the high costs come from making the list
(1) AS-safe, and (2) inclusive of threads that no longer exist in the
abstract machine, but still have running kernel tasks. These
conditions are necessary and sufficient for __synccall to use it for
enumerating threads, but not at all necessary if we just wanted to be
able to install new dynamic TLS synchronously with dlopen.

It would be kinda disappointing to go to the trouble of adding a
global thread list without getting to use it for fixing __synccall,
but I think it is a legitimate option if installing new TLS at dlopen
time looks like the rigth thing to do but the AS-safe lock looks too
expensive or runs into troubles.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible design for global thread list
  2018-10-10 16:33 Possible design for global thread list Rich Felker
  2018-10-12 15:56 ` Rich Felker
@ 2018-10-14 17:32 ` Florian Weimer
  2018-10-14 23:40   ` Rich Felker
  1 sibling, 1 reply; 5+ messages in thread
From: Florian Weimer @ 2018-10-14 17:32 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker:


> Of course, this futex wake is already used for pthread_join, which
> would need another mechanism. This is solved simply: pthread_exit can
> FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
> then has to wait on (but need not acquire) the thread-list lock after
> waiting on the thread's own exit futex in order to ensure the exit has
> actually finished. This is potentially subject to long waits if the
> lock is under contention (lots of threads exiting or being created)
> and retaken before pthread_join gets to run, but the probability of
> collision can be made negligible (only possible under extremely rapid
> tid reuse) by using the tid of the exiting thread as the wait value.
> Alternatively, the tid of the joiner could be used, making collisions
> impossible, but setting up to do this is more complex.

I'm not sure if this is compatible with existing software which
rapidly joins and creates many threads in succession because it looks
to me that the pthread_join operation can return before the kernel
resources are freed.  As a result, applications will get impossible
EAGAIN failures, even though the application never exceeds the thread
limit.

Depending on kernel version and cgroups configuration, this race can
even be observed with the more usual join sequence because the kernel
signals thread exit too early to user space.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible design for global thread list
  2018-10-14 17:32 ` Florian Weimer
@ 2018-10-14 23:40   ` Rich Felker
  2018-10-17 21:54     ` Florian Weimer
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2018-10-14 23:40 UTC (permalink / raw)
  To: Florian Weimer; +Cc: musl

On Sun, Oct 14, 2018 at 07:32:54PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> 
> > Of course, this futex wake is already used for pthread_join, which
> > would need another mechanism. This is solved simply: pthread_exit can
> > FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
> > then has to wait on (but need not acquire) the thread-list lock after
> > waiting on the thread's own exit futex in order to ensure the exit has
> > actually finished. This is potentially subject to long waits if the
> > lock is under contention (lots of threads exiting or being created)
> > and retaken before pthread_join gets to run, but the probability of
> > collision can be made negligible (only possible under extremely rapid
> > tid reuse) by using the tid of the exiting thread as the wait value.
> > Alternatively, the tid of the joiner could be used, making collisions
> > impossible, but setting up to do this is more complex.
> 
> I'm not sure if this is compatible with existing software which
> rapidly joins and creates many threads in succession because it looks
> to me that the pthread_join operation can return before the kernel
> resources are freed.  As a result, applications will get impossible
> EAGAIN failures, even though the application never exceeds the thread
> limit.
> 
> Depending on kernel version and cgroups configuration, this race can
> even be observed with the more usual join sequence because the kernel
> signals thread exit too early to user space.

I think you must be confused about something, because either way, what
pthread_join is waiting for is the same kind of futex wake by the
kernel, and happens at the same point during kernel task exit. The
only difference is what address it's at.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible design for global thread list
  2018-10-14 23:40   ` Rich Felker
@ 2018-10-17 21:54     ` Florian Weimer
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Weimer @ 2018-10-17 21:54 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker:

> On Sun, Oct 14, 2018 at 07:32:54PM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> 
>> > Of course, this futex wake is already used for pthread_join, which
>> > would need another mechanism. This is solved simply: pthread_exit can
>> > FUTEX_REQUEUE a waiting joiner to the thread-list lock. pthread_join
>> > then has to wait on (but need not acquire) the thread-list lock after
>> > waiting on the thread's own exit futex in order to ensure the exit has
>> > actually finished. This is potentially subject to long waits if the
>> > lock is under contention (lots of threads exiting or being created)
>> > and retaken before pthread_join gets to run, but the probability of
>> > collision can be made negligible (only possible under extremely rapid
>> > tid reuse) by using the tid of the exiting thread as the wait value.
>> > Alternatively, the tid of the joiner could be used, making collisions
>> > impossible, but setting up to do this is more complex.
>> 
>> I'm not sure if this is compatible with existing software which
>> rapidly joins and creates many threads in succession because it looks
>> to me that the pthread_join operation can return before the kernel
>> resources are freed.  As a result, applications will get impossible
>> EAGAIN failures, even though the application never exceeds the thread
>> limit.
>> 
>> Depending on kernel version and cgroups configuration, this race can
>> even be observed with the more usual join sequence because the kernel
>> signals thread exit too early to user space.
>
> I think you must be confused about something, because either way, what
> pthread_join is waiting for is the same kind of futex wake by the
> kernel, and happens at the same point during kernel task exit. The
> only difference is what address it's at.

Right, I was confused.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-10-17 21:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10 16:33 Possible design for global thread list Rich Felker
2018-10-12 15:56 ` Rich Felker
2018-10-14 17:32 ` Florian Weimer
2018-10-14 23:40   ` Rich Felker
2018-10-17 21:54     ` Florian Weimer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).