Draft outline of thread-list design

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Draft outline of thread-list design
@ 2019-02-12 18:26 Rich Felker
  2019-02-12 20:23 ` Rich Felker
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Rich Felker @ 2019-02-12 18:26 UTC (permalink / raw)
  To: musl

Here's a draft of the thread-list design, proposed previously as a
better way to do dynamic TLS installation, and now as a solution to
the problem of __synccall's use of /proc/self/task being (apparently
hopelessly) broken:

Goal of simplicity and correctness, not micro-optimizing.

List lock is fully AS-safe. Taking lock requires signals be blocked.
Could be an rwlock, where only thread creation and exit require the
write lock, but this is not necessary for correctness, only as a
possible optimization if other operations with high concurrency
needing access would benefit.

pthread_create:

Take lock, create new thread, on success add to list, unlock. New
thread has new responsibility of unblocking signals, since it inherits
a fully-blocked signal mask from the parent holding the lock. New
thread should be created with its tid address equal to the thread list
lock's address, so that set_tid_address never needs to be called
later. This simplifies logic that previously had to be aware of detach
state and adjust the exit futex address accordingly to be safe against
clobbering freed memory.

pthread_exit:

Take lock. If this is the last thread, unlock and call exit(0).
Otherwise, do cleanup work, set state to exiting, remove self from
list. List will be unlocked when the kernel task exits. Unfortunately
there can be a nontrivial (non-constant) amount of cleanup work to do
if the thread left locks held, but since this should not happen in
correct code, it probably doesn't matter.

pthread_kill, pthread_[gs]etsched(param|prio):

These could remain as they are (would require keeping the kill lock
separate in pthread_exit, not described above), or could be modified
to use the global thread list lock. The former optimized these
functions slightly; the latter optimizes thread exit (by reducing
number of locks involved).

pthread_join:

A joiner can no longer see the exit of the individual kernel thread
via the exit futex (detach_state), so after seeing it in an exiting
state, it must instead use the thread list to confirm completion of
exit. The obvious way to do this is by taking a lock on the list and
immediately releasing it, but the actual taking of the lock can be
elided by simply doing a futex wait on the lock owner being equal to
the tid (or an exit sequence number if we prefer that) of the exiting
thread. In the case of tid reuse collisions, at worse this reverts to
the cost of waiting for the lock to be released.

dlopen:

Take thread list lock in place of __inhibit_ptc. Thread list can
subsequently be used to install new DTLS in all existing threads, and
__tls_get_addr/tlsdesc functions can be streamlined.

__synccall:

Take thread list lock. Signal each thread individually with tkill.
Signaled threads no longer need to enqueue themselves on a list; they
only need to wait until the signaling thread tells them to run the
callback, and report back when they have finished it, which can be
done via a single futex indicating whose turn it is to run.
(Conceptually, this should not even be needed, since the signaling
thread can just signal in sequence, but the intent is to be robust
against spurious signals arriving from outside sources.) The idea is,
for each thread: (1) set futex value to its tid, (2) send signal, (3)
wait on futex to become 0 again. Signal handler simply returns if
futex value != its tid, then runs the callback, then zeros the futex
and performs a futex wake. Code should be tiny compared to now, and
need not pull in any dependency on semaphores, PI futexes, etc.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Draft outline of thread-list design
  2019-02-12 18:26 Draft outline of thread-list design Rich Felker
@ 2019-02-12 20:23 ` Rich Felker
  2019-02-14 21:16 ` Alexey Izbyshev
  2019-02-15  0:20 ` Rich Felker
  2 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2019-02-12 20:23 UTC (permalink / raw)
  To: musl

On Tue, Feb 12, 2019 at 01:26:25PM -0500, Rich Felker wrote:
> Here's a draft of the thread-list design, proposed previously as a
> better way to do dynamic TLS installation, and now as a solution to
> the problem of __synccall's use of /proc/self/task being (apparently
> hopelessly) broken:
> 
> 
> 
> Goal of simplicity and correctness, not micro-optimizing.
> 
> List lock is fully AS-safe. Taking lock requires signals be blocked.
> Could be an rwlock, where only thread creation and exit require the
> write lock, but this is not necessary for correctness, only as a
> possible optimization if other operations with high concurrency
> needing access would benefit.
> 
> 
> pthread_create:
> 
> Take lock, create new thread, on success add to list, unlock. New
> thread has new responsibility of unblocking signals, since it inherits
> a fully-blocked signal mask from the parent holding the lock. New
> thread should be created with its tid address equal to the thread list
> lock's address, so that set_tid_address never needs to be called
> later. This simplifies logic that previously had to be aware of detach
> state and adjust the exit futex address accordingly to be safe against
> clobbering freed memory.
> 
> pthread_exit:
> 
> Take lock. If this is the last thread, unlock and call exit(0).
> Otherwise, do cleanup work, set state to exiting, remove self from
> list. List will be unlocked when the kernel task exits. Unfortunately
> there can be a nontrivial (non-constant) amount of cleanup work to do
> if the thread left locks held, but since this should not happen in
> correct code, it probably doesn't matter.

It should be possible to eliminate the unbounded time the lock is held
and a lot of the serializing effects (but not all) by switching to
having two lists each with their own lock: live threads and exiting
threads. pthread_exit would start by moving the caller from the live
list to the exiting list, holding both locks. After that, the rest of
the function could run without any lock held until just before exit,
when the exiting thread list lock would need to be taken for the
thread to change its state to exited and remove itself from the list.

With this change, pthread_create and dlopen would only need to
synchronize against the live threads list, and pthread_join would only
need to synchronize against the exiting threads list. Only
pthread_exit and __synccall would need to synchronize with both.

I doubt this change makes sense though, at least not actually moving
the exiting thread to a new list. The expected amount of work between
the unlock and re-lock is much less than the cost of a lock cycle. We
could still use a separate exit lock to reduce the window during which
the live thread list lock has to be held, lowering the serializing
pressure on pthread_create, but I don't think there's actually any
advantage to having lower serialization pressure on create than on
exit, since they come in pairs.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Draft outline of thread-list design
  2019-02-12 18:26 Draft outline of thread-list design Rich Felker
  2019-02-12 20:23 ` Rich Felker
@ 2019-02-14 21:16 ` Alexey Izbyshev
  2019-02-14 22:32   ` Rich Felker
  2019-02-15  0:20 ` Rich Felker
  2 siblings, 1 reply; 7+ messages in thread
From: Alexey Izbyshev @ 2019-02-14 21:16 UTC (permalink / raw)
  To: musl

On 2019-02-12 21:26, Rich Felker wrote:
> pthread_join:
> 
> A joiner can no longer see the exit of the individual kernel thread
> via the exit futex (detach_state), so after seeing it in an exiting
> state, it must instead use the thread list to confirm completion of
> exit. The obvious way to do this is by taking a lock on the list and
> immediately releasing it, but the actual taking of the lock can be
> elided by simply doing a futex wait on the lock owner being equal to
> the tid (or an exit sequence number if we prefer that) of the exiting
> thread. In the case of tid reuse collisions, at worse this reverts to
> the cost of waiting for the lock to be released.
> 
Since the kernel wakes only a single thread waiting on ctid address, 
wouldn't the joiner still need to do a futex wake to unblock other 
potential waiters even if it doesn't actually take the lock by changing 
*ctid?

> __synccall:
> 
> Take thread list lock. Signal each thread individually with tkill.
> Signaled threads no longer need to enqueue themselves on a list; they
> only need to wait until the signaling thread tells them to run the
> callback, and report back when they have finished it, which can be
> done via a single futex indicating whose turn it is to run.
> (Conceptually, this should not even be needed, since the signaling
> thread can just signal in sequence, but the intent is to be robust
> against spurious signals arriving from outside sources.) The idea is,
> for each thread: (1) set futex value to its tid, (2) send signal, (3)
> wait on futex to become 0 again. Signal handler simply returns if
> futex value != its tid, then runs the callback, then zeros the futex
> and performs a futex wake. Code should be tiny compared to now, and
> need not pull in any dependency on semaphores, PI futexes, etc.

Wouldn't the handler also need to wait until *all* threads run the 
callback? Otherwise, a thread might continue execution while its uid 
still differs from uids of some other threads.

In general, to my limited expertise, the design looks simple and clean. 
I'm not sure whether it's worth optimizing to reduce serialization 
pressure on pthread_create()/pthread_exit() because creating a large 
amount of short-lived threads doesn't look like a good idea anyway.

Alexey



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Draft outline of thread-list design
  2019-02-14 21:16 ` Alexey Izbyshev
@ 2019-02-14 22:32   ` Rich Felker
  2019-02-14 22:54     ` Alexey Izbyshev
  2019-02-14 23:19     ` Rich Felker
  0 siblings, 2 replies; 7+ messages in thread
From: Rich Felker @ 2019-02-14 22:32 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Fri, Feb 15, 2019 at 12:16:39AM +0300, Alexey Izbyshev wrote:
> On 2019-02-12 21:26, Rich Felker wrote:
> >pthread_join:
> >
> >A joiner can no longer see the exit of the individual kernel thread
> >via the exit futex (detach_state), so after seeing it in an exiting
> >state, it must instead use the thread list to confirm completion of
> >exit. The obvious way to do this is by taking a lock on the list and
> >immediately releasing it, but the actual taking of the lock can be
> >elided by simply doing a futex wait on the lock owner being equal to
> >the tid (or an exit sequence number if we prefer that) of the exiting
> >thread. In the case of tid reuse collisions, at worse this reverts to
> >the cost of waiting for the lock to be released.
> >
> Since the kernel wakes only a single thread waiting on ctid address,
> wouldn't the joiner still need to do a futex wake to unblock other
> potential waiters even if it doesn't actually take the lock by
> changing *ctid?

I'm not sure. If it's just a single wake rather than a broadcast then
yes, but only if it waited. If it observed the lock word != to the
exiting thread tid without performing a futex wait then it doesn't
have to do a futex wake.

> 
> >__synccall:
> >
> >Take thread list lock. Signal each thread individually with tkill.
> >Signaled threads no longer need to enqueue themselves on a list; they
> >only need to wait until the signaling thread tells them to run the
> >callback, and report back when they have finished it, which can be
> >done via a single futex indicating whose turn it is to run.
> >(Conceptually, this should not even be needed, since the signaling
> >thread can just signal in sequence, but the intent is to be robust
> >against spurious signals arriving from outside sources.) The idea is,
> >for each thread: (1) set futex value to its tid, (2) send signal, (3)
> >wait on futex to become 0 again. Signal handler simply returns if
> >futex value != its tid, then runs the callback, then zeros the futex
> >and performs a futex wake. Code should be tiny compared to now, and
> >need not pull in any dependency on semaphores, PI futexes, etc.
> 
> Wouldn't the handler also need to wait until *all* threads run the
> callback? Otherwise, a thread might continue execution while its uid
> still differs from uids of some other threads.

Yes, that's correct. We actually do need the current approach of first
capturing all the threads in a signal handler, then making them run
the callback, then releasing them to return, in three rounds. No
application code should be able to run with the process in a
partially-mutated state.

> In general, to my limited expertise, the design looks simple and
> clean. I'm not sure whether it's worth optimizing to reduce
> serialization pressure on pthread_create()/pthread_exit() because
> creating a large amount of short-lived threads doesn't look like a
> good idea anyway.

Yes. One thing I did notice is that the window where pthread_create
has to hold a lock to prevent new dlopen from happening is a lot
larger than the window where the thread list needs to be locked, and
contains mmap/mprotect. I think we should add a new "DTLS lock" here
that's held for the whole time, with a protocol that if you need both
the DTLS lock and the thread list lock, you take them in that order
(dlopen would also need them both). This reduces the thread list lock
window to just the __clone call and list update.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Draft outline of thread-list design
  2019-02-14 22:32   ` Rich Felker
@ 2019-02-14 22:54     ` Alexey Izbyshev
  2019-02-14 23:19     ` Rich Felker
  1 sibling, 0 replies; 7+ messages in thread
From: Alexey Izbyshev @ 2019-02-14 22:54 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On 2019-02-15 01:32, Rich Felker wrote:
> On Fri, Feb 15, 2019 at 12:16:39AM +0300, Alexey Izbyshev wrote:
>> On 2019-02-12 21:26, Rich Felker wrote:
>> >pthread_join:
>> >
>> >A joiner can no longer see the exit of the individual kernel thread
>> >via the exit futex (detach_state), so after seeing it in an exiting
>> >state, it must instead use the thread list to confirm completion of
>> >exit. The obvious way to do this is by taking a lock on the list and
>> >immediately releasing it, but the actual taking of the lock can be
>> >elided by simply doing a futex wait on the lock owner being equal to
>> >the tid (or an exit sequence number if we prefer that) of the exiting
>> >thread. In the case of tid reuse collisions, at worse this reverts to
>> >the cost of waiting for the lock to be released.
>> >
>> Since the kernel wakes only a single thread waiting on ctid address,
>> wouldn't the joiner still need to do a futex wake to unblock other
>> potential waiters even if it doesn't actually take the lock by
>> changing *ctid?
> 
> I'm not sure. If it's just a single wake rather than a broadcast then
> yes, but only if it waited. If it observed the lock word != to the
> exiting thread tid without performing a futex wait then it doesn't
> have to do a futex wake.
> 
Yes, it's a single wake: 
<http://man7.org/linux/man-pages/man2/set_tid_address.2.html>, 
<https://elixir.bootlin.com/linux/v4.20.8/source/kernel/fork.c#L1292>.
> 
>> In general, to my limited expertise, the design looks simple and
>> clean. I'm not sure whether it's worth optimizing to reduce
>> serialization pressure on pthread_create()/pthread_exit() because
>> creating a large amount of short-lived threads doesn't look like a
>> good idea anyway.
> 
> Yes. One thing I did notice is that the window where pthread_create
> has to hold a lock to prevent new dlopen from happening is a lot
> larger than the window where the thread list needs to be locked, and
> contains mmap/mprotect. I think we should add a new "DTLS lock" here
> that's held for the whole time, with a protocol that if you need both
> the DTLS lock and the thread list lock, you take them in that order
> (dlopen would also need them both). This reduces the thread list lock
> window to just the __clone call and list update.
> 
Looks good.

Alexey



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Draft outline of thread-list design
  2019-02-14 22:32   ` Rich Felker
  2019-02-14 22:54     ` Alexey Izbyshev
@ 2019-02-14 23:19     ` Rich Felker
  1 sibling, 0 replies; 7+ messages in thread
From: Rich Felker @ 2019-02-14 23:19 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Thu, Feb 14, 2019 at 05:32:24PM -0500, Rich Felker wrote:
> On Fri, Feb 15, 2019 at 12:16:39AM +0300, Alexey Izbyshev wrote:
> > In general, to my limited expertise, the design looks simple and
> > clean. I'm not sure whether it's worth optimizing to reduce
> > serialization pressure on pthread_create()/pthread_exit() because
> > creating a large amount of short-lived threads doesn't look like a
> > good idea anyway.
> 
> Yes. One thing I did notice is that the window where pthread_create
> has to hold a lock to prevent new dlopen from happening is a lot
> larger than the window where the thread list needs to be locked, and
> contains mmap/mprotect. I think we should add a new "DTLS lock" here
> that's held for the whole time, with a protocol that if you need both
> the DTLS lock and the thread list lock, you take them in that order
> (dlopen would also need them both). This reduces the thread list lock
> window to just the __clone call and list update.

Also: the DTLS lock function can have a weak dummy, so that
static-linked programs don't even perform any locking.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Draft outline of thread-list design
  2019-02-12 18:26 Draft outline of thread-list design Rich Felker
  2019-02-12 20:23 ` Rich Felker
  2019-02-14 21:16 ` Alexey Izbyshev
@ 2019-02-15  0:20 ` Rich Felker
  2 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2019-02-15  0:20 UTC (permalink / raw)
  To: musl

On Tue, Feb 12, 2019 at 01:26:25PM -0500, Rich Felker wrote:
> Here's a draft of the thread-list design, proposed previously as a
> better way to do dynamic TLS installation, and now as a solution to
> the problem of __synccall's use of /proc/self/task being (apparently
> hopelessly) broken:
> 
> 
> 
> Goal of simplicity and correctness, not micro-optimizing.
> 
> List lock is fully AS-safe. Taking lock requires signals be blocked.

To elaborate on this: application signals must be blocked before the
lock is taken, but implementation signals (particularly the synccall
signal) must not be blocked. Otherwise there is a deadlock: it's
possible that thread A is waiting for the thread list lock, and thread
B holds the thread list lock and is waiting for thread A to respond to
a synccall signal before it can make forward progress.

If we want to block *all* signals, which is needed at detached thread
exit to prevent delivery in the absence of a stack, they must be
blocked after obtaining the thread list lock.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-02-15  0:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-12 18:26 Draft outline of thread-list design Rich Felker
2019-02-12 20:23 ` Rich Felker
2019-02-14 21:16 ` Alexey Izbyshev
2019-02-14 22:32   ` Rich Felker
2019-02-14 22:54     ` Alexey Izbyshev
2019-02-14 23:19     ` Rich Felker
2019-02-15  0:20 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).