mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Illegal killlock skipping when transitioning to single-threaded state
@ 2022-09-07  0:46 Alexey Izbyshev
  2022-09-19 15:29 ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Izbyshev @ 2022-09-07  0:46 UTC (permalink / raw)
  To: musl

Hi,

While reading pthread_exit() implementation I noticed that it can set 
"libc.need_locks" to -1 while still holding the killlock of the exiting 
thread:

     if (!--libc.threads_minus_1) libc.need_locks = -1;

If the remaining thread attempts to acquire the same killlock 
concurrently (which is valid for joinable threads), it works fine 
because LOCK() resets "libc.need_locks" only after a_cas():

     int need_locks = libc.need_locks;
     if (!need_locks) return;
     /* fast path: INT_MIN for the lock, +1 for the congestion */
     int current = a_cas(l, 0, INT_MIN + 1);
     if (need_locks < 0) libc.need_locks = 0;
     if (!current) return;

However, because "libc.need_locks" is reset when using LOCK() for any 
other lock too, the following could happen (E is the exiting thread, R 
is the remaining thread):

E: libc.need_locks = -1
R: LOCK(unrelated_lock)
R:   libc.need_locks = 0
R: UNLOCK(unrelated_lock)
R: LOCK(E->killlock) // skips locking, now both R and E think they are 
holding the lock
R: UNLOCK(E->killlock)
E: UNLOCK(E->killlock)

The lack of mutual exclusion means that the tid reuse problem that 
killlock is supposed to prevent might occur.

Moreover, when UNLOCK(E->killlock) is called concurrently by R and E, 
a_fetch_add() could be done twice if timing is such that both threads 
notice that "l[0] < 0":

    /* Check l[0] to see if we are multi-threaded. */
    if (l[0] < 0) {
        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
            __wake(l, 1, 1);
        }
    }

In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked 
with INT_MAX waiters"). This is a bad state because the next LOCK() will 
wrap it to "unlocked" state instead of locking. Also, if more than two 
threads attempt to use it, a deadlock will occur if two 
supposedly-owners execute a_fetch_add() concurrently in UNLOCK() after a 
third thread registered itself as a waiter because the lock will wrap to 
INT_MIN.

Reordering the "libc.need_locks = -1" assignment and UNLOCK(E->killlock) 
and providing a store barrier between them should fix the issue.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-09-07  0:46 [musl] Illegal killlock skipping when transitioning to single-threaded state Alexey Izbyshev
@ 2022-09-19 15:29 ` Rich Felker
  2022-10-03  6:16   ` Alexey Izbyshev
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2022-09-19 15:29 UTC (permalink / raw)
  To: musl

On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> Hi,
> 
> While reading pthread_exit() implementation I noticed that it can
> set "libc.need_locks" to -1 while still holding the killlock of the
> exiting thread:
> 
>     if (!--libc.threads_minus_1) libc.need_locks = -1;
> 
> If the remaining thread attempts to acquire the same killlock
> concurrently (which is valid for joinable threads), it works fine
> because LOCK() resets "libc.need_locks" only after a_cas():
> 
>     int need_locks = libc.need_locks;
>     if (!need_locks) return;
>     /* fast path: INT_MIN for the lock, +1 for the congestion */
>     int current = a_cas(l, 0, INT_MIN + 1);
>     if (need_locks < 0) libc.need_locks = 0;
>     if (!current) return;
> 
> However, because "libc.need_locks" is reset when using LOCK() for
> any other lock too, the following could happen (E is the exiting
> thread, R is the remaining thread):
> 
> E: libc.need_locks = -1
> R: LOCK(unrelated_lock)
> R:   libc.need_locks = 0
> R: UNLOCK(unrelated_lock)
> R: LOCK(E->killlock) // skips locking, now both R and E think they
> are holding the lock
> R: UNLOCK(E->killlock)
> E: UNLOCK(E->killlock)
> 
> The lack of mutual exclusion means that the tid reuse problem that
> killlock is supposed to prevent might occur.
> 
> Moreover, when UNLOCK(E->killlock) is called concurrently by R and
> E, a_fetch_add() could be done twice if timing is such that both
> threads notice that "l[0] < 0":
> 
>    /* Check l[0] to see if we are multi-threaded. */
>    if (l[0] < 0) {
>        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
>            __wake(l, 1, 1);
>        }
>    }
> 
> In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
> with INT_MAX waiters"). This is a bad state because the next LOCK()
> will wrap it to "unlocked" state instead of locking. Also, if more
> than two threads attempt to use it, a deadlock will occur if two
> supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
> after a third thread registered itself as a waiter because the lock
> will wrap to INT_MIN.
> 
> Reordering the "libc.need_locks = -1" assignment and
> UNLOCK(E->killlock) and providing a store barrier between them
> should fix the issue.

I think this all sounds correct. I'm not sure what you mean by a store
barrier between them, since all lock and unlock operations are already
full barriers.

I'm still a little bit concerned though that there's no real model for
synchronization of the access to modify need_locks though.
Conceptually, the store of -1 from the exiting thread and the store of
0 in the surviving thread are not ordered with respect to each other
except by an unsynchronized causality relationship.

I suspect what "should be" happening is that, if we observe
need_locks==-1 (as a relaxed atomic load in our memory model), we take
the thread list lock (as the lock controlling write access to these
values) which ensures that the exiting thread has exited, then confirm
that we are the only thread left (are there async-signal ways without
UB that another thread could be created in the mean time?) before
setting it to 0. But this is a rather heavy operation. If there's an
assurance that no other threads can be created in the middle of LOCK
(which I think there should be), we could just use __tl_sync(), which
is much lighter, to conclude that we've synchronized with being the
only remaining thread and that unsynchronized access is okay.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-09-19 15:29 ` Rich Felker
@ 2022-10-03  6:16   ` Alexey Izbyshev
  2022-10-03 12:33     ` Rich Felker
  2022-10-03 13:26     ` Szabolcs Nagy
  0 siblings, 2 replies; 8+ messages in thread
From: Alexey Izbyshev @ 2022-10-03  6:16 UTC (permalink / raw)
  To: musl

On 2022-09-19 18:29, Rich Felker wrote:
> On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> Hi,
>> 
>> While reading pthread_exit() implementation I noticed that it can
>> set "libc.need_locks" to -1 while still holding the killlock of the
>> exiting thread:
>> 
>>     if (!--libc.threads_minus_1) libc.need_locks = -1;
>> 
>> If the remaining thread attempts to acquire the same killlock
>> concurrently (which is valid for joinable threads), it works fine
>> because LOCK() resets "libc.need_locks" only after a_cas():
>> 
>>     int need_locks = libc.need_locks;
>>     if (!need_locks) return;
>>     /* fast path: INT_MIN for the lock, +1 for the congestion */
>>     int current = a_cas(l, 0, INT_MIN + 1);
>>     if (need_locks < 0) libc.need_locks = 0;
>>     if (!current) return;
>> 
>> However, because "libc.need_locks" is reset when using LOCK() for
>> any other lock too, the following could happen (E is the exiting
>> thread, R is the remaining thread):
>> 
>> E: libc.need_locks = -1
>> R: LOCK(unrelated_lock)
>> R:   libc.need_locks = 0
>> R: UNLOCK(unrelated_lock)
>> R: LOCK(E->killlock) // skips locking, now both R and E think they
>> are holding the lock
>> R: UNLOCK(E->killlock)
>> E: UNLOCK(E->killlock)
>> 
>> The lack of mutual exclusion means that the tid reuse problem that
>> killlock is supposed to prevent might occur.
>> 
>> Moreover, when UNLOCK(E->killlock) is called concurrently by R and
>> E, a_fetch_add() could be done twice if timing is such that both
>> threads notice that "l[0] < 0":
>> 
>>    /* Check l[0] to see if we are multi-threaded. */
>>    if (l[0] < 0) {
>>        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
>>            __wake(l, 1, 1);
>>        }
>>    }
>> 
>> In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
>> with INT_MAX waiters"). This is a bad state because the next LOCK()
>> will wrap it to "unlocked" state instead of locking. Also, if more
>> than two threads attempt to use it, a deadlock will occur if two
>> supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
>> after a third thread registered itself as a waiter because the lock
>> will wrap to INT_MIN.
>> 
>> Reordering the "libc.need_locks = -1" assignment and
>> UNLOCK(E->killlock) and providing a store barrier between them
>> should fix the issue.
> 
> I think this all sounds correct. I'm not sure what you mean by a store
> barrier between them, since all lock and unlock operations are already
> full barriers.
> 

Before sending the report I tried to infer the intended ordering 
semantics of LOCK/UNLOCK by looking at their implementations. For 
AArch64, I didn't see why they would provide a full barrier (my 
reasoning is below), so I concluded that probably acquire/release 
semantics was intended in general and suggested an extra store barrier 
to prevent hoisting of "libc.need_locks = -1" store spelled after 
UNLOCK(E->killlock) back into the critical section.

UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple 
a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are 
implemented via load-acquire/store-release instructions. Therefore, if 
we consider a LOCK/UNLOCK critical section containing only plain loads 
and stores, (a) any such memory access can be reordered with the initial 
ldaxr in UNLOCK, and (b) any plain load following UNLOCK can be 
reordered with stlxr (assuming the processor predicts that stlxr 
succeeds), and further, due to (a), with any memory access inside the 
critical section. Therefore, UNLOCK is not full barrier. Is this right?

As for a store following UNLOCK, I'm not sure. A plain store following 
stlxr can be reordered with it, but here that store is conditional on 
stlxr success. So even if the processor predicts stlxr success, it can't 
make the following store visible before it's sure that stlxr succeeded. 
But I don't know whether the events "the processor knows that stlxr 
succeeded" and "the result of stlxr is globally visible" are separate or 
not, and if they are separate, what comes first. Depending on the 
answer, UNLOCK acts as a store barrier or not.

> I'm still a little bit concerned though that there's no real model for
> synchronization of the access to modify need_locks though.
> Conceptually, the store of -1 from the exiting thread and the store of
> 0 in the surviving thread are not ordered with respect to each other
> except by an unsynchronized causality relationship.
> 
I don't understand implications of the lack of (stronger) ordering of 
these two stores. Could you clarify?

> I suspect what "should be" happening is that, if we observe
> need_locks==-1 (as a relaxed atomic load in our memory model), we take
> the thread list lock (as the lock controlling write access to these
> values) which ensures that the exiting thread has exited, then confirm
> that we are the only thread left (are there async-signal ways without
> UB that another thread could be created in the mean time?) before
> setting it to 0. But this is a rather heavy operation. If there's an
> assurance that no other threads can be created in the middle of LOCK
> (which I think there should be), we could just use __tl_sync(), which
> is much lighter, to conclude that we've synchronized with being the
> only remaining thread and that unsynchronized access is okay.
> 

LOCK is called either with application signals blocked or from 
async-signal-unsafe functions, so interrupting it with a signal handler 
that calls pthread_create() is not possible without UB. Therefore, I 
don't see any issues with using __tl_sync() for synchronization with the 
exiting thread (although, as I said above, I don't understand what 
problem is being solved).

By the way, I noticed a suspicious usage of "libc.need_locks" in fork():

     int need_locks = libc.need_locks > 0;

This is different from how LOCK behaves: LOCK will still perform normal 
locking one last time if it sees -1, but here locking will be skipped 
entirely. Is this intentional?

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03  6:16   ` Alexey Izbyshev
@ 2022-10-03 12:33     ` Rich Felker
  2022-10-03 13:26     ` Szabolcs Nagy
  1 sibling, 0 replies; 8+ messages in thread
From: Rich Felker @ 2022-10-03 12:33 UTC (permalink / raw)
  To: musl

On Mon, Oct 03, 2022 at 09:16:03AM +0300, Alexey Izbyshev wrote:
> On 2022-09-19 18:29, Rich Felker wrote:
> >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>Hi,
> >>
> >>While reading pthread_exit() implementation I noticed that it can
> >>set "libc.need_locks" to -1 while still holding the killlock of the
> >>exiting thread:
> >>
> >>    if (!--libc.threads_minus_1) libc.need_locks = -1;
> >>
> >>If the remaining thread attempts to acquire the same killlock
> >>concurrently (which is valid for joinable threads), it works fine
> >>because LOCK() resets "libc.need_locks" only after a_cas():
> >>
> >>    int need_locks = libc.need_locks;
> >>    if (!need_locks) return;
> >>    /* fast path: INT_MIN for the lock, +1 for the congestion */
> >>    int current = a_cas(l, 0, INT_MIN + 1);
> >>    if (need_locks < 0) libc.need_locks = 0;
> >>    if (!current) return;
> >>
> >>However, because "libc.need_locks" is reset when using LOCK() for
> >>any other lock too, the following could happen (E is the exiting
> >>thread, R is the remaining thread):
> >>
> >>E: libc.need_locks = -1
> >>R: LOCK(unrelated_lock)
> >>R:   libc.need_locks = 0
> >>R: UNLOCK(unrelated_lock)
> >>R: LOCK(E->killlock) // skips locking, now both R and E think they
> >>are holding the lock
> >>R: UNLOCK(E->killlock)
> >>E: UNLOCK(E->killlock)
> >>
> >>The lack of mutual exclusion means that the tid reuse problem that
> >>killlock is supposed to prevent might occur.
> >>
> >>Moreover, when UNLOCK(E->killlock) is called concurrently by R and
> >>E, a_fetch_add() could be done twice if timing is such that both
> >>threads notice that "l[0] < 0":
> >>
> >>   /* Check l[0] to see if we are multi-threaded. */
> >>   if (l[0] < 0) {
> >>       if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
> >>           __wake(l, 1, 1);
> >>       }
> >>   }
> >>
> >>In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
> >>with INT_MAX waiters"). This is a bad state because the next LOCK()
> >>will wrap it to "unlocked" state instead of locking. Also, if more
> >>than two threads attempt to use it, a deadlock will occur if two
> >>supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
> >>after a third thread registered itself as a waiter because the lock
> >>will wrap to INT_MIN.
> >>
> >>Reordering the "libc.need_locks = -1" assignment and
> >>UNLOCK(E->killlock) and providing a store barrier between them
> >>should fix the issue.
> >
> >I think this all sounds correct. I'm not sure what you mean by a store
> >barrier between them, since all lock and unlock operations are already
> >full barriers.
> >
> 
> Before sending the report I tried to infer the intended ordering
> semantics of LOCK/UNLOCK by looking at their implementations. For
> AArch64, I didn't see why they would provide a full barrier (my
> reasoning is below), so I concluded that probably acquire/release
> semantics was intended in general and suggested an extra store
> barrier to prevent hoisting of "libc.need_locks = -1" store spelled
> after UNLOCK(E->killlock) back into the critical section.
> 
> UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are
> implemented via load-acquire/store-release instructions. Therefore,
> if we consider a LOCK/UNLOCK critical section containing only plain
> loads and stores, (a) any such memory access can be reordered with
> the initial ldaxr in UNLOCK, and (b) any plain load following UNLOCK
> can be reordered with stlxr (assuming the processor predicts that
> stlxr succeeds), and further, due to (a), with any memory access
> inside the critical section. Therefore, UNLOCK is not full barrier.
> Is this right?

If so, this is a bug in our implementation of aarch64 atomics. The
intent is that a_* all be seq_cst (practically, full acq+rel probably
suffices). Maybe someone more familiar with aarch64 can chime in.
I thought this was discussed and determined to be okay when the
aarch64 port was being reviewed, but now I'm starting to doubt that.

> As for a store following UNLOCK, I'm not sure. A plain store
> following stlxr can be reordered with it, but here that store is
> conditional on stlxr success. So even if the processor predicts
> stlxr success, it can't make the following store visible before it's
> sure that stlxr succeeded. But I don't know whether the events "the
> processor knows that stlxr succeeded" and "the result of stlxr is
> globally visible" are separate or not, and if they are separate,
> what comes first. Depending on the answer, UNLOCK acts as a store
> barrier or not.
> 
> >I'm still a little bit concerned though that there's no real model for
> >synchronization of the access to modify need_locks though.
> >Conceptually, the store of -1 from the exiting thread and the store of
> >0 in the surviving thread are not ordered with respect to each other
> >except by an unsynchronized causality relationship.
> >
> I don't understand implications of the lack of (stronger) ordering
> of these two stores. Could you clarify?

I'm talking about at the level of the memory model, not any
implementation details of the cpu. The store of 0 in the surviving
thread is not an atomic, and I was thinking this is possibly not
ordered with respect to the store of the -1 in the exiting thread
unless there's an intervening barrier, *after* the -1 is observed

However, looking back at the code, I was actually careful to ensure
this already. need_locks is loaded only once, before the a_cas, and
the a_cas provides the barrier. If we were instead re-loading
need_locks after the a_cas before testing <0 and assigning 0, there
would be a problem, at least at the abstract memory model level.

> >I suspect what "should be" happening is that, if we observe
> >need_locks==-1 (as a relaxed atomic load in our memory model), we take
> >the thread list lock (as the lock controlling write access to these
> >values) which ensures that the exiting thread has exited, then confirm
> >that we are the only thread left (are there async-signal ways without
> >UB that another thread could be created in the mean time?) before
> >setting it to 0. But this is a rather heavy operation. If there's an
> >assurance that no other threads can be created in the middle of LOCK
> >(which I think there should be), we could just use __tl_sync(), which
> >is much lighter, to conclude that we've synchronized with being the
> >only remaining thread and that unsynchronized access is okay.
> >
> 
> LOCK is called either with application signals blocked or from
> async-signal-unsafe functions, so interrupting it with a signal
> handler that calls pthread_create() is not possible without UB.
> Therefore, I don't see any issues with using __tl_sync() for
> synchronization with the exiting thread (although, as I said above,
> I don't understand what problem is being solved).

Based on the above, I agree there doesn't seem to be an actual problem
here. The reason it works is subtle, but present.

> By the way, I noticed a suspicious usage of "libc.need_locks" in fork():
> 
>     int need_locks = libc.need_locks > 0;
> 
> This is different from how LOCK behaves: LOCK will still perform
> normal locking one last time if it sees -1, but here locking will be
> skipped entirely. Is this intentional?

This is probably subtly incorrect. I think it requires more analysis
to determine if there's anything that can actually go wrong, but it
should probably just be changed anyway.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03  6:16   ` Alexey Izbyshev
  2022-10-03 12:33     ` Rich Felker
@ 2022-10-03 13:26     ` Szabolcs Nagy
  2022-10-03 21:27       ` Szabolcs Nagy
  1 sibling, 1 reply; 8+ messages in thread
From: Szabolcs Nagy @ 2022-10-03 13:26 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

* Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> On 2022-09-19 18:29, Rich Felker wrote:
> > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
...
> > > Reordering the "libc.need_locks = -1" assignment and
> > > UNLOCK(E->killlock) and providing a store barrier between them
> > > should fix the issue.
> > 
> > I think this all sounds correct. I'm not sure what you mean by a store
> > barrier between them, since all lock and unlock operations are already
> > full barriers.
> > 
> 
> Before sending the report I tried to infer the intended ordering semantics
> of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> see why they would provide a full barrier (my reasoning is below), so I
> concluded that probably acquire/release semantics was intended in general
> and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> = -1" store spelled after UNLOCK(E->killlock) back into the critical
> section.
> 
> UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> via load-acquire/store-release instructions. Therefore, if we consider a
> LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> such memory access can be reordered with the initial ldaxr in UNLOCK, and
> (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> the processor predicts that stlxr succeeds), and further, due to (a), with
> any memory access inside the critical section. Therefore, UNLOCK is not full
> barrier. Is this right?

i dont think this is right.

memory operations in the critical section cannot visibly happen after the
final stlxr.

memory operations after the critical section cannot visibly happen before
the ldaxr of UNLOCK.

the only issue can be inside the ll/sc loop in UNLOCK if there are independent
memory operations there, but there arent.

> 
> As for a store following UNLOCK, I'm not sure. A plain store following stlxr
> can be reordered with it, but here that store is conditional on stlxr
> success. So even if the processor predicts stlxr success, it can't make the
> following store visible before it's sure that stlxr succeeded. But I don't
> know whether the events "the processor knows that stlxr succeeded" and "the
> result of stlxr is globally visible" are separate or not, and if they are
> separate, what comes first. Depending on the answer, UNLOCK acts as a store
> barrier or not.
> 

UNLOCK on aarch64 acts as a full seqcst barrier as far as i can see.

but looking into the arch implementations needs a detailed understanding
of the arch memory model (eg aarch64 stlxr is RCsc not RCpc like iso c
release store), but there is no need for that: the musl model is
simple seqcst synchronization everywhere.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 13:26     ` Szabolcs Nagy
@ 2022-10-03 21:27       ` Szabolcs Nagy
  2022-10-03 22:54         ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Szabolcs Nagy @ 2022-10-03 21:27 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

* Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:

> * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > On 2022-09-19 18:29, Rich Felker wrote:
> > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> ...
> > > > Reordering the "libc.need_locks = -1" assignment and
> > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > should fix the issue.
> > > 
> > > I think this all sounds correct. I'm not sure what you mean by a store
> > > barrier between them, since all lock and unlock operations are already
> > > full barriers.
> > > 
> > 
> > Before sending the report I tried to infer the intended ordering semantics
> > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > see why they would provide a full barrier (my reasoning is below), so I
> > concluded that probably acquire/release semantics was intended in general
> > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > section.
> > 
> > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > via load-acquire/store-release instructions. Therefore, if we consider a
> > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > the processor predicts that stlxr succeeds), and further, due to (a), with
> > any memory access inside the critical section. Therefore, UNLOCK is not full
> > barrier. Is this right?
> 
> i dont think this is right.


i think i was wrong and you are right.

so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
starting with 'something == 0' the exiting E and remaining R threads:

E:something=1      // protected by killlock
E:UNLOCK(killlock)
E:need_locks=-1

R:LOCK(unrelated)  // reads need_locks == -1
R:need_locks=0
R:UNLOCK(unrelated)
R:LOCK(killlock)   // does not lock
R:read something   // can it be 0 ?

and here something can be 0 (ie. not protected by killlock) on aarch64
because

T1
	something=1
	ldaxr ... killlock
	stlxr ... killlock
	need_locks=-1

T2
	x=need_locks
	ldaxr ... unrelated
	stlxr ... unrelated
	y=something

can end with x==-1 and y==0.

and to fix it, both a_fetch_add and a_cas need an a_barrier.

i need to think how to support such lock usage on aarch64
without adding too many dmb.



> 
> memory operations in the critical section cannot visibly happen after the
> final stlxr.
> 
> memory operations after the critical section cannot visibly happen before
> the ldaxr of UNLOCK.
> 
> the only issue can be inside the ll/sc loop in UNLOCK if there are independent
> memory operations there, but there arent.
> 
> > 
> > As for a store following UNLOCK, I'm not sure. A plain store following stlxr
> > can be reordered with it, but here that store is conditional on stlxr
> > success. So even if the processor predicts stlxr success, it can't make the
> > following store visible before it's sure that stlxr succeeded. But I don't
> > know whether the events "the processor knows that stlxr succeeded" and "the
> > result of stlxr is globally visible" are separate or not, and if they are
> > separate, what comes first. Depending on the answer, UNLOCK acts as a store
> > barrier or not.
> > 
> 
> UNLOCK on aarch64 acts as a full seqcst barrier as far as i can see.
> 
> but looking into the arch implementations needs a detailed understanding
> of the arch memory model (eg aarch64 stlxr is RCsc not RCpc like iso c
> release store), but there is no need for that: the musl model is
> simple seqcst synchronization everywhere.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 21:27       ` Szabolcs Nagy
@ 2022-10-03 22:54         ` Rich Felker
  2022-10-03 23:05           ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2022-10-03 22:54 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> 
> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > ...
> > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > should fix the issue.
> > > > 
> > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > barrier between them, since all lock and unlock operations are already
> > > > full barriers.
> > > > 
> > > 
> > > Before sending the report I tried to infer the intended ordering semantics
> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > see why they would provide a full barrier (my reasoning is below), so I
> > > concluded that probably acquire/release semantics was intended in general
> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > section.
> > > 
> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > barrier. Is this right?
> > 
> > i dont think this is right.
> 
> 
> i think i was wrong and you are right.
> 
> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> starting with 'something == 0' the exiting E and remaining R threads:
> 
> E:something=1      // protected by killlock
> E:UNLOCK(killlock)
> E:need_locks=-1
> 
> R:LOCK(unrelated)  // reads need_locks == -1
> R:need_locks=0
> R:UNLOCK(unrelated)
> R:LOCK(killlock)   // does not lock
> R:read something   // can it be 0 ?
> 
> and here something can be 0 (ie. not protected by killlock) on aarch64
> because
> 
> T1
> 	something=1
> 	ldaxr ... killlock
> 	stlxr ... killlock
> 	need_locks=-1
> 
> T2
> 	x=need_locks
> 	ldaxr ... unrelated
> 	stlxr ... unrelated
> 	y=something
> 
> can end with x==-1 and y==0.
> 
> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> 
> i need to think how to support such lock usage on aarch64
> without adding too many dmb.

I don't really understand this, but FWIW gcc emits 

    ldxr
    ...
    stlxr
    ...
    dmb ish

for __sync_val_compare_and_swap. So this is probably the right thing
we should have. And it seems to match what the kernel folks discussed
here:

http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html

I wondered if there are similar issues for any others archs which need
review, but it looks like all the other llsc archs have explicit
pre/post barriers defined.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 22:54         ` Rich Felker
@ 2022-10-03 23:05           ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2022-10-03 23:05 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > 
> > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > > ...
> > > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > > should fix the issue.
> > > > > 
> > > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > > barrier between them, since all lock and unlock operations are already
> > > > > full barriers.
> > > > > 
> > > > 
> > > > Before sending the report I tried to infer the intended ordering semantics
> > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > > see why they would provide a full barrier (my reasoning is below), so I
> > > > concluded that probably acquire/release semantics was intended in general
> > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > > section.
> > > > 
> > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > > barrier. Is this right?
> > > 
> > > i dont think this is right.
> > 
> > 
> > i think i was wrong and you are right.
> > 
> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > starting with 'something == 0' the exiting E and remaining R threads:
> > 
> > E:something=1      // protected by killlock
> > E:UNLOCK(killlock)
> > E:need_locks=-1
> > 
> > R:LOCK(unrelated)  // reads need_locks == -1
> > R:need_locks=0
> > R:UNLOCK(unrelated)
> > R:LOCK(killlock)   // does not lock
> > R:read something   // can it be 0 ?
> > 
> > and here something can be 0 (ie. not protected by killlock) on aarch64
> > because
> > 
> > T1
> > 	something=1
> > 	ldaxr ... killlock
> > 	stlxr ... killlock
> > 	need_locks=-1
> > 
> > T2
> > 	x=need_locks
> > 	ldaxr ... unrelated
> > 	stlxr ... unrelated
> > 	y=something
> > 
> > can end with x==-1 and y==0.
> > 
> > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > 
> > i need to think how to support such lock usage on aarch64
> > without adding too many dmb.
> 
> I don't really understand this, but FWIW gcc emits 
> 
>     ldxr
>     ...
>     stlxr
>     ...
>     dmb ish
> 
> for __sync_val_compare_and_swap. So this is probably the right thing
> we should have. And it seems to match what the kernel folks discussed
> here:
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> 
> I wondered if there are similar issues for any others archs which need
> review, but it looks like all the other llsc archs have explicit
> pre/post barriers defined.

Actually I don't understand what's going on with cmpxchg there. The
patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
some follow-up in the thread I don't understand, about the case where
the cas fails, but we already handle that by doing an explicit barrier
in that case.

This stuff is a poorly-documented mess. >_<

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-10-03 23:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07  0:46 [musl] Illegal killlock skipping when transitioning to single-threaded state Alexey Izbyshev
2022-09-19 15:29 ` Rich Felker
2022-10-03  6:16   ` Alexey Izbyshev
2022-10-03 12:33     ` Rich Felker
2022-10-03 13:26     ` Szabolcs Nagy
2022-10-03 21:27       ` Szabolcs Nagy
2022-10-03 22:54         ` Rich Felker
2022-10-03 23:05           ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).