[musl] Illegal killlock skipping when transitioning to single-threaded state

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Illegal killlock skipping when transitioning to single-threaded state
@ 2022-09-07  0:46 Alexey Izbyshev
  2022-09-19 15:29 ` Rich Felker
  2022-10-05  1:00 ` Rich Felker
  0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-09-07  0:46 UTC (permalink / raw)
  To: musl

Hi,

While reading pthread_exit() implementation I noticed that it can set 
"libc.need_locks" to -1 while still holding the killlock of the exiting 
thread:

     if (!--libc.threads_minus_1) libc.need_locks = -1;

If the remaining thread attempts to acquire the same killlock 
concurrently (which is valid for joinable threads), it works fine 
because LOCK() resets "libc.need_locks" only after a_cas():

     int need_locks = libc.need_locks;
     if (!need_locks) return;
     /* fast path: INT_MIN for the lock, +1 for the congestion */
     int current = a_cas(l, 0, INT_MIN + 1);
     if (need_locks < 0) libc.need_locks = 0;
     if (!current) return;

However, because "libc.need_locks" is reset when using LOCK() for any 
other lock too, the following could happen (E is the exiting thread, R 
is the remaining thread):

E: libc.need_locks = -1
R: LOCK(unrelated_lock)
R:   libc.need_locks = 0
R: UNLOCK(unrelated_lock)
R: LOCK(E->killlock) // skips locking, now both R and E think they are 
holding the lock
R: UNLOCK(E->killlock)
E: UNLOCK(E->killlock)

The lack of mutual exclusion means that the tid reuse problem that 
killlock is supposed to prevent might occur.

Moreover, when UNLOCK(E->killlock) is called concurrently by R and E, 
a_fetch_add() could be done twice if timing is such that both threads 
notice that "l[0] < 0":

    /* Check l[0] to see if we are multi-threaded. */
    if (l[0] < 0) {
        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
            __wake(l, 1, 1);
        }
    }

In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked 
with INT_MAX waiters"). This is a bad state because the next LOCK() will 
wrap it to "unlocked" state instead of locking. Also, if more than two 
threads attempt to use it, a deadlock will occur if two 
supposedly-owners execute a_fetch_add() concurrently in UNLOCK() after a 
third thread registered itself as a waiter because the lock will wrap to 
INT_MIN.

Reordering the "libc.need_locks = -1" assignment and UNLOCK(E->killlock) 
and providing a store barrier between them should fix the issue.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-09-07  0:46 [musl] Illegal killlock skipping when transitioning to single-threaded state Alexey Izbyshev
@ 2022-09-19 15:29 ` Rich Felker
  2022-10-03  6:16   ` Alexey Izbyshev
  2022-10-05  1:00 ` Rich Felker
  1 sibling, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-09-19 15:29 UTC (permalink / raw)
  To: musl

On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> Hi,
> 
> While reading pthread_exit() implementation I noticed that it can
> set "libc.need_locks" to -1 while still holding the killlock of the
> exiting thread:
> 
>     if (!--libc.threads_minus_1) libc.need_locks = -1;
> 
> If the remaining thread attempts to acquire the same killlock
> concurrently (which is valid for joinable threads), it works fine
> because LOCK() resets "libc.need_locks" only after a_cas():
> 
>     int need_locks = libc.need_locks;
>     if (!need_locks) return;
>     /* fast path: INT_MIN for the lock, +1 for the congestion */
>     int current = a_cas(l, 0, INT_MIN + 1);
>     if (need_locks < 0) libc.need_locks = 0;
>     if (!current) return;
> 
> However, because "libc.need_locks" is reset when using LOCK() for
> any other lock too, the following could happen (E is the exiting
> thread, R is the remaining thread):
> 
> E: libc.need_locks = -1
> R: LOCK(unrelated_lock)
> R:   libc.need_locks = 0
> R: UNLOCK(unrelated_lock)
> R: LOCK(E->killlock) // skips locking, now both R and E think they
> are holding the lock
> R: UNLOCK(E->killlock)
> E: UNLOCK(E->killlock)
> 
> The lack of mutual exclusion means that the tid reuse problem that
> killlock is supposed to prevent might occur.
> 
> Moreover, when UNLOCK(E->killlock) is called concurrently by R and
> E, a_fetch_add() could be done twice if timing is such that both
> threads notice that "l[0] < 0":
> 
>    /* Check l[0] to see if we are multi-threaded. */
>    if (l[0] < 0) {
>        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
>            __wake(l, 1, 1);
>        }
>    }
> 
> In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
> with INT_MAX waiters"). This is a bad state because the next LOCK()
> will wrap it to "unlocked" state instead of locking. Also, if more
> than two threads attempt to use it, a deadlock will occur if two
> supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
> after a third thread registered itself as a waiter because the lock
> will wrap to INT_MIN.
> 
> Reordering the "libc.need_locks = -1" assignment and
> UNLOCK(E->killlock) and providing a store barrier between them
> should fix the issue.

I think this all sounds correct. I'm not sure what you mean by a store
barrier between them, since all lock and unlock operations are already
full barriers.

I'm still a little bit concerned though that there's no real model for
synchronization of the access to modify need_locks though.
Conceptually, the store of -1 from the exiting thread and the store of
0 in the surviving thread are not ordered with respect to each other
except by an unsynchronized causality relationship.

I suspect what "should be" happening is that, if we observe
need_locks==-1 (as a relaxed atomic load in our memory model), we take
the thread list lock (as the lock controlling write access to these
values) which ensures that the exiting thread has exited, then confirm
that we are the only thread left (are there async-signal ways without
UB that another thread could be created in the mean time?) before
setting it to 0. But this is a rather heavy operation. If there's an
assurance that no other threads can be created in the middle of LOCK
(which I think there should be), we could just use __tl_sync(), which
is much lighter, to conclude that we've synchronized with being the
only remaining thread and that unsynchronized access is okay.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-09-19 15:29 ` Rich Felker
@ 2022-10-03  6:16   ` Alexey Izbyshev
  2022-10-03 12:33     ` Rich Felker
  2022-10-03 13:26     ` Szabolcs Nagy
  0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-03  6:16 UTC (permalink / raw)
  To: musl

On 2022-09-19 18:29, Rich Felker wrote:
> On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> Hi,
>> 
>> While reading pthread_exit() implementation I noticed that it can
>> set "libc.need_locks" to -1 while still holding the killlock of the
>> exiting thread:
>> 
>>     if (!--libc.threads_minus_1) libc.need_locks = -1;
>> 
>> If the remaining thread attempts to acquire the same killlock
>> concurrently (which is valid for joinable threads), it works fine
>> because LOCK() resets "libc.need_locks" only after a_cas():
>> 
>>     int need_locks = libc.need_locks;
>>     if (!need_locks) return;
>>     /* fast path: INT_MIN for the lock, +1 for the congestion */
>>     int current = a_cas(l, 0, INT_MIN + 1);
>>     if (need_locks < 0) libc.need_locks = 0;
>>     if (!current) return;
>> 
>> However, because "libc.need_locks" is reset when using LOCK() for
>> any other lock too, the following could happen (E is the exiting
>> thread, R is the remaining thread):
>> 
>> E: libc.need_locks = -1
>> R: LOCK(unrelated_lock)
>> R:   libc.need_locks = 0
>> R: UNLOCK(unrelated_lock)
>> R: LOCK(E->killlock) // skips locking, now both R and E think they
>> are holding the lock
>> R: UNLOCK(E->killlock)
>> E: UNLOCK(E->killlock)
>> 
>> The lack of mutual exclusion means that the tid reuse problem that
>> killlock is supposed to prevent might occur.
>> 
>> Moreover, when UNLOCK(E->killlock) is called concurrently by R and
>> E, a_fetch_add() could be done twice if timing is such that both
>> threads notice that "l[0] < 0":
>> 
>>    /* Check l[0] to see if we are multi-threaded. */
>>    if (l[0] < 0) {
>>        if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
>>            __wake(l, 1, 1);
>>        }
>>    }
>> 
>> In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
>> with INT_MAX waiters"). This is a bad state because the next LOCK()
>> will wrap it to "unlocked" state instead of locking. Also, if more
>> than two threads attempt to use it, a deadlock will occur if two
>> supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
>> after a third thread registered itself as a waiter because the lock
>> will wrap to INT_MIN.
>> 
>> Reordering the "libc.need_locks = -1" assignment and
>> UNLOCK(E->killlock) and providing a store barrier between them
>> should fix the issue.
> 
> I think this all sounds correct. I'm not sure what you mean by a store
> barrier between them, since all lock and unlock operations are already
> full barriers.
> 

Before sending the report I tried to infer the intended ordering 
semantics of LOCK/UNLOCK by looking at their implementations. For 
AArch64, I didn't see why they would provide a full barrier (my 
reasoning is below), so I concluded that probably acquire/release 
semantics was intended in general and suggested an extra store barrier 
to prevent hoisting of "libc.need_locks = -1" store spelled after 
UNLOCK(E->killlock) back into the critical section.

UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple 
a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are 
implemented via load-acquire/store-release instructions. Therefore, if 
we consider a LOCK/UNLOCK critical section containing only plain loads 
and stores, (a) any such memory access can be reordered with the initial 
ldaxr in UNLOCK, and (b) any plain load following UNLOCK can be 
reordered with stlxr (assuming the processor predicts that stlxr 
succeeds), and further, due to (a), with any memory access inside the 
critical section. Therefore, UNLOCK is not full barrier. Is this right?

As for a store following UNLOCK, I'm not sure. A plain store following 
stlxr can be reordered with it, but here that store is conditional on 
stlxr success. So even if the processor predicts stlxr success, it can't 
make the following store visible before it's sure that stlxr succeeded. 
But I don't know whether the events "the processor knows that stlxr 
succeeded" and "the result of stlxr is globally visible" are separate or 
not, and if they are separate, what comes first. Depending on the 
answer, UNLOCK acts as a store barrier or not.

> I'm still a little bit concerned though that there's no real model for
> synchronization of the access to modify need_locks though.
> Conceptually, the store of -1 from the exiting thread and the store of
> 0 in the surviving thread are not ordered with respect to each other
> except by an unsynchronized causality relationship.
> 
I don't understand implications of the lack of (stronger) ordering of 
these two stores. Could you clarify?

> I suspect what "should be" happening is that, if we observe
> need_locks==-1 (as a relaxed atomic load in our memory model), we take
> the thread list lock (as the lock controlling write access to these
> values) which ensures that the exiting thread has exited, then confirm
> that we are the only thread left (are there async-signal ways without
> UB that another thread could be created in the mean time?) before
> setting it to 0. But this is a rather heavy operation. If there's an
> assurance that no other threads can be created in the middle of LOCK
> (which I think there should be), we could just use __tl_sync(), which
> is much lighter, to conclude that we've synchronized with being the
> only remaining thread and that unsynchronized access is okay.
> 

LOCK is called either with application signals blocked or from 
async-signal-unsafe functions, so interrupting it with a signal handler 
that calls pthread_create() is not possible without UB. Therefore, I 
don't see any issues with using __tl_sync() for synchronization with the 
exiting thread (although, as I said above, I don't understand what 
problem is being solved).

By the way, I noticed a suspicious usage of "libc.need_locks" in fork():

     int need_locks = libc.need_locks > 0;

This is different from how LOCK behaves: LOCK will still perform normal 
locking one last time if it sees -1, but here locking will be skipped 
entirely. Is this intentional?

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03  6:16   ` Alexey Izbyshev
@ 2022-10-03 12:33     ` Rich Felker
  2022-10-03 13:26     ` Szabolcs Nagy
  1 sibling, 0 replies; 32+ messages in thread
From: Rich Felker @ 2022-10-03 12:33 UTC (permalink / raw)
  To: musl

On Mon, Oct 03, 2022 at 09:16:03AM +0300, Alexey Izbyshev wrote:
> On 2022-09-19 18:29, Rich Felker wrote:
> >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>Hi,
> >>
> >>While reading pthread_exit() implementation I noticed that it can
> >>set "libc.need_locks" to -1 while still holding the killlock of the
> >>exiting thread:
> >>
> >>    if (!--libc.threads_minus_1) libc.need_locks = -1;
> >>
> >>If the remaining thread attempts to acquire the same killlock
> >>concurrently (which is valid for joinable threads), it works fine
> >>because LOCK() resets "libc.need_locks" only after a_cas():
> >>
> >>    int need_locks = libc.need_locks;
> >>    if (!need_locks) return;
> >>    /* fast path: INT_MIN for the lock, +1 for the congestion */
> >>    int current = a_cas(l, 0, INT_MIN + 1);
> >>    if (need_locks < 0) libc.need_locks = 0;
> >>    if (!current) return;
> >>
> >>However, because "libc.need_locks" is reset when using LOCK() for
> >>any other lock too, the following could happen (E is the exiting
> >>thread, R is the remaining thread):
> >>
> >>E: libc.need_locks = -1
> >>R: LOCK(unrelated_lock)
> >>R:   libc.need_locks = 0
> >>R: UNLOCK(unrelated_lock)
> >>R: LOCK(E->killlock) // skips locking, now both R and E think they
> >>are holding the lock
> >>R: UNLOCK(E->killlock)
> >>E: UNLOCK(E->killlock)
> >>
> >>The lack of mutual exclusion means that the tid reuse problem that
> >>killlock is supposed to prevent might occur.
> >>
> >>Moreover, when UNLOCK(E->killlock) is called concurrently by R and
> >>E, a_fetch_add() could be done twice if timing is such that both
> >>threads notice that "l[0] < 0":
> >>
> >>   /* Check l[0] to see if we are multi-threaded. */
> >>   if (l[0] < 0) {
> >>       if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
> >>           __wake(l, 1, 1);
> >>       }
> >>   }
> >>
> >>In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked
> >>with INT_MAX waiters"). This is a bad state because the next LOCK()
> >>will wrap it to "unlocked" state instead of locking. Also, if more
> >>than two threads attempt to use it, a deadlock will occur if two
> >>supposedly-owners execute a_fetch_add() concurrently in UNLOCK()
> >>after a third thread registered itself as a waiter because the lock
> >>will wrap to INT_MIN.
> >>
> >>Reordering the "libc.need_locks = -1" assignment and
> >>UNLOCK(E->killlock) and providing a store barrier between them
> >>should fix the issue.
> >
> >I think this all sounds correct. I'm not sure what you mean by a store
> >barrier between them, since all lock and unlock operations are already
> >full barriers.
> >
> 
> Before sending the report I tried to infer the intended ordering
> semantics of LOCK/UNLOCK by looking at their implementations. For
> AArch64, I didn't see why they would provide a full barrier (my
> reasoning is below), so I concluded that probably acquire/release
> semantics was intended in general and suggested an extra store
> barrier to prevent hoisting of "libc.need_locks = -1" store spelled
> after UNLOCK(E->killlock) back into the critical section.
> 
> UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are
> implemented via load-acquire/store-release instructions. Therefore,
> if we consider a LOCK/UNLOCK critical section containing only plain
> loads and stores, (a) any such memory access can be reordered with
> the initial ldaxr in UNLOCK, and (b) any plain load following UNLOCK
> can be reordered with stlxr (assuming the processor predicts that
> stlxr succeeds), and further, due to (a), with any memory access
> inside the critical section. Therefore, UNLOCK is not full barrier.
> Is this right?

If so, this is a bug in our implementation of aarch64 atomics. The
intent is that a_* all be seq_cst (practically, full acq+rel probably
suffices). Maybe someone more familiar with aarch64 can chime in.
I thought this was discussed and determined to be okay when the
aarch64 port was being reviewed, but now I'm starting to doubt that.

> As for a store following UNLOCK, I'm not sure. A plain store
> following stlxr can be reordered with it, but here that store is
> conditional on stlxr success. So even if the processor predicts
> stlxr success, it can't make the following store visible before it's
> sure that stlxr succeeded. But I don't know whether the events "the
> processor knows that stlxr succeeded" and "the result of stlxr is
> globally visible" are separate or not, and if they are separate,
> what comes first. Depending on the answer, UNLOCK acts as a store
> barrier or not.
> 
> >I'm still a little bit concerned though that there's no real model for
> >synchronization of the access to modify need_locks though.
> >Conceptually, the store of -1 from the exiting thread and the store of
> >0 in the surviving thread are not ordered with respect to each other
> >except by an unsynchronized causality relationship.
> >
> I don't understand implications of the lack of (stronger) ordering
> of these two stores. Could you clarify?

I'm talking about at the level of the memory model, not any
implementation details of the cpu. The store of 0 in the surviving
thread is not an atomic, and I was thinking this is possibly not
ordered with respect to the store of the -1 in the exiting thread
unless there's an intervening barrier, *after* the -1 is observed

However, looking back at the code, I was actually careful to ensure
this already. need_locks is loaded only once, before the a_cas, and
the a_cas provides the barrier. If we were instead re-loading
need_locks after the a_cas before testing <0 and assigning 0, there
would be a problem, at least at the abstract memory model level.

> >I suspect what "should be" happening is that, if we observe
> >need_locks==-1 (as a relaxed atomic load in our memory model), we take
> >the thread list lock (as the lock controlling write access to these
> >values) which ensures that the exiting thread has exited, then confirm
> >that we are the only thread left (are there async-signal ways without
> >UB that another thread could be created in the mean time?) before
> >setting it to 0. But this is a rather heavy operation. If there's an
> >assurance that no other threads can be created in the middle of LOCK
> >(which I think there should be), we could just use __tl_sync(), which
> >is much lighter, to conclude that we've synchronized with being the
> >only remaining thread and that unsynchronized access is okay.
> >
> 
> LOCK is called either with application signals blocked or from
> async-signal-unsafe functions, so interrupting it with a signal
> handler that calls pthread_create() is not possible without UB.
> Therefore, I don't see any issues with using __tl_sync() for
> synchronization with the exiting thread (although, as I said above,
> I don't understand what problem is being solved).

Based on the above, I agree there doesn't seem to be an actual problem
here. The reason it works is subtle, but present.

> By the way, I noticed a suspicious usage of "libc.need_locks" in fork():
> 
>     int need_locks = libc.need_locks > 0;
> 
> This is different from how LOCK behaves: LOCK will still perform
> normal locking one last time if it sees -1, but here locking will be
> skipped entirely. Is this intentional?

This is probably subtly incorrect. I think it requires more analysis
to determine if there's anything that can actually go wrong, but it
should probably just be changed anyway.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03  6:16   ` Alexey Izbyshev
  2022-10-03 12:33     ` Rich Felker
@ 2022-10-03 13:26     ` Szabolcs Nagy
  2022-10-03 21:27       ` Szabolcs Nagy
  1 sibling, 1 reply; 32+ messages in thread
From: Szabolcs Nagy @ 2022-10-03 13:26 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

* Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> On 2022-09-19 18:29, Rich Felker wrote:
> > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
...
> > > Reordering the "libc.need_locks = -1" assignment and
> > > UNLOCK(E->killlock) and providing a store barrier between them
> > > should fix the issue.
> > 
> > I think this all sounds correct. I'm not sure what you mean by a store
> > barrier between them, since all lock and unlock operations are already
> > full barriers.
> > 
> 
> Before sending the report I tried to infer the intended ordering semantics
> of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> see why they would provide a full barrier (my reasoning is below), so I
> concluded that probably acquire/release semantics was intended in general
> and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> = -1" store spelled after UNLOCK(E->killlock) back into the critical
> section.
> 
> UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> via load-acquire/store-release instructions. Therefore, if we consider a
> LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> such memory access can be reordered with the initial ldaxr in UNLOCK, and
> (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> the processor predicts that stlxr succeeds), and further, due to (a), with
> any memory access inside the critical section. Therefore, UNLOCK is not full
> barrier. Is this right?

i dont think this is right.

memory operations in the critical section cannot visibly happen after the
final stlxr.

memory operations after the critical section cannot visibly happen before
the ldaxr of UNLOCK.

the only issue can be inside the ll/sc loop in UNLOCK if there are independent
memory operations there, but there arent.

> 
> As for a store following UNLOCK, I'm not sure. A plain store following stlxr
> can be reordered with it, but here that store is conditional on stlxr
> success. So even if the processor predicts stlxr success, it can't make the
> following store visible before it's sure that stlxr succeeded. But I don't
> know whether the events "the processor knows that stlxr succeeded" and "the
> result of stlxr is globally visible" are separate or not, and if they are
> separate, what comes first. Depending on the answer, UNLOCK acts as a store
> barrier or not.
> 

UNLOCK on aarch64 acts as a full seqcst barrier as far as i can see.

but looking into the arch implementations needs a detailed understanding
of the arch memory model (eg aarch64 stlxr is RCsc not RCpc like iso c
release store), but there is no need for that: the musl model is
simple seqcst synchronization everywhere.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 13:26     ` Szabolcs Nagy
@ 2022-10-03 21:27       ` Szabolcs Nagy
  2022-10-03 22:54         ` Rich Felker
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Szabolcs Nagy @ 2022-10-03 21:27 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

* Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:

> * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > On 2022-09-19 18:29, Rich Felker wrote:
> > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> ...
> > > > Reordering the "libc.need_locks = -1" assignment and
> > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > should fix the issue.
> > > 
> > > I think this all sounds correct. I'm not sure what you mean by a store
> > > barrier between them, since all lock and unlock operations are already
> > > full barriers.
> > > 
> > 
> > Before sending the report I tried to infer the intended ordering semantics
> > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > see why they would provide a full barrier (my reasoning is below), so I
> > concluded that probably acquire/release semantics was intended in general
> > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > section.
> > 
> > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > via load-acquire/store-release instructions. Therefore, if we consider a
> > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > the processor predicts that stlxr succeeds), and further, due to (a), with
> > any memory access inside the critical section. Therefore, UNLOCK is not full
> > barrier. Is this right?
> 
> i dont think this is right.


i think i was wrong and you are right.

so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
starting with 'something == 0' the exiting E and remaining R threads:

E:something=1      // protected by killlock
E:UNLOCK(killlock)
E:need_locks=-1

R:LOCK(unrelated)  // reads need_locks == -1
R:need_locks=0
R:UNLOCK(unrelated)
R:LOCK(killlock)   // does not lock
R:read something   // can it be 0 ?

and here something can be 0 (ie. not protected by killlock) on aarch64
because

T1
	something=1
	ldaxr ... killlock
	stlxr ... killlock
	need_locks=-1

T2
	x=need_locks
	ldaxr ... unrelated
	stlxr ... unrelated
	y=something

can end with x==-1 and y==0.

and to fix it, both a_fetch_add and a_cas need an a_barrier.

i need to think how to support such lock usage on aarch64
without adding too many dmb.



> 
> memory operations in the critical section cannot visibly happen after the
> final stlxr.
> 
> memory operations after the critical section cannot visibly happen before
> the ldaxr of UNLOCK.
> 
> the only issue can be inside the ll/sc loop in UNLOCK if there are independent
> memory operations there, but there arent.
> 
> > 
> > As for a store following UNLOCK, I'm not sure. A plain store following stlxr
> > can be reordered with it, but here that store is conditional on stlxr
> > success. So even if the processor predicts stlxr success, it can't make the
> > following store visible before it's sure that stlxr succeeded. But I don't
> > know whether the events "the processor knows that stlxr succeeded" and "the
> > result of stlxr is globally visible" are separate or not, and if they are
> > separate, what comes first. Depending on the answer, UNLOCK acts as a store
> > barrier or not.
> > 
> 
> UNLOCK on aarch64 acts as a full seqcst barrier as far as i can see.
> 
> but looking into the arch implementations needs a detailed understanding
> of the arch memory model (eg aarch64 stlxr is RCsc not RCpc like iso c
> release store), but there is no need for that: the musl model is
> simple seqcst synchronization everywhere.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 21:27       ` Szabolcs Nagy
@ 2022-10-03 22:54         ` Rich Felker
  2022-10-03 23:05           ` Rich Felker
  2022-10-04  2:58         ` Rich Felker
  2022-10-04  5:16         ` Alexey Izbyshev
  2 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-03 22:54 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> 
> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > ...
> > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > should fix the issue.
> > > > 
> > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > barrier between them, since all lock and unlock operations are already
> > > > full barriers.
> > > > 
> > > 
> > > Before sending the report I tried to infer the intended ordering semantics
> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > see why they would provide a full barrier (my reasoning is below), so I
> > > concluded that probably acquire/release semantics was intended in general
> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > section.
> > > 
> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > barrier. Is this right?
> > 
> > i dont think this is right.
> 
> 
> i think i was wrong and you are right.
> 
> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> starting with 'something == 0' the exiting E and remaining R threads:
> 
> E:something=1      // protected by killlock
> E:UNLOCK(killlock)
> E:need_locks=-1
> 
> R:LOCK(unrelated)  // reads need_locks == -1
> R:need_locks=0
> R:UNLOCK(unrelated)
> R:LOCK(killlock)   // does not lock
> R:read something   // can it be 0 ?
> 
> and here something can be 0 (ie. not protected by killlock) on aarch64
> because
> 
> T1
> 	something=1
> 	ldaxr ... killlock
> 	stlxr ... killlock
> 	need_locks=-1
> 
> T2
> 	x=need_locks
> 	ldaxr ... unrelated
> 	stlxr ... unrelated
> 	y=something
> 
> can end with x==-1 and y==0.
> 
> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> 
> i need to think how to support such lock usage on aarch64
> without adding too many dmb.

I don't really understand this, but FWIW gcc emits 

    ldxr
    ...
    stlxr
    ...
    dmb ish

for __sync_val_compare_and_swap. So this is probably the right thing
we should have. And it seems to match what the kernel folks discussed
here:

http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html

I wondered if there are similar issues for any others archs which need
review, but it looks like all the other llsc archs have explicit
pre/post barriers defined.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 22:54         ` Rich Felker
@ 2022-10-03 23:05           ` Rich Felker
  2022-10-04 13:50             ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-03 23:05 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > 
> > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > > ...
> > > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > > should fix the issue.
> > > > > 
> > > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > > barrier between them, since all lock and unlock operations are already
> > > > > full barriers.
> > > > > 
> > > > 
> > > > Before sending the report I tried to infer the intended ordering semantics
> > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > > see why they would provide a full barrier (my reasoning is below), so I
> > > > concluded that probably acquire/release semantics was intended in general
> > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > > section.
> > > > 
> > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > > barrier. Is this right?
> > > 
> > > i dont think this is right.
> > 
> > 
> > i think i was wrong and you are right.
> > 
> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > starting with 'something == 0' the exiting E and remaining R threads:
> > 
> > E:something=1      // protected by killlock
> > E:UNLOCK(killlock)
> > E:need_locks=-1
> > 
> > R:LOCK(unrelated)  // reads need_locks == -1
> > R:need_locks=0
> > R:UNLOCK(unrelated)
> > R:LOCK(killlock)   // does not lock
> > R:read something   // can it be 0 ?
> > 
> > and here something can be 0 (ie. not protected by killlock) on aarch64
> > because
> > 
> > T1
> > 	something=1
> > 	ldaxr ... killlock
> > 	stlxr ... killlock
> > 	need_locks=-1
> > 
> > T2
> > 	x=need_locks
> > 	ldaxr ... unrelated
> > 	stlxr ... unrelated
> > 	y=something
> > 
> > can end with x==-1 and y==0.
> > 
> > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > 
> > i need to think how to support such lock usage on aarch64
> > without adding too many dmb.
> 
> I don't really understand this, but FWIW gcc emits 
> 
>     ldxr
>     ...
>     stlxr
>     ...
>     dmb ish
> 
> for __sync_val_compare_and_swap. So this is probably the right thing
> we should have. And it seems to match what the kernel folks discussed
> here:
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> 
> I wondered if there are similar issues for any others archs which need
> review, but it looks like all the other llsc archs have explicit
> pre/post barriers defined.

Actually I don't understand what's going on with cmpxchg there. The
patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
some follow-up in the thread I don't understand, about the case where
the cas fails, but we already handle that by doing an explicit barrier
in that case.

This stuff is a poorly-documented mess. >_<

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 21:27       ` Szabolcs Nagy
  2022-10-03 22:54         ` Rich Felker
@ 2022-10-04  2:58         ` Rich Felker
  2022-10-04  3:00           ` Rich Felker
  2022-10-04  5:16         ` Alexey Izbyshev
  2 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-04  2:58 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> 
> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > ...
> > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > should fix the issue.
> > > > 
> > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > barrier between them, since all lock and unlock operations are already
> > > > full barriers.
> > > > 
> > > 
> > > Before sending the report I tried to infer the intended ordering semantics
> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > see why they would provide a full barrier (my reasoning is below), so I
> > > concluded that probably acquire/release semantics was intended in general
> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > section.
> > > 
> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > barrier. Is this right?
> > 
> > i dont think this is right.
> 
> 
> i think i was wrong and you are right.
> 
> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> starting with 'something == 0' the exiting E and remaining R threads:
> 
> E:something=1      // protected by killlock
> E:UNLOCK(killlock)
> E:need_locks=-1
> 
> R:LOCK(unrelated)  // reads need_locks == -1
> R:need_locks=0
> R:UNLOCK(unrelated)
> R:LOCK(killlock)   // does not lock
> R:read something   // can it be 0 ?
> 
> and here something can be 0 (ie. not protected by killlock) on aarch64
> because
> 
> T1
> 	something=1
> 	ldaxr ... killlock
> 	stlxr ... killlock
> 	need_locks=-1
> 
> T2
> 	x=need_locks
> 	ldaxr ... unrelated
> 	stlxr ... unrelated
> 	y=something
> 
> can end with x==-1 and y==0.
> 
> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> 
> i need to think how to support such lock usage on aarch64
> without adding too many dmb.

OK, after reading a lot more, I think I'm starting to get what you're
saying. Am I correct in my understanding that the problem is that the
"R:LOCK(unrelated)" as implemented does not synchronize with the
"E:UNLOCK(killlock)" because they're different objects?

If so, I think this would be fully solved by using __tl_sync in the
code path that resets need_locks to 0 after observing -1, by virtue of
providing a common object (the thread list lock) to synchronize on.
This is the "weaker memory model friendly" approach we should probably
strive to achieve some day.

However, all existing code in musl is written assuming what I call the
"POSIX memory model" where the only operation is "synchronizes memory"
and that underspecified phrase has to be interpreted as "is a full
barrier" to admit any consistent model. Said differently, the code was
written assuming every a_* synchronizes with every other a_*, without
any regard for whether they act on the same objects. This likely even
matters for how our waiter accounting works (which is probably a good
argument for removing it and switching to waiter flags). So I think,
if the issue as I understand it now exists, we do need to fix it. Then
we can revisit this at some later time as part of a big project.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04  2:58         ` Rich Felker
@ 2022-10-04  3:00           ` Rich Felker
  2022-10-04  4:59             ` Rich Felker
  0 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-04  3:00 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Mon, Oct 03, 2022 at 10:58:32PM -0400, Rich Felker wrote:
> On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > 
> > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > > ...
> > > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > > should fix the issue.
> > > > > 
> > > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > > barrier between them, since all lock and unlock operations are already
> > > > > full barriers.
> > > > > 
> > > > 
> > > > Before sending the report I tried to infer the intended ordering semantics
> > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > > see why they would provide a full barrier (my reasoning is below), so I
> > > > concluded that probably acquire/release semantics was intended in general
> > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > > section.
> > > > 
> > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > > barrier. Is this right?
> > > 
> > > i dont think this is right.
> > 
> > 
> > i think i was wrong and you are right.
> > 
> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > starting with 'something == 0' the exiting E and remaining R threads:
> > 
> > E:something=1      // protected by killlock
> > E:UNLOCK(killlock)
> > E:need_locks=-1
> > 
> > R:LOCK(unrelated)  // reads need_locks == -1
> > R:need_locks=0
> > R:UNLOCK(unrelated)
> > R:LOCK(killlock)   // does not lock
> > R:read something   // can it be 0 ?
> > 
> > and here something can be 0 (ie. not protected by killlock) on aarch64
> > because
> > 
> > T1
> > 	something=1
> > 	ldaxr ... killlock
> > 	stlxr ... killlock
> > 	need_locks=-1
> > 
> > T2
> > 	x=need_locks
> > 	ldaxr ... unrelated
> > 	stlxr ... unrelated
> > 	y=something
> > 
> > can end with x==-1 and y==0.
> > 
> > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > 
> > i need to think how to support such lock usage on aarch64
> > without adding too many dmb.
> 
> OK, after reading a lot more, I think I'm starting to get what you're
> saying. Am I correct in my understanding that the problem is that the
> "R:LOCK(unrelated)" as implemented does not synchronize with the
> "E:UNLOCK(killlock)" because they're different objects?
> 
> If so, I think this would be fully solved by using __tl_sync in the
> code path that resets need_locks to 0 after observing -1, by virtue of
> providing a common object (the thread list lock) to synchronize on.
> This is the "weaker memory model friendly" approach we should probably
> strive to achieve some day.
> 
> However, all existing code in musl is written assuming what I call the
> "POSIX memory model" where the only operation is "synchronizes memory"
> and that underspecified phrase has to be interpreted as "is a full
> barrier" to admit any consistent model. Said differently, the code was
> written assuming every a_* synchronizes with every other a_*, without
> any regard for whether they act on the same objects. This likely even
> matters for how our waiter accounting works (which is probably a good
> argument for removing it and switching to waiter flags). So I think,
> if the issue as I understand it now exists, we do need to fix it. Then
> we can revisit this at some later time as part of a big project.

Forgot, I should include links to the material I've been reading:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2
https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba

which is where the GCC folks seem to have encountered and fixed their
corresponding issue.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04  3:00           ` Rich Felker
@ 2022-10-04  4:59             ` Rich Felker
  2022-10-04  8:16               ` Szabolcs Nagy
  2022-10-04 10:18               ` Alexey Izbyshev
  0 siblings, 2 replies; 32+ messages in thread
From: Rich Felker @ 2022-10-04  4:59 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

[-- Attachment #1: Type: text/plain, Size: 5018 bytes --]

On Mon, Oct 03, 2022 at 11:00:21PM -0400, Rich Felker wrote:
> On Mon, Oct 03, 2022 at 10:58:32PM -0400, Rich Felker wrote:
> > On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > > 
> > > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > > > ...
> > > > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > > > should fix the issue.
> > > > > > 
> > > > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > > > barrier between them, since all lock and unlock operations are already
> > > > > > full barriers.
> > > > > > 
> > > > > 
> > > > > Before sending the report I tried to infer the intended ordering semantics
> > > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > > > see why they would provide a full barrier (my reasoning is below), so I
> > > > > concluded that probably acquire/release semantics was intended in general
> > > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > > > section.
> > > > > 
> > > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > > > barrier. Is this right?
> > > > 
> > > > i dont think this is right.
> > > 
> > > 
> > > i think i was wrong and you are right.
> > > 
> > > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > > starting with 'something == 0' the exiting E and remaining R threads:
> > > 
> > > E:something=1      // protected by killlock
> > > E:UNLOCK(killlock)
> > > E:need_locks=-1
> > > 
> > > R:LOCK(unrelated)  // reads need_locks == -1
> > > R:need_locks=0
> > > R:UNLOCK(unrelated)
> > > R:LOCK(killlock)   // does not lock
> > > R:read something   // can it be 0 ?
> > > 
> > > and here something can be 0 (ie. not protected by killlock) on aarch64
> > > because
> > > 
> > > T1
> > > 	something=1
> > > 	ldaxr ... killlock
> > > 	stlxr ... killlock
> > > 	need_locks=-1
> > > 
> > > T2
> > > 	x=need_locks
> > > 	ldaxr ... unrelated
> > > 	stlxr ... unrelated
> > > 	y=something
> > > 
> > > can end with x==-1 and y==0.
> > > 
> > > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > > 
> > > i need to think how to support such lock usage on aarch64
> > > without adding too many dmb.
> > 
> > OK, after reading a lot more, I think I'm starting to get what you're
> > saying. Am I correct in my understanding that the problem is that the
> > "R:LOCK(unrelated)" as implemented does not synchronize with the
> > "E:UNLOCK(killlock)" because they're different objects?
> > 
> > If so, I think this would be fully solved by using __tl_sync in the
> > code path that resets need_locks to 0 after observing -1, by virtue of
> > providing a common object (the thread list lock) to synchronize on.
> > This is the "weaker memory model friendly" approach we should probably
> > strive to achieve some day.
> > 
> > However, all existing code in musl is written assuming what I call the
> > "POSIX memory model" where the only operation is "synchronizes memory"
> > and that underspecified phrase has to be interpreted as "is a full
> > barrier" to admit any consistent model. Said differently, the code was
> > written assuming every a_* synchronizes with every other a_*, without
> > any regard for whether they act on the same objects. This likely even
> > matters for how our waiter accounting works (which is probably a good
> > argument for removing it and switching to waiter flags). So I think,
> > if the issue as I understand it now exists, we do need to fix it. Then
> > we can revisit this at some later time as part of a big project.
> 
> Forgot, I should include links to the material I've been reading:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
> https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2
> https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba
> 
> which is where the GCC folks seem to have encountered and fixed their
> corresponding issue.

Proposed patch attached.

Rich

[-- Attachment #2: aarch64-barrier.diff --]
[-- Type: text/plain, Size: 1370 bytes --]

diff --git a/arch/aarch64/atomic_arch.h b/arch/aarch64/atomic_arch.h
index 40fefc25..c01a3ab3 100644
--- a/arch/aarch64/atomic_arch.h
+++ b/arch/aarch64/atomic_arch.h
@@ -2,7 +2,7 @@
 static inline int a_ll(volatile int *p)
 {
 	int v;
-	__asm__ __volatile__ ("ldaxr %w0,%1" : "=r"(v) : "Q"(*p));
+	__asm__ __volatile__ ("ldxr %w0,%1" : "=r"(v) : "Q"(*p));
 	return v;
 }
 
@@ -20,25 +20,13 @@ static inline void a_barrier()
 	__asm__ __volatile__ ("dmb ish" : : : "memory");
 }
 
-#define a_cas a_cas
-static inline int a_cas(volatile int *p, int t, int s)
-{
-	int old;
-	do {
-		old = a_ll(p);
-		if (old != t) {
-			a_barrier();
-			break;
-		}
-	} while (!a_sc(p, s));
-	return old;
-}
+#define a_post_llsc a_barrier
 
 #define a_ll_p a_ll_p
 static inline void *a_ll_p(volatile void *p)
 {
 	void *v;
-	__asm__ __volatile__ ("ldaxr %0, %1" : "=r"(v) : "Q"(*(void *volatile *)p));
+	__asm__ __volatile__ ("ldxr %0, %1" : "=r"(v) : "Q"(*(void *volatile *)p));
 	return v;
 }
 
@@ -50,20 +38,6 @@ static inline int a_sc_p(volatile int *p, void *v)
 	return !r;
 }
 
-#define a_cas_p a_cas_p
-static inline void *a_cas_p(volatile void *p, void *t, void *s)
-{
-	void *old;
-	do {
-		old = a_ll_p(p);
-		if (old != t) {
-			a_barrier();
-			break;
-		}
-	} while (!a_sc_p(p, s));
-	return old;
-}
-
 #define a_ctz_64 a_ctz_64
 static inline int a_ctz_64(uint64_t x)
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 21:27       ` Szabolcs Nagy
  2022-10-03 22:54         ` Rich Felker
  2022-10-04  2:58         ` Rich Felker
@ 2022-10-04  5:16         ` Alexey Izbyshev
  2022-10-04  8:31           ` Szabolcs Nagy
  2 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04  5:16 UTC (permalink / raw)
  To: musl

On 2022-10-04 00:27, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> 
>> * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
>> > On 2022-09-19 18:29, Rich Felker wrote:
>> > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> ...
>> > > > Reordering the "libc.need_locks = -1" assignment and
>> > > > UNLOCK(E->killlock) and providing a store barrier between them
>> > > > should fix the issue.
>> > >
>> > > I think this all sounds correct. I'm not sure what you mean by a store
>> > > barrier between them, since all lock and unlock operations are already
>> > > full barriers.
>> > >
>> >
>> > Before sending the report I tried to infer the intended ordering semantics
>> > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> > see why they would provide a full barrier (my reasoning is below), so I
>> > concluded that probably acquire/release semantics was intended in general
>> > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> > section.
>> >
>> > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> > via load-acquire/store-release instructions. Therefore, if we consider a
>> > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> > the processor predicts that stlxr succeeds), and further, due to (a), with
>> > any memory access inside the critical section. Therefore, UNLOCK is not full
>> > barrier. Is this right?
>> 
>> i dont think this is right.
> 
> 
> i think i was wrong and you are right.
> 
> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> starting with 'something == 0' the exiting E and remaining R threads:
> 
> E:something=1      // protected by killlock
> E:UNLOCK(killlock)
> E:need_locks=-1
> 
> R:LOCK(unrelated)  // reads need_locks == -1
> R:need_locks=0
> R:UNLOCK(unrelated)
> R:LOCK(killlock)   // does not lock
> R:read something   // can it be 0 ?
> 
> and here something can be 0 (ie. not protected by killlock) on aarch64
> because
> 
> T1
> 	something=1
> 	ldaxr ... killlock
> 	stlxr ... killlock
> 	need_locks=-1
> 
> T2
> 	x=need_locks
> 	ldaxr ... unrelated
> 	stlxr ... unrelated
> 	y=something
> 
> can end with x==-1 and y==0.
> 
Yes, overall this matches my understanding. However, your UNLOCK 
expansion (in T1/T2) omits the branch instruction between stlxr and the 
following store, and, as I mentioned, I'm not sufficiently knowledgeable 
to understand the effects of this branch on the order of visibility of 
"stlxr killlock" (and preceding stores) and "need_locks=-1".

> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> 
> i need to think how to support such lock usage on aarch64
> without adding too many dmb.
> 

I'd also like to point out that our discussion so far has been focused 
on reordering of "something" protected by the critical section and 
"need_locks=-1", but my original bug report also mentioned the 
possibility of lock corruption, and for the latter we're also interested 
in reordering of "stlxr" and "need_locks=-1". It's possible that some 
UNLOCK implementations can provide the first guarantee, but not the 
second.

> 
> 
>> 
>> memory operations in the critical section cannot visibly happen after 
>> the
>> final stlxr.
>> 
>> memory operations after the critical section cannot visibly happen 
>> before
>> the ldaxr of UNLOCK.
>> 
>> the only issue can be inside the ll/sc loop in UNLOCK if there are 
>> independent
>> memory operations there, but there arent.
>> 
You've already said this is wrong, but I want to spell out why it's so 
for any interested parties to hopefully clear some confusion. Both 
premises (about stlxr and ldaxr) are correct, but the third point 
doesn't follow and is wrong. Because ldaxr/stlxr act as one-way 
barriers, the following pseudo-code

r1 = X
ldaxr LOCK
stlxr LOCK
r2 = Y

can be observed, in my understanding of AArch64 memory model, as

ldaxr LOCK
r1 = X
r2 = Y
stlxr LOCK

and even as

ldaxr LOCK
r2 = Y
r1 = X
stlxr LOCK

The same is true for stores as well.

So effectively the loop in UNLOCK can absorb memory operations from both 
directions, which then can be reordered with each other (with a possible 
exception of stores that follow UNLOCK, as I said above).

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04  4:59             ` Rich Felker
@ 2022-10-04  8:16               ` Szabolcs Nagy
  2022-10-04 10:18               ` Alexey Izbyshev
  1 sibling, 0 replies; 32+ messages in thread
From: Szabolcs Nagy @ 2022-10-04  8:16 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Alexey Izbyshev

* Rich Felker <dalias@libc.org> [2022-10-04 00:59:51 -0400]:
> On Mon, Oct 03, 2022 at 11:00:21PM -0400, Rich Felker wrote:
> > On Mon, Oct 03, 2022 at 10:58:32PM -0400, Rich Felker wrote:
> > > On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > > > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > > > 
> > > > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > > > > > On 2022-09-19 18:29, Rich Felker wrote:
> > > > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > > > > ...
> > > > > > > > Reordering the "libc.need_locks = -1" assignment and
> > > > > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > > > > > > > should fix the issue.
> > > > > > > 
> > > > > > > I think this all sounds correct. I'm not sure what you mean by a store
> > > > > > > barrier between them, since all lock and unlock operations are already
> > > > > > > full barriers.
> > > > > > > 
> > > > > > 
> > > > > > Before sending the report I tried to infer the intended ordering semantics
> > > > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > > > > > see why they would provide a full barrier (my reasoning is below), so I
> > > > > > concluded that probably acquire/release semantics was intended in general
> > > > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > > > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > > > > > section.
> > > > > > 
> > > > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > > > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > > > > > via load-acquire/store-release instructions. Therefore, if we consider a
> > > > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > > > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > > > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > > > > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > > > > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > > > > > barrier. Is this right?
> > > > > 
> > > > > i dont think this is right.
> > > > 
> > > > 
> > > > i think i was wrong and you are right.
> > > > 
> > > > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > > > starting with 'something == 0' the exiting E and remaining R threads:
> > > > 
> > > > E:something=1      // protected by killlock
> > > > E:UNLOCK(killlock)
> > > > E:need_locks=-1
> > > > 
> > > > R:LOCK(unrelated)  // reads need_locks == -1
> > > > R:need_locks=0
> > > > R:UNLOCK(unrelated)
> > > > R:LOCK(killlock)   // does not lock
> > > > R:read something   // can it be 0 ?
> > > > 
> > > > and here something can be 0 (ie. not protected by killlock) on aarch64
> > > > because
> > > > 
> > > > T1
> > > > 	something=1
> > > > 	ldaxr ... killlock
> > > > 	stlxr ... killlock
> > > > 	need_locks=-1
> > > > 
> > > > T2
> > > > 	x=need_locks
> > > > 	ldaxr ... unrelated
> > > > 	stlxr ... unrelated
> > > > 	y=something
> > > > 
> > > > can end with x==-1 and y==0.
> > > > 
> > > > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > > > 
> > > > i need to think how to support such lock usage on aarch64
> > > > without adding too many dmb.
> > > 
> > > OK, after reading a lot more, I think I'm starting to get what you're
> > > saying. Am I correct in my understanding that the problem is that the
> > > "R:LOCK(unrelated)" as implemented does not synchronize with the
> > > "E:UNLOCK(killlock)" because they're different objects?


yes if the same lock is used that solves this reordering.
btw there is a tool where you can check this (but bit tricky to use).
the online version is http://diy.inria.fr/www/?record=aarch64
with some explanation at
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/how-to-use-the-memory-model-tool
e.g. try

AArch64 samelock
{
0:X0=x; 0:X1=y; 0:X2=lock; 0:X9=l2;
1:X0=x; 1:X1=y; 1:X2=lock; 1:X9=l2;
}
P0                  | P1                ;
MOV W3, #1          | MOV W3,#1         ;
STR W3,[X0]         | LDR W4,[X1]       ;
LDAXR W4,[X2]       | LDAXR W5,[X2]     ;
STLXR W5,W3,[X2]    | STLXR W6,W3,[X2]  ;
STR W3,[X1]         | LDR W7,[X0]       ;
exists(0:X5=0 /\ 1:X6=0 /\ 1:X4=1 /\ 1:X7=0)

if you change X2 to X9 on one side then the reordering is observable.

however note that independent ldaxr cannot move across an stlxr
(in either direction) so normal loads in independent critical
sections won't be observed mixed. here the issue is that we
expect normal access after unlock outside the critical section
to be ordered.

> > > If so, I think this would be fully solved by using __tl_sync in the
> > > code path that resets need_locks to 0 after observing -1, by virtue of
> > > providing a common object (the thread list lock) to synchronize on.
> > > This is the "weaker memory model friendly" approach we should probably
> > > strive to achieve some day.
> > > 
> > > However, all existing code in musl is written assuming what I call the
> > > "POSIX memory model" where the only operation is "synchronizes memory"
> > > and that underspecified phrase has to be interpreted as "is a full
> > > barrier" to admit any consistent model. Said differently, the code was
> > > written assuming every a_* synchronizes with every other a_*, without
> > > any regard for whether they act on the same objects. This likely even
> > > matters for how our waiter accounting works (which is probably a good
> > > argument for removing it and switching to waiter flags). So I think,
> > > if the issue as I understand it now exists, we do need to fix it. Then
> > > we can revisit this at some later time as part of a big project.
> > 
> > Forgot, I should include links to the material I've been reading:
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
> > https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2
> > https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba
> > 
> > which is where the GCC folks seem to have encountered and fixed their
> > corresponding issue.
> 
> Proposed patch attached.
> 
> Rich

the patch looks good to me.
thanks.

> diff --git a/arch/aarch64/atomic_arch.h b/arch/aarch64/atomic_arch.h
> index 40fefc25..c01a3ab3 100644
> --- a/arch/aarch64/atomic_arch.h
> +++ b/arch/aarch64/atomic_arch.h
> @@ -2,7 +2,7 @@
>  static inline int a_ll(volatile int *p)
>  {
>  	int v;
> -	__asm__ __volatile__ ("ldaxr %w0,%1" : "=r"(v) : "Q"(*p));
> +	__asm__ __volatile__ ("ldxr %w0,%1" : "=r"(v) : "Q"(*p));
>  	return v;
>  }
>  
> @@ -20,25 +20,13 @@ static inline void a_barrier()
>  	__asm__ __volatile__ ("dmb ish" : : : "memory");
>  }
>  
> -#define a_cas a_cas
> -static inline int a_cas(volatile int *p, int t, int s)
> -{
> -	int old;
> -	do {
> -		old = a_ll(p);
> -		if (old != t) {
> -			a_barrier();
> -			break;
> -		}
> -	} while (!a_sc(p, s));
> -	return old;
> -}
> +#define a_post_llsc a_barrier
>  
>  #define a_ll_p a_ll_p
>  static inline void *a_ll_p(volatile void *p)
>  {
>  	void *v;
> -	__asm__ __volatile__ ("ldaxr %0, %1" : "=r"(v) : "Q"(*(void *volatile *)p));
> +	__asm__ __volatile__ ("ldxr %0, %1" : "=r"(v) : "Q"(*(void *volatile *)p));
>  	return v;
>  }
>  
> @@ -50,20 +38,6 @@ static inline int a_sc_p(volatile int *p, void *v)
>  	return !r;
>  }
>  
> -#define a_cas_p a_cas_p
> -static inline void *a_cas_p(volatile void *p, void *t, void *s)
> -{
> -	void *old;
> -	do {
> -		old = a_ll_p(p);
> -		if (old != t) {
> -			a_barrier();
> -			break;
> -		}
> -	} while (!a_sc_p(p, s));
> -	return old;
> -}
> -
>  #define a_ctz_64 a_ctz_64
>  static inline int a_ctz_64(uint64_t x)
>  {


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04  5:16         ` Alexey Izbyshev
@ 2022-10-04  8:31           ` Szabolcs Nagy
  2022-10-04 10:28             ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Szabolcs Nagy @ 2022-10-04  8:31 UTC (permalink / raw)
  To: musl

* Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-04 08:16:16 +0300]:

> On 2022-10-04 00:27, Szabolcs Nagy wrote:
> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > i think i was wrong and you are right.
> > 
> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > starting with 'something == 0' the exiting E and remaining R threads:
> > 
> > E:something=1      // protected by killlock
> > E:UNLOCK(killlock)
> > E:need_locks=-1
> > 
> > R:LOCK(unrelated)  // reads need_locks == -1
> > R:need_locks=0
> > R:UNLOCK(unrelated)
> > R:LOCK(killlock)   // does not lock
> > R:read something   // can it be 0 ?
> > 
> > and here something can be 0 (ie. not protected by killlock) on aarch64
> > because
> > 
> > T1
> > 	something=1
> > 	ldaxr ... killlock
> > 	stlxr ... killlock
> > 	need_locks=-1
> > 
> > T2
> > 	x=need_locks
> > 	ldaxr ... unrelated
> > 	stlxr ... unrelated
> > 	y=something
> > 
> > can end with x==-1 and y==0.
> > 
> Yes, overall this matches my understanding. However, your UNLOCK expansion
> (in T1/T2) omits the branch instruction between stlxr and the following
> store, and, as I mentioned, I'm not sufficiently knowledgeable to understand
> the effects of this branch on the order of visibility of "stlxr killlock"
> (and preceding stores) and "need_locks=-1".

i don't know the answer, but i think in musl we don't want to rely on
control dependcies in the architectural memory model anyway (in some
cases the compiler can break it and it's hard to reason about).

> 
> > and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > 
> > i need to think how to support such lock usage on aarch64
> > without adding too many dmb.
> > 
> 
> I'd also like to point out that our discussion so far has been focused on
> reordering of "something" protected by the critical section and
> "need_locks=-1", but my original bug report also mentioned the possibility
> of lock corruption, and for the latter we're also interested in reordering
> of "stlxr" and "need_locks=-1". It's possible that some UNLOCK
> implementations can provide the first guarantee, but not the second.

i skipped that because i did not fully understand it and thought
it's fixed once we make the atomics primitives full barriers.
thanks.

> 
> > 
> > 
> > > 
> > > memory operations in the critical section cannot visibly happen
> > > after the
> > > final stlxr.
> > > 
> > > memory operations after the critical section cannot visibly happen
> > > before
> > > the ldaxr of UNLOCK.
> > > 
> > > the only issue can be inside the ll/sc loop in UNLOCK if there are
> > > independent
> > > memory operations there, but there arent.
> > > 
> You've already said this is wrong, but I want to spell out why it's so for
> any interested parties to hopefully clear some confusion. Both premises
> (about stlxr and ldaxr) are correct, but the third point doesn't follow and
> is wrong. Because ldaxr/stlxr act as one-way barriers, the following
> pseudo-code
> 
> r1 = X
> ldaxr LOCK
> stlxr LOCK
> r2 = Y
> 
> can be observed, in my understanding of AArch64 memory model, as
> 
> ldaxr LOCK
> r1 = X
> r2 = Y
> stlxr LOCK
> 
> and even as
> 
> ldaxr LOCK
> r2 = Y
> r1 = X
> stlxr LOCK
> 
> The same is true for stores as well.
> 
> So effectively the loop in UNLOCK can absorb memory operations from both
> directions, which then can be reordered with each other (with a possible
> exception of stores that follow UNLOCK, as I said above).
> 
> Thanks,
> Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04  4:59             ` Rich Felker
  2022-10-04  8:16               ` Szabolcs Nagy
@ 2022-10-04 10:18               ` Alexey Izbyshev
  1 sibling, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04 10:18 UTC (permalink / raw)
  To: musl; +Cc: musl

On 2022-10-04 07:59, Rich Felker wrote:
> On Mon, Oct 03, 2022 at 11:00:21PM -0400, Rich Felker wrote:
>> On Mon, Oct 03, 2022 at 10:58:32PM -0400, Rich Felker wrote:
>> > On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>> > > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
>> > >
>> > > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
>> > > > > On 2022-09-19 18:29, Rich Felker wrote:
>> > > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> > > > ...
>> > > > > > > Reordering the "libc.need_locks = -1" assignment and
>> > > > > > > UNLOCK(E->killlock) and providing a store barrier between them
>> > > > > > > should fix the issue.
>> > > > > >
>> > > > > > I think this all sounds correct. I'm not sure what you mean by a store
>> > > > > > barrier between them, since all lock and unlock operations are already
>> > > > > > full barriers.
>> > > > > >
>> > > > >
>> > > > > Before sending the report I tried to infer the intended ordering semantics
>> > > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> > > > > see why they would provide a full barrier (my reasoning is below), so I
>> > > > > concluded that probably acquire/release semantics was intended in general
>> > > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> > > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> > > > > section.
>> > > > >
>> > > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> > > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> > > > > via load-acquire/store-release instructions. Therefore, if we consider a
>> > > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> > > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> > > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> > > > > the processor predicts that stlxr succeeds), and further, due to (a), with
>> > > > > any memory access inside the critical section. Therefore, UNLOCK is not full
>> > > > > barrier. Is this right?
>> > > >
>> > > > i dont think this is right.
>> > >
>> > >
>> > > i think i was wrong and you are right.
>> > >
>> > > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> > > starting with 'something == 0' the exiting E and remaining R threads:
>> > >
>> > > E:something=1      // protected by killlock
>> > > E:UNLOCK(killlock)
>> > > E:need_locks=-1
>> > >
>> > > R:LOCK(unrelated)  // reads need_locks == -1
>> > > R:need_locks=0
>> > > R:UNLOCK(unrelated)
>> > > R:LOCK(killlock)   // does not lock
>> > > R:read something   // can it be 0 ?
>> > >
>> > > and here something can be 0 (ie. not protected by killlock) on aarch64
>> > > because
>> > >
>> > > T1
>> > > 	something=1
>> > > 	ldaxr ... killlock
>> > > 	stlxr ... killlock
>> > > 	need_locks=-1
>> > >
>> > > T2
>> > > 	x=need_locks
>> > > 	ldaxr ... unrelated
>> > > 	stlxr ... unrelated
>> > > 	y=something
>> > >
>> > > can end with x==-1 and y==0.
>> > >
>> > > and to fix it, both a_fetch_add and a_cas need an a_barrier.
>> > >
>> > > i need to think how to support such lock usage on aarch64
>> > > without adding too many dmb.
>> >
>> > OK, after reading a lot more, I think I'm starting to get what you're
>> > saying. Am I correct in my understanding that the problem is that the
>> > "R:LOCK(unrelated)" as implemented does not synchronize with the
>> > "E:UNLOCK(killlock)" because they're different objects?
>> >
>> > If so, I think this would be fully solved by using __tl_sync in the
>> > code path that resets need_locks to 0 after observing -1, by virtue of
>> > providing a common object (the thread list lock) to synchronize on.
>> > This is the "weaker memory model friendly" approach we should probably
>> > strive to achieve some day.
>> >
>> > However, all existing code in musl is written assuming what I call the
>> > "POSIX memory model" where the only operation is "synchronizes memory"
>> > and that underspecified phrase has to be interpreted as "is a full
>> > barrier" to admit any consistent model. Said differently, the code was
>> > written assuming every a_* synchronizes with every other a_*, without
>> > any regard for whether they act on the same objects. This likely even
>> > matters for how our waiter accounting works (which is probably a good
>> > argument for removing it and switching to waiter flags). So I think,
>> > if the issue as I understand it now exists, we do need to fix it. Then
>> > we can revisit this at some later time as part of a big project.
>> 
>> Forgot, I should include links to the material I've been reading:
>> 
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
>> https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=f70fb3b635f9618c6d2ee3848ba836914f7951c2
>> https://gcc.gnu.org/git?p=gcc.git;a=commitdiff;h=ab876106eb689947cdd8203f8ecc6e8ac38bf5ba
>> 
>> which is where the GCC folks seem to have encountered and fixed their
>> corresponding issue.

Thank you for the links.

This comment[1] addresses my doubts on whether a store following (the 
current implementation of) UNLOCK can become visible before stlxr 
despite the conditional branch instruction in between (yes, it can).

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c23
> 
> Proposed patch attached.
> 
The patch looks good to me.

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04  8:31           ` Szabolcs Nagy
@ 2022-10-04 10:28             ` Alexey Izbyshev
  0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04 10:28 UTC (permalink / raw)
  To: musl

On 2022-10-04 11:31, Szabolcs Nagy wrote:
> * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-04 08:16:16 +0300]:
> 
>> On 2022-10-04 00:27, Szabolcs Nagy wrote:
>> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
>> > i think i was wrong and you are right.
>> >
>> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> > starting with 'something == 0' the exiting E and remaining R threads:
>> >
>> > E:something=1      // protected by killlock
>> > E:UNLOCK(killlock)
>> > E:need_locks=-1
>> >
>> > R:LOCK(unrelated)  // reads need_locks == -1
>> > R:need_locks=0
>> > R:UNLOCK(unrelated)
>> > R:LOCK(killlock)   // does not lock
>> > R:read something   // can it be 0 ?
>> >
>> > and here something can be 0 (ie. not protected by killlock) on aarch64
>> > because
>> >
>> > T1
>> > 	something=1
>> > 	ldaxr ... killlock
>> > 	stlxr ... killlock
>> > 	need_locks=-1
>> >
>> > T2
>> > 	x=need_locks
>> > 	ldaxr ... unrelated
>> > 	stlxr ... unrelated
>> > 	y=something
>> >
>> > can end with x==-1 and y==0.
>> >
>> Yes, overall this matches my understanding. However, your UNLOCK 
>> expansion
>> (in T1/T2) omits the branch instruction between stlxr and the 
>> following
>> store, and, as I mentioned, I'm not sufficiently knowledgeable to 
>> understand
>> the effects of this branch on the order of visibility of "stlxr 
>> killlock"
>> (and preceding stores) and "need_locks=-1".
> 
> i don't know the answer, but i think in musl we don't want to rely on
> control dependcies in the architectural memory model anyway (in some
> cases the compiler can break it and it's hard to reason about).
> 
I agree. The answer appears to be that the reordering can occur even in 
the presence of the branch instruction[1], though I haven't confirmed it 
with tools that show valid executions yet.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c23

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-03 23:05           ` Rich Felker
@ 2022-10-04 13:50             ` Alexey Izbyshev
  2022-10-04 14:12               ` Rich Felker
  2022-10-04 16:01               ` Alexey Izbyshev
  0 siblings, 2 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04 13:50 UTC (permalink / raw)
  To: musl

On 2022-10-04 02:05, Rich Felker wrote:
> On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
>> On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
>> >
>> > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
>> > > > On 2022-09-19 18:29, Rich Felker wrote:
>> > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> > > ...
>> > > > > > Reordering the "libc.need_locks = -1" assignment and
>> > > > > > UNLOCK(E->killlock) and providing a store barrier between them
>> > > > > > should fix the issue.
>> > > > >
>> > > > > I think this all sounds correct. I'm not sure what you mean by a store
>> > > > > barrier between them, since all lock and unlock operations are already
>> > > > > full barriers.
>> > > > >
>> > > >
>> > > > Before sending the report I tried to infer the intended ordering semantics
>> > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> > > > see why they would provide a full barrier (my reasoning is below), so I
>> > > > concluded that probably acquire/release semantics was intended in general
>> > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> > > > section.
>> > > >
>> > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> > > > via load-acquire/store-release instructions. Therefore, if we consider a
>> > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> > > > the processor predicts that stlxr succeeds), and further, due to (a), with
>> > > > any memory access inside the critical section. Therefore, UNLOCK is not full
>> > > > barrier. Is this right?
>> > >
>> > > i dont think this is right.
>> >
>> >
>> > i think i was wrong and you are right.
>> >
>> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> > starting with 'something == 0' the exiting E and remaining R threads:
>> >
>> > E:something=1      // protected by killlock
>> > E:UNLOCK(killlock)
>> > E:need_locks=-1
>> >
>> > R:LOCK(unrelated)  // reads need_locks == -1
>> > R:need_locks=0
>> > R:UNLOCK(unrelated)
>> > R:LOCK(killlock)   // does not lock
>> > R:read something   // can it be 0 ?
>> >
>> > and here something can be 0 (ie. not protected by killlock) on aarch64
>> > because
>> >
>> > T1
>> > 	something=1
>> > 	ldaxr ... killlock
>> > 	stlxr ... killlock
>> > 	need_locks=-1
>> >
>> > T2
>> > 	x=need_locks
>> > 	ldaxr ... unrelated
>> > 	stlxr ... unrelated
>> > 	y=something
>> >
>> > can end with x==-1 and y==0.
>> >
>> > and to fix it, both a_fetch_add and a_cas need an a_barrier.
>> >
>> > i need to think how to support such lock usage on aarch64
>> > without adding too many dmb.
>> 
>> I don't really understand this, but FWIW gcc emits
>> 
>>     ldxr
>>     ...
>>     stlxr
>>     ...
>>     dmb ish
>> 
>> for __sync_val_compare_and_swap. So this is probably the right thing
>> we should have. And it seems to match what the kernel folks discussed
>> here:
>> 
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
>> 
>> I wondered if there are similar issues for any others archs which need
>> review, but it looks like all the other llsc archs have explicit
>> pre/post barriers defined.
> 
> Actually I don't understand what's going on with cmpxchg there. The
> patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> some follow-up in the thread I don't understand, about the case where
> the cas fails, but we already handle that by doing an explicit barrier
> in that case.
> 
I think in that follow-up[1] they mean the following case (in musl 
terms):

volatile int x, flag;

T1:
     x = 1;
     a_store(&flag, 1);

T2:
     while (!flag);
     a_cas(&x, 0, 1); // can this fail?

They want it to never fail. But if a_cas() is implemented as 
ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered 
with the load of flag.

Note that musl does *not* handle this now, because a_barrier() in the 
failure path is after a_ll().

[1] 
https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 13:50             ` Alexey Izbyshev
@ 2022-10-04 14:12               ` Rich Felker
  2022-10-04 14:19                 ` Rich Felker
  2022-10-04 16:24                 ` James Y Knight
  2022-10-04 16:01               ` Alexey Izbyshev
  1 sibling, 2 replies; 32+ messages in thread
From: Rich Felker @ 2022-10-04 14:12 UTC (permalink / raw)
  To: musl

On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> On 2022-10-04 02:05, Rich Felker wrote:
> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> >>>
> >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>> > ...
> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> >>> > > > > should fix the issue.
> >>> > > >
> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> >>> > > > barrier between them, since all lock and unlock operations are already
> >>> > > > full barriers.
> >>> > > >
> >>> > >
> >>> > > Before sending the report I tried to infer the intended ordering semantics
> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> >>> > > see why they would provide a full barrier (my reasoning is below), so I
> >>> > > concluded that probably acquire/release semantics was intended in general
> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> >>> > > section.
> >>> > >
> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> >>> > > barrier. Is this right?
> >>> >
> >>> > i dont think this is right.
> >>>
> >>>
> >>> i think i was wrong and you are right.
> >>>
> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> >>> starting with 'something == 0' the exiting E and remaining R threads:
> >>>
> >>> E:something=1      // protected by killlock
> >>> E:UNLOCK(killlock)
> >>> E:need_locks=-1
> >>>
> >>> R:LOCK(unrelated)  // reads need_locks == -1
> >>> R:need_locks=0
> >>> R:UNLOCK(unrelated)
> >>> R:LOCK(killlock)   // does not lock
> >>> R:read something   // can it be 0 ?
> >>>
> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> >>> because
> >>>
> >>> T1
> >>> 	something=1
> >>> 	ldaxr ... killlock
> >>> 	stlxr ... killlock
> >>> 	need_locks=-1
> >>>
> >>> T2
> >>> 	x=need_locks
> >>> 	ldaxr ... unrelated
> >>> 	stlxr ... unrelated
> >>> 	y=something
> >>>
> >>> can end with x==-1 and y==0.
> >>>
> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> >>>
> >>> i need to think how to support such lock usage on aarch64
> >>> without adding too many dmb.
> >>
> >>I don't really understand this, but FWIW gcc emits
> >>
> >>    ldxr
> >>    ...
> >>    stlxr
> >>    ...
> >>    dmb ish
> >>
> >>for __sync_val_compare_and_swap. So this is probably the right thing
> >>we should have. And it seems to match what the kernel folks discussed
> >>here:
> >>
> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> >>
> >>I wondered if there are similar issues for any others archs which need
> >>review, but it looks like all the other llsc archs have explicit
> >>pre/post barriers defined.
> >
> >Actually I don't understand what's going on with cmpxchg there. The
> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> >some follow-up in the thread I don't understand, about the case where
> >the cas fails, but we already handle that by doing an explicit barrier
> >in that case.
> >
> I think in that follow-up[1] they mean the following case (in musl
> terms):
> 
> volatile int x, flag;
> 
> T1:
>     x = 1;
>     a_store(&flag, 1);
> 
> T2:
>     while (!flag);
>     a_cas(&x, 0, 1); // can this fail?
> 
> They want it to never fail. But if a_cas() is implemented as
> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> with the load of flag.
> 
> Note that musl does *not* handle this now, because a_barrier() in
> the failure path is after a_ll().
> 
> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html

OK, then indeed this too needs to be fixed -- the a_cas is failing to
synchronize with the a_store. How do we do that? Based on my
understanding, my proposed patch doesn't fix it.

Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
synchronization to a retry in the case where the comparison fails?

If this is actually the case, it's disturbing that GCC does not seem
to be getting it right either...

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 14:12               ` Rich Felker
@ 2022-10-04 14:19                 ` Rich Felker
  2022-10-04 15:43                   ` Alexey Izbyshev
  2022-10-04 16:24                 ` James Y Knight
  1 sibling, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-04 14:19 UTC (permalink / raw)
  To: musl

On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> > On 2022-10-04 02:05, Rich Felker wrote:
> > >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> > >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> > >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> > >>>
> > >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> > >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> > >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > >>> > ...
> > >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> > >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> > >>> > > > > should fix the issue.
> > >>> > > >
> > >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> > >>> > > > barrier between them, since all lock and unlock operations are already
> > >>> > > > full barriers.
> > >>> > > >
> > >>> > >
> > >>> > > Before sending the report I tried to infer the intended ordering semantics
> > >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> > >>> > > see why they would provide a full barrier (my reasoning is below), so I
> > >>> > > concluded that probably acquire/release semantics was intended in general
> > >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> > >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> > >>> > > section.
> > >>> > >
> > >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> > >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> > >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> > >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> > >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> > >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> > >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> > >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> > >>> > > barrier. Is this right?
> > >>> >
> > >>> > i dont think this is right.
> > >>>
> > >>>
> > >>> i think i was wrong and you are right.
> > >>>
> > >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> > >>> starting with 'something == 0' the exiting E and remaining R threads:
> > >>>
> > >>> E:something=1      // protected by killlock
> > >>> E:UNLOCK(killlock)
> > >>> E:need_locks=-1
> > >>>
> > >>> R:LOCK(unrelated)  // reads need_locks == -1
> > >>> R:need_locks=0
> > >>> R:UNLOCK(unrelated)
> > >>> R:LOCK(killlock)   // does not lock
> > >>> R:read something   // can it be 0 ?
> > >>>
> > >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> > >>> because
> > >>>
> > >>> T1
> > >>> 	something=1
> > >>> 	ldaxr ... killlock
> > >>> 	stlxr ... killlock
> > >>> 	need_locks=-1
> > >>>
> > >>> T2
> > >>> 	x=need_locks
> > >>> 	ldaxr ... unrelated
> > >>> 	stlxr ... unrelated
> > >>> 	y=something
> > >>>
> > >>> can end with x==-1 and y==0.
> > >>>
> > >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> > >>>
> > >>> i need to think how to support such lock usage on aarch64
> > >>> without adding too many dmb.
> > >>
> > >>I don't really understand this, but FWIW gcc emits
> > >>
> > >>    ldxr
> > >>    ...
> > >>    stlxr
> > >>    ...
> > >>    dmb ish
> > >>
> > >>for __sync_val_compare_and_swap. So this is probably the right thing
> > >>we should have. And it seems to match what the kernel folks discussed
> > >>here:
> > >>
> > >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> > >>
> > >>I wondered if there are similar issues for any others archs which need
> > >>review, but it looks like all the other llsc archs have explicit
> > >>pre/post barriers defined.
> > >
> > >Actually I don't understand what's going on with cmpxchg there. The
> > >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> > >some follow-up in the thread I don't understand, about the case where
> > >the cas fails, but we already handle that by doing an explicit barrier
> > >in that case.
> > >
> > I think in that follow-up[1] they mean the following case (in musl
> > terms):
> > 
> > volatile int x, flag;
> > 
> > T1:
> >     x = 1;
> >     a_store(&flag, 1);
> > 
> > T2:
> >     while (!flag);
> >     a_cas(&x, 0, 1); // can this fail?
> > 
> > They want it to never fail. But if a_cas() is implemented as
> > ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> > with the load of flag.
> > 
> > Note that musl does *not* handle this now, because a_barrier() in
> > the failure path is after a_ll().
> > 
> > [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> 
> OK, then indeed this too needs to be fixed -- the a_cas is failing to
> synchronize with the a_store. How do we do that? Based on my
> understanding, my proposed patch doesn't fix it.
> 
> Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
> synchronization to a retry in the case where the comparison fails?
> 
> If this is actually the case, it's disturbing that GCC does not seem
> to be getting it right either...

One stupid (or maybe not stupid) option would be to always stlrx back
the old value on comparison failure, so we can rely on the stlrx to
fail if anything was reordered wrong. I think this is actually kinda
how the x86 CAS works, or how it historically worked, so maybe it's
not a bad choice. It's also analogous to how the non-CAS operations
work: no branching out of the ll/sc loop, and just defining the result
as old == t ? s : old:

static inline int a_cas(volatile int *p, int t, int s)
{
	int old;
	a_pre_llsc();
	do old = a_ll(p);
	while (!a_sc(p, old == t ? s : old));
	a_post_llsc();
	return old;
}

In fact if we use this as the general pattern for a_cas (in
src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at
all, on any archs -- but that probably depends on the ISA's SC
semantics...

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 14:19                 ` Rich Felker
@ 2022-10-04 15:43                   ` Alexey Izbyshev
  2022-10-04 15:57                     ` Rich Felker
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04 15:43 UTC (permalink / raw)
  To: musl

On 2022-10-04 17:19, Rich Felker wrote:
> On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
>> On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
>> > On 2022-10-04 02:05, Rich Felker wrote:
>> > >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
>> > >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>> > >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
>> > >>>
>> > >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
>> > >>> > > On 2022-09-19 18:29, Rich Felker wrote:
>> > >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> > >>> > ...
>> > >>> > > > > Reordering the "libc.need_locks = -1" assignment and
>> > >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
>> > >>> > > > > should fix the issue.
>> > >>> > > >
>> > >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
>> > >>> > > > barrier between them, since all lock and unlock operations are already
>> > >>> > > > full barriers.
>> > >>> > > >
>> > >>> > >
>> > >>> > > Before sending the report I tried to infer the intended ordering semantics
>> > >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> > >>> > > see why they would provide a full barrier (my reasoning is below), so I
>> > >>> > > concluded that probably acquire/release semantics was intended in general
>> > >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> > >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> > >>> > > section.
>> > >>> > >
>> > >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> > >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> > >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
>> > >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> > >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> > >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> > >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
>> > >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
>> > >>> > > barrier. Is this right?
>> > >>> >
>> > >>> > i dont think this is right.
>> > >>>
>> > >>>
>> > >>> i think i was wrong and you are right.
>> > >>>
>> > >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> > >>> starting with 'something == 0' the exiting E and remaining R threads:
>> > >>>
>> > >>> E:something=1      // protected by killlock
>> > >>> E:UNLOCK(killlock)
>> > >>> E:need_locks=-1
>> > >>>
>> > >>> R:LOCK(unrelated)  // reads need_locks == -1
>> > >>> R:need_locks=0
>> > >>> R:UNLOCK(unrelated)
>> > >>> R:LOCK(killlock)   // does not lock
>> > >>> R:read something   // can it be 0 ?
>> > >>>
>> > >>> and here something can be 0 (ie. not protected by killlock) on aarch64
>> > >>> because
>> > >>>
>> > >>> T1
>> > >>> 	something=1
>> > >>> 	ldaxr ... killlock
>> > >>> 	stlxr ... killlock
>> > >>> 	need_locks=-1
>> > >>>
>> > >>> T2
>> > >>> 	x=need_locks
>> > >>> 	ldaxr ... unrelated
>> > >>> 	stlxr ... unrelated
>> > >>> 	y=something
>> > >>>
>> > >>> can end with x==-1 and y==0.
>> > >>>
>> > >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
>> > >>>
>> > >>> i need to think how to support such lock usage on aarch64
>> > >>> without adding too many dmb.
>> > >>
>> > >>I don't really understand this, but FWIW gcc emits
>> > >>
>> > >>    ldxr
>> > >>    ...
>> > >>    stlxr
>> > >>    ...
>> > >>    dmb ish
>> > >>
>> > >>for __sync_val_compare_and_swap. So this is probably the right thing
>> > >>we should have. And it seems to match what the kernel folks discussed
>> > >>here:
>> > >>
>> > >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
>> > >>
>> > >>I wondered if there are similar issues for any others archs which need
>> > >>review, but it looks like all the other llsc archs have explicit
>> > >>pre/post barriers defined.
>> > >
>> > >Actually I don't understand what's going on with cmpxchg there. The
>> > >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
>> > >some follow-up in the thread I don't understand, about the case where
>> > >the cas fails, but we already handle that by doing an explicit barrier
>> > >in that case.
>> > >
>> > I think in that follow-up[1] they mean the following case (in musl
>> > terms):
>> >
>> > volatile int x, flag;
>> >
>> > T1:
>> >     x = 1;
>> >     a_store(&flag, 1);
>> >
>> > T2:
>> >     while (!flag);
>> >     a_cas(&x, 0, 1); // can this fail?
>> >
>> > They want it to never fail. But if a_cas() is implemented as
>> > ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
>> > with the load of flag.
>> >
>> > Note that musl does *not* handle this now, because a_barrier() in
>> > the failure path is after a_ll().
>> >
>> > [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
>> 
>> OK, then indeed this too needs to be fixed -- the a_cas is failing to
>> synchronize with the a_store. How do we do that? Based on my
>> understanding, my proposed patch doesn't fix it.
>> 
>> Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
>> synchronization to a retry in the case where the comparison fails?
>> 
ldarx will not work for the same reason as ldrx doesn't work (it can 
still be reordered with the load of flag).

>> If this is actually the case, it's disturbing that GCC does not seem
>> to be getting it right either...
> 
This is indeed disturbing, considering that comment[1] claims that 
__sync RMWs were once defined as:

   __sync_synchronize();
   operation...
   __sync_synchronize();

which clearly doesn't hold for the current implementation of 
__sync_val_compare_and_swap() for AArch64.

I can speculate that some of confusion is due to the fact that 
apparently each __sync builtin originally mapped to a single Itanium 
instruction, so the issues with ordering of "internal" LL/SC accesses 
simply didn't exist.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24

> One stupid (or maybe not stupid) option would be to always stlrx back
> the old value on comparison failure, so we can rely on the stlrx to
> fail if anything was reordered wrong. I think this is actually kinda
> how the x86 CAS works, or how it historically worked, so maybe it's
> not a bad choice. It's also analogous to how the non-CAS operations
> work: no branching out of the ll/sc loop, and just defining the result
> as old == t ? s : old:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
> 	int old;
> 	a_pre_llsc();
> 	do old = a_ll(p);
> 	while (!a_sc(p, old == t ? s : old));
> 	a_post_llsc();
> 	return old;
> }
> 
I can't comment on whether always doing a store (or even many stores, 
until "stabilization") on a failed CAS is better than an extra dmb. 
Maybe if CAS failure normally happens only on contention, and we 
optimize for the non-contended case, it's indeed so. As for correctness, 
I agree that it should work.

> In fact if we use this as the general pattern for a_cas (in
> src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at
> all, on any archs -- but that probably depends on the ISA's SC
> semantics...
> 
This pattern works on AArch64 without a_pre_llsc() only because a_sc() 
is a store-release. If it were a plain store-exclusive, preceding memory 
accesses could be reordered with it.

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 15:43                   ` Alexey Izbyshev
@ 2022-10-04 15:57                     ` Rich Felker
  2022-10-04 18:15                       ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-04 15:57 UTC (permalink / raw)
  To: musl

On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote:
> On 2022-10-04 17:19, Rich Felker wrote:
> >On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
> >>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> >>> On 2022-10-04 02:05, Rich Felker wrote:
> >>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> >>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> >>> >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> >>> >>>
> >>> >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> >>> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> >>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>> >>> > ...
> >>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> >>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> >>> >>> > > > > should fix the issue.
> >>> >>> > > >
> >>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> >>> >>> > > > barrier between them, since all lock and unlock operations are already
> >>> >>> > > > full barriers.
> >>> >>> > > >
> >>> >>> > >
> >>> >>> > > Before sending the report I tried to infer the intended ordering semantics
> >>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> >>> >>> > > see why they would provide a full barrier (my reasoning is below), so I
> >>> >>> > > concluded that probably acquire/release semantics was intended in general
> >>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> >>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> >>> >>> > > section.
> >>> >>> > >
> >>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> >>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> >>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> >>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> >>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> >>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> >>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> >>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> >>> >>> > > barrier. Is this right?
> >>> >>> >
> >>> >>> > i dont think this is right.
> >>> >>>
> >>> >>>
> >>> >>> i think i was wrong and you are right.
> >>> >>>
> >>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> >>> >>> starting with 'something == 0' the exiting E and remaining R threads:
> >>> >>>
> >>> >>> E:something=1      // protected by killlock
> >>> >>> E:UNLOCK(killlock)
> >>> >>> E:need_locks=-1
> >>> >>>
> >>> >>> R:LOCK(unrelated)  // reads need_locks == -1
> >>> >>> R:need_locks=0
> >>> >>> R:UNLOCK(unrelated)
> >>> >>> R:LOCK(killlock)   // does not lock
> >>> >>> R:read something   // can it be 0 ?
> >>> >>>
> >>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> >>> >>> because
> >>> >>>
> >>> >>> T1
> >>> >>> 	something=1
> >>> >>> 	ldaxr ... killlock
> >>> >>> 	stlxr ... killlock
> >>> >>> 	need_locks=-1
> >>> >>>
> >>> >>> T2
> >>> >>> 	x=need_locks
> >>> >>> 	ldaxr ... unrelated
> >>> >>> 	stlxr ... unrelated
> >>> >>> 	y=something
> >>> >>>
> >>> >>> can end with x==-1 and y==0.
> >>> >>>
> >>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> >>> >>>
> >>> >>> i need to think how to support such lock usage on aarch64
> >>> >>> without adding too many dmb.
> >>> >>
> >>> >>I don't really understand this, but FWIW gcc emits
> >>> >>
> >>> >>    ldxr
> >>> >>    ...
> >>> >>    stlxr
> >>> >>    ...
> >>> >>    dmb ish
> >>> >>
> >>> >>for __sync_val_compare_and_swap. So this is probably the right thing
> >>> >>we should have. And it seems to match what the kernel folks discussed
> >>> >>here:
> >>> >>
> >>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> >>> >>
> >>> >>I wondered if there are similar issues for any others archs which need
> >>> >>review, but it looks like all the other llsc archs have explicit
> >>> >>pre/post barriers defined.
> >>> >
> >>> >Actually I don't understand what's going on with cmpxchg there. The
> >>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> >>> >some follow-up in the thread I don't understand, about the case where
> >>> >the cas fails, but we already handle that by doing an explicit barrier
> >>> >in that case.
> >>> >
> >>> I think in that follow-up[1] they mean the following case (in musl
> >>> terms):
> >>>
> >>> volatile int x, flag;
> >>>
> >>> T1:
> >>>     x = 1;
> >>>     a_store(&flag, 1);
> >>>
> >>> T2:
> >>>     while (!flag);
> >>>     a_cas(&x, 0, 1); // can this fail?
> >>>
> >>> They want it to never fail. But if a_cas() is implemented as
> >>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> >>> with the load of flag.
> >>>
> >>> Note that musl does *not* handle this now, because a_barrier() in
> >>> the failure path is after a_ll().
> >>>
> >>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> >>
> >>OK, then indeed this too needs to be fixed -- the a_cas is failing to
> >>synchronize with the a_store. How do we do that? Based on my
> >>understanding, my proposed patch doesn't fix it.
> >>
> >>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
> >>synchronization to a retry in the case where the comparison fails?
> >>
> ldarx will not work for the same reason as ldrx doesn't work (it can
> still be reordered with the load of flag).
> 
> >>If this is actually the case, it's disturbing that GCC does not seem
> >>to be getting it right either...
> >
> This is indeed disturbing, considering that comment[1] claims that
> __sync RMWs were once defined as:
> 
>   __sync_synchronize();
>   operation...
>   __sync_synchronize();
> 
> which clearly doesn't hold for the current implementation of
> __sync_val_compare_and_swap() for AArch64.
> 
> I can speculate that some of confusion is due to the fact that
> apparently each __sync builtin originally mapped to a single Itanium
> instruction, so the issues with ordering of "internal" LL/SC
> accesses simply didn't exist.
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24
> 
> >One stupid (or maybe not stupid) option would be to always stlrx back
> >the old value on comparison failure, so we can rely on the stlrx to
> >fail if anything was reordered wrong. I think this is actually kinda
> >how the x86 CAS works, or how it historically worked, so maybe it's
> >not a bad choice. It's also analogous to how the non-CAS operations
> >work: no branching out of the ll/sc loop, and just defining the result
> >as old == t ? s : old:
> >
> >static inline int a_cas(volatile int *p, int t, int s)
> >{
> >	int old;
> >	a_pre_llsc();
> >	do old = a_ll(p);
> >	while (!a_sc(p, old == t ? s : old));
> >	a_post_llsc();
> >	return old;
> >}
> >
> I can't comment on whether always doing a store (or even many
> stores, until "stabilization") on a failed CAS is better than an
> extra dmb. Maybe if CAS failure normally happens only on contention,
> and we optimize for the non-contended case, it's indeed so. As for
> correctness, I agree that it should work.

OK. I think the alternative is something like:

static inline int a_cas(volatile int *p, int t, int s)
{
	int old;
	a_pre_llsc();
	do {
		old = a_ll(p);
		if (old != t) {
			a_barrier();
			if (*p != old) continue;
			break;
		}
	} while (!a_sc(p, s));
	a_post_llsc();
	return old;
}

This is more/messier code, and still has the possibility of unbounded
looping waiting for consistency under contention. It doesn't perform a
store to the atomic object during such contention, which might be less
of a burden on other threads, but on the other hand it performs full
barriers during this contention, which might be more of a burden.

Conceptually I think I like the version that always stores for
symmetry with the other atomic operations that makes it "obviously
correct if they're correct". But I'm not sure if there are cost
reasons we should prefer a different choice. At least it "shouldn't be
worse" than the other atomic ops are if we do it that way.

> >In fact if we use this as the general pattern for a_cas (in
> >src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at
> >all, on any archs -- but that probably depends on the ISA's SC
> >semantics...
> >
> This pattern works on AArch64 without a_pre_llsc() only because
> a_sc() is a store-release. If it were a plain store-exclusive,
> preceding memory accesses could be reordered with it.

Yes, that sounds right.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 13:50             ` Alexey Izbyshev
  2022-10-04 14:12               ` Rich Felker
@ 2022-10-04 16:01               ` Alexey Izbyshev
  1 sibling, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04 16:01 UTC (permalink / raw)
  To: musl

On 2022-10-04 16:50, Alexey Izbyshev wrote:
> On 2022-10-04 02:05, Rich Felker wrote:
>> On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
>>> On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>>> > * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
>>> >
>>> > > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
>>> > > > On 2022-09-19 18:29, Rich Felker wrote:
>>> > > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>>> > > ...
>>> > > > > > Reordering the "libc.need_locks = -1" assignment and
>>> > > > > > UNLOCK(E->killlock) and providing a store barrier between them
>>> > > > > > should fix the issue.
>>> > > > >
>>> > > > > I think this all sounds correct. I'm not sure what you mean by a store
>>> > > > > barrier between them, since all lock and unlock operations are already
>>> > > > > full barriers.
>>> > > > >
>>> > > >
>>> > > > Before sending the report I tried to infer the intended ordering semantics
>>> > > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>>> > > > see why they would provide a full barrier (my reasoning is below), so I
>>> > > > concluded that probably acquire/release semantics was intended in general
>>> > > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>>> > > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>>> > > > section.
>>> > > >
>>> > > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>>> > > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>>> > > > via load-acquire/store-release instructions. Therefore, if we consider a
>>> > > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>>> > > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>>> > > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>>> > > > the processor predicts that stlxr succeeds), and further, due to (a), with
>>> > > > any memory access inside the critical section. Therefore, UNLOCK is not full
>>> > > > barrier. Is this right?
>>> > >
>>> > > i dont think this is right.
>>> >
>>> >
>>> > i think i was wrong and you are right.
>>> >
>>> > so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>>> > starting with 'something == 0' the exiting E and remaining R threads:
>>> >
>>> > E:something=1      // protected by killlock
>>> > E:UNLOCK(killlock)
>>> > E:need_locks=-1
>>> >
>>> > R:LOCK(unrelated)  // reads need_locks == -1
>>> > R:need_locks=0
>>> > R:UNLOCK(unrelated)
>>> > R:LOCK(killlock)   // does not lock
>>> > R:read something   // can it be 0 ?
>>> >
>>> > and here something can be 0 (ie. not protected by killlock) on aarch64
>>> > because
>>> >
>>> > T1
>>> > 	something=1
>>> > 	ldaxr ... killlock
>>> > 	stlxr ... killlock
>>> > 	need_locks=-1
>>> >
>>> > T2
>>> > 	x=need_locks
>>> > 	ldaxr ... unrelated
>>> > 	stlxr ... unrelated
>>> > 	y=something
>>> >
>>> > can end with x==-1 and y==0.
>>> >
>>> > and to fix it, both a_fetch_add and a_cas need an a_barrier.
>>> >
>>> > i need to think how to support such lock usage on aarch64
>>> > without adding too many dmb.
>>> 
>>> I don't really understand this, but FWIW gcc emits
>>> 
>>>     ldxr
>>>     ...
>>>     stlxr
>>>     ...
>>>     dmb ish
>>> 
>>> for __sync_val_compare_and_swap. So this is probably the right thing
>>> we should have. And it seems to match what the kernel folks discussed
>>> here:
>>> 
>>> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
>>> 
>>> I wondered if there are similar issues for any others archs which 
>>> need
>>> review, but it looks like all the other llsc archs have explicit
>>> pre/post barriers defined.
>> 
>> Actually I don't understand what's going on with cmpxchg there. The
>> patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
>> some follow-up in the thread I don't understand, about the case where
>> the cas fails, but we already handle that by doing an explicit barrier
>> in that case.
>> 
> I think in that follow-up[1] they mean the following case (in musl 
> terms):
> 
> volatile int x, flag;
> 
> T1:
>     x = 1;
>     a_store(&flag, 1);
> 
> T2:
>     while (!flag);
>     a_cas(&x, 0, 1); // can this fail?
> 
I made a mistake here, this should be a_cas(&x, 1, 2). Everything else 
stands.

> They want it to never fail. But if a_cas() is implemented as
> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> with the load of flag.
> 
> Note that musl does *not* handle this now, because a_barrier() in the
> failure path is after a_ll().
> 
> [1] 
> https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> 
> Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 14:12               ` Rich Felker
  2022-10-04 14:19                 ` Rich Felker
@ 2022-10-04 16:24                 ` James Y Knight
  2022-10-04 16:45                   ` Rich Felker
  1 sibling, 1 reply; 32+ messages in thread
From: James Y Knight @ 2022-10-04 16:24 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 685 bytes --]

On Tue, Oct 4, 2022 at 10:13 AM Rich Felker <dalias@libc.org> wrote:

> If this is actually the case, it's disturbing that GCC does not seem
> to be getting it right either...
>

The __sync_* builtins are legacy and were never particularly well-defined
-- especially for non-x86 platforms. (Note that they don't include atomic
load/store operations, which are effectively unnecessary on x86, but vital
on most other architectures).

I would suggest that musl (and anyone else) really ought to migrate from
its homegrown atomics support to the standard C11 atomic memory model,
which _is_ well-defined and extensively studied. Such a migration will
certainly be a Project, of course...

[-- Attachment #2: Type: text/html, Size: 1027 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 16:24                 ` James Y Knight
@ 2022-10-04 16:45                   ` Rich Felker
  2022-10-05 13:52                     ` James Y Knight
  0 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-04 16:45 UTC (permalink / raw)
  To: James Y Knight; +Cc: musl

On Tue, Oct 04, 2022 at 12:24:14PM -0400, James Y Knight wrote:
> On Tue, Oct 4, 2022 at 10:13 AM Rich Felker <dalias@libc.org> wrote:
> 
> > If this is actually the case, it's disturbing that GCC does not seem
> > to be getting it right either...
> >
> 
> The __sync_* builtins are legacy and were never particularly well-defined
> -- especially for non-x86 platforms. (Note that they don't include atomic
> load/store operations, which are effectively unnecessary on x86, but vital
> on most other architectures).
> 
> I would suggest that musl (and anyone else) really ought to migrate from
> its homegrown atomics support to the standard C11 atomic memory model,
> which _is_ well-defined and extensively studied. Such a migration will
> certainly be a Project, of course...

We do not use the __sync builtins. Atomics in musl are implemented
entirely in asm, because the compilers do not get theirs right and do
not support the runtime selection of methods necessary for some of the
archs we support (especially 32-bit arm and sh).

The atomics in musl implement the "POSIX memory model" which is much
simpler to understand and less error-prone than the C11 one (with the
tradeoff being that it admits a lot less optimization for
performance), and is a valid implementation choice for the C11 one. It
has only one relationship, "synchronizes memory", that all
synchronization primitives and atomics entail.

The migration that might happen at some point is using a weaker model
for the C11 synchronization primitives and possibly for the POSIX ones
in the future if POSIX adopts a weaker model. Of course we could also
do this for our own implementation-internal locks, independent of
whether POSIX makes any changes, but among those the only ones that
have any significant effect on performance are the ones in malloc.
It's likely that mallocng could benefit a lot from relaxed-order
atomics on the free bitmasks and weaker acquire/release semantics on
the malloc lock. But this would only help archs where weaker forms are
available.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 15:57                     ` Rich Felker
@ 2022-10-04 18:15                       ` Alexey Izbyshev
  2022-10-04 23:21                         ` Rich Felker
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-04 18:15 UTC (permalink / raw)
  To: musl

On 2022-10-04 18:57, Rich Felker wrote:
> On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote:
>> On 2022-10-04 17:19, Rich Felker wrote:
>> >On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
>> >>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
>> >>> On 2022-10-04 02:05, Rich Felker wrote:
>> >>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
>> >>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
>> >>> >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
>> >>> >>>
>> >>> >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
>> >>> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
>> >>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> >>> >>> > ...
>> >>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
>> >>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
>> >>> >>> > > > > should fix the issue.
>> >>> >>> > > >
>> >>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
>> >>> >>> > > > barrier between them, since all lock and unlock operations are already
>> >>> >>> > > > full barriers.
>> >>> >>> > > >
>> >>> >>> > >
>> >>> >>> > > Before sending the report I tried to infer the intended ordering semantics
>> >>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
>> >>> >>> > > see why they would provide a full barrier (my reasoning is below), so I
>> >>> >>> > > concluded that probably acquire/release semantics was intended in general
>> >>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
>> >>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
>> >>> >>> > > section.
>> >>> >>> > >
>> >>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
>> >>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
>> >>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
>> >>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
>> >>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
>> >>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
>> >>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
>> >>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
>> >>> >>> > > barrier. Is this right?
>> >>> >>> >
>> >>> >>> > i dont think this is right.
>> >>> >>>
>> >>> >>>
>> >>> >>> i think i was wrong and you are right.
>> >>> >>>
>> >>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
>> >>> >>> starting with 'something == 0' the exiting E and remaining R threads:
>> >>> >>>
>> >>> >>> E:something=1      // protected by killlock
>> >>> >>> E:UNLOCK(killlock)
>> >>> >>> E:need_locks=-1
>> >>> >>>
>> >>> >>> R:LOCK(unrelated)  // reads need_locks == -1
>> >>> >>> R:need_locks=0
>> >>> >>> R:UNLOCK(unrelated)
>> >>> >>> R:LOCK(killlock)   // does not lock
>> >>> >>> R:read something   // can it be 0 ?
>> >>> >>>
>> >>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
>> >>> >>> because
>> >>> >>>
>> >>> >>> T1
>> >>> >>> 	something=1
>> >>> >>> 	ldaxr ... killlock
>> >>> >>> 	stlxr ... killlock
>> >>> >>> 	need_locks=-1
>> >>> >>>
>> >>> >>> T2
>> >>> >>> 	x=need_locks
>> >>> >>> 	ldaxr ... unrelated
>> >>> >>> 	stlxr ... unrelated
>> >>> >>> 	y=something
>> >>> >>>
>> >>> >>> can end with x==-1 and y==0.
>> >>> >>>
>> >>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
>> >>> >>>
>> >>> >>> i need to think how to support such lock usage on aarch64
>> >>> >>> without adding too many dmb.
>> >>> >>
>> >>> >>I don't really understand this, but FWIW gcc emits
>> >>> >>
>> >>> >>    ldxr
>> >>> >>    ...
>> >>> >>    stlxr
>> >>> >>    ...
>> >>> >>    dmb ish
>> >>> >>
>> >>> >>for __sync_val_compare_and_swap. So this is probably the right thing
>> >>> >>we should have. And it seems to match what the kernel folks discussed
>> >>> >>here:
>> >>> >>
>> >>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
>> >>> >>
>> >>> >>I wondered if there are similar issues for any others archs which need
>> >>> >>review, but it looks like all the other llsc archs have explicit
>> >>> >>pre/post barriers defined.
>> >>> >
>> >>> >Actually I don't understand what's going on with cmpxchg there. The
>> >>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
>> >>> >some follow-up in the thread I don't understand, about the case where
>> >>> >the cas fails, but we already handle that by doing an explicit barrier
>> >>> >in that case.
>> >>> >
>> >>> I think in that follow-up[1] they mean the following case (in musl
>> >>> terms):
>> >>>
>> >>> volatile int x, flag;
>> >>>
>> >>> T1:
>> >>>     x = 1;
>> >>>     a_store(&flag, 1);
>> >>>
>> >>> T2:
>> >>>     while (!flag);
>> >>>     a_cas(&x, 0, 1); // can this fail?
>> >>>
>> >>> They want it to never fail. But if a_cas() is implemented as
>> >>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
>> >>> with the load of flag.
>> >>>
>> >>> Note that musl does *not* handle this now, because a_barrier() in
>> >>> the failure path is after a_ll().
>> >>>
>> >>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
>> >>
>> >>OK, then indeed this too needs to be fixed -- the a_cas is failing to
>> >>synchronize with the a_store. How do we do that? Based on my
>> >>understanding, my proposed patch doesn't fix it.
>> >>
>> >>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
>> >>synchronization to a retry in the case where the comparison fails?
>> >>
>> ldarx will not work for the same reason as ldrx doesn't work (it can
>> still be reordered with the load of flag).
>> 
>> >>If this is actually the case, it's disturbing that GCC does not seem
>> >>to be getting it right either...
>> >
>> This is indeed disturbing, considering that comment[1] claims that
>> __sync RMWs were once defined as:
>> 
>>   __sync_synchronize();
>>   operation...
>>   __sync_synchronize();
>> 
>> which clearly doesn't hold for the current implementation of
>> __sync_val_compare_and_swap() for AArch64.
>> 
>> I can speculate that some of confusion is due to the fact that
>> apparently each __sync builtin originally mapped to a single Itanium
>> instruction, so the issues with ordering of "internal" LL/SC
>> accesses simply didn't exist.
>> 
>> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24
>> 
>> >One stupid (or maybe not stupid) option would be to always stlrx back
>> >the old value on comparison failure, so we can rely on the stlrx to
>> >fail if anything was reordered wrong. I think this is actually kinda
>> >how the x86 CAS works, or how it historically worked, so maybe it's
>> >not a bad choice. It's also analogous to how the non-CAS operations
>> >work: no branching out of the ll/sc loop, and just defining the result
>> >as old == t ? s : old:
>> >
>> >static inline int a_cas(volatile int *p, int t, int s)
>> >{
>> >	int old;
>> >	a_pre_llsc();
>> >	do old = a_ll(p);
>> >	while (!a_sc(p, old == t ? s : old));
>> >	a_post_llsc();
>> >	return old;
>> >}
>> >
>> I can't comment on whether always doing a store (or even many
>> stores, until "stabilization") on a failed CAS is better than an
>> extra dmb. Maybe if CAS failure normally happens only on contention,
>> and we optimize for the non-contended case, it's indeed so. As for
>> correctness, I agree that it should work.
> 
> OK. I think the alternative is something like:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
> 	int old;
> 	a_pre_llsc();
> 	do {
> 		old = a_ll(p);
> 		if (old != t) {
> 			a_barrier();
> 			if (*p != old) continue;
> 			break;
> 		}
> 	} while (!a_sc(p, s));
> 	a_post_llsc();
> 	return old;
> }
> 
This also looks correct to me.

> This is more/messier code, and still has the possibility of unbounded
> looping waiting for consistency under contention. It doesn't perform a
> store to the atomic object during such contention, which might be less
> of a burden on other threads, but on the other hand it performs full
> barriers during this contention, which might be more of a burden.
> 
> Conceptually I think I like the version that always stores for
> symmetry with the other atomic operations that makes it "obviously
> correct if they're correct". But I'm not sure if there are cost
> reasons we should prefer a different choice. At least it "shouldn't be
> worse" than the other atomic ops are if we do it that way.
> 
Symmetry is indeed nice, but I'm not sure it's really appropriate 
because at least in my mind a_cas() is different from other atomic RMW 
ops by making the "W" part conditional. I do agree on the correctness 
part though.

Interestingly, the kernel actually removed the full barrier semantics 
from a failed atomic_cmpxchg(), with the reasoning that "nobody needs 
it": 
https://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355981.html

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 18:15                       ` Alexey Izbyshev
@ 2022-10-04 23:21                         ` Rich Felker
  0 siblings, 0 replies; 32+ messages in thread
From: Rich Felker @ 2022-10-04 23:21 UTC (permalink / raw)
  To: musl

On Tue, Oct 04, 2022 at 09:15:12PM +0300, Alexey Izbyshev wrote:
> On 2022-10-04 18:57, Rich Felker wrote:
> >On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote:
> >>On 2022-10-04 17:19, Rich Felker wrote:
> >>>On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
> >>>>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> >>>>> On 2022-10-04 02:05, Rich Felker wrote:
> >>>>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> >>>>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> >>>>> >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> >>>>> >>>
> >>>>> >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> >>>>> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> >>>>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>>>> >>> > ...
> >>>>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> >>>>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> >>>>> >>> > > > > should fix the issue.
> >>>>> >>> > > >
> >>>>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> >>>>> >>> > > > barrier between them, since all lock and unlock operations are already
> >>>>> >>> > > > full barriers.
> >>>>> >>> > > >
> >>>>> >>> > >
> >>>>> >>> > > Before sending the report I tried to infer the intended ordering semantics
> >>>>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> >>>>> >>> > > see why they would provide a full barrier (my reasoning is below), so I
> >>>>> >>> > > concluded that probably acquire/release semantics was intended in general
> >>>>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> >>>>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> >>>>> >>> > > section.
> >>>>> >>> > >
> >>>>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> >>>>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> >>>>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> >>>>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> >>>>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> >>>>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> >>>>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> >>>>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> >>>>> >>> > > barrier. Is this right?
> >>>>> >>> >
> >>>>> >>> > i dont think this is right.
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> i think i was wrong and you are right.
> >>>>> >>>
> >>>>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> >>>>> >>> starting with 'something == 0' the exiting E and remaining R threads:
> >>>>> >>>
> >>>>> >>> E:something=1      // protected by killlock
> >>>>> >>> E:UNLOCK(killlock)
> >>>>> >>> E:need_locks=-1
> >>>>> >>>
> >>>>> >>> R:LOCK(unrelated)  // reads need_locks == -1
> >>>>> >>> R:need_locks=0
> >>>>> >>> R:UNLOCK(unrelated)
> >>>>> >>> R:LOCK(killlock)   // does not lock
> >>>>> >>> R:read something   // can it be 0 ?
> >>>>> >>>
> >>>>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> >>>>> >>> because
> >>>>> >>>
> >>>>> >>> T1
> >>>>> >>> 	something=1
> >>>>> >>> 	ldaxr ... killlock
> >>>>> >>> 	stlxr ... killlock
> >>>>> >>> 	need_locks=-1
> >>>>> >>>
> >>>>> >>> T2
> >>>>> >>> 	x=need_locks
> >>>>> >>> 	ldaxr ... unrelated
> >>>>> >>> 	stlxr ... unrelated
> >>>>> >>> 	y=something
> >>>>> >>>
> >>>>> >>> can end with x==-1 and y==0.
> >>>>> >>>
> >>>>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> >>>>> >>>
> >>>>> >>> i need to think how to support such lock usage on aarch64
> >>>>> >>> without adding too many dmb.
> >>>>> >>
> >>>>> >>I don't really understand this, but FWIW gcc emits
> >>>>> >>
> >>>>> >>    ldxr
> >>>>> >>    ...
> >>>>> >>    stlxr
> >>>>> >>    ...
> >>>>> >>    dmb ish
> >>>>> >>
> >>>>> >>for __sync_val_compare_and_swap. So this is probably the right thing
> >>>>> >>we should have. And it seems to match what the kernel folks discussed
> >>>>> >>here:
> >>>>> >>
> >>>>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> >>>>> >>
> >>>>> >>I wondered if there are similar issues for any others archs which need
> >>>>> >>review, but it looks like all the other llsc archs have explicit
> >>>>> >>pre/post barriers defined.
> >>>>> >
> >>>>> >Actually I don't understand what's going on with cmpxchg there. The
> >>>>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> >>>>> >some follow-up in the thread I don't understand, about the case where
> >>>>> >the cas fails, but we already handle that by doing an explicit barrier
> >>>>> >in that case.
> >>>>> >
> >>>>> I think in that follow-up[1] they mean the following case (in musl
> >>>>> terms):
> >>>>>
> >>>>> volatile int x, flag;
> >>>>>
> >>>>> T1:
> >>>>>     x = 1;
> >>>>>     a_store(&flag, 1);
> >>>>>
> >>>>> T2:
> >>>>>     while (!flag);
> >>>>>     a_cas(&x, 0, 1); // can this fail?
> >>>>>
> >>>>> They want it to never fail. But if a_cas() is implemented as
> >>>>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> >>>>> with the load of flag.
> >>>>>
> >>>>> Note that musl does *not* handle this now, because a_barrier() in
> >>>>> the failure path is after a_ll().
> >>>>>
> >>>>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> >>>>
> >>>>OK, then indeed this too needs to be fixed -- the a_cas is failing to
> >>>>synchronize with the a_store. How do we do that? Based on my
> >>>>understanding, my proposed patch doesn't fix it.
> >>>>
> >>>>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
> >>>>synchronization to a retry in the case where the comparison fails?
> >>>>
> >>ldarx will not work for the same reason as ldrx doesn't work (it can
> >>still be reordered with the load of flag).
> >>
> >>>>If this is actually the case, it's disturbing that GCC does not seem
> >>>>to be getting it right either...
> >>>
> >>This is indeed disturbing, considering that comment[1] claims that
> >>__sync RMWs were once defined as:
> >>
> >>  __sync_synchronize();
> >>  operation...
> >>  __sync_synchronize();
> >>
> >>which clearly doesn't hold for the current implementation of
> >>__sync_val_compare_and_swap() for AArch64.
> >>
> >>I can speculate that some of confusion is due to the fact that
> >>apparently each __sync builtin originally mapped to a single Itanium
> >>instruction, so the issues with ordering of "internal" LL/SC
> >>accesses simply didn't exist.
> >>
> >>[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24
> >>
> >>>One stupid (or maybe not stupid) option would be to always stlrx back
> >>>the old value on comparison failure, so we can rely on the stlrx to
> >>>fail if anything was reordered wrong. I think this is actually kinda
> >>>how the x86 CAS works, or how it historically worked, so maybe it's
> >>>not a bad choice. It's also analogous to how the non-CAS operations
> >>>work: no branching out of the ll/sc loop, and just defining the result
> >>>as old == t ? s : old:
> >>>
> >>>static inline int a_cas(volatile int *p, int t, int s)
> >>>{
> >>>	int old;
> >>>	a_pre_llsc();
> >>>	do old = a_ll(p);
> >>>	while (!a_sc(p, old == t ? s : old));
> >>>	a_post_llsc();
> >>>	return old;
> >>>}
> >>>
> >>I can't comment on whether always doing a store (or even many
> >>stores, until "stabilization") on a failed CAS is better than an
> >>extra dmb. Maybe if CAS failure normally happens only on contention,
> >>and we optimize for the non-contended case, it's indeed so. As for
> >>correctness, I agree that it should work.
> >
> >OK. I think the alternative is something like:
> >
> >static inline int a_cas(volatile int *p, int t, int s)
> >{
> >	int old;
> >	a_pre_llsc();
> >	do {
> >		old = a_ll(p);
> >		if (old != t) {
> >			a_barrier();
> >			if (*p != old) continue;
> >			break;
> >		}
> >	} while (!a_sc(p, s));
> >	a_post_llsc();
> >	return old;
> >}
> >
> This also looks correct to me.

Let's compare the code produced for a simple usage,
pthread_spin_trylock. The above:

0000000000000000 <pthread_spin_trylock>:
   0:   52800202        mov     w2, #0x10                       // #16
   4:   885f7c01        ldxr    w1, [x0]
   8:   340000a1        cbz     w1, 1c <pthread_spin_trylock+0x1c>
   c:   d5033bbf        dmb     ish
  10:   b9400003        ldr     w3, [x0]
  14:   6b03003f        cmp     w1, w3
  18:   540000a0        b.eq    2c <pthread_spin_trylock+0x2c>
  1c:   8803fc02        stlxr   w3, w2, [x0]
  20:   35ffff23        cbnz    w3, 4 <pthread_spin_trylock+0x4>
  24:   d5033bbf        dmb     ish
  28:   2a0103e3        mov     w3, w1
  2c:   2a0303e0        mov     w0, w3
  30:   d65f03c0        ret

vs unconditional writeback:

0000000000000000 <pthread_spin_trylock>:
   0:   aa0003e2        mov     x2, x0
   4:   52800204        mov     w4, #0x10                       // #16
   8:   885f7c40        ldxr    w0, [x2]
   c:   7100001f        cmp     w0, #0x0
  10:   1a841001        csel    w1, w0, w4, ne
  14:   8803fc41        stlxr   w3, w1, [x2]
  18:   35ffff83        cbnz    w3, 8 <pthread_spin_trylock+0x8>
  1c:   d5033bbf        dmb     ish
  20:   d65f03c0        ret

The latter is preferable on size and number of branches, and possibly
also on confidence that the compiler won't do something wacky
separating the ldxr and stlxr far from each other. But of course the
question of actual performance remains.

One other idea -- I'm not sure if there's a way to make use of this,
but maybe there's a way to use a dummy atomic to preclude reordering.
See how commit 5a9c8c05a5a0cdced4122589184fd795b761bb4a uses "lock ;
orl $0,(%rsp)" on x86_64 to replace "mfence". Could such an
intervening operation on aarch64 preclude reordering of accesses
across the ldxr on the real atomic?

> >This is more/messier code, and still has the possibility of unbounded
> >looping waiting for consistency under contention. It doesn't perform a
> >store to the atomic object during such contention, which might be less
> >of a burden on other threads, but on the other hand it performs full
> >barriers during this contention, which might be more of a burden.
> >
> >Conceptually I think I like the version that always stores for
> >symmetry with the other atomic operations that makes it "obviously
> >correct if they're correct". But I'm not sure if there are cost
> >reasons we should prefer a different choice. At least it "shouldn't be
> >worse" than the other atomic ops are if we do it that way.
> >
> Symmetry is indeed nice, but I'm not sure it's really appropriate
> because at least in my mind a_cas() is different from other atomic
> RMW ops by making the "W" part conditional. I do agree on the
> correctness part though.
> 
> Interestingly, the kernel actually removed the full barrier
> semantics from a failed atomic_cmpxchg(), with the reasoning that
> "nobody needs it": https://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355981.html

I disagree with this reasoning. Even if you don't care about the
consistency of other memory when a CAS fails, the problem with the
missing barrier is that it can *wrongly fail* when it causally must
have succeeded. As one particular example, this causes "trylock"
operations to sometimes fail when they should have succeeded.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-09-07  0:46 [musl] Illegal killlock skipping when transitioning to single-threaded state Alexey Izbyshev
  2022-09-19 15:29 ` Rich Felker
@ 2022-10-05  1:00 ` Rich Felker
  2022-10-05 12:10   ` Alexey Izbyshev
  1 sibling, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-05  1:00 UTC (permalink / raw)
  To: musl

On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> Reordering the "libc.need_locks = -1" assignment and
> UNLOCK(E->killlock) and providing a store barrier between them
> should fix the issue.

Back to this, because it's immediately actionable without resolving
the aarch64 atomics issue:

Do you have something in mind for how this reordering is done, since
there are other intervening steps that are potentially ordered with
respect to either or both? I don't think there is actually any
ordering constraint at all on the unlocking of killlock (with the
accompanying assignment self->tid=0 kept with it) except that it be
past the point where we are committed to the thread terminating
without executing any more application code. So my leaning would be to
move this block from the end of pthread_exit up to right after the
point-of-no-return comment.

Unfortunately while reading this I found another bug, this time a lock
order one. __dl_thread_cleanup() takes a lock while the thread list
lock is already held, but fork takes these in the opposite order. I
think the lock here could be dropped and replaced with an atomic-cas
list head, but that's rather messy and I'm open to other ideas.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-05  1:00 ` Rich Felker
@ 2022-10-05 12:10   ` Alexey Izbyshev
  2022-10-05 14:03     ` Rich Felker
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-05 12:10 UTC (permalink / raw)
  To: musl

On 2022-10-05 04:00, Rich Felker wrote:
> On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> Reordering the "libc.need_locks = -1" assignment and
>> UNLOCK(E->killlock) and providing a store barrier between them
>> should fix the issue.
> 
> Back to this, because it's immediately actionable without resolving
> the aarch64 atomics issue:
> 
> Do you have something in mind for how this reordering is done, since
> there are other intervening steps that are potentially ordered with
> respect to either or both? I don't think there is actually any
> ordering constraint at all on the unlocking of killlock (with the
> accompanying assignment self->tid=0 kept with it) except that it be
> past the point where we are committed to the thread terminating
> without executing any more application code. So my leaning would be to
> move this block from the end of pthread_exit up to right after the
> point-of-no-return comment.
> 
This was my conclusion as well back when I looked at it before sending 
the report.

I was initially concerned about whether reordering with 
a_store(&self->detach_state, DT_EXITED) could cause an unwanted 
observable change (pthread_tryjoin_np() returning EBUSY after a pthread 
function acting on tid like pthread_getschedparam() returns ESRCH), but 
no, pthread_tryjoin_np() will block/trap if the thread is not 
DT_JOINABLE.

> Unfortunately while reading this I found another bug, this time a lock
> order one. __dl_thread_cleanup() takes a lock while the thread list
> lock is already held, but fork takes these in the opposite order. I
> think the lock here could be dropped and replaced with an atomic-cas
> list head, but that's rather messy and I'm open to other ideas.
> 
I'm not sure why using a lock-free list is messy, it seems like a 
perfect fit here to me.

However, doesn't __dl_vseterr() use the libc-internal allocator after  
34952fe5de44a833370cbe87b63fb8eec61466d7? If so, the problem that 
freebuf_queue was originally solving doesn't exist anymore. We still 
can't call the allocator after __tl_lock(), but maybe this whole free 
deferral approach can be reconsidered?

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-04 16:45                   ` Rich Felker
@ 2022-10-05 13:52                     ` James Y Knight
  0 siblings, 0 replies; 32+ messages in thread
From: James Y Knight @ 2022-10-05 13:52 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 2536 bytes --]

On Tue, Oct 4, 2022 at 12:46 PM Rich Felker <dalias@libc.org> wrote:

> The atomics in musl implement the "POSIX memory model" which is much
> simpler to understand and less error-prone than the C11 one (with the
> tradeoff being that it admits a lot less optimization for
> performance), and is a valid implementation choice for the C11 one. It
> has only one relationship, "synchronizes memory", that all
> synchronization primitives and atomics entail.

Mmmm, maybe I'm weird, but I find it significantly easier to understand
when code uses the standard atomics, because there is copious information
available about that model -- what it means, the real-world implications of
those semantics, and the correct instruction sequences to properly
implement them on various architectures. Memory and concurrency models are
_really_ _hard_ no matter what (as I think this thread demonstrates), and
having a standardized model to base things on is a huge advantage. If
musl's model was "C11 atomics, but we only use seq_cst operations", that
would be wonderful...but it's not. It's something different -- with
different guarantees, and different implications, and thus requires
developers to do unique analysis.

Atomics in musl are implemented
> entirely in asm, because the compilers do not get theirs right and do
> not support the runtime selection of methods necessary for some of the
> archs we support (especially 32-bit arm and sh).

Even if you need to provide a custom implementation to workaround compiler
issues on some platforms, IMO it'd still be an improvement to mirror the
standard API/semantics -- and to use the compiler support on all the
platforms where it does work.

Though, I do believe it ought to DTRT on ARM32 Linux targets. When
targeting older CPUs that don't guarantee LLSC availability, the compiler
will generate a function call to a libgcc function. That library function
then calls the kernel-provided kuser_helper cmpxchg and barrier functions.
(gcc/libgcc/config/arm/linux-atomic.c for the libgcc side). Then, which
instruction sequence is used to implement the atomics is handled purely by
the kernel helper. This design _should_ be correct for all ARM CPUs, but
with a bit of overhead if running on a modern CPU (because operations like
fetch_add get implemented on top of cmpxchg). But, I dunno, perhaps there's
bugs.

I've never looked at the situation on SuperH...but going by the GCC
manual's description of -matomic-model...yikes...that does look like a
complete mess of a situation all around.

[-- Attachment #2: Type: text/html, Size: 3207 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-05 12:10   ` Alexey Izbyshev
@ 2022-10-05 14:03     ` Rich Felker
  2022-10-05 14:37       ` Rich Felker
  0 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-05 14:03 UTC (permalink / raw)
  To: musl

On Wed, Oct 05, 2022 at 03:10:09PM +0300, Alexey Izbyshev wrote:
> On 2022-10-05 04:00, Rich Felker wrote:
> >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>Reordering the "libc.need_locks = -1" assignment and
> >>UNLOCK(E->killlock) and providing a store barrier between them
> >>should fix the issue.
> >
> >Back to this, because it's immediately actionable without resolving
> >the aarch64 atomics issue:
> >
> >Do you have something in mind for how this reordering is done, since
> >there are other intervening steps that are potentially ordered with
> >respect to either or both? I don't think there is actually any
> >ordering constraint at all on the unlocking of killlock (with the
> >accompanying assignment self->tid=0 kept with it) except that it be
> >past the point where we are committed to the thread terminating
> >without executing any more application code. So my leaning would be to
> >move this block from the end of pthread_exit up to right after the
> >point-of-no-return comment.
> >
> This was my conclusion as well back when I looked at it before
> sending the report.
> 
> I was initially concerned about whether reordering with
> a_store(&self->detach_state, DT_EXITED) could cause an unwanted
> observable change (pthread_tryjoin_np() returning EBUSY after a
> pthread function acting on tid like pthread_getschedparam() returns
> ESRCH), but no, pthread_tryjoin_np() will block/trap if the thread
> is not DT_JOINABLE.
> 
> >Unfortunately while reading this I found another bug, this time a lock
> >order one. __dl_thread_cleanup() takes a lock while the thread list
> >lock is already held, but fork takes these in the opposite order. I
> >think the lock here could be dropped and replaced with an atomic-cas
> >list head, but that's rather messy and I'm open to other ideas.
> >
> I'm not sure why using a lock-free list is messy, it seems like a
> perfect fit here to me.

Just in general I've tried to reduce the direct use of atomics and use
high-level primitives, because (as this thread is evidence of) I find
the reasoning about direct use of atomics and their correctness to be
difficult and inaccessible to a lot of people who would otherwise be
successful readers of the code. But you're right that it's a "good
match" for the problem at hand.

> However, doesn't __dl_vseterr() use the libc-internal allocator
> after  34952fe5de44a833370cbe87b63fb8eec61466d7? If so, the problem
> that freebuf_queue was originally solving doesn't exist anymore. We
> still can't call the allocator after __tl_lock(), but maybe this
> whole free deferral approach can be reconsidered?

I almost made that change when the MT-fork changes were done, but
didn't because it was wrong. I'm not sure if I documented this
anywhere (it might be in mail threads related to that or IRC) but it
was probably because it would need to take malloc locks with the
thread list lock held, which isn't allowed.

It would be nice if we could get rid of the deferred freeing here, but
I don't see a good way. The reason we can't free the buffer until
after the thread list lock is taken is that it's only freeable if this
isn't the last exiting thread. If it is the last exiting thread, the
buffer contents still need to be present for the atexit handlers to
see. And whether this is the last exiting thread is only
stable/determinate as long as the thread list lock is held.

Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-05 14:03     ` Rich Felker
@ 2022-10-05 14:37       ` Rich Felker
  2022-10-05 16:23         ` Alexey Izbyshev
  0 siblings, 1 reply; 32+ messages in thread
From: Rich Felker @ 2022-10-05 14:37 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

On Wed, Oct 05, 2022 at 10:03:03AM -0400, Rich Felker wrote:
> On Wed, Oct 05, 2022 at 03:10:09PM +0300, Alexey Izbyshev wrote:
> > On 2022-10-05 04:00, Rich Felker wrote:
> > >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> > >>Reordering the "libc.need_locks = -1" assignment and
> > >>UNLOCK(E->killlock) and providing a store barrier between them
> > >>should fix the issue.
> > >
> > >Back to this, because it's immediately actionable without resolving
> > >the aarch64 atomics issue:
> > >
> > >Do you have something in mind for how this reordering is done, since
> > >there are other intervening steps that are potentially ordered with
> > >respect to either or both? I don't think there is actually any
> > >ordering constraint at all on the unlocking of killlock (with the
> > >accompanying assignment self->tid=0 kept with it) except that it be
> > >past the point where we are committed to the thread terminating
> > >without executing any more application code. So my leaning would be to
> > >move this block from the end of pthread_exit up to right after the
> > >point-of-no-return comment.
> > >
> > This was my conclusion as well back when I looked at it before
> > sending the report.
> > 
> > I was initially concerned about whether reordering with
> > a_store(&self->detach_state, DT_EXITED) could cause an unwanted
> > observable change (pthread_tryjoin_np() returning EBUSY after a
> > pthread function acting on tid like pthread_getschedparam() returns
> > ESRCH), but no, pthread_tryjoin_np() will block/trap if the thread
> > is not DT_JOINABLE.
> > 
> > >Unfortunately while reading this I found another bug, this time a lock
> > >order one. __dl_thread_cleanup() takes a lock while the thread list
> > >lock is already held, but fork takes these in the opposite order. I
> > >think the lock here could be dropped and replaced with an atomic-cas
> > >list head, but that's rather messy and I'm open to other ideas.
> > >
> > I'm not sure why using a lock-free list is messy, it seems like a
> > perfect fit here to me.
> 
> Just in general I've tried to reduce the direct use of atomics and use
> high-level primitives, because (as this thread is evidence of) I find
> the reasoning about direct use of atomics and their correctness to be
> difficult and inaccessible to a lot of people who would otherwise be
> successful readers of the code. But you're right that it's a "good
> match" for the problem at hand.
> 
> > However, doesn't __dl_vseterr() use the libc-internal allocator
> > after  34952fe5de44a833370cbe87b63fb8eec61466d7? If so, the problem
> > that freebuf_queue was originally solving doesn't exist anymore. We
> > still can't call the allocator after __tl_lock(), but maybe this
> > whole free deferral approach can be reconsidered?
> 
> I almost made that change when the MT-fork changes were done, but
> didn't because it was wrong. I'm not sure if I documented this
> anywhere (it might be in mail threads related to that or IRC) but it
> was probably because it would need to take malloc locks with the
> thread list lock held, which isn't allowed.
> 
> It would be nice if we could get rid of the deferred freeing here, but
> I don't see a good way. The reason we can't free the buffer until
> after the thread list lock is taken is that it's only freeable if this
> isn't the last exiting thread. If it is the last exiting thread, the
> buffer contents still need to be present for the atexit handlers to
> see. And whether this is the last exiting thread is only
> stable/determinate as long as the thread list lock is held.

Proposed patch with atomic list attached, along with a stupid test
program (to be run under a debugger to see anything happening).

Rich

[-- Attachment #2: dlerror_free.c --]
[-- Type: text/plain, Size: 646 bytes --]

#include <dlfcn.h>
#include <pthread.h>
#include <stdio.h>
#include <semaphore.h>

void *thread_func(void *p)
{
	sem_t *sem = p;
	char buf[100];
	sem_wait(&sem[0]);
	snprintf(buf, sizeof buf, "/dev/null/noexist_%p", (void *)buf);
	dlopen(buf, RTLD_NOW|RTLD_LOCAL);
	sem_post(&sem[1]);
}

#define NT 30

int main()
{
	sem_t sem[2];
	int i;
	sem_init(&sem[0], 0, 0);
	sem_init(&sem[1], 0, 0);
	for (i=0; i<NT; i++) {
		pthread_t td;
		pthread_create(&td, 0, thread_func, sem);
	}
	for (i=0; i<NT; i++) sem_post(&sem[0]);
	for (i=0; i<NT; i++) sem_wait(&sem[1]);
	dlopen("/dev/null/noexist_main", RTLD_NOW|RTLD_LOCAL);
	printf("%s\n", dlerror());
}

[-- Attachment #3: dlerror_free.diff --]
[-- Type: text/plain, Size: 2760 bytes --]

diff --git a/src/internal/fork_impl.h b/src/internal/fork_impl.h
index 5892c13b..ae3a79e5 100644
--- a/src/internal/fork_impl.h
+++ b/src/internal/fork_impl.h
@@ -2,7 +2,6 @@
 
 extern hidden volatile int *const __at_quick_exit_lockptr;
 extern hidden volatile int *const __atexit_lockptr;
-extern hidden volatile int *const __dlerror_lockptr;
 extern hidden volatile int *const __gettext_lockptr;
 extern hidden volatile int *const __locale_lockptr;
 extern hidden volatile int *const __random_lockptr;
diff --git a/src/ldso/dlerror.c b/src/ldso/dlerror.c
index afe59253..30a33f33 100644
--- a/src/ldso/dlerror.c
+++ b/src/ldso/dlerror.c
@@ -23,28 +23,31 @@ char *dlerror()
 		return s;
 }
 
-static volatile int freebuf_queue_lock[1];
-static void **freebuf_queue;
-volatile int *const __dlerror_lockptr = freebuf_queue_lock;
+/* Atomic singly-linked list, used to store list of thread-local dlerror
+ * buffers for deferred free. They cannot be freed at thread exit time
+ * because, by the time it's known they can be freed, the exiting thread
+ * is in a highly restrictive context where it cannot call (even the
+ * libc-internal) free. It also can't take locks; thus the atomic list. */
+
+static void *volatile freebuf_queue;
 
 void __dl_thread_cleanup(void)
 {
 	pthread_t self = __pthread_self();
-	if (self->dlerror_buf && self->dlerror_buf != (void *)-1) {
-		LOCK(freebuf_queue_lock);
-		void **p = (void **)self->dlerror_buf;
-		*p = freebuf_queue;
-		freebuf_queue = p;
-		UNLOCK(freebuf_queue_lock);
-	}
+	if (!self->dlerror_buf || self->dlerror_buf == (void *)-1)
+		return;
+	void *h;
+	do {
+		h = freebuf_queue;
+		*(void **)self->dlerror_buf = h;
+	} while (a_cas_p(&freebuf_queue, h, self->dlerror_buf) != h);
 }
 
 hidden void __dl_vseterr(const char *fmt, va_list ap)
 {
-	LOCK(freebuf_queue_lock);
-	void **q = freebuf_queue;
-	freebuf_queue = 0;
-	UNLOCK(freebuf_queue_lock);
+	void **q;
+	do q = freebuf_queue;
+	while (q && a_cas_p(&freebuf_queue, q, 0) != q);
 
 	while (q) {
 		void **p = *q;
diff --git a/src/process/fork.c b/src/process/fork.c
index 54bc2892..ff71845c 100644
--- a/src/process/fork.c
+++ b/src/process/fork.c
@@ -9,7 +9,6 @@ static volatile int *const dummy_lockptr = 0;
 
 weak_alias(dummy_lockptr, __at_quick_exit_lockptr);
 weak_alias(dummy_lockptr, __atexit_lockptr);
-weak_alias(dummy_lockptr, __dlerror_lockptr);
 weak_alias(dummy_lockptr, __gettext_lockptr);
 weak_alias(dummy_lockptr, __locale_lockptr);
 weak_alias(dummy_lockptr, __random_lockptr);
@@ -24,7 +23,6 @@ weak_alias(dummy_lockptr, __vmlock_lockptr);
 static volatile int *const *const atfork_locks[] = {
 	&__at_quick_exit_lockptr,
 	&__atexit_lockptr,
-	&__dlerror_lockptr,
 	&__gettext_lockptr,
 	&__locale_lockptr,
 	&__random_lockptr,

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [musl] Illegal killlock skipping when transitioning to single-threaded state
  2022-10-05 14:37       ` Rich Felker
@ 2022-10-05 16:23         ` Alexey Izbyshev
  0 siblings, 0 replies; 32+ messages in thread
From: Alexey Izbyshev @ 2022-10-05 16:23 UTC (permalink / raw)
  To: musl

On 2022-10-05 17:37, Rich Felker wrote:
> On Wed, Oct 05, 2022 at 10:03:03AM -0400, Rich Felker wrote:
>> On Wed, Oct 05, 2022 at 03:10:09PM +0300, Alexey Izbyshev wrote:
>> > On 2022-10-05 04:00, Rich Felker wrote:
>> > >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
>> > >>Reordering the "libc.need_locks = -1" assignment and
>> > >>UNLOCK(E->killlock) and providing a store barrier between them
>> > >>should fix the issue.
>> > >
>> > >Back to this, because it's immediately actionable without resolving
>> > >the aarch64 atomics issue:
>> > >
>> > >Do you have something in mind for how this reordering is done, since
>> > >there are other intervening steps that are potentially ordered with
>> > >respect to either or both? I don't think there is actually any
>> > >ordering constraint at all on the unlocking of killlock (with the
>> > >accompanying assignment self->tid=0 kept with it) except that it be
>> > >past the point where we are committed to the thread terminating
>> > >without executing any more application code. So my leaning would be to
>> > >move this block from the end of pthread_exit up to right after the
>> > >point-of-no-return comment.
>> > >
>> > This was my conclusion as well back when I looked at it before
>> > sending the report.
>> >
>> > I was initially concerned about whether reordering with
>> > a_store(&self->detach_state, DT_EXITED) could cause an unwanted
>> > observable change (pthread_tryjoin_np() returning EBUSY after a
>> > pthread function acting on tid like pthread_getschedparam() returns
>> > ESRCH), but no, pthread_tryjoin_np() will block/trap if the thread
>> > is not DT_JOINABLE.
>> >
>> > >Unfortunately while reading this I found another bug, this time a lock
>> > >order one. __dl_thread_cleanup() takes a lock while the thread list
>> > >lock is already held, but fork takes these in the opposite order. I
>> > >think the lock here could be dropped and replaced with an atomic-cas
>> > >list head, but that's rather messy and I'm open to other ideas.
>> > >
>> > I'm not sure why using a lock-free list is messy, it seems like a
>> > perfect fit here to me.
>> 
>> Just in general I've tried to reduce the direct use of atomics and use
>> high-level primitives, because (as this thread is evidence of) I find
>> the reasoning about direct use of atomics and their correctness to be
>> difficult and inaccessible to a lot of people who would otherwise be
>> successful readers of the code. But you're right that it's a "good
>> match" for the problem at hand.
>> 
>> > However, doesn't __dl_vseterr() use the libc-internal allocator
>> > after  34952fe5de44a833370cbe87b63fb8eec61466d7? If so, the problem
>> > that freebuf_queue was originally solving doesn't exist anymore. We
>> > still can't call the allocator after __tl_lock(), but maybe this
>> > whole free deferral approach can be reconsidered?
>> 
>> I almost made that change when the MT-fork changes were done, but
>> didn't because it was wrong. I'm not sure if I documented this
>> anywhere (it might be in mail threads related to that or IRC) but it
>> was probably because it would need to take malloc locks with the
>> thread list lock held, which isn't allowed.
>> 
>> It would be nice if we could get rid of the deferred freeing here, but
>> I don't see a good way. The reason we can't free the buffer until
>> after the thread list lock is taken is that it's only freeable if this
>> isn't the last exiting thread. If it is the last exiting thread, the
>> buffer contents still need to be present for the atexit handlers to
>> see. And whether this is the last exiting thread is only
>> stable/determinate as long as the thread list lock is held.
> 
> Proposed patch with atomic list attached, along with a stupid test
> program (to be run under a debugger to see anything happening).
> 
The patch looks good to me, and the program does the expected thing for 
me when linked with the patched musl.

Inclusion of "lock.h" and "fork_impl.h" can also be removed from 
dlerror.c.

Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-10-05 16:24 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07  0:46 [musl] Illegal killlock skipping when transitioning to single-threaded state Alexey Izbyshev
2022-09-19 15:29 ` Rich Felker
2022-10-03  6:16   ` Alexey Izbyshev
2022-10-03 12:33     ` Rich Felker
2022-10-03 13:26     ` Szabolcs Nagy
2022-10-03 21:27       ` Szabolcs Nagy
2022-10-03 22:54         ` Rich Felker
2022-10-03 23:05           ` Rich Felker
2022-10-04 13:50             ` Alexey Izbyshev
2022-10-04 14:12               ` Rich Felker
2022-10-04 14:19                 ` Rich Felker
2022-10-04 15:43                   ` Alexey Izbyshev
2022-10-04 15:57                     ` Rich Felker
2022-10-04 18:15                       ` Alexey Izbyshev
2022-10-04 23:21                         ` Rich Felker
2022-10-04 16:24                 ` James Y Knight
2022-10-04 16:45                   ` Rich Felker
2022-10-05 13:52                     ` James Y Knight
2022-10-04 16:01               ` Alexey Izbyshev
2022-10-04  2:58         ` Rich Felker
2022-10-04  3:00           ` Rich Felker
2022-10-04  4:59             ` Rich Felker
2022-10-04  8:16               ` Szabolcs Nagy
2022-10-04 10:18               ` Alexey Izbyshev
2022-10-04  5:16         ` Alexey Izbyshev
2022-10-04  8:31           ` Szabolcs Nagy
2022-10-04 10:28             ` Alexey Izbyshev
2022-10-05  1:00 ` Rich Felker
2022-10-05 12:10   ` Alexey Izbyshev
2022-10-05 14:03     ` Rich Felker
2022-10-05 14:37       ` Rich Felker
2022-10-05 16:23         ` Alexey Izbyshev

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).