From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 20535 invoked from network); 4 Oct 2022 23:21:23 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 4 Oct 2022 23:21:23 -0000
Received: (qmail 15363 invoked by uid 550); 4 Oct 2022 23:21:19 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 14301 invoked from network); 4 Oct 2022 23:21:18 -0000
Date: Tue, 4 Oct 2022 19:21:03 -0400
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Message-ID: <20221004232103.GP29905@brightrain.aerifal.cx>
References: <20221003132615.GF2158779@port70.net>
 <20221003212705.GG2158779@port70.net>
 <20221003225416.GG29905@brightrain.aerifal.cx>
 <20221003230505.GH29905@brightrain.aerifal.cx>
 <2e77a700561a059e85daad5311306cfb@ispras.ru>
 <20221004141242.GL29905@brightrain.aerifal.cx>
 <20221004141937.GM29905@brightrain.aerifal.cx>
 <e90e65c83e96fce54e188fbe6e0413cd@ispras.ru>
 <20221004155713.GN29905@brightrain.aerifal.cx>
 <e4b4f0fbb0bb7d168a4c9c2f50fdb48a@ispras.ru>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e4b4f0fbb0bb7d168a4c9c2f50fdb48a@ispras.ru>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] Illegal killlock skipping when transitioning to
 single-threaded state

On Tue, Oct 04, 2022 at 09:15:12PM +0300, Alexey Izbyshev wrote:
> On 2022-10-04 18:57, Rich Felker wrote:
> >On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote:
> >>On 2022-10-04 17:19, Rich Felker wrote:
> >>>On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
> >>>>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> >>>>> On 2022-10-04 02:05, Rich Felker wrote:
> >>>>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> >>>>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> >>>>> >>> * Szabolcs Nagy <nsz@port70.net> [2022-10-03 15:26:15 +0200]:
> >>>>> >>>
> >>>>> >>> > * Alexey Izbyshev <izbyshev@ispras.ru> [2022-10-03 09:16:03 +0300]:
> >>>>> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> >>>>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>>>> >>> > ...
> >>>>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> >>>>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> >>>>> >>> > > > > should fix the issue.
> >>>>> >>> > > >
> >>>>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> >>>>> >>> > > > barrier between them, since all lock and unlock operations are already
> >>>>> >>> > > > full barriers.
> >>>>> >>> > > >
> >>>>> >>> > >
> >>>>> >>> > > Before sending the report I tried to infer the intended ordering semantics
> >>>>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> >>>>> >>> > > see why they would provide a full barrier (my reasoning is below), so I
> >>>>> >>> > > concluded that probably acquire/release semantics was intended in general
> >>>>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> >>>>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> >>>>> >>> > > section.
> >>>>> >>> > >
> >>>>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> >>>>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> >>>>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> >>>>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> >>>>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> >>>>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> >>>>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> >>>>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> >>>>> >>> > > barrier. Is this right?
> >>>>> >>> >
> >>>>> >>> > i dont think this is right.
> >>>>> >>>
> >>>>> >>>
> >>>>> >>> i think i was wrong and you are right.
> >>>>> >>>
> >>>>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> >>>>> >>> starting with 'something == 0' the exiting E and remaining R threads:
> >>>>> >>>
> >>>>> >>> E:something=1      // protected by killlock
> >>>>> >>> E:UNLOCK(killlock)
> >>>>> >>> E:need_locks=-1
> >>>>> >>>
> >>>>> >>> R:LOCK(unrelated)  // reads need_locks == -1
> >>>>> >>> R:need_locks=0
> >>>>> >>> R:UNLOCK(unrelated)
> >>>>> >>> R:LOCK(killlock)   // does not lock
> >>>>> >>> R:read something   // can it be 0 ?
> >>>>> >>>
> >>>>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> >>>>> >>> because
> >>>>> >>>
> >>>>> >>> T1
> >>>>> >>> 	something=1
> >>>>> >>> 	ldaxr ... killlock
> >>>>> >>> 	stlxr ... killlock
> >>>>> >>> 	need_locks=-1
> >>>>> >>>
> >>>>> >>> T2
> >>>>> >>> 	x=need_locks
> >>>>> >>> 	ldaxr ... unrelated
> >>>>> >>> 	stlxr ... unrelated
> >>>>> >>> 	y=something
> >>>>> >>>
> >>>>> >>> can end with x==-1 and y==0.
> >>>>> >>>
> >>>>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> >>>>> >>>
> >>>>> >>> i need to think how to support such lock usage on aarch64
> >>>>> >>> without adding too many dmb.
> >>>>> >>
> >>>>> >>I don't really understand this, but FWIW gcc emits
> >>>>> >>
> >>>>> >>    ldxr
> >>>>> >>    ...
> >>>>> >>    stlxr
> >>>>> >>    ...
> >>>>> >>    dmb ish
> >>>>> >>
> >>>>> >>for __sync_val_compare_and_swap. So this is probably the right thing
> >>>>> >>we should have. And it seems to match what the kernel folks discussed
> >>>>> >>here:
> >>>>> >>
> >>>>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> >>>>> >>
> >>>>> >>I wondered if there are similar issues for any others archs which need
> >>>>> >>review, but it looks like all the other llsc archs have explicit
> >>>>> >>pre/post barriers defined.
> >>>>> >
> >>>>> >Actually I don't understand what's going on with cmpxchg there. The
> >>>>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> >>>>> >some follow-up in the thread I don't understand, about the case where
> >>>>> >the cas fails, but we already handle that by doing an explicit barrier
> >>>>> >in that case.
> >>>>> >
> >>>>> I think in that follow-up[1] they mean the following case (in musl
> >>>>> terms):
> >>>>>
> >>>>> volatile int x, flag;
> >>>>>
> >>>>> T1:
> >>>>>     x = 1;
> >>>>>     a_store(&flag, 1);
> >>>>>
> >>>>> T2:
> >>>>>     while (!flag);
> >>>>>     a_cas(&x, 0, 1); // can this fail?
> >>>>>
> >>>>> They want it to never fail. But if a_cas() is implemented as
> >>>>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> >>>>> with the load of flag.
> >>>>>
> >>>>> Note that musl does *not* handle this now, because a_barrier() in
> >>>>> the failure path is after a_ll().
> >>>>>
> >>>>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> >>>>
> >>>>OK, then indeed this too needs to be fixed -- the a_cas is failing to
> >>>>synchronize with the a_store. How do we do that? Based on my
> >>>>understanding, my proposed patch doesn't fix it.
> >>>>
> >>>>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
> >>>>synchronization to a retry in the case where the comparison fails?
> >>>>
> >>ldarx will not work for the same reason as ldrx doesn't work (it can
> >>still be reordered with the load of flag).
> >>
> >>>>If this is actually the case, it's disturbing that GCC does not seem
> >>>>to be getting it right either...
> >>>
> >>This is indeed disturbing, considering that comment[1] claims that
> >>__sync RMWs were once defined as:
> >>
> >>  __sync_synchronize();
> >>  operation...
> >>  __sync_synchronize();
> >>
> >>which clearly doesn't hold for the current implementation of
> >>__sync_val_compare_and_swap() for AArch64.
> >>
> >>I can speculate that some of confusion is due to the fact that
> >>apparently each __sync builtin originally mapped to a single Itanium
> >>instruction, so the issues with ordering of "internal" LL/SC
> >>accesses simply didn't exist.
> >>
> >>[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24
> >>
> >>>One stupid (or maybe not stupid) option would be to always stlrx back
> >>>the old value on comparison failure, so we can rely on the stlrx to
> >>>fail if anything was reordered wrong. I think this is actually kinda
> >>>how the x86 CAS works, or how it historically worked, so maybe it's
> >>>not a bad choice. It's also analogous to how the non-CAS operations
> >>>work: no branching out of the ll/sc loop, and just defining the result
> >>>as old == t ? s : old:
> >>>
> >>>static inline int a_cas(volatile int *p, int t, int s)
> >>>{
> >>>	int old;
> >>>	a_pre_llsc();
> >>>	do old = a_ll(p);
> >>>	while (!a_sc(p, old == t ? s : old));
> >>>	a_post_llsc();
> >>>	return old;
> >>>}
> >>>
> >>I can't comment on whether always doing a store (or even many
> >>stores, until "stabilization") on a failed CAS is better than an
> >>extra dmb. Maybe if CAS failure normally happens only on contention,
> >>and we optimize for the non-contended case, it's indeed so. As for
> >>correctness, I agree that it should work.
> >
> >OK. I think the alternative is something like:
> >
> >static inline int a_cas(volatile int *p, int t, int s)
> >{
> >	int old;
> >	a_pre_llsc();
> >	do {
> >		old = a_ll(p);
> >		if (old != t) {
> >			a_barrier();
> >			if (*p != old) continue;
> >			break;
> >		}
> >	} while (!a_sc(p, s));
> >	a_post_llsc();
> >	return old;
> >}
> >
> This also looks correct to me.

Let's compare the code produced for a simple usage,
pthread_spin_trylock. The above:

0000000000000000 <pthread_spin_trylock>:
   0:   52800202        mov     w2, #0x10                       // #16
   4:   885f7c01        ldxr    w1, [x0]
   8:   340000a1        cbz     w1, 1c <pthread_spin_trylock+0x1c>
   c:   d5033bbf        dmb     ish
  10:   b9400003        ldr     w3, [x0]
  14:   6b03003f        cmp     w1, w3
  18:   540000a0        b.eq    2c <pthread_spin_trylock+0x2c>
  1c:   8803fc02        stlxr   w3, w2, [x0]
  20:   35ffff23        cbnz    w3, 4 <pthread_spin_trylock+0x4>
  24:   d5033bbf        dmb     ish
  28:   2a0103e3        mov     w3, w1
  2c:   2a0303e0        mov     w0, w3
  30:   d65f03c0        ret

vs unconditional writeback:

0000000000000000 <pthread_spin_trylock>:
   0:   aa0003e2        mov     x2, x0
   4:   52800204        mov     w4, #0x10                       // #16
   8:   885f7c40        ldxr    w0, [x2]
   c:   7100001f        cmp     w0, #0x0
  10:   1a841001        csel    w1, w0, w4, ne
  14:   8803fc41        stlxr   w3, w1, [x2]
  18:   35ffff83        cbnz    w3, 8 <pthread_spin_trylock+0x8>
  1c:   d5033bbf        dmb     ish
  20:   d65f03c0        ret

The latter is preferable on size and number of branches, and possibly
also on confidence that the compiler won't do something wacky
separating the ldxr and stlxr far from each other. But of course the
question of actual performance remains.

One other idea -- I'm not sure if there's a way to make use of this,
but maybe there's a way to use a dummy atomic to preclude reordering.
See how commit 5a9c8c05a5a0cdced4122589184fd795b761bb4a uses "lock ;
orl $0,(%rsp)" on x86_64 to replace "mfence". Could such an
intervening operation on aarch64 preclude reordering of accesses
across the ldxr on the real atomic?

> >This is more/messier code, and still has the possibility of unbounded
> >looping waiting for consistency under contention. It doesn't perform a
> >store to the atomic object during such contention, which might be less
> >of a burden on other threads, but on the other hand it performs full
> >barriers during this contention, which might be more of a burden.
> >
> >Conceptually I think I like the version that always stores for
> >symmetry with the other atomic operations that makes it "obviously
> >correct if they're correct". But I'm not sure if there are cost
> >reasons we should prefer a different choice. At least it "shouldn't be
> >worse" than the other atomic ops are if we do it that way.
> >
> Symmetry is indeed nice, but I'm not sure it's really appropriate
> because at least in my mind a_cas() is different from other atomic
> RMW ops by making the "W" part conditional. I do agree on the
> correctness part though.
> 
> Interestingly, the kernel actually removed the full barrier
> semantics from a failed atomic_cmpxchg(), with the reasoning that
> "nobody needs it": https://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355981.html

I disagree with this reasoning. Even if you don't care about the
consistency of other memory when a CAS fails, the problem with the
missing barrier is that it can *wrongly fail* when it causally must
have succeeded. As one particular example, this causes "trylock"
operations to sometimes fail when they should have succeeded.

Rich