From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 25697 invoked from network); 4 Oct 2022 15:44:08 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 4 Oct 2022 15:44:08 -0000 Received: (qmail 7803 invoked by uid 550); 4 Oct 2022 15:44:04 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 7764 invoked from network); 4 Oct 2022 15:44:03 -0000 DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 00B1342F5DD4 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1664898230; bh=ca7TVsVNxWxAHNmYOws29frqoAMRvNqqt6kxAf4gRaQ=; h=Date:From:To:Subject:Reply-To:In-Reply-To:References:From; b=lWIfLugOJJQbLCto3N4WnmujseGRA+egfzTm1NI0pArrDgiwfRd5eHSdV1A08ErGN MfGvIpTQerqQRzS42eopJ61Fof1u3VuyTyFw/ZyxKt5r2fcL0550LsVtQLOJlvHdEh c2x/6KDwhmYjkGutOo2qiBdCVjrBrVL3CGm110Qo= MIME-Version: 1.0 Date: Tue, 04 Oct 2022 18:43:49 +0300 From: Alexey Izbyshev To: musl@lists.openwall.com Mail-Followup-To: musl@lists.openwall.com In-Reply-To: <20221004141937.GM29905@brightrain.aerifal.cx> References: <20220919152937.GQ9709@brightrain.aerifal.cx> <20221003132615.GF2158779@port70.net> <20221003212705.GG2158779@port70.net> <20221003225416.GG29905@brightrain.aerifal.cx> <20221003230505.GH29905@brightrain.aerifal.cx> <2e77a700561a059e85daad5311306cfb@ispras.ru> <20221004141242.GL29905@brightrain.aerifal.cx> <20221004141937.GM29905@brightrain.aerifal.cx> User-Agent: Roundcube Webmail/1.4.4 Message-ID: X-Sender: izbyshev@ispras.ru Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [musl] Illegal killlock skipping when transitioning to single-threaded state On 2022-10-04 17:19, Rich Felker wrote: > On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote: >> On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote: >> > On 2022-10-04 02:05, Rich Felker wrote: >> > >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote: >> > >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote: >> > >>> * Szabolcs Nagy [2022-10-03 15:26:15 +0200]: >> > >>> >> > >>> > * Alexey Izbyshev [2022-10-03 09:16:03 +0300]: >> > >>> > > On 2022-09-19 18:29, Rich Felker wrote: >> > >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote: >> > >>> > ... >> > >>> > > > > Reordering the "libc.need_locks = -1" assignment and >> > >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them >> > >>> > > > > should fix the issue. >> > >>> > > > >> > >>> > > > I think this all sounds correct. I'm not sure what you mean by a store >> > >>> > > > barrier between them, since all lock and unlock operations are already >> > >>> > > > full barriers. >> > >>> > > > >> > >>> > > >> > >>> > > Before sending the report I tried to infer the intended ordering semantics >> > >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't >> > >>> > > see why they would provide a full barrier (my reasoning is below), so I >> > >>> > > concluded that probably acquire/release semantics was intended in general >> > >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks >> > >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical >> > >>> > > section. >> > >>> > > >> > >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple >> > >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented >> > >>> > > via load-acquire/store-release instructions. Therefore, if we consider a >> > >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any >> > >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and >> > >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming >> > >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with >> > >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full >> > >>> > > barrier. Is this right? >> > >>> > >> > >>> > i dont think this is right. >> > >>> >> > >>> >> > >>> i think i was wrong and you are right. >> > >>> >> > >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and >> > >>> starting with 'something == 0' the exiting E and remaining R threads: >> > >>> >> > >>> E:something=1 // protected by killlock >> > >>> E:UNLOCK(killlock) >> > >>> E:need_locks=-1 >> > >>> >> > >>> R:LOCK(unrelated) // reads need_locks == -1 >> > >>> R:need_locks=0 >> > >>> R:UNLOCK(unrelated) >> > >>> R:LOCK(killlock) // does not lock >> > >>> R:read something // can it be 0 ? >> > >>> >> > >>> and here something can be 0 (ie. not protected by killlock) on aarch64 >> > >>> because >> > >>> >> > >>> T1 >> > >>> something=1 >> > >>> ldaxr ... killlock >> > >>> stlxr ... killlock >> > >>> need_locks=-1 >> > >>> >> > >>> T2 >> > >>> x=need_locks >> > >>> ldaxr ... unrelated >> > >>> stlxr ... unrelated >> > >>> y=something >> > >>> >> > >>> can end with x==-1 and y==0. >> > >>> >> > >>> and to fix it, both a_fetch_add and a_cas need an a_barrier. >> > >>> >> > >>> i need to think how to support such lock usage on aarch64 >> > >>> without adding too many dmb. >> > >> >> > >>I don't really understand this, but FWIW gcc emits >> > >> >> > >> ldxr >> > >> ... >> > >> stlxr >> > >> ... >> > >> dmb ish >> > >> >> > >>for __sync_val_compare_and_swap. So this is probably the right thing >> > >>we should have. And it seems to match what the kernel folks discussed >> > >>here: >> > >> >> > >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html >> > >> >> > >>I wondered if there are similar issues for any others archs which need >> > >>review, but it looks like all the other llsc archs have explicit >> > >>pre/post barriers defined. >> > > >> > >Actually I don't understand what's going on with cmpxchg there. The >> > >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's >> > >some follow-up in the thread I don't understand, about the case where >> > >the cas fails, but we already handle that by doing an explicit barrier >> > >in that case. >> > > >> > I think in that follow-up[1] they mean the following case (in musl >> > terms): >> > >> > volatile int x, flag; >> > >> > T1: >> > x = 1; >> > a_store(&flag, 1); >> > >> > T2: >> > while (!flag); >> > a_cas(&x, 0, 1); // can this fail? >> > >> > They want it to never fail. But if a_cas() is implemented as >> > ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered >> > with the load of flag. >> > >> > Note that musl does *not* handle this now, because a_barrier() in >> > the failure path is after a_ll(). >> > >> > [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html >> >> OK, then indeed this too needs to be fixed -- the a_cas is failing to >> synchronize with the a_store. How do we do that? Based on my >> understanding, my proposed patch doesn't fix it. >> >> Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra >> synchronization to a retry in the case where the comparison fails? >> ldarx will not work for the same reason as ldrx doesn't work (it can still be reordered with the load of flag). >> If this is actually the case, it's disturbing that GCC does not seem >> to be getting it right either... > This is indeed disturbing, considering that comment[1] claims that __sync RMWs were once defined as: __sync_synchronize(); operation... __sync_synchronize(); which clearly doesn't hold for the current implementation of __sync_val_compare_and_swap() for AArch64. I can speculate that some of confusion is due to the fact that apparently each __sync builtin originally mapped to a single Itanium instruction, so the issues with ordering of "internal" LL/SC accesses simply didn't exist. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24 > One stupid (or maybe not stupid) option would be to always stlrx back > the old value on comparison failure, so we can rely on the stlrx to > fail if anything was reordered wrong. I think this is actually kinda > how the x86 CAS works, or how it historically worked, so maybe it's > not a bad choice. It's also analogous to how the non-CAS operations > work: no branching out of the ll/sc loop, and just defining the result > as old == t ? s : old: > > static inline int a_cas(volatile int *p, int t, int s) > { > int old; > a_pre_llsc(); > do old = a_ll(p); > while (!a_sc(p, old == t ? s : old)); > a_post_llsc(); > return old; > } > I can't comment on whether always doing a store (or even many stores, until "stabilization") on a failed CAS is better than an extra dmb. Maybe if CAS failure normally happens only on contention, and we optimize for the non-contended case, it's indeed so. As for correctness, I agree that it should work. > In fact if we use this as the general pattern for a_cas (in > src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at > all, on any archs -- but that probably depends on the ISA's SC > semantics... > This pattern works on AArch64 without a_pre_llsc() only because a_sc() is a store-release. If it were a plain store-exclusive, preceding memory accesses could be reordered with it. Alexey