From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 22877 invoked from network); 3 Oct 2022 12:33:22 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 3 Oct 2022 12:33:22 -0000 Received: (qmail 27949 invoked by uid 550); 3 Oct 2022 12:33:18 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 27907 invoked from network); 3 Oct 2022 12:33:18 -0000 Date: Mon, 3 Oct 2022 08:33:02 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20221003123302.GF29905@brightrain.aerifal.cx> References: <20220919152937.GQ9709@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] Illegal killlock skipping when transitioning to single-threaded state On Mon, Oct 03, 2022 at 09:16:03AM +0300, Alexey Izbyshev wrote: > On 2022-09-19 18:29, Rich Felker wrote: > >On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote: > >>Hi, > >> > >>While reading pthread_exit() implementation I noticed that it can > >>set "libc.need_locks" to -1 while still holding the killlock of the > >>exiting thread: > >> > >> if (!--libc.threads_minus_1) libc.need_locks = -1; > >> > >>If the remaining thread attempts to acquire the same killlock > >>concurrently (which is valid for joinable threads), it works fine > >>because LOCK() resets "libc.need_locks" only after a_cas(): > >> > >> int need_locks = libc.need_locks; > >> if (!need_locks) return; > >> /* fast path: INT_MIN for the lock, +1 for the congestion */ > >> int current = a_cas(l, 0, INT_MIN + 1); > >> if (need_locks < 0) libc.need_locks = 0; > >> if (!current) return; > >> > >>However, because "libc.need_locks" is reset when using LOCK() for > >>any other lock too, the following could happen (E is the exiting > >>thread, R is the remaining thread): > >> > >>E: libc.need_locks = -1 > >>R: LOCK(unrelated_lock) > >>R: libc.need_locks = 0 > >>R: UNLOCK(unrelated_lock) > >>R: LOCK(E->killlock) // skips locking, now both R and E think they > >>are holding the lock > >>R: UNLOCK(E->killlock) > >>E: UNLOCK(E->killlock) > >> > >>The lack of mutual exclusion means that the tid reuse problem that > >>killlock is supposed to prevent might occur. > >> > >>Moreover, when UNLOCK(E->killlock) is called concurrently by R and > >>E, a_fetch_add() could be done twice if timing is such that both > >>threads notice that "l[0] < 0": > >> > >> /* Check l[0] to see if we are multi-threaded. */ > >> if (l[0] < 0) { > >> if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) { > >> __wake(l, 1, 1); > >> } > >> } > >> > >>In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked > >>with INT_MAX waiters"). This is a bad state because the next LOCK() > >>will wrap it to "unlocked" state instead of locking. Also, if more > >>than two threads attempt to use it, a deadlock will occur if two > >>supposedly-owners execute a_fetch_add() concurrently in UNLOCK() > >>after a third thread registered itself as a waiter because the lock > >>will wrap to INT_MIN. > >> > >>Reordering the "libc.need_locks = -1" assignment and > >>UNLOCK(E->killlock) and providing a store barrier between them > >>should fix the issue. > > > >I think this all sounds correct. I'm not sure what you mean by a store > >barrier between them, since all lock and unlock operations are already > >full barriers. > > > > Before sending the report I tried to infer the intended ordering > semantics of LOCK/UNLOCK by looking at their implementations. For > AArch64, I didn't see why they would provide a full barrier (my > reasoning is below), so I concluded that probably acquire/release > semantics was intended in general and suggested an extra store > barrier to prevent hoisting of "libc.need_locks = -1" store spelled > after UNLOCK(E->killlock) back into the critical section. > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are > implemented via load-acquire/store-release instructions. Therefore, > if we consider a LOCK/UNLOCK critical section containing only plain > loads and stores, (a) any such memory access can be reordered with > the initial ldaxr in UNLOCK, and (b) any plain load following UNLOCK > can be reordered with stlxr (assuming the processor predicts that > stlxr succeeds), and further, due to (a), with any memory access > inside the critical section. Therefore, UNLOCK is not full barrier. > Is this right? If so, this is a bug in our implementation of aarch64 atomics. The intent is that a_* all be seq_cst (practically, full acq+rel probably suffices). Maybe someone more familiar with aarch64 can chime in. I thought this was discussed and determined to be okay when the aarch64 port was being reviewed, but now I'm starting to doubt that. > As for a store following UNLOCK, I'm not sure. A plain store > following stlxr can be reordered with it, but here that store is > conditional on stlxr success. So even if the processor predicts > stlxr success, it can't make the following store visible before it's > sure that stlxr succeeded. But I don't know whether the events "the > processor knows that stlxr succeeded" and "the result of stlxr is > globally visible" are separate or not, and if they are separate, > what comes first. Depending on the answer, UNLOCK acts as a store > barrier or not. > > >I'm still a little bit concerned though that there's no real model for > >synchronization of the access to modify need_locks though. > >Conceptually, the store of -1 from the exiting thread and the store of > >0 in the surviving thread are not ordered with respect to each other > >except by an unsynchronized causality relationship. > > > I don't understand implications of the lack of (stronger) ordering > of these two stores. Could you clarify? I'm talking about at the level of the memory model, not any implementation details of the cpu. The store of 0 in the surviving thread is not an atomic, and I was thinking this is possibly not ordered with respect to the store of the -1 in the exiting thread unless there's an intervening barrier, *after* the -1 is observed However, looking back at the code, I was actually careful to ensure this already. need_locks is loaded only once, before the a_cas, and the a_cas provides the barrier. If we were instead re-loading need_locks after the a_cas before testing <0 and assigning 0, there would be a problem, at least at the abstract memory model level. > >I suspect what "should be" happening is that, if we observe > >need_locks==-1 (as a relaxed atomic load in our memory model), we take > >the thread list lock (as the lock controlling write access to these > >values) which ensures that the exiting thread has exited, then confirm > >that we are the only thread left (are there async-signal ways without > >UB that another thread could be created in the mean time?) before > >setting it to 0. But this is a rather heavy operation. If there's an > >assurance that no other threads can be created in the middle of LOCK > >(which I think there should be), we could just use __tl_sync(), which > >is much lighter, to conclude that we've synchronized with being the > >only remaining thread and that unsynchronized access is okay. > > > > LOCK is called either with application signals blocked or from > async-signal-unsafe functions, so interrupting it with a signal > handler that calls pthread_create() is not possible without UB. > Therefore, I don't see any issues with using __tl_sync() for > synchronization with the exiting thread (although, as I said above, > I don't understand what problem is being solved). Based on the above, I agree there doesn't seem to be an actual problem here. The reason it works is subtle, but present. > By the way, I noticed a suspicious usage of "libc.need_locks" in fork(): > > int need_locks = libc.need_locks > 0; > > This is different from how LOCK behaves: LOCK will still perform > normal locking one last time if it sees -1, but here locking will be > skipped entirely. Is this intentional? This is probably subtly incorrect. I think it requires more analysis to determine if there's anything that can actually go wrong, but it should probably just be changed anyway. Rich