From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 10438 invoked from network); 3 Oct 2022 06:16:21 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 3 Oct 2022 06:16:21 -0000 Received: (qmail 30032 invoked by uid 550); 3 Oct 2022 06:16:16 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 29997 invoked from network); 3 Oct 2022 06:16:15 -0000 DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 9DDC14077B01 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ispras.ru; s=default; t=1664777763; bh=p6MiUXxGrVPNrw8rUQFvGlAVkyDyTAPoiyurHiQ9KMo=; h=Date:From:To:Subject:Reply-To:In-Reply-To:References:From; b=o79kNYhTxp7wv+me4EeFBOgyiWPFOnwvZ7QeG84nHhT5+rWWneZ9CF+uYTKjveJUY vbYbhd4ykWi/+kVV23tfB5CIMGKBO9QSmG8PKLjeUNZoUuguw43DzfCAaUqwaJoWJA VNTxAB6L9xfjPkssSKueGJxQ7LQtHI+V7EMywR0Q= MIME-Version: 1.0 Date: Mon, 03 Oct 2022 09:16:03 +0300 From: Alexey Izbyshev To: musl@lists.openwall.com Mail-Followup-To: musl@lists.openwall.com In-Reply-To: <20220919152937.GQ9709@brightrain.aerifal.cx> References: <20220919152937.GQ9709@brightrain.aerifal.cx> User-Agent: Roundcube Webmail/1.4.4 Message-ID: X-Sender: izbyshev@ispras.ru Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [musl] Illegal killlock skipping when transitioning to single-threaded state On 2022-09-19 18:29, Rich Felker wrote: > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote: >> Hi, >> >> While reading pthread_exit() implementation I noticed that it can >> set "libc.need_locks" to -1 while still holding the killlock of the >> exiting thread: >> >> if (!--libc.threads_minus_1) libc.need_locks = -1; >> >> If the remaining thread attempts to acquire the same killlock >> concurrently (which is valid for joinable threads), it works fine >> because LOCK() resets "libc.need_locks" only after a_cas(): >> >> int need_locks = libc.need_locks; >> if (!need_locks) return; >> /* fast path: INT_MIN for the lock, +1 for the congestion */ >> int current = a_cas(l, 0, INT_MIN + 1); >> if (need_locks < 0) libc.need_locks = 0; >> if (!current) return; >> >> However, because "libc.need_locks" is reset when using LOCK() for >> any other lock too, the following could happen (E is the exiting >> thread, R is the remaining thread): >> >> E: libc.need_locks = -1 >> R: LOCK(unrelated_lock) >> R: libc.need_locks = 0 >> R: UNLOCK(unrelated_lock) >> R: LOCK(E->killlock) // skips locking, now both R and E think they >> are holding the lock >> R: UNLOCK(E->killlock) >> E: UNLOCK(E->killlock) >> >> The lack of mutual exclusion means that the tid reuse problem that >> killlock is supposed to prevent might occur. >> >> Moreover, when UNLOCK(E->killlock) is called concurrently by R and >> E, a_fetch_add() could be done twice if timing is such that both >> threads notice that "l[0] < 0": >> >> /* Check l[0] to see if we are multi-threaded. */ >> if (l[0] < 0) { >> if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) { >> __wake(l, 1, 1); >> } >> } >> >> In that case E->killlock will be assigned to INT_MAX (i.e. "unlocked >> with INT_MAX waiters"). This is a bad state because the next LOCK() >> will wrap it to "unlocked" state instead of locking. Also, if more >> than two threads attempt to use it, a deadlock will occur if two >> supposedly-owners execute a_fetch_add() concurrently in UNLOCK() >> after a third thread registered itself as a waiter because the lock >> will wrap to INT_MIN. >> >> Reordering the "libc.need_locks = -1" assignment and >> UNLOCK(E->killlock) and providing a store barrier between them >> should fix the issue. > > I think this all sounds correct. I'm not sure what you mean by a store > barrier between them, since all lock and unlock operations are already > full barriers. > Before sending the report I tried to infer the intended ordering semantics of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't see why they would provide a full barrier (my reasoning is below), so I concluded that probably acquire/release semantics was intended in general and suggested an extra store barrier to prevent hoisting of "libc.need_locks = -1" store spelled after UNLOCK(E->killlock) back into the critical section. UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented via load-acquire/store-release instructions. Therefore, if we consider a LOCK/UNLOCK critical section containing only plain loads and stores, (a) any such memory access can be reordered with the initial ldaxr in UNLOCK, and (b) any plain load following UNLOCK can be reordered with stlxr (assuming the processor predicts that stlxr succeeds), and further, due to (a), with any memory access inside the critical section. Therefore, UNLOCK is not full barrier. Is this right? As for a store following UNLOCK, I'm not sure. A plain store following stlxr can be reordered with it, but here that store is conditional on stlxr success. So even if the processor predicts stlxr success, it can't make the following store visible before it's sure that stlxr succeeded. But I don't know whether the events "the processor knows that stlxr succeeded" and "the result of stlxr is globally visible" are separate or not, and if they are separate, what comes first. Depending on the answer, UNLOCK acts as a store barrier or not. > I'm still a little bit concerned though that there's no real model for > synchronization of the access to modify need_locks though. > Conceptually, the store of -1 from the exiting thread and the store of > 0 in the surviving thread are not ordered with respect to each other > except by an unsynchronized causality relationship. > I don't understand implications of the lack of (stronger) ordering of these two stores. Could you clarify? > I suspect what "should be" happening is that, if we observe > need_locks==-1 (as a relaxed atomic load in our memory model), we take > the thread list lock (as the lock controlling write access to these > values) which ensures that the exiting thread has exited, then confirm > that we are the only thread left (are there async-signal ways without > UB that another thread could be created in the mean time?) before > setting it to 0. But this is a rather heavy operation. If there's an > assurance that no other threads can be created in the middle of LOCK > (which I think there should be), we could just use __tl_sync(), which > is much lighter, to conclude that we've synchronized with being the > only remaining thread and that unsynchronized access is okay. > LOCK is called either with application signals blocked or from async-signal-unsafe functions, so interrupting it with a signal handler that calls pthread_create() is not possible without UB. Therefore, I don't see any issues with using __tl_sync() for synchronization with the exiting thread (although, as I said above, I don't understand what problem is being solved). By the way, I noticed a suspicious usage of "libc.need_locks" in fork(): int need_locks = libc.need_locks > 0; This is different from how LOCK behaves: LOCK will still perform normal locking one last time if it sees -1, but here locking will be skipped entirely. Is this intentional? Thanks, Alexey