From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7739 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: New optimized normal-type mutex? Date: Thu, 21 May 2015 19:44:02 -0400 Message-ID: <20150521234402.GA25373@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1432251862 9535 80.91.229.3 (21 May 2015 23:44:22 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 21 May 2015 23:44:22 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7751-gllmg-musl=m.gmane.org@lists.openwall.com Fri May 22 01:44:21 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1Yva8S-0003ZN-Nb for gllmg-musl@m.gmane.org; Fri, 22 May 2015 01:44:20 +0200 Original-Received: (qmail 17891 invoked by uid 550); 21 May 2015 23:44:19 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 17835 invoked from network); 21 May 2015 23:44:14 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7739 Archived-At: I realized (thanks to conversations with wm4) that it's possible to optimize normal-type mutexes a lot to eliminate spurious futex wakes and reduce the number of atomic ops/barriers under contention. The concept is to store the full mutex state in a single int: bits 0-29: waiter count bit 30: waiter flag bit 31: lock bit Then unlock is: old = a_fetch_and(p, 0x3fffffff); // new op if (old & 0x40000000) wake(p,1); Trylock is: old = *p; if (old < 0) { a_barrier(); old = *p; } if (old >= 0 && a_cas(p, old, old|INT_MIN)) return 0; Lock is: while (spins--) { if (!trylock()) return 0; a_spin(); } while ((old = *p) < 0) { new = old+1 | 0x40000000; if (a_cas(p, old, new)==old) break; } for(;;) { while ((old = *p) < 0) { futex_wait(p, old); } new = old-1; if (new) new |= 0x40000000; // set waiters flag if any other waiters new |= INT_MIN; if (a_cas(p, old, new)==old) return 0; } The waiter flag marks that unlocking needs to send a wake. It is set whenever a new waiter arrives, and when a waiter that consumed a wake observes that there are more waiters left to be woken. The flag is notably clear when the same thread which just released the mutex happens to obtain it again before an existing waiter does (when hammering the mutex); this is where we should see huge savings, since the subsequent unlock(s) will not produce additional futex wakes which (1) incur syscall overhead, and (2) cause additional waiters which cannot yet get the mutex to wake up, try to lock it, and go back to waiting again. Further: The code path for unlock has exactly one atomic. The code path for trylock has at least one barrier and at most (only on success) one atomic, but may have a spurious barrier if a stale value was initially read and the operation subsequently succeeds. The code path for lock on contention (trylock failed and thus produced no atomic changes, only barriers/spins), contains one atomic cas to set the waiter state (versus 2 in the current musl implementation), whatever the kernel does in futex, and one more atomic cas to finally take the lock (versus 2 in current musl). Are these changes worthwhile and worth having additional custom code for the normal-type mutex? I'm not sure. We should probably do an out-of-tree test implementation to measure differences then consider integrating in musl if it helps. If it does look helpful to musl, it may make sense to use this code as the implementation for __lock/__unlock (implementation-internal locks) and thereby shrink them all from int[2] to int[1], and then have the pthread mutex code tail-call to these functions for the normal-mutex case. Even if it's not a performance help for musl, the above design may be very nice as a third-party mutex library since it handles self-synchronized destruction, minimizes spurious futex wakes, and fits the whole lock in a single int. Rich