From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/5906 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Multi-threaded performance progress Date: Tue, 26 Aug 2014 13:32:18 -0400 Message-ID: <20140826173218.GB12888@brightrain.aerifal.cx> References: <20140826034321.GA13999@brightrain.aerifal.cx> <1409036654.4835.14.camel@eris.loria.fr> <1409070919.8054.47.camel@eris.loria.fr> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1409074358 31188 80.91.229.3 (26 Aug 2014 17:32:38 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 26 Aug 2014 17:32:38 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-5912-gllmg-musl=m.gmane.org@lists.openwall.com Tue Aug 26 19:32:32 2014 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1XMKbf-0007Dw-Ro for gllmg-musl@plane.gmane.org; Tue, 26 Aug 2014 19:32:31 +0200 Original-Received: (qmail 29839 invoked by uid 550); 26 Aug 2014 17:32:31 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 29831 invoked from network); 26 Aug 2014 17:32:30 -0000 Content-Disposition: inline In-Reply-To: <1409070919.8054.47.camel@eris.loria.fr> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:5906 Archived-At: On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote: > Am Dienstag, den 26.08.2014, 09:04 +0200 schrieb Jens Gustedt: > > Am Montag, den 25.08.2014, 23:43 -0400 schrieb Rich Felker: > > > This release cycle looks like it's going to be huge for multi-threaded > > > performance issues. So far the cumulative improvement on my main > > > development system, as measured by the cond_bench.c by Timo Teräs, is > > > from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds. > > > That's comparable to what glibc gets on similar hardware with a cond > > > var implementation that's much less correct. The improvements are a > > > result of adding private futex support, redesigning the cond var > > > implementation, and improvements to the spin-before-futex-wait > > > behavior. > > > > Very impressive! > > I reviewed the new pthread_cond code closely and found it to be really > rock solid. > > I have some minor things, that might still improve things (or > not). They make the code a bit longer, but they attempt to gain things > here and there: > > - Tighten the lock on _c_lock such that the critical section > contains the least necessary. Do you see any major opportunities for this? For the critical section in pthread_cond_timedwait, a few operations could be moved out before it, but they're all trivial assignments. As for __private_cond_signal, it's currently necessary that all its modifications to the list be made before either the cv lock or the in-waiter-node barrier lock is released, because any waiters which win the race and enter LEAVING status rather than SIGNALED status use the cv lock to proceed. Perhaps this could be changed, though... > - Have all the update of the list of waiters done by the signaling > or broadcasting thread. This work is serialized by the lock on the > cv, anyhow, so let the main work be done by a thread that already > holds the lock and is scheduled. The problem I ran into was that the unwait operation can be from cancellation or timeout, in which case the waiter has to remove itself from the list, and needs to obtain the cv lock to do this. And it's not clear to me how the waiter can know that the signaling thread is taking responsibility for removing it from the list without synchronizing with the signaling thread like it does now. In any case the costly synchronization here only happens on hopefully-very-rare races. > - In case of broadcast, work on head and tail of the list > first. These are the only ones that would change the _c_head and _c_tail > entries of the cv. But we can't release the lock anyway until all waiter states have been atomic cas'd, or at least doing so doesn't seem safe to me. > - Try to reduce the number of futex calls. Threads that are leaving > don't have to regain the lock when there is known contention with a > signaler, now that the signaler is doing the main work in that > case. How do they detect this contention? If they won the race and changed state to LEAVING, they don't see the contention. If they lose the race, they become SIGNALED, and thus take the barrier path rather than the cv-lock path. > Also only wake up the signaling thread at the end when he is known > to be inside a futex call. I think this could be achieved trivially by having ref start at 1 rather than 0, and having the signaling thread a_dec ref just before going into the maybe-wait loop. Then the waiters won't send the futex wake unless the signaler reached the a_dec already, since they won't see the hitting-zero step. > There are perhaps other possibilities, like doing some spinning in > "lock" before going into __wait. The __wait function has a built-in spin. Rich