From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/5906
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Multi-threaded performance progress
Date: Tue, 26 Aug 2014 13:32:18 -0400
Message-ID: <20140826173218.GB12888@brightrain.aerifal.cx>
References: <20140826034321.GA13999@brightrain.aerifal.cx>
 <1409036654.4835.14.camel@eris.loria.fr>
 <1409070919.8054.47.camel@eris.loria.fr>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1409074358 31188 80.91.229.3 (26 Aug 2014 17:32:38 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 26 Aug 2014 17:32:38 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-5912-gllmg-musl=m.gmane.org@lists.openwall.com Tue Aug 26 19:32:32 2014
Return-path: <musl-return-5912-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-5912-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1XMKbf-0007Dw-Ro
	for gllmg-musl@plane.gmane.org; Tue, 26 Aug 2014 19:32:31 +0200
Original-Received: (qmail 29839 invoked by uid 550); 26 Aug 2014 17:32:31 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 29831 invoked from network); 26 Aug 2014 17:32:30 -0000
Content-Disposition: inline
In-Reply-To: <1409070919.8054.47.camel@eris.loria.fr>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:5906
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/5906>

On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote:
> Am Dienstag, den 26.08.2014, 09:04 +0200 schrieb Jens Gustedt:
> > Am Montag, den 25.08.2014, 23:43 -0400 schrieb Rich Felker:
> > > This release cycle looks like it's going to be huge for multi-threaded
> > > performance issues. So far the cumulative improvement on my main
> > > development system, as measured by the cond_bench.c by Timo Teräs, is
> > > from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds.
> > > That's comparable to what glibc gets on similar hardware with a cond
> > > var implementation that's much less correct. The improvements are a
> > > result of adding private futex support, redesigning the cond var
> > > implementation, and improvements to the spin-before-futex-wait
> > > behavior.
> > 
> > Very impressive!
> 
> I reviewed the new pthread_cond code closely and found it to be really
> rock solid.
> 
> I have some minor things, that might still improve things (or
> not). They make the code a bit longer, but they attempt to gain things
> here and there:
> 
>  - Tighten the lock on _c_lock such that the critical section
>    contains the least necessary.

Do you see any major opportunities for this? For the critical section
in pthread_cond_timedwait, a few operations could be moved out before
it, but they're all trivial assignments.

As for __private_cond_signal, it's currently necessary that all its
modifications to the list be made before either the cv lock or the
in-waiter-node barrier lock is released, because any waiters which win
the race and enter LEAVING status rather than SIGNALED status use the
cv lock to proceed. Perhaps this could be changed, though...

>  - Have all the update of the list of waiters done by the signaling
>    or broadcasting thread. This work is serialized by the lock on the
>    cv, anyhow, so let the main work be done by a thread that already
>    holds the lock and is scheduled.

The problem I ran into was that the unwait operation can be from
cancellation or timeout, in which case the waiter has to remove itself
from the list, and needs to obtain the cv lock to do this. And it's
not clear to me how the waiter can know that the signaling thread is
taking responsibility for removing it from the list without
synchronizing with the signaling thread like it does now. In any case
the costly synchronization here only happens on hopefully-very-rare
races.

>  - In case of broadcast, work on head and tail of the list
>    first. These are the only ones that would change the _c_head and _c_tail
>    entries of the cv.

But we can't release the lock anyway until all waiter states have been
atomic cas'd, or at least doing so doesn't seem safe to me.

>  - Try to reduce the number of futex calls. Threads that are leaving
>    don't have to regain the lock when there is known contention with a
>    signaler, now that the signaler is doing the main work in that
>    case.

How do they detect this contention? If they won the race and changed
state to LEAVING, they don't see the contention. If they lose the
race, they become SIGNALED, and thus take the barrier path rather than
the cv-lock path.

>    Also only wake up the signaling thread at the end when he is known
>    to be inside a futex call.

I think this could be achieved trivially by having ref start at 1
rather than 0, and having the signaling thread a_dec ref just before
going into the maybe-wait loop. Then the waiters won't send the futex
wake unless the signaler reached the a_dec already, since they won't
see the hitting-zero step.

> There are perhaps other possibilities, like doing some spinning in
> "lock" before going into __wait.

The __wait function has a built-in spin.

Rich