From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/8247
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: New optimized normal-type mutex?
Date: Thu, 30 Jul 2015 09:45:19 -0400
Message-ID: <20150730134519.GB16376@brightrain.aerifal.cx>
References: <20150521234402.GA25373@brightrain.aerifal.cx>
 <8c49d81e.dNq.dMV.21.hNiSfA@mailjet.com>
 <1438207875.10742.3.camel@inria.fr>
 <20150729233054.GZ16376@brightrain.aerifal.cx>
 <1438213760.10742.5.camel@inria.fr>
 <20150730001014.GA16376@brightrain.aerifal.cx>
 <1438243654.10742.9.camel@inria.fr>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1438263937 17589 80.91.229.3 (30 Jul 2015 13:45:37 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 30 Jul 2015 13:45:37 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-8260-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jul 30 15:45:37 2015
Return-path: <musl-return-8260-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-8260-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1ZKo9Q-0000QA-Ii
	for gllmg-musl@m.gmane.org; Thu, 30 Jul 2015 15:45:36 +0200
Original-Received: (qmail 24031 invoked by uid 550); 30 Jul 2015 13:45:33 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 24013 invoked from network); 30 Jul 2015 13:45:32 -0000
Content-Disposition: inline
In-Reply-To: <1438243654.10742.9.camel@inria.fr>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:8247
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/8247>

On Thu, Jul 30, 2015 at 10:07:34AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> > On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > > Hm, could you be more specific about where this hurts?
> > > 
> > > In the code I have there is
> > > 
> > >         for (;val & lockbit;) {
> > >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> > >           val = atomic_load_explicit(loc, memory_order_consume);
> > >         }
> > > 
> > > so this should be robust against spurious wakeups, no?
> > 
> > The problem is that futex_wait returns immediately with EAGAIN if
> > *loc!=val, which happens very often if *loc is incremented or
> > otherwise changed on each arriving waiter.
> 
> Yes, sure, it may change. Whether or not this is often may depend, I
> don't think we can easily make a quantitative statement, here.

The same happened with the obvious implementation of cond vars, having
a waiter store its tid then wait on *val==tid with futex_wait. The
ratio of spurious wakes to intended wakes was something like 10:1. So
this is not a purely theoretical problem.

> In the case of atomics the critical section is extremely short, and
> the count, if it varies so much, should have a bit stabilized during
> the spinlock phase before coming to the futex part. That futex part is
> really only a last resort for the rare case that the thread that is
> holding the lock has been descheduled in the middle.

Spinning is near-useless whenever you have more contenders than cores.
It's an optimization for a special case (fewer contenders than cores)
but you don't want to rely on it.

> My current test case is having X threads hammer on one single
> location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
> a reasonable number of threads (X = 16 or 32) I have an overall
> performance improvement of 30%, say, when using my version of the lock
> instead of the original musl one. The point of inversion where the
> original musl lock is better is at about 200 threads.

Interesting. I suspect this would change quickly if you increased the
amount of work done with the lock held (e.g. to more closely mimic
malloc). The nasty thing is that, as soon as you _do_ cause futex_wait
to start happening rather than just spinning, performance blows up
because each thread that needs to wait calls futex_wait several times
before it succeeds, taking up cpu time where the thread that holds the
lock could be running.

Rich