From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/8404
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: spin strategy in __wait
Date: Sat, 29 Aug 2015 20:57:45 -0400
Message-ID: <20150830005745.GI7833@brightrain.aerifal.cx>
References: <1440875657.693.12.camel@inria.fr>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1440896283 30649 80.91.229.3 (30 Aug 2015 00:58:03 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 30 Aug 2015 00:58:03 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-8416-gllmg-musl=m.gmane.org@lists.openwall.com Sun Aug 30 02:58:02 2015
Return-path: <musl-return-8416-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-8416-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1ZVqwa-0004Wl-L0
	for gllmg-musl@m.gmane.org; Sun, 30 Aug 2015 02:58:00 +0200
Original-Received: (qmail 13332 invoked by uid 550); 30 Aug 2015 00:57:59 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 13314 invoked from network); 30 Aug 2015 00:57:59 -0000
Content-Disposition: inline
In-Reply-To: <1440875657.693.12.camel@inria.fr>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:8404
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/8404>

On Sat, Aug 29, 2015 at 09:14:17PM +0200, Jens Gustedt wrote:
> The question that I asked myself is how much spinning to we need in
> such lock/wait primitives. In the first attached graphic, the three
> bottom curves show what difference three spinning strategies can make
> for this test on a 2x2 hyperthreaded core machine. The bottom is no
> spinning at all, the next is the current strategy implemented in musl,
> and the third is a modification of that. As you can see, for a high
> load of threads they can make a substantial difference, but you also
> see that musl's actually strategy is not very different from doing no
> spinning at all.

That's intentional when you have waiters. Notice in your graph that
for the range 2-4 threads, performance in musl is comparable to your
"no-shortcut" approach. This shows that the spinning is working as
intended.

BTW I think the names you chose are misleading. :-) Refraining to spin
is not a "shortcut" musl is taking. It's not trying to be quicker, but
to refrain from aggressive behavior which should lead to pathological
lock unfairness. I would call the approaches aggressive-spin,
opportunistic-spin, and no-spin.

> The "new" strategy is simply to avoid to take the shortcut that the
> actual code takes when spinning. Here, currently when we detect that
> are already waiters, we stop spinning and try to go into futex_wait
> immediately. Where this sounds nice at a first glance, the figures
> seem to indicate that this is not a good idea and that we would be
> better off without. I'll send a patch for that in a next mail, and
> also one that lets you modify the number of spins easily.

I don't think you can say we'd be better off in general with a single
test. In areas like this you need to be able to demonstrate that there
are no cases where the expected pathologically bad behavior can
happen. I suspect you get lucky with malloc because there's a fair
amount of code being executed between successive attempts on the lock
rather than intense hammering.

On the other hand, are you even counting the number of times each
individual thread performs a malloc/free cycle, or only the total
across all threads? Is it possible that, among your 256 threads, the
vast majority of malloc/free cycles are being performed by a small
number and the rest are stuck repeating futex_wait over and over?

> Now the only situation I thought of where this could be important is
> monoprocessors where actually spinning might not be so good and
> aborting it early good be necessary. So I rand the same test with
> taskset to nail the process to just one core. The result of that are
> the top two curves. As you can see, here the spinning strategy has
> almost no influence so I think we are safe to apply this patch.

This is very surprising to me, which further makes me doubt the test
methodology. The current number of spins is tuned to be comparable in
magnitude to the syscall cost, at least on x86. Adding this extra cost
to each contended lock operation (you basically never stop spinning
early on single-core) should produce a measurable difference.

> Now all of this can also be read as a performance test of the malloc
> subsystem, and here my feeling goes in the direction that Rich
> recently indicated. The performance of the "application" is much
> better if I eliminate all parallelism. As an additional indication
> there are to additional curves that fix the process to one core and
> its hyperthreaded sibling.
> 
> So maybe we might be better off with a simpler malloc strategy that
> serializes all requests.

I tried this a few weeks ago and it performed significantly worse on
my malloc stress tests. And it will get even worse as we try to scale
to larger numbers of cores, or worse yet NUMA...

Rich