[PATCH 1/2] let them spin

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [PATCH 1/2] let them spin
@ 2015-08-29  8:50 Jens Gustedt
  2015-08-29 17:16 ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Gustedt @ 2015-08-29  8:50 UTC (permalink / raw)
  To: musl

Remove a test in __wait that looked if other threads already attempted to
go to sleep in futex_wait.

This has no impact on the fast path. But other than one might think at a
first glance, this slows down if there is congestion.

Applying this patch shows no difference in behavior in a mono-core
setting, so it seems that this shortcut is just superfluous.
---
 src/thread/__wait.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/thread/__wait.c b/src/thread/__wait.c
index dc33c1a..a93d41d 100644
--- a/src/thread/__wait.c
+++ b/src/thread/__wait.c
@@ -4,7 +4,7 @@ void __wait(volatile int *addr, volatile int *waiters, int val, int priv)
 {
 	int spins=100;
 	if (priv) priv = FUTEX_PRIVATE;
-	while (spins-- && (!waiters || !*waiters)) {
+	while (spins--) {
 		if (*addr==val) a_spin();
 		else return;
 	}
-- 
2.1.4



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] let them spin
  2015-08-29  8:50 [PATCH 1/2] let them spin Jens Gustedt
@ 2015-08-29 17:16 ` Rich Felker
  2015-08-29 19:38   ` Jens Gustedt
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2015-08-29 17:16 UTC (permalink / raw)
  To: musl

On Sat, Aug 29, 2015 at 10:50:44AM +0200, Jens Gustedt wrote:
> Remove a test in __wait that looked if other threads already attempted to
> go to sleep in futex_wait.
> 
> This has no impact on the fast path. But other than one might think at a
> first glance, this slows down if there is congestion.
> 
> Applying this patch shows no difference in behavior in a mono-core
> setting, so it seems that this shortcut is just superfluous.

The purpose of this code is is twofold: improving fairness of the lock
and avoiding burning cpu time that's _known_ to be a waste.

If you spin on a lock that already has waiters, the thread that spins
has a much better chance to get the lock than any of the existing
waiters which are in futex_wait. Assuming sufficiently many cores that
all threads that are not sleeping don't get preempted, the spinning
thread is basically guaranteed to get the lock unless it spins so long
to make it futex_wait. This is simply because returning from
futex_wake (which all the other waiters have to do) takes a lot more
time than one spin. I suspect there are common loads under which many
of the waiters will NEVER get the lock.

The other issue is simply wasted cpu time. Unlike in the no-waiters
case where spinning _can_ save a lot of cpu time (no futex_wait or
futex_wake syscall), spinning in the with-waiters case always wastes
cpu time. If the spinning thread gets the lock when the old owner
unlocks it, the old owner has already determined that it has to send a
futex_wake. So now you have (1) the wasted futex_wake syscall in the
old owner thread and (2) a "spurious" wake in one of the waiters,
which wakes up only to find that the lock was already stolen, and
therefore has to call futex_wait again and start the whole process
over.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] let them spin
  2015-08-29 17:16 ` Rich Felker
@ 2015-08-29 19:38   ` Jens Gustedt
  2015-08-30  0:39     ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Gustedt @ 2015-08-29 19:38 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3734 bytes --]

Rich,
I hope with the graphics at hand, you'll understand a bit better what
I was up to. Again my apologies.

Still I try to answer to some of the arguments that you give below:

Am Samstag, den 29.08.2015, 13:16 -0400 schrieb Rich Felker:
> On Sat, Aug 29, 2015 at 10:50:44AM +0200, Jens Gustedt wrote:
> > Remove a test in __wait that looked if other threads already attempted to
> > go to sleep in futex_wait.
> > 
> > This has no impact on the fast path. But other than one might think at a
> > first glance, this slows down if there is congestion.
> > 
> > Applying this patch shows no difference in behavior in a mono-core
> > setting, so it seems that this shortcut is just superfluous.
> 
> The purpose of this code is is twofold: improving fairness of the lock
> and avoiding burning cpu time that's _known_ to be a waste.
> 
> If you spin on a lock that already has waiters, the thread that spins
> has a much better chance to get the lock than any of the existing
> waiters which are in futex_wait. Assuming sufficiently many cores that
> all threads that are not sleeping don't get preempted, the spinning
> thread is basically guaranteed to get the lock unless it spins so long
> to make it futex_wait. This is simply because returning from
> futex_wake (which all the other waiters have to do) takes a lot more
> time than one spin. I suspect there are common loads under which many
> of the waiters will NEVER get the lock.

Yes and no. I benched the things to know a bit more. On my machine one
loop in a spin lock is just about 10 times faster than a failed call
to futex_wait.

Even for the current strategy, one of the futex_waiting threads gets
woken up and gets his chance with a new spinning phase.

So the difference isn't dramatic, just one order of magnitude and
everybody gets his chance. These chances are not equal, sure, but
NEVER in capitals is certainly a big word.

On the other hands the difference in throughput on the multi-core
setting between the different spin versions is dramatic for malloc, I
find.

> The other issue is simply wasted cpu time. Unlike in the no-waiters
> case where spinning _can_ save a lot of cpu time (no futex_wait or
> futex_wake syscall), spinning in the with-waiters case always wastes
> cpu time. If the spinning thread gets the lock when the old owner
> unlocks it, the old owner has already determined that it has to send a
> futex_wake. So now you have (1) the wasted futex_wake syscall in the
> old owner thread and (2) a "spurious" wake in one of the waiters,
> which wakes up only to find that the lock was already stolen, and
> therefore has to call futex_wait again and start the whole process
> over.

This wasted CPU time was one of my worries at first, this is why I
made the mono core test. Basically on mono core spinning is *always* a
waste, and going into futex_wait immediately should always be a
win. (BTW, the current implementation should have a performance gap
for exactly 2 threads that alternate on the lock.)

But, there is almost no difference between the two strategies, the
bias even here goes a tiny bit towards doing some spin, even under
very high load. So, no, I don't think there is much wasted CPU
time. At least performance seems to be largely dominated by other
things, so "spinning" shouldn't be a worry for that in the
current state of things.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] let them spin
  2015-08-29 19:38   ` Jens Gustedt
@ 2015-08-30  0:39     ` Rich Felker
  2015-08-30  8:41       ` Jens Gustedt
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2015-08-30  0:39 UTC (permalink / raw)
  To: musl

On Sat, Aug 29, 2015 at 09:38:30PM +0200, Jens Gustedt wrote:
> Am Samstag, den 29.08.2015, 13:16 -0400 schrieb Rich Felker:
> > On Sat, Aug 29, 2015 at 10:50:44AM +0200, Jens Gustedt wrote:
> > > Remove a test in __wait that looked if other threads already attempted to
> > > go to sleep in futex_wait.
> > > 
> > > This has no impact on the fast path. But other than one might think at a
> > > first glance, this slows down if there is congestion.
> > > 
> > > Applying this patch shows no difference in behavior in a mono-core
> > > setting, so it seems that this shortcut is just superfluous.
> > 
> > The purpose of this code is is twofold: improving fairness of the lock
> > and avoiding burning cpu time that's _known_ to be a waste.
> > 
> > If you spin on a lock that already has waiters, the thread that spins
> > has a much better chance to get the lock than any of the existing
> > waiters which are in futex_wait. Assuming sufficiently many cores that
> > all threads that are not sleeping don't get preempted, the spinning
> > thread is basically guaranteed to get the lock unless it spins so long
> > to make it futex_wait. This is simply because returning from
> > futex_wake (which all the other waiters have to do) takes a lot more
> > time than one spin. I suspect there are common loads under which many
> > of the waiters will NEVER get the lock.
> 
> Yes and no. I benched the things to know a bit more. On my machine one
> loop in a spin lock is just about 10 times faster than a failed call
> to futex_wait.

So this means that a non-preempted spinning thread will always get the
lock before a thread returning from futex_wait, as I predicted.

> Even for the current strategy, one of the futex_waiting threads gets
> woken up and gets his chance with a new spinning phase.

In the current musl code, threads that futex_wait do not go back to
spinning except in a rare race. __wait keeps repeating the futex_wait
syscall as long as *addr==val. Repeating the spinning would let them
be more aggressive about getting the lock against other threads
contending for it, but it burns a lot of cpu and gives up all the
fairness that you would otherwise get from the futex queue.

> So the difference isn't dramatic, just one order of magnitude and
> everybody gets his chance. These chances are not equal, sure, but
> NEVER in capitals is certainly a big word.

Try this: on a machine with at least 3 physical cores, 3 threads
hammer on the same lock, counting the number of times they succeed in
taking it. Once any one thread has taken it at least 10 million times
or so, stop and print the counts. With your spin strategy I would
expect to see 2 threads with counts near 10 million and one thread
with a count in the hundreds or less, maybe even a single-digit count.
With the current behavior (never spinning if there's a waiter) I would
expect all 3 counts to be similar.

> On the other hands the difference in throughput on the multi-core
> setting between the different spin versions is dramatic for malloc, I
> find.

The problem we're dealing with is pathological bad cases, not
throughput in the lucky cases.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] let them spin
  2015-08-30  0:39     ` Rich Felker
@ 2015-08-30  8:41       ` Jens Gustedt
  2015-08-30 17:02         ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Gustedt @ 2015-08-30  8:41 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 5489 bytes --]

Am Samstag, den 29.08.2015, 20:39 -0400 schrieb Rich Felker:
> On Sat, Aug 29, 2015 at 09:38:30PM +0200, Jens Gustedt wrote:
> > Am Samstag, den 29.08.2015, 13:16 -0400 schrieb Rich Felker:
> > > On Sat, Aug 29, 2015 at 10:50:44AM +0200, Jens Gustedt wrote:
> > > > Remove a test in __wait that looked if other threads already attempted to
> > > > go to sleep in futex_wait.
> > > > 
> > > > This has no impact on the fast path. But other than one might think at a
> > > > first glance, this slows down if there is congestion.
> > > > 
> > > > Applying this patch shows no difference in behavior in a mono-core
> > > > setting, so it seems that this shortcut is just superfluous.
> > > 
> > > The purpose of this code is is twofold: improving fairness of the lock
> > > and avoiding burning cpu time that's _known_ to be a waste.
> > > 
> > > If you spin on a lock that already has waiters, the thread that spins
> > > has a much better chance to get the lock than any of the existing
> > > waiters which are in futex_wait. Assuming sufficiently many cores that
> > > all threads that are not sleeping don't get preempted, the spinning
> > > thread is basically guaranteed to get the lock unless it spins so long
> > > to make it futex_wait. This is simply because returning from
> > > futex_wake (which all the other waiters have to do) takes a lot more
> > > time than one spin. I suspect there are common loads under which many
> > > of the waiters will NEVER get the lock.
> > 
> > Yes and no. I benched the things to know a bit more. On my machine one
> > loop in a spin lock is just about 10 times faster than a failed call
> > to futex_wait.
> 
> So this means that a non-preempted spinning thread will always get the
> lock before a thread returning from futex_wait, as I predicted.

Again, yes and no. You are neglecting the fact that in the current
implementation before entering into __wait in __lock there is a
"trylock". So even today in case that there are a lot of threads that
arrive, there is quite a probability that a thread that just got woken
up will not be able to obtain its "dedicated" slot, the one that gave
rise to its wakeup.

With musl's current strategy, such a woken up thread goes back into
futex_wait, with the changed strategy it would have a chance on the
next slot. So I actually would expect a distribution of the waiting
time that has a very long tail, even longer than with my modified
strategy.

It is very difficult to get hands on these probabilities, they are
difficult to measure without influencing the measurement. And I find
neither your arguments nor mine completely convincing :)

> > Even for the current strategy, one of the futex_waiting threads gets
> > woken up and gets his chance with a new spinning phase.
> 
> In the current musl code, threads that futex_wait do not go back to
> spinning except in a rare race. __wait keeps repeating the futex_wait
> syscall as long as *addr==val. Repeating the spinning would let them
> be more aggressive about getting the lock against other threads
> contending for it, but it burns a lot of cpu and gives up all the
> fairness that you would otherwise get from the futex queue.

In a situation where the lock changes often, the wait loop also burns
CPU. One failed wait loop counts for ten spin loops in my measurements
and setting.

> > So the difference isn't dramatic, just one order of magnitude and
> > everybody gets his chance. These chances are not equal, sure, but
> > NEVER in capitals is certainly a big word.
> 
> Try this: on a machine with at least 3 physical cores, 3 threads
> hammer on the same lock, counting the number of times they succeed in
> taking it. Once any one thread has taken it at least 10 million times
> or so, stop and print the counts. With your spin strategy I would
> expect to see 2 threads with counts near 10 million and one thread
> with a count in the hundreds or less, maybe even a single-digit count.
> With the current behavior (never spinning if there's a waiter) I would
> expect all 3 counts to be similar.

The setting that you describe is really a pathological one, where the
threads don't do any work between taking the lock and releasing it. Do
I understand that correctly?

I think a test where there is at least some "work" done in the
critical section would be more reasonable. And for such a setting I
doubt that we would observe such a behavior.

I am currently on the road so I don't have such a machine at hand. I
will try next week. I think it should be relatively simple for my test
to also compute statistics comparing the threads of each run, getting
min, max and standard deviation, so I'll do that.

> > On the other hands the difference in throughput on the multi-core
> > setting between the different spin versions is dramatic for malloc, I
> > find.
> 
> The problem we're dealing with is pathological bad cases, not
> throughput in the lucky cases.

Bad cases, but not only pathological bad cases. There is already a
real difference for 8 threads on my machine, which I think is bad, but
not pathological.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] let them spin
  2015-08-30  8:41       ` Jens Gustedt
@ 2015-08-30 17:02         ` Rich Felker
  2015-08-30 21:48           ` Jens Gustedt
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2015-08-30 17:02 UTC (permalink / raw)
  To: musl

On Sun, Aug 30, 2015 at 10:41:53AM +0200, Jens Gustedt wrote:
> > > So the difference isn't dramatic, just one order of magnitude and
> > > everybody gets his chance. These chances are not equal, sure, but
> > > NEVER in capitals is certainly a big word.
> > 
> > Try this: on a machine with at least 3 physical cores, 3 threads
> > hammer on the same lock, counting the number of times they succeed in
> > taking it. Once any one thread has taken it at least 10 million times
> > or so, stop and print the counts. With your spin strategy I would
> > expect to see 2 threads with counts near 10 million and one thread
> > with a count in the hundreds or less, maybe even a single-digit count.
> > With the current behavior (never spinning if there's a waiter) I would
> > expect all 3 counts to be similar.
> 
> The setting that you describe is really a pathological one, where the
> threads don't do any work between taking the lock and releasing it. Do
> I understand that correctly?

If you're using locks to implement fake greater-than-wordsize atomics
then it's the normal case, not a pathological one. You have
(effectively) things like:

	_Atomic long double x;
	__lock(global_lock);
	x++;
	__unlock(global_lock);

For a more realistic example, consider atomic CAS on a linked-list
prev/next pointer pair.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] let them spin
  2015-08-30 17:02         ` Rich Felker
@ 2015-08-30 21:48           ` Jens Gustedt
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Gustedt @ 2015-08-30 21:48 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

Am Sonntag, den 30.08.2015, 13:02 -0400 schrieb Rich Felker:
> On Sun, Aug 30, 2015 at 10:41:53AM +0200, Jens Gustedt wrote:
> > > > So the difference isn't dramatic, just one order of magnitude and
> > > > everybody gets his chance. These chances are not equal, sure, but
> > > > NEVER in capitals is certainly a big word.
> > > 
> > > Try this: on a machine with at least 3 physical cores, 3 threads
> > > hammer on the same lock, counting the number of times they succeed in
> > > taking it. Once any one thread has taken it at least 10 million times
> > > or so, stop and print the counts. With your spin strategy I would
> > > expect to see 2 threads with counts near 10 million and one thread
> > > with a count in the hundreds or less, maybe even a single-digit count.
> > > With the current behavior (never spinning if there's a waiter) I would
> > > expect all 3 counts to be similar.
> > 
> > The setting that you describe is really a pathological one, where the
> > threads don't do any work between taking the lock and releasing it. Do
> > I understand that correctly?
> 
> If you're using locks to implement fake greater-than-wordsize atomics
> then it's the normal case, not a pathological one. You have
> (effectively) things like:
> 
> 	_Atomic long double x;
> 	__lock(global_lock);
> 	x++;
> 	__unlock(global_lock);

you probably just mean "volatile" instead of "_Atomic"?

In any case, this already has a memory write inside the critical
section, that is not nothing compared to the fast path of the
__lock/__unlock operation.

> For a more realistic example, consider atomic CAS on a linked-list
> prev/next pointer pair.

So that one would be similar to the different tests I did for
<stdatomic.h>. So far I wasn't able to observe your "NEVER" case, and
I did a substantial number of runs over the last weeks.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-08-30 21:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-29  8:50 [PATCH 1/2] let them spin Jens Gustedt
2015-08-29 17:16 ` Rich Felker
2015-08-29 19:38   ` Jens Gustedt
2015-08-30  0:39     ` Rich Felker
2015-08-30  8:41       ` Jens Gustedt
2015-08-30 17:02         ` Rich Felker
2015-08-30 21:48           ` Jens Gustedt

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).