Am Samstag, den 29.08.2015, 20:39 -0400 schrieb Rich Felker:
> On Sat, Aug 29, 2015 at 09:38:30PM +0200, Jens Gustedt wrote:
> > Am Samstag, den 29.08.2015, 13:16 -0400 schrieb Rich Felker:
> > > On Sat, Aug 29, 2015 at 10:50:44AM +0200, Jens Gustedt wrote:
> > > > Remove a test in __wait that looked if other threads already attempted to
> > > > go to sleep in futex_wait.
> > > > 
> > > > This has no impact on the fast path. But other than one might think at a
> > > > first glance, this slows down if there is congestion.
> > > > 
> > > > Applying this patch shows no difference in behavior in a mono-core
> > > > setting, so it seems that this shortcut is just superfluous.
> > > 
> > > The purpose of this code is is twofold: improving fairness of the lock
> > > and avoiding burning cpu time that's _known_ to be a waste.
> > > 
> > > If you spin on a lock that already has waiters, the thread that spins
> > > has a much better chance to get the lock than any of the existing
> > > waiters which are in futex_wait. Assuming sufficiently many cores that
> > > all threads that are not sleeping don't get preempted, the spinning
> > > thread is basically guaranteed to get the lock unless it spins so long
> > > to make it futex_wait. This is simply because returning from
> > > futex_wake (which all the other waiters have to do) takes a lot more
> > > time than one spin. I suspect there are common loads under which many
> > > of the waiters will NEVER get the lock.
> > 
> > Yes and no. I benched the things to know a bit more. On my machine one
> > loop in a spin lock is just about 10 times faster than a failed call
> > to futex_wait.
> 
> So this means that a non-preempted spinning thread will always get the
> lock before a thread returning from futex_wait, as I predicted.

Again, yes and no. You are neglecting the fact that in the current
implementation before entering into __wait in __lock there is a
"trylock". So even today in case that there are a lot of threads that
arrive, there is quite a probability that a thread that just got woken
up will not be able to obtain its "dedicated" slot, the one that gave
rise to its wakeup.

With musl's current strategy, such a woken up thread goes back into
futex_wait, with the changed strategy it would have a chance on the
next slot. So I actually would expect a distribution of the waiting
time that has a very long tail, even longer than with my modified
strategy.

It is very difficult to get hands on these probabilities, they are
difficult to measure without influencing the measurement. And I find
neither your arguments nor mine completely convincing :)

> > Even for the current strategy, one of the futex_waiting threads gets
> > woken up and gets his chance with a new spinning phase.
> 
> In the current musl code, threads that futex_wait do not go back to
> spinning except in a rare race. __wait keeps repeating the futex_wait
> syscall as long as *addr==val. Repeating the spinning would let them
> be more aggressive about getting the lock against other threads
> contending for it, but it burns a lot of cpu and gives up all the
> fairness that you would otherwise get from the futex queue.

In a situation where the lock changes often, the wait loop also burns
CPU. One failed wait loop counts for ten spin loops in my measurements
and setting.

> > So the difference isn't dramatic, just one order of magnitude and
> > everybody gets his chance. These chances are not equal, sure, but
> > NEVER in capitals is certainly a big word.
> 
> Try this: on a machine with at least 3 physical cores, 3 threads
> hammer on the same lock, counting the number of times they succeed in
> taking it. Once any one thread has taken it at least 10 million times
> or so, stop and print the counts. With your spin strategy I would
> expect to see 2 threads with counts near 10 million and one thread
> with a count in the hundreds or less, maybe even a single-digit count.
> With the current behavior (never spinning if there's a waiter) I would
> expect all 3 counts to be similar.

The setting that you describe is really a pathological one, where the
threads don't do any work between taking the lock and releasing it. Do
I understand that correctly?

I think a test where there is at least some "work" done in the
critical section would be more reasonable. And for such a setting I
doubt that we would observe such a behavior.

I am currently on the road so I don't have such a machine at hand. I
will try next week. I think it should be relatively simple for my test
to also compute statistics comparing the threads of each run, getting
min, max and standard deviation, so I'll do that.

> > On the other hands the difference in throughput on the multi-core
> > setting between the different spin versions is dramatic for malloc, I
> > find.
> 
> The problem we're dealing with is pathological bad cases, not
> throughput in the lucky cases.

Bad cases, but not only pathological bad cases. There is already a
real difference for 8 threads on my machine, which I think is bad, but
not pathological.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::