Am Samstag, den 29.08.2015, 20:39 -0400 schrieb Rich Felker: > On Sat, Aug 29, 2015 at 09:38:30PM +0200, Jens Gustedt wrote: > > Am Samstag, den 29.08.2015, 13:16 -0400 schrieb Rich Felker: > > > On Sat, Aug 29, 2015 at 10:50:44AM +0200, Jens Gustedt wrote: > > > > Remove a test in __wait that looked if other threads already attempted to > > > > go to sleep in futex_wait. > > > > > > > > This has no impact on the fast path. But other than one might think at a > > > > first glance, this slows down if there is congestion. > > > > > > > > Applying this patch shows no difference in behavior in a mono-core > > > > setting, so it seems that this shortcut is just superfluous. > > > > > > The purpose of this code is is twofold: improving fairness of the lock > > > and avoiding burning cpu time that's _known_ to be a waste. > > > > > > If you spin on a lock that already has waiters, the thread that spins > > > has a much better chance to get the lock than any of the existing > > > waiters which are in futex_wait. Assuming sufficiently many cores that > > > all threads that are not sleeping don't get preempted, the spinning > > > thread is basically guaranteed to get the lock unless it spins so long > > > to make it futex_wait. This is simply because returning from > > > futex_wake (which all the other waiters have to do) takes a lot more > > > time than one spin. I suspect there are common loads under which many > > > of the waiters will NEVER get the lock. > > > > Yes and no. I benched the things to know a bit more. On my machine one > > loop in a spin lock is just about 10 times faster than a failed call > > to futex_wait. > > So this means that a non-preempted spinning thread will always get the > lock before a thread returning from futex_wait, as I predicted. Again, yes and no. You are neglecting the fact that in the current implementation before entering into __wait in __lock there is a "trylock". So even today in case that there are a lot of threads that arrive, there is quite a probability that a thread that just got woken up will not be able to obtain its "dedicated" slot, the one that gave rise to its wakeup. With musl's current strategy, such a woken up thread goes back into futex_wait, with the changed strategy it would have a chance on the next slot. So I actually would expect a distribution of the waiting time that has a very long tail, even longer than with my modified strategy. It is very difficult to get hands on these probabilities, they are difficult to measure without influencing the measurement. And I find neither your arguments nor mine completely convincing :) > > Even for the current strategy, one of the futex_waiting threads gets > > woken up and gets his chance with a new spinning phase. > > In the current musl code, threads that futex_wait do not go back to > spinning except in a rare race. __wait keeps repeating the futex_wait > syscall as long as *addr==val. Repeating the spinning would let them > be more aggressive about getting the lock against other threads > contending for it, but it burns a lot of cpu and gives up all the > fairness that you would otherwise get from the futex queue. In a situation where the lock changes often, the wait loop also burns CPU. One failed wait loop counts for ten spin loops in my measurements and setting. > > So the difference isn't dramatic, just one order of magnitude and > > everybody gets his chance. These chances are not equal, sure, but > > NEVER in capitals is certainly a big word. > > Try this: on a machine with at least 3 physical cores, 3 threads > hammer on the same lock, counting the number of times they succeed in > taking it. Once any one thread has taken it at least 10 million times > or so, stop and print the counts. With your spin strategy I would > expect to see 2 threads with counts near 10 million and one thread > with a count in the hundreds or less, maybe even a single-digit count. > With the current behavior (never spinning if there's a waiter) I would > expect all 3 counts to be similar. The setting that you describe is really a pathological one, where the threads don't do any work between taking the lock and releasing it. Do I understand that correctly? I think a test where there is at least some "work" done in the critical section would be more reasonable. And for such a setting I doubt that we would observe such a behavior. I am currently on the road so I don't have such a machine at hand. I will try next week. I think it should be relatively simple for my test to also compute statistics comparing the threads of each run, getting min, max and standard deviation, so I'll do that. > > On the other hands the difference in throughput on the multi-core > > setting between the different spin versions is dramatic for malloc, I > > find. > > The problem we're dealing with is pathological bad cases, not > throughput in the lucky cases. Bad cases, but not only pathological bad cases. There is already a real difference for 8 threads on my machine, which I think is bad, but not pathological. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::