Am Montag, den 03.08.2015, 19:36 +0300 schrieb Alexander Monakov: > > Now let us try to figure out what happens in the case that started > > this new discussion, namely that there is so much contention such that > > the probability of an EAGAIN-failure of the mutex call increases. > > As long as we're talking in terms of futexes, it's EWOULDBLOCK, not EAGAIN. Hm, yes I meant futex, but it really is EAGAIN that I observe. So the man page seems out of sync with reality. But be it, I think it is clear that we are talking about a situation where the futex value is too noisy, so the threads have problems to enter FUTEX_WAIT. > > In fact even if inside the critical section the application itself > > does nothing, the lock code first does the spinning. So regardless of > > the application, any thread does 100 spins (so some work) before it > > starts fighting to be put to sleep on the futex. None of the threads > > reaching that point, changes the value of of the counter, so all > > threads that reach the point and call futex_wait "simultaneously" know > > about the same actual value. > > > > The value can only change by new threads entering the acquisition > > loop. There are no more such threads at a time than can be effectively > > be scheduled. These will all spend some time spinning, so there can't > > be a bad cascade of such threads that change the counter. > > OK — initially I missed that spinning a bit before *each* futex_wait is going > to work well for you. > > How do you imagine it to scale when the number of CPUs increases? Naively, it > appears that the more CPUs are in contention, the more you would need to spin? yes, but at least in my case the average number of iterations in the spin loop was 1.1 or so, so there seems to be a lot of leeway. > > In fact, there will certainly be a tradeoff between the number of > > spins and the overall waste in CPU time and bandwidth. I will > > experiment a bit with that. > > Curious to hear what you find. So here is my view now, after experimenting and improving my specialized lock for the atomics, for the situation where the critical section (CS) is extremely short (copy some cachelines). There are three very distinct situations if there are a lot of threads running for the lock: (1) No contention, the fast path. Since the CS is so short, this is still the large majority of the outcomes, about 80 % in my experiments, even when a 100 threads hammer on my lock. If taking the lock in this situation is extremely short, too, the probability of this situation increases even. So there is really high interest to get this done as quickly as possible. In my implementation this is basically a cmpxchg instruction plus a conditional jump. Using __asm__ based atomics adds a comparison to that. The time frame for that situation should be some cycles up to the memory latency. (2) Mild contention. This happens if one thread happens to ask for the lock while another one is inside the CS. The probability that more threads are in the same situation decreases rapidly. After a very limited number of spins (with a call to a_spin()) things are done and the lock is achieved. As said previously the number of spins could theoretically depend on the number of real processors in the system. But if we don't have too fancy hardware with hundreds of cores, a good upper bound should do the trick. The time frame here is perhaps some hundred cycles. (3) Descheduling. This is the worst scenario (as Rich described above). The thread T0 holding the lock is descheduled and the others are running wild. If literally hundreds of threads are waiting to be scheduled to compete for the lock, things go down the drain. For this worst case scenario to happen, we don't need priorities or something like that, it also happens by bad luck. In my humble tests the probability was about 0.1 % of the cases, so this is real and has to be taken care of. (In the presence of priorities and forced descheduling the probability could go up.) For this situation we definitively need non-active sleep in some form and I think futex_wait is the ideal tool for it. For the probability that a futex_wait succeeds, the argument is similar to case (2). Only scheduled threads compete for that. In my model they only change the futex value when they enter the CS, not when they queue for going into futex_wait. So once P threads (for P processors) are inside the CS, one of them will succeed the futex_wait. Now another thread may be scheduled, enter the CS and thus spoil the futex value, so we may observe some failures for the other threads to achieve the wait, but eventually one will make it, again. This could go on as long as there are threads arriving in the CS. Once no more threads arrive, things calm down quickly. And as soon as T0 is rescheduled and finishes the CS, this herd of threads is nicely woken up, one after another. Time frame here are some thousand cycles or a schedule time slice. All of that to argue that situations (1) - (3) have very different time frames and probabilities. The real figures surely depend on a lot of factors (number of processors, relative length of the CS compared to a time slice, memory latency) and are difficult to observe, the observation here really changes the observed. But for a lot of situations where we can limit the CS to some instructions, the figures should be orders of magnitude apart, and so some heuristic to distinguish the three cases should be good enough. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::