Am Montag, den 03.08.2015, 19:36 +0300 schrieb Alexander Monakov:
> > Now let us try to figure out what happens in the case that started
> > this new discussion, namely that there is so much contention such that
> > the probability of an EAGAIN-failure of the mutex call increases.
> 
> As long as we're talking in terms of futexes, it's EWOULDBLOCK, not EAGAIN.

Hm, yes I meant futex, but it really is EAGAIN that I observe. So the
man page seems out of sync with reality.

But be it, I think it is clear that we are talking about a situation
where the futex value is too noisy, so the threads have problems to
enter FUTEX_WAIT.

> > In fact even if inside the critical section the application itself
> > does nothing, the lock code first does the spinning. So regardless of
> > the application, any thread does 100 spins (so some work) before it
> > starts fighting to be put to sleep on the futex. None of the threads
> > reaching that point, changes the value of of the counter, so all
> > threads that reach the point and call futex_wait "simultaneously" know
> > about the same actual value.
> > 
> > The value can only change by new threads entering the acquisition
> > loop. There are no more such threads at a time than can be effectively
> > be scheduled. These will all spend some time spinning, so there can't
> > be a bad cascade of such threads that change the counter.
> 
> OK — initially I missed that spinning a bit before *each* futex_wait is going
> to work well for you.
> 
> How do you imagine it to scale when the number of CPUs increases?  Naively, it
> appears that the more CPUs are in contention, the more you would need to spin?

yes, but at least in my case the average number of iterations in the
spin loop was 1.1 or so, so there seems to be a lot of leeway.

> > In fact, there will certainly be a tradeoff between the number of
> > spins and the overall waste in CPU time and bandwidth. I will
> > experiment a bit with that.
> 
> Curious to hear what you find.

So here is my view now, after experimenting and improving my
specialized lock for the atomics, for the situation where the critical
section (CS) is extremely short (copy some cachelines). There are
three very distinct situations if there are a lot of threads running
for the lock:

(1) No contention, the fast path. Since the CS is so short, this is
still the large majority of the outcomes, about 80 % in my
experiments, even when a 100 threads hammer on my lock. If taking the
lock in this situation is extremely short, too, the probability of
this situation increases even. So there is really high interest to get
this done as quickly as possible. In my implementation this is
basically a cmpxchg instruction plus a conditional jump. Using __asm__
based atomics adds a comparison to that.

The time frame for that situation should be some cycles up to the
memory latency.

(2) Mild contention. This happens if one thread happens to ask for the
lock while another one is inside the CS. The probability that more
threads are in the same situation decreases rapidly. After a very
limited number of spins (with a call to a_spin()) things are done and
the lock is achieved. As said previously the number of spins could
theoretically depend on the number of real processors in the
system. But if we don't have too fancy hardware with hundreds of
cores, a good upper bound should do the trick.

The time frame here is perhaps some hundred cycles.

(3) Descheduling. This is the worst scenario (as Rich described
above). The thread T0 holding the lock is descheduled and the others
are running wild. If literally hundreds of threads are waiting to be
scheduled to compete for the lock, things go down the drain. For this
worst case scenario to happen, we don't need priorities or something
like that, it also happens by bad luck. In my humble tests the
probability was about 0.1 % of the cases, so this is real and has to
be taken care of. (In the presence of priorities and forced
descheduling the probability could go up.) For this situation we
definitively need non-active sleep in some form and I think futex_wait
is the ideal tool for it.

For the probability that a futex_wait succeeds, the argument is
similar to case (2). Only scheduled threads compete for that. In my
model they only change the futex value when they enter the CS, not
when they queue for going into futex_wait. So once P threads (for P
processors) are inside the CS, one of them will succeed the
futex_wait. Now another thread may be scheduled, enter the CS and thus
spoil the futex value, so we may observe some failures for the other
threads to achieve the wait, but eventually one will make it,
again. This could go on as long as there are threads arriving in the
CS. Once no more threads arrive, things calm down quickly. And as soon
as T0 is rescheduled and finishes the CS, this herd of threads is
nicely woken up, one after another.

Time frame here are some thousand cycles or a schedule time slice.

All of that to argue that situations (1) - (3) have very different
time frames and probabilities. The real figures surely depend on a lot
of factors (number of processors, relative length of the CS compared
to a time slice, memory latency) and are difficult to observe, the
observation here really changes the observed. But for a lot of
situations where we can limit the CS to some instructions, the figures
should be orders of magnitude apart, and so some heuristic to
distinguish the three cases should be good enough.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::