Am Donnerstag, den 30.07.2015, 09:46 -0400 schrieb Rich Felker:
> On Thu, Jul 30, 2015 at 02:37:13PM +0300, Alexander Monakov wrote:
> > On Thu, 30 Jul 2015, Jens Gustedt wrote:
> > > Am Donnerstag, den 30.07.2015, 12:36 +0300 schrieb Alexander Monakov:
> > > > That sounds like your testcase simulates a load where you'd be better off with
> > > > a spinlock in the first place, no?
> > > 
> > > Hm, this is not a "testcase" in the sense that this is the real code
> > > that I'd like to use for the generic atomic lock-full stuff. My test
> > > is just using this atomic lock-full thing, with a lot of threads that
> > > use the same head of a "lock-free" FIFO implementation. There the
> > > inner part in the critical section is just memcpy of some bytes. For
> > > reasonable uses of atomics this should be about 16 to 32 bytes that
> > > are copied.
> > > 
> > > So this is really a use case that I consider important, and that I
> > > would like to see implemented with similar performance.
> > 
> > I acknowledge that that seems like an important case, but you have not
> > addressed my main point.

I will try to discuss this below.

> > With so little work in the critical section, it does
> > not make sense to me that you would use something like a normal-type futex-y
> > mutex.  Even a call/return to grab it gives you some overhead.  I'd expect you
> > would use a fully inlined spinlock acquisition/release around the memory copy.
> 
> No, spinlocks are completely unusable in a POSIX libc that implements
> priorities. They will deadlock whenever a lower-priority thread gets
> preempted by a higher-priority one while holding the lock.

First, you may not have recognized this, what I use is *exactly* the
actual strategy of LOCK/UNLOCK that musl implements. If you don't
obtain the lock, spin for one hundred rounds, and then try to go into
a futex wait. (I copied the code from musl and did just some
transformations to fit the case.)

The difference is that I avoid incrementing and decrementing the
counter at each iteration (this could perhaps be done with the actual
version, too), and that the combination of flag and counter in one
word helps that both the fast-path for lock and then unlock, only need
a single CAS, each.

Now let us try to figure out what happens in the case that started
this new discussion, namely that there is so much contention such that
the probability of an EAGAIN-failure of the mutex call increases.

In fact even if inside the critical section the application itself
does nothing, the lock code first does the spinning. So regardless of
the application, any thread does 100 spins (so some work) before it
starts fighting to be put to sleep on the futex. None of the threads
reaching that point, changes the value of of the counter, so all
threads that reach the point and call futex_wait "simultaneously" know
about the same actual value.

The value can only change by new threads entering the acquisition
loop. There are no more such threads at a time than can be effectively
be scheduled. These will all spend some time spinning, so there can't
be a bad cascade of such threads that change the counter.

In fact, there will certainly be a tradeoff between the number of
spins and the overall waste in CPU time and bandwidth. I will
experiment a bit with that.

As I said, for other functions such as timed locks and certainly for
condition variables we'd have to look very carefully. I haven't really
thought of them, and I don't have any claim about them.

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::