Am Donnerstag, den 30.07.2015, 09:46 -0400 schrieb Rich Felker: > On Thu, Jul 30, 2015 at 02:37:13PM +0300, Alexander Monakov wrote: > > On Thu, 30 Jul 2015, Jens Gustedt wrote: > > > Am Donnerstag, den 30.07.2015, 12:36 +0300 schrieb Alexander Monakov: > > > > That sounds like your testcase simulates a load where you'd be better off with > > > > a spinlock in the first place, no? > > > > > > Hm, this is not a "testcase" in the sense that this is the real code > > > that I'd like to use for the generic atomic lock-full stuff. My test > > > is just using this atomic lock-full thing, with a lot of threads that > > > use the same head of a "lock-free" FIFO implementation. There the > > > inner part in the critical section is just memcpy of some bytes. For > > > reasonable uses of atomics this should be about 16 to 32 bytes that > > > are copied. > > > > > > So this is really a use case that I consider important, and that I > > > would like to see implemented with similar performance. > > > > I acknowledge that that seems like an important case, but you have not > > addressed my main point. I will try to discuss this below. > > With so little work in the critical section, it does > > not make sense to me that you would use something like a normal-type futex-y > > mutex. Even a call/return to grab it gives you some overhead. I'd expect you > > would use a fully inlined spinlock acquisition/release around the memory copy. > > No, spinlocks are completely unusable in a POSIX libc that implements > priorities. They will deadlock whenever a lower-priority thread gets > preempted by a higher-priority one while holding the lock. First, you may not have recognized this, what I use is *exactly* the actual strategy of LOCK/UNLOCK that musl implements. If you don't obtain the lock, spin for one hundred rounds, and then try to go into a futex wait. (I copied the code from musl and did just some transformations to fit the case.) The difference is that I avoid incrementing and decrementing the counter at each iteration (this could perhaps be done with the actual version, too), and that the combination of flag and counter in one word helps that both the fast-path for lock and then unlock, only need a single CAS, each. Now let us try to figure out what happens in the case that started this new discussion, namely that there is so much contention such that the probability of an EAGAIN-failure of the mutex call increases. In fact even if inside the critical section the application itself does nothing, the lock code first does the spinning. So regardless of the application, any thread does 100 spins (so some work) before it starts fighting to be put to sleep on the futex. None of the threads reaching that point, changes the value of of the counter, so all threads that reach the point and call futex_wait "simultaneously" know about the same actual value. The value can only change by new threads entering the acquisition loop. There are no more such threads at a time than can be effectively be scheduled. These will all spend some time spinning, so there can't be a bad cascade of such threads that change the counter. In fact, there will certainly be a tradeoff between the number of spins and the overall waste in CPU time and bandwidth. I will experiment a bit with that. As I said, for other functions such as timed locks and certainly for condition variables we'd have to look very carefully. I haven't really thought of them, and I don't have any claim about them. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::