New optimized normal-type mutex?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* New optimized normal-type mutex?
@ 2015-05-21 23:44 Rich Felker
  2015-05-22  7:30 ` Jens Gustedt
  2015-07-29 12:09 ` Joakim Sindholt
  0 siblings, 2 replies; 20+ messages in thread
From: Rich Felker @ 2015-05-21 23:44 UTC (permalink / raw)
  To: musl

I realized (thanks to conversations with wm4) that it's possible to
optimize normal-type mutexes a lot to eliminate spurious futex wakes
and reduce the number of atomic ops/barriers under contention.

The concept is to store the full mutex state in a single int:

bits 0-29: waiter count
bit 30: waiter flag
bit 31: lock bit

Then unlock is:

old = a_fetch_and(p, 0x3fffffff); // new op
if (old & 0x40000000) wake(p,1);

Trylock is:

old = *p;
if (old < 0) {
	a_barrier();
	old = *p;
}
if (old >= 0 && a_cas(p, old, old|INT_MIN)) return 0;

Lock is:

while (spins--) {
	if (!trylock()) return 0;
	a_spin();
}
while ((old = *p) < 0) {
	new = old+1 | 0x40000000;
	if (a_cas(p, old, new)==old) break;
}
for(;;) {
	while ((old = *p) < 0) {
		futex_wait(p, old);
	}
	new = old-1;
	if (new) new |= 0x40000000; // set waiters flag if any other waiters
	new |= INT_MIN;
	if (a_cas(p, old, new)==old) return 0;
}

The waiter flag marks that unlocking needs to send a wake. It is set
whenever a new waiter arrives, and when a waiter that consumed a wake
observes that there are more waiters left to be woken. The flag is
notably clear when the same thread which just released the mutex
happens to obtain it again before an existing waiter does (when
hammering the mutex); this is where we should see huge savings, since
the subsequent unlock(s) will not produce additional futex wakes which
(1) incur syscall overhead, and (2) cause additional waiters which
cannot yet get the mutex to wake up, try to lock it, and go back to
waiting again. Further:

The code path for unlock has exactly one atomic.

The code path for trylock has at least one barrier and at most (only
on success) one atomic, but may have a spurious barrier if a stale
value was initially read and the operation subsequently succeeds.

The code path for lock on contention (trylock failed and thus produced
no atomic changes, only barriers/spins), contains one atomic cas to
set the waiter state (versus 2 in the current musl implementation),
whatever the kernel does in futex, and one more atomic cas to finally
take the lock (versus 2 in current musl).

Are these changes worthwhile and worth having additional custom code
for the normal-type mutex? I'm not sure. We should probably do an
out-of-tree test implementation to measure differences then consider
integrating in musl if it helps.

If it does look helpful to musl, it may make sense to use this code as
the implementation for __lock/__unlock (implementation-internal locks)
and thereby shrink them all from int[2] to int[1], and then have the
pthread mutex code tail-call to these functions for the normal-mutex
case.

Even if it's not a performance help for musl, the above design may be
very nice as a third-party mutex library since it handles
self-synchronized destruction, minimizes spurious futex wakes, and
fits the whole lock in a single int.

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-05-21 23:44 New optimized normal-type mutex? Rich Felker
@ 2015-05-22  7:30 ` Jens Gustedt
  2015-05-22  7:51   ` Rich Felker
  2015-07-29 12:09 ` Joakim Sindholt
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-05-22  7:30 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4513 bytes --]

Hello,

Am Donnerstag, den 21.05.2015, 19:44 -0400 schrieb Rich Felker:
> I realized (thanks to conversations with wm4) that it's possible to
> optimize normal-type mutexes a lot to eliminate spurious futex wakes
> and reduce the number of atomic ops/barriers under contention.
> 
> The concept is to store the full mutex state in a single int:
> 
> bits 0-29: waiter count
> bit 30: waiter flag
> bit 31: lock bit
> 
> Then unlock is:
> 
> old = a_fetch_and(p, 0x3fffffff); // new op
> if (old & 0x40000000) wake(p,1);
> 
> Trylock is:
> 
> old = *p;
> if (old < 0) {
> 	a_barrier();
> 	old = *p;
> }
> if (old >= 0 && a_cas(p, old, old|INT_MIN)) return 0;
> 
> Lock is:
> 
> while (spins--) {
> 	if (!trylock()) return 0;
> 	a_spin();
> }
> while ((old = *p) < 0) {
> 	new = old+1 | 0x40000000;
> 	if (a_cas(p, old, new)==old) break;
> }
> for(;;) {
> 	while ((old = *p) < 0) {
> 		futex_wait(p, old);
> 	}
> 	new = old-1;
> 	if (new) new |= 0x40000000; // set waiters flag if any other waiters
> 	new |= INT_MIN;
> 	if (a_cas(p, old, new)==old) return 0;
> }
> 
> The waiter flag marks that unlocking needs to send a wake. It is set
> whenever a new waiter arrives, and when a waiter that consumed a wake
> observes that there are more waiters left to be woken. The flag is
> notably clear when the same thread which just released the mutex
> happens to obtain it again before an existing waiter does (when
> hammering the mutex); this is where we should see huge savings, since
> the subsequent unlock(s) will not produce additional futex wakes which
> (1) incur syscall overhead, and (2) cause additional waiters which
> cannot yet get the mutex to wake up, try to lock it, and go back to
> waiting again. Further:
> 
> The code path for unlock has exactly one atomic.
> 
> The code path for trylock has at least one barrier and at most (only
> on success) one atomic, but may have a spurious barrier if a stale
> value was initially read and the operation subsequently succeeds.
> 
> The code path for lock on contention (trylock failed and thus produced
> no atomic changes, only barriers/spins), contains one atomic cas to
> set the waiter state (versus 2 in the current musl implementation),
> whatever the kernel does in futex, and one more atomic cas to finally
> take the lock (versus 2 in current musl).
> 
> Are these changes worthwhile and worth having additional custom code
> for the normal-type mutex? I'm not sure. We should probably do an
> out-of-tree test implementation to measure differences then consider
> integrating in musl if it helps.

I think the gain in maintainability and readability would be very
interesting, so I would even be in favor to apply it if it doesn't
bring performance gain. I would at least expect it not to have
performance loss.  Even though this might be a medium sized patch, I
think it is worth a try. (But perhaps you could go for it one arch at
a time, to have things go more smoothly?)

But I personally expect it to be a win, in particular for mtx_t, where
it probably covers 99% of the use cases.

> If it does look helpful to musl, it may make sense to use this code as
> the implementation for __lock/__unlock (implementation-internal locks)
> and thereby shrink them all from int[2] to int[1], and then have the
> pthread mutex code tail-call to these functions for the normal-mutex
> case.

Actually from a code migration POV I would merely start from this
end. Implement internal locks that use that strategy, and use them
internally.

Then, in a second phase start using this lock in data structures that
are exposed to the user. There will be problems with dynamic linking
between executables and libc that differ on these strategies, so we'd
have to be careful.

> Even if it's not a performance help for musl, the above design may be
> very nice as a third-party mutex library since it handles
> self-synchronized destruction, minimizes spurious futex wakes, and
> fits the whole lock in a single int.

"third-party" library is already all the internal stuff where we use
lock features that are not exposed to the user.

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-05-22  7:30 ` Jens Gustedt
@ 2015-05-22  7:51   ` Rich Felker
  0 siblings, 0 replies; 20+ messages in thread
From: Rich Felker @ 2015-05-22  7:51 UTC (permalink / raw)
  To: musl

On Fri, May 22, 2015 at 09:30:48AM +0200, Jens Gustedt wrote:
> > Are these changes worthwhile and worth having additional custom code
> > for the normal-type mutex? I'm not sure. We should probably do an
> > out-of-tree test implementation to measure differences then consider
> > integrating in musl if it helps.
> 
> I think the gain in maintainability and readability would be very
> interesting, so I would even be in favor to apply it if it doesn't
> bring performance gain. I would at least expect it not to have
> performance loss.  Even though this might be a medium sized patch, I
> think it is worth a try. (But perhaps you could go for it one arch at
> a time, to have things go more smoothly?)

There is no per-arch code in the musl pthread implementation (only in
the atomics and syscalls and TLS ABI it relies on), and I don't intend
to add any. Changing it would affect all archs. There shouldn't be any
way it would break on some archs but work on others; if there is, it's
a bug in the atomics.

> But I personally expect it to be a win, in particular for mtx_t, where
> it probably covers 99% of the use cases.

I would expect it to be, but sometimes things that look like wins turn
out not to make any difference. Sharing code paths with owner-tracking
mutex types like we're doing now is probably less code, but on the
other hand, getting rid of the conditionals for type!=normal in those
code paths would make them simpler to understand and more streamlined,
so even from a simplicity standpoint it might be nice to have them
separate.

> > If it does look helpful to musl, it may make sense to use this code as
> > the implementation for __lock/__unlock (implementation-internal locks)
> > and thereby shrink them all from int[2] to int[1], and then have the
> > pthread mutex code tail-call to these functions for the normal-mutex
> > case.
> 
> Actually from a code migration POV I would merely start from this
> end. Implement internal locks that use that strategy, and use them
> internally.

Yes, I agree that's a good approach, especially if it's practical for
the pthread code to tail-call, which I think it is.

> Then, in a second phase start using this lock in data structures that
> are exposed to the user. There will be problems with dynamic linking
> between executables and libc that differ on these strategies, so we'd
> have to be careful.

I'm not sure what you mean. All of the code is in libc.so and all of
the affected structures are opaque and only accessed within libc.so.

> > Even if it's not a performance help for musl, the above design may be
> > very nice as a third-party mutex library since it handles
> > self-synchronized destruction, minimizes spurious futex wakes, and
> > fits the whole lock in a single int.
> 
> "third-party" library is already all the internal stuff where we use
> lock features that are not exposed to the user.

Well, the internal locks in musl are mostly non-interesting to
performance, except for malloc. And they're mostly static-storage, so
don't benefit from the self-synchronized-destruction-safety, except
for the locks in stdio FILEs which are presently not SSD-safe and
relying on malloc behavior to make the possible read-after-fclose
non-dangerous. Fixing those would be nice, but they're owner-tracked
so can't use this.

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-05-21 23:44 New optimized normal-type mutex? Rich Felker
  2015-05-22  7:30 ` Jens Gustedt
@ 2015-07-29 12:09 ` Joakim Sindholt
  2015-07-29 22:11   ` Jens Gustedt
  1 sibling, 1 reply; 20+ messages in thread
From: Joakim Sindholt @ 2015-07-29 12:09 UTC (permalink / raw)
  To: musl

On Fri, May 22, 2015 at 1:44 AM, Rich Felker <dalias@libc.org> wrote:
> I realized (thanks to conversations with wm4) that it's possible to
> optimize normal-type mutexes a lot to eliminate spurious futex wakes
> and reduce the number of atomic ops/barriers under contention.
> 
> The concept is to store the full mutex state in a single int:
> 
> bits 0-29: waiter count
> bit 30: waiter flag
> bit 31: lock bit
> 
> Then unlock is:
> 
> old = a_fetch_and(p, 0x3fffffff); // new op
> if (old & 0x40000000) wake(p,1);
> 
> Trylock is:
> 
> old = *p;
> if (old < 0) {
> 	a_barrier();
> 	old = *p;
> }
> if (old >= 0 && a_cas(p, old, old|INT_MIN)) return 0;
> 
> Lock is:
> 
> while (spins--) {
> 	if (!trylock()) return 0;
> 	a_spin();
> }
> while ((old = *p) < 0) {
> 	new = old+1 | 0x40000000;
> 	if (a_cas(p, old, new)==old) break;
> }
> for(;;) {
> 	while ((old = *p) < 0) {
> 		futex_wait(p, old);
> 	}
> 	new = old-1;
> 	if (new) new |= 0x40000000; // set waiters flag if any other waiters
> 	new |= INT_MIN;
> 	if (a_cas(p, old, new)==old) return 0;
> }

I implemented this lock in my own library and I even did some
preliminary race tests. I couldn't find fault with it. However amonakov
did and made me aware of it on IRC.

Thread A takes the lock
Thread B tries to take the lock but ends up in futex_wait
Thread A unlocks the lock
Thread A futex_wakes Thread B
Thread A steals the lock
Thread B goes back into futex_wait
Thread A unlocks the lock
Thread B hangs
Because:
The first unlock clears the 0x40000000 flag and it does not get set
again at any point.

So he went on and suggested that a cas-less lock was possible with
a_fetch_add however I can't make it work and I don't think he can
either. His idea however is sound: the one who flips the sign bit takes
the lock. Based on that I've cobbled together a different lock that will
probably perform worse than this approach but none-the-less be correct
as far as I can tell.

The difference is that we consider the lock owner a waiter as well, thus
requiring a cas loop in the unlock function to remove itself, so to
speak, from the waiter count. a_fetch_and also turns into a cas loop so
I consider this fairly minor.
This makes the wait loop a little simpler while still maintaining a
waiter count and still only using one int.

void lock(volatile int *p)
{
    int val, spins = 200;

    while (spins--) {
        if (trylock(p)) { return; }
        spin();
    }

    while (1) {
        if ((val = a_load(p)) < 0) {
            if (a_cas(p, val, val - 1) == val) { break; }
        } else {
            if (a_cas(p, val, -val - 1) == val) { return; }
        }
    }

    while (1) {
        while ((val = a_load(p)) < 0) {
            futex_wait(p, val);
        }
        if (a_cas(p, val, -val) == val) { return; }
    }
}

int trylock(volatile int *p)
{
    int val = a_load(p);
    if (val < 0) {
        a_barrier();
        val = a_load(p);
    }
    return (val >= 0 && a_cas(p, val, -val - 1) == val);
}

void unlock(volatile int *p)
{
    int new, val = a_load(p);
    while ((new = a_cas(p, val, -(val + 1))) != val) { val = new; }
    if (val + 1 != 0) { futex_wake(p, 1); }
}

-- Joakim

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-29 12:09 ` Joakim Sindholt
@ 2015-07-29 22:11   ` Jens Gustedt
  2015-07-29 23:30     ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-07-29 22:11 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3926 bytes --]

Hello,

Am Mittwoch, den 29.07.2015, 14:09 +0200 schrieb Joakim Sindholt:
> So he went on and suggested that a cas-less lock was possible with
> a_fetch_add however I can't make it work and I don't think he can
> either. His idea however is sound: the one who flips the sign bit takes
> the lock. Based on that I've cobbled together a different lock that will
> probably perform worse than this approach but none-the-less be correct
> as far as I can tell.
> 
> The difference is that we consider the lock owner a waiter as well, thus
> requiring a cas loop in the unlock function to remove itself, so to
> speak, from the waiter count. a_fetch_and also turns into a cas loop so
> I consider this fairly minor.
> This makes the wait loop a little simpler while still maintaining a
> waiter count and still only using one int.

Nice ideas!

After the recent discussion about the problems on x86_64 I was trying
to come up with a simple lock for the atomics, and I came thinking
along the same lines.

I use "unsigned" as a base type and C11 atomics notation, makes the
arguing for me easier. The idea is that the HO bit is the lock bit
("lockbit" is MAX_INT+1u). As for your ideas above, the counter part
of the total value counts the number of threads inside the critical
section (so number of threads that find themselves between the start
of lock and the end of unlock).

If we count it like that, we know exactly the contribution that a
successful locker has to the total value, namely lockbit + 1. So an
unlock operation consists in an atomic_fetch_sub of that value. From
the return of that fetch it is then easy to see if someone has to be
woken up.

The lock procedure has a simple fast path: if the total value is 0 at
the start, set the value to lockbit+1 and you are done. This can be
done with one CAS, and could also be used for a trylock.

The slow path can be a little bit more complicated. First, account for
the thread being in the critical section by adding 1. Then just as the
current musl __wait function, first spin and then FUTEX_WAIT until the
lockbit is found to be cleared.

(Also for the atomics we don't need to work with shared locks, so the
futex syscalls are straight forward.)

Jens

/* The HO bit. */
unsigned const lockbit = INT_MAX+1u;
/* The HO and LO bit. */
unsigned const contrib = INT_MAX+2u;

void __impl_mut_lock(_Atomic(unsigned)* loc)
{
  if (libc.threads_minus_1) {
    int spins;
    int val = 0;
    /* fast path */
    if (atomic_compare_exchange_strong_explicit(loc, &val, contrib, memory_order_acq_rel, memory_order_consume))
      return;
    val = 1+atomic_fetch_add_explicit(loc, 1, memory_order_relaxed);
    switch (val & lockbit) {
      for (;;) {
        /* The lock bit isn't yet set. */
      case 0:
        val = atomic_fetch_or_explicit(loc, lockbit, memory_order_acq_rel);
        if (!(val & lockbit)) return;
        /* The lock bit is set by someone else. */
        val |= lockbit;
      default:
        for (spins = 100; spins-- && (val & lockbit);) {
          a_spin();
          val = atomic_load_explicit(loc, memory_order_consume);
        }
        for (;val & lockbit;) {
          __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
          val = atomic_load_explicit(loc, memory_order_consume);
        }
      }
    }
  }
}

void __impl_mut_unlock(_Atomic(unsigned)* loc)
{
  if (libc.threads_minus_1) {
    if (contrib != atomic_fetch_sub_explicit(loc, contrib, memory_order_relaxed))
      __syscall(SYS_futex, loc, FUTEX_WAKE, 1);
  }
}

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-29 22:11   ` Jens Gustedt
@ 2015-07-29 23:30     ` Rich Felker
  2015-07-29 23:49       ` Jens Gustedt
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-07-29 23:30 UTC (permalink / raw)
  To: musl

On Thu, Jul 30, 2015 at 12:11:15AM +0200, Jens Gustedt wrote:
> Hello,
> 
> Am Mittwoch, den 29.07.2015, 14:09 +0200 schrieb Joakim Sindholt:
> > So he went on and suggested that a cas-less lock was possible with
> > a_fetch_add however I can't make it work and I don't think he can
> > either. His idea however is sound: the one who flips the sign bit takes
> > the lock. Based on that I've cobbled together a different lock that will
> > probably perform worse than this approach but none-the-less be correct
> > as far as I can tell.
> > 
> > The difference is that we consider the lock owner a waiter as well, thus
> > requiring a cas loop in the unlock function to remove itself, so to
> > speak, from the waiter count. a_fetch_and also turns into a cas loop so
> > I consider this fairly minor.
> > This makes the wait loop a little simpler while still maintaining a
> > waiter count and still only using one int.
> 
> Nice ideas!
> 
> After the recent discussion about the problems on x86_64 I was trying
> to come up with a simple lock for the atomics, and I came thinking
> along the same lines.

Unfortunately, discussion on IRC has revealed a potentially
show-stopping issue for merging the waiter count into the futex word:
arrival of new waiters causes EAGAIN from futex_wait. I don't know any
good way around this, but it's probably the reason designs like this
have not been popular before.

Rich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-29 23:30     ` Rich Felker
@ 2015-07-29 23:49       ` Jens Gustedt
  2015-07-30  0:10         ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-07-29 23:49 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2189 bytes --]

Am Mittwoch, den 29.07.2015, 19:30 -0400 schrieb Rich Felker:
> On Thu, Jul 30, 2015 at 12:11:15AM +0200, Jens Gustedt wrote:
> > Hello,
> > 
> > Am Mittwoch, den 29.07.2015, 14:09 +0200 schrieb Joakim Sindholt:
> > > So he went on and suggested that a cas-less lock was possible with
> > > a_fetch_add however I can't make it work and I don't think he can
> > > either. His idea however is sound: the one who flips the sign bit takes
> > > the lock. Based on that I've cobbled together a different lock that will
> > > probably perform worse than this approach but none-the-less be correct
> > > as far as I can tell.
> > > 
> > > The difference is that we consider the lock owner a waiter as well, thus
> > > requiring a cas loop in the unlock function to remove itself, so to
> > > speak, from the waiter count. a_fetch_and also turns into a cas loop so
> > > I consider this fairly minor.
> > > This makes the wait loop a little simpler while still maintaining a
> > > waiter count and still only using one int.
> > 
> > Nice ideas!
> > 
> > After the recent discussion about the problems on x86_64 I was trying
> > to come up with a simple lock for the atomics, and I came thinking
> > along the same lines.
> 
> Unfortunately, discussion on IRC has revealed a potentially
> show-stopping issue for merging the waiter count into the futex word:
> arrival of new waiters causes EAGAIN from futex_wait. I don't know any
> good way around this, but it's probably the reason designs like this
> have not been popular before.

Hm, could you be more specific about where this hurts?

In the code I have there is

        for (;val & lockbit;) {
          __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
          val = atomic_load_explicit(loc, memory_order_consume);
        }

so this should be robust against spurious wakeups, no?

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-29 23:49       ` Jens Gustedt
@ 2015-07-30  0:10         ` Rich Felker
  2015-07-30  8:07           ` Jens Gustedt
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-07-30  0:10 UTC (permalink / raw)
  To: musl

On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 29.07.2015, 19:30 -0400 schrieb Rich Felker:
> > On Thu, Jul 30, 2015 at 12:11:15AM +0200, Jens Gustedt wrote:
> > > Hello,
> > > 
> > > Am Mittwoch, den 29.07.2015, 14:09 +0200 schrieb Joakim Sindholt:
> > > > So he went on and suggested that a cas-less lock was possible with
> > > > a_fetch_add however I can't make it work and I don't think he can
> > > > either. His idea however is sound: the one who flips the sign bit takes
> > > > the lock. Based on that I've cobbled together a different lock that will
> > > > probably perform worse than this approach but none-the-less be correct
> > > > as far as I can tell.
> > > > 
> > > > The difference is that we consider the lock owner a waiter as well, thus
> > > > requiring a cas loop in the unlock function to remove itself, so to
> > > > speak, from the waiter count. a_fetch_and also turns into a cas loop so
> > > > I consider this fairly minor.
> > > > This makes the wait loop a little simpler while still maintaining a
> > > > waiter count and still only using one int.
> > > 
> > > Nice ideas!
> > > 
> > > After the recent discussion about the problems on x86_64 I was trying
> > > to come up with a simple lock for the atomics, and I came thinking
> > > along the same lines.
> > 
> > Unfortunately, discussion on IRC has revealed a potentially
> > show-stopping issue for merging the waiter count into the futex word:
> > arrival of new waiters causes EAGAIN from futex_wait. I don't know any
> > good way around this, but it's probably the reason designs like this
> > have not been popular before.
> 
> Hm, could you be more specific about where this hurts?
> 
> In the code I have there is
> 
>         for (;val & lockbit;) {
>           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
>           val = atomic_load_explicit(loc, memory_order_consume);
>         }
> 
> so this should be robust against spurious wakeups, no?

The problem is that futex_wait returns immediately with EAGAIN if
*loc!=val, which happens very often if *loc is incremented or
otherwise changed on each arriving waiter.

Rich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30  0:10         ` Rich Felker
@ 2015-07-30  8:07           ` Jens Gustedt
  2015-07-30  9:10             ` Jens Gustedt
  2015-07-30 13:45             ` Rich Felker
  0 siblings, 2 replies; 20+ messages in thread
From: Jens Gustedt @ 2015-07-30  8:07 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]

Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > Hm, could you be more specific about where this hurts?
> > 
> > In the code I have there is
> > 
> >         for (;val & lockbit;) {
> >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> >           val = atomic_load_explicit(loc, memory_order_consume);
> >         }
> > 
> > so this should be robust against spurious wakeups, no?
> 
> The problem is that futex_wait returns immediately with EAGAIN if
> *loc!=val, which happens very often if *loc is incremented or
> otherwise changed on each arriving waiter.

Yes, sure, it may change. Whether or not this is often may depend, I
don't think we can easily make a quantitative statement, here.

In the case of atomics the critical section is extremely short, and
the count, if it varies so much, should have a bit stabilized during
the spinlock phase before coming to the futex part. That futex part is
really only a last resort for the rare case that the thread that is
holding the lock has been descheduled in the middle.

My current test case is having X threads hammer on one single
location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
a reasonable number of threads (X = 16 or 32) I have an overall
performance improvement of 30%, say, when using my version of the lock
instead of the original musl one. The point of inversion where the
original musl lock is better is at about 200 threads.

I'll see how I can get hold on occurrence statistics of the different
phases without being too intrusive (which would change the
scheduling).

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30  8:07           ` Jens Gustedt
@ 2015-07-30  9:10             ` Jens Gustedt
  2015-07-30  9:36               ` Alexander Monakov
  2015-07-30 13:45             ` Rich Felker
  1 sibling, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-07-30  9:10 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2610 bytes --]

Am Donnerstag, den 30.07.2015, 10:07 +0200 schrieb Jens Gustedt:
> Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> > On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > > Hm, could you be more specific about where this hurts?
> > > 
> > > In the code I have there is
> > > 
> > >         for (;val & lockbit;) {
> > >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> > >           val = atomic_load_explicit(loc, memory_order_consume);
> > >         }
> > > 
> > > so this should be robust against spurious wakeups, no?
> > 
> > The problem is that futex_wait returns immediately with EAGAIN if
> > *loc!=val, which happens very often if *loc is incremented or
> > otherwise changed on each arriving waiter.
> 
> Yes, sure, it may change. Whether or not this is often may depend, I
> don't think we can easily make a quantitative statement, here.
> 
> In the case of atomics the critical section is extremely short, and
> the count, if it varies so much, should have a bit stabilized during
> the spinlock phase before coming to the futex part. That futex part is
> really only a last resort for the rare case that the thread that is
> holding the lock has been descheduled in the middle.
> 
> My current test case is having X threads hammer on one single
> location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
> a reasonable number of threads (X = 16 or 32) I have an overall
> performance improvement of 30%, say, when using my version of the lock
> instead of the original musl one. The point of inversion where the
> original musl lock is better is at about 200 threads.
> 
> I'll see how I can get hold on occurrence statistics of the different
> phases without being too intrusive (which would change the
> scheduling).

So I tested briefly varying the number of threads from 2 up to 2048.

Out of the loop iterations on the slow path, less than 0.1 % try to go
into futex wait, and out of these about 20 % come back with EGAIN.

In particular the figure of only 20-30 % of the futex calls failing
with EAGAIN, is quite stable.

For me these figures show that the futex phase is really neglectable
for performance and only serves as a last resort that protects us from
an attack.

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30  9:10             ` Jens Gustedt
@ 2015-07-30  9:36               ` Alexander Monakov
  2015-07-30 10:00                 ` Jens Gustedt
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-07-30  9:36 UTC (permalink / raw)
  To: musl

On Thu, 30 Jul 2015, Jens Gustedt wrote:

> Am Donnerstag, den 30.07.2015, 10:07 +0200 schrieb Jens Gustedt:
> > Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> > > On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > > > Hm, could you be more specific about where this hurts?
> > > > 
> > > > In the code I have there is
> > > > 
> > > >         for (;val & lockbit;) {
> > > >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> > > >           val = atomic_load_explicit(loc, memory_order_consume);
> > > >         }
> > > > 
> > > > so this should be robust against spurious wakeups, no?
> > > 
> > > The problem is that futex_wait returns immediately with EAGAIN if
> > > *loc!=val, which happens very often if *loc is incremented or
> > > otherwise changed on each arriving waiter.
> > 
> > Yes, sure, it may change. Whether or not this is often may depend, I
> > don't think we can easily make a quantitative statement, here.
> > 
> > In the case of atomics the critical section is extremely short, and
> > the count, if it varies so much, should have a bit stabilized during
> > the spinlock phase before coming to the futex part. That futex part is
> > really only a last resort for the rare case that the thread that is
> > holding the lock has been descheduled in the middle.
> > 
> > My current test case is having X threads hammer on one single
> > location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
> > a reasonable number of threads (X = 16 or 32) I have an overall
> > performance improvement of 30%, say, when using my version of the lock
> > instead of the original musl one. The point of inversion where the
> > original musl lock is better is at about 200 threads.
> > 
> > I'll see how I can get hold on occurrence statistics of the different
> > phases without being too intrusive (which would change the
> > scheduling).
> 
> So I tested briefly varying the number of threads from 2 up to 2048.
> 
> Out of the loop iterations on the slow path, less than 0.1 % try to go
> into futex wait, and out of these about 20 % come back with EGAIN.

That sounds like your testcase simulates a load where you'd be better off with
a spinlock in the first place, no?

Have you tried simulating a load that does some non-trivial work between
lock/unlock, making a spinlock a poor fit?

Alexander


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30  9:36               ` Alexander Monakov
@ 2015-07-30 10:00                 ` Jens Gustedt
  2015-07-30 11:37                   ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-07-30 10:00 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4104 bytes --]

Am Donnerstag, den 30.07.2015, 12:36 +0300 schrieb Alexander Monakov:
> On Thu, 30 Jul 2015, Jens Gustedt wrote:
> 
> > Am Donnerstag, den 30.07.2015, 10:07 +0200 schrieb Jens Gustedt:
> > > Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> > > > On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > > > > Hm, could you be more specific about where this hurts?
> > > > > 
> > > > > In the code I have there is
> > > > > 
> > > > >         for (;val & lockbit;) {
> > > > >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> > > > >           val = atomic_load_explicit(loc, memory_order_consume);
> > > > >         }
> > > > > 
> > > > > so this should be robust against spurious wakeups, no?
> > > > 
> > > > The problem is that futex_wait returns immediately with EAGAIN if
> > > > *loc!=val, which happens very often if *loc is incremented or
> > > > otherwise changed on each arriving waiter.
> > > 
> > > Yes, sure, it may change. Whether or not this is often may depend, I
> > > don't think we can easily make a quantitative statement, here.
> > > 
> > > In the case of atomics the critical section is extremely short, and
> > > the count, if it varies so much, should have a bit stabilized during
> > > the spinlock phase before coming to the futex part. That futex part is
> > > really only a last resort for the rare case that the thread that is
> > > holding the lock has been descheduled in the middle.
> > > 
> > > My current test case is having X threads hammer on one single
> > > location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
> > > a reasonable number of threads (X = 16 or 32) I have an overall
> > > performance improvement of 30%, say, when using my version of the lock
> > > instead of the original musl one. The point of inversion where the
> > > original musl lock is better is at about 200 threads.
> > > 
> > > I'll see how I can get hold on occurrence statistics of the different
> > > phases without being too intrusive (which would change the
> > > scheduling).
> > 
> > So I tested briefly varying the number of threads from 2 up to 2048.
> > 
> > Out of the loop iterations on the slow path, less than 0.1 % try to go
> > into futex wait, and out of these about 20 % come back with EGAIN.
> 
> That sounds like your testcase simulates a load where you'd be better off with
> a spinlock in the first place, no?

Hm, this is not a "testcase" in the sense that this is the real code
that I'd like to use for the generic atomic lock-full stuff. My test
is just using this atomic lock-full thing, with a lot of threads that
use the same head of a "lock-free" FIFO implementation. There the
inner part in the critical section is just memcpy of some bytes. For
reasonable uses of atomics this should be about 16 to 32 bytes that
are copied.

So this is really a use case that I consider important, and that I
would like to see implemented with similar performance.

(I didn't yet think of making this into a fullfledged mutex,
implementing timed versions certainly needs some thinking.)

> Have you tried simulating a load that does some non-trivial work between
> lock/unlock, making a spinlock a poor fit?

No. But I am not sure that there is such a case :)

With this idea that the counter doesn't change once the thread is
inside the lock-acquisition loop, there is much less noise on the lock
value. This has two benefits. First the accesses in the loop are
mainly reads, to see if there has been a change, no writes. So the bus
pressure should be reduced. And second, because there are less writes
in total, other threads that are inside the same loop perceive less
perturbation, and the futex as a good chance to succeed.

Jens


-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30 10:00                 ` Jens Gustedt
@ 2015-07-30 11:37                   ` Alexander Monakov
  2015-07-30 13:46                     ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-07-30 11:37 UTC (permalink / raw)
  To: musl

On Thu, 30 Jul 2015, Jens Gustedt wrote:
> Am Donnerstag, den 30.07.2015, 12:36 +0300 schrieb Alexander Monakov:
> > That sounds like your testcase simulates a load where you'd be better off with
> > a spinlock in the first place, no?
> 
> Hm, this is not a "testcase" in the sense that this is the real code
> that I'd like to use for the generic atomic lock-full stuff. My test
> is just using this atomic lock-full thing, with a lot of threads that
> use the same head of a "lock-free" FIFO implementation. There the
> inner part in the critical section is just memcpy of some bytes. For
> reasonable uses of atomics this should be about 16 to 32 bytes that
> are copied.
> 
> So this is really a use case that I consider important, and that I
> would like to see implemented with similar performance.

I acknowledge that that seems like an important case, but you have not
addressed my main point.  With so little work in the critical section, it does
not make sense to me that you would use something like a normal-type futex-y
mutex.  Even a call/return to grab it gives you some overhead.  I'd expect you
would use a fully inlined spinlock acquisition/release around the memory copy.

> 
> (I didn't yet think of making this into a fullfledged mutex,
> implementing timed versions certainly needs some thinking.)
> 
> > Have you tried simulating a load that does some non-trivial work between
> > lock/unlock, making a spinlock a poor fit?
> 
> No. But I am not sure that there is such a case :)

There appears to be some miscommunication here, and the smiley does not help.
"such a case" would be copying 32KB in the critical section, for example.

> With this idea that the counter doesn't change once the thread is
> inside the lock-acquisition loop, there is much less noise on the lock
> value. This has two benefits. First the accesses in the loop are
> mainly reads, to see if there has been a change, no writes. So the bus
> pressure should be reduced. And second, because there are less writes
> in total, other threads that are inside the same loop perceive less
> perturbation, and the futex as a good chance to succeed.

I think spinning every time you're about to enter futex_wait helps if you
expect critical sections to be as small as your spin period.  Otherwise, it's
not obviously an improvement.  I think normally you spin prior to the very
first atomic operation, in anticipation that you can proceed via the fast
path.  Your spin scheme is different.

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30 11:37                   ` Alexander Monakov
@ 2015-07-30 13:46                     ` Rich Felker
  2015-07-30 16:07                       ` Jens Gustedt
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-07-30 13:46 UTC (permalink / raw)
  To: musl

On Thu, Jul 30, 2015 at 02:37:13PM +0300, Alexander Monakov wrote:
> On Thu, 30 Jul 2015, Jens Gustedt wrote:
> > Am Donnerstag, den 30.07.2015, 12:36 +0300 schrieb Alexander Monakov:
> > > That sounds like your testcase simulates a load where you'd be better off with
> > > a spinlock in the first place, no?
> > 
> > Hm, this is not a "testcase" in the sense that this is the real code
> > that I'd like to use for the generic atomic lock-full stuff. My test
> > is just using this atomic lock-full thing, with a lot of threads that
> > use the same head of a "lock-free" FIFO implementation. There the
> > inner part in the critical section is just memcpy of some bytes. For
> > reasonable uses of atomics this should be about 16 to 32 bytes that
> > are copied.
> > 
> > So this is really a use case that I consider important, and that I
> > would like to see implemented with similar performance.
> 
> I acknowledge that that seems like an important case, but you have not
> addressed my main point.  With so little work in the critical section, it does
> not make sense to me that you would use something like a normal-type futex-y
> mutex.  Even a call/return to grab it gives you some overhead.  I'd expect you
> would use a fully inlined spinlock acquisition/release around the memory copy.

No, spinlocks are completely unusable in a POSIX libc that implements
priorities. They will deadlock whenever a lower-priority thread gets
preempted by a higher-priority one while holding the lock.

Rich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30 13:46                     ` Rich Felker
@ 2015-07-30 16:07                       ` Jens Gustedt
  2015-08-03 16:36                         ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-07-30 16:07 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3900 bytes --]

Am Donnerstag, den 30.07.2015, 09:46 -0400 schrieb Rich Felker:
> On Thu, Jul 30, 2015 at 02:37:13PM +0300, Alexander Monakov wrote:
> > On Thu, 30 Jul 2015, Jens Gustedt wrote:
> > > Am Donnerstag, den 30.07.2015, 12:36 +0300 schrieb Alexander Monakov:
> > > > That sounds like your testcase simulates a load where you'd be better off with
> > > > a spinlock in the first place, no?
> > > 
> > > Hm, this is not a "testcase" in the sense that this is the real code
> > > that I'd like to use for the generic atomic lock-full stuff. My test
> > > is just using this atomic lock-full thing, with a lot of threads that
> > > use the same head of a "lock-free" FIFO implementation. There the
> > > inner part in the critical section is just memcpy of some bytes. For
> > > reasonable uses of atomics this should be about 16 to 32 bytes that
> > > are copied.
> > > 
> > > So this is really a use case that I consider important, and that I
> > > would like to see implemented with similar performance.
> > 
> > I acknowledge that that seems like an important case, but you have not
> > addressed my main point.

I will try to discuss this below.

> > With so little work in the critical section, it does
> > not make sense to me that you would use something like a normal-type futex-y
> > mutex.  Even a call/return to grab it gives you some overhead.  I'd expect you
> > would use a fully inlined spinlock acquisition/release around the memory copy.
> 
> No, spinlocks are completely unusable in a POSIX libc that implements
> priorities. They will deadlock whenever a lower-priority thread gets
> preempted by a higher-priority one while holding the lock.

First, you may not have recognized this, what I use is *exactly* the
actual strategy of LOCK/UNLOCK that musl implements. If you don't
obtain the lock, spin for one hundred rounds, and then try to go into
a futex wait. (I copied the code from musl and did just some
transformations to fit the case.)

The difference is that I avoid incrementing and decrementing the
counter at each iteration (this could perhaps be done with the actual
version, too), and that the combination of flag and counter in one
word helps that both the fast-path for lock and then unlock, only need
a single CAS, each.

Now let us try to figure out what happens in the case that started
this new discussion, namely that there is so much contention such that
the probability of an EAGAIN-failure of the mutex call increases.

In fact even if inside the critical section the application itself
does nothing, the lock code first does the spinning. So regardless of
the application, any thread does 100 spins (so some work) before it
starts fighting to be put to sleep on the futex. None of the threads
reaching that point, changes the value of of the counter, so all
threads that reach the point and call futex_wait "simultaneously" know
about the same actual value.

The value can only change by new threads entering the acquisition
loop. There are no more such threads at a time than can be effectively
be scheduled. These will all spend some time spinning, so there can't
be a bad cascade of such threads that change the counter.

In fact, there will certainly be a tradeoff between the number of
spins and the overall waste in CPU time and bandwidth. I will
experiment a bit with that.

As I said, for other functions such as timed locks and certainly for
condition variables we'd have to look very carefully. I haven't really
thought of them, and I don't have any claim about them.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30 16:07                       ` Jens Gustedt
@ 2015-08-03 16:36                         ` Alexander Monakov
  2015-08-03 19:43                           ` Jens Gustedt
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-08-03 16:36 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2160 bytes --]

> I will try to discuss this below.

Thanks for the detailed response.  Misc. comments below.
 
> > > With so little work in the critical section, it does
> > > not make sense to me that you would use something like a normal-type futex-y
> > > mutex.  Even a call/return to grab it gives you some overhead.  I'd expect you
> > > would use a fully inlined spinlock acquisition/release around the memory copy.
> > 
> > No, spinlocks are completely unusable in a POSIX libc that implements
> > priorities. They will deadlock whenever a lower-priority thread gets
> > preempted by a higher-priority one while holding the lock.

Thanks Rich for pointing that out.

[...]
> Now let us try to figure out what happens in the case that started
> this new discussion, namely that there is so much contention such that
> the probability of an EAGAIN-failure of the mutex call increases.

As long as we're talking in terms of futexes, it's EWOULDBLOCK, not EAGAIN.
 
> In fact even if inside the critical section the application itself
> does nothing, the lock code first does the spinning. So regardless of
> the application, any thread does 100 spins (so some work) before it
> starts fighting to be put to sleep on the futex. None of the threads
> reaching that point, changes the value of of the counter, so all
> threads that reach the point and call futex_wait "simultaneously" know
> about the same actual value.
> 
> The value can only change by new threads entering the acquisition
> loop. There are no more such threads at a time than can be effectively
> be scheduled. These will all spend some time spinning, so there can't
> be a bad cascade of such threads that change the counter.

OK — initially I missed that spinning a bit before *each* futex_wait is going
to work well for you.

How do you imagine it to scale when the number of CPUs increases?  Naively, it
appears that the more CPUs are in contention, the more you would need to spin?
 
> In fact, there will certainly be a tradeoff between the number of
> spins and the overall waste in CPU time and bandwidth. I will
> experiment a bit with that.

Curious to hear what you find.

Thanks!
Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-08-03 16:36                         ` Alexander Monakov
@ 2015-08-03 19:43                           ` Jens Gustedt
  2015-08-03 20:05                             ` Isaac Dunham
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Gustedt @ 2015-08-03 19:43 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 5921 bytes --]

Am Montag, den 03.08.2015, 19:36 +0300 schrieb Alexander Monakov:
> > Now let us try to figure out what happens in the case that started
> > this new discussion, namely that there is so much contention such that
> > the probability of an EAGAIN-failure of the mutex call increases.
> 
> As long as we're talking in terms of futexes, it's EWOULDBLOCK, not EAGAIN.

Hm, yes I meant futex, but it really is EAGAIN that I observe. So the
man page seems out of sync with reality.

But be it, I think it is clear that we are talking about a situation
where the futex value is too noisy, so the threads have problems to
enter FUTEX_WAIT.

> > In fact even if inside the critical section the application itself
> > does nothing, the lock code first does the spinning. So regardless of
> > the application, any thread does 100 spins (so some work) before it
> > starts fighting to be put to sleep on the futex. None of the threads
> > reaching that point, changes the value of of the counter, so all
> > threads that reach the point and call futex_wait "simultaneously" know
> > about the same actual value.
> > 
> > The value can only change by new threads entering the acquisition
> > loop. There are no more such threads at a time than can be effectively
> > be scheduled. These will all spend some time spinning, so there can't
> > be a bad cascade of such threads that change the counter.
> 
> OK — initially I missed that spinning a bit before *each* futex_wait is going
> to work well for you.
> 
> How do you imagine it to scale when the number of CPUs increases?  Naively, it
> appears that the more CPUs are in contention, the more you would need to spin?

yes, but at least in my case the average number of iterations in the
spin loop was 1.1 or so, so there seems to be a lot of leeway.

> > In fact, there will certainly be a tradeoff between the number of
> > spins and the overall waste in CPU time and bandwidth. I will
> > experiment a bit with that.
> 
> Curious to hear what you find.

So here is my view now, after experimenting and improving my
specialized lock for the atomics, for the situation where the critical
section (CS) is extremely short (copy some cachelines). There are
three very distinct situations if there are a lot of threads running
for the lock:

(1) No contention, the fast path. Since the CS is so short, this is
still the large majority of the outcomes, about 80 % in my
experiments, even when a 100 threads hammer on my lock. If taking the
lock in this situation is extremely short, too, the probability of
this situation increases even. So there is really high interest to get
this done as quickly as possible. In my implementation this is
basically a cmpxchg instruction plus a conditional jump. Using __asm__
based atomics adds a comparison to that.

The time frame for that situation should be some cycles up to the
memory latency.

(2) Mild contention. This happens if one thread happens to ask for the
lock while another one is inside the CS. The probability that more
threads are in the same situation decreases rapidly. After a very
limited number of spins (with a call to a_spin()) things are done and
the lock is achieved. As said previously the number of spins could
theoretically depend on the number of real processors in the
system. But if we don't have too fancy hardware with hundreds of
cores, a good upper bound should do the trick.

The time frame here is perhaps some hundred cycles.

(3) Descheduling. This is the worst scenario (as Rich described
above). The thread T0 holding the lock is descheduled and the others
are running wild. If literally hundreds of threads are waiting to be
scheduled to compete for the lock, things go down the drain. For this
worst case scenario to happen, we don't need priorities or something
like that, it also happens by bad luck. In my humble tests the
probability was about 0.1 % of the cases, so this is real and has to
be taken care of. (In the presence of priorities and forced
descheduling the probability could go up.) For this situation we
definitively need non-active sleep in some form and I think futex_wait
is the ideal tool for it.

For the probability that a futex_wait succeeds, the argument is
similar to case (2). Only scheduled threads compete for that. In my
model they only change the futex value when they enter the CS, not
when they queue for going into futex_wait. So once P threads (for P
processors) are inside the CS, one of them will succeed the
futex_wait. Now another thread may be scheduled, enter the CS and thus
spoil the futex value, so we may observe some failures for the other
threads to achieve the wait, but eventually one will make it,
again. This could go on as long as there are threads arriving in the
CS. Once no more threads arrive, things calm down quickly. And as soon
as T0 is rescheduled and finishes the CS, this herd of threads is
nicely woken up, one after another.

Time frame here are some thousand cycles or a schedule time slice.

All of that to argue that situations (1) - (3) have very different
time frames and probabilities. The real figures surely depend on a lot
of factors (number of processors, relative length of the CS compared
to a time slice, memory latency) and are difficult to observe, the
observation here really changes the observed. But for a lot of
situations where we can limit the CS to some instructions, the figures
should be orders of magnitude apart, and so some heuristic to
distinguish the three cases should be good enough.

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-08-03 19:43                           ` Jens Gustedt
@ 2015-08-03 20:05                             ` Isaac Dunham
  2015-08-04  5:49                               ` Jens Gustedt
  0 siblings, 1 reply; 20+ messages in thread
From: Isaac Dunham @ 2015-08-03 20:05 UTC (permalink / raw)
  To: musl

On Mon, Aug 03, 2015 at 09:43:27PM +0200, Jens Gustedt wrote:
> Am Montag, den 03.08.2015, 19:36 +0300 schrieb Alexander Monakov:
> > > Now let us try to figure out what happens in the case that started
> > > this new discussion, namely that there is so much contention such that
> > > the probability of an EAGAIN-failure of the mutex call increases.
> > 
> > As long as we're talking in terms of futexes, it's EWOULDBLOCK, not EAGAIN.
> 
> Hm, yes I meant futex, but it really is EAGAIN that I observe. So the
> man page seems out of sync with reality.

EWOULDBLOCK and EAGAIN are the same value (11), as specifically allowed
by POSIX.
However, there are certain uses where one is specified by the standard,
and others where the other is specified.

Thanks,
Isaac Dunham


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-08-03 20:05                             ` Isaac Dunham
@ 2015-08-04  5:49                               ` Jens Gustedt
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Gustedt @ 2015-08-04  5:49 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

Am Montag, den 03.08.2015, 13:05 -0700 schrieb Isaac Dunham:
> On Mon, Aug 03, 2015 at 09:43:27PM +0200, Jens Gustedt wrote:
> > Am Montag, den 03.08.2015, 19:36 +0300 schrieb Alexander Monakov:
> > > > Now let us try to figure out what happens in the case that started
> > > > this new discussion, namely that there is so much contention such that
> > > > the probability of an EAGAIN-failure of the mutex call increases.
> > > 
> > > As long as we're talking in terms of futexes, it's EWOULDBLOCK, not EAGAIN.
> > 
> > Hm, yes I meant futex, but it really is EAGAIN that I observe. So the
> > man page seems out of sync with reality.
> 
> EWOULDBLOCK and EAGAIN are the same value (11), as specifically allowed
> by POSIX.

right, that explains it, thanks

Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: New optimized normal-type mutex?
  2015-07-30  8:07           ` Jens Gustedt
  2015-07-30  9:10             ` Jens Gustedt
@ 2015-07-30 13:45             ` Rich Felker
  1 sibling, 0 replies; 20+ messages in thread
From: Rich Felker @ 2015-07-30 13:45 UTC (permalink / raw)
  To: musl

On Thu, Jul 30, 2015 at 10:07:34AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 29.07.2015, 20:10 -0400 schrieb Rich Felker:
> > On Thu, Jul 30, 2015 at 01:49:20AM +0200, Jens Gustedt wrote:
> > > Hm, could you be more specific about where this hurts?
> > > 
> > > In the code I have there is
> > > 
> > >         for (;val & lockbit;) {
> > >           __syscall(SYS_futex, loc, FUTEX_WAIT, val, 0);
> > >           val = atomic_load_explicit(loc, memory_order_consume);
> > >         }
> > > 
> > > so this should be robust against spurious wakeups, no?
> > 
> > The problem is that futex_wait returns immediately with EAGAIN if
> > *loc!=val, which happens very often if *loc is incremented or
> > otherwise changed on each arriving waiter.
> 
> Yes, sure, it may change. Whether or not this is often may depend, I
> don't think we can easily make a quantitative statement, here.

The same happened with the obvious implementation of cond vars, having
a waiter store its tid then wait on *val==tid with futex_wait. The
ratio of spurious wakes to intended wakes was something like 10:1. So
this is not a purely theoretical problem.

> In the case of atomics the critical section is extremely short, and
> the count, if it varies so much, should have a bit stabilized during
> the spinlock phase before coming to the futex part. That futex part is
> really only a last resort for the rare case that the thread that is
> holding the lock has been descheduled in the middle.

Spinning is near-useless whenever you have more contenders than cores.
It's an optimization for a special case (fewer contenders than cores)
but you don't want to rely on it.

> My current test case is having X threads hammer on one single
> location, X being up to some hundred. On my 2x2 hyperthreaded CPU for
> a reasonable number of threads (X = 16 or 32) I have an overall
> performance improvement of 30%, say, when using my version of the lock
> instead of the original musl one. The point of inversion where the
> original musl lock is better is at about 200 threads.

Interesting. I suspect this would change quickly if you increased the
amount of work done with the lock held (e.g. to more closely mimic
malloc). The nasty thing is that, as soon as you _do_ cause futex_wait
to start happening rather than just spinning, performance blows up
because each thread that needs to wait calls futex_wait several times
before it succeeds, taking up cpu time where the thread that holds the
lock could be running.

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-08-04  5:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-21 23:44 New optimized normal-type mutex? Rich Felker
2015-05-22  7:30 ` Jens Gustedt
2015-05-22  7:51   ` Rich Felker
2015-07-29 12:09 ` Joakim Sindholt
2015-07-29 22:11   ` Jens Gustedt
2015-07-29 23:30     ` Rich Felker
2015-07-29 23:49       ` Jens Gustedt
2015-07-30  0:10         ` Rich Felker
2015-07-30  8:07           ` Jens Gustedt
2015-07-30  9:10             ` Jens Gustedt
2015-07-30  9:36               ` Alexander Monakov
2015-07-30 10:00                 ` Jens Gustedt
2015-07-30 11:37                   ` Alexander Monakov
2015-07-30 13:46                     ` Rich Felker
2015-07-30 16:07                       ` Jens Gustedt
2015-08-03 16:36                         ` Alexander Monakov
2015-08-03 19:43                           ` Jens Gustedt
2015-08-03 20:05                             ` Isaac Dunham
2015-08-04  5:49                               ` Jens Gustedt
2015-07-30 13:45             ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).