mailing list of musl libc
 help / color / mirror / code / Atom feed
* [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
@ 2017-06-16  7:11 Jens Gustedt
  2017-06-18 10:01 ` Alexander Monakov
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Jens Gustedt @ 2017-06-16  7:11 UTC (permalink / raw)
  To: musl

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 5346 bytes --]

A variant of this new lock algorithm has been presented at SAC'16, see
https://hal.inria.fr/hal-01304108. A full version of that paper is
available at https://hal.inria.fr/hal-01236734.

The main motivation of this is to improve on the safety of the basic lock
implementation in musl. This is achieved by squeezing lock value and
"waiter" count into a single int. Thereby an unlock operation does
exactly one memory transfer (a_fetch_add) and never touches the value
again, but still detects if a waiter has to be woken up.

Another effect could be improved performance, but don't take this too
seriously, these lock/unlock functions are probably never used in
performance critical parts of libc.

Depending on the architecture, the performance of the previous lock
strategy was not always so great under no or medium congestion. Now, in
case of no congestion, there are exactly two memory operations, one to
lock an one to unlock. So if we would start to use this same new strategy
also for user locks such as mutexes, semaphores or the locks used for
lockfull atomics, we might see a performance improvement.

Also, the previous algorithm could be terrible under high load. Key to
the better performance is, again, the combination of lock value and
"waiter" count into one atomic entity. This drastically reduces the
memory transfers that have to be performed under load. In particular our
main acquisition loop changes this atomic int exactly once, namely when
the lock is acquired, and so the mutual disturbance between acquiring
threads is reduced. Again, this is probably not an issue, because this
lock/unlock algorithm is *not* used in places that arbitrarily hammer
thousands of requests on the same poor lock. (A throughput test in thread
creation shows about 50000 threads created per second on my machine, not
enough that all of this could make a difference.)

The main price for the improved safety is a little bit larger code.

Also under high congestion, the scheduling behavior will be different
compared to the previous algorithm. In that case, a successful
put-to-sleep may appear out of order compared to the arrival in the
critical section. I am not sure that this is very different from the
previous behavior, nor that scheduling order for these lock primitives is
anything that an application should ever count on.

For the patch itself, I found only one other place where knowledge of the
internals of the lock algorithm is used. Here I patch this usage with the
appropriate CAS, but this should perhaps be extracted as a __trylock
function or, maybe even better, macro.
---
 src/thread/__lock.c         | 50 ++++++++++++++++++++++++++++++++++++++++-----
 src/thread/pthread_detach.c |  2 +-
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/src/thread/__lock.c b/src/thread/__lock.c
index 0874c04a..e970a98c 100644
--- a/src/thread/__lock.c
+++ b/src/thread/__lock.c
@@ -1,15 +1,55 @@
 #include "pthread_impl.h"
 
+
+// INT_MIN for the lock, +1 for the count of this thread in the
+// critical section, has it that the difference between locked and
+// unlocked is ±INT_MAX.
+#if (INT_MIN+1) != -INT_MAX
+#error we need -INT_MAX to be INT_MIN+1
+#endif
+
 void __lock(volatile int *l)
 {
-	if (libc.threads_minus_1)
-		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
+	/* This test is mostly useless, now. Leaving it to the first CAS is
+	   probably just as efficient. */
+	if (libc.threads_minus_1) {
+		int current = 0;
+		/* A first spin lock acquisition loop, for the case of
+		   no or medium congestion. */
+		for (unsigned i = 0; i < 10; ++i) {
+			/* This is purely speculative. */
+			if (current < 0) current += INT_MAX;
+			// assertion: current >= 0
+			int val = a_cas(l, current, current - INT_MAX);
+			if (val == current) return;
+			current = val;
+		}
+		// Spinning failed, so mark ourselves as being inside the CS.
+		current = a_fetch_add(l, 1) + 1;
+		/* The main lock acquisition loop for heavy congestion. The only
+		   change to the value performed inside that loop is a successful
+		   lock via the CAS that acquires the lock. */
+		for (;;) {
+			/* We can only go into wait, if we know that somebody holds the
+			   lock and will eventually wake us up, again. */
+			if (current < 0) {
+				__syscall(SYS_futex, l, FUTEX_WAIT|FUTEX_PRIVATE, current, 0) != -ENOSYS
+					|| __syscall(SYS_futex, l, FUTEX_WAIT, current, 0);
+				/* This is purely speculative. */
+				current += INT_MAX;
+			}
+			/* assertion: current > 0, because the count
+			   includes us already. */
+			int val = a_cas(l, current, INT_MIN + current);
+			if (val == current) return;
+			current = val;
+		}
+	}
 }
 
 void __unlock(volatile int *l)
 {
-	if (l[0]) {
-		a_store(l, 0);
-		if (l[1]) __wake(l, 1, 1);
+	if (a_fetch_add(l, INT_MAX) != -INT_MAX) {
+		__syscall(SYS_futex, l, FUTEX_WAKE|FUTEX_PRIVATE, 1);
 	}
 }
diff --git a/src/thread/pthread_detach.c b/src/thread/pthread_detach.c
index ed77f74d..818626ed 100644
--- a/src/thread/pthread_detach.c
+++ b/src/thread/pthread_detach.c
@@ -6,7 +6,7 @@ int __pthread_join(pthread_t, void **);
 static int __pthread_detach(pthread_t t)
 {
 	/* Cannot detach a thread that's already exiting */
-	if (a_swap(t->exitlock, 1))
+	if (a_cas(t->exitlock, 0, -INT_MAX))
 		return __pthread_join(t, 0);
 	t->detached = 2;
 	__unlock(t->exitlock);
-- 
2.11.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-16  7:11 [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int Jens Gustedt
@ 2017-06-18 10:01 ` Alexander Monakov
  2017-06-18 11:10   ` Jens Gustedt
  2017-06-18 13:40 ` Alexander Monakov
  2017-06-18 15:01 ` Rich Felker
  2 siblings, 1 reply; 16+ messages in thread
From: Alexander Monakov @ 2017-06-18 10:01 UTC (permalink / raw)
  To: musl; +Cc: Jens Gustedt

On Fri, 16 Jun 2017, Jens Gustedt wrote:
>  void __lock(volatile int *l)
>  {
> -	if (libc.threads_minus_1)
> -		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
> +	/* This test is mostly useless, now. Leaving it to the first CAS is
> +	   probably just as efficient. */
> +	if (libc.threads_minus_1) {
[...]
>  void __unlock(volatile int *l)
>  {
> -	if (l[0]) {
> -		a_store(l, 0);
> -		if (l[1]) __wake(l, 1, 1);
> +	if (a_fetch_add(l, INT_MAX) != -INT_MAX) {
> +		__syscall(SYS_futex, l, FUTEX_WAKE|FUTEX_PRIVATE, 1);
>  	}
>  }

This looks wrong in single-threaded case, __lock doesn't touch the lock, but
__unlock modifies it unconditionally.

Alexander


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 10:01 ` Alexander Monakov
@ 2017-06-18 11:10   ` Jens Gustedt
  2017-06-18 11:49     ` Alexander Monakov
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Gustedt @ 2017-06-18 11:10 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

Hello Alexander,

On Sun, 18 Jun 2017 13:01:18 +0300 (MSK) Alexander Monakov
<amonakov@ispras.ru> wrote:

> On Fri, 16 Jun 2017, Jens Gustedt wrote:
> >  void __lock(volatile int *l)
> >  {
> > -	if (libc.threads_minus_1)
> > -		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
> > +	/* This test is mostly useless, now. Leaving it to the
> > first CAS is
> > +	   probably just as efficient. */
> > +	if (libc.threads_minus_1) {  
> [...]
> >  void __unlock(volatile int *l)
> >  {
> > -	if (l[0]) {
> > -		a_store(l, 0);
> > -		if (l[1]) __wake(l, 1, 1);
> > +	if (a_fetch_add(l, INT_MAX) != -INT_MAX) {
> > +		__syscall(SYS_futex, l, FUTEX_WAKE|FUTEX_PRIVATE,
> > 1); }
> >  }  
> 
> This looks wrong in single-threaded case, __lock doesn't touch the
> lock, but __unlock modifies it unconditionally.

Right, probably there should be the same test as for the lock case. Or
we should drop that test all along. I don't think that it still serves
much purpose here. This is just trading one memory access against
another.


Thanks
Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 11:10   ` Jens Gustedt
@ 2017-06-18 11:49     ` Alexander Monakov
  2017-06-18 14:45       ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Alexander Monakov @ 2017-06-18 11:49 UTC (permalink / raw)
  To: musl

On Sun, 18 Jun 2017, Jens Gustedt wrote:
> > This looks wrong in single-threaded case, __lock doesn't touch the
> > lock, but __unlock modifies it unconditionally.
> 
> Right, probably there should be the same test as for the lock case.

Checking libc.threads_minus_1 on the unlock path won't work: as threads
exit, it may become zero even though it weren't when acquiring the lock.

> Or we should drop that test all along. I don't think that it still serves
> much purpose here. This is just trading one memory access against another.

It trades a simple read access against an atomic modification, though.

I think the fastpath in __unlock can check the value of the lock against 0,
exiting immediately if equal.

Alexander


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-16  7:11 [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int Jens Gustedt
  2017-06-18 10:01 ` Alexander Monakov
@ 2017-06-18 13:40 ` Alexander Monakov
  2017-06-18 14:53   ` Rich Felker
  2017-06-18 15:01 ` Rich Felker
  2 siblings, 1 reply; 16+ messages in thread
From: Alexander Monakov @ 2017-06-18 13:40 UTC (permalink / raw)
  To: musl; +Cc: Jens Gustedt

Some style suggestions below.

On Fri, 16 Jun 2017, Jens Gustedt wrote:
> --- a/src/thread/__lock.c
> +++ b/src/thread/__lock.c
> @@ -1,15 +1,55 @@
>  #include "pthread_impl.h"
>  
> +
> +// INT_MIN for the lock, +1 for the count of this thread in the
> +// critical section, has it that the difference between locked and
> +// unlocked is ??INT_MAX.
> +#if (INT_MIN+1) != -INT_MAX
> +#error we need -INT_MAX to be INT_MIN+1
> +#endif

I suggest dropping this and spelling 'INT_MIN + current' as 'current - 1 -
INT_MAX' below.  (all you need is that INT_MIN <= -INT_MAX)

> +
>  void __lock(volatile int *l)
>  {
> -	if (libc.threads_minus_1)
> -		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
> +	/* This test is mostly useless, now. Leaving it to the first CAS is
> +	   probably just as efficient. */
> +	if (libc.threads_minus_1) {

I suggest 'if (!libc.threads_minus_1) return;' and unindenting the rest of the
function.

> +		int current = 0;
> +		/* A first spin lock acquisition loop, for the case of
> +		   no or medium congestion. */
> +		for (unsigned i = 0; i < 10; ++i) {
> +			/* This is purely speculative. */

This comment, also duplicated below, doesn't look helpful.

> +			if (current < 0) current += INT_MAX;
> +			// assertion: current >= 0
> +			int val = a_cas(l, current, current - INT_MAX);
> +			if (val == current) return;
> +			current = val;

Might be nice to use a_spin here.

> +		}
> +		// Spinning failed, so mark ourselves as being inside the CS.
> +		current = a_fetch_add(l, 1) + 1;
> +		/* The main lock acquisition loop for heavy congestion. The only
> +		   change to the value performed inside that loop is a successful
> +		   lock via the CAS that acquires the lock. */
> +		for (;;) {
> +			/* We can only go into wait, if we know that somebody holds the
> +			   lock and will eventually wake us up, again. */
> +			if (current < 0) {
> +				__syscall(SYS_futex, l, FUTEX_WAIT|FUTEX_PRIVATE, current, 0) != -ENOSYS
> +					|| __syscall(SYS_futex, l, FUTEX_WAIT, current, 0);

It's probably better to put this into a new static inline function in
pthread_impl.h next to __wake (e.g. __futexwait) and use it here and in __wait.

> +				/* This is purely speculative. */
> +				current += INT_MAX;
> +			}
> +			/* assertion: current > 0, because the count
> +			   includes us already. */
> +			int val = a_cas(l, current, INT_MIN + current);
> +			if (val == current) return;
> +			current = val;
> +		}
> +	}
>  }
>  
>  void __unlock(volatile int *l)
>  {
> -	if (l[0]) {
> -		a_store(l, 0);
> -		if (l[1]) __wake(l, 1, 1);
> +	if (a_fetch_add(l, INT_MAX) != -INT_MAX) {
> +		__syscall(SYS_futex, l, FUTEX_WAKE|FUTEX_PRIVATE, 1);

This change lost FUTEX_PRIVATE fallback handling; I don't see any reason not to
continue using __wake here.

> --- a/src/thread/pthread_detach.c
> +++ b/src/thread/pthread_detach.c
> @@ -6,7 +6,7 @@ int __pthread_join(pthread_t, void **);
>  static int __pthread_detach(pthread_t t)
>  {
>  	/* Cannot detach a thread that's already exiting */
> -	if (a_swap(t->exitlock, 1))
> +	if (a_cas(t->exitlock, 0, -INT_MAX))
>  		return __pthread_join(t, 0);
>  	t->detached = 2;
>  	__unlock(t->exitlock);

A good way to catch this would be to introduce a struct for the lock (out of
scope of this patch, of course).

Alexander


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 11:49     ` Alexander Monakov
@ 2017-06-18 14:45       ` Rich Felker
  2017-06-18 14:58         ` Alexander Monakov
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2017-06-18 14:45 UTC (permalink / raw)
  To: musl

On Sun, Jun 18, 2017 at 02:49:54PM +0300, Alexander Monakov wrote:
> On Sun, 18 Jun 2017, Jens Gustedt wrote:
> > > This looks wrong in single-threaded case, __lock doesn't touch the
> > > lock, but __unlock modifies it unconditionally.
> > 
> > Right, probably there should be the same test as for the lock case.
> 
> Checking libc.threads_minus_1 on the unlock path won't work: as threads
> exit, it may become zero even though it weren't when acquiring the lock.
> 
> > Or we should drop that test all along. I don't think that it still serves
> > much purpose here. This is just trading one memory access against another.
> 
> It trades a simple read access against an atomic modification, though.

Indeed. This will be the difference between 1 cycle and 25-100 cycles
on many archs, and much worse on old mips where ll/sc work by
trap-and-emulate.

> I think the fastpath in __unlock can check the value of the lock against 0,
> exiting immediately if equal.

Do you mean that would indicate that __lock was a nop because
libc.threads_minus_1 was 0 at lock time? That's how it works in the
existing implementation so unless I missed some difference in the
logic I think it's safe.

Note that this assumes lock/unlock pairs always take place within a
single non-AS-safe libc function call, or if in an AS-safe function,
in a critical section where all signals are blocked. Otherwise it's
possible for the process to become MT between lock and unlock time.
Note that it's always possible to go the other direction (MT to
non-MT) while the lock is held, which is why you have to test
something that reflects the status at lock time, not at unlock time.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 13:40 ` Alexander Monakov
@ 2017-06-18 14:53   ` Rich Felker
  2017-06-18 15:05     ` Alexander Monakov
  2017-06-18 17:40     ` Jens Gustedt
  0 siblings, 2 replies; 16+ messages in thread
From: Rich Felker @ 2017-06-18 14:53 UTC (permalink / raw)
  To: musl

On Sun, Jun 18, 2017 at 04:40:56PM +0300, Alexander Monakov wrote:
> Some style suggestions below.
> 
> On Fri, 16 Jun 2017, Jens Gustedt wrote:
> > --- a/src/thread/__lock.c
> > +++ b/src/thread/__lock.c
> > @@ -1,15 +1,55 @@
> >  #include "pthread_impl.h"
> >  
> > +
> > +// INT_MIN for the lock, +1 for the count of this thread in the
> > +// critical section, has it that the difference between locked and
> > +// unlocked is ??INT_MAX.
> > +#if (INT_MIN+1) != -INT_MAX
> > +#error we need -INT_MAX to be INT_MIN+1
> > +#endif
> 
> I suggest dropping this and spelling 'INT_MIN + current' as 'current - 1 -
> INT_MAX' below.  (all you need is that INT_MIN <= -INT_MAX)

I think this can be dropped entirely for musl's purposes and
essentially for all practical purposes.

> > +
> >  void __lock(volatile int *l)
> >  {
> > -	if (libc.threads_minus_1)
> > -		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
> > +	/* This test is mostly useless, now. Leaving it to the first CAS is
> > +	   probably just as efficient. */
> > +	if (libc.threads_minus_1) {
> 
> I suggest 'if (!libc.threads_minus_1) return;' and unindenting the rest of the
> function.

Agree.

> > +		int current = 0;
> > +		/* A first spin lock acquisition loop, for the case of
> > +		   no or medium congestion. */
> > +		for (unsigned i = 0; i < 10; ++i) {
> > +			/* This is purely speculative. */
> 
> This comment, also duplicated below, doesn't look helpful.
> 
> > +			if (current < 0) current += INT_MAX;
> > +			// assertion: current >= 0
> > +			int val = a_cas(l, current, current - INT_MAX);
> > +			if (val == current) return;
> > +			current = val;
> 
> Might be nice to use a_spin here.

I think so. At one point (before we used volatile to model atomics) it
was probably necessary to keep the spin loop from being optimized out
entirely; now it's just likely to be lighter on cpu/bus.

> > +		}
> > +		// Spinning failed, so mark ourselves as being inside the CS.
> > +		current = a_fetch_add(l, 1) + 1;
> > +		/* The main lock acquisition loop for heavy congestion. The only
> > +		   change to the value performed inside that loop is a successful
> > +		   lock via the CAS that acquires the lock. */
> > +		for (;;) {
> > +			/* We can only go into wait, if we know that somebody holds the
> > +			   lock and will eventually wake us up, again. */
> > +			if (current < 0) {
> > +				__syscall(SYS_futex, l, FUTEX_WAIT|FUTEX_PRIVATE, current, 0) != -ENOSYS
> > +					|| __syscall(SYS_futex, l, FUTEX_WAIT, current, 0);
> 
> It's probably better to put this into a new static inline function in
> pthread_impl.h next to __wake (e.g. __futexwait) and use it here and in __wait.

Is there a reason __wait doesn't work?

> > +				/* This is purely speculative. */
> > +				current += INT_MAX;
> > +			}
> > +			/* assertion: current > 0, because the count
> > +			   includes us already. */
> > +			int val = a_cas(l, current, INT_MIN + current);
> > +			if (val == current) return;
> > +			current = val;
> > +		}
> > +	}
> >  }
> >  
> >  void __unlock(volatile int *l)
> >  {
> > -	if (l[0]) {
> > -		a_store(l, 0);
> > -		if (l[1]) __wake(l, 1, 1);
> > +	if (a_fetch_add(l, INT_MAX) != -INT_MAX) {
> > +		__syscall(SYS_futex, l, FUTEX_WAKE|FUTEX_PRIVATE, 1);
> 
> This change lost FUTEX_PRIVATE fallback handling; I don't see any reason not to
> continue using __wake here.

Agreed.

> > --- a/src/thread/pthread_detach.c
> > +++ b/src/thread/pthread_detach.c
> > @@ -6,7 +6,7 @@ int __pthread_join(pthread_t, void **);
> >  static int __pthread_detach(pthread_t t)
> >  {
> >  	/* Cannot detach a thread that's already exiting */
> > -	if (a_swap(t->exitlock, 1))
> > +	if (a_cas(t->exitlock, 0, -INT_MAX))
> >  		return __pthread_join(t, 0);
> >  	t->detached = 2;
> >  	__unlock(t->exitlock);
> 
> A good way to catch this would be to introduce a struct for the lock (out of
> scope of this patch, of course).

I see reasons for and against that. I'm not strongly against it but
just making a policy not to poke directly at these locks outside of
"lock implementation" files (perhaps adding a macro or inline function
to libc.h to do this instead?) seems like it would work just as well.
It was a poor choice to write the above direct a_cas to begin with, I
think.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 14:45       ` Rich Felker
@ 2017-06-18 14:58         ` Alexander Monakov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexander Monakov @ 2017-06-18 14:58 UTC (permalink / raw)
  To: musl

On Sun, 18 Jun 2017, Rich Felker wrote:
> > I think the fastpath in __unlock can check the value of the lock against 0,
> > exiting immediately if equal.
> 
> Do you mean that would indicate that __lock was a nop because
> libc.threads_minus_1 was 0 at lock time?

Yes, if we didn't elide in __lock, we must observe a negative value in __unlock.

Alexander


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-16  7:11 [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int Jens Gustedt
  2017-06-18 10:01 ` Alexander Monakov
  2017-06-18 13:40 ` Alexander Monakov
@ 2017-06-18 15:01 ` Rich Felker
  2 siblings, 0 replies; 16+ messages in thread
From: Rich Felker @ 2017-06-18 15:01 UTC (permalink / raw)
  To: musl

On Fri, Jun 16, 2017 at 09:11:35AM +0200, Jens Gustedt wrote:
> A variant of this new lock algorithm has been presented at SAC'16, see
> https://hal.inria.fr/hal-01304108. A full version of that paper is
> available at https://hal.inria.fr/hal-01236734.
> 
> The main motivation of this is to improve on the safety of the basic lock
> implementation in musl. This is achieved by squeezing lock value and
> "waiter" count into a single int. Thereby an unlock operation does
> exactly one memory transfer (a_fetch_add) and never touches the value
> again, but still detects if a waiter has to be woken up.

Thanks! In the process of reviewing.

> Another effect could be improved performance, but don't take this too
> seriously, these lock/unlock functions are probably never used in
> performance critical parts of libc.

An equivalent inlined/open-coded version of the lock is used in
malloc, which is the only place I see where it's performance-critical.
If it really does perform significantly better it would make sense to
see if we can get the benefit in malloc too; that may be harder
because it would sacrifice the inlining unless we wanted to try to get
the larger new code inlined (which the compiler may not want to do, or
which may affect other levels of inlining in the same file).

> Depending on the architecture, the performance of the previous lock
> strategy was not always so great under no or medium congestion. Now, in
> case of no congestion, there are exactly two memory operations, one to
> lock an one to unlock. So if we would start to use this same new strategy
> also for user locks such as mutexes, semaphores or the locks used for
> lockfull atomics, we might see a performance improvement.

It definitely can't be used for mutexes that have to record an owner
(recursive, error-checking, and/or robust), or for semaphores, unless
we wanted to depend on double-word atomics. There's just too much
state in all these cases to fit in 32 bits. Normal mutexes could do it
though. Another case that would be very nice to optimize, but where I
don't see a way, is stdio FILE locks. They're recursive and thus need
to record an owner. The current implementation works more like an
errorchecking mutex abused as a recursive one (see also: an old SO
question on that topic), having the caller be responsible for knowing
whether it needs to unlock from whether it was already owned by the
calling thread at lock time.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 14:53   ` Rich Felker
@ 2017-06-18 15:05     ` Alexander Monakov
  2017-06-18 16:04       ` Rich Felker
  2017-06-18 17:40     ` Jens Gustedt
  1 sibling, 1 reply; 16+ messages in thread
From: Alexander Monakov @ 2017-06-18 15:05 UTC (permalink / raw)
  To: musl

On Sun, 18 Jun 2017, Rich Felker wrote:
> > I suggest dropping this and spelling 'INT_MIN + current' as 'current - 1 -
> > INT_MAX' below.  (all you need is that INT_MIN <= -INT_MAX)
> 
> I think this can be dropped entirely for musl's purposes and
> essentially for all practical purposes.

In any case, the second half of that suggestion stands: using INT_MAX
consistently everywhere should be better for clarity.

> > > +				__syscall(SYS_futex, l, FUTEX_WAIT|FUTEX_PRIVATE, current, 0) != -ENOSYS
> > > +					|| __syscall(SYS_futex, l, FUTEX_WAIT, current, 0);
> > 
> > It's probably better to put this into a new static inline function in
> > pthread_impl.h next to __wake (e.g. __futexwait) and use it here and in __wait.
> 
> Is there a reason __wait doesn't work?

__wait doesn't fit here at all, for instance it manipulates a separate int with
waiters count.

Alexander


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 15:05     ` Alexander Monakov
@ 2017-06-18 16:04       ` Rich Felker
  2017-06-18 17:31         ` Jens Gustedt
  2017-06-18 19:32         ` Jens Gustedt
  0 siblings, 2 replies; 16+ messages in thread
From: Rich Felker @ 2017-06-18 16:04 UTC (permalink / raw)
  To: musl

On Sun, Jun 18, 2017 at 06:05:55PM +0300, Alexander Monakov wrote:
> On Sun, 18 Jun 2017, Rich Felker wrote:
> > > I suggest dropping this and spelling 'INT_MIN + current' as 'current - 1 -
> > > INT_MAX' below.  (all you need is that INT_MIN <= -INT_MAX)
> > 
> > I think this can be dropped entirely for musl's purposes and
> > essentially for all practical purposes.
> 
> In any case, the second half of that suggestion stands: using INT_MAX
> consistently everywhere should be better for clarity.

Yes, that seems better.

> > > > +				__syscall(SYS_futex, l, FUTEX_WAIT|FUTEX_PRIVATE, current, 0) != -ENOSYS
> > > > +					|| __syscall(SYS_futex, l, FUTEX_WAIT, current, 0);
> > > 
> > > It's probably better to put this into a new static inline function in
> > > pthread_impl.h next to __wake (e.g. __futexwait) and use it here and in __wait.
> > 
> > Is there a reason __wait doesn't work?
> 
> __wait doesn't fit here at all, for instance it manipulates a separate int with
> waiters count.

It accepts a null pointer for the waiters count, so that it can be
used in contexts with or without one. Any cost of the additional
conditional branches is dominated by syscall time.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 16:04       ` Rich Felker
@ 2017-06-18 17:31         ` Jens Gustedt
  2017-06-18 19:32         ` Jens Gustedt
  1 sibling, 0 replies; 16+ messages in thread
From: Jens Gustedt @ 2017-06-18 17:31 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1475 bytes --]

Hello,

On Sun, 18 Jun 2017 12:04:59 -0400 Rich Felker <dalias@libc.org> wrote:

> On Sun, Jun 18, 2017 at 06:05:55PM +0300, Alexander Monakov wrote:
> > On Sun, 18 Jun 2017, Rich Felker wrote:  
>  [...]  
> > > 
> > > I think this can be dropped entirely for musl's purposes and
> > > essentially for all practical purposes.  
> > 
> > In any case, the second half of that suggestion stands: using
> > INT_MAX consistently everywhere should be better for clarity.  
> 
> Yes, that seems better.

I would slightly prefer to go with a version that uses INT_MIN
everywhere. This would make it clearer that the value is composed as
"INT_MIN + CScount" if the lock is held, and just "CScount" if it
isn't. Sort of sign and magnitude representation.

> > > Is there a reason __wait doesn't work?  
> > 
> > __wait doesn't fit here at all, for instance it manipulates a
> > separate int with waiters count.  
> 
> It accepts a null pointer for the waiters count, so that it can be
> used in contexts with or without one. Any cost of the additional
> conditional branches is dominated by syscall time.

Ah, ok, I'll look into that.

Thanks
Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 14:53   ` Rich Felker
  2017-06-18 15:05     ` Alexander Monakov
@ 2017-06-18 17:40     ` Jens Gustedt
  1 sibling, 0 replies; 16+ messages in thread
From: Jens Gustedt @ 2017-06-18 17:40 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1442 bytes --]


On Sun, 18 Jun 2017 10:53:39 -0400 Rich Felker <dalias@libc.org> wrote:

> > A good way to catch this would be to introduce a struct for the
> > lock (out of scope of this patch, of course).  
> 
> I see reasons for and against that. I'm not strongly against it but
> just making a policy not to poke directly at these locks outside of
> "lock implementation" files (perhaps adding a macro or inline function
> to libc.h to do this instead?) seems like it would work just as well.

I would be in favor of having a struct. This would make reviewing this
code much easier. I also would be much in favor of inline functions or
macros to capture the fastpath of these function. In the orginal
implementation for stdatomic I had that, and it worked quite well:
just an atomic instruction for the fastpath and the function call
overhead only when there is not much to loose, performance wise.

> It was a poor choice to write the above direct a_cas to begin with, I
> think.

I am not sure that I get what you want to say. You mean you would like
to abstract that into something like a __trylock function?

Thanks
Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 16:04       ` Rich Felker
  2017-06-18 17:31         ` Jens Gustedt
@ 2017-06-18 19:32         ` Jens Gustedt
  2017-06-18 20:20           ` Rich Felker
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Gustedt @ 2017-06-18 19:32 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1697 bytes --]

Hello Rich,

On Sun, 18 Jun 2017 12:04:59 -0400 Rich Felker <dalias@libc.org> wrote:

> > > Is there a reason __wait doesn't work?  
> > 
> > __wait doesn't fit here at all, for instance it manipulates a
> > separate int with waiters count.  
> 
> It accepts a null pointer for the waiters count, so that it can be
> used in contexts with or without one. Any cost of the additional
> conditional branches is dominated by syscall time.

Looking into it, I don't agree with you, Rich. Even with waiters set
to 0 it would to 100 spins before going into the syscall. This is much
of a waste, here, because we are just comming out of a spin (at the
first iteration) or we did spin around as long as the value was
positive.

I don't see why the cost of 100 spins would be dominated by the
syscall. If I remember correctly, the benchmarks that I made showed
about 10 memory operations for an unsuccessful syscall. This is why
the magic number for the initial spin is set to 10.

It might be benefitial to do a_spin for a while, if we know that CS
that are protected by this lock are really short, just some
cycles. But 100 is a far too big number, and in my benchmarks I found
not much indication of a benefit for it.

If we want code sharing with the rest of musl (which we should) I like
Alexander's idea of a __futexwait inline function much better.


Thanks
Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 19:32         ` Jens Gustedt
@ 2017-06-18 20:20           ` Rich Felker
  2017-06-18 20:38             ` Alexander Monakov
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2017-06-18 20:20 UTC (permalink / raw)
  To: musl

On Sun, Jun 18, 2017 at 09:32:09PM +0200, Jens Gustedt wrote:
> Hello Rich,
> 
> On Sun, 18 Jun 2017 12:04:59 -0400 Rich Felker <dalias@libc.org> wrote:
> 
> > > > Is there a reason __wait doesn't work?  
> > > 
> > > __wait doesn't fit here at all, for instance it manipulates a
> > > separate int with waiters count.  
> > 
> > It accepts a null pointer for the waiters count, so that it can be
> > used in contexts with or without one. Any cost of the additional
> > conditional branches is dominated by syscall time.
> 
> Looking into it, I don't agree with you, Rich. Even with waiters set
> to 0 it would to 100 spins before going into the syscall. This is much
> of a waste, here, because we are just comming out of a spin (at the
> first iteration) or we did spin around as long as the value was
> positive.

I haven't reviewed that logic yet, so perhaps that's what I'm missing.
I'll follow up with more after I understand that code better.

> I don't see why the cost of 100 spins would be dominated by the
> syscall. If I remember correctly, the benchmarks that I made showed
> about 10 memory operations for an unsuccessful syscall. This is why
> the magic number for the initial spin is set to 10.

A syscall takes at least 500 cycles on an extremely fast system, and
more like 1500-2000 on many. 100 spins was determined empirically and
is probably near the break-even point. If it's a bad choice, it's
probably also bad for other users of __wait.

> It might be benefitial to do a_spin for a while, if we know that CS
> that are protected by this lock are really short, just some
> cycles. But 100 is a far too big number, and in my benchmarks I found
> not much indication of a benefit for it.
> 
> If we want code sharing with the rest of musl (which we should) I like
> Alexander's idea of a __futexwait inline function much better.

I don't think there's any value to making it inline. If it could be a
single syscall, that would be one thing, but with the fallback for old
systems that lack private futex, it's just a gratuitously large inline
chunk that's likely to interfere with inlining/optimization of the
caller, and certainly has no potential to improve performance (since
there's a syscall involved).

The original intent was that __wait be the "__futexwait" primitive
Alexander had in mind. If it's too heavy as it is now, perhaps that's
a mistake that affects other usage too. I was never very happy with
the shaky theoretical foundations of the spinning, but I did feel like
I at least got it to the point where it didn't have pathologically bad
cases.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-18 20:20           ` Rich Felker
@ 2017-06-18 20:38             ` Alexander Monakov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexander Monakov @ 2017-06-18 20:38 UTC (permalink / raw)
  To: musl

On Sun, 18 Jun 2017, Rich Felker wrote:
> > If we want code sharing with the rest of musl (which we should) I like
> > Alexander's idea of a __futexwait inline function much better.
> 
> I don't think there's any value to making it inline. If it could be a
> single syscall, that would be one thing, but with the fallback for old
> systems that lack private futex, it's just a gratuitously large inline
> chunk that's likely to interfere with inlining/optimization of the
> caller, and certainly has no potential to improve performance (since
> there's a syscall involved).

The original suggestion was to move two syscalls into a static inline function,
with contents mirroring those of __wake. If that's too large, then so is __wake
(and all you've said applies equally to __wake, which is static inline).

Alexander


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-06-18 20:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-16  7:11 [PATCH] a new lock algorithm with lock value and CS counts in the same atomic int Jens Gustedt
2017-06-18 10:01 ` Alexander Monakov
2017-06-18 11:10   ` Jens Gustedt
2017-06-18 11:49     ` Alexander Monakov
2017-06-18 14:45       ` Rich Felker
2017-06-18 14:58         ` Alexander Monakov
2017-06-18 13:40 ` Alexander Monakov
2017-06-18 14:53   ` Rich Felker
2017-06-18 15:05     ` Alexander Monakov
2017-06-18 16:04       ` Rich Felker
2017-06-18 17:31         ` Jens Gustedt
2017-06-18 19:32         ` Jens Gustedt
2017-06-18 20:20           ` Rich Felker
2017-06-18 20:38             ` Alexander Monakov
2017-06-18 17:40     ` Jens Gustedt
2017-06-18 15:01 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).