[PATCH] (V2) a new lock algorithm with lock value and CS counts in the same atomic int

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [PATCH] (V2) a new lock algorithm with lock value and CS counts in the same atomic int
@ 2017-06-16  7:11 Jens Gustedt
  2017-06-20 14:41 ` Alexander Monakov
  0 siblings, 1 reply; 3+ messages in thread
From: Jens Gustedt @ 2017-06-16  7:11 UTC (permalink / raw)
  To: musl

A variant of this new lock algorithm has been presented at SAC'16, see
https://hal.inria.fr/hal-01304108. A full version of that paper is
available at https://hal.inria.fr/hal-01236734.

The main motivation of this is to improve on the safety of the basic lock
implementation in musl. This is achieved by squeezing lock value and
"waiter" count into a single int. Thereby an unlock operation does
exactly one memory transfer (a_fetch_add) and never touches the value
again, but still detects if a waiter has to be woken up.

Another effect could be improved performance, but don't take this too
seriously, these lock/unlock functions are probably never used in
performance critical parts of libc.

Depending on the architecture, the performance of the previous lock
strategy was not always so great under no or medium congestion. Now, in
case of no congestion, there are exactly two memory operations, one to
lock an one to unlock. So if we would start to use this same new strategy
also for user locks such as mutexes, semaphores or the locks used for
lockfull atomics, we might see a performance improvement.

Also, the previous algorithm could be terrible under high load. Key to
the better performance is, again, the combination of lock value and
"waiter" count into one atomic entity. This drastically reduces the
memory transfers that have to be performed under load. In particular our
main acquisition loop changes this atomic int exactly once, namely when
the lock is acquired, and so the mutual disturbance between acquiring
threads is reduced. Again, this is probably not an issue, because this
lock/unlock algorithm is *not* used in places that arbitrarily hammer
thousands of requests on the same poor lock. (A throughput test in thread
creation shows about 50000 threads created per second on my machine, not
enough that all of this could make a difference.)

The main price for the improved safety is a little bit larger code.

Also under high congestion, the scheduling behavior will be different
compared to the previous algorithm. In that case, a successful
put-to-sleep may appear out of order compared to the arrival in the
critical section. I am not sure that this is very different from the
previous behavior, nor that scheduling order for these lock primitives is
anything that an application should ever count on.

For the patch itself, I found only one other place where knowledge of the
internals of the lock algorithm is used. Here I patch this usage with the
appropriate CAS, but this should perhaps be extracted as a __trylock
function or, maybe even better, macro.

This is version 2 of the patch that takes some of the remarks of
Alexander and Rich into account:

 - all lock values are now formulated with respect to INT_MIN
 - unlock now uses __wake to wake up waiters
 - a new inline function __futexwait with the same complexity as __wake
   is use to set threads to sleep
 - __unlock covers the case that no lock had been recorded
 - the fast path of __lock is moved outside the loop in a prominent position
---
 src/internal/pthread_impl.h |  6 ++++++
 src/thread/__lock.c         | 42 +++++++++++++++++++++++++++++++++++++-----
 src/thread/pthread_detach.c |  2 +-
 3 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/src/internal/pthread_impl.h b/src/internal/pthread_impl.h
index 757b86ad..0622ad52 100644
--- a/src/internal/pthread_impl.h
+++ b/src/internal/pthread_impl.h
@@ -132,6 +132,12 @@ static inline void __wake(volatile void *addr, int cnt, int priv)
 	__syscall(SYS_futex, addr, FUTEX_WAKE|priv, cnt) != -ENOSYS ||
 	__syscall(SYS_futex, addr, FUTEX_WAKE, cnt);
 }
+static inline void __futexwait(volatile void *addr, int val, int priv)
+{
+	if (priv) priv = 128;
+	__syscall(SYS_futex, addr, FUTEX_WAIT|priv, val) != -ENOSYS ||
+	__syscall(SYS_futex, addr, FUTEX_WAIT, val);
+}

 void __acquire_ptc(void);
 void __release_ptc(void);
diff --git a/src/thread/__lock.c b/src/thread/__lock.c
index 0874c04a..56092240 100644
--- a/src/thread/__lock.c
+++ b/src/thread/__lock.c
@@ -2,14 +2,46 @@

 void __lock(volatile int *l)
 {
-	if (libc.threads_minus_1)
-		while (a_swap(l, 1)) __wait(l, l+1, 1, 1);
+	if (!libc.threads_minus_1) return;
+        /* fast path: INT_MIN for holding the lock, +1 to count this
+           thread in the critical section. */
+	int current = a_cas(l, 0, INT_MIN + 1);
+        if (!current) return;
+	/* A first spin lock acquisition loop, for the case of
+	   medium congestion. */
+	for (unsigned i = 0; i < 10; ++i) {
+		if (current < 0) current -= INT_MIN + 1;
+		// assertion: current >= 0
+		int val = a_cas(l, current, INT_MIN + (current + 1));
+		if (val == current) return;
+		current = val;
+	}
+	// Spinning failed, so mark ourselves as being inside the CS.
+	current = a_fetch_add(l, 1) + 1;
+	/* The main lock acquisition loop for heavy congestion. The only
+	   change to the value performed inside that loop is a successful
+	   lock via the CAS that acquires the lock. */
+	for (;;) {
+		/* We can only go into wait, if we know that somebody holds the
+		   lock and will eventually wake us up, again. */
+		if (current < 0) {
+			__futexwait(l, current, 1);
+			current -= INT_MIN + 1;
+		}
+		/* assertion: current > 0, because the count
+		   includes us already. */
+		int val = a_cas(l, current, INT_MIN + current);
+		if (val == current) return;
+		current = val;
+	}
 }

 void __unlock(volatile int *l)
 {
-	if (l[0]) {
-		a_store(l, 0);
-		if (l[1]) __wake(l, 1, 1);
+	/* We have to check if l[0] had been touched at all. */
+	if (l[0] < 0) {
+		if (a_fetch_add(l, -(INT_MIN + 1)) != (INT_MIN + 1)) {
+			__wake(l, 1, 1);
+		}
 	}
 }
diff --git a/src/thread/pthread_detach.c b/src/thread/pthread_detach.c
index ed77f74d..818626ed 100644
--- a/src/thread/pthread_detach.c
+++ b/src/thread/pthread_detach.c
@@ -6,7 +6,7 @@ int __pthread_join(pthread_t, void **);
 static int __pthread_detach(pthread_t t)
 {
 	/* Cannot detach a thread that's already exiting */
-	if (a_swap(t->exitlock, 1))
+	if (a_cas(t->exitlock, 0, -INT_MAX))
 		return __pthread_join(t, 0);
 	t->detached = 2;
 	__unlock(t->exitlock);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] (V2) a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-16  7:11 [PATCH] (V2) a new lock algorithm with lock value and CS counts in the same atomic int Jens Gustedt
@ 2017-06-20 14:41 ` Alexander Monakov
  2017-06-20 15:36   ` Jens Gustedt
  0 siblings, 1 reply; 3+ messages in thread
From: Alexander Monakov @ 2017-06-20 14:41 UTC (permalink / raw)
  To: musl; +Cc: Jens Gustedt

On Fri, 16 Jun 2017, Jens Gustedt wrote:
> --- a/src/thread/__lock.c
> +++ b/src/thread/__lock.c
> @@ -2,14 +2,46 @@
>  
>  void __lock(volatile int *l)
>  {
[...]
> +	// Spinning failed, so mark ourselves as being inside the CS.
> +	current = a_fetch_add(l, 1) + 1;
> +	/* The main lock acquisition loop for heavy congestion. The only
> +	   change to the value performed inside that loop is a successful
> +	   lock via the CAS that acquires the lock. */
> +	for (;;) {
> +		/* We can only go into wait, if we know that somebody holds the
> +		   lock and will eventually wake us up, again. */
> +		if (current < 0) {
> +			__futexwait(l, current, 1);

Here you reuse the value of 'current' after resuming after a futex-wait. But if
the lock is under contention, with high probability other threads have modified
the lock while we were sleeping, so the value of 'current' is likely stale, and
the following a_cas is pretty much guaranteed to fail. Wouldn't it be better to
reload current = *l after resuming?

Alexander

> +			current -= INT_MIN + 1;
> +		}
> +		/* assertion: current > 0, because the count
> +		   includes us already. */
> +		int val = a_cas(l, current, INT_MIN + current);
> +		if (val == current) return;
> +		current = val;
> +	}
>  }


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] (V2) a new lock algorithm with lock value and CS counts in the same atomic int
  2017-06-20 14:41 ` Alexander Monakov
@ 2017-06-20 15:36   ` Jens Gustedt
  0 siblings, 0 replies; 3+ messages in thread
From: Jens Gustedt @ 2017-06-20 15:36 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2518 bytes --]

Hello Alexander,

On Tue, 20 Jun 2017 17:41:35 +0300 (MSK) Alexander Monakov
<amonakov@ispras.ru> wrote:

> > +		/* We can only go into wait, if we know that
> > somebody holds the
> > +		   lock and will eventually wake us up, again. */
> > +		if (current < 0) {
> > +			__futexwait(l, current, 1);  
> 
> Here you reuse the value of 'current' after resuming after a
> futex-wait. But if the lock is under contention, with high
> probability other threads have modified the lock while we were
> sleeping, so the value of 'current' is likely stale, and the
> following a_cas is pretty much guaranteed to fail. Wouldn't it be
> better to reload current = *l after resuming?

Not to my experience. The benchs that you can find in the references
that I have given, clearly show that this strategy is better than the
current lock strategy (and a lot of variants that I tried.) These
benchmarks were performed on two architectures that are
representative, I think: x86 for one that locks the bus and arm for
one that uses a monitor. (And the gain on arm is more important.)

For all what I found, the real cost here is the bus transfer (caches
are probably not uptodate, anyhow) so doing a CAS compared to a load
is not much more expensive. On the other hand, if we guess "current"
correctly, we are done. Also this guess is not as bad as it might
appear. Comming out of the wait should usually be after a wakeup, were
just the current lock holder left the CS. So the educated guess in
that situation, that the lock is free and there is one thread less in
the CS, works quite well. And if it doesn't, and all other threads of
the CS are in kernel sleep, the next iteration does the trick.

So for the usual case you'd deal one forced load plus one CAS, against
a single CAS in the good case, and two CAS in the bad one. After
coming back from a kernel sleep, you have other things to worry
(reloading all your data, e.g).

Again, I'd like to emphasize that the performance of this part is not
very important. I wouldn't know a program could hammer as much on a
lock that uses this code inside musl, such that a lot of threads are
stuck in that congestion loop.

Thanks
Jens

-- 
:: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-06-20 15:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-16  7:11 [PATCH] (V2) a new lock algorithm with lock value and CS counts in the same atomic int Jens Gustedt
2017-06-20 14:41 ` Alexander Monakov
2017-06-20 15:36   ` Jens Gustedt

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).