Multi-threaded performance progress

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Multi-threaded performance progress
@ 2014-08-26  3:43 Rich Felker
  2014-08-26  7:04 ` Jens Gustedt
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2014-08-26  3:43 UTC (permalink / raw)
  To: musl

This release cycle looks like it's going to be huge for multi-threaded
performance issues. So far the cumulative improvement on my main
development system, as measured by the cond_bench.c by Timo Teräs, is
from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds.
That's comparable to what glibc gets on similar hardware with a cond
var implementation that's much less correct. The improvements are a
result of adding private futex support, redesigning the cond var
implementation, and improvements to the spin-before-futex-wait
behavior.

Semaphore performance has also improved, up from fewer than 500k
wait/post operations to ~12M, mostly due to spin-before-futex-wait.

The above results are all based on micro-benchmarks which are
potentially meaningless to real-world applications, so I'd be
interested in seeing any higher-level or real-application-based
comparisons of the old and new code.

There is one remaining performance issue I still want to look into
fixing, possibly during this release cycle: when a thread repeatedly
takes and releases a lock on which other threads are waiting, it makes
a futex wake syscall on each unlock, despite only the first one being
necessary. I have a design for avoiding this on internal locks, but
it's less obvious how to do it for mutexes where storage is tight and
self-synchronized destruction is possible.

We're near the end of my planned time frame for this release cycle,
but I'm still interested in working with Jens to get C11 threads into
this release if possible, so I'll probably extend it for a while
still.

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26  3:43 Multi-threaded performance progress Rich Felker
@ 2014-08-26  7:04 ` Jens Gustedt
  2014-08-26 10:44   ` Szabolcs Nagy
  2014-08-26 16:35   ` Jens Gustedt
  0 siblings, 2 replies; 15+ messages in thread
From: Jens Gustedt @ 2014-08-26  7:04 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

Am Montag, den 25.08.2014, 23:43 -0400 schrieb Rich Felker:
> This release cycle looks like it's going to be huge for multi-threaded
> performance issues. So far the cumulative improvement on my main
> development system, as measured by the cond_bench.c by Timo Teräs, is
> from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds.
> That's comparable to what glibc gets on similar hardware with a cond
> var implementation that's much less correct. The improvements are a
> result of adding private futex support, redesigning the cond var
> implementation, and improvements to the spin-before-futex-wait
> behavior.

Very impressive!

> We're near the end of my planned time frame for this release cycle,
> but I'm still interested in working with Jens to get C11 threads into
> this release if possible, so I'll probably extend it for a while
> still.

I would be very much in favor of getting C11 in one version or another
into the current release, as you said. In any case, it would be good
if we could claim support for C11 for that release. That would be
another item where musl could claim to be first, at least before
glibc. There is not much missing, I think. Come to mind:

 - one or two simple functions, such as the timespec_get that I posted
 - if we don't have C11 threads the feature test macro __STDC NO_THREAD__
 - perhaps some other feature test macros for unsupported features

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26  7:04 ` Jens Gustedt
@ 2014-08-26 10:44   ` Szabolcs Nagy
  2014-08-26 11:09     ` Jens Gustedt
  2014-08-26 16:35   ` Jens Gustedt
  1 sibling, 1 reply; 15+ messages in thread
From: Szabolcs Nagy @ 2014-08-26 10:44 UTC (permalink / raw)
  To: musl

* Jens Gustedt <jens.gustedt@inria.fr> [2014-08-26 09:04:14 +0200]:
> I would be very much in favor of getting C11 in one version or another
> into the current release, as you said. In any case, it would be good
> if we could claim support for C11 for that release. That would be
> another item where musl could claim to be first, at least before
> glibc. There is not much missing, I think. Come to mind:
> 
>  - one or two simple functions, such as the timespec_get that I posted
>  - if we don't have C11 threads the feature test macro __STDC NO_THREAD__
>  - perhaps some other feature test macros for unsupported features

further missing c11 things:

time.h needs TIME_UTC for timespec_get base parameter
stdalign.h should only define alignas and alignof ifndef c++
assert.h needs static_assert ifndef c++
float.h needs *_DECIMAL_DIG and *_HAS_SUBNORM (i have a patch for this)
uchar.h
stdatomic.h (or __STDC_NO_ATOMICS__)

(thread_local, alignas, alignof, static_assert, noreturn are keywords
in c++, stdnoreturn.h is not specified by c++14 so that probably does
not need the ifndef but the others do)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 10:44   ` Szabolcs Nagy
@ 2014-08-26 11:09     ` Jens Gustedt
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Gustedt @ 2014-08-26 11:09 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2321 bytes --]

Hello,

Am Dienstag, den 26.08.2014, 12:44 +0200 schrieb Szabolcs Nagy:
> * Jens Gustedt <jens.gustedt@inria.fr> [2014-08-26 09:04:14 +0200]:
> >  - one or two simple functions, such as the timespec_get that I posted
> >  - if we don't have C11 threads the feature test macro __STDC NO_THREAD__
> >  - perhaps some other feature test macros for unsupported features
> 
> further missing c11 things:
> 
> time.h needs TIME_UTC for timespec_get base parameter

I had that in my patch for timespec_get, too. This would perhaps need
some short discussion for the policy to apply.

It would be nice to propose all the POSIX clocks that musl implements
as extensions. That is explicitly allowed by C. Unfortunately the
constraint from C11 is that the TIME_XXX values can't be 0, so the
current CLOCK_XXX values don't work and we'd have to translate.

I had chosen the easy way, any TIME value is interpreted as
CLOCK_REALTIME, and that value is just echoed as return value of the
function.

> stdalign.h should only define alignas and alignof ifndef c++
> assert.h needs static_assert ifndef c++
> float.h needs *_DECIMAL_DIG and *_HAS_SUBNORM (i have a patch for this)
> uchar.h

ah, yes, this is probably still a bit of work to do

> stdatomic.h (or __STDC_NO_ATOMICS__)

Not necessarily, this is primarily a compiler feature and not a
library feature. With gcc 4.9 this works. (Well, with some hickups,
because they use POSIX lock features under the hood in libatomic,
instead of implementing them properly with atomic_flag.)

gcc version before that don't have sufficient C11 support, anyhow. In
particular I think that _Generic is only implemented from 4.9 onward.

On a longer term, we could think of replacing libatomic calls, too,
but this isn't urgent at all.

> (thread_local, alignas, alignof, static_assert, noreturn are keywords
> in c++, stdnoreturn.h is not specified by c++14 so that probably does
> not need the ifndef but the others do)

right

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26  7:04 ` Jens Gustedt
  2014-08-26 10:44   ` Szabolcs Nagy
@ 2014-08-26 16:35   ` Jens Gustedt
  2014-08-26 17:32     ` Rich Felker
  2014-08-26 17:53     ` Rich Felker
  1 sibling, 2 replies; 15+ messages in thread
From: Jens Gustedt @ 2014-08-26 16:35 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 2241 bytes --]

Am Dienstag, den 26.08.2014, 09:04 +0200 schrieb Jens Gustedt:
> Am Montag, den 25.08.2014, 23:43 -0400 schrieb Rich Felker:
> > This release cycle looks like it's going to be huge for multi-threaded
> > performance issues. So far the cumulative improvement on my main
> > development system, as measured by the cond_bench.c by Timo Teräs, is
> > from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds.
> > That's comparable to what glibc gets on similar hardware with a cond
> > var implementation that's much less correct. The improvements are a
> > result of adding private futex support, redesigning the cond var
> > implementation, and improvements to the spin-before-futex-wait
> > behavior.
> 
> Very impressive!

I reviewed the new pthread_cond code closely and found it to be really
rock solid.

I have some minor things, that might still improve things (or
not). They make the code a bit longer, but they attempt to gain things
here and there:

 - Tighten the lock on _c_lock such that the critical section
   contains the least necessary.

 - Have all the update of the list of waiters done by the signaling
   or broadcasting thread. This work is serialized by the lock on the
   cv, anyhow, so let the main work be done by a thread that already
   holds the lock and is scheduled.

 - In case of broadcast, work on head and tail of the list
   first. These are the only ones that would change the _c_head and _c_tail
   entries of the cv.

 - Try to reduce the number of futex calls. Threads that are leaving
   don't have to regain the lock when there is known contention with a
   signaler, now that the signaler is doing the main work in that
   case.

   Also only wake up the signaling thread at the end when he is known
   to be inside a futex call.

There are perhaps other possibilities, like doing some spinning in
"lock" before going into __wait.

Jens


-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #1.2: pthread_cond_timedwait.patch --]
[-- Type: text/x-patch, Size: 7675 bytes --]

diff --git a/src/thread/pthread_cond_timedwait.c b/src/thread/pthread_cond_timedwait.c
index 2d192b0..136fa6a 100644
--- a/src/thread/pthread_cond_timedwait.c
+++ b/src/thread/pthread_cond_timedwait.c
@@ -42,12 +42,32 @@ static inline void lock(volatile int *l)
 	}
 }
 
+/* Avoid taking the lock if we know it isn't necessary. */
+static inline int lockRace(volatile int *l, int*volatile* notifier)
+{
+	int ret = 1;
+	if (!*notifier && (ret = a_cas(l, 0, 1))) {
+		a_cas(l, 1, 2);
+		do __wait(l, 0, 2, 1);
+		while (!*notifier && (ret = a_cas(l, 0, 2)));
+	}
+	return ret;
+}
+
 static inline void unlock(volatile int *l)
 {
 	if (a_swap(l, 0)==2)
 		__wake(l, 1, 1);
 }
 
+static inline void unlockRace(volatile int *l, int known)
+{
+	int found = a_swap(l, 0);
+	known += (found == 2);
+	if (known > 0)
+		__wake(l, known, 1);
+}
+
 static inline void unlock_requeue(volatile int *l, volatile int *r, int w)
 {
 	a_store(l, 0);
@@ -56,6 +76,53 @@ static inline void unlock_requeue(volatile int *l, volatile int *r, int w)
 		|| __syscall(SYS_futex, l, FUTEX_REQUEUE, 0, 1, r);
 }
 
+/* Splice the node out of the current list of c and notify a signaling
+   thread with whom there was contention. */
+static inline void leave(struct waiter* node) {
+	/* Access to cv object is valid because this waiter was not
+	 * yet signaled and a new signal/broadcast cannot return
+	 * after seeing a LEAVING waiter without getting notified
+	 * via the futex notify below. */
+	pthread_cond_t *c = node->cond;
+	int locked = lockRace(&c->_c_lock, &node->notify);
+	/* node->notify will only be changed while node is
+	 * still in the list.*/
+	int * ref = node->notify;
+
+	if (!ref) {
+		if (node->prev) node->prev->next = node->next;
+		else if (c->_c_head == node) c->_c_head = node->next;
+		if (node->next) node->next->prev = node->prev;
+		else if (c->_c_tail == node) c->_c_tail = node->prev;
+
+		unlock(&c->_c_lock);
+	} else {
+		/* A race occurred with a signaling or broadcasting thread. The call
+		 * to unlockRace, there, ensures that sufficiently many waiters on _c_lock
+		 * are woken up. */
+		if (!locked) unlock(&c->_c_lock);
+
+		/* There will be at most one signaling or broadcasting thread waiting on ref[0].
+		 * Make sure that we don't waste a futex wake, if that thread isn't yet in futex wait. */
+		if (a_fetch_add(&ref[0], -1)==1 && ref[1])
+			__wake(&ref[0], 1, 1);
+	}
+
+	node->mutex_ret = pthread_mutex_lock(node->mutex);
+}
+
+static inline void enqueue(pthread_cond_t * c, struct waiter* node) {
+	lock(&c->_c_lock);
+
+	struct waiter* ohead = c->_c_head;
+	node->next = ohead;
+	if (ohead) ohead->prev = node;
+	else c->_c_tail = node;
+	c->_c_head = node;
+
+	unlock(&c->_c_lock);
+}
+
 enum {
 	WAITING,
 	SIGNALED,
@@ -78,33 +145,14 @@ static void unwait(void *arg)
 	int oldstate = a_cas(&node->state, WAITING, LEAVING);
 
 	if (oldstate == WAITING) {
-		/* Access to cv object is valid because this waiter was not
-		 * yet signaled and a new signal/broadcast cannot return
-		 * after seeing a LEAVING waiter without getting notified
-		 * via the futex notify below. */
-
-		pthread_cond_t *c = node->cond;
-		lock(&c->_c_lock);
-		
-		if (c->_c_head == node) c->_c_head = node->next;
-		else if (node->prev) node->prev->next = node->next;
-		if (c->_c_tail == node) c->_c_tail = node->prev;
-		else if (node->next) node->next->prev = node->prev;
-		
-		unlock(&c->_c_lock);
-
-		if (node->notify) {
-			if (a_fetch_add(node->notify, -1)==1)
-				__wake(node->notify, 1, 1);
-		}
-	} else {
-		/* Lock barrier first to control wake order. */
-		lock(&node->barrier);
+		leave(node);
+		return;
 	}
 
-	node->mutex_ret = pthread_mutex_lock(node->mutex);
+	/* Lock barrier first to control wake order. */
+	lock(&node->barrier);
 
-	if (oldstate == WAITING) return;
+	node->mutex_ret = pthread_mutex_lock(node->mutex);
 
 	if (!node->next) a_inc(&node->mutex->_m_waiters);
 
@@ -121,7 +169,7 @@ static void unwait(void *arg)
 
 int pthread_cond_timedwait(pthread_cond_t *restrict c, pthread_mutex_t *restrict m, const struct timespec *restrict ts)
 {
-	struct waiter node = { .cond = c, .mutex = m };
+	struct waiter node = { .cond = c, .mutex = m, .state = WAITING, .barrier = 2 };
 	int e, seq, *fut, clock = c->_c_clock;
 
 	if ((m->_m_type&15) && (m->_m_lock&INT_MAX) != __pthread_self()->tid)
@@ -138,17 +186,10 @@ int pthread_cond_timedwait(pthread_cond_t *restrict c, pthread_mutex_t *restrict
 		seq = c->_c_seq;
 		a_inc(&c->_c_waiters);
 	} else {
-		lock(&c->_c_lock);
-
-		seq = node.barrier = 2;
+		seq = node.barrier;
 		fut = &node.barrier;
-		node.state = WAITING;
-		node.next = c->_c_head;
-		c->_c_head = &node;
-		if (!c->_c_tail) c->_c_tail = &node;
-		else node.next->prev = &node;
 
-		unlock(&c->_c_lock);
+		enqueue(c, &node);
 	}
 
 	pthread_mutex_unlock(m);
@@ -162,35 +203,85 @@ int pthread_cond_timedwait(pthread_cond_t *restrict c, pthread_mutex_t *restrict
 	return node.mutex_ret ? node.mutex_ret : e;
 }
 
+static inline int cond_signal (struct waiter * p, int* ref)
+{
+	int ret = a_cas(&p->state, WAITING, SIGNALED);
+	if (ret != WAITING) {
+		ref[0]++;
+		p->notify = ref;
+		if (p->prev) p->prev->next = p->next;
+		if (p->next) p->next->prev = p->prev;
+		p->next = 0;
+		p->prev = 0;
+	}
+	return ret;
+}
+
 int __private_cond_signal(pthread_cond_t *c, int n)
 {
-	struct waiter *p, *first=0;
-	int ref = 0, cur;
+	struct waiter *p, *prev, *first=0;
+	int ref[2] = { 0 }, cur;
 
-	lock(&c->_c_lock);
-	for (p=c->_c_tail; n && p; p=p->prev) {
-		if (a_cas(&p->state, WAITING, SIGNALED) != WAITING) {
-			ref++;
-			p->notify = &ref;
+	if (n == 1) {
+		lock(&c->_c_lock);
+		for (p=c->_c_tail; p; p=prev) {
+			prev = p->prev;
+			if (!cond_signal(p, ref)) {
+				first=p;
+				p=prev;
+				first->prev = 0;
+				break;
+			}
+		}
+		/* Split the list, leaving any remainder on the cv. */
+		if (p) {
+			p->next = 0;
 		} else {
-			n--;
-			if (!first) first=p;
+			c->_c_head = 0;
 		}
-	}
-	/* Split the list, leaving any remainder on the cv. */
-	if (p) {
-		if (p->next) p->next->prev = 0;
-		p->next = 0;
+		c->_c_tail = p;
+		unlockRace(&c->_c_lock, ref[0]);
 	} else {
-		c->_c_head = 0;
+		lock(&c->_c_lock);
+                struct waiter * head = c->_c_head;
+		if (head) {
+			/* Signal head and tail first to reduce possible
+			 * races for the cv to the beginning of the
+			 * processing. */
+			int headrace = cond_signal(head, ref);
+			struct waiter * tail = c->_c_tail;
+			p=tail->prev;
+			if (tail != head) {
+				if (!cond_signal(tail, ref)) first=tail;
+				else while (p != head) {
+					prev = p->prev;
+					if (!cond_signal(p, ref)) {
+						first=p;
+						p=prev;
+						break;
+					}
+					p=prev;
+				}
+			}
+			if (!first && !headrace) first = head;
+			c->_c_head = 0;
+			c->_c_tail = 0;
+			/* Now process the inner part of the list. */
+			if (p) {
+				while (p != head) {
+					prev = p->prev;
+					cond_signal(p, ref);
+					p=prev;
+				}
+			}
+		}
+		unlockRace(&c->_c_lock, ref[0]);
 	}
-	c->_c_tail = p;
-	unlock(&c->_c_lock);
 
 	/* Wait for any waiters in the LEAVING state to remove
 	 * themselves from the list before returning or allowing
 	 * signaled threads to proceed. */
-	while ((cur = ref)) __wait(&ref, 0, cur, 1);
+	while ((cur = ref[0])) __wait(&ref[0], &ref[1], cur, 1);
 
 	/* Allow first signaled waiter, if any, to proceed. */
 	if (first) unlock(&first->barrier);

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 16:35   ` Jens Gustedt
@ 2014-08-26 17:32     ` Rich Felker
  2014-08-26 17:53     ` Rich Felker
  1 sibling, 0 replies; 15+ messages in thread
From: Rich Felker @ 2014-08-26 17:32 UTC (permalink / raw)
  To: musl

On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote:
> Am Dienstag, den 26.08.2014, 09:04 +0200 schrieb Jens Gustedt:
> > Am Montag, den 25.08.2014, 23:43 -0400 schrieb Rich Felker:
> > > This release cycle looks like it's going to be huge for multi-threaded
> > > performance issues. So far the cumulative improvement on my main
> > > development system, as measured by the cond_bench.c by Timo Teräs, is
> > > from ~250k signals in 2 seconds up to ~3.7M signals in 2 seconds.
> > > That's comparable to what glibc gets on similar hardware with a cond
> > > var implementation that's much less correct. The improvements are a
> > > result of adding private futex support, redesigning the cond var
> > > implementation, and improvements to the spin-before-futex-wait
> > > behavior.
> > 
> > Very impressive!
> 
> I reviewed the new pthread_cond code closely and found it to be really
> rock solid.
> 
> I have some minor things, that might still improve things (or
> not). They make the code a bit longer, but they attempt to gain things
> here and there:
> 
>  - Tighten the lock on _c_lock such that the critical section
>    contains the least necessary.

Do you see any major opportunities for this? For the critical section
in pthread_cond_timedwait, a few operations could be moved out before
it, but they're all trivial assignments.

As for __private_cond_signal, it's currently necessary that all its
modifications to the list be made before either the cv lock or the
in-waiter-node barrier lock is released, because any waiters which win
the race and enter LEAVING status rather than SIGNALED status use the
cv lock to proceed. Perhaps this could be changed, though...

>  - Have all the update of the list of waiters done by the signaling
>    or broadcasting thread. This work is serialized by the lock on the
>    cv, anyhow, so let the main work be done by a thread that already
>    holds the lock and is scheduled.

The problem I ran into was that the unwait operation can be from
cancellation or timeout, in which case the waiter has to remove itself
from the list, and needs to obtain the cv lock to do this. And it's
not clear to me how the waiter can know that the signaling thread is
taking responsibility for removing it from the list without
synchronizing with the signaling thread like it does now. In any case
the costly synchronization here only happens on hopefully-very-rare
races.

>  - In case of broadcast, work on head and tail of the list
>    first. These are the only ones that would change the _c_head and _c_tail
>    entries of the cv.

But we can't release the lock anyway until all waiter states have been
atomic cas'd, or at least doing so doesn't seem safe to me.

>  - Try to reduce the number of futex calls. Threads that are leaving
>    don't have to regain the lock when there is known contention with a
>    signaler, now that the signaler is doing the main work in that
>    case.

How do they detect this contention? If they won the race and changed
state to LEAVING, they don't see the contention. If they lose the
race, they become SIGNALED, and thus take the barrier path rather than
the cv-lock path.

>    Also only wake up the signaling thread at the end when he is known
>    to be inside a futex call.

I think this could be achieved trivially by having ref start at 1
rather than 0, and having the signaling thread a_dec ref just before
going into the maybe-wait loop. Then the waiters won't send the futex
wake unless the signaler reached the a_dec already, since they won't
see the hitting-zero step.

> There are perhaps other possibilities, like doing some spinning in
> "lock" before going into __wait.

The __wait function has a built-in spin.

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 16:35   ` Jens Gustedt
  2014-08-26 17:32     ` Rich Felker
@ 2014-08-26 17:53     ` Rich Felker
  2014-08-26 18:30       ` Jens Gustedt
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Felker @ 2014-08-26 17:53 UTC (permalink / raw)
  To: musl

I should have bothered to check and see that there was a patch before
responding...

On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote:
> diff --git a/src/thread/pthread_cond_timedwait.c b/src/thread/pthread_cond_timedwait.c
> index 2d192b0..136fa6a 100644
> --- a/src/thread/pthread_cond_timedwait.c
> +++ b/src/thread/pthread_cond_timedwait.c
> @@ -42,12 +42,32 @@ static inline void lock(volatile int *l)
>  	}
>  }
>  
> +/* Avoid taking the lock if we know it isn't necessary. */
> +static inline int lockRace(volatile int *l, int*volatile* notifier)
> +{
> +	int ret = 1;
> +	if (!*notifier && (ret = a_cas(l, 0, 1))) {
> +		a_cas(l, 1, 2);
> +		do __wait(l, 0, 2, 1);
> +		while (!*notifier && (ret = a_cas(l, 0, 2)));
> +	}
> +	return ret;
> +}

This code was confusing at first, and technically *notifier is
accessed without synchronization -- it may be written by one thread
and read by another without any intervening barrier. I suspect it
works, but I'd like to avoid it. But it does answer my question about
how you were detecting the case of "don't need to lock".

Note that the performance of the code in which you're trying to avoid
the lock does not matter in the slightest except when a race happens
between a thread acting on cancellation or timeout and a signaler
(since that's the only time it runs). I expect this is extremely rare,
so unless we determine otherwise, I'd rather not add complexity here.

>  int pthread_cond_timedwait(pthread_cond_t *restrict c, pthread_mutex_t *restrict m, const struct timespec *restrict ts)
>  {
> -	struct waiter node = { .cond = c, .mutex = m };
> +	struct waiter node = { .cond = c, .mutex = m, .state = WAITING, .barrier = 2 };

This may slightly pessimize process-shared cond vars, but seems ok.

> +static inline int cond_signal (struct waiter * p, int* ref)
> +{
> +	int ret = a_cas(&p->state, WAITING, SIGNALED);
> +	if (ret != WAITING) {
> +		ref[0]++;
> +		p->notify = ref;
> +		if (p->prev) p->prev->next = p->next;
> +		if (p->next) p->next->prev = p->prev;
> +		p->next = 0;
> +		p->prev = 0;
> +	}
> +	return ret;
> +}
> +
>  int __private_cond_signal(pthread_cond_t *c, int n)
>  {
> -	struct waiter *p, *first=0;
> -	int ref = 0, cur;
> +	struct waiter *p, *prev, *first=0;
> +	int ref[2] = { 0 }, cur;
>  
> -	lock(&c->_c_lock);
> -	for (p=c->_c_tail; n && p; p=p->prev) {
> -		if (a_cas(&p->state, WAITING, SIGNALED) != WAITING) {
> -			ref++;
> -			p->notify = &ref;
> +	if (n == 1) {
> +		lock(&c->_c_lock);
> +		for (p=c->_c_tail; p; p=prev) {
> +			prev = p->prev;
> +			if (!cond_signal(p, ref)) {
> +				first=p;
> +				p=prev;
> +				first->prev = 0;
> +				break;
> +			}
> +		}
> +		/* Split the list, leaving any remainder on the cv. */
> +		if (p) {
> +			p->next = 0;
>  		} else {
> -			n--;
> -			if (!first) first=p;
> +			c->_c_head = 0;
>  		}
> -	}
> -	/* Split the list, leaving any remainder on the cv. */
> -	if (p) {
> -		if (p->next) p->next->prev = 0;
> -		p->next = 0;
> +		c->_c_tail = p;
> +		unlockRace(&c->_c_lock, ref[0]);
>  	} else {
> -		c->_c_head = 0;
> +		lock(&c->_c_lock);
> +                struct waiter * head = c->_c_head;
> +		if (head) {
> +			/* Signal head and tail first to reduce possible
> +			 * races for the cv to the beginning of the
> +			 * processing. */
> +			int headrace = cond_signal(head, ref);
> +			struct waiter * tail = c->_c_tail;
> +			p=tail->prev;
> +			if (tail != head) {
> +				if (!cond_signal(tail, ref)) first=tail;
> +				else while (p != head) {
> +					prev = p->prev;
> +					if (!cond_signal(p, ref)) {
> +						first=p;
> +						p=prev;
> +						break;
> +					}
> +					p=prev;
> +				}
> +			}
> +			if (!first && !headrace) first = head;
> +			c->_c_head = 0;
> +			c->_c_tail = 0;
> +			/* Now process the inner part of the list. */
> +			if (p) {
> +				while (p != head) {
> +					prev = p->prev;
> +					cond_signal(p, ref);
> +					p=prev;
> +				}
> +			}
> +		}
> +		unlockRace(&c->_c_lock, ref[0]);
>  	}
> -	c->_c_tail = p;
> -	unlock(&c->_c_lock);
>  
>  	/* Wait for any waiters in the LEAVING state to remove
>  	 * themselves from the list before returning or allowing
>  	 * signaled threads to proceed. */
> -	while ((cur = ref)) __wait(&ref, 0, cur, 1);
> +	while ((cur = ref[0])) __wait(&ref[0], &ref[1], cur, 1);
>  
>  	/* Allow first signaled waiter, if any, to proceed. */
>  	if (first) unlock(&first->barrier);

This is sufficiently more complex that I think we'd need evidence that
the races actually adversely affect performance for it to be
justified. And if it is justified, the code should probably be moved
to pthread_cond_signal and pthread_cond_broadcast rather than 2 cases
in one function here. The only reason both cases are covered in one
function now is that they're naturally trivial special cases of the
exact same code.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 17:53     ` Rich Felker
@ 2014-08-26 18:30       ` Jens Gustedt
  2014-08-26 19:05         ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2014-08-26 18:30 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 6804 bytes --]

Am Dienstag, den 26.08.2014, 13:53 -0400 schrieb Rich Felker:
> I should have bothered to check and see that there was a patch before
> responding...
> 
> On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote:
> > diff --git a/src/thread/pthread_cond_timedwait.c b/src/thread/pthread_cond_timedwait.c
> > index 2d192b0..136fa6a 100644
> > --- a/src/thread/pthread_cond_timedwait.c
> > +++ b/src/thread/pthread_cond_timedwait.c
> > @@ -42,12 +42,32 @@ static inline void lock(volatile int *l)
> >  	}
> >  }
> >  
> > +/* Avoid taking the lock if we know it isn't necessary. */
> > +static inline int lockRace(volatile int *l, int*volatile* notifier)
> > +{
> > +	int ret = 1;
> > +	if (!*notifier && (ret = a_cas(l, 0, 1))) {
> > +		a_cas(l, 1, 2);
> > +		do __wait(l, 0, 2, 1);
> > +		while (!*notifier && (ret = a_cas(l, 0, 2)));
> > +	}
> > +	return ret;
> > +}
> 
> This code was confusing at first, and technically *notifier is
> accessed without synchronization -- it may be written by one thread
> and read by another without any intervening barrier. I suspect it
> works, but I'd like to avoid it. But it does answer my question about
> how you were detecting the case of "don't need to lock".

Hm, I don't see the point of your remark, musl is doing such "atomic"
reads all over, no? There is no macro for atomic load, only for atomic
store, otherwise I would have used that. And notice that *notifier is
volatile, so the load can't be optimized out.

Or do you mean that I should use an atomic store at the other end?

> Note that the performance of the code in which you're trying to avoid
> the lock does not matter in the slightest except when a race happens
> between a thread acting on cancellation or timeout and a signaler
> (since that's the only time it runs). I expect this is extremely rare,
> so unless we determine otherwise, I'd rather not add complexity here.

If we have a broadcaster working a long list of waiters, this might
still happen sufficiently often. And the "complexity" is hidden in the
execution pattern of the current version, where control and handling
of the list alternates between different threads, potentially as many
times as there are waiters in the list.

> >  int pthread_cond_timedwait(pthread_cond_t *restrict c, pthread_mutex_t *restrict m, const struct timespec *restrict ts)
> >  {
> > -	struct waiter node = { .cond = c, .mutex = m };
> > +	struct waiter node = { .cond = c, .mutex = m, .state = WAITING, .barrier = 2 };
> 
> This may slightly pessimize process-shared cond vars, but seems ok.
> 
> > +static inline int cond_signal (struct waiter * p, int* ref)
> > +{
> > +	int ret = a_cas(&p->state, WAITING, SIGNALED);
> > +	if (ret != WAITING) {
> > +		ref[0]++;
> > +		p->notify = ref;
> > +		if (p->prev) p->prev->next = p->next;
> > +		if (p->next) p->next->prev = p->prev;
> > +		p->next = 0;
> > +		p->prev = 0;
> > +	}
> > +	return ret;
> > +}
> > +
> >  int __private_cond_signal(pthread_cond_t *c, int n)
> >  {
> > -	struct waiter *p, *first=0;
> > -	int ref = 0, cur;
> > +	struct waiter *p, *prev, *first=0;
> > +	int ref[2] = { 0 }, cur;
> >  
> > -	lock(&c->_c_lock);
> > -	for (p=c->_c_tail; n && p; p=p->prev) {
> > -		if (a_cas(&p->state, WAITING, SIGNALED) != WAITING) {
> > -			ref++;
> > -			p->notify = &ref;
> > +	if (n == 1) {
> > +		lock(&c->_c_lock);
> > +		for (p=c->_c_tail; p; p=prev) {
> > +			prev = p->prev;
> > +			if (!cond_signal(p, ref)) {
> > +				first=p;
> > +				p=prev;
> > +				first->prev = 0;
> > +				break;
> > +			}
> > +		}
> > +		/* Split the list, leaving any remainder on the cv. */
> > +		if (p) {
> > +			p->next = 0;
> >  		} else {
> > -			n--;
> > -			if (!first) first=p;
> > +			c->_c_head = 0;
> >  		}
> > -	}
> > -	/* Split the list, leaving any remainder on the cv. */
> > -	if (p) {
> > -		if (p->next) p->next->prev = 0;
> > -		p->next = 0;
> > +		c->_c_tail = p;
> > +		unlockRace(&c->_c_lock, ref[0]);
> >  	} else {
> > -		c->_c_head = 0;
> > +		lock(&c->_c_lock);
> > +                struct waiter * head = c->_c_head;
> > +		if (head) {
> > +			/* Signal head and tail first to reduce possible
> > +			 * races for the cv to the beginning of the
> > +			 * processing. */
> > +			int headrace = cond_signal(head, ref);
> > +			struct waiter * tail = c->_c_tail;
> > +			p=tail->prev;
> > +			if (tail != head) {
> > +				if (!cond_signal(tail, ref)) first=tail;
> > +				else while (p != head) {
> > +					prev = p->prev;
> > +					if (!cond_signal(p, ref)) {
> > +						first=p;
> > +						p=prev;
> > +						break;
> > +					}
> > +					p=prev;
> > +				}
> > +			}
> > +			if (!first && !headrace) first = head;
> > +			c->_c_head = 0;
> > +			c->_c_tail = 0;
> > +			/* Now process the inner part of the list. */
> > +			if (p) {
> > +				while (p != head) {
> > +					prev = p->prev;
> > +					cond_signal(p, ref);
> > +					p=prev;
> > +				}
> > +			}
> > +		}
> > +		unlockRace(&c->_c_lock, ref[0]);
> >  	}
> > -	c->_c_tail = p;
> > -	unlock(&c->_c_lock);
> >  
> >  	/* Wait for any waiters in the LEAVING state to remove
> >  	 * themselves from the list before returning or allowing
> >  	 * signaled threads to proceed. */
> > -	while ((cur = ref)) __wait(&ref, 0, cur, 1);
> > +	while ((cur = ref[0])) __wait(&ref[0], &ref[1], cur, 1);
> >  
> >  	/* Allow first signaled waiter, if any, to proceed. */
> >  	if (first) unlock(&first->barrier);
> 
> This is sufficiently more complex that I think we'd need evidence that
> the races actually adversely affect performance for it to be
> justified.

I can imagine that seen as it is here, this looks invasive, but it
isn't really. This is just the merge of about 10 patches that more or
less rewrite to something functionally equivalent. (well, with some
obvious exeptions.) I found posting all those patches too much for the
list, was I wrong?

> And if it is justified, the code should probably be moved
> to pthread_cond_signal and pthread_cond_broadcast rather than 2 cases
> in one function here. The only reason both cases are covered in one
> function now is that they're naturally trivial special cases of the
> exact same code.

Sure, that could and should perhaps be done, although the fact working
on just one file and being able to have a bunch of helper functions as
static inline was quite convenient.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 18:30       ` Jens Gustedt
@ 2014-08-26 19:05         ` Rich Felker
  2014-08-26 19:34           ` Jens Gustedt
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2014-08-26 19:05 UTC (permalink / raw)
  To: musl

On Tue, Aug 26, 2014 at 08:30:39PM +0200, Jens Gustedt wrote:
> Am Dienstag, den 26.08.2014, 13:53 -0400 schrieb Rich Felker:
> > I should have bothered to check and see that there was a patch before
> > responding...
> > 
> > On Tue, Aug 26, 2014 at 06:35:19PM +0200, Jens Gustedt wrote:
> > > diff --git a/src/thread/pthread_cond_timedwait.c b/src/thread/pthread_cond_timedwait.c
> > > index 2d192b0..136fa6a 100644
> > > --- a/src/thread/pthread_cond_timedwait.c
> > > +++ b/src/thread/pthread_cond_timedwait.c
> > > @@ -42,12 +42,32 @@ static inline void lock(volatile int *l)
> > >  	}
> > >  }
> > >  
> > > +/* Avoid taking the lock if we know it isn't necessary. */
> > > +static inline int lockRace(volatile int *l, int*volatile* notifier)
> > > +{
> > > +	int ret = 1;
> > > +	if (!*notifier && (ret = a_cas(l, 0, 1))) {
> > > +		a_cas(l, 1, 2);
> > > +		do __wait(l, 0, 2, 1);
> > > +		while (!*notifier && (ret = a_cas(l, 0, 2)));
> > > +	}
> > > +	return ret;
> > > +}
> > 
> > This code was confusing at first, and technically *notifier is
> > accessed without synchronization -- it may be written by one thread
> > and read by another without any intervening barrier. I suspect it
> > works, but I'd like to avoid it. But it does answer my question about
> > how you were detecting the case of "don't need to lock".
> 
> Hm, I don't see the point of your remark, musl is doing such "atomic"
> reads all over, no?

Only when the write is atomic. Here, it's not. Aside from theoretical
issues with this being UB, there's nothing to order the store to
notify with respect to other stores/loads done in the signaling
thread. For example, ref[0]++ might be reordered after p->notify=ref,
or the writes to modify the list might be reordered before it. I
haven't analyzed all the possible consequences of such reordering, but
it's unlikely to be safe. (Note that volatile only precludes
reordering between multiple volatile accesses, not between volatile
and non-volatile. But even if it did, it wouldn't enforce an ordering
on cache synchronization between cores.)

> There is no macro for atomic load, only for atomic
> store, otherwise I would have used that. And notice that *notifier is
> volatile, so the load can't be optimized out.
> 
> Or do you mean that I should use an atomic store at the other end?

Yes. With an atomic store at the other end, I think it could be
correct, but I'd need to review it further to be sure.

> > Note that the performance of the code in which you're trying to avoid
> > the lock does not matter in the slightest except when a race happens
> > between a thread acting on cancellation or timeout and a signaler
> > (since that's the only time it runs). I expect this is extremely rare,
> > so unless we determine otherwise, I'd rather not add complexity here.
> 
> If we have a broadcaster working a long list of waiters, this might
> still happen sufficiently often. And the "complexity" is hidden in the
> execution pattern of the current version, where control and handling
> of the list alternates between different threads, potentially as many
> times as there are waiters in the list.

Does your code eliminate that all the time? If so it's more
interesting. I'll look at it more.

> > This is sufficiently more complex that I think we'd need evidence that
> > the races actually adversely affect performance for it to be
> > justified.
> 
> I can imagine that seen as it is here, this looks invasive, but it
> isn't really.

It changes the code from one code path with a parameter that can be
read and understood almost immediately to a number of special cases
(one-element list, etc.) and it has additional confusing
synchronization logic with the notify pointer being used to convey
information outside of lock.

> This is just the merge of about 10 patches that more or
> less rewrite to something functionally equivalent. (well, with some
> obvious exeptions.) I found posting all those patches too much for the
> list, was I wrong?

Well the mix of refactoring (arguably gratuitous) with functional
changes may be making it look more complex than it is, so I'll see if,
on further reading, it's simpler than it looks (or easily simplified).

> > And if it is justified, the code should probably be moved
> > to pthread_cond_signal and pthread_cond_broadcast rather than 2 cases
> > in one function here. The only reason both cases are covered in one
> > function now is that they're naturally trivial special cases of the
> > exact same code.
> 
> Sure, that could and should perhaps be done, although the fact working
> on just one file and being able to have a bunch of helper functions as
> static inline was quite convenient.

Yes, for working on tweaking it, that probably makes things easier.

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 19:05         ` Rich Felker
@ 2014-08-26 19:34           ` Jens Gustedt
  2014-08-26 20:26             ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2014-08-26 19:34 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

Am Dienstag, den 26.08.2014, 15:05 -0400 schrieb Rich Felker:
> On Tue, Aug 26, 2014 at 08:30:39PM +0200, Jens Gustedt wrote:
> > Or do you mean that I should use an atomic store at the other end?
>
> Yes. With an atomic store at the other end, I think it could be
> correct, but I'd need to review it further to be sure.

ok, it shouldn't be difficult to use atomic ops, then.

> > > Note that the performance of the code in which you're trying to avoid
> > > the lock does not matter in the slightest except when a race happens
> > > between a thread acting on cancellation or timeout and a signaler
> > > (since that's the only time it runs). I expect this is extremely rare,
> > > so unless we determine otherwise, I'd rather not add complexity here.
> > 
> > If we have a broadcaster working a long list of waiters, this might
> > still happen sufficiently often. And the "complexity" is hidden in the
> > execution pattern of the current version, where control and handling
> > of the list alternates between different threads, potentially as many
> > times as there are waiters in the list.
> 
> Does your code eliminate that all the time? If so it's more
> interesting. I'll look at it more.

Yes, it will be the signaling or broadcasting thread that will be
working on the integrity of the list while it is holding the lock. At
the end those that it detected to be leaving will be isolated list
items, those that are to be signaled form one consecutive list that is
detached from the cv, and the ones that remain (in the signaling case)
form a valid cv-with-list.

The only small gap that remains (and that annoys me) is the leaving
thread that sneaks in

 - marks itself as leaving before the end of the  the CS
 - only asks for _c_lock *after* the signaling thread has left its CS

This is all our problem of late access to the cv variable revisited,
but here it is condensed in a very narrow time frame. Both threads
must be active for this to happen, so my hope is that when both are
spinning for some time on the  c_lock for the waiter and on ref for
the signaler, none of them will "ever" be forced into a futex wait.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 19:34           ` Jens Gustedt
@ 2014-08-26 20:26             ` Rich Felker
  2014-08-26 21:15               ` Jens Gustedt
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2014-08-26 20:26 UTC (permalink / raw)
  To: Jens Gustedt; +Cc: musl

On Tue, Aug 26, 2014 at 09:34:13PM +0200, Jens Gustedt wrote:
> Am Dienstag, den 26.08.2014, 15:05 -0400 schrieb Rich Felker:
> > On Tue, Aug 26, 2014 at 08:30:39PM +0200, Jens Gustedt wrote:
> > > Or do you mean that I should use an atomic store at the other end?
> >
> > Yes. With an atomic store at the other end, I think it could be
> > correct, but I'd need to review it further to be sure.
> 
> ok, it shouldn't be difficult to use atomic ops, then.

Based on what you've said below, though, I think there's still a big
problem..

> > > > Note that the performance of the code in which you're trying to avoid
> > > > the lock does not matter in the slightest except when a race happens
> > > > between a thread acting on cancellation or timeout and a signaler
> > > > (since that's the only time it runs). I expect this is extremely rare,
> > > > so unless we determine otherwise, I'd rather not add complexity here.
> > > 
> > > If we have a broadcaster working a long list of waiters, this might
> > > still happen sufficiently often. And the "complexity" is hidden in the
> > > execution pattern of the current version, where control and handling
> > > of the list alternates between different threads, potentially as many
> > > times as there are waiters in the list.
> > 
> > Does your code eliminate that all the time? If so it's more
> > interesting. I'll look at it more.
> 
> Yes, it will be the signaling or broadcasting thread that will be
> working on the integrity of the list while it is holding the lock. At
> the end those that it detected to be leaving will be isolated list
> items, those that are to be signaled form one consecutive list that is
> detached from the cv, and the ones that remain (in the signaling case)
> form a valid cv-with-list.
> 
> The only small gap that remains (and that annoys me) is the leaving
> thread that sneaks in
> 
>  - marks itself as leaving before the end of the  the CS
>  - only asks for _c_lock *after* the signaling thread has left its CS
> 
> This is all our problem of late access to the cv variable revisited,
> but here it is condensed in a very narrow time frame. Both threads
> must be active for this to happen, so my hope is that when both are
> spinning for some time on the  c_lock for the waiter and on ref for
> the signaler, none of them will "ever" be forced into a futex wait.

That's a bug thast needs to be fixed to go forward with this, since
it's essentially a use-after-free. Now that you mention it, avoiding
use-after-free was one of my main motivations for having such waiters
synchronize with the signaling thread. That should have been
documented in a comment somewhere, but the point seems to have slipped
my mind sometime between the design phase and writing the code and
comments.

Do you see any solution whereby a waiter that's waking up can know
reliably, without accessing the cv, whether a signaling thread is
there to take responsibility for removing it from the list? I'm not
seeing any solution to that problem.

I'm also still skeptical that there's a problem to be solved here; for
it to matter, the incidence of such races needs to be pretty high, I
think. Perhaps if you think it's going to matter you could work on a
test case that shows performance problems under load with lots of
timedwait expirations (or cancellations, but I think worry about
cancellation performance is somewhat silly to begin with). Or, if you
don't have time to spend on side projects like test cases, maybe
someone else could test it?

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 20:26             ` Rich Felker
@ 2014-08-26 21:15               ` Jens Gustedt
  2014-08-26 21:36                 ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2014-08-26 21:15 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 6130 bytes --]

Am Dienstag, den 26.08.2014, 16:26 -0400 schrieb Rich Felker:
> On Tue, Aug 26, 2014 at 09:34:13PM +0200, Jens Gustedt wrote:
> > Am Dienstag, den 26.08.2014, 15:05 -0400 schrieb Rich Felker:
> > > On Tue, Aug 26, 2014 at 08:30:39PM +0200, Jens Gustedt wrote:
> > > > Or do you mean that I should use an atomic store at the other end?
> > >
> > > Yes. With an atomic store at the other end, I think it could be
> > > correct, but I'd need to review it further to be sure.
> > 
> > ok, it shouldn't be difficult to use atomic ops, then.
> 
> Based on what you've said below, though, I think there's still a big
> problem..

I am not sure that I follow.

> > > > > Note that the performance of the code in which you're trying to avoid
> > > > > the lock does not matter in the slightest except when a race happens
> > > > > between a thread acting on cancellation or timeout and a signaler
> > > > > (since that's the only time it runs). I expect this is extremely rare,
> > > > > so unless we determine otherwise, I'd rather not add complexity here.
> > > > 
> > > > If we have a broadcaster working a long list of waiters, this might
> > > > still happen sufficiently often. And the "complexity" is hidden in the
> > > > execution pattern of the current version, where control and handling
> > > > of the list alternates between different threads, potentially as many
> > > > times as there are waiters in the list.
> > > 
> > > Does your code eliminate that all the time? If so it's more
> > > interesting. I'll look at it more.
> > 
> > Yes, it will be the signaling or broadcasting thread that will be
> > working on the integrity of the list while it is holding the lock. At
> > the end those that it detected to be leaving will be isolated list
> > items, those that are to be signaled form one consecutive list that is
> > detached from the cv, and the ones that remain (in the signaling case)
> > form a valid cv-with-list.
> > 
> > The only small gap that remains (and that annoys me) is the leaving
> > thread that sneaks in
> > 
> >  - marks itself as leaving before the end of the  the CS
> >  - only asks for _c_lock *after* the signaling thread has left its CS
> > 
> > This is all our problem of late access to the cv variable revisited,
> > but here it is condensed in a very narrow time frame. Both threads
> > must be active for this to happen, so my hope is that when both are
> > spinning for some time on the  c_lock for the waiter and on ref for
> > the signaler, none of them will "ever" be forced into a futex wait.
> 
> That's a bug thast needs to be fixed to go forward with this, since
> it's essentially a use-after-free. Now that you mention it, avoiding
> use-after-free was one of my main motivations for having such waiters
> synchronize with the signaling thread. That should have been
> documented in a comment somewhere, but the point seems to have slipped
> my mind sometime between the design phase and writing the code and
> comments.

here I don't follow either.

What I describe is an inconvenience that forces all these mechanics
with the signaller still waiting for the ref count to fall to 0 before
he can leave and abandon the cv. As far as I can see that should still
be working with both versions.

> Do you see any solution whereby a waiter that's waking up can know
> reliably, without accessing the cv, whether a signaling thread is
> there to take responsibility for removing it from the list?

No. My intent with the description above was to illustrate where that
problem resides in the version that I proposed.

> I'm not seeing any solution to that problem.

You mean in the code that I proposed? No there is no other solution,
than what you have in your original version: have the signalling
thread wait for all the leaving waiters to come in. In your version
that write to node->notify is protected by _c_lock, so this is not
subject to races.

In my version this is the same, unless the lockRace variant gives the
leaving thread some additional information. If this information about
ref arrives in time before the _c_lock is taken it is used, and the
leaving thread never touches the cv, again. If it doesn't arrive in
time the _c_lock is waited for, then taken, the notifier is
re-evaluated, and all is fine. (__wait should be a memory barrier,
shouldn't it?)

But you are right I should add atomicity, there.

> I'm also still skeptical that there's a problem to be solved here; for
> it to matter, the incidence of such races needs to be pretty high, I
> think.

"matter" can mean different things. I don't think that it my version
penalizes performance (but sure a proof would be better), it could
also give some small benefits concerning execution time, it could be
mildly better by avoiding cache-ping-pong and by avoiding some futex
calls.

For me, it matters because the handling is "simpler" (a proof would be
much, much better, here :) in that there is one single thread
manipulating the list. So I should work more on prooving that point by
keeping it much closer to your original version.

Ok, I'll try to switch to one-step-at-a-time-mode.

> Perhaps if you think it's going to matter you could work on a
> test case that shows performance problems under load with lots of
> timedwait expirations (or cancellations, but I think worry about
> cancellation performance is somewhat silly to begin with).

Yes, since we have eliminated spurious wakeups ;) expirations should
be thing to worry about.

> Or, if you don't have time to spend on side projects like test
> cases, maybe someone else could test it?

Not at the moment. I would prefer to have a C thread implementation by
the end of the week that we could agree upon.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::





[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 21:15               ` Jens Gustedt
@ 2014-08-26 21:36                 ` Rich Felker
  2014-08-27  9:53                   ` Jens Gustedt
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Felker @ 2014-08-26 21:36 UTC (permalink / raw)
  To: Jens Gustedt; +Cc: musl

On Tue, Aug 26, 2014 at 11:15:56PM +0200, Jens Gustedt wrote:
> > > ok, it shouldn't be difficult to use atomic ops, then.
> > 
> > Based on what you've said below, though, I think there's still a big
> > problem..
> 
> I am not sure that I follow.

OK, I think I misunderstood things you said before.

> > > > Does your code eliminate that all the time? If so it's more
> > > > interesting. I'll look at it more.
> > > 
> > > Yes, it will be the signaling or broadcasting thread that will be
> > > working on the integrity of the list while it is holding the lock. At
> > > the end those that it detected to be leaving will be isolated list
> > > items, those that are to be signaled form one consecutive list that is
> > > detached from the cv, and the ones that remain (in the signaling case)
> > > form a valid cv-with-list.
> > > 
> > > The only small gap that remains (and that annoys me) is the leaving
> > > thread that sneaks in
> > > 
> > >  - marks itself as leaving before the end of the  the CS
> > >  - only asks for _c_lock *after* the signaling thread has left its CS
> > > 
> > > This is all our problem of late access to the cv variable revisited,
> > > but here it is condensed in a very narrow time frame. Both threads
> > > must be active for this to happen, so my hope is that when both are
> > > spinning for some time on the  c_lock for the waiter and on ref for
> > > the signaler, none of them will "ever" be forced into a futex wait.
> > 
> > That's a bug thast needs to be fixed to go forward with this, since
> > it's essentially a use-after-free. Now that you mention it, avoiding
> > use-after-free was one of my main motivations for having such waiters
> > synchronize with the signaling thread. That should have been
> > documented in a comment somewhere, but the point seems to have slipped
> > my mind sometime between the design phase and writing the code and
> > comments.
> 
> here I don't follow either.
> 
> What I describe is an inconvenience that forces all these mechanics
> with the signaller still waiting for the ref count to fall to 0 before
> he can leave and abandon the cv. As far as I can see that should still
> be working with both versions.

OK, I misinterpreted what you were saying as a statement that you had
somehow optimized out this waiting, while in reality you seem to have
been saying that there's still this window where the wait is required.
Is this correct?

> > I'm not seeing any solution to that problem.
> 
> You mean in the code that I proposed? No there is no other solution,
> than what you have in your original version: have the signalling
> thread wait for all the leaving waiters to come in. In your version
> that write to node->notify is protected by _c_lock, so this is not
> subject to races.
> 
> In my version this is the same, unless the lockRace variant gives the
> leaving thread some additional information. If this information about
> ref arrives in time before the _c_lock is taken it is used, and the
> leaving thread never touches the cv, again. If it doesn't arrive in
> time the _c_lock is waited for, then taken, the notifier is
> re-evaluated, and all is fine. (__wait should be a memory barrier,
> shouldn't it?)

OK, it sounds plausible that this works. But it is just moving (and
hopefully shortening) the race window, not eliminating it.

> > I'm also still skeptical that there's a problem to be solved here; for
> > it to matter, the incidence of such races needs to be pretty high, I
> > think.
> 
> "matter" can mean different things. I don't think that it my version
> penalizes performance (but sure a proof would be better), it could
> also give some small benefits concerning execution time, it could be
> mildly better by avoiding cache-ping-pong and by avoiding some futex
> calls.
> 
> For me, it matters because the handling is "simpler" (a proof would be
> much, much better, here :) in that there is one single thread
> manipulating the list. So I should work more on prooving that point by
> keeping it much closer to your original version.

As for whether it's simpler, I'm still not convinced, but let's see.

BTW I just found one bug:

+               if (a_fetch_add(&ref[0], -1)==1 && ref[1])
+                       __wake(&ref[0], 1, 1);

This accesses the signaling thread's stack (ref[1]) after releasing
the signaling thread to return.

Fixing it should be trivial via the design I mentioned earlier: don't
use a waiter flag like this, but instead offset the initial value of
ref by +1 and a_dec it just before waiting. As in other places, of
course, a wake to an invalid address is possible either way; this is
"fixable" if necessary via FUTEX_WAKE_OP (having the kernel do the
atomic dec after acquiring the futex bin locks).

> Ok, I'll try to switch to one-step-at-a-time-mode.

OK.

> > Or, if you don't have time to spend on side projects like test
> > cases, maybe someone else could test it?
> 
> Not at the moment. I would prefer to have a C thread implementation by
> the end of the week that we could agree upon.

Yes, that would be very nice!

Rich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-26 21:36                 ` Rich Felker
@ 2014-08-27  9:53                   ` Jens Gustedt
  2014-08-27 16:47                     ` Rich Felker
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Gustedt @ 2014-08-27  9:53 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1877 bytes --]

Am Dienstag, den 26.08.2014, 17:36 -0400 schrieb Rich Felker:
> Fixing it should be trivial via the design I mentioned earlier: don't
> use a waiter flag like this, but instead offset the initial value of
> ref by +1 and a_dec it just before waiting. As in other places, of
> course, a wake to an invalid address is possible either way; this is
> "fixable" if necessary via FUTEX_WAKE_OP (having the kernel do the
> atomic dec after acquiring the futex bin locks).

generally it would be nice to have such a lock functionality that
takes care of the waiters inside the int itself, this could perhaps be
used in other places

> > Ok, I'll try to switch to one-step-at-a-time-mode.
> 
> OK.

ok, so in a follow up I will send two patches that are the essence of
that. It doesn't do the booking from above, let's keep that for later.

Jens Gustedt (2):
  Let the signaler do all work concerning consistency of the list
  avoid taking _c_lock if we know it isn't necessary

 src/thread/pthread_cond_timedwait.c |   62 +++++++++++++++++++++++++++--------
 1 file changed, 48 insertions(+), 14 deletions(-)


> > > Or, if you don't have time to spend on side projects like test
> > > cases, maybe someone else could test it?
> > 
> > Not at the moment. I would prefer to have a C thread implementation by
> > the end of the week that we could agree upon.
> 
> Yes, that would be very nice!

ok, so I will be turning back to that, and as said that will be a
version of C threads that use the POSIX lock structures transparently.

Jens


-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Multi-threaded performance progress
  2014-08-27  9:53                   ` Jens Gustedt
@ 2014-08-27 16:47                     ` Rich Felker
  0 siblings, 0 replies; 15+ messages in thread
From: Rich Felker @ 2014-08-27 16:47 UTC (permalink / raw)
  To: musl

On Wed, Aug 27, 2014 at 11:53:10AM +0200, Jens Gustedt wrote:
> Am Dienstag, den 26.08.2014, 17:36 -0400 schrieb Rich Felker:
> > Fixing it should be trivial via the design I mentioned earlier: don't
> > use a waiter flag like this, but instead offset the initial value of
> > ref by +1 and a_dec it just before waiting. As in other places, of
> > course, a wake to an invalid address is possible either way; this is
> > "fixable" if necessary via FUTEX_WAKE_OP (having the kernel do the
> > atomic dec after acquiring the futex bin locks).
> 
> generally it would be nice to have such a lock functionality that
> takes care of the waiters inside the int itself, this could perhaps be
> used in other places

Yes. My idea was to have a "__wake_store" function or similar that
wraps FUTEX_WAKE_OP and does a fallback to FUTEX_WAKE if the kernel
lacks FUTEX_WAKE_OP (if there are older kernels that lack it; not
sure). Note that this is less powerful that what I mentioned above,
but you can know before the above a_dec if it will write zero since,
if the value is 1, you're the last thread to modify it. I think most
places where FUTEX_WAKE_OP would be used fit this pattern -- you know
in advance that you're not racing with other atomic writers (or if you
are, they're using a_cas and their cas would fail with the value
you're writing and with the value you're overwriting).

Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-08-27 16:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-26  3:43 Multi-threaded performance progress Rich Felker
2014-08-26  7:04 ` Jens Gustedt
2014-08-26 10:44   ` Szabolcs Nagy
2014-08-26 11:09     ` Jens Gustedt
2014-08-26 16:35   ` Jens Gustedt
2014-08-26 17:32     ` Rich Felker
2014-08-26 17:53     ` Rich Felker
2014-08-26 18:30       ` Jens Gustedt
2014-08-26 19:05         ` Rich Felker
2014-08-26 19:34           ` Jens Gustedt
2014-08-26 20:26             ` Rich Felker
2014-08-26 21:15               ` Jens Gustedt
2014-08-26 21:36                 ` Rich Felker
2014-08-27  9:53                   ` Jens Gustedt
2014-08-27 16:47                     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).