C threads, v3.0

mailing list of musl libc
 help / color / mirror / code / Atom feed

* C threads, v3.0
@ 2014-08-04  9:30 Jens Gustedt
  2014-08-04  9:33 ` Jens Gustedt
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-04  9:30 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 3231 bytes --]

Hi everybody,
 please find a new version for the C11 threads attached. This new
version reimplements the functions for the control structures mtx_t
and cond_t. (Only the functions, the types underneath didn't change
for the moment.)

This brings some interesting code simplification because

 - we don't have to care about process shared constructs
 - there are no attributes, so (almost) all code branches that concern
   optional features can be cut off
 - in particular there are no robust mutexes

In any case, all of this should still have the same observable
behavior as without C11 threads. All I intended to do so far is
basically textual rewriting under the assumptions of C11.

I'd like to discuss two issues before going further. The first is of
minor concern. In many places of that code _m_lock uses a flag, namely
0x40000000. I only found places where this bit is read, but not where
it is set. Did I overlook something or is this a leftover?

The second issue concerns tsd. C threads can't be fine tuned with
attributes, in particular they receive exactly as much stack space as
we give them. They are supposed to implement an economic thread model
that uses up as few resources as possible. In the current
implementation tsd takes up a medium sized part of the stack
(__pthread_tsd_size) if any of the application or linked libraries use
pthread_key_create somewhere.

My impression is that pthread_[gs]etspecific is already rarely used
and that tss_[gs]et will be even less. C threads also introduce
_Thread_local, a much more convenient interface as long as you don't
have 

I don't think that this policy is ideal for C threads, but it could
perhaps be revised for pthreads, too. My idea is the
following. (version for C threads with minimal impact on pthreads)

 - don't assign a tsd in thrd_create

 - change pthread_getspecific, pthread_setspecific, tss_get, and
   __pthread_tsd_run_dtors such that they check for nullness of
   tsd. this is a trivial and non-expensive change.
   pthread_setspecific may return ENOMEM or EINVAL in such cases. The
   getters should just return 0. __pthread_tsd_run_dtors obviously
   would just do nothing, then.

 - change tss_set, to mmap the table if it is absent and if it is
   called with a non-null pointer argument. (I.e if it really has to
   store something). C11 allows this function to fail, so we could
   return thrd_error if mmap fails.

 - change thrd_exit to check for tsd and to unmap it at the end if
   necessary

For thrd_create one of the consequences would be that main hasn't to
be treated special with that respect. The additional mmap and unmap in
tss_set should only concern entire pages. This should only have a
minimal runtime overhead, but which only would occur for threads that
call tss_set with a non-null pointer to be stored.

Please let me know what you think.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #1.2: thrd11-v3.0.patch --]
[-- Type: text/x-patch, Size: 32207 bytes --]

diff --git a/include/alltypes.h.in b/include/alltypes.h.in
index 8478fb4..4cd2cb1 100644
--- a/include/alltypes.h.in
+++ b/include/alltypes.h.in
@@ -58,6 +58,10 @@ TYPEDEF struct { unsigned __attr; } pthread_mutexattr_t;
 TYPEDEF struct { unsigned __attr; } pthread_condattr_t;
 TYPEDEF struct { unsigned __attr; } pthread_barrierattr_t;
 TYPEDEF struct { unsigned __attr[2]; } pthread_rwlockattr_t;
+TYPEDEF pthread_cond_t cnd_t;
+TYPEDEF pthread_mutex_t mtx_t;
+
+
 
 TYPEDEF struct _IO_FILE FILE;
 
diff --git a/include/threads.h b/include/threads.h
new file mode 100644
index 0000000..e131781
--- /dev/null
+++ b/include/threads.h
@@ -0,0 +1,169 @@
+#ifndef _THREADS_H
+#define _THREADS_H
+
+/* This one is explicitly allowed to be included. */
+#include <time.h>
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define __NEED_cnd_t
+#define __NEED_mtx_t
+/* Until we come up with a better naming scheme, we need to expose
+   some pthread types. */
+#define __NEED_pthread_cond_t
+#define __NEED_pthread_mutex_t
+#define __NEED_pthread_once_t
+#define __NEED_pthread_t
+#define __NEED_pthread_key_t
+
+#include <bits/alltypes.h>
+#include <bits/errno.h>
+
+typedef pthread_once_t once_flag;
+typedef pthread_t thrd_t;
+typedef pthread_key_t tss_t;
+typedef int (*thrd_start_t)(void*);
+typedef void (*tss_dtor_t)(void*);
+
+
+#ifndef __THRD_EXPERIMENTAL
+# define __THRD_EXPERIMENTAL 1
+#endif
+
+  /* The following list of 9 integer constants makes up for the binary
+     compatibility of this C thread implementation. You must never
+     link code against versions of the C library that do not agree
+     upon these ABI parameters.
+
+     Additionally this implementation assumes that the 5 types have
+     the same size across C libraries and that these types can be
+     initialized by the default initializer.
+
+     The values for the 9 parameters are not fixed for now. Depending
+     on the choices of other implementations and the evolution of the
+     C standard they may still change. This should happen rarely, but
+     you have to consider the C thread feature to be experimental
+     until then, and be prepared to recompile your binary when linking
+     against a different implementation or a new version.
+
+     The macro __THRD_EXPERIMENTAL will be defined as long as we
+     consider this ABI to be unstable. This comes with some link time
+     checks and an overhead of some bytes. At your own risk you may
+     switch this check off by defining __THRD_EXPERIMENTAL macro to
+     0. */
+
+#define __THRD_SUCCESS      0
+#define __THRD_BUSY         EBUSY
+#define __THRD_ERROR        EINVAL
+#define __THRD_NOMEM        ENOMEM
+#define __THRD_TIMEDOUT     ETIMEDOUT
+#define __MTX_PLAIN         0
+#define __MTX_RECURSIVE     1
+#define __MTX_TIMED         2
+#define TSS_DTOR_ITERATIONS 4
+
+#define __THRD_CONCAT10_(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9)         \
+_0 ## _ ## _1 ## _ ## _2 ## _ ## _3 ## _ ## _4 ## _ ## _5 ## _ ## _6 ## _ ## _7 ## _ ## _8 ## _ ## _9
+
+#define __THRD_CONCAT10(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9) \
+  __THRD_CONCAT10_(_0, _1, _2, _3, _4, _5, _6, _7, _8, _9)
+
+
+#define __THRD_ABI                              \
+__THRD_CONCAT10(__thrd_abi,                     \
+                __THRD_SUCCESS,                 \
+                __THRD_BUSY,                    \
+                __THRD_ERROR,                   \
+                __THRD_NOMEM,                   \
+                __THRD_TIMEDOUT,                \
+                __MTX_PLAIN,                    \
+                __MTX_RECURSIVE,                \
+                __MTX_TIMED,                    \
+                TSS_DTOR_ITERATIONS             \
+                )
+
+#define __THRD_SHFT(X, Y) (((X) << 3) ^ (Y))
+
+enum {
+  __thrd_abi =
+  __THRD_SHFT(sizeof(cnd_t),
+              __THRD_SHFT(sizeof(mtx_t),
+                          __THRD_SHFT(sizeof(once_flag),
+                                      __THRD_SHFT(sizeof(thrd_t),
+                                                  sizeof(tss_t))))),
+};
+
+extern unsigned const __THRD_ABI;
+
+#define __THRD_ABI_CHECK (1/(__THRD_ABI == __thrd_abi))
+
+#if __THRD_EXPERIMENTAL
+# define __THRD_ABI_MARK __attribute__((used)) static unsigned const*const __thrd_abi_mark = &__THRD_ABI
+#else
+# define __THRD_ABI_MARK typedef unsigned __thrd_abi_dummy_type
+#endif
+
+enum {
+  thrd_success = __THRD_SUCCESS,
+  thrd_busy = __THRD_BUSY,
+  thrd_error = __THRD_ERROR,
+  thrd_nomem = __THRD_NOMEM,
+  thrd_timedout = __THRD_TIMEDOUT,
+};
+
+enum {
+  mtx_plain = __MTX_PLAIN,
+  mtx_recursive = __MTX_RECURSIVE,
+  // all mutexes are timed, here. so this is a no-op
+  mtx_timed = __MTX_TIMED,
+};
+
+#define ONCE_FLAG_INIT { 0 }
+#define thread_local _Thread_local
+
+int thrd_create(thrd_t *, thrd_start_t, void *);
+_Noreturn void thrd_exit(int);
+
+int thrd_detach(thrd_t);
+int thrd_join(thrd_t, int *);
+
+int thrd_sleep(const struct timespec *, struct timespec *);
+void thrd_yield(void);
+
+thrd_t thrd_current(void);
+int thrd_equal(thrd_t, thrd_t);
+#define thrd_equal(A, B) ((A) == (B))
+
+void call_once(once_flag *, void (*)(void));
+
+int mtx_init(mtx_t *, int);
+void mtx_destroy(mtx_t *);
+
+int mtx_lock(mtx_t *);
+int mtx_timedlock(mtx_t *restrict, const struct timespec *restrict);
+int mtx_trylock(mtx_t *);
+int mtx_unlock(mtx_t *);
+
+int cnd_init(cnd_t *);
+void cnd_destroy(cnd_t *);
+
+int cnd_broadcast(cnd_t *);
+int cnd_signal(cnd_t *);
+
+int cnd_timedwait(cnd_t *restrict, mtx_t *restrict, const struct timespec *restrict);
+int cnd_wait(cnd_t *, mtx_t *);
+
+int tss_create(tss_t *, tss_dtor_t);
+void tss_delete(tss_t key);
+
+int tss_set(tss_t, void *);
+void *tss_get(tss_t);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/src/mman/mprotect.c b/src/mman/mprotect.c
index f488486..535787b 100644
--- a/src/mman/mprotect.c
+++ b/src/mman/mprotect.c
@@ -2,10 +2,12 @@
 #include "libc.h"
 #include "syscall.h"
 
-int mprotect(void *addr, size_t len, int prot)
+int __mprotect(void *addr, size_t len, int prot)
 {
 	size_t start, end;
 	start = (size_t)addr & -PAGE_SIZE;
 	end = (size_t)((char *)addr + len + PAGE_SIZE-1) & -PAGE_SIZE;
 	return syscall(SYS_mprotect, start, end-start, prot);
 }
+
+weak_alias(__mprotect, mprotect);
diff --git a/src/sched/thrd_yield.c b/src/sched/thrd_yield.c
new file mode 100644
index 0000000..301e953
--- /dev/null
+++ b/src/sched/thrd_yield.c
@@ -0,0 +1,7 @@
+#include <sched.h>
+#include "syscall.h"
+
+void thrd_yield()
+{
+         (void)syscall(SYS_sched_yield);
+}
diff --git a/src/thread/__thrd_abi.c b/src/thread/__thrd_abi.c
new file mode 100644
index 0000000..e5674e6
--- /dev/null
+++ b/src/thread/__thrd_abi.c
@@ -0,0 +1,3 @@
+#include <threads.h>
+
+unsigned const __THRD_ABI =  __thrd_abi;
diff --git a/src/thread/__timedwait.c b/src/thread/__timedwait.c
index 302273a..3aed2bd 100644
--- a/src/thread/__timedwait.c
+++ b/src/thread/__timedwait.c
@@ -4,6 +4,9 @@
 #include "futex.h"
 #include "syscall.h"
 
+int __pthread_setcancelstate(int new, int *old);
+int __clock_gettime(clockid_t clk, struct timespec *ts);
+
 static int do_wait(volatile int *addr, int val,
 	clockid_t clk, const struct timespec *at, int priv)
 {
@@ -12,7 +15,7 @@ static int do_wait(volatile int *addr, int val,
 
 	if (at) {
 		if (at->tv_nsec >= 1000000000UL) return EINVAL;
-		if (clock_gettime(clk, &to)) return EINVAL;
+		if (__clock_gettime(clk, &to)) return EINVAL;
 		to.tv_sec = at->tv_sec - to.tv_sec;
 		if ((to.tv_nsec = at->tv_nsec - to.tv_nsec) < 0) {
 			to.tv_sec--;
@@ -33,13 +36,13 @@ int __timedwait(volatile int *addr, int val,
 {
 	int r, cs;
 
-	if (!cleanup) pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);
+	if (!cleanup) __pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);
 	pthread_cleanup_push(cleanup, arg);
 
 	r = do_wait(addr, val, clk, at, priv);
 
 	pthread_cleanup_pop(0);
-	if (!cleanup) pthread_setcancelstate(cs, 0);
+	if (!cleanup) __pthread_setcancelstate(cs, 0);
 
 	return r;
 }
diff --git a/src/thread/call_once.c b/src/thread/call_once.c
new file mode 100644
index 0000000..742a707
--- /dev/null
+++ b/src/thread/call_once.c
@@ -0,0 +1,9 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int __pthread_once(once_flag *, void (*)(void));
+
+void (call_once)(once_flag *flag, void (*func)(void)) {
+	__THRD_ABI_MARK;
+	(void)__pthread_once(flag, func);
+}
diff --git a/src/thread/cnd_broadcast.c b/src/thread/cnd_broadcast.c
new file mode 100644
index 0000000..70fb7a7
--- /dev/null
+++ b/src/thread/cnd_broadcast.c
@@ -0,0 +1,34 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int cnd_broadcast(cnd_t *c)
+{
+	mtx_t *m;
+
+	if (!c->_c_waiters) return thrd_success;
+
+	a_inc(&c->_c_seq);
+
+	/* Block waiters from returning so we can use the mutex. */
+	while (a_swap(&c->_c_lock, 1))
+		__wait(&c->_c_lock, &c->_c_lockwait, 1, 1);
+	if (!c->_c_waiters)
+		goto out;
+	m = c->_c_mutex;
+
+	/* Move waiter count to the mutex */
+	a_fetch_add(&m->_m_waiters, c->_c_waiters2);
+	c->_c_waiters2 = 0;
+
+	/* Perform the futex requeue, waking one waiter unless we know
+	 * that the calling thread holds the mutex. */
+	__syscall(SYS_futex, &c->_c_seq, FUTEX_REQUEUE,
+		!m->_m_type || (m->_m_lock&INT_MAX)!=__pthread_self()->tid,
+		INT_MAX, &m->_m_lock);
+
+out:
+	a_store(&c->_c_lock, 0);
+	if (c->_c_lockwait) __wake(&c->_c_lock, 1, 0);
+
+	return thrd_success;
+}
diff --git a/src/thread/cnd_destroy.c b/src/thread/cnd_destroy.c
new file mode 100644
index 0000000..7d6f3a1
--- /dev/null
+++ b/src/thread/cnd_destroy.c
@@ -0,0 +1,21 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+/* The behavior of cnd_destroy is undefined if cnd is still in
+   use. The choice for pthread_cond_destroy in that situation is to
+   wake up all users before destroying. I am not sure that we should
+   do it like that here, too. Alternatives would be:
+   - complain by using perror or equivalent
+   - assert that there is no waiter
+   - abort when there is a waiter
+   - do nothing
+   */
+void cnd_destroy(cnd_t *c)
+{
+	int cnt;
+	c->_c_destroy = 1;
+	if (c->_c_waiters)
+		__wake(&c->_c_seq, -1, 0);
+	while ((cnt = c->_c_waiters))
+		__wait(&c->_c_waiters, 0, cnt, 0);
+}
diff --git a/src/thread/cnd_init.c b/src/thread/cnd_init.c
new file mode 100644
index 0000000..c7f1edf
--- /dev/null
+++ b/src/thread/cnd_init.c
@@ -0,0 +1,10 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int cnd_init(cnd_t * c)
+{
+	*c = (cnd_t) {
+		0
+	};
+	return thrd_success;
+}
diff --git a/src/thread/cnd_signal.c b/src/thread/cnd_signal.c
new file mode 100644
index 0000000..3967687
--- /dev/null
+++ b/src/thread/cnd_signal.c
@@ -0,0 +1,10 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int cnd_signal(cnd_t * cnd) {
+	if (cnd->_c_waiters) {
+		a_inc(&cnd->_c_seq);
+		if (cnd->_c_waiters) __wake(&cnd->_c_seq, 1, 0);
+ 	}
+	return thrd_success;
+}
diff --git a/src/thread/cnd_timedwait.c b/src/thread/cnd_timedwait.c
new file mode 100644
index 0000000..18db3ef
--- /dev/null
+++ b/src/thread/cnd_timedwait.c
@@ -0,0 +1,81 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+void __pthread_testcancel(void);
+
+struct cm {
+	cnd_t *c;
+	mtx_t *m;
+};
+
+static void unwait(cnd_t *c, mtx_t *m)
+{
+	__THRD_ABI_MARK;
+	/* Removing a waiter is non-trivial if we could be using requeue
+	 * based broadcast signals, due to mutex access issues, etc. */
+
+	while (a_swap(&c->_c_lock, 1))
+		__wait(&c->_c_lock, &c->_c_lockwait, 1, 1);
+
+	if (c->_c_waiters2) c->_c_waiters2--;
+	else a_dec(&m->_m_waiters);
+
+	a_store(&c->_c_lock, 0);
+	if (c->_c_lockwait) __wake(&c->_c_lock, 1, 1);
+
+	a_dec(&c->_c_waiters);
+	if (c->_c_destroy) __wake(&c->_c_waiters, 1, 1);
+}
+
+static void cleanup(void *p)
+{
+	struct cm *cm = p;
+	unwait(cm->c, cm->m);
+	mtx_lock(cm->m);
+}
+
+int cnd_timedwait(cnd_t *restrict c, mtx_t *restrict m, const struct timespec *restrict ts)
+{
+	struct cm cm = { .c=c, .m=m };
+	int r, e=0, seq;
+
+	if (m->_m_type && (m->_m_lock&INT_MAX) != __pthread_self()->tid)
+		return thrd_error;
+
+	if (ts && ts->tv_nsec >= 1000000000UL)
+		return thrd_error;
+
+	__pthread_testcancel();
+
+	a_inc(&c->_c_waiters);
+
+	c->_c_mutex = m;
+	while (a_swap(&c->_c_lock, 1))
+		__wait(&c->_c_lock, &c->_c_lockwait, 1, 1);
+	c->_c_waiters2++;
+	a_store(&c->_c_lock, 0);
+	if (c->_c_lockwait) __wake(&c->_c_lock, 1, 1);
+
+	seq = c->_c_seq;
+
+	mtx_unlock(m);
+
+	do e = __timedwait(&c->_c_seq, seq, CLOCK_REALTIME, ts, cleanup, &cm, 0);
+	while (c->_c_seq == seq && (!e || e==EINTR));
+	if (e == EINTR) e = 0;
+
+	unwait(c, m);
+
+	if (mtx_lock(m) != thrd_success)
+	 return thrd_error;
+
+	switch (e) {
+	case 0:
+		if (thrd_success == 0) break;
+		else return thrd_success;
+	case ETIMEDOUT:
+		if (thrd_timedout == ETIMEDOUT) break;
+		else return thrd_timedout;
+	}
+        return e;
+}
diff --git a/src/thread/cnd_wait.c b/src/thread/cnd_wait.c
new file mode 100644
index 0000000..91e89db
--- /dev/null
+++ b/src/thread/cnd_wait.c
@@ -0,0 +1,10 @@
+#include <threads.h>
+
+int cnd_wait(cnd_t *cond, mtx_t *mutex)
+{
+	/* Calling cnd_timedwait with a null pointer is an
+	   extension. Such a call is convenient, here since it avoids to
+	   repeat the case analysis that is already done for
+	   cnd_timedwait. */
+	return cnd_timedwait(cond, mutex, 0);
+}
diff --git a/src/thread/mtx_destroy.c b/src/thread/mtx_destroy.c
new file mode 100644
index 0000000..482b781
--- /dev/null
+++ b/src/thread/mtx_destroy.c
@@ -0,0 +1,5 @@
+#include <threads.h>
+
+void (mtx_destroy)(mtx_t *mtx) {
+	/* empty */
+}
diff --git a/src/thread/mtx_init.c b/src/thread/mtx_init.c
new file mode 100644
index 0000000..b86afb0
--- /dev/null
+++ b/src/thread/mtx_init.c
@@ -0,0 +1,10 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int mtx_init(mtx_t * m, int type)
+{
+	*m = (mtx_t) {
+		._m_type = ((type&mtx_recursive) ? PTHREAD_MUTEX_RECURSIVE : 0),
+	};
+	return thrd_success;
+}
diff --git a/src/thread/mtx_lock.c b/src/thread/mtx_lock.c
new file mode 100644
index 0000000..aa3d031
--- /dev/null
+++ b/src/thread/mtx_lock.c
@@ -0,0 +1,14 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int mtx_lock(mtx_t *m)
+{
+	__THRD_ABI_MARK;
+	if (m->_m_type == PTHREAD_MUTEX_NORMAL && !a_cas(&m->_m_lock, 0, thrd_busy))
+		return thrd_success;
+	/* Calling mtx_timedlock with a null pointer is an
+	   extension. Such a call is convenient, here, since it avoids
+	   to repeat the case analysis that is already done for
+	   mtx_timedlock. */
+	return mtx_timedlock(m, 0);
+}
diff --git a/src/thread/mtx_timedlock.c b/src/thread/mtx_timedlock.c
new file mode 100644
index 0000000..2169c4a
--- /dev/null
+++ b/src/thread/mtx_timedlock.c
@@ -0,0 +1,33 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int mtx_timedlock(mtx_t *restrict m, const struct timespec *restrict at)
+{
+	int r, t;
+
+	if (m->_m_type == PTHREAD_MUTEX_NORMAL && !a_cas(&m->_m_lock, 0, thrd_busy))
+		return thrd_success;
+
+	for (;;) {
+		r=mtx_trylock(m);
+		if (r != thrd_busy) return r;
+		else {
+			if (!(r=m->_m_lock) || (r&0x40000000)) continue;
+			a_inc(&m->_m_waiters);
+			t = r | 0x80000000;
+			a_cas(&m->_m_lock, r, t);
+			r = __timedwait(&m->_m_lock, t, CLOCK_REALTIME, at, 0, 0, 0);
+			a_dec(&m->_m_waiters);
+			switch (r) {
+			case 0:
+				break;
+			case EINTR:
+				break;
+			case ETIMEDOUT:
+				return thrd_timedout;
+			default:
+				return thrd_error;
+			}
+		}
+	}
+}
diff --git a/src/thread/mtx_trylock.c b/src/thread/mtx_trylock.c
new file mode 100644
index 0000000..9e3e0a5
--- /dev/null
+++ b/src/thread/mtx_trylock.c
@@ -0,0 +1,25 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int mtx_trylock(mtx_t *m)
+{
+	int tid, old, own, ret = 0;
+	pthread_t self;
+
+	if (m->_m_type == PTHREAD_MUTEX_NORMAL) {
+		return a_cas(&m->_m_lock, 0, thrd_busy) & thrd_busy;
+	} else {
+       		self = __pthread_self();
+       		tid = self->tid;
+		old = m->_m_lock;
+		own = old & 0x7fffffff;
+		if (own == tid) {
+			if ((unsigned)m->_m_count >= INT_MAX) return thrd_error;
+			else m->_m_count++;
+		} else {
+			if ((own && !(own & 0x40000000)) || a_cas(&m->_m_lock, old, tid)!=old)
+				return thrd_busy;
+		}
+	}
+	return thrd_success;
+}
diff --git a/src/thread/mtx_unlock.c b/src/thread/mtx_unlock.c
new file mode 100644
index 0000000..30aa7f3
--- /dev/null
+++ b/src/thread/mtx_unlock.c
@@ -0,0 +1,16 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int mtx_unlock(mtx_t *m)
+{
+	if ((m->_m_type&3) == PTHREAD_MUTEX_RECURSIVE && m->_m_count) {
+		if ((m->_m_lock&0x1fffffff) != __pthread_self()->tid)
+			return thrd_error;
+                /* _m_count is the count of additional locks, no need to real unlock */
+		--m->_m_count;
+	} else {
+		if (a_swap(&m->_m_lock, 0)<0 || m->_m_waiters)
+			__wake(&m->_m_lock, 1, 0);
+	}
+	return thrd_success;
+}
diff --git a/src/thread/pthread_cond_timedwait.c b/src/thread/pthread_cond_timedwait.c
index 99d62cc..f43ca9e 100644
--- a/src/thread/pthread_cond_timedwait.c
+++ b/src/thread/pthread_cond_timedwait.c
@@ -1,5 +1,7 @@
 #include "pthread_impl.h"
 
+void __pthread_testcancel(void);
+
 struct cm {
 	pthread_cond_t *c;
 	pthread_mutex_t *m;
@@ -47,7 +49,7 @@ int pthread_cond_timedwait(pthread_cond_t *restrict c, pthread_mutex_t *restrict
 	if (ts && ts->tv_nsec >= 1000000000UL)
 		return EINVAL;
 
-	pthread_testcancel();
+	__pthread_testcancel();
 
 	a_inc(&c->_c_waiters);
 
diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c
index e77e54a..fa73826 100644
--- a/src/thread/pthread_create.c
+++ b/src/thread/pthread_create.c
@@ -3,6 +3,13 @@
 #include "stdio_impl.h"
 #include "libc.h"
 #include <sys/mman.h>
+#include <threads.h>
+
+void *__mmap(void *, size_t, int, int, int, off_t);
+int __munmap(void *, size_t);
+int __mprotect(void *, size_t, int);
+
+#define __THRD_C11 ((void*)(uintptr_t)-1)
 
 static void dummy_0()
 {
@@ -11,7 +18,7 @@ weak_alias(dummy_0, __acquire_ptc);
 weak_alias(dummy_0, __release_ptc);
 weak_alias(dummy_0, __pthread_tsd_run_dtors);
 
-_Noreturn void pthread_exit(void *result)
+_Noreturn void __pthread_exit(void *result)
 {
 	pthread_t self = __pthread_self();
 	sigset_t set;
@@ -113,6 +120,19 @@ static int start(void *p)
 	return 0;
 }
 
+static _Noreturn void __thrd_exit(int result) {
+	__pthread_exit((void*)(intptr_t)result);
+}
+
+
+static int start_c11(void *p)
+{
+	thrd_t self = p;
+	int (*start)(void*) = (int(*)(void*)) self->start;
+	__thrd_exit(start(self->start_arg));
+	return 0;
+}
+
 #define ROUND(x) (((x)+PAGE_SIZE-1)&-PAGE_SIZE)
 
 /* pthread_key_create.c overrides this */
@@ -133,10 +153,10 @@ static void init_file_lock(FILE *f)
 
 void *__copy_tls(unsigned char *);
 
-int pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict attrp, void *(*entry)(void *), void *restrict arg)
+static int __pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict attrp, void *(*entry)(void *), void *restrict arg)
 {
-	int ret;
-	size_t size, guard;
+	int ret, c11 = (attrp == __THRD_C11);
+	size_t size, guard = 0;
 	struct pthread *self, *new;
 	unsigned char *map = 0, *stack = 0, *tsd = 0, *stack_limit;
 	unsigned flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND
@@ -157,6 +177,9 @@ int pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict attrp
 		self->tsd = (void **)__pthread_tsd_main;
 		libc.threaded = 1;
 	}
+        if (c11) {
+          attrp = 0;
+        }
 	if (attrp) attr = *attrp;
 
 	__acquire_ptc();
@@ -184,14 +207,14 @@ int pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict attrp
 
 	if (!tsd) {
 		if (guard) {
-			map = mmap(0, size, PROT_NONE, MAP_PRIVATE|MAP_ANON, -1, 0);
+			map = __mmap(0, size, PROT_NONE, MAP_PRIVATE|MAP_ANON, -1, 0);
 			if (map == MAP_FAILED) goto fail;
-			if (mprotect(map+guard, size-guard, PROT_READ|PROT_WRITE)) {
-				munmap(map, size);
+			if (__mprotect(map+guard, size-guard, PROT_READ|PROT_WRITE)) {
+				__munmap(map, size);
 				goto fail;
 			}
 		} else {
-			map = mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
+			map = __mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
 			if (map == MAP_FAILED) goto fail;
 		}
 		tsd = map + size - __pthread_tsd_size;
@@ -223,7 +246,10 @@ int pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict attrp
 	new->canary = self->canary;
 
 	a_inc(&libc.threads_minus_1);
-	ret = __clone(start, stack, flags, new, &new->tid, TP_ADJ(new), &new->tid);
+	if (c11)
+		ret = __clone(start_c11, stack, flags, new, &new->tid, TP_ADJ(new), &new->tid);
+	else
+		ret = __clone(start, stack, flags, new, &new->tid, TP_ADJ(new), &new->tid);
 
 	__release_ptc();
 
@@ -233,7 +259,7 @@ int pthread_create(pthread_t *restrict res, const pthread_attr_t *restrict attrp
 
 	if (ret < 0) {
 		a_dec(&libc.threads_minus_1);
-		if (map) munmap(map, size);
+		if (map) __munmap(map, size);
 		return EAGAIN;
 	}
 
@@ -251,3 +277,19 @@ fail:
 	__release_ptc();
 	return EAGAIN;
 }
+
+static int __thrd_create(thrd_t *thr,
+                         thrd_start_t func,
+                         void *arg) {
+	/* In the best of all worlds this is a tail call. */
+	int ret = __pthread_create(thr, __THRD_C11, (void * (*)(void *))func, arg);
+	if (thrd_success)
+		return ret ? thrd_error : thrd_success;
+	/* In case of UB may also return ENOSYS or EAGAIN. */
+	return ret;
+}
+
+weak_alias(__pthread_exit, pthread_exit);
+weak_alias(__pthread_create, pthread_create);
+weak_alias(__thrd_create, thrd_create);
+weak_alias(__thrd_exit, thrd_exit);
diff --git a/src/thread/pthread_detach.c b/src/thread/pthread_detach.c
index 651c38e..5ca7855 100644
--- a/src/thread/pthread_detach.c
+++ b/src/thread/pthread_detach.c
@@ -1,11 +1,15 @@
 #include "pthread_impl.h"
 
-int pthread_detach(pthread_t t)
+int __pthread_join(pthread_t t, void **res);
+
+int __pthread_detach(pthread_t t)
 {
 	/* Cannot detach a thread that's already exiting */
 	if (a_swap(t->exitlock, 1))
-		return pthread_join(t, 0);
+		return __pthread_join(t, 0);
 	t->detached = 2;
 	__unlock(t->exitlock);
 	return 0;
 }
+
+weak_alias(__pthread_detach, pthread_detach);
diff --git a/src/thread/pthread_getspecific.c b/src/thread/pthread_getspecific.c
index b2a282c..bfc4294 100644
--- a/src/thread/pthread_getspecific.c
+++ b/src/thread/pthread_getspecific.c
@@ -1,7 +1,10 @@
 #include "pthread_impl.h"
 
-void *pthread_getspecific(pthread_key_t k)
+static void *__pthread_getspecific(pthread_key_t k)
 {
 	struct pthread *self = __pthread_self();
 	return self->tsd[k];
 }
+
+weak_alias(__pthread_getspecific, pthread_getspecific);
+weak_alias(__pthread_getspecific, tss_get);
diff --git a/src/thread/pthread_join.c b/src/thread/pthread_join.c
index 719c91c..1b59489 100644
--- a/src/thread/pthread_join.c
+++ b/src/thread/pthread_join.c
@@ -5,7 +5,7 @@ static void dummy(void *p)
 {
 }
 
-int pthread_join(pthread_t t, void **res)
+int __pthread_join(pthread_t t, void **res)
 {
 	int tmp;
 	while ((tmp = t->tid)) __timedwait(&t->tid, tmp, 0, 0, dummy, 0, 0);
@@ -13,3 +13,5 @@ int pthread_join(pthread_t t, void **res)
 	if (t->map_base) munmap(t->map_base, t->map_size);
 	return 0;
 }
+
+weak_alias(__pthread_join, pthread_join);
diff --git a/src/thread/pthread_key_create.c b/src/thread/pthread_key_create.c
index a9187f7..bfcd597 100644
--- a/src/thread/pthread_key_create.c
+++ b/src/thread/pthread_key_create.c
@@ -9,7 +9,7 @@ static void nodtor(void *dummy)
 {
 }
 
-int pthread_key_create(pthread_key_t *k, void (*dtor)(void *))
+int __pthread_key_create(pthread_key_t *k, void (*dtor)(void *))
 {
 	unsigned i = (uintptr_t)&k / 16 % PTHREAD_KEYS_MAX;
 	unsigned j = i;
@@ -31,7 +31,7 @@ int pthread_key_create(pthread_key_t *k, void (*dtor)(void *))
 	return EAGAIN;
 }
 
-int pthread_key_delete(pthread_key_t k)
+int __pthread_key_delete(pthread_key_t k)
 {
 	keys[k] = 0;
 	return 0;
@@ -53,3 +53,6 @@ void __pthread_tsd_run_dtors()
 		}
 	}
 }
+
+weak_alias(__pthread_key_delete, pthread_key_delete);
+weak_alias(__pthread_key_create, pthread_key_create);
diff --git a/src/thread/pthread_once.c b/src/thread/pthread_once.c
index e01f6d4..a678597 100644
--- a/src/thread/pthread_once.c
+++ b/src/thread/pthread_once.c
@@ -6,7 +6,7 @@ static void undo(void *control)
 	__wake(control, 1, 0);
 }
 
-int pthread_once(pthread_once_t *control, void (*init)(void))
+int __pthread_once(pthread_once_t *control, void (*init)(void))
 {
 	static int waiters;
 
@@ -34,3 +34,5 @@ int pthread_once(pthread_once_t *control, void (*init)(void))
 		return 0;
 	}
 }
+
+weak_alias(__pthread_once, pthread_once);
diff --git a/src/thread/pthread_setcancelstate.c b/src/thread/pthread_setcancelstate.c
index 060bcdc..2268217 100644
--- a/src/thread/pthread_setcancelstate.c
+++ b/src/thread/pthread_setcancelstate.c
@@ -1,6 +1,6 @@
 #include "pthread_impl.h"
 
-int pthread_setcancelstate(int new, int *old)
+int __pthread_setcancelstate(int new, int *old)
 {
 	if (new > 1U) return EINVAL;
 	if (!libc.has_thread_pointer) return ENOSYS;
@@ -9,3 +9,5 @@ int pthread_setcancelstate(int new, int *old)
 	self->canceldisable = new;
 	return 0;
 }
+
+weak_alias(__pthread_setcancelstate, pthread_setcancelstate);
diff --git a/src/thread/pthread_setcanceltype.c b/src/thread/pthread_setcanceltype.c
index bf0a3f3..d493c1b 100644
--- a/src/thread/pthread_setcanceltype.c
+++ b/src/thread/pthread_setcanceltype.c
@@ -1,11 +1,13 @@
 #include "pthread_impl.h"
 
+void __pthread_testcancel(void);
+
 int pthread_setcanceltype(int new, int *old)
 {
 	struct pthread *self = __pthread_self();
 	if (new > 1U) return EINVAL;
 	if (old) *old = self->cancelasync;
 	self->cancelasync = new;
-	if (new) pthread_testcancel();
+	if (new) __pthread_testcancel();
 	return 0;
 }
diff --git a/src/thread/pthread_setspecific.c b/src/thread/pthread_setspecific.c
index 55e46a8..dd18b96 100644
--- a/src/thread/pthread_setspecific.c
+++ b/src/thread/pthread_setspecific.c
@@ -1,6 +1,6 @@
 #include "pthread_impl.h"
 
-int pthread_setspecific(pthread_key_t k, const void *x)
+int __pthread_setspecific(pthread_key_t k, const void *x)
 {
 	struct pthread *self = __pthread_self();
 	/* Avoid unnecessary COW */
@@ -10,3 +10,5 @@ int pthread_setspecific(pthread_key_t k, const void *x)
 	}
 	return 0;
 }
+
+weak_alias(__pthread_setspecific, pthread_setspecific);
diff --git a/src/thread/pthread_testcancel.c b/src/thread/pthread_testcancel.c
index ba5f7c6..ee48e6d 100644
--- a/src/thread/pthread_testcancel.c
+++ b/src/thread/pthread_testcancel.c
@@ -7,7 +7,9 @@ static void dummy()
 
 weak_alias(dummy, __testcancel);
 
-void pthread_testcancel()
+void __pthread_testcancel()
 {
 	__testcancel();
 }
+
+weak_alias(__pthread_testcancel, pthread_testcancel);
diff --git a/src/thread/thrd_current.c b/src/thread/thrd_current.c
new file mode 100644
index 0000000..1728535
--- /dev/null
+++ b/src/thread/thrd_current.c
@@ -0,0 +1,11 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+/* Not all arch have __pthread_self as a symbol. On some this results
+   in some magic. So this "call" to __pthread_self is not necessary a
+   tail call. In particular, other than the appearance, thrd_current
+   can not be implemented as a weak symbol. */
+pthread_t thrd_current()
+{
+	return __pthread_self();
+}
diff --git a/src/thread/thrd_detach.c b/src/thread/thrd_detach.c
new file mode 100644
index 0000000..12dff14
--- /dev/null
+++ b/src/thread/thrd_detach.c
@@ -0,0 +1,13 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int __pthread_detach(pthread_t t);
+
+int thrd_detach(thrd_t t)
+{
+	/* In the best of all worlds this is a tail call. */
+	int ret = __pthread_detach(t);
+	if (thrd_success)
+		return ret ? thrd_error : thrd_success;
+	return ret;
+}
diff --git a/src/thread/thrd_equal.c b/src/thread/thrd_equal.c
new file mode 100644
index 0000000..ac49a44
--- /dev/null
+++ b/src/thread/thrd_equal.c
@@ -0,0 +1,6 @@
+#include <threads.h>
+
+int (thrd_equal)(thrd_t a, thrd_t b)
+{
+	return a==b;
+}
diff --git a/src/thread/thrd_join.c b/src/thread/thrd_join.c
new file mode 100644
index 0000000..0446975
--- /dev/null
+++ b/src/thread/thrd_join.c
@@ -0,0 +1,14 @@
+#include "pthread_impl.h"
+#include <sys/mman.h>
+#include <threads.h>
+
+/* C11 threads cannot be canceled, so there is no need for a
+   cancelation function pointer, here. */
+int thrd_join(thrd_t t, int *res)
+{
+	int tmp;
+	while ((tmp = t->tid)) __timedwait(&t->tid, tmp, 0, 0, 0, 0, 0);
+	if (res) *res = (int)(intptr_t)t->result;
+	if (t->map_base) munmap(t->map_base, t->map_size);
+	return thrd_success;
+}
diff --git a/src/thread/tss_create.c b/src/thread/tss_create.c
new file mode 100644
index 0000000..0cbadc8
--- /dev/null
+++ b/src/thread/tss_create.c
@@ -0,0 +1,13 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int __pthread_key_create(pthread_key_t *k, void (*dtor)(void *));
+
+int tss_create(tss_t *tss, tss_dtor_t dtor)
+{
+	__THRD_ABI_MARK;
+	/* Different error returns are possible. C glues them together into
+	   just failure notification. Can't be optimized to a tail call,
+	   unless thrd_error equals EAGAIN. */
+	return __pthread_key_create(tss, dtor) ? thrd_error : thrd_success;
+}
diff --git a/src/thread/tss_delete.c b/src/thread/tss_delete.c
new file mode 100644
index 0000000..d50731f
--- /dev/null
+++ b/src/thread/tss_delete.c
@@ -0,0 +1,8 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int __pthread_key_delete(pthread_key_t k);
+
+void (tss_delete)(tss_t key) {
+	(void)__pthread_key_delete(key);
+}
diff --git a/src/thread/tss_set.c b/src/thread/tss_set.c
new file mode 100644
index 0000000..70c4fb7
--- /dev/null
+++ b/src/thread/tss_set.c
@@ -0,0 +1,13 @@
+#include "pthread_impl.h"
+#include <threads.h>
+
+int tss_set(tss_t k, void *x)
+{
+	struct pthread *self = __pthread_self();
+	/* Avoid unnecessary COW */
+	if (self->tsd[k] != x) {
+		self->tsd[k] = x;
+		self->tsd_used = 1;
+	}
+	return thrd_success;
+}
diff --git a/src/time/thrd_sleep.c b/src/time/thrd_sleep.c
new file mode 100644
index 0000000..6570b0c
--- /dev/null
+++ b/src/time/thrd_sleep.c
@@ -0,0 +1,31 @@
+#include <time.h>
+#include <errno.h>
+#include "syscall.h"
+#include "libc.h"
+#include "threads.h"
+
+/* Roughly __syscall already returns the right thing: 0 if all went
+   well or a negative error indication, otherwise. */
+int thrd_sleep(const struct timespec *req, struct timespec *rem)
+{
+  int ret = __syscall(SYS_nanosleep, req, rem);
+  // compile time comparison, constant propagated
+  if (EINTR != 1) {
+    switch (ret) {
+    case 0: return 0;
+      /* error described by POSIX:                                    */
+      /* EINTR  The nanosleep() function was interrupted by a signal. */
+      /* The C11 wording is:                                          */
+      /* -1 if it has been interrupted by a signal                    */
+    case -EINTR:  return -1;
+      /* error described by POSIX */
+    case -EINVAL: return -2;
+      /* described for linux */
+    case -EFAULT: return -3;
+      /* No other error returns are specified. */
+    default: return INT_MIN;
+    }
+  }
+  // potentially a tail call
+  return ret;
+}

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C threads, v3.0
  2014-08-04  9:30 C threads, v3.0 Jens Gustedt
@ 2014-08-04  9:33 ` Jens Gustedt
  2014-08-04 14:50 ` Rich Felker
  2014-08-06  3:52 ` Explaining cond var destroy [Re: [musl] C threads, v3.0] Rich Felker
  2 siblings, 0 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-04  9:33 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 753 bytes --]

Sorry,
I didn't finish one sentence

Am Montag, den 04.08.2014, 11:30 +0200 schrieb Jens Gustedt:
> My impression is that pthread_[gs]etspecific is already rarely used
> and that tss_[gs]et will be even less. C threads also introduce
> _Thread_local, a much more convenient interface as long as you don't
> have 

... as long as you know your thread shared variables at compile time
and you don't need destructors to be called.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C threads, v3.0
  2014-08-04  9:30 C threads, v3.0 Jens Gustedt
  2014-08-04  9:33 ` Jens Gustedt
@ 2014-08-04 14:50 ` Rich Felker
  2014-08-04 16:48   ` Jens Gustedt
  2014-08-06  3:52 ` Explaining cond var destroy [Re: [musl] C threads, v3.0] Rich Felker
  2 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-04 14:50 UTC (permalink / raw)
  To: musl

On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> I'd like to discuss two issues before going further. The first is of
> minor concern. In many places of that code _m_lock uses a flag, namely
> 0x40000000. I only found places where this bit is read, but not where
> it is set. Did I overlook something or is this a leftover?

It's set by the kernel when a thread dies while owning a robust mutex.

> The second issue concerns tsd. C threads can't be fine tuned with
> attributes, in particular they receive exactly as much stack space as
> we give them. They are supposed to implement an economic thread model
> that uses up as few resources as possible. In the current
> implementation tsd takes up a medium sized part of the stack
> (__pthread_tsd_size) if any of the application or linked libraries use
> pthread_key_create somewhere.

512 bytes on 32-bit archs, and 1k on 64-bit archs, is certainly
nonzero but I wouldn't call it "a medium sized part of the stack" when
the default stack is something like 80k.

> My impression is that pthread_[gs]etspecific is already rarely used
> and that tss_[gs]et will be even less. C threads also introduce
> _Thread_local, a much more convenient interface as long as you don't
> have 
> 
> I don't think that this policy is ideal for C threads, but it could
> perhaps be revised for pthreads, too. My idea is the
> following. (version for C threads with minimal impact on pthreads)
> 
>  - don't assign a tsd in thrd_create
> 
>  - change pthread_getspecific, pthread_setspecific, tss_get, and
>    __pthread_tsd_run_dtors such that they check for nullness of
>    tsd. this is a trivial and non-expensive change.
>    pthread_setspecific may return ENOMEM or EINVAL in such cases. The
>    getters should just return 0. __pthread_tsd_run_dtors obviously
>    would just do nothing, then.

EINVAL is definitely not permitted here since ENOMEM is required by
POSIX for this case, if it happens.

>  - change tss_set, to mmap the table if it is absent and if it is
>    called with a non-null pointer argument. (I.e if it really has to
>    store something). C11 allows this function to fail, so we could
>    return thrd_error if mmap fails.
> 
>  - change thrd_exit to check for tsd and to unmap it at the end if
>    necessary
> 
> For thrd_create one of the consequences would be that main hasn't to
> be treated special with that respect. The additional mmap and unmap in
> tss_set should only concern entire pages. This should only have a
> minimal runtime overhead, but which only would occur for threads that
> call tss_set with a non-null pointer to be stored.
> 
> Please let me know what you think.

This is not an acceptable implementation (at least from the standpoint
of musl's design principles); it sacrifices fail-safe behavior
(guaranteeing non-failure of an interface that's usually impossible to
deal with failure from, and for which there's no fundamental reason it
should be able to fail) to safe a tiny amount of memory.

For static linking, I think it's completely acceptable to assume that,
if tsd functions are linked, they're going to be used. Unfortunately
this isn't possible for dynamic linking, and I had considered an
alternate implementation for dynamic linking, based on having the
first call to pthread_key_create simulate loading of a shared library
containing a TLS array of PTHREAD_KEYS_MAX pointers. This would make
the success/failure detectable early at a time where there's
inherently a possibility of failure (too many keys) rather than where
failure is just a consequence of a bad implementation. But the
complexity of doing this, and the added unnecessary failure case
(failure before PTHREAD_KEYS_MAX is hit, which really shouldn't
happen), seemed unjustified for a 0.5-1k savings.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C threads, v3.0
  2014-08-04 14:50 ` Rich Felker
@ 2014-08-04 16:48   ` Jens Gustedt
  2014-08-04 17:06     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-04 16:48 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 5552 bytes --]

Am Montag, den 04.08.2014, 10:50 -0400 schrieb Rich Felker:
> On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> > I'd like to discuss two issues before going further. The first is of
> > minor concern. In many places of that code _m_lock uses a flag, namely
> > 0x40000000. I only found places where this bit is read, but not where
> > it is set. Did I overlook something or is this a leftover?
> 
> It's set by the kernel when a thread dies while owning a robust mutex.

ah, ok, good to know, this was not obvious at all to me.

so I will leave this in, although C mutexes are not supposed to be
robust.

> > The second issue concerns tsd...

> > I don't think that this policy is ideal for C threads, but it could
> > perhaps be revised for pthreads, too. My idea is the
> > following. (version for C threads with minimal impact on pthreads)
> > 
> >  - don't assign a tsd in thrd_create
> > 
> >  - change pthread_getspecific, pthread_setspecific, tss_get, and
> >    __pthread_tsd_run_dtors such that they check for nullness of
> >    tsd. this is a trivial and non-expensive change.
> >    pthread_setspecific may return ENOMEM or EINVAL in such cases. The
> >    getters should just return 0. __pthread_tsd_run_dtors obviously
> >    would just do nothing, then.
> 
> EINVAL is definitely not permitted here since ENOMEM is required by
> POSIX for this case, if it happens.

My idea is that ENOMEM doesn't fit too well here, since it is not this
call to pthread_setspecific that is out of memory, but that the key is
an invalid key for this specific thread. (It would only happen to C
threads that use pthread_setspecific.)

> >  - change tss_set, to mmap the table if it is absent and if it is
> >    called with a non-null pointer argument. (I.e if it really has to
> >    store something). C11 allows this function to fail, so we could
> >    return thrd_error if mmap fails.
> > 
> >  - change thrd_exit to check for tsd and to unmap it at the end if
> >    necessary
> > 
> > For thrd_create one of the consequences would be that main hasn't to
> > be treated special with that respect. The additional mmap and unmap in
> > tss_set should only concern entire pages. This should only have a
> > minimal runtime overhead, but which only would occur for threads that
> > call tss_set with a non-null pointer to be stored.
> > 
> > Please let me know what you think.
> 
> This is not an acceptable implementation (at least from the standpoint
> of musl's design principles); it sacrifices fail-safe behavior
> (guaranteeing non-failure of an interface that's usually impossible to
> deal with failure from, and for which there's no fundamental reason it
> should be able to fail) to safe a tiny amount of memory.

I see. (Also it is not *only* to save memory, it is also to simplify
the implementation and to put the burden on the real user of a
feature.)

I am not completely convinced of your analysis. pthread_key_create and
pthread_setspecific have failure paths that are defined by
POSIX.

pthread_getspecific would never fail. It would always return a null
pointer, which would be the correct result since no value had ever
been stored for that thread.

The tss functions also clearly have failure paths in their definition.

> For static linking, I think it's completely acceptable to assume that,
> if tsd functions are linked, they're going to be used.

As I said, I think this is a feature that is rarely used anyhow.  For
the code that actually links pthread_key_create, it is not at all
clear to me that pthread_setspecific is then used by a lot of
threads. I don't have stats for this, obviously, but the tss stuff
will even be used less.

> Unfortunately
> this isn't possible for dynamic linking, and I had considered an
> alternate implementation for dynamic linking, based on having the
> first call to pthread_key_create simulate loading of a shared library
> containing a TLS array of PTHREAD_KEYS_MAX pointers. This would make
> the success/failure detectable early at a time where there's
> inherently a possibility of failure (too many keys) rather than where
> failure is just a consequence of a bad implementation.

I am not sure what you want to imply by /bad/ implementation :)

Just to clear that out, the way that I propose would activate a
potential failure path if the process ran out of memory, correct. But
the real problem on failure here would be the lack of memory and not
the failure of the pthread_setspecific call itself. In the current
model the corresponding call to pthread_create or thrd_create would
have failed for ENOMEM.

> But the
> complexity of doing this, and the added unnecessary failure case
> (failure before PTHREAD_KEYS_MAX is hit, which really shouldn't
> happen), seemed unjustified for a 0.5-1k savings.

I completely agree that such faking of a shared library load would be
an overkill.

For C threads, there would also be an option to avoid messing around
with the pthread key system and keep that intact. We don't know what
POSIX will decide on how they should interact, anyhow.

We could implement tss separately. That's quite simple to do.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C threads, v3.0
  2014-08-04 16:48   ` Jens Gustedt
@ 2014-08-04 17:06     ` Rich Felker
  2014-08-04 22:16       ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-04 17:06 UTC (permalink / raw)
  To: musl

On Mon, Aug 04, 2014 at 06:48:59PM +0200, Jens Gustedt wrote:
> Am Montag, den 04.08.2014, 10:50 -0400 schrieb Rich Felker:
> > On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> > > I'd like to discuss two issues before going further. The first is of
> > > minor concern. In many places of that code _m_lock uses a flag, namely
> > > 0x40000000. I only found places where this bit is read, but not where
> > > it is set. Did I overlook something or is this a leftover?
> > 
> > It's set by the kernel when a thread dies while owning a robust mutex.
> 
> ah, ok, good to know, this was not obvious at all to me.
> 
> so I will leave this in, although C mutexes are not supposed to be
> robust.

If the C committee decides that the case of exiting with a mutex
locked needs to be handled (i.e. the mutex must be permanently locked,
not falsely owned by a new thread that happens to get the same id)
then we need to track such mutexes in the robust list. Technically we
wouldn't have to rely on the kernel to do anything with them (that
only matters for process-shared case; for process-local, we control
thread exit and can handle them ourselves) but letting the kernel do
it does reduce the amount of code in pthread_exit, so it might be
desirable still.

> > > The second issue concerns tsd...
> 
> > > I don't think that this policy is ideal for C threads, but it could
> > > perhaps be revised for pthreads, too. My idea is the
> > > following. (version for C threads with minimal impact on pthreads)
> > > 
> > >  - don't assign a tsd in thrd_create
> > > 
> > >  - change pthread_getspecific, pthread_setspecific, tss_get, and
> > >    __pthread_tsd_run_dtors such that they check for nullness of
> > >    tsd. this is a trivial and non-expensive change.
> > >    pthread_setspecific may return ENOMEM or EINVAL in such cases. The
> > >    getters should just return 0. __pthread_tsd_run_dtors obviously
> > >    would just do nothing, then.
> > 
> > EINVAL is definitely not permitted here since ENOMEM is required by
> > POSIX for this case, if it happens.
> 
> My idea is that ENOMEM doesn't fit too well here, since it is not this
> call to pthread_setspecific that is out of memory, but that the key is
> an invalid key for this specific thread. (It would only happen to C
> threads that use pthread_setspecific.)

But the key isn't invalid; it was a valid key obtained by
pthread_key_create or the C11 version. Rather, the failure is a
failure to obtain storage to store the value.

> > >  - change tss_set, to mmap the table if it is absent and if it is
> > >    called with a non-null pointer argument. (I.e if it really has to
> > >    store something). C11 allows this function to fail, so we could
> > >    return thrd_error if mmap fails.
> > > 
> > >  - change thrd_exit to check for tsd and to unmap it at the end if
> > >    necessary
> > > 
> > > For thrd_create one of the consequences would be that main hasn't to
> > > be treated special with that respect. The additional mmap and unmap in
> > > tss_set should only concern entire pages. This should only have a
> > > minimal runtime overhead, but which only would occur for threads that
> > > call tss_set with a non-null pointer to be stored.
> > > 
> > > Please let me know what you think.
> > 
> > This is not an acceptable implementation (at least from the standpoint
> > of musl's design principles); it sacrifices fail-safe behavior
> > (guaranteeing non-failure of an interface that's usually impossible to
> > deal with failure from, and for which there's no fundamental reason it
> > should be able to fail) to safe a tiny amount of memory.
> 
> I see. (Also it is not *only* to save memory, it is also to simplify
> the implementation and to put the burden on the real user of a
> feature.)
> 
> I am not completely convinced of your analysis. pthread_key_create and
> pthread_setspecific have failure paths that are defined by
> POSIX.

POSIX also allows unspecified failures for pthread_mutex_unlock, but
obviously if unlock fails when it should not, there's no way the
program can proceed.

In general programs are very bad about handling resource allocation
failures. The typical behavior, even in library code, is to abort the
whole program, which is obviously bad. So whenever we can guarantee
that a failure will not happen, even if POSIX doesn't mandate that it
can't happen, this is a big win for improving the overall fail-safety
of the system as a whole.

> > For static linking, I think it's completely acceptable to assume that,
> > if tsd functions are linked, they're going to be used.
> 
> As I said, I think this is a feature that is rarely used anyhow.  For
> the code that actually links pthread_key_create, it is not at all
> clear to me that pthread_setspecific is then used by a lot of
> threads. I don't have stats for this, obviously, but the tss stuff
> will even be used less.

It's not that rare, and it's the only way in portable pre-C11 POSIX
programs to use thread-local storage. Presumably most such programs
check for the availability of gcc-style __thread and use it if
possible, only falling back to the old pthread TSD if it's not.

However there's also another BIG use case that's not obsolete:
POSIX/C11 TSD keys are the only way to store thread-local data that
needs a destructor to be called when the thread exits. Even if gcc
provides an extension to do dtors on TLS objects in C like C++11 TLS
(I'm not sure if it does), gcc's implementation of C++11 TLS
ctors/dtors is not fail-safe or AS-safe; it requires gratuitous
allocation of a structure to store the dtor, and aborts the program if
this allocation fails.

> > Unfortunately
> > this isn't possible for dynamic linking, and I had considered an
> > alternate implementation for dynamic linking, based on having the
> > first call to pthread_key_create simulate loading of a shared library
> > containing a TLS array of PTHREAD_KEYS_MAX pointers. This would make
> > the success/failure detectable early at a time where there's
> > inherently a possibility of failure (too many keys) rather than where
> > failure is just a consequence of a bad implementation.
> 
> I am not sure what you want to imply by /bad/ implementation :)
> 
> Just to clear that out, the way that I propose would activate a
> potential failure path if the process ran out of memory, correct. But
> the real problem on failure here would be the lack of memory and not
> the failure of the pthread_setspecific call itself. In the current
> model the corresponding call to pthread_create or thrd_create would
> have failed for ENOMEM.

Right. It's much better for pthread_create (which is fundamentally a
resource allocation action) to fail than for assignment to an
allocated resource (already conceptually allocated by
pthread_key_create) to fail. Although it's not as bad as real TLS
failure like glibc has (since there's no way to report the failure of
the latter; it has to crash), it's still a situation that's difficult
to handle at run-time and where applications and even libraries are
likely to handle it just by aborting the program.

> > But the
> > complexity of doing this, and the added unnecessary failure case
> > (failure before PTHREAD_KEYS_MAX is hit, which really shouldn't
> > happen), seemed unjustified for a 0.5-1k savings.
> 
> I completely agree that such faking of a shared library load would be
> an overkill.
> 
> For C threads, there would also be an option to avoid messing around
> with the pthread key system and keep that intact. We don't know what
> POSIX will decide on how they should interact, anyhow.

The only possible interaction we could care about would be if POSIX
specified that PTHREAD_KEYS_MAX keys must be allocatable even if some
C11 keys have already been allocated, and/or vice versa. In this case
we could need to separately track the number of each and allow a
higher total number. But I don't see how this would otherwise affect
the requirements on, or implementation of, either one.

> We could implement tss separately. That's quite simple to do.

Yes but that of course increases the cost of a largely-useless (except
in the case of needing a dtor) feature, especially in per-thread
memory needed, which I would like to avoid unless it turns out to be
needed.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C threads, v3.0
  2014-08-04 17:06     ` Rich Felker
@ 2014-08-04 22:16       ` Jens Gustedt
  2014-08-04 22:36         ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-04 22:16 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4553 bytes --]

Am Montag, den 04.08.2014, 13:06 -0400 schrieb Rich Felker:
> On Mon, Aug 04, 2014 at 06:48:59PM +0200, Jens Gustedt wrote:
> > Am Montag, den 04.08.2014, 10:50 -0400 schrieb Rich Felker:
> > > On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> > > > I'd like to discuss two issues before going further. The first is of
> > > > minor concern. In many places of that code _m_lock uses a flag, namely
> > > > 0x40000000. I only found places where this bit is read, but not where
> > > > it is set. Did I overlook something or is this a leftover?
> > > 
> > > It's set by the kernel when a thread dies while owning a robust mutex.
> > 
> > ah, ok, good to know, this was not obvious at all to me.
> > 
> > so I will leave this in, although C mutexes are not supposed to be
> > robust.
> 
> If the C committee decides that the case of exiting with a mutex
> locked needs to be handled (i.e. the mutex must be permanently locked,
> not falsely owned by a new thread that happens to get the same id)
> then we need to track such mutexes in the robust list.

I think it is very unlikely that such a decision would be taken. This
stuff is supposed to run on about any OS, not only POSIX. Imposing
such restrictions to other OS' is unlikely to happen. For the same
reason there is strong resistance against static initialization of
mutexes, e.g., simply because one well-known OS seems not to be able
to deal with it.

> Technically we
> wouldn't have to rely on the kernel to do anything with them (that
> only matters for process-shared case; for process-local, we control
> thread exit and can handle them ourselves) but letting the kernel do
> it does reduce the amount of code in pthread_exit, so it might be
> desirable still.

ok, we definitively should leave it in, then.

> > As I said, I think this is a feature that is rarely used anyhow.  For
> > the code that actually links pthread_key_create, it is not at all
> > clear to me that pthread_setspecific is then used by a lot of
> > threads. I don't have stats for this, obviously, but the tss stuff
> > will even be used less.
> 
> It's not that rare, and it's the only way in portable pre-C11 POSIX
> programs to use thread-local storage. Presumably most such programs
> check for the availability of gcc-style __thread and use it if
> possible, only falling back to the old pthread TSD if it's not.

With C11 threads they are guaranteed to have _Thread_local. So the
check against __thread is then obsolete (and trivial).

> However there's also another BIG use case that's not obsolete:
> POSIX/C11 TSD keys are the only way to store thread-local data that
> needs a destructor to be called when the thread exits.

sure

> The only possible interaction we could care about would be if POSIX
> specified that PTHREAD_KEYS_MAX keys must be allocatable even if some
> C11 keys have already been allocated, and/or vice versa. In this case
> we could need to separately track the number of each and allow a
> higher total number. But I don't see how this would otherwise affect
> the requirements on, or implementation of, either one.

The rules for calling destructors also might differ. For the moment C
is sufficiently vague that almost any strategy would do, but who knows
where this will drift.

BTW, I think that the current implementation of
__pthread_tsd_run_dtors is not completely in line with the text. As I
understand it, each iteration should just treat the keys in one batch
per iteration. Each batch consisting of those keys that have values at
the beginning of that iteration. But that is hopefully completely
irrelevant in real world.

> > We could implement tss separately. That's quite simple to do.
> 
> Yes but that of course increases the cost of a largely-useless (except
> in the case of needing a dtor) feature, especially in per-thread
> memory needed, which I would like to avoid unless it turns out to be
> needed.

I just implemented it, and if I use either that new implementation or
the one with pthread_key_create underneath, there is a difference of 8
bytes for the executable. The difference would only appear for an
application that uses both features.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: C threads, v3.0
  2014-08-04 22:16       ` Jens Gustedt
@ 2014-08-04 22:36         ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2014-08-04 22:36 UTC (permalink / raw)
  To: musl

On Tue, Aug 05, 2014 at 12:16:58AM +0200, Jens Gustedt wrote:
> Am Montag, den 04.08.2014, 13:06 -0400 schrieb Rich Felker:
> > On Mon, Aug 04, 2014 at 06:48:59PM +0200, Jens Gustedt wrote:
> > > Am Montag, den 04.08.2014, 10:50 -0400 schrieb Rich Felker:
> > > > On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> > > > > I'd like to discuss two issues before going further. The first is of
> > > > > minor concern. In many places of that code _m_lock uses a flag, namely
> > > > > 0x40000000. I only found places where this bit is read, but not where
> > > > > it is set. Did I overlook something or is this a leftover?
> > > > 
> > > > It's set by the kernel when a thread dies while owning a robust mutex..
> > > 
> > > ah, ok, good to know, this was not obvious at all to me.
> > > 
> > > so I will leave this in, although C mutexes are not supposed to be
> > > robust.
> > 
> > If the C committee decides that the case of exiting with a mutex
> > locked needs to be handled (i.e. the mutex must be permanently locked,
> > not falsely owned by a new thread that happens to get the same id)
> > then we need to track such mutexes in the robust list.
> 
> I think it is very unlikely that such a decision would be taken. This
> stuff is supposed to run on about any OS, not only POSIX. Imposing
> such restrictions to other OS' is unlikely to happen. For the same
> reason there is strong resistance against static initialization of
> mutexes, e.g., simply because one well-known OS seems not to be able
> to deal with it.

I'm less confident. In principle any OS can have each thread put the
mutexes it holds in a linked list and mark them all permanently-locked
when the thread exits. Robust mutexes are hard on POSIX only because
they have to support process-shared semantics where a process can be
killed asynchronously by an uncatchable signal.

> > Technically we
> > wouldn't have to rely on the kernel to do anything with them (that
> > only matters for process-shared case; for process-local, we control
> > thread exit and can handle them ourselves) but letting the kernel do
> > it does reduce the amount of code in pthread_exit, so it might be
> > desirable still.
> 
> ok, we definitively should leave it in, then.

I haven't reviewed the actual code yet but I'll take a look and see if
it looks like it would work. Actually making recursive and
error-checking POSIX mutexes "always robust" (not exactly, but that's
roughly how it would work) is a change I should make soon anyway. And
I also have the private-futex patch pending, awaiting evidence that
it's actually useful before applying it.

> > > As I said, I think this is a feature that is rarely used anyhow.  For
> > > the code that actually links pthread_key_create, it is not at all
> > > clear to me that pthread_setspecific is then used by a lot of
> > > threads. I don't have stats for this, obviously, but the tss stuff
> > > will even be used less.
> > 
> > It's not that rare, and it's the only way in portable pre-C11 POSIX
> > programs to use thread-local storage. Presumably most such programs
> > check for the availability of gcc-style __thread and use it if
> > possible, only falling back to the old pthread TSD if it's not.
> 
> With C11 threads they are guaranteed to have _Thread_local. So the
> check against __thread is then obsolete (and trivial).

Yes. That's why I said pre-C11 POSIX programs.

> > The only possible interaction we could care about would be if POSIX
> > specified that PTHREAD_KEYS_MAX keys must be allocatable even if some
> > C11 keys have already been allocated, and/or vice versa. In this case
> > we could need to separately track the number of each and allow a
> > higher total number. But I don't see how this would otherwise affect
> > the requirements on, or implementation of, either one.
> 
> The rules for calling destructors also might differ. For the moment C
> is sufficiently vague that almost any strategy would do, but who knows
> where this will drift.
> 
> BTW, I think that the current implementation of
> __pthread_tsd_run_dtors is not completely in line with the text. As I
> understand it, each iteration should just treat the keys in one batch
> per iteration. Each batch consisting of those keys that have values at
> the beginning of that iteration. But that is hopefully completely
> irrelevant in real world.

If I'm not mistaken, I think the iterations macro is the minimum
number of times a dtor will be called; it seems permissible for the
implementation to call it more times.

> > > We could implement tss separately. That's quite simple to do.
> > 
> > Yes but that of course increases the cost of a largely-useless (except
> > in the case of needing a dtor) feature, especially in per-thread
> > memory needed, which I would like to avoid unless it turns out to be
> > needed.
> 
> I just implemented it, and if I use either that new implementation or
> the one with pthread_key_create underneath, there is a difference of 8
> bytes for the executable. The difference would only appear for an
> application that uses both features.

OK I'll take a look.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-04  9:30 C threads, v3.0 Jens Gustedt
  2014-08-04  9:33 ` Jens Gustedt
  2014-08-04 14:50 ` Rich Felker
@ 2014-08-06  3:52 ` Rich Felker
  2014-08-06  8:43   ` Jens Gustedt
  2 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-06  3:52 UTC (permalink / raw)
  To: musl

On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> diff --git a/src/thread/cnd_destroy.c b/src/thread/cnd_destroy.c
> new file mode 100644
> index 0000000..7d6f3a1
> --- /dev/null
> +++ b/src/thread/cnd_destroy.c
> @@ -0,0 +1,21 @@
> +#include "pthread_impl.h"
> +#include <threads.h>
> +
> +/* The behavior of cnd_destroy is undefined if cnd is still in
> +   use. The choice for pthread_cond_destroy in that situation is to
> +   wake up all users before destroying. I am not sure that we should
> +   do it like that here, too. Alternatives would be:
> +   - complain by using perror or equivalent
> +   - assert that there is no waiter
> +   - abort when there is a waiter
> +   - do nothing
> +   */
> +void cnd_destroy(cnd_t *c)
> +{
> +	int cnt;
> +	c->_c_destroy = 1;
> +	if (c->_c_waiters)
> +		__wake(&c->_c_seq, -1, 0);
> +	while ((cnt = c->_c_waiters))
> +		__wait(&c->_c_waiters, 0, cnt, 0);
> +}

The above comment is incorrect; I'll try to explain. At least for
POSIX cond vars, per POSIX, at least one thread that has called
pthread_cond_[timed]wait ceases to be a waiter as soon as
pthread_cond_signal is called, and all threads which have called
pthread_cond_[timed]wait ceast to be waiters as soon as
pthread_cond_broadcast is called. This means that, if a thread calling
pthread_cond_signal or pthread_cond_broadcast has updated the
predicate such that no threads will retry pthread_cond_[timed]wait or
remain as waiters, it may IMMEDIATELY call pthread_cond_destroy
without violating the constraint that pthread_cond_destroy can only be
called when there are no waiters.

Since waiters have additional work to do on the memory associated with
the pthread_cond_t object after the futex wait completes, and since we
do not want to force them to wake and finish this work as part of
pthread_cond_signal/broadcast (this would be expensive on every
signal), I've put the code to finish waiting for the waiters to wake
up in pthread_cond_destroy.

If you think this is a bad idea, I'd be willing to hear alternate
ideas. I'm not really happy with the cond var implementation (if
nothing else, the sequence number thing is an ugly hack and not 100%
robust, I think) and at some point I'd like to redesign it.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06  3:52 ` Explaining cond var destroy [Re: [musl] C threads, v3.0] Rich Felker
@ 2014-08-06  8:43   ` Jens Gustedt
  2014-08-06  9:41     ` Jens Gustedt
  2014-08-06  9:50     ` Rich Felker
  0 siblings, 2 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-06  8:43 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3390 bytes --]

Am Dienstag, den 05.08.2014, 23:52 -0400 schrieb Rich Felker:
> On Mon, Aug 04, 2014 at 11:30:03AM +0200, Jens Gustedt wrote:
> > +/* The behavior of cnd_destroy is undefined if cnd is still in
> > +   use. The choice for pthread_cond_destroy in that situation is to
> > +   wake up all users before destroying. I am not sure that we should
> > +   do it like that here, too. Alternatives would be:
> > +   - complain by using perror or equivalent
> > +   - assert that there is no waiter
> > +   - abort when there is a waiter
> > +   - do nothing
> > +   */
> 
> The above comment is incorrect; I'll try to explain. At least for
> POSIX cond vars, per POSIX, at least one thread that has called
> pthread_cond_[timed]wait ceases to be a waiter as soon as
> pthread_cond_signal is called, and all threads which have called
> pthread_cond_[timed]wait ceast to be waiters as soon as
> pthread_cond_broadcast is called. This means that, if a thread calling
> pthread_cond_signal or pthread_cond_broadcast has updated the
> predicate such that no threads will retry pthread_cond_[timed]wait or
> remain as waiters, it may IMMEDIATELY call pthread_cond_destroy
> without violating the constraint that pthread_cond_destroy can only be
> called when there are no waiters.
> 
> Since waiters have additional work to do on the memory associated with
> the pthread_cond_t object after the futex wait completes, and since we
> do not want to force them to wake and finish this work as part of
> pthread_cond_signal/broadcast (this would be expensive on every
> signal), I've put the code to finish waiting for the waiters to wake
> up in pthread_cond_destroy.

ok, I now understand the motivation behind that and will withdraw or
amend that comment.

> If you think this is a bad idea, I'd be willing to hear alternate
> ideas. I'm not really happy with the cond var implementation (if
> nothing else, the sequence number thing is an ugly hack and not 100%
> robust, I think) and at some point I'd like to redesign it.

As far as I can see the _c_destroy flag is used for no other purpose
than this synchronization between the destroying thread and potential
latecomers.

Technically the real problem is not pthread_cond_destroy (or
cnd_destroy). This could just be a noop as it is for mutexes. Using a
condition after it is destroyed is UB in terms of the standards, but
nothing hinders us to define a behavior for the specific
implementation of musl.

It is the fact that the thread that calls destroy() might also
deallocate the object directly after, and that the latecomers then
crash because we removed their object under their feet. So this
effectively introduces a deallocation barrier, which is nice, but
clearly an extension.

I am not sure about how to deal with this. The idea that destroy may
be blocking or even be doing a context switch to the kernel came as a
surprise to me. Maybe I would be happier if _c_destroy would be used
as a usage count and destroy would just spinlock on that would be more
straight.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06  8:43   ` Jens Gustedt
@ 2014-08-06  9:41     ` Jens Gustedt
  2014-08-06 10:03       ` Rich Felker
  2014-08-06  9:50     ` Rich Felker
  1 sibling, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-06  9:41 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]

Am Mittwoch, den 06.08.2014, 10:43 +0200 schrieb Jens Gustedt:
> Am Dienstag, den 05.08.2014, 23:52 -0400 schrieb Rich Felker:
> > If you think this is a bad idea, I'd be willing to hear alternate
> > ideas. I'm not really happy with the cond var implementation (if
> > nothing else, the sequence number thing is an ugly hack and not 100%
> > robust, I think) and at some point I'd like to redesign it.
> 
> I am not sure about how to deal with this. The idea that destroy may
> be blocking or even be doing a context switch to the kernel came as a
> surprise to me. Maybe I would be happier if _c_destroy would be used
> as a usage count and destroy would just spinlock on that would be more
> straight.

After digging a bit deeper I think I would favor a solution where no
waiter has to touch the condition object at all when coming back from
the wait. Currently this is not the case, in the contrary it reads
_c_seq and then does heavy maintenance in the unwait() call.

The return from a wait should just run into the the call to
pthread_mutex_lock, which would be save since the call has access to
that mutex from its arguments. All the bookkeeping should be done by
pthread_cond_signal and pthread_cond_broadcast before they wake up
anybody.

I am not sure that this would easily be possible with the current
implementation, but I think that this would be ideal.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::




[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06  8:43   ` Jens Gustedt
  2014-08-06  9:41     ` Jens Gustedt
@ 2014-08-06  9:50     ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: Rich Felker @ 2014-08-06  9:50 UTC (permalink / raw)
  To: musl

On Wed, Aug 06, 2014 at 10:43:15AM +0200, Jens Gustedt wrote:
> > If you think this is a bad idea, I'd be willing to hear alternate
> > ideas. I'm not really happy with the cond var implementation (if
> > nothing else, the sequence number thing is an ugly hack and not 100%
> > robust, I think) and at some point I'd like to redesign it.
> 
> As far as I can see the _c_destroy flag is used for no other purpose
> than this synchronization between the destroying thread and potential
> latecomers.

Yes, and I'm not even 100% sure the usage is correct and race-free. At
the very least the write should probably be a_store rather than simple
assignment.

> Technically the real problem is not pthread_cond_destroy (or
> cnd_destroy). This could just be a noop as it is for mutexes.

For mutexes, extreme care is taken that the implementation not use the
mutex object memory after the formal use (ownership) of the mutex
ends. This is why a mix of a waiter count and new-waiter flag on the
futex int itself is used (so that the waiter count does not need to be
read after the futex int is set to zero) and why the __vm_lock_* stuff
exists (to prevent munmap of the mutex while it's still in the pending
slot of the robust list; see glibc bug 14485).

> Using a
> condition after it is destroyed is UB in terms of the standards, but
> nothing hinders us to define a behavior for the specific
> implementation of musl.
> 
> It is the fact that the thread that calls destroy() might also
> deallocate the object directly after, and that the latecomers then
> crash because we removed their object under their feet. So this
> effectively introduces a deallocation barrier, which is nice, but
> clearly an extension.

It's not an extension. Such deallocation after destroy is explicitly
permitted and is actually the standard usage case, not the exception
(the natural time to destroy and free a cond var is right after you
put the predicate it's associated with into a permanently-true state
and signal all waiters; with cond vars in dynamically allocated
objects there may not even be a later time to do so).

The reason I went to all the trouble making a "deallocation barrier"
is because it's required.

> I am not sure about how to deal with this. The idea that destroy may
> be blocking or even be doing a context switch to the kernel came as a
> surprise to me. Maybe I would be happier if _c_destroy would be used
> as a usage count and destroy would just spinlock on that would be more
> straight.

A spinlock will deadlock if realtime priorities are used and the
destroying thread has higher priority than at least one of the
former-waiters that needs to return before destroy can proceed.

Fundamentally there's no reason context switch on destroy is odd.
POSIX (and C11) permits implementations where init/destroy for
synchronization objects requires some sort of kernel or kernel-like
allocation (e.g. in special memory). Light or zero-cost init/destroy
is a side effect of having a good underlying system, not a requirement
(though I would suspect many/most applications are not prepared for
systems where mutex init could fail, probably since good systems
guarantee it doesn't).

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06  9:41     ` Jens Gustedt
@ 2014-08-06 10:03       ` Rich Felker
  2014-08-06 10:32         ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-06 10:03 UTC (permalink / raw)
  To: musl

On Wed, Aug 06, 2014 at 11:41:04AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 06.08.2014, 10:43 +0200 schrieb Jens Gustedt:
> > Am Dienstag, den 05.08.2014, 23:52 -0400 schrieb Rich Felker:
> > > If you think this is a bad idea, I'd be willing to hear alternate
> > > ideas. I'm not really happy with the cond var implementation (if
> > > nothing else, the sequence number thing is an ugly hack and not 100%
> > > robust, I think) and at some point I'd like to redesign it.
> > 
> > I am not sure about how to deal with this. The idea that destroy may
> > be blocking or even be doing a context switch to the kernel came as a
> > surprise to me. Maybe I would be happier if _c_destroy would be used
> > as a usage count and destroy would just spinlock on that would be more
> > straight.
> 
> After digging a bit deeper I think I would favor a solution where no
> waiter has to touch the condition object at all when coming back from
> the wait. Currently this is not the case, in the contrary it reads
> _c_seq and then does heavy maintenance in the unwait() call.
> 
> The return from a wait should just run into the the call to
> pthread_mutex_lock, which would be save since the call has access to
> that mutex from its arguments. All the bookkeeping should be done by
> pthread_cond_signal and pthread_cond_broadcast before they wake up
> anybody.
> 
> I am not sure that this would easily be possible with the current
> implementation, but I think that this would be ideal.

Yes, that would be my ideal too, but I'm not sure it's possible. Well,
not with an efficient implementation anyway. There's a trivial
implementation where pthread_cond_timedwait is basically just:

a_store(&c->_tid, self->tid);
pthread_mutex_unlock(mtx);
__timedwait(&c->tid, self->tid, ts, clock, ts, cleanup, mtx, 0);
pthread_mutex_lock(mtx);

This makes signal and broadcast operations utterly trivial (just set
c->tid to 0 and call __wake with 1 or INT_MAX as the argument), but it
has lots of spurious wakes under contention, I think, and perhaps
other serious drawbacks too. If I remember correctly, musl's original
implementation looked something like the above, but I was also having
a hard time figuring out how to work requeue-to-mutex (which makes a
big deal to performance on big broadcasts) into it.

One idea I've been playing with is possibly eliminating most or all
use of waiter counts in musl and using a pure potential-waiters-flag
approach. That's a whole separate topic I should start a thread on,
but to be brief, it's motivated by some pathological behavior of
waiter counts under rapid lock/unlock by the same thread where a
useless futex wake happens after each unlock if the waiter has not yet
gotten a chance to run. If eliminating waiter counts would make it
easier to do a good cond var implementation that would be another
reason to consider it.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 10:03       ` Rich Felker
@ 2014-08-06 10:32         ` Jens Gustedt
  2014-08-06 16:15           ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-06 10:32 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1996 bytes --]

Am Mittwoch, den 06.08.2014, 06:03 -0400 schrieb Rich Felker:
> On Wed, Aug 06, 2014 at 11:41:04AM +0200, Jens Gustedt wrote:
> > I am not sure that this would easily be possible with the current
> > implementation, but I think that this would be ideal.
> 
> Yes, that would be my ideal too, but I'm not sure it's possible. Well,
> not with an efficient implementation anyway. There's a trivial
> implementation where pthread_cond_timedwait is basically just:
> 
> a_store(&c->_tid, self->tid);
> pthread_mutex_unlock(mtx);
> __timedwait(&c->tid, self->tid, ts, clock, ts, cleanup, mtx, 0);
> pthread_mutex_lock(mtx);
> 
> This makes signal and broadcast operations utterly trivial (just set
> c->tid to 0 and call __wake with 1 or INT_MAX as the argument), but it
> has lots of spurious wakes under contention,

yes

> I think, and perhaps
> other serious drawbacks too. If I remember correctly, musl's original
> implementation looked something like the above, but I was also having
> a hard time figuring out how to work requeue-to-mutex (which makes a
> big deal to performance on big broadcasts) into it.

Look into pthread_cond_broadcast. I think this could do better
bookkeeping. Currently it does its case analysis with a mildly
complicated expression in preparing the arguments to the
requeue-syscall. I think we could be better off by doing a case
analysis beforehand and update the different "waiters" variables
accordingly *before* the syscall. So the targeted threads wouldn't
have to do any bookkeeping when coming back from wait.

Maybe this even would help to reduce the condition state from two
"waiters" fields to a single one.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 10:32         ` Jens Gustedt
@ 2014-08-06 16:15           ` Rich Felker
  2014-08-06 16:56             ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-06 16:15 UTC (permalink / raw)
  To: musl

On Wed, Aug 06, 2014 at 12:32:52PM +0200, Jens Gustedt wrote:
> > I think, and perhaps
> > other serious drawbacks too. If I remember correctly, musl's original
> > implementation looked something like the above, but I was also having
> > a hard time figuring out how to work requeue-to-mutex (which makes a
> > big deal to performance on big broadcasts) into it.
> 
> Look into pthread_cond_broadcast. I think this could do better
> bookkeeping. Currently it does its case analysis with a mildly
> complicated expression in preparing the arguments to the
> requeue-syscall.

I don't think that expression is the problem. What it's doing is very
simple: it's useless to wake any threads if the thread calling
pthread_cond_broadcast currently holds the mutex, since the woken
thread would immediately block again on the mutex. Instead, one thread
will be woken when the mutex is unlocked. This is ensured by "moving"
any remaining wait count from the cond var waiters to the mutex
waiters (so that the mutex unlock will cause a futex wake).

> I think we could be better off by doing a case
> analysis beforehand and update the different "waiters" variables
> accordingly *before* the syscall. So the targeted threads wouldn't
> have to do any bookkeeping when coming back from wait.

One problem with this is that being woken by a signal isn't the only
way a cond var wait can end. It can also end due to timeout or
cancellation, and in that case, there's no signaling thread to be
responsible for the bookkeeping, but if another thread calls
pthread_cond_broadcast at this point, it can legally destroy and free
the cond var, despite the timed-out/cancelled wait still having its
own bookkeeping to do.

The best possible solution would be if we could eliminate the need for
this bookkeeping entirely, but I don't see any easy way to do that.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 16:15           ` Rich Felker
@ 2014-08-06 16:56             ` Jens Gustedt
  2014-08-06 17:32               ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-06 16:56 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1743 bytes --]

Am Mittwoch, den 06.08.2014, 12:15 -0400 schrieb Rich Felker:
> One problem with this is that being woken by a signal isn't the only
> way a cond var wait can end. It can also end due to timeout or
> cancellation, and in that case, there's no signaling thread to be
> responsible for the bookkeeping, but if another thread calls
> pthread_cond_broadcast at this point, it can legally destroy and free
> the cond var, despite the timed-out/cancelled wait still having its
> own bookkeeping to do.
> 
> The best possible solution would be if we could eliminate the need for
> this bookkeeping entirely, but I don't see any easy way to do that.

Easy perhaps not. But the futex syscall returns some of the
information that is currently thrown away.

 - when returning with 0 from do_wait we can't distinguish if we were
   woken up with a FUTEX_WAKE (good) or with EWOULDBLOCK (bad, retry)

 - the call with FUTEX_REQUEUE drops the information if some thread
   was woken up, the same for the __wake in pthread_cond_signal.

 - pthread_cond_broadcast uses FUTEX_REQUEUE and not
   FUTEX_CMP_REQUEUE.  This could help to ensure that we don't have a
   race on _c_seq. If so, we perhaps wouldn't have to hold a lock when
   going in the syscall.

I don't know, yet, if this lost information would help in all
cases. But maybe in some and that would lead to a path for
simplification. I'll dig a bit more.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 16:56             ` Jens Gustedt
@ 2014-08-06 17:32               ` Rich Felker
  2014-08-06 20:55                 ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-06 17:32 UTC (permalink / raw)
  To: musl

On Wed, Aug 06, 2014 at 06:56:56PM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 06.08.2014, 12:15 -0400 schrieb Rich Felker:
> > One problem with this is that being woken by a signal isn't the only
> > way a cond var wait can end. It can also end due to timeout or
> > cancellation, and in that case, there's no signaling thread to be
> > responsible for the bookkeeping, but if another thread calls
> > pthread_cond_broadcast at this point, it can legally destroy and free
> > the cond var, despite the timed-out/cancelled wait still having its
> > own bookkeeping to do.
> > 
> > The best possible solution would be if we could eliminate the need for
> > this bookkeeping entirely, but I don't see any easy way to do that.
> 
> Easy perhaps not. But the futex syscall returns some of the
> information that is currently thrown away.

Sadly it seems like a bad idea to rely on any of the information
returned by the futex syscall, due to the possibility of spurious
wakes coming from reuse of memory. This is discussed somewhat on glibc
issue 13690:

https://sourceware.org/bugzilla/show_bug.cgi?id=13690

The core issue is that, when performing operations like a mutex
unlock, we can only know if futex wake is needed _after_ the atomic
release action has been performed, and at that point, the address on
which the futex wake is performed may no longer be valid, or may have
been reallocated for another use. Thus, futex wait for a new
synchronization object can spuriously return 0.

I think trying to preclude the possibility of such spurious wakes
would require a lot more unlock/destroy synchronization than what we
have now in cond vars.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 17:32               ` Rich Felker
@ 2014-08-06 20:55                 ` Jens Gustedt
  2014-08-06 22:04                   ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-06 20:55 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1847 bytes --]

Am Mittwoch, den 06.08.2014, 13:32 -0400 schrieb Rich Felker:
> On Wed, Aug 06, 2014 at 06:56:56PM +0200, Jens Gustedt wrote:
> > Easy perhaps not. But the futex syscall returns some of the
> > information that is currently thrown away.
> 
> Sadly it seems like a bad idea to rely on any of the information
> returned by the futex syscall, due to the possibility of spurious
> wakes coming from reuse of memory. This is discussed somewhat on glibc
> issue 13690:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=13690

that is a complicated thread, I am not sure that I captured all
aspects

> The core issue is that, when performing operations like a mutex
> unlock, we can only know if futex wake is needed _after_ the atomic
> release action has been performed, and at that point, the address on
> which the futex wake is performed may no longer be valid, or may have
> been reallocated for another use. Thus, futex wait for a new
> synchronization object can spuriously return 0.

*If* we suppose that your destroy-wait trick works (and I think it
does) such a reuse for pthread_cond_t can never happen, and thus the
futex-syscall should return the correct values. I'll try to work some
simplification from there.

> I think trying to preclude the possibility of such spurious wakes
> would require a lot more unlock/destroy synchronization than what we
> have now in cond vars.

Perhaps one could just let some spurious wakeups in some weird
situations escape to user space.

We'll see.

Jens


-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 20:55                 ` Jens Gustedt
@ 2014-08-06 22:04                   ` Rich Felker
  2014-08-06 22:43                     ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-06 22:04 UTC (permalink / raw)
  To: musl

On Wed, Aug 06, 2014 at 10:55:10PM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 06.08.2014, 13:32 -0400 schrieb Rich Felker:
> > On Wed, Aug 06, 2014 at 06:56:56PM +0200, Jens Gustedt wrote:
> > > Easy perhaps not. But the futex syscall returns some of the
> > > information that is currently thrown away.
> > 
> > Sadly it seems like a bad idea to rely on any of the information
> > returned by the futex syscall, due to the possibility of spurious
> > wakes coming from reuse of memory. This is discussed somewhat on glibc
> > issue 13690:
> > 
> > https://sourceware.org/bugzilla/show_bug.cgi?id=13690
> 
> that is a complicated thread, I am not sure that I captured all
> aspects

Yes, I think not.. :)

The issue I'm talking about now doesn't come up until near the end of
the thread and it's in regards to whether it's safe to call futex wake
on an address that's no longer owned at the time of the wake.

> > The core issue is that, when performing operations like a mutex
> > unlock, we can only know if futex wake is needed _after_ the atomic
> > release action has been performed, and at that point, the address on
> > which the futex wake is performed may no longer be valid, or may have
> > been reallocated for another use. Thus, futex wait for a new
> > synchronization object can spuriously return 0.
> 
> *If* we suppose that your destroy-wait trick works (and I think it
> does) such a reuse for pthread_cond_t can never happen, and thus the
> futex-syscall should return the correct values. I'll try to work some
> simplification from there.

No. The problem is that the memory in which the cond var resides could
previously have been used for a mutex, semaphore, or other object for
which a futex wait arrives "very late" due to scheduling. It has
nothing to do with the destruction of cond vars.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 22:04                   ` Rich Felker
@ 2014-08-06 22:43                     ` Jens Gustedt
  2014-08-06 23:15                       ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-06 22:43 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

Am Mittwoch, den 06.08.2014, 18:04 -0400 schrieb Rich Felker:
> No. The problem is that the memory in which the cond var resides could
> previously have been used for a mutex, semaphore, or other object for
> which a futex wait arrives "very late" due to scheduling. It has
> nothing to do with the destruction of cond vars.

Hm, but then we have a problem on the side that has this late wait,
that is an access to a dead object. So the UB is there, in that thread
that is doing the wait. (Suppose it is just threads and not processes,
that would even be more complicated.) That spurious wakeup that you
are talking about is a consequence of UB.

good night

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 22:43                     ` Jens Gustedt
@ 2014-08-06 23:15                       ` Rich Felker
  2014-08-07  7:50                         ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-06 23:15 UTC (permalink / raw)
  To: musl

On Thu, Aug 07, 2014 at 12:43:35AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 06.08.2014, 18:04 -0400 schrieb Rich Felker:
> > No. The problem is that the memory in which the cond var resides could
> > previously have been used for a mutex, semaphore, or other object for
> > which a futex wait arrives "very late" due to scheduling. It has
> > nothing to do with the destruction of cond vars.
> 
> Hm, but then we have a problem on the side that has this late wait,
> that is an access to a dead object. So the UB is there, in that thread
> that is doing the wait. (Suppose it is just threads and not processes,
> that would even be more complicated.) That spurious wakeup that you
> are talking about is a consequence of UB.

No, it's not. The wait happens prior to the deallocation, in the same
thread that performs the deallocation. The interleaving looks like
this:

Thread A                        Thread B
--------                        --------
waiters++
                                save waiters count
                                atomic unlock
futex wait fails with EAGAIN
cas succeeds & gets lock
waiters--
[unlock operation]
[free operation]
                                futex wake to freed address

The free operation in thread A is valid since A knows it is the last
user of the mutex and thread B's use/ownership of the mutex formally
ends with the atomic unlock.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-06 23:15                       ` Rich Felker
@ 2014-08-07  7:50                         ` Jens Gustedt
  2014-08-07 10:52                           ` Szabolcs Nagy
  2014-08-07 16:13                           ` Rich Felker
  0 siblings, 2 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-07  7:50 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3831 bytes --]

Am Mittwoch, den 06.08.2014, 19:15 -0400 schrieb Rich Felker:
> No, it's not. The wait happens prior to the deallocation, in the same
> thread that performs the deallocation. The interleaving looks like
> this:
> 
> Thread A                        Thread B
> --------                        --------
> waiters++
>                                 save waiters count
>                                 atomic unlock
> futex wait fails with EAGAIN
> cas succeeds & gets lock
> waiters--
> [unlock operation]
> [free operation]
>                                 futex wake to freed address
> 
> The free operation in thread A is valid since A knows it is the last
> user of the mutex and thread B's use/ownership of the mutex formally
> ends with the atomic unlock.

No, operating on an object that has been freed is UB.  This is
independent of this object being a mutex or not. This must never
happen. So the free is making a wrong assumption.

I think the fundamental flaw with this approach is that it mixes two
different concepts, the waiters count and a reference count. These are
two different things.

With a reference count, the schema looks like this.

Initially the "obj->refs" counter is at least 2 because both threads
hold references on the object.

Thread A                                   Thread B
--------                                   --------
waiters++
                                           save waiters count
                                           atomic unlock
futex wait fails with EAGAIN
cas succeeds & gets lock
waiters--
[unlock operation]
if (atomic_fetch_sub(&obj->refs,1) == 1)
  [free operation on obj]

                                           futex wake to freed address
                                           if (atomic_fetch_sub(&obj->refs,1) == 1)
                                              [free operation on obj]

Which thread does the free operation (if any), only depends on the
order in which the atomic_fetch_sub operations are effective. (And
musl doesn't seem to have the primitives to do atomic_fetch_sub?)

Now I am aware that such a scheme is difficult to establish in a
setting where obj can be malloced of not. This scenario supposes that
both threads *know* that the allocation of obj has been done with
malloc.

The easiest way to assure that, would be to impose that the "real"
data object that the thread lock, unlock, wait etc operations would
use would always have to be malloced.

For C threads this can be done by mtx_init and cnd_init.  They would
be allocating the dynamic object, set "refs" to 1 and set a link to
that object. For mtx_t and cnd_t dynamic initialization is imperative.

For pthread unshared mutexes and conditions that are initialized by an
initializer (and not by the corresponding init function) one can
certainly get away by delaying that part of the dynamic initialization
to the first usage. This can certainly also be done with mtx_t and
cnd_t as an extension to the C standard.

For process shared mutexes and conditions I have no clue how I would
do that.

This uses dynamic allocation under the hood, and adds possible failure
paths to the game. But as always for allocations these failures occur
in situations where the application is pretty much screwed.

I don't like going dynamic too much, and I strongly suspect that you
don't like such an approach for this reason, either :)

But for C threads this would be a way to go.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-07  7:50                         ` Jens Gustedt
@ 2014-08-07 10:52                           ` Szabolcs Nagy
  2014-08-07 11:03                             ` Jens Gustedt
  2014-08-07 16:13                           ` Rich Felker
  1 sibling, 1 reply; 38+ messages in thread
From: Szabolcs Nagy @ 2014-08-07 10:52 UTC (permalink / raw)
  To: musl

* Jens Gustedt <jens.gustedt@inria.fr> [2014-08-07 09:50:51 +0200]:
> Am Mittwoch, den 06.08.2014, 19:15 -0400 schrieb Rich Felker:
> > No, it's not. The wait happens prior to the deallocation, in the same
> > thread that performs the deallocation. The interleaving looks like
> > this:
> > 
> > Thread A                        Thread B
> > --------                        --------
> > waiters++
> >                                 save waiters count
> >                                 atomic unlock
> > futex wait fails with EAGAIN
> > cas succeeds & gets lock
> > waiters--
> > [unlock operation]
> > [free operation]
> >                                 futex wake to freed address
> > 
> > The free operation in thread A is valid since A knows it is the last
> > user of the mutex and thread B's use/ownership of the mutex formally
> > ends with the atomic unlock.
> 
> No, operating on an object that has been freed is UB.  This is
> independent of this object being a mutex or not. This must never
> happen. So the free is making a wrong assumption.
> 
> I think the fundamental flaw with this approach is that it mixes two
> different concepts, the waiters count and a reference count. These are
> two different things.
> 
> With a reference count, the schema looks like this.
> 
> Initially the "obj->refs" counter is at least 2 because both threads
> hold references on the object.
> 
> Thread A                                   Thread B
> --------                                   --------
> waiters++
>                                            save waiters count
>                                            atomic unlock
> futex wait fails with EAGAIN
> cas succeeds & gets lock
> waiters--
> [unlock operation]
> if (atomic_fetch_sub(&obj->refs,1) == 1)
>   [free operation on obj]
> 
>                                            futex wake to freed address
>                                            if (atomic_fetch_sub(&obj->refs,1) == 1)
>                                               [free operation on obj]
> 
> Which thread does the free operation (if any), only depends on the
> order in which the atomic_fetch_sub operations are effective. (And
> musl doesn't seem to have the primitives to do atomic_fetch_sub?)
> 
> Now I am aware that such a scheme is difficult to establish in a
> setting where obj can be malloced of not. This scenario supposes that
> both threads *know* that the allocation of obj has been done with
> malloc.
> 

i havent followed all the discussions so i dont know if
this is libc internal only or exposed to the user (who
does the free)

but the locking behaviour exposed to the user should
be such that this pattern works:

	mutex_lock(obj->lock);
	c = --obj->refs;
	mutex_unlock(obj->lock);
	if (!c)
		free(obj);

if it does not work then there will be hard to debug bugs

https://sourceware.org/bugzilla/show_bug.cgi?id=13690
http://austingroupbugs.net/view.php?id=811
http://lwn.net/Articles/575477/
https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg04583.html

> The easiest way to assure that, would be to impose that the "real"
> data object that the thread lock, unlock, wait etc operations would
> use would always have to be malloced.
> 
> For C threads this can be done by mtx_init and cnd_init.  They would
> be allocating the dynamic object, set "refs" to 1 and set a link to
> that object. For mtx_t and cnd_t dynamic initialization is imperative.
> 

dynamic allocation per mutex or cond var sounds bad


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-07 10:52                           ` Szabolcs Nagy
@ 2014-08-07 11:03                             ` Jens Gustedt
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-07 11:03 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1568 bytes --]

Am Donnerstag, den 07.08.2014, 12:52 +0200 schrieb Szabolcs Nagy:
> * Jens Gustedt <jens.gustedt@inria.fr> [2014-08-07 09:50:51 +0200]:
> i havent followed all the discussions so i dont know if
> this is libc internal only or exposed to the user (who
> does the free)
> 
> but the locking behaviour exposed to the user should
> be such that this pattern works:
> 
> 	mutex_lock(obj->lock);
> 	c = --obj->refs;
> 	mutex_unlock(obj->lock);
> 	if (!c)
> 		free(obj);

yes, but these were not the assumptions that were made upthread. The
assumption was that a "waiters" counter had fallen to 0, not a
reference count.

> if it does not work then there will be hard to debug bugs

absolutely

> > The easiest way to assure that, would be to impose that the "real"
> > data object that the thread lock, unlock, wait etc operations would
> > use would always have to be malloced.
> > 
> > For C threads this can be done by mtx_init and cnd_init.  They would
> > be allocating the dynamic object, set "refs" to 1 and set a link to
> > that object. For mtx_t and cnd_t dynamic initialization is imperative.
> > 
> 
> dynamic allocation per mutex or cond var sounds bad

sure, but for the moment I am out of ideas, unfortunately

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-07  7:50                         ` Jens Gustedt
  2014-08-07 10:52                           ` Szabolcs Nagy
@ 2014-08-07 16:13                           ` Rich Felker
  2014-08-07 16:47                             ` Jens Gustedt
  1 sibling, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-07 16:13 UTC (permalink / raw)
  To: musl

On Thu, Aug 07, 2014 at 09:50:51AM +0200, Jens Gustedt wrote:
> Am Mittwoch, den 06.08.2014, 19:15 -0400 schrieb Rich Felker:
> > No, it's not. The wait happens prior to the deallocation, in the same
> > thread that performs the deallocation. The interleaving looks like
> > this:
> > 
> > Thread A                        Thread B
> > --------                        --------
> > waiters++
> >                                 save waiters count
> >                                 atomic unlock
> > futex wait fails with EAGAIN
> > cas succeeds & gets lock
> > waiters--
> > [unlock operation]
> > [free operation]
> >                                 futex wake to freed address
> > 
> > The free operation in thread A is valid since A knows it is the last
> > user of the mutex and thread B's use/ownership of the mutex formally
> > ends with the atomic unlock.
> 
> No, operating on an object that has been freed is UB.  This is

No operation is performed on an object which has been freed. The futex
wake is performed on the _address_, not the object, requesting that
the kernel use the address as a key into a hash table and wake any
threads which are blocked waiting on the futex object associated with
that address. The address is never dereferenced. This is the whole
point of the current design.

> independent of this object being a mutex or not. This must never
> happen. So the free is making a wrong assumption.

You should clarify whether you mean internal UB in the implementation,
or UB in the application. My understanding is that you meant in the
implementation, but I claim that's wrong.

> I think the fundamental flaw with this approach is that it mixes two
> different concepts, the waiters count and a reference count. These are
> two different things.

No, the waiters count is not used as a reference count. Only the state
of the atomic int is used as a reference; once it's set to zero the
implementation can no longer access the object (since another thread
in the application is free to lock, unlock, destroy, and free it).

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-07 16:13                           ` Rich Felker
@ 2014-08-07 16:47                             ` Jens Gustedt
  2014-08-07 17:25                               ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-07 16:47 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]

Am Donnerstag, den 07.08.2014, 12:13 -0400 schrieb Rich Felker:
> On Thu, Aug 07, 2014 at 09:50:51AM +0200, Jens Gustedt wrote:
> > Am Mittwoch, den 06.08.2014, 19:15 -0400 schrieb Rich Felker:
> > > The free operation in thread A is valid since A knows it is the last
> > > user of the mutex and thread B's use/ownership of the mutex formally
> > > ends with the atomic unlock.
> > 
> > No, operating on an object that has been freed is UB.  This is
> 
> No operation is performed on an object which has been freed.

ok, let me rephrase

passing an invalid userspace address to the kernel for a futex
operation is UB

> The futex
> wake is performed on the _address_, not the object, requesting that
> the kernel use the address as a key into a hash table and wake any
> threads which are blocked waiting on the futex object associated with
> that address. The address is never dereferenced. This is the whole
> point of the current design.

Yes this is the whole point. But it will not work if you use a invalid
address for that. The kernel is doing address translation (in case of
a shared futex operation) and for that alone you are supposed to pass
in a valid address, I think.

And generally for the design of the futex operations some are even
designed to compare the actual value to "val" or "val3". I don't think
that the kernel guys would give you any guarantee that the kernel
would not touch your object, for any of the operations.

> > independent of this object being a mutex or not. This must never
> > happen. So the free is making a wrong assumption.
> 
> You should clarify whether you mean internal UB in the implementation,
> or UB in the application. My understanding is that you meant in the
> implementation, but I claim that's wrong.

yes I mean implementation, and I am still convinced of it.

> > I think the fundamental flaw with this approach is that it mixes two
> > different concepts, the waiters count and a reference count. These are
> > two different things.
> 
> No, the waiters count is not used as a reference count. Only the state
> of the atomic int is used as a reference; once it's set to zero the
> implementation can no longer access the object (since another thread
> in the application is free to lock, unlock, destroy, and free it).

the application cannot directly, but pending application calls into
the library can, as we have seen.

Jens
-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-07 16:47                             ` Jens Gustedt
@ 2014-08-07 17:25                               ` Rich Felker
  2014-08-08  9:20                                 ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-07 17:25 UTC (permalink / raw)
  To: musl

On Thu, Aug 07, 2014 at 06:47:04PM +0200, Jens Gustedt wrote:
> Am Donnerstag, den 07.08.2014, 12:13 -0400 schrieb Rich Felker:
> > On Thu, Aug 07, 2014 at 09:50:51AM +0200, Jens Gustedt wrote:
> > > Am Mittwoch, den 06.08.2014, 19:15 -0400 schrieb Rich Felker:
> > > > The free operation in thread A is valid since A knows it is the last
> > > > user of the mutex and thread B's use/ownership of the mutex formally
> > > > ends with the atomic unlock.
> > > 
> > > No, operating on an object that has been freed is UB.  This is
> > 
> > No operation is performed on an object which has been freed.
> 
> ok, let me rephrase
> 
> passing an invalid userspace address to the kernel for a futex
> operation is UB

Says who? The syscall API does not have UB. All inputs are defined,
and cannot cause UB in the userspace process unless the action
requested from the kernel is to modify userspace state in a way that
would cause UB.

> > The futex
> > wake is performed on the _address_, not the object, requesting that
> > the kernel use the address as a key into a hash table and wake any
> > threads which are blocked waiting on the futex object associated with
> > that address. The address is never dereferenced. This is the whole
> > point of the current design.
> 
> Yes this is the whole point. But it will not work if you use a invalid
> address for that. The kernel is doing address translation (in case of
> a shared futex operation) and for that alone you are supposed to pass
> in a valid address, I think.

No. Invalid addresses are valid arguments to syscalls. They may
produce EFAULT in some cases, but they do not cause UB in the calling
program.

> And generally for the design of the futex operations some are even
> designed to compare the actual value to "val" or "val3". I don't think
> that the kernel guys would give you any guarantee that the kernel
> would not touch your object, for any of the operations.

Most futex operations (including wait) access an object for read. Some
rarely-used ones write to an object too. Wake does neither. Wake is
purely an operation on the address/futex-waitqueue-object. This
property is not just a side effect but absolutely necessary in order
for futex to be useful, since you _always_ perform the futex wake
after releasing (and thereby potentially allowing destruction) of the
object.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-07 17:25                               ` Rich Felker
@ 2014-08-08  9:20                                 ` Jens Gustedt
  2014-08-08 16:53                                   ` Rich Felker
  2014-08-08 19:14                                   ` Rich Felker
  0 siblings, 2 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-08  9:20 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2034 bytes --]

Hello,

Am Donnerstag, den 07.08.2014, 13:25 -0400 schrieb Rich Felker:
> Most futex operations (including wait) access an object for read. Some
> rarely-used ones write to an object too. Wake does neither. Wake is
> purely an operation on the address/futex-waitqueue-object. This
> property is not just a side effect but absolutely necessary in order
> for futex to be useful, since you _always_ perform the futex wake
> after releasing (and thereby potentially allowing destruction) of the
> object.

I agree with almost all what you are saying, in particular the
facts, ;) and your statement is a good summary of what I learned with
our discussion.

I don't agree with your appreciation of these facts, though.

In particular the "(and thereby potentially allowing destruction)"
bothers me a lot, and I don't think that *doing* this with control
structures is desirable.

I think it is possible to have control structures that guarantee that
no wake (or any other futex operation) is done on an address that
isn't a valid address of a life object. I also think if only done for
non-shared structures (such as for C threads) that such a design can
be relatively simple, and with much less implicit state than what the
current musl implementation does, in particular for pthread_cond_t.

Such an implementation should by design avoid late wakes from the
kernel for addresses that are recycled for new control objects. (It
can't inhibit them since other userspace tools might still have the
late wake problem.)

I am currently debugging such a C thread implementation, and I'll come
back to you once I have something to show.

Thanks in particular to you, Rich, for your patience.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-08  9:20                                 ` Jens Gustedt
@ 2014-08-08 16:53                                   ` Rich Felker
  2014-08-08 19:14                                   ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: Rich Felker @ 2014-08-08 16:53 UTC (permalink / raw)
  To: musl

On Fri, Aug 08, 2014 at 11:20:27AM +0200, Jens Gustedt wrote:
> Hello,
> 
> Am Donnerstag, den 07.08.2014, 13:25 -0400 schrieb Rich Felker:
> > Most futex operations (including wait) access an object for read. Some
> > rarely-used ones write to an object too. Wake does neither. Wake is
> > purely an operation on the address/futex-waitqueue-object. This
> > property is not just a side effect but absolutely necessary in order
> > for futex to be useful, since you _always_ perform the futex wake
> > after releasing (and thereby potentially allowing destruction) of the
> > object.
> 
> I agree with almost all what you are saying, in particular the
> facts, ;) and your statement is a good summary of what I learned with
> our discussion.
> 
> I don't agree with your appreciation of these facts, though.
> 
> In particular the "(and thereby potentially allowing destruction)"
> bothers me a lot, and I don't think that *doing* this with control
> structures is desirable.

Do you have a reason it's undesirable? The only reason I've seen so
far is that it causes spurious wakes and makes the return values of
futex calls unusable. I agree it would be nice if the return values
were usable, but the benefit to having them seems to be limited to
making small optimizations in the implementation of synchronization
primitives. On the other hand, the cost to make such optimizations
(added synchronization to preclude sending a wake to an invalid
address) seems to be much higher, and essentially infinite (i.e.
fundamentally possible without moving the whole lock to kernelspace
and incurring a syscall for each operation) for process-shared
objects. So even if it is desirable, buying it seems like a HUGE net
loss.

> I think it is possible to have control structures that guarantee that
> no wake (or any other futex operation) is done on an address that
> isn't a valid address of a life object.

So far the way you've proposed to do this is using dynamic allocation.
This incurs additional synchronization cost (in malloc) for
creation/destruction, incurs a complexity cost (in no longer being
fail-safe) for both the implementation and the application, and it's
not even possible for process-shared objects because there is
fundamentally no way to allocate shared memory which is accessible
only by the exact set of processes which have access to the original
object. Any attempt to do so would either have false negatives (some
processes that nominally have access to the object would fail to
operate on it) or false positives (malicious processes would
potentially have access to operate on the object).

> I also think if only done for
> non-shared structures (such as for C threads) that such a design can
> be relatively simple, and with much less implicit state than what the
> current musl implementation does, in particular for pthread_cond_t.

Making mutexes heavy to make cond vars lighter seems like a net loss.
Cond vars are supposed to be a heavy primitive (typical case is
waiting) and mutexes a light one (typical case is immediately
acquiring the lock).

I would still like to explore ways to improve cond vars without
resorting to allocation. One potential idea (but I don't see a way to
make it work with cond vars yet) is to try something like the barriers
implementation, where the actual synchronization object lies on the
stack of the first waiter. Obviously this is only for
non-process-shared objects (barriers do something else more costly in
the process-shared case). The need for both signal and broadcast
semantics complicates it somewhat though; with such a design it might
be necessary for broadcast to simply keep signaling until the original
list is empty rather than using a single broadcast wake.

> Such an implementation should by design avoid late wakes from the
> kernel for addresses that are recycled for new control objects. (It
> can't inhibit them since other userspace tools might still have the
> late wake problem.)
> 
> I am currently debugging such a C thread implementation, and I'll come
> back to you once I have something to show.

I'm happy to see what you come up with, but I doubt it's right for
musl.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-08  9:20                                 ` Jens Gustedt
  2014-08-08 16:53                                   ` Rich Felker
@ 2014-08-08 19:14                                   ` Rich Felker
  2014-08-08 20:48                                     ` Rich Felker
  1 sibling, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-08 19:14 UTC (permalink / raw)
  To: musl

On Fri, Aug 08, 2014 at 11:20:27AM +0200, Jens Gustedt wrote:
> Hello,
> 
> Am Donnerstag, den 07.08.2014, 13:25 -0400 schrieb Rich Felker:
> > Most futex operations (including wait) access an object for read. Some
> > rarely-used ones write to an object too. Wake does neither. Wake is
> > purely an operation on the address/futex-waitqueue-object. This
> > property is not just a side effect but absolutely necessary in order
> > for futex to be useful, since you _always_ perform the futex wake
> > after releasing (and thereby potentially allowing destruction) of the
> > object.
> 
> I agree with almost all what you are saying, in particular the
> facts, ;) and your statement is a good summary of what I learned with
> our discussion.

I think I may have a solution you'll like:

We can perform the release of the lock via a compare-and-swap rather
than a simple swap. In this way, we can know before releasing the lock
whether it's going to require a wake or not:

- If waiters was zero and the cas from owned/uncontended to zero
  succeeds, no futex wake operation is needed.

- If waiters was nonzero, or if the cas fails (thereby instead
  requiring a cas from owned/contended to zero), we can do the
  following:

Don't use a userspace CAS to release; this would allow the lock to be
acquired by another thread, released, destroyed, and freed before the
futex wake is performed. Instead, use FUTEX_WAKE_OP to atomically
perform the atomic assignment and futex wake.

Do you see any problems with this approach? I think it fully solves
the problem. If there are old kernels that lack FUTEX_WAKE_OP, we
could fallback to emulating it in userspace for those (with the
spurious wake issue). I think this approach may also also allow me to
resolve an issue that's preventing me from eliminating a lot of
useless FUTEX_WAKE syscalls under certain contention patterns, but I'm
not sure yet.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-08 19:14                                   ` Rich Felker
@ 2014-08-08 20:48                                     ` Rich Felker
  2014-08-09  6:47                                       ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-08 20:48 UTC (permalink / raw)
  To: musl

On Fri, Aug 08, 2014 at 03:14:06PM -0400, Rich Felker wrote:
> I think I may have a solution you'll like:
> 
> We can perform the release of the lock via a compare-and-swap rather
> than a simple swap. In this way, we can know before releasing the lock
> whether it's going to require a wake or not:
> 
> - If waiters was zero and the cas from owned/uncontended to zero
>   succeeds, no futex wake operation is needed.
> 
> - If waiters was nonzero, or if the cas fails (thereby instead
>   requiring a cas from owned/contended to zero), we can do the
>   following:
> 
> Don't use a userspace CAS to release; this would allow the lock to be
> acquired by another thread, released, destroyed, and freed before the
> futex wake is performed. Instead, use FUTEX_WAKE_OP to atomically
> perform the atomic assignment and futex wake.

FUTEX_WAKE_OP is highly under-documented, and i'm worried it might be
unsupported on some archs (since the atomics for it have to be
implemented on a per-arch basis in the kernel) but of course we can
just fallback on archs where it's not supported yet.

Anyway, the behavior seems to be:

- Futex acquisition for uaddr1 and uaddr2 both happen prior to the
  atomic operation, and this hold locks that seem to prevent new
  waiters on the futex(es). This should preclude any risk of waking a
  new waiter that arrives after the atomic operation, as desired.

- Both uaddr1 and uaddr2 are hashed, with no check for equality. This
  is a fairly costly wasteful operation, but could be fixed on the
  kernel side. At present I suspect they don't care because
  FUTEX_WAKE_OP is considered unnecessary, but if I raise it on the
  glibc bug tracker thread for issue 13690 as a solution to the
  problem, I think there would be a lot more interest in optimizing
  this kernel path.

- After the atomic operation is performed, a wake is always performed
  on uaddr1 (based on the previous acquisition); this fact is omitted
  from all the documentation, but it's obviously intentional since
  otherwise the uaddr1 argument would not be used for anything but
  wasting time. The wake on uaddr2 is conditional on a comparison.

- No allocation is required anywhere in the operation, so we don't
  have to worry about lost actions on OOM. For plain FUTEX_WAKE this
  would not have been an issue (if acquirin the futex required memory,
  then failure for FUTEX_WAKE to acquire it would mean there was no
  FUTEX_WAIT taking place anyway), but for FUTEX_WAKE_OP, failure
  would omit the atomic operation, which must take place even if there
  are no current FUTEX_WAIT waiters (e.g. if the FUTEX_WAIT was
  interrupted by a signal handler).

Based on the above, I think it's safe to move forward with using
FUTEX_WAKE_OP. It seems optimal to me to use uaddr1==uaddr2 and a
comparison that always yields false, so that the wake only goes to
uaddr1. This will allow the kernel to optimize out double-hashing in
the future by checking for uaddr1==uaddr2, and already optimizes out
the double-iteration of the hash bucket for waking purposes.

Any further thoughts on the matter? I think we should finish the
private futex support task before starting on this, so that we don't
do new work that's going to conflict with a pending patch.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-08 20:48                                     ` Rich Felker
@ 2014-08-09  6:47                                       ` Jens Gustedt
  2014-08-12  2:50                                         ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-09  6:47 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4333 bytes --]

Hello,

Am Freitag, den 08.08.2014, 16:48 -0400 schrieb Rich Felker:
> On Fri, Aug 08, 2014 at 03:14:06PM -0400, Rich Felker wrote:
> > I think I may have a solution you'll like:
> > 
> > We can perform the release of the lock via a compare-and-swap rather
> > than a simple swap. In this way, we can know before releasing the lock
> > whether it's going to require a wake or not:
> > 
> > - If waiters was zero and the cas from owned/uncontended to zero
> >   succeeds, no futex wake operation is needed.
> > 
> > - If waiters was nonzero, or if the cas fails (thereby instead
> >   requiring a cas from owned/contended to zero), we can do the
> >   following:
> > 
> > Don't use a userspace CAS to release; this would allow the lock to be
> > acquired by another thread, released, destroyed, and freed before the
> > futex wake is performed. Instead, use FUTEX_WAKE_OP to atomically
> > perform the atomic assignment and futex wake.
> 
> FUTEX_WAKE_OP is highly under-documented, and i'm worried it might be
> unsupported on some archs (since the atomics for it have to be
> implemented on a per-arch basis in the kernel) but of course we can
> just fallback on archs where it's not supported yet.
> 
> Anyway, the behavior seems to be:
> 
> - Futex acquisition for uaddr1 and uaddr2 both happen prior to the
>   atomic operation, and this hold locks that seem to prevent new
>   waiters on the futex(es). This should preclude any risk of waking a
>   new waiter that arrives after the atomic operation, as desired.
> 
> - Both uaddr1 and uaddr2 are hashed, with no check for equality. This
>   is a fairly costly wasteful operation, but could be fixed on the
>   kernel side. At present I suspect they don't care because
>   FUTEX_WAKE_OP is considered unnecessary, but if I raise it on the
>   glibc bug tracker thread for issue 13690 as a solution to the
>   problem, I think there would be a lot more interest in optimizing
>   this kernel path.
> 
> - After the atomic operation is performed, a wake is always performed
>   on uaddr1 (based on the previous acquisition); this fact is omitted
>   from all the documentation, but it's obviously intentional since
>   otherwise the uaddr1 argument would not be used for anything but
>   wasting time. The wake on uaddr2 is conditional on a comparison.
> 
> - No allocation is required anywhere in the operation, so we don't
>   have to worry about lost actions on OOM. For plain FUTEX_WAKE this
>   would not have been an issue (if acquirin the futex required memory,
>   then failure for FUTEX_WAKE to acquire it would mean there was no
>   FUTEX_WAIT taking place anyway), but for FUTEX_WAKE_OP, failure
>   would omit the atomic operation, which must take place even if there
>   are no current FUTEX_WAIT waiters (e.g. if the FUTEX_WAIT was
>   interrupted by a signal handler).
> 
> Based on the above, I think it's safe to move forward with using
> FUTEX_WAKE_OP. It seems optimal to me to use uaddr1==uaddr2 and a
> comparison that always yields false, so that the wake only goes to
> uaddr1. This will allow the kernel to optimize out double-hashing in
> the future by checking for uaddr1==uaddr2, and already optimizes out
> the double-iteration of the hash bucket for waking purposes.
> 
> Any further thoughts on the matter? I think we should finish the
> private futex support task before starting on this, so that we don't
> do new work that's going to conflict with a pending patch.

This looks promissing, but I yet don't know enough about these less
common futex operations to comment more on it.

Generally I think that the control structures should be as tight as
possible, give provable properties in the mathematical sense. The
interaction between user- and kernelland should be minimal, and we
shouldn't provoque reactions of the kernel that concern threads (or
even process) that are not really targetted. 

Jens


PS: I will be a bit less available in the next days.


-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-09  6:47                                       ` Jens Gustedt
@ 2014-08-12  2:50                                         ` Rich Felker
  2014-08-12  7:04                                           ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-12  2:50 UTC (permalink / raw)
  To: musl

On Sat, Aug 09, 2014 at 08:47:34AM +0200, Jens Gustedt wrote:
> > Any further thoughts on the matter? I think we should finish the
> > private futex support task before starting on this, so that we don't
> > do new work that's going to conflict with a pending patch.
> 
> This looks promissing, but I yet don't know enough about these less
> common futex operations to comment more on it.

You may want to see my comments here which relate to it:

https://sourceware.org/bugzilla/show_bug.cgi?id=13690

> Generally I think that the control structures should be as tight as
> possible, give provable properties in the mathematical sense. The
> interaction between user- and kernelland should be minimal, and we
> shouldn't provoque reactions of the kernel that concern threads (or
> even process) that are not really targetted. 

The former (provable properties) is definitely a goal we should not
deviate from. But I don't think the current spurious futex wakes
conflict with that goal.

The latter (not provoking reactions in untargetted threads) is a
desirable goal, but not if it conflicts with more important goals like
avoiding unnecessary allocation (actually, I don't think it's possible
to solve the problem with allocation; I think an additional layer of
allocation just makes it worse), fail-safety, performance, etc.

On the other hand, I think it's going to be possible to get both
without sacrificing anything, and moreover I think we can even, if we
want to, provide guaranteed mutex acquisition order (whatever order
the kernel gives, which is probably fifo or fifo within priority
levels). I'll write up the concept for the latter in case there's
interest in doing it. It might avoid the problem even without using
FUTEX_WAKE_OP.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-12  2:50                                         ` Rich Felker
@ 2014-08-12  7:04                                           ` Jens Gustedt
  2014-08-12 16:01                                             ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-12  7:04 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2932 bytes --]

Hello,

Am Montag, den 11.08.2014, 22:50 -0400 schrieb Rich Felker:
> On Sat, Aug 09, 2014 at 08:47:34AM +0200, Jens Gustedt wrote:
> > > Any further thoughts on the matter? I think we should finish the
> > > private futex support task before starting on this, so that we don't
> > > do new work that's going to conflict with a pending patch.
> > 
> > This looks promissing, but I yet don't know enough about these less
> > common futex operations to comment more on it.
> 
> You may want to see my comments here which relate to it:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=13690

thanks for the pointer, I also read up on the subject a bit in the
mean time.

> > Generally I think that the control structures should be as tight as
> > possible, give provable properties in the mathematical sense. The
> > interaction between user- and kernelland should be minimal, and we
> > shouldn't provoque reactions of the kernel that concern threads (or
> > even process) that are not really targetted. 
> 
> The former (provable properties) is definitely a goal we should not
> deviate from. But I don't think the current spurious futex wakes
> conflict with that goal.
> 
> The latter (not provoking reactions in untargetted threads) is a
> desirable goal, but not if it conflicts with more important goals like
> avoiding unnecessary allocation (actually, I don't think it's possible
> to solve the problem with allocation; I think an additional layer of
> allocation just makes it worse), fail-safety, performance, etc.

Did you have a chance to look into my recent implementation of C
threads that I attached to my last post? In particular in
cnd_broadcast you see the advantages, I think:

 - cnd doesn't have to do bookkeeping for the threads waiting on the
   condition, the kernel bookkeeping is used for that

 - threads that had to go into futex wait only touch the temporary
   structure and this only for the reference count

 - a tight spinlock clearly defines the ordering of the cnd_t
   operations

> On the other hand, I think it's going to be possible to get both
> without sacrificing anything, and moreover I think we can even, if we
> want to, provide guaranteed mutex acquisition order (whatever order
> the kernel gives, which is probably fifo or fifo within priority
> levels). I'll write up the concept for the latter in case there's
> interest in doing it. It might avoid the problem even without using
> FUTEX_WAKE_OP.

I think so, too. Perhaps we should work it from both ends. I will now
try to avoid the need for allocation.

Jens


-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-12  7:04                                           ` Jens Gustedt
@ 2014-08-12 16:01                                             ` Rich Felker
  2014-08-12 19:09                                               ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-12 16:01 UTC (permalink / raw)
  To: musl

On Tue, Aug 12, 2014 at 09:04:39AM +0200, Jens Gustedt wrote:
> > > Generally I think that the control structures should be as tight as
> > > possible, give provable properties in the mathematical sense. The
> > > interaction between user- and kernelland should be minimal, and we
> > > shouldn't provoque reactions of the kernel that concern threads (or
> > > even process) that are not really targetted. 
> > 
> > The former (provable properties) is definitely a goal we should not
> > deviate from. But I don't think the current spurious futex wakes
> > conflict with that goal.
> > 
> > The latter (not provoking reactions in untargetted threads) is a
> > desirable goal, but not if it conflicts with more important goals like
> > avoiding unnecessary allocation (actually, I don't think it's possible
> > to solve the problem with allocation; I think an additional layer of
> > allocation just makes it worse), fail-safety, performance, etc.
> 
> Did you have a chance to look into my recent implementation of C
> threads that I attached to my last post? In particular in
> cnd_broadcast you see the advantages, I think:
> 
>  - cnd doesn't have to do bookkeeping for the threads waiting on the
>    condition, the kernel bookkeeping is used for that
> 
>  - threads that had to go into futex wait only touch the temporary
>    structure and this only for the reference count
> 
>  - a tight spinlock clearly defines the ordering of the cnd_t
>    operations

Yes, I've spent a total of about 30-45 min reading it, so I have a
fairly good idea what it's doing, but my understanding surely has
gaps.

As far as I can tell, the only thing that's saving you from sending
futex wakes after free is that you're just using spinlocks. This is an
extremely expensive solution: While contention is rare, as soon as you
do hit contention, if there are many threads they all pile on and
start spinning, and the time to obtain a lock (and cpu time/energy
spent waiting) grows extremely high. And of course it becomes infinite
if you have any threads of differing priorities and the low-priority
thread has the lock...

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-12 16:01                                             ` Rich Felker
@ 2014-08-12 19:09                                               ` Jens Gustedt
  2014-08-12 21:18                                                 ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-12 19:09 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2735 bytes --]

Rich,
thanks a lot for looking into the code.

Am Dienstag, den 12.08.2014, 12:01 -0400 schrieb Rich Felker:
> As far as I can tell, the only thing that's saving you from sending
> futex wakes after free is that you're just using spinlocks.

No, I don't think so. These protect critical sections at the beginning
of the cnd_t calls. The cnd_*wait calls hold the mutex at that time
anyhow, so even if these would be implemented with mutexes (an extra
one per cnd_t to protect the critical section) this wouldn't cause
late wakes, I think.

> This is an
> extremely expensive solution: While contention is rare, as soon as you
> do hit contention, if there are many threads they all pile on and
> start spinning, and the time to obtain a lock (and cpu time/energy
> spent waiting) grows extremely high. And of course it becomes infinite
> if you have any threads of differing priorities and the low-priority
> thread has the lock...

I think you dramatize a bit :)

First you overlooked that all the cnd_*wait calls are done while in
addition the mutex in question is held. So no two waiters will fight
for the critical section at the same time.

In addition this protects against the signal and broadcast calls. I
don't think that a thundering herd of "wakers" (and not waiters) is a
very common scenario.

Second, the idea here is that inside these critical sections there is
only a constant amount of actions that is to be performed, and in
particular there are no waits or other system calls. E.g the malloc
call is done outside, the data is prepared, and then only linked
inside the critical section.

I checked the assembler and this produces only some assembler lines
inside the critical sections. They may only last at most some memory
cycles. The worst case scenario is that the data that is used is not
cached. So such a "regular" critical section last only a very small
fraction of a scheduling interval.

It is very unlikely that a thread that reaches the critical section is
unscheduled *during* that critical section. If it is unscheduled, you
are right, the wait can be long. But that event is very unlikely, so
the average time inside the critical section is still short, with a
probability distribution that is a bit skewed because of the
outliers.

(And then there is no concept of different scheduling priorities for C
threads, all of them are equal.)

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-12 19:09                                               ` Jens Gustedt
@ 2014-08-12 21:18                                                 ` Rich Felker
  2014-08-13  6:43                                                   ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2014-08-12 21:18 UTC (permalink / raw)
  To: musl

On Tue, Aug 12, 2014 at 09:09:21PM +0200, Jens Gustedt wrote:
> Rich,
> thanks a lot for looking into the code.
> 
> Am Dienstag, den 12.08.2014, 12:01 -0400 schrieb Rich Felker:
> > As far as I can tell, the only thing that's saving you from sending
> > futex wakes after free is that you're just using spinlocks.
> 
> No, I don't think so. These protect critical sections at the beginning
> of the cnd_t calls. The cnd_*wait calls hold the mutex at that time
> anyhow, so even if these would be implemented with mutexes (an extra
> one per cnd_t to protect the critical section) this wouldn't cause
> late wakes, I think.

I was talking about the unref-and-free code that's using spinlocks. If
it were using mutexes that don't protect against making futex wake
calls after the atomic unlock, a previous unref could send the wake
after the final one freed the object. So in effect, if you use a mutex
here, I think the wake-after-free issue has just been moved to a
different object, not solved.

> > This is an
> > extremely expensive solution: While contention is rare, as soon as you
> > do hit contention, if there are many threads they all pile on and
> > start spinning, and the time to obtain a lock (and cpu time/energy
> > spent waiting) grows extremely high. And of course it becomes infinite
> > if you have any threads of differing priorities and the low-priority
> > thread has the lock...
> 
> I think you dramatize a bit :)

Perhaps. :)

> It is very unlikely that a thread that reaches the critical section is
> unscheduled *during* that critical section. If it is unscheduled, you
> are right, the wait can be long. But that event is very unlikely, so
> the average time inside the critical section is still short, with a
> probability distribution that is a bit skewed because of the
> outliers.

Yes. The general pathology of spinlocks is that they give extremely
high latency and cpu load in an extremely low probability worst-case.

> (And then there is no concept of different scheduling priorities for C
> threads, all of them are equal.)

Indeed, but there's no reason these functions couldn't end up getting
called from a POSIX program using a C11 library. This is the normal
expected usage for mutexes (i.e. you're writing a library that needs
to be thread-safe but you don't want to depend on POSIX -- in practice
the calling application is unlikely to be using C11 thrd_create
because it sucks :) and perhaps less likely but definitely not
impossible for cond vars.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-12 21:18                                                 ` Rich Felker
@ 2014-08-13  6:43                                                   ` Jens Gustedt
  2014-08-13  7:19                                                     ` Jens Gustedt
  0 siblings, 1 reply; 38+ messages in thread
From: Jens Gustedt @ 2014-08-13  6:43 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3712 bytes --]

Am Dienstag, den 12.08.2014, 17:18 -0400 schrieb Rich Felker:
> On Tue, Aug 12, 2014 at 09:09:21PM +0200, Jens Gustedt wrote:
> > Rich,
> > thanks a lot for looking into the code.
> > 
> > Am Dienstag, den 12.08.2014, 12:01 -0400 schrieb Rich Felker:
> > > As far as I can tell, the only thing that's saving you from sending
> > > futex wakes after free is that you're just using spinlocks.
> > 
> > No, I don't think so. These protect critical sections at the beginning
> > of the cnd_t calls. The cnd_*wait calls hold the mutex at that time
> > anyhow, so even if these would be implemented with mutexes (an extra
> > one per cnd_t to protect the critical section) this wouldn't cause
> > late wakes, I think.
> 
> I was talking about the unref-and-free code that's using spinlocks.

I don't understand, the unref-and-free code doesn't use spinlocks. It
just uses an atomic_fetch_sub of 1 to determine if it is the last
user. If it is not, it does nothing and returns.

(This is indeed supposing that atomic_fetch_sub is "lock-free" in the
sense of the C standard, which basically means that it has no
observable state from a signal handler. All architectures I know of
have that property, but my knowledge is limited.)

> If
> it were using mutexes that don't protect against making futex wake
> calls after the atomic unlock, a previous unref could send the wake
> after the final one freed the object. So in effect, if you use a mutex
> here, I think the wake-after-free issue has just been moved to a
> different object, not solved.
> 
> > > This is an
> > > extremely expensive solution: While contention is rare, as soon as you
> > > do hit contention, if there are many threads they all pile on and
> > > start spinning, and the time to obtain a lock (and cpu time/energy
> > > spent waiting) grows extremely high. And of course it becomes infinite
> > > if you have any threads of differing priorities and the low-priority
> > > thread has the lock...
> > 
> > I think you dramatize a bit :)
> 
> Perhaps. :)
> 
> > It is very unlikely that a thread that reaches the critical section is
> > unscheduled *during* that critical section. If it is unscheduled, you
> > are right, the wait can be long. But that event is very unlikely, so
> > the average time inside the critical section is still short, with a
> > probability distribution that is a bit skewed because of the
> > outliers.
> 
> Yes. The general pathology of spinlocks is that they give extremely
> high latency and cpu load in an extremely low probability worst-case.
> 
> > (And then there is no concept of different scheduling priorities for C
> > threads, all of them are equal.)
> 
> Indeed, but there's no reason these functions couldn't end up getting
> called from a POSIX program using a C11 library. This is the normal
> expected usage for mutexes (i.e. you're writing a library that needs
> to be thread-safe but you don't want to depend on POSIX -- in practice
> the calling application is unlikely to be using C11 thrd_create
> because it sucks :) and perhaps less likely but definitely not
> impossible for cond vars.

Hm, C threads are meant primarily as a portable and simple user space
tool and as a means to provide a model of parallelism as far as the C
standard is concerned. And that model is flat and has no hierarchy
among threads.

Jens


-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Explaining cond var destroy [Re: [musl] C threads, v3.0]
  2014-08-13  6:43                                                   ` Jens Gustedt
@ 2014-08-13  7:19                                                     ` Jens Gustedt
  0 siblings, 0 replies; 38+ messages in thread
From: Jens Gustedt @ 2014-08-13  7:19 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2154 bytes --]

Am Mittwoch, den 13.08.2014, 08:43 +0200 schrieb Jens Gustedt:
> Am Dienstag, den 12.08.2014, 17:18 -0400 schrieb Rich Felker:
> > On Tue, Aug 12, 2014 at 09:09:21PM +0200, Jens Gustedt wrote:
> > > Rich,
> > > thanks a lot for looking into the code.
> > > 
> > > Am Dienstag, den 12.08.2014, 12:01 -0400 schrieb Rich Felker:
> > > > As far as I can tell, the only thing that's saving you from sending
> > > > futex wakes after free is that you're just using spinlocks.
> > > 
> > > No, I don't think so. These protect critical sections at the beginning
> > > of the cnd_t calls. The cnd_*wait calls hold the mutex at that time
> > > anyhow, so even if these would be implemented with mutexes (an extra
> > > one per cnd_t to protect the critical section) this wouldn't cause
> > > late wakes, I think.
> > 
> > I was talking about the unref-and-free code that's using spinlocks.
> 
> I don't understand, the unref-and-free code doesn't use spinlocks. It
> just uses an atomic_fetch_sub of 1 to determine if it is the last
> user. If it is not, it does nothing and returns.
> 
> (This is indeed supposing that atomic_fetch_sub is "lock-free" in the
> sense of the C standard, which basically means that it has no
> observable state from a signal handler. All architectures I know of
> have that property, but my knowledge is limited.)

ah, perhaps you meant that the unref-and-free was done while holding
the spinlock? Maybe in the version that I posted this was the
case. That was a bug that I have now corrected.

I also have added an initial check for the wakers to see if they have
anything to do before they go into the spinlock. This should basically
solve the herd-doing-wakeup problem, since as soon as all waiters are
considered to be unblocked, wakers will return immediately without
spinning.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2014-08-13  7:19 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-04  9:30 C threads, v3.0 Jens Gustedt
2014-08-04  9:33 ` Jens Gustedt
2014-08-04 14:50 ` Rich Felker
2014-08-04 16:48   ` Jens Gustedt
2014-08-04 17:06     ` Rich Felker
2014-08-04 22:16       ` Jens Gustedt
2014-08-04 22:36         ` Rich Felker
2014-08-06  3:52 ` Explaining cond var destroy [Re: [musl] C threads, v3.0] Rich Felker
2014-08-06  8:43   ` Jens Gustedt
2014-08-06  9:41     ` Jens Gustedt
2014-08-06 10:03       ` Rich Felker
2014-08-06 10:32         ` Jens Gustedt
2014-08-06 16:15           ` Rich Felker
2014-08-06 16:56             ` Jens Gustedt
2014-08-06 17:32               ` Rich Felker
2014-08-06 20:55                 ` Jens Gustedt
2014-08-06 22:04                   ` Rich Felker
2014-08-06 22:43                     ` Jens Gustedt
2014-08-06 23:15                       ` Rich Felker
2014-08-07  7:50                         ` Jens Gustedt
2014-08-07 10:52                           ` Szabolcs Nagy
2014-08-07 11:03                             ` Jens Gustedt
2014-08-07 16:13                           ` Rich Felker
2014-08-07 16:47                             ` Jens Gustedt
2014-08-07 17:25                               ` Rich Felker
2014-08-08  9:20                                 ` Jens Gustedt
2014-08-08 16:53                                   ` Rich Felker
2014-08-08 19:14                                   ` Rich Felker
2014-08-08 20:48                                     ` Rich Felker
2014-08-09  6:47                                       ` Jens Gustedt
2014-08-12  2:50                                         ` Rich Felker
2014-08-12  7:04                                           ` Jens Gustedt
2014-08-12 16:01                                             ` Rich Felker
2014-08-12 19:09                                               ` Jens Gustedt
2014-08-12 21:18                                                 ` Rich Felker
2014-08-13  6:43                                                   ` Jens Gustedt
2014-08-13  7:19                                                     ` Jens Gustedt
2014-08-06  9:50     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).