[musl] sem_post() can miss waiters

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] sem_post() can miss waiters
@ 2022-11-21  5:14 Alexey Izbyshev
  2022-11-22  0:09 ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Izbyshev @ 2022-11-21  5:14 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4403 bytes --]

Hi,

While reading musl's POSIX semaphore implementation I realized that its 
accounting of waiters is incorrect (after commit 
88c4e720317845a8e01aee03f142ba82674cd23d) due to combination of the 
following factors:

* sem_post relies on a potentially stale waiters count when deciding 
whether to wake a waiter.
* Presence of waiters is not reflected in the semaphore value when the 
semaphore is unlocked.

Here is an example scenario with 4 threads:

* initial state
   - val = 1, waiters = 0
* T1
   - enters sem_post
   - reads val = 1, waiters = 0, is preempted
   - val = 1, waiters = 0
* T2
   - sem_wait
   - another sem_wait (blocks)
   - val = -1, waiters = 1
* T3
   - sem_wait (blocks)
   - val = -1, waiters = 2
* T4
   - sem_post (wakes T2)
   - val = 1, waiters = 2
* T1
   - continues in sem_post
   - a_cas(1, 2) succeeds
   - T3 is NOT woken because sem_post thinks there are no waiters
   - val = 2, waiters = 1
* T2
   - wakes and completes sem_wait
   - val = 1, waiters = 1

So in the end T3 remains blocked despite that the semaphore is unlocked.

Here is two ideas how this can be fixed without reintroducing the 
problem with accessing a potentially destroyed semaphore after it's 
unlocked.

* Idea 1
   - use a proper WAITERS_BIT in the semaphore state instead of just a 
single value and preserve it in sem_post by default
   - in sem_post, when we notice that waiters count is zero but 
WAITERS_BIT is set, reset WAITERS_BIT, but wake all potential waiters 
afterwards.

This ensures that any new waiters that could arrive after we read 
waiters count are taken into account (and they will proceed to lock the 
semaphore and potentially set WAITERS_BIT again after they are woken).

The sem_post body would look like this (the full patch is attached):

int val, new, priv = sem->__val[2], sync = 0;
do {
     val = sem->__val[0];
     waiters = sem->__val[1];
     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
         errno = EOVERFLOW;
         return -1;
     }
     new = val + 1;
     if (!waiters)
         new &= ~0x80000000;
} while (a_cas(sem->__val, val, new) != val);
if (val<0) __wake(sem->__val, waiters ? 1 : -1, priv);

The main downside of this approach is that a FUTEX_WAKE system call will 
be performed each time the semaphore transitions from "maybe have 
waiters" to "no waiters" state, and in practice it will be useless in 
most cases.

* Idea 2
   - use a proper WAITERS_BIT as above
   - also add a new DETECT_NEW_WAITERS_BIT
   - in sem_post, when we notice that waiters count is zero, WAITERS_BIT 
is set and DETECT_NEW_WAITERS_BIT is not set, atomically transition into 
"waiters detection" state (DETECT_NEW_WAITERS_BIT is set, WAITERS_BIT is 
unset) *without unlocking the semaphore*. Then recheck waiters count 
again, and if it's still zero, try to atomically transition to "no 
waiters" state (both bits are unset) while also unlocking the semaphore. 
If this succeeds, we know that no new waiters arrived before we unlocked 
the semaphore and hence no wake up is needed. Regardless of whether we 
succeeded, we always leave "waiters detection" state before exiting the 
CAS loop.
   - in sem_post, wake a single waiter if at least one of WAITERS_BIT or 
DETECT_NEW_WAITERS_BIT is set. This ensures no waiters are missed in 
sem_post calls racing with the sem_post currently responsible for 
waiters detection.

The sem_post body would look like this (the full patch is attached):

int val, new, priv = sem->__val[2], sync = 0;
for (;;) {
     int cnt;
     val = sem->__val[0];
     cnt = val & SEM_VALUE_MAX;
     if (cnt < SEM_VALUE_MAX) {
         if (!sync && val < 0 && !(val & 0x40000000) && !sem->__val[1]) {
             new = cnt | 0x40000000;
             if (a_cas(sem->__val, val, new) != val)
                 continue;
             val = new;
             sync = 1;
         }
         new = val + 1;
     } else {
         new = val;
     }
     if (sync) {
         if (sem->__val[1])
             new |= 0x80000000;
         new &= ~0x40000000;
     }
     if (a_cas(sem->__val, val, new) == val)
         break;
}
if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
     errno = EOVERFLOW;
     return -1;
}
if (new < 0 || (new & 0x40000000)) __wake(sem->__val, 1, priv);

Compared to the first approach, there is just an extra a_cas instead of 
a system call, though we lose one bit in SEM_VALUE_MAX.

Hope this helps,
Alexey

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: sem-post-detect-new-waiters.diff --]
[-- Type: text/x-diff; name=sem-post-detect-new-waiters.diff, Size: 3680 bytes --]

diff --git a/include/limits.h b/include/limits.h
index 53a27b9d..c4d3f42b 100644
--- a/include/limits.h
+++ b/include/limits.h
@@ -66,7 +66,7 @@
 #define PTHREAD_KEYS_MAX 128
 #define PTHREAD_STACK_MIN 2048
 #define PTHREAD_DESTRUCTOR_ITERATIONS 4
-#define SEM_VALUE_MAX 0x7fffffff
+#define SEM_VALUE_MAX 0x3fffffff
 #define SEM_NSEMS_MAX 256
 #define DELAYTIMER_MAX 0x7fffffff
 #define MQ_PRIO_MAX 32768
diff --git a/src/thread/sem_getvalue.c b/src/thread/sem_getvalue.c
index d9d83071..c0b7762d 100644
--- a/src/thread/sem_getvalue.c
+++ b/src/thread/sem_getvalue.c
@@ -1,8 +1,9 @@
 #include <semaphore.h>
+#include <limits.h>
 
 int sem_getvalue(sem_t *restrict sem, int *restrict valp)
 {
 	int val = sem->__val[0];
-	*valp = val < 0 ? 0 : val;
+	*valp = val & SEM_VALUE_MAX;
 	return 0;
 }
diff --git a/src/thread/sem_post.c b/src/thread/sem_post.c
index 31e3293d..5051ec97 100644
--- a/src/thread/sem_post.c
+++ b/src/thread/sem_post.c
@@ -1,17 +1,38 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 int sem_post(sem_t *sem)
 {
-	int val, waiters, priv = sem->__val[2];
-	do {
+	int val, new, priv = sem->__val[2], sync = 0;
+	for (;;) {
+		int cnt;
 		val = sem->__val[0];
-		waiters = sem->__val[1];
-		if (val == SEM_VALUE_MAX) {
-			errno = EOVERFLOW;
-			return -1;
+		cnt = val & SEM_VALUE_MAX;
+		if (cnt < SEM_VALUE_MAX) {
+			if (!sync && val < 0 && !(val & 0x40000000) && !sem->__val[1]) {
+				new = cnt | 0x40000000;
+				if (a_cas(sem->__val, val, new) != val)
+					continue;
+				val = new;
+				sync = 1;
+			}
+			new = val + 1;
+		} else {
+			new = val;
 		}
-	} while (a_cas(sem->__val, val, val+1+(val<0)) != val);
-	if (val<0 || waiters) __wake(sem->__val, 1, priv);
+		if (sync) {
+			if (sem->__val[1])
+				new |= 0x80000000;
+			new &= ~0x40000000;
+		}
+		if (a_cas(sem->__val, val, new) == val)
+			break;
+	}
+	if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
+		errno = EOVERFLOW;
+		return -1;
+	}
+	if (new < 0 || (new & 0x40000000)) __wake(sem->__val, 1, priv);
 	return 0;
 }
diff --git a/src/thread/sem_timedwait.c b/src/thread/sem_timedwait.c
index 58d3ebfe..51f6a474 100644
--- a/src/thread/sem_timedwait.c
+++ b/src/thread/sem_timedwait.c
@@ -1,4 +1,5 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 static void cleanup(void *p)
@@ -13,14 +14,18 @@ int sem_timedwait(sem_t *restrict sem, const struct timespec *restrict at)
 	if (!sem_trywait(sem)) return 0;
 
 	int spins = 100;
-	while (spins-- && sem->__val[0] <= 0 && !sem->__val[1]) a_spin();
+	while (spins-- && !(sem->__val[0] & SEM_VALUE_MAX) && !sem->__val[1])
+		a_spin();
 
 	while (sem_trywait(sem)) {
-		int r;
+		int r, t, val = sem->__val[0];
+		if (val & SEM_VALUE_MAX)
+			continue;
 		a_inc(sem->__val+1);
-		a_cas(sem->__val, 0, -1);
+		t = val | 0x80000000;
+		a_cas(sem->__val, val, t);
 		pthread_cleanup_push(cleanup, (void *)(sem->__val+1));
-		r = __timedwait_cp(sem->__val, -1, CLOCK_REALTIME, at, sem->__val[2]);
+		r = __timedwait_cp(sem->__val, t, CLOCK_REALTIME, at, sem->__val[2]);
 		pthread_cleanup_pop(1);
 		if (r) {
 			errno = r;
diff --git a/src/thread/sem_trywait.c b/src/thread/sem_trywait.c
index 04edf46b..beb435da 100644
--- a/src/thread/sem_trywait.c
+++ b/src/thread/sem_trywait.c
@@ -1,12 +1,12 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 int sem_trywait(sem_t *sem)
 {
 	int val;
-	while ((val=sem->__val[0]) > 0) {
-		int new = val-1-(val==1 && sem->__val[1]);
-		if (a_cas(sem->__val, val, new)==val) return 0;
+	while ((val=sem->__val[0]) & SEM_VALUE_MAX) {
+		if (a_cas(sem->__val, val, val-1)==val) return 0;
 	}
 	errno = EAGAIN;
 	return -1;

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: sem-post-wake-all.diff --]
[-- Type: text/x-diff; name=sem-post-wake-all.diff, Size: 2795 bytes --]

diff --git a/src/thread/sem_getvalue.c b/src/thread/sem_getvalue.c
index d9d83071..c0b7762d 100644
--- a/src/thread/sem_getvalue.c
+++ b/src/thread/sem_getvalue.c
@@ -1,8 +1,9 @@
 #include <semaphore.h>
+#include <limits.h>
 
 int sem_getvalue(sem_t *restrict sem, int *restrict valp)
 {
 	int val = sem->__val[0];
-	*valp = val < 0 ? 0 : val;
+	*valp = val & SEM_VALUE_MAX;
 	return 0;
 }
diff --git a/src/thread/sem_post.c b/src/thread/sem_post.c
index 31e3293d..ca8d663c 100644
--- a/src/thread/sem_post.c
+++ b/src/thread/sem_post.c
@@ -1,17 +1,21 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 int sem_post(sem_t *sem)
 {
-	int val, waiters, priv = sem->__val[2];
+	int val, new, waiters, priv = sem->__val[2];
 	do {
 		val = sem->__val[0];
 		waiters = sem->__val[1];
-		if (val == SEM_VALUE_MAX) {
+		if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
 			errno = EOVERFLOW;
 			return -1;
 		}
-	} while (a_cas(sem->__val, val, val+1+(val<0)) != val);
-	if (val<0 || waiters) __wake(sem->__val, 1, priv);
+		new = val + 1;
+		if (!waiters)
+			new &= ~0x80000000;
+	} while (a_cas(sem->__val, val, new) != val);
+	if (val<0) __wake(sem->__val, waiters ? 1 : -1, priv);
 	return 0;
 }
diff --git a/src/thread/sem_timedwait.c b/src/thread/sem_timedwait.c
index 58d3ebfe..aa67376c 100644
--- a/src/thread/sem_timedwait.c
+++ b/src/thread/sem_timedwait.c
@@ -1,4 +1,5 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 static void cleanup(void *p)
@@ -13,14 +14,15 @@ int sem_timedwait(sem_t *restrict sem, const struct timespec *restrict at)
 	if (!sem_trywait(sem)) return 0;
 
 	int spins = 100;
-	while (spins-- && sem->__val[0] <= 0 && !sem->__val[1]) a_spin();
+	while (spins-- && !(sem->__val[0] & SEM_VALUE_MAX) && !sem->__val[1])
+		a_spin();
 
 	while (sem_trywait(sem)) {
-		int r;
+		int r, priv = sem->__val[2];
 		a_inc(sem->__val+1);
-		a_cas(sem->__val, 0, -1);
+		a_cas(sem->__val, 0, 0x80000000);
 		pthread_cleanup_push(cleanup, (void *)(sem->__val+1));
-		r = __timedwait_cp(sem->__val, -1, CLOCK_REALTIME, at, sem->__val[2]);
+		r = __timedwait_cp(sem->__val, 0x80000000, CLOCK_REALTIME, at, priv);
 		pthread_cleanup_pop(1);
 		if (r) {
 			errno = r;
diff --git a/src/thread/sem_trywait.c b/src/thread/sem_trywait.c
index 04edf46b..beb435da 100644
--- a/src/thread/sem_trywait.c
+++ b/src/thread/sem_trywait.c
@@ -1,12 +1,12 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 int sem_trywait(sem_t *sem)
 {
 	int val;
-	while ((val=sem->__val[0]) > 0) {
-		int new = val-1-(val==1 && sem->__val[1]);
-		if (a_cas(sem->__val, val, new)==val) return 0;
+	while ((val=sem->__val[0]) & SEM_VALUE_MAX) {
+		if (a_cas(sem->__val, val, val-1)==val) return 0;
 	}
 	errno = EAGAIN;
 	return -1;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] sem_post() can miss waiters
  2022-11-21  5:14 [musl] sem_post() can miss waiters Alexey Izbyshev
@ 2022-11-22  0:09 ` Rich Felker
  2022-11-22  5:41   ` A. Wilcox
  2022-11-22  5:41   ` Alexey Izbyshev
  0 siblings, 2 replies; 7+ messages in thread
From: Rich Felker @ 2022-11-22  0:09 UTC (permalink / raw)
  To: musl

On Mon, Nov 21, 2022 at 08:14:50AM +0300, Alexey Izbyshev wrote:
> Hi,
> 
> While reading musl's POSIX semaphore implementation I realized that
> its accounting of waiters is incorrect (after commit
> 88c4e720317845a8e01aee03f142ba82674cd23d) due to combination of the
> following factors:
> 
> * sem_post relies on a potentially stale waiters count when deciding
> whether to wake a waiter.
> * Presence of waiters is not reflected in the semaphore value when
> the semaphore is unlocked.
> 
> Here is an example scenario with 4 threads:
> 
> * initial state
>   - val = 1, waiters = 0
> * T1
>   - enters sem_post
>   - reads val = 1, waiters = 0, is preempted
>   - val = 1, waiters = 0
> * T2
>   - sem_wait
>   - another sem_wait (blocks)
>   - val = -1, waiters = 1
> * T3
>   - sem_wait (blocks)
>   - val = -1, waiters = 2
> * T4
>   - sem_post (wakes T2)
>   - val = 1, waiters = 2
> * T1
>   - continues in sem_post
>   - a_cas(1, 2) succeeds
>   - T3 is NOT woken because sem_post thinks there are no waiters
>   - val = 2, waiters = 1
> * T2
>   - wakes and completes sem_wait
>   - val = 1, waiters = 1
> 
> So in the end T3 remains blocked despite that the semaphore is unlocked.
> 
> Here is two ideas how this can be fixed without reintroducing the
> problem with accessing a potentially destroyed semaphore after it's
> unlocked.
> 
> * Idea 1
>   - use a proper WAITERS_BIT in the semaphore state instead of just
> a single value and preserve it in sem_post by default
>   - in sem_post, when we notice that waiters count is zero but
> WAITERS_BIT is set, reset WAITERS_BIT, but wake all potential
> waiters afterwards.
> 
> This ensures that any new waiters that could arrive after we read
> waiters count are taken into account (and they will proceed to lock
> the semaphore and potentially set WAITERS_BIT again after they are
> woken).
> 
> The sem_post body would look like this (the full patch is attached):
> 
> int val, new, priv = sem->__val[2], sync = 0;
> do {
>     val = sem->__val[0];
>     waiters = sem->__val[1];
>     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>         errno = EOVERFLOW;
>         return -1;
>     }
>     new = val + 1;
>     if (!waiters)
>         new &= ~0x80000000;
> } while (a_cas(sem->__val, val, new) != val);
> if (val<0) __wake(sem->__val, waiters ? 1 : -1, priv);
> 
> The main downside of this approach is that a FUTEX_WAKE system call
> will be performed each time the semaphore transitions from "maybe
> have waiters" to "no waiters" state, and in practice it will be
> useless in most cases.
> 
> * Idea 2
>   - use a proper WAITERS_BIT as above
>   - also add a new DETECT_NEW_WAITERS_BIT
>   - in sem_post, when we notice that waiters count is zero,
> WAITERS_BIT is set and DETECT_NEW_WAITERS_BIT is not set, atomically
> transition into "waiters detection" state (DETECT_NEW_WAITERS_BIT is
> set, WAITERS_BIT is unset) *without unlocking the semaphore*. Then
> recheck waiters count again, and if it's still zero, try to
> atomically transition to "no waiters" state (both bits are unset)
> while also unlocking the semaphore. If this succeeds, we know that
> no new waiters arrived before we unlocked the semaphore and hence no
> wake up is needed. Regardless of whether we succeeded, we always
> leave "waiters detection" state before exiting the CAS loop.
>   - in sem_post, wake a single waiter if at least one of WAITERS_BIT
> or DETECT_NEW_WAITERS_BIT is set. This ensures no waiters are missed
> in sem_post calls racing with the sem_post currently responsible for
> waiters detection.
> 
> The sem_post body would look like this (the full patch is attached):
> 
> int val, new, priv = sem->__val[2], sync = 0;
> for (;;) {
>     int cnt;
>     val = sem->__val[0];
>     cnt = val & SEM_VALUE_MAX;
>     if (cnt < SEM_VALUE_MAX) {
>         if (!sync && val < 0 && !(val & 0x40000000) && !sem->__val[1]) {
>             new = cnt | 0x40000000;
>             if (a_cas(sem->__val, val, new) != val)
>                 continue;
>             val = new;
>             sync = 1;
>         }
>         new = val + 1;
>     } else {
>         new = val;
>     }
>     if (sync) {
>         if (sem->__val[1])
>             new |= 0x80000000;
>         new &= ~0x40000000;
>     }
>     if (a_cas(sem->__val, val, new) == val)
>         break;
> }
> if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>     errno = EOVERFLOW;
>     return -1;
> }
> if (new < 0 || (new & 0x40000000)) __wake(sem->__val, 1, priv);
> 
> Compared to the first approach, there is just an extra a_cas instead
> of a system call, though we lose one bit in SEM_VALUE_MAX.
> 
> Hope this helps,

Thanks for the detailed analysis and work on fixes. I'm not sure how
this was overlooked at the time of design.

If I understand correctly, the cost of the first option is generally
an extra "last" broadcast wake that shouldn't be needed. Is that
right?

For example, if sem starts with count 0 and thread A calls wait, then
thread B calls post twice, both posts make a syscall even though only
the first one should have.

What if you instead perform the broadcast wake when the observed
waiters count is <=1 rather than ==0? This should have no cost in the
common case, but in the race case, I think it forces any hiding
(just-arrived) extra waiters to wake, fail, and re-publish their
waiting status to the waiters bit.

Regarding your second solution, I think it's more elegant and
efficient and would be preferable if we were doing this from scratch,
but changing SEM_VALUE_MAX is arguably an ABI change we should not
make.

I think there's another really stupid solution too that I would even
consider, partly motivated by the fact that, with long-lived
process-shared semaphores, the waiters count can become stale if
waiters are killed. (Note: semaphores aren't required to be robust
against this, but it's ugly that they're not.) This solution is just
to keep the current code, but drop the waiters count entirely, and use
broadcast wakes for all wakes. Then, any still-live waiters are
*always* responsible for re-asserting their waiting status when all
but one fail to acquire the semaphore after a wake. Of course this is
a thundering herd, which is arguably something we should not want, but
we accept it for correctness in several other places like condvars...

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] sem_post() can miss waiters
  2022-11-22  0:09 ` Rich Felker
@ 2022-11-22  5:41   ` A. Wilcox
  2022-11-22  5:41   ` Alexey Izbyshev
  1 sibling, 0 replies; 7+ messages in thread
From: A. Wilcox @ 2022-11-22  5:41 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

On Nov 21, 2022, at 6:09 PM, Rich Felker <dalias@libc.org> wrote:
> 
> Regarding your second solution, I think it's more elegant and
> efficient and would be preferable if we were doing this from scratch,
> but changing SEM_VALUE_MAX is arguably an ABI change we should not
> make.
> 

With my distro hat on: this isn’t used by too many programs.

Debian Code Search returned a module of Boost.  Otherwise, all apps
correctly use sysconf(_SC_SEM_VALUE_MAX) in my cursory glance.

With my gcompat hat on: yes, please don’t change it.

The current value matches glibc and while it doesn’t seem many things
use it, I would hate to have to debug those kind of corner cases in
binary blob code.

Best,
-A.

[-- Attachment #2: Type: text/html, Size: 4177 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] sem_post() can miss waiters
  2022-11-22  0:09 ` Rich Felker
  2022-11-22  5:41   ` A. Wilcox
@ 2022-11-22  5:41   ` Alexey Izbyshev
  2022-12-14  1:48     ` Rich Felker
  1 sibling, 1 reply; 7+ messages in thread
From: Alexey Izbyshev @ 2022-11-22  5:41 UTC (permalink / raw)
  To: musl

On 2022-11-22 03:09, Rich Felker wrote:
> On Mon, Nov 21, 2022 at 08:14:50AM +0300, Alexey Izbyshev wrote:
>> Hi,
>> 
>> While reading musl's POSIX semaphore implementation I realized that
>> its accounting of waiters is incorrect (after commit
>> 88c4e720317845a8e01aee03f142ba82674cd23d) due to combination of the
>> following factors:
>> 
>> * sem_post relies on a potentially stale waiters count when deciding
>> whether to wake a waiter.
>> * Presence of waiters is not reflected in the semaphore value when
>> the semaphore is unlocked.
>> 
>> Here is an example scenario with 4 threads:
>> 
>> * initial state
>>   - val = 1, waiters = 0
>> * T1
>>   - enters sem_post
>>   - reads val = 1, waiters = 0, is preempted
>>   - val = 1, waiters = 0
>> * T2
>>   - sem_wait
>>   - another sem_wait (blocks)
>>   - val = -1, waiters = 1
>> * T3
>>   - sem_wait (blocks)
>>   - val = -1, waiters = 2
>> * T4
>>   - sem_post (wakes T2)
>>   - val = 1, waiters = 2
>> * T1
>>   - continues in sem_post
>>   - a_cas(1, 2) succeeds
>>   - T3 is NOT woken because sem_post thinks there are no waiters
>>   - val = 2, waiters = 1
>> * T2
>>   - wakes and completes sem_wait
>>   - val = 1, waiters = 1
>> 
>> So in the end T3 remains blocked despite that the semaphore is 
>> unlocked.
>> 
>> Here is two ideas how this can be fixed without reintroducing the
>> problem with accessing a potentially destroyed semaphore after it's
>> unlocked.
>> 
>> * Idea 1
>>   - use a proper WAITERS_BIT in the semaphore state instead of just
>> a single value and preserve it in sem_post by default
>>   - in sem_post, when we notice that waiters count is zero but
>> WAITERS_BIT is set, reset WAITERS_BIT, but wake all potential
>> waiters afterwards.
>> 
>> This ensures that any new waiters that could arrive after we read
>> waiters count are taken into account (and they will proceed to lock
>> the semaphore and potentially set WAITERS_BIT again after they are
>> woken).
>> 
>> The sem_post body would look like this (the full patch is attached):
>> 
>> int val, new, priv = sem->__val[2], sync = 0;
>> do {
>>     val = sem->__val[0];
>>     waiters = sem->__val[1];
>>     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>>         errno = EOVERFLOW;
>>         return -1;
>>     }
>>     new = val + 1;
>>     if (!waiters)
>>         new &= ~0x80000000;
>> } while (a_cas(sem->__val, val, new) != val);
>> if (val<0) __wake(sem->__val, waiters ? 1 : -1, priv);
>> 
>> The main downside of this approach is that a FUTEX_WAKE system call
>> will be performed each time the semaphore transitions from "maybe
>> have waiters" to "no waiters" state, and in practice it will be
>> useless in most cases.
>> 
>> * Idea 2
>>   - use a proper WAITERS_BIT as above
>>   - also add a new DETECT_NEW_WAITERS_BIT
>>   - in sem_post, when we notice that waiters count is zero,
>> WAITERS_BIT is set and DETECT_NEW_WAITERS_BIT is not set, atomically
>> transition into "waiters detection" state (DETECT_NEW_WAITERS_BIT is
>> set, WAITERS_BIT is unset) *without unlocking the semaphore*. Then
>> recheck waiters count again, and if it's still zero, try to
>> atomically transition to "no waiters" state (both bits are unset)
>> while also unlocking the semaphore. If this succeeds, we know that
>> no new waiters arrived before we unlocked the semaphore and hence no
>> wake up is needed. Regardless of whether we succeeded, we always
>> leave "waiters detection" state before exiting the CAS loop.
>>   - in sem_post, wake a single waiter if at least one of WAITERS_BIT
>> or DETECT_NEW_WAITERS_BIT is set. This ensures no waiters are missed
>> in sem_post calls racing with the sem_post currently responsible for
>> waiters detection.
>> 
>> The sem_post body would look like this (the full patch is attached):
>> 
>> int val, new, priv = sem->__val[2], sync = 0;
>> for (;;) {
>>     int cnt;
>>     val = sem->__val[0];
>>     cnt = val & SEM_VALUE_MAX;
>>     if (cnt < SEM_VALUE_MAX) {
>>         if (!sync && val < 0 && !(val & 0x40000000) && !sem->__val[1]) 
>> {
>>             new = cnt | 0x40000000;
>>             if (a_cas(sem->__val, val, new) != val)
>>                 continue;
>>             val = new;
>>             sync = 1;
>>         }
>>         new = val + 1;
>>     } else {
>>         new = val;
>>     }
>>     if (sync) {
>>         if (sem->__val[1])
>>             new |= 0x80000000;
>>         new &= ~0x40000000;
>>     }
>>     if (a_cas(sem->__val, val, new) == val)
>>         break;
>> }
>> if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>>     errno = EOVERFLOW;
>>     return -1;
>> }
>> if (new < 0 || (new & 0x40000000)) __wake(sem->__val, 1, priv);
>> 
>> Compared to the first approach, there is just an extra a_cas instead
>> of a system call, though we lose one bit in SEM_VALUE_MAX.
>> 
>> Hope this helps,
> 
> Thanks for the detailed analysis and work on fixes. I'm not sure how
> this was overlooked at the time of design.
> 
> If I understand correctly, the cost of the first option is generally
> an extra "last" broadcast wake that shouldn't be needed. Is that
> right?
> 
> For example, if sem starts with count 0 and thread A calls wait, then
> thread B calls post twice, both posts make a syscall even though only
> the first one should have.
> 
Yes, exactly.

> What if you instead perform the broadcast wake when the observed
> waiters count is <=1 rather than ==0? This should have no cost in the
> common case, but in the race case, I think it forces any hiding
> (just-arrived) extra waiters to wake, fail, and re-publish their
> waiting status to the waiters bit.
> 
Indeed, I think this solves the overhead issue quite nicely, thanks. So 
sem_post wake up logic would basically boil down to this:

* WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since it's 
likely that some waiters will remain (and it's always fine to err on the 
side of preserving the WAITERS_BIT); wake a single waiter.

* WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake all 
waiters. In non-racy cases only a single waiter will be woken.

* WAITERS_BIT is unset: don't wake anybody. Even if there are some 
waiters, another sem_post (that reset the WAITERS_BIT) is responsible 
for waking them.

In code:

int val, new, waiters, priv = sem->__val[2];
do {
     val = sem->__val[0];
     waiters = sem->__val[1];
     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
         errno = EOVERFLOW;
         return -1;
     }
     new = val + 1;
     if (waiters <= 1)
         new &= ~0x80000000;
} while (a_cas(sem->__val, val, new) != val);
if (val<0) __wake(sem->__val, waiters <= 1 ? -1 : 1, priv);

> Regarding your second solution, I think it's more elegant and
> efficient and would be preferable if we were doing this from scratch,
> but changing SEM_VALUE_MAX is arguably an ABI change we should not
> make.
> 
With your suggested improvement, the first solution is probably even 
more efficient in common cases, assuming there are no hidden costs in 
the kernel for waking a single waiter with "wake all" operation instead 
of "wake single".

> I think there's another really stupid solution too that I would even
> consider, partly motivated by the fact that, with long-lived
> process-shared semaphores, the waiters count can become stale if
> waiters are killed. (Note: semaphores aren't required to be robust
> against this, but it's ugly that they're not.) This solution is just
> to keep the current code, but drop the waiters count entirely, and use
> broadcast wakes for all wakes. Then, any still-live waiters are
> *always* responsible for re-asserting their waiting status when all
> but one fail to acquire the semaphore after a wake. Of course this is
> a thundering herd, which is arguably something we should not want, but
> we accept it for correctness in several other places like condvars...
> 
I'm not sure that the kind of "partial robustness" that could be 
achieved by always waking all waiters is worth punishing normal cases. 
Also, I don't think waiters count being stale is an issue per se because 
it can be wrong only in one way (greater than the real count of 
waiters), assuming you don't mean overflow (but overflow could be 
handled by simply pinning it at INT_MAX permanently). But many other 
issues are possible if we allow killing processes at arbitrary points, 
including sem_post not sending a wake up notification after updating the 
semaphore value, or sem_wait consuming a notification but not updating 
the value. Granted, some of these issues may "self-heal" on a future 
sem_wait/sem_post (in the sense that the semaphore will leave a 
forbidden state), but they might still interfere with the program 
shutdown logic (e.g. if the program expects to claim the semaphore N 
times at the end without being aware that some consumers are dead), and, 
of course, with any logic outside of semaphore manipulation. So it seems 
to me that either the program already watches for death of processes it 
cares about, or it's not even clear that allowing further progress (due 
to a broadcast) is always more desirable than blocking.

Alexey

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] sem_post() can miss waiters
  2022-11-22  5:41   ` Alexey Izbyshev
@ 2022-12-14  1:48     ` Rich Felker
  2022-12-14 10:29       ` Alexey Izbyshev
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2022-12-14  1:48 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 10122 bytes --]

On Tue, Nov 22, 2022 at 08:41:44AM +0300, Alexey Izbyshev wrote:
> On 2022-11-22 03:09, Rich Felker wrote:
> >On Mon, Nov 21, 2022 at 08:14:50AM +0300, Alexey Izbyshev wrote:
> >>Hi,
> >>
> >>While reading musl's POSIX semaphore implementation I realized that
> >>its accounting of waiters is incorrect (after commit
> >>88c4e720317845a8e01aee03f142ba82674cd23d) due to combination of the
> >>following factors:
> >>
> >>* sem_post relies on a potentially stale waiters count when deciding
> >>whether to wake a waiter.
> >>* Presence of waiters is not reflected in the semaphore value when
> >>the semaphore is unlocked.
> >>
> >>Here is an example scenario with 4 threads:
> >>
> >>* initial state
> >>  - val = 1, waiters = 0
> >>* T1
> >>  - enters sem_post
> >>  - reads val = 1, waiters = 0, is preempted
> >>  - val = 1, waiters = 0
> >>* T2
> >>  - sem_wait
> >>  - another sem_wait (blocks)
> >>  - val = -1, waiters = 1
> >>* T3
> >>  - sem_wait (blocks)
> >>  - val = -1, waiters = 2
> >>* T4
> >>  - sem_post (wakes T2)
> >>  - val = 1, waiters = 2
> >>* T1
> >>  - continues in sem_post
> >>  - a_cas(1, 2) succeeds
> >>  - T3 is NOT woken because sem_post thinks there are no waiters
> >>  - val = 2, waiters = 1
> >>* T2
> >>  - wakes and completes sem_wait
> >>  - val = 1, waiters = 1
> >>
> >>So in the end T3 remains blocked despite that the semaphore is
> >>unlocked.
> >>
> >>Here is two ideas how this can be fixed without reintroducing the
> >>problem with accessing a potentially destroyed semaphore after it's
> >>unlocked.
> >>
> >>* Idea 1
> >>  - use a proper WAITERS_BIT in the semaphore state instead of just
> >>a single value and preserve it in sem_post by default
> >>  - in sem_post, when we notice that waiters count is zero but
> >>WAITERS_BIT is set, reset WAITERS_BIT, but wake all potential
> >>waiters afterwards.
> >>
> >>This ensures that any new waiters that could arrive after we read
> >>waiters count are taken into account (and they will proceed to lock
> >>the semaphore and potentially set WAITERS_BIT again after they are
> >>woken).
> >>
> >>The sem_post body would look like this (the full patch is attached):
> >>
> >>int val, new, priv = sem->__val[2], sync = 0;
> >>do {
> >>    val = sem->__val[0];
> >>    waiters = sem->__val[1];
> >>    if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
> >>        errno = EOVERFLOW;
> >>        return -1;
> >>    }
> >>    new = val + 1;
> >>    if (!waiters)
> >>        new &= ~0x80000000;
> >>} while (a_cas(sem->__val, val, new) != val);
> >>if (val<0) __wake(sem->__val, waiters ? 1 : -1, priv);
> >>
> >>The main downside of this approach is that a FUTEX_WAKE system call
> >>will be performed each time the semaphore transitions from "maybe
> >>have waiters" to "no waiters" state, and in practice it will be
> >>useless in most cases.
> >>
> >>* Idea 2
> >>  - use a proper WAITERS_BIT as above
> >>  - also add a new DETECT_NEW_WAITERS_BIT
> >>  - in sem_post, when we notice that waiters count is zero,
> >>WAITERS_BIT is set and DETECT_NEW_WAITERS_BIT is not set, atomically
> >>transition into "waiters detection" state (DETECT_NEW_WAITERS_BIT is
> >>set, WAITERS_BIT is unset) *without unlocking the semaphore*. Then
> >>recheck waiters count again, and if it's still zero, try to
> >>atomically transition to "no waiters" state (both bits are unset)
> >>while also unlocking the semaphore. If this succeeds, we know that
> >>no new waiters arrived before we unlocked the semaphore and hence no
> >>wake up is needed. Regardless of whether we succeeded, we always
> >>leave "waiters detection" state before exiting the CAS loop.
> >>  - in sem_post, wake a single waiter if at least one of WAITERS_BIT
> >>or DETECT_NEW_WAITERS_BIT is set. This ensures no waiters are missed
> >>in sem_post calls racing with the sem_post currently responsible for
> >>waiters detection.
> >>
> >>The sem_post body would look like this (the full patch is attached):
> >>
> >>int val, new, priv = sem->__val[2], sync = 0;
> >>for (;;) {
> >>    int cnt;
> >>    val = sem->__val[0];
> >>    cnt = val & SEM_VALUE_MAX;
> >>    if (cnt < SEM_VALUE_MAX) {
> >>        if (!sync && val < 0 && !(val & 0x40000000) &&
> >>!sem->__val[1]) {
> >>            new = cnt | 0x40000000;
> >>            if (a_cas(sem->__val, val, new) != val)
> >>                continue;
> >>            val = new;
> >>            sync = 1;
> >>        }
> >>        new = val + 1;
> >>    } else {
> >>        new = val;
> >>    }
> >>    if (sync) {
> >>        if (sem->__val[1])
> >>            new |= 0x80000000;
> >>        new &= ~0x40000000;
> >>    }
> >>    if (a_cas(sem->__val, val, new) == val)
> >>        break;
> >>}
> >>if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
> >>    errno = EOVERFLOW;
> >>    return -1;
> >>}
> >>if (new < 0 || (new & 0x40000000)) __wake(sem->__val, 1, priv);
> >>
> >>Compared to the first approach, there is just an extra a_cas instead
> >>of a system call, though we lose one bit in SEM_VALUE_MAX.
> >>
> >>Hope this helps,
> >
> >Thanks for the detailed analysis and work on fixes. I'm not sure how
> >this was overlooked at the time of design.
> >
> >If I understand correctly, the cost of the first option is generally
> >an extra "last" broadcast wake that shouldn't be needed. Is that
> >right?
> >
> >For example, if sem starts with count 0 and thread A calls wait, then
> >thread B calls post twice, both posts make a syscall even though only
> >the first one should have.
> >
> Yes, exactly.
> 
> >What if you instead perform the broadcast wake when the observed
> >waiters count is <=1 rather than ==0? This should have no cost in the
> >common case, but in the race case, I think it forces any hiding
> >(just-arrived) extra waiters to wake, fail, and re-publish their
> >waiting status to the waiters bit.
> >
> Indeed, I think this solves the overhead issue quite nicely, thanks.
> So sem_post wake up logic would basically boil down to this:
> 
> * WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since
> it's likely that some waiters will remain (and it's always fine to
> err on the side of preserving the WAITERS_BIT); wake a single
> waiter.
> 
> * WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake
> all waiters. In non-racy cases only a single waiter will be woken.
> 
> * WAITERS_BIT is unset: don't wake anybody. Even if there are some
> waiters, another sem_post (that reset the WAITERS_BIT) is
> responsible for waking them.
> 
> In code:
> 
> int val, new, waiters, priv = sem->__val[2];
> do {
>     val = sem->__val[0];
>     waiters = sem->__val[1];
>     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>         errno = EOVERFLOW;
>         return -1;
>     }
>     new = val + 1;
>     if (waiters <= 1)
>         new &= ~0x80000000;
> } while (a_cas(sem->__val, val, new) != val);
> if (val<0) __wake(sem->__val, waiters <= 1 ? -1 : 1, priv);

Yes, this looks good to me. How is the attached patch?

> >Regarding your second solution, I think it's more elegant and
> >efficient and would be preferable if we were doing this from scratch,
> >but changing SEM_VALUE_MAX is arguably an ABI change we should not
> >make.
> >
> With your suggested improvement, the first solution is probably even
> more efficient in common cases, assuming there are no hidden costs
> in the kernel for waking a single waiter with "wake all" operation
> instead of "wake single".

Agreed.

> >I think there's another really stupid solution too that I would even
> >consider, partly motivated by the fact that, with long-lived
> >process-shared semaphores, the waiters count can become stale if
> >waiters are killed. (Note: semaphores aren't required to be robust
> >against this, but it's ugly that they're not.) This solution is just
> >to keep the current code, but drop the waiters count entirely, and use
> >broadcast wakes for all wakes. Then, any still-live waiters are
> >*always* responsible for re-asserting their waiting status when all
> >but one fail to acquire the semaphore after a wake. Of course this is
> >a thundering herd, which is arguably something we should not want, but
> >we accept it for correctness in several other places like condvars...
> >
> I'm not sure that the kind of "partial robustness" that could be
> achieved by always waking all waiters is worth punishing normal
> cases. Also, I don't think waiters count being stale is an issue per
> se because it can be wrong only in one way (greater than the real
> count of waiters), assuming you don't mean overflow (but overflow
> could be handled by simply pinning it at INT_MAX permanently). But
> many other issues are possible if we allow killing processes at
> arbitrary points, including sem_post not sending a wake up
> notification after updating the semaphore value, or sem_wait
> consuming a notification but not updating the value. Granted, some
> of these issues may "self-heal" on a future sem_wait/sem_post (in
> the sense that the semaphore will leave a forbidden state), but they
> might still interfere with the program shutdown logic (e.g. if the
> program expects to claim the semaphore N times at the end without
> being aware that some consumers are dead), and, of course, with any
> logic outside of semaphore manipulation. So it seems to me that
> either the program already watches for death of processes it cares
> about, or it's not even clear that allowing further progress (due to
> a broadcast) is always more desirable than blocking.

Indeed, this seems like it's just a bad direction to go for the
reasons you described. One thing I'd like to note here though, but not
act on at this time:

When we were designing the semaphore long ago, there was a vague
proposal to make the post operation responsible for removing waiter
counts, rather than the waiter. It was thrown out because it didn't
seem to work. But if we ever did have a reason to want to do this, I
think it's possible now, since it's essentially just a "countdown to
broadcast wake" with no constraint that it be an accurate count, and
the new waiters bit is what really controls whether wakes happen.

Rich

[-- Attachment #2: 0001-semaphores-fix-missed-wakes-from-ABA-bug-in-waiter-c.patch --]
[-- Type: text/plain, Size: 4659 bytes --]

From 159d1f6c02569091c7a48bdb2e2e824b844a1902 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Tue, 13 Dec 2022 18:39:44 -0500
Subject: [PATCH] semaphores: fix missed wakes from ABA bug in waiter count
 logic

because the has-waiters state in the semaphore value futex word is
only representable when the value is zero (the special value -1
represents "0 with potential new waiters"), it's lost if intervening
operations make the semaphore value positive again. this creates an
ABA issue in sem_post, whereby the post uses a stale waiters count
rather than re-evaluating it, skipping the futex wake if the stale
count was zero.

the fix here is based on a proposal by Alexey Izbyshev, with minor
changes to eliminate costly new spurious wake syscalls.

the basic idea is to replace the special value -1 with a sticky
waiters bit (repurposing the sign bit) preserved under both wait and
post. any post that takes place with the waiters bit set will perform
a futex wake.

to be useful, the waiters bit needs to be removable, and to remove it
safely, we perform a broadcast wake instead of a normal single-task
wake whenever removing the bit. this lets any un-accounted-for waiters
wake and re-add the waiters bit if they still need it.

there are multiple possible choices for when to perform this
broadcast, but the optimal choice seems to be doing it whenever the
observed waiters count is less than two (semantically, this means
exactly one, but we might see a stale count of zero). in this case,
the expected number of threads to be woken is one, with exactly the
same cost as a non-broadcast wake.
---
 src/thread/sem_getvalue.c  |  3 ++-
 src/thread/sem_post.c      | 12 ++++++++----
 src/thread/sem_timedwait.c | 10 ++++++----
 src/thread/sem_trywait.c   |  6 +++---
 4 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/src/thread/sem_getvalue.c b/src/thread/sem_getvalue.c
index d9d83071..c0b7762d 100644
--- a/src/thread/sem_getvalue.c
+++ b/src/thread/sem_getvalue.c
@@ -1,8 +1,9 @@
 #include <semaphore.h>
+#include <limits.h>
 
 int sem_getvalue(sem_t *restrict sem, int *restrict valp)
 {
 	int val = sem->__val[0];
-	*valp = val < 0 ? 0 : val;
+	*valp = val & SEM_VALUE_MAX;
 	return 0;
 }
diff --git a/src/thread/sem_post.c b/src/thread/sem_post.c
index 31e3293d..5c2517f2 100644
--- a/src/thread/sem_post.c
+++ b/src/thread/sem_post.c
@@ -1,17 +1,21 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 int sem_post(sem_t *sem)
 {
-	int val, waiters, priv = sem->__val[2];
+	int val, new, waiters, priv = sem->__val[2];
 	do {
 		val = sem->__val[0];
 		waiters = sem->__val[1];
-		if (val == SEM_VALUE_MAX) {
+		if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
 			errno = EOVERFLOW;
 			return -1;
 		}
-	} while (a_cas(sem->__val, val, val+1+(val<0)) != val);
-	if (val<0 || waiters) __wake(sem->__val, 1, priv);
+		new = val + 1;
+		if (waiters <= 1)
+			new &= ~0x80000000;
+	} while (a_cas(sem->__val, val, new) != val);
+	if (val<0) __wake(sem->__val, waiters>1 ? 1 : -1, priv);
 	return 0;
 }
diff --git a/src/thread/sem_timedwait.c b/src/thread/sem_timedwait.c
index 58d3ebfe..aa67376c 100644
--- a/src/thread/sem_timedwait.c
+++ b/src/thread/sem_timedwait.c
@@ -1,4 +1,5 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 static void cleanup(void *p)
@@ -13,14 +14,15 @@ int sem_timedwait(sem_t *restrict sem, const struct timespec *restrict at)
 	if (!sem_trywait(sem)) return 0;
 
 	int spins = 100;
-	while (spins-- && sem->__val[0] <= 0 && !sem->__val[1]) a_spin();
+	while (spins-- && !(sem->__val[0] & SEM_VALUE_MAX) && !sem->__val[1])
+		a_spin();
 
 	while (sem_trywait(sem)) {
-		int r;
+		int r, priv = sem->__val[2];
 		a_inc(sem->__val+1);
-		a_cas(sem->__val, 0, -1);
+		a_cas(sem->__val, 0, 0x80000000);
 		pthread_cleanup_push(cleanup, (void *)(sem->__val+1));
-		r = __timedwait_cp(sem->__val, -1, CLOCK_REALTIME, at, sem->__val[2]);
+		r = __timedwait_cp(sem->__val, 0x80000000, CLOCK_REALTIME, at, priv);
 		pthread_cleanup_pop(1);
 		if (r) {
 			errno = r;
diff --git a/src/thread/sem_trywait.c b/src/thread/sem_trywait.c
index 04edf46b..beb435da 100644
--- a/src/thread/sem_trywait.c
+++ b/src/thread/sem_trywait.c
@@ -1,12 +1,12 @@
 #include <semaphore.h>
+#include <limits.h>
 #include "pthread_impl.h"
 
 int sem_trywait(sem_t *sem)
 {
 	int val;
-	while ((val=sem->__val[0]) > 0) {
-		int new = val-1-(val==1 && sem->__val[1]);
-		if (a_cas(sem->__val, val, new)==val) return 0;
+	while ((val=sem->__val[0]) & SEM_VALUE_MAX) {
+		if (a_cas(sem->__val, val, val-1)==val) return 0;
 	}
 	errno = EAGAIN;
 	return -1;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] sem_post() can miss waiters
  2022-12-14  1:48     ` Rich Felker
@ 2022-12-14 10:29       ` Alexey Izbyshev
  2023-02-22 18:12         ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Izbyshev @ 2022-12-14 10:29 UTC (permalink / raw)
  To: musl

On 2022-12-14 04:48, Rich Felker wrote:
> On Tue, Nov 22, 2022 at 08:41:44AM +0300, Alexey Izbyshev wrote:
>> On 2022-11-22 03:09, Rich Felker wrote:
>> >If I understand correctly, the cost of the first option is generally
>> >an extra "last" broadcast wake that shouldn't be needed. Is that
>> >right?
>> >
>> >For example, if sem starts with count 0 and thread A calls wait, then
>> >thread B calls post twice, both posts make a syscall even though only
>> >the first one should have.
>> >
>> Yes, exactly.
>> 
>> >What if you instead perform the broadcast wake when the observed
>> >waiters count is <=1 rather than ==0? This should have no cost in the
>> >common case, but in the race case, I think it forces any hiding
>> >(just-arrived) extra waiters to wake, fail, and re-publish their
>> >waiting status to the waiters bit.
>> >
>> Indeed, I think this solves the overhead issue quite nicely, thanks.
>> So sem_post wake up logic would basically boil down to this:
>> 
>> * WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since
>> it's likely that some waiters will remain (and it's always fine to
>> err on the side of preserving the WAITERS_BIT); wake a single
>> waiter.
>> 
>> * WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake
>> all waiters. In non-racy cases only a single waiter will be woken.
>> 
>> * WAITERS_BIT is unset: don't wake anybody. Even if there are some
>> waiters, another sem_post (that reset the WAITERS_BIT) is
>> responsible for waking them.
>> 
>> In code:
>> 
>> int val, new, waiters, priv = sem->__val[2];
>> do {
>>     val = sem->__val[0];
>>     waiters = sem->__val[1];
>>     if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
>>         errno = EOVERFLOW;
>>         return -1;
>>     }
>>     new = val + 1;
>>     if (waiters <= 1)
>>         new &= ~0x80000000;
>> } while (a_cas(sem->__val, val, new) != val);
>> if (val<0) __wake(sem->__val, waiters <= 1 ? -1 : 1, priv);
> 
> Yes, this looks good to me. How is the attached patch?
> 
The patch looks good to me.

>> >I think there's another really stupid solution too that I would even
>> >consider, partly motivated by the fact that, with long-lived
>> >process-shared semaphores, the waiters count can become stale if
>> >waiters are killed. (Note: semaphores aren't required to be robust
>> >against this, but it's ugly that they're not.) This solution is just
>> >to keep the current code, but drop the waiters count entirely, and use
>> >broadcast wakes for all wakes. Then, any still-live waiters are
>> >*always* responsible for re-asserting their waiting status when all
>> >but one fail to acquire the semaphore after a wake. Of course this is
>> >a thundering herd, which is arguably something we should not want, but
>> >we accept it for correctness in several other places like condvars...
>> >
>> I'm not sure that the kind of "partial robustness" that could be
>> achieved by always waking all waiters is worth punishing normal
>> cases. Also, I don't think waiters count being stale is an issue per
>> se because it can be wrong only in one way (greater than the real
>> count of waiters), assuming you don't mean overflow (but overflow
>> could be handled by simply pinning it at INT_MAX permanently). But
>> many other issues are possible if we allow killing processes at
>> arbitrary points, including sem_post not sending a wake up
>> notification after updating the semaphore value, or sem_wait
>> consuming a notification but not updating the value. Granted, some
>> of these issues may "self-heal" on a future sem_wait/sem_post (in
>> the sense that the semaphore will leave a forbidden state), but they
>> might still interfere with the program shutdown logic (e.g. if the
>> program expects to claim the semaphore N times at the end without
>> being aware that some consumers are dead), and, of course, with any
>> logic outside of semaphore manipulation. So it seems to me that
>> either the program already watches for death of processes it cares
>> about, or it's not even clear that allowing further progress (due to
>> a broadcast) is always more desirable than blocking.
> 
> Indeed, this seems like it's just a bad direction to go for the
> reasons you described. One thing I'd like to note here though, but not
> act on at this time:
> 
> When we were designing the semaphore long ago, there was a vague
> proposal to make the post operation responsible for removing waiter
> counts, rather than the waiter. It was thrown out because it didn't
> seem to work. But if we ever did have a reason to want to do this, I
> think it's possible now, since it's essentially just a "countdown to
> broadcast wake" with no constraint that it be an accurate count, and
> the new waiters bit is what really controls whether wakes happen.
> 
Indeed, now the waiters count is just a hint used for resetting the 
waiters bit.
However, its increments/decrements must still be balanced for it to 
remain useful, and I don't see an easy way to achieve that if sem_post 
is responsible for decrementing it.

One further note: after this bug report I'd been pointed to a semaphore 
redesign proposal from 2015 (starts in 
https://www.openwall.com/lists/musl/2015/02/27/1 after some off-list 
discussion, ends in April). In that design, the waiters count is stored 
in the semaphore word, and sem_post is indeed responsible for adjusting 
it. However, another counter ("wake count") is still needed to make the 
semaphore work.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] sem_post() can miss waiters
  2022-12-14 10:29       ` Alexey Izbyshev
@ 2023-02-22 18:12         ` Rich Felker
  0 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2023-02-22 18:12 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Wed, Dec 14, 2022 at 01:29:54PM +0300, Alexey Izbyshev wrote:
> On 2022-12-14 04:48, Rich Felker wrote:
> >On Tue, Nov 22, 2022 at 08:41:44AM +0300, Alexey Izbyshev wrote:
> >>On 2022-11-22 03:09, Rich Felker wrote:
> >>>If I understand correctly, the cost of the first option is generally
> >>>an extra "last" broadcast wake that shouldn't be needed. Is that
> >>>right?
> >>>
> >>>For example, if sem starts with count 0 and thread A calls wait, then
> >>>thread B calls post twice, both posts make a syscall even though only
> >>>the first one should have.
> >>>
> >>Yes, exactly.
> >>
> >>>What if you instead perform the broadcast wake when the observed
> >>>waiters count is <=1 rather than ==0? This should have no cost in the
> >>>common case, but in the race case, I think it forces any hiding
> >>>(just-arrived) extra waiters to wake, fail, and re-publish their
> >>>waiting status to the waiters bit.
> >>>
> >>Indeed, I think this solves the overhead issue quite nicely, thanks.
> >>So sem_post wake up logic would basically boil down to this:
> >>
> >>* WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since
> >>it's likely that some waiters will remain (and it's always fine to
> >>err on the side of preserving the WAITERS_BIT); wake a single
> >>waiter.
> >>
> >>* WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake
> >>all waiters. In non-racy cases only a single waiter will be woken.
> >>
> >>* WAITERS_BIT is unset: don't wake anybody. Even if there are some
> >>waiters, another sem_post (that reset the WAITERS_BIT) is
> >>responsible for waking them.
> >>
> >>In code:
> >>
> >>int val, new, waiters, priv = sem->__val[2];
> >>do {
> >>    val = sem->__val[0];
> >>    waiters = sem->__val[1];
> >>    if ((val & SEM_VALUE_MAX) == SEM_VALUE_MAX) {
> >>        errno = EOVERFLOW;
> >>        return -1;
> >>    }
> >>    new = val + 1;
> >>    if (waiters <= 1)
> >>        new &= ~0x80000000;
> >>} while (a_cas(sem->__val, val, new) != val);
> >>if (val<0) __wake(sem->__val, waiters <= 1 ? -1 : 1, priv);
> >
> >Yes, this looks good to me. How is the attached patch?
> >
> The patch looks good to me.

Just a heads-up: this patch is in my big queue of stuff to push, and
should be upstream soon now.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-02-22 18:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-21  5:14 [musl] sem_post() can miss waiters Alexey Izbyshev
2022-11-22  0:09 ` Rich Felker
2022-11-22  5:41   ` A. Wilcox
2022-11-22  5:41   ` Alexey Izbyshev
2022-12-14  1:48     ` Rich Felker
2022-12-14 10:29       ` Alexey Izbyshev
2023-02-22 18:12         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).