[musl] Memory lock needed for semaphores?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Memory lock needed for semaphores?
@ 2024-08-04  7:29 Markus Wichmann
  2024-08-04 13:31 ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Markus Wichmann @ 2024-08-04  7:29 UTC (permalink / raw)
  To: musl

Hi all,

in the light of the recent thread about barriers I looked at the
semaphore implementation under the light of the sudden unmap issue
again. To recap, the issue is that multiple threads may interact with
the same pshared semaphore, and in particular one thread might unmap a
semaphore that another thread is currently still interacting with.

Is that an issue with semaphores? I'm thinking of the following
scenario:

Start: One thread of another process is blocked at the semaphore
(sem->__val = {INT_MIN, 1, 0}).

Thread 1: Calls sem_post(), gets until just after the a_cas() before
being preempted (sem->__val = {1, 1, 0}, but this thread saw val < 0 and
waiters <= 1, so broadcast wake will issue once thread resumes).

Thread 2: Calls sem_trywait(), sets sem->__val = {0, 1, 0}.
Thread 2: Calls sem_post(), sets sem->__val = {1, 1, 0}, does not wake.
Thread 2: Unmaps the semaphore.
Thread 1: Resumes. SYS_futex returns EFAULT. Other process remains
sleeping.

Am I missing something that makes this impossible? If not, do we maybe
need a vmlock for posting pshared semaphores? That would probably be the
easiest fix.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Memory lock needed for semaphores?
  2024-08-04  7:29 [musl] Memory lock needed for semaphores? Markus Wichmann
@ 2024-08-04 13:31 ` Rich Felker
  2024-08-04 22:40   ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2024-08-04 13:31 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl, Alexey Izbyshev

On Sun, Aug 04, 2024 at 09:29:03AM +0200, Markus Wichmann wrote:
> Hi all,
> 
> in the light of the recent thread about barriers I looked at the
> semaphore implementation under the light of the sudden unmap issue
> again. To recap, the issue is that multiple threads may interact with
> the same pshared semaphore, and in particular one thread might unmap a
> semaphore that another thread is currently still interacting with.
> 
> Is that an issue with semaphores? I'm thinking of the following
> scenario:
> 
> Start: One thread of another process is blocked at the semaphore
> (sem->__val = {INT_MIN, 1, 0}).
> 
> Thread 1: Calls sem_post(), gets until just after the a_cas() before
> being preempted (sem->__val = {1, 1, 0}, but this thread saw val < 0 and
> waiters <= 1, so broadcast wake will issue once thread resumes).
> 
> Thread 2: Calls sem_trywait(), sets sem->__val = {0, 1, 0}.
> Thread 2: Calls sem_post(), sets sem->__val = {1, 1, 0}, does not wake.
> Thread 2: Unmaps the semaphore.
> Thread 1: Resumes. SYS_futex returns EFAULT. Other process remains
> sleeping.
> 
> Am I missing something that makes this impossible? If not, do we maybe
> need a vmlock for posting pshared semaphores? That would probably be the
> easiest fix.

This is not possible and would be a misuse of it even if it were.
sem_post is required to be AS-safe, so it cannot take any sort of
lock.

I think the issue you report is real, but provided that's the case,
it's a flaw in the logic to be relying on the first sem_post to
perform the wake the second logically needs to perform. I haven't yet
looked into what the cost of leaving the waiters flag set would be.

There are kinda 3 approaches I can see exploring to fix this, and I'm
not sure which are low-cost or even valid yet:

1. Have post only clear the waiters flag if the count is <1 (not <=1)
2. Have trywait re-add a waiters flag if it sees waiters>0
3. Have post perform the wake if it saw waiters>0 even if waiters bit
   was not set.

I'll look at this more later. I've CC'd Alexey Izbyshev who introduced
the current waiter accounting.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Memory lock needed for semaphores?
  2024-08-04 13:31 ` Rich Felker
@ 2024-08-04 22:40   ` Rich Felker
  2024-08-07  7:15     ` Markus Wichmann
  2024-08-10 20:47     ` Rich Felker
  0 siblings, 2 replies; 7+ messages in thread
From: Rich Felker @ 2024-08-04 22:40 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl, Alexey Izbyshev

On Sun, Aug 04, 2024 at 09:31:10AM -0400, Rich Felker wrote:
> On Sun, Aug 04, 2024 at 09:29:03AM +0200, Markus Wichmann wrote:
> > Hi all,
> > 
> > in the light of the recent thread about barriers I looked at the
> > semaphore implementation under the light of the sudden unmap issue
> > again. To recap, the issue is that multiple threads may interact with
> > the same pshared semaphore, and in particular one thread might unmap a
> > semaphore that another thread is currently still interacting with.
> > 
> > Is that an issue with semaphores? I'm thinking of the following
> > scenario:
> > 
> > Start: One thread of another process is blocked at the semaphore
> > (sem->__val = {INT_MIN, 1, 0}).
> > 
> > Thread 1: Calls sem_post(), gets until just after the a_cas() before
> > being preempted (sem->__val = {1, 1, 0}, but this thread saw val < 0 and
> > waiters <= 1, so broadcast wake will issue once thread resumes).
> > 
> > Thread 2: Calls sem_trywait(), sets sem->__val = {0, 1, 0}.
> > Thread 2: Calls sem_post(), sets sem->__val = {1, 1, 0}, does not wake.
> > Thread 2: Unmaps the semaphore.
> > Thread 1: Resumes. SYS_futex returns EFAULT. Other process remains
> > sleeping.
> > 
> > Am I missing something that makes this impossible? If not, do we maybe
> > need a vmlock for posting pshared semaphores? That would probably be the
> > easiest fix.
> 
> This is not possible and would be a misuse of it even if it were.
> sem_post is required to be AS-safe, so it cannot take any sort of
> lock.
> 
> I think the issue you report is real, but provided that's the case,
> it's a flaw in the logic to be relying on the first sem_post to
> perform the wake the second logically needs to perform. I haven't yet
> looked into what the cost of leaving the waiters flag set would be.
> 
> There are kinda 3 approaches I can see exploring to fix this, and I'm
> not sure which are low-cost or even valid yet:
> 
> 1. Have post only clear the waiters flag if the count is <1 (not <=1)
> 2. Have trywait re-add a waiters flag if it sees waiters>0
> 3. Have post perform the wake if it saw waiters>0 even if waiters bit
>    was not set.
> 
> I'll look at this more later. I've CC'd Alexey Izbyshev who introduced
> the current waiter accounting.

From the original thread:

https://www.openwall.com/lists/musl/2022/11/22/3

On Tue, Nov 22, 2022 at 08:41:44AM +0300, Alexey Izbyshev wrote:
> ...
> * WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since
> it's likely that some waiters will remain (and it's always fine to
> err on the side of preserving the WAITERS_BIT); wake a single
> waiter.
> 
> * WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake
> all waiters. In non-racy cases only a single waiter will be woken.
> 
> * WAITERS_BIT is unset: don't wake anybody. Even if there are some
> waiters, another sem_post (that reset the WAITERS_BIT) is
> responsible for waking them.

The third case here is the one we have a problem with: it's possible
that the "another sem_post" responsible for waking them can't do so
because the object is unmapped before it gets a chance to.

I don't see any harm in breaking it down into 2 case:

3a. WAITERS_BIT is unset and waiter count is zero: don't wake anybody.

3b. WAITERS_BIT is unset and waiter count is nonzero: since newly
arriving waiters always set the waiters bit, this can only happen if
another sem_post cleared the waiters bit after the waiter arrived.
That sem_post should perform a broadcast wake as part of clearing the
waiters bit, so either it has not yet made forward progress to the
wake, or the wake was sent but the waiter has not yet woken to
decrement itself from the waiter count and attempt to decrement the
semaphore value. Re-issuing the broadcase wake will ensure, in the
former case, that the wake actually gets sent, and that it happens
before before sem_post returns. I think that solves the problem.

The only harm here is that, for patterns like repeatedly calling
sem_post to increment the semaphore value by more than 1, it's quite
plausible that a waiter has not yet taken itself off the waiter count,
and that two or more wake syscalls will be made where one would have
sufficed. This happened prior to commit
159d1f6c02569091c7a48bdb2e2e824b844a1902 anyway, where the old
condition to wake was:

	if (val<0 || waiters) __wake(sem->__val, 1, priv);

so I don't think we need to be concerned about it.

We could *probably* get by with skipping the waiters>0 condition for
non-pshared semaphores:

	if (val<0 || (!priv && waiters)) __wake(sem->__val, 1, priv);

however, I think there's a variant of the bug reported here in regards
to AS-safety, where a sem_post called from a signal handler that
executes between the atomic and the futex wake won't perform another
wake, even though it may need to in order to trigger forward progress.

Even with the extra wake, there's still an AS-safety issue of sorts
where it's possible for the post to happen (value to be updated)
without a wake, if a long-running signal handler interrupts sem_post.
However, aside from the missed-wake in the case of recursive sem_post,
the situation of a signal handler interrupting sem_post doesn't seem
to be a significant behavioral difference: it's hard to construct
cases where you notice the posted value without sem_getvalue, which
has some allowance to be racey. I don't see a way to fully fix this
without FUTEX_WAKE_OP having a CAS operation (which it sadly omitted)
or just blocking signals in the cases where we're going to issue a
wake (which I guess may be an option, since it's already making a
syscall anyway).

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Memory lock needed for semaphores?
  2024-08-04 22:40   ` Rich Felker
@ 2024-08-07  7:15     ` Markus Wichmann
  2024-08-10  3:02       ` Rich Felker
  2024-08-10 20:47     ` Rich Felker
  1 sibling, 1 reply; 7+ messages in thread
From: Markus Wichmann @ 2024-08-07  7:15 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

Hi all,

an idea I had in the shower just now: Why don't we set the number of
threads to awaken equal to the new semaphore value? I mean, once we
determine there may be waiters and we have to wake some of them, isn't
that the natural number of threads to wake up?

The thing is, my example doesn't stop at two threads. There can be any
number of threads stuck between the CAS and the wake in sem_post(). And
yet, it might be only one sem_post() that gets to make a wake call, so
that one call has to wake up all threads that might get the semaphore.

So if there are 100 waiters, and we just set the semaphore value to 3
(OK, INT_MIN + 3), then we wake up 3 threads. Because we can get in this
state only if there are another 2 threads that haven't awoken any of the
waiters, or the waiters haven't decremented the semaphore yet. Of
course, there might be another thread just chomping at the bit to wake
up another 2 threads, but that thread also might not get to perform the
wake at all.

I know that currently we only set the val argument to __futexwake to 1
or -1, but nothing in the interface says we have to.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Memory lock needed for semaphores?
  2024-08-07  7:15     ` Markus Wichmann
@ 2024-08-10  3:02       ` Rich Felker
  0 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2024-08-10  3:02 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl, Alexey Izbyshev

On Wed, Aug 07, 2024 at 09:15:55AM +0200, Markus Wichmann wrote:
> Hi all,
> 
> an idea I had in the shower just now: Why don't we set the number of
> threads to awaken equal to the new semaphore value? I mean, once we
> determine there may be waiters and we have to wake some of them, isn't
> that the natural number of threads to wake up?

If you do that, then performing N posts in a row will yield N*(N+1)/2
wakes, when you only wanted N wakes. Imagine a producer-consumer setup
with 100 consumers and a producer posting 10 new things to be
consumed. You get a thundering herd of waiters waking up vying for
significantly more slots than are actually available.

> The thing is, my example doesn't stop at two threads. There can be any
> number of threads stuck between the CAS and the wake in sem_post(). And
> yet, it might be only one sem_post() that gets to make a wake call, so
> that one call has to wake up all threads that might get the semaphore.

At least in the trivial extension of your example case to N threads
stuck posting, the unmapping is invalid and the whole example breaks
down. This is because the only way the unmapping is valid is by
the unmapping thread being able to know it's the last user of the
mapped semaphore. If you have other concurrent post operations, then
you haven't synchronized being the last user, and thus can't unmap.

In your original example, the unmapping thread synchronized knowing it
was the last user by virtue of its own wait succeeding, as a result of
post modifying the semaphore value.

You can probably modify the example by waiting N times rather than
just 1 time, so that you've consumed all N posts to conclude that no
additional users of the semaphore exist. But then the semaphore value
is back at 0, and with the proposed changes we discussed, you would
issue a futex wake correctly when posting it again.

> So if there are 100 waiters, and we just set the semaphore value to 3
> (OK, INT_MIN + 3), then we wake up 3 threads. Because we can get in this
> state only if there are another 2 threads that haven't awoken any of the
> waiters, or the waiters haven't decremented the semaphore yet. Of
> course, there might be another thread just chomping at the bit to wake
> up another 2 threads, but that thread also might not get to perform the
> wake at all.
> 
> I know that currently we only set the val argument to __futexwake to 1
> or -1, but nothing in the interface says we have to.

Indeed, it just doesn't seem to be useful to set it to a different
value.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Memory lock needed for semaphores?
  2024-08-04 22:40   ` Rich Felker
  2024-08-07  7:15     ` Markus Wichmann
@ 2024-08-10 20:47     ` Rich Felker
  2024-08-11  4:40       ` Markus Wichmann
  1 sibling, 1 reply; 7+ messages in thread
From: Rich Felker @ 2024-08-10 20:47 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl, Alexey Izbyshev

[-- Attachment #1: Type: text/plain, Size: 5958 bytes --]

On Sun, Aug 04, 2024 at 06:40:24PM -0400, Rich Felker wrote:
> On Sun, Aug 04, 2024 at 09:31:10AM -0400, Rich Felker wrote:
> > On Sun, Aug 04, 2024 at 09:29:03AM +0200, Markus Wichmann wrote:
> > > Hi all,
> > > 
> > > in the light of the recent thread about barriers I looked at the
> > > semaphore implementation under the light of the sudden unmap issue
> > > again. To recap, the issue is that multiple threads may interact with
> > > the same pshared semaphore, and in particular one thread might unmap a
> > > semaphore that another thread is currently still interacting with.
> > > 
> > > Is that an issue with semaphores? I'm thinking of the following
> > > scenario:
> > > 
> > > Start: One thread of another process is blocked at the semaphore
> > > (sem->__val = {INT_MIN, 1, 0}).
> > > 
> > > Thread 1: Calls sem_post(), gets until just after the a_cas() before
> > > being preempted (sem->__val = {1, 1, 0}, but this thread saw val < 0 and
> > > waiters <= 1, so broadcast wake will issue once thread resumes).
> > > 
> > > Thread 2: Calls sem_trywait(), sets sem->__val = {0, 1, 0}.
> > > Thread 2: Calls sem_post(), sets sem->__val = {1, 1, 0}, does not wake.
> > > Thread 2: Unmaps the semaphore.
> > > Thread 1: Resumes. SYS_futex returns EFAULT. Other process remains
> > > sleeping.
> > > 
> > > Am I missing something that makes this impossible? If not, do we maybe
> > > need a vmlock for posting pshared semaphores? That would probably be the
> > > easiest fix.
> > 
> > This is not possible and would be a misuse of it even if it were.
> > sem_post is required to be AS-safe, so it cannot take any sort of
> > lock.
> > 
> > I think the issue you report is real, but provided that's the case,
> > it's a flaw in the logic to be relying on the first sem_post to
> > perform the wake the second logically needs to perform. I haven't yet
> > looked into what the cost of leaving the waiters flag set would be.
> > 
> > There are kinda 3 approaches I can see exploring to fix this, and I'm
> > not sure which are low-cost or even valid yet:
> > 
> > 1. Have post only clear the waiters flag if the count is <1 (not <=1)
> > 2. Have trywait re-add a waiters flag if it sees waiters>0
> > 3. Have post perform the wake if it saw waiters>0 even if waiters bit
> >    was not set.
> > 
> > I'll look at this more later. I've CC'd Alexey Izbyshev who introduced
> > the current waiter accounting.
> 
> >From the original thread:
> 
> https://www.openwall.com/lists/musl/2022/11/22/3
> 
> On Tue, Nov 22, 2022 at 08:41:44AM +0300, Alexey Izbyshev wrote:
> > ...
> > * WAITERS_BIT is set and waiters > 1: don't reset WAITERS_BIT since
> > it's likely that some waiters will remain (and it's always fine to
> > err on the side of preserving the WAITERS_BIT); wake a single
> > waiter.
> > 
> > * WAITERS_BIT is set and waiters <= 1: reset WAITERS_BIT and wake
> > all waiters. In non-racy cases only a single waiter will be woken.
> > 
> > * WAITERS_BIT is unset: don't wake anybody. Even if there are some
> > waiters, another sem_post (that reset the WAITERS_BIT) is
> > responsible for waking them.
> 
> The third case here is the one we have a problem with: it's possible
> that the "another sem_post" responsible for waking them can't do so
> because the object is unmapped before it gets a chance to.
> 
> I don't see any harm in breaking it down into 2 case:
> 
> 3a. WAITERS_BIT is unset and waiter count is zero: don't wake anybody.
> 
> 3b. WAITERS_BIT is unset and waiter count is nonzero: since newly
> arriving waiters always set the waiters bit, this can only happen if
> another sem_post cleared the waiters bit after the waiter arrived.
> That sem_post should perform a broadcast wake as part of clearing the
> waiters bit, so either it has not yet made forward progress to the
> wake, or the wake was sent but the waiter has not yet woken to
> decrement itself from the waiter count and attempt to decrement the
> semaphore value. Re-issuing the broadcase wake will ensure, in the
> former case, that the wake actually gets sent, and that it happens
> before before sem_post returns. I think that solves the problem.
> 
> The only harm here is that, for patterns like repeatedly calling
> sem_post to increment the semaphore value by more than 1, it's quite
> plausible that a waiter has not yet taken itself off the waiter count,
> and that two or more wake syscalls will be made where one would have
> sufficed. This happened prior to commit
> 159d1f6c02569091c7a48bdb2e2e824b844a1902 anyway, where the old
> condition to wake was:
> 
> 	if (val<0 || waiters) __wake(sem->__val, 1, priv);
> 
> so I don't think we need to be concerned about it.
> 
> We could *probably* get by with skipping the waiters>0 condition for
> non-pshared semaphores:
> 
> 	if (val<0 || (!priv && waiters)) __wake(sem->__val, 1, priv);
> 
> however, I think there's a variant of the bug reported here in regards
> to AS-safety, where a sem_post called from a signal handler that
> executes between the atomic and the futex wake won't perform another
> wake, even though it may need to in order to trigger forward progress.
> 
> Even with the extra wake, there's still an AS-safety issue of sorts
> where it's possible for the post to happen (value to be updated)
> without a wake, if a long-running signal handler interrupts sem_post.
> However, aside from the missed-wake in the case of recursive sem_post,
> the situation of a signal handler interrupting sem_post doesn't seem
> to be a significant behavioral difference: it's hard to construct
> cases where you notice the posted value without sem_getvalue, which
> has some allowance to be racey. I don't see a way to fully fix this
> without FUTEX_WAKE_OP having a CAS operation (which it sadly omitted)
> or just blocking signals in the cases where we're going to issue a
> wake (which I guess may be an option, since it's already making a
> syscall anyway).

How does the attached look?

Rich

[-- Attachment #2: 0001-fix-lost-or-delayed-wakes-in-sem_post-under-certain-.patch --]
[-- Type: text/plain, Size: 2103 bytes --]

From 882aedf6a13891f887d20f6a4184a13e94793b84 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Sat, 10 Aug 2024 16:30:28 -0400
Subject: [PATCH] fix lost or delayed wakes in sem_post under certain race
 conditions

if sem_post is interrupted between clearing the waiters bit from the
semaphore value and performing the futex wait operation, subsequent
calls to sem_post will not perform a wake operation unless a new
waiter has arrived.

usually, this is at most a minor nuisance, since the original wake
operation will eventually happen. however, it's possible that the wake
is delayed indefinitely if interrupted by a signal handler, or that
the address the wake needs to be performed on is no longer mapped if
the semaphore was a process-shared one that has since been unmapped
but has a waiter on a different mapping of the same semaphore. this
can happen when another thread using the same mapping "steals the
post" atomically before actually becoming a second waiter, deduces
from success that it was the last user of the semaphore mapping, then
re-posts and unmaps the semaphore mapping. this scenario was described
in a report by Markus Wichmann.

instead of checking only the waiters bit, also check the waiter count
that was sampled before the atomic post operation, and perform the
wake if it's nonzero. this will not produce any additional wakes under
non-race conditions, since the waiters bit only becomes zero when
targeting a single waiter for wake. checking both was already the
behavior prior to commit 159d1f6c02569091c7a48bdb2e2e824b844a1902.
---
 src/thread/sem_post.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/thread/sem_post.c b/src/thread/sem_post.c
index 5c2517f2..1c8f763c 100644
--- a/src/thread/sem_post.c
+++ b/src/thread/sem_post.c
@@ -16,6 +16,6 @@ int sem_post(sem_t *sem)
 		if (waiters <= 1)
 			new &= ~0x80000000;
 	} while (a_cas(sem->__val, val, new) != val);
-	if (val<0) __wake(sem->__val, waiters>1 ? 1 : -1, priv);
+	if (val<0 || waiters) __wake(sem->__val, waiters>1 ? 1 : -1, priv);
 	return 0;
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] Memory lock needed for semaphores?
  2024-08-10 20:47     ` Rich Felker
@ 2024-08-11  4:40       ` Markus Wichmann
  0 siblings, 0 replies; 7+ messages in thread
From: Markus Wichmann @ 2024-08-11  4:40 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

Am Sat, Aug 10, 2024 at 04:47:20PM -0400 schrieb Rich Felker:
> How does the attached look?
>

Looks good to me. I had imagined an extension of the scenario to more
threads, but thinking about it further, I noticed that in that scenario
the semaphore value is never more than 1. And in the scenario where it
does get larger, the unmap is invalid, as you said in the other mail.

So yeah, looks good to me.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-08-11  4:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-04  7:29 [musl] Memory lock needed for semaphores? Markus Wichmann
2024-08-04 13:31 ` Rich Felker
2024-08-04 22:40   ` Rich Felker
2024-08-07  7:15     ` Markus Wichmann
2024-08-10  3:02       ` Rich Felker
2024-08-10 20:47     ` Rich Felker
2024-08-11  4:40       ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).