Resuming work on new semaphore

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Resuming work on new semaphore
@ 2015-04-02  1:30 Rich Felker
  2015-04-02  7:42 ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-02  1:30 UTC (permalink / raw)
  To: musl

I have Alexander Monakov's new semaphore design on the agenda for this
release cycle (1.1.9). If I remember correctly, we were looking at
some issues in the last posted version with barriers and how to write
the code in a way that's both clean and avoids compiler stupidity.

One other thing I've been thinking about is how the current and new
semaphore designs differ when a waiter on a process-shared semaphore
is asynchronously killed (while waiting or during acceptance of the
post).

With the current design:

If just waiting, the waiters count remains permanently elevated (and
will eventually overflow if this happens billions of times), and any
posts will make the value positive so that another waiter can take the
post.

If accepting (at least one atomic operation on the sem already
performed) then the waiters count is left valid and either a post is
consumed or left for another waiter to take.

With the new design:

If just waiting, the negative semaphore value persists after the
waiter is killed. Subsequent posts will produce a wake for a waiter
that doesn't exist, and will thereby allow future waiters that arrive
when the semaphore value is zero to proceed immediately (leaving the
value negative) by consuming this wake. There are usage patterns where
trywait would never succeed again, but wait would succeed trivially.

If accepting, the post is consumed. Accepting only involves one atomic
operation so there is no other possibility.

It seems to me the behavior for async-kill-while-waiting is somewhat
worse with the new design, but this isn't really a usage pattern that
should have been considered "supported" before, anyway. If there are
ideas to make it better, we could explore them, but I don't think it's
a show-stopper in any case.

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-02  1:30 Resuming work on new semaphore Rich Felker
@ 2015-04-02  7:42 ` Alexander Monakov
  2015-04-02 15:26   ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-02  7:42 UTC (permalink / raw)
  To: musl

On Wed, 1 Apr 2015, Rich Felker wrote:
> If just waiting, the negative semaphore value persists after the
> waiter is killed. Subsequent posts will produce a wake for a waiter
> that doesn't exist, and will thereby allow future waiters that arrive
> when the semaphore value is zero to proceed immediately (leaving the
> value negative) by consuming this wake. There are usage patterns where
> trywait would never succeed again, but wait would succeed trivially.

Interesting.  To examine the issue under a different light, consider that from
the perspective of semaphore implementation, waiters that were killed,
stopped, or pre-empted forever in the middle of sem_wait are
indistinguishable.  Thus, subsequent sem_wait succeeds by effectively stealing
a post, and to make things consistent you can teach sem_trywait to steal posts
too (i.e. try atomic-decrement-if-positive val[1] just before returning
EAGAIN, return 0 if that succeeds).

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-02  7:42 ` Alexander Monakov
@ 2015-04-02 15:26   ` Rich Felker
  2015-04-02 21:39     ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-02 15:26 UTC (permalink / raw)
  To: musl

On Thu, Apr 02, 2015 at 10:42:16AM +0300, Alexander Monakov wrote:
> On Wed, 1 Apr 2015, Rich Felker wrote:
> > If just waiting, the negative semaphore value persists after the
> > waiter is killed. Subsequent posts will produce a wake for a waiter
> > that doesn't exist, and will thereby allow future waiters that arrive
> > when the semaphore value is zero to proceed immediately (leaving the
> > value negative) by consuming this wake. There are usage patterns where
> > trywait would never succeed again, but wait would succeed trivially.
> 
> Interesting.  To examine the issue under a different light, consider that from
> the perspective of semaphore implementation, waiters that were killed,
> stopped, or pre-empted forever in the middle of sem_wait are
> indistinguishable.

Yes, I noticed this too. In that sense, theoretically there should be
no harm (aside from eventual overflow of pending wake counter) from
having asynchronously-killed waiters, assuming the implementation is
bug-free in the absence of async killing of waiters.

> Thus, subsequent sem_wait succeeds by effectively stealing
> a post, and to make things consistent you can teach sem_trywait to steal posts
> too (i.e. try atomic-decrement-if-positive val[1] just before returning
> EAGAIN, return 0 if that succeeds).

Hmm, perhaps that is valid. I'll have to think about it again. I was
thinking of having sem_trywait unconditionally down the value (val[0])
then immitate the exit path of sem_timedwait, but that's not valid
because another waiter could race and prevent sem_trywait from ever
being able to exit. But if it only does the down as a dec-if-positive
then it seems like it can safely dec-if-positive the wake count before
reporting failure.

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-02 15:26   ` Rich Felker
@ 2015-04-02 21:39     ` Alexander Monakov
  2015-04-02 23:14       ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-02 21:39 UTC (permalink / raw)
  To: musl

On Thu, 2 Apr 2015, Rich Felker wrote:
> > Interesting.  To examine the issue under a different light, consider that from
> > the perspective of semaphore implementation, waiters that were killed,
> > stopped, or pre-empted forever in the middle of sem_wait are
> > indistinguishable.
> 
> Yes, I noticed this too. In that sense, theoretically there should be
> no harm (aside from eventual overflow of pending wake counter) from
> having asynchronously-killed waiters, assuming the implementation is
> bug-free in the absence of async killing of waiters.

Did you mean "presence"?  I'm having trouble understanding your phrase,
especially after "assuming ..."; can you elaborate or rephrase?

That waiters can die breaks an assumption that operations on val[0] and val[1]
do not under/overflow due to their range exceeding the number of
simultaneously live tasks.

> > Thus, subsequent sem_wait succeeds by effectively stealing
> > a post, and to make things consistent you can teach sem_trywait to steal posts
> > too (i.e. try atomic-decrement-if-positive val[1] just before returning
> > EAGAIN, return 0 if that succeeds).
> 
> Hmm, perhaps that is valid. I'll have to think about it again. I was
> thinking of having sem_trywait unconditionally down the value (val[0])
> then immitate the exit path of sem_timedwait, but that's not valid
> because another waiter could race and prevent sem_trywait from ever
> being able to exit. But if it only does the down as a dec-if-positive
> then it seems like it can safely dec-if-positive the wake count before
> reporting failure.

I think my proposition above needs at least the following correction: when
trywait succeeds in stealing a post by dec-if-positive(val[1]), it should also
decrement val[0] before returning.

Are you sure your proposition is invalid?  I don't think so.  How is trywait
different from a timedwait with a timeout that immediately expires?  That is
basically what your scheme should do.

Alexander


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-02 21:39     ` Alexander Monakov
@ 2015-04-02 23:14       ` Rich Felker
  2015-04-05 14:07         ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-02 23:14 UTC (permalink / raw)
  To: musl

On Fri, Apr 03, 2015 at 12:39:10AM +0300, Alexander Monakov wrote:
> On Thu, 2 Apr 2015, Rich Felker wrote:
> > > Interesting.  To examine the issue under a different light, consider that from
> > > the perspective of semaphore implementation, waiters that were killed,
> > > stopped, or pre-empted forever in the middle of sem_wait are
> > > indistinguishable.
> > 
> > Yes, I noticed this too. In that sense, theoretically there should be
> > no harm (aside from eventual overflow of pending wake counter) from
> > having asynchronously-killed waiters, assuming the implementation is
> > bug-free in the absence of async killing of waiters.
> 
> Did you mean "presence"?  I'm having trouble understanding your phrase,
> especially after "assuming ..."; can you elaborate or rephrase?

I meant to say assuming that there aren't already any bugs, by your
reasoning adding async killing of waiters cannot add bugs (except the
overflow) since they're equivalent to a situation that arises without
async killing.

> That waiters can die breaks an assumption that operations on val[0] and val[1]
> do not under/overflow due to their range exceeding the number of
> simultaneously live tasks.

Right. I'm ignoring that one. The current implementation likewise has
that issue for the waiter count (but it could avoid it by saturating
the waiter count at INT_MAX I suppose, or by throwing away the waiter
count and just using a potential-waiters flag).

> > > Thus, subsequent sem_wait succeeds by effectively stealing
> > > a post, and to make things consistent you can teach sem_trywait to steal posts
> > > too (i.e. try atomic-decrement-if-positive val[1] just before returning
> > > EAGAIN, return 0 if that succeeds).
> > 
> > Hmm, perhaps that is valid. I'll have to think about it again. I was
> > thinking of having sem_trywait unconditionally down the value (val[0])
> > then immitate the exit path of sem_timedwait, but that's not valid
> > because another waiter could race and prevent sem_trywait from ever
> > being able to exit. But if it only does the down as a dec-if-positive
> > then it seems like it can safely dec-if-positive the wake count before
> > reporting failure.
> 
> I think my proposition above needs at least the following correction: when
> trywait succeeds in stealing a post by dec-if-positive(val[1]), it should also
> decrement val[0] before returning.

Yes, that seems right.

> Are you sure your proposition is invalid?  I don't think so.  How is trywait
> different from a timedwait with a timeout that immediately expires?  That is
> basically what your scheme should do.

Indeed, I think you're right. Conceptually trywait and timedwait with
zero timeout should be identical modulo error value and cancellation.

Rich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-02 23:14       ` Rich Felker
@ 2015-04-05 14:07         ` Alexander Monakov
  2015-04-05 14:17           ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-05 14:07 UTC (permalink / raw)
  To: musl

So, your solution results in a simpler execution path for successful trywait,
but needs to undo semaphore adjustment to return EAGAIN (which I believe to
be relatively common).  Also, if a concurrent poster sees transient negative
val[0], it will proceed to slow sem_post path (val[1] increment and futex
wake), so if there's high activity on the semaphore, looks like this solution
may slow down posters.

My solution keeps trywait successful path as-is, and adds a test on val[1] in
EAGAIN case with a conditional branch that is normally not executed.
Semaphore's contents are not changed if returning EAGAIN.  I was about to say
that this leads to more preferable cache traffic on highly contended semaphore
(all calls to trywait returning EAGAIN simultaneously can work on shared
rather than exclusive cache lines), but unfortunately we need a read memory
barrier on the semaphore and the solution to that was to perform CAS on val[0]
unconditionally, which would cause each caller to make the cache line with the
semaphore exclusive anyway, IIUC.

Regarding the problem raised on IRC (that with 1 waiter, post-getvalue-trywait
can result in 0-0 return values rather than 0-EAGAIN, while previously it
could result in 1-EAGAIN rather that 1-0, with the explanation that the waiter
did not "become a waiter" so that post had to wake it, but suspended execution
and resumed only after semaphore value became positive, and now the surprising
behavior needs a different explanation): I think in the absence of a mechanism
to detect whether current waiters are still alive, that's the way it has to
be.  Either you pretend there are no waiters at any time, like today, or you
count dead waiters same as live waiters and report 0 semaphore value to the
caller, but then what do you do when caller invokes sem_wait and all other
waiters are dead?  Suspend the caller indefinitely, or let it proceed by
consuming a pending post and thus revealing an inconsistency?

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-05 14:07         ` Alexander Monakov
@ 2015-04-05 14:17           ` Alexander Monakov
  2015-04-05 19:02             ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-05 14:17 UTC (permalink / raw)
  To: musl

In a similar vein, if the caller explicitely kills other waiters before
invoking sem_post and sem_getvalue, it will also reveal an inconsistency with
the new implementation: returned value of 0 will look as if sem_post has woken
some waiters, even though the caller could know for certain that no tasks were
waiting on the semaphore.

Well we can make sem_getvalue return val[0]+val[1] instead... ;)

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-05 14:17           ` Alexander Monakov
@ 2015-04-05 19:02             ` Rich Felker
  2015-04-05 20:03               ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-05 19:02 UTC (permalink / raw)
  To: musl

On Sun, Apr 05, 2015 at 05:17:35PM +0300, Alexander Monakov wrote:
> In a similar vein, if the caller explicitely kills other waiters before
> invoking sem_post and sem_getvalue, it will also reveal an inconsistency with
> the new implementation: returned value of 0 will look as if sem_post has woken
> some waiters, even though the caller could know for certain that no tasks were
> waiting on the semaphore.

I'm not concerned about async killing (which is something of a broken
usage case anyway; it certainly doesn't work for other sync objects
except robust mutexes) but I am somewhat concerned about other
equivalent cases. For example:

1. Thread A enters sem_wait.
2. Thread B observes thread A in sem_wait via failed sem_trywait.
3. Thread B sends a signal to thread A.
4. Thread A blocks in signal handler waiting for some other event
   under the control of thread B.
5. Thread B calls sem_post.
6. Thread B calls sem_wait and succeeds.
7. Thread B unblocks thread A.
8. Thread A remains blocked on the sem since B consumed the wake.

Note that the same thing happens with the old or new implementation;
the only difference is the _values_ observed. I believe this behavior
is forbidden by the following:

    If the semaphore value resulting from this operation is positive,
    then no threads were blocked waiting for the semaphore to become
    unlocked; the semaphore value is simply incremented.

    If the value of the semaphore resulting from this operation is
    zero, then one of the threads blocked waiting for the semaphore
    shall be allowed to return successfully from its call to
    sem_wait()."

The old (current) implementation wrongly yields a positive value
despite there being waiters. The new implementation wrongly allows
sem_wait to succeed (and steal a wake) when the value is zero.

> Well we can make sem_getvalue return val[0]+val[1] instead... ;)

That just makes the new implementation look like the old one, no? :-)

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-05 19:02             ` Rich Felker
@ 2015-04-05 20:03               ` Alexander Monakov
  2015-04-05 20:23                 ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-05 20:03 UTC (permalink / raw)
  To: musl

On Sun, 5 Apr 2015, Rich Felker wrote:
> 1. Thread A enters sem_wait.
> 2. Thread B observes thread A in sem_wait via failed sem_trywait.

Hm, I don't see how that can be achieved.  As a result I'm afraid I didn't
fully understand your example.

> > Well we can make sem_getvalue return val[0]+val[1] instead... ;)
> 
> That just makes the new implementation look like the old one, no? :-)

Can't be bad if it behaves the same but works a bit faster.
Apropos, like I've said on IRC, looks like there's "semaphore uncertainty
principle": that formal semaphore value is between val[0] and (val[0] +/-
val[1]) (clamped to 0 as needed).  It seems you can either do your hack and
pretend that there are never any waiters, or try to faithfully count waiters
in sem_getvalue, but then also reveal that sometimes the implementation works
by stealing a post.  I believe you could argue that the latter is explicitely
disallowed by the spec.

By the way, I think there's an interesting interplay with cancellation.
Consider the following.  Thread B does "return sem_wait(sem);". Thread A does:

  pthread_cancel(thread_B);
  sem_post(sem);
  sem_getvalue(sem);

If it observes semaphore value as 1 it follows that thread B has not become a
waiter yet, and since it must have cancellation already pending, it may not
consume the post.  And yet if thread B is already futex-waiting in sem_wait,
consuming the post takes priority over acting on cancellation.  So if then
thread A does

  pthread_join(thread_B);
  sem_getvalue(sem);

and gets value of 0, it sees a contradiction.  And return value from
pthread_join will indicate that thread_B exited normally rather than was
cancelled.

And on the contrary, if you make acting on cancellation/timeout take priority,
you can observe semaphore value increasing when waiters leave the wait on
error path without consuming the post.

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-05 20:03               ` Alexander Monakov
@ 2015-04-05 20:23                 ` Rich Felker
  2015-04-05 21:07                   ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-05 20:23 UTC (permalink / raw)
  To: musl

On Sun, Apr 05, 2015 at 11:03:34PM +0300, Alexander Monakov wrote:
> On Sun, 5 Apr 2015, Rich Felker wrote:
> > 1. Thread A enters sem_wait.
> > 2. Thread B observes thread A in sem_wait via failed sem_trywait.
> 
> Hm, I don't see how that can be achieved.  As a result I'm afraid I didn't
> fully understand your example.

Indeed I was wrong about that, so I agree the whole scenario may fall
apart. Only sem_getvalue could show this, and only if it returns -1
rather than 0. So returning negative values from sem_getvalue seems
like a very bad idea -- it puts difficult- or impossible-to-satisfy
additional constraints on the implementation.

> > > Well we can make sem_getvalue return val[0]+val[1] instead... ;)
> > 
> > That just makes the new implementation look like the old one, no? :-)
> 
> Can't be bad if it behaves the same but works a bit faster.
> Apropos, like I've said on IRC, looks like there's "semaphore uncertainty
> principle": that formal semaphore value is between val[0] and (val[0] +/-
> val[1]) (clamped to 0 as needed).  It seems you can either do your hack and
> pretend that there are never any waiters, or try to faithfully count waiters
> in sem_getvalue, but then also reveal that sometimes the implementation works
> by stealing a post.  I believe you could argue that the latter is explicitely
> disallowed by the spec.

Yes, I think I agree.

> By the way, I think there's an interesting interplay with cancellation.
> Consider the following.  Thread B does "return sem_wait(sem);". Thread A does:
> 
>   pthread_cancel(thread_B);
>   sem_post(sem);
>   sem_getvalue(sem);
> 
> If it observes semaphore value as 1 it follows that thread B has not become a
> waiter yet, and since it must have cancellation already pending, it may not
> consume the post.  And yet if thread B is already futex-waiting in sem_wait,
> consuming the post takes priority over acting on cancellation.  So if then
> thread A does
> 
>   pthread_join(thread_B);
>   sem_getvalue(sem);
> 
> and gets value of 0, it sees a contradiction.  And return value from
> pthread_join will indicate that thread_B exited normally rather than was
> cancelled.

So the contradiction you claim exists is that cancellation happened
before the post, and thus thread B can't act on the post when it
didn't act on cancellation? I don't think that follows from the rules
of cancellation. The relevant text is:

    "Whenever a thread has cancelability enabled and a cancellation
    request has been made with that thread as the target, and the
    thread then calls any function that is a cancellation point (such
    as pthread_testcancel() or read()), the cancellation request shall
    be acted upon before the function."

So if cancellation was pending _before_ the call to sem_wait, then
sem_wait has to honor it. But there is no requirement that entry to
the sem_wait function be "atomic" with becoming a waiter on the
semaphore, and of course this is impossible to satisfy or even
specify. So it's totally legal to have the sequence:

1. Thread B enters sem_wait.
2. Thread B observes that cancellation was not already pending.
3. Thread A sends cancellation request.
4. Thread A sends post.
5. Thread B receives both, and chooses to act on the post per this
    text:

    "It is unspecified whether the cancellation request is acted upon
    or whether the cancellation request remains pending and the thread
    resumes normal execution if:

    - The thread is suspended at a cancellation point and the event for
    which it is waiting occurs

    - A specified timeout expired

    before the cancellation request is acted upon."

Here, the event for which it was waiting (the post) clearly occurs.

> And on the contrary, if you make acting on cancellation/timeout take priority,
> you can observe semaphore value increasing when waiters leave the wait on
> error path without consuming the post.

Yes obviously that is not possible.

Rich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-05 20:23                 ` Rich Felker
@ 2015-04-05 21:07                   ` Alexander Monakov
  2015-04-11 22:22                     ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-05 21:07 UTC (permalink / raw)
  To: musl

On Sun, 5 Apr 2015, Rich Felker wrote:
> So if cancellation was pending _before_ the call to sem_wait, then
> sem_wait has to honor it. But there is no requirement that entry to
> the sem_wait function be "atomic" with becoming a waiter on the
> semaphore, and of course this is impossible to satisfy or even
> specify.

Thanks!

One other thing to consider.  In the absence of concurrent operations on the
semaphore, return value of sem_getvalue should be equal to the number of times
sem_trywait will indicate success when called repeatedly.  So if the
implementation performs post-stealing in trywait, it should return the higher
bound as semaphore value.  Likewise for timedwait.

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-05 21:07                   ` Alexander Monakov
@ 2015-04-11 22:22                     ` Alexander Monakov
  2015-04-23 16:06                       ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-11 22:22 UTC (permalink / raw)
  To: musl

On Mon, 6 Apr 2015, Alexander Monakov wrote:
> One other thing to consider.  In the absence of concurrent operations on the
> semaphore, return value of sem_getvalue should be equal to the number of times
> sem_trywait will indicate success when called repeatedly.  So if the
> implementation performs post-stealing in trywait, it should return the higher
> bound as semaphore value.  Likewise for timedwait.

If we accept the above, it follows that in the new implementation getvalue
should return not max(0, val[0] + val[1]), but rather max(0, val[0]) + val[1].

So incorporating the above into getvalue, and Rich's scheme of mirroring
timedwait into trywait (as he pointed out on IRC, my approach does not
properly work with two concurrent posters), we get:

int sem_getvalue(sem_t *restrict sem, int *restrict valp)
{
	a_barrier(); // Memory holding the semaphore should not be stale
	int val = sem->__val[0];
	val = val < 0 ? 0 : val;
	*valp = val + sem->__val[1];
	return 0;
}

int sem_trywait(sem_t *sem)
{
	if (a_fetch_add(sem->__val, -1) > 0)
		return 0;
	for (;;) {
		int wak = sem->__val[1];
		if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1))
			return 0;
		int val = sem->__val[0];
		if (val < 0 && val == a_cas(sem->__val, val, val+1))
			break;
		a_spin();
	}
	errno = EAGAIN;
	return -1;
}

int sem_post(sem_t *sem)
{
	int val;
	do val = sem->__val[0];
	while (val != a_cas(sem->__val, val, val+!!(val<SEM_VALUE_MAX)));
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	if (val < SEM_VALUE_MAX) return 0;
	errno = EOVERFLOW;
	return -1;
}

int sem_timedwait(sem_t *restrict sem, const struct timespec *restrict at)
{
	pthread_testcancel();

	// Do not spin if already contended (val0 is negative)
	for (int spins = 1000; spins && sem->__val[0] == 0; spins--)
		a_spin();

	if (a_fetch_add(sem->__val, -1) > 0)
		return 0;

	int priv = sem->__val[2], e = 0, ee, cs;
	pthread_setcancelstate(PTHREAD_CANCEL_MASKED, &cs);

	do {
		ee = __timedwait_cp(sem->__val+1, 0, CLOCK_REALTIME, at, priv);
		int wak = sem->__val[1];
		if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1))
			goto done;
	} while (!ee || ee == EINTR);

	// Upon returning from wait with an error, either discount ourselves as a waiter
	// by incrementing negative val0, and propagate error, or consume a racing post
	// if val0 is non-negative, and suppress error.
	for (;;) {
		int val = sem->__val[0];
		if (val < 0 && val == a_cas(sem->__val, val, val+1))
			break;
		int wak = sem->__val[1];
		if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1))
			goto done;
		a_spin();
	}
	e = ee;
done:
	pthread_setcancelstate(cs, 0);

	if (!e) return 0;

	if (e == ECANCELED) {
		pthread_testcancel();
		pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0);
	}

	errno = e;
	return -1;
}



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-11 22:22                     ` Alexander Monakov
@ 2015-04-23 16:06                       ` Rich Felker
  2015-04-23 18:24                         ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-23 16:06 UTC (permalink / raw)
  To: musl

I'm going to try to summarize some of the issues that have been
discussed on IRC since this.

On Sun, Apr 12, 2015 at 01:22:34AM +0300, Alexander Monakov wrote:
> On Mon, 6 Apr 2015, Alexander Monakov wrote:
> > One other thing to consider.  In the absence of concurrent operations on the
> > semaphore, return value of sem_getvalue should be equal to the number of times
> > sem_trywait will indicate success when called repeatedly.  So if the
> > implementation performs post-stealing in trywait, it should return the higher
> > bound as semaphore value.  Likewise for timedwait.
> 
> If we accept the above, it follows that in the new implementation getvalue
> should return not max(0, val[0] + val[1]), but rather max(0, val[0]) + val[1].

Indeed. But then max(0, val[0]) + val[1] can overflow SEM_VALUE_MAX
unless we prevent it, which takes some work, but I think it's
possible.

> int sem_post(sem_t *sem)
> {
> 	int val;
> 	do val = sem->__val[0];
> 	while (val != a_cas(sem->__val, val, val+!!(val<SEM_VALUE_MAX)));
> 	if (val < 0) {
> 		int priv = sem->__val[2];
> 		a_inc(sem->__val+1);
> 		__wake(sem->__val+1, 1, priv);
> 	}
> 	if (val < SEM_VALUE_MAX) return 0;
> 	errno = EOVERFLOW;
> 	return -1;
> }

The first observation we made was that this checks val<SEM_VALUE_MAX
twice in the success path for an extra useless branch. It can be fixed
by something like this (my1):

int sem_post(sem_t *sem)
{
	int val = sem->__val[0];
	val -= val==SEM_VALUE_MAX;
	while (a_cas(sem->__val, val, val+1) != val) {
		if ((val = sem->__val[0]) == SEM_VALUE_MAX) {
			errno = EOVERFLOW;
			return -1;
		}
	}
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

or this (my1b):

int sem_post(sem_t *sem)
{
	int old, val = sem->__val[0];
	val -= val==SEM_VALUE_MAX;
	while ((old = a_cas(sem->__val, val, val+1)) != val) {
		if ((val = old) == SEM_VALUE_MAX) {
			errno = EOVERFLOW;
			return -1;
		}
	}
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

The latter saves the result of a_cas to prevent an extra load, but I
don't think it makes any significant difference and it might be seen
as uglier.

However neither of those address the overflow issue, which I've tried
to address here:

#define VAL0_MAX ((SEM_VALUE_MAX+1)/2)
#define VAL1_MAX (SEM_VALUE_MAX/2)

int sem_post(sem_t *sem)
{
	int val = sem->__val[0];
	val -= val==VAL0_MAX;
	while (a_cas(sem->__val, val, val+1) != val) {
		if ((val = sem->__val[0]) == VAL0_MAX) {
			int tmp = sem->__val[1];
			if (tmp >= VAL1_MAX) {
				errno = EOVERFLOW;
				return -1;
			}
			if (a_cas(sem->__val+1, tmp, tmp+1) == tmp) {
				return 0;
			}
			val--;
		}
	}
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

This is code whose idea was discussed on IRC but not yet presented, so
it may have significant bugs. The idea is to limit the main sem value
component and the wake count separately to half the max. Once val[0]
hits VAL0_MAX, further posts will be in the form of wakes for
nonexistent waiters (which are ok but more costly). This allows the
total observed value to reach all the way up to SEM_VALUE_MAX.

If this happens, waiters will consume all of val[0] first, and the
wakes will all remain pending until val[0] reaches 0. At that point,
new waiters will decrement val[0] to a negative value (indicating a
waiter), attempt a futex wait, fail because there are wakes pending,
consume one of the wakes, and exit.

(Note: this useless futex wait can be optimized out by reordering the
do-while loop body in sem_timedwait.)

During this state, there is a race window where val[1] can exceed
VAL1_MAX -- if a post happens after a new waiter decrements val[0] but
before it consumes a wake from val[1], a concurrent post will
increment val[0] back to 0 and increment val[1] unconditionally.
However, the magnitude of such overshoot is bounded by the number of
tasks which is necessarily bounded by INT_MAX/4 which is less than
VAL1_MAX, so no integer overflow can happen here (except in the case
of async-killed waiters).

Does this all sound correct?

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-23 16:06                       ` Rich Felker
@ 2015-04-23 18:24                         ` Alexander Monakov
  2015-04-23 20:01                           ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-23 18:24 UTC (permalink / raw)
  To: musl

> The latter saves the result of a_cas to prevent an extra load, but I
> don't think it makes any significant difference and it might be seen
> as uglier.

I think we should use the result of a_cas here: it's part of sem_post "fast
path", and doing it is not too difficult.  I'm using a slightly different
version below.

> However neither of those address the overflow issue, which I've tried
> to address here:
> 
> #define VAL0_MAX ((SEM_VALUE_MAX+1)/2)

Signed integer overflow here -- using corrected version below.

> Does this all sound correct?

I'm afraid not. We must always do futex-wake when incrementing val[1].
Otherwise wake loss is possible:

1. Semaphore initialized to VAL0_MAX
2. Thread A enters sem_post, observes saturated val[0]
3. Thread B downs val[0] to 0 by calling sem_wait VAL0_MAX times
4. Thread B calls sem_wait again and enters futex_wait
5. Thread A ups val[1].
.. At this point thread A must futex-wake val[1].

My version:

#define VAL0_MAX (SEM_VALUE_MAX/2+1)
#define VAL1_MAX (SEM_VALUE_MAX/2)

int sem_post(sem_t *sem)
{
	int old, val = sem->__val[0];
	val -= val == VAL0_MAX;
	while (old = val, (val = a_cas(sem->__val, val, val+1)) != old)
		if (val == VAL0_MAX)
			goto wake;
	if (val < 0) {
wake:;
		int priv = sem->__val[2];
		do
			if ((val = sem->__val[1]) == VAL1_MAX) {
				errno = EOVERFLOW;
				return -1;
			}
		while (val != a_cas(sem->__val+1, val, val+1));
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

After sufficiently many waiters have been killed, val[1] can reach VAL1_MAX
without val[0] also reaching VAL0_MAX, in which case sem_post can report
EOVERFLOW prematurely.  From previous emails it seems it's not a big concern.

It is also possible that EOVERFLOW will be reported prematurely in race
windows when a waiter returning from futex-wait with EWOULDBLOCK has not
decremented val[1] of a recently saturated semaphore yet. Example:

1. Semaphore initialized to SEM_VALUE_MAX
2. Thread A downs val[0] to 0 by calling sem_wait VAL0_MAX times. val[1]
remains at VAL1_MAX.
3. Thread B calls sem_wait and enters futex wait
4. Thread A calls sem_post, observes val[0]<0 && val[1] == VAL1_MAX

It's possible to make the window smaller by reordering futex-wait loop, but it
will remain.  At the moment I don't have a good way out.

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-23 18:24                         ` Alexander Monakov
@ 2015-04-23 20:01                           ` Alexander Monakov
  2015-04-24  2:46                             ` Rich Felker
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-23 20:01 UTC (permalink / raw)
  To: musl

I was over-eager in size-optimizing and at first didn't notice that we may not
report EOVERFLOW after successfully incrementing val[0]; therefore we can
reuse only the very end of the futex-wake path:

#define VAL0_MAX (SEM_VALUE_MAX/2+1)
#define VAL1_MAX (SEM_VALUE_MAX/2)

int sem_post(sem_t *sem)
{
	int priv, old, val = sem->__val[0];
	val -= val == VAL0_MAX;
	while (old = val, (val = a_cas(sem->__val, val, val+1)) != old)
		if (val == VAL0_MAX) {
			priv = sem->__val[2];
			do {
				if ((val = sem->__val[1]) >= VAL1_MAX) {
					errno = EOVERFLOW;
					return -1;
				}
			} while (val != a_cas(sem->__val+1, val, val+1));
			goto wake;
		}
	if (val < 0) {
		priv = sem->__val[2];
		a_inc(sem->__val+1);
wake:
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

Now instead of 'premature EOVERFLOW' problem we have the 'val[1] overshoot'
problem.  It can lead to getvalue overflow:

1. Semaphore initialized to SEM_VALUE_MAX
2. Thread A downs val[0] to 0
3. Thread B downs val[0] to -1
4. Thread A calls sem_post: val[0] == 0, val[1] == VAL1_MAX+1
.. (thread B does not consume the post yet)
5. Thread A ups val[0] to VAL0_MAX
.. now getvalue returns INT_MIN

Alexander


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-23 20:01                           ` Alexander Monakov
@ 2015-04-24  2:46                             ` Rich Felker
  2015-04-24 10:23                               ` Alexander Monakov
  0 siblings, 1 reply; 20+ messages in thread
From: Rich Felker @ 2015-04-24  2:46 UTC (permalink / raw)
  To: musl

On Thu, Apr 23, 2015 at 11:01:19PM +0300, Alexander Monakov wrote:
> I was over-eager in size-optimizing and at first didn't notice that we may not
> report EOVERFLOW after successfully incrementing val[0]; therefore we can
> reuse only the very end of the futex-wake path:
> 
> #define VAL0_MAX (SEM_VALUE_MAX/2+1)
> #define VAL1_MAX (SEM_VALUE_MAX/2)
> 
> int sem_post(sem_t *sem)
> {
> 	int priv, old, val = sem->__val[0];
> 	val -= val == VAL0_MAX;
> 	while (old = val, (val = a_cas(sem->__val, val, val+1)) != old)
> 		if (val == VAL0_MAX) {
> 			priv = sem->__val[2];
> 			do {
> 				if ((val = sem->__val[1]) >= VAL1_MAX) {
> 					errno = EOVERFLOW;
> 					return -1;
> 				}
> 			} while (val != a_cas(sem->__val+1, val, val+1));
> 			goto wake;
> 		}
> 	if (val < 0) {
> 		priv = sem->__val[2];
> 		a_inc(sem->__val+1);
> wake:
> 		__wake(sem->__val+1, 1, priv);
> 	}
> 	return 0;
> }
> 
> Now instead of 'premature EOVERFLOW' problem we have the 'val[1] overshoot'
> problem.  It can lead to getvalue overflow:
> 
> 1. Semaphore initialized to SEM_VALUE_MAX
> 2. Thread A downs val[0] to 0
> 3. Thread B downs val[0] to -1
> 4. Thread A calls sem_post: val[0] == 0, val[1] == VAL1_MAX+1
> ... (thread B does not consume the post yet)
> 5. Thread A ups val[0] to VAL0_MAX
> ... now getvalue returns INT_MIN

Perhaps this can be patched up by saturating sem_getvalue's result? In
the case where the overflow happens it's transient, right? I think
that means discounting the overflow would be valid. But I'll need to
think about it more...

With that said, my inclination right now is that we should hold off on
trying to commit the new semaphore for this release cycle. I've been
aiming for this month and just about everything else is in order for
release, but the semaphore rabbit-hole keeps going deeper and I think
we need to work through this properly. I hope that's not too much of a
disappointment.

Rich


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-24  2:46                             ` Rich Felker
@ 2015-04-24 10:23                               ` Alexander Monakov
  2015-04-24 15:03                                 ` Rich Felker
  2015-04-24 15:47                                 ` Alexander Monakov
  0 siblings, 2 replies; 20+ messages in thread
From: Alexander Monakov @ 2015-04-24 10:23 UTC (permalink / raw)
  To: musl

On Thu, 23 Apr 2015, Rich Felker wrote:
> Perhaps this can be patched up by saturating sem_getvalue's result? In
> the case where the overflow happens it's transient, right? I think
> that means discounting the overflow would be valid. But I'll need to
> think about it more...

Hm, can't agree here.  This whole line of discussion stems from attempt to
align timedwait/trywait/getvalue behavior in light of dead waiters, which are
indistinguishable from preempted waiters.  If "it's transient" claim can be
made, it also can be used as a reason not to modify getvalue to look at val[1].

> With that said, my inclination right now is that we should hold off on
> trying to commit the new semaphore for this release cycle. I've been
> aiming for this month and just about everything else is in order for
> release, but the semaphore rabbit-hole keeps going deeper and I think
> we need to work through this properly. I hope that's not too much of a
> disappointment.

Ack; thankfully I don't feel disappointment in this case, this discussion has
been quite entertaining.  When I proposed my modification I felt it was very
intuitive.  What I did not grasp back then is that the definition of a waiter
is not solid.

How do you interpret the following?

1. Semaphore initialized to 0. There's only one thread.
2. alarm(1)
3. sem_wait()
... (in SIGALRM handler)
    4. sem_post()
    5. sem_getvalue()

May getvalue be 0 here?  At step 4, can the thread possibly "be a waiter"
on the semaphore?

Thanks.
Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-24 10:23                               ` Alexander Monakov
@ 2015-04-24 15:03                                 ` Rich Felker
  2015-04-24 15:47                                 ` Alexander Monakov
  1 sibling, 0 replies; 20+ messages in thread
From: Rich Felker @ 2015-04-24 15:03 UTC (permalink / raw)
  To: musl

On Fri, Apr 24, 2015 at 01:23:27PM +0300, Alexander Monakov wrote:
> On Thu, 23 Apr 2015, Rich Felker wrote:
> > Perhaps this can be patched up by saturating sem_getvalue's result? In
> > the case where the overflow happens it's transient, right? I think
> > that means discounting the overflow would be valid. But I'll need to
> > think about it more...
> 
> Hm, can't agree here.  This whole line of discussion stems from attempt to
> align timedwait/trywait/getvalue behavior in light of dead waiters, which are
> indistinguishable from preempted waiters.

I don't think dead waiters are a solvable problem with this design,
but they're a minor problem until you hit overflow.

> If "it's transient" claim can be
> made, it also can be used as a reason not to modify getvalue to look at val[1].

No, because you can interrupt a waiter with a signal handler and the
"transient" state becomes something you can synchronize with and
observe and thus no longer transient. That was the motivation for
needing to count the pending wakes.

> > With that said, my inclination right now is that we should hold off on
> > trying to commit the new semaphore for this release cycle. I've been
> > aiming for this month and just about everything else is in order for
> > release, but the semaphore rabbit-hole keeps going deeper and I think
> > we need to work through this properly. I hope that's not too much of a
> > disappointment.
> 
> Ack; thankfully I don't feel disappointment in this case, this discussion has
> been quite entertaining.  When I proposed my modification I felt it was very
> intuitive.  What I did not grasp back then is that the definition of a waiter
> is not solid.
> 
> How do you interpret the following?
> 
> 1. Semaphore initialized to 0. There's only one thread.
> 2. alarm(1)
> 3. sem_wait()
> .... (in SIGALRM handler)
>     4. sem_post()
>     5. sem_getvalue()
> 
> May getvalue be 0 here?  At step 4, can the thread possibly "be a waiter"
> on the semaphore?

Here steps 4 and 5 are UB (calling AS-unsafe functions from AS
context). But you can achieve the same with another thread observing
entry to the signal handler in a valid way (e.g. via posting of a
second sem from the signal handler).

With that problem solved, I think it's valid at this point to observe
a value of 0 or 1. But if 0 is observed, sem_trywait would have to
fail, and sem_wait or sem_timedwait could return only in the case of
an error. This is why returning 0 does not seem to be practical -- I
don't know a way to let the existing suspended waiter take the wake
without allowing new waiters to steal it (and thus expose
inconsistency).

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-24 10:23                               ` Alexander Monakov
  2015-04-24 15:03                                 ` Rich Felker
@ 2015-04-24 15:47                                 ` Alexander Monakov
  2015-04-24 15:59                                   ` Rich Felker
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Monakov @ 2015-04-24 15:47 UTC (permalink / raw)
  To: musl

One more thought.  In the original proposal, val[1] was a "wake count".  By
now we have drifted away a bit from that concept, with val[1] being used not
only in post/timedwait, but also in trywait/getvalue, trywait explicitely
performing post-stealing, and getvalue accounting for that.

So how about we try to stick with the original vision, and say that val[1] is
strictly for waiters leaving futex wait via non-error path?  The kernel tells
the poster via the return value of futex_wake whether the wake has caused any
waiter to resume execution.  The kernel already knows if any waiter was there
to accept a wake, and lets us know.  Can we somehow use that to implement
non-stealing wakeups?

Alexander

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Resuming work on new semaphore
  2015-04-24 15:47                                 ` Alexander Monakov
@ 2015-04-24 15:59                                   ` Rich Felker
  0 siblings, 0 replies; 20+ messages in thread
From: Rich Felker @ 2015-04-24 15:59 UTC (permalink / raw)
  To: musl

On Fri, Apr 24, 2015 at 06:47:45PM +0300, Alexander Monakov wrote:
> One more thought.  In the original proposal, val[1] was a "wake count".  By
> now we have drifted away a bit from that concept, with val[1] being used not
> only in post/timedwait, but also in trywait/getvalue, trywait explicitely
> performing post-stealing, and getvalue accounting for that.
> 
> So how about we try to stick with the original vision, and say that val[1] is
> strictly for waiters leaving futex wait via non-error path?  The kernel tells
> the poster via the return value of futex_wake whether the wake has caused any
> waiter to resume execution.  The kernel already knows if any waiter was there
> to accept a wake, and lets us know.  Can we somehow use that to implement
> non-stealing wakeups?

This is the famous problem discussed on libc-alpha in regards to
documenting the futex syscall. If you could reliably assume that any
zero return from futex wait syscall resulted from an intentional futex
wake, this kind of protocol would be possible. Unfortunately almost
every existing use of futex wake sends the wake after the memory could
already be reused for something new.

Fixing this is possible with clever use of FUTEX_WAKE_OP, possibly
with a performance hit (but conversely, improving some scheduling
behavior), but you'd need a contract that everything agrees to use
this protocol. Even one piece of code failing to do so would horribly
break code depending on the assumption that "zero means intentional
wake". So I don't think it's practical.

Also there's the issue that the target may not actually be in the
syscall when the wake arrives -- it may be suspended just before it,
or in a signal handler. I'm not sure whether those issues are solvable
or not.

Rich

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-04-24 15:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-02  1:30 Resuming work on new semaphore Rich Felker
2015-04-02  7:42 ` Alexander Monakov
2015-04-02 15:26   ` Rich Felker
2015-04-02 21:39     ` Alexander Monakov
2015-04-02 23:14       ` Rich Felker
2015-04-05 14:07         ` Alexander Monakov
2015-04-05 14:17           ` Alexander Monakov
2015-04-05 19:02             ` Rich Felker
2015-04-05 20:03               ` Alexander Monakov
2015-04-05 20:23                 ` Rich Felker
2015-04-05 21:07                   ` Alexander Monakov
2015-04-11 22:22                     ` Alexander Monakov
2015-04-23 16:06                       ` Rich Felker
2015-04-23 18:24                         ` Alexander Monakov
2015-04-23 20:01                           ` Alexander Monakov
2015-04-24  2:46                             ` Rich Felker
2015-04-24 10:23                               ` Alexander Monakov
2015-04-24 15:03                                 ` Rich Felker
2015-04-24 15:47                                 ` Alexander Monakov
2015-04-24 15:59                                   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).