sem_getvalue conformance considerations

mailing list of musl libc
 help / color / mirror / code / Atom feed

* sem_getvalue conformance considerations
@ 2014-08-27  2:33 Rich Felker
  2014-08-27  7:05 ` Jens Gustedt
  0 siblings, 1 reply; 19+ messages in thread
From: Rich Felker @ 2014-08-27  2:33 UTC (permalink / raw)
  To: musl

POSIX says:

"The sem_getvalue() function shall update the location referenced by
the sval argument to have the value of the semaphore referenced by sem
without affecting the state of the semaphore. The updated value
represents an actual semaphore value that occurred at some unspecified
time during the call, but it need not be the actual value of the
semaphore when it is returned to the calling process."

musl currently implements this as:

	int val = sem->__val[0];
	*valp = val < 0 ? 0 : val;

However, I don't think this is correct. The "semaphore value" read
from __val[0] is not the formal semaphore value, but an implementation
detail: while there are waiters, after a post there is a window where
__val[0] is positive (to allow a waiter to take the semaphore) despite
the formal value never becoming positive. sem_post is documented as:

"If the semaphore value resulting from this operation is positive,
then no threads were blocked waiting for the semaphore to become
unlocked; the semaphore value is simply incremented."

"If the value of the semaphore resulting from this operation is zero,
then one of the threads blocked waiting for the semaphore shall be
allowed to return successfully from its call to sem_wait()."

So the formal "semaphore value" should always be zero while there are
more waiters than posts, even if those waiters have not yet woken to
take the semaphore.

It seems at least semi-possible to observe incorrect behavior with the
current implementation:

Example 1: Initially two waiters. One call to sem_post followed by
sem_getvalue should read a value of 0, not 1.

In principle we could compensate for this by declaring the value zero
if there are any waiters. But then:

Example 2: Initially one waiter. Two calls to sem_post followed by
sem_getvalue should read a value of 1, not 0.

What if we try to get fancy and subtract waiters from __val[0]?
Unfortunately we can't necessarily read __val[0] and waiters
(__val[1]) atomically together, so it's possible that one is outdated
by the time we read the other, such that the resulting difference is
not the correct formal semaphore value at any time during the
sem_getvalue call.

The one easy cop-out: We could claim the current behavior is
conforming anyway, since there is no way to prove that a thread is
actually blocked on a semaphore. (This is in contrast to cond vars,
where having unlocked the mutex proves they've formally become
waiters.) The state "blocked on semaphore" is fundamentally
indistinguishable from the state of "about to enter sem_wait". Thus,
we could say that, formally, there are never any waiters; sem_post
always makes the semaphore value positive, and a would-be waiter
formally arrives and decrements the semaphore value at some time after
the post makes the value positive.

Of course this is a really pathetic solution, but is there any way we
can provide real, consistent, meaningful results from sem_getvalue? Or
is the whole concept of sem_getvalue just so messed up that this is
the best possible way to handle it?

Rich

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27  2:33 sem_getvalue conformance considerations Rich Felker
@ 2014-08-27  7:05 ` Jens Gustedt
  2014-08-27  7:43   ` Rich Felker
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Gustedt @ 2014-08-27  7:05 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1822 bytes --]

Am Dienstag, den 26.08.2014, 22:33 -0400 schrieb Rich Felker:
> What if we try to get fancy and subtract waiters from __val[0]?
> Unfortunately we can't necessarily read __val[0] and waiters
> (__val[1]) atomically together,

Doing the correct thing is always fancy :)
Sure that this depends on the architecture, but where this is possible
we should just do that, this is the semantically correct value.

On i386 and follow ups 64bit atomic read should always be possible,
and if I remember correctly the arm arch that I touched once had such
a thing, too.

For once the gcc extensions for builtins have easy to use feature test
macros. If you'd have the sem_t changed to a union with alternative
__vall long's, something along the line

#if defined(__GCC_HAVE_SYNC_COMPARE_AND_SWAP_8) && defined(__ATOMIC_ACQUIRE)
uint64_t combined = __atomic_load_n(&s->__vall[0], __ATOMIC_ACQUIRE);
#else
... fallback ...
#endif

> so it's possible that one is outdated
> by the time we read the other, such that the resulting difference is
> not the correct formal semaphore value at any time during the
> sem_getvalue call.

On arch where atomic read of these two values together is not
possible, this is the best approximation that you can get. On these
archs there is simply no precise moment in time for that feature
because the sequence points are not synchronized between the different
threads. Nobody can ask you to return an exact value for a concept
that is not well defined.

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27  7:05 ` Jens Gustedt
@ 2014-08-27  7:43   ` Rich Felker
  2014-08-27 10:43     ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Rich Felker @ 2014-08-27  7:43 UTC (permalink / raw)
  To: musl

On Wed, Aug 27, 2014 at 09:05:41AM +0200, Jens Gustedt wrote:
> Am Dienstag, den 26.08.2014, 22:33 -0400 schrieb Rich Felker:
> > What if we try to get fancy and subtract waiters from __val[0]?
> > Unfortunately we can't necessarily read __val[0] and waiters
> > (__val[1]) atomically together,
> 
> Doing the correct thing is always fancy :)
> Sure that this depends on the architecture, but where this is possible
> we should just do that, this is the semantically correct value.
> 
> On i386 and follow ups 64bit atomic read should always be possible,
> and if I remember correctly the arm arch that I touched once had such
> a thing, too.

Yes, I'm aware that 64-bit atomic read may exist on some archs (note:
this does not include i386; 8-byte atomic read was not possible until
at least i586 generation and our "i386" baseline is really "i486", the
first model with cmpxchg, which is mandatory for working pthread
primitives), but as one of musl's big general principles is providing
uniform behavior across archs, I'd rather not implement something
where the behavior is going to differ like that based on a feature.

> > so it's possible that one is outdated
> > by the time we read the other, such that the resulting difference is
> > not the correct formal semaphore value at any time during the
> > sem_getvalue call.
> 
> On arch where atomic read of these two values together is not
> possible, this is the best approximation that you can get. On these
> archs there is simply no precise moment in time for that feature
> because the sequence points are not synchronized between the different
> threads. Nobody can ask you to return an exact value for a concept
> that is not well defined.

I'm not entirely convinced there's not a solution. There may be
sufficient information to determine whether or not there are waiters
without a 64-bit atomic read.

Let V be the implementation semaphore value (__val[0]) and W the
waiter count (__val[1]).

After observing a nonzero V, W cannot increase without V first
reaching zero. So if we read V first, then W, the value of W read will
be less than or equal to the value of W at the time V was read. This
seems to be sufficient for the semantics I thought were right.

However, I'm doubtful of them too. :-)

Even if we know the number of waiters exactly at the time the value is
read, that's not sufficient to assign a formal value to the semaphore,
because these waiters could race to return EINTR or ETIMEDOUT, or act
upon cancellation, before they consume the post. In this case,
sem_getvalue would have reported an observably incorrect value:

Example: Initially 2 waiters, posting thread posts 3 times, calls
sem_getvalue and sees a value of 1, calls pthread_cancel on both
waiters, then calls sem_getvalue again and sees a value of 3, despite
no additional posts having happened.

The only easy way around this problem is the current behavior: having
sem_getvalue treat waiters as not-having-arrived-yet.

The other solution I see, which would allow sem_getvalue to report
waiters, would be to ensure that waiters always do a final sem_trywait
after observing an error, and ignore the error if the trywait
succeeds. However doing this with cancellation is not easy; it would
require a longjmp, which would require adding setjmp overhead to each
sem_wait. Of course if __timedwait could return ECANCELED rather than
invoking cancellation handlers, that would make things a lot nicer,
and it's something I've wanted to be able to do for a long time, so
perhaps we can revisit this issue once that's implemented... :)

Rich

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27  7:43   ` Rich Felker
@ 2014-08-27 10:43     ` Alexander Monakov
  2014-08-27 13:32       ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2014-08-27 10:43 UTC (permalink / raw)
  To: musl

Why wouldn't the following design work?

val[0] is sem value if >= 0, negated waiters count otherwise
val[1] is wakeup count, incremented before futex wake, tested and decremented
by waiters returning from futex wait

Alexander


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27 10:43     ` Alexander Monakov
@ 2014-08-27 13:32       ` Alexander Monakov
  2014-08-27 19:06         ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2014-08-27 13:32 UTC (permalink / raw)
  To: musl

On Wed, 27 Aug 2014, Alexander Monakov wrote:

> Why wouldn't the following design work?
> 
> val[0] is sem value if >= 0, negated waiters count otherwise
> val[1] is wakeup count, incremented before futex wake, tested and decremented
> by waiters returning from futex wait

Unless I'm missing something, the above can simplify sem ops
(sorry, eliding some details in the following pseudocode)

trywait:

  val = sem->val[0]
  while (val > 0) {
    oldval = val;
    if ((val = a_cas(sem->val, val, val-1)) == oldval)
      return 0;
  }
  errno = EAGAIN;
  return -1;


wait:

  if (atomic_fetch_and_decrement(sem->val) > 0)
    return 0;
  while (!(futex_wait(sem->val+1, 0) && errno == EINTR)) {
    wakecnt = sem->val[1];
    while (wakecnt > 0) {
      oldwcnt = wakecnt;
      if ((wakecnt = a_cas(sem->val+1, wakecnt, wakecnt-1)) == oldwcnt)
	return 0;
    }
  }
  return -1;

post:

  val = sem->val[0];
  do {
    if (val == SEM_VALUE_MAX) {
      errno = EOVERFLOW;
      return -1;
    }
    oldval = val;
  } while ((val = a_cas(sem->val, val, val+1)) != oldval);
  if (val < 0) {
    a_inc(sem->val+1);
    futex_wake(sem->val+1, 1);
  }
  return 0;

Alexander


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27 13:32       ` Alexander Monakov
@ 2014-08-27 19:06         ` Alexander Monakov
  2014-08-27 21:06           ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2014-08-27 19:06 UTC (permalink / raw)
  To: musl

On IRC Rich noted that I've too conveniently elided cancellation, so here's how
I think cancellation handler should look like:

  val = sem->val[0];
  while (val < 0) {
    /* We believe we are a waiter that no sem_post has "seen".  */
    oldval = val;
    if ((val = a_cas(sem->val, val, val+1)) == oldval)
      return;
  }
  /* We are a waiter, and a non-negative val[0] indicates that one sem_post
   * noticed us.  Consume the corresponding wake count increment.  */
  a_dec(sem->val+1);

Alexander


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27 19:06         ` Alexander Monakov
@ 2014-08-27 21:06           ` Alexander Monakov
  2014-08-28 20:47             ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2014-08-27 21:06 UTC (permalink / raw)
  To: musl

On Wed, 27 Aug 2014, Alexander Monakov wrote:

> On IRC Rich noted that I've too conveniently elided cancellation, so here's how
> I think cancellation handler should look like:

s/cancellation handler/"unwait" procedure/: returning with EINTR is botched in
my earlier mail and should employ the same procedure to undo state change

>   val = sem->val[0];
>   while (val < 0) {
>     /* We believe we are a waiter that no sem_post has "seen".  */
>     oldval = val;
>     if ((val = a_cas(sem->val, val, val+1)) == oldval)
>       return;
>   }
>   /* We are a waiter, and a non-negative val[0] indicates that one sem_post
>    * noticed us.  Consume the corresponding wake count increment.  */
>   a_dec(sem->val+1);

A plain atomic decrement on wake count like above is not correct.  However:

1. I think it's correct as long as no new waiters appear while we are
performing unwait.  If new waiters are matched by new posts it's ok; we only
care if the amount of new waiters is higher than the amount of new posts, as
then we are risking to cause a missing wake for one of those, or setting wake
count to -1 if they all consume wakes before we do the increment.

2. Thus it needs a fancy atomic op: decrement val[1] if val[0] is still equal
to "val" we retrieved earlier.  Otherwise, retry from the beginning.

Futhermore, doing the above in a true atomic fashion might not be required!
Isn't it okay to decrement wake count, and if we observe new waiters,
increment it back and cause a futex wake?  Thus:

retry:
  val = sem->val[0];
  while (val < 0) {
    /* We believe we are a waiter that no sem_post has "seen".  */
    oldval = val;
    if ((val = a_cas(sem->val, val, val+1)) == oldval)
      return;
  }
  /* We are a waiter, and a non-negative val[0] indicates that one sem_post
   * noticed us.  Consume the corresponding wake count increment.  */
  a_dec(sem->val+1);
  if (val > sem->val[0]) {
    /* New waiters appeared.  Avoid causing a missing wake and restart.  We
     * don't enter here if more new posts are posted than new waiters come. */
    a_inc(sem->val+1);
    futex_wake(sem->val+1, 1);
    goto retry;
  }

Alexander

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-27 21:06           ` Alexander Monakov
@ 2014-08-28 20:47             ` Alexander Monakov
  2014-08-29 22:51               ` Alexander Monakov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2014-08-28 20:47 UTC (permalink / raw)
  To: musl

Thanks (again) to Rich for showing on IRC that that still was not correct,
with at least two issues:

1. with one waiter, unwait causes an in-progress post to be lost (perhaps
workaroundable by consuming the post from unwait);

2. it is invalid to update val[1] non-atomically with val[0] in sem_post, as
the semaphore may be destroyed in between by a waiter that noticed the post
via unwait

Alexander

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-28 20:47             ` Alexander Monakov
@ 2014-08-29 22:51               ` Alexander Monakov
  2014-08-30  5:12                 ` Rich Felker
  2014-09-01 17:50                 ` Alexander Monakov
  0 siblings, 2 replies; 19+ messages in thread
From: Alexander Monakov @ 2014-08-29 22:51 UTC (permalink / raw)
  To: musl

Hi,

To begin, I have a few questions regarding cancellation (please excuse
deviating from topic for a while):

1.  It looks non-conforming not to act on pending cancellation in uncontended
sem_wait (or any cancellation point that does not invoke a syscall, for that
matter; nsz pointed out that many instances should already be handled, but
pthread_join for instance seems to miss explicit timedwait).  A bug?

2.  I had trouble understanding cancel_handler.  What is re-raising
SIGCANCEL at the end for?

3.  My understanding is that for deferred cancellation, ideally you'd want
SIGCANCEL to be delivered only as interruptions to syscalls --- that would
eliminate some scaffolding around syscall_cp including IP range check in
cancel_handler, correct?

4.  If the above could be provided, it would be possible to guarantee that
pthread cancellation cleanup is executing in a signal handler context only for
asynchronous cancellation.  I'd like that for my sem unwait.  Too bad?

======

Trying to approach a formal description of my proposed sem ops
implementation, here's what I can do now:

1.  Sem ops begin execution by performing an atomic read or read-modify-write
on val[0].  The sequence of atomic accesses establishes an order between
threads doing the ops (with respect to a given semaphore).

2.  Returning from futex wait with timeout or interrupt also accesses
val[0] and is included in that order

3.  The following invariant is maintained: val[0] is equal to the semaphore
value minus the number of current waiters (as given by the linear sequence of
events established above)

4. sem_post begins by performing an atomic increment on val[0]; if the
previous value was negative, there was at least one waiter, which is now
discounted from val[0]; we must arrange that one waiter eventually wakes up
and proceeds

4.1. Note: when sem_post increments negative val[0], we need that one waiter
returns successfully from sem_wait; if they were to exit exceptionally (eintr,
timeout, cancel), they would need to increment val[0] yet again;
but this will allow other threads to observe incrementing sem_getvalue despite
no new posts, as Rich noted in his email

4.2. We arrange that exactly one waiter proceeds by atomically incrementing
val[1] and futex-waking it; since we are doing that non-atomically with val[0]
update, we must guarantee that no waiters can proceed to destroy the semaphore
between val[0] and val[1] updates in sem_post

5.  sem_wait begins by atomically decrementing val[0]; if the previous value
was positive, we have simply successfully decremented the semaphore;
otherwise, we have become a waiter and must suspend

5.1.  sem_wait futex-waits on val[1]; on normal wakeups, it tries to
atomically-decrement-if-positive val[1] and leaves if successful

FIXME: we are reading from val[1] even though val[0] was increased long ago and
semaphore may be now positive; is there a justification for that, why can't
other threads assume the semaphore is free to be destroyed?

5.2.  On eintr/timeout/cancel, a waiter must decide if it may just discount
itself as a waiter, or must deal with a racing post; in case there is a racing
post, the waiter must either consume it and return normally, or discount
itself as a waiter nevertheless and ensure that rather than causing a wake the
post increases semaphore value beyond zero

It seems there are two ways of doing it.  We either prefer to consume a
pending post if available (a), or to propagate eintr/cancel (b).

5.2.a. Begin by incrementing-if-negative val[0]. If successful, we have
discounted ourselves as a waiter, so we don't need to care about concurrent
posts (there are fewer pending posts than waiters) and may propagate
eintr/timeout/cancel.  If unsuccessful, there is a pending post.   Consume it
and cause normal return from sem_wait (as if interruption did not happen).
Can't easily cause normal return from sem_wait if acting upon cancellation.

5.2.b.  Begin by incrementing val[0] unconditionally.  If it was negative, we
have discounted ourselves as a waiter; otherwise, there is a pending post, and
our increment is making it look as if the post is increasing an uncontended
semaphore (out of sequence, so other threads may observe oddly increasing
value); however there is a pending increase of val[1] and we must undo it.
Wait for val[1] to become positive by futex-waiting and then decrement it.
This looks more hairy, causes non-conforming behavior, but easier to execute
from a signal handler context for cancellation.

Does that help?  Thoughts?

Alexander

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-29 22:51               ` Alexander Monakov
@ 2014-08-30  5:12                 ` Rich Felker
  2014-09-01 17:50                 ` Alexander Monakov
  1 sibling, 0 replies; 19+ messages in thread
From: Rich Felker @ 2014-08-30  5:12 UTC (permalink / raw)
  To: musl

On Sat, Aug 30, 2014 at 02:51:35AM +0400, Alexander Monakov wrote:
> Hi,
> 
> To begin, I have a few questions regarding cancellation (please excuse
> deviating from topic for a while):
> 
> 1.  It looks non-conforming not to act on pending cancellation in uncontended
> sem_wait (or any cancellation point that does not invoke a syscall, for that
> matter; nsz pointed out that many instances should already be handled, but
> pthread_join for instance seems to miss explicit timedwait).  A bug?

Probably so.

> 2.  I had trouble understanding cancel_handler.  What is re-raising
> SIGCANCEL at the end for?

It's for the case where the signal arrives during a signal handler
that does not itself contain cancellation points, but which has
interrupted a cancellation point. In this case, we want cancellation
to re-trigger after the signal handler returns. This is achieved by
re-raising the signal, but blocking the signal in the ucontext that
cancel_handler is about to return to (if it weren't blocked there,
re-raising it would make an infinite loop). When the next-level-up
signal handler returns, it unblocks the signal again and it gets
delivered, and cancel_handler runs again and can evaluate whether it's
at a cancellation point.

> 3.  My understanding is that for deferred cancellation, ideally you'd want
> SIGCANCEL to be delivered only as interruptions to syscalls --- that would
> eliminate some scaffolding around syscall_cp including IP range check in
> cancel_handler, correct?

No, that's impossible to eliminate. Any solution without it has a race
condition where cancellation that arrives after the
check-before-syscall but before actually enterring the syscall gets
lost. And that's even assuming an interrupting signal can be used for
cancellation, which is false. An interrupting signal would also
interrupt blocking syscalls which are not cancellation points, thereby
messing up program semantics.

> 4.  If the above could be provided, it would be possible to guarantee that
> pthread cancellation cleanup is executing in a signal handler context only for
> asynchronous cancellation.  I'd like that for my sem unwait.  Too bad?

Not possible. But I've already described the cancellation mode I want
to add whose observable behavior is like this.

I'll address the actual semaphore ideas later.

Rich

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: sem_getvalue conformance considerations
  2014-08-29 22:51               ` Alexander Monakov
  2014-08-30  5:12                 ` Rich Felker
@ 2014-09-01 17:50                 ` Alexander Monakov
  2015-02-27 23:21                   ` semaphore redesign Alexander Monakov
  1 sibling, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2014-09-01 17:50 UTC (permalink / raw)
  To: musl

Hi,

If there's interest, my basic model file for semaphores in Promela/Spin is
pasted below.  Perhaps if I started doing this earlier it would help to avoid
some mistakes.

// spin -a sem.pml && gcc -O2 -DSAFETY pan.c && ./a.out

typedef sem_t {
  int value, waiters;  // Behavior
  int val0, val1;      // Implementation
};

sem_t sem;

#define sem_invariants (                  \
  (sem.value >= 0 && sem.waiters >= 0) && \
  (sem.value == 0 || sem.waiters == 0) && \
  sem.val0 == sem.value - sem.waiters  && \
  sem.val1 >= 0)

bool done;

active proctype monitor()
{
  do
  :: d_step { !sem_invariants -> assert(0); }
  :: done -> break;
  od
}

inline sem_trywait(retval)
{
  // d_step sequences are not preemptible
  if 
  :: d_step { sem.val0 > 0 -> sem.val0--; sem.value--; retval = 0; }
  :: else   { retval = -1; }
  fi
}

inline sem_post()
{
  int v;
  d_step {
    v = sem.val0;
    sem.val0++;
    if
    :: v >= 0 -> sem.value++;
    :: v <  0 -> sem.waiters--;
    fi
  }
  if
  :: v < 0 -> sem.val1++;
  :: else
  fi
}

inline sem_wait(interruptible, retval)
{
  int v;
  d_step {
    retval = 0;
    v = sem.val0;
    sem.val0--;
    if
    :: v >  0 -> sem.value--;
    :: v <= 0 -> sem.waiters++;
    fi
  }
  if
  :: v <= 0  ->
    if // non-deterministic if interruptible && sem.val1 > 0
    :: d_step {sem.val1 > 0 -> sem.val1--;}
    :: interruptible -> 
      d_step {
	v = sem.val0;
	if
	:: v < 0 -> sem.val0++; sem.waiters--; retval = -1;
	:: else
	fi
      }
      if
      :: v >= 0 -> d_step {sem.val1 > 0; sem.val1--;}
      :: else
      fi
    fi
  :: else
  fi
}

int n_posts, n_waits, n_waitfails;

proctype waiter(bool interruptible)
{
  int retval;
  n_waits++;
  sem_wait(interruptible, retval);
  n_waitfails = n_waitfails - retval;
}

proctype poster()
{
  n_posts++;
  sem_post();
}

#define NPROCMAX 4

init
{
  int n_procs = NPROCMAX;
  do // start a non-deterministic amount of posters
  :: n_procs > 0 -> run poster(); n_procs--;
  :: 1 -> break;
  od;
  do // ditto for waiters
  :: n_procs > 0 -> run waiter(false); n_procs--;
  :: 1 -> break;
  od;
  do
  :: n_procs > 0 -> run waiter(true); n_procs--;
  :: 1 -> break;
  od;
  timeout; // wait until quiescent state
  assert(sem.val1 == 0 && sem.val0 == n_posts + n_waitfails - n_waits);
  n_procs = sem.waiters;
  do
  :: n_procs > 0 -> run poster(); n_procs--;
  :: else -> break;
  od;
  timeout; // wait; there should be no processes except "monitor"
  assert(sem.val1 == 0 && sem.val0 == n_posts + n_waitfails - n_waits);
  assert(sem.waiters == 0);
  done = true;
}



^ permalink raw reply	[flat|nested] 19+ messages in thread

* semaphore redesign
  2014-09-01 17:50                 ` Alexander Monakov
@ 2015-02-27 23:21                   ` Alexander Monakov
  2015-02-28 15:42                     ` Rich Felker
  2015-03-01 17:30                     ` Szabolcs Nagy
  0 siblings, 2 replies; 19+ messages in thread
From: Alexander Monakov @ 2015-02-27 23:21 UTC (permalink / raw)
  To: musl

Hello,

As new cancellation has landed (except that __timedwait fails to propagate
ECANCELED in addition to ETIMEDOUT and EINTR), I'm happy to post semaphore
redesign.  We discussed this implementation with Rich on IRC once, and
presently I'm not aware of any issues (finally!), but still testing and
performance comparison need to be done.

This implementation aims to solve the problem with sem_getvalue that started
this thread, while also making fast (uncontended) paths leaner.  There were
some explanations scattered in the old thread; I'll try to provide a summary.

Conceptually, a semaphore has a non-negative value and when the value is zero,
a non-negative waiter count.  The new implementation stores the value minus
number of waiters in val[0], and in val[1] it stores the "wake count": the
number of waiters that were discounted from val[0] in sem_post; val[1] is
futex-woken in sem_post, and futex-waited on in sem_wait.

A thread that entered sem_wait and became a waiter by decrementing
non-positive val[0] may leave either by consuming a post (decrementing a
positive val[1]), or, if error condition such as timeout or cancellation has
been propagated from futex wait, by discounting itself as a waiter by
incrementing a negative val[0].  Conversely, if after encountering an error a
waiter observes non-negative val[0], it means that another thread already
discounted it from waiters when doing a sem_post, so the waiter proceeds to
consume the post and suppress the error.

Since any leaving waiter must either decrement positive val[1] or increment
negative val[0], accessing val[1] non-atomically with val[0] in sem_post does
not lead to potential use of destroyed semaphore.

At Rich's request, I'm posting this as plain C source code rather than a diff.
The implementation of sem_getvalue stays the same.

Alexander

#include <semaphore.h>
#include "pthread_impl.h"

int sem_post(sem_t *sem)
{
	int val;
	do {
		val = sem->__val[0];
		if (val == SEM_VALUE_MAX) {
			errno = EOVERFLOW;
			return -1;
		}
	} while (val != a_cas(sem->__val, val, val+1));
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	return 0;
}

int sem_trywait(sem_t *sem)
{
	int val;
	do {
		val = sem->__val[0];
		if (val <= 0) {
			errno = EAGAIN;
			return -1;
		}
	} while (val != a_cas(sem->__val, val, val-1));
	return 0;
}

static void dummy(void *arg)
{
}

int sem_timedwait(sem_t *restrict sem, const struct timespec *restrict at)
{
	pthread_testcancel();

	// Do not spin if already contended (val0 is negative)
	for (int spins = 100; spins && sem->__val[0] == 0; spins--)
		a_spin();

	if (a_fetch_add(sem->__val, -1) > 0)
		return 0;

	int priv = sem->__val[2], e = 0, ee, cs;
	pthread_setcancelstate(PTHREAD_CANCEL_MASKED, &cs);

	do {
		ee = __timedwait(sem->__val+1, 0, CLOCK_REALTIME, at, dummy, 0, priv);
		int wak = sem->__val[1];
		if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1))
			goto done;
	} while (!ee || ee == EINTR);

	// Upon returning from wait with an error, either discount ourselves as a waiter
	// by incrementing negative val0, and propagate error, or consume a racing post
	// if val0 is non-negative, and suppress error.
	for (;;) {  
		int val = sem->__val[0];
		if (val < 0 && val == a_cas(sem->__val, val, val+1))
			break;
		int wak = sem->__val[1];
		if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1))
			goto done;
		a_spin();
	}
	e = ee;
done:
	pthread_setcancelstate(cs, 0);

	if (!e) return 0;

	if (e == ECANCELED) {
		pthread_testcancel();
		pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0);
	}

	errno = e;
	return -1;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-02-27 23:21                   ` semaphore redesign Alexander Monakov
@ 2015-02-28 15:42                     ` Rich Felker
  2015-03-01 18:54                       ` Alexander Monakov
  2015-03-01 17:30                     ` Szabolcs Nagy
  1 sibling, 1 reply; 19+ messages in thread
From: Rich Felker @ 2015-02-28 15:42 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1337 bytes --]

On Sat, Feb 28, 2015 at 02:21:22AM +0300, Alexander Monakov wrote:
> Hello,
> 
> As new cancellation has landed (except that __timedwait fails to propagate
> ECANCELED in addition to ETIMEDOUT and EINTR), I'm happy to post semaphore
> redesign.  We discussed this implementation with Rich on IRC once, and
> presently I'm not aware of any issues (finally!), but still testing and
> performance comparison need to be done.

Thanks! I'm attaching a performance testing program I used in the
past. Its time measurement is not very precise, but it should show
large-scale differences. It measures the number of messages that can
be send back and forth between two threads in 2 seconds by cancelling
the two threads after sleep(2).

Some other things we should test for performance:

- Do something like the attached but with multiple threads contending
  on semaphores rather than uncontended message passing.

- Time/cycles per call for sem_post and sem_trywait and uncontended
  sem_wait -- this looks like it should be an obvious win though.

- Perhaps something to hammer posts and trywait -- it's not clear that
  this would model any real-world behavior, but it could show
  something interesting anyway.

I doubt we'll see any measurable differences in usage cases where the
futex syscall is expected; it should dominate there.

Rich

[-- Attachment #2: sem_bench.c --]
[-- Type: text/plain, Size: 647 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <pthread.h>
#include <semaphore.h>

static sem_t sem[2];
static long count[2];

static void *func(void *arg)
{
	long my_id = (long) arg;

	for (;;) {
		sem_wait(&sem[my_id]);
		count[my_id]++;
		sem_post(&sem[1-my_id]);
	}

	return NULL;
}

int main(void)
{
	void *ret;
	pthread_t t1, t2;

	sem_init(&sem[0], 0, 1);
	sem_init(&sem[1], 0, 0);

	pthread_create(&t1, NULL, func, (void*) 0);
	pthread_create(&t2, NULL, func, (void*) 1);

	sleep(2);

	pthread_cancel(t1);
	pthread_cancel(t2);

	pthread_join(t1, &ret);
	pthread_join(t2, &ret);

	printf("count=%ld\n", count[0] + count[1]);

	return 0;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-02-27 23:21                   ` semaphore redesign Alexander Monakov
  2015-02-28 15:42                     ` Rich Felker
@ 2015-03-01 17:30                     ` Szabolcs Nagy
  2015-03-01 17:50                       ` Szabolcs Nagy
  2015-03-01 18:24                       ` Alexander Monakov
  1 sibling, 2 replies; 19+ messages in thread
From: Szabolcs Nagy @ 2015-03-01 17:30 UTC (permalink / raw)
  To: musl

* Alexander Monakov <amonakov@ispras.ru> [2015-02-28 02:21:22 +0300]:
> int sem_post(sem_t *sem)
> {
> 	int val;
> 	do {
> 		val = sem->__val[0];
> 		if (val == SEM_VALUE_MAX) {
> 			errno = EOVERFLOW;
> 			return -1;

as discussed on irc early return here without a barrier is not ok
(it is a hard to observe corner case, i add the comment here so
it does not get forgotten)

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11

> 		}
> 	} while (val != a_cas(sem->__val, val, val+1));
> 	if (val < 0) {
> 		int priv = sem->__val[2];
> 		a_inc(sem->__val+1);
> 		__wake(sem->__val+1, 1, priv);
> 	}
> 	return 0;
> }
> 
> int sem_trywait(sem_t *sem)
> {
> 	int val;
> 	do {
> 		val = sem->__val[0];
> 		if (val <= 0) {
> 			errno = EAGAIN;
> 			return -1;

likewise

> 		}
> 	} while (val != a_cas(sem->__val, val, val-1));
> 	return 0;
> }
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-03-01 17:30                     ` Szabolcs Nagy
@ 2015-03-01 17:50                       ` Szabolcs Nagy
  2015-03-02 22:40                         ` Alexander Monakov
  2015-03-01 18:24                       ` Alexander Monakov
  1 sibling, 1 reply; 19+ messages in thread
From: Szabolcs Nagy @ 2015-03-01 17:50 UTC (permalink / raw)
  To: musl

* Szabolcs Nagy <nsz@port70.net> [2015-03-01 18:30:49 +0100]:
> * Alexander Monakov <amonakov@ispras.ru> [2015-02-28 02:21:22 +0300]:
> > int sem_post(sem_t *sem)
> > {
> > 	int val;
> > 	do {
> > 		val = sem->__val[0];
> > 		if (val == SEM_VALUE_MAX) {
> > 			errno = EOVERFLOW;
> > 			return -1;
> 
> as discussed on irc early return here without a barrier is not ok
> (it is a hard to observe corner case, i add the comment here so
> it does not get forgotten)
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11
> 

sorry

the code is ok (applications cannot rely on the barrier in case of
failure), it can lead to surprising results if the application
uses relaxed atomics, but it's not a conformance issue

> > 		}
> > 	} while (val != a_cas(sem->__val, val, val+1));
> > 	if (val < 0) {
> > 		int priv = sem->__val[2];
> > 		a_inc(sem->__val+1);
> > 		__wake(sem->__val+1, 1, priv);
> > 	}
> > 	return 0;
> > }
> > 
> > int sem_trywait(sem_t *sem)
> > {
> > 	int val;
> > 	do {
> > 		val = sem->__val[0];
> > 		if (val <= 0) {
> > 			errno = EAGAIN;
> > 			return -1;
> 
> likewise
> 
> > 		}
> > 	} while (val != a_cas(sem->__val, val, val-1));
> > 	return 0;
> > }
> > 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-03-01 17:30                     ` Szabolcs Nagy
  2015-03-01 17:50                       ` Szabolcs Nagy
@ 2015-03-01 18:24                       ` Alexander Monakov
  1 sibling, 0 replies; 19+ messages in thread
From: Alexander Monakov @ 2015-03-01 18:24 UTC (permalink / raw)
  To: musl

On Sun, 1 Mar 2015, Szabolcs Nagy wrote:

> * Alexander Monakov <amonakov@ispras.ru> [2015-02-28 02:21:22 +0300]:
> > int sem_post(sem_t *sem)
> > {
> > 	int val;
> > 	do {
> > 		val = sem->__val[0];
> > 		if (val == SEM_VALUE_MAX) {
> > 			errno = EOVERFLOW;
> > 			return -1;
> 
> as discussed on irc early return here without a barrier is not ok
> (it is a hard to observe corner case, i add the comment here so
> it does not get forgotten)

We further discussed that to fix it, one can recheck the value after a barrier
in the error path, and restart from the beginning if it changed, or always
proceed to CAS (with value changed only if not leading to error), and handle
errors after the cas-retry loop.  The following code implements the latter
approach.  I strongly prefer it for sem_trywait, where I think EAGAIN is
relatively common.  For sem_post it's not so clear cut for me, as EOVERFLOW
should be extremely rare, but still it helps to get good code layout from the
compiler (otherwise GCC lays out error return path inside of the cas loop).

int sem_trywait(sem_t *sem)
{
	int val;
	do val = sem->__val[0];
	while (val != a_cas(sem->__val, val, val-!!(val>0)));
	if (val > 0) return 0;
	errno = EAGAIN;
	return -1;
}

int sem_post(sem_t *sem)
{
	int val;
	do val = sem->__val[0];
	while (val != a_cas(sem->__val, val, val+!!(val<SEM_VALUE_MAX)));
	if (val < 0) {
		int priv = sem->__val[2];
		a_inc(sem->__val+1);
		__wake(sem->__val+1, 1, priv);
	}
	if (val < SEM_VALUE_MAX) return 0;
	errno = EOVERFLOW;
	return -1;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-02-28 15:42                     ` Rich Felker
@ 2015-03-01 18:54                       ` Alexander Monakov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexander Monakov @ 2015-03-01 18:54 UTC (permalink / raw)
  To: musl

A few more things from IRC discussion.

Fields in the semaphore struct should be made volatile.

The new implementation eliminates the last call site with non-dummy cleanup
argument to __timedwait.

New sem_timedwait shows worse performance in Rich's sem-bench on my desktop, I
think primarily because it does not have sem_trywait preceding the spin loop
anymore, plus the loop itself is also a bit leaner.  My proposal is to use a 
different spin count somewhere in 150-200 range instead of 100.

Alexander

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-03-01 17:50                       ` Szabolcs Nagy
@ 2015-03-02 22:40                         ` Alexander Monakov
  2015-03-02 22:45                           ` Rich Felker
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Monakov @ 2015-03-02 22:40 UTC (permalink / raw)
  To: musl

On Sun, 1 Mar 2015, Szabolcs Nagy wrote:
> sorry
> 
> the code is ok (applications cannot rely on the barrier in case of
> failure), it can lead to surprising results if the application
> uses relaxed atomics, but it's not a conformance issue

There was some follow up on IRC with the conclusion, as I understood, that
even though the rest of memory may be unsynchronized after returning an error,
the memory holding the semaphore itself needs to be synchronized (otherwise
the decision to return an error might have been based on stale memory).

Does sem_getvalue need to synchronize memory as well?  I think it should, but
current implementation does not.

Alexander

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: semaphore redesign
  2015-03-02 22:40                         ` Alexander Monakov
@ 2015-03-02 22:45                           ` Rich Felker
  0 siblings, 0 replies; 19+ messages in thread
From: Rich Felker @ 2015-03-02 22:45 UTC (permalink / raw)
  To: musl

On Tue, Mar 03, 2015 at 01:40:37AM +0300, Alexander Monakov wrote:
> On Sun, 1 Mar 2015, Szabolcs Nagy wrote:
> > sorry
> > 
> > the code is ok (applications cannot rely on the barrier in case of
> > failure), it can lead to surprising results if the application
> > uses relaxed atomics, but it's not a conformance issue
> 
> There was some follow up on IRC with the conclusion, as I understood, that
> even though the rest of memory may be unsynchronized after returning an error,
> the memory holding the semaphore itself needs to be synchronized (otherwise
> the decision to return an error might have been based on stale memory).

Right. The exemption from formally synchronizing memory in the case of
errors does not give license to falsely report errors due to failure
of the implementation to detect that it's in a non-error state. That
would just be an implementation bug.

> Does sem_getvalue need to synchronize memory as well?  I think it should, but
> current implementation does not.

sem_getvalue is required return a value that was valid at some time
during the call. There's no formal requirement to "synchronize memory"
in the POSIX sense AFAIK, but if a barrier is required to meet the
requirement for the value, it should use one.

Rich

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-03-02 22:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-27  2:33 sem_getvalue conformance considerations Rich Felker
2014-08-27  7:05 ` Jens Gustedt
2014-08-27  7:43   ` Rich Felker
2014-08-27 10:43     ` Alexander Monakov
2014-08-27 13:32       ` Alexander Monakov
2014-08-27 19:06         ` Alexander Monakov
2014-08-27 21:06           ` Alexander Monakov
2014-08-28 20:47             ` Alexander Monakov
2014-08-29 22:51               ` Alexander Monakov
2014-08-30  5:12                 ` Rich Felker
2014-09-01 17:50                 ` Alexander Monakov
2015-02-27 23:21                   ` semaphore redesign Alexander Monakov
2015-02-28 15:42                     ` Rich Felker
2015-03-01 18:54                       ` Alexander Monakov
2015-03-01 17:30                     ` Szabolcs Nagy
2015-03-01 17:50                       ` Szabolcs Nagy
2015-03-02 22:40                         ` Alexander Monakov
2015-03-02 22:45                           ` Rich Felker
2015-03-01 18:24                       ` Alexander Monakov

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).