* Resuming work on new semaphore @ 2015-04-02 1:30 Rich Felker 2015-04-02 7:42 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-02 1:30 UTC (permalink / raw) To: musl I have Alexander Monakov's new semaphore design on the agenda for this release cycle (1.1.9). If I remember correctly, we were looking at some issues in the last posted version with barriers and how to write the code in a way that's both clean and avoids compiler stupidity. One other thing I've been thinking about is how the current and new semaphore designs differ when a waiter on a process-shared semaphore is asynchronously killed (while waiting or during acceptance of the post). With the current design: If just waiting, the waiters count remains permanently elevated (and will eventually overflow if this happens billions of times), and any posts will make the value positive so that another waiter can take the post. If accepting (at least one atomic operation on the sem already performed) then the waiters count is left valid and either a post is consumed or left for another waiter to take. With the new design: If just waiting, the negative semaphore value persists after the waiter is killed. Subsequent posts will produce a wake for a waiter that doesn't exist, and will thereby allow future waiters that arrive when the semaphore value is zero to proceed immediately (leaving the value negative) by consuming this wake. There are usage patterns where trywait would never succeed again, but wait would succeed trivially. If accepting, the post is consumed. Accepting only involves one atomic operation so there is no other possibility. It seems to me the behavior for async-kill-while-waiting is somewhat worse with the new design, but this isn't really a usage pattern that should have been considered "supported" before, anyway. If there are ideas to make it better, we could explore them, but I don't think it's a show-stopper in any case. Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-02 1:30 Resuming work on new semaphore Rich Felker @ 2015-04-02 7:42 ` Alexander Monakov 2015-04-02 15:26 ` Rich Felker 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-02 7:42 UTC (permalink / raw) To: musl On Wed, 1 Apr 2015, Rich Felker wrote: > If just waiting, the negative semaphore value persists after the > waiter is killed. Subsequent posts will produce a wake for a waiter > that doesn't exist, and will thereby allow future waiters that arrive > when the semaphore value is zero to proceed immediately (leaving the > value negative) by consuming this wake. There are usage patterns where > trywait would never succeed again, but wait would succeed trivially. Interesting. To examine the issue under a different light, consider that from the perspective of semaphore implementation, waiters that were killed, stopped, or pre-empted forever in the middle of sem_wait are indistinguishable. Thus, subsequent sem_wait succeeds by effectively stealing a post, and to make things consistent you can teach sem_trywait to steal posts too (i.e. try atomic-decrement-if-positive val[1] just before returning EAGAIN, return 0 if that succeeds). Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-02 7:42 ` Alexander Monakov @ 2015-04-02 15:26 ` Rich Felker 2015-04-02 21:39 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-02 15:26 UTC (permalink / raw) To: musl On Thu, Apr 02, 2015 at 10:42:16AM +0300, Alexander Monakov wrote: > On Wed, 1 Apr 2015, Rich Felker wrote: > > If just waiting, the negative semaphore value persists after the > > waiter is killed. Subsequent posts will produce a wake for a waiter > > that doesn't exist, and will thereby allow future waiters that arrive > > when the semaphore value is zero to proceed immediately (leaving the > > value negative) by consuming this wake. There are usage patterns where > > trywait would never succeed again, but wait would succeed trivially. > > Interesting. To examine the issue under a different light, consider that from > the perspective of semaphore implementation, waiters that were killed, > stopped, or pre-empted forever in the middle of sem_wait are > indistinguishable. Yes, I noticed this too. In that sense, theoretically there should be no harm (aside from eventual overflow of pending wake counter) from having asynchronously-killed waiters, assuming the implementation is bug-free in the absence of async killing of waiters. > Thus, subsequent sem_wait succeeds by effectively stealing > a post, and to make things consistent you can teach sem_trywait to steal posts > too (i.e. try atomic-decrement-if-positive val[1] just before returning > EAGAIN, return 0 if that succeeds). Hmm, perhaps that is valid. I'll have to think about it again. I was thinking of having sem_trywait unconditionally down the value (val[0]) then immitate the exit path of sem_timedwait, but that's not valid because another waiter could race and prevent sem_trywait from ever being able to exit. But if it only does the down as a dec-if-positive then it seems like it can safely dec-if-positive the wake count before reporting failure. Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-02 15:26 ` Rich Felker @ 2015-04-02 21:39 ` Alexander Monakov 2015-04-02 23:14 ` Rich Felker 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-02 21:39 UTC (permalink / raw) To: musl On Thu, 2 Apr 2015, Rich Felker wrote: > > Interesting. To examine the issue under a different light, consider that from > > the perspective of semaphore implementation, waiters that were killed, > > stopped, or pre-empted forever in the middle of sem_wait are > > indistinguishable. > > Yes, I noticed this too. In that sense, theoretically there should be > no harm (aside from eventual overflow of pending wake counter) from > having asynchronously-killed waiters, assuming the implementation is > bug-free in the absence of async killing of waiters. Did you mean "presence"? I'm having trouble understanding your phrase, especially after "assuming ..."; can you elaborate or rephrase? That waiters can die breaks an assumption that operations on val[0] and val[1] do not under/overflow due to their range exceeding the number of simultaneously live tasks. > > Thus, subsequent sem_wait succeeds by effectively stealing > > a post, and to make things consistent you can teach sem_trywait to steal posts > > too (i.e. try atomic-decrement-if-positive val[1] just before returning > > EAGAIN, return 0 if that succeeds). > > Hmm, perhaps that is valid. I'll have to think about it again. I was > thinking of having sem_trywait unconditionally down the value (val[0]) > then immitate the exit path of sem_timedwait, but that's not valid > because another waiter could race and prevent sem_trywait from ever > being able to exit. But if it only does the down as a dec-if-positive > then it seems like it can safely dec-if-positive the wake count before > reporting failure. I think my proposition above needs at least the following correction: when trywait succeeds in stealing a post by dec-if-positive(val[1]), it should also decrement val[0] before returning. Are you sure your proposition is invalid? I don't think so. How is trywait different from a timedwait with a timeout that immediately expires? That is basically what your scheme should do. Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-02 21:39 ` Alexander Monakov @ 2015-04-02 23:14 ` Rich Felker 2015-04-05 14:07 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-02 23:14 UTC (permalink / raw) To: musl On Fri, Apr 03, 2015 at 12:39:10AM +0300, Alexander Monakov wrote: > On Thu, 2 Apr 2015, Rich Felker wrote: > > > Interesting. To examine the issue under a different light, consider that from > > > the perspective of semaphore implementation, waiters that were killed, > > > stopped, or pre-empted forever in the middle of sem_wait are > > > indistinguishable. > > > > Yes, I noticed this too. In that sense, theoretically there should be > > no harm (aside from eventual overflow of pending wake counter) from > > having asynchronously-killed waiters, assuming the implementation is > > bug-free in the absence of async killing of waiters. > > Did you mean "presence"? I'm having trouble understanding your phrase, > especially after "assuming ..."; can you elaborate or rephrase? I meant to say assuming that there aren't already any bugs, by your reasoning adding async killing of waiters cannot add bugs (except the overflow) since they're equivalent to a situation that arises without async killing. > That waiters can die breaks an assumption that operations on val[0] and val[1] > do not under/overflow due to their range exceeding the number of > simultaneously live tasks. Right. I'm ignoring that one. The current implementation likewise has that issue for the waiter count (but it could avoid it by saturating the waiter count at INT_MAX I suppose, or by throwing away the waiter count and just using a potential-waiters flag). > > > Thus, subsequent sem_wait succeeds by effectively stealing > > > a post, and to make things consistent you can teach sem_trywait to steal posts > > > too (i.e. try atomic-decrement-if-positive val[1] just before returning > > > EAGAIN, return 0 if that succeeds). > > > > Hmm, perhaps that is valid. I'll have to think about it again. I was > > thinking of having sem_trywait unconditionally down the value (val[0]) > > then immitate the exit path of sem_timedwait, but that's not valid > > because another waiter could race and prevent sem_trywait from ever > > being able to exit. But if it only does the down as a dec-if-positive > > then it seems like it can safely dec-if-positive the wake count before > > reporting failure. > > I think my proposition above needs at least the following correction: when > trywait succeeds in stealing a post by dec-if-positive(val[1]), it should also > decrement val[0] before returning. Yes, that seems right. > Are you sure your proposition is invalid? I don't think so. How is trywait > different from a timedwait with a timeout that immediately expires? That is > basically what your scheme should do. Indeed, I think you're right. Conceptually trywait and timedwait with zero timeout should be identical modulo error value and cancellation. Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-02 23:14 ` Rich Felker @ 2015-04-05 14:07 ` Alexander Monakov 2015-04-05 14:17 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-05 14:07 UTC (permalink / raw) To: musl So, your solution results in a simpler execution path for successful trywait, but needs to undo semaphore adjustment to return EAGAIN (which I believe to be relatively common). Also, if a concurrent poster sees transient negative val[0], it will proceed to slow sem_post path (val[1] increment and futex wake), so if there's high activity on the semaphore, looks like this solution may slow down posters. My solution keeps trywait successful path as-is, and adds a test on val[1] in EAGAIN case with a conditional branch that is normally not executed. Semaphore's contents are not changed if returning EAGAIN. I was about to say that this leads to more preferable cache traffic on highly contended semaphore (all calls to trywait returning EAGAIN simultaneously can work on shared rather than exclusive cache lines), but unfortunately we need a read memory barrier on the semaphore and the solution to that was to perform CAS on val[0] unconditionally, which would cause each caller to make the cache line with the semaphore exclusive anyway, IIUC. Regarding the problem raised on IRC (that with 1 waiter, post-getvalue-trywait can result in 0-0 return values rather than 0-EAGAIN, while previously it could result in 1-EAGAIN rather that 1-0, with the explanation that the waiter did not "become a waiter" so that post had to wake it, but suspended execution and resumed only after semaphore value became positive, and now the surprising behavior needs a different explanation): I think in the absence of a mechanism to detect whether current waiters are still alive, that's the way it has to be. Either you pretend there are no waiters at any time, like today, or you count dead waiters same as live waiters and report 0 semaphore value to the caller, but then what do you do when caller invokes sem_wait and all other waiters are dead? Suspend the caller indefinitely, or let it proceed by consuming a pending post and thus revealing an inconsistency? Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-05 14:07 ` Alexander Monakov @ 2015-04-05 14:17 ` Alexander Monakov 2015-04-05 19:02 ` Rich Felker 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-05 14:17 UTC (permalink / raw) To: musl In a similar vein, if the caller explicitely kills other waiters before invoking sem_post and sem_getvalue, it will also reveal an inconsistency with the new implementation: returned value of 0 will look as if sem_post has woken some waiters, even though the caller could know for certain that no tasks were waiting on the semaphore. Well we can make sem_getvalue return val[0]+val[1] instead... ;) Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-05 14:17 ` Alexander Monakov @ 2015-04-05 19:02 ` Rich Felker 2015-04-05 20:03 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-05 19:02 UTC (permalink / raw) To: musl On Sun, Apr 05, 2015 at 05:17:35PM +0300, Alexander Monakov wrote: > In a similar vein, if the caller explicitely kills other waiters before > invoking sem_post and sem_getvalue, it will also reveal an inconsistency with > the new implementation: returned value of 0 will look as if sem_post has woken > some waiters, even though the caller could know for certain that no tasks were > waiting on the semaphore. I'm not concerned about async killing (which is something of a broken usage case anyway; it certainly doesn't work for other sync objects except robust mutexes) but I am somewhat concerned about other equivalent cases. For example: 1. Thread A enters sem_wait. 2. Thread B observes thread A in sem_wait via failed sem_trywait. 3. Thread B sends a signal to thread A. 4. Thread A blocks in signal handler waiting for some other event under the control of thread B. 5. Thread B calls sem_post. 6. Thread B calls sem_wait and succeeds. 7. Thread B unblocks thread A. 8. Thread A remains blocked on the sem since B consumed the wake. Note that the same thing happens with the old or new implementation; the only difference is the _values_ observed. I believe this behavior is forbidden by the following: If the semaphore value resulting from this operation is positive, then no threads were blocked waiting for the semaphore to become unlocked; the semaphore value is simply incremented. If the value of the semaphore resulting from this operation is zero, then one of the threads blocked waiting for the semaphore shall be allowed to return successfully from its call to sem_wait()." The old (current) implementation wrongly yields a positive value despite there being waiters. The new implementation wrongly allows sem_wait to succeed (and steal a wake) when the value is zero. > Well we can make sem_getvalue return val[0]+val[1] instead... ;) That just makes the new implementation look like the old one, no? :-) Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-05 19:02 ` Rich Felker @ 2015-04-05 20:03 ` Alexander Monakov 2015-04-05 20:23 ` Rich Felker 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-05 20:03 UTC (permalink / raw) To: musl On Sun, 5 Apr 2015, Rich Felker wrote: > 1. Thread A enters sem_wait. > 2. Thread B observes thread A in sem_wait via failed sem_trywait. Hm, I don't see how that can be achieved. As a result I'm afraid I didn't fully understand your example. > > Well we can make sem_getvalue return val[0]+val[1] instead... ;) > > That just makes the new implementation look like the old one, no? :-) Can't be bad if it behaves the same but works a bit faster. Apropos, like I've said on IRC, looks like there's "semaphore uncertainty principle": that formal semaphore value is between val[0] and (val[0] +/- val[1]) (clamped to 0 as needed). It seems you can either do your hack and pretend that there are never any waiters, or try to faithfully count waiters in sem_getvalue, but then also reveal that sometimes the implementation works by stealing a post. I believe you could argue that the latter is explicitely disallowed by the spec. By the way, I think there's an interesting interplay with cancellation. Consider the following. Thread B does "return sem_wait(sem);". Thread A does: pthread_cancel(thread_B); sem_post(sem); sem_getvalue(sem); If it observes semaphore value as 1 it follows that thread B has not become a waiter yet, and since it must have cancellation already pending, it may not consume the post. And yet if thread B is already futex-waiting in sem_wait, consuming the post takes priority over acting on cancellation. So if then thread A does pthread_join(thread_B); sem_getvalue(sem); and gets value of 0, it sees a contradiction. And return value from pthread_join will indicate that thread_B exited normally rather than was cancelled. And on the contrary, if you make acting on cancellation/timeout take priority, you can observe semaphore value increasing when waiters leave the wait on error path without consuming the post. Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-05 20:03 ` Alexander Monakov @ 2015-04-05 20:23 ` Rich Felker 2015-04-05 21:07 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-05 20:23 UTC (permalink / raw) To: musl On Sun, Apr 05, 2015 at 11:03:34PM +0300, Alexander Monakov wrote: > On Sun, 5 Apr 2015, Rich Felker wrote: > > 1. Thread A enters sem_wait. > > 2. Thread B observes thread A in sem_wait via failed sem_trywait. > > Hm, I don't see how that can be achieved. As a result I'm afraid I didn't > fully understand your example. Indeed I was wrong about that, so I agree the whole scenario may fall apart. Only sem_getvalue could show this, and only if it returns -1 rather than 0. So returning negative values from sem_getvalue seems like a very bad idea -- it puts difficult- or impossible-to-satisfy additional constraints on the implementation. > > > Well we can make sem_getvalue return val[0]+val[1] instead... ;) > > > > That just makes the new implementation look like the old one, no? :-) > > Can't be bad if it behaves the same but works a bit faster. > Apropos, like I've said on IRC, looks like there's "semaphore uncertainty > principle": that formal semaphore value is between val[0] and (val[0] +/- > val[1]) (clamped to 0 as needed). It seems you can either do your hack and > pretend that there are never any waiters, or try to faithfully count waiters > in sem_getvalue, but then also reveal that sometimes the implementation works > by stealing a post. I believe you could argue that the latter is explicitely > disallowed by the spec. Yes, I think I agree. > By the way, I think there's an interesting interplay with cancellation. > Consider the following. Thread B does "return sem_wait(sem);". Thread A does: > > pthread_cancel(thread_B); > sem_post(sem); > sem_getvalue(sem); > > If it observes semaphore value as 1 it follows that thread B has not become a > waiter yet, and since it must have cancellation already pending, it may not > consume the post. And yet if thread B is already futex-waiting in sem_wait, > consuming the post takes priority over acting on cancellation. So if then > thread A does > > pthread_join(thread_B); > sem_getvalue(sem); > > and gets value of 0, it sees a contradiction. And return value from > pthread_join will indicate that thread_B exited normally rather than was > cancelled. So the contradiction you claim exists is that cancellation happened before the post, and thus thread B can't act on the post when it didn't act on cancellation? I don't think that follows from the rules of cancellation. The relevant text is: "Whenever a thread has cancelability enabled and a cancellation request has been made with that thread as the target, and the thread then calls any function that is a cancellation point (such as pthread_testcancel() or read()), the cancellation request shall be acted upon before the function." So if cancellation was pending _before_ the call to sem_wait, then sem_wait has to honor it. But there is no requirement that entry to the sem_wait function be "atomic" with becoming a waiter on the semaphore, and of course this is impossible to satisfy or even specify. So it's totally legal to have the sequence: 1. Thread B enters sem_wait. 2. Thread B observes that cancellation was not already pending. 3. Thread A sends cancellation request. 4. Thread A sends post. 5. Thread B receives both, and chooses to act on the post per this text: "It is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution if: - The thread is suspended at a cancellation point and the event for which it is waiting occurs - A specified timeout expired before the cancellation request is acted upon." Here, the event for which it was waiting (the post) clearly occurs. > And on the contrary, if you make acting on cancellation/timeout take priority, > you can observe semaphore value increasing when waiters leave the wait on > error path without consuming the post. Yes obviously that is not possible. Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-05 20:23 ` Rich Felker @ 2015-04-05 21:07 ` Alexander Monakov 2015-04-11 22:22 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-05 21:07 UTC (permalink / raw) To: musl On Sun, 5 Apr 2015, Rich Felker wrote: > So if cancellation was pending _before_ the call to sem_wait, then > sem_wait has to honor it. But there is no requirement that entry to > the sem_wait function be "atomic" with becoming a waiter on the > semaphore, and of course this is impossible to satisfy or even > specify. Thanks! One other thing to consider. In the absence of concurrent operations on the semaphore, return value of sem_getvalue should be equal to the number of times sem_trywait will indicate success when called repeatedly. So if the implementation performs post-stealing in trywait, it should return the higher bound as semaphore value. Likewise for timedwait. Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-05 21:07 ` Alexander Monakov @ 2015-04-11 22:22 ` Alexander Monakov 2015-04-23 16:06 ` Rich Felker 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-11 22:22 UTC (permalink / raw) To: musl On Mon, 6 Apr 2015, Alexander Monakov wrote: > One other thing to consider. In the absence of concurrent operations on the > semaphore, return value of sem_getvalue should be equal to the number of times > sem_trywait will indicate success when called repeatedly. So if the > implementation performs post-stealing in trywait, it should return the higher > bound as semaphore value. Likewise for timedwait. If we accept the above, it follows that in the new implementation getvalue should return not max(0, val[0] + val[1]), but rather max(0, val[0]) + val[1]. So incorporating the above into getvalue, and Rich's scheme of mirroring timedwait into trywait (as he pointed out on IRC, my approach does not properly work with two concurrent posters), we get: int sem_getvalue(sem_t *restrict sem, int *restrict valp) { a_barrier(); // Memory holding the semaphore should not be stale int val = sem->__val[0]; val = val < 0 ? 0 : val; *valp = val + sem->__val[1]; return 0; } int sem_trywait(sem_t *sem) { if (a_fetch_add(sem->__val, -1) > 0) return 0; for (;;) { int wak = sem->__val[1]; if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1)) return 0; int val = sem->__val[0]; if (val < 0 && val == a_cas(sem->__val, val, val+1)) break; a_spin(); } errno = EAGAIN; return -1; } int sem_post(sem_t *sem) { int val; do val = sem->__val[0]; while (val != a_cas(sem->__val, val, val+!!(val<SEM_VALUE_MAX))); if (val < 0) { int priv = sem->__val[2]; a_inc(sem->__val+1); __wake(sem->__val+1, 1, priv); } if (val < SEM_VALUE_MAX) return 0; errno = EOVERFLOW; return -1; } int sem_timedwait(sem_t *restrict sem, const struct timespec *restrict at) { pthread_testcancel(); // Do not spin if already contended (val0 is negative) for (int spins = 1000; spins && sem->__val[0] == 0; spins--) a_spin(); if (a_fetch_add(sem->__val, -1) > 0) return 0; int priv = sem->__val[2], e = 0, ee, cs; pthread_setcancelstate(PTHREAD_CANCEL_MASKED, &cs); do { ee = __timedwait_cp(sem->__val+1, 0, CLOCK_REALTIME, at, priv); int wak = sem->__val[1]; if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1)) goto done; } while (!ee || ee == EINTR); // Upon returning from wait with an error, either discount ourselves as a waiter // by incrementing negative val0, and propagate error, or consume a racing post // if val0 is non-negative, and suppress error. for (;;) { int val = sem->__val[0]; if (val < 0 && val == a_cas(sem->__val, val, val+1)) break; int wak = sem->__val[1]; if (wak > 0 && wak == a_cas(sem->__val+1, wak, wak-1)) goto done; a_spin(); } e = ee; done: pthread_setcancelstate(cs, 0); if (!e) return 0; if (e == ECANCELED) { pthread_testcancel(); pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0); } errno = e; return -1; } ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-11 22:22 ` Alexander Monakov @ 2015-04-23 16:06 ` Rich Felker 2015-04-23 18:24 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-23 16:06 UTC (permalink / raw) To: musl I'm going to try to summarize some of the issues that have been discussed on IRC since this. On Sun, Apr 12, 2015 at 01:22:34AM +0300, Alexander Monakov wrote: > On Mon, 6 Apr 2015, Alexander Monakov wrote: > > One other thing to consider. In the absence of concurrent operations on the > > semaphore, return value of sem_getvalue should be equal to the number of times > > sem_trywait will indicate success when called repeatedly. So if the > > implementation performs post-stealing in trywait, it should return the higher > > bound as semaphore value. Likewise for timedwait. > > If we accept the above, it follows that in the new implementation getvalue > should return not max(0, val[0] + val[1]), but rather max(0, val[0]) + val[1]. Indeed. But then max(0, val[0]) + val[1] can overflow SEM_VALUE_MAX unless we prevent it, which takes some work, but I think it's possible. > int sem_post(sem_t *sem) > { > int val; > do val = sem->__val[0]; > while (val != a_cas(sem->__val, val, val+!!(val<SEM_VALUE_MAX))); > if (val < 0) { > int priv = sem->__val[2]; > a_inc(sem->__val+1); > __wake(sem->__val+1, 1, priv); > } > if (val < SEM_VALUE_MAX) return 0; > errno = EOVERFLOW; > return -1; > } The first observation we made was that this checks val<SEM_VALUE_MAX twice in the success path for an extra useless branch. It can be fixed by something like this (my1): int sem_post(sem_t *sem) { int val = sem->__val[0]; val -= val==SEM_VALUE_MAX; while (a_cas(sem->__val, val, val+1) != val) { if ((val = sem->__val[0]) == SEM_VALUE_MAX) { errno = EOVERFLOW; return -1; } } if (val < 0) { int priv = sem->__val[2]; a_inc(sem->__val+1); __wake(sem->__val+1, 1, priv); } return 0; } or this (my1b): int sem_post(sem_t *sem) { int old, val = sem->__val[0]; val -= val==SEM_VALUE_MAX; while ((old = a_cas(sem->__val, val, val+1)) != val) { if ((val = old) == SEM_VALUE_MAX) { errno = EOVERFLOW; return -1; } } if (val < 0) { int priv = sem->__val[2]; a_inc(sem->__val+1); __wake(sem->__val+1, 1, priv); } return 0; } The latter saves the result of a_cas to prevent an extra load, but I don't think it makes any significant difference and it might be seen as uglier. However neither of those address the overflow issue, which I've tried to address here: #define VAL0_MAX ((SEM_VALUE_MAX+1)/2) #define VAL1_MAX (SEM_VALUE_MAX/2) int sem_post(sem_t *sem) { int val = sem->__val[0]; val -= val==VAL0_MAX; while (a_cas(sem->__val, val, val+1) != val) { if ((val = sem->__val[0]) == VAL0_MAX) { int tmp = sem->__val[1]; if (tmp >= VAL1_MAX) { errno = EOVERFLOW; return -1; } if (a_cas(sem->__val+1, tmp, tmp+1) == tmp) { return 0; } val--; } } if (val < 0) { int priv = sem->__val[2]; a_inc(sem->__val+1); __wake(sem->__val+1, 1, priv); } return 0; } This is code whose idea was discussed on IRC but not yet presented, so it may have significant bugs. The idea is to limit the main sem value component and the wake count separately to half the max. Once val[0] hits VAL0_MAX, further posts will be in the form of wakes for nonexistent waiters (which are ok but more costly). This allows the total observed value to reach all the way up to SEM_VALUE_MAX. If this happens, waiters will consume all of val[0] first, and the wakes will all remain pending until val[0] reaches 0. At that point, new waiters will decrement val[0] to a negative value (indicating a waiter), attempt a futex wait, fail because there are wakes pending, consume one of the wakes, and exit. (Note: this useless futex wait can be optimized out by reordering the do-while loop body in sem_timedwait.) During this state, there is a race window where val[1] can exceed VAL1_MAX -- if a post happens after a new waiter decrements val[0] but before it consumes a wake from val[1], a concurrent post will increment val[0] back to 0 and increment val[1] unconditionally. However, the magnitude of such overshoot is bounded by the number of tasks which is necessarily bounded by INT_MAX/4 which is less than VAL1_MAX, so no integer overflow can happen here (except in the case of async-killed waiters). Does this all sound correct? Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-23 16:06 ` Rich Felker @ 2015-04-23 18:24 ` Alexander Monakov 2015-04-23 20:01 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-23 18:24 UTC (permalink / raw) To: musl > The latter saves the result of a_cas to prevent an extra load, but I > don't think it makes any significant difference and it might be seen > as uglier. I think we should use the result of a_cas here: it's part of sem_post "fast path", and doing it is not too difficult. I'm using a slightly different version below. > However neither of those address the overflow issue, which I've tried > to address here: > > #define VAL0_MAX ((SEM_VALUE_MAX+1)/2) Signed integer overflow here -- using corrected version below. > Does this all sound correct? I'm afraid not. We must always do futex-wake when incrementing val[1]. Otherwise wake loss is possible: 1. Semaphore initialized to VAL0_MAX 2. Thread A enters sem_post, observes saturated val[0] 3. Thread B downs val[0] to 0 by calling sem_wait VAL0_MAX times 4. Thread B calls sem_wait again and enters futex_wait 5. Thread A ups val[1]. .. At this point thread A must futex-wake val[1]. My version: #define VAL0_MAX (SEM_VALUE_MAX/2+1) #define VAL1_MAX (SEM_VALUE_MAX/2) int sem_post(sem_t *sem) { int old, val = sem->__val[0]; val -= val == VAL0_MAX; while (old = val, (val = a_cas(sem->__val, val, val+1)) != old) if (val == VAL0_MAX) goto wake; if (val < 0) { wake:; int priv = sem->__val[2]; do if ((val = sem->__val[1]) == VAL1_MAX) { errno = EOVERFLOW; return -1; } while (val != a_cas(sem->__val+1, val, val+1)); __wake(sem->__val+1, 1, priv); } return 0; } After sufficiently many waiters have been killed, val[1] can reach VAL1_MAX without val[0] also reaching VAL0_MAX, in which case sem_post can report EOVERFLOW prematurely. From previous emails it seems it's not a big concern. It is also possible that EOVERFLOW will be reported prematurely in race windows when a waiter returning from futex-wait with EWOULDBLOCK has not decremented val[1] of a recently saturated semaphore yet. Example: 1. Semaphore initialized to SEM_VALUE_MAX 2. Thread A downs val[0] to 0 by calling sem_wait VAL0_MAX times. val[1] remains at VAL1_MAX. 3. Thread B calls sem_wait and enters futex wait 4. Thread A calls sem_post, observes val[0]<0 && val[1] == VAL1_MAX It's possible to make the window smaller by reordering futex-wait loop, but it will remain. At the moment I don't have a good way out. Thanks. Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-23 18:24 ` Alexander Monakov @ 2015-04-23 20:01 ` Alexander Monakov 2015-04-24 2:46 ` Rich Felker 0 siblings, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-23 20:01 UTC (permalink / raw) To: musl I was over-eager in size-optimizing and at first didn't notice that we may not report EOVERFLOW after successfully incrementing val[0]; therefore we can reuse only the very end of the futex-wake path: #define VAL0_MAX (SEM_VALUE_MAX/2+1) #define VAL1_MAX (SEM_VALUE_MAX/2) int sem_post(sem_t *sem) { int priv, old, val = sem->__val[0]; val -= val == VAL0_MAX; while (old = val, (val = a_cas(sem->__val, val, val+1)) != old) if (val == VAL0_MAX) { priv = sem->__val[2]; do { if ((val = sem->__val[1]) >= VAL1_MAX) { errno = EOVERFLOW; return -1; } } while (val != a_cas(sem->__val+1, val, val+1)); goto wake; } if (val < 0) { priv = sem->__val[2]; a_inc(sem->__val+1); wake: __wake(sem->__val+1, 1, priv); } return 0; } Now instead of 'premature EOVERFLOW' problem we have the 'val[1] overshoot' problem. It can lead to getvalue overflow: 1. Semaphore initialized to SEM_VALUE_MAX 2. Thread A downs val[0] to 0 3. Thread B downs val[0] to -1 4. Thread A calls sem_post: val[0] == 0, val[1] == VAL1_MAX+1 .. (thread B does not consume the post yet) 5. Thread A ups val[0] to VAL0_MAX .. now getvalue returns INT_MIN Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-23 20:01 ` Alexander Monakov @ 2015-04-24 2:46 ` Rich Felker 2015-04-24 10:23 ` Alexander Monakov 0 siblings, 1 reply; 20+ messages in thread From: Rich Felker @ 2015-04-24 2:46 UTC (permalink / raw) To: musl On Thu, Apr 23, 2015 at 11:01:19PM +0300, Alexander Monakov wrote: > I was over-eager in size-optimizing and at first didn't notice that we may not > report EOVERFLOW after successfully incrementing val[0]; therefore we can > reuse only the very end of the futex-wake path: > > #define VAL0_MAX (SEM_VALUE_MAX/2+1) > #define VAL1_MAX (SEM_VALUE_MAX/2) > > int sem_post(sem_t *sem) > { > int priv, old, val = sem->__val[0]; > val -= val == VAL0_MAX; > while (old = val, (val = a_cas(sem->__val, val, val+1)) != old) > if (val == VAL0_MAX) { > priv = sem->__val[2]; > do { > if ((val = sem->__val[1]) >= VAL1_MAX) { > errno = EOVERFLOW; > return -1; > } > } while (val != a_cas(sem->__val+1, val, val+1)); > goto wake; > } > if (val < 0) { > priv = sem->__val[2]; > a_inc(sem->__val+1); > wake: > __wake(sem->__val+1, 1, priv); > } > return 0; > } > > Now instead of 'premature EOVERFLOW' problem we have the 'val[1] overshoot' > problem. It can lead to getvalue overflow: > > 1. Semaphore initialized to SEM_VALUE_MAX > 2. Thread A downs val[0] to 0 > 3. Thread B downs val[0] to -1 > 4. Thread A calls sem_post: val[0] == 0, val[1] == VAL1_MAX+1 > ... (thread B does not consume the post yet) > 5. Thread A ups val[0] to VAL0_MAX > ... now getvalue returns INT_MIN Perhaps this can be patched up by saturating sem_getvalue's result? In the case where the overflow happens it's transient, right? I think that means discounting the overflow would be valid. But I'll need to think about it more... With that said, my inclination right now is that we should hold off on trying to commit the new semaphore for this release cycle. I've been aiming for this month and just about everything else is in order for release, but the semaphore rabbit-hole keeps going deeper and I think we need to work through this properly. I hope that's not too much of a disappointment. Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-24 2:46 ` Rich Felker @ 2015-04-24 10:23 ` Alexander Monakov 2015-04-24 15:03 ` Rich Felker 2015-04-24 15:47 ` Alexander Monakov 0 siblings, 2 replies; 20+ messages in thread From: Alexander Monakov @ 2015-04-24 10:23 UTC (permalink / raw) To: musl On Thu, 23 Apr 2015, Rich Felker wrote: > Perhaps this can be patched up by saturating sem_getvalue's result? In > the case where the overflow happens it's transient, right? I think > that means discounting the overflow would be valid. But I'll need to > think about it more... Hm, can't agree here. This whole line of discussion stems from attempt to align timedwait/trywait/getvalue behavior in light of dead waiters, which are indistinguishable from preempted waiters. If "it's transient" claim can be made, it also can be used as a reason not to modify getvalue to look at val[1]. > With that said, my inclination right now is that we should hold off on > trying to commit the new semaphore for this release cycle. I've been > aiming for this month and just about everything else is in order for > release, but the semaphore rabbit-hole keeps going deeper and I think > we need to work through this properly. I hope that's not too much of a > disappointment. Ack; thankfully I don't feel disappointment in this case, this discussion has been quite entertaining. When I proposed my modification I felt it was very intuitive. What I did not grasp back then is that the definition of a waiter is not solid. How do you interpret the following? 1. Semaphore initialized to 0. There's only one thread. 2. alarm(1) 3. sem_wait() ... (in SIGALRM handler) 4. sem_post() 5. sem_getvalue() May getvalue be 0 here? At step 4, can the thread possibly "be a waiter" on the semaphore? Thanks. Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-24 10:23 ` Alexander Monakov @ 2015-04-24 15:03 ` Rich Felker 2015-04-24 15:47 ` Alexander Monakov 1 sibling, 0 replies; 20+ messages in thread From: Rich Felker @ 2015-04-24 15:03 UTC (permalink / raw) To: musl On Fri, Apr 24, 2015 at 01:23:27PM +0300, Alexander Monakov wrote: > On Thu, 23 Apr 2015, Rich Felker wrote: > > Perhaps this can be patched up by saturating sem_getvalue's result? In > > the case where the overflow happens it's transient, right? I think > > that means discounting the overflow would be valid. But I'll need to > > think about it more... > > Hm, can't agree here. This whole line of discussion stems from attempt to > align timedwait/trywait/getvalue behavior in light of dead waiters, which are > indistinguishable from preempted waiters. I don't think dead waiters are a solvable problem with this design, but they're a minor problem until you hit overflow. > If "it's transient" claim can be > made, it also can be used as a reason not to modify getvalue to look at val[1]. No, because you can interrupt a waiter with a signal handler and the "transient" state becomes something you can synchronize with and observe and thus no longer transient. That was the motivation for needing to count the pending wakes. > > With that said, my inclination right now is that we should hold off on > > trying to commit the new semaphore for this release cycle. I've been > > aiming for this month and just about everything else is in order for > > release, but the semaphore rabbit-hole keeps going deeper and I think > > we need to work through this properly. I hope that's not too much of a > > disappointment. > > Ack; thankfully I don't feel disappointment in this case, this discussion has > been quite entertaining. When I proposed my modification I felt it was very > intuitive. What I did not grasp back then is that the definition of a waiter > is not solid. > > How do you interpret the following? > > 1. Semaphore initialized to 0. There's only one thread. > 2. alarm(1) > 3. sem_wait() > .... (in SIGALRM handler) > 4. sem_post() > 5. sem_getvalue() > > May getvalue be 0 here? At step 4, can the thread possibly "be a waiter" > on the semaphore? Here steps 4 and 5 are UB (calling AS-unsafe functions from AS context). But you can achieve the same with another thread observing entry to the signal handler in a valid way (e.g. via posting of a second sem from the signal handler). With that problem solved, I think it's valid at this point to observe a value of 0 or 1. But if 0 is observed, sem_trywait would have to fail, and sem_wait or sem_timedwait could return only in the case of an error. This is why returning 0 does not seem to be practical -- I don't know a way to let the existing suspended waiter take the wake without allowing new waiters to steal it (and thus expose inconsistency). Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-24 10:23 ` Alexander Monakov 2015-04-24 15:03 ` Rich Felker @ 2015-04-24 15:47 ` Alexander Monakov 2015-04-24 15:59 ` Rich Felker 1 sibling, 1 reply; 20+ messages in thread From: Alexander Monakov @ 2015-04-24 15:47 UTC (permalink / raw) To: musl One more thought. In the original proposal, val[1] was a "wake count". By now we have drifted away a bit from that concept, with val[1] being used not only in post/timedwait, but also in trywait/getvalue, trywait explicitely performing post-stealing, and getvalue accounting for that. So how about we try to stick with the original vision, and say that val[1] is strictly for waiters leaving futex wait via non-error path? The kernel tells the poster via the return value of futex_wake whether the wake has caused any waiter to resume execution. The kernel already knows if any waiter was there to accept a wake, and lets us know. Can we somehow use that to implement non-stealing wakeups? Alexander ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Resuming work on new semaphore 2015-04-24 15:47 ` Alexander Monakov @ 2015-04-24 15:59 ` Rich Felker 0 siblings, 0 replies; 20+ messages in thread From: Rich Felker @ 2015-04-24 15:59 UTC (permalink / raw) To: musl On Fri, Apr 24, 2015 at 06:47:45PM +0300, Alexander Monakov wrote: > One more thought. In the original proposal, val[1] was a "wake count". By > now we have drifted away a bit from that concept, with val[1] being used not > only in post/timedwait, but also in trywait/getvalue, trywait explicitely > performing post-stealing, and getvalue accounting for that. > > So how about we try to stick with the original vision, and say that val[1] is > strictly for waiters leaving futex wait via non-error path? The kernel tells > the poster via the return value of futex_wake whether the wake has caused any > waiter to resume execution. The kernel already knows if any waiter was there > to accept a wake, and lets us know. Can we somehow use that to implement > non-stealing wakeups? This is the famous problem discussed on libc-alpha in regards to documenting the futex syscall. If you could reliably assume that any zero return from futex wait syscall resulted from an intentional futex wake, this kind of protocol would be possible. Unfortunately almost every existing use of futex wake sends the wake after the memory could already be reused for something new. Fixing this is possible with clever use of FUTEX_WAKE_OP, possibly with a performance hit (but conversely, improving some scheduling behavior), but you'd need a contract that everything agrees to use this protocol. Even one piece of code failing to do so would horribly break code depending on the assumption that "zero means intentional wake". So I don't think it's practical. Also there's the issue that the target may not actually be in the syscall when the wake arrives -- it may be suspended just before it, or in a signal handler. I'm not sure whether those issues are solvable or not. Rich ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-04-24 15:59 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-04-02 1:30 Resuming work on new semaphore Rich Felker 2015-04-02 7:42 ` Alexander Monakov 2015-04-02 15:26 ` Rich Felker 2015-04-02 21:39 ` Alexander Monakov 2015-04-02 23:14 ` Rich Felker 2015-04-05 14:07 ` Alexander Monakov 2015-04-05 14:17 ` Alexander Monakov 2015-04-05 19:02 ` Rich Felker 2015-04-05 20:03 ` Alexander Monakov 2015-04-05 20:23 ` Rich Felker 2015-04-05 21:07 ` Alexander Monakov 2015-04-11 22:22 ` Alexander Monakov 2015-04-23 16:06 ` Rich Felker 2015-04-23 18:24 ` Alexander Monakov 2015-04-23 20:01 ` Alexander Monakov 2015-04-24 2:46 ` Rich Felker 2015-04-24 10:23 ` Alexander Monakov 2015-04-24 15:03 ` Rich Felker 2015-04-24 15:47 ` Alexander Monakov 2015-04-24 15:59 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).