mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Bug with priority inheritance and condition variables
@ 2020-09-24 14:58 Edward Scott
  2020-09-24 16:14 ` Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Edward Scott @ 2020-09-24 14:58 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1419 bytes --]

Hello,

There appears to be a bug when using priority inheritance in combination
with condition variables. I have some code that reproduces the bug:

https://github.com/edward-scott/musl-prio-inherit-cv-bug

Using git bisect I traced the origin of the bug to this commit:

https://git.musl-libc.org/cgit/musl/commit/?id=54ca677983d47529bab8752315ac1a2b49888870

which is the commit that is described as "implement priority inheritance
mutexes".

From my analysis it appears that _m_waiters is used by the
priority inheritance logic to maintain some state (as described in the
commit message) but that conflicts with some use of _m_waiters in the
condition variable implementation.

The consequence is that pthread_mutex_lock erroneously returns EDEADLK.

I don't understand the code well enough to produce a fix.

The demo code (a cut version of some production code) will reproduce the
failure. Commenting out the pthread_mutexattr_setprotocol call in
the iot_mutex_init function at the end of the thread.c file will cause the
code to work as intended (without priority inheritance). The code works
fine either way with the GNU lib.

BTW can I recommend that the "magic numbers"  used to represent mutex modes
be replaced at some point with defined constants as it would make the code
much easier to follow.

This is my first post to this list so I hope this message is on the right
list and is helpful.

Edward Scott

[-- Attachment #2: Type: text/html, Size: 1959 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] Bug with priority inheritance and condition variables
  2020-09-24 14:58 [musl] Bug with priority inheritance and condition variables Edward Scott
@ 2020-09-24 16:14 ` Rich Felker
  2020-10-26 19:49   ` Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Rich Felker @ 2020-09-24 16:14 UTC (permalink / raw)
  To: musl

On Thu, Sep 24, 2020 at 03:58:17PM +0100, Edward Scott wrote:
> Hello,
> 
> There appears to be a bug when using priority inheritance in combination
> with condition variables. I have some code that reproduces the bug:
> 
> https://github.com/edward-scott/musl-prio-inherit-cv-bug
> 
> Using git bisect I traced the origin of the bug to this commit:
> 
> https://git.musl-libc.org/cgit/musl/commit/?id=54ca677983d47529bab8752315ac1a2b49888870
> 
> which is the commit that is described as "implement priority inheritance
> mutexes".
> 
> From my analysis it appears that _m_waiters is used by the
> priority inheritance logic to maintain some state (as described in the
> commit message) but that conflicts with some use of _m_waiters in the
> condition variable implementation.

I think this is entirely correct analysis. Thanks for catching this!

> The consequence is that pthread_mutex_lock erroneously returns EDEADLK.

OK, it took me a second to understand this part, because I thought it
would be ENOTRECOVERABLE, but that's only for robust+PI mutexes.
EDEADLK seems to be a consequence of succeeding but returning EBUSY,
which is "wrong" but should only be able to happen with inconsistent
state, as produced by pthread_cond_timedwait.

> I don't understand the code well enough to produce a fix.

I'll take a look. I'd like to just drop adusting the waiters count
here and instead set the bit-31 may-have-waiters flag here, but I'm
not sure that's right for all mutex types. It certainly can be made to
do that just on PI mutexes if needed but having fewer special cases is
preferable.

> The demo code (a cut version of some production code) will reproduce the
> failure. Commenting out the pthread_mutexattr_setprotocol call in
> the iot_mutex_init function at the end of the thread.c file will cause the
> code to work as intended (without priority inheritance). The code works
> fine either way with the GNU lib.
> 
> BTW can I recommend that the "magic numbers"  used to represent mutex modes
> be replaced at some point with defined constants as it would make the code
> much easier to follow.

Yes, it's been something I kinda wanted to do, but that would have
obfuscated and cluttered the actual changes in development when it was
being done. It might be time to go back and add some now that this
code is mature.

> This is my first post to this list so I hope this message is on the right
> list and is helpful.

Yep, this is fine. Thanks again!

Rich

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] Bug with priority inheritance and condition variables
  2020-09-24 16:14 ` Rich Felker
@ 2020-10-26 19:49   ` Rich Felker
  0 siblings, 0 replies; 3+ messages in thread
From: Rich Felker @ 2020-10-26 19:49 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 3258 bytes --]

On Thu, Sep 24, 2020 at 12:14:07PM -0400, Rich Felker wrote:
> On Thu, Sep 24, 2020 at 03:58:17PM +0100, Edward Scott wrote:
> > Hello,
> > 
> > There appears to be a bug when using priority inheritance in combination
> > with condition variables. I have some code that reproduces the bug:
> > 
> > https://github.com/edward-scott/musl-prio-inherit-cv-bug
> > 
> > Using git bisect I traced the origin of the bug to this commit:
> > 
> > https://git.musl-libc.org/cgit/musl/commit/?id=54ca677983d47529bab8752315ac1a2b49888870
> > 
> > which is the commit that is described as "implement priority inheritance
> > mutexes".
> > 
> > From my analysis it appears that _m_waiters is used by the
> > priority inheritance logic to maintain some state (as described in the
> > commit message) but that conflicts with some use of _m_waiters in the
> > condition variable implementation.
> 
> I think this is entirely correct analysis. Thanks for catching this!
> 
> > The consequence is that pthread_mutex_lock erroneously returns EDEADLK.
> 
> OK, it took me a second to understand this part, because I thought it
> would be ENOTRECOVERABLE, but that's only for robust+PI mutexes.
> EDEADLK seems to be a consequence of succeeding but returning EBUSY,
> which is "wrong" but should only be able to happen with inconsistent
> state, as produced by pthread_cond_timedwait.
> 
> > I don't understand the code well enough to produce a fix.
> 
> I'll take a look. I'd like to just drop adusting the waiters count
> here and instead set the bit-31 may-have-waiters flag here, but I'm
> not sure that's right for all mutex types. It certainly can be made to
> do that just on PI mutexes if needed but having fewer special cases is
> preferable.
> 
> > The demo code (a cut version of some production code) will reproduce the
> > failure. Commenting out the pthread_mutexattr_setprotocol call in
> > the iot_mutex_init function at the end of the thread.c file will cause the
> > code to work as intended (without priority inheritance). The code works
> > fine either way with the GNU lib.
> > 
> > BTW can I recommend that the "magic numbers"  used to represent mutex modes
> > be replaced at some point with defined constants as it would make the code
> > much easier to follow.
> 
> Yes, it's been something I kinda wanted to do, but that would have
> obfuscated and cluttered the actual changes in development when it was
> being done. It might be time to go back and add some now that this
> code is mature.
> 
> > This is my first post to this list so I hope this message is on the right
> > list and is helpful.
> 
> Yep, this is fine. Thanks again!

This took a while to get to, but here's my proposed patch. It drops
all waiters modification in favor of setting the "may have waiters"
flag whenever there's another waiter to be woken. At the time this is
done, the calling thread holds the mutex (except on error re-locking
it, but then the mutex is non-recoverable or else UB occurred), and
setting the flag guarantees it will perform a wake when it eventually
unlocks it.

With the patch applied, your test program gets further along but still
hangs. I think the problem is the #if 0 block in threadpool.c; with
that changed to #if 1, it runs to completion.

Rich

[-- Attachment #2: cv_with_pi_mutex.diff --]
[-- Type: text/plain, Size: 757 bytes --]

diff --git a/src/thread/pthread_cond_timedwait.c b/src/thread/pthread_cond_timedwait.c
index d1501240..02858f7d 100644
--- a/src/thread/pthread_cond_timedwait.c
+++ b/src/thread/pthread_cond_timedwait.c
@@ -146,14 +146,13 @@ relock:
 
 	if (oldstate == WAITING) goto done;
 
-	if (!node.next) a_inc(&m->_m_waiters);
-
 	/* Unlock the barrier that's holding back the next waiter, and
 	 * either wake it or requeue it to the mutex. */
-	if (node.prev)
+	if (node.prev) {
+		int val = m->_m_lock;
+		if (val>0) a_cas(&m->_m_lock, val, val|0x80000000);
 		unlock_requeue(&node.prev->barrier, &m->_m_lock, m->_m_type & 128);
-	else
-		a_dec(&m->_m_waiters);
+	}
 
 	/* Since a signal was consumed, cancellation is not permitted. */
 	if (e == ECANCELED) e = 0;

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-10-26 19:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24 14:58 [musl] Bug with priority inheritance and condition variables Edward Scott
2020-09-24 16:14 ` Rich Felker
2020-10-26 19:49   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).