mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait
@ 2024-07-10 19:11 Tavian Barnes
  2024-07-11 14:56 ` 回复: " Li JinCheng
  0 siblings, 1 reply; 6+ messages in thread
From: Tavian Barnes @ 2024-07-10 19:11 UTC (permalink / raw)
  To: musl; +Cc: 250200715

(I cleaned up the HTML entities in this email, but please use plain text
mode next time)

> Hello:
>
> I had a low-probability crash in the child thread when using the
> timer_create interface. After debug, I found that the crash occured
> when the sub-thread accessed in code "if (b->_b_waiters)" which is a
> stack variable created in the main thread and passed to child thread
> by args. It looks like the main thread's timer_create has finished
> executing at this point, so the variables (start_args) on the stack
> have been cleaned up. I take a look at the pthread_barrier_wait code
> and I think it should be a scheduling problem in pthread_barrier_wait.
>
> Take the timer_create as an example, when the child thread is the
> first thread for "pthread_barrier_wait" and it is suspened after it
> executes the code "a_store(&b->_b_lock, 0)", then the main thread in
> timer_create will arrive as the last thread, it will nerver wait for
> the child thread to be rescheduled, the main thread can pass the
> barrier and continue execution, the args created in timer_create will
> be cleaned up. when the child thread is finally rescheduled, it access
> the "b->_b_waiters" which has already been cleaned up by main thread
> and the crash will occur.
>
> Is there a bug here? Looking forward to your reply.
>
>     /* First thread to enter the barrier becomes the "instance owner" */
>     if (!inst) {
>         struct instance new_inst = { 0 };
>         int spins = 200;
>         b->_b_inst = inst = &new_inst;
>         a_store(&b->_b_lock, 0);
>         if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);  // crash here b->_b_waiters
>         while (spins-- && !inst->finished)
>
>     /* First thread to enter the barrier becomes the "instance owner" */
>     if (!inst) {
>         struct instance new_inst = { 0 };
>         int spins = 200;
>         b->_b_inst = inst = &new_inst;
>         a_store(&b->_b_lock, 0);
>         // when the child thread is the first thread and is scheduled out here
>
>         if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);
>         while (spins-- && !inst->finished)

This looks like a real bug in timer_create() to me.  Here's an untested
fix:

From: Tavian Barnes <tavianator@tavianator.com>
Date: Wed, 10 Jul 2024 14:27:21 -0400
Subject: [PATCH] timer_create: destroy the barrier before returning

pthread_barrier_destroy() waits for all threads to be done using the
barrier before destroying it.  Without it, the storage for the barrier
can be deallocated when timer_create() returns while the other thread is
still using it inside the pthread_barrier_wait() implementation.

Link: https://www.openwall.com/lists/musl/2024/07/08/1
---
 src/time/timer_create.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/time/timer_create.c b/src/time/timer_create.c
index 9216b3ab..42c69848 100644
--- a/src/time/timer_create.c
+++ b/src/time/timer_create.c
@@ -121,6 +121,7 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict
 		}
 		td->timer_id = timerid;
 		pthread_barrier_wait(&args.b);
+		pthread_barrier_destroy(&args.b);
 		if (timerid < 0) return -1;
 		*res = (void *)(INTPTR_MIN | (uintptr_t)td>>1);
 		break;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* 回复: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait
  2024-07-10 19:11 [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait Tavian Barnes
@ 2024-07-11 14:56 ` Li JinCheng
  2024-07-23 23:51   ` [musl] " Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Li JinCheng @ 2024-07-11 14:56 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4719 bytes --]

Hi
>pthread_barrier_destroy() waits for all threads to be done using the
>barrier before destroying it.  Without it, the storage for the barrier
>can be deallocated when timer_create() returns while the other thread is
>still using it inside the pthread_barrier_wait() implementation.

Sorry, I have changed to use outlook to send emails. Back to the original topic, I don't think add pthread_barrier_destroy in timer_create can solve this bug. In timer_create, b->_b_limit should be 1, and it will never meet the waiting condition (b->_b_limit < 0) in pthread_barrier_destroy.
And I think add additional synchronization to timer_create  is a temporary method. Maybe some bugfix is needed in pthread_barrier_wait.

int pthread_barrier_destroy(pthread_barrier_t *b)
{
    if (b->_b_limit < 0) {
        if (b->_b_lock) {
            int v;
            a_or(&b->_b_lock, INT_MIN);
            while ((v = b->_b_lock) & INT_MAX)
                __wait(&b->_b_lock, 0, v, 0);
        }
        __vm_wait();
    }
    return 0;
}


Li
________________________________
发件人: Tavian Barnes <tavianator@tavianator.com>
发送时间: 2024年7月11日 3:11
收件人: musl@lists.openwall.com <musl@lists.openwall.com>
抄送: 250200715@qq.com <250200715@qq.com>
主题: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait

(I cleaned up the HTML entities in this email, but please use plain text
mode next time)

> Hello:
>
> I had a low-probability crash in the child thread when using the
> timer_create interface. After debug, I found that the crash occured
> when the sub-thread accessed in code "if (b->_b_waiters)" which is a
> stack variable created in the main thread and passed to child thread
> by args. It looks like the main thread's timer_create has finished
> executing at this point, so the variables (start_args) on the stack
> have been cleaned up. I take a look at the pthread_barrier_wait code
> and I think it should be a scheduling problem in pthread_barrier_wait.
>
> Take the timer_create as an example, when the child thread is the
> first thread for "pthread_barrier_wait" and it is suspened after it
> executes the code "a_store(&b->_b_lock, 0)", then the main thread in
> timer_create will arrive as the last thread, it will nerver wait for
> the child thread to be rescheduled, the main thread can pass the
> barrier and continue execution, the args created in timer_create will
> be cleaned up. when the child thread is finally rescheduled, it access
> the "b->_b_waiters" which has already been cleaned up by main thread
> and the crash will occur.
>
> Is there a bug here? Looking forward to your reply.
>
>     /* First thread to enter the barrier becomes the "instance owner" */
>     if (!inst) {
>         struct instance new_inst = { 0 };
>         int spins = 200;
>         b->_b_inst = inst = &new_inst;
>         a_store(&b->_b_lock, 0);
>         if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);  // crash here b->_b_waiters
>         while (spins-- && !inst->finished)
>
>     /* First thread to enter the barrier becomes the "instance owner" */
>     if (!inst) {
>         struct instance new_inst = { 0 };
>         int spins = 200;
>         b->_b_inst = inst = &new_inst;
>         a_store(&b->_b_lock, 0);
>         // when the child thread is the first thread and is scheduled out here
>
>         if (b->_b_waiters) __wake(&b->_b_lock, 1, 1);
>         while (spins-- && !inst->finished)

This looks like a real bug in timer_create() to me.  Here's an untested
fix:

From: Tavian Barnes <tavianator@tavianator.com>
Date: Wed, 10 Jul 2024 14:27:21 -0400
Subject: [PATCH] timer_create: destroy the barrier before returning

pthread_barrier_destroy() waits for all threads to be done using the
barrier before destroying it.  Without it, the storage for the barrier
can be deallocated when timer_create() returns while the other thread is
still using it inside the pthread_barrier_wait() implementation.

Link: https://www.openwall.com/lists/musl/2024/07/08/1
---
 src/time/timer_create.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/time/timer_create.c b/src/time/timer_create.c
index 9216b3ab..42c69848 100644
--- a/src/time/timer_create.c
+++ b/src/time/timer_create.c
@@ -121,6 +121,7 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict
                 }
                 td->timer_id = timerid;
                 pthread_barrier_wait(&args.b);
+               pthread_barrier_destroy(&args.b);
                 if (timerid < 0) return -1;
                 *res = (void *)(INTPTR_MIN | (uintptr_t)td>>1);
                 break;

[-- Attachment #2: Type: text/html, Size: 10930 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [musl] Re: 回复: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait
  2024-07-11 14:56 ` 回复: " Li JinCheng
@ 2024-07-23 23:51   ` Rich Felker
  2024-07-24  0:11     ` [musl] " Thorsten Glaser
  2024-07-24 16:59     ` [musl] Re: 回复: [musl] " Rich Felker
  0 siblings, 2 replies; 6+ messages in thread
From: Rich Felker @ 2024-07-23 23:51 UTC (permalink / raw)
  To: Li JinCheng; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1366 bytes --]

On Thu, Jul 11, 2024 at 02:56:55PM +0000, Li JinCheng wrote:
> Hi
> >pthread_barrier_destroy() waits for all threads to be done using the
> >barrier before destroying it.  Without it, the storage for the barrier
> >can be deallocated when timer_create() returns while the other thread is
> >still using it inside the pthread_barrier_wait() implementation.
> 
> Sorry, I have changed to use outlook to send emails. Back to the
> original topic, I don't think add pthread_barrier_destroy in
> timer_create can solve this bug. In timer_create, b->_b_limit should
> be 1, and it will never meet the waiting condition (b->_b_limit < 0)
> in pthread_barrier_destroy.
> And I think add additional synchronization to timer_create is a
> temporary method. Maybe some bugfix is needed in
> pthread_barrier_wait.

There's an old unverified suspicion that our pthread barrier
implementation has serious flaws. Unfortunately I hadn't gotten around
to looking into it, partly because it's such an obscure sync primitive
that almost nothing uses. We really DO need to fix it, but I don't
want fixing timer_create to be blocked on that. Attached is a patch
I'll probably apply (not yet tested) that just uses semaphores instead
of a barrier. I think this will be an improvement even after barriers
are fixed, since it will avoid pulling in the barrier code in static
linking.

Rich

[-- Attachment #2: timer_create_avoid_barrier.diff --]
[-- Type: text/plain, Size: 1432 bytes --]

diff --git a/src/time/timer_create.c b/src/time/timer_create.c
index 9216b3ab..9dc0fb05 100644
--- a/src/time/timer_create.c
+++ b/src/time/timer_create.c
@@ -1,6 +1,7 @@
 #include <time.h>
 #include <setjmp.h>
 #include <limits.h>
+#include <semaphore.h>
 #include "pthread_impl.h"
 #include "atomic.h"
 
@@ -12,7 +13,7 @@ struct ksigevent {
 };
 
 struct start_args {
-	pthread_barrier_t b;
+	sem_t sem1, sem2;
 	struct sigevent *sev;
 };
 
@@ -42,7 +43,8 @@ static void *start(void *arg)
 	void (*notify)(union sigval) = args->sev->sigev_notify_function;
 	union sigval val = args->sev->sigev_value;
 
-	pthread_barrier_wait(&args->b);
+	sem_post(&args->sem1);
+	while (sem_wait(&args->sem2));
 	if (self->cancel)
 		return 0;
 	for (;;) {
@@ -99,7 +101,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict
 		else
 			pthread_attr_init(&attr);
 		pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
-		pthread_barrier_init(&args.b, 0, 2);
+		sem_init(&args.sem1, 0, 0);
+		sem_init(&args.sem2, 0, 0);
 		args.sev = evp;
 
 		__block_app_sigs(&set);
@@ -120,7 +123,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict
 			td->cancel = 1;
 		}
 		td->timer_id = timerid;
-		pthread_barrier_wait(&args.b);
+		sem_post(&args.sem2);
+		while (sem_wait(&args.sem1));
 		if (timerid < 0) return -1;
 		*res = (void *)(INTPTR_MIN | (uintptr_t)td>>1);
 		break;

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [musl] Re: Maybe A Bug about timer_create and pthread_barrier_wait
  2024-07-23 23:51   ` [musl] " Rich Felker
@ 2024-07-24  0:11     ` Thorsten Glaser
  2024-07-24  0:25       ` Rich Felker
  2024-07-24 16:59     ` [musl] Re: 回复: [musl] " Rich Felker
  1 sibling, 1 reply; 6+ messages in thread
From: Thorsten Glaser @ 2024-07-24  0:11 UTC (permalink / raw)
  To: musl

Rich Felker dixit:

>There's an old unverified suspicion that our pthread barrier
>implementation has serious flaws. Unfortunately I hadn't gotten around

From having looked at the documentation and the OP’s description
of their understanding of the issue (I don’t understand enough of
the actual code), I got the idea that maybe the musl implementation
accesses the pthread_barrier_t object after at least one of the
threads has returned (which can lead to freeing the object by the
caller). I don’t know enough about parallel programming to be any
more help in debugging that though, or even in checking whether
that’s indeed the case.

bye,
//mirabilos
-- 
<igli> exceptions: a truly awful implementation of quite a nice idea.
<igli> just about the worst way you could do something like that, afaic.
<igli> it's like anti-design.  <mirabilos> that too… may I quote you on that?
<igli> sure, tho i doubt anyone will listen ;)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Re: Maybe A Bug about timer_create and pthread_barrier_wait
  2024-07-24  0:11     ` [musl] " Thorsten Glaser
@ 2024-07-24  0:25       ` Rich Felker
  0 siblings, 0 replies; 6+ messages in thread
From: Rich Felker @ 2024-07-24  0:25 UTC (permalink / raw)
  To: Thorsten Glaser; +Cc: musl

On Wed, Jul 24, 2024 at 12:11:28AM +0000, Thorsten Glaser wrote:
> Rich Felker dixit:
> 
> >There's an old unverified suspicion that our pthread barrier
> >implementation has serious flaws. Unfortunately I hadn't gotten around
> 
> From having looked at the documentation and the OP’s description
> of their understanding of the issue (I don’t understand enough of
> the actual code), I got the idea that maybe the musl implementation
> accesses the pthread_barrier_t object after at least one of the
> threads has returned (which can lead to freeing the object by the
> caller). I don’t know enough about parallel programming to be any
> more help in debugging that though, or even in checking whether
> that’s indeed the case.

The contract of pthread barriers is that the barrier object is no
longer in use (and thus fair game to free/clobber) as soon as any
waiter has returned. This is an extremely strong constraint on the
implementation making for an extremely easy-to-use primitive for the
application, assuming it actually works.

I wrote this code a long, long time ago, but with the intent that it
works, based on doing the actual synchronization in the non-pshared
case with objects on the waiters' stacks rather than the barrier
object, and just using the barrier object to mediate setup of a wait
instance. But there's probably something wrong. I just have to go back
and understand it well enough to figure out what...

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] Re: 回复: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait
  2024-07-23 23:51   ` [musl] " Rich Felker
  2024-07-24  0:11     ` [musl] " Thorsten Glaser
@ 2024-07-24 16:59     ` Rich Felker
  1 sibling, 0 replies; 6+ messages in thread
From: Rich Felker @ 2024-07-24 16:59 UTC (permalink / raw)
  To: Li JinCheng; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

On Tue, Jul 23, 2024 at 07:51:00PM -0400, Rich Felker wrote:
> On Thu, Jul 11, 2024 at 02:56:55PM +0000, Li JinCheng wrote:
> > Hi
> > >pthread_barrier_destroy() waits for all threads to be done using the
> > >barrier before destroying it.  Without it, the storage for the barrier
> > >can be deallocated when timer_create() returns while the other thread is
> > >still using it inside the pthread_barrier_wait() implementation.
> > 
> > Sorry, I have changed to use outlook to send emails. Back to the
> > original topic, I don't think add pthread_barrier_destroy in
> > timer_create can solve this bug. In timer_create, b->_b_limit should
> > be 1, and it will never meet the waiting condition (b->_b_limit < 0)
> > in pthread_barrier_destroy.
> > And I think add additional synchronization to timer_create is a
> > temporary method. Maybe some bugfix is needed in
> > pthread_barrier_wait.
> 
> There's an old unverified suspicion that our pthread barrier
> implementation has serious flaws. Unfortunately I hadn't gotten around
> to looking into it, partly because it's such an obscure sync primitive
> that almost nothing uses. We really DO need to fix it, but I don't
> want fixing timer_create to be blocked on that. Attached is a patch
> I'll probably apply (not yet tested) that just uses semaphores instead
> of a barrier. I think this will be an improvement even after barriers
> are fixed, since it will avoid pulling in the barrier code in static
> linking.
> 
> Rich

> diff --git a/src/time/timer_create.c b/src/time/timer_create.c
> index 9216b3ab..9dc0fb05 100644
> --- a/src/time/timer_create.c
> +++ b/src/time/timer_create.c
> @@ -1,6 +1,7 @@
>  #include <time.h>
>  #include <setjmp.h>
>  #include <limits.h>
> +#include <semaphore.h>
>  #include "pthread_impl.h"
>  #include "atomic.h"
>  
> @@ -12,7 +13,7 @@ struct ksigevent {
>  };
>  
>  struct start_args {
> -	pthread_barrier_t b;
> +	sem_t sem1, sem2;
>  	struct sigevent *sev;
>  };
>  
> @@ -42,7 +43,8 @@ static void *start(void *arg)
>  	void (*notify)(union sigval) = args->sev->sigev_notify_function;
>  	union sigval val = args->sev->sigev_value;
>  
> -	pthread_barrier_wait(&args->b);
> +	sem_post(&args->sem1);
> +	while (sem_wait(&args->sem2));

This code as written is wrong. Since the parent owns the object
lifetimes, the last operation in the child (timer) must be posting to
the semaphore the parent is waiting on. New draft attached.

Rich

[-- Attachment #2: 0001-timer_create-replace-pthread-barrier-with-semaphores.patch --]
[-- Type: text/plain, Size: 2887 bytes --]

From cde213f9c3ac1aa168581222edee6a6642113323 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Wed, 24 Jul 2024 12:41:04 -0400
Subject: [PATCH] timer_create: replace pthread barrier with semaphores for
 thread start

our pthread barrier implementation reportedly has bugs that are could
lead to malfunction or crash in timer_create. while this has not been
reviewed to confirm, there have been past reports of pthread barrier
bugs, and it seems likely that something is actually wrong.

pthread barriers are an obscure primitive, and timer_create is the
only place we are using them internally at present. even if they were
working correctly, this means we are imposing linking of otherwise
likely-dead code whenever timer_create is used.

a pair of semaphores functions identically to a 2-waiter barrier
except for destruction order properties. since the parent is
responsible for the argument structure (including semaphores)
lifetimes, the last operation on them in the timer thread must be
posting to the parent.
---
 src/time/timer_create.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/time/timer_create.c b/src/time/timer_create.c
index 9216b3ab..424c70b7 100644
--- a/src/time/timer_create.c
+++ b/src/time/timer_create.c
@@ -1,6 +1,7 @@
 #include <time.h>
 #include <setjmp.h>
 #include <limits.h>
+#include <semaphore.h>
 #include "pthread_impl.h"
 #include "atomic.h"
 
@@ -12,7 +13,7 @@ struct ksigevent {
 };
 
 struct start_args {
-	pthread_barrier_t b;
+	sem_t sem1, sem2;
 	struct sigevent *sev;
 };
 
@@ -42,7 +43,14 @@ static void *start(void *arg)
 	void (*notify)(union sigval) = args->sev->sigev_notify_function;
 	union sigval val = args->sev->sigev_value;
 
-	pthread_barrier_wait(&args->b);
+	/* The two-way semaphore synchronization ensures that we see
+	 * self->cancel set by the parent if timer creation failed or
+	 * self->timer_id if it succeeded, and informs the parent that
+	 * we are done accessing the arguments so that the parent can
+	 * proceed past their block lifetime. */
+	while (sem_wait(&args->sem1));
+	sem_post(&args->sem2);
+
 	if (self->cancel)
 		return 0;
 	for (;;) {
@@ -99,7 +107,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict
 		else
 			pthread_attr_init(&attr);
 		pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
-		pthread_barrier_init(&args.b, 0, 2);
+		sem_init(&args.sem1, 0, 0);
+		sem_init(&args.sem2, 0, 0);
 		args.sev = evp;
 
 		__block_app_sigs(&set);
@@ -120,7 +129,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict
 			td->cancel = 1;
 		}
 		td->timer_id = timerid;
-		pthread_barrier_wait(&args.b);
+		sem_post(&args.sem1);
+		while (sem_wait(&args.sem2));
 		if (timerid < 0) return -1;
 		*res = (void *)(INTPTR_MIN | (uintptr_t)td>>1);
 		break;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-07-24 17:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-10 19:11 [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait Tavian Barnes
2024-07-11 14:56 ` 回复: " Li JinCheng
2024-07-23 23:51   ` [musl] " Rich Felker
2024-07-24  0:11     ` [musl] " Thorsten Glaser
2024-07-24  0:25       ` Rich Felker
2024-07-24 16:59     ` [musl] Re: 回复: [musl] " Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).