* [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait @ 2024-07-10 19:11 Tavian Barnes 2024-07-11 14:56 ` 回复: " Li JinCheng 0 siblings, 1 reply; 6+ messages in thread From: Tavian Barnes @ 2024-07-10 19:11 UTC (permalink / raw) To: musl; +Cc: 250200715 (I cleaned up the HTML entities in this email, but please use plain text mode next time) > Hello: > > I had a low-probability crash in the child thread when using the > timer_create interface. After debug, I found that the crash occured > when the sub-thread accessed in code "if (b->_b_waiters)" which is a > stack variable created in the main thread and passed to child thread > by args. It looks like the main thread's timer_create has finished > executing at this point, so the variables (start_args) on the stack > have been cleaned up. I take a look at the pthread_barrier_wait code > and I think it should be a scheduling problem in pthread_barrier_wait. > > Take the timer_create as an example, when the child thread is the > first thread for "pthread_barrier_wait" and it is suspened after it > executes the code "a_store(&b->_b_lock, 0)", then the main thread in > timer_create will arrive as the last thread, it will nerver wait for > the child thread to be rescheduled, the main thread can pass the > barrier and continue execution, the args created in timer_create will > be cleaned up. when the child thread is finally rescheduled, it access > the "b->_b_waiters" which has already been cleaned up by main thread > and the crash will occur. > > Is there a bug here? Looking forward to your reply. > > /* First thread to enter the barrier becomes the "instance owner" */ > if (!inst) { > struct instance new_inst = { 0 }; > int spins = 200; > b->_b_inst = inst = &new_inst; > a_store(&b->_b_lock, 0); > if (b->_b_waiters) __wake(&b->_b_lock, 1, 1); // crash here b->_b_waiters > while (spins-- && !inst->finished) > > /* First thread to enter the barrier becomes the "instance owner" */ > if (!inst) { > struct instance new_inst = { 0 }; > int spins = 200; > b->_b_inst = inst = &new_inst; > a_store(&b->_b_lock, 0); > // when the child thread is the first thread and is scheduled out here > > if (b->_b_waiters) __wake(&b->_b_lock, 1, 1); > while (spins-- && !inst->finished) This looks like a real bug in timer_create() to me. Here's an untested fix: From: Tavian Barnes <tavianator@tavianator.com> Date: Wed, 10 Jul 2024 14:27:21 -0400 Subject: [PATCH] timer_create: destroy the barrier before returning pthread_barrier_destroy() waits for all threads to be done using the barrier before destroying it. Without it, the storage for the barrier can be deallocated when timer_create() returns while the other thread is still using it inside the pthread_barrier_wait() implementation. Link: https://www.openwall.com/lists/musl/2024/07/08/1 --- src/time/timer_create.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/time/timer_create.c b/src/time/timer_create.c index 9216b3ab..42c69848 100644 --- a/src/time/timer_create.c +++ b/src/time/timer_create.c @@ -121,6 +121,7 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict } td->timer_id = timerid; pthread_barrier_wait(&args.b); + pthread_barrier_destroy(&args.b); if (timerid < 0) return -1; *res = (void *)(INTPTR_MIN | (uintptr_t)td>>1); break; ^ permalink raw reply [flat|nested] 6+ messages in thread
* 回复: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait 2024-07-10 19:11 [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait Tavian Barnes @ 2024-07-11 14:56 ` Li JinCheng 2024-07-23 23:51 ` [musl] " Rich Felker 0 siblings, 1 reply; 6+ messages in thread From: Li JinCheng @ 2024-07-11 14:56 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 4719 bytes --] Hi >pthread_barrier_destroy() waits for all threads to be done using the >barrier before destroying it. Without it, the storage for the barrier >can be deallocated when timer_create() returns while the other thread is >still using it inside the pthread_barrier_wait() implementation. Sorry, I have changed to use outlook to send emails. Back to the original topic, I don't think add pthread_barrier_destroy in timer_create can solve this bug. In timer_create, b->_b_limit should be 1, and it will never meet the waiting condition (b->_b_limit < 0) in pthread_barrier_destroy. And I think add additional synchronization to timer_create is a temporary method. Maybe some bugfix is needed in pthread_barrier_wait. int pthread_barrier_destroy(pthread_barrier_t *b) { if (b->_b_limit < 0) { if (b->_b_lock) { int v; a_or(&b->_b_lock, INT_MIN); while ((v = b->_b_lock) & INT_MAX) __wait(&b->_b_lock, 0, v, 0); } __vm_wait(); } return 0; } Li ________________________________ 发件人: Tavian Barnes <tavianator@tavianator.com> 发送时间: 2024年7月11日 3:11 收件人: musl@lists.openwall.com <musl@lists.openwall.com> 抄送: 250200715@qq.com <250200715@qq.com> 主题: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait (I cleaned up the HTML entities in this email, but please use plain text mode next time) > Hello: > > I had a low-probability crash in the child thread when using the > timer_create interface. After debug, I found that the crash occured > when the sub-thread accessed in code "if (b->_b_waiters)" which is a > stack variable created in the main thread and passed to child thread > by args. It looks like the main thread's timer_create has finished > executing at this point, so the variables (start_args) on the stack > have been cleaned up. I take a look at the pthread_barrier_wait code > and I think it should be a scheduling problem in pthread_barrier_wait. > > Take the timer_create as an example, when the child thread is the > first thread for "pthread_barrier_wait" and it is suspened after it > executes the code "a_store(&b->_b_lock, 0)", then the main thread in > timer_create will arrive as the last thread, it will nerver wait for > the child thread to be rescheduled, the main thread can pass the > barrier and continue execution, the args created in timer_create will > be cleaned up. when the child thread is finally rescheduled, it access > the "b->_b_waiters" which has already been cleaned up by main thread > and the crash will occur. > > Is there a bug here? Looking forward to your reply. > > /* First thread to enter the barrier becomes the "instance owner" */ > if (!inst) { > struct instance new_inst = { 0 }; > int spins = 200; > b->_b_inst = inst = &new_inst; > a_store(&b->_b_lock, 0); > if (b->_b_waiters) __wake(&b->_b_lock, 1, 1); // crash here b->_b_waiters > while (spins-- && !inst->finished) > > /* First thread to enter the barrier becomes the "instance owner" */ > if (!inst) { > struct instance new_inst = { 0 }; > int spins = 200; > b->_b_inst = inst = &new_inst; > a_store(&b->_b_lock, 0); > // when the child thread is the first thread and is scheduled out here > > if (b->_b_waiters) __wake(&b->_b_lock, 1, 1); > while (spins-- && !inst->finished) This looks like a real bug in timer_create() to me. Here's an untested fix: From: Tavian Barnes <tavianator@tavianator.com> Date: Wed, 10 Jul 2024 14:27:21 -0400 Subject: [PATCH] timer_create: destroy the barrier before returning pthread_barrier_destroy() waits for all threads to be done using the barrier before destroying it. Without it, the storage for the barrier can be deallocated when timer_create() returns while the other thread is still using it inside the pthread_barrier_wait() implementation. Link: https://www.openwall.com/lists/musl/2024/07/08/1 --- src/time/timer_create.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/time/timer_create.c b/src/time/timer_create.c index 9216b3ab..42c69848 100644 --- a/src/time/timer_create.c +++ b/src/time/timer_create.c @@ -121,6 +121,7 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict } td->timer_id = timerid; pthread_barrier_wait(&args.b); + pthread_barrier_destroy(&args.b); if (timerid < 0) return -1; *res = (void *)(INTPTR_MIN | (uintptr_t)td>>1); break; [-- Attachment #2: Type: text/html, Size: 10930 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* [musl] Re: 回复: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait 2024-07-11 14:56 ` 回复: " Li JinCheng @ 2024-07-23 23:51 ` Rich Felker 2024-07-24 0:11 ` [musl] " Thorsten Glaser 2024-07-24 16:59 ` [musl] Re: 回复: [musl] " Rich Felker 0 siblings, 2 replies; 6+ messages in thread From: Rich Felker @ 2024-07-23 23:51 UTC (permalink / raw) To: Li JinCheng; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 1366 bytes --] On Thu, Jul 11, 2024 at 02:56:55PM +0000, Li JinCheng wrote: > Hi > >pthread_barrier_destroy() waits for all threads to be done using the > >barrier before destroying it. Without it, the storage for the barrier > >can be deallocated when timer_create() returns while the other thread is > >still using it inside the pthread_barrier_wait() implementation. > > Sorry, I have changed to use outlook to send emails. Back to the > original topic, I don't think add pthread_barrier_destroy in > timer_create can solve this bug. In timer_create, b->_b_limit should > be 1, and it will never meet the waiting condition (b->_b_limit < 0) > in pthread_barrier_destroy. > And I think add additional synchronization to timer_create is a > temporary method. Maybe some bugfix is needed in > pthread_barrier_wait. There's an old unverified suspicion that our pthread barrier implementation has serious flaws. Unfortunately I hadn't gotten around to looking into it, partly because it's such an obscure sync primitive that almost nothing uses. We really DO need to fix it, but I don't want fixing timer_create to be blocked on that. Attached is a patch I'll probably apply (not yet tested) that just uses semaphores instead of a barrier. I think this will be an improvement even after barriers are fixed, since it will avoid pulling in the barrier code in static linking. Rich [-- Attachment #2: timer_create_avoid_barrier.diff --] [-- Type: text/plain, Size: 1432 bytes --] diff --git a/src/time/timer_create.c b/src/time/timer_create.c index 9216b3ab..9dc0fb05 100644 --- a/src/time/timer_create.c +++ b/src/time/timer_create.c @@ -1,6 +1,7 @@ #include <time.h> #include <setjmp.h> #include <limits.h> +#include <semaphore.h> #include "pthread_impl.h" #include "atomic.h" @@ -12,7 +13,7 @@ struct ksigevent { }; struct start_args { - pthread_barrier_t b; + sem_t sem1, sem2; struct sigevent *sev; }; @@ -42,7 +43,8 @@ static void *start(void *arg) void (*notify)(union sigval) = args->sev->sigev_notify_function; union sigval val = args->sev->sigev_value; - pthread_barrier_wait(&args->b); + sem_post(&args->sem1); + while (sem_wait(&args->sem2)); if (self->cancel) return 0; for (;;) { @@ -99,7 +101,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict else pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); - pthread_barrier_init(&args.b, 0, 2); + sem_init(&args.sem1, 0, 0); + sem_init(&args.sem2, 0, 0); args.sev = evp; __block_app_sigs(&set); @@ -120,7 +123,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict td->cancel = 1; } td->timer_id = timerid; - pthread_barrier_wait(&args.b); + sem_post(&args.sem2); + while (sem_wait(&args.sem1)); if (timerid < 0) return -1; *res = (void *)(INTPTR_MIN | (uintptr_t)td>>1); break; ^ permalink raw reply [flat|nested] 6+ messages in thread
* [musl] Re: Maybe A Bug about timer_create and pthread_barrier_wait 2024-07-23 23:51 ` [musl] " Rich Felker @ 2024-07-24 0:11 ` Thorsten Glaser 2024-07-24 0:25 ` Rich Felker 2024-07-24 16:59 ` [musl] Re: 回复: [musl] " Rich Felker 1 sibling, 1 reply; 6+ messages in thread From: Thorsten Glaser @ 2024-07-24 0:11 UTC (permalink / raw) To: musl Rich Felker dixit: >There's an old unverified suspicion that our pthread barrier >implementation has serious flaws. Unfortunately I hadn't gotten around From having looked at the documentation and the OP’s description of their understanding of the issue (I don’t understand enough of the actual code), I got the idea that maybe the musl implementation accesses the pthread_barrier_t object after at least one of the threads has returned (which can lead to freeing the object by the caller). I don’t know enough about parallel programming to be any more help in debugging that though, or even in checking whether that’s indeed the case. bye, //mirabilos -- <igli> exceptions: a truly awful implementation of quite a nice idea. <igli> just about the worst way you could do something like that, afaic. <igli> it's like anti-design. <mirabilos> that too… may I quote you on that? <igli> sure, tho i doubt anyone will listen ;) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] Re: Maybe A Bug about timer_create and pthread_barrier_wait 2024-07-24 0:11 ` [musl] " Thorsten Glaser @ 2024-07-24 0:25 ` Rich Felker 0 siblings, 0 replies; 6+ messages in thread From: Rich Felker @ 2024-07-24 0:25 UTC (permalink / raw) To: Thorsten Glaser; +Cc: musl On Wed, Jul 24, 2024 at 12:11:28AM +0000, Thorsten Glaser wrote: > Rich Felker dixit: > > >There's an old unverified suspicion that our pthread barrier > >implementation has serious flaws. Unfortunately I hadn't gotten around > > From having looked at the documentation and the OP’s description > of their understanding of the issue (I don’t understand enough of > the actual code), I got the idea that maybe the musl implementation > accesses the pthread_barrier_t object after at least one of the > threads has returned (which can lead to freeing the object by the > caller). I don’t know enough about parallel programming to be any > more help in debugging that though, or even in checking whether > that’s indeed the case. The contract of pthread barriers is that the barrier object is no longer in use (and thus fair game to free/clobber) as soon as any waiter has returned. This is an extremely strong constraint on the implementation making for an extremely easy-to-use primitive for the application, assuming it actually works. I wrote this code a long, long time ago, but with the intent that it works, based on doing the actual synchronization in the non-pshared case with objects on the waiters' stacks rather than the barrier object, and just using the barrier object to mediate setup of a wait instance. But there's probably something wrong. I just have to go back and understand it well enough to figure out what... Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [musl] Re: 回复: [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait 2024-07-23 23:51 ` [musl] " Rich Felker 2024-07-24 0:11 ` [musl] " Thorsten Glaser @ 2024-07-24 16:59 ` Rich Felker 1 sibling, 0 replies; 6+ messages in thread From: Rich Felker @ 2024-07-24 16:59 UTC (permalink / raw) To: Li JinCheng; +Cc: musl [-- Attachment #1: Type: text/plain, Size: 2436 bytes --] On Tue, Jul 23, 2024 at 07:51:00PM -0400, Rich Felker wrote: > On Thu, Jul 11, 2024 at 02:56:55PM +0000, Li JinCheng wrote: > > Hi > > >pthread_barrier_destroy() waits for all threads to be done using the > > >barrier before destroying it. Without it, the storage for the barrier > > >can be deallocated when timer_create() returns while the other thread is > > >still using it inside the pthread_barrier_wait() implementation. > > > > Sorry, I have changed to use outlook to send emails. Back to the > > original topic, I don't think add pthread_barrier_destroy in > > timer_create can solve this bug. In timer_create, b->_b_limit should > > be 1, and it will never meet the waiting condition (b->_b_limit < 0) > > in pthread_barrier_destroy. > > And I think add additional synchronization to timer_create is a > > temporary method. Maybe some bugfix is needed in > > pthread_barrier_wait. > > There's an old unverified suspicion that our pthread barrier > implementation has serious flaws. Unfortunately I hadn't gotten around > to looking into it, partly because it's such an obscure sync primitive > that almost nothing uses. We really DO need to fix it, but I don't > want fixing timer_create to be blocked on that. Attached is a patch > I'll probably apply (not yet tested) that just uses semaphores instead > of a barrier. I think this will be an improvement even after barriers > are fixed, since it will avoid pulling in the barrier code in static > linking. > > Rich > diff --git a/src/time/timer_create.c b/src/time/timer_create.c > index 9216b3ab..9dc0fb05 100644 > --- a/src/time/timer_create.c > +++ b/src/time/timer_create.c > @@ -1,6 +1,7 @@ > #include <time.h> > #include <setjmp.h> > #include <limits.h> > +#include <semaphore.h> > #include "pthread_impl.h" > #include "atomic.h" > > @@ -12,7 +13,7 @@ struct ksigevent { > }; > > struct start_args { > - pthread_barrier_t b; > + sem_t sem1, sem2; > struct sigevent *sev; > }; > > @@ -42,7 +43,8 @@ static void *start(void *arg) > void (*notify)(union sigval) = args->sev->sigev_notify_function; > union sigval val = args->sev->sigev_value; > > - pthread_barrier_wait(&args->b); > + sem_post(&args->sem1); > + while (sem_wait(&args->sem2)); This code as written is wrong. Since the parent owns the object lifetimes, the last operation in the child (timer) must be posting to the semaphore the parent is waiting on. New draft attached. Rich [-- Attachment #2: 0001-timer_create-replace-pthread-barrier-with-semaphores.patch --] [-- Type: text/plain, Size: 2887 bytes --] From cde213f9c3ac1aa168581222edee6a6642113323 Mon Sep 17 00:00:00 2001 From: Rich Felker <dalias@aerifal.cx> Date: Wed, 24 Jul 2024 12:41:04 -0400 Subject: [PATCH] timer_create: replace pthread barrier with semaphores for thread start our pthread barrier implementation reportedly has bugs that are could lead to malfunction or crash in timer_create. while this has not been reviewed to confirm, there have been past reports of pthread barrier bugs, and it seems likely that something is actually wrong. pthread barriers are an obscure primitive, and timer_create is the only place we are using them internally at present. even if they were working correctly, this means we are imposing linking of otherwise likely-dead code whenever timer_create is used. a pair of semaphores functions identically to a 2-waiter barrier except for destruction order properties. since the parent is responsible for the argument structure (including semaphores) lifetimes, the last operation on them in the timer thread must be posting to the parent. --- src/time/timer_create.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/src/time/timer_create.c b/src/time/timer_create.c index 9216b3ab..424c70b7 100644 --- a/src/time/timer_create.c +++ b/src/time/timer_create.c @@ -1,6 +1,7 @@ #include <time.h> #include <setjmp.h> #include <limits.h> +#include <semaphore.h> #include "pthread_impl.h" #include "atomic.h" @@ -12,7 +13,7 @@ struct ksigevent { }; struct start_args { - pthread_barrier_t b; + sem_t sem1, sem2; struct sigevent *sev; }; @@ -42,7 +43,14 @@ static void *start(void *arg) void (*notify)(union sigval) = args->sev->sigev_notify_function; union sigval val = args->sev->sigev_value; - pthread_barrier_wait(&args->b); + /* The two-way semaphore synchronization ensures that we see + * self->cancel set by the parent if timer creation failed or + * self->timer_id if it succeeded, and informs the parent that + * we are done accessing the arguments so that the parent can + * proceed past their block lifetime. */ + while (sem_wait(&args->sem1)); + sem_post(&args->sem2); + if (self->cancel) return 0; for (;;) { @@ -99,7 +107,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict else pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); - pthread_barrier_init(&args.b, 0, 2); + sem_init(&args.sem1, 0, 0); + sem_init(&args.sem2, 0, 0); args.sev = evp; __block_app_sigs(&set); @@ -120,7 +129,8 @@ int timer_create(clockid_t clk, struct sigevent *restrict evp, timer_t *restrict td->cancel = 1; } td->timer_id = timerid; - pthread_barrier_wait(&args.b); + sem_post(&args.sem1); + while (sem_wait(&args.sem2)); if (timerid < 0) return -1; *res = (void *)(INTPTR_MIN | (uintptr_t)td>>1); break; -- 2.21.0 ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-07-24 17:00 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-07-10 19:11 [musl] RE: Maybe A Bug about timer_create and pthread_barrier_wait Tavian Barnes 2024-07-11 14:56 ` 回复: " Li JinCheng 2024-07-23 23:51 ` [musl] " Rich Felker 2024-07-24 0:11 ` [musl] " Thorsten Glaser 2024-07-24 0:25 ` Rich Felker 2024-07-24 16:59 ` [musl] Re: 回复: [musl] " Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).