[musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
@ 2022-03-11 19:41 Gavin Howard
  2022-03-11 20:21 ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Howard @ 2022-03-11 19:41 UTC (permalink / raw)
  To: musl

Hello,

If this email seems rushed, it's because I was told musl is close to
being released and that you all would want to know about this bug
quickly.

All of this was tested on musl 1.2.2.

I'm writing a multi-threaded build system, and I've run into an
interesting error when I compile with musl (the musl-gcc generated by
building musl from source).

Background:

I have two main threads and as many worker threads as desired.

Worker Threads: Responsible for actually building targets. Once they are
spawned, they wait on a semaphore for targets and grab them (with a
mutex) off of a queue. When worker threads run a child process, they do
the following:

* Run the process (whether through posix_spawn or fork()+exec()).
* Check the return value to make sure the child process was created.
* Take a mutex.
* Add the child pid and an fd to communicate with the thread that is the
  parent of the child to a map.
* Post a semaphore.
* Unlock the mutex.
* Continue executing the target and other targets.

Main Thread #1: Spawns all of the other threads, waits on the worker
threads to join, and once they do, does the following:

* Locks the mutex mentioned above.
* Sets a flag.
* Posts the semaphore mentioned above.
* Unlocks the mutex.
* Calls pthread_join() on Main Thread #2.

Main Thread #2: Responsible for sending info about all reaped children
to the threads that spawned them. Once it is started, it does the
following:

* Waits on the semaphore mentioned above.
* Locks the mutex mentioned above.
* Checks the flag mentioned above. If it is set, it unlocks the mutex
  and does a pthread_exit().
* If the flag is not set, it continues and:
* Unlocks the mutex.
* Does a waitpid().
* Checks the return value of waitpid(). If it's ECHILD, it causes an
  abort() (for the purposes of debugging this issue).
* Locks the mutex again.
* Sends the child status info to the right thread using the map
  mentioned above.
* Unlocks the mutex.
* Repeat the process.

When running my code under musl and glibc, the abort() mentioned above
is triggered every so often at the end. This was because I was using an
atomic variable with the semaphore. I could trigger the problem every
10-20 minutes by continually bootstrapping the build system, and it
always happened at the end of the build, when the semaphore was posted
to have Main Thread #2 exit.

The reason the atomic variable did not work is (I suspect) because there
is no guarantee of compiler or processor ordering with pure,
memory_order_relaxed atomic variables. I suspect that the processor was
reordering the instructions such that the semaphore was signalled by
Main Thread #1 before it set the atomic variable, even though the atomic
store to the variable was before the semaphore post call in the source.
The setting of the flag could then come *after* Main Thread #2 did an
atomic load on it.

So I replaced it with the scheme I laid out above, and on glibc, the
issue still triggers; it just takes about 4 hours to do it.

However, on musl, I can still consistently trigger the issue in less
than 10 minutes.

Nevertheless, I still think it is more correct, and here's why: the
mutex should enforce ordering in the processor. By holding the mutex
when the semaphore is posted, the post cannot be reordered to before the
taking of the mutex lock, or after releasing it. Then, because the flag
is also set while holding the mutex lock, this means that setting the
flag cannot be reordered to before the lock. This means that the setting
of the flag always comes before the checking of the flag, even if Main
Thread #2's wait on the semaphore ends *before* the lock is released.

This, in turn, means that when Main Thread #2 takes the lock after
waiting on the semaphore, the flag *will* be set if it has been set at
all.

Likewise, since a thread adding a process to the map holds the lock
while adding the process to the map, and posts the semaphore *while
holding the lock*, then if Main Thread #2's wait on the semaphore ends
before the lock is released, it is still forced to wait to take the
lock. This taking of the lock also forces the waitpid() call to happen
*after* the releasing of the lock by the thread that is adding a process
to the map.

In like manner, the taking and releasing of the lock by the worker
thread means that the creation of the child process must happen before
the release of the lock, so I believe that there should *never* be a
situation where waitpid() is called before a child process is properly
created.

I believe all of that together means that there should *never* be a
situation where waitpid() returns ECHILD. Yet musl does, consistently.

I've checked the musl source for waitpid(), and it's just a syscall, as
it should be, so I don't think the bug is in waitpid().

So after eliminating other possibilities, it seems to me there may be a
bug in either musl's semaphores, its mutexes, or both.

Now, I fully admit that I could have done my analysis wrong and still
have a bug in my code. I admit that this could still be PEBKAC. But even
in that case, I need the help of the musl community to figure out where
I did go wrong.

Thank you for your time.

Gavin Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-11 19:41 [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes Gavin Howard
@ 2022-03-11 20:21 ` Rich Felker
  2022-03-12  5:45   ` Gavin Howard
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2022-03-11 20:21 UTC (permalink / raw)
  To: Gavin Howard; +Cc: musl

On Fri, Mar 11, 2022 at 12:41:05PM -0700, Gavin Howard wrote:
> Hello,
> 
> If this email seems rushed, it's because I was told musl is close to
> being released and that you all would want to know about this bug
> quickly.
> 
> All of this was tested on musl 1.2.2.
> 
> I'm writing a multi-threaded build system, and I've run into an
> interesting error when I compile with musl (the musl-gcc generated by
> building musl from source).
> 
> Background:
> 
> I have two main threads and as many worker threads as desired.
> 
> Worker Threads: Responsible for actually building targets. Once they are
> spawned, they wait on a semaphore for targets and grab them (with a
> mutex) off of a queue. When worker threads run a child process, they do
> the following:
> 
> * Run the process (whether through posix_spawn or fork()+exec()).
> * Check the return value to make sure the child process was created.
> * Take a mutex.
> * Add the child pid and an fd to communicate with the thread that is the
>   parent of the child to a map.
> * Post a semaphore.
> * Unlock the mutex.
> * Continue executing the target and other targets.
> 
> Main Thread #1: Spawns all of the other threads, waits on the worker
> threads to join, and once they do, does the following:
> 
> * Locks the mutex mentioned above.
> * Sets a flag.
> * Posts the semaphore mentioned above.
> * Unlocks the mutex.
> * Calls pthread_join() on Main Thread #2.
> 
> Main Thread #2: Responsible for sending info about all reaped children
> to the threads that spawned them. Once it is started, it does the
> following:
> 
> * Waits on the semaphore mentioned above.
> * Locks the mutex mentioned above.
> * Checks the flag mentioned above. If it is set, it unlocks the mutex
>   and does a pthread_exit().
> * If the flag is not set, it continues and:
> * Unlocks the mutex.
> * Does a waitpid().
> * Checks the return value of waitpid(). If it's ECHILD, it causes an
>   abort() (for the purposes of debugging this issue).
> * Locks the mutex again.
> * Sends the child status info to the right thread using the map
>   mentioned above.
> * Unlocks the mutex.
> * Repeat the process.
> 
> When running my code under musl and glibc, the abort() mentioned above
> is triggered every so often at the end. This was because I was using an
> atomic variable with the semaphore. I could trigger the problem every
> 10-20 minutes by continually bootstrapping the build system, and it
> always happened at the end of the build, when the semaphore was posted
> to have Main Thread #2 exit.
> 
> The reason the atomic variable did not work is (I suspect) because there
> is no guarantee of compiler or processor ordering with pure,
> memory_order_relaxed atomic variables. I suspect that the processor was
> reordering the instructions such that the semaphore was signalled by
> Main Thread #1 before it set the atomic variable, even though the atomic
> store to the variable was before the semaphore post call in the source.
> The setting of the flag could then come *after* Main Thread #2 did an
> atomic load on it.
> 
> So I replaced it with the scheme I laid out above, and on glibc, the
> issue still triggers; it just takes about 4 hours to do it.
> 
> However, on musl, I can still consistently trigger the issue in less
> than 10 minutes.
> 
> Nevertheless, I still think it is more correct, and here's why: the
> mutex should enforce ordering in the processor. By holding the mutex
> when the semaphore is posted, the post cannot be reordered to before the
> taking of the mutex lock, or after releasing it. Then, because the flag
> is also set while holding the mutex lock, this means that setting the
> flag cannot be reordered to before the lock. This means that the setting
> of the flag always comes before the checking of the flag, even if Main
> Thread #2's wait on the semaphore ends *before* the lock is released.
> 
> This, in turn, means that when Main Thread #2 takes the lock after
> waiting on the semaphore, the flag *will* be set if it has been set at
> all.
> 
> Likewise, since a thread adding a process to the map holds the lock
> while adding the process to the map, and posts the semaphore *while
> holding the lock*, then if Main Thread #2's wait on the semaphore ends
> before the lock is released, it is still forced to wait to take the
> lock. This taking of the lock also forces the waitpid() call to happen
> *after* the releasing of the lock by the thread that is adding a process
> to the map.
> 
> In like manner, the taking and releasing of the lock by the worker
> thread means that the creation of the child process must happen before
> the release of the lock, so I believe that there should *never* be a
> situation where waitpid() is called before a child process is properly
> created.
> 
> I believe all of that together means that there should *never* be a
> situation where waitpid() returns ECHILD. Yet musl does, consistently.
> 
> I've checked the musl source for waitpid(), and it's just a syscall, as
> it should be, so I don't think the bug is in waitpid().
> 
> So after eliminating other possibilities, it seems to me there may be a
> bug in either musl's semaphores, its mutexes, or both.
> 
> Now, I fully admit that I could have done my analysis wrong and still
> have a bug in my code. I admit that this could still be PEBKAC. But even
> in that case, I need the help of the musl community to figure out where
> I did go wrong.
> 
> Thank you for your time.

I don't think there's enough detail to debug your issue based on the
high-level description you gave. The bug is almost surely in something
we can't see. As long as all modification to and reading from the flag
is done with the mutex held, you don't need to worry about data races
on it. 

You should probably look at the strace of your program to see what's
going on with the ECHILD. It's not clear whether you're calling
waidpid with an argument of -1 or a specific pid, but in the latter
case you could look for whether the child was already reaped and
whether the pid was ever valid (from the strace log). Also, be aware
that ignoring SIGCHLD may break waiting for it, in case you're doing
that.

As an aside, overall, the whole setup looks like you were looking for
condition variables but skipped over them for some reason.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-11 20:21 ` Rich Felker
@ 2022-03-12  5:45   ` Gavin Howard
  2022-03-12 15:01     ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Howard @ 2022-03-12  5:45 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

> I don't think there's enough detail to debug your issue based on the
> high-level description you gave. The bug is almost surely in something
> we can't see. As long as all modification to and reading from the flag
> is done with the mutex held, you don't need to worry about data races
> on it.

While I can't show a lot of the code due to licensing issues, I will
show what I can.

Main Thread #1 (after joining all other threads:

```
s = y_strucon_mutex_lock(&procs_lock);
if (y_err(s != y_STATUS_SUCCESS)) return;
done = 1;
y_strucon_sem_post(&procs_sem, 1);
y_strucon_mutex_unlock(&procs_lock);

r = pthread_join(childthread->sysid, &data);
if (y_err(r != 0))
{
    y_panica("Deadlock in thread join");
}
```

When you see a `y_` prefix, that is my own code.

`y_strucon_mutex_*` functions are wrappers around the `pthread_mutex_*`
functions, and same with `y_strucon_sem_*` for semaphores. The variable
`s` is the standard return type of most functions; it's a status code.

Main Thread #2:

```
do
{
    int status;

    s = y_strucon_sem_wait(&procs_sem);
    if (y_err(s != y_STATUS_SUCCESS)) return s;

    s = y_strucon_mutex_lock(&procs_lock);
    if (y_err(s != y_STATUS_SUCCESS)) return s;

    if (y_unlikely(done != 0))
    {
        y_strucon_mutex_unlock(&procs_lock);
        pthread_exit((void*) (uintptr_t) s);
    }

    y_strucon_mutex_unlock(&procs_lock);

    do
    {
        r = waitpid(-1, &status, 0);
    }
    while (r < 0 && errno == EINTR);

    y_assert(r > 0, "Child process does not exist");

    y_procchildinfo info;

    info.pid = r;
    info.code = status;

    s = y_strucon_mutex_lock(&procs_lock);
    if (y_err(s != y_STATUS_SUCCESS)) return s;

    y_fd* fdptr;

    // This function returns a pointer to the fd in fdptr, if the
    // process has been registered before.
    if (y_map_exists_v(procs, &info.pid, &fdptr))
    {
        yc_assert(fdptr != NULL, YC_ASSERT_NULL);

        s = y_io_bareio_write(*fdptr, (y_uchar*) &info,
                              sizeof(y_procchildinfo));
        if (y_err(s != y_STATUS_SUCCESS))
        {
            y_panicva("Could not generate child status notification "
                      "using fd %d", *fdptr);
        }

        s = y_map_remove(procs, &info.pid);
        if (y_err(s != y_STATUS_SUCCESS)) return s;
    }
    else
    {
        s = y_vec_push(&unregprocs, &info);
        if (y_err(s != y_STATUS_SUCCESS)) return s;
    }

    y_strucon_mutex_unlock(&procs_lock);
}
while (s == y_STATUS_SUCCESS);
```

What a worker thread does to start and register a process:

```
pid_t chpid = fork();

if (chpid < 0) return y_STATUS_CHILD_ERR;

if (chpid == 0)
{
    // Setup and exec() child. Abort if exec() fails.
    abort();
}

s = y_strucon_mutex_lock(&procs_lock);
if (y_err(s != y_STATUS_SUCCESS)) return s;

size_t i;
size_t len = y_vec_len(&unregprocs);

for (i = 0; i < len; ++i)
{
    y_procchildinfo* ptr = (y_procchildinfo*) y_vec_get(&unregprocs, i);

    if (ptr->pid == chpid)
    {
        s = y_io_bareio_write(fd, (y_uchar*) ptr, sizeof(y_procchildinfo));
        if (y_err(s != y_STATUS_SUCCESS))
        {
            y_panica("Could not generate child status notification");
        }

        s = y_vec_popAt(&unregprocs, i);

        y_strucon_sem_post(&procs_sem, 1);

        y_strucon_mutex_unlock(&procs_lock);

        return s;
    }
}

s = y_map_insert(procs, &chpid, &fd);

y_strucon_sem_post(&procs_sem, 1);

y_strucon_mutex_unlock(&procs_lock);
```

As you can see, except for the presence of a vector (resizable array)
for the possibility of a waitpid() returning a child process before it
is "registered," the high-level description I gave wasn't very
high-level but matches pretty well.

Notice that all accesses of the map and vector are protected by the
mutex, so there's no data race on them. (I have another problem with
musl's rwlocks on another map, but I'll send another message to the
mailing list about that.)

> You should probably look at the strace of your program to see what's
> going on with the ECHILD. It's not clear whether you're calling
> waidpid with an argument of -1 or a specific pid, but in the latter
> case you could look for whether the child was already reaped and
> whether the pid was ever valid (from the strace log). Also, be aware
> that ignoring SIGCHLD may break waiting for it, in case you're doing
> that.

As shown above, I'm always calling it with waitpid(-1, ...), mostly
because I don't know what order child processes will finish in. And this
thread is the only one calling waitpid(), so it couldn't have been
reaped by another thread by accident.

SIGCHLD is not blocked. In fact, a signal handler is explicitly
registered for it is Main Thread #2, which, if it gets a SIGCHLD, does
nothing. I do this because I want to be sure that the problem you
mention does not happen, but also because after attempting to make
SIGCHLD work for this same purpose, I found out that Linux will drop
SIGCHLD's without a second thought. I know this because the code using
them would deadlock nearly every time.

> As an aside, overall, the whole setup looks like you were looking for
> condition variables but skipped over them for some reason.

I skipped over condition variables for a *very* important reason: they
would cause deadlock.

When a condition variable is signalled, it will do nothing if there are
no threads waiting on it. At least, that's what POSIX condition
variables are guaranteed to do. ("The pthread_cond_broadcast() and
pthread_cond_signal() functions shall have no effect if there are no
threads currently blocked on cond.")

What this means is that if Main Thread #2 is blocked on waitpid(), then
if another thread creates a child and signals the condition variable,
then after Main Thread #2 returns from waitpid(), it will block on the
condition variable. If another thread creates another child, sure, it
will unblock, but eventually, when the build is near done, there will be
no more child processes created, and Main Thread #2 will permanently
block on the condition variable. In turn, this will cause a deadlock
because the worker threads all poll() on the read end of the pipe (the
write end was put into the map for Main Thread #2), so if Main Thread #2
blocks on the condition variable, they will all block waiting for their
child(ren) to be reaped, and the build system will deadlock.

Likewise, having Main Thread #2 hold a mutex except when on a condition
variable means that worker threads won't be able to register children
except when Main Thread #2 is blocked on the condition variable. If Main
Thread #2 holds a lock on a mutex while in waitpid(), this means that
the build system can only run one command at a time. If you release the
lock just before calling waitpid(), allowing worker threads to make
progress, you still have the problem I mentioned above should the
condition variable be signalled during that call.

A semaphore is the right choice because it can be signalled while Main
Thread #2 is not blocked on it, and Main Thread #2 will get that signal
when it tries to wait on it. And it can be signalled many times, which
will tell Main Thread #2 how many children it needs to reap.

There is one thing I can think of that might cause the problem: the
processor is reordering the syscall to be before the if (done != 0)
check.

This doesn't make sense to me because 1) the arch I'm on is the mostly
TSO x86_64, and 2) while a mutex unlock may allow some memory
operations to be reordered from after to before the lock, I don't know
that a syscall would be reordered. In fact, I don't know how syscalls
are treated in such cases; are they treated as memory operations at all?
I have no clue.

Thank you for your time.

Gavin Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-12  5:45   ` Gavin Howard
@ 2022-03-12 15:01     ` Rich Felker
  2022-03-12 17:10       ` Gavin Howard
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2022-03-12 15:01 UTC (permalink / raw)
  To: Gavin Howard; +Cc: musl

On Fri, Mar 11, 2022 at 10:45:52PM -0700, Gavin Howard wrote:
> > I don't think there's enough detail to debug your issue based on the
> > high-level description you gave. The bug is almost surely in something
> > we can't see. As long as all modification to and reading from the flag
> > is done with the mutex held, you don't need to worry about data races
> > on it.
> 
> While I can't show a lot of the code due to licensing issues, I will
> show what I can.
> 
> Main Thread #1 (after joining all other threads:
> 
> ```
> s = y_strucon_mutex_lock(&procs_lock);
> if (y_err(s != y_STATUS_SUCCESS)) return;
> done = 1;
> y_strucon_sem_post(&procs_sem, 1);
> y_strucon_mutex_unlock(&procs_lock);
> 
> r = pthread_join(childthread->sysid, &data);
> if (y_err(r != 0))
> {
>     y_panica("Deadlock in thread join");
> }
> ```
> 
> When you see a `y_` prefix, that is my own code.
> 
> `y_strucon_mutex_*` functions are wrappers around the `pthread_mutex_*`
> functions, and same with `y_strucon_sem_*` for semaphores. The variable
> `s` is the standard return type of most functions; it's a status code.
> 
> Main Thread #2:
> 
> ```
> do
> {
>     int status;
> 
>     s = y_strucon_sem_wait(&procs_sem);
>     if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
>     s = y_strucon_mutex_lock(&procs_lock);
>     if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
>     if (y_unlikely(done != 0))
>     {
>         y_strucon_mutex_unlock(&procs_lock);
>         pthread_exit((void*) (uintptr_t) s);
>     }
> 
>     y_strucon_mutex_unlock(&procs_lock);
> 
>     do
>     {
>         r = waitpid(-1, &status, 0);
>     }
>     while (r < 0 && errno == EINTR);
> 
>     y_assert(r > 0, "Child process does not exist");
> 
>     y_procchildinfo info;
> 
>     info.pid = r;
>     info.code = status;
> 
>     s = y_strucon_mutex_lock(&procs_lock);
>     if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
>     y_fd* fdptr;
> 
>     // This function returns a pointer to the fd in fdptr, if the
>     // process has been registered before.
>     if (y_map_exists_v(procs, &info.pid, &fdptr))
>     {
>         yc_assert(fdptr != NULL, YC_ASSERT_NULL);
> 
>         s = y_io_bareio_write(*fdptr, (y_uchar*) &info,
>                               sizeof(y_procchildinfo));
>         if (y_err(s != y_STATUS_SUCCESS))
>         {
>             y_panicva("Could not generate child status notification "
>                       "using fd %d", *fdptr);
>         }
> 
>         s = y_map_remove(procs, &info.pid);
>         if (y_err(s != y_STATUS_SUCCESS)) return s;
>     }
>     else
>     {
>         s = y_vec_push(&unregprocs, &info);
>         if (y_err(s != y_STATUS_SUCCESS)) return s;
>     }
> 
>     y_strucon_mutex_unlock(&procs_lock);
> }
> while (s == y_STATUS_SUCCESS);
> ```
> 
> What a worker thread does to start and register a process:
> 
> ```
> pid_t chpid = fork();
> 
> if (chpid < 0) return y_STATUS_CHILD_ERR;
> 
> if (chpid == 0)
> {
>     // Setup and exec() child. Abort if exec() fails.
>     abort();
> }
> 
> s = y_strucon_mutex_lock(&procs_lock);
> if (y_err(s != y_STATUS_SUCCESS)) return s;
> 
> size_t i;
> size_t len = y_vec_len(&unregprocs);
> 
> for (i = 0; i < len; ++i)
> {
>     y_procchildinfo* ptr = (y_procchildinfo*) y_vec_get(&unregprocs, i);
> 
>     if (ptr->pid == chpid)
>     {
>         s = y_io_bareio_write(fd, (y_uchar*) ptr, sizeof(y_procchildinfo));
>         if (y_err(s != y_STATUS_SUCCESS))
>         {
>             y_panica("Could not generate child status notification");
>         }
> 
>         s = y_vec_popAt(&unregprocs, i);
> 
>         y_strucon_sem_post(&procs_sem, 1);
> 
>         y_strucon_mutex_unlock(&procs_lock);
> 
>         return s;
>     }
> }
> 
> s = y_map_insert(procs, &chpid, &fd);
> 
> y_strucon_sem_post(&procs_sem, 1);
> 
> y_strucon_mutex_unlock(&procs_lock);
> ```
> 
> As you can see, except for the presence of a vector (resizable array)
> for the possibility of a waitpid() returning a child process before it
> is "registered," the high-level description I gave wasn't very
> high-level but matches pretty well.
> 
> Notice that all accesses of the map and vector are protected by the
> mutex, so there's no data race on them. (I have another problem with
> musl's rwlocks on another map, but I'll send another message to the
> mailing list about that.)
> 
> > You should probably look at the strace of your program to see what's
> > going on with the ECHILD. It's not clear whether you're calling
> > waidpid with an argument of -1 or a specific pid, but in the latter
> > case you could look for whether the child was already reaped and
> > whether the pid was ever valid (from the strace log). Also, be aware
> > that ignoring SIGCHLD may break waiting for it, in case you're doing
> > that.
> 
> As shown above, I'm always calling it with waitpid(-1, ...), mostly
> because I don't know what order child processes will finish in. And this
> thread is the only one calling waitpid(), so it couldn't have been
> reaped by another thread by accident.
> 
> SIGCHLD is not blocked. In fact, a signal handler is explicitly
> registered for it is Main Thread #2, which, if it gets a SIGCHLD, does
> nothing. I do this because I want to be sure that the problem you
> mention does not happen, but also because after attempting to make
> SIGCHLD work for this same purpose, I found out that Linux will drop
> SIGCHLD's without a second thought. I know this because the code using
> them would deadlock nearly every time.
> 
> > As an aside, overall, the whole setup looks like you were looking for
> > condition variables but skipped over them for some reason.
> 
> I skipped over condition variables for a *very* important reason: they
> would cause deadlock.
> 
> When a condition variable is signalled, it will do nothing if there are
> no threads waiting on it. At least, that's what POSIX condition
> variables are guaranteed to do. ("The pthread_cond_broadcast() and
> pthread_cond_signal() functions shall have no effect if there are no
> threads currently blocked on cond.")

This is why condition variables necessarily have an associated
predicate (in your case, involving your flag and possibly other
state). You can *never* just do pthread_cond_wait. The only correct
usage is:

	while (!predicate(...)) pthread_cond_wait(mtx,cnd);

Correct use of condition variables ensures that you see any relevant
changes to the state the predicate is evaluated on, without requiring
you to explicitly work out a way to do the same thing with semaphores
or other lower-level primitives. It does not matter whatsoever whether
there are already waiters when you call signal. If not, they'll
necessarily see the state change *before* they wait. The state change
cannot happen between evaluation of the predicate and the wait because
the calling thread holds the mutex.

> What this means is that if Main Thread #2 is blocked on waitpid(), then
> if another thread creates a child and signals the condition variable,
> then after Main Thread #2 returns from waitpid(), it will block on the
                                                    ^^^^^^^^^^^^^^^^^^^^
> condition variable. If another thread creates another child, sure, it
  ^^^^^^^^^^^^^^^^^^

No it won't, because it evaluates the predicate that tells it there's
something to do before calling wait. If you're not doing that, you're
using cond vars in a fundamentally wrong way.

> [...]
> There is one thing I can think of that might cause the problem: the
> processor is reordering the syscall to be before the if (done != 0)
> check.

No, this is absolutely not what's happening. Neither the processor nor
the compiler (in general the latter is at least as much of a concern
if not more when you have missing/incorrect synchronization) can
reorder things like that.

> This doesn't make sense to me because 1) the arch I'm on is the mostly
> TSO x86_64, and 2) while a mutex unlock may allow some memory
> operations to be reordered from after to before the lock, I don't know
> that a syscall would be reordered. In fact, I don't know how syscalls
> are treated in such cases; are they treated as memory operations at all?
> I have no clue.

On some level they're like any other external function call to a
function whose definition the layer optimizing the caller can't see:
they must be assumed to be able to store to or load from any memory
whose address has been exposed, and thus cannot be reordered
whatsoever.

Your problem is not mysterious reordering. Your problem is in your
program's logic somewhere. Please use the debugging tools at your
disposal, especially strace which will probably reveal the problem to
you right away (by letting you see the exact sequence of events that
happened and trace through it to figure out where it mismatches your
expectation).

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-12 15:01     ` Rich Felker
@ 2022-03-12 17:10       ` Gavin Howard
  2022-03-14 16:09         ` enh
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Howard @ 2022-03-12 17:10 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

> This is why condition variables necessarily have an associated
> predicate (in your case, involving your flag and possibly other
> state). You can *never* just do pthread_cond_wait. The only correct
> usage is:
>
>         while (!predicate(...)) pthread_cond_wait(mtx,cnd);
>
> Correct use of condition variables ensures that you see any relevant
> changes to the state the predicate is evaluated on, without requiring
> you to explicitly work out a way to do the same thing with semaphores
> or other lower-level primitives. It does not matter whatsoever whether
> there are already waiters when you call signal. If not, they'll
> necessarily see the state change *before* they wait. The state change
> cannot happen between evaluation of the predicate and the wait because
> the calling thread holds the mutex.

Ah, I misunderstood.

> > What this means is that if Main Thread #2 is blocked on waitpid(), then
> > if another thread creates a child and signals the condition variable,
> > then after Main Thread #2 returns from waitpid(), it will block on the
>                                                     ^^^^^^^^^^^^^^^^^^^^
> > condition variable. If another thread creates another child, sure, it
>   ^^^^^^^^^^^^^^^^^^
>
> No it won't, because it evaluates the predicate that tells it there's
> something to do before calling wait. If you're not doing that, you're
> using cond vars in a fundamentally wrong way.

I have now implemented the system using the mutex and changing the
semaphore to a condition variable with the flag and two other variables
for how many children there are versus how many children have been
reaped. We'll see if anything shows up while I run it over and over
again.

> No, this is absolutely not what's happening. Neither the processor nor
> the compiler (in general the latter is at least as much of a concern
> if not more when you have missing/incorrect synchronization) can
> reorder things like that.

Good to know, thank you.

> On some level they're like any other external function call to a
> function whose definition the layer optimizing the caller can't see:
> they must be assumed to be able to store to or load from any memory
> whose address has been exposed, and thus cannot be reordered
> whatsoever.

Okay, I wondered if that might be the case.

> Your problem is not mysterious reordering. Your problem is in your
> program's logic somewhere. Please use the debugging tools at your
> disposal, especially strace which will probably reveal the problem to
> you right away (by letting you see the exact sequence of events that
> happened and trace through it to figure out where it mismatches your
> expectation).

I did use strace. It revealed nothing out of the ordinary. That's why I
did not mention it, but I probably should have. I've also used GDB to
inspect the core dumps. I did try.

Perhaps while I'm learning and making a fool of myself, I'll mention my
problem with rwlocks.

The relevant code is:

```
do
{
    bool rel = (strchr((char*) cmd->a, '/') != NULL);

    cont = false;

    // We only need to do something when the cmd is not a relative path.
    if (!rel)
    {
        s = y_strucon_rwlock_rdlock(&r->env.lock);
        if (y_err(s != y_STATUS_SUCCESS)) goto err;

        exists = y_map_existsStrings_v(&r->env.exec_map, (char*) cmd->a,
                                       &res);

        // We have to hold the lock until we have copied the result because it
        // could be moved by a write to the map.
        if (exists)
        {
            // Just move the value from res to cmd. I can do this because
            // the string in res is heap allocated and is not affected by
            // edits to the map.
            cmd->len = res->len;
            cmd->a = res->a;

            // Release the lock as soon as possible.
            y_strucon_rwlock_rdunlock(&r->env.lock);
        }
        else
        {
            // Release the lock as soon as possible.
            y_strucon_rwlock_rdunlock(&r->env.lock);

            <Find executable, error if non-existent, and prepare entry>

            s = y_strucon_rwlock_wrlock(&r->env.lock);
            if (y_err(s != y_STATUS_SUCCESS))
            {
                y_str_free(&str);
                goto err;
            }

            // Make sure someone didn't get there first.
            if (!y_map_existsStrings(&r->env.exec_map, (char*) cmd->a))
            {
                s = y_map_insertStrings(&r->env.exec_map, (char*) cmd->a,
                                        (char*) str.a);
                if (s == y_STATUS_ELEM_EXISTS)
                {
                    y_panica("Element already exists");
                }
            }

            y_strucon_rwlock_wrunlock(&r->env.lock);

            // Always free first.
            y_str_free(&str);
            y_stackpool_free(pool);

            if (y_err(s != y_STATUS_SUCCESS)) goto err;

            cont = true;
        }
    }
}
while (cont);

err:
    <Do some error handling>
```

Besides initialization (before any other threads are created) and
destruction (after all other threads are joined), these are the only
references to the map in question.

`y_strucon_rwlock_*` functions are just wrappers around POSIX rwlocks.

In this case, I'm not doing anything fancy; it's just rwlocks. However,
that abort about the element already existing will be triggered
consistently within about 5 minutes, both on musl and glibc, so probably
PEBKAC.

If I change the read lock near the beginning with a write lock, I still
get the same issue. However, if I change the rwlock for a mutex, and
update all locking and unlocking to match, I don't get the issue.

In this case, the strace shows nothing out of the ordinary that I can
see.

Yet I can't see how I am doing anything wrong. I've double checked that
y_map_existsStrings{,_v} do not edit the map at all.

So where's my stupidity on this one?

Gavin Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-12 17:10       ` Gavin Howard
@ 2022-03-14 16:09         ` enh
  2022-03-14 16:12           ` Gavin Howard
  0 siblings, 1 reply; 11+ messages in thread
From: enh @ 2022-03-14 16:09 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker

[-- Attachment #1: Type: text/plain, Size: 6412 bytes --]

On Sat, Mar 12, 2022 at 9:11 AM Gavin Howard <gavin.d.howard@gmail.com>
wrote:

> > This is why condition variables necessarily have an associated
> > predicate (in your case, involving your flag and possibly other
> > state). You can *never* just do pthread_cond_wait. The only correct
> > usage is:
> >
> >         while (!predicate(...)) pthread_cond_wait(mtx,cnd);
> >
> > Correct use of condition variables ensures that you see any relevant
> > changes to the state the predicate is evaluated on, without requiring
> > you to explicitly work out a way to do the same thing with semaphores
> > or other lower-level primitives. It does not matter whatsoever whether
> > there are already waiters when you call signal. If not, they'll
> > necessarily see the state change *before* they wait. The state change
> > cannot happen between evaluation of the predicate and the wait because
> > the calling thread holds the mutex.
>
> Ah, I misunderstood.
>
> > > What this means is that if Main Thread #2 is blocked on waitpid(), then
> > > if another thread creates a child and signals the condition variable,
> > > then after Main Thread #2 returns from waitpid(), it will block on the
> >                                                     ^^^^^^^^^^^^^^^^^^^^
> > > condition variable. If another thread creates another child, sure, it
> >   ^^^^^^^^^^^^^^^^^^
> >
> > No it won't, because it evaluates the predicate that tells it there's
> > something to do before calling wait. If you're not doing that, you're
> > using cond vars in a fundamentally wrong way.
>
> I have now implemented the system using the mutex and changing the
> semaphore to a condition variable with the flag and two other variables
> for how many children there are versus how many children have been
> reaped. We'll see if anything shows up while I run it over and over
> again.
>
> > No, this is absolutely not what's happening. Neither the processor nor
> > the compiler (in general the latter is at least as much of a concern
> > if not more when you have missing/incorrect synchronization) can
> > reorder things like that.
>
> Good to know, thank you.
>
> > On some level they're like any other external function call to a
> > function whose definition the layer optimizing the caller can't see:
> > they must be assumed to be able to store to or load from any memory
> > whose address has been exposed, and thus cannot be reordered
> > whatsoever.
>
> Okay, I wondered if that might be the case.
>
> > Your problem is not mysterious reordering. Your problem is in your
> > program's logic somewhere. Please use the debugging tools at your
> > disposal, especially strace which will probably reveal the problem to
> > you right away (by letting you see the exact sequence of events that
> > happened and trace through it to figure out where it mismatches your
> > expectation).
>
> I did use strace. It revealed nothing out of the ordinary. That's why I
> did not mention it, but I probably should have. I've also used GDB to
> inspect the core dumps. I did try.
>

you might find https://clang.llvm.org/docs/ThreadSanitizer.html useful for
bugs like this.


> Perhaps while I'm learning and making a fool of myself, I'll mention my
> problem with rwlocks.
>
> The relevant code is:
>
> ```
> do
> {
>     bool rel = (strchr((char*) cmd->a, '/') != NULL);
>
>     cont = false;
>
>     // We only need to do something when the cmd is not a relative path.
>     if (!rel)
>     {
>         s = y_strucon_rwlock_rdlock(&r->env.lock);
>         if (y_err(s != y_STATUS_SUCCESS)) goto err;
>
>         exists = y_map_existsStrings_v(&r->env.exec_map, (char*) cmd->a,
>                                        &res);
>
>         // We have to hold the lock until we have copied the result
> because it
>         // could be moved by a write to the map.
>         if (exists)
>         {
>             // Just move the value from res to cmd. I can do this because
>             // the string in res is heap allocated and is not affected by
>             // edits to the map.
>             cmd->len = res->len;
>             cmd->a = res->a;
>
>             // Release the lock as soon as possible.
>             y_strucon_rwlock_rdunlock(&r->env.lock);
>         }
>         else
>         {
>             // Release the lock as soon as possible.
>             y_strucon_rwlock_rdunlock(&r->env.lock);
>
>             <Find executable, error if non-existent, and prepare entry>
>
>             s = y_strucon_rwlock_wrlock(&r->env.lock);
>             if (y_err(s != y_STATUS_SUCCESS))
>             {
>                 y_str_free(&str);
>                 goto err;
>             }
>
>             // Make sure someone didn't get there first.
>             if (!y_map_existsStrings(&r->env.exec_map, (char*) cmd->a))
>             {
>                 s = y_map_insertStrings(&r->env.exec_map, (char*) cmd->a,
>                                         (char*) str.a);
>                 if (s == y_STATUS_ELEM_EXISTS)
>                 {
>                     y_panica("Element already exists");
>                 }
>             }
>
>             y_strucon_rwlock_wrunlock(&r->env.lock);
>
>             // Always free first.
>             y_str_free(&str);
>             y_stackpool_free(pool);
>
>             if (y_err(s != y_STATUS_SUCCESS)) goto err;
>
>             cont = true;
>         }
>     }
> }
> while (cont);
>
> err:
>     <Do some error handling>
> ```
>
> Besides initialization (before any other threads are created) and
> destruction (after all other threads are joined), these are the only
> references to the map in question.
>
> `y_strucon_rwlock_*` functions are just wrappers around POSIX rwlocks.
>
> In this case, I'm not doing anything fancy; it's just rwlocks. However,
> that abort about the element already existing will be triggered
> consistently within about 5 minutes, both on musl and glibc, so probably
> PEBKAC.
>
> If I change the read lock near the beginning with a write lock, I still
> get the same issue. However, if I change the rwlock for a mutex, and
> update all locking and unlocking to match, I don't get the issue.
>
> In this case, the strace shows nothing out of the ordinary that I can
> see.
>
> Yet I can't see how I am doing anything wrong. I've double checked that
> y_map_existsStrings{,_v} do not edit the map at all.
>
> So where's my stupidity on this one?
>
> Gavin Howard
>

[-- Attachment #2: Type: text/html, Size: 8137 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-14 16:09         ` enh
@ 2022-03-14 16:12           ` Gavin Howard
  2022-03-14 16:23             ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Howard @ 2022-03-14 16:12 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker

I used it, but it was giving me too many false positives to be useful,
mostly on data that was initialized when only one thread existed, so it
was initialized without holding a lock.

Gavin Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-14 16:12           ` Gavin Howard
@ 2022-03-14 16:23             ` Rich Felker
  2022-03-15 18:55               ` Gavin Howard
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2022-03-14 16:23 UTC (permalink / raw)
  To: Gavin Howard; +Cc: musl

On Mon, Mar 14, 2022 at 10:12:07AM -0600, Gavin Howard wrote:
> I used it, but it was giving me too many false positives to be useful,
> mostly on data that was initialized when only one thread existed, so it
> was initialized without holding a lock.

I might suggest a programming questions forum like Stack Overflow as a
more productive place to pursue this than the musl mailing list, since
this is pretty clearly not an actionable bug report (which would need
a complete test case you *can* publish that reproduces the purported
bug) and almost surely not related to any bug in musl.

Rich

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-14 16:23             ` Rich Felker
@ 2022-03-15 18:55               ` Gavin Howard
  2022-03-15 19:50                 ` Markus Wichmann
  0 siblings, 1 reply; 11+ messages in thread
From: Gavin Howard @ 2022-03-15 18:55 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

> I might suggest a programming questions forum like Stack Overflow as a
> more productive place to pursue this than the musl mailing list, since
> this is pretty clearly not an actionable bug report (which would need
> a complete test case you *can* publish that reproduces the purported
> bug) and almost surely not related to any bug in musl.

I am pretty sure that this is the right place and that it is a bug in
musl. I spent the time to be able to make the code in question public
and to create an easy reproducer for you so that you will have an
actionable bug report.

The code is at https://git.yzena.com/Yzena/Yc . Make sure you have
CMake and Clang installed (I use Clang to test bootstrap on other
platforms). Also be sure you have musl-gcc in /usr/local/musl/bin.

Once you do, just run the following:

```
git clone https://git.yzena.com/Yzena/Yc.git
cd Yc
./tools/rwlock_repro.sh -m
```

The rwlock_repro.sh script with the -m option will initialize the repo,
build the build system using the musl-gcc, and then repeatedly run the
build system on itself until some error happens.

With this setup, I have been able to get it to show that more than one
thread is allowed into the write lock critical section simultaneously,
and this usually happens within two minutes with an average of about a
minute.

If it does happen, you should see a message like this:

```
Panic: More than one thread in the critical section
    Source:    /home/gavin/Yc2/src/rig/build.c:555
    Function:  rig_searchPath()
```

This happens because there is a global variable that is incremented by
threads when they enter that critical section and decremented when they
leave, and that is the only place the global is used. (It's just for
testing this.) Yet, sometimes, threads will find that it is greater than
1, which means that more than one thread is in that critical section.

Even if there are bugs in my code, I'm pretty sure that is not supposed
to happen with musl's read/write locks. Then again, I could be wrong; if
I am wrong about what is supposed to happen with musl's rwlocks, please
let me know.

I tested this with musl 1.2.2 and the latest master as of last night.
And this is on an AMD Ryzen Threadripper 1900X x86_64, if that helps.

Gavin Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-15 18:55               ` Gavin Howard
@ 2022-03-15 19:50                 ` Markus Wichmann
  2022-03-15 20:35                   ` Gavin Howard
  0 siblings, 1 reply; 11+ messages in thread
From: Markus Wichmann @ 2022-03-15 19:50 UTC (permalink / raw)
  To: musl

On Tue, Mar 15, 2022 at 12:55:38PM -0600, Gavin Howard wrote:
> I am pretty sure that this is the right place and that it is a bug in
> musl.

Once more we find someone blaming the implementation for their own
mistake. From your repository:

| y_Status
| y_strucon_rwlock_wrlock(y_strucon_rwlock* l)
| {
| 	y_ThreadData d = y_strucon_thread_os_get();
|
| 	yc_assert(l != NULL, YC_ASSERT_NULL);
| 	yc_assert(d != NULL, YC_ASSERT_NULL);
|
| 	y_THREAD_TRYLOCK1(d);
|
| 	y_Status s = y_strucon_rwlock_os_rdlock(l);
|
| 	return s;
| }

The mistake is in this function. See what the function is called and
what subfunction it calls? No wonder multiple threads can take the lock
simultaneously.

Also, while reading I found a glaring mistake in y_THREAD_TRYUNLOCK1(),
which I shall leave as homework to the reader.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes
  2022-03-15 19:50                 ` Markus Wichmann
@ 2022-03-15 20:35                   ` Gavin Howard
  0 siblings, 0 replies; 11+ messages in thread
From: Gavin Howard @ 2022-03-15 20:35 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

Well, I am a fool. I must have been too tired when I checked that.

Gavin Howard

On Tue, Mar 15, 2022, 13:51 Markus Wichmann <nullplan@gmx.net> wrote:

> On Tue, Mar 15, 2022 at 12:55:38PM -0600, Gavin Howard wrote:
> > I am pretty sure that this is the right place and that it is a bug in
> > musl.
>
> Once more we find someone blaming the implementation for their own
> mistake. From your repository:
>
> | y_Status
> | y_strucon_rwlock_wrlock(y_strucon_rwlock* l)
> | {
> |       y_ThreadData d = y_strucon_thread_os_get();
> |
> |       yc_assert(l != NULL, YC_ASSERT_NULL);
> |       yc_assert(d != NULL, YC_ASSERT_NULL);
> |
> |       y_THREAD_TRYLOCK1(d);
> |
> |       y_Status s = y_strucon_rwlock_os_rdlock(l);
> |
> |       return s;
> | }
>
> The mistake is in this function. See what the function is called and
> what subfunction it calls? No wonder multiple threads can take the lock
> simultaneously.
>
> Also, while reading I found a glaring mistake in y_THREAD_TRYUNLOCK1(),
> which I shall leave as homework to the reader.
>
> Ciao,
> Markus
>

[-- Attachment #2: Type: text/html, Size: 1494 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-03-15 20:36 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11 19:41 [musl] Possible PEBKAC or Bug in musl Semaphores and/or Mutexes Gavin Howard
2022-03-11 20:21 ` Rich Felker
2022-03-12  5:45   ` Gavin Howard
2022-03-12 15:01     ` Rich Felker
2022-03-12 17:10       ` Gavin Howard
2022-03-14 16:09         ` enh
2022-03-14 16:12           ` Gavin Howard
2022-03-14 16:23             ` Rich Felker
2022-03-15 18:55               ` Gavin Howard
2022-03-15 19:50                 ` Markus Wichmann
2022-03-15 20:35                   ` Gavin Howard

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).