mailing list of musl libc
 help / color / mirror / code / Atom feed
* __synccall: deadlock and reliance on racy /proc/self/task
@ 2019-02-02 21:40 Alexey Izbyshev
  2019-02-07 18:36 ` Rich Felker
  2019-02-12 18:48 ` Rich Felker
  0 siblings, 2 replies; 19+ messages in thread
From: Alexey Izbyshev @ 2019-02-02 21:40 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4474 bytes --]

Hello!

I've discovered that setuid() deadlocks on a simple stress test 
(attached: test-setuid.c) that creates threads concurrently with 
setuid(). (Tested on 1.1.21 on x86_64, kernel 4.15.x and 4.4.x). The gdb 
output:

(gdb) info thr
   Id   Target Id         Frame
* 1    LWP 23555 "a.out" __synccall (func=func@entry=0x402a7d 
<do_setxid>, ctx=ctx@entry=0x7fffea85b17c)
     at ../../musl/src/thread/synccall.c:144
   2    LWP 23566 "a.out" __syscall () at 
../../musl/src/internal/x86_64/syscall.s:13
(gdb) bt
#0  __synccall (func=func@entry=0x402a7d <do_setxid>, 
ctx=ctx@entry=0x7fffea85b17c) at ../../musl/src/thread/synccall.c:144
#1  0x0000000000402af9 in __setxid (nr=<optimized out>, id=<optimized 
out>, eid=<optimized out>, sid=<optimized out>)
     at ../../musl/src/unistd/setxid.c:33
#2  0x00000000004001c8 in main ()
(gdb) thr 2
(gdb) bt
#0  __syscall () at ../../musl/src/internal/x86_64/syscall.s:13
#1  0x00000000004046b7 in __timedwait_cp 
(addr=addr@entry=0x7fe99023475c, val=val@entry=-1, clk=clk@entry=0, 
at=at@entry=0x0,
     priv=<optimized out>) at ../../musl/src/thread/__timedwait.c:31
#2  0x0000000000404591 in sem_timedwait (sem=sem@entry=0x7fe99023475c, 
at=at@entry=0x0) at ../../musl/src/thread/sem_timedwait.c:23
#3  0x00000000004044e1 in sem_wait (sem=sem@entry=0x7fe99023475c) at 
../../musl/src/thread/sem_wait.c:5
#4  0x00000000004037ae in handler (sig=<optimized out>) at 
../../musl/src/thread/synccall.c:43
#5  <signal handler called>
#6  __clone () at ../../musl/src/thread/x86_64/clone.s:17
#7  0x00000000004028ec in __pthread_create (res=0x7fe990234eb8, 
attrp=0x606260 <attr>, entry=0x400300 <thr>, arg=0x0)
     at ../../musl/src/thread/pthread_create.c:286
#8  0x0000000000400323 in thr ()

The main thread spins in __synccall with futex() always returning 
ETIMEDOUT (line 139) and "head" is NULL, while handler() in the second 
thread is blocked on sem_wait() (line 40). So it looks like handler() 
updated the linked list, but the main thread doesn't see the update.

For some reason __synccall accesses the list without a barrier (line 
120), though I don't see why one wouldn't be necessary for correct 
observability of head->next. However, I'm testing on x86_64, so 
acquire/release semantics works without barriers.

I thought that a possible explanation is that handler() got blocked in a 
*previous* setuid() call, but we didn't notice its list entry at that 
time and then overwrote "head" with NULL on the current call to 
setuid(). This seems to be possible because of the following.

1) There is a "presignalling" phase, where we may send a signal to *any* 
thread. Moreover, the number of signals we sent may be *more* than the 
number of threads because some threads may exit while we're in the loop. 
As a result, SIGSYNCCALL may be pending after this loop.

     /* Initially send one signal per counted thread. But since we can't
      * synchronize with thread creation/exit here, there could be too
      * few signals. This initial signaling is just an optimization, not
      * part of the logic. */
     for (i=libc.threads_minus_1; i; i--)
         __syscall(SYS_kill, pid, SIGSYNCCALL);

2) __synccall relies on /proc/self/task to get the list of *all* 
threads. However, since new threads can be created concurrently while we 
read /proc (if some threads were in pthread_thread() after 
__block_new_threads check when we set it to 1), I thought that /proc may 
miss some threads (that's actually why I started the whole exercise in 
the first place).

So, if we miss a thread in (2) but it's created and signalled with a 
pending SIGSYNCCALL shortly after we exit /proc loop (but before we 
reset the signal handler), "handler()" will run in that thread 
concurrently with us, and we may miss its list entry if the timing is 
right.

I've checked that if I remove the "presignalling" loop, the deadlock 
disappears (at least, I could run the test for several minutes without 
any problem).

Of course, the larger problem remains: if we may miss some threads 
because of /proc, we may fail to call setuid() syscall in those threads. 
And that's indeed easily happens in my second test (attached: 
test-setuid-mismatch.c; expected to be run as a suid binary; note that I 
tested both with and without "presignalling").

Both tests run on glibc (2.27) without any problem. Would it be possible 
to fix __synccall in musl? Thanks!

(Please CC me on answering, I'm not subscribed to the list).

Alexey

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: test-setuid.c --]
[-- Type: text/x-c; name=test-setuid.c, Size: 656 bytes --]

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

#define NUM_THREADS 1

static pthread_attr_t attr;

void *thr(void *arg) {
  if (pthread_create(&(pthread_t){0}, &attr, thr, 0))
    abort();
  return 0;
}

int main(void) {
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
  for (int i = 0; i < NUM_THREADS; i++)
    if (pthread_create(&(pthread_t){0}, &attr, thr, 0))
      abort();

  int euid = geteuid();
  for (int it = 0;; it++) {
    printf("%d\n", it);
    if (seteuid(euid)) {
      perror("seteuid");
      abort();
    }
  }
}

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: test-setuid-mismatch.c --]
[-- Type: text/x-c; name=test-setuid-mismatch.c, Size: 1162 bytes --]

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>

#define NUM_THREADS 1

static pthread_attr_t attr;
static volatile int euid = -1, gen, num;

void *thr(void *arg) {
  while (euid == -1)
    ;
  int cur_gen = gen;
  int teuid = syscall(SYS_geteuid);
  if (teuid != euid) {
    printf("mismatch: %d != %d\n", teuid, euid);
    sleep(1000000);
  }
  __sync_fetch_and_add(&num, 1);
  while (cur_gen == gen)
    ;

  if (pthread_create(&(pthread_t){0}, &attr, thr, 0))
    abort();
  return 0;
}

int main(void) {
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
  for (int i = 0; i < NUM_THREADS; i++)
    if (pthread_create(&(pthread_t){0}, &attr, thr, 0))
      abort();

  printf("uid = %d euid = %d\n", getuid(), geteuid());
  int ruid = getuid();
  int cur_euid = 0;
  for (int it = 0;; it++) {
    printf("%d\n", it);
    cur_euid = cur_euid ? 0 : ruid;
    if (seteuid(cur_euid)) {
      perror("seteuid");
      abort();
    }
    num = 0;
    euid = cur_euid;
    while (num < NUM_THREADS)
      ;
    euid = -1;
    gen = 1 - gen;
  }
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-02 21:40 __synccall: deadlock and reliance on racy /proc/self/task Alexey Izbyshev
@ 2019-02-07 18:36 ` Rich Felker
  2019-02-08 18:14   ` Alexey Izbyshev
  2019-02-12 18:48 ` Rich Felker
  1 sibling, 1 reply; 19+ messages in thread
From: Rich Felker @ 2019-02-07 18:36 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

On Sun, Feb 03, 2019 at 12:40:39AM +0300, Alexey Izbyshev wrote:
> Hello!
> 
> I've discovered that setuid() deadlocks on a simple stress test
> (attached: test-setuid.c) that creates threads concurrently with
> setuid(). (Tested on 1.1.21 on x86_64, kernel 4.15.x and 4.4.x). The
> gdb output:
> 
> (gdb) info thr
>   Id   Target Id         Frame
> * 1    LWP 23555 "a.out" __synccall (func=func@entry=0x402a7d
> <do_setxid>, ctx=ctx@entry=0x7fffea85b17c)
>     at ../../musl/src/thread/synccall.c:144
>   2    LWP 23566 "a.out" __syscall () at
> .../../musl/src/internal/x86_64/syscall.s:13
> (gdb) bt
> #0  __synccall (func=func@entry=0x402a7d <do_setxid>,
> ctx=ctx@entry=0x7fffea85b17c) at
> ../../musl/src/thread/synccall.c:144
> #1  0x0000000000402af9 in __setxid (nr=<optimized out>,
> id=<optimized out>, eid=<optimized out>, sid=<optimized out>)
>     at ../../musl/src/unistd/setxid.c:33
> #2  0x00000000004001c8 in main ()
> (gdb) thr 2
> (gdb) bt
> #0  __syscall () at ../../musl/src/internal/x86_64/syscall.s:13
> #1  0x00000000004046b7 in __timedwait_cp
> (addr=addr@entry=0x7fe99023475c, val=val@entry=-1, clk=clk@entry=0,
> at=at@entry=0x0,
>     priv=<optimized out>) at ../../musl/src/thread/__timedwait.c:31
> #2  0x0000000000404591 in sem_timedwait
> (sem=sem@entry=0x7fe99023475c, at=at@entry=0x0) at
> ../../musl/src/thread/sem_timedwait.c:23
> #3  0x00000000004044e1 in sem_wait (sem=sem@entry=0x7fe99023475c) at
> .../../musl/src/thread/sem_wait.c:5
> #4  0x00000000004037ae in handler (sig=<optimized out>) at
> .../../musl/src/thread/synccall.c:43
> #5  <signal handler called>
> #6  __clone () at ../../musl/src/thread/x86_64/clone.s:17
> #7  0x00000000004028ec in __pthread_create (res=0x7fe990234eb8,
> attrp=0x606260 <attr>, entry=0x400300 <thr>, arg=0x0)
>     at ../../musl/src/thread/pthread_create.c:286
> #8  0x0000000000400323 in thr ()
> 
> The main thread spins in __synccall with futex() always returning
> ETIMEDOUT (line 139) and "head" is NULL, while handler() in the
> second thread is blocked on sem_wait() (line 40). So it looks like
> handler() updated the linked list, but the main thread doesn't see
> the update.

I don't understand how this state is reachable. Even if it were
reached once by failing to see the update to head, the update should
eventually be seen when retrying after timeout.

> For some reason __synccall accesses the list without a barrier (line
> 120), though I don't see why one wouldn't be necessary for correct
> observability of head->next. However, I'm testing on x86_64, so
> acquire/release semantics works without barriers.

The formal intent in musl is that all a_* are full seq_cst barriers.
On x86[_64] this used to not be the case; we just used a normal store,
but that turned out to be broken because in some places (and
apparently here in __synccall) there was code that depended on a_store
having acquire semantics too. See commit
3c43c0761e1725fd5f89a9c028cbf43250abb913 and
5a9c8c05a5a0cdced4122589184fd795b761bb4a.

If not for this fix, I could see this being related (but again, it
should see it after timeout anyway). But since the barrier is there
now, it shouldn't happen.

> I thought that a possible explanation is that handler() got blocked
> in a *previous* setuid() call, but we didn't notice its list entry
> at that time and then overwrote "head" with NULL on the current call
> to setuid(). This seems to be possible because of the following.
> 
> 1) There is a "presignalling" phase, where we may send a signal to
> *any* thread. Moreover, the number of signals we sent may be *more*
> than the number of threads because some threads may exit while we're
> in the loop. As a result, SIGSYNCCALL may be pending after this
> loop.
> 
>     /* Initially send one signal per counted thread. But since we can't
>      * synchronize with thread creation/exit here, there could be too
>      * few signals. This initial signaling is just an optimization, not
>      * part of the logic. */
>     for (i=libc.threads_minus_1; i; i--)
>         __syscall(SYS_kill, pid, SIGSYNCCALL);
> 
> 2) __synccall relies on /proc/self/task to get the list of *all*
> threads. However, since new threads can be created concurrently
> while we read /proc (if some threads were in pthread_thread() after
> __block_new_threads check when we set it to 1), I thought that /proc
> may miss some threads (that's actually why I started the whole
> exercise in the first place).
> 
> So, if we miss a thread in (2) but it's created and signalled with a
> pending SIGSYNCCALL shortly after we exit /proc loop (but before we
> reset the signal handler), "handler()" will run in that thread
> concurrently with us, and we may miss its list entry if the timing
> is right.

This seems plausible.

> I've checked that if I remove the "presignalling" loop, the deadlock
> disappears (at least, I could run the test for several minutes
> without any problem).
> 
> Of course, the larger problem remains: if we may miss some threads
> because of /proc, we may fail to call setuid() syscall in those
> threads. And that's indeed easily happens in my second test
> (attached: test-setuid-mismatch.c; expected to be run as a suid
> binary; note that I tested both with and without "presignalling").

Does it work if we force two iterations of the readdir loop with no
tasks missed, rather than just one, to catch the case of missed
concurrent additions? I'm not sure. But all this makes me really
uncomfortable with the current approach.

> Both tests run on glibc (2.27) without any problem.

I think glibc has a different problem: there's a window at thread exit
where setxid can return success without having run the id change in
the exiting thread. In this case, assuming an attacker has code
execution in the process after dropping root, they can mmap malicious
code over top of the thread exit code and obtain code execution as
root.

> Would it be
> possible to fix __synccall in musl? Thanks!

Yes, but I don't think it's easy.

I think we might really need to adopt the design I proposed a while
back with a global thread list that's unlocked atomically with respect
to thread exit, using the exit futex. This is somewhat
expensive/synchronizing, but it makes a lot of things easier/safer,
and optimizes access to dynamic TLS.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-07 18:36 ` Rich Felker
@ 2019-02-08 18:14   ` Alexey Izbyshev
  2019-02-08 18:33     ` Rich Felker
  0 siblings, 1 reply; 19+ messages in thread
From: Alexey Izbyshev @ 2019-02-08 18:14 UTC (permalink / raw)
  To: Rich Felker, musl

On 2/7/19 9:36 PM, Rich Felker wrote:
>> For some reason __synccall accesses the list without a barrier (line
>> 120), though I don't see why one wouldn't be necessary for correct
>> observability of head->next. However, I'm testing on x86_64, so
>> acquire/release semantics works without barriers.
> 
> The formal intent in musl is that all a_* are full seq_cst barriers.
> On x86[_64] this used to not be the case; we just used a normal store,
> but that turned out to be broken because in some places (and
> apparently here in __synccall) there was code that depended on a_store
> having acquire semantics too. See commit
> 3c43c0761e1725fd5f89a9c028cbf43250abb913 and
> 5a9c8c05a5a0cdced4122589184fd795b761bb4a.
> 
> If not for this fix, I could see this being related (but again, it
> should see it after timeout anyway). But since the barrier is there
> now, it shouldn't happen.

Thanks for the explanation about a_store(). I didn't know that it has 
seq_cst semantics. However, I was talking about a barrier between loads 
of head and cp->tid/cp->next:

for (cp = head; cp && cp->tid != tid; cp=cp->next);

In my understanding, we need consume semantics to observe correct values 
of tid and next after we load head. If we don't take Alpha into account,
it probably works without barriers on most current architectures, 
however, I don't know what policy musl has for such cases.

>> Of course, the larger problem remains: if we may miss some threads
>> because of /proc, we may fail to call setuid() syscall in those
>> threads. And that's indeed easily happens in my second test
>> (attached: test-setuid-mismatch.c; expected to be run as a suid
>> binary; note that I tested both with and without "presignalling").
> 
> Does it work if we force two iterations of the readdir loop with no
> tasks missed, rather than just one, to catch the case of missed
> concurrent additions? I'm not sure. But all this makes me really
> uncomfortable with the current approach.

I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt == 
0. The test eventually failed in all cases, with 0 retries requiring 
only a handful of iterations, 1 -- on the order of 100, 2 -- on the 
order of 10000 and 3 -- on the order of 100000.
> 
>> Both tests run on glibc (2.27) without any problem.
> 
> I think glibc has a different problem: there's a window at thread exit
> where setxid can return success without having run the id change in
> the exiting thread. In this case, assuming an attacker has code
> execution in the process after dropping root, they can mmap malicious
> code over top of the thread exit code and obtain code execution as
> root.

Yes, it appears to be true: glibc won't signal a thread if it's marked 
as exiting [1, 2].

[1] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/allocatestack.c;h=d8e8570a7d9b9622309555b03cc98b3dd22e11c9;hb=HEAD#l1035
[2] 
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/allocatestack.c;h=d8e8570a7d9b9622309555b03cc98b3dd22e11c9;hb=HEAD#l1075

Alexey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-08 18:14   ` Alexey Izbyshev
@ 2019-02-08 18:33     ` Rich Felker
  2019-02-09 16:21       ` Szabolcs Nagy
  2019-02-10 21:04       ` Alexey Izbyshev
  0 siblings, 2 replies; 19+ messages in thread
From: Rich Felker @ 2019-02-08 18:33 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> On 2/7/19 9:36 PM, Rich Felker wrote:
> >>For some reason __synccall accesses the list without a barrier (line
> >>120), though I don't see why one wouldn't be necessary for correct
> >>observability of head->next. However, I'm testing on x86_64, so
> >>acquire/release semantics works without barriers.
> >
> >The formal intent in musl is that all a_* are full seq_cst barriers.
> >On x86[_64] this used to not be the case; we just used a normal store,
> >but that turned out to be broken because in some places (and
> >apparently here in __synccall) there was code that depended on a_store
> >having acquire semantics too. See commit
> >3c43c0761e1725fd5f89a9c028cbf43250abb913 and
> >5a9c8c05a5a0cdced4122589184fd795b761bb4a.
> >
> >If not for this fix, I could see this being related (but again, it
> >should see it after timeout anyway). But since the barrier is there
> >now, it shouldn't happen.
> 
> Thanks for the explanation about a_store(). I didn't know that it
> has seq_cst semantics. However, I was talking about a barrier
> between loads of head and cp->tid/cp->next:
> 
> for (cp = head; cp && cp->tid != tid; cp=cp->next);
> 
> In my understanding, we need consume semantics to observe correct
> values of tid and next after we load head. If we don't take Alpha
> into account,
> it probably works without barriers on most current architectures,
> however, I don't know what policy musl has for such cases.

I don't see how that's the case. The only stores to members of ch are
made before the a_cas_p (which has seq_cst order, but just release
would suffice) storing &ch into head and making it visible.

> >>Of course, the larger problem remains: if we may miss some threads
> >>because of /proc, we may fail to call setuid() syscall in those
> >>threads. And that's indeed easily happens in my second test
> >>(attached: test-setuid-mismatch.c; expected to be run as a suid
> >>binary; note that I tested both with and without "presignalling").
> >
> >Does it work if we force two iterations of the readdir loop with no
> >tasks missed, rather than just one, to catch the case of missed
> >concurrent additions? I'm not sure. But all this makes me really
> >uncomfortable with the current approach.
> 
> I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> == 0. The test eventually failed in all cases, with 0 retries
> requiring only a handful of iterations, 1 -- on the order of 100, 2
> -- on the order of 10000 and 3 -- on the order of 100000.

Do you have a theory on the mechanism of failure here? I'm guessing
it's something like this: there's a thread that goes unseen in the
first round, and during the second round, it creates a new thread and
exits itself. The exit gets seen (again, it doesn't show up in the
dirents) but the new thread it created still doesn't. Is that right?

In any case, it looks like the whole mechanism we're using is
unreliable, so something needs to be done. My leaning is to go with
the global thread list and atomicity of list-unlock with exit.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-08 18:33     ` Rich Felker
@ 2019-02-09 16:21       ` Szabolcs Nagy
  2019-02-09 18:33         ` Alexey Izbyshev
  2019-02-10 21:04       ` Alexey Izbyshev
  1 sibling, 1 reply; 19+ messages in thread
From: Szabolcs Nagy @ 2019-02-09 16:21 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

* Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
> On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> > On 2/7/19 9:36 PM, Rich Felker wrote:
> > >>Of course, the larger problem remains: if we may miss some threads
> > >>because of /proc, we may fail to call setuid() syscall in those
> > >>threads. And that's indeed easily happens in my second test
> > >>(attached: test-setuid-mismatch.c; expected to be run as a suid
> > >>binary; note that I tested both with and without "presignalling").
> > >
> > >Does it work if we force two iterations of the readdir loop with no
> > >tasks missed, rather than just one, to catch the case of missed
> > >concurrent additions? I'm not sure. But all this makes me really
> > >uncomfortable with the current approach.
> > 
> > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> > == 0. The test eventually failed in all cases, with 0 retries
> > requiring only a handful of iterations, 1 -- on the order of 100, 2
> > -- on the order of 10000 and 3 -- on the order of 100000.
> 
> Do you have a theory on the mechanism of failure here? I'm guessing
> it's something like this: there's a thread that goes unseen in the
> first round, and during the second round, it creates a new thread and
> exits itself. The exit gets seen (again, it doesn't show up in the
> dirents) but the new thread it created still doesn't. Is that right?
> 
> In any case, it looks like the whole mechanism we're using is
> unreliable, so something needs to be done. My leaning is to go with
> the global thread list and atomicity of list-unlock with exit.

yes that sounds possible, i added some instrumentation to musl
and the trace shows situations like that before the deadlock,
exiting threads can even cause old (previously seen) entries to
disappear from the dir.

tm1 is libc.threads_minus_1, tt is target_tid,
r==3 is ESRCH, r==110 is ETIMEDOUT
my comments are indented:

trace 1:

16:01:45.473004716 [10825] __pthread_exit tm1 2 -> 1
16:01:45.473007835 [10827] start tm1 1 pid 2922 tid 10827
	current threads: 2922, 10827, (10825 is exiting)
	new iter
16:01:45.473012387 [2922] __synccall begin tm1 1
16:01:45.473022059 [2922] __synccall begin dir
16:01:45.473030024 [2922] __synccall tgkill 10825 r 3 tm1 1
16:01:45.473033342 [2922] __synccall end dir cnt 1 miss 0 r 3 tt 10825 (10825) tm1 1
	bad: dir had 10825, but not 10827 (which already made a write syscall before signals were sent)
16:01:45.473037483 [2922] __synccall end loop tm1 1 timedout 0 head 0
16:01:45.473040958 [2922] __synccall single_threaded tm1 1 head 0x7ffff7f19310
16:01:45.473040736 [10827] handler new head 0x7ffff7f19310 tt 10825 (10825) chain node: tid 10827 target_sem 0 0 caller_sem 0 0
	bad: late signal
16:01:45.473046943 [2922] __synccall finish chain node tid 10827 target_sem 0 0 caller_sem 0 0
	new iter
16:01:45.473063328 [2922] __synccall begin tm1 1
16:01:45.473067031 [2922] __synccall begin dir
16:01:45.473073274 [2922] __synccall tgkill 10827 r 0 tm1 1
16:01:45.483124243 [2922] __synccall end dir cnt 1 miss 1 r 110 tt -2147472821 (10827) tm1 1
16:01:45.483137678 [2922] __synccall begin dir
16:01:45.483148355 [2922] __synccall tgkill 10827 r 0 tm1 1
16:01:45.493243161 [2922] __synccall end dir cnt 1 miss 1 r 110 tt -2147472821 (10827) tm1 1

trace 2:

15:09:57.926599655 [19672] __pthread_create tm1 2 -> 3 flags 0x5d0f00 detached 
	printed right before __clone, current threads: 19667, 19670, 19672
	new iter:
15:09:57.926603357 [19667] __synccall begin tm1 3
15:09:57.926610882 [19667] __synccall begin dir
15:09:57.926618254 [19667] __synccall tgkill 19670 r 0 tm1 3
15:09:57.926617745 [19672] handler new head 0x7ffff7f5f310 tt 19670 (19670) chain node: tid 19672 target_sem 0 0 caller_sem 0 0
15:09:57.926589725 [19670] handler done
15:09:57.926633870 [19670] handler new head 0x7ffff7f3bcd0 tt -2147463981 (19667) chain node: tid 19670 target_sem 0 0 caller_sem 0 0
15:09:57.926637148 [19667] __synccall already caught tid 19672 tm1 3
15:09:57.926642781 [19667] __synccall end dir cnt 2 miss 0 r 0 tt 19672 (19672) tm1 3
	ok: dir had 19670, 19672 (new child of 19672 may not be created yet)
15:09:57.926649375 [19667] __synccall end loop tm1 3 timedout 0 head 0x7ffff7f3bcd0
15:09:57.926653791 [19667] __synccall call chain node tid 19670 target_sem -1 1 caller_sem 0 0
15:09:57.926668175 [19667] __synccall call chain node tid 19672 target_sem -1 1 caller_sem 0 0
15:09:57.926683042 [19667] __synccall single_threaded tm1 3 head 0x7ffff7f3bcd0
15:09:57.926688375 [19667] __synccall finish chain node tid 19670 target_sem -1 1 caller_sem 0 0
15:09:57.926695762 [19667] __synccall finish chain node tid 19672 target_sem -1 1 caller_sem 0 0
15:09:57.926698084 [19670] handler done
15:09:57.926476014 [19670] __pthread_create clone ret 19672
	non-monotonic timestamp: the instrumentation code got interrupted
15:09:57.926707218 [19670] __pthread_exit tm1 3 -> 2
	new iter:
15:09:57.926721373 [19667] __synccall begin tm1 2
15:09:57.926727938 [19667] __synccall begin dir
15:09:57.926738572 [19667] __synccall tgkill 19670 r 3 tm1 2
15:09:57.926743677 [19667] __synccall end dir cnt 1 miss 0 r 3 tt 19670 (19670) tm1 2
	bad: dir had 19670 (which exited) but 19672 is missing ???
15:09:57.926748951 [19672] handler done
15:09:57.926749685 [19667] __synccall end loop tm1 2 timedout 0 head 0
15:09:57.926756402 [19667] __synccall single_threaded tm1 2 head 0
15:09:57.926757551 [19672] handler new head 0x7ffff7f5f310 tt 19670 (19670) chain node: tid 19672 target_sem 0 0 caller_sem 0 0
	bad: late signal
15:09:57.926761019 [19667] __synccall finish chain node tid 19672 target_sem 0 0 caller_sem 0 0
	bad: got one sem_post, will still wait on another
	new iter:
15:09:57.926781060 [19667] __synccall begin tm1 2
15:09:57.926786600 [19667] __synccall begin dir
15:09:57.926793026 [19667] __synccall tgkill 19672 r 0 tm1 2
15:09:57.936881342 [19667] __synccall end dir cnt 1 miss 1 r 110 tt -2147463976 (19672) tm1 2
	now dir has 19672, but it's already waiting in the handler..
15:09:57.936902178 [19667] __synccall begin dir
15:09:57.936921680 [19667] __synccall tgkill 19672 r 0 tm1 2
15:09:57.947024323 [19667] __synccall end dir cnt 1 miss 1 r 110 tt -2147463976 (19672) tm1 2
15:09:57.947045505 [19667] __synccall begin dir
15:09:57.947064716 [19667] __synccall tgkill 19672 r 0 tm1 2
15:09:57.957161709 [19667] __synccall end dir cnt 1 miss 1 r 110 tt -2147463976 (19672) tm1 2
15:09:57.957196705 [19667] __synccall end loop tm1 2 timedout 3 head 0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-09 16:21       ` Szabolcs Nagy
@ 2019-02-09 18:33         ` Alexey Izbyshev
  2019-02-09 21:40           ` Szabolcs Nagy
  0 siblings, 1 reply; 19+ messages in thread
From: Alexey Izbyshev @ 2019-02-09 18:33 UTC (permalink / raw)
  To: musl, Szabolcs Nagy

On 2019-02-09 19:21, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
>> On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
>> > On 2/7/19 9:36 PM, Rich Felker wrote:
>> > >Does it work if we force two iterations of the readdir loop with no
>> > >tasks missed, rather than just one, to catch the case of missed
>> > >concurrent additions? I'm not sure. But all this makes me really
>> > >uncomfortable with the current approach.
>> >
>> > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
>> > == 0. The test eventually failed in all cases, with 0 retries
>> > requiring only a handful of iterations, 1 -- on the order of 100, 2
>> > -- on the order of 10000 and 3 -- on the order of 100000.
>> 
>> Do you have a theory on the mechanism of failure here? I'm guessing
>> it's something like this: there's a thread that goes unseen in the
>> first round, and during the second round, it creates a new thread and
>> exits itself. The exit gets seen (again, it doesn't show up in the
>> dirents) but the new thread it created still doesn't. Is that right?
>> 
>> In any case, it looks like the whole mechanism we're using is
>> unreliable, so something needs to be done. My leaning is to go with
>> the global thread list and atomicity of list-unlock with exit.
> 
> yes that sounds possible, i added some instrumentation to musl
> and the trace shows situations like that before the deadlock,
> exiting threads can even cause old (previously seen) entries to
> disappear from the dir.
> 
Thanks for the thorough instrumentation! Your traces confirm both my 
theory about the deadlock and unreliability of /proc/self/task.

I'd also done a very light instrumentation just before I got your email, 
but it took me a while to understand the output I got (see below).

I've used test-setuid-mismatch.c from my first email, with musl modified 
to avoid the deadlock (by removing kill() loop from __synccall) and 
retry readdir() loop several times in case miss_cnt == 0. I've also 
added tracing in several points:
* pthread_exit (just before __unmapself)
* __syncccall: before readdir() loop (prints the retry count)
* __syncccall: tid of the current dentry (only if tid != pid and we've 
not seen it in the chain)
* __syncccall: errors of tgkill() and futex() in the readdir() loop

A couple of traces with max retry == 1:

--iter: 43
retry 0
tid 21089
tid 21091
retry 1
exit 21091
--iter: 44
retry 0
tid 21091
futex: ESRCH
tid 21093
exit 21093
retry 1
mismatch: tid 21094: 0 != 23517
------------------
--iter: 3
retry 0
tid 15974
futex: ESRCH
tid 15977
retry 1
tid 15978
--iter: 4
exit 15977
retry 0
tid 15977
tid 15978
exit 15978
retry 1
tid 15978
tgkill: ESRCH
mismatch: tid 15979: 0 != 23517

And with max retry == 2:

--iter: 19812
retry 0
tid 29606
retry 1
retry 2
exit 29606
--iter: 19813
retry 0
tid 29606
futex: ESRCH
tid 29609
exit 29609
retry 1
tid 29609
tgkill: ESRCH
retry 2
mismatch: tid 29610: 23517 != 0
------------------
--iter: 9762
retry 0
tid 14859
tid 14860
retry 1
retry 2
--iter: 9763
exit 14859
retry 0
tid 14859
exit 14860
tid 14860
retry 1
retry 2
mismatch: tid 14862: 23517 != 0

So, it's clear that /proc/self/task easily misses concurrently created 
threads, at least if other threads exit at the same time. And Szabolcs's 
traces indicate that even threads already running in userspace can be 
missed.

Now, about the strange output I mentioned. Consider one of the above 
fragments:
--iter: 4
exit 15977
retry 0
tid 15977
tid 15978
exit 15978
retry 1
tid 15978
tgkill: ESRCH
mismatch: tid 15979: 0 != 23517

Note that "tid 15978" is printed two times. Recall that it's printed 
only if we haven't seen it in the chain. But how it's possible that we 
haven't seen it *two* times? Since we waited on the futex the first time 
and we got the lock, the signal handler must have unlocked it. There is 
even a comment before futex() call:

/* Obtaining the lock means the thread responded. ESRCH
  * means the target thread exited, which is okay too. */

If it the signal handler reached futex unlock code, it must have updated 
the chain, and we must see the tid in the chain on the next retry and 
don't print it.

Apparently, there is another reason of futex(FUTEX_LOCK_PI) success: the 
owner is exiting concurrently (which is also indicated by the subsequent 
failure of tgkill with ESRCH). So obtaining the lock doesn't necessarily 
mean that the owner responded: it may also mean that the owner is (about 
to be?) dead.

Alexey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-09 18:33         ` Alexey Izbyshev
@ 2019-02-09 21:40           ` Szabolcs Nagy
  2019-02-09 22:29             ` Alexey Izbyshev
  2019-02-10  0:52             ` Rich Felker
  0 siblings, 2 replies; 19+ messages in thread
From: Szabolcs Nagy @ 2019-02-09 21:40 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

[-- Attachment #1: Type: text/plain, Size: 3700 bytes --]

* Alexey Izbyshev <izbyshev@ispras.ru> [2019-02-09 21:33:32 +0300]:
> On 2019-02-09 19:21, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
> > > On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> > > > On 2/7/19 9:36 PM, Rich Felker wrote:
> > > > >Does it work if we force two iterations of the readdir loop with no
> > > > >tasks missed, rather than just one, to catch the case of missed
> > > > >concurrent additions? I'm not sure. But all this makes me really
> > > > >uncomfortable with the current approach.
> > > >
> > > > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> > > > == 0. The test eventually failed in all cases, with 0 retries
> > > > requiring only a handful of iterations, 1 -- on the order of 100, 2
> > > > -- on the order of 10000 and 3 -- on the order of 100000.
> > > 
> > > Do you have a theory on the mechanism of failure here? I'm guessing
> > > it's something like this: there's a thread that goes unseen in the
> > > first round, and during the second round, it creates a new thread and
> > > exits itself. The exit gets seen (again, it doesn't show up in the
> > > dirents) but the new thread it created still doesn't. Is that right?
> > > 
> > > In any case, it looks like the whole mechanism we're using is
> > > unreliable, so something needs to be done. My leaning is to go with
> > > the global thread list and atomicity of list-unlock with exit.
> > 
> > yes that sounds possible, i added some instrumentation to musl
> > and the trace shows situations like that before the deadlock,
> > exiting threads can even cause old (previously seen) entries to
> > disappear from the dir.
> > 
> Thanks for the thorough instrumentation! Your traces confirm both my theory
> about the deadlock and unreliability of /proc/self/task.
> 
> I'd also done a very light instrumentation just before I got your email, but
> it took me a while to understand the output I got (see below).

the attached patch fixes the issue on my machine.
i don't know if this is just luck.

the assumption is that if /proc/self/task is read twice such that
all tids in it seem to be active and caught, then all the active
threads of the process are caught (no new threads that are already
started but not visible there yet)

> Now, about the strange output I mentioned. Consider one of the above
> fragments:
> --iter: 4
> exit 15977
> retry 0
> tid 15977
> tid 15978
> exit 15978
> retry 1
> tid 15978
> tgkill: ESRCH
> mismatch: tid 15979: 0 != 23517
> 
> Note that "tid 15978" is printed two times. Recall that it's printed only if
> we haven't seen it in the chain. But how it's possible that we haven't seen
> it *two* times? Since we waited on the futex the first time and we got the
> lock, the signal handler must have unlocked it. There is even a comment
> before futex() call:
> 
> /* Obtaining the lock means the thread responded. ESRCH
>  * means the target thread exited, which is okay too. */
> 
> If it the signal handler reached futex unlock code, it must have updated the
> chain, and we must see the tid in the chain on the next retry and don't
> print it.
> 
> Apparently, there is another reason of futex(FUTEX_LOCK_PI) success: the
> owner is exiting concurrently (which is also indicated by the subsequent
> failure of tgkill with ESRCH). So obtaining the lock doesn't necessarily
> mean that the owner responded: it may also mean that the owner is (about to
> be?) dead.

so tgkill succeeds but the target exits before handling the signal.
i'd expect ESRCH then not success from the futex.
interesting.

anyway i had to retry until there are no exiting threads in dir to
reliably fix the deadlock.

[-- Attachment #2: 0001-more-robust-synccall.patch --]
[-- Type: text/x-diff, Size: 1636 bytes --]

From ed101ece64b645865779293eb48109cad03e9c35 Mon Sep 17 00:00:00 2001
From: Szabolcs Nagy <nsz@port70.net>
Date: Sat, 9 Feb 2019 21:13:35 +0000
Subject: [PATCH] more robust synccall

---
 src/thread/synccall.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/src/thread/synccall.c b/src/thread/synccall.c
index cc66bd24..7f275114 100644
--- a/src/thread/synccall.c
+++ b/src/thread/synccall.c
@@ -102,6 +102,7 @@ void __synccall(void (*func)(void *), void *ctx)
 
 	/* Loop scanning the kernel-provided thread list until it shows no
 	 * threads that have not already replied to the signal. */
+	int all_threads_caught = 0;
 	for (;;) {
 		int miss_cnt = 0;
 		while ((de = readdir(&dir))) {
@@ -120,6 +121,7 @@ void __synccall(void (*func)(void *), void *ctx)
 			for (cp = head; cp && cp->tid != tid; cp=cp->next);
 			if (cp) continue;
 
+			miss_cnt++;
 			r = -__syscall(SYS_tgkill, pid, tid, SIGSYNCCALL);
 
 			/* Target thread exit is a success condition. */
@@ -142,10 +144,16 @@ void __synccall(void (*func)(void *), void *ctx)
 			/* Obtaining the lock means the thread responded. ESRCH
 			 * means the target thread exited, which is okay too. */
 			if (!r || r == ESRCH) continue;
-
-			miss_cnt++;
 		}
-		if (!miss_cnt) break;
+		if (miss_cnt)
+			all_threads_caught = 0;
+		else
+			all_threads_caught++;
+		/* when all visible threads are stopped there may be newly
+		 * created threads that are not in dir yet, so only assume
+		 * we are done when we see no running threads twice. */
+		if (all_threads_caught > 1)
+			break;
 		rewinddir(&dir);
 	}
 	close(dir.fd);
-- 
2.19.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-09 21:40           ` Szabolcs Nagy
@ 2019-02-09 22:29             ` Alexey Izbyshev
  2019-02-10  0:52             ` Rich Felker
  1 sibling, 0 replies; 19+ messages in thread
From: Alexey Izbyshev @ 2019-02-09 22:29 UTC (permalink / raw)
  To: musl; +Cc: Szabolcs Nagy

On 2019-02-10 00:40, Szabolcs Nagy wrote:
> the attached patch fixes the issue on my machine.
> i don't know if this is just luck.
> 
> the assumption is that if /proc/self/task is read twice such that
> all tids in it seem to be active and caught, then all the active
> threads of the process are caught (no new threads that are already
> started but not visible there yet)
> 
> anyway i had to retry until there are no exiting threads in dir to
> reliably fix the deadlock.

Unfortunately, on 4.15.x kernel, I've got both the deadlock (~23000 
iterations) and the mismatch (after I removed kill() loop; ~19000 
iterations).

On 4.4.x, it took ~30 mln. iterations to get the mismatch (on 
deadlock-free version):

--iter: 30198000
--iter: 30199000
mismatch: tid 539: 1000 != 0

Alexey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-09 21:40           ` Szabolcs Nagy
  2019-02-09 22:29             ` Alexey Izbyshev
@ 2019-02-10  0:52             ` Rich Felker
  2019-02-10  1:16               ` Szabolcs Nagy
  1 sibling, 1 reply; 19+ messages in thread
From: Rich Felker @ 2019-02-10  0:52 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> * Alexey Izbyshev <izbyshev@ispras.ru> [2019-02-09 21:33:32 +0300]:
> > On 2019-02-09 19:21, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@libc.org> [2019-02-08 13:33:57 -0500]:
> > > > On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> > > > > On 2/7/19 9:36 PM, Rich Felker wrote:
> > > > > >Does it work if we force two iterations of the readdir loop with no
> > > > > >tasks missed, rather than just one, to catch the case of missed
> > > > > >concurrent additions? I'm not sure. But all this makes me really
> > > > > >uncomfortable with the current approach.
> > > > >
> > > > > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> > > > > == 0. The test eventually failed in all cases, with 0 retries
> > > > > requiring only a handful of iterations, 1 -- on the order of 100, 2
> > > > > -- on the order of 10000 and 3 -- on the order of 100000.
> > > > 
> > > > Do you have a theory on the mechanism of failure here? I'm guessing
> > > > it's something like this: there's a thread that goes unseen in the
> > > > first round, and during the second round, it creates a new thread and
> > > > exits itself. The exit gets seen (again, it doesn't show up in the
> > > > dirents) but the new thread it created still doesn't. Is that right?
> > > > 
> > > > In any case, it looks like the whole mechanism we're using is
> > > > unreliable, so something needs to be done. My leaning is to go with
> > > > the global thread list and atomicity of list-unlock with exit.
> > > 
> > > yes that sounds possible, i added some instrumentation to musl
> > > and the trace shows situations like that before the deadlock,
> > > exiting threads can even cause old (previously seen) entries to
> > > disappear from the dir.
> > > 
> > Thanks for the thorough instrumentation! Your traces confirm both my theory
> > about the deadlock and unreliability of /proc/self/task.
> > 
> > I'd also done a very light instrumentation just before I got your email, but
> > it took me a while to understand the output I got (see below).
> 
> the attached patch fixes the issue on my machine.
> i don't know if this is just luck.
> 
> the assumption is that if /proc/self/task is read twice such that
> all tids in it seem to be active and caught, then all the active
> threads of the process are caught (no new threads that are already
> started but not visible there yet)

I'm skeptical of whether this should work in principle. If the first
scan of /proc/self/task misses tid J, and during the next scan, tid J
creates tid K then exits, it seems like we could see the same set of
tids on both scans.

Maybe it's salvagable though. Since __block_new_threads is true, in
order for this to happen, tid J must have been between the
__block_new_threads check in pthread_create and the clone syscall at
the time __synccall started. The number of threads in such a state
seems to be bounded by some small constant (like 2) times
libc.threads_minus_1+1, computed at any point after
__block_new_threads is set to true, so sufficiently heavy presignaling
(heavier than we have now) might suffice to guarantee that all are
captured. 

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10  0:52             ` Rich Felker
@ 2019-02-10  1:16               ` Szabolcs Nagy
  2019-02-10  1:20                 ` Rich Felker
  0 siblings, 1 reply; 19+ messages in thread
From: Szabolcs Nagy @ 2019-02-10  1:16 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

* Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > the assumption is that if /proc/self/task is read twice such that
> > all tids in it seem to be active and caught, then all the active
> > threads of the process are caught (no new threads that are already
> > started but not visible there yet)
> 
> I'm skeptical of whether this should work in principle. If the first
> scan of /proc/self/task misses tid J, and during the next scan, tid J
> creates tid K then exits, it seems like we could see the same set of
> tids on both scans.
> 
> Maybe it's salvagable though. Since __block_new_threads is true, in
> order for this to happen, tid J must have been between the
> __block_new_threads check in pthread_create and the clone syscall at
> the time __synccall started. The number of threads in such a state
> seems to be bounded by some small constant (like 2) times
> libc.threads_minus_1+1, computed at any point after
> __block_new_threads is set to true, so sufficiently heavy presignaling
> (heavier than we have now) might suffice to guarantee that all are
> captured. 

heavier presignaling may catch more threads, but we don't
know how long should we wait until all signal handlers are
invoked (to ensure that all tasks are enqueued on the call
serializer chain before we start walking that list)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10  1:16               ` Szabolcs Nagy
@ 2019-02-10  1:20                 ` Rich Felker
  2019-02-10  4:01                   ` Rich Felker
  2019-02-10 12:15                   ` Alexey Izbyshev
  0 siblings, 2 replies; 19+ messages in thread
From: Rich Felker @ 2019-02-10  1:20 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > the assumption is that if /proc/self/task is read twice such that
> > > all tids in it seem to be active and caught, then all the active
> > > threads of the process are caught (no new threads that are already
> > > started but not visible there yet)
> > 
> > I'm skeptical of whether this should work in principle. If the first
> > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > creates tid K then exits, it seems like we could see the same set of
> > tids on both scans.
> > 
> > Maybe it's salvagable though. Since __block_new_threads is true, in
> > order for this to happen, tid J must have been between the
> > __block_new_threads check in pthread_create and the clone syscall at
> > the time __synccall started. The number of threads in such a state
> > seems to be bounded by some small constant (like 2) times
> > libc.threads_minus_1+1, computed at any point after
> > __block_new_threads is set to true, so sufficiently heavy presignaling
> > (heavier than we have now) might suffice to guarantee that all are
> > captured. 
> 
> heavier presignaling may catch more threads, but we don't
> know how long should we wait until all signal handlers are
> invoked (to ensure that all tasks are enqueued on the call
> serializer chain before we start walking that list)

That's why reading /proc/self/task is still necessary. However, it
seems useful to be able to prove you've queued enough signals that at
least as many threads as could possibly exist are already in a state
where they cannot return from a syscall with signals unblocked without
entering the signal handler. In that case you would know there's no
more racing going on to create new threads, so reading /proc/self/task
is purely to get the list of threads you're waiting to enqueue
themselves on the chain, not to find new threads you need to signal.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10  1:20                 ` Rich Felker
@ 2019-02-10  4:01                   ` Rich Felker
  2019-02-10 12:32                     ` Szabolcs Nagy
  2019-02-10 12:15                   ` Alexey Izbyshev
  1 sibling, 1 reply; 19+ messages in thread
From: Rich Felker @ 2019-02-10  4:01 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

On Sat, Feb 09, 2019 at 08:20:32PM -0500, Rich Felker wrote:
> On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > > the assumption is that if /proc/self/task is read twice such that
> > > > all tids in it seem to be active and caught, then all the active
> > > > threads of the process are caught (no new threads that are already
> > > > started but not visible there yet)
> > > 
> > > I'm skeptical of whether this should work in principle. If the first
> > > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > > creates tid K then exits, it seems like we could see the same set of
> > > tids on both scans.
> > > 
> > > Maybe it's salvagable though. Since __block_new_threads is true, in
> > > order for this to happen, tid J must have been between the
> > > __block_new_threads check in pthread_create and the clone syscall at
> > > the time __synccall started. The number of threads in such a state
> > > seems to be bounded by some small constant (like 2) times
> > > libc.threads_minus_1+1, computed at any point after
> > > __block_new_threads is set to true, so sufficiently heavy presignaling
> > > (heavier than we have now) might suffice to guarantee that all are
> > > captured. 
> > 
> > heavier presignaling may catch more threads, but we don't
> > know how long should we wait until all signal handlers are
> > invoked (to ensure that all tasks are enqueued on the call
> > serializer chain before we start walking that list)
> 
> That's why reading /proc/self/task is still necessary. However, it
> seems useful to be able to prove you've queued enough signals that at
> least as many threads as could possibly exist are already in a state
> where they cannot return from a syscall with signals unblocked without
> entering the signal handler. In that case you would know there's no
> more racing going on to create new threads, so reading /proc/self/task
> is purely to get the list of threads you're waiting to enqueue
> themselves on the chain, not to find new threads you need to signal.

One thing to note: SYS_kill is not required to queue an unlimited
number of signals, and might not report failure to do so. We should
probably be using SYS_rt_sigqueue, counting the number of signals
successfully queued, and continue sending them during the loop that
monitors progress building the chain until the necessary number have
been successfully sent, if we're going to rely on the above properties
to guarantee that we've caught every thread.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10  1:20                 ` Rich Felker
  2019-02-10  4:01                   ` Rich Felker
@ 2019-02-10 12:15                   ` Alexey Izbyshev
  2019-02-10 14:57                     ` Rich Felker
  1 sibling, 1 reply; 19+ messages in thread
From: Alexey Izbyshev @ 2019-02-10 12:15 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On 2019-02-10 04:20, Rich Felker wrote:
> On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
>> * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
>> > Maybe it's salvagable though. Since __block_new_threads is true, in
>> > order for this to happen, tid J must have been between the
>> > __block_new_threads check in pthread_create and the clone syscall at
>> > the time __synccall started. The number of threads in such a state
>> > seems to be bounded by some small constant (like 2) times
>> > libc.threads_minus_1+1, computed at any point after
>> > __block_new_threads is set to true, so sufficiently heavy presignaling
>> > (heavier than we have now) might suffice to guarantee that all are
>> > captured.
>> 
>> heavier presignaling may catch more threads, but we don't
>> know how long should we wait until all signal handlers are
>> invoked (to ensure that all tasks are enqueued on the call
>> serializer chain before we start walking that list)
> 
> That's why reading /proc/self/task is still necessary. However, it
> seems useful to be able to prove you've queued enough signals that at
> least as many threads as could possibly exist are already in a state
> where they cannot return from a syscall with signals unblocked without
> entering the signal handler. In that case you would know there's no
> more racing going on to create new threads, so reading /proc/self/task
> is purely to get the list of threads you're waiting to enqueue
> themselves on the chain, not to find new threads you need to signal.

Similar to Szabolcs, I fail to see how heavier presignaling would help. 
Even if we're sure that we'll *eventually* catch all threads (including 
their future children) that were between __block_new_threads check in 
pthread_create and the clone syscall at the time we set 
__block_new_threads to 1, we still have no means to know whether we 
reached a stable state. In other words, we don't know when we should 
stop spinning in /proc/self/task loop because we may miss threads that 
are currently being created.

Also, note that __pthread_exit() blocks all signals and decrements 
libc.threads_minus_1 before exiting, so an arbitrary number of threads 
may be exiting while we're in /proc/self/task loop, and we know that 
concurrently exiting threads are related to misses.

Alexey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10  4:01                   ` Rich Felker
@ 2019-02-10 12:32                     ` Szabolcs Nagy
  2019-02-10 15:05                       ` Rich Felker
  0 siblings, 1 reply; 19+ messages in thread
From: Szabolcs Nagy @ 2019-02-10 12:32 UTC (permalink / raw)
  To: musl; +Cc: Alexey Izbyshev

* Rich Felker <dalias@libc.org> [2019-02-09 23:01:50 -0500]:
> On Sat, Feb 09, 2019 at 08:20:32PM -0500, Rich Felker wrote:
> > On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > > > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > > > the assumption is that if /proc/self/task is read twice such that
> > > > > all tids in it seem to be active and caught, then all the active
> > > > > threads of the process are caught (no new threads that are already
> > > > > started but not visible there yet)

it seems if the main thread exits, it is still listed in /proc/self/task
and has zombie status for the lifetime of the process so futex lock always
fails with ESRCH.

so my logic waiting for all exiting threads to exit does not work (at
least the main thread needs to be special cased).

> > > > 
> > > > I'm skeptical of whether this should work in principle. If the first
> > > > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > > > creates tid K then exits, it seems like we could see the same set of
> > > > tids on both scans.
> > > > 
> > > > Maybe it's salvagable though. Since __block_new_threads is true, in
> > > > order for this to happen, tid J must have been between the
> > > > __block_new_threads check in pthread_create and the clone syscall at
> > > > the time __synccall started. The number of threads in such a state
> > > > seems to be bounded by some small constant (like 2) times
> > > > libc.threads_minus_1+1, computed at any point after
> > > > __block_new_threads is set to true, so sufficiently heavy presignaling
> > > > (heavier than we have now) might suffice to guarantee that all are
> > > > captured. 
> > > 
> > > heavier presignaling may catch more threads, but we don't
> > > know how long should we wait until all signal handlers are
> > > invoked (to ensure that all tasks are enqueued on the call
> > > serializer chain before we start walking that list)
> > 
> > That's why reading /proc/self/task is still necessary. However, it
> > seems useful to be able to prove you've queued enough signals that at
> > least as many threads as could possibly exist are already in a state
> > where they cannot return from a syscall with signals unblocked without
> > entering the signal handler. In that case you would know there's no
> > more racing going on to create new threads, so reading /proc/self/task
> > is purely to get the list of threads you're waiting to enqueue
> > themselves on the chain, not to find new threads you need to signal.
> 
> One thing to note: SYS_kill is not required to queue an unlimited
> number of signals, and might not report failure to do so. We should
> probably be using SYS_rt_sigqueue, counting the number of signals
> successfully queued, and continue sending them during the loop that
> monitors progress building the chain until the necessary number have
> been successfully sent, if we're going to rely on the above properties
> to guarantee that we've caught every thread.

yes, but even if we sent enough signals that cannot be dropped,
and see all tasks in /proc/self/task to be caught in the handler,
there might be tasks that haven't reached the handler yet and
not visible in /proc/self/task yet. if they add themselves to the
chain after we start processing it then they will wait forever.

as a ductape solution we could sleep a bit after all visible tasks
are stopped to give a chance to the not yet visible ones to run
(or to show up in /proc/self/task).

but ideally we would handle non-libc created threads too, so using
libc.threads_minus_1 and __block_new_threads is already suboptimal,
a mechanism like ptrace or SIGSTOP is needed that affects all tasks.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10 12:15                   ` Alexey Izbyshev
@ 2019-02-10 14:57                     ` Rich Felker
  0 siblings, 0 replies; 19+ messages in thread
From: Rich Felker @ 2019-02-10 14:57 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Sun, Feb 10, 2019 at 03:15:55PM +0300, Alexey Izbyshev wrote:
> On 2019-02-10 04:20, Rich Felker wrote:
> >On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> >>* Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> >>> Maybe it's salvagable though. Since __block_new_threads is true, in
> >>> order for this to happen, tid J must have been between the
> >>> __block_new_threads check in pthread_create and the clone syscall at
> >>> the time __synccall started. The number of threads in such a state
> >>> seems to be bounded by some small constant (like 2) times
> >>> libc.threads_minus_1+1, computed at any point after
> >>> __block_new_threads is set to true, so sufficiently heavy presignaling
> >>> (heavier than we have now) might suffice to guarantee that all are
> >>> captured.
> >>
> >>heavier presignaling may catch more threads, but we don't
> >>know how long should we wait until all signal handlers are
> >>invoked (to ensure that all tasks are enqueued on the call
> >>serializer chain before we start walking that list)
> >
> >That's why reading /proc/self/task is still necessary. However, it
> >seems useful to be able to prove you've queued enough signals that at
> >least as many threads as could possibly exist are already in a state
> >where they cannot return from a syscall with signals unblocked without
> >entering the signal handler. In that case you would know there's no
> >more racing going on to create new threads, so reading /proc/self/task
> >is purely to get the list of threads you're waiting to enqueue
> >themselves on the chain, not to find new threads you need to signal.
> 
> Similar to Szabolcs, I fail to see how heavier presignaling would
> help. Even if we're sure that we'll *eventually* catch all threads
> (including their future children) that were between
> __block_new_threads check in pthread_create and the clone syscall at
> the time we set __block_new_threads to 1, we still have no means to
> know whether we reached a stable state. In other words, we don't
> know when we should stop spinning in /proc/self/task loop because we
> may miss threads that are currently being created.

This seems correct.

> Also, note that __pthread_exit() blocks all signals and decrements
> libc.threads_minus_1 before exiting, so an arbitrary number of
> threads may be exiting while we're in /proc/self/task loop, and we
> know that concurrently exiting threads are related to misses.

This too -- there could in theory be unboundedly many threads that
have already decremented threads_minus_1 but haven't yet exited, and
this approach has no way to ensure that we wait for them to exit
before returning from __synccall.

I'm thinking that the problems here are unrecoverable and that we need
the thread list.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-10 12:32                     ` Szabolcs Nagy
@ 2019-02-10 15:05                       ` Rich Felker
  0 siblings, 0 replies; 19+ messages in thread
From: Rich Felker @ 2019-02-10 15:05 UTC (permalink / raw)
  To: musl, Alexey Izbyshev

On Sun, Feb 10, 2019 at 01:32:14PM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-02-09 23:01:50 -0500]:
> > On Sat, Feb 09, 2019 at 08:20:32PM -0500, Rich Felker wrote:
> > > On Sun, Feb 10, 2019 at 02:16:23AM +0100, Szabolcs Nagy wrote:
> > > > * Rich Felker <dalias@libc.org> [2019-02-09 19:52:50 -0500]:
> > > > > On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> > > > > > the assumption is that if /proc/self/task is read twice such that
> > > > > > all tids in it seem to be active and caught, then all the active
> > > > > > threads of the process are caught (no new threads that are already
> > > > > > started but not visible there yet)
> 
> it seems if the main thread exits, it is still listed in /proc/self/task
> and has zombie status for the lifetime of the process so futex lock always
> fails with ESRCH.
> 
> so my logic waiting for all exiting threads to exit does not work (at
> least the main thread needs to be special cased).
> 
> > > > > 
> > > > > I'm skeptical of whether this should work in principle. If the first
> > > > > scan of /proc/self/task misses tid J, and during the next scan, tid J
> > > > > creates tid K then exits, it seems like we could see the same set of
> > > > > tids on both scans.
> > > > > 
> > > > > Maybe it's salvagable though. Since __block_new_threads is true, in
> > > > > order for this to happen, tid J must have been between the
> > > > > __block_new_threads check in pthread_create and the clone syscall at
> > > > > the time __synccall started. The number of threads in such a state
> > > > > seems to be bounded by some small constant (like 2) times
> > > > > libc.threads_minus_1+1, computed at any point after
> > > > > __block_new_threads is set to true, so sufficiently heavy presignaling
> > > > > (heavier than we have now) might suffice to guarantee that all are
> > > > > captured. 
> > > > 
> > > > heavier presignaling may catch more threads, but we don't
> > > > know how long should we wait until all signal handlers are
> > > > invoked (to ensure that all tasks are enqueued on the call
> > > > serializer chain before we start walking that list)
> > > 
> > > That's why reading /proc/self/task is still necessary. However, it
> > > seems useful to be able to prove you've queued enough signals that at
> > > least as many threads as could possibly exist are already in a state
> > > where they cannot return from a syscall with signals unblocked without
> > > entering the signal handler. In that case you would know there's no
> > > more racing going on to create new threads, so reading /proc/self/task
> > > is purely to get the list of threads you're waiting to enqueue
> > > themselves on the chain, not to find new threads you need to signal.
> > 
> > One thing to note: SYS_kill is not required to queue an unlimited
> > number of signals, and might not report failure to do so. We should
> > probably be using SYS_rt_sigqueue, counting the number of signals
> > successfully queued, and continue sending them during the loop that
> > monitors progress building the chain until the necessary number have
> > been successfully sent, if we're going to rely on the above properties
> > to guarantee that we've caught every thread.
> 
> yes, but even if we sent enough signals that cannot be dropped,
> and see all tasks in /proc/self/task to be caught in the handler,
> there might be tasks that haven't reached the handler yet and
> not visible in /proc/self/task yet. if they add themselves to the
> chain after we start processing it then they will wait forever.
> 
> as a ductape solution we could sleep a bit after all visible tasks
> are stopped to give a chance to the not yet visible ones to run
> (or to show up in /proc/self/task).

This is not going to help on a box that's swapping to hell where one
of the threads takes 30 seconds to run again (which is a real
possibility!)

> but ideally we would handle non-libc created threads too, so using
> libc.threads_minus_1 and __block_new_threads is already suboptimal,

Non-libc-created threads just can't be supported; they break in all
sorts of ways and have to just be considered totally undefined. The
synccall signal handler couldn't even perform any of the operations it
does, since libc functions all (by contract, if not in practice) rely
on having a valid thread pointer. We bend this rule slightly (and very
carefully) in posix_spawn to make syscalls with a context shared with
the thread in the parent process, but allowing it to be broken in
arbitrary ways by application code is just not practical.

> a mechanism like ptrace or SIGSTOP is needed that affects all tasks.

Yes, that would work, but is incompatible with running in an
already-traced task as far as I know.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-08 18:33     ` Rich Felker
  2019-02-09 16:21       ` Szabolcs Nagy
@ 2019-02-10 21:04       ` Alexey Izbyshev
  1 sibling, 0 replies; 19+ messages in thread
From: Alexey Izbyshev @ 2019-02-10 21:04 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On 2019-02-08 21:33, Rich Felker wrote:
> On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
>> ...I was talking about a barrier
>> between loads of head and cp->tid/cp->next:
>> 
>> for (cp = head; cp && cp->tid != tid; cp=cp->next);
>> 
>> In my understanding, we need consume semantics to observe correct
>> values of tid and next after we load head. If we don't take Alpha
>> into account,
>> it probably works without barriers on most current architectures,
>> however, I don't know what policy musl has for such cases.
> 
> I don't see how that's the case. The only stores to members of ch are
> made before the a_cas_p (which has seq_cst order, but just release
> would suffice) storing &ch into head and making it visible.
> 
This is probably off-topic because we know that the issue is not related 
to barriers, but I was referring to what is described in DATA DEPENDENCY 
BARRIERS section at 
<https://www.kernel.org/doc/Documentation/memory-barriers.txt>. Since 
the last time I looked at it, kernel devs marked it "historical" and 
added the following note:

> As of v4.15 of the Linux kernel, an smp_read_barrier_depends() was
> added to READ_ONCE(), which means that about the only people who
> need to pay attention to this section are those working on DEC Alpha
> architecture-specific code and those working on READ_ONCE() itself.
> For those who need it, and for those who are interested in the history,
> here is the story of data-dependency barriers.

So musl probably has no need to care.

Alexey


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-02 21:40 __synccall: deadlock and reliance on racy /proc/self/task Alexey Izbyshev
  2019-02-07 18:36 ` Rich Felker
@ 2019-02-12 18:48 ` Rich Felker
  2019-02-21  0:41   ` Rich Felker
  1 sibling, 1 reply; 19+ messages in thread
From: Rich Felker @ 2019-02-12 18:48 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Sun, Feb 03, 2019 at 12:40:39AM +0300, Alexey Izbyshev wrote:
> Hello!
> 
> I've discovered that setuid() deadlocks on a simple stress test
> (attached: test-setuid.c) that creates threads concurrently with
> setuid(). (Tested on 1.1.21 on x86_64, kernel 4.15.x and 4.4.x). The
> gdb output:

FYI I just posted the start of a new thread describing the possible
thread list design:

https://www.openwall.com/lists/musl/2019/02/12/14

If you want to reply into it the message-id for replies is,

Message-ID: <20190212182625.GA24199@brightrain.aerifal.cx>

(still wish the openwall archive showed those...)

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: __synccall: deadlock and reliance on racy /proc/self/task
  2019-02-12 18:48 ` Rich Felker
@ 2019-02-21  0:41   ` Rich Felker
  0 siblings, 0 replies; 19+ messages in thread
From: Rich Felker @ 2019-02-21  0:41 UTC (permalink / raw)
  To: Alexey Izbyshev; +Cc: musl

On Tue, Feb 12, 2019 at 01:48:49PM -0500, Rich Felker wrote:
> On Sun, Feb 03, 2019 at 12:40:39AM +0300, Alexey Izbyshev wrote:
> > Hello!
> > 
> > I've discovered that setuid() deadlocks on a simple stress test
> > (attached: test-setuid.c) that creates threads concurrently with
> > setuid(). (Tested on 1.1.21 on x86_64, kernel 4.15.x and 4.4.x). The
> > gdb output:
> 
> FYI I just posted the start of a new thread describing the possible
> thread list design:
> 
> https://www.openwall.com/lists/musl/2019/02/12/14
> 
> If you want to reply into it the message-id for replies is,
> 
> Message-ID: <20190212182625.GA24199@brightrain.aerifal.cx>
> 
> (still wish the openwall archive showed those...)

I believe this issue should be entirely fixed now in git master.

When you have time to test, the relevant commits are
e4235d70672d9751d7718ddc2b52d0b426430768 and the few leading up to it.

Rich


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-02-21  0:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-02 21:40 __synccall: deadlock and reliance on racy /proc/self/task Alexey Izbyshev
2019-02-07 18:36 ` Rich Felker
2019-02-08 18:14   ` Alexey Izbyshev
2019-02-08 18:33     ` Rich Felker
2019-02-09 16:21       ` Szabolcs Nagy
2019-02-09 18:33         ` Alexey Izbyshev
2019-02-09 21:40           ` Szabolcs Nagy
2019-02-09 22:29             ` Alexey Izbyshev
2019-02-10  0:52             ` Rich Felker
2019-02-10  1:16               ` Szabolcs Nagy
2019-02-10  1:20                 ` Rich Felker
2019-02-10  4:01                   ` Rich Felker
2019-02-10 12:32                     ` Szabolcs Nagy
2019-02-10 15:05                       ` Rich Felker
2019-02-10 12:15                   ` Alexey Izbyshev
2019-02-10 14:57                     ` Rich Felker
2019-02-10 21:04       ` Alexey Izbyshev
2019-02-12 18:48 ` Rich Felker
2019-02-21  0:41   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).