mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH] [RFC] make clone() usable
@ 2023-05-30 23:35 Rich Felker
  2023-06-01 10:12 ` Alexey Izbyshev
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2023-05-30 23:35 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 852 bytes --]

As discussed before (see the 2021 thread "Incorrect thread TID
caching") clone() has been effectively unusable because it produces a
child process in invalid state, with wrong tid in its thread
structure, among other problems.

The attached proposed patch attempts to make clone() usable by having
it share the _Fork logic for establishing a consistent process state
after forking, and also blocks use of flags which produce invalid
state.

With CLONE_VM, the old behavior is kept, with a caveat that the child
context is extremely restrictive, ala vfork.

It was raised previously (pardon the pun) that raise() should perhaps
be modified to make a SYS_gettid syscall rather than using the thread
structure tid, so that it's possible to call raise() in a vfork
context without signaling the wrong process. It might make sense to do
that now too.

Rich

[-- Attachment #2: clone.diff --]
[-- Type: text/plain, Size: 3062 bytes --]

diff --git a/src/internal/fork_impl.h b/src/internal/fork_impl.h
index 354e733b..f995fce2 100644
--- a/src/internal/fork_impl.h
+++ b/src/internal/fork_impl.h
@@ -17,3 +17,5 @@ extern hidden volatile int *const __vmlock_lockptr;
 hidden void __malloc_atfork(int);
 hidden void __ldso_atfork(int);
 hidden void __pthread_key_atfork(int);
+
+hidden void __post_Fork(int);
diff --git a/src/linux/clone.c b/src/linux/clone.c
index 8c1af7d3..e8f76bbc 100644
--- a/src/linux/clone.c
+++ b/src/linux/clone.c
@@ -4,18 +4,55 @@
 #include <sched.h>
 #include "pthread_impl.h"
 #include "syscall.h"
+#include "lock.h"
+#include "fork_impl.h"
+
+struct clone_start_args {
+	int (*func)(void *);
+	void *arg;
+	sigset_t sigmask;
+};
+
+static int clone_start(void *arg)
+{
+	struct clone_start_args *csa = arg;
+	__post_Fork(0);
+	__restore_sigs(&csa->sigmask);
+	return csa->func(csa->arg);
+}
 
 int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
 {
+	struct clone_start_args csa;
 	va_list ap;
-	pid_t *ptid, *ctid;
-	void  *tls;
+	pid_t *ptid = 0, *ctid = 0;
+	void  *tls = 0;
+
+	/* Flags that produce an invalid thread/TLS state are disallowed. */
+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
+	if (flags & badflags)
+		return __syscall_ret(-EINVAL);
 
 	va_start(ap, arg);
-	ptid = va_arg(ap, pid_t *);
-	tls  = va_arg(ap, void *);
-	ctid = va_arg(ap, pid_t *);
+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
+	 	ptid = va_arg(ap, pid_t *);
+	if (flags & CLONE_CHILD_SETTID) {
+		tls = va_arg(ap, void *);
+		ctid = va_arg(ap, pid_t *);
+	}
 	va_end(ap);
 
-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls, ctid));
+	if (flags & CLONE_VM)
+		return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls, ctid));
+
+	__block_all_sigs(&csa.sigmask);
+	LOCK(__abort_lock);
+
+	csa.func = func;
+	csa.arg = arg;
+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
+
+	__post_Fork(ret);
+	__restore_sigs(&csa.sigmask);
+	return __syscall_ret(ret);
 }
diff --git a/src/process/_Fork.c b/src/process/_Fork.c
index fb0fdc2c..94906103 100644
--- a/src/process/_Fork.c
+++ b/src/process/_Fork.c
@@ -5,21 +5,13 @@
 #include "lock.h"
 #include "pthread_impl.h"
 #include "aio_impl.h"
+#include "fork_impl.h"
 
 static void dummy(int x) { }
 weak_alias(dummy, __aio_atfork);
 
-pid_t _Fork(void)
+void __post_Fork(int ret)
 {
-	pid_t ret;
-	sigset_t set;
-	__block_all_sigs(&set);
-	LOCK(__abort_lock);
-#ifdef SYS_fork
-	ret = __syscall(SYS_fork);
-#else
-	ret = __syscall(SYS_clone, SIGCHLD, 0);
-#endif
 	if (!ret) {
 		pthread_t self = __pthread_self();
 		self->tid = __syscall(SYS_gettid);
@@ -32,6 +24,20 @@ pid_t _Fork(void)
 	}
 	UNLOCK(__abort_lock);
 	if (!ret) __aio_atfork(1);
+}
+
+pid_t _Fork(void)
+{
+	pid_t ret;
+	sigset_t set;
+	__block_all_sigs(&set);
+	LOCK(__abort_lock);
+#ifdef SYS_fork
+	ret = __syscall(SYS_fork);
+#else
+	ret = __syscall(SYS_clone, SIGCHLD, 0);
+#endif
+	__post_Fork(ret);
 	__restore_sigs(&set);
 	return __syscall_ret(ret);
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [RFC] make clone() usable
  2023-05-30 23:35 [musl] [PATCH] [RFC] make clone() usable Rich Felker
@ 2023-06-01 10:12 ` Alexey Izbyshev
  2023-06-01 20:08   ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Alexey Izbyshev @ 2023-06-01 10:12 UTC (permalink / raw)
  To: musl

On 2023-05-31 02:35, Rich Felker wrote:
> As discussed before (see the 2021 thread "Incorrect thread TID
> caching") clone() has been effectively unusable because it produces a
> child process in invalid state, with wrong tid in its thread
> structure, among other problems.
> 
> The attached proposed patch attempts to make clone() usable by having
> it share the _Fork logic for establishing a consistent process state
> after forking, and also blocks use of flags which produce invalid
> state.
> 
Tangentially, while thinking about this, I realized that because the 
kernel clears the address of the exit futex on clone and _Fork doesn't 
re-establish it, the following blocks forever:

void *thr(void *arg) {
         // Blocks in __tl_sync because __thread_list_lock is never 
unlocked
         pthread_join(*(pthread_t *)arg, NULL));
         return NULL;
}

int main() {
         if (!fork()) {
                 static pthread_t pt;
                 pt = pthread_self();
                 pthread_create(&(pthread_t){0}, NULL, thr, &pt);
                 pthread_exit(NULL);
         }
         wait(NULL);
}

This could be fixed by the following change in _Fork.c:

-               self->tid = __syscall(SYS_gettid);
+               self->tid = __syscall(SYS_set_tid_address, 
&__thread_list_lock);

> With CLONE_VM, the old behavior is kept, with a caveat that the child
> context is extremely restrictive, ala vfork.
> 
> It was raised previously (pardon the pun) that raise() should perhaps
> be modified to make a SYS_gettid syscall rather than using the thread
> structure tid, so that it's possible to call raise() in a vfork
> context without signaling the wrong process. It might make sense to do
> that now too.
> 
Yes, it would be nice to have raise and abort (which also reads the 
cached tid separately) working in a vfork child.

> diff --git a/src/linux/clone.c b/src/linux/clone.c
> index 8c1af7d3..e8f76bbc 100644
> --- a/src/linux/clone.c
> +++ b/src/linux/clone.c
> @@ -4,18 +4,55 @@
>  #include <sched.h>
>  #include "pthread_impl.h"
>  #include "syscall.h"
> +#include "lock.h"
> +#include "fork_impl.h"
> +
> +struct clone_start_args {
> +	int (*func)(void *);
> +	void *arg;
> +	sigset_t sigmask;
> +};
> +
> +static int clone_start(void *arg)
> +{
> +	struct clone_start_args *csa = arg;
> +	__post_Fork(0);
> +	__restore_sigs(&csa->sigmask);
> +	return csa->func(csa->arg);
> +}
> 
>  int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
>  {
> +	struct clone_start_args csa;
>  	va_list ap;
> -	pid_t *ptid, *ctid;
> -	void  *tls;
> +	pid_t *ptid = 0, *ctid = 0;
> +	void  *tls = 0;
> +
> +	/* Flags that produce an invalid thread/TLS state are disallowed. */
> +	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
> +	if (flags & badflags)
> +		return __syscall_ret(-EINVAL);
> 
Maybe also check for NULL stack, as proposed in 
https://www.openwall.com/lists/musl/2022/08/02/1 ?

>  	va_start(ap, arg);
> -	ptid = va_arg(ap, pid_t *);
> -	tls  = va_arg(ap, void *);
> -	ctid = va_arg(ap, pid_t *);
> +	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
> +	 	ptid = va_arg(ap, pid_t *);
> +	if (flags & CLONE_CHILD_SETTID) {
> +		tls = va_arg(ap, void *);
> +		ctid = va_arg(ap, pid_t *);
> +	}
>  	va_end(ap);
> 
> -	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls, 
> ctid));
> +	if (flags & CLONE_VM)
> +		return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls, 
> ctid));
> +
> +	__block_all_sigs(&csa.sigmask);
> +	LOCK(__abort_lock);
> +
> +	csa.func = func;
> +	csa.arg = arg;
> +	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
> +
> +	__post_Fork(ret);
> +	__restore_sigs(&csa.sigmask);
> +	return __syscall_ret(ret);
>  }

Another thing that might be somewhat relevant if we expect to see wider 
usage of clone is that syscall (the public function) currently blindly 
extracts six arguments via va_arg, which on some architectures requires 
sufficient stack space not to crash. So on i386 the following silly 
function will crash if passed to clone, provided that "stack" points to 
the beginning of unmapped space past the stack page end.

int f(void *arg) {
         return syscall(SYS_gettid);
}

And it will crash even on x86-64 if tail call optimization triggers, 
which doesn't happen with current GCC due to long/int return type 
mismatch.

(This test crashes with glibc 2.37 as well.)

pthread_create masks this issue because it always consumes some stack 
space with struct start_args.

Thanks,
Alexey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [RFC] make clone() usable
  2023-06-01 10:12 ` Alexey Izbyshev
@ 2023-06-01 20:08   ` Rich Felker
  2023-06-01 20:17     ` [musl] [PATCH] [v2] " Rich Felker
                       ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Rich Felker @ 2023-06-01 20:08 UTC (permalink / raw)
  To: musl

On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> On 2023-05-31 02:35, Rich Felker wrote:
> >As discussed before (see the 2021 thread "Incorrect thread TID
> >caching") clone() has been effectively unusable because it produces a
> >child process in invalid state, with wrong tid in its thread
> >structure, among other problems.
> >
> >The attached proposed patch attempts to make clone() usable by having
> >it share the _Fork logic for establishing a consistent process state
> >after forking, and also blocks use of flags which produce invalid
> >state.
> >
> Tangentially, while thinking about this, I realized that because the
> kernel clears the address of the exit futex on clone and _Fork
> doesn't re-establish it, the following blocks forever:
> 
> void *thr(void *arg) {
>         // Blocks in __tl_sync because __thread_list_lock is never
> unlocked
>         pthread_join(*(pthread_t *)arg, NULL));
>         return NULL;
> }
> 
> int main() {
>         if (!fork()) {
>                 static pthread_t pt;
>                 pt = pthread_self();
>                 pthread_create(&(pthread_t){0}, NULL, thr, &pt);
>                 pthread_exit(NULL);
>         }
>         wait(NULL);
> }
> 
> This could be fixed by the following change in _Fork.c:
> 
> -               self->tid = __syscall(SYS_gettid);
> +               self->tid = __syscall(SYS_set_tid_address,
> &__thread_list_lock);

Wow, I had no idea Linux's fork cleared the exit futex address. That's
a big bug we've had around for nobody to have noticed...

> >With CLONE_VM, the old behavior is kept, with a caveat that the child
> >context is extremely restrictive, ala vfork.
> >
> >It was raised previously (pardon the pun) that raise() should perhaps
> >be modified to make a SYS_gettid syscall rather than using the thread
> >structure tid, so that it's possible to call raise() in a vfork
> >context without signaling the wrong process. It might make sense to do
> >that now too.
> >
> Yes, it would be nice to have raise and abort (which also reads the
> cached tid separately) working in a vfork child.
> 
> >diff --git a/src/linux/clone.c b/src/linux/clone.c
> >index 8c1af7d3..e8f76bbc 100644
> >--- a/src/linux/clone.c
> >+++ b/src/linux/clone.c
> >@@ -4,18 +4,55 @@
> > #include <sched.h>
> > #include "pthread_impl.h"
> > #include "syscall.h"
> >+#include "lock.h"
> >+#include "fork_impl.h"
> >+
> >+struct clone_start_args {
> >+	int (*func)(void *);
> >+	void *arg;
> >+	sigset_t sigmask;
> >+};
> >+
> >+static int clone_start(void *arg)
> >+{
> >+	struct clone_start_args *csa = arg;
> >+	__post_Fork(0);
> >+	__restore_sigs(&csa->sigmask);
> >+	return csa->func(csa->arg);
> >+}
> >
> > int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
> > {
> >+	struct clone_start_args csa;
> > 	va_list ap;
> >-	pid_t *ptid, *ctid;
> >-	void  *tls;
> >+	pid_t *ptid = 0, *ctid = 0;
> >+	void  *tls = 0;
> >+
> >+	/* Flags that produce an invalid thread/TLS state are disallowed. */
> >+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
> >+	if (flags & badflags)
> >+		return __syscall_ret(-EINVAL);
> >
> Maybe also check for NULL stack, as proposed in
> https://www.openwall.com/lists/musl/2022/08/02/1 ?

I guess that can be done. Normally we don't check for null pointers,
but I think the syscall would silently keep the existing stack pointer
if passed null rather than blowing up, which we probably don't want --
even without CLONE_VM, where it'd blow up, we'd be giving
appears-to-work behavior without a contract for it to work.

> > 	va_start(ap, arg);
> >-	ptid = va_arg(ap, pid_t *);
> >-	tls  = va_arg(ap, void *);
> >-	ctid = va_arg(ap, pid_t *);
> >+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
> >+	 	ptid = va_arg(ap, pid_t *);
> >+	if (flags & CLONE_CHILD_SETTID) {
> >+		tls = va_arg(ap, void *);
> >+		ctid = va_arg(ap, pid_t *);
> >+	}
> > 	va_end(ap);
> >
> >-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls,
> >ctid));
> >+	if (flags & CLONE_VM)
> >+		return __syscall_ret(__clone(func, stack, flags, arg, ptid,
> >tls, ctid));
> >+
> >+	__block_all_sigs(&csa.sigmask);
> >+	LOCK(__abort_lock);
> >+
> >+	csa.func = func;
> >+	csa.arg = arg;
> >+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
> >+
> >+	__post_Fork(ret);
> >+	__restore_sigs(&csa.sigmask);
> >+	return __syscall_ret(ret);
> > }
> 
> Another thing that might be somewhat relevant if we expect to see
> wider usage of clone is that syscall (the public function) currently
> blindly extracts six arguments via va_arg, which on some
> architectures requires sufficient stack space not to crash. So on
> i386 the following silly function will crash if passed to clone,
> provided that "stack" points to the beginning of unmapped space past
> the stack page end.

That's changed in the patch. It only calls va_arg for arguments whose
existence the flags implies.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [v2] make clone() usable
  2023-06-01 20:08   ` Rich Felker
@ 2023-06-01 20:17     ` Rich Felker
  2023-06-02  7:28     ` [musl] [PATCH] [RFC] " Alexey Izbyshev
  2023-06-02 14:19     ` Markus Wichmann
  2 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2023-06-01 20:17 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1756 bytes --]

On Thu, Jun 01, 2023 at 04:08:07PM -0400, Rich Felker wrote:
> On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> > On 2023-05-31 02:35, Rich Felker wrote:
> > >As discussed before (see the 2021 thread "Incorrect thread TID
> > >caching") clone() has been effectively unusable because it produces a
> > >child process in invalid state, with wrong tid in its thread
> > >structure, among other problems.
> > >
> > >The attached proposed patch attempts to make clone() usable by having
> > >it share the _Fork logic for establishing a consistent process state
> > >after forking, and also blocks use of flags which produce invalid
> > >state.
> > >
> > Tangentially, while thinking about this, I realized that because the
> > kernel clears the address of the exit futex on clone and _Fork
> > doesn't re-establish it, the following blocks forever:
> > 
> > void *thr(void *arg) {
> >         // Blocks in __tl_sync because __thread_list_lock is never
> > unlocked
> >         pthread_join(*(pthread_t *)arg, NULL));
> >         return NULL;
> > }
> > 
> > int main() {
> >         if (!fork()) {
> >                 static pthread_t pt;
> >                 pt = pthread_self();
> >                 pthread_create(&(pthread_t){0}, NULL, thr, &pt);
> >                 pthread_exit(NULL);
> >         }
> >         wait(NULL);
> > }
> > 
> > This could be fixed by the following change in _Fork.c:
> > 
> > -               self->tid = __syscall(SYS_gettid);
> > +               self->tid = __syscall(SYS_set_tid_address,
> > &__thread_list_lock);
> 
> Wow, I had no idea Linux's fork cleared the exit futex address. That's
> a big bug we've had around for nobody to have noticed...

Updated version with this fixed and checking for null stack.

[-- Attachment #2: 0001-fix-broken-thread-list-unlocking-after-fork.patch --]
[-- Type: text/plain, Size: 1106 bytes --]

From 0c277ff156749628c678257f878d3a6edb5868de Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Thu, 1 Jun 2023 16:09:32 -0400
Subject: [PATCH 1/2] fix broken thread list unlocking after fork

apparently Linux clears the registered exit futex address on fork.
this means that, if after forking the child process becomes
multithreaded and the original thread exits, the thread list will
never be unlocked, and future attempts to use the thread list will
deadlock.

re-register the exit futex address after _Fork in the child to ensure
that it's preserved.
---
 src/process/_Fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/process/_Fork.c b/src/process/_Fork.c
index fb0fdc2c..e7650863 100644
--- a/src/process/_Fork.c
+++ b/src/process/_Fork.c
@@ -22,7 +22,7 @@ pid_t _Fork(void)
 #endif
 	if (!ret) {
 		pthread_t self = __pthread_self();
-		self->tid = __syscall(SYS_gettid);
+		self->tid = __syscall(SYS_set_tid_address, &__thread_list_lock);
 		self->robust_list.off = 0;
 		self->robust_list.pending = 0;
 		self->next = self->prev = self;
-- 
2.21.0


[-- Attachment #3: 0002-fix-public-clone-function-to-be-safe-and-usable-by-a.patch --]
[-- Type: text/plain, Size: 5779 bytes --]

From fa4a8abd06a401822cc8ba4e352a219544c0118d Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Wed, 31 May 2023 12:04:06 -0400
Subject: [PATCH 2/2] fix public clone function to be safe and usable by
 applications

the clone() function has been effectively unusable since it was added,
due to producing a child process with inconsistent state. in
particular, the child process's thread structure still contains the
tid, thread list pointers, thread count, and robust list for the
parent. this will cause malfunction in interfaces that attempt to use
the tid or thread list, some of which are specified to be
async-signal-safe.

this patch attempts to make clone() consistent in a _Fork-like sense.
as in _Fork, when the parent process is multi-threaded, the child
process inherits an async-signal context where it cannot call
AS-unsafe functions, but its context is now intended to be safe for
calling AS-safe functions. making clone fork-like would also be a
future option, if it turns out that this is what makes sense to
applications, but it's not done at this time because the changes would
be more invasive.

in the case where the CLONE_VM flag is used, clone is only vfork-like,
not _Fork-like. in particular, the child will see itself as having the
parent's tid, and cannot safely call any libc functions but one of the
exec family or _exit.

handling of flags and variadic arguments is also changed so that
arguments are only consumed with flags that indicate their presence,
and so that flags which produce an inconsistent state are disallowed
(reported as EINVAL). in particular, all libc functions carry a
contract that they are only callable with ABI requirements met, which
includes having a valid thread pointer to a thread structure that's
unique within the process, and whose contents are opaque and only able
to be setup internally by the implementation. the only way for an
application to use flags that violate these requirements without
executing any libc code is to perform the syscall from
application-provided asm.
---
 src/internal/fork_impl.h |  2 ++
 src/linux/clone.c        | 56 +++++++++++++++++++++++++++++++++++-----
 src/process/_Fork.c      | 26 ++++++++++++-------
 3 files changed, 68 insertions(+), 16 deletions(-)

diff --git a/src/internal/fork_impl.h b/src/internal/fork_impl.h
index 354e733b..f995fce2 100644
--- a/src/internal/fork_impl.h
+++ b/src/internal/fork_impl.h
@@ -17,3 +17,5 @@ extern hidden volatile int *const __vmlock_lockptr;
 hidden void __malloc_atfork(int);
 hidden void __ldso_atfork(int);
 hidden void __pthread_key_atfork(int);
+
+hidden void __post_Fork(int);
diff --git a/src/linux/clone.c b/src/linux/clone.c
index 8c1af7d3..257c1cec 100644
--- a/src/linux/clone.c
+++ b/src/linux/clone.c
@@ -4,18 +4,62 @@
 #include <sched.h>
 #include "pthread_impl.h"
 #include "syscall.h"
+#include "lock.h"
+#include "fork_impl.h"
+
+struct clone_start_args {
+	int (*func)(void *);
+	void *arg;
+	sigset_t sigmask;
+};
+
+static int clone_start(void *arg)
+{
+	struct clone_start_args *csa = arg;
+	__post_Fork(0);
+	__restore_sigs(&csa->sigmask);
+	return csa->func(csa->arg);
+}
 
 int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
 {
+	struct clone_start_args csa;
 	va_list ap;
-	pid_t *ptid, *ctid;
-	void  *tls;
+	pid_t *ptid = 0, *ctid = 0;
+	void  *tls = 0;
+
+	/* Flags that produce an invalid thread/TLS state are disallowed. */
+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
+
+	if ((flags & badflags) || !stack)
+		return __syscall_ret(-EINVAL);
 
 	va_start(ap, arg);
-	ptid = va_arg(ap, pid_t *);
-	tls  = va_arg(ap, void *);
-	ctid = va_arg(ap, pid_t *);
+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
+	 	ptid = va_arg(ap, pid_t *);
+	if (flags & CLONE_CHILD_SETTID) {
+		tls = va_arg(ap, void *);
+		ctid = va_arg(ap, pid_t *);
+	}
 	va_end(ap);
 
-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls, ctid));
+	/* If CLONE_VM is used, it's impossible to give the child a consistent
+	 * thread structure. In this case, the best we can do is assume the
+	 * caller is content with an extremely restrictive execution context
+	 * like the one vfork() would provide. */
+	if (flags & CLONE_VM) return __syscall_ret(
+		__clone(func, stack, flags, arg, ptid, tls, ctid));
+
+	__block_all_sigs(&csa.sigmask);
+	LOCK(__abort_lock);
+
+	/* Setup the a wrapper start function for the child process to do
+	 * mimic _Fork in producing a consistent execution state. */
+	csa.func = func;
+	csa.arg = arg;
+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
+
+	__post_Fork(ret);
+	__restore_sigs(&csa.sigmask);
+	return __syscall_ret(ret);
 }
diff --git a/src/process/_Fork.c b/src/process/_Fork.c
index e7650863..9c07792d 100644
--- a/src/process/_Fork.c
+++ b/src/process/_Fork.c
@@ -5,21 +5,13 @@
 #include "lock.h"
 #include "pthread_impl.h"
 #include "aio_impl.h"
+#include "fork_impl.h"
 
 static void dummy(int x) { }
 weak_alias(dummy, __aio_atfork);
 
-pid_t _Fork(void)
+void __post_Fork(int ret)
 {
-	pid_t ret;
-	sigset_t set;
-	__block_all_sigs(&set);
-	LOCK(__abort_lock);
-#ifdef SYS_fork
-	ret = __syscall(SYS_fork);
-#else
-	ret = __syscall(SYS_clone, SIGCHLD, 0);
-#endif
 	if (!ret) {
 		pthread_t self = __pthread_self();
 		self->tid = __syscall(SYS_set_tid_address, &__thread_list_lock);
@@ -32,6 +24,20 @@ pid_t _Fork(void)
 	}
 	UNLOCK(__abort_lock);
 	if (!ret) __aio_atfork(1);
+}
+
+pid_t _Fork(void)
+{
+	pid_t ret;
+	sigset_t set;
+	__block_all_sigs(&set);
+	LOCK(__abort_lock);
+#ifdef SYS_fork
+	ret = __syscall(SYS_fork);
+#else
+	ret = __syscall(SYS_clone, SIGCHLD, 0);
+#endif
+	__post_Fork(ret);
 	__restore_sigs(&set);
 	return __syscall_ret(ret);
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [RFC] make clone() usable
  2023-06-01 20:08   ` Rich Felker
  2023-06-01 20:17     ` [musl] [PATCH] [v2] " Rich Felker
@ 2023-06-02  7:28     ` Alexey Izbyshev
  2023-06-02 14:16       ` Rich Felker
  2023-06-02 14:19     ` Markus Wichmann
  2 siblings, 1 reply; 8+ messages in thread
From: Alexey Izbyshev @ 2023-06-02  7:28 UTC (permalink / raw)
  To: musl

On 2023-06-01 23:08, Rich Felker wrote:
> On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
>> On 2023-05-31 02:35, Rich Felker wrote:
>> >As discussed before (see the 2021 thread "Incorrect thread TID
>> >caching") clone() has been effectively unusable because it produces a
>> >child process in invalid state, with wrong tid in its thread
>> >structure, among other problems.
>> >
>> >The attached proposed patch attempts to make clone() usable by having
>> >it share the _Fork logic for establishing a consistent process state
>> >after forking, and also blocks use of flags which produce invalid
>> >state.
>> >
>> Tangentially, while thinking about this, I realized that because the
>> kernel clears the address of the exit futex on clone and _Fork
>> doesn't re-establish it, the following blocks forever:
>> 
>> void *thr(void *arg) {
>>         // Blocks in __tl_sync because __thread_list_lock is never
>> unlocked
>>         pthread_join(*(pthread_t *)arg, NULL));
>>         return NULL;
>> }
>> 
>> int main() {
>>         if (!fork()) {
>>                 static pthread_t pt;
>>                 pt = pthread_self();
>>                 pthread_create(&(pthread_t){0}, NULL, thr, &pt);
>>                 pthread_exit(NULL);
>>         }
>>         wait(NULL);
>> }
>> 
>> This could be fixed by the following change in _Fork.c:
>> 
>> -               self->tid = __syscall(SYS_gettid);
>> +               self->tid = __syscall(SYS_set_tid_address,
>> &__thread_list_lock);
> 
> Wow, I had no idea Linux's fork cleared the exit futex address. That's
> a big bug we've had around for nobody to have noticed...
> 
>> >With CLONE_VM, the old behavior is kept, with a caveat that the child
>> >context is extremely restrictive, ala vfork.
>> >
>> >It was raised previously (pardon the pun) that raise() should perhaps
>> >be modified to make a SYS_gettid syscall rather than using the thread
>> >structure tid, so that it's possible to call raise() in a vfork
>> >context without signaling the wrong process. It might make sense to do
>> >that now too.
>> >
>> Yes, it would be nice to have raise and abort (which also reads the
>> cached tid separately) working in a vfork child.
>> 
>> >diff --git a/src/linux/clone.c b/src/linux/clone.c
>> >index 8c1af7d3..e8f76bbc 100644
>> >--- a/src/linux/clone.c
>> >+++ b/src/linux/clone.c
>> >@@ -4,18 +4,55 @@
>> > #include <sched.h>
>> > #include "pthread_impl.h"
>> > #include "syscall.h"
>> >+#include "lock.h"
>> >+#include "fork_impl.h"
>> >+
>> >+struct clone_start_args {
>> >+	int (*func)(void *);
>> >+	void *arg;
>> >+	sigset_t sigmask;
>> >+};
>> >+
>> >+static int clone_start(void *arg)
>> >+{
>> >+	struct clone_start_args *csa = arg;
>> >+	__post_Fork(0);
>> >+	__restore_sigs(&csa->sigmask);
>> >+	return csa->func(csa->arg);
>> >+}
>> >
>> > int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
>> > {
>> >+	struct clone_start_args csa;
>> > 	va_list ap;
>> >-	pid_t *ptid, *ctid;
>> >-	void  *tls;
>> >+	pid_t *ptid = 0, *ctid = 0;
>> >+	void  *tls = 0;
>> >+
>> >+	/* Flags that produce an invalid thread/TLS state are disallowed. */
>> >+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
>> >+	if (flags & badflags)
>> >+		return __syscall_ret(-EINVAL);
>> >
>> Maybe also check for NULL stack, as proposed in
>> https://www.openwall.com/lists/musl/2022/08/02/1 ?
> 
> I guess that can be done. Normally we don't check for null pointers,
> but I think the syscall would silently keep the existing stack pointer
> if passed null rather than blowing up, which we probably don't want --
> even without CLONE_VM, where it'd blow up, we'd be giving
> appears-to-work behavior without a contract for it to work.
> 
Yes, and this is not only about CLONE_VM. On some arches (e.g. i386, 
x86_64, aarch64) musl's __clone stores things on "stack" and hence 
segfaults if "stack" is NULL, but on other arches it doesn't (e.g. arm). 
Exposing this is inconsistent and confusing. If support for passing NULL 
"stack" is ever desired, __clone should be changed first.

>> > 	va_start(ap, arg);
>> >-	ptid = va_arg(ap, pid_t *);
>> >-	tls  = va_arg(ap, void *);
>> >-	ctid = va_arg(ap, pid_t *);
>> >+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
>> >+	 	ptid = va_arg(ap, pid_t *);
>> >+	if (flags & CLONE_CHILD_SETTID) {
>> >+		tls = va_arg(ap, void *);
>> >+		ctid = va_arg(ap, pid_t *);
>> >+	}
>> > 	va_end(ap);
>> >
>> >-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls,
>> >ctid));
>> >+	if (flags & CLONE_VM)
>> >+		return __syscall_ret(__clone(func, stack, flags, arg, ptid,
>> >tls, ctid));
>> >+
>> >+	__block_all_sigs(&csa.sigmask);
>> >+	LOCK(__abort_lock);
>> >+
>> >+	csa.func = func;
>> >+	csa.arg = arg;
>> >+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
>> >+
>> >+	__post_Fork(ret);
>> >+	__restore_sigs(&csa.sigmask);
>> >+	return __syscall_ret(ret);
>> > }
>> 
>> Another thing that might be somewhat relevant if we expect to see
>> wider usage of clone is that syscall (the public function) currently
>> blindly extracts six arguments via va_arg, which on some
>> architectures requires sufficient stack space not to crash. So on
>> i386 the following silly function will crash if passed to clone,
>> provided that "stack" points to the beginning of unmapped space past
>> the stack page end.
> 
> That's changed in the patch. It only calls va_arg for arguments whose
> existence the flags implies.
> 
I was talking about va_arg usage in syscall() (in misc/syscall.c). 
__clone only aligns stack on 16 bytes (at least on arches that I 
checked), but doesn't reserve space. So a sufficiently unlucky 
user-provided function will crash when called from __clone in the child, 
as in my example.

Alexey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [RFC] make clone() usable
  2023-06-02  7:28     ` [musl] [PATCH] [RFC] " Alexey Izbyshev
@ 2023-06-02 14:16       ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2023-06-02 14:16 UTC (permalink / raw)
  To: musl

On Fri, Jun 02, 2023 at 10:28:09AM +0300, Alexey Izbyshev wrote:
> On 2023-06-01 23:08, Rich Felker wrote:
> >On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> >>On 2023-05-31 02:35, Rich Felker wrote:
> >>>As discussed before (see the 2021 thread "Incorrect thread TID
> >>>caching") clone() has been effectively unusable because it produces a
> >>>child process in invalid state, with wrong tid in its thread
> >>>structure, among other problems.
> >>>
> >>>The attached proposed patch attempts to make clone() usable by having
> >>>it share the _Fork logic for establishing a consistent process state
> >>>after forking, and also blocks use of flags which produce invalid
> >>>state.
> >>>
> >>Tangentially, while thinking about this, I realized that because the
> >>kernel clears the address of the exit futex on clone and _Fork
> >>doesn't re-establish it, the following blocks forever:
> >>
> >>void *thr(void *arg) {
> >>        // Blocks in __tl_sync because __thread_list_lock is never
> >>unlocked
> >>        pthread_join(*(pthread_t *)arg, NULL));
> >>        return NULL;
> >>}
> >>
> >>int main() {
> >>        if (!fork()) {
> >>                static pthread_t pt;
> >>                pt = pthread_self();
> >>                pthread_create(&(pthread_t){0}, NULL, thr, &pt);
> >>                pthread_exit(NULL);
> >>        }
> >>        wait(NULL);
> >>}
> >>
> >>This could be fixed by the following change in _Fork.c:
> >>
> >>-               self->tid = __syscall(SYS_gettid);
> >>+               self->tid = __syscall(SYS_set_tid_address,
> >>&__thread_list_lock);
> >
> >Wow, I had no idea Linux's fork cleared the exit futex address. That's
> >a big bug we've had around for nobody to have noticed...
> >
> >>>With CLONE_VM, the old behavior is kept, with a caveat that the child
> >>>context is extremely restrictive, ala vfork.
> >>>
> >>>It was raised previously (pardon the pun) that raise() should perhaps
> >>>be modified to make a SYS_gettid syscall rather than using the thread
> >>>structure tid, so that it's possible to call raise() in a vfork
> >>>context without signaling the wrong process. It might make sense to do
> >>>that now too.
> >>>
> >>Yes, it would be nice to have raise and abort (which also reads the
> >>cached tid separately) working in a vfork child.
> >>
> >>>diff --git a/src/linux/clone.c b/src/linux/clone.c
> >>>index 8c1af7d3..e8f76bbc 100644
> >>>--- a/src/linux/clone.c
> >>>+++ b/src/linux/clone.c
> >>>@@ -4,18 +4,55 @@
> >>> #include <sched.h>
> >>> #include "pthread_impl.h"
> >>> #include "syscall.h"
> >>>+#include "lock.h"
> >>>+#include "fork_impl.h"
> >>>+
> >>>+struct clone_start_args {
> >>>+	int (*func)(void *);
> >>>+	void *arg;
> >>>+	sigset_t sigmask;
> >>>+};
> >>>+
> >>>+static int clone_start(void *arg)
> >>>+{
> >>>+	struct clone_start_args *csa = arg;
> >>>+	__post_Fork(0);
> >>>+	__restore_sigs(&csa->sigmask);
> >>>+	return csa->func(csa->arg);
> >>>+}
> >>>
> >>> int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
> >>> {
> >>>+	struct clone_start_args csa;
> >>> 	va_list ap;
> >>>-	pid_t *ptid, *ctid;
> >>>-	void  *tls;
> >>>+	pid_t *ptid = 0, *ctid = 0;
> >>>+	void  *tls = 0;
> >>>+
> >>>+	/* Flags that produce an invalid thread/TLS state are disallowed. */
> >>>+	int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
> >>>+	if (flags & badflags)
> >>>+		return __syscall_ret(-EINVAL);
> >>>
> >>Maybe also check for NULL stack, as proposed in
> >>https://www.openwall.com/lists/musl/2022/08/02/1 ?
> >
> >I guess that can be done. Normally we don't check for null pointers,
> >but I think the syscall would silently keep the existing stack pointer
> >if passed null rather than blowing up, which we probably don't want --
> >even without CLONE_VM, where it'd blow up, we'd be giving
> >appears-to-work behavior without a contract for it to work.
> >
> Yes, and this is not only about CLONE_VM. On some arches (e.g. i386,
> x86_64, aarch64) musl's __clone stores things on "stack" and hence
> segfaults if "stack" is NULL, but on other arches it doesn't (e.g.
> arm). Exposing this is inconsistent and confusing. If support for
> passing NULL "stack" is ever desired, __clone should be changed
> first.

OK. Crashing here would actually be okay, but having it appear to work
on some archs while crashing on others is not a good situation. So I
think just reporting it as an error is okay for now. We can revisit
this later if it would be useful.

> >>> 	va_start(ap, arg);
> >>>-	ptid = va_arg(ap, pid_t *);
> >>>-	tls  = va_arg(ap, void *);
> >>>-	ctid = va_arg(ap, pid_t *);
> >>>+	if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
> >>>+	 	ptid = va_arg(ap, pid_t *);
> >>>+	if (flags & CLONE_CHILD_SETTID) {
> >>>+		tls = va_arg(ap, void *);
> >>>+		ctid = va_arg(ap, pid_t *);
> >>>+	}
> >>> 	va_end(ap);
> >>>
> >>>-	return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls,
> >>>ctid));
> >>>+	if (flags & CLONE_VM)
> >>>+		return __syscall_ret(__clone(func, stack, flags, arg, ptid,
> >>>tls, ctid));
> >>>+
> >>>+	__block_all_sigs(&csa.sigmask);
> >>>+	LOCK(__abort_lock);
> >>>+
> >>>+	csa.func = func;
> >>>+	csa.arg = arg;
> >>>+	int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
> >>>+
> >>>+	__post_Fork(ret);
> >>>+	__restore_sigs(&csa.sigmask);
> >>>+	return __syscall_ret(ret);
> >>> }
> >>
> >>Another thing that might be somewhat relevant if we expect to see
> >>wider usage of clone is that syscall (the public function) currently
> >>blindly extracts six arguments via va_arg, which on some
> >>architectures requires sufficient stack space not to crash. So on
> >>i386 the following silly function will crash if passed to clone,
> >>provided that "stack" points to the beginning of unmapped space past
> >>the stack page end.
> >
> >That's changed in the patch. It only calls va_arg for arguments whose
> >existence the flags implies.
> >
> I was talking about va_arg usage in syscall() (in misc/syscall.c).
> __clone only aligns stack on 16 bytes (at least on arches that I
> checked), but doesn't reserve space. So a sufficiently unlucky
> user-provided function will crash when called from __clone in the
> child, as in my example.

OK, this is really a bug in the syscall() function, albeit one that
doesn't seem to admit any reasonable fix. Note that this function is
not used in libc; it's only present for use by applications.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [RFC] make clone() usable
  2023-06-01 20:08   ` Rich Felker
  2023-06-01 20:17     ` [musl] [PATCH] [v2] " Rich Felker
  2023-06-02  7:28     ` [musl] [PATCH] [RFC] " Alexey Izbyshev
@ 2023-06-02 14:19     ` Markus Wichmann
  2023-06-02 15:30       ` Rich Felker
  2 siblings, 1 reply; 8+ messages in thread
From: Markus Wichmann @ 2023-06-02 14:19 UTC (permalink / raw)
  To: musl

Am Thu, Jun 01, 2023 at 04:08:07PM -0400 schrieb Rich Felker:
> On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> > Another thing that might be somewhat relevant if we expect to see
> > wider usage of clone is that syscall (the public function) currently
> > blindly extracts six arguments via va_arg, which on some
> > architectures requires sufficient stack space not to crash. So on
> > i386 the following silly function will crash if passed to clone,
> > provided that "stack" points to the beginning of unmapped space past
> > the stack page end.
>
> That's changed in the patch. It only calls va_arg for arguments whose
> existence the flags implies.
>
> Rich

clone() does, yes, but syscall() doesn't. So if the function handed to
clone() calls the syscall() function at the top of the stack, it crashes
on architectures that pass parameters on stack primarily.

Quick fix would be to make __clone() on those architectures reserve
sufficient stack space for six parameters before calling the client
function. E.g. on i386, one parameter is four bytes, so six parameters
are twenty-four bytes, and I believe the stack alignment was sixteen
bytes, right? So __clone() could just reduce %esp by thirty-two before
calling the function pointer.

The proper fix would be to have syscall() only read as many arguments as
were given. Since syscall() is an extension function, nobody can rely on
it being defined as

long syscall(int, ...);

You might use similar macros to the ones in syscall.h to add another
argument for the number of arguments. That would avoid the crash.

Of course, you have to keep the above function around for ABI reasons.
Nothing's ever easy, right?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH] [RFC] make clone() usable
  2023-06-02 14:19     ` Markus Wichmann
@ 2023-06-02 15:30       ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2023-06-02 15:30 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Fri, Jun 02, 2023 at 04:19:47PM +0200, Markus Wichmann wrote:
> Am Thu, Jun 01, 2023 at 04:08:07PM -0400 schrieb Rich Felker:
> > On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> > > Another thing that might be somewhat relevant if we expect to see
> > > wider usage of clone is that syscall (the public function) currently
> > > blindly extracts six arguments via va_arg, which on some
> > > architectures requires sufficient stack space not to crash. So on
> > > i386 the following silly function will crash if passed to clone,
> > > provided that "stack" points to the beginning of unmapped space past
> > > the stack page end.
> >
> > That's changed in the patch. It only calls va_arg for arguments whose
> > existence the flags implies.
> >
> > Rich
> 
> clone() does, yes, but syscall() doesn't. So if the function handed to
> clone() calls the syscall() function at the top of the stack, it crashes
> on architectures that pass parameters on stack primarily.
> 
> Quick fix would be to make __clone() on those architectures reserve
> sufficient stack space for six parameters before calling the client
> function. E.g. on i386, one parameter is four bytes, so six parameters
> are twenty-four bytes, and I believe the stack alignment was sixteen
> bytes, right? So __clone() could just reduce %esp by thirty-two before
> calling the function pointer.

It's a hack that might be worthwhile.

> The proper fix would be to have syscall() only read as many arguments as
> were given. Since syscall() is an extension function, nobody can rely on
> it being defined as

That requires knowing all current and future syscalls... :(

> long syscall(int, ...);
> 
> You might use similar macros to the ones in syscall.h to add another
> argument for the number of arguments. That would avoid the crash.

Yes, that's perhaps a good idea. Though it does potentially make a
mess with how macros affect namespace...

> Of course, you have to keep the above function around for ABI reasons.
> Nothing's ever easy, right?

Yes, though it's no worse than now if existing binaries that were
crashing still crash...

Rich



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-06-02 15:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-30 23:35 [musl] [PATCH] [RFC] make clone() usable Rich Felker
2023-06-01 10:12 ` Alexey Izbyshev
2023-06-01 20:08   ` Rich Felker
2023-06-01 20:17     ` [musl] [PATCH] [v2] " Rich Felker
2023-06-02  7:28     ` [musl] [PATCH] [RFC] " Alexey Izbyshev
2023-06-02 14:16       ` Rich Felker
2023-06-02 14:19     ` Markus Wichmann
2023-06-02 15:30       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).