From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_LOW,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 20515 invoked from network); 2 Jun 2023 14:17:12 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 2 Jun 2023 14:17:12 -0000 Received: (qmail 3919 invoked by uid 550); 2 Jun 2023 14:17:09 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 3884 invoked from network); 2 Jun 2023 14:17:08 -0000 Date: Fri, 2 Jun 2023 10:16:55 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20230602141655.GO4163@brightrain.aerifal.cx> References: <20230530233529.GZ4163@brightrain.aerifal.cx> <20230601200806.GM4163@brightrain.aerifal.cx> <67a2fd2f637d0800ff48d27588c00384@ispras.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <67a2fd2f637d0800ff48d27588c00384@ispras.ru> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] [PATCH] [RFC] make clone() usable On Fri, Jun 02, 2023 at 10:28:09AM +0300, Alexey Izbyshev wrote: > On 2023-06-01 23:08, Rich Felker wrote: > >On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote: > >>On 2023-05-31 02:35, Rich Felker wrote: > >>>As discussed before (see the 2021 thread "Incorrect thread TID > >>>caching") clone() has been effectively unusable because it produces a > >>>child process in invalid state, with wrong tid in its thread > >>>structure, among other problems. > >>> > >>>The attached proposed patch attempts to make clone() usable by having > >>>it share the _Fork logic for establishing a consistent process state > >>>after forking, and also blocks use of flags which produce invalid > >>>state. > >>> > >>Tangentially, while thinking about this, I realized that because the > >>kernel clears the address of the exit futex on clone and _Fork > >>doesn't re-establish it, the following blocks forever: > >> > >>void *thr(void *arg) { > >> // Blocks in __tl_sync because __thread_list_lock is never > >>unlocked > >> pthread_join(*(pthread_t *)arg, NULL)); > >> return NULL; > >>} > >> > >>int main() { > >> if (!fork()) { > >> static pthread_t pt; > >> pt = pthread_self(); > >> pthread_create(&(pthread_t){0}, NULL, thr, &pt); > >> pthread_exit(NULL); > >> } > >> wait(NULL); > >>} > >> > >>This could be fixed by the following change in _Fork.c: > >> > >>- self->tid = __syscall(SYS_gettid); > >>+ self->tid = __syscall(SYS_set_tid_address, > >>&__thread_list_lock); > > > >Wow, I had no idea Linux's fork cleared the exit futex address. That's > >a big bug we've had around for nobody to have noticed... > > > >>>With CLONE_VM, the old behavior is kept, with a caveat that the child > >>>context is extremely restrictive, ala vfork. > >>> > >>>It was raised previously (pardon the pun) that raise() should perhaps > >>>be modified to make a SYS_gettid syscall rather than using the thread > >>>structure tid, so that it's possible to call raise() in a vfork > >>>context without signaling the wrong process. It might make sense to do > >>>that now too. > >>> > >>Yes, it would be nice to have raise and abort (which also reads the > >>cached tid separately) working in a vfork child. > >> > >>>diff --git a/src/linux/clone.c b/src/linux/clone.c > >>>index 8c1af7d3..e8f76bbc 100644 > >>>--- a/src/linux/clone.c > >>>+++ b/src/linux/clone.c > >>>@@ -4,18 +4,55 @@ > >>> #include > >>> #include "pthread_impl.h" > >>> #include "syscall.h" > >>>+#include "lock.h" > >>>+#include "fork_impl.h" > >>>+ > >>>+struct clone_start_args { > >>>+ int (*func)(void *); > >>>+ void *arg; > >>>+ sigset_t sigmask; > >>>+}; > >>>+ > >>>+static int clone_start(void *arg) > >>>+{ > >>>+ struct clone_start_args *csa = arg; > >>>+ __post_Fork(0); > >>>+ __restore_sigs(&csa->sigmask); > >>>+ return csa->func(csa->arg); > >>>+} > >>> > >>> int clone(int (*func)(void *), void *stack, int flags, void *arg, ...) > >>> { > >>>+ struct clone_start_args csa; > >>> va_list ap; > >>>- pid_t *ptid, *ctid; > >>>- void *tls; > >>>+ pid_t *ptid = 0, *ctid = 0; > >>>+ void *tls = 0; > >>>+ > >>>+ /* Flags that produce an invalid thread/TLS state are disallowed. */ > >>>+ int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID; > >>>+ if (flags & badflags) > >>>+ return __syscall_ret(-EINVAL); > >>> > >>Maybe also check for NULL stack, as proposed in > >>https://www.openwall.com/lists/musl/2022/08/02/1 ? > > > >I guess that can be done. Normally we don't check for null pointers, > >but I think the syscall would silently keep the existing stack pointer > >if passed null rather than blowing up, which we probably don't want -- > >even without CLONE_VM, where it'd blow up, we'd be giving > >appears-to-work behavior without a contract for it to work. > > > Yes, and this is not only about CLONE_VM. On some arches (e.g. i386, > x86_64, aarch64) musl's __clone stores things on "stack" and hence > segfaults if "stack" is NULL, but on other arches it doesn't (e.g. > arm). Exposing this is inconsistent and confusing. If support for > passing NULL "stack" is ever desired, __clone should be changed > first. OK. Crashing here would actually be okay, but having it appear to work on some archs while crashing on others is not a good situation. So I think just reporting it as an error is okay for now. We can revisit this later if it would be useful. > >>> va_start(ap, arg); > >>>- ptid = va_arg(ap, pid_t *); > >>>- tls = va_arg(ap, void *); > >>>- ctid = va_arg(ap, pid_t *); > >>>+ if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID)) > >>>+ ptid = va_arg(ap, pid_t *); > >>>+ if (flags & CLONE_CHILD_SETTID) { > >>>+ tls = va_arg(ap, void *); > >>>+ ctid = va_arg(ap, pid_t *); > >>>+ } > >>> va_end(ap); > >>> > >>>- return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls, > >>>ctid)); > >>>+ if (flags & CLONE_VM) > >>>+ return __syscall_ret(__clone(func, stack, flags, arg, ptid, > >>>tls, ctid)); > >>>+ > >>>+ __block_all_sigs(&csa.sigmask); > >>>+ LOCK(__abort_lock); > >>>+ > >>>+ csa.func = func; > >>>+ csa.arg = arg; > >>>+ int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid); > >>>+ > >>>+ __post_Fork(ret); > >>>+ __restore_sigs(&csa.sigmask); > >>>+ return __syscall_ret(ret); > >>> } > >> > >>Another thing that might be somewhat relevant if we expect to see > >>wider usage of clone is that syscall (the public function) currently > >>blindly extracts six arguments via va_arg, which on some > >>architectures requires sufficient stack space not to crash. So on > >>i386 the following silly function will crash if passed to clone, > >>provided that "stack" points to the beginning of unmapped space past > >>the stack page end. > > > >That's changed in the patch. It only calls va_arg for arguments whose > >existence the flags implies. > > > I was talking about va_arg usage in syscall() (in misc/syscall.c). > __clone only aligns stack on 16 bytes (at least on arches that I > checked), but doesn't reserve space. So a sufficiently unlucky > user-provided function will crash when called from __clone in the > child, as in my example. OK, this is really a bug in the syscall() function, albeit one that doesn't seem to admit any reasonable fix. Note that this function is not used in libc; it's only present for use by applications. Rich