* [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one @ 2022-05-02 19:26 Alexey Izbyshev 2022-05-02 20:49 ` Carlos O'Donell 2022-05-02 21:18 ` Rich Felker 0 siblings, 2 replies; 9+ messages in thread From: Alexey Izbyshev @ 2022-05-02 19:26 UTC (permalink / raw) To: musl Hi, I was recently made aware via [1] that vfork() can have more failure modes than fork() on Linux. The only case I know about is due to Linux not allowing processes in different time namespaces to share address space, but probably there are or will be more. An example is below (requires Linux >= 5.6). $ cat test.c #include <stdio.h> #include <stdlib.h> #include <string.h> #include <spawn.h> #include <sys/wait.h> #include <unistd.h> int main(int argc, char *argv[], char *envp[]) { if (getenv("TEST_FORK")) { pid_t pid = fork(); if (pid < 0) { perror("fork"); return 127; } if (pid == 0) { execve(argv[1], argv + 1, envp); _exit(127); } } else { int err = posix_spawn(0, argv[1], 0, 0, argv + 1, envp); if (err) { printf("posix_spawn: %s\n", strerror(err)); return 127; } } wait(NULL); return 0; } $ musl-gcc test.c $ unshare -UrT ./a.out /bin/echo OK posix_spawn: Invalid argument $ TEST_FORK=1 unshare -UrT ./a.out /bin/echo OK OK A common expectation from applications is that they can use posix_spawn() as a drop-in replacement for fork()/exec() (when its child-tweaking features are sufficient), but this case breaks the expectation. Do you think it would make sense for musl to fallback to fork() in case vfork() fails in posix_spawn()? I've also opened a bug about this in glibc[2]. Maybe libcs could coordinate in how they handle this case. Alexey [1] https://github.com/python/cpython/issues/91307 [2] https://sourceware.org/bugzilla/show_bug.cgi?id=29115 ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 19:26 [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one Alexey Izbyshev @ 2022-05-02 20:49 ` Carlos O'Donell 2022-05-02 21:18 ` Rich Felker 1 sibling, 0 replies; 9+ messages in thread From: Carlos O'Donell @ 2022-05-02 20:49 UTC (permalink / raw) To: musl, Alexey Izbyshev On 5/2/22 15:26, Alexey Izbyshev wrote: > I've also opened a bug about this in glibc[2]. Maybe libcs could > coordinate in how they handle this case. Speaking as a glibc maintainer. Either the kernel supports vfork or it doesn't. A time namespace, or a seccomp filter are all the same problems, and we should return the error to userspace. Adding code which will only be exercised in the event that a time namespace is in use is going to result in increased long-term maintenance costs. It also results in unexpected surprise behaviour when the developer runs under a time namespace e.g. more memory usage, different code paths taken etc. Rather than add long-term maintenance and surprise developers my suggestion is to fail the posix_spawn. Users can take this up with the kernel to add support for vfork in time namespaces, or with their sandboxing technology provider. There might be exceptional cases where we need to add fallbacks, but I can't see that this is one of those cases. For example clone3 to clone2 fallback is sensible. My impression was that time namespaces as they are today were added primarily to support CRIU process freeze and restore with different monotonic clock offsets[1]. If CRIU needs this to work then it would be more effective to fix the kernel, than to fix every userspace you need to freeze to support fork fallback. -- Cheers, Carlos. [1] https://lore.kernel.org/lkml/158016896588.31887.14143226032971732742.tglx@nanos.tec.linutronix.de/ ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 19:26 [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one Alexey Izbyshev 2022-05-02 20:49 ` Carlos O'Donell @ 2022-05-02 21:18 ` Rich Felker 2022-05-02 21:25 ` Florian Weimer 1 sibling, 1 reply; 9+ messages in thread From: Rich Felker @ 2022-05-02 21:18 UTC (permalink / raw) To: Alexey Izbyshev; +Cc: musl On Mon, May 02, 2022 at 10:26:36PM +0300, Alexey Izbyshev wrote: > Hi, > > I was recently made aware via [1] that vfork() can have more failure > modes than fork() on Linux. The only case I know about is due to > Linux not allowing processes in different time namespaces to share > address space, but probably there are or will be more. An example is > below (requires Linux >= 5.6). > > $ cat test.c > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <spawn.h> > #include <sys/wait.h> > #include <unistd.h> > > int main(int argc, char *argv[], char *envp[]) { > if (getenv("TEST_FORK")) { > pid_t pid = fork(); > if (pid < 0) { > perror("fork"); > return 127; > } > if (pid == 0) { > execve(argv[1], argv + 1, envp); > _exit(127); > } > } else { > int err = posix_spawn(0, argv[1], 0, 0, argv + 1, envp); > if (err) { > printf("posix_spawn: %s\n", strerror(err)); > return 127; > } > } > wait(NULL); > return 0; > } > > $ musl-gcc test.c > $ unshare -UrT ./a.out /bin/echo OK > posix_spawn: Invalid argument > $ TEST_FORK=1 unshare -UrT ./a.out /bin/echo OK > OK > > A common expectation from applications is that they can use > posix_spawn() as a drop-in replacement for fork()/exec() (when its > child-tweaking features are sufficient), but this case breaks the > expectation. Do you think it would make sense for musl to fallback > to fork() in case vfork() fails in posix_spawn()? > > I've also opened a bug about this in glibc[2]. Maybe libcs could > coordinate in how they handle this case. > > Alexey > > [1] https://github.com/python/cpython/issues/91307 > [2] https://sourceware.org/bugzilla/show_bug.cgi?id=29115 I'm trying to understand how this comes to be. The child should inherit the namespaces of the parent and thus should not be in a different namespace that precludes spawn. I'm guessing this is some oddity where unshare doesn't affect the process itself, only its children? If so, it seems like a bug that it doesn't affect the process itself after execve (after unshare(1) runs your test program), but that probably can't be fixed now on the Linux side for stability reasons. :( For what it's worth, I feel like the answer here is really that you can't expect everything (or anything) to work after you've created a bad or inconsistent process state, which can be done in various ways like using unshare(2) in certain ways a multithreaded process, certain manual uses of clone(2), etc. Apparently unsharing time ns is one of those things too, and if it behaves the way it seems to, I don't think you can use it at all without an extra fork (adding -f to the unshare(1) command line). Otherwise the top-level process in your "container" and its children will be in different time namespaces, which is not at all what you would want anyway. We probably could make posix_spawn retry __clone without CLONE_VM if if fails with certain errors, as long as those errors are non-ambiguous about indicating a need for retry. I don't see EINVAL documented as being possible for any cases that would need to be treated as errors, but then again it doesn't seem to be documented for this corner case you found either. Rich ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 21:18 ` Rich Felker @ 2022-05-02 21:25 ` Florian Weimer 2022-05-02 21:31 ` Rich Felker 2022-05-02 21:31 ` Carlos O'Donell 0 siblings, 2 replies; 9+ messages in thread From: Florian Weimer @ 2022-05-02 21:25 UTC (permalink / raw) To: Rich Felker; +Cc: Alexey Izbyshev, musl * Rich Felker: > I'm trying to understand how this comes to be. The child should > inherit the namespaces of the parent and thus should not be in a > different namespace that precludes spawn. I'm guessing this is some > oddity where unshare doesn't affect the process itself, only its > children? If so, it seems like a bug that it doesn't affect the > process itself after execve (after unshare(1) runs your test program), > but that probably can't be fixed now on the Linux side for stability > reasons. :( It's about fundamentally conflicting requirements. The vDSO data mapping needs to store the time offset, so it has to be distinct from the original namespace. vfork preserves the VM sharing. It's not possible to do both things at the same time. unshare(CLONE_NEWTIME) should have been specified to only take effect after execve, when the vDSO is remapped anyway. Thanks, Florian ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 21:25 ` Florian Weimer @ 2022-05-02 21:31 ` Rich Felker 2023-02-22 22:04 ` Alexey Izbyshev 2022-05-02 21:31 ` Carlos O'Donell 1 sibling, 1 reply; 9+ messages in thread From: Rich Felker @ 2022-05-02 21:31 UTC (permalink / raw) To: Florian Weimer; +Cc: Alexey Izbyshev, musl On Mon, May 02, 2022 at 11:25:07PM +0200, Florian Weimer wrote: > * Rich Felker: > > > I'm trying to understand how this comes to be. The child should > > inherit the namespaces of the parent and thus should not be in a > > different namespace that precludes spawn. I'm guessing this is some > > oddity where unshare doesn't affect the process itself, only its > > children? If so, it seems like a bug that it doesn't affect the > > process itself after execve (after unshare(1) runs your test program), > > but that probably can't be fixed now on the Linux side for stability > > reasons. :( > > It's about fundamentally conflicting requirements. > > The vDSO data mapping needs to store the time offset, so it has to be > distinct from the original namespace. vfork preserves the VM sharing. > It's not possible to do both things at the same time. > > unshare(CLONE_NEWTIME) should have been specified to only take effect > after execve, when the vDSO is remapped anyway. Yes, exactly. The bug is that someone confused processes and process images (fork and exec) when coming up with the constraint and now we're probably stuck with it. *sigh* Rich ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 21:31 ` Rich Felker @ 2023-02-22 22:04 ` Alexey Izbyshev 0 siblings, 0 replies; 9+ messages in thread From: Alexey Izbyshev @ 2023-02-22 22:04 UTC (permalink / raw) To: musl; +Cc: Florian Weimer On 2022-05-03 00:31, Rich Felker wrote: > On Mon, May 02, 2022 at 11:25:07PM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > I'm trying to understand how this comes to be. The child should >> > inherit the namespaces of the parent and thus should not be in a >> > different namespace that precludes spawn. I'm guessing this is some >> > oddity where unshare doesn't affect the process itself, only its >> > children? If so, it seems like a bug that it doesn't affect the >> > process itself after execve (after unshare(1) runs your test program), >> > but that probably can't be fixed now on the Linux side for stability >> > reasons. :( >> >> It's about fundamentally conflicting requirements. >> >> The vDSO data mapping needs to store the time offset, so it has to be >> distinct from the original namespace. vfork preserves the VM sharing. >> It's not possible to do both things at the same time. >> >> unshare(CLONE_NEWTIME) should have been specified to only take effect >> after execve, when the vDSO is remapped anyway. > > Yes, exactly. The bug is that someone confused processes and process > images (fork and exec) when coming up with the constraint and now > we're probably stuck with it. *sigh* > Status update: the recently released kernel 6.2 contains the fix[1], so we're not stuck with this bug after all! [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2b5f9dad32ed19e8db3b0f10a84aa824a219803b Alexey ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 21:25 ` Florian Weimer 2022-05-02 21:31 ` Rich Felker @ 2022-05-02 21:31 ` Carlos O'Donell 2022-05-02 21:49 ` Alexey Izbyshev 1 sibling, 1 reply; 9+ messages in thread From: Carlos O'Donell @ 2022-05-02 21:31 UTC (permalink / raw) To: musl, Florian Weimer, Rich Felker; +Cc: Alexey Izbyshev On 5/2/22 17:25, Florian Weimer wrote: > * Rich Felker: > >> I'm trying to understand how this comes to be. The child should >> inherit the namespaces of the parent and thus should not be in a >> different namespace that precludes spawn. I'm guessing this is some >> oddity where unshare doesn't affect the process itself, only its >> children? If so, it seems like a bug that it doesn't affect the >> process itself after execve (after unshare(1) runs your test program), >> but that probably can't be fixed now on the Linux side for stability >> reasons. :( > > It's about fundamentally conflicting requirements. > > The vDSO data mapping needs to store the time offset, so it has to be > distinct from the original namespace. vfork preserves the VM sharing. > It's not possible to do both things at the same time. > > unshare(CLONE_NEWTIME) should have been specified to only take effect > after execve, when the vDSO is remapped anyway. Can we ask some kernel developers for an opinion? -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 21:31 ` Carlos O'Donell @ 2022-05-02 21:49 ` Alexey Izbyshev 2022-05-02 21:56 ` Alexey Izbyshev 0 siblings, 1 reply; 9+ messages in thread From: Alexey Izbyshev @ 2022-05-02 21:49 UTC (permalink / raw) To: musl; +Cc: Florian Weimer, Rich Felker, Andrei Vagin, Christian Brauner On 2022-05-03 00:31, Carlos O'Donell wrote: > On 5/2/22 17:25, Florian Weimer wrote: >> * Rich Felker: >> >>> I'm trying to understand how this comes to be. The child should >>> inherit the namespaces of the parent and thus should not be in a >>> different namespace that precludes spawn. I'm guessing this is some >>> oddity where unshare doesn't affect the process itself, only its >>> children? If so, it seems like a bug that it doesn't affect the >>> process itself after execve (after unshare(1) runs your test >>> program), >>> but that probably can't be fixed now on the Linux side for stability >>> reasons. :( >> >> It's about fundamentally conflicting requirements. >> >> The vDSO data mapping needs to store the time offset, so it has to be >> distinct from the original namespace. vfork preserves the VM sharing. >> It's not possible to do both things at the same time. >> >> unshare(CLONE_NEWTIME) should have been specified to only take effect >> after execve, when the vDSO is remapped anyway. > > Can we ask some kernel developers for an opinion? Christian Brauner had some comments [1,2] on this. Time namespaces were added in [3] by Andrei Vagin. Adding both to CC. Alexey [1] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c6 [2] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c10 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=769071ac9f20b6a447410c7eaa55d1a5233ef40c ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one 2022-05-02 21:49 ` Alexey Izbyshev @ 2022-05-02 21:56 ` Alexey Izbyshev 0 siblings, 0 replies; 9+ messages in thread From: Alexey Izbyshev @ 2022-05-02 21:56 UTC (permalink / raw) To: musl; +Cc: Florian Weimer, Rich Felker, Andrei Vagin, Christian Brauner On 2022-05-03 00:49, Alexey Izbyshev wrote: > On 2022-05-03 00:31, Carlos O'Donell wrote: >> On 5/2/22 17:25, Florian Weimer wrote: >>> * Rich Felker: >>> >>>> I'm trying to understand how this comes to be. The child should >>>> inherit the namespaces of the parent and thus should not be in a >>>> different namespace that precludes spawn. I'm guessing this is some >>>> oddity where unshare doesn't affect the process itself, only its >>>> children? If so, it seems like a bug that it doesn't affect the >>>> process itself after execve (after unshare(1) runs your test >>>> program), >>>> but that probably can't be fixed now on the Linux side for stability >>>> reasons. :( >>> >>> It's about fundamentally conflicting requirements. >>> >>> The vDSO data mapping needs to store the time offset, so it has to be >>> distinct from the original namespace. vfork preserves the VM >>> sharing. >>> It's not possible to do both things at the same time. >>> >>> unshare(CLONE_NEWTIME) should have been specified to only take effect >>> after execve, when the vDSO is remapped anyway. >> >> Can we ask some kernel developers for an opinion? > > Christian Brauner had some comments [1,2] on this. Time namespaces > were added in [3] by Andrei Vagin. Adding both to CC. > (Trying a different email for Andrei Vagin) > Alexey > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c6 > [2] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c10 > [3] > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=769071ac9f20b6a447410c7eaa55d1a5233ef40c ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-02-22 22:04 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-05-02 19:26 [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one Alexey Izbyshev 2022-05-02 20:49 ` Carlos O'Donell 2022-05-02 21:18 ` Rich Felker 2022-05-02 21:25 ` Florian Weimer 2022-05-02 21:31 ` Rich Felker 2023-02-22 22:04 ` Alexey Izbyshev 2022-05-02 21:31 ` Carlos O'Donell 2022-05-02 21:49 ` Alexey Izbyshev 2022-05-02 21:56 ` Alexey Izbyshev
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).