* [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
@ 2022-05-02 19:26 Alexey Izbyshev
2022-05-02 20:49 ` Carlos O'Donell
2022-05-02 21:18 ` Rich Felker
0 siblings, 2 replies; 9+ messages in thread
From: Alexey Izbyshev @ 2022-05-02 19:26 UTC (permalink / raw)
To: musl
Hi,
I was recently made aware via [1] that vfork() can have more failure
modes than fork() on Linux. The only case I know about is due to Linux
not allowing processes in different time namespaces to share address
space, but probably there are or will be more. An example is below
(requires Linux >= 5.6).
$ cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <spawn.h>
#include <sys/wait.h>
#include <unistd.h>
int main(int argc, char *argv[], char *envp[]) {
if (getenv("TEST_FORK")) {
pid_t pid = fork();
if (pid < 0) {
perror("fork");
return 127;
}
if (pid == 0) {
execve(argv[1], argv + 1, envp);
_exit(127);
}
} else {
int err = posix_spawn(0, argv[1], 0, 0, argv + 1, envp);
if (err) {
printf("posix_spawn: %s\n", strerror(err));
return 127;
}
}
wait(NULL);
return 0;
}
$ musl-gcc test.c
$ unshare -UrT ./a.out /bin/echo OK
posix_spawn: Invalid argument
$ TEST_FORK=1 unshare -UrT ./a.out /bin/echo OK
OK
A common expectation from applications is that they can use
posix_spawn() as a drop-in replacement for fork()/exec() (when its
child-tweaking features are sufficient), but this case breaks the
expectation. Do you think it would make sense for musl to fallback to
fork() in case vfork() fails in posix_spawn()?
I've also opened a bug about this in glibc[2]. Maybe libcs could
coordinate in how they handle this case.
Alexey
[1] https://github.com/python/cpython/issues/91307
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=29115
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 19:26 [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one Alexey Izbyshev
@ 2022-05-02 20:49 ` Carlos O'Donell
2022-05-02 21:18 ` Rich Felker
1 sibling, 0 replies; 9+ messages in thread
From: Carlos O'Donell @ 2022-05-02 20:49 UTC (permalink / raw)
To: musl, Alexey Izbyshev
On 5/2/22 15:26, Alexey Izbyshev wrote:
> I've also opened a bug about this in glibc[2]. Maybe libcs could
> coordinate in how they handle this case.
Speaking as a glibc maintainer.
Either the kernel supports vfork or it doesn't.
A time namespace, or a seccomp filter are all the same problems,
and we should return the error to userspace.
Adding code which will only be exercised in the event that a time
namespace is in use is going to result in increased long-term
maintenance costs.
It also results in unexpected surprise behaviour when the developer
runs under a time namespace e.g. more memory usage, different code
paths taken etc.
Rather than add long-term maintenance and surprise developers my
suggestion is to fail the posix_spawn.
Users can take this up with the kernel to add support for vfork in
time namespaces, or with their sandboxing technology provider.
There might be exceptional cases where we need to add fallbacks,
but I can't see that this is one of those cases. For example clone3
to clone2 fallback is sensible.
My impression was that time namespaces as they are today were added
primarily to support CRIU process freeze and restore with different
monotonic clock offsets[1]. If CRIU needs this to work then it would
be more effective to fix the kernel, than to fix every userspace
you need to freeze to support fork fallback.
--
Cheers,
Carlos.
[1] https://lore.kernel.org/lkml/158016896588.31887.14143226032971732742.tglx@nanos.tec.linutronix.de/
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 19:26 [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one Alexey Izbyshev
2022-05-02 20:49 ` Carlos O'Donell
@ 2022-05-02 21:18 ` Rich Felker
2022-05-02 21:25 ` Florian Weimer
1 sibling, 1 reply; 9+ messages in thread
From: Rich Felker @ 2022-05-02 21:18 UTC (permalink / raw)
To: Alexey Izbyshev; +Cc: musl
On Mon, May 02, 2022 at 10:26:36PM +0300, Alexey Izbyshev wrote:
> Hi,
>
> I was recently made aware via [1] that vfork() can have more failure
> modes than fork() on Linux. The only case I know about is due to
> Linux not allowing processes in different time namespaces to share
> address space, but probably there are or will be more. An example is
> below (requires Linux >= 5.6).
>
> $ cat test.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <spawn.h>
> #include <sys/wait.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[], char *envp[]) {
> if (getenv("TEST_FORK")) {
> pid_t pid = fork();
> if (pid < 0) {
> perror("fork");
> return 127;
> }
> if (pid == 0) {
> execve(argv[1], argv + 1, envp);
> _exit(127);
> }
> } else {
> int err = posix_spawn(0, argv[1], 0, 0, argv + 1, envp);
> if (err) {
> printf("posix_spawn: %s\n", strerror(err));
> return 127;
> }
> }
> wait(NULL);
> return 0;
> }
>
> $ musl-gcc test.c
> $ unshare -UrT ./a.out /bin/echo OK
> posix_spawn: Invalid argument
> $ TEST_FORK=1 unshare -UrT ./a.out /bin/echo OK
> OK
>
> A common expectation from applications is that they can use
> posix_spawn() as a drop-in replacement for fork()/exec() (when its
> child-tweaking features are sufficient), but this case breaks the
> expectation. Do you think it would make sense for musl to fallback
> to fork() in case vfork() fails in posix_spawn()?
>
> I've also opened a bug about this in glibc[2]. Maybe libcs could
> coordinate in how they handle this case.
>
> Alexey
>
> [1] https://github.com/python/cpython/issues/91307
> [2] https://sourceware.org/bugzilla/show_bug.cgi?id=29115
I'm trying to understand how this comes to be. The child should
inherit the namespaces of the parent and thus should not be in a
different namespace that precludes spawn. I'm guessing this is some
oddity where unshare doesn't affect the process itself, only its
children? If so, it seems like a bug that it doesn't affect the
process itself after execve (after unshare(1) runs your test program),
but that probably can't be fixed now on the Linux side for stability
reasons. :(
For what it's worth, I feel like the answer here is really that you
can't expect everything (or anything) to work after you've created a
bad or inconsistent process state, which can be done in various ways
like using unshare(2) in certain ways a multithreaded process, certain
manual uses of clone(2), etc. Apparently unsharing time ns is one of
those things too, and if it behaves the way it seems to, I don't think
you can use it at all without an extra fork (adding -f to the
unshare(1) command line). Otherwise the top-level process in your
"container" and its children will be in different time namespaces,
which is not at all what you would want anyway.
We probably could make posix_spawn retry __clone without CLONE_VM if
if fails with certain errors, as long as those errors are
non-ambiguous about indicating a need for retry. I don't see EINVAL
documented as being possible for any cases that would need to be
treated as errors, but then again it doesn't seem to be documented for
this corner case you found either.
Rich
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 21:18 ` Rich Felker
@ 2022-05-02 21:25 ` Florian Weimer
2022-05-02 21:31 ` Rich Felker
2022-05-02 21:31 ` Carlos O'Donell
0 siblings, 2 replies; 9+ messages in thread
From: Florian Weimer @ 2022-05-02 21:25 UTC (permalink / raw)
To: Rich Felker; +Cc: Alexey Izbyshev, musl
* Rich Felker:
> I'm trying to understand how this comes to be. The child should
> inherit the namespaces of the parent and thus should not be in a
> different namespace that precludes spawn. I'm guessing this is some
> oddity where unshare doesn't affect the process itself, only its
> children? If so, it seems like a bug that it doesn't affect the
> process itself after execve (after unshare(1) runs your test program),
> but that probably can't be fixed now on the Linux side for stability
> reasons. :(
It's about fundamentally conflicting requirements.
The vDSO data mapping needs to store the time offset, so it has to be
distinct from the original namespace. vfork preserves the VM sharing.
It's not possible to do both things at the same time.
unshare(CLONE_NEWTIME) should have been specified to only take effect
after execve, when the vDSO is remapped anyway.
Thanks,
Florian
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 21:25 ` Florian Weimer
@ 2022-05-02 21:31 ` Rich Felker
2023-02-22 22:04 ` Alexey Izbyshev
2022-05-02 21:31 ` Carlos O'Donell
1 sibling, 1 reply; 9+ messages in thread
From: Rich Felker @ 2022-05-02 21:31 UTC (permalink / raw)
To: Florian Weimer; +Cc: Alexey Izbyshev, musl
On Mon, May 02, 2022 at 11:25:07PM +0200, Florian Weimer wrote:
> * Rich Felker:
>
> > I'm trying to understand how this comes to be. The child should
> > inherit the namespaces of the parent and thus should not be in a
> > different namespace that precludes spawn. I'm guessing this is some
> > oddity where unshare doesn't affect the process itself, only its
> > children? If so, it seems like a bug that it doesn't affect the
> > process itself after execve (after unshare(1) runs your test program),
> > but that probably can't be fixed now on the Linux side for stability
> > reasons. :(
>
> It's about fundamentally conflicting requirements.
>
> The vDSO data mapping needs to store the time offset, so it has to be
> distinct from the original namespace. vfork preserves the VM sharing.
> It's not possible to do both things at the same time.
>
> unshare(CLONE_NEWTIME) should have been specified to only take effect
> after execve, when the vDSO is remapped anyway.
Yes, exactly. The bug is that someone confused processes and process
images (fork and exec) when coming up with the constraint and now
we're probably stuck with it. *sigh*
Rich
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 21:25 ` Florian Weimer
2022-05-02 21:31 ` Rich Felker
@ 2022-05-02 21:31 ` Carlos O'Donell
2022-05-02 21:49 ` Alexey Izbyshev
1 sibling, 1 reply; 9+ messages in thread
From: Carlos O'Donell @ 2022-05-02 21:31 UTC (permalink / raw)
To: musl, Florian Weimer, Rich Felker; +Cc: Alexey Izbyshev
On 5/2/22 17:25, Florian Weimer wrote:
> * Rich Felker:
>
>> I'm trying to understand how this comes to be. The child should
>> inherit the namespaces of the parent and thus should not be in a
>> different namespace that precludes spawn. I'm guessing this is some
>> oddity where unshare doesn't affect the process itself, only its
>> children? If so, it seems like a bug that it doesn't affect the
>> process itself after execve (after unshare(1) runs your test program),
>> but that probably can't be fixed now on the Linux side for stability
>> reasons. :(
>
> It's about fundamentally conflicting requirements.
>
> The vDSO data mapping needs to store the time offset, so it has to be
> distinct from the original namespace. vfork preserves the VM sharing.
> It's not possible to do both things at the same time.
>
> unshare(CLONE_NEWTIME) should have been specified to only take effect
> after execve, when the vDSO is remapped anyway.
Can we ask some kernel developers for an opinion?
--
Cheers,
Carlos.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 21:31 ` Carlos O'Donell
@ 2022-05-02 21:49 ` Alexey Izbyshev
2022-05-02 21:56 ` Alexey Izbyshev
0 siblings, 1 reply; 9+ messages in thread
From: Alexey Izbyshev @ 2022-05-02 21:49 UTC (permalink / raw)
To: musl; +Cc: Florian Weimer, Rich Felker, Andrei Vagin, Christian Brauner
On 2022-05-03 00:31, Carlos O'Donell wrote:
> On 5/2/22 17:25, Florian Weimer wrote:
>> * Rich Felker:
>>
>>> I'm trying to understand how this comes to be. The child should
>>> inherit the namespaces of the parent and thus should not be in a
>>> different namespace that precludes spawn. I'm guessing this is some
>>> oddity where unshare doesn't affect the process itself, only its
>>> children? If so, it seems like a bug that it doesn't affect the
>>> process itself after execve (after unshare(1) runs your test
>>> program),
>>> but that probably can't be fixed now on the Linux side for stability
>>> reasons. :(
>>
>> It's about fundamentally conflicting requirements.
>>
>> The vDSO data mapping needs to store the time offset, so it has to be
>> distinct from the original namespace. vfork preserves the VM sharing.
>> It's not possible to do both things at the same time.
>>
>> unshare(CLONE_NEWTIME) should have been specified to only take effect
>> after execve, when the vDSO is remapped anyway.
>
> Can we ask some kernel developers for an opinion?
Christian Brauner had some comments [1,2] on this. Time namespaces were
added in [3] by Andrei Vagin. Adding both to CC.
Alexey
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c6
[2] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c10
[3]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=769071ac9f20b6a447410c7eaa55d1a5233ef40c
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 21:49 ` Alexey Izbyshev
@ 2022-05-02 21:56 ` Alexey Izbyshev
0 siblings, 0 replies; 9+ messages in thread
From: Alexey Izbyshev @ 2022-05-02 21:56 UTC (permalink / raw)
To: musl; +Cc: Florian Weimer, Rich Felker, Andrei Vagin, Christian Brauner
On 2022-05-03 00:49, Alexey Izbyshev wrote:
> On 2022-05-03 00:31, Carlos O'Donell wrote:
>> On 5/2/22 17:25, Florian Weimer wrote:
>>> * Rich Felker:
>>>
>>>> I'm trying to understand how this comes to be. The child should
>>>> inherit the namespaces of the parent and thus should not be in a
>>>> different namespace that precludes spawn. I'm guessing this is some
>>>> oddity where unshare doesn't affect the process itself, only its
>>>> children? If so, it seems like a bug that it doesn't affect the
>>>> process itself after execve (after unshare(1) runs your test
>>>> program),
>>>> but that probably can't be fixed now on the Linux side for stability
>>>> reasons. :(
>>>
>>> It's about fundamentally conflicting requirements.
>>>
>>> The vDSO data mapping needs to store the time offset, so it has to be
>>> distinct from the original namespace. vfork preserves the VM
>>> sharing.
>>> It's not possible to do both things at the same time.
>>>
>>> unshare(CLONE_NEWTIME) should have been specified to only take effect
>>> after execve, when the vDSO is remapped anyway.
>>
>> Can we ask some kernel developers for an opinion?
>
> Christian Brauner had some comments [1,2] on this. Time namespaces
> were added in [3] by Andrei Vagin. Adding both to CC.
>
(Trying a different email for Andrei Vagin)
> Alexey
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c6
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=215769#c10
> [3]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=769071ac9f20b6a447410c7eaa55d1a5233ef40c
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one
2022-05-02 21:31 ` Rich Felker
@ 2023-02-22 22:04 ` Alexey Izbyshev
0 siblings, 0 replies; 9+ messages in thread
From: Alexey Izbyshev @ 2023-02-22 22:04 UTC (permalink / raw)
To: musl; +Cc: Florian Weimer
On 2022-05-03 00:31, Rich Felker wrote:
> On Mon, May 02, 2022 at 11:25:07PM +0200, Florian Weimer wrote:
>> * Rich Felker:
>>
>> > I'm trying to understand how this comes to be. The child should
>> > inherit the namespaces of the parent and thus should not be in a
>> > different namespace that precludes spawn. I'm guessing this is some
>> > oddity where unshare doesn't affect the process itself, only its
>> > children? If so, it seems like a bug that it doesn't affect the
>> > process itself after execve (after unshare(1) runs your test program),
>> > but that probably can't be fixed now on the Linux side for stability
>> > reasons. :(
>>
>> It's about fundamentally conflicting requirements.
>>
>> The vDSO data mapping needs to store the time offset, so it has to be
>> distinct from the original namespace. vfork preserves the VM sharing.
>> It's not possible to do both things at the same time.
>>
>> unshare(CLONE_NEWTIME) should have been specified to only take effect
>> after execve, when the vDSO is remapped anyway.
>
> Yes, exactly. The bug is that someone confused processes and process
> images (fork and exec) when coming up with the constraint and now
> we're probably stuck with it. *sigh*
>
Status update: the recently released kernel 6.2 contains the fix[1], so
we're not stuck with this bug after all!
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2b5f9dad32ed19e8db3b0f10a84aa824a219803b
Alexey
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-02-22 22:04 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-02 19:26 [musl] vfork()-based posix_spawn() has more failure modes than fork()-based one Alexey Izbyshev
2022-05-02 20:49 ` Carlos O'Donell
2022-05-02 21:18 ` Rich Felker
2022-05-02 21:25 ` Florian Weimer
2022-05-02 21:31 ` Rich Felker
2023-02-22 22:04 ` Alexey Izbyshev
2022-05-02 21:31 ` Carlos O'Donell
2022-05-02 21:49 ` Alexey Izbyshev
2022-05-02 21:56 ` Alexey Izbyshev
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).