[TUHS] "Fork considered harmful"

The Unix Heritage Society mailing list
 help / color / mirror / Atom feed

* [TUHS] "Fork considered harmful"
@ 2019-04-10 23:06 Richard Salz
  2019-04-10 23:24 ` Bakul Shah
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Richard Salz @ 2019-04-10 23:06 UTC (permalink / raw)
  To: tuhs

[-- Attachment #1: Type: text/plain, Size: 91 bytes --]

Any view on this?
https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/

[-- Attachment #2: Type: text/html, Size: 245 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-10 23:06 [TUHS] "Fork considered harmful" Richard Salz
@ 2019-04-10 23:24 ` Bakul Shah
  2019-04-10 23:37   ` George Michaelson
  2019-04-11 23:37 ` Chris Hanson
  2019-04-12 16:11 ` Jim Capp
  2 siblings, 1 reply; 10+ messages in thread
From: Bakul Shah @ 2019-04-10 23:24 UTC (permalink / raw)
  To: Richard Salz; +Cc: tuhs

On Apr 10, 2019, at 4:06 PM, Richard Salz <rich.salz@gmail.com> wrote:
> 
> Any view on this? https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/
> 

FWIW, my view is that any unix evolution that complicates
fork() is/has probably going/gone in the wrong direction.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-10 23:24 ` Bakul Shah
@ 2019-04-10 23:37   ` George Michaelson
  2019-04-11 11:38     ` Tony Finch
  0 siblings, 1 reply; 10+ messages in thread
From: George Michaelson @ 2019-04-10 23:37 UTC (permalink / raw)
  To: The Eunuchs Hysterical Society

I don't pay much attention to this stuff any more but I do recall
being absolutely *astonished* how complex the re-binding of process to
inherited I/O was in a lot of systems in the 1980s

fork() and the various exec() flavours had this single compelling
story for me: the stdin/stdout/stderr *and all other open I/O* was
inherited across the process boundary. This alone made writing code
significantly easier.

I did briefly find myself in a world where it wasn't clear where an
X.25 binding was (this was Yorkbox, and then the UCL-CS version of the
same idea, and the per-ISODE OSI stack work) on an FD, but it wasn't
immediately clear which.

We wound up passing a binary-encoded text blob in the execve() info to
"inform" the child what it was meant to look for in file descriptors,
to (re)bind as the X.25 FD to work with. It felt really creepy to do
this, but we simply couldn't find a way round it.

Other systems "for your convenience" felt it was far, far kinder to
either completely obscure the binding of I/O or even terminate it.
Truly bizarre.

Actual runtime cost to mechanise the COW state. and the other bits of
kernel state, and instantiate the process, and fiddly bits were stuff
I simply didn't think about much. it felt like some giant bcopy() call
in the kernel did something in VM aware memory addressing to make a
"copy" of something, which then ran, because it existed. As if you
could clone a vertial slice of the layers of the onion and simply
carry on, irrespective.

But if somebody said VMS or the nascent Microsoft kernel was "better"
I would (and probably still would) be skeptical.

Hard to beat simple.

On Thu, Apr 11, 2019 at 9:25 AM Bakul Shah <bakul@bitblocks.com> wrote:
>
> On Apr 10, 2019, at 4:06 PM, Richard Salz <rich.salz@gmail.com> wrote:
> >
> > Any view on this? https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/
> >
>
> FWIW, my view is that any unix evolution that complicates
> fork() is/has probably going/gone in the wrong direction.
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-10 23:37   ` George Michaelson
@ 2019-04-11 11:38     ` Tony Finch
  0 siblings, 0 replies; 10+ messages in thread
From: Tony Finch @ 2019-04-11 11:38 UTC (permalink / raw)
  To: George Michaelson; +Cc: The Eunuchs Hysterical Society

George Michaelson <ggm@algebras.org> wrote:
> On Apr 10, 2019, at 4:06 PM, Richard Salz <rich.salz@gmail.com> wrote:
> >
> > Any view on this? https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/
>
> fork() and the various exec() flavours had this single compelling
> story for me: the stdin/stdout/stderr *and all other open I/O* was
> inherited across the process boundary. This alone made writing code
> significantly easier.

Mark Wooding had an insightful observation in response to that paper: it's
relatively common in Unix to have oblivious intermediaries where it is
important that they pass on things like file descriptors and signal
dispositions. How would you implement nohup() in a spawn-based system?

[ Yorkbox-related tangent: the other day I was trying to find out more
about the JANET NRS (the hosts.txt-alike with names in uk.ac.cam format)
and all I could find out is that it was hosted at Salford on Pr1me
computers https://www.uknof.org.uk/uknof7/Reid-History.pdf ... the reason
for looking because I'm no longer providing secondary DNS for Salford. ]

Tony.
-- 
f.anthony.n.finch  <dot@dotat.at>  http://dotat.at/
Fair Isle: East 2 or 3, veering south 5 or 6. Slight, becoming moderate.
Occasional rain or drizzle. Moderate or good.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-10 23:06 [TUHS] "Fork considered harmful" Richard Salz
  2019-04-10 23:24 ` Bakul Shah
@ 2019-04-11 23:37 ` Chris Hanson
  2019-04-12  0:12   ` Derek Fawcus
  2019-04-12 16:11 ` Jim Capp
  2 siblings, 1 reply; 10+ messages in thread
From: Chris Hanson @ 2019-04-11 23:37 UTC (permalink / raw)
  To: Richard Salz; +Cc: tuhs

On Apr 10, 2019, at 4:06 PM, Richard Salz <rich.salz@gmail.com> wrote:
> 
> Any view on this? https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/

Quite correct in my experience.

The posix_spawn() API isn’t a panacea but (especially with a few *_np extensions) it’s much saner for large real-world applications that run a ton of subprocesses. I work on a large IDE which invokes compilers and such, and it makes a huge difference.

The biggest need for *_np extensions has been control over fd inheritance and chdir behavior. Otherwise a subprocess can wind up inheriting random or no fds from a multithreaded process (thanks to the race between setting up and finishing the call) and you need to use pthread_{chdir,fchdir}() or openat() to set the working directory in which the subprocess is launched.

  -- Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-11 23:37 ` Chris Hanson
@ 2019-04-12  0:12   ` Derek Fawcus
  0 siblings, 0 replies; 10+ messages in thread
From: Derek Fawcus @ 2019-04-12  0:12 UTC (permalink / raw)
  To: tuhs

On Thu, Apr 11, 2019 at 04:37:52PM -0700, Chris Hanson wrote:
> On Apr 10, 2019, at 4:06 PM, Richard Salz <rich.salz@gmail.com> wrote:
> > Any view on this? https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/
> Quite correct in my experience.
> 
> The posix_spawn() API isn’t a panacea but (especially with a few *_np extensions) it’s much saner for large real-world applications that run a ton of subprocesses. I work on a large IDE which invokes compilers and such, and it makes a huge difference.

What I ended up doing for children of GUI apps (on OSX) is to fork very early on
before the GUI starts, or the process becomes multi-thread.  Then that child does
all of the real spawning, using fd passing and messages over a pipe (actually a
unix domain socket) to drive it, so usually no need for _np stuff.
There for cases where posix_spawn() is viable, as I recall it was faster than fork+exec.

DF

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-10 23:06 [TUHS] "Fork considered harmful" Richard Salz
  2019-04-10 23:24 ` Bakul Shah
  2019-04-11 23:37 ` Chris Hanson
@ 2019-04-12 16:11 ` Jim Capp
  2 siblings, 0 replies; 10+ messages in thread
From: Jim Capp @ 2019-04-12 16:11 UTC (permalink / raw)
  To: Richard Salz; +Cc: tuhs

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

I think the problem with fork() is that it's elegance invites people to use it outside its sweet spot. 

I always thought (perhaps wrongly) that fork() didn't have to copy all the memory, just the stack and user areas, and that VM page table entries were cloned to share the code segment. 

fork() is beautiful for what it is and what it does. Being able to create a mirror image of the current process, and to be able to share all the I/O and signal states is cool, if that's what you want. I think the author of the micro$oft article is complaining that fork() shares too much, and therefore to use it is a security risk. 

If you don't want to share all that stuff, maybe you shouldn't be using fork(), or, you should fork() early, sharing EXACTLY what you want to share and nothing more, and then differentiate with exec(). 

C is elegant. C can also be dangerous if you don't use it wisely. 

I think the author should take a lesson from "The Kings Toaster". 

http://www.ee.ryerson.ca:8080/~elf/hack/ktoast.html 

Cheers, 

Jim 

From: "Richard Salz" <rich.salz@gmail.com> 
To: tuhs@tuhs.org 
Sent: Wednesday, April 10, 2019 7:06:23 PM 
Subject: [TUHS] "Fork considered harmful" 

Any view on this? https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/ 

[-- Attachment #2: Type: text/html, Size: 3795 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-12 14:51 Noel Chiappa
  2019-04-12 15:33 ` Warner Losh
@ 2019-04-12 19:55 ` Dan Cross
  1 sibling, 0 replies; 10+ messages in thread
From: Dan Cross @ 2019-04-12 19:55 UTC (permalink / raw)
  To: The Eunuchs Hysterical Society; +Cc: Noel Chiappa

[-- Attachment #1: Type: text/plain, Size: 6861 bytes --]

From: Richard Salz

> Any view on this?
> https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/

Yes.

First, I dislike the presentation of the paper. From the pithy title to
snarky section headings to the overly informal register of the writing, I
think the authors did themselves a disservice writing in a style that is
far too colloquial. The argument is presented with too much editorial
comment, as if it's trying to be "edgy" or something. I find that
off-putting and annoying at best.

But stylistic issues aside, the substance of the paper is largely on point.
The fact is that, for better or worse, fork() has not aged gracefully into
the modern age of multi-threaded programs, big graphical applications, and
the like, and the problems the authors point out are very, very real. An
argument can be made along the lines of, "well, just don't write programs
that way..." but the universe of interesting and useful userspace programs
is now large enough that I'm afraid we're well inside the event horizon of
increasing entropy.

It's interesting that they make an oblique reference to Austin's
dissertation, or at least one of the papers that came out of his work
(Clements et al) but they don't go the full way in exploring the
implications.

On Fri, Apr 12, 2019 at 10:51 AM Noel Chiappa <jnc@mercury.lcs.mit.edu>
wrote:

> Having read this, and seen the subsequent discussion, I think both sides
> have
> good points.
>
> What I perceive to be happening is something I've described previously, but
> never named, which is that as a system scales up, it can be necessary to
> take
> one subsystem which did two things, and split it up so there's a custom
> subsystem for each.
>
> I've seen this a lot in networking; I've been trying to remember some of
> the
> examples I've seen, and here's the best one I can come up with at the
> moment:
> having the routing track 'unique-ID network interface names' (i.e.
> interface
> 'addresses') - think 48-bit IEEE interface IDs' - directly. In a small
> network, this works fine for routing traffic, and as a side-benefit, gives
> you
> mobility. Doesn't scale, though - you have to build an 'interface ID to
> location name mapping system', and use 'location names' (i.e. 'addresses')
> in
> the routing.
>
> So classic Unix 'fork' does two things: i) creates a new process, and ii)
> replicates
> the environment/etc of an existing process. (In early Unix, the latter was
> pretty
> simple, but as the paper points out, it has now become a) complex and b)
> expensive.)
>
> I think the answer has to include decomposing the functionality of old
> fork()
> into several separate sub-primitives (albeit not all necessarily directly
> accessible to the user): a new-process primitive, which can be bundled
> with a
> number of different alternatives (e.g. i) exec(), ii) environment
> replication,
> iii) address-space replication, etc) - perhaps more than one at once.
>

This is approximately what was done in Akaros. The observation, in the MSR
paper and elsewhere, that fork() is a poor abstraction because it tries to
do too much and doesn't interact well with threads (which Akaros was all
about) is essentially correct and created undue burdens on the system's
implementors. The solution there was to split process creation into two
steps: creation proper, and marking a process runnable for the first time.
This gave a parent an opportunity to create a child process and then
populate its file descriptors, environment variables, and so forth before
setting it loose on the system: one got much of the elegance of the fork()
model, but without the bad behavior.

The critical observation is that a hard distinction between fork()/exec()
on one side and spawn() on the other as the only two possible designs is
unnecessary; an intermediate step in the spawn() model preserving the
two-step nature of the fork()/exec() model yields many of the benefits of
both, without many of the problems of one or the other.

So that shell would want a form of fork() which bundled in i) and ii), but
> large applications might want something else. And there might be several
> variants of ii), e.g. one might replicate only environment variables,
> another
> might add I/O channels, etc.
>
> In a larger system, there's just no 'one size fits all' answer, I think.
>

This is, I think, the crux of the argument: Unix fork, as introduced on the
PDP-7, wasn't designed for "large" systems. I'd be curious to know how much
intention was behind the consequences of fork()'s behavior were known in
advance. As Dennis Ritchie's document on early Unix history (well known to
most of the denizens of this list) pointed out, the implementation was an
expedient, and the construct was loosely based on prior art (Genie etc).
Was the idea of the implicit dup()'ing of open file descriptors something
that people thought about consciously at the time? Certainly, once it was
there, people found out that they could exploit it in clever ways for e.g.
IO redirection, pipes (which came later, of course), etc, but I wonder to
what extent that was discovered as opposed to being an explicit design
objective.

Put another way, the thought process may have been along the lines of,
"look at the neat properties that fall out of this thing we've already
got..." as opposed to, "we designed this thing to have all these neat
properties..." Much of the current literature and extant course material
seems to take the latter tack, but it's not at all clear (to me, anyway)
that that's an accurate reflection of the historical reality.

I briefly mentioned Clements's dissertation above. In essence, the
scalability commutativity rule says that interfaces that commute scale
better than those that do not because they do not _require_ serialization,
so they can be parallelized etc. Many of the algorithms in early Unix do
not commute: the behavior of returning the lowest available number when
allocating a file descriptor is an example (consider IO redirection here
and the familiar sequence, "close(1), open("output", O_WRONLY)": this
doesn't work if one does the close after the open, etc). But fork() and
exec() might be the ultimate example. Obviously if I exec() before I fork()
my semantics are very, very different, but more generally the commutativity
must be considered in some context: operations on file descriptors don't
commute with one another, but they do commute with, say, memory writes. On
the other hand, fork() doesn't commute with much of anything at all.

The early Unix algorithms are so incredibly simple (one can easily imagine
the loop over a process's file descriptor table, for example), but one
can't help but wonder if there was any sense of what the consequences of
the details of those algorithms leaking out to user programs would be 50
years later. Surely not?

        - Dan C.

[-- Attachment #2: Type: text/html, Size: 8159 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
  2019-04-12 14:51 Noel Chiappa
@ 2019-04-12 15:33 ` Warner Losh
  2019-04-12 19:55 ` Dan Cross
  1 sibling, 0 replies; 10+ messages in thread
From: Warner Losh @ 2019-04-12 15:33 UTC (permalink / raw)
  To: Noel Chiappa; +Cc: The Eunuchs Hysterical Society

[-- Attachment #1: Type: text/plain, Size: 3696 bytes --]

On Fri, Apr 12, 2019 at 8:51 AM Noel Chiappa <jnc@mercury.lcs.mit.edu>
wrote:

>     > From: Richard Salz
>
>     > Any view on this?
>     >
> https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/
>
> Having read this, and seen the subsequent discussion, I think both sides
> have
> good points.
>
> What I perceive to be happening is something I've described previously, but
> never named, which is that as a system scales up, it can be necessary to
> take
> one subsystem which did two things, and split it up so there's a custom
> subsystem for each.
>
> I've seen this a lot in networking; I've been trying to remember some of
> the
> examples I've seen, and here's the best one I can come up with at the
> moment:
> having the routing track 'unique-ID network interface names' (i.e.
> interface
> 'addresses') - think 48-bit IEEE interface IDs' - directly. In a small
> network, this works fine for routing traffic, and as a side-benefit, gives
> you
> mobility. Doesn't scale, though - you have to build an 'interface ID to
> location name mapping system', and use 'location names' (i.e. 'addresses')
> in
> the routing.
>
> So classic Unix 'fork' does two things: i) creates a new process, and ii)
> replicates
> the environment/etc of an existing process. (In early Unix, the latter was
> pretty
> simple, but as the paper points out, it has now become a) complex and b)
> expensive.)
>

Signals, fds, address space, copy vs share, COW vs copy now, etc are all
things. Also I'd split hairs on (i): you need some way to create a new
thread of execution within a process, which is where a lot of the focus of
criticisms of fork has focused on the past.

> I think the answer has to include decomposing the functionality of old
> fork()
> into several separate sub-primitives (albeit not all necessarily directly
> accessible to the user): a new-process primitive, which can be bundled
> with a
> number of different alternatives (e.g. i) exec(), ii) environment
> replication,
> iii) address-space replication, etc) - perhaps more than one at once.
>
> So that shell would want a form of fork() which bundled in i) and ii), but
> large applications might want something else. And there might be several
> variants of ii), e.g. one might replicate only environment variables,
> another
> might add I/O channels, etc.
>
> In a larger system, there's just no 'one size fits all' answer, I think.
>

Agreed. We've already seen that happening, some examples are quite old. We
had vfork() (dating back to 3BSD) which tried to optimize the duplication
stuff. More recently, rfork() (plan9 and later BSD) and clone() (Linux) [*]
have been used to specify what parts of process are copied and/or shared to
allow, among other things, light weight threads to be one of the possible
answers, to allow the fork to happen asynchronously, etc. Linux has a bunch
of other variants as well.

fork as a boogie man is a well known trope, honestly. Criticism of it, and
solutions for it's all-or-nothing approach have been proffered for a long
time. These solutions range from having the helper child process to spawn
other things a more complex process wants, to specialized ways to create
threads (which are process-like things that share an address space and
benefit from special handling in the kernel), to things like rfork or clone
that try to pick-and-choose what aspects of process duplication are needed.
There's a reason that the clone man page is maybe 10x longer than the
classic fork man page.

Warner

[*] This doesn't even begin to look at things like what Solaris, Irix, or a
dozen other unix derivatives did to create threads and/or optimize
different use cases of fork..

[-- Attachment #2: Type: text/html, Size: 4600 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [TUHS] "Fork considered harmful"
@ 2019-04-12 14:51 Noel Chiappa
  2019-04-12 15:33 ` Warner Losh
  2019-04-12 19:55 ` Dan Cross
  0 siblings, 2 replies; 10+ messages in thread
From: Noel Chiappa @ 2019-04-12 14:51 UTC (permalink / raw)
  To: tuhs; +Cc: jnc

    > From: Richard Salz

    > Any view on this?
    > https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/

Having read this, and seen the subsequent discussion, I think both sides have
good points.

What I perceive to be happening is something I've described previously, but
never named, which is that as a system scales up, it can be necessary to take
one subsystem which did two things, and split it up so there's a custom
subsystem for each.

I've seen this a lot in networking; I've been trying to remember some of the
examples I've seen, and here's the best one I can come up with at the moment:
having the routing track 'unique-ID network interface names' (i.e. interface
'addresses') - think 48-bit IEEE interface IDs' - directly. In a small
network, this works fine for routing traffic, and as a side-benefit, gives you
mobility. Doesn't scale, though - you have to build an 'interface ID to
location name mapping system', and use 'location names' (i.e. 'addresses') in
the routing.

So classic Unix 'fork' does two things: i) creates a new process, and ii) replicates
the environment/etc of an existing process. (In early Unix, the latter was pretty
simple, but as the paper points out, it has now become a) complex and b) expensive.)

I think the answer has to include decomposing the functionality of old fork()
into several separate sub-primitives (albeit not all necessarily directly
accessible to the user): a new-process primitive, which can be bundled with a
number of different alternatives (e.g. i) exec(), ii) environment replication,
iii) address-space replication, etc) - perhaps more than one at once.

So that shell would want a form of fork() which bundled in i) and ii), but
large applications might want something else. And there might be several
variants of ii), e.g. one might replicate only environment variables, another
might add I/O channels, etc.

In a larger system, there's just no 'one size fits all' answer, I think.

	  Noel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-04-12 19:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 23:06 [TUHS] "Fork considered harmful" Richard Salz
2019-04-10 23:24 ` Bakul Shah
2019-04-10 23:37   ` George Michaelson
2019-04-11 11:38     ` Tony Finch
2019-04-11 23:37 ` Chris Hanson
2019-04-12  0:12   ` Derek Fawcus
2019-04-12 16:11 ` Jim Capp
2019-04-12 14:51 Noel Chiappa
2019-04-12 15:33 ` Warner Losh
2019-04-12 19:55 ` Dan Cross

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).