Hi Dan, Thanks for the considered response. I was beginning to fear that my musing was of moronically minimal merit. At 2024-05-15T10:42:33-0400, Dan Cross wrote: > On Tue, May 14, 2024 at 7:10 AM G. Branden Robinson > wrote: > > [snip] > > Viewpoint 1: Perspective from Pike's Peak > > Clever. If Rob's never heard _that_ one before, I am deeply disappointed. > > Elementary Unix commands should be elementary. Unix is a kernel. > > Programs that do simple things with system calls should remain > > simple. This practices makes the system (the kernel interface) > > easier to learn, and to motivate and justify to others. Programs > > therefore test the simplicity and utility of, and can reveal flaws > > in, the set of primitives that the kernel exposes. This is valuable > > stuff for a research organization. "Research" was right there in > > the CSRC's name. > > I believe this is at once making a more complex argument than was > proffered, and at the same misses the contextual essence that Unix was > created in. My understanding of that context is, "a pleasant environment for software development" (McIlroy)[0]. My notion of software development entails (when not under managerial pressure to bang something together for the exploitation of "market advantage") analysis and reanalysis of software components to make them more efficient and more composable. As a response to the perceived bloat of Multics, the development of the Unix kernel absolutely involved much critical reappraisal of what _needed_ to be in a kernel, and of which services were so essential that they must be offered. As a microkernel Kool-Aid drinker, I tend to view Unix's origin in that light, which was reinforced by the severe limitations of the PDP-7 where it was born. Possibly many of the decisions about where to draw the kernel service/userspace service line we made by instinct or seasoned judgment, but the CSRC being a research organization, I'd be surprised if matters of empirical measurement were far from top of mind. It's a shame we don't have more insight into Thompson's development process, especially in those early days. I think we have a tendency to conceive of Unix as having sprung from his fingers already crystallized, like a mineral Athena from the forehead of Zeus. I would wager (and welcome correction if he has the patience) that he made and reversed decisions based on the experience of using the system. Some episodes in McIlroy's "A Research Unix Reader" illustrate that this was a recurring feature of its _later_ development, so why not in the incubation period? That, too, is empirical measurement, even if informal. Many revisions are made in software because we find in testing that something is "too damn slow", or runs the system out of memory too often. So to summarize, I want to push back on your counter here. Making little things to measure system features is a salutary practice in OS development. Stevens's _Advanced Programming in the Unix Environment_ is, shall we say, tricked out with exhibits along these lines. The author's dedication to _measurement_ as opposed to partisan opinion is, I think, a major factor in its status as a landmark work and as nigh-essential reading for the serious Unix developer to this day. Put differently, why would anyone _care_ about making cat(1) simple if one didn't have these objectives in mind? > > Viewpoint 2: "I Just Want to Serve 5 Terabytes"[1] > > > > cat(1)'s man page did not advertise the traits in the foregoing > > viewpoint as objectives, and never did.[2] Its avowed purpose was > > to copy, without interruption or separation, 1..n files from storage > > to and output channel or stream (which might be redirected). > > > > I don't need to tell convince that this is a worthwhile application. > > But when we think about the many possible ways--and destinations--a > > person might have in mind for that I/O channel, we have to face the > > necessity of buffering or performance goes through the floor. > > > > It is 1978. Some VMS > > I don't know about that; VMS IO is notably slower than Unix IO by > default. Unlike VMS, Unix uses the buffer cache to serialize access to > the underlying storage device(s). I must confess I have little experience with VMS (and none more recent than 30 years ago) and offered it as an example mainly because it was actually around in 1978 (if still fresh from the foundry). My personal backstory is much more along the lines of my other example, CP/M on toy computers (8-bit data bus pffffffft, right?). > Ironically, caching here is a major win, not just for speed, but to > make it relatively easy to reason about the state of a block, since > that state is removed from the minutiae of the underlying storage > device and instead handled in the bio layer. Treating the block cache > as a fixed-size pool yields a relatively simple state machine for > synchronizing between the in-memory and on-disk representations of > data. I entirely agree with this. I contemplated following up Bakul Shah's post with a mention of Jim Gettys's work on bufferbloat.[1] So let me do that here, and venture the opinion that a "buffer" as popularly conceived and implemented (more or less just a hunk of memory to house data) is too damn dumb a data structure for many of the uses to which it is put. If/when people address these problems, they do what the Unix buffer cache did; they elaborate it with state. This is a repeated design pattern: see SIGURG for example. Off the top of my head I perceive three circumstances that buffers often need to manage. 1. Avoidance of underrun. Such were the joys of CD-R burning. But also important in streaming or other real-time applications to avoid interruption. Essentially you want to be able to say, "I'm running out of data at the current rate, please supply more ASAP". 2. Avoidance of overrun. The problems of modem-like flow control are familiar to most. An important insight here, reinforced if not pioneered by Gettys, is that "just making the buffer bigger", the brogrammer solution, is not always the wise choice. 3. Cancellation. Familiar to all as SIGPIPE. Sometimes all of the data in the buffer is invalidated. The sender needs to stop transmitting ASAP, and the receiver can discard whatever it has. I apologize for the armchair approach. I have no doubt that much literature exists that has covered this stuff far more rigorously. And yet much of that knowledge has not made its way down the mountain into practice. That, I think, was at least part of Doug's point. Academics may have considered the topic adequately, but practitioners are too often solving problems as if it's 1972. > >[snip] > > And this, as we all know, is one of the reasons the standard I/O > > library came into existence. Mike Lesk, I surmise, understood that > > the "applications programmer" having knowledge of kernel internals > > was in general neither necessary nor desirable. > > I'm not sure about that. I suspect that the justification _may_ have > been more along the lines of noting that many programs implemented > their own, largely similar buffering strategies, and that it was > preferable to centralize those into a single library, and also noting > that building some kinds of programs was inconvenient using raw system > calls. For instance, something like `gets` is handy, An interesting choice given its notoriety as a nuclear landmine of insecurity. ;-) > but is _annoying_ to write using just read(2). It can obviously be > done, but if I don't have to, I'd prefer not to. I think you are justifying why stdio was written _as a library_, as your points seem to be pretty typical examples of why we move code thither from applications. My emphasis is a little different: why was buffered I/O in particular (when it could so easily have been string handling) the nucleus of what would be become a large standard library with its toes in many waters, so huge that projects like uclibc and musl arose for the purpose of (in part) chopping back out the stuff they felt they didn't need? My _claim_ is that stdio.h was the first piece of the library to walk upright because the need for it was most intense. More so than with strings; in fact we've learned that Nelson's original C string library was tricky to use well, was often elaborated by others in unfortunate ways.[7] But there was no I/O at all without going through the kernel, and while there were many ways to get that job done, the best leveraged knowledge of what the kernel had to work with. And yet, the kernel might get redesigned. Could stdio itself have been done better? Korn and Vo tried.[8] > Here's where I think this misses the mark: this focuses too much on > the idea that simple programs exist as to be tests for, and exemplars > of, the kernel system call interface, but what evidence do you have > for that? A little bit of experience, long after the 1970s, of working with automated tests for the seL4 microkernel. > A simpler explanation is that simple programs are easier to > write, easier to read, easier to reason about, test, and examine for > correctness. All certainly true. But these things are just as true of programs that don't directly make system calls at all. cat(1), as ideally envisioned by Pike (if I understand the Platonic ideal of his position correctly), not only makes system calls, but dirties its hands with the standard library as little as possible (if you recognize no options, you need neither call nor reimplement getopt(3)) and certainly not for the central task. Again I think we are not so much disagreeing as much as I'm finding out that I didn't adequately emphasize the distinctions I was making. > Unix amplified this with Doug's "garden hoses of data" idea and the > advent of pipes; here, it was found that small, simple programs could > be combined in often surprisingly unanticipated ways. Agreed; but given that pipes-as-a-service are supplied by the _kernel_, we are once again talking about system calls. One of the projects I never got off the ground with seL4 was a reconsideration from first principles of what sorts of more or less POSIXish buffering and piping mechanisms should be offered (in userland of course). For those who are scandalized that a microkernel doesn't offer pipes itself, see this Heiser piece on "IPC" in that system.[2] > Unix built up a philosophy about _how_ to write programs that was > rooted in the problems that were interesting when Unix was first > created. Something we often forget is that research systems are built > to address problems that are interesting _to the researchers who build > them_. I agree. > This context can shape a system, and we see that with Unix: a > highly synchronous system call interface, because overly elaborate > async interfaces were hard to program; And still are, apparently even without the qualifier "overly elaborate". ...though Go (and JavaScript?) fans may disagree. > a simple file abstraction that was easy to use > (open/creat/read/write/close/seek/stat) because files on other > contemporary systems were baroque things that were difficult to use; Absolutely. It's a truism in the Unix community that it's possible to simulated record-oriented storage and retrieval on top of a byte stream, but hard to do the converse. Though, being a truism, it might be worthwhile to critically reconsider it and more rigorously establish how we know what we think we know. That's another reason I endorse the microkernel mission. Let's lower the cost of experimentation on parts of the system that of themselves don't demand privilege. It's a highly concurrent, NUMA world out there. > a simple primitive for the creation of processes because, again, on > other systems processes were very heavy, complicated things that were > difficult to use. It is with some dismay that I look at what they are, _on Unix_, today. https://github.com/torvalds/linux/blob/1b294a1f35616977caddaddf3e9d28e576a1adbc/include/linux/sched.h#L748 https://github.com/openbsd/src/blob/master/sys/sys/proc.h#L138 Contrast: https://github.com/jeffallen/xv6/blob/master/proc.h#L65 > Unix took problems related to IO and processes and made them easy. By > the 80s, these were pretty well understood, so focus shifted to other > things (languages, networking, etc). True, but beside my point. Pike's point about cat and its flags was, I think, a call to reconsider more fundamental things. To question what we thought we knew--about how best to design core components of the system, for example. Do we really need the efflorescence of options that perfuses not simply the GNU versions of such components (a popular sink for abuse), but Busybox and *BSD implementations as well? Every developer of such a component should consider the cost/benefit ratio of flags, and then RE-consider them at intervals. Even at the cost of backward compatibility. (Deprecation cycles and mitigation/migration plans are good.) > Unix is one of those rare beasts that escaped the lab and made it out > there in the wild. It became the workhorse that beget a whole two or > three generations of commercial work; it's unsurprising that when the > web explosion happened, Unix became the basis for it: it was there, it > was familiar, and by then it wasn't a research project anymore, but a > basis for serious commercial work. Yes, and in a sense this success has cost all of us.[3][4][5] > That it has retained the original system call interface is almost > incidental; In _structure_, sure; in detail, I'm not sure this claim withstands scrutiny. Just _count_ the system calls we have today vs. V6 or V7. > perhaps that fits with your brocolli-man analogy. I'm unfamiliar with this metaphor. It makes me wonder how to place it in company with the requirements documents that led to the Ada language: Strawman, Woodenman, Ironman, and Steelman. At least it's likely better eating than any of those. ;-) Since no one else ever says it on this list, let me point out what a terrific and unfairly maligned language Ada is. In reading the minutes of the latest WG14 meeting[6] I marvel anew at how C has over time slowly, slowly accreted type- and memory-safety features that Ada had in 1983 (or even in 1980, before its formal standardization). Regards, Branden [0] https://www.gnu.org/software/groff/manual/groff.html.node/Background.html [1] https://gettys.wordpress.com/category/bufferbloat/ [2] https://microkerneldude.org/2019/03/07/how-to-and-how-not-to-use-sel4-ipc/ [3] https://tianyin.github.io/misc/irrelevant.pdf (guess who) [4] https://www.youtube.com/watch?v=36myc8wQhLo (Timothy Roscoe) [5] https://queue.acm.org/detail.cfm?id=3212479 (David Chisnall) [6] https://www.open-std.org/JTC1/sc22/wg14/www/docs/n3227.htm Skip down to section 5. Note particularly `_Optional`. [7] https://www.symas.com/post/the-sad-state-of-c-strings [8] https://www.semanticscholar.org/paper/SFIO%3A-Safe-Fast-String-File-IO-Korn-Vo/8014266693afda38a0a177a9b434fedce98eb7de