From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: <6c7d84f203a2cc4f3427e177e34fa9d9@brasstown.quanstro.net> References: <68ce90976b22bdb0929ccccafef4b7d0@kw.quanstro.net> <3330200.XJjoRb8JbZ@blitz> <5538fcd345a73fc294c6ee568f2fcdb4@kw.quanstro.net> <8ee568439e1855124564ecf0e83ac2b3@kw.quanstro.net> <6c7d84f203a2cc4f3427e177e34fa9d9@brasstown.quanstro.net> From: Dan Cross Date: Thu, 30 Aug 2012 15:35:47 +0530 Message-ID: To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=UTF-8 Subject: Re: [9fans] rc's shortcomings (new subject line) Topicbox-Message-UUID: b3e1321c-ead7-11e9-9d60-3106f5b1d025 On Wed, Aug 29, 2012 at 7:27 PM, erik quanstrom wrote: >> > rc already has non-linear pipelines. but they're not very convienient. >> >> And somewhat limited. There's no real concept of 'fanout' of output, >> for instance (though that's a fairly trivial command, so probably >> doesn't count), or multiplexing input from various sources that would >> be needed to implement something like a shell-level data flow network. >> >> Muxing input from multiple sources is hard when the data isn't somehow >> self-delimited. >>[...] >> There may be other ways to achieve the same thing; I remember that the >> boundaries of individual writes used to be preserved on read, but I >> think that behavior changed somewhere along the way; maybe with the >> move away from streams? Or perhaps I'm misremembering? > > pipes still preserve write boundaries, as does il. (even the 0-byte write) but tcp of course by > definition does not. but either way, the protocol would need to be > self-framed to be transported on tcp. and even then, there are protocols > that are essentially serial, like tls. Right. I think this is the reason for Bakul's question about s-expressions or JSON or a similar format; those formats are inherently self-delimiting. The problem with that is that, for passing those things around to work without some kind of reverse 'tee'-like intermediary, the system has to understand the the things that are being transferred are s-expressions or JSON records or whatever, not just streams of uninterpreted bytes. We've steadfastly rejected such system-imposing structure on files in Unix-y type environments since 1969. But conceptually, these IPC mechanisms are sort of similar to channels in CSP-style languages. A natural question then becomes, how do CSP-style languages handle the issue? Channels work around the muxing thing by being typed; elements placed onto a channel are indivisible objects of that type, so one doesn't need to worry about interference from other objects simultaneously placed onto the same channel in other threads of execution. Could we do something similar with pipes? I don't know that anyone wants typed file descriptors; that would open a whole new can of worms. Maybe the building blocks are all there; one could imagine some kind of 'splitter' program that could take input and rebroadcast it across multiple output descriptors. Coupled with some kind of 'merge' program that could take multiple input streams and mux them onto a single output, one could build nearly arbitrarily complicated networks of computations connected by pipes. Maybe for simplicity constrain these to be DAGs. With a notation to describe these computation graphs, one could just do a topological sort of the graph, create pipes in all the appropriate places and go from there. Is the shell an appropriate place for such a thing? Forsyth's link looks interesting; I haven't read through the paper in detail yet, but it sort of reminded me of LabView in a way (where non-programmers wire together data flows using boxes and arrows and stuff). >> > i suppose i'm stepping close to sawzall now. >> >> Actually, I think you're stepping closer to the reducers stuff Rich >> Hickey has done recently in Clojure, though there's admittedly a lot >> of overlap with the sawzall way of looking at things. > > my knowledge of both is weak. :-) The Clojure reducers stuff is kind of slick. Consider a simple reduction in Lisp; say, summing up a list of numbers or something like that. In Common Lisp, we may write this as: (reduce #'+ '(1 2 3 4 5)) In clojure, the same thing would be written as: (reduce + [1 2 3 4 5]) The problem is how the computation is performed. To illustrate, here's a simple definition of 'reduce' written in Scheme (R5RS doesn't have a standard 'reduce' function, but it is most commonly written to take an initial element, so I do that here). (define (reduce binop a bs) (if (null? bs) a (reduce binop (binop a (car bs)) (cdr bs)))) Notice how the recursive depth of the function is linear in the length of the list. But, if one thinks about what I'm doing here (just addition of simple numbers) there's no reason this can't be done in parallel. In particular, if I can split the list into evenly sized parts and recurse, I can limit the recursive depth of the computation to O(lg n). Something more like: (define (reduce binop a bs) (if (null? bs) a (let ((halves (split-into-halves bs))) (binop (reduce binop a (car halves)) (reduce binop a (cadr halves))) If I can exploit parallelism to execute functions in the recursion tree simultaneously, I can really cut down on execution time. The requirement is that binop over a and bs's is a monoid; that is, binop is associative over the set from which 'a' and 'bs' are drawn, and 'a' is an identity element. This sounds wonderful, of course, but in Lisp and Scheme, lists are built from cons cells, and even if I have some magic 'split-into-halves' function that satisfies the requirements of reduce, doing so is still necessarily linear, so I don't gain much. Besides, having to pass around the identity all the time is a bummer. But in clojure, the Lisp concept of a list (composed of cons cells) is generalized into the concept of a 'seq'. A seq is just a sequence of things; it could be a list, a vector, some other container (say, a sequence of key/value pairs derived from some kind of associated structure), or a stream of data being read from a file or network connection. What's the *real* problem here? The issue is that reduce "knows" too much about the things it is reducing over. Doing things sequentially is easy, but slow; doing things in parallel requires that reduce know a lot about the type of thing it's reducing over (e.g., this magic 'split-into-halves' function. Further, that might not be appropriate for *all* sequence types; e.g., files or lists made from cons cells. The insight of the reducers framework is that one can just ask the container to reduce itself. Basically, pass it a function and say, "here, reduce yourself with this function however you see fit." Then, random-access containers can do things in parallel; lists and files and things can do things sequentially; associative containers can do whatever they want, etc. The implementation is kind of interesting; more information is here: http://clojure.com/blog/2012/05/08/reducers-a-library-and-model-for-collection-processing.html Your example of running multiple 'grep's in parallel sort of reminded me of this, though it occurs to me that this can probably be done with a command: a sort of 'parallel apply' thing that can run a command multiple times concurrently, each invocation on a range of the arguments. But making it simple and elegant is likely to be tricky. - Dan C.