From: Paul Winalski <paul.winalski@gmail.com>
To: TUHS main list <tuhs@minnie.tuhs.org>
Subject: [TUHS] Pipes and PRISM
Date: Tue, 1 Mar 2022 11:56:37 -0500 [thread overview]
Message-ID: <CABH=_VRc29kPrUKUnLyT0P5LUmE0maokp_G9oF9TCNMEHqLa+g@mail.gmail.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 3002 bytes --]
Last week there was a bit of discussion about the different shells
that would eventually lead to srb writing the shell that took his name
and the command syntax and semantics that most modern shells use
today. Some of you may remember, VMS had a command interpreter
called DCL (Digital Command language), part of an attempt to make
command syntax uniform across DEC's different operating systems
(TOPS-20 also used DCL). As DEC started to recognize the value of the
Unix marketplace, a project was born in DEC's Commercial Languages and
Tools group to bring the Unix Bourne shell to VMS and to sell it as a
product they called DEC Shell.
I had been part of that effort and one of the issues we had to solve
is providing formal UNIX pipe semantics. They of course needed to
somehow implement UNIX style process pipelines. VMS from the
beginning has had an interprocess communications pseudo-device called
the mailbox that can be written to and read from via the usual I/O
mechanism (the QIO system service). A large problem with them is that
it is not possible to detect the "broken pipe" condition with a
mailbox and that feature deficiency made them unsuitable for use with
DEC Shell. So the team had me write a new device driver, based
closely on the mailbox driver, but that could detect broken pipes
lines UNIX-style.
Shortly after I finished the VMS pipe driver, the team at DECwest had
started work on the MICA project, which was Dave Culter's proposed OS
unification. Dave's team had developed a machine architecture called
PRISM (Proposed RISC Machine) to be the VAX follow-on. For forward
compatibility purposes, PRISM would have to support both Ultrix and
VMS. Dave and team had already written a microkernel-based,
lightweight OS for VAX called VAXeln that was intended for real-time
applications. His new idea was to have a MACH-like microkernel OS
which he called MICA and then to put three user mode personality
modules on top of that:
P.VMS, implementing the VMS system services and ABI
P.Ultrix, implementing the Unix system calls and ABI
P.TBD, a new OS API and ABI intended to supersede VMS
So I wrote the attached "why pipes" memo to explain to Cutler's team
why it was important to implement pipes natively in P.TBD if they
wanted that OS to be a viable follow-on to VMS and Ultrix.
In the end, Dick Sites's 64-bit RISC machine architecture proposal,
which was called Alpha, won out over PRISM. Cutler and a bunch of his
DECwest engineering team went off to Microsoft. Dave's idea of a
microkernel-based OS with multiple personalities of course saw the
light of day originally as NT OS/2, but because of the idea of
multiple personalities, when Microsoft and IBM divorced Dave was able
to quickly pivot to the now infamous Win32 personality, as what would
be called Windows NT. It was also easy for Softway Systems to later
complete the NT POSIX layer for their Interix product, which now a few
generations later is called WSL by Microsoft.
-Paul W.
[-- Attachment #2: why_pipes.txt --]
[-- Type: text/plain, Size: 17845 bytes --]
The prime functional characteristics of pipes on Unix are:
1) they are a pseudo-device that can be created on demand by programs
2) communications is one-way, from one or more write channel to one or
more read channels
3) if either all read channels or all write channels are deassigned,
the other communicating partners are notified via an error condition
(Unix calls this "broken pipe").
The last characteristic, detection of broken pipes, is the key. The restriction
of pipes to one-way communication is required for broken pipe detection.
If there were exactly one reader and one writer, then it would be easy to
detect a broken pipe. If the channel count drops to 1, then the pipe is broken.
This is how DECnet-VAX detects "network partner exited" conditions. However,
because fork(2) can cause the cloning of open I/O channels, this isn't
sufficient in the Unix pipe case. One can have multiple readers and multiple
writers. It therefore is necessary to restrict each individual I/O channel to
be either read-only or write-only. With that restriction in place, the I/O
system can detect "broken pipe"--the condition exists when there are readers
but no writers, or writers but no readers.
VMS mailboxes have most of the characteristics required for pipes. There is
a service to create the pseudo-devices from user-mode code. You can have
multiple read and write channels. The only problem is that you cannot detect
the broken pipe condition. One cannot do this because channels assigned to
a mailbox can be used for both reading and writing. There is no way to tell
which is which, so one cannot tell when all readers have gone away or when all
writers have gone away.
This is why I had to write a new driver to support pipes on VMS. The driver
was written from scratch, but follows the design of the mailbox driver very
closely. Here is a functional summary of the VMS pipe driver.
- You create a pipe by assigning a channel to the template device PIPE0:.
You get back a single I/O channel to a newly-cloned pipe pseudo-device.
You can use $GETDVI to find out the name of the new device so that you can
assign other channels to it. The device is created with the characteristics
bits MBX, REC, SHR, IDV, ODV, device class DC$_MAILBOX, device type
DC$_PIPE. RMS treats it as a mailbox.
I opted for dynamic UCB cloning rather than building a $CREPIPE system
service to do the cloning. Since I was not in the VMS group, I could not
add a system service easily. Also, using dynamic cloning is more flexible.
It makes it easy to create pipes from DCL level, for example, whereas it is
nearly impossible to create and use mailboxes from DCL level because there
is no way to get at $CREMBX from command level.
This does mean, though, that one cannot control buffer quota, max. message
size, device protection, or assignment of a logical name as part of the
device creation call. Max. message size (the UCB device buffer size) and
device protection can be set by IO$_SETMODE calls. The pipe driver does not
allow for user control of buffer quota (more about that later, though).
The device protection set upon cloning is (S,O:RWLP,G,W). This is what you
want for communication among child processes in the same job tree, which is
the usual application of pipes.
- When one initially assigns a channel to a pipe, it is "untyped"--the driver
does not know if it is a read or a write channel. The first I/O operation
to the channel determines which type of channel it is. If the first
operation is IO$_READxBLK (virtual, logical, and physical are all the same
for pipes) or IO$_SETMODE!IO$M_WRTATTN, then the channel is a read-only
channel. If the first operation is IO$_WRITExBLK or IO$_SETMODE!IO$M_READATTN
then the channel becomes a write-only channel. Once the type of a channel
has been set, attempts to do the opposite operation (a write to a read
channel or a read to a write channel) generate SS$_ILLIOFUNC errors.
Sometimes it is desireable to declare the type of a channel before you
actually do any I/O to it. The I/O functions IO$_SETMODE!IO$M_READCHAN and
IO$_SETMODE!IO$M_WRITECHAN exist for this purpose.
- I/O to a pipe is done via the usual IO$_READxBLK, IO$_WRITExBLK, and
IO$_WRITEOF function codes. These behave the same way that they do with
mailboxes, except that there is no IO$M_NOW modifier (see the next item).
The IO$M_READATTN and IO$M_WRTATTN modifiers to IO$_SETMODE are available
and function identically with the mailbox driver. There is a IO$M_STREAM
modifier to IO$_READxBLK and IO$_WRITExBLK that implements true Unix-style
stream-mode I/O (this is the only device in VMS to do this on the $QIO
level, in fact). The default is record mode. More about stream mode later.
- BUFFERING: Both mailboxes and pipes buffer all in-transit records in non-
paged pool. The drivers differ in the quota bookkeeping, however.
The mailbox driver grabs a fixed amount of pool quota from the creator of
the device at $CREMBX time. The buffer quota is set in the
$CREMBX call or defaulted from a SYSGEN parameter. When a program writes to
a mailbox, the data are copied into non-paged pool and the buffer quota is
decremented accordingly. If the quota reaches zero, then the writing process
is put into RWMBX state (if system service resource wait mode is on), or
the write failes with SS$_MBFULL status (if system service resource wait mode
is off). RWMBX state is quite nasty because it prevents the process from
being deleted until and unless somebody empties the mailbox enough to let
the write operation complete. However, it does mean that writes to a
mailbox never use up process BYTLM quota (the quota having been previously
deducted from the creator of the mailbox).
I asked several old-time VMS developers and was never able to get a
satisfactory explanation or rationalization for the existece of RWMBX state
in the system. It seems to cause more harm than good, so I left it out of
the pipe driver design. Pipes have a fixed 4096-byte lien on non-paged
pool. When a process writes to a pipe, the driver allocates a buffer of the
appropriate size from non-paged pool and moves the data into it. If the
4096-byte lien has not been fully used up, then the driver does not deduct
any process BYTLM quota from the writer. If the 4096-byte lien has been
used up, then the pipe driver deducts the difference from process BYTLM
quota--in other words, it is like any run-of-the-mill buffered I/O
operation. There is no use of RWMBX state for buffer quota control.
- I/O COMPLETION: The normal operation of the mailbox driver is for write
operations not to complete until the record was read from the mailbox.
One must specify IO$M_NOW if one requires the operation to complete as soon
as the data are moved into the mailbox buffer in non-paged pool. Likewise,
the mailbox driver provides a IO$M_NOW function for reads to allow a reader
to poll mailbox for the presence of messages.
The pipe driver does not provide a IO$M_NOW function. A write operation
always completes immediately if the message is covered by the 4096-byte
lien on non-paged pool. If user proces BYTLM was required to cover all or
part of the message, then the write operation does not complete until all
of the message is covered by the lien, or until the message is read from
the pipe. Thus, a write of a message longer than 4096 bytes will not
complete until all but 4096 bytes of it have been read (if mixed record
and stream mode I/O is being done, it is possible for messages to be
partially read--more about this below). Reads never complete until
something has been read.
I could have implemented IO$M_NOW, but I chose not to do so. The chosen
implementation offers proper quota management (without RWMBX state), and
allows for asynchronous operation of the readers and writers in the default
I/O case (e.g., RMS I/O), something that doesn't happen with mailboxes.
A writer doing RMS I/O to a mailbox stalls until the message is read. With
pipes, he does not stall until he gets over 4096 bytes ahead of the reader.
The pipe driver design allows efficient processing overlap in pipelines
without the need for any special programming.
- MAILBOX MODE VS. ULTRIX MODE: In a Unix-style pipeline, several images may
be run in succession with their output going down the pipe. RMS thinks the
pipe is a mailbox, and so it writes an EOF record to the pipe every time one
of the images closes its file on the pipe. In this case, we want to ignore
the EOF records and treat the breakage of the pipe at the end of the whole
I/O sequence as the EOF condition. However, in "normal" VMS useage, one
wants to pass EOF records just as the mailbox driver does.
I invented the concept of "Ultrix mode" versus "mailbox mode" to handle this.
Pipes are created in VMS mode. The $QIO call IO$_SETMODE!IO$M_ULTRIX places
the pipe device in Ultrix mode. IO$M_SETMODE!IO$M_MAILBOX will put it back.
There are two differences between the modes:
a) In mailbox mode, a IO$_WRITEOF operation puts a EOF record in
the pipe. In Ultrix mode, IO$_WRITEOF completes successfully but
is a no-op.
b) In mailbox mode, reads to a broken pipe terminate with SS$_LINKDISCON
status. In Ultrix mode, reads to a broken pipe terminate with
SS$_ENDOFFILE status.
- BROKEN PIPE NOTIFICATION: The pipe driver keeps counts of the numbers of
read and write channels assigned, and two additional state bits: readers-
have-existed and writers-have-existed. The readers-have-existed bit is
set when the first read channel is assigned. The writers-have-existed bit is
set when the first write channel is assigned. A broken pipe condition
exists whenever:
a) a write operation is pending on the pipe, readers-have-existed is
set, but the current count of read channels is zero
b) a read operation is pending on the pipe, the pipe is empty,
writers-have-existed is set, but the current count of write
channels is zero
The two "have-existed" bits exist to coordinate startup of the pipe
communication. Without those bits, there would be a race condition between
the first write to the pipe and the first read to the pipe. It is not an
error to be writing to a pipe that has no readers and has never had readers,
or to be reading from a pipe that has no writers but has never had writers.
There is the potential for a hang condition here if, for example, the
reader process dies before it ever gets a chance to open its channel to the
pipe. The same potential exists on Unix. In practice, it is not a
problem, especially since writes to a pipe cannot put you in a resource-wait
state from which there is no exit (RWMBX state).
If all writers have exited, readers can continue to read from the pipe
without error until they have emptied it. This is necessary so that writers
don't have to wait around for all of their data to be read.
Reads to a broken pipe complete with SS$_ENDOFFILE status if the pipe has
been set in Ultrix mode, or with SS$_LINKDISCON (network partner
disconnected logical link) status if the pipe is in mailbox mode (the
default).
Writes to a broken pipe complete with SS$_LINKDISCON status regardless of
mode.
- STREAM MODE: The pipe driver provides a modifier, IO$M_STREAM, for both
IO$_READxBLK and IO$_WRITExBLK. The presence of the modifier indicates
that the I/O operation is to be performed in stream mode rather than in
record mode.
A stream mode read operation always reads the requested number of bytes
from the pipe. It ignores record boundaries. For example, if there are
three 10-byte records in the pipe, and a $QIO specifies a 15-byte stream
mode read, the first record and the first 5 bytes of the second record
will be read and put in the user's buffer as one chunk of data. That
will leave the remaining 5 bytes of record 2 and all of record 3 in the
pipe. A subsequent $QIO read in record mode will read the 5 bytes of
record 2.
There is one case where a stream mode read doesn't read exactly the number
of bytes that the user specified. That is if end-of-pipe is detected.
End-of-pipe is either a EOF record in the pipe, or a broken pipe. In both
of these cases, the read in progress terminates with a short byte count.
The next read issued picks up the EOF or LINKDISCON condition. This is
exactly the Unix semantics for reads from pipes.
A stream mode write operation is the same as a record mode write operation
except that the write doesn't imply a record boundary. For example, suppose
there are three $QIOs specifying stream mode write, each for 5 bytes,
followed by two record mode writes, each for 10 bytes. A record mode reader
will see two records. The first is 25 bytes long, the second is 10 bytes
long.
I will send you a copy of the complete pipe driver specification so you can
see how this looks in toto.
IMPLICATIONS FOR P.TBD INTERPROCESS COMMUNICATION
Clearly P.ULTRIX requires Unix-compatible pipes. The driver design outlined
above accomplishes this in a way that is compatible with simultaneous use by
a record-oriented access method such as RMS. IO$M_STREAM probably isn't
necessary: the Ultrix read(2) and write(2) facilities, which are what present
a stream mode interface to programs, have to deal with block-oriented devices
such as disks and tapes anyway, so they are capable of doing the necessary
record blocking and deblocking to make anything appear to be stream-oriented
regardless of its underlying record-oriented characteristics. One thing
that you get for free with the VMS pipe driver design is that pipes have
names. "Named pipes" are a relatively recent innovation in the Unix world
and are all the rage these days.
It is not as clear that P.VMS needs VAX/VMS-compatible mailboxes. In the
vast majority of cases, channels assigned to mailboxes are used either
exclusively for reading or exclusively for writing, and therefore pipes would
suffice. In fact, use of pipes in place of mailboxes would relieve
implementors of all the defensive programming you need with mailboxes to get
around the fact that with a mailbox there's no way to tell that the other
end of the communications link has gone away. To be conservative, though,
it's probably a good idea to provide a mailbox-compatible facility.
I think that all of the needs in this space could be addressed by a single
I/O facility for interprocess communication. The desireable characteristics
are:
1) The pseudo-device object is created by the P.TBD equivalent of dynamic
UCB cloning on VAX/VMS. That is, the object is created when an I/O
channel is assigned to a template object. It should be possible to get
one of these devices without doing an explicit system service call such
as SYS$CREMBX or pipe(2). Using dynamic UCB cloning allows the
pseudo-devices to be created from command level without any special
support (such as a lexical function).
2) The object has two major modes of operation: mailbox mode and pipe mode.
Upon creation, the object is in pipe mode. The P.TBD equivalent of
a IO$_SETMODE $QIO operation switches the device between mailbox mode
and pipe mode.
a) In mailbox mode, the device acts like a VAX/VMS mailbox. Channels can
be used for either reading or writing. You get no "broken pipe"
notification. Operations stall unless a IO$M_NOW modifier is present.
b) In pipe mode, the device acts like a Unix pipe. Channels can be
used only for reading or writing, but not both. The first operation
to a channel determines its type (read-only or write-only). Type
can be explicitly declared via a IO$_SETMODE-like call. Broken pipe
notification semantics are as for the VAX/VMS pipe driver when
operating in Ultrix mode. IO$_WRITEOF is a no-op. IO$M_NOW is
ignored--the device always behaves like the VAX/VMS pipe driver
as regards stalling.
c) The equivalent of IO$M_READATTN and IO$M_WRITEATTN routines should
operate the way they do for VAX/VMS mailboxes regardless of mode.
These two operations set the mode of the I/O channel.
3) The VMS compatibility library would supply a SYS$CREMBX call that would
assign a channel to the pseudo-device template object, put the object in
mailbox mode, do IO$_SETMODE-equlvalent calls to set the protection and
buffer size characteristics the way the user wanted them, then return
the assigned channel to the caller.
4) The P.Ultrix library would supply a pipe(2) call that would assign a
channel to the pseudo-device template object, assign another channel to
the cloned pipe object, then do a IO$_SETMODE!IO$M_READCHAN-equivalent
I/O call to set the first channel read-only, and a IO$M_WRITECHAN-
equivalent call to set the second channel write-only. Then it would
return both channels to the caller.
5) No equivalent of the RWMBX resource wait state should be provided.
If a pipe or mailbox fills, the current I/O operation merely should be
stalled.
I hope all this stuff is helpful. If there are any specific questions about
pipes or mailboxes that I can answer, just ask.
--PSW
next reply other threads:[~2022-03-01 17:04 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-01 16:56 Paul Winalski [this message]
2022-03-01 21:25 ` Richard Salz
2022-03-01 21:38 ` Larry McVoy
2022-03-01 21:49 ` Clem Cole
2022-03-01 22:09 ` Clem Cole
2022-03-02 18:51 ` Paul Winalski
2022-03-01 23:42 ` silas poulson
2022-03-02 1:18 ` Clem Cole
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CABH=_VRc29kPrUKUnLyT0P5LUmE0maokp_G9oF9TCNMEHqLa+g@mail.gmail.com' \
--to=paul.winalski@gmail.com \
--cc=tuhs@minnie.tuhs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).