From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mike Haertel <mike@ducky.net>
Message-Id: <200101300921.f0U9LpW00742@ducky.net>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] 9P2000
Cc: mike@ducky.net, rob@plan9.bell-labs.com
In-Reply-To: <20010127215828.D7A06199D5@mail.cse.psu.edu>
Date: Tue, 30 Jan 2001 01:21:51 -0800
Topicbox-Message-UUID: 5468ddac-eac9-11e9-9e20-41e7f4b1d025

Here are some reactions.

They mostly boil down to suggested changes that will make the
specification of the protocol both simpler and more bulletproof.

Here is a concise summary of the proposed protocol changes:

	* restrict allowable contents of file, owner, and group names
	  at the protocol level to be equivalent to the restrictions
	  imposed at the Plan 9 kernel level.

	* eliminate the useless special case of ~0 tags.

	* eliminate multiple Tsessions from the protocol; require
	  that each connection begin with exactly one Tversion,
	  exactly one Tsession, and disallow any further occurrences
	  of Tsession and Tversion in the conversation.  then the
	  funny "aborts all transactions" semantics of Tsession can
	  also be eliminated.

	* specify a "minimum maximum" msize that a client can request
	  in such a way that the client can always read() any stat
	  structure that the server might need to return for any
	  possible directory entry.

	* expand time stamps to 64 bits for posterity.

	* forbid attempts in wstat to alter the length of a directory.

	* remove discussion of Plan 9 group leader semantics and
	  other weird stuff from the protocol specification.
	  similarly remove the claim that wstat cannot change file
	  ownership from the specification.  instead say that
	  allowable owner, group, and permission changes are
	  determined at the discretion of whatever security policy
	  the server chooses to implement.  (the discussion of Plan 9
	  group semantics would presumably migrate to the man page
	  for the specific file server.)

	* ensure that walks to .. are reliable by explicitly requiring
	  at the protocol level that the hierarchy is always a strict tree.

	* disallow walks to "" (the zero length name) in addition to the
	  already-disallowed walks to "."

	* in walk operations that fail, newfid should be implicitly clunked
	  unless it was equal to fid.

Long discussion follows...

(There are also a few stylistic comments.)

>     INTRO(5)                                                 INTRO(5)

>               Tversion  size[4] tag[2] msize[4] version[s]
>               Rversion  size[4] tag[2] msize[4] version[s]
>		[...]

Consider changing the format of this table to read:

	size[4] Tversion[1] tag[2] msize[4] version[s]
	size[4] Rversion[1] tag[2] msize[4] version[s]
	[...]

to explicitly show the placement of the type byte in each message.

The old format was appropriate when the type byte was the first
byte of the message, but with the new protocol the table was
confusing until I reread the preceding paragraph that described
the placement of the message type byte after the size[4].  This
proposed format is self documenting at a glance.

By the way, one thing I really like about the new encoding is that
emulating the old "fcall" streams module becomes trivial.

>[...]
>          (Systems may choose
>          to reduce the set of legal characters to reduce syntactic
>          problems, for example to remove slashes from name compo-
>          nents, but the protocol has no such restriction.  Plan 9
>          names may contain any printable character (that is, any
>          character outside hexadecimal 00-1F and 80-9F) except
>          slash.)

I think it is a huge mistake to say "the protocol has no
such restriction".

One of the big problems with Unix was that you could have nearly
arbitrary characters in filenames, but a lot of programs (notably
things like cpio, xargs, and the shell itself) did not take this
possibility seriously.  It takes a lot more experience than it
should to write reliable scripts for dealing with files in Unix.

Admittedly rc has much cleaner quoting than the Bourne shell, and
Plan 9 has helped by outlawing newlines in file names.  However,
why reincarnate the same problem in a different guise?  If 9P2000
servers can export arbitrary strings as file name components, but
the clients (e.g. the Plan 9 kernel) device) can't handle some of those
strings, then it will be impossible to write reliable client
programs.  Consider u9fs.  Since Unix allows nearly arbitrary file
names, it is quite easy now to create Unix files that you can't access
from a Plan 9 client.  If the protocol explicitly forbids funny
characters, then u9fs will have to be fixed to map those characters
in some way, or it can't claim to be a 9P implementation.  I think
that would be a desirable state of affairs.

Another way of putting this: in theory the protocol has no such
restriction, but in practice it does, and always will; therefore,
why not fix the theory to admit the practical restrictions?

>          An exception is the tag ~0, meaning `no tag': the
>          client can use it, when establishing a connection, to over-
>          ride tag matching in version and session messages.

This is poorly worded.  The Tsession page states that the tag must
be ~0; saying here that the client "can" use it makes it sound
optional.  Also, can a client use a tag of ~0 on a Tversion
transaction that is not the first transaction on a new connection?

This whole ~0 feature is undesirable and adds useless complexity.
Tsession could just as well require a tag of 0, since it flushes
all tags.  It could guarantee that the reply tag is 0.  And there
is no reason that Tversion can't or shouldn't be required to just
have a normal tag.  Then you can completely eliminate special cases
in servers (treatment of ~0 tags) and clients (the need to avoid
accidently generating ~0 tags).

>          The version message identifies the version of the protocol
>          and indicates the maximum message size the system is pre-
>          pared to handle.  A session request initializes a connection
>          and aborts all outstanding I/O on the connection.  The set
>          of messages between session requests is called a session.

I've always wondered why the session transaction has these abort
semantics.  It is easy to see the exchange of authentication data
as a justification for a required Tsession style message at the
beginning of a session, and it is also easy to see different protocol
versions might require different session messages, hence justifying
the existence of Tversion as a separate transaction required before
Tsession.

But I don't understand the point of the abort semantics.  My guess
is that it is intended to support some kind of persistent channel
to a server, analagous to hard wired serial port, where there is
no out-of-band concept of channel setup or teardown.  Unlike TCP,
where the setup and teardown connection can be detected independently
of the bytes transmitted on the virtual circuit.  So, for example,
a client reboot could result in a new Tsession on such an imaginary
hardwired connection.

The problem with this idea is that it is insufficient to give
reliable behavior in the face of arbitrary client or server crashes
or misbehavior, since the 9P encoding (and the proposed 9P2000
encoding) provides no easy way to resynchronize with the byte stream
if you somehow lose track of transaction boundaries.

(Ok, can you tell that I've been implementing SONET recently? :-)

I would like to suggest that the "abort" semantics be removed from
Tsession.  Admit that an underlying transport protocol will always
need to exist, specify that only one Tsession message can ever be
sent during the lifetime of a connection, and specify that outstanding
transactions are aborted when the underlying protocol's connection
is shut down.

If I have misunderstood the point of the "abort" semantics of
Tsession, please explain why it's there.

>          The stat transaction retrieves information about the file.
>          The stat field in the reply includes the file's name, access
>          permissions (read, write and execute for owner, group and
>          public), access and modification times, and owner and group
>          identifications (see stat(2)). The owner and group identifi-
>          cations are textual names.  The wstat transaction allows
>          some of a file's properties to be changed.

Again, I would like to lobby for protocol-imposed restrictions on the
legal contents of owner and group names.

>     DIRECTORIES
>          Directories are created by create with DMDIR set in the per-
>          missions argument (see stat(5)). The members of a directory
>          can be found with read(5). All directories must support
>          walks to the directory .. (dot-dot) meaning parent direc-
>          tory, although by convention directories contain no explicit
>          entry for .. or . (dot).  The parent of the root directory
>          of a server's tree is itself.

If I walk to foo/bar/.. does the protocol require that I return to bar?
I.e. is the file hierarchy required to be strictly a tree?  It looks to me
like another one of those restrictions the protocol should impose for the
sanity of the client: without such a restriction, the "lexical names"
feature in the Plan 9 kernel could get hopelessly confused.

>     CLUNK(5)                                                 CLUNK(5)
>
>          Even if the clunk returns an error, the fid is no longer
>          valid.

What are plausible errors associated with Tclunk (other than
attempting to clunk an invalid fid)?  The only thing I could think
of would be deferred errors associated with earlier transactions
that were not detected until later, like media errors associated
with deferred writes.

I assume the intent of allowing errors on Tclunk is that the error
returned is returned as the result of the close() system call?
(Of course, not every Tclunk corresponds to a close()...)

>     ERROR(5)                                                 ERROR(5)
>
>          By convention, clients may truncate error messages after 255
>          bytes, defined as ERRMAX in <libc.h>.

Translation: the server ought to make sure the meat of the error message
fits into the first 255 bytes, otherwise the user of the client might
not see it.

>     READ(5)                                                   READ(5)

>          For directories, read returns an integral number of direc-
>          tory entries exactly as in stat (see stat(5)), one for each
>          member of the directory.  The read request message must have
>          offset equal to zero or the value of offset in the previous
>          read on the directory, plus the number of bytes returned in
>          the previous read.  In other words, seeking other than to
>          the beginning is illegal in a directory (see seek(2)).

What happens if I have a directory entry SomeReallyLongStupidFileNameFromJava
and I attempt to read() fewer than the bytes required for the
associated stat structure?  This could happen two ways: (A) the
client application might have just issued a really small read
request, or (B) the byte count of the directory entry might result
in the required size of the Rread message exceeding the negotiated
maximum transaction size between the 9P client and server.

Scenario (A) can always be handled at the client application level
by executing a seek to the beginning of the directory and rescanning
with a larger buffer.  (Or by just always using a 64K+1 read in the
first place, darnit.)  So scenario (A) is not a serious threat to
the integrity of the underlying protocol design.

Scenario (B) is bad.  There is no easy way for the client to recover.
Certainly the client application can do nothing about it: the protocol
connection is already established and the msize is fixed in stone.

At the protocol level one hypothetical solution might be for the server
to return some kind of error cookie that:

	1) Indicates there was a really long directory entry.

	2) Returns a new offset that the client can use to read beyond
	   the directory entry that didn't fit.  This is important--we
	   wouldn't want it to be possible to "hide" files behind
	   ReallyLongDirectoryEntries.

Another hypothetical solution: the server could have a notion of
a "truncated stat structure" that returns as much as will fit, plus
the real offset to the next directory entry.

Both of these possibilities are needlessly complex.  Better if
scenario (B) could never happen.

It would be easy for the server to ensure this by preventing such
files from ever be created in the first place -- except for one
tiny hitch.  That is that the server cannot exceed the client's
requested msize that was previously specified in Tversion.  So a
client that negotiates a too-small msize can make scenario (B)
possible.

Rather than adding a complex special case response to the server's
repertoire that all clients would have to know about, I'd prefer
to legislate this situation out of existence: add a "minimum maximum"
to Tversion: require that the smallest allowable msize that a client
can request is 64K + some slop, enough to hold an Rread containing
one worst-case stat structure.  Then the need for a way to recover
from scenario (B) is removed from the protocol.

If 64K+slop is unpalatably large, consider specifying a smaller
maximum possible stat record, say 8K-slop, so that the minimum msize
becomes 8K exactly.

>     STAT(5)                                                   STAT(5)

>          name[ s ]
>               file name; must be / if the file is the root directory
>               of the server

Not to beat on a dead horse, but other than this one exception, *please*
outlaw /'s in file names throughout the protocol.

>          Servers may implement a time-
>          out on the lock on an exclusive use file: if the fid holding
>          the file open has been unused for an extended period (of
>          order at least minutes), it is reasonable to break the lock
>          and deny the initial fid further I/O.

Consider an allowable minimum and a required maximum timeout?  This is
one of those situations where you know that whatever you specify will
be wrong, but it's still better to have a specification so that all
implementations will be broken in exactly the same way.

>          The two time fields are measured in seconds since the epoch
>          (Jan 1 00:00 1970 GMT).  The mtime field reflects the time
>          of the last change of content (except when later changed by
>          wstat).  For a plain file, mtime is the time of the most
>          recent create, open with truncation, or write; for a direc-
>          tory it is the time of the most recent remove, create, or
>          wstat of a file in the directory.  Similarly, the atime
>          field records the last read of the contents; also it is set
>          whenever mtime is set.  In addition, for a directory, it is
>          set by an attach, walk, or create, all whether successful or
>          not.

Consider changing the time fields to 64 bits.  2038 is not so far away.
Also for the benefit of programs like mk it would arguably desirable
for timestamps to have finer granularity than 1 second in today's world
of very fast computers (although I suppose mk could detect "instantaneous"
commands by looking for changed qid.versions).  Say 1 microsecond?
64 bits offers a lot of room...

>          The wstat request can change some of the file status infor-
>          mation.  [...]  The length can be
>          changed (affecting the actual length of the file) by anyone
>          with write permission on the file.  It is an error to
>          attempt to set the length of a directory to a non-zero
>          value, and servers may decide to reject length changes for
>          other reasons.

Assuming the server does *not* reject truncation of a directory to
length 0, should a client assume that all files under the directory
have been removed? This is another one of those possible complications
that I think should be eliminated by specifying them out of the
protocol: always reject attempts by wstat to change the length of
a directory.

>	   None
>          of the other data can be altered by a wstat.  In particular,
>          there is no way to change the owner of a file.

This is not true in existing implementations: for example, with
"disk/kfscmd allow", I can change file ownership.  Moreover this
is a necessary feature for system administration to ensure that
system files have the right owners.  I would argue that the protocol
allows you to request a change of ownership, and that it is at the
server's discretion whether to allow or reject, according to the
security policy of the server, which should not be considered part
of the protocol.

In fact, I would go a bit further: the whole concept of "group leaders"
is a weird Plan 9 thing that is not true on, say, a Unix based server.
So it should also be at the server's discretion whether to accept
or reject group changes, again according to a security policy that
is considered outside the scope of the protocol.

Changes in ownership, group, or permissions that are refused should
always result in an Rerror.  (Alright, I see you've covered that
later in the "all or nothing" clause...)

(And the discussion of the main Plan 9 file server's security policy
should really be on some other manual pages than the definition of 9P.)

Now at this point I suppose you'll jump on me and argue that I here
I am arguing for server-dependent variations in behavior, whereas
above (on file names, owner/group names, and the meaning of ..) I
was arguing for required uniform behavior across all servers.  The
reason is that here I consider implementation-dependent variations
less harmful, since relatively few programs normally want to mess
with file ownership, and those that do have a reasonable expectation
of the operations failing anyway.  In contrast, non-uniform rules
for allowable file, owner, and group names or the meaning of ..
would pervasively break a whole lot of programs, like any script
that wants to parse the output of "ls -l" or expects "cd .." to go
somewhere reliable.

>          Note that since the stat information is sent as a 9P
>          variable-length datum, it is limited to a maximum of 65535
>          bytes.

So what should happen if I use Tcreat to create a file name that
is so long that the stat structure associated with the file would
exceed 64k-1 bytes?

I would argue that the Tcreate man page should explicitly say such
requests must always fail.

>     VERSION(5)                                             VERSION(5)
>
>     NAME
>          version - negotiate protocol version
>
>     SYNOPSIS
>          Tversion size[4] tag[2] msize[4] version[s]
>          Rversion size[4] tag[2] msize[4] version[s]
>
>     DESCRIPTION
>          The version request negotiates the protocol version and mes-
>          sage size to be used on the connection.  Tversion must be
>          the first message sent on the 9P connection, and the client
>          cannot issue any further requests until it has received the
>          Rversion reply.

Can you issue another Tversion later?  I would argue that it should
be explicitly prohibited, even more strongly than I previously argued
that multiple Tsessions should be prohibited.

>          The client suggests a maximum message size, msize, that is
>          the maximum length, in bytes, it will ever generate or
>          expect to receive in a single 9P message.

As previously mentioned, please specify a minimum msize that a client
is allowed to request, and make the largest possible stat record
consistent with the value of this minimal msize.

>     WALK(5)                                                   WALK(5)

Interesting: this subsumes the old "clwalk", and also subsumes the old
"clone" via the subterfuge of zero-element walks.

>          The element ``..''  (dot-dot) represents the parent direc-
>          tory.  The name ``.''  (dot), meaning the current directory,
>          is not used in the protocol.
>
>          It is legal for nwname to be zero, in which case newfid will
>          represent the same file as fid and the walk will usually
>          succeed; this is equivalent to walking to dot.  The rest of
>          this discussion assumes nwname is greater than zero.

Do these two paragraphs taken together mean that when the mnt(3) device
When mnt(3) sees the name "foo/./bar", is it expected to generate
walk("foo", "", "bar"), or is it expected to generate walk("foo", "bar")?

I would argue that walk("") should be simply disallowed: if the
mnt(3) device needs to elide walks to ".", it might as well also
elide walks to "" as well; that way you can eliminate a special
case that would otherwise need to be explicitly coded in all servers.

>          If the first element cannot be walked for any reason, Rerror
>          is returned.  Otherwise, the walk will return an Rwalk mes-
>          sage containing nqid qids corresponding, in order, to the
>          files that are visited by the nqid successful elementwise
>          walks; nqid is therefore either nwname or the index of the
>          first elementwise walk that failed.  The value of nqid can-
>          not be zero unless nwname is zero.  Also, nqid will always
>          be less than or equal to nwname.  Only if it is equal, how-
>          ever, will newfid be affected, in which case it will repre-
>          sent the file reached by the final elementwise walk
>          requested in the message.

If the walk operation fails, does newfid exist (and point to the
same qid as fid), or is it implicitly clunked?

My suggestion: If the walk fails, newfid should be implicitly
clunked unless it was equal to fid.