* Re: [9fans] 9P2000
@ 2001-01-31 2:16 rob pike
2001-01-31 8:56 ` Mike Haertel
0 siblings, 1 reply; 10+ messages in thread
From: rob pike @ 2001-01-31 2:16 UTC (permalink / raw)
To: 9fans
Thanks for your thorough reading and commentary. As you can
imagine, there were endless weeks of debate over many of the
issues in the protocol.
* restrict allowable contents of file, owner, and group names
at the protocol level to be equivalent to the restrictions
imposed at the Plan 9 kernel level.
It's unwise to prohibit things in a protocol when you don't know for
certain that they aren't useful. We have changed the legal character
set for file names in Plan 9 several times, but have not yet had to
make any change to the character set for 9P. This is evidence that
we're making prohibitions in the right places.
* eliminate the useless special case of ~0 tags.
Only useless if you don't allow Tsession to reset a connection.
* eliminate multiple Tsessions from the protocol; require
that each connection begin with exactly one Tversion,
exactly one Tsession, and disallow any further occurrences
of Tsession and Tversion in the conversation. then the
funny "aborts all transactions" semantics of Tsession can
also be eliminated.
The requirement for Tsession to reset a connection is there so 9P can
be used on point-to-point wired networks such as the old BIT32 or
other back-to-back bus devices. If every connection to the server
begins as an IP call, I admit Tsession is less useful, but there are
setups for which Tsession provides the necessary reset capability. ~0
is a reserved tag so one may always issue a `reset' this way.
* specify a "minimum maximum" msize that a client can request
in such a way that the client can always read() any stat
structure that the server might need to return for any
possible directory entry.
The issue of going variable-sized in this protocol is by far the most
subtle, difficult, and pervasive issue. There are still rough edges
around all this area. The new kernel does only very preliminary stuff
here. There's no question we need to be careful, but it's all so new
(and hard! dirread was a bear!) I want practice to guide our design.
The basics are in the protocol spec. but lots of details remain to be
filled in.
The particular case you raise here is nasty, but only superficially.
You get an error back on the stat or read; that can happen anyway.
The problem with setting a minimum is that it cascades into other size
issues. How big is the biggest stat? How does the server know a
priori? Etc. Etc. Also, if you set a minimum, it means you can't use
that connection to encapsulate. That feels ugly.
Again, I want to see how this plays out before defining more
explicitly.
* expand time stamps to 64 bits for posterity.
2038 is coming, but the clock actually overflows in 2106. This is
one we talked to death. The earliest drafts had 64-bit clocks but
we backed down for several reasons:
1) Clocks are used everywhere and vlongs are large, slow, and
painful to work with.
2) Too many other fields have doubled in size; we wanted to
keep the overall size of the protocol small. Much of the
purpose of this revision is performance. (For instance,
we kept Qid.version at 32 bits.)
2) All the existing interface software uses 32 bit clocks.
3) 2106 is a long time from now.
4) Mk's clock resolution may be an issue, but mk is enough of
a crock we'd rather force people to think about how to
build software than make everything slower to support
mk.
5) Retrofitting existing timestamps into 64-bit resolution is
a nasty issue for servers.
6) Finally, no design for how the 64-bit clock should be set
up looked good in practice. Nanoseconds? Microseconds?
Milliseconds? If we choose (say) microseconds, what does
it mean to say a file has that time stamp? The bits may
be there but they're meaningless in practice. And whatever
you choose, it's wrong for some future technology if you
depend on the precision of the clock to make critical
decisions, as does mk. Better to face the real issue some
other way and keep times around mostly for humans,
as in ls -l output.
In short, leaving clocks alone, as 32 bits of seconds from 1970.0, is
compatible with every other system out there, including our own.
* forbid attempts in wstat to alter the length of a directory.
This may make sense, but I hardly think it's worth the time to
specify. And again, that issue about forbidding things too soon...
* remove discussion of Plan 9 group leader semantics and
other weird stuff from the protocol specification.
similarly remove the claim that wstat cannot change file
ownership from the specification. instead say that
allowable owner, group, and permission changes are
determined at the discretion of whatever security policy
the server chooses to implement. (the discussion of Plan 9
group semantics would presumably migrate to the man page
for the specific file server.)
I have some sympathy with this one. The protocol is a peculiar
place to write down all this permission stuff.
* ensure that walks to .. are reliable by explicitly requiring
at the protocol level that the hierarchy is always a strict tree.
The hierarchy is not a strict tree in many of our existing servers.
* disallow walks to "" (the zero length name) in addition to the
already-disallowed walks to "."
Existing practice depends on "" meaning ".", particularly within
file names such as "#e". I'm not sure empty strings go across the wire,
but I'm also not sure they don't. This one may be worth clarifying.
* in walk operations that fail, newfid should be implicitly clunked
unless it was equal to fid.
The manual says that newfid is unaffected in that case, and that
newfid must not be in use. I think this is correct and sufficient.
-rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
2001-01-31 2:16 [9fans] 9P2000 rob pike
@ 2001-01-31 8:56 ` Mike Haertel
0 siblings, 0 replies; 10+ messages in thread
From: Mike Haertel @ 2001-01-31 8:56 UTC (permalink / raw)
To: 9fans
> * eliminate multiple Tsessions from the protocol; require
> that each connection begin with exactly one Tversion,
> exactly one Tsession, and disallow any further occurrences
> of Tsession and Tversion in the conversation. then the
> funny "aborts all transactions" semantics of Tsession can
> also be eliminated.
>
>The requirement for Tsession to reset a connection is there so 9P can
>be used on point-to-point wired networks such as the old BIT32 or
>other back-to-back bus devices. If every connection to the server
>begins as an IP call, I admit Tsession is less useful, but there are
>setups for which Tsession provides the necessary reset capability. ~0
>is a reserved tag so one may always issue a `reset' this way.
Ok, that's what I thought. Did you take the time to read my
argument (in the longer discussion) that this is a bogus
consideration?
Here's why: imagine you have a hard-wired connection. Suppose
moreover that your client crashes *in the middle* of sending
a message. Then the client reboots and starts by sending a
new Tversion message, but the server still thinks it is in
the middle of whatever message the client was previously
sending. So the server never resynchronizes with the client,
and the client ends up thinking the server is being stubborn
and never responding, or else sending back garbage.
In order to avoid this scenario you need some kind of markers
that the server can look for once it realizes it has become
desynchronized. One simple approach might be simply to prefix
every 9P message with a particular magic byte that the server
can look for. As long as that magic byte is seen whenever
the server is about to begin reading a new message, it knows
it is (probably) synchronized. If it becomes desynchronized
it can hunt for the magic byte in an attempt to become
resychronized.
This is the sort of service an underlying transport protocol
provides robustly. You could include this functionality in
9P, but why? The Tsession stuff has to be one of the most
non-robust ways of doing this that I have ever seen. If you've
had no problem with unencapsulated 9P on hard-wired links so
far, it's only because you've been lucky. Better by far to
assume a real underlying transport layer. It could be as
simple as a trivial wrapper that puts delimiter bytes on
messages before sending them on a your hardwired connection.
Even something as simple as that will do a better job of crash
recovery than 9P by itself.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
2001-01-31 9:31 Russ Cox
@ 2001-01-31 17:46 ` Mike Haertel
0 siblings, 0 replies; 10+ messages in thread
From: Mike Haertel @ 2001-01-31 17:46 UTC (permalink / raw)
To: 9fans
>Aren't there two issues here? One is resynchronizing
>the message stream, so that both sides agree on the
>message boundaries. The other is resynchronizing
>the 9P conversation state, so that both sides agree
>on which tags and fids are in use and what they mean.
Yes.
>Something (an underlying transport protocol, say) needs
>to provide the first capability, but without the second
>you're still hosed. In an IP environment, you can drop
>and redial the connection,
Yes, in an IP environment, the connection gets closed, and the
other end of the conversation detects that *independently of
the 9P byte stream*, when attempting to read or write the connection
returns some kind of "EOF" indication.
>but if you've got a hard-wired
>link, you need an explicit restart within the protocol,
>hence Tsession, no?
Nope. Because we've already established that in a a hard-wired
environment 9P cannot reliably be the lowest level protocol.
Therefore, we know we already NEED a lower level below 9P,
just to delimit message boundaries. Why not just make that
lower level also know how to "return EOF"?
Then, to the higher 9P level, the hard wired link would look
*just like* and IP connection. So the higher level would have
only one execution environment to cope with, instead of two
subtly different ones.
Let
A = total_complexity_of(9P + Tsession abort and ~0 tags)
B = total_complexity_of(encapsulation layer that doesn't "return EOF")
C = total_complexity_of(9P with those features removed)
D = total_complexity_of(encapsulation layer that does return EOF)
My argument is simply that
A + B > C + D
But, if you guys aren't comfortable with this, I guess it's
not worth arguing about further.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
@ 2001-01-31 9:31 Russ Cox
2001-01-31 17:46 ` Mike Haertel
0 siblings, 1 reply; 10+ messages in thread
From: Russ Cox @ 2001-01-31 9:31 UTC (permalink / raw)
To: 9fans
[I'm just confused trying to follow the argument. Feel free to ignore.]
Aren't there two issues here? One is resynchronizing
the message stream, so that both sides agree on the
message boundaries. The other is resynchronizing
the 9P conversation state, so that both sides agree
on which tags and fids are in use and what they mean.
Something (an underlying transport protocol, say) needs
to provide the first capability, but without the second
you're still hosed. In an IP environment, you can drop
and redial the connection, but if you've got a hard-wired
link, you need an explicit restart within the protocol,
hence Tsession, no?
Russ
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
@ 2001-01-31 2:18 rob pike
0 siblings, 0 replies; 10+ messages in thread
From: rob pike @ 2001-01-31 2:18 UTC (permalink / raw)
To: 9fans
Consider changing the format of this table to read:
size[4] Tversion[1] tag[2] msize[4] version[s]
Right. This was just a mistake that others have pointed out
and has been fixed.
By the way, one thing I really like about the new encoding is that
emulating the old "fcall" streams module becomes trivial.
There is in fact no fcall any more; the mount driver gets it right.
But this only supports your observation.
-rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
2001-01-30 12:09 rog
@ 2001-01-30 18:04 ` Mike Haertel
0 siblings, 0 replies; 10+ messages in thread
From: Mike Haertel @ 2001-01-30 18:04 UTC (permalink / raw)
To: 9fans; +Cc: rog
>there's one case where the client has to be a bit careful about the
>size of messages it generates. a Twalk message can itself fit inside
>the negotiated msize, but require an Rwalk that will not do so. (e.g.
>walking down several short pathname elements). it might be worth
>requiring a minimum message size of 1+4+2+2+MAXWELEM*13 = 217 which
>would avoid this problem.
This is not nearly as bad as the directory entry situation. A
client that specified a very small mside could be held responsible
for not producing Twalks whose corresponding Rwalks would exceed
the msize; this would be under control of the client and so is
an avoidable situation.
But the client has no control whatever over the size of a directory
entry it is about to read. It is helpless.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
@ 2001-01-30 12:09 rog
2001-01-30 18:04 ` Mike Haertel
0 siblings, 1 reply; 10+ messages in thread
From: rog @ 2001-01-30 12:09 UTC (permalink / raw)
To: 9fans
ducky.net!mike wrote:
> (B) the byte count of the directory entry might result
> in the required size of the Rread message exceeding the negotiated
> maximum transaction size between the 9P client and server.
[...]
> Scenario (B) is bad. There is no easy way for the client to recover.
> Certainly the client application can do nothing about it: the protocol
> connection is already established and the msize is fixed in stone.
my understanding was that if a client tried to negotiate a message size
that was too small for the server's maximum filename size, the server
would yield an Rerror "message size too small" or somesuch.
this, perhaps, would be one reason to allow multiple Tversions - if a
Tversion has resulted in an Rerror, then surely it should be possible
to negotiate another version or msize.
there's one case where the client has to be a bit careful about the
size of messages it generates. a Twalk message can itself fit inside
the negotiated msize, but require an Rwalk that will not do so. (e.g.
walking down several short pathname elements). it might be worth
requiring a minimum message size of 1+4+2+2+MAXWELEM*13 = 217 which
would avoid this problem.
> Assuming the server does *not* reject truncation of a directory to
> length 0, should a client assume that all files under the directory
> have been removed? This is another one of those possible complications
> that I think should be eliminated by specifying them out of the
> protocol: always reject attempts by wstat to change the length of
> a directory.
a directory has a conventional length of 0 anyway, so it would make
sense if setting the length of a directory to zero was a no-op.
> If the walk operation fails, does newfid exist (and point to the
> same qid as fid), or is it implicitly clunked?
and quoted previously:
> > Also, nqid will always
> > be less than or equal to nwname. Only if it is equal, how-
> > ever, will newfid be affected, in which case it will repre-
> > sent the file reached by the final elementwise walk
> > requested in the message.
i.e. if the walk operation fails, newfid is not affected (created or
walked).
cheers,
rog.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
2001-01-27 21:58 rob pike
2001-01-28 16:29 ` Sam Ducksworth
@ 2001-01-30 9:21 ` Mike Haertel
1 sibling, 0 replies; 10+ messages in thread
From: Mike Haertel @ 2001-01-30 9:21 UTC (permalink / raw)
To: 9fans; +Cc: mike, rob
Here are some reactions.
They mostly boil down to suggested changes that will make the
specification of the protocol both simpler and more bulletproof.
Here is a concise summary of the proposed protocol changes:
* restrict allowable contents of file, owner, and group names
at the protocol level to be equivalent to the restrictions
imposed at the Plan 9 kernel level.
* eliminate the useless special case of ~0 tags.
* eliminate multiple Tsessions from the protocol; require
that each connection begin with exactly one Tversion,
exactly one Tsession, and disallow any further occurrences
of Tsession and Tversion in the conversation. then the
funny "aborts all transactions" semantics of Tsession can
also be eliminated.
* specify a "minimum maximum" msize that a client can request
in such a way that the client can always read() any stat
structure that the server might need to return for any
possible directory entry.
* expand time stamps to 64 bits for posterity.
* forbid attempts in wstat to alter the length of a directory.
* remove discussion of Plan 9 group leader semantics and
other weird stuff from the protocol specification.
similarly remove the claim that wstat cannot change file
ownership from the specification. instead say that
allowable owner, group, and permission changes are
determined at the discretion of whatever security policy
the server chooses to implement. (the discussion of Plan 9
group semantics would presumably migrate to the man page
for the specific file server.)
* ensure that walks to .. are reliable by explicitly requiring
at the protocol level that the hierarchy is always a strict tree.
* disallow walks to "" (the zero length name) in addition to the
already-disallowed walks to "."
* in walk operations that fail, newfid should be implicitly clunked
unless it was equal to fid.
Long discussion follows...
(There are also a few stylistic comments.)
> INTRO(5) INTRO(5)
> Tversion size[4] tag[2] msize[4] version[s]
> Rversion size[4] tag[2] msize[4] version[s]
> [...]
Consider changing the format of this table to read:
size[4] Tversion[1] tag[2] msize[4] version[s]
size[4] Rversion[1] tag[2] msize[4] version[s]
[...]
to explicitly show the placement of the type byte in each message.
The old format was appropriate when the type byte was the first
byte of the message, but with the new protocol the table was
confusing until I reread the preceding paragraph that described
the placement of the message type byte after the size[4]. This
proposed format is self documenting at a glance.
By the way, one thing I really like about the new encoding is that
emulating the old "fcall" streams module becomes trivial.
>[...]
> (Systems may choose
> to reduce the set of legal characters to reduce syntactic
> problems, for example to remove slashes from name compo-
> nents, but the protocol has no such restriction. Plan 9
> names may contain any printable character (that is, any
> character outside hexadecimal 00-1F and 80-9F) except
> slash.)
I think it is a huge mistake to say "the protocol has no
such restriction".
One of the big problems with Unix was that you could have nearly
arbitrary characters in filenames, but a lot of programs (notably
things like cpio, xargs, and the shell itself) did not take this
possibility seriously. It takes a lot more experience than it
should to write reliable scripts for dealing with files in Unix.
Admittedly rc has much cleaner quoting than the Bourne shell, and
Plan 9 has helped by outlawing newlines in file names. However,
why reincarnate the same problem in a different guise? If 9P2000
servers can export arbitrary strings as file name components, but
the clients (e.g. the Plan 9 kernel) device) can't handle some of those
strings, then it will be impossible to write reliable client
programs. Consider u9fs. Since Unix allows nearly arbitrary file
names, it is quite easy now to create Unix files that you can't access
from a Plan 9 client. If the protocol explicitly forbids funny
characters, then u9fs will have to be fixed to map those characters
in some way, or it can't claim to be a 9P implementation. I think
that would be a desirable state of affairs.
Another way of putting this: in theory the protocol has no such
restriction, but in practice it does, and always will; therefore,
why not fix the theory to admit the practical restrictions?
> An exception is the tag ~0, meaning `no tag': the
> client can use it, when establishing a connection, to over-
> ride tag matching in version and session messages.
This is poorly worded. The Tsession page states that the tag must
be ~0; saying here that the client "can" use it makes it sound
optional. Also, can a client use a tag of ~0 on a Tversion
transaction that is not the first transaction on a new connection?
This whole ~0 feature is undesirable and adds useless complexity.
Tsession could just as well require a tag of 0, since it flushes
all tags. It could guarantee that the reply tag is 0. And there
is no reason that Tversion can't or shouldn't be required to just
have a normal tag. Then you can completely eliminate special cases
in servers (treatment of ~0 tags) and clients (the need to avoid
accidently generating ~0 tags).
> The version message identifies the version of the protocol
> and indicates the maximum message size the system is pre-
> pared to handle. A session request initializes a connection
> and aborts all outstanding I/O on the connection. The set
> of messages between session requests is called a session.
I've always wondered why the session transaction has these abort
semantics. It is easy to see the exchange of authentication data
as a justification for a required Tsession style message at the
beginning of a session, and it is also easy to see different protocol
versions might require different session messages, hence justifying
the existence of Tversion as a separate transaction required before
Tsession.
But I don't understand the point of the abort semantics. My guess
is that it is intended to support some kind of persistent channel
to a server, analagous to hard wired serial port, where there is
no out-of-band concept of channel setup or teardown. Unlike TCP,
where the setup and teardown connection can be detected independently
of the bytes transmitted on the virtual circuit. So, for example,
a client reboot could result in a new Tsession on such an imaginary
hardwired connection.
The problem with this idea is that it is insufficient to give
reliable behavior in the face of arbitrary client or server crashes
or misbehavior, since the 9P encoding (and the proposed 9P2000
encoding) provides no easy way to resynchronize with the byte stream
if you somehow lose track of transaction boundaries.
(Ok, can you tell that I've been implementing SONET recently? :-)
I would like to suggest that the "abort" semantics be removed from
Tsession. Admit that an underlying transport protocol will always
need to exist, specify that only one Tsession message can ever be
sent during the lifetime of a connection, and specify that outstanding
transactions are aborted when the underlying protocol's connection
is shut down.
If I have misunderstood the point of the "abort" semantics of
Tsession, please explain why it's there.
> The stat transaction retrieves information about the file.
> The stat field in the reply includes the file's name, access
> permissions (read, write and execute for owner, group and
> public), access and modification times, and owner and group
> identifications (see stat(2)). The owner and group identifi-
> cations are textual names. The wstat transaction allows
> some of a file's properties to be changed.
Again, I would like to lobby for protocol-imposed restrictions on the
legal contents of owner and group names.
> DIRECTORIES
> Directories are created by create with DMDIR set in the per-
> missions argument (see stat(5)). The members of a directory
> can be found with read(5). All directories must support
> walks to the directory .. (dot-dot) meaning parent direc-
> tory, although by convention directories contain no explicit
> entry for .. or . (dot). The parent of the root directory
> of a server's tree is itself.
If I walk to foo/bar/.. does the protocol require that I return to bar?
I.e. is the file hierarchy required to be strictly a tree? It looks to me
like another one of those restrictions the protocol should impose for the
sanity of the client: without such a restriction, the "lexical names"
feature in the Plan 9 kernel could get hopelessly confused.
> CLUNK(5) CLUNK(5)
>
> Even if the clunk returns an error, the fid is no longer
> valid.
What are plausible errors associated with Tclunk (other than
attempting to clunk an invalid fid)? The only thing I could think
of would be deferred errors associated with earlier transactions
that were not detected until later, like media errors associated
with deferred writes.
I assume the intent of allowing errors on Tclunk is that the error
returned is returned as the result of the close() system call?
(Of course, not every Tclunk corresponds to a close()...)
> ERROR(5) ERROR(5)
>
> By convention, clients may truncate error messages after 255
> bytes, defined as ERRMAX in <libc.h>.
Translation: the server ought to make sure the meat of the error message
fits into the first 255 bytes, otherwise the user of the client might
not see it.
> READ(5) READ(5)
> For directories, read returns an integral number of direc-
> tory entries exactly as in stat (see stat(5)), one for each
> member of the directory. The read request message must have
> offset equal to zero or the value of offset in the previous
> read on the directory, plus the number of bytes returned in
> the previous read. In other words, seeking other than to
> the beginning is illegal in a directory (see seek(2)).
What happens if I have a directory entry SomeReallyLongStupidFileNameFromJava
and I attempt to read() fewer than the bytes required for the
associated stat structure? This could happen two ways: (A) the
client application might have just issued a really small read
request, or (B) the byte count of the directory entry might result
in the required size of the Rread message exceeding the negotiated
maximum transaction size between the 9P client and server.
Scenario (A) can always be handled at the client application level
by executing a seek to the beginning of the directory and rescanning
with a larger buffer. (Or by just always using a 64K+1 read in the
first place, darnit.) So scenario (A) is not a serious threat to
the integrity of the underlying protocol design.
Scenario (B) is bad. There is no easy way for the client to recover.
Certainly the client application can do nothing about it: the protocol
connection is already established and the msize is fixed in stone.
At the protocol level one hypothetical solution might be for the server
to return some kind of error cookie that:
1) Indicates there was a really long directory entry.
2) Returns a new offset that the client can use to read beyond
the directory entry that didn't fit. This is important--we
wouldn't want it to be possible to "hide" files behind
ReallyLongDirectoryEntries.
Another hypothetical solution: the server could have a notion of
a "truncated stat structure" that returns as much as will fit, plus
the real offset to the next directory entry.
Both of these possibilities are needlessly complex. Better if
scenario (B) could never happen.
It would be easy for the server to ensure this by preventing such
files from ever be created in the first place -- except for one
tiny hitch. That is that the server cannot exceed the client's
requested msize that was previously specified in Tversion. So a
client that negotiates a too-small msize can make scenario (B)
possible.
Rather than adding a complex special case response to the server's
repertoire that all clients would have to know about, I'd prefer
to legislate this situation out of existence: add a "minimum maximum"
to Tversion: require that the smallest allowable msize that a client
can request is 64K + some slop, enough to hold an Rread containing
one worst-case stat structure. Then the need for a way to recover
from scenario (B) is removed from the protocol.
If 64K+slop is unpalatably large, consider specifying a smaller
maximum possible stat record, say 8K-slop, so that the minimum msize
becomes 8K exactly.
> STAT(5) STAT(5)
> name[ s ]
> file name; must be / if the file is the root directory
> of the server
Not to beat on a dead horse, but other than this one exception, *please*
outlaw /'s in file names throughout the protocol.
> Servers may implement a time-
> out on the lock on an exclusive use file: if the fid holding
> the file open has been unused for an extended period (of
> order at least minutes), it is reasonable to break the lock
> and deny the initial fid further I/O.
Consider an allowable minimum and a required maximum timeout? This is
one of those situations where you know that whatever you specify will
be wrong, but it's still better to have a specification so that all
implementations will be broken in exactly the same way.
> The two time fields are measured in seconds since the epoch
> (Jan 1 00:00 1970 GMT). The mtime field reflects the time
> of the last change of content (except when later changed by
> wstat). For a plain file, mtime is the time of the most
> recent create, open with truncation, or write; for a direc-
> tory it is the time of the most recent remove, create, or
> wstat of a file in the directory. Similarly, the atime
> field records the last read of the contents; also it is set
> whenever mtime is set. In addition, for a directory, it is
> set by an attach, walk, or create, all whether successful or
> not.
Consider changing the time fields to 64 bits. 2038 is not so far away.
Also for the benefit of programs like mk it would arguably desirable
for timestamps to have finer granularity than 1 second in today's world
of very fast computers (although I suppose mk could detect "instantaneous"
commands by looking for changed qid.versions). Say 1 microsecond?
64 bits offers a lot of room...
> The wstat request can change some of the file status infor-
> mation. [...] The length can be
> changed (affecting the actual length of the file) by anyone
> with write permission on the file. It is an error to
> attempt to set the length of a directory to a non-zero
> value, and servers may decide to reject length changes for
> other reasons.
Assuming the server does *not* reject truncation of a directory to
length 0, should a client assume that all files under the directory
have been removed? This is another one of those possible complications
that I think should be eliminated by specifying them out of the
protocol: always reject attempts by wstat to change the length of
a directory.
> None
> of the other data can be altered by a wstat. In particular,
> there is no way to change the owner of a file.
This is not true in existing implementations: for example, with
"disk/kfscmd allow", I can change file ownership. Moreover this
is a necessary feature for system administration to ensure that
system files have the right owners. I would argue that the protocol
allows you to request a change of ownership, and that it is at the
server's discretion whether to allow or reject, according to the
security policy of the server, which should not be considered part
of the protocol.
In fact, I would go a bit further: the whole concept of "group leaders"
is a weird Plan 9 thing that is not true on, say, a Unix based server.
So it should also be at the server's discretion whether to accept
or reject group changes, again according to a security policy that
is considered outside the scope of the protocol.
Changes in ownership, group, or permissions that are refused should
always result in an Rerror. (Alright, I see you've covered that
later in the "all or nothing" clause...)
(And the discussion of the main Plan 9 file server's security policy
should really be on some other manual pages than the definition of 9P.)
Now at this point I suppose you'll jump on me and argue that I here
I am arguing for server-dependent variations in behavior, whereas
above (on file names, owner/group names, and the meaning of ..) I
was arguing for required uniform behavior across all servers. The
reason is that here I consider implementation-dependent variations
less harmful, since relatively few programs normally want to mess
with file ownership, and those that do have a reasonable expectation
of the operations failing anyway. In contrast, non-uniform rules
for allowable file, owner, and group names or the meaning of ..
would pervasively break a whole lot of programs, like any script
that wants to parse the output of "ls -l" or expects "cd .." to go
somewhere reliable.
> Note that since the stat information is sent as a 9P
> variable-length datum, it is limited to a maximum of 65535
> bytes.
So what should happen if I use Tcreat to create a file name that
is so long that the stat structure associated with the file would
exceed 64k-1 bytes?
I would argue that the Tcreate man page should explicitly say such
requests must always fail.
> VERSION(5) VERSION(5)
>
> NAME
> version - negotiate protocol version
>
> SYNOPSIS
> Tversion size[4] tag[2] msize[4] version[s]
> Rversion size[4] tag[2] msize[4] version[s]
>
> DESCRIPTION
> The version request negotiates the protocol version and mes-
> sage size to be used on the connection. Tversion must be
> the first message sent on the 9P connection, and the client
> cannot issue any further requests until it has received the
> Rversion reply.
Can you issue another Tversion later? I would argue that it should
be explicitly prohibited, even more strongly than I previously argued
that multiple Tsessions should be prohibited.
> The client suggests a maximum message size, msize, that is
> the maximum length, in bytes, it will ever generate or
> expect to receive in a single 9P message.
As previously mentioned, please specify a minimum msize that a client
is allowed to request, and make the largest possible stat record
consistent with the value of this minimal msize.
> WALK(5) WALK(5)
Interesting: this subsumes the old "clwalk", and also subsumes the old
"clone" via the subterfuge of zero-element walks.
> The element ``..'' (dot-dot) represents the parent direc-
> tory. The name ``.'' (dot), meaning the current directory,
> is not used in the protocol.
>
> It is legal for nwname to be zero, in which case newfid will
> represent the same file as fid and the walk will usually
> succeed; this is equivalent to walking to dot. The rest of
> this discussion assumes nwname is greater than zero.
Do these two paragraphs taken together mean that when the mnt(3) device
When mnt(3) sees the name "foo/./bar", is it expected to generate
walk("foo", "", "bar"), or is it expected to generate walk("foo", "bar")?
I would argue that walk("") should be simply disallowed: if the
mnt(3) device needs to elide walks to ".", it might as well also
elide walks to "" as well; that way you can eliminate a special
case that would otherwise need to be explicitly coded in all servers.
> If the first element cannot be walked for any reason, Rerror
> is returned. Otherwise, the walk will return an Rwalk mes-
> sage containing nqid qids corresponding, in order, to the
> files that are visited by the nqid successful elementwise
> walks; nqid is therefore either nwname or the index of the
> first elementwise walk that failed. The value of nqid can-
> not be zero unless nwname is zero. Also, nqid will always
> be less than or equal to nwname. Only if it is equal, how-
> ever, will newfid be affected, in which case it will repre-
> sent the file reached by the final elementwise walk
> requested in the message.
If the walk operation fails, does newfid exist (and point to the
same qid as fid), or is it implicitly clunked?
My suggestion: If the walk fails, newfid should be implicitly
clunked unless it was equal to fid.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] 9P2000
2001-01-27 21:58 rob pike
@ 2001-01-28 16:29 ` Sam Ducksworth
2001-01-30 9:21 ` Mike Haertel
1 sibling, 0 replies; 10+ messages in thread
From: Sam Ducksworth @ 2001-01-28 16:29 UTC (permalink / raw)
To: 9fans
rob pike wrote:
>
> distribution. That is the real reason things have seemed quiet
> lately, not the Lucent announcements.
>
> -rob
>
rob,
thanks for the update. sorry for being so paranoid.
sam
^ permalink raw reply [flat|nested] 10+ messages in thread
* [9fans] 9P2000
@ 2001-01-27 21:58 rob pike
2001-01-28 16:29 ` Sam Ducksworth
2001-01-30 9:21 ` Mike Haertel
0 siblings, 2 replies; 10+ messages in thread
From: rob pike @ 2001-01-27 21:58 UTC (permalink / raw)
To: 9fans
I've been thinking of sending this information to 9fans for a while.
Since the cat is out of the bag, now is as good a time as any. We
have reworked 9P to address many of its failings, most important:
1) Nesting and encapsulation: exportfs embeds 9P within 9P,
which can make reads and writes not fit within the 8K limit.
2) Walk performance: it takes too many walks to evaluate a
name.
3) Sizes fixed and too small: read/write sizes and, most
important, path name elements have limited, too-small sizes.
4) Authentication too rigid: the authentication protocols were
defined in the protocol and so impossible to change.
And a host of other lesser things.
We have a file server and kernel running this protocol now and have
adapted much but not all of our stuff; it's not yet the system we live
with. Comments on the following man pages are welcome. I've included
all of section 5 (9P itself, now 9P2000) and some relevant parts of
section 2. Directory handling is very different, for example.
There may be many errors in these pages and many details are sure to
change before we're done.
Until this stuff gets installed and a lot of shaking down has
happened, there won't be much in the way of updates to the existing
distribution. That is the real reason things have seemed quiet
lately, not the Lucent announcements.
-rob
INTRO(5) INTRO(5)
NAME
intro - introduction to the Plan 9 File Protocol, 9P
SYNOPSIS
#include <fcall.h>
DESCRIPTION
A Plan 9 server is an agent that provides one or more hier-
archical file systems - file trees - that may be accessed by
Plan 9 processes. A server responds to requests by clients
to navigate the hierarchy, and to create, remove, read, and
write files. The prototypical server is a separate machine
that stores large numbers of user files on permanent media;
such a machine is called, somewhat confusingly, a file
server. Another possibility for a server is to synthesize
files on demand, perhaps based on information on data struc-
tures inside the kernel; the proc(3) kernel device is a part
of the Plan 9 kernel that does this. User programs can also
act as servers.
A connection to a server is a bidirectional communication
path from the client to the server. There may be a single
client or multiple clients sharing the same connection. A
server's file tree is attached to a process group's name
space by bind(2) and mount calls; see intro(2). Processes in
the group are then clients of the server: system calls oper-
ating on files are translated into requests and responses
transmitted on the connection to the appropriate service.
The Plan 9 File Protocol, 9P, is used for messages between
clients and servers. A client transmits requests (T-
messages) to a server, which subsequently returns replies
(R-messages) to the client. The combined acts of transmit-
ting (receiving) a request of a particular type, and receiv-
ing (transmitting) its reply is called a transaction of that
type.
Each message consists of a sequence of bytes. Two-, four-,
and eight-byte fields hold unsigned integers represented in
little-endian order (least significant byte first). Data
items of larger or variable lengths are represented by a
two-byte field specifying a count, n, followed by n bytes of
data. Text strings are represented this way, with the text
itself stored as a UTF-8 encoded sequence of Unicode charac-
ters (see utf(6)). Text strings in 9P messages are not NUL-
terminated: n counts the bytes of UTF-8 data, which include
no final zero byte. The NUL character is illegal in all
text strings in 9P, and is therefore excluded from file
names, user names, and so on.
Page 1 Plan 9 (printed 1/27/00)
INTRO(5) INTRO(5)
Each 9P message begins with a four-byte size field specify-
ing the length in bytes of the complete message including
the four bytes of the size field itself. The next byte is
the message type, one of the constants in the enumeration in
the include file <fcall.h>. The remaining bytes are parame-
ters of different sizes. In the message descriptions below,
the number of bytes in a field is given in brackets after
the field name. The notation parameter[n] where n is not a
constant represents a variable-length parameter: n[2] fol-
lowed by n bytes of data forming the parameter. The notation
string[s] (using a literal s character) is shorthand for
s[2] followed by s bytes of UTF-8 text. (Systems may choose
to reduce the set of legal characters to reduce syntactic
problems, for example to remove slashes from name compo-
nents, but the protocol has no such restriction. Plan 9
names may contain any printable character (that is, any
character outside hexadecimal 00-1F and 80-9F) except
slash.) Messages are transported in byte form to allow for
machine independence; fcall(2) describes routines that con-
vert to and from this form into a machine-dependent C struc-
ture.
MESSAGES
Tversion size[4] tag[2] msize[4] version[s]
Rversion size[4] tag[2] msize[4] version[s]
Tsession size[4] tag[2] chal[n]
Rsession size[4] tag[2] chal[n] authid[s] authdom[s]
Rerror size[4] tag[2] ename[s]
Tflush size[4] tag[2] oldtag[4]
Rflush size[4] tag[2]
Tattach size[4] tag[2] fid[4] uname[s] aname[s]
auth[n]
Rattach size[4] tag[2] qid[13] rauth[n]
Twalk size[4] tag[2] fid[4] newfid[4] nwname[2]
nwname*(wname[s])
Rwalk size[4] tag[2] nwqid[2] nwqid*(wqid[13])
Topen size[4] tag[2] fid[4] mode[1]
Ropen size[4] tag[2] qid[13] iounit[4]
Tcreate size[4] tag[2] fid[4] name[s] perm[4] mode[1]
Rcreate size[4] tag[2] qid[13] iounit[4]
Tread size[4] tag[2] fid[4] offset[8] count[4]
Rread size[4] tag[2] count[4] data[count]
Page 2 Plan 9 (printed 1/27/00)
INTRO(5) INTRO(5)
Twrite size[4] tag[2] fid[4] offset[8] count[4]
data[count]
Rwrite size[4] tag[2] count[4]
Tclunk size[4] tag[2] fid[4]
Rclunk size[4] tag[2]
Tremove size[4] tag[2] fid[4]
Rremove size[4] tag[2]
Tstat size[4] tag[2] fid[4]
Rstat size[4] tag[2] stat[n]
Twstat size[4] tag[2] fid[4] stat[n]
Rwstat size[4] tag[2]
Each T-message has a tag field, chosen and used by the
client to identify the message. The reply to the message
will have the same tag. Clients must arrange that no two
outstanding messages on the same connection have the same
tag. An exception is the tag ~0, meaning `no tag': the
client can use it, when establishing a connection, to over-
ride tag matching in version and session messages.
The type of an R-message will either be one greater than the
type of the corresponding T-message or Rerror, indicating
that the request failed. In the latter case, the ename
field contains a string describing the reason for failure.
The version message identifies the version of the protocol
and indicates the maximum message size the system is pre-
pared to handle. A session request initializes a connection
and aborts all outstanding I/O on the connection. The set
of messages between session requests is called a session.
Most T-messages contain a fid, a 32-bit unsigned integer
that the client uses to identify a ``current file'' on the
server. Fids are somewhat like file descriptors in a user
process, but they are not restricted to files open for I/O:
directories being examined, files being accessed by stat(2)
calls, and so on - all files being manipulated by the oper-
ating system - are identified by fids. Fids are chosen by
the client. All requests on a connection share the same fid
space; when several clients share a connection, the agent
managing the sharing must arrange that no two clients choose
the same fid.
The first fid supplied (in an attach message) will be taken
by the server to refer to the root of the served file tree.
The attach identifies the user to the server and may specify
a particular file tree served by the server (for those that
supply more than one). A walk message causes the server to
Page 3 Plan 9 (printed 1/27/00)
INTRO(5) INTRO(5)
change the current file associated with a fid to be a file
in the directory that is the old current file, or one of its
subdirectories. Walk returns a new fid that refers to the
resulting file. Usually, a client maintains a fid for the
root, and navigates by walks from the root fid.
A client can send multiple T-messages without waiting for
the corresponding R-messages, but all outstanding T-messages
must specify different tags. The server may delay the
response to a request on one fid and respond to later
requests on other fids; this is sometimes necessary, for
example when the client reads from a file that the server
synthesizes from external events such as keyboard charac-
ters.
Replies (R-messages) to attach, walk, open, and create
requests convey a qid field back to the client. The qid
represents the server's unique identification for the file
being accessed: two files on the same server hierarchy are
the same if and only if their qids are the same. (The
client may have multiple fids pointing to a single file on a
server and hence having a single qid.) The seventeen-byte
qid fields hold a one-byte type, specifying whether the file
is a directory, append-only file, etc., and two eight-byte
unsigned integers: first the qid path, then the qid version.
The path is an integer unique among all files in the hierar-
chy. If a file is deleted and recreated with the same name
in the same directory, the old and new path components of
the qids should be different. The version is a version num-
ber for a file; typically, it is incremented every time the
file is modified.
An existing file can be opened, or a new file may be created
in the current (directory) file. I/O of a given number of
bytes at a given offset on an open file is done by read and
write.
A client should clunk any fid that is no longer needed. The
remove transaction deletes files.
The stat transaction retrieves information about the file.
The stat field in the reply includes the file's name, access
permissions (read, write and execute for owner, group and
public), access and modification times, and owner and group
identifications (see stat(2)). The owner and group identifi-
cations are textual names. The wstat transaction allows
some of a file's properties to be changed.
A request can be aborted with a Tflush request. When a
server receives a Tflush, it should not reply to the message
with tag oldtag (unless it has already replied), and it
should immediately send an Rflush. The client must wait
Page 4 Plan 9 (printed 1/27/00)
INTRO(5) INTRO(5)
until it gets the Rflush (even if the reply to the original
message arrives in the interim), at which point oldtag may
be reused.
Most programs do not see the 9P protocol directly; instead
calls to library routines that access files are translated
by the mount driver, mnt(3), into 9P messages.
DIRECTORIES
Directories are created by create with DMDIR set in the per-
missions argument (see stat(5)). The members of a directory
can be found with read(5). All directories must support
walks to the directory .. (dot-dot) meaning parent direc-
tory, although by convention directories contain no explicit
entry for .. or . (dot). The parent of the root directory
of a server's tree is itself.
ACCESS PERMISSIONS
Each file server maintains a set of user and group names.
Each user can be a member of any number of groups. Each
group has a group leader who has special privileges (see
stat(5) and users(6)). Every file request has an implicit
user id (copied from the original attach) and an implicit
set of groups (every group of which the user is a member).
Each file has an associated owner and group id and three
sets of permissions: those of the owner, those of the group,
and those of ``other'' users. When the owner attempts to do
something to a file, the owner, group, and other permissions
are consulted, and if any of them grant the requested per-
mission, the operation is allowed. For someone who is not
the owner, but is a member of the file's group, the group
and other permissions are consulted. For everyone else, the
other permissions are used. Each set of permissions says
whether reading is allowed, whether writing is allowed, and
whether executing is allowed. A walk in a directory is
regarded as executing the directory, not reading it. Per-
missions are kept in the low-order bits of the file mode:
owner read/write/execute permission represented as 1 in bits
8, 7, and 6 respectively (using 0 to number the low order).
The group permissions are in bits 5, 4, and 3, and the other
permissions are in bits 2, 1, and 0.
The file mode contains some additional attributes besides
the permissions. If bit 31 is set, the file is a directory;
if bit 30 is set, the file is append-only (offset is ignored
in writes); if bit 29 is set, the file is exclusive-use
(only one client may have it open at a time). These bits
are reproduced, from the top bit down, in the type byte of
the Qid.
Page 5 Plan 9 (printed 1/27/00)
ATTACH(5) ATTACH(5)
NAME
attach, session - messages to initiate activity
SYNOPSIS
Tsession size[4] tag[2] chal[n]
Rsession size[4] tag[2] chal[n] authid[s] authdom[s]
Tattach size[4] tag[2] fid[4] uid[s] aname[s] auth[n]
Rattach size[4] tag[2] qid[13] rauth[n]
DESCRIPTION
The session request initializes a connection between a
client and a server and exchanges authentication informa-
tion. All outstanding I/O on the connection is aborted.
The set of messages between session requests is called a
session. The host's user name (authid) and its authentica-
tion domain (authdom) identify the key to be used when
authenticating to this host. The exchanged challenges
(chal) are used in the authentication algorithm. If authid
is an empty string no authentication is performed in this
session.
The tag should be NOTAG (value ~0) for a session message.
The attach message serves as a fresh introduction from a
user on the client machine to the server. The message iden-
tifies the user (uid) and may select the file tree to access
(aname). The auth argument contains authorization data
derived from the exchanged challenges of the session mes-
sage; see auth(6).
As a result of the attach transaction, the client will have
a connection to the root directory of the desired file tree,
represented by fid. An error is returned if fid is already
in use. The server's idea of the root of the file tree is
represented by the returned qid.
ENTRY POINTS
An attach transaction will be generated for kernel devices
(see intro(3)) when a system call evaluates a file name
beginning with `#'. Pipe(2) generates an attach on the ker-
nel device pipe(3). The mount system call (see bind(2)) gen-
erates an attach message to the remote file server. When
the kernel boots, an attach is made to the root device,
root(3), and then an attach is made to the requested file
server machine.
SEE ALSO
version(5), auth(6)
Page 6 Plan 9 (printed 1/27/00)
CLUNK(5) CLUNK(5)
NAME
clunk - forget about a fid
SYNOPSIS
Tclunk size[4] tag[2] fid[4]
Rclunk size[4] tag[2]
DESCRIPTION
The clunk request informs the file server that the current
file represented by fid is no longer needed by the client.
The actual file is not removed on the server unless the fid
had been opened with ORCLOSE.
Once a fid has been clunked, the same fid can be reused in a
new walk or attach request.
Even if the clunk returns an error, the fid is no longer
valid.
ENTRY POINTS
A clunk message is generated by close and indirectly by
other actions such as failed open calls.
Page 7 Plan 9 (printed 1/27/00)
ERROR(5) ERROR(5)
NAME
error - return an error
SYNOPSIS
Rerror size[4] tag[2] ename[s]
DESCRIPTION
The Rerror request (there is no Terror) is used to return an
error string describing the failure of a transaction. It
replaces the corresponding reply message that would accom-
pany a successful call; its tag is that of the request.
By convention, clients may truncate error messages after 255
bytes, defined as ERRMAX in <libc.h>.
Page 8 Plan 9 (printed 1/27/00)
FLUSH(5) FLUSH(5)
NAME
flush - abort a message
SYNOPSIS
Tflush size[4] tag[2] oldtag[4]
Rflush size[4] tag[2]
DESCRIPTION
When the response to a request is no longer needed, such as
when a user interrupts a process doing a read(2), a Tflush
request is sent to the server to purge the pending response.
The message being flushed is identified by oldtag. The
semantics of flush depends on messages arriving in order.
The server must answer the flush message immediately. If it
recognizes oldtag as the tag of a pending transaction, it
should abort any pending response and discard that tag. In
either case, it should respond with an Rflush echoing the
tag (not oldtag) of the Tflush message. A Tflush can never
be responded to by an Rerror message.
When the client sends a Tflush, it must wait to receive the
corresponding Rflush before reusing oldtag for subsequent
messages. If a response to the flushed request is received
before the Rflush, the client must honor the response as if
it had not been flushed, since the completed request may
signify a state change in the server. For instance, Tcreate
will have created a file and Twalk may have allocated a fid.
If no response is received before the Rflush, the flushed
transaction is considered to have been canceled, and should
be treated as though it had never been sent.
Several exceptional conditions are handled correctly by the
above specification: sending multiple flushes for a single
tag, flushing after a transaction is completed, flushing a
Tflush, and flushing an invalid tag.
Page 9 Plan 9 (printed 1/27/00)
OPEN(5) OPEN(5)
NAME
open, create - prepare a fid for I/O on an existing or new
file
SYNOPSIS
Topen size[4] tag[2] fid[4] mode[1]
Ropen size[4] tag[2] qid[13] iounit[4]
Tcreate size[4] tag[2] fid[4] name[s] perm[4] mode[1]
Rcreate size[4] tag[2] qid[13] iounit[4]
DESCRIPTION
The open request asks the file server to check permissions
and prepare a fid for I/O with subsequent read and write
messages. The mode field determines the type of I/O: 0, 1,
2, and 3 mean read access, write access, read and write
access, and execute access, to be checked against the per-
missions for the file. In addition, if mode has the OTRUNC
(0x10) bit set, the file is to be truncated, which requires
write permission (if the file is append-only, and permission
is granted, the open succeeds but the file will not be trun-
cated); if the mode has the ORCLOSE (0x40) bit set, the file
is to be removed when the fid is clunked, which requires
permission to remove the file from its directory. If other
bits are set in mode they will be ignored. It is illegal to
write a directory, truncate it, or attempt to remove it on
close. If the file is marked for exclusive use (see
stat(5)), only one client can have the file open at any
time. That is, after such a file has been opened, further
opens will fail until fid has been clunked. All these per-
missions are checked at the time of the open request; subse-
quent changes to the permissions of files do not affect the
ability to read, write, or remove an open file.
The create request asks the file server to create a new file
with the name supplied, in the directory (dir) represented
by fid, and requires write permission in the directory. The
owner of the file is the implied user id of the request, the
group of the file is the same as dir, and the permissions
are the value of
perm & (~0666 | (dir.perm & 0666))
if a regular file is being created and
perm & (~0777 | (dir.perm & 0777))
if a directory is being created. This means, for example,
that if the create allows read permission to others, but the
containing directory does not, then the created file will
not allow others to read the file.
Finally, the newly created file is opened according to mode,
and fid will represent the newly opened file. Mode is not
Page 10 Plan 9 (printed 1/27/00)
OPEN(5) OPEN(5)
checked against the permissions in perm. The qid for the new
file is returned with the create reply message.
Directories are created by setting the DMDIR bit
(0x80000000) in the perm.
The names . and .. are special; it is illegal to create
files with these names.
It is an error for either of these messages if the fid is
already the product of a successful open or create message.
An attempt to create a file in a directory where the given
name already exists will be rejected; in this case, the
create system call (see open(2)) uses open with truncation.
The algorithm used by the create system call is: first walk
to the directory to contain the file. If that fails, return
an error. Next walk to the specified file. If the walk
succeeds, send a request to open and truncate the file and
return the result, successful or not. If the walk fails,
send a create message. If that fails, it may be because the
file was created by another process after the previous walk
failed, so (once) try the walk and open again.
For the behavior of create on a union directory, see
bind(2).
The iounit field returned by open and create may be zero.
If it is not, it is the maximum number of bytes that are
guaranteed to be read from or written to the file without
breaking the I/O transfer into multiple 9P messages; see
read(5).
ENTRY POINTS
Open and create both generate open messages; only create
generates a create message.
For programs that need atomic file creation, without the
race that exists in the open-create sequence described
above, the kernel does the following. If the OEXCL (0x1000)
bit is set in the mode for a create system call, the open
message is not sent; the kernel issues only the create.
Thus, if the file exists, create will draw an error, but if
it doesn't and the create system call succeeds, the process
issuing the create is guaranteed to be the one that created
the file.
Page 11 Plan 9 (printed 1/27/00)
READ(5) READ(5)
NAME
read, write - transfer data from and to a file
SYNOPSIS
Tread size[4] tag[2] fid[4] offset[8] count[4]
Rread size[4] tag[2] count[4] data[count]
Twrite size[4] tag[2] fid[4] offset[8] count[4] data[count]
Rwrite size[4] tag[2] count[4]
DESCRIPTION
The read request asks for count bytes of data from the file
identified by fid, which must be opened for reading, start-
ing offset bytes after the beginning of the file. The bytes
are returned with the read reply message.
The count field in the reply indicates the number of bytes
returned. This may be less than the requested amount. If
the offset field is greater than or equal to the number of
bytes in the file, a count of zero will be returned.
For directories, read returns an integral number of direc-
tory entries exactly as in stat (see stat(5)), one for each
member of the directory. The read request message must have
offset equal to zero or the value of offset in the previous
read on the directory, plus the number of bytes returned in
the previous read. In other words, seeking other than to
the beginning is illegal in a directory (see seek(2)).
The write request asks that count bytes of data be recorded
in the file identified by fid, which must be opened for
writing, starting offset bytes after the beginning of the
file. If the file has been opened append only, the data
will be placed at the end of the file regardless of offset.
Directories may not be written.
The write reply records the number of bytes actually writ-
ten. It is usually an error if this is not the same as
requested.
Because 9P implementations may limit the size of individual
messages, more than one message may be produced by a single
read or write call. The iounit field returned by open(5),
if non-zero, reports the maximum size that is guaranteed to
be transferred atomically.
ENTRY POINTS
Read and write messages are generated by the corresponding
calls. Although seek(2) affects the offset, it does not
generate a message.
Page 12 Plan 9 (printed 1/27/00)
REMOVE(5) REMOVE(5)
NAME
remove - remove a file from a server
SYNOPSIS
Tremove size[4] tag[2] fid[4]
Rremove size[4] tag[2]
DESCRIPTION
The remove request asks the file server both to remove the
file represented by fid and to clunk the fid, even if the
remove fails. This request will fail if the client does not
have write permission in the parent directory.
It is correct to consider remove to be a clunk with the side
effect of removing the file if permissions allow.
ENTRY POINTS
Remove messages are generated by remove.
Page 13 Plan 9 (printed 1/27/00)
STAT(5) STAT(5)
NAME
stat, wstat - inquire or change file attributes
SYNOPSIS
Tstat size[4] tag[2] fid[4]
Rstat size[4] tag[2] stat[n]
Twstat size[4] tag[2] fid[4] stat[n]
Rwstat size[4] tag[2]
DESCRIPTION
The stat transaction inquires about the file identified by
fid. The reply will contain a machine-independent directory
entry, stat, laid out as follows:
type[2]
for kernel use
dev[4]
for kernel use
qid.type[1]
the type of the file (directory, etc.), represented as
a bit vector corresponding to the high 8 bits of the
file's mode word.
qid.vers[4]
version number for given path
qid.path[8]
the file server's unique identification for the file
mode[4]
permissions and flags
atime[4]
last access time
mtime[4]
last modification time
length[8]
length of file in bytes
name[ s ]
file name; must be / if the file is the root directory
of the server
uid[ s ]
owner name
Page 14 Plan 9 (printed 1/27/00)
STAT(5) STAT(5)
gid[ s ]
group name
muid[ s ]
name of the user who last modified the file
Integers in this encoding are in little-endian order (least
significant byte first). The convM2D and convD2M routines
(see fcall(2)) convert between directory entries and C
structs.
This encoding may be turned into a machine dependent Dir
structure (see stat(2)) using routines defined in fcall(2).
The mode contains permission bits as described in intro(5)
and the following: 0x80000000 (this file is a directory),
0x40000000 (append only), 0x20000000 (exclusive use); these
are echoed in Qid.type. Writes to append-only files always
place their data at the end of the file; the offset in the
write message is ignored, as is the OTRUNC bit in an open.
Exclusive use files may be open for I/O by only one fid at a
time across all clients of the server. If a second open is
attempted, it draws an error. Servers may implement a time-
out on the lock on an exclusive use file: if the fid holding
the file open has been unused for an extended period (of
order at least minutes), it is reasonable to break the lock
and deny the initial fid further I/O.
The two time fields are measured in seconds since the epoch
(Jan 1 00:00 1970 GMT). The mtime field reflects the time
of the last change of content (except when later changed by
wstat). For a plain file, mtime is the time of the most
recent create, open with truncation, or write; for a direc-
tory it is the time of the most recent remove, create, or
wstat of a file in the directory. Similarly, the atime
field records the last read of the contents; also it is set
whenever mtime is set. In addition, for a directory, it is
set by an attach, walk, or create, all whether successful or
not.
The muid field names the user whose actions most recently
changed the mtime of the file.
The length records the number of bytes in the file. Direc-
tories and most files representing devices have a conven-
tional length of 0.
The stat request requires no special permissions.
The wstat request can change some of the file status infor-
mation. The name can be changed by anyone with write per-
mission in the parent directory; it is an error to change
Page 15 Plan 9 (printed 1/27/00)
STAT(5) STAT(5)
the name to that of an existing file. The length can be
changed (affecting the actual length of the file) by anyone
with write permission on the file. It is an error to
attempt to set the length of a directory to a non-zero
value, and servers may decide to reject length changes for
other reasons. The mode and mtime can be changed by the
owner of the file or the group leader of the file's current
group. The directory bit cannot be changed by a wstat; the
other defined permission and mode bits can. The gid can be
changed: by the owner if also a member of the new group; or
by the group leader of the file's current group if also
leader of the new group (see intro(5) for more information
about permissions and users(6) for users and groups). None
of the other data can be altered by a wstat. In particular,
there is no way to change the owner of a file.
Either all the changes in wstat request happen, or none of
them does: if the request succeeds, all changes were made;
if it fails, none were.
A wstat request can explicitly avoid modifying some proper-
ties of the file by providing explicit ``don't touch'' val-
ues in the stat data that is sent: zero-length strings for
text values and ~0 for integral values.
A read of a directory yields an integral number of directory
entries in the machine independent encoding given above (see
read(5)).
Note that since the stat information is sent as a 9P
variable-length datum, it is limited to a maximum of 65535
bytes.
ENTRY POINTS
Stat messages are generated by fstat and stat.
Wstat messages are generated by fwstat and wstat.
Page 16 Plan 9 (printed 1/27/00)
VERSION(5) VERSION(5)
NAME
version - negotiate protocol version
SYNOPSIS
Tversion size[4] tag[2] msize[4] version[s]
Rversion size[4] tag[2] msize[4] version[s]
DESCRIPTION
The version request negotiates the protocol version and mes-
sage size to be used on the connection. Tversion must be
the first message sent on the 9P connection, and the client
cannot issue any further requests until it has received the
Rversion reply.
The client suggests a maximum message size, msize, that is
the maximum length, in bytes, it will ever generate or
expect to receive in a single 9P message. This count
includes all 9P protocol data, starting from the size field
and extending through the message, but excludes enveloping
transport protocols. The server responds with its own maxi-
mum, msize, which must be less than or equal to the client's
value. Thenceforth, both sides of the connection must honor
this limit.
The version string identifies the level of the protocol.
The string must always begin with the two characters ``9P''.
If the server does not understand the client's version
string, it should respond with an Rversion message (not
Rerror) with the version string the 7 characters
``unknown''.
The server may respond with the client's version string, or
a version string identifying an earlier defined protocol
version. Currently, the only defined version is the 6 char-
acters ``9P2000''. Version strings will be defined such
that, if the client string contains one or more period char-
acters, the initial substring up to but not including any
single period in the version string defines a version of the
protocol. Other version strings may also be valid, however.
The client and server will use the protocol version defined
by the server's response for all subsequent communication on
the connection.
ENTRY POINTS
The version message is generated by the kernel by the first
mount system call on the connection.
Page 17 Plan 9 (printed 1/27/00)
WALK(5) WALK(5)
NAME
walk - descend a directory hierarchy
SYNOPSIS
Twalk size[4] tag[2] fid[4] newfid[4] nwname[2]
nwname*(wname[s])
Rwalk size[4] tag[2] nqid[2] nqid*(qid[13])
DESCRIPTION
The walk request carries as arguments an existing fid, which
must represent a directory, and a proposed newfid (which
must not be in use unless it is the same as fid) that the
client wishes to associate with the result of descending the
directory hierarchy by `walking' the hierarchy using the
successive path name elements wname.
The fid must be valid in the current session and must not
have been opened for I/O by an open or create message. If
the full sequence of nwname elements is walked successfully,
newfid will represent the file that results. If not, newfid
(and fid) will be unaffected. However, if newfid is in use
or otherwise illegal, an Rerror is returned.
The element ``..'' (dot-dot) represents the parent direc-
tory. The name ``.'' (dot), meaning the current directory,
is not used in the protocol.
It is legal for nwname to be zero, in which case newfid will
represent the same file as fid and the walk will usually
succeed; this is equivalent to walking to dot. The rest of
this discussion assumes nwname is greater than zero.
The nwname path name elements wname are walked in order,
``elementwise''. For the first elementwise walk to succeed,
the file identified by fid must be a directory, and the
implied user of the request must have permission to search
the directory (see intro(5)). Subsequent elementwise walks
have equivalent restrictions applied to the implicit fid
that results from the preceding elementwise walk.
If the first element cannot be walked for any reason, Rerror
is returned. Otherwise, the walk will return an Rwalk mes-
sage containing nqid qids corresponding, in order, to the
files that are visited by the nqid successful elementwise
walks; nqid is therefore either nwname or the index of the
first elementwise walk that failed. The value of nqid can-
not be zero unless nwname is zero. Also, nqid will always
be less than or equal to nwname. Only if it is equal, how-
ever, will newfid be affected, in which case it will repre-
sent the file reached by the final elementwise walk
Page 18 Plan 9 (printed 1/27/00)
WALK(5) WALK(5)
requested in the message.
A walk of the name ``..'' in the root directory of a server
is equivalent to a walk with no name elements.
If newfid is the same as fid, the above discussion applies,
with the obvious difference that if the walk changes the
state of newfid, it also changes the state of fid; and if
newfid is unaffected, then fid is also unaffected.
To simplify the implementation of the servers, a maximum of
sixteen name elements or qids may be packed in a single mes-
sage. This constant is called MAXWELEM in fcall(2). Despite
this restriction, the system imposes no limit on the number
of elements in a file name, only the number that may be
transmitted in a single message.
ENTRY POINTS
A call to chdir(2) causes a walk. One or more walk messages
may be generated by any of the following calls, which evalu-
ate file names: bind, create, exec, mount, open, remove,
stat, unmount, wstat. The file name element . (dot) is
interpreted locally and is not transmitted in walk messages.
Page 19 Plan 9 (printed 1/27/00)
DIRREAD(2) DIRREAD(2)
NAME
dirread, dirreadall - read directory
SYNOPSIS
#include <u.h>
#include <libc.h>
long dirread(int fd, Dir **buf)
long dirreadall(int fd, Dir **buf)
#define STATMAX 65535U
#define DIRMAX (sizeof(Dir)+STATMAX)
DESCRIPTION
The data returned by a read(2) on a directory is a set of
complete directory entries in a machine-independent format,
exactly equivalent to the result of a stat(2) on each file
or subdirectory in the directory. Dirread decodes the
directory entries into a machine-dependent form. It reads
from fd and unpacks the data into an array of Dir structures
whose address is returned in *buf (see stat(2) for the lay-
out of a Dir). The array is allocated with malloc(1) each
time dirread is called.
Dirreadall is like dirread, but reads in the entire direc-
tory; by contrast, dirread steps through a directory on
read(2) at a time.
Directory entries have variable length. A successful read
of a directory always returns an integral number of complete
directory entries; dirread always returns complete Dir
structures. See read(5) for more information.
The constant STATMAX is the maximum size that a directory
entry can occupy. The constant DIRMAX is an upper limit on
the size necessary to hold a Dir structure and all the asso-
ciated data.
Dirread returns the number of Dir structures filled in buf.
The file offset is advanced by the number of bytes actually
read.
SOURCE
/sys/src/libc/9sys/dirread.c
SEE ALSO
intro(2), open(2), read(2)
Page 20 Plan 9 (printed 1/27/00)
DIRREAD(2) DIRREAD(2)
DIAGNOSTICS
Sets errstr.
Page 21 Plan 9 (printed 1/27/00)
FCALL(2) FCALL(2)
NAME
Fcall, convS2M, convD2M, convM2S, convM2D, getS, fcallconv,
dirconv, dirmodeconv, read9pmsg - interface to Plan 9 File
protocol
SYNOPSIS
#include <u.h>
#include <libc.h>
#include <auth.h>
#include <fcall.h>
uint convS2M(Fcall *f, uchar *ap, uint nap)
uint convD2M(Dir *d, uchar *ap, uint nap)
uint convM2S(uchar *ap, uint nap, Fcall *f)
uint convM2D(uchar *ap, uint nap, Dir *d, char *strs)
int dirconv(void *o, Fconv*)
int fcallconv(void *o, Fconv*)
int dirmodeconv(void *o, Fconv*)
int read9pmsg(int fd, uchar *buf, uint nbuf);
DESCRIPTION
These routines convert messages in the machine-independent
format of the Plan 9 file protocol, 9P, to and from a more
convenient form, an Fcall structure:
#define MAXWELEM 16
typedef
struct Fcall
{
uchar type;
u32int fid;
ushort tag;
union {
struct {
u32int msize;/* Tversion, Rversion */
char *version; /* Tversion, Rversion */
};
struct {
u32int oldtag;/* Tflush */
};
struct {
char *ename; /* Rerror */
Page 22 Plan 9 (printed 1/27/00)
FCALL(2) FCALL(2)
};
struct {
Qid qid; /* Rattach, Ropen, Rcreate */
u32int iounit;/* Ropen, Rcreate */
ushort nrauth;/* Rattach */
uchar *rauth; /* Rattach */
};
struct {
char *uname; /* Tattach */
char *aname; /* Tattach */
ushort nauth;/* Tattach */
uchar *auth; /* Tattach */
};
struct {
char *authid; /* Rsession */
char *authdom; /* Rsession */
ushort nchal;/* Tsession/Rsession */
uchar *chal; /* Tsession/Rsession */
};
struct {
u32int perm;/* Tcreate */
char *name; /* Tcreate */
uchar mode; /* Tcreate, Topen */
};
struct {
u32int newfid;/* Twalk */
ushort nwname;/* Twalk */
char *wname[MAXWELEM]; /* Twalk */
};
struct {
ushort nwqid;/* Rwalk */
Qid wqid[MAXWELEM]; /* Rwalk */
};
struct {
vlong offset; /* Tread, Twrite */
u32int count;/* Tread, Twrite, Rread */
char *data; /* Twrite, Rread */
};
struct {
ushort nstat;/* Twstat, Rstat */
uchar *stat; /* Twstat, Rstat */
};
};
} Fcall;
/* these are implemented as macros */
uchar GBIT8(uchar*)
ushort GBIT16(uchar*)
ulong GBIT32(uchar*)
vlong GBIT64(uchar*)
Page 23 Plan 9 (printed 1/27/00)
FCALL(2) FCALL(2)
void PBIT8(uchar*, uchar)
void PBIT16(uchar*, ushort)
void PBIT32(uchar*, ulong)
void PBIT64(uchar*, vlong)
#define BIT8SZ 1
#define BIT16SZ 2
#define BIT32SZ 4
#define BIT64SZ 8
This structure is defined in <fcall.h>. See section 5 for a
full description of 9P messages and their encoding. For all
message types, the type field of an Fcall holds one of Tnop,
Rnop, Tsession, Rsession, etc. (defined in an enumerated
type in <fcall.h>). Fid is used by most messages, and tag
is used by all messages. The other fields are used selec-
tively by the message types given in comments.
ConvM2S takes a 9P message at ap of length nap, and uses it
to fill in Fcall structure f. If the passed message includ-
ing any data for Twrite and Rread messages is formatted
properly, the return value is the number of bytes the mes-
sage occupied in the buffer ap, which will always be less
than or equal to nap; otherwise it is 0. For Twrite and
Tread messages, data is set to a pointer into the argument
message, not a copy.
ConvS2M does the reverse conversion, turning f into a mes-
sage starting at ap. The length of the resulting message is
returned. For Twrite and Rread messages, count bytes start-
ing at data are copied into the message.
The constant IOHDRSZ is a suitable amount of buffer to
reserve for storing the 9P header; the data portion of a
Twrite or Rread will be no more than the buffer size nego-
tated in the Tversion/Rversion exchange, minus IOHDRSZ.
Another structure is Dir, used by the routines described in
stat(2). ConvM2D converts the machine-independent form
starting at ap into d and returns the length of the
machine-independent, input encoding. The strings in the
returned Dir structure are stored at successive locations
starting at strs; if strs is nil they are ignored; however,
the return value still includes their length.
ConvD2M does the reverse translation, also returning the
length of the encoding. If the buffer is too short, the
return value will be BIT16SZ and the correct size will be
returned in the first BIT16SZ bytes. The macro GBIT16 can
be used to extract the correct value. The related macros
with different sizes retrieve the corresponding-sized quan-
tities. PBIT16 and its brethren place values in messages.
Page 24 Plan 9 (printed 1/27/00)
FCALL(2) FCALL(2)
With the exception of handling short buffers in convD2M,
these macros are not usually needed except by internal rou-
tines.
GetS reads a message from file descriptor fd into ap and
converts the message using convM2S into the Fcall structure
f. The lp argument must point to a long holding the size of
the ap buffer. It is somewhat resilient to transient read
errors. If convM2S succeeds, its return value is stored in
*lp, and getS returns zero. Otherwise getS returns a string
identifying the error.
Dirconv, fcallconv, and dirmodeconv are formatting routines,
suitable for fmtinstall (see print(2)). They convert Dir*,
Fcall*, and long values into string representations of the
directory buffer, Fcall buffer, or file mode value.
Fcallconv assumes that dirconv has been installed with for-
mat letter `D' and dirmodeconv with format letter `M'.
Read9pmsg calls read(2) multiple times, if necessary, to
read an entire 9P message into buf. The return value is 0
for end of file, or -1 for error; it does not return partial
messages.
SOURCE
/sys/src/libc/9sys
SEE ALSO
intro(2), stat(2), intro(5)
Page 25 Plan 9 (printed 1/27/00)
STAT(2) STAT(2)
NAME
stat, fstat, wstat, fwstat, dirstat, dirfstat, dirwstat,
dirfwstat, nulldir - get and put file status
SYNOPSIS
#include <u.h>
#include <libc.h>
int stat(char *name, uchar *edir, int nedir)
int fstat(int fd, uchar *edir, int nedir)
int wstat(char *name, uchar *edir, int nedir)
int fwstat(int fd, uchar *edir, int nedir)
Dir* dirstat(char *name)
Dir* dirfstat(int fd)
int dirwstat(char *name, Dir *dir)
int dirfwstat(int fd, Dir *dir)
void nulldir(Dir *d)
DESCRIPTION
Given a file's name, or an open file descriptor fd, these
routines retrieve or modify file status information. Stat,
fstat, wstat, and fwstat are the system calls; they deal
with machine-independent directory entries. Their format is
defined by stat(5). Stat and fstat retrieve information
about name or fd into edir, a buffer of length nedir,
defined in <libc.h>. Wstat and fwstat write information
back, thus changing file attributes according to the con-
tents of edir. The data returned from the kernel includes
its leading 16-bit length field as described in intro(5).
For symmetry, this field mustal also be present when passing
data to the kernel in a call to wstat and fwstat, but its
value is ignored.
Dirstat, dirfstat, dirwstat, and dirfwstat are similar to
their counterparts, except that they operate on Dir struc-
tures:
typedef
struct Dir {
/* system-modified data */
uint type; /* server type */
uint dev; /* server subtype */
Page 26 Plan 9 (printed 1/27/00)
STAT(2) STAT(2)
/* file data */
Qid qid; /* unique id from server */
ulong mode; /* permissions */
ulong atime; /* last read time */
ulong mtime; /* last write time */
vlong length; /* file length: see <u.h> */
char *name; /* last element of path */
char *uid; /* owner name */
char *gid; /* group name */
char *muid; /* last modifier name */
} Dir;
The returned structure is allocated by malloc(2); freeing it
also frees the associated strings.
This structure and the Qid structure are defined in
<libc.h>. If the file resides on permanent storage and is
not a directory, the length returned by stat is the number
of bytes in the file. For directories, the length returned
is zero. For files that are streams (e.g., pipes and net-
work connections), the length is the number of bytes that
can be read without blocking.
Each file is the responsibility of some server: it could be
a file server, a kernel device, or a user process. Type
identifies the server type, and dev says which of a group of
servers of the same type is the one responsible for this
file. Qid is a structure containing path and vers fields:
path is guaranteed to be unique among all path names cur-
rently on the file server, and vers changes each time the
file is modified. The path is a long long (64 bits, vlong)
and the vers is an unsigned long (32 bits, ulong). Thus, if
two files have the same type, dev, and qid they are the same
file.
The bits in mode are defined by
0x80000000 directory
0x40000000 append only
0x20000000 exclusive use (locked)
0400 read permission by owner
0200 write permission by owner
0100 execute permission (search on directory) by owner
0070 read, write, execute (search) by group
0007 read, write, execute (search) by others
There are constants defined in <libc.h> for these bits:
DMDIR, DMAPPEND, and DMEXCL for the first three; and DMREAD,
DMWRITE, and DMEXEC for the read, write, and execute bits
for others.
Page 27 Plan 9 (printed 1/27/00)
STAT(2) STAT(2)
The two time fields are measured in seconds since the epoch
(Jan 1 00:00 1970 GMT). Mtime is the time of the last
change of content. Similarly, atime is set whenever the
contents are accessed; also, it is set whenever mtime is
set.
Uid and gid are the names of the owner and group of the
file; muid is the name of the user that last modified the
file (setting mtime). Groups are also users, but each
server is free to associate a list of users with any user
name g, and that list is the set of users in the group g.
When an initial attachment is made to a server, the user
string in the process group is communicated to the server.
Thus, the server knows, for any given file access, whether
the accessing process is the owner of, or in the group of,
the file. This selects which sets of three bits in mode is
used to check permissions.
Only some of the fields may be changed with the wstat calls.
The name can be changed by anyone with write permission in
the parent directory. The mode and mtime can be changed by
the owner or the group leader of the file's current group.
The gid can be changed by the owner if he or she is a member
of the new group. The gid can be changed by the group
leader of the file's current group if he or she is the
leader of the new group. The length can be changed by any-
one with write permission, provided the operation is imple-
mented by the server. (See intro(5) for permission informa-
tion, and users(6) for user and group information).
Special values in the fields of the Dir passed to wstat
indicate that the field is not intended to be changed by the
call. The values are ~0 for integral values and the empty
string for string values. The routine nulldir initializes a
Dir to all `ignore' values. Thus one may change the mode,
for example, by using nulldir to initialize a Dir, then set-
ting the mode, and then doing wstat; it is not necessary to
use stat to retrieve the initial values first.
SOURCE
/sys/src/libc/9syscall for the non-dir routines
/sys/src/libc/9sys for the routines prefixed dir
SEE ALSO
intro(2), fcall(2), dirread(2), stat(5)
DIAGNOSTICS
All these functions return the number of bytes copied on
success, -1 on error, and set errstr.
If the buffer for stat or fstat is too short for the
returned data, the return value will be BIT16SZ (see
Page 28 Plan 9 (printed 1/27/00)
STAT(2) STAT(2)
fcall(2)) and the two bytes returned will contain the ini-
tial count field of the returned data; retrying with nedir
equal to that value plus BIT16SZ (for the count itself)
should succeed.
Page 29 Plan 9 (printed 1/27/00)
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2001-01-31 17:46 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-31 2:16 [9fans] 9P2000 rob pike
2001-01-31 8:56 ` Mike Haertel
-- strict thread matches above, loose matches on Subject: below --
2001-01-31 9:31 Russ Cox
2001-01-31 17:46 ` Mike Haertel
2001-01-31 2:18 rob pike
2001-01-30 12:09 rog
2001-01-30 18:04 ` Mike Haertel
2001-01-27 21:58 rob pike
2001-01-28 16:29 ` Sam Ducksworth
2001-01-30 9:21 ` Mike Haertel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).