9front - general discussion about 9front
 help / color / mirror / Atom feed
From: Jacob Moody <moody@mail.posixcafe.org>
To: 9front@9front.org
Subject: Re: [9front] hackathon writeup
Date: Fri, 19 Aug 2022 07:21:59 -0600	[thread overview]
Message-ID: <5ae3c54f-4530-729a-a745-bd94b49d632b@posixcafe.org> (raw)
In-Reply-To: <CABB-WO-SWxHmD93kGdOhgwQNQa_NpT6PEj-dYgxJ46G1AoiO2Q@mail.gmail.com>

On 8/19/22 05:34, Stuart Morrow wrote:
> On 19/08/2022, Jacob Moody <moody@mail.posixcafe.org> wrote:
>> https://github.com/fjballest/nixMarkIV/blob/master/port/devmnt.c
>>
>> I think you're talking about this? Based on the context you seem
>> to be talking about it in.
> 
> Yeah, I guess so?  I said NxM rather than NIX Mark IV because I
> recall seeing Creepy[1] and the 9P.ix Fossil[2] in there.  But it seems
> there's no sign of support for it in the kernel (/sys/src/nxm).  Weird.
> 
> [1] I was right: https://github.com/rminnich/NxM/tree/master/sys/src/cmd/creepy
> [2] I was not right:
> https://github.com/rminnich/NxM/blob/master/sys/src/cmd/fossil/9p.c#L1133

Ah thanks. I took some time reading through the paper ori had posted.
There was quite a number of interesting ideas there, I appreciate you
mentioning this. I wanted to talk through some of these improvements,
both so my understanding could be corrected and for others who would
wish to discuss them.

Firstly it seems namec in the kernel was adapted to provide
batches of requests in a single burst. At the 9p layer this
was manifested in the form of sending multiple dependent
requests to the server all under the same tag. For example
a wstat involves the following requests:

walk from the rootfid to the file, getting a new fid
wstat that new fid with the user provided data
clunk the fid now that it is no longer being used.

These three requests would all be sent together to the server,
under the same tag to be labeled as dependent. In order to
facilitate this, devmnt was changed to allow each process to
have a number of RPC's in flight at once. The real change at
the protocol level was this batching, that and a move of
opening with truncation to being just create instead of open.
The later of which is a seemingly general (but good) bugfix.

For the implementation detail, the new devmnt api was explained
in terms of promises. Each rpc request generating a "promise"
that can be blocked on later once the result is needed.

For the logistics of batching on the kernel side, this was
accomplished using their 'later' device. Essentially a lazy
evaluation of of a channel to a mount point. The later device
hands it's own Channel's back from walks, waiting for the real
use of the fid to be given by userspace. Once it is known
what the user is doing with the file (wstat, stat)
the walk, action, and (possibly) clunk are all sent together.
This later device helps this be transparent to the rest of the kernel.

The last thing discussed is the read ahead and write behind. Now that
the kernel can submit dependent concurrent 9p messages we can use them
for actual IO. The ability to read ahead on a mnt point was signaled
by the user through the MCACHE mount option, this allowed the kernel to
eagerly fill the cache with readahead calls. The cache maintained an idea
of where the last known offset of the file was to contain data, and will attempt
to service requests under that offset from within the cache. It interpreted a short
read to indicate EOF.

The write behind is similar, but defined by the userspace program with a magic open
flag OBEHIND. This allowed the kernel to submit write rpc requests but not wait for their responses before
returning back to userspace. This was accompanied by a new fdflush system call, for a userspace program to sync
and block on outstanding write rpcs.

Not specifically mentioned here (but details in the paper) on the rework that was done to the mount table and how
chan's keep their current path stored. In general it seemed like this was a simplification from their removal of the
client submitting ".." walks. But its a bit hard for me to fully conceptualize these changes without having dug
in to the code.


Like I said, a lot of ideas in here. So let's start chewing through them.

The concept of the later device, and the rpc batching in general mostly makes sense.
Specifically this 'later' device seems like a great mechanism for doing this. There is a bit of a shortcoming
with how the later device has to handle channels that are part of a union. If a union is crossed then we no longer
can defer these walks, they have to be evaluated now. Of course the underlying channel used for the mnt point itself
could refer to a channel given by this 'later' device, but the details are fuzzy here.

Of course the batching itself is reliant on the devmnt changes to permit multiple in flight rpc requests by a single
process, along with the servers ability to receive multiple dependent operations concurrently. These as a first step
I find hard to disagree with.

The read ahead also sounds nice, when we were discussing at the hackthon there was a desire for the kernel to know whether
a file was 'synthetic' ala /dev/kbd or if it was a 'disk' file, something exported from cwfs,hjfs,fossil etc. We had discussed
the server providing this within the QID type, the approach here seems to infer that an entire mnt point contains 'disk' files
through the MCACHE option. The issue I take with this is that the 9p server isn't giving this information, it is being inferred
from how the user mounted it. This seems to place the information in the wrong place, I think it is a bit strange for a user to know
if a file server is going to serve 'synthetic' files or not. I would much prefer the server giving this type of information itself.

For the write behind, I am not convinced this is the "write" direction. A magic open flag and magic system calls to sync seem too specific.
It seems it would be difficult for a program to know, given just a path, whether a write behind is appropriate or not. Not to mention that
this requires quite a bit of code changes to use fdflush. When ori and I were discussing this kind of 'deferred' system calls, our benchmark
for design was: "How would it look like to support this in cat", and the error handling here gets ugly. If you are using write behind you can't
toss what you wrote away, because the server could error and you need that data back to submit the next request. Cat would have to be changed to
keep the data it read() around until it knows for sure that data has been committed by the file server. To me this just seems like a complicated
kernel dance for doing what a Bio writer can accomplish purely in userspace. The key difference here is that the Bio method only works for sequential
writes, not random writes. But I don't think we're missing much there.

  reply	other threads:[~2022-08-19 13:27 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-17  3:27 ori
2022-08-18 22:37 ` Stuart Morrow
2022-08-18 22:52 ` Stuart Morrow
2022-08-18 23:09   ` Stanley Lieber
2022-08-18 23:32     ` Stuart Morrow
2022-08-18 23:11   ` Kurt H Maier
2022-08-18 23:33     ` Stuart Morrow
2022-08-18 23:55       ` Jacob Moody
2022-08-19  0:05         ` ori
2022-10-14  5:45           ` 9p2000.ix (Was Re: [9front] hackathon writeup) unobe
2022-08-19 11:34         ` [9front] hackathon writeup Stuart Morrow
2022-08-19 13:21           ` Jacob Moody [this message]
2022-08-18 23:04 ` Stuart Morrow
2022-08-18 23:24   ` ori
2022-08-18 23:37     ` Stuart Morrow
2022-08-18 23:43       ` ori
2022-08-19  0:02       ` Kurt H Maier
2022-08-19  0:23         ` qwx
2022-08-19  1:54           ` Kurt H Maier
2022-08-19  2:02             ` qwx
2022-08-19  7:08         ` Steve Simon
2022-08-20  1:38           ` Xiao-Yong Jin
2022-08-20  3:57             ` ori
2022-08-19 19:52         ` Thaddeus Woskowiak
2022-08-19 21:13           ` David Arnold
2022-08-19 19:11   ` Aram Hăvărneanu
2022-08-19 20:40     ` Stuart Morrow
2022-08-20 13:14       ` Stuart Morrow
2022-08-22 17:54         ` Aram Hăvărneanu
2022-08-22 17:56           ` Kurt H Maier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ae3c54f-4530-729a-a745-bd94b49d632b@posixcafe.org \
    --to=moody@mail.posixcafe.org \
    --cc=9front@9front.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).