From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED,
	NICE_REPLY_A,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no
	version=3.4.4
Received: (qmail 23821 invoked from network); 19 Aug 2022 13:27:10 -0000
Received: from 9front.inri.net (168.235.81.73)
  by inbox.vuxu.org with ESMTPUTF8; 19 Aug 2022 13:27:10 -0000
Received: from mail.posixcafe.org ([45.76.19.58]) by 9front; Fri Aug 19 09:24:18 -0400 2022
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=posixcafe.org;
	s=20200506; t=1660915454;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=YWf1RH3XrHZ+F0kFsRtwOi+3uvOI1klZzYtmPC97kTU=;
	b=uVTXzZUORCjqY1Wh87mhJtnEEeRkIVzsovPVm2xN9HGlbPOkEYyvbJ+hqOg06Uj5r+UGUw
	y5IfHTM2gSnExKlb+vjn6snw0wU4068A1wAxuGlYXWMo30LBSxOZ8YWrK5CiyUhoOGyOeJ
	IR8xTEKpJe8Mr3SgVYBXNxoHTh/+6XI=
Received: from [192.168.168.200] (161-97-228-135.mynextlight.net [161.97.228.135])
	by mail.posixcafe.org (OpenSMTPD) with ESMTPSA id f5f1460c (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO)
	for <9front@9front.org>;
	Fri, 19 Aug 2022 08:24:14 -0500 (CDT)
Message-ID: <5ae3c54f-4530-729a-a745-bd94b49d632b@posixcafe.org>
Date: Fri, 19 Aug 2022 07:21:59 -0600
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.10.0
Content-Language: en-US
To: 9front@9front.org
References: <96511664FCD610AAFC870728D786BA81@eigenstate.org>
 <CABB-WO9qWpXMK29P4zqXKRtGiR-+aB7z7qaar7Rm0Hg8v17n2g@mail.gmail.com>
 <Yv7HLgCcIImljTmi@wopr>
 <CABB-WO9vpAP07JK_3j=w_UqJLz_vKNpfk19StA1BsUQR8hPeZA@mail.gmail.com>
 <6c866ad5-d133-ab92-0cd8-9811c3127659@posixcafe.org>
 <CABB-WO-SWxHmD93kGdOhgwQNQa_NpT6PEj-dYgxJ46G1AoiO2Q@mail.gmail.com>
From: Jacob Moody <moody@mail.posixcafe.org>
In-Reply-To: <CABB-WO-SWxHmD93kGdOhgwQNQa_NpT6PEj-dYgxJ46G1AoiO2Q@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
List-ID: <9front.9front.org>
List-Help: <http://lists.9front.org>
X-Glyph: ➈
X-Bullshit: stable immutable cloud-aware realtime-java-aware TOR over SSL firewall-based pipelining-aware control 
Subject: Re: [9front] hackathon writeup
Reply-To: 9front@9front.org
Precedence: bulk

On 8/19/22 05:34, Stuart Morrow wrote:
> On 19/08/2022, Jacob Moody <moody@mail.posixcafe.org> wrote:
>> https://github.com/fjballest/nixMarkIV/blob/master/port/devmnt.c
>>
>> I think you're talking about this? Based on the context you seem
>> to be talking about it in.
> 
> Yeah, I guess so?  I said NxM rather than NIX Mark IV because I
> recall seeing Creepy[1] and the 9P.ix Fossil[2] in there.  But it seems
> there's no sign of support for it in the kernel (/sys/src/nxm).  Weird.
> 
> [1] I was right: https://github.com/rminnich/NxM/tree/master/sys/src/cmd/creepy
> [2] I was not right:
> https://github.com/rminnich/NxM/blob/master/sys/src/cmd/fossil/9p.c#L1133

Ah thanks. I took some time reading through the paper ori had posted.
There was quite a number of interesting ideas there, I appreciate you
mentioning this. I wanted to talk through some of these improvements,
both so my understanding could be corrected and for others who would
wish to discuss them.

Firstly it seems namec in the kernel was adapted to provide
batches of requests in a single burst. At the 9p layer this
was manifested in the form of sending multiple dependent
requests to the server all under the same tag. For example
a wstat involves the following requests:

walk from the rootfid to the file, getting a new fid
wstat that new fid with the user provided data
clunk the fid now that it is no longer being used.

These three requests would all be sent together to the server,
under the same tag to be labeled as dependent. In order to
facilitate this, devmnt was changed to allow each process to
have a number of RPC's in flight at once. The real change at
the protocol level was this batching, that and a move of
opening with truncation to being just create instead of open.
The later of which is a seemingly general (but good) bugfix.

For the implementation detail, the new devmnt api was explained
in terms of promises. Each rpc request generating a "promise"
that can be blocked on later once the result is needed.

For the logistics of batching on the kernel side, this was
accomplished using their 'later' device. Essentially a lazy
evaluation of of a channel to a mount point. The later device
hands it's own Channel's back from walks, waiting for the real
use of the fid to be given by userspace. Once it is known
what the user is doing with the file (wstat, stat)
the walk, action, and (possibly) clunk are all sent together.
This later device helps this be transparent to the rest of the kernel.

The last thing discussed is the read ahead and write behind. Now that
the kernel can submit dependent concurrent 9p messages we can use them
for actual IO. The ability to read ahead on a mnt point was signaled
by the user through the MCACHE mount option, this allowed the kernel to
eagerly fill the cache with readahead calls. The cache maintained an idea
of where the last known offset of the file was to contain data, and will attempt
to service requests under that offset from within the cache. It interpreted a short
read to indicate EOF.

The write behind is similar, but defined by the userspace program with a magic open
flag OBEHIND. This allowed the kernel to submit write rpc requests but not wait for their responses before
returning back to userspace. This was accompanied by a new fdflush system call, for a userspace program to sync
and block on outstanding write rpcs.

Not specifically mentioned here (but details in the paper) on the rework that was done to the mount table and how
chan's keep their current path stored. In general it seemed like this was a simplification from their removal of the
client submitting ".." walks. But its a bit hard for me to fully conceptualize these changes without having dug
in to the code.


Like I said, a lot of ideas in here. So let's start chewing through them.

The concept of the later device, and the rpc batching in general mostly makes sense.
Specifically this 'later' device seems like a great mechanism for doing this. There is a bit of a shortcoming
with how the later device has to handle channels that are part of a union. If a union is crossed then we no longer
can defer these walks, they have to be evaluated now. Of course the underlying channel used for the mnt point itself
could refer to a channel given by this 'later' device, but the details are fuzzy here.

Of course the batching itself is reliant on the devmnt changes to permit multiple in flight rpc requests by a single
process, along with the servers ability to receive multiple dependent operations concurrently. These as a first step
I find hard to disagree with.

The read ahead also sounds nice, when we were discussing at the hackthon there was a desire for the kernel to know whether
a file was 'synthetic' ala /dev/kbd or if it was a 'disk' file, something exported from cwfs,hjfs,fossil etc. We had discussed
the server providing this within the QID type, the approach here seems to infer that an entire mnt point contains 'disk' files
through the MCACHE option. The issue I take with this is that the 9p server isn't giving this information, it is being inferred
from how the user mounted it. This seems to place the information in the wrong place, I think it is a bit strange for a user to know
if a file server is going to serve 'synthetic' files or not. I would much prefer the server giving this type of information itself.

For the write behind, I am not convinced this is the "write" direction. A magic open flag and magic system calls to sync seem too specific.
It seems it would be difficult for a program to know, given just a path, whether a write behind is appropriate or not. Not to mention that
this requires quite a bit of code changes to use fdflush. When ori and I were discussing this kind of 'deferred' system calls, our benchmark
for design was: "How would it look like to support this in cat", and the error handling here gets ugly. If you are using write behind you can't
toss what you wrote away, because the server could error and you need that data back to submit the next request. Cat would have to be changed to
keep the data it read() around until it knows for sure that data has been committed by the file server. To me this just seems like a complicated
kernel dance for doing what a Bio writer can accomplish purely in userspace. The key difference here is that the Bio method only works for sequential
writes, not random writes. But I don't think we're missing much there.