From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=0.2 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, NICE_REPLY_A,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.4 Received: (qmail 23821 invoked from network); 19 Aug 2022 13:27:10 -0000 Received: from 9front.inri.net (168.235.81.73) by inbox.vuxu.org with ESMTPUTF8; 19 Aug 2022 13:27:10 -0000 Received: from mail.posixcafe.org ([45.76.19.58]) by 9front; Fri Aug 19 09:24:18 -0400 2022 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=posixcafe.org; s=20200506; t=1660915454; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YWf1RH3XrHZ+F0kFsRtwOi+3uvOI1klZzYtmPC97kTU=; b=uVTXzZUORCjqY1Wh87mhJtnEEeRkIVzsovPVm2xN9HGlbPOkEYyvbJ+hqOg06Uj5r+UGUw y5IfHTM2gSnExKlb+vjn6snw0wU4068A1wAxuGlYXWMo30LBSxOZ8YWrK5CiyUhoOGyOeJ IR8xTEKpJe8Mr3SgVYBXNxoHTh/+6XI= Received: from [192.168.168.200] (161-97-228-135.mynextlight.net [161.97.228.135]) by mail.posixcafe.org (OpenSMTPD) with ESMTPSA id f5f1460c (TLSv1.3:TLS_AES_256_GCM_SHA384:256:NO) for <9front@9front.org>; Fri, 19 Aug 2022 08:24:14 -0500 (CDT) Message-ID: <5ae3c54f-4530-729a-a745-bd94b49d632b@posixcafe.org> Date: Fri, 19 Aug 2022 07:21:59 -0600 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Content-Language: en-US To: 9front@9front.org References: <96511664FCD610AAFC870728D786BA81@eigenstate.org> <6c866ad5-d133-ab92-0cd8-9811c3127659@posixcafe.org> From: Jacob Moody In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit List-ID: <9front.9front.org> List-Help: X-Glyph: ➈ X-Bullshit: stable immutable cloud-aware realtime-java-aware TOR over SSL firewall-based pipelining-aware control Subject: Re: [9front] hackathon writeup Reply-To: 9front@9front.org Precedence: bulk On 8/19/22 05:34, Stuart Morrow wrote: > On 19/08/2022, Jacob Moody wrote: >> https://github.com/fjballest/nixMarkIV/blob/master/port/devmnt.c >> >> I think you're talking about this? Based on the context you seem >> to be talking about it in. > > Yeah, I guess so? I said NxM rather than NIX Mark IV because I > recall seeing Creepy[1] and the 9P.ix Fossil[2] in there. But it seems > there's no sign of support for it in the kernel (/sys/src/nxm). Weird. > > [1] I was right: https://github.com/rminnich/NxM/tree/master/sys/src/cmd/creepy > [2] I was not right: > https://github.com/rminnich/NxM/blob/master/sys/src/cmd/fossil/9p.c#L1133 Ah thanks. I took some time reading through the paper ori had posted. There was quite a number of interesting ideas there, I appreciate you mentioning this. I wanted to talk through some of these improvements, both so my understanding could be corrected and for others who would wish to discuss them. Firstly it seems namec in the kernel was adapted to provide batches of requests in a single burst. At the 9p layer this was manifested in the form of sending multiple dependent requests to the server all under the same tag. For example a wstat involves the following requests: walk from the rootfid to the file, getting a new fid wstat that new fid with the user provided data clunk the fid now that it is no longer being used. These three requests would all be sent together to the server, under the same tag to be labeled as dependent. In order to facilitate this, devmnt was changed to allow each process to have a number of RPC's in flight at once. The real change at the protocol level was this batching, that and a move of opening with truncation to being just create instead of open. The later of which is a seemingly general (but good) bugfix. For the implementation detail, the new devmnt api was explained in terms of promises. Each rpc request generating a "promise" that can be blocked on later once the result is needed. For the logistics of batching on the kernel side, this was accomplished using their 'later' device. Essentially a lazy evaluation of of a channel to a mount point. The later device hands it's own Channel's back from walks, waiting for the real use of the fid to be given by userspace. Once it is known what the user is doing with the file (wstat, stat) the walk, action, and (possibly) clunk are all sent together. This later device helps this be transparent to the rest of the kernel. The last thing discussed is the read ahead and write behind. Now that the kernel can submit dependent concurrent 9p messages we can use them for actual IO. The ability to read ahead on a mnt point was signaled by the user through the MCACHE mount option, this allowed the kernel to eagerly fill the cache with readahead calls. The cache maintained an idea of where the last known offset of the file was to contain data, and will attempt to service requests under that offset from within the cache. It interpreted a short read to indicate EOF. The write behind is similar, but defined by the userspace program with a magic open flag OBEHIND. This allowed the kernel to submit write rpc requests but not wait for their responses before returning back to userspace. This was accompanied by a new fdflush system call, for a userspace program to sync and block on outstanding write rpcs. Not specifically mentioned here (but details in the paper) on the rework that was done to the mount table and how chan's keep their current path stored. In general it seemed like this was a simplification from their removal of the client submitting ".." walks. But its a bit hard for me to fully conceptualize these changes without having dug in to the code. Like I said, a lot of ideas in here. So let's start chewing through them. The concept of the later device, and the rpc batching in general mostly makes sense. Specifically this 'later' device seems like a great mechanism for doing this. There is a bit of a shortcoming with how the later device has to handle channels that are part of a union. If a union is crossed then we no longer can defer these walks, they have to be evaluated now. Of course the underlying channel used for the mnt point itself could refer to a channel given by this 'later' device, but the details are fuzzy here. Of course the batching itself is reliant on the devmnt changes to permit multiple in flight rpc requests by a single process, along with the servers ability to receive multiple dependent operations concurrently. These as a first step I find hard to disagree with. The read ahead also sounds nice, when we were discussing at the hackthon there was a desire for the kernel to know whether a file was 'synthetic' ala /dev/kbd or if it was a 'disk' file, something exported from cwfs,hjfs,fossil etc. We had discussed the server providing this within the QID type, the approach here seems to infer that an entire mnt point contains 'disk' files through the MCACHE option. The issue I take with this is that the 9p server isn't giving this information, it is being inferred from how the user mounted it. This seems to place the information in the wrong place, I think it is a bit strange for a user to know if a file server is going to serve 'synthetic' files or not. I would much prefer the server giving this type of information itself. For the write behind, I am not convinced this is the "write" direction. A magic open flag and magic system calls to sync seem too specific. It seems it would be difficult for a program to know, given just a path, whether a write behind is appropriate or not. Not to mention that this requires quite a bit of code changes to use fdflush. When ori and I were discussing this kind of 'deferred' system calls, our benchmark for design was: "How would it look like to support this in cat", and the error handling here gets ugly. If you are using write behind you can't toss what you wrote away, because the server could error and you need that data back to submit the next request. Cat would have to be changed to keep the data it read() around until it knows for sure that data has been committed by the file server. To me this just seems like a complicated kernel dance for doing what a Bio writer can accomplish purely in userspace. The key difference here is that the Bio method only works for sequential writes, not random writes. But I don't think we're missing much there.