* [9fans] venti /plan9port mmapped
@ 2026-01-02 19:54 wb.kloke
2026-01-02 20:39 ` ori
0 siblings, 1 reply; 92+ messages in thread
From: wb.kloke @ 2026-01-02 19:54 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 935 bytes --]
Further trying to sanitize venti, I decided to change my mventi (no isects) for mmap IO.
At a first stage I mmap the whole venti i arena partition (current size 80GB, filled to 44GB). On a 64bit processor this is not a big thing, and replace the file operations (rwpart in part.c and readarena in arena.c) by memmove.
As it works on amd64 FreeBSD, I can now eliminate the block cache and its complications (IO over block boundaries) from the source making the remaining source much clearer.
The price is, that I have to rely on demand paging of the OS. So it is probably not easily portable to plan9 or 9front.
Can any body give ime a hint, how to map a partition to a segment in 9front?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M6862409e4512b86c28587d83
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1570 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-02 19:54 [9fans] venti /plan9port mmapped wb.kloke
@ 2026-01-02 20:39 ` ori
2026-01-02 20:58 ` Bakul Shah via 9fans
2026-01-02 21:01 ` ori
0 siblings, 2 replies; 92+ messages in thread
From: ori @ 2026-01-02 20:39 UTC (permalink / raw)
To: 9fans
You may want to read this: https://db.cs.cmu.edu/mmap-cidr2022/
Quoth wb.kloke@gmail.com:
> Further trying to sanitize venti, I decided to change my mventi (no isects) for mmap IO.
>
> At a first stage I mmap the whole venti i arena partition (current size 80GB, filled to 44GB). On a 64bit processor this is not a big thing, and replace the file operations (rwpart in part.c and readarena in arena.c) by memmove.
>
> As it works on amd64 FreeBSD, I can now eliminate the block cache and its complications (IO over block boundaries) from the source making the remaining source much clearer.
>
> The price is, that I have to rely on demand paging of the OS. So it is probably not easily portable to plan9 or 9front.
>
> Can any body give ime a hint, how to map a partition to a segment in 9front?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M1c4328c5437366d50eb39ffd
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-02 20:39 ` ori
@ 2026-01-02 20:58 ` Bakul Shah via 9fans
2026-01-06 22:59 ` Ron Minnich
2026-01-02 21:01 ` ori
1 sibling, 1 reply; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-01-02 20:58 UTC (permalink / raw)
To: 9fans
Might be an opportunity for optimizing mmap + virtual memory architecture for database specific applications.... (should be worth at least a few papers for the academic crowd).
> On Jan 2, 2026, at 12:39 PM, ori@eigenstate.org wrote:
>
> You may want to read this: https://db.cs.cmu.edu/mmap-cidr2022/
>
> Quoth wb.kloke@gmail.com:
>> Further trying to sanitize venti, I decided to change my mventi (no isects) for mmap IO.
>>
>> At a first stage I mmap the whole venti i arena partition (current size 80GB, filled to 44GB). On a 64bit processor this is not a big thing, and replace the file operations (rwpart in part.c and readarena in arena.c) by memmove.
>>
>> As it works on amd64 FreeBSD, I can now eliminate the block cache and its complications (IO over block boundaries) from the source making the remaining source much clearer.
>>
>> The price is, that I have to rely on demand paging of the OS. So it is probably not easily portable to plan9 or 9front.
>>
>> Can any body give ime a hint, how to map a partition to a segment in 9front?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M227c416862f61d52b01bd98f
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-02 20:39 ` ori
2026-01-02 20:58 ` Bakul Shah via 9fans
@ 2026-01-02 21:01 ` ori
2026-01-08 15:59 ` wb.kloke
1 sibling, 1 reply; 92+ messages in thread
From: ori @ 2026-01-02 21:01 UTC (permalink / raw)
To: 9fans
Quoth ori@eigenstate.org:
> > Can any body give ime a hint, how to map a partition to a segment in 9front?
Using read and write; there is no memory mapping, and I don't think we want it.
It doesn't work very well when reading from a networked file system, it makes
error handling unnecessarily difficult, and if you're using it for writing, it's
very fiddly when it comes to committing data.
There may be an argument for read-only mapping of small files, demand paging
style, but things fall apart when the working set size gets larger than memory.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-Mc99c287a545807b040babaf5
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-02 20:58 ` Bakul Shah via 9fans
@ 2026-01-06 22:59 ` Ron Minnich
2026-01-07 4:27 ` Noam Preil
` (3 more replies)
0 siblings, 4 replies; 92+ messages in thread
From: Ron Minnich @ 2026-01-06 22:59 UTC (permalink / raw)
To: 9fans
back in the NIX days, when we had 32GiB of memory mapped with 32 1-G
PTEs, I wrote a trivial venti that ONLY used dram. That was easy.
Because you can keep up a single machine with 32G up for an
arbitrarily long time, I did not bother with a disk. This work was
based on the fact that lsub had used very little of their 32G coraid
server (something like 3G? I forget) over ten years: venti dedups by
its nature and that's your friend.
So maybe a pure-ram venti, with a TiB or so of memory, could work?
"disks are the work of the devil" -- jmk
ron
On Fri, Jan 2, 2026 at 1:02 PM Bakul Shah via 9fans <9fans@9fans.net> wrote:
>
> Might be an opportunity for optimizing mmap + virtual memory architecture for database specific applications.... (should be worth at least a few papers for the academic crowd).
>
> > On Jan 2, 2026, at 12:39 PM, ori@eigenstate.org wrote:
> >
> > You may want to read this: https://db.cs.cmu.edu/mmap-cidr2022/
> >
> > Quoth wb.kloke@gmail.com:
> >> Further trying to sanitize venti, I decided to change my mventi (no isects) for mmap IO.
> >>
> >> At a first stage I mmap the whole venti i arena partition (current size 80GB, filled to 44GB). On a 64bit processor this is not a big thing, and replace the file operations (rwpart in part.c and readarena in arena.c) by memmove.
> >>
> >> As it works on amd64 FreeBSD, I can now eliminate the block cache and its complications (IO over block boundaries) from the source making the remaining source much clearer.
> >>
> >> The price is, that I have to rely on demand paging of the OS. So it is probably not easily portable to plan9 or 9front.
> >>
> >> Can any body give ime a hint, how to map a partition to a segment in 9front?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M44817a56f8c46bcdac466b1b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-06 22:59 ` Ron Minnich
@ 2026-01-07 4:27 ` Noam Preil
2026-01-07 6:15 ` Shawn Rutledge
` (2 subsequent siblings)
3 siblings, 0 replies; 92+ messages in thread
From: Noam Preil @ 2026-01-07 4:27 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 757 bytes --]
If you want to buy 1TiB of RAM to play with, you can go right ahead :p
For a pooled together project - a shared venti - that might still be practical, at least with slower memory, but - 1TiB of HDD space is like $20. I'm not going to bother checking the spot price of 1tib of RAM because... it has a spot price now :/
And it's not terriblly much code to do a better venti using disks.
Could definitely build the index entirely into memory on startup and have less disk code, too, but I'm not convinced it's worth it.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M9f9a4833f888ca1ec34f3c0f
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1662 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-06 22:59 ` Ron Minnich
2026-01-07 4:27 ` Noam Preil
@ 2026-01-07 6:15 ` Shawn Rutledge
2026-01-07 15:46 ` Persistent memory (was Re: [9fans] venti /plan9port mmapped) arnold
2026-01-07 8:52 ` [9fans] venti /plan9port mmapped wb.kloke
2026-01-07 14:57 ` Thaddeus Woskowiak
3 siblings, 1 reply; 92+ messages in thread
From: Shawn Rutledge @ 2026-01-07 6:15 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1979 bytes --]
> On Jan 6, 2026, at 23:59, Ron Minnich <rminnich@p9f.org> wrote:
>
> back in the NIX days, when we had 32GiB of memory mapped with 32 1-G
> PTEs, I wrote a trivial venti that ONLY used dram. That was easy.
> Because you can keep up a single machine with 32G up for an
> arbitrarily long time, I did not bother with a disk. This work was
> based on the fact that lsub had used very little of their 32G coraid
> server (something like 3G? I forget) over ten years: venti dedups by
> its nature and that's your friend.
>
> So maybe a pure-ram venti, with a TiB or so of memory, could work?
>
> "disks are the work of the devil" — jmk
I’ve been wondering why it’s still so rare to map persistent storage to memory addresses, in hardware. It seemed like Intel Optane was going to go there, for a while, then they just gave up on the idea. And core memory was already persistent, back in the day.
I think universal memory should happen eventually; and to prepare for that, software design should go towards organizing data the same in memory as on storage: better packing rather than lots of randomness in the heap, and memory-aligned structures. Local file I/O might become mostly unnecessary, but could continue as an abstraction to organize things in memory, at the cost of having to keep writing I/O code. So if that’s where we are going, mmap is a good thing to have. But yeah, maybe it’s more hassle as an abstraction for network-attached storage.
Wasn’t this sort of thing being done in the PDA era? I never developed for the Newton, but I think a “soup” is such a persistent structure. And maybe whatever smalltalk does with their sandboxes. Is it just a persistent heap or is it organized better?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-Mbf113fadcd84d606d155d602
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 10619 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-06 22:59 ` Ron Minnich
2026-01-07 4:27 ` Noam Preil
2026-01-07 6:15 ` Shawn Rutledge
@ 2026-01-07 8:52 ` wb.kloke
2026-01-07 16:30 ` mmaping on plan9? (was " Bakul Shah via 9fans
2026-01-07 14:57 ` Thaddeus Woskowiak
3 siblings, 1 reply; 92+ messages in thread
From: wb.kloke @ 2026-01-07 8:52 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 633 bytes --]
On Tuesday, 6 January 2026, at 11:59 PM, Ron Minnich wrote:
> So maybe a pure-ram venti, with a TiB or so of memory, could work?
"disks are the work of the devil" -- jmk
Of course, it would work, but imho it would be pointless. Why would anybody use content-addressed storage on data, which are not on persistent memory?
Most of the code of a venti is concerned with the task making a backup easy.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M5b0ec55532fb3ef9b2466ae8
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1215 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-06 22:59 ` Ron Minnich
` (2 preceding siblings ...)
2026-01-07 8:52 ` [9fans] venti /plan9port mmapped wb.kloke
@ 2026-01-07 14:57 ` Thaddeus Woskowiak
2026-01-07 16:07 ` Wes Kussmaul
2026-01-07 16:13 ` Noam Preil
3 siblings, 2 replies; 92+ messages in thread
From: Thaddeus Woskowiak @ 2026-01-07 14:57 UTC (permalink / raw)
To: 9fans
On Tue, Jan 6, 2026 at 7:03 PM Ron Minnich <rminnich@p9f.org> wrote:
>
> So maybe a pure-ram venti, with a TiB or so of memory, could work?
>
If you can afford it thanks to the AI surge. 1TB of DDR5 is around $25,000 USD.
I am quite thankful that my 8GB CPU server is "overkill" for my current needs.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M1bd713b8e014eb23e00b4928
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Persistent memory (was Re: [9fans] venti /plan9port mmapped)
2026-01-07 6:15 ` Shawn Rutledge
@ 2026-01-07 15:46 ` arnold
2026-01-07 16:11 ` Noam Preil
0 siblings, 1 reply; 92+ messages in thread
From: arnold @ 2026-01-07 15:46 UTC (permalink / raw)
To: 9fans
Shawn Rutledge <lists@ecloud.org> wrote:
> I’ve been wondering why it’s still so rare to map persistent storage
> to memory addresses, in hardware. It seemed like Intel Optane was
> going to go there, for a while, then they just gave up on the idea.
Because Intel doesn't understand any kind of product except CPUs. :-(
> I think universal memory should happen eventually; and to prepare for
> that, software design should go towards organizing data the same in
> memory as on storage: better packing rather than lots of randomness in
> the heap, and memory-aligned structures. Local file I/O might become
> mostly unnecessary, but could continue as an abstraction to organize
> things in memory, at the cost of having to keep writing I/O code. So if
> that’s where we are going, mmap is a good thing to have. But yeah,
> maybe it’s more hassle as an abstraction for network-attached storage.
A web search shows that there are several options for peristent memory
allocators, many of which I didn't know about. However, gawk has
been using one for a few years. See https://dl.acm.org/doi/10.1145/3643886
and https://web.eecs.umich.edu/~tpkelly/pma/. It's built on top of
mmap() and only for 64-bit *nix systems.
For the short instructions on using it, see
https://www.gnu.org/software/gawk/manual/html_node/Persistent-Memory.html.
For more details, see
https://www.gnu.org/software/gawk/manual/pm-gawk/pm-gawk.html.
Arnold
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T84cf4042bdd1a74b-M153c99a3b6b8ac59c01977f5
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-07 14:57 ` Thaddeus Woskowiak
@ 2026-01-07 16:07 ` Wes Kussmaul
2026-01-07 16:22 ` Noam Preil
2026-01-07 16:13 ` Noam Preil
1 sibling, 1 reply; 92+ messages in thread
From: Wes Kussmaul @ 2026-01-07 16:07 UTC (permalink / raw)
To: 9fans
On 1/7/26 9:57 AM, Thaddeus Woskowiak wrote:
> On Tue, Jan 6, 2026 at 7:03 PM Ron Minnich <rminnich@p9f.org> wrote:
>> So maybe a pure-ram venti, with a TiB or so of memory, could work?
>>
> If you can afford it thanks to the AI surge. 1TB of DDR5 is around $25,000 USD.
>
> I am quite thankful that my 8GB CPU server is "overkill" for my current needs.
And I'm thankful that demand spikes like this end up generating
overproduction, meaning this time next year DDR5 will be cheaper than ever.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M3bcd391a57c4cc6bd57edc97
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Persistent memory (was Re: [9fans] venti /plan9port mmapped)
2026-01-07 15:46 ` Persistent memory (was Re: [9fans] venti /plan9port mmapped) arnold
@ 2026-01-07 16:11 ` Noam Preil
2026-01-07 17:26 ` Wes Kussmaul
0 siblings, 1 reply; 92+ messages in thread
From: Noam Preil @ 2026-01-07 16:11 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 331 bytes --]
> Because Intel doesn't understand any kind of product except CPUs. :-(
Intel understands CPUs? ;)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T84cf4042bdd1a74b-M6098fa8b8affb66eaa74773e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1025 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-07 14:57 ` Thaddeus Woskowiak
2026-01-07 16:07 ` Wes Kussmaul
@ 2026-01-07 16:13 ` Noam Preil
1 sibling, 0 replies; 92+ messages in thread
From: Noam Preil @ 2026-01-07 16:13 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 321 bytes --]
> 1TB of DDR5 is around $25,000 USD.
...i knew i didn't want to look it up, holy _crap_.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-Md6144477702f7007cc456012
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1015 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-07 16:07 ` Wes Kussmaul
@ 2026-01-07 16:22 ` Noam Preil
2026-01-07 17:31 ` Wes Kussmaul
0 siblings, 1 reply; 92+ messages in thread
From: Noam Preil @ 2026-01-07 16:22 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 851 bytes --]
> And I'm thankful that demand spikes like this end up generating overproduction, meaning this time next year DDR5 will be cheaper than ever.
The memory makers are on the record saying they're not going to increase production because, in essence, they know the bubble is going to pop and want to make sure that their prices stay high afterwards.
In other words, they know that demand spikes lead to overproduction and decided they'd rather avoid meeting the current demand than have oversupply later.
Of coruse, I'd say the same thing if i wanted people buying _now_ and assuming things won't be cheaper later...
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-Mfee471c8b0cadbeaac39a1bc
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1701 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 8:52 ` [9fans] venti /plan9port mmapped wb.kloke
@ 2026-01-07 16:30 ` Bakul Shah via 9fans
2026-01-07 16:40 ` Noam Preil
` (2 more replies)
0 siblings, 3 replies; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-01-07 16:30 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2335 bytes --]
I have this idea that will horrify most of you!
1. Create an mmap device driver. You ask it to a new file handle which you use to communicate about memory mapping.
2. If you want to mmap some file, you open it and write its file descriptor along with other parameters (file offset, base addr, size, mode, flags) to your mmap file handle.
3. The mmap driver sets up necessary page table entries but doesn't actually fetch any data before returning from the write.
4. It can asynchronously kick off io requests on your behalf and fixup page table entries as needed.
5. Page faults in the mmapped area are serviced by making appropriate read/write calls.
6. Flags can be used to indicate read-ahead or write-behind for typical serial access.
7. Similarly msync, munmap etc. can be implemented.
In a sneaky way this avoids the need for adding any mmap specific syscalls! But the underlying work would be mostly similar in either case.
The main benefits of mmap are reduced initial latency , "pay as you go" cost structure and ease of use. It is certainly more expensive than reading/writing the same amount of data directly from a program.
No idea how horrible a hack is needed to implement such a thing or even if it is possible at all but I had to share this ;-)
> On Jan 7, 2026, at 12:52 AM, wb.kloke@gmail.com wrote:
>
> On Tuesday, 6 January 2026, at 11:59 PM, Ron Minnich wrote:
>> So maybe a pure-ram venti, with a TiB or so of memory, could work? "disks are the work of the devil" -- jmk
> Of course, it would work, but imho it would be pointless. Why would anybody use content-addressed storage on data, which are not on persistent memory?
>
> Most of the code of a venti is concerned with the task making a backup easy.
> 9fans <https://9fans.topicbox.com/latest> / 9fans / see discussions <https://9fans.topicbox.com/groups/9fans> + participants <https://9fans.topicbox.com/groups/9fans/members> + delivery options <https://9fans.topicbox.com/groups/9fans/subscription>Permalink <https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M5b0ec55532fb3ef9b2466ae8>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M79241f0580350c49710d5dde
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 3045 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 16:30 ` mmaping on plan9? (was " Bakul Shah via 9fans
@ 2026-01-07 16:40 ` Noam Preil
2026-01-07 16:41 ` ori
2026-01-07 16:52 ` ori
2 siblings, 0 replies; 92+ messages in thread
From: Noam Preil @ 2026-01-07 16:40 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 316 bytes --]
Honestly i think this would be a great experiment regardless of its practical usage :)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M586930ee0b23d0b0562f1eba
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 944 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 16:30 ` mmaping on plan9? (was " Bakul Shah via 9fans
2026-01-07 16:40 ` Noam Preil
@ 2026-01-07 16:41 ` ori
2026-01-07 20:35 ` Bakul Shah via 9fans
2026-01-07 16:52 ` ori
2 siblings, 1 reply; 92+ messages in thread
From: ori @ 2026-01-07 16:41 UTC (permalink / raw)
To: 9fans
Quoth Bakul Shah via 9fans <9fans@9fans.net>:
> I have this idea that will horrify most of you!
>
> 1. Create an mmap device driver. You ask it to a new file handle which you use to communicate about memory mapping.
> 2. If you want to mmap some file, you open it and write its file descriptor along with other parameters (file offset, base addr, size, mode, flags) to your mmap file handle.
> 3. The mmap driver sets up necessary page table entries but doesn't actually fetch any data before returning from the write.
> 4. It can asynchronously kick off io requests on your behalf and fixup page table entries as needed.
> 5. Page faults in the mmapped area are serviced by making appropriate read/write calls.
> 6. Flags can be used to indicate read-ahead or write-behind for typical serial access.
> 7. Similarly msync, munmap etc. can be implemented.
>
> In a sneaky way this avoids the need for adding any mmap specific syscalls! But the underlying work would be mostly similar in either case.
>
> The main benefits of mmap are reduced initial latency , "pay as you go" cost structure and ease of use. It is certainly more expensive than reading/writing the same amount of data directly from a program.
>
> No idea how horrible a hack is needed to implement such a thing or even if it is possible at all but I had to share this ;-)
To what end? The problems with mmap have little to do with adding a syscall;
they're about how you do things like communicating I/O errors. Especially
when flushing the cache.
Imagine the following setup -- I've imported 9p.io:
9fs 9pio
and then I map a file from it:
mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
Now, I want to write something into the file:
*mapped = 1234;
The cached version of the page is dirty, so the OS will
eventually need to flush it back with a 9p Twrite; Let's
assume that before this happens, the network goes down.
How do you communicate the error with userspace?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1ebb42ae226a10acb54e9127
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 16:30 ` mmaping on plan9? (was " Bakul Shah via 9fans
2026-01-07 16:40 ` Noam Preil
2026-01-07 16:41 ` ori
@ 2026-01-07 16:52 ` ori
2026-01-07 17:37 ` wb.kloke
2 siblings, 1 reply; 92+ messages in thread
From: ori @ 2026-01-07 16:52 UTC (permalink / raw)
To: 9fans
Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>
> No idea how horrible a hack is needed to implement such a thing or even if it is possible at all but I had to share this ;-)
>
pretty horrible; so, we hard-code up to 10 segments in the kernel,
which allows things to be really simple; when N is 10, O(n) is fine.
We've got a ton of loops doing things like:
for(i = 0; i < NSEG; i++)
and it works great; the simplicity of our VM system also means that
our fork is something like 10 times faster than fork on Unix, and
If you want to start having a lot of small maps, you start to need
complex data structures to track them and look them up quickly on
page fault; things start to suck.
In addition to mmap being superficially easy, but hard to use correctly,
and very hard to use correctly in the case of trying to write with it,
I think it's best to leave it behind.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M44cbcb5dccb0f6b946f78dfa
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Persistent memory (was Re: [9fans] venti /plan9port mmapped)
2026-01-07 16:11 ` Noam Preil
@ 2026-01-07 17:26 ` Wes Kussmaul
0 siblings, 0 replies; 92+ messages in thread
From: Wes Kussmaul @ 2026-01-07 17:26 UTC (permalink / raw)
To: 9fans
On 1/7/26 11:11 AM, Noam Preil wrote:
> > Because Intel doesn't understand any kind of product except CPUs. :-(
>
> Intel understands CPUs? ;)
Intel understands what happens when you choose to divest your newer
product line (Xscale) that's built on the technology (ARM) that
challenges your cash cow product (x86) because "ARM is not how we do
things around here."
What they now understand about what happens when you do things like
that: smart people leave.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/T84cf4042bdd1a74b-Mfb6fd74ee170955a193b0ec4
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-07 16:22 ` Noam Preil
@ 2026-01-07 17:31 ` Wes Kussmaul
0 siblings, 0 replies; 92+ messages in thread
From: Wes Kussmaul @ 2026-01-07 17:31 UTC (permalink / raw)
To: 9fans
On 1/7/26 11:22 AM, Noam Preil wrote:
> > And I'm thankful that demand spikes like this end up generating
> overproduction, meaning this time next year DDR5 will be cheaper than
> ever.
>
> The memory makers are on the record saying they're not going to
> increase production because, in essence, they know the bubble is going
> to pop and want to make sure that their prices stay high afterwards.
>
> In other words, they know that demand spikes lead to overproduction
> and decided they'd rather avoid meeting the current demand than have
> oversupply later.
>
> Of coruse, I'd say the same thing if i wanted people buying _now_ and
> assuming things won't be cheaper later...
>
Regardless of management intent, they'll get sued by securities
ambulance chasers for taking the long view instead of optimizing
quarterly earnings.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-Mc30224b80c8601103548d87a
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 16:52 ` ori
@ 2026-01-07 17:37 ` wb.kloke
2026-01-07 17:46 ` Noam Preil
0 siblings, 1 reply; 92+ messages in thread
From: wb.kloke @ 2026-01-07 17:37 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 661 bytes --]
On Wednesday, 7 January 2026, at 5:52 PM, ori wrote:
> If you want to start having a lot of small maps, you start to need
complex data structures to track them and look them up quickly on
page fault; things start to suck.
Just let me remark, that in the venti use case, only 1 segment per arena partition would be needed.
Who uses more than 1 arena partition, anyway?
Remember, the isect partitions are gone already for mventi.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M34578a6b597ed109684fb875
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1233 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 17:37 ` wb.kloke
@ 2026-01-07 17:46 ` Noam Preil
2026-01-07 17:56 ` wb.kloke
0 siblings, 1 reply; 92+ messages in thread
From: Noam Preil @ 2026-01-07 17:46 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 319 bytes --]
I do 😅
Multiple disks.
Many cheap ones instead of a single big one, in my server
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M9ded26e484f52e1cb9643d3b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1075 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 17:46 ` Noam Preil
@ 2026-01-07 17:56 ` wb.kloke
2026-01-07 18:07 ` Noam Preil
0 siblings, 1 reply; 92+ messages in thread
From: wb.kloke @ 2026-01-07 17:56 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 559 bytes --]
On Wednesday, 7 January 2026, at 6:46 PM, Noam Preil wrote:
> I do 😅
>
> Multiple disks.
>
> Many cheap ones instead of a single big one, in my server
Ok. A simple solution is: Use 1 mventi process per partition, and a super process to delegate the eral work, perhaps using bloomfilters to avoid spurious requests.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M33fbbcb7e39e81e7b696db1b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1375 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 17:56 ` wb.kloke
@ 2026-01-07 18:07 ` Noam Preil
2026-01-07 18:58 ` wb.kloke
0 siblings, 1 reply; 92+ messages in thread
From: Noam Preil @ 2026-01-07 18:07 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 323 bytes --]
Which if the whole point is to remove the complexity of the disk layer, seems a bit silly imo
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M85ea45143bf2cc5ddd8575f3
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 951 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 18:07 ` Noam Preil
@ 2026-01-07 18:58 ` wb.kloke
0 siblings, 0 replies; 92+ messages in thread
From: wb.kloke @ 2026-01-07 18:58 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 717 bytes --]
On Wednesday, 7 January 2026, at 7:08 PM, Noam Preil wrote:
> Which if the whole point is to remove the complexity of the disk layer, seems a bit silly imo
It depends on the number of partitions.
The original venti uses 1 thread per arenapartition, so the complexity is already catered for. Bloom filters are there, too.
A sane installation could perhaps use 1 or 2 readonly arenapartitions and one writable partition, so we come out with the need for at most 3 segments. ymcv
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mcffad8c0698cbd93526ad357
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1402 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 16:41 ` ori
@ 2026-01-07 20:35 ` Bakul Shah via 9fans
2026-01-07 21:31 ` ron minnich
2026-01-07 21:40 ` ori
0 siblings, 2 replies; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-01-07 20:35 UTC (permalink / raw)
To: 9fans
> On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
>
> Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>> I have this idea that will horrify most of you!
>>
>> 1. Create an mmap device driver. You ask it to a new file handle which you use to communicate about memory mapping.
>> 2. If you want to mmap some file, you open it and write its file descriptor along with other parameters (file offset, base addr, size, mode, flags) to your mmap file handle.
>> 3. The mmap driver sets up necessary page table entries but doesn't actually fetch any data before returning from the write.
>> 4. It can asynchronously kick off io requests on your behalf and fixup page table entries as needed.
>> 5. Page faults in the mmapped area are serviced by making appropriate read/write calls.
>> 6. Flags can be used to indicate read-ahead or write-behind for typical serial access.
>> 7. Similarly msync, munmap etc. can be implemented.
>>
>> In a sneaky way this avoids the need for adding any mmap specific syscalls! But the underlying work would be mostly similar in either case.
>>
>> The main benefits of mmap are reduced initial latency , "pay as you go" cost structure and ease of use. It is certainly more expensive than reading/writing the same amount of data directly from a program.
>>
>> No idea how horrible a hack is needed to implement such a thing or even if it is possible at all but I had to share this ;-)
>
> To what end? The problems with mmap have little to do with adding a syscall;
> they're about how you do things like communicating I/O errors. Especially
> when flushing the cache.
>
> Imagine the following setup -- I've imported 9p.io:
>
> 9fs 9pio
>
> and then I map a file from it:
>
> mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
>
> Now, I want to write something into the file:
>
> *mapped = 1234;
>
> The cached version of the page is dirty, so the OS will
> eventually need to flush it back with a 9p Twrite; Let's
> assume that before this happens, the network goes down.
>
> How do you communicate the error with userspace?
This was just a brainwave but...
You have a (control) connection with the mmap device to
set up mmap so might as well use it to convey errors!
This device would be strictly local to where a program
runs.
I'd even consider allowing a separate process to mmap,
by making an address space a first class object. That'd
move more stuff out of the kernel and allow for more
interesting/esoteric uses.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M6fb0ef830c8e525cb591fb48
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 20:35 ` Bakul Shah via 9fans
@ 2026-01-07 21:31 ` ron minnich
2026-01-08 7:56 ` arnold
` (2 more replies)
2026-01-07 21:40 ` ori
1 sibling, 3 replies; 92+ messages in thread
From: ron minnich @ 2026-01-07 21:31 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3684 bytes --]
what we had planned for harvey was a good deal simpler: designate a part of
the address space as a "bounce fault to user" space area.
When a page fault in that area occurred, info about the fault was sent to
an fd (if it was opened) or a note handler.
user could could handle the fault or punt, as it saw fit. The fixup was
that user mode had to get the data to satisfy the fault, then tell the
kernel what to do.
This is much like the 35-years-ago work we did on AIX, called
external pagers at the time; or the more recent umap work,
https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
If you go this route, it's a bit less complex than what you are proposing.
On Wed, Jan 7, 2026 at 1:09 PM Bakul Shah via 9fans <9fans@9fans.net> wrote:
>
>
> > On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
> >
> > Quoth Bakul Shah via 9fans <9fans@9fans.net>:
> >> I have this idea that will horrify most of you!
> >>
> >> 1. Create an mmap device driver. You ask it to a new file handle which
> you use to communicate about memory mapping.
> >> 2. If you want to mmap some file, you open it and write its file
> descriptor along with other parameters (file offset, base addr, size, mode,
> flags) to your mmap file handle.
> >> 3. The mmap driver sets up necessary page table entries but doesn't
> actually fetch any data before returning from the write.
> >> 4. It can asynchronously kick off io requests on your behalf and fixup
> page table entries as needed.
> >> 5. Page faults in the mmapped area are serviced by making appropriate
> read/write calls.
> >> 6. Flags can be used to indicate read-ahead or write-behind for typical
> serial access.
> >> 7. Similarly msync, munmap etc. can be implemented.
> >>
> >> In a sneaky way this avoids the need for adding any mmap specific
> syscalls! But the underlying work would be mostly similar in either case.
> >>
> >> The main benefits of mmap are reduced initial latency , "pay as you go"
> cost structure and ease of use. It is certainly more expensive than
> reading/writing the same amount of data directly from a program.
> >>
> >> No idea how horrible a hack is needed to implement such a thing or even
> if it is possible at all but I had to share this ;-)
> >
> > To what end? The problems with mmap have little to do with adding a
> syscall;
> > they're about how you do things like communicating I/O errors. Especially
> > when flushing the cache.
> >
> > Imagine the following setup -- I've imported 9p.io:
> >
> > 9fs 9pio
> >
> > and then I map a file from it:
> >
> > mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
> >
> > Now, I want to write something into the file:
> >
> > *mapped = 1234;
> >
> > The cached version of the page is dirty, so the OS will
> > eventually need to flush it back with a 9p Twrite; Let's
> > assume that before this happens, the network goes down.
> >
> > How do you communicate the error with userspace?
>
> This was just a brainwave but...
>
> You have a (control) connection with the mmap device to
> set up mmap so might as well use it to convey errors!
> This device would be strictly local to where a program
> runs.
>
> I'd even consider allowing a separate process to mmap,
> by making an address space a first class object. That'd
> move more stuff out of the kernel and allow for more
> interesting/esoteric uses.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mae5eb9a90d72008533969f26
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 5772 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 20:35 ` Bakul Shah via 9fans
2026-01-07 21:31 ` ron minnich
@ 2026-01-07 21:40 ` ori
1 sibling, 0 replies; 92+ messages in thread
From: ori @ 2026-01-07 21:40 UTC (permalink / raw)
To: 9fans
Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>
> You have a (control) connection with the mmap device to
> set up mmap so might as well use it to convey errors!
> This device would be strictly local to where a program
> runs.
I'm not sure how this is supposed to work; I assume
you'd need to fork a proc to handle errors, since
you would need to block the faulting process while
figuring out how to re-map the page so that the
memory access could be handled?
If the flush was done in the background due to cache
pressure, the code that had done the failing I/O
operation may have very well moved on, so you you'd
need a complciated mechanism to know how long ago
your writes had failed coded into your software.
Memory isn't a good abstraction for failable i/o.
It looks simple, but makes error handling impossibly
hard.
I think the only time it's acceptably behaved is when
it's read-only, you're willing to treat i/o errors as
irrecoverable, and you don't care about random pauses
in your program's perforamce profile. I don't think
that situation is all so common.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M3dc790fee41d21884de7bacd
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 21:31 ` ron minnich
@ 2026-01-08 7:56 ` arnold
2026-01-08 10:31 ` wb.kloke
2026-01-09 3:57 ` Paul Lalonde
2 siblings, 0 replies; 92+ messages in thread
From: arnold @ 2026-01-08 7:56 UTC (permalink / raw)
To: 9fans
ron minnich <rminnich@gmail.com> wrote:
> This is much like the 35-years-ago work we did on AIX, called
> external pagers at the time;
External pagers were a thing in Mach in the mid-80s, IIRC.
To me it seemed like overengineering. How many plain old users
want to write their own pager?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8381942ed38c05822d84803f
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 21:31 ` ron minnich
2026-01-08 7:56 ` arnold
@ 2026-01-08 10:31 ` wb.kloke
2026-01-09 0:02 ` ron minnich
2026-01-09 3:57 ` Paul Lalonde
2 siblings, 1 reply; 92+ messages in thread
From: wb.kloke @ 2026-01-08 10:31 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1238 bytes --]
On Wednesday, 7 January 2026, at 10:31 PM, ron minnich wrote:
> what we had planned for harvey was a good deal simpler: designate a part of the address space as a "bounce fault to user" space area.
>
> When a page fault in that area occurred, info about the fault was sent to an fd (if it was opened) or a note handler.
>
> user could could handle the fault or punt, as it saw fit. The fixup was that user mode had to get the data to satisfy the fault, then tell the kernel what to do.
>
> This is much like the 35-years-ago work we did on AIX, called external pagers at the time; or the more recent umap work, https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
>
> If you go this route, it's a bit less complex than what you are proposing.
Thank you, this seems the nearest possible answer to my original question. The bad about umap is, of course, that it depends on a linuxism.
BTW. Harvey is gone. Is the work done to crosscompile for plan9 on gcc accessible?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M11a11204a789ee5dac250c3a
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2084 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-02 21:01 ` ori
@ 2026-01-08 15:59 ` wb.kloke
2026-02-11 23:19 ` red
0 siblings, 1 reply; 92+ messages in thread
From: wb.kloke @ 2026-01-08 15:59 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1917 bytes --]
I have committed the changes to make the mmapped venti server fully functional to
https://github.com/vestein463/plan9port/tree/master/src/cmd/venti/mappedsrv
This is not fully tested yet, esp. for more than 1 arenapartition.
I include data on the read performance vs. the previous traditional-io version.
The mmapped version is on amd64/32GB freebsd, the traditional on WD mycloud Ex2 (arm7/1G).
The arena partition resides on the mycloud, and is served via nfs readonly to the amd.
I repeated the test to see what effect the os caching of the 2 machines may have.
1st is the mmapped version, 2nd the traditional, 3rd mmapped again and 4th traditional.
The data is the image of an old ufs filesystem created by vbackup.
time venti=wbk1 vcat ffs:f10fb5797e5ea50d6bc567772f9794706548f568 > /dev/null
1540096 blocks total
582848 blocks in use, 2122946 file reads
real 6m53,762s
user 1m13,788s
sys 0m16,910s
[wb@wbk1 ~/plan9port/src/cmd/venti/mappedsrv]$ time venti=wdmc vcat ffs:f10fb5797e5ea50d6bc567772f9794706548f568 > /dev/null
1540096 blocks total
582848 blocks in use, 2122946 file reads
real 17m28,870s
user 1m26,011s
sys 0m42,485s
[wb@wbk1 ~/plan9port/src/cmd/venti/mappedsrv]$ time venti=wbk1 vcat ffs:f10fb5797e5ea50d6bc567772f9794706548f568 > /dev/null
1540096 blocks total
582848 blocks in use, 2122946 file reads
real 3m55,614s
user 1m9,911s
sys 0m14,804s
[wb@wbk1 ~/plan9port/src/cmd/venti/mappedsrv]$ time venti=wdmc vcat ffs:f10fb5797e5ea50d6bc567772f9794706548f568 > /dev/null
1540096 blocks total
582848 blocks in use, 2122946 file reads
real 16m41,033s
user 1m15,768s
sys 0m32,756s
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Tf991997f4e7bb37e-M0b9a9d657a4ab010b4c9d5d2
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2882 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-08 10:31 ` wb.kloke
@ 2026-01-09 0:02 ` ron minnich
0 siblings, 0 replies; 92+ messages in thread
From: ron minnich @ 2026-01-09 0:02 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 7376 bytes --]
Harvey is not gone at all. it's all there on the repo.
git@github.com:Harvey-OS/harvey
git branch
* GPL-C11
The build tool is called build, written in go. It is deliberately very
dumb, Andy Tannenbaum advised me to go with "standard C, not GCC like Linus
did, even though I told him not to!" [in his growly voice -- I miss that
guy] and another friend told me "C compilers are fast, just brute force it"
So instead of calling compilers for each C file, I call the compiler for
every file in a directory:
Building Libc
/usr/bin/clang -std=c11 -c -I /home/rminnich/harvey/harvey/amd64/include -I
/home/rminnich/harvey/harvey/sys/include -I . -fasm -ffreestanding
-fno-builtin -fno-omit-frame-pointer -fno-stack-protector -g -gdwarf-2
-ggdb -mcmodel=small -O0 -static -Wall -Werror -fcommon -mstack-alignment=4
9sys/abort.c 9sys/access.c 9sys/announce.c 9sys/convD2M.c 9sys/convM2D.c
9sys/convM2S.c 9sys/convS2M.c 9sys/cputime.c 9sys/ctime.c 9sys/dial.c
9sys/dirfstat.c 9sys/dirfwstat.c 9sys/dirmodefmt.c 9sys/dirread.c
9sys/dirstat.c 9sys/dirwstat.c 9sys/fcallfmt.c 9sys/fork.c
9sys/getnetconninfo.c 9sys/getenv.c 9sys/getpid.c 9sys/getppid.c
9sys/getwd.c 9sys/iounit.c 9sys/nulldir.c 9sys/postnote.c 9sys/privalloc.c
9sys/pushssl.c 9sys/pushtls.c 9sys/putenv.c 9sys/qlock.c 9sys/read9pmsg.c
9sys/read.c 9sys/readv.c 9sys/rerrstr.c 9sys/sbrk.c 9sys/setnetmtpt.c
9sys/sysfatal.c 9sys/syslog.c 9sys/sysname.c 9sys/time.c 9sys/times.c
9sys/tm2sec.c 9sys/truerand.c 9sys/wait.c 9sys/waitpid.c 9sys/werrstr.c
9sys/write.c 9sys/writev.c 9syscall/alarm.s 9syscall/await.s
9syscall/bind.s 9syscall/brk_.s 9syscall/chdir.s 9syscall/close.s
9syscall/create.s 9syscall/dup.s 9syscall/errstr.s 9syscall/exec.s
9syscall/_exits.s 9syscall/fauth.s 9syscall/fd2path.s 9syscall/fstat.s
9syscall/fversion.s 9syscall/fwstat.s 9syscall/mount.s 9syscall/noted.s
9syscall/notify.s 9syscall/nsec.s 9syscall/open.s 9syscall/pipe.s
9syscall/pread.s 9syscall/pwrite.s 9syscall/r0.s 9syscall/remove.s
9syscall/rendezvous.s 9syscall/rfork.s 9syscall/seek.s 9syscall/segattach.s
9syscall/segbrk.s 9syscall/segdetach.s 9syscall/segflush.s
9syscall/segfree.s 9syscall/semacquire.s 9syscall/semrelease.s
9syscall/sleep.s 9syscall/stat.s 9syscall/tsemacquire.s 9syscall/unmount.s
9syscall/wstat.s fmt/dofmt.c fmt/dorfmt.c fmt/errfmt.c fmt/fltfmt.c
fmt/fmt.c fmt/fmtfd.c fmt/fmtlock.c fmt/fmtprint.c fmt/fmtquote.c
fmt/fmtrune.c fmt/fmtstr.c fmt/fmtvprint.c fmt/fprint.c fmt/print.c
fmt/runefmtstr.c fmt/runeseprint.c fmt/runesmprint.c fmt/runesnprint.c
fmt/runesprint.c fmt/runevseprint.c fmt/runevsmprint.c fmt/runevsnprint.c
fmt/seprint.c fmt/smprint.c fmt/snprint.c fmt/sprint.c fmt/vfprint.c
fmt/vseprint.c fmt/vsmprint.c fmt/vsnprint.c port/_assert.c port/abs.c
port/asin.c port/atan.c port/atan2.c port/atexit.c port/atnotify.c
port/atof.c port/atol.c port/atoll.c port/cistrcmp.c port/cistrncmp.c
port/cistrstr.c port/charstod.c port/cleanname.c port/configstring.c
port/ctype.c port/encodefmt.c port/errno2str.c port/execl.c port/exp.c
port/fabs.c port/floor.c port/fmod.c port/frand.c port/frexp.c
port/getfields.c port/getuser.c port/hangup.c port/hashmap.c port/hypot.c
port/lnrand.c port/lock.c port/log.c port/lrand.c port/malloc.c
port/memccpy.c port/memchr.c port/memcmp.c port/memmove.c port/memset.c
port/mktemp.c port/muldiv.c port/nan.c port/needsrcquote.c port/netcrypt.c
port/netmkaddr.c port/nrand.c port/ntruerand.c port/perror.c port/pool.c
port/pow.c port/pow10.c port/qsort.c port/quote.c port/rand.c port/readn.c
port/reallocarray.c port/rijndael.c port/rune.c port/runebase.c
port/runebsearch.c port/runestrcat.c port/runestrchr.c port/runestrcmp.c
port/runestrcpy.c port/runestrecpy.c port/runestrdup.c port/runestrncat.c
port/runestrncmp.c port/runestrncpy.c port/runestrrchr.c port/runestrlen.c
port/runestrstr.c port/runetype.c port/sha2.c port/sin.c port/sinh.c
port/slice.c port/strcat.c port/strchr.c port/strcmp.c port/strcpy.c
port/strecpy.c port/strcspn.c port/strdup.c port/strlcat.c port/strlcpy.c
port/strlen.c port/strncat.c port/strncmp.c port/strncpy.c port/strpbrk.c
port/strrchr.c port/strspn.c port/strstr.c port/strtod.c port/strtok.c
port/strtol.c port/strtoll.c port/strtoul.c port/strtoull.c port/tan.c
port/tanh.c port/tokenize.c port/toupper.c port/utfecpy.c port/utflen.c
port/utfnlen.c port/utfrune.c port/utfrrune.c port/utfutf.c port/u16.c
port/u32.c port/u64.c amd64/notejmp.c amd64/cycles.c amd64/argv0.c
port/getcallstack.c amd64/rdpmc.c amd64/setjmp.s amd64/sqrt.s amd64/tas.s
amd64/atom.S amd64/main9.S
because clang is smart enough to only parse a .h file once.
There are a few other go tools. We got rid of all the random awk, rc, etc.
scripts, because we always forgot how they worked.
I just tried this:
go install harvey-os.org/cmd/decompress@latest
go install harvey-os.org/cmd/mksys@latest
go install harvey-os.org/cmd/build@latest
git clone git@github.com:Harvey-OS/harvey
cd harvey
ARCH=amd64 CC=gcc build
And, ironically, the one thing that fails is one of the go build steps,
since in the 10 years since we set this up, the Go build commands have
changed in incompatible ways.
The changes are a HUGE improvement, but it just means old projects like
this have an issue with go.
The C stuff all builds fine.
note also:
rminnich@pop-os:~/harvey/harvey$ build
You need to set the CC environment variable (e.g. gcc, clang, clang-3.6,
...)
rminnich@pop-os:~/harvey/harvey$ CC=clang build
You need to set the ARCH environment variable from: [amd64 riscv aarch64]
rminnich@pop-os:~/harvey/harvey$
You have lots of choices as to toolchain.
On Thu, Jan 8, 2026 at 4:45 AM <wb.kloke@gmail.com> wrote:
> On Wednesday, 7 January 2026, at 10:31 PM, ron minnich wrote:
>
> what we had planned for harvey was a good deal simpler: designate a part
> of the address space as a "bounce fault to user" space area.
>
> When a page fault in that area occurred, info about the fault was sent to
> an fd (if it was opened) or a note handler.
>
> user could could handle the fault or punt, as it saw fit. The fixup was
> that user mode had to get the data to satisfy the fault, then tell the
> kernel what to do.
>
> This is much like the 35-years-ago work we did on AIX, called
> external pagers at the time; or the more recent umap work,
> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
>
> If you go this route, it's a bit less complex than what you are proposing.
>
>
> Thank you, this seems the nearest possible answer to my original question.
> The bad about umap is, of course, that it depends on a linuxism.
>
> BTW. Harvey is gone. Is the work done to crosscompile for plan9 on gcc
> accessible?
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M11a11204a789ee5dac250c3a>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M2966fd4ffcd98eb245b77956
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 8682 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-07 21:31 ` ron minnich
2026-01-08 7:56 ` arnold
2026-01-08 10:31 ` wb.kloke
@ 2026-01-09 3:57 ` Paul Lalonde
2026-01-09 5:10 ` ron minnich
2 siblings, 1 reply; 92+ messages in thread
From: Paul Lalonde @ 2026-01-09 3:57 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 4386 bytes --]
Did the same on GPUs/Xeon Phi, including in the texture units. Very useful
mechanism for abstracting compute with random access characteristics.
Paul
On Wed, Jan 7, 2026, 1:35 p.m. ron minnich <rminnich@gmail.com> wrote:
> what we had planned for harvey was a good deal simpler: designate a part
> of the address space as a "bounce fault to user" space area.
>
> When a page fault in that area occurred, info about the fault was sent to
> an fd (if it was opened) or a note handler.
>
> user could could handle the fault or punt, as it saw fit. The fixup was
> that user mode had to get the data to satisfy the fault, then tell the
> kernel what to do.
>
> This is much like the 35-years-ago work we did on AIX, called
> external pagers at the time; or the more recent umap work,
> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
>
> If you go this route, it's a bit less complex than what you are proposing.
>
> On Wed, Jan 7, 2026 at 1:09 PM Bakul Shah via 9fans <9fans@9fans.net>
> wrote:
>
>>
>>
>> > On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
>> >
>> > Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>> >> I have this idea that will horrify most of you!
>> >>
>> >> 1. Create an mmap device driver. You ask it to a new file handle which
>> you use to communicate about memory mapping.
>> >> 2. If you want to mmap some file, you open it and write its file
>> descriptor along with other parameters (file offset, base addr, size, mode,
>> flags) to your mmap file handle.
>> >> 3. The mmap driver sets up necessary page table entries but doesn't
>> actually fetch any data before returning from the write.
>> >> 4. It can asynchronously kick off io requests on your behalf and fixup
>> page table entries as needed.
>> >> 5. Page faults in the mmapped area are serviced by making appropriate
>> read/write calls.
>> >> 6. Flags can be used to indicate read-ahead or write-behind for
>> typical serial access.
>> >> 7. Similarly msync, munmap etc. can be implemented.
>> >>
>> >> In a sneaky way this avoids the need for adding any mmap specific
>> syscalls! But the underlying work would be mostly similar in either case.
>> >>
>> >> The main benefits of mmap are reduced initial latency , "pay as you
>> go" cost structure and ease of use. It is certainly more expensive than
>> reading/writing the same amount of data directly from a program.
>> >>
>> >> No idea how horrible a hack is needed to implement such a thing or
>> even if it is possible at all but I had to share this ;-)
>> >
>> > To what end? The problems with mmap have little to do with adding a
>> syscall;
>> > they're about how you do things like communicating I/O errors.
>> Especially
>> > when flushing the cache.
>> >
>> > Imagine the following setup -- I've imported 9p.io:
>> >
>> > 9fs 9pio
>> >
>> > and then I map a file from it:
>> >
>> > mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
>> >
>> > Now, I want to write something into the file:
>> >
>> > *mapped = 1234;
>> >
>> > The cached version of the page is dirty, so the OS will
>> > eventually need to flush it back with a 9p Twrite; Let's
>> > assume that before this happens, the network goes down.
>> >
>> > How do you communicate the error with userspace?
>>
>> This was just a brainwave but...
>>
>> You have a (control) connection with the mmap device to
>> set up mmap so might as well use it to convey errors!
>> This device would be strictly local to where a program
>> runs.
>>
>> I'd even consider allowing a separate process to mmap,
>> by making an address space a first class object. That'd
>> move more stuff out of the kernel and allow for more
>> interesting/esoteric uses.
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mae5eb9a90d72008533969f26>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 6452 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 3:57 ` Paul Lalonde
@ 2026-01-09 5:10 ` ron minnich
2026-01-09 5:18 ` arnold
0 siblings, 1 reply; 92+ messages in thread
From: ron minnich @ 2026-01-09 5:10 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 4770 bytes --]
I would not tar the idea of external pagers with the Mach tarbrush. Mach
was pretty much inefficient at everything, including external pagers.
External pagers can work well, when implemented well.
On Thu, Jan 8, 2026 at 8:41 PM Paul Lalonde <paul.a.lalonde@gmail.com>
wrote:
> Did the same on GPUs/Xeon Phi, including in the texture units. Very
> useful mechanism for abstracting compute with random access characteristics.
>
> Paul
>
> On Wed, Jan 7, 2026, 1:35 p.m. ron minnich <rminnich@gmail.com> wrote:
>
>> what we had planned for harvey was a good deal simpler: designate a part
>> of the address space as a "bounce fault to user" space area.
>>
>> When a page fault in that area occurred, info about the fault was sent to
>> an fd (if it was opened) or a note handler.
>>
>> user could could handle the fault or punt, as it saw fit. The fixup was
>> that user mode had to get the data to satisfy the fault, then tell the
>> kernel what to do.
>>
>> This is much like the 35-years-ago work we did on AIX, called
>> external pagers at the time; or the more recent umap work,
>> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
>>
>> If you go this route, it's a bit less complex than what you are proposing.
>>
>> On Wed, Jan 7, 2026 at 1:09 PM Bakul Shah via 9fans <9fans@9fans.net>
>> wrote:
>>
>>>
>>>
>>> > On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
>>> >
>>> > Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>>> >> I have this idea that will horrify most of you!
>>> >>
>>> >> 1. Create an mmap device driver. You ask it to a new file handle
>>> which you use to communicate about memory mapping.
>>> >> 2. If you want to mmap some file, you open it and write its file
>>> descriptor along with other parameters (file offset, base addr, size, mode,
>>> flags) to your mmap file handle.
>>> >> 3. The mmap driver sets up necessary page table entries but doesn't
>>> actually fetch any data before returning from the write.
>>> >> 4. It can asynchronously kick off io requests on your behalf and
>>> fixup page table entries as needed.
>>> >> 5. Page faults in the mmapped area are serviced by making appropriate
>>> read/write calls.
>>> >> 6. Flags can be used to indicate read-ahead or write-behind for
>>> typical serial access.
>>> >> 7. Similarly msync, munmap etc. can be implemented.
>>> >>
>>> >> In a sneaky way this avoids the need for adding any mmap specific
>>> syscalls! But the underlying work would be mostly similar in either case.
>>> >>
>>> >> The main benefits of mmap are reduced initial latency , "pay as you
>>> go" cost structure and ease of use. It is certainly more expensive than
>>> reading/writing the same amount of data directly from a program.
>>> >>
>>> >> No idea how horrible a hack is needed to implement such a thing or
>>> even if it is possible at all but I had to share this ;-)
>>> >
>>> > To what end? The problems with mmap have little to do with adding a
>>> syscall;
>>> > they're about how you do things like communicating I/O errors.
>>> Especially
>>> > when flushing the cache.
>>> >
>>> > Imagine the following setup -- I've imported 9p.io:
>>> >
>>> > 9fs 9pio
>>> >
>>> > and then I map a file from it:
>>> >
>>> > mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
>>> >
>>> > Now, I want to write something into the file:
>>> >
>>> > *mapped = 1234;
>>> >
>>> > The cached version of the page is dirty, so the OS will
>>> > eventually need to flush it back with a 9p Twrite; Let's
>>> > assume that before this happens, the network goes down.
>>> >
>>> > How do you communicate the error with userspace?
>>>
>>> This was just a brainwave but...
>>>
>>> You have a (control) connection with the mmap device to
>>> set up mmap so might as well use it to convey errors!
>>> This device would be strictly local to where a program
>>> runs.
>>>
>>> I'd even consider allowing a separate process to mmap,
>>> by making an address space a first class object. That'd
>>> move more stuff out of the kernel and allow for more
>>> interesting/esoteric uses.
>> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M56e476fb601ad6cca1a4d4fc
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 7141 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 5:10 ` ron minnich
@ 2026-01-09 5:18 ` arnold
2026-01-09 6:06 ` David Leimbach via 9fans
0 siblings, 1 reply; 92+ messages in thread
From: arnold @ 2026-01-09 5:18 UTC (permalink / raw)
To: 9fans
I vaguely remember someone being quoted as saying
Microkernels don't have to be small. They just have to
not do much.
:-)
ron minnich <rminnich@gmail.com> wrote:
> I would not tar the idea of external pagers with the Mach tarbrush. Mach
> was pretty much inefficient at everything, including external pagers.
> External pagers can work well, when implemented well.
>
> On Thu, Jan 8, 2026 at 8:41 PM Paul Lalonde <paul.a.lalonde@gmail.com>
> wrote:
>
> > Did the same on GPUs/Xeon Phi, including in the texture units. Very
> > useful mechanism for abstracting compute with random access characteristics.
> >
> > Paul
> >
> > On Wed, Jan 7, 2026, 1:35 p.m. ron minnich <rminnich@gmail.com> wrote:
> >
> >> what we had planned for harvey was a good deal simpler: designate a part
> >> of the address space as a "bounce fault to user" space area.
> >>
> >> When a page fault in that area occurred, info about the fault was sent to
> >> an fd (if it was opened) or a note handler.
> >>
> >> user could could handle the fault or punt, as it saw fit. The fixup was
> >> that user mode had to get the data to satisfy the fault, then tell the
> >> kernel what to do.
> >>
> >> This is much like the 35-years-ago work we did on AIX, called
> >> external pagers at the time; or the more recent umap work,
> >> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
> >>
> >> If you go this route, it's a bit less complex than what you are proposing.
> >>
> >> On Wed, Jan 7, 2026 at 1:09 PM Bakul Shah via 9fans <9fans@9fans.net>
> >> wrote:
> >>
> >>>
> >>>
> >>> > On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
> >>> >
> >>> > Quoth Bakul Shah via 9fans <9fans@9fans.net>:
> >>> >> I have this idea that will horrify most of you!
> >>> >>
> >>> >> 1. Create an mmap device driver. You ask it to a new file handle
> >>> which you use to communicate about memory mapping.
> >>> >> 2. If you want to mmap some file, you open it and write its file
> >>> descriptor along with other parameters (file offset, base addr, size, mode,
> >>> flags) to your mmap file handle.
> >>> >> 3. The mmap driver sets up necessary page table entries but doesn't
> >>> actually fetch any data before returning from the write.
> >>> >> 4. It can asynchronously kick off io requests on your behalf and
> >>> fixup page table entries as needed.
> >>> >> 5. Page faults in the mmapped area are serviced by making appropriate
> >>> read/write calls.
> >>> >> 6. Flags can be used to indicate read-ahead or write-behind for
> >>> typical serial access.
> >>> >> 7. Similarly msync, munmap etc. can be implemented.
> >>> >>
> >>> >> In a sneaky way this avoids the need for adding any mmap specific
> >>> syscalls! But the underlying work would be mostly similar in either case.
> >>> >>
> >>> >> The main benefits of mmap are reduced initial latency , "pay as you
> >>> go" cost structure and ease of use. It is certainly more expensive than
> >>> reading/writing the same amount of data directly from a program.
> >>> >>
> >>> >> No idea how horrible a hack is needed to implement such a thing or
> >>> even if it is possible at all but I had to share this ;-)
> >>> >
> >>> > To what end? The problems with mmap have little to do with adding a
> >>> syscall;
> >>> > they're about how you do things like communicating I/O errors.
> >>> Especially
> >>> > when flushing the cache.
> >>> >
> >>> > Imagine the following setup -- I've imported 9p.io:
> >>> >
> >>> > 9fs 9pio
> >>> >
> >>> > and then I map a file from it:
> >>> >
> >>> > mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
> >>> >
> >>> > Now, I want to write something into the file:
> >>> >
> >>> > *mapped = 1234;
> >>> >
> >>> > The cached version of the page is dirty, so the OS will
> >>> > eventually need to flush it back with a 9p Twrite; Let's
> >>> > assume that before this happens, the network goes down.
> >>> >
> >>> > How do you communicate the error with userspace?
> >>>
> >>> This was just a brainwave but...
> >>>
> >>> You have a (control) connection with the mmap device to
> >>> set up mmap so might as well use it to convey errors!
> >>> This device would be strictly local to where a program
> >>> runs.
> >>>
> >>> I'd even consider allowing a separate process to mmap,
> >>> by making an address space a first class object. That'd
> >>> move more stuff out of the kernel and allow for more
> >>> interesting/esoteric uses.
> >> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> > <https://9fans.topicbox.com/groups/9fans> + participants
> > <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> > <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> > <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f>
> >
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mb4abae3026a42ab768f9f6db
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 5:18 ` arnold
@ 2026-01-09 6:06 ` David Leimbach via 9fans
2026-01-09 17:13 ` ron minnich
2026-01-09 17:39 ` tlaronde
0 siblings, 2 replies; 92+ messages in thread
From: David Leimbach via 9fans @ 2026-01-09 6:06 UTC (permalink / raw)
To: 9fans; +Cc: 9fans
I’d been impressed by L4. It’s certainly been deployed pretty broadly.
And it has recursive pagers but … not sure how that’s used in practice.
And there are a bunch of variants.
Sent from my iPhone
> On Jan 8, 2026, at 9:23 PM, arnold@skeeve.com wrote:
>
> I vaguely remember someone being quoted as saying
>
> Microkernels don't have to be small. They just have to
> not do much.
>
> :-)
>
> ron minnich <rminnich@gmail.com> wrote:
>
>> I would not tar the idea of external pagers with the Mach tarbrush. Mach
>> was pretty much inefficient at everything, including external pagers.
>> External pagers can work well, when implemented well.
>>
>>> On Thu, Jan 8, 2026 at 8:41 PM Paul Lalonde <paul.a.lalonde@gmail.com>
>>> wrote:
>>>
>>> Did the same on GPUs/Xeon Phi, including in the texture units. Very
>>> useful mechanism for abstracting compute with random access characteristics.
>>>
>>> Paul
>>>
>>>> On Wed, Jan 7, 2026, 1:35 p.m. ron minnich <rminnich@gmail.com> wrote:
>>>> what we had planned for harvey was a good deal simpler: designate a part
>>>> of the address space as a "bounce fault to user" space area.
>>>>
>>>> When a page fault in that area occurred, info about the fault was sent to
>>>> an fd (if it was opened) or a note handler.
>>>>
>>>> user could could handle the fault or punt, as it saw fit. The fixup was
>>>> that user mode had to get the data to satisfy the fault, then tell the
>>>> kernel what to do.
>>>>
>>>> This is much like the 35-years-ago work we did on AIX, called
>>>> external pagers at the time; or the more recent umap work,
>>>> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
>>>>
>>>> If you go this route, it's a bit less complex than what you are proposing.
>>>>
>>>> On Wed, Jan 7, 2026 at 1:09 PM Bakul Shah via 9fans <9fans@9fans.net>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>>> On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
>>>>>>
>>>>>> Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>>>>>>> I have this idea that will horrify most of you!
>>>>>>>
>>>>>>> 1. Create an mmap device driver. You ask it to a new file handle
>>>>> which you use to communicate about memory mapping.
>>>>>>> 2. If you want to mmap some file, you open it and write its file
>>>>> descriptor along with other parameters (file offset, base addr, size, mode,
>>>>> flags) to your mmap file handle.
>>>>>>> 3. The mmap driver sets up necessary page table entries but doesn't
>>>>> actually fetch any data before returning from the write.
>>>>>>> 4. It can asynchronously kick off io requests on your behalf and
>>>>> fixup page table entries as needed.
>>>>>>> 5. Page faults in the mmapped area are serviced by making appropriate
>>>>> read/write calls.
>>>>>>> 6. Flags can be used to indicate read-ahead or write-behind for
>>>>> typical serial access.
>>>>>>> 7. Similarly msync, munmap etc. can be implemented.
>>>>>>>
>>>>>>> In a sneaky way this avoids the need for adding any mmap specific
>>>>> syscalls! But the underlying work would be mostly similar in either case.
>>>>>>>
>>>>>>> The main benefits of mmap are reduced initial latency , "pay as you
>>>>> go" cost structure and ease of use. It is certainly more expensive than
>>>>> reading/writing the same amount of data directly from a program.
>>>>>>>
>>>>>>> No idea how horrible a hack is needed to implement such a thing or
>>>>> even if it is possible at all but I had to share this ;-)
>>>>>>
>>>>>> To what end? The problems with mmap have little to do with adding a
>>>>> syscall;
>>>>>> they're about how you do things like communicating I/O errors.
>>>>> Especially
>>>>>> when flushing the cache.
>>>>>>
>>>>>> Imagine the following setup -- I've imported 9p.io:
>>>>>>
>>>>>> 9fs 9pio
>>>>>>
>>>>>> and then I map a file from it:
>>>>>>
>>>>>> mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
>>>>>>
>>>>>> Now, I want to write something into the file:
>>>>>>
>>>>>> *mapped = 1234;
>>>>>>
>>>>>> The cached version of the page is dirty, so the OS will
>>>>>> eventually need to flush it back with a 9p Twrite; Let's
>>>>>> assume that before this happens, the network goes down.
>>>>>>
>>>>>> How do you communicate the error with userspace?
>>>>>
>>>>> This was just a brainwave but...
>>>>>
>>>>> You have a (control) connection with the mmap device to
>>>>> set up mmap so might as well use it to convey errors!
>>>>> This device would be strictly local to where a program
>>>>> runs.
>>>>>
>>>>> I'd even consider allowing a separate process to mmap,
>>>>> by making an address space a first class object. That'd
>>>>> move more stuff out of the kernel and allow for more
>>>>> interesting/esoteric uses.
>>>> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
>>> <https://9fans.topicbox.com/groups/9fans> + participants
>>> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
>>> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
>>> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f>
>>>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M4dfecc367000953ec3cd500e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 6:06 ` David Leimbach via 9fans
@ 2026-01-09 17:13 ` ron minnich
2026-01-09 17:39 ` tlaronde
1 sibling, 0 replies; 92+ messages in thread
From: ron minnich @ 2026-01-09 17:13 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 6015 bytes --]
Right, so, getting back to the original discussion, Bakul, I think the
right path forward is to implement a device that supports an external
pager, rather than mmap.
But code wins, so, quick, somebody, implement something :-)
On Thu, Jan 8, 2026 at 11:03 PM David Leimbach via 9fans <9fans@9fans.net>
wrote:
> I’d been impressed by L4. It’s certainly been deployed pretty broadly.
>
> And it has recursive pagers but … not sure how that’s used in practice.
>
> And there are a bunch of variants.
> Sent from my iPhone
>
> > On Jan 8, 2026, at 9:23 PM, arnold@skeeve.com wrote:
> >
> > I vaguely remember someone being quoted as saying
> >
> > Microkernels don't have to be small. They just have to
> > not do much.
> >
> > :-)
> >
> > ron minnich <rminnich@gmail.com> wrote:
> >
> >> I would not tar the idea of external pagers with the Mach tarbrush. Mach
> >> was pretty much inefficient at everything, including external pagers.
> >> External pagers can work well, when implemented well.
> >>
> >>> On Thu, Jan 8, 2026 at 8:41 PM Paul Lalonde <paul.a.lalonde@gmail.com>
> >>> wrote:
> >>>
> >>> Did the same on GPUs/Xeon Phi, including in the texture units. Very
> >>> useful mechanism for abstracting compute with random access
> characteristics.
> >>>
> >>> Paul
> >>>
> >>>> On Wed, Jan 7, 2026, 1:35 p.m. ron minnich <rminnich@gmail.com>
> wrote:
> >>>> what we had planned for harvey was a good deal simpler: designate a
> part
> >>>> of the address space as a "bounce fault to user" space area.
> >>>>
> >>>> When a page fault in that area occurred, info about the fault was
> sent to
> >>>> an fd (if it was opened) or a note handler.
> >>>>
> >>>> user could could handle the fault or punt, as it saw fit. The fixup
> was
> >>>> that user mode had to get the data to satisfy the fault, then tell the
> >>>> kernel what to do.
> >>>>
> >>>> This is much like the 35-years-ago work we did on AIX, called
> >>>> external pagers at the time; or the more recent umap work,
> >>>> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
> >>>>
> >>>> If you go this route, it's a bit less complex than what you are
> proposing.
> >>>>
> >>>> On Wed, Jan 7, 2026 at 1:09 PM Bakul Shah via 9fans <9fans@9fans.net>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>>> On Jan 7, 2026, at 8:41 AM, ori@eigenstate.org wrote:
> >>>>>>
> >>>>>> Quoth Bakul Shah via 9fans <9fans@9fans.net>:
> >>>>>>> I have this idea that will horrify most of you!
> >>>>>>>
> >>>>>>> 1. Create an mmap device driver. You ask it to a new file handle
> >>>>> which you use to communicate about memory mapping.
> >>>>>>> 2. If you want to mmap some file, you open it and write its file
> >>>>> descriptor along with other parameters (file offset, base addr,
> size, mode,
> >>>>> flags) to your mmap file handle.
> >>>>>>> 3. The mmap driver sets up necessary page table entries but doesn't
> >>>>> actually fetch any data before returning from the write.
> >>>>>>> 4. It can asynchronously kick off io requests on your behalf and
> >>>>> fixup page table entries as needed.
> >>>>>>> 5. Page faults in the mmapped area are serviced by making
> appropriate
> >>>>> read/write calls.
> >>>>>>> 6. Flags can be used to indicate read-ahead or write-behind for
> >>>>> typical serial access.
> >>>>>>> 7. Similarly msync, munmap etc. can be implemented.
> >>>>>>>
> >>>>>>> In a sneaky way this avoids the need for adding any mmap specific
> >>>>> syscalls! But the underlying work would be mostly similar in either
> case.
> >>>>>>>
> >>>>>>> The main benefits of mmap are reduced initial latency , "pay as you
> >>>>> go" cost structure and ease of use. It is certainly more expensive
> than
> >>>>> reading/writing the same amount of data directly from a program.
> >>>>>>>
> >>>>>>> No idea how horrible a hack is needed to implement such a thing or
> >>>>> even if it is possible at all but I had to share this ;-)
> >>>>>>
> >>>>>> To what end? The problems with mmap have little to do with adding a
> >>>>> syscall;
> >>>>>> they're about how you do things like communicating I/O errors.
> >>>>> Especially
> >>>>>> when flushing the cache.
> >>>>>>
> >>>>>> Imagine the following setup -- I've imported 9p.io:
> >>>>>>
> >>>>>> 9fs 9pio
> >>>>>>
> >>>>>> and then I map a file from it:
> >>>>>>
> >>>>>> mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
> >>>>>>
> >>>>>> Now, I want to write something into the file:
> >>>>>>
> >>>>>> *mapped = 1234;
> >>>>>>
> >>>>>> The cached version of the page is dirty, so the OS will
> >>>>>> eventually need to flush it back with a 9p Twrite; Let's
> >>>>>> assume that before this happens, the network goes down.
> >>>>>>
> >>>>>> How do you communicate the error with userspace?
> >>>>>
> >>>>> This was just a brainwave but...
> >>>>>
> >>>>> You have a (control) connection with the mmap device to
> >>>>> set up mmap so might as well use it to convey errors!
> >>>>> This device would be strictly local to where a program
> >>>>> runs.
> >>>>>
> >>>>> I'd even consider allowing a separate process to mmap,
> >>>>> by making an address space a first class object. That'd
> >>>>> move more stuff out of the kernel and allow for more
> >>>>> interesting/esoteric uses.
> >>>> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> >>> <https://9fans.topicbox.com/groups/9fans> + participants
> >>> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> >>> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> >>> <
> https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mfe8a1e4feddaad7bebb650ea
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 10724 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 6:06 ` David Leimbach via 9fans
2026-01-09 17:13 ` ron minnich
@ 2026-01-09 17:39 ` tlaronde
2026-01-09 19:48 ` David Leimbach via 9fans
1 sibling, 1 reply; 92+ messages in thread
From: tlaronde @ 2026-01-09 17:39 UTC (permalink / raw)
To: 9fans
On Thu, Jan 08, 2026 at 10:06:26PM -0800, David Leimbach via 9fans wrote:
> I?d been impressed by L4. It?s certainly been deployed pretty broadly.
>
Was not L4 rewritten in assembly for performance purposes?
T. Laronde
> And it has recursive pagers but ? not sure how that?s used in practice.
>
> And there are a bunch of variants.
> Sent from my iPhone
>
> > On Jan 8, 2026, at 9:23?PM, arnold@skeeve.com wrote:
> >
> > ?I vaguely remember someone being quoted as saying
> >
> > Microkernels don't have to be small. They just have to
> > not do much.
> >
> > :-)
> >
> > ron minnich <rminnich@gmail.com> wrote:
> >
> >> I would not tar the idea of external pagers with the Mach tarbrush. Mach
> >> was pretty much inefficient at everything, including external pagers.
> >> External pagers can work well, when implemented well.
> >>
> >>> On Thu, Jan 8, 2026 at 8:41?PM Paul Lalonde <paul.a.lalonde@gmail.com>
> >>> wrote:
> >>>
> >>> Did the same on GPUs/Xeon Phi, including in the texture units. Very
> >>> useful mechanism for abstracting compute with random access characteristics.
> >>>
> >>> Paul
> >>>
> >>>> On Wed, Jan 7, 2026, 1:35?p.m. ron minnich <rminnich@gmail.com> wrote:
> >>>> what we had planned for harvey was a good deal simpler: designate a part
> >>>> of the address space as a "bounce fault to user" space area.
> >>>>
> >>>> When a page fault in that area occurred, info about the fault was sent to
> >>>> an fd (if it was opened) or a note handler.
> >>>>
> >>>> user could could handle the fault or punt, as it saw fit. The fixup was
> >>>> that user mode had to get the data to satisfy the fault, then tell the
> >>>> kernel what to do.
> >>>>
> >>>> This is much like the 35-years-ago work we did on AIX, called
> >>>> external pagers at the time; or the more recent umap work,
> >>>> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
> >>>>
> >>>> If you go this route, it's a bit less complex than what you are proposing.
> >>>>
> >>>> On Wed, Jan 7, 2026 at 1:09?PM Bakul Shah via 9fans <9fans@9fans.net>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>>> On Jan 7, 2026, at 8:41?AM, ori@eigenstate.org wrote:
> >>>>>>
> >>>>>> Quoth Bakul Shah via 9fans <9fans@9fans.net>:
> >>>>>>> I have this idea that will horrify most of you!
> >>>>>>>
> >>>>>>> 1. Create an mmap device driver. You ask it to a new file handle
> >>>>> which you use to communicate about memory mapping.
> >>>>>>> 2. If you want to mmap some file, you open it and write its file
> >>>>> descriptor along with other parameters (file offset, base addr, size, mode,
> >>>>> flags) to your mmap file handle.
> >>>>>>> 3. The mmap driver sets up necessary page table entries but doesn't
> >>>>> actually fetch any data before returning from the write.
> >>>>>>> 4. It can asynchronously kick off io requests on your behalf and
> >>>>> fixup page table entries as needed.
> >>>>>>> 5. Page faults in the mmapped area are serviced by making appropriate
> >>>>> read/write calls.
> >>>>>>> 6. Flags can be used to indicate read-ahead or write-behind for
> >>>>> typical serial access.
> >>>>>>> 7. Similarly msync, munmap etc. can be implemented.
> >>>>>>>
> >>>>>>> In a sneaky way this avoids the need for adding any mmap specific
> >>>>> syscalls! But the underlying work would be mostly similar in either case.
> >>>>>>>
> >>>>>>> The main benefits of mmap are reduced initial latency , "pay as you
> >>>>> go" cost structure and ease of use. It is certainly more expensive than
> >>>>> reading/writing the same amount of data directly from a program.
> >>>>>>>
> >>>>>>> No idea how horrible a hack is needed to implement such a thing or
> >>>>> even if it is possible at all but I had to share this ;-)
> >>>>>>
> >>>>>> To what end? The problems with mmap have little to do with adding a
> >>>>> syscall;
> >>>>>> they're about how you do things like communicating I/O errors.
> >>>>> Especially
> >>>>>> when flushing the cache.
> >>>>>>
> >>>>>> Imagine the following setup -- I've imported 9p.io:
> >>>>>>
> >>>>>> 9fs 9pio
> >>>>>>
> >>>>>> and then I map a file from it:
> >>>>>>
> >>>>>> mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
> >>>>>>
> >>>>>> Now, I want to write something into the file:
> >>>>>>
> >>>>>> *mapped = 1234;
> >>>>>>
> >>>>>> The cached version of the page is dirty, so the OS will
> >>>>>> eventually need to flush it back with a 9p Twrite; Let's
> >>>>>> assume that before this happens, the network goes down.
> >>>>>>
> >>>>>> How do you communicate the error with userspace?
> >>>>>
> >>>>> This was just a brainwave but...
> >>>>>
> >>>>> You have a (control) connection with the mmap device to
> >>>>> set up mmap so might as well use it to convey errors!
> >>>>> This device would be strictly local to where a program
> >>>>> runs.
> >>>>>
> >>>>> I'd even consider allowing a separate process to mmap,
> >>>>> by making an address space a first class object. That'd
> >>>>> move more stuff out of the kernel and allow for more
> >>>>> interesting/esoteric uses.
> >>>> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> >>> <https://9fans.topicbox.com/groups/9fans> + participants
> >>> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> >>> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> >>> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f>
> >>>
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8bc645f9949b2eb1466815cb
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 17:39 ` tlaronde
@ 2026-01-09 19:48 ` David Leimbach via 9fans
2026-02-05 21:30 ` Alyssa M via 9fans
2026-02-08 14:08 ` Ethan Azariah
0 siblings, 2 replies; 92+ messages in thread
From: David Leimbach via 9fans @ 2026-01-09 19:48 UTC (permalink / raw)
To: 9fans; +Cc: 9fans
Sent from my iPhone
> On Jan 9, 2026, at 9:57 AM, tlaronde@kergis.com wrote:
>
> On Thu, Jan 08, 2026 at 10:06:26PM -0800, David Leimbach via 9fans wrote:
>> I?d been impressed by L4. It?s certainly been deployed pretty broadly.
>>
>
> Was not L4 rewritten in assembly for performance purposes?
>
> T. Laronde
The opposite. L3 was assembly. L4 is basically a spec now. Many implementations. C and C++ (and Haskell I believe)
My point was you can do recursive paging. It has been shown.
>> And it has recursive pagers but ? not sure how that?s used in practice.
>>
>> And there are a bunch of variants.
>> Sent from my iPhone
>>
>>>> On Jan 8, 2026, at 9:23?PM, arnold@skeeve.com wrote:
>>>
>>> ?I vaguely remember someone being quoted as saying
>>>
>>> Microkernels don't have to be small. They just have to
>>> not do much.
>>>
>>> :-)
>>>
>>> ron minnich <rminnich@gmail.com> wrote:
>>>
>>>> I would not tar the idea of external pagers with the Mach tarbrush. Mach
>>>> was pretty much inefficient at everything, including external pagers.
>>>> External pagers can work well, when implemented well.
>>>>
>>>>> On Thu, Jan 8, 2026 at 8:41?PM Paul Lalonde <paul.a.lalonde@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Did the same on GPUs/Xeon Phi, including in the texture units. Very
>>>>> useful mechanism for abstracting compute with random access characteristics.
>>>>>
>>>>> Paul
>>>>>
>>>>>> On Wed, Jan 7, 2026, 1:35?p.m. ron minnich <rminnich@gmail.com> wrote:
>>>>>> what we had planned for harvey was a good deal simpler: designate a part
>>>>>> of the address space as a "bounce fault to user" space area.
>>>>>>
>>>>>> When a page fault in that area occurred, info about the fault was sent to
>>>>>> an fd (if it was opened) or a note handler.
>>>>>>
>>>>>> user could could handle the fault or punt, as it saw fit. The fixup was
>>>>>> that user mode had to get the data to satisfy the fault, then tell the
>>>>>> kernel what to do.
>>>>>>
>>>>>> This is much like the 35-years-ago work we did on AIX, called
>>>>>> external pagers at the time; or the more recent umap work,
>>>>>> https://computing.llnl.gov/projects/umap, used fairly widely in HPC.
>>>>>>
>>>>>> If you go this route, it's a bit less complex than what you are proposing.
>>>>>>
>>>>>> On Wed, Jan 7, 2026 at 1:09?PM Bakul Shah via 9fans <9fans@9fans.net>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Jan 7, 2026, at 8:41?AM, ori@eigenstate.org wrote:
>>>>>>>>
>>>>>>>> Quoth Bakul Shah via 9fans <9fans@9fans.net>:
>>>>>>>>> I have this idea that will horrify most of you!
>>>>>>>>>
>>>>>>>>> 1. Create an mmap device driver. You ask it to a new file handle
>>>>>>> which you use to communicate about memory mapping.
>>>>>>>>> 2. If you want to mmap some file, you open it and write its file
>>>>>>> descriptor along with other parameters (file offset, base addr, size, mode,
>>>>>>> flags) to your mmap file handle.
>>>>>>>>> 3. The mmap driver sets up necessary page table entries but doesn't
>>>>>>> actually fetch any data before returning from the write.
>>>>>>>>> 4. It can asynchronously kick off io requests on your behalf and
>>>>>>> fixup page table entries as needed.
>>>>>>>>> 5. Page faults in the mmapped area are serviced by making appropriate
>>>>>>> read/write calls.
>>>>>>>>> 6. Flags can be used to indicate read-ahead or write-behind for
>>>>>>> typical serial access.
>>>>>>>>> 7. Similarly msync, munmap etc. can be implemented.
>>>>>>>>>
>>>>>>>>> In a sneaky way this avoids the need for adding any mmap specific
>>>>>>> syscalls! But the underlying work would be mostly similar in either case.
>>>>>>>>>
>>>>>>>>> The main benefits of mmap are reduced initial latency , "pay as you
>>>>>>> go" cost structure and ease of use. It is certainly more expensive than
>>>>>>> reading/writing the same amount of data directly from a program.
>>>>>>>>>
>>>>>>>>> No idea how horrible a hack is needed to implement such a thing or
>>>>>>> even if it is possible at all but I had to share this ;-)
>>>>>>>>
>>>>>>>> To what end? The problems with mmap have little to do with adding a
>>>>>>> syscall;
>>>>>>>> they're about how you do things like communicating I/O errors.
>>>>>>> Especially
>>>>>>>> when flushing the cache.
>>>>>>>>
>>>>>>>> Imagine the following setup -- I've imported 9p.io:
>>>>>>>>
>>>>>>>> 9fs 9pio
>>>>>>>>
>>>>>>>> and then I map a file from it:
>>>>>>>>
>>>>>>>> mapped = mmap("/n/9pio/plan9/lib/words", OWRITE);
>>>>>>>>
>>>>>>>> Now, I want to write something into the file:
>>>>>>>>
>>>>>>>> *mapped = 1234;
>>>>>>>>
>>>>>>>> The cached version of the page is dirty, so the OS will
>>>>>>>> eventually need to flush it back with a 9p Twrite; Let's
>>>>>>>> assume that before this happens, the network goes down.
>>>>>>>>
>>>>>>>> How do you communicate the error with userspace?
>>>>>>>
>>>>>>> This was just a brainwave but...
>>>>>>>
>>>>>>> You have a (control) connection with the mmap device to
>>>>>>> set up mmap so might as well use it to convey errors!
>>>>>>> This device would be strictly local to where a program
>>>>>>> runs.
>>>>>>>
>>>>>>> I'd even consider allowing a separate process to mmap,
>>>>>>> by making an address space a first class object. That'd
>>>>>>> move more stuff out of the kernel and allow for more
>>>>>>> interesting/esoteric uses.
>>>>>> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
>>>>> <https://9fans.topicbox.com/groups/9fans> + participants
>>>>> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
>>>>> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
>>>>> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf3cfeeb18fd00292d3f9063f>
>>>>>
>
> --
> Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
> http://www.kergis.com/
> http://kertex.kergis.com/
> Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mae35de7f0d27935dac0dd51b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 19:48 ` David Leimbach via 9fans
@ 2026-02-05 21:30 ` Alyssa M via 9fans
2026-02-08 14:18 ` Ethan Azariah
2026-02-08 14:08 ` Ethan Azariah
1 sibling, 1 reply; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-05 21:30 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 824 bytes --]
I tend to think about mmap the same way I do about vfork: with the right virtual memory system and file system it shouldn't be necessary. A combination of demand paging and copy-on-write should make read/write of large areas of files as efficient as memory mapping - the main difference being that, with read/write, memory is isolated from both the file and other processes, whereas memory mapping implies some degree of sharing. On the whole I think I'd rather have the isolation.
Failable I/O will clearly mix badly with demand paging, but I think the solution is to make the I/O not fail.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M4339bf8c643a58a78486b7e8
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1360 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-01-09 19:48 ` David Leimbach via 9fans
2026-02-05 21:30 ` Alyssa M via 9fans
@ 2026-02-08 14:08 ` Ethan Azariah
1 sibling, 0 replies; 92+ messages in thread
From: Ethan Azariah @ 2026-02-08 14:08 UTC (permalink / raw)
To: David Leimbach, 9fans
On Fri, Jan 9, 2026, at 7:48 PM, David Leimbach wrote:
>
> My point was you can do recursive paging. It has been shown.
Not only can it be done, the regulars in forum.osdev.org often recommend newbies implement recursive paging because it's easier than other paging schemes. ;) Admittedly, this advice pertains to developing an OS from scratch, given when a newbie is wondering where to allocate memory for page tables.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M2daee3b4070d53387c1fbec7
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-05 21:30 ` Alyssa M via 9fans
@ 2026-02-08 14:18 ` Ethan Azariah
2026-02-08 15:10 ` Alyssa M via 9fans
0 siblings, 1 reply; 92+ messages in thread
From: Ethan Azariah @ 2026-02-08 14:18 UTC (permalink / raw)
To: 9fans
On Thu, Feb 5, 2026, at 9:30 PM, Alyssa M wrote:
>
> Failable I/O will clearly mix badly with demand paging, but I think the
> solution is to make the I/O not fail.
This brought on a thousand yard stare for a moment. ;) Besides my experience of various ordinary failures over the last 30 years, my Internet connection in the last 2 weeks has been a disaster!
But then I realised Plan 9 expects reliability and it's not really so bad. If you know the network will be unreliable, you tunnel connections over `aan` (always available network). Provided the server IP address doesn't change, aan can reconnect after failures.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mde5c4563926e2a88a7b46201
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-08 14:18 ` Ethan Azariah
@ 2026-02-08 15:10 ` Alyssa M via 9fans
2026-02-08 20:43 ` Ethan Azariah
0 siblings, 1 reply; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-08 15:10 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 739 bytes --]
aan(8) was at the back of my mind, but I seem to recall NFS will even survive a server reboot by being stateless (not that I've actually tried that...) I was thinking about a local 9p file system filter that could reconnect to a failing remote server when it reboots, reopen any remote open files and position seek pointers, so a remote server can come back from the dead and finish servicing a page fault, and file reads could continue on their way after a few minutes pause...
Has someone built such a thing?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M4f90f7ffbcb4bbbf38ee4a33
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1252 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-08 15:10 ` Alyssa M via 9fans
@ 2026-02-08 20:43 ` Ethan Azariah
2026-02-09 1:35 ` ron minnich
0 siblings, 1 reply; 92+ messages in thread
From: Ethan Azariah @ 2026-02-08 20:43 UTC (permalink / raw)
To: 9fans
On Sun, Feb 8, 2026, at 3:10 PM, Alyssa M wrote:
> I seem to recall NFS will even
> survive a server reboot by being stateless (not that I've actually
> tried that...)
I forget exactly which NFS version I was using back in 2004, but programs with open files didn't survive me tripping over an ethernet cable despite the disconnect not lasting 10 seconds. Server & client were Linux. I remember wondering what NFS's statelessness was for, exactly, though I guess the failure to 'come back' might have been an implementation issue. Newer NFS versions aren't stateless.
Surviving a server reboot would be nice though. :)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mf32de24f9513963fc05bf27e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-08 20:43 ` Ethan Azariah
@ 2026-02-09 1:35 ` ron minnich
2026-02-09 15:23 ` ron minnich
0 siblings, 1 reply; 92+ messages in thread
From: ron minnich @ 2026-02-09 1:35 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1814 bytes --]
suns would survive server outages. At least in the 90s. Linux NFS had its
own ideas for failure.
Statelessness, like everything, has its good and bad points. Note that NFS
was never truly stateless for v2 and later; servers had to have a dup
cache, for practical reasons.
Stateless is not cheap. NFS does not even have a mount rpc, for example, so
every packet carries with it authentication information and user identity.
Every. Single. One.
But you could reboot a server, and you'd see the infamous "nfs server not
responding still trying" on the client for hard mounts. For soft mounts,
you'd see data loss. For spongy mounts, well, some combination of the two
:-)
When all is said and done, like it or not, NFS has had greater success than
9p, for all kinds of reasons, some of which make sense, others which don't.
On Sun, Feb 8, 2026 at 1:01 PM Ethan Azariah <eekee57@fastmail.fm> wrote:
> On Sun, Feb 8, 2026, at 3:10 PM, Alyssa M wrote:
> > I seem to recall NFS will even
> > survive a server reboot by being stateless (not that I've actually
> > tried that...)
>
> I forget exactly which NFS version I was using back in 2004, but programs
> with open files didn't survive me tripping over an ethernet cable despite
> the disconnect not lasting 10 seconds. Server & client were Linux. I
> remember wondering what NFS's statelessness was for, exactly, though I
> guess the failure to 'come back' might have been an implementation issue.
> Newer NFS versions aren't stateless.
>
> Surviving a server reboot would be nice though. :)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mbf4d68b3e4daf920dd4c2edb
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 3298 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-09 1:35 ` ron minnich
@ 2026-02-09 15:23 ` ron minnich
2026-02-09 17:13 ` Bakul Shah via 9fans
` (2 more replies)
0 siblings, 3 replies; 92+ messages in thread
From: ron minnich @ 2026-02-09 15:23 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2429 bytes --]
as for mmap, there's already a defacto mmap happening for executables. They
are not read into memory. In fact, the first instruction you run in a
binary results in a page fault.
Consider a binary larger than your physical memory (this can happen).
Without the defacto mmap, you could not run it.
Similarly, in HPC, there are data sets far larger than physical memory.
mmap makes use of these data sets manageable. Nothing else has been
proposed which comes close.
ron
On Sun, Feb 8, 2026 at 5:35 PM ron minnich <rminnich@gmail.com> wrote:
> suns would survive server outages. At least in the 90s. Linux NFS had its
> own ideas for failure.
>
> Statelessness, like everything, has its good and bad points. Note that NFS
> was never truly stateless for v2 and later; servers had to have a dup
> cache, for practical reasons.
>
> Stateless is not cheap. NFS does not even have a mount rpc, for example,
> so every packet carries with it authentication information and user
> identity. Every. Single. One.
>
> But you could reboot a server, and you'd see the infamous "nfs server not
> responding still trying" on the client for hard mounts. For soft mounts,
> you'd see data loss. For spongy mounts, well, some combination of the two
> :-)
>
> When all is said and done, like it or not, NFS has had greater success
> than 9p, for all kinds of reasons, some of which make sense, others which
> don't.
>
>
>
> On Sun, Feb 8, 2026 at 1:01 PM Ethan Azariah <eekee57@fastmail.fm> wrote:
>
>> On Sun, Feb 8, 2026, at 3:10 PM, Alyssa M wrote:
>> > I seem to recall NFS will even
>> > survive a server reboot by being stateless (not that I've actually
>> > tried that...)
>>
>> I forget exactly which NFS version I was using back in 2004, but programs
>> with open files didn't survive me tripping over an ethernet cable despite
>> the disconnect not lasting 10 seconds. Server & client were Linux. I
>> remember wondering what NFS's statelessness was for, exactly, though I
>> guess the failure to 'come back' might have been an implementation issue.
>> Newer NFS versions aren't stateless.
>>
>> Surviving a server reboot would be nice though. :)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8987b510e5a3f447ba052749
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 4227 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-09 15:23 ` ron minnich
@ 2026-02-09 17:13 ` Bakul Shah via 9fans
2026-02-09 21:38 ` ron minnich
2026-02-10 10:13 ` Alyssa M via 9fans
2026-02-10 16:49 ` wb.kloke
2 siblings, 1 reply; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-02-09 17:13 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3321 bytes --]
Exactly. And the same failure modes exist (if your swap device or exec file access suddenly fails). In general Unix-type OSes only handle the "happy path" well and do not expend heroic efforts to deal with errors.
Here by mmap Ron and I mean memory mapping and not linux/BSD specific mmap API. If any mmap API is added to plan9, it need not follow the example of linux/BSD but it should be well integrated.
> On Feb 9, 2026, at 7:23 AM, ron minnich <rminnich@gmail.com> wrote:
>
> as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
>
> Consider a binary larger than your physical memory (this can happen). Without the defacto mmap, you could not run it.
>
> Similarly, in HPC, there are data sets far larger than physical memory. mmap makes use of these data sets manageable. Nothing else has been proposed which comes close.
>
> ron
>
> On Sun, Feb 8, 2026 at 5:35 PM ron minnich <rminnich@gmail.com <mailto:rminnich@gmail.com>> wrote:
>> suns would survive server outages. At least in the 90s. Linux NFS had its own ideas for failure.
>>
>> Statelessness, like everything, has its good and bad points. Note that NFS was never truly stateless for v2 and later; servers had to have a dup cache, for practical reasons.
>>
>> Stateless is not cheap. NFS does not even have a mount rpc, for example, so every packet carries with it authentication information and user identity. Every. Single. One.
>>
>> But you could reboot a server, and you'd see the infamous "nfs server not responding still trying" on the client for hard mounts. For soft mounts, you'd see data loss. For spongy mounts, well, some combination of the two :-)
>>
>> When all is said and done, like it or not, NFS has had greater success than 9p, for all kinds of reasons, some of which make sense, others which don't.
>>
>>
>>
>> On Sun, Feb 8, 2026 at 1:01 PM Ethan Azariah <eekee57@fastmail.fm <mailto:eekee57@fastmail.fm>> wrote:
>>> On Sun, Feb 8, 2026, at 3:10 PM, Alyssa M wrote:
>>> > I seem to recall NFS will even
>>> > survive a server reboot by being stateless (not that I've actually
>>> > tried that...)
>>>
>>> I forget exactly which NFS version I was using back in 2004, but programs with open files didn't survive me tripping over an ethernet cable despite the disconnect not lasting 10 seconds. Server & client were Linux. I remember wondering what NFS's statelessness was for, exactly, though I guess the failure to 'come back' might have been an implementation issue. Newer NFS versions aren't stateless.
>>>
>>> Surviving a server reboot would be nice though. :)
>
> 9fans <https://9fans.topicbox.com/latest> / 9fans / see discussions <https://9fans.topicbox.com/groups/9fans> + participants <https://9fans.topicbox.com/groups/9fans/members> + delivery options <https://9fans.topicbox.com/groups/9fans/subscription>Permalink <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8987b510e5a3f447ba052749>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Ma355f77548e1bce9f0f6683d
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 5133 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-09 17:13 ` Bakul Shah via 9fans
@ 2026-02-09 21:38 ` ron minnich
0 siblings, 0 replies; 92+ messages in thread
From: ron minnich @ 2026-02-09 21:38 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3557 bytes --]
I still think an external pager interface would be both easier and more
useful than the unix mmap api.
On Mon, Feb 9, 2026 at 11:57 AM Bakul Shah via 9fans <9fans@9fans.net>
wrote:
> Exactly. And the same failure modes exist (if your swap device or exec
> file access suddenly fails). In general Unix-type OSes only handle the
> "happy path" well and do not expend heroic efforts to deal with errors.
>
> Here by mmap Ron and I mean memory mapping and not linux/BSD specific
> mmap *API*. If any mmap API is added to plan9, it need not follow the
> example of linux/BSD but it should be well integrated.
>
> On Feb 9, 2026, at 7:23 AM, ron minnich <rminnich@gmail.com> wrote:
>
> as for mmap, there's already a defacto mmap happening for executables.
> They are not read into memory. In fact, the first instruction you run in a
> binary results in a page fault.
>
> Consider a binary larger than your physical memory (this can happen).
> Without the defacto mmap, you could not run it.
>
> Similarly, in HPC, there are data sets far larger than physical memory.
> mmap makes use of these data sets manageable. Nothing else has been
> proposed which comes close.
>
> ron
>
> On Sun, Feb 8, 2026 at 5:35 PM ron minnich <rminnich@gmail.com> wrote:
>
>> suns would survive server outages. At least in the 90s. Linux NFS had its
>> own ideas for failure.
>>
>> Statelessness, like everything, has its good and bad points. Note that
>> NFS was never truly stateless for v2 and later; servers had to have a dup
>> cache, for practical reasons.
>>
>> Stateless is not cheap. NFS does not even have a mount rpc, for example,
>> so every packet carries with it authentication information and user
>> identity. Every. Single. One.
>>
>> But you could reboot a server, and you'd see the infamous "nfs server not
>> responding still trying" on the client for hard mounts. For soft mounts,
>> you'd see data loss. For spongy mounts, well, some combination of the two
>> :-)
>>
>> When all is said and done, like it or not, NFS has had greater success
>> than 9p, for all kinds of reasons, some of which make sense, others which
>> don't.
>>
>>
>>
>> On Sun, Feb 8, 2026 at 1:01 PM Ethan Azariah <eekee57@fastmail.fm> wrote:
>>
>>> On Sun, Feb 8, 2026, at 3:10 PM, Alyssa M wrote:
>>> > I seem to recall NFS will even
>>> > survive a server reboot by being stateless (not that I've actually
>>> > tried that...)
>>>
>>> I forget exactly which NFS version I was using back in 2004, but
>>> programs with open files didn't survive me tripping over an ethernet cable
>>> despite the disconnect not lasting 10 seconds. Server & client were Linux.
>>> I remember wondering what NFS's statelessness was for, exactly, though I
>>> guess the failure to 'come back' might have been an implementation issue.
>>> Newer NFS versions aren't stateless.
>>>
>>> Surviving a server reboot would be nice though. :)
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Ma355f77548e1bce9f0f6683d>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M39dbdc7846dddfc837d17de5
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 5439 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-09 15:23 ` ron minnich
2026-02-09 17:13 ` Bakul Shah via 9fans
@ 2026-02-10 10:13 ` Alyssa M via 9fans
2026-02-11 1:43 ` Ron Minnich
` (2 more replies)
2026-02-10 16:49 ` wb.kloke
2 siblings, 3 replies; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-10 10:13 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 605 bytes --]
On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1301bada12dd4cab2ed1b120
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1190 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-09 15:23 ` ron minnich
2026-02-09 17:13 ` Bakul Shah via 9fans
2026-02-10 10:13 ` Alyssa M via 9fans
@ 2026-02-10 16:49 ` wb.kloke
2 siblings, 0 replies; 92+ messages in thread
From: wb.kloke @ 2026-02-10 16:49 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 875 bytes --]
On Monday, 9 February 2026, at 4:24 PM, ron minnich wrote:
> as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
>
Reading this, I got the idea that to apply this on venti,one could put the text segment of the ventiserver into the unused part of the venti partition, and load the data log as the data segment. Probably, this would not work with production venti arena partitions, but for small (less than 2GB) ones.
What would happen to the data segment? It won't be written back to the program image, or else?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8debf57b3aeb8038220b72e7
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1501 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-10 10:13 ` Alyssa M via 9fans
@ 2026-02-11 1:43 ` Ron Minnich
2026-02-11 2:19 ` Bakul Shah via 9fans
2026-02-11 3:21 ` Ori Bernstein
2 siblings, 0 replies; 92+ messages in thread
From: Ron Minnich @ 2026-02-11 1:43 UTC (permalink / raw)
To: 9fans
On Tue, Feb 10, 2026 at 9:31 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
>
> On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
>
> as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
>
> I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
This is what sunos did in 1988, and it proved tricky. It took a while
to iron it all out, and it was certainly a very complex thing to chase
down.
It's certainly not as simple as your one liner above :-)
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mc3e17125b9c934e917096903
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-10 10:13 ` Alyssa M via 9fans
2026-02-11 1:43 ` Ron Minnich
@ 2026-02-11 2:19 ` Bakul Shah via 9fans
2026-02-11 3:21 ` Ori Bernstein
2 siblings, 0 replies; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-02-11 2:19 UTC (permalink / raw)
To: 9fans
On Feb 10, 2026, at 2:13 AM, Alyssa M via 9fans <9fans@9fans.net> wrote:
>
> On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
>> as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
If you mean transparently mapping page aligned read/write, it can get messy. For reads you'd have to fix up page tables to fault on access for every page that is not yet read in. For writes you'd have to do copy-on-write in case the user tried to modify buffer passed to write(), until the old data is written out. This buys you reduced latency to first use at the expense of more kernel complexity.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Me6107462bbd77fd08c2eb21e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-10 10:13 ` Alyssa M via 9fans
2026-02-11 1:43 ` Ron Minnich
2026-02-11 2:19 ` Bakul Shah via 9fans
@ 2026-02-11 3:21 ` Ori Bernstein
2026-02-11 10:01 ` hiro
` (2 more replies)
2 siblings, 3 replies; 92+ messages in thread
From: Ori Bernstein @ 2026-02-11 3:21 UTC (permalink / raw)
To: 9fans
On Tue, 10 Feb 2026 05:13:47 -0500
"Alyssa M via 9fans" <9fans@9fans.net> wrote:
> On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
That gets... interesting, from an FS semantics point of view.
What does this code print? Does it change with buffer sizes?
fd = open("x", ORDWR);
pwrite(fd, "foo", 4, 0);
read(fd, buf, 4);
pwrite(fd, "bar", 4, 0);
print("%s\n", buf);
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8b7083e721cfc7f9b15523ef
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-11 3:21 ` Ori Bernstein
@ 2026-02-11 10:01 ` hiro
2026-02-12 1:36 ` Dan Cross
2026-02-11 14:22 ` Dan Cross
2026-02-12 3:12 ` Alyssa M via 9fans
2 siblings, 1 reply; 92+ messages in thread
From: hiro @ 2026-02-11 10:01 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1677 bytes --]
i'm not sure i understand even the most abstract topic discussed here,
what's the advantage of logically organizing your data in size constrained
unnamed page tables as opposed to files in a named tree?
in the end you are assuming your memloads are not gonna be translated into
some message passing protocol underneath? are you sure you can treat all
your big data like one continuous memory region?
"The main benefits of mmap are reduced initial latency" -> latency from
where to where? what kind information are you transmitting in that
procedure?
what concrete problem are you trying to solve?
On Wed, Feb 11, 2026 at 4:34 AM Ori Bernstein <ori@eigenstate.org> wrote:
> On Tue, 10 Feb 2026 05:13:47 -0500
> "Alyssa M via 9fans" <9fans@9fans.net> wrote:
>
> > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > as for mmap, there's already a defacto mmap happening for executables.
> They are not read into memory. In fact, the first instruction you run in a
> binary results in a page fault.
> > I thinking one could bring the same transparent/defacto memory mapping
> to read(2) and write(2), so the API need not change at all.
>
> That gets... interesting, from an FS semantics point of view.
> What does this code print? Does it change with buffer sizes?
>
> fd = open("x", ORDWR);
> pwrite(fd, "foo", 4, 0);
> read(fd, buf, 4);
> pwrite(fd, "bar", 4, 0);
> print("%s\n", buf);
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8565c9f999a17aefbdadb48b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 3330 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-11 3:21 ` Ori Bernstein
2026-02-11 10:01 ` hiro
@ 2026-02-11 14:22 ` Dan Cross
2026-02-11 18:44 ` Ori Bernstein
2026-02-12 3:12 ` Alyssa M via 9fans
2 siblings, 1 reply; 92+ messages in thread
From: Dan Cross @ 2026-02-11 14:22 UTC (permalink / raw)
To: 9fans
On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <ori@eigenstate.org> wrote:
> On Tue, 10 Feb 2026 05:13:47 -0500
> "Alyssa M via 9fans" <9fans@9fans.net> wrote:
>
> > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> > I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
>
> That gets... interesting, from an FS semantics point of view.
> What does this code print? Does it change with buffer sizes?
>
> fd = open("x", ORDWR);
> pwrite(fd, "foo", 4, 0);
> read(fd, buf, 4);
> pwrite(fd, "bar", 4, 0);
> print("%s\n", buf);
It depends. Is `buf` some buffer on your stack or something similar
(a global, static buffer, or heap-malloc'ed perhaps)? If so,
presumably it still prints "foo", since the `read` would have copied
the data out of any shared region and into process-private memory. Or,
is it a pointer to the start of some region that you mapped to "x"?
In that case, the whole program is suspect as it seems to operate well
outside of the assumptions of C, but on Plan 9, I'd kind of expect it
to print "bar".
Perhaps a better example to illustrate the challenge Ron was referring
to is to consider two processes, A and B: A opens a file for write, B
opens a file and then maps it read-only and shared. The sequence of
events is then that, B dereferences a pointer into the region it
mapped and reads the value there; then A seeks to that location, reads
the value, updates it in some way (say, increments and integer or
something), seeks back to the location in question and writes the new
value. B then reads through that pointer a second time; what value
does B see? Here, the answer depends on the implementation.
Early Unix synchronized file state between memory and disk by
channeling everything through the buffer cache; `write` actually
copied into the buffer cache, and dirty buffers were copied out to the
disk asynchronously; similarly, `read` copied data out of the cache,
and if a block was already in memory, it just copied it; but if the
block needed to be read in from disk to fulfill the `read`, a buffer
was allocated, the transfer from disk to buffer scheduled, and the
calling process was suspended until the transfer completed, at which
point the data was copied out of the buffer. For large reads and
writes, this process could repeat many times and the details could get
complex, especially as the number of buffers was fixed and relatively
small (especially on a PDP-11). But, with a little tuning and some
cooperation from programs, a lot of unnecessary IO could be avoided,
and it dramatically simplified synchronizing between processes. Unix
semantics said that reads and writes were "atomic" from the
perspective of other users of the filesystem: a partial write in
progress could not be observed by an outside reader, for example: this
was done by putting locks on inodes during read/write, and buffers
could be similarly locked while transferring from disk, and so on.
Similarly, Unix has used "virtual memory" almost from the state, even
on the PDP-11: the virtual address space was very small, but Unix
processes were mapped into virtual segments that started at address 0;
the kernel was mapped separately, and a region of memory shared
between the two (the "user area") was mapped at the top of data space
so that user processes could pass arguments into the kernel, and so
on. But notably, this was used mainly for protection and simplifying
user programs (which could use absolute addresses within their virtual
address space); the system was still firmly swap based: the address
space was too small to usefully support demand paging, and it's not
clear to me that the PDP-11 stored enough information when trapping to
restart an instruction after a page fault. Even on the PDP-11, there
was some sharing, though; cf the "sticky bit" for the text of
frequently-run executables.
When Unix was moved to the VAX, the address space was greatly
expanded, and demand paging was added in 3BSD and in Reiser's
elaboration of the work started with 32/V inside Bell Labs. To
accommodate sharing of demand paged segments, a separate "page cache"
was introduced, but this was essentially read-only, and was separate
from the buffer cached, used for IO with open/close/read/write etc.
Critically, entries in the page cache were page sized, while entries
in the buffer cache were block sized, and the two weren't necessarily
the same.
With larger address spaces, programmers wanted to use shared memory as
an alternative to traditional forms of IPC, like pipes, and the `mmap`
_design_, based on the `PMAP` call on TENEX/TOPS-20, was presented
with 4.2BSD, but not implemented. Sun did an independent
implementation for SunOS as did many of the commercial Unix vendors;
Berkeley eventually did an implementation in 4.4BSD (btw, there was
some talk that Sun would donate their VM system to Berkeley for 4.4,
but those talks fell through, so CSRG adopted the Mach VM, instead).
By then, Linux had down their own as well. System V had its own shared
memory stuff that was different, but the industry writ large more or
less settled on mmap by the end of mid- to late-1990s. `mmap` gave
programs a lot more control over the address space of the process they
run in, which led to things like supporting shared libraries and so
on. The Sun paper on doing that in SunOS4 is pretty interesting, btw;
shared library support is basically implemented outside of the kernel,
with a little cooperation from the C startup code in libc:
http://mcvoy.com/lm/papers/SunOS.shlib.pdf
But `mmap` worked in terms of the VM page cache, and with writable
mappings, the page cache necessarily became read/write. But
open/close/read/write continued to use the buffer cache, which was
separate, which meant that a region of any given file might be present
in both caches simultaneously, and with no interlock between the two,
a write to one wouldn't necessarily update the other. If programs
were using both for IO on a file simultaneously, the caches could
easily fall out of sync and the state of the file on disk would be
indeterminate. Unifying the two caches addressed this, and SunOS did
that in the 1980s, but which took a long time to get right and
ultimately wasn't done for everything (directories remained in the
buffer cache but not the page cache). Something that aggravates some
old time Sun engineers is that when ZFS was implemented in Solaris, it
used its own cache (the ARC) that wasn't synchronized with the page
cache, undoing much of the earlier work. The ZFS implementors mostly
don't care, and note that using `mmap` for output is fraught:
https://db.cs.cmu.edu/mmap-cidr2022/
When Plan 9 came along, the entire architecture changed. Sure,
executables are faulted in on demand and things like `segattach` exist
to support memory-mapped IO devices and the like, but there's no real
equivalent to the full generality of `mmap`, and no shared libraries.
here are legitimate questions about synchronization when mapping a
file served by a remote server somewhere, and error handling is a
challenge: how does one detect a write failure via a store into a
mapped region? Presumably, that's reflected into an exception that's
delivered to the process in the form of a note or something (if you
really want to twist your noggin around, look at how Multics did it.
Interestingly, the semantics of deleting a segment on Multics are
closer to how Plan 9 deals with deleting a currently open file than
unlink'ing a name on Unix, though of course that's dependent on the
backing file server). But if error handling for a _store_ involves a
round-trip to/from some distant machine, that can be painful.
`mmap` is a lot easier to get right on a single machine with local
storage and no networking at play and yes, Sun did it for NFS (to
support running dynamically linked executables from NFS-mounted
filesystems), but it's a pain and the semantics evolved informally,
instead of being well-defined from the start.
Bottom line: it's not trivial. Fortunately, if someone wants to make a
go of it, there's close to 50 years worth of prior art to learn from.
- Dan C.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Me8f735d3c62aac435db5b793
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-11 14:22 ` Dan Cross
@ 2026-02-11 18:44 ` Ori Bernstein
2026-02-12 1:22 ` Dan Cross
0 siblings, 1 reply; 92+ messages in thread
From: Ori Bernstein @ 2026-02-11 18:44 UTC (permalink / raw)
To: 9fans; +Cc: Dan Cross
On Wed, 11 Feb 2026 09:22:06 -0500
Dan Cross <crossd@gmail.com> wrote:
> On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <ori@eigenstate.org> wrote:
> > On Tue, 10 Feb 2026 05:13:47 -0500
> > "Alyssa M via 9fans" <9fans@9fans.net> wrote:
> >
> > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> > > I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
> >
> > That gets... interesting, from an FS semantics point of view.
> > What does this code print? Does it change with buffer sizes?
> >
> > fd = open("x", ORDWR);
> > pwrite(fd, "foo", 4, 0);
> > read(fd, buf, 4);
> > pwrite(fd, "bar", 4, 0);
> > print("%s\n", buf);
>
> It depends. Is `buf` some buffer on your stack or something similar
> (a global, static buffer, or heap-malloc'ed perhaps)? If so,
> presumably it still prints "foo", since the `read` would have copied
> the data out of any shared region and into process-private memory. Or,
> is it a pointer to the start of some region that you mapped to "x"?
> In that case, the whole program is suspect as it seems to operate well
> outside of the assumptions of C, but on Plan 9, I'd kind of expect it
> to print "bar".
In this example, no trickery; single threaded code, nothing fancy.
Currently, it does what you suggest, but if you did transparent
mapping of the file if the alignment worked, would that be guaranteed?
What if another process wrote to the file after the read? what if
a different machine did?
Right now, eagerly copying works with no surprised, but lazily
mmapping adds a lot of things to get right.
>
> Perhaps a better example to illustrate the challenge Ron was referring
> to is to consider two processes, A and B: A opens a file for write, B
> opens a file and then maps it read-only and shared. The sequence of
> events is then that, B dereferences a pointer into the region it
> mapped and reads the value there; then A seeks to that location, reads
> the value, updates it in some way (say, increments and integer or
> something), seeks back to the location in question and writes the new
> value. B then reads through that pointer a second time; what value
> does B see? Here, the answer depends on the implementation.
My point is that you don't even need to get very fancy to get
semantic difficulties with explicit mmap, you just need to realize
that reads can get delayed by arbitrary amounts of time, allowing
anyone to come in and modify things behind the program's back.
I don't think you can get it right without leaking a great deal
of how the file has been mapped back into the file server.
> Early Unix synchronized file state between memory and disk by
> channeling everything through the buffer cache; `write` actually
> copied into the buffer cache, and dirty buffers were copied out to the
> disk asynchronously; similarly, `read` copied data out of the cache,
> and if a block was already in memory, it just copied it; but if the
> block needed to be read in from disk to fulfill the `read`, a buffer
> was allocated, the transfer from disk to buffer scheduled, and the
> calling process was suspended until the transfer completed, at which
> point the data was copied out of the buffer. For large reads and
> writes, this process could repeat many times and the details could get
> complex, especially as the number of buffers was fixed and relatively
> small (especially on a PDP-11). But, with a little tuning and some
> cooperation from programs, a lot of unnecessary IO could be avoided,
> and it dramatically simplified synchronizing between processes. Unix
> semantics said that reads and writes were "atomic" from the
> perspective of other users of the filesystem: a partial write in
> progress could not be observed by an outside reader, for example: this
> was done by putting locks on inodes during read/write, and buffers
> could be similarly locked while transferring from disk, and so on.
>
> Similarly, Unix has used "virtual memory" almost from the state, even
> on the PDP-11: the virtual address space was very small, but Unix
> processes were mapped into virtual segments that started at address 0;
> the kernel was mapped separately, and a region of memory shared
> between the two (the "user area") was mapped at the top of data space
> so that user processes could pass arguments into the kernel, and so
> on. But notably, this was used mainly for protection and simplifying
> user programs (which could use absolute addresses within their virtual
> address space); the system was still firmly swap based: the address
> space was too small to usefully support demand paging, and it's not
> clear to me that the PDP-11 stored enough information when trapping to
> restart an instruction after a page fault. Even on the PDP-11, there
> was some sharing, though; cf the "sticky bit" for the text of
> frequently-run executables.
>
> When Unix was moved to the VAX, the address space was greatly
> expanded, and demand paging was added in 3BSD and in Reiser's
> elaboration of the work started with 32/V inside Bell Labs. To
> accommodate sharing of demand paged segments, a separate "page cache"
> was introduced, but this was essentially read-only, and was separate
> from the buffer cached, used for IO with open/close/read/write etc.
> Critically, entries in the page cache were page sized, while entries
> in the buffer cache were block sized, and the two weren't necessarily
> the same.
>
> With larger address spaces, programmers wanted to use shared memory as
> an alternative to traditional forms of IPC, like pipes, and the `mmap`
> _design_, based on the `PMAP` call on TENEX/TOPS-20, was presented
> with 4.2BSD, but not implemented. Sun did an independent
> implementation for SunOS as did many of the commercial Unix vendors;
> Berkeley eventually did an implementation in 4.4BSD (btw, there was
> some talk that Sun would donate their VM system to Berkeley for 4.4,
> but those talks fell through, so CSRG adopted the Mach VM, instead).
> By then, Linux had down their own as well. System V had its own shared
> memory stuff that was different, but the industry writ large more or
> less settled on mmap by the end of mid- to late-1990s. `mmap` gave
> programs a lot more control over the address space of the process they
> run in, which led to things like supporting shared libraries and so
> on. The Sun paper on doing that in SunOS4 is pretty interesting, btw;
> shared library support is basically implemented outside of the kernel,
> with a little cooperation from the C startup code in libc:
> http://mcvoy.com/lm/papers/SunOS.shlib.pdf
>
> But `mmap` worked in terms of the VM page cache, and with writable
> mappings, the page cache necessarily became read/write. But
> open/close/read/write continued to use the buffer cache, which was
> separate, which meant that a region of any given file might be present
> in both caches simultaneously, and with no interlock between the two,
> a write to one wouldn't necessarily update the other. If programs
> were using both for IO on a file simultaneously, the caches could
> easily fall out of sync and the state of the file on disk would be
> indeterminate. Unifying the two caches addressed this, and SunOS did
> that in the 1980s, but which took a long time to get right and
> ultimately wasn't done for everything (directories remained in the
> buffer cache but not the page cache). Something that aggravates some
> old time Sun engineers is that when ZFS was implemented in Solaris, it
> used its own cache (the ARC) that wasn't synchronized with the page
> cache, undoing much of the earlier work. The ZFS implementors mostly
> don't care, and note that using `mmap` for output is fraught:
> https://db.cs.cmu.edu/mmap-cidr2022/
>
> When Plan 9 came along, the entire architecture changed. Sure,
> executables are faulted in on demand and things like `segattach` exist
> to support memory-mapped IO devices and the like, but there's no real
> equivalent to the full generality of `mmap`, and no shared libraries.
> here are legitimate questions about synchronization when mapping a
> file served by a remote server somewhere, and error handling is a
> challenge: how does one detect a write failure via a store into a
> mapped region? Presumably, that's reflected into an exception that's
> delivered to the process in the form of a note or something (if you
> really want to twist your noggin around, look at how Multics did it.
> Interestingly, the semantics of deleting a segment on Multics are
> closer to how Plan 9 deals with deleting a currently open file than
> unlink'ing a name on Unix, though of course that's dependent on the
> backing file server). But if error handling for a _store_ involves a
> round-trip to/from some distant machine, that can be painful.
>
> `mmap` is a lot easier to get right on a single machine with local
> storage and no networking at play and yes, Sun did it for NFS (to
> support running dynamically linked executables from NFS-mounted
> filesystems), but it's a pain and the semantics evolved informally,
> instead of being well-defined from the start.
>
> Bottom line: it's not trivial. Fortunately, if someone wants to make a
> go of it, there's close to 50 years worth of prior art to learn from.
>
> - Dan C.
--
Ori Bernstein <ori@eigenstate.org>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mecd608135b0c0e70a20c7040
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [9fans] venti /plan9port mmapped
2026-01-08 15:59 ` wb.kloke
@ 2026-02-11 23:19 ` red
0 siblings, 0 replies; 92+ messages in thread
From: red @ 2026-02-11 23:19 UTC (permalink / raw)
To: wb.kloke; +Cc: 9fans
> I include data on the read performance vs. the
> previous traditional-io version. The mmapped
> version is on amd64/32GB freebsd, the traditional
> on WD mycloud Ex2 (arm7/1G).
>
> [...]
>
> The arena partition resides on the mycloud, and is
> served via nfs readonly to the amd.
what are you even testing here??
you're benchmarking venti between a local machine
and a remote one with completely different cpu,
ram and io paths, _while_ claiming that mmap is
providing a 4x speed increase?
this makes no sense! i dont understand this!
how is mmap related to any of this?
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-11 18:44 ` Ori Bernstein
@ 2026-02-12 1:22 ` Dan Cross
2026-02-12 4:26 ` Ori Bernstein
0 siblings, 1 reply; 92+ messages in thread
From: Dan Cross @ 2026-02-12 1:22 UTC (permalink / raw)
To: Ori Bernstein; +Cc: 9fans
On Wed, Feb 11, 2026 at 1:44 PM Ori Bernstein <ori@eigenstate.org> wrote:
> On Wed, 11 Feb 2026 09:22:06 -0500 Dan Cross <crossd@gmail.com> wrote:
> > On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <ori@eigenstate.org> wrote:
> > > On Tue, 10 Feb 2026 05:13:47 -0500
> > > "Alyssa M via 9fans" <9fans@9fans.net> wrote:
> > >
> > > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > > > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> > > > I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
> > >
> > > That gets... interesting, from an FS semantics point of view.
> > > What does this code print? Does it change with buffer sizes?
> > >
> > > fd = open("x", ORDWR);
> > > pwrite(fd, "foo", 4, 0);
> > > read(fd, buf, 4);
> > > pwrite(fd, "bar", 4, 0);
> > > print("%s\n", buf);
> >
> > It depends. Is `buf` some buffer on your stack or something similar
> > (a global, static buffer, or heap-malloc'ed perhaps)? If so,
> > presumably it still prints "foo", since the `read` would have copied
> > the data out of any shared region and into process-private memory. Or,
> > is it a pointer to the start of some region that you mapped to "x"?
> > In that case, the whole program is suspect as it seems to operate well
> > outside of the assumptions of C, but on Plan 9, I'd kind of expect it
> > to print "bar".
>
> In this example, no trickery; single threaded code, nothing fancy.
Ok. Perhaps implicitly you also mean that there's no `mmap` involved?
> Currently, it does what you suggest, but if you did transparent
> mapping of the file if the alignment worked, would that be guaranteed?
I don't think that changes my earlier conclusion, which is it depends
entirely on where `buf` is.
> What if another process wrote to the file after the read? what if
> a different machine did?
All of these things are issues, but I don't see how that changes your
example, which I don't think illustrates the problems with mmap
specifically very convincingly.
1. You write a few bytes to some file referred to by `fd` at offset 0,
2. You then read from offset 0 in that same file into some buffer called `buf`,
3. You then write some bytes to that same file, again at offset 0,
4. Finally, you print the contents of `buf` and ask what it displays.
As before, the answer is entirely, "it depends." There are too many
unknowns, including critically, that you didn't say where _buf_ lives;
that's really the thing that matters vis `mmap` in this example.
> Right now, eagerly copying works with no surprised, but lazily
> mmapping adds a lot of things to get right.
I don't see how it's any appreciably different from the program's
perspective, unless `buf` is somehow aliasing the beginning of a
region that was `mmap`'ed to the file you are mutating via `pwrite`.
But that's weird, and all bets are kinda of off in that case; I view
that as not as bad, but sort of in the same villainous league, as
opening /proc/$pid/mem and grubbing around.
If `buf` is in your proc-local stack segment, however, then I don't
expect anything to change it once read, regardless of what you do to
the file, whether mmapped or manipulated via explicit write calls.
With `read`, the contents may not be the same as what you wrote, but
that's true today: did another process open it and write to it, after
your first `pwrite` but before your `read`? Maybe; depending on the
file and the way it was opened (does it have the `l` bit set?). If so,
it could print anything; it may print "baz" for all we know. For that
matter, you don't even know if the `buf` is nul terminated when you
pass it to `print("%s\n");`. Are you racing against another proc you
yourself `rfork`'ed? Maybe, but you said no, so I'll assume not.
Anyway, my point wasn't that there aren't issues; there demonstrably
are. But I think that if you want to provide a good example for the
dangers of `mmap` here, something like this is better:
int fd = open("something", OREAD|OWRITE);
char *p = mmap(fd, RD|WR, len, etc);
char a[4], b[4];
pwrite(fd, "foo", 4, 0);
read(fd, a, 4, 0);
memmove(p, "bar", 4);
pread(fd, b, 4, 0);
if (memcmp(a, b, 4) == 0)
print("same same\n");
else
print("but different\n");
Let's assume the happy path where this isn't racing against anything
else mutating "something", nothing crashes, or anything of that
nature: in that case, I assert that it could print either string, but
that `a` unconditionally contains { 'f', 'o', 'o', '\0' } while the
contents of `b` might be either { 'f', 'o', 'o', '\0' } or they might
be { 'b', 'a', 'r', '\0' } or some combination---all depending on the
semantics of `mmap` and how it's implemented, and, perhaps
surprisingly, `memmove` and it's implementation and the particular
properties of where `a` and `b` wind up in memory.
> > Perhaps a better example to illustrate the challenge Ron was referring
> > to is to consider two processes, A and B: A opens a file for write, B
> > opens a file and then maps it read-only and shared. The sequence of
> > events is then that, B dereferences a pointer into the region it
> > mapped and reads the value there; then A seeks to that location, reads
> > the value, updates it in some way (say, increments and integer or
> > something), seeks back to the location in question and writes the new
> > value. B then reads through that pointer a second time; what value
> > does B see? Here, the answer depends on the implementation.
>
> My point is that you don't even need to get very fancy to get
> semantic difficulties with explicit mmap, you just need to realize
> that reads can get delayed by arbitrary amounts of time, allowing
> anyone to come in and modify things behind the program's back.
Yes, but that's true today with read and write; I don't see what that
has to do with mmap specifically. Also, I think I said as much in the
remainder of my response.
> I don't think you can get it right without leaking a great deal
> of how the file has been mapped back into the file server.
It depends on what how you define your semantics.
If you declare that writes are coherent with reads, then yeah, you
need some sort of callback mechanism to let a reader know that their
cached copy is now stale, which means you need to track a lot of state
on the server, and presumably you need to set things up to fault on
writes so that you can reflect that back to the server when a program
does a store; that is going to be bad even for file-backed regions,
where every store in a `memmove` is a fault and a round-trip, but it's
really going to really suck for anonymous memory (...so don't do that
for anon mem, which by definition isn't going to some server somewhere
anyway). But maybe you rely on lending out leases with a known
expiration date to clients; if a write faults, the client has to go to
a server, get a lease, remap the region writeable, restart the
operation, and then schedule a callback to revoke write access to the
region and flush its contents back up to the server when the lease
expires (or right before that, maybe).
If you declare instead that all bets are off and you can't write-map
with sharing, so stores won't be reflected back to writes on the
server, and that it's up to programs to synchronize writes themselves
using some external means, then it's not all that different than
open/close/read/write today in the case where you open a file, read a
copy locally, and work from that copy. It someone else writes to that
file while you're memmove'ing out of the shared region, you could see
inconsistent data, but you've already said that all bets are off,
so...oh well. Besides, someone can be writing some largish amount of
data to that same file in a loop using plain 'ol `write` a chunk at a
time, and you could that the result of one of those writes before the
whole thing is done.
Anyway. Yeah, it's a hard problem, but all of these techniques have
been implemented before, on a variety of different systems, with
greater or lesser degrees of success. It's all there in the literature
to see what has worked on systems in the past, and what has not.
- Dan C.
> > Early Unix synchronized file state between memory and disk by
> > channeling everything through the buffer cache; `write` actually
> > copied into the buffer cache, and dirty buffers were copied out to the
> > disk asynchronously; similarly, `read` copied data out of the cache,
> > and if a block was already in memory, it just copied it; but if the
> > block needed to be read in from disk to fulfill the `read`, a buffer
> > was allocated, the transfer from disk to buffer scheduled, and the
> > calling process was suspended until the transfer completed, at which
> > point the data was copied out of the buffer. For large reads and
> > writes, this process could repeat many times and the details could get
> > complex, especially as the number of buffers was fixed and relatively
> > small (especially on a PDP-11). But, with a little tuning and some
> > cooperation from programs, a lot of unnecessary IO could be avoided,
> > and it dramatically simplified synchronizing between processes. Unix
> > semantics said that reads and writes were "atomic" from the
> > perspective of other users of the filesystem: a partial write in
> > progress could not be observed by an outside reader, for example: this
> > was done by putting locks on inodes during read/write, and buffers
> > could be similarly locked while transferring from disk, and so on.
> >
> > Similarly, Unix has used "virtual memory" almost from the state, even
> > on the PDP-11: the virtual address space was very small, but Unix
> > processes were mapped into virtual segments that started at address 0;
> > the kernel was mapped separately, and a region of memory shared
> > between the two (the "user area") was mapped at the top of data space
> > so that user processes could pass arguments into the kernel, and so
> > on. But notably, this was used mainly for protection and simplifying
> > user programs (which could use absolute addresses within their virtual
> > address space); the system was still firmly swap based: the address
> > space was too small to usefully support demand paging, and it's not
> > clear to me that the PDP-11 stored enough information when trapping to
> > restart an instruction after a page fault. Even on the PDP-11, there
> > was some sharing, though; cf the "sticky bit" for the text of
> > frequently-run executables.
> >
> > When Unix was moved to the VAX, the address space was greatly
> > expanded, and demand paging was added in 3BSD and in Reiser's
> > elaboration of the work started with 32/V inside Bell Labs. To
> > accommodate sharing of demand paged segments, a separate "page cache"
> > was introduced, but this was essentially read-only, and was separate
> > from the buffer cached, used for IO with open/close/read/write etc.
> > Critically, entries in the page cache were page sized, while entries
> > in the buffer cache were block sized, and the two weren't necessarily
> > the same.
> >
> > With larger address spaces, programmers wanted to use shared memory as
> > an alternative to traditional forms of IPC, like pipes, and the `mmap`
> > _design_, based on the `PMAP` call on TENEX/TOPS-20, was presented
> > with 4.2BSD, but not implemented. Sun did an independent
> > implementation for SunOS as did many of the commercial Unix vendors;
> > Berkeley eventually did an implementation in 4.4BSD (btw, there was
> > some talk that Sun would donate their VM system to Berkeley for 4.4,
> > but those talks fell through, so CSRG adopted the Mach VM, instead).
> > By then, Linux had down their own as well. System V had its own shared
> > memory stuff that was different, but the industry writ large more or
> > less settled on mmap by the end of mid- to late-1990s. `mmap` gave
> > programs a lot more control over the address space of the process they
> > run in, which led to things like supporting shared libraries and so
> > on. The Sun paper on doing that in SunOS4 is pretty interesting, btw;
> > shared library support is basically implemented outside of the kernel,
> > with a little cooperation from the C startup code in libc:
> > http://mcvoy.com/lm/papers/SunOS.shlib.pdf
> >
> > But `mmap` worked in terms of the VM page cache, and with writable
> > mappings, the page cache necessarily became read/write. But
> > open/close/read/write continued to use the buffer cache, which was
> > separate, which meant that a region of any given file might be present
> > in both caches simultaneously, and with no interlock between the two,
> > a write to one wouldn't necessarily update the other. If programs
> > were using both for IO on a file simultaneously, the caches could
> > easily fall out of sync and the state of the file on disk would be
> > indeterminate. Unifying the two caches addressed this, and SunOS did
> > that in the 1980s, but which took a long time to get right and
> > ultimately wasn't done for everything (directories remained in the
> > buffer cache but not the page cache). Something that aggravates some
> > old time Sun engineers is that when ZFS was implemented in Solaris, it
> > used its own cache (the ARC) that wasn't synchronized with the page
> > cache, undoing much of the earlier work. The ZFS implementors mostly
> > don't care, and note that using `mmap` for output is fraught:
> > https://db.cs.cmu.edu/mmap-cidr2022/
> >
> > When Plan 9 came along, the entire architecture changed. Sure,
> > executables are faulted in on demand and things like `segattach` exist
> > to support memory-mapped IO devices and the like, but there's no real
> > equivalent to the full generality of `mmap`, and no shared libraries.
> > here are legitimate questions about synchronization when mapping a
> > file served by a remote server somewhere, and error handling is a
> > challenge: how does one detect a write failure via a store into a
> > mapped region? Presumably, that's reflected into an exception that's
> > delivered to the process in the form of a note or something (if you
> > really want to twist your noggin around, look at how Multics did it.
> > Interestingly, the semantics of deleting a segment on Multics are
> > closer to how Plan 9 deals with deleting a currently open file than
> > unlink'ing a name on Unix, though of course that's dependent on the
> > backing file server). But if error handling for a _store_ involves a
> > round-trip to/from some distant machine, that can be painful.
> >
> > `mmap` is a lot easier to get right on a single machine with local
> > storage and no networking at play and yes, Sun did it for NFS (to
> > support running dynamically linked executables from NFS-mounted
> > filesystems), but it's a pain and the semantics evolved informally,
> > instead of being well-defined from the start.
> >
> > Bottom line: it's not trivial. Fortunately, if someone wants to make a
> > go of it, there's close to 50 years worth of prior art to learn from.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M06d2c26f15f53a2c0240aff0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-11 10:01 ` hiro
@ 2026-02-12 1:36 ` Dan Cross
2026-02-12 5:39 ` Alyssa M via 9fans
0 siblings, 1 reply; 92+ messages in thread
From: Dan Cross @ 2026-02-12 1:36 UTC (permalink / raw)
To: 9fans
On Wed, Feb 11, 2026 at 10:36 AM hiro <23hiro@gmail.com> wrote:
> i'm not sure i understand even the most abstract topic discussed here, what's the advantage of logically organizing your data in size constrained unnamed page tables as opposed to files in a named tree?
You don't. You're still using named files; this affects how you access
them: does that look like loads and stores from and to memory, or does
it look like explicit system calls a la `read` and `write`?
> in the end you are assuming your memloads are not gonna be translated into some message passing protocol underneath? are you sure you can treat all your big data like one continuous memory region?
Depends on the data. For the sorts of things we're talking about with
`mmap`, you open a file, and then you map some region within the file
to some part of your address space; you don't have to map the whole
thing, just the part you're interested in. If it doesn't fit, you get
an error.
> "The main benefits of mmap are reduced initial latency" -> latency from where to where? what kind information are you transmitting in that procedure?
Latency from time of `exec` to running the new program, usually.
There are kind of two ways to do it; when you load a binary, you could
read all of the bits of it out of the binary executable image and
eagerly copy them into memory, but that might take a while if the
executable is big. But once you're done, you start it running, and if
it ever faults that's probably an error, so you just kill it.
Or, you can read the relatively small headers at the front of the
binary, and use that information to reserve regions of address space
and their properties, and figure out where the program is supposed to
start executing. Then you just start trying to run, without reading
anything else. But, that's going to page fault pretty much
immediately, like on the first instruction. So you can arrange for the
fault handler to detect that the fault was in a mapped, but
unpopulated, region of memory, read the relevant page of memory from
the underlying executable file, patch that into the address space, and
then return to userspace and restart the faulting instruction, which
probably fault now (one of its operands might refer to still unmapped
memory). But the program will probably do something that will fault
again soon, at which point you just repeat the same process. You keep
doing that until the program exits or you detect a fault for something
outside of the program's mapped regions, or that perhaps in one of
those regions but in violation of the region's permissions (a store to
a read-only segment or something similar). This is "demand paging" in
a nutshell.
> what concrete problem are you trying to solve?
I can't speak to that, I'm afraid.
- Dan C.
> On Wed, Feb 11, 2026 at 4:34 AM Ori Bernstein <ori@eigenstate.org> wrote:
>>
>> On Tue, 10 Feb 2026 05:13:47 -0500
>> "Alyssa M via 9fans" <9fans@9fans.net> wrote:
>>
>> > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
>> > > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
>> > I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
>>
>> That gets... interesting, from an FS semantics point of view.
>> What does this code print? Does it change with buffer sizes?
>>
>> fd = open("x", ORDWR);
>> pwrite(fd, "foo", 4, 0);
>> read(fd, buf, 4);
>> pwrite(fd, "bar", 4, 0);
>> print("%s\n", buf);
>>
>
> 9fans / 9fans / see discussions + participants + delivery options Permalink
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1e5415ad20881dd5ef3e0d59
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-11 3:21 ` Ori Bernstein
2026-02-11 10:01 ` hiro
2026-02-11 14:22 ` Dan Cross
@ 2026-02-12 3:12 ` Alyssa M via 9fans
2026-02-12 4:52 ` Dan Cross
2 siblings, 1 reply; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-12 3:12 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2683 bytes --]
On Wednesday, February 11, 2026, at 10:01 AM, hiro wrote:
> what concrete problem are you trying to solve?
Making software simpler to write, I think.
People have expressed an interest in bringing something like mmap to Plan 9. I'm wondering whether memory mapping could be done as an implementation matter, never visible through the API - much as executables have been demand-paged since antiquity. Uses of mmap(2) and msync(2) could be replaced with read(2) and write(2) that sometimes use memory mapping as part of their implementation. read would memory map a copy-on-write snapshot of the requested data. write would write back only the parts that have changed. It's not the same as mmap, but it might be preferable.
On Wednesday, February 11, 2026, at 1:44 AM, Ron Minnich wrote:
> It's certainly not as simple as your one liner above :-)
Certainly true! Well there's a lot of history to learn from. Plan 9 has the benefit of being simpler, so it might be easier.
(I'm trying to implement this on my hobby OS at the moment - hence my interest, but it's not derived from Plan 9 (or anything else for that matter), so I don't talk about it much.)
I think the use cases are perhaps relatively specialised, so doing this in a special segment type in an optional driver might be fine.
On Wednesday, February 11, 2026, at 3:21 AM, Ori Bernstein wrote:
> That gets... interesting, from an FS semantics point of view.
What does this code print? Does it change with buffer sizes?
fd = open("x", ORDWR);
pwrite(fd, "foo", 4, 0);
read(fd, buf, 4);
pwrite(fd, "bar", 4, 0);
print("%s\n", buf);
It must print "foo"!
In practice it wouldn't memory map a partial page like this anyway, but the result would be the same with a larger example.
On Wednesday, January 07, 2026, at 4:41 PM, ori wrote:
> The cached version of the page is dirty, so the OS will
eventually need to flush it back with a 9p Twrite; Let's
assume that before this happens, the network goes down.
How do you communicate the error with userspace?
I think the I/O really needs to not fail:
I've recently been building aa9fs, which sits between a remote file server and a mount point.
When the remote server goes down aa9fs automatically reconnects, and reestablishes the fids. The clients are blocked until the server comes back up, so they're never aware of the outage. If I get it into a decent state I'll share it.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M62fa796f70106e96ce450b3c
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 5237 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 1:22 ` Dan Cross
@ 2026-02-12 4:26 ` Ori Bernstein
2026-02-12 4:34 ` Dan Cross
0 siblings, 1 reply; 92+ messages in thread
From: Ori Bernstein @ 2026-02-12 4:26 UTC (permalink / raw)
To: 9fans; +Cc: Dan Cross
On Wed, 11 Feb 2026 20:22:03 -0500
Dan Cross <crossd@gmail.com> wrote:
> On Wed, Feb 11, 2026 at 1:44 PM Ori Bernstein <ori@eigenstate.org> wrote:
> > On Wed, 11 Feb 2026 09:22:06 -0500 Dan Cross <crossd@gmail.com> wrote:
> > > On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <ori@eigenstate.org> wrote:
> > > > On Tue, 10 Feb 2026 05:13:47 -0500
> > > > "Alyssa M via 9fans" <9fans@9fans.net> wrote:
> > > >
> > > > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > > > > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> > > > > I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
> > > >
> > > > That gets... interesting, from an FS semantics point of view.
> > > > What does this code print? Does it change with buffer sizes?
> > > >
> > > > fd = open("x", ORDWR);
> > > > pwrite(fd, "foo", 4, 0);
> > > > read(fd, buf, 4);
> > > > pwrite(fd, "bar", 4, 0);
> > > > print("%s\n", buf);
> > >
> > > It depends. Is `buf` some buffer on your stack or something similar
> > > (a global, static buffer, or heap-malloc'ed perhaps)? If so,
> > > presumably it still prints "foo", since the `read` would have copied
> > > the data out of any shared region and into process-private memory. Or,
> > > is it a pointer to the start of some region that you mapped to "x"?
> > > In that case, the whole program is suspect as it seems to operate well
> > > outside of the assumptions of C, but on Plan 9, I'd kind of expect it
> > > to print "bar".
> >
> > In this example, no trickery; single threaded code, nothing fancy.
>
> Ok. Perhaps implicitly you also mean that there's no `mmap` involved?
The message I was responding to said:
"I thinking one could bring the same transparent/defacto
memory mapping to read(2) and write(2), so the API need
not change at all."
So, yes, I was talking about a hypothetical modification
to read/write.
--
Ori Bernstein <ori@eigenstate.org>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mc7cc7ad5c35d413ea7182980
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 4:26 ` Ori Bernstein
@ 2026-02-12 4:34 ` Dan Cross
0 siblings, 0 replies; 92+ messages in thread
From: Dan Cross @ 2026-02-12 4:34 UTC (permalink / raw)
To: Ori Bernstein; +Cc: 9fans
On Wed, Feb 11, 2026 at 11:26 PM Ori Bernstein <ori@eigenstate.org> wrote:
> On Wed, 11 Feb 2026 20:22:03 -0500 Dan Cross <crossd@gmail.com> wrote:
> > On Wed, Feb 11, 2026 at 1:44 PM Ori Bernstein <ori@eigenstate.org> wrote:
> > > On Wed, 11 Feb 2026 09:22:06 -0500 Dan Cross <crossd@gmail.com> wrote:
> > > > On Tue, Feb 10, 2026 at 10:34 PM Ori Bernstein <ori@eigenstate.org> wrote:
> > > > > On Tue, 10 Feb 2026 05:13:47 -0500
> > > > > "Alyssa M via 9fans" <9fans@9fans.net> wrote:
> > > > >
> > > > > > On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
> > > > > > > as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> > > > > > I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
> > > > >
> > > > > That gets... interesting, from an FS semantics point of view.
> > > > > What does this code print? Does it change with buffer sizes?
> > > > >
> > > > > fd = open("x", ORDWR);
> > > > > pwrite(fd, "foo", 4, 0);
> > > > > read(fd, buf, 4);
> > > > > pwrite(fd, "bar", 4, 0);
> > > > > print("%s\n", buf);
> > > >
> > > > It depends. Is `buf` some buffer on your stack or something similar
> > > > (a global, static buffer, or heap-malloc'ed perhaps)? If so,
> > > > presumably it still prints "foo", since the `read` would have copied
> > > > the data out of any shared region and into process-private memory. Or,
> > > > is it a pointer to the start of some region that you mapped to "x"?
> > > > In that case, the whole program is suspect as it seems to operate well
> > > > outside of the assumptions of C, but on Plan 9, I'd kind of expect it
> > > > to print "bar".
> > >
> > > In this example, no trickery; single threaded code, nothing fancy.
> >
> > Ok. Perhaps implicitly you also mean that there's no `mmap` involved?
>
> The message I was responding to said:
>
> "I thinking one could bring the same transparent/defacto
> memory mapping to read(2) and write(2), so the API need
> not change at all."
>
> So, yes, I was talking about a hypothetical modification
> to read/write.
Ah, sorry; I missed that part of the context. In that case, yes, I
agree, and your example is apt.
- Dan C.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M4d996616564e9c612d59909a
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 3:12 ` Alyssa M via 9fans
@ 2026-02-12 4:52 ` Dan Cross
2026-02-12 8:37 ` Alyssa M via 9fans
0 siblings, 1 reply; 92+ messages in thread
From: Dan Cross @ 2026-02-12 4:52 UTC (permalink / raw)
To: 9fans
On Wed, Feb 11, 2026 at 11:08 PM Alyssa M via 9fans <9fans@9fans.net> wrote:
> On Wednesday, February 11, 2026, at 10:01 AM, hiro wrote:
> > what concrete problem are you trying to solve?
>
> Making software simpler to write, I think.
I don't understand that. If the interface doesn't change, how is it simpler?
> People have expressed an interest in bringing something like mmap to
> Plan 9. I'm wondering whether memory mapping could be done as an
> implementation matter, never visible through the API - much as executables
> have been demand-paged since antiquity. Uses of mmap(2) and msync(2)
> could be replaced with read(2) and write(2) that sometimes use memory
> mapping as part of their implementation. read would memory map a
> copy-on-write snapshot of the requested data. write would write back only
> the parts that have changed. It's not the same as mmap, but it might be
> preferable.
The problem is, those aren't the right analogues for the file
metaphor. `mmap` is closer to `open` than to `read`; `msync` is a
flush operation (that doesn't really exist as an independent syscall
on plan9, but is kinda sorta like close followed by an open in a way),
and `memmove` subsumes both `read` and `write`.
You _could_ build some kind of `read`/`write` looking interface that
wrapped around a region of mapped memory, but I don't see how that
would buy you very much; the reason one usually wants something like
`mmap` is one of two things: 1) to map resources from, say, a hardware
device into an address space; things like the frame buffer of a
display adapter, for example, or registers of some device in MMIO
space (whether that's a good idea or not is another matter). The other
is to map the contents of a file into an address space, so that you
can treat them like memory, without first reading them from an actual
file. This is useful for large but sparse read-only data files: I
don't need to read the entire thing into physical memory; for that
matter if may not even fit into physical memory. But if I mmap it, and
just copy the bits I need, then those can be faulted into the address
space on demand. An alternative might be allocating a buffer in my
program, seek'ing to wherever in the file I want to read the data,
reading, and then using. But let's be honest; that's a pain. It's
appropriate pain if I'm doing something like building a DBMS (see the
CMU paper I linked to before), but if my data set is basically
read-only, I don't care about the occasional latency spike as
something is faulted in, and my locality is sufficiently good that
those will be amortized into nothingness anyway, then `mmap` isn't a
bad way to do the thing (particularly with the external pager sort of
thing Ron was suggesting). As pointed out, Plan 9 _already_ uses
demand paging of mapped segments to load executables today: this isn't
particularly new.
But if I try to shoehorn all of that into open/close/read/write, I
don't see what that buys me. The whole point is to be able to avoid
that interface for things that I can reasonably treat like "memory";
if I'm just going to do that, but wrapped in open/close/read/write
anyway, then what does it buy me? Consider the "big data file" case:
I'm back to doing all of the manual buffer management and IO myself
anyway; I may as well just open/seek/read.
- Dan C.
> On Wednesday, February 11, 2026, at 1:44 AM, Ron Minnich wrote:
>
> It's certainly not as simple as your one liner above :-)
>
> Certainly true! Well there's a lot of history to learn from. Plan 9 has the benefit of being simpler, so it might be easier.
>
> (I'm trying to implement this on my hobby OS at the moment - hence my interest, but it's not derived from Plan 9 (or anything else for that matter), so I don't talk about it much.)
> I think the use cases are perhaps relatively specialised, so doing this in a special segment type in an optional driver might be fine.
>
> On Wednesday, February 11, 2026, at 3:21 AM, Ori Bernstein wrote:
>
> That gets... interesting, from an FS semantics point of view. What does this code print? Does it change with buffer sizes?
>
> fd = open("x", ORDWR);
> pwrite(fd, "foo", 4, 0);
> read(fd, buf, 4);
> pwrite(fd, "bar", 4, 0);
> print("%s\n", buf);
>
> It must print "foo"!
> In practice it wouldn't memory map a partial page like this anyway, but the result would be the same with a larger example.
>
> On Wednesday, January 07, 2026, at 4:41 PM, ori wrote:
>
> The cached version of the page is dirty, so the OS will eventually need to flush it back with a 9p Twrite; Let's assume that before this happens, the network goes down. How do you communicate the error with userspace?
>
> I think the I/O really needs to not fail:
> I've recently been building aa9fs, which sits between a remote file server and a mount point.
> When the remote server goes down aa9fs automatically reconnects, and reestablishes the fids. The clients are blocked until the server comes back up, so they're never aware of the outage. If I get it into a decent state I'll share it.
> 9fans / 9fans / see discussions + participants + delivery options Permalink
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Md40d34fc37ba55c9a7c3a403
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 1:36 ` Dan Cross
@ 2026-02-12 5:39 ` Alyssa M via 9fans
2026-02-12 9:08 ` hiro via 9fans
2026-02-12 13:34 ` Alyssa M via 9fans
0 siblings, 2 replies; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-12 5:39 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 898 bytes --]
Ah oops - I forgot to repeat my mention of copy-on-write from my first post:
On Thursday, February 05, 2026, at 9:30 PM, Alyssa M wrote:
> I tend to think about mmap the same way I do about vfork: with the right virtual memory system and file system it shouldn't be necessary. A combination of demand paging* and copy-on-write* should make read/write of large areas of files as efficient as memory mapping - the main difference being that, with read/write, memory is isolated from both the file and other processes, whereas memory mapping implies some degree of sharing. On the whole I think I'd rather have the isolation.
Without that it makes no sense! Sorry!
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M718007f34efa7d6c391d0583
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1519 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 4:52 ` Dan Cross
@ 2026-02-12 8:37 ` Alyssa M via 9fans
2026-02-12 12:37 ` hiro via 9fans
` (2 more replies)
0 siblings, 3 replies; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-12 8:37 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3900 bytes --]
On Thursday, February 12, 2026, at 4:53 AM, Dan Cross wrote:
> On Wed, Feb 11, 2026 at 11:08 PM Alyssa M via 9fans <9fans@9fans.net> wrote:
>> On Wednesday, February 11, 2026, at 10:01 AM, hiro wrote:
>>> what concrete problem are you trying to solve?
>> Making software simpler to write, I think.
> I don't understand that. If the interface doesn't change, how is it simpler?
Think of a program that reads a file completely into memory, pokes at it a bit sparsely then writes the whole file out again. This is simple if the file is small.
If the file gets big, you might start looking around for ways to not do all that I/O, and pretty soon you have a buffer cache implementation. So the program is now more complex. Not only is there a buffer cache implementation, but you have to use it everywhere, rather than just operating on memory.
This is when mmap starts to look appealing.
On Thursday, February 12, 2026, at 4:53 AM, Dan Cross wrote:
> The other
[use] is to map the contents of a file into an address space, so that you
can treat them like memory, without first reading them from an actual
file. This is useful for large but sparse read-only data files: I
don't need to read the entire thing into physical memory; for that
matter if may not even fit into physical memory. But if I mmap it, and
just copy the bits I need, then those can be faulted into the address
space on demand.
So what I'm suggesting is that instead of the programmer making an mmap call, they should make a single read call to read the entire file into the address space - as they did before. The new read implementation would do this, but as a memory mapped snapshot. This looks no different to the programmer from how reads have always worked, it just happens very quickly, because no I/O actually happens.
The snapshot data is brought in by demand paging as it is touched, and pages may get dirtied.
When the programmer would otherwise call msync, they instead write out the entire file back where it came from - as they did before. The write implementation will recognise when it's overwriting the file where the snapshot came from and will only write the dirty pages - which is effectively what msync does.
So from the programmer's point of view this is exactly what they've always done. The implementation uses c-o-w snapshots and demand paging which have the performance of mmap, but provide the conventional semantics of read and write.
Programs can handle larger files faster without having to change.
It's just an optimisation in the read/write implementation.
So that's the idea. Is it practical? I don't know... It's certainly harder to do.
One difference with mmap is that dirty pages don't get back to the file by themselves. You have to do the writes. But I think there may be ways to address this.
On Thursday, February 12, 2026, at 4:53 AM, Dan Cross wrote:
> The problem is, those aren't the right analogues for the file
metaphor. `mmap` is closer to `open` than to `read`
In the sense that mmap creates an association between pages and the file and munmap undoes that, yes. With the idea above the page association is with snapshots and is a bit more ephemeral, and I don't know yet how much it matters if it persists after it's no longer needed. Pages are disassociated from snapshots naturally by being dirtied, by being associated with something else or perhaps by memory being deallocated. It may be somewhat like file deletion. Sometimes when it's 'gone' it's not really gone until the last user lets go. I don't think it's a problem for the process, but it may be for the file system in some situations.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M8b80dba1c12ac630dda63f5c
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 5241 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 5:39 ` Alyssa M via 9fans
@ 2026-02-12 9:08 ` hiro via 9fans
2026-02-12 13:34 ` Alyssa M via 9fans
1 sibling, 0 replies; 92+ messages in thread
From: hiro via 9fans @ 2026-02-12 9:08 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2054 bytes --]
you're saying you want the software to be more simple, but at the same time
you say you want to exploit some techniques that make the hardware use more
efficient.
the first goal is commendable, but for the second you have to first
understand how the hardware actually works. i suggest you share the insight
that makes you believe mmap is "more efficient" than reading a large file.
i see a lot of talk about mapping executables. in our special case on plan9
the executables are so small they might fit in L1 cache.
The whole 9front distribution with all documentation and git history fits
in L3 cache of modern cpus.
Maybe just change iosize and be done?
On Thu, Feb 12, 2026 at 6:44 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
> Ah oops - I forgot to repeat my mention of copy-on-write from my first
> post:
>
> On Thursday, February 05, 2026, at 9:30 PM, Alyssa M wrote:
>
> I tend to think about mmap the same way I do about vfork: with the right
> virtual memory system and file system it shouldn't be necessary. A
> combination of demand paging* and copy-on-write* should make read/write
> of large areas of files as efficient as memory mapping - the main
> difference being that, with read/write, memory is isolated from both the
> file and other processes, whereas memory mapping implies some degree of
> sharing. On the whole I think I'd rather have the isolation.
>
>
> Without that it makes no sense! Sorry!
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M718007f34efa7d6c391d0583>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M21196aae4eff578310ffe2b1
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2649 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 8:37 ` Alyssa M via 9fans
@ 2026-02-12 12:37 ` hiro via 9fans
2026-02-13 1:36 ` Dan Cross
2026-02-15 4:34 ` Bakul Shah via 9fans
2 siblings, 0 replies; 92+ messages in thread
From: hiro via 9fans @ 2026-02-12 12:37 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 615 bytes --]
On Thu, Feb 12, 2026 at 12:44 PM Alyssa M via 9fans <9fans@9fans.net> wrote:
> Think of a program that reads a file completely into memory, pokes at it a
> bit sparsely then writes the whole file out again.
>
If you poke at it sparsely then reading it into memory is what you should
avoid, thus you use seek and normal read instead. that is more simple,
efficient, optimal.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M2f4485bceebbc3c28972f4c0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1445 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 5:39 ` Alyssa M via 9fans
2026-02-12 9:08 ` hiro via 9fans
@ 2026-02-12 13:34 ` Alyssa M via 9fans
2026-02-13 13:48 ` hiro
2026-02-13 17:21 ` ron minnich
1 sibling, 2 replies; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-12 13:34 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1802 bytes --]
On Thursday, February 12, 2026, at 9:08 AM, hiro wrote:
> you're saying you want the software to be more simple, but at the same time you say you want to exploit some techniques that make the hardware use more efficient.
> the first goal is commendable, but for the second you have to first understand how the hardware actually works. i suggest you share the insight that makes you believe mmap is "more efficient" than reading a large file.
I don't think it is, and I don't think I ever said it was more efficient, did I?
I think having the whole file in your address space makes it easier to select the bits of the file you want to read - assuming you don't need to read all or even most of it.
Being that selective probably only makes a significant speed difference when the file is large, anyway.
I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
On Thursday, February 12, 2026, at 12:38 PM, hiro wrote:
> If you poke at it sparsely then reading it into memory is what you should avoid, thus you use seek and normal read instead. that is more simple, efficient, optimal.
I'm sure that's been argued to death over the years. But in a lot of situations I'd agree with you. It's going to be a tradeoff.
My interest is in simpler software - so I'm exploring a way get some of the effect of mmap without having actual mmap - by making it an implementation detail.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mdbd5febf25c74da357aee4b1
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2719 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 8:37 ` Alyssa M via 9fans
2026-02-12 12:37 ` hiro via 9fans
@ 2026-02-13 1:36 ` Dan Cross
2026-02-14 3:35 ` Alyssa M via 9fans
2026-02-15 4:34 ` Bakul Shah via 9fans
2 siblings, 1 reply; 92+ messages in thread
From: Dan Cross @ 2026-02-13 1:36 UTC (permalink / raw)
To: 9fans
On Thu, Feb 12, 2026 at 6:44 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
> > On Thursday, February 12, 2026, at 4:53 AM, Dan Cross wrote:
> > > On Wed, Feb 11, 2026 at 11:08 PM Alyssa M via 9fans <9fans@9fans.net> wrote:
> > >
> > > > On Wednesday, February 11, 2026, at 10:01 AM, hiro wrote:
> > > > what concrete problem are you trying to solve?
> > >
> > > Making software simpler to write, I think.
> >
> > I don't understand that. If the interface doesn't change, how is it simpler?
>
> Think of a program that reads a file completely into memory, pokes at it a
> bit sparsely then writes the whole file out again. This is simple if the file is small.
> If the file gets big, you might start looking around for ways to not do all that I/O,
> and pretty soon you have a buffer cache implementation. So the program is now
> more complex. Not only is there a buffer cache implementation, but you have to
> use it everywhere, rather than just operating on memory.
I don't see how this really works; in particular, the semantics of
read/write are simply different from those of `mmap`. In the former
case, to read a file into memory, I have to know how big it is (I can
just stat it) and then I have to allocate memory to hold its contents,
and then I expect `read` to copy the contents of the file into the
memory region I just allocated. Note that there is a "happens before"
relationship between allocating memory and then reading the contents
of the file into that memory. With mapping a file into virtual
memory, I'm simultaneously allocating address space _and_ arranging
things so that accesses to that region of address space correspond go
parts of the mapped file.
You seem to be proposing a model that somehow pushes enough smarts
into `read` to combine the two, as in the `mmap` case; but how does
that work from a day-to-day programming perspective? Suppose I go and
allocate a bunch of memory, and then immediately stream a bunch of
data from /dev/random and write it into that memory; the contents of
each page are thus random, and now there's no good way for the VM
system to do anything clever like acknowledge success but not _really_
allocate until I demand fault it in by accessing it (I already did by
scribbling all over it), nor can it do something like say, "oh, these
bits are all zeros; I'll just map this to a single global zero page
and trap stores and CoW", since the contents are random, not uniform.
Now, with these preconditions set, I go to `read` a big file into that
memory: what should the system do?
_An_ argument is that it should just discard the prior contents, since
they are logically overwritten by the contents of the file, anyway.
But that's not general: you aren't guaranteed that the your buffer
you're reading into is properly aligned to do a bunch of page mapping
shenanigans. Read doesn't care: it just copies bytes, but pages of
memory are both sized and aligned: `mmap` returns a pointer aligned to
a page boundary, and requests for fixed mappings enforce this and will
fail if given a non-aligned offset.
But also, suppose that instead of one big read, I do something like:
`loop ... { seek(fd, 1, 1); read(fd, p + 1, 4093); p += 4093; }` to
copy into this region of memory I've mangled. Now you've got to deal
with access patterns that mix pre-existing data with data newly copied
from the file. "Well, copy part of the file contents into a newly
allocated page..." might be an answer there, but that's not
substantially different than what `read` does today, so what's the
differentiator?
> This is when mmap starts to look appealing.
The key to making a good argument here is first acknowledging that the
whole model for working with data is just fundamentally different with
`mmap` as it is with `read`. You really can't treat them as the same.
Let me be blunt: the `mmap` interface, as specified in 4.2BSD and
implemented in a bunch of Unix and Unix-like systems, is atrocious.
Its roots come from a system that was radically different in design
than Unix, and its baroque design, with a bunch of operations
multiplexed onto a single call with 6 (!!) arguments, two of which are
bitmaps that interact in abstruse ways and one of which can radically
alter the semantics of the call, really shows. I believe that it _is_
possible to do better. But shoehorning the model of memory-mapped IO
into an overloaded `read` is not it.
- Dan C.
> On Thursday, February 12, 2026, at 4:53 AM, Dan Cross wrote:
>
> The other [use] is to map the contents of a file into an address space, so that you can treat them like memory, without first reading them from an actual file. This is useful for large but sparse read-only data files: I don't need to read the entire thing into physical memory; for that matter if may not even fit into physical memory. But if I mmap it, and just copy the bits I need, then those can be faulted into the address space on demand.
>
>
> So what I'm suggesting is that instead of the programmer making an mmap call, they should make a single read call to read the entire file into the address space - as they did before. The new read implementation would do this, but as a memory mapped snapshot. This looks no different to the programmer from how reads have always worked, it just happens very quickly, because no I/O actually happens.
> The snapshot data is brought in by demand paging as it is touched, and pages may get dirtied.
>
> When the programmer would otherwise call msync, they instead write out the entire file back where it came from - as they did before. The write implementation will recognise when it's overwriting the file where the snapshot came from and will only write the dirty pages - which is effectively what msync does.
>
> So from the programmer's point of view this is exactly what they've always done. The implementation uses c-o-w snapshots and demand paging which have the performance of mmap, but provide the conventional semantics of read and write.
>
> Programs can handle larger files faster without having to change.
> It's just an optimisation in the read/write implementation.
>
> So that's the idea. Is it practical? I don't know... It's certainly harder to do.
>
> One difference with mmap is that dirty pages don't get back to the file by themselves. You have to do the writes. But I think there may be ways to address this.
>
> On Thursday, February 12, 2026, at 4:53 AM, Dan Cross wrote:
>
> The problem is, those aren't the right analogues for the file metaphor. `mmap` is closer to `open` than to `read`
>
> In the sense that mmap creates an association between pages and the file and munmap undoes that, yes. With the idea above the page association is with snapshots and is a bit more ephemeral, and I don't know yet how much it matters if it persists after it's no longer needed. Pages are disassociated from snapshots naturally by being dirtied, by being associated with something else or perhaps by memory being deallocated. It may be somewhat like file deletion. Sometimes when it's 'gone' it's not really gone until the last user lets go. I don't think it's a problem for the process, but it may be for the file system in some situations.
>
>
> 9fans / 9fans / see discussions + participants + delivery options Permalink
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M67e7be4c741cd85745124418
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 13:34 ` Alyssa M via 9fans
@ 2026-02-13 13:48 ` hiro
2026-02-13 17:21 ` ron minnich
1 sibling, 0 replies; 92+ messages in thread
From: hiro @ 2026-02-13 13:48 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 1861 bytes --]
On Fri, Feb 13, 2026 at 5:17 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
> I think having the whole file in your address space makes it easier to
> select the bits of the file you want to read - assuming you don't need to
> read all or even most of it.
>
>
Again, what is difficult about doing a seek?
Let's take the simplest possible example, e.g. say you have a 1TB file, and
you need one byte from the middle of the file. How is a seek of 500GB and a
1byte complicated?
are you just missing a syntaxsugar here, cause you dont like to run 2
system calls? what if you used a wrapper function called
seekandread(filename, seekoffset, size); so you'd char x =
seekandread(file, 500G, 1); still a complexity issue?
i don't think mmap will make this any simpler
> My interest is in simpler software - so I'm exploring a way get some of
> the effect of mmap without having actual mmap - by making it an
> implementation detail.
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mdbd5febf25c74da357aee4b1>
>
Again, what is the favourable effect of mmap? i'm seeing nothing special
that one should strive for here. I agree with Dan in that if it *was* easy
to map mmap to read, then the two would at best be equivalent, and thus
read would suffice and mmap isn't needed at all.
If on the other hand there was any advantage at all that read() + seek() is
missing out on then why bother mapping it to read() in the first place? why
overload the syscall and make it more complicated?
> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mdbd5febf25c74da357aee4b1>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M37ab029f6df3c52ad0979147
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 3192 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 13:34 ` Alyssa M via 9fans
2026-02-13 13:48 ` hiro
@ 2026-02-13 17:21 ` ron minnich
2026-02-15 16:12 ` Danny Wilkins via 9fans
2026-02-16 2:24 ` Alyssa M via 9fans
1 sibling, 2 replies; 92+ messages in thread
From: ron minnich @ 2026-02-13 17:21 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3266 bytes --]
so, if you want to try the effect of just reading in text and data
segments, rather than faulting them in, this is actually pretty easy. It's
what we did in NIX. Look in the NIX source for nixprepage, and then see how
we applied it in sysexec, this is a pretty quick test you can make.
Have it enabled depending on a kernel variable, similarly pretty quick, Let
me know if you need more pointers.
Then you need to figure out how to measure the effect. A possible simple
way is to see the effect on building a kernel. It won't matter much at all
for long running processes, so you want to look for something that involves
lots of exec.
As for needing or not needing mmap, you need to consider how you support
efficient access for files 100x or greater in size than physical memory.
I'm not sure what your plan is there. It's a real world problem however.
Or, further, files that contain data which are pointers.
On Thu, Feb 12, 2026 at 8:17 PM Alyssa M via 9fans <9fans@9fans.net> wrote:
> On Thursday, February 12, 2026, at 9:08 AM, hiro wrote:
>
> you're saying you want the software to be more simple, but at the same
> time you say you want to exploit some techniques that make the hardware use
> more efficient.
> the first goal is commendable, but for the second you have to first
> understand how the hardware actually works. i suggest you share the insight
> that makes you believe mmap is "more efficient" than reading a large file.
>
> I don't think it is, and I don't think I ever said it was more efficient,
> did I?
> I think having the whole file in your address space makes it easier to
> select the bits of the file you want to read - assuming you don't need to
> read all or even most of it.
> Being that selective probably only makes a significant speed difference
> when the file is large, anyway.
>
> I've built a couple of simple disk file systems. I thinking of taking the
> cache code out of one of them and mapping the whole file system image into
> the address space - to see how much it simplifies the code. I'm not
> expecting it will be faster.
>
> On Thursday, February 12, 2026, at 12:38 PM, hiro wrote:
>
> If you poke at it sparsely then reading it into memory is what you should
> avoid, thus you use seek and normal read instead. that is more simple,
> efficient, optimal.
>
> I'm sure that's been argued to death over the years. But in a lot of
> situations I'd agree with you. It's going to be a tradeoff.
>
> My interest is in simpler software - so I'm exploring a way get some of
> the effect of mmap without having actual mmap - by making it an
> implementation detail.
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mdbd5febf25c74da357aee4b1>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1b18a0b65f9e2407978304b0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 4120 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-13 1:36 ` Dan Cross
@ 2026-02-14 3:35 ` Alyssa M via 9fans
2026-02-14 14:26 ` Dan Cross
0 siblings, 1 reply; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-14 3:35 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]
On Friday, February 13, 2026, at 1:36 AM, Dan Cross wrote:
> Let me be blunt: the `mmap` interface, as specified in 4.2BSD and
implemented in a bunch of Unix and Unix-like systems, is atrocious.
Its roots come from a system that was radically different in design
than Unix, and its baroque design, with a bunch of operations
multiplexed onto a single call with 6 (!!) arguments, two of which are
bitmaps that interact in abstruse ways and one of which can radically
alter the semantics of the call, really shows. I believe that it __is__ possible to do better.
I'm definitely with you there.
What I'm trying to do is explore whether the standard read/write calls with a different implementation could make mmap redundant by using the combination of demand paging and copy-on-write to defer or elide I/O. This is not mmap, and not really memory mapping in the conventional sense either. But I think it can do the things an mmap user is looking for (so long as they're not looking for frame buffers or shared memory.) A successful implementation will not be detectably different from the traditional read/write calls except with respect to deferred or elided I/O.
The key ideas are that the read (and write!) system calls associate buffer pages with a snapshot of the file data involved (which is then demand loaded - deferred), and altering memory pages breaks that association, and associates the altered pages with the swap file. Pages that are still associated with the file region they're being written to are not written (this is the eliding part). Fragments of pages are treated traditionally. The segment this happens in may be logically much larger even than the swap area - and not is pre-allocated, so pages are either allocated swap space on demand, or are associated with snapshots.
There's much more I could say, but I don't have it all working yet, and making it work at scale is probably another story.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M2ad0abab705206e4d7351e45
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 2839 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-14 3:35 ` Alyssa M via 9fans
@ 2026-02-14 14:26 ` Dan Cross
0 siblings, 0 replies; 92+ messages in thread
From: Dan Cross @ 2026-02-14 14:26 UTC (permalink / raw)
To: 9fans
On Sat, Feb 14, 2026 at 2:15 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
> On Friday, February 13, 2026, at 1:36 AM, Dan Cross wrote:
> > Let me be blunt: the `mmap` interface, as specified in 4.2BSD
> > and implemented in a bunch of Unix and Unix-like systems, is
> > atrocious. Its roots come from a system that was radically
> > different in design than Unix, and its baroque design, with a
> > bunch of operations multiplexed onto a single call with 6 (!!)
> > arguments, two of which are bitmaps that interact in abstruse
> > ways and one of which can radically alter the semantics of
> > the call, really shows. I believe that it _is_ possible to do better.
>
> I'm definitely with you there.
>
> What I'm trying to do is explore whether the standard read/write
> calls with a different implementation could make mmap redundant
> by using the combination of demand paging and copy-on-write to
> defer or elide I/O. This is not mmap, and not really memory
> mapping in the conventional sense either. But I think it can do the
> things an mmap user is looking for (so long as they're not looking
> for frame buffers or shared memory.) A successful implementation
> will not be detectably different from the traditional read/write calls
> except with respect to deferred or elided I/O.
But as before, I don't see how you intend to make that work in a way
that is transparent to actual programmers. The semantics of read/write
and memory mapping files are just too different.
> The key ideas are that the read (and write!) system calls associate
> buffer pages with a snapshot of the file data involved (which is then
> demand loaded - deferred), and altering memory pages breaks that
> association, and associates the altered pages with the swap file.
Except they don't.
It sounds like you're thinking in terms of pages, I get that; but
read/write work in terms of byte buffers that have no obligation to be
byte aligned. Put another way, read and write relate the contents of a
"file" with an arbitrarily sized and aligned byte-buffer in memory,
but there is no obligation that those byte buffers have the properties
required to be a "page" in the virtual memory sense.
As in my earlier email, consider the case of a `read` into some long
buffer that is offset relative to a page boundary. For example,
assuming a 4KiB page:
char *p = 0xffffffff00000000; // Starts at 4GiB.
read(fd, p + 2, 0x1000*1024*128);
Here, a program is reading 512MiB of data into a buffer, but that
buffer doesn't start on a page boundary; so the contents of each
logical "page" read from the file are offset from the start of a page
a la the virtual memory system. Because of that, the system can't play
clever games with page mapping the read contents anymore: the virtual
memory hardware enforces the property that pages are _aligned_ to the
page size, and the destination of this read are not aligned to that
alignment requirement.
> Pages that are still associated with the file region they're being written
> to are not written (this is the eliding part). Fragments of pages are
> treated traditionally. The segment this happens in may be logically
> much larger even than the swap area - and not is pre-allocated, so
> pages are either allocated swap space on demand, or are associated
> with snapshots.
This still doesn't make any sense to me. The _program_ has already
allocated the memory it reads into; how the system gets the data into
that memory in response to a read is sort of irrelevant. In any
event, you still have to deal with the problem of destination buffers
that aren't page aligned, which is already going to make this
unworkable in the general case. You could special-case requests for
page aligned reads, I guess, but there are a lot of corner cases: what
if I `malloc` a buffer, align a pointer to a page boundary at some
offset into the malloc'ed region, `read`, and then immediately `free`
the buffer? Now `free` needs to know to tell the system that this
region was dealloc'ed; if it doesn't, then a subsequent `malloc`
covering part of it could cause a lot of IO for no good reason, as I
go to write bits of that memory and they're CoW faulted into the
process's address space. `free` usually doesn't care; it's a
userspace library, and sure, some implementations do interact with the
system more than others, but now I'm essentially requiring that.
In short, I don't see how this decreases complexity; it seems to
objectively increase it.
> There's much more I could say, but I don't have it all working yet, and
> making it work at scale is probably another story.
Well, I wish you luck, and if you come up with something amazing, I'll
be the first to admit that I'm wrong. But it sure seems like the
practicality of the idea is predicated on assumptions that don't hold
in the vast majority of cases.
Again, I think there's space for something like memory-mapped files in
the wider universe of systems design, but I'm afraid you'll find that
trying to make that thing `read` is not workable.
- Dan C.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M399decded15386d15d15b6c7
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-12 8:37 ` Alyssa M via 9fans
2026-02-12 12:37 ` hiro via 9fans
2026-02-13 1:36 ` Dan Cross
@ 2026-02-15 4:34 ` Bakul Shah via 9fans
2026-02-15 10:19 ` hiro
2 siblings, 1 reply; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-02-15 4:34 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2449 bytes --]
On Feb 12, 2026, at 12:37 AM, Alyssa M via 9fans <9fans@9fans.net> wrote:
>
> So what I'm suggesting is that instead of the programmer making an mmap call, they should make a single read call to read the entire file into the address space - as they did before. The new read implementation would do this, but as a memory mapped snapshot. This looks no different to the programmer from how reads have always worked, it just happens very quickly, because no I/O actually happens.
> The snapshot data is brought in by demand paging as it is touched, and pages may get dirtied.
The issue here is the read call is given a pre-allocated buffer. What you seem to want is the read call to return as soon as possible and pages get read in when accessed. For this to work, the kernel side read implementation has to free up (or set aside) these pages, invalidate pagetable entries & flush relevant TLB entries & later reverse this operation once a page is filled in.
It is better if you let the read call itself manage memory. This was explored in a 1994 paper: "The Alloc Stream Facility": https://www.eecg.toronto.edu/~stumm/Papers/Krieger-IEEEComputer94.pdf
For example,
stream = Sopen(filename, flags, ...)
...
addr = SallocAt(stream, mode, &length, offset, whence)
// can access memory in [addr..addr+length)
Sfree(stream, addr)
...
Sclose(stream)
The idea is SallocAt() returns a ptr to read in data for read mode (or space for write mode) for length amount. Actual length is returned in the length variable if the file is too short. Sfree() frees up space or pushes out the data to the underlying file. The alloc stream facility can in theory also use mmap under the hood.
One can think of mmap() as something like sallocAt() fpr page aligned data. An added benefit: the same underlying data page can be accessed from multiple address spaces in a similar way. Any attempt to overwrite would result in copy-on-write so that other clients see the old data snapshot. If you want to see any updates by other processes, you should use some higher level access protocol to do so safely.
Using mmap instead of read is likely less efficient but the idea is to factor out common IO patterns.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M5c4377f99d6ec38b299025f0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 4622 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-15 4:34 ` Bakul Shah via 9fans
@ 2026-02-15 10:19 ` hiro
0 siblings, 0 replies; 92+ messages in thread
From: hiro @ 2026-02-15 10:19 UTC (permalink / raw)
To: 9fans
since you give no reasons yourself, let me try to hallucinate a reason
why you might be doing what you're doing here:
the basic assumption is that all your data is spread over multiple
pages, but there is data locality per page (i.e. read/write access in
one page will rarely depend on another page)
maybe you could have a /dev/mem/pages so that pages become easier to access...
still, no memmap needed. you just want pages to be less transparent it seems.
On Sun, Feb 15, 2026 at 6:43 AM Bakul Shah via 9fans <9fans@9fans.net> wrote:
>
> On Feb 12, 2026, at 12:37 AM, Alyssa M via 9fans <9fans@9fans.net> wrote:
>
>
> So what I'm suggesting is that instead of the programmer making an mmap call, they should make a single read call to read the entire file into the address space - as they did before. The new read implementation would do this, but as a memory mapped snapshot. This looks no different to the programmer from how reads have always worked, it just happens very quickly, because no I/O actually happens.
> The snapshot data is brought in by demand paging as it is touched, and pages may get dirtied.
>
>
> The issue here is the read call is given a pre-allocated buffer. What you seem to want is the read call to return as soon as possible and pages get read in when accessed. For this to work, the kernel side read implementation has to free up (or set aside) these pages, invalidate pagetable entries & flush relevant TLB entries & later reverse this operation once a page is filled in.
>
> It is better if you let the read call itself manage memory. This was explored in a 1994 paper: "The Alloc Stream Facility": https://www.eecg.toronto.edu/~stumm/Papers/Krieger-IEEEComputer94.pdf
> For example,
>
> stream = Sopen(filename, flags, ...)
> ...
> addr = SallocAt(stream, mode, &length, offset, whence)
> // can access memory in [addr..addr+length)
> Sfree(stream, addr)
> ...
> Sclose(stream)
>
> The idea is SallocAt() returns a ptr to read in data for read mode (or space for write mode) for length amount. Actual length is returned in the length variable if the file is too short. Sfree() frees up space or pushes out the data to the underlying file. The alloc stream facility can in theory also use mmap under the hood.
>
> One can think of mmap() as something like sallocAt() fpr page aligned data. An added benefit: the same underlying data page can be accessed from multiple address spaces in a similar way. Any attempt to overwrite would result in copy-on-write so that other clients see the old data snapshot. If you want to see any updates by other processes, you should use some higher level access protocol to do so safely.
>
> Using mmap instead of read is likely less efficient but the idea is to factor out common IO patterns.
>
>
>
>
> 9fans / 9fans / see discussions + participants + delivery options Permalink
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mee0c45deb26d8cc17994bfcc
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-13 17:21 ` ron minnich
@ 2026-02-15 16:12 ` Danny Wilkins via 9fans
2026-02-17 3:13 ` Alyssa M via 9fans
2026-02-16 2:24 ` Alyssa M via 9fans
1 sibling, 1 reply; 92+ messages in thread
From: Danny Wilkins via 9fans @ 2026-02-15 16:12 UTC (permalink / raw)
To: 9fans
On Fri, Feb 13, 2026 at 09:21:21AM -0800, ron minnich wrote:
> As for needing or not needing mmap, you need to consider how you support
> efficient access for files 100x or greater in size than physical memory.
> I'm not sure what your plan is there. It's a real world problem however.
> Or, further, files that contain data which are pointers.
>
Hi Ron,
Probably showing my ignorance but out of curiosity, what *is* the difference
here? I can certainly see the open/read method requiring some bikeshedding on
the buffer size, but you could jump to a pointer with seek. Is there some
nasty behavior on how much memory an open file will consume as you jump around
in it beyond the fixed info maintained for the fd and the buffer allocated
by the caller?
I want to say I've been in that situation, though in a weird way :)
If I recall correctly, I messed with some files in the GB size range on a box
with 16 MB of memory last year, but that was pretty much just dd so it may have
been an optimal usage pattern (linear read from start to end, blocksize buffer.)
Thanks,
Danny
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M5f14de1d894fe4bd646f64af
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-13 17:21 ` ron minnich
2026-02-15 16:12 ` Danny Wilkins via 9fans
@ 2026-02-16 2:24 ` Alyssa M via 9fans
2026-02-16 3:17 ` Ori Bernstein
` (3 more replies)
1 sibling, 4 replies; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-16 2:24 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 7290 bytes --]
I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
My implementation (on my hobby OS) only does this in a custom segment type. A segment of this type can be of any size, but is not pre-allocated pages in memory or the swap file - I do this to allow it to be very large, and because a read has to happen within the boundaries of a segment. I back it with a file system temporary file, so when pages migrate to the swap area the disk allocation can be sparse. You can load or store bytes anywhere in this segment. Touching pages allocates them, first in memory and eventually in the swap file as they get paged out.
On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
> but
read/write work in terms of byte buffers that have no obligation to be
byte aligned. Put another way, read and write relate the contents of a
"file" with an arbitrarily sized and aligned byte-buffer in memory,
but there is no obligation that those byte buffers have the properties
required to be a "page" in the virtual memory sense.
Understood. My current implementation does conventional I/O with any fragments of pages at the beginning and end of the read/write buffers. So small reads and writes happen traditionally. At the moment that's done before the read completes, so your example of doing lots of adjacent reads of small areas would work very badly (few pages would get the deferred loading), but I think I can do better by deferring the fragment I/O, so adjacent reads can coalesce the snapshots. My main scenario of interest though is for very large reads and writes, because that's where the sparse access has value.
Because reads are copies and not memory mapping, it doesn't matter if the reads are not page-aligned. The process's memory pages are not being shared with the cache of the file (snapshot), so if the data is not aligned then page faults will copy bytes from two cached file blocks (assuming they're the same size). In practice I'm expecting that large reads will be into large allocations, which will be aligned, so there's an opportunity to steal blocks from the file cache. But I'm not expecting to implement this. There's no coherence problem here because the snapshot is private to the process. And readonly.
When I do a read call into the segment, firstly a snapshot is made of the data to be read. This is functionally equivalent to making a temporary file and copying the data into it. Making this copy-on-write so the snapshot costs nothing is a key part of this without which there would be no point.
The pages of the read buffer in the segment are then associated with parts of the snapshot - rather than the swap file. So rather than zero filling (or reloading paged-out data) when a load instruction is executed, the memory pages are filled from the snapshot.
When a store instruction happens, the page becomes dirty, and loses its association with the snapshot. It's then backed by the swap file. If you alter all pages of the buffer, then all pages are disconnected from the snapshot, and the snapshot is deleted. At that point you can't tell that anything unconventional happened.
If I 'read over' a buffer with something else, the pages get associated with the new snapshot, and disassociated from the old one.
When I do a write call, the write call looks at each page, and decides whether it is part of a snapshot. If it is, and we're writing back to the same part of the same file (an update) and the corresponding block has not been changed in the file, then the write call can skip that page. In other cases it actually writes to the file. Any other writing to the file that we made a snapshot from invokes the copy-on-write mechanism, so the file changes, but the snapshot doesn't.
If you freed the read buffer memory, then parts of it might get demand loaded in the act of writing malloc's book-keeping information into it - depending on how the malloc works. If you later use calloc (or memset), it will zero the memory, which will detach it all from the snapshot, albeit loading every page from the snapshot as it goes...
One could change calloc to read from /dev/zero for allocations over a certain size, and special-case that to set up pages for zero-fill when it happens in this type of segment, which would disassociate the pages from the old snapshot without loading them, just as any other subsequent read does. A memset syscall might be better.
Practically, though, I think malloc and free are not likely to be used in this type of segment. You'd probably just detach the segment rather than free parts of it, but I've illustrated how you could drop the deferred snapshot if you needed to.
So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap. In particular: it lets you do an arbitrarily large read call instantly, and fault in just the pages you actually need as you need them. So like demand-paging, but from a snapshot of a file. Similarly, if you're writing back to the same file region, write will only write the pages that have altered - either in memory or in the file. This is effectively an update, somewhat like msync.
It's different from mmap in some ways: the data read is always a copy of the file contents, so there's never any spooky changing of memory under your feet. The behaviour is not detectably different to the program from the traditional implementation - except for where and if the time is spent.
There's still more I could add, but if I'm still not making sense, perhaps I'd better stop there. I think I've ended up making it sound more complicated than it is.
On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
> since you give no reasons yourself, let me try to hallucinate a reason
why you might be doing what you're doing here:
Here was my example for you:
On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
> I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M2de7c0bb4f1ac35c8f5e12e2
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 8645 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 2:24 ` Alyssa M via 9fans
@ 2026-02-16 3:17 ` Ori Bernstein
2026-02-16 10:55 ` Frank D. Engel, Jr.
2026-02-16 19:40 ` Bakul Shah via 9fans
2026-02-16 9:50 ` tlaronde
` (2 subsequent siblings)
3 siblings, 2 replies; 92+ messages in thread
From: Ori Bernstein @ 2026-02-16 3:17 UTC (permalink / raw)
To: 9fans
The difficulty here is that having read mark a region
as paged in "later" delays the actual I/O, by which
time the file contents may have changed, and your
read returns incorrect results.
This idea can work if your OS has a page cache, the
data is already in the page cache, and you eagerly
read the data that is not loaded -- but the delayed
i/o semantics otherwise simply break.
fixing this would need deep filesystem-level help,
where the filesystem would need to take a snapshot
when the read is invoked, in order to prevent any
subsequent mutations from being visible to the reader.
(on most Plan 9 file systems, this per-file snapshot
is fairly expensive; on gefs, for example, this would
snapshot all files within the mount)
On Sun, 15 Feb 2026 21:24:32 -0500
"Alyssa M via 9fans" <9fans@9fans.net> wrote:
> I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
> That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
>
> My implementation (on my hobby OS) only does this in a custom segment type. A segment of this type can be of any size, but is not pre-allocated pages in memory or the swap file - I do this to allow it to be very large, and because a read has to happen within the boundaries of a segment. I back it with a file system temporary file, so when pages migrate to the swap area the disk allocation can be sparse. You can load or store bytes anywhere in this segment. Touching pages allocates them, first in memory and eventually in the swap file as they get paged out.
>
> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
> > but
> read/write work in terms of byte buffers that have no obligation to be
> byte aligned. Put another way, read and write relate the contents of a
> "file" with an arbitrarily sized and aligned byte-buffer in memory,
> but there is no obligation that those byte buffers have the properties
> required to be a "page" in the virtual memory sense.
> Understood. My current implementation does conventional I/O with any fragments of pages at the beginning and end of the read/write buffers. So small reads and writes happen traditionally. At the moment that's done before the read completes, so your example of doing lots of adjacent reads of small areas would work very badly (few pages would get the deferred loading), but I think I can do better by deferring the fragment I/O, so adjacent reads can coalesce the snapshots. My main scenario of interest though is for very large reads and writes, because that's where the sparse access has value.
>
> Because reads are copies and not memory mapping, it doesn't matter if the reads are not page-aligned. The process's memory pages are not being shared with the cache of the file (snapshot), so if the data is not aligned then page faults will copy bytes from two cached file blocks (assuming they're the same size). In practice I'm expecting that large reads will be into large allocations, which will be aligned, so there's an opportunity to steal blocks from the file cache. But I'm not expecting to implement this. There's no coherence problem here because the snapshot is private to the process. And readonly.
>
> When I do a read call into the segment, firstly a snapshot is made of the data to be read. This is functionally equivalent to making a temporary file and copying the data into it. Making this copy-on-write so the snapshot costs nothing is a key part of this without which there would be no point.
> The pages of the read buffer in the segment are then associated with parts of the snapshot - rather than the swap file. So rather than zero filling (or reloading paged-out data) when a load instruction is executed, the memory pages are filled from the snapshot.
> When a store instruction happens, the page becomes dirty, and loses its association with the snapshot. It's then backed by the swap file. If you alter all pages of the buffer, then all pages are disconnected from the snapshot, and the snapshot is deleted. At that point you can't tell that anything unconventional happened.
> If I 'read over' a buffer with something else, the pages get associated with the new snapshot, and disassociated from the old one.
>
> When I do a write call, the write call looks at each page, and decides whether it is part of a snapshot. If it is, and we're writing back to the same part of the same file (an update) and the corresponding block has not been changed in the file, then the write call can skip that page. In other cases it actually writes to the file. Any other writing to the file that we made a snapshot from invokes the copy-on-write mechanism, so the file changes, but the snapshot doesn't.
>
> If you freed the read buffer memory, then parts of it might get demand loaded in the act of writing malloc's book-keeping information into it - depending on how the malloc works. If you later use calloc (or memset), it will zero the memory, which will detach it all from the snapshot, albeit loading every page from the snapshot as it goes...
> One could change calloc to read from /dev/zero for allocations over a certain size, and special-case that to set up pages for zero-fill when it happens in this type of segment, which would disassociate the pages from the old snapshot without loading them, just as any other subsequent read does. A memset syscall might be better.
> Practically, though, I think malloc and free are not likely to be used in this type of segment. You'd probably just detach the segment rather than free parts of it, but I've illustrated how you could drop the deferred snapshot if you needed to.
>
> So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap. In particular: it lets you do an arbitrarily large read call instantly, and fault in just the pages you actually need as you need them. So like demand-paging, but from a snapshot of a file. Similarly, if you're writing back to the same file region, write will only write the pages that have altered - either in memory or in the file. This is effectively an update, somewhat like msync.
>
> It's different from mmap in some ways: the data read is always a copy of the file contents, so there's never any spooky changing of memory under your feet. The behaviour is not detectably different to the program from the traditional implementation - except for where and if the time is spent.
>
> There's still more I could add, but if I'm still not making sense, perhaps I'd better stop there. I think I've ended up making it sound more complicated than it is.
>
> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
> > since you give no reasons yourself, let me try to hallucinate a reason
> why you might be doing what you're doing here:
>
> Here was my example for you:
>
> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
> > I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
>
> This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
> With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
--
Ori Bernstein <ori@eigenstate.org>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M7294b53b05c66b76159a66b3
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 2:24 ` Alyssa M via 9fans
2026-02-16 3:17 ` Ori Bernstein
@ 2026-02-16 9:50 ` tlaronde
2026-02-16 12:24 ` hiro via 9fans
2026-02-16 12:33 ` hiro via 9fans
3 siblings, 0 replies; 92+ messages in thread
From: tlaronde @ 2026-02-16 9:50 UTC (permalink / raw)
To: 9fans
On Sun, Feb 15, 2026 at 09:24:32PM -0500, Alyssa M via 9fans wrote:
> I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
> That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
>
But isn't this what the buffer cache does? My concern with this scheme
is the sharing (concurrent accesses, even with only one rw and several
read-only: what are the read-only ones really reading and when): if
there is a kernel side resources server, only when the writing process
does "commit", the data is officially modified and with a central
server the information can be propagated. With a per process mapping,
on what criterium will the data be marked as "officially" modified
with the concurrent accesses having to be informed? If the process has
(as it should) to inform (to commit modified data) and don't see what
is different from the traditional scheme and I don't get what it
improves---my gut feeling is the reverse: it would be more tricky to
get it right and the complexity will certainly not improve the
performance.
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M0571d4715430928ca6543e92
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 3:17 ` Ori Bernstein
@ 2026-02-16 10:55 ` Frank D. Engel, Jr.
2026-02-16 13:49 ` Ori Bernstein
2026-02-16 19:40 ` Bakul Shah via 9fans
1 sibling, 1 reply; 92+ messages in thread
From: Frank D. Engel, Jr. @ 2026-02-16 10:55 UTC (permalink / raw)
To: 9fans
Technically it could also be worked around on a more traditionally
designed OS when working with strictly local filesystems - even if
shared - as long as the local system was in full control of the
filesystem any writes would go through a common kernel where the changes
could be managed and either controlled or synced in some acceptable
manner if multiple processes were accessing the same file.
It breaks completely on Plan 9 systems due to the nature of all
filesystems - even local filesystems - being accessed from a server
process to which the client processes cannot safely assume to have
exclusive access. As the server may or may not be remote and may or may
not be in use by multiple consumers the kernel can't safely assume
exclusive access and this becomes something of a non-starter unless
baked into the protocol at a fundamental level.
Since 9p was never designed for this, you would be breaking
compatibility to get this to work safely in a sane manner with existing
filesystem implementations.
You could theoretically work around that by extending the protocol in a
detectable way to provide the required support and only enabling this
feature for filesystems which declare correct implementation of the
extensions, but I am also of the school that it is not clear how much of
a benefit this modification really provides and whether or not it would
be worth putting in the effort.
Of course if you were creating something completely new without the need
to keep compatibility with existing 9p filesystems then you could
engineer your new system with this goal in mind from the beginning -
that would be an entirely different matter.
On 2/15/26 22:17, Ori Bernstein wrote:
> The difficulty here is that having read mark a region
> as paged in "later" delays the actual I/O, by which
> time the file contents may have changed, and your
> read returns incorrect results.
>
> This idea can work if your OS has a page cache, the
> data is already in the page cache, and you eagerly
> read the data that is not loaded -- but the delayed
> i/o semantics otherwise simply break.
>
> fixing this would need deep filesystem-level help,
> where the filesystem would need to take a snapshot
> when the read is invoked, in order to prevent any
> subsequent mutations from being visible to the reader.
>
> (on most Plan 9 file systems, this per-file snapshot
> is fairly expensive; on gefs, for example, this would
> snapshot all files within the mount)
>
> On Sun, 15 Feb 2026 21:24:32 -0500
> "Alyssa M via 9fans" <9fans@9fans.net> wrote:
>
>> I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
>> That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
>>
>> My implementation (on my hobby OS) only does this in a custom segment type. A segment of this type can be of any size, but is not pre-allocated pages in memory or the swap file - I do this to allow it to be very large, and because a read has to happen within the boundaries of a segment. I back it with a file system temporary file, so when pages migrate to the swap area the disk allocation can be sparse. You can load or store bytes anywhere in this segment. Touching pages allocates them, first in memory and eventually in the swap file as they get paged out.
>>
>> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
>>> but
>> read/write work in terms of byte buffers that have no obligation to be
>> byte aligned. Put another way, read and write relate the contents of a
>> "file" with an arbitrarily sized and aligned byte-buffer in memory,
>> but there is no obligation that those byte buffers have the properties
>> required to be a "page" in the virtual memory sense.
>> Understood. My current implementation does conventional I/O with any fragments of pages at the beginning and end of the read/write buffers. So small reads and writes happen traditionally. At the moment that's done before the read completes, so your example of doing lots of adjacent reads of small areas would work very badly (few pages would get the deferred loading), but I think I can do better by deferring the fragment I/O, so adjacent reads can coalesce the snapshots. My main scenario of interest though is for very large reads and writes, because that's where the sparse access has value.
>>
>> Because reads are copies and not memory mapping, it doesn't matter if the reads are not page-aligned. The process's memory pages are not being shared with the cache of the file (snapshot), so if the data is not aligned then page faults will copy bytes from two cached file blocks (assuming they're the same size). In practice I'm expecting that large reads will be into large allocations, which will be aligned, so there's an opportunity to steal blocks from the file cache. But I'm not expecting to implement this. There's no coherence problem here because the snapshot is private to the process. And readonly.
>>
>> When I do a read call into the segment, firstly a snapshot is made of the data to be read. This is functionally equivalent to making a temporary file and copying the data into it. Making this copy-on-write so the snapshot costs nothing is a key part of this without which there would be no point.
>> The pages of the read buffer in the segment are then associated with parts of the snapshot - rather than the swap file. So rather than zero filling (or reloading paged-out data) when a load instruction is executed, the memory pages are filled from the snapshot.
>> When a store instruction happens, the page becomes dirty, and loses its association with the snapshot. It's then backed by the swap file. If you alter all pages of the buffer, then all pages are disconnected from the snapshot, and the snapshot is deleted. At that point you can't tell that anything unconventional happened.
>> If I 'read over' a buffer with something else, the pages get associated with the new snapshot, and disassociated from the old one.
>>
>> When I do a write call, the write call looks at each page, and decides whether it is part of a snapshot. If it is, and we're writing back to the same part of the same file (an update) and the corresponding block has not been changed in the file, then the write call can skip that page. In other cases it actually writes to the file. Any other writing to the file that we made a snapshot from invokes the copy-on-write mechanism, so the file changes, but the snapshot doesn't.
>>
>> If you freed the read buffer memory, then parts of it might get demand loaded in the act of writing malloc's book-keeping information into it - depending on how the malloc works. If you later use calloc (or memset), it will zero the memory, which will detach it all from the snapshot, albeit loading every page from the snapshot as it goes...
>> One could change calloc to read from /dev/zero for allocations over a certain size, and special-case that to set up pages for zero-fill when it happens in this type of segment, which would disassociate the pages from the old snapshot without loading them, just as any other subsequent read does. A memset syscall might be better.
>> Practically, though, I think malloc and free are not likely to be used in this type of segment. You'd probably just detach the segment rather than free parts of it, but I've illustrated how you could drop the deferred snapshot if you needed to.
>>
>> So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap. In particular: it lets you do an arbitrarily large read call instantly, and fault in just the pages you actually need as you need them. So like demand-paging, but from a snapshot of a file. Similarly, if you're writing back to the same file region, write will only write the pages that have altered - either in memory or in the file. This is effectively an update, somewhat like msync.
>>
>> It's different from mmap in some ways: the data read is always a copy of the file contents, so there's never any spooky changing of memory under your feet. The behaviour is not detectably different to the program from the traditional implementation - except for where and if the time is spent.
>>
>> There's still more I could add, but if I'm still not making sense, perhaps I'd better stop there. I think I've ended up making it sound more complicated than it is.
>>
>> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
>>> since you give no reasons yourself, let me try to hallucinate a reason
>> why you might be doing what you're doing here:
>>
>> Here was my example for you:
>>
>> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
>>> I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
>> This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
>> With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mdd1b6431d6b56fc32ce1d597
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 2:24 ` Alyssa M via 9fans
2026-02-16 3:17 ` Ori Bernstein
2026-02-16 9:50 ` tlaronde
@ 2026-02-16 12:24 ` hiro via 9fans
2026-02-16 12:33 ` hiro via 9fans
3 siblings, 0 replies; 92+ messages in thread
From: hiro via 9fans @ 2026-02-16 12:24 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 953 bytes --]
On Mon, Feb 16, 2026 at 3:45 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
> I think the difficulty here is thinking about this as memory mapping. What
> I'm really doing is deferred I/O.
>
yes, then stop speaking of memory mapping? read() seems to work just fine
for your purposes.
memory mapping doesn't cause I/O, actual access to the (memory mapped)
region causes the I/O. this is how it always worked and i can't believe
you're trying to sell this to us as some kind of novelty now!
as you keep going in the same direction without responding to any of the
doubt, i can't help but blame sycophantic AI, I don't manage to convince
myself any more you're actually engaging with us properly as a human.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M2513ee6a62f2ed87172aaaa9
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1812 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 2:24 ` Alyssa M via 9fans
` (2 preceding siblings ...)
2026-02-16 12:24 ` hiro via 9fans
@ 2026-02-16 12:33 ` hiro via 9fans
3 siblings, 0 replies; 92+ messages in thread
From: hiro via 9fans @ 2026-02-16 12:33 UTC (permalink / raw)
To: 9fans
On Mon, Feb 16, 2026 at 3:45 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
>
> This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
> With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
"I'd read the entire file system image into the segment in one gulp"
But you wouldn't, actually, as you already aluded to...
Are you aware how read() is commonly implemented?
Maybe you get confused between kernelspace and userspace?
Where is this fileserver running for you? In kernel or not?
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mb5da462dfb5824b832ee4d7f
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 10:55 ` Frank D. Engel, Jr.
@ 2026-02-16 13:49 ` Ori Bernstein
0 siblings, 0 replies; 92+ messages in thread
From: Ori Bernstein @ 2026-02-16 13:49 UTC (permalink / raw)
To: 9fans; +Cc: Frank D. Engel, Jr.
It breaks completely on all systems without the file system being
able to take single file snapshots.
Assume you have a machine with 1G of memory and a 16G file that
you want to read using this read-as-mmap optimization. Also assume
that you have an implementation of malloc that will allow you to
overcommit.
To set up, you obtain a file; for this example, let's generate it
with dd:
dd -if /dev/zero -of test -bs 16k -count 1kk
All the contents are zero initially. Then, you have a program that
reads it, touching the contents with some delay:
#include <u.h>
#include <libc.h>
#define N (16*1024*1024*1024)
void
main(int argc, char **argv)
{
char *buf;
int fd, i;
buf = malloc(N);
fd = open(argv[1], OREAD);
read(fd, buf, N);
sleep(3600*1000);
i = truerand() % N;
print("buf[%d] %d\n", i, buf[i]);
exits(nil);
}
and then, in another terminal, while this process is sleeping,
you run this command:
dd -if /dev/random -of test -bs 16k -count 1kk
Please describe, in detail, short of a per-file snapshot being
supported by the fs, what management to changes would actually
lead to zeros being read in the test C code?
I think this idea just doesn't work; if you have a buffer cache,
you can save some memcpying by remapping the pages that are in
the cache, but it'd remain to be seen how much data you'd need
to copy for the contention and TLB flushes to be cheaper than
just doing the copy.
On Mon, 16 Feb 2026 05:55:55 -0500
"Frank D. Engel, Jr." <fde101@fjrhome.net> wrote:
> Technically it could also be worked around on a more traditionally
> designed OS when working with strictly local filesystems - even if
> shared - as long as the local system was in full control of the
> filesystem any writes would go through a common kernel where the changes
> could be managed and either controlled or synced in some acceptable
> manner if multiple processes were accessing the same file.
>
> It breaks completely on Plan 9 systems due to the nature of all
> filesystems - even local filesystems - being accessed from a server
> process to which the client processes cannot safely assume to have
> exclusive access. As the server may or may not be remote and may or may
> not be in use by multiple consumers the kernel can't safely assume
> exclusive access and this becomes something of a non-starter unless
> baked into the protocol at a fundamental level.
>
> Since 9p was never designed for this, you would be breaking
> compatibility to get this to work safely in a sane manner with existing
> filesystem implementations.
>
> You could theoretically work around that by extending the protocol in a
> detectable way to provide the required support and only enabling this
> feature for filesystems which declare correct implementation of the
> extensions, but I am also of the school that it is not clear how much of
> a benefit this modification really provides and whether or not it would
> be worth putting in the effort.
>
> Of course if you were creating something completely new without the need
> to keep compatibility with existing 9p filesystems then you could
> engineer your new system with this goal in mind from the beginning -
> that would be an entirely different matter.
>
>
> On 2/15/26 22:17, Ori Bernstein wrote:
> > The difficulty here is that having read mark a region
> > as paged in "later" delays the actual I/O, by which
> > time the file contents may have changed, and your
> > read returns incorrect results.
> >
> > This idea can work if your OS has a page cache, the
> > data is already in the page cache, and you eagerly
> > read the data that is not loaded -- but the delayed
> > i/o semantics otherwise simply break.
> >
> > fixing this would need deep filesystem-level help,
> > where the filesystem would need to take a snapshot
> > when the read is invoked, in order to prevent any
> > subsequent mutations from being visible to the reader.
> >
> > (on most Plan 9 file systems, this per-file snapshot
> > is fairly expensive; on gefs, for example, this would
> > snapshot all files within the mount)
> >
> > On Sun, 15 Feb 2026 21:24:32 -0500
> > "Alyssa M via 9fans" <9fans@9fans.net> wrote:
> >
> >> I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
> >> That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
> >>
> >> My implementation (on my hobby OS) only does this in a custom segment type. A segment of this type can be of any size, but is not pre-allocated pages in memory or the swap file - I do this to allow it to be very large, and because a read has to happen within the boundaries of a segment. I back it with a file system temporary file, so when pages migrate to the swap area the disk allocation can be sparse. You can load or store bytes anywhere in this segment. Touching pages allocates them, first in memory and eventually in the swap file as they get paged out.
> >>
> >> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
> >>> but
> >> read/write work in terms of byte buffers that have no obligation to be
> >> byte aligned. Put another way, read and write relate the contents of a
> >> "file" with an arbitrarily sized and aligned byte-buffer in memory,
> >> but there is no obligation that those byte buffers have the properties
> >> required to be a "page" in the virtual memory sense.
> >> Understood. My current implementation does conventional I/O with any fragments of pages at the beginning and end of the read/write buffers. So small reads and writes happen traditionally. At the moment that's done before the read completes, so your example of doing lots of adjacent reads of small areas would work very badly (few pages would get the deferred loading), but I think I can do better by deferring the fragment I/O, so adjacent reads can coalesce the snapshots. My main scenario of interest though is for very large reads and writes, because that's where the sparse access has value.
> >>
> >> Because reads are copies and not memory mapping, it doesn't matter if the reads are not page-aligned. The process's memory pages are not being shared with the cache of the file (snapshot), so if the data is not aligned then page faults will copy bytes from two cached file blocks (assuming they're the same size). In practice I'm expecting that large reads will be into large allocations, which will be aligned, so there's an opportunity to steal blocks from the file cache. But I'm not expecting to implement this. There's no coherence problem here because the snapshot is private to the process. And readonly.
> >>
> >> When I do a read call into the segment, firstly a snapshot is made of the data to be read. This is functionally equivalent to making a temporary file and copying the data into it. Making this copy-on-write so the snapshot costs nothing is a key part of this without which there would be no point.
> >> The pages of the read buffer in the segment are then associated with parts of the snapshot - rather than the swap file. So rather than zero filling (or reloading paged-out data) when a load instruction is executed, the memory pages are filled from the snapshot.
> >> When a store instruction happens, the page becomes dirty, and loses its association with the snapshot. It's then backed by the swap file. If you alter all pages of the buffer, then all pages are disconnected from the snapshot, and the snapshot is deleted. At that point you can't tell that anything unconventional happened.
> >> If I 'read over' a buffer with something else, the pages get associated with the new snapshot, and disassociated from the old one.
> >>
> >> When I do a write call, the write call looks at each page, and decides whether it is part of a snapshot. If it is, and we're writing back to the same part of the same file (an update) and the corresponding block has not been changed in the file, then the write call can skip that page. In other cases it actually writes to the file. Any other writing to the file that we made a snapshot from invokes the copy-on-write mechanism, so the file changes, but the snapshot doesn't.
> >>
> >> If you freed the read buffer memory, then parts of it might get demand loaded in the act of writing malloc's book-keeping information into it - depending on how the malloc works. If you later use calloc (or memset), it will zero the memory, which will detach it all from the snapshot, albeit loading every page from the snapshot as it goes...
> >> One could change calloc to read from /dev/zero for allocations over a certain size, and special-case that to set up pages for zero-fill when it happens in this type of segment, which would disassociate the pages from the old snapshot without loading them, just as any other subsequent read does. A memset syscall might be better.
> >> Practically, though, I think malloc and free are not likely to be used in this type of segment. You'd probably just detach the segment rather than free parts of it, but I've illustrated how you could drop the deferred snapshot if you needed to.
> >>
> >> So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap. In particular: it lets you do an arbitrarily large read call instantly, and fault in just the pages you actually need as you need them. So like demand-paging, but from a snapshot of a file. Similarly, if you're writing back to the same file region, write will only write the pages that have altered - either in memory or in the file. This is effectively an update, somewhat like msync.
> >>
> >> It's different from mmap in some ways: the data read is always a copy of the file contents, so there's never any spooky changing of memory under your feet. The behaviour is not detectably different to the program from the traditional implementation - except for where and if the time is spent.
> >>
> >> There's still more I could add, but if I'm still not making sense, perhaps I'd better stop there. I think I've ended up making it sound more complicated than it is.
> >>
> >> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
> >>> since you give no reasons yourself, let me try to hallucinate a reason
> >> why you might be doing what you're doing here:
> >>
> >> Here was my example for you:
> >>
> >> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
> >>> I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
> >> This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
> >> With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
--
Ori Bernstein <ori@eigenstate.org>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1acc51ce2e2564894a8bbd60
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 3:17 ` Ori Bernstein
2026-02-16 10:55 ` Frank D. Engel, Jr.
@ 2026-02-16 19:40 ` Bakul Shah via 9fans
2026-02-16 19:43 ` Bakul Shah via 9fans
1 sibling, 1 reply; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-02-16 19:40 UTC (permalink / raw)
To: 9fans
Note that neither 9p nor nfs provide this guarantee.
A large read() may be divided up in multiple 9p calls
and another client can certainly modify the read data
range in between 9p reads.
You are asking mmap to do more than what multiple reads
would do. But if anything, mmap can indeed be implemented
to provide a consistent snapshot of(unlike using multiple
reads).
> On Feb 15, 2026, at 7:17 PM, Ori Bernstein <ori@eigenstate.org> wrote:
>
> The difficulty here is that having read mark a region
> as paged in "later" delays the actual I/O, by which
> time the file contents may have changed, and your
> read returns incorrect results.
>
> This idea can work if your OS has a page cache, the
> data is already in the page cache, and you eagerly
> read the data that is not loaded -- but the delayed
> i/o semantics otherwise simply break.
>
> fixing this would need deep filesystem-level help,
> where the filesystem would need to take a snapshot
> when the read is invoked, in order to prevent any
> subsequent mutations from being visible to the reader.
>
> (on most Plan 9 file systems, this per-file snapshot
> is fairly expensive; on gefs, for example, this would
> snapshot all files within the mount)
>
> On Sun, 15 Feb 2026 21:24:32 -0500
> "Alyssa M via 9fans" <9fans@9fans.net> wrote:
>
>> I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
>> That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
>>
>> My implementation (on my hobby OS) only does this in a custom segment type. A segment of this type can be of any size, but is not pre-allocated pages in memory or the swap file - I do this to allow it to be very large, and because a read has to happen within the boundaries of a segment. I back it with a file system temporary file, so when pages migrate to the swap area the disk allocation can be sparse. You can load or store bytes anywhere in this segment. Touching pages allocates them, first in memory and eventually in the swap file as they get paged out.
>>
>> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
>>> but
>> read/write work in terms of byte buffers that have no obligation to be
>> byte aligned. Put another way, read and write relate the contents of a
>> "file" with an arbitrarily sized and aligned byte-buffer in memory,
>> but there is no obligation that those byte buffers have the properties
>> required to be a "page" in the virtual memory sense.
>> Understood. My current implementation does conventional I/O with any fragments of pages at the beginning and end of the read/write buffers. So small reads and writes happen traditionally. At the moment that's done before the read completes, so your example of doing lots of adjacent reads of small areas would work very badly (few pages would get the deferred loading), but I think I can do better by deferring the fragment I/O, so adjacent reads can coalesce the snapshots. My main scenario of interest though is for very large reads and writes, because that's where the sparse access has value.
>>
>> Because reads are copies and not memory mapping, it doesn't matter if the reads are not page-aligned. The process's memory pages are not being shared with the cache of the file (snapshot), so if the data is not aligned then page faults will copy bytes from two cached file blocks (assuming they're the same size). In practice I'm expecting that large reads will be into large allocations, which will be aligned, so there's an opportunity to steal blocks from the file cache. But I'm not expecting to implement this. There's no coherence problem here because the snapshot is private to the process. And readonly.
>>
>> When I do a read call into the segment, firstly a snapshot is made of the data to be read. This is functionally equivalent to making a temporary file and copying the data into it. Making this copy-on-write so the snapshot costs nothing is a key part of this without which there would be no point.
>> The pages of the read buffer in the segment are then associated with parts of the snapshot - rather than the swap file. So rather than zero filling (or reloading paged-out data) when a load instruction is executed, the memory pages are filled from the snapshot.
>> When a store instruction happens, the page becomes dirty, and loses its association with the snapshot. It's then backed by the swap file. If you alter all pages of the buffer, then all pages are disconnected from the snapshot, and the snapshot is deleted. At that point you can't tell that anything unconventional happened.
>> If I 'read over' a buffer with something else, the pages get associated with the new snapshot, and disassociated from the old one.
>>
>> When I do a write call, the write call looks at each page, and decides whether it is part of a snapshot. If it is, and we're writing back to the same part of the same file (an update) and the corresponding block has not been changed in the file, then the write call can skip that page. In other cases it actually writes to the file. Any other writing to the file that we made a snapshot from invokes the copy-on-write mechanism, so the file changes, but the snapshot doesn't.
>>
>> If you freed the read buffer memory, then parts of it might get demand loaded in the act of writing malloc's book-keeping information into it - depending on how the malloc works. If you later use calloc (or memset), it will zero the memory, which will detach it all from the snapshot, albeit loading every page from the snapshot as it goes...
>> One could change calloc to read from /dev/zero for allocations over a certain size, and special-case that to set up pages for zero-fill when it happens in this type of segment, which would disassociate the pages from the old snapshot without loading them, just as any other subsequent read does. A memset syscall might be better.
>> Practically, though, I think malloc and free are not likely to be used in this type of segment. You'd probably just detach the segment rather than free parts of it, but I've illustrated how you could drop the deferred snapshot if you needed to.
>>
>> So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap. In particular: it lets you do an arbitrarily large read call instantly, and fault in just the pages you actually need as you need them. So like demand-paging, but from a snapshot of a file. Similarly, if you're writing back to the same file region, write will only write the pages that have altered - either in memory or in the file. This is effectively an update, somewhat like msync.
>>
>> It's different from mmap in some ways: the data read is always a copy of the file contents, so there's never any spooky changing of memory under your feet. The behaviour is not detectably different to the program from the traditional implementation - except for where and if the time is spent.
>>
>> There's still more I could add, but if I'm still not making sense, perhaps I'd better stop there. I think I've ended up making it sound more complicated than it is.
>>
>> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
>>> since you give no reasons yourself, let me try to hallucinate a reason
>> why you might be doing what you're doing here:
>>
>> Here was my example for you:
>>
>> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
>>> I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
>>
>> This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
>> With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
>
>
> --
> Ori Bernstein <ori@eigenstate.org>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mdf6d07958d01aede284f4c2f
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-16 19:40 ` Bakul Shah via 9fans
@ 2026-02-16 19:43 ` Bakul Shah via 9fans
0 siblings, 0 replies; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-02-16 19:43 UTC (permalink / raw)
To: 9fans
[consistent snapshot for local files. Not for remote]
> On Feb 16, 2026, at 11:40 AM, Bakul Shah <bakul@iitbombay.org> wrote:
>
> Note that neither 9p nor nfs provide this guarantee.
> A large read() may be divided up in multiple 9p calls
> and another client can certainly modify the read data
> range in between 9p reads.
>
> You are asking mmap to do more than what multiple reads
> would do. But if anything, mmap can indeed be implemented
> to provide a consistent snapshot of(unlike using multiple
> reads).
>
>> On Feb 15, 2026, at 7:17 PM, Ori Bernstein <ori@eigenstate.org> wrote:
>>
>> The difficulty here is that having read mark a region
>> as paged in "later" delays the actual I/O, by which
>> time the file contents may have changed, and your
>> read returns incorrect results.
>>
>> This idea can work if your OS has a page cache, the
>> data is already in the page cache, and you eagerly
>> read the data that is not loaded -- but the delayed
>> i/o semantics otherwise simply break.
>>
>> fixing this would need deep filesystem-level help,
>> where the filesystem would need to take a snapshot
>> when the read is invoked, in order to prevent any
>> subsequent mutations from being visible to the reader.
>>
>> (on most Plan 9 file systems, this per-file snapshot
>> is fairly expensive; on gefs, for example, this would
>> snapshot all files within the mount)
>>
>> On Sun, 15 Feb 2026 21:24:32 -0500
>> "Alyssa M via 9fans" <9fans@9fans.net> wrote:
>>
>>> I think the difficulty here is thinking about this as memory mapping. What I'm really doing is deferred I/O. By the time a read completes, the read has logically happened, it's just that not all of the data has been transferred yet.
>>> That happens later as the buffer is examined, and if pages of the buffer are not examined, it doesn't happen in those pages at all.
>>>
>>> My implementation (on my hobby OS) only does this in a custom segment type. A segment of this type can be of any size, but is not pre-allocated pages in memory or the swap file - I do this to allow it to be very large, and because a read has to happen within the boundaries of a segment. I back it with a file system temporary file, so when pages migrate to the swap area the disk allocation can be sparse. You can load or store bytes anywhere in this segment. Touching pages allocates them, first in memory and eventually in the swap file as they get paged out.
>>>
>>> On Saturday, February 14, 2026, at 2:27 PM, Dan Cross wrote:
>>>> but
>>> read/write work in terms of byte buffers that have no obligation to be
>>> byte aligned. Put another way, read and write relate the contents of a
>>> "file" with an arbitrarily sized and aligned byte-buffer in memory,
>>> but there is no obligation that those byte buffers have the properties
>>> required to be a "page" in the virtual memory sense.
>>> Understood. My current implementation does conventional I/O with any fragments of pages at the beginning and end of the read/write buffers. So small reads and writes happen traditionally. At the moment that's done before the read completes, so your example of doing lots of adjacent reads of small areas would work very badly (few pages would get the deferred loading), but I think I can do better by deferring the fragment I/O, so adjacent reads can coalesce the snapshots. My main scenario of interest though is for very large reads and writes, because that's where the sparse access has value.
>>>
>>> Because reads are copies and not memory mapping, it doesn't matter if the reads are not page-aligned. The process's memory pages are not being shared with the cache of the file (snapshot), so if the data is not aligned then page faults will copy bytes from two cached file blocks (assuming they're the same size). In practice I'm expecting that large reads will be into large allocations, which will be aligned, so there's an opportunity to steal blocks from the file cache. But I'm not expecting to implement this. There's no coherence problem here because the snapshot is private to the process. And readonly.
>>>
>>> When I do a read call into the segment, firstly a snapshot is made of the data to be read. This is functionally equivalent to making a temporary file and copying the data into it. Making this copy-on-write so the snapshot costs nothing is a key part of this without which there would be no point.
>>> The pages of the read buffer in the segment are then associated with parts of the snapshot - rather than the swap file. So rather than zero filling (or reloading paged-out data) when a load instruction is executed, the memory pages are filled from the snapshot.
>>> When a store instruction happens, the page becomes dirty, and loses its association with the snapshot. It's then backed by the swap file. If you alter all pages of the buffer, then all pages are disconnected from the snapshot, and the snapshot is deleted. At that point you can't tell that anything unconventional happened.
>>> If I 'read over' a buffer with something else, the pages get associated with the new snapshot, and disassociated from the old one.
>>>
>>> When I do a write call, the write call looks at each page, and decides whether it is part of a snapshot. If it is, and we're writing back to the same part of the same file (an update) and the corresponding block has not been changed in the file, then the write call can skip that page. In other cases it actually writes to the file. Any other writing to the file that we made a snapshot from invokes the copy-on-write mechanism, so the file changes, but the snapshot doesn't.
>>>
>>> If you freed the read buffer memory, then parts of it might get demand loaded in the act of writing malloc's book-keeping information into it - depending on how the malloc works. If you later use calloc (or memset), it will zero the memory, which will detach it all from the snapshot, albeit loading every page from the snapshot as it goes...
>>> One could change calloc to read from /dev/zero for allocations over a certain size, and special-case that to set up pages for zero-fill when it happens in this type of segment, which would disassociate the pages from the old snapshot without loading them, just as any other subsequent read does. A memset syscall might be better.
>>> Practically, though, I think malloc and free are not likely to be used in this type of segment. You'd probably just detach the segment rather than free parts of it, but I've illustrated how you could drop the deferred snapshot if you needed to.
>>>
>>> So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap. In particular: it lets you do an arbitrarily large read call instantly, and fault in just the pages you actually need as you need them. So like demand-paging, but from a snapshot of a file. Similarly, if you're writing back to the same file region, write will only write the pages that have altered - either in memory or in the file. This is effectively an update, somewhat like msync.
>>>
>>> It's different from mmap in some ways: the data read is always a copy of the file contents, so there's never any spooky changing of memory under your feet. The behaviour is not detectably different to the program from the traditional implementation - except for where and if the time is spent.
>>>
>>> There's still more I could add, but if I'm still not making sense, perhaps I'd better stop there. I think I've ended up making it sound more complicated than it is.
>>>
>>> On Sunday, February 15, 2026, at 10:19 AM, hiro wrote:
>>>> since you give no reasons yourself, let me try to hallucinate a reason
>>> why you might be doing what you're doing here:
>>>
>>> Here was my example for you:
>>>
>>> On Thursday, February 12, 2026, at 1:34 PM, Alyssa M wrote:
>>>> I've built a couple of simple disk file systems. I thinking of taking the cache code out of one of them and mapping the whole file system image into the address space - to see how much it simplifies the code. I'm not expecting it will be faster.
>>>
>>> This is interesting because it's a large data structure that's very sparsely read or written. I'd read the entire file system image into the segment in one gulp, respond to some file protocol requests (e.g. over 9P) by treating the segment as a single data structure, and write the entire image out periodically to implement what we used to call 'sync'.
>>> With traditional I/O that would be ridiculous. With the above mechanism it should work about as well as mmap would. And without all that cache code and block fetching. Which is the point of this.
>>
>>
>> --
>> Ori Bernstein <ori@eigenstate.org>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M6e3e365f5f5c394c0f463fd7
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-15 16:12 ` Danny Wilkins via 9fans
@ 2026-02-17 3:13 ` Alyssa M via 9fans
2026-02-17 13:02 ` Dan Cross
0 siblings, 1 reply; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-17 3:13 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 7059 bytes --]
Guys, I'm really sorry for the confusion.
A week ago I posted this, as a loose analogy, trying to relate what Ron said to my earlier idea:
On Tuesday, February 10, 2026, at 10:13 AM, Alyssa M wrote:
> On Monday, February 09, 2026, at 3:24 PM, ron minnich wrote:
>> as for mmap, there's already a defacto mmap happening for executables. They are not read into memory. In fact, the first instruction you run in a binary results in a page fault.
> I thinking one could bring the same transparent/defacto memory mapping to read(2) and write(2), so the API need not change at all.
Unfortunately it got taken literally and I've been fire-fighting that ever since:
On Thursday, February 12, 2026, at 3:12 AM, Alyssa M wrote:
> Uses of mmap(2) and msync(2) could be replaced with read(2) and write(2) that sometimes use memory mapping as part of their implementation.
On Thursday, February 12, 2026, at 8:37 AM, Alyssa M wrote:
> The new read implementation would do this, but as a memory mapped snapshot. This looks no different to the programmer from how reads have always worked,
On Saturday, February 14, 2026, at 3:35 AM, Alyssa M wrote:
> This is not mmap, and not really memory mapping in the conventional sense either. But I think it can do the things an mmap user is looking for
On Monday, February 16, 2026, at 2:24 AM, Alyssa M wrote:
> So this is not mmap by another name. It's an optimization of the standard read/write approach that has some of the desirable characteristics of mmap.
All of these statements are my clumsy attempts to explain the same thing. There is memory mapping happening, in the sense that parts of the address space are sometimes associated with parts of files. But that's where the similarity ends. None of that is apparent to the programmer at any time. If it were it would be broken. Obviously.
In some ways this is like lazy evaluation: if you were actually to use all the data it would be slower. But if you only need part of it you can get access to it in memory up front (with read) without knowing ahead of time which bits you need.
This is not about making I/O faster, it's about doing less of it.
With regard to the snapshots:
Snapshots are just there to ensure that the proper read semantics are preserved.
On Monday, February 16, 2026, at 1:49 PM, Ori Bernstein wrote:
> Please describe, in detail, short of a per-file snapshot being
supported by the fs, what management to changes would actually
lead to zeros being read in the test C code?
When you're short of a per-file snapshot being supported by the fs, you get the traditional code path in read:
On Monday, February 16, 2026, at 1:49 PM, Ori Bernstein wrote:
> in another terminal, while this process is sleeping,
you run this command:
dd -if /dev/random -of test -bs 16k -count 1kk
A normal read has happened because the kernel was unable to obtain a snapshot. The zeroes are already in memory (assuming you can actually read 16GB of zeroes from a file into the process within an hour!).
However:
It would be convenient for this if servers for disk file systems had the ability to create a snapshot of a range of bytes from a file. But they generally don't. So I'm building a file system wrapper layer (a file server that talks to another file server) that provides snapshot files to the kernel via an extension of the file server protocol, in addition to what the underlying file system already provides. My current implementation does this with temporary files. When a file is written, the temporary file gets any original data that's about to be overwritten. The snapshot provided to the kernel is a synthetic blend of the original file and any bytes that were rescued and put in the temporary file. In most uses the original file will never be touched by another process and the temporary file won't even be created.
The wrapper requires exclusive access to the file server underneath, and also requires the contents to be stable. It is the wrapper that is mounted in the namespace. So the wrapper sees all attempts to alter any file, and can ensure that that any snapshot maintains the illusion of being a full prior copy when writes later happen to the file it came from.
So with the wrapper in place...
On Monday, February 16, 2026, at 1:49 PM, Ori Bernstein wrote:
> in another terminal, while this process is sleeping,
you run this command:
dd -if /dev/random -of test -bs 16k -count 1kk
The dd will cause the wrapper file system to copy the zeroes into a temporary file in the snapshot as it replaces them in the file with random data - which would presumably take three times as long as the earlier example), and the process will wake up after an hour and fault zeroes in from one page in the temporary file in the snapshot. The same happens if you delete or truncate the file: the wrapper will save the data in the temporary file first.
The kernel always sees a stable snapshot (if it sees one at all), which appears to be a readonly file the size of the read buffer. It can demand-load from that without interference for as long as it's needed, and it will remove on close.
The wrapper can be local or remote - wherever the file system needs to be shared.
On Monday, February 16, 2026, at 10:56 AM, Frank D. Engel, Jr. wrote:
> You could theoretically work around that by extending the protocol in a
detectable way to provide the required support and only enabling this
feature for filesystems which declare correct implementation of the
extensions
Yes indeed.
On my hobby OS the file server protocol is not 9P, and adding a new request type to it is easy and expected.
For 9P it would probably be most sensible to do something like what I did for Linux last year, and use some kind of control file to communicate the request. It's kind of ugly, but can likely be done. Given how little this mechanism would be likely to be used in practice that's probably reasonable.
There's still more to say about what happens to memory pages after a write occurs, and about a really nice result that falls naturally out of all this when write knows where the data came from. But given the responses I've seen so far I'm not sure I dare...
On Monday, February 16, 2026, at 12:24 PM, hiro wrote:
> as you keep going in the same direction without responding to any of the doubt, i can't help but blame sycophantic AI, I don't manage to convince myself any more you're actually engaging with us properly as a human.
I don't post AI slop.
I hadn't actually intended to go into so much detail in this thread, particularly as I can't see myself doing this on Plan 9 - though I'm sure someone more expert than me could.
I'm dismayed by the responses so far, because I think this is potentially a lot better than mmap.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M160f64e4b54be3298727d64a
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 9199 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-17 3:13 ` Alyssa M via 9fans
@ 2026-02-17 13:02 ` Dan Cross
2026-02-17 16:00 ` ron minnich
2026-02-17 16:56 ` Bakul Shah via 9fans
0 siblings, 2 replies; 92+ messages in thread
From: Dan Cross @ 2026-02-17 13:02 UTC (permalink / raw)
To: 9fans
On Tue, Feb 17, 2026 at 3:54 AM Alyssa M via 9fans <9fans@9fans.net> wrote:
> Guys, I'm really sorry for the confusion.
> A week ago I posted this, as a loose analogy, trying to relate what Ron said to my earlier idea:
> [snip]
> > It would be convenient for this if servers for disk file systems had the ability to create a snapshot of a range of bytes from a file. But they generally don't. So I'm building a file system wrapper layer (a file server that talks to another file server) that provides snapshot files to the kernel via an extension of the file server protocol, in addition to what the underlying file system already provides. My current implementation does this with temporary files. When a file is written, the temporary file gets any original data that's about to be overwritten. The snapshot provided to the kernel is a synthetic blend of the original file and any bytes that were rescued and put in the temporary file. In most uses the original file will never be touched by another process and the temporary file won't even be created.
> >
> > The wrapper requires exclusive access to the file server underneath, and also requires the contents to be stable. It is the wrapper that is mounted in the namespace. So the wrapper sees all attempts to alter any file, and can ensure that that any snapshot maintains the illusion of being a full prior copy when writes later happen to the file it came from.
This doesn't seem workable.
Consider a network with three machines: one serves a filesystem, two
mount it as clients. If I understand your description above, there is
one of these "wrappers" on each client. How, then, do they arrange for
exclusive access to the filesystem, which is on a completely separate
machine?
> [snip]
> I'm dismayed by the responses so far, because I think this is potentially a lot better than mmap.
No, this is nothing like mmap. mmap is an operation initiated by a
programmer; this is some totally separate thing.
My humble suggestion is that if you don't want people to conflate this
with the `mmap` call, you should stop referring to it as `mmap`.
- Dan C.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M270677fd2081a0f5169e9a9e
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-17 13:02 ` Dan Cross
@ 2026-02-17 16:00 ` ron minnich
2026-02-17 16:39 ` hiro
2026-02-17 16:56 ` Bakul Shah via 9fans
1 sibling, 1 reply; 92+ messages in thread
From: ron minnich @ 2026-02-17 16:00 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 2737 bytes --]
I suggest we start over, and, as Dan says, let's stop using the word mmap.
Can we look at this from the point of view of a problem you are trying to
solve? What is it you want to do?
On Tue, Feb 17, 2026 at 5:36 AM Dan Cross <crossd@gmail.com> wrote:
> On Tue, Feb 17, 2026 at 3:54 AM Alyssa M via 9fans <9fans@9fans.net>
> wrote:
> > Guys, I'm really sorry for the confusion.
> > A week ago I posted this, as a loose analogy, trying to relate what Ron
> said to my earlier idea:
> > [snip]
> > > It would be convenient for this if servers for disk file systems had
> the ability to create a snapshot of a range of bytes from a file. But they
> generally don't. So I'm building a file system wrapper layer (a file server
> that talks to another file server) that provides snapshot files to the
> kernel via an extension of the file server protocol, in addition to what
> the underlying file system already provides. My current implementation does
> this with temporary files. When a file is written, the temporary file gets
> any original data that's about to be overwritten. The snapshot provided to
> the kernel is a synthetic blend of the original file and any bytes that
> were rescued and put in the temporary file. In most uses the original file
> will never be touched by another process and the temporary file won't even
> be created.
> > >
> > > The wrapper requires exclusive access to the file server underneath,
> and also requires the contents to be stable. It is the wrapper that is
> mounted in the namespace. So the wrapper sees all attempts to alter any
> file, and can ensure that that any snapshot maintains the illusion of being
> a full prior copy when writes later happen to the file it came from.
>
> This doesn't seem workable.
>
> Consider a network with three machines: one serves a filesystem, two
> mount it as clients. If I understand your description above, there is
> one of these "wrappers" on each client. How, then, do they arrange for
> exclusive access to the filesystem, which is on a completely separate
> machine?
>
> > [snip]
> > I'm dismayed by the responses so far, because I think this is
> potentially a lot better than mmap.
>
> No, this is nothing like mmap. mmap is an operation initiated by a
> programmer; this is some totally separate thing.
>
> My humble suggestion is that if you don't want people to conflate this
> with the `mmap` call, you should stop referring to it as `mmap`.
>
> - Dan C.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Med980740c04af0912ebaf0f0
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 4245 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-17 16:00 ` ron minnich
@ 2026-02-17 16:39 ` hiro
0 siblings, 0 replies; 92+ messages in thread
From: hiro @ 2026-02-17 16:39 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 3614 bytes --]
Alyssa, Thanks for restoring some order after everybody got so confused.
Like ron i suggest setting the scope more clearly: what are the hardware
layers and driver interface abstractions involved, is this below your
kernel, part of the kernel, or part of the abi?
Remember we also do userlevel filesystems on plan9. And our kernel fs arent
so insanely different from such.
On Tuesday, February 17, 2026, ron minnich <rminnich@gmail.com> wrote:
> I suggest we start over, and, as Dan says, let's stop using the word mmap.
>
> Can we look at this from the point of view of a problem you are trying to
> solve? What is it you want to do?
>
> On Tue, Feb 17, 2026 at 5:36 AM Dan Cross <crossd@gmail.com> wrote:
>
>> On Tue, Feb 17, 2026 at 3:54 AM Alyssa M via 9fans <9fans@9fans.net>
>> wrote:
>> > Guys, I'm really sorry for the confusion.
>> > A week ago I posted this, as a loose analogy, trying to relate what Ron
>> said to my earlier idea:
>> > [snip]
>> > > It would be convenient for this if servers for disk file systems had
>> the ability to create a snapshot of a range of bytes from a file. But they
>> generally don't. So I'm building a file system wrapper layer (a file server
>> that talks to another file server) that provides snapshot files to the
>> kernel via an extension of the file server protocol, in addition to what
>> the underlying file system already provides. My current implementation does
>> this with temporary files. When a file is written, the temporary file gets
>> any original data that's about to be overwritten. The snapshot provided to
>> the kernel is a synthetic blend of the original file and any bytes that
>> were rescued and put in the temporary file. In most uses the original file
>> will never be touched by another process and the temporary file won't even
>> be created.
>> > >
>> > > The wrapper requires exclusive access to the file server underneath,
>> and also requires the contents to be stable. It is the wrapper that is
>> mounted in the namespace. So the wrapper sees all attempts to alter any
>> file, and can ensure that that any snapshot maintains the illusion of being
>> a full prior copy when writes later happen to the file it came from.
>>
>> This doesn't seem workable.
>>
>> Consider a network with three machines: one serves a filesystem, two
>> mount it as clients. If I understand your description above, there is
>> one of these "wrappers" on each client. How, then, do they arrange for
>> exclusive access to the filesystem, which is on a completely separate
>> machine?
>>
>> > [snip]
>> > I'm dismayed by the responses so far, because I think this is
>> potentially a lot better than mmap.
>>
>> No, this is nothing like mmap. mmap is an operation initiated by a
>> programmer; this is some totally separate thing.
>>
>> My humble suggestion is that if you don't want people to conflate this
>> with the `mmap` call, you should stop referring to it as `mmap`.
>>
>> - Dan C.
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Med980740c04af0912ebaf0f0>
>
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-Mafe57a1fd798d48194107130
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 5624 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-17 13:02 ` Dan Cross
2026-02-17 16:00 ` ron minnich
@ 2026-02-17 16:56 ` Bakul Shah via 9fans
2026-02-17 17:54 ` hiro
2026-02-17 22:21 ` Alyssa M via 9fans
1 sibling, 2 replies; 92+ messages in thread
From: Bakul Shah via 9fans @ 2026-02-17 16:56 UTC (permalink / raw)
To: 9fans
May be start a new thread with a better subject line?
> On Feb 17, 2026, at 5:02 AM, Dan Cross <crossd@gmail.com> wrote:
>
> My humble suggestion is that if you don't want people to conflate this
> with the `mmap` call, you should stop referring to it as `mmap`.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1abd2bc4a6f7b0b5e577b7df
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-17 16:56 ` Bakul Shah via 9fans
@ 2026-02-17 17:54 ` hiro
2026-02-17 22:21 ` Alyssa M via 9fans
1 sibling, 0 replies; 92+ messages in thread
From: hiro @ 2026-02-17 17:54 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 740 bytes --]
Oh indeed. I also completely forgot the history of this and that there was
a takeover of your original thread, helps maximize the confusion ;)
On Tuesday, February 17, 2026, Bakul Shah via 9fans <9fans@9fans.net> wrote:
> May be start a new thread with a better subject line?
>
> > On Feb 17, 2026, at 5:02 AM, Dan Cross <crossd@gmail.com> wrote:
> >
> > My humble suggestion is that if you don't want people to conflate this
> > with the `mmap` call, you should stop referring to it as `mmap`.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M1f0de0620cf1ba91be774cc7
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1936 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: mmaping on plan9? (was Re: [9fans] venti /plan9port mmapped
2026-02-17 16:56 ` Bakul Shah via 9fans
2026-02-17 17:54 ` hiro
@ 2026-02-17 22:21 ` Alyssa M via 9fans
1 sibling, 0 replies; 92+ messages in thread
From: Alyssa M via 9fans @ 2026-02-17 22:21 UTC (permalink / raw)
To: 9fans
[-- Attachment #1: Type: text/plain, Size: 827 bytes --]
I'd like to apologize to Bakul for derailing this thread. That was really not my intention. (Incidentally the Alloc Stream paper was very interesting! Thanks for that.)
A new thread might be sensible for a general discussion of what else, if anything, anyone wants to do for Plan 9 in this area, but I think it should be lead by someone who knows more about Plan 9 and how its kernel works than I do, and perhaps has more immediate practical use for it.
Let's leave discussion of what I'm doing until I have something to show/demonstrate. Communicating only the ideas has been unexpectedly hard.
------------------------------------------
9fans: 9fans
Permalink: https://9fans.topicbox.com/groups/9fans/Te8d7c6e48b5c075b-M49e249c9af5ebdc59dccd1d5
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription
[-- Attachment #2: Type: text/html, Size: 1400 bytes --]
^ permalink raw reply [flat|nested] 92+ messages in thread
end of thread, other threads:[~2026-02-18 0:42 UTC | newest]
Thread overview: 92+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-02 19:54 [9fans] venti /plan9port mmapped wb.kloke
2026-01-02 20:39 ` ori
2026-01-02 20:58 ` Bakul Shah via 9fans
2026-01-06 22:59 ` Ron Minnich
2026-01-07 4:27 ` Noam Preil
2026-01-07 6:15 ` Shawn Rutledge
2026-01-07 15:46 ` Persistent memory (was Re: [9fans] venti /plan9port mmapped) arnold
2026-01-07 16:11 ` Noam Preil
2026-01-07 17:26 ` Wes Kussmaul
2026-01-07 8:52 ` [9fans] venti /plan9port mmapped wb.kloke
2026-01-07 16:30 ` mmaping on plan9? (was " Bakul Shah via 9fans
2026-01-07 16:40 ` Noam Preil
2026-01-07 16:41 ` ori
2026-01-07 20:35 ` Bakul Shah via 9fans
2026-01-07 21:31 ` ron minnich
2026-01-08 7:56 ` arnold
2026-01-08 10:31 ` wb.kloke
2026-01-09 0:02 ` ron minnich
2026-01-09 3:57 ` Paul Lalonde
2026-01-09 5:10 ` ron minnich
2026-01-09 5:18 ` arnold
2026-01-09 6:06 ` David Leimbach via 9fans
2026-01-09 17:13 ` ron minnich
2026-01-09 17:39 ` tlaronde
2026-01-09 19:48 ` David Leimbach via 9fans
2026-02-05 21:30 ` Alyssa M via 9fans
2026-02-08 14:18 ` Ethan Azariah
2026-02-08 15:10 ` Alyssa M via 9fans
2026-02-08 20:43 ` Ethan Azariah
2026-02-09 1:35 ` ron minnich
2026-02-09 15:23 ` ron minnich
2026-02-09 17:13 ` Bakul Shah via 9fans
2026-02-09 21:38 ` ron minnich
2026-02-10 10:13 ` Alyssa M via 9fans
2026-02-11 1:43 ` Ron Minnich
2026-02-11 2:19 ` Bakul Shah via 9fans
2026-02-11 3:21 ` Ori Bernstein
2026-02-11 10:01 ` hiro
2026-02-12 1:36 ` Dan Cross
2026-02-12 5:39 ` Alyssa M via 9fans
2026-02-12 9:08 ` hiro via 9fans
2026-02-12 13:34 ` Alyssa M via 9fans
2026-02-13 13:48 ` hiro
2026-02-13 17:21 ` ron minnich
2026-02-15 16:12 ` Danny Wilkins via 9fans
2026-02-17 3:13 ` Alyssa M via 9fans
2026-02-17 13:02 ` Dan Cross
2026-02-17 16:00 ` ron minnich
2026-02-17 16:39 ` hiro
2026-02-17 16:56 ` Bakul Shah via 9fans
2026-02-17 17:54 ` hiro
2026-02-17 22:21 ` Alyssa M via 9fans
2026-02-16 2:24 ` Alyssa M via 9fans
2026-02-16 3:17 ` Ori Bernstein
2026-02-16 10:55 ` Frank D. Engel, Jr.
2026-02-16 13:49 ` Ori Bernstein
2026-02-16 19:40 ` Bakul Shah via 9fans
2026-02-16 19:43 ` Bakul Shah via 9fans
2026-02-16 9:50 ` tlaronde
2026-02-16 12:24 ` hiro via 9fans
2026-02-16 12:33 ` hiro via 9fans
2026-02-11 14:22 ` Dan Cross
2026-02-11 18:44 ` Ori Bernstein
2026-02-12 1:22 ` Dan Cross
2026-02-12 4:26 ` Ori Bernstein
2026-02-12 4:34 ` Dan Cross
2026-02-12 3:12 ` Alyssa M via 9fans
2026-02-12 4:52 ` Dan Cross
2026-02-12 8:37 ` Alyssa M via 9fans
2026-02-12 12:37 ` hiro via 9fans
2026-02-13 1:36 ` Dan Cross
2026-02-14 3:35 ` Alyssa M via 9fans
2026-02-14 14:26 ` Dan Cross
2026-02-15 4:34 ` Bakul Shah via 9fans
2026-02-15 10:19 ` hiro
2026-02-10 16:49 ` wb.kloke
2026-02-08 14:08 ` Ethan Azariah
2026-01-07 21:40 ` ori
2026-01-07 16:52 ` ori
2026-01-07 17:37 ` wb.kloke
2026-01-07 17:46 ` Noam Preil
2026-01-07 17:56 ` wb.kloke
2026-01-07 18:07 ` Noam Preil
2026-01-07 18:58 ` wb.kloke
2026-01-07 14:57 ` Thaddeus Woskowiak
2026-01-07 16:07 ` Wes Kussmaul
2026-01-07 16:22 ` Noam Preil
2026-01-07 17:31 ` Wes Kussmaul
2026-01-07 16:13 ` Noam Preil
2026-01-02 21:01 ` ori
2026-01-08 15:59 ` wb.kloke
2026-02-11 23:19 ` red
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).