Re: [9fans] PDP11 (Was: Re: what heavy negativity!)

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
@ 2018-10-10 17:34 cinap_lenrek
  2018-10-10 21:54 ` Steven Stallion
  0 siblings, 1 reply; 34+ messages in thread
From: cinap_lenrek @ 2018-10-10 17:34 UTC (permalink / raw)
  To: 9fans

> But the reason I want this is to reduce latency to the first
> access, especially for very large files. With read() I have
> to wait until the read completes. With mmap() processing can
> start much earlier and can be interleaved with background
> data fetch or prefetch. With read() a lot more resources
> are tied down. If I need random access and don't need to
> read all of the data, the application has to do pread(),
> pwrite() a lot thus complicating it. With mmap() I can just
> map in the whole file and excess reading (beyond what the
> app needs) will not be a large fraction.

you think doing single 4K page sized reads in the pagefault
handler is better than doing precise >4K reads from your
application? possibly in a background thread so you can
overlap processing with data fetching?

the advantage of mmap is not prefetch. its about not to do
any I/O when data is already in the *SHARED* buffer cache!
which plan9 does not have (except the mntcache, but that is
optional and only works for the disk fileservers that maintain
ther file qid ver info consistently). its *IS* really a linux
thing where all block device i/o goes thru the buffer cache.

--
cinap

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 17:34 [9fans] PDP11 (Was: Re: what heavy negativity!) cinap_lenrek
@ 2018-10-10 21:54 ` Steven Stallion
  2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
                     ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Steven Stallion @ 2018-10-10 21:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

As the guy who wrote the majority of the code that pushed those 1M 4K
random IOPS erik mentioned, this thread annoys the shit out of me. You
don't get an award for writing a driver. In fact, it's probably better
not to be known at all considering the bloody murder one has to commit
to marry hardware and software together.

Let's be frank, the I/O handling in the kernel is anachronistic. To
hit those rates, I had to add support for asynchronous and vectored
I/O not to mention a sizable bit of work by a co-worker to properly
handle NUMA on our appliances to hit those speeds. As I recall, we had
to rewrite the scheduler and re-implement locking, which even Charles
Forsyth had a hand in. Had we the time and resources to implement
something like zero-copy we'd have done it in a heartbeat.

In the end, it doesn't matter how "fast" a storage driver is in Plan 9
- as soon as you put a 9P-based filesystem on it, it's going to be
limited to a single outstanding operation. This is the tyranny of 9P.
We (Coraid) got around this by avoiding filesystems altogether.

Go solve that problem first.
On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
>
> > But the reason I want this is to reduce latency to the first
> > access, especially for very large files. With read() I have
> > to wait until the read completes. With mmap() processing can
> > start much earlier and can be interleaved with background
> > data fetch or prefetch. With read() a lot more resources
> > are tied down. If I need random access and don't need to
> > read all of the data, the application has to do pread(),
> > pwrite() a lot thus complicating it. With mmap() I can just
> > map in the whole file and excess reading (beyond what the
> > app needs) will not be a large fraction.
>
> you think doing single 4K page sized reads in the pagefault
> handler is better than doing precise >4K reads from your
> application? possibly in a background thread so you can
> overlap processing with data fetching?
>
> the advantage of mmap is not prefetch. its about not to do
> any I/O when data is already in the *SHARED* buffer cache!
> which plan9 does not have (except the mntcache, but that is
> optional and only works for the disk fileservers that maintain
> ther file qid ver info consistently). its *IS* really a linux
> thing where all block device i/o goes thru the buffer cache.
>
> --
> cinap
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 21:54 ` Steven Stallion
@ 2018-10-10 22:26   ` Bakul Shah
  2018-10-10 22:52     ` Steven Stallion
  2018-10-11 20:43     ` Lyndon Nerenberg
  2018-10-10 22:29   ` [9fans] " Kurt H Maier
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 34+ messages in thread
From: Bakul Shah @ 2018-10-10 22:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Excellent response! Just what I was hoping for!

On Oct 10, 2018, at 2:54 PM, Steven Stallion <sstallion@gmail.com> wrote:
>
> As the guy who wrote the majority of the code that pushed those 1M 4K
> random IOPS erik mentioned, this thread annoys the shit out of me. You
> don't get an award for writing a driver. In fact, it's probably better
> not to be known at all considering the bloody murder one has to commit
> to marry hardware and software together.
>
> Let's be frank, the I/O handling in the kernel is anachronistic. To
> hit those rates, I had to add support for asynchronous and vectored
> I/O not to mention a sizable bit of work by a co-worker to properly
> handle NUMA on our appliances to hit those speeds. As I recall, we had
> to rewrite the scheduler and re-implement locking, which even Charles
> Forsyth had a hand in. Had we the time and resources to implement
> something like zero-copy we'd have done it in a heartbeat.
>
> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
> - as soon as you put a 9P-based filesystem on it, it's going to be
> limited to a single outstanding operation. This is the tyranny of 9P.
> We (Coraid) got around this by avoiding filesystems altogether.
>
> Go solve that problem first.

You seem to be saying zero-copy wouldn't buy anything until these
other problems are solved, right?

Suppose you could replace 9p based FS with something of your choice.
Would it have made your jobs easier? Code less grotty? In other
words, is the complexity of the driver to achieve high throughput
due to the complexity of hardware or is it due to 9p's RPC model?
For streaming data you pretty much have to have some sort of
windowing protocol (data prefetch or write behind with mmap is a
similar thing).

Looks like people who have worked on the plan9 kernel have learned
a lot of lessons and have a lot of good advice to offer. I'd love
to learn from that. Except usually I rarely see anyone criticizing
plan9.


> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
>>
>>> But the reason I want this is to reduce latency to the first
>>> access, especially for very large files. With read() I have
>>> to wait until the read completes. With mmap() processing can
>>> start much earlier and can be interleaved with background
>>> data fetch or prefetch. With read() a lot more resources
>>> are tied down. If I need random access and don't need to
>>> read all of the data, the application has to do pread(),
>>> pwrite() a lot thus complicating it. With mmap() I can just
>>> map in the whole file and excess reading (beyond what the
>>> app needs) will not be a large fraction.
>>
>> you think doing single 4K page sized reads in the pagefault
>> handler is better than doing precise >4K reads from your
>> application? possibly in a background thread so you can
>> overlap processing with data fetching?
>>
>> the advantage of mmap is not prefetch. its about not to do
>> any I/O when data is already in the *SHARED* buffer cache!
>> which plan9 does not have (except the mntcache, but that is
>> optional and only works for the disk fileservers that maintain
>> ther file qid ver info consistently). its *IS* really a linux
>> thing where all block device i/o goes thru the buffer cache.
>>
>> --
>> cinap
>>
>




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
@ 2018-10-10 22:52     ` Steven Stallion
  2018-10-11 20:43     ` Lyndon Nerenberg
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Stallion @ 2018-10-10 22:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> On Oct 10, 2018, at 2:54 PM, Steven Stallion <sstallion@gmail.com> wrote:
>
> You seem to be saying zero-copy wouldn't buy anything until these
> other problems are solved, right?

Fundamentally zero-copy requires that the kernel and user process
share the same virtual address space mapped for the given operation.
This can't always be done and the kernel will be forced to perform a
copy anyway. To wit, one of the things I added to the exynos kernel
early on was a 1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible in the future (it was
also very convenient to limit MMU swaps on the Cortex-A15). That said,
the problem gets harder when you're working on something more general
that can handle the entire address space. In the end, you trade the
complexity/performance hit of MMU management versus making a copy.
Believe it or not, sometimes copies can be faster, especially on
larger NUMA systems.

> Suppose you could replace 9p based FS with something of your choice.
> Would it have made your jobs easier? Code less grotty? In other
> words, is the complexity of the driver to achieve high throughput
> due to the complexity of hardware or is it due to 9p's RPC model?
> For streaming data you pretty much have to have some sort of
> windowing protocol (data prefetch or write behind with mmap is a
> similar thing).

This is one of those problems that afflicts storage more than any
other subsystem, but like most things it's a tradeoff. Having a
filesystem that doesn't support 9P doesn't seem to make much sense on
Plan 9 given the ubiquity of the protocol. Dealing with the multiple
outstanding issue does make filesystem support much more complex and
would have a far-reaching effect on existing code (not to mention the
kernel).

It's completely possible to support prefetch and/or streaming I/O
using existing kernel interfaces. cinap's comment about read not
returning until the entire buffer is read is an implementation detail
of the underlying device. A read call is free to return fewer bytes
than requested; it's not uncommon for a driver to return partial data
to favor latency over throughput. In other words, there's no magic
behind mmap - it's a convenience interface. If you look at how other
kernels tend to implement I/O, there are generally fundamental calls
to the a read/write interface - there are no special provisions for
mmap beyond the syscall layer.

The beauty of 9P is you can wrap driver filesystems for added
functionality. Want a block caching interface? Great! Slap a kernel
device on top of a storage driver that handles caching and prefetch.
I'm sure you can see where this is going...

> Looks like people who have worked on the plan9 kernel have learned
> a lot of lessons and have a lot of good advice to offer. I'd love
> to learn from that. Except usually I rarely see anyone criticizing
> plan9.

Something, something, in polite company :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
  2018-10-10 22:52     ` Steven Stallion
@ 2018-10-11 20:43     ` Lyndon Nerenberg
  2018-10-11 22:28       ` hiro
  2018-10-12  6:04       ` Ori Bernstein
  1 sibling, 2 replies; 34+ messages in thread
From: Lyndon Nerenberg @ 2018-10-11 20:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: Lyndon Nerenberg

Another case to ponder ...   We're handling the incoming I/Q data
stream, but need to fan that out to many downstream consumers.  If
we already read the data into a page, then flip it to the first
consumer, is there a benefit to adding a reference counter to that
read-only page and leaving the page live until the counter expires?

Hiro clamours for benchmarks.  I agree.  Some basic searches I've
done don't show anyone trying this out with P9 (and publishing
their results).  Anybody have hints/references to prior work?

--lyndon

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-11 20:43     ` Lyndon Nerenberg
@ 2018-10-11 22:28       ` hiro
  2018-10-12  6:04       ` Ori Bernstein
  1 sibling, 0 replies; 34+ messages in thread
From: hiro @ 2018-10-11 22:28 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

i'm not saying you should measure a lot even. just trying to make you
verify my point that this is not your bottleneck, just check if you
hit a cpu limit already with that single processing stage (my guess
was FFT).

the reason why i think my guess is right is bec. of experience with
the low bandwidth of SEQUENTIAL data you're claiming could create
problems.

in contrast i'm happy stallione at least brought up something more
demanding earlier, like finding true limits during small block-size
random access.

On 10/11/18, Lyndon Nerenberg <lyndon@orthanc.ca> wrote:
> Another case to ponder ...   We're handling the incoming I/Q data
> stream, but need to fan that out to many downstream consumers.  If
> we already read the data into a page, then flip it to the first
> consumer, is there a benefit to adding a reference counter to that
> read-only page and leaving the page live until the counter expires?
>
> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> done don't show anyone trying this out with P9 (and publishing
> their results).  Anybody have hints/references to prior work?
>
> --lyndon
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-11 20:43     ` Lyndon Nerenberg
  2018-10-11 22:28       ` hiro
@ 2018-10-12  6:04       ` Ori Bernstein
  2018-10-13 18:01         ` Charles Forsyth
  1 sibling, 1 reply; 34+ messages in thread
From: Ori Bernstein @ 2018-10-12  6:04 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: Lyndon Nerenberg

On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg <lyndon@orthanc.ca> wrote:

> Another case to ponder ...   We're handling the incoming I/Q data
> stream, but need to fan that out to many downstream consumers.  If
> we already read the data into a page, then flip it to the first
> consumer, is there a benefit to adding a reference counter to that
> read-only page and leaving the page live until the counter expires?
>
> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> done don't show anyone trying this out with P9 (and publishing
> their results).  Anybody have hints/references to prior work?
>
> --lyndon
>

I don't believe anyone has done the work yet. I'd be interested
to see what you come up with.


--
    Ori Bernstein



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-12  6:04       ` Ori Bernstein
@ 2018-10-13 18:01         ` Charles Forsyth
  2018-10-13 21:11           ` hiro
  0 siblings, 1 reply; 34+ messages in thread
From: Charles Forsyth @ 2018-10-13 18:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs; +Cc: Lyndon Nerenberg

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

I did several versions of one part of zero copy, inspired by several things
in x-kernel, replacing Blocks by another structure throughout the network
stacks and kernel, then made messages visible to user level. Nemo did
another part, on his way to Clive

On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org> wrote:

> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg <lyndon@orthanc.ca>
> wrote:
>
> > Another case to ponder ...   We're handling the incoming I/Q data
> > stream, but need to fan that out to many downstream consumers.  If
> > we already read the data into a page, then flip it to the first
> > consumer, is there a benefit to adding a reference counter to that
> > read-only page and leaving the page live until the counter expires?
> >
> > Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> > done don't show anyone trying this out with P9 (and publishing
> > their results).  Anybody have hints/references to prior work?
> >
> > --lyndon
> >
>
> I don't believe anyone has done the work yet. I'd be interested
> to see what you come up with.
>
>
> --
>     Ori Bernstein
>
>

[-- Attachment #2: Type: text/html, Size: 1581 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-13 18:01         ` Charles Forsyth
@ 2018-10-13 21:11           ` hiro
  2018-10-14  5:25             ` FJ Ballesteros
  0 siblings, 1 reply; 34+ messages in thread
From: hiro @ 2018-10-13 21:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

and, did it improve anything noticeably?

On 10/13/18, Charles Forsyth <charles.forsyth@gmail.com> wrote:
> I did several versions of one part of zero copy, inspired by several things
> in x-kernel, replacing Blocks by another structure throughout the network
> stacks and kernel, then made messages visible to user level. Nemo did
> another part, on his way to Clive
>
> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org> wrote:
>
>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg <lyndon@orthanc.ca>
>> wrote:
>>
>> > Another case to ponder ...   We're handling the incoming I/Q data
>> > stream, but need to fan that out to many downstream consumers.  If
>> > we already read the data into a page, then flip it to the first
>> > consumer, is there a benefit to adding a reference counter to that
>> > read-only page and leaving the page live until the counter expires?
>> >
>> > Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>> > done don't show anyone trying this out with P9 (and publishing
>> > their results).  Anybody have hints/references to prior work?
>> >
>> > --lyndon
>> >
>>
>> I don't believe anyone has done the work yet. I'd be interested
>> to see what you come up with.
>>
>>
>> --
>>     Ori Bernstein
>>
>>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-13 21:11           ` hiro
@ 2018-10-14  5:25             ` FJ Ballesteros
  2018-10-14  7:34               ` hiro
  0 siblings, 1 reply; 34+ messages in thread
From: FJ Ballesteros @ 2018-10-14  5:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

yes. bugs, on my side at least. 
The copy isolates from others. 
But some experiments in nix and in a thing I wrote for leanxcale show that some things can be much faster. 
It’s fun either way. 

> El 13 oct 2018, a las 23:11, hiro <23hiro@gmail.com> escribió:
> 
> and, did it improve anything noticeably?
> 
>> On 10/13/18, Charles Forsyth <charles.forsyth@gmail.com> wrote:
>> I did several versions of one part of zero copy, inspired by several things
>> in x-kernel, replacing Blocks by another structure throughout the network
>> stacks and kernel, then made messages visible to user level. Nemo did
>> another part, on his way to Clive
>> 
>>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org> wrote:
>>> 
>>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg <lyndon@orthanc.ca>
>>> wrote:
>>> 
>>>> Another case to ponder ...   We're handling the incoming I/Q data
>>>> stream, but need to fan that out to many downstream consumers.  If
>>>> we already read the data into a page, then flip it to the first
>>>> consumer, is there a benefit to adding a reference counter to that
>>>> read-only page and leaving the page live until the counter expires?
>>>> 
>>>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>>>> done don't show anyone trying this out with P9 (and publishing
>>>> their results).  Anybody have hints/references to prior work?
>>>> 
>>>> --lyndon
>>>> 
>>> 
>>> I don't believe anyone has done the work yet. I'd be interested
>>> to see what you come up with.
>>> 
>>> 
>>> --
>>>    Ori Bernstein
>>> 
>>> 
>> 
> 




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-14  5:25             ` FJ Ballesteros
@ 2018-10-14  7:34               ` hiro
  2018-10-14  7:38                 ` Francisco J Ballesteros
  0 siblings, 1 reply; 34+ messages in thread
From: hiro @ 2018-10-14  7:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

well, finding bugs is always good :)
but since i got curious could you also tell which things exactly got
much faster, so that we know what might be possible?

On 10/14/18, FJ Ballesteros <nemo@lsub.org> wrote:
> yes. bugs, on my side at least.
> The copy isolates from others.
> But some experiments in nix and in a thing I wrote for leanxcale show that
> some things can be much faster.
> It’s fun either way.
>
>> El 13 oct 2018, a las 23:11, hiro <23hiro@gmail.com> escribió:
>>
>> and, did it improve anything noticeably?
>>
>>> On 10/13/18, Charles Forsyth <charles.forsyth@gmail.com> wrote:
>>> I did several versions of one part of zero copy, inspired by several
>>> things
>>> in x-kernel, replacing Blocks by another structure throughout the
>>> network
>>> stacks and kernel, then made messages visible to user level. Nemo did
>>> another part, on his way to Clive
>>>
>>>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org> wrote:
>>>>
>>>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
>>>> <lyndon@orthanc.ca>
>>>> wrote:
>>>>
>>>>> Another case to ponder ...   We're handling the incoming I/Q data
>>>>> stream, but need to fan that out to many downstream consumers.  If
>>>>> we already read the data into a page, then flip it to the first
>>>>> consumer, is there a benefit to adding a reference counter to that
>>>>> read-only page and leaving the page live until the counter expires?
>>>>>
>>>>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>>>>> done don't show anyone trying this out with P9 (and publishing
>>>>> their results).  Anybody have hints/references to prior work?
>>>>>
>>>>> --lyndon
>>>>>
>>>>
>>>> I don't believe anyone has done the work yet. I'd be interested
>>>> to see what you come up with.
>>>>
>>>>
>>>> --
>>>>    Ori Bernstein
>>>>
>>>>
>>>
>>
>
>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-14  7:34               ` hiro
@ 2018-10-14  7:38                 ` Francisco J Ballesteros
  2018-10-14  8:00                   ` hiro
  0 siblings, 1 reply; 34+ messages in thread
From: Francisco J Ballesteros @ 2018-10-14  7:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Pure "producer/cosumer" stuff, like sending things through a pipe as long as the source didn't need to touch the data ever more.
Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.

> On 14 Oct 2018, at 09:34, hiro <23hiro@gmail.com> wrote:
> 
> well, finding bugs is always good :)
> but since i got curious could you also tell which things exactly got
> much faster, so that we know what might be possible?
> 
> On 10/14/18, FJ Ballesteros <nemo@lsub.org> wrote:
>> yes. bugs, on my side at least.
>> The copy isolates from others.
>> But some experiments in nix and in a thing I wrote for leanxcale show that
>> some things can be much faster.
>> It’s fun either way.
>> 
>>> El 13 oct 2018, a las 23:11, hiro <23hiro@gmail.com> escribió:
>>> 
>>> and, did it improve anything noticeably?
>>> 
>>>> On 10/13/18, Charles Forsyth <charles.forsyth@gmail.com> wrote:
>>>> I did several versions of one part of zero copy, inspired by several
>>>> things
>>>> in x-kernel, replacing Blocks by another structure throughout the
>>>> network
>>>> stacks and kernel, then made messages visible to user level. Nemo did
>>>> another part, on his way to Clive
>>>> 
>>>>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org> wrote:
>>>>> 
>>>>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
>>>>> <lyndon@orthanc.ca>
>>>>> wrote:
>>>>> 
>>>>>> Another case to ponder ...   We're handling the incoming I/Q data
>>>>>> stream, but need to fan that out to many downstream consumers.  If
>>>>>> we already read the data into a page, then flip it to the first
>>>>>> consumer, is there a benefit to adding a reference counter to that
>>>>>> read-only page and leaving the page live until the counter expires?
>>>>>> 
>>>>>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>>>>>> done don't show anyone trying this out with P9 (and publishing
>>>>>> their results).  Anybody have hints/references to prior work?
>>>>>> 
>>>>>> --lyndon
>>>>>> 
>>>>> 
>>>>> I don't believe anyone has done the work yet. I'd be interested
>>>>> to see what you come up with.
>>>>> 
>>>>> 
>>>>> --
>>>>>   Ori Bernstein
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
> 




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-14  7:38                 ` Francisco J Ballesteros
@ 2018-10-14  8:00                   ` hiro
  2018-10-15 16:48                     ` Charles Forsyth
  0 siblings, 1 reply; 34+ messages in thread
From: hiro @ 2018-10-14  8:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

thanks, this will allow us to know where to look more closely.

On 10/14/18, Francisco J Ballesteros <nemo@lsub.org> wrote:
> Pure "producer/cosumer" stuff, like sending things through a pipe as long as
> the source didn't need to touch the data ever more.
> Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.
>
>> On 14 Oct 2018, at 09:34, hiro <23hiro@gmail.com> wrote:
>>
>> well, finding bugs is always good :)
>> but since i got curious could you also tell which things exactly got
>> much faster, so that we know what might be possible?
>>
>> On 10/14/18, FJ Ballesteros <nemo@lsub.org> wrote:
>>> yes. bugs, on my side at least.
>>> The copy isolates from others.
>>> But some experiments in nix and in a thing I wrote for leanxcale show
>>> that
>>> some things can be much faster.
>>> It’s fun either way.
>>>
>>>> El 13 oct 2018, a las 23:11, hiro <23hiro@gmail.com> escribió:
>>>>
>>>> and, did it improve anything noticeably?
>>>>
>>>>> On 10/13/18, Charles Forsyth <charles.forsyth@gmail.com> wrote:
>>>>> I did several versions of one part of zero copy, inspired by several
>>>>> things
>>>>> in x-kernel, replacing Blocks by another structure throughout the
>>>>> network
>>>>> stacks and kernel, then made messages visible to user level. Nemo did
>>>>> another part, on his way to Clive
>>>>>
>>>>>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org> wrote:
>>>>>>
>>>>>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
>>>>>> <lyndon@orthanc.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> Another case to ponder ...   We're handling the incoming I/Q data
>>>>>>> stream, but need to fan that out to many downstream consumers.  If
>>>>>>> we already read the data into a page, then flip it to the first
>>>>>>> consumer, is there a benefit to adding a reference counter to that
>>>>>>> read-only page and leaving the page live until the counter expires?
>>>>>>>
>>>>>>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
>>>>>>> done don't show anyone trying this out with P9 (and publishing
>>>>>>> their results).  Anybody have hints/references to prior work?
>>>>>>>
>>>>>>> --lyndon
>>>>>>>
>>>>>>
>>>>>> I don't believe anyone has done the work yet. I'd be interested
>>>>>> to see what you come up with.
>>>>>>
>>>>>>
>>>>>> --
>>>>>>   Ori Bernstein
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-14  8:00                   ` hiro
@ 2018-10-15 16:48                     ` Charles Forsyth
  2018-10-15 17:01                       ` hiro
                                         ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Charles Forsyth @ 2018-10-15 16:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 4108 bytes --]

It's useful internally in protocol implementation, specifically to avoid
copying in transport protocols (for later retransmission), and the
modifications aren't vast.
A few changes were trickier, often because of small bugs in the original
code. icmp does some odd things i think.

Btw, "zero copy" isn't the right term and I preferred another term that
I've now forgotten. Minimal copying, perhaps.
For one thing, messages can eventually end up being copied to contiguous
blocks for devices without decent scatter-gather DMA.

Messages are a tuple (mutable header stack, immutable slices of immutable
data).
Originally the data was organised as a tree, but nemo suggested using just
an array, so I changed it.
It's important that it's (logically) immutable. Headers are pushed onto and
popped from the header stack, and the current stack top is mutable.

There were new readmsg and writemsg system calls to carry message
structures between kernel and user level.
The message was immutable on writemsg. Between processes in the same
program, message transfers could be done by exchanging pointers into a
shared region.

I'll see if I wrote up some of it. I think there were manual pages for the
Messages replacing Blocks.

My mcs lock implementation was probably more useful, and I use that in my
copy of the kernel known as 9k

Also, NUMA effects are more important in practice on big multicores. Some
of the off-chip delays are brutal.

On Sun, 14 Oct 2018 at 09:50, hiro <23hiro@gmail.com> wrote:

> thanks, this will allow us to know where to look more closely.
>
> On 10/14/18, Francisco J Ballesteros <nemo@lsub.org> wrote:
> > Pure "producer/cosumer" stuff, like sending things through a pipe as
> long as
> > the source didn't need to touch the data ever more.
> > Regarding bugs, I meant "producing bugs" not "fixing bugs", btw.
> >
> >> On 14 Oct 2018, at 09:34, hiro <23hiro@gmail.com> wrote:
> >>
> >> well, finding bugs is always good :)
> >> but since i got curious could you also tell which things exactly got
> >> much faster, so that we know what might be possible?
> >>
> >> On 10/14/18, FJ Ballesteros <nemo@lsub.org> wrote:
> >>> yes. bugs, on my side at least.
> >>> The copy isolates from others.
> >>> But some experiments in nix and in a thing I wrote for leanxcale show
> >>> that
> >>> some things can be much faster.
> >>> It’s fun either way.
> >>>
> >>>> El 13 oct 2018, a las 23:11, hiro <23hiro@gmail.com> escribió:
> >>>>
> >>>> and, did it improve anything noticeably?
> >>>>
> >>>>> On 10/13/18, Charles Forsyth <charles.forsyth@gmail.com> wrote:
> >>>>> I did several versions of one part of zero copy, inspired by several
> >>>>> things
> >>>>> in x-kernel, replacing Blocks by another structure throughout the
> >>>>> network
> >>>>> stacks and kernel, then made messages visible to user level. Nemo did
> >>>>> another part, on his way to Clive
> >>>>>
> >>>>>> On Fri, 12 Oct 2018, 07:05 Ori Bernstein, <ori@eigenstate.org>
> wrote:
> >>>>>>
> >>>>>> On Thu, 11 Oct 2018 13:43:00 -0700, Lyndon Nerenberg
> >>>>>> <lyndon@orthanc.ca>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Another case to ponder ...   We're handling the incoming I/Q data
> >>>>>>> stream, but need to fan that out to many downstream consumers.  If
> >>>>>>> we already read the data into a page, then flip it to the first
> >>>>>>> consumer, is there a benefit to adding a reference counter to that
> >>>>>>> read-only page and leaving the page live until the counter expires?
> >>>>>>>
> >>>>>>> Hiro clamours for benchmarks.  I agree.  Some basic searches I've
> >>>>>>> done don't show anyone trying this out with P9 (and publishing
> >>>>>>> their results).  Anybody have hints/references to prior work?
> >>>>>>>
> >>>>>>> --lyndon
> >>>>>>>
> >>>>>>
> >>>>>> I don't believe anyone has done the work yet. I'd be interested
> >>>>>> to see what you come up with.
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>   Ori Bernstein
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 6062 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-15 16:48                     ` Charles Forsyth
@ 2018-10-15 17:01                       ` hiro
  2018-10-15 17:29                       ` hiro
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: hiro @ 2018-10-15 17:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Btw, "zero copy" isn't the right term and I preferred another term that I've now forgotten. Minimal copying, perhaps.

I like that, "zero-copy" makes me imply other linux-specifics, and
those are hard to not get emotional about.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-15 16:48                     ` Charles Forsyth
  2018-10-15 17:01                       ` hiro
@ 2018-10-15 17:29                       ` hiro
  2018-10-15 23:06                         ` Charles Forsyth
  2018-10-16  0:09                       ` erik quanstrom
  2018-10-17 18:14                       ` Charles Forsyth
  3 siblings, 1 reply; 34+ messages in thread
From: hiro @ 2018-10-15 17:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Also, NUMA effects are more important in practice on big multicores. Some
> of the off-chip delays are brutal.

yeah, we've been talking about this on #cat-v. even inside one CPU
package amd puts multiple dies nowadays, and the cross-die cpu cache
access delays are approaching the same dimensions as memory-access!

also on each die, they have what they call ccx (cpu complex),
groupings of 4 cores, which are connected much faster internally than
towards the other ccx



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-15 17:29                       ` hiro
@ 2018-10-15 23:06                         ` Charles Forsyth
  0 siblings, 0 replies; 34+ messages in thread
From: Charles Forsyth @ 2018-10-15 23:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

They are machines designed to run programs most people do not write!

On Mon, 15 Oct 2018 at 19:20, hiro <23hiro@gmail.com> wrote:

> > Also, NUMA effects are more important in practice on big multicores. Some
> > of the off-chip delays are brutal.
>
> yeah, we've been talking about this on #cat-v. even inside one CPU
> package amd puts multiple dies nowadays, and the cross-die cpu cache
> access delays are approaching the same dimensions as memory-access!
>
> also on each die, they have what they call ccx (cpu complex),
> groupings of 4 cores, which are connected much faster internally than
> towards the other ccx
>
>

[-- Attachment #2: Type: text/html, Size: 909 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-15 16:48                     ` Charles Forsyth
  2018-10-15 17:01                       ` hiro
  2018-10-15 17:29                       ` hiro
@ 2018-10-16  0:09                       ` erik quanstrom
  2018-10-17 18:14                       ` Charles Forsyth
  3 siblings, 0 replies; 34+ messages in thread
From: erik quanstrom @ 2018-10-16  0:09 UTC (permalink / raw)
  To: 9fans

> It's useful internally in protocol implementation, specifically to avoid
> copying in transport protocols (for later retransmission), and the
> modifications aren't vast.
> A few changes were trickier, often because of small bugs in the original
> code. icmp does some odd things i think.

that makes sense.  likewise, if it were essentially free to add file systems in the i/o path,
from user space, one could build micro file systems that took care of small details without
incuring much cost.  ramfs is enough of a file system if you have other programs to do
other things like dump.

> I'll see if I wrote up some of it. I think there were manual pages for the
> Messages replacing Blocks.

that would be great.  thanks.


> My mcs lock implementation was probably more useful, and I use that in my
> copy of the kernel known as 9k

indeed.  i've seen great performance with mcs in my kernel.

- erik



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-15 16:48                     ` Charles Forsyth
                                         ` (2 preceding siblings ...)
  2018-10-16  0:09                       ` erik quanstrom
@ 2018-10-17 18:14                       ` Charles Forsyth
  3 siblings, 0 replies; 34+ messages in thread
From: Charles Forsyth @ 2018-10-17 18:14 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 263 bytes --]

> I'll see if I wrote up some of it. I think there were manual pages for the
>> Messages replacing Blocks.
>
>
Here are the three manual pages  https://goo.gl/Qykprf
It's not obvious from them, but internally a Fragment can represent a slice
of a Segment*

[-- Attachment #2: Type: text/html, Size: 764 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 21:54 ` Steven Stallion
  2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
@ 2018-10-10 22:29   ` Kurt H Maier
  2018-10-10 22:55     ` Steven Stallion
  2018-10-11  0:26   ` Skip Tavakkolian
  2018-10-14  9:46   ` Ole-Hjalmar Kristensen
  3 siblings, 1 reply; 34+ messages in thread
From: Kurt H Maier @ 2018-10-10 22:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Oct 10, 2018 at 04:54:22PM -0500, Steven Stallion wrote:
> As the guy

might be worth keeping in mind the current most common use case for nvme
is laptop storage and not building jet engines in coraid's basement

so the nvme driver that cinap wrote works on my thinkpad today and is
about infinity times faster than the one you guys locked up in the
warehouse at the end of raiders of the lost ark, because my laptop can't
seem to boot off nostalgia.

so no, nobody gets an award for writing a driver.  but cinap won the
9front Order of Valorous Service (with bronze oak leaf cluster,
signifying working code) for *releasing* one.  I was there when field
marshal aiju presented the award; it was a very nice ceremony.

anyway, someone once said communication is not a zero-sum game.  the
hyperspecific use case you describe is fine but there are other reasons
to care about how well this stuff works, you know?

khm

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 22:29   ` [9fans] " Kurt H Maier
@ 2018-10-10 22:55     ` Steven Stallion
  2018-10-11 11:19       ` Aram Hăvărneanu
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Stallion @ 2018-10-10 22:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Posted August 15th, 2013: https://9p.io/sources/contrib/stallion/src/sdmpt2.c
Corresponding announcement:
https://groups.google.com/forum/#!topic/comp.os.plan9/134-YyYnfbQ
On Wed, Oct 10, 2018 at 5:31 PM Kurt H Maier <khm@sciops.net> wrote:
>
> On Wed, Oct 10, 2018 at 04:54:22PM -0500, Steven Stallion wrote:
> > As the guy
>
> might be worth keeping in mind the current most common use case for nvme
> is laptop storage and not building jet engines in coraid's basement
>
> so the nvme driver that cinap wrote works on my thinkpad today and is
> about infinity times faster than the one you guys locked up in the
> warehouse at the end of raiders of the lost ark, because my laptop can't
> seem to boot off nostalgia.
>
> so no, nobody gets an award for writing a driver.  but cinap won the
> 9front Order of Valorous Service (with bronze oak leaf cluster,
> signifying working code) for *releasing* one.  I was there when field
> marshal aiju presented the award; it was a very nice ceremony.
>
> anyway, someone once said communication is not a zero-sum game.  the
> hyperspecific use case you describe is fine but there are other reasons
> to care about how well this stuff works, you know?
>
> khm
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 22:55     ` Steven Stallion
@ 2018-10-11 11:19       ` Aram Hăvărneanu
  0 siblings, 0 replies; 34+ messages in thread
From: Aram Hăvărneanu @ 2018-10-11 11:19 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Posted August 15th, 2013:
>   https://9p.io/sources/contrib/stallion/src/sdmpt2.c Corresponding
> announcement:
>   https://groups.google.com/forum/#!topic/comp.os.plan9/134-YyYnfbQ

This is not a NVMe driver.

-- 
Aram Hăvărneanu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 21:54 ` Steven Stallion
  2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
  2018-10-10 22:29   ` [9fans] " Kurt H Maier
@ 2018-10-11  0:26   ` Skip Tavakkolian
  2018-10-11  1:03     ` Steven Stallion
  2018-10-14  9:46   ` Ole-Hjalmar Kristensen
  3 siblings, 1 reply; 34+ messages in thread
From: Skip Tavakkolian @ 2018-10-11  0:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

For operations that matter in this context (read, write), there can be
multiple outstanding tags. A while back rsc implemented fcp, partly to
prove this point.

On Wed, Oct 10, 2018 at 2:54 PM Steven Stallion <sstallion@gmail.com> wrote:

> As the guy who wrote the majority of the code that pushed those 1M 4K
> random IOPS erik mentioned, this thread annoys the shit out of me. You
> don't get an award for writing a driver. In fact, it's probably better
> not to be known at all considering the bloody murder one has to commit
> to marry hardware and software together.
>
> Let's be frank, the I/O handling in the kernel is anachronistic. To
> hit those rates, I had to add support for asynchronous and vectored
> I/O not to mention a sizable bit of work by a co-worker to properly
> handle NUMA on our appliances to hit those speeds. As I recall, we had
> to rewrite the scheduler and re-implement locking, which even Charles
> Forsyth had a hand in. Had we the time and resources to implement
> something like zero-copy we'd have done it in a heartbeat.
>
> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
> - as soon as you put a 9P-based filesystem on it, it's going to be
> limited to a single outstanding operation. This is the tyranny of 9P.
> We (Coraid) got around this by avoiding filesystems altogether.
>
> Go solve that problem first.
> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
> >
> > > But the reason I want this is to reduce latency to the first
> > > access, especially for very large files. With read() I have
> > > to wait until the read completes. With mmap() processing can
> > > start much earlier and can be interleaved with background
> > > data fetch or prefetch. With read() a lot more resources
> > > are tied down. If I need random access and don't need to
> > > read all of the data, the application has to do pread(),
> > > pwrite() a lot thus complicating it. With mmap() I can just
> > > map in the whole file and excess reading (beyond what the
> > > app needs) will not be a large fraction.
> >
> > you think doing single 4K page sized reads in the pagefault
> > handler is better than doing precise >4K reads from your
> > application? possibly in a background thread so you can
> > overlap processing with data fetching?
> >
> > the advantage of mmap is not prefetch. its about not to do
> > any I/O when data is already in the *SHARED* buffer cache!
> > which plan9 does not have (except the mntcache, but that is
> > optional and only works for the disk fileservers that maintain
> > ther file qid ver info consistently). its *IS* really a linux
> > thing where all block device i/o goes thru the buffer cache.
> >
> > --
> > cinap
> >
>
>

[-- Attachment #2: Type: text/html, Size: 3340 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-11  0:26   ` Skip Tavakkolian
@ 2018-10-11  1:03     ` Steven Stallion
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Stallion @ 2018-10-11  1:03 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Interesting - was this ever generalized? It's been several years since
I last looked, but I seem to recall that unless you went out of your
way to write your own 9P implementation, you were limited to a single
tag.
On Wed, Oct 10, 2018 at 7:51 PM Skip Tavakkolian
<skip.tavakkolian@gmail.com> wrote:
>
> For operations that matter in this context (read, write), there can be multiple outstanding tags. A while back rsc implemented fcp, partly to prove this point.
>
> On Wed, Oct 10, 2018 at 2:54 PM Steven Stallion <sstallion@gmail.com> wrote:
>>
>> As the guy who wrote the majority of the code that pushed those 1M 4K
>> random IOPS erik mentioned, this thread annoys the shit out of me. You
>> don't get an award for writing a driver. In fact, it's probably better
>> not to be known at all considering the bloody murder one has to commit
>> to marry hardware and software together.
>>
>> Let's be frank, the I/O handling in the kernel is anachronistic. To
>> hit those rates, I had to add support for asynchronous and vectored
>> I/O not to mention a sizable bit of work by a co-worker to properly
>> handle NUMA on our appliances to hit those speeds. As I recall, we had
>> to rewrite the scheduler and re-implement locking, which even Charles
>> Forsyth had a hand in. Had we the time and resources to implement
>> something like zero-copy we'd have done it in a heartbeat.
>>
>> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
>> - as soon as you put a 9P-based filesystem on it, it's going to be
>> limited to a single outstanding operation. This is the tyranny of 9P.
>> We (Coraid) got around this by avoiding filesystems altogether.
>>
>> Go solve that problem first.
>> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
>> >
>> > > But the reason I want this is to reduce latency to the first
>> > > access, especially for very large files. With read() I have
>> > > to wait until the read completes. With mmap() processing can
>> > > start much earlier and can be interleaved with background
>> > > data fetch or prefetch. With read() a lot more resources
>> > > are tied down. If I need random access and don't need to
>> > > read all of the data, the application has to do pread(),
>> > > pwrite() a lot thus complicating it. With mmap() I can just
>> > > map in the whole file and excess reading (beyond what the
>> > > app needs) will not be a large fraction.
>> >
>> > you think doing single 4K page sized reads in the pagefault
>> > handler is better than doing precise >4K reads from your
>> > application? possibly in a background thread so you can
>> > overlap processing with data fetching?
>> >
>> > the advantage of mmap is not prefetch. its about not to do
>> > any I/O when data is already in the *SHARED* buffer cache!
>> > which plan9 does not have (except the mntcache, but that is
>> > optional and only works for the disk fileservers that maintain
>> > ther file qid ver info consistently). its *IS* really a linux
>> > thing where all block device i/o goes thru the buffer cache.
>> >
>> > --
>> > cinap
>> >
>>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 21:54 ` Steven Stallion
                     ` (2 preceding siblings ...)
  2018-10-11  0:26   ` Skip Tavakkolian
@ 2018-10-14  9:46   ` Ole-Hjalmar Kristensen
  2018-10-14 10:37     ` hiro
  3 siblings, 1 reply; 34+ messages in thread
From: Ole-Hjalmar Kristensen @ 2018-10-14  9:46 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 3416 bytes --]

I'm not going to argue with someone who has got his hands dirty by actually
doing this but I don't really get this about the tyranny of 9p. Isn't the
point of the tag field to identify the request? What is stopping the client
from issuing multiple requests and match the replies based on the tag? From
the manual:

Each T-message has a tag field, chosen and used by the
          client to identify the message.  The reply to the message
          will have the same tag.  Clients must arrange that no two
          outstanding messages on the same connection have the same
          tag.  An exception is the tag NOTAG, defined as (ushort)~0
          in <fcall.h>: the client can use it, when establishing a
          connection, to override tag matching in version messages.



Den ons. 10. okt. 2018, 23.56 skrev Steven Stallion <sstallion@gmail.com>:

> As the guy who wrote the majority of the code that pushed those 1M 4K
> random IOPS erik mentioned, this thread annoys the shit out of me. You
> don't get an award for writing a driver. In fact, it's probably better
> not to be known at all considering the bloody murder one has to commit
> to marry hardware and software together.
>
> Let's be frank, the I/O handling in the kernel is anachronistic. To
> hit those rates, I had to add support for asynchronous and vectored
> I/O not to mention a sizable bit of work by a co-worker to properly
> handle NUMA on our appliances to hit those speeds. As I recall, we had
> to rewrite the scheduler and re-implement locking, which even Charles
> Forsyth had a hand in. Had we the time and resources to implement
> something like zero-copy we'd have done it in a heartbeat.
>
> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
> - as soon as you put a 9P-based filesystem on it, it's going to be
> limited to a single outstanding operation. This is the tyranny of 9P.
> We (Coraid) got around this by avoiding filesystems altogether.
>
> Go solve that problem first.
> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
> >
> > > But the reason I want this is to reduce latency to the first
> > > access, especially for very large files. With read() I have
> > > to wait until the read completes. With mmap() processing can
> > > start much earlier and can be interleaved with background
> > > data fetch or prefetch. With read() a lot more resources
> > > are tied down. If I need random access and don't need to
> > > read all of the data, the application has to do pread(),
> > > pwrite() a lot thus complicating it. With mmap() I can just
> > > map in the whole file and excess reading (beyond what the
> > > app needs) will not be a large fraction.
> >
> > you think doing single 4K page sized reads in the pagefault
> > handler is better than doing precise >4K reads from your
> > application? possibly in a background thread so you can
> > overlap processing with data fetching?
> >
> > the advantage of mmap is not prefetch. its about not to do
> > any I/O when data is already in the *SHARED* buffer cache!
> > which plan9 does not have (except the mntcache, but that is
> > optional and only works for the disk fileservers that maintain
> > ther file qid ver info consistently). its *IS* really a linux
> > thing where all block device i/o goes thru the buffer cache.
> >
> > --
> > cinap
> >
>
>

[-- Attachment #2: Type: text/html, Size: 4264 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-14  9:46   ` Ole-Hjalmar Kristensen
@ 2018-10-14 10:37     ` hiro
  2018-10-14 17:34       ` Ole-Hjalmar Kristensen
  0 siblings, 1 reply; 34+ messages in thread
From: hiro @ 2018-10-14 10:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

there's no tyranny involved.

a client that is fine with the *responses* coming in reordered could
remember the tag obviously and do whatever you imagine.

the problem is potential reordering of the messages in the kernel
before responding, even if the 9p transport has guaranteed ordering.

On 10/14/18, Ole-Hjalmar Kristensen <ole.hjalmar.kristensen@gmail.com> wrote:
> I'm not going to argue with someone who has got his hands dirty by actually
> doing this but I don't really get this about the tyranny of 9p. Isn't the
> point of the tag field to identify the request? What is stopping the client
> from issuing multiple requests and match the replies based on the tag? From
> the manual:
>
> Each T-message has a tag field, chosen and used by the
>           client to identify the message.  The reply to the message
>           will have the same tag.  Clients must arrange that no two
>           outstanding messages on the same connection have the same
>           tag.  An exception is the tag NOTAG, defined as (ushort)~0
>           in <fcall.h>: the client can use it, when establishing a
>           connection, to override tag matching in version messages.
>
>
>
> Den ons. 10. okt. 2018, 23.56 skrev Steven Stallion <sstallion@gmail.com>:
>
>> As the guy who wrote the majority of the code that pushed those 1M 4K
>> random IOPS erik mentioned, this thread annoys the shit out of me. You
>> don't get an award for writing a driver. In fact, it's probably better
>> not to be known at all considering the bloody murder one has to commit
>> to marry hardware and software together.
>>
>> Let's be frank, the I/O handling in the kernel is anachronistic. To
>> hit those rates, I had to add support for asynchronous and vectored
>> I/O not to mention a sizable bit of work by a co-worker to properly
>> handle NUMA on our appliances to hit those speeds. As I recall, we had
>> to rewrite the scheduler and re-implement locking, which even Charles
>> Forsyth had a hand in. Had we the time and resources to implement
>> something like zero-copy we'd have done it in a heartbeat.
>>
>> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
>> - as soon as you put a 9P-based filesystem on it, it's going to be
>> limited to a single outstanding operation. This is the tyranny of 9P.
>> We (Coraid) got around this by avoiding filesystems altogether.
>>
>> Go solve that problem first.
>> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
>> >
>> > > But the reason I want this is to reduce latency to the first
>> > > access, especially for very large files. With read() I have
>> > > to wait until the read completes. With mmap() processing can
>> > > start much earlier and can be interleaved with background
>> > > data fetch or prefetch. With read() a lot more resources
>> > > are tied down. If I need random access and don't need to
>> > > read all of the data, the application has to do pread(),
>> > > pwrite() a lot thus complicating it. With mmap() I can just
>> > > map in the whole file and excess reading (beyond what the
>> > > app needs) will not be a large fraction.
>> >
>> > you think doing single 4K page sized reads in the pagefault
>> > handler is better than doing precise >4K reads from your
>> > application? possibly in a background thread so you can
>> > overlap processing with data fetching?
>> >
>> > the advantage of mmap is not prefetch. its about not to do
>> > any I/O when data is already in the *SHARED* buffer cache!
>> > which plan9 does not have (except the mntcache, but that is
>> > optional and only works for the disk fileservers that maintain
>> > ther file qid ver info consistently). its *IS* really a linux
>> > thing where all block device i/o goes thru the buffer cache.
>> >
>> > --
>> > cinap
>> >
>>
>>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-14 10:37     ` hiro
@ 2018-10-14 17:34       ` Ole-Hjalmar Kristensen
  2018-10-14 19:17         ` hiro
  2018-10-15  9:29         ` Giacomo Tesio
  0 siblings, 2 replies; 34+ messages in thread
From: Ole-Hjalmar Kristensen @ 2018-10-14 17:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 4429 bytes --]

OK, that makes sense. So it would not stop a client from for example first
read an index block in a B-tree, wait for the result, and then issue read
operations for all the data blocks in parallel. That's exactly the same as
any asynchronous disk subsystem I am acquainted with. Reordering is the
norm.

On Sun, Oct 14, 2018 at 1:21 PM hiro <23hiro@gmail.com> wrote:

> there's no tyranny involved.
>
> a client that is fine with the *responses* coming in reordered could
> remember the tag obviously and do whatever you imagine.
>
> the problem is potential reordering of the messages in the kernel
> before responding, even if the 9p transport has guaranteed ordering.
>
> On 10/14/18, Ole-Hjalmar Kristensen <ole.hjalmar.kristensen@gmail.com>
> wrote:
> > I'm not going to argue with someone who has got his hands dirty by
> actually
> > doing this but I don't really get this about the tyranny of 9p. Isn't the
> > point of the tag field to identify the request? What is stopping the
> client
> > from issuing multiple requests and match the replies based on the tag?
> From
> > the manual:
> >
> > Each T-message has a tag field, chosen and used by the
> >           client to identify the message.  The reply to the message
> >           will have the same tag.  Clients must arrange that no two
> >           outstanding messages on the same connection have the same
> >           tag.  An exception is the tag NOTAG, defined as (ushort)~0
> >           in <fcall.h>: the client can use it, when establishing a
> >           connection, to override tag matching in version messages.
> >
> >
> >
> > Den ons. 10. okt. 2018, 23.56 skrev Steven Stallion <sstallion@gmail.com
> >:
> >
> >> As the guy who wrote the majority of the code that pushed those 1M 4K
> >> random IOPS erik mentioned, this thread annoys the shit out of me. You
> >> don't get an award for writing a driver. In fact, it's probably better
> >> not to be known at all considering the bloody murder one has to commit
> >> to marry hardware and software together.
> >>
> >> Let's be frank, the I/O handling in the kernel is anachronistic. To
> >> hit those rates, I had to add support for asynchronous and vectored
> >> I/O not to mention a sizable bit of work by a co-worker to properly
> >> handle NUMA on our appliances to hit those speeds. As I recall, we had
> >> to rewrite the scheduler and re-implement locking, which even Charles
> >> Forsyth had a hand in. Had we the time and resources to implement
> >> something like zero-copy we'd have done it in a heartbeat.
> >>
> >> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
> >> - as soon as you put a 9P-based filesystem on it, it's going to be
> >> limited to a single outstanding operation. This is the tyranny of 9P.
> >> We (Coraid) got around this by avoiding filesystems altogether.
> >>
> >> Go solve that problem first.
> >> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
> >> >
> >> > > But the reason I want this is to reduce latency to the first
> >> > > access, especially for very large files. With read() I have
> >> > > to wait until the read completes. With mmap() processing can
> >> > > start much earlier and can be interleaved with background
> >> > > data fetch or prefetch. With read() a lot more resources
> >> > > are tied down. If I need random access and don't need to
> >> > > read all of the data, the application has to do pread(),
> >> > > pwrite() a lot thus complicating it. With mmap() I can just
> >> > > map in the whole file and excess reading (beyond what the
> >> > > app needs) will not be a large fraction.
> >> >
> >> > you think doing single 4K page sized reads in the pagefault
> >> > handler is better than doing precise >4K reads from your
> >> > application? possibly in a background thread so you can
> >> > overlap processing with data fetching?
> >> >
> >> > the advantage of mmap is not prefetch. its about not to do
> >> > any I/O when data is already in the *SHARED* buffer cache!
> >> > which plan9 does not have (except the mntcache, but that is
> >> > optional and only works for the disk fileservers that maintain
> >> > ther file qid ver info consistently). its *IS* really a linux
> >> > thing where all block device i/o goes thru the buffer cache.
> >> >
> >> > --
> >> > cinap
> >> >
> >>
> >>
> >
>
>

[-- Attachment #2: Type: text/html, Size: 5587 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-14 17:34       ` Ole-Hjalmar Kristensen
@ 2018-10-14 19:17         ` hiro
  2018-10-15  9:29         ` Giacomo Tesio
  1 sibling, 0 replies; 34+ messages in thread
From: hiro @ 2018-10-14 19:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

also read what has been written before about fcp. and read the source of fcp.

On 10/14/18, Ole-Hjalmar Kristensen <ole.hjalmar.kristensen@gmail.com> wrote:
> OK, that makes sense. So it would not stop a client from for example first
> read an index block in a B-tree, wait for the result, and then issue read
> operations for all the data blocks in parallel. That's exactly the same as
> any asynchronous disk subsystem I am acquainted with. Reordering is the
> norm.
>
> On Sun, Oct 14, 2018 at 1:21 PM hiro <23hiro@gmail.com> wrote:
>
>> there's no tyranny involved.
>>
>> a client that is fine with the *responses* coming in reordered could
>> remember the tag obviously and do whatever you imagine.
>>
>> the problem is potential reordering of the messages in the kernel
>> before responding, even if the 9p transport has guaranteed ordering.
>>
>> On 10/14/18, Ole-Hjalmar Kristensen <ole.hjalmar.kristensen@gmail.com>
>> wrote:
>> > I'm not going to argue with someone who has got his hands dirty by
>> actually
>> > doing this but I don't really get this about the tyranny of 9p. Isn't
>> > the
>> > point of the tag field to identify the request? What is stopping the
>> client
>> > from issuing multiple requests and match the replies based on the tag?
>> From
>> > the manual:
>> >
>> > Each T-message has a tag field, chosen and used by the
>> >           client to identify the message.  The reply to the message
>> >           will have the same tag.  Clients must arrange that no two
>> >           outstanding messages on the same connection have the same
>> >           tag.  An exception is the tag NOTAG, defined as (ushort)~0
>> >           in <fcall.h>: the client can use it, when establishing a
>> >           connection, to override tag matching in version messages.
>> >
>> >
>> >
>> > Den ons. 10. okt. 2018, 23.56 skrev Steven Stallion
>> > <sstallion@gmail.com
>> >:
>> >
>> >> As the guy who wrote the majority of the code that pushed those 1M 4K
>> >> random IOPS erik mentioned, this thread annoys the shit out of me. You
>> >> don't get an award for writing a driver. In fact, it's probably better
>> >> not to be known at all considering the bloody murder one has to commit
>> >> to marry hardware and software together.
>> >>
>> >> Let's be frank, the I/O handling in the kernel is anachronistic. To
>> >> hit those rates, I had to add support for asynchronous and vectored
>> >> I/O not to mention a sizable bit of work by a co-worker to properly
>> >> handle NUMA on our appliances to hit those speeds. As I recall, we had
>> >> to rewrite the scheduler and re-implement locking, which even Charles
>> >> Forsyth had a hand in. Had we the time and resources to implement
>> >> something like zero-copy we'd have done it in a heartbeat.
>> >>
>> >> In the end, it doesn't matter how "fast" a storage driver is in Plan 9
>> >> - as soon as you put a 9P-based filesystem on it, it's going to be
>> >> limited to a single outstanding operation. This is the tyranny of 9P.
>> >> We (Coraid) got around this by avoiding filesystems altogether.
>> >>
>> >> Go solve that problem first.
>> >> On Wed, Oct 10, 2018 at 12:36 PM <cinap_lenrek@felloff.net> wrote:
>> >> >
>> >> > > But the reason I want this is to reduce latency to the first
>> >> > > access, especially for very large files. With read() I have
>> >> > > to wait until the read completes. With mmap() processing can
>> >> > > start much earlier and can be interleaved with background
>> >> > > data fetch or prefetch. With read() a lot more resources
>> >> > > are tied down. If I need random access and don't need to
>> >> > > read all of the data, the application has to do pread(),
>> >> > > pwrite() a lot thus complicating it. With mmap() I can just
>> >> > > map in the whole file and excess reading (beyond what the
>> >> > > app needs) will not be a large fraction.
>> >> >
>> >> > you think doing single 4K page sized reads in the pagefault
>> >> > handler is better than doing precise >4K reads from your
>> >> > application? possibly in a background thread so you can
>> >> > overlap processing with data fetching?
>> >> >
>> >> > the advantage of mmap is not prefetch. its about not to do
>> >> > any I/O when data is already in the *SHARED* buffer cache!
>> >> > which plan9 does not have (except the mntcache, but that is
>> >> > optional and only works for the disk fileservers that maintain
>> >> > ther file qid ver info consistently). its *IS* really a linux
>> >> > thing where all block device i/o goes thru the buffer cache.
>> >> >
>> >> > --
>> >> > cinap
>> >> >
>> >>
>> >>
>> >
>>
>>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] PDP11 (Was: Re: what heavy negativity!)
  2018-10-14 17:34       ` Ole-Hjalmar Kristensen
  2018-10-14 19:17         ` hiro
@ 2018-10-15  9:29         ` Giacomo Tesio
  1 sibling, 0 replies; 34+ messages in thread
From: Giacomo Tesio @ 2018-10-15  9:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Il giorno dom 14 ott 2018 alle ore 19:39 Ole-Hjalmar Kristensen
<ole.hjalmar.kristensen@gmail.com> ha scritto:
>
> OK, that makes sense. So it would not stop a client from for example first read an index block in a B-tree, wait for the result, and then issue read operations for all the data blocks in parallel.

If the client is the kernel that's true.
If the client is directly speaking 9P that's true again.

But if the client is a userspace program using pread/pwrite that
wouldn't work unless it fork a new process per each read as the
syscalls blocks.
Which is what fcp does, actually:
https://github.com/brho/plan9/blob/master/sys/src/cmd/fcp.c

Giacomo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
@ 2018-10-10 23:58 cinap_lenrek
  2018-10-11  0:56 ` Dan Cross
  0 siblings, 1 reply; 34+ messages in thread
From: cinap_lenrek @ 2018-10-10 23:58 UTC (permalink / raw)
  To: 9fans

> Fundamentally zero-copy requires that the kernel and user process
> share the same virtual address space mapped for the given operation.

and it is. this doesnt make your point clear. the kernel is always mapped.
(you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

the difference is that *USER* pages are (unless you use special segments)
scattered randomly in physical memory or not even realized and you need
to lookup the pages in the virtual page table to get to the physical
addresses needed to hand them to the hardware for DMA.

now the *INTERESTING* thing is what happens to the original virtual
address space that covered the I/O when someone touches into it while
the I/O is in flight. so do we cut it out of the TLB's of ALL processes
*SHARING* the segment? and then have the pagefault handler wait until
the I/O is finished? fuck your go routines... he wants the D.

> This can't always be done and the kernel will be forced to perform a
> copy anyway.

explain *WHEN*, that would be an insight in what you'r trying to
explain.

> To wit, one of the things I added to the exynos kernel
> early on was a 1:1 mapping of the virtual kernel address space such
> that something like zero-copy could be possible in the future (it was
> also very convenient to limit MMU swaps on the Cortex-A15). That said,
> the problem gets harder when you're working on something more general
> that can handle the entire address space. In the end, you trade the
> complexity/performance hit of MMU management versus making a copy.

don't forget the code complexity with dealing with these scattered
pages in the *DRIVERS*.

> Believe it or not, sometimes copies can be faster, especially on
> larger NUMA systems.

--
cinap

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-10 23:58 [9fans] zero copy & 9p (was " cinap_lenrek
@ 2018-10-11  0:56 ` Dan Cross
  2018-10-11  2:26   ` Steven Stallion
  2018-10-11  2:30   ` Bakul Shah
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Cross @ 2018-10-11  0:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 4065 bytes --]

On Wed, Oct 10, 2018 at 7:58 PM <cinap_lenrek@felloff.net> wrote:

> > Fundamentally zero-copy requires that the kernel and user process
> > share the same virtual address space mapped for the given operation.
>
> and it is. this doesnt make your point clear. the kernel is always mapped.
>

Meltdown has shown this to be a bad idea.

(you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)
>

plan9 doesn't use an identity mapping; it uses an offset mapping for most
of the address space and on 64-bit systems a separate mapping for the
kernel. An identity mapping from P to V is a function f such that f(a) = a.
But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
64-bit plan9 systems it's a little more complex because of the two
mappings, which vary between sub-projects: 9front appears to map the kernel
into the top 2 gigs of the address space which means that, on large
machines, the entire physical address space can't fit into the kernel.  Of
course in such situations one maps the top part of the canonical address
space for the exclusive use of supervisor code, so in that way it's a
distinction without a difference.

Of course, there are tricks to make lookups of arbitrary addresses
relatively cheap by using the MMU hardware and dedicating part of the
address space to a recursive self-map. That is, if you don't want to walk
page tables yourself, or keep a more elaborate data structure to describe
the address space.

the difference is that *USER* pages are (unless you use special segments)
> scattered randomly in physical memory or not even realized and you need
> to lookup the pages in the virtual page table to get to the physical
> addresses needed to hand them to the hardware for DMA.
>

So...walking page tables is hard? Ok....

now the *INTERESTING* thing is what happens to the original virtual
> address space that covered the I/O when someone touches into it while
> the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> *SHARING* the segment? and then have the pagefault handler wait until
> the I/O is finished?

You seem to be mixing multiple things here. The physical page has to be
pinned while the DMA operation is active (unless it can be reliably
canceled). This can be done any number of ways; but so what? It's not new
and it's not black magic. Who cares about the virtual address space? If
some other processor (nb, not process -- processes don't have TLB entries,
processors do) might have a TLB entry for that mapping that you just
changed you need to shoot it down anyway: what's that have to do with
making things wait for page faulting?

The simplicity of the current scheme comes from the fact that the kernel
portion of the address *space* is effectively immutable once the kernel
gets going. That's easy, but it's not particularly flexible and other
systems do things differently (not just Linux and its ilk). I'm not saying
you *should* do it in plan9, but it's not like it hasn't been done
elegantly before.

> fuck your go routines... he wants the D.
>
> > This can't always be done and the kernel will be forced to perform a
> > copy anyway.
>
> explain *WHEN*, that would be an insight in what you'r trying to
> explain.
>
> > To wit, one of the things I added to the exynos kernel
> > early on was a 1:1 mapping of the virtual kernel address space such
> > that something like zero-copy could be possible in the future (it was
> > also very convenient to limit MMU swaps on the Cortex-A15). That said,
> > the problem gets harder when you're working on something more general
> > that can handle the entire address space. In the end, you trade the
> > complexity/performance hit of MMU management versus making a copy.
>
> don't forget the code complexity with dealing with these scattered
> pages in the *DRIVERS*.
>

It's really not that hard. The way Linux does it is pretty bad, but it's
not like that's the only way to do it.

Or don't.

        - Dan C.

[-- Attachment #2: Type: text/html, Size: 5184 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-11  0:56 ` Dan Cross
@ 2018-10-11  2:26   ` Steven Stallion
  2018-10-11  2:30   ` Bakul Shah
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Stallion @ 2018-10-11  2:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Oct 10, 2018 at 8:20 PM Dan Cross <crossd@gmail.com> wrote:
>> don't forget the code complexity with dealing with these scattered
>> pages in the *DRIVERS*.
>
> It's really not that hard. The way Linux does it is pretty bad, but it's not like that's the only way to do it.

SunOS and Win32 (believe it or not) managed to get this "right";
dealing with zero-copy in those kernels was a non-event. I'm not sure
I understand the assertion how this would affect constituent drivers.
This sort of detail is handled at a higher level - the driver
generally operates on a buffer that gets jammed into a ring for DMA
transfer. Apart from grabbing the physical address, the worst you may
have to do is pin/unpin the block for the DMA operation. From the
driver's perspective, it's memory. It doesn't matter where it came
from (or who owns it for that matter).

> Or don't.

There's a lot to be said for keeping it simple...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-11  0:56 ` Dan Cross
  2018-10-11  2:26   ` Steven Stallion
@ 2018-10-11  2:30   ` Bakul Shah
  2018-10-11  3:20     ` Steven Stallion
  1 sibling, 1 reply; 34+ messages in thread
From: Bakul Shah @ 2018-10-11  2:30 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, 10 Oct 2018 20:56:20 -0400 Dan Cross <crossd@gmail.com> wrote:
>
> On Wed, Oct 10, 2018 at 7:58 PM <cinap_lenrek@felloff.net> wrote:
>
> > > Fundamentally zero-copy requires that the kernel and user process
> > > share the same virtual address space mapped for the given operation.
> >
> > and it is. this doesnt make your point clear. the kernel is always mapped.
> >
>
> Meltdown has shown this to be a bad idea.

People still do this.

> > (you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

Steve wrote "1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible"

Not sure what he meant. For zero copy you need to *directly*
write to the memory allocated to a process. 1:1 mapping is
really not needed.

> plan9 doesn't use an identity mapping; it uses an offset mapping for most
> of the address space and on 64-bit systems a separate mapping for the
> kernel. An identity mapping from P to V is a function f such that f(a) = a.
> But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
> 64-bit plan9 systems it's a little more complex because of the two
> mappings, which vary between sub-projects: 9front appears to map the kernel
> into the top 2 gigs of the address space which means that, on large
> machines, the entire physical address space can't fit into the kernel.  Of
> course in such situations one maps the top part of the canonical address
> space for the exclusive use of supervisor code, so in that way it's a
> distinction without a difference.
>
> Of course, there are tricks to make lookups of arbitrary addresses
> relatively cheap by using the MMU hardware and dedicating part of the
> address space to a recursive self-map. That is, if you don't want to walk
> page tables yourself, or keep a more elaborate data structure to describe
> the address space.
>
> > the difference is that *USER* pages are (unless you use special segments)
> > scattered randomly in physical memory or not even realized and you need
> > to lookup the pages in the virtual page table to get to the physical
> > addresses needed to hand them to the hardware for DMA.

If you don't copy, you do need to find all the physical pages.
This is not really expensive and many OSes do precisely this.

If you copy, you can avoid walking the page table. But for
that to work, the kernel virtual space needs to mapped 1:1 in
*every* process -- this is because any cached data will be in
kernel space and must be availabele in all processes.

In fact the *main* reason this was done was to facilitate such
copying. Had we always done zero-copy, we could've avoided
Meltdown altogether. copyin/copyout of syscall arguments
shouldn't be expensive.

> So...walking page tables is hard? Ok....
>
> > now the *INTERESTING* thing is what happens to the original virtual
> > address space that covered the I/O when someone touches into it while
> > the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> > *SHARING* the segment? and then have the pagefault handler wait until
> > the I/O is finished?

In general, the way this works is a bit different. In an
mmap() scenario, the initial mapping simply allocates the
necessary PTEs and marks them so that *any* read/write access
will incur a page fault.  At this time if the underlying page
is found to be cached, it is linked to the PTE and the
relevant access bit changed to allow the access. if not, the
process has to wait until the page is read in, at which time
it be linked with the relevant PTE(s). Even if the same file
page is mapped in N processes, the same thing happens. The
kernel does have to do some bookkeeping as the same file data
may be referenced from multiple places.

> You seem to be mixing multiple things here. The physical page has to be
> pinned while the DMA operation is active (unless it can be reliably
> canceled). This can be done any number of ways; but so what? It's not new
> and it's not black magic. Who cares about the virtual address space? If
> some other processor (nb, not process -- processes don't have TLB entries,
> processors do) might have a TLB entry for that mapping that you just
> changed you need to shoot it down anyway: what's that have to do with
> making things wait for page faulting?

Indeed.

> The simplicity of the current scheme comes from the fact that the kernel
> portion of the address *space* is effectively immutable once the kernel
> gets going. That's easy, but it's not particularly flexible and other
> systems do things differently (not just Linux and its ilk). I'm not saying
> you *should* do it in plan9, but it's not like it hasn't been done
> elegantly before.
>
>
> > fuck your go routines... he wants the D.

What?!

> > > This can't always be done and the kernel will be forced to perform a
> > > copy anyway.

In general this is wrong. None of this is new. By decades.
Theoreticlly even regular read/write can use mapping behind
the scenes. [Save the old V->P map to deal with any IO error,
remove the same from caller's pagetable, read in first (few)
pages) and mark the rest as if newly allocated, commence a
fetch in the background mode and return]

But note that the io driver should *not* do any prefetch --
that is left upto the caching or FS layer.

> > explain *WHEN*, that would be an insight in what you'r trying to
> > explain.
> >
> > > To wit, one of the things I added to the exynos kernel
> > > early on was a 1:1 mapping of the virtual kernel address space such
> > > that something like zero-copy could be possible in the future (it was
> > > also very convenient to limit MMU swaps on the Cortex-A15). That said,
> > > the problem gets harder when you're working on something more general
> > > that can handle the entire address space. In the end, you trade the
> > > complexity/performance hit of MMU management versus making a copy.
> >
> > don't forget the code complexity with dealing with these scattered
> > pages in the *DRIVERS*.
> >
>
> It's really not that hard. The way Linux does it is pretty bad, but it's
> not like that's the only way to do it.
>
> Or don't.

People should think about how things were done prior to Linux
so as to avoid its reality distortion field.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
  2018-10-11  2:30   ` Bakul Shah
@ 2018-10-11  3:20     ` Steven Stallion
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Stallion @ 2018-10-11  3:20 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Wed, Oct 10, 2018 at 9:32 PM Bakul Shah <bakul@bitblocks.com> wrote:
> Steve wrote "1:1 mapping of the virtual kernel address space such
> that something like zero-copy could be possible"
>
> Not sure what he meant. For zero copy you need to *directly*
> write to the memory allocated to a process. 1:1 mapping is
> really not needed.

Ugh. I could have worded that better. That was a (very) clumsy attempt
at stating that the kernel would have to support remapping the user
buffer to virtual kernel space. Fortunately Plan 9 doesn't page out
kernel memory, so pinning wouldn't be required.

Cheers,
Steve



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-10-17 18:14 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10 17:34 [9fans] PDP11 (Was: Re: what heavy negativity!) cinap_lenrek
2018-10-10 21:54 ` Steven Stallion
2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
2018-10-10 22:52     ` Steven Stallion
2018-10-11 20:43     ` Lyndon Nerenberg
2018-10-11 22:28       ` hiro
2018-10-12  6:04       ` Ori Bernstein
2018-10-13 18:01         ` Charles Forsyth
2018-10-13 21:11           ` hiro
2018-10-14  5:25             ` FJ Ballesteros
2018-10-14  7:34               ` hiro
2018-10-14  7:38                 ` Francisco J Ballesteros
2018-10-14  8:00                   ` hiro
2018-10-15 16:48                     ` Charles Forsyth
2018-10-15 17:01                       ` hiro
2018-10-15 17:29                       ` hiro
2018-10-15 23:06                         ` Charles Forsyth
2018-10-16  0:09                       ` erik quanstrom
2018-10-17 18:14                       ` Charles Forsyth
2018-10-10 22:29   ` [9fans] " Kurt H Maier
2018-10-10 22:55     ` Steven Stallion
2018-10-11 11:19       ` Aram Hăvărneanu
2018-10-11  0:26   ` Skip Tavakkolian
2018-10-11  1:03     ` Steven Stallion
2018-10-14  9:46   ` Ole-Hjalmar Kristensen
2018-10-14 10:37     ` hiro
2018-10-14 17:34       ` Ole-Hjalmar Kristensen
2018-10-14 19:17         ` hiro
2018-10-15  9:29         ` Giacomo Tesio
2018-10-10 23:58 [9fans] zero copy & 9p (was " cinap_lenrek
2018-10-11  0:56 ` Dan Cross
2018-10-11  2:26   ` Steven Stallion
2018-10-11  2:30   ` Bakul Shah
2018-10-11  3:20     ` Steven Stallion

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).