From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bakul Shah <bakul@bitblocks.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
In-reply-to: Your message of "Wed, 10 Oct 2018 20:56:20 -0400."
	<CAEoi9W62qFgB2tj=aKEZ=PusHObnTwTDbGBqFBmzgmu=Sfepjg@mail.gmail.com>
References: <3C62CE67D1FF8260C5450B3D0AE7AA1A@felloff.net>
	<CAEoi9W62qFgB2tj=aKEZ=PusHObnTwTDbGBqFBmzgmu=Sfepjg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <2554.1539225046.1@bitblocks.com>
Date: Wed, 10 Oct 2018 19:30:46 -0700
Message-Id: <20181011023054.2E6AC156E40C@mail.bitblocks.com>
Subject: Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy
	negativity!)
Topicbox-Message-UUID: ebd74d76-ead9-11e9-9d60-3106f5b1d025

On Wed, 10 Oct 2018 20:56:20 -0400 Dan Cross <crossd@gmail.com> wrote:
>
> On Wed, Oct 10, 2018 at 7:58 PM <cinap_lenrek@felloff.net> wrote:
>
> > > Fundamentally zero-copy requires that the kernel and user process
> > > share the same virtual address space mapped for the given operation.
> >
> > and it is. this doesnt make your point clear. the kernel is always mapped.
> >
>
> Meltdown has shown this to be a bad idea.

People still do this.

> > (you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

Steve wrote "1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible"

Not sure what he meant. For zero copy you need to *directly*
write to the memory allocated to a process. 1:1 mapping is
really not needed.

> plan9 doesn't use an identity mapping; it uses an offset mapping for most
> of the address space and on 64-bit systems a separate mapping for the
> kernel. An identity mapping from P to V is a function f such that f(a) = a.
> But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
> 64-bit plan9 systems it's a little more complex because of the two
> mappings, which vary between sub-projects: 9front appears to map the kernel
> into the top 2 gigs of the address space which means that, on large
> machines, the entire physical address space can't fit into the kernel.  Of
> course in such situations one maps the top part of the canonical address
> space for the exclusive use of supervisor code, so in that way it's a
> distinction without a difference.
>
> Of course, there are tricks to make lookups of arbitrary addresses
> relatively cheap by using the MMU hardware and dedicating part of the
> address space to a recursive self-map. That is, if you don't want to walk
> page tables yourself, or keep a more elaborate data structure to describe
> the address space.
>
> > the difference is that *USER* pages are (unless you use special segments)
> > scattered randomly in physical memory or not even realized and you need
> > to lookup the pages in the virtual page table to get to the physical
> > addresses needed to hand them to the hardware for DMA.

If you don't copy, you do need to find all the physical pages.
This is not really expensive and many OSes do precisely this.

If you copy, you can avoid walking the page table. But for
that to work, the kernel virtual space needs to mapped 1:1 in
*every* process -- this is because any cached data will be in
kernel space and must be availabele in all processes.

In fact the *main* reason this was done was to facilitate such
copying. Had we always done zero-copy, we could've avoided
Meltdown altogether. copyin/copyout of syscall arguments
shouldn't be expensive.

> So...walking page tables is hard? Ok....
>
> > now the *INTERESTING* thing is what happens to the original virtual
> > address space that covered the I/O when someone touches into it while
> > the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> > *SHARING* the segment? and then have the pagefault handler wait until
> > the I/O is finished?

In general, the way this works is a bit different. In an
mmap() scenario, the initial mapping simply allocates the
necessary PTEs and marks them so that *any* read/write access
will incur a page fault.  At this time if the underlying page
is found to be cached, it is linked to the PTE and the
relevant access bit changed to allow the access. if not, the
process has to wait until the page is read in, at which time
it be linked with the relevant PTE(s). Even if the same file
page is mapped in N processes, the same thing happens. The
kernel does have to do some bookkeeping as the same file data
may be referenced from multiple places.

> You seem to be mixing multiple things here. The physical page has to be
> pinned while the DMA operation is active (unless it can be reliably
> canceled). This can be done any number of ways; but so what? It's not new
> and it's not black magic. Who cares about the virtual address space? If
> some other processor (nb, not process -- processes don't have TLB entries,
> processors do) might have a TLB entry for that mapping that you just
> changed you need to shoot it down anyway: what's that have to do with
> making things wait for page faulting?

Indeed.

> The simplicity of the current scheme comes from the fact that the kernel
> portion of the address *space* is effectively immutable once the kernel
> gets going. That's easy, but it's not particularly flexible and other
> systems do things differently (not just Linux and its ilk). I'm not saying
> you *should* do it in plan9, but it's not like it hasn't been done
> elegantly before.
>
>
> > fuck your go routines... he wants the D.

What?!

> > > This can't always be done and the kernel will be forced to perform a
> > > copy anyway.

In general this is wrong. None of this is new. By decades.
Theoreticlly even regular read/write can use mapping behind
the scenes. [Save the old V->P map to deal with any IO error,
remove the same from caller's pagetable, read in first (few)
pages) and mark the rest as if newly allocated, commence a
fetch in the background mode and return]

But note that the io driver should *not* do any prefetch --
that is left upto the caching or FS layer.

> > explain *WHEN*, that would be an insight in what you'r trying to
> > explain.
> >
> > > To wit, one of the things I added to the exynos kernel
> > > early on was a 1:1 mapping of the virtual kernel address space such
> > > that something like zero-copy could be possible in the future (it was
> > > also very convenient to limit MMU swaps on the Cortex-A15). That said,
> > > the problem gets harder when you're working on something more general
> > > that can handle the entire address space. In the end, you trade the
> > > complexity/performance hit of MMU management versus making a copy.
> >
> > don't forget the code complexity with dealing with these scattered
> > pages in the *DRIVERS*.
> >
>
> It's really not that hard. The way Linux does it is pretty bad, but it's
> not like that's the only way to do it.
>
> Or don't.

People should think about how things were done prior to Linux
so as to avoid its reality distortion field.