Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

From: Bakul Shah <bakul@bitblocks.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] zero copy & 9p (was Re: PDP11 (Was: Re: what heavy negativity!)
Date: Wed, 10 Oct 2018 19:30:46 -0700	[thread overview]
Message-ID: <20181011023054.2E6AC156E40C@mail.bitblocks.com> (raw)
In-Reply-To: Your message of "Wed, 10 Oct 2018 20:56:20 -0400." <CAEoi9W62qFgB2tj=aKEZ=PusHObnTwTDbGBqFBmzgmu=Sfepjg@mail.gmail.com>

On Wed, 10 Oct 2018 20:56:20 -0400 Dan Cross <crossd@gmail.com> wrote:
>
> On Wed, Oct 10, 2018 at 7:58 PM <cinap_lenrek@felloff.net> wrote:
>
> > > Fundamentally zero-copy requires that the kernel and user process
> > > share the same virtual address space mapped for the given operation.
> >
> > and it is. this doesnt make your point clear. the kernel is always mapped.
> >
>
> Meltdown has shown this to be a bad idea.

People still do this.

> > (you ment 1:1 identity mapping *PHYSICAL* pages to make the lookup cheap?)

Steve wrote "1:1 mapping of the virtual kernel address space such
that something like zero-copy could be possible"

Not sure what he meant. For zero copy you need to *directly*
write to the memory allocated to a process. 1:1 mapping is
really not needed.

> plan9 doesn't use an identity mapping; it uses an offset mapping for most
> of the address space and on 64-bit systems a separate mapping for the
> kernel. An identity mapping from P to V is a function f such that f(a) = a.
> But on 32-bit plan9, VADDR(p) = p + KZERO and PADDR(v) = v - KZERO. On
> 64-bit plan9 systems it's a little more complex because of the two
> mappings, which vary between sub-projects: 9front appears to map the kernel
> into the top 2 gigs of the address space which means that, on large
> machines, the entire physical address space can't fit into the kernel.  Of
> course in such situations one maps the top part of the canonical address
> space for the exclusive use of supervisor code, so in that way it's a
> distinction without a difference.
>
> Of course, there are tricks to make lookups of arbitrary addresses
> relatively cheap by using the MMU hardware and dedicating part of the
> address space to a recursive self-map. That is, if you don't want to walk
> page tables yourself, or keep a more elaborate data structure to describe
> the address space.
>
> > the difference is that *USER* pages are (unless you use special segments)
> > scattered randomly in physical memory or not even realized and you need
> > to lookup the pages in the virtual page table to get to the physical
> > addresses needed to hand them to the hardware for DMA.

If you don't copy, you do need to find all the physical pages.
This is not really expensive and many OSes do precisely this.

If you copy, you can avoid walking the page table. But for
that to work, the kernel virtual space needs to mapped 1:1 in
*every* process -- this is because any cached data will be in
kernel space and must be availabele in all processes.

In fact the *main* reason this was done was to facilitate such
copying. Had we always done zero-copy, we could've avoided
Meltdown altogether. copyin/copyout of syscall arguments
shouldn't be expensive.

> So...walking page tables is hard? Ok....
>
> > now the *INTERESTING* thing is what happens to the original virtual
> > address space that covered the I/O when someone touches into it while
> > the I/O is in flight. so do we cut it out of the TLB's of ALL processes
> > *SHARING* the segment? and then have the pagefault handler wait until
> > the I/O is finished?

In general, the way this works is a bit different. In an
mmap() scenario, the initial mapping simply allocates the
necessary PTEs and marks them so that *any* read/write access
will incur a page fault.  At this time if the underlying page
is found to be cached, it is linked to the PTE and the
relevant access bit changed to allow the access. if not, the
process has to wait until the page is read in, at which time
it be linked with the relevant PTE(s). Even if the same file
page is mapped in N processes, the same thing happens. The
kernel does have to do some bookkeeping as the same file data
may be referenced from multiple places.

> You seem to be mixing multiple things here. The physical page has to be
> pinned while the DMA operation is active (unless it can be reliably
> canceled). This can be done any number of ways; but so what? It's not new
> and it's not black magic. Who cares about the virtual address space? If
> some other processor (nb, not process -- processes don't have TLB entries,
> processors do) might have a TLB entry for that mapping that you just
> changed you need to shoot it down anyway: what's that have to do with
> making things wait for page faulting?

Indeed.

> The simplicity of the current scheme comes from the fact that the kernel
> portion of the address *space* is effectively immutable once the kernel
> gets going. That's easy, but it's not particularly flexible and other
> systems do things differently (not just Linux and its ilk). I'm not saying
> you *should* do it in plan9, but it's not like it hasn't been done
> elegantly before.
>
>
> > fuck your go routines... he wants the D.

What?!

> > > This can't always be done and the kernel will be forced to perform a
> > > copy anyway.

In general this is wrong. None of this is new. By decades.
Theoreticlly even regular read/write can use mapping behind
the scenes. [Save the old V->P map to deal with any IO error,
remove the same from caller's pagetable, read in first (few)
pages) and mark the rest as if newly allocated, commence a
fetch in the background mode and return]

But note that the io driver should *not* do any prefetch --
that is left upto the caching or FS layer.

> > explain *WHEN*, that would be an insight in what you'r trying to
> > explain.
> >
> > > To wit, one of the things I added to the exynos kernel
> > > early on was a 1:1 mapping of the virtual kernel address space such
> > > that something like zero-copy could be possible in the future (it was
> > > also very convenient to limit MMU swaps on the Cortex-A15). That said,
> > > the problem gets harder when you're working on something more general
> > > that can handle the entire address space. In the end, you trade the
> > > complexity/performance hit of MMU management versus making a copy.
> >
> > don't forget the code complexity with dealing with these scattered
> > pages in the *DRIVERS*.
> >
>
> It's really not that hard. The way Linux does it is pretty bad, but it's
> not like that's the only way to do it.
>
> Or don't.

People should think about how things were done prior to Linux
so as to avoid its reality distortion field.

next prev parent reply	other threads:[~2018-10-11  2:30 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-10 23:58 cinap_lenrek
2018-10-11  0:56 ` Dan Cross
2018-10-11  2:26   ` Steven Stallion
2018-10-11  2:30   ` Bakul Shah [this message]
2018-10-11  3:20     ` Steven Stallion
2018-10-11 14:21       ` [9fans] ..... UNSUBSCRIBE_HELP NEEDED DHAN HURLEY
  -- strict thread matches above, loose matches on Subject: below --
2018-10-10 17:34 [9fans] PDP11 (Was: Re: what heavy negativity!) cinap_lenrek
2018-10-10 21:54 ` Steven Stallion
2018-10-10 22:26   ` [9fans] zero copy & 9p (was " Bakul Shah
2018-10-10 22:52     ` Steven Stallion
2018-10-11 20:43     ` Lyndon Nerenberg
2018-10-11 22:28       ` hiro
2018-10-12  6:04       ` Ori Bernstein
2018-10-13 18:01         ` Charles Forsyth
2018-10-13 21:11           ` hiro
2018-10-14  5:25             ` FJ Ballesteros
2018-10-14  7:34               ` hiro
2018-10-14  7:38                 ` Francisco J Ballesteros
2018-10-14  8:00                   ` hiro
2018-10-15 16:48                     ` Charles Forsyth
2018-10-15 17:01                       ` hiro
2018-10-15 17:29                       ` hiro
2018-10-15 23:06                         ` Charles Forsyth
2018-10-16  0:09                       ` erik quanstrom
2018-10-17 18:14                       ` Charles Forsyth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181011023054.2E6AC156E40C@mail.bitblocks.com \
    --to=bakul@bitblocks.com \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).