The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: Larry McVoy <lm@mcvoy.com>
To: George Michaelson <ggm@algebras.org>
Cc: TUHS main list <tuhs@minnie.tuhs.org>
Subject: Re: [TUHS] FreeBSD behind the times? (was: Favorite unix design principles?)
Date: Thu, 4 Feb 2021 16:33:15 -0800	[thread overview]
Message-ID: <20210205003315.GK13701@mcvoy.com> (raw)
In-Reply-To: <CAKr6gn3Mvzy0Nf2NHHXGf6eFdumZU8h8jUUiU4e4T8Y_7ofisA@mail.gmail.com>

On Fri, Feb 05, 2021 at 08:28:12AM +1000, George Michaelson wrote:
> What ZFS did, and what Docker does, and snap does, and flatpack does,
> is package things in a way which make modalities for use for  modern
> sysadmin "just work"

No argument there.

> Basically Larry, I think you are kindof wrong. These alumni of yours
> did what all kids should do: they ran ahead. Did they scrape their
> knees doing it? Sure. But if they don't try things their teachers say
> are bad, how do they advance the art? 

Before I show you I'm not wrong, if you are saying (and I think you
are) that you like ZFS and find it useful, I have no disagreement with
that.  I'm in no way arguing that ZFS isn't useful.  If you are just 
an end user and you don't run into any of the coherency problems, 
it's great.

I'm arguing from the point of view of how a kernel is supposed to work.
What ZFS did is a gross violation of how the kernel is supposed to work
and both Bonwick and Moore have admitted that, they just thought it was
too hard to do it right.  There is a body of code in BitKeeper that does
the exact part that they thought was too hard, a layer that takes a page
fault and fills in the page from a compressed and xor-ed data source.
Works great, one guy did it in a few months or so.  It's not that hard.

So why is what ZFS did so wrong?

Ignoring the page cache and make their own cache has big problems.
You can mmap() ZFS files and doing so means that when a page is referenced
it is copied from the ZFS cache to the page cache.  That creates a
coherency problem, I can write via the mapping and I can write via
write(2) and now you have two copies of the data that don't match,
that's pretty much OS no-no #1.

You can get around it, I know, because I've written the coherency code
for SGI's IRIX when I did the bulk data server that went around the page
cache and made NFS go at 60MB/sec for a single stream (and many times
that for multiple streams).  So I'm not talking out of my ass, I know
what coherency means when you have the same data in two different places,
I know it is possible to make it work, I've done that, and I don't think
it is a good idea (it was OK in SGI's case, it was for O_DIRECT which
exists to completely bypass the page cache; so a special case that wasn't
so bad and wasn't general).

It's messy.  You could remove the data from the ZFS cache when you put
it in the page cache but ZFS is compresses so it's not going line up on
page boundaries like you'd want it to.  That means you're removing more
than your page which sort of sucks.

You could never map the pages writable, take a fault every time someone
wants to write the page and then do the write back to the ZFS cache.
That doesn't really work because you take the fault before the write
completes, not after.  You can make it work, the write fault has to
get an exclusive lock on the data in the ZFS cache, then return, then
the page gets modified, now someone has to wake up and copy that data
from the page cache to ZFS.  It's messy and it performs really poorly,
nobody would do it this way.  

You could lock the data in the ZFS cache, making it read only.  That
doesn't work because you can write via mmap() and read via ZFS and you
get old data.

All of these sorts of problems, which are solvable, I've solved them,
Sun solved them, are why you don't really want what ZFS did.  It's non
ending case of wack a mole as the code evolves and someone slips in
something that makes the page cache and the ZFS cache incoherent again.
There isn't a pleasant way to make this stuff work, that's exactly
why Sun made everything live in the page cache, there was only one
copy of any chunk of data.

Which makes it baffling to me that Sun would allow ZFS into the kernel
but I guess the benefits were perceived to outweigh the ongoing work to
make the caches coherent.  Personally, I think just doing it right is
way easier.

--lm

  parent reply	other threads:[~2021-02-05  0:33 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-25 11:10 [TUHS] Favorite unix design principles? Tyler Adams
2021-01-25 12:32 ` Steve Nickolas
2021-01-26  2:06   ` M Douglas McIlroy
2021-01-26  2:53     ` Steve Nickolas
2021-01-26 10:22     ` Tyler Adams
2021-01-26 12:26       ` John P. Linderman
2021-01-26 15:23       ` Clem Cole
2021-01-26 16:00         ` Niklas Karlsson
2021-01-26 16:13           ` Adam Thornton
     [not found]       ` <CAKH6PiXKjksEpQOMMMQTbcsMvX2thz3WzqjoRWJAsXnZ4Eq_iQ@mail.gmail.com>
2021-01-30 19:01         ` Tyler Adams
2021-01-30 19:50           ` Jon Steinhart
2021-01-30 20:06             ` Tyler Adams
2021-01-30 21:28               ` Clem Cole
2021-01-30 21:42                 ` Dave Horsfall
2021-01-30 21:45                 ` Tyler Adams
2021-01-30 22:31                   ` Larry McVoy
2021-01-30 22:28                 ` Larry McVoy
2021-01-30 23:11                   ` [TUHS] FreeBSD behind the times? (was: Favorite unix design principles?) Greg 'groggy' Lehey
2021-01-30 23:17                     ` Larry McVoy
2021-01-30 23:22                       ` Warner Losh
2021-01-30 23:31                         ` [TUHS] [SPAM] " Larry McVoy
2021-01-30 23:37                           ` Jon Steinhart
2021-01-30 23:54                             ` Larry McVoy
2021-01-31 12:23                               ` [TUHS] [SPAM] Re: FreeBSD behind the times? Dermot Tynan
2021-01-31  0:00                             ` [TUHS] [SPAM] Re: FreeBSD behind the times? (was: Favorite unix design principles?) Bakul Shah
2021-02-09  2:15                         ` [TUHS] " Will Senn
2021-02-09  2:16                           ` Will Senn
2021-02-09  2:30                             ` Greg 'groggy' Lehey
2021-01-31  0:39                     ` Steve Nickolas
2021-01-31  1:47                     ` Will Senn
2021-01-31  2:25                       ` Larry McVoy
2021-01-31  2:52                         ` Will Senn
2021-01-31  3:00                           ` Larry McVoy
2021-01-31  3:06                             ` Will Senn
2021-01-31  3:32                               ` John Cowan
2021-02-04  5:43                         ` Dave Horsfall
2021-02-04  6:10                           ` Angus Robinson
2021-02-04  7:46                             ` Andy Kosela
2021-02-04 22:25                             ` Dave Horsfall
2021-02-04 15:45                           ` Will Senn
2021-02-04 16:03                             ` Henry Bent
2021-02-04 16:32                             ` Dan Cross
2021-02-04 16:49                               ` Will Senn
2021-02-04 17:46                               ` Larry McVoy
2021-02-04 18:41                               ` Bakul Shah
2021-02-04 22:28                                 ` George Michaelson
2021-02-04 22:41                                   ` Bakul Shah
2021-02-05  0:33                                   ` Larry McVoy [this message]
2021-02-05  5:17                                     ` Bakul Shah
2021-02-05 14:18                                       ` Larry McVoy
2021-02-05 18:16                                         ` Warner Losh
2021-02-05 18:21                                         ` ron minnich
2021-02-06  0:03                                         ` Bakul Shah
2021-02-06  2:06                                           ` Dan Cross
2021-02-06  3:01                                             ` Bakul Shah
2021-02-06  1:18                                         ` John Gilmore
2021-02-06  1:43                                           ` joe mcguckin
2021-02-06  1:55                                           ` Bakul Shah
2021-02-05 20:50                             ` Dave Horsfall
2021-02-06  0:21                               ` Brad Spencer
2021-02-06  2:22                               ` Rico Pajarola
2021-02-06  2:55                                 ` Larry McVoy
2021-02-06  3:07                                   ` Will Senn
2021-02-27  8:54                                   ` Stuart Remphrey
2021-02-06  4:55                               ` John Cowan
2021-02-04  7:46                         ` Chris Torek
2021-02-04 15:47                           ` Will Senn
2021-02-11 21:01                         ` Angel M Alganza
2021-01-30 23:09                 ` [TUHS] Favorite unix design principles? John Cowan
2021-01-30 23:22                   ` Jon Steinhart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210205003315.GK13701@mcvoy.com \
    --to=lm@mcvoy.com \
    --cc=ggm@algebras.org \
    --cc=tuhs@minnie.tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).