9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: "andrey mirtchovski" <mirtchovski@gmail.com>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@9fans.net>
Subject: Re: [9fans] Changelogs & Patches?
Date: Tue,  6 Jan 2009 07:22:11 -0700	[thread overview]
Message-ID: <14ec7b180901060622q2a179705g53c7fe62e70ee90a@mail.gmail.com> (raw)
In-Reply-To: <BDF19F86-DF67-4B82-9EF1-16C4F46B5B28@sun.com>

i'm using zfs right now for a project storing a few terabytes worth of
data and vm images. i have two zfs servers and about 10 pools of
different sizes with several hundred different zfs filesystems and
volumes of raw disk exported via iscsi. clones play a vital part in
the whole set up (they number in the thousands). for what it's worth,
zfs is the best thing in linux-world (sorry, solaris and *bsd too) for
that kind of task. my comment is that, coming from fossil/venti, zfs
feels just a bit more convoluted and there are more special cases that
seem like a design mishap at least when compared to what i'm used to
in the world i'm coming from.

i'll try to explain:

> Fair enough. But YourTextGoesHere then becomes a transient property
> of my namespace, where in case of ZFS it is truly a tag for a snapshot.

all snapshots have tags: their top-level sha1 score. what i supplied
was simply a way to translate that to any random text. you don't need
to, nor do you have to do this (by the way, do you get the irony of
forcing snapshots to contain the '@' character in their name? sounds a
lot like '#' to me ;)

snapshots are generally accessible via fossil as a directory with the
date of the snapshot as its name. this starts making more sense when
you take into consideration that snapshots are global per fossil, but
then you can run several fossils without having them step on their
toes when it comes to venti. at least until you get a collision in
blocks' hashes.

in fact, i'm so used to fossil's dated snapshots that in my setup i
have restricted 'YourTextGoesHere' to actually be a date. that gives
me so much more context in the case where something goes wrong and i
have to go back through the snapshots for a filesystem or a volume to
find the last known good one.

> Well, strictly speaking Solaris does have a reasonable approximation
> of bind in a form of lofs -- so remapping default ZFS mount point to
> something else is not a big deal.

did not know that

>
>>>  $ zfs clone pool/projects/foo@YourTextGoesHere pool/projects/branch
>>
>> that's as simple as starting a new fossil with -f 'somehex', where
>> "somehex" is the score of the corresponding snap.
>>
>> this gives you both read-only snapshots,
>
> Meaning?

venti is write-once. if you instantiate a fossil from a venti score it
is, by definition, read-only, as all changes to the current fossil
will not appear to another fossil instantiated from the same venti
score. changes are committed to venti once you do a fossil snap,
however that automatically generates a new snapshot score (not
modifying the old one). it should be clear from the paper.

>> - snapshots are read only and generally unmountable (unless you go
>> through the effort of making them so by setting a special option,
>> which i'm not sure is per-snapshot)
>
> Huh? That's weird -- I routinely access them via
>     /<pool>/<fs>/.zfs/snapshot/<snapshot name>
> and I don't remember setting any kind of options. The visibility
> of .zfs can be tweaked, but all it really affects is Tab in bash ;-)
>
>> - clones can only be created off of snapshots
>
> But that does sound reasonable. What else there is except snapshots
> and an active tree? Or are you objecting to the extra step that is
> needed where you really want to clone the active tree?

i have .zfs exports turned off (it's off by default) because the
read-only snapshots are useless in my environment. instead i must
create clones off one or many snapshots and keep track and delete them
when their tasks have been accomplished.

this is an example of the design decision difference between
fossil/venti and zfs: venti commits storage permanently and everything
becomes a snapshot, while the designers of zfs decided to create a
two-stage process introducing a read-only intermediary between the
original data and a read-write access to it independent of other
clients.

where the second choice becomes a nuisance for me is in the case where
one has thousands of clones and needs to keep track of thousands of
names in order to ensure that when the right one has finished the
right clone disappears. it's good that zfs can handle so many,
otherwise it would've been useless.

note that other systems take the plan9 approach to heart: qemu for
example has the -snapshot argument which allows me to boot many VMs,
fossil-style, off a single vm image without worrying whether they'll
step on each other's toes. that way seems so much simpler and natural
to me, but then i'm jaded by venti :)

>> - clones are read-writable but they can only be mounted within the
>> /pool/fs/branch hierarchy. if you want to share them you need to
>> explicitly adjust a lot of zfs settings such as 'sharenfs' and so on;
>
> In general -- this is true :-( But I think there's a way now to do that.
> If you're really interested -- I can take a look and let you know.

my problem is with the local/remote duality of exports: if i create a
zfs cloned filesystem it's immediately locally available and perhaps
(via 'sharenfs' inheritance from its parent) i can mount it via nfs
from a remote node. if i create a zfs cloned volume i need to arrange
an iscsi method of access from a remote node. both nfs and iscsi have
a host of nasty settings that need to be correct on both ends in order
for things to work right. i can never hope to export an nfs share
outside my DMZ.

i don't see a solution to this problem: the unix world is committed to
nfs and a bit less so to iscsi. i'm more of a 9p guy myself though, so
i listed it as a complaint.

>> - none of this can be done remotely
>
> Meaning?

from machine X in the datacentre i want to be able to say "please
create me a clone of the latest snapshot of this filesystem" without
having to ssh to the solaris node running zfs.

>> - libzfs has an unpublished interface, so if one wants to, say, write
>> a 9p server to expose zfs functionality to remote hosts they must
>> either reverse engineer libzfs or use other means.
>
>
> This one is a bit unfair. The interface is published alright. As much
> as anything in Open Source is. It is also documented at the level
> that would be considered reasonable for Linux. The fact that
> it is not *stable* makes the usual thorough Solaris documentation
> lacking.
>
> But all in all, following along doesn't require much more extra
> effort compared to following along any other evolving OS
> project.

i wanted to write a filesystem exporting zfs command functionality to
nodes within a datacentre (create/modify/delete/list
filesystems/volumes/snapshots). i tried reading the libzfs
documentation for linking to it and couldn't find such documentation.
i couldn't find the source for libzfs either, without having to
register to the opensolaris developers' site.

instead of reverse engineering a library that i have not much faith
in, i wrote a python 9p server that uses local zfs/zpool commands to
do what i could've done with C and libzfs. it's a hack but it gets the
job done. now i can access block X of zfs volume Y remotely via 9p (at
one third the speed, to be fair).

and i think i'm using a pretty new version of zfs and my experiences
are, in fact, quite recent :)

i would be glad to help you understand the differences between zfs and
fossil/venti with my limited knowledge of both.

cheers: andrey

nb: please don't take this as a wholesale criticism of zfs. as stated
earlier, it is quite a sane system to work with. my gripes only appear
when one compares it to the "fossil/venti experience".



  reply	other threads:[~2009-01-06 14:22 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-22 15:27 Venkatesh Srinivas
2008-12-22 15:29 ` erik quanstrom
2008-12-22 16:41 ` Charles Forsyth
2008-12-25  6:34   ` Roman Shaposhnik
2008-12-25  6:40     ` erik quanstrom
2008-12-26  4:28       ` Roman Shaposhnik
2008-12-26  4:45         ` lucio
2008-12-26  4:57         ` Anthony Sorace
2008-12-26  6:19           ` blstuart
2008-12-27  8:00           ` Roman Shaposhnik
2008-12-27 11:56             ` erik quanstrom
2008-12-30  0:31               ` Roman Shaposhnik
2008-12-30  0:57                 ` erik quanstrom
2009-01-05  5:19                   ` Roman V. Shaposhnik
2009-01-05  5:28                     ` erik quanstrom
2008-12-22 17:03 ` Devon H. O'Dell
2008-12-23  4:31   ` Uriel
2008-12-23  4:46 ` Nathaniel W Filardo
2008-12-25  6:50   ` Roman Shaposhnik
2008-12-25 14:37     ` erik quanstrom
2008-12-26 13:27       ` Charles Forsyth
2008-12-26 13:33         ` Charles Forsyth
2008-12-26 14:27         ` tlaronde
2008-12-26 17:25           ` blstuart
2008-12-26 18:14             ` tlaronde
2008-12-26 18:20               ` erik quanstrom
2008-12-26 18:52                 ` tlaronde
2008-12-26 21:44                   ` blstuart
2008-12-26 22:04                     ` Eris Discordia
2008-12-26 22:30                       ` erik quanstrom
2008-12-26 23:00                         ` blstuart
2008-12-27  6:04                         ` Eris Discordia
2008-12-27 10:36                           ` tlaronde
2008-12-27 16:27                             ` Eris Discordia
2008-12-29 23:54         ` Roman Shaposhnik
2008-12-30  0:13           ` hiro
2008-12-30  1:07           ` erik quanstrom
2008-12-30  1:48           ` Charles Forsyth
2008-12-30 13:18             ` Uriel
2008-12-30 15:06               ` C H Forsyth
2008-12-30 17:31                 ` Uriel
2008-12-31  1:58                   ` Noah Evans
2009-01-03 22:03           ` sqweek
2009-01-05  5:05             ` Roman V. Shaposhnik
2009-01-05  5:12               ` erik quanstrom
2009-01-06  5:06                 ` Roman Shaposhnik
2009-01-06 13:55                   ` erik quanstrom
2009-01-05  5:24               ` andrey mirtchovski
2009-01-06  5:49                 ` Roman Shaposhnik
2009-01-06 14:22                   ` andrey mirtchovski [this message]
2009-01-06 16:19                     ` erik quanstrom
2009-01-06 23:23                       ` Roman V. Shaposhnik
2009-01-06 23:44                         ` erik quanstrom
2009-01-08  0:36                           ` Roman V. Shaposhnik
2009-01-08  1:11                             ` erik quanstrom
2009-01-20  6:20                               ` Roman Shaposhnik
2009-01-20 14:19                                 ` erik quanstrom
2009-01-20 22:30                                   ` Roman V. Shaposhnik
2009-01-20 23:36                                     ` erik quanstrom
2009-01-21  1:43                                       ` Roman V. Shaposhnik
2009-01-21  2:02                                         ` erik quanstrom
2009-01-26  6:28                                           ` Roman V. Shaposhnik
2009-01-26 13:42                                             ` erik quanstrom
2009-01-26 16:15                                               ` Roman V. Shaposhnik
2009-01-26 16:39                                                 ` erik quanstrom
2009-01-27  4:45                                                   ` Roman Shaposhnik
2009-01-21 19:02                                         ` Uriel
2009-01-21 19:53                                           ` Steve Simon
2009-01-24  3:15                                             ` Roman V. Shaposhnik
2009-01-24  3:36                                               ` erik quanstrom
2009-01-26  6:21                                                 ` Roman V. Shaposhnik
2009-01-26 13:53                                                   ` erik quanstrom
2009-01-26 16:21                                                     ` Roman V. Shaposhnik
2009-01-26 17:37                                                       ` erik quanstrom
2009-01-27  4:51                                                         ` Roman Shaposhnik
2009-01-27  5:44                                                           ` erik quanstrom
2009-01-21 20:01                                           ` erik quanstrom
2009-01-24  3:19                                           ` Roman V. Shaposhnik
2009-01-24  3:25                                             ` erik quanstrom
2009-01-20  6:48                     ` Roman Shaposhnik
2009-01-20 14:13                       ` erik quanstrom
2009-01-20 16:19                         ` Steve Simon
2009-01-20 23:52                       ` andrey mirtchovski
2009-01-21  4:49                         ` Dave Eckhardt
2009-01-21  6:38                         ` Steve Simon
2009-01-21 14:02                           ` erik quanstrom
2009-01-26  6:16                         ` Roman V. Shaposhnik
2009-01-26 16:22                           ` Russ Cox
2009-01-26 19:42                             ` Roman V. Shaposhnik
2009-01-26 20:11                               ` Steve Simon
2008-12-27  7:40       ` Roman Shaposhnik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=14ec7b180901060622q2a179705g53c7fe62e70ee90a@mail.gmail.com \
    --to=mirtchovski@gmail.com \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).