From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 19 Jan 2009 22:48:08 -0800 From: Roman Shaposhnik In-reply-to: <14ec7b180901060622q2a179705g53c7fe62e70ee90a@mail.gmail.com> To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Message-id: MIME-version: 1.0 Content-type: text/plain; delsp=yes; format=flowed; charset=US-ASCII Content-transfer-encoding: 7BIT References: <1a605cf7ccd9e5ba7aaf6f3ad42e0f4b@terzarima.net> <140e7ec30901031403y66a3d67epac5a9800026e7609@mail.gmail.com> <1231131954.11463.459.camel@goose.sun.com> <14ec7b180901042124v45e6e5a0x4a9f405f78b4e49b@mail.gmail.com> <14ec7b180901060622q2a179705g53c7fe62e70ee90a@mail.gmail.com> Subject: Re: [9fans] Changelogs & Patches? Topicbox-Message-UUID: 82dfe076-ead4-11e9-9d60-3106f5b1d025 Hi Andrey! Sorry, it took me a longer time to dig through the code than I hoped to. So, if you're still game... On Jan 6, 2009, at 6:22 AM, andrey mirtchovski wrote: > i'm using zfs right now for a project storing a few terabytes worth of > data and vm images. Is it how it was from the get go, or did you use venti-based solutions before? > i have two zfs servers and about 10 pools of > different sizes with several hundred different zfs filesystems and > volumes of raw disk exported via iscsi. What kind of clients are on the other side of iscsi? > clones play a vital part in the whole set up (they number in the > thousands). > for what it's worth, zfs is the best thing in linux-world (sorry, > solaris and *bsd too) You're using it on Linux? >> Fair enough. But YourTextGoesHere then becomes a transient property >> of my namespace, where in case of ZFS it is truly a tag for a >> snapshot. > > all snapshots have tags: their top-level sha1 score. what i supplied > was simply a way to translate that to any random text. you don't need > to, nor do you have to do this (by the way, do you get the irony of > forcing snapshots to contain the '@' character in their name? sounds a > lot like '#' to me ;) Ok. Fair enough. I think I'm convinced on that point. > snapshots are generally accessible via fossil as a directory with the > date of the snapshot as its name. this starts making more sense when > you take into consideration that snapshots are global per fossil, but > then you can run several fossils without having them step on their > toes when it comes to venti. at least until you get a collision in > blocks' hashes. Aha! And here are my first questions: you say that I can run multiple fossils off of the same venti and thus have a setup that is very close to zfs clones: 1. how do you do that exactly? fossil -f doesn't work for me (nor should it according to the docs) 2. how do you work around the fact that each fossil needs its own partition (unlike ZFS where all the clones can share the same pool of blocks)? > venti is write-once. if you instantiate a fossil from a venti score it > is, by definition, read-only, as all changes to the current fossil > will not appear to another fossil instantiated from the same venti > score. changes are committed to venti once you do a fossil snap, > however that automatically generates a new snapshot score (not > modifying the old one). it should be clear from the paper. I think I understand it now (except for the fossil -f part), but how do you promote (zfs promote) such a clone? >> where the second choice becomes a nuisance for me is in the case >> where > one has thousands of clones and needs to keep track of thousands of > names in order to ensure that when the right one has finished the > right clone disappears. I see what you mean, but in case of venti -- nothing disappears, really. From that perspective you can sort of make those zfs clones linger. The storage consumption won't be any different, right? >>> - none of this can be done remotely >> >> Meaning? > > from machine X in the datacentre i want to be able to say "please > create me a clone of the latest snapshot of this filesystem" without > having to ssh to the solaris node running zfs. Well, if its the protocol you don't like -- writing your own daemon that will respond to such requests sounds like a trivial task to me. > i couldn't find the source for libzfs either, without having to > register to the opensolaris developers' site. [...] > and i think i'm using a pretty new version of zfs and my experiences > are, in fact, quite recent :) well, the fact that you had to register in order to access the code suggest a pretty dated experience ;-) http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libzfs/ > instead of reverse engineering a library that i have not much faith > in, i wrote a python 9p server that uses local zfs/zpool commands to > do what i could've done with C and libzfs. it's a hack but it gets the > job done. now i can access block X of zfs volume Y remotely via 9p (at > one third the speed, to be fair). Well, Solaris desperately wanted to enter the Open Source geekdom and from your experience it seems like it was a success ;-) Seriously though, I personally found reading source code of zdb to be absolutely illuminating about all sorts of things ZFS: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/cmd/zdb/zdb.c But yes -- just like with any unruly OS project you have to really invest your time if you want to tag along. I think it was Russ who made a comment that the Free Software is only free if your time has no value :-( > i would be glad to help you understand the differences between zfs and > fossil/venti with my limited knowledge of both. Great! I tired to do as much homework as possible (hence the delay) but I still have some questions left: 0. A dumb one: what's the proper way of cleanly shutting down fossil and venti? 1. What's the use of copying arenas to CD/DVD? Is it purely back up, since they have to stay on-line forever? 2. Would fossil/venti notice silent data corruptions in blocks? 3. Do you think its a good idea to have volume management be part of filesystems, since that way you can try to heal the data on-the-fly? 4. If I have a venti server and a bunch of sha1 codes, can I somehow instantiate a single fossil serving all of them under /archive? Thanks, Roman.