[9fans] Fossil+Venti on Linux

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] Fossil+Venti on Linux
@ 2008-05-25  7:56 Enrico Weigelt
  2008-05-25 14:59 ` a
  0 siblings, 1 reply; 12+ messages in thread
From: Enrico Weigelt @ 2008-05-25  7:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hi folks,

is anyone already running venti+fossil on Linux ?

I've got several machines, where I need an remote backup (via DSL
link) with a few days living snapshots. Ideally the backup should
be directly accessible, so when one machine goes down a while,
the backup one can take over.

The venti even could be clustered (each machine feeding the others),
so the remaining backup will just be on metadata, right ?

cu
--
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-25  7:56 [9fans] Fossil+Venti on Linux Enrico Weigelt
@ 2008-05-25 14:59 ` a
  2008-05-25 15:48   ` Francisco J Ballesteros
  2008-05-26 12:58   ` Enrico Weigelt
  0 siblings, 2 replies; 12+ messages in thread
From: a @ 2008-05-25 14:59 UTC (permalink / raw)
  To: weigelt, 9fans

I'm not running Linux, but I've run venti+fossil on Mac OS X for testing. I
intend to use venti there regularly once I figure out how to get OS X to let
me get at a raw partition that isn't mounted (anyone?).

I don't think venti+fossil will do what you're looking for, however, at least
not without some additional machinery. Fossil doesn't do any replication or
fail-over: it must talk to zero or one ventis. Venti doesn't automatically
replicate anything, either, although that's pretty easy to script if you're
willing to accept the exposure of a cron job. It's true you could run multiple
fossils off one venti, but they'll be logically distinct (just getting the block
aggregation benifits of sharing a venti backing store).

I believe the Plan B folks did some work with fail-over (amongst other
things) that might be applicable. Beyond that, if you want to get what you
want from venti+fossil, you'll need to inject a filter in front of one of those
two to do the fail-over (and handle all the fun of tracking writes and
propagating them when the server comes back, and so on).

If you're looking to back up *existing* Linux boxes, then fossil might not
be what you want anyway. Take a look at vbackup(8) and friends (I'm
trying to convince it I'm on an HFS+ partition). You'll have to figure out
the correct procedures for your site, but the examples are pretty useful.
Still no automatic fail-over, but a cron job could probably get you
replication.
Anthony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-25 14:59 ` a
@ 2008-05-25 15:48   ` Francisco J Ballesteros
  2008-05-25 20:24     ` erik quanstrom
  2008-05-26 12:58   ` Enrico Weigelt
  1 sibling, 1 reply; 12+ messages in thread
From: Francisco J Ballesteros @ 2008-05-25 15:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> I believe the Plan B folks did some work with fail-over (amongst other
> things) that might be applicable. Beyond that, if you want to get what you

You could adapt Plan B's bns to fail over between different FSs. But...
We learned that although you can let he FS fail over nicely, many other
things stand in the way making it unnecessary to fail over. For example,
on Plan 9, cs and dns have problems after a fail over, your IP address
may change, etc. All that is to say that when you experience tolerance
to FS failures, you still face other things that do not fail over.

To tolerate failures what we do is to run venti on
a raid. If fossil gets corrupted somehow we'd just format the partition
using the last vac. To survive crashes of the machine with the venti we
copy its arenas to another machine, aso keeping a raid.

If you want clients to stay up during server crashes you could use
either bns or recover to pretend the FS is still there (blocked, but there)
while you reboot (or replace)
it.

hth

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-25 15:48   ` Francisco J Ballesteros
@ 2008-05-25 20:24     ` erik quanstrom
  0 siblings, 0 replies; 12+ messages in thread
From: erik quanstrom @ 2008-05-25 20:24 UTC (permalink / raw)
  To: 9fans

> You could adapt Plan B's bns to fail over between different FSs. But...
> We learned that although you can let he FS fail over nicely, many other
> things stand in the way making it unnecessary to fail over. For example,
> on Plan 9, cs and dns have problems after a fail over, your IP address
> may change, etc. All that is to say that when you experience tolerance
> to FS failures, you still face other things that do not fail over.
>
> To tolerate failures what we do is to run venti on
> a raid. If fossil gets corrupted somehow we'd just format the partition
> using the last vac. To survive crashes of the machine with the venti we
> copy its arenas to another machine, aso keeping a raid.

forgive a bit of off-topicness.  this is about ken's filesystem, not
venti or fossil.

the coraid fs maintains its cache on a local AoE-based raid10 and it
automaticlly mirrors its worm on two AoE-based raid5 targets.  the
secondary worm target is in a seperate building with a backup fs.
since reads always start with the first target, the slow offsite link
is not noticed.  (we frequently exceed the bandwidth of the
backup link -- now 100Mbps --- to the cache, so replicating the cache
would be impractical.)

we can sustain the loss of a disk drive with only a small and
temporary performance hit.  the storage targets may be rebooted with a
small pause in service.

more severe machine failues can be recovered with varing degrees of
pain.  only if both raid targets were lost simultainously would more than
24hrs of data be lost.

we don't do any failover.  we try to keep the fs up instead.  we have
had two unplanned fs outages in 2 years.  one was due to a corrupt
sector leading to a bad tag.  the other was a network problem due to
an electrical storm that could have been avoided if i'd been on the
ball.

the "diskless fileserver" paper from iwp9 has the gory details.

- erik

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-25 14:59 ` a
  2008-05-25 15:48   ` Francisco J Ballesteros
@ 2008-05-26 12:58   ` Enrico Weigelt
  2008-05-26 14:01     ` erik quanstrom
       [not found]     ` <78feb60ec33f8a38ccbc38625b6ea653@quanstro.net>
  1 sibling, 2 replies; 12+ messages in thread
From: Enrico Weigelt @ 2008-05-26 12:58 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

* a@9srv.net <a@9srv.net> wrote:

> I don't think venti+fossil will do what you're looking for, however, at least
> not without some additional machinery. Fossil doesn't do any replication or
> fail-over: it must talk to zero or one ventis. Venti doesn't automatically
> replicate anything, either, although that's pretty easy to script if you're
> willing to accept the exposure of a cron job.

I intend to code an special cluster-venti, which automatically
propagates new blocks to it's peers and ask around if it can't some
block. This would bit be a bit lazy replication (depends on how long
propagation takes). As long as the critical data (metadata, ...) are
distributed fast enough or otherwise backed-up properly, so at least
an reasonably recent snapshot can always be retrieved, this IMHO would
make it much easier/faster to get some services up on another machine.
Doesn't need to have a true failover fs, I just want to start with
the last snapshot quickly (as I now would do with a tar'ed backup).

My first excercise will be some LOC which dump out the new blocks to
some dir and build some separate daemon which sends them out to the
peer venti's. So I only have to back-up fossil's local metadata.
Since the machines tend to have a lot of equal data, much space and
traffic can be saved.

As a more sophisticated aproach, I'm planning an *real* clustered
venti, which also keeps track of block atime's and copy-counters.
This way, seldomly used blocks can be removed from one node as long
as there are still enough copies in the cluster. (probably requires
redesign of the log architecture).

This still isn't an replicated/distributed fs, but an clustered
block storage, maybe even as a basis for an truely replicated fs.
BTW: with a bit more logic, we even could build something like
Amazon's S3 on that ;-)

Actually, I don't need an replicated fs at all, just an space
and traffic efficient backup mechanism, which shares data with
the local fs'es.

cu
--
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-26 12:58   ` Enrico Weigelt
@ 2008-05-26 14:01     ` erik quanstrom
       [not found]     ` <78feb60ec33f8a38ccbc38625b6ea653@quanstro.net>
  1 sibling, 0 replies; 12+ messages in thread
From: erik quanstrom @ 2008-05-26 14:01 UTC (permalink / raw)
  To: weigelt, 9fans

> As a more sophisticated aproach, I'm planning an *real* clustered
> venti, which also keeps track of block atime's and copy-counters.
> This way, seldomly used blocks can be removed from one node as long
> as there are still enough copies in the cluster. (probably requires
> redesign of the log architecture).

one of venti's design goals was to structure the arenas so that filled
arenas are immutable.  this is important for recoverability.  if you
know the arena was filled and thus has not changed, any backup
will do.

put simply, venti trades the ability to delete for reliability.
since storage is very cheep, i think this is a good tradeoff.

> This still isn't an replicated/distributed fs, but an clustered
> block storage, maybe even as a basis for an truely replicated fs.
> BTW: with a bit more logic, we even could build something like
> Amazon's S3 on that ;-)

what problem are you trying to solve?  if you are trying to go for
reliability, i would think it would be easier to use raid+backups
for data stability.  using a ups will do wonders for uptime.

if you're going to use a distributed block storage device to build
a distributed fs, then either your fs can't do any caching at all
or there needs to be a full cache coherency protocol between the
fs and the block storage.

consider this case.  two fs want to add different files to the same
directory "at the same time".  i don't see how block storage can
help you with any of the problems that arise from this case.

- erik

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <78feb60ec33f8a38ccbc38625b6ea653@quanstro.net>]

* Re: [9fans] Fossil+Venti on Linux
       [not found]     ` <78feb60ec33f8a38ccbc38625b6ea653@quanstro.net>
@ 2008-05-29  9:12       ` Enrico Weigelt
  2008-05-29  9:27         ` Christian Kellermann
  2008-05-29 12:26         ` erik quanstrom
  0 siblings, 2 replies; 12+ messages in thread
From: Enrico Weigelt @ 2008-05-29  9:12 UTC (permalink / raw)
  To: 9fans

* erik quanstrom <quanstro@quanstro.net> wrote:

> > As a more sophisticated aproach, I'm planning an *real* clustered
> > venti, which also keeps track of block atime's and copy-counters.
> > This way, seldomly used blocks can be removed from one node as long
> > as there are still enough copies in the cluster. (probably requires
> > redesign of the log architecture).
>
> one of venti's design goals was to structure the arenas so that filled
> arenas are immutable.  this is important for recoverability.  if you
> know the arena was filled and thus has not changed, any backup
> will do.
> put simply, venti trades the ability to delete for reliability.

Right, my approach would be an paradigm change. But my venti-2 would
used for completely different things: distributed data storage
instead of eternal log ;P

> since storage is very cheep, i think this is a good tradeoff.

I'm thinking of an scale where storage isn't that cheap ...

> > This still isn't an replicated/distributed fs, but an clustered
> > block storage, maybe even as a basis for an truely replicated fs.
> > BTW: with a bit more logic, we even could build something like
> > Amazon's S3 on that ;-)
>
> what problem are you trying to solve?  if you are trying to go for
> reliability, i would think it would be easier to use raid+backups
> for data stability.

Easier, yes, but more expensive (at least the iron).

> consider this case.  two fs want to add different files to the same
> directory "at the same time".  i don't see how block storage can
> help you with any of the problems that arise from this case.

It shouldn't, same as an RAID can't help an local fs with multiple
users addings files to the same directory.

In my concept, the distribution of the block storage has nothing
to do with the (eventual) distribution of the fs. My venti-2 will
be like a SAN, just with content-addressing :)
So, instead of an SAN or local RAID you can simply use an venti-2
cloud. The venti-clients (eg. fossil, vac, ...) do not any knowledge
about this fact.

A venti-based distributed filesystem is an completely different issue.
All nodes will store their (payload) data in one venti (-cloud).
Of course the nodes have to coordinate their actions (through an
separate channel), but this will only be required for metadata,
not payload. Data cache coherency isn't an issue anymore, since a
data block itself cannot change - only a file's data pointers,
which belong to metadata, will change.

For example, if only one node can write to a file (and writes don't
have to appear to others simultaniously reading the same file,
aka. transaction methodology ;-)), single files could be stored via
vac, and the fs-cluster only has to manage directories. The directory
server(s) than manage the permissions and directory updates. Each
commit of a new file or file change triggers an directory update.
This can be done transactionally via an RDBMS.

The fine thing of this concept is, the venti cloud could even be built
of hosts which aren't completely trusted (as long as data itself is
properly encrypted) - as long as there are enough copies and you've got
enough peerings in the cloud, single nodes can't harm your data.

cu
--
----------------------------------------------------------------------
 Enrico Weigelt, metux IT service -- http://www.metux.de/

 cellphone: +49 174 7066481   email: info@metux.de   skype: nekrad666
----------------------------------------------------------------------
 Embedded-Linux / Portierung / Opensource-QM / Verteilte Systeme
----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-29  9:12       ` Enrico Weigelt
@ 2008-05-29  9:27         ` Christian Kellermann
  2008-05-29 12:17           ` Enrico Weigelt
  2008-05-29 12:26         ` erik quanstrom
  1 sibling, 1 reply; 12+ messages in thread
From: Christian Kellermann @ 2008-05-29  9:27 UTC (permalink / raw)
  To: weigelt, Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 345 bytes --]

IIRC Russ et al. have written a paper on connecting a venti server
to a distributed hash table (like chord) I think the word to google
for would be venti and dhash.

http://project-iris.net/isw-2003/papers/sit.pdf

HTH

Christian

-- 
You may use my gpg key for replies:
pub  1024D/47F79788 2005/02/02 Christian Kellermann (C-Keen)

[-- Attachment #2: Type: application/pgp-signature, Size: 194 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-29  9:27         ` Christian Kellermann
@ 2008-05-29 12:17           ` Enrico Weigelt
  2008-05-29 13:51             ` Russ Cox
  0 siblings, 1 reply; 12+ messages in thread
From: Enrico Weigelt @ 2008-05-29 12:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

* Christian Kellermann <Christian.Kellermann@nefkom.net> wrote:

> IIRC Russ et al. have written a paper on connecting a venti server
> to a distributed hash table (like chord) I think the word to google
> for would be venti and dhash.
>
> http://project-iris.net/isw-2003/papers/sit.pdf

Sounds very intersting.
Is there any source code available ?


cu
--
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-29 12:17           ` Enrico Weigelt
@ 2008-05-29 13:51             ` Russ Cox
  0 siblings, 0 replies; 12+ messages in thread
From: Russ Cox @ 2008-05-29 13:51 UTC (permalink / raw)
  To: weigelt, 9fans

>> http://project-iris.net/isw-2003/papers/sit.pdf
>
> Sounds very intersting.
> Is there any source code available ?

Most of what is described in that paper is now
libventi, vbackup, and vnfs.  There was some
notion that it would be interesting to try storing
data in a peer-to-peer storage system, but when
push came to shove we just set up a well-equipped
Venti server for our own backups.  It's got 15TB of
raw storage providing about 7TB of venti arenas
(mirrored).

The only unreleased piece is a tiny protocol
translator I wrote to convert between Venti protocol
and DHash protocol.  DHash was and still is
a research prototype.  You don't want to trust your
data to it.

Russ

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-29  9:12       ` Enrico Weigelt
  2008-05-29  9:27         ` Christian Kellermann
@ 2008-05-29 12:26         ` erik quanstrom
  2008-05-29 13:33           ` Wes Kussmaul
  1 sibling, 1 reply; 12+ messages in thread
From: erik quanstrom @ 2008-05-29 12:26 UTC (permalink / raw)
  To: weigelt, 9fans

>> since storage is very cheep, i think this is a good tradeoff.
>
> I'm thinking of an scale where storage isn't that cheap ...

what scale is that?

>> what problem are you trying to solve?  if you are trying to go for
>> reliability, i would think it would be easier to use raid+backups
>> for data stability.
>
> Easier, yes, but more expensive (at least the iron).

not sure what you mean by this.

suppose i have 10TB to keep in a redundant fashion.  with a two
machine solution, i need 20TB of disk since the only sensible way to
keep a redundant copy on a second machine is a full mirror.  with
a 1 machine solution, i don't need any more disks to have a full
mirror and i have the option of raid5 which will reduce the number
of disks i need to 10TB + 1 disk.  since your model is that the
storage is a significant expense, a single raid5 machine would make
more sense.

even if you are thinking of an enormous cloud with hundreds of
machines, you could halve the number of machines required by
raiding each node.  if cost is an issue, reducing the number of
machines is a benefit.  given constant data, fewer machines reduces
the obvious -- power, chassis, etc.  but another important reduction
is network ports.  once you outgrow a single 24-port switch, network
costs seem to grow in a super-linear fashion.

- erik

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [9fans] Fossil+Venti on Linux
  2008-05-29 12:26         ` erik quanstrom
@ 2008-05-29 13:33           ` Wes Kussmaul
  0 siblings, 0 replies; 12+ messages in thread
From: Wes Kussmaul @ 2008-05-29 13:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:

> with
> a 1 machine solution, i don't need any more disks to have a full
> mirror and i have the option of raid5 which will reduce the number
> of disks i need to 10TB + 1 disk.  since your model is that the
> storage is a significant expense, a single raid5 machine would make
> more sense.

Reliance on striping for redundancy frightens a number of us perhaps
uninformed folks. It just seems like too much could go wrong with such a
complex scheme. We sleep better knowing there's a mirrored drive in
another location.

As for cost, we just imagine it's 2003, a gigabyte costs five bucks, but
our astute purchasing skills got storage for less than a tenth of that...

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-05-29 13:51 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-05-25  7:56 [9fans] Fossil+Venti on Linux Enrico Weigelt
2008-05-25 14:59 ` a
2008-05-25 15:48   ` Francisco J Ballesteros
2008-05-25 20:24     ` erik quanstrom
2008-05-26 12:58   ` Enrico Weigelt
2008-05-26 14:01     ` erik quanstrom
     [not found]     ` <78feb60ec33f8a38ccbc38625b6ea653@quanstro.net>
2008-05-29  9:12       ` Enrico Weigelt
2008-05-29  9:27         ` Christian Kellermann
2008-05-29 12:17           ` Enrico Weigelt
2008-05-29 13:51             ` Russ Cox
2008-05-29 12:26         ` erik quanstrom
2008-05-29 13:33           ` Wes Kussmaul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).