From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <1902d3f01b81b65009a4326d51d88197@quanstro.net> To: 9fans@9fans.net From: erik quanstrom Date: Sun, 25 May 2008 16:24:13 -0400 In-Reply-To: <8ccc8ba40805250848x16f054b8y71b46ff1c346eda4@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [9fans] Fossil+Venti on Linux Topicbox-Message-UUID: ac7181d4-ead3-11e9-9d60-3106f5b1d025 > You could adapt Plan B's bns to fail over between different FSs. But... > We learned that although you can let he FS fail over nicely, many other > things stand in the way making it unnecessary to fail over. For example, > on Plan 9, cs and dns have problems after a fail over, your IP address > may change, etc. All that is to say that when you experience tolerance > to FS failures, you still face other things that do not fail over. > > To tolerate failures what we do is to run venti on > a raid. If fossil gets corrupted somehow we'd just format the partition > using the last vac. To survive crashes of the machine with the venti we > copy its arenas to another machine, aso keeping a raid. forgive a bit of off-topicness. this is about ken's filesystem, not venti or fossil. the coraid fs maintains its cache on a local AoE-based raid10 and it automaticlly mirrors its worm on two AoE-based raid5 targets. the secondary worm target is in a seperate building with a backup fs. since reads always start with the first target, the slow offsite link is not noticed. (we frequently exceed the bandwidth of the backup link -- now 100Mbps --- to the cache, so replicating the cache would be impractical.) we can sustain the loss of a disk drive with only a small and temporary performance hit. the storage targets may be rebooted with a small pause in service. more severe machine failues can be recovered with varing degrees of pain. only if both raid targets were lost simultainously would more than 24hrs of data be lost. we don't do any failover. we try to keep the fs up instead. we have had two unplanned fs outages in 2 years. one was due to a corrupt sector leading to a bad tag. the other was a network problem due to an electrical storm that could have been avoided if i'd been on the ball. the "diskless fileserver" paper from iwp9 has the gory details. - erik