From mboxrd@z Thu Jan 1 00:00:00 1970 To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Subject: Re: [9fans] Recovering a venti from disk failure In-reply-to: Your message of "Thu, 19 Apr 2007 08:05:08 EDT." From: Bakul Shah Date: Thu, 19 Apr 2007 13:15:57 -0700 Message-Id: <20070419201557.34C6B5B52@mail.bitblocks.com> Topicbox-Message-UUID: 4c8e0b62-ead2-11e9-9d60-3106f5b1d025 > >> If your old venti is intact, I don't see the need to copy it (or is it > >> on the drive with the damaged fossil and you don't trust the drive?). > > > > If your venti disk is similar to the failed disk and of same > > vintage, it is also likely to fail and should be replaced > > before it actually does. Similarly replace all disks in a > > RAID if one of the disks dies (and the death was not in the > > first few months of its life). > > our experience has been this is not a cost-effective way of dealing > with disk failures as disks do not fail en masse. Various studies seem to indicate failure rates are highly correlated with drive model, vintage and manufacturer. Assuming a RAID is built from similar disks, when one fails the others are more likely to fail. Google has a recent paper on disk failures that shows 8% to 8.6% annualized failure rate for 2 to 3 year old disks (measured across a very large population of disks and not accounting for vendor/model differences). Things are likely to be worse in a typical home environment -- dust, heat variation, poor or nonworking surge protectors, more power cycles, computers get moved around or kicked more, poorer quality components etc. If you can afford it, replacing your most critical disks every three years is a good rule of thumb -- the first disk failure is a strong reminder of that:-) > i think this corelation gives people the false impression that they do > fail en masse, but that's really wrong. the latent errors probablly > happened months ago. Yes but if there are many latent errors and/or the error rate is going up it is time to replace it. > our solution to this problem (RaidShield™) is to preemtively scan > disks while the raid is idle and rewrite these blocks. this either > (a) corrects the bit rot (b) causes the drive to remap the sector or > (c) notifies the user that there's a real disk problem so it can be > replaced before a second drive in the array fails. This is a good idea. We did this in 1983, back when disks were simpler beasts. No RAID then of course.