From mboxrd@z Thu Jan  1 00:00:00 1970
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] Recovering a venti from disk failure 
In-reply-to: Your message of "Thu, 19 Apr 2007 08:05:08 EDT."
	<c7b8d5ebb220fc17ef16b96b3e8c805f@coraid.com> 
From: Bakul Shah <bakul+plan9@BitBlocks.com>
Date: Thu, 19 Apr 2007 13:15:57 -0700
Message-Id: <20070419201557.34C6B5B52@mail.bitblocks.com>
Topicbox-Message-UUID: 4c8e0b62-ead2-11e9-9d60-3106f5b1d025

> >> If your old venti is intact, I don't see the need to copy it (or is it
> >> on the drive with the damaged fossil and you don't trust the drive?).
> > 
> > If your venti disk is similar to the failed disk and of same
> > vintage, it is also likely to fail and should be replaced
> > before it actually does.  Similarly replace all disks in a
> > RAID if one of the disks dies (and the death was not in the
> > first few months of its life).
> 
> our experience has been this is not a cost-effective way of dealing
> with disk failures as disks do not fail en masse.

Various studies seem to indicate failure rates are highly
correlated with drive model, vintage and manufacturer.
Assuming a RAID is built from similar disks, when one fails
the others are more likely to fail.

Google has a recent paper on disk failures that shows 8% to
8.6% annualized failure rate for 2 to 3 year old disks
(measured across a very large population of disks and not
accounting for vendor/model differences).  Things are likely
to be worse in a typical home environment -- dust, heat
variation, poor or nonworking surge protectors, more power
cycles, computers get moved around or kicked more, poorer
quality components etc.

If you can afford it, replacing your most critical disks
every three years is a good rule of thumb -- the first disk
failure is a strong reminder of that:-)

> i think this corelation gives people the false impression that they do
> fail en masse, but that's really wrong.  the latent errors probablly
> happened months ago.

Yes but if there are many latent errors and/or the error rate
is going up it is time to replace it.

> our solution to this problem (RaidShield™) is to preemtively scan
> disks while the raid is idle and rewrite these blocks.  this either
> (a) corrects the bit rot (b) causes the drive to remap the sector or
> (c) notifies the user that there's a real disk problem so it can be
> replaced before a second drive in the array fails.

This is a good idea.  We did this in 1983, back when disks
were simpler beasts.  No RAID then of course.