9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] Recovering a venti from disk failure
@ 2007-04-19  4:05 Anthony Sorace
  2007-04-19  4:30 ` Russ Cox
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Anthony Sorace @ 2007-04-19  4:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I had a disk fail a few days ago, after a power outage here. Various
spots in the fossil partition generate IO errors. The Venti arenas
seem intact. I've bought a new (and larger) drive, and have done a
basic Plan 9 install onto that, and moved the old disk from sdC0 to
sdD0. I'd like to recover the data from my old venti. It looks like
the process looks like this:

1) Boot off new disk.
2) Recover last score from old fossil w/ fossil/last (this works).
3) Start a venti running off old disk (will require editing venti/conf?)
4) venti/copy old-venti new-venti score-from-setp-2
5) Reboot off some other medium.
6) Reformat new fossil from new venti using score from step 2.
7) Reboot off new fossil+venti.

Does that sound like an accurate script? Can anyone who's done this
confirm? By chance an easier way?
Anthony


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19  4:05 [9fans] Recovering a venti from disk failure Anthony Sorace
@ 2007-04-19  4:30 ` Russ Cox
  2007-04-19  4:48 ` Bakul Shah
  2007-04-19  5:43 ` geoff
  2 siblings, 0 replies; 14+ messages in thread
From: Russ Cox @ 2007-04-19  4:30 UTC (permalink / raw)
  To: 9fans

> 1) Boot off new disk.
> 2) Recover last score from old fossil w/ fossil/last (this works).
> 3) Start a venti running off old disk (will require editing venti/conf?)
> 4) venti/copy old-venti new-venti score-from-setp-2
> 5) Reboot off some other medium.
> 6) Reformat new fossil from new venti using score from step 2.
> 7) Reboot off new fossil+venti.

Looks good to me.  You'll have to change the file
names in the venti config to get the old venti 
running again, but the rest can stay the same.

Russ



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19  4:05 [9fans] Recovering a venti from disk failure Anthony Sorace
  2007-04-19  4:30 ` Russ Cox
@ 2007-04-19  4:48 ` Bakul Shah
  2007-04-19  5:43 ` geoff
  2 siblings, 0 replies; 14+ messages in thread
From: Bakul Shah @ 2007-04-19  4:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> 1) Boot off new disk.
> 2) Recover last score from old fossil w/ fossil/last (this works).
> 3) Start a venti running off old disk (will require editing venti/conf?)
> 4) venti/copy old-venti new-venti score-from-setp-2

Wouldn't this lose all old snapshots?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19  4:05 [9fans] Recovering a venti from disk failure Anthony Sorace
  2007-04-19  4:30 ` Russ Cox
  2007-04-19  4:48 ` Bakul Shah
@ 2007-04-19  5:43 ` geoff
  2007-04-19  6:07   ` Bakul Shah
  2007-04-19 11:25   ` Anthony Sorace
  2 siblings, 2 replies; 14+ messages in thread
From: geoff @ 2007-04-19  5:43 UTC (permalink / raw)
  To: 9fans

If your old venti is intact, I don't see the need to copy it (or is it
on the drive with the damaged fossil and you don't trust the drive?).
I would just format the new fossil partition using the last fossil
dump score.

If the arenas are okay but not the index, I'd rebuild the index.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19  5:43 ` geoff
@ 2007-04-19  6:07   ` Bakul Shah
  2007-04-19 12:05     ` erik quanstrom
  2007-04-19 11:25   ` Anthony Sorace
  1 sibling, 1 reply; 14+ messages in thread
From: Bakul Shah @ 2007-04-19  6:07 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> If your old venti is intact, I don't see the need to copy it (or is it
> on the drive with the damaged fossil and you don't trust the drive?).

If your venti disk is similar to the failed disk and of same
vintage, it is also likely to fail and should be replaced
before it actually does.  Similarly replace all disks in a
RAID if one of the disks dies (and the death was not in the
first few months of its life).


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19  5:43 ` geoff
  2007-04-19  6:07   ` Bakul Shah
@ 2007-04-19 11:25   ` Anthony Sorace
  2007-04-19 12:01     ` Russ Cox
  1 sibling, 1 reply; 14+ messages in thread
From: Anthony Sorace @ 2007-04-19 11:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Geoff wrote:
// (or is it on the drive with the damaged fossil and you
// don't trust the drive?).

exactly. i'm also replacing it with something about five times the
size at the same cost. amazing what a few years will do. once the
transition is complete, i'll most of the old disk as an "other". i'm
also considering options for a more disaster-tollerant setup.

Bakul wrote:
// Wouldn't this lose all old snapshots?

I believe the method described will take the root of fossil, before
/active, the normally visible part. I'm particularly encouraged by the
example involving a corrupted disk and fossil/flfmt in fossil(4). i'll
let you know if my experience says otherwise.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19 11:25   ` Anthony Sorace
@ 2007-04-19 12:01     ` Russ Cox
  0 siblings, 0 replies; 14+ messages in thread
From: Russ Cox @ 2007-04-19 12:01 UTC (permalink / raw)
  To: 9fans

> // Wouldn't this lose all old snapshots?
> 
> I believe the method described will take the root of fossil, before
> /active, the normally visible part. I'm particularly encouraged by the
> example involving a corrupted disk and fossil/flfmt in fossil(4). i'll
> let you know if my experience says otherwise.

You get to keep /n/dump but not /n/snap.

Russ



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19  6:07   ` Bakul Shah
@ 2007-04-19 12:05     ` erik quanstrom
  2007-04-19 20:15       ` Bakul Shah
  0 siblings, 1 reply; 14+ messages in thread
From: erik quanstrom @ 2007-04-19 12:05 UTC (permalink / raw)
  To: 9fans

>> If your old venti is intact, I don't see the need to copy it (or is it
>> on the drive with the damaged fossil and you don't trust the drive?).
> 
> If your venti disk is similar to the failed disk and of same
> vintage, it is also likely to fail and should be replaced
> before it actually does.  Similarly replace all disks in a
> RAID if one of the disks dies (and the death was not in the
> first few months of its life).

our experience has been this is not a cost-effective way of dealing
with disk failures as disks do not fail en masse.

also, once a failure has occured, it is too late.  if there are other
disks with latent errors, raid reconstruction will fail.  in a raid 5
array, a latent error + failed disk means that block can't be
reconstructed.

i think this corelation gives people the false impression that they do
fail en masse, but that's really wrong.  the latent errors probablly
happened months ago.

our solution to this problem (RaidShield™) is to preemtively scan
disks while the raid is idle and rewrite these blocks.  this either
(a) corrects the bit rot (b) causes the drive to remap the sector or
(c) notifies the user that there's a real disk problem so it can be
replaced before a second drive in the array fails.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19 12:05     ` erik quanstrom
@ 2007-04-19 20:15       ` Bakul Shah
  2007-04-19 21:26         ` erik quanstrom
  2007-04-19 21:36         ` W B Hacker
  0 siblings, 2 replies; 14+ messages in thread
From: Bakul Shah @ 2007-04-19 20:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2127 bytes --]

> >> If your old venti is intact, I don't see the need to copy it (or is it
> >> on the drive with the damaged fossil and you don't trust the drive?).
> > 
> > If your venti disk is similar to the failed disk and of same
> > vintage, it is also likely to fail and should be replaced
> > before it actually does.  Similarly replace all disks in a
> > RAID if one of the disks dies (and the death was not in the
> > first few months of its life).
> 
> our experience has been this is not a cost-effective way of dealing
> with disk failures as disks do not fail en masse.

Various studies seem to indicate failure rates are highly
correlated with drive model, vintage and manufacturer.
Assuming a RAID is built from similar disks, when one fails
the others are more likely to fail.

Google has a recent paper on disk failures that shows 8% to
8.6% annualized failure rate for 2 to 3 year old disks
(measured across a very large population of disks and not
accounting for vendor/model differences).  Things are likely
to be worse in a typical home environment -- dust, heat
variation, poor or nonworking surge protectors, more power
cycles, computers get moved around or kicked more, poorer
quality components etc.

If you can afford it, replacing your most critical disks
every three years is a good rule of thumb -- the first disk
failure is a strong reminder of that:-)

> i think this corelation gives people the false impression that they do
> fail en masse, but that's really wrong.  the latent errors probablly
> happened months ago.

Yes but if there are many latent errors and/or the error rate
is going up it is time to replace it.

> our solution to this problem (RaidShield™) is to preemtively scan
> disks while the raid is idle and rewrite these blocks.  this either
> (a) corrects the bit rot (b) causes the drive to remap the sector or
> (c) notifies the user that there's a real disk problem so it can be
> replaced before a second drive in the array fails.

This is a good idea.  We did this in 1983, back when disks
were simpler beasts.  No RAID then of course.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19 20:15       ` Bakul Shah
@ 2007-04-19 21:26         ` erik quanstrom
  2007-04-19 21:39           ` W B Hacker
  2007-04-19 22:25           ` Bakul Shah
  2007-04-19 21:36         ` W B Hacker
  1 sibling, 2 replies; 14+ messages in thread
From: erik quanstrom @ 2007-04-19 21:26 UTC (permalink / raw)
  To: 9fans

> Various studies seem to indicate failure rates are highly
> correlated with drive model, vintage and manufacturer.
> Assuming a RAID is built from similar disks, when one fails
> the others are more likely to fail.

while it is true that some disks vintages are better than others, when
one drive fails, the probability of the other drives failing has not
changed.  this is the same as if you flip a coin ten times and get ten
heads, the probability of flipping the same coin and getting heads, is
still 1/2.

>> i think this corelation gives people the false impression that they do
>> fail en masse, but that's really wrong.  the latent errors probablly
>> happened months ago.
> 
> Yes but if there are many latent errors and/or the error rate
> is going up it is time to replace it.

maybe.  the goggle paper you cited didn't find a strong correlation
between smart errors (including block relocation) and failure.

> This is a good idea.  We did this in 1983, back when disks
> were simpler beasts.  No RAID then of course.

even a better idea back then.  disks didn't have 1/4 million
lines of firmware relocating blocks and doing other things to^w
i mean for you.

- erik



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19 20:15       ` Bakul Shah
  2007-04-19 21:26         ` erik quanstrom
@ 2007-04-19 21:36         ` W B Hacker
  1 sibling, 0 replies; 14+ messages in thread
From: W B Hacker @ 2007-04-19 21:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Bakul Shah wrote:

*snip*

> 
> This is a good idea.  We did this in 1983, back when disks
> were simpler beasts.  No RAID then of course.
> 

Sure there were.  From SMD days in the '70's, even.

We called them 'mirrored' (2 drives, one controller and cable set)
or 'duplexed' (separate controllers and cables) - sometimes on separate hosts 
sharing common drives.

The other so-called 'level's and 'RAID' terminology took longer to catch on. Or 
become affordable enough to leave the mainframe arena anyway.

Drives - such as the ISS-80 - a few of whose remains litter my garage yet today, 
were hardly 'simpler'.

Quite the reverse! Built like a milling machine or Hardinge lathe to even 
*attempt* to keep heads aligned with (replaceable) media.

"Simplicity" - or decent and affordable reliability at least - arrived with 
IBM-UK 'Winchester' technology and the CDC-'birdnames' notably Lark and Wren.

Even with 'Winchester', 'Bernoulli' 8" cartidges were far more reliable than the 
comparable 'SeaGRRRRRATE' HDD of the same era. Twinned units were not 'RAID' but 
had a fast 'dd'-like imaging utility for manual duplication.

Seagate 5 1/4" drives used to be good for 3 to 6 months, Western Digital or 
Microscience 6 to 12, CDC 12 to 24 months.

Big improvement as first-generation NCR' Century' series once lunched a set of 
platters as often as twice a week...

The reasons Fossil and Venti are as they are may no longer be as obvious or 
compelling as they once were, but how soon we forget how *seriously* fragile and 
failure-prone HDD once were.

Bill




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19 21:26         ` erik quanstrom
@ 2007-04-19 21:39           ` W B Hacker
  2007-04-19 22:25           ` Bakul Shah
  1 sibling, 0 replies; 14+ messages in thread
From: W B Hacker @ 2007-04-19 21:39 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
>> Various studies seem to indicate failure rates are highly
>> correlated with drive model, vintage and manufacturer.
>> Assuming a RAID is built from similar disks, when one fails
>> the others are more likely to fail.
> 
> while it is true that some disks vintages are better than others, when
> one drive fails, the probability of the other drives failing has not
> changed.  this is the same as if you flip a coin ten times and get ten
> heads, the probability of flipping the same coin and getting heads, is
> still 1/2.
> 
>>> i think this corelation gives people the false impression that they do
>>> fail en masse, but that's really wrong.  the latent errors probablly
>>> happened months ago.
>> Yes but if there are many latent errors and/or the error rate
>> is going up it is time to replace it.
> 
> maybe.  the goggle paper you cited didn't find a strong correlation
> between smart errors (including block relocation) and failure.
> 
>> This is a good idea.  We did this in 1983, back when disks
>> were simpler beasts.  No RAID then of course.
> 
> even a better idea back then.  disks didn't have 1/4 million
> lines of firmware relocating blocks and doing other things to^w
> i mean for you.
> 
> - erik
> 
> 

And - lest we forget - a RAID array actually has a higher statistical chance of 
failure, and a *lower* MTBF than a single drive. Simple math.

What we gain is a reduced risk of *unrecoverable* damage, not fewer failures, 
per se.

Bill





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
  2007-04-19 21:26         ` erik quanstrom
  2007-04-19 21:39           ` W B Hacker
@ 2007-04-19 22:25           ` Bakul Shah
  1 sibling, 0 replies; 14+ messages in thread
From: Bakul Shah @ 2007-04-19 22:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> while it is true that some disks vintages are better than others, when
> one drive fails, the probability of the other drives failing has not
> changed.  this is the same as if you flip a coin ten times and get ten
> heads, the probability of flipping the same coin and getting heads, is
> still 1/2.

They are not independent events since they see similar wear
and tear and experience the same environmental stress.  In
fact by 2nd/3rd year they are starting to approach the other
end of the bathtub curve of failures.

The disk industry needs to hire real actuaries.  Or may be
google will!

> 
> >> i think this corelation gives people the false impression that they do
> >> fail en masse, but that's really wrong.  the latent errors probablly
> >> happened months ago.
> > 
> > Yes but if there are many latent errors and/or the error rate
> > is going up it is time to replace it.
> 
> maybe.  the goggle paper you cited didn't find a strong correlation
> between smart errors (including block relocation) and failure.

Actually they did find strong correlation for some of the
SMART parameters.  I think what they said was that over 36%
of failed disks had no SMART errors of certain kinds (that
could be because the drive firmware didn't actually count
those errors but it seems they didn't investigate this).  If
you do see more soft errors there is definitely something to
worry about.

> > This is a good idea.  We did this in 1983, back when disks
> > were simpler beasts.  No RAID then of course.
> 
> even a better idea back then.  disks didn't have 1/4 million
> lines of firmware relocating blocks and doing other things to^w
> i mean for you.

I preferred doing bad block forwarding etc. in 100 or so
lines of C code in the OS but it was clear even then that
disk vendors were going to make their disks "smarter".


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [9fans] Recovering a venti from disk failure
@ 2007-04-19  4:37 YAMANASHI Takeshi
  0 siblings, 0 replies; 14+ messages in thread
From: YAMANASHI Takeshi @ 2007-04-19  4:37 UTC (permalink / raw)
  To: 9fans

> > 4) venti/copy old-venti new-venti score-from-setp-2

you might want to add '-f' option to venti/copy or copying
could take really long long time if you had a long history
of dump.
-- 



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-04-19 22:25 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-19  4:05 [9fans] Recovering a venti from disk failure Anthony Sorace
2007-04-19  4:30 ` Russ Cox
2007-04-19  4:48 ` Bakul Shah
2007-04-19  5:43 ` geoff
2007-04-19  6:07   ` Bakul Shah
2007-04-19 12:05     ` erik quanstrom
2007-04-19 20:15       ` Bakul Shah
2007-04-19 21:26         ` erik quanstrom
2007-04-19 21:39           ` W B Hacker
2007-04-19 22:25           ` Bakul Shah
2007-04-19 21:36         ` W B Hacker
2007-04-19 11:25   ` Anthony Sorace
2007-04-19 12:01     ` Russ Cox
2007-04-19  4:37 YAMANASHI Takeshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).