9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] ide file server with mirroring
@ 2001-09-05 14:54 jmk
  2001-09-05 15:23 ` Martin Harriss
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: jmk @ 2001-09-05 14:54 UTC (permalink / raw)
  To: 9fans

Following on from what Nemo and Geoff have said, the most common failure
now seems to be the power supply in the computer case. During the past year
we've had a couple dozen fail, a lot of them while on a UPS, so it's not
just the off/on/power-dip stress. That, coupled with the apparent fragility
of ATA drives, makes designing a reliable hardware+filesystem more of a
challenge.

In our computer room there are 2 or 3 IDE-raid boxes (not running a Plan 9
filesystem) and I believe in the 6 months or so they've been there at least
one drive has failed. However, the boxes have redundant power supplies and
the drives can be hot-swapped (at some performance cost during the update),
so getting the hardware reliability is possible at a reasonable monetary and
performance cost.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
  2001-09-05 14:54 [9fans] ide file server with mirroring jmk
@ 2001-09-05 15:23 ` Martin Harriss
  2001-09-05 19:05 ` Boyd Roberts
  2001-09-05 19:55 ` Paul O'Donnell
  2 siblings, 0 replies; 8+ messages in thread
From: Martin Harriss @ 2001-09-05 15:23 UTC (permalink / raw)
  To: 9fans

jmk@plan9.bell-labs.com wrote:
>
> Following on from what Nemo and Geoff have said, the most common failure
> now seems to be the power supply in the computer case. During the past year
> we've had a couple dozen fail, a lot of them while on a UPS, so it's not
> just the off/on/power-dip stress. That, coupled with the apparent fragility
> of ATA drives, makes designing a reliable hardware+filesystem more of a
> challenge.
>
> In our computer room there are 2 or 3 IDE-raid boxes (not running a Plan 9
> filesystem) and I believe in the 6 months or so they've been there at least
> one drive has failed. However, the boxes have redundant power supplies and
> the drives can be hot-swapped (at some performance cost during the update),
> so getting the hardware reliability is possible at a reasonable monetary and
> performance cost.

It's also worth noting that IDE drives are not all created equal.  IBM
seems to make the best ones.  Western digital are pretty good, perhaps
not quite as good as IBM.  I've had problems with Maxtor and probably
wouldn't buy another one.

I don't think there's a significant difference in cost; in any case,
it's probably worth spending the few extra dollars to get a good drive,
if only to avoid the hassle of replacing the thing a year later.

Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
  2001-09-05 14:54 [9fans] ide file server with mirroring jmk
  2001-09-05 15:23 ` Martin Harriss
@ 2001-09-05 19:05 ` Boyd Roberts
  2001-09-05 19:55 ` Paul O'Donnell
  2 siblings, 0 replies; 8+ messages in thread
From: Boyd Roberts @ 2001-09-05 19:05 UTC (permalink / raw)
  To: 9fans

hmm, a few years ago i had a bunch of rz-29's [storageworks 4Gb]
mirrored or raid 5 and they just kept dying.  i was losing one
a month.  good thing they were mirrored/raided.  very strange.
i've rarely seen discs die.  the m/c [alpha 8400] would die
strangely too, but we had a theory on that.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
  2001-09-05 14:54 [9fans] ide file server with mirroring jmk
  2001-09-05 15:23 ` Martin Harriss
  2001-09-05 19:05 ` Boyd Roberts
@ 2001-09-05 19:55 ` Paul O'Donnell
  2 siblings, 0 replies; 8+ messages in thread
From: Paul O'Donnell @ 2001-09-05 19:55 UTC (permalink / raw)
  To: jmk; +Cc: 9fans

computers with dual power supplies are pretty common these days,
so any kind of power problem should be easy to deal with for
a server you care about.

in general, hardware failures are a solved problem in the storage
world.  storage is a commodity, with a few relatively well defined
standards for access.  you care about three variables (price,
performance and reliability) and you optimise according to your
needs.  note that a fast, reliable disk array can cost a couple
of orders of magnitude more per byte than a bare ide drive.

given a reliable array, you care about multiple paths to it (in case
you lose a controller, or someone trips over the cable), you care about
reliable code (array microcode, host device drivers and file systems)
and you want simple tools on the host.

in the commercial (unix based) world which i inhabit, we run a lot of
these systems.  we have remarkably few problems which result in a loss
of access to data or worse loss of data.  we do run into occasional
microcode bugs in the arrays, but the most challenging problem is the
complexity of the tools on the host.  with multiple levels of
indirection (multi pathing, volume managers, file systems) even
detecting a simple partition overlap can be a challenging task.  this
is where better (read simpler) os design can help.

On Wed, 5 Sep 2001 jmk@plan9.bell-labs.com wrote:

> Following on from what Nemo and Geoff have said, the most common failure
> now seems to be the power supply in the computer case. During the past year
> we've had a couple dozen fail, a lot of them while on a UPS, so it's not
> just the off/on/power-dip stress. That, coupled with the apparent fragility
> of ATA drives, makes designing a reliable hardware+filesystem more of a
> challenge.
>
> In our computer room there are 2 or 3 IDE-raid boxes (not running a Plan 9
> filesystem) and I believe in the 6 months or so they've been there at least
> one drive has failed. However, the boxes have redundant power supplies and
> the drives can be hot-swapped (at some performance cost during the update),
> so getting the hardware reliability is possible at a reasonable monetary and
> performance cost.
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
@ 2001-09-05 22:11 geoff
  0 siblings, 0 replies; 8+ messages in thread
From: geoff @ 2001-09-05 22:11 UTC (permalink / raw)
  To: 9fans

No, a quick grep shows that only the console dump command and the
clock rolling past 5:00 trigger dumps.  I asked Ken about having a
(nearly) full cache trigger a dump, and he said he'd tried it and
couldn't get it to work right.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
@ 2001-09-05 17:08 Russ Cox
  0 siblings, 0 replies; 8+ messages in thread
From: Russ Cox @ 2001-09-05 17:08 UTC (permalink / raw)
  To: 9fans

> Note too that worm writes happen only during a dump, typically once
> per night.

Or when the cache gets full, I think (but could
definitely be wrong).



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
@ 2001-09-05 11:06 Fco.J.Ballesteros
  0 siblings, 0 replies; 8+ messages in thread
From: Fco.J.Ballesteros @ 2001-09-05 11:06 UTC (permalink / raw)
  To: 9fans

: Any file server should be on a UPS.  This will let it ride out the most
: common power outages, which last at most a few seconds, and probably

Mine is. It was just that the power source (within the cpu box) burned.
This only happen once to me, but my worm was gone (despite
the disks gave no i/o errors).

: worm become unusable.  I'd be curious to know more, in particular what
: caused the crash.  Obviously a real worm is yet more robust, but Ken

I think the crash happen while doing the dump, because the problem
was that the worm part of the cached worm had some kind of problem
with its superblock. Perhaps the fs couldn't get at all to the sb list, but
I don't really know.

Probably I should have used cwcmd to get the worm working again, but
I just didn't know how to use it to do that.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [9fans] ide file server with mirroring
@ 2001-09-05 10:13 geoff
  0 siblings, 0 replies; 8+ messages in thread
From: geoff @ 2001-09-05 10:13 UTC (permalink / raw)
  To: 9fans

Any file server should be on a UPS.  This will let it ride out the most
common power outages, which last at most a few seconds, and probably
provides slightly better power conditioning than a random power strip
or surge suppressor.  It won't really help when somebody turns off
power for a large part of California for a few hours, but nowadays
those are forecast at least the day before.  It might be worth
teaching the file server how to recognise that the UPS is on battery
power and running now, so it can shut down gracefully.

In my experience at the Labs, power outages were the most common
reason for file server crashes.  Almost nothing else caused them; our
file servers were generally up for six months at a time and then were
usually shut down gracefully because we needed to move or alter the
hardware, or we wanted to boot a new kernel.  One exception was that
we got clobbered by a broken Lucent gigabit Ethernet switch (actually
acquired when Lucent bought some other company) that was installed by
the building networking people on a Friday at 16:30 and which then
proceeded to transmit bogus IP-broadcast Ethernet packets at full
speed all weekend.  The file servers ran out of network buffers and
didn't cope very well after that.  They wouldn't have been able to do
much even if they had been up, though.

The file server seems pretty robust against random crashes or power
outages in any case, and I'm surprised that nemo actually had a cached
worm become unusable.  I'd be curious to know more, in particular what
caused the crash.  Obviously a real worm is yet more robust, but Ken
observed that each dump seals off any file system (as opposed to
media) damage that could occur.  The file server can skip forward from
superblock to superblock to locate the last valid superblock if
something awful happens.

Anyway, pages 274-275 of the Plan 9 documents (in Ken's `The Plan 9
File Server' paper) answer many of the questions asked: each block in
the cache is tagged with its state, which can be manipulated by the
`cwcmd mvstate' console command in case of trouble, and the wcp
process copies Cdump blocks from the cache to the worm (and relabels
the blocks) as long as any remain (including following a crash and
restart, as I read the code).  In the case of worm write errors,
blocks are relabelled to Cdump1 and you have to intervene with cwcmd.
It's hard to see how one can produce an unusable cached worm.  A power
failure during a disk write of the Cache struct could invalidate the
cache device, but the recover console command will fix it, at the cost
of data written since the last dump.  (A trashed cache just requires
use of the recover console command.)  I think writing the Cache struct
to two different blocks and looking for the backup if the primary were
trashed would fix even that vulnerability.  (I suppose one could mirror
the cache as well as the fake worm, though I don't know what that would
do to performance; ignore my earlier estimate.)

Note too that worm writes happen only during a dump, typically once
per night.

The {} mirror device works like this: given {D1...Dn}, writes are
performed to Dn back to D1, and reads are performed on D1; if a read
fails, D2 is read, etc. through to Dn.  If all the reads fail, {}
reports an error.  Because writes are done in the reverse order, when
the write to D1 completes, we know that all the mirrors have already
been written.  Thus checking the {} device effectively checks D1, with
changes propagating through to the mirrors.  If the file server were
to crash between writing Dn and writing D1, if would appear that the
write wasn't performed at all.

As for recovery, there's nothing very automatic.  Upon observing a
flurry of errors, one should reboot the file server, reconfigure it
without the bad disk(s), go to CompUSA and buy the current jumbo $100
IDE disk, replace the bad disk(s) with it, reconfigure the file server
with the new jumbo disk and load it.  Nemo has added a "dupfilsys"
configuration command that might do the job, though it would be nicer
to have a background process create a mirror on the new disk,
deferring any writes from the mirror device until the affected blocks
have been written by dupfilsys (or making dupfilsys skip over blocks
written by the mirror device).

Clearly using a mirrored fake worm is not as good as an optical disk
jukebox, but it's a lot cheaper and provides protection against a
single disk failure.  Given how cheap and unreliable IDE disks are, it
might make sense to have more than one mirror disk (i.e., {h0h2h1}).
Taking occasional snapshots of important data on CD-R (attached to a
terminal or CPU server, not the file server) is also pretty cheap and
provides reliable, reasonably permanent (i.e., optical) backup.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2001-09-05 22:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-05 14:54 [9fans] ide file server with mirroring jmk
2001-09-05 15:23 ` Martin Harriss
2001-09-05 19:05 ` Boyd Roberts
2001-09-05 19:55 ` Paul O'Donnell
  -- strict thread matches above, loose matches on Subject: below --
2001-09-05 22:11 geoff
2001-09-05 17:08 Russ Cox
2001-09-05 11:06 Fco.J.Ballesteros
2001-09-05 10:13 geoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).