9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: ron minnich <rminnich@lanl.gov>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] Interesting in trying out Plan 9
Date: Wed, 18 Feb 2004 08:08:29 -0700	[thread overview]
Message-ID: <Pine.LNX.4.44.0402180755420.19583-100000@maxroach.lanl.gov> (raw)
In-Reply-To: <0aa1b455d0a97802ab518aebbb724d94@vitanuova.com>

On Wed, 18 Feb 2004, C H Forsyth wrote:

> >>well I don't see that kind of stuff here on 1000s of linux nodes. That
>
> mainly computational?  how often are they rebooted (eg, between tasks)?
> just curious.

computational. We count the whole system as "up" if we have more than 1000
of 1024 nodes nodes up -- this is ok on clusters, since the system is
usable even when some fraction is not there -- contrast with a Cray I used
to use that would be totally dead if one wire was flaky. Whole system
uptime is basically measured in units of quarters (3 months) and whole
system reboots generally happen due to hardware failure and the need to
power cycle everything. We lose about one node a week due to myrinet
hardware issues (usually lasers, sometimes cards). If our 1024-node
cluster were only Ethernet connected I don't think it would go down ever.

LLNL sees similar results with their MCR and other linux clusters, as does
industry. LLNL's MCR (diskful nodes) has about 4-5 nodes go out a month, I
believe usually due to disk failures. At a recent talk PNNL, with 1024
nodes with 5 disks each, said they replace 5 disks A DAY. Pink, with zero
disks, only has a node die when Myrinet goes south -- as mentioned this
seems to be about one node a week.

But note that all these failures are hardware.

The lesson is simple: pay the 20% increment over "cheap white boxes" and
get huge gains in hardware reliability, with consequent gains in uptime.
Pay 10-20x the cheap white box and you'll get ripped off, with reliability
typically LOWER for "Enterprise Class" systems (yes, there are people who
pay 20x cheap white box cost).  My friends who run big linux clusters tell
me that linux is not the component that drives failure -- it's hardware.
"Weird" errors as mentioned earlier in this thread can usually be traced
to dicey hardware -- there's a lot of that out there. SDRAM is a
particular culprit, since many vendors lie -- yes, I mean LIE -- about
what the timing on the SDRAM is.

I like Plan 9, but I think some of the Linux slamming that goes on in this
list is not justified based on measured systems.

ron



  parent reply	other threads:[~2004-02-18 15:08 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-02-17 20:06 dmr
2004-02-18  3:58 ` boyd, rounin
2004-02-18  4:06   ` Geoff Collyer
2004-02-18 10:50     ` Dave Lukes
2004-02-18 14:12       ` ron minnich
2004-02-18 14:56         ` C H Forsyth
2004-02-18 15:03           ` Dave Lukes
2004-02-19  4:17             ` Martin C.Atkins
2004-02-18 15:08           ` ron minnich [this message]
2004-02-18 15:28             ` matt
2004-02-18 16:38               ` ron minnich
2004-02-18 17:06                 ` matt
2004-02-18 23:15               ` David Cantrell
2004-02-18 23:25                 ` matt
2004-02-18 23:42                   ` boyd, rounin
2004-02-19  3:43                     ` Micah Stetson
2004-02-19  4:29                       ` andrey mirtchovski
2004-02-19  5:43                         ` Micah Stetson
2004-02-19  5:06                           ` andrey mirtchovski
2004-02-18 20:15                             ` Kenji Okamoto
2004-02-19  9:03                           ` boyd, rounin
     [not found]             ` <Pine.LNX.4.44.0402180755420.19583-100000@maxroach.lanl.gov >
2004-02-18 17:08               ` Wes Kussmaul
2004-02-18 23:27                 ` ron minnich
2004-02-19 10:26             ` vic zandy
2004-02-19 14:47               ` ron minnich
2004-02-20  2:26                 ` boyd, rounin
2004-02-20  3:18                   ` ron minnich
2004-02-18 15:14           ` boyd, rounin
2004-02-18 15:01         ` Dave Lukes
2004-02-18 15:17           ` boyd, rounin
2004-02-19  1:02         ` Kenji Okamoto
2004-02-19  2:18           ` ron minnich
  -- strict thread matches above, loose matches on Subject: below --
2004-02-16 10:35 Joel Konkle-Parker
2004-02-16 10:53 ` Geoff Collyer
2004-02-16 11:11   ` Fco.J.Ballesteros
2004-02-16 11:13     ` Geoff Collyer
2004-02-16 15:35   ` Dave Lukes
2004-02-16 15:44     ` C H Forsyth
2004-02-16 15:53       ` Dave Lukes
2004-02-16 16:36     ` ron minnich
2004-02-16 16:49       ` matt
2004-02-16 17:04         ` Dave Lukes
2004-02-16 17:03       ` Dave Lukes
2004-02-16 19:08       ` 9nut
2004-02-17  0:15     ` Geoff Collyer
2004-02-17  0:32       ` Scott Schwartz
2004-02-19 18:50         ` Wes Kussmaul
2004-02-17  1:35       ` 9nut
2004-02-16 15:58 ` Jim Choate
2004-02-16 16:31 ` ron minnich
2004-02-16 16:38   ` Dave Lukes
2004-02-17  4:38   ` boyd, rounin
2004-02-17 13:34     ` Brantley Coile
2004-02-17 14:26       ` matt
2004-02-17 14:33         ` boyd, rounin
2004-02-17 14:58           ` Dave Lukes
2004-02-17 15:03             ` boyd, rounin
2004-02-17 15:14               ` Brantley Coile
2004-02-17 15:29                 ` Dave Lukes
2004-02-17 22:17             ` Steve Kilbane
2004-02-18  6:39               ` dmr
2004-02-18  8:13                 ` 9nut
2004-02-18 10:02                   ` Richard Miller
2004-02-16 17:36 ` 9nut

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.44.0402180755420.19583-100000@maxroach.lanl.gov \
    --to=rminnich@lanl.gov \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).