From mboxrd@z Thu Jan  1 00:00:00 1970
From: ron minnich <rminnich@lanl.gov>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] Interesting in trying out Plan 9
In-Reply-To: <0aa1b455d0a97802ab518aebbb724d94@vitanuova.com>
Message-ID: <Pine.LNX.4.44.0402180755420.19583-100000@maxroach.lanl.gov>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Date: Wed, 18 Feb 2004 08:08:29 -0700
Topicbox-Message-UUID: ea30b212-eacc-11e9-9e20-41e7f4b1d025

On Wed, 18 Feb 2004, C H Forsyth wrote:

> >>well I don't see that kind of stuff here on 1000s of linux nodes. That
>
> mainly computational?  how often are they rebooted (eg, between tasks)?
> just curious.

computational. We count the whole system as "up" if we have more than 1000
of 1024 nodes nodes up -- this is ok on clusters, since the system is
usable even when some fraction is not there -- contrast with a Cray I used
to use that would be totally dead if one wire was flaky. Whole system
uptime is basically measured in units of quarters (3 months) and whole
system reboots generally happen due to hardware failure and the need to
power cycle everything. We lose about one node a week due to myrinet
hardware issues (usually lasers, sometimes cards). If our 1024-node
cluster were only Ethernet connected I don't think it would go down ever.

LLNL sees similar results with their MCR and other linux clusters, as does
industry. LLNL's MCR (diskful nodes) has about 4-5 nodes go out a month, I
believe usually due to disk failures. At a recent talk PNNL, with 1024
nodes with 5 disks each, said they replace 5 disks A DAY. Pink, with zero
disks, only has a node die when Myrinet goes south -- as mentioned this
seems to be about one node a week.

But note that all these failures are hardware.

The lesson is simple: pay the 20% increment over "cheap white boxes" and
get huge gains in hardware reliability, with consequent gains in uptime.
Pay 10-20x the cheap white box and you'll get ripped off, with reliability
typically LOWER for "Enterprise Class" systems (yes, there are people who
pay 20x cheap white box cost).  My friends who run big linux clusters tell
me that linux is not the component that drives failure -- it's hardware.
"Weird" errors as mentioned earlier in this thread can usually be traced
to dicey hardware -- there's a lot of that out there. SDRAM is a
particular culprit, since many vendors lie -- yes, I mean LIE -- about
what the timing on the SDRAM is.

I like Plan 9, but I think some of the Linux slamming that goes on in this
list is not justified based on measured systems.

ron