From mboxrd@z Thu Jan 1 00:00:00 1970 From: ron minnich To: 9fans@cse.psu.edu Subject: Re: [9fans] Interesting in trying out Plan 9 In-Reply-To: <0aa1b455d0a97802ab518aebbb724d94@vitanuova.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Date: Wed, 18 Feb 2004 08:08:29 -0700 Topicbox-Message-UUID: ea30b212-eacc-11e9-9e20-41e7f4b1d025 On Wed, 18 Feb 2004, C H Forsyth wrote: > >>well I don't see that kind of stuff here on 1000s of linux nodes. That > > mainly computational? how often are they rebooted (eg, between tasks)? > just curious. computational. We count the whole system as "up" if we have more than 1000 of 1024 nodes nodes up -- this is ok on clusters, since the system is usable even when some fraction is not there -- contrast with a Cray I used to use that would be totally dead if one wire was flaky. Whole system uptime is basically measured in units of quarters (3 months) and whole system reboots generally happen due to hardware failure and the need to power cycle everything. We lose about one node a week due to myrinet hardware issues (usually lasers, sometimes cards). If our 1024-node cluster were only Ethernet connected I don't think it would go down ever. LLNL sees similar results with their MCR and other linux clusters, as does industry. LLNL's MCR (diskful nodes) has about 4-5 nodes go out a month, I believe usually due to disk failures. At a recent talk PNNL, with 1024 nodes with 5 disks each, said they replace 5 disks A DAY. Pink, with zero disks, only has a node die when Myrinet goes south -- as mentioned this seems to be about one node a week. But note that all these failures are hardware. The lesson is simple: pay the 20% increment over "cheap white boxes" and get huge gains in hardware reliability, with consequent gains in uptime. Pay 10-20x the cheap white box and you'll get ripped off, with reliability typically LOWER for "Enterprise Class" systems (yes, there are people who pay 20x cheap white box cost). My friends who run big linux clusters tell me that linux is not the component that drives failure -- it's hardware. "Weird" errors as mentioned earlier in this thread can usually be traced to dicey hardware -- there's a lot of that out there. SDRAM is a particular culprit, since many vendors lie -- yes, I mean LIE -- about what the timing on the SDRAM is. I like Plan 9, but I think some of the Linux slamming that goes on in this list is not justified based on measured systems. ron