From mboxrd@z Thu Jan 1 00:00:00 1970 To: 9fans@9fans.net Date: Fri, 5 Mar 2010 10:01:27 +0000 From: "hugh@mimosa.com" Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable References: <8f6ef34730ac116e3d6a1d45ac557816@ladd.quanstro.net> Subject: Re: [9fans] pineview atom Topicbox-Message-UUID: e0918ad4-ead5-11e9-9d60-3106f5b1d025 On Feb 21, 2:48 pm, davide...@cs.cmu.edu (Dave Eckhardt) wrote: > * Bits were flipping pretty often.  I think we got 10-ish events > per day. TLB bits are not like DRAM bits. They were surely static cells, built for speed and functionality (CAM) not density. The cells would be quite large. It is unlikely that this problem came from external radiation. Guess: the problem was a marginal design of the circuitry. At about that time DRAM cells seemed to be suffering from radiation- induced bit flips. It was felt that 16Kbit chips would be the limit because of this (please realise that my own memory might be slightly faulty). It turned out that the radiation was actually coming from the chip packaging material. Once that was sorted, RAM density marched on to where we are now. As cells shrink, and voltages shrink, I understand that radiation can have greater effects. Eventually mainstream systems will have ECC. But I've been thinking this for as long as there have been personal computers built out of microprocessors. Adding ECC to memory seems to me to be an easy no-brainer. Adding it "everywhere" in processors does not seem easy. Actually, even adding it in memory isn't that easy. In the old days, a simple Hamming code was good enough because each bit in a word lived on a different chip. Now memory chips are wider and so the code has to account for multi-bit errors (flipping of bits is not independent). Cray famously said "Parity is for farmers". It was an obscure joke (referring to some US agricultural subsidy) but really he meant that he didn't want to waste circuitry on error checking (as I understand it). This was one of the things that made me averse to his systems. It is really hard to guess what the conversion rate of bit flips into observed anomalies on ordinary systems. I wonder if any research has been done on this. In the real world, software bugs take surely most of the blame. Users seem to have been trained to accept lower reliability in computer systems. Apple seems to be one of the few vendors that might be able to market the idea of ECC to consumers.