From mboxrd@z Thu Jan  1 00:00:00 1970
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
In-reply-to: Your message of "Thu, 05 Jan 2012 13:43:49 EST."
	<0461bed1c9a218dc56f80ade41b34617@chula.quanstro.net>
References: <20120105124852.GA940@polynum.com>
	<20120105144810.182add7c@wks-ddc.exosec.local>
	<20120105151518.GB435@polynum.com>
	<CADSkJJXMBGnjtXbL5w+WMuxVbnW_+D6hQXYMN04wzYu+yXaCzA@mail.gmail.com>
	<20120105163907.GA761@polynum.com>
	<20120105175106.6FAD9B852@mail.bitblocks.com>
	<ea7157b338d53727cc50bd6fee724a5f@chula.quanstro.net>
	<20120105182546.D1AF3B858@mail.bitblocks.com>
	<0461bed1c9a218dc56f80ade41b34617@chula.quanstro.net>
Date: Thu,  5 Jan 2012 11:56:03 -0800
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20120105195603.EE148B852@mail.bitblocks.com>
Subject: Re: [9fans] venti and "contrib": RFC
Topicbox-Message-UUID: 55fcc92c-ead7-11e9-9d60-3106f5b1d025

On Thu, 05 Jan 2012 13:43:49 EST erik quanstrom <quanstro@quanstro.net>  wrote:
> On Thu Jan  5 13:26:16 EST 2012, bakul@bitblocks.com wrote:
> > On Thu, 05 Jan 2012 13:01:52 EST erik quanstrom <quanstro@quanstro.net>  wr
> ote:
> > > > if you read 1TB, you have 8% chance of a silent bad read
> > > > sector.  More important to worry about that in today's world
> > > > than optimizing disk space use.
> > >
> > > do you have a citation for this?  i know if you work out the
> > > numbers from the BER, this is about what you get, but in
> > > practice i do not see this 8%.  we do pattern writes all the
> > > time, and i can't recall the last time i saw a "silent" read error.
> >
> > Silent == unseen! Do you log RAID errors? Only way to catch them.
> >
> > That number is derived purely on an bit error rate (I think
> > vendors base that on the Reed-Solomon code used). No idea how
> > "uniformly random" the data (or medium) is in practice. I
> > thought the "practice" was worse!
>
> i thought your definition of silent was not caught by the on-drive
> ecc.  i think this is not very likely,   and we're explicitly checking for

Hmm.... You are right!  I meant *uncorrectable* read errors
(URE), which are not necessarily *undetectable* errors (where
a data pattern switches to another pattern mapping to the same
syndrome bits).  Clearly my memory by now has had much more
massive bit-errors! Still, consumer disk URE rate of 10^-14
coupled with large disk sizes does mean RAID is essential.

> this byrunning massive numbers of disks through pattern writes with
> verification, and don't see it.

Are these new disks?  The rate goes up with age.  Do SMART
stats show any new errors?  It is also possible vendors are
*conservatively* specifying 10^-14 (though I no longer know
how they arrive at the URE number!).  Can you share what you
did discover? [offline, if you don't want to broadcast]

You've probably read
http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.ps