From mboxrd@z Thu Jan 1 00:00:00 1970 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> In-reply-to: Your message of "Thu, 05 Jan 2012 13:43:49 EST." <0461bed1c9a218dc56f80ade41b34617@chula.quanstro.net> References: <20120105124852.GA940@polynum.com> <20120105144810.182add7c@wks-ddc.exosec.local> <20120105151518.GB435@polynum.com> <20120105163907.GA761@polynum.com> <20120105175106.6FAD9B852@mail.bitblocks.com> <20120105182546.D1AF3B858@mail.bitblocks.com> <0461bed1c9a218dc56f80ade41b34617@chula.quanstro.net> Date: Thu, 5 Jan 2012 11:56:03 -0800 From: Bakul Shah Message-Id: <20120105195603.EE148B852@mail.bitblocks.com> Subject: Re: [9fans] venti and "contrib": RFC Topicbox-Message-UUID: 55fcc92c-ead7-11e9-9d60-3106f5b1d025 On Thu, 05 Jan 2012 13:43:49 EST erik quanstrom wrote: > On Thu Jan 5 13:26:16 EST 2012, bakul@bitblocks.com wrote: > > On Thu, 05 Jan 2012 13:01:52 EST erik quanstrom wr > ote: > > > > if you read 1TB, you have 8% chance of a silent bad read > > > > sector. More important to worry about that in today's world > > > > than optimizing disk space use. > > > > > > do you have a citation for this? i know if you work out the > > > numbers from the BER, this is about what you get, but in > > > practice i do not see this 8%. we do pattern writes all the > > > time, and i can't recall the last time i saw a "silent" read error. > > > > Silent == unseen! Do you log RAID errors? Only way to catch them. > > > > That number is derived purely on an bit error rate (I think > > vendors base that on the Reed-Solomon code used). No idea how > > "uniformly random" the data (or medium) is in practice. I > > thought the "practice" was worse! > > i thought your definition of silent was not caught by the on-drive > ecc. i think this is not very likely, and we're explicitly checking for Hmm.... You are right! I meant *uncorrectable* read errors (URE), which are not necessarily *undetectable* errors (where a data pattern switches to another pattern mapping to the same syndrome bits). Clearly my memory by now has had much more massive bit-errors! Still, consumer disk URE rate of 10^-14 coupled with large disk sizes does mean RAID is essential. > this byrunning massive numbers of disks through pattern writes with > verification, and don't see it. Are these new disks? The rate goes up with age. Do SMART stats show any new errors? It is also possible vendors are *conservatively* specifying 10^-14 (though I no longer know how they arrive at the URE number!). Can you share what you did discover? [offline, if you don't want to broadcast] You've probably read http://research.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.ps