From mboxrd@z Thu Jan 1 00:00:00 1970 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> In-reply-to: Your message of "Sun, 20 Sep 2009 23:37:02 EDT." <4d29649e5c597cd8ebd627a2d65f2c9e@quanstro.net> References: <4d29649e5c597cd8ebd627a2d65f2c9e@quanstro.net> From: Bakul Shah Date: Mon, 21 Sep 2009 10:43:10 -0700 Message-Id: <20090921174310.D5B4C5B55@mail.bitblocks.com> Subject: Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS Topicbox-Message-UUID: 7389d5ae-ead5-11e9-9d60-3106f5b1d025 > > > 8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8% > > > > Isn't that the probability of getting a bad sector when you > > read a terabyte? In other words, this is not related to the > > disk size but how much you read from the given disk. Granted > > that when you "resilver" you have no choice but to read the > > entire disk and that is why just one redundant disk is not > > good enough for TB size disks (if you lose a disk there is 8% > > chance you copied a bad block in resilvering a mirror). > > see below. i think you're confusing a single disk 8% chance > of failure with a 3 disk tb array, with a 1e-7% chance of failure. I was talking about the case where you replace a disk in a mirror. To rebuild the mirror the new disk has to be initialized from the remaining "good" disk (and there is no redundancy left) so you have to read the whole disk. This implies 8% chance of a bad sector. The situation is worse in an N+1 disk RAID-5 when you lose a disk. Now you have N*8% chance of a bad sector. And of course in real life this are worse because usually disks in a cheap array don't have independent power supplies (and the shared one can be underpowered or under-regulated). > i would think this is acceptable. at these low levels, something > else is going to get you — like drives failing unindependently. > say because of power problems. 8% rate for an array rebuild may or may not be acceptable depending on your application. > > > i'm a little to lazy to calcuate what the probabilty is that > > > another sector in the row is also bad. (this depends on > > > stripe size, the number of disks in the raid, etc.) but it's > > > safe to say that it's pretty small. for a 3 disk raid 5 with > > > 64k stripes it would be something like > > > 8 bites/byte * 64k *3 / 1e14 = 1e-8 > > > > The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~= > > 3*0.5E-8, since three 64k byte blocks have to be read. The > > unrecoverable case is two of them being bad at the same time. > > The prob. of this is 3*0.25E-16 (not sure I did this right -- > > thanks for noticing that. i think i didn't explain myself well > i was calculating the rough probability of a ure in reading the > *whole array*, not just one stripe. > > to do this more methodicly using your method, we need > to count up all the possible ways of getting a double fail > with 3 disks and multiply by the probability of getting that > sort of failure and then add 'em up. if 0 is ok and 1 is fail, > then i think there are these cases: > > 0 0 0 > 1 0 0 > 0 1 0 > 0 0 1 > 1 1 0 > 1 0 1 > 0 1 1 > 1 1 1 > > so there are 4 ways to fail. 3 double fail have a probability of > 3*(2^9 bits * 1e-14 1/ bit)^2 Why 2^9 bits? A sector is 2^9 bytes or 2^12 bits. Note that there is no recovery possible for fewer bits than a sector. > and the triple fail has a probability of > (2^9 bits * 1e-14 1/ bit)^3 > so we have > 3*(2^9 bits * 1e-14 1/ bit)^2 + (2^9 bits * 1e-14 1/ bit)^3 ~= > 3*(2^9 bits * 1e-14 1/ bit)^2 > = 8.24633720832e-17 3*(2^12 bits * 1e-14 1/ bit)^2 + (2^12 bits * 1e-14 1/ bit)^3 ~= 3*(2^12 bits * 1e-14 1/ bit)^2 ~= 3*(1e-11)^2 = 3E-22 If per sector recovery is done, you have 3E-22*(64K/512) = 3.84E-20 > that's per stripe. if we multiply by 1e12/(64*1024) stripes/array, > > we have > = 1.2582912e-09 For the whole 2TB array you have 3E-22*(10^12/512) ~= 6E-13 > which is remarkably close to my lousy first guess. so we went > from 8e-2 to 1e-9 for an improvement of 7 orders of magnitude. > > > we have to consider the exact same sector # going bad in two > > of the three disks and there are three such pairs). > > the exact sector doesn't matter. i don't know any > implementations that try to do partial stripe recovery. If by partial stripe recovery you mean 2 of the stripes must be entirely error free to recreate the third, your logic seems wrong even after we replace 2^9 with 2^12 bits. When you have only stripe level recovery you will throw away whole stripes even where sector level recovery would've worked. If, for example, each stripe has two sectors, on a 3 disk raid5, and you have sector 0 of disk 0 stripe and sector 1 of disk 1 stripe are bad, sector level recovery would work but stripe level recovery would fail. For 2 sector stripes and 3 disks you have 64 possible outcomes, out of which 48 result in bad data for sector level recovery and 54 for stripe level recovery (see below). And it will get worse with larger stripes. [by bad data I mean we throw away the whole stripe even if one sector can't be recovered] I did some googling but didn't discover anything that does a proper statistical analysis. --------------------------------------------- disk0 stripe sectors (1 means read failed) | disk1 stripe sectors | | disk2 stripe sectors | | | sector level recovery possible? (if so, the stripe can be recovered) | | | | stripe level recovery possible? | | | | | 00 00 00 00 00 01 00 00 10 00 00 11 00 01 00 00 01 01 N N 00 01 10 N 00 01 11 N N 00 10 00 00 10 01 N 00 10 10 N N 00 10 11 N N 00 11 00 00 11 01 N N 00 11 10 N N 00 11 11 N N 01 00 00 01 00 01 N N 01 00 10 N 01 00 11 N N 01 01 00 N N 01 01 01 N N 01 01 10 N N 01 01 11 N N 01 10 00 N 01 10 01 N N 01 10 10 N N 01 10 11 N N 01 11 00 N N 01 11 01 N N 01 11 10 N N 01 11 11 N N 10 00 00 10 00 01 N 10 00 10 N N 10 00 11 N N 10 01 00 N 10 01 01 N N 10 01 10 N N 10 01 11 N N 10 10 00 N N 10 10 01 N N 10 10 10 N N 10 10 11 N N 10 11 00 N N 10 11 01 N N 10 11 10 N N 10 11 11 N N 11 00 00 11 00 01 N N 11 00 10 N N 11 00 11 N N 11 01 00 N N 11 01 01 N N 11 01 10 N N 11 01 11 N N 11 10 00 N N 11 10 01 N N 11 10 10 N N 11 10 11 N N 11 11 00 N N 11 11 01 N N 11 11 10 N N 11 11 11 N N