From mboxrd@z Thu Jan  1 00:00:00 1970
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
In-reply-to: Your message of "Sun, 20 Sep 2009 23:37:02 EDT."
	<4d29649e5c597cd8ebd627a2d65f2c9e@quanstro.net>
References: <4d29649e5c597cd8ebd627a2d65f2c9e@quanstro.net>
From: Bakul Shah <bakul+plan9@bitblocks.com>
Date: Mon, 21 Sep 2009 10:43:10 -0700
Message-Id: <20090921174310.D5B4C5B55@mail.bitblocks.com>
Subject: Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
Topicbox-Message-UUID: 7389d5ae-ead5-11e9-9d60-3106f5b1d025

> > > 	8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%
> >
> > Isn't that the probability of getting a bad sector when you
> > read a terabyte? In other words, this is not related to the
> > disk size but how much you read from the given disk. Granted
> > that when you "resilver" you have no choice but to read the
> > entire disk and that is why just one redundant disk is not
> > good enough for TB size disks (if you lose a disk there is 8%
> > chance you copied a bad block in resilvering a mirror).
>
> see below.  i think you're confusing a single disk 8% chance
> of failure with a 3 disk tb array, with a 1e-7% chance of failure.

I was talking about the case where you replace a disk in a
mirror. To rebuild the mirror the new disk has to be
initialized from the remaining "good" disk (and there is no
redundancy left) so you have to read the whole disk. This
implies 8% chance of a bad sector. The situation is worse in
an N+1 disk RAID-5 when you lose a disk. Now you have N*8%
chance of a bad sector. And of course in real life this are
worse because usually disks in a cheap array don't have
independent power supplies (and the shared one can be
underpowered or under-regulated).

> i would think this is acceptable.  at these low levels, something
> else is going to get you — like drives failing unindependently.
> say because of power problems.

8% rate for an array rebuild may or may not be acceptable
depending on your application.

> > > i'm a little to lazy to calcuate what the probabilty is that
> > > another sector in the row is also bad.  (this depends on
> > > stripe size, the number of disks in the raid, etc.)  but it's
> > > safe to say that it's pretty small.  for a 3 disk raid 5 with
> > > 64k stripes it would be something like
> > > 	8 bites/byte * 64k *3 / 1e14 = 1e-8
> >
> > The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~=
> > 3*0.5E-8, since three 64k byte blocks have to be read.  The
> > unrecoverable case is two of them being bad at the same time.
> > The prob. of this is 3*0.25E-16 (not sure I did this right --
>
> thanks for noticing that.  i think i didn't explain myself well
> i was calculating the rough probability of a ure in reading the
> *whole array*, not just one stripe.
>
> to do this more methodicly using your method, we need
> to count up all the possible ways of getting a double fail
> with 3 disks and multiply by the probability of getting that
> sort of failure and then add 'em up.  if 0 is ok and 1 is fail,
> then i think there are these cases:
>
> 0 0 0
> 1 0 0
> 0 1 0
> 0 0 1
> 1 1 0
> 1 0 1
> 0 1 1
> 1 1 1
>
> so there are 4 ways to fail.  3 double fail have a probability of
> 3*(2^9 bits * 1e-14 1/ bit)^2

Why 2^9 bits? A sector is 2^9 bytes or 2^12 bits. Note that
there is no recovery possible for fewer bits than a sector.

> and the triple fail has a probability of
> (2^9 bits * 1e-14 1/ bit)^3
> so we have
> 3*(2^9 bits * 1e-14 1/ bit)^2 + (2^9 bits * 1e-14 1/ bit)^3 ~=
> 	3*(2^9 bits * 1e-14 1/ bit)^2
> 	= 8.24633720832e-17

3*(2^12 bits * 1e-14 1/ bit)^2 + (2^12 bits * 1e-14 1/ bit)^3 ~=
  	3*(2^12 bits * 1e-14 1/ bit)^2
	~= 3*(1e-11)^2 = 3E-22

If per sector recovery is done, you have
	3E-22*(64K/512) = 3.84E-20

> that's per stripe.  if we multiply by 1e12/(64*1024) stripes/array,
>
> we have
> 	= 1.2582912e-09

For the whole 2TB array you have

	3E-22*(10^12/512) ~= 6E-13

> which is remarkably close to my lousy first guess.  so we went
> from 8e-2 to 1e-9 for an improvement of 7 orders of magnitude.
>
> > we have to consider the exact same sector # going bad in two
> > of the three disks and there are three such pairs).
>
> the exact sector doesn't matter.  i don't know any
> implementations that try to do partial stripe recovery.

If by partial stripe recovery you mean 2 of the stripes must
be entirely error free to recreate the third, your logic
seems wrong even after we replace 2^9 with 2^12 bits.

When you have only stripe level recovery you will throw away
whole stripes even where sector level recovery would've
worked.  If, for example, each stripe has two sectors, on a 3
disk raid5, and you have sector 0 of disk 0 stripe and sector
1 of disk 1 stripe are bad, sector level recovery would work
but stripe level recovery would fail.  For 2 sector stripes
and 3 disks you have 64 possible outcomes, out of which 48
result in bad data for sector level recovery and 54 for
stripe level recovery (see below).  And it will get worse
with larger stripes.  [by bad data I mean we throw away the
whole stripe even if one sector can't be recovered]

I did some googling but didn't discover anything that does a
proper statistical analysis.

---------------------------------------------
disk0 stripe sectors (1 means read failed)
|  disk1 stripe sectors
|  |  disk2 stripe sectors
|  |  |  sector level recovery possible? (if so, the stripe can be recovered)
|  |  |  | stripe level recovery possible?
|  |  |  | |
00 00 00
00 00 01
00 00 10
00 00 11
00 01 00
00 01 01 N N
00 01 10   N
00 01 11 N N
00 10 00
00 10 01   N
00 10 10 N N
00 10 11 N N
00 11 00
00 11 01 N N
00 11 10 N N
00 11 11 N N

01 00 00
01 00 01 N N
01 00 10   N
01 00 11 N N
01 01 00 N N
01 01 01 N N
01 01 10 N N
01 01 11 N N
01 10 00   N
01 10 01 N N
01 10 10 N N
01 10 11 N N
01 11 00 N N
01 11 01 N N
01 11 10 N N
01 11 11 N N

10 00 00
10 00 01   N
10 00 10 N N
10 00 11 N N
10 01 00   N
10 01 01 N N
10 01 10 N N
10 01 11 N N
10 10 00 N N
10 10 01 N N
10 10 10 N N
10 10 11 N N
10 11 00 N N
10 11 01 N N
10 11 10 N N
10 11 11 N N

11 00 00
11 00 01 N N
11 00 10 N N
11 00 11 N N
11 01 00 N N
11 01 01 N N
11 01 10 N N
11 01 11 N N
11 10 00 N N
11 10 01 N N
11 10 10 N N
11 10 11 N N
11 11 00 N N
11 11 01 N N
11 11 10 N N
11 11 11 N N