Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

From: Bakul Shah <bakul+plan9@bitblocks.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
Date: Mon, 21 Sep 2009 10:43:10 -0700	[thread overview]
Message-ID: <20090921174310.D5B4C5B55@mail.bitblocks.com> (raw)
In-Reply-To: Your message of "Sun, 20 Sep 2009 23:37:02 EDT." <4d29649e5c597cd8ebd627a2d65f2c9e@quanstro.net>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 5933 bytes --]

> > > 	8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%
> >
> > Isn't that the probability of getting a bad sector when you
> > read a terabyte? In other words, this is not related to the
> > disk size but how much you read from the given disk. Granted
> > that when you "resilver" you have no choice but to read the
> > entire disk and that is why just one redundant disk is not
> > good enough for TB size disks (if you lose a disk there is 8%
> > chance you copied a bad block in resilvering a mirror).
>
> see below.  i think you're confusing a single disk 8% chance
> of failure with a 3 disk tb array, with a 1e-7% chance of failure.

I was talking about the case where you replace a disk in a
mirror. To rebuild the mirror the new disk has to be
initialized from the remaining "good" disk (and there is no
redundancy left) so you have to read the whole disk. This
implies 8% chance of a bad sector. The situation is worse in
an N+1 disk RAID-5 when you lose a disk. Now you have N*8%
chance of a bad sector. And of course in real life this are
worse because usually disks in a cheap array don't have
independent power supplies (and the shared one can be
underpowered or under-regulated).

> i would think this is acceptable.  at these low levels, something
> else is going to get you — like drives failing unindependently.
> say because of power problems.

8% rate for an array rebuild may or may not be acceptable
depending on your application.

> > > i'm a little to lazy to calcuate what the probabilty is that
> > > another sector in the row is also bad.  (this depends on
> > > stripe size, the number of disks in the raid, etc.)  but it's
> > > safe to say that it's pretty small.  for a 3 disk raid 5 with
> > > 64k stripes it would be something like
> > > 	8 bites/byte * 64k *3 / 1e14 = 1e-8
> >
> > The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~=
> > 3*0.5E-8, since three 64k byte blocks have to be read.  The
> > unrecoverable case is two of them being bad at the same time.
> > The prob. of this is 3*0.25E-16 (not sure I did this right --
>
> thanks for noticing that.  i think i didn't explain myself well
> i was calculating the rough probability of a ure in reading the
> *whole array*, not just one stripe.
>
> to do this more methodicly using your method, we need
> to count up all the possible ways of getting a double fail
> with 3 disks and multiply by the probability of getting that
> sort of failure and then add 'em up.  if 0 is ok and 1 is fail,
> then i think there are these cases:
>
> 0 0 0
> 1 0 0
> 0 1 0
> 0 0 1
> 1 1 0
> 1 0 1
> 0 1 1
> 1 1 1
>
> so there are 4 ways to fail.  3 double fail have a probability of
> 3*(2^9 bits * 1e-14 1/ bit)^2

Why 2^9 bits? A sector is 2^9 bytes or 2^12 bits. Note that
there is no recovery possible for fewer bits than a sector.

> and the triple fail has a probability of
> (2^9 bits * 1e-14 1/ bit)^3
> so we have
> 3*(2^9 bits * 1e-14 1/ bit)^2 + (2^9 bits * 1e-14 1/ bit)^3 ~=
> 	3*(2^9 bits * 1e-14 1/ bit)^2
> 	= 8.24633720832e-17

3*(2^12 bits * 1e-14 1/ bit)^2 + (2^12 bits * 1e-14 1/ bit)^3 ~=
  	3*(2^12 bits * 1e-14 1/ bit)^2
	~= 3*(1e-11)^2 = 3E-22

If per sector recovery is done, you have
	3E-22*(64K/512) = 3.84E-20

> that's per stripe.  if we multiply by 1e12/(64*1024) stripes/array,
>
> we have
> 	= 1.2582912e-09

For the whole 2TB array you have

	3E-22*(10^12/512) ~= 6E-13

> which is remarkably close to my lousy first guess.  so we went
> from 8e-2 to 1e-9 for an improvement of 7 orders of magnitude.
>
> > we have to consider the exact same sector # going bad in two
> > of the three disks and there are three such pairs).
>
> the exact sector doesn't matter.  i don't know any
> implementations that try to do partial stripe recovery.

If by partial stripe recovery you mean 2 of the stripes must
be entirely error free to recreate the third, your logic
seems wrong even after we replace 2^9 with 2^12 bits.

When you have only stripe level recovery you will throw away
whole stripes even where sector level recovery would've
worked.  If, for example, each stripe has two sectors, on a 3
disk raid5, and you have sector 0 of disk 0 stripe and sector
1 of disk 1 stripe are bad, sector level recovery would work
but stripe level recovery would fail.  For 2 sector stripes
and 3 disks you have 64 possible outcomes, out of which 48
result in bad data for sector level recovery and 54 for
stripe level recovery (see below).  And it will get worse
with larger stripes.  [by bad data I mean we throw away the
whole stripe even if one sector can't be recovered]

I did some googling but didn't discover anything that does a
proper statistical analysis.

---------------------------------------------
disk0 stripe sectors (1 means read failed)
|  disk1 stripe sectors
|  |  disk2 stripe sectors
|  |  |  sector level recovery possible? (if so, the stripe can be recovered)
|  |  |  | stripe level recovery possible?
|  |  |  | |
00 00 00
00 00 01
00 00 10
00 00 11
00 01 00
00 01 01 N N
00 01 10   N
00 01 11 N N
00 10 00
00 10 01   N
00 10 10 N N
00 10 11 N N
00 11 00
00 11 01 N N
00 11 10 N N
00 11 11 N N

01 00 00
01 00 01 N N
01 00 10   N
01 00 11 N N
01 01 00 N N
01 01 01 N N
01 01 10 N N
01 01 11 N N
01 10 00   N
01 10 01 N N
01 10 10 N N
01 10 11 N N
01 11 00 N N
01 11 01 N N
01 11 10 N N
01 11 11 N N

10 00 00
10 00 01   N
10 00 10 N N
10 00 11 N N
10 01 00   N
10 01 01 N N
10 01 10 N N
10 01 11 N N
10 10 00 N N
10 10 01 N N
10 10 10 N N
10 10 11 N N
10 11 00 N N
10 11 01 N N
10 11 10 N N
10 11 11 N N

11 00 00
11 00 01 N N
11 00 10 N N
11 00 11 N N
11 01 00 N N
11 01 01 N N
11 01 10 N N
11 01 11 N N
11 10 00 N N
11 10 01 N N
11 10 10 N N
11 10 11 N N
11 11 00 N N
11 11 01 N N
11 11 10 N N
11 11 11 N N

next prev parent reply	other threads:[~2009-09-21 17:43 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-14 16:43 erik quanstrom
2009-09-20 20:13 ` Bakul Shah
2009-09-21  3:37   ` erik quanstrom
2009-09-21 17:43     ` Bakul Shah [this message]
2009-09-21 18:02       ` erik quanstrom
2009-09-21 18:49         ` Wes Kussmaul
2009-09-21 19:21           ` erik quanstrom
2009-09-21 20:57             ` Wes Kussmaul
2009-09-21 22:42               ` erik quanstrom
2009-09-22 10:59             ` matt
2009-09-21 19:10         ` Bakul Shah
2009-09-21 20:30           ` erik quanstrom
2009-09-21 20:57             ` Jack Norton
2009-09-21 23:38               ` erik quanstrom
2009-09-21 22:07             ` Bakul Shah
2009-09-21 23:35               ` Eris Discordia
2009-09-22  0:45                 ` erik quanstrom
     [not found]               ` <6DC61E4A6EC613C81AC1688E@192.168.1.2>
2009-09-21 23:50                 ` Eris Discordia
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  0:53 Roman V Shaposhnik
2009-09-04  1:20 ` erik quanstrom
2009-09-04  9:37   ` matt
2009-09-04 14:30     ` erik quanstrom
2009-09-04 16:54     ` Roman Shaposhnik
2009-09-04 12:24   ` Eris Discordia
2009-09-04 12:41     ` erik quanstrom
2009-09-04 13:56       ` Eris Discordia
2009-09-04 14:10         ` erik quanstrom
2009-09-04 18:34           ` Eris Discordia
     [not found]       ` <48F03982350BA904DFFA266E@192.168.1.2>
2009-09-07 20:02         ` Uriel
2009-09-08 13:32           ` Eris Discordia
2009-09-04 16:52   ` Roman Shaposhnik
2009-09-04 17:27     ` erik quanstrom
2009-09-04 17:37       ` Jack Norton
2009-09-04 18:33         ` erik quanstrom
2009-09-08 16:53           ` Jack Norton
2009-09-08 17:16             ` erik quanstrom
2009-09-08 18:17               ` Jack Norton
2009-09-08 18:54                 ` erik quanstrom
2009-09-14 15:50                   ` Jack Norton
2009-09-14 17:05                     ` Russ Cox
2009-09-14 17:48                       ` Jack Norton
2009-09-04 23:25   ` James Tomaschke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090921174310.D5B4C5B55@mail.bitblocks.com \
    --to=bakul+plan9@bitblocks.com \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).