9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] Petabytes on a budget: JBODs + Linux + JFS
@ 2009-09-04  0:53 Roman V Shaposhnik
  2009-09-04  1:20 ` erik quanstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Roman V Shaposhnik @ 2009-09-04  0:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

"None of those technologies [NFS, iSCSI, FC] scales as cheaply,
reliably, goes as big, nor can be managed as easily as stand-alone pods
with their own IP address waiting for requests on HTTPS."
   http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

Apart from the obvious comment that I swear I used a quote like that
to justify 9P more than once, I'm very curious to know how Plan9
would perform on such a box.

Erik, do you have any comments?

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  0:53 [9fans] Petabytes on a budget: JBODs + Linux + JFS Roman V Shaposhnik
@ 2009-09-04  1:20 ` erik quanstrom
  2009-09-04  9:37   ` matt
                     ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-04  1:20 UTC (permalink / raw)
  To: 9fans

On Thu Sep  3 20:53:13 EDT 2009, rvs@sun.com wrote:
> "None of those technologies [NFS, iSCSI, FC] scales as cheaply,
> reliably, goes as big, nor can be managed as easily as stand-alone pods
> with their own IP address waiting for requests on HTTPS."
>    http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/
>
> Apart from the obvious comment that I swear I used a quote like that
> to justify 9P more than once, I'm very curious to know how Plan9
> would perform on such a box.
>
> Erik, do you have any comments?

i'm speaking for myself, and not for anybody else here.
i do work for coraid, and i do do what i believe.  so
cavet emptor.

i think coraid's cost/petabyte is pretty competitive.
they sell 48TB 3u unit for about 20% more.  though
one could not build 1 of these machines since the
case is not commercially available.

i see some warning signs about this setup.  it stands
out to me that they use desktop-class drives and the
drives appear hard to swap out.  the bandwith out
of the box is 125MB/s max.

aside from that, here's what i see as what you get for
that extra 20%:
- fully-supported firmware,
- full-bandwith to the disk  (no port multpliers)
- double the network bandwidth
- ecc memory,
- a hot swap case with ses-2 lights so the tech doesn't
grab the wrong drive,

oh, and the coraid unit works with plan 9.  :-)

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  1:20 ` erik quanstrom
@ 2009-09-04  9:37   ` matt
  2009-09-04 14:30     ` erik quanstrom
  2009-09-04 16:54     ` Roman Shaposhnik
  2009-09-04 12:24   ` Eris Discordia
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 42+ messages in thread
From: matt @ 2009-09-04  9:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs


I concur with Erik, I specced out a 20tb server earlier this year,
matching the throughputs hits you in the wallet.

I'm amazed they are using pci-e 1x , it's kind of naive

see what the guy from sun says

http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  1:20 ` erik quanstrom
  2009-09-04  9:37   ` matt
@ 2009-09-04 12:24   ` Eris Discordia
  2009-09-04 12:41     ` erik quanstrom
  2009-09-04 16:52   ` Roman Shaposhnik
  2009-09-04 23:25   ` James Tomaschke
  3 siblings, 1 reply; 42+ messages in thread
From: Eris Discordia @ 2009-09-04 12:24 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> - a hot swap case with ses-2 lights so the tech doesn't
> grab the wrong drive,

This caught my attention and you are the storage expert here. Is there an
equivalent technology on SATA disks for controlling enclosure facilities?
(Other than SMART, I mean, which seems to be only for monitoring and not
for control.)

I have this SATA backplan-inside-enclosure with 3x Barracuda 7200 series 1
TB disks attached. The enclosure lights for the two 7200.11's respond the
right way but the one that's ought to represent the 7200.12 freaks out
(goes multi-color). Have you experienced anything similar? The tech at the
enclosure vendor tells me some Seagate disks don't support control of
enclosure lights.

--On Thursday, September 03, 2009 21:20 -0400 erik quanstrom
<quanstro@quanstro.net> wrote:

> On Thu Sep  3 20:53:13 EDT 2009, rvs@sun.com wrote:
>> "None of those technologies [NFS, iSCSI, FC] scales as cheaply,
>> reliably, goes as big, nor can be managed as easily as stand-alone pods
>> with their own IP address waiting for requests on HTTPS."
>>    http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-bui
>>    ld-cheap-cloud-storage/
>>
>> Apart from the obvious comment that I swear I used a quote like that
>> to justify 9P more than once, I'm very curious to know how Plan9
>> would perform on such a box.
>>
>> Erik, do you have any comments?
>
> i'm speaking for myself, and not for anybody else here.
> i do work for coraid, and i do do what i believe.  so
> cavet emptor.
>
> i think coraid's cost/petabyte is pretty competitive.
> they sell 48TB 3u unit for about 20% more.  though
> one could not build 1 of these machines since the
> case is not commercially available.
>
> i see some warning signs about this setup.  it stands
> out to me that they use desktop-class drives and the
> drives appear hard to swap out.  the bandwith out
> of the box is 125MB/s max.
>
> aside from that, here's what i see as what you get for
> that extra 20%:
> - fully-supported firmware,
> - full-bandwith to the disk  (no port multpliers)
> - double the network bandwidth
> - ecc memory,
> - a hot swap case with ses-2 lights so the tech doesn't
> grab the wrong drive,
>
> oh, and the coraid unit works with plan 9.  :-)
>
> - erik
>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 12:24   ` Eris Discordia
@ 2009-09-04 12:41     ` erik quanstrom
  2009-09-04 13:56       ` Eris Discordia
       [not found]       ` <48F03982350BA904DFFA266E@192.168.1.2>
  0 siblings, 2 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-04 12:41 UTC (permalink / raw)
  To: 9fans

> This caught my attention and you are the storage expert here. Is there an
> equivalent technology on SATA disks for controlling enclosure facilities?
> (Other than SMART, I mean, which seems to be only for monitoring and not
> for control.)

SES-2/SGPIO typically interact with the backplane, not the drive itself.
you can use either one with any type of disk you'd like.

> I have this SATA backplan-inside-enclosure with 3x Barracuda 7200 series 1
> TB disks attached. The enclosure lights for the two 7200.11's respond the
> right way but the one that's ought to represent the 7200.12 freaks out
> (goes multi-color). Have you experienced anything similar? The tech at the
> enclosure vendor tells me some Seagate disks don't support control of
> enclosure lights.

not really.  the green (activity) light is drive driven and sometimes
doesn't work due to different voltage / pull up resistor conventions.
if there's a single dual-duty led maybe this is the problem.  how
many sepearte led packages do you have?

the backplane chip could simply be misprogrammed.  do the lights
follow the drive?  have you tried resetting the lights.

if you have quanstro/sd installed, sdorion(3) discusses how it
controls the backplane lights.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 12:41     ` erik quanstrom
@ 2009-09-04 13:56       ` Eris Discordia
  2009-09-04 14:10         ` erik quanstrom
       [not found]       ` <48F03982350BA904DFFA266E@192.168.1.2>
  1 sibling, 1 reply; 42+ messages in thread
From: Eris Discordia @ 2009-09-04 13:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Many thanks for the info :-)

> if there's a single dual-duty led maybe this is the problem.  how
> many sepearte led packages do you have?

There's one multi-color (3-prong) LED responsible for this. Nominally,
green should mean drive running and okay, alternating red should mean
transfer, and orange (red + green) a disk failure. In case of 7200.11's
this works as it should. In case of 7200.12 the light goes orange when the
disk spins up and remains so. At times of transfer it goes red as it should
but returns back to orange instead of green when there's no transfer. I
feared the (new) disk was unhealthy and stressed it for some time but all
seems to be fine except that light.

I tried changing the bay in which the disk sits and the anomaly follows the
disk so I guess the backplane's okay. The tech specifically mentioned
Seagate ES2 as a similar case and told me the disk was fine and it just
lacked support for interacting with the light (directly or through the
backplane, I don't know).

> if you have quanstro/sd installed, sdorion(3) discusses how it
> controls the backplane lights.

Um, I don't have that because I don't have any running Plan 9 instances,
but I'll try finding it on the web (if it's been through man2html at some
time).

--On Friday, September 04, 2009 08:41 -0400 erik quanstrom
<quanstro@quanstro.net> wrote:

>> This caught my attention and you are the storage expert here. Is there
>> an  equivalent technology on SATA disks for controlling enclosure
>> facilities?  (Other than SMART, I mean, which seems to be only for
>> monitoring and not  for control.)
>
> SES-2/SGPIO typically interact with the backplane, not the drive itself.
> you can use either one with any type of disk you'd like.
>
>> I have this SATA backplan-inside-enclosure with 3x Barracuda 7200 series
>> 1  TB disks attached. The enclosure lights for the two 7200.11's respond
>> the  right way but the one that's ought to represent the 7200.12 freaks
>> out  (goes multi-color). Have you experienced anything similar? The tech
>> at the  enclosure vendor tells me some Seagate disks don't support
>> control of  enclosure lights.
>
> not really.  the green (activity) light is drive driven and sometimes
> doesn't work due to different voltage / pull up resistor conventions.
> if there's a single dual-duty led maybe this is the problem.  how
> many sepearte led packages do you have?
>
> the backplane chip could simply be misprogrammed.  do the lights
> follow the drive?  have you tried resetting the lights.
>
> if you have quanstro/sd installed, sdorion(3) discusses how it
> controls the backplane lights.
>
> - erik
>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 13:56       ` Eris Discordia
@ 2009-09-04 14:10         ` erik quanstrom
  2009-09-04 18:34           ` Eris Discordia
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-04 14:10 UTC (permalink / raw)
  To: 9fans

> There's one multi-color (3-prong) LED responsible for this. Nominally,
> green should mean drive running and okay, alternating red should mean
> transfer, and orange (red + green) a disk failure. In case of 7200.11's

there's a standard for this
red	fail
orange	locate
green	activity

maybe you're enclosure's not standard.

> I tried changing the bay in which the disk sits and the anomaly follows the
> disk so I guess the backplane's okay.

since it's a single led and follows the drive, i think this is a voltage problem.
it just has to do with the fact that the voltage / pullup standard
changed.

> Um, I don't have that because I don't have any running Plan 9 instances,
> but I'll try finding it on the web (if it's been through man2html at some
> time).

http://sources.coraid.com/sources/contrib/quanstro/root/sys/man/3

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  9:37   ` matt
@ 2009-09-04 14:30     ` erik quanstrom
  2009-09-04 16:54     ` Roman Shaposhnik
  1 sibling, 0 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-04 14:30 UTC (permalink / raw)
  To: 9fans

> I concur with Erik, I specced out a 20tb server earlier this year,
> matching the throughputs hits you in the wallet.

even if you're okay with low performance, please don't
set up a 20tb server without enterprise drives.  it's no
guarentee, but it's the closest you can come.  also,
the #1 predictor of drive reliabilty is that given a
drive architecture, mtbf is proportional to 1/platters.

*blatent shill alert*

and, hey, my fileserver's aoe storage has finally become
a real product!

http://www.linuxfordevices.com/c/a/News/Coraid-EtherDrive-SR821T-Storage-Appliance/

prototype here:

http://www.quanstro.net/plan9/fs.html

and, yes, it works with plan 9.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  1:20 ` erik quanstrom
  2009-09-04  9:37   ` matt
  2009-09-04 12:24   ` Eris Discordia
@ 2009-09-04 16:52   ` Roman Shaposhnik
  2009-09-04 17:27     ` erik quanstrom
  2009-09-04 23:25   ` James Tomaschke
  3 siblings, 1 reply; 42+ messages in thread
From: Roman Shaposhnik @ 2009-09-04 16:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sep 3, 2009, at 6:20 PM, erik quanstrom wrote:
> On Thu Sep  3 20:53:13 EDT 2009, rvs@sun.com wrote:
>> "None of those technologies [NFS, iSCSI, FC] scales as cheaply,
>> reliably, goes as big, nor can be managed as easily as stand-alone
>> pods
>> with their own IP address waiting for requests on HTTPS."
>>   http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/
>>
>> Apart from the obvious comment that I swear I used a quote like that
>> to justify 9P more than once, I'm very curious to know how Plan9
>> would perform on such a box.
>>
>> Erik, do you have any comments?
>
> i'm speaking for myself, and not for anybody else here.
> i do work for coraid, and i do do what i believe.  so
> cavet emptor.
>
> i think coraid's cost/petabyte is pretty competitive.
> they sell 48TB 3u unit for about 20% more.  though
> one could not build 1 of these machines since the
> case is not commercially available.
>
> i see some warning signs about this setup.  it stands
> out to me that they use desktop-class drives and the
> drives appear hard to swap out.  the bandwith out
> of the box is 125MB/s max.
>
> aside from that, here's what i see as what you get for
> that extra 20%:
> - fully-supported firmware,
> - full-bandwith to the disk  (no port multpliers)
> - double the network bandwidth
> - ecc memory,
> - a hot swap case with ses-2 lights so the tech doesn't
> grab the wrong drive,
>
> oh, and the coraid unit works with plan 9.  :-)

*with*, not *on* right?

Now, the information above is quite useful, yet my question
was more along the lines of -- if one was to build such
a box using Plan 9 as the software -- would it be:
     1. feasible
     2. have any advantages over Linux + JFS

Thanks,
Roman.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  9:37   ` matt
  2009-09-04 14:30     ` erik quanstrom
@ 2009-09-04 16:54     ` Roman Shaposhnik
  1 sibling, 0 replies; 42+ messages in thread
From: Roman Shaposhnik @ 2009-09-04 16:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Sep 4, 2009, at 2:37 AM, matt wrote:
> I concur with Erik, I specced out a 20tb server earlier this year,
> matching the throughputs hits you in the wallet.
>
> I'm amazed they are using pci-e 1x , it's kind of naive
>
> see what the guy from sun says
>
> http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html

Joerg is smart. But he feels defensive about ZFS. Yet, he is smart
enough to say: YMMV at the
very end of his post.

Come on, that's the same argument people were using trying to sell
Google big honking SPARC
boxes way back in '99. They were missing the point back then, I'm sure
quite a few of them
are going to miss the point now.

Thanks,
Roman.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 16:52   ` Roman Shaposhnik
@ 2009-09-04 17:27     ` erik quanstrom
  2009-09-04 17:37       ` Jack Norton
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-04 17:27 UTC (permalink / raw)
  To: 9fans

> *with*, not *on* right?

with.  it's an appliance.

> Now, the information above is quite useful, yet my question
> was more along the lines of -- if one was to build such
> a box using Plan 9 as the software -- would it be:
>      1. feasible
>      2. have any advantages over Linux + JFS

aoe is block storage so i guess i don't know how
to answer.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 17:27     ` erik quanstrom
@ 2009-09-04 17:37       ` Jack Norton
  2009-09-04 18:33         ` erik quanstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Jack Norton @ 2009-09-04 17:37 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
>> *with*, not *on* right?
>>
>
> with.  it's an appliance.
>
>
>> Now, the information above is quite useful, yet my question
>> was more along the lines of -- if one was to build such
>> a box using Plan 9 as the software -- would it be:
>>      1. feasible
>>      2. have any advantages over Linux + JFS
>>
>
> aoe is block storage so i guess i don't know how
> to answer.
>
> - erik
>
>
I think what he means is:
You are given an inordinate amount of harddrives and some computers to
house them.
If plan9 is your only software, how would it be configured overall,
given that it has to perform as well, or better.

Or put another way: your boss wants you to compete with backblaze using
only plan9 and (let's say) a _large_ budget.  Go!

-Jack



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 17:37       ` Jack Norton
@ 2009-09-04 18:33         ` erik quanstrom
  2009-09-08 16:53           ` Jack Norton
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-04 18:33 UTC (permalink / raw)
  To: 9fans

> I think what he means is:
> You are given an inordinate amount of harddrives and some computers to
> house them.
> If plan9 is your only software, how would it be configured overall,
> given that it has to perform as well, or better.
>
> Or put another way: your boss wants you to compete with backblaze using
> only plan9 and (let's say) a _large_ budget.  Go!

forgive me for thinking in ruts ...

i wouldn't ask the question just like that.  the original
plan 9 fileserver had a practically-infinite storage system.
it was a jukebox.  the jukebox ran some firmware that wasn't
plan 9.  (in fact the fileserver itself wasn't running plan 9.)

today, jukeboxes are still ideal in some ways, but they're too
expensive.  i personally think you  can replace the juke
with a set of aoe shelves.  you can treat the shelves as if
they were jukebox platters.  add as necessary.  this gives
you an solid, redundant foundation.

for a naive first implementation targeting plan 9 clients,
i would probablly start with ken's fs.  for coraid's modest
requirements (10e2 users 10e2 terminals 10e1 cpu servers
10e2 mb/s), i built this http://www.quanstro.net/plan9/disklessfs.pdf

i don't see any fundamental reasons why it would not
scale up to petabytes.  i would put work into enabling
multiple cpus. i would imagine it wouldn't be hard to
saturate 2x10gbe with such a setup.  of course, there is
no reason one would need to limit oneself to a single
file server, other than simplicity.

of course this is all a bunch of hand waving without any
money or specific requirements.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 14:10         ` erik quanstrom
@ 2009-09-04 18:34           ` Eris Discordia
  0 siblings, 0 replies; 42+ messages in thread
From: Eris Discordia @ 2009-09-04 18:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> there's a standard for this
> red	fail
> orange	locate
> green	activity
>
> maybe you're enclosure's not standard.

That may be the case as it's really sort of a cheap hack: Chieftec
SNT-2131. A 3-in-2 "solution" for use in 5.25" bays of desktop computer
cases. I hear ICY DOCK has better offers but didn't see those available
around here.

> since it's a single led and follows the drive, i think this is a voltage
> problem. it just has to do with the fact that the voltage / pullup
> standard changed.

Good enough explanation for me. One thing that gave me worries was the
negative reviews of some early 7200.12's (compared to 7200.11) circulating
around on the web. Apparently, earlier firmware versions on the series had
serious problems--serious enough to kill a drive, some reviews claimed.

> http://sources.coraid.com/sources/contrib/quanstro/root/sys/man/3

Upon reading the man page the line that relieved me was this:

> The LED state has no effect on drive function.

And thanks again for the kind counsel.



--On Friday, September 04, 2009 10:10 -0400 erik quanstrom
<quanstro@quanstro.net> wrote:

>> There's one multi-color (3-prong) LED responsible for this. Nominally,
>> green should mean drive running and okay, alternating red should mean
>> transfer, and orange (red + green) a disk failure. In case of 7200.11's
>
> there's a standard for this
> red	fail
> orange	locate
> green	activity
>
> maybe you're enclosure's not standard.
>
>> I tried changing the bay in which the disk sits and the anomaly follows
>> the  disk so I guess the backplane's okay.
>
> since it's a single led and follows the drive, i think this is a voltage
> problem. it just has to do with the fact that the voltage / pullup
> standard changed.
>
>> Um, I don't have that because I don't have any running Plan 9 instances,
>> but I'll try finding it on the web (if it's been through man2html at
>> some  time).
>
> http://sources.coraid.com/sources/contrib/quanstro/root/sys/man/3
>
> - erik
>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04  1:20 ` erik quanstrom
                     ` (2 preceding siblings ...)
  2009-09-04 16:52   ` Roman Shaposhnik
@ 2009-09-04 23:25   ` James Tomaschke
  3 siblings, 0 replies; 42+ messages in thread
From: James Tomaschke @ 2009-09-04 23:25 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
> i'm speaking for myself, and not for anybody else here.
> i do work for coraid, and i do do what i believe.  so
> cavet emptor.
We have a 15TB unit, nice bit of hardware.

> oh, and the coraid unit works with plan 9.  :-)
You guys should get some Glenda-themed packing tape.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
       [not found]       ` <48F03982350BA904DFFA266E@192.168.1.2>
@ 2009-09-07 20:02         ` Uriel
  2009-09-08 13:32           ` Eris Discordia
  0 siblings, 1 reply; 42+ messages in thread
From: Uriel @ 2009-09-07 20:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Sep 4, 2009 at 3:56 PM, Eris Discordia<eris.discordia@gmail.com> wrote:
>> if you have quanstro/sd installed, sdorion(3) discusses how it
>> controls the backplane lights.
>
> Um, I don't have that because I don't have any running Plan 9 instances, but
> I'll try finding it on the web (if it's been through man2html at some time).

Here you go: http://man.cat-v.org/plan_9_contrib/3/sdorion



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-07 20:02         ` Uriel
@ 2009-09-08 13:32           ` Eris Discordia
  0 siblings, 0 replies; 42+ messages in thread
From: Eris Discordia @ 2009-09-08 13:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Thanks.

Erik Quanstrom, too, posted a link to that page, although it wasn't in HTML.

--On Monday, September 07, 2009 22:02 +0200 Uriel <uriel99@gmail.com> wrote:

> On Fri, Sep 4, 2009 at 3:56 PM, Eris Discordia<eris.discordia@gmail.com>
> wrote:
>>> if you have quanstro/sd installed, sdorion(3) discusses how it
>>> controls the backplane lights.
>>
>> Um, I don't have that because I don't have any running Plan 9 instances,
>> but I'll try finding it on the web (if it's been through man2html at
>> some time).
>
> Here you go: http://man.cat-v.org/plan_9_contrib/3/sdorion
>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-04 18:33         ` erik quanstrom
@ 2009-09-08 16:53           ` Jack Norton
  2009-09-08 17:16             ` erik quanstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Jack Norton @ 2009-09-08 16:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
>> I think what he means is:
>> You are given an inordinate amount of harddrives and some computers to
>> house them.
>> If plan9 is your only software, how would it be configured overall,
>> given that it has to perform as well, or better.
>>
>> Or put another way: your boss wants you to compete with backblaze using
>> only plan9 and (let's say) a _large_ budget.  Go!
>>
>
> forgive me for thinking in ruts ...
>
> i wouldn't ask the question just like that.  the original
> plan 9 fileserver had a practically-infinite storage system.
> it was a jukebox.  the jukebox ran some firmware that wasn't
> plan 9.  (in fact the fileserver itself wasn't running plan 9.)
>
> today, jukeboxes are still ideal in some ways, but they're too
> expensive.  i personally think you  can replace the juke
> with a set of aoe shelves.  you can treat the shelves as if
> they were jukebox platters.  add as necessary.  this gives
> you an solid, redundant foundation.
>
> for a naive first implementation targeting plan 9 clients,
> i would probablly start with ken's fs.  for coraid's modest
> requirements (10e2 users 10e2 terminals 10e1 cpu servers
> 10e2 mb/s), i built this http://www.quanstro.net/plan9/disklessfs.pdf
>
> i don't see any fundamental reasons why it would not
> scale up to petabytes.  i would put work into enabling
> multiple cpus. i would imagine it wouldn't be hard to
> saturate 2x10gbe with such a setup.  of course, there is
> no reason one would need to limit oneself to a single
> file server, other than simplicity.
>
> of course this is all a bunch of hand waving without any
> money or specific requirements.
>
> - erik
>
>
Erik,

I read the paper you wrote and I have some (probably naive) questions:
The section #6 labeled "core improvements" seems to suggest that the
fileserver is basically using the CPU/fileserver hybrid kernel (both
major changes are quoted as coming from the CPU kernel).  Is this just a
one-off adjustment made by yourself, or have these changes been made
permanent?
Also, about the coraid AoE unit: am I correct in assuming that it does
some sort of RAID functionality, and then presents the resulting
device(s) as an AoE device (and nothing more)?
Also, another probably dumb question: did the the fileserver machine use
the AoE device as a kenfs volume or a fossil(+venti)?

The reason I am asking all of this is because I have a linux fileserver
machine that _just_ serves up storage, and I have a atom based machine
not doing anything at the moment (with gbE).  I would love to have the
linux machine present its goods as an AoE device and have the atom based
machine play the fileserver role.  That would be fun.
Thanks in advance for patience involving my questions :)

-Jack




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-08 16:53           ` Jack Norton
@ 2009-09-08 17:16             ` erik quanstrom
  2009-09-08 18:17               ` Jack Norton
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-08 17:16 UTC (permalink / raw)
  To: 9fans

> I read the paper you wrote and I have some (probably naive) questions:
> The section #6 labeled "core improvements" seems to suggest that the
> fileserver is basically using the CPU/fileserver hybrid kernel (both
> major changes are quoted as coming from the CPU kernel).  Is this just a
> one-off adjustment made by yourself, or have these changes been made
> permanent?

it's not a hybrid kernel.  it is just a file server kernel.  of course
the same c library and whatnot are used.  so there are shared
bits, but not many.

it is a one-off thing.  i also run the same fileserver at home.
the source is on sources as contrib quanstro/fs.  i don't think
anybody else runs it.

> Also, about the coraid AoE unit: am I correct in assuming that it does
> some sort of RAID functionality, and then presents the resulting
> device(s) as an AoE device (and nothing more)?

exactly.

> Also, another probably dumb question: did the the fileserver machine use
> the AoE device as a kenfs volume or a fossil(+venti)?

s/did/does/.  the fileserver is running today.

the fileserver provides the network with regular 9p fileserver
with three attach points (main, dump, other) accessable via il/ip.
from a client's view of the 9p messages, fossil, fossil+venti and
ken's fs would be difficult to distinguish.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-08 17:16             ` erik quanstrom
@ 2009-09-08 18:17               ` Jack Norton
  2009-09-08 18:54                 ` erik quanstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Jack Norton @ 2009-09-08 18:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
>> Also, another probably dumb question: did the the fileserver machine use
>> the AoE device as a kenfs volume or a fossil(+venti)?
>>
>
> s/did/does/.  the fileserver is running today.
>
> the fileserver provides the network with regular 9p fileserver
> with three attach points (main, dump, other) accessable via il/ip.
> from a client's view of the 9p messages, fossil, fossil+venti and
> ken's fs would be difficult to distinguish.
>
> - erik
>
>

Very cool.
So what about having venti on an AoE device, and fossil on a local drive
(say an ssd even)?  How would you handle (or: how would venti handle), a
resize of the AoE device?  Let's say you add more active drives to the
RAID pool on the AoE machine (which on a linux fileserver would then
involve resizing partition on block device, followed by growing the
volume groups if using lvm, followed by growing the filesystem).
Sorry for thinking out loud... I should get back to work anyway....  fun
thread though.

-Jack



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-08 18:17               ` Jack Norton
@ 2009-09-08 18:54                 ` erik quanstrom
  2009-09-14 15:50                   ` Jack Norton
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-08 18:54 UTC (permalink / raw)
  To: 9fans

> So what about having venti on an AoE device, and fossil on a local drive
> (say an ssd even)?

sure.  we keep the cache on the coraid sr1521 as well.

> How would you handle (or: how would venti handle), a
> resize of the AoE device?

that would depend on the device structure of ken's fs.
as long as you don't use the pseudo-worm device, it wouldn't
care.  the worm would simply grow.  if you use the pseudo-worm
device (f), changing the size of the device would fail since
the written bitmap is at a fixed offset from the end of the
device.  and if you try to read an unwritten block, the f
panics the file server.  i stopped using the f device.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-08 18:54                 ` erik quanstrom
@ 2009-09-14 15:50                   ` Jack Norton
  2009-09-14 17:05                     ` Russ Cox
  0 siblings, 1 reply; 42+ messages in thread
From: Jack Norton @ 2009-09-14 15:50 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
>> So what about having venti on an AoE device, and fossil on a local drive
>> (say an ssd even)?
>>
>
> sure.  we keep the cache on the coraid sr1521 as well.
>
>
>> How would you handle (or: how would venti handle), a
>> resize of the AoE device?
>>
>
> that would depend on the device structure of ken's fs.
> as long as you don't use the pseudo-worm device, it wouldn't
> care.  the worm would simply grow.  if you use the pseudo-worm
> device (f), changing the size of the device would fail since
> the written bitmap is at a fixed offset from the end of the
> device.  and if you try to read an unwritten block, the f
> panics the file server.  i stopped using the f device.
>
> - erik
>
>
I am going to try my hands at beating a dead horse:)
So when you create a Venti volume, it basically writes '0's' to all the
blocks of the underlying device right?  If I put a venti volume on a AoE
device which is a linux raid5, using normal desktop sata drives, what
are my chances of a successful completion of the venti formating (let's
say 1TB raw size)?   Have you ever encountered such problems, or are you
using more robust hardware?
I ask because I have in the past, failed a somewhat-used sata drive
while creating a venti volume (it subsequently created i/o errors on
every os I used on it, although it limped along).  I must say it is
quite a brutal process, creating venti (at least I get that impression
from the times I have done it in the past).
If linux is my raid controller, I know that it is _very_ picky about how
long a drive takes to respond and will fail a drive if it has to wait
too long.

By the way I am currently buying a few pieces of cheap hardware to
implement my own diskless fileserver.  Should be ready to go in about a
couple of weeks.

-Jack



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-14 15:50                   ` Jack Norton
@ 2009-09-14 17:05                     ` Russ Cox
  2009-09-14 17:48                       ` Jack Norton
  0 siblings, 1 reply; 42+ messages in thread
From: Russ Cox @ 2009-09-14 17:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Sep 14, 2009 at 8:50 AM, Jack Norton <jack@0x6a.com> wrote:
> So when you create a Venti volume, it basically writes '0's' to all the
> blocks of the underlying device right?

In case anyone decides to try the experiment,
venti hasn't done this for a few years.  Better to try with dd.

Russ


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-14 17:05                     ` Russ Cox
@ 2009-09-14 17:48                       ` Jack Norton
  0 siblings, 0 replies; 42+ messages in thread
From: Jack Norton @ 2009-09-14 17:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Russ Cox wrote:
> On Mon, Sep 14, 2009 at 8:50 AM, Jack Norton <jack@0x6a.com> wrote:
>
>> So when you create a Venti volume, it basically writes '0's' to all the
>> blocks of the underlying device right?
>>
>
> In case anyone decides to try the experiment,
> venti hasn't done this for a few years.  Better to try with dd.
>
> Russ
>
>
By a 'few years' do you mean >=2.5 yrs, because thats the last time I
had a venti setup.  I had to stop using plan 9 for a while so that I
could graduate on time (that thesis wasn't going to write itself
apparently).  All work and no play makes jack a dull boy.

-Jack



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 19:21           ` erik quanstrom
  2009-09-21 20:57             ` Wes Kussmaul
@ 2009-09-22 10:59             ` matt
  1 sibling, 0 replies; 42+ messages in thread
From: matt @ 2009-09-22 10:59 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs



>storage vendors have a credibility problem.  i think the big
>storage vendors, as referenced in the op, sell you on many
>things you don't need for much more than one has to spend.
>
>
I went to a product demo from http://www.isilon.com/

They make a filesystem that spans multiple machines. They split the data
into 128k blocks and then duplicate store them across your RAIDs. They
claim over 80% utilisation rather than the more usual 50% of mirroring

http://www.isilon.com/products/OneFS.php

I'm no expert but after reading a bit of Shannon when I got back I found
it hard to believe their claims but they have great market penetration
because you just stick another multi Tb server in the rack and it adds
itself to the array.

All I kept thinking was "you want $100k for 25tb of aoe and cwfs, get
out of it!"





^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 23:35               ` Eris Discordia
@ 2009-09-22  0:45                 ` erik quanstrom
  0 siblings, 0 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-22  0:45 UTC (permalink / raw)
  To: 9fans

> Apparently, the distinction made between "consumer" and "enterprise" is
> actually between technology classes, i.e. SCSI/Fibre Channel vs. SATA,
> rather than between manufacturers' gradings, e.g. Seagate 7200 desktop
> series vs. Western Digital RE3/RE4 enterprise drives.

yes this is very misleading.  the interface doesn't make a
drive enterprise or consumer grade.  it seems that the
terminology was different then.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
       [not found]               ` <6DC61E4A6EC613C81AC1688E@192.168.1.2>
@ 2009-09-21 23:50                 ` Eris Discordia
  0 siblings, 0 replies; 42+ messages in thread
From: Eris Discordia @ 2009-09-21 23:50 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Upon reading more into that study it seems the Wikipedia editor has derived
a distorted conclusion:

> In our data sets, the replacement rates of SATA disks are not worse than
> the replacement rates of SCSI or FC disks. This may indicate that
> disk-independent factors, such as operating conditions, usage and
> environmental factors, affect replacement rates more than component
> specific factors. However, the only evidence we have of a bad batch of
> disks was found in a collection of SATA disks experiencing high media
> error rates. We have too little data on bad batches to estimate the
> relative frequency of bad batches by type of disk, although there is
> plenty of anecdotal evidence that bad batches are not unique to SATA
> disks.

-- the USENIX article

Apparently, the distinction made between "consumer" and "enterprise" is
actually between technology classes, i.e. SCSI/Fibre Channel vs. SATA,
rather than between manufacturers' gradings, e.g. Seagate 7200 desktop
series vs. Western Digital RE3/RE4 enterprise drives.

All SATA drives listed have MTTF (== MTBF?) of > 1.0 million hours which is
characteristic of enterprise drives as Erik Quanstrom pointed out earlier
on this thread. The 7200s have an MTBF of around 0.75 million hours in
contrast to RE4s with > 1.0-million-hour MTBF.



--On Tuesday, September 22, 2009 00:35 +0100 Eris Discordia
<eris.discordia@gmail.com> wrote:

>> What I haven't found is a decent, no frills, sata/e-sata enclosure for a
>> home system.
>
> Depending on where you are, where you can purchase from, and how much you
> want to pay you may be able to get yourself ICY DOCK or Chieftec
> enclosures that fit the description. ICY DOCK's 5-bay enclosure seemed a
> fine choice to me although somewhat expensive (slightly over 190 USD, I
> seem to remember).
>
> -------------------------------------------------------------------------
> -------
>
> Related to the subject of drive reliability:
>
>> A common misconception is that "server-grade" drives fail less frequently
>> than consumer-grade drives. Two different, independent studies, by
>> Carnegie Mellon University and Google, have shown that failure rates are
>> largely independent of the supposed "grade" of the drive.
>
> -- <http://en.wikipedia.org/wiki/RAID>
>
> The paragraph cites this as its source:
>
> --
> <http://searchstorage.techtarget.com/magazineFeature/0,296894,sid5_gci125
> 9075,00.html>
> (full text available only to registered users; registration is free,
> which begs the question of why they've decided to pester penniless
> readers with questions their "corporation's" number of employees and IT
> expenses)
>
> which has derived its content from this study:
>
> <http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.
> html>
>
> I couldn't find the other study, "independent" from this first.
>
>
>
> --On Monday, September 21, 2009 15:07 -0700 Bakul Shah
> <bakul+plan9@bitblocks.com> wrote:
>
>> On Mon, 21 Sep 2009 16:30:25 EDT erik quanstrom <quanstro@quanstro.net>
>> wrote:
>>> > > i think the lesson here is don't by cheep drives; if you
>>> > > have enterprise drives at 1e-15 error rate, the fail rate
>>> > > will be 0.8%.  of course if you don't have a raid, the fail
>>> > > rate is 100%.
>>> > >
>>> > > if that's not acceptable, then use raid 6.
>>> >
>>> > Hopefully Raid 6 or zfs's raidz2 works well enough with cheap
>>> > drives!
>>>
>>> don't hope.  do the calculations.  or simulate it.
>>
>> The "hopefully" part was due to power supplies, fans, mobos.
>> I can't get hold of their reliability data (not that I have
>> tried very hard).  Ignoring that, raidz2 (+ venti) is good
>> enough for my use.
>>
>>> this is a pain in the neck as it's a function of ber,
>>> mtbf, rebuild window and number of drives.
>>>
>>> i found that not having a hot spare can increase
>>> your chances of a double failure by an order of
>>> magnitude.  the birthday paradox never ceases to
>>> amaze.
>>
>> I plan to replace one disk every 6 to 9 months or so. In a
>> 3+2 raidz2 array disks will be swapped out in 2.5 to 3.75
>> years in the worst case.  What I haven't found is a decent,
>> no frills, sata/e-sata enclosure for a home system.
>>
>
>
>
>







^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 20:57             ` Jack Norton
@ 2009-09-21 23:38               ` erik quanstrom
  0 siblings, 0 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-21 23:38 UTC (permalink / raw)
  To: 9fans

> At work, we recently had a massive failure of our RAID array.  After
> much brown noseing, I come to find that after many harddrives being
> shipped to our IT guy and him scratching his head, it was in fact the
> RAID card itself that had failed (which takes out the whole array, plus
> can take out any new drives you throw at it apparently).

i have never seen any controller fail in such a way that drives
were actually damaged.  and i would suspect serious design
issues if that is what happened.  that's like a bad ethernet
or usb controller frying your switch.

controller failure is not common for the types of controllers
i use.  for machines that are in service, controller failure
is no more common than cpu or motherboard failure.

> So I ask you all this (especially those in the 'biz): all this
> redundancy on the drive side, why no redundancy of controller cards (or
> should I say, the driver infrastructure needed)?

the high-end sas "solution" is to buy expensive dual-ported drives
and cross connect controllers and drives.  this is very complicated
and requires twice the number of ports or sas expanders.  it also
requires quite a bit of driver-level code.  it is possible
if the failure rates are low enough (and especially if cable failure
is more probable than port failure), that the extra bits and pieces
in this dual-ported setup are *less* reliable than a standard setup.
and it's all for naught if the cpu. mb or memory blow up.

i keep a cold spare controller, just in case.
(coraid sells a spares kit for the truly paranoid, like me.
and a mirroring appliance for those who are even parnoider.
of course the mirroring appliance can be mirrored, which is great
until the switch blows up.  but naturally you can use multiple
switches.  alas, no protection from meteors.)

> It is appealing to me to try and get some plan 9 supported raid card and
> have plan 9 throughout (like the coraid setup as far as I can tell), but
> this little issue bothers me.

plan 9 doesn't support any raid cards per se.  (well, maybe the wonderful
but now ancient parallel scsi drivers might.)  theoretically, intel
matrix raid supports raid and is drivable with the ahci driver.  that would
limit you to the on-board ports.  i've never tried it.  as far as i can tell,
matrix raid uses smm mode + microcode on the southbridge to operate.
(anyone know better?)  and i want as little code sneaking around behind
my back as possible.

the annoying problem with "hardware" raid is that it takes real contortions
to make an array span controllers.  and you can't recruit a hot spare from
another controller.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 22:07             ` Bakul Shah
@ 2009-09-21 23:35               ` Eris Discordia
  2009-09-22  0:45                 ` erik quanstrom
       [not found]               ` <6DC61E4A6EC613C81AC1688E@192.168.1.2>
  1 sibling, 1 reply; 42+ messages in thread
From: Eris Discordia @ 2009-09-21 23:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> What I haven't found is a decent, no frills, sata/e-sata enclosure for a
> home system.

Depending on where you are, where you can purchase from, and how much you
want to pay you may be able to get yourself ICY DOCK or Chieftec enclosures
that fit the description. ICY DOCK's 5-bay enclosure seemed a fine choice
to me although somewhat expensive (slightly over 190 USD, I seem to
remember).

--------------------------------------------------------------------------------

Related to the subject of drive reliability:

> A common misconception is that "server-grade" drives fail less frequently
> than consumer-grade drives. Two different, independent studies, by
> Carnegie Mellon University and Google, have shown that failure rates are
> largely independent of the supposed "grade" of the drive.

-- <http://en.wikipedia.org/wiki/RAID>

The paragraph cites this as its source:

--
<http://searchstorage.techtarget.com/magazineFeature/0,296894,sid5_gci1259075,00.html>
(full text available only to registered users; registration is free, which
begs the question of why they've decided to pester penniless readers with
questions their "corporation's" number of employees and IT expenses)

which has derived its content from this study:

<http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html>

I couldn't find the other study, "independent" from this first.



--On Monday, September 21, 2009 15:07 -0700 Bakul Shah
<bakul+plan9@bitblocks.com> wrote:

> On Mon, 21 Sep 2009 16:30:25 EDT erik quanstrom <quanstro@quanstro.net>
> wrote:
>> > > i think the lesson here is don't by cheep drives; if you
>> > > have enterprise drives at 1e-15 error rate, the fail rate
>> > > will be 0.8%.  of course if you don't have a raid, the fail
>> > > rate is 100%.
>> > >
>> > > if that's not acceptable, then use raid 6.
>> >
>> > Hopefully Raid 6 or zfs's raidz2 works well enough with cheap
>> > drives!
>>
>> don't hope.  do the calculations.  or simulate it.
>
> The "hopefully" part was due to power supplies, fans, mobos.
> I can't get hold of their reliability data (not that I have
> tried very hard).  Ignoring that, raidz2 (+ venti) is good
> enough for my use.
>
>> this is a pain in the neck as it's a function of ber,
>> mtbf, rebuild window and number of drives.
>>
>> i found that not having a hot spare can increase
>> your chances of a double failure by an order of
>> magnitude.  the birthday paradox never ceases to
>> amaze.
>
> I plan to replace one disk every 6 to 9 months or so. In a
> 3+2 raidz2 array disks will be swapped out in 2.5 to 3.75
> years in the worst case.  What I haven't found is a decent,
> no frills, sata/e-sata enclosure for a home system.
>







^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 20:57             ` Wes Kussmaul
@ 2009-09-21 22:42               ` erik quanstrom
  0 siblings, 0 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-21 22:42 UTC (permalink / raw)
  To: 9fans

> > storage vendors have a credibility problem.  i think the big
> > storage vendors, as referenced in the op, sell you on many
> > things you don't need for much more than one has to spend.
>
> Those of us who know something about Coraid understand that your company
>   doesn't engage in fudili practices.

thank you.

> > so i understand one is not inclined to believe a storage guy
> > about the need for something that's more expensive.
>
> Maybe the appliance example was misplaced; no skepticism of your take on
> drive quality was intended. After all, you don't make the drives. I'm
> genuinely interested in the view of a storage systems pro like yourself
> about how an outsider like me who buys drives at tigerdirect can discern
> an "enterprise" drive when I see one.

actually, i should thank you.  you brought
up a point that i've been thinking about
for some time.

i buy my drives online like everybody else.
my personal approach is to select a couple
of good performers in the capacity range
i'm intersted in and sort by (a) mtbf,
(b) ure rate and (c) duty cycle.  i'd stay
away from drives with 4 or more platters
if possible.

finally, beware of tech sites with storage sponsers
that might compromise their recomendations.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 20:30           ` erik quanstrom
  2009-09-21 20:57             ` Jack Norton
@ 2009-09-21 22:07             ` Bakul Shah
  2009-09-21 23:35               ` Eris Discordia
       [not found]               ` <6DC61E4A6EC613C81AC1688E@192.168.1.2>
  1 sibling, 2 replies; 42+ messages in thread
From: Bakul Shah @ 2009-09-21 22:07 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, 21 Sep 2009 16:30:25 EDT erik quanstrom <quanstro@quanstro.net>  wrote:
> > > i think the lesson here is don't by cheep drives; if you
> > > have enterprise drives at 1e-15 error rate, the fail rate
> > > will be 0.8%.  of course if you don't have a raid, the fail
> > > rate is 100%.
> > >
> > > if that's not acceptable, then use raid 6.
> >
> > Hopefully Raid 6 or zfs's raidz2 works well enough with cheap
> > drives!
>
> don't hope.  do the calculations.  or simulate it.

The "hopefully" part was due to power supplies, fans, mobos.
I can't get hold of their reliability data (not that I have
tried very hard).  Ignoring that, raidz2 (+ venti) is good
enough for my use.

> this is a pain in the neck as it's a function of ber,
> mtbf, rebuild window and number of drives.
>
> i found that not having a hot spare can increase
> your chances of a double failure by an order of
> magnitude.  the birthday paradox never ceases to
> amaze.

I plan to replace one disk every 6 to 9 months or so. In a
3+2 raidz2 array disks will be swapped out in 2.5 to 3.75
years in the worst case.  What I haven't found is a decent,
no frills, sata/e-sata enclosure for a home system.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 20:30           ` erik quanstrom
@ 2009-09-21 20:57             ` Jack Norton
  2009-09-21 23:38               ` erik quanstrom
  2009-09-21 22:07             ` Bakul Shah
  1 sibling, 1 reply; 42+ messages in thread
From: Jack Norton @ 2009-09-21 20:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:
>>> i think the lesson here is don't by cheep drives; if you
>>> have enterprise drives at 1e-15 error rate, the fail rate
>>> will be 0.8%.  of course if you don't have a raid, the fail
>>> rate is 100%.
>>>
>>> if that's not acceptable, then use raid 6.
>>>
>> Hopefully Raid 6 or zfs's raidz2 works well enough with cheap
>> drives!
>>
>
> don't hope.  do the calculations.  or simulate it.
>
> this is a pain in the neck as it's a function of ber,
> mtbf, rebuild window and number of drives.
>
> i found that not having a hot spare can increase
> your chances of a double failure by an order of
> magnitude.  the birthday paradox never ceases to
> amaze.
>
> - erik
>
>
While we are on the topic:
How many RAID cards have we failed lately?  I ask because I am about to
hit a fork in the road with my work-a-like of your diskless fs.  I was
originally going to use linux soft raid and vblade, but I am considering
using some raid cards that just so happen to be included in the piece of
hardware I will be getting soon...
At work, we recently had a massive failure of our RAID array.  After
much brown noseing, I come to find that after many harddrives being
shipped to our IT guy and him scratching his head, it was in fact the
RAID card itself that had failed (which takes out the whole array, plus
can take out any new drives you throw at it apparently).

So I ask you all this (especially those in the 'biz): all this
redundancy on the drive side, why no redundancy of controller cards (or
should I say, the driver infrastructure needed)?

It is appealing to me to try and get some plan 9 supported raid card and
have plan 9 throughout (like the coraid setup as far as I can tell), but
this little issue bothers me.

Speaking of birthday, I mentioned to our IT dep (all two people...) that
they should try and spread out the drives used among different mfg dates
and batches.  It shocked me to know that this was news to them...

-Jack



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 19:21           ` erik quanstrom
@ 2009-09-21 20:57             ` Wes Kussmaul
  2009-09-21 22:42               ` erik quanstrom
  2009-09-22 10:59             ` matt
  1 sibling, 1 reply; 42+ messages in thread
From: Wes Kussmaul @ 2009-09-21 20:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:

> storage vendors have a credibility problem.  i think the big
> storage vendors, as referenced in the op, sell you on many
> things you don't need for much more than one has to spend.

Those of us who know something about Coraid understand that your company
  doesn't engage in fudili practices.

> so i understand one is not inclined to believe a storage guy
> about the need for something that's more expensive.

Maybe the appliance example was misplaced; no skepticism of your take on
drive quality was intended. After all, you don't make the drives. I'm
genuinely interested in the view of a storage systems pro like yourself
about how an outsider like me who buys drives at tigerdirect can discern
an "enterprise" drive when I see one.

Wes




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 19:10         ` Bakul Shah
@ 2009-09-21 20:30           ` erik quanstrom
  2009-09-21 20:57             ` Jack Norton
  2009-09-21 22:07             ` Bakul Shah
  0 siblings, 2 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-21 20:30 UTC (permalink / raw)
  To: 9fans

> > i think the lesson here is don't by cheep drives; if you
> > have enterprise drives at 1e-15 error rate, the fail rate
> > will be 0.8%.  of course if you don't have a raid, the fail
> > rate is 100%.
> >
> > if that's not acceptable, then use raid 6.
>
> Hopefully Raid 6 or zfs's raidz2 works well enough with cheap
> drives!

don't hope.  do the calculations.  or simulate it.

this is a pain in the neck as it's a function of ber,
mtbf, rebuild window and number of drives.

i found that not having a hot spare can increase
your chances of a double failure by an order of
magnitude.  the birthday paradox never ceases to
amaze.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 18:49         ` Wes Kussmaul
@ 2009-09-21 19:21           ` erik quanstrom
  2009-09-21 20:57             ` Wes Kussmaul
  2009-09-22 10:59             ` matt
  0 siblings, 2 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-21 19:21 UTC (permalink / raw)
  To: 9fans

On Mon Sep 21 14:51:07 EDT 2009, wes@authentrus.com wrote:
> erik quanstrom wrote:
> Our top-of-the-line Sub Zero and Thermidor kitchen appliances are pure
> junk. In fact, I can point to Consumer Reports data that shows an
> inverse relationship between appliance cost and reliability.

storage vendors have a credibility problem.  i think the big
storage vendors, as referenced in the op, sell you on many
things you don't need for much more than one has to spend.

so i understand one is not inclined to believe a storage guy
about the need for something that's more expensive.

however, unlike your refrigerator, the hard drive vendors
claim different reliability numbers for enterprise drives
than consumer models.  they claim (a) lower ure rates
(b) greater mtbf (c) 100% duty cycle.  typical enterprise
drives today have a 10x smaller claimed ure rate, 2-5x
greater mtbf and are warrented to run 100% of the time.

you may not believe them, but when we see drive failures,
chances are very good that it's a consumer drive.

from what i can tell without any nda access, there's different
hardware on enterprise drives.  there are different chips on
the pcbs of most enterprise drives, claimed extra accelerometers
and temperature sensors (to properly place the head).

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 18:02       ` erik quanstrom
  2009-09-21 18:49         ` Wes Kussmaul
@ 2009-09-21 19:10         ` Bakul Shah
  2009-09-21 20:30           ` erik quanstrom
  1 sibling, 1 reply; 42+ messages in thread
From: Bakul Shah @ 2009-09-21 19:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, 21 Sep 2009 14:02:40 EDT erik quanstrom <quanstro@quanstro.net>  wrote:
> > > i would think this is acceptable.  at these low levels, something
> > > else is going to get you -- like drives failing unindependently.
> > > say because of power problems.
> >
> > 8% rate for an array rebuild may or may not be acceptable
> > depending on your application.
>
> i think the lesson here is don't by cheep drives; if you
> have enterprise drives at 1e-15 error rate, the fail rate
> will be 0.8%.  of course if you don't have a raid, the fail
> rate is 100%.
>
> if that's not acceptable, then use raid 6.

Hopefully Raid 6 or zfs's raidz2 works well enough with cheap
drives!

> > > so there are 4 ways to fail.  3 double fail have a probability of
> > > 3*(2^9 bits * 1e-14 1/ bit)^2
> >
> > Why 2^9 bits? A sector is 2^9 bytes or 2^12 bits.
>
>
> cut-and-paste error.  sorry that was 2^19 bits, e.g. 64k*8 bits/byte.
> the calculation is still correct, since it was done on that basis.

Ok.

> > If per sector recovery is done, you have
> > 	3E-22*(64K/512) = 3.84E-20
>
> i'd be interested to know if anyone does this.  it's not
> as easy as it would first appear.  do you know of any
> hardware or software that does sector-level recovery?

No idea -- I haven't really looked in this area in ages.  In
case of two stripes being bad it would make sense to me to
reread a stripe one sector at a time since chances of the
exact same sector being bad on two disks is much lower (about
2^14 times smaller for 64k stripes?).  I don't know if disk
drives return a error bit array along with data of a
multisector read (nth bit is set if nth sector could not be
recovered).  If not, that would be a worthwhile addition.

> i don't have enough data to know how likely it is to
> have exactly 1 bad sector.  any references?

Not sure what you are asking.  Reed-solomon are block codes,
applied to a whole sector so per sector error rate is
UER*512*8 where UER == uncorrectable error rate. [Early IDE
disks had 4 byte ECC per sector.  Now that bits are packed so
tight, S/N ratio is far worse and ECC is at least 40 bytes,
to keep UER to 1E-14 or whatever is the target].



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 18:02       ` erik quanstrom
@ 2009-09-21 18:49         ` Wes Kussmaul
  2009-09-21 19:21           ` erik quanstrom
  2009-09-21 19:10         ` Bakul Shah
  1 sibling, 1 reply; 42+ messages in thread
From: Wes Kussmaul @ 2009-09-21 18:49 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

erik quanstrom wrote:

> i think the lesson here is don't by cheep drives;

Our top-of-the-line Sub Zero and Thermidor kitchen appliances are pure
junk. In fact, I can point to Consumer Reports data that shows an
inverse relationship between appliance cost and reliability.

One who works for Coraid is likely to know which drives are better than
others. Perhaps you can give us a heads up?

Wes




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21 17:43     ` Bakul Shah
@ 2009-09-21 18:02       ` erik quanstrom
  2009-09-21 18:49         ` Wes Kussmaul
  2009-09-21 19:10         ` Bakul Shah
  0 siblings, 2 replies; 42+ messages in thread
From: erik quanstrom @ 2009-09-21 18:02 UTC (permalink / raw)
  To: 9fans

> > i would think this is acceptable.  at these low levels, something
> > else is going to get you -- like drives failing unindependently.
> > say because of power problems.
>
> 8% rate for an array rebuild may or may not be acceptable
> depending on your application.

i think the lesson here is don't by cheep drives; if you
have enterprise drives at 1e-15 error rate, the fail rate
will be 0.8%.  of course if you don't have a raid, the fail
rate is 100%.

if that's not acceptable, then use raid 6.

> > so there are 4 ways to fail.  3 double fail have a probability of
> > 3*(2^9 bits * 1e-14 1/ bit)^2
>
> Why 2^9 bits? A sector is 2^9 bytes or 2^12 bits.

cut-and-paste error.  sorry that was 2^19 bits, e.g. 64k*8 bits/byte.
the calculation is still correct, since it was done on that basis.

> If per sector recovery is done, you have
> 	3E-22*(64K/512) = 3.84E-20

i'd be interested to know if anyone does this.  it's not
as easy as it would first appear.  do you know of any
hardware or software that does sector-level recovery?

i don't have enough data to know how likely it is to
have exactly 1 bad sector.  any references?

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-21  3:37   ` erik quanstrom
@ 2009-09-21 17:43     ` Bakul Shah
  2009-09-21 18:02       ` erik quanstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Bakul Shah @ 2009-09-21 17:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 5933 bytes --]

> > > 	8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%
> >
> > Isn't that the probability of getting a bad sector when you
> > read a terabyte? In other words, this is not related to the
> > disk size but how much you read from the given disk. Granted
> > that when you "resilver" you have no choice but to read the
> > entire disk and that is why just one redundant disk is not
> > good enough for TB size disks (if you lose a disk there is 8%
> > chance you copied a bad block in resilvering a mirror).
>
> see below.  i think you're confusing a single disk 8% chance
> of failure with a 3 disk tb array, with a 1e-7% chance of failure.

I was talking about the case where you replace a disk in a
mirror. To rebuild the mirror the new disk has to be
initialized from the remaining "good" disk (and there is no
redundancy left) so you have to read the whole disk. This
implies 8% chance of a bad sector. The situation is worse in
an N+1 disk RAID-5 when you lose a disk. Now you have N*8%
chance of a bad sector. And of course in real life this are
worse because usually disks in a cheap array don't have
independent power supplies (and the shared one can be
underpowered or under-regulated).

> i would think this is acceptable.  at these low levels, something
> else is going to get you — like drives failing unindependently.
> say because of power problems.

8% rate for an array rebuild may or may not be acceptable
depending on your application.

> > > i'm a little to lazy to calcuate what the probabilty is that
> > > another sector in the row is also bad.  (this depends on
> > > stripe size, the number of disks in the raid, etc.)  but it's
> > > safe to say that it's pretty small.  for a 3 disk raid 5 with
> > > 64k stripes it would be something like
> > > 	8 bites/byte * 64k *3 / 1e14 = 1e-8
> >
> > The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~=
> > 3*0.5E-8, since three 64k byte blocks have to be read.  The
> > unrecoverable case is two of them being bad at the same time.
> > The prob. of this is 3*0.25E-16 (not sure I did this right --
>
> thanks for noticing that.  i think i didn't explain myself well
> i was calculating the rough probability of a ure in reading the
> *whole array*, not just one stripe.
>
> to do this more methodicly using your method, we need
> to count up all the possible ways of getting a double fail
> with 3 disks and multiply by the probability of getting that
> sort of failure and then add 'em up.  if 0 is ok and 1 is fail,
> then i think there are these cases:
>
> 0 0 0
> 1 0 0
> 0 1 0
> 0 0 1
> 1 1 0
> 1 0 1
> 0 1 1
> 1 1 1
>
> so there are 4 ways to fail.  3 double fail have a probability of
> 3*(2^9 bits * 1e-14 1/ bit)^2

Why 2^9 bits? A sector is 2^9 bytes or 2^12 bits. Note that
there is no recovery possible for fewer bits than a sector.

> and the triple fail has a probability of
> (2^9 bits * 1e-14 1/ bit)^3
> so we have
> 3*(2^9 bits * 1e-14 1/ bit)^2 + (2^9 bits * 1e-14 1/ bit)^3 ~=
> 	3*(2^9 bits * 1e-14 1/ bit)^2
> 	= 8.24633720832e-17

3*(2^12 bits * 1e-14 1/ bit)^2 + (2^12 bits * 1e-14 1/ bit)^3 ~=
  	3*(2^12 bits * 1e-14 1/ bit)^2
	~= 3*(1e-11)^2 = 3E-22

If per sector recovery is done, you have
	3E-22*(64K/512) = 3.84E-20

> that's per stripe.  if we multiply by 1e12/(64*1024) stripes/array,
>
> we have
> 	= 1.2582912e-09

For the whole 2TB array you have

	3E-22*(10^12/512) ~= 6E-13

> which is remarkably close to my lousy first guess.  so we went
> from 8e-2 to 1e-9 for an improvement of 7 orders of magnitude.
>
> > we have to consider the exact same sector # going bad in two
> > of the three disks and there are three such pairs).
>
> the exact sector doesn't matter.  i don't know any
> implementations that try to do partial stripe recovery.

If by partial stripe recovery you mean 2 of the stripes must
be entirely error free to recreate the third, your logic
seems wrong even after we replace 2^9 with 2^12 bits.

When you have only stripe level recovery you will throw away
whole stripes even where sector level recovery would've
worked.  If, for example, each stripe has two sectors, on a 3
disk raid5, and you have sector 0 of disk 0 stripe and sector
1 of disk 1 stripe are bad, sector level recovery would work
but stripe level recovery would fail.  For 2 sector stripes
and 3 disks you have 64 possible outcomes, out of which 48
result in bad data for sector level recovery and 54 for
stripe level recovery (see below).  And it will get worse
with larger stripes.  [by bad data I mean we throw away the
whole stripe even if one sector can't be recovered]

I did some googling but didn't discover anything that does a
proper statistical analysis.

---------------------------------------------
disk0 stripe sectors (1 means read failed)
|  disk1 stripe sectors
|  |  disk2 stripe sectors
|  |  |  sector level recovery possible? (if so, the stripe can be recovered)
|  |  |  | stripe level recovery possible?
|  |  |  | |
00 00 00
00 00 01
00 00 10
00 00 11
00 01 00
00 01 01 N N
00 01 10   N
00 01 11 N N
00 10 00
00 10 01   N
00 10 10 N N
00 10 11 N N
00 11 00
00 11 01 N N
00 11 10 N N
00 11 11 N N

01 00 00
01 00 01 N N
01 00 10   N
01 00 11 N N
01 01 00 N N
01 01 01 N N
01 01 10 N N
01 01 11 N N
01 10 00   N
01 10 01 N N
01 10 10 N N
01 10 11 N N
01 11 00 N N
01 11 01 N N
01 11 10 N N
01 11 11 N N

10 00 00
10 00 01   N
10 00 10 N N
10 00 11 N N
10 01 00   N
10 01 01 N N
10 01 10 N N
10 01 11 N N
10 10 00 N N
10 10 01 N N
10 10 10 N N
10 10 11 N N
10 11 00 N N
10 11 01 N N
10 11 10 N N
10 11 11 N N

11 00 00
11 00 01 N N
11 00 10 N N
11 00 11 N N
11 01 00 N N
11 01 01 N N
11 01 10 N N
11 01 11 N N
11 10 00 N N
11 10 01 N N
11 10 10 N N
11 10 11 N N
11 11 00 N N
11 11 01 N N
11 11 10 N N
11 11 11 N N



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-20 20:13 ` Bakul Shah
@ 2009-09-21  3:37   ` erik quanstrom
  2009-09-21 17:43     ` Bakul Shah
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-21  3:37 UTC (permalink / raw)
  To: 9fans

> > drive mfgrs don't report write error rates.  i would consider any
> > drive with write errors to be dead as fried chicken.  a more
> > interesting question is what is the chance you can read the
> > written data back correctly.  in that case with desktop drives,
> > you have a
> > 	8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%
>
> Isn't that the probability of getting a bad sector when you
> read a terabyte? In other words, this is not related to the
> disk size but how much you read from the given disk. Granted
> that when you "resilver" you have no choice but to read the
> entire disk and that is why just one redundant disk is not
> good enough for TB size disks (if you lose a disk there is 8%
> chance you copied a bad block in resilvering a mirror).

see below.  i think you're confusing a single disk 8% chance
of failure with a 3 disk tb array, with a 1e-7% chance of failure.

i would think this is acceptable.  at these low levels, something
else is going to get you — like drives failing unindependently.
say because of power problems.

> > i'm a little to lazy to calcuate what the probabilty is that
> > another sector in the row is also bad.  (this depends on
> > stripe size, the number of disks in the raid, etc.)  but it's
> > safe to say that it's pretty small.  for a 3 disk raid 5 with
> > 64k stripes it would be something like
> > 	8 bites/byte * 64k *3 / 1e14 = 1e-8
>
> The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~=
> 3*0.5E-8, since three 64k byte blocks have to be read.  The
> unrecoverable case is two of them being bad at the same time.
> The prob. of this is 3*0.25E-16 (not sure I did this right --

thanks for noticing that.  i think i didn't explain myself well
i was calculating the rough probability of a ure in reading the
*whole array*, not just one stripe.

to do this more methodicly using your method, we need
to count up all the possible ways of getting a double fail
with 3 disks and multiply by the probability of getting that
sort of failure and then add 'em up.  if 0 is ok and 1 is fail,
then i think there are these cases:

0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1

so there are 4 ways to fail.  3 double fail have a probability of
3*(2^9 bits * 1e-14 1/ bit)^2
and the triple fail has a probability of
(2^9 bits * 1e-14 1/ bit)^3
so we have
3*(2^9 bits * 1e-14 1/ bit)^2 + (2^9 bits * 1e-14 1/ bit)^3 ~=
	3*(2^9 bits * 1e-14 1/ bit)^2
	= 8.24633720832e-17
that's per stripe.  if we multiply by 1e12/(64*1024) stripes/array,
we have
	= 1.2582912e-09
which is remarkably close to my lousy first guess.  so we went
from 8e-2 to 1e-9 for an improvement of 7 orders of magnitude.

> we have to consider the exact same sector # going bad in two
> of the three disks and there are three such pairs).

the exact sector doesn't matter.  i don't know any
implementations that try to do partial stripe recovery.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
  2009-09-14 16:43 erik quanstrom
@ 2009-09-20 20:13 ` Bakul Shah
  2009-09-21  3:37   ` erik quanstrom
  0 siblings, 1 reply; 42+ messages in thread
From: Bakul Shah @ 2009-09-20 20:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, 14 Sep 2009 12:43:42 EDT erik quanstrom <quanstro@quanstro.net>  wrote:
> > I am going to try my hands at beating a dead horse:)
> > So when you create a Venti volume, it basically writes '0's' to all the
> > blocks of the underlying device right?  If I put a venti volume on a AoE
> > device which is a linux raid5, using normal desktop sata drives, what
> > are my chances of a successful completion of the venti formating (let's
> > say 1TB raw size)?
>
> drive mfgrs don't report write error rates.  i would consider any
> drive with write errors to be dead as fried chicken.  a more
> interesting question is what is the chance you can read the
> written data back correctly.  in that case with desktop drives,
> you have a
> 	8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%

Isn't that the probability of getting a bad sector when you
read a terabyte? In other words, this is not related to the
disk size but how much you read from the given disk. Granted
that when you "resilver" you have no choice but to read the
entire disk and that is why just one redundant disk is not
good enough for TB size disks (if you lose a disk there is 8%
chance you copied a bad block in resilvering a mirror).

> i'm a little to lazy to calcuate what the probabilty is that
> another sector in the row is also bad.  (this depends on
> stripe size, the number of disks in the raid, etc.)  but it's
> safe to say that it's pretty small.  for a 3 disk raid 5 with
> 64k stripes it would be something like
> 	8 bites/byte * 64k *3 / 1e14 = 1e-8

The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~=
3*0.5E-8, since three 64k byte blocks have to be read.  The
unrecoverable case is two of them being bad at the same time.
The prob. of this is 3*0.25E-16 (not sure I did this right --
we have to consider the exact same sector # going bad in two
of the three disks and there are three such pairs).



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [9fans] Petabytes on a budget: JBODs + Linux + JFS
@ 2009-09-14 16:43 erik quanstrom
  2009-09-20 20:13 ` Bakul Shah
  0 siblings, 1 reply; 42+ messages in thread
From: erik quanstrom @ 2009-09-14 16:43 UTC (permalink / raw)
  To: 9fans

> I am going to try my hands at beating a dead horse:)
> So when you create a Venti volume, it basically writes '0's' to all the
> blocks of the underlying device right?  If I put a venti volume on a AoE
> device which is a linux raid5, using normal desktop sata drives, what
> are my chances of a successful completion of the venti formating (let's
> say 1TB raw size)?

drive mfgrs don't report write error rates.  i would consider any
drive with write errors to be dead as fried chicken.  a more
interesting question is what is the chance you can read the
written data back correctly.  in that case with desktop drives,
you have a
	8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%
i'm a little to lazy to calcuate what the probabilty is that
another sector in the row is also bad.  (this depends on
stripe size, the number of disks in the raid, etc.)  but it's
safe to say that it's pretty small.  for a 3 disk raid 5 with
64k stripes it would be something like
	8 bites/byte * 64k *3 / 1e14 = 1e-8
i'm making the completely unwarrented assumption that
read errors are independent, see below as to why they not
be.

> Have you ever encountered such problems, or are you
> using more robust hardware?

yes.  i have.  after unexpected power failure and apparently
a head crash, i have seen writes that appear to work but don't
followed by write failure and smart "threshold exceeded".
smart here isn't diagnostic but it allows one to rma a drive
without booting the drive wiz-bang tool.

- erik



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-09-22 10:59 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-04  0:53 [9fans] Petabytes on a budget: JBODs + Linux + JFS Roman V Shaposhnik
2009-09-04  1:20 ` erik quanstrom
2009-09-04  9:37   ` matt
2009-09-04 14:30     ` erik quanstrom
2009-09-04 16:54     ` Roman Shaposhnik
2009-09-04 12:24   ` Eris Discordia
2009-09-04 12:41     ` erik quanstrom
2009-09-04 13:56       ` Eris Discordia
2009-09-04 14:10         ` erik quanstrom
2009-09-04 18:34           ` Eris Discordia
     [not found]       ` <48F03982350BA904DFFA266E@192.168.1.2>
2009-09-07 20:02         ` Uriel
2009-09-08 13:32           ` Eris Discordia
2009-09-04 16:52   ` Roman Shaposhnik
2009-09-04 17:27     ` erik quanstrom
2009-09-04 17:37       ` Jack Norton
2009-09-04 18:33         ` erik quanstrom
2009-09-08 16:53           ` Jack Norton
2009-09-08 17:16             ` erik quanstrom
2009-09-08 18:17               ` Jack Norton
2009-09-08 18:54                 ` erik quanstrom
2009-09-14 15:50                   ` Jack Norton
2009-09-14 17:05                     ` Russ Cox
2009-09-14 17:48                       ` Jack Norton
2009-09-04 23:25   ` James Tomaschke
2009-09-14 16:43 erik quanstrom
2009-09-20 20:13 ` Bakul Shah
2009-09-21  3:37   ` erik quanstrom
2009-09-21 17:43     ` Bakul Shah
2009-09-21 18:02       ` erik quanstrom
2009-09-21 18:49         ` Wes Kussmaul
2009-09-21 19:21           ` erik quanstrom
2009-09-21 20:57             ` Wes Kussmaul
2009-09-21 22:42               ` erik quanstrom
2009-09-22 10:59             ` matt
2009-09-21 19:10         ` Bakul Shah
2009-09-21 20:30           ` erik quanstrom
2009-09-21 20:57             ` Jack Norton
2009-09-21 23:38               ` erik quanstrom
2009-09-21 22:07             ` Bakul Shah
2009-09-21 23:35               ` Eris Discordia
2009-09-22  0:45                 ` erik quanstrom
     [not found]               ` <6DC61E4A6EC613C81AC1688E@192.168.1.2>
2009-09-21 23:50                 ` Eris Discordia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).