9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] ata drive capabilities
@ 2007-12-26  7:44 Joshua Wood
  2007-12-26 13:18 ` roger peppe
  0 siblings, 1 reply; 19+ messages in thread
From: Joshua Wood @ 2007-12-26  7:44 UTC (permalink / raw)
  To: 9fans

> >From everything I've seen, SMART has zero correlation with real
> hardware issues -- confirmed by a discussion with someone at a big
> search company. SMART is dumb.

If it's everyone's favorite ``big search company'' in question, they  
have an [only moderately depressing] paper:
	http://209.85.163.132/papers/disk_failures.pdf

Turns out from their big sample that, nope, SMART isn't good at  
predicting failure; nor are temperature or activity levels. Instead  
it seems like almost entirely a manufacturing crapshoot.

SMART looks no smarter in CMU's study of the same topic, which nixes  
age as a good failure predictor, too:
	http://www.usenix.org/event/fast07/tech/schroeder/schroeder_html/ 
index.html

--
Josh


^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re: [9fans] ata drive capabilities
@ 2007-12-27  9:01 Joshua Wood
  2007-12-27 15:15 ` Brantley Coile
  0 siblings, 1 reply; 19+ messages in thread
From: Joshua Wood @ 2007-12-27  9:01 UTC (permalink / raw)
  To: 9fans

> here's what we do useing ken's fs fs, not venti.
>         http://cm.bell-labs.com/iwp9/papers/23.disklessfs.pdf
>

Oh right -- I'd forgotten about the wireless link out to another  
building that is described in the above. Good enough against the  
small meteorite. :-)

Viewed from a high enough level, I think I'm using venti and a backup  
fileserver in a broadly similar way, albeit less `live' and fully  
integrated.

Thanks, Erik.

--
Josh




^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re: [9fans] ata drive capabilities
@ 2007-12-27  6:22 Joshua Wood
  2007-12-27  7:28 ` erik quanstrom
  0 siblings, 1 reply; 19+ messages in thread
From: Joshua Wood @ 2007-12-27  6:22 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 1922 bytes --]

> i don't know.  if you lean that direction, then the only thing raid  
> gives
> you is reduced downtime.

On reflection, it seems I do lean that direction, and generally use  
raid mainly to dodge downtime. Our plan 9 systems (and `other' alike)  
mostly have redundant disks (when they must have them at all) -- but  
they have regular offsite backup also. I wonder if I'm being wasteful.

> i think of raid as reliable storage.  backups are for saving one's  
> bacon in
> the face of other disasters.  you know, sysadmin mistakes,  
> misconfiguration,
> code gone wild, building burns down,
>
meteorite! ;)

> (and if my experience with backups is any indiciation, it's best  
> not to
> rely on them.)
>
Probably another discussion, but I try to deal with this by testing  
the offsite backups (rdarena output) of the plan9 fileserver against  
a similar system that's designated the second-string fileserver. I  
haven't had to do it in production in a while, (raid narrowed the  
reasons I'd need to) so maybe I'm missing something and it would be  
less successful than in the testing.

> but this thinking is probablly specific to how i use raid.  i  
> imagine the
> exact answer on what raid gives you should be worked out based on
> the application.
>
I probably veer toward mere semantics, but I'd still define your use  
of raid to be uptime-protection. The list of exceptions you place  
under ``backups are for...'' is the same list, essentially, that  
motivates the offsite backups I mention -- the usual holes in the  
raid prophylactic. I see how Plan 9 facilities (esp. dump) ameliorate  
some of them: admin mistakes, for example. But it doesn't fireproof  
the system.

Or, put another way, you're not asserting you have no backup beyond  
that fileserver raid, are you? Because if so, I want to learn how I  
can skip that step, too.

--
Josh




[-- Attachment #2: Type: text/html, Size: 3013 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread
* Re: [9fans] ata drive capabilities
@ 2007-12-26 16:22 Joshua Wood
  2007-12-26 18:14 ` erik quanstrom
  0 siblings, 1 reply; 19+ messages in thread
From: Joshua Wood @ 2007-12-26 16:22 UTC (permalink / raw)
  To: 9fans

> the google paper shows a 40% afr for the first 6 months after some
> smart errors appear.  (unfortunately they don't do numbers for
> a simple smart status.)

Yes, and I rather mischaracterized the google paper's comments on  
SMART. A reread (I first read them a few months ago) shows the above.  
Further, the CMU paper even references the google study on the SMART  
subject:

``They find that [ ... ] the value of several SMART counters  
correlate highly with failures.''

So SMART appears a little less dumb. I'd say meets the better than  
nothing criterion.

> from my understanding of how google do things, loosing a drive just
> means they need to replace it.  so it's cheeper to let drives fail.
> on the other hand, we have our main filesystem raided on an aoe
> appliance.  suppose that one of those raids has two disks showing
> a smart status of "will fail".  in this case i want to know the  
> elevated
> risk and i will allocate a spare drive to replace at least one of the
> drives.
>
> i guess this is the long way of saying, it all depends on how painful
> loosing your data might be.  if it's painful enough, even a poor tool
> like smart is better than nothing.
>
I agree (plus I was just wrong about SMART at first), though I do  
think your example above is about preventing downtime, not so much  
data loss (Even without smart entirely, and all the disks come up  
corrupt, we're all backed up within some acceptable window, right?)


> what a pity! it would have been so great to have had
> an objective assessment of reliability by manufacturer.
>
Since the CMU thing found no difference between disk *types*, I  
wonder if it might be that there's little difference between  
manufacturers either -- instead the difference is in manufacturing,  
i.e., `vintage' & the like.

> i've found it really quite hard to find useful data to
> indicate how reliable a drive might be.
>

I think Fig. 2, Sec. 4.2 of the CMU paper relates to that; the  
`infant mortality' of manufactured mechanical parts isn't captured in  
MTTF -- but IDEMA is apparently going to solve this by replacing the  
single MTTF number that I don't quite understand with 4 different  
MTTF numbers, one for each `phase' of a disk's life.

--
Josh




^ permalink raw reply	[flat|nested] 19+ messages in thread
* [9fans] ata drive capabilities
@ 2007-12-25 21:40 Christian Kellermann
  2007-12-25 21:48 ` Pietro Gagliardi
  2007-12-25 23:59 ` erik quanstrom
  0 siblings, 2 replies; 19+ messages in thread
From: Christian Kellermann @ 2007-12-25 21:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

Hi 9fans,

can someone on this list tell me how to interpret the config part of 
cpu% cat /dev/sdC0/ctl
inquiry WDC WD1600JB-00REA0                     
config 427A capabilities 2F00 dma 00550020 dmactl 00550020 rwm 16 rwmctl 0 lba48always off

I am trying to figure out whether the disk signals the implementation
of the SMART feature set.

Kind regards,

Christian

-- 
You may use my gpg key for replies:
pub  1024D/47F79788 2005/02/02 Christian Kellermann (C-Keen)

[-- Attachment #2: Type: application/pgp-signature, Size: 194 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-12-27 18:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-26  7:44 [9fans] ata drive capabilities Joshua Wood
2007-12-26 13:18 ` roger peppe
2007-12-26 18:15   ` erik quanstrom
  -- strict thread matches above, loose matches on Subject: below --
2007-12-27  9:01 Joshua Wood
2007-12-27 15:15 ` Brantley Coile
2007-12-27  6:22 Joshua Wood
2007-12-27  7:28 ` erik quanstrom
2007-12-26 16:22 Joshua Wood
2007-12-26 18:14 ` erik quanstrom
2007-12-25 21:40 Christian Kellermann
2007-12-25 21:48 ` Pietro Gagliardi
2007-12-25 23:59 ` erik quanstrom
2007-12-26  6:31   ` ron minnich
2007-12-26 13:10     ` erik quanstrom
2007-12-26 19:52       ` Christian Kellermann
2007-12-26 20:13         ` andrey mirtchovski
2007-12-27 18:12           ` Christian Kellermann
2007-12-26 23:58         ` Robert William Fuller
2007-12-27  2:34         ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).