The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] Is it time to resurrect the original dsw (delete with switches)?
@ 2021-08-29 22:12 Jon Steinhart
  2021-08-29 23:09 ` Henry Bent
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Jon Steinhart @ 2021-08-29 22:12 UTC (permalink / raw)
  To: The Unix Heretics Society mailing list

I recently upgraded my machines to fc34.  I just did a stock
uncomplicated installation using the defaults and it failed miserably.

Fc34 uses btrfs as the default filesystem so I thought that I'd give it
a try.  I was especially interested in the automatic checksumming because
the majority of my storage is large media files and I worry about bit
rot in seldom used files.  I have been keeping a separate database of
file hashes and in theory btrfs would make that automatic and transparent.

I have 32T of disk on my system, so it took a long time to convert
everything over.  A few weeks after I did this I went to unload my
camera and couldn't because the filesystem that holds my photos was
mounted read-only.  WTF?  I didn't do that.

After a bit of poking around I discovered that btrfs SILENTLY remounted the
filesystem because it had errors.  Sure, it put something in a log file,
but I don't spend all day surfing logs for things that shouldn't be going
wrong.  Maybe my expectation that filesystems just work is antiquated.

This was on a brand new 16T drive, so I didn't think that it was worth
the month that it would take to run the badblocks program which doesn't
really scale to modern disk sizes.  Besides, SMART said that it was fine.

Although it's been discredited by some, I'm still a believer in "stop and
fsck" policing of disk drives.  Unmounted the filesystem and ran fsck to
discover that btrfs had to do its own thing.  No idea why; I guess some
think that incompatibility is a good thing.

Ran "btrfs check" which reported errors in the filesystem but was otherwise
useless BECAUSE IT DIDN'T FIX ANYTHING.  What good is knowing that the
filesystem has errors if you can't fix them?

Near the top of the manual page it says:

  Warning
    Do not use --repair unless you are advised to do so by a developer
    or an experienced user, and then only after having accepted that
    no fsck successfully repair all types of filesystem corruption. Eg.
    some other software or hardware bugs can fatally damage a volume.

Whoa!  I'm sure that operators are standing by, call 1-800-FIX-BTRFS.
Really?  Is a ploy by the developers to form a support business?

Later on, the manual page says:

  DANGEROUS OPTIONS
    --repair
        enable the repair mode and attempt to fix problems where possible

	    Note there’s a warning and 10 second delay when this option
	    is run without --force to give users a chance to think twice
	    before running repair, the warnings in documentation have
	    shown to be insufficient

Since when is it dangerous to repair a filesystem?  That's a new one to me.

Having no option other than not being able to use the disk, I ran btrfs
check with the --repair option.  It crashed.  Lesson so far is that
trusting my data to an unreliable unrepairable filesystem is not a good
idea.  Since this was one of my media disks I just rebuilt it using ext4.

Last week I was working away and tried to write out a file to discover
that /home and /root had become read-only.  Charming.  Tried rebooting,
but couldn't since btrfs filesystems aren't checked and repaired.  Plugged
in a flash drive with a live version, managed to successfully run --repair,
and rebooted.  Lasted about 15 minutes before flipping back to read only
with the same error.

Time to suck it up and revert.	Started a clean reinstall.  Got stuck
because it crashed during disk setup with anaconda giving me a completely
useless big python stack trace.  Eventually figured out that it was
unable to delete the btrfs filesystem that had errors so it just crashed
instead.  Wiped it using dd; nice that some reliable tools still survive.
Finished the installation and am back up and running.

Any of the rest of you have any experiences with btrfs?  I'm sure that it
works fine at large companies that can afford a team of disk babysitters.
What benefits does btrfs provide that other filesystem formats such as
ext4 and ZFS don't?  Is it just a continuation of the "we have to do
everything ourselves and under no circumstances use anything that came
from the BSD world" mentality?

So what's the future for filesystem repair?  Does it look like the past?
Is Ken's original need for dsw going to rise from the dead?

In my limited experience btrfs is a BiTteR FileSystem to swallow.

Or, as Saturday Night Live might put it:  And now, linux, starring the
not ready for prime time filesystem.  Seems like something that's been
under development for around 15 years should be in better shape.

Jon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart
@ 2021-08-29 23:09 ` Henry Bent
  2021-08-30  3:14   ` Theodore Ts'o
  2021-08-30  9:14   ` John Dow via TUHS
  2021-08-29 23:57 ` Larry McVoy
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Henry Bent @ 2021-08-29 23:09 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

[-- Attachment #1: Type: text/plain, Size: 811 bytes --]

On Sun, 29 Aug 2021 at 18:13, Jon Steinhart <jon@fourwinds.com> wrote:

> I recently upgraded my machines to fc34.  I just did a stock
> uncomplicated installation using the defaults and it failed miserably.
>
> Fc34 uses btrfs as the default filesystem so I thought that I'd give it
> a try.
>

... cut out a lot about how no sane person would want to use btrfs ...

>
> Or, as Saturday Night Live might put it:  And now, linux, starring the
> not ready for prime time filesystem.  Seems like something that's been
> under development for around 15 years should be in better shape.
>

To my way of thinking this isn't a Linux problem, or even a btrfs problem,
it's a Fedora problem.  They're the ones who decided to switch their
default filesystem to something that clearly isn't ready for prime time.

-Henry

[-- Attachment #2: Type: text/html, Size: 1333 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart
  2021-08-29 23:09 ` Henry Bent
@ 2021-08-29 23:57 ` Larry McVoy
  2021-08-30  1:21   ` Rob Pike
  2021-08-30  3:46   ` Theodore Ts'o
  2021-08-30  3:36 ` Bakul Shah
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Larry McVoy @ 2021-08-29 23:57 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

On Sun, Aug 29, 2021 at 03:12:16PM -0700, Jon Steinhart wrote:
> After a bit of poking around I discovered that btrfs SILENTLY remounted the
> filesystem because it had errors.  Sure, it put something in a log file,
> but I don't spend all day surfing logs for things that shouldn't be going
> wrong.  Maybe my expectation that filesystems just work is antiquated.

I give them credit for remounting read-only when seeing errors, they may
have gotten that from BitKeeper.  When we opened a history file, if we
encountered any errors we opened the history file in read only mode so
if it worked enough you could see your data, great, but don't write on
top of bad data.

> Although it's been discredited by some, I'm still a believer in "stop and
> fsck" policing of disk drives.  

Me too.  Though with a 32TB drive (I'm guessing rotating media), that's 
going to take a long time.   If I had a drive that big, I'd divide it
into managable chunks and mount them all under /drive/{a,b,c,d,e...}
so that when something goes wrong you don't have to check the whole 
32TB.

> Near the top of the manual page it says:
> 
>   Warning
>     Do not use --repair unless you are advised to do so by a developer
>     or an experienced user, and then only after having accepted that
>     no fsck successfully repair all types of filesystem corruption. Eg.
>     some other software or hardware bugs can fatally damage a volume.
> 
> Whoa!  I'm sure that operators are standing by, call 1-800-FIX-BTRFS.
> Really?  Is a ploy by the developers to form a support business?

That's a stretch, they are just trying to not encourage you to make a 
mess.

I sent Linus an email to find out where btrfs is, I'll report back when
he replies.

--lm

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 23:57 ` Larry McVoy
@ 2021-08-30  1:21   ` Rob Pike
  2021-08-30  3:46   ` Theodore Ts'o
  1 sibling, 0 replies; 21+ messages in thread
From: Rob Pike @ 2021-08-30  1:21 UTC (permalink / raw)
  To: Larry McVoy; +Cc: The Unix Heretics Society mailing list

[-- Attachment #1: Type: text/plain, Size: 1968 bytes --]

Do you even have input switches and HALT/CONT switches? I don't think so....

Commiserations.

-rob




On Mon, Aug 30, 2021 at 9:58 AM Larry McVoy <lm@mcvoy.com> wrote:

> On Sun, Aug 29, 2021 at 03:12:16PM -0700, Jon Steinhart wrote:
> > After a bit of poking around I discovered that btrfs SILENTLY remounted
> the
> > filesystem because it had errors.  Sure, it put something in a log file,
> > but I don't spend all day surfing logs for things that shouldn't be going
> > wrong.  Maybe my expectation that filesystems just work is antiquated.
>
> I give them credit for remounting read-only when seeing errors, they may
> have gotten that from BitKeeper.  When we opened a history file, if we
> encountered any errors we opened the history file in read only mode so
> if it worked enough you could see your data, great, but don't write on
> top of bad data.
>
> > Although it's been discredited by some, I'm still a believer in "stop and
> > fsck" policing of disk drives.
>
> Me too.  Though with a 32TB drive (I'm guessing rotating media), that's
> going to take a long time.   If I had a drive that big, I'd divide it
> into managable chunks and mount them all under /drive/{a,b,c,d,e...}
> so that when something goes wrong you don't have to check the whole
> 32TB.
>
> > Near the top of the manual page it says:
> >
> >   Warning
> >     Do not use --repair unless you are advised to do so by a developer
> >     or an experienced user, and then only after having accepted that
> >     no fsck successfully repair all types of filesystem corruption. Eg.
> >     some other software or hardware bugs can fatally damage a volume.
> >
> > Whoa!  I'm sure that operators are standing by, call 1-800-FIX-BTRFS.
> > Really?  Is a ploy by the developers to form a support business?
>
> That's a stretch, they are just trying to not encourage you to make a
> mess.
>
> I sent Linus an email to find out where btrfs is, I'll report back when
> he replies.
>
> --lm
>

[-- Attachment #2: Type: text/html, Size: 2594 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 23:09 ` Henry Bent
@ 2021-08-30  3:14   ` Theodore Ts'o
  2021-08-30 13:55     ` Steffen Nurpmeso
  2021-08-30  9:14   ` John Dow via TUHS
  1 sibling, 1 reply; 21+ messages in thread
From: Theodore Ts'o @ 2021-08-30  3:14 UTC (permalink / raw)
  To: Henry Bent; +Cc: The Unix Heretics Society mailing list

On Sun, Aug 29, 2021 at 07:09:50PM -0400, Henry Bent wrote:
> On Sun, 29 Aug 2021 at 18:13, Jon Steinhart <jon@fourwinds.com> wrote:
> 
> > I recently upgraded my machines to fc34.  I just did a stock
> > uncomplicated installation using the defaults and it failed miserably.
> >
> > Fc34 uses btrfs as the default filesystem so I thought that I'd give it
> > a try.
> 
> ... cut out a lot about how no sane person would want to use btrfs ...

The ext2/ext3/ext4 file system utilities is as far as I know the first
fsck that was developed with a full regression test suite from the
very beginning and integrated into the sources.  (Just run "make
check" and you'll know if you broken something --- or it's how I know
the person contributing code was sloppy and didn't bother to run
"make check" before sending me patches to review....)

What a lot of people don't seem to understand is that file system
utilities are *important*, and more work than you might think.  The
ext4 file system is roughly 71 kLOC (thousand lines of code) in the
kernel.  E2fsprogs is 340 kLOC.  In contrast, the btrfs kernel code is
145 kLOC (btrfs does have a lot more "sexy new features"), but its
btrfs-progs utilities is currently only 124 kLOC.

And the e2fsprogs line count doesn't include the 350+ library of
corrupted file system images that are part of its regression test
suite.  Btrfs has a few unit tests (as does e2fsprogs), but it doesn't
have any thing similar in terms of a library corrupted file system
images to test its fsck functionality.  (Then again, neither does the
file system utilities for FFS, so a regression test suite is not
required to create a high quality fsck program.  In my opinion, it
very much helps, though!)

> > Or, as Saturday Night Live might put it:  And now, linux, starring the
> > not ready for prime time filesystem.  Seems like something that's been
> > under development for around 15 years should be in better shape.
> >
> 
> To my way of thinking this isn't a Linux problem, or even a btrfs problem,
> it's a Fedora problem.  They're the ones who decided to switch their
> default filesystem to something that clearly isn't ready for prime time.

I was present at the very beginning of btrfs.  In November, 2007,
various file system developers from a number of the big IBM companies
got together (IBM, Intel, HP, Red Hat, etc.) and folks decided that
Linux "needed an answer to ZFS".  In preparation for that meeting, I
did some research asking various contacts I had at various companies
how much effort and how long it took to create a new file system from
scratch and make it be "enterprise ready".  I asked folks at Digital
how long it took for advfs, IBM for AIX and GPFS, etc., etc.  And the
answer I got back at that time was between 50 and 200 Person Years,
with the bulk of the answers being between 100-200 PY's (the single
50PY estimate was an outlier).  This was everything --- kernel and
userspace coding, testing and QA, performance tuning, documentation,
etc. etc.  The calendar-time estimates I was given was between 5-7
calendar years, and even then, users would take at least another 2-3
years minimum of "kicking the tires", before they would trust *their*
precious enterprise data on the file system.

There was an Intel engineer at that meeting, who shall remain
nameless, who said, "Don't tell the managers that or they'll never
greenlight the project!  Tell them 18 months...."

And so I and other developers at IBM, continued working on ext4, which
we never expected would be able to compete with btrfs and ZFS in terms
of "sexy new features", but our focus was on performance, scalability,
and robustness.  

And it probably was about 2015 or so that btrfs finally became more or
less stable, but only if you restricted yourself to core
functionality.  (e.g., snapshots, file-system level RAID, etc., was
still dodgy at the time.)

I will say that at Google, ext4 is still our primary file system,
mainly because all of our expertise is currently focused there.  We
are starting to support XFS in "beta" ("Preview") for Cloud Optimized
OS, since there are some enterprise customers which are using XFS on
their systems, and they want to continue using XFS as they migrate
from on-prem to the Cloud.  We fully support XFS for Anthos Migrate
(which is a read-mostly workload), and we're still building our
expertise, working on getting bug fixes backported, etc., so we can
support XFS the way enterprises expect for Cloud Optimized OS, which
is our high-security, ChromeOS based Linux distribution with a
read-only, cryptographically signed root file system optimized for
Docker and Kubernetes workloads.

I'm not aware of any significant enterprise usage of btrfs, which is
why we're not bothering to support btrfs at $WORK.  The only big
company which is using btrfs in production that I know of is Facebook,
because they have a bunch of btrfs developers, but even there, they
aren't using btrfs exclusively for all of their workloads.

My understanding of why Fedora decided to make btrfs the default was
because they wanted to get more guinea pigs to flush out the bugs.
Note that Red Hat, which is responsible for Red Hat Enterprise Linux
(their paid product, where they make $$$) and Fedora, which is their
freebie "community distribution" --- Well, Red Hat does not currently
support btrfs for their RHEL product.

Make of that what you will....

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart
  2021-08-29 23:09 ` Henry Bent
  2021-08-29 23:57 ` Larry McVoy
@ 2021-08-30  3:36 ` Bakul Shah
  2021-08-30 11:56   ` Theodore Ts'o
  2021-08-30 15:05 ` Steffen Nurpmeso
  2021-08-30 21:38 ` Larry McVoy
  4 siblings, 1 reply; 21+ messages in thread
From: Bakul Shah @ 2021-08-30  3:36 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

Chances are your disk has a URE 1 in 10^14 bits ("enterprise" disks
may have a URE of 1 in 10^15). 10^14 bit is about 12.5TB. For 16TB
disks you should use at least mirroring, provided some day you'd want
to fill up the disk. And a machine with ECC RAM (& trust but verify!).
I am no fan of btrfs but these are the things I'd consider for any FS.
Even if you have done all this, consider the fact that disk mortality
has a bathtub curve.

I use FreeBSD + ZFS so I'd recommend ZFS (on Linux).

ZFS scrub works in background on an active system. Similarly resilvering
(thought things slow down). On my original zfs filesystem I replaced
all the 4 disks twice. I have been using zfs since 2005 and it has rarely
required any babysitting. I reboot it when upgrading to a new release or
applying kernel patches. "backups" via zfs send/recv of snapshots.

> On Aug 29, 2021, at 3:12 PM, Jon Steinhart <jon@fourwinds.com> wrote:
> 
> I recently upgraded my machines to fc34.  I just did a stock
> uncomplicated installation using the defaults and it failed miserably.
> 
> Fc34 uses btrfs as the default filesystem so I thought that I'd give it
> a try.  I was especially interested in the automatic checksumming because
> the majority of my storage is large media files and I worry about bit
> rot in seldom used files.  I have been keeping a separate database of
> file hashes and in theory btrfs would make that automatic and transparent.
> 
> I have 32T of disk on my system, so it took a long time to convert
> everything over.  A few weeks after I did this I went to unload my
> camera and couldn't because the filesystem that holds my photos was
> mounted read-only.  WTF?  I didn't do that.
> 
> After a bit of poking around I discovered that btrfs SILENTLY remounted the
> filesystem because it had errors.  Sure, it put something in a log file,
> but I don't spend all day surfing logs for things that shouldn't be going
> wrong.  Maybe my expectation that filesystems just work is antiquated.
> 
> This was on a brand new 16T drive, so I didn't think that it was worth
> the month that it would take to run the badblocks program which doesn't
> really scale to modern disk sizes.  Besides, SMART said that it was fine.
> 
> Although it's been discredited by some, I'm still a believer in "stop and
> fsck" policing of disk drives.  Unmounted the filesystem and ran fsck to
> discover that btrfs had to do its own thing.  No idea why; I guess some
> think that incompatibility is a good thing.
> 
> Ran "btrfs check" which reported errors in the filesystem but was otherwise
> useless BECAUSE IT DIDN'T FIX ANYTHING.  What good is knowing that the
> filesystem has errors if you can't fix them?
> 
> Near the top of the manual page it says:
> 
> Warning
>   Do not use --repair unless you are advised to do so by a developer
>   or an experienced user, and then only after having accepted that
>   no fsck successfully repair all types of filesystem corruption. Eg.
>   some other software or hardware bugs can fatally damage a volume.
> 
> Whoa!  I'm sure that operators are standing by, call 1-800-FIX-BTRFS.
> Really?  Is a ploy by the developers to form a support business?
> 
> Later on, the manual page says:
> 
> DANGEROUS OPTIONS
>   --repair
>       enable the repair mode and attempt to fix problems where possible
> 
> 	    Note there’s a warning and 10 second delay when this option
> 	    is run without --force to give users a chance to think twice
> 	    before running repair, the warnings in documentation have
> 	    shown to be insufficient
> 
> Since when is it dangerous to repair a filesystem?  That's a new one to me.
> 
> Having no option other than not being able to use the disk, I ran btrfs
> check with the --repair option.  It crashed.  Lesson so far is that
> trusting my data to an unreliable unrepairable filesystem is not a good
> idea.  Since this was one of my media disks I just rebuilt it using ext4.
> 
> Last week I was working away and tried to write out a file to discover
> that /home and /root had become read-only.  Charming.  Tried rebooting,
> but couldn't since btrfs filesystems aren't checked and repaired.  Plugged
> in a flash drive with a live version, managed to successfully run --repair,
> and rebooted.  Lasted about 15 minutes before flipping back to read only
> with the same error.
> 
> Time to suck it up and revert.	Started a clean reinstall.  Got stuck
> because it crashed during disk setup with anaconda giving me a completely
> useless big python stack trace.  Eventually figured out that it was
> unable to delete the btrfs filesystem that had errors so it just crashed
> instead.  Wiped it using dd; nice that some reliable tools still survive.
> Finished the installation and am back up and running.
> 
> Any of the rest of you have any experiences with btrfs?  I'm sure that it
> works fine at large companies that can afford a team of disk babysitters.
> What benefits does btrfs provide that other filesystem formats such as
> ext4 and ZFS don't?  Is it just a continuation of the "we have to do
> everything ourselves and under no circumstances use anything that came
> from the BSD world" mentality?
> 
> So what's the future for filesystem repair?  Does it look like the past?
> Is Ken's original need for dsw going to rise from the dead?
> 
> In my limited experience btrfs is a BiTteR FileSystem to swallow.
> 
> Or, as Saturday Night Live might put it:  And now, linux, starring the
> not ready for prime time filesystem.  Seems like something that's been
> under development for around 15 years should be in better shape.
> 
> Jon



-- Bakul


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 23:57 ` Larry McVoy
  2021-08-30  1:21   ` Rob Pike
@ 2021-08-30  3:46   ` Theodore Ts'o
  2021-08-30 23:04     ` Bakul Shah
  1 sibling, 1 reply; 21+ messages in thread
From: Theodore Ts'o @ 2021-08-30  3:46 UTC (permalink / raw)
  To: Larry McVoy; +Cc: The Unix Heretics Society mailing list

On Sun, Aug 29, 2021 at 04:57:45PM -0700, Larry McVoy wrote:
> 
> I give them credit for remounting read-only when seeing errors, they may
> have gotten that from BitKeeper.

Actually, the btrfs folks got that from ext2/ext3/ext4.  The original
behavior was "don't worry, be happy" (log errors and continue), and I
added two additional options, "remount read-only", and "panic and
reboot the system".  I recommend the last especially for high
availability systems, since you can then fail over to the secondary
system, and fsck can repair the file system on the reboot path.


The primary general-purpose file systems in Linux which are under
active development these days are btrfs, ext4, f2fs, and xfs.  They
all have slightly different focus areas.  For example, f2fs is best
for low-end flash, the kind that is find on $30 dollar mobile handsets
on sale in countries like India (aka, "the next billion users").  It
has deep knowledge of "cost-optimized" flash where random writes are
to be avoided at all costs because write amplification is a terrible
thing with very primitive FTL's.

For very large file systems (e.g., large RAID arrays with pedabytes of
data), XFS will probably do better than ext4 for many workloads.

Btrfs is the file systems for users who have ZFS envy.  I believe many
of those sexy new features are best done at other layers in the
storage stack, but if you *really* want file-system level snapshots
and rollback, btrfs is the only game in town for Linux.  (Unless of
course you don't mind using ZFS and hope that Larry Ellison won't sue
the bejesus out of you, and if you don't care about potential GPL
violations....)

Ext4 is still getting new features added; we recently added a
light-weight journaling (a simplified version of the 2017 Usenix ATC
iJournaling paper[1]), and just last week we've added parallelized
orphan list called Orphan File[2] which optimizes parallel truncate
and unlink workloads.  (Neither of these features are enabled by
default yet, because maybe in a few years, or earlier if community
distros want to volunteer their users to be guinea pigs.  :-)

[1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf
[2] https://www.spinics.net/lists/linux-ext4/msg79021.html

We currently aren't adding the "sexy new features" of btrfs or ZFS,
but that's mainly because there isn't a business justification to pay
for the engineering effort needed to add them.  I have some design
sketches of how we *could* add them to ext4, but most of the ext4
developers like food with our meals, and I'm still a working stiff so
I focus on work that adds value to my employer --- and, of course,
helping other ext4 developers working at other companies figure out
ways to justify new features that would add value to *their*
employers.

I might work on some sexy new features if I won the Powerball Lottery
and could retire rich, or I was working at company where engineers
could work on whatever technologies they wanted without getting
permission from the business types, but those companies tend not to
end well (especially after they get purchased by Oracle....)

						- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 23:09 ` Henry Bent
  2021-08-30  3:14   ` Theodore Ts'o
@ 2021-08-30  9:14   ` John Dow via TUHS
  1 sibling, 0 replies; 21+ messages in thread
From: John Dow via TUHS @ 2021-08-30  9:14 UTC (permalink / raw)
  To: tuhs

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

On 30/08/2021 00:09, Henry Bent wrote:

> On Sun, 29 Aug 2021 at 18:13, Jon Steinhart <jon@fourwinds.com 
> <mailto:jon@fourwinds.com>> wrote:
>
>     I recently upgraded my machines to fc34.  I just did a stock
>     uncomplicated installation using the defaults and it failed miserably.
>
>     Fc34 uses btrfs as the default filesystem so I thought that I'd
>     give it
>     a try.
>
>
> ... cut out a lot about how no sane person would want to use btrfs ...
>
>
>     Or, as Saturday Night Live might put it:  And now, linux, starring the
>     not ready for prime time filesystem.  Seems like something that's been
>     under development for around 15 years should be in better shape.
>
>
> To my way of thinking this isn't a Linux problem, or even a btrfs 
> problem, it's a Fedora problem.  They're the ones who decided to 
> switch their default filesystem to something that clearly isn't ready 
> for prime time.

This. Even the Arch wiki makes it clear that btrfs isn't ready for prime 
time. It's still under very heavy development - not really something you 
want for a filesystem, particularly one storing a large amount of 
critical data.


John


[-- Attachment #2: Type: text/html, Size: 2428 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30  3:36 ` Bakul Shah
@ 2021-08-30 11:56   ` Theodore Ts'o
  2021-08-30 22:35     ` Bakul Shah
  0 siblings, 1 reply; 21+ messages in thread
From: Theodore Ts'o @ 2021-08-30 11:56 UTC (permalink / raw)
  To: Bakul Shah; +Cc: The Unix Heretics Society mailing list

On Sun, Aug 29, 2021 at 08:36:37PM -0700, Bakul Shah wrote:
> Chances are your disk has a URE 1 in 10^14 bits ("enterprise" disks
> may have a URE of 1 in 10^15). 10^14 bit is about 12.5TB. For 16TB
> disks you should use at least mirroring, provided some day you'd want
> to fill up the disk. And a machine with ECC RAM (& trust but verify!).
> I am no fan of btrfs but these are the things I'd consider for any FS.
> Even if you have done all this, consider the fact that disk mortality
> has a bathtub curve.

You may find this article interesting: "The case of the 12TB URE:
Explained and debunked"[1], and the following commit on a reddit
post[2] discussiong this article:

   "Lol of course it's a myth.

   I don't know why or how anyone thought there would be a URE
   anywhere close to every 12TB read.

   Many of us have large pools that are dozens and sometimes hundreds of TB.

   I have 2 64TB pools and scrub them every month. I can go years
   without a checksum error during a scrub, which means that all my
   ~50TB of data was read correctly without any URE many times in a
   row which means that I have sometimes read 1PB (50TB x 2 pools x 10
   months) worth from my disks without any URE.

   Last I checked, the spec sheets say < 1 in 1x1014 which means less
   than 1 in 12TB. 0 in 1PB is less than 1 in 12TB so it meets the
   spec."

[1] https://heremystuff.wordpress.com/2020/08/25/the-case-of-the-12tb-ure/
[2] https://www.reddit.com/r/DataHoarder/comments/igmab7/the_12tb_ure_myth_explained_and_debunked/

Of course, disks do die, and ECC and backups and checksums are good
things.  But the whole "read 12TB get an error", saying really
misunderstands how hdd failures work.  Losing an entire platter, or
maybe the entire 12TB disk die due to a head crash, adds a lot of
uncorrectable read errors to the numerator of the UER statistics.

It just goes to show that human intuition really sucks at statistics,
whether it's about vaccination side effects, nuclear power plants, or
the danger of flying in airplanes versus driving in cars.

	      	   	     	 	   - Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30  3:14   ` Theodore Ts'o
@ 2021-08-30 13:55     ` Steffen Nurpmeso
  0 siblings, 0 replies; 21+ messages in thread
From: Steffen Nurpmeso @ 2021-08-30 13:55 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: The Unix Heretics Society mailing list

Theodore Ts'o wrote in
 <YSxNFKq9r3dyHT7l@mit.edu>:
 ...
 |What a lot of people don't seem to understand is that file system
 |utilities are *important*, and more work than you might think.  The
 |ext4 file system is roughly 71 kLOC (thousand lines of code) in the
 |kernel.  E2fsprogs is 340 kLOC.  In contrast, the btrfs kernel code is
 |145 kLOC (btrfs does have a lot more "sexy new features"), but its
 |btrfs-progs utilities is currently only 124 kLOC.

To be "fair" it must be said that btrfs usage almost requires
installation of e2fsprogs because only that ships chattr(1).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart
                   ` (2 preceding siblings ...)
  2021-08-30  3:36 ` Bakul Shah
@ 2021-08-30 15:05 ` Steffen Nurpmeso
  2021-08-31 13:18   ` Steffen Nurpmeso
  2021-08-30 21:38 ` Larry McVoy
  4 siblings, 1 reply; 21+ messages in thread
From: Steffen Nurpmeso @ 2021-08-30 15:05 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

Jon Steinhart wrote in
 <202108292212.17TMCGow1448973@darkstar.fourwinds.com>:
 |I recently upgraded my machines to fc34.  I just did a stock
 |uncomplicated installation using the defaults and it failed miserably.
 |
 |Fc34 uses btrfs as the default filesystem so I thought that I'd give it
 ...

 |Any of the rest of you have any experiences with btrfs?  I'm sure that it
 |works fine at large companies that can afford a team of disk babysitters.
 |What benefits does btrfs provide that other filesystem formats such as
 |ext4 and ZFS don't?  Is it just a continuation of the "we have to do
 |everything ourselves and under no circumstances use anything that came
 |from the BSD world" mentality?

From a small man perspective i can say i use btrfs.  I have seen
more problems with filesystems in these 29 months than in the ~36
years (well, 22 if you only count Unix, mostly FreeBSD) before.

I have learned that i have to chattr +C my vm/ directory in order
to avoid filesystem errors which can only be solved by deleting
the corrupted files (which were not easy to find out,
inspect-internal inode-resolve could have helped a bit better, but
.. why?).  I have seen (factual) snapshot receive errors that were
not indicated by the tool's exit status.  With dangling files
laying around.   I have seen +C attributes missing on the target
filesystem from played in snapshots.  I have found it impossible
to mount external devices because "BTRFS warning (device
<unknown>): duplicate device /dev/sdc2 devid 1 generation 3977"
even though the former umount was successful (i must admit i had
used lazytime mount option there, be sure not to do that for
BTRFS).  I know where you coming from, i tried it all without
success.  With all the little tools that do not do what you want.

I found it quite ridiculous that i had to shrink-resize my
filesystem in an iteration because the "minimum possible" size was
adjusted each and every time, until i finally was where the actual
filesystem usage indicated you would end up with from the
beginning.  (defrag did not prevent this.)

On the other hand i am still here, now on Luks2, now with xxhash
checksums and data and meta DUP (instead of RAID 1, i thought).
Same data, just moved over onto.

I think i would try out ZFS if it would be in-kernel.  The FreeBSD
Handbook is possibly the best manual you can have for that.  But
i mean the ZFS code is much, much larger than BTRFS, BTRFS serves
me really, really good for what i want and need, and i am sailing
Linux stable/latest (4.19, now 5.10) here, on a free and small
Linux distribution without engineering power aka supervision, that
sails somewhat the edge of what all the many involved projects
produce, with problems all over the place, sometimes more,
sometimes less, some all the time.  That is nothing new, you have
to live with some deficiencies in the free software world; i just
find it sometimes hard to believe that this is still true for
Linux with the many many billions of Dollars that went in, and the
tens of thousands of people working on it.

I really hate that all the Linux kernel guys seem to look forward
only, mostly, you have to go and try out the newest thing, maybe
there all is good.  Of course .. i can understand this (a bit).
That is the good thing of using such an engineering-power
distribution, you likely get backports and have a large user base
with feedback.

 |In my limited experience btrfs is a BiTteR FileSystem to swallow.

Well often people say you need to know what you are doing.  That
is hard without proper documentation, and the www is a toilet.
And noone writes HOWTOs any more.  But, for example, if i do not
use an undocumented kernel parameter (rtw88_pci.disable_aspm=1),
and use wifi and bluetooth (audio) in conjunction, then i have to
boot into the Windows startup screen in order to properly
reinitialize my wifi/bluetooth chip.  Or else you are dead.  And
that even though it seems the Linux driver comes from the Chip
creator itself.  So i think a "chattr +C" here and there, it can
be directory-wide, it could be a mount or creation option, isn't
that bad.  Also it is just me having had a go (or julia or nim; or
perl) on using _one_ filesystem with several subvolumes for
anything; if i would have used my (well, security context only)
old-style way of doing things and used several hard partitions
with individual filesystems, i would have used +C from the
beginning for that one.  Especially if i would have read it in the
manual page.

I do believe todays' journalled, checksummed, snapshot'able
copy-on-write filesystems are complex beasts.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart
                   ` (3 preceding siblings ...)
  2021-08-30 15:05 ` Steffen Nurpmeso
@ 2021-08-30 21:38 ` Larry McVoy
  4 siblings, 0 replies; 21+ messages in thread
From: Larry McVoy @ 2021-08-30 21:38 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

Linus replied:
> Anyhoo, what can you me about btrfs?  I haven't paid attention, when I
> last did, it seemed unfinished.  Fedora apparently uses it by default,
> should they be?  I've been pretty happy with ext{2,3,4}, I don't know
> if Ted is still behind those but I've been super happy with how they
> handled compat and other stuff.
>
> So did btrfs get done?

I'm not sure how much people use the fancier features, but lots of
people seem to use btrfs in production these days, and the maintenance
team seems responsive and stable. Of the local disk filesystems we
have, I'd pick any of ext4, btrfs and xfs as stable and
well-maintained.

Picking any of those is fine, and the choice would come down to which
distro you use, and _possibly_ whether you want any of the particular
features. But on the whole, most people don't seem to care deeply
about the more intricate features unless they have very specific
use-cases.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30 11:56   ` Theodore Ts'o
@ 2021-08-30 22:35     ` Bakul Shah
  0 siblings, 0 replies; 21+ messages in thread
From: Bakul Shah @ 2021-08-30 22:35 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: The Unix Heretics Society mailing list

On Aug 30, 2021, at 4:56 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Sun, Aug 29, 2021 at 08:36:37PM -0700, Bakul Shah wrote:
>> Chances are your disk has a URE 1 in 10^14 bits ("enterprise" disks
>> may have a URE of 1 in 10^15). 10^14 bit is about 12.5TB. For 16TB
>> disks you should use at least mirroring, provided some day you'd want
>> to fill up the disk. And a machine with ECC RAM (& trust but verify!).
>> I am no fan of btrfs but these are the things I'd consider for any FS.
>> Even if you have done all this, consider the fact that disk mortality
>> has a bathtub curve.
> 
> You may find this article interesting: "The case of the 12TB URE:
> Explained and debunked"[1], and the following commit on a reddit
> post[2] discussiong this article:
> 
>   "Lol of course it's a myth.
> 
>   I don't know why or how anyone thought there would be a URE
>   anywhere close to every 12TB read.
> 
>   Many of us have large pools that are dozens and sometimes hundreds of TB.
> 
>   I have 2 64TB pools and scrub them every month. I can go years
>   without a checksum error during a scrub, which means that all my
>   ~50TB of data was read correctly without any URE many times in a
>   row which means that I have sometimes read 1PB (50TB x 2 pools x 10
>   months) worth from my disks without any URE.
> 
>   Last I checked, the spec sheets say < 1 in 1x1014 which means less
>   than 1 in 12TB. 0 in 1PB is less than 1 in 12TB so it meets the
>   spec."
> 
> [1] https://heremystuff.wordpress.com/2020/08/25/the-case-of-the-12tb-ure/
> [2] https://www.reddit.com/r/DataHoarder/comments/igmab7/the_12tb_ure_myth_explained_and_debunked/

It seems this guy doesn't understand statistics. He checked his 2 pools
and is confident that a sample of 4 disks (likely) he knows that URE
specs are crap. Even from an economic PoV it doen't make sense.
Why wouldn't the disk companies tout an even lower error rate if they
can get away with it? Presumably these rates are derived from reading
many many disks and averaged.

Here's what the author says on a serverfault thread:
https://serverfault.com/questions/812891/what-is-exactly-an-ure

  @DavidBalažic Evidently, your sample size of one invalidates the
  entirety of probability theory! I suggest you submit a paper to
  the Nobel Committee. – Ian Kemp Apr 16 '19 at 5:37

  @IanKemp If someone claims that all numbers are divisible by 7 and
  I find ONE that is not, then yes, a single find can invalidate an
  entire theory. BTW, still not a single person has confirmed the myth
  in practice (by experiment), did they? Why should they, when belief
  is more than knowledge...– David Balažic Apr 16 '19 at 12:22

Incidentally, it is hard to believe he scrubs his 2x64TB pools once a month.
Assuming 250MB/s sequential throughput and his scrubber can stream it at
that rate, it will take him close to 6 days (3 days if reading them in
parallel) to read every block. During this time these pools won't be
useful for anything else. Unclear if he is using any RAID or a filesystem
that does checksums. Without that he would be unable to detect hidden
data corruption.

In contrast, ZFS will only scrub *live* data. As more of the disks are 
filled up, scrub will take progressively more time. Similarly,
replacing a zfs mirror won't read the source disk in its entirety,
only the live data.

> Of course, disks do die, and ECC and backups and checksums are good
> things.  But the whole "read 12TB get an error", saying really
> misunderstands how hdd failures work.  Losing an entire platter, or
> maybe the entire 12TB disk die due to a head crash, adds a lot of
> uncorrectable read errors to the numerator of the UER statistics.

That is not how URE specs are derived.

> 
> It just goes to show that human intuition really sucks at statistics,

Indeed :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30  3:46   ` Theodore Ts'o
@ 2021-08-30 23:04     ` Bakul Shah
  2021-09-02 15:52       ` Jon Steinhart
  0 siblings, 1 reply; 21+ messages in thread
From: Bakul Shah @ 2021-08-30 23:04 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: The Unix Heretics Society mailing list

On Aug 29, 2021, at 8:46 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> (Unless of
> course you don't mind using ZFS and hope that Larry Ellison won't sue
> the bejesus out of you, and if you don't care about potential GPL
> violations....)

This should not matter if you are adding zfs for your own use. Still,
if this is a concern, best switch over to FreeBSD :-) Actually FreeBSD
is now using OpenZFS so it is the same source code base as ZFS on
Linux. You should even be able to do zfs send/recv between the two OSes.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30 15:05 ` Steffen Nurpmeso
@ 2021-08-31 13:18   ` Steffen Nurpmeso
  0 siblings, 0 replies; 21+ messages in thread
From: Steffen Nurpmeso @ 2021-08-31 13:18 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

P.S.:

Steffen Nurpmeso wrote in
 <20210830150510.jWrRZ%steffen@sdaoden.eu>:
 |Jon Steinhart wrote in
 | <202108292212.17TMCGow1448973@darkstar.fourwinds.com>:
 ||I recently upgraded my machines to fc34.  I just did a stock
 ||uncomplicated installation using the defaults and it failed miserably.
 ||
 ||Fc34 uses btrfs as the default filesystem so I thought that I'd give it
 ...
 |I have learned that i have to chattr +C my vm/ directory in order
 |to avoid filesystem errors which can only be solved by deleting
 |the corrupted files (which were not easy to find out,
 |inspect-internal inode-resolve could have helped a bit better, but
 ...
 |creator itself.  So i think a "chattr +C" here and there, it can
 |be directory-wide, it could be a mount or creation option, isn't
 |that bad.  Also it is just me having had a go (or julia or nim; or
 ...

Only to add that this was effectively my fault, because of the
caching behaviour my box-vm.sh script chose for qemu.
In effect i think that +C i could drop again now that i have

  drivecache=,cache=writeback # ,cache=none XXX on ZFS!?

used like eg

  -drive index=0,if=ide$drivecache,file=img/$vmimg

It, however, took quite some research to get there.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30 23:04     ` Bakul Shah
@ 2021-09-02 15:52       ` Jon Steinhart
  2021-09-02 16:57         ` Theodore Ts'o
  0 siblings, 1 reply; 21+ messages in thread
From: Jon Steinhart @ 2021-09-02 15:52 UTC (permalink / raw)
  To: The Unix Heretics Society mailing list

Hey, wanted to thank everybody for the good information on this topic.
Was pleasantly surprised that we got through it without a flame war :-)

I have a related question in case any of you have actual factual knowledge
of disk drive internals.  A friend who used to be in charge of reliability
engineering at Seagate used to be a great resource but he's now retired.
For example, years ago someone ranted at me about disk manufacturers
sacrificing reliability to save a few pennies by removing the head ramp;
I asked my friend who explained to me how that ramp was removed to improve
reliability.

So my personal setup here bucks conventional wisdom in that I don't do RAID.
One reason is that I read the Google reliability studies years ago and
interpreted them to say that if I bought a stack of disks on the same day I
could expect them to fail at about the same time.  Another reason is that
24x7 spinning disk drives is the biggest power consumer in my office.  Last
reason is that my big drives hold media (music/photos/video), so if one dies
it's not going to be any sort of critical interruption.

My strategy is to keep three (+) copies.  I realized that I first came across
this wisdom while learning both code and spelunking as a teenager from Carl
Christensen and Heinz Lycklama in the guise of how many spare headlamps one
should have when spelunking.  There's the copy on my main machine, another in
a fire safe, and I rsync to another copy on a duplicate machine up at my ski
condo.  Plus, I keep lots of old fragments of stuff on retired small (<10T)
disks that are left over from past systems.  And, a lot of the music part of
my collection is backed up by proof-of-purchase CDs in the store room or
shared with many others so it's not too hard to recover.

Long intro, on to the question.  Anyone know what it does to reliability to
spin disks up and down.  I don't really need the media disks to be constantly
spinning; when whatever I'm listening to in the evening finishes the disk
could spin down until morning to save energy.  Likewise the video disk drive
is at most used for a few hours a day.

My big disks (currently 16T and 12T) bake when they're spinning which can't
be great for them, but I don't know how that compares to mechanical stress
from spinning up and down from a reliability standpoint.  Anyone know?

Jon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-09-02 15:52       ` Jon Steinhart
@ 2021-09-02 16:57         ` Theodore Ts'o
  0 siblings, 0 replies; 21+ messages in thread
From: Theodore Ts'o @ 2021-09-02 16:57 UTC (permalink / raw)
  To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list

On Thu, Sep 02, 2021 at 08:52:23AM -0700, Jon Steinhart wrote:
> Long intro, on to the question.  Anyone know what it does to reliability to
> spin disks up and down.  I don't really need the media disks to be constantly
> spinning; when whatever I'm listening to in the evening finishes the disk
> could spin down until morning to save energy.  Likewise the video disk drive
> is at most used for a few hours a day.
> 
> My big disks (currently 16T and 12T) bake when they're spinning which can't
> be great for them, but I don't know how that compares to mechanical stress
> from spinning up and down from a reliability standpoint.  Anyone know?

First of all, I wouldn't worry too much about big disks "baking" while
they are spinning.  Google runs its disks hot, in data centers where
the ambient air temperatures is at least 80 degrees Fahrenheit[1], and
it's not alone; Dell said in 2012 that it would honor warranties for
servers running in environments as hot as 115 degrees Fahrenheit[2].

[1] https://www.google.com/about/datacenters/efficiency/
[2] https://www.datacenterknowledge.com/archives/2012/03/23/too-hot-for-humans-but-google-servers-keep-humming

And of course, if the ambient *air* temperature is 80 degrees plus,
you can just imagine the temperature at the hard drive.

It's also true that a long time ago, disk drives had limited number of
spin up/down cycles; this was in its spec sheet, and SMART would track
the number of disk spinups.  I had a laptop drive where I had
configured the OS so it would spin down the drive after 30 seconds of
idle, and I realized that after about 9 months, SMART stats had
reported that I had used up almost 50% of the rated spin up/down
cycles for a laptop drive.  Needless to say, I backed of my agressive
spindown policies.

That being said, more modern HDD's have been designed for better power
effiencies, with slower disk rotational speeds (which is probably fine
for media disks, unless you are serving a large number of different
video streams at the same time), and they are designed to allow for a
much larger number of spindown cycles.  Check your spec sheets, this
will be listed as load/unload cycles, and it will typically be a
number like 200,000, 300,000 or 600,000.

If you're only spinning down the a few times a day, I suspect you'll
be fine.  Especially since if the disk dies due to a head crash or
other head failure, it'll be a case of an obvious disk failure, not
silent data corruption, and you can just pull your backups out of your
fire safe.


I don't personally have a lot of knowledge of how modern HDD's
actually survive large numbers of load/unload cycles, because at $WORK
we keep the disks spinning at all times.  A disk provides value in two
ways: bytes of storage, and I/O operations.  And an idle disk means
we're wasting part of its value it could be providing, and if the goal
is to decrease the overall storage TCO, wasting IOPS is not a good
thing[3].  Hence, we try to organize our data to keep all of the hard
drives busy, by peanut-buttering the hot data across all of the disks
in the cluster[4].

[3] https://research.google/pubs/pub44830/
[4] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf

Hence, a spun-down disk is a disk which is frittering away the CapEx
of the drive and a portion of the server cost to which the disk is
attached.  And if you can find useful work for that disk to do, it's
way more valuable to keep it spun up even taking into account to the
power and air-conditioning costs of the spinning drive.


It should also be noted that modern HDD's now also have *write*
limits[5], just like SSD's.  This is especially true for technologies
like HAMR --- where if you need to apply *heat* to write, that means
additional thermal stress on the drive head when you write to a disk,
but the write limits predate new technologies like HAMR and MAMR.

[5] https://www.theregister.com/2016/05/03/when_did_hard_drives_get_workload_rate_limits/

HDD write limits has implications for systems that are using log
structured storage, or other copy-on-write schemes, or systems that
are moving data around to balance hot and cold data as described in
the PDSW keynote.  This is probably not an issue for home systems, but
it's one of the things which keeps the storage space interesting.  :-)

					- Ted

P.S.  I have a Synology NAS box, and I *do* let the disks spin down.
Storage at the industrial scale is really different than storage at
the personal scale.  I do use RAID, but my backup strategy in extremis
is encrypted backups uploaded to cloud storage (where I can take
advantage of industrial-scale storage pricing).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30 14:42 ` Theodore Ts'o
@ 2021-08-30 18:08   ` Adam Thornton
  0 siblings, 0 replies; 21+ messages in thread
From: Adam Thornton @ 2021-08-30 18:08 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: tuhs



> On Aug 30, 2021, at 7:42 AM, Theodore Ts'o <tytso@mit.edu> wrote:

> In addition, the answer for hobbyists
> and academics might be quite different from a large company making
> lots of money and more likely to attract the attention of the
> leadership and lawyers at Oracle.

I mean, Oracle’s always been super well-behaved and not lawsuit-happy _at all_!  You’re fine, at least until Larry needs a slightly larger yacht or volcano lair or whatever it is he’s into these days.

Adam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30 13:06 Norman Wilson
  2021-08-30 14:42 ` Theodore Ts'o
@ 2021-08-30 16:46 ` Arthur Krewat
  1 sibling, 0 replies; 21+ messages in thread
From: Arthur Krewat @ 2021-08-30 16:46 UTC (permalink / raw)
  To: tuhs

On 8/30/2021 9:06 AM, Norman Wilson wrote:
> A key point is that the character of the errors they
> found suggests it's not just the disks one ought to worry
> about, but all the hardware and software (much of the latter
> inside disks and storage controllers and the like) in the
> storage stack.
I had a pair of Dell MD1000's, full of SATA drives (28 total), with the 
SATA/SAS interposers on the back of the drive. Was getting checksum 
errors in ZFS on a handful of the drives. Took the time to build a new 
array, on a Supermicro backplane, and no more errors with the exact same 
drives.

I'm theorizing it was either the interposers, or the SAS 
backplane/controllers in the MD1000. Without ZFS, who knows who 
swiss-cheesy my data would be.

Not to mention the time I setup a Solaris x86 cluster zoned to a 
Compellent and periodically would get one or two checksum errors in ZFS. 
This was the only cluster out of a handful that had issues, and only on 
that one filesystem. Of course, it was a production PeopleSoft Oracle 
database. I guess moving to a VMware Linux guest and XFS just swept the 
problem under the rug, but the hardware is not being reused so there's that.

> I had heard anecdotes long before (e.g. from Andrew Hume)
> suggesting silent data corruption had become prominent
> enough to matter, but this paper was the first real study
> I came across.
>
> I have used ZFS for my home file server for more than a
> decade; presently on an antique version of Solaris, but
> I hope to migrate to OpenZFS on a newer OS and hardware.
> So far as I can tell ZFS in old Solaris is quite stable
> and reliable.  As Ted has said, there are philosophical
> reasons why some prefer to avoid it, but if you don't
> subscribe to those it's a fine answer.
>
Been running Solaris 11.3 and ZFS for quite a few years now, at home. 
Before that, Solaris 10. I recently setup a home Redhat 8 server, w/ZoL 
(.8), earlier this year - so far, no issues, with 40+TB online. I have 
various test servers with ZoL 2.0 on them, too.

I have so much online data that I use as the "live copy" - going back to 
the early 80's copies of my TOPS-10 stuff. Even though I have copious 
amounts of LTO tape copies of this data, I won't go back to the "out of 
sight out of mind" mentality.

Trying to get customers to buy into that idea is another story.

art k.

PS: I refuse to use a workstation that doesn't use ECC RAM, either. I 
like swiss-cheese on a sandwich. I don't like my (or my customers') data 
emulating it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
  2021-08-30 13:06 Norman Wilson
@ 2021-08-30 14:42 ` Theodore Ts'o
  2021-08-30 18:08   ` Adam Thornton
  2021-08-30 16:46 ` Arthur Krewat
  1 sibling, 1 reply; 21+ messages in thread
From: Theodore Ts'o @ 2021-08-30 14:42 UTC (permalink / raw)
  To: Norman Wilson; +Cc: tuhs

On Mon, Aug 30, 2021 at 09:06:03AM -0400, Norman Wilson wrote:
> Not to get into what is soemthing of a religious war,
> but this was the paper that convinced me that silent
> data corruption in storage is worth thinking about:
> 
> http://www.cs.toronto.edu/~bianca/papers/fast08.pdf
> 
> A key point is that the character of the errors they
> found suggests it's not just the disks one ought to worry
> about, but all the hardware and software (much of the latter
> inside disks and storage controllers and the like) in the
> storage stack.

There's nothing I'd disagree with in this paper.  I'll note though
that part of the paper's findings is that silent data corruption
occured at a rate roughly 1.5 orders of magnitude less often than
latent sector errors (e.g., UER's).

The probability of at least one silent data corruption during the
study period (41 months, although not all disks would have been in
service during that entire time) was P = 0.0086 for nearline disks and
P = 0.00065 for enterprise disks.  And P(1st error) remained constant
over disk age, and per the authors' analysis, it was unclear whether
P(1st error) changed as disk size increased --- which is
representative of non media-related failures (not surprising given how
much checksum and ECC checks are done by the HDD's).

A likely supposition IMHO is that the use of more costly enterprise
disks correlated with higher quality hardware in the rest of the
storage stack --- so things like ECC memory really do matter.

> As Ted has said, there are philosophical reasons why some prefer to
> avoid it, but if you don't subscribe to those it's a fine answer.

WRT to running ZFS on Linux, I wouldn't call it philosophical reasons,
but rather legal risks.  Life is not perfect, so you can't drive any
kind of risk (including risks of hardware failure) down to zero.

Whether you should be comfortable with the legal risks in this case
very much depends on who you are and what your risk profile might be,
and you should contact a lawyer if you want legal advice.  Clearly the
lawyers at companies like Red Hat and SuSE have given very answers
from the lawyers at Canonical.  In addition, the answer for hobbyists
and academics might be quite different from a large company making
lots of money and more likely to attract the attention of the
leadership and lawyers at Oracle.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
@ 2021-08-30 13:06 Norman Wilson
  2021-08-30 14:42 ` Theodore Ts'o
  2021-08-30 16:46 ` Arthur Krewat
  0 siblings, 2 replies; 21+ messages in thread
From: Norman Wilson @ 2021-08-30 13:06 UTC (permalink / raw)
  To: tuhs

Not to get into what is soemthing of a religious war,
but this was the paper that convinced me that silent
data corruption in storage is worth thinking about:

http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

A key point is that the character of the errors they
found suggests it's not just the disks one ought to worry
about, but all the hardware and software (much of the latter
inside disks and storage controllers and the like) in the
storage stack.

I had heard anecdotes long before (e.g. from Andrew Hume)
suggesting silent data corruption had become prominent
enough to matter, but this paper was the first real study
I came across.

I have used ZFS for my home file server for more than a
decade; presently on an antique version of Solaris, but
I hope to migrate to OpenZFS on a newer OS and hardware.
So far as I can tell ZFS in old Solaris is quite stable
and reliable.  As Ted has said, there are philosophical
reasons why some prefer to avoid it, but if you don't
subscribe to those it's a fine answer.

I've been hearing anecdotes since forever about sharp
edges lurking here and there in BtrFS.  It does seem
to be eternally unready for production use if you really
care about your data.  It's all anecdotes so I don't know
how seriously to take it, but since I'm comfortable with
ZFS I don't worry about it.

Norman Wilson
Toronto ON

PS: Disclosure: I work in the same (large) CS department
as Bianca Schroeder, and admire her work in general,
though the paper cited above was my first taste of it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-09-02 16:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart
2021-08-29 23:09 ` Henry Bent
2021-08-30  3:14   ` Theodore Ts'o
2021-08-30 13:55     ` Steffen Nurpmeso
2021-08-30  9:14   ` John Dow via TUHS
2021-08-29 23:57 ` Larry McVoy
2021-08-30  1:21   ` Rob Pike
2021-08-30  3:46   ` Theodore Ts'o
2021-08-30 23:04     ` Bakul Shah
2021-09-02 15:52       ` Jon Steinhart
2021-09-02 16:57         ` Theodore Ts'o
2021-08-30  3:36 ` Bakul Shah
2021-08-30 11:56   ` Theodore Ts'o
2021-08-30 22:35     ` Bakul Shah
2021-08-30 15:05 ` Steffen Nurpmeso
2021-08-31 13:18   ` Steffen Nurpmeso
2021-08-30 21:38 ` Larry McVoy
2021-08-30 13:06 Norman Wilson
2021-08-30 14:42 ` Theodore Ts'o
2021-08-30 18:08   ` Adam Thornton
2021-08-30 16:46 ` Arthur Krewat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).