The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)?
@ 2021-08-30 13:06 Norman Wilson
  2021-08-30 14:42 ` Theodore Ts'o
  2021-08-30 16:46 ` Arthur Krewat
  0 siblings, 2 replies; 21+ messages in thread
From: Norman Wilson @ 2021-08-30 13:06 UTC (permalink / raw)
  To: tuhs

Not to get into what is soemthing of a religious war,
but this was the paper that convinced me that silent
data corruption in storage is worth thinking about:

A key point is that the character of the errors they
found suggests it's not just the disks one ought to worry
about, but all the hardware and software (much of the latter
inside disks and storage controllers and the like) in the
storage stack.

I had heard anecdotes long before (e.g. from Andrew Hume)
suggesting silent data corruption had become prominent
enough to matter, but this paper was the first real study
I came across.

I have used ZFS for my home file server for more than a
decade; presently on an antique version of Solaris, but
I hope to migrate to OpenZFS on a newer OS and hardware.
So far as I can tell ZFS in old Solaris is quite stable
and reliable.  As Ted has said, there are philosophical
reasons why some prefer to avoid it, but if you don't
subscribe to those it's a fine answer.

I've been hearing anecdotes since forever about sharp
edges lurking here and there in BtrFS.  It does seem
to be eternally unready for production use if you really
care about your data.  It's all anecdotes so I don't know
how seriously to take it, but since I'm comfortable with
ZFS I don't worry about it.

Norman Wilson
Toronto ON

PS: Disclosure: I work in the same (large) CS department
as Bianca Schroeder, and admire her work in general,
though the paper cited above was my first taste of it.

^ permalink raw reply	[flat|nested] 21+ messages in thread
* [TUHS] Is it time to resurrect the original dsw (delete with switches)?
@ 2021-08-29 22:12 Jon Steinhart
  2021-08-29 23:09 ` Henry Bent
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Jon Steinhart @ 2021-08-29 22:12 UTC (permalink / raw)
  To: The Unix Heretics Society mailing list

I recently upgraded my machines to fc34.  I just did a stock
uncomplicated installation using the defaults and it failed miserably.

Fc34 uses btrfs as the default filesystem so I thought that I'd give it
a try.  I was especially interested in the automatic checksumming because
the majority of my storage is large media files and I worry about bit
rot in seldom used files.  I have been keeping a separate database of
file hashes and in theory btrfs would make that automatic and transparent.

I have 32T of disk on my system, so it took a long time to convert
everything over.  A few weeks after I did this I went to unload my
camera and couldn't because the filesystem that holds my photos was
mounted read-only.  WTF?  I didn't do that.

After a bit of poking around I discovered that btrfs SILENTLY remounted the
filesystem because it had errors.  Sure, it put something in a log file,
but I don't spend all day surfing logs for things that shouldn't be going
wrong.  Maybe my expectation that filesystems just work is antiquated.

This was on a brand new 16T drive, so I didn't think that it was worth
the month that it would take to run the badblocks program which doesn't
really scale to modern disk sizes.  Besides, SMART said that it was fine.

Although it's been discredited by some, I'm still a believer in "stop and
fsck" policing of disk drives.  Unmounted the filesystem and ran fsck to
discover that btrfs had to do its own thing.  No idea why; I guess some
think that incompatibility is a good thing.

Ran "btrfs check" which reported errors in the filesystem but was otherwise
useless BECAUSE IT DIDN'T FIX ANYTHING.  What good is knowing that the
filesystem has errors if you can't fix them?

Near the top of the manual page it says:

    Do not use --repair unless you are advised to do so by a developer
    or an experienced user, and then only after having accepted that
    no fsck successfully repair all types of filesystem corruption. Eg.
    some other software or hardware bugs can fatally damage a volume.

Whoa!  I'm sure that operators are standing by, call 1-800-FIX-BTRFS.
Really?  Is a ploy by the developers to form a support business?

Later on, the manual page says:

        enable the repair mode and attempt to fix problems where possible

	    Note there’s a warning and 10 second delay when this option
	    is run without --force to give users a chance to think twice
	    before running repair, the warnings in documentation have
	    shown to be insufficient

Since when is it dangerous to repair a filesystem?  That's a new one to me.

Having no option other than not being able to use the disk, I ran btrfs
check with the --repair option.  It crashed.  Lesson so far is that
trusting my data to an unreliable unrepairable filesystem is not a good
idea.  Since this was one of my media disks I just rebuilt it using ext4.

Last week I was working away and tried to write out a file to discover
that /home and /root had become read-only.  Charming.  Tried rebooting,
but couldn't since btrfs filesystems aren't checked and repaired.  Plugged
in a flash drive with a live version, managed to successfully run --repair,
and rebooted.  Lasted about 15 minutes before flipping back to read only
with the same error.

Time to suck it up and revert.	Started a clean reinstall.  Got stuck
because it crashed during disk setup with anaconda giving me a completely
useless big python stack trace.  Eventually figured out that it was
unable to delete the btrfs filesystem that had errors so it just crashed
instead.  Wiped it using dd; nice that some reliable tools still survive.
Finished the installation and am back up and running.

Any of the rest of you have any experiences with btrfs?  I'm sure that it
works fine at large companies that can afford a team of disk babysitters.
What benefits does btrfs provide that other filesystem formats such as
ext4 and ZFS don't?  Is it just a continuation of the "we have to do
everything ourselves and under no circumstances use anything that came
from the BSD world" mentality?

So what's the future for filesystem repair?  Does it look like the past?
Is Ken's original need for dsw going to rise from the dead?

In my limited experience btrfs is a BiTteR FileSystem to swallow.

Or, as Saturday Night Live might put it:  And now, linux, starring the
not ready for prime time filesystem.  Seems like something that's been
under development for around 15 years should be in better shape.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-09-02 16:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30 13:06 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Norman Wilson
2021-08-30 14:42 ` Theodore Ts'o
2021-08-30 18:08   ` Adam Thornton
2021-08-30 16:46 ` Arthur Krewat
  -- strict thread matches above, loose matches on Subject: below --
2021-08-29 22:12 Jon Steinhart
2021-08-29 23:09 ` Henry Bent
2021-08-30  3:14   ` Theodore Ts'o
2021-08-30 13:55     ` Steffen Nurpmeso
2021-08-30  9:14   ` John Dow via TUHS
2021-08-29 23:57 ` Larry McVoy
2021-08-30  1:21   ` Rob Pike
2021-08-30  3:46   ` Theodore Ts'o
2021-08-30 23:04     ` Bakul Shah
2021-09-02 15:52       ` Jon Steinhart
2021-09-02 16:57         ` Theodore Ts'o
2021-08-30  3:36 ` Bakul Shah
2021-08-30 11:56   ` Theodore Ts'o
2021-08-30 22:35     ` Bakul Shah
2021-08-30 15:05 ` Steffen Nurpmeso
2021-08-31 13:18   ` Steffen Nurpmeso
2021-08-30 21:38 ` Larry McVoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).