* [TUHS] Is it time to resurrect the original dsw (delete with switches)? @ 2021-08-29 22:12 Jon Steinhart 2021-08-29 23:09 ` Henry Bent ` (4 more replies) 0 siblings, 5 replies; 21+ messages in thread From: Jon Steinhart @ 2021-08-29 22:12 UTC (permalink / raw) To: The Unix Heretics Society mailing list I recently upgraded my machines to fc34. I just did a stock uncomplicated installation using the defaults and it failed miserably. Fc34 uses btrfs as the default filesystem so I thought that I'd give it a try. I was especially interested in the automatic checksumming because the majority of my storage is large media files and I worry about bit rot in seldom used files. I have been keeping a separate database of file hashes and in theory btrfs would make that automatic and transparent. I have 32T of disk on my system, so it took a long time to convert everything over. A few weeks after I did this I went to unload my camera and couldn't because the filesystem that holds my photos was mounted read-only. WTF? I didn't do that. After a bit of poking around I discovered that btrfs SILENTLY remounted the filesystem because it had errors. Sure, it put something in a log file, but I don't spend all day surfing logs for things that shouldn't be going wrong. Maybe my expectation that filesystems just work is antiquated. This was on a brand new 16T drive, so I didn't think that it was worth the month that it would take to run the badblocks program which doesn't really scale to modern disk sizes. Besides, SMART said that it was fine. Although it's been discredited by some, I'm still a believer in "stop and fsck" policing of disk drives. Unmounted the filesystem and ran fsck to discover that btrfs had to do its own thing. No idea why; I guess some think that incompatibility is a good thing. Ran "btrfs check" which reported errors in the filesystem but was otherwise useless BECAUSE IT DIDN'T FIX ANYTHING. What good is knowing that the filesystem has errors if you can't fix them? Near the top of the manual page it says: Warning Do not use --repair unless you are advised to do so by a developer or an experienced user, and then only after having accepted that no fsck successfully repair all types of filesystem corruption. Eg. some other software or hardware bugs can fatally damage a volume. Whoa! I'm sure that operators are standing by, call 1-800-FIX-BTRFS. Really? Is a ploy by the developers to form a support business? Later on, the manual page says: DANGEROUS OPTIONS --repair enable the repair mode and attempt to fix problems where possible Note there’s a warning and 10 second delay when this option is run without --force to give users a chance to think twice before running repair, the warnings in documentation have shown to be insufficient Since when is it dangerous to repair a filesystem? That's a new one to me. Having no option other than not being able to use the disk, I ran btrfs check with the --repair option. It crashed. Lesson so far is that trusting my data to an unreliable unrepairable filesystem is not a good idea. Since this was one of my media disks I just rebuilt it using ext4. Last week I was working away and tried to write out a file to discover that /home and /root had become read-only. Charming. Tried rebooting, but couldn't since btrfs filesystems aren't checked and repaired. Plugged in a flash drive with a live version, managed to successfully run --repair, and rebooted. Lasted about 15 minutes before flipping back to read only with the same error. Time to suck it up and revert. Started a clean reinstall. Got stuck because it crashed during disk setup with anaconda giving me a completely useless big python stack trace. Eventually figured out that it was unable to delete the btrfs filesystem that had errors so it just crashed instead. Wiped it using dd; nice that some reliable tools still survive. Finished the installation and am back up and running. Any of the rest of you have any experiences with btrfs? I'm sure that it works fine at large companies that can afford a team of disk babysitters. What benefits does btrfs provide that other filesystem formats such as ext4 and ZFS don't? Is it just a continuation of the "we have to do everything ourselves and under no circumstances use anything that came from the BSD world" mentality? So what's the future for filesystem repair? Does it look like the past? Is Ken's original need for dsw going to rise from the dead? In my limited experience btrfs is a BiTteR FileSystem to swallow. Or, as Saturday Night Live might put it: And now, linux, starring the not ready for prime time filesystem. Seems like something that's been under development for around 15 years should be in better shape. Jon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart @ 2021-08-29 23:09 ` Henry Bent 2021-08-30 3:14 ` Theodore Ts'o 2021-08-30 9:14 ` John Dow via TUHS 2021-08-29 23:57 ` Larry McVoy ` (3 subsequent siblings) 4 siblings, 2 replies; 21+ messages in thread From: Henry Bent @ 2021-08-29 23:09 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list [-- Attachment #1: Type: text/plain, Size: 811 bytes --] On Sun, 29 Aug 2021 at 18:13, Jon Steinhart <jon@fourwinds.com> wrote: > I recently upgraded my machines to fc34. I just did a stock > uncomplicated installation using the defaults and it failed miserably. > > Fc34 uses btrfs as the default filesystem so I thought that I'd give it > a try. > ... cut out a lot about how no sane person would want to use btrfs ... > > Or, as Saturday Night Live might put it: And now, linux, starring the > not ready for prime time filesystem. Seems like something that's been > under development for around 15 years should be in better shape. > To my way of thinking this isn't a Linux problem, or even a btrfs problem, it's a Fedora problem. They're the ones who decided to switch their default filesystem to something that clearly isn't ready for prime time. -Henry [-- Attachment #2: Type: text/html, Size: 1333 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 23:09 ` Henry Bent @ 2021-08-30 3:14 ` Theodore Ts'o 2021-08-30 13:55 ` Steffen Nurpmeso 2021-08-30 9:14 ` John Dow via TUHS 1 sibling, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2021-08-30 3:14 UTC (permalink / raw) To: Henry Bent; +Cc: The Unix Heretics Society mailing list On Sun, Aug 29, 2021 at 07:09:50PM -0400, Henry Bent wrote: > On Sun, 29 Aug 2021 at 18:13, Jon Steinhart <jon@fourwinds.com> wrote: > > > I recently upgraded my machines to fc34. I just did a stock > > uncomplicated installation using the defaults and it failed miserably. > > > > Fc34 uses btrfs as the default filesystem so I thought that I'd give it > > a try. > > ... cut out a lot about how no sane person would want to use btrfs ... The ext2/ext3/ext4 file system utilities is as far as I know the first fsck that was developed with a full regression test suite from the very beginning and integrated into the sources. (Just run "make check" and you'll know if you broken something --- or it's how I know the person contributing code was sloppy and didn't bother to run "make check" before sending me patches to review....) What a lot of people don't seem to understand is that file system utilities are *important*, and more work than you might think. The ext4 file system is roughly 71 kLOC (thousand lines of code) in the kernel. E2fsprogs is 340 kLOC. In contrast, the btrfs kernel code is 145 kLOC (btrfs does have a lot more "sexy new features"), but its btrfs-progs utilities is currently only 124 kLOC. And the e2fsprogs line count doesn't include the 350+ library of corrupted file system images that are part of its regression test suite. Btrfs has a few unit tests (as does e2fsprogs), but it doesn't have any thing similar in terms of a library corrupted file system images to test its fsck functionality. (Then again, neither does the file system utilities for FFS, so a regression test suite is not required to create a high quality fsck program. In my opinion, it very much helps, though!) > > Or, as Saturday Night Live might put it: And now, linux, starring the > > not ready for prime time filesystem. Seems like something that's been > > under development for around 15 years should be in better shape. > > > > To my way of thinking this isn't a Linux problem, or even a btrfs problem, > it's a Fedora problem. They're the ones who decided to switch their > default filesystem to something that clearly isn't ready for prime time. I was present at the very beginning of btrfs. In November, 2007, various file system developers from a number of the big IBM companies got together (IBM, Intel, HP, Red Hat, etc.) and folks decided that Linux "needed an answer to ZFS". In preparation for that meeting, I did some research asking various contacts I had at various companies how much effort and how long it took to create a new file system from scratch and make it be "enterprise ready". I asked folks at Digital how long it took for advfs, IBM for AIX and GPFS, etc., etc. And the answer I got back at that time was between 50 and 200 Person Years, with the bulk of the answers being between 100-200 PY's (the single 50PY estimate was an outlier). This was everything --- kernel and userspace coding, testing and QA, performance tuning, documentation, etc. etc. The calendar-time estimates I was given was between 5-7 calendar years, and even then, users would take at least another 2-3 years minimum of "kicking the tires", before they would trust *their* precious enterprise data on the file system. There was an Intel engineer at that meeting, who shall remain nameless, who said, "Don't tell the managers that or they'll never greenlight the project! Tell them 18 months...." And so I and other developers at IBM, continued working on ext4, which we never expected would be able to compete with btrfs and ZFS in terms of "sexy new features", but our focus was on performance, scalability, and robustness. And it probably was about 2015 or so that btrfs finally became more or less stable, but only if you restricted yourself to core functionality. (e.g., snapshots, file-system level RAID, etc., was still dodgy at the time.) I will say that at Google, ext4 is still our primary file system, mainly because all of our expertise is currently focused there. We are starting to support XFS in "beta" ("Preview") for Cloud Optimized OS, since there are some enterprise customers which are using XFS on their systems, and they want to continue using XFS as they migrate from on-prem to the Cloud. We fully support XFS for Anthos Migrate (which is a read-mostly workload), and we're still building our expertise, working on getting bug fixes backported, etc., so we can support XFS the way enterprises expect for Cloud Optimized OS, which is our high-security, ChromeOS based Linux distribution with a read-only, cryptographically signed root file system optimized for Docker and Kubernetes workloads. I'm not aware of any significant enterprise usage of btrfs, which is why we're not bothering to support btrfs at $WORK. The only big company which is using btrfs in production that I know of is Facebook, because they have a bunch of btrfs developers, but even there, they aren't using btrfs exclusively for all of their workloads. My understanding of why Fedora decided to make btrfs the default was because they wanted to get more guinea pigs to flush out the bugs. Note that Red Hat, which is responsible for Red Hat Enterprise Linux (their paid product, where they make $$$) and Fedora, which is their freebie "community distribution" --- Well, Red Hat does not currently support btrfs for their RHEL product. Make of that what you will.... - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 3:14 ` Theodore Ts'o @ 2021-08-30 13:55 ` Steffen Nurpmeso 0 siblings, 0 replies; 21+ messages in thread From: Steffen Nurpmeso @ 2021-08-30 13:55 UTC (permalink / raw) To: Theodore Ts'o; +Cc: The Unix Heretics Society mailing list Theodore Ts'o wrote in <YSxNFKq9r3dyHT7l@mit.edu>: ... |What a lot of people don't seem to understand is that file system |utilities are *important*, and more work than you might think. The |ext4 file system is roughly 71 kLOC (thousand lines of code) in the |kernel. E2fsprogs is 340 kLOC. In contrast, the btrfs kernel code is |145 kLOC (btrfs does have a lot more "sexy new features"), but its |btrfs-progs utilities is currently only 124 kLOC. To be "fair" it must be said that btrfs usage almost requires installation of e2fsprogs because only that ships chattr(1). --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 23:09 ` Henry Bent 2021-08-30 3:14 ` Theodore Ts'o @ 2021-08-30 9:14 ` John Dow via TUHS 1 sibling, 0 replies; 21+ messages in thread From: John Dow via TUHS @ 2021-08-30 9:14 UTC (permalink / raw) To: tuhs [-- Attachment #1: Type: text/plain, Size: 1169 bytes --] On 30/08/2021 00:09, Henry Bent wrote: > On Sun, 29 Aug 2021 at 18:13, Jon Steinhart <jon@fourwinds.com > <mailto:jon@fourwinds.com>> wrote: > > I recently upgraded my machines to fc34. I just did a stock > uncomplicated installation using the defaults and it failed miserably. > > Fc34 uses btrfs as the default filesystem so I thought that I'd > give it > a try. > > > ... cut out a lot about how no sane person would want to use btrfs ... > > > Or, as Saturday Night Live might put it: And now, linux, starring the > not ready for prime time filesystem. Seems like something that's been > under development for around 15 years should be in better shape. > > > To my way of thinking this isn't a Linux problem, or even a btrfs > problem, it's a Fedora problem. They're the ones who decided to > switch their default filesystem to something that clearly isn't ready > for prime time. This. Even the Arch wiki makes it clear that btrfs isn't ready for prime time. It's still under very heavy development - not really something you want for a filesystem, particularly one storing a large amount of critical data. John [-- Attachment #2: Type: text/html, Size: 2428 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart 2021-08-29 23:09 ` Henry Bent @ 2021-08-29 23:57 ` Larry McVoy 2021-08-30 1:21 ` Rob Pike 2021-08-30 3:46 ` Theodore Ts'o 2021-08-30 3:36 ` Bakul Shah ` (2 subsequent siblings) 4 siblings, 2 replies; 21+ messages in thread From: Larry McVoy @ 2021-08-29 23:57 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list On Sun, Aug 29, 2021 at 03:12:16PM -0700, Jon Steinhart wrote: > After a bit of poking around I discovered that btrfs SILENTLY remounted the > filesystem because it had errors. Sure, it put something in a log file, > but I don't spend all day surfing logs for things that shouldn't be going > wrong. Maybe my expectation that filesystems just work is antiquated. I give them credit for remounting read-only when seeing errors, they may have gotten that from BitKeeper. When we opened a history file, if we encountered any errors we opened the history file in read only mode so if it worked enough you could see your data, great, but don't write on top of bad data. > Although it's been discredited by some, I'm still a believer in "stop and > fsck" policing of disk drives. Me too. Though with a 32TB drive (I'm guessing rotating media), that's going to take a long time. If I had a drive that big, I'd divide it into managable chunks and mount them all under /drive/{a,b,c,d,e...} so that when something goes wrong you don't have to check the whole 32TB. > Near the top of the manual page it says: > > Warning > Do not use --repair unless you are advised to do so by a developer > or an experienced user, and then only after having accepted that > no fsck successfully repair all types of filesystem corruption. Eg. > some other software or hardware bugs can fatally damage a volume. > > Whoa! I'm sure that operators are standing by, call 1-800-FIX-BTRFS. > Really? Is a ploy by the developers to form a support business? That's a stretch, they are just trying to not encourage you to make a mess. I sent Linus an email to find out where btrfs is, I'll report back when he replies. --lm ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 23:57 ` Larry McVoy @ 2021-08-30 1:21 ` Rob Pike 2021-08-30 3:46 ` Theodore Ts'o 1 sibling, 0 replies; 21+ messages in thread From: Rob Pike @ 2021-08-30 1:21 UTC (permalink / raw) To: Larry McVoy; +Cc: The Unix Heretics Society mailing list [-- Attachment #1: Type: text/plain, Size: 1968 bytes --] Do you even have input switches and HALT/CONT switches? I don't think so.... Commiserations. -rob On Mon, Aug 30, 2021 at 9:58 AM Larry McVoy <lm@mcvoy.com> wrote: > On Sun, Aug 29, 2021 at 03:12:16PM -0700, Jon Steinhart wrote: > > After a bit of poking around I discovered that btrfs SILENTLY remounted > the > > filesystem because it had errors. Sure, it put something in a log file, > > but I don't spend all day surfing logs for things that shouldn't be going > > wrong. Maybe my expectation that filesystems just work is antiquated. > > I give them credit for remounting read-only when seeing errors, they may > have gotten that from BitKeeper. When we opened a history file, if we > encountered any errors we opened the history file in read only mode so > if it worked enough you could see your data, great, but don't write on > top of bad data. > > > Although it's been discredited by some, I'm still a believer in "stop and > > fsck" policing of disk drives. > > Me too. Though with a 32TB drive (I'm guessing rotating media), that's > going to take a long time. If I had a drive that big, I'd divide it > into managable chunks and mount them all under /drive/{a,b,c,d,e...} > so that when something goes wrong you don't have to check the whole > 32TB. > > > Near the top of the manual page it says: > > > > Warning > > Do not use --repair unless you are advised to do so by a developer > > or an experienced user, and then only after having accepted that > > no fsck successfully repair all types of filesystem corruption. Eg. > > some other software or hardware bugs can fatally damage a volume. > > > > Whoa! I'm sure that operators are standing by, call 1-800-FIX-BTRFS. > > Really? Is a ploy by the developers to form a support business? > > That's a stretch, they are just trying to not encourage you to make a > mess. > > I sent Linus an email to find out where btrfs is, I'll report back when > he replies. > > --lm > [-- Attachment #2: Type: text/html, Size: 2594 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 23:57 ` Larry McVoy 2021-08-30 1:21 ` Rob Pike @ 2021-08-30 3:46 ` Theodore Ts'o 2021-08-30 23:04 ` Bakul Shah 1 sibling, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2021-08-30 3:46 UTC (permalink / raw) To: Larry McVoy; +Cc: The Unix Heretics Society mailing list On Sun, Aug 29, 2021 at 04:57:45PM -0700, Larry McVoy wrote: > > I give them credit for remounting read-only when seeing errors, they may > have gotten that from BitKeeper. Actually, the btrfs folks got that from ext2/ext3/ext4. The original behavior was "don't worry, be happy" (log errors and continue), and I added two additional options, "remount read-only", and "panic and reboot the system". I recommend the last especially for high availability systems, since you can then fail over to the secondary system, and fsck can repair the file system on the reboot path. The primary general-purpose file systems in Linux which are under active development these days are btrfs, ext4, f2fs, and xfs. They all have slightly different focus areas. For example, f2fs is best for low-end flash, the kind that is find on $30 dollar mobile handsets on sale in countries like India (aka, "the next billion users"). It has deep knowledge of "cost-optimized" flash where random writes are to be avoided at all costs because write amplification is a terrible thing with very primitive FTL's. For very large file systems (e.g., large RAID arrays with pedabytes of data), XFS will probably do better than ext4 for many workloads. Btrfs is the file systems for users who have ZFS envy. I believe many of those sexy new features are best done at other layers in the storage stack, but if you *really* want file-system level snapshots and rollback, btrfs is the only game in town for Linux. (Unless of course you don't mind using ZFS and hope that Larry Ellison won't sue the bejesus out of you, and if you don't care about potential GPL violations....) Ext4 is still getting new features added; we recently added a light-weight journaling (a simplified version of the 2017 Usenix ATC iJournaling paper[1]), and just last week we've added parallelized orphan list called Orphan File[2] which optimizes parallel truncate and unlink workloads. (Neither of these features are enabled by default yet, because maybe in a few years, or earlier if community distros want to volunteer their users to be guinea pigs. :-) [1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf [2] https://www.spinics.net/lists/linux-ext4/msg79021.html We currently aren't adding the "sexy new features" of btrfs or ZFS, but that's mainly because there isn't a business justification to pay for the engineering effort needed to add them. I have some design sketches of how we *could* add them to ext4, but most of the ext4 developers like food with our meals, and I'm still a working stiff so I focus on work that adds value to my employer --- and, of course, helping other ext4 developers working at other companies figure out ways to justify new features that would add value to *their* employers. I might work on some sexy new features if I won the Powerball Lottery and could retire rich, or I was working at company where engineers could work on whatever technologies they wanted without getting permission from the business types, but those companies tend not to end well (especially after they get purchased by Oracle....) - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 3:46 ` Theodore Ts'o @ 2021-08-30 23:04 ` Bakul Shah 2021-09-02 15:52 ` Jon Steinhart 0 siblings, 1 reply; 21+ messages in thread From: Bakul Shah @ 2021-08-30 23:04 UTC (permalink / raw) To: Theodore Ts'o; +Cc: The Unix Heretics Society mailing list On Aug 29, 2021, at 8:46 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > (Unless of > course you don't mind using ZFS and hope that Larry Ellison won't sue > the bejesus out of you, and if you don't care about potential GPL > violations....) This should not matter if you are adding zfs for your own use. Still, if this is a concern, best switch over to FreeBSD :-) Actually FreeBSD is now using OpenZFS so it is the same source code base as ZFS on Linux. You should even be able to do zfs send/recv between the two OSes. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 23:04 ` Bakul Shah @ 2021-09-02 15:52 ` Jon Steinhart 2021-09-02 16:57 ` Theodore Ts'o 0 siblings, 1 reply; 21+ messages in thread From: Jon Steinhart @ 2021-09-02 15:52 UTC (permalink / raw) To: The Unix Heretics Society mailing list Hey, wanted to thank everybody for the good information on this topic. Was pleasantly surprised that we got through it without a flame war :-) I have a related question in case any of you have actual factual knowledge of disk drive internals. A friend who used to be in charge of reliability engineering at Seagate used to be a great resource but he's now retired. For example, years ago someone ranted at me about disk manufacturers sacrificing reliability to save a few pennies by removing the head ramp; I asked my friend who explained to me how that ramp was removed to improve reliability. So my personal setup here bucks conventional wisdom in that I don't do RAID. One reason is that I read the Google reliability studies years ago and interpreted them to say that if I bought a stack of disks on the same day I could expect them to fail at about the same time. Another reason is that 24x7 spinning disk drives is the biggest power consumer in my office. Last reason is that my big drives hold media (music/photos/video), so if one dies it's not going to be any sort of critical interruption. My strategy is to keep three (+) copies. I realized that I first came across this wisdom while learning both code and spelunking as a teenager from Carl Christensen and Heinz Lycklama in the guise of how many spare headlamps one should have when spelunking. There's the copy on my main machine, another in a fire safe, and I rsync to another copy on a duplicate machine up at my ski condo. Plus, I keep lots of old fragments of stuff on retired small (<10T) disks that are left over from past systems. And, a lot of the music part of my collection is backed up by proof-of-purchase CDs in the store room or shared with many others so it's not too hard to recover. Long intro, on to the question. Anyone know what it does to reliability to spin disks up and down. I don't really need the media disks to be constantly spinning; when whatever I'm listening to in the evening finishes the disk could spin down until morning to save energy. Likewise the video disk drive is at most used for a few hours a day. My big disks (currently 16T and 12T) bake when they're spinning which can't be great for them, but I don't know how that compares to mechanical stress from spinning up and down from a reliability standpoint. Anyone know? Jon ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-09-02 15:52 ` Jon Steinhart @ 2021-09-02 16:57 ` Theodore Ts'o 0 siblings, 0 replies; 21+ messages in thread From: Theodore Ts'o @ 2021-09-02 16:57 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list On Thu, Sep 02, 2021 at 08:52:23AM -0700, Jon Steinhart wrote: > Long intro, on to the question. Anyone know what it does to reliability to > spin disks up and down. I don't really need the media disks to be constantly > spinning; when whatever I'm listening to in the evening finishes the disk > could spin down until morning to save energy. Likewise the video disk drive > is at most used for a few hours a day. > > My big disks (currently 16T and 12T) bake when they're spinning which can't > be great for them, but I don't know how that compares to mechanical stress > from spinning up and down from a reliability standpoint. Anyone know? First of all, I wouldn't worry too much about big disks "baking" while they are spinning. Google runs its disks hot, in data centers where the ambient air temperatures is at least 80 degrees Fahrenheit[1], and it's not alone; Dell said in 2012 that it would honor warranties for servers running in environments as hot as 115 degrees Fahrenheit[2]. [1] https://www.google.com/about/datacenters/efficiency/ [2] https://www.datacenterknowledge.com/archives/2012/03/23/too-hot-for-humans-but-google-servers-keep-humming And of course, if the ambient *air* temperature is 80 degrees plus, you can just imagine the temperature at the hard drive. It's also true that a long time ago, disk drives had limited number of spin up/down cycles; this was in its spec sheet, and SMART would track the number of disk spinups. I had a laptop drive where I had configured the OS so it would spin down the drive after 30 seconds of idle, and I realized that after about 9 months, SMART stats had reported that I had used up almost 50% of the rated spin up/down cycles for a laptop drive. Needless to say, I backed of my agressive spindown policies. That being said, more modern HDD's have been designed for better power effiencies, with slower disk rotational speeds (which is probably fine for media disks, unless you are serving a large number of different video streams at the same time), and they are designed to allow for a much larger number of spindown cycles. Check your spec sheets, this will be listed as load/unload cycles, and it will typically be a number like 200,000, 300,000 or 600,000. If you're only spinning down the a few times a day, I suspect you'll be fine. Especially since if the disk dies due to a head crash or other head failure, it'll be a case of an obvious disk failure, not silent data corruption, and you can just pull your backups out of your fire safe. I don't personally have a lot of knowledge of how modern HDD's actually survive large numbers of load/unload cycles, because at $WORK we keep the disks spinning at all times. A disk provides value in two ways: bytes of storage, and I/O operations. And an idle disk means we're wasting part of its value it could be providing, and if the goal is to decrease the overall storage TCO, wasting IOPS is not a good thing[3]. Hence, we try to organize our data to keep all of the hard drives busy, by peanut-buttering the hot data across all of the disks in the cluster[4]. [3] https://research.google/pubs/pub44830/ [4] http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf Hence, a spun-down disk is a disk which is frittering away the CapEx of the drive and a portion of the server cost to which the disk is attached. And if you can find useful work for that disk to do, it's way more valuable to keep it spun up even taking into account to the power and air-conditioning costs of the spinning drive. It should also be noted that modern HDD's now also have *write* limits[5], just like SSD's. This is especially true for technologies like HAMR --- where if you need to apply *heat* to write, that means additional thermal stress on the drive head when you write to a disk, but the write limits predate new technologies like HAMR and MAMR. [5] https://www.theregister.com/2016/05/03/when_did_hard_drives_get_workload_rate_limits/ HDD write limits has implications for systems that are using log structured storage, or other copy-on-write schemes, or systems that are moving data around to balance hot and cold data as described in the PDSW keynote. This is probably not an issue for home systems, but it's one of the things which keeps the storage space interesting. :-) - Ted P.S. I have a Synology NAS box, and I *do* let the disks spin down. Storage at the industrial scale is really different than storage at the personal scale. I do use RAID, but my backup strategy in extremis is encrypted backups uploaded to cloud storage (where I can take advantage of industrial-scale storage pricing). ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart 2021-08-29 23:09 ` Henry Bent 2021-08-29 23:57 ` Larry McVoy @ 2021-08-30 3:36 ` Bakul Shah 2021-08-30 11:56 ` Theodore Ts'o 2021-08-30 15:05 ` Steffen Nurpmeso 2021-08-30 21:38 ` Larry McVoy 4 siblings, 1 reply; 21+ messages in thread From: Bakul Shah @ 2021-08-30 3:36 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list Chances are your disk has a URE 1 in 10^14 bits ("enterprise" disks may have a URE of 1 in 10^15). 10^14 bit is about 12.5TB. For 16TB disks you should use at least mirroring, provided some day you'd want to fill up the disk. And a machine with ECC RAM (& trust but verify!). I am no fan of btrfs but these are the things I'd consider for any FS. Even if you have done all this, consider the fact that disk mortality has a bathtub curve. I use FreeBSD + ZFS so I'd recommend ZFS (on Linux). ZFS scrub works in background on an active system. Similarly resilvering (thought things slow down). On my original zfs filesystem I replaced all the 4 disks twice. I have been using zfs since 2005 and it has rarely required any babysitting. I reboot it when upgrading to a new release or applying kernel patches. "backups" via zfs send/recv of snapshots. > On Aug 29, 2021, at 3:12 PM, Jon Steinhart <jon@fourwinds.com> wrote: > > I recently upgraded my machines to fc34. I just did a stock > uncomplicated installation using the defaults and it failed miserably. > > Fc34 uses btrfs as the default filesystem so I thought that I'd give it > a try. I was especially interested in the automatic checksumming because > the majority of my storage is large media files and I worry about bit > rot in seldom used files. I have been keeping a separate database of > file hashes and in theory btrfs would make that automatic and transparent. > > I have 32T of disk on my system, so it took a long time to convert > everything over. A few weeks after I did this I went to unload my > camera and couldn't because the filesystem that holds my photos was > mounted read-only. WTF? I didn't do that. > > After a bit of poking around I discovered that btrfs SILENTLY remounted the > filesystem because it had errors. Sure, it put something in a log file, > but I don't spend all day surfing logs for things that shouldn't be going > wrong. Maybe my expectation that filesystems just work is antiquated. > > This was on a brand new 16T drive, so I didn't think that it was worth > the month that it would take to run the badblocks program which doesn't > really scale to modern disk sizes. Besides, SMART said that it was fine. > > Although it's been discredited by some, I'm still a believer in "stop and > fsck" policing of disk drives. Unmounted the filesystem and ran fsck to > discover that btrfs had to do its own thing. No idea why; I guess some > think that incompatibility is a good thing. > > Ran "btrfs check" which reported errors in the filesystem but was otherwise > useless BECAUSE IT DIDN'T FIX ANYTHING. What good is knowing that the > filesystem has errors if you can't fix them? > > Near the top of the manual page it says: > > Warning > Do not use --repair unless you are advised to do so by a developer > or an experienced user, and then only after having accepted that > no fsck successfully repair all types of filesystem corruption. Eg. > some other software or hardware bugs can fatally damage a volume. > > Whoa! I'm sure that operators are standing by, call 1-800-FIX-BTRFS. > Really? Is a ploy by the developers to form a support business? > > Later on, the manual page says: > > DANGEROUS OPTIONS > --repair > enable the repair mode and attempt to fix problems where possible > > Note there’s a warning and 10 second delay when this option > is run without --force to give users a chance to think twice > before running repair, the warnings in documentation have > shown to be insufficient > > Since when is it dangerous to repair a filesystem? That's a new one to me. > > Having no option other than not being able to use the disk, I ran btrfs > check with the --repair option. It crashed. Lesson so far is that > trusting my data to an unreliable unrepairable filesystem is not a good > idea. Since this was one of my media disks I just rebuilt it using ext4. > > Last week I was working away and tried to write out a file to discover > that /home and /root had become read-only. Charming. Tried rebooting, > but couldn't since btrfs filesystems aren't checked and repaired. Plugged > in a flash drive with a live version, managed to successfully run --repair, > and rebooted. Lasted about 15 minutes before flipping back to read only > with the same error. > > Time to suck it up and revert. Started a clean reinstall. Got stuck > because it crashed during disk setup with anaconda giving me a completely > useless big python stack trace. Eventually figured out that it was > unable to delete the btrfs filesystem that had errors so it just crashed > instead. Wiped it using dd; nice that some reliable tools still survive. > Finished the installation and am back up and running. > > Any of the rest of you have any experiences with btrfs? I'm sure that it > works fine at large companies that can afford a team of disk babysitters. > What benefits does btrfs provide that other filesystem formats such as > ext4 and ZFS don't? Is it just a continuation of the "we have to do > everything ourselves and under no circumstances use anything that came > from the BSD world" mentality? > > So what's the future for filesystem repair? Does it look like the past? > Is Ken's original need for dsw going to rise from the dead? > > In my limited experience btrfs is a BiTteR FileSystem to swallow. > > Or, as Saturday Night Live might put it: And now, linux, starring the > not ready for prime time filesystem. Seems like something that's been > under development for around 15 years should be in better shape. > > Jon -- Bakul ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 3:36 ` Bakul Shah @ 2021-08-30 11:56 ` Theodore Ts'o 2021-08-30 22:35 ` Bakul Shah 0 siblings, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2021-08-30 11:56 UTC (permalink / raw) To: Bakul Shah; +Cc: The Unix Heretics Society mailing list On Sun, Aug 29, 2021 at 08:36:37PM -0700, Bakul Shah wrote: > Chances are your disk has a URE 1 in 10^14 bits ("enterprise" disks > may have a URE of 1 in 10^15). 10^14 bit is about 12.5TB. For 16TB > disks you should use at least mirroring, provided some day you'd want > to fill up the disk. And a machine with ECC RAM (& trust but verify!). > I am no fan of btrfs but these are the things I'd consider for any FS. > Even if you have done all this, consider the fact that disk mortality > has a bathtub curve. You may find this article interesting: "The case of the 12TB URE: Explained and debunked"[1], and the following commit on a reddit post[2] discussiong this article: "Lol of course it's a myth. I don't know why or how anyone thought there would be a URE anywhere close to every 12TB read. Many of us have large pools that are dozens and sometimes hundreds of TB. I have 2 64TB pools and scrub them every month. I can go years without a checksum error during a scrub, which means that all my ~50TB of data was read correctly without any URE many times in a row which means that I have sometimes read 1PB (50TB x 2 pools x 10 months) worth from my disks without any URE. Last I checked, the spec sheets say < 1 in 1x1014 which means less than 1 in 12TB. 0 in 1PB is less than 1 in 12TB so it meets the spec." [1] https://heremystuff.wordpress.com/2020/08/25/the-case-of-the-12tb-ure/ [2] https://www.reddit.com/r/DataHoarder/comments/igmab7/the_12tb_ure_myth_explained_and_debunked/ Of course, disks do die, and ECC and backups and checksums are good things. But the whole "read 12TB get an error", saying really misunderstands how hdd failures work. Losing an entire platter, or maybe the entire 12TB disk die due to a head crash, adds a lot of uncorrectable read errors to the numerator of the UER statistics. It just goes to show that human intuition really sucks at statistics, whether it's about vaccination side effects, nuclear power plants, or the danger of flying in airplanes versus driving in cars. - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 11:56 ` Theodore Ts'o @ 2021-08-30 22:35 ` Bakul Shah 0 siblings, 0 replies; 21+ messages in thread From: Bakul Shah @ 2021-08-30 22:35 UTC (permalink / raw) To: Theodore Ts'o; +Cc: The Unix Heretics Society mailing list On Aug 30, 2021, at 4:56 AM, Theodore Ts'o <tytso@mit.edu> wrote: > > On Sun, Aug 29, 2021 at 08:36:37PM -0700, Bakul Shah wrote: >> Chances are your disk has a URE 1 in 10^14 bits ("enterprise" disks >> may have a URE of 1 in 10^15). 10^14 bit is about 12.5TB. For 16TB >> disks you should use at least mirroring, provided some day you'd want >> to fill up the disk. And a machine with ECC RAM (& trust but verify!). >> I am no fan of btrfs but these are the things I'd consider for any FS. >> Even if you have done all this, consider the fact that disk mortality >> has a bathtub curve. > > You may find this article interesting: "The case of the 12TB URE: > Explained and debunked"[1], and the following commit on a reddit > post[2] discussiong this article: > > "Lol of course it's a myth. > > I don't know why or how anyone thought there would be a URE > anywhere close to every 12TB read. > > Many of us have large pools that are dozens and sometimes hundreds of TB. > > I have 2 64TB pools and scrub them every month. I can go years > without a checksum error during a scrub, which means that all my > ~50TB of data was read correctly without any URE many times in a > row which means that I have sometimes read 1PB (50TB x 2 pools x 10 > months) worth from my disks without any URE. > > Last I checked, the spec sheets say < 1 in 1x1014 which means less > than 1 in 12TB. 0 in 1PB is less than 1 in 12TB so it meets the > spec." > > [1] https://heremystuff.wordpress.com/2020/08/25/the-case-of-the-12tb-ure/ > [2] https://www.reddit.com/r/DataHoarder/comments/igmab7/the_12tb_ure_myth_explained_and_debunked/ It seems this guy doesn't understand statistics. He checked his 2 pools and is confident that a sample of 4 disks (likely) he knows that URE specs are crap. Even from an economic PoV it doen't make sense. Why wouldn't the disk companies tout an even lower error rate if they can get away with it? Presumably these rates are derived from reading many many disks and averaged. Here's what the author says on a serverfault thread: https://serverfault.com/questions/812891/what-is-exactly-an-ure @DavidBalažic Evidently, your sample size of one invalidates the entirety of probability theory! I suggest you submit a paper to the Nobel Committee. – Ian Kemp Apr 16 '19 at 5:37 @IanKemp If someone claims that all numbers are divisible by 7 and I find ONE that is not, then yes, a single find can invalidate an entire theory. BTW, still not a single person has confirmed the myth in practice (by experiment), did they? Why should they, when belief is more than knowledge...– David Balažic Apr 16 '19 at 12:22 Incidentally, it is hard to believe he scrubs his 2x64TB pools once a month. Assuming 250MB/s sequential throughput and his scrubber can stream it at that rate, it will take him close to 6 days (3 days if reading them in parallel) to read every block. During this time these pools won't be useful for anything else. Unclear if he is using any RAID or a filesystem that does checksums. Without that he would be unable to detect hidden data corruption. In contrast, ZFS will only scrub *live* data. As more of the disks are filled up, scrub will take progressively more time. Similarly, replacing a zfs mirror won't read the source disk in its entirety, only the live data. > Of course, disks do die, and ECC and backups and checksums are good > things. But the whole "read 12TB get an error", saying really > misunderstands how hdd failures work. Losing an entire platter, or > maybe the entire 12TB disk die due to a head crash, adds a lot of > uncorrectable read errors to the numerator of the UER statistics. That is not how URE specs are derived. > > It just goes to show that human intuition really sucks at statistics, Indeed :-) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart ` (2 preceding siblings ...) 2021-08-30 3:36 ` Bakul Shah @ 2021-08-30 15:05 ` Steffen Nurpmeso 2021-08-31 13:18 ` Steffen Nurpmeso 2021-08-30 21:38 ` Larry McVoy 4 siblings, 1 reply; 21+ messages in thread From: Steffen Nurpmeso @ 2021-08-30 15:05 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list Jon Steinhart wrote in <202108292212.17TMCGow1448973@darkstar.fourwinds.com>: |I recently upgraded my machines to fc34. I just did a stock |uncomplicated installation using the defaults and it failed miserably. | |Fc34 uses btrfs as the default filesystem so I thought that I'd give it ... |Any of the rest of you have any experiences with btrfs? I'm sure that it |works fine at large companies that can afford a team of disk babysitters. |What benefits does btrfs provide that other filesystem formats such as |ext4 and ZFS don't? Is it just a continuation of the "we have to do |everything ourselves and under no circumstances use anything that came |from the BSD world" mentality? From a small man perspective i can say i use btrfs. I have seen more problems with filesystems in these 29 months than in the ~36 years (well, 22 if you only count Unix, mostly FreeBSD) before. I have learned that i have to chattr +C my vm/ directory in order to avoid filesystem errors which can only be solved by deleting the corrupted files (which were not easy to find out, inspect-internal inode-resolve could have helped a bit better, but .. why?). I have seen (factual) snapshot receive errors that were not indicated by the tool's exit status. With dangling files laying around. I have seen +C attributes missing on the target filesystem from played in snapshots. I have found it impossible to mount external devices because "BTRFS warning (device <unknown>): duplicate device /dev/sdc2 devid 1 generation 3977" even though the former umount was successful (i must admit i had used lazytime mount option there, be sure not to do that for BTRFS). I know where you coming from, i tried it all without success. With all the little tools that do not do what you want. I found it quite ridiculous that i had to shrink-resize my filesystem in an iteration because the "minimum possible" size was adjusted each and every time, until i finally was where the actual filesystem usage indicated you would end up with from the beginning. (defrag did not prevent this.) On the other hand i am still here, now on Luks2, now with xxhash checksums and data and meta DUP (instead of RAID 1, i thought). Same data, just moved over onto. I think i would try out ZFS if it would be in-kernel. The FreeBSD Handbook is possibly the best manual you can have for that. But i mean the ZFS code is much, much larger than BTRFS, BTRFS serves me really, really good for what i want and need, and i am sailing Linux stable/latest (4.19, now 5.10) here, on a free and small Linux distribution without engineering power aka supervision, that sails somewhat the edge of what all the many involved projects produce, with problems all over the place, sometimes more, sometimes less, some all the time. That is nothing new, you have to live with some deficiencies in the free software world; i just find it sometimes hard to believe that this is still true for Linux with the many many billions of Dollars that went in, and the tens of thousands of people working on it. I really hate that all the Linux kernel guys seem to look forward only, mostly, you have to go and try out the newest thing, maybe there all is good. Of course .. i can understand this (a bit). That is the good thing of using such an engineering-power distribution, you likely get backports and have a large user base with feedback. |In my limited experience btrfs is a BiTteR FileSystem to swallow. Well often people say you need to know what you are doing. That is hard without proper documentation, and the www is a toilet. And noone writes HOWTOs any more. But, for example, if i do not use an undocumented kernel parameter (rtw88_pci.disable_aspm=1), and use wifi and bluetooth (audio) in conjunction, then i have to boot into the Windows startup screen in order to properly reinitialize my wifi/bluetooth chip. Or else you are dead. And that even though it seems the Linux driver comes from the Chip creator itself. So i think a "chattr +C" here and there, it can be directory-wide, it could be a mount or creation option, isn't that bad. Also it is just me having had a go (or julia or nim; or perl) on using _one_ filesystem with several subvolumes for anything; if i would have used my (well, security context only) old-style way of doing things and used several hard partitions with individual filesystems, i would have used +C from the beginning for that one. Especially if i would have read it in the manual page. I do believe todays' journalled, checksummed, snapshot'able copy-on-write filesystems are complex beasts. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 15:05 ` Steffen Nurpmeso @ 2021-08-31 13:18 ` Steffen Nurpmeso 0 siblings, 0 replies; 21+ messages in thread From: Steffen Nurpmeso @ 2021-08-31 13:18 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list P.S.: Steffen Nurpmeso wrote in <20210830150510.jWrRZ%steffen@sdaoden.eu>: |Jon Steinhart wrote in | <202108292212.17TMCGow1448973@darkstar.fourwinds.com>: ||I recently upgraded my machines to fc34. I just did a stock ||uncomplicated installation using the defaults and it failed miserably. || ||Fc34 uses btrfs as the default filesystem so I thought that I'd give it ... |I have learned that i have to chattr +C my vm/ directory in order |to avoid filesystem errors which can only be solved by deleting |the corrupted files (which were not easy to find out, |inspect-internal inode-resolve could have helped a bit better, but ... |creator itself. So i think a "chattr +C" here and there, it can |be directory-wide, it could be a mount or creation option, isn't |that bad. Also it is just me having had a go (or julia or nim; or ... Only to add that this was effectively my fault, because of the caching behaviour my box-vm.sh script chose for qemu. In effect i think that +C i could drop again now that i have drivecache=,cache=writeback # ,cache=none XXX on ZFS!? used like eg -drive index=0,if=ide$drivecache,file=img/$vmimg It, however, took quite some research to get there. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart ` (3 preceding siblings ...) 2021-08-30 15:05 ` Steffen Nurpmeso @ 2021-08-30 21:38 ` Larry McVoy 4 siblings, 0 replies; 21+ messages in thread From: Larry McVoy @ 2021-08-30 21:38 UTC (permalink / raw) To: Jon Steinhart; +Cc: The Unix Heretics Society mailing list Linus replied: > Anyhoo, what can you me about btrfs? I haven't paid attention, when I > last did, it seemed unfinished. Fedora apparently uses it by default, > should they be? I've been pretty happy with ext{2,3,4}, I don't know > if Ted is still behind those but I've been super happy with how they > handled compat and other stuff. > > So did btrfs get done? I'm not sure how much people use the fancier features, but lots of people seem to use btrfs in production these days, and the maintenance team seems responsive and stable. Of the local disk filesystems we have, I'd pick any of ext4, btrfs and xfs as stable and well-maintained. Picking any of those is fine, and the choice would come down to which distro you use, and _possibly_ whether you want any of the particular features. But on the whole, most people don't seem to care deeply about the more intricate features unless they have very specific use-cases. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? @ 2021-08-30 13:06 Norman Wilson 2021-08-30 14:42 ` Theodore Ts'o 2021-08-30 16:46 ` Arthur Krewat 0 siblings, 2 replies; 21+ messages in thread From: Norman Wilson @ 2021-08-30 13:06 UTC (permalink / raw) To: tuhs Not to get into what is soemthing of a religious war, but this was the paper that convinced me that silent data corruption in storage is worth thinking about: http://www.cs.toronto.edu/~bianca/papers/fast08.pdf A key point is that the character of the errors they found suggests it's not just the disks one ought to worry about, but all the hardware and software (much of the latter inside disks and storage controllers and the like) in the storage stack. I had heard anecdotes long before (e.g. from Andrew Hume) suggesting silent data corruption had become prominent enough to matter, but this paper was the first real study I came across. I have used ZFS for my home file server for more than a decade; presently on an antique version of Solaris, but I hope to migrate to OpenZFS on a newer OS and hardware. So far as I can tell ZFS in old Solaris is quite stable and reliable. As Ted has said, there are philosophical reasons why some prefer to avoid it, but if you don't subscribe to those it's a fine answer. I've been hearing anecdotes since forever about sharp edges lurking here and there in BtrFS. It does seem to be eternally unready for production use if you really care about your data. It's all anecdotes so I don't know how seriously to take it, but since I'm comfortable with ZFS I don't worry about it. Norman Wilson Toronto ON PS: Disclosure: I work in the same (large) CS department as Bianca Schroeder, and admire her work in general, though the paper cited above was my first taste of it. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 13:06 Norman Wilson @ 2021-08-30 14:42 ` Theodore Ts'o 2021-08-30 18:08 ` Adam Thornton 2021-08-30 16:46 ` Arthur Krewat 1 sibling, 1 reply; 21+ messages in thread From: Theodore Ts'o @ 2021-08-30 14:42 UTC (permalink / raw) To: Norman Wilson; +Cc: tuhs On Mon, Aug 30, 2021 at 09:06:03AM -0400, Norman Wilson wrote: > Not to get into what is soemthing of a religious war, > but this was the paper that convinced me that silent > data corruption in storage is worth thinking about: > > http://www.cs.toronto.edu/~bianca/papers/fast08.pdf > > A key point is that the character of the errors they > found suggests it's not just the disks one ought to worry > about, but all the hardware and software (much of the latter > inside disks and storage controllers and the like) in the > storage stack. There's nothing I'd disagree with in this paper. I'll note though that part of the paper's findings is that silent data corruption occured at a rate roughly 1.5 orders of magnitude less often than latent sector errors (e.g., UER's). The probability of at least one silent data corruption during the study period (41 months, although not all disks would have been in service during that entire time) was P = 0.0086 for nearline disks and P = 0.00065 for enterprise disks. And P(1st error) remained constant over disk age, and per the authors' analysis, it was unclear whether P(1st error) changed as disk size increased --- which is representative of non media-related failures (not surprising given how much checksum and ECC checks are done by the HDD's). A likely supposition IMHO is that the use of more costly enterprise disks correlated with higher quality hardware in the rest of the storage stack --- so things like ECC memory really do matter. > As Ted has said, there are philosophical reasons why some prefer to > avoid it, but if you don't subscribe to those it's a fine answer. WRT to running ZFS on Linux, I wouldn't call it philosophical reasons, but rather legal risks. Life is not perfect, so you can't drive any kind of risk (including risks of hardware failure) down to zero. Whether you should be comfortable with the legal risks in this case very much depends on who you are and what your risk profile might be, and you should contact a lawyer if you want legal advice. Clearly the lawyers at companies like Red Hat and SuSE have given very answers from the lawyers at Canonical. In addition, the answer for hobbyists and academics might be quite different from a large company making lots of money and more likely to attract the attention of the leadership and lawyers at Oracle. Cheers, - Ted ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 14:42 ` Theodore Ts'o @ 2021-08-30 18:08 ` Adam Thornton 0 siblings, 0 replies; 21+ messages in thread From: Adam Thornton @ 2021-08-30 18:08 UTC (permalink / raw) To: Theodore Ts'o; +Cc: tuhs > On Aug 30, 2021, at 7:42 AM, Theodore Ts'o <tytso@mit.edu> wrote: > In addition, the answer for hobbyists > and academics might be quite different from a large company making > lots of money and more likely to attract the attention of the > leadership and lawyers at Oracle. I mean, Oracle’s always been super well-behaved and not lawsuit-happy _at all_! You’re fine, at least until Larry needs a slightly larger yacht or volcano lair or whatever it is he’s into these days. Adam ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [TUHS] Is it time to resurrect the original dsw (delete with switches)? 2021-08-30 13:06 Norman Wilson 2021-08-30 14:42 ` Theodore Ts'o @ 2021-08-30 16:46 ` Arthur Krewat 1 sibling, 0 replies; 21+ messages in thread From: Arthur Krewat @ 2021-08-30 16:46 UTC (permalink / raw) To: tuhs On 8/30/2021 9:06 AM, Norman Wilson wrote: > A key point is that the character of the errors they > found suggests it's not just the disks one ought to worry > about, but all the hardware and software (much of the latter > inside disks and storage controllers and the like) in the > storage stack. I had a pair of Dell MD1000's, full of SATA drives (28 total), with the SATA/SAS interposers on the back of the drive. Was getting checksum errors in ZFS on a handful of the drives. Took the time to build a new array, on a Supermicro backplane, and no more errors with the exact same drives. I'm theorizing it was either the interposers, or the SAS backplane/controllers in the MD1000. Without ZFS, who knows who swiss-cheesy my data would be. Not to mention the time I setup a Solaris x86 cluster zoned to a Compellent and periodically would get one or two checksum errors in ZFS. This was the only cluster out of a handful that had issues, and only on that one filesystem. Of course, it was a production PeopleSoft Oracle database. I guess moving to a VMware Linux guest and XFS just swept the problem under the rug, but the hardware is not being reused so there's that. > I had heard anecdotes long before (e.g. from Andrew Hume) > suggesting silent data corruption had become prominent > enough to matter, but this paper was the first real study > I came across. > > I have used ZFS for my home file server for more than a > decade; presently on an antique version of Solaris, but > I hope to migrate to OpenZFS on a newer OS and hardware. > So far as I can tell ZFS in old Solaris is quite stable > and reliable. As Ted has said, there are philosophical > reasons why some prefer to avoid it, but if you don't > subscribe to those it's a fine answer. > Been running Solaris 11.3 and ZFS for quite a few years now, at home. Before that, Solaris 10. I recently setup a home Redhat 8 server, w/ZoL (.8), earlier this year - so far, no issues, with 40+TB online. I have various test servers with ZoL 2.0 on them, too. I have so much online data that I use as the "live copy" - going back to the early 80's copies of my TOPS-10 stuff. Even though I have copious amounts of LTO tape copies of this data, I won't go back to the "out of sight out of mind" mentality. Trying to get customers to buy into that idea is another story. art k. PS: I refuse to use a workstation that doesn't use ECC RAM, either. I like swiss-cheese on a sandwich. I don't like my (or my customers') data emulating it. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2021-09-02 16:58 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-29 22:12 [TUHS] Is it time to resurrect the original dsw (delete with switches)? Jon Steinhart 2021-08-29 23:09 ` Henry Bent 2021-08-30 3:14 ` Theodore Ts'o 2021-08-30 13:55 ` Steffen Nurpmeso 2021-08-30 9:14 ` John Dow via TUHS 2021-08-29 23:57 ` Larry McVoy 2021-08-30 1:21 ` Rob Pike 2021-08-30 3:46 ` Theodore Ts'o 2021-08-30 23:04 ` Bakul Shah 2021-09-02 15:52 ` Jon Steinhart 2021-09-02 16:57 ` Theodore Ts'o 2021-08-30 3:36 ` Bakul Shah 2021-08-30 11:56 ` Theodore Ts'o 2021-08-30 22:35 ` Bakul Shah 2021-08-30 15:05 ` Steffen Nurpmeso 2021-08-31 13:18 ` Steffen Nurpmeso 2021-08-30 21:38 ` Larry McVoy 2021-08-30 13:06 Norman Wilson 2021-08-30 14:42 ` Theodore Ts'o 2021-08-30 18:08 ` Adam Thornton 2021-08-30 16:46 ` Arthur Krewat
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).