On Fri, Feb 5, 2021 at 12:51 PM Dave Horsfall wrote: > > [...] > > Thanks; I'd heard that ZFS was a compressed file system, so I stopped > right there (I had lots of experience in recovering from corrupted RK05s, > and didn't need any more trouble). > That's funny, for me this is the main reason to use ZFS... What really sets ZFS apart from everything else is the lack of trouble and its resilience to failures. We used to have lots and lots of ZFS filesystems at work, and I've been using ZFS exclusively at home ever since. I have run into a non-importable ZFS file system (all drives are there, but it's corrupt and won't import) only once, and was able to fix it with zdb. ZFS compression is completely optional, and not even on by default. I've only tried it once and found it cost too much performance on something that's not very fast to begin with, but I don't think it affects data recovery much (the way ZFS stripes data makes traditional data recovery tools pretty useless anyway). I personally don't care about purity of implementation, because everything is a trade-off. The argument really reminds me of Tanenbaums criticism of the Linux Monokernel (was Tanenbaum right? Maybe, but who cares, because Linux took over the world, and Minix didn't, so from a practical point of view, Linus was right). The other one it reminds me of is the criticism of TCPs "blatant layering violation" (vs OSI). But IMHO the critics were just jealous of the cool things they couldn't do because they needed to respect the division of labor along those pesky layers. I remember reading on one of the Sun engineers blogs (remember when Sun allowed their engineers to keep blogs about Solaris development? Good times!) about the heated discussions they had over the ARC and bypassing the page cache. I don't remember the actual arguments for it, but it was certainly not a decision that was made out of laziness. Performance wise, ZFS is not the best, and if that's all you care about, there are better options. It needs a lot of tuning to just reach "acceptable" and it definitely does not play well with doing other stuff on the same machine (it pretty much assumes that your storage appliance is dedicated). It has particularly abysmal performance when you do lots of small random writes and then try to read that back in order, but if you care about not losing your data, it's 2nd to none. In $JOB-1 (almost 15 years ago), we spent a few weeks stress testing ZFS. The setup was 24x4TB SATA drives, divided into 2 12 drive raidz2 vdevs or something like that. All tests were done while it was busy reading/writing checksummed test files at full speed, 1GB/s or so (see? Performance was not impressive. We definitely got a lot more out of that with UFS). What was absolutely stunning was the fact that in all our tests it never served one bit of corrupted data. It either had it, or it returned an error. We tortured the storage in any way we could imagine. Wiggled the cables, yanked out drives, used dd to overwrite random parts or entire drives, smashed a drive with a hammer and put it back in, put in drives of the wrong size, put in known bad drives, yanked out drives while it was resilvering, put drives back into a different slot, overwrote stuff while it was resilvering. Unplugged the entire storage, plugged the storage into another machine and imported it, plugged the drives back into the first machine in a different order. We even did things like "copy a drive onto a spare with dd, remove 3 drives, and then substitute the spare drive for the removed one" (this led to some data loss because making the copy was not atomic, but most of the data was recoverable). And no matter what we did, it just kept going unless the data was simply not there, and even then, it kept serving the files (or parts of files) that were available, and indicated exactly which files were affected by data loss. And when you put the drives back (or restored the overwritten parts), it would continue as if nothing had ever happened. If you've ever wrestled a hardware RAID controller, or VxFS/JFS/HPFS, or mdadm, you know that none of that can be taken for granted, and that doing any of the stupid things mentioned above would most likely lead to complete data loss and/or serving lots of random corrupted data and no way to tell what had been corrupted. I remember some performance issues with mmap, but I don't remember how we fixed it. Probably just sucked it up. Using ZFS was not for maximum performance.