From mboxrd@z Thu Jan 1 00:00:00 1970 References: <9F03A819-F521-407C-A6BD-13A04A3AC877@lsub.org> <20120518222257.386CFB827@mail.bitblocks.com> <23ED89F3-F760-428A-8CF4-0A046F52675B@lsub.org> <20120520031308.C9D3EB827@mail.bitblocks.com> From: Francisco J Ballesteros Content-Type: text/plain; charset=us-ascii In-Reply-To: <20120520031308.C9D3EB827@mail.bitblocks.com> Message-Id: Date: Sun, 20 May 2012 12:07:16 +0200 To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (1.0) Subject: Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation Experience) Topicbox-Message-UUID: 9398f6ac-ead7-11e9-9d60-3106f5b1d025 On May 20, 2012, at 5:13 AM, Bakul Shah wrote: >=20 > How often would you flush to disk? You still need to worry about the order= > of writing metadata. >=20 that's the nice thing. it's so simple I don't have to worry about order. you= write new blocks and, once all of them reached the disk without errors, you write a ne= w super with the address of a new root. if you fail before the super is written, you have= the old tree, otherwise, you get the new one. the super has two copies of its data, in cas= e you fail in the middle of writing it, if they don't match, you use the oldest one. The tmr proc writes new versions of the tree on a periodic basis and also if= the number of dirty (of memory addressed, actually) blocks exceeds some value. > You do have to keep track of free disk blocks. On disk. So a linked list > would require you visit every freed block. >=20 There's no linked list either :) There's a mark per block, an epoch number, so you have one block with marks for each block group. all blocks given addresses on disk use the current val= ue for the mark or epoch. when you want to collect more free blocks, you traverse the t= ree and update the mark for blocks. After that, you increment the value for the mark that i= s used to recognize free blocks. Of course you could fail at any time, and, again, if you do that before writ= ing the new super the only thing that happens is that you'll have to mark again the bloc= ks (you prune the mark process when you find that the block visited already has the c= orrect mark value). This was actually the code I had in place to double check that the previous i= mplementation for free block management was correct, but I noticed that it was both doing t= he job, and=20 not disturbing normal operation in the file system, so I removed the managem= ent code and declared this debug check the actual algorithm. > I think an incore FS the easy case but you will have to face > the issues of corner/boundary cases, various error conditions > and efficiency when dealing with real disks. These things are > what introduce complexity and bugs. "Soft updates" in FFS took > quite a while shake out bugs. zfs took a long time. Hammer > fs of DragonFly took a while. Pretty much every FS design has > taken a while to be rock solid. Far longer than the original > estimates of the designers I think. >=20 Yes, that's why I improved it mostly by removing features and simplifying al= gorithms instead of using clever ones. It was not easy to do that, it had like three o= r four rewrites. Funny it was a lot easier to write the first, much more complex, version of i= t. There are no soft updates now, because the log makes that unneeded. And the complex part of log management, which is reclaiming unused blocks, is also gone because of t= he naive, yet effective, algorithm used now. But you are right. I spent most of the time fixing problems with disks that w= ere full or had faults injected at the worst times. The operation in non boundary cases w= as always fine, even with the fancier implementation I used before. Regarding memory and processors, the code is aggressive and tries to use eve= rything because the machine will be devoted just to serve files.=