From mboxrd@z Thu Jan  1 00:00:00 1970
References: <f350de69a2df1fb900b5f1f54cd12826@rei2.9hal>
	<9F03A819-F521-407C-A6BD-13A04A3AC877@lsub.org>
	<20120518222257.386CFB827@mail.bitblocks.com>
	<23ED89F3-F760-428A-8CF4-0A046F52675B@lsub.org>
	<20120520031308.C9D3EB827@mail.bitblocks.com>
From: Francisco J Ballesteros <nemo@lsub.org>
Content-Type: text/plain;
	charset=us-ascii
In-Reply-To: <20120520031308.C9D3EB827@mail.bitblocks.com>
Message-Id: <E440EC45-FAA3-4817-89ED-5259358A8835@lsub.org>
Date: Sun, 20 May 2012 12:07:16 +0200
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (1.0)
Subject: Re: [9fans] The creepy WORM. (was: Re: Thinkpad T61 Installation
	Experience)
Topicbox-Message-UUID: 9398f6ac-ead7-11e9-9d60-3106f5b1d025




On May 20, 2012, at 5:13 AM, Bakul Shah <bakul@bitblocks.com> wrote:

>=20
> How often would you flush to disk? You still need to worry about the order=

> of writing metadata.
>=20

that's the nice thing. it's so simple I don't have to worry about order. you=
 write new
blocks and, once all of them reached the disk without errors, you write a ne=
w super with
the address of a new root. if you fail before the super is written, you have=
 the old tree,
otherwise, you get the new one. the super has two copies of its data, in cas=
e you fail
in the middle of writing it, if they don't match, you use the oldest one.

The tmr proc writes new versions of the tree on a periodic basis and also if=
 the
number of dirty (of memory addressed, actually) blocks exceeds some value.


> You do have to keep track of free disk blocks. On disk. So a linked list
> would require you visit every freed block.
>=20


There's no linked list either :)
There's a mark per block, an epoch number, so you have one block with marks
for each block group. all blocks given addresses on disk use the current val=
ue for the
mark or epoch. when you want to collect more free blocks, you traverse the t=
ree and update
the mark for blocks. After that, you increment the value for the mark that i=
s used to recognize
free blocks.

Of course you could fail at any time, and, again, if you do that before writ=
ing the new
super the only thing that happens is that you'll have to mark again the bloc=
ks (you
prune the mark process when you find that the block visited already has the c=
orrect mark
value).

This was actually the code I had in place to double check that the previous i=
mplementation
for free block management was correct, but I noticed that it was both doing t=
he job, and=20
not disturbing normal operation in the file system, so I removed the managem=
ent code and
declared this debug check the actual algorithm.


> I think an incore FS the easy case but you will have to face
> the issues of corner/boundary cases, various error conditions
> and efficiency when dealing with real disks. These things are
> what introduce complexity and bugs. "Soft updates" in FFS took
> quite a while shake out bugs.  zfs took a long time.  Hammer
> fs of DragonFly took a while. Pretty much every FS design has
> taken a while to be rock solid.  Far longer than the original
> estimates of the designers I think.
>=20


Yes, that's why I improved it mostly by removing features and simplifying al=
gorithms
instead of using clever ones. It was not easy to do that, it had like three o=
r four rewrites.
Funny it was a lot easier to write the first, much more complex, version of i=
t.

There are no soft
updates now, because the log makes that unneeded. And the complex part of
log management, which is reclaiming unused blocks, is also gone because of t=
he
naive, yet effective, algorithm used now.

But you are right. I spent most of the time fixing problems with disks that w=
ere full or
had faults injected at the worst times. The operation in non boundary cases w=
as always
fine, even with the fancier implementation I used before.

Regarding memory and processors, the code is aggressive and tries to use eve=
rything
because the machine will be devoted just to serve files.=