How to write a GC for size

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* How to write a GC for size > memory
@ 2009-11-13  5:10 Goswin von Brederlow
  2009-11-13  8:38 ` [Caml-list] " Richard Jones
  0 siblings, 1 reply; 3+ messages in thread
From: Goswin von Brederlow @ 2009-11-13  5:10 UTC (permalink / raw)
  To: caml-list

Hi,

as I might have mentioned in the past I'm writing C bindings for
libaio and libfuse to write a filesystem in ocaml.

The problem I'm now facing is freeing blocks on the filesystem when
they are no longer used. The filesystem has a huge B-Tree containing
inodes, file attributes, directory entries (lets call them simple
values as none of them are pointer like) or references to data blocks
of a file (lets call them pointer). The filesystem is purely
copy-on-write except for the root of the B-Tree, which is stored
round-robin in a select few blocks at the begining of the drive.

Simplified I have

type key = int
type block_t = int (* pointer to block *)
type entry = Data of int | Block of block_t
type leaf = (key * entry) array
type node = (key * block_t) array
type tree = Leaf of leaf | Node of node
type roots = tree array

To free unused blocks and to defragment the filesystem I though I
would implement a moving GC. This seems actualy is perfect for a GC as
I have no mutable structures (except the roots which I can watch).

But here are my problems:

1) I can't put a bit into every (data) block.
   a) there is no space. A block has 4k of data
   b) i can't modify the block anyway without risking damage in a
      crash

2) I have a lot of blocks, 1-2 billion of them. Keeping even a single
   bit per block in memory exceeds what I can spare.

3) The B-tree exceeds the spare memory too so it needs to be swaped in
   and out with a minimum of trashing the disk.

I'm thinking of using GC mixed with reference counting for this. The
available space is split into chunks of 65536 blocks. For each chunks
I add an entry in the B-Tree counting the number of blocks allocated
(0-65536 and monoton rising) and the number of references to blocks in
that chunks. The number of references gives me a metric how urgend a
GC run over the chunk is. Overall there will be 16k-64k chunks, which
doesn't cost much space (and therefore can fit in memory). To optimize
locality a chunk can only contain B-Tree nodes (few chunks) or data
(many many chunks) but never both.

The GC then has 3 modes:

1) defrag mode, lots of free space left

Go through the B-Tree and copy blocks from anywhere into a new chunk
so that data (or B-Tree nodes) becomes sequential. Keeping the B-Tree
nodes sequential will improve sequential access to keys fast (sweeps
in the GC and certain FS operations: readdir(), unlink() mostly)

2) freeing chunk mode, space is not abundant or chunks with large drop count

Pick a worthy chunk. Sweep the B-Tree completly and copy used blocks
from that chunk somewhere else. After the sweep there can't be any
references to that chunk so set the allocated and droped counts to
0. The chunk is now free again. Since I sweep the B-Tree I can
calculate the used count for each chunk here so the next run can pick
a worthy chunk.

3) emergency compaction mode, space is running out

Declare an emergency and suspend filesystem operations. This can't be
concurrent as the FS would run out of space.

Pick a worth chunk and allocate a bitmap in memory for it. Sweep the
B-Tree completely and mark all used blocks in the bitmap. Compact the
chunk and update the B-Tree (another sweep). Set the allocated count
to the number of used blocks and used to the number of references. The
chunk is then partialy allocated again. Since space it tight don't
regenerate the other used counts to avoid B-Tree updates eating
space. Idealy this should only use free blocks in the chunk it
compacts to record the changes to the B-Tree.

Normaly the GC would switch between defrag and freeing chunk
mode. Both would be concurrent to normal filesystem
operations. Possibly only run when the filesystem is idle. The
compation mode would only happen in situations where the FS would
otherwise have to return ENOSPC.

To further improve this I would like to add a generational component
into the GC. I have no mutables (except the roots) so an old chunk (or
an old B-Tree node) can hold no references to a newer chunk. Also new
files are far more likely to be deleted again than old files and new
B-Tree nodes more likely to be modified again than old B-Tree
nodes. Seems to screem for a generational approach. A generational
approach should allow the GC to only scan a fraction of the B-Tree for
its sweep. But I'm not quite sure how to go about that.

Comments, Ideas, Improvements, Urls welcome.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] How to write a GC for size > memory
  2009-11-13  5:10 How to write a GC for size > memory Goswin von Brederlow
@ 2009-11-13  8:38 ` Richard Jones
  2009-11-13 17:07   ` Goswin von Brederlow
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Jones @ 2009-11-13  8:38 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: caml-list

On Fri, Nov 13, 2009 at 06:10:03AM +0100, Goswin von Brederlow wrote:
> Normaly the GC would switch between defrag and freeing chunk
> mode. Both would be concurrent to normal filesystem
> operations. Possibly only run when the filesystem is idle. The
> compation mode would only happen in situations where the FS would
> otherwise have to return ENOSPC.
> 
> To further improve this I would like to add a generational component
> into the GC. I have no mutables (except the roots) so an old chunk (or
> an old B-Tree node) can hold no references to a newer chunk. Also new
> files are far more likely to be deleted again than old files and new
> B-Tree nodes more likely to be modified again than old B-Tree
> nodes. Seems to screem for a generational approach. A generational
> approach should allow the GC to only scan a fraction of the B-Tree for
> its sweep. But I'm not quite sure how to go about that.
> 
> Comments, Ideas, Improvements, Urls welcome.

As I said on IRC I still think ocaml-ancient is a good fit to this,
unless you are planning to modify the OCaml GC itself.

ocaml-ancient essentially converts parts of the heap to use manual
memory allocation, on top of which you can write a more suitable GC in
OCaml for your long-lived rarely-changing blocks (eg. one based on
reference counting).  The invariants you describe above are exactly
the ones which ocaml-ancient needs.

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Caml-list] How to write a GC for size > memory
  2009-11-13  8:38 ` [Caml-list] " Richard Jones
@ 2009-11-13 17:07   ` Goswin von Brederlow
  0 siblings, 0 replies; 3+ messages in thread
From: Goswin von Brederlow @ 2009-11-13 17:07 UTC (permalink / raw)
  To: Richard Jones; +Cc: caml-list

Richard Jones <rich@annexia.org> writes:

> On Fri, Nov 13, 2009 at 06:10:03AM +0100, Goswin von Brederlow wrote:
>> Normaly the GC would switch between defrag and freeing chunk
>> mode. Both would be concurrent to normal filesystem
>> operations. Possibly only run when the filesystem is idle. The
>> compation mode would only happen in situations where the FS would
>> otherwise have to return ENOSPC.
>> 
>> To further improve this I would like to add a generational component
>> into the GC. I have no mutables (except the roots) so an old chunk (or
>> an old B-Tree node) can hold no references to a newer chunk. Also new
>> files are far more likely to be deleted again than old files and new
>> B-Tree nodes more likely to be modified again than old B-Tree
>> nodes. Seems to screem for a generational approach. A generational
>> approach should allow the GC to only scan a fraction of the B-Tree for
>> its sweep. But I'm not quite sure how to go about that.
>> 
>> Comments, Ideas, Improvements, Urls welcome.
>
> As I said on IRC I still think ocaml-ancient is a good fit to this,
> unless you are planning to modify the OCaml GC itself.

I'm not trying to use the OCaml GC. This is totaly impossible for the
simple reason that a 32bit cpu can not mmap 8 TiB of disk space and
use that as Ocaml heap. I could use ocaml-ancient to map selected
blocks of the disk into memory but I'm already using libaio and
Bigarray for this. The ocaml GC itself can cope fine with mapped
blocks. What I need to do is grabage collect the disk blocks
itself. And, as said, even using 1 bit of memory per disk block is too
much.

> ocaml-ancient essentially converts parts of the heap to use manual
> memory allocation, on top of which you can write a more suitable GC in
> OCaml for your long-lived rarely-changing blocks (eg. one based on
> reference counting).  The invariants you describe above are exactly
> the ones which ocaml-ancient needs.
>
> Rich.

Even if that where possible that would change absolutely nothing. I
still need to write the GC. So the question of how to do that best
remains.

As a side node: I considered doing reference counting on the
blocks. Since I have no loops that would be free of leaks. But would
require verry frequent change in the reference count and that count
would have to be stored in a different location than the data
itself. For every B-Tree block written I would have to inrement the
counter for every block it points to, which can be each in a different
block on disk. So instead of writing 4k I need to write maybe 500k,
which is an unacceptable slowdown.

The reference counting described in my last mail would only be to
prioritize chunks where probably the least amount of used blocks are.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-11-13 17:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-13  5:10 How to write a GC for size > memory Goswin von Brederlow
2009-11-13  8:38 ` [Caml-list] " Richard Jones
2009-11-13 17:07   ` Goswin von Brederlow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).