[Caml-list] Mark-region GC

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* [Caml-list] Mark-region GC
@ 2011-01-09 14:14 Jon Harrop
  0 siblings, 0 replies; only message in thread
From: Jon Harrop @ 2011-01-09 14:14 UTC (permalink / raw)
  To: caml-list

Thought you guys might appreciate my recent foray into GC design:

http://flyingfrogblog.blogspot.com/2011/01/importance-of-locality-and-sparsi
ty-in.html
  http://flyingfrogblog.blogspot.com/2011/01/sweeping-700-faster.html

In summary, I think I have found a way to make a simple GC for HLVM that is
almost as fast as OCaml's on serial functional code, significantly faster on
imperative code and can easily support both parallelism among mutators and
even parallel+concurrent collection.

Specifically, I noted that the advantage of OCaml's generational GC is
largely its ability to sweep large numbers of dead infants from the nursery
in constant time but the disadvantage is the overhead of evacuating
survivors. I was able to mimic the advantage without the disadvantage by
allocating regions and storing bitvectors representing the blocks within the
region that are allocated and marked. Hundreds of heap allocated blocks in a
region can then be swept simply by reading those two bitvectors, computing
the bitwise AND and writing the result as an updated allocated bitvector
(with everything unmarked zeroed out). Moreover, 512-bit bitvectors fill a
cache line and offer near-optimal granularity for the rest of the system.
Thus, the new collector is sweeping at a rate of 40Gb of heap blocks per
second (!), not only for the "nusery" (thread-local region) but potentially
also for the entire "major heap" (global queues of full and non-full
regions).

I was also able to mimic the advantage of Appel's semi-generational
collector with the added benefit of reducing synchronization by sweeping the
thread-local region when it fills. Many local sweeps are typically performed
before the local region fills so contention for obtaining new regions from
the global queue is greatly reduced.

However, my design does degrade the performance of allocation. Rather than
using a bump allocator, I search for the next non-255 byte in the allocated
bitvector and then use a 255-element lookup table to find the next zero bit,
set it, calculate and return the corresponding pointer into the region. I
haven't profiled the overhead of this allocator but my C++ prototype gets
within 10% of OCaml's performance on the list-based n-queens solver. I am
*extremely* happy with that result! :-)

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2011-01-09 14:14 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-09 14:14 [Caml-list] Mark-region GC Jon Harrop

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).