Some observations and measurements with using Weak Hashtable to annotate a tree (was: Re: [Caml-list] Weak hash table for attaching extra data to an object)

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Richard Jones <rich@annexia.org>
To: caml-list@inria.fr
Subject: Some observations and measurements with using Weak Hashtable to annotate a tree (was: Re: [Caml-list] Weak hash table for attaching extra data to an object)
Date: Thu, 16 Aug 2007 17:17:21 +0100	[thread overview]
Message-ID: <20070816161721.GA11440@furbychan.cocan.org> (raw)
In-Reply-To: <20070815190455.GA4568@furbychan.cocan.org>

On Wed, Aug 15, 2007 at 08:04:55PM +0100, Richard Jones wrote:
> I should have said :-) I'm trying to annotate the <img> & <applet>
> nodes in a typical HTML document, so they are both sparse (compared to
> all nodes), but there are many of them.

I did some measurements using my WeakMetadata module to annotate a
large random HTML document tree.  Here are some observations and
results, in no particular order.

Code:

http://annexia.org/tmp/weakMetadata.ml
http://annexia.org/tmp/weakMetadata.mli
http://annexia.org/tmp/test_weakmetadata.ml

Firstly, it works, but I found a few programming pitfalls.  In one
case I was accidentally copying my tree nodes when annotating them, so
I was annotating the copies rather than the nodes themselves.  The
code was along these lines:

    | Element (a,b,c,d) -> annotate (a,b,c,d)

instead of:

    | Element ((_,_,_,_) as e) -> annotate e

Obviously experienced OCaml programmers wouldn't make this mistake,
right?!

Another programming problem was that I ended up having to assign a
unique identifier to each node which was used to do the hash
(otherwise the hash function would have to do a costly recursive
descent to determine if two nodes are equal).  This is a peculiarity
of the Weak hash table itself, namely that physical equality cannot be
used.

Another programming problem is that the choice of the initial size of
the hash table determines performance factors in rather non-obvious
ways.  Allocating the hash table with 'create 13' causes buckets to
have long chains, but that creates much lower memory overhead.
Allocating the hash table with 'create 100003' gives short chains and
high memory overhead.  (Exact numbers below).

I wrote a program to create a document (tree) with approximately
560,000 elements and approximately 840,000 text nodes.

I annotated every element in the document to get some idea of the
overhead.  As mentioned before, choice of initial hash table size
affects the length of the bucket chain:

  Initial hash table size: 13
  Median chain length per bucket: 33
  Overhead per element annotated: 21 bytes  (on a 64 bit machine)

  Initial hash table size: 100003
  Median chain length per bucket: 6
  Overhead per element annotated: 40 bytes  (on a 64 bit machine)

Approximately 210,000 elements were <img> or <input> elements (15% of
the total nodes) and I am particularly interested in just annotating
these nodes.  The overhead for annotating sparse nodes like this is
similar (again with chain length dependent a lot on initial choice of
hash table size):

  Initial hash table size: 13
  Median chain length per bucket: 27
  Overhead per element annotated: 26 bytes  (on a 64 bit machine)

  Initial hash table size: 100003
  Median chain length per bucket: 3
  Overhead per element annotated: 70 bytes  (on a 64 bit machine)

You have to manually compact the hash table in order to actually
regain the memory used by your metadata annotations.  (Alternately,
wait for the hash table as a whole to become unreachable, or reuse the
hash table which causes old annotations to be overwritten).

Internally, the Weak module stores a linked list of weak arrays
(caml_weak_list_head).  The GC has a separate phase to deal with weak
arrays, and as far as I can tell from the code the amount of work it
will do in each slice is limited.  The GC also has special code to
remove unreachable weak arrays, so this list does not just grow
without bound.  However I'm still unclear on the exact O(..) for using
huge numbers of weak arrays, which is what the weak hash table does
once you start annotating large numbers of nodes.

Rich.

-- 
Richard Jones
Red Hat

next prev parent reply	other threads:[~2007-08-16 16:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-14 10:15 Weak hash table for attaching extra data to an object Richard Jones
2007-08-14 11:08 ` [Caml-list] " Thomas Fischbacher
2007-08-14 11:48   ` Till Varoquaux
2007-08-14 20:44   ` Jon Harrop
2007-08-14 23:09     ` Stefano Zacchiroli
2007-08-15  0:33       ` Jon Harrop
2007-08-15 12:33         ` Daniel Bünzli
2007-08-16 14:54         ` Markus Mottl
2007-08-15  1:28     ` skaller
2007-08-15 19:04     ` Richard Jones
2007-08-16 16:17       ` Richard Jones [this message]
2007-08-14 16:22 ` Richard Jones
2007-08-14 23:24   ` Till Varoquaux
2007-08-14 23:35   ` Fernando Alegre
2007-08-15  7:59     ` Richard Jones

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070816161721.GA11440@furbychan.cocan.org \
    --to=rich@annexia.org \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).