caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Zheng Li <li@pps.jussieu.fr>
To: caml-list@inria.fr
Subject: Re: [ANN] Weaktbl: a weak hash table library
Date: Thu, 04 Oct 2007 18:01:13 +0200	[thread overview]
Message-ID: <878x6iuawm.fsf@pps.jussieu.fr> (raw)
In-Reply-To: <87641q3ysp.fsf@pps.jussieu.fr>


Hi, 

Follow some advices, I updated Weaktbl to version 0.02. Notable improvements
include:
 
 - the implementation is improved based on the clarified semantics
 - a new weak stack implementation is used internally (I'm not sure whether
   it's useful to others so I decide not to expose it through interface)
 - add new sections to README

The interface remains the same. Anyone has downloaded the previous version is
strongly suggested to upgrade. Btw, the code is released under LGPL + linking
exception.


Here are the extra sections from the README:
-------------------------------------------

......

== Semantics ==

Weaktbl have all the same semantics as standard Hashtbl, with the only
exception: a binding will be automatically garbage collected, if *its key* is
no longer referenced by (hence unreachable from) the rest part your program. 
Note that

 - Here, "its key" indicates the exact (physical) key with which the binding
   was added, not any other keys of its equivalent class even if they may
   (or may not) be structurally equivalent to this key.
 - You may use noncollectable values as keys (such as int), but they won't be
   collected by the Weaktbl (this is the same convention all weak data
   structures follow). Since the keys won't be collected, according to our
   principle -- key's aliveness decide its' binding's aliveness, the bindings
   won't be collected too. You're actually using Weaktbl as a persistent
   Hashtbl, but that's still not too bad.

Finally, the polymorphic hash interface of Weaktbl takes ''equal x y = (compare
x y) = 0'' and ''hash = Hashtbl.hash'' as the default setting, the same as the
polymorphic interface of standard Hashtbl.


== Application ==

Weaktbl is useful for many kinds of applications, where standard Hashtbl's
persistent storage prevents the necessity of automatic garbage
collection. We'll illustrate this with a few typical examples:


=== Dict cache ===

Hash table is typically used for quick dictionary lookup. Suppose you're working
with a huge data set and key->value lookup is the most frequent operations. To
have all the data into a hash table in memory is out of consideration; but to
have everything external is way too expensive (e.g. looking up in a large file
or querying from a external database each time). Fortunately, you can work on
one small part of the data set at a time, and the range of this small part
evolving gradually, i.e. some data coming into the range of vision, others
going out of scope. A typical working mechanics can be written as

  let query key = 
    try lookup $key in $hashtable (* quick *)
    with Not_found ->
      let value = lookup $key in $external_storage in (* slow *)
      add $key $value to $hashtable; (* we may query it quite often recently *)
      value

You'll soon find the problem -- $hashtable here will growing forever! The
situation is, with the evolving of working range, some keys are out of reach
and should be collected by the GC, so should the bindings related to them.
However due to the Hashtbl's persistence, this won't happen. First, the key
itself is persistently referenced by Hashtbl, it wouldn't likely to be GCed,
not mention the bindings. For a weak hash table, this is not a problem:
keys themselves are *weakly* referenced by the table, besides when a key is
forgotten (GCed) by your program, its binding(s) will be forgotten (GCed) by
the table too [*].


=== User-level GC ===

Suppose you are working on graph-like data structure. Each node is represented
as a id -> {links: id list; info: huge_data} bindings in a hash table, where
the links is all the nodes directly reachable from node id, and the info is
large blocks of information affiliated to the node id. The graph structure
keeps evolving over time, i.e. new nodes being added, new links being
established, old links being broken and old nodes being dropped etc. So far
it's just normal. Image you break a link, i.e. remove some $id_b from the links
field of $id_a, it's possible this is the only bridge between the sub-graph
$id_b is located and the outside world, besides there's no active id directly
points to this sub-graph. In short, the sub-graph is disconnected, obsoleted
and unreachable. Because the graph data structure keeps evolving, it's very
important to GC those unreachable parts. How this is done with a standard
Hashtbl? Whenever we break a line $id_a -> $id_b, we check all the nodes' links
field to see whether this is the last link to $id_b. If so, we first remove
$id_b, break each outward link of $id_b and recursively check whether these
broken links produce more unreachable nodes... With weak hash table, you do
nothing. When the last link to a sub-graph is broken, it simply means the
sub-graph is no longer referenced by the rest program, so it's GCed
automatically.


Zheng Li <li@pps.jussieu.fr> writes:
> Hi,
>
> I remember weak hash table was discussed on the list not long ago. I once ran
> into a situation where weak data structure was desired, and came up with this
> small module. Though I didn't really get a chance to make use of it (I turned
> to another solution laterly), I'd like to share and hope it would be useful to
> others. 
>
> == Description ==
>
> Weaktbl is yet another weak hash table library for OCaml. Its main features
> include:
>
>  * Both keys and associated values are weakly stored. A binding exists until
>    the key is no longer referenced anywhere
>  * The implementation is built upon the hash table functor of Weak library
>    rather than implemented from scratch, so it's rather small
>  * The interface is fully compatible with the standard Hashtbl library instead
>    of the hash sub-module of the Weak library, so basically you can also use it
>    as an alternative of the standard Hashtbl
>  * Its behaviors also follow the standard Hashtbl library's conventions. 
>    E.g. the "binding orders" and the "current binding" concepts all make
>    sense here (with find/find_all/remove/replace/iter/fold etc.)
>
> Link: http://www.pps.jussieu.fr/~li/software/index.html#weaktbl

-- 
Zheng Li
http://www.pps.jussieu.fr/~li


      parent reply	other threads:[~2007-10-04 16:06 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-02  4:50 Zheng Li
2007-10-02 11:07 ` [Caml-list] " skaller
2007-10-03 10:43   ` Zheng Li
2007-10-04 16:01 ` Zheng Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878x6iuawm.fsf@pps.jussieu.fr \
    --to=li@pps.jussieu.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).