caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [ANN] Weaktbl: a weak hash table library
@ 2007-10-02  4:50 Zheng Li
  2007-10-02 11:07 ` [Caml-list] " skaller
  2007-10-04 16:01 ` Zheng Li
  0 siblings, 2 replies; 4+ messages in thread
From: Zheng Li @ 2007-10-02  4:50 UTC (permalink / raw)
  To: caml-list


Hi,

I remember weak hash table was discussed on the list not long ago. I once ran
into a situation where weak data structure was desired, and came up with this
small module. Though I didn't really get a chance to make use of it (I turned
to another solution laterly), I'd like to share and hope it would be useful to
others. 

== Description ==

Weaktbl is yet another weak hash table library for OCaml. Its main features
include:

 * Both keys and associated values are weakly stored. A binding exists until
   the key is no longer referenced anywhere
 * The implementation is built upon the hash table functor of Weak library
   rather than implemented from scratch, so it's rather small
 * The interface is fully compatible with the standard Hashtbl library instead
   of the hash sub-module of the Weak library, so basically you can also use it
   as an alternative of the standard Hashtbl
 * Its behaviors also follow the standard Hashtbl library's conventions. 
   E.g. the "binding orders" and the "current binding" concepts all make
   sense here (with find/find_all/remove/replace/iter/fold etc.)

Link: http://www.pps.jussieu.fr/~li/software/index.html#weaktbl

Cheers
-- 
Zheng Li
http://www.pps.jussieu.fr/~li


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] [ANN] Weaktbl: a weak hash table library
  2007-10-02  4:50 [ANN] Weaktbl: a weak hash table library Zheng Li
@ 2007-10-02 11:07 ` skaller
  2007-10-03 10:43   ` Zheng Li
  2007-10-04 16:01 ` Zheng Li
  1 sibling, 1 reply; 4+ messages in thread
From: skaller @ 2007-10-02 11:07 UTC (permalink / raw)
  To: Zheng Li; +Cc: caml-list

On Tue, 2007-10-02 at 06:50 +0200, Zheng Li wrote:
> Hi,

> Weaktbl is yet another weak hash table library for OCaml. Its main features
> include:
> 
>  * Both keys and associated values are weakly stored. A binding exists until
>    the key is no longer referenced anywhere

I'm a bit confused what it means..

If the key is weak, it is useless except for terms, i.e. variants
or records of another data structure, where you are placing a pointer
in the table as a key. You can't put values (int, float, etc) in the 
key field, because they're immediately unreferenced.

So you could use this thing as a cache, where for example you have
a tree and calculate some property of each node, and cache it
in the table.

The problem is .. physical comparison in Ocaml is unstable,
so you cannot hash pointers using polymorphic hash, 
but pointers are the only kind that persist. 

So I have to conclude you must use the Weaktbl with a functor,
instantiating the hash function to something which does a computation
on the value, and this claim:

 * The interface is fully compatible with the standard 
   Hashtbl library instead
   of the hash sub-module of the Weak library, so 
   basically you can also use it
   as an alternative of the standard Hashtbl

is misleading: I use Hashtbl extensively, but ALL my uses are
using the polymorphic interface with polymorphic hash,
so actually none of them will work with Weaktbl, which is
a pity.

Can you confirm or deny this analysis?

A weakly keyed table would be cool for term cache, since
it saves worrying about removing dangling references.


-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANN] Weaktbl: a weak hash table library
  2007-10-02 11:07 ` [Caml-list] " skaller
@ 2007-10-03 10:43   ` Zheng Li
  0 siblings, 0 replies; 4+ messages in thread
From: Zheng Li @ 2007-10-03 10:43 UTC (permalink / raw)
  To: caml-list


Hi,

skaller <skaller@users.sourceforge.net> writes:
>>  * Both keys and associated values are weakly stored. A binding exists until
>>    the key is no longer referenced anywhere
>
> I'm a bit confused what it means..
>
> If the key is weak, it is useless except for terms, i.e. variants
> or records of another data structure, where you are placing a pointer
> in the table as a key. You can't put values (int, float, etc) in the 
> key field, because they're immediately unreferenced.
What happen if you put a noncollectable value (e.g. int) into a weak data
structure (e.g. standard weak array/hashtable)? It won't disappear from the
data structure unless you manually remove it. 

It's the same here, if you use noncollectable data as the key, the key won't
disappear, hence the binding won't disappear (according to our principle -- key
decide whether the binding disappear). So at least it's consistent with the
standard Weak module. If you're using noncollectable value as key, you're
actually using Weaktbl as normal Hashtbl, that's still not bad at all.

> So you could use this thing as a cache, where for example you have
> a tree and calculate some property of each node, and cache it
> in the table.
>
> The problem is .. physical comparison in Ocaml is unstable,
> so you cannot hash pointers using polymorphic hash, 
> but pointers are the only kind that persist. 
>
> So I have to conclude you must use the Weaktbl with a functor,
> instantiating the hash function to something which does a computation
> on the value, and this claim:

Weaktbl provides the same interface as Hashtbl, i.e. a polymorphic one and a
functorial one. The functorial signature is exactly:

  module Make (H : Hashtbl.HashedType) : Hashtbl.S with type key = H.t

The polymorphic one is equivalent to having "equal x y = (compare x y) = 0" and
"hash = Hashtbl.hash", just as the Hashtbl's polymorphic interface did.

You don't have to (shouldn't, as in standard Hashtbl) use a hash function
related to physical property of data, just a normal hash (as the standard one)
based on structural property is enough.

Actually, it's the "equal" matters to the semantic. Here I'd like to clarify it
a bit, and ask for some *advices*:

- If physical equivalence (==) provided as "equal", there should be nothing
  ambiguous here, because every key is unique. A key disappears, all its
  bindings disappear. Querying with a key (e.g. using find/find_all/remove
  etc.), only those bindings added with this exact key will be given. The
  semantic is clear, every key is only related to binding added by itself ---
  because in each equivalent class there is just one value.

- If other equivalent relation provided as "equal", e.g. "let equal = (=)" or
  "let equal x y = (fst x) = (fst y)". No doubt, when queering with a key , all
  bindings being added by its equivalent class should be given --- that's what
  we define the equivalent relation for. However, on the other problem: when a
  binding should disappear, we have two kinds of semantics as candidate:

  * The first one is used in the current library: a binding is physically
    related to the key with which it was added with. I.e., each key is still
    physically unique, and its presence decide the aliveness of all the
    bindings added with it. Still, a key disappear, the bindings added by it
    disappear. This has nothing to do with the equivalent relationship ---
    equivalent class relation is only used for querying and manipulation. E.g.,
    if we define "equal = (=)" and  $ka and $kb are keys structurally
    equivalent but physically nonequivalent, the presence of bindings added by
    $ka and bindings added by $kb is still independently decided by the
    presence of $ka and $kb. But when doing find_all/find/remove/replace etc.,
    using no matter $ka or $kb should give you the same result.

  * Another possibility is: a binding will exist until all keys that have
    contributed to this equivalent class disappear, or say as far as there is
    at least one key that ever contributed to this class, all bindings of this
    class will remain; when the last key of this class disappears, the bindings
    belong to this class disappear all together.

  The rational behind the second approach is: people can sometimes mix up
  structurally equivalent but physically nonequivalent keys. E.g. they can use
  key $ka to add some binding, but *accidentally* switch to $ka's structural
  equivalence $kb afterwards to add more bindings. (That's normal, for example,
  you filter $ka through a function which doesn't return $ka itself but its
  copy among the result, and you don't insist in using the original $ka, but
  instead using the one in the output.) After a while,  $ka is probably no
  longer used and hence GCed sometime later. In the first approach, those
  bindings added with $ka will just disappear which sounds mysterious; in the
  second approach, all bindings remain as far as one of $ka or $kb is alive.

  So people may think the second approach is usually convenient, because it's
  more conservative and safe, and usually people don't care much about the more
  memory occupation as far as it won't leak overall. But the second approach
  doesn't fully solve its problem and can produce even more mysterious
  situation: since its rational is assume people can make mistakes, then
  think about the following situation: after you using $kb as the key for a
  while, you accidentally (again) switch to $kc, however you hasn't got any
  chance to use $kc to manipulate bindings yet (so the weaktbl knows nothing
  about the existence of $kc), you just hold it because you believe it's $k(b)
  (and $k(a)). After a while, $ka and $kb disappear, all the bindings are
  gone...you're still holding $kc.

I actually implemented both of them, the switch is just one line -- the first
approach use a weak stack inside, the second use a common stack. However I
don't want to provide both to confuse people. I'd like to hear more suggestions
to decide which one is desired.

Thanks in advance.

-- 
Zheng Li
http://www.pps.jussieu.fr/~li


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ANN] Weaktbl: a weak hash table library
  2007-10-02  4:50 [ANN] Weaktbl: a weak hash table library Zheng Li
  2007-10-02 11:07 ` [Caml-list] " skaller
@ 2007-10-04 16:01 ` Zheng Li
  1 sibling, 0 replies; 4+ messages in thread
From: Zheng Li @ 2007-10-04 16:01 UTC (permalink / raw)
  To: caml-list


Hi, 

Follow some advices, I updated Weaktbl to version 0.02. Notable improvements
include:
 
 - the implementation is improved based on the clarified semantics
 - a new weak stack implementation is used internally (I'm not sure whether
   it's useful to others so I decide not to expose it through interface)
 - add new sections to README

The interface remains the same. Anyone has downloaded the previous version is
strongly suggested to upgrade. Btw, the code is released under LGPL + linking
exception.


Here are the extra sections from the README:
-------------------------------------------

......

== Semantics ==

Weaktbl have all the same semantics as standard Hashtbl, with the only
exception: a binding will be automatically garbage collected, if *its key* is
no longer referenced by (hence unreachable from) the rest part your program. 
Note that

 - Here, "its key" indicates the exact (physical) key with which the binding
   was added, not any other keys of its equivalent class even if they may
   (or may not) be structurally equivalent to this key.
 - You may use noncollectable values as keys (such as int), but they won't be
   collected by the Weaktbl (this is the same convention all weak data
   structures follow). Since the keys won't be collected, according to our
   principle -- key's aliveness decide its' binding's aliveness, the bindings
   won't be collected too. You're actually using Weaktbl as a persistent
   Hashtbl, but that's still not too bad.

Finally, the polymorphic hash interface of Weaktbl takes ''equal x y = (compare
x y) = 0'' and ''hash = Hashtbl.hash'' as the default setting, the same as the
polymorphic interface of standard Hashtbl.


== Application ==

Weaktbl is useful for many kinds of applications, where standard Hashtbl's
persistent storage prevents the necessity of automatic garbage
collection. We'll illustrate this with a few typical examples:


=== Dict cache ===

Hash table is typically used for quick dictionary lookup. Suppose you're working
with a huge data set and key->value lookup is the most frequent operations. To
have all the data into a hash table in memory is out of consideration; but to
have everything external is way too expensive (e.g. looking up in a large file
or querying from a external database each time). Fortunately, you can work on
one small part of the data set at a time, and the range of this small part
evolving gradually, i.e. some data coming into the range of vision, others
going out of scope. A typical working mechanics can be written as

  let query key = 
    try lookup $key in $hashtable (* quick *)
    with Not_found ->
      let value = lookup $key in $external_storage in (* slow *)
      add $key $value to $hashtable; (* we may query it quite often recently *)
      value

You'll soon find the problem -- $hashtable here will growing forever! The
situation is, with the evolving of working range, some keys are out of reach
and should be collected by the GC, so should the bindings related to them.
However due to the Hashtbl's persistence, this won't happen. First, the key
itself is persistently referenced by Hashtbl, it wouldn't likely to be GCed,
not mention the bindings. For a weak hash table, this is not a problem:
keys themselves are *weakly* referenced by the table, besides when a key is
forgotten (GCed) by your program, its binding(s) will be forgotten (GCed) by
the table too [*].


=== User-level GC ===

Suppose you are working on graph-like data structure. Each node is represented
as a id -> {links: id list; info: huge_data} bindings in a hash table, where
the links is all the nodes directly reachable from node id, and the info is
large blocks of information affiliated to the node id. The graph structure
keeps evolving over time, i.e. new nodes being added, new links being
established, old links being broken and old nodes being dropped etc. So far
it's just normal. Image you break a link, i.e. remove some $id_b from the links
field of $id_a, it's possible this is the only bridge between the sub-graph
$id_b is located and the outside world, besides there's no active id directly
points to this sub-graph. In short, the sub-graph is disconnected, obsoleted
and unreachable. Because the graph data structure keeps evolving, it's very
important to GC those unreachable parts. How this is done with a standard
Hashtbl? Whenever we break a line $id_a -> $id_b, we check all the nodes' links
field to see whether this is the last link to $id_b. If so, we first remove
$id_b, break each outward link of $id_b and recursively check whether these
broken links produce more unreachable nodes... With weak hash table, you do
nothing. When the last link to a sub-graph is broken, it simply means the
sub-graph is no longer referenced by the rest program, so it's GCed
automatically.


Zheng Li <li@pps.jussieu.fr> writes:
> Hi,
>
> I remember weak hash table was discussed on the list not long ago. I once ran
> into a situation where weak data structure was desired, and came up with this
> small module. Though I didn't really get a chance to make use of it (I turned
> to another solution laterly), I'd like to share and hope it would be useful to
> others. 
>
> == Description ==
>
> Weaktbl is yet another weak hash table library for OCaml. Its main features
> include:
>
>  * Both keys and associated values are weakly stored. A binding exists until
>    the key is no longer referenced anywhere
>  * The implementation is built upon the hash table functor of Weak library
>    rather than implemented from scratch, so it's rather small
>  * The interface is fully compatible with the standard Hashtbl library instead
>    of the hash sub-module of the Weak library, so basically you can also use it
>    as an alternative of the standard Hashtbl
>  * Its behaviors also follow the standard Hashtbl library's conventions. 
>    E.g. the "binding orders" and the "current binding" concepts all make
>    sense here (with find/find_all/remove/replace/iter/fold etc.)
>
> Link: http://www.pps.jussieu.fr/~li/software/index.html#weaktbl

-- 
Zheng Li
http://www.pps.jussieu.fr/~li


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-10-04 16:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-02  4:50 [ANN] Weaktbl: a weak hash table library Zheng Li
2007-10-02 11:07 ` [Caml-list] " skaller
2007-10-03 10:43   ` Zheng Li
2007-10-04 16:01 ` Zheng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).