caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Zheng Li <li@pps.jussieu.fr>
To: caml-list@inria.fr
Subject: Re: [ANN] Weaktbl: a weak hash table library
Date: Wed, 03 Oct 2007 12:43:13 +0200	[thread overview]
Message-ID: <873awsv5q6.fsf@pps.jussieu.fr> (raw)
In-Reply-To: <1191323231.28954.27.camel@rosella.wigram>


Hi,

skaller <skaller@users.sourceforge.net> writes:
>>  * Both keys and associated values are weakly stored. A binding exists until
>>    the key is no longer referenced anywhere
>
> I'm a bit confused what it means..
>
> If the key is weak, it is useless except for terms, i.e. variants
> or records of another data structure, where you are placing a pointer
> in the table as a key. You can't put values (int, float, etc) in the 
> key field, because they're immediately unreferenced.
What happen if you put a noncollectable value (e.g. int) into a weak data
structure (e.g. standard weak array/hashtable)? It won't disappear from the
data structure unless you manually remove it. 

It's the same here, if you use noncollectable data as the key, the key won't
disappear, hence the binding won't disappear (according to our principle -- key
decide whether the binding disappear). So at least it's consistent with the
standard Weak module. If you're using noncollectable value as key, you're
actually using Weaktbl as normal Hashtbl, that's still not bad at all.

> So you could use this thing as a cache, where for example you have
> a tree and calculate some property of each node, and cache it
> in the table.
>
> The problem is .. physical comparison in Ocaml is unstable,
> so you cannot hash pointers using polymorphic hash, 
> but pointers are the only kind that persist. 
>
> So I have to conclude you must use the Weaktbl with a functor,
> instantiating the hash function to something which does a computation
> on the value, and this claim:

Weaktbl provides the same interface as Hashtbl, i.e. a polymorphic one and a
functorial one. The functorial signature is exactly:

  module Make (H : Hashtbl.HashedType) : Hashtbl.S with type key = H.t

The polymorphic one is equivalent to having "equal x y = (compare x y) = 0" and
"hash = Hashtbl.hash", just as the Hashtbl's polymorphic interface did.

You don't have to (shouldn't, as in standard Hashtbl) use a hash function
related to physical property of data, just a normal hash (as the standard one)
based on structural property is enough.

Actually, it's the "equal" matters to the semantic. Here I'd like to clarify it
a bit, and ask for some *advices*:

- If physical equivalence (==) provided as "equal", there should be nothing
  ambiguous here, because every key is unique. A key disappears, all its
  bindings disappear. Querying with a key (e.g. using find/find_all/remove
  etc.), only those bindings added with this exact key will be given. The
  semantic is clear, every key is only related to binding added by itself ---
  because in each equivalent class there is just one value.

- If other equivalent relation provided as "equal", e.g. "let equal = (=)" or
  "let equal x y = (fst x) = (fst y)". No doubt, when queering with a key , all
  bindings being added by its equivalent class should be given --- that's what
  we define the equivalent relation for. However, on the other problem: when a
  binding should disappear, we have two kinds of semantics as candidate:

  * The first one is used in the current library: a binding is physically
    related to the key with which it was added with. I.e., each key is still
    physically unique, and its presence decide the aliveness of all the
    bindings added with it. Still, a key disappear, the bindings added by it
    disappear. This has nothing to do with the equivalent relationship ---
    equivalent class relation is only used for querying and manipulation. E.g.,
    if we define "equal = (=)" and  $ka and $kb are keys structurally
    equivalent but physically nonequivalent, the presence of bindings added by
    $ka and bindings added by $kb is still independently decided by the
    presence of $ka and $kb. But when doing find_all/find/remove/replace etc.,
    using no matter $ka or $kb should give you the same result.

  * Another possibility is: a binding will exist until all keys that have
    contributed to this equivalent class disappear, or say as far as there is
    at least one key that ever contributed to this class, all bindings of this
    class will remain; when the last key of this class disappears, the bindings
    belong to this class disappear all together.

  The rational behind the second approach is: people can sometimes mix up
  structurally equivalent but physically nonequivalent keys. E.g. they can use
  key $ka to add some binding, but *accidentally* switch to $ka's structural
  equivalence $kb afterwards to add more bindings. (That's normal, for example,
  you filter $ka through a function which doesn't return $ka itself but its
  copy among the result, and you don't insist in using the original $ka, but
  instead using the one in the output.) After a while,  $ka is probably no
  longer used and hence GCed sometime later. In the first approach, those
  bindings added with $ka will just disappear which sounds mysterious; in the
  second approach, all bindings remain as far as one of $ka or $kb is alive.

  So people may think the second approach is usually convenient, because it's
  more conservative and safe, and usually people don't care much about the more
  memory occupation as far as it won't leak overall. But the second approach
  doesn't fully solve its problem and can produce even more mysterious
  situation: since its rational is assume people can make mistakes, then
  think about the following situation: after you using $kb as the key for a
  while, you accidentally (again) switch to $kc, however you hasn't got any
  chance to use $kc to manipulate bindings yet (so the weaktbl knows nothing
  about the existence of $kc), you just hold it because you believe it's $k(b)
  (and $k(a)). After a while, $ka and $kb disappear, all the bindings are
  gone...you're still holding $kc.

I actually implemented both of them, the switch is just one line -- the first
approach use a weak stack inside, the second use a common stack. However I
don't want to provide both to confuse people. I'd like to hear more suggestions
to decide which one is desired.

Thanks in advance.

-- 
Zheng Li
http://www.pps.jussieu.fr/~li


  reply	other threads:[~2007-10-03 10:51 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-02  4:50 Zheng Li
2007-10-02 11:07 ` [Caml-list] " skaller
2007-10-03 10:43   ` Zheng Li [this message]
2007-10-04 16:01 ` Zheng Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=873awsv5q6.fsf@pps.jussieu.fr \
    --to=li@pps.jussieu.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).