* [ANN] Weaktbl: a weak hash table library @ 2007-10-02 4:50 Zheng Li 2007-10-02 11:07 ` [Caml-list] " skaller 2007-10-04 16:01 ` Zheng Li 0 siblings, 2 replies; 4+ messages in thread From: Zheng Li @ 2007-10-02 4:50 UTC (permalink / raw) To: caml-list Hi, I remember weak hash table was discussed on the list not long ago. I once ran into a situation where weak data structure was desired, and came up with this small module. Though I didn't really get a chance to make use of it (I turned to another solution laterly), I'd like to share and hope it would be useful to others. == Description == Weaktbl is yet another weak hash table library for OCaml. Its main features include: * Both keys and associated values are weakly stored. A binding exists until the key is no longer referenced anywhere * The implementation is built upon the hash table functor of Weak library rather than implemented from scratch, so it's rather small * The interface is fully compatible with the standard Hashtbl library instead of the hash sub-module of the Weak library, so basically you can also use it as an alternative of the standard Hashtbl * Its behaviors also follow the standard Hashtbl library's conventions. E.g. the "binding orders" and the "current binding" concepts all make sense here (with find/find_all/remove/replace/iter/fold etc.) Link: http://www.pps.jussieu.fr/~li/software/index.html#weaktbl Cheers -- Zheng Li http://www.pps.jussieu.fr/~li ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Caml-list] [ANN] Weaktbl: a weak hash table library 2007-10-02 4:50 [ANN] Weaktbl: a weak hash table library Zheng Li @ 2007-10-02 11:07 ` skaller 2007-10-03 10:43 ` Zheng Li 2007-10-04 16:01 ` Zheng Li 1 sibling, 1 reply; 4+ messages in thread From: skaller @ 2007-10-02 11:07 UTC (permalink / raw) To: Zheng Li; +Cc: caml-list On Tue, 2007-10-02 at 06:50 +0200, Zheng Li wrote: > Hi, > Weaktbl is yet another weak hash table library for OCaml. Its main features > include: > > * Both keys and associated values are weakly stored. A binding exists until > the key is no longer referenced anywhere I'm a bit confused what it means.. If the key is weak, it is useless except for terms, i.e. variants or records of another data structure, where you are placing a pointer in the table as a key. You can't put values (int, float, etc) in the key field, because they're immediately unreferenced. So you could use this thing as a cache, where for example you have a tree and calculate some property of each node, and cache it in the table. The problem is .. physical comparison in Ocaml is unstable, so you cannot hash pointers using polymorphic hash, but pointers are the only kind that persist. So I have to conclude you must use the Weaktbl with a functor, instantiating the hash function to something which does a computation on the value, and this claim: * The interface is fully compatible with the standard Hashtbl library instead of the hash sub-module of the Weak library, so basically you can also use it as an alternative of the standard Hashtbl is misleading: I use Hashtbl extensively, but ALL my uses are using the polymorphic interface with polymorphic hash, so actually none of them will work with Weaktbl, which is a pity. Can you confirm or deny this analysis? A weakly keyed table would be cool for term cache, since it saves worrying about removing dangling references. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [ANN] Weaktbl: a weak hash table library 2007-10-02 11:07 ` [Caml-list] " skaller @ 2007-10-03 10:43 ` Zheng Li 0 siblings, 0 replies; 4+ messages in thread From: Zheng Li @ 2007-10-03 10:43 UTC (permalink / raw) To: caml-list Hi, skaller <skaller@users.sourceforge.net> writes: >> * Both keys and associated values are weakly stored. A binding exists until >> the key is no longer referenced anywhere > > I'm a bit confused what it means.. > > If the key is weak, it is useless except for terms, i.e. variants > or records of another data structure, where you are placing a pointer > in the table as a key. You can't put values (int, float, etc) in the > key field, because they're immediately unreferenced. What happen if you put a noncollectable value (e.g. int) into a weak data structure (e.g. standard weak array/hashtable)? It won't disappear from the data structure unless you manually remove it. It's the same here, if you use noncollectable data as the key, the key won't disappear, hence the binding won't disappear (according to our principle -- key decide whether the binding disappear). So at least it's consistent with the standard Weak module. If you're using noncollectable value as key, you're actually using Weaktbl as normal Hashtbl, that's still not bad at all. > So you could use this thing as a cache, where for example you have > a tree and calculate some property of each node, and cache it > in the table. > > The problem is .. physical comparison in Ocaml is unstable, > so you cannot hash pointers using polymorphic hash, > but pointers are the only kind that persist. > > So I have to conclude you must use the Weaktbl with a functor, > instantiating the hash function to something which does a computation > on the value, and this claim: Weaktbl provides the same interface as Hashtbl, i.e. a polymorphic one and a functorial one. The functorial signature is exactly: module Make (H : Hashtbl.HashedType) : Hashtbl.S with type key = H.t The polymorphic one is equivalent to having "equal x y = (compare x y) = 0" and "hash = Hashtbl.hash", just as the Hashtbl's polymorphic interface did. You don't have to (shouldn't, as in standard Hashtbl) use a hash function related to physical property of data, just a normal hash (as the standard one) based on structural property is enough. Actually, it's the "equal" matters to the semantic. Here I'd like to clarify it a bit, and ask for some *advices*: - If physical equivalence (==) provided as "equal", there should be nothing ambiguous here, because every key is unique. A key disappears, all its bindings disappear. Querying with a key (e.g. using find/find_all/remove etc.), only those bindings added with this exact key will be given. The semantic is clear, every key is only related to binding added by itself --- because in each equivalent class there is just one value. - If other equivalent relation provided as "equal", e.g. "let equal = (=)" or "let equal x y = (fst x) = (fst y)". No doubt, when queering with a key , all bindings being added by its equivalent class should be given --- that's what we define the equivalent relation for. However, on the other problem: when a binding should disappear, we have two kinds of semantics as candidate: * The first one is used in the current library: a binding is physically related to the key with which it was added with. I.e., each key is still physically unique, and its presence decide the aliveness of all the bindings added with it. Still, a key disappear, the bindings added by it disappear. This has nothing to do with the equivalent relationship --- equivalent class relation is only used for querying and manipulation. E.g., if we define "equal = (=)" and $ka and $kb are keys structurally equivalent but physically nonequivalent, the presence of bindings added by $ka and bindings added by $kb is still independently decided by the presence of $ka and $kb. But when doing find_all/find/remove/replace etc., using no matter $ka or $kb should give you the same result. * Another possibility is: a binding will exist until all keys that have contributed to this equivalent class disappear, or say as far as there is at least one key that ever contributed to this class, all bindings of this class will remain; when the last key of this class disappears, the bindings belong to this class disappear all together. The rational behind the second approach is: people can sometimes mix up structurally equivalent but physically nonequivalent keys. E.g. they can use key $ka to add some binding, but *accidentally* switch to $ka's structural equivalence $kb afterwards to add more bindings. (That's normal, for example, you filter $ka through a function which doesn't return $ka itself but its copy among the result, and you don't insist in using the original $ka, but instead using the one in the output.) After a while, $ka is probably no longer used and hence GCed sometime later. In the first approach, those bindings added with $ka will just disappear which sounds mysterious; in the second approach, all bindings remain as far as one of $ka or $kb is alive. So people may think the second approach is usually convenient, because it's more conservative and safe, and usually people don't care much about the more memory occupation as far as it won't leak overall. But the second approach doesn't fully solve its problem and can produce even more mysterious situation: since its rational is assume people can make mistakes, then think about the following situation: after you using $kb as the key for a while, you accidentally (again) switch to $kc, however you hasn't got any chance to use $kc to manipulate bindings yet (so the weaktbl knows nothing about the existence of $kc), you just hold it because you believe it's $k(b) (and $k(a)). After a while, $ka and $kb disappear, all the bindings are gone...you're still holding $kc. I actually implemented both of them, the switch is just one line -- the first approach use a weak stack inside, the second use a common stack. However I don't want to provide both to confuse people. I'd like to hear more suggestions to decide which one is desired. Thanks in advance. -- Zheng Li http://www.pps.jussieu.fr/~li ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [ANN] Weaktbl: a weak hash table library 2007-10-02 4:50 [ANN] Weaktbl: a weak hash table library Zheng Li 2007-10-02 11:07 ` [Caml-list] " skaller @ 2007-10-04 16:01 ` Zheng Li 1 sibling, 0 replies; 4+ messages in thread From: Zheng Li @ 2007-10-04 16:01 UTC (permalink / raw) To: caml-list Hi, Follow some advices, I updated Weaktbl to version 0.02. Notable improvements include: - the implementation is improved based on the clarified semantics - a new weak stack implementation is used internally (I'm not sure whether it's useful to others so I decide not to expose it through interface) - add new sections to README The interface remains the same. Anyone has downloaded the previous version is strongly suggested to upgrade. Btw, the code is released under LGPL + linking exception. Here are the extra sections from the README: ------------------------------------------- ...... == Semantics == Weaktbl have all the same semantics as standard Hashtbl, with the only exception: a binding will be automatically garbage collected, if *its key* is no longer referenced by (hence unreachable from) the rest part your program. Note that - Here, "its key" indicates the exact (physical) key with which the binding was added, not any other keys of its equivalent class even if they may (or may not) be structurally equivalent to this key. - You may use noncollectable values as keys (such as int), but they won't be collected by the Weaktbl (this is the same convention all weak data structures follow). Since the keys won't be collected, according to our principle -- key's aliveness decide its' binding's aliveness, the bindings won't be collected too. You're actually using Weaktbl as a persistent Hashtbl, but that's still not too bad. Finally, the polymorphic hash interface of Weaktbl takes ''equal x y = (compare x y) = 0'' and ''hash = Hashtbl.hash'' as the default setting, the same as the polymorphic interface of standard Hashtbl. == Application == Weaktbl is useful for many kinds of applications, where standard Hashtbl's persistent storage prevents the necessity of automatic garbage collection. We'll illustrate this with a few typical examples: === Dict cache === Hash table is typically used for quick dictionary lookup. Suppose you're working with a huge data set and key->value lookup is the most frequent operations. To have all the data into a hash table in memory is out of consideration; but to have everything external is way too expensive (e.g. looking up in a large file or querying from a external database each time). Fortunately, you can work on one small part of the data set at a time, and the range of this small part evolving gradually, i.e. some data coming into the range of vision, others going out of scope. A typical working mechanics can be written as let query key = try lookup $key in $hashtable (* quick *) with Not_found -> let value = lookup $key in $external_storage in (* slow *) add $key $value to $hashtable; (* we may query it quite often recently *) value You'll soon find the problem -- $hashtable here will growing forever! The situation is, with the evolving of working range, some keys are out of reach and should be collected by the GC, so should the bindings related to them. However due to the Hashtbl's persistence, this won't happen. First, the key itself is persistently referenced by Hashtbl, it wouldn't likely to be GCed, not mention the bindings. For a weak hash table, this is not a problem: keys themselves are *weakly* referenced by the table, besides when a key is forgotten (GCed) by your program, its binding(s) will be forgotten (GCed) by the table too [*]. === User-level GC === Suppose you are working on graph-like data structure. Each node is represented as a id -> {links: id list; info: huge_data} bindings in a hash table, where the links is all the nodes directly reachable from node id, and the info is large blocks of information affiliated to the node id. The graph structure keeps evolving over time, i.e. new nodes being added, new links being established, old links being broken and old nodes being dropped etc. So far it's just normal. Image you break a link, i.e. remove some $id_b from the links field of $id_a, it's possible this is the only bridge between the sub-graph $id_b is located and the outside world, besides there's no active id directly points to this sub-graph. In short, the sub-graph is disconnected, obsoleted and unreachable. Because the graph data structure keeps evolving, it's very important to GC those unreachable parts. How this is done with a standard Hashtbl? Whenever we break a line $id_a -> $id_b, we check all the nodes' links field to see whether this is the last link to $id_b. If so, we first remove $id_b, break each outward link of $id_b and recursively check whether these broken links produce more unreachable nodes... With weak hash table, you do nothing. When the last link to a sub-graph is broken, it simply means the sub-graph is no longer referenced by the rest program, so it's GCed automatically. Zheng Li <li@pps.jussieu.fr> writes: > Hi, > > I remember weak hash table was discussed on the list not long ago. I once ran > into a situation where weak data structure was desired, and came up with this > small module. Though I didn't really get a chance to make use of it (I turned > to another solution laterly), I'd like to share and hope it would be useful to > others. > > == Description == > > Weaktbl is yet another weak hash table library for OCaml. Its main features > include: > > * Both keys and associated values are weakly stored. A binding exists until > the key is no longer referenced anywhere > * The implementation is built upon the hash table functor of Weak library > rather than implemented from scratch, so it's rather small > * The interface is fully compatible with the standard Hashtbl library instead > of the hash sub-module of the Weak library, so basically you can also use it > as an alternative of the standard Hashtbl > * Its behaviors also follow the standard Hashtbl library's conventions. > E.g. the "binding orders" and the "current binding" concepts all make > sense here (with find/find_all/remove/replace/iter/fold etc.) > > Link: http://www.pps.jussieu.fr/~li/software/index.html#weaktbl -- Zheng Li http://www.pps.jussieu.fr/~li ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2007-10-04 16:06 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-10-02 4:50 [ANN] Weaktbl: a weak hash table library Zheng Li 2007-10-02 11:07 ` [Caml-list] " skaller 2007-10-03 10:43 ` Zheng Li 2007-10-04 16:01 ` Zheng Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).