caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Hashtbl and max_len
@ 2001-11-18 14:55 Marcin 'Qrczak' Kowalczyk
  2001-11-18 17:01 ` [Caml-list] " Marcin 'Qrczak' Kowalczyk
  2001-11-19 15:34 ` [Caml-list] " Xavier Leroy
  0 siblings, 2 replies; 4+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2001-11-18 14:55 UTC (permalink / raw)
  To: caml-list

let create initial_size =
  let s = if initial_size < 1 then 1 else initial_size in
  let s = if s > Sys.max_array_length then Sys.max_array_length else s in
  { max_len = 3; data = Array.make s Empty }

let resize hashfun tbl =
  let odata = tbl.data in
  let osize = Array.length odata in
  let nsize = min (2 * osize + 1) Sys.max_array_length in
  [...]
  tbl.max_len <- 2 * tbl.max_len

The initial_size argument of Hashtbl.create has much bigger impact on
efficiency than I thought. The documentation is misleading in saying
that "The table grows as needed, so n is just an initial guess",
thus suggesting that it shouldn't matter much what was given after
the table is filled.

The reality is that the initial_size establishes an approximate ratio
of bucket count to bucket size which persists for the whole life of
this table.

A table created by Hashtbl.create 10 and filled with 1000 entries is
approximately 10 times slower in element access than a table created
by Hashtbl.create 1000 with the same entries!

This is unfortunate when Hashtbl is wrapped in an API which
doesn't always provide initial_size. The wrapper doesn't know which
initial_size to set. It can't say: "let's make this table small
initially, because there might be thousands of such tables with 5
entries in each, but if thousands of entries are put into this table,
it shouldn't become much slower than it would be if it knew its size
from the beginning".

In my case equality and hashing is quite slow by nature and thus
having too big max_len or frequent resizing gives horrible performance.
OTOH always providing large initial_size wastes memory.

IMHO max_len shouldn't grow that much. Why does it grow at all when
the current size is not near Sys.max_array_length?

Anyway, besides considering including a tweaked hashtbl.ml, I seek for
a better general-purpose imperative dictionary which uses arbitrary
hashing and equality and behaves well when these functions are slow
and the size of a table is not known initially.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^
QRCZAK

-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Caml-list] Re: Hashtbl and max_len
  2001-11-18 14:55 [Caml-list] Hashtbl and max_len Marcin 'Qrczak' Kowalczyk
@ 2001-11-18 17:01 ` Marcin 'Qrczak' Kowalczyk
  2001-11-18 17:19   ` Marcin 'Qrczak' Kowalczyk
  2001-11-19 15:34 ` [Caml-list] " Xavier Leroy
  1 sibling, 1 reply; 4+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2001-11-18 17:01 UTC (permalink / raw)
  To: caml-list

Sun, 18 Nov 2001 14:55:17 +0000 (UTC), Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> pisze:

> The initial_size argument of Hashtbl.create has much bigger impact on
> efficiency than I thought. The documentation is misleading in saying
> that "The table grows as needed, so n is just an initial guess",
> thus suggesting that it shouldn't matter much what was given after
> the table is filled.

Ah, I see that the version in cvs has changed the policy of resizing
and it indeed works better when a good initial size is not known.
It has a bug though:

--- hashtbl.ml~	Sun Nov 18 17:44:57 2001
+++ hashtbl.ml	Sun Nov 18 17:57:42 2001
@@ -197,7 +197,7 @@
       let bucket = Cons(key, info, h.data.(i)) in
       h.data.(i) <- bucket;
       h.size <- succ h.size;
-      if h.size > Array.length h.data lsl 1 then resize hash h
+      if h.size > Array.length h.data lsl 1 then resize H.hash h
 
     let remove h key =
       let rec remove_bucket = function
@@ -255,7 +255,7 @@
       with Not_found ->
         h.data.(i) <- Cons(key, info, l);
         h.size <- succ h.size;
-        if h.size > Array.length h.data lsl 1 then resize hash h
+        if h.size > Array.length h.data lsl 1 then resize H.hash h
 
     let mem h key =
       let rec mem_in_bucket = function

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^
QRCZAK

-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Caml-list] Re: Hashtbl and max_len
  2001-11-18 17:01 ` [Caml-list] " Marcin 'Qrczak' Kowalczyk
@ 2001-11-18 17:19   ` Marcin 'Qrczak' Kowalczyk
  0 siblings, 0 replies; 4+ messages in thread
From: Marcin 'Qrczak' Kowalczyk @ 2001-11-18 17:19 UTC (permalink / raw)
  To: caml-list

Sun, 18 Nov 2001 17:01:00 +0000 (UTC), Marcin 'Qrczak' Kowalczyk <qrczak@knm.org.pl> pisze:

> Ah, I see that the version in cvs has changed the policy of resizing
> and it indeed works better when a good initial size is not known.
> It has a bug though:

Which suggests that the functorial interface is rarely used. Maybe the
OCaml compiler should use it instead of generic Hashtbl when the key
is something like int? It would catch bugs like this by bootstrapping,
even if performance is good enough without it.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^
QRCZAK

-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Caml-list] Hashtbl and max_len
  2001-11-18 14:55 [Caml-list] Hashtbl and max_len Marcin 'Qrczak' Kowalczyk
  2001-11-18 17:01 ` [Caml-list] " Marcin 'Qrczak' Kowalczyk
@ 2001-11-19 15:34 ` Xavier Leroy
  1 sibling, 0 replies; 4+ messages in thread
From: Xavier Leroy @ 2001-11-19 15:34 UTC (permalink / raw)
  To: Marcin 'Qrczak' Kowalczyk; +Cc: caml-list

> The initial_size argument of Hashtbl.create has much bigger impact on
> efficiency than I thought.
> [...]
> IMHO max_len shouldn't grow that much. Why does it grow at all when
> the current size is not near Sys.max_array_length?

The original strategy was indeed to bound the length of buckets by a
constant, and resize when a bucket exceeds this length.  However, this
results in excessive resizing when the hashing causes collisions, or
(worse!) the same key is repeatedly inserted in the hash table.
Hence, as a stopgap, we decided to increase the maximal bucket length
at each resizing.

There are two problems with this strategy:
- as you noticed, this causes the initial size to impact performance
  significantly;
- checking for long buckets becomes linear in the size of the table,
  resulting in quadratic behavior for long sequences of insertions.

The latter problem was the last nail in the coffin of this
implementation.  In the working sources, we now keep track of the size
(total number of elements inserted in the table) and base resizing
decisions on this.  In case the hashing is good (all buckets have
approximately the same length), this results in constant length for
buckets.  If the hashing is not good, the performances degrade more
gracefully than with the old scheme.

- Xavier Leroy

PS.  Thanks for pointing out the typos in the working implementation.
The dangers of cut-and-paste programming...
-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2001-11-19 15:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-11-18 14:55 [Caml-list] Hashtbl and max_len Marcin 'Qrczak' Kowalczyk
2001-11-18 17:01 ` [Caml-list] " Marcin 'Qrczak' Kowalczyk
2001-11-18 17:19   ` Marcin 'Qrczak' Kowalczyk
2001-11-19 15:34 ` [Caml-list] " Xavier Leroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).