caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
To: caml-list@inria.fr
Subject: [Caml-list] Hashtbl and max_len
Date: Sun, 18 Nov 2001 14:55:17 +0000 (UTC)	[thread overview]
Message-ID: <slrn9vfiul.rkm.qrczak@qrnik.zagroda> (raw)

let create initial_size =
  let s = if initial_size < 1 then 1 else initial_size in
  let s = if s > Sys.max_array_length then Sys.max_array_length else s in
  { max_len = 3; data = Array.make s Empty }

let resize hashfun tbl =
  let odata = tbl.data in
  let osize = Array.length odata in
  let nsize = min (2 * osize + 1) Sys.max_array_length in
  [...]
  tbl.max_len <- 2 * tbl.max_len

The initial_size argument of Hashtbl.create has much bigger impact on
efficiency than I thought. The documentation is misleading in saying
that "The table grows as needed, so n is just an initial guess",
thus suggesting that it shouldn't matter much what was given after
the table is filled.

The reality is that the initial_size establishes an approximate ratio
of bucket count to bucket size which persists for the whole life of
this table.

A table created by Hashtbl.create 10 and filled with 1000 entries is
approximately 10 times slower in element access than a table created
by Hashtbl.create 1000 with the same entries!

This is unfortunate when Hashtbl is wrapped in an API which
doesn't always provide initial_size. The wrapper doesn't know which
initial_size to set. It can't say: "let's make this table small
initially, because there might be thousands of such tables with 5
entries in each, but if thousands of entries are put into this table,
it shouldn't become much slower than it would be if it knew its size
from the beginning".

In my case equality and hashing is quite slow by nature and thus
having too big max_len or frequent resizing gives horrible performance.
OTOH always providing large initial_size wastes memory.

IMHO max_len shouldn't grow that much. Why does it grow at all when
the current size is not near Sys.max_array_length?

Anyway, besides considering including a tweaked hashtbl.ml, I seek for
a better general-purpose imperative dictionary which uses arbitrary
hashing and equality and behaves well when these functions are slow
and the size of a table is not known initially.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^
QRCZAK

-------------------
Bug reports: http://caml.inria.fr/bin/caml-bugs  FAQ: http://caml.inria.fr/FAQ/
To unsubscribe, mail caml-list-request@inria.fr  Archives: http://caml.inria.fr


             reply	other threads:[~2001-11-18 14:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-11-18 14:55 Marcin 'Qrczak' Kowalczyk [this message]
2001-11-18 17:01 ` [Caml-list] " Marcin 'Qrczak' Kowalczyk
2001-11-18 17:19   ` Marcin 'Qrczak' Kowalczyk
2001-11-19 15:34 ` [Caml-list] " Xavier Leroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=slrn9vfiul.rkm.qrczak@qrnik.zagroda \
    --to=qrczak@knm.org.pl \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).