From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id PAA17722; Sun, 18 Nov 2001 15:54:35 +0100 (MET) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id PAA17547 for ; Sun, 18 Nov 2001 15:54:34 +0100 (MET) Received: from mail.mimuw.edu.pl (paf87.warszawa.sdi.tpnet.pl [217.96.225.87]) by nez-perce.inria.fr (8.11.1/8.10.0) with ESMTP id fAIEsT900583 for ; Sun, 18 Nov 2001 15:54:30 +0100 (MET) Received: (from news@localhost) by mail.mimuw.edu.pl (PLD/8.9.3) id PAA28918 for caml-list@inria.fr; Sun, 18 Nov 2001 15:55:21 +0100 X-Authentication-Warning: qrnik.zagroda: news set sender to Marcin 'Qrczak' Kowalczyk using -f From: "Marcin 'Qrczak' Kowalczyk" Subject: [Caml-list] Hashtbl and max_len Date: Sun, 18 Nov 2001 14:55:17 +0000 (UTC) Organization: Klub Nieszkodliwych =?iso-8859-2?Q?Manjak=F3w?= Message-ID: X-Trace: qrnik.zagroda 1006095317 28311 192.168.0.1 (18 Nov 2001 14:55:17 GMT) X-Complaints-To: abuse@localhost NNTP-Posting-Date: Sun, 18 Nov 2001 14:55:17 +0000 (UTC) User-Agent: slrn/0.9.7.2 (Linux) To: caml-list@inria.fr Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk let create initial_size = let s = if initial_size < 1 then 1 else initial_size in let s = if s > Sys.max_array_length then Sys.max_array_length else s in { max_len = 3; data = Array.make s Empty } let resize hashfun tbl = let odata = tbl.data in let osize = Array.length odata in let nsize = min (2 * osize + 1) Sys.max_array_length in [...] tbl.max_len <- 2 * tbl.max_len The initial_size argument of Hashtbl.create has much bigger impact on efficiency than I thought. The documentation is misleading in saying that "The table grows as needed, so n is just an initial guess", thus suggesting that it shouldn't matter much what was given after the table is filled. The reality is that the initial_size establishes an approximate ratio of bucket count to bucket size which persists for the whole life of this table. A table created by Hashtbl.create 10 and filled with 1000 entries is approximately 10 times slower in element access than a table created by Hashtbl.create 1000 with the same entries! This is unfortunate when Hashtbl is wrapped in an API which doesn't always provide initial_size. The wrapper doesn't know which initial_size to set. It can't say: "let's make this table small initially, because there might be thousands of such tables with 5 entries in each, but if thousands of entries are put into this table, it shouldn't become much slower than it would be if it knew its size from the beginning". In my case equality and hashing is quite slow by nature and thus having too big max_len or frequent resizing gives horrible performance. OTOH always providing large initial_size wastes memory. IMHO max_len shouldn't grow that much. Why does it grow at all when the current size is not near Sys.max_array_length? Anyway, besides considering including a tweaked hashtbl.ml, I seek for a better general-purpose imperative dictionary which uses arbitrary hashing and equality and behaves well when these functions are slow and the size of a table is not known initially. -- __("< Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/ \__/ ^^ QRCZAK ------------------- Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr