caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Hashtbl performance
@ 2011-07-27 21:07 Toby Kelsey
  2011-07-28  9:48 ` Fabrice Le Fessant
  0 siblings, 1 reply; 5+ messages in thread
From: Toby Kelsey @ 2011-07-27 21:07 UTC (permalink / raw)
  To: caml-list

I have written two versions of a small program, one using an (int list) data structure and the other an ((int*int) list); and I tested using Hashtbl, Set and (Jean-Christophe Filliatre's) Trie to cache these elements in each version.
The relative run times of the programs turns out to be:

                 Hashtbl  Set  Trie
(int list)         1      2.1  2.0
((int*int) list)  34.7    2.6  2.1


There is a slight inherent speed difference between the 2 versions but the major effect is the cache type (in fact the caching difference must be larger than these ratios due to non-cache computations). I expected Hashtbl might be a bit slower than more specialised data structures, but the large speed difference with different data structures was unexpected. Presumably the slow-down is due to excessive hash collisions.

I had expected that the generic Hashtbl would be written to give adequate speed for all types of data and when I look at OCaml code on the web Hashtbl is usually used, so it seems most OCaml programmers believe the standard Hashtbl is a reasonable choice for most data types as well.

I haven't tested the Batteries or OCaml-Core hash tables so these may be more consistent, but if not, my question is can you predict how well Hashtbl will work for different types of data and so what to use it with, or it is just not reliable enough for general-purpose caching/memoization?

If anyone wants to look at the code, the (int list) Hashtbl version is at <http://rosettacode.org/wiki/Sokoban#OCaml> and the other versions are in <http://pastebin.com/hGn1AL9L> temporarily. Apart from the caching and the int/int pair changes they should be identical.

Toby

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Hashtbl performance
  2011-07-27 21:07 [Caml-list] Hashtbl performance Toby Kelsey
@ 2011-07-28  9:48 ` Fabrice Le Fessant
  2011-07-28 12:25   ` barnier
  0 siblings, 1 reply; 5+ messages in thread
From: Fabrice Le Fessant @ 2011-07-28  9:48 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 2518 bytes --]

I think it is a well known bug of the hash function:

let list = ref [];;
for i = 1 to 30 do
  list := i :: !list;
  Printf.printf "%d " (Hashtbl.hash !list)
done;;

1 65601 8392706 797307010 578301955 797307010 8392706 65601 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

As you can see, long lists all have the same hash value. It is due to
the fact that the hash function does a depth-first traversal of the
datastructure, starting with the last field, instead of a breadth-first
traversal, and, as it only visits 10 nodes, they are always the same
ones for all lists of length > 10.

Thus, you should consider using your own hash function, probably only
considering the ints in the list and its length. Then, use the functor
in the Hashtbl module.

--Fabrice

On 07/27/2011 11:07 PM, Toby Kelsey wrote:
> I have written two versions of a small program, one using an (int list) data structure and the other an ((int*int) list); and I tested using Hashtbl, Set and (Jean-Christophe Filliatre's) Trie to cache these elements in each version.
> The relative run times of the programs turns out to be:
> 
>                  Hashtbl  Set  Trie
> (int list)         1      2.1  2.0
> ((int*int) list)  34.7    2.6  2.1
> 
> 
> There is a slight inherent speed difference between the 2 versions but the major effect is the cache type (in fact the caching difference must be larger than these ratios due to non-cache computations). I expected Hashtbl might be a bit slower than more specialised data structures, but the large speed difference with different data structures was unexpected. Presumably the slow-down is due to excessive hash collisions.
> 
> I had expected that the generic Hashtbl would be written to give adequate speed for all types of data and when I look at OCaml code on the web Hashtbl is usually used, so it seems most OCaml programmers believe the standard Hashtbl is a reasonable choice for most data types as well.
> 
> I haven't tested the Batteries or OCaml-Core hash tables so these may be more consistent, but if not, my question is can you predict how well Hashtbl will work for different types of data and so what to use it with, or it is just not reliable enough for general-purpose caching/memoization?
> 
> If anyone wants to look at the code, the (int list) Hashtbl version is at <http://rosettacode.org/wiki/Sokoban#OCaml> and the other versions are in <http://pastebin.com/hGn1AL9L> temporarily. Apart from the caching and the int/int pair changes they should be identical.
> 
> Toby
> 

[-- Attachment #2: fabrice_le_fessant.vcf --]
[-- Type: text/x-vcard, Size: 380 bytes --]

begin:vcard
fn:Fabrice LE FESSANT
n:LE FESSANT;Fabrice
org:INRIA Saclay -- Ile-de-France;P2P & OCaml
adr;quoted-printable:;;Parc Orsay Universit=C3=A9 ;Orsay CEDEX;;91893;France
email;internet:fabrice.le_fessant@inria.fr
title;quoted-printable:Charg=C3=A9 de Recherche
tel;work:+33 1 74 85 42 14
tel;fax:+33 1 74 85 42 49 
url:http://fabrice.lefessant.net/
version:2.1
end:vcard


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Hashtbl performance
  2011-07-28  9:48 ` Fabrice Le Fessant
@ 2011-07-28 12:25   ` barnier
  2011-07-28 14:18     ` Xavier Leroy
  0 siblings, 1 reply; 5+ messages in thread
From: barnier @ 2011-07-28 12:25 UTC (permalink / raw)
  To: caml-list

Quoting Fabrice Le Fessant <Fabrice.Le_fessant@inria.fr>:

> Thus, you should consider using your own hash function, probably only
> considering the ints in the list and its length. Then, use the functor
> in the Hashtbl module.

You could also use the hash_param : int -> int -> 'a -> int
function in the functor which allows to specify the number
of nodes traversed to compute the key :

# let list = ref [];;
   for i = 1 to 30 do
   list := i :: !list;
   Printf.printf "%d " (Hashtbl.hash_param 50 50 !list)
done  ;;
1 65601 8392706 797307010 578301955 731304131 182476804 222011652  
582000645 696017221 383840262 291574150 409554951 301052359 474071048  
875971080 459423753 1026998857 314757130 771437194 56324107 59478731  
841228300 921690892 921690892 841228300 59478731 56324107 771437194  
314757130 - : unit = ()

-- Nicolas Barnier


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Hashtbl performance
  2011-07-28 12:25   ` barnier
@ 2011-07-28 14:18     ` Xavier Leroy
  2011-07-28 16:29       ` Toby Kelsey
  0 siblings, 1 reply; 5+ messages in thread
From: Xavier Leroy @ 2011-07-28 14:18 UTC (permalink / raw)
  To: caml-list

Fabrice Le Fessant <Fabrice.Le_fessant@inria.fr>:

> Thus, you should consider using your own hash function, probably only
> considering the ints in the list and its length. Then, use the functor
> in the Hashtbl module.

On 07/28/2011 02:25 PM, barnier@recherche.enac.fr wrote:

> You could also use the hash_param : int -> int -> 'a -> int
> function in the functor which allows to specify the number
> of nodes traversed to compute the key :

Both Fabrice's and Nicolas's suggestions are excellent.

Let me just add that this problem with lists as hashtable keys is one
of the known issues with OCaml's current generic hash function: the
other is that the mixing functions used are simplistic and exhibit
some statistical bias, even for strings.

The SVN trunk for OCaml contains a complete reimplementation of the
generic hash function that addresses both issues: lists and other
complex keys are traversed breadth-first in a more cautious manner
than before, and the mixing functions are based on MurmurHash 3, which
exhibits very good statistical properties.  All this should go in the
next major release 3.13.  So, everyone with an interest in efficient
hash tables is welcome to try the trunk sources and let me know of any
problems encountered.

- Xavier Leroy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] Hashtbl performance
  2011-07-28 14:18     ` Xavier Leroy
@ 2011-07-28 16:29       ` Toby Kelsey
  0 siblings, 0 replies; 5+ messages in thread
From: Toby Kelsey @ 2011-07-28 16:29 UTC (permalink / raw)
  To: caml-list

On 28/07/11 15:18, Xavier Leroy wrote:
> Fabrice Le Fessant <Fabrice.Le_fessant@inria.fr>:
> 
>> Thus, you should consider using your own hash function, probably only
>> considering the ints in the list and its length. Then, use the functor
>> in the Hashtbl module.
> 
> On 07/28/2011 02:25 PM, barnier@recherche.enac.fr wrote:
> 
>> You could also use the hash_param : int -> int -> 'a -> int
>> function in the functor which allows to specify the number
>> of nodes traversed to compute the key :
> 
> Both Fabrice's and Nicolas's suggestions are excellent.
> 
> Let me just add that this problem with lists as hashtable keys is one
> of the known issues with OCaml's current generic hash function: the
> other is that the mixing functions used are simplistic and exhibit
> some statistical bias, even for strings.

I will try both suggestions and look at the SVN trunk. Let me just add that the lists being hashed are all of length 5, so the length should not be the issue.
Hopefully 3.13 will fix this problem.

Thanks everyone for the useful suggestions,
Toby


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-07-28 16:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-27 21:07 [Caml-list] Hashtbl performance Toby Kelsey
2011-07-28  9:48 ` Fabrice Le Fessant
2011-07-28 12:25   ` barnier
2011-07-28 14:18     ` Xavier Leroy
2011-07-28 16:29       ` Toby Kelsey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).