Dear OCaml users,

 

to close this topic, here are some statistics for my use case with different hash functions. All hash methods I tried lead to identical statistics as random numbers. Simply using Hashtbl.hash on lists of hashes of substructures (Method 2 below) worked very well, is easy to write down and offers sufficient flexibility.

 

NItems   =  3277598

NBuckets = 16777216

 

        | Theoretical | Method1  |  Method2 |  Method3

--------+-------------+----------+----------+---------

Empty   |  13799905.0 | 13800062 | 13799864 | 13801452

Single  |   2695950.5 |  2695830 |  2696131 |  2693130

2 items |    263340.5 |   263101 |   263042 |   264318

3 items |     17148.7 |    17350 |    17368 |    17463

4 items |       837.5 |      849 |      779 |      822

5 items |        32.7 |       24 |       29 |       31

6 items |         1.1 |        0 |        3 |        3

--------+-------------+----------+----------+---------

Runtime |             |     119s |     122s |     110s

 

Note 1: the expected random deviation of the counts is sqrt(count).

 

Note 2: The speed differences between Method 1 and 2 are random – Method 2 should be faster.

 

Note 3: using Hashtbl.hash on the complete structure doesn't work for me for two reasons:

- parts of the structure should be ignored

- I have maps as substructures and require that logically equal maps result in the same hash

 

Theoretical:

collision numbers for a random process with same parameters (used exact Bernoulli statistics, not Poisson approximation)

 

Method1:

Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list).

Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash.

Multiply each hash list entry with a distinct prime.

Use prime values for variants and prime multipliers for variant arguments.

 

Method2:

Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list).

Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash.

Use prime values for variants and prime multipliers for variant arguments.

 

Method3:

( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4   

where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases

 

Btw.: I added a function “worst_bucketlist” to Hashtbl, which returns the keys of the largest bucket. This is useful for optimizing (or bug fixing) hash and equality functions – try to make a feature request for it.

 

Best regards,

 

Michael

Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928