Dear OCaml users,
to close this topic, here are some statistics for my use case with different hash functions. All hash methods I tried lead to identical statistics as random
numbers. Simply using Hashtbl.hash on lists of hashes of substructures (Method 2 below) worked very well, is easy to write down and offers sufficient flexibility.
NItems = 3277598
NBuckets = 16777216
| Theoretical | Method1 | Method2 | Method3
--------+-------------+----------+----------+---------
Empty | 13799905.0 | 13800062 | 13799864 | 13801452
Single | 2695950.5 | 2695830 | 2696131 | 2693130
2 items | 263340.5 | 263101 | 263042 | 264318
3 items | 17148.7 | 17350 | 17368 | 17463
4 items | 837.5 | 849 | 779 | 822
5 items | 32.7 | 24 | 29 | 31
6 items | 1.1 | 0 | 3 | 3
--------+-------------+----------+----------+---------
Runtime | | 119s | 122s | 110s
Note 1: the expected random deviation of the counts is sqrt(count).
Note 2: The speed differences between Method 1 and 2 are random – Method 2 should be faster.
Note 3: using Hashtbl.hash on the complete structure doesn't work for me for two reasons:
- parts of the structure should be ignored
- I have maps as substructures and require that logically equal maps result in the same hash
Theoretical:
collision numbers for a random process with same parameters (used exact Bernoulli statistics, not Poisson approximation)
Method1:
Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list).
Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash.
Multiply each hash list entry with a distinct prime.
Use prime values for variants and prime multipliers for variant arguments.
Method2:
Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list).
Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash.
Use prime values for variants and prime multipliers for variant arguments.
Method3:
( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases
Btw.: I added a function “worst_bucketlist” to Hashtbl, which returns the keys of the largest bucket. This is useful for optimizing (or bug fixing) hash and
equality functions – try to make a feature request for it.
Best regards,
Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928