Dear Michael, 256 might be too small for your use case if you have big lists in your records. A way to workaround this limitation is to use the seeded_hash function to fold an hash accumulator through your record. That is, if your record has fields f1, ..., fn you might want to write let hash t = let x = seeded_hash (hash t.f1) (hash t.f2) in let x = seeded_hash x (hash t.f3) in ... seeded_hash x (hash t.fn) On 14 May 2016 00:46, "Soegtrop, Michael" wrote: > Dear Tom, > > thanks for the pointer to the manual. I googled, but couldn't find this > info - should have read the manual first. Yes, 10 is for sure not enough in > my case. I will try the hash function with a value adjusted to my case and > compare with mine. > > Best regards, > > Michael > > > -----Original Message----- > > From: Thomas Braibant [mailto:thomas.braibant@gmail.com] > > Sent: Friday, May 13, 2016 8:57 PM > > To: Soegtrop, Michael > > Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr > > Subject: Re: [Caml-list] Specify the default hash function for a type > > > > The OCaml hash function is not bad in itself (it's a slightly modified > version > > of Murmur 3, with a 32 bit state). There are a few difficult design > decisions > > that are made in the main hash loop which can impact users though. > > > > I wonder if the one that you are hitting is the fact that by default the > > polymorphic hash function only consider a (small) amount of meaningful > > value, in a breadth first manner. The following program is interesting > > > > ``` > > let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; > let r = ref 0 in > > while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do > incr r; > > done; !r;; > > - : int = 10 > > ``` > > > > The current default choice for the number of meaningful values is > described > > here: > > http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html > > and it is indeed 10. You could try to increase the number of meaningful > > nodes that you want to consider in your code, to the max of 256. > > > > There is really a balance to strike here between speed of hashing and > > collisions. > > > > Tom > > > > > > > > > > > > On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael > > wrote: > > > Dear Ocaml Users, > > > > > > a possibly interesting data point in the discussion on polymorphic > hashes: > > > > > > I replaced the polymorphic hash function for a record, which consists > of > > smallish integers, bools, lists of integers and short strings, with my > own > > hash function which looks like: > > > > > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 > > > > > > where vi are values and pi are distinct prime numbers. For lists I use > 4 > > primes alternatingly and 4 different primes for empty cases (implemented > > as 4 mutual recursive functions). The primes are random primes between > > 2^28 and 2^29, avoiding those primes where larger powers of 2 are close > to 0 > > modulo the prime. All arithmetic is done with plain ocaml ints and > > corresponding overflows. For strings I used the polymorphic hash > function. > > > > > > The hash histogram for this hash function looks like this: > > > > > > num_bindings=1803454 > > > num_buckets=16777216 > > > max_bucket_length=24 > > > (the histogram is "number of items in bucket=count of buckets with > > > this number of items") > > > 0=15356268 (empty buckets) > > > 1=1185329 (non-colliding buckets) > > > 2=135231 > > > 3=80194 > > > 4=11344 > > > 5=2367 > > > 6=2836 > > > 7=2630 > > > 8=244 > > > 9=49 > > > 10=20 > > > 11=25 > > > 12=64 > > > 13=39 > > > 14=100 > > > 15=75 > > > 16=33 > > > 17=5 > > > 18=149 > > > 19=15 > > > 20=3 > > > 21=180 > > > 22=14 > > > 24=2 > > > > > > This is roughly in line with random numbers: > > > > > > fill rate = 1803454/16777216 = 10.7% > > > 2 collisions = 0.8% ~ fill rate ^2 > > > 3 collisions = 0.48% ~ 4 * fill rate ^3 > > > 4 collisions = 0.068% ~ 5 * fill rate ^4 > > > > > > The hash histogram for the same data and the polymorphic hash function > > looks like this: > > > > > > num_bindings=1803454 > > > num_buckets=16777216 > > > max_bucket_length=3343 > > > 0=16730926 > > > 1=5520 > > > 2=3201 > > > 3=2779 > > > 4=1633 > > > 5=1079 > > > 6=3701 > > > 7=672 > > > 8=828 > > > 9=1584 > > > 10=600 > > > 11=384 > > > 12=2774 > > > 13=404 > > > 14=525 > > > 15=500 > > > 16=435 > > > 17=358 > > > 18=1406 > > > 19=244 > > > 20=504 > > > 21=427 > > > 22=316 > > > 23=369 > > > 24=837 > > > 25=153 > > > 26=250 > > > 27=417 > > > 28=191 > > > 29=222 > > > 30=501 > > > 31=76 > > > 32=196 > > > 33=530 > > > 34=142 > > > 35=153 > > > 36=859 > > > 37=88 > > > 38=178 > > > 39=310 > > > 40=147 > > > 41=173 > > > 42=313 > > > 43=102 > > > 44=235 > > > 45=123 > > > 46=149 > > > 47=152 > > > 48=244 > > > 49=107 > > > 50=173 > > > : > > > : > > > 1001=1 > > > 1002=1 > > > 1005=1 > > > 1006=3 > > > 1009=1 > > > 1010=1 > > > 1013=1 > > > 1014=2 > > > 1019=1 > > > 1020=1 > > > 1029=1 > > > 1031=1 > > > 1034=1 > > > 1035=1 > > > 1036=1 > > > 1064=1 > > > 1082=1 > > > 1184=1 > > > 1191=1 > > > 1196=1 > > > 1197=1 > > > 1200=1 > > > 1206=1 > > > 1207=1 > > > 1219=1 > > > 1225=1 > > > 1228=1 > > > 1232=1 > > > 1566=1 > > > 1581=1 > > > 1636=1 > > > 1698=1 > > > 1720=2 > > > 1737=2 > > > 1740=1 > > > 1744=1 > > > 1752=1 > > > 1762=1 > > > 1789=1 > > > 2255=1 > > > 2271=1 > > > 2287=1 > > > 2319=1 > > > 2332=1 > > > 2345=1 > > > 3296=1 > > > 3325=1 > > > 3343=1 > > > > > > So there are many buckets with 1000s of collisions and only a few 1000 > > collision free buckets. > > > > > > It might make sense to come up with a better polymorphic hashing > > function. > > > > > > Best regards, > > > > > > Michael > > > Intel Deutschland GmbH > > > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > > > Tel: +49 89 99 8853-0, www.intel.de > > > Managing Directors: Christin Eisenschmid, Christian Lamprechter > > > Chairperson of the Supervisory Board: Nicole Lau Registered Office: > > > Munich Commercial Register: Amtsgericht Muenchen HRB 186928 > > > > > > > > > -- > > > Caml-list mailing list. Subscription management and archives: > > > https://sympa.inria.fr/sympa/arc/caml-list > > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > > > Bug reports: http://caml.inria.fr/bin/caml-bugs > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 >