my lists are typically 3..6 elements. I think in total 60 or so might be fine. I will give it a try.

Although, after I have written the hash and equality functions for the Hashtbl module interfce, I make use of the fact that I can easily deliberately ignore certain things – the starting point of this thread.

Maybe for larger structures it is just the right thing to do to write your own hash function. I wonder what would be an elegant way to do this? Collect the default-hash function values of the pieces in a list and compute the hash of this list?

Best regards,

Michael

From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
Sent: Saturday, May 14, 2016 10:41 AM
To: Soegtrop, Michael <michael.soegtrop@intel.com>
Cc: Alain Frisch <alain.frisch@lexifi.com>; OCaML Mailing List <caml-list@inria.fr>; Pierre Chambart <pierre.chambart@ocamlpro.com>
Subject: RE: [Caml-list] Specify the default hash function for a type

Dear Michael,

256 might be too small for your use case if you have big lists in your records. A way to workaround this limitation is to use the seeded_hash function to fold an hash accumulator through your record.

That is, if your record has fields f1, ..., fn
you might want to write

let hash t =
   let x = seeded_hash (hash t.f1) (hash t.f2) in
   let x = seeded_hash x (hash t.f3) in
   ...
   seeded_hash x (hash t.fn)

On 14 May 2016 00:46, "Soegtrop, Michael" <michael.soegtrop@intel.com> wrote:

Dear Tom,

thanks for the pointer to the manual. I googled, but couldn't find this info - should have read the manual first. Yes, 10 is for sure not enough in my case. I will try the hash function with a value adjusted to my case and compare with mine.

Best regards,

Michael

> -----Original Message-----
> From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
> Sent: Friday, May 13, 2016 8:57 PM
> To: Soegtrop, Michael
> Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr
> Subject: Re: [Caml-list] Specify the default hash function for a type
>
> The OCaml hash function is not bad in itself (it's a slightly modified version
> of Murmur 3, with a 32 bit state). There are a few difficult design decisions
> that are made in the main hash loop which can impact users though.
>
> I wonder if the one that you are hitting is the fact that by default the
> polymorphic hash function only consider a (small) amount of meaningful
> value, in a breadth first manner. The following program is interesting
>
> ```
> let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in
> while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r;
> done; !r;;
> - : int = 10
> ```
>
> The current default choice for the number of meaningful values is described
> here:
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
> and it is indeed 10. You could try to increase the number of meaningful
> nodes that you want to consider in your code, to the max of 256.
>
> There is really a balance to strike here between speed of hashing and
> collisions.
>
> Tom
>
>
>
>
>
> On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
> <michael.soegtrop@intel.com> wrote:
> > Dear Ocaml Users,
> >
> > a possibly interesting data point in the discussion on polymorphic hashes:
> >
> > I replaced the polymorphic hash function for a record, which consists of
> smallish integers, bools, lists of integers and short strings, with my own
> hash function which looks like:
> >
> > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
> >
> > where vi are values and pi are distinct prime numbers. For lists I use 4
> primes alternatingly and 4 different primes for empty cases (implemented
> as 4 mutual recursive functions). The primes are random primes between
> 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0
> modulo the prime. All arithmetic is done with plain ocaml ints and
> corresponding overflows. For strings I used the polymorphic hash function.
> >
> > The hash histogram for this hash function looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=24
> > (the histogram is "number of items in bucket=count of buckets with
> > this number of items")
> > 0=15356268 (empty buckets)
> > 1=1185329 (non-colliding buckets)
> > 2=135231
> > 3=80194
> > 4=11344
> > 5=2367
> > 6=2836
> > 7=2630
> > 8=244
> > 9=49
> > 10=20
> > 11=25
> > 12=64
> > 13=39
> > 14=100
> > 15=75
> > 16=33
> > 17=5
> > 18=149
> > 19=15
> > 20=3
> > 21=180
> > 22=14
> > 24=2
> >
> > This is roughly in line with random numbers:
> >
> > fill rate = 1803454/16777216 = 10.7%
> > 2 collisions = 0.8% ~ fill rate ^2
> > 3 collisions = 0.48% ~ 4 * fill rate ^3
> > 4 collisions = 0.068% ~ 5 * fill rate ^4
> >
> > The hash histogram for the same data and the polymorphic hash function
> looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=3343
> > 0=16730926
> > 1=5520
> > 2=3201
> > 3=2779
> > 4=1633
> > 5=1079
> > 6=3701
> > 7=672
> > 8=828
> > 9=1584
> > 10=600
> > 11=384
> > 12=2774
> > 13=404
> > 14=525
> > 15=500
> > 16=435
> > 17=358
> > 18=1406
> > 19=244
> > 20=504
> > 21=427
> > 22=316
> > 23=369
> > 24=837
> > 25=153
> > 26=250
> > 27=417
> > 28=191
> > 29=222
> > 30=501
> > 31=76
> > 32=196
> > 33=530
> > 34=142
> > 35=153
> > 36=859
> > 37=88
> > 38=178
> > 39=310
> > 40=147
> > 41=173
> > 42=313
> > 43=102
> > 44=235
> > 45=123
> > 46=149
> > 47=152
> > 48=244
> > 49=107
> > 50=173
> > :
> > :
> > 1001=1
> > 1002=1
> > 1005=1
> > 1006=3
> > 1009=1
> > 1010=1
> > 1013=1
> > 1014=2
> > 1019=1
> > 1020=1
> > 1029=1
> > 1031=1
> > 1034=1
> > 1035=1
> > 1036=1
> > 1064=1
> > 1082=1
> > 1184=1
> > 1191=1
> > 1196=1
> > 1197=1
> > 1200=1
> > 1206=1
> > 1207=1
> > 1219=1
> > 1225=1
> > 1228=1
> > 1232=1
> > 1566=1
> > 1581=1
> > 1636=1
> > 1698=1
> > 1720=2
> > 1737=2
> > 1740=1
> > 1744=1
> > 1752=1
> > 1762=1
> > 1789=1
> > 2255=1
> > 2271=1
> > 2287=1
> > 2319=1
> > 2332=1
> > 2345=1
> > 3296=1
> > 3325=1
> > 3343=1
> >
> > So there are many buckets with 1000s of collisions and only a few 1000
> collision free buckets.
> >
> > It might make sense to come up with a better polymorphic hashing
> function.
> >
> > Best regards,
> >
> > Michael
> > Intel Deutschland GmbH
> > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> > Tel: +49 89 99 8853-0, www.intel.de
> > Managing Directors: Christin Eisenschmid, Christian Lamprechter
> > Chairperson of the Supervisory Board: Nicole Lau Registered Office:
> > Munich Commercial Register: Amtsgericht Muenchen HRB 186928
> >
> >
> > --
> > Caml-list mailing list. Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928