caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Thomas Braibant <thomas.braibant@gmail.com>
To: "Soegtrop, Michael" <michael.soegtrop@intel.com>
Cc: Alain Frisch <alain.frisch@lexifi.com>,
	Pierre Chambart <pierre.chambart@ocamlpro.com>,
	 "caml-list@inria.fr" <caml-list@inria.fr>
Subject: Re: [Caml-list] Specify the default hash function for a type
Date: Fri, 13 May 2016 20:57:29 +0200	[thread overview]
Message-ID: <CAHR=Vkz0VQWMw6inxUv-ootPY6Qjr9nj3kfj9FQZaT8bPFQ7Xg@mail.gmail.com> (raw)
In-Reply-To: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com>

The OCaml hash function is not bad in itself (it's a slightly modified
version of Murmur 3, with a 32 bit state). There are a few difficult
design decisions that are made in the main hash loop which can impact
users though.

I wonder if the one that you are hitting is the fact that by default
the polymorphic hash function only consider a (small) amount of
meaningful value, in a breadth first manner. The following program is
interesting

```
let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);;
let r = ref 0 in while Hashtbl.hash (make !r []) <> Hashtbl.hash (make
(!r + 1) []) do incr r; done; !r;;
- : int = 10
```

The current default choice for the number of meaningful values is
described here:
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
and it is indeed 10. You could try to increase the number of
meaningful nodes that you want to consider in your code, to the max of
256.

There is really a balance to strike here between speed of hashing and
collisions.

Tom





On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
<michael.soegtrop@intel.com> wrote:
> Dear Ocaml Users,
>
> a possibly interesting data point in the discussion on polymorphic hashes:
>
> I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like:
>
> ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
>
> where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function.
>
> The hash histogram for this hash function looks like this:
>
> num_bindings=1803454
> num_buckets=16777216
> max_bucket_length=24
> (the histogram is "number of items in bucket=count of buckets with this number of items")
> 0=15356268 (empty buckets)
> 1=1185329 (non-colliding buckets)
> 2=135231
> 3=80194
> 4=11344
> 5=2367
> 6=2836
> 7=2630
> 8=244
> 9=49
> 10=20
> 11=25
> 12=64
> 13=39
> 14=100
> 15=75
> 16=33
> 17=5
> 18=149
> 19=15
> 20=3
> 21=180
> 22=14
> 24=2
>
> This is roughly in line with random numbers:
>
> fill rate = 1803454/16777216 = 10.7%
> 2 collisions = 0.8% ~ fill rate ^2
> 3 collisions = 0.48% ~ 4 * fill rate ^3
> 4 collisions = 0.068% ~ 5 * fill rate ^4
>
> The hash histogram for the same data and the polymorphic hash function looks like this:
>
> num_bindings=1803454
> num_buckets=16777216
> max_bucket_length=3343
> 0=16730926
> 1=5520
> 2=3201
> 3=2779
> 4=1633
> 5=1079
> 6=3701
> 7=672
> 8=828
> 9=1584
> 10=600
> 11=384
> 12=2774
> 13=404
> 14=525
> 15=500
> 16=435
> 17=358
> 18=1406
> 19=244
> 20=504
> 21=427
> 22=316
> 23=369
> 24=837
> 25=153
> 26=250
> 27=417
> 28=191
> 29=222
> 30=501
> 31=76
> 32=196
> 33=530
> 34=142
> 35=153
> 36=859
> 37=88
> 38=178
> 39=310
> 40=147
> 41=173
> 42=313
> 43=102
> 44=235
> 45=123
> 46=149
> 47=152
> 48=244
> 49=107
> 50=173
> :
> :
> 1001=1
> 1002=1
> 1005=1
> 1006=3
> 1009=1
> 1010=1
> 1013=1
> 1014=2
> 1019=1
> 1020=1
> 1029=1
> 1031=1
> 1034=1
> 1035=1
> 1036=1
> 1064=1
> 1082=1
> 1184=1
> 1191=1
> 1196=1
> 1197=1
> 1200=1
> 1206=1
> 1207=1
> 1219=1
> 1225=1
> 1228=1
> 1232=1
> 1566=1
> 1581=1
> 1636=1
> 1698=1
> 1720=2
> 1737=2
> 1740=1
> 1744=1
> 1752=1
> 1762=1
> 1789=1
> 2255=1
> 2271=1
> 2287=1
> 2319=1
> 2332=1
> 2345=1
> 3296=1
> 3325=1
> 3343=1
>
> So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets.
>
> It might make sense to come up with a better polymorphic hashing function.
>
> Best regards,
>
> Michael
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

  reply	other threads:[~2016-05-13 18:57 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-13  7:52 Soegtrop, Michael
2016-05-13  8:13 ` Ben Millwood
2016-05-13  8:40   ` Soegtrop, Michael
2016-05-13  9:06 ` Alain Frisch
2016-05-13  9:26   ` Soegtrop, Michael
2016-05-13 12:01     ` Gabriel Scherer
2016-05-13 12:23       ` Alain Frisch
2016-05-13 12:32       ` Soegtrop, Michael
2016-05-13 13:50   ` Pierre Chambart
2016-05-13 13:56     ` Alain Frisch
2016-05-13 16:17       ` Soegtrop, Michael
2016-05-13 18:57         ` Thomas Braibant [this message]
2016-05-13 22:45           ` Soegtrop, Michael
2016-05-14  8:41             ` Thomas Braibant
2016-05-14  9:06               ` Soegtrop, Michael
2016-05-17 13:03                 ` Soegtrop, Michael

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHR=Vkz0VQWMw6inxUv-ootPY6Qjr9nj3kfj9FQZaT8bPFQ7Xg@mail.gmail.com' \
    --to=thomas.braibant@gmail.com \
    --cc=alain.frisch@lexifi.com \
    --cc=caml-list@inria.fr \
    --cc=michael.soegtrop@intel.com \
    --cc=pierre.chambart@ocamlpro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).