caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Soegtrop, Michael" <michael.soegtrop@intel.com>
To: Alain Frisch <alain.frisch@lexifi.com>,
	Pierre Chambart <pierre.chambart@ocamlpro.com>,
	"caml-list@inria.fr" <caml-list@inria.fr>
Subject: RE: [Caml-list] Specify the default hash function for a type
Date: Fri, 13 May 2016 16:17:47 +0000	[thread overview]
Message-ID: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <44df5d5e-8ad9-1a86-7fda-bfc203bfb479@lexifi.com>

Dear Ocaml Users,

a possibly interesting data point in the discussion on polymorphic hashes:

I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like:

( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4    

where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function.

The hash histogram for this hash function looks like this:

num_bindings=1803454
num_buckets=16777216
max_bucket_length=24 
(the histogram is "number of items in bucket=count of buckets with this number of items")
0=15356268 (empty buckets)
1=1185329 (non-colliding buckets)
2=135231
3=80194
4=11344
5=2367
6=2836
7=2630
8=244
9=49
10=20
11=25
12=64
13=39 
14=100 
15=75 
16=33 
17=5 
18=149 
19=15 
20=3 
21=180 
22=14 
24=2

This is roughly in line with random numbers:

fill rate = 1803454/16777216 = 10.7%
2 collisions = 0.8% ~ fill rate ^2
3 collisions = 0.48% ~ 4 * fill rate ^3
4 collisions = 0.068% ~ 5 * fill rate ^4

The hash histogram for the same data and the polymorphic hash function looks like this:

num_bindings=1803454
num_buckets=16777216
max_bucket_length=3343
0=16730926
1=5520
2=3201
3=2779
4=1633
5=1079
6=3701
7=672
8=828
9=1584
10=600
11=384
12=2774
13=404
14=525
15=500
16=435
17=358
18=1406
19=244
20=504
21=427
22=316
23=369
24=837
25=153
26=250
27=417
28=191
29=222
30=501
31=76
32=196
33=530
34=142
35=153
36=859
37=88
38=178
39=310
40=147
41=173
42=313
43=102
44=235
45=123
46=149
47=152
48=244
49=107
50=173
:
:
1001=1
1002=1
1005=1
1006=3
1009=1
1010=1
1013=1
1014=2
1019=1
1020=1
1029=1
1031=1
1034=1
1035=1
1036=1
1064=1
1082=1
1184=1
1191=1
1196=1
1197=1
1200=1
1206=1
1207=1
1219=1
1225=1
1228=1
1232=1
1566=1
1581=1
1636=1
1698=1
1720=2
1737=2
1740=1
1744=1
1752=1
1762=1
1789=1
2255=1
2271=1
2287=1
2319=1
2332=1
2345=1
3296=1
3325=1
3343=1

So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets.

It might make sense to come up with a better polymorphic hashing function.

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


  reply	other threads:[~2016-05-13 16:17 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-13  7:52 Soegtrop, Michael
2016-05-13  8:13 ` Ben Millwood
2016-05-13  8:40   ` Soegtrop, Michael
2016-05-13  9:06 ` Alain Frisch
2016-05-13  9:26   ` Soegtrop, Michael
2016-05-13 12:01     ` Gabriel Scherer
2016-05-13 12:23       ` Alain Frisch
2016-05-13 12:32       ` Soegtrop, Michael
2016-05-13 13:50   ` Pierre Chambart
2016-05-13 13:56     ` Alain Frisch
2016-05-13 16:17       ` Soegtrop, Michael [this message]
2016-05-13 18:57         ` Thomas Braibant
2016-05-13 22:45           ` Soegtrop, Michael
2016-05-14  8:41             ` Thomas Braibant
2016-05-14  9:06               ` Soegtrop, Michael
2016-05-17 13:03                 ` Soegtrop, Michael

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com \
    --to=michael.soegtrop@intel.com \
    --cc=alain.frisch@lexifi.com \
    --cc=caml-list@inria.fr \
    --cc=pierre.chambart@ocamlpro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).