caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Soegtrop, Michael" <michael.soegtrop@intel.com>
To: Thomas Braibant <thomas.braibant@gmail.com>
Cc: Alain Frisch <alain.frisch@lexifi.com>,
	Pierre Chambart <pierre.chambart@ocamlpro.com>,
	"caml-list@inria.fr" <caml-list@inria.fr>
Subject: RE: [Caml-list] Specify the default hash function for a type
Date: Fri, 13 May 2016 22:45:56 +0000	[thread overview]
Message-ID: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2DF6@IRSMSX102.ger.corp.intel.com> (raw)
In-Reply-To: <CAHR=Vkz0VQWMw6inxUv-ootPY6Qjr9nj3kfj9FQZaT8bPFQ7Xg@mail.gmail.com>

Dear Tom,

thanks for the pointer to the manual. I googled, but couldn't find this info - should have read the manual first. Yes, 10 is for sure not enough in my case. I will try the hash function with a value adjusted to my case and compare with mine.

Best regards,

Michael

> -----Original Message-----
> From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
> Sent: Friday, May 13, 2016 8:57 PM
> To: Soegtrop, Michael
> Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr
> Subject: Re: [Caml-list] Specify the default hash function for a type
> 
> The OCaml hash function is not bad in itself (it's a slightly modified version
> of Murmur 3, with a 32 bit state). There are a few difficult design decisions
> that are made in the main hash loop which can impact users though.
> 
> I wonder if the one that you are hitting is the fact that by default the
> polymorphic hash function only consider a (small) amount of meaningful
> value, in a breadth first manner. The following program is interesting
> 
> ```
> let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in
> while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r;
> done; !r;;
> - : int = 10
> ```
> 
> The current default choice for the number of meaningful values is described
> here:
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
> and it is indeed 10. You could try to increase the number of meaningful
> nodes that you want to consider in your code, to the max of 256.
> 
> There is really a balance to strike here between speed of hashing and
> collisions.
> 
> Tom
> 
> 
> 
> 
> 
> On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
> <michael.soegtrop@intel.com> wrote:
> > Dear Ocaml Users,
> >
> > a possibly interesting data point in the discussion on polymorphic hashes:
> >
> > I replaced the polymorphic hash function for a record, which consists of
> smallish integers, bools, lists of integers and short strings, with my own
> hash function which looks like:
> >
> > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
> >
> > where vi are values and pi are distinct prime numbers. For lists I use 4
> primes alternatingly and 4 different primes for empty cases (implemented
> as 4 mutual recursive functions). The primes are random primes between
> 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0
> modulo the prime. All arithmetic is done with plain ocaml ints and
> corresponding overflows. For strings I used the polymorphic hash function.
> >
> > The hash histogram for this hash function looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=24
> > (the histogram is "number of items in bucket=count of buckets with
> > this number of items")
> > 0=15356268 (empty buckets)
> > 1=1185329 (non-colliding buckets)
> > 2=135231
> > 3=80194
> > 4=11344
> > 5=2367
> > 6=2836
> > 7=2630
> > 8=244
> > 9=49
> > 10=20
> > 11=25
> > 12=64
> > 13=39
> > 14=100
> > 15=75
> > 16=33
> > 17=5
> > 18=149
> > 19=15
> > 20=3
> > 21=180
> > 22=14
> > 24=2
> >
> > This is roughly in line with random numbers:
> >
> > fill rate = 1803454/16777216 = 10.7%
> > 2 collisions = 0.8% ~ fill rate ^2
> > 3 collisions = 0.48% ~ 4 * fill rate ^3
> > 4 collisions = 0.068% ~ 5 * fill rate ^4
> >
> > The hash histogram for the same data and the polymorphic hash function
> looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=3343
> > 0=16730926
> > 1=5520
> > 2=3201
> > 3=2779
> > 4=1633
> > 5=1079
> > 6=3701
> > 7=672
> > 8=828
> > 9=1584
> > 10=600
> > 11=384
> > 12=2774
> > 13=404
> > 14=525
> > 15=500
> > 16=435
> > 17=358
> > 18=1406
> > 19=244
> > 20=504
> > 21=427
> > 22=316
> > 23=369
> > 24=837
> > 25=153
> > 26=250
> > 27=417
> > 28=191
> > 29=222
> > 30=501
> > 31=76
> > 32=196
> > 33=530
> > 34=142
> > 35=153
> > 36=859
> > 37=88
> > 38=178
> > 39=310
> > 40=147
> > 41=173
> > 42=313
> > 43=102
> > 44=235
> > 45=123
> > 46=149
> > 47=152
> > 48=244
> > 49=107
> > 50=173
> > :
> > :
> > 1001=1
> > 1002=1
> > 1005=1
> > 1006=3
> > 1009=1
> > 1010=1
> > 1013=1
> > 1014=2
> > 1019=1
> > 1020=1
> > 1029=1
> > 1031=1
> > 1034=1
> > 1035=1
> > 1036=1
> > 1064=1
> > 1082=1
> > 1184=1
> > 1191=1
> > 1196=1
> > 1197=1
> > 1200=1
> > 1206=1
> > 1207=1
> > 1219=1
> > 1225=1
> > 1228=1
> > 1232=1
> > 1566=1
> > 1581=1
> > 1636=1
> > 1698=1
> > 1720=2
> > 1737=2
> > 1740=1
> > 1744=1
> > 1752=1
> > 1762=1
> > 1789=1
> > 2255=1
> > 2271=1
> > 2287=1
> > 2319=1
> > 2332=1
> > 2345=1
> > 3296=1
> > 3325=1
> > 3343=1
> >
> > So there are many buckets with 1000s of collisions and only a few 1000
> collision free buckets.
> >
> > It might make sense to come up with a better polymorphic hashing
> function.
> >
> > Best regards,
> >
> > Michael
> > Intel Deutschland GmbH
> > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> > Tel: +49 89 99 8853-0, www.intel.de
> > Managing Directors: Christin Eisenschmid, Christian Lamprechter
> > Chairperson of the Supervisory Board: Nicole Lau Registered Office:
> > Munich Commercial Register: Amtsgericht Muenchen HRB 186928
> >
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

  reply	other threads:[~2016-05-13 22:46 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-13  7:52 Soegtrop, Michael
2016-05-13  8:13 ` Ben Millwood
2016-05-13  8:40   ` Soegtrop, Michael
2016-05-13  9:06 ` Alain Frisch
2016-05-13  9:26   ` Soegtrop, Michael
2016-05-13 12:01     ` Gabriel Scherer
2016-05-13 12:23       ` Alain Frisch
2016-05-13 12:32       ` Soegtrop, Michael
2016-05-13 13:50   ` Pierre Chambart
2016-05-13 13:56     ` Alain Frisch
2016-05-13 16:17       ` Soegtrop, Michael
2016-05-13 18:57         ` Thomas Braibant
2016-05-13 22:45           ` Soegtrop, Michael [this message]
2016-05-14  8:41             ` Thomas Braibant
2016-05-14  9:06               ` Soegtrop, Michael
2016-05-17 13:03                 ` Soegtrop, Michael

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0F7D3B1B3C4B894D824F5B822E3E5A172CEF2DF6@IRSMSX102.ger.corp.intel.com \
    --to=michael.soegtrop@intel.com \
    --cc=alain.frisch@lexifi.com \
    --cc=caml-list@inria.fr \
    --cc=pierre.chambart@ocamlpro.com \
    --cc=thomas.braibant@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).