caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Thomas Braibant <thomas.braibant@gmail.com>
To: "Soegtrop, Michael" <michael.soegtrop@intel.com>
Cc: Alain Frisch <alain.frisch@lexifi.com>,
	OCaML Mailing List <caml-list@inria.fr>,
	 Pierre Chambart <pierre.chambart@ocamlpro.com>
Subject: RE: [Caml-list] Specify the default hash function for a type
Date: Sat, 14 May 2016 09:41:24 +0100	[thread overview]
Message-ID: <CAHR=VkwXbrpCqDcJ=cTQLnVzaRmDnDX-UWoRqxBVaL6u_2Nh5Q@mail.gmail.com> (raw)
In-Reply-To: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2DF6@IRSMSX102.ger.corp.intel.com>

[-- Attachment #1: Type: text/plain, Size: 6714 bytes --]

Dear Michael,

256 might be too small for your use case if you have big lists in your
records. A way to workaround this limitation is to use the seeded_hash
function to fold an hash accumulator through your record.

That is, if your record has fields f1, ..., fn
you might want to write

let hash t =
   let x = seeded_hash (hash t.f1) (hash t.f2) in
   let x = seeded_hash x (hash t.f3) in
   ...
   seeded_hash x (hash t.fn)




On 14 May 2016 00:46, "Soegtrop, Michael" <michael.soegtrop@intel.com>
wrote:

> Dear Tom,
>
> thanks for the pointer to the manual. I googled, but couldn't find this
> info - should have read the manual first. Yes, 10 is for sure not enough in
> my case. I will try the hash function with a value adjusted to my case and
> compare with mine.
>
> Best regards,
>
> Michael
>
> > -----Original Message-----
> > From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
> > Sent: Friday, May 13, 2016 8:57 PM
> > To: Soegtrop, Michael
> > Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr
> > Subject: Re: [Caml-list] Specify the default hash function for a type
> >
> > The OCaml hash function is not bad in itself (it's a slightly modified
> version
> > of Murmur 3, with a 32 bit state). There are a few difficult design
> decisions
> > that are made in the main hash loop which can impact users though.
> >
> > I wonder if the one that you are hitting is the fact that by default the
> > polymorphic hash function only consider a (small) amount of meaningful
> > value, in a breadth first manner. The following program is interesting
> >
> > ```
> > let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);;
> let r = ref 0 in
> > while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do
> incr r;
> > done; !r;;
> > - : int = 10
> > ```
> >
> > The current default choice for the number of meaningful values is
> described
> > here:
> > http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
> > and it is indeed 10. You could try to increase the number of meaningful
> > nodes that you want to consider in your code, to the max of 256.
> >
> > There is really a balance to strike here between speed of hashing and
> > collisions.
> >
> > Tom
> >
> >
> >
> >
> >
> > On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
> > <michael.soegtrop@intel.com> wrote:
> > > Dear Ocaml Users,
> > >
> > > a possibly interesting data point in the discussion on polymorphic
> hashes:
> > >
> > > I replaced the polymorphic hash function for a record, which consists
> of
> > smallish integers, bools, lists of integers and short strings, with my
> own
> > hash function which looks like:
> > >
> > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
> > >
> > > where vi are values and pi are distinct prime numbers. For lists I use
> 4
> > primes alternatingly and 4 different primes for empty cases (implemented
> > as 4 mutual recursive functions). The primes are random primes between
> > 2^28 and 2^29, avoiding those primes where larger powers of 2 are close
> to 0
> > modulo the prime. All arithmetic is done with plain ocaml ints and
> > corresponding overflows. For strings I used the polymorphic hash
> function.
> > >
> > > The hash histogram for this hash function looks like this:
> > >
> > > num_bindings=1803454
> > > num_buckets=16777216
> > > max_bucket_length=24
> > > (the histogram is "number of items in bucket=count of buckets with
> > > this number of items")
> > > 0=15356268 (empty buckets)
> > > 1=1185329 (non-colliding buckets)
> > > 2=135231
> > > 3=80194
> > > 4=11344
> > > 5=2367
> > > 6=2836
> > > 7=2630
> > > 8=244
> > > 9=49
> > > 10=20
> > > 11=25
> > > 12=64
> > > 13=39
> > > 14=100
> > > 15=75
> > > 16=33
> > > 17=5
> > > 18=149
> > > 19=15
> > > 20=3
> > > 21=180
> > > 22=14
> > > 24=2
> > >
> > > This is roughly in line with random numbers:
> > >
> > > fill rate = 1803454/16777216 = 10.7%
> > > 2 collisions = 0.8% ~ fill rate ^2
> > > 3 collisions = 0.48% ~ 4 * fill rate ^3
> > > 4 collisions = 0.068% ~ 5 * fill rate ^4
> > >
> > > The hash histogram for the same data and the polymorphic hash function
> > looks like this:
> > >
> > > num_bindings=1803454
> > > num_buckets=16777216
> > > max_bucket_length=3343
> > > 0=16730926
> > > 1=5520
> > > 2=3201
> > > 3=2779
> > > 4=1633
> > > 5=1079
> > > 6=3701
> > > 7=672
> > > 8=828
> > > 9=1584
> > > 10=600
> > > 11=384
> > > 12=2774
> > > 13=404
> > > 14=525
> > > 15=500
> > > 16=435
> > > 17=358
> > > 18=1406
> > > 19=244
> > > 20=504
> > > 21=427
> > > 22=316
> > > 23=369
> > > 24=837
> > > 25=153
> > > 26=250
> > > 27=417
> > > 28=191
> > > 29=222
> > > 30=501
> > > 31=76
> > > 32=196
> > > 33=530
> > > 34=142
> > > 35=153
> > > 36=859
> > > 37=88
> > > 38=178
> > > 39=310
> > > 40=147
> > > 41=173
> > > 42=313
> > > 43=102
> > > 44=235
> > > 45=123
> > > 46=149
> > > 47=152
> > > 48=244
> > > 49=107
> > > 50=173
> > > :
> > > :
> > > 1001=1
> > > 1002=1
> > > 1005=1
> > > 1006=3
> > > 1009=1
> > > 1010=1
> > > 1013=1
> > > 1014=2
> > > 1019=1
> > > 1020=1
> > > 1029=1
> > > 1031=1
> > > 1034=1
> > > 1035=1
> > > 1036=1
> > > 1064=1
> > > 1082=1
> > > 1184=1
> > > 1191=1
> > > 1196=1
> > > 1197=1
> > > 1200=1
> > > 1206=1
> > > 1207=1
> > > 1219=1
> > > 1225=1
> > > 1228=1
> > > 1232=1
> > > 1566=1
> > > 1581=1
> > > 1636=1
> > > 1698=1
> > > 1720=2
> > > 1737=2
> > > 1740=1
> > > 1744=1
> > > 1752=1
> > > 1762=1
> > > 1789=1
> > > 2255=1
> > > 2271=1
> > > 2287=1
> > > 2319=1
> > > 2332=1
> > > 2345=1
> > > 3296=1
> > > 3325=1
> > > 3343=1
> > >
> > > So there are many buckets with 1000s of collisions and only a few 1000
> > collision free buckets.
> > >
> > > It might make sense to come up with a better polymorphic hashing
> > function.
> > >
> > > Best regards,
> > >
> > > Michael
> > > Intel Deutschland GmbH
> > > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> > > Tel: +49 89 99 8853-0, www.intel.de
> > > Managing Directors: Christin Eisenschmid, Christian Lamprechter
> > > Chairperson of the Supervisory Board: Nicole Lau Registered Office:
> > > Munich Commercial Register: Amtsgericht Muenchen HRB 186928
> > >
> > >
> > > --
> > > Caml-list mailing list.  Subscription management and archives:
> > > https://sympa.inria.fr/sympa/arc/caml-list
> > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > > Bug reports: http://caml.inria.fr/bin/caml-bugs
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>

[-- Attachment #2: Type: text/html, Size: 9735 bytes --]

  reply	other threads:[~2016-05-14  8:41 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-13  7:52 Soegtrop, Michael
2016-05-13  8:13 ` Ben Millwood
2016-05-13  8:40   ` Soegtrop, Michael
2016-05-13  9:06 ` Alain Frisch
2016-05-13  9:26   ` Soegtrop, Michael
2016-05-13 12:01     ` Gabriel Scherer
2016-05-13 12:23       ` Alain Frisch
2016-05-13 12:32       ` Soegtrop, Michael
2016-05-13 13:50   ` Pierre Chambart
2016-05-13 13:56     ` Alain Frisch
2016-05-13 16:17       ` Soegtrop, Michael
2016-05-13 18:57         ` Thomas Braibant
2016-05-13 22:45           ` Soegtrop, Michael
2016-05-14  8:41             ` Thomas Braibant [this message]
2016-05-14  9:06               ` Soegtrop, Michael
2016-05-17 13:03                 ` Soegtrop, Michael

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHR=VkwXbrpCqDcJ=cTQLnVzaRmDnDX-UWoRqxBVaL6u_2Nh5Q@mail.gmail.com' \
    --to=thomas.braibant@gmail.com \
    --cc=alain.frisch@lexifi.com \
    --cc=caml-list@inria.fr \
    --cc=michael.soegtrop@intel.com \
    --cc=pierre.chambart@ocamlpro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).