caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Specify the default hash function for a type
@ 2016-05-13  7:52 Soegtrop, Michael
  2016-05-13  8:13 ` Ben Millwood
  2016-05-13  9:06 ` Alain Frisch
  0 siblings, 2 replies; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-13  7:52 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

Dear Ocaml Users,

I wonder if there is a way to specify a default hash function for a type, which is then used by the automated hash functions e.g. created for records containing this type. I have e.g. a record with about 10 fields and I am happy with the default hash functions for most fields and for the record, but for just one field I would like to use a custom hash function, but I don't want to write a hash function for the record.

I am currently using the standard library, but if there is a solution with Batteries or Core, I would also be interested.

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

[-- Attachment #2: Type: text/html, Size: 2891 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13  7:52 [Caml-list] Specify the default hash function for a type Soegtrop, Michael
@ 2016-05-13  8:13 ` Ben Millwood
  2016-05-13  8:40   ` Soegtrop, Michael
  2016-05-13  9:06 ` Alain Frisch
  1 sibling, 1 reply; 16+ messages in thread
From: Ben Millwood @ 2016-05-13  8:13 UTC (permalink / raw)
  To: Soegtrop, Michael; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

Polymorphic hash operates without type information, so it's not able to
tell that it's hashing your record type vs. some other type with the same
structure. So I would guess that what you want is not possible.

The Core team is working on a ppx extension analogous to ppx_sexp_conv and
ppx_compare which generates hash functions automatically, and would make
this kind of thing possible, but it's not done yet.

On 13 May 2016 at 15:52, Soegtrop, Michael <michael.soegtrop@intel.com>
wrote:

> Dear Ocaml Users,
>
>
>
> I wonder if there is a way to specify a default hash function for a type,
> which is then used by the automated hash functions e.g. created for records
> containing this type. I have e.g. a record with about 10 fields and I am
> happy with the default hash functions for most fields and for the record,
> but for just one field I would like to use a custom hash function, but I
> don’t want to write a hash function for the record.
>
>
>
> I am currently using the standard library, but if there is a solution with
> Batteries or Core, I would also be interested.
>
>
>
> Best regards,
>
>
>
> Michael
>
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>

[-- Attachment #2: Type: text/html, Size: 2316 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-13  8:13 ` Ben Millwood
@ 2016-05-13  8:40   ` Soegtrop, Michael
  0 siblings, 0 replies; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-13  8:40 UTC (permalink / raw)
  To: Ben Millwood; +Cc: caml-list

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

Dear Ben,

thanks a lot for the quick answer! I am looking forward to the PPX.

Best regards,

Michael

From: caml-list-request@inria.fr [mailto:caml-list-request@inria.fr] On Behalf Of Ben Millwood
Sent: Friday, May 13, 2016 10:14 AM
To: Soegtrop, Michael
Cc: caml-list@inria.fr
Subject: Re: [Caml-list] Specify the default hash function for a type

Polymorphic hash operates without type information, so it's not able to tell that it's hashing your record type vs. some other type with the same structure. So I would guess that what you want is not possible.

The Core team is working on a ppx extension analogous to ppx_sexp_conv and ppx_compare which generates hash functions automatically, and would make this kind of thing possible, but it's not done yet.
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

[-- Attachment #2: Type: text/html, Size: 4807 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13  7:52 [Caml-list] Specify the default hash function for a type Soegtrop, Michael
  2016-05-13  8:13 ` Ben Millwood
@ 2016-05-13  9:06 ` Alain Frisch
  2016-05-13  9:26   ` Soegtrop, Michael
  2016-05-13 13:50   ` Pierre Chambart
  1 sibling, 2 replies; 16+ messages in thread
From: Alain Frisch @ 2016-05-13  9:06 UTC (permalink / raw)
  To: Soegtrop, Michael, caml-list

Dear Michael,

Don't tell anybody (ahem), but the following might work depending on 
your exact use case...

If you define a type such as:

  type t = {
    foo: ....;
    id: int;
  }

and if you set the tag of values for this type to Obj.record_tag (= 248) 
with Obj.set_tag, then polymorphic hash *and* comparison functions will 
use the "id" field and ignore the "foo" field (and any other extra 
fields).  It is important to keep "id" as the second field to mimic the 
layout of real object values (and exception slots).

In particular, this lets you easily "forget" sub-parts of a value (by 
setting "id" to a fixed constant), or implement "references compared 
with physical equality" (use a counter to assign unique "id" to each 
such reference).


Best regards,

Alain



On 13/05/2016 09:52, Soegtrop, Michael wrote:
> Dear Ocaml Users,
>
>
>
> I wonder if there is a way to specify a default hash function for a
> type, which is then used by the automated hash functions e.g. created
> for records containing this type. I have e.g. a record with about 10
> fields and I am happy with the default hash functions for most fields
> and for the record, but for just one field I would like to use a custom
> hash function, but I don’t want to write a hash function for the record.
>
>
>
> I am currently using the standard library, but if there is a solution
> with Batteries or Core, I would also be interested.
>
>
>
> Best regards,
>
>
>
> Michael
>
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-13  9:06 ` Alain Frisch
@ 2016-05-13  9:26   ` Soegtrop, Michael
  2016-05-13 12:01     ` Gabriel Scherer
  2016-05-13 13:50   ` Pierre Chambart
  1 sibling, 1 reply; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-13  9:26 UTC (permalink / raw)
  To: Alain Frisch, caml-list

Dear Alain,

this are temptations on Friday morning ... I guess as a beginner I should resist, not be lazy, and write the hash function by hand. But I put it into appendix D of my OCaml notes ;-)

Best regards,

Michael

> -----Original Message-----
> From: Alain Frisch [mailto:alain.frisch@lexifi.com]
> Sent: Friday, May 13, 2016 11:06 AM
> To: Soegtrop, Michael; caml-list@inria.fr
> Subject: Re: [Caml-list] Specify the default hash function for a type
> 
> Dear Michael,
> 
> Don't tell anybody (ahem), but the following might work depending on your
> exact use case...
> 
> If you define a type such as:
> 
>   type t = {
>     foo: ....;
>     id: int;
>   }
> 
> and if you set the tag of values for this type to Obj.record_tag (= 248) with
> Obj.set_tag, then polymorphic hash *and* comparison functions will use the
> "id" field and ignore the "foo" field (and any other extra fields).  It is
> important to keep "id" as the second field to mimic the layout of real object
> values (and exception slots).
> 
> In particular, this lets you easily "forget" sub-parts of a value (by setting "id"
> to a fixed constant), or implement "references compared with physical
> equality" (use a counter to assign unique "id" to each such reference).
> 
> 
> Best regards,
> 
> Alain
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13  9:26   ` Soegtrop, Michael
@ 2016-05-13 12:01     ` Gabriel Scherer
  2016-05-13 12:23       ` Alain Frisch
  2016-05-13 12:32       ` Soegtrop, Michael
  0 siblings, 2 replies; 16+ messages in thread
From: Gabriel Scherer @ 2016-05-13 12:01 UTC (permalink / raw)
  To: Soegtrop, Michael; +Cc: Alain Frisch, caml-list

From the C side it is possible to create "custom" values that come
with user-specified comparison and hashing functions. However, for now
the runtime does not allow creating them from OCaml -- this is indeed
a desirable feature, but it's not trivial to implement. I don't really
whether it's actually "really hard", or just that it requires some
effort that nobody has invested yet -- the latter would be my guess.

If you don't mind going through a C stubs to do this, you could thus
create the custom value in C land, and export that in OCaml land.
See the manual:
  Interfacing C with OCaml > custom blocks
  http://caml.inria.fr/pub/docs/manual-ocaml-4.03/intfc.html#sec442

On Fri, May 13, 2016 at 5:26 AM, Soegtrop, Michael
<michael.soegtrop@intel.com> wrote:
> Dear Alain,
>
> this are temptations on Friday morning ... I guess as a beginner I should resist, not be lazy, and write the hash function by hand. But I put it into appendix D of my OCaml notes ;-)
>
> Best regards,
>
> Michael
>
>> -----Original Message-----
>> From: Alain Frisch [mailto:alain.frisch@lexifi.com]
>> Sent: Friday, May 13, 2016 11:06 AM
>> To: Soegtrop, Michael; caml-list@inria.fr
>> Subject: Re: [Caml-list] Specify the default hash function for a type
>>
>> Dear Michael,
>>
>> Don't tell anybody (ahem), but the following might work depending on your
>> exact use case...
>>
>> If you define a type such as:
>>
>>   type t = {
>>     foo: ....;
>>     id: int;
>>   }
>>
>> and if you set the tag of values for this type to Obj.record_tag (= 248) with
>> Obj.set_tag, then polymorphic hash *and* comparison functions will use the
>> "id" field and ignore the "foo" field (and any other extra fields).  It is
>> important to keep "id" as the second field to mimic the layout of real object
>> values (and exception slots).
>>
>> In particular, this lets you easily "forget" sub-parts of a value (by setting "id"
>> to a fixed constant), or implement "references compared with physical
>> equality" (use a counter to assign unique "id" to each such reference).
>>
>>
>> Best regards,
>>
>> Alain
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13 12:01     ` Gabriel Scherer
@ 2016-05-13 12:23       ` Alain Frisch
  2016-05-13 12:32       ` Soegtrop, Michael
  1 sibling, 0 replies; 16+ messages in thread
From: Alain Frisch @ 2016-05-13 12:23 UTC (permalink / raw)
  To: Gabriel Scherer, Soegtrop, Michael; +Cc: caml-list

On 13/05/2016 14:01, Gabriel Scherer wrote:
>>From the C side it is possible to create "custom" values that come
> with user-specified comparison and hashing functions. However, for now
> the runtime does not allow creating them from OCaml -- this is indeed
> a desirable feature, but it's not trivial to implement. I don't really
> whether it's actually "really hard", or just that it requires some
> effort that nobody has invested yet -- the latter would be my guess.

The problem is that the custom comparison function could move objects in 
the heap if it triggers a GC, and the polymorphic functions 
(compare/hash) would need to take this into account, i.e. register all 
"pending" nodes as GC roots.  This could slow them down quite a bit.

Alain

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-13 12:01     ` Gabriel Scherer
  2016-05-13 12:23       ` Alain Frisch
@ 2016-05-13 12:32       ` Soegtrop, Michael
  1 sibling, 0 replies; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-13 12:32 UTC (permalink / raw)
  To: Gabriel Scherer; +Cc: Alain Frisch, caml-list

Dear Gabriel, Alain,

thanks for the hints!

After I applied my program to a larger data set, I found that, contrary to my initial assessment, the polymorphic hash function for the complete record is not good in my case. The hash stats show that the largest bucket has more items in it than I have single item buckets - not a good sign. So I anyway have to write a complete custom hash function.

Is there some documentation on what the polymorphic hash function does?

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13  9:06 ` Alain Frisch
  2016-05-13  9:26   ` Soegtrop, Michael
@ 2016-05-13 13:50   ` Pierre Chambart
  2016-05-13 13:56     ` Alain Frisch
  1 sibling, 1 reply; 16+ messages in thread
From: Pierre Chambart @ 2016-05-13 13:50 UTC (permalink / raw)
  To: Alain Frisch, Soegtrop, Michael, caml-list

For those who don't see that as a joke, please forget everything about that.
Obj.set_tag is an extremely dangerous function as the compiler assumes that
value's tag never change, even for completely unknown values. Hence, it
might
be legal for the compiler to do something that you might not expect on:

let f x =
  let x' = Obj.repr x in
  let tag = Obj.tag x' in
  Obj.set_tag x' 0
  let ... = match x with ... in
  Obj.set_tag x' tag;
  match x with ...

The access to the tag can be shared for instance. And lots of
'surprising' stuff might happen.
This is of course a quite brutal example but this could apply to a lot
of less obvious situation !

So Obj.magic and Obj.set_field : bad
Obj.set_tag: really really bad !
-- 
Pierre

On 13/05/2016 11:06, Alain Frisch wrote:
> Dear Michael,
>
> Don't tell anybody (ahem), but the following might work depending on
> your exact use case...
>
> If you define a type such as:
>
>  type t = {
>    foo: ....;
>    id: int;
>  }
>
> and if you set the tag of values for this type to Obj.record_tag (=
> 248) with Obj.set_tag, then polymorphic hash *and* comparison
> functions will use the "id" field and ignore the "foo" field (and any
> other extra fields).  It is important to keep "id" as the second field
> to mimic the layout of real object values (and exception slots).
>
> In particular, this lets you easily "forget" sub-parts of a value (by
> setting "id" to a fixed constant), or implement "references compared
> with physical equality" (use a counter to assign unique "id" to each
> such reference).
>
>
> Best regards,
>
> Alain
>
>
>
> On 13/05/2016 09:52, Soegtrop, Michael wrote:
>> Dear Ocaml Users,
>>
>>
>>
>> I wonder if there is a way to specify a default hash function for a
>> type, which is then used by the automated hash functions e.g. created
>> for records containing this type. I have e.g. a record with about 10
>> fields and I am happy with the default hash functions for most fields
>> and for the record, but for just one field I would like to use a custom
>> hash function, but I don’t want to write a hash function for the record.
>>
>>
>>
>> I am currently using the standard library, but if there is a solution
>> with Batteries or Core, I would also be interested.
>>
>>
>>
>> Best regards,
>>
>>
>>
>> Michael
>>
>> Intel Deutschland GmbH
>> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
>> Tel: +49 89 99 8853-0, www.intel.de
>> Managing Directors: Christin Eisenschmid, Christian Lamprechter
>> Chairperson of the Supervisory Board: Nicole Lau
>> Registered Office: Munich
>> Commercial Register: Amtsgericht Muenchen HRB 186928
>>
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13 13:50   ` Pierre Chambart
@ 2016-05-13 13:56     ` Alain Frisch
  2016-05-13 16:17       ` Soegtrop, Michael
  0 siblings, 1 reply; 16+ messages in thread
From: Alain Frisch @ 2016-05-13 13:56 UTC (permalink / raw)
  To: Pierre Chambart, Soegtrop, Michael, caml-list

FWIW, I was not seriously suggesting this.  That said, ..., the case 
should be relatively safe, since the code generated by OCaml never reads 
the tag of record values (it is assumed to be 0).  The tag should only 
be read by polymorphic primitives, which don't receive any information 
from the OCaml optimizer.  So I don't see how the optimizer could break 
the (not-really-)suggested hack, except on purpose.

Alain



On 13/05/2016 15:50, Pierre Chambart wrote:
> For those who don't see that as a joke, please forget everything about that.
> Obj.set_tag is an extremely dangerous function as the compiler assumes that
> value's tag never change, even for completely unknown values. Hence, it
> might
> be legal for the compiler to do something that you might not expect on:
>
> let f x =
>   let x' = Obj.repr x in
>   let tag = Obj.tag x' in
>   Obj.set_tag x' 0
>   let ... = match x with ... in
>   Obj.set_tag x' tag;
>   match x with ...
>
> The access to the tag can be shared for instance. And lots of
> 'surprising' stuff might happen.
> This is of course a quite brutal example but this could apply to a lot
> of less obvious situation !
>
> So Obj.magic and Obj.set_field : bad
> Obj.set_tag: really really bad !
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-13 13:56     ` Alain Frisch
@ 2016-05-13 16:17       ` Soegtrop, Michael
  2016-05-13 18:57         ` Thomas Braibant
  0 siblings, 1 reply; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-13 16:17 UTC (permalink / raw)
  To: Alain Frisch, Pierre Chambart, caml-list

Dear Ocaml Users,

a possibly interesting data point in the discussion on polymorphic hashes:

I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like:

( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4    

where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function.

The hash histogram for this hash function looks like this:

num_bindings=1803454
num_buckets=16777216
max_bucket_length=24 
(the histogram is "number of items in bucket=count of buckets with this number of items")
0=15356268 (empty buckets)
1=1185329 (non-colliding buckets)
2=135231
3=80194
4=11344
5=2367
6=2836
7=2630
8=244
9=49
10=20
11=25
12=64
13=39 
14=100 
15=75 
16=33 
17=5 
18=149 
19=15 
20=3 
21=180 
22=14 
24=2

This is roughly in line with random numbers:

fill rate = 1803454/16777216 = 10.7%
2 collisions = 0.8% ~ fill rate ^2
3 collisions = 0.48% ~ 4 * fill rate ^3
4 collisions = 0.068% ~ 5 * fill rate ^4

The hash histogram for the same data and the polymorphic hash function looks like this:

num_bindings=1803454
num_buckets=16777216
max_bucket_length=3343
0=16730926
1=5520
2=3201
3=2779
4=1633
5=1079
6=3701
7=672
8=828
9=1584
10=600
11=384
12=2774
13=404
14=525
15=500
16=435
17=358
18=1406
19=244
20=504
21=427
22=316
23=369
24=837
25=153
26=250
27=417
28=191
29=222
30=501
31=76
32=196
33=530
34=142
35=153
36=859
37=88
38=178
39=310
40=147
41=173
42=313
43=102
44=235
45=123
46=149
47=152
48=244
49=107
50=173
:
:
1001=1
1002=1
1005=1
1006=3
1009=1
1010=1
1013=1
1014=2
1019=1
1020=1
1029=1
1031=1
1034=1
1035=1
1036=1
1064=1
1082=1
1184=1
1191=1
1196=1
1197=1
1200=1
1206=1
1207=1
1219=1
1225=1
1228=1
1232=1
1566=1
1581=1
1636=1
1698=1
1720=2
1737=2
1740=1
1744=1
1752=1
1762=1
1789=1
2255=1
2271=1
2287=1
2319=1
2332=1
2345=1
3296=1
3325=1
3343=1

So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets.

It might make sense to come up with a better polymorphic hashing function.

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Caml-list] Specify the default hash function for a type
  2016-05-13 16:17       ` Soegtrop, Michael
@ 2016-05-13 18:57         ` Thomas Braibant
  2016-05-13 22:45           ` Soegtrop, Michael
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Braibant @ 2016-05-13 18:57 UTC (permalink / raw)
  To: Soegtrop, Michael; +Cc: Alain Frisch, Pierre Chambart, caml-list

The OCaml hash function is not bad in itself (it's a slightly modified
version of Murmur 3, with a 32 bit state). There are a few difficult
design decisions that are made in the main hash loop which can impact
users though.

I wonder if the one that you are hitting is the fact that by default
the polymorphic hash function only consider a (small) amount of
meaningful value, in a breadth first manner. The following program is
interesting

```
let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);;
let r = ref 0 in while Hashtbl.hash (make !r []) <> Hashtbl.hash (make
(!r + 1) []) do incr r; done; !r;;
- : int = 10
```

The current default choice for the number of meaningful values is
described here:
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
and it is indeed 10. You could try to increase the number of
meaningful nodes that you want to consider in your code, to the max of
256.

There is really a balance to strike here between speed of hashing and
collisions.

Tom





On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
<michael.soegtrop@intel.com> wrote:
> Dear Ocaml Users,
>
> a possibly interesting data point in the discussion on polymorphic hashes:
>
> I replaced the polymorphic hash function for a record, which consists of smallish integers, bools, lists of integers and short strings, with my own hash function which looks like:
>
> ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
>
> where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases (implemented as 4 mutual recursive functions). The primes are random primes between 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0 modulo the prime. All arithmetic is done with plain ocaml ints and corresponding overflows. For strings I used the polymorphic hash function.
>
> The hash histogram for this hash function looks like this:
>
> num_bindings=1803454
> num_buckets=16777216
> max_bucket_length=24
> (the histogram is "number of items in bucket=count of buckets with this number of items")
> 0=15356268 (empty buckets)
> 1=1185329 (non-colliding buckets)
> 2=135231
> 3=80194
> 4=11344
> 5=2367
> 6=2836
> 7=2630
> 8=244
> 9=49
> 10=20
> 11=25
> 12=64
> 13=39
> 14=100
> 15=75
> 16=33
> 17=5
> 18=149
> 19=15
> 20=3
> 21=180
> 22=14
> 24=2
>
> This is roughly in line with random numbers:
>
> fill rate = 1803454/16777216 = 10.7%
> 2 collisions = 0.8% ~ fill rate ^2
> 3 collisions = 0.48% ~ 4 * fill rate ^3
> 4 collisions = 0.068% ~ 5 * fill rate ^4
>
> The hash histogram for the same data and the polymorphic hash function looks like this:
>
> num_bindings=1803454
> num_buckets=16777216
> max_bucket_length=3343
> 0=16730926
> 1=5520
> 2=3201
> 3=2779
> 4=1633
> 5=1079
> 6=3701
> 7=672
> 8=828
> 9=1584
> 10=600
> 11=384
> 12=2774
> 13=404
> 14=525
> 15=500
> 16=435
> 17=358
> 18=1406
> 19=244
> 20=504
> 21=427
> 22=316
> 23=369
> 24=837
> 25=153
> 26=250
> 27=417
> 28=191
> 29=222
> 30=501
> 31=76
> 32=196
> 33=530
> 34=142
> 35=153
> 36=859
> 37=88
> 38=178
> 39=310
> 40=147
> 41=173
> 42=313
> 43=102
> 44=235
> 45=123
> 46=149
> 47=152
> 48=244
> 49=107
> 50=173
> :
> :
> 1001=1
> 1002=1
> 1005=1
> 1006=3
> 1009=1
> 1010=1
> 1013=1
> 1014=2
> 1019=1
> 1020=1
> 1029=1
> 1031=1
> 1034=1
> 1035=1
> 1036=1
> 1064=1
> 1082=1
> 1184=1
> 1191=1
> 1196=1
> 1197=1
> 1200=1
> 1206=1
> 1207=1
> 1219=1
> 1225=1
> 1228=1
> 1232=1
> 1566=1
> 1581=1
> 1636=1
> 1698=1
> 1720=2
> 1737=2
> 1740=1
> 1744=1
> 1752=1
> 1762=1
> 1789=1
> 2255=1
> 2271=1
> 2287=1
> 2319=1
> 2332=1
> 2345=1
> 3296=1
> 3325=1
> 3343=1
>
> So there are many buckets with 1000s of collisions and only a few 1000 collision free buckets.
>
> It might make sense to come up with a better polymorphic hashing function.
>
> Best regards,
>
> Michael
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>
>
> --
> Caml-list mailing list.  Subscription management and archives:
> https://sympa.inria.fr/sympa/arc/caml-list
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-13 18:57         ` Thomas Braibant
@ 2016-05-13 22:45           ` Soegtrop, Michael
  2016-05-14  8:41             ` Thomas Braibant
  0 siblings, 1 reply; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-13 22:45 UTC (permalink / raw)
  To: Thomas Braibant; +Cc: Alain Frisch, Pierre Chambart, caml-list

Dear Tom,

thanks for the pointer to the manual. I googled, but couldn't find this info - should have read the manual first. Yes, 10 is for sure not enough in my case. I will try the hash function with a value adjusted to my case and compare with mine.

Best regards,

Michael

> -----Original Message-----
> From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
> Sent: Friday, May 13, 2016 8:57 PM
> To: Soegtrop, Michael
> Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr
> Subject: Re: [Caml-list] Specify the default hash function for a type
> 
> The OCaml hash function is not bad in itself (it's a slightly modified version
> of Murmur 3, with a 32 bit state). There are a few difficult design decisions
> that are made in the main hash loop which can impact users though.
> 
> I wonder if the one that you are hitting is the fact that by default the
> polymorphic hash function only consider a (small) amount of meaningful
> value, in a breadth first manner. The following program is interesting
> 
> ```
> let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in
> while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r;
> done; !r;;
> - : int = 10
> ```
> 
> The current default choice for the number of meaningful values is described
> here:
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
> and it is indeed 10. You could try to increase the number of meaningful
> nodes that you want to consider in your code, to the max of 256.
> 
> There is really a balance to strike here between speed of hashing and
> collisions.
> 
> Tom
> 
> 
> 
> 
> 
> On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
> <michael.soegtrop@intel.com> wrote:
> > Dear Ocaml Users,
> >
> > a possibly interesting data point in the discussion on polymorphic hashes:
> >
> > I replaced the polymorphic hash function for a record, which consists of
> smallish integers, bools, lists of integers and short strings, with my own
> hash function which looks like:
> >
> > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
> >
> > where vi are values and pi are distinct prime numbers. For lists I use 4
> primes alternatingly and 4 different primes for empty cases (implemented
> as 4 mutual recursive functions). The primes are random primes between
> 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0
> modulo the prime. All arithmetic is done with plain ocaml ints and
> corresponding overflows. For strings I used the polymorphic hash function.
> >
> > The hash histogram for this hash function looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=24
> > (the histogram is "number of items in bucket=count of buckets with
> > this number of items")
> > 0=15356268 (empty buckets)
> > 1=1185329 (non-colliding buckets)
> > 2=135231
> > 3=80194
> > 4=11344
> > 5=2367
> > 6=2836
> > 7=2630
> > 8=244
> > 9=49
> > 10=20
> > 11=25
> > 12=64
> > 13=39
> > 14=100
> > 15=75
> > 16=33
> > 17=5
> > 18=149
> > 19=15
> > 20=3
> > 21=180
> > 22=14
> > 24=2
> >
> > This is roughly in line with random numbers:
> >
> > fill rate = 1803454/16777216 = 10.7%
> > 2 collisions = 0.8% ~ fill rate ^2
> > 3 collisions = 0.48% ~ 4 * fill rate ^3
> > 4 collisions = 0.068% ~ 5 * fill rate ^4
> >
> > The hash histogram for the same data and the polymorphic hash function
> looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=3343
> > 0=16730926
> > 1=5520
> > 2=3201
> > 3=2779
> > 4=1633
> > 5=1079
> > 6=3701
> > 7=672
> > 8=828
> > 9=1584
> > 10=600
> > 11=384
> > 12=2774
> > 13=404
> > 14=525
> > 15=500
> > 16=435
> > 17=358
> > 18=1406
> > 19=244
> > 20=504
> > 21=427
> > 22=316
> > 23=369
> > 24=837
> > 25=153
> > 26=250
> > 27=417
> > 28=191
> > 29=222
> > 30=501
> > 31=76
> > 32=196
> > 33=530
> > 34=142
> > 35=153
> > 36=859
> > 37=88
> > 38=178
> > 39=310
> > 40=147
> > 41=173
> > 42=313
> > 43=102
> > 44=235
> > 45=123
> > 46=149
> > 47=152
> > 48=244
> > 49=107
> > 50=173
> > :
> > :
> > 1001=1
> > 1002=1
> > 1005=1
> > 1006=3
> > 1009=1
> > 1010=1
> > 1013=1
> > 1014=2
> > 1019=1
> > 1020=1
> > 1029=1
> > 1031=1
> > 1034=1
> > 1035=1
> > 1036=1
> > 1064=1
> > 1082=1
> > 1184=1
> > 1191=1
> > 1196=1
> > 1197=1
> > 1200=1
> > 1206=1
> > 1207=1
> > 1219=1
> > 1225=1
> > 1228=1
> > 1232=1
> > 1566=1
> > 1581=1
> > 1636=1
> > 1698=1
> > 1720=2
> > 1737=2
> > 1740=1
> > 1744=1
> > 1752=1
> > 1762=1
> > 1789=1
> > 2255=1
> > 2271=1
> > 2287=1
> > 2319=1
> > 2332=1
> > 2345=1
> > 3296=1
> > 3325=1
> > 3343=1
> >
> > So there are many buckets with 1000s of collisions and only a few 1000
> collision free buckets.
> >
> > It might make sense to come up with a better polymorphic hashing
> function.
> >
> > Best regards,
> >
> > Michael
> > Intel Deutschland GmbH
> > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> > Tel: +49 89 99 8853-0, www.intel.de
> > Managing Directors: Christin Eisenschmid, Christian Lamprechter
> > Chairperson of the Supervisory Board: Nicole Lau Registered Office:
> > Munich Commercial Register: Amtsgericht Muenchen HRB 186928
> >
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-13 22:45           ` Soegtrop, Michael
@ 2016-05-14  8:41             ` Thomas Braibant
  2016-05-14  9:06               ` Soegtrop, Michael
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Braibant @ 2016-05-14  8:41 UTC (permalink / raw)
  To: Soegtrop, Michael; +Cc: Alain Frisch, OCaML Mailing List, Pierre Chambart

[-- Attachment #1: Type: text/plain, Size: 6714 bytes --]

Dear Michael,

256 might be too small for your use case if you have big lists in your
records. A way to workaround this limitation is to use the seeded_hash
function to fold an hash accumulator through your record.

That is, if your record has fields f1, ..., fn
you might want to write

let hash t =
   let x = seeded_hash (hash t.f1) (hash t.f2) in
   let x = seeded_hash x (hash t.f3) in
   ...
   seeded_hash x (hash t.fn)




On 14 May 2016 00:46, "Soegtrop, Michael" <michael.soegtrop@intel.com>
wrote:

> Dear Tom,
>
> thanks for the pointer to the manual. I googled, but couldn't find this
> info - should have read the manual first. Yes, 10 is for sure not enough in
> my case. I will try the hash function with a value adjusted to my case and
> compare with mine.
>
> Best regards,
>
> Michael
>
> > -----Original Message-----
> > From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
> > Sent: Friday, May 13, 2016 8:57 PM
> > To: Soegtrop, Michael
> > Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr
> > Subject: Re: [Caml-list] Specify the default hash function for a type
> >
> > The OCaml hash function is not bad in itself (it's a slightly modified
> version
> > of Murmur 3, with a 32 bit state). There are a few difficult design
> decisions
> > that are made in the main hash loop which can impact users though.
> >
> > I wonder if the one that you are hitting is the fact that by default the
> > polymorphic hash function only consider a (small) amount of meaningful
> > value, in a breadth first manner. The following program is interesting
> >
> > ```
> > let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);;
> let r = ref 0 in
> > while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do
> incr r;
> > done; !r;;
> > - : int = 10
> > ```
> >
> > The current default choice for the number of meaningful values is
> described
> > here:
> > http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
> > and it is indeed 10. You could try to increase the number of meaningful
> > nodes that you want to consider in your code, to the max of 256.
> >
> > There is really a balance to strike here between speed of hashing and
> > collisions.
> >
> > Tom
> >
> >
> >
> >
> >
> > On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
> > <michael.soegtrop@intel.com> wrote:
> > > Dear Ocaml Users,
> > >
> > > a possibly interesting data point in the discussion on polymorphic
> hashes:
> > >
> > > I replaced the polymorphic hash function for a record, which consists
> of
> > smallish integers, bools, lists of integers and short strings, with my
> own
> > hash function which looks like:
> > >
> > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
> > >
> > > where vi are values and pi are distinct prime numbers. For lists I use
> 4
> > primes alternatingly and 4 different primes for empty cases (implemented
> > as 4 mutual recursive functions). The primes are random primes between
> > 2^28 and 2^29, avoiding those primes where larger powers of 2 are close
> to 0
> > modulo the prime. All arithmetic is done with plain ocaml ints and
> > corresponding overflows. For strings I used the polymorphic hash
> function.
> > >
> > > The hash histogram for this hash function looks like this:
> > >
> > > num_bindings=1803454
> > > num_buckets=16777216
> > > max_bucket_length=24
> > > (the histogram is "number of items in bucket=count of buckets with
> > > this number of items")
> > > 0=15356268 (empty buckets)
> > > 1=1185329 (non-colliding buckets)
> > > 2=135231
> > > 3=80194
> > > 4=11344
> > > 5=2367
> > > 6=2836
> > > 7=2630
> > > 8=244
> > > 9=49
> > > 10=20
> > > 11=25
> > > 12=64
> > > 13=39
> > > 14=100
> > > 15=75
> > > 16=33
> > > 17=5
> > > 18=149
> > > 19=15
> > > 20=3
> > > 21=180
> > > 22=14
> > > 24=2
> > >
> > > This is roughly in line with random numbers:
> > >
> > > fill rate = 1803454/16777216 = 10.7%
> > > 2 collisions = 0.8% ~ fill rate ^2
> > > 3 collisions = 0.48% ~ 4 * fill rate ^3
> > > 4 collisions = 0.068% ~ 5 * fill rate ^4
> > >
> > > The hash histogram for the same data and the polymorphic hash function
> > looks like this:
> > >
> > > num_bindings=1803454
> > > num_buckets=16777216
> > > max_bucket_length=3343
> > > 0=16730926
> > > 1=5520
> > > 2=3201
> > > 3=2779
> > > 4=1633
> > > 5=1079
> > > 6=3701
> > > 7=672
> > > 8=828
> > > 9=1584
> > > 10=600
> > > 11=384
> > > 12=2774
> > > 13=404
> > > 14=525
> > > 15=500
> > > 16=435
> > > 17=358
> > > 18=1406
> > > 19=244
> > > 20=504
> > > 21=427
> > > 22=316
> > > 23=369
> > > 24=837
> > > 25=153
> > > 26=250
> > > 27=417
> > > 28=191
> > > 29=222
> > > 30=501
> > > 31=76
> > > 32=196
> > > 33=530
> > > 34=142
> > > 35=153
> > > 36=859
> > > 37=88
> > > 38=178
> > > 39=310
> > > 40=147
> > > 41=173
> > > 42=313
> > > 43=102
> > > 44=235
> > > 45=123
> > > 46=149
> > > 47=152
> > > 48=244
> > > 49=107
> > > 50=173
> > > :
> > > :
> > > 1001=1
> > > 1002=1
> > > 1005=1
> > > 1006=3
> > > 1009=1
> > > 1010=1
> > > 1013=1
> > > 1014=2
> > > 1019=1
> > > 1020=1
> > > 1029=1
> > > 1031=1
> > > 1034=1
> > > 1035=1
> > > 1036=1
> > > 1064=1
> > > 1082=1
> > > 1184=1
> > > 1191=1
> > > 1196=1
> > > 1197=1
> > > 1200=1
> > > 1206=1
> > > 1207=1
> > > 1219=1
> > > 1225=1
> > > 1228=1
> > > 1232=1
> > > 1566=1
> > > 1581=1
> > > 1636=1
> > > 1698=1
> > > 1720=2
> > > 1737=2
> > > 1740=1
> > > 1744=1
> > > 1752=1
> > > 1762=1
> > > 1789=1
> > > 2255=1
> > > 2271=1
> > > 2287=1
> > > 2319=1
> > > 2332=1
> > > 2345=1
> > > 3296=1
> > > 3325=1
> > > 3343=1
> > >
> > > So there are many buckets with 1000s of collisions and only a few 1000
> > collision free buckets.
> > >
> > > It might make sense to come up with a better polymorphic hashing
> > function.
> > >
> > > Best regards,
> > >
> > > Michael
> > > Intel Deutschland GmbH
> > > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> > > Tel: +49 89 99 8853-0, www.intel.de
> > > Managing Directors: Christin Eisenschmid, Christian Lamprechter
> > > Chairperson of the Supervisory Board: Nicole Lau Registered Office:
> > > Munich Commercial Register: Amtsgericht Muenchen HRB 186928
> > >
> > >
> > > --
> > > Caml-list mailing list.  Subscription management and archives:
> > > https://sympa.inria.fr/sympa/arc/caml-list
> > > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > > Bug reports: http://caml.inria.fr/bin/caml-bugs
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Christian Lamprechter
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>

[-- Attachment #2: Type: text/html, Size: 9735 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-14  8:41             ` Thomas Braibant
@ 2016-05-14  9:06               ` Soegtrop, Michael
  2016-05-17 13:03                 ` Soegtrop, Michael
  0 siblings, 1 reply; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-14  9:06 UTC (permalink / raw)
  To: Thomas Braibant; +Cc: Alain Frisch, OCaML Mailing List, Pierre Chambart

[-- Attachment #1: Type: text/plain, Size: 7944 bytes --]

Dear Tom,

my lists are typically 3..6 elements. I think in total 60 or so might be fine. I will give it a try.

Although, after I have written the hash and equality functions for the Hashtbl module interfce, I make use of the fact that I can easily deliberately ignore certain things – the starting point of this thread.

Maybe for larger structures it is just the right thing to do to write your own hash function. I wonder what would be an elegant way to do this? Collect the default-hash function values of the pieces in a list and compute the hash of this list?

Best regards,

Michael

From: Thomas Braibant [mailto:thomas.braibant@gmail.com]
Sent: Saturday, May 14, 2016 10:41 AM
To: Soegtrop, Michael <michael.soegtrop@intel.com>
Cc: Alain Frisch <alain.frisch@lexifi.com>; OCaML Mailing List <caml-list@inria.fr>; Pierre Chambart <pierre.chambart@ocamlpro.com>
Subject: RE: [Caml-list] Specify the default hash function for a type


Dear Michael,

256 might be too small for your use case if you have big lists in your records. A way to workaround this limitation is to use the seeded_hash function to fold an hash accumulator through your record.

That is, if your record has fields f1, ..., fn
you might want to write

let hash t =
   let x = seeded_hash (hash t.f1) (hash t.f2) in
   let x = seeded_hash x (hash t.f3) in
   ...
   seeded_hash x (hash t.fn)




On 14 May 2016 00:46, "Soegtrop, Michael" <michael.soegtrop@intel.com<mailto:michael.soegtrop@intel.com>> wrote:
Dear Tom,

thanks for the pointer to the manual. I googled, but couldn't find this info - should have read the manual first. Yes, 10 is for sure not enough in my case. I will try the hash function with a value adjusted to my case and compare with mine.

Best regards,

Michael

> -----Original Message-----
> From: Thomas Braibant [mailto:thomas.braibant@gmail.com<mailto:thomas.braibant@gmail.com>]
> Sent: Friday, May 13, 2016 8:57 PM
> To: Soegtrop, Michael
> Cc: Alain Frisch; Pierre Chambart; caml-list@inria.fr<mailto:caml-list@inria.fr>
> Subject: Re: [Caml-list] Specify the default hash function for a type
>
> The OCaml hash function is not bad in itself (it's a slightly modified version
> of Murmur 3, with a 32 bit state). There are a few difficult design decisions
> that are made in the main hash loop which can impact users though.
>
> I wonder if the one that you are hitting is the fact that by default the
> polymorphic hash function only consider a (small) amount of meaningful
> value, in a breadth first manner. The following program is interesting
>
> ```
> let rec make n acc = if n = 0 then acc else make (n - 1) (n :: acc);; let r = ref 0 in
> while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r;
> done; !r;;
> - : int = 10
> ```
>
> The current default choice for the number of meaningful values is described
> here:
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html
> and it is indeed 10. You could try to increase the number of meaningful
> nodes that you want to consider in your code, to the max of 256.
>
> There is really a balance to strike here between speed of hashing and
> collisions.
>
> Tom
>
>
>
>
>
> On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael
> <michael.soegtrop@intel.com<mailto:michael.soegtrop@intel.com>> wrote:
> > Dear Ocaml Users,
> >
> > a possibly interesting data point in the discussion on polymorphic hashes:
> >
> > I replaced the polymorphic hash function for a record, which consists of
> smallish integers, bools, lists of integers and short strings, with my own
> hash function which looks like:
> >
> > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
> >
> > where vi are values and pi are distinct prime numbers. For lists I use 4
> primes alternatingly and 4 different primes for empty cases (implemented
> as 4 mutual recursive functions). The primes are random primes between
> 2^28 and 2^29, avoiding those primes where larger powers of 2 are close to 0
> modulo the prime. All arithmetic is done with plain ocaml ints and
> corresponding overflows. For strings I used the polymorphic hash function.
> >
> > The hash histogram for this hash function looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=24
> > (the histogram is "number of items in bucket=count of buckets with
> > this number of items")
> > 0=15356268 (empty buckets)
> > 1=1185329 (non-colliding buckets)
> > 2=135231
> > 3=80194
> > 4=11344
> > 5=2367
> > 6=2836
> > 7=2630
> > 8=244
> > 9=49
> > 10=20
> > 11=25
> > 12=64
> > 13=39
> > 14=100
> > 15=75
> > 16=33
> > 17=5
> > 18=149
> > 19=15
> > 20=3
> > 21=180
> > 22=14
> > 24=2
> >
> > This is roughly in line with random numbers:
> >
> > fill rate = 1803454/16777216 = 10.7%
> > 2 collisions = 0.8% ~ fill rate ^2
> > 3 collisions = 0.48% ~ 4 * fill rate ^3
> > 4 collisions = 0.068% ~ 5 * fill rate ^4
> >
> > The hash histogram for the same data and the polymorphic hash function
> looks like this:
> >
> > num_bindings=1803454
> > num_buckets=16777216
> > max_bucket_length=3343
> > 0=16730926
> > 1=5520
> > 2=3201
> > 3=2779
> > 4=1633
> > 5=1079
> > 6=3701
> > 7=672
> > 8=828
> > 9=1584
> > 10=600
> > 11=384
> > 12=2774
> > 13=404
> > 14=525
> > 15=500
> > 16=435
> > 17=358
> > 18=1406
> > 19=244
> > 20=504
> > 21=427
> > 22=316
> > 23=369
> > 24=837
> > 25=153
> > 26=250
> > 27=417
> > 28=191
> > 29=222
> > 30=501
> > 31=76
> > 32=196
> > 33=530
> > 34=142
> > 35=153
> > 36=859
> > 37=88
> > 38=178
> > 39=310
> > 40=147
> > 41=173
> > 42=313
> > 43=102
> > 44=235
> > 45=123
> > 46=149
> > 47=152
> > 48=244
> > 49=107
> > 50=173
> > :
> > :
> > 1001=1
> > 1002=1
> > 1005=1
> > 1006=3
> > 1009=1
> > 1010=1
> > 1013=1
> > 1014=2
> > 1019=1
> > 1020=1
> > 1029=1
> > 1031=1
> > 1034=1
> > 1035=1
> > 1036=1
> > 1064=1
> > 1082=1
> > 1184=1
> > 1191=1
> > 1196=1
> > 1197=1
> > 1200=1
> > 1206=1
> > 1207=1
> > 1219=1
> > 1225=1
> > 1228=1
> > 1232=1
> > 1566=1
> > 1581=1
> > 1636=1
> > 1698=1
> > 1720=2
> > 1737=2
> > 1740=1
> > 1744=1
> > 1752=1
> > 1762=1
> > 1789=1
> > 2255=1
> > 2271=1
> > 2287=1
> > 2319=1
> > 2332=1
> > 2345=1
> > 3296=1
> > 3325=1
> > 3343=1
> >
> > So there are many buckets with 1000s of collisions and only a few 1000
> collision free buckets.
> >
> > It might make sense to come up with a better polymorphic hashing
> function.
> >
> > Best regards,
> >
> > Michael
> > Intel Deutschland GmbH
> > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> > Tel: +49 89 99 8853-0, www.intel.de<http://www.intel.de>
> > Managing Directors: Christin Eisenschmid, Christian Lamprechter
> > Chairperson of the Supervisory Board: Nicole Lau Registered Office:
> > Munich Commercial Register: Amtsgericht Muenchen HRB 186928
> >
> >
> > --
> > Caml-list mailing list.  Subscription management and archives:
> > https://sympa.inria.fr/sympa/arc/caml-list
> > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> > Bug reports: http://caml.inria.fr/bin/caml-bugs
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0<tel:%2B49%2089%2099%208853-0>, www.intel.de<http://www.intel.de>
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

[-- Attachment #2: Type: text/html, Size: 14866 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [Caml-list] Specify the default hash function for a type
  2016-05-14  9:06               ` Soegtrop, Michael
@ 2016-05-17 13:03                 ` Soegtrop, Michael
  0 siblings, 0 replies; 16+ messages in thread
From: Soegtrop, Michael @ 2016-05-17 13:03 UTC (permalink / raw)
  To: OCaML Mailing List

[-- Attachment #1: Type: text/plain, Size: 2951 bytes --]

Dear OCaml users,

to close this topic, here are some statistics for my use case with different hash functions. All hash methods I tried lead to identical statistics as random numbers. Simply using Hashtbl.hash on lists of hashes of substructures (Method 2 below) worked very well, is easy to write down and offers sufficient flexibility.

NItems   =  3277598
NBuckets = 16777216

        | Theoretical | Method1  |  Method2 |  Method3
--------+-------------+----------+----------+---------
Empty   |  13799905.0 | 13800062 | 13799864 | 13801452
Single  |   2695950.5 |  2695830 |  2696131 |  2693130
2 items |    263340.5 |   263101 |   263042 |   264318
3 items |     17148.7 |    17350 |    17368 |    17463
4 items |       837.5 |      849 |      779 |      822
5 items |        32.7 |       24 |       29 |       31
6 items |         1.1 |        0 |        3 |        3
--------+-------------+----------+----------+---------
Runtime |             |     119s |     122s |     110s

Note 1: the expected random deviation of the counts is sqrt(count).

Note 2: The speed differences between Method 1 and 2 are random – Method 2 should be faster.

Note 3: using Hashtbl.hash on the complete structure doesn't work for me for two reasons:
- parts of the structure should be ignored
- I have maps as substructures and require that logically equal maps result in the same hash

Theoretical:
collision numbers for a random process with same parameters (used exact Bernoulli statistics, not Poisson approximation)

Method1:
Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list).
Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash.
Multiply each hash list entry with a distinct prime.
Use prime values for variants and prime multipliers for variant arguments.

Method2:
Hashtbl.hash used on structures recursively (create list of hash values for substructures and use Hashtbl.hash on list).
Maps (<10 elements) are converted to lists and the list is hashed with Hashtbl.hash.
Use prime values for variants and prime multipliers for variant arguments.

Method3:
( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4
where vi are values and pi are distinct prime numbers. For lists I use 4 primes alternatingly and 4 different primes for empty cases

Btw.: I added a function “worst_bucketlist” to Hashtbl, which returns the keys of the largest bucket. This is useful for optimizing (or bug fixing) hash and equality functions – try to make a feature request for it.

Best regards,

Michael
Intel Deutschland GmbH
Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, www.intel.de
Managing Directors: Christin Eisenschmid, Christian Lamprechter
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928

[-- Attachment #2: Type: text/html, Size: 13039 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-05-17 13:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-13  7:52 [Caml-list] Specify the default hash function for a type Soegtrop, Michael
2016-05-13  8:13 ` Ben Millwood
2016-05-13  8:40   ` Soegtrop, Michael
2016-05-13  9:06 ` Alain Frisch
2016-05-13  9:26   ` Soegtrop, Michael
2016-05-13 12:01     ` Gabriel Scherer
2016-05-13 12:23       ` Alain Frisch
2016-05-13 12:32       ` Soegtrop, Michael
2016-05-13 13:50   ` Pierre Chambart
2016-05-13 13:56     ` Alain Frisch
2016-05-13 16:17       ` Soegtrop, Michael
2016-05-13 18:57         ` Thomas Braibant
2016-05-13 22:45           ` Soegtrop, Michael
2016-05-14  8:41             ` Thomas Braibant
2016-05-14  9:06               ` Soegtrop, Michael
2016-05-17 13:03                 ` Soegtrop, Michael

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).