caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* range of hash function
@ 2010-02-20  8:54 Grégoire Seux
  2010-02-20 10:52 ` [Caml-list] " Goswin von Brederlow
  2010-02-20 13:54 ` Jacques Garrigue
  0 siblings, 2 replies; 5+ messages in thread
From: Grégoire Seux @ 2010-02-20  8:54 UTC (permalink / raw)
  To: caml-list

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

hello !

i would like to use the polymorhpic hash function on strings. But i would
like to know what is the probability of a collision between two hashes.

my first question is about the range of the Hashtbl.hash function: what is
its range ? ( string -> [1..N] ?)

the second question is : can i assume that the result is a uniform
distribution over [1..N] ? (for 10⁶ words which is an estimation of the
english vocabulary size)

the third one is : is it possible to predict which will be the collision ? I
mean collisions are between words which are very 'similar' (for ex: "boy"
and "boys") or are completely unpredictable.

thanks !


-- 
Grégoire Seux

[-- Attachment #2: Type: text/html, Size: 859 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] range of hash function
  2010-02-20  8:54 range of hash function Grégoire Seux
@ 2010-02-20 10:52 ` Goswin von Brederlow
  2010-02-20 12:39   ` ygrek
  2010-02-20 13:54 ` Jacques Garrigue
  1 sibling, 1 reply; 5+ messages in thread
From: Goswin von Brederlow @ 2010-02-20 10:52 UTC (permalink / raw)
  To: Grgoire Seux; +Cc: caml-list

Grégoire Seux <kamaradclimber@gmail.com> writes:

> hello !
>
> i would like to use the polymorhpic hash function on strings. But i would like
> to know what is the probability of a collision between two hashes. 
>
> my first question is about the range of the Hashtbl.hash function: what is its
> range ? ( string -> [1..N] ?)

It is range [min_int..max_int]. Meaning 31 or 63 bits.

> the second question is : can i assume that the result is a uniform distribution
> over [1..N] ? (for 10? words which is an estimation of the english vocabulary
> size)

If the function is any good it will be reasonably uniform. And if it is
crap then I bet someone have replaced it already.

> the third one is : is it possible to predict which will be the collision ? I
> mean collisions are between words which are very 'similar' (for ex: "boy" and
> "boys") or are completely unpredictable.

You can read the source, analyze the math and try to figure out a
collision. From past experience every widely used hash function so far
has been shown to be vulnerable. Like md5 is now considered compromised
due to the ease of creating a collision. Sha1 is deemed unsafe too. It
is just a matter of how much brainpower and cpu time you want to throw
at it. Or pure bad luck.

Is the Hashtbl.hash the same function used to hash polymorphic variant
types? If so then there are collisions between words which are verry
similar. I knew of an example once but google isn't verry helpfull
finding it again.

MfG
        Goswin


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] range of hash function
  2010-02-20 10:52 ` [Caml-list] " Goswin von Brederlow
@ 2010-02-20 12:39   ` ygrek
  0 siblings, 0 replies; 5+ messages in thread
From: ygrek @ 2010-02-20 12:39 UTC (permalink / raw)
  To: caml-list

On Sat, 20 Feb 2010 11:52:05 +0100
Goswin von Brederlow <goswin-v-b@web.de> wrote:

> Grégoire Seux <kamaradclimber@gmail.com> writes:
> 
> > hello !
> >
> > i would like to use the polymorhpic hash function on strings. But i would like
> > to know what is the probability of a collision between two hashes. 
> >
> > my first question is about the range of the Hashtbl.hash function: what is its
> > range ? ( string -> [1..N] ?)
> 
> It is range [min_int..max_int]. Meaning 31 or 63 bits.
> 
> > the second question is : can i assume that the result is a uniform distribution
> > over [1..N] ? (for 10? words which is an estimation of the english vocabulary
> > size)

The best way is to test yourself. Or rely on someone else's tests :)

http://alaska-kamtchatka.blogspot.com/2009/06/hash-n-mix.html
 
> You can read the source, analyze the math and try to figure out a
> collision. From past experience every widely used hash function so far
> has been shown to be vulnerable. Like md5 is now considered compromised
> due to the ease of creating a collision. Sha1 is deemed unsafe too. It
> is just a matter of how much brainpower and cpu time you want to throw
> at it. Or pure bad luck.

Hashtbl.hash function is not required to be cryptographically strong, but rather
uniformly random on "usual" input data and fast.

-- 
 ygrek
 http://ygrek.org.ua


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] range of hash function
  2010-02-20  8:54 range of hash function Grégoire Seux
  2010-02-20 10:52 ` [Caml-list] " Goswin von Brederlow
@ 2010-02-20 13:54 ` Jacques Garrigue
  2010-02-21 23:10   ` Richard Jones
  1 sibling, 1 reply; 5+ messages in thread
From: Jacques Garrigue @ 2010-02-20 13:54 UTC (permalink / raw)
  To: kamaradclimber; +Cc: caml-list

From: Grégoire Seux <kamaradclimber@gmail.com>
> i would like to use the polymorhpic hash function on strings. But i would
> like to know what is the probability of a collision between two hashes.
> 
> my first question is about the range of the Hashtbl.hash function: what is
> its range ? ( string -> [1..N] ?)

Just to get things straight: this is 0..2^30-1 (0..0x3fffffff).
The result of the hash function is the same on 32-bit and 64-bit
architectures.

> the second question is : can i assume that the result is a uniform
> distribution over [1..N] ? (for 10⁶ words which is an estimation of the
> english vocabulary size)

The algorithm for strings is as follows:

      i = caml_string_length(obj);
      for (p = &Byte_u(obj, 0); i > 0; i--, p++)
        hash_accu = hash_accu * 19 + *p;

So you can assume an uniform distribution for sufficiently long
strings.

> the third one is : is it possible to predict which will be the collision ? I
> mean collisions are between words which are very 'similar' (for ex: "boy"
> and "boys") or are completely unpredictable.

Since you have the algorithm, you can predict collisions. Basically
shifting character n by 1 is equivalent to shifting character n+1 by
19, so you have lots of collisions. But this hash function being
intended for hashtables, collisions are not a problem, uniform
distribution matters more.

By the way, for polymorphic variants collisions matter, and the hash
function is different. The range is 31-bits rather than 30-bits, and
the factor is 243, so that names of no more than 4 characters are
guaranteed to be different. You still have collisions, but they are
going to be less similar.

Both hash functions are defined in byterun/hash.c.

Hope this helps,

Jacques Garrigue

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Caml-list] range of hash function
  2010-02-20 13:54 ` Jacques Garrigue
@ 2010-02-21 23:10   ` Richard Jones
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Jones @ 2010-02-21 23:10 UTC (permalink / raw)
  To: Jacques Garrigue; +Cc: kamaradclimber, caml-list

On Sat, Feb 20, 2010 at 10:54:12PM +0900, Jacques Garrigue wrote:
> Since you have the algorithm, you can predict collisions. Basically
> shifting character n by 1 is equivalent to shifting character n+1 by
> 19, so you have lots of collisions. But this hash function being
> intended for hashtables, collisions are not a problem, uniform
> distribution matters more.

I would slightly dispute your assertion that collisions are not a
problem, because of algorithmic complexity attacks:

http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/index.html

The hash implementation used by Perl was changed to avoid this attack:

http://perl5.git.perl.org/perl.git/blob/HEAD:/perl.c#l1465

Rich.

-- 
Richard Jones
Red Hat


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-02-21 23:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-20  8:54 range of hash function Grégoire Seux
2010-02-20 10:52 ` [Caml-list] " Goswin von Brederlow
2010-02-20 12:39   ` ygrek
2010-02-20 13:54 ` Jacques Garrigue
2010-02-21 23:10   ` Richard Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).