From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by yquem.inria.fr (Postfix) with ESMTP id EE68FBBAF for ; Sat, 20 Feb 2010 11:52:08 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ao4BAGNNf0vZSMDqkGdsb2JhbACbCxUBAQEBCQkMBxMDH7tHhGgE X-IronPort-AV: E=Sophos;i="4.49,507,1262559600"; d="scan'208";a="53176501" Received: from fmmailgate03.web.de ([217.72.192.234]) by mail1-smtp-roc.national.inria.fr with ESMTP; 20 Feb 2010 11:52:08 +0100 Received: from smtp06.web.de (fmsmtp06.dlan.cinetic.de [172.20.5.172]) by fmmailgate03.web.de (Postfix) with ESMTP id 344C013EF9886; Sat, 20 Feb 2010 11:52:08 +0100 (CET) Received: from [95.208.117.111] (helo=frosties.localdomain) by smtp06.web.de with asmtp (TLSv1:AES256-SHA:256) (WEB.DE 4.110 #314) id 1NimwS-0003bs-00; Sat, 20 Feb 2010 11:52:08 +0100 Received: from mrvn by frosties.localdomain with local (Exim 4.71) (envelope-from ) id 1NimwR-0003t0-8x; Sat, 20 Feb 2010 11:52:07 +0100 From: Goswin von Brederlow To: Grgoire Seux Cc: caml-list Subject: Re: [Caml-list] range of hash function In-Reply-To: <1ae8fe881002200054s2e9752d8x620695c3c4e27939@mail.gmail.com> ("Grgoire Seux"'s message of "Sat, 20 Feb 2010 14:24:23 +0530") References: <1ae8fe881002200054s2e9752d8x620695c3c4e27939@mail.gmail.com> User-Agent: Gnus/5.110009 (No Gnus v0.9) XEmacs/21.4.22 (linux, no MULE) Date: Sat, 20 Feb 2010 11:52:05 +0100 Message-ID: <87fx4w6uy2.fsf@frosties.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit Sender: goswin-v-b@web.de X-Sender: goswin-v-b@web.de X-Provags-ID: V01U2FsdGVkX19JglDH8SVMDFTC5/agowkuluw5ynUerNjWPSxn ZchrD+qchREtl5JZ17tJ/U9/BTLQ2mKwlvywZr+bNeMIZdQPt9 5sOmgfwL0= X-Spam: no; 0.00; hash:01 hash:01 hashes:01 hashtbl:01 hashtbl:01 mfg:98 polymorphic:01 caml-list:01 unsafe:01 probability:01 int:01 int:01 writes:01 strings:01 variant:02 Grégoire Seux writes: > hello ! > > i would like to use the polymorhpic hash function on strings. But i would like > to know what is the probability of a collision between two hashes.  > > my first question is about the range of the Hashtbl.hash function: what is its > range ? ( string -> [1..N] ?) It is range [min_int..max_int]. Meaning 31 or 63 bits. > the second question is : can i assume that the result is a uniform distribution > over [1..N] ? (for 10? words which is an estimation of the english vocabulary > size) If the function is any good it will be reasonably uniform. And if it is crap then I bet someone have replaced it already. > the third one is : is it possible to predict which will be the collision ? I > mean collisions are between words which are very 'similar' (for ex: "boy" and > "boys") or are completely unpredictable. You can read the source, analyze the math and try to figure out a collision. From past experience every widely used hash function so far has been shown to be vulnerable. Like md5 is now considered compromised due to the ease of creating a collision. Sha1 is deemed unsafe too. It is just a matter of how much brainpower and cpu time you want to throw at it. Or pure bad luck. Is the Hashtbl.hash the same function used to hash polymorphic variant types? If so then there are collisions between words which are verry similar. I knew of an example once but google isn't verry helpfull finding it again. MfG Goswin