From mboxrd@z Thu Jan  1 00:00:00 1970
X-Sympa-To: caml-list@inria.fr
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id q18AivJZ021429
	for <caml-list@sympa-roc.inria.fr>; Wed, 8 Feb 2012 11:44:57 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AksDAChRMk+BryEplGdsb2JhbABAA4UNqlYiAQEBAQkLJgQjgXMBBSNWEAsaAiYCAh84BhOHfwanHYQzjV6BL4oYBwICAggBAQQNBAYBNAaCfxkEAwwDFAVRAQIHBUOCDIEWBI0JiCKSXw
X-IronPort-AV: E=Sophos;i="4.73,383,1325458800"; 
   d="scan'208";a="143364601"
Received: from smtp1.u-psud.fr ([129.175.33.41])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 08 Feb 2012 11:44:52 +0100
Received: from smtp1.u-psud.fr (localhost [127.0.0.1])
	by localhost (MTA) with SMTP id 04BEA2559B2;
	Wed,  8 Feb 2012 11:44:52 +0100 (CET)
Received: from ext.lri.fr (ext.lri.fr [129.175.15.4])
	by smtp1.u-psud.fr (MTA) with ESMTP id BF42F25596F;
	Wed,  8 Feb 2012 11:44:51 +0100 (CET)
Received: from localhost (localhost [127.0.0.1])
	by ext.lri.fr (Postfix) with ESMTP id BA0F8408F0;
	Wed,  8 Feb 2012 11:44:51 +0100 (CET)
X-Virus-Scanned: Debian amavisd-new at lri.fr
Received: from ext.lri.fr ([127.0.0.1])
	by localhost (ext.lri.fr [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id CVhNUHEcK7e2; Wed,  8 Feb 2012 11:44:51 +0100 (CET)
Received: from localhost.localdomain (AMontsouris-651-1-185-43.w82-123.abo.wanadoo.fr [82.123.184.43])
	(using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits))
	(No client certificate requested)
	by ext.lri.fr (Postfix) with ESMTPSA id 63CA5408C8;
	Wed,  8 Feb 2012 11:44:51 +0100 (CET)
Date: Wed, 8 Feb 2012 11:46:30 +0100
From: AUGER =?UTF-8?B?Q8OpZHJpYw==?= <Cedric.Auger@lri.fr>
To: Goswin von Brederlow <goswin-v-b@web.de>
Cc: "Gerd Stolpmann" <info@gerd-stolpmann.de>, caml-list@inria.fr
Message-ID: <20120208114630.4dcd531d@lri.fr>
In-Reply-To: <8739alfys0.fsf@frosties.localnet>
References: <1325263446.5036.104.camel@samsung>
	<87zkd531kl.fsf@frosties.localnet>
	<403c4e4bb2cfce801aad217c80365879.squirrel@gps.dynxs.de>
	<8739alfys0.fsf@frosties.localnet>
X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; i686-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by walapai.inria.fr id q18AivJZ021429
X-Validation-by: cedric.auger@lri.fr
Subject: Re: [Caml-list] Hashtbl and security

Le Wed, 08 Feb 2012 10:41:03 +0100,
Goswin von Brederlow <goswin-v-b@web.de> a écrit :

> "Gerd Stolpmann" <info@gerd-stolpmann.de> writes:
> 
> >> Gerd Stolpmann <info@gerd-stolpmann.de> writes:
> >> You are forgetting a variable in this. To create a DOS in the
> >> hashtable you need to hit the same bucket again and again. And
> >> bucket = hash % size.
> >>
> >> You focused on the hash but lets talk about the size for a moment.
> >>
> >> The size is rather limited and large hashtabels simply won't
> >> increate size anymore and allways have lots of collisions. So even
> >> if someone isn't actively trying to create DOS the performace
> >> still tanks as you add more items.
> >
> > I cannot completely follow here. If you add more elements, the
> > bucket array is resized. The ocaml Hashtbl does this when the
> > number of elements exceeds half of the array size. The argument of
> > Hashtbl.create is only the initial size of the bucket array.
> >
> > The upper bound is Sys.max_array_length, of course. Granted, on 32
> > bit platforms it is low (4M). But nobody of sound mind would run
> > ocaml programs on 32 bit for production purposes. (If you do,
> > consider to switch. You can use more memory and get a little
> > performance boost, not only for ocaml programs.)
> >
> > Gerd
> 
> In practice I see a serious performance degradation with large
> hashtables. So much so that I had to use an array of hashtables or
> hashtable of hashtables and split the values across them.
> 
> Maybe the size of the hashtable indeed isn't the problem but the hash
> function. Here is something odd:
> 
> # for i = 1 to 63 do Printf.printf "%d %d\n" (1 lsl i) (Hashtbl.hash
> (1 lsl i)) done;; 2 2
> 4 4
> 8 8
> 16 16
> 32 32
> 64 64
> 128 128
> 256 256
> 512 512
> 1024 1024
> 2048 2048
> 4096 4096
> 8192 8192
> 16384 16384
> 32768 32768
> 65536 65536
> 131072 131072
> 262144 262144
> 524288 524288
> 1048576 1048576
> 2097152 2097152
> 4194304 4194304
> 8388608 8388608
> 16777216 16777216
> 33554432 33554432
> 67108864 67108864
> 134217728 134217728
> 268435456 268435456
> 536870912 536870912
> 1073741824 0
> 2147483648 0
> 4294967296 0
> 8589934592 0
> 17179869184 0
> 34359738368 0
> 68719476736 0
> 137438953472 0
> 274877906944 0
> 549755813888 0
> 1099511627776 0
> 2199023255552 0
> 4398046511104 0
> 8796093022208 0
> 17592186044416 0
> 35184372088832 0
> 70368744177664 0
> 140737488355328 0
> 281474976710656 0
> 562949953421312 0
> 1125899906842624 0
> 2251799813685248 0
> 4503599627370496 0
> 9007199254740992 0
> 18014398509481984 0
> 36028797018963968 0
> 72057594037927936 0
> 144115188075855872 0
> 288230376151711744 0
> 576460752303423488 0
> 1152921504606846976 0
> 2305843009213693952 0
> -4611686018427387904 0
> 0 0
> - : unit = ()
> 
> Is every value over a billion hashed to 0?
> 
> MfG
>         Goswin
> 

Go to:
http://caml.inria.fr/cgi-bin/viewvc.cgi/ocaml/trunk/byterun/hash.c?view=markup
line 285

if you want another hasfunction, you can download the sources, modify
it and recompile.

More seriously, hashtables are here to store values, so you don't want
want to have a hash value bigger than (2^30)-1.

In fact even such a value is too big for that purpouse IMHO, since on a
4Gio RAM computer, it will take half it to store the table, assuming
each entry only takes 4 octets, that is it would be ok only for storing
just the structure of the table (each entry has to be a pointer),
so that in practice, either your table is ridiculously big to store
too few values either it is relevant to have such a size and in this
case your computer will spend all its time to swap since you won't
have enough RAM.

 Clearly you cannot a bijection from 63 bits
integers to 31 bits integers; so there are a lot of collisions. I do
not have a 64 arch, but I guesse that if you hash "2^48+168" you will
get "168". I guess such a function is ways to be ideal, since when
playing with bitvectors having two integers which differs by only one
bit is very common, but at least it is a fast function, and has a good
repartition (you have 2^32 collisions exactly per bucket).