caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Goswin von Brederlow <goswin-v-b@web.de>
To: Gerd Stolpmann <info@gerd-stolpmann.de>
Cc: "Richard W.M. Jones" <rich@annexia.org>, caml-list@inria.fr
Subject: Re: [Caml-list] Hardening [Perl's] hash function further
Date: Mon, 25 Nov 2013 14:38:28 +0100	[thread overview]
Message-ID: <20131125133828.GD3610@frosties> (raw)
In-Reply-To: <1384819720.4083.57.camel@zotac>

On Tue, Nov 19, 2013 at 01:08:40AM +0100, Gerd Stolpmann wrote:
> Am Montag, den 18.11.2013, 20:44 +0000 schrieb Richard W.M. Jones:
> > The Perl community have found further attacks against their
> > (already considered hardened) hash function.  This article
> > explains the problems quite well:
> > 
> > http://blog.booking.com/hardening-perls-hash-function.html
> 
> Interesting read. I'm wondering which conclusions one should draw from
> it. This article is more or less discussing cryptoattacks on an insecure
> hash function, and the new versions of the functions are only "better"
> but in no way safe (which you only get with crypto hashes or
> dictionaries that don't allow collisions). This sounds for me like a
> race between attackers and hash implementors.
> 
> Gerd

I notice that the problem isn't exploting hash collisions but
collisions in the bucket being used in the hashtable. So no matter how
good the hash is the table still breaks down.

1) the hashtable isn't randomized by default
   - the bucket being used is predictable so you can make long chains

Why not just always use the randomized hash? Why only use it when you
think you have detected an abuse? That is just silly.

2) on long chains the hashtable size is doubled
   - each bucket is split in two and assuming a good hash each will
     get half the entries

You don't double your size. The old and new size should not have a
common divisor, esspecially not 2. Best is to use primes but not
fastest. Using 2^n-1 is tempting since (x mod 2^n-1) is fast to
compute without needing a long division.

If the old and new size have no common divisor then the entries of the
overly long chain are distributed over *all* the buckets. So instead
of getting 2 buckets with n/2 entries you get buckets with n/size
entries.

3) hashtables aren't re-randomized on resize

When you resize a hashtable you have to sort every key into a new
bucket. So why not change the randomize of the hash function at that
time. If the randomizer is applied after the hash is computed this can
be done quickly without having to rehash every key. Takes extra memory
to store the non randomized hash value of each key though. If you
don't want to spend that memory you can rehash every key, which also
allows using a randomized hash function instead of randomizing the
result of the hash.

4) hashtable sizes are predictable

You can also randomize the size. Nobody says you have to (nearly)
double the size in a predictable way. You can have a long list of good
sizes (e.g. prime numbers) and randomly pick one that is between 1.8
and 2.2 times (or whatever grows factor you want) the size as the old
size. The randomness will make the size of the table unpredictable
making it that much harder to construct a key collision.

Note: if a table is resized due to an overlong chain without being
reasonable full I would just resize it a tiny bit instead of about
double. That gets rid of the collisions without doubling the memory
used.

MfG
	Goswin

PS: is anyone working on fixing the stdlib hashtable?

      parent reply	other threads:[~2013-11-25 13:38 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-18 20:44 Richard W.M. Jones
2013-11-19  0:08 ` Gerd Stolpmann
2013-11-19  7:53   ` David MENTRE
2013-11-19  8:50     ` Richard W.M. Jones
2013-11-19  9:14     ` Gabriel Scherer
2013-11-19 11:19       ` Dario Teixeira
2013-11-19 12:55         ` rixed
2013-11-19 22:18           ` Nicolas Braud-Santoni
2013-11-19 22:39             ` Eric Cooper
2013-11-19 22:55               ` Nicolas Braud-Santoni
2013-11-25 13:46                 ` Goswin von Brederlow
2013-11-19 22:31         ` Nicolas Braud-Santoni
2013-11-20 18:56         ` Florian Weimer
2013-11-20 21:47           ` Gerd Stolpmann
2013-11-25 13:51             ` Goswin von Brederlow
2013-11-25 14:43               ` Yaron Minsky
2013-11-19 22:15     ` Nicolas Braud-Santoni
2013-11-25 13:38   ` Goswin von Brederlow [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131125133828.GD3610@frosties \
    --to=goswin-v-b@web.de \
    --cc=caml-list@inria.fr \
    --cc=info@gerd-stolpmann.de \
    --cc=rich@annexia.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).