On 19/11/2013 08:53, David MENTRE wrote:
> Hello,
>
> 2013/11/19 Gerd Stolpmann <info@gerd-stolpmann.de>:
>> This article is more or less discussing cryptoattacks on an insecure
>> hash function, and the new versions of the functions are only "better"
>> but in no way safe (which you only get with crypto hashes or
>> dictionaries that don't allow collisions).
> Which in turn raises the question: Who has the responsibility to
> implement such hashes that can be exposed to the Internet: OCaml
> standard library or dedicated frameworks like OCamlNet or Ocsigen? In
> other words, shouldn't we keep simple hashes in the OCaml standard
> library (for people making regular programs) and let people making web
> applications chose the proper crypto hash in a specialized library?
>
> Best regards,
> david

Hello,

I believe that, before discussing changes to Hashtbl, we need to
know the “parameters” of our performance/security tradeoff:
1. We need actual performance data to know what is currently costly (in
seeded_hash).
A big part of the code in byterun/hash.c is actually concerned with
traversal of the data-structure being hashed;
is there any information about how much time is actually spent hashing
the data, versus traversal & memory access times ?

2. What security guarantees do we want Hashtbl.seeded_hash to provide?
On the security side, as Gabriel Scherer mentionned, “randomized
hashing” failed because little was known about the security of the hash
functions that were used (such as CityHash or MurmurHash).
a. There are several classical notions that may apply:
- Universal Hash Function: *theoretically* suitable for use in hash
tables, even when data is chosen by an adversary (as long as the random
parameter is kept secret);
- Universal One-Way Hash Function (pronounced âwoofâ): stronger
guarantees wrt. the hardness of collision (as long as the random
parameter is kept secret).

However, this still can be said to be insecure, as Hashtbl leaks
information about the random parameter:
for instance, Hashtbl.iter (currently) iterates the table in the order
of the hashes of the keys (mod 2^s);
naively displaying a table of data by iterating over the hash table
leaks some information about this parameter.

Moreover, there is no (as far as I know) UH function that is both fast,
gives good uniformity in practice (nevermind secure).

b. recent research has yielded some cryptographic hash functions, like
SipHash[1], which are:
- believed to be strong[2];
- have speed comparable with âcommonâ non-crypto hash functions,
notably on short inputs (but we need to evaluate the speed of seeded_hash);
- have good statistical properties.

Regarding the suggestion of libraries implementing hashes suitable for
use with untrusted data[3], I see 3 problems:
1. It is difficult for a library author to guess whether his library
will be used in a context where a âmore secureâ Hashtbl
implementation is required (and functorizing everything is problematic);
that's part of why having a âreasonable secureâ default seems
important (to me, at least).
2. Implementing a polymorphic hash function is rather non-trivial.
3. Most users do not use secure primitives, especially if the default
option doesn't document its security trade-off.

Regards,
Nicolas

[1] https://131002.net/siphash/
[2] “strong” means here that it is claimed to be a PRF (e.g. not
effectively distinguishable from a random oracle)
[3] It isn't necessarily “over the Internet”, and the hashes themselves
need not be exposed for the security to matter.