From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id FAA08642; Wed, 20 Mar 2002 05:45:29 +0100 (MET) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id FAA22939 for ; Wed, 20 Mar 2002 05:45:28 +0100 (MET) Received: from lobus.fungible.com (adsl-64-161-114-6.dsl.snfc21.pacbell.net [64.161.114.6]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g2K4jQ903539 for ; Wed, 20 Mar 2002 05:45:27 +0100 (MET) Received: by lobus.fungible.com (Postfix, from userid 1000) id A9A328056; Tue, 19 Mar 2002 20:42:59 -0800 (PST) Message-Id: <1191-Tue19Mar2002204259-0800-tim@fungible.com> X-Mailer: emacs 21.1.1 (via feedmail 8 I) Date: Tue, 19 Mar 2002 19:46:24 -0700 From-Tims-Fingers: true To: georg.g@home.se Cc: garrigue@kurims.kyoto-u.ac.jp, caml-list@inria.fr In-reply-to: <00d901c1cf94$c5fa7700$f58c72d5@invariant.se> (message from =?iso-8859-1?Q?Johan_Georg_Granstr=F6m?= on Tue, 19 Mar 2002 23:21:02 +0100) Subject: Hashing research (was Re: [Caml-list] Big executables ...) References: <9601-Sat16Mar2002094255-0800-tim@fungible.com><20020318101225C.garrigue@kurims.kyoto-u.ac.jp><1191-Sun17Mar2002193326-0800-tim@fungible.com> <20020318142017J.garrigue@kurims.kyoto-u.ac.jp> <00d901c1cf94$c5fa7700$f58c72d5@invariant.se> From: tim@fungible.com (Tim Freeman) Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk From: georg.g@home.se >IMHO this a perfect research problem: > >Find a mapping H:S->B where S is the set of module signatures and >B is the set of binary (arbitrary length) strings. Such that if and only if >s_1 is a subset of s_2 then there is some relation between H(s_1) and >H(s_2), thus s_1 >Perhaps you could drop "and only if" and let H(s_1) R H(s_2) imply >s_1 < s_2 with 99.9...% certainty. I think you can't do it with constant-sized hashes. For instance, if s_2 has 100 elements, then it has 2 ** 100 subsets. Since R has to behave correctly on most of those 2 ** 100 subsets, those subsets need to have almost 2 ** 100 different hashes, so your hash can't be less than 100 bits. You have to know the name for each entry point into the library anyway so you can do the linking. We could just have one hash for the type per entry point. Hmm; MD5 is only 16 bytes, or 32 bytes of hex, or 22 bytes of base 62 (digits plus upper and lower case letters), so maybe we just append the MD5 checksum to the end of the symbol. If that's too much and we're willing to have less-than-cryptographic security we could truncate the added checksum to whatever number of bits is small enough and still have a very good chance of getting the right answer. -- Tim Freeman tim@fungible.com ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners