From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id B7A4B7FA5C for ; Fri, 13 May 2016 20:57:51 +0200 (CEST) IronPort-PHdr: 9a23:2NXClx0M1a5Nxh7KsmDT+DRfVm0co7zxezQtwd8ZsekXIvad9pjvdHbS+e9qxAeQG96LurQe0qGP6f6ocFdDyKjCmUhKSIZLWR4BhJdetC0bK+nBN3fGKuX3ZTcxBsVIWQwt1Xi6NU9IBJS2PAWK8TWM5DIfUi/yKRBybrysXNWC3oLtjqvrocObSj4LrQT+SIs6FA+xowTVu5teqqpZAYF19CH0pGBVcf9d32JiKAHbtR/94sCt4MwrqHwI6LoJvvRNWqTifqk+UacQTHF/azh0t/vQqALbQACTynwZW2QQ2loUUkmWpC39C93KtSb1qvB6wG3SGMz9Tbk5XX7qu6JqQx/hhSNBLDk0/33NjdRYjaRHrRbnrBt6ld36eoaQYcZ+eabUZpswX2NAVM9XS2QVHoO7aoIUSeAbNOdSpo/hj1QLpBq6QwKrAbW8mXdzmnbq0PhigKwaGgbc0VllQosD Authentication-Results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=thomas.braibant@gmail.com; spf=Pass smtp.mailfrom=thomas.braibant@gmail.com; spf=None smtp.helo=postmaster@mail-qg0-f47.google.com Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of thomas.braibant@gmail.com) identity=pra; client-ip=209.85.192.47; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="thomas.braibant@gmail.com"; x-sender="thomas.braibant@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of thomas.braibant@gmail.com designates 209.85.192.47 as permitted sender) identity=mailfrom; client-ip=209.85.192.47; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="thomas.braibant@gmail.com"; x-sender="thomas.braibant@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-qg0-f47.google.com) identity=helo; client-ip=209.85.192.47; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="thomas.braibant@gmail.com"; x-sender="postmaster@mail-qg0-f47.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A0D+AACWIjZXjy/AVdFeg1U3fganEYFThWOLGYF2IoVlCgKBJgc5EwEBAQEBAQEBEQEBAQEHCwsJIS+CLYIWAQEEDgQRHQEIEwkJCwEDDAYFCw0CAgkdAgIiAREBBQEKEgYTEhCHcgEDFw6jPYExPjGLO4FqglgFh0wKGScDClKDUQEBAQEBAQQBAQEBAQEZAgYQcYUkg0qBA4Q/gwCCWQWGOwyRYIFWhCiIIII3jGKOAxIegQ4iAoI3DBIKgU06MgGIWQEBAQ X-IPAS-Result: A0D+AACWIjZXjy/AVdFeg1U3fganEYFThWOLGYF2IoVlCgKBJgc5EwEBAQEBAQEBEQEBAQEHCwsJIS+CLYIWAQEEDgQRHQEIEwkJCwEDDAYFCw0CAgkdAgIiAREBBQEKEgYTEhCHcgEDFw6jPYExPjGLO4FqglgFh0wKGScDClKDUQEBAQEBAQQBAQEBAQEZAgYQcYUkg0qBA4Q/gwCCWQWGOwyRYIFWhCiIIII3jGKOAxIegQ4iAoI3DBIKgU06MgGIWQEBAQ X-IronPort-AV: E=Sophos;i="5.24,614,1454972400"; d="scan'208";a="218225564" Received: from mail-qg0-f47.google.com ([209.85.192.47]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/AES128-GCM-SHA256; 13 May 2016 20:57:50 +0200 Received: by mail-qg0-f47.google.com with SMTP id 90so62560467qgz.1 for ; Fri, 13 May 2016 11:57:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=8Paw1HyhgmQ1G4te+zb7AU/xMCxx0hbSlYJIhka2f1Q=; b=N7WYwzdts24DbUKXpkOCGrHJ2hx+ivMTit8iwWUGi5VpWRADYQkC+FRtxsc9DVVjbr upZALY32TcFOelZex2gw/UyAlkFuvaIy+JfblVnCSLDgpb27cO6dpTAs0Mhs0KwJQHNS ITNziv3BE4k59LXFxarBZ5nDdbrG8r1ogbbkZ2jRsK1VycPHOkDApbEGaY10Vq5mmyRz YeNGAD7evmudOV3vMb4+wIhqoFKzvWqyI0SJH+SNmxVowFe+wxvRsHe5iddNMQqBMVFp jTvOurE+2+Wo3eKaw6PX8xZ8PKmaTeIn2ssf8R7phl1qD1+rsaQAoW6ijy0u0vQFjIkP 1mlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=8Paw1HyhgmQ1G4te+zb7AU/xMCxx0hbSlYJIhka2f1Q=; b=TuhRMAVfCZs7avakHS6mrM0aToL0Rt3Xrbuz9YqT8D5efl1wih3VgldTCOeA4dFb1A LSTPm1V7eYbcyXJh+rpNGGDGYcHRcMM07o9YotPzmGGV/KhYttY15p3RrXiAcVavjNJc aWWVOg6OhIthz+xxTrD4FYj7czQpl4NT63UCoHbesvQ6hTA29AOp+12kTm9IgoFEC6fk XQuNqxLrcom6XiNs7aL+5a5Vz3FR8xNQsrynwdS4eyIz5MZPYXfvR7WN1p/rC8CGuf7P waWKb5912WHf/sIhe3IE8y/rDSHly0sKh3to+J3JWnZyzgaCwTmASLJqZ3HRGNNt4E/r vMrw== X-Gm-Message-State: AOPr4FUZXImG52c2OUrPdrcGYYa8nl8bYXkolMYdquarcQ3sTp6vbXM3pOBasqOlmWoM3Gsh+acQ4MX3Vk+H+w== X-Received: by 10.140.144.204 with SMTP id 195mr17609806qhq.92.1463165869384; Fri, 13 May 2016 11:57:49 -0700 (PDT) MIME-Version: 1.0 Received: by 10.140.92.173 with HTTP; Fri, 13 May 2016 11:57:29 -0700 (PDT) In-Reply-To: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com> References: <0F7D3B1B3C4B894D824F5B822E3E5A172CEF1630@IRSMSX102.ger.corp.intel.com> <850765ff-25f1-ecb9-9707-74d3dc47675c@lexifi.com> <5735DBB9.5050101@ocamlpro.com> <44df5d5e-8ad9-1a86-7fda-bfc203bfb479@lexifi.com> <0F7D3B1B3C4B894D824F5B822E3E5A172CEF2D2F@IRSMSX102.ger.corp.intel.com> From: Thomas Braibant Date: Fri, 13 May 2016 20:57:29 +0200 Message-ID: To: "Soegtrop, Michael" Cc: Alain Frisch , Pierre Chambart , "caml-list@inria.fr" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Caml-list] Specify the default hash function for a type The OCaml hash function is not bad in itself (it's a slightly modified version of Murmur 3, with a 32 bit state). There are a few difficult design decisions that are made in the main hash loop which can impact users though. I wonder if the one that you are hitting is the fact that by default the polymorphic hash function only consider a (small) amount of meaningful value, in a breadth first manner. The following program is interesting ``` let rec make n acc =3D if n =3D 0 then acc else make (n - 1) (n :: acc);; let r =3D ref 0 in while Hashtbl.hash (make !r []) <> Hashtbl.hash (make (!r + 1) []) do incr r; done; !r;; - : int =3D 10 ``` The current default choice for the number of meaningful values is described here: http://caml.inria.fr/pub/docs/manual-ocaml/libref/Hashtbl.html and it is indeed 10. You could try to increase the number of meaningful nodes that you want to consider in your code, to the max of 256. There is really a balance to strike here between speed of hashing and collisions. Tom On Fri, May 13, 2016 at 6:17 PM, Soegtrop, Michael wrote: > Dear Ocaml Users, > > a possibly interesting data point in the discussion on polymorphic hashes: > > I replaced the polymorphic hash function for a record, which consists of = smallish integers, bools, lists of integers and short strings, with my own = hash function which looks like: > > ( ( ( (v1 * p1 + v2) * p2) + v3 ) *p3 + v4) * p4 > > where vi are values and pi are distinct prime numbers. For lists I use 4 = primes alternatingly and 4 different primes for empty cases (implemented as= 4 mutual recursive functions). The primes are random primes between 2^28 a= nd 2^29, avoiding those primes where larger powers of 2 are close to 0 modu= lo the prime. All arithmetic is done with plain ocaml ints and correspondin= g overflows. For strings I used the polymorphic hash function. > > The hash histogram for this hash function looks like this: > > num_bindings=3D1803454 > num_buckets=3D16777216 > max_bucket_length=3D24 > (the histogram is "number of items in bucket=3Dcount of buckets with this= number of items") > 0=3D15356268 (empty buckets) > 1=3D1185329 (non-colliding buckets) > 2=3D135231 > 3=3D80194 > 4=3D11344 > 5=3D2367 > 6=3D2836 > 7=3D2630 > 8=3D244 > 9=3D49 > 10=3D20 > 11=3D25 > 12=3D64 > 13=3D39 > 14=3D100 > 15=3D75 > 16=3D33 > 17=3D5 > 18=3D149 > 19=3D15 > 20=3D3 > 21=3D180 > 22=3D14 > 24=3D2 > > This is roughly in line with random numbers: > > fill rate =3D 1803454/16777216 =3D 10.7% > 2 collisions =3D 0.8% ~ fill rate ^2 > 3 collisions =3D 0.48% ~ 4 * fill rate ^3 > 4 collisions =3D 0.068% ~ 5 * fill rate ^4 > > The hash histogram for the same data and the polymorphic hash function lo= oks like this: > > num_bindings=3D1803454 > num_buckets=3D16777216 > max_bucket_length=3D3343 > 0=3D16730926 > 1=3D5520 > 2=3D3201 > 3=3D2779 > 4=3D1633 > 5=3D1079 > 6=3D3701 > 7=3D672 > 8=3D828 > 9=3D1584 > 10=3D600 > 11=3D384 > 12=3D2774 > 13=3D404 > 14=3D525 > 15=3D500 > 16=3D435 > 17=3D358 > 18=3D1406 > 19=3D244 > 20=3D504 > 21=3D427 > 22=3D316 > 23=3D369 > 24=3D837 > 25=3D153 > 26=3D250 > 27=3D417 > 28=3D191 > 29=3D222 > 30=3D501 > 31=3D76 > 32=3D196 > 33=3D530 > 34=3D142 > 35=3D153 > 36=3D859 > 37=3D88 > 38=3D178 > 39=3D310 > 40=3D147 > 41=3D173 > 42=3D313 > 43=3D102 > 44=3D235 > 45=3D123 > 46=3D149 > 47=3D152 > 48=3D244 > 49=3D107 > 50=3D173 > : > : > 1001=3D1 > 1002=3D1 > 1005=3D1 > 1006=3D3 > 1009=3D1 > 1010=3D1 > 1013=3D1 > 1014=3D2 > 1019=3D1 > 1020=3D1 > 1029=3D1 > 1031=3D1 > 1034=3D1 > 1035=3D1 > 1036=3D1 > 1064=3D1 > 1082=3D1 > 1184=3D1 > 1191=3D1 > 1196=3D1 > 1197=3D1 > 1200=3D1 > 1206=3D1 > 1207=3D1 > 1219=3D1 > 1225=3D1 > 1228=3D1 > 1232=3D1 > 1566=3D1 > 1581=3D1 > 1636=3D1 > 1698=3D1 > 1720=3D2 > 1737=3D2 > 1740=3D1 > 1744=3D1 > 1752=3D1 > 1762=3D1 > 1789=3D1 > 2255=3D1 > 2271=3D1 > 2287=3D1 > 2319=3D1 > 2332=3D1 > 2345=3D1 > 3296=3D1 > 3325=3D1 > 3343=3D1 > > So there are many buckets with 1000s of collisions and only a few 1000 co= llision free buckets. > > It might make sense to come up with a better polymorphic hashing function. > > Best regards, > > Michael > Intel Deutschland GmbH > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany > Tel: +49 89 99 8853-0, www.intel.de > Managing Directors: Christin Eisenschmid, Christian Lamprechter > Chairperson of the Supervisory Board: Nicole Lau > Registered Office: Munich > Commercial Register: Amtsgericht Muenchen HRB 186928 > > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs