From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id A6F907F907 for ; Tue, 3 Jun 2014 15:37:44 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of yminsky@janestreet.com) identity=pra; client-ip=38.105.200.115; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="yminsky@janestreet.com"; x-sender="yminsky@janestreet.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of yminsky@janestreet.com designates 38.105.200.115 as permitted sender) identity=mailfrom; client-ip=38.105.200.115; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="yminsky@janestreet.com"; x-sender="yminsky@janestreet.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mxout1.mail.janestreet.com) identity=helo; client-ip=38.105.200.115; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="yminsky@janestreet.com"; x-sender="postmaster@mxout1.mail.janestreet.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AlYBABLPjVMmachznGdsb2JhbABZg1lYgmyqDIxriHsBgQUeDgEBAQEBBhYJPIIlAQEBAwESER0BASMJCwEECwsLDQwBHQICIQESAQUBChIGExIMBIgMAwkIAwIIoBxqijB3hH8BBZlyAwqFQhEGiTODCYEuEVMEB4J1gUuEYwWTJ4F4gT6Lf4QDGCmFBIFS X-IPAS-Result: AlYBABLPjVMmachznGdsb2JhbABZg1lYgmyqDIxriHsBgQUeDgEBAQEBBhYJPIIlAQEBAwESER0BASMJCwEECwsLDQwBHQICIQESAQUBChIGExIMBIgMAwkIAwIIoBxqijB3hH8BBZlyAwqFQhEGiTODCYEuEVMEB4J1gUuEYwWTJ4F4gT6Lf4QDGCmFBIFS X-IronPort-AV: E=Sophos;i="4.98,965,1392159600"; d="scan'208";a="77804586" Received: from mxout2.mail.janestreet.com (HELO mxout1.mail.janestreet.com) ([38.105.200.115]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 03 Jun 2014 15:37:42 +0200 Received: from tot-smtp.delacy.com ([172.27.22.15] helo=tot-smtp) by mxout1.mail.janestreet.com with esmtps (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.82) (envelope-from ) id 1WrouL-000692-G0 for caml-list@inria.fr; Tue, 03 Jun 2014 09:37:41 -0400 Received: from tot-dmz-mxgoog1.delacy.com ([172.27.224.14] helo=mxgoog2.janestreet.com) by tot-smtp with esmtps (TLSv1:AES256-SHA:256) (Exim 4.72) (envelope-from ) id 1WrouK-0003xz-Sk for caml-list@inria.fr; Tue, 03 Jun 2014 09:37:40 -0400 Received: from mail-lb0-f174.google.com ([209.85.217.174]) by mxgoog2.janestreet.com with esmtp (Exim 4.76) (envelope-from ) id 1WrouK-00010n-Fs for caml-list@inria.fr; Tue, 03 Jun 2014 09:37:40 -0400 Received: by mail-lb0-f174.google.com with SMTP id n15so3467270lbi.33 for ; Tue, 03 Jun 2014 06:37:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=janestreet.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Jgqpq16So4m9rskMUXYnVxs87jukhyGaagfZ5OrvHx0=; b=XsPENI/6GlrkmaIHV+zNZTblgIuQhPcaBKyFLaZmzzEqAd90Qkce2rdupxs4CBYxs0 Gh+RzfxA0DoMM2y6+hrxEn92J4Dkb/ZNC7FIufQKujFRlxngHIfly7skC696cRhoUusW 2Ia0rzdc7M7Ku4ETOfwGqNda1AQU3f/aGzcEU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=Jgqpq16So4m9rskMUXYnVxs87jukhyGaagfZ5OrvHx0=; b=klzHY4ZRW8Dxzm6/UKcF4fTWjYDXz98IMFN5KS3OLC+pxaapuzpmsjfJaQOpLVP8fc JEjaoz1zxLZ7J0p7yBKcFOOftbXJRUE/2iITFxLOAI2x2U1A+U8FgGTAsTM+l22ZXZcY fTKwQhsQZl/L6IUEbVw/dU3xCpAvrBpnprHqc+EgvjMYc629wDGU6/PXWj84PUHF5JT+ vxOeIXZrSm6DWsIYileAcZe2KRlUbX4Eatx3yxSnFtmfyLCt4qvRPB9AHAHR7Idab7ZK /E5Yx8PT4LDIvgo9LMhQ8c7hdgqf8K7cARVNWk4ejx3EkXmJsfee30KGPKawElaWDSfs b6+A== X-Gm-Message-State: ALoCoQn6iAa5ni9k3uhSxKWmEpjKebUg/RY+3nxtOCoKneQspu/jq+GWhX99HzWKMbDeiA4xF1NCFBpnDVYOtsYhyziA94qhAVoPIVefI7WLdtzfmrUTFGPZD80HMy0sblsPhIo01UqT X-Received: by 10.152.28.228 with SMTP id e4mr35020342lah.3.1401802659800; Tue, 03 Jun 2014 06:37:39 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.152.28.228 with SMTP id e4mr35020329lah.3.1401802659687; Tue, 03 Jun 2014 06:37:39 -0700 (PDT) Received: by 10.112.3.39 with HTTP; Tue, 3 Jun 2014 06:37:38 -0700 (PDT) Received: by 10.112.3.39 with HTTP; Tue, 3 Jun 2014 06:37:38 -0700 (PDT) In-Reply-To: References: <543099239773658961@orange.fr> <1401716071976.85ecb8da@Nodemailer> Date: Tue, 3 Jun 2014 09:37:38 -0400 Message-ID: From: Yaron Minsky To: Gabriel Scherer Cc: caml-list@inria.fr, Andrew Herron , David Powers , Damien Guichard , Eric Stokes Content-Type: multipart/alternative; boundary=089e0160b69493c29a04faee9d7f Subject: Re: [Caml-list] Why AVL-tree? --089e0160b69493c29a04faee9d7f Content-Type: text/plain; charset=UTF-8 Looping in Eric Stokes, who did the benchmarking on that. On Jun 3, 2014 9:13 AM, "Gabriel Scherer" wrote: > Thanks Yaron, that is very interesting feedback. > > Would you happen to have the same kind of information about your > experiments with balanced tree buckets for Hashtable? I'm quite > interested in their good worst-case behavior, and considered > experimenting with such a structure for Batteries, but didn't have > time to look at it so far. > > On Tue, Jun 3, 2014 at 2:48 PM, Yaron Minsky > wrote: > > The following summary of what we do with respect to Maps and Sets in > > Core was written by David Powers (who isn't yet subscribe to the list, > > so he asked me to forward it on.) > > > > In Core we use a slight modification of the AVL tree found in the > > standard library. I think the biggest change (other than the > > interface) is that we add a specialized constructor (Leaf of 'key * > > 'value) as a specialization of Node (left * key * value * right) to > > limit allocation. It's a nice speed bump and doesn't do too much > > damage to the readability of the code. > > > > We also spent a bunch of time last summer working through the research > > papers of the last 10 years to see if we could find an implementation > > we liked better. I'd have to pull up the full history of the project > > to give real details, but we tried at least all of the following: > > > > - red-black trees > > - left-leaning red-black trees > > - treaps (including a variant that stored entropy in the spare bits in > > the variant tag) > > - splay trees > > - weight balanced trees > > - AVL trees with GADT enforcement of the invariants > > - 1-2 brother trees > > > > I'll lead with the caveat that benchmarking is hard, and these > > structures shine in different ways depending on the type of workload > > you throw at them. Each implementation below was also mostly a > > first-pass to understand the structure and do simple tests, so there > > may be more speed gold in the hills. Your mileage may vary. > > > > That said, our conclusions at the end: > > > > - red black trees are hard to code and understand (mostly due to > > remove), and don't show a real performance win. > > > > - treaps are a wonderful structure in terms of code simplicity, but > > getting enough randomness quickly enough is too costly to make them a > > win over AVL trees (you need to allocate just as much and you need to > > generate randomness) > > > > - splay trees are in our tree, but are too special purpose to be a > general win. > > > > - Weight balanced trees are a nice structure, and are used in other > > languages/libraries. They were neither better or worse than AVL > > trees. > > > > - AVL trees with GADT enforcement work, but were actually slower than > > straightforward AVL trees at the time we tested them. There is some > > extra matching due to the variant having more cases, so perhaps this > > isn't surprising. It's also likely that we didn't carry the > > 2-imbalance trick into the GADT version, which might have skewed the > > result. > > > > - 1-2 brother trees were the best of the lot, and we actually produced > > a version of the code that we felt was an overall win (or tie) for all > > workloads. Unfortunately, the optimizations we needed to get us there > > made the code much longer and harder to understand than the AVL tree > > code. We just couldn't convince ourselves that it was worth it. > > > > Probably the most important point is that nothing we did above gave a > > general win of more than 10-20% in the tight loop case. Given that, > > we kept our tweaked AVL tree implementation. If you want to be very > > very fast, you probably can't get away with a map, and if you just > > want to be "fast enough" the AVL tree we have is a nice set of > > tradeoffs for code complexity. > > > > On Mon, Jun 2, 2014 at 11:06 AM, Gabriel Scherer > > wrote: > >> Note that OCaml's balanced trees are not exactly what is usually > >> called AVL, as the imbalance between different branches can be at most > >> 2 (+1 on one side and -1 on the other) instead of just 1 as the > >> traditional definition assumes. > >> > >> On Mon, Jun 2, 2014 at 3:34 PM, Andrew Herron > wrote: > >>> Wikipedia has some notes on the difference: > >>> > >>> http://en.wikipedia.org/wiki/AVL_tree > >>> > >>> AVL has faster lookup, so maybe they decided to optimise for that. > >>> > >>> It's different to some other languages I've seen, but then so is their > >>> decision to not use a tail recursive List.map. Each to their own, it's > not > >>> hard to implement the alternative :) > >>> > >>> > >>> On Mon, Jun 2, 2014 at 11:21 PM, Damien Guichard > > >>> wrote: > >>>> > >>>> > >>>> Red-black tree would spare a machine word per node, because a > red-black > >>>> tree doesn't need depth information. > >>>> Hence the reason is either historical or a space/speed trade-off > >>>> (comparing two depths may be faster than pattern matching). > >>>> > >>>> Regards, > >>>> > >>>> damien guichard > >>>> > >>>> Hi, list, > >>>> > >>>> Just from the curiosity, why balanced binary trees used in Set and > Map are > >>>> AVL-trees, not their alternative, say, red-black trees? Is there a > deep > >>>> reason for it, or just a historical one? > >>>> > >>>> Best, > >>>> -- > >>>> Yoriyuki Yamagata > >>>> http://yoriyuki.info/ > >>>> > >>>> > >>> > >> > >> -- > >> Caml-list mailing list. Subscription management and archives: > >> https://sympa.inria.fr/sympa/arc/caml-list > >> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > >> Bug reports: http://caml.inria.fr/bin/caml-bugs > --089e0160b69493c29a04faee9d7f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Looping in Eric Stokes, who did the benchmarking on that.

On Jun 3, 2014 9:13 AM, "Gabriel Scherer&qu= ot; <gabriel.scherer@gmail.= com> wrote:
Thanks Yaron, that is very interesting feedback.

Would you happen to have the same kind of information about your
experiments with balanced tree buckets for Hashtable? I'm quite
interested in their good worst-case behavior, and considered
experimenting with such a structure for Batteries, but didn't have
time to look at it so far.

On Tue, Jun 3, 2014 at 2:48 PM, Yaron Minsky <yminsky@janestreet.com> wrote:
> The following summary of what we do with respect to Maps and Sets in > Core was written by David Powers (who isn't yet subscribe to the l= ist,
> so he asked me to forward it on.)
>
> In Core we use a slight modification of the AVL tree found in the
> standard library. =C2=A0I think the biggest change (other than the
> interface) is that we add a specialized constructor (Leaf of 'key = *
> 'value) as a specialization of Node (left * key * value * right) t= o
> limit allocation. =C2=A0It's a nice speed bump and doesn't do = too much
> damage to the readability of the code.
>
> We also spent a bunch of time last summer working through the research=
> papers of the last 10 years to see if we could find an implementation<= br> > we liked better. =C2=A0I'd have to pull up the full history of the= project
> to give real details, but we tried at least all of the following:
>
> - red-black trees
> - left-leaning red-black trees
> - treaps (including a variant that stored entropy in the spare bits in=
> the variant tag)
> - splay trees
> - weight balanced trees
> - AVL trees with GADT enforcement of the invariants
> - 1-2 brother trees
>
> I'll lead with the caveat that benchmarking is hard, and these
> structures shine in different ways depending on the type of workload > you throw at them. =C2=A0Each implementation below was also mostly a > first-pass to understand the structure and do simple tests, so there > may be more speed gold in the hills. =C2=A0Your mileage may vary.
>
> That said, our conclusions at the end:
>
> - red black trees are hard to code and understand (mostly due to
> remove), and don't show a real performance win.
>
> - treaps are a wonderful structure in terms of code simplicity, but
> getting enough randomness quickly enough is too costly to make them a<= br> > win over AVL trees (you need to allocate just as much and you need to<= br> > generate randomness)
>
> - splay trees are in our tree, but are too special purpose to be a gen= eral win.
>
> - Weight balanced trees are a nice structure, and are used in other
> languages/libraries. =C2=A0They were neither better or worse than AVL<= br> > trees.
>
> - AVL trees with GADT enforcement work, but were actually slower than<= br> > straightforward AVL trees at the time we tested them. =C2=A0There is s= ome
> extra matching due to the variant having more cases, so perhaps this > isn't surprising. =C2=A0It's also likely that we didn't ca= rry the
> 2-imbalance trick into the GADT version, which might have skewed the > result.
>
> - 1-2 brother trees were the best of the lot, and we actually produced=
> a version of the code that we felt was an overall win (or tie) for all=
> workloads. =C2=A0Unfortunately, the optimizations we needed to get us = there
> made the code much longer and harder to understand than the AVL tree > code. =C2=A0We just couldn't convince ourselves that it was worth = it.
>
> Probably the most important point is that nothing we did above gave a<= br> > general win of more than 10-20% in the tight loop case. =C2=A0Given th= at,
> we kept our tweaked AVL tree implementation. =C2=A0If you want to be v= ery
> very fast, you probably can't get away with a map, and if you just=
> want to be "fast enough" the AVL tree we have is a nice set = of
> tradeoffs for code complexity.
>
> On Mon, Jun 2, 2014 at 11:06 AM, Gabriel Scherer
> <gabriel.scherer@gmail= .com> wrote:
>> Note that OCaml's balanced trees are not exactly what is usual= ly
>> called AVL, as the imbalance between different branches can be at = most
>> 2 (+1 on one side and -1 on the other) instead of just 1 as the
>> traditional definition assumes.
>>
>> On Mon, Jun 2, 2014 at 3:34 PM, Andrew Herron <andrew.herron@gmail.com> wrote:
>>> Wikipedia has some notes on the difference:
>>>
>>> http://en.wikipedia.org/wiki/AVL_tree
>>>
>>> AVL has faster lookup, so maybe they decided to optimise for t= hat.
>>>
>>> It's different to some other languages I've seen, but = then so is their
>>> decision to not use a tail recursive List.map. Each to their o= wn, it's not
>>> hard to implement the alternative :)
>>>
>>>
>>> On Mon, Jun 2, 2014 at 11:21 PM, Damien Guichard <alphablock@orange.fr>
>>> wrote:
>>>>
>>>>
>>>> Red-black tree would spare a machine word per node, becaus= e a red-black
>>>> tree doesn't need depth information.
>>>> Hence the reason is either historical or a space/speed tra= de-off
>>>> (comparing two depths may be faster than pattern matching)= .
>>>>
>>>> Regards,
>>>>
>>>> damien guichard
>>>>
>>>> Hi, list,
>>>>
>>>> Just from the curiosity, why balanced binary trees used in= Set and Map are
>>>> AVL-trees, not their alternative, say, red-black trees? = =C2=A0Is there a deep
>>>> reason for it, or just a historical one?
>>>>
>>>> Best,
>>>> --
>>>> Yoriyuki Yamagata
>>>> http:/= /yoriyuki.info/
>>>>
>>>>
>>>
>>
>> --
>> Caml-list mailing list. =C2=A0Subscription management and archives= :
>> https://sympa.inria.fr/sympa/arc/caml-list
>> Beginner's list: http://groups.yahoo.com/group/ocaml_beginner= s
>> Bug reports: http://caml.inria.fr/bin/caml-bugs
--089e0160b69493c29a04faee9d7f--