From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 7289E7FA56 for ; Sun, 3 Aug 2014 23:25:23 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of dofp.ocaml@gmail.com) identity=pra; client-ip=209.85.220.44; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="dofp.ocaml@gmail.com"; x-sender="dofp.ocaml@gmail.com"; x-conformance=sidf_compatible Received-SPF: Pass (mail2-smtp-roc.national.inria.fr: domain of dofp.ocaml@gmail.com designates 209.85.220.44 as permitted sender) identity=mailfrom; client-ip=209.85.220.44; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="dofp.ocaml@gmail.com"; x-sender="dofp.ocaml@gmail.com"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@mail-pa0-f44.google.com) identity=helo; client-ip=209.85.220.44; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="dofp.ocaml@gmail.com"; x-sender="postmaster@mail-pa0-f44.google.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AoACAMin3lPRVdwslGdsb2JhbABag19XBIJ0q14BBoJqmG+BY4dMAYECCBYQAQEBAQcLCwkSK4QEAQEDARIRBBkBGwkJCwEDAQsGAwILGh0CAiIBEQEFAQoSBiUQiAsBAwkIDZIukCBqiymBcoMQih8KGScDCmSGKxEBBQ6FbolMBAeCeYFSAQSEfwWXAoFUkRkYKYUAOy8 X-IPAS-Result: AoACAMin3lPRVdwslGdsb2JhbABag19XBIJ0q14BBoJqmG+BY4dMAYECCBYQAQEBAQcLCwkSK4QEAQEDARIRBBkBGwkJCwEDAQsGAwILGh0CAiIBEQEFAQoSBiUQiAsBAwkIDZIukCBqiymBcoMQih8KGScDCmSGKxEBBQ6FbolMBAeCeYFSAQSEfwWXAoFUkRkYKYUAOy8 X-IronPort-AV: E=Sophos;i="5.01,794,1400018400"; d="scan'208";a="88125553" Received: from mail-pa0-f44.google.com ([209.85.220.44]) by mail2-smtp-roc.national.inria.fr with ESMTP/TLS/RC4-SHA; 03 Aug 2014 23:25:22 +0200 Received: by mail-pa0-f44.google.com with SMTP id eu11so8836613pac.3 for ; Sun, 03 Aug 2014 14:25:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=tSssW0fv5dr/mBYVcI5UwPJINK+pNP424NBDp/QVXn8=; b=BXziHj3NamJrIGFuo1AuMuyb7XF2e1uE0FkG90EIOlG2jx48hM5ESaopKJZe2/q/7X sJlJcVw7ex3jZYVMD4Yhza08du4/N2CE24HUXArfXvQveiuyGUJqTQlNCXKsntF24mGo ajVew+9BwwGnBmO0bl/07yK4CUs5eZBIJVQL7bUbnuSABzKwnOwZsDigS6yyOFWo9s32 /2GArbuFewr6UwI7FShFHaVkBNYUrpcCKHamE8BtmgF9vTJ99Gjnx1Syz80PjKyHmoz1 GkpaK39TQjgiUywZ5uEIlkIR5yLy3Sp1UCbguSi0xSeIrAmYtzKlg9rReOfXgy7a+dKU 3lxA== MIME-Version: 1.0 X-Received: by 10.66.100.170 with SMTP id ez10mr19563795pab.12.1407101120710; Sun, 03 Aug 2014 14:25:20 -0700 (PDT) Received: by 10.70.50.225 with HTTP; Sun, 3 Aug 2014 14:25:20 -0700 (PDT) In-Reply-To: <538CAD12.2040104@inria.fr> References: <543099239773658961@orange.fr> <1401716071976.85ecb8da@Nodemailer> <538CAD12.2040104@inria.fr> Date: Sun, 3 Aug 2014 23:25:20 +0200 Message-ID: From: Diego Olivier Fernandez Pons To: Xavier Leroy Cc: caml-list Content-Type: multipart/alternative; boundary=047d7bf17dcc768d0104ffc04223 Subject: Re: [Caml-list] Why AVL-tree? --047d7bf17dcc768d0104ffc04223 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi > At the time Set was written, no efficient algorithms for whole-set > operations (union, intersection, difference, etc) were known for > red-black trees. I'm not sure they are known today. > The answer is no and I will try to explain. When doing a set union, at some point you will have to merge two disjoint sets represented each by a tree [1 .. 10] union [15 .. 22] At that point you need to know the "size" of each tree because if both trees are the same size you just need to append them return Node ([1..10], [15..22]) while if one tree is bigger than the other, you will need to cut that tree into smaller pieces and reorganize them (a "rotation"). divide [1..10] into [1..5], [6..10] return Node([1..5], Node ([6..10], [15..22])) The AVL implementation of OCaml gives you the "size" of each tree, its the extra int in the constructor 'a tree =3D Empty | Node 'a tree * 'a * 'a tree * int <--- here it is equal to the longest path in the tree f =3D function E -> 0 | Node (l, _, r, h) -> 1 + max (f l, fr) If you implement Weight-balanced trees, that value is the cardinal of the set represented by the tree. If you use red-black or "traditional AVL" representation, that extra bit just tells you if the tree is leaning towards the left or towards the right, but doesn't tell you any information about the global "size" of the tree. Which means you cannot implement set union directly. You have to traverse full tree in O(n) to recompute that information. There is at least one library in Haskell that implements AVL trees that way, it uses -1 | 0 | 1 information for most operations and recomputes the longest path for all trees when doing set operations, then discards that info. There is also a way to implement red black trees to do set operations just like (non-traditional) AVLs. It is described in Olivi=C3=A9 H.J A new class= of balanced search trees : half balanced search trees, RAIRO Informatique Th=C3=A9orique 16, 51 71 - also in his PhD Thesis. Basically a tree is red-black when shortest-branch * 2 >=3D longest-branch (cf. Constructing Red-Black trees, Ralf Hinze 1999 - easier to find than Olivi=C3=A9's works) So you can implement a tree structure that memorizes both the shortest and the longest branch in each node 'a tree =3D E | N of 'a tree * 'a * 'a tree * int * int <---- shortest and longest >From there you just follow the same scheme than the AVL implementation in the OCaml lib, you just need to cope with a couple of triple rotation cases. > As for performance of insert/lookup operations, Jean-Christophe > Filli=C3=A2tre has measurements showing that OCaml's 2-imbalance AVL trees > perform better than red-black trees. It all depends on your ratio of > insertions to lookups, of course. But the 2-imbalance trick makes a > big difference with textbook AVL trees. > This is a bit more difficult but in short "red-black trees do not implement truly red-black trees" and "all implementations of red-black trees implement a different approximation of red-black trees". Any implementation of red-black trees that uses the 1 bit trick and where insertions are in log (n) is not surjective. You can understand that as meaning (1) - some red-black trees will never be built by the balancing algorithm (2) - sometimes you will give the balancing algorithm a red-black tree (=3D has a provable red-black coloring) and it will still rebalance it instead of noticing it's already balanced and returning it untouched. The reason is insertion in red-black trees using the 1 bit trick is linear. Meaning if I want a balancing algorithm that never falls into (1) or (2) I cannot avoid linear time insertion. The result is all algorithms found in the literature are log(n) approximations of the full algorithm. The only algorithm that is truly logarithmic is again Olivi=C3=A9's thanks to it's implicit representation of the red-black coloring. >From there, comparing performance of AVLs with "red-black trees" in general makes no sense as the person that implemented them may have implemented any arbitrary approximation of the full algorithm. This translates in the code as a random number of "cases" to check for balancing : 4 for Okasaki, 27 in the CLR, etc. With double, triple, quadruple rotations, according to how "deep" the author was willing to go. Because what these guys are doing is "unfolding" the linear behavior of the algorithm into a constant and falling back into rebuilding any tree they are given for all other cases. In any case, some benchmarks I did very, very, very longtime ago showed AVL and Weight balanced are good all-purpose implementations. The advantage of WB being it gives you the cardinality of the set in O(1). In some sense, AVL is the log of WB (in the same way the height of a tree is the log of its size). Diego Olivier > > - Xavier Leroy > > -- > Caml-list mailing list. Subscription management and archives: > https://sympa.inria.fr/sympa/arc/caml-list > Beginner's list: http://groups.yahoo.com/group/ocaml_beginners > Bug reports: http://caml.inria.fr/bin/caml-bugs > --047d7bf17dcc768d0104ffc04223 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
=C2=A0 =C2=A0 Hi
=C2=A0
At the time Set was written, no efficient algorithms for wh= ole-set
operations (union, intersection, difference, etc) were known for
red-black trees. =C2=A0I'm not sure they are known today.

The answer is no and I will try to explain.

When doing a set union, at some point you will have to mer= ge two disjoint sets represented each by a tree

[1 .. 10] union [15 .. 22]

At = that point you need to know the "size" of each tree because if bo= th trees are the same size you just need to append them

=C2=A0 =C2=A0 return Node ([1..10], [15..22])

=
while if one tree is bigger than the other, you will need to cut that = tree into smaller pieces and reorganize them (a "rotation").

=C2=A0 =C2=A0 divide [1..10] into [1..5], [6..10]
= =C2=A0 =C2=A0 return Node([1..5], Node ([6..10], [15..22]))
<= br>
The AVL implementation of OCaml gives you the "size"= ; of each tree, its the extra int in the constructor=C2=A0

=C2=A0 =C2=A0 'a tree =3D Empty | Node 'a tree = * 'a * 'a tree * int <--- here

it is eq= ual to the longest path in the tree

=C2=A0 =C2=A0 = f =3D function E -> 0 | Node (l, _, r, h) -> 1 + max (f l, fr)

If you implement Weight-balanced trees, that value is t= he cardinal of the set represented by the tree.

If you use red-black or "traditional AVL" representation, that = extra bit just tells you if the tree is leaning towards the left or towards= the right, but doesn't tell you any information about the global "= ;size" of the tree. Which means you cannot implement set union directl= y. You have to traverse full tree in O(n) to recompute that information.

There is at least one library in Haskell that implement= s AVL trees that way, it uses -1 | 0 | 1 information for most operations an= d recomputes the longest path for all trees when doing set operations, then= discards that info.

There is also a way to implement red black trees to do = set operations just like (non-traditional) AVLs. It is described in Olivi= =C3=A9 H.J A new class of balanced search trees : half balanced search tree= s, RAIRO Informatique Th=C3=A9orique 16, 51 71 - also in his PhD Thesis.

Basically a tree is red-black when shortest-branch * 2 = >=3D longest-branch (cf. Constructing Red-Black trees, Ralf Hinze 1999 -= easier to find than Olivi=C3=A9's works)

So you can implement a= tree structure that memorizes both the shortest and=C2=A0the longest branc= h in each node

=C2=A0 =C2=A0 'a tree =3D E | N of 'a tre= e * 'a * 'a tree * int * int <---- shortest and longest

From there you just follow the same scheme than the AVL im= plementation in the OCaml lib, you just need to cope with a couple of tripl= e rotation cases.

=C2=A0
As for performance of inse= rt/lookup operations, Jean-Christophe
Filli=C3=A2tre has measurements showing that OCaml's 2-imbalance AVL tr= ees
perform better than red-black trees. =C2=A0It all depends on your ratio of<= br> insertions to lookups, of course. =C2=A0But the 2-imbalance trick makes a big difference with textbook AVL trees.


This is a bit more difficult but in short "red-black t= rees do not implement truly red-black trees" and "all implementat= ions of red-black trees implement a different approximation of red-black tr= ees".

Any implementation of red-black trees that uses the 1 b= it trick and where insertions are in log (n) is not surjective. You can und= erstand that as meaning
(1) - some red-black trees will never be = built by the balancing algorithm
(2) - sometimes you will give the balancing algorithm a red-black tree= (=3D has a provable red-black coloring) and it will still rebalance it ins= tead of noticing it's already balanced and returning it untouched.

The reason is insertion in red-black trees using the 1 = bit trick is linear. Meaning if I want a balancing algorithm that never fal= ls into (1) or (2) I cannot avoid linear time insertion. The result is all = algorithms found in the literature are log(n) approximations of the full al= gorithm. The only algorithm that is truly logarithmic is again Olivi=C3=A9&= #39;s thanks to it's implicit representation of the red-black coloring.=

From there, comparing performance of AVLs with "re= d-black trees" in general makes no sense as the person that implemente= d them may have implemented any arbitrary approximation of the full algorit= hm. This translates in the code as a random number of "cases" to = check for balancing : 4 for Okasaki, 27 in the CLR, etc. With double, tripl= e, quadruple rotations, according to how "deep" the author was wi= lling to go. Because what these guys are doing is "unfolding" the= linear behavior of the algorithm into a constant and falling back into reb= uilding any tree they are given for all other cases.

In any case, some benchmarks I did very, very, very lon= gtime ago showed AVL and Weight balanced are good all-purpose implementatio= ns. The advantage of WB being it gives you the cardinality of the set in O(= 1). In some sense, AVL is the log of WB (in the same way the height of a tr= ee is the log of its size).

=C2=A0 =C2=A0 =C2=A0 =C2=A0 Diego Olivier
=C2= =A0

- Xavier Leroy

--
Caml-list mailing list. =C2=A0Subscription management and archives:
ht= tps://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

--047d7bf17dcc768d0104ffc04223--