Hi
 
At the time Set was written, no efficient algorithms for whole-set
operations (union, intersection, difference, etc) were known for
red-black trees.  I'm not sure they are known today.

The answer is no and I will try to explain.

When doing a set union, at some point you will have to merge two disjoint sets represented each by a tree

[1 .. 10] union [15 .. 22]

At that point you need to know the "size" of each tree because if both trees are the same size you just need to append them

    return Node ([1..10], [15..22])

while if one tree is bigger than the other, you will need to cut that tree into smaller pieces and reorganize them (a "rotation").

    divide [1..10] into [1..5], [6..10]
    return Node([1..5], Node ([6..10], [15..22]))

The AVL implementation of OCaml gives you the "size" of each tree, its the extra int in the constructor 

    'a tree = Empty | Node 'a tree * 'a * 'a tree * int <--- here

it is equal to the longest path in the tree

    f = function E -> 0 | Node (l, _, r, h) -> 1 + max (f l, fr)

If you implement Weight-balanced trees, that value is the cardinal of the set represented by the tree.

If you use red-black or "traditional AVL" representation, that extra bit just tells you if the tree is leaning towards the left or towards the right, but doesn't tell you any information about the global "size" of the tree. Which means you cannot implement set union directly. You have to traverse full tree in O(n) to recompute that information.

There is at least one library in Haskell that implements AVL trees that way, it uses -1 | 0 | 1 information for most operations and recomputes the longest path for all trees when doing set operations, then discards that info.

There is also a way to implement red black trees to do set operations just like (non-traditional) AVLs. It is described in Olivié H.J A new class of balanced search trees : half balanced search trees, RAIRO Informatique Théorique 16, 51 71 - also in his PhD Thesis.

Basically a tree is red-black when shortest-branch * 2 >= longest-branch (cf. Constructing Red-Black trees, Ralf Hinze 1999 - easier to find than Olivié's works)

So you can implement a tree structure that memorizes both the shortest and the longest branch in each node

    'a tree = E | N of 'a tree * 'a * 'a tree * int * int <---- shortest and longest

From there you just follow the same scheme than the AVL implementation in the OCaml lib, you just need to cope with a couple of triple rotation cases.

 
As for performance of insert/lookup operations, Jean-Christophe
Filliâtre has measurements showing that OCaml's 2-imbalance AVL trees
perform better than red-black trees.  It all depends on your ratio of
insertions to lookups, of course.  But the 2-imbalance trick makes a
big difference with textbook AVL trees.


This is a bit more difficult but in short "red-black trees do not implement truly red-black trees" and "all implementations of red-black trees implement a different approximation of red-black trees".

Any implementation of red-black trees that uses the 1 bit trick and where insertions are in log (n) is not surjective. You can understand that as meaning
(1) - some red-black trees will never be built by the balancing algorithm
(2) - sometimes you will give the balancing algorithm a red-black tree (= has a provable red-black coloring) and it will still rebalance it instead of noticing it's already balanced and returning it untouched.

The reason is insertion in red-black trees using the 1 bit trick is linear. Meaning if I want a balancing algorithm that never falls into (1) or (2) I cannot avoid linear time insertion. The result is all algorithms found in the literature are log(n) approximations of the full algorithm. The only algorithm that is truly logarithmic is again Olivié's thanks to it's implicit representation of the red-black coloring.

From there, comparing performance of AVLs with "red-black trees" in general makes no sense as the person that implemented them may have implemented any arbitrary approximation of the full algorithm. This translates in the code as a random number of "cases" to check for balancing : 4 for Okasaki, 27 in the CLR, etc. With double, triple, quadruple rotations, according to how "deep" the author was willing to go. Because what these guys are doing is "unfolding" the linear behavior of the algorithm into a constant and falling back into rebuilding any tree they are given for all other cases.

In any case, some benchmarks I did very, very, very longtime ago showed AVL and Weight balanced are good all-purpose implementations. The advantage of WB being it gives you the cardinality of the set in O(1). In some sense, AVL is the log of WB (in the same way the height of a tree is the log of its size).

        Diego Olivier
 

- Xavier Leroy

--
Caml-list mailing list.  Subscription management and archives:
https://sympa.inria.fr/sympa/arc/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs