From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on yquem.inria.fr X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=AWL autolearn=disabled version=3.1.3 X-Original-To: caml-list@yquem.inria.fr Delivered-To: caml-list@yquem.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by yquem.inria.fr (Postfix) with ESMTP id DEB9ABBC1 for ; Wed, 5 Mar 2008 18:27:43 +0100 (CET) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAAJtmzkfAXQInh2dsb2JhbACQcgEBAQgKKZpc X-IronPort-AV: E=Sophos;i="4.25,451,1199660400"; d="scan'208";a="8044265" Received: from concorde.inria.fr ([192.93.2.39]) by mail2-smtp-roc.national.inria.fr with ESMTP; 05 Mar 2008 18:27:43 +0100 Received: from mail4-relais-sop.national.inria.fr (mail4-relais-sop.national.inria.fr [192.134.164.105]) by concorde.inria.fr (8.13.6/8.13.6) with ESMTP id m25HRhbH004074 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=OK) for ; Wed, 5 Mar 2008 18:27:43 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AgAAALJmzkdQDPIahWdsb2JhbACQcgEBAQgEBgcIEweaYQ X-IronPort-AV: E=Sophos;i="4.25,451,1199660400"; d="scan'208";a="23403020" Received: from smtp20.orange.fr ([80.12.242.26]) by mail4-smtp-sop.national.inria.fr with ESMTP; 05 Mar 2008 18:27:42 +0100 Received: from me-wanadoo.net (localhost [127.0.0.1]) by mwinf2027.orange.fr (SMTP Server) with ESMTP id 986331C0009B; Wed, 5 Mar 2008 18:27:42 +0100 (CET) Received: from [192.168.1.59] (APuteaux-154-1-94-92.w83-204.abo.wanadoo.fr [83.204.229.92]) by mwinf2027.orange.fr (SMTP Server) with ESMTP id 640F11C0009A; Wed, 5 Mar 2008 18:27:42 +0100 (CET) X-ME-UUID: 20080305172742409.640F11C0009A@mwinf2027.orange.fr Message-ID: <47CED80A.1010504@frisch.fr> Date: Wed, 05 Mar 2008 18:27:38 +0100 From: Alain Frisch User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: Berke Durak Cc: caml-list Subject: Re: [Caml-list] Canonical Set/Map datastructure? References: <47CECF23.1020508@exalead.com> In-Reply-To: <47CECF23.1020508@exalead.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Miltered: at concorde with ID 47CED80F.000 by Joe's j-chkmail (http://j-chkmail . ensmp . fr)! X-Spam: no; 0.00; frisch:01 frisch:01 berke:01 durak:01 functors:01 intersection:01 integers:01 integers:01 hash-consing:01 memoize:01 constructors:01 memoize:01 intersection:01 memoization:01 toplevel:01 Berke Durak wrote: > The Map and Set modules use AVL trees which are efficient but not > canonical - a given > set of elements can have more than one representation. This means that > you cannot use > ad hoc comparison on sets and maps, and this is why they are presented > as functors. > > Does anyone know if, in the many years that have passed since the > implementation of > those fine modules, someone has invented a (functional) datastructure > that is as > efficient while being canonic? Well, Patricia trees have been around for many years and they satisfy this property. They also allow set operations (union, intersection, ...) in linear time (and I explain below how this can be optimized to something which is really efficient for some applications). Jean-Christophe FilliĆ¢tre has an implementation on its web page. Patricia trees work fine when the set elements can easily be represented as strings of bits. So if you can map your elements to integers, that's ok. Otherwise, you can hash-cons your elements to get unique integers for them. Something that Jean-Christophe's implementation doesn't do but which is quite easy to add is to use hash-consing on patricia trees themselves, that is, to memoize their constructors in order to get unique physical representation and maximal sharing. That way, you get: structural equality = physical equality = set equality With this property, set operations on patricia trees can be optimized with reflexivity properties (e.g. the inner loop of the union function can start by checking equality of its arguments). Also, you get a nice unique integer for each tree. This allow you to memoize efficiently set operations (like union, intersection, for which you can use memoization in the inner loop, not only at toplevel), and to build sets of sets (and so on). -- Alain