From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by margaux.inria.fr, Thu, 6 May 93 10:22:37 +0200 Message-Id: <9370.732388392@margaux.inria.fr> Subject: sets To: caml-redistribution@margaux Date: Thu, 6 May 1993 10:22:36 +0200 (MET DST) X-Mailer: ELM [version 2.4 PL13] Content-Type: text/plain This is a very long message about sets. Read the abstract and skip it, if it does not interest you. Abstract: Sets are a very useful data structure, but are very difficult to implement. The basic problem is that ML polymorphism is not adequate to a proper representation of sets. It's not a problem about modules, but about polymorphism. We need a more powerful type system to get really efficient sets. As Xavier pointed out, a naive implementation based on lists is very easy to implement. Its drawback is evidently efficiency (O(n^2) for basic operations union and intersection). But as far as I know, it is the only one which is absolutely compatible with ML polymorphism: you get a truly polymorphic empty set and a truly polymorphic function to add an element to a set. If we agree that a library implementing sets must be very efficient or useless, then the problem becomes very hard. The reason is that there exists many clever ways to represent sets, each one adapted to a particular usage of sets and to different underlying type. (I guess we want to implement finite and homogeneous sets over any ML type.) Consider for instance sets based on an enumerated type: if this type has less than, say 32 or 31 constant constructors, you have to represent sets over this type as integers so that union and intersection cost only one machine instruction. If the enumerated type has more than 32 constant constructors, you must represent sets on this type as strings (considered as bit vectors) in order to get a still very fast implementation of union and intersection. If the underlying type gets non constant constructors, things become even worse: it can be the case that you have to stick to the representation of sets as string, if there are not many elements of the underlying type used in the sets manipulated by the program, otherwise you must adopt a tree-like representation (balanced tree or even lists in some cases). You may choose even more complex representation, such that a mixt of bit vector and balanced trees. Well, my claim is that efficient set implementation requires various representations of sets coexisting at the same time in the program, either a single implementation for each basic universe (underlying type), or even better a dynamically modified scheme of representation for the sets of a given universe. This is very difficult to handle in ML, since data representation should be uniform to be compatible with polymorphism (and we want polymorphic sets, don't we?). My own opinion is that this is not a question of modules, but a problem of typechecking system of the core language. This is analoguous to I/Os and printing. There exists some situations where you need to write functions which are able to run differently according to the type of their arguments or results (polymorphic write and read as in Pascal). The problem is clear when you imagine to offer a polymorphic map_set function in the set library: map_set is the higher-order function on sets, equivalent to map on lists. Thus map_set has type: For all 'a, 'b. ('a -> 'b) -> 'a set -> 'b set. The ``object oriented'' approach of Damien or Didier is the trivial solution that comes in mind in the first 5 minutes when you imagine to implement sets. Since it took 5 minutes of Damien's or Didier's time to be found, it's a very clever idea (I'm not flaming). Thus, this method works very fine when you design functions which get as arguments every sets involved in the problem: they dynamically asked to these arguments ``Hey guy, what's your real implementation?'', get the answer and work accordingly. This method (not in the OO sense!) completely fails for map_set: you have to ask to the function what can be the representation of a set built with elements of its co-domain type! We don't want to add this ``method'' to each closure, for the sole reason that we want to implement sets. Well you can say: I don't mind map_set. You can, but that's not a good way to deal with the problem: the problem is a real one, it will not desappear like that. What about singleton : 'a -> 'a set ? or empty : 'a set ? There are other examples of this scheme in the field of primitives over ``maps'' (in the set litterature sense). We need a type-system to treat this cases, and in particular polymorphic printing (for user's convenience and debugging) and input (I want a safe function read : unit -> 'a). Catherine Dubois and I have are working since now 1.5 year about set implementation in Caml. She knows very well set implementation, and I know the Caml system. Thus, we got very efficient schemes to represent sets, and all was fine, until we faced the map_set problem. We then turned to the design of a new type-system for Caml integrating parametric polymorphism and a new one which allows the IMPLEMENTATION of map_set. Thanks to this new type-system we can write the map_set primitive. Without it, even the most dirty tricks (including OO features), those which cannot be told in manuals, failed to implement the map_set function (providing you keep it its natural and required type). For the time being, we are writing a paper (article or research report) on this type-system. Wait and see ... Pierre Weis ---------------------------------------------------------------------------- Formel Project INRIA, BP 105, F-78153 Le Chesnay Cedex (France) E-mail: Pierre.Weis@inria.fr Telephone: +33 1 39 63 55 98 ----------------------------------------------------------------------------