From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA28987; Mon, 2 Sep 2002 11:42:25 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA28902 for ; Mon, 2 Sep 2002 11:42:24 +0200 (MET DST) Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g829gND16153 for ; Mon, 2 Sep 2002 11:42:23 +0200 (MET DST) Received: from ibm3.cicrp.jussieu.fr (ibm3.cicrp.jussieu.fr [134.157.15.3]) by shiva.jussieu.fr (8.12.5/jtpda-5.4) with ESMTP id g829gMjR086001 ; Mon, 2 Sep 2002 11:42:23 +0200 (CEST) Received: from ibm1.cicrp.jussieu.fr (ibm1.cicrp.jussieu.fr [134.157.15.1]) by ibm3.cicrp.jussieu.fr (8.8.8/jtpda/mob-V8) with ESMTP id LAA34714 ; Mon, 2 Sep 2002 11:40:43 +0200 Received: from localhost (fernande@localhost) by ibm1.cicrp.jussieu.fr (8.8.8/jtpda/mob-v8) with SMTP id LAA61898 ; Mon, 2 Sep 2002 11:40:42 +0200 Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST) From: Diego Olivier Fernandez Pons To: John Max Skaller cc: Caml-list@inria.fr Subject: [Caml-list] Explaining bit sets In-Reply-To: <3D70203F.1000106@ozemail.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Antivirus: scanned by sophie at shiva.jussieu.fr Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Bonjour, I will try to explain how does the bitSet module work. A bit set is just =E0 n-ary tree with n =3D 30 and an integer data in each node : instead of just having type tree =3D Empty | Node of int * tree * tree you have type tree =3D Empty | Node of int * tree array (I really use an other representation which is mosty equivalent, but saves some space and makes algorithms easier :=20 tree =3D N of int * int array) When you call (insert [1,15,2,3,17] my_set) the only thing you are doing is set to 1 the 17h bit of the integer located in the first tree first level, 15th tree second level, 2nd tree third level, 3rd tree 4th level.=20 Now, all the game is to find a convenient decomposition for each value you want to save, the actual module provide two classical function - big-endian (base 30) example : path 15123 -> [16,24,3]=20 path 15124 -> [16,24,4] (same mask) path 15153 -> [16,25,3] (next node) that means that integer masks will represent numbers as follows [ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc. - endian-endian (base 30) example : path 15123 -> [3,24,16] path 16023 -> [3,24,17] (same mask) path 15124 -> [4,24,16] (next tree from the root) that means that integer masks will represent numbers as follows [ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc. I am working on a third function allowing other modulo representations e.g. [ 0, 5, 10, 15, ... , 150] etc. > Hmmm. I need a lexer that supports unicode, > i.e. 2^24 characters. To make that work, > I need to classify characters (array of 2^24 size is a bit expensive), > but also the lexer needs an efficient representation > of subsets of unicode characters, typically > containining a few ranges (eg 'letter' is a code > in one of about 30 different ranges). You could use big-endian representation of course, but you could even use your own 30-ary representation. Why ?=20 let say you insert 10 ints from 221 456 to 221 466 that is [8,6,1,26] to [8,6,2,6], You will have a degenerated tree with 2 useless nodes (level 1 and 2) since the rest of the tree is empty=20 | | / \ int int Of course, two nodes is not much but modulo 30 operations have a cost.=20 The same data structure with modulo 2 operations for optimal speed acces would look have 5 times more useless nodes (because 30 ~=3D 32 =3D 2^5) To have an optimal space occupation you just need to write down your own decomposition function which should have the following properties - injectivity in the domain you are working with - neighbourg conservation - most accessed integers in the top (id est shortest path) There are other solutions based on using common prefixes : - patricia trees - a trie of packed arrays You can find patricia trees in Jean Christophe Fill=E2tre's home page I have written a small packed arrays module which will soon be in my data structures library (an old buggy beta version is in my home page but will be removed) If you give me some more informations (on typical ranges that you can find in unicode subsets) I will be able to give more precise answers Diego Olivier ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners