From mboxrd@z Thu Jan  1 00:00:00 1970
Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id LAA28987; Mon, 2 Sep 2002 11:42:25 +0200 (MET DST)
X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id LAA28902 for <caml-list@pauillac.inria.fr>; Mon, 2 Sep 2002 11:42:24 +0200 (MET DST)
Received: from shiva.jussieu.fr (shiva.jussieu.fr [134.157.0.129])
	by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g829gND16153
	for <Caml-list@inria.fr>; Mon, 2 Sep 2002 11:42:23 +0200 (MET DST)
Received: from ibm3.cicrp.jussieu.fr (ibm3.cicrp.jussieu.fr [134.157.15.3])
          by shiva.jussieu.fr (8.12.5/jtpda-5.4) with ESMTP id g829gMjR086001
          ; Mon, 2 Sep 2002 11:42:23 +0200 (CEST)
Received: from ibm1.cicrp.jussieu.fr (ibm1.cicrp.jussieu.fr [134.157.15.1])
          by ibm3.cicrp.jussieu.fr (8.8.8/jtpda/mob-V8) with ESMTP id LAA34714
          ; Mon, 2 Sep 2002 11:40:43 +0200
Received: from localhost (fernande@localhost)
          by ibm1.cicrp.jussieu.fr (8.8.8/jtpda/mob-v8) with SMTP id LAA61898
          ; Mon, 2 Sep 2002 11:40:42 +0200
Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST)
From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr>
To: John Max Skaller <skaller@ozemail.com.au>
cc: Caml-list@inria.fr
Subject: [Caml-list] Explaining bit sets
In-Reply-To: <3D70203F.1000106@ozemail.com.au>
Message-ID: <Pine.A32.3.95.1020902105608.97756A-100000@ibm1.cicrp.jussieu.fr>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=iso-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-Antivirus: scanned by sophie at shiva.jussieu.fr
Sender: owner-caml-list@pauillac.inria.fr
Precedence: bulk

    Bonjour,

I will try to explain how does the bitSet module work.

A bit set is just =E0 n-ary tree with n =3D 30 and an integer data in
each node : instead of just having

  type tree =3D Empty | Node of int * tree * tree

you have

  type tree =3D Empty | Node of int * tree array

(I really use an other representation which is mosty equivalent,
but saves some space and makes algorithms easier :=20
tree =3D N of int * int array)

When you call (insert [1,15,2,3,17] my_set) the only thing you are
doing is set to 1 the 17h bit of the integer located in the first tree
first level, 15th tree second level, 2nd tree third level, 3rd tree
4th level.=20

Now, all the game is to find a convenient decomposition for each value
you want to save, the actual module provide two classical function

- big-endian (base 30)
example : path 15123 -> [16,24,3]=20
          path 15124 -> [16,24,4] (same mask)
          path 15153 -> [16,25,3] (next node)

that means that integer masks will represent numbers as follows
[ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc.

- endian-endian (base 30)

example : path 15123 -> [3,24,16]
          path 16023 -> [3,24,17] (same mask)
          path 15124 -> [4,24,16] (next tree from the root)

that means that integer masks will represent numbers as follows
[ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc.

I am working on a third function allowing other modulo representations
e.g. [ 0, 5, 10, 15, ... , 150] etc.


> Hmmm. I need a lexer that supports unicode,
> i.e. 2^24 characters. To make that work,
> I need to classify characters (array of 2^24 size is a bit expensive),
> but also the lexer needs an efficient representation
> of subsets of unicode characters, typically
> containining a few ranges (eg 'letter' is a code
> in one of about 30 different ranges).

You could use big-endian representation of course, but you could even
use your own 30-ary representation. Why ?=20

let say you insert 10 ints from 221 456 to 221 466
that is [8,6,1,26] to [8,6,2,6],

You will have a degenerated tree with 2 useless nodes (level 1 and 2)
since the rest of the tree is empty=20
     |
     |
    / \
  int  int

Of course, two nodes is not much but modulo 30 operations have a cost.=20
The same data structure with modulo 2 operations for optimal speed
acces would look have 5 times more useless nodes (because 30 ~=3D 32 =3D
2^5)

To have an optimal space occupation you just need to write down your
own decomposition function which should have the following properties

- injectivity in the domain you are working with
- neighbourg conservation
- most accessed integers in the top (id est shortest path)

There are other solutions based on using common prefixes :
- patricia trees
- a trie of packed arrays

You can find patricia trees in Jean Christophe Fill=E2tre's home page
I have written a small packed arrays module which will soon be in my
data structures library (an old buggy beta version is in my home page
but will be removed)

If you give me some more informations (on typical ranges that you can
find in unicode subsets) I will be able to give more precise answers

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners