[Caml-list] Explaining bit sets - Diego Olivier Fernandez Pons

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr>
To: John Max Skaller <skaller@ozemail.com.au>
Cc: Caml-list@inria.fr
Subject: [Caml-list] Explaining bit sets
Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST)	[thread overview]
Message-ID: <Pine.A32.3.95.1020902105608.97756A-100000@ibm1.cicrp.jussieu.fr> (raw)
In-Reply-To: <3D70203F.1000106@ozemail.com.au>

    Bonjour,

I will try to explain how does the bitSet module work.

A bit set is just à n-ary tree with n = 30 and an integer data in
each node : instead of just having

  type tree = Empty | Node of int * tree * tree

you have

  type tree = Empty | Node of int * tree array

(I really use an other representation which is mosty equivalent,
but saves some space and makes algorithms easier : 
tree = N of int * int array)

When you call (insert [1,15,2,3,17] my_set) the only thing you are
doing is set to 1 the 17h bit of the integer located in the first tree
first level, 15th tree second level, 2nd tree third level, 3rd tree
4th level. 

Now, all the game is to find a convenient decomposition for each value
you want to save, the actual module provide two classical function

- big-endian (base 30)
example : path 15123 -> [16,24,3] 
          path 15124 -> [16,24,4] (same mask)
          path 15153 -> [16,25,3] (next node)

that means that integer masks will represent numbers as follows
[ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc.

- endian-endian (base 30)

example : path 15123 -> [3,24,16]
          path 16023 -> [3,24,17] (same mask)
          path 15124 -> [4,24,16] (next tree from the root)

that means that integer masks will represent numbers as follows
[ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc.

I am working on a third function allowing other modulo representations
e.g. [ 0, 5, 10, 15, ... , 150] etc.


> Hmmm. I need a lexer that supports unicode,
> i.e. 2^24 characters. To make that work,
> I need to classify characters (array of 2^24 size is a bit expensive),
> but also the lexer needs an efficient representation
> of subsets of unicode characters, typically
> containining a few ranges (eg 'letter' is a code
> in one of about 30 different ranges).

You could use big-endian representation of course, but you could even
use your own 30-ary representation. Why ? 

let say you insert 10 ints from 221 456 to 221 466
that is [8,6,1,26] to [8,6,2,6],

You will have a degenerated tree with 2 useless nodes (level 1 and 2)
since the rest of the tree is empty 
     |
     |
    / \
  int  int

Of course, two nodes is not much but modulo 30 operations have a cost. 
The same data structure with modulo 2 operations for optimal speed
acces would look have 5 times more useless nodes (because 30 ~= 32 =
2^5)

To have an optimal space occupation you just need to write down your
own decomposition function which should have the following properties

- injectivity in the domain you are working with
- neighbourg conservation
- most accessed integers in the top (id est shortest path)

There are other solutions based on using common prefixes :
- patricia trees
- a trie of packed arrays

You can find patricia trees in Jean Christophe Fillâtre's home page
I have written a small packed arrays module which will soon be in my
data structures library (an old buggy beta version is in my home page
but will be removed)

If you give me some more informations (on typical ranges that you can
find in unicode subsets) I will be able to give more precise answers

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

next      parent reply	other threads:[~2002-09-02  9:42 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <3D70203F.1000106@ozemail.com.au>
2002-09-02  9:40 ` Diego Olivier Fernandez Pons [this message]
2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
2002-09-02 13:25     ` Noel Welsh
2002-09-02 17:18     ` Tim Freeman
2002-09-02 21:30     ` Berke Durak
2002-09-02 23:46   ` [Caml-list] Explaining bit sets Yamagata Yoriyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.A32.3.95.1020902105608.97756A-100000@ibm1.cicrp.jussieu.fr \
    --to=diego-olivier.fernandez-pons@cicrp.jussieu.fr \
    --cc=Caml-list@inria.fr \
    --cc=skaller@ozemail.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).