caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] Explaining bit sets
       [not found] <3D70203F.1000106@ozemail.com.au>
@ 2002-09-02  9:40 ` Diego Olivier Fernandez Pons
  2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
  2002-09-02 23:46   ` [Caml-list] Explaining bit sets Yamagata Yoriyuki
  0 siblings, 2 replies; 6+ messages in thread
From: Diego Olivier Fernandez Pons @ 2002-09-02  9:40 UTC (permalink / raw)
  To: John Max Skaller; +Cc: Caml-list

    Bonjour,

I will try to explain how does the bitSet module work.

A bit set is just à n-ary tree with n = 30 and an integer data in
each node : instead of just having

  type tree = Empty | Node of int * tree * tree

you have

  type tree = Empty | Node of int * tree array

(I really use an other representation which is mosty equivalent,
but saves some space and makes algorithms easier : 
tree = N of int * int array)

When you call (insert [1,15,2,3,17] my_set) the only thing you are
doing is set to 1 the 17h bit of the integer located in the first tree
first level, 15th tree second level, 2nd tree third level, 3rd tree
4th level. 

Now, all the game is to find a convenient decomposition for each value
you want to save, the actual module provide two classical function

- big-endian (base 30)
example : path 15123 -> [16,24,3] 
          path 15124 -> [16,24,4] (same mask)
          path 15153 -> [16,25,3] (next node)

that means that integer masks will represent numbers as follows
[ 0, 1, ..., 30 ] [ 31, 32, ... , 60] etc.

- endian-endian (base 30)

example : path 15123 -> [3,24,16]
          path 16023 -> [3,24,17] (same mask)
          path 15124 -> [4,24,16] (next tree from the root)

that means that integer masks will represent numbers as follows
[ 0, 30, 60, ..., 900] [ 1, 31, 61, ..., 901] etc.

I am working on a third function allowing other modulo representations
e.g. [ 0, 5, 10, 15, ... , 150] etc.


> Hmmm. I need a lexer that supports unicode,
> i.e. 2^24 characters. To make that work,
> I need to classify characters (array of 2^24 size is a bit expensive),
> but also the lexer needs an efficient representation
> of subsets of unicode characters, typically
> containining a few ranges (eg 'letter' is a code
> in one of about 30 different ranges).

You could use big-endian representation of course, but you could even
use your own 30-ary representation. Why ? 

let say you insert 10 ints from 221 456 to 221 466
that is [8,6,1,26] to [8,6,2,6],

You will have a degenerated tree with 2 useless nodes (level 1 and 2)
since the rest of the tree is empty 
     |
     |
    / \
  int  int

Of course, two nodes is not much but modulo 30 operations have a cost. 
The same data structure with modulo 2 operations for optimal speed
acces would look have 5 times more useless nodes (because 30 ~= 32 =
2^5)

To have an optimal space occupation you just need to write down your
own decomposition function which should have the following properties

- injectivity in the domain you are working with
- neighbourg conservation
- most accessed integers in the top (id est shortest path)

There are other solutions based on using common prefixes :
- patricia trees
- a trie of packed arrays

You can find patricia trees in Jean Christophe Fillâtre's home page
I have written a small packed arrays module which will soon be in my
data structures library (an old buggy beta version is in my home page
but will be removed)

If you give me some more informations (on typical ranges that you can
find in unicode subsets) I will be able to give more precise answers

        Diego Olivier

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Caml-list] Holding a set of random integers of very wide range?
  2002-09-02  9:40 ` [Caml-list] Explaining bit sets Diego Olivier Fernandez Pons
@ 2002-09-02 13:10   ` john.chu
  2002-09-02 13:25     ` Noel Welsh
                       ` (2 more replies)
  2002-09-02 23:46   ` [Caml-list] Explaining bit sets Yamagata Yoriyuki
  1 sibling, 3 replies; 6+ messages in thread
From: john.chu @ 2002-09-02 13:10 UTC (permalink / raw)
  To: Caml-list

Diego Olivier Fernandez Pons writes:
> If you give me some more informations (on typical ranges that you can
> find in unicode subsets) I will be able to give more precise answers

This brings up a possibly related question that I have.

For reasons I tried to explain in a previous draft of this e-mail and
has been cut due to the amount of space it took, I need to generate
multiple permuted lists of integers ranging from 0 to approximately
2^100 (or more, it's a bit open-ended unfortunately).

Since I only need one value at a time, I can use a lazy list for
this. i.e., the head holds a value and the tail is a suspension of the
state I need to generate the next value.  No big deal there.

The big deal, for me anyways, is that the state I need is tracking the
integers that I've already used (so that I don't generate the same one
twice) given that the range of possible values is so large.

I'm currently using the very low-tech solution of maintaining a list
of ranges of used values.  (i.e., when I use a value that would link 2
different ranges, I go coalesce the list.  The worst case would be if
I picked all the odd integers or all the even integers.)

Given the recent talk of Patricia trees and bit sets, I was wondering
if some variant of either one of those, or some other representation
would be better for what I'm doing?  (I figure at the very least I
ought to be using a tree of ranges of used values.  I'm currently
using the Big_int module to represent my integers.)

BTW, it's extremely unlikely I will ever need every single value in
the lazy list.  I will probably have queried the garbage collector and
declared defeat long before then.  I'm hoping that, at worst, I will
need only several thousand values from each permuted list.  But I
don't want to precompute them since I may not need several thousand
values from some lists and others might need more.  So I want to
generate only the numbers that are necessary.  (Sorry, this bit only
makes sense if you knew what I'm using this for.  The short
description would be backtracking to find a set of variable values
which satisfy a bunch of contraints where there can be multiple
possible solutions of which I want a random one where variable values
can be mapped onto a potentially very large range of integers.  I'm
applying as many constraints before backtracking as possible to reduce
the state space to search.)

Thanks.

					john
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Holding a set of random integers of very wide range?
  2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
@ 2002-09-02 13:25     ` Noel Welsh
  2002-09-02 17:18     ` Tim Freeman
  2002-09-02 21:30     ` Berke Durak
  2 siblings, 0 replies; 6+ messages in thread
From: Noel Welsh @ 2002-09-02 13:25 UTC (permalink / raw)
  To: john.chu, Caml-list

--- john.chu@east.sun.com wrote:
> For reasons I tried to explain in a previous draft
> of this e-mail and
> has been cut due to the amount of space it took, I
> need to generate
> multiple permuted lists of integers ranging from 0
> to approximately
> 2^100 (or more, it's a bit open-ended
> unfortunately).

How about using a linear-congruential random number
generators.  These simple algorithms will generate
numbers that don't repeat for a long period of time
(typically they use 32bit integers but I'm sure you'll
be able to find algorithms that generate numbers in a
larger range).  You can generate permutations of by
setting the seeds to different values.

HTH,
Noel


__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Holding a set of random integers of very wide range?
  2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
  2002-09-02 13:25     ` Noel Welsh
@ 2002-09-02 17:18     ` Tim Freeman
  2002-09-02 21:30     ` Berke Durak
  2 siblings, 0 replies; 6+ messages in thread
From: Tim Freeman @ 2002-09-02 17:18 UTC (permalink / raw)
  To: john.chu; +Cc: Caml-list

>For reasons I tried to explain in a previous draft of this e-mail and
>has been cut due to the amount of space it took, I need to generate
>multiple permuted lists of integers ranging from 0 to approximately
>2^100 (or more, it's a bit open-ended unfortunately).
>
>Since I only need one value at a time, I can use a lazy list for
>this. i.e., the head holds a value and the tail is a suspension of the
>state I need to generate the next value.  No big deal there.
>
>The big deal, for me anyways, is that the state I need is tracking the
>integers that I've already used (so that I don't generate the same one
>twice) given that the range of possible values is so large.

If your integers have enough bits, then a list of random numbers is
indistinguishable from a random permutation.  This assumes you only
have time to look at a reasonable number of values from the
permutation.  Do the math to figure out what "big enough" is.  For
example, in the past year I designed a scheme that assumed that two
files were identical if and only if their 64-bit checksums were
identical, and there were only a million or so files, so I did the
math and concluded that the chances of a collision were small enough.
It worked fine.

If you're paranoid then use the MD5 checksums in the Digest module to
generate random numbers; otherwise you can use a linear congruential
pseudo-random number generator as mentioned by another poster and hope
that there's no interesting interaction between the number generation
scheme and the other details of your algorithm.

>I'm currently using the very low-tech solution of maintaining a list
>of ranges of used values.  (i.e., when I use a value that would link 2
>different ranges, I go coalesce the list.  The worst case would be if
>I picked all the odd integers or all the even integers.)
>
>Given the recent talk of Patricia trees and bit sets, I was wondering
>if some variant of either one of those, or some other representation
>would be better for what I'm doing?  (I figure at the very least I
>ought to be using a tree of ranges of used values.  I'm currently
>using the Big_int module to represent my integers.)

If your integers are large enough, then you can ignore the possiblity
of a collision and no data structure is needed.  Otherwise, is there a
reason not to use a hash table?  

-- 
Tim Freeman       
tim@fungible.com
GPG public key fingerprint ECDF 46F8 3B80 BB9E 575D  7180 76DF FE00 34B1 5C78 
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Holding a set of random integers of very wide range?
  2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
  2002-09-02 13:25     ` Noel Welsh
  2002-09-02 17:18     ` Tim Freeman
@ 2002-09-02 21:30     ` Berke Durak
  2 siblings, 0 replies; 6+ messages in thread
From: Berke Durak @ 2002-09-02 21:30 UTC (permalink / raw)
  To: john.chu; +Cc: caml-list

On Mon, Sep 02, 2002 at 09:10:18AM -0400, john.chu@east.sun.com wrote:
> For reasons I tried to explain in a previous draft of this e-mail and
> has been cut due to the amount of space it took, I need to generate
> multiple permuted lists of integers ranging from 0 to approximately
> 2^100 (or more, it's a bit open-ended unfortunately).

You can solve that problem in constant space by encrypting a counter
using a convenient block cipher (such as RC6 or idea).  Since
encryption is bijective, you are assured of not hitting the same
integer twice. You have to somehow encode and decode integers, but in
runs in constant space and time.
--
Berke

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Caml-list] Explaining bit sets
  2002-09-02  9:40 ` [Caml-list] Explaining bit sets Diego Olivier Fernandez Pons
  2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
@ 2002-09-02 23:46   ` Yamagata Yoriyuki
  1 sibling, 0 replies; 6+ messages in thread
From: Yamagata Yoriyuki @ 2002-09-02 23:46 UTC (permalink / raw)
  To: skaller; +Cc: Caml-list

From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr>
Subject: [Caml-list] Explaining bit sets
Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST)

> > Hmmm. I need a lexer that supports unicode,
> > i.e. 2^24 characters. To make that work,
> > I need to classify characters (array of 2^24 size is a bit expensive),
> > but also the lexer needs an efficient representation
> > of subsets of unicode characters, typically
> > containining a few ranges (eg 'letter' is a code
> > in one of about 30 different ranges).

I wrote something for the same purpose (Unicode -> foo tables), which
might be useful.  The relevant modules are tbl.ml[i],
bitsvect.ml[i], bytesvect.ml[i] in
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/camomile/camomile/internal

There are 6 table types, 'a Tbl32.rw, 'a Tbl32.t, which are tables
from int to 'a, Tbl32.bits_rw, Tbl32.bits which are from int to int (<
256), Tbl32.bytes_rw, Tbl32.bytes which are from int to int.
Tbl32.*rw are internally tries whose nodes have 256 branches.  Terminal
nodes are arrays (Tbl32.rw) or bits vectors (Tbl32.bits_rw) or bytes
sequences (Tbl32.bytes_rw).  Inplace-update is possible for these
types.

'a Tbl32.t (and Tbl32.bits, Tbl32.bytes) are read-only.  They are
internally 4-dimensional arrays, so access should be fast.
Tbl32.rw_to_ro creates Tbl32.t from Tbl32.rw.  (There are similar
functions for Tbl32.bits and Tbl32.bytes.)  It scans the whole trie
and make identical nodes shared, so that table size is small.  For
example, the size of Unicode character category table (Classification
of Unicode characters into about 30 categories) becomes 16K.

(btw, Unicode is 21-bits, not 24-bits.)
--
Yamagata Yoriyuki
http://www.mars.sphere.ne.jp/yoriyuki/
PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-09-02 23:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <3D70203F.1000106@ozemail.com.au>
2002-09-02  9:40 ` [Caml-list] Explaining bit sets Diego Olivier Fernandez Pons
2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
2002-09-02 13:25     ` Noel Welsh
2002-09-02 17:18     ` Tim Freeman
2002-09-02 21:30     ` Berke Durak
2002-09-02 23:46   ` [Caml-list] Explaining bit sets Yamagata Yoriyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).