caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: Yamagata Yoriyuki <yoriyuki@mbg.sphere.ne.jp>
To: skaller@ozemail.com.au
Cc: Caml-list@inria.fr
Subject: Re: [Caml-list] Explaining bit sets
Date: Tue, 03 Sep 2002 08:46:28 +0900 (JST)	[thread overview]
Message-ID: <20020903.084628.10760457.yoriyuki@mbg.sphere.ne.jp> (raw)
In-Reply-To: <Pine.A32.3.95.1020902105608.97756A-100000@ibm1.cicrp.jussieu.fr>

From: Diego Olivier Fernandez Pons <Diego-Olivier.FERNANDEZ-PONS@cicrp.jussieu.fr>
Subject: [Caml-list] Explaining bit sets
Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST)

> > Hmmm. I need a lexer that supports unicode,
> > i.e. 2^24 characters. To make that work,
> > I need to classify characters (array of 2^24 size is a bit expensive),
> > but also the lexer needs an efficient representation
> > of subsets of unicode characters, typically
> > containining a few ranges (eg 'letter' is a code
> > in one of about 30 different ranges).

I wrote something for the same purpose (Unicode -> foo tables), which
might be useful.  The relevant modules are tbl.ml[i],
bitsvect.ml[i], bytesvect.ml[i] in
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/camomile/camomile/internal

There are 6 table types, 'a Tbl32.rw, 'a Tbl32.t, which are tables
from int to 'a, Tbl32.bits_rw, Tbl32.bits which are from int to int (<
256), Tbl32.bytes_rw, Tbl32.bytes which are from int to int.
Tbl32.*rw are internally tries whose nodes have 256 branches.  Terminal
nodes are arrays (Tbl32.rw) or bits vectors (Tbl32.bits_rw) or bytes
sequences (Tbl32.bytes_rw).  Inplace-update is possible for these
types.

'a Tbl32.t (and Tbl32.bits, Tbl32.bytes) are read-only.  They are
internally 4-dimensional arrays, so access should be fast.
Tbl32.rw_to_ro creates Tbl32.t from Tbl32.rw.  (There are similar
functions for Tbl32.bits and Tbl32.bytes.)  It scans the whole trie
and make identical nodes shared, so that table size is small.  For
example, the size of Unicode character category table (Classification
of Unicode characters into about 30 categories) becomes 16K.

(btw, Unicode is 21-bits, not 24-bits.)
--
Yamagata Yoriyuki
http://www.mars.sphere.ne.jp/yoriyuki/
PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


      parent reply	other threads:[~2002-09-02 23:42 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <3D70203F.1000106@ozemail.com.au>
2002-09-02  9:40 ` Diego Olivier Fernandez Pons
2002-09-02 13:10   ` [Caml-list] Holding a set of random integers of very wide range? john.chu
2002-09-02 13:25     ` Noel Welsh
2002-09-02 17:18     ` Tim Freeman
2002-09-02 21:30     ` Berke Durak
2002-09-02 23:46   ` Yamagata Yoriyuki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20020903.084628.10760457.yoriyuki@mbg.sphere.ne.jp \
    --to=yoriyuki@mbg.sphere.ne.jp \
    --cc=Caml-list@inria.fr \
    --cc=skaller@ozemail.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).