From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id BAA31555; Tue, 3 Sep 2002 01:42:25 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id BAA13442 for ; Tue, 3 Sep 2002 01:42:25 +0200 (MET DST) Received: from mbg.sphere.ne.jp (mbg.sphere.ne.jp [210.150.254.179]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g82NgJr25502 for ; Tue, 3 Sep 2002 01:42:22 +0200 (MET DST) Received: from localhost (pl1158.nas923.o-tokyo.nttpc.ne.jp [210.165.110.134]) by mbg.sphere.ne.jp (Postfix) with ESMTP id DD56B3A25C; Tue, 3 Sep 2002 08:42:12 +0900 (JST) Date: Tue, 03 Sep 2002 08:46:28 +0900 (JST) Message-Id: <20020903.084628.10760457.yoriyuki@mbg.sphere.ne.jp> To: skaller@ozemail.com.au Cc: Caml-list@inria.fr Subject: Re: [Caml-list] Explaining bit sets From: Yamagata Yoriyuki In-Reply-To: References: <3D70203F.1000106@ozemail.com.au> X-Mailer: Mew version 2.2 on Emacs 21.2 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk From: Diego Olivier Fernandez Pons Subject: [Caml-list] Explaining bit sets Date: Mon, 2 Sep 2002 11:40:42 +0200 (DST) > > Hmmm. I need a lexer that supports unicode, > > i.e. 2^24 characters. To make that work, > > I need to classify characters (array of 2^24 size is a bit expensive), > > but also the lexer needs an efficient representation > > of subsets of unicode characters, typically > > containining a few ranges (eg 'letter' is a code > > in one of about 30 different ranges). I wrote something for the same purpose (Unicode -> foo tables), which might be useful. The relevant modules are tbl.ml[i], bitsvect.ml[i], bytesvect.ml[i] in http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/camomile/camomile/internal There are 6 table types, 'a Tbl32.rw, 'a Tbl32.t, which are tables from int to 'a, Tbl32.bits_rw, Tbl32.bits which are from int to int (< 256), Tbl32.bytes_rw, Tbl32.bytes which are from int to int. Tbl32.*rw are internally tries whose nodes have 256 branches. Terminal nodes are arrays (Tbl32.rw) or bits vectors (Tbl32.bits_rw) or bytes sequences (Tbl32.bytes_rw). Inplace-update is possible for these types. 'a Tbl32.t (and Tbl32.bits, Tbl32.bytes) are read-only. They are internally 4-dimensional arrays, so access should be fast. Tbl32.rw_to_ro creates Tbl32.t from Tbl32.rw. (There are similar functions for Tbl32.bits and Tbl32.bytes.) It scans the whole trie and make identical nodes shared, so that table size is small. For example, the size of Unicode character category table (Classification of Unicode characters into about 30 categories) becomes 16K. (btw, Unicode is 21-bits, not 24-bits.) -- Yamagata Yoriyuki http://www.mars.sphere.ne.jp/yoriyuki/ PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners