From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id NAA13254; Sun, 1 Sep 2002 13:52:47 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id NAA13365 for ; Sun, 1 Sep 2002 13:52:46 +0200 (MET DST) Received: from mbg.sphere.ne.jp (mbg.sphere.ne.jp [210.150.254.179]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g81BqiD25688 for ; Sun, 1 Sep 2002 13:52:44 +0200 (MET DST) Received: from localhost (pl1045.nas924.o-tokyo.nttpc.ne.jp [61.197.110.21]) by mbg.sphere.ne.jp (Postfix) with ESMTP id 1824B3A282 for ; Sun, 1 Sep 2002 20:52:39 +0900 (JST) Date: Sun, 01 Sep 2002 20:57:10 +0900 (JST) Message-Id: <20020901.205710.118791924.yoriyuki@mbg.sphere.ne.jp> To: caml-list@inria.fr Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version) From: Yamagata Yoriyuki In-Reply-To: <3D71D544.4010509@ozemail.com.au> References: <20020901014544.GC820@ice.gerd-stolpmann.de> <3D71D544.4010509@ozemail.com.au> X-Mailer: Mew version 2.2 on Emacs 21.2 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk From: John Max Skaller Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version) Date: Sun, 01 Sep 2002 18:52:20 +1000 > I have ALL the code sets specified at Unicode.org in > programmatic form. Easy to generate Ocaml versions > of the tables. Data at Unicode.org for East Asian encodings are buggy. Don't use them. (Moreover, Unicode Consortium declared they don't want to fix these bugs, and make East Asian mapping tables obsolete. see ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/README.TXT) I uses mapping tables from glibc for my camomile, which seems more debugged. > My functions are in Python, and take the form: > > decode: string -> (int * string) > encode: int -> string > > where string is an 8 bit byte stream, > and int is a unicode (or other) code point. This interface has a problem with stateful encodings, which are quite important here. (ISO-2020-JP or JIS encoding is stateful, and standard encoding for email.) In addition, it is inefficient. > The actual python functions use dynamically loaded > data tables, but each character set has a fixed > format for the tables that knows about the raw > structure of the character set (eg what ranges of > hi and low bytes are allowed in two byte encodings > of Shift-Jis, KSC, etc). For Ocaml, we'd probably > want to bind the encodings at compile time > (since there is no well defined way to find > the data tables at run time :( > > The tables are very compact, but there are quite > a few encodings -- some overhead if they're all > in the one module .. I read somewhere that Perl6 delegates code conversion to add-on programs, since making standard mapping tables is really hard. (Even naming of encodings is a problem. There is no cross-platform way of this.) Introducing generic channel type (for char and unicode character) and letting 3rd party libraries do conversion is better solution, IMO. -- Yamagata Yoriyuki http://www.mars.sphere.ne.jp/yoriyuki/ PGP fingerprint = 0374 5290 7445 4C06 D79E AA86 1A91 48CB 2B4E 34CF ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners