From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id KAA10953; Sun, 1 Sep 2002 10:52:34 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id KAA11009 for ; Sun, 1 Sep 2002 10:52:33 +0200 (MET DST) Received: from mail2.tpgi.com.au (mail.tpgi.com.au [203.12.160.58]) by nez-perce.inria.fr (8.11.1/8.11.1) with ESMTP id g818qVD24045 for ; Sun, 1 Sep 2002 10:52:32 +0200 (MET DST) Received: from ozemail.com.au (syd-ts16-2600-111.tpgi.com.au [203.213.84.111]) (authenticated (0 bits)) by mail2.tpgi.com.au (8.11.6/8.11.6) with ESMTP id g818qLQ01114; Sun, 1 Sep 2002 18:52:22 +1000 Message-ID: <3D71D544.4010509@ozemail.com.au> Date: Sun, 01 Sep 2002 18:52:20 +1000 From: John Max Skaller User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901 X-Accept-Language: en-us MIME-Version: 1.0 To: Gerd Stolpmann CC: caml-list@inria.fr Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version) References: <20020901014544.GC820@ice.gerd-stolpmann.de> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Gerd Stolpmann wrote: > previous versions of PXP, the internal representation of the XML trees was > restricted to either UTF-8 or ISO-8859-1. Now, a number of additional > encodings are supported, including the whole ISO-8859 series. I have ALL the code sets specified at Unicode.org in programmatic form. Easy to generate Ocaml versions of the tables. however, how about developing a standard I18n library with an eye to future inclusion in the standard distribution? The questions are mainly: what form should the encode/decode functions take? My functions are in Python, and take the form: decode: string -> (int * string) encode: int -> string where string is an 8 bit byte stream, and int is a unicode (or other) code point. The actual python functions use dynamically loaded data tables, but each character set has a fixed format for the tables that knows about the raw structure of the character set (eg what ranges of hi and low bytes are allowed in two byte encodings of Shift-Jis, KSC, etc). For Ocaml, we'd probably want to bind the encodings at compile time (since there is no well defined way to find the data tables at run time :( The tables are very compact, but there are quite a few encodings -- some overhead if they're all in the one module .. -- John Max Skaller, mailto:skaller@ozemail.com.au snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners