From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id XAA09555; Fri, 14 Sep 2001 23:06:05 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id XAA09685 for ; Fri, 14 Sep 2001 23:06:04 +0200 (MET DST) Received: from moutvdom00.kundenserver.de (moutvdom00.kundenserver.de [195.20.224.149]) by concorde.inria.fr (8.11.1/8.10.0) with ESMTP id f8EL63f12193 for ; Fri, 14 Sep 2001 23:06:03 +0200 (MET DST) Received: from [195.20.224.220] (helo=mrvdom04.kundenserver.de) by moutvdom00.kundenserver.de with esmtp (Exim 2.12 #2) id 15i09y-00015E-00; Fri, 14 Sep 2001 23:06:02 +0200 Received: from drms-3e364bb7.pool.mediaways.net ([62.54.75.183] helo=ice.gerd-stolpmann.de) by mrvdom04.kundenserver.de with esmtp (Exim 2.12 #2) id 15i09y-0001gT-00; Fri, 14 Sep 2001 23:06:02 +0200 Received: from localhost (localhost [[UNIX: localhost]]) by ice.gerd-stolpmann.de (8.9.3/8.9.3) id XAA31979; Fri, 14 Sep 2001 23:05:05 +0200 From: Gerd Stolpmann Reply-To: info@gerd-stolpmann.de Organization: privat To: "SooHyoung Oh" Subject: Re: [Caml-list] Unicode support? Date: Fri, 14 Sep 2001 22:30:03 +0200 X-Mailer: KMail [version 1.0.29.2] Content-Type: text/plain References: <007f01c13b28$da1897e0$1e01a8c0@hama> In-Reply-To: <007f01c13b28$da1897e0$1e01a8c0@hama> Cc: caml-list@inria.fr MIME-Version: 1.0 Message-Id: <01091423045501.29025@ice> Content-Transfer-Encoding: 8bit Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk On Wed, 12 Sep 2001, you wrote: >Hi! > >I'm very interesed to use Caml for educational purpose >in multibyte environment such as CJK. >So I'd like to know whether Caml supoprt Unicode or not? >(Unicode (ISO 10646) is the character coding (or character set) standard.) >Many people forsee that Unicode is going to replace all older lagcy >encodings in a few years >and many languages such as C, Java, Perl, Tcl, C# and Python support it >already. People have also forseen that PL/I replaces Cobol. So take that with care. >If Caml doesn't support unicode, could you give me any clue >from which part in caml source I should start in order to support unicode? In the mailing list archives you find a long debate, beginning here and never ended: http://caml.inria.fr/archives/199910/msg00132.html Further threads: http://caml.inria.fr/archives/199910/msg00189.html http://caml.inria.fr/archives/200101/msg00006.html >More question, >Which form is better for processing unicode in Caml, UTF-8 or UCS-2 (or >UCS-4)? There is currently not very much unicode support in caml. But there are some software packages and projects aiming at improving this (besides other goals). For my XML parser "PXP" I needed some unicode functionality, and that was the reason to become active regarding this issue. Caml strings can contain arbitrary byte sequences, and the null byte is explicitly allowed. So this would permit all of UTF-8, UCS-2, and UCS-4. For PXP, I needed especially a lexical analyzer that supports unicode, and that lead to two solutions: - PXP itself contains a tool that converts .mll files containing unicode characters into their UTF-8 representation. The idea is that UTF-8 regular expressions can be handled by every regular expression engine. This approach worked for the limited usage of unicode in the XML definition, but the resulting lexers are huge. This tool was a contribution by Claudio Sacerdoti Coen. Link: http://www.ocaml-programming.de/packages/documentation/pxp (look into the directory tools/src/ucs2_to_utf8) - Alain Frisch developed a patch for the ocamllex tool: wlex. He first used it for his XPath package, but it is now also used in PXP. wlex reads an arbitrary encoding, maps this encoding to character classes, and allows it to form regular expressions using these classes. The lexers are usually much smaller, but you need to patch ocamllex. Link: http://www.eleves.ens.fr:8080/home/frisch/soft Another problem is to convert between various character encodings. The netstring package contains a routine that can map between UTF-8, UCS-2, UCS-4, and many 8-bit encodings (but no other multibyte encodings). Link: http://www.ocaml-programming.de/packages/documentation/netstring/ The netstring development is now done by the ocamlnet project (http://ocamlnet.sourceforge.net). I hope this helps, Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: gerd@gerd-stolpmann.de Germany ---------------------------------------------------------------------------- ------------------- Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr