From mboxrd@z Thu Jan 1 00:00:00 1970 Received: (from majordomo@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id PAA14241; Sun, 1 Sep 2002 15:54:45 +0200 (MET DST) X-Authentication-Warning: pauillac.inria.fr: majordomo set sender to owner-caml-list@pauillac.inria.fr using -f Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id PAA14217 for ; Sun, 1 Sep 2002 15:54:44 +0200 (MET DST) Received: from mail3.tpgi.com.au (mail.tpgi.com.au [203.12.160.59]) by concorde.inria.fr (8.11.1/8.11.1) with ESMTP id g81DsgD23784 for ; Sun, 1 Sep 2002 15:54:43 +0200 (MET DST) Received: from ozemail.com.au (syd-ts15-2600-120.tpgi.com.au [203.213.83.120]) (authenticated (0 bits)) by mail3.tpgi.com.au (8.11.6/8.11.6) with ESMTP id g81DpZX29702; Sun, 1 Sep 2002 23:51:35 +1000 Message-ID: <3D721C15.5000308@ozemail.com.au> Date: Sun, 01 Sep 2002 23:54:29 +1000 From: John Max Skaller User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.2.1) Gecko/20010901 X-Accept-Language: en-us MIME-Version: 1.0 To: Yamagata Yoriyuki CC: caml-list@inria.fr Subject: Re: [Caml-list] Announcement: PXP 1.1.92 (development version) References: <20020901014544.GC820@ice.gerd-stolpmann.de> <3D71D544.4010509@ozemail.com.au> <20020901.205710.118791924.yoriyuki@mbg.sphere.ne.jp> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-caml-list@pauillac.inria.fr Precedence: bulk Yamagata Yoriyuki wrote: > Data at Unicode.org for East Asian encodings are buggy. Don't use > them. Noted. >>My functions are in Python, and take the form: >> >> decode: string -> (int * string) >> encode: int -> string >> >>where string is an 8 bit byte stream, >>and int is a unicode (or other) code point. >> > > This interface has a problem with stateful encodings, which are quite > important here. (ISO-2020-JP or JIS encoding is stateful, and > standard encoding for email.) In addition, it is inefficient. Agree on both counts, though none of the encodings I handle are stateful (I handle Shift-Jis which isn't stateful AFAIK) The functions I give are canonical, and they're fast enough in Python (if you want fast, you'd use C anyhow). There is an issue for Ocaml: what is a Unicode string like? My answer would be 'array of int'. But another answer is 'string with UTF-8 encoding'. In theory, mappings and codecs are orthogonal. UTF-8 has nothing to do with Unicode, it works just fine for any national character set. In practice, many character sets are defined by two byte encodings. So you might want a function: Shift-Jis -> Unicode as UTF-8 modelled by string -> string (8 bit clean strings) That can be made from the canonical functions, but it isn't efficient to do the conversion via an integer intermediate form. > I read somewhere that Perl6 delegates code conversion to add-on > programs, since making standard mapping tables is really hard. > (Even naming of encodings is a problem. There is no cross-platform > way of this.) Introducing generic channel type (for char and unicode > character) and letting 3rd party libraries do conversion is better > solution, IMO. Well, you also want in-core conversions. And then a third party library is an arbitrary function. The problem is that people are rewriting these functions for each application that needs some i18n support. Reuse would be better, but that requires some form of standardisation. Its both hard to get the conversions right, and also to make them efficient. I spent ages converting the unicode.org data (I also found a bug in the UNICODE tables). The problem is: 'third party libraries' might be a reasonable answer for a C program. Its not so reasonable for Ocaml: where are they? We're short of useful libraries .. indeed, for a mechanism to install and access them. -- John Max Skaller, mailto:skaller@ozemail.com.au snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners