From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id XAA26006 for caml-redistribution; Tue, 19 Oct 1999 23:39:30 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA21298 for ; Tue, 19 Oct 1999 20:42:17 +0200 (MET DST) Received: from ruby (pm1-29.triode.net.au [203.63.235.45]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id UAA06702; Tue, 19 Oct 1999 20:42:12 +0200 (MET DST) Received: from maxtal.com.au (IDENT:root@localhost [127.0.0.1]) by ruby (8.9.3/8.9.3) with ESMTP id EAA23081; Wed, 20 Oct 1999 04:36:25 +1000 Sender: weis Message-ID: <380CBA28.E13AFB6E@maxtal.com.au> Date: Wed, 20 Oct 1999 04:36:24 +1000 From: skaller Organization: Maxtal X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.12 i586) X-Accept-Language: en MIME-Version: 1.0 To: Xavier Leroy CC: caml-list@inria.fr Subject: Re: localization, internationalization and Caml References: <199910151406.QAA07501@yana.inria.fr> <19991017162917.48773@pauillac.inria.fr> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Xavier Leroy wrote: > The support for ISO-8859-1 in Caml Light and OCaml is essentially an > historical and geographical accident. The first books on Caml were > written in French, and it was nice to be able to use accented french > words as identifiers. Also, that was at a time (1991-1992) where > Unicode and consorts didn't even exist. And supporting ISO-8859-1 was a fine thing to do at the time! > The choice of ISO-8859-1 is not that politically incorrect either: it > works not only for western Europe, but also for Latin America, many > Pacific countries, and large parts of Africa. If we were to choose an > 8-bit character set based on the number of OCaml programmers that > actually need it, I guess ISO-8859-1 (or its newer incarnation with > the Euro sign whose name I can't remember) would still win. (At least > until we get OCaml in the Chinese curriculum...) While this is true, there is a circularity here: people not using 8 bit character sets face an extra battle using ocaml. > Notice also that Caml doesn't prevent the programmer from putting any > character set that includes ASCII (ISO-8859-x, but also UTF8-encoded > Unicode) in character strings and in comments. Yes. This one of the key points of my argument that UTF-8 is the natural way to go: it provides ISO-10646 compliance without requiring any new string kind. > There are several ways to internationalize further. One is to support > other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There > are several problems with this: > - It's not enough for Asian languages. > - The POSIX localization stuff isn't supported under Windows. > - It's badly supported on all Unixes I know (e.g. to get French, I > need to set LC_CTYPE to different values under Linux, Solaris, and > Digital Unix; it gets worse for other languages such as Japanese). > - Handling of mixed-language texts is a nightmare. If you are suggesting not using C locale stuff -- I agree entirely. > Unicode / ISO10646 is probably a better approach. However, it has its > own problems: > - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that > technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix) > chose 32-bit Unicode. (That's the great things about standards: > there are so many to choose from...) I cannot see the problem -- except for the 16 bit adopters, who must eventually upgrade .. again. > - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well. > E.g. Java seems to have its own variant of UTF8. How are we going > to interoperate? I do not understand: UTF-8 is a fixed, internationally standardised encoding. If it is used, the ISO Standard is followed. If Java doesn't do that, that is Java's problem. > - I/O is a nightmare. The API has to handle at least byte streams, > wide character streams, and UTF8-encoded streams. No, it doesn't. This is a possibility. But it is NOT necessary. It is necessary only to read byte streams. Conversion can be done later using strings. This is less efficient, but it is a sensible starting point (to ignore internationalisation on I/O completely). > - Support for Unicode / UTF8 files in today's operating systems and GUIs > is very low. When will I be able to do "more" on an UTF8 file and see my > French accented letters? Yes. I agree. This is a major problem. One of the answers is "When programming languages provide the support that applications programmers need" :-) > My conclusion is that I18N is such a mess that I don't think we'll do > much about it in Caml anytime soon. I agree. The way forward is, I believe: a) do not change the I/O system, but deprecate TEXT mode (all I/O should be done in binary) b) do not change the String module, but deprecate the upper/lower case functions (and anything else that smacks of relating to natural language) c) Provide functions to support internationalisation. d) modify the ocaml compiler, to process \uXXXX and \UXXXX escapes [everywhere] e) provide a fast variable length array type (d) could be done easily using ocamlp4 I think. >Perhaps some basic support for > wide characters and wide character strings will be added at some > point, if only because COM interoperability requires it. I don't think it is necessary, a variable length array of integers is good enough. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller