From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA14899 for caml-redistribution; Mon, 18 Oct 1999 16:18:56 +0200 (MET DST) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA19335 for ; Sun, 17 Oct 1999 16:29:19 +0200 (MET DST) Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35]) by concorde.inria.fr (8.8.7/8.8.7) with ESMTP id QAA06940 for ; Sun, 17 Oct 1999 16:29:17 +0200 (MET DST) Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA28272; Sun, 17 Oct 1999 16:29:17 +0200 (MET DST) Message-ID: <19991017162917.48773@pauillac.inria.fr> Date: Sun, 17 Oct 1999 16:29:17 +0200 From: Xavier Leroy To: caml-list@inria.fr Subject: Re: localization, internationalization and Caml References: <199910151406.QAA07501@yana.inria.fr> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Mailer: Mutt 0.89.1 In-Reply-To: <199910151406.QAA07501@yana.inria.fr>; from Gerard Huet on Fri, Oct 15, 1999 at 03:53:15PM +0200 Sender: weis Wow, there's nothing like internationalization to spark lively discussions. Since even Gérard Huet (oops, sorry for that 8859-1 accent, couldn't resist) and Francis Dupont broke their vows of silence, I guess I have to say something too. The support for ISO-8859-1 in Caml Light and OCaml is essentially an historical and geographical accident. The first books on Caml were written in French, and it was nice to be able to use accented french words as identifiers. Also, that was at a time (1991-1992) where Unicode and consorts didn't even exist. The choice of ISO-8859-1 is not that politically incorrect either: it works not only for western Europe, but also for Latin America, many Pacific countries, and large parts of Africa. If we were to choose an 8-bit character set based on the number of OCaml programmers that actually need it, I guess ISO-8859-1 (or its newer incarnation with the Euro sign whose name I can't remember) would still win. (At least until we get OCaml in the Chinese curriculum...) Notice also that Caml doesn't prevent the programmer from putting any character set that includes ASCII (ISO-8859-x, but also UTF8-encoded Unicode) in character strings and in comments. There are several ways to internationalize further. One is to support other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There are several problems with this: - It's not enough for Asian languages. - The POSIX localization stuff isn't supported under Windows. - It's badly supported on all Unixes I know (e.g. to get French, I need to set LC_CTYPE to different values under Linux, Solaris, and Digital Unix; it gets worse for other languages such as Japanese). - Handling of mixed-language texts is a nightmare. Unicode / ISO10646 is probably a better approach. However, it has its own problems: - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix) chose 32-bit Unicode. (That's the great things about standards: there are so many to choose from...) - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well. E.g. Java seems to have its own variant of UTF8. How are we going to interoperate? - I/O is a nightmare. The API has to handle at least byte streams, wide character streams, and UTF8-encoded streams. - Support for Unicode / UTF8 files in today's operating systems and GUIs is very low. When will I be able to do "more" on an UTF8 file and see my French accented letters? My conclusion is that I18N is such a mess that I don't think we'll do much about it in Caml anytime soon. Perhaps some basic support for wide characters and wide character strings will be added at some point, if only because COM interoperability requires it. - Xavier Leroy