From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: weis
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id XAA26006 for caml-redistribution; Tue, 19 Oct 1999 23:39:30 +0200 (MET DST)
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id UAA21298 for <caml-list@pauillac.inria.fr>; Tue, 19 Oct 1999 20:42:17 +0200 (MET DST)
Received: from ruby (pm1-29.triode.net.au [203.63.235.45])
	by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id UAA06702;
	Tue, 19 Oct 1999 20:42:12 +0200 (MET DST)
Received: from maxtal.com.au (IDENT:root@localhost [127.0.0.1])
	by ruby (8.9.3/8.9.3) with ESMTP id EAA23081;
	Wed, 20 Oct 1999 04:36:25 +1000
Sender: weis
Message-ID: <380CBA28.E13AFB6E@maxtal.com.au>
Date: Wed, 20 Oct 1999 04:36:24 +1000
From: skaller <skaller@maxtal.com.au>
Organization: Maxtal
X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.12 i586)
X-Accept-Language: en
MIME-Version: 1.0
To: Xavier Leroy <Xavier.Leroy@inria.fr>
CC: caml-list@inria.fr
Subject: Re: localization, internationalization and Caml
References: <199910151406.QAA07501@yana.inria.fr> <19991017162917.48773@pauillac.inria.fr>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Xavier Leroy wrote:

> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident.  The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers.  Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.

	And supporting ISO-8859-1 was a fine thing to do at the time!
 
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa.  If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win.  (At least
> until we get OCaml in the Chinese curriculum...)

	While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
 
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.

	Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
 
> There are several ways to internationalize further.  One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
>   need to set LC_CTYPE to different values under Linux, Solaris, and
>   Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.

	If you are suggesting not using C locale stuff -- I agree entirely.

> Unicode / ISO10646 is probably a better approach.  However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
>   technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
>   chose 32-bit Unicode.   (That's the great things about standards:
>   there are so many to choose from...)

	I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.

> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
>   E.g. Java seems to have its own variant of UTF8.  How are we going
>   to interoperate?

	I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.

> - I/O is a nightmare.  The API has to handle at least byte streams,
>   wide character streams, and UTF8-encoded streams.

	No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).

> - Support for Unicode / UTF8 files in today's operating systems and GUIs
>   is very low.  When will I be able to do "more" on an UTF8 file and see my
>   French accented letters?

	Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
 
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.  

	I agree. The way forward is, I believe:

	a) do not change the I/O system, but deprecate TEXT mode
	   (all I/O should be done in binary)

	b) do not change the String module, but deprecate the
	   upper/lower case functions (and anything else that
	   smacks of relating to natural language)

	c) Provide functions to support internationalisation.

	d) modify the ocaml compiler, to process \uXXXX and \UXXXX
	   escapes [everywhere]

	e) provide a fast variable length array type

(d) could be done easily using ocamlp4 I think.

>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.

	I don't think it is necessary, a variable length
array of integers is good enough.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller