From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: weis
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA14899 for caml-redistribution; Mon, 18 Oct 1999 16:18:56 +0200 (MET DST)
Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA19335 for <caml-list@pauillac.inria.fr>; Sun, 17 Oct 1999 16:29:19 +0200 (MET DST)
Received: from pauillac.inria.fr (pauillac.inria.fr [128.93.11.35])
	by concorde.inria.fr (8.8.7/8.8.7) with ESMTP id QAA06940
	for <caml-list@inria.fr>; Sun, 17 Oct 1999 16:29:17 +0200 (MET DST)
Received: (from xleroy@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA28272; Sun, 17 Oct 1999 16:29:17 +0200 (MET DST)
Message-ID: <19991017162917.48773@pauillac.inria.fr>
Date: Sun, 17 Oct 1999 16:29:17 +0200
From: Xavier Leroy <Xavier.Leroy@inria.fr>
To: caml-list@inria.fr
Subject: Re: localization, internationalization and Caml
References: <199910151406.QAA07501@yana.inria.fr>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Mailer: Mutt 0.89.1
In-Reply-To: <199910151406.QAA07501@yana.inria.fr>; from Gerard Huet on Fri, Oct 15, 1999 at 03:53:15PM +0200
Sender: weis

Wow, there's nothing like internationalization to spark lively
discussions.  Since even Gérard Huet (oops, sorry for that 8859-1
accent, couldn't resist) and Francis Dupont broke their
vows of silence, I guess I have to say something too.

The support for ISO-8859-1 in Caml Light and OCaml is essentially an
historical and geographical accident.  The first books on Caml were
written in French, and it was nice to be able to use accented french
words as identifiers.  Also, that was at a time (1991-1992) where
Unicode and consorts didn't even exist.

The choice of ISO-8859-1 is not that politically incorrect either: it
works not only for western Europe, but also for Latin America, many
Pacific countries, and large parts of Africa.  If we were to choose an
8-bit character set based on the number of OCaml programmers that
actually need it, I guess ISO-8859-1 (or its newer incarnation with
the Euro sign whose name I can't remember) would still win.  (At least
until we get OCaml in the Chinese curriculum...)

Notice also that Caml doesn't prevent the programmer from putting any
character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
Unicode) in character strings and in comments.  

There are several ways to internationalize further.  One is to support
other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
are several problems with this:
- It's not enough for Asian languages.
- The POSIX localization stuff isn't supported under Windows.
- It's badly supported on all Unixes I know (e.g. to get French, I
  need to set LC_CTYPE to different values under Linux, Solaris, and
  Digital Unix; it gets worse for other languages such as Japanese).
- Handling of mixed-language texts is a nightmare.

Unicode / ISO10646 is probably a better approach.  However, it has its
own problems:
- There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
  technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
  chose 32-bit Unicode.   (That's the great things about standards:
  there are so many to choose from...)
- Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
  E.g. Java seems to have its own variant of UTF8.  How are we going
  to interoperate?
- I/O is a nightmare.  The API has to handle at least byte streams,
  wide character streams, and UTF8-encoded streams.
- Support for Unicode / UTF8 files in today's operating systems and GUIs
  is very low.  When will I be able to do "more" on an UTF8 file and see my
  French accented letters? 

My conclusion is that I18N is such a mess that I don't think we'll do
much about it in Caml anytime soon.  Perhaps some basic support for
wide characters and wide character strings will be added at some
point, if only because COM interoperability requires it.

- Xavier Leroy