Re: localization, internationalization and Caml

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: skaller <skaller@maxtal.com.au>
To: Xavier Leroy <Xavier.Leroy@inria.fr>
Cc: caml-list@inria.fr
Subject: Re: localization, internationalization and Caml
Date: Wed, 20 Oct 1999 04:36:24 +1000	[thread overview]
Message-ID: <380CBA28.E13AFB6E@maxtal.com.au> (raw)
In-Reply-To: <19991017162917.48773@pauillac.inria.fr>

Xavier Leroy wrote:

> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident.  The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers.  Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.

	And supporting ISO-8859-1 was a fine thing to do at the time!
 
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa.  If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win.  (At least
> until we get OCaml in the Chinese curriculum...)

	While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
 
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.

	Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
 
> There are several ways to internationalize further.  One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
>   need to set LC_CTYPE to different values under Linux, Solaris, and
>   Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.

	If you are suggesting not using C locale stuff -- I agree entirely.

> Unicode / ISO10646 is probably a better approach.  However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
>   technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
>   chose 32-bit Unicode.   (That's the great things about standards:
>   there are so many to choose from...)

	I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.

> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
>   E.g. Java seems to have its own variant of UTF8.  How are we going
>   to interoperate?

	I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.

> - I/O is a nightmare.  The API has to handle at least byte streams,
>   wide character streams, and UTF8-encoded streams.

	No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).

> - Support for Unicode / UTF8 files in today's operating systems and GUIs
>   is very low.  When will I be able to do "more" on an UTF8 file and see my
>   French accented letters?

	Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
 
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.  

	I agree. The way forward is, I believe:

	a) do not change the I/O system, but deprecate TEXT mode
	   (all I/O should be done in binary)

	b) do not change the String module, but deprecate the
	   upper/lower case functions (and anything else that
	   smacks of relating to natural language)

	c) Provide functions to support internationalisation.

	d) modify the ocaml compiler, to process \uXXXX and \UXXXX
	   escapes [everywhere]

	e) provide a fast variable length array type

(d) could be done easily using ocamlp4 I think.

>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.

	I don't think it is necessary, a variable length
array of integers is good enough.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller

next prev parent reply	other threads:[~1999-10-19 21:39 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1999-10-15 13:53 Gerard Huet
1999-10-15 20:28 ` Gerd Stolpmann
1999-10-19 18:06   ` skaller
1999-10-20 21:05     ` Gerd Stolpmann
1999-10-21  4:42       ` skaller
1999-10-21 12:05       ` Matías Giovannini
1999-10-21 15:35         ` skaller
1999-10-21 16:27           ` Matías Giovannini
1999-10-21 16:36             ` skaller
1999-10-21 17:21               ` Matías Giovannini
1999-10-23  9:53               ` Benoit Deboursetty
1999-10-25 21:06                 ` Jan Skibinski
1999-10-26 18:02                 ` skaller
1999-10-25  0:54               ` How to format a float? skaller
1999-10-26  0:53                 ` Michel Quercia
1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
1999-10-28 17:04           ` Pierre Weis
1999-10-28 17:41           ` Matías Giovannini
1999-10-28 17:59           ` Matías Giovannini
1999-10-29  9:44             ` Francois Pottier
1999-10-28 21:00           ` Gerd Stolpmann
1999-10-29  4:29           ` skaller
1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy
1999-10-19 18:36   ` skaller [this message]
  -- strict thread matches above, loose matches on Subject: below --
1999-10-13 12:12 STARYNKEVITCH Basile
1999-10-14 22:20 ` skaller
1999-10-15  8:26   ` Francis Dupont
1999-10-17 11:27     ` skaller
1999-10-17 15:54       ` Francis Dupont
1999-10-19 18:48         ` skaller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=380CBA28.E13AFB6E@maxtal.com.au \
    --to=skaller@maxtal.com.au \
    --cc=Xavier.Leroy@inria.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).