From: skaller <skaller@maxtal.com.au>
To: Xavier Leroy <Xavier.Leroy@inria.fr>
Cc: caml-list@inria.fr
Subject: Re: localization, internationalization and Caml
Date: Wed, 20 Oct 1999 04:36:24 +1000 [thread overview]
Message-ID: <380CBA28.E13AFB6E@maxtal.com.au> (raw)
In-Reply-To: <19991017162917.48773@pauillac.inria.fr>
Xavier Leroy wrote:
> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident. The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers. Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.
And supporting ISO-8859-1 was a fine thing to do at the time!
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa. If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win. (At least
> until we get OCaml in the Chinese curriculum...)
While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.
Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
> There are several ways to internationalize further. One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
> need to set LC_CTYPE to different values under Linux, Solaris, and
> Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.
If you are suggesting not using C locale stuff -- I agree entirely.
> Unicode / ISO10646 is probably a better approach. However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that
> technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
> chose 32-bit Unicode. (That's the great things about standards:
> there are so many to choose from...)
I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.
> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
> E.g. Java seems to have its own variant of UTF8. How are we going
> to interoperate?
I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.
> - I/O is a nightmare. The API has to handle at least byte streams,
> wide character streams, and UTF8-encoded streams.
No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).
> - Support for Unicode / UTF8 files in today's operating systems and GUIs
> is very low. When will I be able to do "more" on an UTF8 file and see my
> French accented letters?
Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.
I agree. The way forward is, I believe:
a) do not change the I/O system, but deprecate TEXT mode
(all I/O should be done in binary)
b) do not change the String module, but deprecate the
upper/lower case functions (and anything else that
smacks of relating to natural language)
c) Provide functions to support internationalisation.
d) modify the ocaml compiler, to process \uXXXX and \UXXXX
escapes [everywhere]
e) provide a fast variable length array type
(d) could be done easily using ocamlp4 I think.
>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.
I don't think it is necessary, a variable length
array of integers is good enough.
--
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller
next prev parent reply other threads:[~1999-10-19 21:39 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
1999-10-15 13:53 Gerard Huet
1999-10-15 20:28 ` Gerd Stolpmann
1999-10-19 18:06 ` skaller
1999-10-20 21:05 ` Gerd Stolpmann
1999-10-21 4:42 ` skaller
1999-10-21 12:05 ` Matías Giovannini
1999-10-21 15:35 ` skaller
1999-10-21 16:27 ` Matías Giovannini
1999-10-21 16:36 ` skaller
1999-10-21 17:21 ` Matías Giovannini
1999-10-23 9:53 ` Benoit Deboursetty
1999-10-25 21:06 ` Jan Skibinski
1999-10-26 18:02 ` skaller
1999-10-25 0:54 ` How to format a float? skaller
1999-10-26 0:53 ` Michel Quercia
1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty
1999-10-28 17:04 ` Pierre Weis
1999-10-28 17:41 ` Matías Giovannini
1999-10-28 17:59 ` Matías Giovannini
1999-10-29 9:44 ` Francois Pottier
1999-10-28 21:00 ` Gerd Stolpmann
1999-10-29 4:29 ` skaller
1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy
1999-10-19 18:36 ` skaller [this message]
-- strict thread matches above, loose matches on Subject: below --
1999-10-13 12:12 STARYNKEVITCH Basile
1999-10-14 22:20 ` skaller
1999-10-15 8:26 ` Francis Dupont
1999-10-17 11:27 ` skaller
1999-10-17 15:54 ` Francis Dupont
1999-10-19 18:48 ` skaller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=380CBA28.E13AFB6E@maxtal.com.au \
--to=skaller@maxtal.com.au \
--cc=Xavier.Leroy@inria.fr \
--cc=caml-list@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).