From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: weis
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA03304 for caml-redistribution; Thu, 21 Oct 1999 19:11:42 +0200 (MET DST)
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id RAA03898 for <caml-list@pauillac.inria.fr>; Thu, 21 Oct 1999 17:41:50 +0200 (MET DST)
Received: from ruby (pm1-3.triode.net.au [203.63.235.19])
	by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id RAA02724
	for <caml-list@inria.fr>; Thu, 21 Oct 1999 17:41:45 +0200 (MET DST)
Received: from maxtal.com.au (IDENT:root@localhost [127.0.0.1])
	by ruby (8.9.3/8.9.3) with ESMTP id BAA25586;
	Fri, 22 Oct 1999 01:35:00 +1000
Sender: weis
Message-ID: <380F32A4.8E5ECC5F@maxtal.com.au>
Date: Fri, 22 Oct 1999 01:35:00 +1000
From: skaller <skaller@maxtal.com.au>
Organization: Maxtal
X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.12 i586)
X-Accept-Language: en
MIME-Version: 1.0
To: matias@k-bell.com
CC: caml-list@inria.fr, Gerd.Stolpmann@darmstadt.netsurf.de
Subject: Re: localization, internationalization and Caml
References: <380CB30E.56D1A8A2@maxtal.com.au> <99102100543400.15513@ice> <380F0157.CDBBAD7D@k-bell.com>
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit

Mat韆s Giovannini wrote:

> OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
> agree that my view is chauvinistic (and selfish, perhaps: I already have
> "俊衢眢鷘窳赏于苎" for writing in Spanish, why should I ask for more?),
> I see no restriction in that (well, If I were Chinese, or Egiptian, I
> would see things differently). 

	Exactly. There are quite a lot of Chinese, Indian,
Russian ... and non-Latin people in the world: more than Latins.
And many are faced with a barrier, participating in the computing world
because of language problems.

>What's more, the whole syntactic
> apparatus of a programming language *assumes* a Latin setting, where
> things make sense when read from left to right, from top to bottom; and
> where punctuation is what we're used to. Programming languages suited
> for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> from the grounds up.

	Actually, no. Most of these peoples learn English and learn
computing, if they are to work with computers. But they still wish
to use comments, strings, and identifiers in their native script.

	Have you ever seen a Japanese program? I have.
Quite an interesting challenge: normal C/C++ code, with 
Latin characters encoding Japanese character names in identifiers,
and actual Japanese characters in comments and strings.
 
	I had no idea what the code did. My point: for a non-native
speaker, being forced to use a foreign language for identifiers and
comments is a serious impediment, not having native characters
in string is not an impediment, but a complete disaster (how will
the users of the program understand it -- they may not know any
Latin language)

> On the other hand, OCaml provides a String type that *can be* seen as a
> variable-length sequence of uninterpreted bytes. 

	Yes. What ocaml does not provide is a way of encoding
extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.

>We have uninterpreted
> bytes! It's all we need to build whatever I18NString type we may need.
> What is missing is *library* facilities to abstract that view into a
> full-fledged i18n machinery. 

	I agree.

>Of course, there's a problem with the
> manipulation of 32-bit integer values, but if used with care, the Nat
> datatype could serve perfectly well as the underlying, low-level datatype.
> 
> Which makes me think, John, you already have variable-length int arrays.

	But they're not standard (yet). Actually, ocaml 'int' is 31 bits,
which is enough bits for ISO10646 (with some careful fiddling to avoid
problems with the sign?).

	So there are TWO issues -- one is to make ocaml itself
ISO10646 aware (i.e., the compiler), and the other is to provide
users with libraries to manipulate extended characters.

	Please note: neither of these features would be optional,
were ocaml to be submitted for ISO standardisation. ISO directives
require all ISO languages to upgrade to provide international
support. I know ocaml isn't an ISO language, but I think the 
basic intent is sound. [In some sense, ocaml is already a leader,
accepting Latin-1 characters when other languages only allowed ASCII]

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller