localization, internationalization and Caml

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* localization, internationalization and Caml
@ 1999-10-13 12:12 STARYNKEVITCH Basile
  1999-10-14 22:20 ` skaller
  0 siblings, 1 reply; 21+ messages in thread
From: STARYNKEVITCH Basile @ 1999-10-13 12:12 UTC (permalink / raw)
  To: caml-list

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=us-ascii, Size: 2388 bytes --]

Hello All,

Just a small remark about localization and internationalization (see
your setlocale printf strtod man pages), which means adapting a
software to culturally different users. Problems include date
representation, number representation, error messages, and even
character sets and left-right or right-left human reading. For example
some french people want "Taux d'inflation = 3,14% - TROP" instead of
"TOO MUCH inflation 3.14%" (message in english/french, numbers with
decimal point/comma, argument 3.14 and string "TOO MUCH" or "TROP"
(locale dependent) in different order.

I am not at all a fan of localization. But I do have a wish if it ever
occur in Ocaml:

* do not depend on C localization (This means Printf.printf should not
  depend on LC_NUMERIC environment variable. Is this true now?)

* make the locale an explicit argument, or at least a property bound
  to a channel. Several channels may need different locales (for
  instance an HTTP socket needs a C locale, while the user stderr
  could be in French locale)

  so 

    lprintf Locale.French "%d %g" 2 3.14

  is much better than

    set_locale LC_ALL "FR"
    printf "%d %g" 2 3.14

By the way, I more and more believe that the printf interface is (in C
as in Ocaml) a big mistake (which could easily be avoided in Ocaml,
thanks to it typing)

We should code

  print [Int 2; String " < "; Float 3.14]

instead of 

  printf "%d < %g" 2 3.14

Again, I am *not* asking for localization in Ocaml, but if somebody
needs it (I don't) I still hope it would be implemented better than in
C. And I think that Unicode would be more useful than localization.

I'm saying all this because I have now a headache regarding C
localization, so hope that Ocaml will avoid that mistake.

################

Court Resumé: je pense que la localisation en Ocaml -dont je ne ressens pas
le besoin- ne devrait pas être faite comme en C.

N.B. Any opinions expressed here are only mine, and not of my organization.
N.B. Les opinions exprimees ici me sont personnelles et n engagent pas le CEA.

---------------------------------------------------------------------
Basile STARYNKEVITCH   ----  Commissariat à l Energie Atomique 
DTA/LETI/DEIN/SLA * CEA/Saclay b.528 (p111f) * 91191 GIF/YVETTE CEDEX * France
phone: 1,69.08.60.55; fax: 1.69.08.83.95 home: 1,46.65.45.53
email: Basile point Starynkevitch at cea point fr 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-13 12:12 localization, internationalization and Caml STARYNKEVITCH Basile
@ 1999-10-14 22:20 ` skaller
  1999-10-15  8:26   ` Francis Dupont
  0 siblings, 1 reply; 21+ messages in thread
From: skaller @ 1999-10-14 22:20 UTC (permalink / raw)
  To: STARYNKEVITCH Basile; +Cc: caml-list

STARYNKEVITCH Basile wrote:
> 
> By the way, I more and more believe that the printf interface is (in C
> as in Ocaml) a big mistake (which could easily be avoided in Ocaml,
> thanks to it typing)

	I agree but ..
 
> We should code
> 
>   print [Int 2; String " < "; Float 3.14]
> 
> instead of
> 
>   printf "%d < %g" 2 3.14

	However, I do not agree with the solution.
The correct method, IMHO, is to provide some proper formatting
functions (ocamls are plain WRONG!) such as

	formatted_string_of_int justify width value

[where justify is LeftSpace |  RightSpace | LeftZero]

	and then use the power of functional programming
to create output strings. {the above is only a quick exemplary
interface,
not a well considered one]
 
> Again, I am *not* asking for localization in Ocaml, but if somebody
> needs it (I don't) I still hope it would be implemented better than in
> C. And I think that Unicode would be more useful than localization.

	Please, ISO10646 not unicode. 
We have International Standards. There is a lot of work to be done in
internationalisation. If it is worth doing, it is worth doing right.

	The current 'support' for 8 bit characters in ocaml should be
deprecated immediately. It is an extremely bad thing to have, since
Latin-1 et al are archaic 8 bit standards incompatible with the
international standard for ISO10646 communication, namely 
the UTF-8 encoding. Yes, I know Latin-1 is useful now for French.
The way forward may well be to provide an input filter to convert
Latin-1 (or any other encoding) to UTF8, and have ocaml process that.
This requires almost no changes to the compiler: the design should
open the set of characters acceptable in identifiers, probably
to some subset of the set recommended in one of the ISO10646 related
documents; the other change required is to accept \uXXXX and \UXXXXXXXX
escapes in strings. String processing functions should generally
continue to be 8 bit [per octet]: full internationalisation of client
string handling functions is a very complex, non-trivial, task]




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-14 22:20 ` skaller
@ 1999-10-15  8:26   ` Francis Dupont
  1999-10-17 11:27     ` skaller
  0 siblings, 1 reply; 21+ messages in thread
From: Francis Dupont @ 1999-10-15  8:26 UTC (permalink / raw)
  To: skaller; +Cc: STARYNKEVITCH Basile, caml-list

 In your previous mail you wrote:

   	The current 'support' for 8 bit characters in ocaml should be
   deprecated immediately. It is an extremely bad thing to have, since
   Latin-1 et al are archaic 8 bit standards incompatible with the
   international standard for ISO10646 communication, namely 
   the UTF-8 encoding.

=> there is a rather strong opposition against UTF-8 in France
because it is not a natural encoding (ie. if ASCII maps to ASCII
it is not the case for ISO 8859-* characters, imagine a new UTF-X
encoding maps ASCII to strange things and you'd be able to understand
our concern).

   Yes, I know Latin-1 is useful now for French.

=> it is more than useful, Latin-1 (soon ISO IS 8859-15) is necessary
if you need really readable texts in French.

   The way forward may well be to provide an input filter to convert
   Latin-1 (or any other encoding) to UTF8, and have ocaml process that.

=> my problem is the output of the filter will be no more readable when
I've put too much French in the program (in comments for instance).

   This requires almost no changes to the compiler: the design should
   open the set of characters acceptable in identifiers, probably
   to some subset of the set recommended in one of the ISO10646 related
   documents; the other change required is to accept \uXXXX and \UXXXXXXXX
   escapes in strings. String processing functions should generally
   continue to be 8 bit [per octet]: full internationalisation of client
   string handling functions is a very complex, non-trivial, task]

=> I believe internationalization should not be done by countries
where English is the only used language: this is at least awkward...

Regards

Francis.Dupont@inria.fr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15  8:26   ` Francis Dupont
@ 1999-10-17 11:27     ` skaller
  1999-10-17 15:54       ` Francis Dupont
  0 siblings, 1 reply; 21+ messages in thread
From: skaller @ 1999-10-17 11:27 UTC (permalink / raw)
  To: Francis Dupont; +Cc: STARYNKEVITCH Basile, caml-list

Francis Dupont wrote:
> 
>  In your previous mail you wrote:
> 
>         The current 'support' for 8 bit characters in ocaml should be
>    deprecated immediately. It is an extremely bad thing to have, since
>    Latin-1 et al are archaic 8 bit standards incompatible with the
>    international standard for ISO10646 communication, namely
>    the UTF-8 encoding.
> 
> => there is a rather strong opposition against UTF-8 in France
> because it is not a natural encoding (ie. if ASCII maps to ASCII
> it is not the case for ISO 8859-* characters, imagine a new UTF-X
> encoding maps ASCII to strange things and you'd be able to understand
> our concern).

	I do understand the concern, but the decision on
the International Standards has been made. The transition
for ISO 8859-x clients will involve some pain. Better to
start going through the pain now :-(

>    Yes, I know Latin-1 is useful now for French.
> 
> => it is more than useful, Latin-1 (soon ISO IS 8859-15) is necessary
> if you need really readable texts in French.

	No, what you mean is that with _current technology_
there is plenty of support for 8 bit characters, using code pages,
so that Latin-1 is well supported.

	For example, there are a lot of text editors that
accept 8 bit characters, and even permit switching code pages.
There are almost none that work with ISO10646 or unicode,
let alone accept UTF-8 encoding. (Yudit is the only one I know of).

>    The way forward may well be to provide an input filter to convert
>    Latin-1 (or any other encoding) to UTF8, and have ocaml process that.
> 
> => my problem is the output of the filter will be no more readable when
> I've put too much French in the program (in comments for instance).

	You will have no problem with the right tools: the difference
will be transparent. Of course, you will need the right tools.
For example, you will need a browser like Internet Explorer 5,
which processes UTF-8 encoding correctly.

	I agree that this is a problem, but supporting
Latin-1, or any other archaic standard, is not going
to help move forward. It is bad enough that most vendors
only support Unicode, which is a small, almost filled,
16 bit subset of the full 31 bit ISO-10646 Standard.
 
>    This requires almost no changes to the compiler: the design should
>    open the set of characters acceptable in identifiers, probably
>    to some subset of the set recommended in one of the ISO10646 related
>    documents; the other change required is to accept \uXXXX and \UXXXXXXXX
>    escapes in strings. String processing functions should generally
>    continue to be 8 bit [per octet]: full internationalisation of client
>    string handling functions is a very complex, non-trivial, task]
> 
> => I believe internationalization should not be done by countries
> where English is the only used language: this is at least awkward...

	I believe people with international concerns can work
together no matter what their native language. Some English
speakers may be concerned, some, like me, are somewhat
embarrased to be non-fluent in _any_ other language.
[I speak a smattering of high school German]

	However, Australia, where I live, has migrants
from all over the world and support for many languages
is an important issue here. Particularly Asian languages.
And ISO-8859-x is not much help there :-)

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-17 11:27     ` skaller
@ 1999-10-17 15:54       ` Francis Dupont
  1999-10-19 18:48         ` skaller
  0 siblings, 1 reply; 21+ messages in thread
From: Francis Dupont @ 1999-10-17 15:54 UTC (permalink / raw)
  To: skaller; +Cc: STARYNKEVITCH Basile, caml-list

 In your previous mail you wrote:

   >         The current 'support' for 8 bit characters in ocaml should be
   >    deprecated immediately. It is an extremely bad thing to have, since
   >    Latin-1 et al are archaic 8 bit standards incompatible with the
   >    international standard for ISO10646 communication, namely
   >    the UTF-8 encoding.
   > 
   > => there is a rather strong opposition against UTF-8 in France
   > because it is not a natural encoding (ie. if ASCII maps to ASCII
   > it is not the case for ISO 8859-* characters, imagine a new UTF-X
   > encoding maps ASCII to strange things and you'd be able to understand
   > our concern).

   	I do understand the concern, but the decision on
   the International Standards has been made.

=> this is not so obvious because there are other encoding (UTF-X)
without this kind of problems. I'll send this thread to a colleague
who tried to get something better than UTF-8 at the IETF (but he was
too late).

   >    Yes, I know Latin-1 is useful now for French.
   > 
   > => it is more than useful, Latin-1 (soon ISO IS 8859-15) is necessary
   > if you need really readable texts in French.

   	No, what you mean is that with _current technology_
   there is plenty of support for 8 bit characters, using code pages,
   so that Latin-1 is well supported.

=> yes, for instance you have a reasonable set of fonts.

   	For example, there are a lot of text editors that
   accept 8 bit characters, and even permit switching code pages.
   There are almost none that work with ISO10646 or unicode,
   let alone accept UTF-8 encoding. (Yudit is the only one I know of).

=> I'd like to get some free ISO10646/Unicode fonts. I believe
without them ISO10646/Unicode will not be accepted by users.

   	I agree that this is a problem, but supporting
   Latin-1, or any other archaic standard, is not going
   to help move forward.

=> Latin-1 is not so archaic (it should be old enough in order to
become archaic :-).

   It is bad enough that most vendors
   only support Unicode, which is a small, almost filled,
   16 bit subset of the full 31 bit ISO-10646 Standard.

=> Unicode is not so supported...

   	I believe people with international concerns can work
   together no matter what their native language. Some English
   speakers may be concerned, some, like me, are somewhat
   embarrased to be non-fluent in _any_ other language.
   [I speak a smattering of high school German]

=> It is great than English speakers support internationalization
but we need other language speakers in order to get an as complete
as possible one. For instance where is the first character of a string?
An Arabic speaker can easily show to them this is not so obvious.

   	However, Australia, where I live, has migrants
   from all over the world and support for many languages
   is an important issue here. Particularly Asian languages.

=> Asian languages seem hard and we can't ignore one third of the world...

Regards

Francis.Dupont@inria.fr

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-17 15:54       ` Francis Dupont
@ 1999-10-19 18:48         ` skaller
  0 siblings, 0 replies; 21+ messages in thread
From: skaller @ 1999-10-19 18:48 UTC (permalink / raw)
  To: Francis Dupont; +Cc: STARYNKEVITCH Basile, caml-list

Francis Dupont wrote:

>         I believe people with international concerns can work
>    together no matter what their native language. Some English
>    speakers may be concerned, some, like me, are somewhat
>    embarrased to be non-fluent in _any_ other language.
>    [I speak a smattering of high school German]
> 
> => It is great than English speakers support internationalization
> but we need other language speakers in order to get an as complete
> as possible one. For instance where is the first character of a string?
> An Arabic speaker can easily show to them this is not so obvious.

	Yes. And ocaml developers cannot go through all the pain
of this complex field, and do not need to: workers in the field
have produced an International Standard which guides how programming
language developers should proceed. Not all choices are fixed,
by any means, but the documents exist, and _are_ being worked on
by people speaking many different languages.
 
> => Asian languages seem hard and we can't ignore one third of the world...

	Not nearly so hard as Arabic and Hindic languages, in which the
usual categorical composition of strings by concatenation fails.
Similarly, things like collation sequences are a serious nightmare.
But they can be implemented with (perhaps complex) functions,
so it is possible to extend the support a programming language
gives clients as time goes on.

	My initial point was that very few changes are required to
prepare for internationalisation, if the UTF8 of ISO10646 is adopted,
but one of the key things that is required, urgently, is to deprecate
latin-1.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
@ 1999-10-15 13:53 Gerard Huet
  1999-10-15 20:28 ` Gerd Stolpmann
  1999-10-17 14:29 ` Xavier Leroy
  0 siblings, 2 replies; 21+ messages in thread
From: Gerard Huet @ 1999-10-15 13:53 UTC (permalink / raw)
  To: Francis Dupont, skaller; +Cc: STARYNKEVITCH Basile, caml-list

Just to put my 2 cents on this issue...

At 10:26 15/10/99 +0200, Francis Dupont wrote:
> In your previous mail you wrote:
>
>   	The current 'support' for 8 bit characters in ocaml should be
>   deprecated immediately. It is an extremely bad thing to have, since
>   Latin-1 et al are archaic 8 bit standards incompatible with the
>   international standard for ISO10646 communication, namely
>   the UTF-8 encoding.

I do not agree. What we need is not ayatollah dictats, but careful thinking
about evolution of standards.

First of all, ISO-Latin is as international a standard as ISO10646, only a bit
more mature. By essence, international standards are not immediately obsolete,
they are here to stay because we need some stability in a world of sound
engineering, as opposed to the permanent hype which our discipline is
subjected to.

Secondly, the string data type of Ocaml is not about ASCII or ISO-Latin or
whatever. It is a low-level data type of implementation of lists of bytes
of data efficiently represented in machine memory. These bytes may be used
for encoding elements of various finite sets such as ASCII or ISO-Latin,
but the string library does not care about such intentions.

When such strings are used to represent natural language sentences, there
is a natural tendency to sophistication, from UPPER CASE letters of the
computer printers of old to ASCII to ISO-Latin 1, 2, etc to Unicode. At
some point (256) these sets of codes cannot be represented in a one to one
fashion into bytes, and so multi-bytes representations must be designed,
such as UTF-8.
Such multi-bytes representations are inconsistent with ISO-Latin convention
somewhere, and thus the ISO-Latin character set must be shifted out of its
usual representation since the 8th bit is needed for the multi-byte encoding.

So for instance engineers designing natural language interfaces must make
the choice of sticking to the old convention in a purely local software, or
upgrading their software to the international standard, typically for Web
applications. At some point I am sure some brave soul from the Ocaml
implementation team will write a Unicode library for implementing the
non-trivial manipulations of lists of Unicode characters, so that the above
engineers will have a generic tool to use. Such libraries will typically
implement a NEW datatype of "unistring" or whatever, with proper conversion
to string representations of course, but the string data type is surely
here to stay, because bytes are not going to become obsolete overnight. :-)

>=> there is a rather strong opposition against UTF-8 in France
>because it is not a natural encoding (ie. if ASCII maps to ASCII
>it is not the case for ISO 8859-* characters, imagine a new UTF-X
>encoding maps ASCII to strange things and you'd be able to understand
>our concern).

I do not share Francis' pessimism. The ISO commitees are not entirely
stupid, and care has been taken to make the move as painless as possible.
ISO-Latin has just been shifted by a mere translation. Here is my Ocaml
code for translating strings of ISO-Latin 1 characters to UTF-8 HTML:

let print_unicode c     let ascii = int_of_char c in (* test for ISO-LATIN *)
        if ascii < 128 then print_char c (* 7 bit ascii *)
        else print_string ("&#" ^ (string_of_int ascii) ^ ";");

This is hardly mysterious or complicated or inefficient.

>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).

Come on, Francis, we do not read core dumps nowadays, we read through the
eyes of HTML or TeX or whatever !

>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...

I simply do not understand this remark in a WWW world.

Cheers
Gérard

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15 13:53 Gerard Huet
@ 1999-10-15 20:28 ` Gerd Stolpmann
  1999-10-19 18:06   ` skaller
  1999-10-17 14:29 ` Xavier Leroy
  1 sibling, 1 reply; 21+ messages in thread
From: Gerd Stolpmann @ 1999-10-15 20:28 UTC (permalink / raw)
  To: caml-list

I agree that Unicode or even ISO-10646 support would be a nice thing. I
also agree that for many users (including myself) the Latin-1 character set
suffices. Luckily, both character sets are strongly related: The first 256
character positions of Unicode (which is the 16-bit subset of ISO-10646) are
exactly the same as in Latin1.

UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
character is represented by one to six bytes; the higher the character code the
more bytes are needed. This encoding is mainly interesting for I/O, and not for
internal processing, because it is impossible to access the characters of a
string by their position. Internally, you must represent the characters as 16-
or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
wasted (this is the price for the enhanced possibilities). UTF-8 is thought for
compatibility, because the following holds:

- Every ASCII character (i.e. with codes 0 to 127) is represented as before,
  and every non-ASCII character is represented by a byte sequence where the
  eighth bit is set. Old, non-UTF-aware programs can at least interpret the
  ASCII characters.
  (Note that there is a variant of UTF-8 which encodes the 0 character
  differently - by two characters.)

- If you sort UTF-8 strings alphabetically (more precisely, using
  the byte values of the encoding as criterion) you get the same result as if
  you sorted the strings alphabetically by their character codes.

This means that we need at least three types of strings: Latin1 strings for
compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
the same language type "string", and to provide "wchar" and "wstring" for the
extended character set.

Of course, the current "string" data type is mainly an implementation of byte
sequences, which is independent of the underlying interpretation. Only the
following functions seem to introduce a character set:

- The String.uppercase and .lowercase functions
- The channels if opened in text mode (because they specially recognize the
  line endings if needed for the operating system, and newline characters are
  part of the character set)

The best solution would be to have internationalized versions of these
functions (and of perhaps some more functions) which still operate on the
"string" type but allow the user to select the encoding and the locale.

This means we would have something like

	type encoding = UTF8 | Latin1 | ...
	type locale = ...
	val String.i18n_uppercase : encoding -> locale -> string -> string
	val String.i18n_lowercase : encoding -> locale -> string -> string
	val String.recode : encoding -> encoding -> string -> string
		(* changes the encoding if possible *)

For "wstring" it is simpler:

	val Wstring.uppercase : string -> string
	val Wstring.i18n_uppercase : locale -> string -> string

New opening mode for channels:
	Text of encoding
This encoding specifies the encoding of the file. It must be possible to change
this later (e.g. to process XML's "encoding" declaration).

New input/output functions:

	val output_i18n_string : out_channel -> encoding -> string -> unit
	val input_i18n_line : in_channel -> encoding -> string

Here, the encoding argument specifies the encoding of the internal
representation. The other I/O functions need I18N versions as well, and of
course we need to operate on "wstring"s directly:

	val output_wstring : out_channel -> wstring -> unit
	val input_wstring_line : in_channel -> wstring

This all means that the number of string functions explodes: We need functions
for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
functions for wide strings. I think this is the main argument against it,
and it is very difficult to get around this. (Any ideas?)

Francis Dupont:
>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).

The enlarged character sets become more and more important, and it is only a
matter of time until every piece of software which wants to be taken seriously
can process them, even a dumb terminal or simple text editor. So you will be
able to put accented characters into your comments, and you will see them as
such even if you 'cat' the program text to the terminal or printer; this will
work everywhere...

>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...

but in the USA (a general prejudice not worth to discuss; there is only some
"personal experience" behind it).

Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15 20:28 ` Gerd Stolpmann
@ 1999-10-19 18:06   ` skaller
  1999-10-20 21:05     ` Gerd Stolpmann
  0 siblings, 1 reply; 21+ messages in thread
From: skaller @ 1999-10-19 18:06 UTC (permalink / raw)
  To: Gerd.Stolpmann; +Cc: caml-list

Gerd Stolpmann wrote:
> 
> I agree that Unicode or even ISO-10646 support would be a nice thing. I
> also agree that for many users (including myself) the Latin-1 character set
> suffices. 

	Generally, for me, 7 bit ASCII 'suffices'. But that is irrelevant,
the world is bigger than my country, or Europe.

>Luckily, both character sets are strongly related: The first 256
> character positions of Unicode (which is the 16-bit subset of ISO-10646) are
> exactly the same as in Latin1.

	.. of course, this is not luck .. 
 
> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
> character is represented by one to six bytes; the higher the character code the
> more bytes are needed. This encoding is mainly interesting for I/O, and not for
> internal processing, because it is impossible to access the characters of a
> string by their position. Internally, you must represent the characters as 16-
> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
> wasted (this is the price for the enhanced possibilities). 

	I don't agree. If you read ISO10646 carefully, you will find
that you must STILL parse sequences of code points to obtain the
equivalent
of a 'character', if, indeed, such a concept exists,
and furthermore, the sequences are not unique. For example,
many diacritic marks such as accents may be appended to a code point,
and act in conjunction with the preceding code point to represent
a character. This is permitted EVEN if there is a single code point
for that character; and worse, if there are TWO such marks, the order is
not
fixed.

	And that's just simple European languages. Now try Arabic or Thai :-)

> This means that we need at least three types of strings: Latin1 strings for
> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
> the same language type "string", and to provide "wchar" and "wstring" for the
> extended character set.

	I'd like to suggest we forget the 'wchar' string, at least initially.
I think you will find UTF-8 encoding requires very few changes. For
example,
genuine regular expressions work out of the box. String searching
works out of the box.

	What doesn't work efficiently is indexing.
And it is never necessary to do it for human script.
Why would you ever want to, say, replace the 10'th character
of a string?? [You could: if you were analysing, say, a stock code,
but in that case the n'th byte would do: it isn't natural language
script]

	The way to handle Latin-1, or Big-5, or KSC, or ShiftJis, 
is to translate it with an input filter, or internally, if the client
is reading the codes directly.

> Of course, the current "string" data type is mainly an implementation of byte
> sequences, which is independent of the underlying interpretation. Only the
> following functions seem to introduce a character set:
> 
> - The String.uppercase and .lowercase functions

It is best to get rid of these functions. The belong in a more
sophisticated
natural language processing package. 

> - The channels if opened in text mode (because they specially recognize the
>   line endings if needed for the operating system, and newline characters are
>   part of the character set)

	This is a serious problem. It is also partly ill formed:
what is a 'line' in Chinese, which writes characters top down?
What is a 'line' in a Web page :-)
 
> The best solution would be to have internationalized versions of these
> functions (and of perhaps some more functions) which still operate on the
> "string" type but allow the user to select the encoding and the locale.

	There is more. The compiler must be modified to accept
identifiers in the extended character set. This should work out
of the box (since 8'th bit set characters are already accepted);
in fact, it is too permissive. Secondly, literals need to be
processed, to expand \uXXXX and \UXXXXXXXX escapes.
 
> This means we would have something like
> 
>         type encoding = UTF8 | Latin1 | ...

	Be careful to distinguish CODE SET from ENCODING.
See the Unicode home page for more details: Latin-1 is NOT
an encoding, but a character set. There is a MAPPING from
Latin-1 to ISO-10646. This is not the same thing as an encoding.
I think this is the wrong approach: we do not want built-in
cases for every possible encoding/character set.

	Instead, we want an open ended set of conversions from (and perhaps to)
the internally used representation. There are a LOT of such
combinations,
we need to add new ones without breaking into a module. We should do it
functionally; this should work well in ocaml :-)

>         type locale = ...
>         val String.i18n_uppercase : encoding -> locale -> string -> string
>         val String.i18n_lowercase : encoding -> locale -> string -> string

	Not in the String module. This belongs in a different package
which handles complex vagaries of human script. [This particular
function, is relatively simple. Another is 'get_digit', 'isdigit'.
Whitespace is much harder. Collation is a nightmare :-]

>         val String.recode : encoding -> encoding -> string -> string
>                 (* changes the encoding if possible *)

	This isn't quite right. The way to do this is to have a function:

	LATIN1_to_ISO10646 code

which does the mapping (from a SINGLE code point in LATIN1 to ISO10646).
The code point is an int. Separately, we handle encodings: 

	UCS4_to_UTF8 code

converts an int to a UTF8 string, and

	UTF8_to_UCS4 string position

parses the string from position, returning a code point and position.
There are other encodings, such as DBCS encodings, which generally
are tied to a single character set. [UCS4/UTF8 are less depenent]

[....]
> This all means that the number of string functions explodes: 

	Exactly. And we don't want that. So I suggest, we continue
to use the existing strings of 8 bit bytes ONLY, and represent
ALL foreign [non ISO10646] character sets using ISO-10646 code points,
encoded as UTF-8, and provide an input filter for the compiler.

	In addition, some extra functions to convert other
character sets and encodings to ISO-10646/UTF-8 are provided,
and, if you like, they can be plugged into the I/O system.

	This means a lot of conversion functions, but ONE
internal representation only: the one we already have.	

We need functions
> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
> functions for wide strings. I think this is the main argument against it,
> and it is very difficult to get around this. (Any ideas?)

	I've been trying to tell you how to do it. The solution is simple,
to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE
encoding of it; and provide conversions from other character sets and
encodings.
All the code that needs to manipulate strings can then be provided NOW
as additional functions manipulating the existing string type.

	The apparent loss of indexing is a mirage. The gain is huge:
ISO-10646 Level 1 compliance without any explosiion of data types.
Yes, some more _functions_ are needed to do extra processing,
such as normalisation, comparisons of various kinds,
capitalisation, etc. Regular expressions will need to be
enhanced, to fix the special features (like case insensitive searching),
but the basic regular expressions will work out of the box.

> The enlarged character sets become more and more important, and it is only a
> matter of time until every piece of software which wants to be taken seriously
> can process them, even a dumb terminal or simple text editor. So you will be
> able to put accented characters into your comments, and you will see them as
> such even if you 'cat' the program text to the terminal or printer; this will
> work everywhere...

	Yes. This time is not here yet, but it will come soon that
international support is mandatory for all large software purchases
by governments and large corporations.
 
-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-19 18:06   ` skaller
@ 1999-10-20 21:05     ` Gerd Stolpmann
  1999-10-21  4:42       ` skaller
  1999-10-21 12:05       ` Matías Giovannini
  0 siblings, 2 replies; 21+ messages in thread
From: Gerd Stolpmann @ 1999-10-20 21:05 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

On Tue, 19 Oct 1999, John Skaller wrote:
>Gerd Stolpmann wrote:
>> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
>> character is represented by one to six bytes; the higher the character code the
>> more bytes are needed. This encoding is mainly interesting for I/O, and not for
>> internal processing, because it is impossible to access the characters of a
>> string by their position. Internally, you must represent the characters as 16-
>> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
>> wasted (this is the price for the enhanced possibilities). 
>
>	I don't agree. If you read ISO10646 carefully, you will find
>that you must STILL parse sequences of code points to obtain the
>equivalent
>of a 'character', if, indeed, such a concept exists,
>and furthermore, the sequences are not unique. For example,
>many diacritic marks such as accents may be appended to a code point,
>and act in conjunction with the preceding code point to represent
>a character. This is permitted EVEN if there is a single code point
>for that character; and worse, if there are TWO such marks, the order is
>not
>fixed.
>
>	And that's just simple European languages. Now try Arabic or Thai :-)

Let's begin with languages we know. As far as I know, ISO10646 allows it not to
implement the combining characters. I think, a programming language should only
provide the basic means by which you can operate with characters, but should not
solve it completely.

>> This means that we need at least three types of strings: Latin1 strings for
>> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
>> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
>> the same language type "string", and to provide "wchar" and "wstring" for the
>> extended character set.
>
>	I'd like to suggest we forget the 'wchar' string, at least initially.
>I think you will find UTF-8 encoding requires very few changes. For
>example,
>genuine regular expressions work out of the box. String searching
>works out of the box.
>
>	What doesn't work efficiently is indexing.
>And it is never necessary to do it for human script.
>Why would you ever want to, say, replace the 10'th character
>of a string?? [You could: if you were analysing, say, a stock code,
>but in that case the n'th byte would do: it isn't natural language
>script]

Because I have an algorithm operating on the characters of a string. Such
algorithms use indexes as pointers to parts of a string, and in most cases the
indexes are only incremented or decremented. On a UTF-8 string, you could
define an index as

type index = { index_position : int; byte_position : int }

and define the operations "increment", "decrement" (only working together with
the string), "add", "substract", "compare" (to calculate string lengths). Such
indexes have strange properties; they can only be interpreted together with the
string to which they refer. 

You cannot avoid such an index type really; you can only avoid to give the
thing a name and program the index operations every time anew.

Perhaps your suggestion works; but string manipulation will then be much
slower. For example, an "increment" must be implemented by finding the next
beginning of a character (instead of just incrementing a numeric index).

>> Of course, the current "string" data type is mainly an implementation of byte
>> sequences, which is independent of the underlying interpretation. Only the
>> following functions seem to introduce a character set:
>> 
>> - The String.uppercase and .lowercase functions
>
>It is best to get rid of these functions. The belong in a more
>sophisticated
>natural language processing package. 

There will always be a difference between natural languages and sophisticated
packages. Even the current String.uppercase is wrong (in Latin1 there is a
lower case character without corresponding capital character (\223), but WORDS
containing this character can be capitalized by applying a semantical rule).

I would suppose that String.upper/lowercase are part of the library because the
compiler itself needs them. Currently, ocaml depends on languages that know the
distinction of character cases.

In my opinion such case functions can only approximate the semantical meaning,
and a simple approximation is better than no approximation.

>> - The channels if opened in text mode (because they specially recognize the
>>   line endings if needed for the operating system, and newline characters are
>>   part of the character set)
>
>	This is a serious problem. It is also partly ill formed:
>what is a 'line' in Chinese, which writes characters top down?

But lines exist. For example, your message is divided into lines. The concept
of lines is too important to be dropped although it is simple (much of the
success has to do with its simplicity). Other writing traditions also have a
writing direction.

>What is a 'line' in a Web page :-)

What is a 'line' in the sky?

>> This means we would have something like
>> 
>>         type encoding = UTF8 | Latin1 | ...
>
>	Be careful to distinguish CODE SET from ENCODING.
>See the Unicode home page for more details: Latin-1 is NOT
>an encoding, but a character set. There is a MAPPING from
>Latin-1 to ISO-10646. This is not the same thing as an encoding.
>I think this is the wrong approach: we do not want built-in
>cases for every possible encoding/character set.

Character sets and encodings are both artificial concepts. When I program, I
have always to do with a combination of both. The distinction is irrelevant for
most applications; it is important if you want to convert texts from (cs1,enc1)
to (cs2,enc2) because conversion is not always possible.

My idea is that the type "encoding" enumerates all supported combinations; I
expect only a few.

>	Instead, we want an open ended set of conversions from (and perhaps to)
>the internally used representation. There are a LOT of such
>combinations,
>we need to add new ones without breaking into a module. We should do it
>functionally; this should work well in ocaml :-)

What kind of problem do you want to solve with an open ended set of
conversions? Isn't this the task of a specialized program?

>>         type locale = ...
>>         val String.i18n_uppercase : encoding -> locale -> string -> string
>>         val String.i18n_lowercase : encoding -> locale -> string -> string
>
>	Not in the String module. This belongs in a different package
>which handles complex vagaries of human script. [This particular
>function, is relatively simple. 

See above.

>Another is 'get_digit', 'isdigit'.
>Whitespace is much harder. Collation is a nightmare :-]

I think collation should be left out by a basic library. Even for a single
language, there are often several traditions how to sort, and it also depends on
the kind of strings your are sorting (for example, think of personal names).
Members of traditions can contribute special modules for collation.

>
>>         val String.recode : encoding -> encoding -> string -> string
>>                 (* changes the encoding if possible *)
>
>	This isn't quite right. The way to do this is to have a function:
>
>	LATIN1_to_ISO10646 code
>
>which does the mapping (from a SINGLE code point in LATIN1 to ISO10646).
>The code point is an int. Separately, we handle encodings: 
>
>	UCS4_to_UTF8 code
>
>converts an int to a UTF8 string, and
>
>	UTF8_to_UCS4 string position
>
>parses the string from position, returning a code point and position.
>There are other encodings, such as DBCS encodings, which generally
>are tied to a single character set. [UCS4/UTF8 are less depenent]
>

The most correct interface is not always the best.

>[....]
>> This all means that the number of string functions explodes: 
>
>	Exactly. And we don't want that. So I suggest, we continue
>to use the existing strings of 8 bit bytes ONLY, and represent
>ALL foreign [non ISO10646] character sets using ISO-10646 code points,
>encoded as UTF-8, and provide an input filter for the compiler.
>
>	In addition, some extra functions to convert other
>character sets and encodings to ISO-10646/UTF-8 are provided,
>and, if you like, they can be plugged into the I/O system.
>
>	This means a lot of conversion functions, but ONE
>internal representation only: the one we already have.	

There will be a significant slow-down of all ocaml programs if the strings are
encoded as UTF-8. I think the user of a language should be able to choose what
is more important: time or space or reduced functionality. UTF-8 saves space,
and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise
and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves
time and space but has less functionality. 

>> We need functions
>> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
>> functions for wide strings. I think this is the main argument against it,
>> and it is very difficult to get around this. (Any ideas?)
>
>	I've been trying to tell you how to do it. The solution is simple,
>to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE
>encoding of it; and provide conversions from other character sets and
>encodings.

It looks simple but I suppose it is not what the ocaml users want.

>All the code that needs to manipulate strings can then be provided NOW
>as additional functions manipulating the existing string type.

And because compatibility is lost, the whole current code base has to be worked
through.

>	The apparent loss of indexing is a mirage. The gain is huge:
>ISO-10646 Level 1 compliance without any explosiion of data types.
>Yes, some more _functions_ are needed to do extra processing,
>such as normalisation, comparisons of various kinds,
>capitalisation, etc. Regular expressions will need to be
>enhanced, to fix the special features (like case insensitive searching),
>but the basic regular expressions will work out of the box.
>
>> The enlarged character sets become more and more important, and it is only a
>> matter of time until every piece of software which wants to be taken seriously
>> can process them, even a dumb terminal or simple text editor. So you will be
>> able to put accented characters into your comments, and you will see them as
>> such even if you 'cat' the program text to the terminal or printer; this will
>> work everywhere...
>
>	Yes. This time is not here yet, but it will come soon that
>international support is mandatory for all large software purchases
>by governments and large corporations.

I do not believe that this will be the driving force because the current
solutions exist, and it is VERY expensive to replace them. It is even cheaper
to replace a language than a character set/encoding. Looks like another Year
2000 but without deadline.

The first field where some progress will be made is data exchange, because
ISO10646 can bridge several character sets. At that time, tools will be
available to view and edit such data, and of course to convert them; ISO10646
will be used in parallel with the "traditional" character set. These tools will
be low-level, and perhaps operating systems will then support the tools with
fonts, input methods, and conventions how to indicate the encoding.(The current
environment variables solution is a pain if you try to use two encodings in
parallel. For example, I can imagine that Unix terminal drivers allow it to
select the encoding directly, in the same way as you can set other terminal
properties.)

In contrast to this, many applications need not to be replaced and won't be.
Perhaps they will have a ISO10646 import/export filter.

--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-20 21:05     ` Gerd Stolpmann
@ 1999-10-21  4:42       ` skaller
  1999-10-21 12:05       ` Matías Giovannini
  1 sibling, 0 replies; 21+ messages in thread
From: skaller @ 1999-10-21  4:42 UTC (permalink / raw)
  To: Gerd.Stolpmann; +Cc: caml-list

Gerd Stolpmann wrote: 
> On Tue, 19 Oct 1999, John Skaller wrote:
> >
> >       I don't agree. If you read ISO10646 carefully, you will find
> >that you must STILL parse sequences of code points

> Let's begin with languages we know. As far as I know, ISO10646 allows it not to
> implement the combining characters. 

	From memory, you are correct: there are three specified levels of
compliance.
Level 1 compliance does not require processing combining characters.

> I think, a programming language should only
> provide the basic means by which you can operate with characters, but should not
> solve it completely.

	Yes, I agree, at least at this time. 

> >       What doesn't work efficiently is indexing.
> >And it is never necessary to do it for human script.
> >Why would you ever want to, say, replace the 10'th character
> >of a string?? 
> 
> Because I have an algorithm operating on the characters of a string. 

	If the string represents human script, it is then wrong
because it makes incorrect assumptions about the nature of human script.
You will need to rewrite it, if you want it to work in an international
setting.

> Such
> algorithms use indexes as pointers to parts of a string, and in most cases the
> indexes are only incremented or decremented. On a UTF-8 string, you could
> define an index as
> 
> type index = { index_position : int; byte_position : int }
> 
> and define the operations "increment", "decrement" (only working together with
> the string), "add", "substract", "compare" (to calculate string lengths). Such
> indexes have strange properties; they can only be interpreted together with the
> string to which they refer.
> 
> You cannot avoid such an index type really; you can only avoid to give the
> thing a name and program the index operations every time anew.

	I agree. But my point is: you should change your code _anyhow_,
to use the new and correct parsing method, because it is necessary
for Level 2 and Level 3 compliance. Your code will then work
correctly at those levels when the 'increment' function is upgraded.

	What you will find is something which by chance, perhaps, is
natural in Python: there is no such thing as a character. A string
is NOT an array of characters. Strings can be composed from strings,
and decomposed into arrays of strings, but there is not really any
character type.
 
> Perhaps your suggestion works; but string manipulation will then be much
> slower. For example, an "increment" must be implemented by finding the next
> beginning of a character (instead of just incrementing a numeric index).

	Yes, but this is a fact: it is actually required for correct
processing of human script. You cannot 'magic' away the facts.

	What you can do, is, if you are programming with a known
subset, such as the characters for a stock code, then you can
use indexing anyhow, perhaps with the ASCII subset. That is,
you can use the byte strings as character strings.

> There will always be a difference between natural languages and sophisticated
> packages. 

	Yes. However, there is an important point here. Natural languages
are quirky and behaviour is variant: each human uses language
differently
each sentence, varing with region, context .. etc. Obviously, computer
systems only use some abstracted representation. While there are many
levels and ways of abstracting this, there is one that is worthy of
special interest here: the ISO10646 Standard.

	So I guess my suggestion is that in the _standard_ language libraries
we will eventually need to implement the algorithms required for
compliance
with that Standard. In my opinion, that naturally breaks into two parts:

	a) (byte/word) string management: this is an issue of storage
	   allocation and manipulation, not natural language processing

	b) basic natural language processing

>Even the current String.uppercase is wrong (in Latin1 there is a
> lower case character without corresponding capital character (\223), but WORDS
> containing this character can be capitalized by applying a semantical rule).
> 
> I would suppose that String.upper/lowercase are part of the library because the
> compiler itself needs them. Currently, ocaml depends on languages that know the
> distinction of character cases.

	AH! you are right!
 
> In my opinion such case functions can only approximate the semantical meaning,
> and a simple approximation is better than no approximation.

	No. That is, I agree entirely, but make a different point:
an arbitrary simple approximation is worthless, the one that is useful
is the ISO Standardised one.

> My idea is that the type "encoding" enumerates all supported combinations; I
> expect only a few.

	Please no. Leave the type open to external augmentation.
Just consider: my Interscript literate programming tool ALREADY supports
something like 30 "encodings" -- all those present on the unicode.org
website. Your 'type' is already a joke. I already support a lot more
encodings than that.

> What kind of problem do you want to solve with an open ended set of
> conversions? Isn't this the task of a specialized program?

	No. It allows a generalised ISO10646 compliant program
to read and perhaps write any file encoded in any supported encoding,
but manipulate it internally in one format. If there is an encoding
that is missed, it is easy to add a new pair of conversion functions,
without breaking the standard library.

	That is, it is the task of specialised _functions_.
It makes sense to provide some as standard like the ones your
type suggests -- but not represent the cases with a type.
Ocaml variants are not amenable to extension. Function parameters
are.

	That is, I think there are exactly two cases:

	a) no conversion required
	b) user supplied conversion function
 
> I think collation should be left out by a basic library. 

	Probably right. Level 1 compliance is a good start,
and does not require collation.

> The most correct interface is not always the best.

	What do you mean 'most correct'?
Either the interface supports the (ISO10646) required behaviour or not.

> There will be a significant slow-down of all ocaml programs if the strings are
> encoded as UTF-8. 

	No. On the contrary, most existing programs will be unaffected.
Those which actually care about internationalisation can only be made
faster ( by providing native support).


>I think the user of a language should be able to choose what
> is more important: time or space or reduced functionality. UTF-8 saves space,
> and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise
> and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves
> time and space but has less functionality.

	Sure, but, this leads to multiple interfaces. Was that not
the original problem?
 
	Let me put the argument for UTF-8 differently.
Processing UTF-8 'as is' is non-trivial and should be done
in low level system functions for speed.

	Processing arrays of 31 bit integers is _already_ well
supported in ocaml, and will be better supported by adding
variable length arrays with functions that are designed with
some view of use for string processing.

	So we don't actually need a wide character string type
or supporting functions, precisely because in the simplest
cases a standard data type not really specialised to script
processing will do the job.

	What is actually required (in both cases) are some
'data tables' to support things like case mapping. For example,
a function

	convert_to_upper i

which takes an ocaml integer argument would be useful,
and it is easy enough to 'map' this over an array.

Sigh. See next post. I will post my code, so it can
be torn up by experts. 

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-20 21:05     ` Gerd Stolpmann
  1999-10-21  4:42       ` skaller
@ 1999-10-21 12:05       ` Matías Giovannini
  1999-10-21 15:35         ` skaller
  1 sibling, 1 reply; 21+ messages in thread
From: Matías Giovannini @ 1999-10-21 12:05 UTC (permalink / raw)
  To: caml-list; +Cc: Gerd.Stolpmann, skaller

Gerd Stolpmann wrote:
> 
> On Tue, 19 Oct 1999, John Skaller wrote:
> >Gerd Stolpmann wrote:
> >> The enlarged character sets become more and more important, and it is only a
> >> matter of time until every piece of software which wants to be taken seriously
> >> can process them, even a dumb terminal or simple text editor. So you will be
> >> able to put accented characters into your comments, and you will see them as
> >> such even if you 'cat' the program text to the terminal or printer; this will
> >> work everywhere...
> >
> >       Yes. This time is not here yet, but it will come soon that
> >international support is mandatory for all large software purchases
> >by governments and large corporations.
> 
> I do not believe that this will be the driving force because the current
> solutions exist, and it is VERY expensive to replace them. It is even cheaper
> to replace a language than a character set/encoding. Looks like another Year
> 2000 but without deadline.

I still don't understand the point of this discussion. As a MacOS
programmer of many years, I tend to view localization and
internationalization as tasks best performed by the operating system, or
at least by pluggable modules. This discussion of patching l12n and i18n
functions *into* OCaml is, to me at least, losing direction.

OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
agree that my view is chauvinistic (and selfish, perhaps: I already have
"¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?),
I see no restriction in that (well, If I were Chinese, or Egiptian, I
would see things differently). What's more, the whole syntactic
apparatus of a programming language *assumes* a Latin setting, where
things make sense when read from left to right, from top to bottom; and
where punctuation is what we're used to. Programming languages suited
for a Han, or Arab, or even a Hebrew audience would have to be rethinked
from the grounds up.

On the other hand, OCaml provides a String type that *can be* seen as a
variable-length sequence of uninterpreted bytes. We have uninterpreted
bytes! It's all we need to build whatever I18NString type we may need.
What is missing is *library* facilities to abstract that view into a
full-fledged i18n machinery. Of course, there's a problem with the
manipulation of 32-bit integer values, but if used with care, the Nat
datatype could serve perfectly well as the underlying, low-level datatype.

Which makes me think, John, you already have variable-length int arrays.
Nat's are as unsafe as they get :-)

Regards,
Matías.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 12:05       ` Matías Giovannini
@ 1999-10-21 15:35         ` skaller
  1999-10-21 16:27           ` Matías Giovannini
  0 siblings, 1 reply; 21+ messages in thread
From: skaller @ 1999-10-21 15:35 UTC (permalink / raw)
  To: matias; +Cc: caml-list, Gerd.Stolpmann

Matías Giovannini wrote:

> OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
> agree that my view is chauvinistic (and selfish, perhaps: I already have
> "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?),
> I see no restriction in that (well, If I were Chinese, or Egiptian, I
> would see things differently). 

	Exactly. There are quite a lot of Chinese, Indian,
Russian ... and non-Latin people in the world: more than Latins.
And many are faced with a barrier, participating in the computing world
because of language problems.

>What's more, the whole syntactic
> apparatus of a programming language *assumes* a Latin setting, where
> things make sense when read from left to right, from top to bottom; and
> where punctuation is what we're used to. Programming languages suited
> for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> from the grounds up.

	Actually, no. Most of these peoples learn English and learn
computing, if they are to work with computers. But they still wish
to use comments, strings, and identifiers in their native script.

	Have you ever seen a Japanese program? I have.
Quite an interesting challenge: normal C/C++ code, with 
Latin characters encoding Japanese character names in identifiers,
and actual Japanese characters in comments and strings.
 
	I had no idea what the code did. My point: for a non-native
speaker, being forced to use a foreign language for identifiers and
comments is a serious impediment, not having native characters
in string is not an impediment, but a complete disaster (how will
the users of the program understand it -- they may not know any
Latin language)

> On the other hand, OCaml provides a String type that *can be* seen as a
> variable-length sequence of uninterpreted bytes. 

	Yes. What ocaml does not provide is a way of encoding
extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.

>We have uninterpreted
> bytes! It's all we need to build whatever I18NString type we may need.
> What is missing is *library* facilities to abstract that view into a
> full-fledged i18n machinery. 

	I agree.

>Of course, there's a problem with the
> manipulation of 32-bit integer values, but if used with care, the Nat
> datatype could serve perfectly well as the underlying, low-level datatype.
> 
> Which makes me think, John, you already have variable-length int arrays.

	But they're not standard (yet). Actually, ocaml 'int' is 31 bits,
which is enough bits for ISO10646 (with some careful fiddling to avoid
problems with the sign?).

	So there are TWO issues -- one is to make ocaml itself
ISO10646 aware (i.e., the compiler), and the other is to provide
users with libraries to manipulate extended characters.

	Please note: neither of these features would be optional,
were ocaml to be submitted for ISO standardisation. ISO directives
require all ISO languages to upgrade to provide international
support. I know ocaml isn't an ISO language, but I think the 
basic intent is sound. [In some sense, ocaml is already a leader,
accepting Latin-1 characters when other languages only allowed ASCII]

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 15:35         ` skaller
@ 1999-10-21 16:27           ` Matías Giovannini
  1999-10-21 16:36             ` skaller
  0 siblings, 1 reply; 21+ messages in thread
From: Matías Giovannini @ 1999-10-21 16:27 UTC (permalink / raw)
  To: caml-list; +Cc: skaller

skaller wrote:
> 
> Matías Giovannini wrote:
> >What's more, the whole syntactic
> > apparatus of a programming language *assumes* a Latin setting, where
> > things make sense when read from left to right, from top to bottom; and
> > where punctuation is what we're used to. Programming languages suited
> > for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> > from the grounds up.
> 
>         Actually, no. Most of these peoples learn English and learn
> computing, if they are to work with computers. But they still wish
> to use comments, strings, and identifiers in their native script.

Strings can be localized with a package mechanism, à la Java. I don't
like hardwired strings in code, they're a maintenance nightmare (not
that I always abide by my own rule :-)

>         Have you ever seen a Japanese program? I have.
> Quite an interesting challenge: normal C/C++ code, with
> Latin characters encoding Japanese character names in identifiers,
> and actual Japanese characters in comments and strings.

I agree that comments should be written in the language most suited to
the intended audience (I normally comment my code in English, unless I
know I wnat someone else to maintain it, in which case I comment it in Spanish.)

> > On the other hand, OCaml provides a String type that *can be* seen as a
> > variable-length sequence of uninterpreted bytes.
> 
>         Yes. What ocaml does not provide is a way of encoding
> extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.

No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude.

> >Of course, there's a problem with the
> > manipulation of 32-bit integer values, but if used with care, the Nat
> > datatype could serve perfectly well as the underlying, low-level datatype.
> >
> > Which makes me think, John, you already have variable-length int arrays.
> 
>         But they're not standard (yet).

They are! Don't be put off by its status as "experimental feature".
Nat's been around since CamlLight. You could even use it as a template
implementation of unsafe longint varlen arrays and link a custom
toplevel. Yet again, OCaml provides the tools.

>         So there are TWO issues -- one is to make ocaml itself
> ISO10646 aware (i.e., the compiler), and the other is to provide
> users with libraries to manipulate extended characters.

I think a more realistic goal would be making OCaml ISO10646-tolerant in
comments. Perhaps adding real conditional compilation and transparent
comments would suffice.

Again, anyone can download the source code and modify OCaml to suit his
tastes. OCaml's goal is not to be a model of i18n awareness, but a
platform for experimenting with types in a functional setting. It
happens that OCaml is open enough, and extensible enough, and efficient
enough to make a good i18n effort possible, and that is a tribute to its
success as strongly-typed, imperative, fast functional language.

>         Please note: neither of these features would be optional,
> were ocaml to be submitted for ISO standardisation. ISO directives
> require all ISO languages to upgrade to provide international
> support. I know ocaml isn't an ISO language, but I think the
> basic intent is sound. [In some sense, ocaml is already a leader,
> accepting Latin-1 characters when other languages only allowed ASCII]

The implementors have made clear in more than one occasion that they're
not interested in making OCaml a standard language (remember the thread
"How to convince management?"). But don't take my word for it, ask Pierre.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 16:27           ` Matías Giovannini
@ 1999-10-21 16:36             ` skaller
  1999-10-21 17:21               ` Matías Giovannini
  1999-10-23  9:53               ` Benoit Deboursetty
  0 siblings, 2 replies; 21+ messages in thread
From: skaller @ 1999-10-21 16:36 UTC (permalink / raw)
  To: matias; +Cc: caml-list

Matías Giovannini wrote:
 
> Strings can be localized with a package mechanism, à la Java. I don't
> like hardwired strings in code, they're a maintenance nightmare (not
> that I always abide by my own rule :-)

	It doesn't matter what you like (or what I like).

> >         Have you ever seen a Japanese program? I have.
> >         Yes. What ocaml does not provide is a way of encoding
> > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.
> 
> No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude.

	Irrelevant. The \u \U escapes are ISO recommended, used in
C and C++, and must be supported.
 
> > > Which makes me think, John, you already have variable-length int arrays.
> >
> >         But they're not standard (yet).
> 
> They are! Don't be put off by its status as "experimental feature".
> Nat's been around since CamlLight. 

	Oh, I must have misunderstood your comment: Nat is standard,
I'm using it in Viper, but 'a Varray -- a variable length array of 'a,
is not.

> Again, anyone can download the source code and modify OCaml to suit his
> tastes. OCaml's goal is not to be a model of i18n awareness, but a
> platform for experimenting with types in a functional setting. 

	Ocaml is a tool, it doesn't have a goal. :-)
Humans have goals. The problem is that the designers of ocaml
have been too successful: ocaml is so good that other people now
want to use it, and _their_ goals are important too.

>It
> happens that OCaml is open enough, and extensible enough, and efficient
> enough to make a good i18n effort possible, and that is a tribute to its
> success as strongly-typed, imperative, fast functional language.

	I agree. It could easily become a leader in this field,
since implementing complex stuff is relatively easy in ocaml :-)
 
> The implementors have made clear in more than one occasion that they're
> not interested in making OCaml a standard language (remember the thread
> "How to convince management?"). But don't take my word for it, ask Pierre.

	My point was simply that the ISO internationalisation requirements
are not unreasonable, and that other languages will be doing this work,
some because they have to, and some because they want to stay part of
the real world -- and encourage non-English (woops, I mean, non-Latin
:-)
clients, who, after all, may well make significant contributions.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 16:36             ` skaller
@ 1999-10-21 17:21               ` Matías Giovannini
  1999-10-23  9:53               ` Benoit Deboursetty
  1 sibling, 0 replies; 21+ messages in thread
From: Matías Giovannini @ 1999-10-21 17:21 UTC (permalink / raw)
  To: caml-list; +Cc: skaller

skaller wrote:
> 
> Matías Giovannini wrote:
> 
> > Strings can be localized with a package mechanism, à la Java. I don't
> > like hardwired strings in code, they're a maintenance nightmare (not
> > that I always abide by my own rule :-)
> 
>         It doesn't matter what you like (or what I like).

It doesn't, my point is: the functionality for localized strings can be
had, only through an indirect route, as are "string packages".

As an aside, let's keep the tone light, ok?

> 
> > >         Have you ever seen a Japanese program? I have.
> > >         Yes. What ocaml does not provide is a way of encoding
> > > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.
> >
> > No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude.
> 
>         Irrelevant. The \u \U escapes are ISO recommended, used in
> C and C++, and must be supported.

Well, OCaml is *not* ISO recommended, is *not* C and it is certainly
*not* C++. Let's learn to live with languages other than ISO-mandated,
ISO-validated, ISO-standardized and whatnot.

In fact, now that I think of it, standardization is driven by market
pressure. If OCaml were a commercial product, I guess things would be
different. But it's not (thank Pete), see below.

> > > > Which makes me think, John, you already have variable-length int arrays.
> > >
> > >         But they're not standard (yet).
> >
> > They are! Don't be put off by its status as "experimental feature".
> > Nat's been around since CamlLight.
> 
>         Oh, I must have misunderstood your comment: Nat is standard,
> I'm using it in Viper, but 'a Varray -- a variable length array of 'a,
> is not.

And it's not going to be, unless someone comes with a sound typing
strategy *and* an efficient implementation for them.

> 
> > Again, anyone can download the source code and modify OCaml to suit his
> > tastes. OCaml's goal is not to be a model of i18n awareness, but a
> > platform for experimenting with types in a functional setting.
> 
>         Ocaml is a tool, it doesn't have a goal. :-)
> Humans have goals. The problem is that the designers of ocaml
> have been too successful: ocaml is so good that other people now
> want to use it, and _their_ goals are important too.

Let me restate it: OCaml is the intellectual property of INRIA,
developed under a specific project (Projet Cristal if I remember
correctly) with very definite goals. The project *has* goals, anything
outside those goals is a gift (what is more, everything falling *within*
those goals already *are* a gift), and must be accepted as that. If
INRIA decides that since OCaml is useful to many many people around the
world and want to make one of its goals to turn OCaml into a platform
for experimenting in the implementation of programming languages with
strong i18n support, well, bring the champagne. In the meantime, we'll
have to build upon what it's there.

Suppose the following scenario: INRIA decides that the MacOS platform is
not nearly significative enough to justify the porting effort, and so it
is dropped. What should I do? Plead, certainly, until I'm told "don't
whine, there's nothing we can do". What would be my options? Use a
Wintel box, or make my own port.

This scenario is not unrealistic: there's no native compiler under
MacOS, and there won't be until someone ports it. I can't do it, the
implementors can't do it, and such is life.

> >It
> > happens that OCaml is open enough, and extensible enough, and efficient
> > enough to make a good i18n effort possible, and that is a tribute to its
> > success as strongly-typed, imperative, fast functional language.
> 
>         I agree. It could easily become a leader in this field,
> since implementing complex stuff is relatively easy in ocaml :-)
> 
> > The implementors have made clear in more than one occasion that they're
> > not interested in making OCaml a standard language (remember the thread
> > "How to convince management?"). But don't take my word for it, ask Pierre.
> 
>         My point was simply that the ISO internationalisation requirements
> are not unreasonable, and that other languages will be doing this work,
> some because they have to, and some because they want to stay part of
> the real world -- and encourage non-English (woops, I mean, non-Latin
> :-)
> clients, who, after all, may well make significant contributions.

Hm. I see your point. I don't necessarily agree, though.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 16:36             ` skaller
  1999-10-21 17:21               ` Matías Giovannini
@ 1999-10-23  9:53               ` Benoit Deboursetty
  1999-10-25 21:06                 ` Jan Skibinski
  1999-10-26 18:02                 ` skaller
  1 sibling, 2 replies; 21+ messages in thread
From: Benoit Deboursetty @ 1999-10-23  9:53 UTC (permalink / raw)
  To: caml-list

This message just wants to raise a paradoxical point in this discussion
[yet it may have already been posted ?]. It seems to me that allowing
foreign characters to be used in a computer language, as identifiers or
comments, would reduce the exchange of contributions worldwide.

Here is my personal experience: i have used caml and ocaml for more than 2
years now. From the beginning, it seemed to me really cool to be able to
have identifiers in French, with accents and everything. So i took the
habit of using French in my programs.

Now, i'm writing a more consequent program, which could become a small
intl "open project". *Except* that i find myself with a program in French,
and that it's not so easy to find qualified programming partners who
understand French. The range of people who could help with my program is
terribly limited.

You should understand i sometimes feel i should have written it in
english.

I must however acknowledge that [o']caml 's ability to cope with latin1
characters is above all useful for educational purpose. Let me explain...
Perhaps is it a French thing, but in this country it sounds quite snobbish
for a French to embed English words in a sentence with the right accent +
stress. Hence, almost every computer science teacher takes an exaggerated
French accent to pronounce English words ("la fonction 'rimouve'"). [I
shall not disclose the names of my teachers in CaML :) ]

So, for educational purposes, it is much better if the teachers can have
French identifiers ("la fonction 'enlève'"). Much easier to pronounce,
isn't it? I suppose it is the same for many other countries. (i think
especially of japan "biko-zu ingurisshu izu ha-do tsu puronaonsu foa
japani-zu pi-poru tsu-")

My point remains: encouraging people to write code in their language would
reduce the possiblities of exchanging their work. This does not mean,
though, that i will translate the program i've written into english. I
consider it is a sort of tribute to the preservation of the diversity of
languages, at my most humblest scale... and i will write enough programs
in English when i work for a company, too.

Benoît de Boursetty
Benoit.de-Boursetty@polytechnique.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-23  9:53               ` Benoit Deboursetty
@ 1999-10-25 21:06                 ` Jan Skibinski
  1999-10-26 18:02                 ` skaller
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Skibinski @ 1999-10-25 21:06 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list



On Sat, 23 Oct 1999, Benoit Deboursetty wrote:

> This message just wants to raise a paradoxical point in this discussion
> [yet it may have already been posted ?]. It seems to me that allowing
> foreign characters to be used in a computer language, as identifiers or
> comments, would reduce the exchange of contributions worldwide.

	Yes, but it is nice to have error messges, prompts, etc.
	expressed in a native language of a program user. And an
	ability of a native text processing is also quite often
	desirable. 

	I have been reading this thread for some time and I've
	seen plenty of references to Latin1 and of different
	attitudes to its usefulness (or not). Let me add my two
	cents here.

	Demanding a support for diacritical marks is often not a matter
	of being snobbish or a language purist. I cannot speak for
	other languages that use Latin alphabet but I can tell you
	what a mess it is with Polish (having 8 diacritical marks),
	and, I suppose, with other languages, such as Hungarian, etc.
	that have been qualified to Latin2. Someone has made such
	decision some time ago, and now we pay a price, since Latin1
	seems to be seen by some as some sort of improvement over the
	plain ascii. 

	I am not whining here because I can get quite fine with the plain
	ascii in my email, etc., and I can even cope with all sort
	of email that come here formatted as either Latin1 or Latin2.
	But even so, I sometimes find myself cornered by plain
	ascii when a meaning of a sentence becomes suddenly funny,
	or bezerk or senseless. One example to illustrate the point.

	1. z<.>a<;>danie - "a strong request". This is what I want to use

	2. zadanie       - "a problem to solve, or a goal". Wrong!
			   This is what I get from plain ascii

	3. rzadanie      - When pronounced it does not sound quite as
	                   "a request", but an intelligent recipient
			   can guess my intention. But they might
			   as well consider me illiterate; Polish 
		           has two alternative spelling of the
	                   same (similar) sound: "z<.>" and "rz".
			   In this case "rz" is very wrong.

	4. rzondanie     - Now it sounds almost OK ("on" sounds close to
			   "a<;>"), but the spelling is even worse.  
	where
		z<.> stands for dot over z
		a<;> stands for "ogonek" (yes, this is an official
	             name in Unicode), or "a tail" under a.

	As you can see, this is not just matter of some perky accents.

	Jan





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-23  9:53               ` Benoit Deboursetty
  1999-10-25 21:06                 ` Jan Skibinski
@ 1999-10-26 18:02                 ` skaller
  1 sibling, 0 replies; 21+ messages in thread
From: skaller @ 1999-10-26 18:02 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

Benoit Deboursetty wrote:
> 
> This message just wants to raise a paradoxical point in this discussion
> [yet it may have already been posted ?]. It seems to me that allowing
> foreign characters to be used in a computer language, as identifiers or
> comments, would reduce the exchange of contributions worldwide.

	Excuse me, but exactly what do you mean by 'foreign' characters?
Do you mean non Chinese characters? What? You aren't Chinese?

> You should understand i sometimes feel i should have written it in
> english.

	I think that, at the moment, English is the 'lingua franca' <grin>
of the Internet. Spoken with an American accent :-)

	However, the Internet is growing fast, and the number
of English speakers will soon enough be a minority. It will probably
remain true that most of the _programmers_ will be able to use English.
 
> I must however acknowledge that [o']caml 's ability to cope with latin1
> characters is above all useful for educational purpose.

	Yes. I think it is highly laudible that ocaml accepts more than
just plain 'ASCII': many students are more fluent with their native
language (even if they speak some English and/or are learning it),
and being able to program with it will enhance learning.
Internationalising
software that is actually worth sharing internationally is a lesser
obstacle that writing good software in the first place.

> My point remains: encouraging people to write code in their language would
> reduce the possiblities of exchanging their work. 

	In my opinion, a programming language should simply
give clients a _choice_. Cultures, people, and circumstances vary.
I don't think programming language designers should be in the
business of encouraging or discouraging use of a particular
language, but rather facilitating the implementation of 
the clients own wishes or requirements.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15 13:53 Gerard Huet
  1999-10-15 20:28 ` Gerd Stolpmann
@ 1999-10-17 14:29 ` Xavier Leroy
  1999-10-19 18:36   ` skaller
  1 sibling, 1 reply; 21+ messages in thread
From: Xavier Leroy @ 1999-10-17 14:29 UTC (permalink / raw)
  To: caml-list

Wow, there's nothing like internationalization to spark lively
discussions.  Since even Gérard Huet (oops, sorry for that 8859-1
accent, couldn't resist) and Francis Dupont broke their
vows of silence, I guess I have to say something too.

The support for ISO-8859-1 in Caml Light and OCaml is essentially an
historical and geographical accident.  The first books on Caml were
written in French, and it was nice to be able to use accented french
words as identifiers.  Also, that was at a time (1991-1992) where
Unicode and consorts didn't even exist.

The choice of ISO-8859-1 is not that politically incorrect either: it
works not only for western Europe, but also for Latin America, many
Pacific countries, and large parts of Africa.  If we were to choose an
8-bit character set based on the number of OCaml programmers that
actually need it, I guess ISO-8859-1 (or its newer incarnation with
the Euro sign whose name I can't remember) would still win.  (At least
until we get OCaml in the Chinese curriculum...)

Notice also that Caml doesn't prevent the programmer from putting any
character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
Unicode) in character strings and in comments.  

There are several ways to internationalize further.  One is to support
other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
are several problems with this:
- It's not enough for Asian languages.
- The POSIX localization stuff isn't supported under Windows.
- It's badly supported on all Unixes I know (e.g. to get French, I
  need to set LC_CTYPE to different values under Linux, Solaris, and
  Digital Unix; it gets worse for other languages such as Japanese).
- Handling of mixed-language texts is a nightmare.

Unicode / ISO10646 is probably a better approach.  However, it has its
own problems:
- There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
  technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
  chose 32-bit Unicode.   (That's the great things about standards:
  there are so many to choose from...)
- Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
  E.g. Java seems to have its own variant of UTF8.  How are we going
  to interoperate?
- I/O is a nightmare.  The API has to handle at least byte streams,
  wide character streams, and UTF8-encoded streams.
- Support for Unicode / UTF8 files in today's operating systems and GUIs
  is very low.  When will I be able to do "more" on an UTF8 file and see my
  French accented letters? 

My conclusion is that I18N is such a mess that I don't think we'll do
much about it in Caml anytime soon.  Perhaps some basic support for
wide characters and wide character strings will be added at some
point, if only because COM interoperability requires it.

- Xavier Leroy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-17 14:29 ` Xavier Leroy
@ 1999-10-19 18:36   ` skaller
  0 siblings, 0 replies; 21+ messages in thread
From: skaller @ 1999-10-19 18:36 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list

Xavier Leroy wrote:

> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident.  The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers.  Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.

	And supporting ISO-8859-1 was a fine thing to do at the time!
 
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa.  If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win.  (At least
> until we get OCaml in the Chinese curriculum...)

	While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
 
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.

	Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
 
> There are several ways to internationalize further.  One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
>   need to set LC_CTYPE to different values under Linux, Solaris, and
>   Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.

	If you are suggesting not using C locale stuff -- I agree entirely.

> Unicode / ISO10646 is probably a better approach.  However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
>   technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
>   chose 32-bit Unicode.   (That's the great things about standards:
>   there are so many to choose from...)

	I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.

> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
>   E.g. Java seems to have its own variant of UTF8.  How are we going
>   to interoperate?

	I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.

> - I/O is a nightmare.  The API has to handle at least byte streams,
>   wide character streams, and UTF8-encoded streams.

	No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).

> - Support for Unicode / UTF8 files in today's operating systems and GUIs
>   is very low.  When will I be able to do "more" on an UTF8 file and see my
>   French accented letters?

	Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
 
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.  

	I agree. The way forward is, I believe:

	a) do not change the I/O system, but deprecate TEXT mode
	   (all I/O should be done in binary)

	b) do not change the String module, but deprecate the
	   upper/lower case functions (and anything else that
	   smacks of relating to natural language)

	c) Provide functions to support internationalisation.

	d) modify the ocaml compiler, to process \uXXXX and \UXXXX
	   escapes [everywhere]

	e) provide a fast variable length array type

(d) could be done easily using ocamlp4 I think.

>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.

	I don't think it is necessary, a variable length
array of integers is good enough.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~1999-10-28 17:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-10-13 12:12 localization, internationalization and Caml STARYNKEVITCH Basile
1999-10-14 22:20 ` skaller
1999-10-15  8:26   ` Francis Dupont
1999-10-17 11:27     ` skaller
1999-10-17 15:54       ` Francis Dupont
1999-10-19 18:48         ` skaller
1999-10-15 13:53 Gerard Huet
1999-10-15 20:28 ` Gerd Stolpmann
1999-10-19 18:06   ` skaller
1999-10-20 21:05     ` Gerd Stolpmann
1999-10-21  4:42       ` skaller
1999-10-21 12:05       ` Matías Giovannini
1999-10-21 15:35         ` skaller
1999-10-21 16:27           ` Matías Giovannini
1999-10-21 16:36             ` skaller
1999-10-21 17:21               ` Matías Giovannini
1999-10-23  9:53               ` Benoit Deboursetty
1999-10-25 21:06                 ` Jan Skibinski
1999-10-26 18:02                 ` skaller
1999-10-17 14:29 ` Xavier Leroy
1999-10-19 18:36   ` skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).