Re: localization, internationalization and Caml

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Re: localization, internationalization and Caml
@ 1999-10-15 13:53 Gerard Huet
  1999-10-15 20:28 ` Gerd Stolpmann
  1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy
  0 siblings, 2 replies; 24+ messages in thread
From: Gerard Huet @ 1999-10-15 13:53 UTC (permalink / raw)
  To: Francis Dupont, skaller; +Cc: STARYNKEVITCH Basile, caml-list

Just to put my 2 cents on this issue...

At 10:26 15/10/99 +0200, Francis Dupont wrote:
> In your previous mail you wrote:
>
>   	The current 'support' for 8 bit characters in ocaml should be
>   deprecated immediately. It is an extremely bad thing to have, since
>   Latin-1 et al are archaic 8 bit standards incompatible with the
>   international standard for ISO10646 communication, namely
>   the UTF-8 encoding.

I do not agree. What we need is not ayatollah dictats, but careful thinking
about evolution of standards.

First of all, ISO-Latin is as international a standard as ISO10646, only a bit
more mature. By essence, international standards are not immediately obsolete,
they are here to stay because we need some stability in a world of sound
engineering, as opposed to the permanent hype which our discipline is
subjected to.

Secondly, the string data type of Ocaml is not about ASCII or ISO-Latin or
whatever. It is a low-level data type of implementation of lists of bytes
of data efficiently represented in machine memory. These bytes may be used
for encoding elements of various finite sets such as ASCII or ISO-Latin,
but the string library does not care about such intentions.

When such strings are used to represent natural language sentences, there
is a natural tendency to sophistication, from UPPER CASE letters of the
computer printers of old to ASCII to ISO-Latin 1, 2, etc to Unicode. At
some point (256) these sets of codes cannot be represented in a one to one
fashion into bytes, and so multi-bytes representations must be designed,
such as UTF-8.
Such multi-bytes representations are inconsistent with ISO-Latin convention
somewhere, and thus the ISO-Latin character set must be shifted out of its
usual representation since the 8th bit is needed for the multi-byte encoding.

So for instance engineers designing natural language interfaces must make
the choice of sticking to the old convention in a purely local software, or
upgrading their software to the international standard, typically for Web
applications. At some point I am sure some brave soul from the Ocaml
implementation team will write a Unicode library for implementing the
non-trivial manipulations of lists of Unicode characters, so that the above
engineers will have a generic tool to use. Such libraries will typically
implement a NEW datatype of "unistring" or whatever, with proper conversion
to string representations of course, but the string data type is surely
here to stay, because bytes are not going to become obsolete overnight. :-)

>=> there is a rather strong opposition against UTF-8 in France
>because it is not a natural encoding (ie. if ASCII maps to ASCII
>it is not the case for ISO 8859-* characters, imagine a new UTF-X
>encoding maps ASCII to strange things and you'd be able to understand
>our concern).

I do not share Francis' pessimism. The ISO commitees are not entirely
stupid, and care has been taken to make the move as painless as possible.
ISO-Latin has just been shifted by a mere translation. Here is my Ocaml
code for translating strings of ISO-Latin 1 characters to UTF-8 HTML:

let print_unicode c     let ascii = int_of_char c in (* test for ISO-LATIN *)
        if ascii < 128 then print_char c (* 7 bit ascii *)
        else print_string ("&#" ^ (string_of_int ascii) ^ ";");

This is hardly mysterious or complicated or inefficient.

>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).

Come on, Francis, we do not read core dumps nowadays, we read through the
eyes of HTML or TeX or whatever !

>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...

I simply do not understand this remark in a WWW world.

Cheers
Gérard

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15 13:53 localization, internationalization and Caml Gerard Huet
@ 1999-10-15 20:28 ` Gerd Stolpmann
  1999-10-19 18:06   ` skaller
  1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy
  1 sibling, 1 reply; 24+ messages in thread
From: Gerd Stolpmann @ 1999-10-15 20:28 UTC (permalink / raw)
  To: caml-list

I agree that Unicode or even ISO-10646 support would be a nice thing. I
also agree that for many users (including myself) the Latin-1 character set
suffices. Luckily, both character sets are strongly related: The first 256
character positions of Unicode (which is the 16-bit subset of ISO-10646) are
exactly the same as in Latin1.

UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
character is represented by one to six bytes; the higher the character code the
more bytes are needed. This encoding is mainly interesting for I/O, and not for
internal processing, because it is impossible to access the characters of a
string by their position. Internally, you must represent the characters as 16-
or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
wasted (this is the price for the enhanced possibilities). UTF-8 is thought for
compatibility, because the following holds:

- Every ASCII character (i.e. with codes 0 to 127) is represented as before,
  and every non-ASCII character is represented by a byte sequence where the
  eighth bit is set. Old, non-UTF-aware programs can at least interpret the
  ASCII characters.
  (Note that there is a variant of UTF-8 which encodes the 0 character
  differently - by two characters.)

- If you sort UTF-8 strings alphabetically (more precisely, using
  the byte values of the encoding as criterion) you get the same result as if
  you sorted the strings alphabetically by their character codes.

This means that we need at least three types of strings: Latin1 strings for
compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
the same language type "string", and to provide "wchar" and "wstring" for the
extended character set.

Of course, the current "string" data type is mainly an implementation of byte
sequences, which is independent of the underlying interpretation. Only the
following functions seem to introduce a character set:

- The String.uppercase and .lowercase functions
- The channels if opened in text mode (because they specially recognize the
  line endings if needed for the operating system, and newline characters are
  part of the character set)

The best solution would be to have internationalized versions of these
functions (and of perhaps some more functions) which still operate on the
"string" type but allow the user to select the encoding and the locale.

This means we would have something like

	type encoding = UTF8 | Latin1 | ...
	type locale = ...
	val String.i18n_uppercase : encoding -> locale -> string -> string
	val String.i18n_lowercase : encoding -> locale -> string -> string
	val String.recode : encoding -> encoding -> string -> string
		(* changes the encoding if possible *)

For "wstring" it is simpler:

	val Wstring.uppercase : string -> string
	val Wstring.i18n_uppercase : locale -> string -> string

New opening mode for channels:
	Text of encoding
This encoding specifies the encoding of the file. It must be possible to change
this later (e.g. to process XML's "encoding" declaration).

New input/output functions:

	val output_i18n_string : out_channel -> encoding -> string -> unit
	val input_i18n_line : in_channel -> encoding -> string

Here, the encoding argument specifies the encoding of the internal
representation. The other I/O functions need I18N versions as well, and of
course we need to operate on "wstring"s directly:

	val output_wstring : out_channel -> wstring -> unit
	val input_wstring_line : in_channel -> wstring

This all means that the number of string functions explodes: We need functions
for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
functions for wide strings. I think this is the main argument against it,
and it is very difficult to get around this. (Any ideas?)

Francis Dupont:
>=> my problem is the output of the filter will be no more readable when
>I've put too much French in the program (in comments for instance).

The enlarged character sets become more and more important, and it is only a
matter of time until every piece of software which wants to be taken seriously
can process them, even a dumb terminal or simple text editor. So you will be
able to put accented characters into your comments, and you will see them as
such even if you 'cat' the program text to the terminal or printer; this will
work everywhere...

>=> I believe internationalization should not be done by countries
>where English is the only used language: this is at least awkward...

but in the USA (a general prejudice not worth to discuss; there is only some
"personal experience" behind it).

Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15 13:53 localization, internationalization and Caml Gerard Huet
  1999-10-15 20:28 ` Gerd Stolpmann
@ 1999-10-17 14:29 ` Xavier Leroy
  1999-10-19 18:36   ` skaller
  1 sibling, 1 reply; 24+ messages in thread
From: Xavier Leroy @ 1999-10-17 14:29 UTC (permalink / raw)
  To: caml-list

Wow, there's nothing like internationalization to spark lively
discussions.  Since even Gérard Huet (oops, sorry for that 8859-1
accent, couldn't resist) and Francis Dupont broke their
vows of silence, I guess I have to say something too.

The support for ISO-8859-1 in Caml Light and OCaml is essentially an
historical and geographical accident.  The first books on Caml were
written in French, and it was nice to be able to use accented french
words as identifiers.  Also, that was at a time (1991-1992) where
Unicode and consorts didn't even exist.

The choice of ISO-8859-1 is not that politically incorrect either: it
works not only for western Europe, but also for Latin America, many
Pacific countries, and large parts of Africa.  If we were to choose an
8-bit character set based on the number of OCaml programmers that
actually need it, I guess ISO-8859-1 (or its newer incarnation with
the Euro sign whose name I can't remember) would still win.  (At least
until we get OCaml in the Chinese curriculum...)

Notice also that Caml doesn't prevent the programmer from putting any
character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
Unicode) in character strings and in comments.  

There are several ways to internationalize further.  One is to support
other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
are several problems with this:
- It's not enough for Asian languages.
- The POSIX localization stuff isn't supported under Windows.
- It's badly supported on all Unixes I know (e.g. to get French, I
  need to set LC_CTYPE to different values under Linux, Solaris, and
  Digital Unix; it gets worse for other languages such as Japanese).
- Handling of mixed-language texts is a nightmare.

Unicode / ISO10646 is probably a better approach.  However, it has its
own problems:
- There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
  technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
  chose 32-bit Unicode.   (That's the great things about standards:
  there are so many to choose from...)
- Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
  E.g. Java seems to have its own variant of UTF8.  How are we going
  to interoperate?
- I/O is a nightmare.  The API has to handle at least byte streams,
  wide character streams, and UTF8-encoded streams.
- Support for Unicode / UTF8 files in today's operating systems and GUIs
  is very low.  When will I be able to do "more" on an UTF8 file and see my
  French accented letters? 

My conclusion is that I18N is such a mess that I don't think we'll do
much about it in Caml anytime soon.  Perhaps some basic support for
wide characters and wide character strings will be added at some
point, if only because COM interoperability requires it.

- Xavier Leroy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-15 20:28 ` Gerd Stolpmann
@ 1999-10-19 18:06   ` skaller
  1999-10-20 21:05     ` Gerd Stolpmann
  0 siblings, 1 reply; 24+ messages in thread
From: skaller @ 1999-10-19 18:06 UTC (permalink / raw)
  To: Gerd.Stolpmann; +Cc: caml-list

Gerd Stolpmann wrote:
> 
> I agree that Unicode or even ISO-10646 support would be a nice thing. I
> also agree that for many users (including myself) the Latin-1 character set
> suffices. 

	Generally, for me, 7 bit ASCII 'suffices'. But that is irrelevant,
the world is bigger than my country, or Europe.

>Luckily, both character sets are strongly related: The first 256
> character positions of Unicode (which is the 16-bit subset of ISO-10646) are
> exactly the same as in Latin1.

	.. of course, this is not luck .. 
 
> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
> character is represented by one to six bytes; the higher the character code the
> more bytes are needed. This encoding is mainly interesting for I/O, and not for
> internal processing, because it is impossible to access the characters of a
> string by their position. Internally, you must represent the characters as 16-
> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
> wasted (this is the price for the enhanced possibilities). 

	I don't agree. If you read ISO10646 carefully, you will find
that you must STILL parse sequences of code points to obtain the
equivalent
of a 'character', if, indeed, such a concept exists,
and furthermore, the sequences are not unique. For example,
many diacritic marks such as accents may be appended to a code point,
and act in conjunction with the preceding code point to represent
a character. This is permitted EVEN if there is a single code point
for that character; and worse, if there are TWO such marks, the order is
not
fixed.

	And that's just simple European languages. Now try Arabic or Thai :-)

> This means that we need at least three types of strings: Latin1 strings for
> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
> the same language type "string", and to provide "wchar" and "wstring" for the
> extended character set.

	I'd like to suggest we forget the 'wchar' string, at least initially.
I think you will find UTF-8 encoding requires very few changes. For
example,
genuine regular expressions work out of the box. String searching
works out of the box.

	What doesn't work efficiently is indexing.
And it is never necessary to do it for human script.
Why would you ever want to, say, replace the 10'th character
of a string?? [You could: if you were analysing, say, a stock code,
but in that case the n'th byte would do: it isn't natural language
script]

	The way to handle Latin-1, or Big-5, or KSC, or ShiftJis, 
is to translate it with an input filter, or internally, if the client
is reading the codes directly.

> Of course, the current "string" data type is mainly an implementation of byte
> sequences, which is independent of the underlying interpretation. Only the
> following functions seem to introduce a character set:
> 
> - The String.uppercase and .lowercase functions

It is best to get rid of these functions. The belong in a more
sophisticated
natural language processing package. 

> - The channels if opened in text mode (because they specially recognize the
>   line endings if needed for the operating system, and newline characters are
>   part of the character set)

	This is a serious problem. It is also partly ill formed:
what is a 'line' in Chinese, which writes characters top down?
What is a 'line' in a Web page :-)
 
> The best solution would be to have internationalized versions of these
> functions (and of perhaps some more functions) which still operate on the
> "string" type but allow the user to select the encoding and the locale.

	There is more. The compiler must be modified to accept
identifiers in the extended character set. This should work out
of the box (since 8'th bit set characters are already accepted);
in fact, it is too permissive. Secondly, literals need to be
processed, to expand \uXXXX and \UXXXXXXXX escapes.
 
> This means we would have something like
> 
>         type encoding = UTF8 | Latin1 | ...

	Be careful to distinguish CODE SET from ENCODING.
See the Unicode home page for more details: Latin-1 is NOT
an encoding, but a character set. There is a MAPPING from
Latin-1 to ISO-10646. This is not the same thing as an encoding.
I think this is the wrong approach: we do not want built-in
cases for every possible encoding/character set.

	Instead, we want an open ended set of conversions from (and perhaps to)
the internally used representation. There are a LOT of such
combinations,
we need to add new ones without breaking into a module. We should do it
functionally; this should work well in ocaml :-)

>         type locale = ...
>         val String.i18n_uppercase : encoding -> locale -> string -> string
>         val String.i18n_lowercase : encoding -> locale -> string -> string

	Not in the String module. This belongs in a different package
which handles complex vagaries of human script. [This particular
function, is relatively simple. Another is 'get_digit', 'isdigit'.
Whitespace is much harder. Collation is a nightmare :-]

>         val String.recode : encoding -> encoding -> string -> string
>                 (* changes the encoding if possible *)

	This isn't quite right. The way to do this is to have a function:

	LATIN1_to_ISO10646 code

which does the mapping (from a SINGLE code point in LATIN1 to ISO10646).
The code point is an int. Separately, we handle encodings: 

	UCS4_to_UTF8 code

converts an int to a UTF8 string, and

	UTF8_to_UCS4 string position

parses the string from position, returning a code point and position.
There are other encodings, such as DBCS encodings, which generally
are tied to a single character set. [UCS4/UTF8 are less depenent]

[....]
> This all means that the number of string functions explodes: 

	Exactly. And we don't want that. So I suggest, we continue
to use the existing strings of 8 bit bytes ONLY, and represent
ALL foreign [non ISO10646] character sets using ISO-10646 code points,
encoded as UTF-8, and provide an input filter for the compiler.

	In addition, some extra functions to convert other
character sets and encodings to ISO-10646/UTF-8 are provided,
and, if you like, they can be plugged into the I/O system.

	This means a lot of conversion functions, but ONE
internal representation only: the one we already have.	

We need functions
> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
> functions for wide strings. I think this is the main argument against it,
> and it is very difficult to get around this. (Any ideas?)

	I've been trying to tell you how to do it. The solution is simple,
to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE
encoding of it; and provide conversions from other character sets and
encodings.
All the code that needs to manipulate strings can then be provided NOW
as additional functions manipulating the existing string type.

	The apparent loss of indexing is a mirage. The gain is huge:
ISO-10646 Level 1 compliance without any explosiion of data types.
Yes, some more _functions_ are needed to do extra processing,
such as normalisation, comparisons of various kinds,
capitalisation, etc. Regular expressions will need to be
enhanced, to fix the special features (like case insensitive searching),
but the basic regular expressions will work out of the box.

> The enlarged character sets become more and more important, and it is only a
> matter of time until every piece of software which wants to be taken seriously
> can process them, even a dumb terminal or simple text editor. So you will be
> able to put accented characters into your comments, and you will see them as
> such even if you 'cat' the program text to the terminal or printer; this will
> work everywhere...

	Yes. This time is not here yet, but it will come soon that
international support is mandatory for all large software purchases
by governments and large corporations.
 
-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy
@ 1999-10-19 18:36   ` skaller
  0 siblings, 0 replies; 24+ messages in thread
From: skaller @ 1999-10-19 18:36 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: caml-list

Xavier Leroy wrote:

> The support for ISO-8859-1 in Caml Light and OCaml is essentially an
> historical and geographical accident.  The first books on Caml were
> written in French, and it was nice to be able to use accented french
> words as identifiers.  Also, that was at a time (1991-1992) where
> Unicode and consorts didn't even exist.

	And supporting ISO-8859-1 was a fine thing to do at the time!
 
> The choice of ISO-8859-1 is not that politically incorrect either: it
> works not only for western Europe, but also for Latin America, many
> Pacific countries, and large parts of Africa.  If we were to choose an
> 8-bit character set based on the number of OCaml programmers that
> actually need it, I guess ISO-8859-1 (or its newer incarnation with
> the Euro sign whose name I can't remember) would still win.  (At least
> until we get OCaml in the Chinese curriculum...)

	While this is true, there is a circularity here: people not
using 8 bit character sets face an extra battle using ocaml.
 
> Notice also that Caml doesn't prevent the programmer from putting any
> character set that includes ASCII (ISO-8859-x, but also UTF8-encoded
> Unicode) in character strings and in comments.

	Yes. This one of the key points of my argument that UTF-8
is the natural way to go: it provides ISO-10646 compliance without
requiring any new string kind.
 
> There are several ways to internationalize further.  One is to support
> other 8-bit character sets the POSIX way (the LC_CTYPE stuff).  There
> are several problems with this:
> - It's not enough for Asian languages.
> - The POSIX localization stuff isn't supported under Windows.
> - It's badly supported on all Unixes I know (e.g. to get French, I
>   need to set LC_CTYPE to different values under Linux, Solaris, and
>   Digital Unix; it gets worse for other languages such as Japanese).
> - Handling of mixed-language texts is a nightmare.

	If you are suggesting not using C locale stuff -- I agree entirely.

> Unicode / ISO10646 is probably a better approach.  However, it has its
> own problems:
> - There's 16-bit Unicode and 32-bit Unicode.  Early adopters of that
>   technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix)
>   chose 32-bit Unicode.   (That's the great things about standards:
>   there are so many to choose from...)

	I cannot see the problem -- except for the 16 bit adopters,
who must eventually upgrade .. again.

> - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well.
>   E.g. Java seems to have its own variant of UTF8.  How are we going
>   to interoperate?

	I do not understand: UTF-8 is a fixed, internationally standardised
encoding. If it is used, the ISO Standard is followed. If Java doesn't
do that,
that is Java's problem.

> - I/O is a nightmare.  The API has to handle at least byte streams,
>   wide character streams, and UTF8-encoded streams.

	No, it doesn't. This is a possibility. But it is NOT necessary.
It is necessary only to read byte streams. Conversion can be done
later using strings. This is less efficient, but it is a sensible
starting point (to ignore internationalisation on I/O completely).

> - Support for Unicode / UTF8 files in today's operating systems and GUIs
>   is very low.  When will I be able to do "more" on an UTF8 file and see my
>   French accented letters?

	Yes. I agree. This is a major problem. One of the answers is
"When programming languages provide the support that applications
programmers need" :-)
 
> My conclusion is that I18N is such a mess that I don't think we'll do
> much about it in Caml anytime soon.  

	I agree. The way forward is, I believe:

	a) do not change the I/O system, but deprecate TEXT mode
	   (all I/O should be done in binary)

	b) do not change the String module, but deprecate the
	   upper/lower case functions (and anything else that
	   smacks of relating to natural language)

	c) Provide functions to support internationalisation.

	d) modify the ocaml compiler, to process \uXXXX and \UXXXX
	   escapes [everywhere]

	e) provide a fast variable length array type

(d) could be done easily using ocamlp4 I think.

>Perhaps some basic support for
> wide characters and wide character strings will be added at some
> point, if only because COM interoperability requires it.

	I don't think it is necessary, a variable length
array of integers is good enough.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-19 18:06   ` skaller
@ 1999-10-20 21:05     ` Gerd Stolpmann
  1999-10-21  4:42       ` skaller
  1999-10-21 12:05       ` Matías Giovannini
  0 siblings, 2 replies; 24+ messages in thread
From: Gerd Stolpmann @ 1999-10-20 21:05 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

On Tue, 19 Oct 1999, John Skaller wrote:
>Gerd Stolpmann wrote:
>> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit
>> character is represented by one to six bytes; the higher the character code the
>> more bytes are needed. This encoding is mainly interesting for I/O, and not for
>> internal processing, because it is impossible to access the characters of a
>> string by their position. Internally, you must represent the characters as 16-
>> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is
>> wasted (this is the price for the enhanced possibilities). 
>
>	I don't agree. If you read ISO10646 carefully, you will find
>that you must STILL parse sequences of code points to obtain the
>equivalent
>of a 'character', if, indeed, such a concept exists,
>and furthermore, the sequences are not unique. For example,
>many diacritic marks such as accents may be appended to a code point,
>and act in conjunction with the preceding code point to represent
>a character. This is permitted EVEN if there is a single code point
>for that character; and worse, if there are TWO such marks, the order is
>not
>fixed.
>
>	And that's just simple European languages. Now try Arabic or Thai :-)

Let's begin with languages we know. As far as I know, ISO10646 allows it not to
implement the combining characters. I think, a programming language should only
provide the basic means by which you can operate with characters, but should not
solve it completely.

>> This means that we need at least three types of strings: Latin1 strings for
>> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings
>> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by
>> the same language type "string", and to provide "wchar" and "wstring" for the
>> extended character set.
>
>	I'd like to suggest we forget the 'wchar' string, at least initially.
>I think you will find UTF-8 encoding requires very few changes. For
>example,
>genuine regular expressions work out of the box. String searching
>works out of the box.
>
>	What doesn't work efficiently is indexing.
>And it is never necessary to do it for human script.
>Why would you ever want to, say, replace the 10'th character
>of a string?? [You could: if you were analysing, say, a stock code,
>but in that case the n'th byte would do: it isn't natural language
>script]

Because I have an algorithm operating on the characters of a string. Such
algorithms use indexes as pointers to parts of a string, and in most cases the
indexes are only incremented or decremented. On a UTF-8 string, you could
define an index as

type index = { index_position : int; byte_position : int }

and define the operations "increment", "decrement" (only working together with
the string), "add", "substract", "compare" (to calculate string lengths). Such
indexes have strange properties; they can only be interpreted together with the
string to which they refer. 

You cannot avoid such an index type really; you can only avoid to give the
thing a name and program the index operations every time anew.

Perhaps your suggestion works; but string manipulation will then be much
slower. For example, an "increment" must be implemented by finding the next
beginning of a character (instead of just incrementing a numeric index).

>> Of course, the current "string" data type is mainly an implementation of byte
>> sequences, which is independent of the underlying interpretation. Only the
>> following functions seem to introduce a character set:
>> 
>> - The String.uppercase and .lowercase functions
>
>It is best to get rid of these functions. The belong in a more
>sophisticated
>natural language processing package. 

There will always be a difference between natural languages and sophisticated
packages. Even the current String.uppercase is wrong (in Latin1 there is a
lower case character without corresponding capital character (\223), but WORDS
containing this character can be capitalized by applying a semantical rule).

I would suppose that String.upper/lowercase are part of the library because the
compiler itself needs them. Currently, ocaml depends on languages that know the
distinction of character cases.

In my opinion such case functions can only approximate the semantical meaning,
and a simple approximation is better than no approximation.

>> - The channels if opened in text mode (because they specially recognize the
>>   line endings if needed for the operating system, and newline characters are
>>   part of the character set)
>
>	This is a serious problem. It is also partly ill formed:
>what is a 'line' in Chinese, which writes characters top down?

But lines exist. For example, your message is divided into lines. The concept
of lines is too important to be dropped although it is simple (much of the
success has to do with its simplicity). Other writing traditions also have a
writing direction.

>What is a 'line' in a Web page :-)

What is a 'line' in the sky?

>> This means we would have something like
>> 
>>         type encoding = UTF8 | Latin1 | ...
>
>	Be careful to distinguish CODE SET from ENCODING.
>See the Unicode home page for more details: Latin-1 is NOT
>an encoding, but a character set. There is a MAPPING from
>Latin-1 to ISO-10646. This is not the same thing as an encoding.
>I think this is the wrong approach: we do not want built-in
>cases for every possible encoding/character set.

Character sets and encodings are both artificial concepts. When I program, I
have always to do with a combination of both. The distinction is irrelevant for
most applications; it is important if you want to convert texts from (cs1,enc1)
to (cs2,enc2) because conversion is not always possible.

My idea is that the type "encoding" enumerates all supported combinations; I
expect only a few.

>	Instead, we want an open ended set of conversions from (and perhaps to)
>the internally used representation. There are a LOT of such
>combinations,
>we need to add new ones without breaking into a module. We should do it
>functionally; this should work well in ocaml :-)

What kind of problem do you want to solve with an open ended set of
conversions? Isn't this the task of a specialized program?

>>         type locale = ...
>>         val String.i18n_uppercase : encoding -> locale -> string -> string
>>         val String.i18n_lowercase : encoding -> locale -> string -> string
>
>	Not in the String module. This belongs in a different package
>which handles complex vagaries of human script. [This particular
>function, is relatively simple. 

See above.

>Another is 'get_digit', 'isdigit'.
>Whitespace is much harder. Collation is a nightmare :-]

I think collation should be left out by a basic library. Even for a single
language, there are often several traditions how to sort, and it also depends on
the kind of strings your are sorting (for example, think of personal names).
Members of traditions can contribute special modules for collation.

>
>>         val String.recode : encoding -> encoding -> string -> string
>>                 (* changes the encoding if possible *)
>
>	This isn't quite right. The way to do this is to have a function:
>
>	LATIN1_to_ISO10646 code
>
>which does the mapping (from a SINGLE code point in LATIN1 to ISO10646).
>The code point is an int. Separately, we handle encodings: 
>
>	UCS4_to_UTF8 code
>
>converts an int to a UTF8 string, and
>
>	UTF8_to_UCS4 string position
>
>parses the string from position, returning a code point and position.
>There are other encodings, such as DBCS encodings, which generally
>are tied to a single character set. [UCS4/UTF8 are less depenent]
>

The most correct interface is not always the best.

>[....]
>> This all means that the number of string functions explodes: 
>
>	Exactly. And we don't want that. So I suggest, we continue
>to use the existing strings of 8 bit bytes ONLY, and represent
>ALL foreign [non ISO10646] character sets using ISO-10646 code points,
>encoded as UTF-8, and provide an input filter for the compiler.
>
>	In addition, some extra functions to convert other
>character sets and encodings to ISO-10646/UTF-8 are provided,
>and, if you like, they can be plugged into the I/O system.
>
>	This means a lot of conversion functions, but ONE
>internal representation only: the one we already have.	

There will be a significant slow-down of all ocaml programs if the strings are
encoded as UTF-8. I think the user of a language should be able to choose what
is more important: time or space or reduced functionality. UTF-8 saves space,
and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise
and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves
time and space but has less functionality. 

>> We need functions
>> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and
>> functions for wide strings. I think this is the main argument against it,
>> and it is very difficult to get around this. (Any ideas?)
>
>	I've been trying to tell you how to do it. The solution is simple,
>to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE
>encoding of it; and provide conversions from other character sets and
>encodings.

It looks simple but I suppose it is not what the ocaml users want.

>All the code that needs to manipulate strings can then be provided NOW
>as additional functions manipulating the existing string type.

And because compatibility is lost, the whole current code base has to be worked
through.

>	The apparent loss of indexing is a mirage. The gain is huge:
>ISO-10646 Level 1 compliance without any explosiion of data types.
>Yes, some more _functions_ are needed to do extra processing,
>such as normalisation, comparisons of various kinds,
>capitalisation, etc. Regular expressions will need to be
>enhanced, to fix the special features (like case insensitive searching),
>but the basic regular expressions will work out of the box.
>
>> The enlarged character sets become more and more important, and it is only a
>> matter of time until every piece of software which wants to be taken seriously
>> can process them, even a dumb terminal or simple text editor. So you will be
>> able to put accented characters into your comments, and you will see them as
>> such even if you 'cat' the program text to the terminal or printer; this will
>> work everywhere...
>
>	Yes. This time is not here yet, but it will come soon that
>international support is mandatory for all large software purchases
>by governments and large corporations.

I do not believe that this will be the driving force because the current
solutions exist, and it is VERY expensive to replace them. It is even cheaper
to replace a language than a character set/encoding. Looks like another Year
2000 but without deadline.

The first field where some progress will be made is data exchange, because
ISO10646 can bridge several character sets. At that time, tools will be
available to view and edit such data, and of course to convert them; ISO10646
will be used in parallel with the "traditional" character set. These tools will
be low-level, and perhaps operating systems will then support the tools with
fonts, input methods, and conventions how to indicate the encoding.(The current
environment variables solution is a pain if you try to use two encodings in
parallel. For example, I can imagine that Unix terminal drivers allow it to
select the encoding directly, in the same way as you can set other terminal
properties.)

In contrast to this, many applications need not to be replaced and won't be.
Perhaps they will have a ISO10646 import/export filter.

--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-20 21:05     ` Gerd Stolpmann
@ 1999-10-21  4:42       ` skaller
  1999-10-21 12:05       ` Matías Giovannini
  1 sibling, 0 replies; 24+ messages in thread
From: skaller @ 1999-10-21  4:42 UTC (permalink / raw)
  To: Gerd.Stolpmann; +Cc: caml-list

Gerd Stolpmann wrote: 
> On Tue, 19 Oct 1999, John Skaller wrote:
> >
> >       I don't agree. If you read ISO10646 carefully, you will find
> >that you must STILL parse sequences of code points

> Let's begin with languages we know. As far as I know, ISO10646 allows it not to
> implement the combining characters. 

	From memory, you are correct: there are three specified levels of
compliance.
Level 1 compliance does not require processing combining characters.

> I think, a programming language should only
> provide the basic means by which you can operate with characters, but should not
> solve it completely.

	Yes, I agree, at least at this time. 

> >       What doesn't work efficiently is indexing.
> >And it is never necessary to do it for human script.
> >Why would you ever want to, say, replace the 10'th character
> >of a string?? 
> 
> Because I have an algorithm operating on the characters of a string. 

	If the string represents human script, it is then wrong
because it makes incorrect assumptions about the nature of human script.
You will need to rewrite it, if you want it to work in an international
setting.

> Such
> algorithms use indexes as pointers to parts of a string, and in most cases the
> indexes are only incremented or decremented. On a UTF-8 string, you could
> define an index as
> 
> type index = { index_position : int; byte_position : int }
> 
> and define the operations "increment", "decrement" (only working together with
> the string), "add", "substract", "compare" (to calculate string lengths). Such
> indexes have strange properties; they can only be interpreted together with the
> string to which they refer.
> 
> You cannot avoid such an index type really; you can only avoid to give the
> thing a name and program the index operations every time anew.

	I agree. But my point is: you should change your code _anyhow_,
to use the new and correct parsing method, because it is necessary
for Level 2 and Level 3 compliance. Your code will then work
correctly at those levels when the 'increment' function is upgraded.

	What you will find is something which by chance, perhaps, is
natural in Python: there is no such thing as a character. A string
is NOT an array of characters. Strings can be composed from strings,
and decomposed into arrays of strings, but there is not really any
character type.
 
> Perhaps your suggestion works; but string manipulation will then be much
> slower. For example, an "increment" must be implemented by finding the next
> beginning of a character (instead of just incrementing a numeric index).

	Yes, but this is a fact: it is actually required for correct
processing of human script. You cannot 'magic' away the facts.

	What you can do, is, if you are programming with a known
subset, such as the characters for a stock code, then you can
use indexing anyhow, perhaps with the ASCII subset. That is,
you can use the byte strings as character strings.

> There will always be a difference between natural languages and sophisticated
> packages. 

	Yes. However, there is an important point here. Natural languages
are quirky and behaviour is variant: each human uses language
differently
each sentence, varing with region, context .. etc. Obviously, computer
systems only use some abstracted representation. While there are many
levels and ways of abstracting this, there is one that is worthy of
special interest here: the ISO10646 Standard.

	So I guess my suggestion is that in the _standard_ language libraries
we will eventually need to implement the algorithms required for
compliance
with that Standard. In my opinion, that naturally breaks into two parts:

	a) (byte/word) string management: this is an issue of storage
	   allocation and manipulation, not natural language processing

	b) basic natural language processing

>Even the current String.uppercase is wrong (in Latin1 there is a
> lower case character without corresponding capital character (\223), but WORDS
> containing this character can be capitalized by applying a semantical rule).
> 
> I would suppose that String.upper/lowercase are part of the library because the
> compiler itself needs them. Currently, ocaml depends on languages that know the
> distinction of character cases.

	AH! you are right!
 
> In my opinion such case functions can only approximate the semantical meaning,
> and a simple approximation is better than no approximation.

	No. That is, I agree entirely, but make a different point:
an arbitrary simple approximation is worthless, the one that is useful
is the ISO Standardised one.

> My idea is that the type "encoding" enumerates all supported combinations; I
> expect only a few.

	Please no. Leave the type open to external augmentation.
Just consider: my Interscript literate programming tool ALREADY supports
something like 30 "encodings" -- all those present on the unicode.org
website. Your 'type' is already a joke. I already support a lot more
encodings than that.

> What kind of problem do you want to solve with an open ended set of
> conversions? Isn't this the task of a specialized program?

	No. It allows a generalised ISO10646 compliant program
to read and perhaps write any file encoded in any supported encoding,
but manipulate it internally in one format. If there is an encoding
that is missed, it is easy to add a new pair of conversion functions,
without breaking the standard library.

	That is, it is the task of specialised _functions_.
It makes sense to provide some as standard like the ones your
type suggests -- but not represent the cases with a type.
Ocaml variants are not amenable to extension. Function parameters
are.

	That is, I think there are exactly two cases:

	a) no conversion required
	b) user supplied conversion function
 
> I think collation should be left out by a basic library. 

	Probably right. Level 1 compliance is a good start,
and does not require collation.

> The most correct interface is not always the best.

	What do you mean 'most correct'?
Either the interface supports the (ISO10646) required behaviour or not.

> There will be a significant slow-down of all ocaml programs if the strings are
> encoded as UTF-8. 

	No. On the contrary, most existing programs will be unaffected.
Those which actually care about internationalisation can only be made
faster ( by providing native support).


>I think the user of a language should be able to choose what
> is more important: time or space or reduced functionality. UTF-8 saves space,
> and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise
> and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves
> time and space but has less functionality.

	Sure, but, this leads to multiple interfaces. Was that not
the original problem?
 
	Let me put the argument for UTF-8 differently.
Processing UTF-8 'as is' is non-trivial and should be done
in low level system functions for speed.

	Processing arrays of 31 bit integers is _already_ well
supported in ocaml, and will be better supported by adding
variable length arrays with functions that are designed with
some view of use for string processing.

	So we don't actually need a wide character string type
or supporting functions, precisely because in the simplest
cases a standard data type not really specialised to script
processing will do the job.

	What is actually required (in both cases) are some
'data tables' to support things like case mapping. For example,
a function

	convert_to_upper i

which takes an ocaml integer argument would be useful,
and it is easy enough to 'map' this over an array.

Sigh. See next post. I will post my code, so it can
be torn up by experts. 

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-20 21:05     ` Gerd Stolpmann
  1999-10-21  4:42       ` skaller
@ 1999-10-21 12:05       ` Matías Giovannini
  1999-10-21 15:35         ` skaller
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
  1 sibling, 2 replies; 24+ messages in thread
From: Matías Giovannini @ 1999-10-21 12:05 UTC (permalink / raw)
  To: caml-list; +Cc: Gerd.Stolpmann, skaller

Gerd Stolpmann wrote:
> 
> On Tue, 19 Oct 1999, John Skaller wrote:
> >Gerd Stolpmann wrote:
> >> The enlarged character sets become more and more important, and it is only a
> >> matter of time until every piece of software which wants to be taken seriously
> >> can process them, even a dumb terminal or simple text editor. So you will be
> >> able to put accented characters into your comments, and you will see them as
> >> such even if you 'cat' the program text to the terminal or printer; this will
> >> work everywhere...
> >
> >       Yes. This time is not here yet, but it will come soon that
> >international support is mandatory for all large software purchases
> >by governments and large corporations.
> 
> I do not believe that this will be the driving force because the current
> solutions exist, and it is VERY expensive to replace them. It is even cheaper
> to replace a language than a character set/encoding. Looks like another Year
> 2000 but without deadline.

I still don't understand the point of this discussion. As a MacOS
programmer of many years, I tend to view localization and
internationalization as tasks best performed by the operating system, or
at least by pluggable modules. This discussion of patching l12n and i18n
functions *into* OCaml is, to me at least, losing direction.

OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
agree that my view is chauvinistic (and selfish, perhaps: I already have
"¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?),
I see no restriction in that (well, If I were Chinese, or Egiptian, I
would see things differently). What's more, the whole syntactic
apparatus of a programming language *assumes* a Latin setting, where
things make sense when read from left to right, from top to bottom; and
where punctuation is what we're used to. Programming languages suited
for a Han, or Arab, or even a Hebrew audience would have to be rethinked
from the grounds up.

On the other hand, OCaml provides a String type that *can be* seen as a
variable-length sequence of uninterpreted bytes. We have uninterpreted
bytes! It's all we need to build whatever I18NString type we may need.
What is missing is *library* facilities to abstract that view into a
full-fledged i18n machinery. Of course, there's a problem with the
manipulation of 32-bit integer values, but if used with care, the Nat
datatype could serve perfectly well as the underlying, low-level datatype.

Which makes me think, John, you already have variable-length int arrays.
Nat's are as unsafe as they get :-)

Regards,
Matías.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 12:05       ` Matías Giovannini
@ 1999-10-21 15:35         ` skaller
  1999-10-21 16:27           ` Matías Giovannini
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
  1 sibling, 1 reply; 24+ messages in thread
From: skaller @ 1999-10-21 15:35 UTC (permalink / raw)
  To: matias; +Cc: caml-list, Gerd.Stolpmann

Matías Giovannini wrote:

> OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
> agree that my view is chauvinistic (and selfish, perhaps: I already have
> "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?),
> I see no restriction in that (well, If I were Chinese, or Egiptian, I
> would see things differently). 

	Exactly. There are quite a lot of Chinese, Indian,
Russian ... and non-Latin people in the world: more than Latins.
And many are faced with a barrier, participating in the computing world
because of language problems.

>What's more, the whole syntactic
> apparatus of a programming language *assumes* a Latin setting, where
> things make sense when read from left to right, from top to bottom; and
> where punctuation is what we're used to. Programming languages suited
> for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> from the grounds up.

	Actually, no. Most of these peoples learn English and learn
computing, if they are to work with computers. But they still wish
to use comments, strings, and identifiers in their native script.

	Have you ever seen a Japanese program? I have.
Quite an interesting challenge: normal C/C++ code, with 
Latin characters encoding Japanese character names in identifiers,
and actual Japanese characters in comments and strings.
 
	I had no idea what the code did. My point: for a non-native
speaker, being forced to use a foreign language for identifiers and
comments is a serious impediment, not having native characters
in string is not an impediment, but a complete disaster (how will
the users of the program understand it -- they may not know any
Latin language)

> On the other hand, OCaml provides a String type that *can be* seen as a
> variable-length sequence of uninterpreted bytes. 

	Yes. What ocaml does not provide is a way of encoding
extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.

>We have uninterpreted
> bytes! It's all we need to build whatever I18NString type we may need.
> What is missing is *library* facilities to abstract that view into a
> full-fledged i18n machinery. 

	I agree.

>Of course, there's a problem with the
> manipulation of 32-bit integer values, but if used with care, the Nat
> datatype could serve perfectly well as the underlying, low-level datatype.
> 
> Which makes me think, John, you already have variable-length int arrays.

	But they're not standard (yet). Actually, ocaml 'int' is 31 bits,
which is enough bits for ISO10646 (with some careful fiddling to avoid
problems with the sign?).

	So there are TWO issues -- one is to make ocaml itself
ISO10646 aware (i.e., the compiler), and the other is to provide
users with libraries to manipulate extended characters.

	Please note: neither of these features would be optional,
were ocaml to be submitted for ISO standardisation. ISO directives
require all ISO languages to upgrade to provide international
support. I know ocaml isn't an ISO language, but I think the 
basic intent is sound. [In some sense, ocaml is already a leader,
accepting Latin-1 characters when other languages only allowed ASCII]

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 15:35         ` skaller
@ 1999-10-21 16:27           ` Matías Giovannini
  1999-10-21 16:36             ` skaller
  0 siblings, 1 reply; 24+ messages in thread
From: Matías Giovannini @ 1999-10-21 16:27 UTC (permalink / raw)
  To: caml-list; +Cc: skaller

skaller wrote:
> 
> Matías Giovannini wrote:
> >What's more, the whole syntactic
> > apparatus of a programming language *assumes* a Latin setting, where
> > things make sense when read from left to right, from top to bottom; and
> > where punctuation is what we're used to. Programming languages suited
> > for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> > from the grounds up.
> 
>         Actually, no. Most of these peoples learn English and learn
> computing, if they are to work with computers. But they still wish
> to use comments, strings, and identifiers in their native script.

Strings can be localized with a package mechanism, à la Java. I don't
like hardwired strings in code, they're a maintenance nightmare (not
that I always abide by my own rule :-)

>         Have you ever seen a Japanese program? I have.
> Quite an interesting challenge: normal C/C++ code, with
> Latin characters encoding Japanese character names in identifiers,
> and actual Japanese characters in comments and strings.

I agree that comments should be written in the language most suited to
the intended audience (I normally comment my code in English, unless I
know I wnat someone else to maintain it, in which case I comment it in Spanish.)

> > On the other hand, OCaml provides a String type that *can be* seen as a
> > variable-length sequence of uninterpreted bytes.
> 
>         Yes. What ocaml does not provide is a way of encoding
> extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.

No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude.

> >Of course, there's a problem with the
> > manipulation of 32-bit integer values, but if used with care, the Nat
> > datatype could serve perfectly well as the underlying, low-level datatype.
> >
> > Which makes me think, John, you already have variable-length int arrays.
> 
>         But they're not standard (yet).

They are! Don't be put off by its status as "experimental feature".
Nat's been around since CamlLight. You could even use it as a template
implementation of unsafe longint varlen arrays and link a custom
toplevel. Yet again, OCaml provides the tools.

>         So there are TWO issues -- one is to make ocaml itself
> ISO10646 aware (i.e., the compiler), and the other is to provide
> users with libraries to manipulate extended characters.

I think a more realistic goal would be making OCaml ISO10646-tolerant in
comments. Perhaps adding real conditional compilation and transparent
comments would suffice.

Again, anyone can download the source code and modify OCaml to suit his
tastes. OCaml's goal is not to be a model of i18n awareness, but a
platform for experimenting with types in a functional setting. It
happens that OCaml is open enough, and extensible enough, and efficient
enough to make a good i18n effort possible, and that is a tribute to its
success as strongly-typed, imperative, fast functional language.

>         Please note: neither of these features would be optional,
> were ocaml to be submitted for ISO standardisation. ISO directives
> require all ISO languages to upgrade to provide international
> support. I know ocaml isn't an ISO language, but I think the
> basic intent is sound. [In some sense, ocaml is already a leader,
> accepting Latin-1 characters when other languages only allowed ASCII]

The implementors have made clear in more than one occasion that they're
not interested in making OCaml a standard language (remember the thread
"How to convince management?"). But don't take my word for it, ask Pierre.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 16:27           ` Matías Giovannini
@ 1999-10-21 16:36             ` skaller
  1999-10-21 17:21               ` Matías Giovannini
                                 ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: skaller @ 1999-10-21 16:36 UTC (permalink / raw)
  To: matias; +Cc: caml-list

Matías Giovannini wrote:
 
> Strings can be localized with a package mechanism, à la Java. I don't
> like hardwired strings in code, they're a maintenance nightmare (not
> that I always abide by my own rule :-)

	It doesn't matter what you like (or what I like).

> >         Have you ever seen a Japanese program? I have.
> >         Yes. What ocaml does not provide is a way of encoding
> > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.
> 
> No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude.

	Irrelevant. The \u \U escapes are ISO recommended, used in
C and C++, and must be supported.
 
> > > Which makes me think, John, you already have variable-length int arrays.
> >
> >         But they're not standard (yet).
> 
> They are! Don't be put off by its status as "experimental feature".
> Nat's been around since CamlLight. 

	Oh, I must have misunderstood your comment: Nat is standard,
I'm using it in Viper, but 'a Varray -- a variable length array of 'a,
is not.

> Again, anyone can download the source code and modify OCaml to suit his
> tastes. OCaml's goal is not to be a model of i18n awareness, but a
> platform for experimenting with types in a functional setting. 

	Ocaml is a tool, it doesn't have a goal. :-)
Humans have goals. The problem is that the designers of ocaml
have been too successful: ocaml is so good that other people now
want to use it, and _their_ goals are important too.

>It
> happens that OCaml is open enough, and extensible enough, and efficient
> enough to make a good i18n effort possible, and that is a tribute to its
> success as strongly-typed, imperative, fast functional language.

	I agree. It could easily become a leader in this field,
since implementing complex stuff is relatively easy in ocaml :-)
 
> The implementors have made clear in more than one occasion that they're
> not interested in making OCaml a standard language (remember the thread
> "How to convince management?"). But don't take my word for it, ask Pierre.

	My point was simply that the ISO internationalisation requirements
are not unreasonable, and that other languages will be doing this work,
some because they have to, and some because they want to stay part of
the real world -- and encourage non-English (woops, I mean, non-Latin
:-)
clients, who, after all, may well make significant contributions.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 16:36             ` skaller
@ 1999-10-21 17:21               ` Matías Giovannini
  1999-10-23  9:53               ` Benoit Deboursetty
  1999-10-25  0:54               ` How to format a float? skaller
  2 siblings, 0 replies; 24+ messages in thread
From: Matías Giovannini @ 1999-10-21 17:21 UTC (permalink / raw)
  To: caml-list; +Cc: skaller

skaller wrote:
> 
> Matías Giovannini wrote:
> 
> > Strings can be localized with a package mechanism, à la Java. I don't
> > like hardwired strings in code, they're a maintenance nightmare (not
> > that I always abide by my own rule :-)
> 
>         It doesn't matter what you like (or what I like).

It doesn't, my point is: the functionality for localized strings can be
had, only through an indirect route, as are "string packages".

As an aside, let's keep the tone light, ok?

> 
> > >         Have you ever seen a Japanese program? I have.
> > >         Yes. What ocaml does not provide is a way of encoding
> > > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers.
> >
> > No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude.
> 
>         Irrelevant. The \u \U escapes are ISO recommended, used in
> C and C++, and must be supported.

Well, OCaml is *not* ISO recommended, is *not* C and it is certainly
*not* C++. Let's learn to live with languages other than ISO-mandated,
ISO-validated, ISO-standardized and whatnot.

In fact, now that I think of it, standardization is driven by market
pressure. If OCaml were a commercial product, I guess things would be
different. But it's not (thank Pete), see below.

> > > > Which makes me think, John, you already have variable-length int arrays.
> > >
> > >         But they're not standard (yet).
> >
> > They are! Don't be put off by its status as "experimental feature".
> > Nat's been around since CamlLight.
> 
>         Oh, I must have misunderstood your comment: Nat is standard,
> I'm using it in Viper, but 'a Varray -- a variable length array of 'a,
> is not.

And it's not going to be, unless someone comes with a sound typing
strategy *and* an efficient implementation for them.

> 
> > Again, anyone can download the source code and modify OCaml to suit his
> > tastes. OCaml's goal is not to be a model of i18n awareness, but a
> > platform for experimenting with types in a functional setting.
> 
>         Ocaml is a tool, it doesn't have a goal. :-)
> Humans have goals. The problem is that the designers of ocaml
> have been too successful: ocaml is so good that other people now
> want to use it, and _their_ goals are important too.

Let me restate it: OCaml is the intellectual property of INRIA,
developed under a specific project (Projet Cristal if I remember
correctly) with very definite goals. The project *has* goals, anything
outside those goals is a gift (what is more, everything falling *within*
those goals already *are* a gift), and must be accepted as that. If
INRIA decides that since OCaml is useful to many many people around the
world and want to make one of its goals to turn OCaml into a platform
for experimenting in the implementation of programming languages with
strong i18n support, well, bring the champagne. In the meantime, we'll
have to build upon what it's there.

Suppose the following scenario: INRIA decides that the MacOS platform is
not nearly significative enough to justify the porting effort, and so it
is dropped. What should I do? Plead, certainly, until I'm told "don't
whine, there's nothing we can do". What would be my options? Use a
Wintel box, or make my own port.

This scenario is not unrealistic: there's no native compiler under
MacOS, and there won't be until someone ports it. I can't do it, the
implementors can't do it, and such is life.

> >It
> > happens that OCaml is open enough, and extensible enough, and efficient
> > enough to make a good i18n effort possible, and that is a tribute to its
> > success as strongly-typed, imperative, fast functional language.
> 
>         I agree. It could easily become a leader in this field,
> since implementing complex stuff is relatively easy in ocaml :-)
> 
> > The implementors have made clear in more than one occasion that they're
> > not interested in making OCaml a standard language (remember the thread
> > "How to convince management?"). But don't take my word for it, ask Pierre.
> 
>         My point was simply that the ISO internationalisation requirements
> are not unreasonable, and that other languages will be doing this work,
> some because they have to, and some because they want to stay part of
> the real world -- and encourage non-English (woops, I mean, non-Latin
> :-)
> clients, who, after all, may well make significant contributions.

Hm. I see your point. I don't necessarily agree, though.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-21 16:36             ` skaller
  1999-10-21 17:21               ` Matías Giovannini
@ 1999-10-23  9:53               ` Benoit Deboursetty
  1999-10-25 21:06                 ` Jan Skibinski
  1999-10-26 18:02                 ` skaller
  1999-10-25  0:54               ` How to format a float? skaller
  2 siblings, 2 replies; 24+ messages in thread
From: Benoit Deboursetty @ 1999-10-23  9:53 UTC (permalink / raw)
  To: caml-list

This message just wants to raise a paradoxical point in this discussion
[yet it may have already been posted ?]. It seems to me that allowing
foreign characters to be used in a computer language, as identifiers or
comments, would reduce the exchange of contributions worldwide.

Here is my personal experience: i have used caml and ocaml for more than 2
years now. From the beginning, it seemed to me really cool to be able to
have identifiers in French, with accents and everything. So i took the
habit of using French in my programs.

Now, i'm writing a more consequent program, which could become a small
intl "open project". *Except* that i find myself with a program in French,
and that it's not so easy to find qualified programming partners who
understand French. The range of people who could help with my program is
terribly limited.

You should understand i sometimes feel i should have written it in
english.

I must however acknowledge that [o']caml 's ability to cope with latin1
characters is above all useful for educational purpose. Let me explain...
Perhaps is it a French thing, but in this country it sounds quite snobbish
for a French to embed English words in a sentence with the right accent +
stress. Hence, almost every computer science teacher takes an exaggerated
French accent to pronounce English words ("la fonction 'rimouve'"). [I
shall not disclose the names of my teachers in CaML :) ]

So, for educational purposes, it is much better if the teachers can have
French identifiers ("la fonction 'enlève'"). Much easier to pronounce,
isn't it? I suppose it is the same for many other countries. (i think
especially of japan "biko-zu ingurisshu izu ha-do tsu puronaonsu foa
japani-zu pi-poru tsu-")

My point remains: encouraging people to write code in their language would
reduce the possiblities of exchanging their work. This does not mean,
though, that i will translate the program i've written into english. I
consider it is a sort of tribute to the preservation of the diversity of
languages, at my most humblest scale... and i will write enough programs
in English when i work for a company, too.

Benoît de Boursetty
Benoit.de-Boursetty@polytechnique.org

^ permalink raw reply	[flat|nested] 24+ messages in thread

* How to format a float?
  1999-10-21 16:36             ` skaller
  1999-10-21 17:21               ` Matías Giovannini
  1999-10-23  9:53               ` Benoit Deboursetty
@ 1999-10-25  0:54               ` skaller
  1999-10-26  0:53                 ` Michel Quercia
  2 siblings, 1 reply; 24+ messages in thread
From: skaller @ 1999-10-25  0:54 UTC (permalink / raw)
  To: caml-list

How do I format a floating point number 
correctly in ocaml, with a given _variable_
width and precision?

I can't find a routine in the library for this.
Shouldn't there be one? Did I miss it?

--
sprintf is not suitable, the format string
must be a literal. string_of_float has
no arguments, and it gives the wrong format
if the number happens to be integral.
I need to write a 'printf' like routine.

It's relatively easy to write an integer formatting
routine, but floats are somewhat more difficult.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-23  9:53               ` Benoit Deboursetty
@ 1999-10-25 21:06                 ` Jan Skibinski
  1999-10-26 18:02                 ` skaller
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Skibinski @ 1999-10-25 21:06 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list



On Sat, 23 Oct 1999, Benoit Deboursetty wrote:

> This message just wants to raise a paradoxical point in this discussion
> [yet it may have already been posted ?]. It seems to me that allowing
> foreign characters to be used in a computer language, as identifiers or
> comments, would reduce the exchange of contributions worldwide.

	Yes, but it is nice to have error messges, prompts, etc.
	expressed in a native language of a program user. And an
	ability of a native text processing is also quite often
	desirable. 

	I have been reading this thread for some time and I've
	seen plenty of references to Latin1 and of different
	attitudes to its usefulness (or not). Let me add my two
	cents here.

	Demanding a support for diacritical marks is often not a matter
	of being snobbish or a language purist. I cannot speak for
	other languages that use Latin alphabet but I can tell you
	what a mess it is with Polish (having 8 diacritical marks),
	and, I suppose, with other languages, such as Hungarian, etc.
	that have been qualified to Latin2. Someone has made such
	decision some time ago, and now we pay a price, since Latin1
	seems to be seen by some as some sort of improvement over the
	plain ascii. 

	I am not whining here because I can get quite fine with the plain
	ascii in my email, etc., and I can even cope with all sort
	of email that come here formatted as either Latin1 or Latin2.
	But even so, I sometimes find myself cornered by plain
	ascii when a meaning of a sentence becomes suddenly funny,
	or bezerk or senseless. One example to illustrate the point.

	1. z<.>a<;>danie - "a strong request". This is what I want to use

	2. zadanie       - "a problem to solve, or a goal". Wrong!
			   This is what I get from plain ascii

	3. rzadanie      - When pronounced it does not sound quite as
	                   "a request", but an intelligent recipient
			   can guess my intention. But they might
			   as well consider me illiterate; Polish 
		           has two alternative spelling of the
	                   same (similar) sound: "z<.>" and "rz".
			   In this case "rz" is very wrong.

	4. rzondanie     - Now it sounds almost OK ("on" sounds close to
			   "a<;>"), but the spelling is even worse.  
	where
		z<.> stands for dot over z
		a<;> stands for "ogonek" (yes, this is an official
	             name in Unicode), or "a tail" under a.

	As you can see, this is not just matter of some perky accents.

	Jan





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: How to format a float?
  1999-10-25  0:54               ` How to format a float? skaller
@ 1999-10-26  0:53                 ` Michel Quercia
  0 siblings, 0 replies; 24+ messages in thread
From: Michel Quercia @ 1999-10-26  0:53 UTC (permalink / raw)
  To: caml-list

Le lun, 25 oct 1999, vous avez écrit :
: How do I format a floating point number 
: correctly in ocaml, with a given _variable_
: width and precision?

Like this :

let str width prec =
  let prec    = min 8 (max prec 0)  in
  let pow_ten = 10.0 ** float(prec) in
  fun x ->
    let y  = abs_float(x)                in
    let sgn= if x < 0. then -1. else 1.  in
    let a  = floor(y)                    in
    let b  = 1. +. y -. a                in
    let z  = truncate(b*.pow_ten +. 0.5) in
    let sa = string_of_float(sgn*.a)     in
    let sb = string_of_int(z)            in
    let pad= width - String.length(sa) - prec - 1 in
    (String.make (max 0 pad) ' ') ^ sa ^ "." ^ (String.sub sb 1 prec)

The "8" in "min 8 (max prec 0)" is here to avoid integer overflow when
computing "z" on 32-bit computers. "b" is 1 too large in order to force
"string_of_int(z)" having "prec+1" characters.

Not very nice ... Perhaps an addition to the standard printf library
routines in order to accept the "*" variable field length would be better.
--
Michel Quercia
9/11 rue du grand rabbin Haguenauer, 54000 Nancy
http://pauillac.inria.fr/~quercia
mailto:quercia@cal.enst.fr



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Go for ultimate localization!
  1999-10-21 12:05       ` Matías Giovannini
  1999-10-21 15:35         ` skaller
@ 1999-10-26  4:36         ` Benoit Deboursetty
  1999-10-28 17:04           ` Pierre Weis
                             ` (4 more replies)
  1 sibling, 5 replies; 24+ messages in thread
From: Benoit Deboursetty @ 1999-10-26  4:36 UTC (permalink / raw)
  To: caml-list

[ Since my post ended up being *very* long, so I put some headers and a
  summary... "rhetoric sugar" ]

--- Summary

1. INTRODUCTION
  Present situation: O'CaML allows Latin1 in identifiers, but keywords are
  still in English -> one's code is made of intermingled languages

2. ULTIMATE LANGUAGE LOCALIZATION?
  Why not localize the parser? the libraries?

3. RELEVANCE
  Is this not just another useless theory?
  [ like was said of functional programming a few years ago ]

4. CONCLUSION
  Well, time to set to work, people :)

---

On Thu, 21 Oct 1999, Matías Giovannini wrote:

> OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll
> agree that my view is chauvinistic (and selfish, perhaps: I already have
> "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?),
> I see no restriction in that (well, If I were Chinese, or Egiptian, I
> would see things differently). What's more, the whole syntactic
> apparatus of a programming language *assumes* a Latin setting, where
> things make sense when read from left to right, from top to bottom; and
> where punctuation is what we're used to. Programming languages suited
> for a Han, or Arab, or even a Hebrew audience would have to be rethinked
> from the grounds up.

1. INTRODUCTION
---------------

	Hi,

	Thanks for giving me this idea, Matías. "Rethought from the
ground up" was the only wrong thing: you only have to change the
preprocessor...

  [ Since I am catching this thread in the middle, I preventively
    apologize for any possible redundancy with what has been said before]

	So, O'CaML handles Latin-1 characters in the identifiers. This
allows most Europeans, well at least you and me, to use their own
language; but to me, it has always seemed a bit strange to write a program
with keywords in English, and identifiers in French. Indeed, I feel
it's really bogus when I try to read this out loud:

	"let élément = String.sub chaîne where chaîne = "...

Do you take the French accent to read this? an English accent? Do you
switch accents? How do you say "=", "égal" or "equals"? Do you translate
every word to your mother tongue on-the-fly?

  [ You know, there is something with the French and the "purity" of their
    language... ]

Don't think I put this example just to try to be funny: when building
software in teams, it is often that you want to explain what your program
does to someone that speaks your language, and that you feel uncomfortable
with reading your code. I'm speaking from personal experience, I hope not
to be an exception.

	At least, there is a certain consistency in C allowing only
English characters: were I in a normalization committee, I would even
demand, in the C norm, that a compliant source code has everything,
identifiers and comments, in American English (you must be hoping I will
never be given such power).

	Everyone is happy with O'CaML handling Latin1 within Western
Europe and the US *because* the grammar differ very slightly from one
country to another. That was your point, Matias. And the Indo-European
kind of syntax is not universal. The syntax I know that differs most from
Indo-European languages is that of Japanese, of course. And Japan is a
country too often forgotten when it comes to localization. Must be
difficult to deal with Japanese. Anyway, it makes this economically
powerful country over-sensitive to locale-aware software (such an OS as
BeOS is therefore quite popular there).

	To sum up the present situation: half-half. Your program is half
in your language, half in English. The syntax corresponds to the native
language of half the number of potential users. Latin1 is meaningful to
only part of the users, too.

2. ULTIMATE LANGUAGE LOCALIZATION?
----------------------------------

	I know one of the authors of olabl is a Japanese, I would like to
know whether it makes sense to have identifiers in Japanese in a language
having English keywords.

	For instance, we often give verbal identifiers to functions that
perform an action. I think if the Japanese had invented the first computer
languages, we would naturally put such function calls after the arguments
(a bit like in RPL), because the verb is always put after its objects in
Japanese. [it has nothing to do with writing right-to-left]

	So, from my naive point of view, using for identifiers a language
different from the one which the keywords are chosen from is very
unnatural. Another way of seeing things would be to have different parsers
for the same compiler: one with English keywords (the one that exists
now), one with French keywords and a similar syntax, one with Japanese
keywords and perhaps a completely different syntax, etc., each one
adapting the same abstract language, "abstract ocaml", to different
natural languages.

	In other words, localizing error messages, charsets is fine: also
*localizing the parser* seems more consistent to me. Of course, this
raises again the issue I was speaking of in my previous post: using
languages different from the "computer esperanto" in computer programs
would probably block international exchanges (but translation here is just
another pretty-printing -- see in 4.). I am just following the idea of
localization to its logical conclusion.

	Perhaps there would be to go further: library interfaces, after
all, should also have the possibility of being localized. Why would you
use a module named "String" in a French program. This word has a
completely different meaning in French, and every French computer
scientist sounds really suspect to a non-scientific person when he speaks
of "strings" all the time. The same with "bit". It's also because of
things like that, that computer people are sometimes thought of they
can't communicate.

  [ "un string" is only an underwear in French. And there is a
    very unfortunate collision between the English word "bit" and another
    word. ]

3. RELEVANCE
------------

	So, well, can all this theory going to be useful to anyone? I
really think there would be a gain in implementing such a feature. There
is a pervasive, wasteful effort of translation being made by non-native
English speakers when programming. It can be ridiculous when two speakers
of the same language come to speaking English to each other.

	Of course, such an extension won't probably be used in business as
long as everybody speaks English at work in almost every software
development group anyway.

	But first, I think I read O'CaML was designed primarily to be an
experimentation in language design. Second, it's no wonder everybody
speaks English in software businesses if it is the language that lies
under every computer language (problème de "la poule et l'oeuf"). Third,
there is a constant motion from natural language to computer language and
vice-versa... For instance, most constructs in computer languages (except
in lisp, scheme,...) are based on English. Perhaps are we missing some
useful constructs that could come from other natural  languages.

	Another example, more concrete. There is no simple translation in
French for the expression "to match some constant against a pattern".
Instead of saying "matcher" (which is what we do), we should better try
and find a better word. But even if a good word had been invented, it
would never be used in practice, I mean in the code of programs, and we
would still use "matcher" (erk).

4. CONCLUSION
-------------

	All this may be very difficult to implement, I don't know, I am
telling things from the user's point of view. This may also be a project
completely different from what O'CaML wants to focus on. But couldn't the
camlp4 "PreProcessorPrettyPrinter" by D. de Rauglaurde be the beginning of
a solution, at least for parser localization? It would even provide
translations between different localizations of the parser by
pretty-printing, wouldn't it?

	Congratulations if you've read this post until here, I await your
comments on this utopical project.

	Benoît de Boursetty
	Benoit.de-Boursetty@polytechnique.org


---

PS: the camlp4 manual:
http://caml.inria.fr/camlp4/manual/index.html

PPS: Pour "to match sthg against a pattern", existe-t-il une traduction
     "officielle" ? Que pensez-vous de "plaquer qqch sur un motif" ?

PPPS: I apologize for any mistakes in English (I myself am not perfectly
      localized) and for taking examples only in the French & Japanese
      languages.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: localization, internationalization and Caml
  1999-10-23  9:53               ` Benoit Deboursetty
  1999-10-25 21:06                 ` Jan Skibinski
@ 1999-10-26 18:02                 ` skaller
  1 sibling, 0 replies; 24+ messages in thread
From: skaller @ 1999-10-26 18:02 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

Benoit Deboursetty wrote:
> 
> This message just wants to raise a paradoxical point in this discussion
> [yet it may have already been posted ?]. It seems to me that allowing
> foreign characters to be used in a computer language, as identifiers or
> comments, would reduce the exchange of contributions worldwide.

	Excuse me, but exactly what do you mean by 'foreign' characters?
Do you mean non Chinese characters? What? You aren't Chinese?

> You should understand i sometimes feel i should have written it in
> english.

	I think that, at the moment, English is the 'lingua franca' <grin>
of the Internet. Spoken with an American accent :-)

	However, the Internet is growing fast, and the number
of English speakers will soon enough be a minority. It will probably
remain true that most of the _programmers_ will be able to use English.
 
> I must however acknowledge that [o']caml 's ability to cope with latin1
> characters is above all useful for educational purpose.

	Yes. I think it is highly laudible that ocaml accepts more than
just plain 'ASCII': many students are more fluent with their native
language (even if they speak some English and/or are learning it),
and being able to program with it will enhance learning.
Internationalising
software that is actually worth sharing internationally is a lesser
obstacle that writing good software in the first place.

> My point remains: encouraging people to write code in their language would
> reduce the possiblities of exchanging their work. 

	In my opinion, a programming language should simply
give clients a _choice_. Cultures, people, and circumstances vary.
I don't think programming language designers should be in the
business of encouraging or discouraging use of a particular
language, but rather facilitating the implementation of 
the clients own wishes or requirements.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Go for ultimate localization!
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
@ 1999-10-28 17:04           ` Pierre Weis
  1999-10-28 17:41           ` Matías Giovannini
                             ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Pierre Weis @ 1999-10-28 17:04 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

> 1. INTRODUCTION
>   Present situation: O'CaML allows Latin1 in identifiers, but keywords are
>   still in English -> one's code is made of intermingled languages

Yes, and this can even be a convenient way of distinguishing between
pure syntactical markers (let, try, with) of the language, and
semantically relevant names, for instance

let élimine_accents s =
  ...

In this respect, when teaching Caml to beginners this is now some kind
of advantage not to be an english native speaker!

Concerning strange kinds of natural language syntax, or kind of
syntaxes that would not be suitable to Caml, I think the discussion is
endless and running to a dead-end anyway. Just consider chinese
instead of japanese as an example: there is no alphabet at all (so no
idea of dividing words into a small fixed set of characters), and
traditional writing is to draw vertical lines that goes from right to
left. Now consider another language of the Caml kind (semi-rigourously
defined, semi-universal (:) that you definitively want to use as a
chinese human being: the language of mathematics. What can you do ?
Would you still use chinese traditional digits (one is denoted as -,
two as =, ...)  or adopt ``long noses''' strange notations (0, 1, 2,
3, ...) ? Would you adapt the whole set of well designed mathematical
notations just to be able to write vertically and from the right to
the left ? Surely no. You would use instead a simpler solution: you
would adapt *yourself* not the established notations! That's why
chinese people use the same well-known notations as everybody in the
world, just because they want to understand others and also to be
understood by others (otherwise foreign people could be a bit confused
with the strange use of - as 1 or = as 2!). For computer languages, I
think it is the same, you must use the keywords unchanged just because
in the first place, you must communicate with some machine and its
hardware, and also you want to communicate your programs to others.

If you want hard to communicate about your program you may go one step
further and use english identifiers, or two steps further and use
english comments as well.

> 	Another example, more concrete. There is no simple translation in
> French for the expression "to match some constant against a
> pattern".

Yes there is one: filtrer. (Pattern: filtre (qq fois motif, plus
rarement patron); pattern matching: filtrage (qq fois sélection de
motifs, plus rarement appel par patron)). You may read the french
version of the Caml FAQ, or a good french book about the language.

> Instead of saying "matcher" (which is what we do), we should better try
> and find a better word. But even if a good word had been invented, it
> would never be used in practice, I mean in the code of programs, and we
> would still use "matcher" (erk).

We don't say ``matcher'', except as a kind of jargon, a funny verb
that we use when using a low level of language not very far from
slang. We often use the good word ``in practice''. For instance, we
say:

Appel direct au filtrage: la construction ``match ... with''

To mean the english section title

Direct call to pattern matching: the ``match ... with'' construct.

Best regards && Cordialement (mais pas ``meilleurs regards'' !)

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Go for ultimate localization!
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
  1999-10-28 17:04           ` Pierre Weis
@ 1999-10-28 17:41           ` Matías Giovannini
  1999-10-28 17:59           ` Matías Giovannini
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Matías Giovannini @ 1999-10-28 17:41 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

Uh, I'm going to get flamed by off-topicness, but...

Benoit Deboursetty wrote:
>         So, O'CaML handles Latin-1 characters in the identifiers. This
> allows most Europeans, well at least you and me, to use their own
> language; but to me, it has always seemed a bit strange to write a program
> with keywords in English, and identifiers in French. Indeed, I feel
> it's really bogus when I try to read this out loud:
> 
>         "let élément = String.sub chaîne where chaîne = "...
> 
> Do you take the French accent to read this? an English accent? Do you
> switch accents? How do you say "=", "égal" or "equals"? Do you translate
> every word to your mother tongue on-the-fly?

Symbols and numbers in Spanish, "words" (be them identifiers or
keywords) as they are. Accent varies, but normally is Argentinian
English (I can speak acceptable English, with something of an American
English accent). 
This I do when reading programs aloud, or when doing mathematics.

For instance

"Let x be a real such that x >= 0"

is for me "Let <<equis>> be a real such that <<equis mayor o igual a cero>>"

Evidently, the linguistic structures I use for words and concepts are
separated. And I had never been misunderstood, and I never misunderstood
anybody doing this sort of things.

I think all this is fascinating ;-}

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Go for ultimate localization!
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
  1999-10-28 17:04           ` Pierre Weis
  1999-10-28 17:41           ` Matías Giovannini
@ 1999-10-28 17:59           ` Matías Giovannini
  1999-10-29  9:44             ` Francois Pottier
  1999-10-28 21:00           ` Gerd Stolpmann
  1999-10-29  4:29           ` skaller
  4 siblings, 1 reply; 24+ messages in thread
From: Matías Giovannini @ 1999-10-28 17:59 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

I should have read all before letting my fingers twitch, but... ObCaml
now :-)

Benoit Deboursetty wrote:
>         All this may be very difficult to implement, I don't know, I am
> telling things from the user's point of view. This may also be a project
> completely different from what O'CaML wants to focus on. But couldn't the
> camlp4 "PreProcessorPrettyPrinter" by D. de Rauglaurde be the beginning of
> a solution, at least for parser localization? It would even provide
> translations between different localizations of the parser by
> pretty-printing, wouldn't it?

Localizing the language is not that difficult, but it would involve
exposing the abstract syntax to the localizer, and the localizer would
have the difficult task of rewriting the parser. This is not as bad as
it sounds for indoeuropean languages (the most difficult problem I can
think of is the use of cases vs. prepositions, but I can't think of the
equivalent of a prepositional phrase in a computer language), but for
object-subject-verb languages like Japanese it could be nightmarish.

Localizing the libraries can be done the right way, or the easy way. The
easy way is to have a dictionary of external-vs-internal names, with the
language using only internal names. It could be easy if initially the
internal names are the names used by the developer, and external names
are added as the localization proceeds. Somehow, this seems wrong to me
(because it seems too easy), so I think there must be a "right" way to
do it that is absolutely non-trivial.

> PPS: Pour "to match sthg against a pattern", existe-t-il une traduction
>      "officielle" ? Que pensez-vous de "plaquer qqch sur un motif" ?

Um, you know, I think that this is pressing things too far. It's the
same attitude the Spanish have, trying to find a Spanish word for every
technical word out there. The result is doubly bad: on one hand, you're
restricting yourself to *translating* instead of *generating* technical
language (as is done in the Social Sciences, for instance), and you're
letting out people who have no problem with direct imports (we speak
routinely of *matchear*, and I have yet to find someone, techical or
non, who doesn't "get me"). I'm not afraid of jargon, and I think we in
Argentina are used to jargon (we speak dialect anyway, in Buenos Aires).

The literal Spanish to your French would be "pegar algo sobre un patrón"
(paste sthg over a pattern) but it sounds backwards. Better would be
"aplicar un patrón sobre algo" (apply a pattern on sthg), but while it
now sounds right it *is* exactly backwards. The best compromise is
"matchear algo contra un pattern", and it only translates the normal
"part-of-speech" words, leaving order and technical words intact.

-- 
I got your message. I couldn't read it. It was a cryptogram.
-- Laurie Anderson

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Go for ultimate localization!
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
                             ` (2 preceding siblings ...)
  1999-10-28 17:59           ` Matías Giovannini
@ 1999-10-28 21:00           ` Gerd Stolpmann
  1999-10-29  4:29           ` skaller
  4 siblings, 0 replies; 24+ messages in thread
From: Gerd Stolpmann @ 1999-10-28 21:00 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

On Tue, 26 Oct 1999, Benoît de Boursetty wrote:
>
>	So, O'CaML handles Latin-1 characters in the identifiers. This
>allows most Europeans, well at least you and me, to use their own
>language; but to me, it has always seemed a bit strange to write a program
>with keywords in English, and identifiers in French. Indeed, I feel
>it's really bogus when I try to read this out loud:
>
>	"let élément = String.sub chaîne where chaîne = "...

I have currently some projects in a bank in Frankfurt. Although our team often
chooses English identifiers, there are sometimes special terms which are
difficult to translate. Big organizations tend to have their own language, and
everybody in the organization knows what is meant if one of the special words
is used. For example, in this bank the German word "Sachgebiet" is a software
application tied to an organizational unit; nobody outside would think at
software if he or she hears this word; and it seems to be impossible to
translate such a term to English because nobody inside the bank would
understand it. So the only way out is to mix English and German words, e.g.
get_sachgebiet as method of an object.

I mention this example because it demonstrates that you cannot get rid of the
local language even if you try hard. I think this is not different in Japan or
other parts of the world, so non-Latin1 characters are really needed there.

Please note: This argument has nothing to do with esthetics ("how it sounds"),
the language mixture is simply necessary to make yourself understandable. The
grammar (the language of the keywords) of Ocaml should be the same all over the
world, otherwise you would split the software basis artificially.

>It can be ridiculous when two speakers
>of the same language come to speaking English to each other.

Yes, but this is not the problem. We have also native English speakers in our
team, and international teams are very normal in this world. John also mixes
languages and it is sometimes very funny...

Gerd
--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Go for ultimate localization!
  1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
                             ` (3 preceding siblings ...)
  1999-10-28 21:00           ` Gerd Stolpmann
@ 1999-10-29  4:29           ` skaller
  4 siblings, 0 replies; 24+ messages in thread
From: skaller @ 1999-10-29  4:29 UTC (permalink / raw)
  To: Benoit Deboursetty; +Cc: caml-list

Benoit Deboursetty wrote:
>> 
>         At least, there is a certain consistency in C allowing only
> English characters: were I in a normalization committee, I would even
> demand, in the C norm, that a compliant source code has everything,
> identifiers and comments, in American English (you must be hoping I will
> never be given such power).

	FYI: as it happens, I _am_ in the C standardisation committee,
and the latest version of C, code named 'C9X', which has now passed
the final NB balloting, as well as C++, which is already an
International
Standard, permits a certain subset of ISO10646 in identifiers. All the
characters are in the first (unicode) plane.
[The actual subset is listed in Appendix E of the C++ Standard]

	However, neither permits any variations on keywords :-)

>         In other words, localizing error messages, charsets is fine: also
> *localizing the parser* seems more consistent to me.

	I have no opinion on the idea of allowing _alternate_
keywords, or even syntax, more amenable to international use.
However, I would like to be 'pedantic' and object to
the word 'localisation'.  I would STRONGLY object to
localising ANY aspect of the ocaml language, and this is one of my
reasons
for objecting to the current Latin-1 support.

	I advocate _internationalisation_, not localisation,
that is: all the possible characters, grammar extensions,
new keywords, or what ever, are universally available to 
everyone.

	On the other hand, I _would_ support localisation
of diagnostic error messages and documentation: that is,
if you speak French, you get errors in French.

> Of course, this
> raises again the issue I was speaking of in my previous post: using
> languages different from the "computer esperanto" in computer programs
> would probably block international exchanges 

	I am not at all sure about that. Have you considered it
might _facilitate_ international exchange? For example, Japanese
programmer able to use Japanese extensively in a programming 
language might now be free to release it internationally,
without feeling bound to 'Anglicize' it. This would make
it harder for non-Japanese speakers to understand than
Japanese speaker -- but 'harder' is better than
'impossible because the program was never publically released'.

	I personally feel embarrassed that I can't follow
dialog in French, reading this group. But that is surely
MY problem, and I am grateful so many French can read and
write MY native language (English with an Ozzie accent :)
fluently.
 
> every French computer
> scientist sounds really suspect to a non-scientific person when he speaks
> of "strings" all the time. The same with "bit". It's also because of
> things like that, that computer people are sometimes thought of they
> can't communicate.

	Now, that cannot be the reason -- because it doesn't explain
why the same is thought of English speaking computer nerds. :-)
 
>   [ "un string" is only an underwear in French. And there is a
>     very unfortunate collision between the English word "bit" and another
>     word. ]

	The mind boggles. An anecdote from the C++ Standardisation process:
when a new feature of the C++ library was being developed, which
provided
'auxilliary' information, it was originally called 'baggage'.
The name was changed to 'traits'. I was one of the people who said that
'baggage' has unfortunate connotations in Australia (at least),
refering to a sexually promiscuous woman in derogatory masculine slang.

>         Of course, such an extension won't probably be used in business as
> long as everybody speaks English at work in almost every software
> development group anyway.

	I am surprised: is this really the case in France, for example?
It is not in Japan (perhaps the language is more alien?)
 
>         All this may be very difficult to implement, I don't know, I am
> telling things from the user's point of view. 

	I do not think the difficulty of implementation is any more
than a minor issue. The major issue, surely, is, that: at least
in most programming languages, _something_ is universal, even if
the keywords tend to be derived from English. If too many
alternative ways of doing the same thing are available, we may
end up with a mish mash language (sort of like English itself
:-)


-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Go for ultimate localization!
  1999-10-28 17:59           ` Matías Giovannini
@ 1999-10-29  9:44             ` Francois Pottier
  0 siblings, 0 replies; 24+ messages in thread
From: Francois Pottier @ 1999-10-29  9:44 UTC (permalink / raw)
  To: matias; +Cc: caml-list

> Localizing the language is not that difficult, but it would involve
> exposing the abstract syntax to the localizer, and the localizer would
> have the difficult task of rewriting the parser.

I would like to point out that this has been done by Apple in their
AppleScript scripting language. It looks cool at first sight, but it
means that textual programs are no longer portable, since the machine
you're moving to may not use the same language as yours. So, only the
(binary) abstract syntax tree is portable, which is (IMHO) a very bad
thing, since maintaining compatibility across versions of the language
becomes very difficult.

-- 
François Pottier
Francois.Pottier@inria.fr
http://pauillac.inria.fr/~fpottier/

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~1999-10-29 17:21 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-10-15 13:53 localization, internationalization and Caml Gerard Huet
1999-10-15 20:28 ` Gerd Stolpmann
1999-10-19 18:06   ` skaller
1999-10-20 21:05     ` Gerd Stolpmann
1999-10-21  4:42       ` skaller
1999-10-21 12:05       ` Matías Giovannini
1999-10-21 15:35         ` skaller
1999-10-21 16:27           ` Matías Giovannini
1999-10-21 16:36             ` skaller
1999-10-21 17:21               ` Matías Giovannini
1999-10-23  9:53               ` Benoit Deboursetty
1999-10-25 21:06                 ` Jan Skibinski
1999-10-26 18:02                 ` skaller
1999-10-25  0:54               ` How to format a float? skaller
1999-10-26  0:53                 ` Michel Quercia
1999-10-26  4:36         ` Go for ultimate localization! Benoit Deboursetty
1999-10-28 17:04           ` Pierre Weis
1999-10-28 17:41           ` Matías Giovannini
1999-10-28 17:59           ` Matías Giovannini
1999-10-29  9:44             ` Francois Pottier
1999-10-28 21:00           ` Gerd Stolpmann
1999-10-29  4:29           ` skaller
1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy
1999-10-19 18:36   ` skaller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).