From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: weis
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id JAA16818 for caml-redistribution; Tue, 26 Oct 1999 09:03:56 +0200 (MET DST)
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id EAA27246 for <caml-list@pauillac.inria.fr>; Tue, 26 Oct 1999 04:39:12 +0200 (MET DST)
Received: from isil.maya.com (HOPKINS.PC.CS.CMU.EDU [128.2.215.35])
	by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id EAA29038
	for <caml-list@inria.fr>; Tue, 26 Oct 1999 04:39:08 +0200 (MET DST)
Received: (from prevost@localhost)
	by isil.maya.com (8.9.3/8.9.3) id WAA30244;
	Mon, 25 Oct 1999 22:46:48 -0400
Sender: weis
To: skaller <skaller@maxtal.com.au>
Cc: caml-list@inria.fr
Subject: Re: Request for Ideas: i18n issues
References: <m3yacws0lu.fsf@isil.maya.com> <3814A785.B01AFAD4@maxtal.com.au>
From: John Prevost <prevost@maya.com>
Date: 25 Oct 1999 22:46:48 -0400
In-Reply-To: skaller's message of "Tue, 26 Oct 1999 04:55:01 +1000"
Message-ID: <m3aep6vnxj.fsf@isil.maya.com>
User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

skaller <skaller@maxtal.com.au> writes:

> >   * a type for characters in the charset
> 
> 	I think you mean 'code points'. These are logically distinct
> from characters (which are kind of amorphous). For example, a single
> character -- if there is such a thing in the language -- may
> consists of several code points combined, and there are code points
> which are not characters.

I'm not sure the distinction matters much at this level.  I am
predisposed to prefer the name "char" to the name "codepoint" just
because more people will understandit.  In the back of my head, I also
generally interpret "codepoint" as meaning the number associated with
the character, rather than the abstract "character" itself.  I believe
Unicode makes this distinction between a "glyph", a "character", and a
"codepoint".

> >   * a type for strings in the charset (maybe char array, maybe not)
> 
> 	I think you mean 'script'. Strings of code points can be used
> to represent script.

Uh.  Okay, whatever.  Again, I will tend to use the more traditional
word "string" to mean a sequence of characters.  (On reflection, I
think you're trying to say something deeper, but your explanation is
so vague I won't try to interpret what it is.  I leave that to you.)

> >   * functions to work with strings in the character set in order to do
> >     standard manipulations.  If we said a string is always a char
> >     array and that there are standard functions to work on strings
> >     given the above, this might be something that can be done away
> >     with.
> 
> 	Probably not. There is a distinction, for example, between
> concatenating two arrays of code points, and producing a new array
> of code points corresponding to the concatenated script.  This is
> 'most true' in Arabic, but it is also true in plain old English: the
> script
  {...}

I don't believe that this is a job for the basic string manipulation
stuff to do.  There do need to be methods for manipulating strings as
sequences.  As such, I'm not going to worry about it at this level.

> >   * functions to convert characters and strings to a reference format,
> >     perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
> >     of coverage, and without some common format, converting from one
> >     character set to another is problemmatic.
> 
> 	I agree. I think there are two choices here, UCS-4 and UTF-8
> encodings of ISO-10646. UCS-4 is better for internal operations,
> UTF-8 for IO and streaming operations (perhaps including
> regular expression matching where tables of 256 cases are more
> acceptable than 2^31 :-)

True.  Of course, there are ways and ways.  Regexps based on character
classes for efficiency is one (now I only need 5 bits (for Unicode,
anyway) to represent that I want "all letters or numbers or
connectors").  And there could be multi-level tables.

> >   Locales:
> > 
> >   * functions to do case mapping, collation, etc.
> 
> 	No. It is generally accepted that 'locale' information is
> limited to culturally sensitive variations like whether full stop or
> comma is used for a decimal point, and whether the date is written
> dd/mm/yy or yy/mm/dd or mm/dd/yy.
> 
> 	Collation, case mapping, etc are not locale data, but specific
> to a particular script.
> 
> 	The tendency in i18n developments has been, I think, to
> divorce character sets, encodings, collation, and script issues from
> the locale: the locale may indicate the local language, but that is
> independent (more or less) of script processing.

Okay.  I'm going to have a summary at the bottom of my message of the
various kinds of things Java has for this sort of thing, just as
something to think about.  (i.e. are these all separate things?  How
does a program get them?  etc.)

> > (I think this is really why Java went to "the one true charset is
> > Unicode".  Not just because of politics, but because interacting with
> > mutually incompatible character sets can be a type-safety nightmare.)
> 
> 	Yes. I am somewhat suprised to see an attempt to create a more
> abstract interface to multiple character sets/encodings. This area
> tends, I think, to be complex, full of ad hoc rules, and so quirky
> as to defy genuine abstraction.
> 
> 	Fixing a single standard (ISO10646) is a simpler
> alternative; even simpler if there is a single reference encoding
> such as UCS-4 or UTF-8. In that case, the functions that do the
> work can be specialised to a well researched International Standard.
>  
> 	It is still necessary to provide functions that encode/decode
> the standard format to other formats (encodings/character sets), but
> no functions need be provided to do things like collation or case
> mapping for these other formats.

The big difficulty here is that not everybody wants to eat Unicode.  I
think it's appropriate, but not everyone does.  And there are still
characters in iso-2022, for example, which have no Unicode code point.

I think that as I look at the problem more, though, I'm inclined to
say "one definitive set of characters" is a better idea.  Especially
since that set is needed for reasonable interoperation between the
others.


Something I've noted looking at O'Caml these last few days: the
"string" type is really more an efficient byte array type.  And the
char type is really a byte type.  There's no real way to do "input
bytes from a stream" except inputting them as characters and then
interpreting those characters as bytes.


Here's what Java has, related to encodings, strings, characters,
collators, and so on:

java.lang
Character       Represents individual Unicode characters.
                Provides methods to get character class, etc.
String          Immutable string of characters.
StringBuffer    Mutable string of characters.

java.util
Locale          Appears to provide mainly identity, in a package based on
                ISO language and country codes.  Methods to look up
                "resource bundles".  Various formatting tools allow you to
                get a new formatter given a specific locale (perhaps by
                using their own resource bundles describing which subclass
                should be used for each locale.)

java.text
BreakIterator   For finding word, sentence, para breaks in text.
ChoiceFormat    Allows a complex mapping from values to strings.
                (i.e. 0 -> "no files", 1 -> "1 file", n -> "n files")
Collator        Configurable comparison between strings.
DateFormat      Parses and unparses date values.
DecimalFormat   Configurable formatter of numbers ("####.##" -> " 123.20")
Format          Superclass for all these formats.  They all apparently have
                to support parsing and unparsing values in Java.
MessageFormat   Formatting strings including argument number specification,
                for message catalogs.  ("{0}'s {1}" -> "John's foo",
                "{1} u {0}" -> "foo u Ivan")
NumberFormat    Generic any-number formatter, not decimal.
RuleBasedCollator   Collator that can be given strings that represent rules
                for ordering strings.
and a few others

java.io
OutputStreamWriter      Writes strings to an output stream, in an encoding
                        specified by a string (the encoding's name)
InputStreamReader       Ditto for the other direction.
ByteArray...


Anyway, I think this partitions things like so:

1) Basic char and string types
2) Locale type which exists solely for locale identity
3) Collator type which allows various sorting of strings
4) Formatter and parser types for different data values
4a) A sort of formatter which allows reordering of arguments in the output
    string is needed (not too hard).
5) Reader and writer types for encoding characters into various encodings


My current thought is that the char and string types should be defined
in terms of Unicode.  The char type should be abstract, and have a
function for converting to Unicode codepoints.  The new string type
may want to be immutable, with buffers used for changeable things
instead.

You should be able to get ahold of collators, formatters, message
catalogs, default encodings, and the like by having a locale.  You
should of course be able to ignore them, too.

I'll try to get some type signatures for possibilities up in the next
day or so.

John.