Request for Ideas: i18n issues

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Request for Ideas: i18n issues
@ 1999-10-22  0:25 John Prevost
  1999-10-25 18:55 ` skaller
  0 siblings, 1 reply; 7+ messages in thread
From: John Prevost @ 1999-10-22  0:25 UTC (permalink / raw)
  To: caml-list

I think there are a number of issues for adding i18n support to
O'Caml.  One of the first issues that should probably be addressed is
providing a standard signature for a "character set/encoding" module.
(This isn't necessarily part of the stock distributed O'Caml
libraries, but it might make sense if the standard library could take
advantage of it.  Tradeoffs.)  This is the heart of my message, and
continues after a short digression.

The other issue I've thought about for a while has to do with wanting
to use non-8859-1 characters in O'Caml code.  I don't know if this is
a necessity (though the language geek in me says "yes, please!), but
it would be interesting.  The difficulty I see here is mainly the
syntactic distinction between Uidents and Lidents.  If I want to
define a symbol named by the kanji "san", it's hard to say whether
that's an uppercase ident or a lowercase one.  (Why would I want to?
Like I said, I'm a language geek.)

I understand that Caml didn't have this restriction, and that it was
added to deal with understanding things like Foo.bar vs foo.bar (Foo
is a module, foo is a record value) But would it be possible to remove
the distinction again?

Back to the charset/encoding module type.  Here's what I think might
want to be in here.  I appeal to you for suggestions about things to
remove or add.

  Charsets:

  * a type for characters in the charset

  * a type for strings in the charset (maybe char array, maybe not)

  * functions to determine the "class" of a character.  This would
    probably involve a standard "character class" type, possibly
    informed by character classes in Unicode.

  * functions to work with strings in the character set in order to do
    standard manipulations.  If we said a string is always a char
    array and that there are standard functions to work on strings
    given the above, this might be something that can be done away
    with.

  * functions to convert characters and strings to a reference format,
    perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
    of coverage, and without some common format, converting from one
    character set to another is problemmatic.

  Encodings:

    These are tied to charsets much of the time, but not always.

  * functions to encode and decode strings in streams and buffers.

  Locales:

  * functions to do case mapping, collation, etc.

I'm sure I've missed useful features that should be in these
categories.  Some of these might be functors.  Some might be objects,
since modules aren't values, and you may very well want to select a
locale/encoding at runtime.  Selecting a charset is more problemmatic.
(I think this is really why Java went to "the one true charset is
Unicode".  Not just because of politics, but because interacting with
mutually incompatible character sets can be a type-safety nightmare.)

One or more of the above might be functors, so that you can compose a
character set, encoding, and locale to get what you want.  This, of
course, gets into questions of whether certain character sets,
locales, and encodings are interoperable, and how one might cause a
type error when trying to combine an encoding and a character set that
don't work together.  Dunno if this is possible.  How to recover from
failure is of course a good question to try to answer as well.

I'll see if I can carve out some initial guesses at what could be used
above, and watch the list for commentary.  I've a week off next week
and will probably be hacking around on my own stuff (probably the
UCS-4 based unicode module I've been putting off for a while.)

John Prevost.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Request for Ideas: i18n issues
  1999-10-22  0:25 Request for Ideas: i18n issues John Prevost
@ 1999-10-25 18:55 ` skaller
  1999-10-26  2:46   ` John Prevost
  0 siblings, 1 reply; 7+ messages in thread
From: skaller @ 1999-10-25 18:55 UTC (permalink / raw)
  To: John Prevost; +Cc: caml-list

John Prevost wrote:

> Back to the charset/encoding module type.  Here's what I think might
> want to be in here.  I appeal to you for suggestions about things to
> remove or add.
> 
>   Charsets:
> 
>   * a type for characters in the charset

	I think you mean 'code points'. These are
logically distinct from characters (which are kind of
amorphous). For example, a single character -- if there
is such a thing in the language -- may consists of several
code points combined, and there are code points which are
not characters.
 
>   * a type for strings in the charset (maybe char array, maybe not)

	I think you mean 'script'. Strings of code points
can be used to represent script.
 
>   * functions to determine the "class" of a character.  This would
>     probably involve a standard "character class" type, possibly
>     informed by character classes in Unicode.

	Yes. For ISO10646 plane 1 (Unicode), the data is readily
available for a few key attributes (such as character class,
case mappings, corresponding decimal digit, etc).
 
>   * functions to work with strings in the character set in order to do
>     standard manipulations.  If we said a string is always a char
>     array and that there are standard functions to work on strings
>     given the above, this might be something that can be done away
>     with.

	Probably not. There is a distinction, for example, between
concatenating two arrays of code points, and producing a new
array of code points corresponding to the concatenated script.
This is 'most true' in Arabic, but it is also true in
plain old English: the script

	"It is a"

and 

	"nice day"

requires a space to be inserted between the code point arrays
to obtain the correctly concatenated script

	
"It is a nice day"
 
>   * functions to convert characters and strings to a reference format,
>     perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
>     of coverage, and without some common format, converting from one
>     character set to another is problemmatic.

	I agree. I think there are two choices here, UCS-4 and UTF-8
encodings of ISO-10646. UCS-4 is better for internal operations,
UTF-8 for IO and streaming operations (perhaps including
regular expression matching where tables of 256 cases are more
acceptable than 2^31 :-)
 
>   Encodings:
> 
>     These are tied to charsets much of the time, but not always.
> 
>   * functions to encode and decode strings in streams and buffers.

	Yes.
 
>   Locales:
> 
>   * functions to do case mapping, collation, etc.

	No. It is generally accepted that 'locale' information
is limited to culturally sensitive variations like whether
full stop or comma is used for a decimal point, and whether
the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy.

	Collation, case mapping, etc are not locale
data, but specific to a particular script. 

	The tendency in i18n developments has been, I think,
to divorce character sets, encodings, collation, and script
issues from the locale: the locale may indicate the local
language, but that is independent (more or less) of
script processing.
 
> (I think this is really why Java went to "the one true charset is
> Unicode".  Not just because of politics, but because interacting with
> mutually incompatible character sets can be a type-safety nightmare.)

	Yes. I am somewhat suprised to see an attempt to create a more
abstract interface to multiple character sets/encodings. This area
tends, I think, to be complex, full of ad hoc rules, and so quirky
as to defy genuine abstraction.

	Fixing a single standard (ISO10646) is a simpler
alternative; even simpler if there is a single reference encoding
such as UCS-4 or UTF-8. In that case, the functions that do the
work can be specialised to a well researched International Standard.
 
	It is still necessary to provide functions that
encode/decode the standard format to other formats (encodings/character
sets),
but no functions need be provided to do things like collation or
case mapping for these other formats.

> One or more of the above might be functors, so that you can compose a
> character set, encoding, and locale to get what you want.  This, of
> course, gets into questions of whether certain character sets,
> locales, and encodings are interoperable, and how one might cause a
> type error when trying to combine an encoding and a character set that
> don't work together.  Dunno if this is possible. 

	I think it is, but it isn't desriable. For example,
it is possible to use UCS-4 or UTF-8 encodings of ANY 
character set, since all have integral code points to represent
them: UCS-4 is universal for all sets of less than 2^32 points,
UTF-8 for sets less than 2^31 points.

> How to recover from
> failure is of course a good question to try to answer as well.

	There are, in fact, multiple answers; this is one
of the complicating factors.
 
-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Request for Ideas: i18n issues
  1999-10-25 18:55 ` skaller
@ 1999-10-26  2:46   ` John Prevost
  1999-10-26 13:36     ` Frank A. Christoph
  1999-10-26 20:16     ` skaller
  0 siblings, 2 replies; 7+ messages in thread
From: John Prevost @ 1999-10-26  2:46 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

skaller <skaller@maxtal.com.au> writes:

> >   * a type for characters in the charset
> 
> 	I think you mean 'code points'. These are logically distinct
> from characters (which are kind of amorphous). For example, a single
> character -- if there is such a thing in the language -- may
> consists of several code points combined, and there are code points
> which are not characters.

I'm not sure the distinction matters much at this level.  I am
predisposed to prefer the name "char" to the name "codepoint" just
because more people will understandit.  In the back of my head, I also
generally interpret "codepoint" as meaning the number associated with
the character, rather than the abstract "character" itself.  I believe
Unicode makes this distinction between a "glyph", a "character", and a
"codepoint".

> >   * a type for strings in the charset (maybe char array, maybe not)
> 
> 	I think you mean 'script'. Strings of code points can be used
> to represent script.

Uh.  Okay, whatever.  Again, I will tend to use the more traditional
word "string" to mean a sequence of characters.  (On reflection, I
think you're trying to say something deeper, but your explanation is
so vague I won't try to interpret what it is.  I leave that to you.)

> >   * functions to work with strings in the character set in order to do
> >     standard manipulations.  If we said a string is always a char
> >     array and that there are standard functions to work on strings
> >     given the above, this might be something that can be done away
> >     with.
> 
> 	Probably not. There is a distinction, for example, between
> concatenating two arrays of code points, and producing a new array
> of code points corresponding to the concatenated script.  This is
> 'most true' in Arabic, but it is also true in plain old English: the
> script
  {...}

I don't believe that this is a job for the basic string manipulation
stuff to do.  There do need to be methods for manipulating strings as
sequences.  As such, I'm not going to worry about it at this level.

> >   * functions to convert characters and strings to a reference format,
> >     perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
> >     of coverage, and without some common format, converting from one
> >     character set to another is problemmatic.
> 
> 	I agree. I think there are two choices here, UCS-4 and UTF-8
> encodings of ISO-10646. UCS-4 is better for internal operations,
> UTF-8 for IO and streaming operations (perhaps including
> regular expression matching where tables of 256 cases are more
> acceptable than 2^31 :-)

True.  Of course, there are ways and ways.  Regexps based on character
classes for efficiency is one (now I only need 5 bits (for Unicode,
anyway) to represent that I want "all letters or numbers or
connectors").  And there could be multi-level tables.

> >   Locales:
> > 
> >   * functions to do case mapping, collation, etc.
> 
> 	No. It is generally accepted that 'locale' information is
> limited to culturally sensitive variations like whether full stop or
> comma is used for a decimal point, and whether the date is written
> dd/mm/yy or yy/mm/dd or mm/dd/yy.
> 
> 	Collation, case mapping, etc are not locale data, but specific
> to a particular script.
> 
> 	The tendency in i18n developments has been, I think, to
> divorce character sets, encodings, collation, and script issues from
> the locale: the locale may indicate the local language, but that is
> independent (more or less) of script processing.

Okay.  I'm going to have a summary at the bottom of my message of the
various kinds of things Java has for this sort of thing, just as
something to think about.  (i.e. are these all separate things?  How
does a program get them?  etc.)

> > (I think this is really why Java went to "the one true charset is
> > Unicode".  Not just because of politics, but because interacting with
> > mutually incompatible character sets can be a type-safety nightmare.)
> 
> 	Yes. I am somewhat suprised to see an attempt to create a more
> abstract interface to multiple character sets/encodings. This area
> tends, I think, to be complex, full of ad hoc rules, and so quirky
> as to defy genuine abstraction.
> 
> 	Fixing a single standard (ISO10646) is a simpler
> alternative; even simpler if there is a single reference encoding
> such as UCS-4 or UTF-8. In that case, the functions that do the
> work can be specialised to a well researched International Standard.
>  
> 	It is still necessary to provide functions that encode/decode
> the standard format to other formats (encodings/character sets), but
> no functions need be provided to do things like collation or case
> mapping for these other formats.

The big difficulty here is that not everybody wants to eat Unicode.  I
think it's appropriate, but not everyone does.  And there are still
characters in iso-2022, for example, which have no Unicode code point.

I think that as I look at the problem more, though, I'm inclined to
say "one definitive set of characters" is a better idea.  Especially
since that set is needed for reasonable interoperation between the
others.

Something I've noted looking at O'Caml these last few days: the
"string" type is really more an efficient byte array type.  And the
char type is really a byte type.  There's no real way to do "input
bytes from a stream" except inputting them as characters and then
interpreting those characters as bytes.

Here's what Java has, related to encodings, strings, characters,
collators, and so on:

java.lang
Character       Represents individual Unicode characters.
                Provides methods to get character class, etc.
String          Immutable string of characters.
StringBuffer    Mutable string of characters.

java.util
Locale          Appears to provide mainly identity, in a package based on
                ISO language and country codes.  Methods to look up
                "resource bundles".  Various formatting tools allow you to
                get a new formatter given a specific locale (perhaps by
                using their own resource bundles describing which subclass
                should be used for each locale.)

java.text
BreakIterator   For finding word, sentence, para breaks in text.
ChoiceFormat    Allows a complex mapping from values to strings.
                (i.e. 0 -> "no files", 1 -> "1 file", n -> "n files")
Collator        Configurable comparison between strings.
DateFormat      Parses and unparses date values.
DecimalFormat   Configurable formatter of numbers ("####.##" -> " 123.20")
Format          Superclass for all these formats.  They all apparently have
                to support parsing and unparsing values in Java.
MessageFormat   Formatting strings including argument number specification,
                for message catalogs.  ("{0}'s {1}" -> "John's foo",
                "{1} u {0}" -> "foo u Ivan")
NumberFormat    Generic any-number formatter, not decimal.
RuleBasedCollator   Collator that can be given strings that represent rules
                for ordering strings.
and a few others

java.io
OutputStreamWriter      Writes strings to an output stream, in an encoding
                        specified by a string (the encoding's name)
InputStreamReader       Ditto for the other direction.
ByteArray...

Anyway, I think this partitions things like so:

1) Basic char and string types
2) Locale type which exists solely for locale identity
3) Collator type which allows various sorting of strings
4) Formatter and parser types for different data values
4a) A sort of formatter which allows reordering of arguments in the output
    string is needed (not too hard).
5) Reader and writer types for encoding characters into various encodings

My current thought is that the char and string types should be defined
in terms of Unicode.  The char type should be abstract, and have a
function for converting to Unicode codepoints.  The new string type
may want to be immutable, with buffers used for changeable things
instead.

You should be able to get ahold of collators, formatters, message
catalogs, default encodings, and the like by having a locale.  You
should of course be able to ignore them, too.

I'll try to get some type signatures for possibilities up in the next
day or so.

John.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Request for Ideas: i18n issues
  1999-10-26  2:46   ` John Prevost
@ 1999-10-26 13:36     ` Frank A. Christoph
  1999-10-26 22:02       ` skaller
  1999-10-26 20:16     ` skaller
  1 sibling, 1 reply; 7+ messages in thread
From: Frank A. Christoph @ 1999-10-26 13:36 UTC (permalink / raw)
  To: John Prevost, skaller; +Cc: caml-list

John Prevost wrote in response to John Skaller:
> > 	Probably not. There is a distinction, for example, between
> > concatenating two arrays of code points, and producing a new array
> > of code points corresponding to the concatenated script.  This is
> > 'most true' in Arabic, but it is also true in plain old English: the
> > script
>   {...}
>
> I don't believe that this is a job for the basic string manipulation
> stuff to do.  There do need to be methods for manipulating strings as
> sequences.  As such, I'm not going to worry about it at this level.

Maybe he is arguing for a higher-level approach to text representation. Text
is, after all, an important enough application field that it deserves some
treatment in a standard library. Unfortunately (as we are all bearing
witness to in this discussion) there is no standard way to do it, even for
one (natural) language, much less all of them.

> The big difficulty here is that not everybody wants to eat Unicode.  I
> think it's appropriate, but not everyone does.  And there are still
> characters in iso-2022, for example, which have no Unicode code point.
>
> I think that as I look at the problem more, though, I'm inclined to
> say "one definitive set of characters" is a better idea.  Especially
> since that set is needed for reasonable interoperation between the
> others.

This is getting a little afield from the topic of the list, but I have been
wondering if maybe Unicode (and other related standards) might not end up
being most valuable in the long run, not for being able to represent text
from existing languages but, for shaping the form of languages to come. I
don't mean completely new languages of course (I think I heard Klingon and
Tengwar were actually submitted for inclusion!), but rather that as
electronic information becomes more and more central to communication, the
need to encode it will change the way people speak and write in other media
as well. I think this is happening already, sort of: for example, I've seen
email emoticons used in published materials, and on billboards here in
Japan; and we all know a hundred words from computer jargon that have made
it into mainstream languages.

> Something I've noted looking at O'Caml these last few days: the
> "string" type is really more an efficient byte array type.  And the
> char type is really a byte type.  There's no real way to do "input
> bytes from a stream" except inputting them as characters and then
> interpreting those characters as bytes.

That's exactly the way I think of and use O'Caml characters and strings. It
is sort of unfortunate that O'Caml inherited this terminology from C... I
would have preferred "byte" or "octet" for characters, at least.

--FAC

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Request for Ideas: i18n issues
  1999-10-26 13:36     ` Frank A. Christoph
@ 1999-10-26 22:02       ` skaller
  0 siblings, 0 replies; 7+ messages in thread
From: skaller @ 1999-10-26 22:02 UTC (permalink / raw)
  To: Frank A. Christoph; +Cc: John Prevost, caml-list

"Frank A. Christoph" wrote:

> This is getting a little afield from the topic of the list, 
[]
(I think I heard Klingon and Tengwar were actually submitted for
inclusion!),

	Klingon characters are in the ISO10646 International Standard.
It appears it should have been called 'Intergalactic Standard' though
;-)
[Couldn't resist]

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Request for Ideas: i18n issues
  1999-10-26  2:46   ` John Prevost
  1999-10-26 13:36     ` Frank A. Christoph
@ 1999-10-26 20:16     ` skaller
  1999-10-27  0:37       ` John Prevost
  1 sibling, 1 reply; 7+ messages in thread
From: skaller @ 1999-10-26 20:16 UTC (permalink / raw)
  To: John Prevost; +Cc: caml-list

John Prevost wrote:
> 
> skaller <skaller@maxtal.com.au> writes:
> 
> > >   * a type for characters in the charset
> >
> >       I think you mean 'code points'. These are logically distinct
> > from characters (which are kind of amorphous). For example, a single
> > character -- if there is such a thing in the language -- may
> > consists of several code points combined, and there are code points
> > which are not characters.
> 
> I'm not sure the distinction matters much at this level.  I am
> predisposed to prefer the name "char" to the name "codepoint" just
> because more people will understandit. 

	I myself tend to use these (incorrect) words as well,
through familiarity. However, in the longer run, vague,
and even incorrect, terminology will make it even harder
to communicate.

	My experience with this is from the C++ Standardisation
committee (rather than I18n forums). In C++, there are a number
of important and distinct concepts without proper terminology,
and it makes it almost impossible to have a technical discussion
about the C++ language. Over 90% of all issues and debates
centred around terminology and interpretation, rather than
functionality.

	But let me give another clarifying example.
In Unicode, there are some code points which are
reserved, and do not represent characters, namely
the High and Low surrogates. These are used 
to allow two byte 16 bit UTF-16 encoding of the first 16
planes of ISO-10646.

	So when you see two of these together,
they are not two 'characters', but in fact an
encoding of a SINGLE code point, which itself
may or may not be a character.
	
> In the back of my head, I also
> generally interpret "codepoint" as meaning the number associated with
> the character, 

	Yes, this is more or less correct.
And, basic array /string manipulations must first work with
code points, rather than the abstract 'characters'.

	In fact, the 'meaning' of the code points
as characters is to be found in the functions which
manipulate the code points (sort of like an ocaml module).
More correctly, sequences of code points can be 
understood as 'script', and manipulated to represent
script manipulations.

	Thus: it is possible to do a lexicographical
string sort by code points, which is not the same
as a native language sort on script. The former
is a convenience for representing strings in
_some_ total order, for example, for use in an 
ocaml 'Set'.

	The latter may not even be well defined,
or have multiple definitions, depending on the
script or use -- for example, the usual code point
order sort of ASCII character is utterly wrong
for a book index.

>rather than the abstract "character" itself.  I believe
> Unicode makes this distinction between a "glyph", a "character", and a
> "codepoint".

	Yes: a glyph is the shape, a font's character is particular
rendering of it, a character is an abstraction (is 'A' the same
character as 'a'?? The shape is different, the concept is the same).
A code point is just a number in a set of numbers used to encode
script.

	The difference is subtle, and I do not claim to fully
understand it, but the ISO and Unicode committees needing to
create standards emphasise the distinctions (to the point
of the liason on the C++ committee complaining about the
incorrect use of the word 'character' in the C++ language draft).
 
> > >   * a type for strings in the charset (maybe char array, maybe not)
> >
> >       I think you mean 'script'. Strings of code points can be used
> > to represent script.
> 
> Uh.  Okay, whatever.  Again, I will tend to use the more traditional
> word "string" to mean a sequence of characters.  (On reflection, I
> think you're trying to say something deeper, but your explanation is
> so vague I won't try to interpret what it is.  I leave that to you.)

	Sorry: I do not understand it entirely myself.
But consider: the code point for a space, has no character.
There is no such thing as a 'space' character. There is no glyph.
There _is_ such a thing as white space in (latin) script, however.
 
> I don't believe that this is a job for the basic string manipulation
> stuff to do.  There do need to be methods for manipulating strings as
> sequences.  As such, I'm not going to worry about it at this level.

	I agree. But, then, you are only dealing with some kind
of array of code points. Ocaml already has an 'array' type
which can be used for this purpose, and if there were a 
string type that were variable length, we would be somewhat
better off.

	But manipulating these data structures is entirely
structural, and only relates to 'script' in the sense
that the data structures are convenient for working with
script representations. That is, actual script related functions
such as

	(1) trimming whitespace
	(2) capitalisation
	(3) collation
	(4) charset mapping and encoding/decoding

are, IMHO, quite separate.

 
> The big difficulty here is that not everybody wants to eat Unicode.  

	I know this is true. However, the many issues have been
addressed by representatives of National Governments and industry,
and a consensus has been reached, and is embodied in an 
International Standard.

>I think it's appropriate, but not everyone does.  And there are still
> characters in iso-2022, for example, which have no Unicode code point.

	These issues are being addressed by members of the
ISO technical committee. Any complaints should be addressed to them.
Programmers should probably implement what is standardised,
and perhaps complain, or provide input to the process, rather than
going off on their own private, probably doomed track.

	[This is not the same issue, IMHO, as programming languages:
there is just ONE commonly accepted universal standard for script,
namely ISO-10646]

> I think that as I look at the problem more, though, I'm inclined to
> say "one definitive set of characters" is a better idea.  Especially
> since that set is needed for reasonable interoperation between the
> others.

	This is my feeling, partly for another reason: anything
else is just too complicated. ISO 10646 is hard enough as it is.
 
> Something I've noted looking at O'Caml these last few days: the
> "string" type is really more an efficient byte array type.  And the
> char type is really a byte type.  

	Yes. And it is needed as such, even though perhaps
poorly named. Something which represents byte strings
is essentially distinct from something representing
'script'/'text'. IMHO.

> Anyway, I think this partitions things like so:
> 
> 1) Basic char and string types
> 2) Locale type which exists solely for locale identity
> 3) Collator type which allows various sorting of strings
> 4) Formatter and parser types for different data values
> 4a) A sort of formatter which allows reordering of arguments in the output
>     string is needed (not too hard).
> 5) Reader and writer types for encoding characters into various encodings

	Seems reasonable. I personally would address (1) and (5)
first. Also, I think 'regexp' et al fits in somewhere, and needs
to be addressed.
 
> My current thought is that the char and string types should be defined
> in terms of Unicode. 

	Please NO. ISO10646. 'Unicode' is no longer supported by Unicode Corp,
which is fully behind ISO10646. "unicode" is for implementations
that will be out of date before anyone get the tools to actually
use internationalised software (like text editors, fonts, etc).

	Java has gone this way, and will pay the price in the long
run. Python is going that way too. But Unicode is ALREADY out of date.
If work is to be done -- and there is a LOT of work in this area --
then it should conform to the long sighted International Standard,
which has plenty of space for extension, and is supported by
an international consensus of National bodies and technical
experts.

> The char type should be abstract, 

	Why? ISO 10646 code points are just integers 0 .. 2^31-1,
and it is necessary to have some way of representing them
as _literals_. In particular, you cannot 'match' easily
on an abstract type, but matching is going to be a very
common operation:

	match wchar with
	| WhatDoIPutHere -> ...

One solution is to have a native wchar type in ocaml,
with literals recognized by the lexer .. but it seems to
me that integers will do the job, even if they
can be abused, by, for example, multiplication
(which, off hand, doesn't seem appropriate).

[That is, code points are a Z module: they're offsets]

 
> You should be able to get ahold of collators, formatters, message
> catalogs, default encodings, and the like by having a locale.  You
> should of course be able to ignore them, too.

	I would leave most of this out, at least for the moment:
it is enough work to just handle encodings and mappings.
These things are important in the transition from other
character sets including Latin-1, ShiftJis -- etc.

	Without these mapping/encoding tools, it isn't
really possible to actually work with ISO10646, because
there are not many tools that can do so: most people 
use either 8 bit editors, or specialised editors for
their DBCS encoding (like ShiftJis).
 
> I'll try to get some type signatures for possibilities up in the next
> day or so.

	Great!

	BTW: I have some Python code for doing a lot
of mappings/encodings (all the ones available on
the Unicode.org website). This may be useful,
I can generate tables of any kind as required.
(Please send me private email).

	Anyone interested can examine my
web page:

	http://www.triode.net.au/~skaller/unicode/index.html

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Request for Ideas: i18n issues
  1999-10-26 20:16     ` skaller
@ 1999-10-27  0:37       ` John Prevost
  0 siblings, 0 replies; 7+ messages in thread
From: John Prevost @ 1999-10-27  0:37 UTC (permalink / raw)
  To: skaller; +Cc: caml-list

skaller <skaller@maxtal.com.au> writes:

> 	Sorry: I do not understand it entirely myself.  But consider:
> the code point for a space, has no character.  There is no such
> thing as a 'space' character. There is no glyph.  There _is_ such a
> thing as white space in (latin) script, however.

I think this is overly pedantic.  Let's try not to go out this far.

> > 1) Basic char and string types
> > 2) Locale type which exists solely for locale identity
> > 3) Collator type which allows various sorting of strings
> > 4) Formatter and parser types for different data values
> > 4a) A sort of formatter which allows reordering of arguments in the output
> >     string is needed (not too hard).
> > 5) Reader and writer types for encoding characters into various encodings
> 
> 	Seems reasonable. I personally would address (1) and (5)
> first. Also, I think 'regexp' et al fits in somewhere, and needs to
> be addressed.

True.  And you're right, 1 and 5 are probably the most important
things.  I'm just trying to clarify what sorts of things are needed
for i18n stuff.

> > My current thought is that the char and string types should be defined
> > in terms of Unicode. 
> 
> 	Please NO. ISO10646. 'Unicode' is no longer supported by
> Unicode Corp, which is fully behind ISO10646. "unicode" is for
> implementations that will be out of date before anyone get the tools
> to actually use internationalised software (like text editors,
> fonts, etc).

Ahh.  I'm not entirely sure which distinction you are making here.  I
use the phrase "Unicode" as meaning "ISO 10646", because I find it
easier to pronounce.  I understand the differences, but not why you
went off on a flame when you saw "Unicode".  As my understanding goes,
the goal in Unicode has been to conform to ISO 10646 since at least
revision 2.0.

> > The char type should be abstract, 
> 
> 	Why? ISO 10646 code points are just integers 0 .. 2^31-1, and
> it is necessary to have some way of representing them as
> _literals_. In particular, you cannot 'match' easily on an abstract
> type, but matching is going to be a very common operation:
> 
> 	match wchar with
> 	| WhatDoIPutHere -> ...

This is true.  And having base-language support for wchar literals
would be ideal.  However, the type of characters and the type of
integers should be disjoint, since the number 32 and the character ' '
are distinct.  You do not want to apply a function on characters to an
integer, because they are semantically different.  This is precisely
the sort of thing the type system in ML is supposed to be used for.

I won't stand for leaving such an important constraint out of the
abstraction.  You may map an integer into a character of the
equivalent codepoint and vice-versa, but a character is not an
integer.

In the implementation, of course, it actually is.

>From another part of your message:

> 	But let me give another clarifying example.  In Unicode, there
> are some code points which are reserved, and do not represent
> characters, namely the High and Low surrogates. These are used to
> allow two byte 16 bit UTF-16 encoding of the first 16 planes of
> ISO-10646.

I'd like to see this issue not exist in O'Caml's character type.  To
me, a surrogate is half a character.  If I have an array of
characters, I should not be able to look up a cell and get half a
character.  Therefore, the base character type in O'Caml should be
defined in terms of the codepoint itself.  This still leaves the
surrogate area as problemmatic: what if I want to be able to deal with
input in which surrogates are mismatched?  What if the library client
asks me to turn an integer which is the codepoint of a surrogate into
a character?

But I think that in O'Caml, the character type should never contain
"half a character".

Your other message:

> In that process, I have learned a bit; mainly that:
>
> 	(1) it is complicated
> 	(2) it is not wise to specify interfaces based on 
> 		vague assumptions.
>
> That is, I'd like to suggest focussing on encodings/mappings, and
> character classes; and I believe that you will find that the issues
> demand different interfaces than you might think at first: I know
> that what I _thought_ made sense, at first, turned out to be not
> really correct.

I agree.  And this is where I'll focus first.  At the same time, one
might argue that an object that represents collation, at some level,
has a specifc interface which allows an operation: comparing two
pieces of script (to employ an excessively long turn of phrase) for
order.  Regardless of how the order is determined, that's what the
goal is.

I don't intend to try to come up with some things and have them set in
stone.  I intend this as an excercise to come up with some tools which
people can use and try out and fiddle with and make suggestions about.

> On things like collation, there is no way I can see to proceed
> without obtaining the relevant ISO documents.  I seriously doubt
> that it is possible to design an 'abstract' interface, without
> actually first trying to implement the algorithms: the interfaces
> will be determined by concrete details and issues.

In this case, I don't propose to describe in detail all of the issues
of collating.  But, in order to provide a non-codepoint ordered
mechanism for ordering strings, and maybe provide a way to find such a
mechanism, it might be worthwhile to explore a few possibilities, even
if they come to nought in the end.

As for myself, I do not generally have the funds or werewithal to
obtain ISO documents.  I do have a copy of the Unicode v2.0 book
handy, but it doesn't concern itself with such issues.

Anyway, you're right, the encoding and character class stuff is first.
I think format strings and message catalogs is a doable second goal,
which actually won't cause too much pain.  Ways to format dates and
the like is harder, not least because you have to have a standard way
to express a date before you can have a standard interface for
outputting a date in some localized format.

Anyway, encodings first.  Then the world.  I'll take a look at your
Python stuff.

John.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~1999-10-28 17:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-10-22  0:25 Request for Ideas: i18n issues John Prevost
1999-10-25 18:55 ` skaller
1999-10-26  2:46   ` John Prevost
1999-10-26 13:36     ` Frank A. Christoph
1999-10-26 22:02       ` skaller
1999-10-26 20:16     ` skaller
1999-10-27  0:37       ` John Prevost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).