Re: Request for Ideas: i18n issues

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: skaller <skaller@maxtal.com.au>
To: John Prevost <prevost@maya.com>
Cc: caml-list@inria.fr
Subject: Re: Request for Ideas: i18n issues
Date: Tue, 26 Oct 1999 04:55:01 +1000	[thread overview]
Message-ID: <3814A785.B01AFAD4@maxtal.com.au> (raw)
In-Reply-To: <m3yacws0lu.fsf@isil.maya.com>

John Prevost wrote:

> Back to the charset/encoding module type.  Here's what I think might
> want to be in here.  I appeal to you for suggestions about things to
> remove or add.
> 
>   Charsets:
> 
>   * a type for characters in the charset

	I think you mean 'code points'. These are
logically distinct from characters (which are kind of
amorphous). For example, a single character -- if there
is such a thing in the language -- may consists of several
code points combined, and there are code points which are
not characters.
 
>   * a type for strings in the charset (maybe char array, maybe not)

	I think you mean 'script'. Strings of code points
can be used to represent script.
 
>   * functions to determine the "class" of a character.  This would
>     probably involve a standard "character class" type, possibly
>     informed by character classes in Unicode.

	Yes. For ISO10646 plane 1 (Unicode), the data is readily
available for a few key attributes (such as character class,
case mappings, corresponding decimal digit, etc).
 
>   * functions to work with strings in the character set in order to do
>     standard manipulations.  If we said a string is always a char
>     array and that there are standard functions to work on strings
>     given the above, this might be something that can be done away
>     with.

	Probably not. There is a distinction, for example, between
concatenating two arrays of code points, and producing a new
array of code points corresponding to the concatenated script.
This is 'most true' in Arabic, but it is also true in
plain old English: the script

	"It is a"

and 

	"nice day"

requires a space to be inserted between the code point arrays
to obtain the correctly concatenated script

	
"It is a nice day"
 
>   * functions to convert characters and strings to a reference format,
>     perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
>     of coverage, and without some common format, converting from one
>     character set to another is problemmatic.

	I agree. I think there are two choices here, UCS-4 and UTF-8
encodings of ISO-10646. UCS-4 is better for internal operations,
UTF-8 for IO and streaming operations (perhaps including
regular expression matching where tables of 256 cases are more
acceptable than 2^31 :-)
 
>   Encodings:
> 
>     These are tied to charsets much of the time, but not always.
> 
>   * functions to encode and decode strings in streams and buffers.

	Yes.
 
>   Locales:
> 
>   * functions to do case mapping, collation, etc.

	No. It is generally accepted that 'locale' information
is limited to culturally sensitive variations like whether
full stop or comma is used for a decimal point, and whether
the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy.

	Collation, case mapping, etc are not locale
data, but specific to a particular script. 

	The tendency in i18n developments has been, I think,
to divorce character sets, encodings, collation, and script
issues from the locale: the locale may indicate the local
language, but that is independent (more or less) of
script processing.
 
> (I think this is really why Java went to "the one true charset is
> Unicode".  Not just because of politics, but because interacting with
> mutually incompatible character sets can be a type-safety nightmare.)

	Yes. I am somewhat suprised to see an attempt to create a more
abstract interface to multiple character sets/encodings. This area
tends, I think, to be complex, full of ad hoc rules, and so quirky
as to defy genuine abstraction.

	Fixing a single standard (ISO10646) is a simpler
alternative; even simpler if there is a single reference encoding
such as UCS-4 or UTF-8. In that case, the functions that do the
work can be specialised to a well researched International Standard.
 
	It is still necessary to provide functions that
encode/decode the standard format to other formats (encodings/character
sets),
but no functions need be provided to do things like collation or
case mapping for these other formats.

> One or more of the above might be functors, so that you can compose a
> character set, encoding, and locale to get what you want.  This, of
> course, gets into questions of whether certain character sets,
> locales, and encodings are interoperable, and how one might cause a
> type error when trying to combine an encoding and a character set that
> don't work together.  Dunno if this is possible. 

	I think it is, but it isn't desriable. For example,
it is possible to use UCS-4 or UTF-8 encodings of ANY 
character set, since all have integral code points to represent
them: UCS-4 is universal for all sets of less than 2^32 points,
UTF-8 for sets less than 2^31 points.

> How to recover from
> failure is of course a good question to try to answer as well.

	There are, in fact, multiple answers; this is one
of the complicating factors.
 
-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller

next prev parent reply	other threads:[~1999-10-25 18:55 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1999-10-22  0:25 John Prevost
1999-10-25 18:55 ` skaller [this message]
1999-10-26  2:46   ` John Prevost
1999-10-26 13:36     ` Frank A. Christoph
1999-10-26 22:02       ` skaller
1999-10-26 20:16     ` skaller
1999-10-27  0:37       ` John Prevost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3814A785.B01AFAD4@maxtal.com.au \
    --to=skaller@maxtal.com.au \
    --cc=caml-list@inria.fr \
    --cc=prevost@maya.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).