From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: weis
Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id RAA09669 for caml-redistribution; Mon, 25 Oct 1999 17:32:07 +0200 (MET DST)
Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id CAA04997 for <caml-list@pauillac.inria.fr>; Fri, 22 Oct 1999 02:17:50 +0200 (MET DST)
Received: from isil.maya.com (isil.maya.com [192.70.254.5])
	by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id CAA09536
	for <caml-list@inria.fr>; Fri, 22 Oct 1999 02:17:48 +0200 (MET DST)
Received: (from prevost@localhost)
	by isil.maya.com (8.9.3/8.9.3) id UAA23748;
	Thu, 21 Oct 1999 20:25:17 -0400
Sender: weis
To: caml-list@inria.fr
Subject: Request for Ideas: i18n issues
From: John Prevost <prevost@maya.com>
Date: 21 Oct 1999 20:25:17 -0400
Message-ID: <m3yacws0lu.fsf@isil.maya.com>
User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

I think there are a number of issues for adding i18n support to
O'Caml.  One of the first issues that should probably be addressed is
providing a standard signature for a "character set/encoding" module.
(This isn't necessarily part of the stock distributed O'Caml
libraries, but it might make sense if the standard library could take
advantage of it.  Tradeoffs.)  This is the heart of my message, and
continues after a short digression.


The other issue I've thought about for a while has to do with wanting
to use non-8859-1 characters in O'Caml code.  I don't know if this is
a necessity (though the language geek in me says "yes, please!), but
it would be interesting.  The difficulty I see here is mainly the
syntactic distinction between Uidents and Lidents.  If I want to
define a symbol named by the kanji "san", it's hard to say whether
that's an uppercase ident or a lowercase one.  (Why would I want to?
Like I said, I'm a language geek.)

I understand that Caml didn't have this restriction, and that it was
added to deal with understanding things like Foo.bar vs foo.bar (Foo
is a module, foo is a record value) But would it be possible to remove
the distinction again?


Back to the charset/encoding module type.  Here's what I think might
want to be in here.  I appeal to you for suggestions about things to
remove or add.

  Charsets:

  * a type for characters in the charset

  * a type for strings in the charset (maybe char array, maybe not)

  * functions to determine the "class" of a character.  This would
    probably involve a standard "character class" type, possibly
    informed by character classes in Unicode.

  * functions to work with strings in the character set in order to do
    standard manipulations.  If we said a string is always a char
    array and that there are standard functions to work on strings
    given the above, this might be something that can be done away
    with.

  * functions to convert characters and strings to a reference format,
    perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
    of coverage, and without some common format, converting from one
    character set to another is problemmatic.

  Encodings:

    These are tied to charsets much of the time, but not always.

  * functions to encode and decode strings in streams and buffers.

  Locales:

  * functions to do case mapping, collation, etc.


I'm sure I've missed useful features that should be in these
categories.  Some of these might be functors.  Some might be objects,
since modules aren't values, and you may very well want to select a
locale/encoding at runtime.  Selecting a charset is more problemmatic.
(I think this is really why Java went to "the one true charset is
Unicode".  Not just because of politics, but because interacting with
mutually incompatible character sets can be a type-safety nightmare.)

One or more of the above might be functors, so that you can compose a
character set, encoding, and locale to get what you want.  This, of
course, gets into questions of whether certain character sets,
locales, and encodings are interoperable, and how one might cause a
type error when trying to combine an encoding and a character set that
don't work together.  Dunno if this is possible.  How to recover from
failure is of course a good question to try to answer as well.


I'll see if I can carve out some initial guesses at what could be used
above, and watch the list for commentary.  I've a week off next week
and will probably be hacking around on my own stuff (probably the
UCS-4 based unicode module I've been putting off for a while.)


John Prevost.