Request for Ideas: i18n issues

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: John Prevost <prevost@maya.com>
To: caml-list@inria.fr
Subject: Request for Ideas: i18n issues
Date: 21 Oct 1999 20:25:17 -0400	[thread overview]
Message-ID: <m3yacws0lu.fsf@isil.maya.com> (raw)

I think there are a number of issues for adding i18n support to
O'Caml.  One of the first issues that should probably be addressed is
providing a standard signature for a "character set/encoding" module.
(This isn't necessarily part of the stock distributed O'Caml
libraries, but it might make sense if the standard library could take
advantage of it.  Tradeoffs.)  This is the heart of my message, and
continues after a short digression.

The other issue I've thought about for a while has to do with wanting
to use non-8859-1 characters in O'Caml code.  I don't know if this is
a necessity (though the language geek in me says "yes, please!), but
it would be interesting.  The difficulty I see here is mainly the
syntactic distinction between Uidents and Lidents.  If I want to
define a symbol named by the kanji "san", it's hard to say whether
that's an uppercase ident or a lowercase one.  (Why would I want to?
Like I said, I'm a language geek.)

I understand that Caml didn't have this restriction, and that it was
added to deal with understanding things like Foo.bar vs foo.bar (Foo
is a module, foo is a record value) But would it be possible to remove
the distinction again?

Back to the charset/encoding module type.  Here's what I think might
want to be in here.  I appeal to you for suggestions about things to
remove or add.

  Charsets:

  * a type for characters in the charset

  * a type for strings in the charset (maybe char array, maybe not)

  * functions to determine the "class" of a character.  This would
    probably involve a standard "character class" type, possibly
    informed by character classes in Unicode.

  * functions to work with strings in the character set in order to do
    standard manipulations.  If we said a string is always a char
    array and that there are standard functions to work on strings
    given the above, this might be something that can be done away
    with.

  * functions to convert characters and strings to a reference format,
    perhaps UCS-4.  UCS-4 isn't perfect, but it does have a great deal
    of coverage, and without some common format, converting from one
    character set to another is problemmatic.

  Encodings:

    These are tied to charsets much of the time, but not always.

  * functions to encode and decode strings in streams and buffers.

  Locales:

  * functions to do case mapping, collation, etc.

I'm sure I've missed useful features that should be in these
categories.  Some of these might be functors.  Some might be objects,
since modules aren't values, and you may very well want to select a
locale/encoding at runtime.  Selecting a charset is more problemmatic.
(I think this is really why Java went to "the one true charset is
Unicode".  Not just because of politics, but because interacting with
mutually incompatible character sets can be a type-safety nightmare.)

One or more of the above might be functors, so that you can compose a
character set, encoding, and locale to get what you want.  This, of
course, gets into questions of whether certain character sets,
locales, and encodings are interoperable, and how one might cause a
type error when trying to combine an encoding and a character set that
don't work together.  Dunno if this is possible.  How to recover from
failure is of course a good question to try to answer as well.

I'll see if I can carve out some initial guesses at what could be used
above, and watch the list for commentary.  I've a week off next week
and will probably be hacking around on my own stuff (probably the
UCS-4 based unicode module I've been putting off for a while.)

John Prevost.

next             reply	other threads:[~1999-10-25 15:32 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
1999-10-22  0:25 John Prevost [this message]
1999-10-25 18:55 ` skaller
1999-10-26  2:46   ` John Prevost
1999-10-26 13:36     ` Frank A. Christoph
1999-10-26 22:02       ` skaller
1999-10-26 20:16     ` skaller
1999-10-27  0:37       ` John Prevost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m3yacws0lu.fsf@isil.maya.com \
    --to=prevost@maya.com \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).