From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id RAA09669 for caml-redistribution; Mon, 25 Oct 1999 17:32:07 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id CAA04997 for ; Fri, 22 Oct 1999 02:17:50 +0200 (MET DST) Received: from isil.maya.com (isil.maya.com [192.70.254.5]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id CAA09536 for ; Fri, 22 Oct 1999 02:17:48 +0200 (MET DST) Received: (from prevost@localhost) by isil.maya.com (8.9.3/8.9.3) id UAA23748; Thu, 21 Oct 1999 20:25:17 -0400 Sender: weis To: caml-list@inria.fr Subject: Request for Ideas: i18n issues From: John Prevost Date: 21 Oct 1999 20:25:17 -0400 Message-ID: User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii I think there are a number of issues for adding i18n support to O'Caml. One of the first issues that should probably be addressed is providing a standard signature for a "character set/encoding" module. (This isn't necessarily part of the stock distributed O'Caml libraries, but it might make sense if the standard library could take advantage of it. Tradeoffs.) This is the heart of my message, and continues after a short digression. The other issue I've thought about for a while has to do with wanting to use non-8859-1 characters in O'Caml code. I don't know if this is a necessity (though the language geek in me says "yes, please!), but it would be interesting. The difficulty I see here is mainly the syntactic distinction between Uidents and Lidents. If I want to define a symbol named by the kanji "san", it's hard to say whether that's an uppercase ident or a lowercase one. (Why would I want to? Like I said, I'm a language geek.) I understand that Caml didn't have this restriction, and that it was added to deal with understanding things like Foo.bar vs foo.bar (Foo is a module, foo is a record value) But would it be possible to remove the distinction again? Back to the charset/encoding module type. Here's what I think might want to be in here. I appeal to you for suggestions about things to remove or add. Charsets: * a type for characters in the charset * a type for strings in the charset (maybe char array, maybe not) * functions to determine the "class" of a character. This would probably involve a standard "character class" type, possibly informed by character classes in Unicode. * functions to work with strings in the character set in order to do standard manipulations. If we said a string is always a char array and that there are standard functions to work on strings given the above, this might be something that can be done away with. * functions to convert characters and strings to a reference format, perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal of coverage, and without some common format, converting from one character set to another is problemmatic. Encodings: These are tied to charsets much of the time, but not always. * functions to encode and decode strings in streams and buffers. Locales: * functions to do case mapping, collation, etc. I'm sure I've missed useful features that should be in these categories. Some of these might be functors. Some might be objects, since modules aren't values, and you may very well want to select a locale/encoding at runtime. Selecting a charset is more problemmatic. (I think this is really why Java went to "the one true charset is Unicode". Not just because of politics, but because interacting with mutually incompatible character sets can be a type-safety nightmare.) One or more of the above might be functors, so that you can compose a character set, encoding, and locale to get what you want. This, of course, gets into questions of whether certain character sets, locales, and encodings are interoperable, and how one might cause a type error when trying to combine an encoding and a character set that don't work together. Dunno if this is possible. How to recover from failure is of course a good question to try to answer as well. I'll see if I can carve out some initial guesses at what could be used above, and watch the list for commentary. I've a week off next week and will probably be hacking around on my own stuff (probably the UCS-4 based unicode module I've been putting off for a while.) John Prevost.