From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id UAA07209 for caml-redistribution; Mon, 25 Oct 1999 20:55:34 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id SAA03626 for ; Mon, 25 Oct 1999 18:28:46 +0200 (MET DST) Received: from ruby (pm1-3.triode.net.au [203.63.235.19]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id SAA17443 for ; Mon, 25 Oct 1999 18:28:35 +0200 (MET DST) Received: from maxtal.com.au (IDENT:root@localhost [127.0.0.1]) by ruby (8.9.3/8.9.3) with ESMTP id EAA01900; Tue, 26 Oct 1999 04:55:01 +1000 Sender: weis Message-ID: <3814A785.B01AFAD4@maxtal.com.au> Date: Tue, 26 Oct 1999 04:55:01 +1000 From: skaller Organization: Maxtal X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.12 i586) X-Accept-Language: en MIME-Version: 1.0 To: John Prevost CC: caml-list@inria.fr Subject: Re: Request for Ideas: i18n issues References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit John Prevost wrote: > Back to the charset/encoding module type. Here's what I think might > want to be in here. I appeal to you for suggestions about things to > remove or add. > > Charsets: > > * a type for characters in the charset I think you mean 'code points'. These are logically distinct from characters (which are kind of amorphous). For example, a single character -- if there is such a thing in the language -- may consists of several code points combined, and there are code points which are not characters. > * a type for strings in the charset (maybe char array, maybe not) I think you mean 'script'. Strings of code points can be used to represent script. > * functions to determine the "class" of a character. This would > probably involve a standard "character class" type, possibly > informed by character classes in Unicode. Yes. For ISO10646 plane 1 (Unicode), the data is readily available for a few key attributes (such as character class, case mappings, corresponding decimal digit, etc). > * functions to work with strings in the character set in order to do > standard manipulations. If we said a string is always a char > array and that there are standard functions to work on strings > given the above, this might be something that can be done away > with. Probably not. There is a distinction, for example, between concatenating two arrays of code points, and producing a new array of code points corresponding to the concatenated script. This is 'most true' in Arabic, but it is also true in plain old English: the script "It is a" and "nice day" requires a space to be inserted between the code point arrays to obtain the correctly concatenated script "It is a nice day" > * functions to convert characters and strings to a reference format, > perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal > of coverage, and without some common format, converting from one > character set to another is problemmatic. I agree. I think there are two choices here, UCS-4 and UTF-8 encodings of ISO-10646. UCS-4 is better for internal operations, UTF-8 for IO and streaming operations (perhaps including regular expression matching where tables of 256 cases are more acceptable than 2^31 :-) > Encodings: > > These are tied to charsets much of the time, but not always. > > * functions to encode and decode strings in streams and buffers. Yes. > Locales: > > * functions to do case mapping, collation, etc. No. It is generally accepted that 'locale' information is limited to culturally sensitive variations like whether full stop or comma is used for a decimal point, and whether the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy. Collation, case mapping, etc are not locale data, but specific to a particular script. The tendency in i18n developments has been, I think, to divorce character sets, encodings, collation, and script issues from the locale: the locale may indicate the local language, but that is independent (more or less) of script processing. > (I think this is really why Java went to "the one true charset is > Unicode". Not just because of politics, but because interacting with > mutually incompatible character sets can be a type-safety nightmare.) Yes. I am somewhat suprised to see an attempt to create a more abstract interface to multiple character sets/encodings. This area tends, I think, to be complex, full of ad hoc rules, and so quirky as to defy genuine abstraction. Fixing a single standard (ISO10646) is a simpler alternative; even simpler if there is a single reference encoding such as UCS-4 or UTF-8. In that case, the functions that do the work can be specialised to a well researched International Standard. It is still necessary to provide functions that encode/decode the standard format to other formats (encodings/character sets), but no functions need be provided to do things like collation or case mapping for these other formats. > One or more of the above might be functors, so that you can compose a > character set, encoding, and locale to get what you want. This, of > course, gets into questions of whether certain character sets, > locales, and encodings are interoperable, and how one might cause a > type error when trying to combine an encoding and a character set that > don't work together. Dunno if this is possible. I think it is, but it isn't desriable. For example, it is possible to use UCS-4 or UTF-8 encodings of ANY character set, since all have integral code points to represent them: UCS-4 is universal for all sets of less than 2^32 points, UTF-8 for sets less than 2^31 points. > How to recover from > failure is of course a good question to try to answer as well. There are, in fact, multiple answers; this is one of the complicating factors. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller