From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id JAA16818 for caml-redistribution; Tue, 26 Oct 1999 09:03:56 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id EAA27246 for ; Tue, 26 Oct 1999 04:39:12 +0200 (MET DST) Received: from isil.maya.com (HOPKINS.PC.CS.CMU.EDU [128.2.215.35]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id EAA29038 for ; Tue, 26 Oct 1999 04:39:08 +0200 (MET DST) Received: (from prevost@localhost) by isil.maya.com (8.9.3/8.9.3) id WAA30244; Mon, 25 Oct 1999 22:46:48 -0400 Sender: weis To: skaller Cc: caml-list@inria.fr Subject: Re: Request for Ideas: i18n issues References: <3814A785.B01AFAD4@maxtal.com.au> From: John Prevost Date: 25 Oct 1999 22:46:48 -0400 In-Reply-To: skaller's message of "Tue, 26 Oct 1999 04:55:01 +1000" Message-ID: User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii skaller writes: > > * a type for characters in the charset > > I think you mean 'code points'. These are logically distinct > from characters (which are kind of amorphous). For example, a single > character -- if there is such a thing in the language -- may > consists of several code points combined, and there are code points > which are not characters. I'm not sure the distinction matters much at this level. I am predisposed to prefer the name "char" to the name "codepoint" just because more people will understandit. In the back of my head, I also generally interpret "codepoint" as meaning the number associated with the character, rather than the abstract "character" itself. I believe Unicode makes this distinction between a "glyph", a "character", and a "codepoint". > > * a type for strings in the charset (maybe char array, maybe not) > > I think you mean 'script'. Strings of code points can be used > to represent script. Uh. Okay, whatever. Again, I will tend to use the more traditional word "string" to mean a sequence of characters. (On reflection, I think you're trying to say something deeper, but your explanation is so vague I won't try to interpret what it is. I leave that to you.) > > * functions to work with strings in the character set in order to do > > standard manipulations. If we said a string is always a char > > array and that there are standard functions to work on strings > > given the above, this might be something that can be done away > > with. > > Probably not. There is a distinction, for example, between > concatenating two arrays of code points, and producing a new array > of code points corresponding to the concatenated script. This is > 'most true' in Arabic, but it is also true in plain old English: the > script {...} I don't believe that this is a job for the basic string manipulation stuff to do. There do need to be methods for manipulating strings as sequences. As such, I'm not going to worry about it at this level. > > * functions to convert characters and strings to a reference format, > > perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal > > of coverage, and without some common format, converting from one > > character set to another is problemmatic. > > I agree. I think there are two choices here, UCS-4 and UTF-8 > encodings of ISO-10646. UCS-4 is better for internal operations, > UTF-8 for IO and streaming operations (perhaps including > regular expression matching where tables of 256 cases are more > acceptable than 2^31 :-) True. Of course, there are ways and ways. Regexps based on character classes for efficiency is one (now I only need 5 bits (for Unicode, anyway) to represent that I want "all letters or numbers or connectors"). And there could be multi-level tables. > > Locales: > > > > * functions to do case mapping, collation, etc. > > No. It is generally accepted that 'locale' information is > limited to culturally sensitive variations like whether full stop or > comma is used for a decimal point, and whether the date is written > dd/mm/yy or yy/mm/dd or mm/dd/yy. > > Collation, case mapping, etc are not locale data, but specific > to a particular script. > > The tendency in i18n developments has been, I think, to > divorce character sets, encodings, collation, and script issues from > the locale: the locale may indicate the local language, but that is > independent (more or less) of script processing. Okay. I'm going to have a summary at the bottom of my message of the various kinds of things Java has for this sort of thing, just as something to think about. (i.e. are these all separate things? How does a program get them? etc.) > > (I think this is really why Java went to "the one true charset is > > Unicode". Not just because of politics, but because interacting with > > mutually incompatible character sets can be a type-safety nightmare.) > > Yes. I am somewhat suprised to see an attempt to create a more > abstract interface to multiple character sets/encodings. This area > tends, I think, to be complex, full of ad hoc rules, and so quirky > as to defy genuine abstraction. > > Fixing a single standard (ISO10646) is a simpler > alternative; even simpler if there is a single reference encoding > such as UCS-4 or UTF-8. In that case, the functions that do the > work can be specialised to a well researched International Standard. > > It is still necessary to provide functions that encode/decode > the standard format to other formats (encodings/character sets), but > no functions need be provided to do things like collation or case > mapping for these other formats. The big difficulty here is that not everybody wants to eat Unicode. I think it's appropriate, but not everyone does. And there are still characters in iso-2022, for example, which have no Unicode code point. I think that as I look at the problem more, though, I'm inclined to say "one definitive set of characters" is a better idea. Especially since that set is needed for reasonable interoperation between the others. Something I've noted looking at O'Caml these last few days: the "string" type is really more an efficient byte array type. And the char type is really a byte type. There's no real way to do "input bytes from a stream" except inputting them as characters and then interpreting those characters as bytes. Here's what Java has, related to encodings, strings, characters, collators, and so on: java.lang Character Represents individual Unicode characters. Provides methods to get character class, etc. String Immutable string of characters. StringBuffer Mutable string of characters. java.util Locale Appears to provide mainly identity, in a package based on ISO language and country codes. Methods to look up "resource bundles". Various formatting tools allow you to get a new formatter given a specific locale (perhaps by using their own resource bundles describing which subclass should be used for each locale.) java.text BreakIterator For finding word, sentence, para breaks in text. ChoiceFormat Allows a complex mapping from values to strings. (i.e. 0 -> "no files", 1 -> "1 file", n -> "n files") Collator Configurable comparison between strings. DateFormat Parses and unparses date values. DecimalFormat Configurable formatter of numbers ("####.##" -> " 123.20") Format Superclass for all these formats. They all apparently have to support parsing and unparsing values in Java. MessageFormat Formatting strings including argument number specification, for message catalogs. ("{0}'s {1}" -> "John's foo", "{1} u {0}" -> "foo u Ivan") NumberFormat Generic any-number formatter, not decimal. RuleBasedCollator Collator that can be given strings that represent rules for ordering strings. and a few others java.io OutputStreamWriter Writes strings to an output stream, in an encoding specified by a string (the encoding's name) InputStreamReader Ditto for the other direction. ByteArray... Anyway, I think this partitions things like so: 1) Basic char and string types 2) Locale type which exists solely for locale identity 3) Collator type which allows various sorting of strings 4) Formatter and parser types for different data values 4a) A sort of formatter which allows reordering of arguments in the output string is needed (not too hard). 5) Reader and writer types for encoding characters into various encodings My current thought is that the char and string types should be defined in terms of Unicode. The char type should be abstract, and have a function for converting to Unicode codepoints. The new string type may want to be immutable, with buffers used for changeable things instead. You should be able to get ahold of collators, formatters, message catalogs, default encodings, and the like by having a locale. You should of course be able to ignore them, too. I'll try to get some type signatures for possibilities up in the next day or so. John.