From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA03132 for caml-redistribution; Thu, 28 Oct 1999 19:09:54 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id CAA11478 for ; Wed, 27 Oct 1999 02:30:01 +0200 (MET DST) Received: from isil.maya.com (HOPKINS.PC.CS.CMU.EDU [128.2.215.35]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id CAA01733 for ; Wed, 27 Oct 1999 02:29:57 +0200 (MET DST) Received: (from prevost@localhost) by isil.maya.com (8.9.3/8.9.3) id UAA31687; Tue, 26 Oct 1999 20:37:55 -0400 Sender: weis To: skaller Cc: caml-list@inria.fr Subject: Re: Request for Ideas: i18n issues References: <3814A785.B01AFAD4@maxtal.com.au> <38160C26.25223AE1@maxtal.com.au> From: John Prevost Date: 26 Oct 1999 20:37:55 -0400 In-Reply-To: skaller's message of "Wed, 27 Oct 1999 06:16:38 +1000" Message-ID: User-Agent: Gnus/5.070096 (Pterodactyl Gnus v0.96) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii skaller writes: > Sorry: I do not understand it entirely myself. But consider: > the code point for a space, has no character. There is no such > thing as a 'space' character. There is no glyph. There _is_ such a > thing as white space in (latin) script, however. I think this is overly pedantic. Let's try not to go out this far. > > 1) Basic char and string types > > 2) Locale type which exists solely for locale identity > > 3) Collator type which allows various sorting of strings > > 4) Formatter and parser types for different data values > > 4a) A sort of formatter which allows reordering of arguments in the output > > string is needed (not too hard). > > 5) Reader and writer types for encoding characters into various encodings > > Seems reasonable. I personally would address (1) and (5) > first. Also, I think 'regexp' et al fits in somewhere, and needs to > be addressed. True. And you're right, 1 and 5 are probably the most important things. I'm just trying to clarify what sorts of things are needed for i18n stuff. > > My current thought is that the char and string types should be defined > > in terms of Unicode. > > Please NO. ISO10646. 'Unicode' is no longer supported by > Unicode Corp, which is fully behind ISO10646. "unicode" is for > implementations that will be out of date before anyone get the tools > to actually use internationalised software (like text editors, > fonts, etc). Ahh. I'm not entirely sure which distinction you are making here. I use the phrase "Unicode" as meaning "ISO 10646", because I find it easier to pronounce. I understand the differences, but not why you went off on a flame when you saw "Unicode". As my understanding goes, the goal in Unicode has been to conform to ISO 10646 since at least revision 2.0. > > The char type should be abstract, > > Why? ISO 10646 code points are just integers 0 .. 2^31-1, and > it is necessary to have some way of representing them as > _literals_. In particular, you cannot 'match' easily on an abstract > type, but matching is going to be a very common operation: > > match wchar with > | WhatDoIPutHere -> ... This is true. And having base-language support for wchar literals would be ideal. However, the type of characters and the type of integers should be disjoint, since the number 32 and the character ' ' are distinct. You do not want to apply a function on characters to an integer, because they are semantically different. This is precisely the sort of thing the type system in ML is supposed to be used for. I won't stand for leaving such an important constraint out of the abstraction. You may map an integer into a character of the equivalent codepoint and vice-versa, but a character is not an integer. In the implementation, of course, it actually is. >>From another part of your message: > But let me give another clarifying example. In Unicode, there > are some code points which are reserved, and do not represent > characters, namely the High and Low surrogates. These are used to > allow two byte 16 bit UTF-16 encoding of the first 16 planes of > ISO-10646. I'd like to see this issue not exist in O'Caml's character type. To me, a surrogate is half a character. If I have an array of characters, I should not be able to look up a cell and get half a character. Therefore, the base character type in O'Caml should be defined in terms of the codepoint itself. This still leaves the surrogate area as problemmatic: what if I want to be able to deal with input in which surrogates are mismatched? What if the library client asks me to turn an integer which is the codepoint of a surrogate into a character? But I think that in O'Caml, the character type should never contain "half a character". Your other message: > In that process, I have learned a bit; mainly that: > > (1) it is complicated > (2) it is not wise to specify interfaces based on > vague assumptions. > > That is, I'd like to suggest focussing on encodings/mappings, and > character classes; and I believe that you will find that the issues > demand different interfaces than you might think at first: I know > that what I _thought_ made sense, at first, turned out to be not > really correct. I agree. And this is where I'll focus first. At the same time, one might argue that an object that represents collation, at some level, has a specifc interface which allows an operation: comparing two pieces of script (to employ an excessively long turn of phrase) for order. Regardless of how the order is determined, that's what the goal is. I don't intend to try to come up with some things and have them set in stone. I intend this as an excercise to come up with some tools which people can use and try out and fiddle with and make suggestions about. > On things like collation, there is no way I can see to proceed > without obtaining the relevant ISO documents. I seriously doubt > that it is possible to design an 'abstract' interface, without > actually first trying to implement the algorithms: the interfaces > will be determined by concrete details and issues. In this case, I don't propose to describe in detail all of the issues of collating. But, in order to provide a non-codepoint ordered mechanism for ordering strings, and maybe provide a way to find such a mechanism, it might be worthwhile to explore a few possibilities, even if they come to nought in the end. As for myself, I do not generally have the funds or werewithal to obtain ISO documents. I do have a copy of the Unicode v2.0 book handy, but it doesn't concern itself with such issues. Anyway, you're right, the encoding and character class stuff is first. I think format strings and message catalogs is a doable second goal, which actually won't cause too much pain. Ways to format dates and the like is harder, not least because you have to have a standard way to express a date before you can have a standard interface for outputting a date in some localized format. Anyway, encodings first. Then the world. I'll take a look at your Python stuff. John.