From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id TAA16622 for caml-redistribution; Thu, 28 Oct 1999 19:08:17 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id TAA01845 for ; Tue, 26 Oct 1999 19:50:12 +0200 (MET DST) Received: from ruby (pm1-15.triode.net.au [203.63.235.31]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id TAA23176 for ; Tue, 26 Oct 1999 19:50:07 +0200 (MET DST) Received: from maxtal.com.au (IDENT:root@localhost [127.0.0.1]) by ruby (8.9.3/8.9.3) with ESMTP id GAA03136; Wed, 27 Oct 1999 06:16:38 +1000 Sender: weis Message-ID: <38160C26.25223AE1@maxtal.com.au> Date: Wed, 27 Oct 1999 06:16:38 +1000 From: skaller Organization: Maxtal X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.12 i586) X-Accept-Language: en MIME-Version: 1.0 To: John Prevost CC: caml-list@inria.fr Subject: Re: Request for Ideas: i18n issues References: <3814A785.B01AFAD4@maxtal.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit John Prevost wrote: > > skaller writes: > > > > * a type for characters in the charset > > > > I think you mean 'code points'. These are logically distinct > > from characters (which are kind of amorphous). For example, a single > > character -- if there is such a thing in the language -- may > > consists of several code points combined, and there are code points > > which are not characters. > > I'm not sure the distinction matters much at this level. I am > predisposed to prefer the name "char" to the name "codepoint" just > because more people will understandit. I myself tend to use these (incorrect) words as well, through familiarity. However, in the longer run, vague, and even incorrect, terminology will make it even harder to communicate. My experience with this is from the C++ Standardisation committee (rather than I18n forums). In C++, there are a number of important and distinct concepts without proper terminology, and it makes it almost impossible to have a technical discussion about the C++ language. Over 90% of all issues and debates centred around terminology and interpretation, rather than functionality. But let me give another clarifying example. In Unicode, there are some code points which are reserved, and do not represent characters, namely the High and Low surrogates. These are used to allow two byte 16 bit UTF-16 encoding of the first 16 planes of ISO-10646. So when you see two of these together, they are not two 'characters', but in fact an encoding of a SINGLE code point, which itself may or may not be a character. > In the back of my head, I also > generally interpret "codepoint" as meaning the number associated with > the character, Yes, this is more or less correct. And, basic array /string manipulations must first work with code points, rather than the abstract 'characters'. In fact, the 'meaning' of the code points as characters is to be found in the functions which manipulate the code points (sort of like an ocaml module). More correctly, sequences of code points can be understood as 'script', and manipulated to represent script manipulations. Thus: it is possible to do a lexicographical string sort by code points, which is not the same as a native language sort on script. The former is a convenience for representing strings in _some_ total order, for example, for use in an ocaml 'Set'. The latter may not even be well defined, or have multiple definitions, depending on the script or use -- for example, the usual code point order sort of ASCII character is utterly wrong for a book index. >rather than the abstract "character" itself. I believe > Unicode makes this distinction between a "glyph", a "character", and a > "codepoint". Yes: a glyph is the shape, a font's character is particular rendering of it, a character is an abstraction (is 'A' the same character as 'a'?? The shape is different, the concept is the same). A code point is just a number in a set of numbers used to encode script. The difference is subtle, and I do not claim to fully understand it, but the ISO and Unicode committees needing to create standards emphasise the distinctions (to the point of the liason on the C++ committee complaining about the incorrect use of the word 'character' in the C++ language draft). > > > * a type for strings in the charset (maybe char array, maybe not) > > > > I think you mean 'script'. Strings of code points can be used > > to represent script. > > Uh. Okay, whatever. Again, I will tend to use the more traditional > word "string" to mean a sequence of characters. (On reflection, I > think you're trying to say something deeper, but your explanation is > so vague I won't try to interpret what it is. I leave that to you.) Sorry: I do not understand it entirely myself. But consider: the code point for a space, has no character. There is no such thing as a 'space' character. There is no glyph. There _is_ such a thing as white space in (latin) script, however. > I don't believe that this is a job for the basic string manipulation > stuff to do. There do need to be methods for manipulating strings as > sequences. As such, I'm not going to worry about it at this level. I agree. But, then, you are only dealing with some kind of array of code points. Ocaml already has an 'array' type which can be used for this purpose, and if there were a string type that were variable length, we would be somewhat better off. But manipulating these data structures is entirely structural, and only relates to 'script' in the sense that the data structures are convenient for working with script representations. That is, actual script related functions such as (1) trimming whitespace (2) capitalisation (3) collation (4) charset mapping and encoding/decoding are, IMHO, quite separate. > The big difficulty here is that not everybody wants to eat Unicode. I know this is true. However, the many issues have been addressed by representatives of National Governments and industry, and a consensus has been reached, and is embodied in an International Standard. >I think it's appropriate, but not everyone does. And there are still > characters in iso-2022, for example, which have no Unicode code point. These issues are being addressed by members of the ISO technical committee. Any complaints should be addressed to them. Programmers should probably implement what is standardised, and perhaps complain, or provide input to the process, rather than going off on their own private, probably doomed track. [This is not the same issue, IMHO, as programming languages: there is just ONE commonly accepted universal standard for script, namely ISO-10646] > I think that as I look at the problem more, though, I'm inclined to > say "one definitive set of characters" is a better idea. Especially > since that set is needed for reasonable interoperation between the > others. This is my feeling, partly for another reason: anything else is just too complicated. ISO 10646 is hard enough as it is. > Something I've noted looking at O'Caml these last few days: the > "string" type is really more an efficient byte array type. And the > char type is really a byte type. Yes. And it is needed as such, even though perhaps poorly named. Something which represents byte strings is essentially distinct from something representing 'script'/'text'. IMHO. > Anyway, I think this partitions things like so: > > 1) Basic char and string types > 2) Locale type which exists solely for locale identity > 3) Collator type which allows various sorting of strings > 4) Formatter and parser types for different data values > 4a) A sort of formatter which allows reordering of arguments in the output > string is needed (not too hard). > 5) Reader and writer types for encoding characters into various encodings Seems reasonable. I personally would address (1) and (5) first. Also, I think 'regexp' et al fits in somewhere, and needs to be addressed. > My current thought is that the char and string types should be defined > in terms of Unicode. Please NO. ISO10646. 'Unicode' is no longer supported by Unicode Corp, which is fully behind ISO10646. "unicode" is for implementations that will be out of date before anyone get the tools to actually use internationalised software (like text editors, fonts, etc). Java has gone this way, and will pay the price in the long run. Python is going that way too. But Unicode is ALREADY out of date. If work is to be done -- and there is a LOT of work in this area -- then it should conform to the long sighted International Standard, which has plenty of space for extension, and is supported by an international consensus of National bodies and technical experts. > The char type should be abstract, Why? ISO 10646 code points are just integers 0 .. 2^31-1, and it is necessary to have some way of representing them as _literals_. In particular, you cannot 'match' easily on an abstract type, but matching is going to be a very common operation: match wchar with | WhatDoIPutHere -> ... One solution is to have a native wchar type in ocaml, with literals recognized by the lexer .. but it seems to me that integers will do the job, even if they can be abused, by, for example, multiplication (which, off hand, doesn't seem appropriate). [That is, code points are a Z module: they're offsets] > You should be able to get ahold of collators, formatters, message > catalogs, default encodings, and the like by having a locale. You > should of course be able to ignore them, too. I would leave most of this out, at least for the moment: it is enough work to just handle encodings and mappings. These things are important in the transition from other character sets including Latin-1, ShiftJis -- etc. Without these mapping/encoding tools, it isn't really possible to actually work with ISO10646, because there are not many tools that can do so: most people use either 8 bit editors, or specialised editors for their DBCS encoding (like ShiftJis). > I'll try to get some type signatures for possibilities up in the next > day or so. Great! BTW: I have some Python code for doing a lot of mappings/encodings (all the ones available on the Unicode.org website). This may be useful, I can generate tables of any kind as required. (Please send me private email). Anyone interested can examine my web page: http://www.triode.net.au/~skaller/unicode/index.html -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller