* Request for Ideas: i18n issues @ 1999-10-22 0:25 John Prevost 1999-10-25 18:55 ` skaller 0 siblings, 1 reply; 7+ messages in thread From: John Prevost @ 1999-10-22 0:25 UTC (permalink / raw) To: caml-list I think there are a number of issues for adding i18n support to O'Caml. One of the first issues that should probably be addressed is providing a standard signature for a "character set/encoding" module. (This isn't necessarily part of the stock distributed O'Caml libraries, but it might make sense if the standard library could take advantage of it. Tradeoffs.) This is the heart of my message, and continues after a short digression. The other issue I've thought about for a while has to do with wanting to use non-8859-1 characters in O'Caml code. I don't know if this is a necessity (though the language geek in me says "yes, please!), but it would be interesting. The difficulty I see here is mainly the syntactic distinction between Uidents and Lidents. If I want to define a symbol named by the kanji "san", it's hard to say whether that's an uppercase ident or a lowercase one. (Why would I want to? Like I said, I'm a language geek.) I understand that Caml didn't have this restriction, and that it was added to deal with understanding things like Foo.bar vs foo.bar (Foo is a module, foo is a record value) But would it be possible to remove the distinction again? Back to the charset/encoding module type. Here's what I think might want to be in here. I appeal to you for suggestions about things to remove or add. Charsets: * a type for characters in the charset * a type for strings in the charset (maybe char array, maybe not) * functions to determine the "class" of a character. This would probably involve a standard "character class" type, possibly informed by character classes in Unicode. * functions to work with strings in the character set in order to do standard manipulations. If we said a string is always a char array and that there are standard functions to work on strings given the above, this might be something that can be done away with. * functions to convert characters and strings to a reference format, perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal of coverage, and without some common format, converting from one character set to another is problemmatic. Encodings: These are tied to charsets much of the time, but not always. * functions to encode and decode strings in streams and buffers. Locales: * functions to do case mapping, collation, etc. I'm sure I've missed useful features that should be in these categories. Some of these might be functors. Some might be objects, since modules aren't values, and you may very well want to select a locale/encoding at runtime. Selecting a charset is more problemmatic. (I think this is really why Java went to "the one true charset is Unicode". Not just because of politics, but because interacting with mutually incompatible character sets can be a type-safety nightmare.) One or more of the above might be functors, so that you can compose a character set, encoding, and locale to get what you want. This, of course, gets into questions of whether certain character sets, locales, and encodings are interoperable, and how one might cause a type error when trying to combine an encoding and a character set that don't work together. Dunno if this is possible. How to recover from failure is of course a good question to try to answer as well. I'll see if I can carve out some initial guesses at what could be used above, and watch the list for commentary. I've a week off next week and will probably be hacking around on my own stuff (probably the UCS-4 based unicode module I've been putting off for a while.) John Prevost. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Request for Ideas: i18n issues 1999-10-22 0:25 Request for Ideas: i18n issues John Prevost @ 1999-10-25 18:55 ` skaller 1999-10-26 2:46 ` John Prevost 0 siblings, 1 reply; 7+ messages in thread From: skaller @ 1999-10-25 18:55 UTC (permalink / raw) To: John Prevost; +Cc: caml-list John Prevost wrote: > Back to the charset/encoding module type. Here's what I think might > want to be in here. I appeal to you for suggestions about things to > remove or add. > > Charsets: > > * a type for characters in the charset I think you mean 'code points'. These are logically distinct from characters (which are kind of amorphous). For example, a single character -- if there is such a thing in the language -- may consists of several code points combined, and there are code points which are not characters. > * a type for strings in the charset (maybe char array, maybe not) I think you mean 'script'. Strings of code points can be used to represent script. > * functions to determine the "class" of a character. This would > probably involve a standard "character class" type, possibly > informed by character classes in Unicode. Yes. For ISO10646 plane 1 (Unicode), the data is readily available for a few key attributes (such as character class, case mappings, corresponding decimal digit, etc). > * functions to work with strings in the character set in order to do > standard manipulations. If we said a string is always a char > array and that there are standard functions to work on strings > given the above, this might be something that can be done away > with. Probably not. There is a distinction, for example, between concatenating two arrays of code points, and producing a new array of code points corresponding to the concatenated script. This is 'most true' in Arabic, but it is also true in plain old English: the script "It is a" and "nice day" requires a space to be inserted between the code point arrays to obtain the correctly concatenated script "It is a nice day" > * functions to convert characters and strings to a reference format, > perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal > of coverage, and without some common format, converting from one > character set to another is problemmatic. I agree. I think there are two choices here, UCS-4 and UTF-8 encodings of ISO-10646. UCS-4 is better for internal operations, UTF-8 for IO and streaming operations (perhaps including regular expression matching where tables of 256 cases are more acceptable than 2^31 :-) > Encodings: > > These are tied to charsets much of the time, but not always. > > * functions to encode and decode strings in streams and buffers. Yes. > Locales: > > * functions to do case mapping, collation, etc. No. It is generally accepted that 'locale' information is limited to culturally sensitive variations like whether full stop or comma is used for a decimal point, and whether the date is written dd/mm/yy or yy/mm/dd or mm/dd/yy. Collation, case mapping, etc are not locale data, but specific to a particular script. The tendency in i18n developments has been, I think, to divorce character sets, encodings, collation, and script issues from the locale: the locale may indicate the local language, but that is independent (more or less) of script processing. > (I think this is really why Java went to "the one true charset is > Unicode". Not just because of politics, but because interacting with > mutually incompatible character sets can be a type-safety nightmare.) Yes. I am somewhat suprised to see an attempt to create a more abstract interface to multiple character sets/encodings. This area tends, I think, to be complex, full of ad hoc rules, and so quirky as to defy genuine abstraction. Fixing a single standard (ISO10646) is a simpler alternative; even simpler if there is a single reference encoding such as UCS-4 or UTF-8. In that case, the functions that do the work can be specialised to a well researched International Standard. It is still necessary to provide functions that encode/decode the standard format to other formats (encodings/character sets), but no functions need be provided to do things like collation or case mapping for these other formats. > One or more of the above might be functors, so that you can compose a > character set, encoding, and locale to get what you want. This, of > course, gets into questions of whether certain character sets, > locales, and encodings are interoperable, and how one might cause a > type error when trying to combine an encoding and a character set that > don't work together. Dunno if this is possible. I think it is, but it isn't desriable. For example, it is possible to use UCS-4 or UTF-8 encodings of ANY character set, since all have integral code points to represent them: UCS-4 is universal for all sets of less than 2^32 points, UTF-8 for sets less than 2^31 points. > How to recover from > failure is of course a good question to try to answer as well. There are, in fact, multiple answers; this is one of the complicating factors. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Request for Ideas: i18n issues 1999-10-25 18:55 ` skaller @ 1999-10-26 2:46 ` John Prevost 1999-10-26 13:36 ` Frank A. Christoph 1999-10-26 20:16 ` skaller 0 siblings, 2 replies; 7+ messages in thread From: John Prevost @ 1999-10-26 2:46 UTC (permalink / raw) To: skaller; +Cc: caml-list skaller <skaller@maxtal.com.au> writes: > > * a type for characters in the charset > > I think you mean 'code points'. These are logically distinct > from characters (which are kind of amorphous). For example, a single > character -- if there is such a thing in the language -- may > consists of several code points combined, and there are code points > which are not characters. I'm not sure the distinction matters much at this level. I am predisposed to prefer the name "char" to the name "codepoint" just because more people will understandit. In the back of my head, I also generally interpret "codepoint" as meaning the number associated with the character, rather than the abstract "character" itself. I believe Unicode makes this distinction between a "glyph", a "character", and a "codepoint". > > * a type for strings in the charset (maybe char array, maybe not) > > I think you mean 'script'. Strings of code points can be used > to represent script. Uh. Okay, whatever. Again, I will tend to use the more traditional word "string" to mean a sequence of characters. (On reflection, I think you're trying to say something deeper, but your explanation is so vague I won't try to interpret what it is. I leave that to you.) > > * functions to work with strings in the character set in order to do > > standard manipulations. If we said a string is always a char > > array and that there are standard functions to work on strings > > given the above, this might be something that can be done away > > with. > > Probably not. There is a distinction, for example, between > concatenating two arrays of code points, and producing a new array > of code points corresponding to the concatenated script. This is > 'most true' in Arabic, but it is also true in plain old English: the > script {...} I don't believe that this is a job for the basic string manipulation stuff to do. There do need to be methods for manipulating strings as sequences. As such, I'm not going to worry about it at this level. > > * functions to convert characters and strings to a reference format, > > perhaps UCS-4. UCS-4 isn't perfect, but it does have a great deal > > of coverage, and without some common format, converting from one > > character set to another is problemmatic. > > I agree. I think there are two choices here, UCS-4 and UTF-8 > encodings of ISO-10646. UCS-4 is better for internal operations, > UTF-8 for IO and streaming operations (perhaps including > regular expression matching where tables of 256 cases are more > acceptable than 2^31 :-) True. Of course, there are ways and ways. Regexps based on character classes for efficiency is one (now I only need 5 bits (for Unicode, anyway) to represent that I want "all letters or numbers or connectors"). And there could be multi-level tables. > > Locales: > > > > * functions to do case mapping, collation, etc. > > No. It is generally accepted that 'locale' information is > limited to culturally sensitive variations like whether full stop or > comma is used for a decimal point, and whether the date is written > dd/mm/yy or yy/mm/dd or mm/dd/yy. > > Collation, case mapping, etc are not locale data, but specific > to a particular script. > > The tendency in i18n developments has been, I think, to > divorce character sets, encodings, collation, and script issues from > the locale: the locale may indicate the local language, but that is > independent (more or less) of script processing. Okay. I'm going to have a summary at the bottom of my message of the various kinds of things Java has for this sort of thing, just as something to think about. (i.e. are these all separate things? How does a program get them? etc.) > > (I think this is really why Java went to "the one true charset is > > Unicode". Not just because of politics, but because interacting with > > mutually incompatible character sets can be a type-safety nightmare.) > > Yes. I am somewhat suprised to see an attempt to create a more > abstract interface to multiple character sets/encodings. This area > tends, I think, to be complex, full of ad hoc rules, and so quirky > as to defy genuine abstraction. > > Fixing a single standard (ISO10646) is a simpler > alternative; even simpler if there is a single reference encoding > such as UCS-4 or UTF-8. In that case, the functions that do the > work can be specialised to a well researched International Standard. > > It is still necessary to provide functions that encode/decode > the standard format to other formats (encodings/character sets), but > no functions need be provided to do things like collation or case > mapping for these other formats. The big difficulty here is that not everybody wants to eat Unicode. I think it's appropriate, but not everyone does. And there are still characters in iso-2022, for example, which have no Unicode code point. I think that as I look at the problem more, though, I'm inclined to say "one definitive set of characters" is a better idea. Especially since that set is needed for reasonable interoperation between the others. Something I've noted looking at O'Caml these last few days: the "string" type is really more an efficient byte array type. And the char type is really a byte type. There's no real way to do "input bytes from a stream" except inputting them as characters and then interpreting those characters as bytes. Here's what Java has, related to encodings, strings, characters, collators, and so on: java.lang Character Represents individual Unicode characters. Provides methods to get character class, etc. String Immutable string of characters. StringBuffer Mutable string of characters. java.util Locale Appears to provide mainly identity, in a package based on ISO language and country codes. Methods to look up "resource bundles". Various formatting tools allow you to get a new formatter given a specific locale (perhaps by using their own resource bundles describing which subclass should be used for each locale.) java.text BreakIterator For finding word, sentence, para breaks in text. ChoiceFormat Allows a complex mapping from values to strings. (i.e. 0 -> "no files", 1 -> "1 file", n -> "n files") Collator Configurable comparison between strings. DateFormat Parses and unparses date values. DecimalFormat Configurable formatter of numbers ("####.##" -> " 123.20") Format Superclass for all these formats. They all apparently have to support parsing and unparsing values in Java. MessageFormat Formatting strings including argument number specification, for message catalogs. ("{0}'s {1}" -> "John's foo", "{1} u {0}" -> "foo u Ivan") NumberFormat Generic any-number formatter, not decimal. RuleBasedCollator Collator that can be given strings that represent rules for ordering strings. and a few others java.io OutputStreamWriter Writes strings to an output stream, in an encoding specified by a string (the encoding's name) InputStreamReader Ditto for the other direction. ByteArray... Anyway, I think this partitions things like so: 1) Basic char and string types 2) Locale type which exists solely for locale identity 3) Collator type which allows various sorting of strings 4) Formatter and parser types for different data values 4a) A sort of formatter which allows reordering of arguments in the output string is needed (not too hard). 5) Reader and writer types for encoding characters into various encodings My current thought is that the char and string types should be defined in terms of Unicode. The char type should be abstract, and have a function for converting to Unicode codepoints. The new string type may want to be immutable, with buffers used for changeable things instead. You should be able to get ahold of collators, formatters, message catalogs, default encodings, and the like by having a locale. You should of course be able to ignore them, too. I'll try to get some type signatures for possibilities up in the next day or so. John. ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Request for Ideas: i18n issues 1999-10-26 2:46 ` John Prevost @ 1999-10-26 13:36 ` Frank A. Christoph 1999-10-26 22:02 ` skaller 1999-10-26 20:16 ` skaller 1 sibling, 1 reply; 7+ messages in thread From: Frank A. Christoph @ 1999-10-26 13:36 UTC (permalink / raw) To: John Prevost, skaller; +Cc: caml-list John Prevost wrote in response to John Skaller: > > Probably not. There is a distinction, for example, between > > concatenating two arrays of code points, and producing a new array > > of code points corresponding to the concatenated script. This is > > 'most true' in Arabic, but it is also true in plain old English: the > > script > {...} > > I don't believe that this is a job for the basic string manipulation > stuff to do. There do need to be methods for manipulating strings as > sequences. As such, I'm not going to worry about it at this level. Maybe he is arguing for a higher-level approach to text representation. Text is, after all, an important enough application field that it deserves some treatment in a standard library. Unfortunately (as we are all bearing witness to in this discussion) there is no standard way to do it, even for one (natural) language, much less all of them. > The big difficulty here is that not everybody wants to eat Unicode. I > think it's appropriate, but not everyone does. And there are still > characters in iso-2022, for example, which have no Unicode code point. > > I think that as I look at the problem more, though, I'm inclined to > say "one definitive set of characters" is a better idea. Especially > since that set is needed for reasonable interoperation between the > others. This is getting a little afield from the topic of the list, but I have been wondering if maybe Unicode (and other related standards) might not end up being most valuable in the long run, not for being able to represent text from existing languages but, for shaping the form of languages to come. I don't mean completely new languages of course (I think I heard Klingon and Tengwar were actually submitted for inclusion!), but rather that as electronic information becomes more and more central to communication, the need to encode it will change the way people speak and write in other media as well. I think this is happening already, sort of: for example, I've seen email emoticons used in published materials, and on billboards here in Japan; and we all know a hundred words from computer jargon that have made it into mainstream languages. > Something I've noted looking at O'Caml these last few days: the > "string" type is really more an efficient byte array type. And the > char type is really a byte type. There's no real way to do "input > bytes from a stream" except inputting them as characters and then > interpreting those characters as bytes. That's exactly the way I think of and use O'Caml characters and strings. It is sort of unfortunate that O'Caml inherited this terminology from C... I would have preferred "byte" or "octet" for characters, at least. --FAC ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Request for Ideas: i18n issues 1999-10-26 13:36 ` Frank A. Christoph @ 1999-10-26 22:02 ` skaller 0 siblings, 0 replies; 7+ messages in thread From: skaller @ 1999-10-26 22:02 UTC (permalink / raw) To: Frank A. Christoph; +Cc: John Prevost, caml-list "Frank A. Christoph" wrote: > This is getting a little afield from the topic of the list, [] (I think I heard Klingon and Tengwar were actually submitted for inclusion!), Klingon characters are in the ISO10646 International Standard. It appears it should have been called 'Intergalactic Standard' though ;-) [Couldn't resist] -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Request for Ideas: i18n issues 1999-10-26 2:46 ` John Prevost 1999-10-26 13:36 ` Frank A. Christoph @ 1999-10-26 20:16 ` skaller 1999-10-27 0:37 ` John Prevost 1 sibling, 1 reply; 7+ messages in thread From: skaller @ 1999-10-26 20:16 UTC (permalink / raw) To: John Prevost; +Cc: caml-list John Prevost wrote: > > skaller <skaller@maxtal.com.au> writes: > > > > * a type for characters in the charset > > > > I think you mean 'code points'. These are logically distinct > > from characters (which are kind of amorphous). For example, a single > > character -- if there is such a thing in the language -- may > > consists of several code points combined, and there are code points > > which are not characters. > > I'm not sure the distinction matters much at this level. I am > predisposed to prefer the name "char" to the name "codepoint" just > because more people will understandit. I myself tend to use these (incorrect) words as well, through familiarity. However, in the longer run, vague, and even incorrect, terminology will make it even harder to communicate. My experience with this is from the C++ Standardisation committee (rather than I18n forums). In C++, there are a number of important and distinct concepts without proper terminology, and it makes it almost impossible to have a technical discussion about the C++ language. Over 90% of all issues and debates centred around terminology and interpretation, rather than functionality. But let me give another clarifying example. In Unicode, there are some code points which are reserved, and do not represent characters, namely the High and Low surrogates. These are used to allow two byte 16 bit UTF-16 encoding of the first 16 planes of ISO-10646. So when you see two of these together, they are not two 'characters', but in fact an encoding of a SINGLE code point, which itself may or may not be a character. > In the back of my head, I also > generally interpret "codepoint" as meaning the number associated with > the character, Yes, this is more or less correct. And, basic array /string manipulations must first work with code points, rather than the abstract 'characters'. In fact, the 'meaning' of the code points as characters is to be found in the functions which manipulate the code points (sort of like an ocaml module). More correctly, sequences of code points can be understood as 'script', and manipulated to represent script manipulations. Thus: it is possible to do a lexicographical string sort by code points, which is not the same as a native language sort on script. The former is a convenience for representing strings in _some_ total order, for example, for use in an ocaml 'Set'. The latter may not even be well defined, or have multiple definitions, depending on the script or use -- for example, the usual code point order sort of ASCII character is utterly wrong for a book index. >rather than the abstract "character" itself. I believe > Unicode makes this distinction between a "glyph", a "character", and a > "codepoint". Yes: a glyph is the shape, a font's character is particular rendering of it, a character is an abstraction (is 'A' the same character as 'a'?? The shape is different, the concept is the same). A code point is just a number in a set of numbers used to encode script. The difference is subtle, and I do not claim to fully understand it, but the ISO and Unicode committees needing to create standards emphasise the distinctions (to the point of the liason on the C++ committee complaining about the incorrect use of the word 'character' in the C++ language draft). > > > * a type for strings in the charset (maybe char array, maybe not) > > > > I think you mean 'script'. Strings of code points can be used > > to represent script. > > Uh. Okay, whatever. Again, I will tend to use the more traditional > word "string" to mean a sequence of characters. (On reflection, I > think you're trying to say something deeper, but your explanation is > so vague I won't try to interpret what it is. I leave that to you.) Sorry: I do not understand it entirely myself. But consider: the code point for a space, has no character. There is no such thing as a 'space' character. There is no glyph. There _is_ such a thing as white space in (latin) script, however. > I don't believe that this is a job for the basic string manipulation > stuff to do. There do need to be methods for manipulating strings as > sequences. As such, I'm not going to worry about it at this level. I agree. But, then, you are only dealing with some kind of array of code points. Ocaml already has an 'array' type which can be used for this purpose, and if there were a string type that were variable length, we would be somewhat better off. But manipulating these data structures is entirely structural, and only relates to 'script' in the sense that the data structures are convenient for working with script representations. That is, actual script related functions such as (1) trimming whitespace (2) capitalisation (3) collation (4) charset mapping and encoding/decoding are, IMHO, quite separate. > The big difficulty here is that not everybody wants to eat Unicode. I know this is true. However, the many issues have been addressed by representatives of National Governments and industry, and a consensus has been reached, and is embodied in an International Standard. >I think it's appropriate, but not everyone does. And there are still > characters in iso-2022, for example, which have no Unicode code point. These issues are being addressed by members of the ISO technical committee. Any complaints should be addressed to them. Programmers should probably implement what is standardised, and perhaps complain, or provide input to the process, rather than going off on their own private, probably doomed track. [This is not the same issue, IMHO, as programming languages: there is just ONE commonly accepted universal standard for script, namely ISO-10646] > I think that as I look at the problem more, though, I'm inclined to > say "one definitive set of characters" is a better idea. Especially > since that set is needed for reasonable interoperation between the > others. This is my feeling, partly for another reason: anything else is just too complicated. ISO 10646 is hard enough as it is. > Something I've noted looking at O'Caml these last few days: the > "string" type is really more an efficient byte array type. And the > char type is really a byte type. Yes. And it is needed as such, even though perhaps poorly named. Something which represents byte strings is essentially distinct from something representing 'script'/'text'. IMHO. > Anyway, I think this partitions things like so: > > 1) Basic char and string types > 2) Locale type which exists solely for locale identity > 3) Collator type which allows various sorting of strings > 4) Formatter and parser types for different data values > 4a) A sort of formatter which allows reordering of arguments in the output > string is needed (not too hard). > 5) Reader and writer types for encoding characters into various encodings Seems reasonable. I personally would address (1) and (5) first. Also, I think 'regexp' et al fits in somewhere, and needs to be addressed. > My current thought is that the char and string types should be defined > in terms of Unicode. Please NO. ISO10646. 'Unicode' is no longer supported by Unicode Corp, which is fully behind ISO10646. "unicode" is for implementations that will be out of date before anyone get the tools to actually use internationalised software (like text editors, fonts, etc). Java has gone this way, and will pay the price in the long run. Python is going that way too. But Unicode is ALREADY out of date. If work is to be done -- and there is a LOT of work in this area -- then it should conform to the long sighted International Standard, which has plenty of space for extension, and is supported by an international consensus of National bodies and technical experts. > The char type should be abstract, Why? ISO 10646 code points are just integers 0 .. 2^31-1, and it is necessary to have some way of representing them as _literals_. In particular, you cannot 'match' easily on an abstract type, but matching is going to be a very common operation: match wchar with | WhatDoIPutHere -> ... One solution is to have a native wchar type in ocaml, with literals recognized by the lexer .. but it seems to me that integers will do the job, even if they can be abused, by, for example, multiplication (which, off hand, doesn't seem appropriate). [That is, code points are a Z module: they're offsets] > You should be able to get ahold of collators, formatters, message > catalogs, default encodings, and the like by having a locale. You > should of course be able to ignore them, too. I would leave most of this out, at least for the moment: it is enough work to just handle encodings and mappings. These things are important in the transition from other character sets including Latin-1, ShiftJis -- etc. Without these mapping/encoding tools, it isn't really possible to actually work with ISO10646, because there are not many tools that can do so: most people use either 8 bit editors, or specialised editors for their DBCS encoding (like ShiftJis). > I'll try to get some type signatures for possibilities up in the next > day or so. Great! BTW: I have some Python code for doing a lot of mappings/encodings (all the ones available on the Unicode.org website). This may be useful, I can generate tables of any kind as required. (Please send me private email). Anyone interested can examine my web page: http://www.triode.net.au/~skaller/unicode/index.html -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Request for Ideas: i18n issues 1999-10-26 20:16 ` skaller @ 1999-10-27 0:37 ` John Prevost 0 siblings, 0 replies; 7+ messages in thread From: John Prevost @ 1999-10-27 0:37 UTC (permalink / raw) To: skaller; +Cc: caml-list skaller <skaller@maxtal.com.au> writes: > Sorry: I do not understand it entirely myself. But consider: > the code point for a space, has no character. There is no such > thing as a 'space' character. There is no glyph. There _is_ such a > thing as white space in (latin) script, however. I think this is overly pedantic. Let's try not to go out this far. > > 1) Basic char and string types > > 2) Locale type which exists solely for locale identity > > 3) Collator type which allows various sorting of strings > > 4) Formatter and parser types for different data values > > 4a) A sort of formatter which allows reordering of arguments in the output > > string is needed (not too hard). > > 5) Reader and writer types for encoding characters into various encodings > > Seems reasonable. I personally would address (1) and (5) > first. Also, I think 'regexp' et al fits in somewhere, and needs to > be addressed. True. And you're right, 1 and 5 are probably the most important things. I'm just trying to clarify what sorts of things are needed for i18n stuff. > > My current thought is that the char and string types should be defined > > in terms of Unicode. > > Please NO. ISO10646. 'Unicode' is no longer supported by > Unicode Corp, which is fully behind ISO10646. "unicode" is for > implementations that will be out of date before anyone get the tools > to actually use internationalised software (like text editors, > fonts, etc). Ahh. I'm not entirely sure which distinction you are making here. I use the phrase "Unicode" as meaning "ISO 10646", because I find it easier to pronounce. I understand the differences, but not why you went off on a flame when you saw "Unicode". As my understanding goes, the goal in Unicode has been to conform to ISO 10646 since at least revision 2.0. > > The char type should be abstract, > > Why? ISO 10646 code points are just integers 0 .. 2^31-1, and > it is necessary to have some way of representing them as > _literals_. In particular, you cannot 'match' easily on an abstract > type, but matching is going to be a very common operation: > > match wchar with > | WhatDoIPutHere -> ... This is true. And having base-language support for wchar literals would be ideal. However, the type of characters and the type of integers should be disjoint, since the number 32 and the character ' ' are distinct. You do not want to apply a function on characters to an integer, because they are semantically different. This is precisely the sort of thing the type system in ML is supposed to be used for. I won't stand for leaving such an important constraint out of the abstraction. You may map an integer into a character of the equivalent codepoint and vice-versa, but a character is not an integer. In the implementation, of course, it actually is. >From another part of your message: > But let me give another clarifying example. In Unicode, there > are some code points which are reserved, and do not represent > characters, namely the High and Low surrogates. These are used to > allow two byte 16 bit UTF-16 encoding of the first 16 planes of > ISO-10646. I'd like to see this issue not exist in O'Caml's character type. To me, a surrogate is half a character. If I have an array of characters, I should not be able to look up a cell and get half a character. Therefore, the base character type in O'Caml should be defined in terms of the codepoint itself. This still leaves the surrogate area as problemmatic: what if I want to be able to deal with input in which surrogates are mismatched? What if the library client asks me to turn an integer which is the codepoint of a surrogate into a character? But I think that in O'Caml, the character type should never contain "half a character". Your other message: > In that process, I have learned a bit; mainly that: > > (1) it is complicated > (2) it is not wise to specify interfaces based on > vague assumptions. > > That is, I'd like to suggest focussing on encodings/mappings, and > character classes; and I believe that you will find that the issues > demand different interfaces than you might think at first: I know > that what I _thought_ made sense, at first, turned out to be not > really correct. I agree. And this is where I'll focus first. At the same time, one might argue that an object that represents collation, at some level, has a specifc interface which allows an operation: comparing two pieces of script (to employ an excessively long turn of phrase) for order. Regardless of how the order is determined, that's what the goal is. I don't intend to try to come up with some things and have them set in stone. I intend this as an excercise to come up with some tools which people can use and try out and fiddle with and make suggestions about. > On things like collation, there is no way I can see to proceed > without obtaining the relevant ISO documents. I seriously doubt > that it is possible to design an 'abstract' interface, without > actually first trying to implement the algorithms: the interfaces > will be determined by concrete details and issues. In this case, I don't propose to describe in detail all of the issues of collating. But, in order to provide a non-codepoint ordered mechanism for ordering strings, and maybe provide a way to find such a mechanism, it might be worthwhile to explore a few possibilities, even if they come to nought in the end. As for myself, I do not generally have the funds or werewithal to obtain ISO documents. I do have a copy of the Unicode v2.0 book handy, but it doesn't concern itself with such issues. Anyway, you're right, the encoding and character class stuff is first. I think format strings and message catalogs is a doable second goal, which actually won't cause too much pain. Ways to format dates and the like is harder, not least because you have to have a standard way to express a date before you can have a standard interface for outputting a date in some localized format. Anyway, encodings first. Then the world. I'll take a look at your Python stuff. John. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~1999-10-28 17:09 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1999-10-22 0:25 Request for Ideas: i18n issues John Prevost 1999-10-25 18:55 ` skaller 1999-10-26 2:46 ` John Prevost 1999-10-26 13:36 ` Frank A. Christoph 1999-10-26 22:02 ` skaller 1999-10-26 20:16 ` skaller 1999-10-27 0:37 ` John Prevost
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).