* Re: localization, internationalization and Caml @ 1999-10-15 13:53 Gerard Huet 1999-10-15 20:28 ` Gerd Stolpmann 1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy 0 siblings, 2 replies; 24+ messages in thread From: Gerard Huet @ 1999-10-15 13:53 UTC (permalink / raw) To: Francis Dupont, skaller; +Cc: STARYNKEVITCH Basile, caml-list Just to put my 2 cents on this issue... At 10:26 15/10/99 +0200, Francis Dupont wrote: > In your previous mail you wrote: > > The current 'support' for 8 bit characters in ocaml should be > deprecated immediately. It is an extremely bad thing to have, since > Latin-1 et al are archaic 8 bit standards incompatible with the > international standard for ISO10646 communication, namely > the UTF-8 encoding. I do not agree. What we need is not ayatollah dictats, but careful thinking about evolution of standards. First of all, ISO-Latin is as international a standard as ISO10646, only a bit more mature. By essence, international standards are not immediately obsolete, they are here to stay because we need some stability in a world of sound engineering, as opposed to the permanent hype which our discipline is subjected to. Secondly, the string data type of Ocaml is not about ASCII or ISO-Latin or whatever. It is a low-level data type of implementation of lists of bytes of data efficiently represented in machine memory. These bytes may be used for encoding elements of various finite sets such as ASCII or ISO-Latin, but the string library does not care about such intentions. When such strings are used to represent natural language sentences, there is a natural tendency to sophistication, from UPPER CASE letters of the computer printers of old to ASCII to ISO-Latin 1, 2, etc to Unicode. At some point (256) these sets of codes cannot be represented in a one to one fashion into bytes, and so multi-bytes representations must be designed, such as UTF-8. Such multi-bytes representations are inconsistent with ISO-Latin convention somewhere, and thus the ISO-Latin character set must be shifted out of its usual representation since the 8th bit is needed for the multi-byte encoding. So for instance engineers designing natural language interfaces must make the choice of sticking to the old convention in a purely local software, or upgrading their software to the international standard, typically for Web applications. At some point I am sure some brave soul from the Ocaml implementation team will write a Unicode library for implementing the non-trivial manipulations of lists of Unicode characters, so that the above engineers will have a generic tool to use. Such libraries will typically implement a NEW datatype of "unistring" or whatever, with proper conversion to string representations of course, but the string data type is surely here to stay, because bytes are not going to become obsolete overnight. :-) >=> there is a rather strong opposition against UTF-8 in France >because it is not a natural encoding (ie. if ASCII maps to ASCII >it is not the case for ISO 8859-* characters, imagine a new UTF-X >encoding maps ASCII to strange things and you'd be able to understand >our concern). I do not share Francis' pessimism. The ISO commitees are not entirely stupid, and care has been taken to make the move as painless as possible. ISO-Latin has just been shifted by a mere translation. Here is my Ocaml code for translating strings of ISO-Latin 1 characters to UTF-8 HTML: let print_unicode c let ascii = int_of_char c in (* test for ISO-LATIN *) if ascii < 128 then print_char c (* 7 bit ascii *) else print_string ("&#" ^ (string_of_int ascii) ^ ";"); This is hardly mysterious or complicated or inefficient. >=> my problem is the output of the filter will be no more readable when >I've put too much French in the program (in comments for instance). Come on, Francis, we do not read core dumps nowadays, we read through the eyes of HTML or TeX or whatever ! >=> I believe internationalization should not be done by countries >where English is the only used language: this is at least awkward... I simply do not understand this remark in a WWW world. Cheers Gérard ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-15 13:53 localization, internationalization and Caml Gerard Huet @ 1999-10-15 20:28 ` Gerd Stolpmann 1999-10-19 18:06 ` skaller 1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy 1 sibling, 1 reply; 24+ messages in thread From: Gerd Stolpmann @ 1999-10-15 20:28 UTC (permalink / raw) To: caml-list I agree that Unicode or even ISO-10646 support would be a nice thing. I also agree that for many users (including myself) the Latin-1 character set suffices. Luckily, both character sets are strongly related: The first 256 character positions of Unicode (which is the 16-bit subset of ISO-10646) are exactly the same as in Latin1. UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit character is represented by one to six bytes; the higher the character code the more bytes are needed. This encoding is mainly interesting for I/O, and not for internal processing, because it is impossible to access the characters of a string by their position. Internally, you must represent the characters as 16- or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is wasted (this is the price for the enhanced possibilities). UTF-8 is thought for compatibility, because the following holds: - Every ASCII character (i.e. with codes 0 to 127) is represented as before, and every non-ASCII character is represented by a byte sequence where the eighth bit is set. Old, non-UTF-aware programs can at least interpret the ASCII characters. (Note that there is a variant of UTF-8 which encodes the 0 character differently - by two characters.) - If you sort UTF-8 strings alphabetically (more precisely, using the byte values of the encoding as criterion) you get the same result as if you sorted the strings alphabetically by their character codes. This means that we need at least three types of strings: Latin1 strings for compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by the same language type "string", and to provide "wchar" and "wstring" for the extended character set. Of course, the current "string" data type is mainly an implementation of byte sequences, which is independent of the underlying interpretation. Only the following functions seem to introduce a character set: - The String.uppercase and .lowercase functions - The channels if opened in text mode (because they specially recognize the line endings if needed for the operating system, and newline characters are part of the character set) The best solution would be to have internationalized versions of these functions (and of perhaps some more functions) which still operate on the "string" type but allow the user to select the encoding and the locale. This means we would have something like type encoding = UTF8 | Latin1 | ... type locale = ... val String.i18n_uppercase : encoding -> locale -> string -> string val String.i18n_lowercase : encoding -> locale -> string -> string val String.recode : encoding -> encoding -> string -> string (* changes the encoding if possible *) For "wstring" it is simpler: val Wstring.uppercase : string -> string val Wstring.i18n_uppercase : locale -> string -> string New opening mode for channels: Text of encoding This encoding specifies the encoding of the file. It must be possible to change this later (e.g. to process XML's "encoding" declaration). New input/output functions: val output_i18n_string : out_channel -> encoding -> string -> unit val input_i18n_line : in_channel -> encoding -> string Here, the encoding argument specifies the encoding of the internal representation. The other I/O functions need I18N versions as well, and of course we need to operate on "wstring"s directly: val output_wstring : out_channel -> wstring -> unit val input_wstring_line : in_channel -> wstring This all means that the number of string functions explodes: We need functions for compatibility (Latin1), functions for arbitrary 8 bit encodings, and functions for wide strings. I think this is the main argument against it, and it is very difficult to get around this. (Any ideas?) Francis Dupont: >=> my problem is the output of the filter will be no more readable when >I've put too much French in the program (in comments for instance). The enlarged character sets become more and more important, and it is only a matter of time until every piece of software which wants to be taken seriously can process them, even a dumb terminal or simple text editor. So you will be able to put accented characters into your comments, and you will see them as such even if you 'cat' the program text to the terminal or printer; this will work everywhere... >=> I believe internationalization should not be done by countries >where English is the only used language: this is at least awkward... but in the USA (a general prejudice not worth to discuss; there is only some "personal experience" behind it). Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: Gerd.Stolpmann@darmstadt.netsurf.de (privat) Germany ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-15 20:28 ` Gerd Stolpmann @ 1999-10-19 18:06 ` skaller 1999-10-20 21:05 ` Gerd Stolpmann 0 siblings, 1 reply; 24+ messages in thread From: skaller @ 1999-10-19 18:06 UTC (permalink / raw) To: Gerd.Stolpmann; +Cc: caml-list Gerd Stolpmann wrote: > > I agree that Unicode or even ISO-10646 support would be a nice thing. I > also agree that for many users (including myself) the Latin-1 character set > suffices. Generally, for me, 7 bit ASCII 'suffices'. But that is irrelevant, the world is bigger than my country, or Europe. >Luckily, both character sets are strongly related: The first 256 > character positions of Unicode (which is the 16-bit subset of ISO-10646) are > exactly the same as in Latin1. .. of course, this is not luck .. > UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit > character is represented by one to six bytes; the higher the character code the > more bytes are needed. This encoding is mainly interesting for I/O, and not for > internal processing, because it is impossible to access the characters of a > string by their position. Internally, you must represent the characters as 16- > or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is > wasted (this is the price for the enhanced possibilities). I don't agree. If you read ISO10646 carefully, you will find that you must STILL parse sequences of code points to obtain the equivalent of a 'character', if, indeed, such a concept exists, and furthermore, the sequences are not unique. For example, many diacritic marks such as accents may be appended to a code point, and act in conjunction with the preceding code point to represent a character. This is permitted EVEN if there is a single code point for that character; and worse, if there are TWO such marks, the order is not fixed. And that's just simple European languages. Now try Arabic or Thai :-) > This means that we need at least three types of strings: Latin1 strings for > compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings > for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by > the same language type "string", and to provide "wchar" and "wstring" for the > extended character set. I'd like to suggest we forget the 'wchar' string, at least initially. I think you will find UTF-8 encoding requires very few changes. For example, genuine regular expressions work out of the box. String searching works out of the box. What doesn't work efficiently is indexing. And it is never necessary to do it for human script. Why would you ever want to, say, replace the 10'th character of a string?? [You could: if you were analysing, say, a stock code, but in that case the n'th byte would do: it isn't natural language script] The way to handle Latin-1, or Big-5, or KSC, or ShiftJis, is to translate it with an input filter, or internally, if the client is reading the codes directly. > Of course, the current "string" data type is mainly an implementation of byte > sequences, which is independent of the underlying interpretation. Only the > following functions seem to introduce a character set: > > - The String.uppercase and .lowercase functions It is best to get rid of these functions. The belong in a more sophisticated natural language processing package. > - The channels if opened in text mode (because they specially recognize the > line endings if needed for the operating system, and newline characters are > part of the character set) This is a serious problem. It is also partly ill formed: what is a 'line' in Chinese, which writes characters top down? What is a 'line' in a Web page :-) > The best solution would be to have internationalized versions of these > functions (and of perhaps some more functions) which still operate on the > "string" type but allow the user to select the encoding and the locale. There is more. The compiler must be modified to accept identifiers in the extended character set. This should work out of the box (since 8'th bit set characters are already accepted); in fact, it is too permissive. Secondly, literals need to be processed, to expand \uXXXX and \UXXXXXXXX escapes. > This means we would have something like > > type encoding = UTF8 | Latin1 | ... Be careful to distinguish CODE SET from ENCODING. See the Unicode home page for more details: Latin-1 is NOT an encoding, but a character set. There is a MAPPING from Latin-1 to ISO-10646. This is not the same thing as an encoding. I think this is the wrong approach: we do not want built-in cases for every possible encoding/character set. Instead, we want an open ended set of conversions from (and perhaps to) the internally used representation. There are a LOT of such combinations, we need to add new ones without breaking into a module. We should do it functionally; this should work well in ocaml :-) > type locale = ... > val String.i18n_uppercase : encoding -> locale -> string -> string > val String.i18n_lowercase : encoding -> locale -> string -> string Not in the String module. This belongs in a different package which handles complex vagaries of human script. [This particular function, is relatively simple. Another is 'get_digit', 'isdigit'. Whitespace is much harder. Collation is a nightmare :-] > val String.recode : encoding -> encoding -> string -> string > (* changes the encoding if possible *) This isn't quite right. The way to do this is to have a function: LATIN1_to_ISO10646 code which does the mapping (from a SINGLE code point in LATIN1 to ISO10646). The code point is an int. Separately, we handle encodings: UCS4_to_UTF8 code converts an int to a UTF8 string, and UTF8_to_UCS4 string position parses the string from position, returning a code point and position. There are other encodings, such as DBCS encodings, which generally are tied to a single character set. [UCS4/UTF8 are less depenent] [....] > This all means that the number of string functions explodes: Exactly. And we don't want that. So I suggest, we continue to use the existing strings of 8 bit bytes ONLY, and represent ALL foreign [non ISO10646] character sets using ISO-10646 code points, encoded as UTF-8, and provide an input filter for the compiler. In addition, some extra functions to convert other character sets and encodings to ISO-10646/UTF-8 are provided, and, if you like, they can be plugged into the I/O system. This means a lot of conversion functions, but ONE internal representation only: the one we already have. We need functions > for compatibility (Latin1), functions for arbitrary 8 bit encodings, and > functions for wide strings. I think this is the main argument against it, > and it is very difficult to get around this. (Any ideas?) I've been trying to tell you how to do it. The solution is simple, to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE encoding of it; and provide conversions from other character sets and encodings. All the code that needs to manipulate strings can then be provided NOW as additional functions manipulating the existing string type. The apparent loss of indexing is a mirage. The gain is huge: ISO-10646 Level 1 compliance without any explosiion of data types. Yes, some more _functions_ are needed to do extra processing, such as normalisation, comparisons of various kinds, capitalisation, etc. Regular expressions will need to be enhanced, to fix the special features (like case insensitive searching), but the basic regular expressions will work out of the box. > The enlarged character sets become more and more important, and it is only a > matter of time until every piece of software which wants to be taken seriously > can process them, even a dumb terminal or simple text editor. So you will be > able to put accented characters into your comments, and you will see them as > such even if you 'cat' the program text to the terminal or printer; this will > work everywhere... Yes. This time is not here yet, but it will come soon that international support is mandatory for all large software purchases by governments and large corporations. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-19 18:06 ` skaller @ 1999-10-20 21:05 ` Gerd Stolpmann 1999-10-21 4:42 ` skaller 1999-10-21 12:05 ` Matías Giovannini 0 siblings, 2 replies; 24+ messages in thread From: Gerd Stolpmann @ 1999-10-20 21:05 UTC (permalink / raw) To: skaller; +Cc: caml-list On Tue, 19 Oct 1999, John Skaller wrote: >Gerd Stolpmann wrote: >> UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit >> character is represented by one to six bytes; the higher the character code the >> more bytes are needed. This encoding is mainly interesting for I/O, and not for >> internal processing, because it is impossible to access the characters of a >> string by their position. Internally, you must represent the characters as 16- >> or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is >> wasted (this is the price for the enhanced possibilities). > > I don't agree. If you read ISO10646 carefully, you will find >that you must STILL parse sequences of code points to obtain the >equivalent >of a 'character', if, indeed, such a concept exists, >and furthermore, the sequences are not unique. For example, >many diacritic marks such as accents may be appended to a code point, >and act in conjunction with the preceding code point to represent >a character. This is permitted EVEN if there is a single code point >for that character; and worse, if there are TWO such marks, the order is >not >fixed. > > And that's just simple European languages. Now try Arabic or Thai :-) Let's begin with languages we know. As far as I know, ISO10646 allows it not to implement the combining characters. I think, a programming language should only provide the basic means by which you can operate with characters, but should not solve it completely. >> This means that we need at least three types of strings: Latin1 strings for >> compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings >> for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by >> the same language type "string", and to provide "wchar" and "wstring" for the >> extended character set. > > I'd like to suggest we forget the 'wchar' string, at least initially. >I think you will find UTF-8 encoding requires very few changes. For >example, >genuine regular expressions work out of the box. String searching >works out of the box. > > What doesn't work efficiently is indexing. >And it is never necessary to do it for human script. >Why would you ever want to, say, replace the 10'th character >of a string?? [You could: if you were analysing, say, a stock code, >but in that case the n'th byte would do: it isn't natural language >script] Because I have an algorithm operating on the characters of a string. Such algorithms use indexes as pointers to parts of a string, and in most cases the indexes are only incremented or decremented. On a UTF-8 string, you could define an index as type index = { index_position : int; byte_position : int } and define the operations "increment", "decrement" (only working together with the string), "add", "substract", "compare" (to calculate string lengths). Such indexes have strange properties; they can only be interpreted together with the string to which they refer. You cannot avoid such an index type really; you can only avoid to give the thing a name and program the index operations every time anew. Perhaps your suggestion works; but string manipulation will then be much slower. For example, an "increment" must be implemented by finding the next beginning of a character (instead of just incrementing a numeric index). >> Of course, the current "string" data type is mainly an implementation of byte >> sequences, which is independent of the underlying interpretation. Only the >> following functions seem to introduce a character set: >> >> - The String.uppercase and .lowercase functions > >It is best to get rid of these functions. The belong in a more >sophisticated >natural language processing package. There will always be a difference between natural languages and sophisticated packages. Even the current String.uppercase is wrong (in Latin1 there is a lower case character without corresponding capital character (\223), but WORDS containing this character can be capitalized by applying a semantical rule). I would suppose that String.upper/lowercase are part of the library because the compiler itself needs them. Currently, ocaml depends on languages that know the distinction of character cases. In my opinion such case functions can only approximate the semantical meaning, and a simple approximation is better than no approximation. >> - The channels if opened in text mode (because they specially recognize the >> line endings if needed for the operating system, and newline characters are >> part of the character set) > > This is a serious problem. It is also partly ill formed: >what is a 'line' in Chinese, which writes characters top down? But lines exist. For example, your message is divided into lines. The concept of lines is too important to be dropped although it is simple (much of the success has to do with its simplicity). Other writing traditions also have a writing direction. >What is a 'line' in a Web page :-) What is a 'line' in the sky? >> This means we would have something like >> >> type encoding = UTF8 | Latin1 | ... > > Be careful to distinguish CODE SET from ENCODING. >See the Unicode home page for more details: Latin-1 is NOT >an encoding, but a character set. There is a MAPPING from >Latin-1 to ISO-10646. This is not the same thing as an encoding. >I think this is the wrong approach: we do not want built-in >cases for every possible encoding/character set. Character sets and encodings are both artificial concepts. When I program, I have always to do with a combination of both. The distinction is irrelevant for most applications; it is important if you want to convert texts from (cs1,enc1) to (cs2,enc2) because conversion is not always possible. My idea is that the type "encoding" enumerates all supported combinations; I expect only a few. > Instead, we want an open ended set of conversions from (and perhaps to) >the internally used representation. There are a LOT of such >combinations, >we need to add new ones without breaking into a module. We should do it >functionally; this should work well in ocaml :-) What kind of problem do you want to solve with an open ended set of conversions? Isn't this the task of a specialized program? >> type locale = ... >> val String.i18n_uppercase : encoding -> locale -> string -> string >> val String.i18n_lowercase : encoding -> locale -> string -> string > > Not in the String module. This belongs in a different package >which handles complex vagaries of human script. [This particular >function, is relatively simple. See above. >Another is 'get_digit', 'isdigit'. >Whitespace is much harder. Collation is a nightmare :-] I think collation should be left out by a basic library. Even for a single language, there are often several traditions how to sort, and it also depends on the kind of strings your are sorting (for example, think of personal names). Members of traditions can contribute special modules for collation. > >> val String.recode : encoding -> encoding -> string -> string >> (* changes the encoding if possible *) > > This isn't quite right. The way to do this is to have a function: > > LATIN1_to_ISO10646 code > >which does the mapping (from a SINGLE code point in LATIN1 to ISO10646). >The code point is an int. Separately, we handle encodings: > > UCS4_to_UTF8 code > >converts an int to a UTF8 string, and > > UTF8_to_UCS4 string position > >parses the string from position, returning a code point and position. >There are other encodings, such as DBCS encodings, which generally >are tied to a single character set. [UCS4/UTF8 are less depenent] > The most correct interface is not always the best. >[....] >> This all means that the number of string functions explodes: > > Exactly. And we don't want that. So I suggest, we continue >to use the existing strings of 8 bit bytes ONLY, and represent >ALL foreign [non ISO10646] character sets using ISO-10646 code points, >encoded as UTF-8, and provide an input filter for the compiler. > > In addition, some extra functions to convert other >character sets and encodings to ISO-10646/UTF-8 are provided, >and, if you like, they can be plugged into the I/O system. > > This means a lot of conversion functions, but ONE >internal representation only: the one we already have. There will be a significant slow-down of all ocaml programs if the strings are encoded as UTF-8. I think the user of a language should be able to choose what is more important: time or space or reduced functionality. UTF-8 saves space, and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves time and space but has less functionality. >> We need functions >> for compatibility (Latin1), functions for arbitrary 8 bit encodings, and >> functions for wide strings. I think this is the main argument against it, >> and it is very difficult to get around this. (Any ideas?) > > I've been trying to tell you how to do it. The solution is simple, >to adopt ISO-10646 as the SOLE character set, and UTF-8 as the SOLE >encoding of it; and provide conversions from other character sets and >encodings. It looks simple but I suppose it is not what the ocaml users want. >All the code that needs to manipulate strings can then be provided NOW >as additional functions manipulating the existing string type. And because compatibility is lost, the whole current code base has to be worked through. > The apparent loss of indexing is a mirage. The gain is huge: >ISO-10646 Level 1 compliance without any explosiion of data types. >Yes, some more _functions_ are needed to do extra processing, >such as normalisation, comparisons of various kinds, >capitalisation, etc. Regular expressions will need to be >enhanced, to fix the special features (like case insensitive searching), >but the basic regular expressions will work out of the box. > >> The enlarged character sets become more and more important, and it is only a >> matter of time until every piece of software which wants to be taken seriously >> can process them, even a dumb terminal or simple text editor. So you will be >> able to put accented characters into your comments, and you will see them as >> such even if you 'cat' the program text to the terminal or printer; this will >> work everywhere... > > Yes. This time is not here yet, but it will come soon that >international support is mandatory for all large software purchases >by governments and large corporations. I do not believe that this will be the driving force because the current solutions exist, and it is VERY expensive to replace them. It is even cheaper to replace a language than a character set/encoding. Looks like another Year 2000 but without deadline. The first field where some progress will be made is data exchange, because ISO10646 can bridge several character sets. At that time, tools will be available to view and edit such data, and of course to convert them; ISO10646 will be used in parallel with the "traditional" character set. These tools will be low-level, and perhaps operating systems will then support the tools with fonts, input methods, and conventions how to indicate the encoding.(The current environment variables solution is a pain if you try to use two encodings in parallel. For example, I can imagine that Unix terminal drivers allow it to select the encoding directly, in the same way as you can set other terminal properties.) In contrast to this, many applications need not to be replaced and won't be. Perhaps they will have a ISO10646 import/export filter. -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: Gerd.Stolpmann@darmstadt.netsurf.de (privat) Germany ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-20 21:05 ` Gerd Stolpmann @ 1999-10-21 4:42 ` skaller 1999-10-21 12:05 ` Matías Giovannini 1 sibling, 0 replies; 24+ messages in thread From: skaller @ 1999-10-21 4:42 UTC (permalink / raw) To: Gerd.Stolpmann; +Cc: caml-list Gerd Stolpmann wrote: > On Tue, 19 Oct 1999, John Skaller wrote: > > > > I don't agree. If you read ISO10646 carefully, you will find > >that you must STILL parse sequences of code points > Let's begin with languages we know. As far as I know, ISO10646 allows it not to > implement the combining characters. From memory, you are correct: there are three specified levels of compliance. Level 1 compliance does not require processing combining characters. > I think, a programming language should only > provide the basic means by which you can operate with characters, but should not > solve it completely. Yes, I agree, at least at this time. > > What doesn't work efficiently is indexing. > >And it is never necessary to do it for human script. > >Why would you ever want to, say, replace the 10'th character > >of a string?? > > Because I have an algorithm operating on the characters of a string. If the string represents human script, it is then wrong because it makes incorrect assumptions about the nature of human script. You will need to rewrite it, if you want it to work in an international setting. > Such > algorithms use indexes as pointers to parts of a string, and in most cases the > indexes are only incremented or decremented. On a UTF-8 string, you could > define an index as > > type index = { index_position : int; byte_position : int } > > and define the operations "increment", "decrement" (only working together with > the string), "add", "substract", "compare" (to calculate string lengths). Such > indexes have strange properties; they can only be interpreted together with the > string to which they refer. > > You cannot avoid such an index type really; you can only avoid to give the > thing a name and program the index operations every time anew. I agree. But my point is: you should change your code _anyhow_, to use the new and correct parsing method, because it is necessary for Level 2 and Level 3 compliance. Your code will then work correctly at those levels when the 'increment' function is upgraded. What you will find is something which by chance, perhaps, is natural in Python: there is no such thing as a character. A string is NOT an array of characters. Strings can be composed from strings, and decomposed into arrays of strings, but there is not really any character type. > Perhaps your suggestion works; but string manipulation will then be much > slower. For example, an "increment" must be implemented by finding the next > beginning of a character (instead of just incrementing a numeric index). Yes, but this is a fact: it is actually required for correct processing of human script. You cannot 'magic' away the facts. What you can do, is, if you are programming with a known subset, such as the characters for a stock code, then you can use indexing anyhow, perhaps with the ASCII subset. That is, you can use the byte strings as character strings. > There will always be a difference between natural languages and sophisticated > packages. Yes. However, there is an important point here. Natural languages are quirky and behaviour is variant: each human uses language differently each sentence, varing with region, context .. etc. Obviously, computer systems only use some abstracted representation. While there are many levels and ways of abstracting this, there is one that is worthy of special interest here: the ISO10646 Standard. So I guess my suggestion is that in the _standard_ language libraries we will eventually need to implement the algorithms required for compliance with that Standard. In my opinion, that naturally breaks into two parts: a) (byte/word) string management: this is an issue of storage allocation and manipulation, not natural language processing b) basic natural language processing >Even the current String.uppercase is wrong (in Latin1 there is a > lower case character without corresponding capital character (\223), but WORDS > containing this character can be capitalized by applying a semantical rule). > > I would suppose that String.upper/lowercase are part of the library because the > compiler itself needs them. Currently, ocaml depends on languages that know the > distinction of character cases. AH! you are right! > In my opinion such case functions can only approximate the semantical meaning, > and a simple approximation is better than no approximation. No. That is, I agree entirely, but make a different point: an arbitrary simple approximation is worthless, the one that is useful is the ISO Standardised one. > My idea is that the type "encoding" enumerates all supported combinations; I > expect only a few. Please no. Leave the type open to external augmentation. Just consider: my Interscript literate programming tool ALREADY supports something like 30 "encodings" -- all those present on the unicode.org website. Your 'type' is already a joke. I already support a lot more encodings than that. > What kind of problem do you want to solve with an open ended set of > conversions? Isn't this the task of a specialized program? No. It allows a generalised ISO10646 compliant program to read and perhaps write any file encoded in any supported encoding, but manipulate it internally in one format. If there is an encoding that is missed, it is easy to add a new pair of conversion functions, without breaking the standard library. That is, it is the task of specialised _functions_. It makes sense to provide some as standard like the ones your type suggests -- but not represent the cases with a type. Ocaml variants are not amenable to extension. Function parameters are. That is, I think there are exactly two cases: a) no conversion required b) user supplied conversion function > I think collation should be left out by a basic library. Probably right. Level 1 compliance is a good start, and does not require collation. > The most correct interface is not always the best. What do you mean 'most correct'? Either the interface supports the (ISO10646) required behaviour or not. > There will be a significant slow-down of all ocaml programs if the strings are > encoded as UTF-8. No. On the contrary, most existing programs will be unaffected. Those which actually care about internationalisation can only be made faster ( by providing native support). >I think the user of a language should be able to choose what > is more important: time or space or reduced functionality. UTF-8 saves space, > and costs time; UCS-4 wastes space, but saves time; UCS-2 is a compromise > and bad because it is a compromise; Latin 1 (or another 8 bit cs) saves > time and space but has less functionality. Sure, but, this leads to multiple interfaces. Was that not the original problem? Let me put the argument for UTF-8 differently. Processing UTF-8 'as is' is non-trivial and should be done in low level system functions for speed. Processing arrays of 31 bit integers is _already_ well supported in ocaml, and will be better supported by adding variable length arrays with functions that are designed with some view of use for string processing. So we don't actually need a wide character string type or supporting functions, precisely because in the simplest cases a standard data type not really specialised to script processing will do the job. What is actually required (in both cases) are some 'data tables' to support things like case mapping. For example, a function convert_to_upper i which takes an ocaml integer argument would be useful, and it is easy enough to 'map' this over an array. Sigh. See next post. I will post my code, so it can be torn up by experts. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-20 21:05 ` Gerd Stolpmann 1999-10-21 4:42 ` skaller @ 1999-10-21 12:05 ` Matías Giovannini 1999-10-21 15:35 ` skaller 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty 1 sibling, 2 replies; 24+ messages in thread From: Matías Giovannini @ 1999-10-21 12:05 UTC (permalink / raw) To: caml-list; +Cc: Gerd.Stolpmann, skaller Gerd Stolpmann wrote: > > On Tue, 19 Oct 1999, John Skaller wrote: > >Gerd Stolpmann wrote: > >> The enlarged character sets become more and more important, and it is only a > >> matter of time until every piece of software which wants to be taken seriously > >> can process them, even a dumb terminal or simple text editor. So you will be > >> able to put accented characters into your comments, and you will see them as > >> such even if you 'cat' the program text to the terminal or printer; this will > >> work everywhere... > > > > Yes. This time is not here yet, but it will come soon that > >international support is mandatory for all large software purchases > >by governments and large corporations. > > I do not believe that this will be the driving force because the current > solutions exist, and it is VERY expensive to replace them. It is even cheaper > to replace a language than a character set/encoding. Looks like another Year > 2000 but without deadline. I still don't understand the point of this discussion. As a MacOS programmer of many years, I tend to view localization and internationalization as tasks best performed by the operating system, or at least by pluggable modules. This discussion of patching l12n and i18n functions *into* OCaml is, to me at least, losing direction. OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll agree that my view is chauvinistic (and selfish, perhaps: I already have "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?), I see no restriction in that (well, If I were Chinese, or Egiptian, I would see things differently). What's more, the whole syntactic apparatus of a programming language *assumes* a Latin setting, where things make sense when read from left to right, from top to bottom; and where punctuation is what we're used to. Programming languages suited for a Han, or Arab, or even a Hebrew audience would have to be rethinked from the grounds up. On the other hand, OCaml provides a String type that *can be* seen as a variable-length sequence of uninterpreted bytes. We have uninterpreted bytes! It's all we need to build whatever I18NString type we may need. What is missing is *library* facilities to abstract that view into a full-fledged i18n machinery. Of course, there's a problem with the manipulation of 32-bit integer values, but if used with care, the Nat datatype could serve perfectly well as the underlying, low-level datatype. Which makes me think, John, you already have variable-length int arrays. Nat's are as unsafe as they get :-) Regards, Matías. -- I got your message. I couldn't read it. It was a cryptogram. -- Laurie Anderson ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-21 12:05 ` Matías Giovannini @ 1999-10-21 15:35 ` skaller 1999-10-21 16:27 ` Matías Giovannini 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty 1 sibling, 1 reply; 24+ messages in thread From: skaller @ 1999-10-21 15:35 UTC (permalink / raw) To: matias; +Cc: caml-list, Gerd.Stolpmann Matías Giovannini wrote: > OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll > agree that my view is chauvinistic (and selfish, perhaps: I already have > "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?), > I see no restriction in that (well, If I were Chinese, or Egiptian, I > would see things differently). Exactly. There are quite a lot of Chinese, Indian, Russian ... and non-Latin people in the world: more than Latins. And many are faced with a barrier, participating in the computing world because of language problems. >What's more, the whole syntactic > apparatus of a programming language *assumes* a Latin setting, where > things make sense when read from left to right, from top to bottom; and > where punctuation is what we're used to. Programming languages suited > for a Han, or Arab, or even a Hebrew audience would have to be rethinked > from the grounds up. Actually, no. Most of these peoples learn English and learn computing, if they are to work with computers. But they still wish to use comments, strings, and identifiers in their native script. Have you ever seen a Japanese program? I have. Quite an interesting challenge: normal C/C++ code, with Latin characters encoding Japanese character names in identifiers, and actual Japanese characters in comments and strings. I had no idea what the code did. My point: for a non-native speaker, being forced to use a foreign language for identifiers and comments is a serious impediment, not having native characters in string is not an impediment, but a complete disaster (how will the users of the program understand it -- they may not know any Latin language) > On the other hand, OCaml provides a String type that *can be* seen as a > variable-length sequence of uninterpreted bytes. Yes. What ocaml does not provide is a way of encoding extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers. >We have uninterpreted > bytes! It's all we need to build whatever I18NString type we may need. > What is missing is *library* facilities to abstract that view into a > full-fledged i18n machinery. I agree. >Of course, there's a problem with the > manipulation of 32-bit integer values, but if used with care, the Nat > datatype could serve perfectly well as the underlying, low-level datatype. > > Which makes me think, John, you already have variable-length int arrays. But they're not standard (yet). Actually, ocaml 'int' is 31 bits, which is enough bits for ISO10646 (with some careful fiddling to avoid problems with the sign?). So there are TWO issues -- one is to make ocaml itself ISO10646 aware (i.e., the compiler), and the other is to provide users with libraries to manipulate extended characters. Please note: neither of these features would be optional, were ocaml to be submitted for ISO standardisation. ISO directives require all ISO languages to upgrade to provide international support. I know ocaml isn't an ISO language, but I think the basic intent is sound. [In some sense, ocaml is already a leader, accepting Latin-1 characters when other languages only allowed ASCII] -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-21 15:35 ` skaller @ 1999-10-21 16:27 ` Matías Giovannini 1999-10-21 16:36 ` skaller 0 siblings, 1 reply; 24+ messages in thread From: Matías Giovannini @ 1999-10-21 16:27 UTC (permalink / raw) To: caml-list; +Cc: skaller skaller wrote: > > Matías Giovannini wrote: > >What's more, the whole syntactic > > apparatus of a programming language *assumes* a Latin setting, where > > things make sense when read from left to right, from top to bottom; and > > where punctuation is what we're used to. Programming languages suited > > for a Han, or Arab, or even a Hebrew audience would have to be rethinked > > from the grounds up. > > Actually, no. Most of these peoples learn English and learn > computing, if they are to work with computers. But they still wish > to use comments, strings, and identifiers in their native script. Strings can be localized with a package mechanism, à la Java. I don't like hardwired strings in code, they're a maintenance nightmare (not that I always abide by my own rule :-) > Have you ever seen a Japanese program? I have. > Quite an interesting challenge: normal C/C++ code, with > Latin characters encoding Japanese character names in identifiers, > and actual Japanese characters in comments and strings. I agree that comments should be written in the language most suited to the intended audience (I normally comment my code in English, unless I know I wnat someone else to maintain it, in which case I comment it in Spanish.) > > On the other hand, OCaml provides a String type that *can be* seen as a > > variable-length sequence of uninterpreted bytes. > > Yes. What ocaml does not provide is a way of encoding > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers. No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude. > >Of course, there's a problem with the > > manipulation of 32-bit integer values, but if used with care, the Nat > > datatype could serve perfectly well as the underlying, low-level datatype. > > > > Which makes me think, John, you already have variable-length int arrays. > > But they're not standard (yet). They are! Don't be put off by its status as "experimental feature". Nat's been around since CamlLight. You could even use it as a template implementation of unsafe longint varlen arrays and link a custom toplevel. Yet again, OCaml provides the tools. > So there are TWO issues -- one is to make ocaml itself > ISO10646 aware (i.e., the compiler), and the other is to provide > users with libraries to manipulate extended characters. I think a more realistic goal would be making OCaml ISO10646-tolerant in comments. Perhaps adding real conditional compilation and transparent comments would suffice. Again, anyone can download the source code and modify OCaml to suit his tastes. OCaml's goal is not to be a model of i18n awareness, but a platform for experimenting with types in a functional setting. It happens that OCaml is open enough, and extensible enough, and efficient enough to make a good i18n effort possible, and that is a tribute to its success as strongly-typed, imperative, fast functional language. > Please note: neither of these features would be optional, > were ocaml to be submitted for ISO standardisation. ISO directives > require all ISO languages to upgrade to provide international > support. I know ocaml isn't an ISO language, but I think the > basic intent is sound. [In some sense, ocaml is already a leader, > accepting Latin-1 characters when other languages only allowed ASCII] The implementors have made clear in more than one occasion that they're not interested in making OCaml a standard language (remember the thread "How to convince management?"). But don't take my word for it, ask Pierre. -- I got your message. I couldn't read it. It was a cryptogram. -- Laurie Anderson ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-21 16:27 ` Matías Giovannini @ 1999-10-21 16:36 ` skaller 1999-10-21 17:21 ` Matías Giovannini ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: skaller @ 1999-10-21 16:36 UTC (permalink / raw) To: matias; +Cc: caml-list Matías Giovannini wrote: > Strings can be localized with a package mechanism, à la Java. I don't > like hardwired strings in code, they're a maintenance nightmare (not > that I always abide by my own rule :-) It doesn't matter what you like (or what I like). > > Have you ever seen a Japanese program? I have. > > Yes. What ocaml does not provide is a way of encoding > > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers. > > No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude. Irrelevant. The \u \U escapes are ISO recommended, used in C and C++, and must be supported. > > > Which makes me think, John, you already have variable-length int arrays. > > > > But they're not standard (yet). > > They are! Don't be put off by its status as "experimental feature". > Nat's been around since CamlLight. Oh, I must have misunderstood your comment: Nat is standard, I'm using it in Viper, but 'a Varray -- a variable length array of 'a, is not. > Again, anyone can download the source code and modify OCaml to suit his > tastes. OCaml's goal is not to be a model of i18n awareness, but a > platform for experimenting with types in a functional setting. Ocaml is a tool, it doesn't have a goal. :-) Humans have goals. The problem is that the designers of ocaml have been too successful: ocaml is so good that other people now want to use it, and _their_ goals are important too. >It > happens that OCaml is open enough, and extensible enough, and efficient > enough to make a good i18n effort possible, and that is a tribute to its > success as strongly-typed, imperative, fast functional language. I agree. It could easily become a leader in this field, since implementing complex stuff is relatively easy in ocaml :-) > The implementors have made clear in more than one occasion that they're > not interested in making OCaml a standard language (remember the thread > "How to convince management?"). But don't take my word for it, ask Pierre. My point was simply that the ISO internationalisation requirements are not unreasonable, and that other languages will be doing this work, some because they have to, and some because they want to stay part of the real world -- and encourage non-English (woops, I mean, non-Latin :-) clients, who, after all, may well make significant contributions. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-21 16:36 ` skaller @ 1999-10-21 17:21 ` Matías Giovannini 1999-10-23 9:53 ` Benoit Deboursetty 1999-10-25 0:54 ` How to format a float? skaller 2 siblings, 0 replies; 24+ messages in thread From: Matías Giovannini @ 1999-10-21 17:21 UTC (permalink / raw) To: caml-list; +Cc: skaller skaller wrote: > > Matías Giovannini wrote: > > > Strings can be localized with a package mechanism, à la Java. I don't > > like hardwired strings in code, they're a maintenance nightmare (not > > that I always abide by my own rule :-) > > It doesn't matter what you like (or what I like). It doesn't, my point is: the functionality for localized strings can be had, only through an indirect route, as are "string packages". As an aside, let's keep the tone light, ok? > > > > Have you ever seen a Japanese program? I have. > > > Yes. What ocaml does not provide is a way of encoding > > > extended characters -- \uXXXX \UXXXXXXXXX in strings, or in identifiers. > > > > No need to. Use \HH\LL. Again, what OCaml does is sensible, if crude. > > Irrelevant. The \u \U escapes are ISO recommended, used in > C and C++, and must be supported. Well, OCaml is *not* ISO recommended, is *not* C and it is certainly *not* C++. Let's learn to live with languages other than ISO-mandated, ISO-validated, ISO-standardized and whatnot. In fact, now that I think of it, standardization is driven by market pressure. If OCaml were a commercial product, I guess things would be different. But it's not (thank Pete), see below. > > > > Which makes me think, John, you already have variable-length int arrays. > > > > > > But they're not standard (yet). > > > > They are! Don't be put off by its status as "experimental feature". > > Nat's been around since CamlLight. > > Oh, I must have misunderstood your comment: Nat is standard, > I'm using it in Viper, but 'a Varray -- a variable length array of 'a, > is not. And it's not going to be, unless someone comes with a sound typing strategy *and* an efficient implementation for them. > > > Again, anyone can download the source code and modify OCaml to suit his > > tastes. OCaml's goal is not to be a model of i18n awareness, but a > > platform for experimenting with types in a functional setting. > > Ocaml is a tool, it doesn't have a goal. :-) > Humans have goals. The problem is that the designers of ocaml > have been too successful: ocaml is so good that other people now > want to use it, and _their_ goals are important too. Let me restate it: OCaml is the intellectual property of INRIA, developed under a specific project (Projet Cristal if I remember correctly) with very definite goals. The project *has* goals, anything outside those goals is a gift (what is more, everything falling *within* those goals already *are* a gift), and must be accepted as that. If INRIA decides that since OCaml is useful to many many people around the world and want to make one of its goals to turn OCaml into a platform for experimenting in the implementation of programming languages with strong i18n support, well, bring the champagne. In the meantime, we'll have to build upon what it's there. Suppose the following scenario: INRIA decides that the MacOS platform is not nearly significative enough to justify the porting effort, and so it is dropped. What should I do? Plead, certainly, until I'm told "don't whine, there's nothing we can do". What would be my options? Use a Wintel box, or make my own port. This scenario is not unrealistic: there's no native compiler under MacOS, and there won't be until someone ports it. I can't do it, the implementors can't do it, and such is life. > >It > > happens that OCaml is open enough, and extensible enough, and efficient > > enough to make a good i18n effort possible, and that is a tribute to its > > success as strongly-typed, imperative, fast functional language. > > I agree. It could easily become a leader in this field, > since implementing complex stuff is relatively easy in ocaml :-) > > > The implementors have made clear in more than one occasion that they're > > not interested in making OCaml a standard language (remember the thread > > "How to convince management?"). But don't take my word for it, ask Pierre. > > My point was simply that the ISO internationalisation requirements > are not unreasonable, and that other languages will be doing this work, > some because they have to, and some because they want to stay part of > the real world -- and encourage non-English (woops, I mean, non-Latin > :-) > clients, who, after all, may well make significant contributions. Hm. I see your point. I don't necessarily agree, though. -- I got your message. I couldn't read it. It was a cryptogram. -- Laurie Anderson ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-21 16:36 ` skaller 1999-10-21 17:21 ` Matías Giovannini @ 1999-10-23 9:53 ` Benoit Deboursetty 1999-10-25 21:06 ` Jan Skibinski 1999-10-26 18:02 ` skaller 1999-10-25 0:54 ` How to format a float? skaller 2 siblings, 2 replies; 24+ messages in thread From: Benoit Deboursetty @ 1999-10-23 9:53 UTC (permalink / raw) To: caml-list This message just wants to raise a paradoxical point in this discussion [yet it may have already been posted ?]. It seems to me that allowing foreign characters to be used in a computer language, as identifiers or comments, would reduce the exchange of contributions worldwide. Here is my personal experience: i have used caml and ocaml for more than 2 years now. From the beginning, it seemed to me really cool to be able to have identifiers in French, with accents and everything. So i took the habit of using French in my programs. Now, i'm writing a more consequent program, which could become a small intl "open project". *Except* that i find myself with a program in French, and that it's not so easy to find qualified programming partners who understand French. The range of people who could help with my program is terribly limited. You should understand i sometimes feel i should have written it in english. I must however acknowledge that [o']caml 's ability to cope with latin1 characters is above all useful for educational purpose. Let me explain... Perhaps is it a French thing, but in this country it sounds quite snobbish for a French to embed English words in a sentence with the right accent + stress. Hence, almost every computer science teacher takes an exaggerated French accent to pronounce English words ("la fonction 'rimouve'"). [I shall not disclose the names of my teachers in CaML :) ] So, for educational purposes, it is much better if the teachers can have French identifiers ("la fonction 'enlève'"). Much easier to pronounce, isn't it? I suppose it is the same for many other countries. (i think especially of japan "biko-zu ingurisshu izu ha-do tsu puronaonsu foa japani-zu pi-poru tsu-") My point remains: encouraging people to write code in their language would reduce the possiblities of exchanging their work. This does not mean, though, that i will translate the program i've written into english. I consider it is a sort of tribute to the preservation of the diversity of languages, at my most humblest scale... and i will write enough programs in English when i work for a company, too. Benoît de Boursetty Benoit.de-Boursetty@polytechnique.org ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-23 9:53 ` Benoit Deboursetty @ 1999-10-25 21:06 ` Jan Skibinski 1999-10-26 18:02 ` skaller 1 sibling, 0 replies; 24+ messages in thread From: Jan Skibinski @ 1999-10-25 21:06 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list On Sat, 23 Oct 1999, Benoit Deboursetty wrote: > This message just wants to raise a paradoxical point in this discussion > [yet it may have already been posted ?]. It seems to me that allowing > foreign characters to be used in a computer language, as identifiers or > comments, would reduce the exchange of contributions worldwide. Yes, but it is nice to have error messges, prompts, etc. expressed in a native language of a program user. And an ability of a native text processing is also quite often desirable. I have been reading this thread for some time and I've seen plenty of references to Latin1 and of different attitudes to its usefulness (or not). Let me add my two cents here. Demanding a support for diacritical marks is often not a matter of being snobbish or a language purist. I cannot speak for other languages that use Latin alphabet but I can tell you what a mess it is with Polish (having 8 diacritical marks), and, I suppose, with other languages, such as Hungarian, etc. that have been qualified to Latin2. Someone has made such decision some time ago, and now we pay a price, since Latin1 seems to be seen by some as some sort of improvement over the plain ascii. I am not whining here because I can get quite fine with the plain ascii in my email, etc., and I can even cope with all sort of email that come here formatted as either Latin1 or Latin2. But even so, I sometimes find myself cornered by plain ascii when a meaning of a sentence becomes suddenly funny, or bezerk or senseless. One example to illustrate the point. 1. z<.>a<;>danie - "a strong request". This is what I want to use 2. zadanie - "a problem to solve, or a goal". Wrong! This is what I get from plain ascii 3. rzadanie - When pronounced it does not sound quite as "a request", but an intelligent recipient can guess my intention. But they might as well consider me illiterate; Polish has two alternative spelling of the same (similar) sound: "z<.>" and "rz". In this case "rz" is very wrong. 4. rzondanie - Now it sounds almost OK ("on" sounds close to "a<;>"), but the spelling is even worse. where z<.> stands for dot over z a<;> stands for "ogonek" (yes, this is an official name in Unicode), or "a tail" under a. As you can see, this is not just matter of some perky accents. Jan ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-23 9:53 ` Benoit Deboursetty 1999-10-25 21:06 ` Jan Skibinski @ 1999-10-26 18:02 ` skaller 1 sibling, 0 replies; 24+ messages in thread From: skaller @ 1999-10-26 18:02 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list Benoit Deboursetty wrote: > > This message just wants to raise a paradoxical point in this discussion > [yet it may have already been posted ?]. It seems to me that allowing > foreign characters to be used in a computer language, as identifiers or > comments, would reduce the exchange of contributions worldwide. Excuse me, but exactly what do you mean by 'foreign' characters? Do you mean non Chinese characters? What? You aren't Chinese? > You should understand i sometimes feel i should have written it in > english. I think that, at the moment, English is the 'lingua franca' <grin> of the Internet. Spoken with an American accent :-) However, the Internet is growing fast, and the number of English speakers will soon enough be a minority. It will probably remain true that most of the _programmers_ will be able to use English. > I must however acknowledge that [o']caml 's ability to cope with latin1 > characters is above all useful for educational purpose. Yes. I think it is highly laudible that ocaml accepts more than just plain 'ASCII': many students are more fluent with their native language (even if they speak some English and/or are learning it), and being able to program with it will enhance learning. Internationalising software that is actually worth sharing internationally is a lesser obstacle that writing good software in the first place. > My point remains: encouraging people to write code in their language would > reduce the possiblities of exchanging their work. In my opinion, a programming language should simply give clients a _choice_. Cultures, people, and circumstances vary. I don't think programming language designers should be in the business of encouraging or discouraging use of a particular language, but rather facilitating the implementation of the clients own wishes or requirements. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* How to format a float? 1999-10-21 16:36 ` skaller 1999-10-21 17:21 ` Matías Giovannini 1999-10-23 9:53 ` Benoit Deboursetty @ 1999-10-25 0:54 ` skaller 1999-10-26 0:53 ` Michel Quercia 2 siblings, 1 reply; 24+ messages in thread From: skaller @ 1999-10-25 0:54 UTC (permalink / raw) To: caml-list How do I format a floating point number correctly in ocaml, with a given _variable_ width and precision? I can't find a routine in the library for this. Shouldn't there be one? Did I miss it? -- sprintf is not suitable, the format string must be a literal. string_of_float has no arguments, and it gives the wrong format if the number happens to be integral. I need to write a 'printf' like routine. It's relatively easy to write an integer formatting routine, but floats are somewhat more difficult. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: How to format a float? 1999-10-25 0:54 ` How to format a float? skaller @ 1999-10-26 0:53 ` Michel Quercia 0 siblings, 0 replies; 24+ messages in thread From: Michel Quercia @ 1999-10-26 0:53 UTC (permalink / raw) To: caml-list Le lun, 25 oct 1999, vous avez écrit : : How do I format a floating point number : correctly in ocaml, with a given _variable_ : width and precision? Like this : let str width prec = let prec = min 8 (max prec 0) in let pow_ten = 10.0 ** float(prec) in fun x -> let y = abs_float(x) in let sgn= if x < 0. then -1. else 1. in let a = floor(y) in let b = 1. +. y -. a in let z = truncate(b*.pow_ten +. 0.5) in let sa = string_of_float(sgn*.a) in let sb = string_of_int(z) in let pad= width - String.length(sa) - prec - 1 in (String.make (max 0 pad) ' ') ^ sa ^ "." ^ (String.sub sb 1 prec) The "8" in "min 8 (max prec 0)" is here to avoid integer overflow when computing "z" on 32-bit computers. "b" is 1 too large in order to force "string_of_int(z)" having "prec+1" characters. Not very nice ... Perhaps an addition to the standard printf library routines in order to accept the "*" variable field length would be better. -- Michel Quercia 9/11 rue du grand rabbin Haguenauer, 54000 Nancy http://pauillac.inria.fr/~quercia mailto:quercia@cal.enst.fr ^ permalink raw reply [flat|nested] 24+ messages in thread
* Go for ultimate localization! 1999-10-21 12:05 ` Matías Giovannini 1999-10-21 15:35 ` skaller @ 1999-10-26 4:36 ` Benoit Deboursetty 1999-10-28 17:04 ` Pierre Weis ` (4 more replies) 1 sibling, 5 replies; 24+ messages in thread From: Benoit Deboursetty @ 1999-10-26 4:36 UTC (permalink / raw) To: caml-list [ Since my post ended up being *very* long, so I put some headers and a summary... "rhetoric sugar" ] --- Summary 1. INTRODUCTION Present situation: O'CaML allows Latin1 in identifiers, but keywords are still in English -> one's code is made of intermingled languages 2. ULTIMATE LANGUAGE LOCALIZATION? Why not localize the parser? the libraries? 3. RELEVANCE Is this not just another useless theory? [ like was said of functional programming a few years ago ] 4. CONCLUSION Well, time to set to work, people :) --- On Thu, 21 Oct 1999, Matías Giovannini wrote: > OCaml uses Latin1 for its *internal* encoding of identifiers. While I'll > agree that my view is chauvinistic (and selfish, perhaps: I already have > "¿¡áéíóúuñÁÉÍÓÚÜÑ" for writing in Spanish, why should I ask for more?), > I see no restriction in that (well, If I were Chinese, or Egiptian, I > would see things differently). What's more, the whole syntactic > apparatus of a programming language *assumes* a Latin setting, where > things make sense when read from left to right, from top to bottom; and > where punctuation is what we're used to. Programming languages suited > for a Han, or Arab, or even a Hebrew audience would have to be rethinked > from the grounds up. 1. INTRODUCTION --------------- Hi, Thanks for giving me this idea, Matías. "Rethought from the ground up" was the only wrong thing: you only have to change the preprocessor... [ Since I am catching this thread in the middle, I preventively apologize for any possible redundancy with what has been said before] So, O'CaML handles Latin-1 characters in the identifiers. This allows most Europeans, well at least you and me, to use their own language; but to me, it has always seemed a bit strange to write a program with keywords in English, and identifiers in French. Indeed, I feel it's really bogus when I try to read this out loud: "let élément = String.sub chaîne where chaîne = "... Do you take the French accent to read this? an English accent? Do you switch accents? How do you say "=", "égal" or "equals"? Do you translate every word to your mother tongue on-the-fly? [ You know, there is something with the French and the "purity" of their language... ] Don't think I put this example just to try to be funny: when building software in teams, it is often that you want to explain what your program does to someone that speaks your language, and that you feel uncomfortable with reading your code. I'm speaking from personal experience, I hope not to be an exception. At least, there is a certain consistency in C allowing only English characters: were I in a normalization committee, I would even demand, in the C norm, that a compliant source code has everything, identifiers and comments, in American English (you must be hoping I will never be given such power). Everyone is happy with O'CaML handling Latin1 within Western Europe and the US *because* the grammar differ very slightly from one country to another. That was your point, Matias. And the Indo-European kind of syntax is not universal. The syntax I know that differs most from Indo-European languages is that of Japanese, of course. And Japan is a country too often forgotten when it comes to localization. Must be difficult to deal with Japanese. Anyway, it makes this economically powerful country over-sensitive to locale-aware software (such an OS as BeOS is therefore quite popular there). To sum up the present situation: half-half. Your program is half in your language, half in English. The syntax corresponds to the native language of half the number of potential users. Latin1 is meaningful to only part of the users, too. 2. ULTIMATE LANGUAGE LOCALIZATION? ---------------------------------- I know one of the authors of olabl is a Japanese, I would like to know whether it makes sense to have identifiers in Japanese in a language having English keywords. For instance, we often give verbal identifiers to functions that perform an action. I think if the Japanese had invented the first computer languages, we would naturally put such function calls after the arguments (a bit like in RPL), because the verb is always put after its objects in Japanese. [it has nothing to do with writing right-to-left] So, from my naive point of view, using for identifiers a language different from the one which the keywords are chosen from is very unnatural. Another way of seeing things would be to have different parsers for the same compiler: one with English keywords (the one that exists now), one with French keywords and a similar syntax, one with Japanese keywords and perhaps a completely different syntax, etc., each one adapting the same abstract language, "abstract ocaml", to different natural languages. In other words, localizing error messages, charsets is fine: also *localizing the parser* seems more consistent to me. Of course, this raises again the issue I was speaking of in my previous post: using languages different from the "computer esperanto" in computer programs would probably block international exchanges (but translation here is just another pretty-printing -- see in 4.). I am just following the idea of localization to its logical conclusion. Perhaps there would be to go further: library interfaces, after all, should also have the possibility of being localized. Why would you use a module named "String" in a French program. This word has a completely different meaning in French, and every French computer scientist sounds really suspect to a non-scientific person when he speaks of "strings" all the time. The same with "bit". It's also because of things like that, that computer people are sometimes thought of they can't communicate. [ "un string" is only an underwear in French. And there is a very unfortunate collision between the English word "bit" and another word. ] 3. RELEVANCE ------------ So, well, can all this theory going to be useful to anyone? I really think there would be a gain in implementing such a feature. There is a pervasive, wasteful effort of translation being made by non-native English speakers when programming. It can be ridiculous when two speakers of the same language come to speaking English to each other. Of course, such an extension won't probably be used in business as long as everybody speaks English at work in almost every software development group anyway. But first, I think I read O'CaML was designed primarily to be an experimentation in language design. Second, it's no wonder everybody speaks English in software businesses if it is the language that lies under every computer language (problème de "la poule et l'oeuf"). Third, there is a constant motion from natural language to computer language and vice-versa... For instance, most constructs in computer languages (except in lisp, scheme,...) are based on English. Perhaps are we missing some useful constructs that could come from other natural languages. Another example, more concrete. There is no simple translation in French for the expression "to match some constant against a pattern". Instead of saying "matcher" (which is what we do), we should better try and find a better word. But even if a good word had been invented, it would never be used in practice, I mean in the code of programs, and we would still use "matcher" (erk). 4. CONCLUSION ------------- All this may be very difficult to implement, I don't know, I am telling things from the user's point of view. This may also be a project completely different from what O'CaML wants to focus on. But couldn't the camlp4 "PreProcessorPrettyPrinter" by D. de Rauglaurde be the beginning of a solution, at least for parser localization? It would even provide translations between different localizations of the parser by pretty-printing, wouldn't it? Congratulations if you've read this post until here, I await your comments on this utopical project. Benoît de Boursetty Benoit.de-Boursetty@polytechnique.org --- PS: the camlp4 manual: http://caml.inria.fr/camlp4/manual/index.html PPS: Pour "to match sthg against a pattern", existe-t-il une traduction "officielle" ? Que pensez-vous de "plaquer qqch sur un motif" ? PPPS: I apologize for any mistakes in English (I myself am not perfectly localized) and for taking examples only in the French & Japanese languages. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Go for ultimate localization! 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty @ 1999-10-28 17:04 ` Pierre Weis 1999-10-28 17:41 ` Matías Giovannini ` (3 subsequent siblings) 4 siblings, 0 replies; 24+ messages in thread From: Pierre Weis @ 1999-10-28 17:04 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list > 1. INTRODUCTION > Present situation: O'CaML allows Latin1 in identifiers, but keywords are > still in English -> one's code is made of intermingled languages Yes, and this can even be a convenient way of distinguishing between pure syntactical markers (let, try, with) of the language, and semantically relevant names, for instance let élimine_accents s = ... In this respect, when teaching Caml to beginners this is now some kind of advantage not to be an english native speaker! Concerning strange kinds of natural language syntax, or kind of syntaxes that would not be suitable to Caml, I think the discussion is endless and running to a dead-end anyway. Just consider chinese instead of japanese as an example: there is no alphabet at all (so no idea of dividing words into a small fixed set of characters), and traditional writing is to draw vertical lines that goes from right to left. Now consider another language of the Caml kind (semi-rigourously defined, semi-universal (:) that you definitively want to use as a chinese human being: the language of mathematics. What can you do ? Would you still use chinese traditional digits (one is denoted as -, two as =, ...) or adopt ``long noses''' strange notations (0, 1, 2, 3, ...) ? Would you adapt the whole set of well designed mathematical notations just to be able to write vertically and from the right to the left ? Surely no. You would use instead a simpler solution: you would adapt *yourself* not the established notations! That's why chinese people use the same well-known notations as everybody in the world, just because they want to understand others and also to be understood by others (otherwise foreign people could be a bit confused with the strange use of - as 1 or = as 2!). For computer languages, I think it is the same, you must use the keywords unchanged just because in the first place, you must communicate with some machine and its hardware, and also you want to communicate your programs to others. If you want hard to communicate about your program you may go one step further and use english identifiers, or two steps further and use english comments as well. > Another example, more concrete. There is no simple translation in > French for the expression "to match some constant against a > pattern". Yes there is one: filtrer. (Pattern: filtre (qq fois motif, plus rarement patron); pattern matching: filtrage (qq fois sélection de motifs, plus rarement appel par patron)). You may read the french version of the Caml FAQ, or a good french book about the language. > Instead of saying "matcher" (which is what we do), we should better try > and find a better word. But even if a good word had been invented, it > would never be used in practice, I mean in the code of programs, and we > would still use "matcher" (erk). We don't say ``matcher'', except as a kind of jargon, a funny verb that we use when using a low level of language not very far from slang. We often use the good word ``in practice''. For instance, we say: Appel direct au filtrage: la construction ``match ... with'' To mean the english section title Direct call to pattern matching: the ``match ... with'' construct. Best regards && Cordialement (mais pas ``meilleurs regards'' !) Pierre Weis INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Go for ultimate localization! 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty 1999-10-28 17:04 ` Pierre Weis @ 1999-10-28 17:41 ` Matías Giovannini 1999-10-28 17:59 ` Matías Giovannini ` (2 subsequent siblings) 4 siblings, 0 replies; 24+ messages in thread From: Matías Giovannini @ 1999-10-28 17:41 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list Uh, I'm going to get flamed by off-topicness, but... Benoit Deboursetty wrote: > So, O'CaML handles Latin-1 characters in the identifiers. This > allows most Europeans, well at least you and me, to use their own > language; but to me, it has always seemed a bit strange to write a program > with keywords in English, and identifiers in French. Indeed, I feel > it's really bogus when I try to read this out loud: > > "let élément = String.sub chaîne where chaîne = "... > > Do you take the French accent to read this? an English accent? Do you > switch accents? How do you say "=", "égal" or "equals"? Do you translate > every word to your mother tongue on-the-fly? Symbols and numbers in Spanish, "words" (be them identifiers or keywords) as they are. Accent varies, but normally is Argentinian English (I can speak acceptable English, with something of an American English accent). This I do when reading programs aloud, or when doing mathematics. For instance "Let x be a real such that x >= 0" is for me "Let <<equis>> be a real such that <<equis mayor o igual a cero>>" Evidently, the linguistic structures I use for words and concepts are separated. And I had never been misunderstood, and I never misunderstood anybody doing this sort of things. I think all this is fascinating ;-} -- I got your message. I couldn't read it. It was a cryptogram. -- Laurie Anderson ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Go for ultimate localization! 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty 1999-10-28 17:04 ` Pierre Weis 1999-10-28 17:41 ` Matías Giovannini @ 1999-10-28 17:59 ` Matías Giovannini 1999-10-29 9:44 ` Francois Pottier 1999-10-28 21:00 ` Gerd Stolpmann 1999-10-29 4:29 ` skaller 4 siblings, 1 reply; 24+ messages in thread From: Matías Giovannini @ 1999-10-28 17:59 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list I should have read all before letting my fingers twitch, but... ObCaml now :-) Benoit Deboursetty wrote: > All this may be very difficult to implement, I don't know, I am > telling things from the user's point of view. This may also be a project > completely different from what O'CaML wants to focus on. But couldn't the > camlp4 "PreProcessorPrettyPrinter" by D. de Rauglaurde be the beginning of > a solution, at least for parser localization? It would even provide > translations between different localizations of the parser by > pretty-printing, wouldn't it? Localizing the language is not that difficult, but it would involve exposing the abstract syntax to the localizer, and the localizer would have the difficult task of rewriting the parser. This is not as bad as it sounds for indoeuropean languages (the most difficult problem I can think of is the use of cases vs. prepositions, but I can't think of the equivalent of a prepositional phrase in a computer language), but for object-subject-verb languages like Japanese it could be nightmarish. Localizing the libraries can be done the right way, or the easy way. The easy way is to have a dictionary of external-vs-internal names, with the language using only internal names. It could be easy if initially the internal names are the names used by the developer, and external names are added as the localization proceeds. Somehow, this seems wrong to me (because it seems too easy), so I think there must be a "right" way to do it that is absolutely non-trivial. > PPS: Pour "to match sthg against a pattern", existe-t-il une traduction > "officielle" ? Que pensez-vous de "plaquer qqch sur un motif" ? Um, you know, I think that this is pressing things too far. It's the same attitude the Spanish have, trying to find a Spanish word for every technical word out there. The result is doubly bad: on one hand, you're restricting yourself to *translating* instead of *generating* technical language (as is done in the Social Sciences, for instance), and you're letting out people who have no problem with direct imports (we speak routinely of *matchear*, and I have yet to find someone, techical or non, who doesn't "get me"). I'm not afraid of jargon, and I think we in Argentina are used to jargon (we speak dialect anyway, in Buenos Aires). The literal Spanish to your French would be "pegar algo sobre un patrón" (paste sthg over a pattern) but it sounds backwards. Better would be "aplicar un patrón sobre algo" (apply a pattern on sthg), but while it now sounds right it *is* exactly backwards. The best compromise is "matchear algo contra un pattern", and it only translates the normal "part-of-speech" words, leaving order and technical words intact. -- I got your message. I couldn't read it. It was a cryptogram. -- Laurie Anderson ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Go for ultimate localization! 1999-10-28 17:59 ` Matías Giovannini @ 1999-10-29 9:44 ` Francois Pottier 0 siblings, 0 replies; 24+ messages in thread From: Francois Pottier @ 1999-10-29 9:44 UTC (permalink / raw) To: matias; +Cc: caml-list > Localizing the language is not that difficult, but it would involve > exposing the abstract syntax to the localizer, and the localizer would > have the difficult task of rewriting the parser. I would like to point out that this has been done by Apple in their AppleScript scripting language. It looks cool at first sight, but it means that textual programs are no longer portable, since the machine you're moving to may not use the same language as yours. So, only the (binary) abstract syntax tree is portable, which is (IMHO) a very bad thing, since maintaining compatibility across versions of the language becomes very difficult. -- François Pottier Francois.Pottier@inria.fr http://pauillac.inria.fr/~fpottier/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Go for ultimate localization! 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty ` (2 preceding siblings ...) 1999-10-28 17:59 ` Matías Giovannini @ 1999-10-28 21:00 ` Gerd Stolpmann 1999-10-29 4:29 ` skaller 4 siblings, 0 replies; 24+ messages in thread From: Gerd Stolpmann @ 1999-10-28 21:00 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list On Tue, 26 Oct 1999, Benoît de Boursetty wrote: > > So, O'CaML handles Latin-1 characters in the identifiers. This >allows most Europeans, well at least you and me, to use their own >language; but to me, it has always seemed a bit strange to write a program >with keywords in English, and identifiers in French. Indeed, I feel >it's really bogus when I try to read this out loud: > > "let élément = String.sub chaîne where chaîne = "... I have currently some projects in a bank in Frankfurt. Although our team often chooses English identifiers, there are sometimes special terms which are difficult to translate. Big organizations tend to have their own language, and everybody in the organization knows what is meant if one of the special words is used. For example, in this bank the German word "Sachgebiet" is a software application tied to an organizational unit; nobody outside would think at software if he or she hears this word; and it seems to be impossible to translate such a term to English because nobody inside the bank would understand it. So the only way out is to mix English and German words, e.g. get_sachgebiet as method of an object. I mention this example because it demonstrates that you cannot get rid of the local language even if you try hard. I think this is not different in Japan or other parts of the world, so non-Latin1 characters are really needed there. Please note: This argument has nothing to do with esthetics ("how it sounds"), the language mixture is simply necessary to make yourself understandable. The grammar (the language of the keywords) of Ocaml should be the same all over the world, otherwise you would split the software basis artificially. >It can be ridiculous when two speakers >of the same language come to speaking English to each other. Yes, but this is not the problem. We have also native English speakers in our team, and international teams are very normal in this world. John also mixes languages and it is sometimes very funny... Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: Gerd.Stolpmann@darmstadt.netsurf.de (privat) Germany ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Go for ultimate localization! 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty ` (3 preceding siblings ...) 1999-10-28 21:00 ` Gerd Stolpmann @ 1999-10-29 4:29 ` skaller 4 siblings, 0 replies; 24+ messages in thread From: skaller @ 1999-10-29 4:29 UTC (permalink / raw) To: Benoit Deboursetty; +Cc: caml-list Benoit Deboursetty wrote: >> > At least, there is a certain consistency in C allowing only > English characters: were I in a normalization committee, I would even > demand, in the C norm, that a compliant source code has everything, > identifiers and comments, in American English (you must be hoping I will > never be given such power). FYI: as it happens, I _am_ in the C standardisation committee, and the latest version of C, code named 'C9X', which has now passed the final NB balloting, as well as C++, which is already an International Standard, permits a certain subset of ISO10646 in identifiers. All the characters are in the first (unicode) plane. [The actual subset is listed in Appendix E of the C++ Standard] However, neither permits any variations on keywords :-) > In other words, localizing error messages, charsets is fine: also > *localizing the parser* seems more consistent to me. I have no opinion on the idea of allowing _alternate_ keywords, or even syntax, more amenable to international use. However, I would like to be 'pedantic' and object to the word 'localisation'. I would STRONGLY object to localising ANY aspect of the ocaml language, and this is one of my reasons for objecting to the current Latin-1 support. I advocate _internationalisation_, not localisation, that is: all the possible characters, grammar extensions, new keywords, or what ever, are universally available to everyone. On the other hand, I _would_ support localisation of diagnostic error messages and documentation: that is, if you speak French, you get errors in French. > Of course, this > raises again the issue I was speaking of in my previous post: using > languages different from the "computer esperanto" in computer programs > would probably block international exchanges I am not at all sure about that. Have you considered it might _facilitate_ international exchange? For example, Japanese programmer able to use Japanese extensively in a programming language might now be free to release it internationally, without feeling bound to 'Anglicize' it. This would make it harder for non-Japanese speakers to understand than Japanese speaker -- but 'harder' is better than 'impossible because the program was never publically released'. I personally feel embarrassed that I can't follow dialog in French, reading this group. But that is surely MY problem, and I am grateful so many French can read and write MY native language (English with an Ozzie accent :) fluently. > every French computer > scientist sounds really suspect to a non-scientific person when he speaks > of "strings" all the time. The same with "bit". It's also because of > things like that, that computer people are sometimes thought of they > can't communicate. Now, that cannot be the reason -- because it doesn't explain why the same is thought of English speaking computer nerds. :-) > [ "un string" is only an underwear in French. And there is a > very unfortunate collision between the English word "bit" and another > word. ] The mind boggles. An anecdote from the C++ Standardisation process: when a new feature of the C++ library was being developed, which provided 'auxilliary' information, it was originally called 'baggage'. The name was changed to 'traits'. I was one of the people who said that 'baggage' has unfortunate connotations in Australia (at least), refering to a sexually promiscuous woman in derogatory masculine slang. > Of course, such an extension won't probably be used in business as > long as everybody speaks English at work in almost every software > development group anyway. I am surprised: is this really the case in France, for example? It is not in Japan (perhaps the language is more alien?) > All this may be very difficult to implement, I don't know, I am > telling things from the user's point of view. I do not think the difficulty of implementation is any more than a minor issue. The major issue, surely, is, that: at least in most programming languages, _something_ is universal, even if the keywords tend to be derived from English. If too many alternative ways of doing the same thing are available, we may end up with a mish mash language (sort of like English itself :-) -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-15 13:53 localization, internationalization and Caml Gerard Huet 1999-10-15 20:28 ` Gerd Stolpmann @ 1999-10-17 14:29 ` Xavier Leroy 1999-10-19 18:36 ` skaller 1 sibling, 1 reply; 24+ messages in thread From: Xavier Leroy @ 1999-10-17 14:29 UTC (permalink / raw) To: caml-list Wow, there's nothing like internationalization to spark lively discussions. Since even Gérard Huet (oops, sorry for that 8859-1 accent, couldn't resist) and Francis Dupont broke their vows of silence, I guess I have to say something too. The support for ISO-8859-1 in Caml Light and OCaml is essentially an historical and geographical accident. The first books on Caml were written in French, and it was nice to be able to use accented french words as identifiers. Also, that was at a time (1991-1992) where Unicode and consorts didn't even exist. The choice of ISO-8859-1 is not that politically incorrect either: it works not only for western Europe, but also for Latin America, many Pacific countries, and large parts of Africa. If we were to choose an 8-bit character set based on the number of OCaml programmers that actually need it, I guess ISO-8859-1 (or its newer incarnation with the Euro sign whose name I can't remember) would still win. (At least until we get OCaml in the Chinese curriculum...) Notice also that Caml doesn't prevent the programmer from putting any character set that includes ASCII (ISO-8859-x, but also UTF8-encoded Unicode) in character strings and in comments. There are several ways to internationalize further. One is to support other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There are several problems with this: - It's not enough for Asian languages. - The POSIX localization stuff isn't supported under Windows. - It's badly supported on all Unixes I know (e.g. to get French, I need to set LC_CTYPE to different values under Linux, Solaris, and Digital Unix; it gets worse for other languages such as Japanese). - Handling of mixed-language texts is a nightmare. Unicode / ISO10646 is probably a better approach. However, it has its own problems: - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix) chose 32-bit Unicode. (That's the great things about standards: there are so many to choose from...) - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well. E.g. Java seems to have its own variant of UTF8. How are we going to interoperate? - I/O is a nightmare. The API has to handle at least byte streams, wide character streams, and UTF8-encoded streams. - Support for Unicode / UTF8 files in today's operating systems and GUIs is very low. When will I be able to do "more" on an UTF8 file and see my French accented letters? My conclusion is that I18N is such a mess that I don't think we'll do much about it in Caml anytime soon. Perhaps some basic support for wide characters and wide character strings will be added at some point, if only because COM interoperability requires it. - Xavier Leroy ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: localization, internationalization and Caml 1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy @ 1999-10-19 18:36 ` skaller 0 siblings, 0 replies; 24+ messages in thread From: skaller @ 1999-10-19 18:36 UTC (permalink / raw) To: Xavier Leroy; +Cc: caml-list Xavier Leroy wrote: > The support for ISO-8859-1 in Caml Light and OCaml is essentially an > historical and geographical accident. The first books on Caml were > written in French, and it was nice to be able to use accented french > words as identifiers. Also, that was at a time (1991-1992) where > Unicode and consorts didn't even exist. And supporting ISO-8859-1 was a fine thing to do at the time! > The choice of ISO-8859-1 is not that politically incorrect either: it > works not only for western Europe, but also for Latin America, many > Pacific countries, and large parts of Africa. If we were to choose an > 8-bit character set based on the number of OCaml programmers that > actually need it, I guess ISO-8859-1 (or its newer incarnation with > the Euro sign whose name I can't remember) would still win. (At least > until we get OCaml in the Chinese curriculum...) While this is true, there is a circularity here: people not using 8 bit character sets face an extra battle using ocaml. > Notice also that Caml doesn't prevent the programmer from putting any > character set that includes ASCII (ISO-8859-x, but also UTF8-encoded > Unicode) in character strings and in comments. Yes. This one of the key points of my argument that UTF-8 is the natural way to go: it provides ISO-10646 compliance without requiring any new string kind. > There are several ways to internationalize further. One is to support > other 8-bit character sets the POSIX way (the LC_CTYPE stuff). There > are several problems with this: > - It's not enough for Asian languages. > - The POSIX localization stuff isn't supported under Windows. > - It's badly supported on all Unixes I know (e.g. to get French, I > need to set LC_CTYPE to different values under Linux, Solaris, and > Digital Unix; it gets worse for other languages such as Japanese). > - Handling of mixed-language texts is a nightmare. If you are suggesting not using C locale stuff -- I agree entirely. > Unicode / ISO10646 is probably a better approach. However, it has its > own problems: > - There's 16-bit Unicode and 32-bit Unicode. Early adopters of that > technology (Windows, Java) chose 16-bit Unicode; late adopters (Unix) > chose 32-bit Unicode. (That's the great things about standards: > there are so many to choose from...) I cannot see the problem -- except for the 16 bit adopters, who must eventually upgrade .. again. > - Apparently, not everyone agrees on multi-byte encodings (UTF8) as well. > E.g. Java seems to have its own variant of UTF8. How are we going > to interoperate? I do not understand: UTF-8 is a fixed, internationally standardised encoding. If it is used, the ISO Standard is followed. If Java doesn't do that, that is Java's problem. > - I/O is a nightmare. The API has to handle at least byte streams, > wide character streams, and UTF8-encoded streams. No, it doesn't. This is a possibility. But it is NOT necessary. It is necessary only to read byte streams. Conversion can be done later using strings. This is less efficient, but it is a sensible starting point (to ignore internationalisation on I/O completely). > - Support for Unicode / UTF8 files in today's operating systems and GUIs > is very low. When will I be able to do "more" on an UTF8 file and see my > French accented letters? Yes. I agree. This is a major problem. One of the answers is "When programming languages provide the support that applications programmers need" :-) > My conclusion is that I18N is such a mess that I don't think we'll do > much about it in Caml anytime soon. I agree. The way forward is, I believe: a) do not change the I/O system, but deprecate TEXT mode (all I/O should be done in binary) b) do not change the String module, but deprecate the upper/lower case functions (and anything else that smacks of relating to natural language) c) Provide functions to support internationalisation. d) modify the ocaml compiler, to process \uXXXX and \UXXXX escapes [everywhere] e) provide a fast variable length array type (d) could be done easily using ocamlp4 I think. >Perhaps some basic support for > wide characters and wide character strings will be added at some > point, if only because COM interoperability requires it. I don't think it is necessary, a variable length array of integers is good enough. -- John Skaller, mailto:skaller@maxtal.com.au 1/10 Toxteth Rd Glebe NSW 2037 Australia homepage: http://www.maxtal.com.au/~skaller downloads: http://www.triode.net.au/~skaller ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~1999-10-29 17:21 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 1999-10-15 13:53 localization, internationalization and Caml Gerard Huet 1999-10-15 20:28 ` Gerd Stolpmann 1999-10-19 18:06 ` skaller 1999-10-20 21:05 ` Gerd Stolpmann 1999-10-21 4:42 ` skaller 1999-10-21 12:05 ` Matías Giovannini 1999-10-21 15:35 ` skaller 1999-10-21 16:27 ` Matías Giovannini 1999-10-21 16:36 ` skaller 1999-10-21 17:21 ` Matías Giovannini 1999-10-23 9:53 ` Benoit Deboursetty 1999-10-25 21:06 ` Jan Skibinski 1999-10-26 18:02 ` skaller 1999-10-25 0:54 ` How to format a float? skaller 1999-10-26 0:53 ` Michel Quercia 1999-10-26 4:36 ` Go for ultimate localization! Benoit Deboursetty 1999-10-28 17:04 ` Pierre Weis 1999-10-28 17:41 ` Matías Giovannini 1999-10-28 17:59 ` Matías Giovannini 1999-10-29 9:44 ` Francois Pottier 1999-10-28 21:00 ` Gerd Stolpmann 1999-10-29 4:29 ` skaller 1999-10-17 14:29 ` localization, internationalization and Caml Xavier Leroy 1999-10-19 18:36 ` skaller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).