From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA12251 for caml-redistribution; Fri, 15 Oct 1999 16:27:50 +0200 (MET DST) Received: from concorde.inria.fr (concorde.inria.fr [192.93.2.39]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id QAA09128 for ; Fri, 15 Oct 1999 16:06:16 +0200 (MET DST) Received: from yana.inria.fr (yana.inria.fr [128.93.9.62]) by concorde.inria.fr (8.8.7/8.8.7) with ESMTP id QAA25584; Fri, 15 Oct 1999 16:06:04 +0200 (MET DST) Received: from huet-gera1p (huet-gera2p.inria.fr [128.93.20.113]) by yana.inria.fr (8.8.5/8.8.7) with SMTP id QAA07501; Fri, 15 Oct 1999 16:06:03 +0200 (MET DST) Message-Id: <199910151406.QAA07501@yana.inria.fr> Identite-Message: <4.0.2.19991015154222.00e5f2c0@pop-rocq.inria.fr> X-Sender: huet@pop-rocq.inria.fr X-Mailer: QUALCOMM Windows Eudora Pro Version 4.0.2 Date: Fri, 15 Oct 1999 15:53:15 +0200 To: Francis Dupont , skaller From: Gerard Huet Subject: Re: localization, internationalization and Caml Cc: STARYNKEVITCH Basile , caml-list@inria.fr Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: weis Just to put my 2 cents on this issue... At 10:26 15/10/99 +0200, Francis Dupont wrote: > In your previous mail you wrote: > > The current 'support' for 8 bit characters in ocaml should be > deprecated immediately. It is an extremely bad thing to have, since > Latin-1 et al are archaic 8 bit standards incompatible with the > international standard for ISO10646 communication, namely > the UTF-8 encoding. I do not agree. What we need is not ayatollah dictats, but careful thinking about evolution of standards. First of all, ISO-Latin is as international a standard as ISO10646, only a= bit more mature. By essence, international standards are not immediately= obsolete, they are here to stay because we need some stability in a world of sound engineering, as opposed to the permanent hype which our discipline is subjected to. Secondly, the string data type of Ocaml is not about ASCII or ISO-Latin or whatever. It is a low-level data type of implementation of lists of bytes of data efficiently represented in machine memory. These bytes may be used for encoding elements of various finite sets such as ASCII or ISO-Latin, but the string library does not care about such intentions. When such strings are used to represent natural language sentences, there is a natural tendency to sophistication, from UPPER CASE letters of the computer printers of old to ASCII to ISO-Latin 1, 2, etc to Unicode. At some point (256) these sets of codes cannot be represented in a one to one fashion into bytes, and so multi-bytes representations must be designed, such as UTF-8. Such multi-bytes representations are inconsistent with ISO-Latin convention somewhere, and thus the ISO-Latin character set must be shifted out of its usual representation since the 8th bit is needed for the multi-byte= encoding. So for instance engineers designing natural language interfaces must make the choice of sticking to the old convention in a purely local software, or upgrading their software to the international standard, typically for Web applications. At some point I am sure some brave soul from the Ocaml implementation team will write a Unicode library for implementing the non-trivial manipulations of lists of Unicode characters, so that the above engineers will have a generic tool to use. Such libraries will typically implement a NEW datatype of "unistring" or whatever, with proper conversion to string representations of course, but the string data type is surely here to stay, because bytes are not going to become obsolete overnight. :-) >=> there is a rather strong opposition against UTF-8 in France >because it is not a natural encoding (ie. if ASCII maps to ASCII >it is not the case for ISO 8859-* characters, imagine a new UTF-X >encoding maps ASCII to strange things and you'd be able to understand >our concern). I do not share Francis' pessimism. The ISO commitees are not entirely stupid, and care has been taken to make the move as painless as possible. ISO-Latin has just been shifted by a mere translation. Here is my Ocaml code for translating strings of ISO-Latin 1 characters to UTF-8 HTML: let print_unicode c = let ascii = int_of_char c in (* test for ISO-LATIN *) if ascii < 128 then print_char c (* 7 bit ascii *) else print_string ("&#" ^ (string_of_int ascii) ^ ";"); This is hardly mysterious or complicated or inefficient. >=> my problem is the output of the filter will be no more readable when >I've put too much French in the program (in comments for instance). Come on, Francis, we do not read core dumps nowadays, we read through the eyes of HTML or TeX or whatever ! >=> I believe internationalization should not be done by countries >where English is the only used language: this is at least awkward... I simply do not understand this remark in a WWW world. Cheers Gérard