From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: weis Received: (from weis@localhost) by pauillac.inria.fr (8.7.6/8.7.3) id QAA20333 for caml-redistribution; Mon, 18 Oct 1999 16:03:36 +0200 (MET DST) Received: from nez-perce.inria.fr (nez-perce.inria.fr [192.93.2.78]) by pauillac.inria.fr (8.7.6/8.7.3) with ESMTP id XAA15447 for ; Fri, 15 Oct 1999 23:59:46 +0200 (MET DST) Received: from beach.frankfurt.netsurf.de (beach.frankfurt.netsurf.de [194.64.181.2]) by nez-perce.inria.fr (8.8.7/8.8.7) with ESMTP id XAA09388 for ; Fri, 15 Oct 1999 23:59:45 +0200 (MET DST) Received: from ice.darmstadt.netsurf.de (board-65.darmstadt.netsurf.de [194.163.86.193]) by beach.frankfurt.netsurf.de (8.8.5/8.8.5) with ESMTP id XAA04651 for ; Fri, 15 Oct 1999 23:59:43 +0200 (MET DST) Received: from localhost (localhost [[UNIX: localhost]]) by ice.darmstadt.netsurf.de (8.9.3/8.9.3) id XAA25791 for caml-list@inria.fr; Fri, 15 Oct 1999 23:51:52 +0200 From: Gerd Stolpmann Reply-To: Gerd.Stolpmann@darmstadt.netsurf.de Organization: privat To: caml-list@inria.fr Subject: Re: localization, internationalization and Caml Date: Fri, 15 Oct 1999 22:28:44 +0200 X-Mailer: KMail [version 1.0.21] Content-Type: text/plain References: <199910151406.QAA07501@yana.inria.fr> MIME-Version: 1.0 Message-Id: <99101523515200.25289@ice> Content-Transfer-Encoding: 8bit Sender: weis I agree that Unicode or even ISO-10646 support would be a nice thing. I also agree that for many users (including myself) the Latin-1 character set suffices. Luckily, both character sets are strongly related: The first 256 character positions of Unicode (which is the 16-bit subset of ISO-10646) are exactly the same as in Latin1. UTF-8 is a special encoding of the bigger character sets. Every 16- or 31-bit character is represented by one to six bytes; the higher the character code the more bytes are needed. This encoding is mainly interesting for I/O, and not for internal processing, because it is impossible to access the characters of a string by their position. Internally, you must represent the characters as 16- or 32-bit numbers (this is called UCS-2 and -4, respectively), even if memory is wasted (this is the price for the enhanced possibilities). UTF-8 is thought for compatibility, because the following holds: - Every ASCII character (i.e. with codes 0 to 127) is represented as before, and every non-ASCII character is represented by a byte sequence where the eighth bit is set. Old, non-UTF-aware programs can at least interpret the ASCII characters. (Note that there is a variant of UTF-8 which encodes the 0 character differently - by two characters.) - If you sort UTF-8 strings alphabetically (more precisely, using the byte values of the encoding as criterion) you get the same result as if you sorted the strings alphabetically by their character codes. This means that we need at least three types of strings: Latin1 strings for compatibility, UCS-2 or -4 strings for internal processing, and UTF-8 strings for I/O. For simplicity, I suggest to represent both Latin1 and UTF-8 strings by the same language type "string", and to provide "wchar" and "wstring" for the extended character set. Of course, the current "string" data type is mainly an implementation of byte sequences, which is independent of the underlying interpretation. Only the following functions seem to introduce a character set: - The String.uppercase and .lowercase functions - The channels if opened in text mode (because they specially recognize the line endings if needed for the operating system, and newline characters are part of the character set) The best solution would be to have internationalized versions of these functions (and of perhaps some more functions) which still operate on the "string" type but allow the user to select the encoding and the locale. This means we would have something like type encoding = UTF8 | Latin1 | ... type locale = ... val String.i18n_uppercase : encoding -> locale -> string -> string val String.i18n_lowercase : encoding -> locale -> string -> string val String.recode : encoding -> encoding -> string -> string (* changes the encoding if possible *) For "wstring" it is simpler: val Wstring.uppercase : string -> string val Wstring.i18n_uppercase : locale -> string -> string New opening mode for channels: Text of encoding This encoding specifies the encoding of the file. It must be possible to change this later (e.g. to process XML's "encoding" declaration). New input/output functions: val output_i18n_string : out_channel -> encoding -> string -> unit val input_i18n_line : in_channel -> encoding -> string Here, the encoding argument specifies the encoding of the internal representation. The other I/O functions need I18N versions as well, and of course we need to operate on "wstring"s directly: val output_wstring : out_channel -> wstring -> unit val input_wstring_line : in_channel -> wstring This all means that the number of string functions explodes: We need functions for compatibility (Latin1), functions for arbitrary 8 bit encodings, and functions for wide strings. I think this is the main argument against it, and it is very difficult to get around this. (Any ideas?) Francis Dupont: >=> my problem is the output of the filter will be no more readable when >I've put too much French in the program (in comments for instance). The enlarged character sets become more and more important, and it is only a matter of time until every piece of software which wants to be taken seriously can process them, even a dumb terminal or simple text editor. So you will be able to put accented characters into your comments, and you will see them as such even if you 'cat' the program text to the terminal or printer; this will work everywhere... >=> I believe internationalization should not be done by countries >where English is the only used language: this is at least awkward... but in the USA (a general prejudice not worth to discuss; there is only some "personal experience" behind it). Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 100 64293 Darmstadt EMail: Gerd.Stolpmann@darmstadt.netsurf.de (privat) Germany ----------------------------------------------------------------------------