From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p1SEDleL002496 for ; Mon, 28 Feb 2011 15:13:47 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AikCADs8a03U4xEKkGdsb2JhbACEJKIiFQEBAQEJCQwHEQMiq2eQCAKBJYNEdgSHdodliFM X-IronPort-AV: E=Sophos;i="4.62,239,1297033200"; d="scan'208";a="100850959" Received: from moutng.kundenserver.de ([212.227.17.10]) by mail1-smtp-roc.national.inria.fr with ESMTP; 28 Feb 2011 15:13:42 +0100 Received: from office1.lan.sumadev.de (dslb-188-097-079-229.pools.arcor-ip.net [188.97.79.229]) by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis) id 0MUjkW-1PX0AL3SXc-00YnFu; Mon, 28 Feb 2011 15:13:42 +0100 Received: from [192.168.5.106] (dslb-188-097-079-229.pools.arcor-ip.net [188.97.79.229]) by office1.lan.sumadev.de (Postfix) with ESMTPA id 77C505F701; Mon, 28 Feb 2011 15:13:41 +0100 (CET) From: Gerd Stolpmann To: Christophe TROESTLER Cc: OCaml Mailing List In-Reply-To: <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be> References: <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be> Content-Type: text/plain; charset="UTF-8" Date: Mon, 28 Feb 2011 15:13:40 +0100 Message-ID: <1298902420.27243.42.camel@thinkpad> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:JtwduN7lUqLlfVSBy3gBXWZVDWrVhfEf8dyELGFjHTF 00fn+C6+XZig9uJ0/Kqmnp90Y3GgaZHZJtRdiO293+G2TFxbSF muBs7iIgtluMSQf6Q7BzI3h6KrnAyu5QyHOKItq6jnWUQ7WXAT DAMSEKtDSpWxuBKGHyCkGgOr52o3OhpYvkIXKsLr43gCS1ndns CetIOB0Hru7FydMx91yxA== Subject: Re: [Caml-list] GSoC: better UTF-8 support Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER: > Hi, > > Starting from an idea on the Ocsigen mailing list, it was suggested > that better support for UTF-8 in the tools would be of interest to > several people. In particular, the following points were identified: > > - A flag (-utf8 ?) to the compilers should be added so that errors > locations are correct in presence of UTF-8 strings [the programmer > restricting himself to ASCII identifiers]. > > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write, > it would be nice to be able to parametrize any of them with the > correct charset (using again the -utf8 flag ?) > > - UTF8.Char and UTF8.String modules should be written with the same > interface as Char and String. [Camomile should be adapted > consequently.] Well, UTF-8 is the wrong term here. What you need on this level are Unicode modules, where a uni_char can contain all Unicode code points, and a uni_string is an array of such uni_char's. UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited for string manipulation, at least if you want efficient support for index-based access, because the length of the char representation is not constant. Probably you would choose uni_char=int as representation for characters, but for strings there are several possibilities: - 16 bits/char: This path is taken by other languages, but only a subset of Unicode chars can be represented directly - 24 bits/char: All Unicode chars can be represented (range is 0 to 0x10ffff), but you need to multiply by 3 to access by index. This multiplication is relatively cheap (one bit shift plus one addition). - 32 bits/char: A slight waste of RAM but very efficient access by index - int/char: same as 32 bits/char for 32-bit platforms but 64 bits/char for 64-bit. Probably no good choice. Of course, there should also be conversions from/to normal chars/strings, and this is the place where UTF-8 comes into play. Another comment: for supporting lowercase/uppercase conversion one needs lookup tables. Not really big tables, because only a small fraction of the Unicode chars has this variation. One should also think about whether other properties of the Unicode character database should be made available. E.g. character classes. This could also live in add-on libraries, but it is worth discussing. > - Printf/Scanf: %U of %cu for UTF8.Char.t You need also string conversions for uni_string. > > - Graphics: UTF-8 text printing > > - Str: (character ranges) Before talking about character ranges, Str needs to support single Unicode characters properly. If it runs over a uni_string this is trivial. If it runs directly over a UTF-8 encoded string this is possible but the algorithm needs to be adapted to cope with multi-byte representations. For full internationalization you need more than just character ranges. There are also character classes (e.g. "all letters", "all digits"), and a few other phenomenons. > The questions are: would such changes be beneficial to you? I'd like it very much. > Are there > other issues to address? You probably would also need a Unicode-version of Buffer. Which syntax to choose for Unicode string accesses? Maybe s.[[k]] ? So far I see normal strings would still be used for I/O, only that it is now easy to decode them as UTF-8. There is one difficulty, though. Imagine you read a file block by block, where a block has a fixed length in bytes. Also, the file contains UTF-8-encoded data. It can now happen that the end of a block is not at the end of a character. For that reason you need special decoding functions that can deal with that. There are probably other functions where you would like to have Unicode support directly, e.g. int_of_uni_string. > Is this enough for a GSoc proposal (seems a > little light to me)? Honestly, this is not the type of work that is well-suited for GSoc. You need to only change code, not develop something entirely new. Also, you need to dig into very different parts of the code base - and you need broad knowledge how everything interacts. This is might be better done as community project with some moderation from INRIA. There could be a master plan, and several people take over tasks. Gerd > If it is done, is there a chance to have this > work included in the standard distribution? > > Best, > C. > -- ------------------------------------------------------------ Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------