From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p1SEDleL002496
	for <caml-list@sympa-roc.inria.fr>; Mon, 28 Feb 2011 15:13:47 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AikCADs8a03U4xEKkGdsb2JhbACEJKIiFQEBAQEJCQwHEQMiq2eQCAKBJYNEdgSHdodliFM
X-IronPort-AV: E=Sophos;i="4.62,239,1297033200"; 
   d="scan'208";a="100850959"
Received: from moutng.kundenserver.de ([212.227.17.10])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 28 Feb 2011 15:13:42 +0100
Received: from office1.lan.sumadev.de (dslb-188-097-079-229.pools.arcor-ip.net [188.97.79.229])
	by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis)
	id 0MUjkW-1PX0AL3SXc-00YnFu; Mon, 28 Feb 2011 15:13:42 +0100
Received: from [192.168.5.106] (dslb-188-097-079-229.pools.arcor-ip.net [188.97.79.229])
	by office1.lan.sumadev.de (Postfix) with ESMTPA id 77C505F701;
	Mon, 28 Feb 2011 15:13:41 +0100 (CET)
From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Christophe TROESTLER <Christophe.Troestler@umons.ac.be>
Cc: OCaml Mailing List <caml-list@inria.fr>
In-Reply-To: <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be>
References: 
	 <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be>
Content-Type: text/plain; charset="UTF-8"
Date: Mon, 28 Feb 2011 15:13:40 +0100
Message-ID: <1298902420.27243.42.camel@thinkpad>
Mime-Version: 1.0
X-Mailer: Evolution 2.28.1 
Content-Transfer-Encoding: 7bit
X-Provags-ID: V02:K0:JtwduN7lUqLlfVSBy3gBXWZVDWrVhfEf8dyELGFjHTF
 00fn+C6+XZig9uJ0/Kqmnp90Y3GgaZHZJtRdiO293+G2TFxbSF
 muBs7iIgtluMSQf6Q7BzI3h6KrnAyu5QyHOKItq6jnWUQ7WXAT
 DAMSEKtDSpWxuBKGHyCkGgOr52o3OhpYvkIXKsLr43gCS1ndns
 CetIOB0Hru7FydMx91yxA==
Subject: Re: [Caml-list] GSoC: better UTF-8 support

Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
> Hi,
> 
> Starting from an idea on the Ocsigen mailing list, it was suggested
> that better support for UTF-8 in the tools would be of interest to
> several people.  In particular, the following points were identified:
> 
> - A flag (-utf8 ?) to the compilers should be added so that errors
>   locations are correct in presence of UTF-8 strings [the programmer
>   restricting himself to ASCII identifiers].
> 
> - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
>   it would be nice to be able to parametrize any of them with the
>   correct charset (using again the -utf8 flag ?)
> 
> - UTF8.Char and UTF8.String modules should be written with the same
>   interface as Char and String.  [Camomile should be adapted
>   consequently.]

Well, UTF-8 is the wrong term here. What you need on this level are
Unicode modules, where a uni_char can contain all Unicode code points,
and a uni_string is an array of such uni_char's.

UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited
for string manipulation, at least if you want efficient support for
index-based access, because the length of the char representation is not
constant.

Probably you would choose uni_char=int as representation for characters,
but for strings there are several possibilities:

- 16 bits/char: This path is taken by other languages, but only a 
  subset of Unicode chars can be represented directly
- 24 bits/char: All Unicode chars can be represented (range is
  0 to 0x10ffff), but you need to multiply by 3 to access by index. 
  This multiplication is relatively cheap (one bit shift plus 
  one addition).
- 32 bits/char: A slight waste of RAM but very efficient access by
  index
- int/char: same as 32 bits/char for 32-bit platforms but
  64 bits/char for 64-bit. Probably no good choice.

Of course, there should also be conversions from/to normal
chars/strings, and this is the place where UTF-8 comes into play.

Another comment: for supporting lowercase/uppercase conversion one needs
lookup tables. Not really big tables, because only a small fraction of
the Unicode chars has this variation.

One should also think about whether other properties of the Unicode
character database should be made available. E.g. character classes.
This could also live in add-on libraries, but it is worth discussing.

> - Printf/Scanf: %U of %cu for UTF8.Char.t

You need also string conversions for uni_string.

> 
> - Graphics: UTF-8 text printing
> 
> - Str: (character ranges)

Before talking about character ranges, Str needs to support single
Unicode characters properly. If it runs over a uni_string this is
trivial. If it runs directly over a UTF-8 encoded string this is
possible but the algorithm needs to be adapted to cope with multi-byte
representations.

For full internationalization you need more than just character ranges.
There are also character classes (e.g. "all letters", "all digits"), and
a few other phenomenons.

> The questions are: would such changes be beneficial to you?

I'd like it very much.

>   Are there
> other issues to address? 

You probably would also need a Unicode-version of Buffer.

Which syntax to choose for Unicode string accesses? Maybe

s.[[k]] ?

So far I see normal strings would still be used for I/O, only that it is
now easy to decode them as UTF-8. There is one difficulty, though.
Imagine you read a file block by block, where a block has a fixed length
in bytes. Also, the file contains UTF-8-encoded data. It can now happen
that the end of a block is not at the end of a character. For that
reason you need special decoding functions that can deal with that.

There are probably other functions where you would like to have Unicode
support directly, e.g. int_of_uni_string.

> Is this enough for a GSoc proposal (seems a
> little light to me)? 

Honestly, this is not the type of work that is well-suited for GSoc. You
need to only change code, not develop something entirely new. Also, you
need to dig into very different parts of the code base - and you need
broad knowledge how everything interacts.

This is might be better done as community project with some moderation
from INRIA. There could be a master plan, and several people take over
tasks.

Gerd

>  If it is done, is there a chance to have this
> work included in the standard distribution?
> 
> Best,
> C.
> 


-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------