Re: [Caml-list] GSoC: better UTF-8 support

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: Gerd Stolpmann <info@gerd-stolpmann.de>
To: Christophe TROESTLER <Christophe.Troestler@umons.ac.be>
Cc: OCaml Mailing List <caml-list@inria.fr>
Subject: Re: [Caml-list] GSoC: better UTF-8 support
Date: Mon, 28 Feb 2011 15:13:40 +0100	[thread overview]
Message-ID: <1298902420.27243.42.camel@thinkpad> (raw)
In-Reply-To: <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be>

Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
> Hi,
> 
> Starting from an idea on the Ocsigen mailing list, it was suggested
> that better support for UTF-8 in the tools would be of interest to
> several people.  In particular, the following points were identified:
> 
> - A flag (-utf8 ?) to the compilers should be added so that errors
>   locations are correct in presence of UTF-8 strings [the programmer
>   restricting himself to ASCII identifiers].
> 
> - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
>   it would be nice to be able to parametrize any of them with the
>   correct charset (using again the -utf8 flag ?)
> 
> - UTF8.Char and UTF8.String modules should be written with the same
>   interface as Char and String.  [Camomile should be adapted
>   consequently.]

Well, UTF-8 is the wrong term here. What you need on this level are
Unicode modules, where a uni_char can contain all Unicode code points,
and a uni_string is an array of such uni_char's.

UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited
for string manipulation, at least if you want efficient support for
index-based access, because the length of the char representation is not
constant.

Probably you would choose uni_char=int as representation for characters,
but for strings there are several possibilities:

- 16 bits/char: This path is taken by other languages, but only a 
  subset of Unicode chars can be represented directly
- 24 bits/char: All Unicode chars can be represented (range is
  0 to 0x10ffff), but you need to multiply by 3 to access by index. 
  This multiplication is relatively cheap (one bit shift plus 
  one addition).
- 32 bits/char: A slight waste of RAM but very efficient access by
  index
- int/char: same as 32 bits/char for 32-bit platforms but
  64 bits/char for 64-bit. Probably no good choice.

Of course, there should also be conversions from/to normal
chars/strings, and this is the place where UTF-8 comes into play.

Another comment: for supporting lowercase/uppercase conversion one needs
lookup tables. Not really big tables, because only a small fraction of
the Unicode chars has this variation.

One should also think about whether other properties of the Unicode
character database should be made available. E.g. character classes.
This could also live in add-on libraries, but it is worth discussing.

> - Printf/Scanf: %U of %cu for UTF8.Char.t

You need also string conversions for uni_string.

> 
> - Graphics: UTF-8 text printing
> 
> - Str: (character ranges)

Before talking about character ranges, Str needs to support single
Unicode characters properly. If it runs over a uni_string this is
trivial. If it runs directly over a UTF-8 encoded string this is
possible but the algorithm needs to be adapted to cope with multi-byte
representations.

For full internationalization you need more than just character ranges.
There are also character classes (e.g. "all letters", "all digits"), and
a few other phenomenons.

> The questions are: would such changes be beneficial to you?

I'd like it very much.

>   Are there
> other issues to address? 

You probably would also need a Unicode-version of Buffer.

Which syntax to choose for Unicode string accesses? Maybe

s.[[k]] ?

So far I see normal strings would still be used for I/O, only that it is
now easy to decode them as UTF-8. There is one difficulty, though.
Imagine you read a file block by block, where a block has a fixed length
in bytes. Also, the file contains UTF-8-encoded data. It can now happen
that the end of a block is not at the end of a character. For that
reason you need special decoding functions that can deal with that.

There are probably other functions where you would like to have Unicode
support directly, e.g. int_of_uni_string.

> Is this enough for a GSoc proposal (seems a
> little light to me)? 

Honestly, this is not the type of work that is well-suited for GSoc. You
need to only change code, not develop something entirely new. Also, you
need to dig into very different parts of the code base - and you need
broad knowledge how everything interacts.

This is might be better done as community project with some moderation
from INRIA. There could be a master plan, and several people take over
tasks.

Gerd

>  If it is done, is there a chance to have this
> work included in the standard distribution?
> 
> Best,
> C.
> 

-- 
------------------------------------------------------------
Gerd Stolpmann, Bad Nauheimer Str.3, 64289 Darmstadt,Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------

next prev parent reply	other threads:[~2011-02-28 14:13 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-28  8:35 Christophe TROESTLER
2011-02-28  8:58 ` Daniel Bünzli
2011-02-28 10:07   ` David Allsopp
2011-02-28 11:21     ` Daniel Bünzli
2011-02-28 11:46       ` David Allsopp
2011-02-28 12:32         ` Daniel Bünzli
2011-02-28 12:59           ` [Caml-list] " Sylvain Le Gall
2011-02-28 10:59   ` Sylvain Le Gall
2011-02-28 14:39   ` [Caml-list] " David Rajchenbach-Teller
2011-02-28 10:07 ` David Allsopp
     [not found]   ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>
2011-02-28 14:11     ` Daniel Bünzli
2011-02-28 14:57       ` Dario Teixeira
2011-02-28 14:13 ` Gerd Stolpmann [this message]
2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
2011-02-28 15:09   ` [Caml-list] " Dario Teixeira
2011-02-28 15:50   ` David Allsopp
2011-03-01  5:49     ` [Caml-list] " Yoriyuki Yamagata
2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand
2011-03-03 15:37 ` Damien Doligez
2011-03-03 16:42   ` Dario Teixeira

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1298902420.27243.42.camel@thinkpad \
    --to=info@gerd-stolpmann.de \
    --cc=Christophe.Troestler@umons.ac.be \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).