caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: David Allsopp <dra-news@metastack.com>
To: "'Gerd Stolpmann'" <info@gerd-stolpmann.de>,
	"'Christophe TROESTLER'" <Christophe.Troestler@umons.ac.be>
Cc: "'OCaml Mailing List'" <caml-list@inria.fr>
Subject: RE: [Caml-list] GSoC: better UTF-8 support
Date: Mon, 28 Feb 2011 15:50:40 +0000	[thread overview]
Message-ID: <E51C5B015DBD1348A1D85763337FB6D9491012A2@Remus.metastack.local> (raw)
In-Reply-To: <1298902420.27243.42.camel@thinkpad>

Gerd Stolpmann wrote:
> Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
> > Hi,
> >
> > Starting from an idea on the Ocsigen mailing list, it was suggested
> > that better support for UTF-8 in the tools would be of interest to
> > several people.  In particular, the following points were identified:
> >
> > - A flag (-utf8 ?) to the compilers should be added so that errors
> >   locations are correct in presence of UTF-8 strings [the programmer
> >   restricting himself to ASCII identifiers].
> >
> > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
> >   it would be nice to be able to parametrize any of them with the
> >   correct charset (using again the -utf8 flag ?)
> >
> > - UTF8.Char and UTF8.String modules should be written with the same
> >   interface as Char and String.  [Camomile should be adapted
> >   consequently.]
> 
> Well, UTF-8 is the wrong term here. What you need on this level are
> Unicode modules, where a uni_char can contain all Unicode code points,
> and a uni_string is an array of such uni_char's.
> 
> UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited
> for string manipulation, at least if you want efficient support for
> index-based access, because the length of the char representation is not
> constant.
> 
> Probably you would choose uni_char=int as representation for characters,
> but for strings there are several possibilities:
> 
> - 16 bits/char: This path is taken by other languages, but only a
>   subset of Unicode chars can be represented directly
> - 24 bits/char: All Unicode chars can be represented (range is
>   0 to 0x10ffff), but you need to multiply by 3 to access by index.
>   This multiplication is relatively cheap (one bit shift plus
>   one addition).
> - 32 bits/char: A slight waste of RAM but very efficient access by
>   index
> - int/char: same as 32 bits/char for 32-bit platforms but
>   64 bits/char for 64-bit. Probably no good choice.
> 
> Of course, there should also be conversions from/to normal chars/strings,
> and this is the place where UTF-8 comes into play.
> 
> Another comment: for supporting lowercase/uppercase conversion one needs
> lookup tables. Not really big tables, because only a small fraction of
> the Unicode chars has this variation.

Although you could reasonably exclude case conversion functions if you wanted (of course, if you're trying to be totally compatible with Char/String then they'd have to be implemented as it has them). Not providing a function doesn't imply half-baked as long as there's the capability to implement it on top of the functions you do provide.

> One should also think about whether other properties of the Unicode
> character database should be made available. E.g. character classes.
> This could also live in add-on libraries, but it is worth discussing.

Personally, I'd say they could safely live in other libraries - the advantage of having basic Unicode string handling (length, character retrieval, simple operations over Unicode-character offsets, etc.) would be that the standard library can use the representation itself and other libraries can be updated to work with it (that's what we have modules and functors for, after all). The fact that OCaml's I/O functions can interface with Unicode-based file systems to me means that it absolutely must support Unicode (at least as a future target) - the status quo of being unable, for example, accurately to query the number of characters in the length of a filename returned by Unix.readdir () is not in any way desirable (and that's to say nothing of the fact that if the Windows ports used the wide versions of the Win32 API instead then you'd have Unix.readdir() returning UTF-8 strings on *nix and 16-bit wchar strings on Windows so you'd have lost platform independence as well!). Fixing that does not require a fully featured Unicode library and wishing for that seems a bit silly as a) it's exceedingly unlikely to happen and b) OCaml already has several very good libraries for full-blown Unicode.


David


  parent reply	other threads:[~2011-02-28 15:55 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-28  8:35 Christophe TROESTLER
2011-02-28  8:58 ` Daniel Bünzli
2011-02-28 10:07   ` David Allsopp
2011-02-28 11:21     ` Daniel Bünzli
2011-02-28 11:46       ` David Allsopp
2011-02-28 12:32         ` Daniel Bünzli
2011-02-28 12:59           ` [Caml-list] " Sylvain Le Gall
2011-02-28 10:59   ` Sylvain Le Gall
2011-02-28 14:39   ` [Caml-list] " David Rajchenbach-Teller
2011-02-28 10:07 ` David Allsopp
     [not found]   ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>
2011-02-28 14:11     ` Daniel Bünzli
2011-02-28 14:57       ` Dario Teixeira
2011-02-28 14:13 ` Gerd Stolpmann
2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
2011-02-28 15:09   ` [Caml-list] " Dario Teixeira
2011-02-28 15:50   ` David Allsopp [this message]
2011-03-01  5:49     ` [Caml-list] " Yoriyuki Yamagata
2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand
2011-03-03 15:37 ` Damien Doligez
2011-03-03 16:42   ` Dario Teixeira

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E51C5B015DBD1348A1D85763337FB6D9491012A2@Remus.metastack.local \
    --to=dra-news@metastack.com \
    --cc=Christophe.Troestler@umons.ac.be \
    --cc=caml-list@inria.fr \
    --cc=info@gerd-stolpmann.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).