From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail1-relais-roc.national.inria.fr (mail1-relais-roc.national.inria.fr [192.134.164.82])
	by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p1SFtqZ7014333
	for <caml-list@sympa-roc.inria.fr>; Mon, 28 Feb 2011 16:55:52 +0100
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AikCANdUa01RZ90wkWdsb2JhbACEJaE+ZBUBAQEBCQsKBxEDIqxOkB+BJ4NEdgSMHYwR
X-IronPort-AV: E=Sophos;i="4.62,240,1297033200"; 
   d="scan'208";a="100861405"
Received: from mtaout02-winn.ispmail.ntl.com ([81.103.221.48])
  by mail1-smtp-roc.national.inria.fr with ESMTP; 28 Feb 2011 16:55:35 +0100
Received: from aamtaout03-winn.ispmail.ntl.com ([81.103.221.35])
          by mtaout02-winn.ispmail.ntl.com
          (InterMail vM.7.08.04.00 201-2186-134-20080326) with ESMTP
          id <20110228155535.ITNK19887.mtaout02-winn.ispmail.ntl.com@aamtaout03-winn.ispmail.ntl.com>;
          Mon, 28 Feb 2011 15:55:35 +0000
Received: from romulus.metastack.com ([81.102.132.77])
          by aamtaout03-winn.ispmail.ntl.com
          (InterMail vG.3.00.04.00 201-2196-133-20080908) with ESMTP
          id <20110228155534.CIGY28282.aamtaout03-winn.ispmail.ntl.com@romulus.metastack.com>;
          Mon, 28 Feb 2011 15:55:34 +0000
Received: from remus.metastack.local ([172.16.0.1])
	by romulus.metastack.com (8.14.2/8.14.2) with ESMTP id p1SFtRL7005430
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL);
	Mon, 28 Feb 2011 15:55:27 GMT
Received: from Remus.metastack.local ([fe80::547c:3c42:e1da:eda2]) by
 Remus.metastack.local ([fe80::547c:3c42:e1da:eda2%11]) with mapi; Mon, 28 Feb
 2011 15:50:41 +0000
From: David Allsopp <dra-news@metastack.com>
To: "'Gerd Stolpmann'" <info@gerd-stolpmann.de>,
        "'Christophe TROESTLER'"
	<Christophe.Troestler@umons.ac.be>
CC: "'OCaml Mailing List'" <caml-list@inria.fr>
Thread-Topic: [Caml-list] GSoC: better UTF-8 support
Thread-Index: AQHL1yHbpO6NgLMzxEmbDuJZqeBtYpQW9SEAgAAWSuA=
Date: Mon, 28 Feb 2011 15:50:40 +0000
Message-ID: <E51C5B015DBD1348A1D85763337FB6D9491012A2@Remus.metastack.local>
References: <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be>
 <1298902420.27243.42.camel@thinkpad>
In-Reply-To: <1298902420.27243.42.camel@thinkpad>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Organization: MetaStack Solutions Ltd.
X-Scanned-By: MIMEDefang 2.65 on 81.102.132.77
X-Cloudmark-Analysis: v=1.1 cv=R50lirqlHffDPPkwUlkuVa99MrvKdVWo//yz83qex8g= c=1 sm=0 a=fAv-q_DikvIA:10 a=IkcTkHD0fZMA:10 a=xqWC_Br6kY4A:10 a=7qjdOVGF2v_mxneQgdcA:9 a=SetqJa4EOQBt_7ShmIgA:7 a=YCQfkkbi7EizrjY93settsU-rNIA:4 a=QEXdDO2ut3YA:10 a=EmT01T_X6bhj07Dw:21 a=KE88DNbGnzcQiTst:21 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from base64 to 8bit by walapai.inria.fr id p1SFtqZ7014333
Subject: RE: [Caml-list] GSoC: better UTF-8 support

Gerd Stolpmann wrote:
> Am Montag, den 28.02.2011, 09:35 +0100 schrieb Christophe TROESTLER:
> > Hi,
> >
> > Starting from an idea on the Ocsigen mailing list, it was suggested
> > that better support for UTF-8 in the tools would be of interest to
> > several people.  In particular, the following points were identified:
> >
> > - A flag (-utf8 ?) to the compilers should be added so that errors
> >   locations are correct in presence of UTF-8 strings [the programmer
> >   restricting himself to ASCII identifiers].
> >
> > - ocamldoc: while an UTF-8 aware doc-generator is very easy to write,
> >   it would be nice to be able to parametrize any of them with the
> >   correct charset (using again the -utf8 flag ?)
> >
> > - UTF8.Char and UTF8.String modules should be written with the same
> >   interface as Char and String.  [Camomile should be adapted
> >   consequently.]
> 
> Well, UTF-8 is the wrong term here. What you need on this level are
> Unicode modules, where a uni_char can contain all Unicode code points,
> and a uni_string is an array of such uni_char's.
> 
> UTF-8 is a run-length encoding of Unicode for I/O. It is not well suited
> for string manipulation, at least if you want efficient support for
> index-based access, because the length of the char representation is not
> constant.
> 
> Probably you would choose uni_char=int as representation for characters,
> but for strings there are several possibilities:
> 
> - 16 bits/char: This path is taken by other languages, but only a
>   subset of Unicode chars can be represented directly
> - 24 bits/char: All Unicode chars can be represented (range is
>   0 to 0x10ffff), but you need to multiply by 3 to access by index.
>   This multiplication is relatively cheap (one bit shift plus
>   one addition).
> - 32 bits/char: A slight waste of RAM but very efficient access by
>   index
> - int/char: same as 32 bits/char for 32-bit platforms but
>   64 bits/char for 64-bit. Probably no good choice.
> 
> Of course, there should also be conversions from/to normal chars/strings,
> and this is the place where UTF-8 comes into play.
> 
> Another comment: for supporting lowercase/uppercase conversion one needs
> lookup tables. Not really big tables, because only a small fraction of
> the Unicode chars has this variation.

Although you could reasonably exclude case conversion functions if you wanted (of course, if you're trying to be totally compatible with Char/String then they'd have to be implemented as it has them). Not providing a function doesn't imply half-baked as long as there's the capability to implement it on top of the functions you do provide.

> One should also think about whether other properties of the Unicode
> character database should be made available. E.g. character classes.
> This could also live in add-on libraries, but it is worth discussing.

Personally, I'd say they could safely live in other libraries - the advantage of having basic Unicode string handling (length, character retrieval, simple operations over Unicode-character offsets, etc.) would be that the standard library can use the representation itself and other libraries can be updated to work with it (that's what we have modules and functors for, after all). The fact that OCaml's I/O functions can interface with Unicode-based file systems to me means that it absolutely must support Unicode (at least as a future target) - the status quo of being unable, for example, accurately to query the number of characters in the length of a filename returned by Unix.readdir () is not in any way desirable (and that's to say nothing of the fact that if the Windows ports used the wide versions of the Win32 API instead then you'd have Unix.readdir() returning UTF-8 strings on *nix and 16-bit wchar strings on Windows so you'd have lost platform independence as well!). Fixing that does not require a fully featured Unicode library and wishing for that seems a bit silly as a) it's exceedingly unlikely to happen and b) OCaml already has several very good libraries for full-blown Unicode.


David