From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by walapai.inria.fr (8.13.6/8.13.6) with ESMTP id p1SBpIcT027235 for ; Mon, 28 Feb 2011 12:51:18 +0100 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AikCAHcba01RZ90xkWdsb2JhbACEJKE+ZBUBAQEBCQsKBxEDIqwAkAOBJ4NEdgSMHQ X-IronPort-AV: E=Sophos;i="4.62,238,1297033200"; d="scan'208";a="92215889" Received: from mtaout03-winn.ispmail.ntl.com ([81.103.221.49]) by mail2-smtp-roc.national.inria.fr with ESMTP; 28 Feb 2011 12:51:12 +0100 Received: from aamtaout04-winn.ispmail.ntl.com ([81.103.221.35]) by mtaout03-winn.ispmail.ntl.com (InterMail vM.7.08.04.00 201-2186-134-20080326) with ESMTP id <20110228115111.EFWV23441.mtaout03-winn.ispmail.ntl.com@aamtaout04-winn.ispmail.ntl.com>; Mon, 28 Feb 2011 11:51:11 +0000 Received: from romulus.metastack.com ([81.102.132.77]) by aamtaout04-winn.ispmail.ntl.com (InterMail vG.3.00.04.00 201-2196-133-20080908) with ESMTP id <20110228115111.JFIT25656.aamtaout04-winn.ispmail.ntl.com@romulus.metastack.com>; Mon, 28 Feb 2011 11:51:11 +0000 Received: from remus.metastack.local ([172.16.0.1]) by romulus.metastack.com (8.14.2/8.14.2) with ESMTP id p1SBp7Ks004013 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Mon, 28 Feb 2011 11:51:08 GMT Received: from Remus.metastack.local ([fe80::547c:3c42:e1da:eda2]) by Remus.metastack.local ([fe80::547c:3c42:e1da:eda2%11]) with mapi; Mon, 28 Feb 2011 11:46:21 +0000 From: David Allsopp To: =?utf-8?B?J0RhbmllbCBCw7xuemxpJw==?= CC: "'Christophe TROESTLER'" , "'OCaml Mailing List'" Thread-Topic: [Caml-list] GSoC: better UTF-8 support Thread-Index: AQHL1yHbpO6NgLMzxEmbDuJZqeBtYpQWnSWAgAALgYCAABxjAIAAADKw Date: Mon, 28 Feb 2011 11:46:20 +0000 Message-ID: References: <20110228.093528.996524125295855263.Christophe.Troestler@umons.ac.be> In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Organization: MetaStack Solutions Ltd. X-Scanned-By: MIMEDefang 2.65 on 81.102.132.77 X-Cloudmark-Analysis: v=1.1 cv=R50lirqlHffDPPkwUlkuVa99MrvKdVWo//yz83qex8g= c=1 sm=0 a=fAv-q_DikvIA:10 a=IkcTkHD0fZMA:10 a=xqWC_Br6kY4A:10 a=fsWwqtZfhJyd2lOehUQA:9 a=ROkZ8YJkMAz-Tc9YMjUA:7 a=MmdDHkUCH_hNIw9dWfJ1fzHZeRsA:4 a=QEXdDO2ut3YA:10 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by walapai.inria.fr id p1SBpIcT027235 Subject: RE: [Caml-list] GSoC: better UTF-8 support Daniel Bünzli wrote: > > If it's to go into the standard library then yes, it should exactly > > replicate the interface of the Char and String modules, that way > > within the standard library UTF8 can be used as a drop-in replacement > [...] > > I'm not sure many programs would actually benefit from that. At a certain > point if you really want to process unicode at the character level you'll > need a proper library. Yes, and at that point you'd use one. Not all programs need to analyse strings at a character level. For example, it would be nice for this to work within just the standard library: D:\>md "Paweł Łukaszewski" D:\>cd "Paweł Łukaszewski" D:\Paweł Łukaszewski>ocaml Objective Caml version 3.11.2 # Sys.getcwd();; - : string = "D:\\Pawel Lukaszewski" The reason for this is because an internal conversion is done to hack around Unicode filenames - but having any representation of Unicode in the standard library would mean that these and other functions would no longer have to return incorrect answers, as in this case, but the standard library would have a default mechanism for dealing with Unicode. At the moment, there's nothing the standard library can do with this Unicode filename because there's no standardised (within the library) representation available for it. Consider it as being similar to something like the Digest module. It's only really there because digests are used in .cmi files within the compiler. If you're serious about using digests in an application then you have to use an external library because MD5 is not usually enough and you'll need other algorithms. But the module is still potentially useful so it's better that it's publically visible in the standard library and not just an internal module of the compiler. Same would apply to this simple native support for UTF-8 - it would allow other parts of the standard library to work properly for simple applications would allow Unicode strings to be handled. > Using these ascii/latin1 oriented interfaces to > process unicode at the character level would be debilitating and > frustrating for your final users (e.g. no treatement of normal forms, you > do realize that in unicode there's more than one way of representing the > character 'é'). Fully aware - but just because you need to work with strings does *not* imply you ever need even to compare them. Granted, the documentation may note that to perform a canonical comparison you'll need a third party library but I still maintain that having basic support is better and more usable than having none. The above issue, for example, means I can't even manipulate a directory tree in OCaml on my system which shouldn't even be related to string/text processing! David