caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: David Allsopp <dra-news@metastack.com>
To: "'Daniel Bünzli'" <daniel.buenzli@erratique.ch>
Cc: "'Christophe TROESTLER'" <Christophe.Troestler@umons.ac.be>,
	"'OCaml Mailing List'" <caml-list@inria.fr>
Subject: RE: [Caml-list] GSoC: better UTF-8 support
Date: Mon, 28 Feb 2011 11:46:20 +0000	[thread overview]
Message-ID: <E51C5B015DBD1348A1D85763337FB6D949100ED6@Remus.metastack.local> (raw)
In-Reply-To: <AANLkTi=dHMofPWCoMwMVtiB+Q-9A32JwP2OuuqxGLYoh@mail.gmail.com>

Daniel Bünzli wrote:
> > If it's to go into the standard library then yes, it should exactly
> > replicate the interface of the Char and String modules, that way
> > within the standard library UTF8 can be used as a drop-in replacement
> [...]
> 
> I'm not sure many programs would actually benefit from that. At a certain
> point if you really want to process unicode at the character level you'll
> need a proper library.

Yes, and at that point you'd use one. Not all programs need to analyse strings at a character level. For example, it would be nice for this to work within just the standard library:

D:\>md "Paweł Łukaszewski"
D:\>cd "Paweł Łukaszewski"
D:\Paweł Łukaszewski>ocaml
        Objective Caml version 3.11.2

# Sys.getcwd();;
- : string = "D:\\Pawel Lukaszewski"

The reason for this is because an internal conversion is done to hack around Unicode filenames - but having any representation of Unicode in the standard library would mean that these and other functions would no longer have to return incorrect answers, as in this case, but the standard library would have a default mechanism for dealing with Unicode. At the moment, there's nothing the standard library can do with this Unicode filename because there's no standardised (within the library) representation available for it.

Consider it as being similar to something like the Digest module. It's only really there because digests are used in .cmi files within the compiler. If you're serious about using digests in an application then you have to use an external library because MD5 is not usually enough and you'll need other algorithms. But the module is still potentially useful so it's better that it's publically visible in the standard library and not just an internal module of the compiler. Same would apply to this simple native support for UTF-8 - it would allow other parts of the standard library to work properly for simple applications would allow Unicode strings to be handled.

> Using these ascii/latin1 oriented interfaces to
> process unicode at the character level would be debilitating and
> frustrating for your final users (e.g. no treatement of normal forms, you
> do realize that in unicode there's more than one way of representing the
> character 'é').

Fully aware - but just because you need to work with strings does *not* imply you ever need even to compare them. Granted, the documentation may note that to perform a canonical comparison you'll need a third party library but I still maintain that having basic support is better and more usable than having none. The above issue, for example, means I can't even manipulate a directory tree in OCaml on my system which shouldn't even be related to string/text processing!


David


  reply	other threads:[~2011-02-28 11:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-28  8:35 Christophe TROESTLER
2011-02-28  8:58 ` Daniel Bünzli
2011-02-28 10:07   ` David Allsopp
2011-02-28 11:21     ` Daniel Bünzli
2011-02-28 11:46       ` David Allsopp [this message]
2011-02-28 12:32         ` Daniel Bünzli
2011-02-28 12:59           ` [Caml-list] " Sylvain Le Gall
2011-02-28 10:59   ` Sylvain Le Gall
2011-02-28 14:39   ` [Caml-list] " David Rajchenbach-Teller
2011-02-28 10:07 ` David Allsopp
     [not found]   ` <20110228.143157.1265982603697554449.Christophe.Troestler+ocaml@umons.ac.be>
2011-02-28 14:11     ` Daniel Bünzli
2011-02-28 14:57       ` Dario Teixeira
2011-02-28 14:13 ` Gerd Stolpmann
2011-02-28 14:31   ` [Caml-list] " Sylvain Le Gall
2011-02-28 15:09   ` [Caml-list] " Dario Teixeira
2011-02-28 15:50   ` David Allsopp
2011-03-01  5:49     ` [Caml-list] " Yoriyuki Yamagata
2011-02-28 14:21 ` [Caml-list] " Michael Ekstrand
2011-03-03 15:37 ` Damien Doligez
2011-03-03 16:42   ` Dario Teixeira

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E51C5B015DBD1348A1D85763337FB6D949100ED6@Remus.metastack.local \
    --to=dra-news@metastack.com \
    --cc=Christophe.Troestler@umons.ac.be \
    --cc=caml-list@inria.fr \
    --cc=daniel.buenzli@erratique.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).