Re: [Caml-list] A proposal of a standard support for Unicode string

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

From: "Török Edwin" <edwin+ml-ocaml@etorok.net>
To: caml-list@inria.fr
Subject: Re: [Caml-list] A proposal of a standard support for Unicode string
Date: Fri, 18 Jul 2014 18:42:02 +0300	[thread overview]
Message-ID: <53C9404A.1000007@etorok.net> (raw)
In-Reply-To: <CALdQWQ5WTUdupzKc=tKkqKat=xtADHadv7_hcct1kS5w=2tLgw@mail.gmail.com>

On 07/18/2014 05:08 PM, Yoriyuki Yamagata wrote:
> Dear List,
> 
> I write a blog post http://yoriyuki.info/en/blog/2014/07/18/unicode/ which proposes inclusion of Unicode strings in OCaml standard distribution.
> 
> The reason for this proposal can be summarized as follows.
> 
>  1.
> 
>     Type for human readable text is too important to left out from the standard distribution, in particular from the beginner's perspective
> 
>  2.
> 
>     This enhances interpolability of Unicode processing libraries
> 
>  3.
> 
>     This suppresses the current dangerous practice that raw UTF-8 encoded string is used for Unicode string.
> 
> I hope this stimulates the further discussion of human readable texts in OCaml

Just my two cents:

I don't think that iterating on codepoints is the fundamental operation for Unicode strings, there are too many pitfalls
to be aware of (unless you are writing a low-level Unicode library such as normalization, regular expression matching, etc.).

From a user's perspective I think you need mostly these operations:
  * create a valid UTF-8 string (reject invalid ones)
  * be able to put Unicode strings in standard containers (i.e. comparison and hash functions)
  * transform Unicode strings (normalize, case-fold)
  * Unicode regular expressions (UTS#18)
  * Unicode text segmentation (UTS#29)
  * Unicode collation algorithms (UTS#10)

The regular expressions could be used to match/replace/split the Unicode string just as you would use String.index/String.sub,
and it could be used to find/access the Unicode properties of unicode characters.
And Unicode text segmentation can be used to define more useful iterators than codepoint-by-codepoint.
If performance is a concern specialized implementations could be provided for commonly used expressions.

Now of course this is only my view and someone else would consider other Unicode features to be more important.
Also Unicode evolves, there could be bugs in the implementation, etc. so I don't think this is suitable to be baked in
the compiler-provided standard library.

At least such a "high-level" Unicode library should be prototyped and fully implemented outside the compiler
(based on the already existing uu* libraries perhaps).
Then see what problems people face when using it, and then tested out in a "standard library" such as Batteries or Core, and only then be made part of the compiler's standard library.

Given that recently there's been a tendency to split-out functionality from the OCaml distribution (camlp4, ocamlbuild, etc.) I don't think
that adding such complicated and evolving Unicode algorithms to the standard library is the way to go.
I'd rather see the standard library deprecate/remove the ASCII-specific interfaces (or move them to a submodule).

As for the compiler itself it could provide some useful functionality, not sure if it is required:
  * warn when string literals are not valid UTF-8
  * support for \u
  * if there will be [`String|`Bytes ] stringlike there could be a 'type unicode = `Unicode stringlike' and some way to make literal strings be
 of type unicode, with the actual unicode implementation left to a user library

Best regards,
--Edwin

     prev parent reply	other threads:[~2014-07-18 15:42 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-18 14:08 Yoriyuki Yamagata
2014-07-18 15:42 ` Török Edwin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53C9404A.1000007@etorok.net \
    --to=edwin+ml-ocaml@etorok.net \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).