caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
* [Caml-list] A proposal of a standard support for Unicode string
@ 2014-07-18 14:08 Yoriyuki Yamagata
  2014-07-18 15:42 ` Török Edwin
  0 siblings, 1 reply; 2+ messages in thread
From: Yoriyuki Yamagata @ 2014-07-18 14:08 UTC (permalink / raw)
  To: Caml List

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

Dear List,

I write a blog post http://yoriyuki.info/en/blog/2014/07/18/unicode/ which
proposes inclusion of Unicode strings in OCaml standard distribution.

The reason for this proposal can be summarized as follows.


   1.

   Type for human readable text is too important to left out from the
   standard distribution, in particular from the beginner's perspective
   2.

   This enhances interpolability of Unicode processing libraries
   3.

   This suppresses the current dangerous practice that raw UTF-8 encoded
   string is used for Unicode string.

I hope this stimulates the further discussion of human readable texts in
OCaml

Best,
-- 
Yoriyuki Yamagata
http://yoriyuki.info/

[-- Attachment #2: Type: text/html, Size: 1168 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Caml-list] A proposal of a standard support for Unicode string
  2014-07-18 14:08 [Caml-list] A proposal of a standard support for Unicode string Yoriyuki Yamagata
@ 2014-07-18 15:42 ` Török Edwin
  0 siblings, 0 replies; 2+ messages in thread
From: Török Edwin @ 2014-07-18 15:42 UTC (permalink / raw)
  To: caml-list

On 07/18/2014 05:08 PM, Yoriyuki Yamagata wrote:
> Dear List,
> 
> I write a blog post http://yoriyuki.info/en/blog/2014/07/18/unicode/ which proposes inclusion of Unicode strings in OCaml standard distribution.
> 
> The reason for this proposal can be summarized as follows.
> 
>  1.
> 
>     Type for human readable text is too important to left out from the standard distribution, in particular from the beginner's perspective
> 
>  2.
> 
>     This enhances interpolability of Unicode processing libraries
> 
>  3.
> 
>     This suppresses the current dangerous practice that raw UTF-8 encoded string is used for Unicode string.
> 
> I hope this stimulates the further discussion of human readable texts in OCaml

Just my two cents:

I don't think that iterating on codepoints is the fundamental operation for Unicode strings, there are too many pitfalls
to be aware of (unless you are writing a low-level Unicode library such as normalization, regular expression matching, etc.).

From a user's perspective I think you need mostly these operations:
  * create a valid UTF-8 string (reject invalid ones)
  * be able to put Unicode strings in standard containers (i.e. comparison and hash functions)
  * transform Unicode strings (normalize, case-fold)
  * Unicode regular expressions (UTS#18)
  * Unicode text segmentation (UTS#29)
  * Unicode collation algorithms (UTS#10)

The regular expressions could be used to match/replace/split the Unicode string just as you would use String.index/String.sub,
and it could be used to find/access the Unicode properties of unicode characters.
And Unicode text segmentation can be used to define more useful iterators than codepoint-by-codepoint.
If performance is a concern specialized implementations could be provided for commonly used expressions.

Now of course this is only my view and someone else would consider other Unicode features to be more important.
Also Unicode evolves, there could be bugs in the implementation, etc. so I don't think this is suitable to be baked in
the compiler-provided standard library.

At least such a "high-level" Unicode library should be prototyped and fully implemented outside the compiler
(based on the already existing uu* libraries perhaps).
Then see what problems people face when using it, and then tested out in a "standard library" such as Batteries or Core, and only then be made part of the compiler's standard library.

Given that recently there's been a tendency to split-out functionality from the OCaml distribution (camlp4, ocamlbuild, etc.) I don't think
that adding such complicated and evolving Unicode algorithms to the standard library is the way to go.
I'd rather see the standard library deprecate/remove the ASCII-specific interfaces (or move them to a submodule).

As for the compiler itself it could provide some useful functionality, not sure if it is required:
  * warn when string literals are not valid UTF-8
  * support for \u
  * if there will be [`String|`Bytes ] stringlike there could be a 'type unicode = `Unicode stringlike' and some way to make literal strings be
 of type unicode, with the actual unicode implementation left to a user library

Best regards,
--Edwin

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2014-07-18 15:42 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-18 14:08 [Caml-list] A proposal of a standard support for Unicode string Yoriyuki Yamagata
2014-07-18 15:42 ` Török Edwin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).