From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id 1F9507FA43 for ; Fri, 18 Jul 2014 17:42:06 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of edwin+ml-ocaml@etorok.net) identity=pra; client-ip=62.113.205.31; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="edwin+ml-ocaml@etorok.net"; x-sender="edwin+ml-ocaml@etorok.net"; x-conformance=sidf_compatible Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of edwin+ml-ocaml@etorok.net designates 62.113.205.31 as permitted sender) identity=mailfrom; client-ip=62.113.205.31; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="edwin+ml-ocaml@etorok.net"; x-sender="edwin+ml-ocaml@etorok.net"; x-conformance=sidf_compatible; x-record-type="v=spf1" Received-SPF: Pass (mail3-smtp-sop.national.inria.fr: domain of postmaster@mx.etorok.net designates 62.113.205.31 as permitted sender) identity=helo; client-ip=62.113.205.31; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="edwin+ml-ocaml@etorok.net"; x-sender="postmaster@mx.etorok.net"; x-conformance=sidf_compatible; x-record-type="v=spf1" X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhAFAJ4/yVM+cc0f/2dsb2JhbABZgw5Sg0/BfIdDAYEIFnaEBAEFIx04Ag8LGAICBRMDCwICCQMCAQIBRRMIAohCCas/d4UIkk0RBoEsjTdvgniBTopkkEaBTZJig0c6Lw X-IPAS-Result: AhAFAJ4/yVM+cc0f/2dsb2JhbABZgw5Sg0/BfIdDAYEIFnaEBAEFIx04Ag8LGAICBRMDCwICCQMCAQIBRRMIAohCCas/d4UIkk0RBoEsjTdvgniBTopkkEaBTZJig0c6Lw X-IronPort-AV: E=Sophos;i="5.01,685,1400018400"; d="scan'208";a="71984426" Received: from mx.etorok.net ([62.113.205.31]) by mail3-smtp-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-SHA; 18 Jul 2014 17:42:05 +0200 Received: by mx.etorok.net (OpenSMTPD) with ESMTP id 40a49501; for ; Fri, 18 Jul 2014 18:42:03 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=etorok.net; h= message-id:date:from:mime-version:to:references:in-reply-to :content-type:content-transfer-encoding; s=ml; l=3217; bh=I7W1QU PxcCaMWD7YHVU4dvVpuSY=; b=IeYTCgEvAeGIASD4Arax3qRJq9+/jAu7rslOBJ BEbq4cyH2qa7VzAQIBiALZl7ik1KRanEMfZB7OzTWOEwmLfwHx2xdCAf0GUULkue ulo/x4tcIaFNKRF5zu6XrJI+oP1UG0CF726UtRDaRoVJRrOm3LghSFreeF6PF9i3 R8WEo= Received: by mx.etorok.net (OpenSMTPD) with ESMTPSA id 4c7ac7af; TLS version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO; for ; Fri, 18 Jul 2014 18:42:02 +0300 (EEST) Message-ID: <53C9404A.1000007@etorok.net> Date: Fri, 18 Jul 2014 18:42:02 +0300 From: =?UTF-8?B?VMO2csO2ayBFZHdpbg==?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.0 MIME-Version: 1.0 To: caml-list@inria.fr References: In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Subject: Re: [Caml-list] A proposal of a standard support for Unicode string On 07/18/2014 05:08 PM, Yoriyuki Yamagata wrote: > Dear List, > > I write a blog post http://yoriyuki.info/en/blog/2014/07/18/unicode/ which proposes inclusion of Unicode strings in OCaml standard distribution. > > The reason for this proposal can be summarized as follows. > > 1. > > Type for human readable text is too important to left out from the standard distribution, in particular from the beginner's perspective > > 2. > > This enhances interpolability of Unicode processing libraries > > 3. > > This suppresses the current dangerous practice that raw UTF-8 encoded string is used for Unicode string. > > I hope this stimulates the further discussion of human readable texts in OCaml Just my two cents: I don't think that iterating on codepoints is the fundamental operation for Unicode strings, there are too many pitfalls to be aware of (unless you are writing a low-level Unicode library such as normalization, regular expression matching, etc.). >From a user's perspective I think you need mostly these operations: * create a valid UTF-8 string (reject invalid ones) * be able to put Unicode strings in standard containers (i.e. comparison and hash functions) * transform Unicode strings (normalize, case-fold) * Unicode regular expressions (UTS#18) * Unicode text segmentation (UTS#29) * Unicode collation algorithms (UTS#10) The regular expressions could be used to match/replace/split the Unicode string just as you would use String.index/String.sub, and it could be used to find/access the Unicode properties of unicode characters. And Unicode text segmentation can be used to define more useful iterators than codepoint-by-codepoint. If performance is a concern specialized implementations could be provided for commonly used expressions. Now of course this is only my view and someone else would consider other Unicode features to be more important. Also Unicode evolves, there could be bugs in the implementation, etc. so I don't think this is suitable to be baked in the compiler-provided standard library. At least such a "high-level" Unicode library should be prototyped and fully implemented outside the compiler (based on the already existing uu* libraries perhaps). Then see what problems people face when using it, and then tested out in a "standard library" such as Batteries or Core, and only then be made part of the compiler's standard library. Given that recently there's been a tendency to split-out functionality from the OCaml distribution (camlp4, ocamlbuild, etc.) I don't think that adding such complicated and evolving Unicode algorithms to the standard library is the way to go. I'd rather see the standard library deprecate/remove the ASCII-specific interfaces (or move them to a submodule). As for the compiler itself it could provide some useful functionality, not sure if it is required: * warn when string literals are not valid UTF-8 * support for \u * if there will be [`String|`Bytes ] stringlike there could be a 'type unicode = `Unicode stringlike' and some way to make literal strings be of type unicode, with the actual unicode implementation left to a user library Best regards, --Edwin