From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <daniel.buenzli@erratique.ch>
X-Original-To: caml-list@sympa.inria.fr
Delivered-To: caml-list@sympa.inria.fr
Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83])
	by sympa.inria.fr (Postfix) with ESMTPS id 204F07FA13
	for <caml-list@sympa.inria.fr>; Tue,  8 Jul 2014 21:24:19 +0200 (CEST)
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  daniel.buenzli@erratique.ch) identity=pra;
  client-ip=74.55.86.74;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="daniel.buenzli@erratique.ch";
  x-sender="daniel.buenzli@erratique.ch";
  x-conformance=sidf_compatible
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  daniel.buenzli@erratique.ch) identity=mailfrom;
  client-ip=74.55.86.74;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="daniel.buenzli@erratique.ch";
  x-sender="daniel.buenzli@erratique.ch";
  x-conformance=sidf_compatible
Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender
  authenticity information available from domain of
  postmaster@smtp.webfaction.com) identity=helo;
  client-ip=74.55.86.74;
  receiver=mail2-smtp-roc.national.inria.fr;
  envelope-from="daniel.buenzli@erratique.ch";
  x-sender="postmaster@smtp.webfaction.com";
  x-conformance=sidf_compatible
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AjEBAKtEvFNKN1ZKlWdsb2JhbABZhgeBIsNeAYEuDwEBAQEHDQkJEiqEAwEBBAEjVgULCxoCJgICISYQGQiIJgMJCASuXpMxDYV9F4Esi2yBdzMHFoJhNoEWBZh2iHcYhmmJWA
X-IPAS-Result: AjEBAKtEvFNKN1ZKlWdsb2JhbABZhgeBIsNeAYEuDwEBAQEHDQkJEiqEAwEBBAEjVgULCxoCJgICISYQGQiIJgMJCASuXpMxDYV9F4Esi2yBdzMHFoJhNoEWBZh2iHcYhmmJWA
X-IronPort-AV: E=Sophos;i="5.01,626,1400018400"; 
   d="scan'208";a="84208250"
Received: from mail6.webfaction.com (HELO smtp.webfaction.com) ([74.55.86.74])
  by mail2-smtp-roc.national.inria.fr with ESMTP; 08 Jul 2014 21:24:18 +0200
Received: from [172.20.10.2] (188.29.165.193.threembb.co.uk [188.29.165.193])
	by smtp.webfaction.com (Postfix) with ESMTP id 2348222CFED3;
	Tue,  8 Jul 2014 19:24:14 +0000 (UTC)
Date: Tue, 8 Jul 2014 20:24:09 +0100
From: =?utf-8?Q?Daniel_B=C3=BCnzli?= <daniel.buenzli@erratique.ch>
To: mattiasw@gmail.com
Cc: caml-list@inria.fr
Message-ID: <A55AB3EABAC8457B8D777447055C49DA@erratique.ch>
In-Reply-To: <sympa.1404842907.21063.651@inria.fr>
References: <sympa.1404842907.21063.651@inria.fr>
X-Mailer: sparrow 1.6.4 (build 1178)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
Subject: Re: [Caml-list] Immutable strings

Le mardi, 8 juillet 2014 =C3=A0 19:15, mattiasw@gmail.com a =C3=A9crit :
> My two cents:
>=20=20
> To me it seems very strange to introduce a new string type and not make it
> UTF-8 from start.

No new string type was introduced. A bytes type was introduced.=20=20
=20=20
> ocaml will be that last language that doesn't have standardize unicode
> support.=20=20

What do you mean by standarized unicode support in the language *exactly* ?=
=20=20

I'd be genuinely interested in knowing the actual real level of support for=
 Unicode in these language, beyond saying our string is an UTF-X encoded se=
quence of scalar values. For example do these other language do perform Uni=
code normalisation on string literals/patterns (and identifiers if they cho=
ose that craze) ? This for example would be absolutely necessary to have fo=
r performing any kind of real world processing on unicode strings, but then=
 there's not only a single normalisation form and the one you want depends =
on the context. Do they have a notation to indicate in which form they want=
 the literal/pattern to be ?=20=20

> Even old languages like Erlang has gone the UTF-8 way, and that
> includes program code.

For a very very very very long time it has been possible to write, unnormal=
ized or normalized according to the normal form your editor, UTF-8 encoded =
literals in your OCaml sources; you just had to drop the idea of using lati=
n1 identifiers, which are now anyway deprecated since 4.01.=20=20

As for being able to write Unicode *identifiers* in the language I'm actual=
ly quite glad OCaml hasn't that, there are both too many arrow characters t=
o use in Unicode and too many unreasonable programmers out there.
=20=20
> Bytes and strings have nothing in common, but str.[4] is still relevant f=
or
> UTF-8 strings.=20=20

Direct indexing is rarely relevant in Unicode as usually you want those ind=
exes to correspond to user perceived characters (e.g. to align things in te=
xt formatting) and user perceived characters may be written as a sequence o=
f unicode scalar value=E2=80=A6 or not (even in normal forms, since an arbi=
trary number of combining character can be applied to a base character). Th=
e unicode segmentation algorithm allows you to find these boundaries, simpl=
e indexing doesn't and is mostly worthless in Unicode processing.

Best,

Daniel