From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail2-relais-roc.national.inria.fr (mail2-relais-roc.national.inria.fr [192.134.164.83]) by sympa.inria.fr (Postfix) with ESMTPS id 204F07FA13 for ; Tue, 8 Jul 2014 21:24:19 +0200 (CEST) Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of daniel.buenzli@erratique.ch) identity=pra; client-ip=74.55.86.74; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="daniel.buenzli@erratique.ch"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of daniel.buenzli@erratique.ch) identity=mailfrom; client-ip=74.55.86.74; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="daniel.buenzli@erratique.ch"; x-conformance=sidf_compatible Received-SPF: None (mail2-smtp-roc.national.inria.fr: no sender authenticity information available from domain of postmaster@smtp.webfaction.com) identity=helo; client-ip=74.55.86.74; receiver=mail2-smtp-roc.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="postmaster@smtp.webfaction.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AjEBAKtEvFNKN1ZKlWdsb2JhbABZhgeBIsNeAYEuDwEBAQEHDQkJEiqEAwEBBAEjVgULCxoCJgICISYQGQiIJgMJCASuXpMxDYV9F4Esi2yBdzMHFoJhNoEWBZh2iHcYhmmJWA X-IPAS-Result: AjEBAKtEvFNKN1ZKlWdsb2JhbABZhgeBIsNeAYEuDwEBAQEHDQkJEiqEAwEBBAEjVgULCxoCJgICISYQGQiIJgMJCASuXpMxDYV9F4Esi2yBdzMHFoJhNoEWBZh2iHcYhmmJWA X-IronPort-AV: E=Sophos;i="5.01,626,1400018400"; d="scan'208";a="84208250" Received: from mail6.webfaction.com (HELO smtp.webfaction.com) ([74.55.86.74]) by mail2-smtp-roc.national.inria.fr with ESMTP; 08 Jul 2014 21:24:18 +0200 Received: from [172.20.10.2] (188.29.165.193.threembb.co.uk [188.29.165.193]) by smtp.webfaction.com (Postfix) with ESMTP id 2348222CFED3; Tue, 8 Jul 2014 19:24:14 +0000 (UTC) Date: Tue, 8 Jul 2014 20:24:09 +0100 From: =?utf-8?Q?Daniel_B=C3=BCnzli?= To: mattiasw@gmail.com Cc: caml-list@inria.fr Message-ID: In-Reply-To: References: X-Mailer: sparrow 1.6.4 (build 1178) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Subject: Re: [Caml-list] Immutable strings Le mardi, 8 juillet 2014 =C3=A0 19:15, mattiasw@gmail.com a =C3=A9crit : > My two cents: >=20=20 > To me it seems very strange to introduce a new string type and not make it > UTF-8 from start. No new string type was introduced. A bytes type was introduced.=20=20 =20=20 > ocaml will be that last language that doesn't have standardize unicode > support.=20=20 What do you mean by standarized unicode support in the language *exactly* ?= =20=20 I'd be genuinely interested in knowing the actual real level of support for= Unicode in these language, beyond saying our string is an UTF-X encoded se= quence of scalar values. For example do these other language do perform Uni= code normalisation on string literals/patterns (and identifiers if they cho= ose that craze) ? This for example would be absolutely necessary to have fo= r performing any kind of real world processing on unicode strings, but then= there's not only a single normalisation form and the one you want depends = on the context. Do they have a notation to indicate in which form they want= the literal/pattern to be ?=20=20 > Even old languages like Erlang has gone the UTF-8 way, and that > includes program code. For a very very very very long time it has been possible to write, unnormal= ized or normalized according to the normal form your editor, UTF-8 encoded = literals in your OCaml sources; you just had to drop the idea of using lati= n1 identifiers, which are now anyway deprecated since 4.01.=20=20 As for being able to write Unicode *identifiers* in the language I'm actual= ly quite glad OCaml hasn't that, there are both too many arrow characters t= o use in Unicode and too many unreasonable programmers out there. =20=20 > Bytes and strings have nothing in common, but str.[4] is still relevant f= or > UTF-8 strings.=20=20 Direct indexing is rarely relevant in Unicode as usually you want those ind= exes to correspond to user perceived characters (e.g. to align things in te= xt formatting) and user perceived characters may be written as a sequence o= f unicode scalar value=E2=80=A6 or not (even in normal forms, since an arbi= trary number of combining character can be applied to a base character). Th= e unicode segmentation algorithm allows you to find these boundaries, simpl= e indexing doesn't and is mostly worthless in Unicode processing. Best, Daniel