From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Original-To: caml-list@sympa.inria.fr Delivered-To: caml-list@sympa.inria.fr Received: from mail3-relais-sop.national.inria.fr (mail3-relais-sop.national.inria.fr [192.134.164.104]) by sympa.inria.fr (Postfix) with ESMTPS id C1D7E7FA13 for ; Wed, 9 Jul 2014 16:15:38 +0200 (CEST) Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of daniel.buenzli@erratique.ch) identity=pra; client-ip=74.55.86.74; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="daniel.buenzli@erratique.ch"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of daniel.buenzli@erratique.ch) identity=mailfrom; client-ip=74.55.86.74; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="daniel.buenzli@erratique.ch"; x-conformance=sidf_compatible Received-SPF: None (mail3-smtp-sop.national.inria.fr: no sender authenticity information available from domain of postmaster@smtp.webfaction.com) identity=helo; client-ip=74.55.86.74; receiver=mail3-smtp-sop.national.inria.fr; envelope-from="daniel.buenzli@erratique.ch"; x-sender="postmaster@smtp.webfaction.com"; x-conformance=sidf_compatible X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ApUBAElNvVNKN1ZKlWdsb2JhbABZg2CCJ4EivC2HQQGBJQ8BAQEBBw0JCRIqhAMBAQQBI2YLGgImAgIhJhAZCIgmAwkIDa5pkm4Nhn8XgSyIToMegXc6FoJhNoEWBZh2g0iFLxiGaYlY X-IPAS-Result: ApUBAElNvVNKN1ZKlWdsb2JhbABZg2CCJ4EivC2HQQGBJQ8BAQEBBw0JCRIqhAMBAQQBI2YLGgImAgIhJhAZCIgmAwkIDa5pkm4Nhn8XgSyIToMegXc6FoJhNoEWBZh2g0iFLxiGaYlY X-IronPort-AV: E=Sophos;i="5.01,631,1400018400"; d="scan'208";a="70705846" Received: from mail6.webfaction.com (HELO smtp.webfaction.com) ([74.55.86.74]) by mail3-smtp-sop.national.inria.fr with ESMTP; 09 Jul 2014 16:15:37 +0200 Received: from [172.17.152.221] (global-1-26.nat.csx.cam.ac.uk [131.111.184.26]) by smtp.webfaction.com (Postfix) with ESMTP id D18E022C883D for ; Wed, 9 Jul 2014 14:15:34 +0000 (UTC) Date: Wed, 9 Jul 2014 15:15:33 +0100 From: =?utf-8?Q?Daniel_B=C3=BCnzli?= To: caml-list@inria.fr Message-ID: In-Reply-To: References: X-Mailer: sparrow 1.6.4 (build 1178) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Subject: Re: [Caml-list] Immutable strings Le mardi, 8 juillet 2014 =C3=A0 19:15, mattiasw@gmail.com a =C3=A9crit : > ocaml will be that last language that doesn't have standardize unicode > support. Even old languages like Erlang has gone the UTF-8 way, and that > includes program code. For the fun I just had a look what python does.=20=20 So in python basically they have a Unicode string which is a string made of= Unicode *code points*. Fail, end of discussion. Should have been: *scalar = values* (for those who don't understand why, I suggest reading my minimal U= nicode introduction [1]). (both in 2 and 3, apparently 2 used to be messier for reason I didn't bothe= r to understand, they seem to be highly confused) Sample code. U+D800 is the first surrogate, i.e. something you should never= see in concrete Unicode textual processing, only in UTF-16 encoded bytes a= nd paired with an appropriate low surrogate. Python2: >>> u'\uD800'.encode('utf-8') '\xed\xa0\x80' Congratulations, you just produced an invalid UTF-8 sequence (serialized a = surrogate).=20=20 Python3 is a *little* better with *UTF-8* (but wait=E2=80=A6) encoding stuff >>> "\uD800".encode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in positi= on 0: surrogates not allowed So now let's try UTF-16: >>> "\uD800".encode("utf-16") b'\xff\xfe\x00\xd8' Congratulations you just produced an invalid UTF-16 sequence hi-surrogate w= ithout a corresponding low surrogate (which together would define an Unicod= e scalar value). Why on earth do they allow to represent surrogates *at all* in their Unicod= e text data structure ? Basically they don't understand Unicode.=20=20 The old camel should not be ashamed of its *outsanding* (absolutely) unicod= e support =E2=80=94 this is not to say that nothing can be improved, I do h= ave some proposal in the works =E2=80=94 but the situation is not bad eithe= r. Best, Daniel P.S. Skimming through these articles about python unicode strings I gather = why people find unicode hard, there seem to be a high level of both technic= al and conceptual confusion. Again have a read at [1] if you'd like to clea= r (I hope) your mind about these things. [1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal