caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed
From: "Daniel Bünzli" <daniel.buenzli@erratique.ch>
To: caml-list@inria.fr
Subject: Re: [Caml-list] Immutable strings
Date: Wed, 9 Jul 2014 15:15:33 +0100	[thread overview]
Message-ID: <C8E64BE53B6D4027B43B29260AC28C5D@erratique.ch> (raw)
In-Reply-To: <sympa.1404842907.21063.651@inria.fr>

Le mardi, 8 juillet 2014 à 19:15, mattiasw@gmail.com a écrit :
> ocaml will be that last language that doesn't have standardize unicode
> support. Even old languages like Erlang has gone the UTF-8 way, and that
> includes program code.

For the fun I just had a look what python does.  

So in python basically they have a Unicode string which is a string made of Unicode *code points*. Fail, end of discussion. Should have been: *scalar values* (for those who don't understand why, I suggest reading my minimal Unicode introduction [1]).

(both in 2 and 3, apparently 2 used to be messier for reason I didn't bother to understand, they seem to be highly confused)

Sample code. U+D800 is the first surrogate, i.e. something you should never see in concrete Unicode textual processing, only in UTF-16 encoded bytes and paired with an appropriate low surrogate.

Python2:

>>> u'\uD800'.encode('utf-8')
'\xed\xa0\x80'

Congratulations, you just produced an invalid UTF-8 sequence (serialized a surrogate).  

Python3 is a *little* better with *UTF-8* (but wait…) encoding stuff

>>> "\uD800".encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed


So now let's try UTF-16:

>>> "\uD800".encode("utf-16")
b'\xff\xfe\x00\xd8'


Congratulations you just produced an invalid UTF-16 sequence hi-surrogate without a corresponding low surrogate (which together would define an Unicode scalar value).

Why on earth do they allow to represent surrogates *at all* in their Unicode text data structure ? Basically they don't understand Unicode.  

The old camel should not be ashamed of its *outsanding* (absolutely) unicode support — this is not to say that nothing can be improved, I do have some proposal in the works — but the situation is not bad either.

Best,

Daniel

P.S. Skimming through these articles about python unicode strings I gather why people find unicode hard, there seem to be a high level of both technical and conceptual confusion. Again have a read at [1] if you'd like to clear (I hope) your mind about these things.


[1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal





  parent reply	other threads:[~2014-07-09 14:15 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-04 19:18 Gerd Stolpmann
2014-07-04 20:31 ` Anthony Tavener
2014-07-04 20:38   ` Malcolm Matalka
2014-07-04 23:44   ` Daniel Bünzli
2014-07-05 11:04   ` Gerd Stolpmann
2014-07-16 11:38     ` Damien Doligez
2014-07-04 21:01 ` Markus Mottl
2014-07-05 11:24   ` Gerd Stolpmann
2014-07-08 13:23     ` Jacques Garrigue
2014-07-08 13:37       ` Alain Frisch
2014-07-08 14:04         ` Jacques Garrigue
2014-07-28 11:14   ` Goswin von Brederlow
2014-07-28 15:51     ` Markus Mottl
2014-07-29  2:54       ` Yaron Minsky
2014-07-29  9:46         ` Goswin von Brederlow
2014-07-29 11:48         ` John F. Carr
2014-07-07 12:42 ` Alain Frisch
2014-07-08 12:24   ` Gerd Stolpmann
2014-07-09 13:54     ` Alain Frisch
2014-07-09 18:04       ` Gerd Stolpmann
2014-07-10  6:41         ` Nicolas Boulay
2014-07-14 17:40       ` Richard W.M. Jones
2014-07-08 18:15 ` mattiasw
2014-07-08 19:24   ` Daniel Bünzli
2014-07-08 19:27     ` Raoul Duke
2014-07-09 14:15   ` Daniel Bünzli [this message]
2014-07-14 17:45   ` Richard W.M. Jones
2014-07-21 15:06 ` Alain Frisch
     [not found]   ` <20140722.235104.405798419265248505.Christophe.Troestler@umons.ac.be>
2014-08-29 16:30     ` Damien Doligez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C8E64BE53B6D4027B43B29260AC28C5D@erratique.ch \
    --to=daniel.buenzli@erratique.ch \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).