From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <7359f04905081908294942438a@mail.gmail.com>
Date: Fri, 19 Aug 2005 08:29:39 -0700
From: Rob Pike <robpike@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu>
Subject: Re: [9fans] plan9 and the Unicode Consortium definitions
In-Reply-To: <bcba51a05081907514df66d76@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <bcba51a05081907514df66d76@mail.gmail.com>
Topicbox-Message-UUID: 796be5f2-ead0-11e9-9d60-3106f5b1d025

The addition of surrogates was a serious error of judgement for Unicode,
in my opinion.  As I understand it, the original idea for Unicode was to
have a useful 16-bit subset of the 32-but ISO10646 standard, yet now
we see Unicode growing until it no longer fits in 16 bits, in order to
include some politically expedient characters.  This is unwise.  It's a
fact of life, but it's unwise.

As far as Plan 9 is concerned, it shouldn't be too hard to cope with
surrogates but the solution will be to bump Rune to unsigned int from
unsigned short.  It's not worth doing until everything else, for instance
Java, has made a similar jump.  I don't see a lot of pressure to pull
the systems in line with the standard.  Everyone is annoyed.

Plan 9's libraries provide a modest, convenient subset of the standard.
They could use an updating to (the non-surrogate part of) Unicode 3.0.
It's not hard; I've seen code to auto-generate the appropriate tables from
the Unicode data set.  I'll see about digging them up.  If you need the
full monstrosity,  though, you may need to accept something like ICU.
It's instructive to compare the magnitude of software growth that will
encompass.

All that aside, Plan 9 doesn't do some things it should.  For instance,
it should canonicalize all characters to separate out the diacritics and
merge them on display.  At the moment, it doesn't handle diacritics at
all.

-rob