From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <7359f04905081908294942438a@mail.gmail.com> Date: Fri, 19 Aug 2005 08:29:39 -0700 From: Rob Pike To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Subject: Re: [9fans] plan9 and the Unicode Consortium definitions In-Reply-To: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: Topicbox-Message-UUID: 796be5f2-ead0-11e9-9d60-3106f5b1d025 The addition of surrogates was a serious error of judgement for Unicode, in my opinion. As I understand it, the original idea for Unicode was to have a useful 16-bit subset of the 32-but ISO10646 standard, yet now we see Unicode growing until it no longer fits in 16 bits, in order to include some politically expedient characters. This is unwise. It's a fact of life, but it's unwise. As far as Plan 9 is concerned, it shouldn't be too hard to cope with surrogates but the solution will be to bump Rune to unsigned int from unsigned short. It's not worth doing until everything else, for instance Java, has made a similar jump. I don't see a lot of pressure to pull the systems in line with the standard. Everyone is annoyed. Plan 9's libraries provide a modest, convenient subset of the standard. They could use an updating to (the non-surrogate part of) Unicode 3.0. It's not hard; I've seen code to auto-generate the appropriate tables from the Unicode data set. I'll see about digging them up. If you need the full monstrosity, though, you may need to accept something like ICU. It's instructive to compare the magnitude of software growth that will encompass. All that aside, Plan 9 doesn't do some things it should. For instance, it should canonicalize all characters to separate out the diacritics and merge them on display. At the moment, it doesn't handle diacritics at all. -rob