From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <1556.63.165.50.175.1090272954.squirrel@wish.cooper.edu>
In-Reply-To: <007b01c46dd6$89a0c420$8efa7d50@SOMA>
References: <6e35c06204071810312daa31a9@mail.gmail.com><000701c46cf6$814c4370$92ec7d50@SOMA><7359f049040718120571c93b25@mail.gmail.com><1485.63.165.50.175.1090270909.squirrel@wish.cooper.edu>
	<007b01c46dd6$89a0c420$8efa7d50@SOMA>
Date: Mon, 19 Jul 2004 17:35:54 -0400
Subject: Re: [9fans] UTF-8 criticism?
From: "Joel Salomon" <salomo3@cooper.edu>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu>
User-Agent: SquirrelMail/1.4.2
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: c33ee588-eacd-11e9-9e20-41e7f4b1d025

>> Would moving to 32 bit signed (and only 0 -- 2^21 allowed, plus -1 for
>> EOF) as in the more recent revisions of Unicode take care of the
>> surrogates problem?
>
> this has nothing to do with EOF.

Sorry if I was unclear - let me try again. Would moving to 32 bit signed
(and only 0 -- 2^21 allowed), thus including all surrogates in the
directly accessible character set solve the problem?

Yes, this does open a new can of worms, but how much more difficult would
it be to move from 16 bit Runes to 21/32 bit wide Runes then it was to
move from 7 bit ASCII to Unicode in the first place?

As an aside, the way I've understood the Unicode standard (4.0), 21 bit
characters can be encoded in 1, 2, 3, or 4 bytes in UTF-8 and if text is
internally represented by int32, some out-of-band information (like EOF,
or bad UTF (but preserving the original bytes)) can be carried along.

--Joel