From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <1556.63.165.50.175.1090272954.squirrel@wish.cooper.edu> In-Reply-To: <007b01c46dd6$89a0c420$8efa7d50@SOMA> References: <6e35c06204071810312daa31a9@mail.gmail.com><000701c46cf6$814c4370$92ec7d50@SOMA><7359f049040718120571c93b25@mail.gmail.com><1485.63.165.50.175.1090270909.squirrel@wish.cooper.edu> <007b01c46dd6$89a0c420$8efa7d50@SOMA> Date: Mon, 19 Jul 2004 17:35:54 -0400 Subject: Re: [9fans] UTF-8 criticism? From: "Joel Salomon" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu> User-Agent: SquirrelMail/1.4.2 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: c33ee588-eacd-11e9-9e20-41e7f4b1d025 >> Would moving to 32 bit signed (and only 0 -- 2^21 allowed, plus -1 for >> EOF) as in the more recent revisions of Unicode take care of the >> surrogates problem? > > this has nothing to do with EOF. Sorry if I was unclear - let me try again. Would moving to 32 bit signed (and only 0 -- 2^21 allowed), thus including all surrogates in the directly accessible character set solve the problem? Yes, this does open a new can of worms, but how much more difficult would it be to move from 16 bit Runes to 21/32 bit wide Runes then it was to move from 7 bit ASCII to Unicode in the first place? As an aside, the way I've understood the Unicode standard (4.0), 21 bit characters can be encoded in 1, 2, 3, or 4 bytes in UTF-8 and if text is internally represented by int32, some out-of-band information (like EOF, or bad UTF (but preserving the original bytes)) can be carried along. --Joel