From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <1583.63.165.50.175.1090274172.squirrel@wish.cooper.edu>
In-Reply-To: <1556.63.165.50.175.1090272954.squirrel@wish.cooper.edu>
References: DEFANGED[36]:<6e35c06204071810312daa31a9@mail.gmail.com><000701c46cf6$814c4370$92ec7d50@SOMA><7359f049040718120571c93b25@mail.gmail.com><1485.63.165.50.175.1090270909.squirrel@wish.cooper.edu><007b01c46dd6$89a
	" " 0c420$8efa7d50@SOMA>
	<1556.63.165.50.175.1090272954.squirrel@wish.cooper.edu>
Date: Mon, 19 Jul 2004 17:56:12 -0400
Subject: Re: [9fans] UTF-8 criticism?
From: "Joel Salomon" <salomo3@cooper.edu>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu>
User-Agent: SquirrelMail/1.4.2
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: c34ae63a-eacd-11e9-9e20-41e7f4b1d025

Joel Salomon said:
> As an aside, the way I've understood the Unicode standard (4.0), 21 bit
> characters can be encoded in 1, 2, 3, or 4 bytes in UTF-8 and if text i=
s
> internally represented by int32, some out-of-band information (like EOF=
,
> or bad UTF (but preserving the original bytes)) can be carried along.
>

And here's where the out-of-band encoding might come in useful:

rog@vitanuova.com said:
> you do have to be a bit careful with utf-8, as many possible byte
> sequences map down to the same rune (error), so if you
> do your comparisons too early, you run the risk of inconsistency.
>
> for instance, you can exploit this (at least, i *think* this is the
> cause) to create a file that can never be removed on ken's fileserver:
<snip>

but if "error" becomes 0x80000000 & XX, where XX is the original (bad, or
out-of-place) byte, we never lose the ability to retrieve/delete the file=
.
This would be an extension to Unicode, possibly a dangerous one, but mayb=
e
worth considering.

--Joel