From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <1583.63.165.50.175.1090274172.squirrel@wish.cooper.edu> In-Reply-To: <1556.63.165.50.175.1090272954.squirrel@wish.cooper.edu> References: DEFANGED[36]:<6e35c06204071810312daa31a9@mail.gmail.com><000701c46cf6$814c4370$92ec7d50@SOMA><7359f049040718120571c93b25@mail.gmail.com><1485.63.165.50.175.1090270909.squirrel@wish.cooper.edu><007b01c46dd6$89a " " 0c420$8efa7d50@SOMA> <1556.63.165.50.175.1090272954.squirrel@wish.cooper.edu> Date: Mon, 19 Jul 2004 17:56:12 -0400 Subject: Re: [9fans] UTF-8 criticism? From: "Joel Salomon" To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu> User-Agent: SquirrelMail/1.4.2 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: c34ae63a-eacd-11e9-9e20-41e7f4b1d025 Joel Salomon said: > As an aside, the way I've understood the Unicode standard (4.0), 21 bit > characters can be encoded in 1, 2, 3, or 4 bytes in UTF-8 and if text i= s > internally represented by int32, some out-of-band information (like EOF= , > or bad UTF (but preserving the original bytes)) can be carried along. > And here's where the out-of-band encoding might come in useful: rog@vitanuova.com said: > you do have to be a bit careful with utf-8, as many possible byte > sequences map down to the same rune (error), so if you > do your comparisons too early, you run the risk of inconsistency. > > for instance, you can exploit this (at least, i *think* this is the > cause) to create a file that can never be removed on ken's fileserver: but if "error" becomes 0x80000000 & XX, where XX is the original (bad, or out-of-place) byte, we never lose the ability to retrieve/delete the file= . This would be an extension to Unicode, possibly a dangerous one, but mayb= e worth considering. --Joel