From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <6e35c06204071810312daa31a9@mail.gmail.com>
Date: Sun, 18 Jul 2004 10:31:11 -0700
From: Jack Johnson <knapjack@gmail.com>
To: 9fans <9fans@cse.psu.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: [9fans] UTF-8 criticism?
Topicbox-Message-UUID: c29fec76-eacd-11e9-9e20-41e7f4b1d025

I've always appreciated Plan 9's native UTF-8 support, but while
searching for reasons for the crappy multilingual support in Squeak I
came across the following message.

I'm just wondering how Plan 9 deals with this, or if it really matters?

-Jack

------

UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.

UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.

The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.

So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.

So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.

For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.