From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <6e35c06204071810312daa31a9@mail.gmail.com> Date: Sun, 18 Jul 2004 10:31:11 -0700 From: Jack Johnson To: 9fans <9fans@cse.psu.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Subject: [9fans] UTF-8 criticism? Topicbox-Message-UUID: c29fec76-eacd-11e9-9e20-41e7f4b1d025 I've always appreciated Plan 9's native UTF-8 support, but while searching for reasons for the crappy multilingual support in Squeak I came across the following message. I'm just wondering how Plan 9 deals with this, or if it really matters? -Jack ------ UTF-8 is what is known as an output transformation. It is used to put whatever is in memory into some other form that is more readily digestible by other devices that expect 7-bit ASCII and its associated zero-null-byte convention. The UTF-8 format specifies ways of storing up to 4-byte characters without any nulls aligned on bytes. UTF-8 is also more compact for the European languages, but it is very lengthy for traditional Chinese, as all characters require 2 bytes and some characters inevitably require 3 bytes. The problem with UTF-8 is that it is non-indexable. If you have a string of characters, you can't make an assumption about where the nth character is. To find out, you have to do a linear search. That makes string indexing O(n) instead of O(1), which is unacceptable. If you were to sort a UTF-8 string for some reason, the bubble sort would actually have a lower order of magnitude than the quicksort. So UTF-8 is not a very good memory format for characters. NT uses UCS-2 (the 2-octet character set) in its native encoding, and UTF-8 for a lot of transfers to disk and the network. I'm not so sure that Be uses UTF-8 in memory. I think you'd actually find they use UCS-2. So I don't think it'd be good for someone to go through the hassle of implementing a UTF-8 set of string methods. I like the idea of bringing Unicode into Squeak. But there's a lot more involved than just adding 2-byte arrays. For example, you will want to store method string in UTF-8, because they aren't allowed to carry characters larger than 7 bits. But you'd have to make sure that they get transformed properly for other purposes. You will have to provide alternate input/output routines for files because you shouldn't store text files in UCS-2. There are many considerations and I recommend that you read the standard, and all, before going ahead and doing it.