9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] UTF-8 criticism?
@ 2004-07-18 17:31 Jack Johnson
  2004-07-18 18:27 ` Rob Pike
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jack Johnson @ 2004-07-18 17:31 UTC (permalink / raw)
  To: 9fans

I've always appreciated Plan 9's native UTF-8 support, but while
searching for reasons for the crappy multilingual support in Squeak I
came across the following message.

I'm just wondering how Plan 9 deals with this, or if it really matters?

-Jack

------

UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.

UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.

The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.

So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.

So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.

For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-07-20  8:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-07-18 17:31 [9fans] UTF-8 criticism? Jack Johnson
2004-07-18 18:27 ` Rob Pike
2004-07-18 18:39 ` boyd, rounin
2004-07-18 19:05   ` Rob Pike
2004-07-18 19:06     ` boyd, rounin
2004-07-19  9:00       ` Douglas A. Gwyn
2004-07-19 15:34         ` Skip Tavakkolian
2004-07-18 19:34     ` boyd, rounin
2004-07-19  7:40       ` Charles Forsyth
2004-07-19  8:39         ` Geoff Collyer
2004-07-19 21:01     ` Joel Salomon
2004-07-19 21:22       ` boyd, rounin
2004-07-19 21:35         ` Joel Salomon
2004-07-19 21:56           ` Joel Salomon
2004-07-19 21:42       ` andrey mirtchovski
2004-07-19 21:43         ` Tengwar " Joel Salomon
2004-07-20  8:32       ` Douglas A. Gwyn
2004-07-19 21:35 ` rog

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).