From: Jack Johnson <knapjack@gmail.com>
To: 9fans <9fans@cse.psu.edu>
Subject: [9fans] UTF-8 criticism?
Date: Sun, 18 Jul 2004 10:31:11 -0700 [thread overview]
Message-ID: <6e35c06204071810312daa31a9@mail.gmail.com> (raw)
I've always appreciated Plan 9's native UTF-8 support, but while
searching for reasons for the crappy multilingual support in Squeak I
came across the following message.
I'm just wondering how Plan 9 deals with this, or if it really matters?
-Jack
------
UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.
UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.
The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.
So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.
So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.
For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.
next reply other threads:[~2004-07-18 17:31 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-07-18 17:31 Jack Johnson [this message]
2004-07-18 18:27 ` Rob Pike
2004-07-18 18:39 ` boyd, rounin
2004-07-18 19:05 ` Rob Pike
2004-07-18 19:06 ` boyd, rounin
2004-07-19 9:00 ` Douglas A. Gwyn
2004-07-19 15:34 ` Skip Tavakkolian
2004-07-18 19:34 ` boyd, rounin
2004-07-19 7:40 ` Charles Forsyth
2004-07-19 8:39 ` Geoff Collyer
2004-07-19 21:01 ` Joel Salomon
2004-07-19 21:22 ` boyd, rounin
2004-07-19 21:35 ` Joel Salomon
2004-07-19 21:56 ` Joel Salomon
2004-07-19 21:42 ` andrey mirtchovski
2004-07-19 21:43 ` Tengwar " Joel Salomon
2004-07-20 8:32 ` Douglas A. Gwyn
2004-07-19 21:35 ` rog
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6e35c06204071810312daa31a9@mail.gmail.com \
--to=knapjack@gmail.com \
--cc=9fans@cse.psu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).