9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Jack Johnson <knapjack@gmail.com>
To: 9fans <9fans@cse.psu.edu>
Subject: [9fans] UTF-8 criticism?
Date: Sun, 18 Jul 2004 10:31:11 -0700	[thread overview]
Message-ID: <6e35c06204071810312daa31a9@mail.gmail.com> (raw)

I've always appreciated Plan 9's native UTF-8 support, but while
searching for reasons for the crappy multilingual support in Squeak I
came across the following message.

I'm just wondering how Plan 9 deals with this, or if it really matters?

-Jack

------

UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.

UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.

The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.

So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.

So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.

For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.


             reply	other threads:[~2004-07-18 17:31 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-07-18 17:31 Jack Johnson [this message]
2004-07-18 18:27 ` Rob Pike
2004-07-18 18:39 ` boyd, rounin
2004-07-18 19:05   ` Rob Pike
2004-07-18 19:06     ` boyd, rounin
2004-07-19  9:00       ` Douglas A. Gwyn
2004-07-19 15:34         ` Skip Tavakkolian
2004-07-18 19:34     ` boyd, rounin
2004-07-19  7:40       ` Charles Forsyth
2004-07-19  8:39         ` Geoff Collyer
2004-07-19 21:01     ` Joel Salomon
2004-07-19 21:22       ` boyd, rounin
2004-07-19 21:35         ` Joel Salomon
2004-07-19 21:56           ` Joel Salomon
2004-07-19 21:42       ` andrey mirtchovski
2004-07-19 21:43         ` Tengwar " Joel Salomon
2004-07-20  8:32       ` Douglas A. Gwyn
2004-07-19 21:35 ` rog

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6e35c06204071810312daa31a9@mail.gmail.com \
    --to=knapjack@gmail.com \
    --cc=9fans@cse.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).