[9fans] UTF-8 criticism?

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] UTF-8 criticism?
@ 2004-07-18 17:31 Jack Johnson
  2004-07-18 18:27 ` Rob Pike
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jack Johnson @ 2004-07-18 17:31 UTC (permalink / raw)
  To: 9fans

I've always appreciated Plan 9's native UTF-8 support, but while
searching for reasons for the crappy multilingual support in Squeak I
came across the following message.

I'm just wondering how Plan 9 deals with this, or if it really matters?

-Jack

------

UTF-8 is what is known as an output transformation. It is used to put
whatever is in memory into some other form that is more readily
digestible by other devices that expect 7-bit ASCII and its associated
zero-null-byte convention. The UTF-8 format specifies ways of storing
up to 4-byte characters without any nulls aligned on bytes.

UTF-8 is also more compact for the European languages, but it is very
lengthy for traditional Chinese, as all characters require 2 bytes and
some characters inevitably require 3 bytes.

The problem with UTF-8 is that it is non-indexable. If you have a
string of characters, you can't make an assumption about where the nth
character is. To find out, you have to do a linear search. That makes
string indexing O(n) instead of O(1), which is unacceptable. If you
were to sort a UTF-8 string for some reason, the bubble sort would
actually have a lower order of magnitude than the quicksort.

So UTF-8 is not a very good memory format for characters. NT uses
UCS-2 (the 2-octet character set) in its native encoding, and UTF-8
for a lot of transfers to disk and the network. I'm not so sure that
Be uses UTF-8 in memory. I think you'd actually find they use UCS-2.

So I don't think it'd be good for someone to go through the hassle of
implementing a UTF-8 set of string methods. I like the idea of
bringing Unicode into Squeak. But there's a lot more involved than
just adding 2-byte arrays.

For example, you will want to store method string in UTF-8, because
they aren't allowed to carry characters larger than 7 bits. But you'd
have to make sure that they get transformed properly for other
purposes. You will have to provide alternate input/output routines for
files because you shouldn't store text files in UCS-2. There are many
considerations and I recommend that you read the standard, and all,
before going ahead and doing it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 17:31 [9fans] UTF-8 criticism? Jack Johnson
@ 2004-07-18 18:27 ` Rob Pike
  2004-07-18 18:39 ` boyd, rounin
  2004-07-19 21:35 ` rog
  2 siblings, 0 replies; 18+ messages in thread
From: Rob Pike @ 2004-07-18 18:27 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

utf-8 is an exchange format and that's mostly how it's used.
the few programs that do any analysis on utf-8 in memory
do so usually as a side effect of whatever else is going on.
for instance, they might call cleanname() or some other
such file name-processing routine.

his criticisms are on target but building utf-8 routines is
not intellectually challenging, nor is it a big job.

in short, it doesn't really matter.

-rob

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 17:31 [9fans] UTF-8 criticism? Jack Johnson
  2004-07-18 18:27 ` Rob Pike
@ 2004-07-18 18:39 ` boyd, rounin
  2004-07-18 19:05   ` Rob Pike
  2004-07-19 21:35 ` rog
  2 siblings, 1 reply; 18+ messages in thread
From: boyd, rounin @ 2004-07-18 18:39 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> If you were to sort a UTF-8 string for some reason, the bubble
> sort would actually have a lower order of magnitude than the
> quicksort.

that's not the real problem.  it's implementing the collation
sequences.  the internal representation as 16 bit unsigneds
is not a problem.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 18:39 ` boyd, rounin
@ 2004-07-18 19:05   ` Rob Pike
  2004-07-18 19:06     ` boyd, rounin
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Rob Pike @ 2004-07-18 19:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> that's not the real problem.  it's implementing the collation
> sequences.  the internal representation as 16 bit unsigneds
> is not a problem.

actually it is, now that surrogates are well-established.

-rob


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 19:05   ` Rob Pike
@ 2004-07-18 19:06     ` boyd, rounin
  2004-07-19  9:00       ` Douglas A. Gwyn
  2004-07-18 19:34     ` boyd, rounin
  2004-07-19 21:01     ` Joel Salomon
  2 siblings, 1 reply; 18+ messages in thread
From: boyd, rounin @ 2004-07-18 19:06 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> actually it is, now that surrogates are well-established.

surrogates?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 19:06     ` boyd, rounin
@ 2004-07-19  9:00       ` Douglas A. Gwyn
  2004-07-19 15:34         ` Skip Tavakkolian
  0 siblings, 1 reply; 18+ messages in thread
From: Douglas A. Gwyn @ 2004-07-19  9:00 UTC (permalink / raw)
  To: 9fans

boyd, rounin wrote:
> surrogates?

Yeah, that is what happens when people settle on
too small a code space.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19  9:00       ` Douglas A. Gwyn
@ 2004-07-19 15:34         ` Skip Tavakkolian
  0 siblings, 0 replies; 18+ messages in thread
From: Skip Tavakkolian @ 2004-07-19 15:34 UTC (permalink / raw)
  To: 9fans

>> surrogates?
>
> Yeah, that is what happens when people settle on
> too small a code space.

The Klingon Language Institute probably agrees.  Far out man!

http://www.kli.org/tlh/pIqaD.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 19:05   ` Rob Pike
  2004-07-18 19:06     ` boyd, rounin
@ 2004-07-18 19:34     ` boyd, rounin
  2004-07-19  7:40       ` Charles Forsyth
  2004-07-19 21:01     ` Joel Salomon
  2 siblings, 1 reply; 18+ messages in thread
From: boyd, rounin @ 2004-07-18 19:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> actually it is, now that surrogates are well-established.

ahh, i see.  ick.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 19:34     ` boyd, rounin
@ 2004-07-19  7:40       ` Charles Forsyth
  2004-07-19  8:39         ` Geoff Collyer
  0 siblings, 1 reply; 18+ messages in thread
From: Charles Forsyth @ 2004-07-19  7:40 UTC (permalink / raw)
  To: 9fans

[-- Attachment #1: Type: text/plain, Size: 79 bytes --]

yes, there is no standard so useful that standards committees cannot ruin it.

[-- Attachment #2: Type: message/rfc822, Size: 2636 bytes --]

From: "boyd, rounin" <boyd@insultant.net>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu>
Subject: Re: [9fans] UTF-8 criticism?
Date: Sun, 18 Jul 2004 21:34:01 +0200
Message-ID: <002f01c46cfe$32fcfc70$92ec7d50@SOMA>

> actually it is, now that surrogates are well-established.

ahh, i see.  ick.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19  7:40       ` Charles Forsyth
@ 2004-07-19  8:39         ` Geoff Collyer
  0 siblings, 0 replies; 18+ messages in thread
From: Geoff Collyer @ 2004-07-19  8:39 UTC (permalink / raw)
  To: 9fans

A sitting standards committee is like a sitting legislature: unless
they pass new standards (resp.  laws), they have nothing to do.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 19:05   ` Rob Pike
  2004-07-18 19:06     ` boyd, rounin
  2004-07-18 19:34     ` boyd, rounin
@ 2004-07-19 21:01     ` Joel Salomon
  2004-07-19 21:22       ` boyd, rounin
                         ` (2 more replies)
  2 siblings, 3 replies; 18+ messages in thread
From: Joel Salomon @ 2004-07-19 21:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>> that's not the real problem.  it's implementing the collation
>> sequences.  the internal representation as 16 bit unsigneds
>> is not a problem.
>
> actually it is, now that surrogates are well-established.
>
> -rob
>

Would moving to 32 bit signed (and only 0 -- 2^21 allowed, plus -1 for
EOF) as in the more recent revisions of Unicode take care of the
surrogates problem?

--Joel

p.s.

>>> surrogates?
>>
>> Yeah, that is what happens when people settle on
>> too small a code space.
>
> The Klingon Language Institute probably agrees.  Far out man!
>

And I want native tengwar support... Go Geeks!!



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19 21:01     ` Joel Salomon
@ 2004-07-19 21:22       ` boyd, rounin
  2004-07-19 21:35         ` Joel Salomon
  2004-07-19 21:42       ` andrey mirtchovski
  2004-07-20  8:32       ` Douglas A. Gwyn
  2 siblings, 1 reply; 18+ messages in thread
From: boyd, rounin @ 2004-07-19 21:22 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Would moving to 32 bit signed (and only 0 -- 2^21 allowed, plus -1 for
> EOF) as in the more recent revisions of Unicode take care of the
> surrogates problem?

this has nothing to do with EOF.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19 21:22       ` boyd, rounin
@ 2004-07-19 21:35         ` Joel Salomon
  2004-07-19 21:56           ` Joel Salomon
  0 siblings, 1 reply; 18+ messages in thread
From: Joel Salomon @ 2004-07-19 21:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>> Would moving to 32 bit signed (and only 0 -- 2^21 allowed, plus -1 for
>> EOF) as in the more recent revisions of Unicode take care of the
>> surrogates problem?
>
> this has nothing to do with EOF.

Sorry if I was unclear - let me try again. Would moving to 32 bit signed
(and only 0 -- 2^21 allowed), thus including all surrogates in the
directly accessible character set solve the problem?

Yes, this does open a new can of worms, but how much more difficult would
it be to move from 16 bit Runes to 21/32 bit wide Runes then it was to
move from 7 bit ASCII to Unicode in the first place?

As an aside, the way I've understood the Unicode standard (4.0), 21 bit
characters can be encoded in 1, 2, 3, or 4 bytes in UTF-8 and if text is
internally represented by int32, some out-of-band information (like EOF,
or bad UTF (but preserving the original bytes)) can be carried along.

--Joel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19 21:35         ` Joel Salomon
@ 2004-07-19 21:56           ` Joel Salomon
  0 siblings, 0 replies; 18+ messages in thread
From: Joel Salomon @ 2004-07-19 21:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Joel Salomon said:
> As an aside, the way I've understood the Unicode standard (4.0), 21 bit
> characters can be encoded in 1, 2, 3, or 4 bytes in UTF-8 and if text is
> internally represented by int32, some out-of-band information (like EOF,
> or bad UTF (but preserving the original bytes)) can be carried along.
>

And here's where the out-of-band encoding might come in useful:

rog@vitanuova.com said:
> you do have to be a bit careful with utf-8, as many possible byte
> sequences map down to the same rune (error), so if you
> do your comparisons too early, you run the risk of inconsistency.
>
> for instance, you can exploit this (at least, i *think* this is the
> cause) to create a file that can never be removed on ken's fileserver:
<snip>

but if "error" becomes 0x80000000 & XX, where XX is the original (bad, or
out-of-place) byte, we never lose the ability to retrieve/delete the file.
This would be an extension to Unicode, possibly a dangerous one, but maybe
worth considering.

--Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19 21:01     ` Joel Salomon
  2004-07-19 21:22       ` boyd, rounin
@ 2004-07-19 21:42       ` andrey mirtchovski
  2004-07-19 21:43         ` Tengwar " Joel Salomon
  2004-07-20  8:32       ` Douglas A. Gwyn
  2 siblings, 1 reply; 18+ messages in thread
From: andrey mirtchovski @ 2004-07-19 21:42 UTC (permalink / raw)
  To: 9fans

> And I want native tengwar support... Go Geeks!!

sure thing. at plan9.ucalgary.ca

	% font=tengwar.20.font rio

gives you:

	http://pages.cpsc.ucalgary.ca/~mirtchov/screenshots/tengwar.gif

(acme mail displaying your message second from top :)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Tengwar Re: [9fans] UTF-8 criticism?
  2004-07-19 21:42       ` andrey mirtchovski
@ 2004-07-19 21:43         ` Joel Salomon
  0 siblings, 0 replies; 18+ messages in thread
From: Joel Salomon @ 2004-07-19 21:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>> And I want native tengwar support... Go Geeks!!
>
> sure thing. at plan9.ucalgary.ca
> 	% font=tengwar.20.font rio
> gives you:
> 	http://pages.cpsc.ucalgary.ca/~mirtchov/screenshots/tengwar.gif
> (acme mail displaying your message second from top :)
>
Thanks. Still, it'd be nice to have the code spaces (tentatively)
allocated for tengwar available - and they're not in the first 65535
characters in Unicode (old unicode, now the BMP).

Of course, there's more to Unicode than the character set, but most of
that is in the text processing domain (troff, TeX).

--Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-19 21:01     ` Joel Salomon
  2004-07-19 21:22       ` boyd, rounin
  2004-07-19 21:42       ` andrey mirtchovski
@ 2004-07-20  8:32       ` Douglas A. Gwyn
  2 siblings, 0 replies; 18+ messages in thread
From: Douglas A. Gwyn @ 2004-07-20  8:32 UTC (permalink / raw)
  To: 9fans

Joel Salomon wrote:
> Would moving to 32 bit signed (and only 0 -- 2^21 allowed, plus -1 for
> EOF) as in the more recent revisions of Unicode take care of the
> surrogates problem?

Sure.  Actually it's not "more recent", as that's the
original ISO-10646 code points, predating Unicode.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [9fans] UTF-8 criticism?
  2004-07-18 17:31 [9fans] UTF-8 criticism? Jack Johnson
  2004-07-18 18:27 ` Rob Pike
  2004-07-18 18:39 ` boyd, rounin
@ 2004-07-19 21:35 ` rog
  2 siblings, 0 replies; 18+ messages in thread
From: rog @ 2004-07-19 21:35 UTC (permalink / raw)
  To: 9fans

you do have to be a bit careful with utf-8, as many possible byte
sequences map down to the same rune (error), so if you
do your comparisons too early, you run the risk of inconsistency.

for instance, you can exploit this (at least, i *think* this is the
cause) to create a file that can never be removed on ken's fileserver:

#include <u.h>
#include <libc.h>
void
main(void)
{
	char f[] = {0xc0, 0xb0, 0};
	create(f, OWRITE, 0666)
}

don't try this at home...



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-07-20  8:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-07-18 17:31 [9fans] UTF-8 criticism? Jack Johnson
2004-07-18 18:27 ` Rob Pike
2004-07-18 18:39 ` boyd, rounin
2004-07-18 19:05   ` Rob Pike
2004-07-18 19:06     ` boyd, rounin
2004-07-19  9:00       ` Douglas A. Gwyn
2004-07-19 15:34         ` Skip Tavakkolian
2004-07-18 19:34     ` boyd, rounin
2004-07-19  7:40       ` Charles Forsyth
2004-07-19  8:39         ` Geoff Collyer
2004-07-19 21:01     ` Joel Salomon
2004-07-19 21:22       ` boyd, rounin
2004-07-19 21:35         ` Joel Salomon
2004-07-19 21:56           ` Joel Salomon
2004-07-19 21:42       ` andrey mirtchovski
2004-07-19 21:43         ` Tengwar " Joel Salomon
2004-07-20  8:32       ` Douglas A. Gwyn
2004-07-19 21:35 ` rog

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).