9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans]  About The Codes Beyond Unicode-BMP
@ 2008-03-13 14:28 Hongzheng Wang
  2008-03-13 14:55 ` erik quanstrom
  0 siblings, 1 reply; 6+ messages in thread
From: Hongzheng Wang @ 2008-03-13 14:28 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 1098 bytes --]

Hi,

I did an experiment to test if the programs in Plan9 could support the
codes beyond Unicode-BMP.
The result is not so good.

Let's repeat it:

Take U+01000 code for example.  Create a file and fill it with only
one character U+010000 encoded
in UTF-8.

Note that It is could be done with Vim on Linux since Vim has a good
support.  Double check could
be done by Nvi.  The internal representation of U+010000's with UTF-8
is F0908080 [1].

Open the file by ed or sam or acme.  Of course, it could not be
displayed correctly since no fonts in
system could coverage such a code yet.  Then, just re-write the file
again.  Then open it again by
non-Plan9 program, say, Nvi on Linux.  The internal representation became
EFBFBDEFBFBDEFBFBDEFBFBD.  That is, both ed and sam (also acme) failed
to recognize
U+010000 encoded by UTF-8, and destroyed it when writing.

So, does Plan9 acctually supports only the codes in Unicode-BMP?

BTW: the attachment is the gzipped test file containing only U+010000
encoded by UTF-8.

[1] http://en.wikipedia.org/wiki/UTF-8

--
HZ

[-- Attachment #2: test.gz --]
[-- Type: application/x-gzip, Size: 30 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] About The Codes Beyond Unicode-BMP
  2008-03-13 14:28 [9fans] About The Codes Beyond Unicode-BMP Hongzheng Wang
@ 2008-03-13 14:55 ` erik quanstrom
  2008-03-13 15:02   ` Hongzheng Wang
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: erik quanstrom @ 2008-03-13 14:55 UTC (permalink / raw)
  To: 9fans

plan 9 supports utf16.  that is codpoints u+0000 — u+fffff.  there is no
support for 32bit characters.  to support larger characters, the starting point
would be changing Rune from ushort to ulong and changing constants like
UTFmax and fixing chartorune and runetochar.  (and finding all the places
that assume that UTFmax really is 3.)

it's all very doable, but it would be a very invasive change.

- erik


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] About The Codes Beyond Unicode-BMP
  2008-03-13 14:55 ` erik quanstrom
@ 2008-03-13 15:02   ` Hongzheng Wang
  2008-03-13 18:23   ` Russ Cox
  2008-03-13 18:44   ` Joel C. Salomon
  2 siblings, 0 replies; 6+ messages in thread
From: Hongzheng Wang @ 2008-03-13 15:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I see.  Thanks.

On Thu, Mar 13, 2008 at 10:55 PM, erik quanstrom <quanstro@coraid.com> wrote:
> plan 9 supports utf16.  that is codpoints u+0000 — u+fffff.  there is no
>  support for 32bit characters.  to support larger characters, the starting point
>  would be changing Rune from ushort to ulong and changing constants like
>  UTFmax and fixing chartorune and runetochar.  (and finding all the places
>  that assume that UTFmax really is 3.)
>
>  it's all very doable, but it would be a very invasive change.
>
>  - erik
>
>



-- 
HZ


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] About The Codes Beyond Unicode-BMP
  2008-03-13 14:55 ` erik quanstrom
  2008-03-13 15:02   ` Hongzheng Wang
@ 2008-03-13 18:23   ` Russ Cox
  2008-03-13 18:38     ` erik quanstrom
  2008-03-13 18:44   ` Joel C. Salomon
  2 siblings, 1 reply; 6+ messages in thread
From: Russ Cox @ 2008-03-13 18:23 UTC (permalink / raw)
  To: 9fans

> plan 9 supports utf16.  that is codpoints u+0000 — u+fffff.  there is no
> support for 32bit characters. 

this is correct except for the use of the term utf16,
which is a character encoding, not a character set.
the subject line is correct - plan 9 doesn't support
codes beyond the BMP.

> to support larger characters, the starting point
> would be changing Rune from ushort to ulong and changing constants like
> UTFmax and fixing chartorune and runetochar.  (and finding all the places
> that assume that UTFmax really is 3.)
> it's all very doable, but it would be a very invasive change.

it would require recompiling everything, 
but i don't believe it would require changes
to code beyond the utf routines in the c library.
i do not believe there are many places (if any)
that presume to know the value of UTFmax.

russ



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] About The Codes Beyond Unicode-BMP
  2008-03-13 18:23   ` Russ Cox
@ 2008-03-13 18:38     ` erik quanstrom
  0 siblings, 0 replies; 6+ messages in thread
From: erik quanstrom @ 2008-03-13 18:38 UTC (permalink / raw)
  To: 9fans

> it would require recompiling everything,
> but i don't believe it would require changes
> to code beyond the utf routines in the c library.
> i do not believe there are many places (if any)
> that presume to know the value of UTFmax.

you just pointed one out yesterday -- in devatach().

- erik


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] About The Codes Beyond Unicode-BMP
  2008-03-13 14:55 ` erik quanstrom
  2008-03-13 15:02   ` Hongzheng Wang
  2008-03-13 18:23   ` Russ Cox
@ 2008-03-13 18:44   ` Joel C. Salomon
  2 siblings, 0 replies; 6+ messages in thread
From: Joel C. Salomon @ 2008-03-13 18:44 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Thu, Mar 13, 2008 at 10:55 AM, erik quanstrom <quanstro@coraid.com> wrote:
> plan 9 supports utf16.  that is codpoints u+0000 — u+fffff.

To be pedantic, UTF-16 has the ability to represent characters in the
'astral planes' via surrogate pairs (pairs of character in the range
U+D800–U+DFFF); Plan 9's charset is approximately UCS-2.

Java has the same trouble; its astral plane characters are first
encoded as UTF-16 surrogate pairs, then those 16-bit values are
encoded as UTF-8.

> to support larger characters, the starting point would be changing Rune
> from ushort to ulong and changing constants like UTFmax and fixing
> chartorune and runetochar.  (and finding all the places that assume that
> UTFmax really is 3.)
>
> it's all very doable, but it would be a very invasive change.

Not really, since only the 2²⁰+2¹⁶ values from 0–0x10FFFF are needed
and UTFmax only needs to go up to 4.  An advantage would be that
out-of-band symbols like EOF and yacc terminals could be represented
in the same data type as the characters

On the other hand, there are more useful bits of Unicode that are
unimplemented in Plan 9.  Mañana (as in /sys/doc/utf.{html,ps,pdf}
never did come.

--Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-03-13 18:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-03-13 14:28 [9fans] About The Codes Beyond Unicode-BMP Hongzheng Wang
2008-03-13 14:55 ` erik quanstrom
2008-03-13 15:02   ` Hongzheng Wang
2008-03-13 18:23   ` Russ Cox
2008-03-13 18:38     ` erik quanstrom
2008-03-13 18:44   ` Joel C. Salomon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).