9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] plan9 and the Unicode Consortium definitions
@ 2005-08-19 15:23 Dimitry Golubovsky
  0 siblings, 0 replies; 5+ messages in thread
From: Dimitry Golubovsky @ 2005-08-19 15:23 UTC (permalink / raw)
  To: mirtchov; +Cc: 9fans

Andrey,

Andrey wrote:

>> I am just wondering whether any API to access more complete set of
>> character properties defined by Unicode.org is available in Plan9.

>you mean things like diacritics?

I mean character categories defined in 

http://www.unicode.org/Public/4.1.0/ucd/UCD.html#General_Category_Values

Abbr.  Description
 
Lu Letter, Uppercase 
Ll Letter, Lowercase 
Lt Letter, Titlecase 
Lm Letter, Modifier 
Lo Letter, Other 
Mn Mark, Nonspacing 
Mc Mark, Spacing Combining 
Me Mark, Enclosing 
Nd Number, Decimal Digit 

etc., total about 30 or so. isxxxrune distinguishes only among 5 categories.

This would probably inlcude diacritics, but my question was more
general (maybe even philosophical): there exists a recommended set of
Unicode character properties, APIs, and interfaces (Unicode.org).
Plan9 which probably influenced some aspects of Unicode to be
implemented in other systems does not follow. Is there any historical
/political /technical /other reason? Related man pages mention "The
Unicode Standard" though in SEE ALSO section.

What is more interesting to me (technically, as I asked in my first
message) - is 16-bitness of runes hardcoded anywhere in the kernel, or
only in libc?

-- 
Dimitry Golubovsky

Anywhere on the Web


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] plan9 and the Unicode Consortium definitions
  2005-08-19 14:51 Dimitry Golubovsky
  2005-08-19 15:00 ` Christoph Lohmann
  2005-08-19 15:03 ` andrey mirtchovski
@ 2005-08-19 15:29 ` Rob Pike
  2 siblings, 0 replies; 5+ messages in thread
From: Rob Pike @ 2005-08-19 15:29 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

The addition of surrogates was a serious error of judgement for Unicode,
in my opinion.  As I understand it, the original idea for Unicode was to
have a useful 16-bit subset of the 32-but ISO10646 standard, yet now
we see Unicode growing until it no longer fits in 16 bits, in order to
include some politically expedient characters.  This is unwise.  It's a
fact of life, but it's unwise.

As far as Plan 9 is concerned, it shouldn't be too hard to cope with
surrogates but the solution will be to bump Rune to unsigned int from
unsigned short.  It's not worth doing until everything else, for instance
Java, has made a similar jump.  I don't see a lot of pressure to pull
the systems in line with the standard.  Everyone is annoyed.

Plan 9's libraries provide a modest, convenient subset of the standard.
They could use an updating to (the non-surrogate part of) Unicode 3.0.
It's not hard; I've seen code to auto-generate the appropriate tables from
the Unicode data set.  I'll see about digging them up.  If you need the
full monstrosity,  though, you may need to accept something like ICU.
It's instructive to compare the magnitude of software growth that will
encompass.

All that aside, Plan 9 doesn't do some things it should.  For instance,
it should canonicalize all characters to separate out the diacritics and
merge them on display.  At the moment, it doesn't handle diacritics at
all.

-rob


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] plan9 and the Unicode Consortium definitions
  2005-08-19 14:51 Dimitry Golubovsky
  2005-08-19 15:00 ` Christoph Lohmann
@ 2005-08-19 15:03 ` andrey mirtchovski
  2005-08-19 15:29 ` Rob Pike
  2 siblings, 0 replies; 5+ messages in thread
From: andrey mirtchovski @ 2005-08-19 15:03 UTC (permalink / raw)
  To: 9fans

> I am just wondering whether any API to access more complete set of
> character properties defined by Unicode.org is available in Plan9.

you mean things like diacritics?

i think the answer is no:

http://pages.cpsc.ucalgary.ca/~mirtchov/screenshots/x-p9-diacritics.png

(the upper one is xterm in unicode mode, the lower is Plan 9)



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] plan9 and the Unicode Consortium definitions
  2005-08-19 14:51 Dimitry Golubovsky
@ 2005-08-19 15:00 ` Christoph Lohmann
  2005-08-19 15:03 ` andrey mirtchovski
  2005-08-19 15:29 ` Rob Pike
  2 siblings, 0 replies; 5+ messages in thread
From: Christoph Lohmann @ 2005-08-19 15:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Good day.

On Fri, 19 Aug 2005 10:51:07 -0400
Dimitry Golubovsky <golubovsky@gmail.com> wrote:

> PS I looked at the sources mirror at 9grid.de, and manpages at the
> Bell Labs website. Outdated?

>From this morning.

Sincerely,

Christoph


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [9fans] plan9 and the Unicode Consortium definitions
@ 2005-08-19 14:51 Dimitry Golubovsky
  2005-08-19 15:00 ` Christoph Lohmann
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Dimitry Golubovsky @ 2005-08-19 14:51 UTC (permalink / raw)
  To: 9fans

I am just wondering whether any API to access more complete set of
character properties defined by Unicode.org is available in Plan9. So
far I have seen only library functions like isalpharune(2) defined in
runetype.c, but it does not cover all the character categories defined
by the Unicode Consortium. Something might be expected in the Section
7 of manpages, might not it?

BTW I've got some code I wrote earlier for Hugs and Glasgow Haskell
Compiler, which is autogenerated from UnicodeData.txt (runetype.c
seems to be manually hardcoded, or at least there is nothing in the
mkfile that shows how it was generated). If there is any interest, I
may send a link. My code is based on the same princilpes as I see in
runetype.c: binary search over sorted lists of character ranges.

Another question: is (historical) 16-bitness of runes a limitation of
the C runtime library only, or is the kernel rune-size-aware, too?
Because what Unicode.org defines is wider than 16 bits, as everybody
knows.

Unless there is any intentional divergence from the Unicode.org definitions.

PS I looked at the sources mirror at 9grid.de, and manpages at the
Bell Labs website. Outdated?

-- 
Dimitry Golubovsky

Anywhere on the Web


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-08-19 15:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-08-19 15:23 [9fans] plan9 and the Unicode Consortium definitions Dimitry Golubovsky
  -- strict thread matches above, loose matches on Subject: below --
2005-08-19 14:51 Dimitry Golubovsky
2005-08-19 15:00 ` Christoph Lohmann
2005-08-19 15:03 ` andrey mirtchovski
2005-08-19 15:29 ` Rob Pike

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).