[9fans] hard-coded UTF-8 in wc.c

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] hard-coded UTF-8 in wc.c
@ 2010-03-15 21:02 anonymous
  2010-03-15 21:13 ` erik quanstrom
  0 siblings, 1 reply; 5+ messages in thread
From: anonymous @ 2010-03-15 21:02 UTC (permalink / raw)
  To: 9fans

Just looked at source of wc
(http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/wc.c). UTF-8
is hard-coded here. What is the reason? Nobody wants to rewrite it,
it is optimization or it is impossible to rewrite it using runes for
some reason?

http://plan9.bell-labs.com/sys/doc/utf.html says all you need to do to
change encoding is:
1. Rewrite UTF encoding/decoding code.
2. Convert all text files.
3. Recompile all software.

Looks like it is impossible with current code. It is not fixed just
because there is more important work or there is some serious problem
in design?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] hard-coded UTF-8 in wc.c
  2010-03-15 21:02 [9fans] hard-coded UTF-8 in wc.c anonymous
@ 2010-03-15 21:13 ` erik quanstrom
  2010-03-15 21:35   ` anonymous
  0 siblings, 1 reply; 5+ messages in thread
From: erik quanstrom @ 2010-03-15 21:13 UTC (permalink / raw)
  To: 9fans

On Mon Mar 15 17:12:06 EDT 2010, aim0shei@lavabit.com wrote:
> Just looked at source of wc
> (http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/wc.c). UTF-8
> is hard-coded here. What is the reason? Nobody wants to rewrite it,
> it is optimization or it is impossible to rewrite it using runes for
> some reason?
>
> http://plan9.bell-labs.com/sys/doc/utf.html says all you need to do to
> change encoding is:
> 1. Rewrite UTF encoding/decoding code.
> 2. Convert all text files.
> 3. Recompile all software.
>
> Looks like it is impossible with current code. It is not fixed just
> because there is more important work or there is some serious problem
> in design?

perhaps you have misunderstood.

inside programs, sometimes unicode text is represented as
runes.  runes are not sent over pipes nor stored in files.

therefore, there is no need to wc runes.

- erik



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] hard-coded UTF-8 in wc.c
  2010-03-15 21:13 ` erik quanstrom
@ 2010-03-15 21:35   ` anonymous
  2010-03-15 22:44     ` erik quanstrom
  0 siblings, 1 reply; 5+ messages in thread
From: anonymous @ 2010-03-15 21:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Mar 15, 2010 at 05:13:40PM -0400, erik quanstrom wrote:
> perhaps you have misunderstood.
>
> inside programs, sometimes unicode text is represented as
> runes.  runes are not sent over pipes nor stored in files.
>
> therefore, there is no need to wc runes.

Yes, but why wc utility counts runes (wc(1) call them runes) manually
using huge table instead of using functions from rune(3) such as utflen?




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [9fans] hard-coded UTF-8 in wc.c
  2010-03-15 21:35   ` anonymous
@ 2010-03-15 22:44     ` erik quanstrom
  0 siblings, 0 replies; 5+ messages in thread
From: erik quanstrom @ 2010-03-15 22:44 UTC (permalink / raw)
  To: 9fans

On Mon Mar 15 17:46:11 EDT 2010, aim0shei@lavabit.com wrote:
> On Mon, Mar 15, 2010 at 05:13:40PM -0400, erik quanstrom wrote:
> > perhaps you have misunderstood.
> >
> > inside programs, sometimes unicode text is represented as
> > runes.  runes are not sent over pipes nor stored in files.
> >
> > therefore, there is no need to wc runes.
>
> Yes, but why wc utility counts runes (wc(1) call them runes) manually
> using huge table instead of using functions from rune(3) such as utflen?

i didn't write wc, but i would imagine that it's for speed.

- erik



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [9fans] hard-coded UTF-8 in wc.c
@ 2010-12-29  4:40 erik quanstrom
  0 siblings, 0 replies; 5+ messages in thread
From: erik quanstrom @ 2010-12-29  4:40 UTC (permalink / raw)
  To: 9fans

this just popped up when i was searching the archive.

On Mon 15 Mar 2010 18:44:41 EST 2010, quanstro@quanstro.net wrote:
> On Mon Mar 15 17:46:11 EDT 2010, aim0shei@lav... wrote:
> > Yes, but why wc utility counts runes (wc(1) call them runes) manually
> > using huge table instead of using functions from rune(3) such as utflen?
>
> i didn't write wc, but i would imagine that it's for speed.

i took some time a few weeks ago to extend wc to handle runes
up to 0x10ffff which ment adding 3 states for 4-byte runes and
adding an additional table.  with that perspective ...

wc is a big state machine.  using the rune functions would hide
a good deal of the state machine, which would make the states
harder to understand and some of this work would need to be redone.
the tables are actually really easy to understand and generate.
wikipedia has a discussion of the bit patterns which can help.

- erik

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-12-29  4:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-15 21:02 [9fans] hard-coded UTF-8 in wc.c anonymous
2010-03-15 21:13 ` erik quanstrom
2010-03-15 21:35   ` anonymous
2010-03-15 22:44     ` erik quanstrom
2010-12-29  4:40 erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).