[9fans] simplicity

9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed

* [9fans] simplicity
@ 2007-09-16 18:55 Francisco J Ballesteros
  2007-09-16 20:42 ` Anant Narayanan
                   ` (4 more replies)
  0 siblings, 5 replies; 40+ messages in thread
From: Francisco J Ballesteros @ 2007-09-16 18:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Time ago, Ron said

> I know we have some faculty on this list. Please talk to your students :-)

regarding the madness of making complex software (that time, it was
about configure).

I have allocated  half of the presentation lecture for this semester to
"Why does this matter at all". Among other things,
I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture.

Any other suggestion?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 18:55 [9fans] simplicity Francisco J Ballesteros
@ 2007-09-16 20:42 ` Anant Narayanan
  2007-09-16 21:24   ` Francisco J Ballesteros
  2007-09-16 20:43 ` roger peppe
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 40+ messages in thread
From: Anant Narayanan @ 2007-09-16 20:42 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> I have allocated  half of the presentation lecture for this  
> semester to
> "Why does this matter at all". Among other things,
> I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the  
> picture.
>
> Any other suggestion?

Please do put up the slides online, if possible, for the benefit of  
the students on this list :)

--
Anant

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 18:55 [9fans] simplicity Francisco J Ballesteros
  2007-09-16 20:42 ` Anant Narayanan
@ 2007-09-16 20:43 ` roger peppe
  2007-09-16 20:53   ` Steve Simon
  2007-09-17 20:00   ` Scott Schwartz
  2007-09-17  3:23 ` erik quanstrom
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 40+ messages in thread
From: roger peppe @ 2007-09-16 20:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> I have allocated  half of the presentation lecture for this semester to
> "Why does this matter at all". Among other things,
> I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture.
>
> Any other suggestion?

comparing documentation can be instructive - e.g. all the unix socket
calls vs. plan 9's
dial(2) - maybe get them to write a network dialler from first principles using
both interfaces.

it might be a problem trying to illustrate just why complex software can be
so maddening - i think that insight only really comes with the
experience of trying
to maintain and transform one's own (and others') software, along with
the realisation of just how much time is spent maintaining software vs. the
time writing it in the first place.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 20:43 ` roger peppe
@ 2007-09-16 20:53   ` Steve Simon
  2007-09-17 15:22     ` Douglas A. Gwyn
  2007-09-17 20:00   ` Scott Schwartz
  1 sibling, 1 reply; 40+ messages in thread
From: Steve Simon @ 2007-09-16 20:53 UTC (permalink / raw)
  To: 9fans

Top of my over-complex list would be configure.

-Steve


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 20:42 ` Anant Narayanan
@ 2007-09-16 21:24   ` Francisco J Ballesteros
  2007-09-17 15:22     ` Douglas A. Gwyn
  0 siblings, 1 reply; 40+ messages in thread
From: Francisco J Ballesteros @ 2007-09-16 21:24 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

the "slides" are a buch of programs. In fact, I use a terminal to
compile and run
programs from the 9.intro.pdf book. I introduce mistakes and show the
consequences,
and then I fix them.

In this particular course, I use slides just for the introduction
classs. I'll put them on
the web once we update the web pages for the semester.


On 9/16/07, Anant Narayanan <anant@kix.in> wrote:
> > I have allocated  half of the presentation lecture for this
> > semester to
> > "Why does this matter at all". Among other things,
> > I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the
> > picture.
> >
> > Any other suggestion?
>
> Please do put up the slides online, if possible, for the benefit of
> the students on this list :)
>
> --
> Anant

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 18:55 [9fans] simplicity Francisco J Ballesteros
  2007-09-16 20:42 ` Anant Narayanan
  2007-09-16 20:43 ` roger peppe
@ 2007-09-17  3:23 ` erik quanstrom
  2007-09-17 15:22   ` Douglas A. Gwyn
  2007-09-17 14:52 ` ron minnich
  2007-09-17 14:53 ` ron minnich
  4 siblings, 1 reply; 40+ messages in thread
From: erik quanstrom @ 2007-09-17  3:23 UTC (permalink / raw)
  To: 9fans

>> I know we have some faculty on this list. Please talk to your students :-)
> 
> regarding the madness of making complex software (that time, it was
> about configure).
> 
> I have allocated  half of the presentation lecture for this semester to
> "Why does this matter at all". Among other things,
> I´ll be comparing gnu cat.c with plan 9 cat.c, so they get the picture.
> 
> Any other suggestion?

i think the devolution of gnu grep is quite instructive.  once upon a time
it was simple and very fast.  (thanks, mike.)  today it is neither.

the last time i tried to fix a utf-8 problem (it was 80 times slower
processing utf8 than ascii), i gave up after encountering dozens of
if(special char set){fast version}else{slow version} constructions.

it gets to the heart of why plan9's invention and use (thank's rob, ken) of
utf-8 is so great.

and speaking of regular expressions, one could use russ' excellent work
on perl regular expressions vs. plan 9 regular expressions to talk about
how seemingly straightforward extensions are not always Mostly Harmless;
complexity is a sneaky thing.

- erik



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 18:55 [9fans] simplicity Francisco J Ballesteros
                   ` (2 preceding siblings ...)
  2007-09-17  3:23 ` erik quanstrom
@ 2007-09-17 14:52 ` ron minnich
  2007-09-17 14:53 ` ron minnich
  4 siblings, 0 replies; 40+ messages in thread
From: ron minnich @ 2007-09-17 14:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/16/07, Francisco J Ballesteros <nemo@lsub.org> wrote:

> Any other suggestion?
>
ELF prelinking (on, e.g., FC7)

how to take a bad decision and make it worse

ron


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 18:55 [9fans] simplicity Francisco J Ballesteros
                   ` (3 preceding siblings ...)
  2007-09-17 14:52 ` ron minnich
@ 2007-09-17 14:53 ` ron minnich
  4 siblings, 0 replies; 40+ messages in thread
From: ron minnich @ 2007-09-17 14:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

oh, yeah, the utf8 example is great.

abiword use to be fast. before internationalization. Now it is so slow
as to be totally useless.

ron


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 20:53   ` Steve Simon
@ 2007-09-17 15:22     ` Douglas A. Gwyn
  0 siblings, 0 replies; 40+ messages in thread
From: Douglas A. Gwyn @ 2007-09-17 15:22 UTC (permalink / raw)
  To: 9fans

Steve Simon wrote:
> Top of my over-complex list would be configure.

My experience with configure is that it seldom selects the compiler
I wanted to use, for some reason preferring the Gnu software even
though the conventional Unix versions work at least as well for the
purpose.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 21:24   ` Francisco J Ballesteros
@ 2007-09-17 15:22     ` Douglas A. Gwyn
  0 siblings, 0 replies; 40+ messages in thread
From: Douglas A. Gwyn @ 2007-09-17 15:22 UTC (permalink / raw)
  To: 9fans

Francisco J Ballesteros wrote:
> the "slides" are a buch of programs. In fact, I use a terminal to
> compile and run
> programs from the 9.intro.pdf book. ...

By the way, I've been reading through that book in my spare time,
and it's a pretty good resource.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-17  3:23 ` erik quanstrom
@ 2007-09-17 15:22   ` Douglas A. Gwyn
  2007-09-17 15:55     ` erik quanstrom
  2007-09-18 15:27     ` Rob Pike
  0 siblings, 2 replies; 40+ messages in thread
From: Douglas A. Gwyn @ 2007-09-17 15:22 UTC (permalink / raw)
  To: 9fans

erik quanstrom wrote:
> i think the devolution of gnu grep is quite instructive.  ...
> it gets to the heart of why plan9's invention and use (thank's rob, ken) of
> utf-8 is so great.

If the problem is that Gnu grep converts any non-8-bit character set
to wchar_t (the equivalent of Plan 9 "rune"), then it's not really a
fair criticism of the software.  The conversion approach handles a
wide variety of character encoding scheme, whereas grepping the
encodings directly (the fast approach) doesn't work well for many
non-UTF-8 encodings.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-17 15:22   ` Douglas A. Gwyn
@ 2007-09-17 15:55     ` erik quanstrom
  2007-09-18  8:38       ` Douglas A. Gwyn
  2007-09-18 15:27     ` Rob Pike
  1 sibling, 1 reply; 40+ messages in thread
From: erik quanstrom @ 2007-09-17 15:55 UTC (permalink / raw)
  To: 9fans

> erik quanstrom wrote:
> > i think the devolution of gnu grep is quite instructive.  ...
> > it gets to the heart of why plan9's invention and use (thank's rob, ken) of
> > utf-8 is so great.
>
> If the problem is that Gnu grep converts any non-8-bit character set
> to wchar_t (the equivalent of Plan 9 "rune"), then it's not really a
> fair criticism of the software.  The conversion approach handles a
> wide variety of character encoding scheme, whereas grepping the
> encodings directly (the fast approach) doesn't work well for many
> non-UTF-8 encodings.

performance may suck, but that's just a symptom of a bigger problem.

wchar_t is not the equivalent of Rune.  Rune is always utf-8.  wchar_t
can be whatever.

this is not a feature.  it is a bug.

suppose Linux user a and user b grep the same "text" file for the same string.
results will depend on the users' locales.

contrast plan 9.  any two users grepping the same file for the same string
will get the same results.

in either case a character set conversion might be necessary to match
the locale.  but in the plan 9 case, one conversion will fix things for
any plan 9 user.  in the Linux case, there is no conversion that will fix
things for any Linux user.

- erik

p.s. gnu grep does special-cases utf-8 and avoids wchar_t conversions

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-16 20:43 ` roger peppe
  2007-09-16 20:53   ` Steve Simon
@ 2007-09-17 20:00   ` Scott Schwartz
  1 sibling, 0 replies; 40+ messages in thread
From: Scott Schwartz @ 2007-09-17 20:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

In my experience, the one thing that really gets Plan 9 across to people
is the telco server.  That's an example of something that you can't nicely
do in Unix, and that exhibits power and elegance as a consequence of a
few basic design choices.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-17 15:55     ` erik quanstrom
@ 2007-09-18  8:38       ` Douglas A. Gwyn
  2007-09-18 10:45         ` dave.l
  2007-10-10  3:30         ` Jack Johnson
  0 siblings, 2 replies; 40+ messages in thread
From: Douglas A. Gwyn @ 2007-09-18  8:38 UTC (permalink / raw)
  To: 9fans

erik quanstrom wrote:
> wchar_t is not the equivalent of Rune.  Rune is always utf-8.  wchar_t
> can be whatever.

I could have sworn that Plan 9 "rune" is used to contain a Unicode
value (UCS-2).  wchar_t can do the same thing, and does on some
platforms.  On others, wchar_t holds a full 31-but UCS-4 code, and
on others (Solaris for example) its encoding is locale-dependent
(which I would agree is not a good design).

> suppose Linux user a and user b grep the same "text" file for the same string.
> results will depend on the users' locales.

But if they're trying to match an alphabetic character class, the
result *should* depend on the locale.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18  8:38       ` Douglas A. Gwyn
@ 2007-09-18 10:45         ` dave.l
  2007-09-18 14:44           ` Iruata Souza
  2007-10-10  3:30         ` Jack Johnson
  1 sibling, 1 reply; 40+ messages in thread
From: dave.l @ 2007-09-18 10:45 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>But if they're trying to match an alphabetic character class, the
>result *should* depend on the locale.

... so what *should* the result be if the locale specifies an ideographic script?

DaveL


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 10:45         ` dave.l
@ 2007-09-18 14:44           ` Iruata Souza
  2007-09-18 15:41             ` Douglas A. Gwyn
  0 siblings, 1 reply; 40+ messages in thread
From: Iruata Souza @ 2007-09-18 14:44 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/18/07, dave.l@mac.com <dave.l@mac.com> wrote:
> >But if they're trying to match an alphabetic character class, the
> >result *should* depend on the locale.
>
> ... so what *should* the result be if the locale specifies an ideographic script?
>
> DaveL
>

the result *should* be 'now go and use plan 9'

iru


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-17 15:22   ` Douglas A. Gwyn
  2007-09-17 15:55     ` erik quanstrom
@ 2007-09-18 15:27     ` Rob Pike
  2007-09-18 15:38       ` Uriel
  1 sibling, 1 reply; 40+ messages in thread
From: Rob Pike @ 2007-09-18 15:27 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/17/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> erik quanstrom wrote:
> > i think the devolution of gnu grep is quite instructive.  ...
> > it gets to the heart of why plan9's invention and use (thank's rob, ken) of
> > utf-8 is so great.
>
> If the problem is that Gnu grep converts any non-8-bit character set
> to wchar_t (the equivalent of Plan 9 "rune"), then it's not really a
> fair criticism of the software.  The conversion approach handles a
> wide variety of character encoding scheme, whereas grepping the
> encodings directly (the fast approach) doesn't work well for many
> non-UTF-8 encodings.

Well, on a 2GHz x86, gnu grep ran for me at about 9600 baud on an
ASCII file if I set my locale to the UTF-8 locale.  UTF-8 is ASCII
compatible - explicitly, publicly, and on purpose - so there is no
excuse for this sort of performance penalty.  To be specific, in
the UTF-8 locale it should take just a few instructions to convert
any character to wchar_t, ASCII or not, but gnu grep was calling
malloc for this, even for an ASCII byte.

It is a fair criticism to say this is unacceptable, whatever the
intentions of the authors may be.

-rob

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 15:27     ` Rob Pike
@ 2007-09-18 15:38       ` Uriel
  2007-09-19  8:50         ` Douglas A. Gwyn
                           ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Uriel @ 2007-09-18 15:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Don't complain, at least it is not producing random behaviour, I have
seen versions of gnu awk that when feed plain ASCII input, if the
locale was UTF-8, rules would match random lines of input, the fix?
set the locale to 'C' at the top of all your scripts (and don't even
think of dealing with files which actually contain non-ASCII UTF-8).

This was some years ago, it might be fixed by now, but it demonstrates
how the locale insanity makes life so much more fun.

And talking of simplicity, don't forget to mention X. By chance I just
found this gem in one of the many X headers:

#define NBBY    8       /* number of bits in a byte */

uriel


On 9/18/07, Rob Pike <robpike@gmail.com> wrote:
> On 9/17/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> > erik quanstrom wrote:
> > > i think the devolution of gnu grep is quite instructive.  ...
> > > it gets to the heart of why plan9's invention and use (thank's rob, ken) of
> > > utf-8 is so great.
> >
> > If the problem is that Gnu grep converts any non-8-bit character set
> > to wchar_t (the equivalent of Plan 9 "rune"), then it's not really a
> > fair criticism of the software.  The conversion approach handles a
> > wide variety of character encoding scheme, whereas grepping the
> > encodings directly (the fast approach) doesn't work well for many
> > non-UTF-8 encodings.
>
> Well, on a 2GHz x86, gnu grep ran for me at about 9600 baud on an
> ASCII file if I set my locale to the UTF-8 locale.  UTF-8 is ASCII
> compatible - explicitly, publicly, and on purpose - so there is no
> excuse for this sort of performance penalty.  To be specific, in
> the UTF-8 locale it should take just a few instructions to convert
> any character to wchar_t, ASCII or not, but gnu grep was calling
> malloc for this, even for an ASCII byte.
>
> It is a fair criticism to say this is unacceptable, whatever the
> intentions of the authors may be.
>
> -rob
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 14:44           ` Iruata Souza
@ 2007-09-18 15:41             ` Douglas A. Gwyn
  2007-09-18 21:34               ` Iruata Souza
  0 siblings, 1 reply; 40+ messages in thread
From: Douglas A. Gwyn @ 2007-09-18 15:41 UTC (permalink / raw)
  To: 9fans

Iruata Souza wrote:
> On 9/18/07, dave.l@mac.com <dave.l@mac.com> wrote:
> > >But if they're trying to match an alphabetic character class, the
> > >result *should* depend on the locale.
> > ... so what *should* the result be if the locale specifies an ideographic script?
> the result *should* be 'now go and use plan 9'

That doesn't address the issue Dave L raised.

I don't know off hand what POSIX decreed for "character classes"
involving ideographs.  My guess is that they have to not count
as uppercase or lowercase, and probably not as alphabetic nor
alphanumeric.  You could ask similar questions about accented
characters in alphabet-based languages.  This isn't about
character coding so much as it is about classification.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 15:41             ` Douglas A. Gwyn
@ 2007-09-18 21:34               ` Iruata Souza
  0 siblings, 0 replies; 40+ messages in thread
From: Iruata Souza @ 2007-09-18 21:34 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/18/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> Iruata Souza wrote:
> > On 9/18/07, dave.l@mac.com <dave.l@mac.com> wrote:
> > > >But if they're trying to match an alphabetic character class, the
> > > >result *should* depend on the locale.
> > > ... so what *should* the result be if the locale specifies an ideographic script?
> > the result *should* be 'now go and use plan 9'
>
> That doesn't address the issue Dave L raised.
>
I can't realize why not.

iru


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 15:38       ` Uriel
@ 2007-09-19  8:50         ` Douglas A. Gwyn
  2007-09-19 11:51           ` erik quanstrom
                             ` (3 more replies)
  2007-10-09 20:08         ` Aharon Robbins
  2007-10-10  5:33         ` sqweek
  2 siblings, 4 replies; 40+ messages in thread
From: Douglas A. Gwyn @ 2007-09-19  8:50 UTC (permalink / raw)
  To: 9fans

Uriel wrote:
> found this gem in one of the many X headers:
> #define NBBY    8       /* number of bits in a byte */

So what is supposed to be wrong with using a manifest constant
instead of hard-coding "8" in various places?  As I recall,
The Elements of Programming Style recommended this approach.

Similar definitions have been in Unix system headers for
decades.  CHAR_BIT is defined in <limits.h>. (Yes, I know
there is a difference between a char and a byte.  Less well
known, there is a difference between a byte and an octet.)

I'm not saying that some of the complaints don't have a
point, especially when important tools perform poorly.
However, I've observed an unusal degree of arrogance in
the Plan 9 newsgroup, approaching religion.  Plan 9's way
of doing things is not the only intelligent way; others
may have different goals and constraints that affect how
they do things in their particular environments.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-19  8:50         ` Douglas A. Gwyn
@ 2007-09-19 11:51           ` erik quanstrom
  2007-09-19 15:02             ` Russ Cox
  2007-09-19 14:17           ` Charles Forsyth
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: erik quanstrom @ 2007-09-19 11:51 UTC (permalink / raw)
  To: 9fans

> So what is supposed to be wrong with using a manifest constant
> instead of hard-coding "8" in various places?  As I recall,
> The Elements of Programming Style recommended this approach.

i see two problems with this sort of indirection.  if i see NBBY
in the code, i have to look up it's value.  NBBY doesn't mean anything
to me.  this layer of mental gymnastics that makes the code hard
 to read and understand.  on the other hand, 8 means something to me.

more importantly, it implies that the code would work with NBBY
of 10 or 12.  (c standard says you can't have < 8 §5.2.4.2.1.)
i'd bet there are many things in the code that depend on the sizeof
a byte that don't reference NBBY.

so this define goes 0 fer 2.  it can't be changed and it is not informative.

> Similar definitions have been in Unix system headers for
> decades.  CHAR_BIT is defined in <limits.h>. (Yes, I know
> there is a difference between a char and a byte.  Less well
> known, there is a difference between a byte and an octet.)

this mightn't be the right place to defend a practice by saying that
"unix systems have been doing it for years."

- erik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-19  8:50         ` Douglas A. Gwyn
  2007-09-19 11:51           ` erik quanstrom
@ 2007-09-19 14:17           ` Charles Forsyth
  2007-09-19 14:21           ` Iruata Souza
  2007-09-19 15:32           ` Skip Tavakkolian
  3 siblings, 0 replies; 40+ messages in thread
From: Charles Forsyth @ 2007-09-19 14:17 UTC (permalink / raw)
  To: 9fans

>Less well known, there is a difference between a byte and an octet.

grep octet /sys/games/lib/fortunes
	20 octets is 160 guys playing flutes -- rob

easily one of my favourites



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-19  8:50         ` Douglas A. Gwyn
  2007-09-19 11:51           ` erik quanstrom
  2007-09-19 14:17           ` Charles Forsyth
@ 2007-09-19 14:21           ` Iruata Souza
  2007-09-19 15:32           ` Skip Tavakkolian
  3 siblings, 0 replies; 40+ messages in thread
From: Iruata Souza @ 2007-09-19 14:21 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/19/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> I'm not saying that some of the complaints don't have a
> point, especially when important tools perform poorly.
> However, I've observed an unusal degree of arrogance in
> the Plan 9 newsgroup, approaching religion.  Plan 9's way
> of doing things is not the only intelligent way; others
> may have different goals and constraints that affect how
> they do things in their particular environments.
>

imho a big problem is that in the mentioned places every environment
is always thought as a particular one.

iru


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-19 11:51           ` erik quanstrom
@ 2007-09-19 15:02             ` Russ Cox
  0 siblings, 0 replies; 40+ messages in thread
From: Russ Cox @ 2007-09-19 15:02 UTC (permalink / raw)
  To: 9fans

> i see two problems with this sort of indirection.  if i see NBBY
> in the code, i have to look up it's value.  NBBY doesn't mean anything
> to me.  this layer of mental gymnastics that makes the code hard
>  to read and understand.  on the other hand, 8 means something to me.
> 
> more importantly, it implies that the code would work with NBBY
> of 10 or 12.  (c standard says you can't have < 8 §5.2.4.2.1.)
> i'd bet there are many things in the code that depend on the sizeof
> a byte that don't reference NBBY.
> 
> so this define goes 0 fer 2.  it can't be changed and it is not informative.

8 can be a lot of things besides the number of bits in a byte
(the number of bytes in a double or vlong, for example).
if you're doing enough conversions between byte counts
and bit counts, then using NBBY makes it clear *why* you're
using an 8 there, which might help a lot.

in other contexts, it might not be worth the effort.

jumping all over a #define without seeing how or 
why it is being used is not productive.  nor interesting.
in fact i can't believe i'm writing this.  sorry.

russ

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-19  8:50         ` Douglas A. Gwyn
                             ` (2 preceding siblings ...)
  2007-09-19 14:21           ` Iruata Souza
@ 2007-09-19 15:32           ` Skip Tavakkolian
  3 siblings, 0 replies; 40+ messages in thread
From: Skip Tavakkolian @ 2007-09-19 15:32 UTC (permalink / raw)
  To: 9fans

> However, I've observed an unusal degree of arrogance in
> the Plan 9 newsgroup, approaching religion.

elitism, not arrogance.

"I don't want to belong to any club that will accept me as a member." - Groucho Marx



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 15:38       ` Uriel
  2007-09-19  8:50         ` Douglas A. Gwyn
@ 2007-10-09 20:08         ` Aharon Robbins
  2007-10-09 21:08           ` Uriel
  2007-10-10  5:33         ` sqweek
  2 siblings, 1 reply; 40+ messages in thread
From: Aharon Robbins @ 2007-10-09 20:08 UTC (permalink / raw)
  To: 9fans

In article <5d375e920709180838t4070c23al11bc0eb5cc7280c9@mail.gmail.com> Uriel wrote:
>Don't complain, at least it is not producing random behaviour, I have
>seen versions of gnu awk that when feed plain ASCII input, if the
>locale was UTF-8, rules would match random lines of input, the fix?
>set the locale to 'C' at the top of all your scripts (and don't even
>think of dealing with files which actually contain non-ASCII UTF-8).
>
>This was some years ago, it might be fixed by now, but it demonstrates
>how the locale insanity makes life so much more fun.

It likely is fixed by now.  If not, I'd like to have a sample program and
data and locale name to test under. And the truth is, even if it doesn't work,
I can blame the library routines and locale and not my code. :-)

Testing should be performed using current sources, available via anonymous
CVS from savannah.gnu.org, check out the gawk-stable module.  From CVS use:

	./bootstrap.sh
	./configure && make && make check

to build on a Unix or Linux system.

I hope to make a formal release in the next few weeks.

As to the original thread, yeah, configure (= autoconf + automake +
libtool + gnulib) has gotten way too hairy to handle. I don't use gnulib
on principle: I have the gut feeling that the configuration goop would
likely outweigh the source code in line count.

The only reason I added Automake support was to get GNU Gettext, which
on balance is a good thing.  Locales, on the other hand, I think are
very painful.  I hope that people who use them find them valuable (I'm
a parochial English speaking American myself, so ASCII is usually
enough for me.)

My two cents,

Arnold
-- 
Aharon (Arnold) Robbins 				arnold AT skeeve DOT com
P.O. Box 354		Home Phone: +972  8 979-0381	Fax: +1 206 202 4333
Nof Ayalon		Cell Phone: +972 50  729-7545
D.N. Shimshon 99785	ISRAEL

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-09 20:08         ` Aharon Robbins
@ 2007-10-09 21:08           ` Uriel
  0 siblings, 0 replies; 40+ messages in thread
From: Uriel @ 2007-10-09 21:08 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> >This was some years ago, it might be fixed by now, but it demonstrates
> >how the locale insanity makes life so much more fun.
>
> It likely is fixed by now.  If not, I'd like to have a sample program and
> data and locale name to test under. And the truth is, even if it doesn't work,
> I can blame the library routines and locale and not my code. :-)

Yes, it is likely fixed now, and it was very likely a bug in the
libraries rather than awk, but illustrates the kinds of problems
locales create. And I can tell you, in a production environment it can
be a pain when who knows what tool who knows where in your whole
system starts to misbehave because it is not happy with your locale.

I also find most sad how in the name of 'localization' the output of
many tools (specially error messages) has become unpredictable. It
makes providing support most fun when you ask people "can you copy
paste the output you get when you run this", and they answer with a
bunch of stuff Aramaic. If you use unix, you are supposed to
understand English, period. (Or what is next? will they have a set of
'magic symlinks' that links '/bin/gato' to '/bin/cat' if your locale
is in Spanish?)

And now that you mention Gettext, if only I could get back all the
time I wasted trying to compile some stupid program (that should never
have been 'localized' in the first place) which is somehow unhappy
about the gettext version I have (or the other way around)...

uriel

P.S.: Oh, and people who insist in using encodings other than UTF-8
should be locked up in padded cells (without access to computers and
ideally even without electricity, unless it is to help them
electrocute themselves) for the good of mankind.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18  8:38       ` Douglas A. Gwyn
  2007-09-18 10:45         ` dave.l
@ 2007-10-10  3:30         ` Jack Johnson
  2007-10-10  4:02           ` erik quanstrom
  1 sibling, 1 reply; 40+ messages in thread
From: Jack Johnson @ 2007-10-10  3:30 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Yes, old thread, sorry.  Blame Uriel.

On 9/18/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> erik quanstrom wrote:
> > suppose Linux user a and user b grep the same "text" file for the same string.
> > results will depend on the users' locales.
>
> But if they're trying to match an alphabetic character class, the
> result *should* depend on the locale.

This baffles me.  Can anyone think of examples where one might want
differing results depending on your locale?

-Jack


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10  3:30         ` Jack Johnson
@ 2007-10-10  4:02           ` erik quanstrom
  2007-10-10  6:17             ` Jack Johnson
  0 siblings, 1 reply; 40+ messages in thread
From: erik quanstrom @ 2007-10-10  4:02 UTC (permalink / raw)
  To: 9fans

> Yes, old thread, sorry.  Blame Uriel.
> 
> On 9/18/07, Douglas A. Gwyn <DAGwyn@null.net> wrote:
> > erik quanstrom wrote:
> > > suppose Linux user a and user b grep the same "text" file for the same string.
> > > results will depend on the users' locales.
> >
> > But if they're trying to match an alphabetic character class, the
> > result *should* depend on the locale.
> 
> This baffles me.  Can anyone think of examples where one might want
> differing results depending on your locale?
> 
> -Jack

i think i see what the reasoning is.  the thought is that, e.g.,
in spanish [a-z] should match ñ.  

the problem is this means that grep(regexp, data) now
returns a set of results, one for each locale.

so on the one hand, one would like [a-z] to do the Right Thing,
depending on language.  and on the other hand, one wants
grep(regexp, data) to return a single result.

i think the way to see through this issue is to notice that
the reason we want ñ to be in [a-z] is because of visual
similarity.  what if we were dealing with chinese?  i think
it's pretty clear that [a-z] should map to a contiguous set
of unicode codepoints.

if you want to deal with ñ, the unicode tables do note that ñ
is n+combining ~, so one could come up with a new
denotation for base codepoint.  unfortunately the combining
that with existing regexp would be a bit painful.

- erik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-09-18 15:38       ` Uriel
  2007-09-19  8:50         ` Douglas A. Gwyn
  2007-10-09 20:08         ` Aharon Robbins
@ 2007-10-10  5:33         ` sqweek
  2007-10-10 11:49           ` erik quanstrom
  2 siblings, 1 reply; 40+ messages in thread
From: sqweek @ 2007-10-10  5:33 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 9/18/07, Uriel <uriel99@gmail.com> wrote:
> Don't complain, at least it is not producing random behaviour, I have
> seen versions of gnu awk that when feed plain ASCII input, if the
> locale was UTF-8, rules would match random lines of input, the fix?
> set the locale to 'C' at the top of all your scripts (and don't even
> think of dealing with files which actually contain non-ASCII UTF-8).
>
> This was some years ago, it might be fixed by now, but it demonstrates
> how the locale insanity makes life so much more fun.-

 Heh, funny that this thread got revived the very day that my
colleague's backup script choked because he was running in a utf8
locale and hit a filename encoded in iso8859-1. Apparently GNU sed's .
stops matching when it hits an invalid bytestream (which is not
entirely unreasonable I guess).
-sqweek


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10  4:02           ` erik quanstrom
@ 2007-10-10  6:17             ` Jack Johnson
  2007-10-10 12:22               ` erik quanstrom
  0 siblings, 1 reply; 40+ messages in thread
From: Jack Johnson @ 2007-10-10  6:17 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On 10/9/07, erik quanstrom <quanstro@quanstro.net> wrote:
> i think i see what the reasoning is.  the thought is that, e.g.,
> in spanish [a-z] should match ñ.

Ah, thanks!

I was thinking of the simplistic scenario, where someone might be
looking for niño in some file, regardless of what locale they might
happen to be in.  Now I can imagine the nightmare it must be for
non-English speakers looking for letter combinations irrespective of
accents.

But, it seems more like a problem with the shorthand than grep, per
se.  I could see an argument for [:alpha:] potentially matching n and
ñ depending on the locale, but [a-z] not matching ñ in any locale. But
even that, my tendency would be that [:alpha:] match ñ in every
locale.

But then, does [:alpha:] match ἄγαθος?  How ironic that it doesn't match α.

What an ugly problem.

-Jack

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10  5:33         ` sqweek
@ 2007-10-10 11:49           ` erik quanstrom
  0 siblings, 0 replies; 40+ messages in thread
From: erik quanstrom @ 2007-10-10 11:49 UTC (permalink / raw)
  To: 9fans

>  Heh, funny that this thread got revived the very day that my
> colleague's backup script choked because he was running in a utf8
> locale and hit a filename encoded in iso8859-1. Apparently GNU sed's .
> stops matching when it hits an invalid bytestream (which is not
> entirely unreasonable I guess).
> -sqweek

clearly in their world, it is unreasonable.

- erik


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10  6:17             ` Jack Johnson
@ 2007-10-10 12:22               ` erik quanstrom
  0 siblings, 0 replies; 40+ messages in thread
From: erik quanstrom @ 2007-10-10 12:22 UTC (permalink / raw)
  To: 9fans

> I was thinking of the simplistic scenario, where someone might be
> looking for niño in some file, regardless of what locale they might
> happen to be in.  Now I can imagine the nightmare it must be for
> non-English speakers looking for letter combinations irrespective of
> accents.
> 
> But, it seems more like a problem with the shorthand than grep, per
> se.

i agree with this.  or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatability
constraints, the alphabet would have nñ together so [a-z] would 
match both.

> I could see an argument for [:alpha:] potentially matching n and
> ñ depending on the locale, but [a-z] not matching ñ in any locale. But
> even that, my tendency would be that [:alpha:] match ñ in every
> locale.
> 
> But then, does [:alpha:] match ἄγαθος?  How ironic that it doesn't match α.

i don't think one can go this route.  you can't have a magic environment
variable that changes everything.  testing is a nightmare in such a world.
you have to go through every combination of (data cs, locale) to see if
things are working.

a better solution is to use the properties of unicode.  ñ is noted in the
table as

00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small letter n tilde;;00d1;;00d1

field 6 has the base codepoint 006e as its first subfield.  it would not be hard
to build a table quickly mapping a codepoint to its base codepoint σ.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms ξ.

suppose, for lack of creativity, we use » to mean all base codepoints
matching the next item character so »a matches ä as does »[a-z].
so for » of a letter c can be grepped by taking ξσ(c) which results
in a character class.

plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data.  it would
be easy to adapt them to generate ξ and σ.  (the tables would be pretty big.)

> 
> What an ugly problem.

it can be made ugly quickly.  but i'm not convinced that all approaches
to this problem are bad.

- erik

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10 14:29     ` erik quanstrom
@ 2007-10-10 15:26       ` John Stalker
  0 siblings, 0 replies; 40+ messages in thread
From: John Stalker @ 2007-10-10 15:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> > I'm not sure your solution is always the correct one, or is implementable.
> > Should an MTA silently convert incoming mail to the local character set?
> 
> it doesn't have to.  upas/fs does given the character set in the file.
> i've thought about the mta doing it.  i think that would be a nice solution.

In my case this was being done by the MUA, which was mh rather than upas,
but the net effect is the same.

> > I'm not sure I want that.  The other program in my example was a web
> > browser reading from a pipe.  It can't know whether it's processing data
> > as it comes into the system or data which is already there and has already
> > been converted, unless either it can trust the meta tag in the document to
> > have been updated or the conversion is pushed out into the network layer.
> 
> what is the standard.  if the encoding in the header header is x does that me
> an
> that the encoding in the html header needs to be x?  what happends if they
> differ?
> 
> the only case that makes sense is that they have to be the same.  but html
> and http generally run counter to common sense. ;-)

I don't know what happens if they differ.  In my case they were the same, but
the problem was that both programs assigned themselves the job of converting.
I think that the mailer SHOULD NOT, to use the RFC capitals, convert the
character set if it is handing off the display job to another program.  In any
case that's the way I set things up once I figured out what was going on.
This is counter to the way the CRLF issue is handled, though.  There the network
standard is CRLF and systems which use other systems, including all the ones I use,
are expected to convert before sending and after receiving so no local programs
need to know about such issues.
-- 
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10 14:05   ` John Stalker
@ 2007-10-10 14:29     ` erik quanstrom
  2007-10-10 15:26       ` John Stalker
  0 siblings, 1 reply; 40+ messages in thread
From: erik quanstrom @ 2007-10-10 14:29 UTC (permalink / raw)
  To: 9fans

On Wed Oct 10 10:05:45 EDT 2007, stalker@maths.tcd.ie wrote:
> > i think this is a character set conversion problem, not a locale
> > problem.  a small distinction, but i think one can live with converting
> > character sets as they come onto a system.  localized (ha!) complexity.
> 
> I'm not sure your solution is always the correct one, or is implementable.
> Should an MTA silently convert incoming mail to the local character set?

it doesn't have to.  upas/fs does given the character set in the file.
i've thought about the mta doing it.  i think that would be a nice solution.

> I'm not sure I want that.  The other program in my example was a web
> browser reading from a pipe.  It can't know whether it's processing data
> as it comes into the system or data which is already there and has already
> been converted, unless either it can trust the meta tag in the document to
> have been updated or the conversion is pushed out into the network layer.

what is the standard.  if the encoding in the header header is x does that mean
that the encoding in the html header needs to be x?  what happends if they
differ?

the only case that makes sense is that they have to be the same.  but html
and http generally run counter to common sense. ;-)

- erik


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10 11:47 ` erik quanstrom
@ 2007-10-10 14:05   ` John Stalker
  2007-10-10 14:29     ` erik quanstrom
  0 siblings, 1 reply; 40+ messages in thread
From: John Stalker @ 2007-10-10 14:05 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> i think this is a character set conversion problem, not a locale
> problem.  a small distinction, but i think one can live with converting
> character sets as they come onto a system.  localized (ha!) complexity.

I'm not sure your solution is always the correct one, or is implementable.
Should an MTA silently convert incoming mail to the local character set?
I'm not sure I want that.  The other program in my example was a web
browser reading from a pipe.  It can't know whether it's processing data
as it comes into the system or data which is already there and has already
been converted, unless either it can trust the meta tag in the document to
have been updated or the conversion is pushed out into the network layer.
Also, it's meaningful to talk about the system character set in the plan9
world or the windows world, but not under UNIX, which is where I spend
most of my time, for better or worse.
-- 
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10  7:36 John Stalker
  2007-10-10  8:24 ` Charles Forsyth
@ 2007-10-10 11:47 ` erik quanstrom
  2007-10-10 14:05   ` John Stalker
  1 sibling, 1 reply; 40+ messages in thread
From: erik quanstrom @ 2007-10-10 11:47 UTC (permalink / raw)
  To: 9fans

> My most annoying locale problem concerned reading Czech HTML emails in
> mh.  Don't ask why, just accept that I got a lot of these and could not
> simply ignore them.  The problem was that mh saw a text/html MIME type
> and, as it does for text, helpfully converted from the original encoding,
> usually CP1250 or iso8859-2, [...]

i think this is a character set conversion problem, not a locale
problem.  a small distinction, but i think one can live with converting
character sets as they come onto a system.  localized (ha!) complexity.

- erik


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
  2007-10-10  7:36 John Stalker
@ 2007-10-10  8:24 ` Charles Forsyth
  2007-10-10 11:47 ` erik quanstrom
  1 sibling, 0 replies; 40+ messages in thread
From: Charles Forsyth @ 2007-10-10  8:24 UTC (permalink / raw)
  To: 9fans

> Forcing everyone to use utf-8 would be better, but is not
> going to happen either.

it will, it will just take some time (some things will be in utf-x for x>8)
partly because it isn't `forced' (who could ever do the `forcing')



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [9fans] simplicity
@ 2007-10-10  7:36 John Stalker
  2007-10-10  8:24 ` Charles Forsyth
  2007-10-10 11:47 ` erik quanstrom
  0 siblings, 2 replies; 40+ messages in thread
From: John Stalker @ 2007-10-10  7:36 UTC (permalink / raw)
  To: 9fans

My most annoying locale problem concerned reading Czech HTML emails in
mh.  Don't ask why, just accept that I got a lot of these and could not
simply ignore them.  The problem was that mh saw a text/html MIME type
and, as it does for text, helpfully converted from the original encoding,
usually CP1250 or iso8859-2, to the encoding specified in my locale
environment variable, utf-8.  Since the content was html, it then handed
it to a ``browser'', in my case w3m, for pretty formatting.  w3m read the
encoding from the html header, thought its input was CP1250 or iso8859-2,
and helpfully converted to utf-8.  Both programs were behaving in a
vaguely sensible way, but iconv was being run twice, and the result was
gibberish.  It took me a while to figure our what was happening and a
while to figure out a way to make it stop.  I don't know what the general
answer to problems like this is.  Forcing everyone to use English is not
an option.  Forcing everyone to use utf-8 would be better, but is not
going to happen either.

John
-- 
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-10-10 15:26 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-09-16 18:55 [9fans] simplicity Francisco J Ballesteros
2007-09-16 20:42 ` Anant Narayanan
2007-09-16 21:24   ` Francisco J Ballesteros
2007-09-17 15:22     ` Douglas A. Gwyn
2007-09-16 20:43 ` roger peppe
2007-09-16 20:53   ` Steve Simon
2007-09-17 15:22     ` Douglas A. Gwyn
2007-09-17 20:00   ` Scott Schwartz
2007-09-17  3:23 ` erik quanstrom
2007-09-17 15:22   ` Douglas A. Gwyn
2007-09-17 15:55     ` erik quanstrom
2007-09-18  8:38       ` Douglas A. Gwyn
2007-09-18 10:45         ` dave.l
2007-09-18 14:44           ` Iruata Souza
2007-09-18 15:41             ` Douglas A. Gwyn
2007-09-18 21:34               ` Iruata Souza
2007-10-10  3:30         ` Jack Johnson
2007-10-10  4:02           ` erik quanstrom
2007-10-10  6:17             ` Jack Johnson
2007-10-10 12:22               ` erik quanstrom
2007-09-18 15:27     ` Rob Pike
2007-09-18 15:38       ` Uriel
2007-09-19  8:50         ` Douglas A. Gwyn
2007-09-19 11:51           ` erik quanstrom
2007-09-19 15:02             ` Russ Cox
2007-09-19 14:17           ` Charles Forsyth
2007-09-19 14:21           ` Iruata Souza
2007-09-19 15:32           ` Skip Tavakkolian
2007-10-09 20:08         ` Aharon Robbins
2007-10-09 21:08           ` Uriel
2007-10-10  5:33         ` sqweek
2007-10-10 11:49           ` erik quanstrom
2007-09-17 14:52 ` ron minnich
2007-09-17 14:53 ` ron minnich
2007-10-10  7:36 John Stalker
2007-10-10  8:24 ` Charles Forsyth
2007-10-10 11:47 ` erik quanstrom
2007-10-10 14:05   ` John Stalker
2007-10-10 14:29     ` erik quanstrom
2007-10-10 15:26       ` John Stalker

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).