From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <0658d9ebf605b525d017007cadbc2e51@cat-v.org>
To: 9fans@cse.psu.edu
Subject: Re: [9fans] ports from GPL
Date: Mon, 20 Mar 2006 04:39:43 +0100
From: uriel@cat-v.org
In-Reply-To: <20060320021808.91DE411FC1@dexter-peak.quanstro.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Topicbox-Message-UUID: 19650dea-ead1-11e9-9d60-3106f5b1d025

> the gnu awk folks are doing a pretty good job, given their constraints.
>
> i have not read the sed code (for a while, anyway), but i could imagine
> that it may have the same character set problems as newer versions of gnu grep.
> gnu grep calls mbtowc for each input character, even when not required.
>
> have you tried your test with LC_LANG=C?

I have seen GNU awk produce different matches with LC_ALL=UTF-8 than
with LC_ALL=C when input was plain ASCII (only digits!)

Since then at the top of all unix shell scripts I add LC_LANG=C, not
for performance reasons, but because otherwise things often break in
subtle and very hard to debug ways, really sad.

I wonder how many more years we will have to wait until any unix
system supports UTF-8 properly.

Only thing that excuses GNU is that the locale system is not entirely
their fault, locales are probably one of the worst ideas in the
history of Unix, if not the worst.

I will ignore the subject of UTF-8 support in terminal
emulators, many books could be written about the various kinds of
braindamage in this area.  Thank God for 9term.

> | I wonder who spent so much time speeding up awk and ignoring sed? :)

A program that produces incorrect results twice as fast is infinitely slower.
    -- John Osterhout

I wonder how many thousands of man-years have been wasted due to
locale-related braindamage.

uriel