From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <0658d9ebf605b525d017007cadbc2e51@cat-v.org> To: 9fans@cse.psu.edu Subject: Re: [9fans] ports from GPL Date: Mon, 20 Mar 2006 04:39:43 +0100 From: uriel@cat-v.org In-Reply-To: <20060320021808.91DE411FC1@dexter-peak.quanstro.net> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Topicbox-Message-UUID: 19650dea-ead1-11e9-9d60-3106f5b1d025 > the gnu awk folks are doing a pretty good job, given their constraints. > > i have not read the sed code (for a while, anyway), but i could imagine > that it may have the same character set problems as newer versions of gnu grep. > gnu grep calls mbtowc for each input character, even when not required. > > have you tried your test with LC_LANG=C? I have seen GNU awk produce different matches with LC_ALL=UTF-8 than with LC_ALL=C when input was plain ASCII (only digits!) Since then at the top of all unix shell scripts I add LC_LANG=C, not for performance reasons, but because otherwise things often break in subtle and very hard to debug ways, really sad. I wonder how many more years we will have to wait until any unix system supports UTF-8 properly. Only thing that excuses GNU is that the locale system is not entirely their fault, locales are probably one of the worst ideas in the history of Unix, if not the worst. I will ignore the subject of UTF-8 support in terminal emulators, many books could be written about the various kinds of braindamage in this area. Thank God for 9term. > | I wonder who spent so much time speeding up awk and ignoring sed? :) A program that produces incorrect results twice as fast is infinitely slower. -- John Osterhout I wonder how many thousands of man-years have been wasted due to locale-related braindamage. uriel