From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <6ba21b46fa535aa3c7f5cd555d86f8be@quanstro.net>
From: erik quanstrom <quanstro@quanstro.net>
Date: Wed, 27 Feb 2008 04:57:10 -0500
To: paurea@gmail.com, 9fans@cse.psu.edu
Subject: Re: [9fans] awk, not utf aware...
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Cc:
Content-Transfer-Encoding: quoted-printable
Topicbox-Message-UUID: 644475ba-ead3-11e9-9d60-3106f5b1d025

> There is split and other functions,
> for example:
>=20
> toupper("a=C3=AD")
> gives
> A=C3=AD
>=20
> My guess is that there are many more little (or not) corners where it
> doesn't work.
> We can go on and on looking for crevices and hiding the bugs further
> under the rug
> so that they are not evident and find everyone completely unaware,
> leave awk as it is now or really fix the problem. The first approach
> doesn't work. I am going to take
> the second till I have time to take the third which means use runes or
> at least revise all the
> code so that it is uniformly aware of the existance of non-ascii charac=
ters.

i don't understand this approach.  you propose redoing a fundamental
part of awk.   yet at the end you won't have solved the bug that's bother=
ing
you.

ignoring the fact that awk is an ape program and doesn't use runes, the
problem with toupper is independent of the internal representation
of strings. as far as i can tell, ape doesn't even have towupper and towl=
ower.

so if you provide those functions, fixing toupper and tolower could be
a 5 minute fix.  and you know you won't have broken anything else.

/sys/doc/utf.ps is worth a read.  it's not to hard to think of situations
that depend on character boundaries or operate on non-ascii characters.
generally there are few.  for example, rc only bothers with character
boundaries in matching. perhaps you could build a utf testsuite for awk.
make sure to use non-latin1 languages, too.

- erik