From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <6ba21b46fa535aa3c7f5cd555d86f8be@quanstro.net> From: erik quanstrom Date: Wed, 27 Feb 2008 04:57:10 -0500 To: paurea@gmail.com, 9fans@cse.psu.edu Subject: Re: [9fans] awk, not utf aware... MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 644475ba-ead3-11e9-9d60-3106f5b1d025 > There is split and other functions, > for example: >=20 > toupper("a=C3=AD") > gives > A=C3=AD >=20 > My guess is that there are many more little (or not) corners where it > doesn't work. > We can go on and on looking for crevices and hiding the bugs further > under the rug > so that they are not evident and find everyone completely unaware, > leave awk as it is now or really fix the problem. The first approach > doesn't work. I am going to take > the second till I have time to take the third which means use runes or > at least revise all the > code so that it is uniformly aware of the existance of non-ascii charac= ters. i don't understand this approach. you propose redoing a fundamental part of awk. yet at the end you won't have solved the bug that's bother= ing you. ignoring the fact that awk is an ape program and doesn't use runes, the problem with toupper is independent of the internal representation of strings. as far as i can tell, ape doesn't even have towupper and towl= ower. so if you provide those functions, fixing toupper and tolower could be a 5 minute fix. and you know you won't have broken anything else. /sys/doc/utf.ps is worth a read. it's not to hard to think of situations that depend on character boundaries or operate on non-ascii characters. generally there are few. for example, rc only bothers with character boundaries in matching. perhaps you could build a utf testsuite for awk. make sure to use non-latin1 languages, too. - erik