From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 26 Feb 2008 14:16:13 +0100 From: Martin Neubauer To: Fans of the OS Plan 9 from Bell Labs <9fans@cse.psu.edu> Subject: Re: [9fans] awk, not utf aware... Message-ID: <20080226131613.GA811@shodan.homeunix.net> References: <599f06db0802260418m1c2732fdt1487051c59152e27@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <599f06db0802260418m1c2732fdt1487051c59152e27@mail.gmail.com> User-Agent: Mutt/1.4.2.3i Topicbox-Message-UUID: 61c1278e-ead3-11e9-9d60-3106f5b1d025 Awk is one of the few programs in the ditribution that is maintained externally (by Brian Kernighan) and is pulled in via ape and pcc (it might actually be the only one - I didn't bother to check.) A quick glimpse at lex.c suggests that awk scans input one char at a time. In hindsight I'm a bit surprised that I haven't got bitten by this, but I probably didn't split within multibyte sequences. It's probably not too hard to change awk to read runes for the price of creating ``the other one true awk.'' Martin * Gorka Guardiola (paurea@gmail.com) wrote: > I think this has come up before, but I didn't found reply. > If I do in awk something like: >=20 > split($0, c, ""); >=20 > c should be an array of Runes internally, UTF externally, but apparently, > it is not. Is it just broken?, is there a replacement?, is it just the > builtins or > is the whole awk broken?. >=20 > Example, freqpair >=20 > ------ > #!/bin/awk -f >=20 > { > n =3D split($0, c , ""); > for(i=3D1; i pair=3Dc[i] c[i+1] > f[pair]++; > } > } > END{ > for(h in f) > printf("%d %s\n", f[h], h); > } >=20 > ------ >=20 > % echo abcd|freqpair > 1 ab > 1 cd > 1 bc > % echo a=C3=ADcd|freqpair > 1 cd > 1 =EF=BF=BDc > 1 =C3=AD > 1 a=EF=BF=BD >=20 >=20 > where the ? is a Peter face... >=20 > Thanks. >=20 > --=20 > - curiosity sKilled the cat