From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <7f50f59eb69c5e678cbb97e2b978b112@quanstro.net> From: erik quanstrom Date: Tue, 26 Feb 2008 15:24:00 -0500 To: 9fans@cse.psu.edu Subject: Re: [9fans] awk, not utf aware... In-Reply-To: <599f06db0802260418m1c2732fdt1487051c59152e27@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Topicbox-Message-UUID: 62a907b6-ead3-11e9-9d60-3106f5b1d025 > I think this has come up before, but I didn't found reply. > If I do in awk something like: >=20 > split($0, c, ""); >=20 > c should be an array of Runes internally, UTF externally, but apparentl= y, > it is not. Is it just broken?, is there a replacement?, is it just the > builtins or > is the whole awk broken?. i think the comments about this problem are missing the point a bit. utf8 should be transparent to awk unless the situation demands that awk needs to know the length of a character. it's not necessary to keep strings as Rune*s internally to work with utf8. splitting on "" is a special case where awk does need to know the length of a character. e.g. this script should work fine ; cat /tmp/smile #!/bin/awk -f { n =3D split($0, c, "=E2=98=BA"); for(i =3D 1; i <=3D n; i++) print c[i] } ; echo fu=E2=98=BAbar|/tmp/smile fu bar but splitting on "" won't. i attached a patch that fixes this problem as an illustration. i'm not using utflen because pcc won't see it. it's an ugly patch. i don't think i know what a proper fix for awk would be. i wouldn't think there are many cases like this, but i haven't spent much time with awk internals. - erik ------ 9diff run.c /n/sources/plan9//sys/src/cmd/awk/run.c:1191,1196 - run.c:1191,1219 return(False); } =20 + static int + utf8len(char *s) + { + int c, n, i; +=20 + c =3D *(unsigned char*)s++; + if ((c&0xe0) =3D=3D 0xc0) + n =3D 2; + else if ((c&0xf0) =3D=3D 0xe0) + n =3D 3; + else if ((c&0xf8) =3D=3D 0xf0) + n =3D 4; + else + return 1; //-1; + i =3D n-1; + if(strlen(s) < i) + return 1; // -1; + for(; i-- && (c =3D *(unsigned char*)s++);) + if(0x80 !=3D (c&0xc0)) + return 1; //-1; + return n; + } +=20 Cell *split(Node **a, int nnn) /* split(a[0], a[1], a[2]); a[3] is type= */ { Cell *x =3D 0, *y, *ap; /n/sources/plan9//sys/src/cmd/awk/run.c:1279,1290 - run.c:1302,1316 s++; } } else if (sep =3D=3D 0) { /* new: split(s, a, "") =3D> 1 char/elem */ - for (n =3D 0; *s !=3D 0; s++) { - char buf[2]; + int i, len; + char buf[5]; + for (n =3D 0; *s !=3D 0; s +=3D len) { n++; sprintf(num, "%d", n); - buf[0] =3D *s; - buf[1] =3D 0; + len =3D utf8len(s); + for(i =3D 0; i < len; i++) + buf[i] =3D s[i]; + buf[len] =3D 0; if (isdigit(buf[0])) setsymtab(num, buf, atof(buf), STR|NUM, (Array *) ap->sval); else