From mboxrd@z Thu Jan 1 00:00:00 1970 From: erik quanstrom Date: Sat, 29 Mar 2014 19:54:15 -0400 To: 9fans@9fans.net Message-ID: <938abc1c40e15468aa034d34b07a2d49@brasstown.quanstro.net> In-Reply-To: <76A0A13C-0D91-4620-A282-A581C206A9FA@ar.aichi-u.ac.jp> References: <76A0A13C-0D91-4620-A282-A581C206A9FA@ar.aichi-u.ac.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] a strange bug in grep Topicbox-Message-UUID: d2a482b6-ead8-11e9-9d60-3106f5b1d025 > Hello, >=20 > I found a strange bug in grep. > some Japanese runes does not match =E2=80=98[^0-9]=E2=80=99. >=20 > for example =E2=80=98=E3=81=BE' (307e) and =E2=80=98=E3=81=BF=E2=80=99(= 307f). >=20 i can't replicate here with 9atom's fixes to grep. with the same t3 file as you've got, ; wc -l /tmp/t3 21 /tmp/t3 ; grep -v '^[0-9]' /tmp/t3 | wc -l 21 i have some other differences in grep, including -I (same as -i, except fold runes), but i think the differences in comp.c are what cause the bug. in particular, you really need that 0xffff entry in the tabs. /n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147 { 0x007f, 0x07ff, + 0xffff, }; Rune tab2[] =3D { 0x003f, 0x0fff, + 0xffff, }; =20 Re2 the additional pairs and the correction to the combining case here were not accepted to sources, but they allow for large character classes generated used by folding. many of the characters are contiguous so getting the contiguous case right is important. /n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223 Re2 re2class(char *s) { - Rune pairs[200+2], *p, *q, ov; + Rune pairs[400+2], *p, *q, ov; int nc; Re2 x; =20 /n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242 break; p[1] =3D *p; p +=3D 2; - if(p >=3D pairs + nelem(pairs) - 2) + if(p =3D=3D pairs + nelem(pairs) - 2) error("class too big"); s +=3D chartorune(p, s); if(*p !=3D '-') /n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262 for(p=3Dpairs+2; *p; p+=3D2) { if(p[0] > p[1]) continue; - if(p[0] > q[1] || p[1] < q[0]) { + if(p[0] > q[1]+1 || p[1] < q[0]) { q[2] =3D p[0]; q[3] =3D p[1]; q +=3D 2; i believe this case is also critical. split the bmp off. /n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283 x =3D re2or(x, rclass(ov, p[0]-1)); ov =3D p[1]+1; } - x =3D re2or(x, rclass(ov, Runemask)); + x =3D re2or(x, rclass(ov, 0xffff)); } else { x =3D rclass(p[0], p[1]); for(p+=3D2; *p; p+=3D2) - erik