* [9fans] a strange bug in grep @ 2014-03-29 23:31 arisawa 2014-03-29 23:54 ` erik quanstrom 0 siblings, 1 reply; 10+ messages in thread From: arisawa @ 2014-03-29 23:31 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs Hello, I found a strange bug in grep. some Japanese runes does not match ‘[^0-9]’. for example ‘ま' (307e) and ‘み’(307f). unicode table from 3040 to 3090 is 3040 3041 ぁ 3042 あ 3043 ぃ 3044 い 3045 ぅ 3046 う 3047 ぇ 3048 え 3049 ぉ 304a お 304b か 304c が 304d き 304e ぎ 304f く 3050 ぐ 3051 け 3052 げ 3053 こ 3054 ご 3055 さ 3056 ざ 3057 し 3058 じ 3059 す 305a ず 305b せ 305c ぜ 305d そ 305e ぞ 305f た 3060 だ 3061 ち 3062 ぢ 3063 っ 3064 つ 3065 づ 3066 て 3067 で 3068 と 3069 ど 306a な 306b に 306c ぬ 306d ね 306e の 306f は 3070 ば 3071 ぱ 3072 ひ 3073 び 3074 ぴ 3075 ふ 3076 ぶ 3077 ぷ 3078 へ 3079 べ 307a ぺ 307b ほ 307c ぼ 307d ぽ 307e ま 307f み 3080 む 3081 め 3082 も 3083 ゃ 3084 や 3085 ゅ 3086 ゆ 3087 ょ 3088 よ 3089 ら 308a り 308b る 308c れ 308d ろ 308e ゎ 308f わ 3090 ゐ I tried some of them. term% cat t3 あ は ば ぱ ひ び ぴ ふ ぶ ぷ へ べ ぺ ほ ぼ ぽ ま み む め も term% grep -v '[^0-9]' t3 ま み term% Kenji Arisawa ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-29 23:31 [9fans] a strange bug in grep arisawa @ 2014-03-29 23:54 ` erik quanstrom 2014-03-30 0:51 ` arisawa 2014-03-30 1:44 ` cinap_lenrek 0 siblings, 2 replies; 10+ messages in thread From: erik quanstrom @ 2014-03-29 23:54 UTC (permalink / raw) To: 9fans > Hello, > > I found a strange bug in grep. > some Japanese runes does not match ‘[^0-9]’. > > for example ‘ま' (307e) and ‘み’(307f). > i can't replicate here with 9atom's fixes to grep. with the same t3 file as you've got, ; wc -l /tmp/t3 21 /tmp/t3 ; grep -v '^[0-9]' /tmp/t3 | wc -l 21 i have some other differences in grep, including -I (same as -i, except fold runes), but i think the differences in comp.c are what cause the bug. in particular, you really need that 0xffff entry in the tabs. /n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147 { 0x007f, 0x07ff, + 0xffff, }; Rune tab2[] = { 0x003f, 0x0fff, + 0xffff, }; Re2 the additional pairs and the correction to the combining case here were not accepted to sources, but they allow for large character classes generated used by folding. many of the characters are contiguous so getting the contiguous case right is important. /n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223 Re2 re2class(char *s) { - Rune pairs[200+2], *p, *q, ov; + Rune pairs[400+2], *p, *q, ov; int nc; Re2 x; /n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242 break; p[1] = *p; p += 2; - if(p >= pairs + nelem(pairs) - 2) + if(p == pairs + nelem(pairs) - 2) error("class too big"); s += chartorune(p, s); if(*p != '-') /n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262 for(p=pairs+2; *p; p+=2) { if(p[0] > p[1]) continue; - if(p[0] > q[1] || p[1] < q[0]) { + if(p[0] > q[1]+1 || p[1] < q[0]) { q[2] = p[0]; q[3] = p[1]; q += 2; i believe this case is also critical. split the bmp off. /n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283 x = re2or(x, rclass(ov, p[0]-1)); ov = p[1]+1; } - x = re2or(x, rclass(ov, Runemask)); + x = re2or(x, rclass(ov, 0xffff)); } else { x = rclass(p[0], p[1]); for(p+=2; *p; p+=2) - erik ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-29 23:54 ` erik quanstrom @ 2014-03-30 0:51 ` arisawa 2014-03-30 1:44 ` cinap_lenrek 1 sibling, 0 replies; 10+ messages in thread From: arisawa @ 2014-03-30 0:51 UTC (permalink / raw) To: Fans of the OS Plan 9 from Bell Labs thanks eric. that fixed problems of my sample data! Kenji Arisawa 2014/03/30 8:54、erik quanstrom <quanstro@quanstro.net> のメール: >> Hello, >> >> I found a strange bug in grep. >> some Japanese runes does not match ‘[^0-9]’. >> >> for example ‘ま' (307e) and ‘み’(307f). >> > > i can't replicate here with 9atom's fixes to grep. > with the same t3 file as you've got, > > ; wc -l /tmp/t3 > 21 /tmp/t3 > ; grep -v '^[0-9]' /tmp/t3 | wc -l > 21 > > i have some other differences in grep, including -I (same > as -i, except fold runes), but i think the differences in > comp.c are what cause the bug. in particular, you really > need that 0xffff entry in the tabs. > > /n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147 > { > 0x007f, > 0x07ff, > + 0xffff, > }; > Rune tab2[] = > { > 0x003f, > 0x0fff, > + 0xffff, > }; > > Re2 > > the additional pairs and the correction to the combining case > here were not accepted to sources, but they allow for large character > classes generated used by folding. many of the characters are contiguous > so getting the contiguous case right is important. > > /n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223 > Re2 > re2class(char *s) > { > - Rune pairs[200+2], *p, *q, ov; > + Rune pairs[400+2], *p, *q, ov; > int nc; > Re2 x; > > /n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242 > break; > p[1] = *p; > p += 2; > - if(p >= pairs + nelem(pairs) - 2) > + if(p == pairs + nelem(pairs) - 2) > error("class too big"); > s += chartorune(p, s); > if(*p != '-') > /n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262 > for(p=pairs+2; *p; p+=2) { > if(p[0] > p[1]) > continue; > - if(p[0] > q[1] || p[1] < q[0]) { > + if(p[0] > q[1]+1 || p[1] < q[0]) { > q[2] = p[0]; > q[3] = p[1]; > q += 2; > > i believe this case is also critical. split the bmp off. > > /n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283 > x = re2or(x, rclass(ov, p[0]-1)); > ov = p[1]+1; > } > - x = re2or(x, rclass(ov, Runemask)); > + x = re2or(x, rclass(ov, 0xffff)); > } else { > x = rclass(p[0], p[1]); > for(p+=2; *p; p+=2) > > - erik > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-29 23:54 ` erik quanstrom 2014-03-30 0:51 ` arisawa @ 2014-03-30 1:44 ` cinap_lenrek 2014-03-30 6:10 ` erik quanstrom 2014-03-30 6:24 ` erik quanstrom 1 sibling, 2 replies; 10+ messages in thread From: cinap_lenrek @ 2014-03-30 1:44 UTC (permalink / raw) To: 9fans very good. one question about: - x = re2or(x, rclass(ov, Runemask)); + x = re2or(x, rclass(ov, 0xffff)); this seems wrong for 21 bit runes (the old is also wrong i think). shouldnt that be: + x = re2or(x, rclass(ov, Runemax)); as Runemask (0x1fffff) is not a valid rune for 21-bit rune as it is >Runemax. as i understand it, tab1[] array contains the last valid rune in a range of the same utf8 encoding length. basically: 0-07f -> 1 byte, 0x80-0x7ff -> 2 byte ect... so adding 0xffff is right. the next would be 0x10ffff for 21 bit runes but there shouldnt be any runes above 0x10ffff. makes any sense? -- cinap ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-30 1:44 ` cinap_lenrek @ 2014-03-30 6:10 ` erik quanstrom 2014-03-30 15:40 ` cinap_lenrek 2014-03-30 6:24 ` erik quanstrom 1 sibling, 1 reply; 10+ messages in thread From: erik quanstrom @ 2014-03-30 6:10 UTC (permalink / raw) To: 9fans On Sat Mar 29 21:46:33 EDT 2014, cinap_lenrek@felloff.net wrote: > very good. > > one question about: > > - x = re2or(x, rclass(ov, Runemask)); > + x = re2or(x, rclass(ov, 0xffff)); > > this seems wrong for 21 bit runes (the old is also wrong i think). > > shouldnt that be: > > + x = re2or(x, rclass(ov, Runemax)); > > as Runemask (0x1fffff) is not a valid rune for 21-bit rune > as it is >Runemax. yes, that's correct. i left it at 0xffff because was still a bug. tab2 still needs to burst the leading bytes so we enum all the cases. i think tab2 should be Rune tab2[] = { 0x003f, 0x0fff, 0x07ffff, }; since the first byte of the 21-bit rune is 0b11110xxx. what do you think? > as i understand it, tab1[] array contains the last valid rune > in a range of the same utf8 encoding length. > > basically: > > 0-07f -> 1 byte, 0x80-0x7ff -> 2 byte ect... > > so adding 0xffff is right. the next would be 0x10ffff for 21 bit > runes but there shouldnt be any runes above 0x10ffff. > > makes any sense? since the tab1 array is bursting at byte boundaries, the next birst is at 0x1fffff. but that's in undefined territory. - erik ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-30 6:10 ` erik quanstrom @ 2014-03-30 15:40 ` cinap_lenrek 2014-03-30 16:26 ` erik quanstrom 0 siblings, 1 reply; 10+ messages in thread From: cinap_lenrek @ 2014-03-30 15:40 UTC (permalink / raw) To: 9fans couldnt make any sense out of tab2 at first, but i think i got it now: 0xxxxxxx 110xxxxx 10mmmmmm 1110xxxx 10mmmmmm 10mmmmmm 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm m = tab2[i] x = ~m so i think tab2 should be: Rune tab2[] = { 0x003f, 0x0fff, 0x3ffff, }; makes snese? -- cinap ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-30 15:40 ` cinap_lenrek @ 2014-03-30 16:26 ` erik quanstrom 2014-03-30 18:05 ` cinap_lenrek 0 siblings, 1 reply; 10+ messages in thread From: erik quanstrom @ 2014-03-30 16:26 UTC (permalink / raw) To: 9fans > 0xxxxxxx > 110xxxxx 10mmmmmm > 1110xxxx 10mmmmmm 10mmmmmm > 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm > > m = tab2[i] > x = ~m [...] > > Rune tab2[] = > { > 0x003f, > 0x0fff, > 0x3ffff, > }; > > makes snese? doesn't a 3-bit mask -> 7 not 3? i read tab2[] as the number of possibilties for byte 1, assuming not ascii. if i've got that wrong, explain it again. :-) - erik ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-30 16:26 ` erik quanstrom @ 2014-03-30 18:05 ` cinap_lenrek 2014-03-30 18:10 ` erik quanstrom 0 siblings, 1 reply; 10+ messages in thread From: cinap_lenrek @ 2014-03-30 18:05 UTC (permalink / raw) To: 9fans no. lets try example with p1=0x3ffff and p2=0x7ffff and assuming your m=0x7ffff mask in tab2. p1 (0x3ffff) = 0xf0 0xbf 0xbf 0xbf (0b11110000 0b10111111 0b10111111 0b10111111) p2 (0x7ffff) = 0xf1 0xbf 0xbf 0xbf (0b11110001 0b10111111 0b10111111 0b10111111) for m == 0x7ffff (p1 & ~m) == 0 (p2 & ~m) == 0 if((p1 & ~m) != (p2 & ~m)) { ... bust } so this if case is never taken, tho the lead byte in the utf8 encoding is different for this p1-p2 range and the following encoded bytes are not comparable. > 0xxxxxxx > 110xxxxx 10mmmmmm 0x3f -> 6 bits (count m's) > 1110xxxx 10mmmmmm 10mmmmmm 0xfff -> 12 bits (count m's) > 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm 0x3ffff -> 18 bits (count m's) repeat example with 18 bit mask m=0x3ffff we get (p2 & ~m) == 0x40000 then we bust and following encoded bytes are compared in each branch separately. -- cinap ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-30 18:05 ` cinap_lenrek @ 2014-03-30 18:10 ` erik quanstrom 0 siblings, 0 replies; 10+ messages in thread From: erik quanstrom @ 2014-03-30 18:10 UTC (permalink / raw) To: 9fans > repeat example with 18 bit mask m=0x3ffff we get (p2 & ~m) == 0x40000 then > we bust and following encoded bytes are compared in each branch separately. yup, you're right. - erik ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep 2014-03-30 1:44 ` cinap_lenrek 2014-03-30 6:10 ` erik quanstrom @ 2014-03-30 6:24 ` erik quanstrom 1 sibling, 0 replies; 10+ messages in thread From: erik quanstrom @ 2014-03-30 6:24 UTC (permalink / raw) To: 9fans i should mention a test case that's pretty interesting. this will not work without the changes to tab2[]. ; cat /tmp/cuneiform 𒐊 𒐋 𒑑 ; grep '[^x]' /tmp/cuneiform 𒐊 𒐋 𒑑 ; grep '[^α]' /tmp/cuneiform 𒐊 𒐋 𒑑 ; grep '[^ℙ]' /tmp/cuneiform 𒐊 𒐋 𒑑 ; grep '[^🎄]' /tmp/cuneiform 𒐊 𒐋 𒑑 - erik ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-03-30 18:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-03-29 23:31 [9fans] a strange bug in grep arisawa 2014-03-29 23:54 ` erik quanstrom 2014-03-30 0:51 ` arisawa 2014-03-30 1:44 ` cinap_lenrek 2014-03-30 6:10 ` erik quanstrom 2014-03-30 15:40 ` cinap_lenrek 2014-03-30 16:26 ` erik quanstrom 2014-03-30 18:05 ` cinap_lenrek 2014-03-30 18:10 ` erik quanstrom 2014-03-30 6:24 ` erik quanstrom
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).