* [9fans] a strange bug in grep
@ 2014-03-29 23:31 arisawa
2014-03-29 23:54 ` erik quanstrom
0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2014-03-29 23:31 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
Hello,
I found a strange bug in grep.
some Japanese runes does not match ‘[^0-9]’.
for example ‘ま' (307e) and ‘み’(307f).
unicode table from 3040 to 3090 is
3040 3041 ぁ 3042 あ 3043 ぃ 3044 い 3045 ぅ 3046 う 3047 ぇ
3048 え 3049 ぉ 304a お 304b か 304c が 304d き 304e ぎ 304f く
3050 ぐ 3051 け 3052 げ 3053 こ 3054 ご 3055 さ 3056 ざ 3057 し
3058 じ 3059 す 305a ず 305b せ 305c ぜ 305d そ 305e ぞ 305f た
3060 だ 3061 ち 3062 ぢ 3063 っ 3064 つ 3065 づ 3066 て 3067 で
3068 と 3069 ど 306a な 306b に 306c ぬ 306d ね 306e の 306f は
3070 ば 3071 ぱ 3072 ひ 3073 び 3074 ぴ 3075 ふ 3076 ぶ 3077 ぷ
3078 へ 3079 べ 307a ぺ 307b ほ 307c ぼ 307d ぽ 307e ま 307f み
3080 む 3081 め 3082 も 3083 ゃ 3084 や 3085 ゅ 3086 ゆ 3087 ょ
3088 よ 3089 ら 308a り 308b る 308c れ 308d ろ 308e ゎ 308f わ
3090 ゐ
I tried some of them.
term% cat t3
あ
は
ば
ぱ
ひ
び
ぴ
ふ
ぶ
ぷ
へ
べ
ぺ
ほ
ぼ
ぽ
ま
み
む
め
も
term% grep -v '[^0-9]' t3
ま
み
term%
Kenji Arisawa
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-29 23:31 [9fans] a strange bug in grep arisawa
@ 2014-03-29 23:54 ` erik quanstrom
2014-03-30 0:51 ` arisawa
2014-03-30 1:44 ` cinap_lenrek
0 siblings, 2 replies; 10+ messages in thread
From: erik quanstrom @ 2014-03-29 23:54 UTC (permalink / raw)
To: 9fans
> Hello,
>
> I found a strange bug in grep.
> some Japanese runes does not match ‘[^0-9]’.
>
> for example ‘ま' (307e) and ‘み’(307f).
>
i can't replicate here with 9atom's fixes to grep.
with the same t3 file as you've got,
; wc -l /tmp/t3
21 /tmp/t3
; grep -v '^[0-9]' /tmp/t3 | wc -l
21
i have some other differences in grep, including -I (same
as -i, except fold runes), but i think the differences in
comp.c are what cause the bug. in particular, you really
need that 0xffff entry in the tabs.
/n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
{
0x007f,
0x07ff,
+ 0xffff,
};
Rune tab2[] =
{
0x003f,
0x0fff,
+ 0xffff,
};
Re2
the additional pairs and the correction to the combining case
here were not accepted to sources, but they allow for large character
classes generated used by folding. many of the characters are contiguous
so getting the contiguous case right is important.
/n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
Re2
re2class(char *s)
{
- Rune pairs[200+2], *p, *q, ov;
+ Rune pairs[400+2], *p, *q, ov;
int nc;
Re2 x;
/n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
break;
p[1] = *p;
p += 2;
- if(p >= pairs + nelem(pairs) - 2)
+ if(p == pairs + nelem(pairs) - 2)
error("class too big");
s += chartorune(p, s);
if(*p != '-')
/n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
for(p=pairs+2; *p; p+=2) {
if(p[0] > p[1])
continue;
- if(p[0] > q[1] || p[1] < q[0]) {
+ if(p[0] > q[1]+1 || p[1] < q[0]) {
q[2] = p[0];
q[3] = p[1];
q += 2;
i believe this case is also critical. split the bmp off.
/n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
x = re2or(x, rclass(ov, p[0]-1));
ov = p[1]+1;
}
- x = re2or(x, rclass(ov, Runemask));
+ x = re2or(x, rclass(ov, 0xffff));
} else {
x = rclass(p[0], p[1]);
for(p+=2; *p; p+=2)
- erik
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-29 23:54 ` erik quanstrom
@ 2014-03-30 0:51 ` arisawa
2014-03-30 1:44 ` cinap_lenrek
1 sibling, 0 replies; 10+ messages in thread
From: arisawa @ 2014-03-30 0:51 UTC (permalink / raw)
To: Fans of the OS Plan 9 from Bell Labs
thanks eric.
that fixed problems of my sample data!
Kenji Arisawa
2014/03/30 8:54、erik quanstrom <quanstro@quanstro.net> のメール:
>> Hello,
>>
>> I found a strange bug in grep.
>> some Japanese runes does not match ‘[^0-9]’.
>>
>> for example ‘ま' (307e) and ‘み’(307f).
>>
>
> i can't replicate here with 9atom's fixes to grep.
> with the same t3 file as you've got,
>
> ; wc -l /tmp/t3
> 21 /tmp/t3
> ; grep -v '^[0-9]' /tmp/t3 | wc -l
> 21
>
> i have some other differences in grep, including -I (same
> as -i, except fold runes), but i think the differences in
> comp.c are what cause the bug. in particular, you really
> need that 0xffff entry in the tabs.
>
> /n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
> {
> 0x007f,
> 0x07ff,
> + 0xffff,
> };
> Rune tab2[] =
> {
> 0x003f,
> 0x0fff,
> + 0xffff,
> };
>
> Re2
>
> the additional pairs and the correction to the combining case
> here were not accepted to sources, but they allow for large character
> classes generated used by folding. many of the characters are contiguous
> so getting the contiguous case right is important.
>
> /n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
> Re2
> re2class(char *s)
> {
> - Rune pairs[200+2], *p, *q, ov;
> + Rune pairs[400+2], *p, *q, ov;
> int nc;
> Re2 x;
>
> /n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
> break;
> p[1] = *p;
> p += 2;
> - if(p >= pairs + nelem(pairs) - 2)
> + if(p == pairs + nelem(pairs) - 2)
> error("class too big");
> s += chartorune(p, s);
> if(*p != '-')
> /n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
> for(p=pairs+2; *p; p+=2) {
> if(p[0] > p[1])
> continue;
> - if(p[0] > q[1] || p[1] < q[0]) {
> + if(p[0] > q[1]+1 || p[1] < q[0]) {
> q[2] = p[0];
> q[3] = p[1];
> q += 2;
>
> i believe this case is also critical. split the bmp off.
>
> /n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
> x = re2or(x, rclass(ov, p[0]-1));
> ov = p[1]+1;
> }
> - x = re2or(x, rclass(ov, Runemask));
> + x = re2or(x, rclass(ov, 0xffff));
> } else {
> x = rclass(p[0], p[1]);
> for(p+=2; *p; p+=2)
>
> - erik
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-29 23:54 ` erik quanstrom
2014-03-30 0:51 ` arisawa
@ 2014-03-30 1:44 ` cinap_lenrek
2014-03-30 6:10 ` erik quanstrom
2014-03-30 6:24 ` erik quanstrom
1 sibling, 2 replies; 10+ messages in thread
From: cinap_lenrek @ 2014-03-30 1:44 UTC (permalink / raw)
To: 9fans
very good.
one question about:
- x = re2or(x, rclass(ov, Runemask));
+ x = re2or(x, rclass(ov, 0xffff));
this seems wrong for 21 bit runes (the old is also wrong i think).
shouldnt that be:
+ x = re2or(x, rclass(ov, Runemax));
as Runemask (0x1fffff) is not a valid rune for 21-bit rune
as it is >Runemax.
as i understand it, tab1[] array contains the last valid rune
in a range of the same utf8 encoding length.
basically:
0-07f -> 1 byte, 0x80-0x7ff -> 2 byte ect...
so adding 0xffff is right. the next would be 0x10ffff for 21 bit
runes but there shouldnt be any runes above 0x10ffff.
makes any sense?
--
cinap
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-30 1:44 ` cinap_lenrek
@ 2014-03-30 6:10 ` erik quanstrom
2014-03-30 15:40 ` cinap_lenrek
2014-03-30 6:24 ` erik quanstrom
1 sibling, 1 reply; 10+ messages in thread
From: erik quanstrom @ 2014-03-30 6:10 UTC (permalink / raw)
To: 9fans
On Sat Mar 29 21:46:33 EDT 2014, cinap_lenrek@felloff.net wrote:
> very good.
>
> one question about:
>
> - x = re2or(x, rclass(ov, Runemask));
> + x = re2or(x, rclass(ov, 0xffff));
>
> this seems wrong for 21 bit runes (the old is also wrong i think).
>
> shouldnt that be:
>
> + x = re2or(x, rclass(ov, Runemax));
>
> as Runemask (0x1fffff) is not a valid rune for 21-bit rune
> as it is >Runemax.
yes, that's correct. i left it at 0xffff because was still a bug.
tab2 still needs to burst the leading bytes so we enum all
the cases. i think tab2 should be
Rune tab2[] =
{
0x003f,
0x0fff,
0x07ffff,
};
since the first byte of the 21-bit rune is 0b11110xxx.
what do you think?
> as i understand it, tab1[] array contains the last valid rune
> in a range of the same utf8 encoding length.
>
> basically:
>
> 0-07f -> 1 byte, 0x80-0x7ff -> 2 byte ect...
>
> so adding 0xffff is right. the next would be 0x10ffff for 21 bit
> runes but there shouldnt be any runes above 0x10ffff.
>
> makes any sense?
since the tab1 array is bursting at byte boundaries, the next
birst is at 0x1fffff. but that's in undefined territory.
- erik
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-30 1:44 ` cinap_lenrek
2014-03-30 6:10 ` erik quanstrom
@ 2014-03-30 6:24 ` erik quanstrom
1 sibling, 0 replies; 10+ messages in thread
From: erik quanstrom @ 2014-03-30 6:24 UTC (permalink / raw)
To: 9fans
i should mention a test case that's pretty interesting.
this will not work without the changes to tab2[].
; cat /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^x]' /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^α]' /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^ℙ]' /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^🎄]' /tmp/cuneiform
𒐊
𒐋
𒑑
- erik
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-30 6:10 ` erik quanstrom
@ 2014-03-30 15:40 ` cinap_lenrek
2014-03-30 16:26 ` erik quanstrom
0 siblings, 1 reply; 10+ messages in thread
From: cinap_lenrek @ 2014-03-30 15:40 UTC (permalink / raw)
To: 9fans
couldnt make any sense out of tab2 at first, but
i think i got it now:
0xxxxxxx
110xxxxx 10mmmmmm
1110xxxx 10mmmmmm 10mmmmmm
11110xxx 10mmmmmm 10mmmmmm 10mmmmmm
m = tab2[i]
x = ~m
so i think tab2 should be:
Rune tab2[] =
{
0x003f,
0x0fff,
0x3ffff,
};
makes snese?
--
cinap
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-30 15:40 ` cinap_lenrek
@ 2014-03-30 16:26 ` erik quanstrom
2014-03-30 18:05 ` cinap_lenrek
0 siblings, 1 reply; 10+ messages in thread
From: erik quanstrom @ 2014-03-30 16:26 UTC (permalink / raw)
To: 9fans
> 0xxxxxxx
> 110xxxxx 10mmmmmm
> 1110xxxx 10mmmmmm 10mmmmmm
> 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm
>
> m = tab2[i]
> x = ~m
[...]
>
> Rune tab2[] =
> {
> 0x003f,
> 0x0fff,
> 0x3ffff,
> };
>
> makes snese?
doesn't a 3-bit mask -> 7 not 3?
i read tab2[] as the number of possibilties for byte 1,
assuming not ascii.
if i've got that wrong, explain it again. :-)
- erik
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-30 16:26 ` erik quanstrom
@ 2014-03-30 18:05 ` cinap_lenrek
2014-03-30 18:10 ` erik quanstrom
0 siblings, 1 reply; 10+ messages in thread
From: cinap_lenrek @ 2014-03-30 18:05 UTC (permalink / raw)
To: 9fans
no.
lets try example with p1=0x3ffff and p2=0x7ffff and assuming your m=0x7ffff mask
in tab2.
p1 (0x3ffff) = 0xf0 0xbf 0xbf 0xbf (0b11110000 0b10111111 0b10111111 0b10111111)
p2 (0x7ffff) = 0xf1 0xbf 0xbf 0xbf (0b11110001 0b10111111 0b10111111 0b10111111)
for m == 0x7ffff
(p1 & ~m) == 0
(p2 & ~m) == 0
if((p1 & ~m) != (p2 & ~m)) {
... bust
}
so this if case is never taken, tho the lead byte in the utf8 encoding is different
for this p1-p2 range and the following encoded bytes are not comparable.
> 0xxxxxxx
> 110xxxxx 10mmmmmm
0x3f -> 6 bits (count m's)
> 1110xxxx 10mmmmmm 10mmmmmm
0xfff -> 12 bits (count m's)
> 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm
0x3ffff -> 18 bits (count m's)
repeat example with 18 bit mask m=0x3ffff we get (p2 & ~m) == 0x40000 then
we bust and following encoded bytes are compared in each branch separately.
--
cinap
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [9fans] a strange bug in grep
2014-03-30 18:05 ` cinap_lenrek
@ 2014-03-30 18:10 ` erik quanstrom
0 siblings, 0 replies; 10+ messages in thread
From: erik quanstrom @ 2014-03-30 18:10 UTC (permalink / raw)
To: 9fans
> repeat example with 18 bit mask m=0x3ffff we get (p2 & ~m) == 0x40000 then
> we bust and following encoded bytes are compared in each branch separately.
yup, you're right.
- erik
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-03-30 18:10 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-29 23:31 [9fans] a strange bug in grep arisawa
2014-03-29 23:54 ` erik quanstrom
2014-03-30 0:51 ` arisawa
2014-03-30 1:44 ` cinap_lenrek
2014-03-30 6:10 ` erik quanstrom
2014-03-30 15:40 ` cinap_lenrek
2014-03-30 16:26 ` erik quanstrom
2014-03-30 18:05 ` cinap_lenrek
2014-03-30 18:10 ` erik quanstrom
2014-03-30 6:24 ` erik quanstrom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).