9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] a strange bug in grep
@ 2014-03-29 23:31 arisawa
  2014-03-29 23:54 ` erik quanstrom
  0 siblings, 1 reply; 10+ messages in thread
From: arisawa @ 2014-03-29 23:31 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hello,

I found a strange bug in grep.
some Japanese runes does not match ‘[^0-9]’.

for example ‘ま' (307e) and ‘み’(307f).

unicode table from 3040 to 3090 is 
3040 ぀	3041 ぁ	3042 あ	3043 ぃ	3044 い	3045 ぅ	3046 う	3047 ぇ
3048 え	3049 ぉ	304a お	304b か	304c が	304d き	304e ぎ	304f く
3050 ぐ	3051 け	3052 げ	3053 こ	3054 ご	3055 さ	3056 ざ	3057 し
3058 じ	3059 す	305a ず	305b せ	305c ぜ	305d そ	305e ぞ	305f た
3060 だ	3061 ち	3062 ぢ	3063 っ	3064 つ	3065 づ	3066 て	3067 で
3068 と	3069 ど	306a な	306b に	306c ぬ	306d ね	306e の	306f は
3070 ば	3071 ぱ	3072 ひ	3073 び	3074 ぴ	3075 ふ	3076 ぶ	3077 ぷ
3078 へ	3079 べ	307a ぺ	307b ほ	307c ぼ	307d ぽ	307e ま	307f み
3080 む	3081 め	3082 も	3083 ゃ	3084 や	3085 ゅ	3086 ゆ	3087 ょ
3088 よ	3089 ら	308a り	308b る	308c れ	308d ろ	308e ゎ	308f わ
3090 ゐ

I tried some of them.
term% cat t3
あ
は
ば
ぱ
ひ
び
ぴ
ふ
ぶ
ぷ
へ
べ
ぺ
ほ
ぼ
ぽ
ま
み
む
め
も
term% grep -v '[^0-9]' t3
ま
み
term% 

Kenji Arisawa





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-29 23:31 [9fans] a strange bug in grep arisawa
@ 2014-03-29 23:54 ` erik quanstrom
  2014-03-30  0:51   ` arisawa
  2014-03-30  1:44   ` cinap_lenrek
  0 siblings, 2 replies; 10+ messages in thread
From: erik quanstrom @ 2014-03-29 23:54 UTC (permalink / raw)
  To: 9fans

> Hello,
> 
> I found a strange bug in grep.
> some Japanese runes does not match ‘[^0-9]’.
> 
> for example ‘ま' (307e) and ‘み’(307f).
> 

i can't replicate here with 9atom's fixes to grep.
with the same t3 file as you've got,

	; wc -l /tmp/t3
	     21 /tmp/t3
	; grep -v '^[0-9]' /tmp/t3 | wc -l
	     21

i have some other differences in grep, including -I (same
as -i, except fold runes), but i think the differences in
comp.c are what cause the bug.  in particular, you really
need that 0xffff entry in the tabs.

/n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
  {
  	0x007f,
  	0x07ff,
+ 	0xffff,
  };
  Rune	tab2[] =
  {
  	0x003f,
  	0x0fff,
+ 	0xffff,
  };
  
  Re2

the additional pairs and the correction to the combining case
here were not accepted to sources, but they allow for large character
classes generated used by folding.  many of the characters are contiguous
so getting the contiguous case right is important.

/n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
  Re2
  re2class(char *s)
  {
- 	Rune pairs[200+2], *p, *q, ov;
+ 	Rune pairs[400+2], *p, *q, ov;
  	int nc;
  	Re2 x;
  
/n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
  			break;
  		p[1] = *p;
  		p += 2;
- 		if(p >= pairs + nelem(pairs) - 2)
+ 		if(p == pairs + nelem(pairs) - 2)
  			error("class too big");
  		s += chartorune(p, s);
  		if(*p != '-')
/n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
  	for(p=pairs+2; *p; p+=2) {
  		if(p[0] > p[1])
  			continue;
- 		if(p[0] > q[1] || p[1] < q[0]) {
+ 		if(p[0] > q[1]+1 || p[1] < q[0]) {
  			q[2] = p[0];
  			q[3] = p[1];
  			q += 2;

i believe this case is also critical.  split the bmp off.

/n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
  			x = re2or(x, rclass(ov, p[0]-1));
  			ov = p[1]+1;
  		}
- 		x = re2or(x, rclass(ov, Runemask));
+ 		x = re2or(x, rclass(ov, 0xffff));
  	} else {
  		x = rclass(p[0], p[1]);
  		for(p+=2; *p; p+=2)

- erik



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-29 23:54 ` erik quanstrom
@ 2014-03-30  0:51   ` arisawa
  2014-03-30  1:44   ` cinap_lenrek
  1 sibling, 0 replies; 10+ messages in thread
From: arisawa @ 2014-03-30  0:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

thanks eric.
that fixed problems of my sample data!

Kenji Arisawa

2014/03/30 8:54、erik quanstrom <quanstro@quanstro.net> のメール:

>> Hello,
>> 
>> I found a strange bug in grep.
>> some Japanese runes does not match ‘[^0-9]’.
>> 
>> for example ‘ま' (307e) and ‘み’(307f).
>> 
> 
> i can't replicate here with 9atom's fixes to grep.
> with the same t3 file as you've got,
> 
> 	; wc -l /tmp/t3
> 	     21 /tmp/t3
> 	; grep -v '^[0-9]' /tmp/t3 | wc -l
> 	     21
> 
> i have some other differences in grep, including -I (same
> as -i, except fold runes), but i think the differences in
> comp.c are what cause the bug.  in particular, you really
> need that 0xffff entry in the tabs.
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
>  {
>  	0x007f,
>  	0x07ff,
> + 	0xffff,
>  };
>  Rune	tab2[] =
>  {
>  	0x003f,
>  	0x0fff,
> + 	0xffff,
>  };
> 
>  Re2
> 
> the additional pairs and the correction to the combining case
> here were not accepted to sources, but they allow for large character
> classes generated used by folding.  many of the characters are contiguous
> so getting the contiguous case right is important.
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
>  Re2
>  re2class(char *s)
>  {
> - 	Rune pairs[200+2], *p, *q, ov;
> + 	Rune pairs[400+2], *p, *q, ov;
>  	int nc;
>  	Re2 x;
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
>  			break;
>  		p[1] = *p;
>  		p += 2;
> - 		if(p >= pairs + nelem(pairs) - 2)
> + 		if(p == pairs + nelem(pairs) - 2)
>  			error("class too big");
>  		s += chartorune(p, s);
>  		if(*p != '-')
> /n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
>  	for(p=pairs+2; *p; p+=2) {
>  		if(p[0] > p[1])
>  			continue;
> - 		if(p[0] > q[1] || p[1] < q[0]) {
> + 		if(p[0] > q[1]+1 || p[1] < q[0]) {
>  			q[2] = p[0];
>  			q[3] = p[1];
>  			q += 2;
> 
> i believe this case is also critical.  split the bmp off.
> 
> /n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
>  			x = re2or(x, rclass(ov, p[0]-1));
>  			ov = p[1]+1;
>  		}
> - 		x = re2or(x, rclass(ov, Runemask));
> + 		x = re2or(x, rclass(ov, 0xffff));
>  	} else {
>  		x = rclass(p[0], p[1]);
>  		for(p+=2; *p; p+=2)
> 
> - erik
> 




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-29 23:54 ` erik quanstrom
  2014-03-30  0:51   ` arisawa
@ 2014-03-30  1:44   ` cinap_lenrek
  2014-03-30  6:10     ` erik quanstrom
  2014-03-30  6:24     ` erik quanstrom
  1 sibling, 2 replies; 10+ messages in thread
From: cinap_lenrek @ 2014-03-30  1:44 UTC (permalink / raw)
  To: 9fans

very good.

one question about:

- 		x = re2or(x, rclass(ov, Runemask));
+ 		x = re2or(x, rclass(ov, 0xffff));

this seems wrong for 21 bit runes (the old is also wrong i think).

shouldnt that be:

+ 		x = re2or(x, rclass(ov, Runemax));

as Runemask (0x1fffff) is not a valid rune for 21-bit rune
 as it is >Runemax.

as i understand it, tab1[] array contains the last valid rune
in a range of the same utf8 encoding length.

basically:

0-07f	-> 1 byte, 0x80-0x7ff -> 2 byte  ect...

so adding 0xffff is right. the next would be 0x10ffff for 21 bit
runes but there shouldnt be any runes above 0x10ffff.

makes any sense?

--
cinap



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-30  1:44   ` cinap_lenrek
@ 2014-03-30  6:10     ` erik quanstrom
  2014-03-30 15:40       ` cinap_lenrek
  2014-03-30  6:24     ` erik quanstrom
  1 sibling, 1 reply; 10+ messages in thread
From: erik quanstrom @ 2014-03-30  6:10 UTC (permalink / raw)
  To: 9fans

On Sat Mar 29 21:46:33 EDT 2014, cinap_lenrek@felloff.net wrote:
> very good.
>
> one question about:
>
> - 		x = re2or(x, rclass(ov, Runemask));
> + 		x = re2or(x, rclass(ov, 0xffff));
>
> this seems wrong for 21 bit runes (the old is also wrong i think).
>
> shouldnt that be:
>
> + 		x = re2or(x, rclass(ov, Runemax));
>
> as Runemask (0x1fffff) is not a valid rune for 21-bit rune
>  as it is >Runemax.

yes, that's correct.  i left it at 0xffff because was still a bug.
tab2 still needs to burst the leading bytes so we enum all
the cases.  i think tab2 should be

Rune	tab2[] =
{
	0x003f,
	0x0fff,
	0x07ffff,
};

since the first byte of the 21-bit rune is 0b11110xxx.

what do you think?

> as i understand it, tab1[] array contains the last valid rune
> in a range of the same utf8 encoding length.
>
> basically:
>
> 0-07f	-> 1 byte, 0x80-0x7ff -> 2 byte  ect...
>
> so adding 0xffff is right. the next would be 0x10ffff for 21 bit
> runes but there shouldnt be any runes above 0x10ffff.
>
> makes any sense?

since the tab1 array is bursting at byte boundaries, the next
birst is at 0x1fffff.  but that's in undefined territory.

- erik



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-30  1:44   ` cinap_lenrek
  2014-03-30  6:10     ` erik quanstrom
@ 2014-03-30  6:24     ` erik quanstrom
  1 sibling, 0 replies; 10+ messages in thread
From: erik quanstrom @ 2014-03-30  6:24 UTC (permalink / raw)
  To: 9fans

i should mention a test case that's pretty interesting.
this will not work without the changes to tab2[].

; cat /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^x]' /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^α]' /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^ℙ]' /tmp/cuneiform
𒐊
𒐋
𒑑
; grep '[^🎄]' /tmp/cuneiform
𒐊
𒐋
𒑑

- erik



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-30  6:10     ` erik quanstrom
@ 2014-03-30 15:40       ` cinap_lenrek
  2014-03-30 16:26         ` erik quanstrom
  0 siblings, 1 reply; 10+ messages in thread
From: cinap_lenrek @ 2014-03-30 15:40 UTC (permalink / raw)
  To: 9fans

couldnt make any sense out of tab2 at first, but
i think i got it now:

0xxxxxxx
110xxxxx 10mmmmmm
1110xxxx 10mmmmmm 10mmmmmm
11110xxx 10mmmmmm 10mmmmmm 10mmmmmm

m = tab2[i]
x = ~m

so i think tab2 should be:

Rune	tab2[] =
{
	0x003f,
	0x0fff,
	0x3ffff,
};

makes snese?

--
cinap



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-30 15:40       ` cinap_lenrek
@ 2014-03-30 16:26         ` erik quanstrom
  2014-03-30 18:05           ` cinap_lenrek
  0 siblings, 1 reply; 10+ messages in thread
From: erik quanstrom @ 2014-03-30 16:26 UTC (permalink / raw)
  To: 9fans

> 0xxxxxxx
> 110xxxxx 10mmmmmm
> 1110xxxx 10mmmmmm 10mmmmmm
> 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm
>
> m = tab2[i]
> x = ~m
[...]
>
> Rune	tab2[] =
> {
> 	0x003f,
> 	0x0fff,
> 	0x3ffff,
> };
>
> makes snese?

doesn't a 3-bit mask -> 7 not 3?
i read tab2[] as the number of possibilties for byte 1,
assuming not ascii.

if i've got that wrong, explain it again.  :-)

- erik



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-30 16:26         ` erik quanstrom
@ 2014-03-30 18:05           ` cinap_lenrek
  2014-03-30 18:10             ` erik quanstrom
  0 siblings, 1 reply; 10+ messages in thread
From: cinap_lenrek @ 2014-03-30 18:05 UTC (permalink / raw)
  To: 9fans

no.

lets try example with p1=0x3ffff and p2=0x7ffff and assuming your m=0x7ffff mask
in tab2.

p1 (0x3ffff) = 0xf0 0xbf 0xbf 0xbf (0b11110000 0b10111111 0b10111111 0b10111111)
p2 (0x7ffff) = 0xf1 0xbf 0xbf 0xbf (0b11110001 0b10111111 0b10111111 0b10111111)

for m == 0x7ffff

(p1 & ~m) == 0
(p2 & ~m) == 0

if((p1 & ~m) != (p2 & ~m)) {
	... bust
}

so this if case is never taken, tho the lead byte in the utf8 encoding is different
for this p1-p2 range and the following encoded bytes are not comparable.

> 0xxxxxxx
> 110xxxxx 10mmmmmm
0x3f -> 6 bits (count m's)

> 1110xxxx 10mmmmmm 10mmmmmm
0xfff -> 12 bits (count m's)

> 11110xxx 10mmmmmm 10mmmmmm 10mmmmmm
0x3ffff -> 18 bits (count m's)

repeat example with 18 bit mask m=0x3ffff we get (p2 & ~m) == 0x40000 then
we bust and following encoded bytes are compared in each branch separately.

--
cinap



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [9fans] a strange bug in grep
  2014-03-30 18:05           ` cinap_lenrek
@ 2014-03-30 18:10             ` erik quanstrom
  0 siblings, 0 replies; 10+ messages in thread
From: erik quanstrom @ 2014-03-30 18:10 UTC (permalink / raw)
  To: 9fans

> repeat example with 18 bit mask m=0x3ffff we get (p2 & ~m) == 0x40000 then
> we bust and following encoded bytes are compared in each branch separately.

yup, you're right.

- erik



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-03-30 18:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-29 23:31 [9fans] a strange bug in grep arisawa
2014-03-29 23:54 ` erik quanstrom
2014-03-30  0:51   ` arisawa
2014-03-30  1:44   ` cinap_lenrek
2014-03-30  6:10     ` erik quanstrom
2014-03-30 15:40       ` cinap_lenrek
2014-03-30 16:26         ` erik quanstrom
2014-03-30 18:05           ` cinap_lenrek
2014-03-30 18:10             ` erik quanstrom
2014-03-30  6:24     ` erik quanstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).