9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: erik quanstrom <quanstro@quanstro.net>
To: 9fans@9fans.net
Subject: Re: [9fans] a strange bug in grep
Date: Sat, 29 Mar 2014 19:54:15 -0400	[thread overview]
Message-ID: <938abc1c40e15468aa034d34b07a2d49@brasstown.quanstro.net> (raw)
In-Reply-To: <76A0A13C-0D91-4620-A282-A581C206A9FA@ar.aichi-u.ac.jp>

> Hello,
> 
> I found a strange bug in grep.
> some Japanese runes does not match ‘[^0-9]’.
> 
> for example ‘ま' (307e) and ‘み’(307f).
> 

i can't replicate here with 9atom's fixes to grep.
with the same t3 file as you've got,

	; wc -l /tmp/t3
	     21 /tmp/t3
	; grep -v '^[0-9]' /tmp/t3 | wc -l
	     21

i have some other differences in grep, including -I (same
as -i, except fold runes), but i think the differences in
comp.c are what cause the bug.  in particular, you really
need that 0xffff entry in the tabs.

/n/sources/plan9/sys/src/cmd/grep/comp.c:135,145 - comp.c:135,147
  {
  	0x007f,
  	0x07ff,
+ 	0xffff,
  };
  Rune	tab2[] =
  {
  	0x003f,
  	0x0fff,
+ 	0xffff,
  };
  
  Re2

the additional pairs and the correction to the combining case
here were not accepted to sources, but they allow for large character
classes generated used by folding.  many of the characters are contiguous
so getting the contiguous case right is important.

/n/sources/plan9/sys/src/cmd/grep/comp.c:215,221 - comp.c:217,223
  Re2
  re2class(char *s)
  {
- 	Rune pairs[200+2], *p, *q, ov;
+ 	Rune pairs[400+2], *p, *q, ov;
  	int nc;
  	Re2 x;
  
/n/sources/plan9/sys/src/cmd/grep/comp.c:234,240 - comp.c:236,242
  			break;
  		p[1] = *p;
  		p += 2;
- 		if(p >= pairs + nelem(pairs) - 2)
+ 		if(p == pairs + nelem(pairs) - 2)
  			error("class too big");
  		s += chartorune(p, s);
  		if(*p != '-')
/n/sources/plan9/sys/src/cmd/grep/comp.c:254,260 - comp.c:256,262
  	for(p=pairs+2; *p; p+=2) {
  		if(p[0] > p[1])
  			continue;
- 		if(p[0] > q[1] || p[1] < q[0]) {
+ 		if(p[0] > q[1]+1 || p[1] < q[0]) {
  			q[2] = p[0];
  			q[3] = p[1];
  			q += 2;

i believe this case is also critical.  split the bmp off.

/n/sources/plan9/sys/src/cmd/grep/comp.c:275,281 - comp.c:277,283
  			x = re2or(x, rclass(ov, p[0]-1));
  			ov = p[1]+1;
  		}
- 		x = re2or(x, rclass(ov, Runemask));
+ 		x = re2or(x, rclass(ov, 0xffff));
  	} else {
  		x = rclass(p[0], p[1]);
  		for(p+=2; *p; p+=2)

- erik



  reply	other threads:[~2014-03-29 23:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-29 23:31 arisawa
2014-03-29 23:54 ` erik quanstrom [this message]
2014-03-30  0:51   ` arisawa
2014-03-30  1:44   ` cinap_lenrek
2014-03-30  6:10     ` erik quanstrom
2014-03-30 15:40       ` cinap_lenrek
2014-03-30 16:26         ` erik quanstrom
2014-03-30 18:05           ` cinap_lenrek
2014-03-30 18:10             ` erik quanstrom
2014-03-30  6:24     ` erik quanstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=938abc1c40e15468aa034d34b07a2d49@brasstown.quanstro.net \
    --to=quanstro@quanstro.net \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).