From mboxrd@z Thu Jan  1 00:00:00 1970
MIME-Version: 1.0
In-Reply-To: <cb038fc012830f0a8f6cad5b76beb980@ladd.quanstro.net>
References: <cb038fc012830f0a8f6cad5b76beb980@ladd.quanstro.net>
Date: Mon, 30 Nov 2009 01:52:20 -0600
Message-ID: <d50d7d460911292352j7cbcbc7erefa21b3b7f29f20a@mail.gmail.com>
From: Jason Catena <jason.catena@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [9fans]
	=?iso-8859-1?q?gr=EBp_=28rhymes_with_creep=29_and_cptmp?=
Topicbox-Message-UUID: a51f9126-ead5-11e9-9d60-3106f5b1d025

> hey, this is great stuff! =A0i really like the approach.

Thank you.  It evolved from wanting to cut-and-paste character
classes, to automatically applying them to test them.  I suppose the
character classes file could be useful in other applications that
selectively don't want to care about accents.

I added a dash-and-hyphen class, keyed to the hyphen-minus as the
first character (since it's overused), so I had to change the sed
command.

sed '/^\[.+-/d;...

I also now "rm $classes" at the end, of course, though I guess it now
doesn't exit with the exit status of grep.  I should probably save
$status after the grep command, and exit with it.  Or, save the
expanded regex in a new shell variable, rm $classes, then grep with
the new shell variable so the grep is the last command.

> the patterns get really big in a hurry.

Agreed.  Part of grep's job is to be a regex engine, so I thought in
general it would be okay to push it here.

> i played with this a little bit, but quickly ran into problems.

> "reasonable" re size limits of say 300 characters
> just don't work if you're doing expansion. =A0expanding "cooperate"
> results in a 460-byte string!

Where does this 300-character limit come from?  If you code them by
hand I agree that a 300 character regex could be hard to fully
understand.  The regexes this script generates are very simple in
structure and (ahem) regular, so I'd be inclined to allow them past a
size restriction based on style.  As far as time and space required to
wade through the character sets, I haven't yet run into performance
problems or actual failures in my tests.

$ which grep
/usr/local/plan9/bin/grep

$ wc *|tail -1
  17655  118910  774237 total

$ time gr=EBp Obergruppenfuhrersaal *
wewelsburg:155: (1938=961943): The "Obergruppenf=FChrersaal" (SS Generals' =
Hall) and
wewelsburg:161: floor of the "Obergruppenf=FChrersaal" lie on this axis.
 Both redesigned
wewelsburg:180: The "Obergruppenf=FChrersaal" (SS Generals' Hall).  On the =
ground
wewelsburg:181: floor the "Obergruppenf=FChrersaal" (literally translated:
wewelsburg:236: castle, in the so-called Obergruppenf=FChrersaal
("Obergruppenf=FChrer
0.00u 0.03s 0.03r 	 gr=EBp Obergruppenfuhrersaal 0=9631acme 0=9631i850 1920=
s ...

0.03 was the biggest result I got in practice.  The first run had 0.02
user time.  This seems negligible to me, so I'm not yet pushing its
performance boundaries with this string (lots of vowels and other
characters with bigger classes) on this data set (a collection of
notes largely cut-and-pasted from the web).

> - erik

Jason Catena