From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 30 Nov 2009 01:52:20 -0600 Message-ID: From: Jason Catena To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] =?iso-8859-1?q?gr=EBp_=28rhymes_with_creep=29_and_cptmp?= Topicbox-Message-UUID: a51f9126-ead5-11e9-9d60-3106f5b1d025 > hey, this is great stuff! =A0i really like the approach. Thank you. It evolved from wanting to cut-and-paste character classes, to automatically applying them to test them. I suppose the character classes file could be useful in other applications that selectively don't want to care about accents. I added a dash-and-hyphen class, keyed to the hyphen-minus as the first character (since it's overused), so I had to change the sed command. sed '/^\[.+-/d;... I also now "rm $classes" at the end, of course, though I guess it now doesn't exit with the exit status of grep. I should probably save $status after the grep command, and exit with it. Or, save the expanded regex in a new shell variable, rm $classes, then grep with the new shell variable so the grep is the last command. > the patterns get really big in a hurry. Agreed. Part of grep's job is to be a regex engine, so I thought in general it would be okay to push it here. > i played with this a little bit, but quickly ran into problems. > "reasonable" re size limits of say 300 characters > just don't work if you're doing expansion. =A0expanding "cooperate" > results in a 460-byte string! Where does this 300-character limit come from? If you code them by hand I agree that a 300 character regex could be hard to fully understand. The regexes this script generates are very simple in structure and (ahem) regular, so I'd be inclined to allow them past a size restriction based on style. As far as time and space required to wade through the character sets, I haven't yet run into performance problems or actual failures in my tests. $ which grep /usr/local/plan9/bin/grep $ wc *|tail -1 17655 118910 774237 total $ time gr=EBp Obergruppenfuhrersaal * wewelsburg:155: (1938=961943): The "Obergruppenf=FChrersaal" (SS Generals' = Hall) and wewelsburg:161: floor of the "Obergruppenf=FChrersaal" lie on this axis. Both redesigned wewelsburg:180: The "Obergruppenf=FChrersaal" (SS Generals' Hall). On the = ground wewelsburg:181: floor the "Obergruppenf=FChrersaal" (literally translated: wewelsburg:236: castle, in the so-called Obergruppenf=FChrersaal ("Obergruppenf=FChrer 0.00u 0.03s 0.03r gr=EBp Obergruppenfuhrersaal 0=9631acme 0=9631i850 1920= s ... 0.03 was the biggest result I got in practice. The first run had 0.02 user time. This seems negligible to me, so I'm not yet pushing its performance boundaries with this string (lots of vowels and other characters with bigger classes) on this data set (a collection of notes largely cut-and-pasted from the web). > - erik Jason Catena