From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 30 Nov 2009 09:00:25 +0000 From: Eris Discordia To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Message-ID: <954FF94C2C131285456E5657@[192.168.1.2]> In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Subject: Re: [9fans] =?utf-8?q?gr=C3=ABp_=28rhymes_with_creep=29_and_cptmp?= Topicbox-Message-UUID: a529a8d2-ead5-11e9-9d60-3106f5b1d025 > $ time gr=C3=ABp Obergruppenfuhrersaal * Touch=C3=A9 :-) --On Monday, November 30, 2009 01:52 -0600 Jason Catena=20 wrote: >> hey, this is great stuff! =C2=A0i really like the approach. > > Thank you. It evolved from wanting to cut-and-paste character > classes, to automatically applying them to test them. I suppose the > character classes file could be useful in other applications that > selectively don't want to care about accents. > > I added a dash-and-hyphen class, keyed to the hyphen-minus as the > first character (since it's overused), so I had to change the sed > command. > > sed '/^\[.+-/d;... > > I also now "rm $classes" at the end, of course, though I guess it now > doesn't exit with the exit status of grep. I should probably save > $status after the grep command, and exit with it. Or, save the > expanded regex in a new shell variable, rm $classes, then grep with > the new shell variable so the grep is the last command. > >> the patterns get really big in a hurry. > > Agreed. Part of grep's job is to be a regex engine, so I thought in > general it would be okay to push it here. > >> i played with this a little bit, but quickly ran into problems. > >> "reasonable" re size limits of say 300 characters >> just don't work if you're doing expansion. =C2=A0expanding "cooperate" >> results in a 460-byte string! > > Where does this 300-character limit come from? If you code them by > hand I agree that a 300 character regex could be hard to fully > understand. The regexes this script generates are very simple in > structure and (ahem) regular, so I'd be inclined to allow them past a > size restriction based on style. As far as time and space required to > wade through the character sets, I haven't yet run into performance > problems or actual failures in my tests. > > $ which grep > /usr/local/plan9/bin/grep > > $ wc *|tail -1 > 17655 118910 774237 total > > $ time gr=C3=ABp Obergruppenfuhrersaal * > wewelsburg:155: (1938=E2=80=931943): The "Obergruppenf=C3=BChrersaal" (SS = Generals' > Hall) and wewelsburg:161: floor of the "Obergruppenf=C3=BChrersaal" lie = on > this axis. Both redesigned > wewelsburg:180: The "Obergruppenf=C3=BChrersaal" (SS Generals' Hall). On = the > ground wewelsburg:181: floor the "Obergruppenf=C3=BChrersaal" (literally > translated: wewelsburg:236: castle, in the so-called > Obergruppenf=C3=BChrersaal > ("Obergruppenf=C3=BChrer > 0.00u 0.03s 0.03r gr=C3=ABp Obergruppenfuhrersaal 0=E2=80=9331acme = 0=E2=80=9331i850 > 1920s ... > > 0.03 was the biggest result I got in practice. The first run had 0.02 > user time. This seems negligible to me, so I'm not yet pushing its > performance boundaries with this string (lots of vowels and other > characters with bigger classes) on this data set (a collection of > notes largely cut-and-pasted from the web). > >> - erik > > Jason Catena >