From mboxrd@z Thu Jan 1 00:00:00 1970 MIME-Version: 1.0 In-Reply-To: References: Date: Mon, 30 Nov 2009 15:51:51 +1100 Message-ID: <775b8d190911292051g57001bf7p3deb7439858b9e4b@mail.gmail.com> From: Bruce Ellis To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [9fans] =?iso-8859-1?q?gr=EBp_=28rhymes_with_creep=29_and_cptmp?= Topicbox-Message-UUID: a51a0b84-ead5-11e9-9d60-3106f5b1d025 i like the approach. back in basser computational linguistics days frank was indexing a greek verb dictionary. to sort the keys - he used tr | sort | tr. i'm glad you didn't screw with grep. it's brilliant but the implementation is not easily understood. i was in the room at the time, so i have a headstart. brucee On 11/30/09, Jason Catena wrote: > I wrote a wrapper around grep to search for words regardless of > accents. I didn't want to worry about whether I used accents on > characters (I sometimes use them inconsistently, and others decidedly > do), but I still wanted to limit the results to exact matches if I > supplied an accent. Here's an example run. > > > $ grep facade word > treatment . A false, superficial, or artificial > > $ gr=EBp facade word > 89: to bow to man. fa=E7ade. circa 1681. French fa=E7ade, from Italian > 92: treatment . A false, superficial, or artific= ial > > $ gr=EBp fa=E7ade * > style:21: crucial difference to pronunciation: clich=E9, soup=E7on, fa=E7= ade, caf=E9, > wabisabi:51: or the crumbling stone fa=E7ade of an old building. Transi= ence, > word:89: to bow to man. fa=E7ade. circa 1681. French fa=E7ade, from Ital= ian > > > Note that line word:92 (output by the second command) is not output by > the third command, since I supplied an accent on that particular > character (=E7) in my input pattern. I chose the umlaut or di=E6resis to > remind me that gr=EBp provides the -n option by default, so I'll get a > line number and : in the output. (I should probably just pass through > all of grep's command-line options.) > > > =3D > #!/usr/local/plan9/bin/rc > > regex=3D$1 > shift > > classes=3D`{cptmp classes} > sed '/-/d;s,^\[(.),s/\1/\[\1,;s,$,/g,' charclass > $classes > > grep -n `{echo $regex | sed -f $classes} $* > > > I translate each ordinary latin character in the input pattern (eg > [0-9A-Za-z]) into a character class (the attached charclass file, > which doesn't cut-and-paste well), and then call grep with the updated > pattern. The first sed command in gr=EBp turns the character classes in > charclass into s commands for sed. The charclass file contains the > square brackets because I also use it to cut-and-paste from when I > need a character class for a sed script. > > The script cptmp creates a temporary copy of an existing file, or a > temporary new file. > > > =3D > #!/usr/local/plan9/bin/rc > flag e + > > if(~ $#TMPDIR 0) > TMPDIR=3D/tmp > base=3D`{basename $1} > tmp=3D$TMPDIR/$base.$USER.$pid > > if (test -f $1) { > cp -pr $1 $tmp > } > if not { > touch $tmp > } > chmod +wx $tmp > echo $tmp > > > Jason Catena > >