9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] grëp (rhymes with creep) and cptmp
       [not found] <<d50d7d460911292352j7cbcbc7erefa21b3b7f29f20a@mail.gmail.com>
@ 2009-11-30 13:50 ` erik quanstrom
  2009-11-30 14:48   ` roger peppe
  2009-11-30 15:10   ` Jason Catena
  0 siblings, 2 replies; 19+ messages in thread
From: erik quanstrom @ 2009-11-30 13:50 UTC (permalink / raw)
  To: 9fans

On Mon Nov 30 02:54:45 EST 2009, jason.catena@gmail.com wrote:
> Agreed.  Part of grep's job is to be a regex engine, so I thought in
> general it would be okay to push it here.
>
> > i played with this a little bit, but quickly ran into problems.
>
> > "reasonable" re size limits of say 300 characters
> > just don't work if you're doing expansion.  expanding "cooperate"
> > results in a 460-byte string!
>
> Where does this 300-character limit come from?  If you code them by

dict.

i used unfold (/n/sources/contrib/quanstro/runetype/unfold.c.
	; unfold cooperate | wc -rc
  	  199     454
it turns out that doing regular expressions is difficult, since
it's not clear to me what [a-z] should match when unfolded.

on the other hand, a folding-based approach makes the meaning
of [a-z] clear.  it's a good argument for folding.
	 echo 'rhymes with grëep' |../grep/8.out -I 'gr[a-z]ep'
	rhymes with grëep

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 13:50 ` [9fans] grëp (rhymes with creep) and cptmp erik quanstrom
@ 2009-11-30 14:48   ` roger peppe
  2009-11-30 14:54     ` David Leimbach
  2009-11-30 15:10   ` Jason Catena
  1 sibling, 1 reply; 19+ messages in thread
From: roger peppe @ 2009-11-30 14:48 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/11/30 erik quanstrom <quanstro@quanstro.net>:
> i used unfold (/n/sources/contrib/quanstro/runetype/unfold.c.

% 8c -I ../grepfold unfold.c
unfold.c:5 8c: 'utfunfold.h' file does not exist: utfunfold.h
% du -a /n/sources/contrib/quanstro | grep utfunfold.h
%

forgive me for not reading the source code,
but what does unfold do?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 14:48   ` roger peppe
@ 2009-11-30 14:54     ` David Leimbach
  0 siblings, 0 replies; 19+ messages in thread
From: David Leimbach @ 2009-11-30 14:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 511 bytes --]

On Mon, Nov 30, 2009 at 6:48 AM, roger peppe <rogpeppe@gmail.com> wrote:

> 2009/11/30 erik quanstrom <quanstro@quanstro.net>:
> > i used unfold (/n/sources/contrib/quanstro/runetype/unfold.c.
>
> % 8c -I ../grepfold unfold.c
> unfold.c:5 8c: 'utfunfold.h' file does not exist: utfunfold.h
> % du -a /n/sources/contrib/quanstro | grep utfunfold.h
> %
>
> forgive me for not reading the source code,
> but what does unfold do?
>
> Also didn't read the source... sounds functionally delicious.  :-)

[-- Attachment #2: Type: text/html, Size: 846 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 13:50 ` [9fans] grëp (rhymes with creep) and cptmp erik quanstrom
  2009-11-30 14:48   ` roger peppe
@ 2009-11-30 15:10   ` Jason Catena
  2009-11-30 15:32     ` erik quanstrom
  1 sibling, 1 reply; 19+ messages in thread
From: Jason Catena @ 2009-11-30 15:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> it turns out that doing regular expressions is difficult, since
> it's not clear to me what [a-z] should match when unfolded.

I have discovered a truly marvellous proof of this, which this memory
is too narrow to contain.

209 runes in an unfolded a-Ǯ superclass later...
12498: signal: sys: segmentation violation

> - erik

Jason Catena



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
       [not found] <<df49a7370911300648l5e243b12ncdf6de116d81afa9@mail.gmail.com>
@ 2009-11-30 15:28 ` erik quanstrom
  2009-11-30 16:38   ` roger peppe
  0 siblings, 1 reply; 19+ messages in thread
From: erik quanstrom @ 2009-11-30 15:28 UTC (permalink / raw)
  To: 9fans

On Mon Nov 30 09:49:34 EST 2009, rogpeppe@gmail.com wrote:
> 2009/11/30 erik quanstrom <quanstro@quanstro.net>:
> > i used unfold (/n/sources/contrib/quanstro/runetype/unfold.c.
>
> % 8c -I ../grepfold unfold.c
> unfold.c:5 8c: 'utfunfold.h' file does not exist: utfunfold.h
> % du -a /n/sources/contrib/quanstro | grep utfunfold.h
> %
>

utfunfold is generated from the utf tables.  copy the
whole directory and mk.  you'll have to build fold
by hand.  the mkfile is terrible.

> forgive me for not reading the source code,
> but what does unfold do?

unfold turns a character, say ë into the set of
characters that can be folded to the same base
character.  so
	; unfold ë
	[eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]
on second thought the [] seem like a bad idea.

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 15:10   ` Jason Catena
@ 2009-11-30 15:32     ` erik quanstrom
  2009-11-30 15:54       ` Jorden Mauro
  0 siblings, 1 reply; 19+ messages in thread
From: erik quanstrom @ 2009-11-30 15:32 UTC (permalink / raw)
  To: 9fans

On Mon Nov 30 10:13:09 EST 2009, jason.catena@gmail.com wrote:
> > it turns out that doing regular expressions is difficult, since
> > it's not clear to me what [a-z] should match when unfolded.
>
> I have discovered a truly marvellous proof of this, which this memory
> is too narrow to contain.
>
> 209 runes in an unfolded a-Ǯ superclass later...
> 12498: signal: sys: segmentation violation

size isn't the real issue.  the real issue is determining what
the ranges are for other than the base character.  if a maps
to [aa'...] and z maps to [zz'...]  it's not clear that [a'-z'] is a
sensible set.  for example what does [e-f] map to?  [e-f], clearly
but [ë-what?]

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 15:32     ` erik quanstrom
@ 2009-11-30 15:54       ` Jorden Mauro
  2009-11-30 16:00         ` erik quanstrom
  0 siblings, 1 reply; 19+ messages in thread
From: Jorden Mauro @ 2009-11-30 15:54 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Nov 30, 2009 at 10:32 AM, erik quanstrom <quanstro@coraid.com> wrote:
> size isn't the real issue.  the real issue is determining what
> the ranges are for other than the base character.  if a maps
> to [aa'...] and z maps to [zz'...]  it's not clear that [a'-z'] is a
> sensible set.  for example what does [e-f] map to?  [e-f], clearly
> but [ë-what?]
>

``unfold turns a character, say ë into the set of
characters that can be folded to the same base
character.  so
       ; unfold ë
       [eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]''

To me, that sounds like [e-f] should be

[eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệfƒ]

iff e unfolds to the same set as ë. If e only unfolds to [e], then
[e-f] would unfold to [ef].

Does that sound sane?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 15:54       ` Jorden Mauro
@ 2009-11-30 16:00         ` erik quanstrom
  2009-11-30 18:38           ` hiro
  2009-11-30 19:43           ` Jorden Mauro
  0 siblings, 2 replies; 19+ messages in thread
From: erik quanstrom @ 2009-11-30 16:00 UTC (permalink / raw)
  To: 9fans

> ``unfold turns a character, say ë into the set of
> characters that can be folded to the same base
> character.  so
>        ; unfold ë
>        [eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]''
>
> To me, that sounds like [e-f] should be
>
> [eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệfƒ]
>
> iff e unfolds to the same set as ë. If e only unfolds to [e], then
> [e-f] would unfold to [ef].

i don't think that works.  consider [e-g].  normally
this would match 'f', but under your algorithm it wouldn't.
the problem is that [a-z] works because ascii is arranged
in alphabetical order.  all the various accented characters
are not.

that's why the folding approach has an advantage [a-z]
will work and will do the Right Thing.

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 15:28 ` erik quanstrom
@ 2009-11-30 16:38   ` roger peppe
  2009-11-30 17:34     ` erik quanstrom
  0 siblings, 1 reply; 19+ messages in thread
From: roger peppe @ 2009-11-30 16:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2009/11/30 erik quanstrom <quanstro@quanstro.net>:
> utfunfold is generated from the utf tables.  copy the
> whole directory and mk.

i would if i could.

% pwd
/n/sources/contrib/quanstro/runetype
% cat mkfile
cat: can't open mkfile: 'mkfile' permission denied
% ls -l | grep -- -----
--rw------- M 5961 quanstro sys   1427 Nov 30 08:29 mkfile
--rw------- M 5961 quanstro sys   2243 Nov 28 17:30 runetype.c
--rw------- M 5961 quanstro sys   1046 Nov 28 17:30 uconv.c
%



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 16:38   ` roger peppe
@ 2009-11-30 17:34     ` erik quanstrom
  0 siblings, 0 replies; 19+ messages in thread
From: erik quanstrom @ 2009-11-30 17:34 UTC (permalink / raw)
  To: 9fans

On Mon Nov 30 11:40:50 EST 2009, rogpeppe@gmail.com wrote:
> 2009/11/30 erik quanstrom <quanstro@quanstro.net>:
> > utfunfold is generated from the utf tables.  copy the
> > whole directory and mk.
>
> i would if i could.

fixed.

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 16:00         ` erik quanstrom
@ 2009-11-30 18:38           ` hiro
  2009-11-30 19:43           ` Jorden Mauro
  1 sibling, 0 replies; 19+ messages in thread
From: hiro @ 2009-11-30 18:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

>> iff e unfolds to the same set as ë. If e only unfolds to [e], then
>> [e-f] would unfold to [ef].
>
> i don't think that works.  consider [e-g].  normally
> this would match 'f', but under your algorithm it wouldn't.

I don't get it, why not? Especially, what algorithm are we speaking about?



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30 16:00         ` erik quanstrom
  2009-11-30 18:38           ` hiro
@ 2009-11-30 19:43           ` Jorden Mauro
  1 sibling, 0 replies; 19+ messages in thread
From: Jorden Mauro @ 2009-11-30 19:43 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Mon, Nov 30, 2009 at 11:00 AM, erik quanstrom <quanstro@coraid.com> wrote:
>> ``unfold turns a character, say ë into the set of
>> characters that can be folded to the same base
>> character.  so
>>        ; unfold ë
>>        [eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ]''
>>
>> To me, that sounds like [e-f] should be
>>
>> [eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệfƒ]
>>
>> iff e unfolds to the same set as ë. If e only unfolds to [e], then
>> [e-f] would unfold to [ef].
>
> i don't think that works.  consider [e-g].  normally
> this would match 'f', but under your algorithm it wouldn't.
> the problem is that [a-z] works because ascii is arranged
> in alphabetical order.  all the various accented characters
> are not.

It would work if the algorithm didn't expand the class just by
enumerating ASCII letters, but
for every letter also added the accented chars.

>
> that's why the folding approach has an advantage [a-z]
> will work and will do the Right Thing.
>
> - erik
>
>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
       [not found] <<df49a7370911300326m3e3a6be1yc77e49a2b23a6da2@mail.gmail.com>
@ 2009-11-30 14:06 ` erik quanstrom
  0 siblings, 0 replies; 19+ messages in thread
From: erik quanstrom @ 2009-11-30 14:06 UTC (permalink / raw)
  To: 9fans

On Mon Nov 30 06:28:25 EST 2009, rogpeppe@gmail.com wrote:
> now just some handling of combining characters to do :-)

i assume you mean "now just some handling of combining characters
to do.  :)", not "now just some handling of combining characters
to do ☺".  :-)

of course the problem with combining characters is that they're
metafont hidden inside a character set.  unicode doesn't limit
the decorations one can put on a letter. so a + hat + hat +
+ breve is just a fine character.  in fact if i haven't gotten mixed
up, that's a romization of an archaic cyrillic character that's still
common in surnames.  there's no reason you can't put that all
in a circle.

so given that, we know there's no requirement for a precombined
form.  the easy approach of just turning x + combiners into
x' which can be decomposed into x + combiners won't work, in
general.

(doing as mac and decomposing everything seems excessively hard
on libdraw.  what if successive calls to string() split a combiner?)

maybe we can live with combine everything combineable.
it could be an improvement.

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-29 19:01 Jason Catena
  2009-11-30  4:51 ` Bruce Ellis
@ 2009-11-30 11:26 ` roger peppe
  1 sibling, 0 replies; 19+ messages in thread
From: roger peppe @ 2009-11-30 11:26 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

now just some handling of combining characters to do :-)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30  7:52   ` Jason Catena
@ 2009-11-30  9:00     ` Eris Discordia
  0 siblings, 0 replies; 19+ messages in thread
From: Eris Discordia @ 2009-11-30  9:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> $ time grëp Obergruppenfuhrersaal *

Touché :-)


--On Monday, November 30, 2009 01:52 -0600 Jason Catena 
<jason.catena@gmail.com> wrote:

>> hey, this is great stuff!  i really like the approach.
>
> Thank you.  It evolved from wanting to cut-and-paste character
> classes, to automatically applying them to test them.  I suppose the
> character classes file could be useful in other applications that
> selectively don't want to care about accents.
>
> I added a dash-and-hyphen class, keyed to the hyphen-minus as the
> first character (since it's overused), so I had to change the sed
> command.
>
> sed '/^\[.+-/d;...
>
> I also now "rm $classes" at the end, of course, though I guess it now
> doesn't exit with the exit status of grep.  I should probably save
> $status after the grep command, and exit with it.  Or, save the
> expanded regex in a new shell variable, rm $classes, then grep with
> the new shell variable so the grep is the last command.
>
>> the patterns get really big in a hurry.
>
> Agreed.  Part of grep's job is to be a regex engine, so I thought in
> general it would be okay to push it here.
>
>> i played with this a little bit, but quickly ran into problems.
>
>> "reasonable" re size limits of say 300 characters
>> just don't work if you're doing expansion.  expanding "cooperate"
>> results in a 460-byte string!
>
> Where does this 300-character limit come from?  If you code them by
> hand I agree that a 300 character regex could be hard to fully
> understand.  The regexes this script generates are very simple in
> structure and (ahem) regular, so I'd be inclined to allow them past a
> size restriction based on style.  As far as time and space required to
> wade through the character sets, I haven't yet run into performance
> problems or actual failures in my tests.
>
> $ which grep
> /usr/local/plan9/bin/grep
>
> $ wc *|tail -1
>   17655  118910  774237 total
>
> $ time grëp Obergruppenfuhrersaal *
> wewelsburg:155: (1938–1943): The "Obergruppenführersaal" (SS Generals'
> Hall) and wewelsburg:161: floor of the "Obergruppenführersaal" lie on
> this axis.  Both redesigned
> wewelsburg:180: The "Obergruppenführersaal" (SS Generals' Hall).  On the
> ground wewelsburg:181: floor the "Obergruppenführersaal" (literally
> translated: wewelsburg:236: castle, in the so-called
> Obergruppenführersaal
> ("Obergruppenführer
> 0.00u 0.03s 0.03r 	 grëp Obergruppenfuhrersaal 0–31acme 0–31i850
> 1920s ...
>
> 0.03 was the biggest result I got in practice.  The first run had 0.02
> user time.  This seems negligible to me, so I'm not yet pushing its
> performance boundaries with this string (lots of vowels and other
> characters with bigger classes) on this data set (a collection of
> notes largely cut-and-pasted from the web).
>
>> - erik
>
> Jason Catena
>







^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-30  4:29 ` erik quanstrom
@ 2009-11-30  7:52   ` Jason Catena
  2009-11-30  9:00     ` Eris Discordia
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Catena @ 2009-11-30  7:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> hey, this is great stuff!  i really like the approach.

Thank you.  It evolved from wanting to cut-and-paste character
classes, to automatically applying them to test them.  I suppose the
character classes file could be useful in other applications that
selectively don't want to care about accents.

I added a dash-and-hyphen class, keyed to the hyphen-minus as the
first character (since it's overused), so I had to change the sed
command.

sed '/^\[.+-/d;...

I also now "rm $classes" at the end, of course, though I guess it now
doesn't exit with the exit status of grep.  I should probably save
$status after the grep command, and exit with it.  Or, save the
expanded regex in a new shell variable, rm $classes, then grep with
the new shell variable so the grep is the last command.

> the patterns get really big in a hurry.

Agreed.  Part of grep's job is to be a regex engine, so I thought in
general it would be okay to push it here.

> i played with this a little bit, but quickly ran into problems.

> "reasonable" re size limits of say 300 characters
> just don't work if you're doing expansion.  expanding "cooperate"
> results in a 460-byte string!

Where does this 300-character limit come from?  If you code them by
hand I agree that a 300 character regex could be hard to fully
understand.  The regexes this script generates are very simple in
structure and (ahem) regular, so I'd be inclined to allow them past a
size restriction based on style.  As far as time and space required to
wade through the character sets, I haven't yet run into performance
problems or actual failures in my tests.

$ which grep
/usr/local/plan9/bin/grep

$ wc *|tail -1
  17655  118910  774237 total

$ time grëp Obergruppenfuhrersaal *
wewelsburg:155: (1938–1943): The "Obergruppenführersaal" (SS Generals' Hall) and
wewelsburg:161: floor of the "Obergruppenführersaal" lie on this axis.
 Both redesigned
wewelsburg:180: The "Obergruppenführersaal" (SS Generals' Hall).  On the ground
wewelsburg:181: floor the "Obergruppenführersaal" (literally translated:
wewelsburg:236: castle, in the so-called Obergruppenführersaal
("Obergruppenführer
0.00u 0.03s 0.03r 	 grëp Obergruppenfuhrersaal 0–31acme 0–31i850 1920s ...

0.03 was the biggest result I got in practice.  The first run had 0.02
user time.  This seems negligible to me, so I'm not yet pushing its
performance boundaries with this string (lots of vowels and other
characters with bigger classes) on this data set (a collection of
notes largely cut-and-pasted from the web).

> - erik

Jason Catena



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
  2009-11-29 19:01 Jason Catena
@ 2009-11-30  4:51 ` Bruce Ellis
  2009-11-30 11:26 ` roger peppe
  1 sibling, 0 replies; 19+ messages in thread
From: Bruce Ellis @ 2009-11-30  4:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

i like the approach. back in basser computational linguistics days
frank was indexing a greek verb dictionary. to sort the keys - he used
tr | sort | tr.

i'm glad you didn't screw with grep. it's brilliant but the
implementation is not easily understood. i was in the room at the
time, so i have a headstart.

brucee

On 11/30/09, Jason Catena <jason.catena@gmail.com> wrote:
> I wrote a wrapper around grep to search for words regardless of
> accents.  I didn't want to worry about whether I used accents on
> characters (I sometimes use them inconsistently, and others decidedly
> do), but I still wanted to limit the results to exact matches if I
> supplied an accent.  Here's an example run.
>
>
> $ grep facade word
> treatment <a museum's east facade>.  A false, superficial, or artificial
>
> $ grëp facade word
> 89: to bow to man. façade. circa 1681.  French façade, from Italian
> 92: treatment <a museum's east facade>.  A false, superficial, or artificial
>
> $ grëp façade *
> style:21: crucial difference to pronunciation: cliché, soupçon, façade, café,
> wabisabi:51: or the crumbling stone façade of an old building.   Transience,
> word:89: to bow to man. façade. circa 1681.  French façade, from Italian
>
>
> Note that line word:92 (output by the second command) is not output by
> the third command, since I supplied an accent on that particular
> character (ç) in my input pattern.  I chose the umlaut or diæresis to
> remind me that grëp provides the -n option by default, so I'll get a
> line number and : in the output.  (I should probably just pass through
> all of grep's command-line options.)
>
>
> <grëp>=
> #!/usr/local/plan9/bin/rc
>
> regex=$1
> shift
>
> classes=`{cptmp classes}
> sed '/-/d;s,^\[(.),s/\1/\[\1,;s,$,/g,' charclass > $classes
>
> grep -n `{echo $regex | sed -f $classes} $*
>
>
> I translate each ordinary latin character in the input pattern (eg
> [0-9A-Za-z]) into a character class (the attached charclass file,
> which doesn't cut-and-paste well), and then call grep with the updated
> pattern.  The first sed command in grëp turns the character classes in
> charclass into s commands for sed.  The charclass file contains the
> square brackets because I also use it to cut-and-paste from when I
> need a character class for a sed script.
>
> The script cptmp creates a temporary copy of an existing file, or a
> temporary new file.
>
>
> <cptmp>=
> #!/usr/local/plan9/bin/rc
> flag e +
>
> if(~ $#TMPDIR 0)
>        TMPDIR=/tmp
> base=`{basename $1}
> tmp=$TMPDIR/$base.$USER.$pid
>
> if (test -f $1) {
>        cp -pr $1 $tmp
> }
> if not {
>        touch $tmp
> }
> chmod +wx $tmp
> echo $tmp
>
>
> Jason Catena
>
>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [9fans] grëp (rhymes with creep) and cptmp
       [not found] <<d50d7d460911291101k7420eb0fna61f87646606e991@mail.gmail.com>
@ 2009-11-30  4:29 ` erik quanstrom
  2009-11-30  7:52   ` Jason Catena
  0 siblings, 1 reply; 19+ messages in thread
From: erik quanstrom @ 2009-11-30  4:29 UTC (permalink / raw)
  To: 9fans

On Sun Nov 29 14:03:23 EST 2009, jason.catena@gmail.com wrote:

> I wrote a wrapper around grep to search for words regardless of
> accents.  I didn't want to worry about whether I used accents on
> characters (I sometimes use them inconsistently, and others decidedly
> do), but I still wanted to limit the results to exact matches if I
> supplied an accent.  Here's an example run.

hey, this is great stuff!  i really like the approach.  i played with
this a little bit, but quickly ran into problems.  the patterns get
really big in a hurry.  "reasonable" re size limits of say 300 characters
just don't work if you're doing expansion.  expanding "cooperate"
results in a 460-byte string!

so i went back to an old idea.  i hope you won't accuse me of topperism,
but you finally motivated me to work on something i threatened
to do at iwp9 2e: add folding to grep.

it was right up my alley since i just recently redid the rune tables
that i've been using.  they're built directly from UnicodeData.txt.
it wasn't too hard to build a table that folds modified letters to
a base with the unicode data.  from there, i reused the same same
technique used for case folding.  since the table i'm using don't
fold case, "grep -Ii" makes sense.

performance is pretty good. worse case is about 2x the user time.
there's no overhead when the I flag isn't given.

the source is in /n/sources/contrib/quanstro/src/grepfold.
please let me know of any bugs.  i'm sure there are a few wierd
cases.  let me know if there are.

- erik



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [9fans] grëp (rhymes with creep) and cptmp
@ 2009-11-29 19:01 Jason Catena
  2009-11-30  4:51 ` Bruce Ellis
  2009-11-30 11:26 ` roger peppe
  0 siblings, 2 replies; 19+ messages in thread
From: Jason Catena @ 2009-11-29 19:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

I wrote a wrapper around grep to search for words regardless of
accents.  I didn't want to worry about whether I used accents on
characters (I sometimes use them inconsistently, and others decidedly
do), but I still wanted to limit the results to exact matches if I
supplied an accent.  Here's an example run.


$ grep facade word
treatment <a museum's east facade>.  A false, superficial, or artificial

$ grëp facade word
89: to bow to man. façade. circa 1681.  French façade, from Italian
92: treatment <a museum's east facade>.  A false, superficial, or artificial

$ grëp façade *
style:21: crucial difference to pronunciation: cliché, soupçon, façade, café,
wabisabi:51: or the crumbling stone façade of an old building.   Transience,
word:89: to bow to man. façade. circa 1681.  French façade, from Italian


Note that line word:92 (output by the second command) is not output by
the third command, since I supplied an accent on that particular
character (ç) in my input pattern.  I chose the umlaut or diæresis to
remind me that grëp provides the -n option by default, so I'll get a
line number and : in the output.  (I should probably just pass through
all of grep's command-line options.)


<grëp>=
#!/usr/local/plan9/bin/rc

regex=$1
shift

classes=`{cptmp classes}
sed '/-/d;s,^\[(.),s/\1/\[\1,;s,$,/g,' charclass > $classes

grep -n `{echo $regex | sed -f $classes} $*


I translate each ordinary latin character in the input pattern (eg
[0-9A-Za-z]) into a character class (the attached charclass file,
which doesn't cut-and-paste well), and then call grep with the updated
pattern.  The first sed command in grëp turns the character classes in
charclass into s commands for sed.  The charclass file contains the
square brackets because I also use it to cut-and-paste from when I
need a character class for a sed script.

The script cptmp creates a temporary copy of an existing file, or a
temporary new file.


<cptmp>=
#!/usr/local/plan9/bin/rc
flag e +

if(~ $#TMPDIR 0)
	TMPDIR=/tmp
base=`{basename $1}
tmp=$TMPDIR/$base.$USER.$pid

if (test -f $1) {
	cp -pr $1 $tmp
}
if not {
	touch $tmp
}
chmod +wx $tmp
echo $tmp


Jason Catena

[-- Attachment #2: charclass --]
[-- Type: application/octet-stream, Size: 1126 bytes --]

[ 	]
[0-9]
[0⁰₀]
[1¹₁]
[2²₂]
[3³₃]
[4⁴₄]
[5⁵₅]
[6⁶₆]
[7⁷₇]
[8⁸₈]
[9⁹₉]
[A-Z]
[AÁÀĂÂǍÅǺÄǞÃȦǠĄĀȀȂª]
[BƁʙɞʚ]
[CĆĈČĊÇƇ]
[DĎĐÐƉƊ]
[EÉÈĔÊĚËĖȨĘĒȄȆɝƎƐɛɜ]
[FƑℲ]
[GǴĞĜǦĠĢǤƓɢʛ]
[HĤȞHĦʜǶ]
[IÍÌĬÎǏÏĨİĮĪȈȊIƗɪ]
[JĴJ]
[KǨĶƘKĸ]
[LĹĽĻŁŁĿʟ]
[M]
[NŃǸŇÑŅƝNɴŊ]
[OÓÒŎÔǑÖȪŐÕȬȮȰØǾǪǬŌȌȎƠƟ]
[PƤP]
[Q]
[RŔŘŖȐȒƦʀʁ]
[SŚŜŠŞȘ]
[TŤTŢȚŦƬƮ]
[UÚÙŬÛǓŮÜǗǛǙǕŰŨŲŪȔȖƯ]
[VƲ]
[WŴW]
[X]
[YÝŶYŸȲʏƳ]
[ZŹŽŻƵȤʐǮ]
[a-z]
[aáàăâǎåǻäǟãȧǡąāȁȃɐɑɒ]
[bƀɓƂƃ]
[cćĉčċçƈɕ]
[dďđðɖɗƋƌȡ]
[eéèĕêěëėȩęēȅȇɚǝƏəɘ]
[fƒʩ]
[gǵğĝǧġģǥɠɡ]
[hĥȟħƕɦɧ]
[iíìĭîǐïĩiįīȉȋıɨƖɩ]
[jĵǰʝɟʄ]
[kǩķƙʞ]
[lĺľļłłŀƚɫɬɭȴ]
[mɱ]
[nńǹňñņɲȠƞɳȵnŋ]
[oóòŏôǒöȫőõȭȯȱøǿǫǭōȍȏơɵ]
[pƥp]
[qʠ]
[rŕřŗȑȓɼɽɾɹɺɻɿ]
[sśŝšşșʂ]
[tťţțƫƭʈȶ]
[uúùŭûǔůüǘǜǚǖűũųūȕȗưʉ]
[vʋ]
[wŵ]
[x]
[yýŷÿȳƴ]
[zźžżƶȥʑǯƺ]
[ÆǼǢ]
[æǽǣ]
[Œɶ]
[œ]
[ɮ]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2009-11-30 19:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <<d50d7d460911292352j7cbcbc7erefa21b3b7f29f20a@mail.gmail.com>
2009-11-30 13:50 ` [9fans] grëp (rhymes with creep) and cptmp erik quanstrom
2009-11-30 14:48   ` roger peppe
2009-11-30 14:54     ` David Leimbach
2009-11-30 15:10   ` Jason Catena
2009-11-30 15:32     ` erik quanstrom
2009-11-30 15:54       ` Jorden Mauro
2009-11-30 16:00         ` erik quanstrom
2009-11-30 18:38           ` hiro
2009-11-30 19:43           ` Jorden Mauro
     [not found] <<df49a7370911300648l5e243b12ncdf6de116d81afa9@mail.gmail.com>
2009-11-30 15:28 ` erik quanstrom
2009-11-30 16:38   ` roger peppe
2009-11-30 17:34     ` erik quanstrom
     [not found] <<df49a7370911300326m3e3a6be1yc77e49a2b23a6da2@mail.gmail.com>
2009-11-30 14:06 ` erik quanstrom
     [not found] <<d50d7d460911291101k7420eb0fna61f87646606e991@mail.gmail.com>
2009-11-30  4:29 ` erik quanstrom
2009-11-30  7:52   ` Jason Catena
2009-11-30  9:00     ` Eris Discordia
2009-11-29 19:01 Jason Catena
2009-11-30  4:51 ` Bruce Ellis
2009-11-30 11:26 ` roger peppe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).