9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
From: Jason Catena <jason.catena@gmail.com>
To: Fans of the OS Plan 9 from Bell Labs <9fans@9fans.net>
Subject: Re: [9fans] grëp (rhymes with creep) and cptmp
Date: Mon, 30 Nov 2009 01:52:20 -0600	[thread overview]
Message-ID: <d50d7d460911292352j7cbcbc7erefa21b3b7f29f20a@mail.gmail.com> (raw)
In-Reply-To: <cb038fc012830f0a8f6cad5b76beb980@ladd.quanstro.net>

> hey, this is great stuff!  i really like the approach.

Thank you.  It evolved from wanting to cut-and-paste character
classes, to automatically applying them to test them.  I suppose the
character classes file could be useful in other applications that
selectively don't want to care about accents.

I added a dash-and-hyphen class, keyed to the hyphen-minus as the
first character (since it's overused), so I had to change the sed
command.

sed '/^\[.+-/d;...

I also now "rm $classes" at the end, of course, though I guess it now
doesn't exit with the exit status of grep.  I should probably save
$status after the grep command, and exit with it.  Or, save the
expanded regex in a new shell variable, rm $classes, then grep with
the new shell variable so the grep is the last command.

> the patterns get really big in a hurry.

Agreed.  Part of grep's job is to be a regex engine, so I thought in
general it would be okay to push it here.

> i played with this a little bit, but quickly ran into problems.

> "reasonable" re size limits of say 300 characters
> just don't work if you're doing expansion.  expanding "cooperate"
> results in a 460-byte string!

Where does this 300-character limit come from?  If you code them by
hand I agree that a 300 character regex could be hard to fully
understand.  The regexes this script generates are very simple in
structure and (ahem) regular, so I'd be inclined to allow them past a
size restriction based on style.  As far as time and space required to
wade through the character sets, I haven't yet run into performance
problems or actual failures in my tests.

$ which grep
/usr/local/plan9/bin/grep

$ wc *|tail -1
  17655  118910  774237 total

$ time grëp Obergruppenfuhrersaal *
wewelsburg:155: (1938–1943): The "Obergruppenführersaal" (SS Generals' Hall) and
wewelsburg:161: floor of the "Obergruppenführersaal" lie on this axis.
 Both redesigned
wewelsburg:180: The "Obergruppenführersaal" (SS Generals' Hall).  On the ground
wewelsburg:181: floor the "Obergruppenführersaal" (literally translated:
wewelsburg:236: castle, in the so-called Obergruppenführersaal
("Obergruppenführer
0.00u 0.03s 0.03r 	 grëp Obergruppenfuhrersaal 0–31acme 0–31i850 1920s ...

0.03 was the biggest result I got in practice.  The first run had 0.02
user time.  This seems negligible to me, so I'm not yet pushing its
performance boundaries with this string (lots of vowels and other
characters with bigger classes) on this data set (a collection of
notes largely cut-and-pasted from the web).

> - erik

Jason Catena



  reply	other threads:[~2009-11-30  7:52 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <<d50d7d460911291101k7420eb0fna61f87646606e991@mail.gmail.com>
2009-11-30  4:29 ` erik quanstrom
2009-11-30  7:52   ` Jason Catena [this message]
2009-11-30  9:00     ` Eris Discordia
     [not found] <<df49a7370911300648l5e243b12ncdf6de116d81afa9@mail.gmail.com>
2009-11-30 15:28 ` erik quanstrom
2009-11-30 16:38   ` roger peppe
2009-11-30 17:34     ` erik quanstrom
     [not found] <<df49a7370911300326m3e3a6be1yc77e49a2b23a6da2@mail.gmail.com>
2009-11-30 14:06 ` erik quanstrom
     [not found] <<d50d7d460911292352j7cbcbc7erefa21b3b7f29f20a@mail.gmail.com>
2009-11-30 13:50 ` erik quanstrom
2009-11-30 14:48   ` roger peppe
2009-11-30 14:54     ` David Leimbach
2009-11-30 15:10   ` Jason Catena
2009-11-30 15:32     ` erik quanstrom
2009-11-30 15:54       ` Jorden Mauro
2009-11-30 16:00         ` erik quanstrom
2009-11-30 18:38           ` hiro
2009-11-30 19:43           ` Jorden Mauro
2009-11-29 19:01 Jason Catena
2009-11-30  4:51 ` Bruce Ellis
2009-11-30 11:26 ` roger peppe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d50d7d460911292352j7cbcbc7erefa21b3b7f29f20a@mail.gmail.com \
    --to=jason.catena@gmail.com \
    --cc=9fans@9fans.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).