The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
From: Doug McIlroy <doug@cs.dartmouth.edu>
To: tuhs@tuhs.org
Subject: Re: [TUHS] Regular Expressions
Date: Sat, 01 Aug 2020 17:12:54 -0400	[thread overview]
Message-ID: <202008012112.071LCsdo037245@tahoe.cs.Dartmouth.EDU> (raw)

> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?


The editor ed was in Unix from day 1. For the necessarily tiny
 implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.

Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.

"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.

This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to

search calendar files for dates in rather free formats for today
and all days through the next working day.  Here's an example
(egrep interprets newline as "|"):

(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)

Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.

Doug

             reply	other threads:[~2020-08-01 21:13 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-01 21:12 Doug McIlroy [this message]
2020-08-09 23:44 ` Dave Horsfall
2020-08-10  0:50   ` Rob Pike
  -- strict thread matches above, loose matches on Subject: below --
2020-08-02  4:59 Rudi Blom
2020-08-01  0:00 Noel Chiappa
2020-07-31 22:57 Will Senn
2020-08-01  0:01 ` Bakul Shah
2020-08-01  0:36   ` Rob Pike
2020-08-01  0:53     ` John P. Linderman
2020-08-01  1:31     ` Bakul Shah
2020-08-01  1:39     ` Larry McVoy
2020-08-01  2:33     ` Will Senn
2020-08-01  2:50       ` Rich Morin
2020-08-01  3:01         ` Larry McVoy
2020-08-01  3:07     ` Will Senn
2020-08-01  4:31       ` Earl Baugh
2020-08-01  4:53         ` ron minnich
2020-08-01  5:48 ` Andrew Hume
2020-08-01 13:31   ` Richard Salz
2020-08-01 13:43     ` Andrew Hume
2020-08-02  0:45   ` Christopher Browne
2020-08-09  1:00   ` Dave Horsfall
2020-08-09  1:15     ` Nelson H. F. Beebe
2020-08-09 23:53       ` Dave Horsfall
2020-08-10  1:38         ` John Cowan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202008012112.071LCsdo037245@tahoe.cs.Dartmouth.EDU \
    --to=doug@cs.dartmouth.edu \
    --cc=tuhs@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).