The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] Regular Expressions
@ 2020-07-31 22:57 Will Senn
  2020-08-01  0:01 ` Bakul Shah
  2020-08-01  5:48 ` Andrew Hume
  0 siblings, 2 replies; 25+ messages in thread
From: Will Senn @ 2020-07-31 22:57 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

I've always been intrigued with regexes. When I was first exposed to 
them, I was mystified and lost in the greediness of matches. Now, I use 
them regularly, but still have trouble using them. I think it is because 
I don't really understand how they work.

My question for y'all has to do with early unix. I have a copy of 
Thompson, K. (1968). Regular expression search algorithm. Communications 
of the ACM, 11(6), 419-422. It is interesting as an example of 
Thompson's thinking about regexes. In this paper, he presents a 
non-backtracking, efficient, algorithm for converting a regex into an 
IBM 7094 (whatever that is) program that can be run against text input 
that generates matches. It's cool. It got me to thinking maybe the way 
to understand the unix regex lies in a careful investigation into how it 
is implemented (original thought, right?). So, here I am again to ask 
your indulgence as the latecomer wannabe unix apprentice. My thought is 
that ed is where it begins and might be a good starting point, but I'm 
not sure - what say y'all?

I also have a copy of the O'Reilly Mastering Regular Expressions book, 
but that's not really the kind of thing I'm talking about. My question 
is more basic than how to use regexes practically. I would like to 
understand them at a parsing level/state change level (not sure that's 
the correct way to say it, but I'm really new to this kind of lingo).  
When I'm done with my stepping through the source, I want to be able to 
reason that this is why that search matched that text and not this text 
and why the search was greedy, or not greedy because of this logic here...

If my question above isn't focused or on topic enough, here's an 
alternative set to ruminate on and hopefully discuss:

1. What's the provenance of regex in unix (when did it appear, in what 
form, etc)?
2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
3. What are some of the milestones along the way (major changes, forks, 
disagreements)?
4. Where, in the source, or in a paper, would you point someone to 
wanting to better understand the mechanics of regex?

Thanks!

Will

-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 2765 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
@ 2020-08-01  0:00 Noel Chiappa
  0 siblings, 0 replies; 25+ messages in thread
From: Noel Chiappa @ 2020-08-01  0:00 UTC (permalink / raw)
  To: tuhs; +Cc: jnc

    > From: Will Senn

    > I don't really understand how they work.
    > ...
    > maybe the way to understand the unix regex lies in a careful
    > investigation into how it is implemented

I'm not sure what I did, but it wasn't the latter, since I have no idea how
they are done!

I just mentally break the regex search string up into substrings (I use them
most in Epsilon, which has syntax to do substrings of search strings, which
helps a lot); past that, I think it's just using them and getting used to
them.

    > an IBM 7094 (whatever that is)

IBM's last 36-bit scientific mainframe before the System/360's. CTSS (which
DMR held out as the ancestor of Unix) ran on 7094's.

    Noel

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
@ 2020-08-01 21:12 Doug McIlroy
  2020-08-09 23:44 ` Dave Horsfall
  0 siblings, 1 reply; 25+ messages in thread
From: Doug McIlroy @ 2020-08-01 21:12 UTC (permalink / raw)
  To: tuhs

> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?


The editor ed was in Unix from day 1. For the necessarily tiny
 implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.

Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.

"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.

This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to

search calendar files for dates in rather free formats for today
and all days through the next working day.  Here's an example
(egrep interprets newline as "|"):

(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)

Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.

Doug

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
@ 2020-08-02  4:59 Rudi Blom
  0 siblings, 0 replies; 25+ messages in thread
From: Rudi Blom @ 2020-08-02  4:59 UTC (permalink / raw)
  To: tuhs

Probably same as others do, when I'm implementing some 'trace'
messages in a new program or one just 'under investigation' I try to
make sure the message has a nice format to be able to process a few
megabyte of logfile easily.

Cheers,
uncle rubl

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-08-10  1:39 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
2020-08-01  0:01 ` Bakul Shah
2020-08-01  0:36   ` Rob Pike
2020-08-01  0:53     ` John P. Linderman
2020-08-01  1:31     ` Bakul Shah
2020-08-01  1:39     ` Larry McVoy
2020-08-01  2:33     ` Will Senn
2020-08-01  2:50       ` Rich Morin
2020-08-01  3:01         ` Larry McVoy
2020-08-01  3:07     ` Will Senn
2020-08-01  4:31       ` Earl Baugh
2020-08-01  4:53         ` ron minnich
2020-08-01  5:48 ` Andrew Hume
2020-08-01 13:31   ` Richard Salz
2020-08-01 13:43     ` Andrew Hume
2020-08-02  0:45   ` Christopher Browne
2020-08-09  1:00   ` Dave Horsfall
2020-08-09  1:15     ` Nelson H. F. Beebe
2020-08-09 23:53       ` Dave Horsfall
2020-08-10  1:38         ` John Cowan
2020-08-01  0:00 Noel Chiappa
2020-08-01 21:12 Doug McIlroy
2020-08-09 23:44 ` Dave Horsfall
2020-08-10  0:50   ` Rob Pike
2020-08-02  4:59 Rudi Blom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).