The Unix Heritage Society mailing list
 help / color / Atom feed
* Re: [TUHS] Regular Expressions
@ 2020-08-01 21:12 Doug McIlroy
  0 siblings, 0 replies; 19+ messages in thread
From: Doug McIlroy @ 2020-08-01 21:12 UTC (permalink / raw)
  To: tuhs

> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?


The editor ed was in Unix from day 1. For the necessarily tiny
 implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.

Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.

"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.

This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to

search calendar files for dates in rather free formats for today
and all days through the next working day.  Here's an example
(egrep interprets newline as "|"):

(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)

Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.

Doug

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
@ 2020-08-02  4:59 Rudi Blom
  0 siblings, 0 replies; 19+ messages in thread
From: Rudi Blom @ 2020-08-02  4:59 UTC (permalink / raw)
  To: tuhs

Probably same as others do, when I'm implementing some 'trace'
messages in a new program or one just 'under investigation' I try to
make sure the message has a nice format to be able to process a few
megabyte of logfile easily.

Cheers,
uncle rubl

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  5:48 ` Andrew Hume
  2020-08-01 13:31   ` Richard Salz
@ 2020-08-02  0:45   ` Christopher Browne
  1 sibling, 0 replies; 19+ messages in thread
From: Christopher Browne @ 2020-08-02  0:45 UTC (permalink / raw)
  To: Andrew Hume; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Sat, 1 Aug 2020 at 02:38, Andrew Hume <andrew@humeweb.com> wrote:

>
> during the mid 80s, i was preparing one of the first invited talks for
> USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated,
> and there were many different romanizations active at the time.)
>
> dennis ritchie wandered into the unix room and asked what it was and then
> immediately applied it to our
> somewhat newish AP news wire feed. it matched two new spellings!!
>

I recalled that one being about Khadaffi, of Libya, and there being a
punchline involving one of the names being
"squid-breath."
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

[-- Attachment #2: Type: text/html, Size: 1185 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01 13:31   ` Richard Salz
@ 2020-08-01 13:43     ` Andrew Hume
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Hume @ 2020-08-01 13:43 UTC (permalink / raw)
  To: Richard Salz; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 774 bytes --]

i apologise.

i was in fact referring to the qaddafi torture test (another victim of my stroke).
thanks to rich for catching that slip.

> On Aug 1, 2020, at 6:31 AM, Richard Salz <rich.salz@gmail.com> wrote:
> 
> 
> during the mid 80s, i was preparing one of the first invited talks for USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of saddam hussein. (it was complicated,
> 
> I wonder if that was the inspiration for the Qaddafi torture test; see http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01 <http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01> (June 87)
> 
> The first time I saw "p[-1]" was looking at some regexp source.  It blew/expanded my mind.


[-- Attachment #2: Type: text/html, Size: 1639 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  5:48 ` Andrew Hume
@ 2020-08-01 13:31   ` Richard Salz
  2020-08-01 13:43     ` Andrew Hume
  2020-08-02  0:45   ` Christopher Browne
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Salz @ 2020-08-01 13:31 UTC (permalink / raw)
  To: Andrew Hume; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

> during the mid 80s, i was preparing one of the first invited talks for
> USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated,
>

I wonder if that was the inspiration for the Qaddafi torture test; see
http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01 (June 87)

The first time I saw "p[-1]" was looking at some regexp source.  It
blew/expanded my mind.

[-- Attachment #2: Type: text/html, Size: 822 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-07-31 22:57 Will Senn
  2020-08-01  0:01 ` Bakul Shah
@ 2020-08-01  5:48 ` Andrew Hume
  2020-08-01 13:31   ` Richard Salz
  2020-08-02  0:45   ` Christopher Browne
  1 sibling, 2 replies; 19+ messages in thread
From: Andrew Hume @ 2020-08-01  5:48 UTC (permalink / raw)
  To: Will Senn; +Cc: TUHS main list


during the mid 80s, i was preparing one of the first invited talks for USENIX on the subject of regular expressions.
i had concocted a massive RE that matched all the known spellings of saddam hussein. (it was complicated,
and there were many different romanizations active at the time.)

dennis ritchie wandered into the unix room and asked what it was and then immediately applied it to our
somewhat newish AP news wire feed. it matched two new spellings!!


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  4:31       ` Earl Baugh
@ 2020-08-01  4:53         ` ron minnich
  0 siblings, 0 replies; 19+ messages in thread
From: ron minnich @ 2020-08-01  4:53 UTC (permalink / raw)
  To: Earl Baugh; +Cc: TUHS main list, Bakul Shah

kernighan did this one at a talk at udel in 1976: find all the words
in the dictionary that you can see on a calculator looking at it
upside down.

On Fri, Jul 31, 2020 at 9:32 PM Earl Baugh <earl.baugh@gmail.com> wrote:
>
> I have a 30k+ character regex that I used to parse US street addresses from a single line of input into component pieces. That’s the largest ( and yes it was crazy... but it just grew and grew...and made sense to me as I decomposed the problem ).   That’s the largest one I ever saw.... and it was the fastest way I could solve the problem.. both mentally and processing wise.
>
> ( there was some post processing to deal with soundex matching and data base matching of the pieces to confirm the right answer.  But it solved the problem with 95%+ accuracy... that I got into the high 99%’s with the post processing. )
>
> So that’s at least an example of an arbitrary problem that was solved with regex.
>
> Earl
>
>
> Sent from my iPhone
>
> On Jul 31, 2020, at 11:08 PM, Will Senn <will.senn@gmail.com> wrote:
>
> 
> Oh, and one good google over another, I also found this:
>
> https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
>
> Which is also great and simple to follow :).
>
> Will
>
> On 7/31/20 7:36 PM, Rob Pike wrote:
>
> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>>
>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>> >
>> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
>> > ...
>> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
>> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
>> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
>> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>>
>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>
>> [I learned about regular expressions in an automata theory class,
>>  before I knew anything about Unix. What helped me was learning
>>  about finite state machines. You won't need more than paper and
>>  pencil to construct one. Reading source code would make more
>>  sense once you grasp how to construct a FSM corresponding to a RE.]
>
>
>
> --
> GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  3:07     ` Will Senn
@ 2020-08-01  4:31       ` Earl Baugh
  2020-08-01  4:53         ` ron minnich
  0 siblings, 1 reply; 19+ messages in thread
From: Earl Baugh @ 2020-08-01  4:31 UTC (permalink / raw)
  To: Will Senn; +Cc: TUHS main list, Bakul Shah


[-- Attachment #1: Type: text/plain, Size: 2785 bytes --]

I have a 30k+ character regex that I used to parse US street addresses from a single line of input into component pieces. That’s the largest ( and yes it was crazy... but it just grew and grew...and made sense to me as I decomposed the problem ).   That’s the largest one I ever saw.... and it was the fastest way I could solve the problem.. both mentally and processing wise.  

( there was some post processing to deal with soundex matching and data base matching of the pieces to confirm the right answer.  But it solved the problem with 95%+ accuracy... that I got into the high 99%’s with the post processing. )

So that’s at least an example of an arbitrary problem that was solved with regex. 

Earl 


Sent from my iPhone

> On Jul 31, 2020, at 11:08 PM, Will Senn <will.senn@gmail.com> wrote:
> 
> 
> Oh, and one good google over another, I also found this:
> 
> https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
> 
> Which is also great and simple to follow :).
> 
> Will
> 
>> On 7/31/20 7:36 PM, Rob Pike wrote:
>> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>> 
>> Regular expressions are beautiful, simple, and widely misunderstood.
>> 
>> -rob
>> 
>> 
>> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>>> > 
>>> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
>>> > ...
>>> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
>>> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
>>> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
>>> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>>> 
>>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>> 
>>> [I learned about regular expressions in an automata theory class, 
>>>  before I knew anything about Unix. What helped me was learning
>>>  about finite state machines. You won't need more than paper and
>>>  pencil to construct one. Reading source code would make more
>>>  sense once you grasp how to construct a FSM corresponding to a RE.]
> 
> 
> -- 
> GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF

[-- Attachment #2: Type: text/html, Size: 4939 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
                       ` (3 preceding siblings ...)
  2020-08-01  2:33     ` Will Senn
@ 2020-08-01  3:07     ` Will Senn
  2020-08-01  4:31       ` Earl Baugh
  4 siblings, 1 reply; 19+ messages in thread
From: Will Senn @ 2020-08-01  3:07 UTC (permalink / raw)
  To: Rob Pike, Bakul Shah; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 2023 bytes --]

Oh, and one good google over another, I also found this:

https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904

Which is also great and simple to follow :).

Will

On 7/31/20 7:36 PM, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s 
> the best place to start. Superb exposition on the background, theory, 
> and implementation as well as a bit of history of how the industry 
> lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org 
> <mailto:bakul@iitbombay.org>> wrote:
>
>     On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com
>     <mailto:will.senn@gmail.com>> wrote:
>     >
>     > I've always been intrigued with regexes. When I was first
>     exposed to them, I was mystified and lost in the greediness of
>     matches. Now, I use them regularly, but still have trouble using
>     them. I think it is because I don't really understand how they work.
>     > ...
>     > 1. What's the provenance of regex in unix (when did it appear,
>     in what form, etc)?
>     > 2. What are the 'best' implementations throughout unix (keep it
>     pre 1980s)?
>     > 3. What are some of the milestones along the way (major changes,
>     forks, disagreements)?
>     > 4. Where, in the source, or in a paper, would you point someone
>     to wanting to better understand the mechanics of regex?
>
>     Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
>     [I learned about regular expressions in an automata theory class,
>      before I knew anything about Unix. What helped me was learning
>      about finite state machines. You won't need more than paper and
>      pencil to construct one. Reading source code would make more
>      sense once you grasp how to construct a FSM corresponding to a RE.]
>


-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 3760 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  2:50       ` Rich Morin
@ 2020-08-01  3:01         ` Larry McVoy
  0 siblings, 0 replies; 19+ messages in thread
From: Larry McVoy @ 2020-08-01  3:01 UTC (permalink / raw)
  To: Rich Morin; +Cc: TUHS main list

I was a fan of this:

http://www.cse.yorku.ca/~oz/regex.bun

my employees moved me to PCRE at some point but OZ has a talent for writing
really small implementations of useful stuff.  Very early Unix like.

On Fri, Jul 31, 2020 at 07:50:57PM -0700, Rich Morin wrote:
> You might want to pick up a used copy of this book, if only as an occasional resource:
> 
> https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124
> 
> -r

-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  2:33     ` Will Senn
@ 2020-08-01  2:50       ` Rich Morin
  2020-08-01  3:01         ` Larry McVoy
  0 siblings, 1 reply; 19+ messages in thread
From: Rich Morin @ 2020-08-01  2:50 UTC (permalink / raw)
  To: TUHS main list

You might want to pick up a used copy of this book, if only as an occasional resource:

https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124

-r


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
                       ` (2 preceding siblings ...)
  2020-08-01  1:39     ` Larry McVoy
@ 2020-08-01  2:33     ` Will Senn
  2020-08-01  2:50       ` Rich Morin
  2020-08-01  3:07     ` Will Senn
  4 siblings, 1 reply; 19+ messages in thread
From: Will Senn @ 2020-08-01  2:33 UTC (permalink / raw)
  To: Rob Pike, Bakul Shah; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

Yes, I googled it per Clem's suggestion and wound up on that exact link 
after wandering around admiring the scenery. I envy the more 
mathematically inclined among us their view of matters technical. This 
piece, being in C and having step by step articulation of the diagrams, 
is better for me than the more formal wikipedia article, although, when 
I get enough background, that looks like it'll be good, too.

Thanks Rob, Clem and Bakul.

Will


On 7/31/20 7:36 PM, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s 
> the best place to start. Superb exposition on the background, theory, 
> and implementation as well as a bit of history of how the industry 
> lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org 
> <mailto:bakul@iitbombay.org>> wrote:
>
>     On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com
>     <mailto:will.senn@gmail.com>> wrote:
>     >
>     > I've always been intrigued with regexes. When I was first
>     exposed to them, I was mystified and lost in the greediness of
>     matches. Now, I use them regularly, but still have trouble using
>     them. I think it is because I don't really understand how they work.
>     > ...
>     > 1. What's the provenance of regex in unix (when did it appear,
>     in what form, etc)?
>     > 2. What are the 'best' implementations throughout unix (keep it
>     pre 1980s)?
>     > 3. What are some of the milestones along the way (major changes,
>     forks, disagreements)?
>     > 4. Where, in the source, or in a paper, would you point someone
>     to wanting to better understand the mechanics of regex?
>
>     Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
>     [I learned about regular expressions in an automata theory class,
>      before I knew anything about Unix. What helped me was learning
>      about finite state machines. You won't need more than paper and
>      pencil to construct one. Reading source code would make more
>      sense once you grasp how to construct a FSM corresponding to a RE.]
>


-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 3921 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
  2020-08-01  0:53     ` John P. Linderman
  2020-08-01  1:31     ` Bakul Shah
@ 2020-08-01  1:39     ` Larry McVoy
  2020-08-01  2:33     ` Will Senn
  2020-08-01  3:07     ` Will Senn
  4 siblings, 0 replies; 19+ messages in thread
From: Larry McVoy @ 2020-08-01  1:39 UTC (permalink / raw)
  To: Rob Pike; +Cc: Bakul Shah, TUHS main list

That link is a great read.

On Sat, Aug 01, 2020 at 10:36:48AM +1000, Rob Pike wrote:
> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the
> best place to start. Superb exposition on the background, theory, and
> implementation as well as a bit of history of how the industry lost its way
> with regular expressions.
> 
> Regular expressions are beautiful, simple, and widely misunderstood.
> 
> -rob
> 
> 
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
> 
> > On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> > >
> > > I've always been intrigued with regexes. When I was first exposed to
> > them, I was mystified and lost in the greediness of matches. Now, I use
> > them regularly, but still have trouble using them. I think it is because I
> > don't really understand how they work.
> > > ...
> > > 1. What's the provenance of regex in unix (when did it appear, in what
> > form, etc)?
> > > 2. What are the 'best' implementations throughout unix (keep it pre
> > 1980s)?
> > > 3. What are some of the milestones along the way (major changes, forks,
> > disagreements)?
> > > 4. Where, in the source, or in a paper, would you point someone to
> > wanting to better understand the mechanics of regex?
> >
> > Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
> >
> > [I learned about regular expressions in an automata theory class,
> >  before I knew anything about Unix. What helped me was learning
> >  about finite state machines. You won't need more than paper and
> >  pencil to construct one. Reading source code would make more
> >  sense once you grasp how to construct a FSM corresponding to a RE.]

-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
  2020-08-01  0:53     ` John P. Linderman
@ 2020-08-01  1:31     ` Bakul Shah
  2020-08-01  1:39     ` Larry McVoy
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Bakul Shah @ 2020-08-01  1:31 UTC (permalink / raw)
  To: Rob Pike; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

I was trying to get Will to try simple stuff out himself from first principles (to get started) but this is of course far better!

> On Jul 31, 2020, at 5:36 PM, Rob Pike <robpike@gmail.com> wrote:
> 
> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html <https://swtch.com/~rsc/regexp/regexp1.html> i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
> 
> Regular expressions are beautiful, simple, and widely misunderstood.
> 
> -rob
> 
> 
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org <mailto:bakul@iitbombay.org>> wrote:
> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com <mailto:will.senn@gmail.com>> wrote:
> > 
> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
> 
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction <https://en.wikipedia.org/wiki/Thompson%27s_construction>
> 
> [I learned about regular expressions in an automata theory class, 
>  before I knew anything about Unix. What helped me was learning
>  about finite state machines. You won't need more than paper and
>  pencil to construct one. Reading source code would make more
>  sense once you grasp how to construct a FSM corresponding to a RE.]


[-- Attachment #2: Type: text/html, Size: 3133 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
@ 2020-08-01  0:53     ` John P. Linderman
  2020-08-01  1:31     ` Bakul Shah
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: John P. Linderman @ 2020-08-01  0:53 UTC (permalink / raw)
  To: Rob Pike; +Cc: Bakul Shah, TUHS main list


[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

Hear hear. And I think it should be possible to analyze regular expressions
to determine if "simple" (hence fast) regexps are adequate. "Fancy" re's
are rare. At least from those of us where they couldn't be expressed.

On Fri, Jul 31, 2020 at 8:38 PM Rob Pike <robpike@gmail.com> wrote:

> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the
> best place to start. Superb exposition on the background, theory, and
> implementation as well as a bit of history of how the industry lost its way
> with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>
>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>> >
>> > I've always been intrigued with regexes. When I was first exposed to
>> them, I was mystified and lost in the greediness of matches. Now, I use
>> them regularly, but still have trouble using them. I think it is because I
>> don't really understand how they work.
>> > ...
>> > 1. What's the provenance of regex in unix (when did it appear, in what
>> form, etc)?
>> > 2. What are the 'best' implementations throughout unix (keep it pre
>> 1980s)?
>> > 3. What are some of the milestones along the way (major changes, forks,
>> disagreements)?
>> > 4. Where, in the source, or in a paper, would you point someone to
>> wanting to better understand the mechanics of regex?
>>
>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>
>> [I learned about regular expressions in an automata theory class,
>>  before I knew anything about Unix. What helped me was learning
>>  about finite state machines. You won't need more than paper and
>>  pencil to construct one. Reading source code would make more
>>  sense once you grasp how to construct a FSM corresponding to a RE.]
>
>

[-- Attachment #2: Type: text/html, Size: 2897 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:01 ` Bakul Shah
@ 2020-08-01  0:36   ` Rob Pike
  2020-08-01  0:53     ` John P. Linderman
                       ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Rob Pike @ 2020-08-01  0:36 UTC (permalink / raw)
  To: Bakul Shah; +Cc: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 1544 bytes --]

I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the
best place to start. Superb exposition on the background, theory, and
implementation as well as a bit of history of how the industry lost its way
with regular expressions.

Regular expressions are beautiful, simple, and widely misunderstood.

-rob


On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:

> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> >
> > I've always been intrigued with regexes. When I was first exposed to
> them, I was mystified and lost in the greediness of matches. Now, I use
> them regularly, but still have trouble using them. I think it is because I
> don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear, in what
> form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it pre
> 1980s)?
> > 3. What are some of the milestones along the way (major changes, forks,
> disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone to
> wanting to better understand the mechanics of regex?
>
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
> [I learned about regular expressions in an automata theory class,
>  before I knew anything about Unix. What helped me was learning
>  about finite state machines. You won't need more than paper and
>  pencil to construct one. Reading source code would make more
>  sense once you grasp how to construct a FSM corresponding to a RE.]

[-- Attachment #2: Type: text/html, Size: 2196 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-07-31 22:57 Will Senn
@ 2020-08-01  0:01 ` Bakul Shah
  2020-08-01  0:36   ` Rob Pike
  2020-08-01  5:48 ` Andrew Hume
  1 sibling, 1 reply; 19+ messages in thread
From: Bakul Shah @ 2020-08-01  0:01 UTC (permalink / raw)
  To: Will Senn; +Cc: TUHS main list

On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> 
> I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
> ...
> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?
> 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?

Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction

[I learned about regular expressions in an automata theory class, 
 before I knew anything about Unix. What helped me was learning
 about finite state machines. You won't need more than paper and
 pencil to construct one. Reading source code would make more
 sense once you grasp how to construct a FSM corresponding to a RE.]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [TUHS] Regular Expressions
@ 2020-08-01  0:00 Noel Chiappa
  0 siblings, 0 replies; 19+ messages in thread
From: Noel Chiappa @ 2020-08-01  0:00 UTC (permalink / raw)
  To: tuhs; +Cc: jnc

    > From: Will Senn

    > I don't really understand how they work.
    > ...
    > maybe the way to understand the unix regex lies in a careful
    > investigation into how it is implemented

I'm not sure what I did, but it wasn't the latter, since I have no idea how
they are done!

I just mentally break the regex search string up into substrings (I use them
most in Epsilon, which has syntax to do substrings of search strings, which
helps a lot); past that, I think it's just using them and getting used to
them.

    > an IBM 7094 (whatever that is)

IBM's last 36-bit scientific mainframe before the System/360's. CTSS (which
DMR held out as the ancestor of Unix) ran on 7094's.

    Noel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [TUHS] Regular Expressions
@ 2020-07-31 22:57 Will Senn
  2020-08-01  0:01 ` Bakul Shah
  2020-08-01  5:48 ` Andrew Hume
  0 siblings, 2 replies; 19+ messages in thread
From: Will Senn @ 2020-07-31 22:57 UTC (permalink / raw)
  To: TUHS main list


[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

I've always been intrigued with regexes. When I was first exposed to 
them, I was mystified and lost in the greediness of matches. Now, I use 
them regularly, but still have trouble using them. I think it is because 
I don't really understand how they work.

My question for y'all has to do with early unix. I have a copy of 
Thompson, K. (1968). Regular expression search algorithm. Communications 
of the ACM, 11(6), 419-422. It is interesting as an example of 
Thompson's thinking about regexes. In this paper, he presents a 
non-backtracking, efficient, algorithm for converting a regex into an 
IBM 7094 (whatever that is) program that can be run against text input 
that generates matches. It's cool. It got me to thinking maybe the way 
to understand the unix regex lies in a careful investigation into how it 
is implemented (original thought, right?). So, here I am again to ask 
your indulgence as the latecomer wannabe unix apprentice. My thought is 
that ed is where it begins and might be a good starting point, but I'm 
not sure - what say y'all?

I also have a copy of the O'Reilly Mastering Regular Expressions book, 
but that's not really the kind of thing I'm talking about. My question 
is more basic than how to use regexes practically. I would like to 
understand them at a parsing level/state change level (not sure that's 
the correct way to say it, but I'm really new to this kind of lingo).  
When I'm done with my stepping through the source, I want to be able to 
reason that this is why that search matched that text and not this text 
and why the search was greedy, or not greedy because of this logic here...

If my question above isn't focused or on topic enough, here's an 
alternative set to ruminate on and hopefully discuss:

1. What's the provenance of regex in unix (when did it appear, in what 
form, etc)?
2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
3. What are some of the milestones along the way (major changes, forks, 
disagreements)?
4. Where, in the source, or in a paper, would you point someone to 
wanting to better understand the mechanics of regex?

Thanks!

Will

-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 2765 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, back to index

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-01 21:12 [TUHS] Regular Expressions Doug McIlroy
  -- strict thread matches above, loose matches on Subject: below --
2020-08-02  4:59 Rudi Blom
2020-08-01  0:00 Noel Chiappa
2020-07-31 22:57 Will Senn
2020-08-01  0:01 ` Bakul Shah
2020-08-01  0:36   ` Rob Pike
2020-08-01  0:53     ` John P. Linderman
2020-08-01  1:31     ` Bakul Shah
2020-08-01  1:39     ` Larry McVoy
2020-08-01  2:33     ` Will Senn
2020-08-01  2:50       ` Rich Morin
2020-08-01  3:01         ` Larry McVoy
2020-08-01  3:07     ` Will Senn
2020-08-01  4:31       ` Earl Baugh
2020-08-01  4:53         ` ron minnich
2020-08-01  5:48 ` Andrew Hume
2020-08-01 13:31   ` Richard Salz
2020-08-01 13:43     ` Andrew Hume
2020-08-02  0:45   ` Christopher Browne

The Unix Heritage Society mailing list

Archives are clonable: git clone --mirror http://inbox.vuxu.org/tuhs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://inbox.vuxu.org/vuxu.archive.tuhs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git