The Unix Heritage Society mailing list
 help / color / mirror / Atom feed
* [TUHS] Regular Expressions
@ 2020-07-31 22:57 Will Senn
  2020-08-01  0:01 ` Bakul Shah
  2020-08-01  5:48 ` Andrew Hume
  0 siblings, 2 replies; 25+ messages in thread
From: Will Senn @ 2020-07-31 22:57 UTC (permalink / raw)
  To: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]

I've always been intrigued with regexes. When I was first exposed to 
them, I was mystified and lost in the greediness of matches. Now, I use 
them regularly, but still have trouble using them. I think it is because 
I don't really understand how they work.

My question for y'all has to do with early unix. I have a copy of 
Thompson, K. (1968). Regular expression search algorithm. Communications 
of the ACM, 11(6), 419-422. It is interesting as an example of 
Thompson's thinking about regexes. In this paper, he presents a 
non-backtracking, efficient, algorithm for converting a regex into an 
IBM 7094 (whatever that is) program that can be run against text input 
that generates matches. It's cool. It got me to thinking maybe the way 
to understand the unix regex lies in a careful investigation into how it 
is implemented (original thought, right?). So, here I am again to ask 
your indulgence as the latecomer wannabe unix apprentice. My thought is 
that ed is where it begins and might be a good starting point, but I'm 
not sure - what say y'all?

I also have a copy of the O'Reilly Mastering Regular Expressions book, 
but that's not really the kind of thing I'm talking about. My question 
is more basic than how to use regexes practically. I would like to 
understand them at a parsing level/state change level (not sure that's 
the correct way to say it, but I'm really new to this kind of lingo).  
When I'm done with my stepping through the source, I want to be able to 
reason that this is why that search matched that text and not this text 
and why the search was greedy, or not greedy because of this logic here...

If my question above isn't focused or on topic enough, here's an 
alternative set to ruminate on and hopefully discuss:

1. What's the provenance of regex in unix (when did it appear, in what 
form, etc)?
2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
3. What are some of the milestones along the way (major changes, forks, 
disagreements)?
4. Where, in the source, or in a paper, would you point someone to 
wanting to better understand the mechanics of regex?

Thanks!

Will

-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 2765 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
@ 2020-08-01  0:01 ` Bakul Shah
  2020-08-01  0:36   ` Rob Pike
  2020-08-01  5:48 ` Andrew Hume
  1 sibling, 1 reply; 25+ messages in thread
From: Bakul Shah @ 2020-08-01  0:01 UTC (permalink / raw)
  To: Will Senn; +Cc: TUHS main list

On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> 
> I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
> ...
> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?
> 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?

Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction

[I learned about regular expressions in an automata theory class, 
 before I knew anything about Unix. What helped me was learning
 about finite state machines. You won't need more than paper and
 pencil to construct one. Reading source code would make more
 sense once you grasp how to construct a FSM corresponding to a RE.]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:01 ` Bakul Shah
@ 2020-08-01  0:36   ` Rob Pike
  2020-08-01  0:53     ` John P. Linderman
                       ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Rob Pike @ 2020-08-01  0:36 UTC (permalink / raw)
  To: Bakul Shah; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 1544 bytes --]

I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the
best place to start. Superb exposition on the background, theory, and
implementation as well as a bit of history of how the industry lost its way
with regular expressions.

Regular expressions are beautiful, simple, and widely misunderstood.

-rob


On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:

> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> >
> > I've always been intrigued with regexes. When I was first exposed to
> them, I was mystified and lost in the greediness of matches. Now, I use
> them regularly, but still have trouble using them. I think it is because I
> don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear, in what
> form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it pre
> 1980s)?
> > 3. What are some of the milestones along the way (major changes, forks,
> disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone to
> wanting to better understand the mechanics of regex?
>
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
> [I learned about regular expressions in an automata theory class,
>  before I knew anything about Unix. What helped me was learning
>  about finite state machines. You won't need more than paper and
>  pencil to construct one. Reading source code would make more
>  sense once you grasp how to construct a FSM corresponding to a RE.]

[-- Attachment #2: Type: text/html, Size: 2196 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
@ 2020-08-01  0:53     ` John P. Linderman
  2020-08-01  1:31     ` Bakul Shah
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: John P. Linderman @ 2020-08-01  0:53 UTC (permalink / raw)
  To: Rob Pike; +Cc: Bakul Shah, TUHS main list

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

Hear hear. And I think it should be possible to analyze regular expressions
to determine if "simple" (hence fast) regexps are adequate. "Fancy" re's
are rare. At least from those of us where they couldn't be expressed.

On Fri, Jul 31, 2020 at 8:38 PM Rob Pike <robpike@gmail.com> wrote:

> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the
> best place to start. Superb exposition on the background, theory, and
> implementation as well as a bit of history of how the industry lost its way
> with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>
>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>> >
>> > I've always been intrigued with regexes. When I was first exposed to
>> them, I was mystified and lost in the greediness of matches. Now, I use
>> them regularly, but still have trouble using them. I think it is because I
>> don't really understand how they work.
>> > ...
>> > 1. What's the provenance of regex in unix (when did it appear, in what
>> form, etc)?
>> > 2. What are the 'best' implementations throughout unix (keep it pre
>> 1980s)?
>> > 3. What are some of the milestones along the way (major changes, forks,
>> disagreements)?
>> > 4. Where, in the source, or in a paper, would you point someone to
>> wanting to better understand the mechanics of regex?
>>
>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>
>> [I learned about regular expressions in an automata theory class,
>>  before I knew anything about Unix. What helped me was learning
>>  about finite state machines. You won't need more than paper and
>>  pencil to construct one. Reading source code would make more
>>  sense once you grasp how to construct a FSM corresponding to a RE.]
>
>

[-- Attachment #2: Type: text/html, Size: 2897 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
  2020-08-01  0:53     ` John P. Linderman
@ 2020-08-01  1:31     ` Bakul Shah
  2020-08-01  1:39     ` Larry McVoy
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 25+ messages in thread
From: Bakul Shah @ 2020-08-01  1:31 UTC (permalink / raw)
  To: Rob Pike; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

I was trying to get Will to try simple stuff out himself from first principles (to get started) but this is of course far better!

> On Jul 31, 2020, at 5:36 PM, Rob Pike <robpike@gmail.com> wrote:
> 
> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html <https://swtch.com/~rsc/regexp/regexp1.html> i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
> 
> Regular expressions are beautiful, simple, and widely misunderstood.
> 
> -rob
> 
> 
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org <mailto:bakul@iitbombay.org>> wrote:
> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com <mailto:will.senn@gmail.com>> wrote:
> > 
> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
> 
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction <https://en.wikipedia.org/wiki/Thompson%27s_construction>
> 
> [I learned about regular expressions in an automata theory class, 
>  before I knew anything about Unix. What helped me was learning
>  about finite state machines. You won't need more than paper and
>  pencil to construct one. Reading source code would make more
>  sense once you grasp how to construct a FSM corresponding to a RE.]


[-- Attachment #2: Type: text/html, Size: 3133 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
  2020-08-01  0:53     ` John P. Linderman
  2020-08-01  1:31     ` Bakul Shah
@ 2020-08-01  1:39     ` Larry McVoy
  2020-08-01  2:33     ` Will Senn
  2020-08-01  3:07     ` Will Senn
  4 siblings, 0 replies; 25+ messages in thread
From: Larry McVoy @ 2020-08-01  1:39 UTC (permalink / raw)
  To: Rob Pike; +Cc: Bakul Shah, TUHS main list

That link is a great read.

On Sat, Aug 01, 2020 at 10:36:48AM +1000, Rob Pike wrote:
> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the
> best place to start. Superb exposition on the background, theory, and
> implementation as well as a bit of history of how the industry lost its way
> with regular expressions.
> 
> Regular expressions are beautiful, simple, and widely misunderstood.
> 
> -rob
> 
> 
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
> 
> > On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> > >
> > > I've always been intrigued with regexes. When I was first exposed to
> > them, I was mystified and lost in the greediness of matches. Now, I use
> > them regularly, but still have trouble using them. I think it is because I
> > don't really understand how they work.
> > > ...
> > > 1. What's the provenance of regex in unix (when did it appear, in what
> > form, etc)?
> > > 2. What are the 'best' implementations throughout unix (keep it pre
> > 1980s)?
> > > 3. What are some of the milestones along the way (major changes, forks,
> > disagreements)?
> > > 4. Where, in the source, or in a paper, would you point someone to
> > wanting to better understand the mechanics of regex?
> >
> > Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
> >
> > [I learned about regular expressions in an automata theory class,
> >  before I knew anything about Unix. What helped me was learning
> >  about finite state machines. You won't need more than paper and
> >  pencil to construct one. Reading source code would make more
> >  sense once you grasp how to construct a FSM corresponding to a RE.]

-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
                       ` (2 preceding siblings ...)
  2020-08-01  1:39     ` Larry McVoy
@ 2020-08-01  2:33     ` Will Senn
  2020-08-01  2:50       ` Rich Morin
  2020-08-01  3:07     ` Will Senn
  4 siblings, 1 reply; 25+ messages in thread
From: Will Senn @ 2020-08-01  2:33 UTC (permalink / raw)
  To: Rob Pike, Bakul Shah; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]

Yes, I googled it per Clem's suggestion and wound up on that exact link 
after wandering around admiring the scenery. I envy the more 
mathematically inclined among us their view of matters technical. This 
piece, being in C and having step by step articulation of the diagrams, 
is better for me than the more formal wikipedia article, although, when 
I get enough background, that looks like it'll be good, too.

Thanks Rob, Clem and Bakul.

Will


On 7/31/20 7:36 PM, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s 
> the best place to start. Superb exposition on the background, theory, 
> and implementation as well as a bit of history of how the industry 
> lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org 
> <mailto:bakul@iitbombay.org>> wrote:
>
>     On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com
>     <mailto:will.senn@gmail.com>> wrote:
>     >
>     > I've always been intrigued with regexes. When I was first
>     exposed to them, I was mystified and lost in the greediness of
>     matches. Now, I use them regularly, but still have trouble using
>     them. I think it is because I don't really understand how they work.
>     > ...
>     > 1. What's the provenance of regex in unix (when did it appear,
>     in what form, etc)?
>     > 2. What are the 'best' implementations throughout unix (keep it
>     pre 1980s)?
>     > 3. What are some of the milestones along the way (major changes,
>     forks, disagreements)?
>     > 4. Where, in the source, or in a paper, would you point someone
>     to wanting to better understand the mechanics of regex?
>
>     Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
>     [I learned about regular expressions in an automata theory class,
>      before I knew anything about Unix. What helped me was learning
>      about finite state machines. You won't need more than paper and
>      pencil to construct one. Reading source code would make more
>      sense once you grasp how to construct a FSM corresponding to a RE.]
>


-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 3921 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  2:33     ` Will Senn
@ 2020-08-01  2:50       ` Rich Morin
  2020-08-01  3:01         ` Larry McVoy
  0 siblings, 1 reply; 25+ messages in thread
From: Rich Morin @ 2020-08-01  2:50 UTC (permalink / raw)
  To: TUHS main list

You might want to pick up a used copy of this book, if only as an occasional resource:

https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124

-r


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  2:50       ` Rich Morin
@ 2020-08-01  3:01         ` Larry McVoy
  0 siblings, 0 replies; 25+ messages in thread
From: Larry McVoy @ 2020-08-01  3:01 UTC (permalink / raw)
  To: Rich Morin; +Cc: TUHS main list

I was a fan of this:

http://www.cse.yorku.ca/~oz/regex.bun

my employees moved me to PCRE at some point but OZ has a talent for writing
really small implementations of useful stuff.  Very early Unix like.

On Fri, Jul 31, 2020 at 07:50:57PM -0700, Rich Morin wrote:
> You might want to pick up a used copy of this book, if only as an occasional resource:
> 
> https://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124
> 
> -r

-- 
---
Larry McVoy            	     lm at mcvoy.com             http://www.mcvoy.com/lm 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  0:36   ` Rob Pike
                       ` (3 preceding siblings ...)
  2020-08-01  2:33     ` Will Senn
@ 2020-08-01  3:07     ` Will Senn
  2020-08-01  4:31       ` Earl Baugh
  4 siblings, 1 reply; 25+ messages in thread
From: Will Senn @ 2020-08-01  3:07 UTC (permalink / raw)
  To: Rob Pike, Bakul Shah; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 2023 bytes --]

Oh, and one good google over another, I also found this:

https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904

Which is also great and simple to follow :).

Will

On 7/31/20 7:36 PM, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s 
> the best place to start. Superb exposition on the background, theory, 
> and implementation as well as a bit of history of how the industry 
> lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org 
> <mailto:bakul@iitbombay.org>> wrote:
>
>     On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com
>     <mailto:will.senn@gmail.com>> wrote:
>     >
>     > I've always been intrigued with regexes. When I was first
>     exposed to them, I was mystified and lost in the greediness of
>     matches. Now, I use them regularly, but still have trouble using
>     them. I think it is because I don't really understand how they work.
>     > ...
>     > 1. What's the provenance of regex in unix (when did it appear,
>     in what form, etc)?
>     > 2. What are the 'best' implementations throughout unix (keep it
>     pre 1980s)?
>     > 3. What are some of the milestones along the way (major changes,
>     forks, disagreements)?
>     > 4. Where, in the source, or in a paper, would you point someone
>     to wanting to better understand the mechanics of regex?
>
>     Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
>     [I learned about regular expressions in an automata theory class,
>      before I knew anything about Unix. What helped me was learning
>      about finite state machines. You won't need more than paper and
>      pencil to construct one. Reading source code would make more
>      sense once you grasp how to construct a FSM corresponding to a RE.]
>


-- 
GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF


[-- Attachment #2: Type: text/html, Size: 3760 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  3:07     ` Will Senn
@ 2020-08-01  4:31       ` Earl Baugh
  2020-08-01  4:53         ` ron minnich
  0 siblings, 1 reply; 25+ messages in thread
From: Earl Baugh @ 2020-08-01  4:31 UTC (permalink / raw)
  To: Will Senn; +Cc: TUHS main list, Bakul Shah

[-- Attachment #1: Type: text/plain, Size: 2785 bytes --]

I have a 30k+ character regex that I used to parse US street addresses from a single line of input into component pieces. That’s the largest ( and yes it was crazy... but it just grew and grew...and made sense to me as I decomposed the problem ).   That’s the largest one I ever saw.... and it was the fastest way I could solve the problem.. both mentally and processing wise.  

( there was some post processing to deal with soundex matching and data base matching of the pieces to confirm the right answer.  But it solved the problem with 95%+ accuracy... that I got into the high 99%’s with the post processing. )

So that’s at least an example of an arbitrary problem that was solved with regex. 

Earl 


Sent from my iPhone

> On Jul 31, 2020, at 11:08 PM, Will Senn <will.senn@gmail.com> wrote:
> 
> 
> Oh, and one good google over another, I also found this:
> 
> https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
> 
> Which is also great and simple to follow :).
> 
> Will
> 
>> On 7/31/20 7:36 PM, Rob Pike wrote:
>> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>> 
>> Regular expressions are beautiful, simple, and widely misunderstood.
>> 
>> -rob
>> 
>> 
>> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>>> > 
>>> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
>>> > ...
>>> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
>>> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
>>> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
>>> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>>> 
>>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>> 
>>> [I learned about regular expressions in an automata theory class, 
>>>  before I knew anything about Unix. What helped me was learning
>>>  about finite state machines. You won't need more than paper and
>>>  pencil to construct one. Reading source code would make more
>>>  sense once you grasp how to construct a FSM corresponding to a RE.]
> 
> 
> -- 
> GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF

[-- Attachment #2: Type: text/html, Size: 4939 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  4:31       ` Earl Baugh
@ 2020-08-01  4:53         ` ron minnich
  0 siblings, 0 replies; 25+ messages in thread
From: ron minnich @ 2020-08-01  4:53 UTC (permalink / raw)
  To: Earl Baugh; +Cc: TUHS main list, Bakul Shah

kernighan did this one at a talk at udel in 1976: find all the words
in the dictionary that you can see on a calculator looking at it
upside down.

On Fri, Jul 31, 2020 at 9:32 PM Earl Baugh <earl.baugh@gmail.com> wrote:
>
> I have a 30k+ character regex that I used to parse US street addresses from a single line of input into component pieces. That’s the largest ( and yes it was crazy... but it just grew and grew...and made sense to me as I decomposed the problem ).   That’s the largest one I ever saw.... and it was the fastest way I could solve the problem.. both mentally and processing wise.
>
> ( there was some post processing to deal with soundex matching and data base matching of the pieces to confirm the right answer.  But it solved the problem with 95%+ accuracy... that I got into the high 99%’s with the post processing. )
>
> So that’s at least an example of an arbitrary problem that was solved with regex.
>
> Earl
>
>
> Sent from my iPhone
>
> On Jul 31, 2020, at 11:08 PM, Will Senn <will.senn@gmail.com> wrote:
>
> 
> Oh, and one good google over another, I also found this:
>
> https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
>
> Which is also great and simple to follow :).
>
> Will
>
> On 7/31/20 7:36 PM, Rob Pike wrote:
>
> I think this link -  https://swtch.com/~rsc/regexp/regexp1.html i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>>
>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>> >
>> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
>> > ...
>> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
>> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
>> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
>> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>>
>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>
>> [I learned about regular expressions in an automata theory class,
>>  before I knew anything about Unix. What helped me was learning
>>  about finite state machines. You won't need more than paper and
>>  pencil to construct one. Reading source code would make more
>>  sense once you grasp how to construct a FSM corresponding to a RE.]
>
>
>
> --
> GPG Fingerprint: 68F4 B3BD 1730 555A 4462  7D45 3EAA 5B6D A982 BAAF

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
  2020-08-01  0:01 ` Bakul Shah
@ 2020-08-01  5:48 ` Andrew Hume
  2020-08-01 13:31   ` Richard Salz
                     ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Andrew Hume @ 2020-08-01  5:48 UTC (permalink / raw)
  To: Will Senn; +Cc: TUHS main list


during the mid 80s, i was preparing one of the first invited talks for USENIX on the subject of regular expressions.
i had concocted a massive RE that matched all the known spellings of saddam hussein. (it was complicated,
and there were many different romanizations active at the time.)

dennis ritchie wandered into the unix room and asked what it was and then immediately applied it to our
somewhat newish AP news wire feed. it matched two new spellings!!


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  5:48 ` Andrew Hume
@ 2020-08-01 13:31   ` Richard Salz
  2020-08-01 13:43     ` Andrew Hume
  2020-08-02  0:45   ` Christopher Browne
  2020-08-09  1:00   ` Dave Horsfall
  2 siblings, 1 reply; 25+ messages in thread
From: Richard Salz @ 2020-08-01 13:31 UTC (permalink / raw)
  To: Andrew Hume; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

> during the mid 80s, i was preparing one of the first invited talks for
> USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated,
>

I wonder if that was the inspiration for the Qaddafi torture test; see
http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01 (June 87)

The first time I saw "p[-1]" was looking at some regexp source.  It
blew/expanded my mind.

[-- Attachment #2: Type: text/html, Size: 822 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01 13:31   ` Richard Salz
@ 2020-08-01 13:43     ` Andrew Hume
  0 siblings, 0 replies; 25+ messages in thread
From: Andrew Hume @ 2020-08-01 13:43 UTC (permalink / raw)
  To: Richard Salz; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 774 bytes --]

i apologise.

i was in fact referring to the qaddafi torture test (another victim of my stroke).
thanks to rich for catching that slip.

> On Aug 1, 2020, at 6:31 AM, Richard Salz <rich.salz@gmail.com> wrote:
> 
> 
> during the mid 80s, i was preparing one of the first invited talks for USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of saddam hussein. (it was complicated,
> 
> I wonder if that was the inspiration for the Qaddafi torture test; see http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01 <http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01> (June 87)
> 
> The first time I saw "p[-1]" was looking at some regexp source.  It blew/expanded my mind.


[-- Attachment #2: Type: text/html, Size: 1639 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  5:48 ` Andrew Hume
  2020-08-01 13:31   ` Richard Salz
@ 2020-08-02  0:45   ` Christopher Browne
  2020-08-09  1:00   ` Dave Horsfall
  2 siblings, 0 replies; 25+ messages in thread
From: Christopher Browne @ 2020-08-02  0:45 UTC (permalink / raw)
  To: Andrew Hume; +Cc: TUHS main list

[-- Attachment #1: Type: text/plain, Size: 806 bytes --]

On Sat, 1 Aug 2020 at 02:38, Andrew Hume <andrew@humeweb.com> wrote:

>
> during the mid 80s, i was preparing one of the first invited talks for
> USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated,
> and there were many different romanizations active at the time.)
>
> dennis ritchie wandered into the unix room and asked what it was and then
> immediately applied it to our
> somewhat newish AP news wire feed. it matched two new spellings!!
>

I recalled that one being about Khadaffi, of Libya, and there being a
punchline involving one of the names being
"squid-breath."
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

[-- Attachment #2: Type: text/html, Size: 1185 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01  5:48 ` Andrew Hume
  2020-08-01 13:31   ` Richard Salz
  2020-08-02  0:45   ` Christopher Browne
@ 2020-08-09  1:00   ` Dave Horsfall
  2020-08-09  1:15     ` Nelson H. F. Beebe
  2 siblings, 1 reply; 25+ messages in thread
From: Dave Horsfall @ 2020-08-09  1:00 UTC (permalink / raw)
  To: The Eunuchs Hysterical Society

On Fri, 31 Jul 2020, Andrew Hume wrote:

> i had concocted a massive RE that matched all the known spellings of 
> saddam hussein. (it was complicated, and there were many different 
> romanizations active at the time.)

Another neat test was to run it against all the variations of "Muammar 
Gaddafi" and its various Anglicised spellings.

-- Dave

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-09  1:00   ` Dave Horsfall
@ 2020-08-09  1:15     ` Nelson H. F. Beebe
  2020-08-09 23:53       ` Dave Horsfall
  0 siblings, 1 reply; 25+ messages in thread
From: Nelson H. F. Beebe @ 2020-08-09  1:15 UTC (permalink / raw)
  To: Dave Horsfall, tuhs

Dave Horsfall writes about complex regular expressions:

>> ... a test was to run it against all the variations of "Muammar
>> Gaddafi" and its various Anglicised spellings.

The name "Chebyshev" is well known in numerical computation in
mathematics, science, and engineering.  This Web site lists 207
variants in its transliteration from the Cyrillic alphabet to the
Roman one:

	https://mathreader.livejournal.com/9239.html

The MathSciNet and zbMATH databases each record more than 4300
publications with "Chebyshev" in their titles.  Alas, no online
publication database that I know of permits regular expressions in
searches, so finding publications that use the 200+ variants would be
rather difficult.

-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe@math.utah.edu  -
- 155 S 1400 E RM 233                       beebe@acm.org  beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-09  1:15     ` Nelson H. F. Beebe
@ 2020-08-09 23:53       ` Dave Horsfall
  2020-08-10  1:38         ` John Cowan
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Horsfall @ 2020-08-09 23:53 UTC (permalink / raw)
  To: The Eunuchs Hysterical Society

On Sat, 8 Aug 2020, Nelson H. F. Beebe wrote:

> The name "Chebyshev" is well known in numerical computation in 
> mathematics, science, and engineering.  This Web site lists 207 variants 
> in its transliteration from the Cyrillic alphabet to the Roman one:
>
> 	https://mathreader.livejournal.com/9239.html

Interesting; I was taught it was "Chebychev", which as second ranking 
doesn't even come close to "Chebyshev"...

Possibly a cultural thing; I went to an Australian university (UNSW).

-- Dave

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-09 23:53       ` Dave Horsfall
@ 2020-08-10  1:38         ` John Cowan
  0 siblings, 0 replies; 25+ messages in thread
From: John Cowan @ 2020-08-10  1:38 UTC (permalink / raw)
  To: Dave Horsfall; +Cc: The Eunuchs Hysterical Society

[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]

On Sun, Aug 9, 2020 at 7:53 PM Dave Horsfall <dave@horsfall.org> wrote:

Interesting; I was taught it was "Chebychev", which as second ranking
> doesn't even come close to "Chebyshev"...
>

"Chebychev" is downright bizarre.  The first "Ch" is as in English
"Chebyshev", the second is as in French "Tchebychev".  The other variants
are mostly explicable: for example, in German it's "Tschebyschew".

The final vowel is written "e" but pronounced "o" in Russian, and can come
out either way in transliteration.  Likewise, the final "v" is pronounced
"f", and can be written "v" or "ff" in transliteration.   Tolstoy was
affected by this too: his first name is usually given as either "Leo"
(which is what it means) or "Lev" nowadays, but in the 19C it was usually
spelled "Lyoff" (one syllable), which is how it's pronounced.



John Cowan          http://vrici.lojban.org/~cowan        cowan@ccil.org
You are a child of the universe no less than the trees and all other acyclic
graphs; you have a right to be here.  --DeXiderata by Sean McGrath

[-- Attachment #2: Type: text/html, Size: 1791 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-09 23:44 ` Dave Horsfall
@ 2020-08-10  0:50   ` Rob Pike
  0 siblings, 0 replies; 25+ messages in thread
From: Rob Pike @ 2020-08-10  0:50 UTC (permalink / raw)
  To: Dave Horsfall; +Cc: The Eunuchs Hysterical Society

[-- Attachment #1: Type: text/plain, Size: 443 bytes --]

Ken's grep for Plan 9 beats them all.

% time /usr/local/plan9/bin/grep -c . /tmp/x

39536274


real 0m1.731s

user 0m1.568s

sys 0m0.158s

% time /usr/bin/egrep -c . /tmp/x

39536274


real 0m9.798s

user 0m9.644s

sys 0m0.153s

%


% time /usr/local/plan9/bin/grep -c '(ab|cd)' /tmp/x

1023171


real 0m1.719s

user 0m1.560s

sys 0m0.158s

% time /usr/bin/egrep -c '(ab|cd)' /tmp/x

1023171


real 0m32.692s

user 0m32.525s
sys 0m0.166s-rob

[-- Attachment #2: Type: text/html, Size: 9453 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
  2020-08-01 21:12 Doug McIlroy
@ 2020-08-09 23:44 ` Dave Horsfall
  2020-08-10  0:50   ` Rob Pike
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Horsfall @ 2020-08-09 23:44 UTC (permalink / raw)
  To: The Eunuchs Hysterical Society

On Sat, 1 Aug 2020, Doug McIlroy wrote:

> This lesson came home with a vengeance when Al Aho wrote egrep, which 
> implemented full regular expressions as DFA's. I happened to be writing 
> calendar(1) at the same time, and used egrep to search calendar files 
> for dates in rather free formats for today and all days through the next 
> working day.  [...]

I heard somewhere (a Usenix paper?) that "egrep" turned out to be faster 
than "fgrep" (designed to look for fixed strings only), thus "egrep" is 
now symlinked everywhere.

On my FreeBSD box (in /usr/bin):

     aneurin% ls -li *grep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 bsdgrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 bzegrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 bzfgrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 bzgrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 egrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 fgrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 grep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 lzegrep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 lzfgrep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 lzgrep
     23554 lrwxr-xr-x  1 root  wheel     10 Feb 18  2011 pgrep -> /bin/pgrep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 xzegrep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 xzfgrep
     25471 -r-xr-xr-x  7 root  wheel  40744 Oct 15  2017 xzgrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 zegrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 zfgrep
     25480 -r-xr-xr-x  9 root  wheel  82136 Oct 15  2017 zgrep

OK, there's a few strays in there...

-- Dave

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
@ 2020-08-02  4:59 Rudi Blom
  0 siblings, 0 replies; 25+ messages in thread
From: Rudi Blom @ 2020-08-02  4:59 UTC (permalink / raw)
  To: tuhs

Probably same as others do, when I'm implementing some 'trace'
messages in a new program or one just 'under investigation' I try to
make sure the message has a nice format to be able to process a few
megabyte of logfile easily.

Cheers,
uncle rubl

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
@ 2020-08-01 21:12 Doug McIlroy
  2020-08-09 23:44 ` Dave Horsfall
  0 siblings, 1 reply; 25+ messages in thread
From: Doug McIlroy @ 2020-08-01 21:12 UTC (permalink / raw)
  To: tuhs

> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?


The editor ed was in Unix from day 1. For the necessarily tiny
 implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.

Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.

"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.

This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to

search calendar files for dates in rather free formats for today
and all days through the next working day.  Here's an example
(egrep interprets newline as "|"):

(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)

Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.

Doug

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TUHS] Regular Expressions
@ 2020-08-01  0:00 Noel Chiappa
  0 siblings, 0 replies; 25+ messages in thread
From: Noel Chiappa @ 2020-08-01  0:00 UTC (permalink / raw)
  To: tuhs; +Cc: jnc

    > From: Will Senn

    > I don't really understand how they work.
    > ...
    > maybe the way to understand the unix regex lies in a careful
    > investigation into how it is implemented

I'm not sure what I did, but it wasn't the latter, since I have no idea how
they are done!

I just mentally break the regex search string up into substrings (I use them
most in Epsilon, which has syntax to do substrings of search strings, which
helps a lot); past that, I think it's just using them and getting used to
them.

    > an IBM 7094 (whatever that is)

IBM's last 36-bit scientific mainframe before the System/360's. CTSS (which
DMR held out as the ancestor of Unix) ran on 7094's.

    Noel

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-08-10  1:39 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
2020-08-01  0:01 ` Bakul Shah
2020-08-01  0:36   ` Rob Pike
2020-08-01  0:53     ` John P. Linderman
2020-08-01  1:31     ` Bakul Shah
2020-08-01  1:39     ` Larry McVoy
2020-08-01  2:33     ` Will Senn
2020-08-01  2:50       ` Rich Morin
2020-08-01  3:01         ` Larry McVoy
2020-08-01  3:07     ` Will Senn
2020-08-01  4:31       ` Earl Baugh
2020-08-01  4:53         ` ron minnich
2020-08-01  5:48 ` Andrew Hume
2020-08-01 13:31   ` Richard Salz
2020-08-01 13:43     ` Andrew Hume
2020-08-02  0:45   ` Christopher Browne
2020-08-09  1:00   ` Dave Horsfall
2020-08-09  1:15     ` Nelson H. F. Beebe
2020-08-09 23:53       ` Dave Horsfall
2020-08-10  1:38         ` John Cowan
2020-08-01  0:00 Noel Chiappa
2020-08-01 21:12 Doug McIlroy
2020-08-09 23:44 ` Dave Horsfall
2020-08-10  0:50   ` Rob Pike
2020-08-02  4:59 Rudi Blom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).