* [TUHS] Regular Expressions
@ 2020-07-31 22:57 Will Senn
2020-08-01 0:01 ` Bakul Shah
2020-08-01 5:48 ` Andrew Hume
0 siblings, 2 replies; 25+ messages in thread
From: Will Senn @ 2020-07-31 22:57 UTC (permalink / raw)
To: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 2221 bytes --]
I've always been intrigued with regexes. When I was first exposed to
them, I was mystified and lost in the greediness of matches. Now, I use
them regularly, but still have trouble using them. I think it is because
I don't really understand how they work.
My question for y'all has to do with early unix. I have a copy of
Thompson, K. (1968). Regular expression search algorithm. Communications
of the ACM, 11(6), 419-422. It is interesting as an example of
Thompson's thinking about regexes. In this paper, he presents a
non-backtracking, efficient, algorithm for converting a regex into an
IBM 7094 (whatever that is) program that can be run against text input
that generates matches. It's cool. It got me to thinking maybe the way
to understand the unix regex lies in a careful investigation into how it
is implemented (original thought, right?). So, here I am again to ask
your indulgence as the latecomer wannabe unix apprentice. My thought is
that ed is where it begins and might be a good starting point, but I'm
not sure - what say y'all?
I also have a copy of the O'Reilly Mastering Regular Expressions book,
but that's not really the kind of thing I'm talking about. My question
is more basic than how to use regexes practically. I would like to
understand them at a parsing level/state change level (not sure that's
the correct way to say it, but I'm really new to this kind of lingo).
When I'm done with my stepping through the source, I want to be able to
reason that this is why that search matched that text and not this text
and why the search was greedy, or not greedy because of this logic here...
If my question above isn't focused or on topic enough, here's an
alternative set to ruminate on and hopefully discuss:
1. What's the provenance of regex in unix (when did it appear, in what
form, etc)?
2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
3. What are some of the milestones along the way (major changes, forks,
disagreements)?
4. Where, in the source, or in a paper, would you point someone to
wanting to better understand the mechanics of regex?
Thanks!
Will
--
GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
[-- Attachment #2: Type: text/html, Size: 2765 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
@ 2020-08-01 0:01 ` Bakul Shah
2020-08-01 0:36 ` Rob Pike
2020-08-01 5:48 ` Andrew Hume
1 sibling, 1 reply; 25+ messages in thread
From: Bakul Shah @ 2020-08-01 0:01 UTC (permalink / raw)
To: Will Senn; +Cc: TUHS main list
On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>
> I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
> ...
> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?
> 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
[I learned about regular expressions in an automata theory class,
before I knew anything about Unix. What helped me was learning
about finite state machines. You won't need more than paper and
pencil to construct one. Reading source code would make more
sense once you grasp how to construct a FSM corresponding to a RE.]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 0:01 ` Bakul Shah
@ 2020-08-01 0:36 ` Rob Pike
2020-08-01 0:53 ` John P. Linderman
` (4 more replies)
0 siblings, 5 replies; 25+ messages in thread
From: Rob Pike @ 2020-08-01 0:36 UTC (permalink / raw)
To: Bakul Shah; +Cc: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 1544 bytes --]
I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s the
best place to start. Superb exposition on the background, theory, and
implementation as well as a bit of history of how the industry lost its way
with regular expressions.
Regular expressions are beautiful, simple, and widely misunderstood.
-rob
On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> >
> > I've always been intrigued with regexes. When I was first exposed to
> them, I was mystified and lost in the greediness of matches. Now, I use
> them regularly, but still have trouble using them. I think it is because I
> don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear, in what
> form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it pre
> 1980s)?
> > 3. What are some of the milestones along the way (major changes, forks,
> disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone to
> wanting to better understand the mechanics of regex?
>
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
> [I learned about regular expressions in an automata theory class,
> before I knew anything about Unix. What helped me was learning
> about finite state machines. You won't need more than paper and
> pencil to construct one. Reading source code would make more
> sense once you grasp how to construct a FSM corresponding to a RE.]
[-- Attachment #2: Type: text/html, Size: 2196 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 0:36 ` Rob Pike
@ 2020-08-01 0:53 ` John P. Linderman
2020-08-01 1:31 ` Bakul Shah
` (3 subsequent siblings)
4 siblings, 0 replies; 25+ messages in thread
From: John P. Linderman @ 2020-08-01 0:53 UTC (permalink / raw)
To: Rob Pike; +Cc: Bakul Shah, TUHS main list
[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]
Hear hear. And I think it should be possible to analyze regular expressions
to determine if "simple" (hence fast) regexps are adequate. "Fancy" re's
are rare. At least from those of us where they couldn't be expressed.
On Fri, Jul 31, 2020 at 8:38 PM Rob Pike <robpike@gmail.com> wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s the
> best place to start. Superb exposition on the background, theory, and
> implementation as well as a bit of history of how the industry lost its way
> with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>
>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>> >
>> > I've always been intrigued with regexes. When I was first exposed to
>> them, I was mystified and lost in the greediness of matches. Now, I use
>> them regularly, but still have trouble using them. I think it is because I
>> don't really understand how they work.
>> > ...
>> > 1. What's the provenance of regex in unix (when did it appear, in what
>> form, etc)?
>> > 2. What are the 'best' implementations throughout unix (keep it pre
>> 1980s)?
>> > 3. What are some of the milestones along the way (major changes, forks,
>> disagreements)?
>> > 4. Where, in the source, or in a paper, would you point someone to
>> wanting to better understand the mechanics of regex?
>>
>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>
>> [I learned about regular expressions in an automata theory class,
>> before I knew anything about Unix. What helped me was learning
>> about finite state machines. You won't need more than paper and
>> pencil to construct one. Reading source code would make more
>> sense once you grasp how to construct a FSM corresponding to a RE.]
>
>
[-- Attachment #2: Type: text/html, Size: 2897 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 0:36 ` Rob Pike
2020-08-01 0:53 ` John P. Linderman
@ 2020-08-01 1:31 ` Bakul Shah
2020-08-01 1:39 ` Larry McVoy
` (2 subsequent siblings)
4 siblings, 0 replies; 25+ messages in thread
From: Bakul Shah @ 2020-08-01 1:31 UTC (permalink / raw)
To: Rob Pike; +Cc: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]
I was trying to get Will to try simple stuff out himself from first principles (to get started) but this is of course far better!
> On Jul 31, 2020, at 5:36 PM, Rob Pike <robpike@gmail.com> wrote:
>
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html <https://swtch.com/~rsc/regexp/regexp1.html> i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org <mailto:bakul@iitbombay.org>> wrote:
> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com <mailto:will.senn@gmail.com>> wrote:
> >
> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction <https://en.wikipedia.org/wiki/Thompson%27s_construction>
>
> [I learned about regular expressions in an automata theory class,
> before I knew anything about Unix. What helped me was learning
> about finite state machines. You won't need more than paper and
> pencil to construct one. Reading source code would make more
> sense once you grasp how to construct a FSM corresponding to a RE.]
[-- Attachment #2: Type: text/html, Size: 3133 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 0:36 ` Rob Pike
2020-08-01 0:53 ` John P. Linderman
2020-08-01 1:31 ` Bakul Shah
@ 2020-08-01 1:39 ` Larry McVoy
2020-08-01 2:33 ` Will Senn
2020-08-01 3:07 ` Will Senn
4 siblings, 0 replies; 25+ messages in thread
From: Larry McVoy @ 2020-08-01 1:39 UTC (permalink / raw)
To: Rob Pike; +Cc: Bakul Shah, TUHS main list
That link is a great read.
On Sat, Aug 01, 2020 at 10:36:48AM +1000, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s the
> best place to start. Superb exposition on the background, theory, and
> implementation as well as a bit of history of how the industry lost its way
> with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>
> > On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
> > >
> > > I've always been intrigued with regexes. When I was first exposed to
> > them, I was mystified and lost in the greediness of matches. Now, I use
> > them regularly, but still have trouble using them. I think it is because I
> > don't really understand how they work.
> > > ...
> > > 1. What's the provenance of regex in unix (when did it appear, in what
> > form, etc)?
> > > 2. What are the 'best' implementations throughout unix (keep it pre
> > 1980s)?
> > > 3. What are some of the milestones along the way (major changes, forks,
> > disagreements)?
> > > 4. Where, in the source, or in a paper, would you point someone to
> > wanting to better understand the mechanics of regex?
> >
> > Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
> >
> > [I learned about regular expressions in an automata theory class,
> > before I knew anything about Unix. What helped me was learning
> > about finite state machines. You won't need more than paper and
> > pencil to construct one. Reading source code would make more
> > sense once you grasp how to construct a FSM corresponding to a RE.]
--
---
Larry McVoy lm at mcvoy.com http://www.mcvoy.com/lm
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 0:36 ` Rob Pike
` (2 preceding siblings ...)
2020-08-01 1:39 ` Larry McVoy
@ 2020-08-01 2:33 ` Will Senn
2020-08-01 2:50 ` Rich Morin
2020-08-01 3:07 ` Will Senn
4 siblings, 1 reply; 25+ messages in thread
From: Will Senn @ 2020-08-01 2:33 UTC (permalink / raw)
To: Rob Pike, Bakul Shah; +Cc: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 2285 bytes --]
Yes, I googled it per Clem's suggestion and wound up on that exact link
after wandering around admiring the scenery. I envy the more
mathematically inclined among us their view of matters technical. This
piece, being in C and having step by step articulation of the diagrams,
is better for me than the more formal wikipedia article, although, when
I get enough background, that looks like it'll be good, too.
Thanks Rob, Clem and Bakul.
Will
On 7/31/20 7:36 PM, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s
> the best place to start. Superb exposition on the background, theory,
> and implementation as well as a bit of history of how the industry
> lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org
> <mailto:bakul@iitbombay.org>> wrote:
>
> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com
> <mailto:will.senn@gmail.com>> wrote:
> >
> > I've always been intrigued with regexes. When I was first
> exposed to them, I was mystified and lost in the greediness of
> matches. Now, I use them regularly, but still have trouble using
> them. I think it is because I don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear,
> in what form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it
> pre 1980s)?
> > 3. What are some of the milestones along the way (major changes,
> forks, disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone
> to wanting to better understand the mechanics of regex?
>
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
> [I learned about regular expressions in an automata theory class,
> before I knew anything about Unix. What helped me was learning
> about finite state machines. You won't need more than paper and
> pencil to construct one. Reading source code would make more
> sense once you grasp how to construct a FSM corresponding to a RE.]
>
--
GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
[-- Attachment #2: Type: text/html, Size: 3921 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 0:36 ` Rob Pike
` (3 preceding siblings ...)
2020-08-01 2:33 ` Will Senn
@ 2020-08-01 3:07 ` Will Senn
2020-08-01 4:31 ` Earl Baugh
4 siblings, 1 reply; 25+ messages in thread
From: Will Senn @ 2020-08-01 3:07 UTC (permalink / raw)
To: Rob Pike, Bakul Shah; +Cc: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 2023 bytes --]
Oh, and one good google over another, I also found this:
https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
Which is also great and simple to follow :).
Will
On 7/31/20 7:36 PM, Rob Pike wrote:
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s
> the best place to start. Superb exposition on the background, theory,
> and implementation as well as a bit of history of how the industry
> lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org
> <mailto:bakul@iitbombay.org>> wrote:
>
> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com
> <mailto:will.senn@gmail.com>> wrote:
> >
> > I've always been intrigued with regexes. When I was first
> exposed to them, I was mystified and lost in the greediness of
> matches. Now, I use them regularly, but still have trouble using
> them. I think it is because I don't really understand how they work.
> > ...
> > 1. What's the provenance of regex in unix (when did it appear,
> in what form, etc)?
> > 2. What are the 'best' implementations throughout unix (keep it
> pre 1980s)?
> > 3. What are some of the milestones along the way (major changes,
> forks, disagreements)?
> > 4. Where, in the source, or in a paper, would you point someone
> to wanting to better understand the mechanics of regex?
>
> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>
> [I learned about regular expressions in an automata theory class,
> before I knew anything about Unix. What helped me was learning
> about finite state machines. You won't need more than paper and
> pencil to construct one. Reading source code would make more
> sense once you grasp how to construct a FSM corresponding to a RE.]
>
--
GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
[-- Attachment #2: Type: text/html, Size: 3760 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 3:07 ` Will Senn
@ 2020-08-01 4:31 ` Earl Baugh
2020-08-01 4:53 ` ron minnich
0 siblings, 1 reply; 25+ messages in thread
From: Earl Baugh @ 2020-08-01 4:31 UTC (permalink / raw)
To: Will Senn; +Cc: TUHS main list, Bakul Shah
[-- Attachment #1: Type: text/plain, Size: 2785 bytes --]
I have a 30k+ character regex that I used to parse US street addresses from a single line of input into component pieces. That’s the largest ( and yes it was crazy... but it just grew and grew...and made sense to me as I decomposed the problem ). That’s the largest one I ever saw.... and it was the fastest way I could solve the problem.. both mentally and processing wise.
( there was some post processing to deal with soundex matching and data base matching of the pieces to confirm the right answer. But it solved the problem with 95%+ accuracy... that I got into the high 99%’s with the post processing. )
So that’s at least an example of an arbitrary problem that was solved with regex.
Earl
Sent from my iPhone
> On Jul 31, 2020, at 11:08 PM, Will Senn <will.senn@gmail.com> wrote:
>
>
> Oh, and one good google over another, I also found this:
>
> https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
>
> Which is also great and simple to follow :).
>
> Will
>
>> On 7/31/20 7:36 PM, Rob Pike wrote:
>> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>>
>> Regular expressions are beautiful, simple, and widely misunderstood.
>>
>> -rob
>>
>>
>> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>>> >
>>> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
>>> > ...
>>> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
>>> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
>>> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
>>> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>>>
>>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>>
>>> [I learned about regular expressions in an automata theory class,
>>> before I knew anything about Unix. What helped me was learning
>>> about finite state machines. You won't need more than paper and
>>> pencil to construct one. Reading source code would make more
>>> sense once you grasp how to construct a FSM corresponding to a RE.]
>
>
> --
> GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
[-- Attachment #2: Type: text/html, Size: 4939 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 4:31 ` Earl Baugh
@ 2020-08-01 4:53 ` ron minnich
0 siblings, 0 replies; 25+ messages in thread
From: ron minnich @ 2020-08-01 4:53 UTC (permalink / raw)
To: Earl Baugh; +Cc: TUHS main list, Bakul Shah
kernighan did this one at a talk at udel in 1976: find all the words
in the dictionary that you can see on a calculator looking at it
upside down.
On Fri, Jul 31, 2020 at 9:32 PM Earl Baugh <earl.baugh@gmail.com> wrote:
>
> I have a 30k+ character regex that I used to parse US street addresses from a single line of input into component pieces. That’s the largest ( and yes it was crazy... but it just grew and grew...and made sense to me as I decomposed the problem ). That’s the largest one I ever saw.... and it was the fastest way I could solve the problem.. both mentally and processing wise.
>
> ( there was some post processing to deal with soundex matching and data base matching of the pieces to confirm the right answer. But it solved the problem with 95%+ accuracy... that I got into the high 99%’s with the post processing. )
>
> So that’s at least an example of an arbitrary problem that was solved with regex.
>
> Earl
>
>
> Sent from my iPhone
>
> On Jul 31, 2020, at 11:08 PM, Will Senn <will.senn@gmail.com> wrote:
>
>
> Oh, and one good google over another, I also found this:
>
> https://www.drdobbs.com/architecture-and-design/regular-expressions/184410904
>
> Which is also great and simple to follow :).
>
> Will
>
> On 7/31/20 7:36 PM, Rob Pike wrote:
>
> I think this link - https://swtch.com/~rsc/regexp/regexp1.html i- s the best place to start. Superb exposition on the background, theory, and implementation as well as a bit of history of how the industry lost its way with regular expressions.
>
> Regular expressions are beautiful, simple, and widely misunderstood.
>
> -rob
>
>
> On Sat, Aug 1, 2020 at 10:03 AM Bakul Shah <bakul@iitbombay.org> wrote:
>>
>> On Jul 31, 2020, at 3:57 PM, Will Senn <will.senn@gmail.com> wrote:
>> >
>> > I've always been intrigued with regexes. When I was first exposed to them, I was mystified and lost in the greediness of matches. Now, I use them regularly, but still have trouble using them. I think it is because I don't really understand how they work.
>> > ...
>> > 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
>> > 2. What are the 'best' implementations throughout unix (keep it pre 1980s)?
>> > 3. What are some of the milestones along the way (major changes, forks, disagreements)?
>> > 4. Where, in the source, or in a paper, would you point someone to wanting to better understand the mechanics of regex?
>>
>> Start here: https://en.wikipedia.org/wiki/Thompson%27s_construction
>>
>> [I learned about regular expressions in an automata theory class,
>> before I knew anything about Unix. What helped me was learning
>> about finite state machines. You won't need more than paper and
>> pencil to construct one. Reading source code would make more
>> sense once you grasp how to construct a FSM corresponding to a RE.]
>
>
>
> --
> GPG Fingerprint: 68F4 B3BD 1730 555A 4462 7D45 3EAA 5B6D A982 BAAF
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
2020-08-01 0:01 ` Bakul Shah
@ 2020-08-01 5:48 ` Andrew Hume
2020-08-01 13:31 ` Richard Salz
` (2 more replies)
1 sibling, 3 replies; 25+ messages in thread
From: Andrew Hume @ 2020-08-01 5:48 UTC (permalink / raw)
To: Will Senn; +Cc: TUHS main list
during the mid 80s, i was preparing one of the first invited talks for USENIX on the subject of regular expressions.
i had concocted a massive RE that matched all the known spellings of saddam hussein. (it was complicated,
and there were many different romanizations active at the time.)
dennis ritchie wandered into the unix room and asked what it was and then immediately applied it to our
somewhat newish AP news wire feed. it matched two new spellings!!
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 5:48 ` Andrew Hume
@ 2020-08-01 13:31 ` Richard Salz
2020-08-01 13:43 ` Andrew Hume
2020-08-02 0:45 ` Christopher Browne
2020-08-09 1:00 ` Dave Horsfall
2 siblings, 1 reply; 25+ messages in thread
From: Richard Salz @ 2020-08-01 13:31 UTC (permalink / raw)
To: Andrew Hume; +Cc: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 473 bytes --]
> during the mid 80s, i was preparing one of the first invited talks for
> USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated,
>
I wonder if that was the inspiration for the Qaddafi torture test; see
http://sources.vsta.org/comp.sources.unix/volume9/fastgrep/part01 (June 87)
The first time I saw "p[-1]" was looking at some regexp source. It
blew/expanded my mind.
[-- Attachment #2: Type: text/html, Size: 822 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 5:48 ` Andrew Hume
2020-08-01 13:31 ` Richard Salz
@ 2020-08-02 0:45 ` Christopher Browne
2020-08-09 1:00 ` Dave Horsfall
2 siblings, 0 replies; 25+ messages in thread
From: Christopher Browne @ 2020-08-02 0:45 UTC (permalink / raw)
To: Andrew Hume; +Cc: TUHS main list
[-- Attachment #1: Type: text/plain, Size: 806 bytes --]
On Sat, 1 Aug 2020 at 02:38, Andrew Hume <andrew@humeweb.com> wrote:
>
> during the mid 80s, i was preparing one of the first invited talks for
> USENIX on the subject of regular expressions.
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated,
> and there were many different romanizations active at the time.)
>
> dennis ritchie wandered into the unix room and asked what it was and then
> immediately applied it to our
> somewhat newish AP news wire feed. it matched two new spellings!!
>
I recalled that one being about Khadaffi, of Libya, and there being a
punchline involving one of the names being
"squid-breath."
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
[-- Attachment #2: Type: text/html, Size: 1185 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 5:48 ` Andrew Hume
2020-08-01 13:31 ` Richard Salz
2020-08-02 0:45 ` Christopher Browne
@ 2020-08-09 1:00 ` Dave Horsfall
2020-08-09 1:15 ` Nelson H. F. Beebe
2 siblings, 1 reply; 25+ messages in thread
From: Dave Horsfall @ 2020-08-09 1:00 UTC (permalink / raw)
To: The Eunuchs Hysterical Society
On Fri, 31 Jul 2020, Andrew Hume wrote:
> i had concocted a massive RE that matched all the known spellings of
> saddam hussein. (it was complicated, and there were many different
> romanizations active at the time.)
Another neat test was to run it against all the variations of "Muammar
Gaddafi" and its various Anglicised spellings.
-- Dave
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-09 1:00 ` Dave Horsfall
@ 2020-08-09 1:15 ` Nelson H. F. Beebe
2020-08-09 23:53 ` Dave Horsfall
0 siblings, 1 reply; 25+ messages in thread
From: Nelson H. F. Beebe @ 2020-08-09 1:15 UTC (permalink / raw)
To: Dave Horsfall, tuhs
Dave Horsfall writes about complex regular expressions:
>> ... a test was to run it against all the variations of "Muammar
>> Gaddafi" and its various Anglicised spellings.
The name "Chebyshev" is well known in numerical computation in
mathematics, science, and engineering. This Web site lists 207
variants in its transliteration from the Cyrillic alphabet to the
Roman one:
https://mathreader.livejournal.com/9239.html
The MathSciNet and zbMATH databases each record more than 4300
publications with "Chebyshev" in their titles. Alas, no online
publication database that I know of permits regular expressions in
searches, so finding publications that use the 200+ variants would be
rather difficult.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe Tel: +1 801 581 5254 -
- University of Utah FAX: +1 801 581 4148 -
- Department of Mathematics, 110 LCB Internet e-mail: beebe@math.utah.edu -
- 155 S 1400 E RM 233 beebe@acm.org beebe@computer.org -
- Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-09 1:15 ` Nelson H. F. Beebe
@ 2020-08-09 23:53 ` Dave Horsfall
2020-08-10 1:38 ` John Cowan
0 siblings, 1 reply; 25+ messages in thread
From: Dave Horsfall @ 2020-08-09 23:53 UTC (permalink / raw)
To: The Eunuchs Hysterical Society
On Sat, 8 Aug 2020, Nelson H. F. Beebe wrote:
> The name "Chebyshev" is well known in numerical computation in
> mathematics, science, and engineering. This Web site lists 207 variants
> in its transliteration from the Cyrillic alphabet to the Roman one:
>
> https://mathreader.livejournal.com/9239.html
Interesting; I was taught it was "Chebychev", which as second ranking
doesn't even come close to "Chebyshev"...
Possibly a cultural thing; I went to an Australian university (UNSW).
-- Dave
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-09 23:53 ` Dave Horsfall
@ 2020-08-10 1:38 ` John Cowan
0 siblings, 0 replies; 25+ messages in thread
From: John Cowan @ 2020-08-10 1:38 UTC (permalink / raw)
To: Dave Horsfall; +Cc: The Eunuchs Hysterical Society
[-- Attachment #1: Type: text/plain, Size: 1044 bytes --]
On Sun, Aug 9, 2020 at 7:53 PM Dave Horsfall <dave@horsfall.org> wrote:
Interesting; I was taught it was "Chebychev", which as second ranking
> doesn't even come close to "Chebyshev"...
>
"Chebychev" is downright bizarre. The first "Ch" is as in English
"Chebyshev", the second is as in French "Tchebychev". The other variants
are mostly explicable: for example, in German it's "Tschebyschew".
The final vowel is written "e" but pronounced "o" in Russian, and can come
out either way in transliteration. Likewise, the final "v" is pronounced
"f", and can be written "v" or "ff" in transliteration. Tolstoy was
affected by this too: his first name is usually given as either "Leo"
(which is what it means) or "Lev" nowadays, but in the 19C it was usually
spelled "Lyoff" (one syllable), which is how it's pronounced.
John Cowan http://vrici.lojban.org/~cowan cowan@ccil.org
You are a child of the universe no less than the trees and all other acyclic
graphs; you have a right to be here. --DeXiderata by Sean McGrath
[-- Attachment #2: Type: text/html, Size: 1791 bytes --]
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
@ 2020-08-01 0:00 Noel Chiappa
0 siblings, 0 replies; 25+ messages in thread
From: Noel Chiappa @ 2020-08-01 0:00 UTC (permalink / raw)
To: tuhs; +Cc: jnc
> From: Will Senn
> I don't really understand how they work.
> ...
> maybe the way to understand the unix regex lies in a careful
> investigation into how it is implemented
I'm not sure what I did, but it wasn't the latter, since I have no idea how
they are done!
I just mentally break the regex search string up into substrings (I use them
most in Epsilon, which has syntax to do substrings of search strings, which
helps a lot); past that, I think it's just using them and getting used to
them.
> an IBM 7094 (whatever that is)
IBM's last 36-bit scientific mainframe before the System/360's. CTSS (which
DMR held out as the ancestor of Unix) ran on 7094's.
Noel
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
@ 2020-08-01 21:12 Doug McIlroy
2020-08-09 23:44 ` Dave Horsfall
0 siblings, 1 reply; 25+ messages in thread
From: Doug McIlroy @ 2020-08-01 21:12 UTC (permalink / raw)
To: tuhs
> 1. What's the provenance of regex in unix (when did it appear, in what form, etc)?
> 2. What are the 'best' implementations throughout unix (keep it pre1980s)?
> 3. What are some of the milestones along the way (major changes, forks, disagreements)?
The editor ed was in Unix from day 1. For the necessarily tiny
implementation, Ken discarded various features
from the ancestral qed. Among the casualties was alternation
in regular expressions. It has never fully returned.
Ken's original paper described a method for simulating all paths
of a nondeterministic finite automaton in parallel, although he
didn't describe it in these exact terms. This meant he had to
keep track of up to n possible states, where n is the number of
terminal symbols in the regular expression.
"Computing Reviews" published a scathing critique of the paper:
everyone knows a deterministic automaton can recognize regular
expressions with one state transition per input character; what
a waste of time to have to keep track of multiple states! What the
review missed was that the size of the DFA can be exponential in n.
For one-shot use, as in an editor, it can take far longer to construct
the DFA than to run it.
This lesson came home with a vengeance when Al Aho wrote egrep,
which implemented full regular expressions as DFA's. I happened
to be writing calendar(1) at the same time, and used egrep to
search calendar files for dates in rather free formats for today
and all days through the next working day. Here's an example
(egrep interprets newline as "|"):
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*1)([^0123456789]|$)
(^|[ (,;])((\* *)0*1)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*2)([^0123456789]|$)
(^|[ (,;])((\* *)0*2)([^0123456789]|$)
(^|[ (,;])(([Aa]ug[^ ]* *|(08|8)/)0*3)([^0123456789]|$)
(^|[ (,;])((\* *)0*3)([^0123456789]|$)
Much to Al's chagrin, this regular expression took the better
part of a minute to compile into a DFA, which would then run in
microseconds. The trouble was that the DFA was enormously
bigger than the input--only a tiny fraction of the machine's
states would be visited; the rest were useless. That led
him to the brilliant idea of constructing the machine on
the fly, creating only the states that were pertinent to
the input at hand. That innovation made the DFA again
competitive with an NFA.
Doug
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
2020-08-01 21:12 Doug McIlroy
@ 2020-08-09 23:44 ` Dave Horsfall
2020-08-10 0:50 ` Rob Pike
0 siblings, 1 reply; 25+ messages in thread
From: Dave Horsfall @ 2020-08-09 23:44 UTC (permalink / raw)
To: The Eunuchs Hysterical Society
On Sat, 1 Aug 2020, Doug McIlroy wrote:
> This lesson came home with a vengeance when Al Aho wrote egrep, which
> implemented full regular expressions as DFA's. I happened to be writing
> calendar(1) at the same time, and used egrep to search calendar files
> for dates in rather free formats for today and all days through the next
> working day. [...]
I heard somewhere (a Usenix paper?) that "egrep" turned out to be faster
than "fgrep" (designed to look for fixed strings only), thus "egrep" is
now symlinked everywhere.
On my FreeBSD box (in /usr/bin):
aneurin% ls -li *grep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 bsdgrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 bzegrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 bzfgrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 bzgrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 egrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 fgrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 grep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 lzegrep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 lzfgrep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 lzgrep
23554 lrwxr-xr-x 1 root wheel 10 Feb 18 2011 pgrep -> /bin/pgrep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 xzegrep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 xzfgrep
25471 -r-xr-xr-x 7 root wheel 40744 Oct 15 2017 xzgrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 zegrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 zfgrep
25480 -r-xr-xr-x 9 root wheel 82136 Oct 15 2017 zgrep
OK, there's a few strays in there...
-- Dave
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TUHS] Regular Expressions
@ 2020-08-02 4:59 Rudi Blom
0 siblings, 0 replies; 25+ messages in thread
From: Rudi Blom @ 2020-08-02 4:59 UTC (permalink / raw)
To: tuhs
Probably same as others do, when I'm implementing some 'trace'
messages in a new program or one just 'under investigation' I try to
make sure the message has a nice format to be able to process a few
megabyte of logfile easily.
Cheers,
uncle rubl
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2020-08-10 1:39 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 22:57 [TUHS] Regular Expressions Will Senn
2020-08-01 0:01 ` Bakul Shah
2020-08-01 0:36 ` Rob Pike
2020-08-01 0:53 ` John P. Linderman
2020-08-01 1:31 ` Bakul Shah
2020-08-01 1:39 ` Larry McVoy
2020-08-01 2:33 ` Will Senn
2020-08-01 2:50 ` Rich Morin
2020-08-01 3:01 ` Larry McVoy
2020-08-01 3:07 ` Will Senn
2020-08-01 4:31 ` Earl Baugh
2020-08-01 4:53 ` ron minnich
2020-08-01 5:48 ` Andrew Hume
2020-08-01 13:31 ` Richard Salz
2020-08-01 13:43 ` Andrew Hume
2020-08-02 0:45 ` Christopher Browne
2020-08-09 1:00 ` Dave Horsfall
2020-08-09 1:15 ` Nelson H. F. Beebe
2020-08-09 23:53 ` Dave Horsfall
2020-08-10 1:38 ` John Cowan
2020-08-01 0:00 Noel Chiappa
2020-08-01 21:12 Doug McIlroy
2020-08-09 23:44 ` Dave Horsfall
2020-08-10 0:50 ` Rob Pike
2020-08-02 4:59 Rudi Blom
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).