9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* [9fans] non greedy regular expressions
@ 2008-10-23 18:58 Rudolf Sykora
  2008-10-23 19:05 ` erik quanstrom
  0 siblings, 1 reply; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-23 18:58 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hello

regexp(6) seems to know only greedy regular expressions. So does
probably awk, sed, grep, ..., since these are based on that regexp.
My question is what to do if one needs non-greedy regexps.

Also, is there anything like submatch extraction, counted repetition
of regexp (like {3,5}), (lookahead) assertions in plan9?

Thanks
Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-23 18:58 [9fans] non greedy regular expressions Rudolf Sykora
@ 2008-10-23 19:05 ` erik quanstrom
  2008-10-24  8:08   ` Rudolf Sykora
  0 siblings, 1 reply; 45+ messages in thread
From: erik quanstrom @ 2008-10-23 19:05 UTC (permalink / raw)
  To: 9fans

> regexp(6) seems to know only greedy regular expressions. So does
> probably awk, sed, grep, ..., since these are based on that regexp.
> My question is what to do if one needs non-greedy regexps.

russ has a great writeup on this.
http://swtch.com/~rsc/regexp/
i think it covers all your questions.

- erik




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-23 19:05 ` erik quanstrom
@ 2008-10-24  8:08   ` Rudolf Sykora
  2008-10-24 12:23     ` erik quanstrom
  0 siblings, 1 reply; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24  8:08 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> russ has a great writeup on this.
> http://swtch.com/~rsc/regexp/
> i think it covers all your questions.
>
> - erik

I read trough some of that already yesterday. Anyway, am still
puzzled. In the text of

Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)

R. Cox writes:
---
While writing the text editor sam [6] in the early 1980s, Rob Pike
wrote a new regular expression implementation, which Dave Presotto
extracted into a library that appeared in the Eighth Edition. Pike's
implementation incorporated submatch tracking into an efficient NFA
simulation but, like the rest of the Eighth Edition source, was not
widely distributed.
...
Pike's regular expression implementation, extended to support Unicode,
was made freely available with sam in late 1992, but the particularly
efficient regular expression search algorithm went unnoticed. The code
is now available in many forms: as part of sam, as Plan 9's regular
expression library, or packaged separately for Unix.
---

But any manual page (regexp(6), that of sam)  keeps completely silent
about eg. any submatch tracking.
So what's wrong? Can anybody clarify the situation for me or do I
really have to read the codes?

Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24  8:08   ` Rudolf Sykora
@ 2008-10-24 12:23     ` erik quanstrom
  2008-10-24 16:11       ` Rudolf Sykora
  0 siblings, 1 reply; 45+ messages in thread
From: erik quanstrom @ 2008-10-24 12:23 UTC (permalink / raw)
  To: 9fans

> But any manual page (regexp(6), that of sam)  keeps completely silent
> about eg. any submatch tracking.
> So what's wrong? Can anybody clarify the situation for me or do I
> really have to read the codes?

well reading the code would be a travesty.  it's curious
that neither the sam paper nor regexp(6) mentions
submatches.  maybe i missed them.

sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' </lib/volcanoes

sam -d /lib/volcanoes
,x g/KRAK/ s:[^A-Z]+([A-Z]+*) +([a-zA-Z]+).*:\2, \1:
/KRAK/-+

or, if you want to run sam as a stream editor
{	echo ',x g/KRAK/ s:[^A-Z]+([A-Z]+*) +([a-zA-Z]+).*:\2, \1:'
	echo '/KRAK/-+'
} | sam -d /lib/volcanoes >[2=]

a couple of examples of submatch tracking from our system
(i'm not sure about the version of iwhois on sources.)

/rc/bin/B
/rc/bin/Kill
/rc/bin/broke
/rc/bin/fax
/rc/bin/fshalt
/rc/bin/iwhois
/rc/bin/juke

- erik




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 12:23     ` erik quanstrom
@ 2008-10-24 16:11       ` Rudolf Sykora
  2008-10-24 16:54         ` erik quanstrom
                           ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24 16:11 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> well reading the code would be a travesty.  it's curious
> that neither the sam paper nor regexp(6) mentions
> submatches.  maybe i missed them.
>
> sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' </lib/volcanoes
> - erik

Ok, so despite the documentation, some submatch tracking is there.
But in all (?) your examples, as well as in the scripts you mentioned,
this tracking is exclusively used with the s command (which is said to
be unnecessary at least in sam/acme). If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
he claims that all nice features except for backreferences can be
implemented with Thomson's NFA algorithm. And even the backreferences
can be handled gracefully somehow. That is: ALL: non-greedy operators,
generalized assertions, counted repetitions, character classes CAN be
processed using the fast algorithm. Why then we don't have it? I once
wrote a program in python and was pretty happy to have non-greedy
operators and lookahead assertions on hand. Should I hadn't had those,
I probably wouldn't have been able to write it (nicely).

Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 16:11       ` Rudolf Sykora
@ 2008-10-24 16:54         ` erik quanstrom
  2008-10-24 17:02         ` John Stalker
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: erik quanstrom @ 2008-10-24 16:54 UTC (permalink / raw)
  To: rudolf.sykora, 9fans

> Ok, so despite the documentation, some submatch tracking is there.
> But in all (?) your examples, as well as in the scripts you mentioned,
> this tracking is exclusively used with the s command (which is said to
> be unnecessary at least in sam/acme). If I try sth. like
> /( b(.)b)/a/\1\2/

this is covered in the sam paper in the section where pike
discusses s.

> Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
> he claims that all nice features except for backreferences can be
> implemented with Thomson's NFA algorithm. And even the backreferences
> can be handled gracefully somehow. That is: ALL: non-greedy operators,
> generalized assertions, counted repetitions, character classes CAN be
> processed using the fast algorithm. Why then we don't have it?

let me turn this around.  why would these additions (complications)
benefit plan 9?

- erik



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 16:11       ` Rudolf Sykora
  2008-10-24 16:54         ` erik quanstrom
@ 2008-10-24 17:02         ` John Stalker
  2008-10-24 17:15           ` Rob Pike
  2008-10-24 17:41           ` Rudolf Sykora
  2008-10-24 17:10         ` Uriel
  2008-10-24 19:56         ` Charles Forsyth
  3 siblings, 2 replies; 45+ messages in thread
From: John Stalker @ 2008-10-24 17:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I think you've understood correctly.  Back references mostly aren't
there.  Greedy operators aren't there.  For back references, this may
be due to philosophical reservations; I have a few myself.  For greedy
operators, I suspect it's more because noone has cared enough to do
it.  It wouldn't be too hard, as Russ' article says.  If someone is
going to to this I would suggest going all the way and implementing
tags.  See http://laurikari.net/ville/spire2000-tnfa.ps.

> > well reading the code would be a travesty.  it's curious
> > that neither the sam paper nor regexp(6) mentions
> > submatches.  maybe i missed them.
> >
> > sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' </lib/volcanoes
> > - erik
>
> Ok, so despite the documentation, some submatch tracking is there.
> But in all (?) your examples, as well as in the scripts you mentioned,
> this tracking is exclusively used with the s command (which is said to
> be unnecessary at least in sam/acme). If I try sth. like
> /( b(.)b)/a/\1\2/
> on
> bla blb 56
> I get
> bla blb\1\2 56
> which is not quite what I want... How then? (I'd like to get 'bla blblblb 56'
> )
>
> Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
> he claims that all nice features except for backreferences can be
> implemented with Thomson's NFA algorithm. And even the backreferences
> can be handled gracefully somehow. That is: ALL: non-greedy operators,
> generalized assertions, counted repetitions, character classes CAN be
> processed using the fast algorithm. Why then we don't have it? I once
> wrote a program in python and was pretty happy to have non-greedy
> operators and lookahead assertions on hand. Should I hadn't had those,
> I probably wouldn't have been able to write it (nicely).
>
> Ruda
>
--
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 16:11       ` Rudolf Sykora
  2008-10-24 16:54         ` erik quanstrom
  2008-10-24 17:02         ` John Stalker
@ 2008-10-24 17:10         ` Uriel
  2008-10-24 19:56         ` Charles Forsyth
  3 siblings, 0 replies; 45+ messages in thread
From: Uriel @ 2008-10-24 17:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

On Fri, Oct 24, 2008 at 6:11 PM, Rudolf Sykora <rudolf.sykora@gmail.com> wrote:
> Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
> he claims that all nice features except for backreferences can be
> implemented with Thomson's NFA algorithm. And even the backreferences
> can be handled gracefully somehow. That is: ALL: non-greedy operators,
> generalized assertions, counted repetitions, character classes CAN be
> processed using the fast algorithm. Why then we don't have it?

Just because something is possible doesn't mean that it is a good
idea. Implementing every possible feature is the GNU/way, if you want
to go down that path, you know where to find it.

> I once
> wrote a program in python and was pretty happy to have non-greedy
> operators and lookahead assertions on hand. Should I hadn't had those,
> I probably wouldn't have been able to write it (nicely).

Python uses the absolutely dreadful and unintelligible PCRE, which not
only have unpredictable performance as Russ points out, but are a
total mess so complex that I doubt anybody fully understands them,
they are to Plan 9 regexps what C++ is to C.

Maybe Plan 9 regexps are not as 'powerful', but at least I can understand them.

uriel



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 17:02         ` John Stalker
@ 2008-10-24 17:15           ` Rob Pike
  2008-10-24 17:41           ` Rudolf Sykora
  1 sibling, 0 replies; 45+ messages in thread
From: Rob Pike @ 2008-10-24 17:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Backreferences within the pattern (such as in /(.*)\1/) make the
matcher non-regular and exponentially hard. It was a deliberate
decision not to implement them in sam and I'd make the same decision
today.

As far as greedy/non-greedy operators go, they have more utility but I
believe they have become popular primarily to deal with the problems
with Perl and PCRE doing greedy implementations badly in order to
avoid the exponential cost of their backtracking matchers. These
matchers can give crazy results.

Sam etc. are O(length of string * length of pattern) at all times and
their leftmost-longest properties work throughout.

Again, Russ has most of this in his analysis.

-rob



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 17:02         ` John Stalker
  2008-10-24 17:15           ` Rob Pike
@ 2008-10-24 17:41           ` Rudolf Sykora
  2008-10-24 18:01             ` Russ Cox
  2008-10-24 18:02             ` John Stalker
  1 sibling, 2 replies; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24 17:41 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2008/10/24 John Stalker <stalker@maths.tcd.ie>:
> I think you've understood correctly.  Back references mostly aren't
> there.  Greedy operators aren't there.

you probably mean NON-greedy ops.
I was not that concerned about backreferences but submatch extraction
(according to the terminology used by R. Cox in his article).
R.

PS.: and submatch extraction is there, somehow, undocumented (to my knowledge)



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 17:41           ` Rudolf Sykora
@ 2008-10-24 18:01             ` Russ Cox
  2008-10-24 19:56               ` Rudolf Sykora
  2008-10-24 18:02             ` John Stalker
  1 sibling, 1 reply; 45+ messages in thread
From: Russ Cox @ 2008-10-24 18:01 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Regarding greedy vs non-greedy etc.

In Plan 9, a regular expression search always looks
for the "leftmost-longest" match, which means
a match that starts as far to the left as possible
in the target text, and of the matches that start there,
the longest one.

In that model, it is not accurate to describe the * + ?
operators as greedy or not.  None of them is working
toward any goal other than the overall longest match
at the leftmost position.

In Perl and its imitators, the match starts at the leftmost
position but is otherwise the first one that is found,
not necessarily the longest.  In that context, words like
"greedy" and "non-greedy" start to make sense,
because the behavior of any one operator influences which
match is encountered first.

Either approach -- leftmost-longest or leftmost-first --
can be implemented using finite automata.

Russ


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 17:41           ` Rudolf Sykora
  2008-10-24 18:01             ` Russ Cox
@ 2008-10-24 18:02             ` John Stalker
  1 sibling, 0 replies; 45+ messages in thread
From: John Stalker @ 2008-10-24 18:02 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> you probably mean NON-greedy ops.

Yes, my mistake.  I'll risk making a very minor correction to
Rob's post as well:

> Backreferences within the pattern (such as in /(.*)\1/) make the
> matcher non-regular and exponentially hard.

They do change the class of the grammar and nobody knows how to
implement them in subexponential time, but it hasn't been proved
to be impossible.

--
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 19:56         ` Charles Forsyth
@ 2008-10-24 19:56           ` Rudolf Sykora
  2008-10-26 21:23             ` Rob Pike
  0 siblings, 1 reply; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24 19:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2008/10/24 Charles Forsyth <forsyth@terzarima.net>:
>>If I try sth. like
>>/( b(.)b)/a/\1\2/
>>on
>>bla blb 56
>>I get
>>bla blb\1\2 56
>>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')
>
> echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
> bla blb blbl 56
>
> similarly use `s' not `a' in sam.


Yes. But my question was more subtle. I know it can be done with the
's' command in sam now. But I asked sth. different. The documentation
(paper about sam) says 's' is redundant. But it seems (only seems
since documentation says nothing), that submatches only work with the
's' command. If this is true 's' is not redundant, since if it weren't
there you would not have any chance to use \1, \2, etc.
That was the reason I wanted to have my example rewritten WITHOUT the
's' command...

But anyway thanks!
Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 16:11       ` Rudolf Sykora
                           ` (2 preceding siblings ...)
  2008-10-24 17:10         ` Uriel
@ 2008-10-24 19:56         ` Charles Forsyth
  2008-10-24 19:56           ` Rudolf Sykora
  3 siblings, 1 reply; 45+ messages in thread
From: Charles Forsyth @ 2008-10-24 19:56 UTC (permalink / raw)
  To: 9fans

>If I try sth. like
>/( b(.)b)/a/\1\2/
>on
>bla blb 56
>I get
>bla blb\1\2 56
>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
bla blb blbl 56

similarly use `s' not `a' in sam.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 18:01             ` Russ Cox
@ 2008-10-24 19:56               ` Rudolf Sykora
  2008-10-24 21:10                 ` Russ Cox
  0 siblings, 1 reply; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24 19:56 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> In that model, it is not accurate to describe the * + ?
> operators as greedy or not.  None of them is working
> toward any goal other than the overall longest match
> at the leftmost position.

So then I must be mistaken about the terminology.
I thought greedy=leftmost-longest, while non-greedy=leftmost-first:
Having
bla bla (AB)(CDEF)(GH)
/\(.*\)/
matches the whole (AB)(CDEF)(GH), 'greedy', while if '*' were
'non-greedy', I'd expect a match with (AB). This is now not in Plan9.
This mentioned example is easy to solve, if I want to go in 'bracket'
steps with
/\([^)]*\)/
but there are harder examples, eg. where the_interesting_part is
delimited with a more complicated structure ('ABC' here):
ABCthe_interesting_partABC blabla bla ABCthe_interesting_partABC etc.
Now, how to parse it? With non-greedy '*', no problem. With 'greedy'
and no negative lookahead assertion? Ok, maybe 'y' in sam could help
(don't know). What about
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA etc.?
All the thinking about this is simply removed with 'non-greedy' ops.

Ruda



> In Perl and its imitators, the match starts at the leftmost
> position but is otherwise the first one that is found,
> not necessarily the longest.  In that context, words like
> "greedy" and "non-greedy" start to make sense,
> because the behavior of any one operator influences which
> match is encountered first.
>
> Either approach -- leftmost-longest or leftmost-first --
> can be implemented using finite automata.
>
> Russ
>
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 19:56               ` Rudolf Sykora
@ 2008-10-24 21:10                 ` Russ Cox
  2008-10-24 21:40                   ` Rudolf Sykora
  0 siblings, 1 reply; 45+ messages in thread
From: Russ Cox @ 2008-10-24 21:10 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> I thought greedy=leftmost-longest, while non-greedy=leftmost-first:

Greedy leftmost-first is different from leftmost-longest.
Search for /a*(ab)?/ in "ab".  The leftmost-longest match
is "ab", but the leftmost-first match (because of the
greedy star) is "a".  In the leftmost-first case, the greediness
of the star caused an overall short match.

> All the thinking about this is simply removed with 'non-greedy' ops.

But it isn't (or shouldn't be).
Using /\(.*\)/ to match small parenthesized expressions
is fragile: /\(.*\)/ in "(a(b))" matches "(a(b)".
In contrast, the solution you rejected /\([^)]*\)/ is more robust.

It doesn't make sense to shoehorn non-greedy and greedy
operators into an engine that provides leftmost-longest matching.
If you want a different model, you need to use a different program.
Perl has been ported.

Russ


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 21:10                 ` Russ Cox
@ 2008-10-24 21:40                   ` Rudolf Sykora
  2008-10-24 21:47                     ` erik quanstrom
  0 siblings, 1 reply; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24 21:40 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Greedy leftmost-first is different from leftmost-longest.
> Search for /a*(ab)?/ in "ab".  The leftmost-longest match
> is "ab", but the leftmost-first match (because of the
> greedy star) is "a".  In the leftmost-first case, the greediness
> of the star caused an overall short match.

Ok, I finally see the point... thanks.
Only one last question: So is there any simple way (using existing
regexps) to find 'the interesting parts' in my last example?:
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
...i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
Say I have a file with such a structure and want to only get 'the
interesting parts' form it.

>
>> All the thinking about this is simply removed with 'non-greedy' ops.
>
> But it isn't (or shouldn't be).
well, I admit it isn't :)



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 21:40                   ` Rudolf Sykora
@ 2008-10-24 21:47                     ` erik quanstrom
  2008-10-24 22:04                       ` Rudolf Sykora
  0 siblings, 1 reply; 45+ messages in thread
From: erik quanstrom @ 2008-10-24 21:47 UTC (permalink / raw)
  To: 9fans

> Ok, I finally see the point... thanks.
> Only one last question: So is there any simple way (using existing
> regexps) to find 'the interesting parts' in my last example?:
> ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
> ...i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
> Say I have a file with such a structure and want to only get 'the
> interesting parts' form it.

doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
maybe i don't understand the example.  if so, could you explain?

- erik



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 21:47                     ` erik quanstrom
@ 2008-10-24 22:04                       ` Rudolf Sykora
  2008-10-24 22:38                         ` Gabriel Diaz Lopez de la Llave
                                           ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-24 22:04 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
> maybe i don't understand the example.  if so, could you explain?
>
> - erik

I think not.
I have a file say like this

ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA

and I want to get

' asassadfasdf asdfasdf asdfasdf '
'dasdfas aasdfa'
'njnjn'

where I added apostrophes to see the spaces on indivial lines. Simply:
give me everything that is between delimiters (ABC and CBA).

Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 22:04                       ` Rudolf Sykora
@ 2008-10-24 22:38                         ` Gabriel Diaz Lopez de la Llave
  2008-10-24 22:54                         ` Charles Forsyth
  2008-10-24 23:52                         ` Tom Simons
  2 siblings, 0 replies; 45+ messages in thread
From: Gabriel Diaz Lopez de la Llave @ 2008-10-24 22:38 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

hello

using sed and only one reg-exp is mandatory?

cat t.txt| sed 's/(ABC | CBA)/ \n\1\n /g'  | awk '/ABC/,/CBA/' | grep - 
v 'ABC|CBA'

that's a naive and simple approach, but i can't see why you need to  
use just one reg-exp and just one sed. May be i missed something  
through the thread :-?

gabi


El 25/10/2008, a las 0:04, Rudolf Sykora escribió:

>> doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
>> maybe i don't understand the example.  if so, could you explain?
>>
>> - erik
>
> I think not.
> I have a file say like this
>
> ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
> sodifs
> sdfsd
> ABC
> dasdfas aasdfa
> njnjn CBA
>
> and I want to get
>
> ' asassadfasdf asdfasdf asdfasdf '
> 'dasdfas aasdfa'
> 'njnjn'
>
> where I added apostrophes to see the spaces on indivial lines. Simply:
> give me everything that is between delimiters (ABC and CBA).
>
> Ruda
>
>




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 22:04                       ` Rudolf Sykora
  2008-10-24 22:38                         ` Gabriel Diaz Lopez de la Llave
@ 2008-10-24 22:54                         ` Charles Forsyth
  2008-10-24 22:59                           ` Charles Forsyth
  2008-10-24 23:52                         ` Tom Simons
  2 siblings, 1 reply; 45+ messages in thread
From: Charles Forsyth @ 2008-10-24 22:54 UTC (permalink / raw)
  To: 9fans

although it can be satisfying to do things in one instruction (one command)
i confess i often find it quicker to split up the problem in sam or acme:

1. delete everything not between delimiters
	,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
2. delete the delimeters
	,x/ABC|CBA/d
3. look to decide if i missed a boundary case for my input

it would be easier if the delimiters weren't words, or your sections didn't span lines.
i didn't spend any time deciding whether there was a better way
to express the trailing word delimiter.  it's too late.

an example actually using the () [submatch] to extract and exchange two fields on each line:
	,x s/([^,]*),(.*)/\2,\1/



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 22:54                         ` Charles Forsyth
@ 2008-10-24 22:59                           ` Charles Forsyth
  0 siblings, 0 replies; 45+ messages in thread
From: Charles Forsyth @ 2008-10-24 22:59 UTC (permalink / raw)
  To: 9fans

>i didn't spend any time deciding whether there was a better way
>to express the trailing word delimiter.  it's too late.

gdiaz@9grid.es meanwhile pointed out one way: change the words to otherwise unused special
characters



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 22:04                       ` Rudolf Sykora
  2008-10-24 22:38                         ` Gabriel Diaz Lopez de la Llave
  2008-10-24 22:54                         ` Charles Forsyth
@ 2008-10-24 23:52                         ` Tom Simons
  2008-10-25 22:35                           ` Rudolf Sykora
  2 siblings, 1 reply; 45+ messages in thread
From: Tom Simons @ 2008-10-24 23:52 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

[-- Attachment #1: Type: text/plain, Size: 1153 bytes --]

Is awk available?  This worked for me, but it's not on Plan9.  It does copy
the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
should be deleted).
$ cat a.data
 dflkdl dlkrwo3je4ogjmdmxd
ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA
fkpri34ouijglkrlptgf;c
$ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data
 asassadfasdf asdfasdf asdfasdf

dasdfas aasdfa
njnjn

On Fri, Oct 24, 2008 at 3:04 PM, Rudolf Sykora <rudolf.sykora@gmail.com>wrote:

> > doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
> > maybe i don't understand the example.  if so, could you explain?
> >
> > - erik
>
> I think not.
> I have a file say like this
>
> ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
> sodifs
> sdfsd
> ABC
> dasdfas aasdfa
> njnjn CBA
>
> and I want to get
>
> ' asassadfasdf asdfasdf asdfasdf '
> 'dasdfas aasdfa'
> 'njnjn'
>
> where I added apostrophes to see the spaces on indivial lines. Simply:
> give me everything that is between delimiters (ABC and CBA).
>
> Ruda
>
>

[-- Attachment #2: Type: text/html, Size: 1640 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 23:52                         ` Tom Simons
@ 2008-10-25 22:35                           ` Rudolf Sykora
  2008-10-25 23:02                             ` Steve Simon
                                               ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-25 22:35 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

2008/10/25 Tom Simons <tom.simons@gmail.com>:
> Is awk available?  This worked for me, but it's not on Plan9.  It does copy
> the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
> should be deleted).
> $ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data

To that newline: It should copy the newline you describe, since that
one really is between delimiters. However, this one is also the only
one that should be copied. There shouldn't be any blank line anywhere
in the middle of the resulting output. In this sense your solution
doesn't work.

Your solution ALMOST works in linux. It shows not to work in plan9 at
all, probably due to the fact that in plan9 only the 1st character of
the RS variable is considered as the record delimiter.

But what I really wanted to see is how people using plan9 can solve
the problem without using a specialized minilanguage like awk. See
what Erik S. Raymond says in his Art of Unix programming:

http://www.faqs.org/docs/artu/ch08s02.html#awk

Basically he claims that the way this language was designed was
unfortunate. And that the language is on its decline. Among the
reasons is that languages like Perl, Python, Ruby all form a suitable
superset and that
'Programmers increasingly chose to do awklike things with Perl or
(later) Python, rather than keep two different scripting languages in
their heads'.

I myself may not be that competent to claim this too, but at least
from my own experience I have started to like to use as few tools as
possible. Thus I don't want to use awk any longer. I don't like perl
either (in my opinion it's a bad language). Python is nice for coding,
but somehow not handy for commandline use. Ruby seems to be superior
to all. So in my ideal (not the quickest though) world I'd rather get
rid of perl, awk, and use ruby instead, if anything more complicated
is needed.

Anyway, my main reason for the task was to see if someone can really
come with a nice solution using exclusively sam (and more, possibly
without that 's' command --- btw. noone so far has answered the
question of possible use of submatch tracking with commands other than
's'; remember 's' was designated unnecessary).

I wanted to see the thing be done in sam/acme without use of awk or
sed. That is Charles Forsyth's solution, which really works:
---
1. delete everything not between delimiters
       ,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
2. delete the delimeters
       ,x/ABC|CBA/d
3. look to decide if i missed a boundary case for my input
---
I like it. It does exactly what I wanted. And here comes the point
I've been after all the time from the very beginning. I wanted to
show, that the solution has a very ugly part in itself, namely
([^C]|C[^B]|CB[^A]|\n)+
whose only reason is to ensure there is not CBA somewhere in the
middle. Imagine there would be something more complicated as a
delimiter. Imagine, I'd like the closing delimiter be either CBA or
EFG (any one would do). And I think you soon go mad.

In python ( http://www.amk.ca/python/howto/regex/), this is easily
solved with a non-greedy operator

/ABC(.*?)CBA/
/ABC(.*?)(CBA|EFG)/

It's true that non-greedy operators don't have a good meaning in Plan9
(as R. Cox explained), due to its leftmost-longest paradigm. However,
what I conclude from this example is, that the leftmost-first kind of
thinking with two kinds of ops (greedy/nongreedy) can be sometimes
quite useful.

Now. If the leftmost-longest match is usable for my problem, I am fine
with C + regexp(6). If not I only see the possibility to use
perl/python nowadays (if I don't want to go mad like above). Put it
the other way round. Perl/python can hopefully almost always be used,
they solve almost any problem with regexps to our liking. Then
universality wins and we may end up using perl/python exclusively. And
we will (people do) use them inspite of their wrong (i.e. slow;
perhaps horrible -- as some of you said) designs.

My question then is: wouldn't it be better to switch to the
leftmost-first paradigm, hence open possible use of (non-)greedy
operators, and in a way contribute to an accord with perl/python
syntax?  And use a good algorithm for that all? But maybe it's not
worth and the current state is just sufficient...

Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-25 22:35                           ` Rudolf Sykora
@ 2008-10-25 23:02                             ` Steve Simon
  2008-10-26  8:57                             ` John Stalker
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 45+ messages in thread
From: Steve Simon @ 2008-10-25 23:02 UTC (permalink / raw)
  To: 9fans

> ...I have started to like to use as few tools as
> possible.

I entirely agree, there is too much to learn and you have to be selective;
however my selection is: sed, awk, C and sam (for impromptu, well, editing).

I cannot really comment directly on gready operators, I have never
knowingly used a system that provided them, and I have never found myself
needing features that Plan9's RE don't have.

Perhaps its a lack of vision on my part.

It is probably possible visualise problems that would be solved
much more simply on Plan9 than on Linux/BSD/etc. The bottom line
is that the designers of Plan9 took a different approach
to REs and as a result the systems are different.

-Steve



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-25 22:35                           ` Rudolf Sykora
  2008-10-25 23:02                             ` Steve Simon
@ 2008-10-26  8:57                             ` John Stalker
  2008-10-26 18:36                               ` Eris Discordia
  2008-10-27  4:55                             ` Russ Cox
  2008-11-30  8:29                             ` Yard Ape
  3 siblings, 1 reply; 45+ messages in thread
From: John Stalker @ 2008-10-26  8:57 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Now. If the leftmost-longest match is usable for my problem, I am fine
> with C + regexp(6). If not I only see the possibility to use
> perl/python nowadays (if I don't want to go mad like above).

There is another option: yacc.  I'm not saying it's simpler
than perl or python, but it's not much harder and it's more
a more general tool.  The resulting code will be fast, if
you care.

> My question then is: wouldn't it be better to switch to the
> leftmost-first paradigm, hence open possible use of (non-)greedy
> operators, and in a way contribute to an accord with perl/python
> syntax?  And use a good algorithm for that all? But maybe it's not
> worth and the current state is just sufficient...

Switching semantics in existing tools is rarely a good idea.  Too many
things break.  If anyone is going to do this, and it won't be me, then
it needs to be done in a way that doesn't break anything, like when
UNIX switched from basic to extended regular expressions and added a
-e option to grep.
--
John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-26  8:57                             ` John Stalker
@ 2008-10-26 18:36                               ` Eris Discordia
  0 siblings, 0 replies; 45+ messages in thread
From: Eris Discordia @ 2008-10-26 18:36 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

I loved this thread. Thanks everyone. Thanks Rudolf Sykora.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-24 19:56           ` Rudolf Sykora
@ 2008-10-26 21:23             ` Rob Pike
  0 siblings, 0 replies; 45+ messages in thread
From: Rob Pike @ 2008-10-26 21:23 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

The ability to put \1 in the right hand side of a substitution was
done by jason@
at the Uni of Sydney, but after the Sam papers were published.  It was a welcome
feature that added special functionality to the 's' command within
Sam. (Ed(1) had
the feature, within its limited regexps, long before, of course.)

-rob


On Fri, Oct 24, 2008 at 12:56 PM, Rudolf Sykora <rudolf.sykora@gmail.com> wrote:
> 2008/10/24 Charles Forsyth <forsyth@terzarima.net>:
>>>If I try sth. like
>>>/( b(.)b)/a/\1\2/
>>>on
>>>bla blb 56
>>>I get
>>>bla blb\1\2 56
>>>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')
>>
>> echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
>> bla blb blbl 56
>>
>> similarly use `s' not `a' in sam.
>
>
> Yes. But my question was more subtle. I know it can be done with the
> 's' command in sam now. But I asked sth. different. The documentation
> (paper about sam) says 's' is redundant. But it seems (only seems
> since documentation says nothing), that submatches only work with the
> 's' command. If this is true 's' is not redundant, since if it weren't
> there you would not have any chance to use \1, \2, etc.
> That was the reason I wanted to have my example rewritten WITHOUT the
> 's' command...
>
> But anyway thanks!
> Ruda
>
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-25 22:35                           ` Rudolf Sykora
  2008-10-25 23:02                             ` Steve Simon
  2008-10-26  8:57                             ` John Stalker
@ 2008-10-27  4:55                             ` Russ Cox
  2008-10-27  8:28                               ` Rudolf Sykora
  2008-10-27 10:18                               ` Charles Forsyth
  2008-11-30  8:29                             ` Yard Ape
  3 siblings, 2 replies; 45+ messages in thread
From: Russ Cox @ 2008-10-27  4:55 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

As I explained in an earlier post, your suggested

> /ABC(.*?)CBA/

is less robust than Charles's spelled-out version,
since yours doesn't handle nested expressions well.
That's a serious enough limitation that your
scenario stops being a compelling argument for
leftmost-first matching and non-greedy operators.

> Then universality wins and we may end up using perl/python exclusively.

No one said you couldn't.  If they let you do your job faster than
acme and sam do, no one here is going to tell you not to use them.

> My question then is: wouldn't it be better to switch to the
> leftmost-first paradigm, hence open possible use of (non-)greedy
> operators, and in a way contribute to an accord with perl/python
> syntax?  And use a good algorithm for that all? But maybe it's not
> worth and the current state is just sufficient...

Leftmost-first matching is difficult to explain.
When POSIX had to define the rules for regexps,
they chose to define them as leftmost-longest
even though all the implementations were leftmost-first,
because describing leftmost-first precisely was too
complex.

Leftmost-first matching is not widely understood.
At the beginning of this conversation, you believed that in the
absence of non-greedy operators, leftmost-first matching
produced leftmost-longest matches.  I would guess that
more people believe this than know otherwise.

Leftmost-first matching only lets you write a regexp that
is half right (see above) in the use case that you are
focusing on.

I don't see the benefit of switching from leftmost-longest
to leftmost-first.  Certainly an "accord with perl/python"
is not good justification.  Playing catch-up with Perl and
Python regexps can only lead to madness.  Both systems
are complex enough that essentially no one completely
understands them.

Russ


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27  4:55                             ` Russ Cox
@ 2008-10-27  8:28                               ` Rudolf Sykora
  2008-10-27 10:18                               ` Charles Forsyth
  1 sibling, 0 replies; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-27  8:28 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Leftmost-first matching is difficult to explain.
> When POSIX had to define the rules for regexps,
> they chose to define them as leftmost-longest
> even though all the implementations were leftmost-first,
> because describing leftmost-first precisely was too
> complex.
>
> Leftmost-first matching is not widely understood.
> At the beginning of this conversation, you believed that in the
> absence of non-greedy operators, leftmost-first matching
> produced leftmost-longest matches.  I would guess that
> more people believe this than know otherwise.

Yes. I weren't quite acquainted with the problematics. Probably
because I have never bumped into a problem with the leftmost-first
implemenatation. Perhaps my problems were just not complex enough to
notice a regexp finds sth. different from my expectation. Seems to be
a case of majority of people.
(E.g. Vim seems to use leftmost-first, too.)

Ok. I am just somehow tired now with this. I thank everybody who
positively contributed to this thread. I'll wait some time and will
try to use what is present and I will see.

Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27  4:55                             ` Russ Cox
  2008-10-27  8:28                               ` Rudolf Sykora
@ 2008-10-27 10:18                               ` Charles Forsyth
  2008-10-27 13:13                                 ` Eris Discordia
  1 sibling, 1 reply; 45+ messages in thread
From: Charles Forsyth @ 2008-10-27 10:18 UTC (permalink / raw)
  To: 9fans

>Both systems are complex enough that essentially no one completely understands them.

this touches on an important point. the first introduction of regular expressions
to editors was great, because it took some formal language theory and made it useful
in an `every day' way. conversely, the theory provided significant underpinning (in a structural
sense) to that practical application.  now there are big books on `regular expressions'
mainly because they are no longer regular but a big collection of ad-hoc additions.
(i think there is a similarity to Perl's relationship to the history of programming languages.)
the result really is hard to reason about ... because it hasn't got that underpinning and the algebra has gone.
it's worse than that: the code is a mess too, i supposed in consequence.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 10:18                               ` Charles Forsyth
@ 2008-10-27 13:13                                 ` Eris Discordia
  2008-10-27 13:23                                   ` erik quanstrom
  2008-10-27 16:13                                   ` Brian L. Stuart
  0 siblings, 2 replies; 45+ messages in thread
From: Eris Discordia @ 2008-10-27 13:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> practical application.  now there are big books on `regular expressions'
> mainly because they are no longer regular but a big collection of ad-hoc

I thought they were "regular" because they "regularly" occurred in the 
target text. Turns out other interpretations are possible. Though, mine has 
the advantage of being correct :-D

The set of "big books on regular expressions" includes Jeffrey Friedl's 
"Mastering Regular Expressions" that happens to contain a chapter by the 
title "NFA, DFA, and POSIX" wherein he says:

> DFA Speed with NFA Capabilities: Regex Nirvana?
>
> I've said several times that a DFA can't provide capturing parentheses or
> backreferences. This is quite true, but it certainly doesn't preclude
> hybrid approaches that mix technologies in an attempt to reach regex
> nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from
> the theoretical straight and narrow in search of more power, and it's
> only natural that the same happens with DFAs. A DFA's construction makes
> it more difficult, but that doesn't mean impossible.
>
> GNU grep takes a simple but effective approach. It uses a DFA when
> possible, reverting to an NFA when backreferences are used. GNU awk does
> something similar—it uses GNU grep's fast shortest-leftmost DFA engine
> for simple "does it match" checks, and reverts to a different engine for
> checks where the actual extent of the match must be known. Since that
> other engine is an NFA, GNU awk can conveniently offer capturing
> parentheses, and it does via its special gensub function.
>
> Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom
> you may remember having played an important part in the early development
> and popularization of regular expressions see Section 3.1.1.6). The Tcl
> engine sometimes appears to be an NFA— it has lookaround, capturing
> parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX
> longest-leftmost match (see Section 4.6.1), and doesn't suffer from some
> of the NFA problems that we'll see in Chapter 6. It really seems quite
> wonderful.
>
> Currently, this engine is available only to Tcl, but Henry tells me that
> it's on his to-do list to break it out into a separate package that can
> be used by others.

Again, turns out the "big books on regular expressions" can give the 
lowlife--that's me--things "hackers" deny them.

--On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth 
<forsyth@terzarima.net> wrote:

>> Both systems are complex enough that essentially no one completely
>> understands them.
>
> this touches on an important point. the first introduction of regular
> expressions to editors was great, because it took some formal language
> theory and made it useful in an `every day' way. conversely, the theory
> provided significant underpinning (in a structural sense) to that
> practical application.  now there are big books on `regular expressions'
> mainly because they are no longer regular but a big collection of ad-hoc
> additions. (i think there is a similarity to Perl's relationship to the
> history of programming languages.) the result really is hard to reason
> about ... because it hasn't got that underpinning and the algebra has
> gone. it's worse than that: the code is a mess too, i supposed in
> consequence.
>







^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 13:13                                 ` Eris Discordia
@ 2008-10-27 13:23                                   ` erik quanstrom
  2008-10-27 19:42                                     ` Eris Discordia
  2008-10-27 16:13                                   ` Brian L. Stuart
  1 sibling, 1 reply; 45+ messages in thread
From: erik quanstrom @ 2008-10-27 13:23 UTC (permalink / raw)
  To: 9fans

>> practical application.  now there are big books on `regular expressions'
>> mainly because they are no longer regular but a big collection of ad-hoc
>
> I thought they were "regular" because they "regularly" occurred in the
> target text. Turns out other interpretations are possible. Though, mine has
> the advantage of being correct :-D

there's a reason they're not called regularly expressions.

(if this were the definition, an expression's regularlyness would
depend on the target text, would it not?)

- erik




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 13:13                                 ` Eris Discordia
  2008-10-27 13:23                                   ` erik quanstrom
@ 2008-10-27 16:13                                   ` Brian L. Stuart
  1 sibling, 0 replies; 45+ messages in thread
From: Brian L. Stuart @ 2008-10-27 16:13 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> The set of "big books on regular expressions" includes Jeffrey Friedl's
> "Mastering Regular Expressions" that happens to contain a chapter by the
> title "NFA, DFA, and POSIX" wherein he says:
>
> > DFA Speed with NFA Capabilities: Regex Nirvana?

This guy seems to blur the distinctions here.  His discussion
makes it sound like he's saying that NFAs have more expressive
power than DFAs.  This is incorrect.  Both NFAs and DFAs have
exactly the same expressive power as the class of grammars
called regular.  For the arbitrary case of nesting (e.g. parens),
these machines are insufficient.  However, for any prescribed
maximum nesting level, you can write a regular expression to
account for it, though it becomes clumsy.

To get more expressiveness, you need to go to a machine
with more functionality.  Classically, the next step
up the hierarchy is the pushdown automaton.  The expressive
power of this machine corresponds directly to the context-
free grammars.  Because the full generality of the CFG
require nondeterminism, automated translations from CFG
to code/machine are usually done with restricted classes
of CFGs, such as LR(k) and LALR.  You can also increase
the power of a FA by adding a counter or by making the
transitions probablistic.  If you truly want to build
expression matching mechanisms that go beyond regular,
building on the FA with counter(s) would be a far more
sound foundation than a lot of the ad hoc stuff that's
been done.  But the truth is that whipping up a CFG
and feeding it to yacc is far more expedient than torturing
regular expressions all day.

> Again, turns out the "big books on regular expressions" can give the
> lowlife--that's me--things "hackers" deny them.

And a good book on automata theory can give you far more
than any "big book of...", "... in 21 days" or "... for
dummies" book can.  Besides, why would you think that
anyone is denying you knowledge or the opportunity
to write code that demonstrates your ideas?

BLS




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 13:23                                   ` erik quanstrom
@ 2008-10-27 19:42                                     ` Eris Discordia
  0 siblings, 0 replies; 45+ messages in thread
From: Eris Discordia @ 2008-10-27 19:42 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> there's a reason they're not called regularly expressions.

As explained in the post by Brian L. Stuart it's a matter of "grammar" :-P

> (if this were the definition, an expression's regularlyness would
> depend on the target text, would it not?)

Yes, and that _would_ be why you wouldn't craft a regex for matching in a
text with a, say, single-digit frequency of some string.

--On Monday, October 27, 2008 9:23 AM -0400 erik quanstrom
<quanstro@quanstro.net> wrote:

>>> practical application.  now there are big books on `regular expressions'
>>> mainly because they are no longer regular but a big collection of ad-hoc
>>
>> I thought they were "regular" because they "regularly" occurred in the
>> target text. Turns out other interpretations are possible. Though, mine
>> has  the advantage of being correct :-D
>
> there's a reason they're not called regularly expressions.
>
> (if this were the definition, an expression's regularlyness would
> depend on the target text, would it not?)
>
> - erik
>
>




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-25 22:35                           ` Rudolf Sykora
                                               ` (2 preceding siblings ...)
  2008-10-27  4:55                             ` Russ Cox
@ 2008-11-30  8:29                             ` Yard Ape
  2008-12-11 16:32                               ` Rudolf Sykora
  3 siblings, 1 reply; 45+ messages in thread
From: Yard Ape @ 2008-11-30  8:29 UTC (permalink / raw)
  To: 9fans

"Rudolf Sykora" wrote:

> I have a file say like this
>
> ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
> sodifs
> sdfsd
> ABC
> dasdfas aasdfa
> njnjn CBA
>
> and I want to get
>
> ' asassadfasdf asdfasdf asdfasdf '
> 'dasdfas aasdfa'
> 'njnjn'
>
> ...i.e. delimited with ABC from the left and CBA (or
> e.g. EFG) from the right?  Say I have a file with such a
> structure and want to only get 'the interesting parts'
> form it.
>
> ...using exclusively sam (and more, possibly without that
> 's' command
>

One of the things I love about ed is how easy it
is to loop through *ranges* which can be delimited
by regexes, like so:

   ,g/pre-interesting/.+,/post-interesting/-\
   s/this/that/g

The same method can be used in sam:

   ,x/pre-interesting/+#0;/post-interesting/-#0{
    x/this/c/that/
   }

In cases where the delimeters don't alternate
perfectly (as in nesting), you can just start adding
conditionals and/or subloops:

   ,x/pre-interesting/.+#0;/post-interesting/-#0{
    v/pre-interesting/{
     x/this/c/that/
    }
   }

You can operate on the inverse of the interesting
range by employing a marker in the loop to maintain a
"trailing" range throughout:

   0k
   ,x/pre-interesting/.+#0;/post-interesting/-#0{
    v/pre-interesting/{
    ',.-#0{
      x/this/c/that/
      }
     .+#0k
     }
    }

etc., etc.

Here's a version that works on your example:

   0k
   ,x/ABC/+#0;/CBA|EFG/{
    ',.-#0d
    .+#0??d
    .+#0k
   }

-Derek




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-11-30  8:29                             ` Yard Ape
@ 2008-12-11 16:32                               ` Rudolf Sykora
  0 siblings, 0 replies; 45+ messages in thread
From: Rudolf Sykora @ 2008-12-11 16:32 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> Here's a version that works on your example:
>
>   0k
>   ,x/ABC/+#0;/CBA|EFG/{
>    ',.-#0d
>    .+#0??d
>    .+#0k
>   }
>
> -Derek


Thanks. This is what I wanted to see...
Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-28 14:51 ` Brian L. Stuart
@ 2008-10-28 15:07   ` Eris Discordia
  0 siblings, 0 replies; 45+ messages in thread
From: Eris Discordia @ 2008-10-28 15:07 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Thanks for the explanations. The lowlife learns a bit or two :-)

--On Tuesday, October 28, 2008 2:51 PM +0000 "Brian L. Stuart"
<blstuart@bellsouth.net> wrote:

>> > This guy seems to blur the distinctions here.  His discussion
>>
>> He doesn't. If one reads the whole section part of which was quoted one
>> will see that he clearly states DFA and NFA are theoretically
>> equivalent,  but then goes on to explain that DFA and NFA
>> _implementations_ are not  identical.
>
> Actually, that's why I said "seems to."  Basically, what he
> should be saying is that some people have found it easier
> to add elements to the NFA model that augment the grammar.
> But then he goes on to use the initials NFA when he really
> means the extensions thereof.  This creates in the reader's
> mind the misimpression that there's something magically
> different between an NFA and a DFA, theoretically or
> implementationally.  They have exactly the same expressive
> power in both realms.  Besides, nondeterminism can only
> be simulated when running on a deterministic computer.
> That's actually the insight behind the NFA to DFA construction.
> So talking about an "NFA implementation" is rather artificial.
>
>> They learn slowly, hardly, painfully--they aren't smart. If
>> possible they'll learn less rather than learn more. What the "hacker"
>> denies the lowlife is the opportunity to exist free of "GNU-is-wrong" or
>> "X-is-wrong" blemish.
>
> Look at it this way.  The people here aren't trying to
> create blemishes on anything.  Rather we are trying to
> help your learning process.  We're trying to head off
> the tendency for people to jump from "it was easy to
> learn how to do X in this tool" to "how this tool does
> things is better."  Instead, before asserting something
> is better or asking for others to write software according
> to a personal preference, it is important to broaden
> your understanding so that you know what the options
> really are and what the design decisions really imply.
> Implementing a laundry list of features without that
> perspective simply produces bad software.
>
>> It's
>> good--for the lowlife, of course--to know the wonders they see didn't
>> spring into existence out of the blue.
>
> That's why we teach the theory that everyone seems to
> want to complain about.  Building on Newton, if we want
> to see farther than those before us, we need to know
> those before us well enough to climb onto their shoulders.
> No software is based on Hogwarts' technology, and no
> good software comes from the "random walk method of
> programming."  It is the product of intellectual reasoning.
> Criticism of the result won't be taken too seriously
> if the critic shows they have not understood the reasoning
> behind it.
>
> BLS
>
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 21:08 Aharon Robbins
@ 2008-10-28 14:53 ` Eris Discordia
  0 siblings, 0 replies; 45+ messages in thread
From: Eris Discordia @ 2008-10-28 14:53 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> It is merely the traditional POSIX flavor.  Some people like that
> flavor, some don't.

Understandable.

> It is more that Perl simply was never part of the picture for the people
> who develop(ed) and use(d) Plan 9.  It's like asking why the paper on
> the Plan 9 C compiler doesn't state that C++ classes are not available.

Very reasonable.

> if you really want Perl, you know where to get it.

We all do. We all did [get it].

> Me, I'm pretty happy with the traditional shell + sed + grep + awk
> combinations, but then again, I'm biased, particularly towards awk.
> :-)

Yeah, I've seen your (g)awk activities on other lists. Good luck, and
thanks.

--On Monday, October 27, 2008 11:08 PM +0200 Aharon Robbins
<arnold@skeeve.com> wrote:

>> > As other mails have pointed out, anything that isn't leftmost longest
>> > has weird semantics.  Non-greedy operators are mostly syntactic sugar.
>>
>> Is (leftmost-longest + all-greedy operators) syntactic salt then?
>
> It is merely the traditional POSIX flavor.  Some people like that
> flavor, some don't.
>
>> > Not in the least. The Plan 9 regexp library in fact gives you close to
>> > the same nirvana; an automata that has DFA speed characteristics with
>> > the NFA's ability to capture sub texts.
>>
>> Does regexp(n) also give the lowlife any hint of why it should behave
>> differently from Perl? Friedl's book doesn't, but it has good reason.
>
> It is more that Perl simply was never part of the picture for the people
> who develop(ed) and use(d) Plan 9.  It's like asking why the paper on
> the Plan 9 C compiler doesn't state that C++ classes are not available.
>
> The long-time users of Plan 9 were using Unix before Perl even came along;
> for reasons having to do with both taste and the theoretical soundness,
> they saw no reason to try to support the Perl features.  After all,
> if you really want Perl, you know where to get it.
>
> Me, I'm pretty happy with the traditional shell + sed + grep + awk
> combinations, but then again, I'm biased, particularly towards awk.
> :-)
>
> Arnold
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 20:00 Eris Discordia
@ 2008-10-28 14:51 ` Brian L. Stuart
  2008-10-28 15:07   ` Eris Discordia
  0 siblings, 1 reply; 45+ messages in thread
From: Brian L. Stuart @ 2008-10-28 14:51 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> > This guy seems to blur the distinctions here.  His discussion
>
> He doesn't. If one reads the whole section part of which was quoted one
> will see that he clearly states DFA and NFA are theoretically equivalent,
> but then goes on to explain that DFA and NFA _implementations_ are not
> identical.

Actually, that's why I said "seems to."  Basically, what he
should be saying is that some people have found it easier
to add elements to the NFA model that augment the grammar.
But then he goes on to use the initials NFA when he really
means the extensions thereof.  This creates in the reader's
mind the misimpression that there's something magically
different between an NFA and a DFA, theoretically or
implementationally.  They have exactly the same expressive
power in both realms.  Besides, nondeterminism can only
be simulated when running on a deterministic computer.
That's actually the insight behind the NFA to DFA construction.
So talking about an "NFA implementation" is rather artificial.

> They learn slowly, hardly, painfully--they aren't smart. If
> possible they'll learn less rather than learn more. What the "hacker"
> denies the lowlife is the opportunity to exist free of "GNU-is-wrong" or
> "X-is-wrong" blemish.

Look at it this way.  The people here aren't trying to
create blemishes on anything.  Rather we are trying to
help your learning process.  We're trying to head off
the tendency for people to jump from "it was easy to
learn how to do X in this tool" to "how this tool does
things is better."  Instead, before asserting something
is better or asking for others to write software according
to a personal preference, it is important to broaden
your understanding so that you know what the options
really are and what the design decisions really imply.
Implementing a laundry list of features without that
perspective simply produces bad software.

> It's
> good--for the lowlife, of course--to know the wonders they see didn't
> spring into existence out of the blue.

That's why we teach the theory that everyone seems to
want to complain about.  Building on Newton, if we want
to see farther than those before us, we need to know
those before us well enough to climb onto their shoulders.
No software is based on Hogwarts' technology, and no
good software comes from the "random walk method of
programming."  It is the product of intellectual reasoning.
Criticism of the result won't be taken too seriously
if the critic shows they have not understood the reasoning
behind it.

BLS




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
@ 2008-10-27 21:08 Aharon Robbins
  2008-10-28 14:53 ` Eris Discordia
  0 siblings, 1 reply; 45+ messages in thread
From: Aharon Robbins @ 2008-10-27 21:08 UTC (permalink / raw)
  To: 9fans

> > As other mails have pointed out, anything that isn't leftmost longest
> > has weird semantics.  Non-greedy operators are mostly syntactic sugar.
>
> Is (leftmost-longest + all-greedy operators) syntactic salt then?

It is merely the traditional POSIX flavor.  Some people like that
flavor, some don't.

> > Not in the least. The Plan 9 regexp library in fact gives you close to
> > the same nirvana; an automata that has DFA speed characteristics with
> > the NFA's ability to capture sub texts.
>
> Does regexp(n) also give the lowlife any hint of why it should behave
> differently from Perl? Friedl's book doesn't, but it has good reason.

It is more that Perl simply was never part of the picture for the people
who develop(ed) and use(d) Plan 9.  It's like asking why the paper on
the Plan 9 C compiler doesn't state that C++ classes are not available.

The long-time users of Plan 9 were using Unix before Perl even came along;
for reasons having to do with both taste and the theoretical soundness,
they saw no reason to try to support the Perl features.  After all,
if you really want Perl, you know where to get it.

Me, I'm pretty happy with the traditional shell + sed + grep + awk
combinations, but then again, I'm biased, particularly towards awk.
:-)

Arnold



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
  2008-10-27 19:23 Aharon Robbins
@ 2008-10-27 20:15 ` Eris Discordia
  0 siblings, 0 replies; 45+ messages in thread
From: Eris Discordia @ 2008-10-27 20:15 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

> As other mails have pointed out, anything that isn't leftmost longest
> has weird semantics.  Non-greedy operators are mostly syntactic sugar.

Is (leftmost-longest + all-greedy operators) syntactic salt then?

> Not in the least. The Plan 9 regexp library in fact gives you close to
> the same nirvana; an automata that has DFA speed characteristics with
> the NFA's ability to capture sub texts.

Does regexp(n) also give the lowlife any hint of why it should behave
differently from Perl? Friedl's book doesn't, but it has good reason.

> Friedl's book is good for what it aims to be: an introduction to
> regular expressions.  But scientifically rigid (as in, on the same
> order as the dragon book) it's definitely not.

That's good. No problem with "big books on regular expressions," then.
Introductory ones, or scientific ones.

--On Monday, October 27, 2008 9:23 PM +0200 Aharon Robbins
<arnold@skeeve.com> wrote:

>> > GNU grep takes a simple but effective approach. It uses a DFA when
>> > possible, reverting to an NFA when backreferences are used. GNU awk
>> > does something similar---it uses GNU grep's fast shortest-leftmost DFA
>> > engine for simple "does it match" checks, and reverts to a different
>> > engine for checks where the actual extent of the match must be known.
>> > Since that other engine is an NFA, GNU awk can conveniently offer
>> > capturing parentheses, and it does via its special gensub function.
>
> I'll be the first to admit that gawk's behavior is something of a hack.
> I'd much rather be able to use a single matcher. (At least I was able to
> get Friedl to describe gawk's implementation correctly. :-)
>
>> > Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
>> >
>> > Currently, this engine is available only to Tcl, but Henry tells me
>> > that it's on his to-do list to break it out into a separate package
>> > that can be used by others.
>
> It's been on that TODO list for well over a decade, IIRC. More vaporware
> as far as a broader community is concerned.
>
>> Again, turns out the "big books on regular expressions" can give the
>> lowlife--that's me--things "hackers" deny them.
>
> Not in the least. The Plan 9 regexp library in fact gives you close to
> the same nirvana; an automata that has DFA speed characteristics with
> the NFA's ability to capture sub texts.
>
> As other mails have pointed out, anything that isn't leftmost longest
> has weird semantics.  Non-greedy operators are mostly syntactic sugar.
> I'll agree that in practice they're likely to be useful, but as I'm pretty
> much an awk guy (<grin>) I've never felt the pain of their lack, either.
>
> Gawk is stuck with its current two-matcher approach since it provides
> the option for different syntaxes, but that approach also has lots of
> problems; more than once I've come across cases where they interpret the
> same syntax differently, or don't handle multibyte characters the same.
> (Not to mention that they sntax bits API in the GNU matchers is a
> a real nightmare. Another thing that's too late to change.)
>
> If I could start over again, I'd start with either the plan 9 lib or
> ripping out Henry's library from tcl, but the former is probably easier.
>
> Friedl's book is good for what it aims to be: an introduction to
> regular expressions.  But scientifically rigid (as in, on the same
> order as the dragon book) it's definitely not.
>
> Russ's paper describes the state of the world pretty well.
>
> Arnold
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
@ 2008-10-27 20:00 Eris Discordia
  2008-10-28 14:51 ` Brian L. Stuart
  0 siblings, 1 reply; 45+ messages in thread
From: Eris Discordia @ 2008-10-27 20:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

First of all, thanks for the explanation. It's above my head, but thanks
anyway.

> This guy seems to blur the distinctions here.  His discussion

He doesn't. If one reads the whole section part of which was quoted one
will see that he clearly states DFA and NFA are theoretically equivalent,
but then goes on to explain that DFA and NFA _implementations_ are not
identical.

> The true mathematical and computational meaning of "NFA" is different
> from what is commonly called an "NFA regex engine." In theory, NFA and
> DFA engines should match exactly the same text and have exactly the same
> features. In practice, the desire for richer, more expressive regular
> expressions has caused their semantics to diverge. An example is the
> support for backreferences.
>
> The design of a DFA engine precludes backreferences, but it's a
> relatively small task to add backreference support to a true
> (mathematically speaking) NFA engine. In doing so, you create a more
> powerful tool, but you also make it decidedly nonregular (mathematically
> speaking). What does this mean? At most, that you should probably stop
> calling it an NFA, and start using the phrase "nonregular expressions,"
> since that describes (mathematically speaking) the new situation. No one
> has actually done this, so the name "NFA" has lingered, even though the
> implementation is no longer (mathematically speaking) an NFA.

-- from the same book

> And a good book on automata theory can give you far more
> than any "big book of...", "... in 21 days" or "... for
> dummies" book can.  Besides, why would you think that
> anyone is denying you knowledge or the opportunity
> to write code that demonstrates your ideas?

Because lowlifes don't write code. No need for somebody to stop them from
doing so. They learn slowly, hardly, painfully--they aren't smart. If
possible they'll learn less rather than learn more. What the "hacker"
denies the lowlife is the opportunity to exist free of "GNU-is-wrong" or
"X-is-wrong" blemish. GNU may be wrong, or right, but GNU is learnt fast
and easy. And it does the job.

By the way, Friedl's book has the advantage of giving a lowlife a glimpse
of a subject otherwise arcane from that same lowlife's point of view. It's
good--for the lowlife, of course--to know the wonders they see didn't
spring into existence out of the blue. I benefitted from that learning.

--On Monday, October 27, 2008 4:13 PM +0000 "Brian L. Stuart"
<blstuart@bellsouth.net> wrote:

>> The set of "big books on regular expressions" includes Jeffrey Friedl's
>> "Mastering Regular Expressions" that happens to contain a chapter by the
>> title "NFA, DFA, and POSIX" wherein he says:
>>
>> > DFA Speed with NFA Capabilities: Regex Nirvana?
>
> This guy seems to blur the distinctions here.  His discussion
> makes it sound like he's saying that NFAs have more expressive
> power than DFAs.  This is incorrect.  Both NFAs and DFAs have
> exactly the same expressive power as the class of grammars
> called regular.  For the arbitrary case of nesting (e.g. parens),
> these machines are insufficient.  However, for any prescribed
> maximum nesting level, you can write a regular expression to
> account for it, though it becomes clumsy.
>
> To get more expressiveness, you need to go to a machine
> with more functionality.  Classically, the next step
> up the hierarchy is the pushdown automaton.  The expressive
> power of this machine corresponds directly to the context-
> free grammars.  Because the full generality of the CFG
> require nondeterminism, automated translations from CFG
> to code/machine are usually done with restricted classes
> of CFGs, such as LR(k) and LALR.  You can also increase
> the power of a FA by adding a counter or by making the
> transitions probablistic.  If you truly want to build
> expression matching mechanisms that go beyond regular,
> building on the FA with counter(s) would be a far more
> sound foundation than a lot of the ad hoc stuff that's
> been done.  But the truth is that whipping up a CFG
> and feeding it to yacc is far more expedient than torturing
> regular expressions all day.
>
>> Again, turns out the "big books on regular expressions" can give the
>> lowlife--that's me--things "hackers" deny them.
>
> And a good book on automata theory can give you far more
> than any "big book of...", "... in 21 days" or "... for
> dummies" book can.  Besides, why would you think that
> anyone is denying you knowledge or the opportunity
> to write code that demonstrates your ideas?
>
> BLS
>
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
@ 2008-10-27 19:23 Aharon Robbins
  2008-10-27 20:15 ` Eris Discordia
  0 siblings, 1 reply; 45+ messages in thread
From: Aharon Robbins @ 2008-10-27 19:23 UTC (permalink / raw)
  To: 9fans

> > GNU grep takes a simple but effective approach. It uses a DFA when
> > possible, reverting to an NFA when backreferences are used. GNU awk does
> > something similar---it uses GNU grep's fast shortest-leftmost DFA engine
> > for simple "does it match" checks, and reverts to a different engine for
> > checks where the actual extent of the match must be known. Since that
> > other engine is an NFA, GNU awk can conveniently offer capturing
> > parentheses, and it does via its special gensub function.

I'll be the first to admit that gawk's behavior is something of a hack.
I'd much rather be able to use a single matcher. (At least I was able to
get Friedl to describe gawk's implementation correctly. :-)

> > Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
> >
> > Currently, this engine is available only to Tcl, but Henry tells me that
> > it's on his to-do list to break it out into a separate package that can
> > be used by others.

It's been on that TODO list for well over a decade, IIRC. More vaporware
as far as a broader community is concerned.

> Again, turns out the "big books on regular expressions" can give the
> lowlife--that's me--things "hackers" deny them.

Not in the least. The Plan 9 regexp library in fact gives you close to
the same nirvana; an automata that has DFA speed characteristics with
the NFA's ability to capture sub texts.

As other mails have pointed out, anything that isn't leftmost longest
has weird semantics.  Non-greedy operators are mostly syntactic sugar.
I'll agree that in practice they're likely to be useful, but as I'm pretty
much an awk guy (<grin>) I've never felt the pain of their lack, either.

Gawk is stuck with its current two-matcher approach since it provides
the option for different syntaxes, but that approach also has lots of
problems; more than once I've come across cases where they interpret the
same syntax differently, or don't handle multibyte characters the same.
(Not to mention that they sntax bits API in the GNU matchers is a
a real nightmare. Another thing that's too late to change.)

If I could start over again, I'd start with either the plan 9 lib or
ripping out Henry's library from tcl, but the former is probably easier.

Friedl's book is good for what it aims to be: an introduction to
regular expressions.  But scientifically rigid (as in, on the same
order as the dragon book) it's definitely not.

Russ's paper describes the state of the world pretty well.

Arnold



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [9fans] non greedy regular expressions
@ 2008-10-24 11:27 Aharon Robbins
  0 siblings, 0 replies; 45+ messages in thread
From: Aharon Robbins @ 2008-10-24 11:27 UTC (permalink / raw)
  To: 9fans, rudolf.sykora

You are not missing anything.

Subexpression matching means when you have an expression like

	q(a+b)(c*d)z

that you can get access to the exact text matched by the two
parenthesized subexpressions.

You asked about non-greedy regular expressions which were first
popularized by perl.

IIRC the Plan 9 regex library does not provide this at all; Bell Labs
code never did non-greedy regexp matching.

Rob and/or Russ can correct me if I'm wrong.

FWIW, tools from the GNU world also do not support non-greedy
matching, nor are such expressions part of POSIX.

Hope this helps,

Arnold

> > russ has a great writeup on this.
> > http://swtch.com/~rsc/regexp/
> > i think it covers all your questions.
> >
> > - erik
>
> I read trough some of that already yesterday. Anyway, am still
> puzzled. In the text of
>
> Regular Expression Matching Can Be Simple And Fast
> (but is slow in Java, Perl, PHP, Python, Ruby, ...)
>
> R. Cox writes:
> ---
> While writing the text editor sam [6] in the early 1980s, Rob Pike
> wrote a new regular expression implementation, which Dave Presotto
> extracted into a library that appeared in the Eighth Edition. Pike's
> implementation incorporated submatch tracking into an efficient NFA
> simulation but, like the rest of the Eighth Edition source, was not
> widely distributed.
> ...
> Pike's regular expression implementation, extended to support Unicode,
> was made freely available with sam in late 1992, but the particularly
> efficient regular expression search algorithm went unnoticed. The code
> is now available in many forms: as part of sam, as Plan 9's regular
> expression library, or packaged separately for Unix.
> ---
>
> But any manual page (regexp(6), that of sam)  keeps completely silent
> about eg. any submatch tracking.
> So what's wrong? Can anybody clarify the situation for me or do I
> really have to read the codes?
>
> Ruda
>



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2008-12-11 16:32 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-23 18:58 [9fans] non greedy regular expressions Rudolf Sykora
2008-10-23 19:05 ` erik quanstrom
2008-10-24  8:08   ` Rudolf Sykora
2008-10-24 12:23     ` erik quanstrom
2008-10-24 16:11       ` Rudolf Sykora
2008-10-24 16:54         ` erik quanstrom
2008-10-24 17:02         ` John Stalker
2008-10-24 17:15           ` Rob Pike
2008-10-24 17:41           ` Rudolf Sykora
2008-10-24 18:01             ` Russ Cox
2008-10-24 19:56               ` Rudolf Sykora
2008-10-24 21:10                 ` Russ Cox
2008-10-24 21:40                   ` Rudolf Sykora
2008-10-24 21:47                     ` erik quanstrom
2008-10-24 22:04                       ` Rudolf Sykora
2008-10-24 22:38                         ` Gabriel Diaz Lopez de la Llave
2008-10-24 22:54                         ` Charles Forsyth
2008-10-24 22:59                           ` Charles Forsyth
2008-10-24 23:52                         ` Tom Simons
2008-10-25 22:35                           ` Rudolf Sykora
2008-10-25 23:02                             ` Steve Simon
2008-10-26  8:57                             ` John Stalker
2008-10-26 18:36                               ` Eris Discordia
2008-10-27  4:55                             ` Russ Cox
2008-10-27  8:28                               ` Rudolf Sykora
2008-10-27 10:18                               ` Charles Forsyth
2008-10-27 13:13                                 ` Eris Discordia
2008-10-27 13:23                                   ` erik quanstrom
2008-10-27 19:42                                     ` Eris Discordia
2008-10-27 16:13                                   ` Brian L. Stuart
2008-11-30  8:29                             ` Yard Ape
2008-12-11 16:32                               ` Rudolf Sykora
2008-10-24 18:02             ` John Stalker
2008-10-24 17:10         ` Uriel
2008-10-24 19:56         ` Charles Forsyth
2008-10-24 19:56           ` Rudolf Sykora
2008-10-26 21:23             ` Rob Pike
2008-10-24 11:27 Aharon Robbins
2008-10-27 19:23 Aharon Robbins
2008-10-27 20:15 ` Eris Discordia
2008-10-27 20:00 Eris Discordia
2008-10-28 14:51 ` Brian L. Stuart
2008-10-28 15:07   ` Eris Discordia
2008-10-27 21:08 Aharon Robbins
2008-10-28 14:53 ` Eris Discordia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).