9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] non greedy regular expressions
@ 2008-10-27 21:08 Aharon Robbins
  2008-10-28 14:53 ` Eris Discordia
  0 siblings, 1 reply; 45+ messages in thread
From: Aharon Robbins @ 2008-10-27 21:08 UTC (permalink / raw)
  To: 9fans

> > As other mails have pointed out, anything that isn't leftmost longest
> > has weird semantics.  Non-greedy operators are mostly syntactic sugar.
>
> Is (leftmost-longest + all-greedy operators) syntactic salt then?

It is merely the traditional POSIX flavor.  Some people like that
flavor, some don't.

> > Not in the least. The Plan 9 regexp library in fact gives you close to
> > the same nirvana; an automata that has DFA speed characteristics with
> > the NFA's ability to capture sub texts.
>
> Does regexp(n) also give the lowlife any hint of why it should behave
> differently from Perl? Friedl's book doesn't, but it has good reason.

It is more that Perl simply was never part of the picture for the people
who develop(ed) and use(d) Plan 9.  It's like asking why the paper on
the Plan 9 C compiler doesn't state that C++ classes are not available.

The long-time users of Plan 9 were using Unix before Perl even came along;
for reasons having to do with both taste and the theoretical soundness,
they saw no reason to try to support the Perl features.  After all,
if you really want Perl, you know where to get it.

Me, I'm pretty happy with the traditional shell + sed + grep + awk
combinations, but then again, I'm biased, particularly towards awk.
:-)

Arnold



^ permalink raw reply	[flat|nested] 45+ messages in thread
* Re: [9fans] non greedy regular expressions
@ 2008-10-27 20:00 Eris Discordia
  2008-10-28 14:51 ` Brian L. Stuart
  0 siblings, 1 reply; 45+ messages in thread
From: Eris Discordia @ 2008-10-27 20:00 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

First of all, thanks for the explanation. It's above my head, but thanks
anyway.

> This guy seems to blur the distinctions here.  His discussion

He doesn't. If one reads the whole section part of which was quoted one
will see that he clearly states DFA and NFA are theoretically equivalent,
but then goes on to explain that DFA and NFA _implementations_ are not
identical.

> The true mathematical and computational meaning of "NFA" is different
> from what is commonly called an "NFA regex engine." In theory, NFA and
> DFA engines should match exactly the same text and have exactly the same
> features. In practice, the desire for richer, more expressive regular
> expressions has caused their semantics to diverge. An example is the
> support for backreferences.
>
> The design of a DFA engine precludes backreferences, but it's a
> relatively small task to add backreference support to a true
> (mathematically speaking) NFA engine. In doing so, you create a more
> powerful tool, but you also make it decidedly nonregular (mathematically
> speaking). What does this mean? At most, that you should probably stop
> calling it an NFA, and start using the phrase "nonregular expressions,"
> since that describes (mathematically speaking) the new situation. No one
> has actually done this, so the name "NFA" has lingered, even though the
> implementation is no longer (mathematically speaking) an NFA.

-- from the same book

> And a good book on automata theory can give you far more
> than any "big book of...", "... in 21 days" or "... for
> dummies" book can.  Besides, why would you think that
> anyone is denying you knowledge or the opportunity
> to write code that demonstrates your ideas?

Because lowlifes don't write code. No need for somebody to stop them from
doing so. They learn slowly, hardly, painfully--they aren't smart. If
possible they'll learn less rather than learn more. What the "hacker"
denies the lowlife is the opportunity to exist free of "GNU-is-wrong" or
"X-is-wrong" blemish. GNU may be wrong, or right, but GNU is learnt fast
and easy. And it does the job.

By the way, Friedl's book has the advantage of giving a lowlife a glimpse
of a subject otherwise arcane from that same lowlife's point of view. It's
good--for the lowlife, of course--to know the wonders they see didn't
spring into existence out of the blue. I benefitted from that learning.

--On Monday, October 27, 2008 4:13 PM +0000 "Brian L. Stuart"
<blstuart@bellsouth.net> wrote:

>> The set of "big books on regular expressions" includes Jeffrey Friedl's
>> "Mastering Regular Expressions" that happens to contain a chapter by the
>> title "NFA, DFA, and POSIX" wherein he says:
>>
>> > DFA Speed with NFA Capabilities: Regex Nirvana?
>
> This guy seems to blur the distinctions here.  His discussion
> makes it sound like he's saying that NFAs have more expressive
> power than DFAs.  This is incorrect.  Both NFAs and DFAs have
> exactly the same expressive power as the class of grammars
> called regular.  For the arbitrary case of nesting (e.g. parens),
> these machines are insufficient.  However, for any prescribed
> maximum nesting level, you can write a regular expression to
> account for it, though it becomes clumsy.
>
> To get more expressiveness, you need to go to a machine
> with more functionality.  Classically, the next step
> up the hierarchy is the pushdown automaton.  The expressive
> power of this machine corresponds directly to the context-
> free grammars.  Because the full generality of the CFG
> require nondeterminism, automated translations from CFG
> to code/machine are usually done with restricted classes
> of CFGs, such as LR(k) and LALR.  You can also increase
> the power of a FA by adding a counter or by making the
> transitions probablistic.  If you truly want to build
> expression matching mechanisms that go beyond regular,
> building on the FA with counter(s) would be a far more
> sound foundation than a lot of the ad hoc stuff that's
> been done.  But the truth is that whipping up a CFG
> and feeding it to yacc is far more expedient than torturing
> regular expressions all day.
>
>> Again, turns out the "big books on regular expressions" can give the
>> lowlife--that's me--things "hackers" deny them.
>
> And a good book on automata theory can give you far more
> than any "big book of...", "... in 21 days" or "... for
> dummies" book can.  Besides, why would you think that
> anyone is denying you knowledge or the opportunity
> to write code that demonstrates your ideas?
>
> BLS
>
>



^ permalink raw reply	[flat|nested] 45+ messages in thread
* Re: [9fans] non greedy regular expressions
@ 2008-10-27 19:23 Aharon Robbins
  2008-10-27 20:15 ` Eris Discordia
  0 siblings, 1 reply; 45+ messages in thread
From: Aharon Robbins @ 2008-10-27 19:23 UTC (permalink / raw)
  To: 9fans

> > GNU grep takes a simple but effective approach. It uses a DFA when
> > possible, reverting to an NFA when backreferences are used. GNU awk does
> > something similar---it uses GNU grep's fast shortest-leftmost DFA engine
> > for simple "does it match" checks, and reverts to a different engine for
> > checks where the actual extent of the match must be known. Since that
> > other engine is an NFA, GNU awk can conveniently offer capturing
> > parentheses, and it does via its special gensub function.

I'll be the first to admit that gawk's behavior is something of a hack.
I'd much rather be able to use a single matcher. (At least I was able to
get Friedl to describe gawk's implementation correctly. :-)

> > Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
> >
> > Currently, this engine is available only to Tcl, but Henry tells me that
> > it's on his to-do list to break it out into a separate package that can
> > be used by others.

It's been on that TODO list for well over a decade, IIRC. More vaporware
as far as a broader community is concerned.

> Again, turns out the "big books on regular expressions" can give the
> lowlife--that's me--things "hackers" deny them.

Not in the least. The Plan 9 regexp library in fact gives you close to
the same nirvana; an automata that has DFA speed characteristics with
the NFA's ability to capture sub texts.

As other mails have pointed out, anything that isn't leftmost longest
has weird semantics.  Non-greedy operators are mostly syntactic sugar.
I'll agree that in practice they're likely to be useful, but as I'm pretty
much an awk guy (<grin>) I've never felt the pain of their lack, either.

Gawk is stuck with its current two-matcher approach since it provides
the option for different syntaxes, but that approach also has lots of
problems; more than once I've come across cases where they interpret the
same syntax differently, or don't handle multibyte characters the same.
(Not to mention that they sntax bits API in the GNU matchers is a
a real nightmare. Another thing that's too late to change.)

If I could start over again, I'd start with either the plan 9 lib or
ripping out Henry's library from tcl, but the former is probably easier.

Friedl's book is good for what it aims to be: an introduction to
regular expressions.  But scientifically rigid (as in, on the same
order as the dragon book) it's definitely not.

Russ's paper describes the state of the world pretty well.

Arnold



^ permalink raw reply	[flat|nested] 45+ messages in thread
* Re: [9fans] non greedy regular expressions
@ 2008-10-24 11:27 Aharon Robbins
  0 siblings, 0 replies; 45+ messages in thread
From: Aharon Robbins @ 2008-10-24 11:27 UTC (permalink / raw)
  To: 9fans, rudolf.sykora

You are not missing anything.

Subexpression matching means when you have an expression like

	q(a+b)(c*d)z

that you can get access to the exact text matched by the two
parenthesized subexpressions.

You asked about non-greedy regular expressions which were first
popularized by perl.

IIRC the Plan 9 regex library does not provide this at all; Bell Labs
code never did non-greedy regexp matching.

Rob and/or Russ can correct me if I'm wrong.

FWIW, tools from the GNU world also do not support non-greedy
matching, nor are such expressions part of POSIX.

Hope this helps,

Arnold

> > russ has a great writeup on this.
> > http://swtch.com/~rsc/regexp/
> > i think it covers all your questions.
> >
> > - erik
>
> I read trough some of that already yesterday. Anyway, am still
> puzzled. In the text of
>
> Regular Expression Matching Can Be Simple And Fast
> (but is slow in Java, Perl, PHP, Python, Ruby, ...)
>
> R. Cox writes:
> ---
> While writing the text editor sam [6] in the early 1980s, Rob Pike
> wrote a new regular expression implementation, which Dave Presotto
> extracted into a library that appeared in the Eighth Edition. Pike's
> implementation incorporated submatch tracking into an efficient NFA
> simulation but, like the rest of the Eighth Edition source, was not
> widely distributed.
> ...
> Pike's regular expression implementation, extended to support Unicode,
> was made freely available with sam in late 1992, but the particularly
> efficient regular expression search algorithm went unnoticed. The code
> is now available in many forms: as part of sam, as Plan 9's regular
> expression library, or packaged separately for Unix.
> ---
>
> But any manual page (regexp(6), that of sam)  keeps completely silent
> about eg. any submatch tracking.
> So what's wrong? Can anybody clarify the situation for me or do I
> really have to read the codes?
>
> Ruda
>



^ permalink raw reply	[flat|nested] 45+ messages in thread
* [9fans] non greedy regular expressions
@ 2008-10-23 18:58 Rudolf Sykora
  2008-10-23 19:05 ` erik quanstrom
  0 siblings, 1 reply; 45+ messages in thread
From: Rudolf Sykora @ 2008-10-23 18:58 UTC (permalink / raw)
  To: Fans of the OS Plan 9 from Bell Labs

Hello

regexp(6) seems to know only greedy regular expressions. So does
probably awk, sed, grep, ..., since these are based on that regexp.
My question is what to do if one needs non-greedy regexps.

Also, is there anything like submatch extraction, counted repetition
of regexp (like {3,5}), (lookahead) assertions in plan9?

Thanks
Ruda



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2008-12-11 16:32 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-27 21:08 [9fans] non greedy regular expressions Aharon Robbins
2008-10-28 14:53 ` Eris Discordia
  -- strict thread matches above, loose matches on Subject: below --
2008-10-27 20:00 Eris Discordia
2008-10-28 14:51 ` Brian L. Stuart
2008-10-28 15:07   ` Eris Discordia
2008-10-27 19:23 Aharon Robbins
2008-10-27 20:15 ` Eris Discordia
2008-10-24 11:27 Aharon Robbins
2008-10-23 18:58 Rudolf Sykora
2008-10-23 19:05 ` erik quanstrom
2008-10-24  8:08   ` Rudolf Sykora
2008-10-24 12:23     ` erik quanstrom
2008-10-24 16:11       ` Rudolf Sykora
2008-10-24 16:54         ` erik quanstrom
2008-10-24 17:02         ` John Stalker
2008-10-24 17:15           ` Rob Pike
2008-10-24 17:41           ` Rudolf Sykora
2008-10-24 18:01             ` Russ Cox
2008-10-24 19:56               ` Rudolf Sykora
2008-10-24 21:10                 ` Russ Cox
2008-10-24 21:40                   ` Rudolf Sykora
2008-10-24 21:47                     ` erik quanstrom
2008-10-24 22:04                       ` Rudolf Sykora
2008-10-24 22:38                         ` Gabriel Diaz Lopez de la Llave
2008-10-24 22:54                         ` Charles Forsyth
2008-10-24 22:59                           ` Charles Forsyth
2008-10-24 23:52                         ` Tom Simons
2008-10-25 22:35                           ` Rudolf Sykora
2008-10-25 23:02                             ` Steve Simon
2008-10-26  8:57                             ` John Stalker
2008-10-26 18:36                               ` Eris Discordia
2008-10-27  4:55                             ` Russ Cox
2008-10-27  8:28                               ` Rudolf Sykora
2008-10-27 10:18                               ` Charles Forsyth
2008-10-27 13:13                                 ` Eris Discordia
2008-10-27 13:23                                   ` erik quanstrom
2008-10-27 19:42                                     ` Eris Discordia
2008-10-27 16:13                                   ` Brian L. Stuart
2008-11-30  8:29                             ` Yard Ape
2008-12-11 16:32                               ` Rudolf Sykora
2008-10-24 18:02             ` John Stalker
2008-10-24 17:10         ` Uriel
2008-10-24 19:56         ` Charles Forsyth
2008-10-24 19:56           ` Rudolf Sykora
2008-10-26 21:23             ` Rob Pike

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).