Computer Old Farts Forum
 help / color / mirror / Atom feed
From: Dan Cross <crossd@gmail.com>
To: Grant Taylor <gtaylor@tnetconsulting.net>
Cc: coff@tuhs.org
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Fri, 3 Mar 2023 14:31:14 -0500	[thread overview]
Message-ID: <CAEoi9W45t__mk_MH_2VOn=exhuRhTwE4XdnQCug9Jb0fPVWLOQ@mail.gmail.com> (raw)
In-Reply-To: <c4e87cee-d04f-50be-cb91-edf1b24eaf07@spamtrap.tnetconsulting.net>

On Fri, Mar 3, 2023 at 2:06 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> Thank you all for very interesting and engaging comments & threads to
> chase / pull / untangle.
>
> I'd like to expand / refine my original question a little bit.
>
> On 3/2/23 11:54 AM, Grant Taylor via COFF wrote:
> > I'd like some thoughts ~> input on extended regular expressions used
> > with grep, specifically GNU grep -e / egrep.
>
> While some reading of the references that Clem provided I came across
> multiple indications that back-references can be problematic from a
> performance stand point.
>
> So I'd like to know if all back-references are problematic, or if very
> specific back-references are okay.

The thing about backreferences is that they're not representable in
the regular languages because they require additional state (the thing
the backref refers to), so you cannot construct a DFA corresponding to
them, nor an NDFA simulator (this is where Freidl gets things wrong!);
you really need a pushdown automata and then you're in the domain of
the context-free languages. Therefore, "regexps" that use back
references are not actually regular expressions.

Yet, popular engines support them...but how? Well, pretty much all of
them use a backtracking implementation, which _can_ be exponential in
both time and space.

Now, that said, there are plenty of REs, even some with backrefs,
that'll execute plenty fast enough on backtracking implementations; it
really depends on the expressions in question and the size of strings
you're trying to match against. But you lose the bounding guarantees
DFAs and NDFAs provide.

> Suppose I have the following two lines:
>
>     aaa aaa
>     aaa bbb
>
> Does the following RE w/ back-reference introduce a big performance penalty?
>
>     (aaa|bbb) \1
>
> As in:
>
>     % echo "aaa aaa" | egrep "(aaa|bbb) \1"
>     aaa aaa
>
> I can easily see how a back reference to something that is not a fixed
> length can become a rabbit hole.  But I'm wondering if a back-reference
> to -- what I think is called -- an alternation (with length fixed in the
> RE) is a performance hit or not.

Well, it's more about the implementation strategy than the specific
expression here. Could this become exponential? I don't think this one
would, no; but others may, particularly if you use Kleene closures in
the alternation.

This _is_ something that appears in the wild, by the way, not just in
theory; I did a change to Google's spelling service code to replace
PCRE with re2 precisely because it was blowing up with exponential
memory usage on some user input. The problems went away, but I had to
rewrite a bunch of the REs involved.

        - Dan C.

  reply	other threads:[~2023-03-03 19:31 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross [this message]
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEoi9W45t__mk_MH_2VOn=exhuRhTwE4XdnQCug9Jb0fPVWLOQ@mail.gmail.com' \
    --to=crossd@gmail.com \
    --cc=coff@tuhs.org \
    --cc=gtaylor@tnetconsulting.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).