[COFF] Re: Requesting thoughts on extended regular expressions in grep.

Computer Old Farts Forum
 help / color / mirror / Atom feed

From: Dan Cross <crossd@gmail.com>
To: Grant Taylor <gtaylor@tnetconsulting.net>
Cc: COFF <coff@tuhs.org>
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Thu, 2 Mar 2023 16:53:31 -0500	[thread overview]
Message-ID: <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com> (raw)
In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net>

On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
> I'd like some thoughts ~> input on extended regular expressions used
> with grep, specifically GNU grep -e / egrep.
>
> What are the pros / cons to creating extended regular expressions like
> the following:
>
>     ^\w{3}
>
> vs:
>
>     ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)

Well, obviously the former matches any sequence 3 of
alpha-numerics/underscores at the beginning of a string, while the
latter only matches abbreviations of months in the western calendar;
that is, the two REs are matching very different things (the latter is
a strict subset of the former). But I suspect you mean in a more
general sense.

> Or:
>
>     [ :[:digit:]]{11}

...do you really want to match a space, a colon and a single digit 11
times in a single string?

> vs:
>
>     ( 1| 2| 3| 4| 5| 6| 7| 8|
> 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
> (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]

Using character classes would greatly simplify what you're trying to
do. It seems like this could be simplified to (untested) snippet:

    ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]

For this, I'd probably eschew `[:digit:]`. Named character classes are
for handy locale support, or in lieu of typing every character in the
alphabet (though we can use ranges to abbreviate that), but it kind of
seems like that's not coming into play here and, IMHO, `[0-9]` is
clearer in context.

> I'm currently eliding the 61st (60) second, the 32nd day, and dealing
> with February having fewer days for simplicity.

It's not clear to me that dates, in their generality, can be matched
with regular expressions.  Consider leap years; you'd almost
necessarily have to use backtracking for that, but I admit I haven't
thought it through.

> For matching patterns like the following in log files?
>
>     Mar  2 03:23:38
>
> I'm working on organically training logcheck to match known good log
> entries.  So I'm *DEEP* in the bowels of extended regular expressions
> (GNU egrep) that runs over all logs hourly.  As such, I'm interested in
> making sure that my REs are both efficient and accurate or at least not
> WILDLY badly structured.  The pedantic part of me wants to avoid
> wildcard type matches (\w), even if they are bounded (\w{3}), unless it
> truly is for unpredictable text.

`\w` is a GNU extension; I'd probably avoid it on portability grounds
(though `\b` is very handy).

The thing about regular expressions is that they describe regular
languages, and regular languages are those for which there exists a
finite automaton that can recognize the language. An important class
of finite automata are deterministic finite automata; by definition,
recognition by such automata are linear in the length of the input.

However, construction of a DFA for any given regular expression can be
superlinear (in fact, it can be exponential) so practically speaking,
we usually construct non-deterministic finite automata (NDFAs) and
"simulate" their execution for matching. NDFAs generalize DFAs (DFAs
are a subset of NDFAs, incidentally) in that, in any non-terminal
state, there can be multiple subsequent states that the machine can
transition to given an input symbol. When executed, for any state, the
simulator will transition to every permissible subsequent state
simultaneously, discarding impossible states as they become evident.

This implies that NDFA execution is superlinear, but it is bounded,
and is O(n*m*e), where n is the length of the input, m is the number
of nodes in the state transition graph corresponding to the NDFA, and
e is the maximum number of edges leaving any node in that graph (for a
fully connected graph, that would m, so this can be up to O(n*m^2)).
Construction of an NDFA is O(m), so while it's slower to execute, it's
actually possible to construct in a reasonable amount of time. Russ's
excellent series of articles that Clem linked to gives details and
algorithms.

> I'd appreciate any feedback and recommendations from people who have
> been using and / or optimizing (extended) regular expressions for longer
> than I have been using them.
>
> Thank you for your time and input.

In practical terms? Basically, don't worry about it too much. Egrep
will generate an NDFA simulation that's going to be acceptably fast
for all but the weirdest cases.

        - Dan C.

next prev parent reply	other threads:[~2023-03-02 21:54 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross [this message]
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com \
    --to=crossd@gmail.com \
    --cc=coff@tuhs.org \
    --cc=gtaylor@tnetconsulting.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).