From: Grant Taylor via COFF <coff@tuhs.org>
To: coff@tuhs.org
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Thu, 2 Mar 2023 18:05:51 -0700 [thread overview]
Message-ID: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> (raw)
In-Reply-To: <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5704 bytes --]
On 3/2/23 2:53 PM, Dan Cross wrote:
> Well, obviously the former matches any sequence 3 of
> alpha-numerics/underscores at the beginning of a string, while the
> latter only matches abbreviations of months in the western calendar;
> that is, the two REs are matching very different things (the latter
> is a strict subset of the former).
I completely agree with you. That's also why I'm wanting to start
utilizing the latter, more specific RE. But I don't know where the line
of over complicating things is to avoid crossing it.
> But I suspect you mean in a more general sense.
Yes and no. Does the comment above clarify at all?
> ...do you really want to match a space, a colon and a single digit
> 11 times ...
Yes.
> ... in a single string?
What constitutes a single string? ;-) I sort of rhetorically ask.
The log lines start with
MMM dd hh:mm:ss
Where:
- MMM is the month abbreviation
- dd is the day of the month
- hh is the hour of the day
- mm is the minute of the hour
- ss is the second of the minute
So, yes, there are eleven characters that fall into the class consisting
of a space or a colon or a number.
Is that a single string? It depends what you're looking at, the
sequences of non white space in the log? No. The patter that I'm
matching ya.
> Using character classes would greatly simplify what you're trying to
> do. It seems like this could be simplified to (untested) snippet:
Agreed.
I'm starting with the examples that came with; "^\w{3} [
:[:digit:]]{11}", the logcheck package that I'm working with and
evaluating what I want to do.
I actually like the idea of dividing out the following:
- months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
- months that have 30 days: Apr, Jun, Sep, Nov
- month that have 28/29 days: Feb
> ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]
Aside: Why do you have the double square brackets in "[12][[0-9]]"?
> For this, I'd probably eschew `[:digit:]`. Named character classes
> are for handy locale support, or in lieu of typing every character
> in the alphabet (though we can use ranges to abbreviate that), but
> it kind of seems like that's not coming into play here and, IMHO,
> `[0-9]` is clearer in context.
ACK
"[[:digit:]]+" was a construct that I'm parroting. It and
[.:[:xdigit:]]+ are good for some things. But they definitely aren't
the best for all things.
Hence trying to find the line of being more accurate without going too far.
> It's not clear to me that dates, in their generality, can be
> matched with regular expressions. Consider leap years; you'd almost
> necessarily have to use backtracking for that, but I admit I haven't
> thought it through.
Given the context that these extended regular expressions are going to
be used in, logcheck -- filtering out known okay log entries to email
what doesn't get filtered -- I'm okay with having a few things slip
through like leap day / leap seconds / leap frogs.
> `\w` is a GNU extension; I'd probably avoid it on portability grounds
> (though `\b` is very handy).
I hear, understand, and acknowledge your concern. At present, these
filters are being used in a package; logcheck, which I believe is
specific to Debian and ilk. As such, GNU grep is very much a thing.
I'm also not a fan of the use of `\w` and would prefer to (...|...) things.
> The thing about regular expressions is that they describe regular
> languages, and regular languages are those for which there exists a
> finite automaton that can recognize the language. An important class
> of finite automata are deterministic finite automata; by definition,
> recognition by such automata are linear in the length of the input.
>
> However, construction of a DFA for any given regular expression can be
> superlinear (in fact, it can be exponential) so practically speaking,
> we usually construct non-deterministic finite automata (NDFAs) and
> "simulate" their execution for matching. NDFAs generalize DFAs (DFAs
> are a subset of NDFAs, incidentally) in that, in any non-terminal
> state, there can be multiple subsequent states that the machine can
> transition to given an input symbol. When executed, for any state,
> the simulator will transition to every permissible subsequent state
> simultaneously, discarding impossible states as they become evident.
>
> This implies that NDFA execution is superlinear, but it is bounded,
> and is O(n*m*e), where n is the length of the input, m is the number
> of nodes in the state transition graph corresponding to the NDFA, and
> e is the maximum number of edges leaving any node in that graph (for
> a fully connected graph, that would m, so this can be up to O(n*m^2)).
> Construction of an NDFA is O(m), so while it's slower to execute, it's
> actually possible to construct in a reasonable amount of time. Russ's
> excellent series of articles that Clem linked to gives details and
> algorithms.
I only vaguely understand those three paragraphs as they are deeper
computer science than I've gone before.
I think I get the gist of them but could not explain them if my life
depended upon it.
> In practical terms? Basically, don't worry about it too much. Egrep
> will generate an NDFA simulation that's going to be acceptably fast
> for all but the weirdest cases.
ACK
It sounds like I can make any reasonable extended regular expression a
human can read and I'll probably be good.
Thank you for the detailed response Dan. :-)
--
Grant. . . .
unix || die
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]
next prev parent reply other threads:[~2023-03-03 1:06 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38 ` Grant Taylor via COFF
2023-03-02 23:01 ` Stuff Received
2023-03-02 23:46 ` Steffen Nurpmeso
2023-03-03 1:08 ` Grant Taylor via COFF
2023-03-03 2:10 ` Dave Horsfall
2023-03-03 3:34 ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03 1:05 ` Grant Taylor via COFF [this message]
2023-03-03 3:04 ` Dan Cross
2023-03-03 3:53 ` Grant Taylor via COFF
2023-03-03 13:47 ` Dan Cross
2023-03-03 19:26 ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11 ` Dan Cross
2023-03-03 13:42 ` Ralph Corderoy
2023-03-03 19:19 ` Grant Taylor via COFF
2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49 ` [COFF] " Tomasz Rola
2023-03-07 22:46 ` Tomasz Rola
2023-06-20 16:02 ` Michael Parson
2023-06-20 21:26 ` Tomasz Rola
2023-06-22 15:45 ` Michael Parson
2023-07-10 9:08 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13 ` Dan Cross
2023-03-03 17:38 ` Ralph Corderoy
2023-03-03 19:09 ` Dan Cross
2023-03-03 19:36 ` Grant Taylor via COFF
2023-03-04 10:26 ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31 ` Dan Cross
2023-03-04 10:07 ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01 ` Dan Cross
2023-03-06 21:49 ` Steffen Nurpmeso
2023-03-07 1:43 ` Larry McVoy
2023-03-07 4:01 ` Ed Bradford
2023-03-07 11:39 ` [COFF] " Ralph Corderoy
2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22 ` Ed Bradford
2023-03-07 16:14 ` Dan Cross
2023-03-07 17:34 ` [COFF] " Ralph Corderoy
2023-03-07 18:33 ` [COFF] " Dan Cross
2023-03-07 4:19 ` Ed Bradford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net \
--to=coff@tuhs.org \
--cc=gtaylor@tnetconsulting.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).