Computer Old Farts Forum
 help / color / mirror / Atom feed
From: Grant Taylor via COFF <coff@tuhs.org>
To: coff@tuhs.org
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Thu, 2 Mar 2023 18:05:51 -0700	[thread overview]
Message-ID: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> (raw)
In-Reply-To: <CAEoi9W7pTBWTekzPuUpMScNzwC9TgQKvr59PBAjr8+FRQOrarg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5704 bytes --]

On 3/2/23 2:53 PM, Dan Cross wrote:
> Well, obviously the former matches any sequence 3 of 
> alpha-numerics/underscores at the beginning of a string, while the 
> latter only matches abbreviations of months in the western calendar; 
> that is, the two REs are matching very different things (the latter 
> is a strict subset of the former).

I completely agree with you.  That's also why I'm wanting to start 
utilizing the latter, more specific RE.  But I don't know where the line 
of over complicating things is to avoid crossing it.

> But I suspect you mean in a more general sense.

Yes and no.  Does the comment above clarify at all?

> ...do you really want to match a space, a colon and a single digit 
> 11 times ...

Yes.

> ... in a single string?

What constitutes a single string?  ;-)  I sort of rhetorically ask.

The log lines start with

MMM dd hh:mm:ss

Where:
  - MMM is the month abbreviation
  - dd  is the day of the month
  - hh  is the hour of the day
  - mm  is the minute of the hour
  - ss  is the second of the minute

So, yes, there are eleven characters that fall into the class consisting 
of a space or a colon or a number.

Is that a single string?  It depends what you're looking at, the 
sequences of non white space in the log? No.  The patter that I'm 
matching ya.

> Using character classes would greatly simplify what you're trying to 
> do. It seems like this could be simplified to (untested) snippet:

Agreed.

I'm starting with the examples that came with; "^\w{3} [ 
:[:digit:]]{11}", the logcheck package that I'm working with and 
evaluating what I want to do.

I actually like the idea of dividing out the following:

  - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec
  - months that have 30 days: Apr, Jun, Sep, Nov
  - month that have 28/29 days: Feb

> ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]

Aside:  Why do you have the double square brackets in "[12][[0-9]]"?

> For this, I'd probably eschew `[:digit:]`. Named character classes 
> are for handy locale support, or in lieu of typing every character 
> in the alphabet (though we can use ranges to abbreviate that), but 
> it kind of seems like that's not coming into play here and, IMHO, 
> `[0-9]` is clearer in context.

ACK

"[[:digit:]]+" was a construct that I'm parroting.  It and 
[.:[:xdigit:]]+ are good for some things.  But they definitely aren't 
the best for all things.

Hence trying to find the line of being more accurate without going too far.

> It's not clear to me that dates, in their generality, can be 
> matched with regular expressions.  Consider leap years; you'd almost 
> necessarily have to use backtracking for that, but I admit I haven't 
> thought it through.

Given the context that these extended regular expressions are going to 
be used in, logcheck -- filtering out known okay log entries to email 
what doesn't get filtered -- I'm okay with having a few things slip 
through like leap day / leap seconds / leap frogs.

> `\w` is a GNU extension; I'd probably avoid it on portability grounds 
> (though `\b` is very handy).

I hear, understand, and acknowledge your concern.  At present, these 
filters are being used in a package; logcheck, which I believe is 
specific to Debian and ilk.  As such, GNU grep is very much a thing.

I'm also not a fan of the use of `\w` and would prefer to (...|...) things.

> The thing about regular expressions is that they describe regular 
> languages, and regular languages are those for which there exists a 
> finite automaton that can recognize the language. An important class 
> of finite automata are deterministic finite automata; by definition, 
> recognition by such automata are linear in the length of the input.
> 
> However, construction of a DFA for any given regular expression can be 
> superlinear (in fact, it can be exponential) so practically speaking, 
> we usually construct non-deterministic finite automata (NDFAs) and 
> "simulate" their execution for matching. NDFAs generalize DFAs (DFAs 
> are a subset of NDFAs, incidentally) in that, in any non-terminal 
> state, there can be multiple subsequent states that the machine can 
> transition to given an input symbol. When executed, for any state, 
> the simulator will transition to every permissible subsequent state 
> simultaneously, discarding impossible states as they become evident.
> 
> This implies that NDFA execution is superlinear, but it is bounded, 
> and is O(n*m*e), where n is the length of the input, m is the number 
> of nodes in the state transition graph corresponding to the NDFA, and 
> e is the maximum number of edges leaving any node in that graph (for 
> a fully connected graph, that would m, so this can be up to O(n*m^2)). 
> Construction of an NDFA is O(m), so while it's slower to execute, it's 
> actually possible to construct in a reasonable amount of time. Russ's 
> excellent series of articles that Clem linked to gives details and 
> algorithms.

I only vaguely understand those three paragraphs as they are deeper 
computer science than I've gone before.

I think I get the gist of them but could not explain them if my life 
depended upon it.

> In practical terms? Basically, don't worry about it too much. Egrep 
> will generate an NDFA simulation that's going to be acceptably fast 
> for all but the weirdest cases.

ACK

It sounds like I can make any reasonable extended regular expression a 
human can read and I'll probably be good.

Thank you for the detailed response Dan.  :-)



-- 
Grant. . . .
unix || die


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]

  reply	other threads:[~2023-03-03  1:06 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF [this message]
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net \
    --to=coff@tuhs.org \
    --cc=gtaylor@tnetconsulting.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).