[COFF] Re: Requesting thoughts on extended regular expressions in grep.

Computer Old Farts Forum
 help / color / mirror / Atom feed

From: Grant Taylor via COFF <coff@tuhs.org>
To: coff@tuhs.org
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Thu, 2 Mar 2023 20:53:08 -0700	[thread overview]
Message-ID: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net> (raw)
In-Reply-To: <CAEoi9W4BjrcyEQdqUigfd+Oa3WYh-H_B4kh84XOoqRKrUmMm2A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5031 bytes --]

On 3/2/23 8:04 PM, Dan Cross wrote:
> I guess what I'm saying is, match what you want to match and don't sweat 
> the small stuff.

ACK

> Not exactly. :-)
> 
> What I understand you to mean, based on this and the rest of your note, 
> is that you want to find a good division point between overly specific, 
> complex REs and simpler, easy to understand REs that are less specific. 
> The danger with the latter is that they may match things you don't 
> intend, while the former are harder to maintain and (arguably) more 
> brittle. I can sympathize.

You got it.

> For the purposes of grep/egrep, that'll be a logical "line" of text, 
> terminated by a newline, though the newline itself isn't considered part 
> of the text for matching. I believe the `-z` option can be used to set a 
> NUL byte as the "line" terminator; presumably this lets one match 
> strings with embedded newlines, though I haven't tried.

Fair enough.  That's also sort of what I thought might be the case.

> "string" in this context is the input you're attempting to match 
> against. `egrep` will attempt to match your pattern against each "line" 
> of text it reads from the files its searching. That is, each line in 
> your log file(s).

*nod*

> But consider what `[ :[:digit:]]{11}` means: you've got a character 
> class consisting of space, colon and a digit; {11} means "match any of 
> the characters in that class exactly 11 times" (as opposed to other 
> variations on the '{}' syntax that say "at least m times", "at most n 
> times", or "between n and m times").

Yep, I'm well aware of the that.

> But that'll match all sorts of things that don't look like 'dd 
> hh:mm:ss':

That's one of the reasons that I'm interested in coming up with a more 
precise regular expression ... without being overly complex.

> (The first line is my typing; the second is output from egrep except for 
> the short line of 9 '1's, for which egrep had no output. That last two 
> lines are matching space characters and egrep echoing the match, but I'm 
> guessing gmail will eat those.)
> 
> Note that there are inputs with more than 11 characters that match; this 
> is because there is some 11-character substring that matches the RE  in 
> those lines. In any event, I suspect this would generally not be what 
> you want. But if nothing else in your input can match the RE (which you 
> might know a priori because of domain knowledge about whatever is 
> generating those logs) then it's no big deal, even if the RE was capable 
> of matching more things generally.

Yep.

Here's an example of the full RE:

^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ 
postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from 
[._[:alnum:]-]+\[[.:[:xdigit:]]+\]$

As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a 
larger RE and there is bounding & delimiting around the subpart.

This is to match a standard message from postfix via standard SYSLOG.

> Ah. I suspect this relies on domain knowledge about the format of log 
> lines to match reliably. Otherwise it could match, `___ 123 456:789` 
> which is probably not what you are expecting.

Yep.

Though said domain knowledge isn't anything special in and of itself.

> Sure.  One nice thing about `egrep` et al is that you can put the REs 
> into a file and include them with `-f`, as opposed to having them all 
> directly on the command line.

Yep.  logcheck makes extensive use of many files like this to do it's work.

> Typo.  :-)

ACKK

> That seems reasonable.

Thank you for the logic CRC.

> Aside: I found the note on it's website amusing: Brought to you by the 
> UK's best gambling sites! "Only gamble with what you can afford to 
> lose." Yikes!

Um ... that's concerning.

> I'd proceed with caution here; it also seems to be in the FreeBSD and 
> DragonFly ports collections and Homebrew on the Mac (but so is GNU grep 
> for all of those).

Fair enough.

My use case is on Linux where GNU egrep is a thing.

> Yeah. IMHO `\w` is too general for what you're trying to do.

I think that `\w` is a good primer, but not where I want things to end 
up long term.

> Basically, a regular expression is a regular expression if you can build 
> a machine with no additional memory that can tell you whether or not a 
> given string matches the RE examining its input one character at a time.

I /think/ that I could build a complex nested tree of switch statements 
to test each character to see if things match what they should or not. 
Though I would need at least one variable / memory to hold absolutely 
minimal state to know where I am in the switch tree.  I think a number 
to identify the switch statement in question would be sufficient.  So 
I'm guessing two bytes of variable and uncounted bytes of program code.

> I think that's about right.

Thank you again Dan.

> Sure thing!

:-)

-- 
Grant. . . .
unix || die

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4017 bytes --]

next prev parent reply	other threads:[~2023-03-03  3:53 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF [this message]
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net \
    --to=coff@tuhs.org \
    --cc=gtaylor@tnetconsulting.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).