Computer Old Farts Forum
 help / color / mirror / Atom feed
From: Dan Cross <crossd@gmail.com>
To: Grant Taylor <gtaylor@tnetconsulting.net>
Cc: coff@tuhs.org
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Fri, 3 Mar 2023 08:47:39 -0500	[thread overview]
Message-ID: <CAEoi9W5JBfNC8rEmcv=_fMvKRKcD1CocqnQzFv_oXjcKZ8M8zQ@mail.gmail.com> (raw)
In-Reply-To: <1519cce3-1c38-8a9c-cfdd-b39484bd163b@spamtrap.tnetconsulting.net>

On Thu, Mar 2, 2023 at 10:53 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
>[snip
> Here's an example of the full RE:
>
> ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+
> postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from
> [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$
>
> As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a
> larger RE and there is bounding & delimiting around the subpart.

Oh, for sure; to be clear, it was obvious that in the earlier
discussion the original was just part of something larger.

FWIW, this RE seems ok to me; the additional context makes it unlikely
to match something else accidentally.

> This is to match a standard message from postfix via standard SYSLOG.
>
> > Ah. I suspect this relies on domain knowledge about the format of log
> > lines to match reliably. Otherwise it could match, `___ 123 456:789`
> > which is probably not what you are expecting.
>
> Yep.
>
> Though said domain knowledge isn't anything special in and of itself.

It needn't be special.  The point is simply that there's some external
knowledge that can be brought to bear to guide the shape of the REs.
In this case, you know that log lines won't begin with `___ 123
456:789` or other similar junk.

> [snip]
> > Basically, a regular expression is a regular expression if you can build
> > a machine with no additional memory that can tell you whether or not a
> > given string matches the RE examining its input one character at a time.
>
> I /think/ that I could build a complex nested tree of switch statements
> to test each character to see if things match what they should or not.
> Though I would need at least one variable / memory to hold absolutely
> minimal state to know where I am in the switch tree.  I think a number
> to identify the switch statement in question would be sufficient.  So
> I'm guessing two bytes of variable and uncounted bytes of program code.

Kinda. The "machine" in this case is actually an abstraction, like a
Turing machine. The salient point here is that REs map to finite state
machines, and in particular, one need not keep (say) a stack of prior
states when simulating them. Note that even in an NDFA simulation,
where one keeps track of what states one may be in, one doesn't need
to keep track of how one got into those states.

Obviously in a real implementation you've got the program counter,
register contents, local variables, etc, all of which consume "memory"
in the conventional sense. But the point is that you don't need
additional memory proportional to anything other than the size of the
RE. DFA implementation could be implemented entirely with `switch` and
`goto` if one wanted, as opposed to a bunch of mutually recursive
function calls, NDFA simulation similarly except that you need some
(bounded) additional memory to hold the active set of states. Contrast
this with a pushdown automata, which can parse a context-free
language, in which a stack is maintained that can store additional
information relative to the input (for example, an already seen
character). Pushdown automata can, for example, recognize matched
parenthesis while regular languages cannot.

Anyway, sorry, this is all rather more theoretical than is perhaps
interesting or useful. Bottom line is, I think your REs are probably
fine. `egrep` will complain at you if they are not, and I wouldn't
worry too much about optimizing them: I'd "stop" whenever you're happy
that you've got something understandable that matches what you want it
to match.

        - Dan C.

  reply	other threads:[~2023-03-03 13:48 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross [this message]
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAEoi9W5JBfNC8rEmcv=_fMvKRKcD1CocqnQzFv_oXjcKZ8M8zQ@mail.gmail.com' \
    --to=crossd@gmail.com \
    --cc=coff@tuhs.org \
    --cc=gtaylor@tnetconsulting.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).