From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 22291 invoked from network); 3 Mar 2023 03:05:22 -0000 Received: from minnie.tuhs.org (50.116.15.146) by inbox.vuxu.org with ESMTPUTF8; 3 Mar 2023 03:05:22 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 076BE4330A; Fri, 3 Mar 2023 13:05:20 +1000 (AEST) Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by minnie.tuhs.org (Postfix) with ESMTPS id 5AC7A432FE for ; Fri, 3 Mar 2023 13:05:11 +1000 (AEST) Received: by mail-lf1-x130.google.com with SMTP id i9so1891896lfc.6 for ; Thu, 02 Mar 2023 19:05:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=EQh0n0KMxu1PAbNhdBG008HN/bO6OAfQB3pkqe3owRY=; b=S7YMVMCTbomh/pGReOBRslIQgbXsYN83WmZEhfh4BJXTNea3hXxABWbIyO0Q3LBBi7 jr22RvdnrBLQPtHSBI7hhr4NR4ViUZ266UiS1+Q8WW4OwCWdt3nRWi/NxOplVvYHxTi7 QtiS9GnIwu1vs8ihagayEicmpp+ZoGe9G9yCJUV+oSUDXE6MVC3P2HqLG2wmbKWfhlSK Mio6Jipi5aHPMUBIILubhVYNq6obwtMHGw7cVzNNaORBkmwrkLXtIGl7ER/LfXscI5QN QFajVAHR0+9wld44DUrET5PJ70KrqUUQg9GtXYAAq1UGSOhW2yJyhCoLKbFmo0SCrLPw hVJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=EQh0n0KMxu1PAbNhdBG008HN/bO6OAfQB3pkqe3owRY=; b=Q0QV8TDTxb1VMgFBekbjXNB5MxEvyo1rHtJ2YAMChOZc9Dby3gfXJDcUYAl034Ga8E cYlb9nZ1B4MTU+VilQCl3WQyIiC7RZ/LyKLUVmWSapuUFfGxEvaav37X7cOx13oOfwio UmhO7GbfSzWeOpxKjbuvaflecbGxYufu97hOhh4kqrn2YEwA4XswGj4Bp2MQs86z/Bp7 DQvzwSzumufTCl2bf+9WPK54E09ZRImzj/8ZdYpD3zBAcEeHmtn/av1EVTenDqU4AVdp TRE9/4YsfCQVRUmbW20Tk48MDq3X79Pi3wGe5v6oLyaK7582wLfPKwHcBaPKu8yoy47I ddJQ== X-Gm-Message-State: AO0yUKXOCUnwzj+HD8oO+/h8BJlMfPUd2360ipMx5qwYDO865H26tdom O57Jjm93Y1kuxQEMxggm9VGQ4unHG1j8iOT+JP6vlF2p X-Google-Smtp-Source: AK7set9mopWQsM7OOs6GxloOK9UquwHWJ2PqmJl9E2WXspYgRK1UMJlIarFM6RZyOUHaqE75DyQ9Q+cJFs+hdIVpYe4= X-Received: by 2002:a05:6512:102c:b0:4d5:ca32:7bc3 with SMTP id r12-20020a056512102c00b004d5ca327bc3mr134892lfr.10.1677812708709; Thu, 02 Mar 2023 19:05:08 -0800 (PST) MIME-Version: 1.0 References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> In-Reply-To: <688396c8-7a25-5cd6-282c-49f1b13117d4@spamtrap.tnetconsulting.net> From: Dan Cross Date: Thu, 2 Mar 2023 22:04:32 -0500 Message-ID: To: Grant Taylor Content-Type: multipart/alternative; boundary="00000000000080991205f5f63689" Message-ID-Hash: 72PFSD5PW3YFFTTBR4PISO5YVGAPPR2C X-Message-ID-Hash: 72PFSD5PW3YFFTTBR4PISO5YVGAPPR2C X-MailFrom: crossd@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: coff@tuhs.org X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --00000000000080991205f5f63689 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Mar 2, 2023 at 8:06=E2=80=AFPM Grant Taylor via COFF wrote: > On 3/2/23 2:53 PM, Dan Cross wrote: > > Well, obviously the former matches any sequence 3 of > > alpha-numerics/underscores at the beginning of a string, while the > > latter only matches abbreviations of months in the western calendar; > > that is, the two REs are matching very different things (the latter > > is a strict subset of the former). > > I completely agree with you. That's also why I'm wanting to start > utilizing the latter, more specific RE. But I don't know where the line > of over complicating things is to avoid crossing it. I guess what I'm saying is, match what you want to match and don't sweat the small stuff. > > But I suspect you mean in a more general sense. > > Yes and no. Does the comment above clarify at all? Not exactly. :-) What I understand you to mean, based on this and the rest of your note, is that you want to find a good division point between overly specific, complex REs and simpler, easy to understand REs that are less specific. The danger with the latter is that they may match things you don't intend, while the former are harder to maintain and (arguably) more brittle. I can sympathize. > > ...do you really want to match a space, a colon and a single digit > > 11 times ... > > Yes. > > > ... in a single string? > > What constitutes a single string? ;-) I sort of rhetorically ask. For the purposes of grep/egrep, that'll be a logical "line" of text, terminated by a newline, though the newline itself isn't considered part of the text for matching. I believe the `-z` option can be used to set a NUL byte as the "line" terminator; presumably this lets one match strings with embedded newlines, though I haven't tried. > The log lines start with > > MMM dd hh:mm:ss > > Where: > - MMM is the month abbreviation > - dd is the day of the month > - hh is the hour of the day > - mm is the minute of the hour > - ss is the second of the minute > > So, yes, there are eleven characters that fall into the class consisting > of a space or a colon or a number. > > Is that a single string? It depends what you're looking at, the > sequences of non white space in the log? No. The patter that I'm > matching ya. "string" in this context is the input you're attempting to match against. `egrep` will attempt to match your pattern against each "line" of text it reads from the files its searching. That is, each line in your log file(s). But consider what `[ :[:digit:]]{11}` means: you've got a character class consisting of space, colon and a digit; {11} means "match any of the characters in that class exactly 11 times" (as opposed to other variations on the '{}' syntax that say "at least m times", "at most n times", or "between n and m times"). But that'll match all sorts of things that don't look like 'dd hh:mm:ss': term% egrep '[ :[:digit:]]{11}' 11111111111 11111111111 111111111 1111111111111 1111111111111 :::::::::::::::: :::::::::::::::: aaaa bbbbb aaaa bbbbb term% (The first line is my typing; the second is output from egrep except for the short line of 9 '1's, for which egrep had no output. That last two lines are matching space characters and egrep echoing the match, but I'm guessing gmail will eat those.) Note that there are inputs with more than 11 characters that match; this is because there is some 11-character substring that matches the RE in those lines. In any event, I suspect this would generally not be what you want. But if nothing else in your input can match the RE (which you might know a priori because of domain knowledge about whatever is generating those logs) then it's no big deal, even if the RE was capable of matching more things generally. > > Using character classes would greatly simplify what you're trying to > > do. It seems like this could be simplified to (untested) snippet: > > Agreed. > > I'm starting with the examples that came with; "^\w{3} [ > :[:digit:]]{11}", the logcheck package that I'm working with and > evaluating what I want to do. Ah. I suspect this relies on domain knowledge about the format of log lines to match reliably. Otherwise it could match, `___ 123 456:789` which is probably not what you are expecting. > I actually like the idea of dividing out the following: > > - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec > - months that have 30 days: Apr, Jun, Sep, Nov > - month that have 28/29 days: Feb Sure. One nice thing about `egrep` et al is that you can put the REs into a file and include them with `-f`, as opposed to having them all directly on the command line. > > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] > > Aside: Why do you have the double square brackets in "[12][[0-9]]"? Typo. :-) > > For this, I'd probably eschew `[:digit:]`. Named character classes > > are for handy locale support, or in lieu of typing every character > > in the alphabet (though we can use ranges to abbreviate that), but > > it kind of seems like that's not coming into play here and, IMHO, > > `[0-9]` is clearer in context. > > ACK > > "[[:digit:]]+" was a construct that I'm parroting. It and > [.:[:xdigit:]]+ are good for some things. But they definitely aren't > the best for all things. > > Hence trying to find the line of being more accurate without going too far. > > > It's not clear to me that dates, in their generality, can be > > matched with regular expressions. Consider leap years; you'd almost > > necessarily have to use backtracking for that, but I admit I haven't > > thought it through. > > Given the context that these extended regular expressions are going to > be used in, logcheck -- filtering out known okay log entries to email > what doesn't get filtered -- I'm okay with having a few things slip > through like leap day / leap seconds / leap frogs. That seems reasonable. > > `\w` is a GNU extension; I'd probably avoid it on portability grounds > > (though `\b` is very handy). > > I hear, understand, and acknowledge your concern. At present, these > filters are being used in a package; logcheck, Aside: I found the note on it's website amusing: Brought to you by the UK's best gambling sites! "Only gamble with what you can afford to lose." Yikes! > which I believe is > specific to Debian and ilk. As such, GNU grep is very much a thing. I'd proceed with caution here; it also seems to be in the FreeBSD and DragonFly ports collections and Homebrew on the Mac (but so is GNU grep for all of those). > I'm also not a fan of the use of `\w` and would prefer to (...|...) things. Yeah. IMHO `\w` is too general for what you're trying to do. > > The thing about regular expressions is that they describe regular > > languages, and regular languages are those for which there exists a > > finite automaton that can recognize the language. An important class > > of finite automata are deterministic finite automata; by definition, > > recognition by such automata are linear in the length of the input. > > > > However, construction of a DFA for any given regular expression can be > > superlinear (in fact, it can be exponential) so practically speaking, > > we usually construct non-deterministic finite automata (NDFAs) and > > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs > > are a subset of NDFAs, incidentally) in that, in any non-terminal > > state, there can be multiple subsequent states that the machine can > > transition to given an input symbol. When executed, for any state, > > the simulator will transition to every permissible subsequent state > > simultaneously, discarding impossible states as they become evident. > > > > This implies that NDFA execution is superlinear, but it is bounded, > > and is O(n*m*e), where n is the length of the input, m is the number > > of nodes in the state transition graph corresponding to the NDFA, and > > e is the maximum number of edges leaving any node in that graph (for > > a fully connected graph, that would m, so this can be up to O(n*m^2)). > > Construction of an NDFA is O(m), so while it's slower to execute, it's > > actually possible to construct in a reasonable amount of time. Russ's > > excellent series of articles that Clem linked to gives details and > > algorithms. > > I only vaguely understand those three paragraphs as they are deeper > computer science than I've gone before. > > I think I get the gist of them but could not explain them if my life > depended upon it. Basically, a regular expression is a regular expression if you can build a machine with no additional memory that can tell you whether or not a given string matches the RE examining its input one character at a time. > > In practical terms? Basically, don't worry about it too much. Egrep > > will generate an NDFA simulation that's going to be acceptably fast > > for all but the weirdest cases. > > ACK > > It sounds like I can make any reasonable extended regular expression a > human can read and I'll probably be good. I think that's about right. > Thank you for the detailed response Dan. :-) Sure thing! - Dan C. --00000000000080991205f5f63689 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, Mar 2, 2023 at 8:06=E2=80=AFPM Grant Taylor via CO= FF <coff@tuhs.org> wrote:
>= ; On 3/2/23 2:53 PM, Dan Cross wrote:
> > Well, obviously the form= er matches any sequence 3 of
> > alpha-numerics/underscores at the= beginning of a string, while the
> > latter only matches abbrevia= tions of months in the western calendar;
> > that is, the two REs = are matching very different things (the latter
> > is a strict sub= set of the former).
>
> I completely agree with you.=C2=A0 That= 's also why I'm wanting to start
> utilizing the latter, more= specific RE.=C2=A0 But I don't know where the line
> of over com= plicating things is to avoid crossing it.

I guess what I'm sayin= g is, match what you want to match and don't sweat the small stuff.
=
> > But I suspect you mean in a more general sense.
>
&g= t; Yes and no.=C2=A0 Does the comment above clarify at all?

Not exac= tly. :-)

What I understand you to mean, based on this and the rest o= f your note, is that you want to find a good division point between overly = specific, complex REs and simpler, easy to understand REs that are less spe= cific. The danger with the latter is that they may match things you don'= ;t intend, while the former are harder to maintain and (arguably) more brit= tle. I can sympathize.

> > ...do you really want to match a sp= ace, a colon and a single digit
> > 11 times ...
>
> Y= es.
>
> > ... in a single string?
>
> What const= itutes a single string? =C2=A0;-) =C2=A0I sort of rhetorically ask.

= For the purposes of grep/egrep, that'll be a logical "line" o= f text, terminated by a newline, though the newline itself isn't consid= ered part of the text for matching. I believe the `-z` option can be used t= o set a NUL byte as the "line" terminator; presumably this lets o= ne match strings with embedded newlines, though I haven't tried.
> The log lines start with
>
> MMM dd hh:mm:ss
>
&= gt; Where:
> =C2=A0 - MMM is the month abbreviation
> =C2=A0 - = dd =C2=A0is the day of the month
> =C2=A0 - hh =C2=A0is the hour of t= he day
> =C2=A0 - mm =C2=A0is the minute of the hour
> =C2=A0 -= ss =C2=A0is the second of the minute
>
> So, yes, there are el= even characters that fall into the class consisting
> of a space or a= colon or a number.
>
> Is that a single string?=C2=A0 It= depends what you're looking at, the
> sequences of non white spa= ce in the log? No.=C2=A0 The patter that I'm
> matching ya.
=

"string" in this context is the input you're a= ttempting to match against. `egrep` will attempt to match your pattern agai= nst each "line" of text it reads from the files its searching. Th= at is, each line in your log file(s).

But consider what `[ :[= :digit:]]{11}` means: you've got a character class consisting of space,= colon and a digit; {11} means "match any of the characters in that cl= ass exactly 11 times" (as opposed to other variations on the '{}&#= 39; syntax that say "at least m times", "at most n times&quo= t;, or "between n and m times"). But that'll match all sorts = of things that don't look like 'dd hh:mm:ss':

term% egre= p '[ :[:digit:]]{11}'
11111111111
11111111111
1111111111111111111111
1111111111111
::::::::::::::::
::::::::::::::::
= aaaa =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0bbbbb
aaaa =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0bbbbb


term%

(The first line is my ty= ping; the second is output from egrep except for the short line of 9 '1= 's, for which egrep had no output. That last two lines are matching spa= ce characters and egrep echoing the match, but I'm guessing gmail will = eat those.)

Note that there are inputs with more than 11= characters that match; this is because there is some 11-character substrin= g that matches the RE=C2=A0 in those lines. In any event, I suspect this wo= uld generally not be what you want. But if nothing else in your input can m= atch the RE (which you might know a priori because of domain knowledge abou= t whatever is generating those logs) then it's no big deal, even if the= RE was capable of matching more things generally.

> &= gt; Using character classes would greatly simplify what you're trying t= o
> > do. It seems like this could be simplified to (untested) sni= ppet:
>
> Agreed.
>
> I'm starting with the exa= mples that came with; "^\w{3} [
> :[:digit:]]{11}", the log= check package that I'm working with and
> evaluating what I want = to do.

Ah. I suspect this relies on domain knowled= ge about the format of log lines to match reliably. Otherwise it could matc= h, `___ 123 456:789` which is probably not what you are expecting.

> I actually like the idea of dividing out the following:
><= br>> =C2=A0 - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, an= d Dec
> =C2=A0 - months that have 30 days: Apr, Jun, Sep, Nov
>= =C2=A0 - month that have 28/29 days: Feb

Sure.=C2=A0 One nice thing= about `egrep` et al is that you can put the REs into a file and include th= em with `-f`, as opposed to having them all directly on the command line.
> > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]=
>
> Aside: =C2=A0Why do you have the double square brackets in= "[12][[0-9]]"?

Typo.=C2=A0 :-)

> > For this,= I'd probably eschew `[:digit:]`. Named character classes
> > = are for handy locale support, or in lieu of typing every character
> = > in the alphabet (though we can use ranges to abbreviate that), but
= > > it kind of seems like that's not coming into play here and, I= MHO,
> > `[0-9]` is clearer in context.
>
> ACK
>= ;
> "[[:digit:]]+" was a construct that I'm parroting.= =C2=A0 It and
> [.:[:xdigit:]]+ are good for some things.=C2=A0 But t= hey definitely aren't
> the best for all things.
>
> = Hence trying to find the line of being more accurate without going too far.=
>
> > It's not clear to me that dates, in their general= ity, can be
> > matched with regular expressions.=C2=A0 Consider l= eap years; you'd almost
> > necessarily have to use backtracki= ng for that, but I admit I haven't
> > thought it through.
= >
> Given the context that these extended regular expressions are = going to
> be used in, logcheck -- filtering out known okay log entri= es to email
> what doesn't get filtered -- I'm okay with havi= ng a few things slip
> through like leap day / leap seconds / leap fr= ogs.

That seems reasonable.

> > `\w` = is a GNU extension; I'd probably avoid it on portability grounds
>= ; > (though `\b` is very handy).
>
> I hear, understand, and= acknowledge your concern.=C2=A0 At present, these
> filters are bein= g used in a package; logcheck,

Aside: I found the = note on it's website amusing: Brought to you by the UK's best gambl= ing sites! "Only gamble with what you can afford to lose." Yikes!=

> which I believe is
> specific to Debia= n and ilk.=C2=A0 As such, GNU grep is very much a thing.

I'd proceed with caution here; it also seems to be in the FreeBS= D and DragonFly ports collections and Homebrew on the Mac (but so is GNU gr= ep for all of those).

> I'm also not a fan of the use = of `\w` and would prefer to (...|...) things.

Yeah. IMHO `\w` is too= general for what you're trying to do.

> > The thing about= regular expressions is that they describe regular
> > languages, = and regular languages are those for which there exists a
> > finit= e automaton that can recognize the language. An important class
> >= ; of finite automata are deterministic finite automata; by definition,
&= gt; > recognition by such automata are linear in the length of the input= .
> >
> > However, construction of a DFA for any given re= gular expression can be
> > superlinear (in fact, it can be expone= ntial) so practically speaking,
> > we usually construct non-deter= ministic finite automata (NDFAs) and
> > "simulate" thei= r execution for matching. NDFAs generalize DFAs (DFAs
> > are a su= bset of NDFAs, incidentally) in that, in any non-terminal
> > stat= e, there can be multiple subsequent states that the machine can
> >= ; transition to given an input symbol. When executed, for any state,
>= ; > the simulator will transition to every permissible subsequent state<= br>> > simultaneously, discarding impossible states as they become ev= ident.
> >
> > This implies that NDFA execution is superl= inear, but it is bounded,
> > and is O(n*m*e), where n is the leng= th of the input, m is the number
> > of nodes in the state transit= ion graph corresponding to the NDFA, and
> > e is the maximum numb= er of edges leaving any node in that graph (for
> > a fully connec= ted graph, that would m, so this can be up to O(n*m^2)).
> > Const= ruction of an NDFA is O(m), so while it's slower to execute, it's> > actually possible to construct in a reasonable amount of time. = Russ's
> > excellent series of articles that Clem linked to gi= ves details and
> > algorithms.
>
> I only vaguely und= erstand those three paragraphs as they are deeper
> computer science = than I've gone before.
>
> I think I get the gist of them b= ut could not explain them if my life
> depended upon it.
Basically, a regular expression is a regular expression if you= can build a machine with no additional memory that can tell you whether or= not a given string matches the RE examining its input one character at a t= ime.

> > In practical terms? Basically, don't worry= about it too much. Egrep
> > will generate an NDFA simulation tha= t's going to be acceptably fast
> > for all but the weirdest c= ases.
>
> ACK
>
> It sounds like I can make any rea= sonable extended regular expression a
> human can read and I'll p= robably be good.

I think that's about right.

> Thank you for the detailed response Dan. =C2=A0:-)

Sure thing!

=C2=A0 = =C2=A0 =C2=A0 =C2=A0 - Dan C.

--00000000000080991205f5f63689--