On Thu, Mar 2, 2023 at 8:06=E2=80=AFPM Grant Taylor via CO=
FF <coff@tuhs.org> wrote:

>= ; On 3/2/23 2:53 PM, Dan Cross wrote:

> > Well, obviously the form= er matches any sequence 3 of

> > alpha-numerics/underscores at the= beginning of a string, while the

> > latter only matches abbrevia= tions of months in the western calendar;

> > that is, the two REs = are matching very different things (the latter

> > is a strict sub= set of the former).

>

> I completely agree with you.=C2=A0 That= 's also why I'm wanting to start

> utilizing the latter, more= specific RE.=C2=A0 But I don't know where the line

> of over com= plicating things is to avoid crossing it.

I guess what I'm sayin= g is, match what you want to match and don't sweat the small stuff.

=

> > But I suspect you mean in a more general sense.

>

&g= t; Yes and no.=C2=A0 Does the comment above clarify at all?

Not exac= tly. :-)

What I understand you to mean, based on this and the rest o= f your note, is that you want to find a good division point between overly = specific, complex REs and simpler, easy to understand REs that are less spe= cific. The danger with the latter is that they may match things you don'= ;t intend, while the former are harder to maintain and (arguably) more brit= tle. I can sympathize.

> > ...do you really want to match a sp= ace, a colon and a single digit

> > 11 times ...

>

> Y= es.

>

> > ... in a single string?

>

> What const= itutes a single string? =C2=A0;-) =C2=A0I sort of rhetorically ask.

= For the purposes of grep/egrep, that'll be a logical "line" o= f text, terminated by a newline, though the newline itself isn't consid= ered part of the text for matching. I believe the `-z` option can be used t= o set a NUL byte as the "line" terminator; presumably this lets o= ne match strings with embedded newlines, though I haven't tried.

> The log lines start with

>

> MMM dd hh:mm:ss

>

&= gt; Where:

> =C2=A0 - MMM is the month abbreviation

> =C2=A0 - = dd =C2=A0is the day of the month

> =C2=A0 - hh =C2=A0is the hour of t= he day

> =C2=A0 - mm =C2=A0is the minute of the hour

> =C2=A0 -= ss =C2=A0is the second of the minute

>

> So, yes, there are el= even characters that fall into the class consisting

> of a space or a= colon or a number.

>

"string" in this context is the input you're a=
ttempting to match against. `egrep` will attempt to match your pattern agai=
nst each "line" of text it reads from the files its searching. Th=
at is, each line in your log file(s).

But consider what `[ :[=
:digit:]]{11}` means: you've got a character class consisting of space,=
colon and a digit; {11} means "match any of the characters in that cl=
ass exactly 11 times" (as opposed to other variations on the '{}&#=
39; syntax that say "at least m times", "at most n times&quo=
t;, or "between n and m times"). But that'll match all sorts =
of things that don't look like 'dd hh:mm:ss':

term% egre= p '[ :[:digit:]]{11}'

11111111111

11111111111

111111111

1111111111111

1111111111111

::::::::::::::::

::::::::::::::::

= aaaa =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0bbbbb

aaaa =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0bbbbb

term%

(The first line is my ty= ping; the second is output from egrep except for the short line of 9 '1= 's, for which egrep had no output. That last two lines are matching spa= ce characters and egrep echoing the match, but I'm guessing gmail will = eat those.)

> I actually like the idea of dividing out the following:

><= br>> =C2=A0 - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, an= d Dec

> =C2=A0 - months that have 30 days: Apr, Jun, Sep, Nov

>= =C2=A0 - month that have 28/29 days: Feb

Sure.=C2=A0 One nice thing= about `egrep` et al is that you can put the REs into a file and include th= em with `-f`, as opposed to having them all directly on the command line.

> > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]=

>

> Aside: =C2=A0Why do you have the double square brackets in= "[12][[0-9]]"?

Typo.=C2=A0 :-)

> > For this,= I'd probably eschew `[:digit:]`. Named character classes

> > = are for handy locale support, or in lieu of typing every character

> = > in the alphabet (though we can use ranges to abbreviate that), but

= > > it kind of seems like that's not coming into play here and, I= MHO,

> > `[0-9]` is clearer in context.

>

> ACK

>= ;

> "[[:digit:]]+" was a construct that I'm parroting.= =C2=A0 It and

> [.:[:xdigit:]]+ are good for some things.=C2=A0 But t= hey definitely aren't

> the best for all things.

>

> = Hence trying to find the line of being more accurate without going too far.=

>

> > It's not clear to me that dates, in their general= ity, can be

> > matched with regular expressions.=C2=A0 Consider l= eap years; you'd almost

> > necessarily have to use backtracki= ng for that, but I admit I haven't

> > thought it through.

= >

> Given the context that these extended regular expressions are = going to

> be used in, logcheck -- filtering out known okay log entri= es to email

> what doesn't get filtered -- I'm okay with havi= ng a few things slip

> through like leap day / leap seconds / leap fr= ogs.

> I'm also not a fan of the use = of `\w` and would prefer to (...|...) things.

Yeah. IMHO `\w` is too= general for what you're trying to do.

> > The thing about= regular expressions is that they describe regular

> > languages, = and regular languages are those for which there exists a

> > finit= e automaton that can recognize the language. An important class

> >= ; of finite automata are deterministic finite automata; by definition,

&= gt; > recognition by such automata are linear in the length of the input= .

> >

> > However, construction of a DFA for any given re= gular expression can be

> > superlinear (in fact, it can be expone= ntial) so practically speaking,

> > we usually construct non-deter= ministic finite automata (NDFAs) and

> > "simulate" thei= r execution for matching. NDFAs generalize DFAs (DFAs

> > are a su= bset of NDFAs, incidentally) in that, in any non-terminal

> > stat= e, there can be multiple subsequent states that the machine can

> >= ; transition to given an input symbol. When executed, for any state,

>= ; > the simulator will transition to every permissible subsequent state<= br>> > simultaneously, discarding impossible states as they become ev= ident.

> >

> > This implies that NDFA execution is superl= inear, but it is bounded,

> > and is O(n*m*e), where n is the leng= th of the input, m is the number

> > of nodes in the state transit= ion graph corresponding to the NDFA, and

> > e is the maximum numb= er of edges leaving any node in that graph (for

> > a fully connec= ted graph, that would m, so this can be up to O(n*m^2)).

> > Const= ruction of an NDFA is O(m), so while it's slower to execute, it's**> > actually possible to construct in a reasonable amount of time. =
Russ's**

> > excellent series of articles that Clem linked to gi= ves details and

> > algorithms.

>

> I only vaguely und= erstand those three paragraphs as they are deeper

> computer science = than I've gone before.

>

> I think I get the gist of them b= ut could not explain them if my life

> depended upon it.

> > In practical terms? Basically, don't worry= about it too much. Egrep

> > will generate an NDFA simulation tha= t's going to be acceptably fast

> > for all but the weirdest c= ases.

>

> ACK

>

> It sounds like I can make any rea= sonable extended regular expression a

> human can read and I'll p= robably be good.

>= ; On 3/2/23 2:53 PM, Dan Cross wrote:

> > Well, obviously the form= er matches any sequence 3 of

> > alpha-numerics/underscores at the= beginning of a string, while the

> > latter only matches abbrevia= tions of months in the western calendar;

> > that is, the two REs = are matching very different things (the latter

> > is a strict sub= set of the former).

>

> I completely agree with you.=C2=A0 That= 's also why I'm wanting to start

> utilizing the latter, more= specific RE.=C2=A0 But I don't know where the line

> of over com= plicating things is to avoid crossing it.

I guess what I'm sayin= g is, match what you want to match and don't sweat the small stuff.

=

> > But I suspect you mean in a more general sense.

>

&g= t; Yes and no.=C2=A0 Does the comment above clarify at all?

Not exac= tly. :-)

What I understand you to mean, based on this and the rest o= f your note, is that you want to find a good division point between overly = specific, complex REs and simpler, easy to understand REs that are less spe= cific. The danger with the latter is that they may match things you don'= ;t intend, while the former are harder to maintain and (arguably) more brit= tle. I can sympathize.

> > ...do you really want to match a sp= ace, a colon and a single digit

> > 11 times ...

>

> Y= es.

>

> > ... in a single string?

>

> What const= itutes a single string? =C2=A0;-) =C2=A0I sort of rhetorically ask.

= For the purposes of grep/egrep, that'll be a logical "line" o= f text, terminated by a newline, though the newline itself isn't consid= ered part of the text for matching. I believe the `-z` option can be used t= o set a NUL byte as the "line" terminator; presumably this lets o= ne match strings with embedded newlines, though I haven't tried.

> The log lines start with

>

> MMM dd hh:mm:ss

>

&= gt; Where:

> =C2=A0 - MMM is the month abbreviation

> =C2=A0 - = dd =C2=A0is the day of the month

> =C2=A0 - hh =C2=A0is the hour of t= he day

> =C2=A0 - mm =C2=A0is the minute of the hour

> =C2=A0 -= ss =C2=A0is the second of the minute

>

> So, yes, there are el= even characters that fall into the class consisting

> of a space or a= colon or a number.

>

> Is that a single string?=C2=A0 It=
depends what you're looking at, the

> sequences of non white spa= ce in the log? No.=C2=A0 The patter that I'm

> matching ya.

=
> sequences of non white spa= ce in the log? No.=C2=A0 The patter that I'm

> matching ya.

term% egre= p '[ :[:digit:]]{11}'

11111111111

11111111111

111111111

1111111111111

1111111111111

::::::::::::::::

::::::::::::::::

= aaaa =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0bbbbb

aaaa =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0bbbbb

term%

(The first line is my ty= ping; the second is output from egrep except for the short line of 9 '1= 's, for which egrep had no output. That last two lines are matching spa= ce characters and egrep echoing the match, but I'm guessing gmail will = eat those.)

Note that there are inputs with more than 11=
characters that match; this is because there is some 11-character substrin=
g that matches the RE=C2=A0 in those lines. In any event, I suspect this wo=
uld generally not be what you want. But if nothing else in your input can m=
atch the RE (which you might know a priori because of domain knowledge abou=
t whatever is generating those logs) then it's no big deal, even if the=
RE was capable of matching more things generally.

> &=
gt; Using character classes would greatly simplify what you're trying t=
o

> > do. It seems like this could be simplified to (untested) sni= ppet:

>

> Agreed.

>

> I'm starting with the exa= mples that came with; "^\w{3} [

> :[:digit:]]{11}", the log= check package that I'm working with and

> evaluating what I want = to do.

> > do. It seems like this could be simplified to (untested) sni= ppet:

>

> Agreed.

>

> I'm starting with the exa= mples that came with; "^\w{3} [

> :[:digit:]]{11}", the log= check package that I'm working with and

> evaluating what I want = to do.

Ah. I suspect this relies on domain knowled=
ge about the format of log lines to match reliably. Otherwise it could matc=
h, `___ 123 456:789` which is probably not what you are expecting.

> I actually like the idea of dividing out the following:

><= br>> =C2=A0 - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, an= d Dec

> =C2=A0 - months that have 30 days: Apr, Jun, Sep, Nov

>= =C2=A0 - month that have 28/29 days: Feb

Sure.=C2=A0 One nice thing= about `egrep` et al is that you can put the REs into a file and include th= em with `-f`, as opposed to having them all directly on the command line.

> > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9]=

>

> Aside: =C2=A0Why do you have the double square brackets in= "[12][[0-9]]"?

Typo.=C2=A0 :-)

> > For this,= I'd probably eschew `[:digit:]`. Named character classes

> > = are for handy locale support, or in lieu of typing every character

> = > in the alphabet (though we can use ranges to abbreviate that), but

= > > it kind of seems like that's not coming into play here and, I= MHO,

> > `[0-9]` is clearer in context.

>

> ACK

>= ;

> "[[:digit:]]+" was a construct that I'm parroting.= =C2=A0 It and

> [.:[:xdigit:]]+ are good for some things.=C2=A0 But t= hey definitely aren't

> the best for all things.

>

> = Hence trying to find the line of being more accurate without going too far.=

>

> > It's not clear to me that dates, in their general= ity, can be

> > matched with regular expressions.=C2=A0 Consider l= eap years; you'd almost

> > necessarily have to use backtracki= ng for that, but I admit I haven't

> > thought it through.

= >

> Given the context that these extended regular expressions are = going to

> be used in, logcheck -- filtering out known okay log entri= es to email

> what doesn't get filtered -- I'm okay with havi= ng a few things slip

> through like leap day / leap seconds / leap fr= ogs.

That seems reasonable.

> > `\w` = is a GNU extension; I'd probably avoid it on portability grounds

>= ; > (though `\b` is very handy).

>

> I hear, understand, and= acknowledge your concern.=C2=A0 At present, these

> filters are bein= g used in a package; logcheck,

> > `\w` = is a GNU extension; I'd probably avoid it on portability grounds

>= ; > (though `\b` is very handy).

>

> I hear, understand, and= acknowledge your concern.=C2=A0 At present, these

> filters are bein= g used in a package; logcheck,

Aside: I found the =
note on it's website amusing: Brought to you by the UK's best gambl=
ing sites! "Only gamble with what you can afford to lose." Yikes!=

> which I believe is

> specific to Debia= n and ilk.=C2=A0 As such, GNU grep is very much a thing.

> specific to Debia= n and ilk.=C2=A0 As such, GNU grep is very much a thing.

I'd proceed with caution here; it also seems to be in the FreeBS=
D and DragonFly ports collections and Homebrew on the Mac (but so is GNU gr=
ep for all of those).

> I'm also not a fan of the use = of `\w` and would prefer to (...|...) things.

Yeah. IMHO `\w` is too= general for what you're trying to do.

> > The thing about= regular expressions is that they describe regular

> > languages, = and regular languages are those for which there exists a

> > finit= e automaton that can recognize the language. An important class

> >= ; of finite automata are deterministic finite automata; by definition,

&= gt; > recognition by such automata are linear in the length of the input= .

> >

> > However, construction of a DFA for any given re= gular expression can be

> > superlinear (in fact, it can be expone= ntial) so practically speaking,

> > we usually construct non-deter= ministic finite automata (NDFAs) and

> > "simulate" thei= r execution for matching. NDFAs generalize DFAs (DFAs

> > are a su= bset of NDFAs, incidentally) in that, in any non-terminal

> > stat= e, there can be multiple subsequent states that the machine can

> >= ; transition to given an input symbol. When executed, for any state,

>= ; > the simulator will transition to every permissible subsequent state<= br>> > simultaneously, discarding impossible states as they become ev= ident.

> >

> > This implies that NDFA execution is superl= inear, but it is bounded,

> > and is O(n*m*e), where n is the leng= th of the input, m is the number

> > of nodes in the state transit= ion graph corresponding to the NDFA, and

> > e is the maximum numb= er of edges leaving any node in that graph (for

> > a fully connec= ted graph, that would m, so this can be up to O(n*m^2)).

> > Const= ruction of an NDFA is O(m), so while it's slower to execute, it's

> > excellent series of articles that Clem linked to gi= ves details and

> > algorithms.

>

> I only vaguely und= erstand those three paragraphs as they are deeper

> computer science = than I've gone before.

>

> I think I get the gist of them b= ut could not explain them if my life

> depended upon it.

Basically, a regular expression is a regular expression if you=
can build a machine with no additional memory that can tell you whether or=
not a given string matches the RE examining its input one character at a t=
ime.

> > In practical terms? Basically, don't worry= about it too much. Egrep

> > will generate an NDFA simulation tha= t's going to be acceptably fast

> > for all but the weirdest c= ases.

>

> ACK

>

> It sounds like I can make any rea= sonable extended regular expression a

> human can read and I'll p= robably be good.

I think that's about right.

> Thank you for the detailed response Dan. =C2=A0:-)

--00000000000080991205f5f63689--
> Thank you for the detailed response Dan. =C2=A0:-)

Sure thing!

=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 - Dan C.