On 3/2/23 8:04 PM, Dan Cross wrote: > I guess what I'm saying is, match what you want to match and don't sweat > the small stuff. ACK > Not exactly. :-) > > What I understand you to mean, based on this and the rest of your note, > is that you want to find a good division point between overly specific, > complex REs and simpler, easy to understand REs that are less specific. > The danger with the latter is that they may match things you don't > intend, while the former are harder to maintain and (arguably) more > brittle. I can sympathize. You got it. > For the purposes of grep/egrep, that'll be a logical "line" of text, > terminated by a newline, though the newline itself isn't considered part > of the text for matching. I believe the `-z` option can be used to set a > NUL byte as the "line" terminator; presumably this lets one match > strings with embedded newlines, though I haven't tried. Fair enough. That's also sort of what I thought might be the case. > "string" in this context is the input you're attempting to match > against. `egrep` will attempt to match your pattern against each "line" > of text it reads from the files its searching. That is, each line in > your log file(s). *nod* > But consider what `[ :[:digit:]]{11}` means: you've got a character > class consisting of space, colon and a digit; {11} means "match any of > the characters in that class exactly 11 times" (as opposed to other > variations on the '{}' syntax that say "at least m times", "at most n > times", or "between n and m times"). Yep, I'm well aware of the that. > But that'll match all sorts of things that don't look like 'dd > hh:mm:ss': That's one of the reasons that I'm interested in coming up with a more precise regular expression ... without being overly complex. > (The first line is my typing; the second is output from egrep except for > the short line of 9 '1's, for which egrep had no output. That last two > lines are matching space characters and egrep echoing the match, but I'm > guessing gmail will eat those.) > > Note that there are inputs with more than 11 characters that match; this > is because there is some 11-character substring that matches the RE  in > those lines. In any event, I suspect this would generally not be what > you want. But if nothing else in your input can match the RE (which you > might know a priori because of domain knowledge about whatever is > generating those logs) then it's no big deal, even if the RE was capable > of matching more things generally. Yep. Here's an example of the full RE: ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$ As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a larger RE and there is bounding & delimiting around the subpart. This is to match a standard message from postfix via standard SYSLOG. > Ah. I suspect this relies on domain knowledge about the format of log > lines to match reliably. Otherwise it could match, `___ 123 456:789` > which is probably not what you are expecting. Yep. Though said domain knowledge isn't anything special in and of itself. > Sure.  One nice thing about `egrep` et al is that you can put the REs > into a file and include them with `-f`, as opposed to having them all > directly on the command line. Yep. logcheck makes extensive use of many files like this to do it's work. > Typo.  :-) ACKK > That seems reasonable. Thank you for the logic CRC. > Aside: I found the note on it's website amusing: Brought to you by the > UK's best gambling sites! "Only gamble with what you can afford to > lose." Yikes! Um ... that's concerning. > I'd proceed with caution here; it also seems to be in the FreeBSD and > DragonFly ports collections and Homebrew on the Mac (but so is GNU grep > for all of those). Fair enough. My use case is on Linux where GNU egrep is a thing. > Yeah. IMHO `\w` is too general for what you're trying to do. I think that `\w` is a good primer, but not where I want things to end up long term. > Basically, a regular expression is a regular expression if you can build > a machine with no additional memory that can tell you whether or not a > given string matches the RE examining its input one character at a time. I /think/ that I could build a complex nested tree of switch statements to test each character to see if things match what they should or not. Though I would need at least one variable / memory to hold absolutely minimal state to know where I am in the switch tree. I think a number to identify the switch statement in question would be sufficient. So I'm guessing two bytes of variable and uncounted bytes of program code. > I think that's about right. Thank you again Dan. > Sure thing! :-) -- Grant. . . . unix || die