On 3/2/23 2:53 PM, Dan Cross wrote: > Well, obviously the former matches any sequence 3 of > alpha-numerics/underscores at the beginning of a string, while the > latter only matches abbreviations of months in the western calendar; > that is, the two REs are matching very different things (the latter > is a strict subset of the former). I completely agree with you. That's also why I'm wanting to start utilizing the latter, more specific RE. But I don't know where the line of over complicating things is to avoid crossing it. > But I suspect you mean in a more general sense. Yes and no. Does the comment above clarify at all? > ...do you really want to match a space, a colon and a single digit > 11 times ... Yes. > ... in a single string? What constitutes a single string? ;-) I sort of rhetorically ask. The log lines start with MMM dd hh:mm:ss Where: - MMM is the month abbreviation - dd is the day of the month - hh is the hour of the day - mm is the minute of the hour - ss is the second of the minute So, yes, there are eleven characters that fall into the class consisting of a space or a colon or a number. Is that a single string? It depends what you're looking at, the sequences of non white space in the log? No. The patter that I'm matching ya. > Using character classes would greatly simplify what you're trying to > do. It seems like this could be simplified to (untested) snippet: Agreed. I'm starting with the examples that came with; "^\w{3} [ :[:digit:]]{11}", the logcheck package that I'm working with and evaluating what I want to do. I actually like the idea of dividing out the following: - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec - months that have 30 days: Apr, Jun, Sep, Nov - month that have 28/29 days: Feb > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] Aside: Why do you have the double square brackets in "[12][[0-9]]"? > For this, I'd probably eschew `[:digit:]`. Named character classes > are for handy locale support, or in lieu of typing every character > in the alphabet (though we can use ranges to abbreviate that), but > it kind of seems like that's not coming into play here and, IMHO, > `[0-9]` is clearer in context. ACK "[[:digit:]]+" was a construct that I'm parroting. It and [.:[:xdigit:]]+ are good for some things. But they definitely aren't the best for all things. Hence trying to find the line of being more accurate without going too far. > It's not clear to me that dates, in their generality, can be > matched with regular expressions. Consider leap years; you'd almost > necessarily have to use backtracking for that, but I admit I haven't > thought it through. Given the context that these extended regular expressions are going to be used in, logcheck -- filtering out known okay log entries to email what doesn't get filtered -- I'm okay with having a few things slip through like leap day / leap seconds / leap frogs. > `\w` is a GNU extension; I'd probably avoid it on portability grounds > (though `\b` is very handy). I hear, understand, and acknowledge your concern. At present, these filters are being used in a package; logcheck, which I believe is specific to Debian and ilk. As such, GNU grep is very much a thing. I'm also not a fan of the use of `\w` and would prefer to (...|...) things. > The thing about regular expressions is that they describe regular > languages, and regular languages are those for which there exists a > finite automaton that can recognize the language. An important class > of finite automata are deterministic finite automata; by definition, > recognition by such automata are linear in the length of the input. > > However, construction of a DFA for any given regular expression can be > superlinear (in fact, it can be exponential) so practically speaking, > we usually construct non-deterministic finite automata (NDFAs) and > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs > are a subset of NDFAs, incidentally) in that, in any non-terminal > state, there can be multiple subsequent states that the machine can > transition to given an input symbol. When executed, for any state, > the simulator will transition to every permissible subsequent state > simultaneously, discarding impossible states as they become evident. > > This implies that NDFA execution is superlinear, but it is bounded, > and is O(n*m*e), where n is the length of the input, m is the number > of nodes in the state transition graph corresponding to the NDFA, and > e is the maximum number of edges leaving any node in that graph (for > a fully connected graph, that would m, so this can be up to O(n*m^2)). > Construction of an NDFA is O(m), so while it's slower to execute, it's > actually possible to construct in a reasonable amount of time. Russ's > excellent series of articles that Clem linked to gives details and > algorithms. I only vaguely understand those three paragraphs as they are deeper computer science than I've gone before. I think I get the gist of them but could not explain them if my life depended upon it. > In practical terms? Basically, don't worry about it too much. Egrep > will generate an NDFA simulation that's going to be acceptably fast > for all but the weirdest cases. ACK It sounds like I can make any reasonable extended regular expression a human can read and I'll probably be good. Thank you for the detailed response Dan. :-) -- Grant. . . . unix || die