* [COFF] Requesting thoughts on extended regular expressions in grep. @ 2023-03-02 18:54 Grant Taylor via COFF 2023-03-02 19:23 ` [COFF] " Clem Cole ` (4 more replies) 0 siblings, 5 replies; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-02 18:54 UTC (permalink / raw) To: COFF [-- Attachment #1: Type: text/plain, Size: 1436 bytes --] Hi, I'd like some thoughts ~> input on extended regular expressions used with grep, specifically GNU grep -e / egrep. What are the pros / cons to creating extended regular expressions like the following: ^\w{3} vs: ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) Or: [ :[:digit:]]{11} vs: ( 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] I'm currently eliding the 61st (60) second, the 32nd day, and dealing with February having fewer days for simplicity. For matching patterns like the following in log files? Mar 2 03:23:38 I'm working on organically training logcheck to match known good log entries. So I'm *DEEP* in the bowels of extended regular expressions (GNU egrep) that runs over all logs hourly. As such, I'm interested in making sure that my REs are both efficient and accurate or at least not WILDLY badly structured. The pedantic part of me wants to avoid wildcard type matches (\w), even if they are bounded (\w{3}), unless it truly is for unpredictable text. I'd appreciate any feedback and recommendations from people who have been using and / or optimizing (extended) regular expressions for longer than I have been using them. Thank you for your time and input. -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF @ 2023-03-02 19:23 ` Clem Cole 2023-03-02 19:38 ` Grant Taylor via COFF 2023-03-02 23:01 ` Stuff Received 2023-03-02 21:53 ` Dan Cross ` (3 subsequent siblings) 4 siblings, 2 replies; 46+ messages in thread From: Clem Cole @ 2023-03-02 19:23 UTC (permalink / raw) To: Grant Taylor; +Cc: COFF [-- Attachment #1: Type: text/plain, Size: 1788 bytes --] Grant - check out Russ Cox's web page on this very subject: Implementing Regular Expressions <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.com%2F%7Ersc%2Fregexp%2F> ᐧ On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote: > Hi, > > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) > > Or: > > [ :[:digit:]]{11} > > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] > > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. > > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. > > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. > > > > -- > Grant. . . . > unix || die > > [-- Attachment #2: Type: text/html, Size: 2619 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 19:23 ` [COFF] " Clem Cole @ 2023-03-02 19:38 ` Grant Taylor via COFF 2023-03-02 23:01 ` Stuff Received 1 sibling, 0 replies; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-02 19:38 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 308 bytes --] On 3/2/23 12:23 PM, Clem Cole wrote: > Grant - check out Russ Cox's web page on this very subject: Implementing > Regular Expressions Thank you for the pointer Clem. It's at the top of my reading list. I'll dig into the articles listed thereon later today. -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 19:23 ` [COFF] " Clem Cole 2023-03-02 19:38 ` Grant Taylor via COFF @ 2023-03-02 23:01 ` Stuff Received 2023-03-02 23:46 ` Steffen Nurpmeso 2023-03-03 1:08 ` Grant Taylor via COFF 1 sibling, 2 replies; 46+ messages in thread From: Stuff Received @ 2023-03-02 23:01 UTC (permalink / raw) To: coff On 2023-03-02 14:23, Clem Cole wrote: > Grant - check out Russ Cox's web page on this very subject: Implementing > Regular Expressions > <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.com%2F%7Ersc%2Fregexp%2F> Clem, why are you linking through streaklinks.com? N. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 23:01 ` Stuff Received @ 2023-03-02 23:46 ` Steffen Nurpmeso 2023-03-03 1:08 ` Grant Taylor via COFF 1 sibling, 0 replies; 46+ messages in thread From: Steffen Nurpmeso @ 2023-03-02 23:46 UTC (permalink / raw) To: Stuff Received; +Cc: coff Stuff Received wrote in <c4328f0f-f304-e131-a9f3-32ce01fa6814@riddermarkfarm.ca>: |On 2023-03-02 14:23, Clem Cole wrote: |> Grant - check out Russ Cox's web page on this very subject: Implementing |> Regular Expressions |> <https://streaklinks.com/BaglRcA5OeZepwiX_AELMUI5/https%3A%2F%2Fswtch.co\ |> m%2F%7Ersc%2Fregexp%2F> | |Clem, why are you linking through streaklinks.com? I do not want to be unfriendly; (but) i use firefox-bin (mozilla-compiled) and my only extension is uMatrix that i have been pointed to, and i can only recommend it highly to anyone (though the "modern" web mostly requires to turn off tracking protection and numerous white flags in uMatrix to work), maybe even to those who simply put their browser into a container. Anyhow, uMatrix gives you the following, and while i have not tried to selectively click me through to get to the target, i could have done so: streak.com www.streak.com cloudflare.com cdnjs.cloudflare.com d3e54v103j8qbb.cloudfront.net facebook.net connect.facebook.net google.com www.google.com ajax.googleapis.com storage.googleapis.com intercom.io widget.intercom.io licdn.com snap.licdn.com pdst.fm cdn.pdst.fm producthunt.com api.producthunt.com sentry-cdn.com js.sentry-cdn.com website-files.com assets.website-files.com assets-global.website-files.com google-analytics.com www.google-analytics.com googletagmanager.com www.googletagmanager.com Randomized links i find just terrible. IETF started using randomized archive links, which are mesmerising; most often mailman archive links give you a bit of orientation by themselves, isn't that more appealing. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 23:01 ` Stuff Received 2023-03-02 23:46 ` Steffen Nurpmeso @ 2023-03-03 1:08 ` Grant Taylor via COFF 2023-03-03 2:10 ` Dave Horsfall 1 sibling, 1 reply; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 1:08 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 560 bytes --] On 3/2/23 4:01 PM, Stuff Received wrote: > Clem, why are you linking through streaklinks.com? Here's a direct link to the page that I landed on when following Clem's link: Link - Implementing Regular Expressions - https://swtch.com/~rsc/regexp/ I didn't pay attention to Clem's link beyond the fact that I got to the desired page without needing to tilt at my various filtering plugins. Though the message I'm replying to has caused a few brain cells to find themselves in confusion ~> curiosity. -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 1:08 ` Grant Taylor via COFF @ 2023-03-03 2:10 ` Dave Horsfall 2023-03-03 3:34 ` Grant Taylor via COFF 0 siblings, 1 reply; 46+ messages in thread From: Dave Horsfall @ 2023-03-03 2:10 UTC (permalink / raw) To: Computer Old Farts Followers On Thu, 2 Mar 2023, Grant Taylor via COFF wrote: > Though the message I'm replying to has caused a few brain cells to find > themselves in confusion ~> curiosity. Because evil things can happen with URL redirectors; personally I like to know where I'm going beforehand... -- Dave ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 2:10 ` Dave Horsfall @ 2023-03-03 3:34 ` Grant Taylor via COFF 0 siblings, 0 replies; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 3:34 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 299 bytes --] On 3/2/23 7:10 PM, Dave Horsfall wrote: > Because evil things can happen with URL redirectors; personally I like to > know where I'm going beforehand... I absolutely agree. The confusion was why someone would purposefully choose to use a redirector. -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF 2023-03-02 19:23 ` [COFF] " Clem Cole @ 2023-03-02 21:53 ` Dan Cross 2023-03-03 1:05 ` Grant Taylor via COFF 2023-03-03 10:59 ` Ralph Corderoy ` (2 subsequent siblings) 4 siblings, 1 reply; 46+ messages in thread From: Dan Cross @ 2023-03-02 21:53 UTC (permalink / raw) To: Grant Taylor; +Cc: COFF On Thu, Mar 2, 2023 at 1:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote: > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) Well, obviously the former matches any sequence 3 of alpha-numerics/underscores at the beginning of a string, while the latter only matches abbreviations of months in the western calendar; that is, the two REs are matching very different things (the latter is a strict subset of the former). But I suspect you mean in a more general sense. > Or: > > [ :[:digit:]]{11} ...do you really want to match a space, a colon and a single digit 11 times in a single string? > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] Using character classes would greatly simplify what you're trying to do. It seems like this could be simplified to (untested) snippet: ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] For this, I'd probably eschew `[:digit:]`. Named character classes are for handy locale support, or in lieu of typing every character in the alphabet (though we can use ranges to abbreviate that), but it kind of seems like that's not coming into play here and, IMHO, `[0-9]` is clearer in context. > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. It's not clear to me that dates, in their generality, can be matched with regular expressions. Consider leap years; you'd almost necessarily have to use backtracking for that, but I admit I haven't thought it through. > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. `\w` is a GNU extension; I'd probably avoid it on portability grounds (though `\b` is very handy). The thing about regular expressions is that they describe regular languages, and regular languages are those for which there exists a finite automaton that can recognize the language. An important class of finite automata are deterministic finite automata; by definition, recognition by such automata are linear in the length of the input. However, construction of a DFA for any given regular expression can be superlinear (in fact, it can be exponential) so practically speaking, we usually construct non-deterministic finite automata (NDFAs) and "simulate" their execution for matching. NDFAs generalize DFAs (DFAs are a subset of NDFAs, incidentally) in that, in any non-terminal state, there can be multiple subsequent states that the machine can transition to given an input symbol. When executed, for any state, the simulator will transition to every permissible subsequent state simultaneously, discarding impossible states as they become evident. This implies that NDFA execution is superlinear, but it is bounded, and is O(n*m*e), where n is the length of the input, m is the number of nodes in the state transition graph corresponding to the NDFA, and e is the maximum number of edges leaving any node in that graph (for a fully connected graph, that would m, so this can be up to O(n*m^2)). Construction of an NDFA is O(m), so while it's slower to execute, it's actually possible to construct in a reasonable amount of time. Russ's excellent series of articles that Clem linked to gives details and algorithms. > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. In practical terms? Basically, don't worry about it too much. Egrep will generate an NDFA simulation that's going to be acceptably fast for all but the weirdest cases. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 21:53 ` Dan Cross @ 2023-03-03 1:05 ` Grant Taylor via COFF 2023-03-03 3:04 ` Dan Cross 0 siblings, 1 reply; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 1:05 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 5704 bytes --] On 3/2/23 2:53 PM, Dan Cross wrote: > Well, obviously the former matches any sequence 3 of > alpha-numerics/underscores at the beginning of a string, while the > latter only matches abbreviations of months in the western calendar; > that is, the two REs are matching very different things (the latter > is a strict subset of the former). I completely agree with you. That's also why I'm wanting to start utilizing the latter, more specific RE. But I don't know where the line of over complicating things is to avoid crossing it. > But I suspect you mean in a more general sense. Yes and no. Does the comment above clarify at all? > ...do you really want to match a space, a colon and a single digit > 11 times ... Yes. > ... in a single string? What constitutes a single string? ;-) I sort of rhetorically ask. The log lines start with MMM dd hh:mm:ss Where: - MMM is the month abbreviation - dd is the day of the month - hh is the hour of the day - mm is the minute of the hour - ss is the second of the minute So, yes, there are eleven characters that fall into the class consisting of a space or a colon or a number. Is that a single string? It depends what you're looking at, the sequences of non white space in the log? No. The patter that I'm matching ya. > Using character classes would greatly simplify what you're trying to > do. It seems like this could be simplified to (untested) snippet: Agreed. I'm starting with the examples that came with; "^\w{3} [ :[:digit:]]{11}", the logcheck package that I'm working with and evaluating what I want to do. I actually like the idea of dividing out the following: - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec - months that have 30 days: Apr, Jun, Sep, Nov - month that have 28/29 days: Feb > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] Aside: Why do you have the double square brackets in "[12][[0-9]]"? > For this, I'd probably eschew `[:digit:]`. Named character classes > are for handy locale support, or in lieu of typing every character > in the alphabet (though we can use ranges to abbreviate that), but > it kind of seems like that's not coming into play here and, IMHO, > `[0-9]` is clearer in context. ACK "[[:digit:]]+" was a construct that I'm parroting. It and [.:[:xdigit:]]+ are good for some things. But they definitely aren't the best for all things. Hence trying to find the line of being more accurate without going too far. > It's not clear to me that dates, in their generality, can be > matched with regular expressions. Consider leap years; you'd almost > necessarily have to use backtracking for that, but I admit I haven't > thought it through. Given the context that these extended regular expressions are going to be used in, logcheck -- filtering out known okay log entries to email what doesn't get filtered -- I'm okay with having a few things slip through like leap day / leap seconds / leap frogs. > `\w` is a GNU extension; I'd probably avoid it on portability grounds > (though `\b` is very handy). I hear, understand, and acknowledge your concern. At present, these filters are being used in a package; logcheck, which I believe is specific to Debian and ilk. As such, GNU grep is very much a thing. I'm also not a fan of the use of `\w` and would prefer to (...|...) things. > The thing about regular expressions is that they describe regular > languages, and regular languages are those for which there exists a > finite automaton that can recognize the language. An important class > of finite automata are deterministic finite automata; by definition, > recognition by such automata are linear in the length of the input. > > However, construction of a DFA for any given regular expression can be > superlinear (in fact, it can be exponential) so practically speaking, > we usually construct non-deterministic finite automata (NDFAs) and > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs > are a subset of NDFAs, incidentally) in that, in any non-terminal > state, there can be multiple subsequent states that the machine can > transition to given an input symbol. When executed, for any state, > the simulator will transition to every permissible subsequent state > simultaneously, discarding impossible states as they become evident. > > This implies that NDFA execution is superlinear, but it is bounded, > and is O(n*m*e), where n is the length of the input, m is the number > of nodes in the state transition graph corresponding to the NDFA, and > e is the maximum number of edges leaving any node in that graph (for > a fully connected graph, that would m, so this can be up to O(n*m^2)). > Construction of an NDFA is O(m), so while it's slower to execute, it's > actually possible to construct in a reasonable amount of time. Russ's > excellent series of articles that Clem linked to gives details and > algorithms. I only vaguely understand those three paragraphs as they are deeper computer science than I've gone before. I think I get the gist of them but could not explain them if my life depended upon it. > In practical terms? Basically, don't worry about it too much. Egrep > will generate an NDFA simulation that's going to be acceptably fast > for all but the weirdest cases. ACK It sounds like I can make any reasonable extended regular expression a human can read and I'll probably be good. Thank you for the detailed response Dan. :-) -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 1:05 ` Grant Taylor via COFF @ 2023-03-03 3:04 ` Dan Cross 2023-03-03 3:53 ` Grant Taylor via COFF 0 siblings, 1 reply; 46+ messages in thread From: Dan Cross @ 2023-03-03 3:04 UTC (permalink / raw) To: Grant Taylor; +Cc: coff [-- Attachment #1: Type: text/plain, Size: 9408 bytes --] On Thu, Mar 2, 2023 at 8:06 PM Grant Taylor via COFF <coff@tuhs.org> wrote: > On 3/2/23 2:53 PM, Dan Cross wrote: > > Well, obviously the former matches any sequence 3 of > > alpha-numerics/underscores at the beginning of a string, while the > > latter only matches abbreviations of months in the western calendar; > > that is, the two REs are matching very different things (the latter > > is a strict subset of the former). > > I completely agree with you. That's also why I'm wanting to start > utilizing the latter, more specific RE. But I don't know where the line > of over complicating things is to avoid crossing it. I guess what I'm saying is, match what you want to match and don't sweat the small stuff. > > But I suspect you mean in a more general sense. > > Yes and no. Does the comment above clarify at all? Not exactly. :-) What I understand you to mean, based on this and the rest of your note, is that you want to find a good division point between overly specific, complex REs and simpler, easy to understand REs that are less specific. The danger with the latter is that they may match things you don't intend, while the former are harder to maintain and (arguably) more brittle. I can sympathize. > > ...do you really want to match a space, a colon and a single digit > > 11 times ... > > Yes. > > > ... in a single string? > > What constitutes a single string? ;-) I sort of rhetorically ask. For the purposes of grep/egrep, that'll be a logical "line" of text, terminated by a newline, though the newline itself isn't considered part of the text for matching. I believe the `-z` option can be used to set a NUL byte as the "line" terminator; presumably this lets one match strings with embedded newlines, though I haven't tried. > The log lines start with > > MMM dd hh:mm:ss > > Where: > - MMM is the month abbreviation > - dd is the day of the month > - hh is the hour of the day > - mm is the minute of the hour > - ss is the second of the minute > > So, yes, there are eleven characters that fall into the class consisting > of a space or a colon or a number. > > Is that a single string? It depends what you're looking at, the > sequences of non white space in the log? No. The patter that I'm > matching ya. "string" in this context is the input you're attempting to match against. `egrep` will attempt to match your pattern against each "line" of text it reads from the files its searching. That is, each line in your log file(s). But consider what `[ :[:digit:]]{11}` means: you've got a character class consisting of space, colon and a digit; {11} means "match any of the characters in that class exactly 11 times" (as opposed to other variations on the '{}' syntax that say "at least m times", "at most n times", or "between n and m times"). But that'll match all sorts of things that don't look like 'dd hh:mm:ss': term% egrep '[ :[:digit:]]{11}' 11111111111 11111111111 111111111 1111111111111 1111111111111 :::::::::::::::: :::::::::::::::: aaaa bbbbb aaaa bbbbb term% (The first line is my typing; the second is output from egrep except for the short line of 9 '1's, for which egrep had no output. That last two lines are matching space characters and egrep echoing the match, but I'm guessing gmail will eat those.) Note that there are inputs with more than 11 characters that match; this is because there is some 11-character substring that matches the RE in those lines. In any event, I suspect this would generally not be what you want. But if nothing else in your input can match the RE (which you might know a priori because of domain knowledge about whatever is generating those logs) then it's no big deal, even if the RE was capable of matching more things generally. > > Using character classes would greatly simplify what you're trying to > > do. It seems like this could be simplified to (untested) snippet: > > Agreed. > > I'm starting with the examples that came with; "^\w{3} [ > :[:digit:]]{11}", the logcheck package that I'm working with and > evaluating what I want to do. Ah. I suspect this relies on domain knowledge about the format of log lines to match reliably. Otherwise it could match, `___ 123 456:789` which is probably not what you are expecting. > I actually like the idea of dividing out the following: > > - months that have 31 days: Jan, Mar, May, Jul, Aug, Oct, and Dec > - months that have 30 days: Apr, Jun, Sep, Nov > - month that have 28/29 days: Feb Sure. One nice thing about `egrep` et al is that you can put the REs into a file and include them with `-f`, as opposed to having them all directly on the command line. > > ( [1-9]|[12][[0-9]]|3[01]) [0-2][0-9]:[0-5][0-9]:[0-5][0-9] > > Aside: Why do you have the double square brackets in "[12][[0-9]]"? Typo. :-) > > For this, I'd probably eschew `[:digit:]`. Named character classes > > are for handy locale support, or in lieu of typing every character > > in the alphabet (though we can use ranges to abbreviate that), but > > it kind of seems like that's not coming into play here and, IMHO, > > `[0-9]` is clearer in context. > > ACK > > "[[:digit:]]+" was a construct that I'm parroting. It and > [.:[:xdigit:]]+ are good for some things. But they definitely aren't > the best for all things. > > Hence trying to find the line of being more accurate without going too far. > > > It's not clear to me that dates, in their generality, can be > > matched with regular expressions. Consider leap years; you'd almost > > necessarily have to use backtracking for that, but I admit I haven't > > thought it through. > > Given the context that these extended regular expressions are going to > be used in, logcheck -- filtering out known okay log entries to email > what doesn't get filtered -- I'm okay with having a few things slip > through like leap day / leap seconds / leap frogs. That seems reasonable. > > `\w` is a GNU extension; I'd probably avoid it on portability grounds > > (though `\b` is very handy). > > I hear, understand, and acknowledge your concern. At present, these > filters are being used in a package; logcheck, Aside: I found the note on it's website amusing: Brought to you by the UK's best gambling sites! "Only gamble with what you can afford to lose." Yikes! > which I believe is > specific to Debian and ilk. As such, GNU grep is very much a thing. I'd proceed with caution here; it also seems to be in the FreeBSD and DragonFly ports collections and Homebrew on the Mac (but so is GNU grep for all of those). > I'm also not a fan of the use of `\w` and would prefer to (...|...) things. Yeah. IMHO `\w` is too general for what you're trying to do. > > The thing about regular expressions is that they describe regular > > languages, and regular languages are those for which there exists a > > finite automaton that can recognize the language. An important class > > of finite automata are deterministic finite automata; by definition, > > recognition by such automata are linear in the length of the input. > > > > However, construction of a DFA for any given regular expression can be > > superlinear (in fact, it can be exponential) so practically speaking, > > we usually construct non-deterministic finite automata (NDFAs) and > > "simulate" their execution for matching. NDFAs generalize DFAs (DFAs > > are a subset of NDFAs, incidentally) in that, in any non-terminal > > state, there can be multiple subsequent states that the machine can > > transition to given an input symbol. When executed, for any state, > > the simulator will transition to every permissible subsequent state > > simultaneously, discarding impossible states as they become evident. > > > > This implies that NDFA execution is superlinear, but it is bounded, > > and is O(n*m*e), where n is the length of the input, m is the number > > of nodes in the state transition graph corresponding to the NDFA, and > > e is the maximum number of edges leaving any node in that graph (for > > a fully connected graph, that would m, so this can be up to O(n*m^2)). > > Construction of an NDFA is O(m), so while it's slower to execute, it's > > actually possible to construct in a reasonable amount of time. Russ's > > excellent series of articles that Clem linked to gives details and > > algorithms. > > I only vaguely understand those three paragraphs as they are deeper > computer science than I've gone before. > > I think I get the gist of them but could not explain them if my life > depended upon it. Basically, a regular expression is a regular expression if you can build a machine with no additional memory that can tell you whether or not a given string matches the RE examining its input one character at a time. > > In practical terms? Basically, don't worry about it too much. Egrep > > will generate an NDFA simulation that's going to be acceptably fast > > for all but the weirdest cases. > > ACK > > It sounds like I can make any reasonable extended regular expression a > human can read and I'll probably be good. I think that's about right. > Thank you for the detailed response Dan. :-) Sure thing! - Dan C. [-- Attachment #2: Type: text/html, Size: 11000 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 3:04 ` Dan Cross @ 2023-03-03 3:53 ` Grant Taylor via COFF 2023-03-03 13:47 ` Dan Cross 0 siblings, 1 reply; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 3:53 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 5031 bytes --] On 3/2/23 8:04 PM, Dan Cross wrote: > I guess what I'm saying is, match what you want to match and don't sweat > the small stuff. ACK > Not exactly. :-) > > What I understand you to mean, based on this and the rest of your note, > is that you want to find a good division point between overly specific, > complex REs and simpler, easy to understand REs that are less specific. > The danger with the latter is that they may match things you don't > intend, while the former are harder to maintain and (arguably) more > brittle. I can sympathize. You got it. > For the purposes of grep/egrep, that'll be a logical "line" of text, > terminated by a newline, though the newline itself isn't considered part > of the text for matching. I believe the `-z` option can be used to set a > NUL byte as the "line" terminator; presumably this lets one match > strings with embedded newlines, though I haven't tried. Fair enough. That's also sort of what I thought might be the case. > "string" in this context is the input you're attempting to match > against. `egrep` will attempt to match your pattern against each "line" > of text it reads from the files its searching. That is, each line in > your log file(s). *nod* > But consider what `[ :[:digit:]]{11}` means: you've got a character > class consisting of space, colon and a digit; {11} means "match any of > the characters in that class exactly 11 times" (as opposed to other > variations on the '{}' syntax that say "at least m times", "at most n > times", or "between n and m times"). Yep, I'm well aware of the that. > But that'll match all sorts of things that don't look like 'dd > hh:mm:ss': That's one of the reasons that I'm interested in coming up with a more precise regular expression ... without being overly complex. > (The first line is my typing; the second is output from egrep except for > the short line of 9 '1's, for which egrep had no output. That last two > lines are matching space characters and egrep echoing the match, but I'm > guessing gmail will eat those.) > > Note that there are inputs with more than 11 characters that match; this > is because there is some 11-character substring that matches the RE in > those lines. In any event, I suspect this would generally not be what > you want. But if nothing else in your input can match the RE (which you > might know a priori because of domain knowledge about whatever is > generating those logs) then it's no big deal, even if the RE was capable > of matching more things generally. Yep. Here's an example of the full RE: ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$ As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a larger RE and there is bounding & delimiting around the subpart. This is to match a standard message from postfix via standard SYSLOG. > Ah. I suspect this relies on domain knowledge about the format of log > lines to match reliably. Otherwise it could match, `___ 123 456:789` > which is probably not what you are expecting. Yep. Though said domain knowledge isn't anything special in and of itself. > Sure. One nice thing about `egrep` et al is that you can put the REs > into a file and include them with `-f`, as opposed to having them all > directly on the command line. Yep. logcheck makes extensive use of many files like this to do it's work. > Typo. :-) ACKK > That seems reasonable. Thank you for the logic CRC. > Aside: I found the note on it's website amusing: Brought to you by the > UK's best gambling sites! "Only gamble with what you can afford to > lose." Yikes! Um ... that's concerning. > I'd proceed with caution here; it also seems to be in the FreeBSD and > DragonFly ports collections and Homebrew on the Mac (but so is GNU grep > for all of those). Fair enough. My use case is on Linux where GNU egrep is a thing. > Yeah. IMHO `\w` is too general for what you're trying to do. I think that `\w` is a good primer, but not where I want things to end up long term. > Basically, a regular expression is a regular expression if you can build > a machine with no additional memory that can tell you whether or not a > given string matches the RE examining its input one character at a time. I /think/ that I could build a complex nested tree of switch statements to test each character to see if things match what they should or not. Though I would need at least one variable / memory to hold absolutely minimal state to know where I am in the switch tree. I think a number to identify the switch statement in question would be sufficient. So I'm guessing two bytes of variable and uncounted bytes of program code. > I think that's about right. Thank you again Dan. > Sure thing! :-) -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 3:53 ` Grant Taylor via COFF @ 2023-03-03 13:47 ` Dan Cross 2023-03-03 19:26 ` Grant Taylor via COFF 0 siblings, 1 reply; 46+ messages in thread From: Dan Cross @ 2023-03-03 13:47 UTC (permalink / raw) To: Grant Taylor; +Cc: coff On Thu, Mar 2, 2023 at 10:53 PM Grant Taylor via COFF <coff@tuhs.org> wrote: >[snip > Here's an example of the full RE: > > ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ > postfix/msa/smtpd\[[[:digit:]]+\]: timeout after STARTTLS from > [._[:alnum:]-]+\[[.:[:xdigit:]]+\]$ > > As you can see the "[ :[:digit:]]{11}" is actually only a sub-part of a > larger RE and there is bounding & delimiting around the subpart. Oh, for sure; to be clear, it was obvious that in the earlier discussion the original was just part of something larger. FWIW, this RE seems ok to me; the additional context makes it unlikely to match something else accidentally. > This is to match a standard message from postfix via standard SYSLOG. > > > Ah. I suspect this relies on domain knowledge about the format of log > > lines to match reliably. Otherwise it could match, `___ 123 456:789` > > which is probably not what you are expecting. > > Yep. > > Though said domain knowledge isn't anything special in and of itself. It needn't be special. The point is simply that there's some external knowledge that can be brought to bear to guide the shape of the REs. In this case, you know that log lines won't begin with `___ 123 456:789` or other similar junk. > [snip] > > Basically, a regular expression is a regular expression if you can build > > a machine with no additional memory that can tell you whether or not a > > given string matches the RE examining its input one character at a time. > > I /think/ that I could build a complex nested tree of switch statements > to test each character to see if things match what they should or not. > Though I would need at least one variable / memory to hold absolutely > minimal state to know where I am in the switch tree. I think a number > to identify the switch statement in question would be sufficient. So > I'm guessing two bytes of variable and uncounted bytes of program code. Kinda. The "machine" in this case is actually an abstraction, like a Turing machine. The salient point here is that REs map to finite state machines, and in particular, one need not keep (say) a stack of prior states when simulating them. Note that even in an NDFA simulation, where one keeps track of what states one may be in, one doesn't need to keep track of how one got into those states. Obviously in a real implementation you've got the program counter, register contents, local variables, etc, all of which consume "memory" in the conventional sense. But the point is that you don't need additional memory proportional to anything other than the size of the RE. DFA implementation could be implemented entirely with `switch` and `goto` if one wanted, as opposed to a bunch of mutually recursive function calls, NDFA simulation similarly except that you need some (bounded) additional memory to hold the active set of states. Contrast this with a pushdown automata, which can parse a context-free language, in which a stack is maintained that can store additional information relative to the input (for example, an already seen character). Pushdown automata can, for example, recognize matched parenthesis while regular languages cannot. Anyway, sorry, this is all rather more theoretical than is perhaps interesting or useful. Bottom line is, I think your REs are probably fine. `egrep` will complain at you if they are not, and I wouldn't worry too much about optimizing them: I'd "stop" whenever you're happy that you've got something understandable that matches what you want it to match. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 13:47 ` Dan Cross @ 2023-03-03 19:26 ` Grant Taylor via COFF 0 siblings, 0 replies; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 19:26 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 3045 bytes --] On 3/3/23 6:47 AM, Dan Cross wrote: > Oh, for sure; to be clear, it was obvious that in the earlier > discussion the original was just part of something larger. Good. For a moment I thought that you might be thinking it was stand alone. > FWIW, this RE seems ok to me; the additional context makes it unlikely > to match something else accidentally. :-) > It needn't be special. The point is simply that there's some external > knowledge that can be brought to bear to guide the shape of the REs. ACK I've heard "domain (specific) knowledge" used to refer to both extremely specific training in a field and -- as you have -- data that is having something done to it. > In this case, you know that log lines won't begin with `___ 123 > 456:789` or other similar junk. They darned well had better not. > Kinda. The "machine" in this case is actually an abstraction, like a > Turing machine. The salient point here is that REs map to finite state > machines, and in particular, one need not keep (say) a stack of prior > states when simulating them. Note that even in an NDFA simulation, > where one keeps track of what states one may be in, one doesn't need > to keep track of how one got into those states. ACK > Obviously in a real implementation you've got the program counter, > register contents, local variables, etc, all of which consume > "memory" in the conventional sense. But the point is that you don't > need additional memory proportional to anything other than the size > of the RE. DFA implementation could be implemented entirely with > `switch` and `goto` if one wanted, as opposed to a bunch of mutually > recursive function calls, NDFA simulation similarly except that > you need some (bounded) additional memory to hold the active set > of states. Contrast this with a pushdown automata, which can parse > a context-free language, in which a stack is maintained that can > store additional information relative to the input (for example, > an already seen character). Pushdown automata can, for example, > recognize matched parenthesis while regular languages cannot. I think I understand the gist of what you're saying, but I need to re-read it and think about it a little bit. > Anyway, sorry, this is all rather more theoretical than is perhaps > interesting or useful. Apology returned to sender as unnecessary. You are providing the requested thought provoking discussion, which is exactly what I asked for. I feel like I'm going to walk away from this thread wiser based on the thread's content plus all additional reading material on top of the thread itself. > Bottom line is, I think your REs are probably fine. `egrep` will > complain at you if they are not, and I wouldn't worry too much about > optimizing them: I'd "stop" whenever you're happy that you've got > something understandable that matches what you want it to match. Thank you (again) Dan. :-) -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF 2023-03-02 19:23 ` [COFF] " Clem Cole 2023-03-02 21:53 ` Dan Cross @ 2023-03-03 10:59 ` Ralph Corderoy 2023-03-03 13:11 ` Dan Cross 2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall 2023-03-03 19:06 ` Grant Taylor via COFF 2023-03-06 10:01 ` Ed Bradford 4 siblings, 2 replies; 46+ messages in thread From: Ralph Corderoy @ 2023-03-03 10:59 UTC (permalink / raw) To: coff Hi Grant, > What are the pros / cons to creating extended regular expressions like > the following: If you want to understand: - the maths of regular expressions, - the syntax of regexps which these days expresses more than REs, and - the regexp engines in programs, the differences in how they work and what they match, and - how to efficiently steer an engine's internals then I recommend Jeffrey Friedl's Mastering Regular Expressions. http://regex.info/book.html > For matching patterns like the following in log files? > > Mar 2 03:23:38 Do you want speed of matching with some false positives or validation by regexp rather than post-lexing logic and to what depth, e.g. does this month have a ‘31st’? /^... .. ..:..:../ You'd said egrep, which is NDFA, but in other engines, alternation order can matter, e.g. ‘J’ starts the most months and some months have more days than others. /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|... -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 10:59 ` Ralph Corderoy @ 2023-03-03 13:11 ` Dan Cross 2023-03-03 13:42 ` Ralph Corderoy 2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall 1 sibling, 1 reply; 46+ messages in thread From: Dan Cross @ 2023-03-03 13:11 UTC (permalink / raw) To: Ralph Corderoy; +Cc: coff On Fri, Mar 3, 2023 at 5:59 AM Ralph Corderoy <ralph@inputplus.co.uk> wrote: > [snip] > > If you want to understand: > > - the maths of regular expressions, > - the syntax of regexps which these days expresses more than REs, and > - the regexp engines in programs, the differences in how they work and > what they match, and > - how to efficiently steer an engine's internals > > then I recommend Jeffrey Friedl's Mastering Regular Expressions. > http://regex.info/book.html I'm afraid I must sound a note of caution about Friedl's book. Russ Cox alludes to some of the problems in the "History and References" section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that was linked earlier, and he links to this post: http://regex.info/blog/2006-09-15/248 The impression is that Friedl shows wonderfully how to _use_ regular expressions, but does not understand the theory behind their implementation. It is certainly true that today what many people refer to as "regular expressions" are not in fact regular (and require a pushdown automata to implement, putting them somewhere between REs and the context-free languages in terms of expressiveness). Personally, I'd stick with Russ's stuff, especially as `egrep` is the target here. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 13:11 ` Dan Cross @ 2023-03-03 13:42 ` Ralph Corderoy 2023-03-03 19:19 ` Grant Taylor via COFF 0 siblings, 1 reply; 46+ messages in thread From: Ralph Corderoy @ 2023-03-03 13:42 UTC (permalink / raw) To: coff Hi Dan, > > If you want to understand: > > > > - the maths of regular expressions, > > - the syntax of regexps which these days expresses more than REs, and > > - the regexp engines in programs, the differences in how they work > > and what they match, and > > - how to efficiently steer an engine's internals > > > > then I recommend Jeffrey Friedl's Mastering Regular Expressions. > > http://regex.info/book.html > > I'm afraid I must sound a note of caution about Friedl's book. Russ > Cox alludes to some of the problems in the "History and References" > section of his page (https://swtch.com/~rsc/regexp/regexp1.html), that > was linked earlier Russ says: 1 ‘Finally, any discussion of regular expressions would be incomplete without mentioning Jeffrey Friedl's book Mastering Regular Expressions, perhaps the most popular reference among today's programmers. 2 Friedl's book teaches programmers how best to use today's regular expression implementations, but not how best to implement them. 3 What little text it devotes to implementation issues perpetuates the widespread belief that recursive backtracking is the only way to simulate an NFA. 4 Friedl makes it clear that he [neither understands nor respects] the underlying theory.’ http://regex.info/blog/2006-09-15/248 I think Grant is after what Russ addresses in sentence 2. :-) > The impression is that Friedl shows wonderfully how to _use_ regular > expressions, but does not understand the theory behind their > implementation. Yes, Friedl does show that wonderfully. From long-ago memory, Friedl understands enough to have diagrams of NFAs and DFAs clocking through their inputs, showing the differences in number of states, etc. Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3, it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it before reading his book. Friedl was at the sharp end of regexps, needing to process large amounts of text, at Yahoo! IIRC. He investigated how the programs available behaved; he didn't start at the theory and come up with a new program best suited to his needs. > Personally, I'd stick with Russ's stuff, especially as `egrep` is the > target here. Russ's stuff is great. He refuted that widespread belief, for one thing. But Russ isn't trying to teach a programmer how to best use the regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes the many pages needed to do this. It depends what one wants to learn first. As Friedl says in the post Russ linked to: ‘As a user, you don't care if it's regular, nonregular, unregular, irregular, or incontinent. So long as you know what you can expect from it (something this chapter will show you), you know all you need to care about. ‘For those wishing to learn more about the theory of regular expressions, the classic computer-science text is chapter 3 of Aho, Sethi, and Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley, 1986), commonly called “The Dragon Book” due to the cover design. More specifically, this is the “red dragon”. The “green dragon” is its predecessor, Aho and Ullman's Principles of Compiler Design.’ In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory, Languages, and Computation’ goes further into the subject. Chapter two has DFA, NFA, epsilon transitions, and uses searching text as an example. Chapter three is regular expressions, four is regular languages. Pushdown automata is chapter six. Too many books, not enough time to read. :-) -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 13:42 ` Ralph Corderoy @ 2023-03-03 19:19 ` Grant Taylor via COFF 2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy 0 siblings, 1 reply; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 19:19 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 3590 bytes --] On 3/3/23 6:42 AM, Ralph Corderoy wrote: > I think Grant is after what Russ addresses in sentence 2. :-) You are mostly correct. The motivation for this thread is very much so wanting to learn "how best to use today's regular expression implementations". However there is also the part of me that wants to have a little bit of understanding behind why the former is the case. > Yes, Friedl does show that wonderfully. From long-ago memory, Friedl > understands enough to have diagrams of NFAs and DFAs clocking through > their inputs, showing the differences in number of states, etc. It seems like I need to find another copy of Friedl's book. -- My current copy is boxed up for a move nearly 1k miles away. :-/ > Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3, > it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it > before reading his book. Friedl was at the sharp end of regexps, > needing to process large amounts of text, at Yahoo! IIRC. He > investigated how the programs available behaved; he didn't start at the > theory and come up with a new program best suited to his needs. It sounds like I'm coming from a similar position of "what is the best* way to process this corpus" more than "what is the underlying theory behind what I'm wanting to do". > Russ's stuff is great. He refuted that widespread belief, for one > thing. But Russ isn't trying to teach a programmer how to best use the > regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes > the many pages needed to do this. :-) > It depends what one wants to learn first. I'm learning that I'm more of a technician that wants to know how to use the existing tools to the best of his / their ability. While having some interest in theory behind things. > As Friedl says in the post Russ linked to: > > ‘As a user, you don't care if it's regular, nonregular, unregular, > irregular, or incontinent. So long as you know what you can expect > from it (something this chapter will show you), you know all you need > to care about. Yep. That's the position that I would be in if someone were paying me to write the REs that I'm writing. > ‘For those wishing to learn more about the theory of regular expressions, > the classic computer-science text is chapter 3 of Aho, Sethi, and > Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley, > 1986), commonly called “The Dragon Book” due to the cover design. > More specifically, this is the “red dragon”. The “green dragon” > is its predecessor, Aho and Ullman's Principles of Compiler Design.’ This all sounds interesting to me, and like something I might add to my collection of books. But it also sounds like something that will be an up hill read and vast learning opportunity. > In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory, > Languages, and Computation’ goes further into the subject. Chapter two > has DFA, NFA, epsilon transitions, and uses searching text as an > example. Chapter three is regular expressions, four is regular > languages. Pushdown automata is chapter six. > > Too many books, not enough time to read. :-) Yep. Even inventorying and keeping track of the books can be time consuming. -- Thankfully I took some time to do exactly that and have access to that information on the super computer in my pocket. -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) 2023-03-03 19:19 ` Grant Taylor via COFF @ 2023-03-04 10:15 ` Ralph Corderoy 2023-03-07 21:49 ` [COFF] " Tomasz Rola 2023-06-20 16:02 ` Michael Parson 0 siblings, 2 replies; 46+ messages in thread From: Ralph Corderoy @ 2023-03-04 10:15 UTC (permalink / raw) To: coff Hi, Grant wrote: > Even inventorying and keeping track of the books can be time > consuming. -- Thankfully I took some time to do exactly that and > have access to that information on the super computer in my pocket. I seek recommendations for an Android app to comfortably read PDFs on a mobile phone's screen. They were intended to be printed as a book. In particular, once I've zoomed and panned to get the interesting part of a page as large as possible, swiping between pages should persist that view. An extra point for allowing odd and even pages to use different panning. -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) 2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy @ 2023-03-07 21:49 ` Tomasz Rola 2023-03-07 22:46 ` Tomasz Rola 2023-06-20 16:02 ` Michael Parson 1 sibling, 1 reply; 46+ messages in thread From: Tomasz Rola @ 2023-03-07 21:49 UTC (permalink / raw) To: coff On Sat, Mar 04, 2023 at 10:15:33AM +0000, Ralph Corderoy wrote: > Hi, > > Grant wrote: > > Even inventorying and keeping track of the books can be time > > consuming. -- Thankfully I took some time to do exactly that and > > have access to that information on the super computer in my pocket. > > I seek recommendations for an Android app to comfortably read PDFs on a > mobile phone's screen. They were intended to be printed as a book. In > particular, once I've zoomed and panned to get the interesting part of a > page as large as possible, swiping between pages should persist that > view. An extra point for allowing odd and even pages to use different > panning. My own recommendation for this is to get a dedicated ebook reader. It will feel a bit clumsy to have both a cretinphone and another thing with you, but at least the thing is doing the job. At least, mine keeps cropping across pages. Also, the e-ink/epaper display of ebook reader is not supposed to screw your eyes and/or circadian rhythms (not that I know anything specific, but I find it very strange that people shine blue light into their eyes for extended periods of time and do not even quietly protest - well, perhaps it is akin to what goes between human and a dog, they become alike to each other, now, when a human has cretinphone...). Or, if it just one pdf to read, then you should be fine reading it on bigger screen. HTH -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola@bigfoot.com ** ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) 2023-03-07 21:49 ` [COFF] " Tomasz Rola @ 2023-03-07 22:46 ` Tomasz Rola 0 siblings, 0 replies; 46+ messages in thread From: Tomasz Rola @ 2023-03-07 22:46 UTC (permalink / raw) To: coff On Tue, Mar 07, 2023 at 10:49:14PM +0100, Tomasz Rola wrote: [...] > people shine blue light into their eyes for extended periods of time > and do not even quietly protest - well, perhaps it is akin to what > goes between human and a dog, they become alike to each other, now, > when a human has cretinphone...). To answer unasked question, I own a cretinphone too :-). And few dumbs. Together, they sum up to something like cretinphone on steroids. -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola@bigfoot.com ** ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) 2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy 2023-03-07 21:49 ` [COFF] " Tomasz Rola @ 2023-06-20 16:02 ` Michael Parson 2023-06-20 21:26 ` Tomasz Rola 1 sibling, 1 reply; 46+ messages in thread From: Michael Parson @ 2023-06-20 16:02 UTC (permalink / raw) To: coff On Sat, 4 Mar 2023, Ralph Corderoy wrote: > Date: Sat, 4 Mar 2023 04:15:33 > From: Ralph Corderoy <ralph@inputplus.co.uk> > To: coff@tuhs.org > Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on > extended regular expressions in grep.) > > Hi, > > Grant wrote: >> Even inventorying and keeping track of the books can be time >> consuming. -- Thankfully I took some time to do exactly that and have >> access to that information on the super computer in my pocket. > > I seek recommendations for an Android app to comfortably read PDFs on > a mobile phone's screen. They were intended to be printed as a book. > In particular, once I've zoomed and panned to get the interesting part > of a page as large as possible, swiping between pages should persist > that view. An extra point for allowing odd and even pages to use > different panning. Sorry for responding to an old thread got behind on my list-mail reading, but I wanted to share my $.02. Someone else mentioned an e-book reader app, and I second that, mostly...Moon+ Reader on Android is the e-book reader I've been using for a while and it does a good job with standard e-book formats as well as PDF files, IF the PDF is a PDF of formatted text. It even has a mode where it will do a pretty decent job of on-the-fly converting/reformatting the text of the PDF to something that can actually be read on a small (phone) screen. However, if the PDF is just a bunch of 1 image per page wrapped in a PDF container, you're out of luck and back to zoom/pan around the page. For most of my digtal book reading these days, I use a Boox e-ink reader. It runs Android, so, I can use the same e-book reader I used on my phone. It can even sync where you're at in the book/document via dropbox and you can move between multiple devices if needed. If I want to mark-up the PDF, the built-in stuff on the Boox handles that nicely. If I'm on my phone, I use an app called Xodo. -- Michael Parson Pflugerville, TX ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) 2023-06-20 16:02 ` Michael Parson @ 2023-06-20 21:26 ` Tomasz Rola 2023-06-22 15:45 ` Michael Parson 0 siblings, 1 reply; 46+ messages in thread From: Tomasz Rola @ 2023-06-20 21:26 UTC (permalink / raw) To: Computer Old Farts Followers On Tue, Jun 20, 2023 at 11:02:33AM -0500, Michael Parson wrote: > On Sat, 4 Mar 2023, Ralph Corderoy wrote: > > > Date: Sat, 4 Mar 2023 04:15:33 > > From: Ralph Corderoy <ralph@inputplus.co.uk> > > To: coff@tuhs.org > > Subject: [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on > > extended regular expressions in grep.) > > > > Hi, > > > > Grant wrote: > > > Even inventorying and keeping track of the books can be time > > > consuming. -- Thankfully I took some time to do exactly that and have > > > access to that information on the super computer in my pocket. > > > > I seek recommendations for an Android app to comfortably read PDFs on > > a mobile phone's screen. They were intended to be printed as a book. > > In particular, once I've zoomed and panned to get the interesting part > > of a page as large as possible, swiping between pages should persist > > that view. An extra point for allowing odd and even pages to use > > different panning. > > Sorry for responding to an old thread got behind on my list-mail > reading, but I wanted to share my $.02. > > Someone else mentioned an e-book reader app, and I second that, > mostly...Moon+ Reader on Android is the e-book reader I've been using > for a while and it does a good job with standard e-book formats > as well as PDF files, IF the PDF is a PDF of formatted text. It > even has a mode where it will do a pretty decent job of on-the-fly > converting/reformatting the text of the PDF to something that can > actually be read on a small (phone) screen. However, if the PDF is just > a bunch of 1 image per page wrapped in a PDF container, you're out of > luck and back to zoom/pan around the page. > > For most of my digtal book reading these days, I use a Boox e-ink > reader. It runs Android, so, I can use the same e-book reader I used > on my phone. It can even sync where you're at in the book/document via > dropbox and you can move between multiple devices if needed. > > If I want to mark-up the PDF, the built-in stuff on the Boox handles > that nicely. If I'm on my phone, I use an app called Xodo. > > -- > Michael Parson > Pflugerville, TX Hello Michael, I am not challenging your choices (to each his/her own), but to add some alternative, my own preferences go toward: a. have sd card slot in a reader (I mean hardware with e-ink, not some app on a phone). This means a card can be slipped into the box without opening it. This means the box is not water-proof. However, I had a look inside and I suspect it can still be water-prooved with duct tape, if someone wants it so much. b. so far I was rather happy with Linux custom made by manufacturer, but not an Android - I am yet to try Android based ebook reader (but maybe not too fast). Phones with A* are rather lousy at keeping their batteries loaded, I wonder how eink devices fare - do they, too, need to be recharged every day? My reader is recharged every 2-3 weeks, when batt drops to about 70%, while I keep using it at least every second day for few hours at a time. I had once (many years go, when I was to buy my first reader) a dream of browsing web pages with it. However, built in browser in non-A* reader proved to be lousy, equally lousy to browser in A* phones that I have tried. So, my current ereader was never connected to the net because I see no point. Of course each model nowadays comes with wi-fi, it just does not add anything useful so no need to even configure it on home router etc. Nowadays, I would rather convert whatever to pdf or epub and upload to the reader. Reading wikipedia pages printed to pdf saved me plenty of grief, as opposed to trying them in a (builtin) browser. I suspect elinks could look much better, but trying this requires some free time (compiling/porting, uploading). As a side note, I have observed that some pdfs do not render too well on my reader - it seems that they make this software "too new" to be solid & fast nowadays. Same pdfs converted to djvu read like a dream, however. Having more than few supported book formats is nice. My reader also comes with BT, possibly meant to connect headphones but perhaps usable for BT keyboard. Might be a thing to try in a future (or not), I mention it to let others know there may be such an option in case they care about it (I really do not, but I do not make those things so what can I do...). HTH -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola@bigfoot.com ** ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) 2023-06-20 21:26 ` Tomasz Rola @ 2023-06-22 15:45 ` Michael Parson 2023-07-10 9:08 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola 0 siblings, 1 reply; 46+ messages in thread From: Michael Parson @ 2023-06-22 15:45 UTC (permalink / raw) To: Computer Old Farts Followers On Tue, 20 Jun 2023, Tomasz Rola wrote: > On Tue, Jun 20, 2023 at 11:02:33AM -0500, Michael Parson wrote: <snip> >> Sorry for responding to an old thread got behind on my list-mail >> reading, but I wanted to share my $.02. >> >> Someone else mentioned an e-book reader app, and I second that, >> mostly...Moon+ Reader on Android is the e-book reader I've been using >> for a while and it does a good job with standard e-book formats >> as well as PDF files, IF the PDF is a PDF of formatted text. It >> even has a mode where it will do a pretty decent job of on-the-fly >> converting/reformatting the text of the PDF to something that can >> actually be read on a small (phone) screen. However, if the PDF is just >> a bunch of 1 image per page wrapped in a PDF container, you're out of >> luck and back to zoom/pan around the page. >> >> For most of my digtal book reading these days, I use a Boox e-ink >> reader. It runs Android, so, I can use the same e-book reader I used >> on my phone. It can even sync where you're at in the book/document via >> dropbox and you can move between multiple devices if needed. >> >> If I want to mark-up the PDF, the built-in stuff on the Boox handles >> that nicely. If I'm on my phone, I use an app called Xodo. >> > Hello Michael, > > I am not challenging your choices (to each his/her own), but to add > some alternative, my own preferences go toward: > > a. have sd card slot in a reader (I mean hardware with e-ink, not some > app on a phone). This means a card can be slipped into the box without > opening it. This means the box is not water-proof. However, I had a > look inside and I suspect it can still be water-prooved with duct > tape, if someone wants it so much. (For the record, the "device" I'm discussing is my Boox Nova Air 7.8" E Ink tablet, Amazon product B09FQ14Z6N.) What are you wanting/needing an SD card for? Extra storage? File-transfer? For what I use this device for, the built-in storage (32GB) has been more than enough to hold the books & notes I need it to. For file transfer, the USB Type-C port on it goes both ways, you can connect it to a computer and move files that way, or you can get a Type-C SD card reader. Plus there's wifi for network transfers. I store my e-books in Calibre on my Linux desktop, which has a web-server that the software on the device knows how to talk to, or I can connect the device and my Linux box together over USB and Calibre recognizes it as an e-reader device and can push/pull books via its interface. This device does not advertise any water-proofing at all. > b. so far I was rather happy with Linux custom made by manufacturer, > but not an Android - I am yet to try Android based ebook reader (but > maybe not too fast). Phones with A* are rather lousy at keeping their > batteries loaded, I wonder how eink devices fare - do they, too, need > to be recharged every day? My reader is recharged every 2-3 weeks, > when batt drops to about 70%, while I keep using it at least every > second day for few hours at a time. Out of the box, this device has very aggressive power-saving modes. If the screen is left off for more than something like 10 minutes, it would do a full shutdown, which means the next time you hit the power button, it does a full boot. To be fair, it boots much faster than my phone, about 15 seconds from hitting the power button to asking me to unlock the screen. Default settings also turn off the wifi & bluetooth when the screen is off. In this mode, yeah, it will last a few weeks. I disabled the shutdown bits, I want to have that "instant on" experience. Well, as instant as things get with E-Ink. From what I've noticed, the biggest battery drain seems to be if I use the backlight or not. I mostly don't, but it's there when other light isn't available. With my usage patterns and device settings, I probably charge it up to full once or twice a week, but that's plugging it in when it gets to about 50% battery. If I left the wifi off, I could probably extend it even more. To be completely honest, I've not worried too much about how long things keep a charge, as long as it can stay charged long enough for me to use it for what I want to do with it. I have lots of USB chargers spread around my house, which means that I can tend to have something charging and still be nearby. The exception to this is my watch. I have a Fossil Hybrid smart-watch. It's an analog watch with an e-ink background that can show me alerts/info from my phone. It only needs to be charged about once every 10 days or so. I don't want a watch that I'd have to recharge daily. > I had once (many years go, when I was to buy my first reader) a > dream of browsing web pages with it. However, built in browser in > non-A* reader proved to be lousy, equally lousy to browser in A* > phones that I have tried. So, my current ereader was never connected > to the net because I see no point. Of course each model nowadays > comes with wi-fi, it just does not add anything useful so no need > to even configure it on home router etc. Nowadays, I would rather > convert whatever to pdf or epub and upload to the reader. Reading > wikipedia pages printed to pdf saved me plenty of grief, as opposed to > trying them in a (builtin) browser. I suspect elinks could look much > better, but trying this requires some free time (compiling/porting, > uploading). I don't do a lot of web browsing on the device, but when I do, it's using firefox, not the vendor-provided browser. I'm all-in on the FF ecosystem, have an FF account, all my systems have their browsers logged into my FF account, I can send tabs between devices/desktops/laptops easily, etc. I'm sure the same can be done with Chrome. Firefox on Android also has a "Desktop site" slider that will re-render the page like a desktop browser would instead of a mobile one. This is nice on bigger screen Android devices. That being said. The latest update to the firmware on this device has added per-app E-ink refresh settings. So, for reading books in the book-reader app, I have it set on one of the higher-quality, but slower refresh modes. I read my books like books, page at a time, generally little to no scrolling. This is harder to do when browsing the web, so, I have Firefox set to the lower-quality but faster refresh mode. Yeah, there is a little ghosting after the scrolling stops, but it's more like the faint bleed-through you get when reading a newspaper. > As a side note, I have observed that some pdfs do not render too well > on my reader - it seems that they make this software "too new" to be > solid & fast nowadays. Same pdfs converted to djvu read like a dream, > however. Having more than few supported book formats is nice. I mostly only deal with epub, mobi, and PDF. The blurb on Amazon says that it handles "PDF(reflowable), PPT, EPUB, TXT, DJVU, HTML, RTF, FB2, DOC, MOBI, CHM..." > My reader also comes with BT, possibly meant to connect headphones but > perhaps usable for BT keyboard. Might be a thing to try in a future > (or not), I mention it to let others know there may be such an option > in case they care about it (I really do not, but I do not make those > things so what can I do...). Yup, this device has BT as well, since being an Android device, you can listen to MP3s, or stream from Pandora/Spotify/etc. I've never set that up on here, I have other devices that can already do that. The only BT I've hooked up is a keyboard, and that was mostly for a bit of fun. I installed an ssh-client on there and spent a few hours logged into my Linux box reading email. I still kinda want a laptop with an e-ink display. The other thing I use this device for is for hand-written notes and sketches. Writing on the screen with the stylus feels a lot like writing with a pencil on paper. I'm not an artist by any stretch, but sometimes writing things out and using boxes, circles, helps sort things out in my head, or sometimes I'll sketch things out while working on a design I eventually want to 3d print. All stuff I'd historically done in a paper notebook, but now I have similarly-sized electronic notepad where I can erase mistakes, have layers to my drawings (like photoshop/gimp/etc), zoom in/out, etc. The notepad app is also linked to my DropBox account, so when I'm done with whatever doodles/notes/sketches, I can then load them up on one of my other systems as PDFs. I've even toyed with the idea of writing letters out in long-hand in the notepad and then emailing the resulting PDFs to people, since the practice of hand-written letters has gone by the wayside, I thought this would be an entertaining way to bring that back. :) If what you're looking for is an E-reader that can also do the note-taking (Or a note-taking device that also functions as an e-reader), and not really much else, I've heard good things about the Remarkable tablets. One of the things they advertise is it being a distraction-free device, no alerts from the socials medias, emails, etc, since the device flat doesn't do those things. I've never used one, but I've seen people at my $DAYJOB say they really like theirs. I'm not saying this is the perfect device, or that it will fill the needs of anyone else. It doesn't solve every need I have either, which is why I have multiple devices (phone, this e-reader tablet, personal laptop, work laptop, personal desktop, scattered Raspberry Pis, etc). -- Michael Parson Pflugerville, TX ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) 2023-06-22 15:45 ` Michael Parson @ 2023-07-10 9:08 ` Tomasz Rola 0 siblings, 0 replies; 46+ messages in thread From: Tomasz Rola @ 2023-07-10 9:08 UTC (permalink / raw) To: Computer Old Farts Followers On Thu, Jun 22, 2023 at 10:45:43AM -0500, Michael Parson wrote: > On Tue, 20 Jun 2023, Tomasz Rola wrote: > > On Tue, Jun 20, 2023 at 11:02:33AM -0500, Michael Parson wrote: > > <snip> > [...] > > a. have sd card slot in a reader (I mean hardware with e-ink, not some > > app on a phone). This means a card can be slipped into the box without > > opening it. This means the box is not water-proof. However, I had a > > look inside and I suspect it can still be water-prooved with duct > > tape, if someone wants it so much. > Hello again and sorry for the late answer. Busy, hardware glitches, tired, etc. > (For the record, the "device" I'm discussing is my Boox Nova Air 7.8" E Ink > tablet, Amazon product B09FQ14Z6N.) > > What are you wanting/needing an SD card for? Extra storage? > File-transfer? For what I use this device for, the built-in storage > (32GB) has been more than enough to hold the books & notes I need it to. There is no use pattern as such. It was just very convenient to have one. When I bought my first reader, PocketBook 622 (or something like it, a good decade ago), it only had 2GB of flash. Of this, about half was already prefilled with free ebooks, free photos to see in some app. I decided to keep all this, which left me with .LT. 1GB for my use. If I wanted just epubs, this was more than enough to have few thousands files. But, an atlas of human anatomy from Internet Archive was ca. 8MB, Lisp 1.5 manual from bitsavers was another 9MB and if I wanted some more, space would quickly ran out. An 8GB card solved the problem. Later on, a member of family fell in love with e-reader, too. But, in her hands they died rather quickly. So we struck a deal - she will buy me a new one and get mine, which after a year of use was still mostly like new. I just replugged SD card and it was it. Now I have two broken Pocketbooks 622 in a drawer, waiting for their chance - just replace their displays. Later on, I bought me an Inkpad (again from Pocketbook, probably I am a fanboy). This one had about 7.8'' display, many more inky pixels and not too big cpu, so was a bit sluggish. But I liked it. Alas, it entered nirvana as a result of botched upgrade. If I am to believe internet, it can be pulled back into this world with the help of specially crafted serial cable and kermit. So, later on I bought me a Pocketbook 633 (definitely a fanboy!). This one can display 256 colors on outer LCD display and 16 levels of grey on inner e-ink display, which combines into 4096 colors. I wanted to give it a try, but if this specimmen dies, I will go back to normal e-ink b/w. The colors are there, but nothing outstanding. Ok for office kind of documents - pie charts, graphs etc. Space photographs, ukiyo-e, n nono no. Besides, if I use color, it powers on the LCD layer and consumes significantly more power than when it goes with e-ink only. For some documents, color can be turned off, for icons on main page, they show in color. Each time I just replug the card. So next time, I guess I will be looking for another reader with a slot. As of what you write about using cable and or BT for file transfer, it is all true but if reader does not work anymore, it might be hard to get files back. Or impossible. I have a smartphone, just to make a photo sometimes, nothing more, and I store it all on sdcard, again. This one phone does not want to emulate pendrive when I plug it into my Linux box, I had to install ftp server on it (phone). If it dies, I will unplug the card. Easy as a pie. No need to spend days looking for some bloody program which maybe is Windows-XP-only or maybe just been dropped from manufacturer support website. So I guess my next phone should better have a socket for sdcard, too. This is the future - if I have a need to support, I need to support it by me. Investing money in something that may turn into a brick and get my stuff with it - I do not think so. Too bad if some manufacturer does not want to play along, but they come by a dozen so maybe I can find what I want :-)... [...] > This device does not advertise any water-proofing at all. Well, somehow all readers without sdcard tout about them being waterproof. Like I wanted to read in water. [...] > The exception to this is my watch. I have a Fossil Hybrid smart-watch. > It's an analog watch with an e-ink background that can show me > alerts/info from my phone. It only needs to be charged about once every > 10 days or so. I don't want a watch that I'd have to recharge daily. I bought me a cheap mechanical clock from China (say, 17 bucks). It only needs a "mechanical recharge" once a day and I have to turn it back 5 minutes once a week to keep it somewhat accurate. Good enough for me. [...] > The other thing I use this device for is for hand-written notes and > sketches. Writing on the screen with the stylus feels a lot like > writing with a pencil on paper. I'm not an artist by any stretch, but > sometimes writing things out and using boxes, circles, helps sort things > out in my head, or sometimes I'll sketch things out while working on a > design I eventually want to 3d print. All stuff I'd historically done in > a paper notebook, but now I have similarly-sized electronic notepad > where I can erase mistakes, have layers to my drawings (like > photoshop/gimp/etc), zoom in/out, etc. Well I went the other way - bought some pencils to see how it goes with paper notes. Color crayons, some other thingies for making circles and lines (compass, ruler). Now, seems I will have to make some kind of notebook (in a future, not now) because what I see in my shops somehow does not convince me to buy (yeah I know about moleskines and leichturms, but this is not what I want, not sure what I want). Right now, it is just a normal crappy school notebook but with nice cover :-) . > The notepad app is also linked to my DropBox account, so when I'm done > with whatever doodles/notes/sketches, I can then load them up on one > of my other systems as PDFs. I've even toyed with the idea of writing "Dropbox baaad, shoe box good" :-) -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola@bigfoot.com ** ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 10:59 ` Ralph Corderoy 2023-03-03 13:11 ` Dan Cross @ 2023-03-03 16:12 ` Dave Horsfall 2023-03-03 17:13 ` Dan Cross 2023-03-03 19:36 ` Grant Taylor via COFF 1 sibling, 2 replies; 46+ messages in thread From: Dave Horsfall @ 2023-03-03 16:12 UTC (permalink / raw) To: Computer Old Farts Followers [-- Attachment #1: Type: text/plain, Size: 1185 bytes --] On Fri, 3 Mar 2023, Ralph Corderoy wrote: > You'd said egrep, which is NDFA, but in other engines, alternation order > can matter, e.g. ‘J’ starts the most months and some months have more > days than others. > > /^(J(an|u[nl])|Ma[ry]|A(ug|pr)|Oct|Dec|... I can't help but provide an extract from my antispam log summariser (AWK): # Yes, I have a warped sense of humour here. /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ { date = sprintf("%4d/%.2d/%.2d", year, months[substr($0, 1, 3)], substr($0, 5, 2)) Etc. The idea is not to validate so much as to grab a line of interest to me and extract the bits that I want. In this case I trust the source (the Sendmail log), but of course that is not always the case... When doing things like this, you need to ask yourself at least the following questions: 1) What exactly am I trying to do? This is fairly important :-) 2) Can I trust the data? Bobby Tables, Reflections on Trusting Trust... 3) Etc. And let's not get started on the difference betwixt "trusted" and "trustworthy" (that distinction keeps security bods awake at night). -- Dave ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall @ 2023-03-03 17:13 ` Dan Cross 2023-03-03 17:38 ` Ralph Corderoy 2023-03-03 19:36 ` Grant Taylor via COFF 1 sibling, 1 reply; 46+ messages in thread From: Dan Cross @ 2023-03-03 17:13 UTC (permalink / raw) To: Dave Horsfall; +Cc: Computer Old Farts Followers On Fri, Mar 3, 2023 at 11:12 AM Dave Horsfall <dave@horsfall.org> wrote: > [snip] > # Yes, I have a warped sense of humour here. > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ > { > date = sprintf("%4d/%.2d/%.2d", > year, months[substr($0, 1, 3)], substr($0, 5, 2)) If I may, I'd like to point out something fairly subtle here that, I think, bears on the original question (paraphrased as, "where does one draw the line between concision and understandability?"). Note Dave's class to match the first letter of the month: `[JFMAMJJASOND]`. One may notice that a few letters are repeated (J, M, A), and one _could_ shorten this to: `[JFMASOND]`. But I can see a serious argument where that may be regarded as a mistake; in particular, the original is easy to validate by just saying the names of the month out loud as one scans the list. For the shorter version, I'd worry that I would miss something or make a mistake. The lesson here is keep it simple and don't over-optimize! > Etc. The idea is not to validate so much as to grab a line of interest to > me and extract the bits that I want. > [snip] Too true. A few years ago, Rob Pike gave a talk about lexing in Go that bears on this that's worth a listen: https://www.youtube.com/watch?v=HxaD_trXwRE - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 17:13 ` Dan Cross @ 2023-03-03 17:38 ` Ralph Corderoy 2023-03-03 19:09 ` Dan Cross 0 siblings, 1 reply; 46+ messages in thread From: Ralph Corderoy @ 2023-03-03 17:38 UTC (permalink / raw) To: coff Hi, > Dave Horsfall <dave@horsfall.org> wrote: > > # Yes, I have a warped sense of humour here. > > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ ... > in particular, the original is easy to validate by just saying the > names of the month out loud as one scans the list. Some clients pay me to read code and find fault. It's a hard habit to break. ‘coc’ smells wrong. :-) A bit of vi's :map later... Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc The regexp works, of course, but in this case removing the redundancy would also fix the ‘fault’. -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 17:38 ` Ralph Corderoy @ 2023-03-03 19:09 ` Dan Cross 0 siblings, 0 replies; 46+ messages in thread From: Dan Cross @ 2023-03-03 19:09 UTC (permalink / raw) To: Ralph Corderoy; +Cc: coff On Fri, Mar 3, 2023 at 12:39 PM Ralph Corderoy <ralph@inputplus.co.uk> wrote: > > Dave Horsfall <dave@horsfall.org> wrote: > > > # Yes, I have a warped sense of humour here. > > > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ > ... > > in particular, the original is easy to validate by just saying the > > names of the month out loud as one scans the list. > > Some clients pay me to read code and find fault. It's a hard habit to > break. ‘coc’ smells wrong. :-) > > A bit of vi's :map later... > > Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dcc > > The regexp works, of course, but in this case removing the redundancy > would also fix the ‘fault’. Ha! Good catch. I'd probably just write it as, `(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)` which isn't much longer than the original anyway. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall 2023-03-03 17:13 ` Dan Cross @ 2023-03-03 19:36 ` Grant Taylor via COFF 2023-03-04 10:26 ` Ralph Corderoy 1 sibling, 1 reply; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 19:36 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 2055 bytes --] On 3/3/23 9:12 AM, Dave Horsfall wrote: > I can't help but provide an extract from my antispam log summariser > (AWK): > > # Yes, I have a warped sense of humour here. > /^[JFMAMJJASOND][aeapauuuecoc][nbrrynlgptvc] [ 0123][0-9] / \ > { > date = sprintf("%4d/%.2d/%.2d", > year, months[substr($0, 1, 3)], substr($0, 5, 2)) Thank you for sharing that Dave. > Etc. The idea is not to validate so much as to grab a line of interest > to me and extract the bits that I want. Fair enough. Using bracket expressions for the three letters is definitely another idea that I hadn't considered. But I believe I like what I think is -- what I'm going to describe as -- the more precise alternation listing out each month. (Jan|Feb|Mar... Such an alternation is not going to match Jer like the three bracket expressions will. I also believe that the alternation will be easier to maintain in the future. Especially by someone other than me that has less experience with REs. > In this case I trust the source (the Sendmail log), but of course > that is not always the case... I trust that syslog will produce consistent line beginnings more than I trust the data that is provided to syslog. But I'd still like to be able to detect "Jer" or "Dot" if syslog ever tosses it's cookies. > When doing things like this, you need to ask yourself at least the > following questions: > > 1) What exactly am I trying to do? This is fairly important :-) Filter out known to be okay log entries. > 2) Can I trust the data? Bobby Tables, Reflections on Trusting > Trust... Given that I'm effectively negating things and filtering out log entries that I want to not see (because they are okay) I'm comfortable with trusting the data from syslog. Brown M&Ms come to mind. > 3) Etc. > > And let's not get started on the difference betwixt "trusted" and > "trustworthy" (that distinction keeps security bods awake at night). ACK -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 19:36 ` Grant Taylor via COFF @ 2023-03-04 10:26 ` Ralph Corderoy 0 siblings, 0 replies; 46+ messages in thread From: Ralph Corderoy @ 2023-03-04 10:26 UTC (permalink / raw) To: coff Hi Grant, > the more precise alternation listing out each month. (Jan|Feb|Mar... For those regexp engines which test each alternative in turn, ordering the months most-frequent first would give a slight win. :-) It really is a rabbit hole once you start. Typically not worth entering, but it can be fun if you like that kind of thing. > I trust that syslog will produce consistent line beginnings more than > I trust the data that is provided to syslog. But I'd still like to be > able to detect "Jer" or "Dot" if syslog ever tosses it's cookies. You could develop your regexps to find lines of interest and then flip them about, e.g. egrep's -v, to see what lines are missed and consider if any are interesting. Repeat. But this happens at development time. Or at run time, you can have a ‘loose’ regexp to let all expected lines in through the door and then match with one or more ‘tight’ regexps, baulking if none do. There's no right answer in general. -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF ` (2 preceding siblings ...) 2023-03-03 10:59 ` Ralph Corderoy @ 2023-03-03 19:06 ` Grant Taylor via COFF 2023-03-03 19:31 ` Dan Cross 2023-03-04 10:07 ` Ralph Corderoy 2023-03-06 10:01 ` Ed Bradford 4 siblings, 2 replies; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-03 19:06 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 1241 bytes --] Thank you all for very interesting and engaging comments & threads to chase / pull / untangle. I'd like to expand / refine my original question a little bit. On 3/2/23 11:54 AM, Grant Taylor via COFF wrote: > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. While some reading of the references that Clem provided I came across multiple indications that back-references can be problematic from a performance stand point. So I'd like to know if all back-references are problematic, or if very specific back-references are okay. Suppose I have the following two lines: aaa aaa aaa bbb Does the following RE w/ back-reference introduce a big performance penalty? (aaa|bbb) \1 As in: % echo "aaa aaa" | egrep "(aaa|bbb) \1" aaa aaa I can easily see how a back reference to something that is not a fixed length can become a rabbit hole. But I'm wondering if a back-reference to -- what I think is called -- an alternation (with length fixed in the RE) is a performance hit or not. Now to read and reply to the many good comments that people have shared. :-) -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 19:06 ` Grant Taylor via COFF @ 2023-03-03 19:31 ` Dan Cross 2023-03-04 10:07 ` Ralph Corderoy 1 sibling, 0 replies; 46+ messages in thread From: Dan Cross @ 2023-03-03 19:31 UTC (permalink / raw) To: Grant Taylor; +Cc: coff On Fri, Mar 3, 2023 at 2:06 PM Grant Taylor via COFF <coff@tuhs.org> wrote: > Thank you all for very interesting and engaging comments & threads to > chase / pull / untangle. > > I'd like to expand / refine my original question a little bit. > > On 3/2/23 11:54 AM, Grant Taylor via COFF wrote: > > I'd like some thoughts ~> input on extended regular expressions used > > with grep, specifically GNU grep -e / egrep. > > While some reading of the references that Clem provided I came across > multiple indications that back-references can be problematic from a > performance stand point. > > So I'd like to know if all back-references are problematic, or if very > specific back-references are okay. The thing about backreferences is that they're not representable in the regular languages because they require additional state (the thing the backref refers to), so you cannot construct a DFA corresponding to them, nor an NDFA simulator (this is where Freidl gets things wrong!); you really need a pushdown automata and then you're in the domain of the context-free languages. Therefore, "regexps" that use back references are not actually regular expressions. Yet, popular engines support them...but how? Well, pretty much all of them use a backtracking implementation, which _can_ be exponential in both time and space. Now, that said, there are plenty of REs, even some with backrefs, that'll execute plenty fast enough on backtracking implementations; it really depends on the expressions in question and the size of strings you're trying to match against. But you lose the bounding guarantees DFAs and NDFAs provide. > Suppose I have the following two lines: > > aaa aaa > aaa bbb > > Does the following RE w/ back-reference introduce a big performance penalty? > > (aaa|bbb) \1 > > As in: > > % echo "aaa aaa" | egrep "(aaa|bbb) \1" > aaa aaa > > I can easily see how a back reference to something that is not a fixed > length can become a rabbit hole. But I'm wondering if a back-reference > to -- what I think is called -- an alternation (with length fixed in the > RE) is a performance hit or not. Well, it's more about the implementation strategy than the specific expression here. Could this become exponential? I don't think this one would, no; but others may, particularly if you use Kleene closures in the alternation. This _is_ something that appears in the wild, by the way, not just in theory; I did a change to Google's spelling service code to replace PCRE with re2 precisely because it was blowing up with exponential memory usage on some user input. The problems went away, but I had to rewrite a bunch of the REs involved. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-03 19:06 ` Grant Taylor via COFF 2023-03-03 19:31 ` Dan Cross @ 2023-03-04 10:07 ` Ralph Corderoy 1 sibling, 0 replies; 46+ messages in thread From: Ralph Corderoy @ 2023-03-04 10:07 UTC (permalink / raw) To: coff Hi Grant, > Suppose I have the following two lines: > > aaa aaa > aaa bbb > > Does the following RE w/ back-reference introduce a big performance > penalty? > > (aaa|bbb) \1 > > As in: > > % echo "aaa aaa" | egrep "(aaa|bbb) \1" > aaa aaa You could measure the number of CPU instructions and experiment. $ echo xyzaaa aaaxyz >f $ ticks() { LC_ALL=C perf stat -e instructions egrep "$@"; } $ $ ticks '(aaa|bbb) \1' <f xyzaaa aaaxyz Performance counter stats for 'egrep (aaa|bbb) \1': 2790889 instructions:u 0.009146904 seconds time elapsed 0.009178000 seconds user 0.000000000 seconds sys $ Bear in mind that egreps differ, even within GNU egrep, say, over time. $ LC_ALL=C perf stat -e instructions egrep '(aaa|bbb) \1' f xyzaaa aaaxyz ... 2795836 instructions:u ... $ LC_ALL=C perf stat -e instructions perl -ne '/(aaa|bbb) \1/ and print' f xyzaaa aaaxyz ... 2563488 instructions:u ... $ LC_ALL=C perf stat -e instructions sed -nr '/(aaa|bbb) \1/p' f xyzaaa aaaxyz ... 610213 instructions:u ... $ -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF ` (3 preceding siblings ...) 2023-03-03 19:06 ` Grant Taylor via COFF @ 2023-03-06 10:01 ` Ed Bradford 2023-03-06 21:01 ` Dan Cross 4 siblings, 1 reply; 46+ messages in thread From: Ed Bradford @ 2023-03-06 10:01 UTC (permalink / raw) To: Grant Taylor; +Cc: COFF [-- Attachment #1: Type: text/plain, Size: 5314 bytes --] Thanks, Grant and contributors in this thread, Great thread on RE's. I bought and read the book (it's on the floor over there in the corner and I'm not getting up). My task was finding dates in binary and text files. It turns out RE's work just fine for that. Because I was looking at both text files and binary files, I wrote my stuff using 8-bit python "bytes" rather than python "text" which is, I think, 7-bit in python. (I use python because it works on both Linux, Macs and Windows and reduces the number of RE implementations I have to deal with to 1). I finished my first round of the program late fall of 2022. Then I put it down and now I am revisiting it. I was creating: A Python program to search for media files (pictures and movies) and copy them to another directory tree, copying only the unique ones (deduplication), and renaming each with *YYYY-MM-DD-* as a prefix. Here is a list of observations from my programming. 1. RE's are quite unreadable. I defined a lot of python variables and simply added them together in python to make a larger byte string (see below). The resulting expressions were shorter on screen and more readable. Furthermore, I could construct them incrementally. I insist on readable code because I frequently put things down for a month or more. A while back it was a sad day when I restarted something and simply had to throw it away, moaning, "What was that programmer thinking?". Here is an example RE for YYYY-MM-DD # FR = front BA = back # ymdt is text version ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP ymdc = re.compile( ymdt ) 1a. I also had a time defining delimiters. There are delimiters for the beginning, delimiters for internal separation, and delimiters for the end. The significant thing is I have to find the RE if it is the very first string in the file or the very last. That also complicates buffered reading immensely. Hence, I wrote the whole program by reading the file into a single python variable. However, when files become much larger than memory, python simply ground to a halt as did my Windows machine. I then rewrote it using a memory mapped file (for all files) and the problem was fixed. 2. Dates are formatted in a number of ways. I chose exactly one format to learn about RE's and how to construct them and use them. Even the book didn't elaborate everything. I could not find detailed documentation on some of the interfaces in the book. On a whim, I asked chatGPT to write a python module that returns a list of offsets and dates in a file. Surprisingly, it wrote one that was quite credible. It had bugs but it knew more about how to use the various functional interfaces in RE's than I did. 3. Testing an RE is maybe even more difficult than writing one. I have not given any serious effort to verification testing yet. I would like to extend my program to any date format. That would require a much bigger RE. I have been led to believe that a 50Kbyte or 500Kbyte RE works just as well (if not as fast) as a 100 byte RE. I think with parentheses and pipe-symbols suitably used, one could match Monday, March 6, 2023 2023-03-06 Mar 6, 2023 or ... I'm just guessing, though. This thread has been very informative. I have much to read. Thank all of you. Ed Bradford Pflugerville, TX On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote: > Hi, > > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) > > Or: > > [ :[:digit:]]{11} > > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] > > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. > > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. > > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. > > > > -- > Grant. . . . > unix || die > > -- Advice is judged by results, not by intentions. Cicero [-- Attachment #2: Type: text/html, Size: 6600 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-06 10:01 ` Ed Bradford @ 2023-03-06 21:01 ` Dan Cross 2023-03-06 21:49 ` Steffen Nurpmeso ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Dan Cross @ 2023-03-06 21:01 UTC (permalink / raw) To: Ed Bradford; +Cc: Grant Taylor, COFF On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2@gmail.com> wrote: >[snip] > I would like to extend my program to > any date format. That would require > a much bigger RE. I have been led to > believe that a 50Kbyte or 500Kbyte > RE works just as well (if not > as fast) as a 100 byte RE. I think > with parentheses and > pipe-symbols suitably used, > one could match > > Monday, March 6, 2023 > 2023-03-06 > Mar 6, 2023 > or > ... This reminds me of something that I wanted to bring up. Perhaps one _could_ define a sufficiently rich regular expression that one could match a number of date formats. However, I submit that one _should not_. REs may be sufficiently powerful, but in all likelihood what you'll end up with is an unreadable mess; it's like people who abuse `sed` or whatever to execute complex, general purpose programs: yeah, it's a clever hack, but that doesn't mean you should do it. Pick the right tool for the job. REs are a powerful tool, but they're not the right tool for _every_ job, and I'd argue that once you hit a threshold of complexity that'll be mostly self-evident, it's time to move on to something else. As for large vs small REs.... When we start talking about differences of orders of magnitude in size, we start talking about real performance implications; in general an NDFA simulation of a regular expression will have on the order of the length of the RE in states, so when the length of the RE is half a million symbols, that's half-a-million states, which practically speaking is a pretty big number, even though it's bounded is still a pretty big number, and even on modern CPUs. I wouldn't want to poke that bear. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-06 21:01 ` Dan Cross @ 2023-03-06 21:49 ` Steffen Nurpmeso 2023-03-07 1:43 ` Larry McVoy 2023-03-07 4:19 ` Ed Bradford 2 siblings, 0 replies; 46+ messages in thread From: Steffen Nurpmeso @ 2023-03-06 21:49 UTC (permalink / raw) To: Dan Cross; +Cc: Grant Taylor, COFF Dan Cross wrote in <CAEoi9W6tZ+55MSPxPoZqfS3k9RO9MOQqB0yu=MO_vzzw0K6Lhw@mail.gmail.com>: |On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2@gmail.com> wrote: |>[snip] |> I would like to extend my program to |> any date format. That would require |> a much bigger RE. I have been led to ... |> one could match |> |> Monday, March 6, 2023 |> 2023-03-06 |> Mar 6, 2023 |> or ... |This reminds me of something that I wanted to bring up. Me too. If it becomes something regular and stable maybe turn into a dedicated parser. (As a lex yacc bison byacc refuser, but these surely can too.) |Perhaps one _could_ define a sufficiently rich regular expression that |one could match a number of date formats. However, I submit that one |_should not_. REs may be sufficiently powerful, but in all likelihood ... Kurt Shoens implemented some date template parser for BSD Mail in about 1980 that was successively changed many years later by Edward Wang in 1988 ([1] commit [309eb459e35f77985851ce143ad2f9da5f0d90da], 1988-07-08 18:41:33 -0800). There is strftime(3), but it came later than both to CSRG, and the Wang thing (in usr.bin/mail/head.c) is a dedicated thing. (Ie /* Template characters for cmatch_data.tdata: * 'A' An upper case char * 'a' A lower case char * ' ' A space * '0' A digit * 'O' An optional digit or space; MUST be followed by '0space'! * ':' A colon * '+' Either a plus or a minus sign */ and then according strings like "Aaa Aaa O0 00:00:00 0000".) [1] https://github.com/robohack/ucb-csrg-bsd.git --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-06 21:01 ` Dan Cross 2023-03-06 21:49 ` Steffen Nurpmeso @ 2023-03-07 1:43 ` Larry McVoy 2023-03-07 4:01 ` Ed Bradford 2023-03-07 4:19 ` Ed Bradford 2 siblings, 1 reply; 46+ messages in thread From: Larry McVoy @ 2023-03-07 1:43 UTC (permalink / raw) To: Dan Cross; +Cc: Grant Taylor, COFF On Mon, Mar 06, 2023 at 04:01:51PM -0500, Dan Cross wrote: > On Mon, Mar 6, 2023 at 5:02???AM Ed Bradford <egbegb2@gmail.com> wrote: > >[snip] > > I would like to extend my program to > > any date format. That would require > > a much bigger RE. I have been led to > > believe that a 50Kbyte or 500Kbyte > > RE works just as well (if not > > as fast) as a 100 byte RE. I think > > with parentheses and > > pipe-symbols suitably used, > > one could match > > > > Monday, March 6, 2023 > > 2023-03-06 > > Mar 6, 2023 > > or > > ... > > This reminds me of something that I wanted to bring up. > > Perhaps one _could_ define a sufficiently rich regular expression that > one could match a number of date formats. However, I submit that one > _should not_. REs may be sufficiently powerful, but in all likelihood > what you'll end up with is an unreadable mess; it's like people who > abuse `sed` or whatever to execute complex, general purpose programs: > yeah, it's a clever hack, but that doesn't mean you should do it. Dan, I agree with you. I ran a software company for almost 20 years and the main thing I contributed was "lets be dumb". Lets write code that is easy to read, easy to bug fix. Smart engineers love to be clever, they would be the folks that wrote those long RE that worked magic. But that magic was something they understood and nobody else did. Less is more. Less is easy to support. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-07 1:43 ` Larry McVoy @ 2023-03-07 4:01 ` Ed Bradford 2023-03-07 11:39 ` [COFF] " Ralph Corderoy 2023-03-07 16:14 ` Dan Cross 0 siblings, 2 replies; 46+ messages in thread From: Ed Bradford @ 2023-03-07 4:01 UTC (permalink / raw) To: Larry McVoy; +Cc: Grant Taylor, COFF [-- Attachment #1: Type: text/plain, Size: 2560 bytes --] I have made an attempt to make my RE stuff readable and supportable. I think I write more description that I do RE "code". As for, *it won't be comprehendable,* Machine language was unreadable and then along came assembly language. Assembly language was unreadable, then came higher level languages. Even higher level languages are unsupportable if not well documented and mostly simple to understand ("you are not expected to understand this" notwithstanding). The jump from machine language to python today was unimagined in early times. [ As an old timer, I see inflection points between: machine language and assembly language assembly language and high level languages and high level languages and python. But that's just me. ] I think it is possible to make a 50K RE that is understandable. However, it requires a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover a lack of truth in that belief, if such exists. I repeat. I put stuff down for months at a time. My metric is *coming back to it* *and understanding where I left off*. So far, I can do that for this RE program that works for small files, large files, binary files and text files for exactly one pattern: YYYY[-MM-DD] I constructed this RE with code like this: # ymdt is YYYY-MM-DD RE in text. # looking only for 1900s and 2000s years and no later than today. _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" # months _MM = "(0[1-9]|1[012])" # days _DD = "(0[1-9]|[12]\d|3[01])" ymdt = _YYYY + '[' + _INTERNALSEP + _MM + _INTERNALSEP + ']'{0,1) For the whole file, RE I used ymdthf = _FRSEP + ymdt + _BASEP where FRSEP is front separator which includes a bunch of possible separators, excluding numbers and letters, or-ed with the up arrow "beginning of line" RE mark. BASEP is back separator is same as FRSEP with "^" replaced with "$". I then aimed ymdthf at "data" the thing that represents the entire memory mapped file (where there is only one beginning and one end). Again, I say validating an RE is as difficult or more than writing one. What does it miss? Dates are an excellent test ground for RE's. Latitude and longitude is another. Ed PS: I thought I was on the COFF mailing list. I received this email by direct mail to from Larry. I haven't seen any other comments on my submission. I might have unsubscribed, but now I regret it. Dear powers that be: Please resubscribe me. [-- Attachment #2: Type: text/html, Size: 8375 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Requesting thoughts on extended regular expressions in grep. 2023-03-07 4:01 ` Ed Bradford @ 2023-03-07 11:39 ` Ralph Corderoy 2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF 2023-03-08 11:22 ` Ed Bradford 2023-03-07 16:14 ` Dan Cross 1 sibling, 2 replies; 46+ messages in thread From: Ralph Corderoy @ 2023-03-07 11:39 UTC (permalink / raw) To: coff Hi Ed, > I have made an attempt to make my RE stuff readable and supportable. Readable to you, which is fine because you're the prime future reader. But it's less readable than the regexp to those that know and read them because of the indirection introduced by the variables. You've created your own little language of CAPITALS rather than the lingua franca of regexps. :-) > Machine language was unreadable and then along came assembly language. > Assembly language was unreadable, then came higher level languages. Each time the original language was readable because practitioners had to read and write it. When its replacement came along, the old skill was no longer learnt and the language became ‘unreadable’. > So far, I can do that for this RE program that works for small files, > large files, binary files and text files for exactly one pattern: > YYYY[-MM-DD] > I constructed this RE with code like this: > # ymdt is YYYY-MM-DD RE in text. > # looking only for 1900s and 2000s years and no later than today. > _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" ‘{1}’ is redundant. > # months > _MM = "(0[1-9]|1[012])" > # days > _DD = "(0[1-9]|[12]\d|3[01])" > ymdt = _YYYY + '[' + _INTERNALSEP + > _MM + > _INTERNALSEP + > ']'{0,1) I think we're missing something as the ‘'['’ is starting a character class which is odd for wrapping the month and the ‘{0,1)’ doesn't have matching brackets and is outside the string. BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. > For the whole file, RE I used > ymdthf = _FRSEP + ymdt + _BASEP > where FRSEP is front separator which includes > a bunch of possible separators, excluding numbers and letters, or-ed > with the up arrow "beginning of line" RE mark. It sounds like you're wanting a word boundary; something provided by regexps. In Python, it's ‘\b’. >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'), (<re.Match object; span=(16, 19), match='foo'>,) Are you aware of the /x modifier to a regexp which ignores internal whitespace, including linefeeds? This allows a large regexp to be split over lines. There's a comment syntax too. See https://docs.python.org/3/library/re.html#re.X GNU grep isn't too shabby at looking through binary files. I can't use /x with grep so in a bash script, I'd do it manually. \< and \> match the start and end of a word, a bit like Python's \b. re=' .?\< (19[0-9][0-9]|20[01][0-9]|202[0-3]) ( ([-:._]) (0[1-9]|1[0-2]) \3 (0[1-9]|[12][0-9]|3[01]) )? \>.? ' re=${re//$'\n'/} re=${re// /} printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 >big-binary-file LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l which gives 0:2001-04-01,$ 11:1999_12_31$ 22:1944.03.01,$ 33:1914!$ 39:2000-$ showing: - the byte offset within the file of each match, - along with the any before and after byte if it's not a \n and not already matched, just to show the word-boundary at work, - with any non-printables escaped into octal by sed. > I thought I was on the COFF mailing list. I'm sending this to just the list. > I received this email by direct mail to from Larry. Perhaps your account on the list is configured to not send you an email if it sees your address in the header's fields. -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-07 11:39 ` [COFF] " Ralph Corderoy @ 2023-03-07 18:31 ` Grant Taylor via COFF 2023-03-08 11:22 ` Ed Bradford 1 sibling, 0 replies; 46+ messages in thread From: Grant Taylor via COFF @ 2023-03-07 18:31 UTC (permalink / raw) To: coff [-- Attachment #1: Type: text/plain, Size: 4934 bytes --] On 3/7/23 4:39 AM, Ralph Corderoy wrote: > Readable to you, which is fine because you're the prime future > reader. But it's less readable than the regexp to those that know > and read them because of the indirection introduced by the variables. > You've created your own little language of CAPITALS rather than the > lingua franca of regexps. :-) I want to agree, but then I run into things like this: ^\w{3} [ :[:digit:]]{11} [._[:alnum:]-]+ postfix(/smtps)?/smtpd\[[[:digit:]]+\]: disconnect from [._[:alnum:]-]+\[[.:[:xdigit:]]+\]( helo=[[:digit:]]+(/[[:digit:]]+)?)?( ehlo=[[:digit:]]+(/[[:digit:]]+)?)?( starttls=[[:digit:]]+(/[[:digit:]]+)?)?( auth=[[:digit:]]+(/[[:digit:]]+)?)?( mail=[[:digit:]]+(/[[:digit:]]+)?)?( rcpt=[[:digit:]]+(/[[:digit:]]+)?)?( data=[[:digit:]]+(/[[:digit:]]+)?)?( bdat=[[:digit:]]+(/[[:digit:]]+)?)?( rset=[[:digit:]]+(/[[:digit:]]+)?)?( noop=[[:digit:]]+(/[[:digit:]]+)?)?( quit=[[:digit:]]+(/[[:digit:]]+)?)?( unknown=[[:digit:]]+(/[[:digit:]]+)?)?( commands=[[:digit:]]+(/[[:digit:]]+)?)?$ Which is produced by this m4: define(`DAEMONPID', `$1\[DIGITS\]:')dnl define(`DATE', `\w{3} [ :[:digit:]]{11}')dnl define(`DIGIT', `[[:digit:]]')dnl define(`DIGITS', `DIGIT+')dnl define(`HOST', `[._[:alnum:]-]+')dnl define(`HOSTIP', `HOST\[IP\]')dnl define(`IP', `[.:[:xdigit:]]+')dnl define(`VERB', `( $1=DIGITS`'(/DIGITS)?)?')dnl ^DATE HOST DAEMONPID(`postfix(/smtps)?/smtpd') disconnect from HOSTIP`'VERB(`helo')VERB(`ehlo')VERB(`starttls')VERB(`auth')VERB(`mail')VERB(`rcpt')VERB(`data')VERB(`bdat')VERB(`rset')VERB(`noop')VERB(`quit')VERB(`unknown')VERB(`commands')$ I only consider myself to be an /adequate/ m4 user. Though I've done some things that are arguably creating new languages. I personally find the generated regular expression to be onerous to read and understand, much less modify. I would be highly dependent on my editor's (vim's) parenthesis / square bracket matching (%) capability and / or would need to explode the RE into multiple components on multiple lines to have a hope of accurately understanding or modifying it. Conversely I think that the m4 is /largely/ find and replace with a little syntactic sugar around the definitions. I also think that anyone that does understand regular expressions and the concept of find & replace is likely to be able to both recognize patterns -- as in "VERB(...)" corresponds to "( $1=DIGITS`'(/DIGITS)?)?", that "DIGITS" corresponds to "DIGIT+", and that "DIGIT" corresponds to "[[:digit:]]". There seems to be a point between simple REs w/o any supporting constructor and complex REs with supporting constructor where I think it is better to have the constructors. Especially when duplication comes into play. If nothing else, the constructors are likely to reduce one-off typo errors. The typo will either be everywhere the constructor was used, or similarly be fixed everywhere at the same time. Conversely, finding an unmatched parenthesis or square bracket in the RE above will be annoying at best if not likely to be more daunting. > Each time the original language was readable because practitioners > had to read and write it. When its replacement came along, the old > skill was no longer learnt and the language became ‘unreadable’. I feel like there is an analogy between machine code and assembly language as well as assembly language and higher level languages. My understanding is that the computer industry has vastly agreed that the higher level language is easier to understand and maintain. > ‘{1}’ is redundant. That may very well be. But what will be more maintainable / easier to correct in the future; adding `{2}` when necessary or changing the value of `1` to `2`? I think this is an example of tradeoff of not strictly required to make something more maintainable down the road. Sort of like fleet vehicles vs non-fleet vehicles. > BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. I think this is another example of the maintainability. > I'm sending this to just the list. I'm also replying to only the COFF mailing list. > Perhaps your account on the list is configured to not send you an > email if it sees your address in the header's fields. There is a reasonable chance that the COFF mailing list and / or your account therein is configured to minimize duplicates meaning the COFF mailing list won't send you a copy if it sees your subscribed address as receiving a copy directly. I personally always prefer the mailing list copy and shun the direct copies. I think that the copy from the mailing list keeps the discussion on the mailing list and avoids accidental replies bypassing the mailing list. -- Grant. . . . unix || die [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 4017 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-07 11:39 ` [COFF] " Ralph Corderoy 2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF @ 2023-03-08 11:22 ` Ed Bradford 1 sibling, 0 replies; 46+ messages in thread From: Ed Bradford @ 2023-03-08 11:22 UTC (permalink / raw) To: Ralph Corderoy; +Cc: coff [-- Attachment #1: Type: text/plain, Size: 10419 bytes --] Thank you for the very useful comments. However, I disagree with you about the RE language. While I agree all RE experts don't need that, when I was hiring and gave some software to a new hire (whether an experienced programmer or a recent college grad) simply handing over huge RE's to my new hire was a daunting task to that person. I wrote that stuff that way to help remind me and anyone who might use the python program. I don't claim success. It does help me. When you say '{1}' is redundant, I think I did that to avoid any possibility of conflicts with the next string that is concatentated to the *Y_* (e.g. '*' or '+' or '{4,7}'). I am embarrassed I did not communicate that in the code. I had to think about it for a couple of hours before I recalled the "why". I will fix that. (it would be difficult to discuss this RE if I had to write "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ") rather than just *Y_*). My initial thoughts on naming were I wanted the definition to be defined in exactly one place in the software. Python and the BTL folks told me to never use a constant in code. Always name it. Hence, I gave it a name. Each name might be used in multiple places. They might be imported. You are correct, the expression is unbalanced. I tried to remove the text2bytes(lastYearRE*)* call so the expression in this email was all text. I failed to remove the trailing *)* when I removed the call to text2bytes(). My hasty transcriptions might have produced similar errors in my email. Recall, my focus was on any file of any size. I'm on Windows 10 and an m1 MacBook. Python works on both. I don't have a Linux machine or enough desktop space to host one. I'm also mildly fed-up with virtual machines. Friedl taught me one thing. Most RE implementations are different. I'm trying to write a program that I could give to anyone and could reliably find a date (an RE) in any file. YYYY, MM, DD, HR, MI, SE, TH are words my user could use in the command line or in an options dialog. LAT and LON might also be possibilities. CST, EST, MST, PST, ... also. A 500 gigabyte archive or directory/folder of pictures and movies would be a great test target. I very much appreciate your comments. If this discussion is boring to others, I would be happy to take it to emails. I like your program. My experience with RE, grep, python, and sed suggests that anything but gnu grep and sed might not work due to the different implementations. I've been out of the Unix software business for 30 years after starting work at BTL in the 1970s and working on Version 6. I didn't know "printf" was now built into bash! That was a surprise. It's an incremental improvement, but doesn't compare with f-strings in python. *The interactive interpreter for python should have* *a "bash" mode?!* Does grep use a memory mapped file for its search, thereby avoiding all buffering boundaries? That too, would be new information to me. The additional complexity of dealing with buffering is more than annoying. Do you have any thoughts on how to verify a program that uses RE's. I've given no thought until now. My first thought for dates would be to write a separate module that simply searched through the file looking for 4 numbers in a row without using RE's, recording the offsets and 16 characters after and 1 character before in a python list of (offset,str) of tuples, ddddList, and using *dddd**List* as a proxy for the entire file. I could then aim my RE's at *ddddList*. *[A list of tuples in python* *is wonderful! !]* It seems to me '*' and '+' and {x,y} are the performance hogs in RE's. My RE's avoid them. One pass, I think, should suffice. What do you think? I haven't "archived" my 350 GB of pictures and movies, but one pass over all files therein ought to suffice, right? Two different programs that use different algorithms should be pretty good proof of correctness wouldn't you think? My RE's have no stars or pluses. If there is a mismatch before a match, give up and move on. On my Windows 10 machine, I have cygwin. Microsoft says my CPU doesn't have a TPM and the specific Intel Core I7 on my system is not supported so Windows 11 is not happening. Microsoft is DOS personified. (An unkind editorial remark about the low quality of software coming from Microsoft.) Anyway, I thank you again for your patience with me and your observations. I value your views and the other views I've seen here on coff@tuhs.org. I welcome all input to my education and will share all I have done so far with anyone who wants to collaborate, test, or is just curious. GOAL: run python program from an at-cost thumb drive that: reaps all media files from a user specified directory/folder tree and Adds files to the thumb drive. *Adds files* means Original file system is untouched Adds only unique files (hash codes are unique) Creates on the thumb drive a relative directory wherein the original file was found Prepends a "YYYY-MM-DD-" string to the filename if one can be found (EXIF is great shortcut). Copies srcroot/relative_path/oldfilename to thumbdrive/relative_path/YYYY-MM-DD-oldfilename or thumbdrive/relative_path/0000-oldfilename. Can also incrementally add new files by just scanning anywhere in any other computer file system or any other computer. Must work on Mac, Windows, and Linux What I have is a working prototype. It works on Mac and Windows. It doesn't do the date thing very well, and there are other shortcomings. I have delivered exactly one Christmas present to my favorite person in the world - a 400 GB SSD drive with all our pictures and media we have ever taken. The next things are to *add *more media and *re-unique-ify* (check) what is already present on the SSD drive and *improve the proper choice of "YYYY-MM-DD-" prefix* to filenames. I am retired and this is fun. I'm too old to want to get rich. Ed Bradford Pflugerville, TX egbegb2@gmail.com On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy <ralph@inputplus.co.uk> wrote: > Hi Ed, > > > I have made an attempt to make my RE stuff readable and supportable. > > Readable to you, which is fine because you're the prime future reader. > But it's less readable than the regexp to those that know and read them > because of the indirection introduced by the variables. You've created > your own little language of CAPITALS rather than the lingua franca of > regexps. :-) > > > Machine language was unreadable and then along came assembly language. > > Assembly language was unreadable, then came higher level languages. > > Each time the original language was readable because practitioners had > to read and write it. When its replacement came along, the old skill > was no longer learnt and the language became ‘unreadable’. > > > So far, I can do that for this RE program that works for small files, > > large files, binary files and text files for exactly one pattern: > > YYYY[-MM-DD] > > I constructed this RE with code like this: > > # ymdt is YYYY-MM-DD RE in text. > > # looking only for 1900s and 2000s years and no later than today. > > _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}" > > ‘{1}’ is redundant. > > > # months > > _MM = "(0[1-9]|1[012])" > > # days > > _DD = "(0[1-9]|[12]\d|3[01])" > > ymdt = _YYYY + '[' + _INTERNALSEP + > > _MM + > > _INTERNALSEP + > > ']'{0,1) > > I think we're missing something as the ‘'['’ is starting a character > class which is odd for wrapping the month and the ‘{0,1)’ doesn't have > matching brackets and is outside the string. > > BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’. > > > For the whole file, RE I used > > ymdthf = _FRSEP + ymdt + _BASEP > > where FRSEP is front separator which includes > > a bunch of possible separators, excluding numbers and letters, or-ed > > with the up arrow "beginning of line" RE mark. > > It sounds like you're wanting a word boundary; something provided by > regexps. In Python, it's ‘\b’. > > >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'), > (<re.Match object; span=(16, 19), match='foo'>,) > > Are you aware of the /x modifier to a regexp which ignores internal > whitespace, including linefeeds? This allows a large regexp to be split > over lines. There's a comment syntax too. See > https://docs.python.org/3/library/re.html#re.X > > GNU grep isn't too shabby at looking through binary files. I can't use > /x with grep so in a bash script, I'd do it manually. \< and \> match > the start and end of a word, a bit like Python's \b. > > re=' > .?\< > (19[0-9][0-9]|20[01][0-9]|202[0-3]) > ( > ([-:._]) > (0[1-9]|1[0-2]) > \3 > (0[1-9]|[12][0-9]|3[01]) > )? > \>.? > ' > re=${re//$'\n'/} > re=${re// /} > > printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 > >big-binary-file > LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l > > which gives > > 0:2001-04-01,$ > 11:1999_12_31$ > 22:1944.03.01,$ > 33:1914!$ > 39:2000-$ > > showing: > > - the byte offset within the file of each match, > - along with the any before and after byte if it's not a \n and not > already matched, just to show the word-boundary at work, > - with any non-printables escaped into octal by sed. > > > I thought I was on the COFF mailing list. > > I'm sending this to just the list. > > > I received this email by direct mail to from Larry. > > Perhaps your account on the list is configured to not send you an email > if it sees your address in the header's fields. > > -- > Cheers, Ralph. > -- Advice is judged by results, not by intentions. Cicero [-- Attachment #2: Type: text/html, Size: 25039 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-07 4:01 ` Ed Bradford 2023-03-07 11:39 ` [COFF] " Ralph Corderoy @ 2023-03-07 16:14 ` Dan Cross 2023-03-07 17:34 ` [COFF] " Ralph Corderoy 1 sibling, 1 reply; 46+ messages in thread From: Dan Cross @ 2023-03-07 16:14 UTC (permalink / raw) To: Ed Bradford; +Cc: Grant Taylor, COFF On Mon, Mar 6, 2023 at 11:01 PM Ed Bradford <egbegb2@gmail.com> wrote: >[snip] > I think it is possible to make a 50K RE that is understandable. However, it requires > a lot of 'splainin' throughout the code. I'm naive though; I will eventually discover > a lack of truth in that belief, if such exists. Actually, I believe you. I'm sure that with enough effort, it _is_ possible to make a 50K RE that is understandable to mere mortals. But it begs the question: why bother? The answer to that question, in my mind, shows the difference between a clever programmer and a pragmatic engineer. I submit that it's time to reach for another tool well before you get to an RE that big, and if one is still considering such a thing, one must really ask what properties of REs and the problem at hand one thinks lend itself to that as the solution. >[snip] > It sounds to me like an "optimizer" is needed. There is alreay a compiler > that uses FA's. I'm not sure what you're referring to here, though you were replying to me. There are a couple of different threads floating around: 1. Writing really big regular expressions: this is probably a bad idea. Don't do it (see below). 2. Writing a recognizer for dates. Yeah, the small REs you have for that are fine. If you want to extend those to arbitrary date formats, I think you'll find it starts getting ugly. 3. Optimizing regular expressions. You're still bound by the known theoretical properties of finite automata here. > Is someone else going to create a program > to look for dates without using regular expressions? Many people have already done so. :-) > Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing > the owner of RE world to change how they are used. For instance. Compile your RE > and a subroutine/function is produced that performs the RE search. I'm not sure I understand what you mean. The theory here is well-understood: we know recognizers for regular languages can be built from DFAs, that run in time linear in the size of their inputs, but we also know that constructing such a DFA can be exponential in space and time, and thus impractical for many REs. We know that NDFA simulators can be built in time and space linear in the length of the RE, but that the resulting recognizers will be superlinear at runtime, proportional to the product of the length of input, number of states, and number edges between states in the state transition graph. For a very large regular expression, that's going to be a pretty big number, and even on modern CPUs won't be particularly fast. Compilation to native code won't really help you. There is no "owner of RE world" that can change that. If you can find some way to do so, I think that would qualify as a major breakthrough in computer science. > RE is a language, not necessarily an implementation. > At least that is my understanding. Regular expressions describe regular languages, but as I mentioned above, the theory gives the currently understood bounds for their performance characteristics. It's kinda like the speed of light in this regard; we can't really make it go faster. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Requesting thoughts on extended regular expressions in grep. 2023-03-07 16:14 ` Dan Cross @ 2023-03-07 17:34 ` Ralph Corderoy 2023-03-07 18:33 ` [COFF] " Dan Cross 0 siblings, 1 reply; 46+ messages in thread From: Ralph Corderoy @ 2023-03-07 17:34 UTC (permalink / raw) To: coff Hi Dan, > I'm sure that with enough effort, it _is_ possible to make a 50K RE > that is understandable to mere mortals. But it begs the question: why > bother? It could be the quickest way to express the intent. > The answer to that question, in my mind, shows the difference between > a clever programmer and a pragmatic engineer. I think those two can overlap. :-) > I submit that it's time to reach for another tool well before you get > to an RE that big Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular grammar? I think sticking with code aimed at regular grammars, or more likely regexps, will do better than, say, a parser generator for a type-two context-free grammar. As well as the lex(1) family, there's Ragel as another example. http://www.colm.net/open-source/ragel/ > 3. Optimizing regular expressions. You're still bound by the known > theoretical properties of finite automata here. Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched by some engines by first checking the last five bytes are ‘.jpeg’. $ debugperl -Dr -e \ > '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/' ... Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg" Intuit: trying to determine minimum start position... doing 'check' fbm scan, [1..46] gave -1 Did not find floating substr ".jpeg"$... Match rejected by optimizer Freeing REx: "^[0-9]+\.jpeg$" $ Boyer-Moore string searching can be used. Common-subregexp-elimination can spot repetitive fragment of regexp and factor them into a single set of states along with pairing the route into them with the appropriate route out. The more regexp engines are optimised, the more benefit to the programmer from sticking to a regexp rather than, say, ad hoc parsing. The theory of REs is interesting and important, but regexps deviate from it ever more. -- Cheers, Ralph. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-07 17:34 ` [COFF] " Ralph Corderoy @ 2023-03-07 18:33 ` Dan Cross 0 siblings, 0 replies; 46+ messages in thread From: Dan Cross @ 2023-03-07 18:33 UTC (permalink / raw) To: Ralph Corderoy; +Cc: coff On Tue, Mar 7, 2023 at 12:34 PM Ralph Corderoy <ralph@inputplus.co.uk> wrote: > > I'm sure that with enough effort, it _is_ possible to make a 50K RE > > that is understandable to mere mortals. But it begs the question: why > > bother? > > It could be the quickest way to express the intent. Ok, I challenge you to find me anything for which the quickest way to express the intent is a 50 *thousand* symbol regular expression. :-) > > The answer to that question, in my mind, shows the difference between > > a clever programmer and a pragmatic engineer. > > I think those two can overlap. :-) Indeed they can. But I have grave doubts that this is a good example of such overlap. > > I submit that it's time to reach for another tool well before you get > > to an RE that big > > Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular > grammar? I think sticking with code aimed at regular grammars, or more > likely regexps, will do better than, say, a parser generator for a > type-two context-free grammar. Is there an extant, non-theoretical, example of such a grammar? > As well as the lex(1) family, there's Ragel as another example. > http://www.colm.net/open-source/ragel/ This is moving the goal posts more than a bit. I'm suggesting that a 50k-symbol RE is unlikely to be the best solution to any reasonable problem. A state-machine generator, even one with 50k statements, is not a 50k RE. > > 3. Optimizing regular expressions. You're still bound by the known > > theoretical properties of finite automata here. > > Well, we're back to the RE v. regexp again. /^[0-9]+\.jpeg$/ is matched > by some engines by first checking the last five bytes are ‘.jpeg’. ...in general, in order to find the end, won't _something_ have to traverse the entire input? (Note that I said, "in general". Allusions to mmap'ed files or seeking to the end of a file are not general, since they don't apply well to important categories of input sources, such as pipes or network connections.) > $ debugperl -Dr -e \ > > '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/' > ... > Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg" > Intuit: trying to determine minimum start position... > doing 'check' fbm scan, [1..46] gave -1 > Did not find floating substr ".jpeg"$... > Match rejected by optimizer > Freeing REx: "^[0-9]+\.jpeg$" > $ > > Boyer-Moore string searching can be used. Common-subregexp-elimination > can spot repetitive fragment of regexp and factor them into a single set > of states along with pairing the route into them with the appropriate > route out. Well, that's what big-O notation accounts for. I'm afraid none of this really changes the time bounds, however, when applied in general. > The more regexp engines are optimised, the more benefit to the > programmer from sticking to a regexp rather than, say, ad hoc parsing. This is comparing apples and oranges. There may be all sorts of heuristics that we can apply to specific regular expressions to prune the search space, and that's great. But by their very nature, heuristics are not always generally applicable. As an analogy, we know that we cannot solve _the_ halting problem, but we also know that we can solve _many_ halting problem_s_. For example, a compiler can recognize that any of, `for(;;);` or `while(1);` or `loop {}` do not halt, and so on, ad nauseum, but even if some oracle can recognize arbitrarily many such halting problems, we still haven't solved the general problem. > The theory of REs is interesting and important, but regexps deviate from > it ever more. Yup. My post should not be construed as suggesting that regexps are not useful, or that they should not be a part of a programmer's toolkit. My post _should_ be construed as a suggestion that they are not always the best solution, and a part of being an engineer is finding that dividing line. - Dan C. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [COFF] Re: Requesting thoughts on extended regular expressions in grep. 2023-03-06 21:01 ` Dan Cross 2023-03-06 21:49 ` Steffen Nurpmeso 2023-03-07 1:43 ` Larry McVoy @ 2023-03-07 4:19 ` Ed Bradford 2 siblings, 0 replies; 46+ messages in thread From: Ed Bradford @ 2023-03-07 4:19 UTC (permalink / raw) To: Dan Cross; +Cc: Grant Taylor, COFF [-- Attachment #1: Type: text/plain, Size: 2509 bytes --] Hi Dan, It sounds to me like an "optimizer" is needed. There is alreay a compiler that uses FA's. Is someone else going to create a program to look for dates without using regular expressions? Today, I write small-sized RE's. If I write a giant RE, there is nothing preventing the owner of RE world to change how they are used. For instance. Compile your RE and a subroutine/function is produced that performs the RE search. RE is a *language*, not necessarily an implementation. At least that is my understanding. Ed On Mon, Mar 6, 2023 at 3:02 PM Dan Cross <crossd@gmail.com> wrote: > On Mon, Mar 6, 2023 at 5:02 AM Ed Bradford <egbegb2@gmail.com> wrote: > >[snip] > > I would like to extend my program to > > any date format. That would require > > a much bigger RE. I have been led to > > believe that a 50Kbyte or 500Kbyte > > RE works just as well (if not > > as fast) as a 100 byte RE. I think > > with parentheses and > > pipe-symbols suitably used, > > one could match > > > > Monday, March 6, 2023 > > 2023-03-06 > > Mar 6, 2023 > > or > > ... > > This reminds me of something that I wanted to bring up. > > Perhaps one _could_ define a sufficiently rich regular expression that > one could match a number of date formats. However, I submit that one > _should not_. REs may be sufficiently powerful, but in all likelihood > what you'll end up with is an unreadable mess; it's like people who > abuse `sed` or whatever to execute complex, general purpose programs: > yeah, it's a clever hack, but that doesn't mean you should do it. > > Pick the right tool for the job. REs are a powerful tool, but they're > not the right tool for _every_ job, and I'd argue that once you hit a > threshold of complexity that'll be mostly self-evident, it's time to > move on to something else. > > As for large vs small REs.... When we start talking about differences > of orders of magnitude in size, we start talking about real > performance implications; in general an NDFA simulation of a regular > expression will have on the order of the length of the RE in states, > so when the length of the RE is half a million symbols, that's > half-a-million states, which practically speaking is a pretty big > number, even though it's bounded is still a pretty big number, and > even on modern CPUs. > > I wouldn't want to poke that bear. > > - Dan C. > -- Advice is judged by results, not by intentions. Cicero [-- Attachment #2: Type: text/html, Size: 4427 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2023-07-10 9:08 UTC | newest] Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-02 18:54 [COFF] Requesting thoughts on extended regular expressions in grep Grant Taylor via COFF 2023-03-02 19:23 ` [COFF] " Clem Cole 2023-03-02 19:38 ` Grant Taylor via COFF 2023-03-02 23:01 ` Stuff Received 2023-03-02 23:46 ` Steffen Nurpmeso 2023-03-03 1:08 ` Grant Taylor via COFF 2023-03-03 2:10 ` Dave Horsfall 2023-03-03 3:34 ` Grant Taylor via COFF 2023-03-02 21:53 ` Dan Cross 2023-03-03 1:05 ` Grant Taylor via COFF 2023-03-03 3:04 ` Dan Cross 2023-03-03 3:53 ` Grant Taylor via COFF 2023-03-03 13:47 ` Dan Cross 2023-03-03 19:26 ` Grant Taylor via COFF 2023-03-03 10:59 ` Ralph Corderoy 2023-03-03 13:11 ` Dan Cross 2023-03-03 13:42 ` Ralph Corderoy 2023-03-03 19:19 ` Grant Taylor via COFF 2023-03-04 10:15 ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy 2023-03-07 21:49 ` [COFF] " Tomasz Rola 2023-03-07 22:46 ` Tomasz Rola 2023-06-20 16:02 ` Michael Parson 2023-06-20 21:26 ` Tomasz Rola 2023-06-22 15:45 ` Michael Parson 2023-07-10 9:08 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola 2023-03-03 16:12 ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall 2023-03-03 17:13 ` Dan Cross 2023-03-03 17:38 ` Ralph Corderoy 2023-03-03 19:09 ` Dan Cross 2023-03-03 19:36 ` Grant Taylor via COFF 2023-03-04 10:26 ` Ralph Corderoy 2023-03-03 19:06 ` Grant Taylor via COFF 2023-03-03 19:31 ` Dan Cross 2023-03-04 10:07 ` Ralph Corderoy 2023-03-06 10:01 ` Ed Bradford 2023-03-06 21:01 ` Dan Cross 2023-03-06 21:49 ` Steffen Nurpmeso 2023-03-07 1:43 ` Larry McVoy 2023-03-07 4:01 ` Ed Bradford 2023-03-07 11:39 ` [COFF] " Ralph Corderoy 2023-03-07 18:31 ` [COFF] " Grant Taylor via COFF 2023-03-08 11:22 ` Ed Bradford 2023-03-07 16:14 ` Dan Cross 2023-03-07 17:34 ` [COFF] " Ralph Corderoy 2023-03-07 18:33 ` [COFF] " Dan Cross 2023-03-07 4:19 ` Ed Bradford
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).