Thanks, Grant and contributors in this thread, Great thread on RE's. I bought and read the book (it's on the floor over there in the corner and I'm not getting up). My task was finding dates in binary and text files. It turns out RE's work just fine for that. Because I was looking at both text files and binary files, I wrote my stuff using 8-bit python "bytes" rather than python "text" which is, I think, 7-bit in python. (I use python because it works on both Linux, Macs and Windows and reduces the number of RE implementations I have to deal with to 1). I finished my first round of the program late fall of 2022. Then I put it down and now I am revisiting it. I was creating: A Python program to search for media files (pictures and movies) and copy them to another directory tree, copying only the unique ones (deduplication), and renaming each with *YYYY-MM-DD-* as a prefix. Here is a list of observations from my programming. 1. RE's are quite unreadable. I defined a lot of python variables and simply added them together in python to make a larger byte string (see below). The resulting expressions were shorter on screen and more readable. Furthermore, I could construct them incrementally. I insist on readable code because I frequently put things down for a month or more. A while back it was a sad day when I restarted something and simply had to throw it away, moaning, "What was that programmer thinking?". Here is an example RE for YYYY-MM-DD # FR = front BA = back # ymdt is text version ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP ymdc = re.compile( ymdt ) 1a. I also had a time defining delimiters. There are delimiters for the beginning, delimiters for internal separation, and delimiters for the end. The significant thing is I have to find the RE if it is the very first string in the file or the very last. That also complicates buffered reading immensely. Hence, I wrote the whole program by reading the file into a single python variable. However, when files become much larger than memory, python simply ground to a halt as did my Windows machine. I then rewrote it using a memory mapped file (for all files) and the problem was fixed. 2. Dates are formatted in a number of ways. I chose exactly one format to learn about RE's and how to construct them and use them. Even the book didn't elaborate everything. I could not find detailed documentation on some of the interfaces in the book. On a whim, I asked chatGPT to write a python module that returns a list of offsets and dates in a file. Surprisingly, it wrote one that was quite credible. It had bugs but it knew more about how to use the various functional interfaces in RE's than I did. 3. Testing an RE is maybe even more difficult than writing one. I have not given any serious effort to verification testing yet. I would like to extend my program to any date format. That would require a much bigger RE. I have been led to believe that a 50Kbyte or 500Kbyte RE works just as well (if not as fast) as a 100 byte RE. I think with parentheses and pipe-symbols suitably used, one could match Monday, March 6, 2023 2023-03-06 Mar 6, 2023 or ... I'm just guessing, though. This thread has been very informative. I have much to read. Thank all of you. Ed Bradford Pflugerville, TX On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF wrote: > Hi, > > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) > > Or: > > [ :[:digit:]]{11} > > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] > > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. > > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. > > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. > > > > -- > Grant. . . . > unix || die > > -- Advice is judged by results, not by intentions. Cicero