Thanks, Grant and contributors in
this thread,

Great thread on RE's. I bought and read
the book (it's on the floor over there
in the corner and I'm not getting up).

My task was finding dates in binary
and text files. It turns out RE's work just
fine for that. Because I was looking at
both text files and binary files, I
wrote my stuff using 8-bit python
"bytes" rather than python "text" which
is, I think, 7-bit in python. (I use
python because it works on both
Linux, Macs and Windows and reduces the
number of RE implementations I have
to deal with to 1).

I finished my first round of the
program late fall of 2022. Then
I put it down and now I am
revisiting it. I was creating:

  A Python program to search for
  media files (pictures and movies)
  and copy them to another
  directory tree, copying only the
  unique ones (deduplication), and
  renaming each with
 
    YYYY-MM-DD-

  as a prefix.
 

Here is a list of observations from my
programming.

1. RE's are quite unreadable. I defined
   a lot of python variables and simply
   added them together in python to make
   a larger byte string (see below).
   The resulting
   expressions were shorter on screen
   and more readable. Furthermore,
   I could construct them incrementally.
   I insist on readable code
   because I frequently put things down
   for a month or more. A while back
   it was a sad day when I restarted
   something and simply had to throw it
   away, moaning, "What was that
   programmer thinking?".

   Here is an example RE for
       YYYY-MM-DD

      # FR = front   BA = back
      # ymdt is text version
      ymdt = FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP
      ymdc = re.compile( ymdt )

     
1a. I also had a time defining
    delimiters. There are delimiters
    for the beginning, delimiters
    for internal separation,
    and delimiters for the end.

    The significant thing is I have
    to find the RE if it is the very
    first string in the file or the
    very last. That also complicates
    buffered reading immensely. Hence, I wrote
    the whole program by reading the
    file into a single python variable.
    However, when files become much
    larger than memory, python simply
    ground to a halt as did my Windows
    machine. I then rewrote it using a
    memory mapped file (for all files)
    and the problem was fixed.

2. Dates are formatted in a number of
   ways. I chose exactly one
   format to learn about RE's
   and how to construct them and use
   them. Even the book didn't elaborate
   everything. I could not find
   detailed documentation on some of
   the interfaces in the book.

   On a whim, I asked chatGPT
   to write a python module that returns
   a list of offsets and dates in a file.
   Surprisingly, it wrote one that was
   quite credible. It had bugs but it
   knew more about how to use the various
   functional interfaces in RE's than I
   did.

3. Testing an RE is maybe even more
   difficult than writing one. I have
   not given any serious effort to
   verification testing yet.

I would like to extend my program to
any date format. That would require
a much bigger RE. I have been led to
believe that a 50Kbyte or 500Kbyte
RE works just as well (if not
as fast) as a 100 byte RE. I think
with parentheses and
pipe-symbols suitably used,
one could match

  Monday, March 6, 2023
  2023-03-06
  Mar 6, 2023
  or
  ...

I'm just guessing, though. This
thread has been very informative.
I have much to read.
Thank all of you.

Ed Bradford
Pflugerville, TX




On Thu, Mar 2, 2023 at 12:55 PM Grant Taylor via COFF <coff@tuhs.org> wrote:
Hi,

I'd like some thoughts ~> input on extended regular expressions used
with grep, specifically GNU grep -e / egrep.

What are the pros / cons to creating extended regular expressions like
the following:

    ^\w{3}

vs:

    ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)

Or:

    [ :[:digit:]]{11}

vs:

    ( 1| 2| 3| 4| 5| 6| 7| 8|
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]

I'm currently eliding the 61st (60) second, the 32nd day, and dealing
with February having fewer days for simplicity.

For matching patterns like the following in log files?

    Mar  2 03:23:38

I'm working on organically training logcheck to match known good log
entries.  So I'm *DEEP* in the bowels of extended regular expressions
(GNU egrep) that runs over all logs hourly.  As such, I'm interested in
making sure that my REs are both efficient and accurate or at least not
WILDLY badly structured.  The pedantic part of me wants to avoid
wildcard type matches (\w), even if they are bounded (\w{3}), unless it
truly is for unpredictable text.

I'd appreciate any feedback and recommendations from people who have
been using and / or optimizing (extended) regular expressions for longer
than I have been using them.

Thank you for your time and input.



--
Grant. . . .
unix || die



--
Advice is judged by results, not by intentions.
  Cicero