From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 31962 invoked from network); 6 Mar 2023 10:02:06 -0000 Received: from minnie.tuhs.org (2600:3c01:e000:146::1) by inbox.vuxu.org with ESMTPUTF8; 6 Mar 2023 10:02:06 -0000 Received: from minnie.tuhs.org (localhost [IPv6:::1]) by minnie.tuhs.org (Postfix) with ESMTP id 4F7354123E; Mon, 6 Mar 2023 20:02:05 +1000 (AEST) Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by minnie.tuhs.org (Postfix) with ESMTPS id A235B4122D for ; Mon, 6 Mar 2023 20:01:59 +1000 (AEST) Received: by mail-pg1-x533.google.com with SMTP id z10so5149399pgr.8 for ; Mon, 06 Mar 2023 02:01:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=dODUrf9S3cPcsqxlGRelM3otw2lzEa5YkjxHMG2sTD0=; b=V9fo9lOySxPleWOQC6jD/vNsE0JPTH0NJa/GO0tcSLAyRgzxAYJliWxTRv3s0zhgqE WUcDxttnYofN4NSz8yriEOkF05ogXt+iLwhe/Y7BFU5JT0n3ngw16YlNGcGUZcGUpbnA K5In2bpsSviDz1KEIeO8MLdUbs0DW/GWUkNLBprhMdHTBA1AVq/ll/3nbVJq4mQAx5+Z C/upud4vTtTrxV7P+g5/IvjYO832WXD688VJ6hHAcj+or9Vm7o8aGrGRH0eLyiP5rswp RAMYAoBHxGB+95s3EGIMP8Ac2UtF3pT+rb2npr2+O9sOqbnn2mb8rT0jUk2V80L8NC9c wSSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=dODUrf9S3cPcsqxlGRelM3otw2lzEa5YkjxHMG2sTD0=; b=ySR/yKwfFgKH50R8rNCufO6V2rjlVcXbV34jBw0Ez9v2Y34IwTb4YOzMPazC9lQUVH qXF7DLS5hv8hwvhTh3EwwnN34do04rRGsGTiY2jEESaMYCv3dToaliJLxy2CdMRV5m7I 28onkGF5sHrC60DRZ8zmzQzlBxsi0IPpgQPdGPhcLALsNh6m50IlYrPFpmqhwKnqK5qS p6drAptzjbghIXHqCGuMI4pN2Xg6yYrEHApQLfo96ZZKlnkrWuqjpsm4wzgrJOWzYycj gydf61KVuwoSnAG86H7gTi70AVaAGSwXYlSh3OubOaXNNFR/pwd/bWhhPgYm3v9dwfQM fmNw== X-Gm-Message-State: AO0yUKUC20s2469fpW/HRDgSMiv5tgjUptSjQpTq1LnWnyF3bK9TeqT9 Z7YUKE/BvCWc67N8tRUDFHoVzdjLN1J9skbcSv5EFWSkvpWBLg== X-Google-Smtp-Source: AK7set+WiyHxfui/qs0i5EuUXo0qeSMX6Y+OFF5Pi9a1CgnAKPXF5SPxvCSYekvXQXqNTPu//c20bLuU1E0E8vo/ooE= X-Received: by 2002:a62:8247:0:b0:5a8:bdd2:f99c with SMTP id w68-20020a628247000000b005a8bdd2f99cmr4354680pfd.1.1678096918972; Mon, 06 Mar 2023 02:01:58 -0800 (PST) MIME-Version: 1.0 References: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> In-Reply-To: <8d1de5c8-1f34-3d37-395d-0f1da7b062ec@spamtrap.tnetconsulting.net> From: Ed Bradford Date: Mon, 6 Mar 2023 04:01:47 -0600 Message-ID: To: Grant Taylor Content-Type: multipart/alternative; boundary="000000000000c10bce05f6386237" Message-ID-Hash: S4NCUMKTBSZTKI644ED3JZ4VKTAWWNNI X-Message-ID-Hash: S4NCUMKTBSZTKI644ED3JZ4VKTAWWNNI X-MailFrom: egbegb2@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: COFF X-Mailman-Version: 3.3.6b1 Precedence: list Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep. List-Id: Computer Old Farts Forum Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: --000000000000c10bce05f6386237 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks, Grant and contributors in this thread, Great thread on RE's. I bought and read the book (it's on the floor over there in the corner and I'm not getting up). My task was finding dates in binary and text files. It turns out RE's work just fine for that. Because I was looking at both text files and binary files, I wrote my stuff using 8-bit python "bytes" rather than python "text" which is, I think, 7-bit in python. (I use python because it works on both Linux, Macs and Windows and reduces the number of RE implementations I have to deal with to 1). I finished my first round of the program late fall of 2022. Then I put it down and now I am revisiting it. I was creating: A Python program to search for media files (pictures and movies) and copy them to another directory tree, copying only the unique ones (deduplication), and renaming each with *YYYY-MM-DD-* as a prefix. Here is a list of observations from my programming. 1. RE's are quite unreadable. I defined a lot of python variables and simply added them together in python to make a larger byte string (see below). The resulting expressions were shorter on screen and more readable. Furthermore, I could construct them incrementally. I insist on readable code because I frequently put things down for a month or more. A while back it was a sad day when I restarted something and simply had to throw it away, moaning, "What was that programmer thinking?". Here is an example RE for YYYY-MM-DD # FR =3D front BA =3D back # ymdt is text version ymdt =3D FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP ymdc =3D re.compile( ymdt ) 1a. I also had a time defining delimiters. There are delimiters for the beginning, delimiters for internal separation, and delimiters for the end. The significant thing is I have to find the RE if it is the very first string in the file or the very last. That also complicates buffered reading immensely. Hence, I wrote the whole program by reading the file into a single python variable. However, when files become much larger than memory, python simply ground to a halt as did my Windows machine. I then rewrote it using a memory mapped file (for all files) and the problem was fixed. 2. Dates are formatted in a number of ways. I chose exactly one format to learn about RE's and how to construct them and use them. Even the book didn't elaborate everything. I could not find detailed documentation on some of the interfaces in the book. On a whim, I asked chatGPT to write a python module that returns a list of offsets and dates in a file. Surprisingly, it wrote one that was quite credible. It had bugs but it knew more about how to use the various functional interfaces in RE's than I did. 3. Testing an RE is maybe even more difficult than writing one. I have not given any serious effort to verification testing yet. I would like to extend my program to any date format. That would require a much bigger RE. I have been led to believe that a 50Kbyte or 500Kbyte RE works just as well (if not as fast) as a 100 byte RE. I think with parentheses and pipe-symbols suitably used, one could match Monday, March 6, 2023 2023-03-06 Mar 6, 2023 or ... I'm just guessing, though. This thread has been very informative. I have much to read. Thank all of you. Ed Bradford Pflugerville, TX On Thu, Mar 2, 2023 at 12:55=E2=80=AFPM Grant Taylor via COFF wrote: > Hi, > > I'd like some thoughts ~> input on extended regular expressions used > with grep, specifically GNU grep -e / egrep. > > What are the pros / cons to creating extended regular expressions like > the following: > > ^\w{3} > > vs: > > ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) > > Or: > > [ :[:digit:]]{11} > > vs: > > ( 1| 2| 3| 4| 5| 6| 7| 8| > 9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31) > (0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]] > > I'm currently eliding the 61st (60) second, the 32nd day, and dealing > with February having fewer days for simplicity. > > For matching patterns like the following in log files? > > Mar 2 03:23:38 > > I'm working on organically training logcheck to match known good log > entries. So I'm *DEEP* in the bowels of extended regular expressions > (GNU egrep) that runs over all logs hourly. As such, I'm interested in > making sure that my REs are both efficient and accurate or at least not > WILDLY badly structured. The pedantic part of me wants to avoid > wildcard type matches (\w), even if they are bounded (\w{3}), unless it > truly is for unpredictable text. > > I'd appreciate any feedback and recommendations from people who have > been using and / or optimizing (extended) regular expressions for longer > than I have been using them. > > Thank you for your time and input. > > > > -- > Grant. . . . > unix || die > > --=20 Advice is judged by results, not by intentions. Cicero --000000000000c10bce05f6386237 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
=C2=A0 Thanks, Grant and contributors in
this thread,
Great thread on RE's. I bought and read
the book=C2=A0(it's on= the floor over there
in the corner and I'm not getting up).

= My task was finding dates in binary
and text files. It turns out=C2=A0RE= 's work just
fine for that. Because I was looking at
both text fi= les and binary files, I
wrote my stuff using 8-bit python
"bytes= " rather than python "text" which
is, I think, 7-bit in p= ython. (I use
python=C2=A0because it works on both
Linux, Macs and Wi= ndows and reduces the
number of RE implementations=C2=A0I have
to dea= l with to 1).

I finished my first round of the
program late fall = of 2022. Then
I put it down and now I am
revisiting it. I was creatin= g:

=C2=A0 A Python program to search for
=C2=A0 media files (pict= ures and movies)
=C2=A0 and copy them to another
=C2=A0 directory tre= e, copying only the
=C2=A0 unique ones (deduplication), and
=C2=A0 re= naming each with
=C2=A0
=C2=A0 =C2=A0 YYYY-MM-DD-

=C2= =A0 as a prefix.
=C2=A0

Here is a list of observations from myprogramming.

1. RE's are quite unreadable. I defined
=C2=A0= =C2=A0a lot of python=C2=A0variables and simply
=C2=A0 =C2=A0added them= together=C2=A0in python to=C2=A0make
=C2=A0 =C2=A0a larger byte string = (see below).
=C2=A0 =C2=A0The resulting
=C2=A0 =C2=A0expressions were s= horter on screen
=C2=A0 =C2=A0and more readable. Furthermore,
=C2=A0 = =C2=A0I could construct them incrementally.
=C2=A0 =C2=A0I insist on rea= dable code
=C2=A0 =C2=A0because I frequently put things down
=C2=A0 = =C2=A0for a month or more. A while back
=C2=A0 =C2=A0it was a sad day wh= en I restarted
=C2=A0 =C2=A0something and simply had to throw it
=C2= =A0 =C2=A0away, moaning, "What was that
=C2=A0 =C2=A0programmer thi= nking?".

=C2=A0 =C2=A0Here is an example RE for
=C2=A0 =C2= =A0 =C2=A0 =C2=A0YYYY-MM-DD

=C2=A0 =C2=A0 =C2=A0 # FR =3D front =C2= =A0 BA =3D back
=C2=A0 =C2=A0 =C2=A0 # ymdt is text version
=C2=A0 = =C2=A0 =C2=A0 ymdt =3D FRSEP + Y_ + SEP + M_ + SEP + D_ + BASEP
=C2=A0 = =C2=A0 =C2=A0 ymdc =3D re.compile( ymdt )

=C2=A0 =C2=A0 =C2=A0
1a= . I also had a time defining
=C2=A0 =C2=A0 delimiters. There are delimit= ers
=C2=A0 =C2=A0 for the beginning, delimiters
=C2=A0 =C2=A0 for int= ernal separation,
=C2=A0 =C2=A0 and delimiters for the end.

=C2= =A0 =C2=A0 The significant thing is I have
=C2=A0 =C2=A0 to find the RE = if it is the very
=C2=A0 =C2=A0 first string in the file or the
=C2= =A0 =C2=A0 very last. That also complicates
=C2=A0 =C2=A0 buffered readi= ng immensely. Hence, I wrote
=C2=A0 =C2=A0 the whole program by reading = the
=C2=A0 =C2=A0 file into a single python variable.
=C2=A0 =C2=A0 H= owever, when files become much
=C2=A0 =C2=A0 larger than memory, python = simply
=C2=A0 =C2=A0 ground to a halt as did my Windows
=C2=A0 =C2=A0= machine. I then rewrote it using a
=C2=A0 =C2=A0 memory mapped file (fo= r all files)
=C2=A0 =C2=A0 and the problem was fixed.

2. Dates ar= e formatted in a number of
=C2=A0 =C2=A0ways. I chose exactly one
=C2= =A0 =C2=A0format to learn about RE's
=C2=A0 =C2=A0and how to constru= ct them and use
=C2=A0 =C2=A0them.=C2=A0Even the book didn't elabora= te
=C2=A0 =C2=A0everything. I could not find
=C2=A0 =C2=A0detailed do= cumentation on some of
=C2=A0 =C2=A0the interfaces in the book.

= =C2=A0 =C2=A0On a whim, I asked chatGPT
=C2=A0 =C2=A0to write a python m= odule that returns
=C2=A0 =C2=A0a list of offsets and dates in a file.=C2=A0 =C2=A0Surprisingly, it wrote one that was
=C2=A0 =C2=A0quite cr= edible. It had bugs but it
=C2=A0 =C2=A0knew=C2=A0more about how to use = the various
=C2=A0 =C2=A0functional interfaces in RE's than I
=C2= =A0 =C2=A0did.

3. Testing an RE is maybe even more
=C2=A0 =C2=A0d= ifficult than writing one. I have
=C2=A0 =C2=A0not=C2=A0given any seriou= s effort to
=C2=A0 =C2=A0verification=C2=A0testing yet.

I would l= ike to extend my program to
any date format. That would require
a muc= h bigger RE. I have been led to
believe that a 50Kbyte or 500Kbyte
RE= works just as well (if not
as fast) as a 100 byte RE. I think
with p= arentheses and
pipe-symbols suitably used,
one could match

=C2= =A0 Monday, March 6, 2023
=C2=A0 2023-03-06
=C2=A0 Mar 6, 2023
= =C2=A0 or
=C2=A0 ...

I'm just guessing, though. This
threa= d has been very informative.
I have much to read.
Thank all of you.
Ed Bradford
Pflugerville, TX




On Thu, Mar 2, 2023 = at 12:55=E2=80=AFPM Grant Taylor via COFF <coff@tuhs.org> wrote:
Hi,

I'd like some thoughts ~> input on extended regular expressions used=
with grep, specifically GNU grep -e / egrep.

What are the pros / cons to creating extended regular expressions like
the following:

=C2=A0 =C2=A0 ^\w{3}

vs:

=C2=A0 =C2=A0 ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)

Or:

=C2=A0 =C2=A0 [ :[:digit:]]{11}

vs:

=C2=A0 =C2=A0 ( 1| 2| 3| 4| 5| 6| 7| 8|
9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31)
(0|1|2)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]:(0|1|2|3|4|5)[[:digit:]]

I'm currently eliding the 61st (60) second, the 32nd day, and dealing <= br> with February having fewer days for simplicity.

For matching patterns like the following in log files?

=C2=A0 =C2=A0 Mar=C2=A0 2 03:23:38

I'm working on organically training logcheck to match known good log entries.=C2=A0 So I'm *DEEP* in the bowels of extended regular expressi= ons
(GNU egrep) that runs over all logs hourly.=C2=A0 As such, I'm interest= ed in
making sure that my REs are both efficient and accurate or at least not WILDLY badly structured.=C2=A0 The pedantic part of me wants to avoid
wildcard type matches (\w), even if they are bounded (\w{3}), unless it truly is for unpredictable text.

I'd appreciate any feedback and recommendations from people who have been using and / or optimizing (extended) regular expressions for longer than I have been using them.

Thank you for your time and input.



--
Grant. . . .
unix || die



--
Advice is judged by results, not by intentions.
=C2=A0 Cicero

--000000000000c10bce05f6386237--