[COFF] Re: Requesting thoughts on extended regular expressions in grep.

Computer Old Farts Forum
 help / color / mirror / Atom feed

From: Ed Bradford <egbegb2@gmail.com>
To: Ralph Corderoy <ralph@inputplus.co.uk>
Cc: coff@tuhs.org
Subject: [COFF] Re: Requesting thoughts on extended regular expressions in grep.
Date: Wed, 8 Mar 2023 05:22:56 -0600	[thread overview]
Message-ID: <CAHTagfGYNi-TvkPMsXBf36a3g-b7D7qtk-xn9k6kiwu0YM7DcA@mail.gmail.com> (raw)
In-Reply-To: <20230307113949.501602135B@orac.inputplus.co.uk>

[-- Attachment #1: Type: text/plain, Size: 10419 bytes --]

Thank you for the very useful comments. However,
I disagree with you about the RE language. While
I agree all RE experts don't need that, when I was
hiring and gave some software to a new hire (whether
an experienced programmer or a recent college grad)
simply handing over huge RE's to my new hire was
a daunting task to that person. I wrote that stuff
that way to help remind me and anyone who might
use the python program.

I don't claim success. It does help me.

When you say '{1}' is redundant, I think I did that
to avoid any possibility of conflicts with the
next string that is
concatentated to the *Y_* (e.g. '*' or '+' or '{4,7}').
I am embarrassed I did not communicate that
in the code. I had to think about it for a couple of hours
before I recalled the "why". I will fix that.

  (it would be difficult to discuss
   this RE if I had to write
       "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE + "]" + ")
   rather than just *Y_*).

My initial thoughts
on naming were I wanted the definition to be defined
in exactly one place in the software.
Python and the BTL folks told me to never
use a constant in code. Always name it.
Hence, I gave it a name. Each name might
be used in multiple places. They might be imported.

You are correct, the expression is unbalanced. I tried
to remove the text2bytes(lastYearRE*)* call so the expression in
this email was all text. I failed to remove the trailing *)* when
I removed the call to text2bytes(). My hasty transcriptions
might have produced similar errors in my email.

Recall, my focus was on any file of any size.
I'm on Windows 10 and an m1 MacBook.
Python works on both. I don't have
a Linux machine or enough desktop space to
host one. I'm also mildly fed-up with
virtual machines.

Friedl taught me one thing. Most
RE implementations are different. I'm trying
to write a program that I could give
to anyone and could reliably find a date (an RE) in
any file. YYYY, MM, DD, HR, MI, SE, TH are words
my user could use in the command line or in
an options dialog. LAT and LON might also be
possibilities. CST, EST, MST, PST, ... also.
A 500 gigabyte archive or directory/folder
of pictures and movies would be
a great test target.

I very much appreciate your comments. If this
discussion is boring to others, I would be happy
to take it to emails.

I like your program. My experience
with RE, grep, python, and sed suggests that
anything but gnu grep and sed might not work due to the
different implementations.

I've been out of the Unix software business
for 30 years after starting work at BTL in the 1970s
and working on Version 6. I didn't know "printf" was now
built into bash! That was a surprise. It's an incremental
improvement, but doesn't compare with f-strings in python.
*The interactive interpreter for python should have*
*a "bash" mode?!*

Does grep use a memory mapped file for its search, thereby
avoiding all buffering boundaries? That too, would
be new information to me. The additional complexity
of dealing with buffering is more than annoying.

Do you have any thoughts on how to verify
a program that uses RE's. I've given no thought
until now. My first thought for dates would be
to write a separate module that simply searched
through the file looking for 4 numbers in a row
without using RE's, recording the offsets and 16 characters
after and 1 character before in a python list of (offset,str)
of tuples, ddddList, and using *dddd**List*
as a proxy for the entire file. I could then
aim my RE's at *ddddList*. *[A list of tuples in python*
*is wonderful! !]* It seems to me '*' and '+' and {x,y} are the performance
hogs in RE's. My RE's avoid them. One pass, I think, should
suffice. What do you think? I haven't "archived" my 350 GB
of pictures and movies, but one pass over all files therein
ought to suffice, right? Two different programs that use different
algorithms should be pretty good proof of correctness wouldn't
you think?

My RE's have no stars or pluses. If there is a mismatch before
a match, give up and move on.

On my Windows 10 machine, I have cygwin.
Microsoft says my CPU doesn't have a TPM and
the specific Intel Core I7 on my system is not
supported so Windows 11 is not happening.
Microsoft is DOS personified.
 (An unkind editorial remark about the low
  quality of software coming from Microsoft.)

Anyway, I thank you again for your patience with me
and your observations. I value your views and the
other views I've seen here on coff@tuhs.org.

I welcome all input to my education and will share
all I have done so far with anyone who wants to
collaborate, test, or is just curious.

    GOAL: run python program from an at-cost thumb drive that:
          reaps all media files from a user specified
          directory/folder tree and

          Adds files to the thumb drive.

          *Adds files* means
            Original file system is untouched

            Adds only unique files (hash codes are unique)

            Creates on the thumb drive a relative directory
              wherein the original file was found

            Prepends a "YYYY-MM-DD-" string to the filename
              if one can be found (EXIF is great shortcut).

            Copies
                      srcroot/relative_path/oldfilename
              to
                   thumbdrive/relative_path/YYYY-MM-DD-oldfilename
                     or
                   thumbdrive/relative_path/0000-oldfilename.

          Can also incrementally add new files by just
            scanning anywhere in any other computer
            file system or any other computer.

          Must work on Mac, Windows, and Linux

What I have is a working prototype. It works
on Mac and Windows. It doesn't do the
date thing very well, and there are other shortcomings.

I have delivered exactly one Christmas present to my favorite person
in the world - a 400 GB SSD drive with all our pictures and media
we have ever taken. The next things are to *add *more media
and *re-unique-ify* (check) what is already present on the SSD drive
and  *improve the proper choice of "YYYY-MM-DD-" prefix* to
filenames.

I am retired and this is fun.
I'm too old to want to get rich.

Ed Bradford
Pflugerville, TX
egbegb2@gmail.com

On Tue, Mar 7, 2023 at 5:40 AM Ralph Corderoy <ralph@inputplus.co.uk> wrote:

> Hi Ed,
>
> > I have made an attempt to make my RE stuff readable and supportable.
>
> Readable to you, which is fine because you're the prime future reader.
> But it's less readable than the regexp to those that know and read them
> because of the indirection introduced by the variables.  You've created
> your own little language of CAPITALS rather than the lingua franca of
> regexps.  :-)
>
> > Machine language was unreadable and then along came assembly language.
> > Assembly language was unreadable, then came higher level languages.
>
> Each time the original language was readable because practitioners had
> to read and write it.  When its replacement came along, the old skill
> was no longer learnt and the language became ‘unreadable’.
>
> > So far, I can do that for this RE program that works for small files,
> > large files, binary files and text files for exactly one pattern:
> >     YYYY[-MM-DD]
> > I constructed this RE with code like this:
> >     # ymdt is YYYY-MM-DD RE in text.
> >     # looking only for 1900s and 2000s years and no later than today.
> >     _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"
>
> ‘{1}’ is redundant.
>
> >     # months
> >     _MM   = "(0[1-9]|1[012])"
> >     # days
> >     _DD   = "(0[1-9]|[12]\d|3[01])"
> >     ymdt = _YYYY + '[' + _INTERNALSEP +
> >                          _MM          +
> >                          _INTERNALSEP +
> >                    ']'{0,1)
>
> I think we're missing something as the ‘'['’ is starting a character
> class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
> matching brackets and is outside the string.
>
> BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.
>
> > For the whole file, RE I used
> >     ymdthf = _FRSEP + ymdt + _BASEP
> > where FRSEP is front separator which includes
> > a bunch of possible separators, excluding numbers and letters, or-ed
> > with the up arrow "beginning of line" RE mark.
>
> It sounds like you're wanting a word boundary; something provided by
> regexps.  In Python, it's ‘\b’.
>
>     >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
>     (<re.Match object; span=(16, 19), match='foo'>,)
>
> Are you aware of the /x modifier to a regexp which ignores internal
> whitespace, including linefeeds?  This allows a large regexp to be split
> over lines.  There's a comment syntax too.  See
> https://docs.python.org/3/library/re.html#re.X
>
> GNU grep isn't too shabby at looking through binary files.  I can't use
> /x with grep so in a bash script, I'd do it manually.  \< and \> match
> the start and end of a word, a bit like Python's \b.
>
>     re='
>         .?\<
>             (19[0-9][0-9]|20[01][0-9]|202[0-3])
>             (
>                 ([-:._])
>                 (0[1-9]|1[0-2])
>                 \3
>                 (0[1-9]|[12][0-9]|3[01])
>             )?
>         \>.?
>     '
>     re=${re//$'\n'/}
>     re=${re// /}
>
>     printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01
> >big-binary-file
>     LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l
>
> which gives
>
>     0:2001-04-01,$
>     11:1999_12_31$
>     22:1944.03.01,$
>     33:1914!$
>     39:2000-$
>
> showing:
>
> - the byte offset within the file of each match,
> - along with the any before and after byte if it's not a \n and not
>   already matched, just to show the word-boundary at work,
> - with any non-printables escaped into octal by sed.
>
> > I thought I was on the COFF mailing list.
>
> I'm sending this to just the list.
>
> > I received this email by direct mail to from Larry.
>
> Perhaps your account on the list is configured to not send you an email
> if it sees your address in the header's fields.
>
> --
> Cheers, Ralph.
>

-- 
Advice is judged by results, not by intentions.
  Cicero

[-- Attachment #2: Type: text/html, Size: 25039 bytes --]

next prev parent reply	other threads:[~2023-03-08 11:23 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 [COFF] " Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford [this message]
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHTagfGYNi-TvkPMsXBf36a3g-b7D7qtk-xn9k6kiwu0YM7DcA@mail.gmail.com \
    --to=egbegb2@gmail.com \
    --cc=coff@tuhs.org \
    --cc=ralph@inputplus.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).