Computer Old Farts Forum
 help / color / mirror / Atom feed
From: Ralph Corderoy <ralph@inputplus.co.uk>
To: coff@tuhs.org
Subject: [COFF] Requesting thoughts on extended regular expressions in grep.
Date: Tue, 07 Mar 2023 11:39:49 +0000	[thread overview]
Message-ID: <20230307113949.501602135B@orac.inputplus.co.uk> (raw)
In-Reply-To: <CAHTagfFqfP3eVSgQOgV29O=JJkGdhjiv40pw-LNsvNvORC1XTA@mail.gmail.com>

Hi Ed,

> I have made an attempt to make my RE stuff readable and supportable.

Readable to you, which is fine because you're the prime future reader.
But it's less readable than the regexp to those that know and read them
because of the indirection introduced by the variables.  You've created
your own little language of CAPITALS rather than the lingua franca of
regexps.  :-)

> Machine language was unreadable and then along came assembly language.
> Assembly language was unreadable, then came higher level languages.

Each time the original language was readable because practitioners had
to read and write it.  When its replacement came along, the old skill
was no longer learnt and the language became ‘unreadable’.

> So far, I can do that for this RE program that works for small files,
> large files, binary files and text files for exactly one pattern:
>     YYYY[-MM-DD]
> I constructed this RE with code like this:
>     # ymdt is YYYY-MM-DD RE in text.
>     # looking only for 1900s and 2000s years and no later than today.
>     _YYYY = "(19\d\d|20[01]\d|202" + "[0-" + lastYearRE) + "]" + "){1}"

‘{1}’ is redundant.

>     # months
>     _MM   = "(0[1-9]|1[012])"
>     # days
>     _DD   = "(0[1-9]|[12]\d|3[01])"
>     ymdt = _YYYY + '[' + _INTERNALSEP +
>                          _MM          +
>                          _INTERNALSEP +
>                    ']'{0,1)

I think we're missing something as the ‘'['’ is starting a character
class which is odd for wrapping the month and the ‘{0,1)’ doesn't have
matching brackets and is outside the string.

BTW, ‘{0,1}’ is more readable to those who know regexps as ‘?’.

> For the whole file, RE I used
>     ymdthf = _FRSEP + ymdt + _BASEP
> where FRSEP is front separator which includes
> a bunch of possible separators, excluding numbers and letters, or-ed
> with the up arrow "beginning of line" RE mark.

It sounds like you're wanting a word boundary; something provided by
regexps.  In Python, it's ‘\b’.

    >>> re.search(r'\bfoo\b', 'endfoo foostart foo ends'),
    (<re.Match object; span=(16, 19), match='foo'>,)

Are you aware of the /x modifier to a regexp which ignores internal
whitespace, including linefeeds?  This allows a large regexp to be split
over lines.  There's a comment syntax too.  See
https://docs.python.org/3/library/re.html#re.X

GNU grep isn't too shabby at looking through binary files.  I can't use
/x with grep so in a bash script, I'd do it manually.  \< and \> match
the start and end of a word, a bit like Python's \b.

    re='
        .?\<
            (19[0-9][0-9]|20[01][0-9]|202[0-3])
            (
                ([-:._])
                (0[1-9]|1[0-2])
                \3
                (0[1-9]|[12][0-9]|3[01])
            )?
        \>.?
    '
    re=${re//$'\n'/}
    re=${re// /}

    printf '%s\n' 2001-04-01,1999_12_31 1944.03.01,1914! 2000-01.01 >big-binary-file
    LC_ALL=C grep -Eboa "$re" big-binary-file | sed -n l

which gives

    0:2001-04-01,$
    11:1999_12_31$
    22:1944.03.01,$
    33:1914!$
    39:2000-$

showing:

- the byte offset within the file of each match,
- along with the any before and after byte if it's not a \n and not
  already matched, just to show the word-boundary at work,
- with any non-printables escaped into octal by sed.

> I thought I was on the COFF mailing list.

I'm sending this to just the list.

> I received this email by direct mail to from Larry.

Perhaps your account on the list is configured to not send you an email
if it sees your address in the header's fields.

-- 
Cheers, Ralph.

  reply	other threads:[~2023-03-07 11:40 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` Ralph Corderoy [this message]
2023-03-07 18:31           ` Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` [COFF] " Ralph Corderoy
2023-03-07 18:33             ` [COFF] " Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230307113949.501602135B@orac.inputplus.co.uk \
    --to=ralph@inputplus.co.uk \
    --cc=coff@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).