Computer Old Farts Forum
 help / color / mirror / Atom feed
From: Ralph Corderoy <ralph@inputplus.co.uk>
To: coff@tuhs.org
Subject: [COFF] Requesting thoughts on extended regular expressions in grep.
Date: Tue, 07 Mar 2023 17:34:51 +0000	[thread overview]
Message-ID: <20230307173451.D94B421C9B@orac.inputplus.co.uk> (raw)
In-Reply-To: <CAEoi9W6sLfpoGo8pGF_1RYA62Pi8aCWdyo53vNKqdtxyFKTGpA@mail.gmail.com>

Hi Dan,

> I'm sure that with enough effort, it _is_ possible to make a 50K RE
> that is understandable to mere mortals.  But it begs the question: why
> bother?

It could be the quickest way to express the intent.

> The answer to that question, in my mind, shows the difference between
> a clever programmer and a pragmatic engineer.

I think those two can overlap.  :-)

> I submit that it's time to reach for another tool well before you get
> to an RE that big

Why, if the grammar is type three in Chomsky's hierarchy, i.e. a regular
grammar?  I think sticking with code aimed at regular grammars, or more
likely regexps, will do better than, say, a parser generator for a
type-two context-free grammar.

As well as the lex(1) family, there's Ragel as another example.
http://www.colm.net/open-source/ragel/

> 3.  Optimizing regular expressions.  You're still bound by the known
> theoretical properties of finite automata here.

Well, we're back to the RE v. regexp again.  /^[0-9]+\.jpeg$/ is matched
by some engines by first checking the last five bytes are ‘.jpeg’.

    $ debugperl -Dr -e \
    >     '"123546789012354678901235467890123546789012.jpg" =~ /^[0-9]+\.jpeg$/'
    ...
    Matching REx "^[0-9]+\.jpeg$" against "123546789012354678901235467890123546789012.jpg"
    Intuit: trying to determine minimum start position...
      doing 'check' fbm scan, [1..46] gave -1
      Did not find floating substr ".jpeg"$...
    Match rejected by optimizer
    Freeing REx: "^[0-9]+\.jpeg$"
    $

Boyer-Moore string searching can be used.  Common-subregexp-elimination
can spot repetitive fragment of regexp and factor them into a single set
of states along with pairing the route into them with the appropriate
route out.

The more regexp engines are optimised, the more benefit to the
programmer from sticking to a regexp rather than, say, ad hoc parsing.

The theory of REs is interesting and important, but regexps deviate from
it ever more.

-- 
Cheers, Ralph.

  reply	other threads:[~2023-03-07 17:34 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 18:54 Grant Taylor via COFF
2023-03-02 19:23 ` [COFF] " Clem Cole
2023-03-02 19:38   ` Grant Taylor via COFF
2023-03-02 23:01   ` Stuff Received
2023-03-02 23:46     ` Steffen Nurpmeso
2023-03-03  1:08     ` Grant Taylor via COFF
2023-03-03  2:10       ` Dave Horsfall
2023-03-03  3:34         ` Grant Taylor via COFF
2023-03-02 21:53 ` Dan Cross
2023-03-03  1:05   ` Grant Taylor via COFF
2023-03-03  3:04     ` Dan Cross
2023-03-03  3:53       ` Grant Taylor via COFF
2023-03-03 13:47         ` Dan Cross
2023-03-03 19:26           ` Grant Taylor via COFF
2023-03-03 10:59 ` Ralph Corderoy
2023-03-03 13:11   ` Dan Cross
2023-03-03 13:42     ` Ralph Corderoy
2023-03-03 19:19       ` Grant Taylor via COFF
2023-03-04 10:15         ` [COFF] Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.) Ralph Corderoy
2023-03-07 21:49           ` [COFF] " Tomasz Rola
2023-03-07 22:46             ` Tomasz Rola
2023-06-20 16:02           ` Michael Parson
2023-06-20 21:26             ` Tomasz Rola
2023-06-22 15:45               ` Michael Parson
2023-07-10  9:08                 ` [COFF] Re: Reader, paper, tablet, phone (was: Re: Reading PDFs on a mobile. (Was: Requesting thoughts on extended regular expressions in grep.)) Tomasz Rola
2023-03-03 16:12   ` [COFF] Re: Requesting thoughts on extended regular expressions in grep Dave Horsfall
2023-03-03 17:13     ` Dan Cross
2023-03-03 17:38       ` Ralph Corderoy
2023-03-03 19:09         ` Dan Cross
2023-03-03 19:36     ` Grant Taylor via COFF
2023-03-04 10:26       ` Ralph Corderoy
2023-03-03 19:06 ` Grant Taylor via COFF
2023-03-03 19:31   ` Dan Cross
2023-03-04 10:07   ` Ralph Corderoy
2023-03-06 10:01 ` Ed Bradford
2023-03-06 21:01   ` Dan Cross
2023-03-06 21:49     ` Steffen Nurpmeso
2023-03-07  1:43     ` Larry McVoy
2023-03-07  4:01       ` Ed Bradford
2023-03-07 11:39         ` [COFF] " Ralph Corderoy
2023-03-07 18:31           ` [COFF] " Grant Taylor via COFF
2023-03-08 11:22           ` Ed Bradford
2023-03-07 16:14         ` Dan Cross
2023-03-07 17:34           ` Ralph Corderoy [this message]
2023-03-07 18:33             ` Dan Cross
2023-03-07  4:19     ` Ed Bradford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230307173451.D94B421C9B@orac.inputplus.co.uk \
    --to=ralph@inputplus.co.uk \
    --cc=coff@tuhs.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).