zsh-workers
 help / color / mirror / code / Atom feed
* Extending regexes
@ 2022-07-04 12:03 Sebastian Gniazdowski
  2022-07-04 13:47 ` Peter Stephenson
  2022-07-06 23:07 ` Phil Pennock
  0 siblings, 2 replies; 8+ messages in thread
From: Sebastian Gniazdowski @ 2022-07-04 12:03 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

Hi,
Zsh has extensions to regular regexes - the ~ and ^ negations. They, as it
can be expected from negations that are required by Turing universal
machines, introduce a whole new universe of computations over standard
regular expressions. For example matching in an AND fashion:

If [[ ABC == *A*~^*B*~^*C* ]]; then
  print A,  B and C found
fi

I think that regexes look pretty limited from this point of view and that
pcre extensions went wrong path with the look forward and behind semantics.
The typical, common attempts of using regex [^] negation like [^(string)]
are simply there in zsh patterns as ^string.

I've recently used ~ negation in a project to reject a set of known tokens
from matching at given position with a great success to match a loose `for`
syntax in an zinit-annex-pull extension to zinit that greps and extracts
zinit commands from any web page. I cannot see it possible without the
extra negation.

Therefore I thought that it's weird that such an useful feature is missing
from the commonly used regex syntax. So maybe an attempt of updating it has
sense? Could someone experienced with them like Oliver prepare some white
papers to accomplish this? It would be a great event to extend the old
regexes with such a great feature like not one, but TWO new negations.

[-- Attachment #2: Type: text/html, Size: 1590 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-04 12:03 Extending regexes Sebastian Gniazdowski
@ 2022-07-04 13:47 ` Peter Stephenson
  2022-07-04 19:15   ` Bart Schaefer
  2022-07-06 18:40   ` stephane
  2022-07-06 23:07 ` Phil Pennock
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Stephenson @ 2022-07-04 13:47 UTC (permalink / raw)
  To: Zsh hackers list

> On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> Hi,
> Zsh has extensions to regular regexes - the ~ and ^ negations.
> 
> Therefore I thought that it's weird that such an useful feature is missing
> from the commonly used regex syntax. So maybe an attempt of updating it has
> sense?

You're quite right both that they're very useful in zsh and there's nothing
like this in normal regular expressions, but unfortunately I've got a strong
feeling this is a big can of worms [hope that image is graphic enough that
I don't need to explain the phrase for non-native English speakers].

I say that as although I'm not very up on the mathematics of regular expressions
I did write the basics of the current zsh implementation of glob negations.
(Before that, there was an even less efficient implementation that created a
structure for each part of the pattern, which wasn't good for memory management
--- this is going back to the 1990s, I think.)  There are some pretty pathological
details to make sure this works in every case, so I'd really want a real expert in
the subject area to think about this before it got much further.

pws


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-04 13:47 ` Peter Stephenson
@ 2022-07-04 19:15   ` Bart Schaefer
  2022-07-04 19:41     ` Peter Stephenson
  2022-07-06 18:40   ` stephane
  1 sibling, 1 reply; 8+ messages in thread
From: Bart Schaefer @ 2022-07-04 19:15 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Mon, Jul 4, 2022 at 6:53 AM Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:>
> > On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> > Zsh has extensions to regular regexes - the ~ and ^ negations.
>
> You're quite right both that they're very useful in zsh and there's nothing
> like this in normal regular expressions, but unfortunately I've got a strong
> feeling this is a big can of worms [hope that image is graphic enough that
> I don't need to explain the phrase for non-native English speakers].

In particular, these no longer fit the formal definition of "regular".

PWS correct me if I go too far astray, but (^Y) is internally (*~Y)
and (X~Y) is implemented by first matching (X) and then removing
anything that matches (Y) ... which is where the regular-ness goes
astray.  My formal training on this is more than a little rusty, but I
believe this means chaining together two finite-state machines rather
than building a single one.

On Mon, Jul 4, 2022 at 5:06 AM Sebastian Gniazdowski
<sgniazdowski@gmail.com> wrote:
>
> I think that regexes look pretty limited from this point of view and that pcre extensions went wrong path with the look forward and behind semantics.

Note that of course "pcre" stands for "perl-compatible RE" so you can
find the justifications for look-{ahead,behind} in the history of perl
development.  Again, a long time ago, but my recollection is that the
reason "lookaround assertions" are zero-width elements is to preserve
the finite-state semantics.  Please take that with 30 years worth of
salt grains (a less self-explanatory idiom than Peter's, I fear).


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-04 19:15   ` Bart Schaefer
@ 2022-07-04 19:41     ` Peter Stephenson
  2022-07-06 10:03       ` Daniel Shahaf
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Stephenson @ 2022-07-04 19:41 UTC (permalink / raw)
  To: Zsh hackers list


> On 04 July 2022 at 20:15 Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Mon, Jul 4, 2022 at 6:53 AM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:>
> > > On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> > > Zsh has extensions to regular regexes - the ~ and ^ negations.
>
> PWS correct me if I go too far astray, but (^Y) is internally (*~Y)
> and (X~Y) is implemented by first matching (X) and then removing
> anything that matches (Y) ... which is where the regular-ness goes
> astray.  My formal training on this is more than a little rusty, but I
> believe this means chaining together two finite-state machines rather
> than building a single one.

That is basically how they're implemented. We have a sort of internal
scratchpad that allows us to backtrack over the exclusions as a nested
state of the main pattern match. You're entitled to say 'ick' at this
point.

pws


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-04 19:41     ` Peter Stephenson
@ 2022-07-06 10:03       ` Daniel Shahaf
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Shahaf @ 2022-07-06 10:03 UTC (permalink / raw)
  To: zsh-workers

Peter Stephenson wrote on Mon, 04 Jul 2022 19:41 +00:00:
>> On 04 July 2022 at 20:15 Bart Schaefer <schaefer@brasslantern.com> wrote:
>> On Mon, Jul 4, 2022 at 6:53 AM Peter Stephenson
>> <p.w.stephenson@ntlworld.com> wrote:>
>> > > On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
>> > > Zsh has extensions to regular regexes - the ~ and ^ negations.
>>
>> PWS correct me if I go too far astray, but (^Y) is internally (*~Y)
>> and (X~Y) is implemented by first matching (X) and then removing
>> anything that matches (Y) ... which is where the regular-ness goes
>> astray.  My formal training on this is more than a little rusty, but I
>> believe this means chaining together two finite-state machines rather
>> than building a single one.

"X and not Y" isn't chaining; it's a Cartesian product.  Essentially one
walks both the X machine and the "not Y" machine simultaneously and
accepts iff both of them accept.

Chaining machines would create a non-deterministic machine that matches
the concatenation of the input machines' languages.

Cheers,

Daniel
(backlogged, so, replying out of order)

> That is basically how they're implemented. We have a sort of internal
> scratchpad that allows us to backtrack over the exclusions as a nested
> state of the main pattern match. You're entitled to say 'ick' at this
> point.
>
> pws


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-04 13:47 ` Peter Stephenson
  2022-07-04 19:15   ` Bart Schaefer
@ 2022-07-06 18:40   ` stephane
  1 sibling, 0 replies; 8+ messages in thread
From: stephane @ 2022-07-06 18:40 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

FWIW, ast's extended and augmented regexps as supported by ksh93's [[ =~ 
]] operator or more generally in globs after ~(E) (extended) ~(A) 
(augmented) do have AND and NOT operator.

That's \& and \! in ERE and & and ! in ARE.

ere() ksh -c '[[ $1 =~ $2 ]]' ksh "$@"
are() ksh -c '[[ $1 =~ (?A)$2 ]]' ksh "$@"

And then you can do

ere x '^([[:lower:]]\&.)$'
ere y '^x\!$'

are x '^([[:lower:]]&.)$'
are y '^x!$'

Also note that AND(A,B) can be done with NOT(OR(NOT(A), NOT(B))) so even 
ksh88 or bash can do AND in their globs (with extglob in bash) with 
!(!(A)|!(B))

-- 
Stephane



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-04 12:03 Extending regexes Sebastian Gniazdowski
  2022-07-04 13:47 ` Peter Stephenson
@ 2022-07-06 23:07 ` Phil Pennock
  2022-07-07  0:22   ` Bart Schaefer
  1 sibling, 1 reply; 8+ messages in thread
From: Phil Pennock @ 2022-07-06 23:07 UTC (permalink / raw)
  To: zsh-workers

On 2022-07-04 at 14:03 +0200, Sebastian Gniazdowski wrote:
> Zsh has extensions to regular regexes - the ~ and ^ negations. They, as it
> can be expected from negations that are required by Turing universal
> machines, introduce a whole new universe of computations over standard
> regular expressions. For example matching in an AND fashion:

For clarity: zsh has long had the module zsh/pcre, providing
-pcre-match; when the =~ regexp matching operator was added, we
deliberately chose to add a module zsh/regex to use the system ERE
libraries with -regex-match and made that the default implementation
behind the =~ operator.

If you're getting PCRE semantics, then probably somewhere in your
startup files you have something like `setopt re_match_pcre`.

A while back I wrote some bindings for using the RE2 library, which
matches the efficient regexps found in Go and which is licensed such
that more vendors might enable it by default with zsh.  I stopped as I
tried to puzzle through how to dig myself out of my own hole, in having
made `RE_MATCH_PCRE` be a simple boolean.

My _tentative_ thinking, which I'd appreciate feedback on, is to
introduce a new special parameter, `ZSH_EQTILDE_ENGINE` or somesuch;
have that only succeed when assigned a parseable value, and make
mutations of the RE_MATCH_PCRE be implicit assignments of `regex` or
`pcre` to that parameter.

Is this sane?  Are we happy introducing new special parameters, as long
as the name starts `zsh`?  Should the semantics just be "name of a
module" or a static list?  If "name of a module" then that would let
people do more than just use our engines (at their own risk), but should
we then update the .mdd files or the exported tables with some new
identifier to mark "use this function to back =~ when the engine points
here"?

I would quite like to move towards being able to expect "better, but
sane" REs to be available, even with commercial OS vendor builds of zsh.
I think RE2 is probably the best way forward, but ... I should probably
have asked long ago for advice on the design decisions which need to be
made.

-Phil


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Extending regexes
  2022-07-06 23:07 ` Phil Pennock
@ 2022-07-07  0:22   ` Bart Schaefer
  0 siblings, 0 replies; 8+ messages in thread
From: Bart Schaefer @ 2022-07-07  0:22 UTC (permalink / raw)
  To: Zsh hackers list

On Wed, Jul 6, 2022 at 4:13 PM Phil Pennock
<zsh-workers+phil.pennock@spodhuis.org> wrote:
>
> A while back I wrote some bindings for using the RE2 library [...]
>
> My _tentative_ thinking, which I'd appreciate feedback on, is to
> introduce a new special parameter, `ZSH_EQTILDE_ENGINE` or somesuch;
> [...]
>
> Is this sane?  Are we happy introducing new special parameters, as long
> as the name starts `zsh`?  [...]

My intuition about this suggests that an interface somewhere between
"enable -p $patchar" and "ztie -d $dbtype" would be more appropriate
here.  Something like

zregex zsh/re/$flavor

where the named module must implement "-$flavor-match" as a
conditional.  For backwards-compatibility, zsh/pcre would load
zsh/re/pcre and zsh/regex would load zsh/re/regex.

An option "zregex -x" (choose your x) replaces -regex-match with
-$flavor-match (a no-op for zsh/re/regex) in the implementation of
"=~"

Leave RE_MATCH_PCRE as-is (it replaces -$flavor-match with -pcre-match
when set) but document it as deprecated and perhaps print a warning if
that option is set when zregex -x is called.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-07-07  0:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-04 12:03 Extending regexes Sebastian Gniazdowski
2022-07-04 13:47 ` Peter Stephenson
2022-07-04 19:15   ` Bart Schaefer
2022-07-04 19:41     ` Peter Stephenson
2022-07-06 10:03       ` Daniel Shahaf
2022-07-06 18:40   ` stephane
2022-07-06 23:07 ` Phil Pennock
2022-07-07  0:22   ` Bart Schaefer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).