[-- Attachment #1: Type: text/plain, Size: 1296 bytes --] Hi, Zsh has extensions to regular regexes - the ~ and ^ negations. They, as it can be expected from negations that are required by Turing universal machines, introduce a whole new universe of computations over standard regular expressions. For example matching in an AND fashion: If [[ ABC == *A*~^*B*~^*C* ]]; then print A, B and C found fi I think that regexes look pretty limited from this point of view and that pcre extensions went wrong path with the look forward and behind semantics. The typical, common attempts of using regex [^] negation like [^(string)] are simply there in zsh patterns as ^string. I've recently used ~ negation in a project to reject a set of known tokens from matching at given position with a great success to match a loose `for` syntax in an zinit-annex-pull extension to zinit that greps and extracts zinit commands from any web page. I cannot see it possible without the extra negation. Therefore I thought that it's weird that such an useful feature is missing from the commonly used regex syntax. So maybe an attempt of updating it has sense? Could someone experienced with them like Oliver prepare some white papers to accomplish this? It would be a great event to extend the old regexes with such a great feature like not one, but TWO new negations. [-- Attachment #2: Type: text/html, Size: 1590 bytes --]
> On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> Hi,
> Zsh has extensions to regular regexes - the ~ and ^ negations.
>
> Therefore I thought that it's weird that such an useful feature is missing
> from the commonly used regex syntax. So maybe an attempt of updating it has
> sense?
You're quite right both that they're very useful in zsh and there's nothing
like this in normal regular expressions, but unfortunately I've got a strong
feeling this is a big can of worms [hope that image is graphic enough that
I don't need to explain the phrase for non-native English speakers].
I say that as although I'm not very up on the mathematics of regular expressions
I did write the basics of the current zsh implementation of glob negations.
(Before that, there was an even less efficient implementation that created a
structure for each part of the pattern, which wasn't good for memory management
--- this is going back to the 1990s, I think.) There are some pretty pathological
details to make sure this works in every case, so I'd really want a real expert in
the subject area to think about this before it got much further.
pws
On Mon, Jul 4, 2022 at 6:53 AM Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:> > > On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote: > > Zsh has extensions to regular regexes - the ~ and ^ negations. > > You're quite right both that they're very useful in zsh and there's nothing > like this in normal regular expressions, but unfortunately I've got a strong > feeling this is a big can of worms [hope that image is graphic enough that > I don't need to explain the phrase for non-native English speakers]. In particular, these no longer fit the formal definition of "regular". PWS correct me if I go too far astray, but (^Y) is internally (*~Y) and (X~Y) is implemented by first matching (X) and then removing anything that matches (Y) ... which is where the regular-ness goes astray. My formal training on this is more than a little rusty, but I believe this means chaining together two finite-state machines rather than building a single one. On Mon, Jul 4, 2022 at 5:06 AM Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote: > > I think that regexes look pretty limited from this point of view and that pcre extensions went wrong path with the look forward and behind semantics. Note that of course "pcre" stands for "perl-compatible RE" so you can find the justifications for look-{ahead,behind} in the history of perl development. Again, a long time ago, but my recollection is that the reason "lookaround assertions" are zero-width elements is to preserve the finite-state semantics. Please take that with 30 years worth of salt grains (a less self-explanatory idiom than Peter's, I fear).
> On 04 July 2022 at 20:15 Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Mon, Jul 4, 2022 at 6:53 AM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:>
> > > On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote:
> > > Zsh has extensions to regular regexes - the ~ and ^ negations.
>
> PWS correct me if I go too far astray, but (^Y) is internally (*~Y)
> and (X~Y) is implemented by first matching (X) and then removing
> anything that matches (Y) ... which is where the regular-ness goes
> astray. My formal training on this is more than a little rusty, but I
> believe this means chaining together two finite-state machines rather
> than building a single one.
That is basically how they're implemented. We have a sort of internal
scratchpad that allows us to backtrack over the exclusions as a nested
state of the main pattern match. You're entitled to say 'ick' at this
point.
pws
Peter Stephenson wrote on Mon, 04 Jul 2022 19:41 +00:00: >> On 04 July 2022 at 20:15 Bart Schaefer <schaefer@brasslantern.com> wrote: >> On Mon, Jul 4, 2022 at 6:53 AM Peter Stephenson >> <p.w.stephenson@ntlworld.com> wrote:> >> > > On 04 July 2022 at 13:03 Sebastian Gniazdowski <sgniazdowski@gmail.com> wrote: >> > > Zsh has extensions to regular regexes - the ~ and ^ negations. >> >> PWS correct me if I go too far astray, but (^Y) is internally (*~Y) >> and (X~Y) is implemented by first matching (X) and then removing >> anything that matches (Y) ... which is where the regular-ness goes >> astray. My formal training on this is more than a little rusty, but I >> believe this means chaining together two finite-state machines rather >> than building a single one. "X and not Y" isn't chaining; it's a Cartesian product. Essentially one walks both the X machine and the "not Y" machine simultaneously and accepts iff both of them accept. Chaining machines would create a non-deterministic machine that matches the concatenation of the input machines' languages. Cheers, Daniel (backlogged, so, replying out of order) > That is basically how they're implemented. We have a sort of internal > scratchpad that allows us to backtrack over the exclusions as a nested > state of the main pattern match. You're entitled to say 'ick' at this > point. > > pws
FWIW, ast's extended and augmented regexps as supported by ksh93's [[ =~ ]] operator or more generally in globs after ~(E) (extended) ~(A) (augmented) do have AND and NOT operator. That's \& and \! in ERE and & and ! in ARE. ere() ksh -c '[[ $1 =~ $2 ]]' ksh "$@" are() ksh -c '[[ $1 =~ (?A)$2 ]]' ksh "$@" And then you can do ere x '^([[:lower:]]\&.)$' ere y '^x\!$' are x '^([[:lower:]]&.)$' are y '^x!$' Also note that AND(A,B) can be done with NOT(OR(NOT(A), NOT(B))) so even ksh88 or bash can do AND in their globs (with extglob in bash) with !(!(A)|!(B)) -- Stephane
On 2022-07-04 at 14:03 +0200, Sebastian Gniazdowski wrote:
> Zsh has extensions to regular regexes - the ~ and ^ negations. They, as it
> can be expected from negations that are required by Turing universal
> machines, introduce a whole new universe of computations over standard
> regular expressions. For example matching in an AND fashion:
For clarity: zsh has long had the module zsh/pcre, providing
-pcre-match; when the =~ regexp matching operator was added, we
deliberately chose to add a module zsh/regex to use the system ERE
libraries with -regex-match and made that the default implementation
behind the =~ operator.
If you're getting PCRE semantics, then probably somewhere in your
startup files you have something like `setopt re_match_pcre`.
A while back I wrote some bindings for using the RE2 library, which
matches the efficient regexps found in Go and which is licensed such
that more vendors might enable it by default with zsh. I stopped as I
tried to puzzle through how to dig myself out of my own hole, in having
made `RE_MATCH_PCRE` be a simple boolean.
My _tentative_ thinking, which I'd appreciate feedback on, is to
introduce a new special parameter, `ZSH_EQTILDE_ENGINE` or somesuch;
have that only succeed when assigned a parseable value, and make
mutations of the RE_MATCH_PCRE be implicit assignments of `regex` or
`pcre` to that parameter.
Is this sane? Are we happy introducing new special parameters, as long
as the name starts `zsh`? Should the semantics just be "name of a
module" or a static list? If "name of a module" then that would let
people do more than just use our engines (at their own risk), but should
we then update the .mdd files or the exported tables with some new
identifier to mark "use this function to back =~ when the engine points
here"?
I would quite like to move towards being able to expect "better, but
sane" REs to be available, even with commercial OS vendor builds of zsh.
I think RE2 is probably the best way forward, but ... I should probably
have asked long ago for advice on the design decisions which need to be
made.
-Phil
On Wed, Jul 6, 2022 at 4:13 PM Phil Pennock
<zsh-workers+phil.pennock@spodhuis.org> wrote:
>
> A while back I wrote some bindings for using the RE2 library [...]
>
> My _tentative_ thinking, which I'd appreciate feedback on, is to
> introduce a new special parameter, `ZSH_EQTILDE_ENGINE` or somesuch;
> [...]
>
> Is this sane? Are we happy introducing new special parameters, as long
> as the name starts `zsh`? [...]
My intuition about this suggests that an interface somewhere between
"enable -p $patchar" and "ztie -d $dbtype" would be more appropriate
here. Something like
zregex zsh/re/$flavor
where the named module must implement "-$flavor-match" as a
conditional. For backwards-compatibility, zsh/pcre would load
zsh/re/pcre and zsh/regex would load zsh/re/regex.
An option "zregex -x" (choose your x) replaces -regex-match with
-$flavor-match (a no-op for zsh/re/regex) in the implementation of
"=~"
Leave RE_MATCH_PCRE as-is (it replaces -$flavor-match with -pcre-match
when set) but document it as deprecated and perhaps print a warning if
that option is set when zregex -x is called.