zsh-workers
 help / color / mirror / code / Atom feed
* Pattern bug on (a*|)~^(*b)
@ 2023-07-25 13:19 Johan Grande
  2023-07-25 18:35 ` Bart Schaefer
  2023-07-31 11:36 ` Peter Stephenson
  0 siblings, 2 replies; 13+ messages in thread
From: Johan Grande @ 2023-07-25 13:19 UTC (permalink / raw)
  To: zsh-workers

In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob,

[[ "ab" = (|a*)~^(*b) ]]

incorrectly (unless I'm mistaken) returns 1. However

[[ "ab" = (a*|)~^(*b) ]]

correctly returns 0.

-- 
Johan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-25 13:19 Pattern bug on (a*|)~^(*b) Johan Grande
@ 2023-07-25 18:35 ` Bart Schaefer
  2023-07-25 18:47   ` Johan Grande
  2023-07-31 11:36 ` Peter Stephenson
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Schaefer @ 2023-07-25 18:35 UTC (permalink / raw)
  To: Johan Grande; +Cc: zsh-workers

On Tue, Jul 25, 2023 at 6:19 AM Johan Grande <nahoj@crans.org> wrote:
>
> In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob,
>
> [[ "ab" = (|a*)~^(*b) ]]
>
> incorrectly (unless I'm mistaken) returns 1.

This is not precisely a bug, it's expected behavior.  Alternation with
(x|y) tries the patterns in left-to-right order and has lower
precedence than p~q (despite the parens).  The upshot is that
(|a*)~^(*b) matches like (|)~^(*b) whereas (a*|)~^(*b) is like
(a*)~^(*b) and the final bit of the puzzle is that the double negative
in  ~^(b*) cannot match the empty result from (|).

The same thing happens with globbing, touch a file named "ab" and then
try those patterns for filename generation.  PWS may be able to
explain some more about the precedence of (x|y) vs. (p~q).

To resolve this, you have to put the p~q inside the x|y, like [[ "ab"
= (|a*~^(*b)) ]]


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-25 18:35 ` Bart Schaefer
@ 2023-07-25 18:47   ` Johan Grande
  2023-07-28  1:02     ` Bart Schaefer
  0 siblings, 1 reply; 13+ messages in thread
From: Johan Grande @ 2023-07-25 18:47 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

Le 25/07/2023 à 20:35, Bart Schaefer a écrit :
> On Tue, Jul 25, 2023 at 6:19 AM Johan Grande <nahoj@crans.org> wrote:
>>
>> In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob,
>>
>> [[ "ab" = (|a*)~^(*b) ]]
>>
>> incorrectly (unless I'm mistaken) returns 1.
> 
> This is not precisely a bug, it's expected behavior.  Alternation with
> (x|y) tries the patterns in left-to-right order and has lower
> precedence than p~q (despite the parens).

I don't get it. What are the different patterns tried during this 
left-to-right evaluation of (|a*)~^(*b) ?

-- 
Johan



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-25 18:47   ` Johan Grande
@ 2023-07-28  1:02     ` Bart Schaefer
  2023-07-28  6:41       ` Stephane Chazelas
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Schaefer @ 2023-07-28  1:02 UTC (permalink / raw)
  To: Johan Grande; +Cc: zsh-workers

On Tue, Jul 25, 2023 at 11:47 AM Johan Grande <nahoj@crans.org> wrote:
>
> I don't get it. What are the different patterns tried during this
> left-to-right evaluation of (|a*)~^(*b) ?

I'm going to invoke my "PWS may be able to explain some more" remark,
but basically once you cross the "~" it's no longer possible to
backtrack again; so it tries to match the empty alternative, gets an
empty result set, excludes ^(*b) from that result set leaving a
still-empty set, can't go back, and fails because the final result is
empty.  I don't recall all the details but this has something to do
with X~Y implemented as "exclude Y from the result of X".


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-28  1:02     ` Bart Schaefer
@ 2023-07-28  6:41       ` Stephane Chazelas
  2023-07-29  1:35         ` Bart Schaefer
  0 siblings, 1 reply; 13+ messages in thread
From: Stephane Chazelas @ 2023-07-28  6:41 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Johan Grande, zsh-workers

2023-07-27 18:02:30 -0700, Bart Schaefer:
> On Tue, Jul 25, 2023 at 11:47 AM Johan Grande <nahoj@crans.org> wrote:
> >
> > I don't get it. What are the different patterns tried during this
> > left-to-right evaluation of (|a*)~^(*b) ?
> 
> I'm going to invoke my "PWS may be able to explain some more" remark,
> but basically once you cross the "~" it's no longer possible to
> backtrack again; so it tries to match the empty alternative, gets an
> empty result set, excludes ^(*b) from that result set leaving a
> still-empty set, can't go back, and fails because the final result is
> empty.  I don't recall all the details but this has something to do
> with X~Y implemented as "exclude Y from the result of X".

I have to say I'm with the OP and don't understand that
explanation either.

It also doesn't seem to tie in with the fact that in globbing
the part after ~ is seemingly done *after* globbing so with
very low precedence, as in ./**/*.txt~*/src/* finding all the
txt files including in src dirs, but the */src/* filtered out
afterwards.

Even if that's "how it works", I struggle to see how it can be
seen as anything but a bug. It's not working as documented, and
I can't really see how we would document it.

See also:

$ [[ ab = a(|a*)~^*b ]]; echo $?
1
$ [[ ab = a(|*b)~^*b ]]; echo $?
1
$ [[ ab = a(*z|*b)~^*b ]]; echo $?
0

Which doesn't seem to tie in with the explanation.

-- 
Stephane


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-28  6:41       ` Stephane Chazelas
@ 2023-07-29  1:35         ` Bart Schaefer
  2023-08-01 13:19           ` Johan Grande
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Schaefer @ 2023-07-29  1:35 UTC (permalink / raw)
  To: Bart Schaefer, Johan Grande, zsh-workers

On Thu, Jul 27, 2023 at 11:41 PM Stephane Chazelas
<stephane@chazelas.org> wrote:
>
> I have to say I'm with the OP and don't understand that
> explanation either.

In the absence of a direct response from PWS, I'll just point you to
his comments in pattern.c, some of which date from before we had a git
repository:

 * The strategy is to test the asserted pattern,
 * recording via P_EXCSYNC how far the part to
 * be excluded matched.  We then set the
 * length of the test string to that
 * point and see if the exclusion as far as
 * P_EXCEND also matches that string.
 * We need to keep testing the asserted pattern
 * by backtracking, since the first attempt
 * may be excluded while a later attempt may not.
 * For this we keep a pointer just after
 * the P_EXCLUDE which is tested by the P_EXCSYNC
 * to see if we matched there last time, in which
 * case we fail.  If there is nothing to backtrack
 * over, that doesn't matter:  we should fail anyway.
 * The pointer also tells us where the asserted
 * pattern matched for use by the exclusion.

 * P.S. in case you were wondering, this code
 * is horrible.

 * This is where we make sure that we are not
 * repeatedly matching zero-length strings in
 * a closure, which would cause an infinite loop,
 * and also remove exponential behaviour in
 * backtracking nested closures.

 * If we come round to the same branch again, and
 * there is already a 1, then the test fails.

By my reading, zero-length matches may be short-circuited to avoid
pathological behavior.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-25 13:19 Pattern bug on (a*|)~^(*b) Johan Grande
  2023-07-25 18:35 ` Bart Schaefer
@ 2023-07-31 11:36 ` Peter Stephenson
  2023-07-31 15:21   ` Peter Stephenson
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Stephenson @ 2023-07-31 11:36 UTC (permalink / raw)
  To: Johan Grande, zsh-workers

> On 25/07/2023 14:19 Johan Grande <nahoj@crans.org> wrote:
> In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob,
> 
> [[ "ab" = (|a*)~^(*b) ]]
> 
> incorrectly (unless I'm mistaken) returns 1. However
> 
> [[ "ab" = (a*|)~^(*b) ]]
> 
> correctly returns 0.

You can see this is a real bug with

[[ ab = (a|a*)~^*b ]]

which also fails, while

[[ ab = (b*|a*)~^*b ]]

succeeds, so this is clearly inconsistent.

I think I've tracked down the problem.  Because of the unpleasant
hierarchy-violating properties of exclusion pointed out by Bart, we record the
point at which the match succeeded.  The point we do this isn't where the
entire pattern succeeds, however, only the point where the part we're going to
exclude has matched.  So at this point we've successfully matched "a" in the
case I showed, or the empty string in the original case.  But we haven't yet
taken account of the anchor at the end, which will later cause us to fail.

When we do fail and backtrack, the marker is still there saying we matched just
the "a" or the empty string.  We then successfully match a*, but when we get
to the point of trying to exclude ^*b the record saying what we're excluding
against is messed up --- ^*b DOES match either a or the empty string, so
the exclusion succeeds.

The fixes I can think of are either to find a point at which to undo the sync
record for the match that has failed after the event or, perhaps better as it's
a local change even though it does more work, reset any previous matches before
the sync point on the next alternative when we mark that as having matched
(we will only do that if the previous branch failed --- remember [irony
intended] that a branch succeeds only if all following patterns match too).

This is a bit fraught as some of the markers (not necessarily these ones, it's
almost a quarter of a century since the first version of this code) are there
to take account of some pathological cases so removing them might have hard
to discover effects.  This will take another load of wet towels and caffeine.

pws


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-31 11:36 ` Peter Stephenson
@ 2023-07-31 15:21   ` Peter Stephenson
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Stephenson @ 2023-07-31 15:21 UTC (permalink / raw)
  To: Johan Grande, zsh-workers

> On 31/07/2023 12:36 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> > On 25/07/2023 14:19 Johan Grande <nahoj@crans.org> wrote:
> > In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob,
> > 
> > [[ "ab" = (|a*)~^(*b) ]]
> > 
> > incorrectly (unless I'm mistaken) returns 1.
>
> The fixes I can think of are either to find a point at which to undo the sync
> record for the match that has failed after the event or, perhaps better as it's
> a local change even though it does more work, reset any previous matches before
> the sync point on the next alternative when we mark that as having matched
> (we will only do that if the previous branch failed --- remember [irony
> intended] that a branch succeeds only if all following patterns match too).

I think the latter is good enough.  There's no sign of a slowdown with this
fix in D02glob.ztst, so any resulting problems will have to be complicated
things with nested branches and probably zero-length matches --- not all
of which are well optimised anyway.  I could be wrong, but I think we'll
have to take it on the chin.

Anyway, I couldn't think of any easy way around that, despite some investigation,
so as this does fix the immediate problem, we can go with it.  It's the sort
of self-contained fix I like.

pws

diff --git a/Src/pattern.c b/Src/pattern.c
index 3edda1772..2a1a514fb 100644
--- a/Src/pattern.c
+++ b/Src/pattern.c
@@ -2987,14 +2987,15 @@ patmatch(Upat prog)
 	case P_EXCSYNC:
 	    /* See the P_EXCLUDE code below for where syncptr comes from */
 	    {
-		unsigned char *syncptr;
+		unsigned char *syncstart, *syncptr, *ptr;
 		Upat after;
 		after = P_OPERAND(scan);
 		DPUTS(!P_ISEXCLUDE(after),
 		      "BUG: EXCSYNC not followed by EXCLUDE.");
 		DPUTS(!P_OPERAND(after)->p,
 		      "BUG: EXCSYNC not handled by EXCLUDE");
-		syncptr = P_OPERAND(after)->p + (patinput - patinstart);
+		syncstart = P_OPERAND(after)->p;
+		syncptr = syncstart + (patinput - patinstart);
 		/*
 		 * If we already matched from here, this time we fail.
 		 * See WBRANCH code for story about error count.
@@ -3009,6 +3010,23 @@ patmatch(Upat prog)
 		 * failed anyway.
 		 */
 		*syncptr = errsfound + 1;
+		/*
+		 * Because of backtracking, any match before this point
+		 * can't apply to the current branch we're on so is now
+		 * a failure --- this can happen if, on a previous
+		 * branch, we initially marked a success before failing
+		 * on a later part of the pattern after marking up the
+		 * P_EXCSYNC (even an end anchor will have this effect).
+		 * To make sure we record the current match point
+		 * correctly, mark those down now.
+		 *
+		 * This might have side effects on the efficiency of
+		 * pathological cases involving nested branches.  To
+		 * fix that we'd probably need to record matches on
+		 * different branches separately.
+		 */
+		for (ptr = syncstart; ptr < syncptr; ++ptr)
+		    *ptr = 0;
 	    }
 	    break;
 	case P_EXCEND:
diff --git a/Test/D02glob.ztst b/Test/D02glob.ztst
index 850a535e5..4d88e5c27 100644
--- a/Test/D02glob.ztst
+++ b/Test/D02glob.ztst
@@ -817,6 +817,32 @@
 *>*/glob.tmp/(flip|flop)
 *>*/glob.tmp/(flip|flop)/trailing/components
 
+# The following set test an obscure problem with branches followed by
+# exclusions that shows up when the exclusion matches against
+# something other than the complete test string, hence the complicated
+# double negative.
+  [[ ab = (|a*)~^(*b) ]]
+0:Regression test for exclusion after branches: empty first alternative
+
+  [[ ab = (b|a*)~^(*b) ]]
+0:Regression test for exclusion after branches: non-empty first alternative
+
+  [[ ab = (b*|a*)~^(*b) ]]
+0:Regression test for exclusion after branches: full length first alternative
+
+# Corresponding tests where the exclusion should succeed, so the
+# match fails.  It's hard to know how to provoke bugs here...
+  [[ abc = (|a*)~^(*b) ]]
+1:Regression test for exclusion after branches: failure case 1
+
+  [[ abc = (b|a*)~^(*b) ]]
+1:Regression test for exclusion after branches: failure case 2
+
+  [[ abc = (b*|a*)~^(*b) ]]
+1:Regression test for exclusion after branches: failure case 3
+
+# Careful: extendedglob off from this point.
+
   unsetopt extendedglob
   print -r -- ${(*)=${(@s.+.):-A+B}/(#b)(?)/-${(L)match[1]} ${match[1]}}
 0:the '*' qualfier enables extended_glob for pattern matching


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-07-29  1:35         ` Bart Schaefer
@ 2023-08-01 13:19           ` Johan Grande
  2023-08-01 13:30             ` Peter Stephenson
  0 siblings, 1 reply; 13+ messages in thread
From: Johan Grande @ 2023-08-01 13:19 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers

Le 29/07/2023 à 03:35, Bart Schaefer a écrit :
> On Thu, Jul 27, 2023 at 11:41 PM Stephane Chazelas 
> <stephane@chazelas.org> wrote:
>> 
>> I have to say I'm with the OP and don't understand that explanation
>> either.
> 
> In the absence of a direct response from PWS, I'll just point you to 
> his comments in pattern.c, some of which date from before we had a
> git repository: [...]
> 
> By my reading, zero-length matches may be short-circuited to avoid 
> pathological behavior.

Thank you for your answers, though the latter is rather obscure to me.

If you can answer, or if PWS reads this:

What globbing should do is one thing, but in the meantime I'm interested
to know what I can safely put on the left-hand side of ~ with the
current implementation. Does this behavior concern any case of
backtracking, or only (...|...) patterns, or, which would be even
better, only (|...) and (...|) patterns?

In my project, I use globbing to find files with tags matching patterns
provided by the user, because I noticed that it's much faster than
anything else I tried, and I use ()~^() and (|) as AND and OR operators.

As an example, here is the pattern generated for files with tags 
matching pat1:

*[[](((* |)pat1( *|)))[]]*

which I could easily replace with:

*[[](((* )#pat1( *)#))[]]*

Then this would be the pattern for files with pat1 and pat2 and neither 
pat3 nor pat4:

*[[](((* )#pat1( *)#)~^((* )#pat2( *)#))[]]*~*[[](((* )#pat3( *)#|(* 
)#pat4( *)#))[]]*

So I'm thinking I could filter for user-provided patterns that contain 
'|' and don't put them in the mega-pattern but filter for them 
post-globbing. Does that sound like a sound solution to you?

-- 
Johan


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-08-01 13:19           ` Johan Grande
@ 2023-08-01 13:30             ` Peter Stephenson
  2023-08-01 13:46               ` Johan Grande
  2023-08-02  8:31               ` Johan Grande
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Stephenson @ 2023-08-01 13:30 UTC (permalink / raw)
  To: Johan Grande, zsh-workers

On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote:
> So I'm thinking I could filter for user-provided patterns that contain 
> '|' and don't put them in the mega-pattern but filter for them 
> post-globbing. Does that sound like a sound solution to you?

Yes, I think that would be OK.  The only case I definitely know of where
you'd hit the problem would be branches using that specific syntax.
I'm not entirely sure if you'd get issues with approximation, which is
even more complicated, but so long as there are no (#a...)'s in your
patterns that's not a problem.

If you can guarantee that the first branch is a longer match than the
second, you'll also get a away with it.  So (a*|) will work, because
the first branch must be at least 1 character, and the second can't
have any.  Obviously, that's harder to check for automatically, though,
but you could do a limited check, say, that any | was followed by a
right parenthesis.

pws


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-08-01 13:30             ` Peter Stephenson
@ 2023-08-01 13:46               ` Johan Grande
  2023-08-02  8:31               ` Johan Grande
  1 sibling, 0 replies; 13+ messages in thread
From: Johan Grande @ 2023-08-01 13:46 UTC (permalink / raw)
  To: Peter Stephenson, zsh-workers

Le 01/08/2023 à 15:30, Peter Stephenson a écrit :
> On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote:
>> So I'm thinking I could filter for user-provided patterns that contain
>> '|' and don't put them in the mega-pattern but filter for them
>> post-globbing. Does that sound like a sound solution to you?
> 
> Yes, I think that would be OK.  The only case I definitely know of where
> you'd hit the problem would be branches using that specific syntax.
> I'm not entirely sure if you'd get issues with approximation, which is
> even more complicated, but so long as there are no (#a...)'s in your
> patterns that's not a problem.
> 
> If you can guarantee that the first branch is a longer match than the
> second, you'll also get a away with it.  So (a*|) will work, because
> the first branch must be at least 1 character, and the second can't
> have any.  Obviously, that's harder to check for automatically, though,
> but you could do a limited check, say, that any | was followed by a
> right parenthesis.

Understood, thank you.

-- 
Johan



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-08-01 13:30             ` Peter Stephenson
  2023-08-01 13:46               ` Johan Grande
@ 2023-08-02  8:31               ` Johan Grande
  2023-08-02  9:37                 ` Peter Stephenson
  1 sibling, 1 reply; 13+ messages in thread
From: Johan Grande @ 2023-08-02  8:31 UTC (permalink / raw)
  To: Peter Stephenson, zsh-workers

Le 01/08/2023 à 15:30, Peter Stephenson a écrit :
> On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote:
>> So I'm thinking I could filter for user-provided patterns that contain
>> '|' and don't put them in the mega-pattern but filter for them
>> post-globbing. Does that sound like a sound solution to you?
> 
> Yes, I think that would be OK.  [...]

Actually, an AND can be expressed as a combination of ORs and NOTs. So I 
can build patterns such as

^(^(pat1)|^(pat2))

and never introduce ~. (With pat1 and pat2 arbitrary.) Basis tests show 
it working. Can you see any issue I might run into if I implement my 
queries like this?

-- 
Johan



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Pattern bug on (a*|)~^(*b)
  2023-08-02  8:31               ` Johan Grande
@ 2023-08-02  9:37                 ` Peter Stephenson
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Stephenson @ 2023-08-02  9:37 UTC (permalink / raw)
  To: Johan Grande, zsh-workers


> On 02/08/2023 09:31 Johan Grande <nahoj@crans.org> wrote:
> 
>  
> Le 01/08/2023 à 15:30, Peter Stephenson a écrit :
> > On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote:
> >> So I'm thinking I could filter for user-provided patterns that contain
> >> '|' and don't put them in the mega-pattern but filter for them
> >> post-globbing. Does that sound like a sound solution to you?
> > 
> > Yes, I think that would be OK.  [...]
> 
> Actually, an AND can be expressed as a combination of ORs and NOTs. So I 
> can build patterns such as
> 
> ^(^(pat1)|^(pat2))
> 
> and never introduce ~. (With pat1 and pat2 arbitrary.) Basis tests show 
> it working. Can you see any issue I might run into if I implement my 
> queries like this?

That certainly shouldn't tickle this problem since the exclusions
operate locally, so the problematic string recording positions is
recreated each time.  It's probably a bit less efficient since the
exclusions do some memory management for that state record, but
if that's not causing you a problems you should be fine.

pws


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-08-02  9:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-25 13:19 Pattern bug on (a*|)~^(*b) Johan Grande
2023-07-25 18:35 ` Bart Schaefer
2023-07-25 18:47   ` Johan Grande
2023-07-28  1:02     ` Bart Schaefer
2023-07-28  6:41       ` Stephane Chazelas
2023-07-29  1:35         ` Bart Schaefer
2023-08-01 13:19           ` Johan Grande
2023-08-01 13:30             ` Peter Stephenson
2023-08-01 13:46               ` Johan Grande
2023-08-02  8:31               ` Johan Grande
2023-08-02  9:37                 ` Peter Stephenson
2023-07-31 11:36 ` Peter Stephenson
2023-07-31 15:21   ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).