* Pattern bug on (a*|)~^(*b) @ 2023-07-25 13:19 Johan Grande 2023-07-25 18:35 ` Bart Schaefer 2023-07-31 11:36 ` Peter Stephenson 0 siblings, 2 replies; 13+ messages in thread From: Johan Grande @ 2023-07-25 13:19 UTC (permalink / raw) To: zsh-workers In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob, [[ "ab" = (|a*)~^(*b) ]] incorrectly (unless I'm mistaken) returns 1. However [[ "ab" = (a*|)~^(*b) ]] correctly returns 0. -- Johan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-25 13:19 Pattern bug on (a*|)~^(*b) Johan Grande @ 2023-07-25 18:35 ` Bart Schaefer 2023-07-25 18:47 ` Johan Grande 2023-07-31 11:36 ` Peter Stephenson 1 sibling, 1 reply; 13+ messages in thread From: Bart Schaefer @ 2023-07-25 18:35 UTC (permalink / raw) To: Johan Grande; +Cc: zsh-workers On Tue, Jul 25, 2023 at 6:19 AM Johan Grande <nahoj@crans.org> wrote: > > In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob, > > [[ "ab" = (|a*)~^(*b) ]] > > incorrectly (unless I'm mistaken) returns 1. This is not precisely a bug, it's expected behavior. Alternation with (x|y) tries the patterns in left-to-right order and has lower precedence than p~q (despite the parens). The upshot is that (|a*)~^(*b) matches like (|)~^(*b) whereas (a*|)~^(*b) is like (a*)~^(*b) and the final bit of the puzzle is that the double negative in ~^(b*) cannot match the empty result from (|). The same thing happens with globbing, touch a file named "ab" and then try those patterns for filename generation. PWS may be able to explain some more about the precedence of (x|y) vs. (p~q). To resolve this, you have to put the p~q inside the x|y, like [[ "ab" = (|a*~^(*b)) ]] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-25 18:35 ` Bart Schaefer @ 2023-07-25 18:47 ` Johan Grande 2023-07-28 1:02 ` Bart Schaefer 0 siblings, 1 reply; 13+ messages in thread From: Johan Grande @ 2023-07-25 18:47 UTC (permalink / raw) To: Bart Schaefer; +Cc: zsh-workers Le 25/07/2023 à 20:35, Bart Schaefer a écrit : > On Tue, Jul 25, 2023 at 6:19 AM Johan Grande <nahoj@crans.org> wrote: >> >> In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob, >> >> [[ "ab" = (|a*)~^(*b) ]] >> >> incorrectly (unless I'm mistaken) returns 1. > > This is not precisely a bug, it's expected behavior. Alternation with > (x|y) tries the patterns in left-to-right order and has lower > precedence than p~q (despite the parens). I don't get it. What are the different patterns tried during this left-to-right evaluation of (|a*)~^(*b) ? -- Johan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-25 18:47 ` Johan Grande @ 2023-07-28 1:02 ` Bart Schaefer 2023-07-28 6:41 ` Stephane Chazelas 0 siblings, 1 reply; 13+ messages in thread From: Bart Schaefer @ 2023-07-28 1:02 UTC (permalink / raw) To: Johan Grande; +Cc: zsh-workers On Tue, Jul 25, 2023 at 11:47 AM Johan Grande <nahoj@crans.org> wrote: > > I don't get it. What are the different patterns tried during this > left-to-right evaluation of (|a*)~^(*b) ? I'm going to invoke my "PWS may be able to explain some more" remark, but basically once you cross the "~" it's no longer possible to backtrack again; so it tries to match the empty alternative, gets an empty result set, excludes ^(*b) from that result set leaving a still-empty set, can't go back, and fails because the final result is empty. I don't recall all the details but this has something to do with X~Y implemented as "exclude Y from the result of X". ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-28 1:02 ` Bart Schaefer @ 2023-07-28 6:41 ` Stephane Chazelas 2023-07-29 1:35 ` Bart Schaefer 0 siblings, 1 reply; 13+ messages in thread From: Stephane Chazelas @ 2023-07-28 6:41 UTC (permalink / raw) To: Bart Schaefer; +Cc: Johan Grande, zsh-workers 2023-07-27 18:02:30 -0700, Bart Schaefer: > On Tue, Jul 25, 2023 at 11:47 AM Johan Grande <nahoj@crans.org> wrote: > > > > I don't get it. What are the different patterns tried during this > > left-to-right evaluation of (|a*)~^(*b) ? > > I'm going to invoke my "PWS may be able to explain some more" remark, > but basically once you cross the "~" it's no longer possible to > backtrack again; so it tries to match the empty alternative, gets an > empty result set, excludes ^(*b) from that result set leaving a > still-empty set, can't go back, and fails because the final result is > empty. I don't recall all the details but this has something to do > with X~Y implemented as "exclude Y from the result of X". I have to say I'm with the OP and don't understand that explanation either. It also doesn't seem to tie in with the fact that in globbing the part after ~ is seemingly done *after* globbing so with very low precedence, as in ./**/*.txt~*/src/* finding all the txt files including in src dirs, but the */src/* filtered out afterwards. Even if that's "how it works", I struggle to see how it can be seen as anything but a bug. It's not working as documented, and I can't really see how we would document it. See also: $ [[ ab = a(|a*)~^*b ]]; echo $? 1 $ [[ ab = a(|*b)~^*b ]]; echo $? 1 $ [[ ab = a(*z|*b)~^*b ]]; echo $? 0 Which doesn't seem to tie in with the explanation. -- Stephane ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-28 6:41 ` Stephane Chazelas @ 2023-07-29 1:35 ` Bart Schaefer 2023-08-01 13:19 ` Johan Grande 0 siblings, 1 reply; 13+ messages in thread From: Bart Schaefer @ 2023-07-29 1:35 UTC (permalink / raw) To: Bart Schaefer, Johan Grande, zsh-workers On Thu, Jul 27, 2023 at 11:41 PM Stephane Chazelas <stephane@chazelas.org> wrote: > > I have to say I'm with the OP and don't understand that > explanation either. In the absence of a direct response from PWS, I'll just point you to his comments in pattern.c, some of which date from before we had a git repository: * The strategy is to test the asserted pattern, * recording via P_EXCSYNC how far the part to * be excluded matched. We then set the * length of the test string to that * point and see if the exclusion as far as * P_EXCEND also matches that string. * We need to keep testing the asserted pattern * by backtracking, since the first attempt * may be excluded while a later attempt may not. * For this we keep a pointer just after * the P_EXCLUDE which is tested by the P_EXCSYNC * to see if we matched there last time, in which * case we fail. If there is nothing to backtrack * over, that doesn't matter: we should fail anyway. * The pointer also tells us where the asserted * pattern matched for use by the exclusion. * P.S. in case you were wondering, this code * is horrible. * This is where we make sure that we are not * repeatedly matching zero-length strings in * a closure, which would cause an infinite loop, * and also remove exponential behaviour in * backtracking nested closures. * If we come round to the same branch again, and * there is already a 1, then the test fails. By my reading, zero-length matches may be short-circuited to avoid pathological behavior. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-29 1:35 ` Bart Schaefer @ 2023-08-01 13:19 ` Johan Grande 2023-08-01 13:30 ` Peter Stephenson 0 siblings, 1 reply; 13+ messages in thread From: Johan Grande @ 2023-08-01 13:19 UTC (permalink / raw) To: Bart Schaefer, zsh-workers Le 29/07/2023 à 03:35, Bart Schaefer a écrit : > On Thu, Jul 27, 2023 at 11:41 PM Stephane Chazelas > <stephane@chazelas.org> wrote: >> >> I have to say I'm with the OP and don't understand that explanation >> either. > > In the absence of a direct response from PWS, I'll just point you to > his comments in pattern.c, some of which date from before we had a > git repository: [...] > > By my reading, zero-length matches may be short-circuited to avoid > pathological behavior. Thank you for your answers, though the latter is rather obscure to me. If you can answer, or if PWS reads this: What globbing should do is one thing, but in the meantime I'm interested to know what I can safely put on the left-hand side of ~ with the current implementation. Does this behavior concern any case of backtracking, or only (...|...) patterns, or, which would be even better, only (|...) and (...|) patterns? In my project, I use globbing to find files with tags matching patterns provided by the user, because I noticed that it's much faster than anything else I tried, and I use ()~^() and (|) as AND and OR operators. As an example, here is the pattern generated for files with tags matching pat1: *[[](((* |)pat1( *|)))[]]* which I could easily replace with: *[[](((* )#pat1( *)#))[]]* Then this would be the pattern for files with pat1 and pat2 and neither pat3 nor pat4: *[[](((* )#pat1( *)#)~^((* )#pat2( *)#))[]]*~*[[](((* )#pat3( *)#|(* )#pat4( *)#))[]]* So I'm thinking I could filter for user-provided patterns that contain '|' and don't put them in the mega-pattern but filter for them post-globbing. Does that sound like a sound solution to you? -- Johan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-08-01 13:19 ` Johan Grande @ 2023-08-01 13:30 ` Peter Stephenson 2023-08-01 13:46 ` Johan Grande 2023-08-02 8:31 ` Johan Grande 0 siblings, 2 replies; 13+ messages in thread From: Peter Stephenson @ 2023-08-01 13:30 UTC (permalink / raw) To: Johan Grande, zsh-workers On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote: > So I'm thinking I could filter for user-provided patterns that contain > '|' and don't put them in the mega-pattern but filter for them > post-globbing. Does that sound like a sound solution to you? Yes, I think that would be OK. The only case I definitely know of where you'd hit the problem would be branches using that specific syntax. I'm not entirely sure if you'd get issues with approximation, which is even more complicated, but so long as there are no (#a...)'s in your patterns that's not a problem. If you can guarantee that the first branch is a longer match than the second, you'll also get a away with it. So (a*|) will work, because the first branch must be at least 1 character, and the second can't have any. Obviously, that's harder to check for automatically, though, but you could do a limited check, say, that any | was followed by a right parenthesis. pws ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-08-01 13:30 ` Peter Stephenson @ 2023-08-01 13:46 ` Johan Grande 2023-08-02 8:31 ` Johan Grande 1 sibling, 0 replies; 13+ messages in thread From: Johan Grande @ 2023-08-01 13:46 UTC (permalink / raw) To: Peter Stephenson, zsh-workers Le 01/08/2023 à 15:30, Peter Stephenson a écrit : > On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote: >> So I'm thinking I could filter for user-provided patterns that contain >> '|' and don't put them in the mega-pattern but filter for them >> post-globbing. Does that sound like a sound solution to you? > > Yes, I think that would be OK. The only case I definitely know of where > you'd hit the problem would be branches using that specific syntax. > I'm not entirely sure if you'd get issues with approximation, which is > even more complicated, but so long as there are no (#a...)'s in your > patterns that's not a problem. > > If you can guarantee that the first branch is a longer match than the > second, you'll also get a away with it. So (a*|) will work, because > the first branch must be at least 1 character, and the second can't > have any. Obviously, that's harder to check for automatically, though, > but you could do a limited check, say, that any | was followed by a > right parenthesis. Understood, thank you. -- Johan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-08-01 13:30 ` Peter Stephenson 2023-08-01 13:46 ` Johan Grande @ 2023-08-02 8:31 ` Johan Grande 2023-08-02 9:37 ` Peter Stephenson 1 sibling, 1 reply; 13+ messages in thread From: Johan Grande @ 2023-08-02 8:31 UTC (permalink / raw) To: Peter Stephenson, zsh-workers Le 01/08/2023 à 15:30, Peter Stephenson a écrit : > On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote: >> So I'm thinking I could filter for user-provided patterns that contain >> '|' and don't put them in the mega-pattern but filter for them >> post-globbing. Does that sound like a sound solution to you? > > Yes, I think that would be OK. [...] Actually, an AND can be expressed as a combination of ORs and NOTs. So I can build patterns such as ^(^(pat1)|^(pat2)) and never introduce ~. (With pat1 and pat2 arbitrary.) Basis tests show it working. Can you see any issue I might run into if I implement my queries like this? -- Johan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-08-02 8:31 ` Johan Grande @ 2023-08-02 9:37 ` Peter Stephenson 0 siblings, 0 replies; 13+ messages in thread From: Peter Stephenson @ 2023-08-02 9:37 UTC (permalink / raw) To: Johan Grande, zsh-workers > On 02/08/2023 09:31 Johan Grande <nahoj@crans.org> wrote: > > > Le 01/08/2023 à 15:30, Peter Stephenson a écrit : > > On 01/08/2023 14:19 Johan Grande <nahoj@crans.org> wrote: > >> So I'm thinking I could filter for user-provided patterns that contain > >> '|' and don't put them in the mega-pattern but filter for them > >> post-globbing. Does that sound like a sound solution to you? > > > > Yes, I think that would be OK. [...] > > Actually, an AND can be expressed as a combination of ORs and NOTs. So I > can build patterns such as > > ^(^(pat1)|^(pat2)) > > and never introduce ~. (With pat1 and pat2 arbitrary.) Basis tests show > it working. Can you see any issue I might run into if I implement my > queries like this? That certainly shouldn't tickle this problem since the exclusions operate locally, so the problematic string recording positions is recreated each time. It's probably a bit less efficient since the exclusions do some memory management for that state record, but if that's not causing you a problems you should be fine. pws ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-25 13:19 Pattern bug on (a*|)~^(*b) Johan Grande 2023-07-25 18:35 ` Bart Schaefer @ 2023-07-31 11:36 ` Peter Stephenson 2023-07-31 15:21 ` Peter Stephenson 1 sibling, 1 reply; 13+ messages in thread From: Peter Stephenson @ 2023-07-31 11:36 UTC (permalink / raw) To: Johan Grande, zsh-workers > On 25/07/2023 14:19 Johan Grande <nahoj@crans.org> wrote: > In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob, > > [[ "ab" = (|a*)~^(*b) ]] > > incorrectly (unless I'm mistaken) returns 1. However > > [[ "ab" = (a*|)~^(*b) ]] > > correctly returns 0. You can see this is a real bug with [[ ab = (a|a*)~^*b ]] which also fails, while [[ ab = (b*|a*)~^*b ]] succeeds, so this is clearly inconsistent. I think I've tracked down the problem. Because of the unpleasant hierarchy-violating properties of exclusion pointed out by Bart, we record the point at which the match succeeded. The point we do this isn't where the entire pattern succeeds, however, only the point where the part we're going to exclude has matched. So at this point we've successfully matched "a" in the case I showed, or the empty string in the original case. But we haven't yet taken account of the anchor at the end, which will later cause us to fail. When we do fail and backtrack, the marker is still there saying we matched just the "a" or the empty string. We then successfully match a*, but when we get to the point of trying to exclude ^*b the record saying what we're excluding against is messed up --- ^*b DOES match either a or the empty string, so the exclusion succeeds. The fixes I can think of are either to find a point at which to undo the sync record for the match that has failed after the event or, perhaps better as it's a local change even though it does more work, reset any previous matches before the sync point on the next alternative when we mark that as having matched (we will only do that if the previous branch failed --- remember [irony intended] that a branch succeeds only if all following patterns match too). This is a bit fraught as some of the markers (not necessarily these ones, it's almost a quarter of a century since the first version of this code) are there to take account of some pathological cases so removing them might have hard to discover effects. This will take another load of wet towels and caffeine. pws ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Pattern bug on (a*|)~^(*b) 2023-07-31 11:36 ` Peter Stephenson @ 2023-07-31 15:21 ` Peter Stephenson 0 siblings, 0 replies; 13+ messages in thread From: Peter Stephenson @ 2023-07-31 15:21 UTC (permalink / raw) To: Johan Grande, zsh-workers > On 31/07/2023 12:36 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote: > > On 25/07/2023 14:19 Johan Grande <nahoj@crans.org> wrote: > > In zsh 5.8.1 (x86_64-ubuntu-linux-gnu) with extended_glob, > > > > [[ "ab" = (|a*)~^(*b) ]] > > > > incorrectly (unless I'm mistaken) returns 1. > > The fixes I can think of are either to find a point at which to undo the sync > record for the match that has failed after the event or, perhaps better as it's > a local change even though it does more work, reset any previous matches before > the sync point on the next alternative when we mark that as having matched > (we will only do that if the previous branch failed --- remember [irony > intended] that a branch succeeds only if all following patterns match too). I think the latter is good enough. There's no sign of a slowdown with this fix in D02glob.ztst, so any resulting problems will have to be complicated things with nested branches and probably zero-length matches --- not all of which are well optimised anyway. I could be wrong, but I think we'll have to take it on the chin. Anyway, I couldn't think of any easy way around that, despite some investigation, so as this does fix the immediate problem, we can go with it. It's the sort of self-contained fix I like. pws diff --git a/Src/pattern.c b/Src/pattern.c index 3edda1772..2a1a514fb 100644 --- a/Src/pattern.c +++ b/Src/pattern.c @@ -2987,14 +2987,15 @@ patmatch(Upat prog) case P_EXCSYNC: /* See the P_EXCLUDE code below for where syncptr comes from */ { - unsigned char *syncptr; + unsigned char *syncstart, *syncptr, *ptr; Upat after; after = P_OPERAND(scan); DPUTS(!P_ISEXCLUDE(after), "BUG: EXCSYNC not followed by EXCLUDE."); DPUTS(!P_OPERAND(after)->p, "BUG: EXCSYNC not handled by EXCLUDE"); - syncptr = P_OPERAND(after)->p + (patinput - patinstart); + syncstart = P_OPERAND(after)->p; + syncptr = syncstart + (patinput - patinstart); /* * If we already matched from here, this time we fail. * See WBRANCH code for story about error count. @@ -3009,6 +3010,23 @@ patmatch(Upat prog) * failed anyway. */ *syncptr = errsfound + 1; + /* + * Because of backtracking, any match before this point + * can't apply to the current branch we're on so is now + * a failure --- this can happen if, on a previous + * branch, we initially marked a success before failing + * on a later part of the pattern after marking up the + * P_EXCSYNC (even an end anchor will have this effect). + * To make sure we record the current match point + * correctly, mark those down now. + * + * This might have side effects on the efficiency of + * pathological cases involving nested branches. To + * fix that we'd probably need to record matches on + * different branches separately. + */ + for (ptr = syncstart; ptr < syncptr; ++ptr) + *ptr = 0; } break; case P_EXCEND: diff --git a/Test/D02glob.ztst b/Test/D02glob.ztst index 850a535e5..4d88e5c27 100644 --- a/Test/D02glob.ztst +++ b/Test/D02glob.ztst @@ -817,6 +817,32 @@ *>*/glob.tmp/(flip|flop) *>*/glob.tmp/(flip|flop)/trailing/components +# The following set test an obscure problem with branches followed by +# exclusions that shows up when the exclusion matches against +# something other than the complete test string, hence the complicated +# double negative. + [[ ab = (|a*)~^(*b) ]] +0:Regression test for exclusion after branches: empty first alternative + + [[ ab = (b|a*)~^(*b) ]] +0:Regression test for exclusion after branches: non-empty first alternative + + [[ ab = (b*|a*)~^(*b) ]] +0:Regression test for exclusion after branches: full length first alternative + +# Corresponding tests where the exclusion should succeed, so the +# match fails. It's hard to know how to provoke bugs here... + [[ abc = (|a*)~^(*b) ]] +1:Regression test for exclusion after branches: failure case 1 + + [[ abc = (b|a*)~^(*b) ]] +1:Regression test for exclusion after branches: failure case 2 + + [[ abc = (b*|a*)~^(*b) ]] +1:Regression test for exclusion after branches: failure case 3 + +# Careful: extendedglob off from this point. + unsetopt extendedglob print -r -- ${(*)=${(@s.+.):-A+B}/(#b)(?)/-${(L)match[1]} ${match[1]}} 0:the '*' qualfier enables extended_glob for pattern matching ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2023-08-02 9:38 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-07-25 13:19 Pattern bug on (a*|)~^(*b) Johan Grande 2023-07-25 18:35 ` Bart Schaefer 2023-07-25 18:47 ` Johan Grande 2023-07-28 1:02 ` Bart Schaefer 2023-07-28 6:41 ` Stephane Chazelas 2023-07-29 1:35 ` Bart Schaefer 2023-08-01 13:19 ` Johan Grande 2023-08-01 13:30 ` Peter Stephenson 2023-08-01 13:46 ` Johan Grande 2023-08-02 8:31 ` Johan Grande 2023-08-02 9:37 ` Peter Stephenson 2023-07-31 11:36 ` Peter Stephenson 2023-07-31 15:21 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).