* Issue with ${var#(*_)(#cN,M)} @ 2015-10-19 9:33 Stephane Chazelas 2015-10-19 19:17 ` Bart Schaefer 0 siblings, 1 reply; 10+ messages in thread From: Stephane Chazelas @ 2015-10-19 9:33 UTC (permalink / raw) To: Zsh hackers list Unless I'm missing something, this looks like a bug: ~$ a='1_2_3_4_5_6' ~$ echo ${a#(*_)(#c1)} 2_3_4_5_6 #OK ~$ echo ${a#(*_)(#c2)} 2_3_4_5_6 ~$ echo ${a#(*_)(#c3)} 3_4_5_6 ~$ echo ${a#(*_)(#c4)} 4_5_6 ~$ echo ${a#(*_)(#c5)} 4_5_6 ~$ echo ${a#(*_)(#c6)} 3_4_5_6 ~$ echo ${a#(*_)(#c7)} 4_5_6 ~$ echo ${a%(_*)(#c1)} 1_2_3_4_5 ~$ echo ${a%(_*)(#c2)} 1_2_3_4_5_6 ~$ echo ${a%(_*)(#c3)} 1_2_3_4_5_6 ~$ echo ${a%(_*)(#c4)} 1_2_3_4 ~$ echo ${(S)a/(*_)(#c1)/+} +2_3_4_5_6 ~$ echo ${(S)a/(*_)(#c2)/+} +2_3_4_5_6 ~$ echo ${(S)a/(*_)(#c3)/+} +3_4_5_6 These are OK: ~$ echo ${a#(?_)(#c1)} 2_3_4_5_6 ~$ echo ${a#(?_)(#c2)} 3_4_5_6 ~$ echo ${a#(?_)(#c3)} 4_5_6 ~$ echo ${a#([^_]#_)(#c1)} 2_3_4_5_6 ~$ echo ${a#([^_]#_)(#c2)} 3_4_5_6 ~$ echo ${a#([^_]#_)(#c3)} 4_5_6 zsh 5.0.2 (x86_64-pc-linux-gnu) and zsh 5.1.1 (x86_64-debian-linux-gnu) -- Stephane ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-19 9:33 Issue with ${var#(*_)(#cN,M)} Stephane Chazelas @ 2015-10-19 19:17 ` Bart Schaefer 2015-10-20 19:09 ` Stephane Chazelas 0 siblings, 1 reply; 10+ messages in thread From: Bart Schaefer @ 2015-10-19 19:17 UTC (permalink / raw) To: Zsh hackers list On Oct 19, 10:33am, Stephane Chazelas wrote: } Subject: Issue with ${var#(*_)(#cN,M)} } } Unless I'm missing something, this looks like a bug: Hm. I think it's counting the number of times it backtracked. E.g. } ~$ a='1_2_3_4_5_6' } ~$ echo ${a#(*_)(#c2)} } 2_3_4_5_6 Here, it first matched "1_2_3_4_5_" but then couldn't match a second time, so it backtracked, matched "1_", and stopped counting. However, there's an interaction with ${a#...} here -- because you've asked for the shortest match, glob.c:igetmatch() first tries for the longest match and then "brute-force" (see comment in glob.c) looks for a shorter one. So the pattern code gets invoked multiple times. To make (*_)(#c2) work as you'd expect (each (*_) uses the shortest match and then is tried again on the remainder) I think we would have to teach the pattern code itself about shortest/longest match. There's a further issue that backreferences don't seem to be well- defined when a parenthesized subpattern is required to repeat. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-19 19:17 ` Bart Schaefer @ 2015-10-20 19:09 ` Stephane Chazelas 2015-10-20 23:04 ` Bart Schaefer 0 siblings, 1 reply; 10+ messages in thread From: Stephane Chazelas @ 2015-10-20 19:09 UTC (permalink / raw) To: Bart Schaefer; +Cc: Zsh hackers list 2015-10-19 12:17:28 -0700, Bart Schaefer: > On Oct 19, 10:33am, Stephane Chazelas wrote: > } Subject: Issue with ${var#(*_)(#cN,M)} > } > } Unless I'm missing something, this looks like a bug: > > Hm. I think it's counting the number of times it backtracked. E.g. > > } ~$ a='1_2_3_4_5_6' > } ~$ echo ${a#(*_)(#c2)} > } 2_3_4_5_6 > > Here, it first matched "1_2_3_4_5_" but then couldn't match a second > time, so it backtracked, matched "1_", and stopped counting. [...] Note that the: ~$ echo ${a#*_*_} 3_4_5_6 ~$ echo ${a#*_*_*_} 4_5_6 work OK. And also: ~$ echo ${a#(*_)(*_)(*_|)} 3_4_5_6 ~$ echo ${a#(*_)(*_)(*_|)4} _5_6 (equivalent of (#c2,3)) -- Stephane ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-20 19:09 ` Stephane Chazelas @ 2015-10-20 23:04 ` Bart Schaefer 2015-10-27 10:00 ` Peter Stephenson 0 siblings, 1 reply; 10+ messages in thread From: Bart Schaefer @ 2015-10-20 23:04 UTC (permalink / raw) To: Zsh hackers list On Oct 20, 8:09pm, Stephane Chazelas wrote: } Subject: Re: Issue with ${var#(*_)(#cN,M)} } } 2015-10-19 12:17:28 -0700, Bart Schaefer: } > } > } ~$ a='1_2_3_4_5_6' } > } ~$ echo ${a#(*_)(#c2)} } > } 2_3_4_5_6 } > } > Here, it first matched "1_2_3_4_5_" but then couldn't match a second } > time, so it backtracked, matched "1_", and stopped counting. } } Note that the: } } ~$ echo ${a#*_*_} } 3_4_5_6 } ~$ echo ${a#*_*_*_} } 4_5_6 } } work OK. Well, yes, but not really relevant. } And also: } } ~$ echo ${a#(*_)(*_)(*_|)} } 3_4_5_6 The (#c) modifier is not implemented by replicating the pattern, it's implemented by counting the number of successful trials that can be made using the single pattern. So it really makes no difference to the bug that manually repeating the pattern does the right thing. What's messing it up is the "*" operator and the backtracking that is implied because * can match anything. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-20 23:04 ` Bart Schaefer @ 2015-10-27 10:00 ` Peter Stephenson 2015-10-27 10:46 ` Peter Stephenson 0 siblings, 1 reply; 10+ messages in thread From: Peter Stephenson @ 2015-10-27 10:00 UTC (permalink / raw) To: Zsh hackers list Original problem > } ~$ a='1_2_3_4_5_6' > } ~$ echo ${a#(*_)(#c2)} > } 2_3_4_5_6 On Tue, 20 Oct 2015 16:04:22 -0700 Bart Schaefer <schaefer@brasslantern.com> wrote: > What's messing it up is the "*" operator and the backtracking that is > implied because * can match anything. Exactly. What's backtracking over what in what order here is a bit of nightmare, and I'm not sure I'm likely to get my mind round it. Unless someone does, you'll be better of sticking to % a='1_2_3_4_5_6' % echo ${a#([^_]#_)(#c2)} 3_4_5_6 and then we don't have the "*" within the group to worry about. pws ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-27 10:00 ` Peter Stephenson @ 2015-10-27 10:46 ` Peter Stephenson 2015-10-27 11:03 ` Stephane Chazelas 0 siblings, 1 reply; 10+ messages in thread From: Peter Stephenson @ 2015-10-27 10:46 UTC (permalink / raw) To: Zsh hackers list On Tue, 27 Oct 2015 10:00:34 +0000 Peter Stephenson <p.stephenson@samsung.com> wrote: > Original problem > > } ~$ a='1_2_3_4_5_6' > > } ~$ echo ${a#(*_)(#c2)} > > } 2_3_4_5_6 > > On Tue, 20 Oct 2015 16:04:22 -0700 > Bart Schaefer <schaefer@brasslantern.com> wrote: > > What's messing it up is the "*" operator and the backtracking that is > > implied because * can match anything. > > Exactly. What's backtracking over what in what order here is a bit of > nightmare, and I'm not sure I'm likely to get my mind round it. > > Unless someone does, you'll be better of sticking to > > % a='1_2_3_4_5_6' > % echo ${a#([^_]#_)(#c2)} > 3_4_5_6 > > and then we don't have the "*" within the group to worry about. Indeed, I've just noticed that with % egrep --version egrep (GNU grep) 2.8 the following: % egrep '^(*_){2}$' <<<'1_2_' fails to match completely, i.e the backtracking is too complicated to handle, whereas % egrep '^([^_]+_){2}$' <<<'1_2_' succeeds. At this point, I'm going to document the difficulty and slowly retreat backwards from the dark corner. pws diff --git a/Doc/Zsh/expn.yo b/Doc/Zsh/expn.yo index 5ea8610..49a0f0d 100644 --- a/Doc/Zsh/expn.yo +++ b/Doc/Zsh/expn.yo @@ -2192,6 +2192,16 @@ inclusive. The form tt(LPAR()#c)var(N)tt(RPAR()) requires exactly tt(N) matches; tt(LPAR()#c,)var(M)tt(RPAR()) is equivalent to specifying var(N) as 0; tt(LPAR()#c)var(N)tt(,RPAR()) specifies that there is no maximum limit on the number of matches. + +Note that if the previous group of characters contains wildcards, +results can be unpredictable to the point of being logically incorrect. +It is recommended that the pattern be trimmed to match the minimum +possible. For example, to match a string of the form `tt(1_2_3_)', use +a pattern of the form `tt(LPAR()[[:digit:]]##_+RPAR()LPAR()#c3+RPAR())', not +`tt(LPAR()*_+RPAR()LPAR()#c3+RPAR())'. This arises from the +complicated interaction between attempts to match a number of +repetitions of the whole pattern and attempts to match the wildcard +`tt(*)'. ) vindex(MATCH) vindex(MBEGIN) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-27 10:46 ` Peter Stephenson @ 2015-10-27 11:03 ` Stephane Chazelas 2015-10-27 11:11 ` Peter Stephenson 2015-10-27 11:11 ` Stephane Chazelas 0 siblings, 2 replies; 10+ messages in thread From: Stephane Chazelas @ 2015-10-27 11:03 UTC (permalink / raw) To: zsh-workers 2015-10-27 10:46:33 +0000, Peter Stephenson: [...] > % egrep '^(*_){2}$' <<<'1_2_' > > fails to match completely, i.e the backtracking is too complicated > to handle, whereas [...] Except that it should be .* in REs and that REs are greedy. $ egrep '^(.*_){2}$' <<<'1_2_' 1_2_ $ grep -Eo '^(.*_){2}' <<<'1_2_3_4_5' 1_2_3_4_ $ grep -Po '^(.*?_){2}' <<<'1_2_3_4_5' 1_2_ ksh93 is also fine with it: $ a='1_2_3_4_5' ksh -c 'echo "${a#{2}(*_)}"' 3_4_5 $ a='1_2_3_4_5' ksh -c 'echo "${a##{2}(*_)}"' 5 The zsh limitation should probably be documented if not fixed. -- Stephane ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-27 11:03 ` Stephane Chazelas @ 2015-10-27 11:11 ` Peter Stephenson 2015-10-27 11:11 ` Stephane Chazelas 1 sibling, 0 replies; 10+ messages in thread From: Peter Stephenson @ 2015-10-27 11:11 UTC (permalink / raw) To: zsh-workers On Tue, 27 Oct 2015 11:03:53 +0000 Stephane Chazelas <stephane.chazelas@gmail.com> wrote: > 2015-10-27 10:46:33 +0000, Peter Stephenson: > [...] > > % egrep '^(*_){2}$' <<<'1_2_' > > > > fails to match completely, i.e the backtracking is too complicated > > to handle, whereas > [...] > > Except that it should be .* in REs and that REs are greedy. You're right, I got the pattern wrong, and it does work. pws ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-27 11:03 ` Stephane Chazelas 2015-10-27 11:11 ` Peter Stephenson @ 2015-10-27 11:11 ` Stephane Chazelas 2015-10-27 11:37 ` Peter Stephenson 1 sibling, 1 reply; 10+ messages in thread From: Stephane Chazelas @ 2015-10-27 11:11 UTC (permalink / raw) To: zsh-workers 2015-10-27 11:03:53 +0000, Stephane Chazelas: [...] > ksh93 is also fine with it: > > $ a='1_2_3_4_5' ksh -c 'echo "${a#{2}(*_)}"' > 3_4_5 > $ a='1_2_3_4_5' ksh -c 'echo "${a##{2}(*_)}"' > 5 > > The zsh limitation should probably be documented if not fixed. [...] Another work around is to use zsh's PCREs: $ a='1_2_3_4_5' zsh -o rematchpcre -c '[[ $a =~ "(?s)^(.*?_){2}" ]] &&echo $MATCH' 1_2_ $ a=$'1_2_3_4_5\nqweq' zsh -o rematchpcre -c '[[ $a =~ "(?s)^(?:.*?_){2}(.*)" ]]; echo $match' 3_4_5 qweq -- Stephane ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Issue with ${var#(*_)(#cN,M)} 2015-10-27 11:11 ` Stephane Chazelas @ 2015-10-27 11:37 ` Peter Stephenson 0 siblings, 0 replies; 10+ messages in thread From: Peter Stephenson @ 2015-10-27 11:37 UTC (permalink / raw) To: zsh-workers Sigh. Can't see the wood for the trees. It is a backtracking problem, but it's a simple bug with restoring the state when backtracking, not a logical error in the matching machine. I'll take out the weasel words again, shall I? pws diff --git a/Doc/Zsh/expn.yo b/Doc/Zsh/expn.yo index 49a0f0d..5ea8610 100644 --- a/Doc/Zsh/expn.yo +++ b/Doc/Zsh/expn.yo @@ -2192,16 +2192,6 @@ inclusive. The form tt(LPAR()#c)var(N)tt(RPAR()) requires exactly tt(N) matches; tt(LPAR()#c,)var(M)tt(RPAR()) is equivalent to specifying var(N) as 0; tt(LPAR()#c)var(N)tt(,RPAR()) specifies that there is no maximum limit on the number of matches. - -Note that if the previous group of characters contains wildcards, -results can be unpredictable to the point of being logically incorrect. -It is recommended that the pattern be trimmed to match the minimum -possible. For example, to match a string of the form `tt(1_2_3_)', use -a pattern of the form `tt(LPAR()[[:digit:]]##_+RPAR()LPAR()#c3+RPAR())', not -`tt(LPAR()*_+RPAR()LPAR()#c3+RPAR())'. This arises from the -complicated interaction between attempts to match a number of -repetitions of the whole pattern and attempts to match the wildcard -`tt(*)'. ) vindex(MATCH) vindex(MBEGIN) diff --git a/Src/pattern.c b/Src/pattern.c index 8b07cca..9e8a80a 100644 --- a/Src/pattern.c +++ b/Src/pattern.c @@ -3376,6 +3376,7 @@ patmatch(Upat prog) scan[P_CT_CURRENT].l = cur + 1; if (patmatch(scan + P_CT_OPERAND)) return 1; + scan[P_CT_CURRENT].l = cur; patinput = patinput_thistime; } if (cur < min) diff --git a/Test/D02glob.ztst b/Test/D02glob.ztst index 3e2095a..f944a4f 100644 --- a/Test/D02glob.ztst +++ b/Test/D02glob.ztst @@ -574,3 +574,11 @@ 0:Optimisation to squeeze multiple *'s used as ordinary glob wildcards. >glob.tmp/ra=1.0_et=3.5 >glob.tmp/ra=1.0_et=3.5 + + [[ 1_2_ = (*_)(#c1) ]] && print 1 OK # because * matches 1_2 + [[ 1_2_ = (*_)(#c2) ]] && print 2 OK + [[ 1_2_ = (*_)(#c3) ]] || print 3 OK +0:Some more complicated backtracking with match counts. +>1 OK +>2 OK +>3 OK diff --git a/Test/D04parameter.ztst b/Test/D04parameter.ztst index f1cc23e..cb7079c 100644 --- a/Test/D04parameter.ztst +++ b/Test/D04parameter.ztst @@ -1735,3 +1735,12 @@ 0:History modifier works the same for scalar and array substitution >ddd bdb cdc >ddd bdb cdc + + a=1_2_3_4_5_6 + print ${a#(*_)(#c2)} + print ${a#(*_)(#c5)} + print ${a#(*_)(#c7)} +0:Complicated backtracking with match counts +>3_4_5_6 +>6 +>1_2_3_4_5_6 ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-10-27 11:48 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-10-19 9:33 Issue with ${var#(*_)(#cN,M)} Stephane Chazelas 2015-10-19 19:17 ` Bart Schaefer 2015-10-20 19:09 ` Stephane Chazelas 2015-10-20 23:04 ` Bart Schaefer 2015-10-27 10:00 ` Peter Stephenson 2015-10-27 10:46 ` Peter Stephenson 2015-10-27 11:03 ` Stephane Chazelas 2015-10-27 11:11 ` Peter Stephenson 2015-10-27 11:11 ` Stephane Chazelas 2015-10-27 11:37 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).