zsh-workers
 help / color / mirror / code / Atom feed
* [BUG] `$match` is haunting my regex’s trailing, optional, capture
@ 2023-12-09  5:14 chris0e3
  2023-12-09  6:23 ` Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: chris0e3 @ 2023-12-09  5:14 UTC (permalink / raw)
  To: zsh-workers

Hello,

I’m using a custom built zsh 5.9 & PCRE 8.45 on macOS.
I’m seeing unexpected values in `$match` after a successful match.

What is the expected output of:
```
  setopt rematch_pcre
  [[ 'REQUIRE. OPT' =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tA. ‹%s›\n' $match
  [[ 'REQUIRE.'     =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tB. ‹%s›\n' $match
```

I had expected:
```
	A. ‹ OPT›
	B. ‹›
```

But I get:
```
	A. ‹ OPT›
	B. ‹ OPT›
```

Reversing the order of the tests (& executing them in a new Terminal window) produces expected/different results. [Though executing in a sub-shell appears to inherit the previous value of `$match`.  Is that expected?]  So this is probably just due to `$match` initially being empty.
However, changing the regex to 'REQUIRE.(\s*OPT)?(.*)' or '(REQUIRE).(\s*OPT)?' produces expected results.

It looks like: if there is a match, but no captures are matched then `$match` is not cleared.  However, I think it should be cleared.  The zsh manual §22.23 appears to imply what I contend.  [If I read it correctly.]

Based on my hypothesis I wrote this (simplification):
```
  setopt rematch_pcre; match=RUBBISH
  [[ A =~ 'A|(B)' ]] && printf '\ta. ‹%s›\n' $match
  [[ B =~ '(A)|B' ]] && printf '\tb. ‹%s›\n' $match
```

I would expect:
```
	a. ‹›
	b. ‹›
```

But I get:
```
	a. ‹RUBBISH›
	b. ‹RUBBISH›
```
[But changing the regexes to 'A()|(B)' & '(A)|B()' produces the expected results.]


So.  Am I right?  And is it possible to fix zsh?  Or am I wrong?

Thanks,

CHRIS




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] `$match` is haunting my regex’s trailing, optional, capture
  2023-12-09  5:14 [BUG] `$match` is haunting my regex’s trailing, optional, capture chris0e3
@ 2023-12-09  6:23 ` Bart Schaefer
  2023-12-09 20:54   ` [PATCH?] " Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Schaefer @ 2023-12-09  6:23 UTC (permalink / raw)
  To: chris0e3; +Cc: zsh-workers

On Fri, Dec 8, 2023 at 9:14 PM <chris0e3@gmail.com> wrote:
>
>   setopt rematch_pcre
>   [[ 'REQUIRE. OPT' =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tA. ‹%s›\n' $match
>   [[ 'REQUIRE.'     =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tB. ‹%s›\n' $match

Aside, you should use e.g. "$match[@]" there (including the quotes) or
you'll be surprised when there is more than one back reference.

> It looks like: if there is a match, but no captures are matched then `$match` is not cleared.

The output of your latter example is certainly not consistent with e.g.

 setopt extendedglob
 [[ 'REQUIRE.' = (#b)REQUIRE.(*OPT)(#c0,1) ]] && printf '\tB. ‹%s›\n' $match

And I can reproduce your example with the most recent git checkout, as
well.  Oliver recently updated to pcre2, so if a patch appears, you
may be on your own to backport it.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH?] Re: [BUG] `$match` is haunting my regex’s trailing, optional, capture
  2023-12-09  6:23 ` Bart Schaefer
@ 2023-12-09 20:54   ` Bart Schaefer
  2023-12-11 23:49     ` Oliver Kiddle
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Schaefer @ 2023-12-09 20:54 UTC (permalink / raw)
  To: chris0e3; +Cc: Zsh hackers list


[-- Attachment #1.1: Type: text/plain, Size: 866 bytes --]

On Fri, Dec 8, 2023 at 10:23 PM Bart Schaefer <schaefer@brasslantern.com>
wrote:

> On Fri, Dec 8, 2023 at 9:14 PM <chris0e3@gmail.com> wrote:
> >
> >   setopt rematch_pcre
> >   [[ 'REQUIRE. OPT' =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tA. ‹%s›\n'
> $match
> >   [[ 'REQUIRE.'     =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tB. ‹%s›\n'
> $match
>
> [...] I can reproduce your example with the most recent git checkout, as
> well.  Oliver recently updated to pcre2, so if a patch appears, you
> may be on your own to backport it.
>

This was easier to track down than I feared, but possibly difficult to
handle correctly.

Is "unset match" OK here?  There doesn't seem to be an obvious way to
distinguish "there are capture expressions, but none matched anything" from
"there were no capture expressions".  Maybe Oliver has a better clue.

[-- Attachment #1.2: Type: text/html, Size: 1398 bytes --]

[-- Attachment #2: pcre-unset-match.txt --]
[-- Type: text/plain, Size: 444 bytes --]

diff --git a/Src/Modules/pcre.c b/Src/Modules/pcre.c
index e48ae3ae5..e4cdd8dbd 100644
--- a/Src/Modules/pcre.c
+++ b/Src/Modules/pcre.c
@@ -210,7 +210,8 @@ zpcre_get_substrings(pcre2_code *pat, char *arg, pcre2_match_data *mdata,
 	    }
 	    *x = NULL;
 	    setaparam(substravar, matches);
-	}
+	} else if (substravar)
+	    unsetparam(substravar);
 
 	if (namedassoc
 		&& !pcre2_pattern_info(pat, PCRE2_INFO_NAMECOUNT, &ncount) && ncount

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH?] Re: [BUG] `$match` is haunting my regex’s trailing, optional, capture
  2023-12-09 20:54   ` [PATCH?] " Bart Schaefer
@ 2023-12-11 23:49     ` Oliver Kiddle
  2023-12-12  1:38       ` Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: Oliver Kiddle @ 2023-12-11 23:49 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: chris0e3, Zsh hackers list

Bart Schaefer wrote:
> On Fri, Dec 8, 2023 at 10:23 PM Bart Schaefer <[1]schaefer@brasslantern.com>
> wrote:
>
>     On Fri, Dec 8, 2023 at 9:14 PM <[2]chris0e3@gmail.com> wrote:
>     >
>     >   setopt rematch_pcre
>     >   [[ 'REQUIRE. OPT' =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tA. ‹%s›\n'
>     $match
>     >   [[ 'REQUIRE.'     =~ 'REQUIRE.(\s*OPT)?' ]] && printf '\tB. ‹%s›\n'

Without rematchpcre and with \s changed to just a space, this will set
match=( '' ) which is what would seem most logical to me.

> Is "unset match" OK here?  There doesn't seem to be an obvious way to
> distinguish "there are capture expressions, but none matched anything" from
> "there were no capture expressions".  Maybe Oliver has a better clue.

pcre2_get_ovector_count() will give how many capture expressions
the pattern contains. The following:
  [[ 'REQUIRE.1' =~ 'REQUIRE.(\s*O(P)T)?(1)' ]]
results in match=( '' '' 1 ). So adding empty elements at the end too is
consistent with that. pcre2_match's return status tells us the
last capture element that was set.

I didn't find anything in the documentation to confirm that later
elements of the ovector will have been initialised empty but they do
appear to be. If you get garbage instead of empty elements, that'll be
the cause.

Oliver

diff --git a/Src/Modules/pcre.c b/Src/Modules/pcre.c
index e48ae3ae5..a49d1a307 100644
--- a/Src/Modules/pcre.c
+++ b/Src/Modules/pcre.c
@@ -391,6 +391,8 @@ bin_pcre_match(char *nam, char **args, Options ops, UNUSED(int func))
 	pcre_mdata = pcre2_match_data_create_from_pattern(pcre_pattern, NULL);
 	ret = pcre2_match(pcre_pattern, (PCRE2_SPTR) plaintext, subject_len,
 		offset_start, 0, pcre_mdata, mcontext);
+	if (ret > 0)
+	    ret = pcre2_get_ovector_count(pcre_mdata);
     }
 
     if (ret==0) return_value = 0;
@@ -479,7 +481,8 @@ cond_pcre_match(char **a, int id)
 		    break;
 		}
                 else if (r>0) {
-		    zpcre_get_substrings(pcre_pat, lhstr_plain, pcre_mdata, r, svar, avar,
+		    uint32_t ovec_count = pcre2_get_ovector_count(pcre_mdata);
+		    zpcre_get_substrings(pcre_pat, lhstr_plain, pcre_mdata, ovec_count, svar, avar,
 			    ".pcre.match", 0, isset(BASHREMATCH), !isset(BASHREMATCH));
 		    return_value = 1;
 		    break;


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH?] Re: [BUG] `$match` is haunting my regex’s trailing, optional, capture
  2023-12-11 23:49     ` Oliver Kiddle
@ 2023-12-12  1:38       ` Bart Schaefer
  2024-01-25 22:14         ` Oliver Kiddle
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Schaefer @ 2023-12-12  1:38 UTC (permalink / raw)
  To: Zsh hackers list

On Mon, Dec 11, 2023 at 3:49 PM Oliver Kiddle <opk@zsh.org> wrote:
>
> diff --git a/Src/Modules/pcre.c b/Src/Modules/pcre.c

So is this instead of my patch or as well?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH?] Re: [BUG] `$match` is haunting my regex’s trailing, optional, capture
  2023-12-12  1:38       ` Bart Schaefer
@ 2024-01-25 22:14         ` Oliver Kiddle
  0 siblings, 0 replies; 6+ messages in thread
From: Oliver Kiddle @ 2024-01-25 22:14 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 11 Dec, Bart Schaefer wrote:
> On Mon, Dec 11, 2023 at 3:49 PM Oliver Kiddle <opk@zsh.org> wrote:
> >
> > diff --git a/Src/Modules/pcre.c b/Src/Modules/pcre.c
>
> So is this instead of my patch or as well?

Instead. Sorry for not getting back sooner. I had forgotten this. The
following adds a test case.

Oliver

diff --git a/Test/V07pcre.ztst b/Test/V07pcre.ztst
index 585698d05..b8cd31c96 100644
--- a/Test/V07pcre.ztst
+++ b/Test/V07pcre.ztst
@@ -108,6 +108,11 @@
 >0 xo→t →t
 >0 Xo→t →t
 
+  [[ foo =~ (pre)?f(o*)(opt(i)onal)?(y)* ]]
+  typeset -p match
+0:Empty string for optional captures that don't match
+>typeset -g -a match=( '' oo '' '' '' )
+
   string="The following zip codes: 78884 90210 99513"
   pcre_compile -m "\d{5}"
   pcre_match -b -- $string && print "$MATCH; ZPCRE_OP: $ZPCRE_OP"


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-01-25 22:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-09  5:14 [BUG] `$match` is haunting my regex’s trailing, optional, capture chris0e3
2023-12-09  6:23 ` Bart Schaefer
2023-12-09 20:54   ` [PATCH?] " Bart Schaefer
2023-12-11 23:49     ` Oliver Kiddle
2023-12-12  1:38       ` Bart Schaefer
2024-01-25 22:14         ` Oliver Kiddle

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).