zsh-workers
 help / color / mirror / Atom feed
* regexp-replace and ^, word boundary or look-behind operators
@ 2019-12-16 21:10 Stephane Chazelas
  2019-12-16 21:27 ` Stephane Chazelas
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2019-12-16 21:10 UTC (permalink / raw)
  To: Zsh hackers list

The way regexp-replace works means that these things:

$ a='aaab'; regexp-replace a '^a' x; echo "$a"
xxxb
$ a='abab'; regexp-replace a '\<ab' '<$MATCH>'; echo $a
<ab><ab>
$ set -o rematchpcre
$ a=xxx; regexp-replace a '(?<!x)x' y; echo $a
yyy

don't work properly as after the first substitution, the regex
is no longer matched on the full subject, but on the part of
subject after the last match.

I don't think that can be fixed without exposing more of the
regex/pcre C API.

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-16 21:10 regexp-replace and ^, word boundary or look-behind operators Stephane Chazelas
@ 2019-12-16 21:27 ` Stephane Chazelas
  2019-12-17  7:38   ` Stephane Chazelas
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2019-12-16 21:27 UTC (permalink / raw)
  To: Zsh hackers list

2019-12-16 21:10:13 +0000, Stephane Chazelas:
> The way regexp-replace works means that these things:
> 
> $ a='aaab'; regexp-replace a '^a' x; echo "$a"
> xxxb
> $ a='abab'; regexp-replace a '\<ab' '<$MATCH>'; echo $a
> <ab><ab>
> $ set -o rematchpcre
> $ a=xxx; regexp-replace a '(?<!x)x' y; echo $a
> yyy
[...]

FWIW, looks like some sed implementations (like that of the
heirloom toolchest or busybox) or ksh93 have the same problem:

$ echo xxx | busybox sed 's/\<x/y/g'
yyy
$ a=xxx ksh -c 'echo ${a//~(E:^x)/y}'
yyy
$ a=xxx ksh -c 'echo ${a//[[:<:]]x/y}'
yyy

It may be that the POSIX regex API doesn't have a way to fix
that (REG_NOTBOL addresses the ^ case, but there's nothing about
\< / \b / [[:<]] which are non-POSIX extensions anyway).

PCRE should be OK, so it could be just a matter of
exposing it via the pcre_match builtin and document the
limitation otherwise for EREs (PCRE is the new de-facto standard
anyway).

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-16 21:27 ` Stephane Chazelas
@ 2019-12-17  7:38   ` Stephane Chazelas
  2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2019-12-17  7:38 UTC (permalink / raw)
  To: Zsh hackers list

2019-12-16 21:27:06 +0000, Stephane Chazelas:
[...]
> PCRE should be OK, so it could be just a matter of
> exposing it via the pcre_match builtin
[...]

D'oh, it's there already with the -b, -n options.

I'll try and suggest a regexp-replace improvement using that.

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-17  7:38   ` Stephane Chazelas
@ 2019-12-17 11:11     ` Stephane Chazelas
  2019-12-18  0:22       ` Daniel Shahaf
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2019-12-17 11:11 UTC (permalink / raw)
  To: zsh-workers

2019-12-17 07:38:46 +0000, Stephane Chazelas:
> 2019-12-16 21:27:06 +0000, Stephane Chazelas:
> [...]
> > PCRE should be OK, so it could be just a matter of
> > exposing it via the pcre_match builtin
> [...]
> 
> D'oh, it's there already with the -b, -n options.
> 
> I'll try and suggest a regexp-replace improvement using that.
[...]

There's another issue in that the zero-width matches cause
infinite loops. Here's my first attempt at fixing those issues
(also fixing a few issues in the zpgrep example functions while
I'm at it):

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index d32ba018d..61e6a434f 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4301,6 +4301,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if not using PCRE, using the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..41ea9d79e 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -8,36 +8,79 @@
 # $ and backtick substitutions; in particular, $MATCH will be replaced
 # by the portion of the string matched by the regular expression.
 
-integer pcre
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable. Make sure we start with 3 and only
+# 3 elements:
+argv=("$1" "$2" "$3")
 
-[[ -o re_match_pcre ]] && pcre=1
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
+
+
 local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $5, $6 reserved for later
+  4=0 5= 6=1
+
+  local ZPCRE_OP IFS=' '
+  while pcre_match -b -n $4 -- "${(P)1}"; do
+    # append offsets and computed replacement to the array
+    argv+=($=ZPCRE_OP ${(e)3})
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$((argv[-2] + (argv[-3] == argv[-2])))
+  done
+
+  (($# > 6)) || return # no match
+
+  set +o multibyte
+
+  # $5 contains the result, $6 the current offset
+  for 2 3 4 in "$@[7,-1]"; do
+    5+=${(P)1[$6,$2]}$4
+    6=$(($3 + 1))
+  done
+  5+=${(P)1[$6,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+
+  # $4 is the string to be matched
+  4=${(P)1}
+
+  while [[ -n $4 ]]; do
+    if [[ $4 =~ $2 ]]; then
+      # append initial part and substituted match
+      5+=${4[1,MBEGIN-1]}${(e)3}
+      # truncate remaining string
+      if ((MEND < MBEGIN)); then
+        # zero-width match, skip one character for the next match
+        ((MEND++))
+	5+=${4[1]}
+      fi
+      4=${4[MEND+1,-1]}
+      # indicate we did something
+      6=1
+    else
+      break
+    fi
+  done
+  [[ -n $6 ]] || return # no match
+  5+=$4
+fi
+
+eval $1=\$5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
@ 2019-12-18  0:22       ` Daniel Shahaf
  2019-12-18  8:31         ` Stephane Chazelas
  2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
  0 siblings, 2 replies; 10+ messages in thread
From: Daniel Shahaf @ 2019-12-18  0:22 UTC (permalink / raw)
  To: Stephane Chazelas, zsh-workers

Stephane Chazelas wrote on Tue, 17 Dec 2019 11:11 +00:00:
> +++ b/Doc/Zsh/contrib.yo
> @@ -4301,6 +4301,9 @@ and arithmetic expressions which will be 
> replaced:  in particular, a
>  reference to tt($MATCH) will be replaced by the text matched by the 
> pattern.
>  
>  The return status is 0 if at least one match was performed, else 1.
> +
> +Note that if not using PCRE, using the tt(^) or word boundary operators
> +(where available) may not work properly.

Suggest to avoid the double negative:

1. s/not using PCRE/using POSIX ERE's/

2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph

I'll push a minor change to that first paragraph in a moment.

>  )
> +++ b/Functions/Example/zpgrep
> @@ -2,24 +2,31 @@
> +eval $1=\$5

How about «: ${(P)1::="$5"}» to avoid eval?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-18  0:22       ` Daniel Shahaf
@ 2019-12-18  8:31         ` Stephane Chazelas
  2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
  1 sibling, 0 replies; 10+ messages in thread
From: Stephane Chazelas @ 2019-12-18  8:31 UTC (permalink / raw)
  To: Daniel Shahaf; +Cc: zsh-workers

2019-12-18 00:22:53 +0000, Daniel Shahaf:
[...]
> > +eval $1=\$5
> 
> How about «: ${(P)1::="$5"}» to avoid eval?

I suppose that would work but would not prevent code injection
vulnerabilities if $1 was not guaranteed to contain a valid
variable name:

$ 1='a[`uname>&2`]'
$ : ${(P)1::="$5"}
Linux
zsh: bad math expression: empty string
Linux
zsh: bad math expression: empty string

Note that uname was run twice suggesting it's potentially less
efficient than using eval (IIRC, that was already discussed
here. possibly that was fixed in a newer version).

Here, I'd say it's the caller's responsibility to make sure they
pass a valid lvalue as first argument.

-- 
Stephane

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  2019-12-18  0:22       ` Daniel Shahaf
  2019-12-18  8:31         ` Stephane Chazelas
@ 2020-01-01 14:03         ` Stephane Chazelas
  2021-04-30  6:11           ` Stephane Chazelas
  1 sibling, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2020-01-01 14:03 UTC (permalink / raw)
  To: zsh-workers

2019-12-18 00:22:53 +0000, Daniel Shahaf:
[...]
> > +
> > +Note that if not using PCRE, using the tt(^) or word boundary operators
> > +(where available) may not work properly.
> 
> Suggest to avoid the double negative:
> 
> 1. s/not using PCRE/using POSIX ERE's/
> 
> 2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph
> 
> I'll push a minor change to that first paragraph in a moment.
[...]

Thanks, I've incorporated that suggesting and fixed an issue
with PCRE when the replacement was empty or generated more than
one element.

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 558342711..9a804fc11 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4284,7 +4284,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4302,6 +4302,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..0d5948075 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -8,36 +8,84 @@
 # $ and backtick substitutions; in particular, $MATCH will be replaced
 # by the portion of the string matched by the regular expression.
 
-integer pcre
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable. Make sure we start with 3 and only
+# 3 elements:
+argv=("$1" "$2" "$3")
 
-[[ -o re_match_pcre ]] && pcre=1
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
+
+
 local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $5, $6 reserved for later use
+  4=0 6=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "${(P)1}"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    5=${(e)3}
+    argv+=(${(s: :)ZPCRE_OP} "$5")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$((argv[-2] + (argv[-3] == argv[-2])))
+  done
+
+  (($# > 6)) || return # no match
+
+  set +o multibyte
+
+  # $5 contains the result, $6 the current offset
+  5= 6=1
+  for 2 3 4 in "$@[7,-1]"; do
+    5+=${(P)1[$6,$2]}$4
+    6=$(($3 + 1))
+  done
+  5+=${(P)1[$6,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+
+  # $4 is the string to be matched
+  4=${(P)1}
+
+  while [[ -n $4 ]]; do
+    if [[ $4 =~ $2 ]]; then
+      # append initial part and substituted match
+      5+=${4[1,MBEGIN-1]}${(e)3}
+      # truncate remaining string
+      if ((MEND < MBEGIN)); then
+        # zero-width match, skip one character for the next match
+        ((MEND++))
+	5+=${4[1]}
+      fi
+      4=${4[MEND+1,-1]}
+      # indicate we did something
+      6=1
+    else
+      break
+    fi
+  done
+  [[ -n $6 ]] || return # no match
+  5+=$4
+fi
+
+eval $1=\$5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
@ 2021-04-30  6:11           ` Stephane Chazelas
  2021-04-30 23:13             ` Bart Schaefer
  0 siblings, 1 reply; 10+ messages in thread
From: Stephane Chazelas @ 2021-04-30  6:11 UTC (permalink / raw)
  To: Zsh hackers list

ping.

2020-01-01 14:03:43 +0000, Stephane Chazelas:
2019-12-18 00:22:53 +0000, Daniel Shahaf:
[...]
> > +
> > +Note that if not using PCRE, using the tt(^) or word boundary operators
> > +(where available) may not work properly.
> 
> Suggest to avoid the double negative:
> 
> 1. s/not using PCRE/using POSIX ERE's/
> 
> 2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph
> 
> I'll push a minor change to that first paragraph in a moment.
[...]

Thanks, I've incorporated that suggesting and fixed an issue
with PCRE when the replacement was empty or generated more than
one element.

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 558342711..9a804fc11 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4284,7 +4284,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4302,6 +4302,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..0d5948075 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -8,36 +8,84 @@
 # $ and backtick substitutions; in particular, $MATCH will be replaced
 # by the portion of the string matched by the regular expression.
 
-integer pcre
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable. Make sure we start with 3 and only
+# 3 elements:
+argv=("$1" "$2" "$3")
 
-[[ -o re_match_pcre ]] && pcre=1
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
+
+
 local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $5, $6 reserved for later use
+  4=0 6=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "${(P)1}"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    5=${(e)3}
+    argv+=(${(s: :)ZPCRE_OP} "$5")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$((argv[-2] + (argv[-3] == argv[-2])))
+  done
+
+  (($# > 6)) || return # no match
+
+  set +o multibyte
+
+  # $5 contains the result, $6 the current offset
+  5= 6=1
+  for 2 3 4 in "$@[7,-1]"; do
+    5+=${(P)1[$6,$2]}$4
+    6=$(($3 + 1))
+  done
+  5+=${(P)1[$6,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+
+  # $4 is the string to be matched
+  4=${(P)1}
+
+  while [[ -n $4 ]]; do
+    if [[ $4 =~ $2 ]]; then
+      # append initial part and substituted match
+      5+=${4[1,MBEGIN-1]}${(e)3}
+      # truncate remaining string
+      if ((MEND < MBEGIN)); then
+        # zero-width match, skip one character for the next match
+        ((MEND++))
+	5+=${4[1]}
+      fi
+      4=${4[MEND+1,-1]}
+      # indicate we did something
+      6=1
+    else
+      break
+    fi
+  done
+  [[ -n $6 ]] || return # no match
+  5+=$4
+fi
+
+eval $1=\$5


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  2021-04-30  6:11           ` Stephane Chazelas
@ 2021-04-30 23:13             ` Bart Schaefer
  2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  0 siblings, 1 reply; 10+ messages in thread
From: Bart Schaefer @ 2021-04-30 23:13 UTC (permalink / raw)
  To: Zsh hackers list

On Thu, Apr 29, 2021 at 11:12 PM Stephane Chazelas
<stephane.chazelas@gmail.com> wrote:
>
> ping.

I went back and looked at the patch again.

Tangential question:  "pgrep" commonly refers to grepping the process
list, and is linked to "pkill".  I know "zpgrep" precedes this patch,
but I'm wondering if we should rename it.

More directly about regexp-replace:

If $argv[4,-1] are going to be ignored/discarded, perhaps there should
be a warning?  (Another thing that predates the patch, I know)

What do you think about replacing the final eval with typeset -g, as
mentioned in workers/48760 ?

I don't see any reason to doubt the accuracy of the implementation, so
except possibly for that last question I think the patch could be
applied and the other two items addressed elsewhere.  Thoughts?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  2021-04-30 23:13             ` Bart Schaefer
@ 2021-05-05 11:45               ` Stephane Chazelas
  0 siblings, 0 replies; 10+ messages in thread
From: Stephane Chazelas @ 2021-05-05 11:45 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-04-30 16:13:34 -0700, Bart Schaefer:
[...]
> I went back and looked at the patch again.

Thanks. Here's a third version with further improvements
addressing some of the comments here.

> Tangential question:  "pgrep" commonly refers to grepping the process
> list, and is linked to "pkill".  I know "zpgrep" precedes this patch,
> but I'm wondering if we should rename it.

I agree, zshpcregrep or zpmatch may be better names. There
exists a pcregrep command, zpcregrep would be likely interpreted
as zip-pcregrep. I'll leave it out for now. IMO, that zpgrep
serves more as example code than a command people would actually
use, so it probably doesn't matter much.

> More directly about regexp-replace:
> 
> If $argv[4,-1] are going to be ignored/discarded, perhaps there should
> be a warning?  (Another thing that predates the patch, I know)

Agreed. I've addressed that.

> What do you think about replacing the final eval with typeset -g, as
> mentioned in workers/48760 ?

I've compared:

(1) eval -- $lvalue=\$value
(2) Bart's typeset -g -- $lvalue=$value
(3) Daniel's (zsh-workers 45073) : ${(P)lvalue::="$value"}

(1) to me is the most legible but if $lvalue is not a valid
lvalue, it doesn't necessarily return a useful error message to
the user (like when lvalue='reboot;var'...)

(2) is also very legible. It has the benefit (or inconvenience
depending on PoV) of returning an error if the lvalue is not a
scalar. It reports an error (and exits the shell process)  upon
incorrect lvalues (except ones such as "var=foo"). A major
drawback though is that if chokes on lvalue='array[n=1]' or
lvalue='assoc[keywith=characters]'

(3) is the least legible. It also causes the lvalue to be
dereferenced twice. For instance with lvalue='a[++n]', n is
incremented twice. However, it does report an error upon invalid
lvalue (even though ${(P)lvalue} alone doesn't), and as we use
${(P)lvalue} above already, that has the benefit of that lvalue
being interpreted consistently. Non-scalar variables are
converted to scalar (like with (1)). It works OK for
lvalue='assoc[$key]' and lvalue='assoc[=]' or
lvalue='assoc[\]]' for instance.

Performance wise, for usual cases (lvalue being a simple
variable name and value short enough), (1) seems to be the worst
in my tests and (3) best, (2) very close. But that's reversed
for the less usual cases.

So, I've gone for (3), changed the code to limit the number of
times the lvalue is dereferenced. I've also addressed an issue
whereby regexp-replace empty '^' x would not insert x in ERE
mode.

(note that it is affected by the (e) failure exit code issue
I've just raised separately; I'm not attempting to work around
it here; though I've added the X flag for error reporting to be
more consistent)

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 8bf1a208e..db06d7925 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4328,7 +4328,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4346,6 +4346,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..c947a2043 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -1,43 +1,109 @@
-# Replace all occurrences of a regular expression in a variable.  The
-# variable is modified directly.  Respects the setting of the
-# option RE_MATCH_PCRE.
+# Replace all occurrences of a regular expression in a scalar variable.
+# The variable is modified directly.  Respects the setting of the option
+# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode.
 #
-# First argument: *name* (not contents) of variable.
-# Second argument: regular expression
-# Third argument: replacement string.  This can contain all forms of
-# $ and backtick substitutions; in particular, $MATCH will be replaced
-# by the portion of the string matched by the regular expression.
-
-integer pcre
+# Arguments:
+#
+# 1. *name* (not contents) of variable or more generally any lvalue,
+#    expected to be scalar.  That lvalue will be evaluated once to
+#    retrieve the current value, and two more times (not just one as a
+#    side effect of using ${(P)varname::=$value}; FIXME) for the
+#    assignment of the new value if a substitution was made.  So lvalues
+#    such as array[++n] where the subscript is dynamic should be
+#    avoided.
+#
+# 2. regular expression
+#
+# 3. replacement string.  This can contain all forms of
+#    $ and backtick substitutions; in particular, $MATCH will be
+#    replaced by the portion of the string matched by the regular
+#    expression. Parsing errors are fatal to the shell process.
+#
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable.
 
-[[ -o re_match_pcre ]] && pcre=1
+if (( $# < 2 || $# > 3 )); then
+  setopt localoptions functionargzero
+  print -ru2 "Usage: $0 <varname> <regexp> [<replacement>]"
+  return 2
+fi
 
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
-local MATCH MBEGIN MEND
+
+# $5 is the string to be matched
+5=${(P)1}
+
+local    MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $6, $7 reserved for later use
+  4=0 7=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "$5"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    6=${(Xe)3}
+    argv+=(${(s: :)ZPCRE_OP} "$6")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$(( argv[-2] + (argv[-3] == argv[-2]) ))
+  done
+
+  (( $# > 7 )) || return # no match
+
+  set +o multibyte
+
+  # $6 contains the result, $7 the current offset
+  6= 7=1
+  for 2 3 4 in "$@[8,-1]"; do
+    6+=${5[$7,$2]}$4
+    7=$(( $3 + 1 ))
+  done
+  6+=${5[$7,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+  while
+    if [[ $5 =~ $2 ]]; then
+      # append initial part and substituted match
+      6+=${5[1,MBEGIN-1]}${(Xe)3}
+      # truncate remaining string
+      if (( MEND < MBEGIN )); then
+	# zero-width match, skip one character for the next match
+	(( MEND++ ))
+	6+=${5[1]}
+      fi
+      5=${5[MEND+1,-1]}
+      # indicate we did something
+      7=1
+    fi
+    [[ -n $5 ]]
+  do
+    continue
+  done
+  [[ -n $7 ]] || return # no match
+  6+=$5
+fi
+
+# assign result to target variable if at least one substitution was
+# made.  At this point, if the variable was originally array or assoc, it
+# is converted to scalar. If $1 doesn't contain a valid lvalue
+# specification, an exception is raised (exits the shell process if
+# non-interactive).
+: ${(P)1::="$6"}


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-05-05 11:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 21:10 regexp-replace and ^, word boundary or look-behind operators Stephane Chazelas
2019-12-16 21:27 ` Stephane Chazelas
2019-12-17  7:38   ` Stephane Chazelas
2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
2019-12-18  0:22       ` Daniel Shahaf
2019-12-18  8:31         ` Stephane Chazelas
2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
2021-04-30  6:11           ` Stephane Chazelas
2021-04-30 23:13             ` Bart Schaefer
2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas

zsh-workers

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.vuxu.org/zsh-workers

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 zsh-workers zsh-workers/ http://inbox.vuxu.org/zsh-workers \
		zsh-workers@zsh.org
	public-inbox-index zsh-workers

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.vuxu.org/vuxu.archive.zsh.workers


code repositories for the project(s) associated with this inbox:

	https://git.vuxu.org/mirror/zsh/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git