regexp-replace and ^, word boundary or look-behind operators

zsh-workers
 help / color / mirror / code / Atom feed

* regexp-replace and ^, word boundary or look-behind operators
@ 2019-12-16 21:10 Stephane Chazelas
  2019-12-16 21:27 ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2019-12-16 21:10 UTC (permalink / raw)
  To: Zsh hackers list

The way regexp-replace works means that these things:

$ a='aaab'; regexp-replace a '^a' x; echo "$a"
xxxb
$ a='abab'; regexp-replace a '\<ab' '<$MATCH>'; echo $a
<ab><ab>
$ set -o rematchpcre
$ a=xxx; regexp-replace a '(?<!x)x' y; echo $a
yyy

don't work properly as after the first substitution, the regex
is no longer matched on the full subject, but on the part of
subject after the last match.

I don't think that can be fixed without exposing more of the
regex/pcre C API.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-16 21:10 regexp-replace and ^, word boundary or look-behind operators Stephane Chazelas
@ 2019-12-16 21:27 ` Stephane Chazelas
  2019-12-17  7:38   ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2019-12-16 21:27 UTC (permalink / raw)
  To: Zsh hackers list

2019-12-16 21:10:13 +0000, Stephane Chazelas:
> The way regexp-replace works means that these things:
> 
> $ a='aaab'; regexp-replace a '^a' x; echo "$a"
> xxxb
> $ a='abab'; regexp-replace a '\<ab' '<$MATCH>'; echo $a
> <ab><ab>
> $ set -o rematchpcre
> $ a=xxx; regexp-replace a '(?<!x)x' y; echo $a
> yyy
[...]

FWIW, looks like some sed implementations (like that of the
heirloom toolchest or busybox) or ksh93 have the same problem:

$ echo xxx | busybox sed 's/\<x/y/g'
yyy
$ a=xxx ksh -c 'echo ${a//~(E:^x)/y}'
yyy
$ a=xxx ksh -c 'echo ${a//[[:<:]]x/y}'
yyy

It may be that the POSIX regex API doesn't have a way to fix
that (REG_NOTBOL addresses the ^ case, but there's nothing about
\< / \b / [[:<]] which are non-POSIX extensions anyway).

PCRE should be OK, so it could be just a matter of
exposing it via the pcre_match builtin and document the
limitation otherwise for EREs (PCRE is the new de-facto standard
anyway).

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-16 21:27 ` Stephane Chazelas
@ 2019-12-17  7:38   ` Stephane Chazelas
  2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2019-12-17  7:38 UTC (permalink / raw)
  To: Zsh hackers list

2019-12-16 21:27:06 +0000, Stephane Chazelas:
[...]
> PCRE should be OK, so it could be just a matter of
> exposing it via the pcre_match builtin
[...]

D'oh, it's there already with the -b, -n options.

I'll try and suggest a regexp-replace improvement using that.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH] Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-17  7:38   ` Stephane Chazelas
@ 2019-12-17 11:11     ` Stephane Chazelas
  2019-12-18  0:22       ` Daniel Shahaf
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2019-12-17 11:11 UTC (permalink / raw)
  To: zsh-workers

2019-12-17 07:38:46 +0000, Stephane Chazelas:
> 2019-12-16 21:27:06 +0000, Stephane Chazelas:
> [...]
> > PCRE should be OK, so it could be just a matter of
> > exposing it via the pcre_match builtin
> [...]
> 
> D'oh, it's there already with the -b, -n options.
> 
> I'll try and suggest a regexp-replace improvement using that.
[...]

There's another issue in that the zero-width matches cause
infinite loops. Here's my first attempt at fixing those issues
(also fixing a few issues in the zpgrep example functions while
I'm at it):

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index d32ba018d..61e6a434f 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4301,6 +4301,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if not using PCRE, using the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..41ea9d79e 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -8,36 +8,79 @@
 # $ and backtick substitutions; in particular, $MATCH will be replaced
 # by the portion of the string matched by the regular expression.
 
-integer pcre
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable. Make sure we start with 3 and only
+# 3 elements:
+argv=("$1" "$2" "$3")
 
-[[ -o re_match_pcre ]] && pcre=1
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
+
+
 local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $5, $6 reserved for later
+  4=0 5= 6=1
+
+  local ZPCRE_OP IFS=' '
+  while pcre_match -b -n $4 -- "${(P)1}"; do
+    # append offsets and computed replacement to the array
+    argv+=($=ZPCRE_OP ${(e)3})
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$((argv[-2] + (argv[-3] == argv[-2])))
+  done
+
+  (($# > 6)) || return # no match
+
+  set +o multibyte
+
+  # $5 contains the result, $6 the current offset
+  for 2 3 4 in "$@[7,-1]"; do
+    5+=${(P)1[$6,$2]}$4
+    6=$(($3 + 1))
+  done
+  5+=${(P)1[$6,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+
+  # $4 is the string to be matched
+  4=${(P)1}
+
+  while [[ -n $4 ]]; do
+    if [[ $4 =~ $2 ]]; then
+      # append initial part and substituted match
+      5+=${4[1,MBEGIN-1]}${(e)3}
+      # truncate remaining string
+      if ((MEND < MBEGIN)); then
+        # zero-width match, skip one character for the next match
+        ((MEND++))
+	5+=${4[1]}
+      fi
+      4=${4[MEND+1,-1]}
+      # indicate we did something
+      6=1
+    else
+      break
+    fi
+  done
+  [[ -n $6 ]] || return # no match
+  5+=$4
+fi
+
+eval $1=\$5


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
@ 2019-12-18  0:22       ` Daniel Shahaf
  2019-12-18  8:31         ` Stephane Chazelas
  2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
  0 siblings, 2 replies; 58+ messages in thread
From: Daniel Shahaf @ 2019-12-18  0:22 UTC (permalink / raw)
  To: Stephane Chazelas, zsh-workers

Stephane Chazelas wrote on Tue, 17 Dec 2019 11:11 +00:00:
> +++ b/Doc/Zsh/contrib.yo
> @@ -4301,6 +4301,9 @@ and arithmetic expressions which will be 
> replaced:  in particular, a
>  reference to tt($MATCH) will be replaced by the text matched by the 
> pattern.
>  
>  The return status is 0 if at least one match was performed, else 1.
> +
> +Note that if not using PCRE, using the tt(^) or word boundary operators
> +(where available) may not work properly.

Suggest to avoid the double negative:

1. s/not using PCRE/using POSIX ERE's/

2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph

I'll push a minor change to that first paragraph in a moment.

>  )
> +++ b/Functions/Example/zpgrep
> @@ -2,24 +2,31 @@
> +eval $1=\$5

How about «: ${(P)1::="$5"}» to avoid eval?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] Re: regexp-replace and ^, word boundary or look-behind operators
  2019-12-18  0:22       ` Daniel Shahaf
@ 2019-12-18  8:31         ` Stephane Chazelas
  2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
  1 sibling, 0 replies; 58+ messages in thread
From: Stephane Chazelas @ 2019-12-18  8:31 UTC (permalink / raw)
  To: Daniel Shahaf; +Cc: zsh-workers

2019-12-18 00:22:53 +0000, Daniel Shahaf:
[...]
> > +eval $1=\$5
> 
> How about «: ${(P)1::="$5"}» to avoid eval?

I suppose that would work but would not prevent code injection
vulnerabilities if $1 was not guaranteed to contain a valid
variable name:

$ 1='a[`uname>&2`]'
$ : ${(P)1::="$5"}
Linux
zsh: bad math expression: empty string
Linux
zsh: bad math expression: empty string

Note that uname was run twice suggesting it's potentially less
efficient than using eval (IIRC, that was already discussed
here. possibly that was fixed in a newer version).

Here, I'd say it's the caller's responsibility to make sure they
pass a valid lvalue as first argument.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  2019-12-18  0:22       ` Daniel Shahaf
  2019-12-18  8:31         ` Stephane Chazelas
@ 2020-01-01 14:03         ` Stephane Chazelas
  2021-04-30  6:11           ` Stephane Chazelas
  1 sibling, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2020-01-01 14:03 UTC (permalink / raw)
  To: zsh-workers

2019-12-18 00:22:53 +0000, Daniel Shahaf:
[...]
> > +
> > +Note that if not using PCRE, using the tt(^) or word boundary operators
> > +(where available) may not work properly.
> 
> Suggest to avoid the double negative:
> 
> 1. s/not using PCRE/using POSIX ERE's/
> 
> 2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph
> 
> I'll push a minor change to that first paragraph in a moment.
[...]

Thanks, I've incorporated that suggesting and fixed an issue
with PCRE when the replacement was empty or generated more than
one element.

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 558342711..9a804fc11 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4284,7 +4284,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4302,6 +4302,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..0d5948075 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -8,36 +8,84 @@
 # $ and backtick substitutions; in particular, $MATCH will be replaced
 # by the portion of the string matched by the regular expression.
 
-integer pcre
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable. Make sure we start with 3 and only
+# 3 elements:
+argv=("$1" "$2" "$3")
 
-[[ -o re_match_pcre ]] && pcre=1
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
+
+
 local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $5, $6 reserved for later use
+  4=0 6=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "${(P)1}"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    5=${(e)3}
+    argv+=(${(s: :)ZPCRE_OP} "$5")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$((argv[-2] + (argv[-3] == argv[-2])))
+  done
+
+  (($# > 6)) || return # no match
+
+  set +o multibyte
+
+  # $5 contains the result, $6 the current offset
+  5= 6=1
+  for 2 3 4 in "$@[7,-1]"; do
+    5+=${(P)1[$6,$2]}$4
+    6=$(($3 + 1))
+  done
+  5+=${(P)1[$6,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+
+  # $4 is the string to be matched
+  4=${(P)1}
+
+  while [[ -n $4 ]]; do
+    if [[ $4 =~ $2 ]]; then
+      # append initial part and substituted match
+      5+=${4[1,MBEGIN-1]}${(e)3}
+      # truncate remaining string
+      if ((MEND < MBEGIN)); then
+        # zero-width match, skip one character for the next match
+        ((MEND++))
+	5+=${4[1]}
+      fi
+      4=${4[MEND+1,-1]}
+      # indicate we did something
+      6=1
+    else
+      break
+    fi
+  done
+  [[ -n $6 ]] || return # no match
+  5+=$4
+fi
+
+eval $1=\$5


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
@ 2021-04-30  6:11           ` Stephane Chazelas
  2021-04-30 23:13             ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-04-30  6:11 UTC (permalink / raw)
  To: Zsh hackers list

ping.

2020-01-01 14:03:43 +0000, Stephane Chazelas:
2019-12-18 00:22:53 +0000, Daniel Shahaf:
[...]
> > +
> > +Note that if not using PCRE, using the tt(^) or word boundary operators
> > +(where available) may not work properly.
> 
> Suggest to avoid the double negative:
> 
> 1. s/not using PCRE/using POSIX ERE's/
> 
> 2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph
> 
> I'll push a minor change to that first paragraph in a moment.
[...]

Thanks, I've incorporated that suggesting and fixed an issue
with PCRE when the replacement was empty or generated more than
one element.

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 558342711..9a804fc11 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4284,7 +4284,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4302,6 +4302,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..0d5948075 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -8,36 +8,84 @@
 # $ and backtick substitutions; in particular, $MATCH will be replaced
 # by the portion of the string matched by the regular expression.
 
-integer pcre
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable. Make sure we start with 3 and only
+# 3 elements:
+argv=("$1" "$2" "$3")
 
-[[ -o re_match_pcre ]] && pcre=1
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
+
+
 local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $5, $6 reserved for later use
+  4=0 6=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "${(P)1}"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    5=${(e)3}
+    argv+=(${(s: :)ZPCRE_OP} "$5")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$((argv[-2] + (argv[-3] == argv[-2])))
+  done
+
+  (($# > 6)) || return # no match
+
+  set +o multibyte
+
+  # $5 contains the result, $6 the current offset
+  5= 6=1
+  for 2 3 4 in "$@[7,-1]"; do
+    5+=${(P)1[$6,$2]}$4
+    6=$(($3 + 1))
+  done
+  5+=${(P)1[$6,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+
+  # $4 is the string to be matched
+  4=${(P)1}
+
+  while [[ -n $4 ]]; do
+    if [[ $4 =~ $2 ]]; then
+      # append initial part and substituted match
+      5+=${4[1,MBEGIN-1]}${(e)3}
+      # truncate remaining string
+      if ((MEND < MBEGIN)); then
+        # zero-width match, skip one character for the next match
+        ((MEND++))
+	5+=${4[1]}
+      fi
+      4=${4[MEND+1,-1]}
+      # indicate we did something
+      6=1
+    else
+      break
+    fi
+  done
+  [[ -n $6 ]] || return # no match
+  5+=$4
+fi
+
+eval $1=\$5


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators
  2021-04-30  6:11           ` Stephane Chazelas
@ 2021-04-30 23:13             ` Bart Schaefer
  2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-04-30 23:13 UTC (permalink / raw)
  To: Zsh hackers list

On Thu, Apr 29, 2021 at 11:12 PM Stephane Chazelas
<stephane.chazelas@gmail.com> wrote:
>
> ping.

I went back and looked at the patch again.

Tangential question:  "pgrep" commonly refers to grepping the process
list, and is linked to "pkill".  I know "zpgrep" precedes this patch,
but I'm wondering if we should rename it.

More directly about regexp-replace:

If $argv[4,-1] are going to be ignored/discarded, perhaps there should
be a warning?  (Another thing that predates the patch, I know)

What do you think about replacing the final eval with typeset -g, as
mentioned in workers/48760 ?

I don't see any reason to doubt the accuracy of the implementation, so
except possibly for that last question I think the patch could be
applied and the other two items addressed elsewhere.  Thoughts?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  2021-04-30 23:13             ` Bart Schaefer
@ 2021-05-05 11:45               ` Stephane Chazelas
  2021-05-31  0:58                 ` Lawrence Velázquez
                                   ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Stephane Chazelas @ 2021-05-05 11:45 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-04-30 16:13:34 -0700, Bart Schaefer:
[...]
> I went back and looked at the patch again.

Thanks. Here's a third version with further improvements
addressing some of the comments here.

> Tangential question:  "pgrep" commonly refers to grepping the process
> list, and is linked to "pkill".  I know "zpgrep" precedes this patch,
> but I'm wondering if we should rename it.

I agree, zshpcregrep or zpmatch may be better names. There
exists a pcregrep command, zpcregrep would be likely interpreted
as zip-pcregrep. I'll leave it out for now. IMO, that zpgrep
serves more as example code than a command people would actually
use, so it probably doesn't matter much.

> More directly about regexp-replace:
> 
> If $argv[4,-1] are going to be ignored/discarded, perhaps there should
> be a warning?  (Another thing that predates the patch, I know)

Agreed. I've addressed that.

> What do you think about replacing the final eval with typeset -g, as
> mentioned in workers/48760 ?

I've compared:

(1) eval -- $lvalue=\$value
(2) Bart's typeset -g -- $lvalue=$value
(3) Daniel's (zsh-workers 45073) : ${(P)lvalue::="$value"}

(1) to me is the most legible but if $lvalue is not a valid
lvalue, it doesn't necessarily return a useful error message to
the user (like when lvalue='reboot;var'...)

(2) is also very legible. It has the benefit (or inconvenience
depending on PoV) of returning an error if the lvalue is not a
scalar. It reports an error (and exits the shell process)  upon
incorrect lvalues (except ones such as "var=foo"). A major
drawback though is that if chokes on lvalue='array[n=1]' or
lvalue='assoc[keywith=characters]'

(3) is the least legible. It also causes the lvalue to be
dereferenced twice. For instance with lvalue='a[++n]', n is
incremented twice. However, it does report an error upon invalid
lvalue (even though ${(P)lvalue} alone doesn't), and as we use
${(P)lvalue} above already, that has the benefit of that lvalue
being interpreted consistently. Non-scalar variables are
converted to scalar (like with (1)). It works OK for
lvalue='assoc[$key]' and lvalue='assoc[=]' or
lvalue='assoc[\]]' for instance.

Performance wise, for usual cases (lvalue being a simple
variable name and value short enough), (1) seems to be the worst
in my tests and (3) best, (2) very close. But that's reversed
for the less usual cases.

So, I've gone for (3), changed the code to limit the number of
times the lvalue is dereferenced. I've also addressed an issue
whereby regexp-replace empty '^' x would not insert x in ERE
mode.

(note that it is affected by the (e) failure exit code issue
I've just raised separately; I'm not attempting to work around
it here; though I've added the X flag for error reporting to be
more consistent)

diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo
index 8bf1a208e..db06d7925 100644
--- a/Doc/Zsh/contrib.yo
+++ b/Doc/Zsh/contrib.yo
@@ -4328,7 +4328,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below.
 findex(regexp-replace)
 item(tt(regexp-replace) var(var) var(regexp) var(replace))(
 Use regular expressions to perform a global search and replace operation
-on a variable.  POSIX extended regular expressions are used,
+on a variable.  POSIX extended regular expressions (ERE) are used,
 unless the option tt(RE_MATCH_PCRE) has been set, in which case
 Perl-compatible regular expressions are used
 (this requires the shell to be linked against the tt(pcre)
@@ -4346,6 +4346,9 @@ and arithmetic expressions which will be replaced:  in particular, a
 reference to tt($MATCH) will be replaced by the text matched by the pattern.
 
 The return status is 0 if at least one match was performed, else 1.
+
+Note that if using POSIX EREs, the tt(^) or word boundary operators
+(where available) may not work properly.
 )
 findex(run-help)
 item(tt(run-help) var(cmd))(
diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep
index 8b1edaa1c..556e58cd6 100644
--- a/Functions/Example/zpgrep
+++ b/Functions/Example/zpgrep
@@ -2,24 +2,31 @@
 #
 
 zpgrep() {
-local file pattern
+local file pattern ret
 
 pattern=$1
 shift
+ret=1
 
 if ((! ARGC)) then
 	set -- -
 fi
 
-pcre_compile $pattern
+zmodload zsh/pcre || return
+pcre_compile -- "$pattern"
 pcre_study
 
 for file
 do
 	if [[ "$file" == - ]] then
-		while read -u0 buf; do pcre_match $buf && print $buf; done
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done
 	else
-		while read -u0 buf; do pcre_match $buf && print $buf; done < "$file"
+		while IFS= read -ru0 buf; do
+			pcre_match -- "$buf" && ret=0 && print -r -- "$buf"
+		done < "$file"
 	fi
 done
+return "$ret"
 }
diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index dec105524..c947a2043 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -1,43 +1,109 @@
-# Replace all occurrences of a regular expression in a variable.  The
-# variable is modified directly.  Respects the setting of the
-# option RE_MATCH_PCRE.
+# Replace all occurrences of a regular expression in a scalar variable.
+# The variable is modified directly.  Respects the setting of the option
+# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode.
 #
-# First argument: *name* (not contents) of variable.
-# Second argument: regular expression
-# Third argument: replacement string.  This can contain all forms of
-# $ and backtick substitutions; in particular, $MATCH will be replaced
-# by the portion of the string matched by the regular expression.
-
-integer pcre
+# Arguments:
+#
+# 1. *name* (not contents) of variable or more generally any lvalue,
+#    expected to be scalar.  That lvalue will be evaluated once to
+#    retrieve the current value, and two more times (not just one as a
+#    side effect of using ${(P)varname::=$value}; FIXME) for the
+#    assignment of the new value if a substitution was made.  So lvalues
+#    such as array[++n] where the subscript is dynamic should be
+#    avoided.
+#
+# 2. regular expression
+#
+# 3. replacement string.  This can contain all forms of
+#    $ and backtick substitutions; in particular, $MATCH will be
+#    replaced by the portion of the string matched by the regular
+#    expression. Parsing errors are fatal to the shell process.
+#
+# we use positional parameters instead of variables to avoid
+# clashing with the user's variable.
 
-[[ -o re_match_pcre ]] && pcre=1
+if (( $# < 2 || $# > 3 )); then
+  setopt localoptions functionargzero
+  print -ru2 "Usage: $0 <varname> <regexp> [<replacement>]"
+  return 2
+fi
 
+# $4 records whether pcre is enabled as that information would otherwise
+# be lost after emulate -L zsh
+4=0
+[[ -o re_match_pcre ]] && 4=1
 emulate -L zsh
-(( pcre )) && setopt re_match_pcre
-
-# $4 is the string to be matched
-4=${(P)1}
-# $5 is the final string
-5=
-# 6 indicates if we made a change
-6=
-local MATCH MBEGIN MEND
+
+# $5 is the string to be matched
+5=${(P)1}
+
+local    MATCH MBEGIN MEND
 local -a match mbegin mend
 
-while [[ -n $4 ]]; do
-  if [[ $4 =~ $2 ]]; then
-    # append initial part and subsituted match
-    5+=${4[1,MBEGIN-1]}${(e)3}
-    # truncate remaining string
-    4=${4[MEND+1,-1]}
-    # indicate we did something
-    6=1
-  else
-    break
-  fi
-done
-5+=$4
-
-eval ${1}=${(q)5}
-# status 0 if we did something, else 1.
-[[ -n $6 ]]
+if (( $4 )); then
+  # if using pcre, we're using pcre_match and a running offset
+  # That's needed for ^, \A, \b, and look-behind operators to work
+  # properly.
+
+  zmodload zsh/pcre || return 2
+  pcre_compile -- "$2" && pcre_study || return 2
+
+  # $4 is the current *byte* offset, $6, $7 reserved for later use
+  4=0 7=
+
+  local ZPCRE_OP
+  while pcre_match -b -n $4 -- "$5"; do
+    # append offsets and computed replacement to the array
+    # we need to perform the evaluation in a scalar assignment so that if
+    # it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    6=${(Xe)3}
+    argv+=(${(s: :)ZPCRE_OP} "$6")
+
+    # for 0-width matches, increase offset by 1 to avoid
+    # infinite loop
+    4=$(( argv[-2] + (argv[-3] == argv[-2]) ))
+  done
+
+  (( $# > 7 )) || return # no match
+
+  set +o multibyte
+
+  # $6 contains the result, $7 the current offset
+  6= 7=1
+  for 2 3 4 in "$@[8,-1]"; do
+    6+=${5[$7,$2]}$4
+    7=$(( $3 + 1 ))
+  done
+  6+=${5[$7,-1]}
+else
+  # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
+  # available) won't work properly.
+  while
+    if [[ $5 =~ $2 ]]; then
+      # append initial part and substituted match
+      6+=${5[1,MBEGIN-1]}${(Xe)3}
+      # truncate remaining string
+      if (( MEND < MBEGIN )); then
+	# zero-width match, skip one character for the next match
+	(( MEND++ ))
+	6+=${5[1]}
+      fi
+      5=${5[MEND+1,-1]}
+      # indicate we did something
+      7=1
+    fi
+    [[ -n $5 ]]
+  do
+    continue
+  done
+  [[ -n $7 ]] || return # no match
+  6+=$5
+fi
+
+# assign result to target variable if at least one substitution was
+# made.  At this point, if the variable was originally array or assoc, it
+# is converted to scalar. If $1 doesn't contain a valid lvalue
+# specification, an exception is raised (exits the shell process if
+# non-interactive).
+: ${(P)1::="$6"}


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
@ 2021-05-31  0:58                 ` Lawrence Velázquez
  2021-05-31 18:18                 ` Bart Schaefer
  2024-03-08 15:30                 ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  2 siblings, 0 replies; 58+ messages in thread
From: Lawrence Velázquez @ 2021-05-31  0:58 UTC (permalink / raw)
  To: zsh-workers; +Cc: Stephane Chazelas

On Wed, May 5, 2021, at 7:45 AM, Stephane Chazelas wrote:
> 2021-04-30 16:13:34 -0700, Bart Schaefer:
> [...]
> > I went back and looked at the patch again.
> 
> Thanks. Here's a third version with further improvements
> addressing some of the comments here.

Any further feedback on this?

-- 
vq


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  2021-05-31  0:58                 ` Lawrence Velázquez
@ 2021-05-31 18:18                 ` Bart Schaefer
  2021-05-31 21:37                   ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
  2024-03-08 15:30                 ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  2 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-05-31 18:18 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

On Wed, May 5, 2021 at 4:45 AM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> > What do you think about replacing the final eval with typeset -g, as
> > mentioned in workers/48760 ?
>
> [...] very legible. It has the benefit (or inconvenience
> depending on PoV) of returning an error if the lvalue is not a
> scalar. It reports an error (and exits the shell process)  upon
> incorrect lvalues (except ones such as "var=foo"). A major
> drawback though is that if chokes on lvalue='array[n=1]' or
> lvalue='assoc[keywith=characters]'

Hmm, I wonder if that should be considered a bug.  It seems the
parsing in builtin.c:getasg() is not equivalent to what's done for the
same input syntax elsewhere.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH] (?) typeset array[position=index]=value
  2021-05-31 18:18                 ` Bart Schaefer
@ 2021-05-31 21:37                   ` Bart Schaefer
  2021-06-01  5:32                     ` Stephane Chazelas
  2021-06-05  4:29                     ` Mikael Magnusson
  0 siblings, 2 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-05-31 21:37 UTC (permalink / raw)
  To: Zsh hackers list

On Mon, May 31, 2021 at 11:18 AM Bart Schaefer
<schaefer@brasslantern.com> wrote:
>
> On Wed, May 5, 2021 at 4:45 AM Stephane Chazelas <stephane@chazelas.org> wrote:
> > [typeset] chokes on lvalue='array[n=1]' or
> > lvalue='assoc[keywith=characters]'
>
> Hmm, I wonder if that should be considered a bug.

This copies (tweaked for context) the code from parse.c at line 2006
or thereabouts.

All tests still pass, but as you can see from the comment this is not
yet handling x+=y which there doesn't seem to be any reason for
typeset NOT to support; I think it would require only another flag in
struct asgment, but I haven't attempted that.

Commentary?

diff --git a/Src/builtin.c b/Src/builtin.c
index a16fddcb7..148c56952 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -1933,10 +1933,11 @@ getasg(char ***argvp, LinkList assigns)
     asg.flags = 0;

     /* search for `=' */
-    for (; *s && *s != '='; s++);
+    for (; *s && *s != '[' && *s != '=' /* && *s != '+' */; s++);
+    if (s > asg.name && *s == '[') skipparens('[', ']', &s);

     /* found `=', so return with a value */
-    if (*s) {
+    if (*s && *s == '=') {
     *s = '\0';
     asg.value.scalar = s + 1;
     } else {


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-05-31 21:37                   ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
@ 2021-06-01  5:32                     ` Stephane Chazelas
  2021-06-01 16:05                       ` Bart Schaefer
  2021-06-05  4:29                     ` Mikael Magnusson
  1 sibling, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-01  5:32 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

While we're at it, could we fix:

assoc[]=x
unset 'assoc[]'
etc.

The former can be worked around with assoc[${-+}]=x or
assoc+=('' x), the latter not AFAICT.

See also

worker:43269
https://unix.stackexchange.com/questions/626393/in-zsh-how-do-i-unset-an-arbitrary-associative-array-element/626529#626529

unset -k "$key" assoc_or_array

or:

assoc[$key]=()

Would probably make sense to fix the issue with escaping \ and ]
bytes in there unless we're happy to break backward
compatibility and make unset "assoc[$key]" work whatever the
value of $key (unset 'assoc[f\]oo]]' for unset the element of
key 'f\]oo]' for instance)

-- 
Stephane


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-01  5:32                     ` Stephane Chazelas
@ 2021-06-01 16:05                       ` Bart Schaefer
  2021-06-02  2:51                         ` [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff] Bart Schaefer
                                           ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-06-01 16:05 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

On Mon, May 31, 2021 at 10:32 PM Stephane Chazelas
<stephane@chazelas.org> wrote:
>
> While we're at it, could we fix:

Wow, that thread from 2016 sort of died out in the middle, at
workers/37933.  Thanks for reminding me of that thread, though, I
should at least examine whether getasg() ought to be using
parse_subscript() even though the corresponding parse.c block is using
skipparens().

> assoc[]=x
> unset 'assoc[]'
> etc.
>
> The former can be worked around with assoc[${-+}]=x or
> assoc+=('' x), the latter not AFAICT.

The former can also be done with

assoc[(e)]=x

Whereas most subscript flags are meaningless for unset, we might
consider supporting (e) there.  I'm not sure whether that would fully
address the problem with the other characters, though.

The issue with the empty key seems merely to be that the subscript
validity test for associative arrays never changed from the one for
plain arrays.

> [...] unless we're happy to break backward
> compatibility and make unset "assoc[$key]" work whatever the
> value of $key (unset 'assoc[f\]oo]]' for unset the element of
> key 'f\]oo]' for instance)

Hm, I'm not sure what backward-compatibility would be broken?  Do you
mean that scripts that already have work-arounds for the current issue
might stop working?

That's sort of why I'm thinking about (e).  If that flag had to be
present in order to get the "whatever the value of $key" behavior, old
workarounds would be unchanged.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff]
  2021-06-01 16:05                       ` Bart Schaefer
@ 2021-06-02  2:51                         ` Bart Schaefer
  2021-06-02 10:06                           ` Stephane Chazelas
  2021-06-02  9:11                         ` [PATCH] (?) typeset array[position=index]=value Stephane Chazelas
  2021-06-10  0:21                         ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
  2 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-02  2:51 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

On Tue, Jun 1, 2021 at 9:05 AM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> [...] I should at least examine whether getasg() ought to be using
> parse_subscript() even though the corresponding parse.c block is using
> skipparens().

I've decided parse_subscript() is appropriate there, so that's
included below.  If anyone sees any problem with that, please point it
out.

> Whereas most subscript flags are meaningless for unset, we might
> consider supporting (e) there.  I'm not sure whether that would fully
> address the problem with the other characters, though.

Turns out it does seem to fix that, so that's the direction I've gone.
More discussion below.

> The issue with the empty key seems merely to be that the subscript
> validity test for associative arrays never changed from the one for
> plain arrays.

To maintain error-equivalent backward compatibility I didn't "fix"
this, instead, hash[(e)] (or hash[(e)''] if you think that more
readable) is required in order to unset the element with the empty
key.

> [...] I'm not sure what backward-compatibility would be broken?  Do you
> mean that scripts that already have work-arounds for the current issue
> might stop working?

The one compatibility issue with the foregoing is this:

Current pre-patch zsh:
% typeset -A zz
% bad='(e)bang'
% zz[$bad]=x
t% typeset -p zz
typeset -A zz=( ['(e)bang']=x )
% unset zz\["$bad"\]
% typeset -p zz
typeset -A zz=( )

With the patch, the "(e)" appearing in the value of $bad becomes a
subscript flag, because $bad is expanded before "unset" parses:
% zz[$bad]=x
% typeset -p zz
typeset -A zz=( ['(e)bang']=x )
% unset zz\["$bad"\]
% typeset -p zz
typeset -A zz=( ['(e)bang']=x )

You have to double the flag:
% unset zz\["(e)$bad"\]
% typeset -p zz
typeset -A zz=( )

Is that a small enough incompatibility for this to be acceptable?

I've included doc updates, but not tests.  If anyone else would like
to take a stab at those?

[-- Attachment #2: hash_subscripts.txt --]
[-- Type: text/plain, Size: 2913 bytes --]

diff --git a/Doc/Zsh/builtins.yo b/Doc/Zsh/builtins.yo
index f8e11d79d..07d8f6ac9 100644
--- a/Doc/Zsh/builtins.yo
+++ b/Doc/Zsh/builtins.yo
@@ -2411,6 +2411,8 @@ but the previous value will still reappear when the scope ends.
 Individual elements of associative array parameters may be unset by using
 subscript syntax on var(name), which should be quoted (or the entire command
 prefixed with tt(noglob)) to protect the subscript from filename generation.
+The `tt(LPAR())tt(e)tt(RPAR())' subscript flag may be used to require strict
+interpretation of characters in the key.
 
 If the tt(-m) flag is specified the arguments are taken as patterns (should
 be quoted) and all parameters with matching names are unset.  Note that this
diff --git a/Doc/Zsh/params.yo b/Doc/Zsh/params.yo
index dc28a45ae..03547ef3f 100644
--- a/Doc/Zsh/params.yo
+++ b/Doc/Zsh/params.yo
@@ -443,7 +443,9 @@ not inhibited.
 
 This flag can also be used to force tt(*) or tt(@) to be interpreted as
 a single key rather than as a reference to all values.  It may be used
-for either purpose on the left side of an assignment.
+for either purpose on the left side of an assignment.  This flag may
+also be used in the subscript of an argument to tt(unset) to disable
+interpretation of special characters and quoting.
 )
 enditem()
 
diff --git a/Src/builtin.c b/Src/builtin.c
index a16fddcb7..7aff354cc 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -1933,10 +1933,14 @@ getasg(char ***argvp, LinkList assigns)
     asg.flags = 0;
 
     /* search for `=' */
-    for (; *s && *s != '='; s++);
+    for (; *s && *s != '[' && *s != '=' /* && *s != '+' */; s++);
+    if (s > asg.name && *s == '[') {
+	char *se = parse_subscript(s + 1, 1, ']');
+	if (se && *se == ']') s = se + 1;
+    }
 
     /* found `=', so return with a value */
-    if (*s) {
+    if (*s && *s == '=') {
 	*s = '\0';
 	asg.value.scalar = s + 1;
     } else {
@@ -3726,11 +3730,16 @@ bin_unset(char *name, char **argv, Options ops, int func)
 	if (ss) {
 	    char *sse;
 	    *ss = 0;
+	    /* parse_subscript() doesn't handle empty brackets - should it? */
 	    if ((sse = parse_subscript(ss+1, 1, ']'))) {
 		*sse = 0;
-		subscript = dupstring(ss+1);
+		if (ss[1] == '(' && ss[2] == 'e' && ss[3] == ')') {
+		    subscript = dupstring(ss+4);
+		} else {
+		    subscript = dupstring(ss+1);
+		    remnulargs(subscript);
+		}
 		*sse = ']';
-		remnulargs(subscript);
 		untokenize(subscript);
 	    }
 	}
@@ -3763,6 +3772,12 @@ bin_unset(char *name, char **argv, Options ops, int func)
 	    } else if (PM_TYPE(pm->node.flags) == PM_SCALAR ||
 		       PM_TYPE(pm->node.flags) == PM_ARRAY) {
 		struct value vbuf;
+		if (!*subscript) {
+		    *ss = '[';
+		    zerrnam(name, "%s: invalid parameter name", s);
+		    returnval = 1;
+		    continue;
+		}
 		vbuf.isarr = (PM_TYPE(pm->node.flags) == PM_ARRAY ?
 			      SCANPM_ARRONLY : 0);
 		vbuf.pm = pm;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff]
  2021-06-02  2:51                         ` [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff] Bart Schaefer
@ 2021-06-02 10:06                           ` Stephane Chazelas
  2021-06-02 14:52                             ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-02 10:06 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-06-01 19:51:32 -0700, Bart Schaefer:
[...]
> > The issue with the empty key seems merely to be that the subscript
> > validity test for associative arrays never changed from the one for
> > plain arrays.
> 
> To maintain error-equivalent backward compatibility I didn't "fix"
> this, instead, hash[(e)] (or hash[(e)''] if you think that more
> readable) is required in order to unset the element with the empty
> key.

I have to admit I don't see the problem here. I would have
thought allowing a[]=foo and unset 'a[]' would be no-brainers
as there's no concern about backward compatibility as those
currently return an error.

Even for plain arrays, IMO, it would make sense to allow empty
subscripts. In most contexts, an empty arithmetic expression is
interpreted as 0:

$ echo $(())
0
$ printf '%d\n' ''
0
$ set -o ksharrays; a[empty]=1; typeset -p a
typeset -a a=( 1 )

In ksh93:

$ ksh -c '(( a[] = 1 )); typeset -p a'
typeset -a a=(1)

$ ksh -c 'a[]=1'
ksh: syntax error at line 1: `[]' empty subscript
(oddly enough)

$ ksh -c 'a[""]=1; typeset -p a; unset "a[]"; typeset -p a'
typeset -a a=(1)
typeset -a a=()

$ ksh -c 'typeset -A a; a[""]=1; typeset -p a; unset "a[]"; typeset -p a'
typeset -A a=(['']=1)
typeset -A a=()

mksh:

$ mksh -xc 'a[]=1; typeset -p a; unset "a[]"; typeset -p a'
+ a[]=1
+ typeset -p a
set -A a
typeset a[0]=1
+ unset 'a[]'
+ typeset -p a

> The one compatibility issue with the foregoing is this:
[...]
> With the patch, the "(e)" appearing in the value of $bad becomes a
> subscript flag, because $bad is expanded before "unset" parses:
> % zz[$bad]=x
> % typeset -p zz
> typeset -A zz=( ['(e)bang']=x )
> % unset zz\["$bad"\]
> % typeset -p zz
> typeset -A zz=( ['(e)bang']=x )
> 
> You have to double the flag:
> % unset zz\["(e)$bad"\]

Or more legibly:

unset "zz[(e)$bad]"

> % typeset -p zz
> typeset -A zz=( )
> 
> Is that a small enough incompatibility for this to be acceptable?
[...]

Well, currently, you already need to escape the (s and )s in
general (except when they're matched):

$ key='(' zsh -c 'typeset -A a; a[$key]=x; unset "a[$key]"'
zsh:unset:1: a[(]: invalid parameter name

So I'm not sure there's much of a compatibility problem.

But while it allows unsetting the element with empty key with
unset 'a[(e)]', it seems to make it even more difficult to unset 
elements with arbitrary keys. 

One still can't use:

unset "a[$key]"

nor

unset "a[(e)$key]"

That still chokes on ()[]`\ and that still can't be worked around with

unset "a[${(b)key}]"

as it inserts backslashes in too many places and not in front of
backticks:

$ key='?' ./Src/zsh -c 'typeset -A a; a[x]=y; a[$key]=x; typeset -p a; unset "a[${(b)key}]"; typeset -p a'
typeset -A a=( ['?']=x [x]=y )
typeset -A a=( ['?']=x [x]=y )

And with (e), we can't use backslash to escape problematic
characters:

$ typeset -A a=('[' x)
$ unset 'a[(e)[]'
unset: a[(e)[]: invalid parameter name
$ unset 'a[(e)\[]'
$ typeset -p a
typeset -A a=( ['[']=x )

So, you'd need something like:

if [[ -n $key ]]; then
  () {
    set -o localoptions +o multibyte -o extendedglob
    unset "a[${key//[][()\`\\]/\\$MATCH}]"
  }
else
  unset "a[(e)]"
fi

(untested)

To unset an element with arbitrary key (granted, that's an
improvement as you can now unset the element with empty key, but
IMO not an acceptable solution).

"e" for "exact" is also a bit misleading in that case as without it,
wildcards and */@ are not treated specially.

It's also a bit confusing that subscript flags would be
seemingly parsed but later ignored (included in the value of the
key) except for (e). The fact that (e) is recognised and (ee) is
not also makes for a not very consistent API.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff]
  2021-06-02 10:06                           ` Stephane Chazelas
@ 2021-06-02 14:52                             ` Bart Schaefer
  2021-06-02 16:02                               ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-02 14:52 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

On Wed, Jun 2, 2021 at 3:06 AM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> 2021-06-01 19:51:32 -0700, Bart Schaefer:
> [...]
> > > The issue with the empty key seems merely to be that the subscript
> > > validity test for associative arrays never changed from the one for
> > > plain arrays.
> >
> > To maintain error-equivalent backward compatibility I didn't "fix"
> > this, instead, hash[(e)] (or hash[(e)''] if you think that more
> > readable) is required in order to unset the element with the empty
> > key.
>
> I have to admit I don't see the problem here. I would have
> thought allowing a[]=foo and unset 'a[]' would be no-brainers

Mostly I was thinking about

key=$(some derived value)
unset "hash[$key]"

In existing code that would fail on [[ -z $key ]], but you can't see
that by examination.

> as there's no concern about backward compatibility as those
> currently return an error.

That's not our usual criteria for backward compatibility.  Usually we
only change things if the new construct was previously a syntax error,
something that would prevent an old script from even being properly
parsed.

> Even for plain arrays, IMO, it would make sense to allow empty
> subscripts. In most contexts, an empty arithmetic expression is
> interpreted as 0:

But ... there's no such thing as array position 0 in native zsh.

> unset "a[(e)$key]"
>
> That still chokes on ()[]`\ and that still can't be worked around

This is why I asked for test cases ... combinations that I tried DID
work with a[(e)$key].  So, that may be a deal-breaker regardless of
the rest of this discussion.

> It's also a bit confusing that subscript flags would be
> seemingly parsed but later ignored (included in the value of the
> key) except for (e).

Well, they're parsed and ignored currently, and we already allow some
subscripts but not others in various assignment contexts, so this
doesn't seem all that weird in comparison.

> The fact that (e) is recognised and (ee) is
> not also makes for a not very consistent API.

What would (ee) mean?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff]
  2021-06-02 14:52                             ` Bart Schaefer
@ 2021-06-02 16:02                               ` Stephane Chazelas
  0 siblings, 0 replies; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-02 16:02 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-06-02 07:52:25 -0700, Bart Schaefer:
[...]
> > I have to admit I don't see the problem here. I would have
> > thought allowing a[]=foo and unset 'a[]' would be no-brainers
> 
> Mostly I was thinking about
> 
> key=$(some derived value)
> unset "hash[$key]"
> 
> In existing code that would fail on [[ -z $key ]], but you can't see
> that by examination.

Not sure I follow. That script doesn't work properly atm as it
fails to unset the corresponding hash element and would be fixed
once we allow unset 'hash[]'

I can't think of real life scenarios where one would *rely* on

unset 'hash[]'

Aborting the shell with a

zsh:unset:1: hash[]: invalid parameter name

error.

> > as there's no concern about backward compatibility as those
> > currently return an error.
> 
> That's not our usual criteria for backward compatibility.  Usually we
> only change things if the new construct was previously a syntax error,
> something that would prevent an old script from even being properly
> parsed.

By that logic, we could never add features like new options to
builtins or new flags. For instance, we couldn't add a -k option
to unset because

unset -k key arr

currently is not a zsh syntax error, that script is parsed OK,
and that command returns with:

unset: bad option: -k

[...]
> > Even for plain arrays, IMO, it would make sense to allow empty
> > subscripts. In most contexts, an empty arithmetic expression is
> > interpreted as 0:
> 
> But ... there's no such thing as array position 0 in native zsh.

But it would make the API more consistent if array subscripts
could be any arithmetic expressions or comma-separated pair of
arithmetic expressions.

And when ksharrays is not enabled a[] to return the same error
as for a[0] or a[empty] ("assignment to invalid subscript
range").

Note that 0 position is allowed in second place:

a[2,0]=x
for instance (or a[2,empty]=x, but not a[2,]=x atm) to insert a
x in second position.

I'm not saying that's something we should do or would be
terribly useful, just that it would make the interface more
consistent, and that array[] being rejected should not be a
justification for rejecting assoc[].

[...]
> > The fact that (e) is recognised and (ee) is
> > not also makes for a not very consistent API.
> 
> What would (ee) mean?

The e flag passed twice.

echo $a[(e)*], $a[(ee)*], $a[(eee)*]

All expand to element of key "*".

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-01 16:05                       ` Bart Schaefer
  2021-06-02  2:51                         ` [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff] Bart Schaefer
@ 2021-06-02  9:11                         ` Stephane Chazelas
  2021-06-02 13:34                           ` Daniel Shahaf
  2021-06-10  0:21                         ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
  2 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-02  9:11 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-06-01 09:05:13 -0700, Bart Schaefer:
[...]
> > [...] unless we're happy to break backward
> > compatibility and make unset "assoc[$key]" work whatever the
> > value of $key (unset 'assoc[f\]oo]]' for unset the element of
> > key 'f\]oo]' for instance)
> 
> Hm, I'm not sure what backward-compatibility would be broken?  Do you
> mean that scripts that already have work-arounds for the current issue
> might stop working?

Yes, like that guy's unset_keys at
https://unix.stackexchange.com/questions/626393/in-zsh-how-do-i-unset-an-arbitrary-associative-array-element/626529#626529
mentioned earlier.

ATM, you need to do:

unset 'hash[\]]'

To unset the element of key "]" for instance.

Having said that, in practice, it's rare for values to be passed
literally. What you generally want to do is:

unset "hash[$key]"

With $key being any arbitrary string. And that's where the
problem is.

Maybe the best approach would be to make unset a dual
keyword/builtin like typeset/export... so one can do:

unset hash[$key]

And that hash[$key] being interpreted the same as when you do:

hash[$key]=value

Being able to do:

unset hash[(R)pattern]
unset hash[(I)pattern]

would also be useful. We don't want however subscript flags to
be interpreted in:

unset "hash[$key]"

as that would introduce command injection vulnerabilities, like
is already the case in things like:

$ key='(n:evil:)' evil='psvar[$(uname>&2)1]' zsh -c 'typeset -A a; (( a[$key]++ ))'
Linux
Linux

(though in that case, that's not limited to subscript flags,
key='x]+psvar[$(uname>&2)1' works as well, see
https://unix.stackexchange.com/questions/627474/how-to-use-associative-arrays-safely-inside-arithmetic-expressions/627475#627475
)

That's also why I was suggesting allowing:

hash[$key]=()

to unset an element, where you don't have the ambiguity of
whether the contents of $key is going to be interpreted
specially.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-02  9:11                         ` [PATCH] (?) typeset array[position=index]=value Stephane Chazelas
@ 2021-06-02 13:34                           ` Daniel Shahaf
  2021-06-02 14:20                             ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Daniel Shahaf @ 2021-06-02 13:34 UTC (permalink / raw)
  To: zsh-workers

Stephane Chazelas wrote on Wed, 02 Jun 2021 09:11 +00:00:
> Maybe the best approach would be to make unset a dual
> keyword/builtin like typeset/export... so one can do:
> 
> unset hash[$key]
> 
> And that hash[$key] being interpreted the same as when you do:
> 
> hash[$key]=value

Haven't read the whole thread, so apologies if I'm missing something, but:

Please let's not invent a reserved word that uses different variable
expansion rules.  The sequence «hash[$key]» should mean the same thing
everywhere.  Instead, we could have a builtin that takes two separate
arguments, as in «foo hash $key» (and «foo hash ''» to unset the element
whose key is the empty string).

Makes sense?

Cheers,

Daniel

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-02 13:34                           ` Daniel Shahaf
@ 2021-06-02 14:20                             ` Stephane Chazelas
  2021-06-02 15:59                               ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-02 14:20 UTC (permalink / raw)
  To: Daniel Shahaf; +Cc: zsh-workers

2021-06-02 13:34:57 +0000, Daniel Shahaf:
[...]
> Haven't read the whole thread, so apologies if I'm missing something, but:
> 
> Please let's not invent a reserved word that uses different variable
> expansion rules.  The sequence «hash[$key]» should mean the same thing
> everywhere.

That's the problem here.

It's already different in

hash[$key]=1
typeset hash[$key]=1 # same as in assignment in recent versions
                     # of zsh where here typeset is a keyword

(( hash[$key] = 1 ))


unset hash[$key] # globbing performed.
read hash[$key]
$dryrun typeset hash[$key]=1 # here typeset not recognised as
                             # keyword

$dryrun typeset 'hash[$key]=1'
read 'hash[$key]'
let 'hash[$key] = 1' # yes, it works despite the single quotes
                     # and is actually safe it seems

unset 'hash[$key]' # unsets the element of key $key (literally)

(see also
https://unix.stackexchange.com/questions/627474/how-to-use-associative-arrays-safely-inside-arithmetic-expressions/627475#627475)

Parsing rules are different because we are in different
contexts. For unset hash[$key] or unset hash[(e)*] to treat
those as associative array lvalue, and not globs, it would need
to be a reserved word, like typeset above.

But, yes I agree it's all very messy.

> Instead, we could have a builtin that takes two separate
> arguments, as in «foo hash $key» (and «foo hash ''» to unset the element
> whose key is the empty string).
> 
> Makes sense?
[...]

I suggested:

unset -k "$key" hash

for that.

-- 
Stephane


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-02 14:20                             ` Stephane Chazelas
@ 2021-06-02 15:59                               ` Bart Schaefer
  2021-06-03  2:04                                 ` [PATCH (not final)] (take three?) unset "array[$anything]" Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-02 15:59 UTC (permalink / raw)
  To: Daniel Shahaf, Zsh hackers list

On Wed, Jun 2, 2021 at 7:20 AM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> $dryrun typeset hash[$key]=1 # here typeset not recognised as
>                              # keyword

Same for "builtin typeset ..." of course.

> let 'hash[$key] = 1' # yes, it works despite the single quotes

I'm surprised that solves the \`()[] problem.  I suspect it's
accidental and has to do with a different meaning of [...] in math
context.

> Parsing rules are different because we are in different
> contexts. For unset hash[$key] or unset hash[(e)*] to treat
> those as associative array lvalue, and not globs, it would need
> to be a reserved word, like typeset above.

That's true regarding globs, but I've just had a hand-slaps-forehead
moment ... take 3 to follow in another thread.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-02 15:59                               ` Bart Schaefer
@ 2021-06-03  2:04                                 ` Bart Schaefer
  2021-06-03  2:42                                   ` Bart Schaefer
  2021-06-03  6:05                                   ` [PATCH (not final)] (take three?) unset "array[$anything]" Stephane Chazelas
  0 siblings, 2 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-06-03  2:04 UTC (permalink / raw)
  To: Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 2349 bytes --]

On Wed, Jun 2, 2021 at 8:59 AM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> I've just had a hand-slaps-forehead
> moment ... take 3 to follow in another thread.

What I realized is that for any unset of an array element, the closing
bracket must always be the last character of the argument.  There's no
reason to parse the subscript or skip over matching brackets; if a '['
is found, just make sure the last character is ']' and the subscript
must be everything in between.

% typeset -A ax
% for k in '' '\' '`' '(' '[' ')' ']'; do
for> ax[$k]=$k
for> done
% typeset -p ax
typeset -A ax=( ['']='' ['(']='(' [')']=')' ['[']='[' ['\']='\'
[']']=']' ['`']='`' )
% for k in ${(k)ax}; do
unset "ax[$k]"
done
% typeset -p ax
typeset -A ax=( )

Given this realization, it's easy to make { unset "hash[$key]" } work
"like it always should have".  The trouble comes in with (not)
breaking past workarounds.  Because the current (and theoretically
experimental, though we forgot about that for 5 years) code uses
parse_subscript(), we get a partly-tokenized (as if double-quoted,
actually) string in the cases where backslashes are used to force the
closing bracket to be found.  If those backslashes aren't needed any
more, there's no clean way to ignore them upon untokenize, to get back
to something that actually matches the intended hash key.

The attached not-ready-for-push patch has 4 variations that can be
chosen by #define, currently set up as follows:

#define unset_workers_37914 0
#define unset_hashelem_empty_only 0
#define unset_hashelem_literal 1
#define unset_hashelem_stripquote 0

The first one is just the current code.  The second one allows { unset
"hash[]" } (and gives "invalid subscript" for array[] instead of
"invalid parameter name").  The third one, which I have defined by
default, uses the subscript literally, so if you can do { hash[$k]=v }
you can also do { unset "hash[$k]" } (at least for all cases I
tested).  The fourth one requires { unset "hash[${(q)k}]" } instead,
but I think it otherwise works for all cases.  Both of those also work
for "hash[]".

Therefore I think the best option is to choose one of the latter two,
possibly depending on which one induces the least damage to any
workarounds for the current behavior that are known in the wild,
though aesthetically I'd rather use the literal version.

[-- Attachment #2: unset_subscripts.txt --]
[-- Type: text/plain, Size: 2174 bytes --]

diff --git a/Src/builtin.c b/Src/builtin.c
index a16fddcb7..ec0376367 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -1933,10 +1933,14 @@ getasg(char ***argvp, LinkList assigns)
     asg.flags = 0;

     /* search for `=' */
-    for (; *s && *s != '='; s++);
+    for (; *s && *s != '[' && *s != '=' /* && *s != '+' */; s++);
+    if (s > asg.name && *s == '[') {
+	char *se = parse_subscript(s + 1, 1, ']');
+	if (se && *se == ']') s = se + 1;
+    }

     /* found `=', so return with a value */
-    if (*s) {
+    if (*s && *s == '=') {
 	*s = '\0';
 	asg.value.scalar = s + 1;
     } else {
@@ -3724,6 +3728,11 @@ bin_unset(char *name, char **argv, Options ops, int func)
     while ((s = *argv++)) {
 	char *ss = strchr(s, '['), *subscript = 0;
 	if (ss) {
+#define unset_workers_37914 0
+#define unset_hashelem_empty_only 0
+#define unset_hashelem_literal 1
+#define unset_hashelem_stripquote 0
+#if unset_workers_37914
 	    char *sse;
 	    *ss = 0;
 	    if ((sse = parse_subscript(ss+1, 1, ']'))) {
@@ -3733,6 +3742,47 @@ bin_unset(char *name, char **argv, Options ops, int func)
 		remnulargs(subscript);
 		untokenize(subscript);
 	    }
+#elif unset_hashelem_empty_only
+	    char *sse;
+	    *ss = 0;
+	    if (ss[1] == ']' && !ss[2] ? (sse = ss+1) :
+		(sse = parse_subscript(ss+1, 1, ']'))) {
+		*sse = 0;
+		subscript = dupstring(ss+1);
+		*sse = ']';
+		remnulargs(subscript);
+		untokenize(subscript);
+	    }
+#else
+	    char *sse = ss + strlen(ss)-1;
+	    *ss = 0;
+	    if (*sse == ']') {
+# if unset_hashelem_literal
+		*sse = 0;
+		subscript = dupstring(ss+1);
+		*sse = ']';
+# elif unset_hashelem_stripquote
+		int ne = noerrs;
+		noerrs = 2;
+		*ss = 0;
+		*sse = 0;
+		subscript = dupstring(ss+1);
+		*sse = ']';
+		/*
+		 * parse_subst_string() removes one level of quoting.
+		 * If it returns nonzero, substring is unchanged, else
+		 * it has been re-tokenized in place, so clean it up.
+		 */
+		if (!parse_subst_string(subscript)) {
+		    remnulargs(subscript);
+		    untokenize(subscript);
+		}
+		noerrs = ne;
+# else
+		XXX parse error XXX
+# endif
+	    }
+#endif
 	}
 	if ((ss && !subscript) || !isident(s)) {
 	    if (ss)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  2:04                                 ` [PATCH (not final)] (take three?) unset "array[$anything]" Bart Schaefer
@ 2021-06-03  2:42                                   ` Bart Schaefer
  2021-06-03  6:12                                     ` Bart Schaefer
  2021-06-03  6:05                                   ` [PATCH (not final)] (take three?) unset "array[$anything]" Stephane Chazelas
  1 sibling, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-03  2:42 UTC (permalink / raw)
  To: Zsh hackers list

On Wed, Jun 2, 2021 at 7:04 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> #define unset_hashelem_literal 1
> #define unset_hashelem_stripquote 0
>
> The fourth one requires { unset "hash[${(q)k}]" } instead,
> but I think it otherwise works for all cases.

Another point in favor of "stripquote" is that it makes ${(b)k}
function in a lot of cases.  The failing cases for (b) in that branch
are backticks and a leading '\='.

Also, if you know of good test cases, please try them / send them along.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  2:42                                   ` Bart Schaefer
@ 2021-06-03  6:12                                     ` Bart Schaefer
  2021-06-03  8:54                                       ` Peter Stephenson
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-03  6:12 UTC (permalink / raw)
  To: Zsh hackers list

On Wed, Jun 2, 2021 at 7:42 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> Also, if you know of good test cases, please try them / send them along.

I've now tried all the hash keys from the test harness in
 https://unix.stackexchange.com/questions/626393/in-zsh-how-do-i-unset-an-arbitrary-associative-array-element/626529#626529
It found a few additional misfires when using the (b) flag with the
stripquote variation, but both literal and stripquote worked in the
respectively expected ways for all those cases plus some others I had
already tried.

Furthermore, both of them do approximately as well with the $MATCH
pattern from workers/43269 as does the current variation -- the
original shell unsets 281 out of 291 keys, the literal variation 280,
and the stripquote variation 284.  The keys not unset using that
pattern are different in each of the three cases, except that they all
fail to unset key='\`' ... several of the failures with the literal
variation have multiple backslashes, though not always consecutive,
and it does fail with key=']' and key='\' which could be important.

So, we're probably in pretty safe territory with either choice but in
terms of backwards compatibility the stripquote version is slightly
ahead.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  6:12                                     ` Bart Schaefer
@ 2021-06-03  8:54                                       ` Peter Stephenson
  2021-06-03 13:13                                         ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Stephenson @ 2021-06-03  8:54 UTC (permalink / raw)
  To: Zsh hackers list

> On 03 June 2021 at 07:12 Bart Schaefer <schaefer@brasslantern.com> wrote:
> So, we're probably in pretty safe territory with either choice but in
> terms of backwards compatibility the stripquote version is slightly
> ahead.

I'd be in favour of this.  I agree with the consensus that the current code
is not really usable for difficult cases, so some degree of incompatibility
is warranted.

pws


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  8:54                                       ` Peter Stephenson
@ 2021-06-03 13:13                                         ` Stephane Chazelas
  2021-06-03 14:41                                           ` Peter Stephenson
  2021-06-03 18:12                                           ` Bart Schaefer
  0 siblings, 2 replies; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-03 13:13 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

2021-06-03 09:54:41 +0100, Peter Stephenson:
> > On 03 June 2021 at 07:12 Bart Schaefer <schaefer@brasslantern.com> wrote:
> > So, we're probably in pretty safe territory with either choice but in
> > terms of backwards compatibility the stripquote version is slightly
> > ahead.
> 
> I'd be in favour of this.  I agree with the consensus that the current code
> is not really usable for difficult cases, so some degree of incompatibility
> is warranted.
[...]

If I understand correctly, the "stripquote" variant is the one
where the user can do:

unset 'hash["foo"]'

unset "hash['']"

unset "hash[${(qq)key}]"

That is where quotes are parsed and removed but otherwise
serve no purpose.

IMO, that just confuses things even more.

That hardly helps with backward compatibility as users who did
work around the previous behaviour will still have to adapt
their work around, and those who didn't (who did unset
"hash[$key]") will have their code choke on even more characters
(", $, ' in addition to the previous ones, and likely more
confusing behaviour when there are unmatched quotes or ()).

It also deviates even more from the other places that take
lvalues causing potential confustion. hash["foo"]=x or read
'hash["foo"]' set the element of key "foo" (quotes included) not
foo.

To me, acceptable options would be the "literal" one, or add
another quoting flag (maybe "bb"?) that quotes in the manner
currently expected by unset so people can use
unset "hash[${(bb)key}]" (but unfortunately not read
"hash[${(bb)key}]" for which $, ` also need to be escaped with
backslash so that bb flag would only be useful for unset), or
unset -k "$key" hash, or a new option à la bash.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03 13:13                                         ` Stephane Chazelas
@ 2021-06-03 14:41                                           ` Peter Stephenson
  2021-06-04 19:25                                             ` Bart Schaefer
  2021-06-03 18:12                                           ` Bart Schaefer
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Stephenson @ 2021-06-03 14:41 UTC (permalink / raw)
  To: Zsh hackers list

> On 03 June 2021 at 14:13 Stephane Chazelas <stephane@chazelas.org> wrote:
> If I understand correctly, the "stripquote" variant is the one
> where the user can do:
> 
> unset 'hash["foo"]'
> 
> unset "hash['']"
> 
> unset "hash[${(qq)key}]"
> 
> That is where quotes are parsed and removed but otherwise
> serve no purpose.
> 
> IMO, that just confuses things even more.

It would probably be useful to make a list of difficult cases --- both
how the original is handled and how anything that looks like a likely workaround
is treated  --- and then assemble a table of how they're handled now and with the
two most promising proposed fixes.

pws


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03 14:41                                           ` Peter Stephenson
@ 2021-06-04 19:25                                             ` Bart Schaefer
  2021-06-05 18:18                                               ` Peter Stephenson
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-04 19:25 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Thu, Jun 3, 2021 at 7:42 AM Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
>
> It would probably be useful to make a list of difficult cases --- both
> how the original is handled and how anything that looks like a likely workaround
> is treated  --- and then assemble a table of how they're handled now and with the
> two most promising proposed fixes.

Are the samples I posted in workers/49005 adequate?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-04 19:25                                             ` Bart Schaefer
@ 2021-06-05 18:18                                               ` Peter Stephenson
  2021-06-09 23:31                                                 ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Stephenson @ 2021-06-05 18:18 UTC (permalink / raw)
  To: Zsh hackers list

On Fri, 2021-06-04 at 12:25 -0700, Bart Schaefer wrote:
> On Thu, Jun 3, 2021 at 7:42 AM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:
> > 
> > It would probably be useful to make a list of difficult cases --- both
> > how the original is handled and how anything that looks like a likely workaround
> > is treated  --- and then assemble a table of how they're handled now and with the
> > two most promising proposed fixes.
> 
> Are the samples I posted in workers/49005 adequate?

Looks very much like what's needed.  Thanks.

pws



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-05 18:18                                               ` Peter Stephenson
@ 2021-06-09 23:31                                                 ` Bart Schaefer
  2021-06-13 16:51                                                   ` Peter Stephenson
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-09 23:31 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Sat, Jun 5, 2021 at 11:18 AM Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
>
> On Fri, 2021-06-04 at 12:25 -0700, Bart Schaefer wrote:
> > On Thu, Jun 3, 2021 at 7:42 AM Peter Stephenson
> > <p.w.stephenson@ntlworld.com> wrote:
> > >
> > > assemble a table of how they're handled now and with the
> > > two most promising proposed fixes.
> >
> > Are the samples I posted in workers/49005 adequate?
>
> Looks very much like what's needed.  Thanks.

How do we make the call on what to do, then?

There are a total of 29 test cases where at least one combination
(bare current shell, current shell with Stephane's workaround, and
stripquote or literal with the workaround) fails on the test key.  (31
appear in 49005 but two of the test cases are redundant, which I
hadn't noticed before.)

Considering only the 3 variants with the workaround, literal misses 5
of the same cases as the current shell, but 9 additional cases.
Stripquote misses only one of the same cases as the current shell but
also only 5 additional cases.  The current shell (w/workaround) misses
4 cases that neither of the other two miss.

The cases literal fails on are, as might be expected, those in which
the workaround adds extra backslashes.  Stripquote works for most of
those because it peels the extra backslash off again ... but that's
exactly the behavior which Stephane points out "otherwise serves no
purpose".

I still feel literal is the best option despite potentially breaking
more workarounds, because it means the workarounds can just be removed
rather than modified.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-09 23:31                                                 ` Bart Schaefer
@ 2021-06-13 16:51                                                   ` Peter Stephenson
  2021-06-13 18:04                                                     ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Stephenson @ 2021-06-13 16:51 UTC (permalink / raw)
  To: zsh-workers

On Wed, 2021-06-09 at 16:31 -0700, Bart Schaefer wrote:
> On Sat, Jun 5, 2021 at 11:18 AM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:
> > 
> > On Fri, 2021-06-04 at 12:25 -0700, Bart Schaefer wrote:
> > > On Thu, Jun 3, 2021 at 7:42 AM Peter Stephenson
> > > <p.w.stephenson@ntlworld.com> wrote:
> > > > 
> > > > assemble a table of how they're handled now and with the
> > > > two most promising proposed fixes.
> > > 
> > > Are the samples I posted in workers/49005 adequate?
> > 
> > Looks very much like what's needed.  Thanks.
> 
> How do we make the call on what to do, then?
> 
> The cases literal fails on are, as might be expected, those in which
> the workaround adds extra backslashes.  Stripquote works for most of
> those because it peels the extra backslash off again ... but that's
> exactly the behavior which Stephane points out "otherwise serves no
> purpose".
> 
> I still feel literal is the best option despite potentially breaking
> more workarounds, because it means the workarounds can just be removed
> rather than modified.

I don't have a killer argument, but what you just said sounds like the
best we've got, so I'd certainly be happy to vote for that, in the
absence of a magic wand.

I'm wondering about something we could do that would ease the
transition, but I think I've convinced myeelf that just adds yet more
unwanted complexity.

pws



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-13 16:51                                                   ` Peter Stephenson
@ 2021-06-13 18:04                                                     ` Bart Schaefer
  2021-06-13 19:48                                                       ` Peter Stephenson
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-13 18:04 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Sun, Jun 13, 2021 at 9:54 AM Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
>
> I'm wondering about something we could do that would ease the
> transition, but I think I've convinced myself that just adds yet more
> unwanted complexity.

The only thing I could come up with is an option, but then ... in
which state does it start, and what do we do with it later?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-13 18:04                                                     ` Bart Schaefer
@ 2021-06-13 19:48                                                       ` Peter Stephenson
  2021-06-13 21:44                                                         ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Stephenson @ 2021-06-13 19:48 UTC (permalink / raw)
  To: zsh-workers

On Sun, 2021-06-13 at 11:04 -0700, Bart Schaefer wrote:
> On Sun, Jun 13, 2021 at 9:54 AM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:
> > 
> > I'm wondering about something we could do that would ease the
> > transition, but I think I've convinced myself that just adds yet more
> > unwanted complexity.
> 
> The only thing I could come up with is an option, but then ... in
> which state does it start, and what do we do with it later?

Yes, in fact those are the same thoughts that flitted through my mind.
I think the answer to the first is "we'd have it on the new setting to
alert people the change is imminent", but I don't have an answer to the
second: it just seems to be stuck there, useless.

pws



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-13 19:48                                                       ` Peter Stephenson
@ 2021-06-13 21:44                                                         ` Bart Schaefer
  2021-06-14  7:19                                                           ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-13 21:44 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: Zsh hackers list

On Sun, Jun 13, 2021 at 12:50 PM Peter Stephenson
<p.w.stephenson@ntlworld.com> wrote:
>
> Yes, in fact those are the same thoughts that flitted through my mind.

One other idea ... NO_UNSET could warn if you unset something that's
not set, as well as when you $deref something that's not set.  That
would at least blat at you if your hash key was misinterpreted.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-13 21:44                                                         ` Bart Schaefer
@ 2021-06-14  7:19                                                           ` Stephane Chazelas
  0 siblings, 0 replies; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-14  7:19 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Peter Stephenson, Zsh hackers list

2021-06-13 14:44:03 -0700, Bart Schaefer:
> On Sun, Jun 13, 2021 at 12:50 PM Peter Stephenson
> <p.w.stephenson@ntlworld.com> wrote:
> >
> > Yes, in fact those are the same thoughts that flitted through my mind.
> 
> One other idea ... NO_UNSET could warn if you unset something that's
> not set, as well as when you $deref something that's not set.  That
> would at least blat at you if your hash key was misinterpreted.

Note that nounset is a POSIX option. And for the unset special
utility:

"Unsetting a variable or function that was not previously set
shall not be considered an error and does not cause the shell to
abort."

POSIX has no jurisdiction over arrays/hashes, but that does mean
we can't have unset fail on unset variables when in sh emulation
at least.

A shell that would fail upon:

  unset var

When var was not previously set would not be compliant.

A script that would do

  unset "var[foo]"

Would not be a POSIX script as var[foo] is not a valid variable
name, so implementations are free to do whatever they want for
them. But if zsh returned failure for when var[foo] is not set
and not in unset var, that would make it quite inconsistent.

Changing the behaviour would likely break some scripts that use
errexit as well.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03 13:13                                         ` Stephane Chazelas
  2021-06-03 14:41                                           ` Peter Stephenson
@ 2021-06-03 18:12                                           ` Bart Schaefer
  2021-06-04  8:02                                             ` Stephane Chazelas
  1 sibling, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-03 18:12 UTC (permalink / raw)
  To: Peter Stephenson, Zsh hackers list

[-- Attachment #1: Type: text/plain, Size: 3153 bytes --]

On Thu, Jun 3, 2021 at 6:14 AM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> If I understand correctly, the "stripquote" variant is the one
> where the user can do:
>
> unset 'hash["foo"]'
>
> unset "hash['']"
>
> unset "hash[${(qq)key}]"
>
> That is where quotes are parsed and removed but otherwise
> serve no purpose.

I guess you can describe it that way, yes.

> That hardly helps with backward compatibility as users who did
> work around the previous behaviour will still have to adapt
> their work around

The point of my previous analysis is that they probably will not have
to adapt.  Your $MATCH workaround operates correctly with the
stripquote version in nearly as many cases as it currently does (it is
not foolproof for the current shell implementation).

However, that doesn't mean I advocate for this over the "literal" variant.

> and those who didn't (who did unset
> "hash[$key]") will have their code choke on even more characters
> (", $, ' in addition to the previous ones, and likely more
> confusing behaviour when there are unmatched quotes or ()).

I extended my test set to include a number of variations on those
characters as well.  I've attached the test script.

Here are my test cases that fail if you just blindly do
  for k in ${(k)ax}; do unset "ax[$k]"; done
  typeset -p ax
with appropriate trapping for the fatal errors.  The literal variant
gets all of these, leaving the test hash empty.  Neither literal nor
stripquote throws any errors, so the trap is only needed for the
current implementation.  Starting with that:

typeset -A ax=(
  [$'\M-\C-@\\']=$'\M-\C-@\\'
  [$'\M-\C-@`']=$'\M-\C-@`'
  ['"${(@w)oops"']='"${(@w)oops"'
  ['(a']='(a'
  ['\$']='\$'
  ['\(']='\('
  ['\)']='\)'
  ['\[']='\['
  ['\\\']='\\\'
  ['\\\\']='\\\\'
  ['\]']='\]'
  ['\`']='\`'
)

The cases that fail with the stripquote variation:

typeset -A ax=(
  [$'\M-\C-@\\']=$'\M-\C-@\\'
  ['""']='""'
  ['"$oops"']='"$oops"'
  ['"safe"']='"safe"'
  ['"set"']='"set"'
  [\'\']=\'\'
  [\''safe'\']=\''safe'\'
  [\''set'\']=\''set'\'
  ['\!']='\!'
  ['\$']='\$'
  ['\(']='\('
  ['\)']='\)'
  ['\*']='\*'
  ['\=']='\='
  ['\@']='\@'
  ['\[']='\['
  ['\\\']='\\\'
  ['\\\\']='\\\\'
  ['\]']='\]'
  ['\`']='\`'
  ['\s\a\f\e']='\s\a\f\e'
  ['\s\e\t']='\s\e\t'
)

Current shell using $MATCH workaround:

typeset -A ax=(
  [$'\M-\C-@`']=$'\M-\C-@`'
  ['"${(@w)oops"']='"${(@w)oops"'
  ['(']='('
  ['(a']='(a'
  [')']=')'
  ['[']='['
  ['\(']='\('
  ['\)']='\)'
  ['\[']='\['
  ['\`']='\`'
  ['`']='`'
)

And stripquote with $MATCH workaround:

typeset -A ax=(
  ['""']='""'
  ['"$oops"']='"$oops"'
  ['"safe"']='"safe"'
  ['"set"']='"set"'
  [\'\']=\'\'
  [\''safe'\']=\''safe'\'
  [\''set'\']=\''set'\'
  ['\`']='\`'
)

Finally, literal with $MATCH workaround:

typeset -A ax=(
  [$'\M-\C-@\\']=$'\M-\C-@\\'
  ['\']='\'
  ['\!']='\!'
  ['\$']='\$'
  ['\(']='\('
  ['\)']='\)'
  ['\*']='\*'
  ['\=']='\='
  ['\@']='\@'
  ['\[']='\['
  ['\\\']='\\\'
  ['\`']='\`'
  ['\s\a\f\e']='\s\a\f\e'
  ['\s\e\t']='\s\e\t'
  [']']=']'
)

I agree that the literal variation is probably the one most useful to
re-apply in cases like read, print -v, etc.

[-- Attachment #2: test_unset.zsh --]
[-- Type: application/octet-stream, Size: 2033 bytes --]

exec 2>&1	# to see fatal error messages in proper sequence
unset ax bx oops
oops='this is a problem'
typeset -A ax bx
typeset -a kx=(
  '' '\' '`' '(' '[' ')' ']' '=' '"' "'" '!' '@' '$' '*'
  $'\0' '\\' '\`' '\(' '\[' '\)' '\]' '\=' '\!' '\@' '\$' '\*'
  '()safe' '(r)set' '(R)set' '(k)safe'	# look like valid subscript flags
  '(a' '(a)' '(n:foo:)a' '(n:1)a'	# look like invalid subscript flags
  'set' '"set"' \'set\' '\s\e\t'
  'safe' '"safe"' \'safe\' '\s\a\f\e'
  '\\' '\\\' '\\\\' '""' \'\'
  'two words' 'two  spaces' ' initial space' 'trailing space '
  $'\x80\\' $'\x80\`' $'\x80\~'		# broken UTF-8
  '?~>#'
  '$oops' '${oops}' '"$oops"' '"${(@w)oops"' '${(qqq)oops}' '$(echo oops)'
)
for ((i=0; i<255; i++)); do
  printf -v n '\\x%02x' $i
  eval "kx+=(\$'$n')"
done

for k in "$kx[@]"; do
  ax[$k]=$k
  bx[$k]=$k
done
typeset -p 1 ax
print COUNT: $#ax

print $'\nWith (b)'
for k in "$kx[@]"; do
  ( unset "ax[${(b)k}]" ) && unset "ax[${(b)k}]" &&
  print -r -- ${ax[$k]+Missed} ${(V)ax[$k]-Got $k}
done
typeset -p 1 ax

ax=(${(kv)bx})
print $'\nWith (q)'
for k in "$kx[@]"; do
  ( unset "ax[${(q)k}]" ) && unset "ax[${(q)k}]" &&
  print -r -- ${ax[$k]+Missed} ${(V)ax[$k]-Got $k}
done
typeset -p 1 ax

ax=(${(kv)bx})
print $'\nWith (qq)'
for k in "$kx[@]"; do
  ( unset "ax[${(qq)k}]" ) && unset "ax[${(qq)k}]" &&
  print -r -- ${ax[$k]+Missed} ${(V)ax[$k]-Got $k}
done
typeset -p 1 ax

ax=(${(kv)bx})
print $'\nWith (qqq)'
for k in "$kx[@]"; do
  ( unset "ax[${(qqq)k}]" ) && unset "ax[${(qqq)k}]" &&
  print -r -- ${ax[$k]+Missed} ${(V)ax[$k]-Got $k}
done
typeset -p 1 ax

ax=(${(kv)bx})
print $'\nWithout backslashes'
for k in "$kx[@]"; do
  ( unset "ax[$k]" ) && unset "ax[$k]" &&
  print -r -- ${ax[$k]+Missed} ${(V)ax[$k]-Got $k}
done
typeset -p 1 ax

ax=(${(kv)bx})
print $'\nWith pattern from workers/43269'
for k in "$kx[@]"; do
  ( unset "ax[${(*)k//(#m)[]\\]/\\$MATCH}]" ) &&
  unset "ax[${(*)k//(#m)[]\\]/\\$MATCH}]" &&
  print -r -- ${ax[$k]+Missed} ${(V)ax[$k]-Got $k}
done
typeset -p 1 ax
print COUNT: $#ax

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03 18:12                                           ` Bart Schaefer
@ 2021-06-04  8:02                                             ` Stephane Chazelas
  2021-06-04 18:36                                               ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-04  8:02 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Peter Stephenson, Zsh hackers list

2021-06-03 11:12:19 -0700, Bart Schaefer:
[...]
> > If I understand correctly, the "stripquote" variant is the one
> > where the user can do:
[...]
> > That is where quotes are parsed and removed but otherwise
> > serve no purpose.
> 
> I guess you can describe it that way, yes.

It also means the lexer, a large and complex bit of code whose
behaviour also depends on a the setting of a number of options
will end up being exposed to user-supplied data (of the users of
the zsh scripts, not just the users of zsh), which adds some
level of risk.

> > That hardly helps with backward compatibility as users who did
> > work around the previous behaviour will still have to adapt
> > their work around
> 
> The point of my previous analysis is that they probably will not have
> to adapt.  Your $MATCH workaround operates correctly with the
> stripquote version in nearly as many cases as it currently does (it is
> not foolproof for the current shell implementation).

If I understand correctly, that quote stripping removes one
level of quotes from tokens identified as string tokens provided
the lexer succeeded in tokenising the input, so any workaround
would have to escape/quote any shell special character to
guarantee the lexer succeeds.

[...]
> I agree that the literal variation is probably the one most useful to
> re-apply in cases like read, print -v, etc.

Note that read/print -v, etc. is radically different.

Currently we can do (reliably):

   read 'hash[$key]'

and can't do (reliably)

   read "hash[$key]"

If we align with whatever solution we pick for unset, we're
going to break a lot more scripts. I don't think we can touch
those at least in the default mode operation.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-04  8:02                                             ` Stephane Chazelas
@ 2021-06-04 18:36                                               ` Bart Schaefer
  2021-06-04 20:21                                                 ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-04 18:36 UTC (permalink / raw)
  To: Bart Schaefer, Peter Stephenson, Zsh hackers list

On Fri, Jun 4, 2021 at 1:02 AM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> It also means the lexer, a large and complex bit of code whose
> behaviour also depends on a the setting of a number of options
> will end up being exposed to user-supplied data (of the users of
> the zsh scripts, not just the users of zsh), which adds some
> level of risk.

This remark baffles me.  The exact same code is used for ${v/pat/repl}
among other things, I haven't added any new entry point to the lexer.

Still, we're continuing to banter about what I think is the
second-best solution anyway.

> Currently we can do (reliably):
>
>    read 'hash[$key]'

It's unfortunate that this is "reliable" because I'm pretty sure it's
totally unintentional and should be considered a bug.  "read" should
not be re-interpreting $key in its NAME parameter.

> and can't do (reliably)
>
>    read "hash[$key]"

That one ought to be reliable if using hash elements here is permitted at all.

> If we align with whatever solution we pick for unset, we're
> going to break a lot more scripts. I don't think we can touch
> those at least in the default mode operation.

I don't know how many is "a lot" here, because frankly this is
something I would never have thought of attempting in the first place
(particularly, relying on the buggy behavior of the single-quoted
case).  But I really want to STOP talking about this in the same
thread as the "unset" question.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-04 18:36                                               ` Bart Schaefer
@ 2021-06-04 20:21                                                 ` Stephane Chazelas
  2021-06-05  0:20                                                   ` Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-04 20:21 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Peter Stephenson, Zsh hackers list

2021-06-04 11:36:09 -0700, Bart Schaefer:
> On Fri, Jun 4, 2021 at 1:02 AM Stephane Chazelas <stephane@chazelas.org> wrote:
> >
> > It also means the lexer, a large and complex bit of code whose
> > behaviour also depends on a the setting of a number of options
> > will end up being exposed to user-supplied data (of the users of
> > the zsh scripts, not just the users of zsh), which adds some
> > level of risk.
> 
> This remark baffles me.  The exact same code is used for ${v/pat/repl}
> among other things, I haven't added any new entry point to the lexer.

But there (and I expect most other contexts where that is used),
the repl is not data supplied by the user of the script, it will
be typically fixed code like $replacement as supplied by the
*author* of the script.

While in:

while IFS=, read -r a b c; do
  unset "hash[$b]"
done < data

And with the stripquotes patch, you'd have the *contents* of $b
(from external data), not the literal $b string fed to the lexer
as if it was meant to be zsh code.

[...]
> > Currently we can do (reliably):
> >
> >    read 'hash[$key]'
> 
> It's unfortunate that this is "reliable" because I'm pretty sure it's
> totally unintentional and should be considered a bug.  "read" should
> not be re-interpreting $key in its NAME parameter.

I'd tend to agree with that statement, and it's this kind of
thing that makes things taking lvalues or arithmetic expressions
very dangerous.

As discussed several times (IIRC, Oliver Kiddle brought it up in
2014 in the aftermaths of shellshock).

  ((x == 1))
  or
  printf %d x
  or
  read "array[x]"

are also command injection vulnerabilities for the same kind of
reason. Here for instance when x='psvar[$(reboot)1]' even
though there's probably not a good reason why $(reboot) would be
expanded when x is referenced in an arithmetic expression.

> 
> > and can't do (reliably)
> >
> >    read "hash[$key]"
> 
> That one ought to be reliable if using hash elements here is permitted at all.
> 
> > If we align with whatever solution we pick for unset, we're
> > going to break a lot more scripts. I don't think we can touch
> > those at least in the default mode operation.
> 
> I don't know how many is "a lot" here

I don't have an answer to that either.

> because frankly this is
> something I would never have thought of attempting in the first place
> (particularly, relying on the buggy behavior of the single-quoted
> case).  But I really want to STOP talking about this in the same
> thread as the "unset" question.

Well, it would be good if all those things would be fixed once
and for all and in a consistent way.

There's also the fact that we can't do:

hash['foo bar
baz'$'\n'...]=value

(one needs things like hash[${(pl[1][\n])}]=x to set elements of
key $'\n')

or

array[1 + 1]=4

That is pretty annoying.

Hard to fix without breaking backward portability, but could we
introduce a better design under a new option (setopt
bettersubscriptsandarithmetics), or a "use 6.0" à la perl?

See also https://github.com/ksh93/ksh/issues/152 
where Martijn Dekker has done some improvements on that front in
ksh93.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-04 20:21                                                 ` Stephane Chazelas
@ 2021-06-05  0:20                                                   ` Bart Schaefer
  2021-06-05 17:05                                                     ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-05  0:20 UTC (permalink / raw)
  To: Zsh hackers list

On Fri, Jun 4, 2021 at 1:21 PM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> while IFS=, read -r a b c; do
>   unset "hash[$b]"
> done < data
>
> And with the stripquotes patch, you'd have the *contents* of $b
> (from external data), not the literal $b string fed to the lexer
> as if it was meant to be zsh code.

It's doing only pure lexical analysis of a single "string" token,
there's no chance of it interpolating anything.

> There's also the fact that we can't do:
>
> hash['foo bar
> baz'$'\n'...]=value

Hmm, you mean because subscripts (even hash subscripts) are parsed as
if in $(( )) and $'...' does not expand there.

> array[1 + 1]=4
>
> That is pretty annoying.

That's just shell whitespace rules, though.  If you have nobadpattern
set, you get

zsh: command not found: array[1

And if the parser doesn't eventually find ]= then what?  Re-parse the
whole thing to word-split it?

I don't think there's any way to retconn word-splitting.  We could
perhaps overload ={...} (or something) so that instead of

: ${array[1 + 1]:=4}

you could write

={array[1 + 1]}=4

but I don't think that's really that much prettier (and unlikely as it
would be to appear in any existing script, that's obviously not
backwards-compatible).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-05  0:20                                                   ` Bart Schaefer
@ 2021-06-05 17:05                                                     ` Stephane Chazelas
  2021-06-10  0:14                                                       ` Square brackets in command position Bart Schaefer
  0 siblings, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-05 17:05 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-06-04 17:20:29 -0700, Bart Schaefer:
[...]
> > array[1 + 1]=4
> >
> > That is pretty annoying.
> 
> That's just shell whitespace rules, though.  If you have nobadpattern
> set, you get
> 
> zsh: command not found: array[1
> 
> And if the parser doesn't eventually find ]= then what?  Re-parse the
> whole thing to word-split it?
[...]

Note that all other shells that support a[1]=value (ksh, pdksh,
bash) do it.

In effect that makes them non-POSIX as there's nothing in the
spec POSIX that says

a[1 + 1]=2

May not run the a[1 with +] and =2 as arguments.

I did bring that up a few years ago at:
https://www.mail-archive.com/austin-group-l@opengroup.org/msg04563.html
(see also https://austingroupbugs.net/view.php?id=1279)

There is variation in behaviour between shells if the first word
starts with name[ and either a matching ] or matching ] followed
by = is not found, and how that matching is done or what kind of
quoting are supported inside the [...].

But all in all, even if it may not be pretty when you look closely,
that does work fine there as long as you're not trying to fool
the parser.

In zsh, that would be even less problematic as (at least when
not in sh emulation), zsh complains about unmatched [s or ]s
(except for [ alone)

Even:

$ /bin/[ a = a ]
zsh: bad pattern: /bin/[

So we wouldn't be breaking anything if we started to accept:

hash[foo bar; baz]=x

or

array[1 + 1]=x

We would if we started to accept.

hash['x]y'-$'\n']=x

like ksh93/bash do though.

For literal hash[key]=value assignment, I never remember what needs
to be quoted and how. I generally resort to doing
var=key; hash[$var]=x
or
hash+=(key x)
which I know are fine.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Square brackets in command position
  2021-06-05 17:05                                                     ` Stephane Chazelas
@ 2021-06-10  0:14                                                       ` Bart Schaefer
  0 siblings, 0 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-06-10  0:14 UTC (permalink / raw)
  To: Zsh hackers list

On Sat, Jun 5, 2021 at 10:05 AM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> 2021-06-04 17:20:29 -0700, Bart Schaefer:
> [...]
> > > array[1 + 1]=4
> > >
> > > That is pretty annoying.
> >
> > That's just shell whitespace rules, though.
>
> Note that all other shells that support a[1]=value (ksh, pdksh,
> bash) do it.

Chip will probably correct me, but from some experiments with bash it
appears that (with the exception of '[' all alone, to invoke "test")
an (unquoted) '[' in a word in command position always begins an
expression that must be bounded by a closing (unquoted) ']' (even if
that means continuing to the PS2 prompt).  Thus for example
  $ a[ 1 + 1 ]b
globs to a file named "a b" in the current directory and does a path
search to run that as a command.  Note that
 $ [a ]b
globs to "ab" or " b" (leading space there if that's hard to see),
it's not what's to the left of '[' that triggers this.

Thus it's not just assignments that have command-position magic,
character classes can have embedded (unquoted) spaces there as well.
This breaks if you are looking for a file named e.g. "a+=" because
']=' turns the whole thing into an array element assignment.

> There is variation in behaviour between shells if the first word
> starts with name[ and either a matching ] or matching ] followed
> by = is not found, and how that matching is done or what kind of
> quoting are supported inside the [...].

That doesn't give us any good ideas on where to start.  Of course
zsh's parsing hash keys using the same quoting rules as arithmetic was
a quick way to slurp hash[string] through the existing array[math]
parser at the time, and in retrospect probably should have been given
its own rules.

> In zsh, that would be even less problematic as (at least when
> not in sh emulation), zsh complains about unmatched [s or ]s

As mentioned before, that's the BAD_PATTERN setopt, which can be
turned off independent of emulation, so it might be a stretch to say
we wouldn't be breaking anything.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  2:04                                 ` [PATCH (not final)] (take three?) unset "array[$anything]" Bart Schaefer
  2021-06-03  2:42                                   ` Bart Schaefer
@ 2021-06-03  6:05                                   ` Stephane Chazelas
  2021-06-03  6:43                                     ` Bart Schaefer
  1 sibling, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-03  6:05 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-06-02 19:04:01 -0700, Bart Schaefer:
> On Wed, Jun 2, 2021 at 8:59 AM Bart Schaefer <schaefer@brasslantern.com> wrote:
> >
> > I've just had a hand-slaps-forehead
> > moment ... take 3 to follow in another thread.
> 
> What I realized is that for any unset of an array element, the closing
> bracket must always be the last character of the argument.  There's no
> reason to parse the subscript or skip over matching brackets; if a '['
> is found, just make sure the last character is ']' and the subscript
> must be everything in between.

D'oh, I had assumed that had been like that in the beginning but
changed to allow backslash processing for some reason or other
such as alignment with something else.

[...]
> Given this realization, it's easy to make { unset "hash[$key]" } work
> "like it always should have".  The trouble comes in with (not)
> breaking past workarounds.

I think I'd be in favour of breaking these workarounds on the
ground that it would fix far more existing script (that just do
unset "hash[$key]") than it would fix.

There's also the question of what to do with

read "hash[$key]"
getopts "hash[$key]"

and getln, print[f] -v, sysread, strftime -s...

I've not tested all of them but at least for read and print -v,

key=foo
typeset -A a
print -v 'a[$key]' x

gives:

typeset -A a=( [foo]=x )

and not

typeset -A a=( ['$key']=x )

With those, one can't do:

read "hash[$key]"

(which would be a command injection vulnerability)

And the (unintuitive) work around is:

read 'hash[$key]'

It's the same in bash unless you set the assoc_expand_once
option there, not in ksh93.

My suggestion of making unset a keyword so that unset hash[$key]
works as expected  while unset "hash[$key]" works as before
won't fly as we can't possibly make all builtins that take
lvalues keywords.

[...]
> Therefore I think the best option is to choose one of the latter two,
> possibly depending on which one induces the least damage to any
> workarounds for the current behavior that are known in the wild,
> though aesthetically I'd rather use the literal version.
[...]

Well, the only one that doesn't break workarounds is to leave it
asis (and potentially add support for unset 'hash[]'), or
introduce a new syntax (like unset -k "$key" hash) or an option
like bash's assoc_expand_once to switch to the new behaviour.

We can't really expect people to carry on doing:

() { set -o localoptions +o multibyte -o extendedglob
unset "hash[${key//(#m)[][\\()\`]/\\$MATCH}]"; }

to unset a hash key, so I don't thing it's reasonable to leave
it asis.

Another option would be to align unset with the other builtins,
but that's even more problematic as codes that were doing:

unset "hash[$key]"

would change from being buggy (choking on []()`\ and other
characters containing the encoding of those)  to being command
injection vulnerabilities (like with bash without
assoc_expand_once):

$ echo x | key='$(uname>&2)x' bash -c 'typeset -A a; a[$key]=1; unset "a[$key]"; typeset -p a'
Linux
declare -A a=(["\$(uname>&2)x"]="1" )
$ echo x | key='$(uname>&2)x' bash -O assoc_expand_once -c 'typeset -A a; a[$key]=1; unset "a[$key]"; typeset -p a'
declare -A a=()

(zsh has the same problem with read/print -v...)

And again, there is also the problem of hashes used in
arithmetic expressions, including:

key='evil $(reboot)'
print -v 'array[++hash[\$key]]' x

(here you need both the single quotes and the \ to avoid
evaluations of code in the contents of $key)

I must admit I don't really have an idea of how to get out of
this mess. bash, which must have gone through a similar
exercise, hasn't fixed everything with its assoc_expand_once
(see
https://unix.stackexchange.com/questions/627474/how-to-use-associative-arrays-safely-inside-arithmetic-expressions/627475#627475
again).

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  6:05                                   ` [PATCH (not final)] (take three?) unset "array[$anything]" Stephane Chazelas
@ 2021-06-03  6:43                                     ` Bart Schaefer
  2021-06-03  7:31                                       ` Stephane Chazelas
  0 siblings, 1 reply; 58+ messages in thread
From: Bart Schaefer @ 2021-06-03  6:43 UTC (permalink / raw)
  To: Zsh hackers list

On Wed, Jun 2, 2021 at 11:05 PM Stephane Chazelas <stephane@chazelas.org> wrote:
>
> 2021-06-02 19:04:01 -0700, Bart Schaefer:
> > On Wed, Jun 2, 2021 at 8:59 AM Bart Schaefer <schaefer@brasslantern.com> wrote:
> > What I realized is that for any unset of an array element, the closing
> > bracket must always be the last character of the argument.
>
> D'oh, I had assumed that had been like that in the beginning

No ... it used to do skipparens() which would always stop too soon if
there was an embedded ']', and I changed it to parse_subscript() to
make '\]' possible.

> I think I'd be in favour of breaking these workarounds on the
> ground that it would fix far more existing script (that just do
> unset "hash[$key]") than it would [break].

See my follow-up message comparing variations.

> There's also the question of what to do with

... a bunch of things that aren't documented to work with associative
array elements but accidentally do ...

> My suggestion of making unset a keyword so that unset hash[$key]
> works as expected  while unset "hash[$key]" works as before
> won't fly as we can't possibly make all builtins that take
> lvalues keywords.

Strictly speaking they don't take lvalues, they take shell parameter
names and use them as lvalues.

> We can't really expect people to carry on doing:
>
> () { set -o localoptions +o multibyte -o extendedglob
> unset "hash[${key//(#m)[][\\()\`]/\\$MATCH}]"; }

There's already an unreleased change for the (*) parameter flag to
locally enable extendedglob:

  unset "hash[${(*)key//(#m)[][\\()\`]/\\$MATCH}]"; }

but still unreasonable.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH (not final)] (take three?) unset "array[$anything]"
  2021-06-03  6:43                                     ` Bart Schaefer
@ 2021-06-03  7:31                                       ` Stephane Chazelas
  0 siblings, 0 replies; 58+ messages in thread
From: Stephane Chazelas @ 2021-06-03  7:31 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

2021-06-02 23:43:50 -0700, Bart Schaefer:
[...]
> > There's also the question of what to do with
> 
> ... a bunch of things that aren't documented to work with associative
> array elements but accidentally do ...
[...]

That's quite a natural thing to want to do though and is
supported by all shells with arrays (except yash it seems).

See for instance CSV processing such as:

typeset -A field
IFS=,
{
  read -rA headers
  (($#headers)) && while read -r 'field['$^headers']'; do
    typeset -p field
    do-something-with "$field[firstname]" "$field[lastname]"
  done
} < file.csv

That code above currently fails if the header contains some
special characters and has a code injection vulnerability (like
if the first line of the csv is $(reboot)), and making read a
keyword wouldn't help.

It can currently be avoided with:

typeset -A field
IFS=,
{
  read -rA headers
  (($#headers)) && while read -r 'field[$headers['{1..$#headers}']]'; do
    typeset -p field
    do-something-with "$field[firstname]" "$field[lastname]"
  done
} < file.csv

-- 
Stephane


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-01 16:05                       ` Bart Schaefer
  2021-06-02  2:51                         ` [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff] Bart Schaefer
  2021-06-02  9:11                         ` [PATCH] (?) typeset array[position=index]=value Stephane Chazelas
@ 2021-06-10  0:21                         ` Bart Schaefer
  2 siblings, 0 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-06-10  0:21 UTC (permalink / raw)
  To: Zsh hackers list

On Tue, Jun 1, 2021 at 9:05 AM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> I should at least examine whether getasg() ought to be using
> parse_subscript() even though the corresponding parse.c block is using
> skipparens().

This got lumped into the "unset" thread via workers/48993 ... should I
split it off and commit it separately?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-05-31 21:37                   ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
  2021-06-01  5:32                     ` Stephane Chazelas
@ 2021-06-05  4:29                     ` Mikael Magnusson
  2021-06-05  5:49                       ` Bart Schaefer
  1 sibling, 1 reply; 58+ messages in thread
From: Mikael Magnusson @ 2021-06-05  4:29 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 5/31/21, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Mon, May 31, 2021 at 11:18 AM Bart Schaefer
> <schaefer@brasslantern.com> wrote:
>>
>> On Wed, May 5, 2021 at 4:45 AM Stephane Chazelas <stephane@chazelas.org>
>> wrote:
>> > [typeset] chokes on lvalue='array[n=1]' or
>> > lvalue='assoc[keywith=characters]'
>>
>> Hmm, I wonder if that should be considered a bug.
>
> This copies (tweaked for context) the code from parse.c at line 2006
> or thereabouts.
>
> All tests still pass, but as you can see from the comment this is not
> yet handling x+=y which there doesn't seem to be any reason for
> typeset NOT to support; I think it would require only another flag in
> struct asgment, but I haven't attempted that.
>
> Commentary?

Is there some fundamental reason we couldn't just make the obvious
thing work instead?

% typeset -A foo
% foo=(a b c d)
% foo[a]=()
# what actually happens
zsh: foo: attempt to set slice of associative array
# what seems reasonable to happen
% typeset -p foo
typeset -A foo=( [c]=d )



Relatedly, this also seems very inconsistent:
% typeset -A foo
% foo=(a b c d)
% bar=(a b c d)
% unset 'foo[a]'
% unset 'bar[2]'
% typeset -p foo bar
typeset -A foo=( [c]=d )
typeset -a bar=( a '' c d )

So for regular arrays, unset will just set the element to the empty
string, for assoc arrays it removes the key and the value.
For regular arrays, assigning () to the element unsets(?) the element,
and for assoc arrays it is an error.

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-05  4:29                     ` Mikael Magnusson
@ 2021-06-05  5:49                       ` Bart Schaefer
  2021-06-05 11:06                         ` Mikael Magnusson
  2021-06-18 10:53                         ` Mikael Magnusson
  0 siblings, 2 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-06-05  5:49 UTC (permalink / raw)
  To: Mikael Magnusson; +Cc: Zsh hackers list

On Fri, Jun 4, 2021 at 9:29 PM Mikael Magnusson <mikachu@gmail.com> wrote:
>
> On 5/31/21, Bart Schaefer <schaefer@brasslantern.com> wrote:
> > All tests still pass, but as you can see from the comment this is not
> > yet handling x+=y which there doesn't seem to be any reason for
> > typeset NOT to support; I think it would require only another flag in
> > struct asgment, but I haven't attempted that.
> >
> > Commentary?
>
> Is there some fundamental reason we couldn't just make the obvious
> thing work instead?

For one thing, because the above isn't about what happens to the
value, it's about whether assignment can understand the key?

> % typeset -A foo
> % foo=(a b c d)
> % foo[a]=()
> # what actually happens
> zsh: foo: attempt to set slice of associative array
> # what seems reasonable to happen
> % typeset -p foo
> typeset -A foo=( [c]=d )

The problem is with the follow-up questions, namely ,what should happen with

foo=( [a]=() )
foo[a]=( z )
foo[a]=( x y )
etc.

> Relatedly, this also seems very inconsistent:
[...]
> So for regular arrays, unset will just set the element to the empty
> string, for assoc arrays it removes the key and the value.

That's because zsh doesn't support sparse arrays, and the NULL element
indicates the end of the array, so you can't put a NULL element in the
middle.  If we'd thought about it long ago, it might be an error to
unset a regular array element, but we're stuck now.

> For regular arrays, assigning () to the element unsets(?) the element

No.  For regular arrays assigning an array to an element splices the
rvalue array into the lvalue array.   Splicing the empty array
shortens the regular array, it doesn't cause elements to become unset
(unless you consider that the former $#'th element is now unset
because that position no longer exists).

> and for assoc arrays it is an error.

Because you can't perform a splice on a hash table.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-05  5:49                       ` Bart Schaefer
@ 2021-06-05 11:06                         ` Mikael Magnusson
  2021-06-05 16:22                           ` Bart Schaefer
  2021-06-18 10:53                         ` Mikael Magnusson
  1 sibling, 1 reply; 58+ messages in thread
From: Mikael Magnusson @ 2021-06-05 11:06 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 6/5/21, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Fri, Jun 4, 2021 at 9:29 PM Mikael Magnusson <mikachu@gmail.com> wrote:
>>
>> On 5/31/21, Bart Schaefer <schaefer@brasslantern.com> wrote:
>> > All tests still pass, but as you can see from the comment this is not
>> > yet handling x+=y which there doesn't seem to be any reason for
>> > typeset NOT to support; I think it would require only another flag in
>> > struct asgment, but I haven't attempted that.
>> >
>> > Commentary?
>>
>> Is there some fundamental reason we couldn't just make the obvious
>> thing work instead?
>
> For one thing, because the above isn't about what happens to the
> value, it's about whether assignment can understand the key?

Well, there was 4 different threads about saying "unset foo[$bar]" and
I didn't know exactly where to reply so I went to the source, maybe I
went one thread too far back though.

>> % typeset -A foo
>> % foo=(a b c d)
>> % foo[a]=()
>> # what actually happens
>> zsh: foo: attempt to set slice of associative array
>> # what seems reasonable to happen
>> % typeset -p foo
>> typeset -A foo=( [c]=d )
>
> The problem is with the follow-up questions, namely ,what should happen
> with
>
> foo=( [a]=() )
remove foo[a]
> foo[a]=( z )
set foo[a] to z
> foo[a]=( x y )
error
> etc.
there probably aren't more cases

>> Relatedly, this also seems very inconsistent:
> [...]
>> So for regular arrays, unset will just set the element to the empty
>> string, for assoc arrays it removes the key and the value.
>
> That's because zsh doesn't support sparse arrays, and the NULL element
> indicates the end of the array, so you can't put a NULL element in the
> middle.  If we'd thought about it long ago, it might be an error to
> unset a regular array element, but we're stuck now.
>
>> For regular arrays, assigning () to the element unsets(?) the element
>
> No.  For regular arrays assigning an array to an element splices the
> rvalue array into the lvalue array.   Splicing the empty array
> shortens the regular array, it doesn't cause elements to become unset
> (unless you consider that the former $#'th element is now unset
> because that position no longer exists).

By this logic, unset "foo[bar]" should have had the same effect as
foo[bar]="" (also too late).

>> and for assoc arrays it is an error.
>
> Because you can't perform a splice on a hash table.

Well, with my naive end user hat, this seems much more logical to
remove an element than fiddling with the unset keyword, which doesn't
even do the same on a normal array. If you come from knowing that
bar[5]=() removes the 5th element, then it's reasonable to assume that
foo[baz]=() would work too.

Since we've established that both unset and assignment to an element
behaves in no way the same already for regular arrays and associative
arrays, is there any reason to not make assignment to associative
array slices (with the empty slice) just remove the element? The
parsing already works and it wouldn't break any existing cases.

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-05 11:06                         ` Mikael Magnusson
@ 2021-06-05 16:22                           ` Bart Schaefer
  0 siblings, 0 replies; 58+ messages in thread
From: Bart Schaefer @ 2021-06-05 16:22 UTC (permalink / raw)
  To: Mikael Magnusson; +Cc: Zsh hackers list

On Sat, Jun 5, 2021 at 4:06 AM Mikael Magnusson <mikachu@gmail.com> wrote:
>
> On 6/5/21, Bart Schaefer <schaefer@brasslantern.com> wrote:
> >
> > foo=( [a]=() )
> remove foo[a]

That doesn't make sense.  foo=( [a]=z ) replaces the entire hash table
with a single element.
  foo=( [a]=() [b]=z )
would just be a very verbose way of saying foo=( b z ) ??

> > foo[a]=( z )
> set foo[a] to z
> > foo[a]=( x y )
> error

How is that any more consistent than what we have now?

> > etc.
> there probably aren't more cases

foo[(R)*a*]=( ... )

If we can assign one element to foo[a] with parens, why can't we
assign a list of elements?

> Since we've established that both unset and assignment to an element
> behaves in no way the same already for regular arrays and associative
> arrays, is there any reason to not make assignment to associative
> array slices (with the empty slice) just remove the element? The
> parsing already works and it wouldn't break any existing cases.

We can continue to have that discussion, but IMO that's orthogonal to
whether subscripts are parsed correctly by unset.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] (?) typeset array[position=index]=value
  2021-06-05  5:49                       ` Bart Schaefer
  2021-06-05 11:06                         ` Mikael Magnusson
@ 2021-06-18 10:53                         ` Mikael Magnusson
  1 sibling, 0 replies; 58+ messages in thread
From: Mikael Magnusson @ 2021-06-18 10:53 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Zsh hackers list

On 6/5/21, Bart Schaefer <schaefer@brasslantern.com> wrote:
> On Fri, Jun 4, 2021 at 9:29 PM Mikael Magnusson <mikachu@gmail.com> wrote:
>>
>> Relatedly, this also seems very inconsistent:
> [...]
>> So for regular arrays, unset will just set the element to the empty
>> string, for assoc arrays it removes the key and the value.
>
> That's because zsh doesn't support sparse arrays, and the NULL element
> indicates the end of the array, so you can't put a NULL element in the
> middle.  If we'd thought about it long ago, it might be an error to
> unset a regular array element, but we're stuck now.
>
>> For regular arrays, assigning () to the element unsets(?) the element
>
> No.  For regular arrays assigning an array to an element splices the
> rvalue array into the lvalue array.   Splicing the empty array
> shortens the regular array, it doesn't cause elements to become unset
> (unless you consider that the former $#'th element is now unset
> because that position no longer exists).

I just happened to come across the following, and I'm struggling to
fit the result into any sort of logic:
% a=( {1..50} )
% unset 'a[5,45]'
% typeset -p a
typeset -a a=( 1 2 3 4 '' 46 47 48 49 50 )

So did we just remove the given elements, or set them to the empty
string? Yes, we did both :).

(note, this isn't intended as an argument for or against any previous
points, i'm just mentioning it because it still seems inconsistent to
me)

-- 
Mikael Magnusson


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  2021-05-31  0:58                 ` Lawrence Velázquez
  2021-05-31 18:18                 ` Bart Schaefer
@ 2024-03-08 15:30                 ` Stephane Chazelas
  2024-03-09  8:41                   ` [PATCH v5] " Stephane Chazelas
  2024-03-09 13:03                   ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  2 siblings, 2 replies; 58+ messages in thread
From: Stephane Chazelas @ 2024-03-08 15:30 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

2021-05-05 12:45:21 +0100, Stephane Chazelas:
> 2021-04-30 16:13:34 -0700, Bart Schaefer:
> [...]
> > I went back and looked at the patch again.
> 
> Thanks. Here's a third version with further improvements
> addressing some of the comments here.
[...]

That v3 patch had (at least) a couple of bugs:
- in ERE mode, replacement was not inserted properly when
  pattern matched an empty string not at the start of the
  subject (like in regexp-replace var '\>' new)

- it would run in an infinite loop when there's no match in ERE
  mode.

I see Bart ended up committing the v2 version of my patch (from
48747) a few months later in:

commit bb61da36aaeeaa70413cdf5bc66d7a71194f93e5
Author:     Stephane Chazelas <stephane.chazelas@gmail.com>
AuthorDate: Mon Sep 6 14:43:01 2021 -0700
Commit:     Bart Schaefer <schaefer@ipost.com>
CommitDate: Mon Sep 6 14:43:01 2021 -0700

    45180: clarify doc for POSIX EREs, fix an issue with PCRE when the replacement was empty or generated more than one element

That one didn't have the second problem but had the first and
also failed to add the replacement in:

regexp-replace var '^' replacement

for instance when $var is initially empty.

So here's a v4 that should address that, some of the objections
to v2 and uses namerefs to replace that illegible usage
of positional parameters for local variables (that's diff
against current HEAD, not pre-v2).

I went for the:

typeset -g -- "$1"
typeset -nu -- var=$1

suggested by Bart to avoid possible clashes with local variable
names. That might have side effects if called as regexp-replace
'a[2]' re?

diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index d4408f0f7..0e3deed4f 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -1,91 +1,95 @@
-# Replace all occurrences of a regular expression in a variable.  The
-# variable is modified directly.  Respects the setting of the
-# option RE_MATCH_PCRE.
+# Replace all occurrences of a regular expression in a scalar variable.
+# The variable is modified directly.  Respects the setting of the option
+# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode.
 #
-# First argument: *name* (not contents) of variable.
-# Second argument: regular expression
-# Third argument: replacement string.  This can contain all forms of
-# $ and backtick substitutions; in particular, $MATCH will be replaced
-# by the portion of the string matched by the regular expression.
-
-# we use positional parameters instead of variables to avoid
-# clashing with the user's variable. Make sure we start with 3 and only
-# 3 elements:
-argv=("$1" "$2" "$3")
-
-# $4 records whether pcre is enabled as that information would otherwise
-# be lost after emulate -L zsh
-4=0
-[[ -o re_match_pcre ]] && 4=1
+# Arguments:
+#
+# 1. *name* (not contents) of variable or more generally any lvalue;
+#    expected to be scalar.
+#
+# 2. regular expression
+#
+# 3. replacement string.  This can contain all forms of
+#    $ and backtick substitutions; in particular, $MATCH will be
+#    replaced by the portion of the string matched by the regular
+#    expression. Parsing errors are fatal to the shell process.
+
+if (( $# < 2 || $# > 3 )); then
+  setopt localoptions functionargzero
+  print -ru2 "Usage: $0 <varname> <regexp> [<replacement>]"
+  return 2
+fi
 
-emulate -L zsh
+# ensure variable exists in the caller's scope before referencing it
+# to make sure we don't end up referencing one of our own.
+typeset -g -- "$1" || return 2
+typeset -nu -- var=$1 || return 2
 
+local -i use_pcre=0
+[[ -o re_match_pcre ]] && use_pcre=1
 
-local MATCH MBEGIN MEND
+emulate -L zsh
+
+local regexp=$2 replacement=$3 result MATCH MBEGIN MEND
 local -a match mbegin mend
 
-if (( $4 )); then
+if (( use_pcre )); then
   # if using pcre, we're using pcre_match and a running offset
   # That's needed for ^, \A, \b, and look-behind operators to work
   # properly.
 
   zmodload zsh/pcre || return 2
-  pcre_compile -- "$2" && pcre_study || return 2
+  pcre_compile -- "$regexp" && pcre_study || return 2
+
+  local -i offset=0 start stop
+  local new ZPCRE_OP
+  local -a finds
 
-  # $4 is the current *byte* offset, $5, $6 reserved for later use
-  4=0 6=
+  while pcre_match -b -n $offset -- "$var"; do
+    # we need to perform the evaluation in a scalar assignment so that
+    # if it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    new=${(Xe)replacement}
 
-  local ZPCRE_OP
-  while pcre_match -b -n $4 -- "${(P)1}"; do
-    # append offsets and computed replacement to the array
-    # we need to perform the evaluation in a scalar assignment so that if
-    # it generates an array, the elements are converted to string (by
-    # joining with the first character of $IFS as usual)
-    5=${(e)3}
-    argv+=(${(s: :)ZPCRE_OP} "$5")
+    finds+=( ${(s[ ])ZPCRE_OP} "$new" )
 
     # for 0-width matches, increase offset by 1 to avoid
     # infinite loop
-    4=$((argv[-2] + (argv[-3] == argv[-2])))
+    (( offset = finds[-2] + (finds[-3] == finds[-2]) ))
   done
 
-  (($# > 6)) || return # no match
+  (( $#finds )) || return # no match
 
-  set +o multibyte
+  unsetopt multibyte
 
-  # $5 contains the result, $6 the current offset
-  5= 6=1
-  for 2 3 4 in "$@[7,-1]"; do
-    5+=${(P)1[$6,$2]}$4
-    6=$(($3 + 1))
+  offset=1
+  for start stop new in "$finds[@]"; do
+    result+=${var[offset,start]}$new
+    (( offset = stop + 1 ))
   done
-  5+=${(P)1[$6,-1]}
-else
+  result+=${var[offset,-1]}
+
+else # no PCRE
+
   # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
   # available) won't work properly.
-
-  # $4 is the string to be matched
-  4=${(P)1}
-
-  while [[ -n $4 ]]; do
-    if [[ $4 =~ $2 ]]; then
-      # append initial part and substituted match
-      5+=${4[1,MBEGIN-1]}${(e)3}
-      # truncate remaining string
-      if ((MEND < MBEGIN)); then
-        # zero-width match, skip one character for the next match
-        ((MEND++))
-	5+=${4[1]}
-      fi
-      4=${4[MEND+1,-1]}
-      # indicate we did something
-      6=1
-    else
-      break
+  local subject=$var
+  local -i ok
+  while [[ $subject =~ $regexp ]]; do
+    # append initial part and substituted match
+    result+=$subject[1,MBEGIN-1]${(Xe)replacement}
+    # truncate remaining string
+    if (( MEND < MBEGIN )); then
+      # zero-width match, skip one character for the next match
+      (( MEND++ ))
+      result+=$subject[MBEGIN]
     fi
+    subject=$subject[MEND+1,-1]
+    ok=1
+    [[ -n $subject ]] && break
   done
-  [[ -n $6 ]] || return # no match
-  5+=$4
+  (( ok )) || return
+  result+=$subject
 fi
 
-eval $1=\$5
+var=$result


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).
  2024-03-08 15:30                 ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
@ 2024-03-09  8:41                   ` Stephane Chazelas
  2024-03-09  9:21                     ` MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).) Stephane Chazelas
  2024-03-09 13:03                   ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  1 sibling, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2024-03-09  8:41 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

2024-03-08 15:30:50 +0000, Stephane Chazelas:
[...]
> - it would run in an infinite loop when there's no match in ERE
>   mode.
[...]
> +    [[ -n $subject ]] && break
[...]

D'oh, should be -z instead of -n or || instead of &&.

So, here's a v5 patch (previous one should have been a v4, I
forgot to update the subject):

diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index d4408f0f7..db8f63404 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -1,91 +1,95 @@
-# Replace all occurrences of a regular expression in a variable.  The
-# variable is modified directly.  Respects the setting of the
-# option RE_MATCH_PCRE.
+# Replace all occurrences of a regular expression in a scalar variable.
+# The variable is modified directly.  Respects the setting of the option
+# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode.
 #
-# First argument: *name* (not contents) of variable.
-# Second argument: regular expression
-# Third argument: replacement string.  This can contain all forms of
-# $ and backtick substitutions; in particular, $MATCH will be replaced
-# by the portion of the string matched by the regular expression.
-
-# we use positional parameters instead of variables to avoid
-# clashing with the user's variable. Make sure we start with 3 and only
-# 3 elements:
-argv=("$1" "$2" "$3")
-
-# $4 records whether pcre is enabled as that information would otherwise
-# be lost after emulate -L zsh
-4=0
-[[ -o re_match_pcre ]] && 4=1
+# Arguments:
+#
+# 1. *name* (not contents) of variable or more generally any lvalue;
+#    expected to be scalar.
+#
+# 2. regular expression
+#
+# 3. replacement string.  This can contain all forms of
+#    $ and backtick substitutions; in particular, $MATCH will be
+#    replaced by the portion of the string matched by the regular
+#    expression. Parsing errors are fatal to the shell process.
+
+if (( $# < 2 || $# > 3 )); then
+  setopt localoptions functionargzero
+  print -ru2 "Usage: $0 <varname> <regexp> [<replacement>]"
+  return 2
+fi
 
-emulate -L zsh
+# ensure variable exists in the caller's scope before referencing it
+# to make sure we don't end up referencing one of our own.
+typeset -g -- "$1" || return 2
+typeset -nu -- var=$1 || return 2
 
+local -i use_pcre=0
+[[ -o re_match_pcre ]] && use_pcre=1
 
-local MATCH MBEGIN MEND
+emulate -L zsh
+
+local regexp=$2 replacement=$3 result MATCH MBEGIN MEND
 local -a match mbegin mend
 
-if (( $4 )); then
+if (( use_pcre )); then
   # if using pcre, we're using pcre_match and a running offset
   # That's needed for ^, \A, \b, and look-behind operators to work
   # properly.
 
   zmodload zsh/pcre || return 2
-  pcre_compile -- "$2" && pcre_study || return 2
+  pcre_compile -- "$regexp" && pcre_study || return 2
+
+  local -i offset=0 start stop
+  local new ZPCRE_OP
+  local -a finds
 
-  # $4 is the current *byte* offset, $5, $6 reserved for later use
-  4=0 6=
+  while pcre_match -b -n $offset -- "$var"; do
+    # we need to perform the evaluation in a scalar assignment so that
+    # if it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    new=${(Xe)replacement}
 
-  local ZPCRE_OP
-  while pcre_match -b -n $4 -- "${(P)1}"; do
-    # append offsets and computed replacement to the array
-    # we need to perform the evaluation in a scalar assignment so that if
-    # it generates an array, the elements are converted to string (by
-    # joining with the first character of $IFS as usual)
-    5=${(e)3}
-    argv+=(${(s: :)ZPCRE_OP} "$5")
+    finds+=( ${(s[ ])ZPCRE_OP} "$new" )
 
     # for 0-width matches, increase offset by 1 to avoid
     # infinite loop
-    4=$((argv[-2] + (argv[-3] == argv[-2])))
+    (( offset = finds[-2] + (finds[-3] == finds[-2]) ))
   done
 
-  (($# > 6)) || return # no match
+  (( $#finds )) || return # no match
 
-  set +o multibyte
+  unsetopt multibyte
 
-  # $5 contains the result, $6 the current offset
-  5= 6=1
-  for 2 3 4 in "$@[7,-1]"; do
-    5+=${(P)1[$6,$2]}$4
-    6=$(($3 + 1))
+  offset=1
+  for start stop new in "$finds[@]"; do
+    result+=${var[offset,start]}$new
+    (( offset = stop + 1 ))
   done
-  5+=${(P)1[$6,-1]}
-else
+  result+=${var[offset,-1]}
+
+else # no PCRE
+
   # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
   # available) won't work properly.
-
-  # $4 is the string to be matched
-  4=${(P)1}
-
-  while [[ -n $4 ]]; do
-    if [[ $4 =~ $2 ]]; then
-      # append initial part and substituted match
-      5+=${4[1,MBEGIN-1]}${(e)3}
-      # truncate remaining string
-      if ((MEND < MBEGIN)); then
-        # zero-width match, skip one character for the next match
-        ((MEND++))
-	5+=${4[1]}
-      fi
-      4=${4[MEND+1,-1]}
-      # indicate we did something
-      6=1
-    else
-      break
+  local subject=$var
+  local -i ok
+  while [[ $subject =~ $regexp ]]; do
+    # append initial part and substituted match
+    result+=$subject[1,MBEGIN-1]${(Xe)replacement}
+    # truncate remaining string
+    if (( MEND < MBEGIN )); then
+      # zero-width match, skip one character for the next match
+      (( MEND++ ))
+      result+=$subject[MBEGIN]
     fi
+    subject=$subject[MEND+1,-1]
+    ok=1
+    [[ -z $subject ]] && break
   done
-  [[ -n $6 ]] || return # no match
-  5+=$4
+  (( ok )) || return
+  result+=$subject
 fi
 
-eval $1=\$5
+var=$result


^ permalink raw reply	[flat|nested] 58+ messages in thread

* MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).)
  2024-03-09  8:41                   ` [PATCH v5] " Stephane Chazelas
@ 2024-03-09  9:21                     ` Stephane Chazelas
  0 siblings, 0 replies; 58+ messages in thread
From: Stephane Chazelas @ 2024-03-09  9:21 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

2024-03-09 08:41:58 +0000, Stephane Chazelas:
[...]
> +  while [[ $subject =~ $regexp ]]; do
> +    # append initial part and substituted match
> +    result+=$subject[1,MBEGIN-1]${(Xe)replacement}
[...]

BTW, likely not zsh's fault but here on Ubuntu 22.04

With:

$ a=$'ABC/\U0010fffe/DEF'
$ print -r - ${(q)a}
ABC/$'\364\217\277\276'/DEF

So with a string containing a 4-byte multibyte character.

$ regexp-replace  a $'\276' $'\277'
$ print -r - ${(q)a}
ABC/$'\364\217\277\276'/D$'\277'F

See $'\277' not replacing $'\276' but E instead.

It's my bad as a user to be doing that with multibyte enabled in
a locale with a multibyte charset.

$ a=$'ABC/\U0010fffe/DEF'
$ set +o multibyte
$ regexp-replace  a $'\276' $'\277'
$ print -r - ${(q+)a}
$'ABC/\U0010ffff/DEF'
$ set -o multibyte
$ print -r - ${(q)a}
ABC/$'\364\217\277\277'/DEF

Is OK

The problem here is:

$ [[ $a =~ $'\276' ]]
$ echo $MBEGIN $MEND
8 8
$ [[ $a =~ D ]]
$ echo $MBEGIN $MEND
7 7

And could very well be caused by a bug in my regex library,
maybe a variation of
https://sourceware.org/bugzilla/show_bug.cgi?id=31075 for regex.

If the problem is in the system's regexps, I can't think of
anything zsh could do about it except maybe checking that
subject and regexp decode as text properly, and error out if not
like it does in pcre mode.

zsh pattern matching seems to be handling it better.

$ [[ $a = (#b)*($'\276')* ]] && echo match; typeset mbegin mend
mbegin=( -1 -1 )
mend=( -1 -1 )
$ [[ $a = (#b)*(D)* ]] && echo match; typeset mbegin mend
match
mbegin=( 7 )
mend=( 7 )

I wonder if PCRE2_MATCH_INVALID_UTF/PCRE2_NO_UTF_CHECK could be
used to improve matching with invalid UTF-8 for the pcre mode,
at least for the pcre builtins where offsets are byte-wide
rather than character-wise.

-- 
Stephane

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more).
  2024-03-08 15:30                 ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
  2024-03-09  8:41                   ` [PATCH v5] " Stephane Chazelas
@ 2024-03-09 13:03                   ` Stephane Chazelas
  2024-03-10 19:52                     ` [PATCH v6] " Stephane Chazelas
  1 sibling, 1 reply; 58+ messages in thread
From: Stephane Chazelas @ 2024-03-09 13:03 UTC (permalink / raw)
  To: Bart Schaefer, Zsh hackers list

2024-03-08 15:30:50 +0000, Stephane Chazelas:
[...]
> So here's a v4 that should address that, some of the objections
> to v2 and uses namerefs to replace that illegible usage
> of positional parameters for local variables
[...]

Scratch that, we do have to use positional parameters or
namespacing as expansions are performed in the replacement so
the users could do:

regexp-replace var $regexp '$start[$#MATCH]$regexp'

For instance, and expect the $start / $regexp in the replacement
to be *their* variable, not the local variables of
regexp-replace.

I'll send a v6 likely using namespaced variables rather than
going back to using positional parameters, once I understand the
point of using .regexp_replace.myvar over _regexp_replace_myvar
(TBH, ATM I don't).

-- 
Stephane


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH v6] regexp-replace and ^, word boundary or look-behind operators (and more).
  2024-03-09 13:03                   ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
@ 2024-03-10 19:52                     ` Stephane Chazelas
  0 siblings, 0 replies; 58+ messages in thread
From: Stephane Chazelas @ 2024-03-10 19:52 UTC (permalink / raw)
  To: Zsh hackers list

2024-03-09 13:03:10 +0000, Stephane Chazelas:
[...]
> I'll send a v6 likely using namespaced variables rather than
> going back to using positional parameters, once I understand the
> point of using .regexp_replace.myvar over _regexp_replace_myvar
[...]

So here it is. I ended up using none of the new features
(nameref and namespace) as they were not overly useful in this
instance and that means the code can be used as-is in older
versions. I'm using $_regexp_replace_localvarname for
namespacing. I compared performance with
${.regexp_replace.localvarname} and they were similar (the
latter about 2-3% slower in my limited tests).

diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace
index d4408f0f7..630a5ceab 100644
--- a/Functions/Misc/regexp-replace
+++ b/Functions/Misc/regexp-replace
@@ -1,91 +1,99 @@
-# Replace all occurrences of a regular expression in a variable.  The
-# variable is modified directly.  Respects the setting of the
-# option RE_MATCH_PCRE.
+# Replace all occurrences of a regular expression in a scalar variable.
+# The variable is modified directly.  Respects the setting of the option
+# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode.
 #
-# First argument: *name* (not contents) of variable.
-# Second argument: regular expression
-# Third argument: replacement string.  This can contain all forms of
-# $ and backtick substitutions; in particular, $MATCH will be replaced
-# by the portion of the string matched by the regular expression.
-
-# we use positional parameters instead of variables to avoid
-# clashing with the user's variable. Make sure we start with 3 and only
-# 3 elements:
-argv=("$1" "$2" "$3")
-
-# $4 records whether pcre is enabled as that information would otherwise
-# be lost after emulate -L zsh
-4=0
-[[ -o re_match_pcre ]] && 4=1
+# Arguments:
+#
+# 1. *name* (not contents) of variable or more generally any lvalue;
+#    expected to be scalar.
+#
+# 2. regular expression
+#
+# 3. replacement string.  This can contain all forms of
+#    $ and backtick substitutions; in particular, $MATCH will be
+#    replaced by the portion of the string matched by the regular
+#    expression. Parsing errors are fatal to the shell process.
+
+if (( $# < 2 || $# > 3 )); then
+  setopt localoptions functionargzero
+  print -ru2 "Usage: $0 <varname> <regexp> [<replacement>]"
+  return 2
+fi
+
+local _regexp_replace_use_pcre=0
+[[ -o re_match_pcre ]] && _regexp_replace_use_pcre=1
 
 emulate -L zsh
 
+local _regexp_replace_subject=${(P)1} \
+      _regexp_replace_regexp=$2 \
+      _regexp_replace_replacement=$3 \
+      _regexp_replace_result \
+      MATCH MBEGIN MEND
 
-local MATCH MBEGIN MEND
 local -a match mbegin mend
 
-if (( $4 )); then
+if (( _regexp_replace_use_pcre )); then
   # if using pcre, we're using pcre_match and a running offset
   # That's needed for ^, \A, \b, and look-behind operators to work
   # properly.
 
   zmodload zsh/pcre || return 2
-  pcre_compile -- "$2" && pcre_study || return 2
+  pcre_compile -- "$_regexp_replace_regexp" && pcre_study || return 2
+
+  local _regexp_replace_offset=0 _regexp_replace_start _regexp_replace_stop _regexp_replace_new ZPCRE_OP
+  local -a _regexp_replace_finds
 
-  # $4 is the current *byte* offset, $5, $6 reserved for later use
-  4=0 6=
+  while pcre_match -b -n $_regexp_replace_offset -- "$_regexp_replace_subject"; do
+    # we need to perform the evaluation in a scalar assignment so that
+    # if it generates an array, the elements are converted to string (by
+    # joining with the first chararacter of $IFS as usual)
+    _regexp_replace_new=${(Xe)_regexp_replace_replacement}
 
-  local ZPCRE_OP
-  while pcre_match -b -n $4 -- "${(P)1}"; do
-    # append offsets and computed replacement to the array
-    # we need to perform the evaluation in a scalar assignment so that if
-    # it generates an array, the elements are converted to string (by
-    # joining with the first character of $IFS as usual)
-    5=${(e)3}
-    argv+=(${(s: :)ZPCRE_OP} "$5")
+    _regexp_replace_finds+=( ${(s[ ])ZPCRE_OP} "$_regexp_replace_new" )
 
     # for 0-width matches, increase offset by 1 to avoid
     # infinite loop
-    4=$((argv[-2] + (argv[-3] == argv[-2])))
+    (( _regexp_replace_offset = _regexp_replace_finds[-2] + (_regexp_replace_finds[-3] == _regexp_replace_finds[-2]) ))
   done
 
-  (($# > 6)) || return # no match
+  (( $#_regexp_replace_finds )) || return # no match
 
-  set +o multibyte
+  unsetopt multibyte
 
-  # $5 contains the result, $6 the current offset
-  5= 6=1
-  for 2 3 4 in "$@[7,-1]"; do
-    5+=${(P)1[$6,$2]}$4
-    6=$(($3 + 1))
+  _regexp_replace_offset=1
+  for _regexp_replace_start _regexp_replace_stop _regexp_replace_new in "$_regexp_replace_finds[@]"; do
+    _regexp_replace_result+=${_regexp_replace_subject[_regexp_replace_offset,_regexp_replace_start]}$_regexp_replace_new
+    (( _regexp_replace_offset = _regexp_replace_stop + 1 ))
   done
-  5+=${(P)1[$6,-1]}
-else
+  _regexp_replace_result+=${_regexp_replace_subject[_regexp_replace_offset,-1]}
+
+else # no PCRE
   # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where
   # available) won't work properly.
 
-  # $4 is the string to be matched
-  4=${(P)1}
-
-  while [[ -n $4 ]]; do
-    if [[ $4 =~ $2 ]]; then
-      # append initial part and substituted match
-      5+=${4[1,MBEGIN-1]}${(e)3}
-      # truncate remaining string
-      if ((MEND < MBEGIN)); then
-        # zero-width match, skip one character for the next match
-        ((MEND++))
-	5+=${4[1]}
-      fi
-      4=${4[MEND+1,-1]}
-      # indicate we did something
-      6=1
-    else
-      break
+  local _regexp_replace_ok=0
+  while [[ $_regexp_replace_subject =~ $_regexp_replace_regexp ]]; do
+    # append initial part and substituted match
+    _regexp_replace_result+=$_regexp_replace_subject[1,MBEGIN-1]${(Xe)_regexp_replace_replacement}
+    # truncate remaining string
+    if (( MEND < MBEGIN )); then
+      # zero-width match, skip one character for the next match
+      (( MEND++ ))
+      _regexp_replace_result+=$_regexp_replace_subject[MBEGIN]
     fi
+    _regexp_replace_subject=$_regexp_replace_subject[MEND+1,-1]
+    _regexp_replace_ok=1
+    [[ -z $_regexp_replace_subject ]] && break
   done
-  [[ -n $6 ]] || return # no match
-  5+=$4
+  (( _regexp_replace_ok )) || return
+  _regexp_replace_result+=$_regexp_replace_subject
 fi
 
-eval $1=\$5
+# assign result to target variable if at least one substitution was
+# made.  At this point, if the variable was originally array or assoc, it
+# is converted to scalar. If $1 doesn't contain a valid lvalue
+# specification, an exception is raised (exits the shell process if
+# non-interactive).
+: ${(P)1::="$_regexp_replace_result"}
+




^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2024-03-10 19:52 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 21:10 regexp-replace and ^, word boundary or look-behind operators Stephane Chazelas
2019-12-16 21:27 ` Stephane Chazelas
2019-12-17  7:38   ` Stephane Chazelas
2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
2019-12-18  0:22       ` Daniel Shahaf
2019-12-18  8:31         ` Stephane Chazelas
2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
2021-04-30  6:11           ` Stephane Chazelas
2021-04-30 23:13             ` Bart Schaefer
2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
2021-05-31  0:58                 ` Lawrence Velázquez
2021-05-31 18:18                 ` Bart Schaefer
2021-05-31 21:37                   ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
2021-06-01  5:32                     ` Stephane Chazelas
2021-06-01 16:05                       ` Bart Schaefer
2021-06-02  2:51                         ` [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff] Bart Schaefer
2021-06-02 10:06                           ` Stephane Chazelas
2021-06-02 14:52                             ` Bart Schaefer
2021-06-02 16:02                               ` Stephane Chazelas
2021-06-02  9:11                         ` [PATCH] (?) typeset array[position=index]=value Stephane Chazelas
2021-06-02 13:34                           ` Daniel Shahaf
2021-06-02 14:20                             ` Stephane Chazelas
2021-06-02 15:59                               ` Bart Schaefer
2021-06-03  2:04                                 ` [PATCH (not final)] (take three?) unset "array[$anything]" Bart Schaefer
2021-06-03  2:42                                   ` Bart Schaefer
2021-06-03  6:12                                     ` Bart Schaefer
2021-06-03  8:54                                       ` Peter Stephenson
2021-06-03 13:13                                         ` Stephane Chazelas
2021-06-03 14:41                                           ` Peter Stephenson
2021-06-04 19:25                                             ` Bart Schaefer
2021-06-05 18:18                                               ` Peter Stephenson
2021-06-09 23:31                                                 ` Bart Schaefer
2021-06-13 16:51                                                   ` Peter Stephenson
2021-06-13 18:04                                                     ` Bart Schaefer
2021-06-13 19:48                                                       ` Peter Stephenson
2021-06-13 21:44                                                         ` Bart Schaefer
2021-06-14  7:19                                                           ` Stephane Chazelas
2021-06-03 18:12                                           ` Bart Schaefer
2021-06-04  8:02                                             ` Stephane Chazelas
2021-06-04 18:36                                               ` Bart Schaefer
2021-06-04 20:21                                                 ` Stephane Chazelas
2021-06-05  0:20                                                   ` Bart Schaefer
2021-06-05 17:05                                                     ` Stephane Chazelas
2021-06-10  0:14                                                       ` Square brackets in command position Bart Schaefer
2021-06-03  6:05                                   ` [PATCH (not final)] (take three?) unset "array[$anything]" Stephane Chazelas
2021-06-03  6:43                                     ` Bart Schaefer
2021-06-03  7:31                                       ` Stephane Chazelas
2021-06-10  0:21                         ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
2021-06-05  4:29                     ` Mikael Magnusson
2021-06-05  5:49                       ` Bart Schaefer
2021-06-05 11:06                         ` Mikael Magnusson
2021-06-05 16:22                           ` Bart Schaefer
2021-06-18 10:53                         ` Mikael Magnusson
2024-03-08 15:30                 ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
2024-03-09  8:41                   ` [PATCH v5] " Stephane Chazelas
2024-03-09  9:21                     ` MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).) Stephane Chazelas
2024-03-09 13:03                   ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
2024-03-10 19:52                     ` [PATCH v6] " Stephane Chazelas

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).