From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 23487 invoked from network); 5 May 2021 11:45:39 -0000 Received: from zero.zsh.org (2a02:898:31:0:48:4558:7a:7368) by inbox.vuxu.org with ESMTPUTF8; 5 May 2021 11:45:39 -0000 ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20200801; t=1620215139; b=fbXxdcZ9N5CBnx+yyHzXX8cnLhH5cmUQCxrdWE6gsOfPEyN5zHCGix26TalBnDwZ/CoCEH1G/V DQrdlnYFEpLdtFTiDnJSA8cIBObQotAeM/QC0R5Y5P5HxM1A0wlorKZE3VpwqWDLlCOzBZjA5w l/Uo3mXQjj3a3e+rO/Sji9sVNpVQqTk35SZXDpBdy+QWkSP5K1VLgzZ67NnX3d/ErQp+17lOP2 /mlOdt0qIGAmL6WuRj15PFU8682zvb7Alo7t8CJyTul0fB0u4PLEcODrBcGJPiaBTu1fGqiC5N FHMm0EL/FktBNKtMx92/TG+jq7B/fJAq4CX4PmQfLlzjQw==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (relay2-d.mail.gandi.net) smtp.remote-ip=217.70.183.194; dmarc=none header.from=chazelas.org; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20200801; t=1620215139; bh=7ksAaF8h4E5D61IiTIlAiNW2cikkdidpR+e2CEFFRwE=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:In-Reply-To:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:DKIM-Signature; b=gZPFBwFC9bax0nn5LkSdW/Y6YtCrfFYgW0XTE9UPeUoqZ7+AGQHSVS4VhJfE1mNFuQ9RNJEBlZ X14DqdPGgxtplhLdyIRg+ZRAtjR/vNsoh+Q8+GEz+tbzF1xWLEiOCaRZX6sU0xat3buHg0jCcl aHSKWy6p4ZBD3dPoM04OUhw0BUwOTQY6BWfE6Dt/gYrWd+Jatp1T99LNqfYSspnDO/V1P304u4 s++uRB7aM2bPKpeRwLHOYliONvdKWXJfCwjNBdiPWrQ/p0ZRb0cWOB9zFlD7subyopBpr1P/30 n6gZTS5UtP6OsRv5wQl0Au8VK5SKj8xeNxE1zvukSTs16w==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20200801; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=RdnLFkAGCAMLBcUmmnVUpt6MDlVoLWvWbDHv6Vv1nGs=; b=scI6zNs8wXLwJ6/y4ieXPZSX6n qpU4Nsrbv2l5eLjxzJqGAELDl+UbF3LDjrfZGmvLsYRB6CbQMoKA96icOKJqXbkAVORdOge9b4Zg4 65gYnVr+BuW0emD4h/ljXWQNos3AXwMGVY1cDOxi4PnsjobRn8nVmja6//cDQEtSKCbqeYjJTrUdZ k6ibBXosMMpBLUYXEw1X/Kxc77VBzdeOeLY9oxKJxm9HND86HwqNvk1Z36q4tYIz47Fpi8jNgWY0D X9+iQq28S9hCWN4iP8fSHaZuzM/iUDQYZEhrdsJyWpdaIDzclEDChJ+WvAQs+POFe5DaahzH+gQ/g LcuUwFlA==; Received: from authenticated user by zero.zsh.org with local id 1leFyF-000FcK-8u; Wed, 05 May 2021 11:45:39 +0000 Authentication-Results: zsh.org; iprev=pass (relay2-d.mail.gandi.net) smtp.remote-ip=217.70.183.194; dmarc=none header.from=chazelas.org; arc=none Received: from relay2-d.mail.gandi.net ([217.70.183.194]:32911) by zero.zsh.org with esmtps (TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256) id 1leFxy-000FMS-SN; Wed, 05 May 2021 11:45:23 +0000 X-Originating-IP: 90.215.204.106 Received: from chazelas.org (unknown [90.215.204.106]) (Authenticated sender: stephane@chazelas.org) by relay2-d.mail.gandi.net (Postfix) with ESMTPSA id E85BB4000F; Wed, 5 May 2021 11:45:21 +0000 (UTC) Date: Wed, 5 May 2021 12:45:21 +0100 From: Stephane Chazelas To: Bart Schaefer Cc: Zsh hackers list Subject: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more). Message-ID: <20210505114521.bemoiekpophssbug@chazelas.org> Mail-Followup-To: Bart Schaefer , Zsh hackers list References: <20191216211013.6opkv5sy4wvp3yn2@chaz.gmail.com> <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> <20210430061117.buyhdhky5crqjrf2@chazelas.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Seq: 48786 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: List-Subscribe: List-Unsubscribe: List-Post: List-Owner: List-Archive: 2021-04-30 16:13:34 -0700, Bart Schaefer: [...] > I went back and looked at the patch again. Thanks. Here's a third version with further improvements addressing some of the comments here. > Tangential question: "pgrep" commonly refers to grepping the process > list, and is linked to "pkill". I know "zpgrep" precedes this patch, > but I'm wondering if we should rename it. I agree, zshpcregrep or zpmatch may be better names. There exists a pcregrep command, zpcregrep would be likely interpreted as zip-pcregrep. I'll leave it out for now. IMO, that zpgrep serves more as example code than a command people would actually use, so it probably doesn't matter much. > More directly about regexp-replace: > > If $argv[4,-1] are going to be ignored/discarded, perhaps there should > be a warning? (Another thing that predates the patch, I know) Agreed. I've addressed that. > What do you think about replacing the final eval with typeset -g, as > mentioned in workers/48760 ? I've compared: (1) eval -- $lvalue=\$value (2) Bart's typeset -g -- $lvalue=$value (3) Daniel's (zsh-workers 45073) : ${(P)lvalue::="$value"} (1) to me is the most legible but if $lvalue is not a valid lvalue, it doesn't necessarily return a useful error message to the user (like when lvalue='reboot;var'...) (2) is also very legible. It has the benefit (or inconvenience depending on PoV) of returning an error if the lvalue is not a scalar. It reports an error (and exits the shell process) upon incorrect lvalues (except ones such as "var=foo"). A major drawback though is that if chokes on lvalue='array[n=1]' or lvalue='assoc[keywith=characters]' (3) is the least legible. It also causes the lvalue to be dereferenced twice. For instance with lvalue='a[++n]', n is incremented twice. However, it does report an error upon invalid lvalue (even though ${(P)lvalue} alone doesn't), and as we use ${(P)lvalue} above already, that has the benefit of that lvalue being interpreted consistently. Non-scalar variables are converted to scalar (like with (1)). It works OK for lvalue='assoc[$key]' and lvalue='assoc[=]' or lvalue='assoc[\]]' for instance. Performance wise, for usual cases (lvalue being a simple variable name and value short enough), (1) seems to be the worst in my tests and (3) best, (2) very close. But that's reversed for the less usual cases. So, I've gone for (3), changed the code to limit the number of times the lvalue is dereferenced. I've also addressed an issue whereby regexp-replace empty '^' x would not insert x in ERE mode. (note that it is affected by the (e) failure exit code issue I've just raised separately; I'm not attempting to work around it here; though I've added the X flag for error reporting to be more consistent) diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo index 8bf1a208e..db06d7925 100644 --- a/Doc/Zsh/contrib.yo +++ b/Doc/Zsh/contrib.yo @@ -4328,7 +4328,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below. findex(regexp-replace) item(tt(regexp-replace) var(var) var(regexp) var(replace))( Use regular expressions to perform a global search and replace operation -on a variable. POSIX extended regular expressions are used, +on a variable. POSIX extended regular expressions (ERE) are used, unless the option tt(RE_MATCH_PCRE) has been set, in which case Perl-compatible regular expressions are used (this requires the shell to be linked against the tt(pcre) @@ -4346,6 +4346,9 @@ and arithmetic expressions which will be replaced: in particular, a reference to tt($MATCH) will be replaced by the text matched by the pattern. The return status is 0 if at least one match was performed, else 1. + +Note that if using POSIX EREs, the tt(^) or word boundary operators +(where available) may not work properly. ) findex(run-help) item(tt(run-help) var(cmd))( diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep index 8b1edaa1c..556e58cd6 100644 --- a/Functions/Example/zpgrep +++ b/Functions/Example/zpgrep @@ -2,24 +2,31 @@ # zpgrep() { -local file pattern +local file pattern ret pattern=$1 shift +ret=1 if ((! ARGC)) then set -- - fi -pcre_compile $pattern +zmodload zsh/pcre || return +pcre_compile -- "$pattern" pcre_study for file do if [[ "$file" == - ]] then - while read -u0 buf; do pcre_match $buf && print $buf; done + while IFS= read -ru0 buf; do + pcre_match -- "$buf" && ret=0 && print -r -- "$buf" + done else - while read -u0 buf; do pcre_match $buf && print $buf; done < "$file" + while IFS= read -ru0 buf; do + pcre_match -- "$buf" && ret=0 && print -r -- "$buf" + done < "$file" fi done +return "$ret" } diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace index dec105524..c947a2043 100644 --- a/Functions/Misc/regexp-replace +++ b/Functions/Misc/regexp-replace @@ -1,43 +1,109 @@ -# Replace all occurrences of a regular expression in a variable. The -# variable is modified directly. Respects the setting of the -# option RE_MATCH_PCRE. +# Replace all occurrences of a regular expression in a scalar variable. +# The variable is modified directly. Respects the setting of the option +# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode. # -# First argument: *name* (not contents) of variable. -# Second argument: regular expression -# Third argument: replacement string. This can contain all forms of -# $ and backtick substitutions; in particular, $MATCH will be replaced -# by the portion of the string matched by the regular expression. - -integer pcre +# Arguments: +# +# 1. *name* (not contents) of variable or more generally any lvalue, +# expected to be scalar. That lvalue will be evaluated once to +# retrieve the current value, and two more times (not just one as a +# side effect of using ${(P)varname::=$value}; FIXME) for the +# assignment of the new value if a substitution was made. So lvalues +# such as array[++n] where the subscript is dynamic should be +# avoided. +# +# 2. regular expression +# +# 3. replacement string. This can contain all forms of +# $ and backtick substitutions; in particular, $MATCH will be +# replaced by the portion of the string matched by the regular +# expression. Parsing errors are fatal to the shell process. +# +# we use positional parameters instead of variables to avoid +# clashing with the user's variable. -[[ -o re_match_pcre ]] && pcre=1 +if (( $# < 2 || $# > 3 )); then + setopt localoptions functionargzero + print -ru2 "Usage: $0 []" + return 2 +fi +# $4 records whether pcre is enabled as that information would otherwise +# be lost after emulate -L zsh +4=0 +[[ -o re_match_pcre ]] && 4=1 emulate -L zsh -(( pcre )) && setopt re_match_pcre - -# $4 is the string to be matched -4=${(P)1} -# $5 is the final string -5= -# 6 indicates if we made a change -6= -local MATCH MBEGIN MEND + +# $5 is the string to be matched +5=${(P)1} + +local MATCH MBEGIN MEND local -a match mbegin mend -while [[ -n $4 ]]; do - if [[ $4 =~ $2 ]]; then - # append initial part and subsituted match - 5+=${4[1,MBEGIN-1]}${(e)3} - # truncate remaining string - 4=${4[MEND+1,-1]} - # indicate we did something - 6=1 - else - break - fi -done -5+=$4 - -eval ${1}=${(q)5} -# status 0 if we did something, else 1. -[[ -n $6 ]] +if (( $4 )); then + # if using pcre, we're using pcre_match and a running offset + # That's needed for ^, \A, \b, and look-behind operators to work + # properly. + + zmodload zsh/pcre || return 2 + pcre_compile -- "$2" && pcre_study || return 2 + + # $4 is the current *byte* offset, $6, $7 reserved for later use + 4=0 7= + + local ZPCRE_OP + while pcre_match -b -n $4 -- "$5"; do + # append offsets and computed replacement to the array + # we need to perform the evaluation in a scalar assignment so that if + # it generates an array, the elements are converted to string (by + # joining with the first chararacter of $IFS as usual) + 6=${(Xe)3} + argv+=(${(s: :)ZPCRE_OP} "$6") + + # for 0-width matches, increase offset by 1 to avoid + # infinite loop + 4=$(( argv[-2] + (argv[-3] == argv[-2]) )) + done + + (( $# > 7 )) || return # no match + + set +o multibyte + + # $6 contains the result, $7 the current offset + 6= 7=1 + for 2 3 4 in "$@[8,-1]"; do + 6+=${5[$7,$2]}$4 + 7=$(( $3 + 1 )) + done + 6+=${5[$7,-1]} +else + # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where + # available) won't work properly. + while + if [[ $5 =~ $2 ]]; then + # append initial part and substituted match + 6+=${5[1,MBEGIN-1]}${(Xe)3} + # truncate remaining string + if (( MEND < MBEGIN )); then + # zero-width match, skip one character for the next match + (( MEND++ )) + 6+=${5[1]} + fi + 5=${5[MEND+1,-1]} + # indicate we did something + 7=1 + fi + [[ -n $5 ]] + do + continue + done + [[ -n $7 ]] || return # no match + 6+=$5 +fi + +# assign result to target variable if at least one substitution was +# made. At this point, if the variable was originally array or assoc, it +# is converted to scalar. If $1 doesn't contain a valid lvalue +# specification, an exception is raised (exits the shell process if +# non-interactive). +: ${(P)1::="$6"}