From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: from zero.zsh.org (zero.zsh.org [IPv6:2a02:898:31:0:48:4558:7a:7368]) by inbox.vuxu.org (Postfix) with ESMTP id 94B1D27AA4 for ; Fri, 8 Mar 2024 16:31:31 +0100 (CET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1709911891; b=hJLDU5jrxOyIj2Pc096YFrhitRz0YRusUblZnUnTxt5Atw/nEt2AHiR66Xk9zqUtQll98vwqTf 4Xr94BcHoyzHmi1DgNfMVofFGOGtgHiCL5chJ8H54DYIn9wD8TbteHGLLSdCcZjwTh6/xlrzLO m6wE+GtBJ/hf4xfBUcQwVlg3yUJ03+dnjxC3bRixQoXXadm+YVxjEKEDG0Ma8K7XzItASIOyI/ n/vOydfzuBZ9NEy+sOl/45y70K/XUsZr7K04zOOoI8QZt/BjPTlciwmDACRDyQAWiSd1zR/Lx+ v6IjdJU12t19zwvsQsN5Uydyu4JFlft6R0n17FgSDVIBxg==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (relay4-d.mail.gandi.net) smtp.remote-ip=217.70.183.196; dmarc=none header.from=chazelas.org; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1709911891; bh=zVs0VdGGaMfa31XSIUV5xTZ5UXOqPPTxMd4vMjdx82U=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:In-Reply-To:Content-Type:MIME-Version:References:Message-ID: Subject:To:From:Date:DKIM-Signature; b=bbDRI4PEAmSs++waC+YMNiOfEKLdOiCfwkvxaZxmWgyKPDWnWb+SxWoeIjxJ0sfRSzExSYv3DN wYM6eLrjh83JsQL5awe1/2vq8qH/6ZL5zI68W9OeC8fJyhRyqFXHR23OHsSjg9caH6E+dXKAZ6 J/IMcPTXW6byKIrwN86DAGt9qIRcMDnvaxbIU5XYtwM22OD309Hnw4FPpz26ob/Om6iCu865rp ZpK5rwPWfok5OgKdC6Z3WfWZ/tYk7/DrktsKufZTHkvYSNNnH71N7Ib16XotKJUSmpFsvvvf4O uXM81d26La5xSOihnHnP1U2yDow578BzZckAVMOd7QpYcA==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:To:From:Date:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=mjgwJqLMBdZGOUeOJWgb510OB4v9CIJbq55VsnkKrPk=; b=HWxhRADVUZGl2zCmOMe3YlOLDc aJhBZzs/EnRkQJZ+x+KYhtjd3sL0BiuZ0pYwplHx0s4MjHiUKoqCeBH3PboSQBvwIHaOTuD4BMyxp vAZZsDUFZSqBEOLOxAoWlScR1bQrzmjpjmLH/cHmONHL11htIDfNkUCwlmSrFKlJmlR9H32527bmq AKjGXXmcg6EeueM44eQX4Fr/VnEaUVlLvpbePJYfZ4lM2brPBQgsOGl+VmxMrR3yrV8msD66746cF /6kpcZthDzDkJ67/qjlKZ6FX953FstTkDgHk7x64+2zFMW+b6k5i5yaPtgKYA2XUX4TlrhBF39sRX j8UzcivQ==; Received: by zero.zsh.org with local id 1ricC2-000PEa-UF; Fri, 08 Mar 2024 15:31:31 +0000 Authentication-Results: zsh.org; iprev=pass (relay4-d.mail.gandi.net) smtp.remote-ip=217.70.183.196; dmarc=none header.from=chazelas.org; arc=none Received: from relay4-d.mail.gandi.net ([217.70.183.196]:60099) by zero.zsh.org with esmtps (TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256) id 1ricBR-000Ouu-4A; Fri, 08 Mar 2024 15:30:54 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id D8C22E000E; Fri, 8 Mar 2024 15:30:51 +0000 (UTC) Date: Fri, 8 Mar 2024 15:30:50 +0000 From: Stephane Chazelas To: Bart Schaefer , Zsh hackers list Subject: Re: [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more). Message-ID: <20240308153050.u63fqtcjyr2yewye@chazelas.org> Mail-Followup-To: Bart Schaefer , Zsh hackers list References: <20191216211013.6opkv5sy4wvp3yn2@chaz.gmail.com> <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> <20210430061117.buyhdhky5crqjrf2@chazelas.org> <20210505114521.bemoiekpophssbug@chazelas.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210505114521.bemoiekpophssbug@chazelas.org> X-GND-Sasl: stephane@chazelas.org X-Seq: 52713 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: 2021-05-05 12:45:21 +0100, Stephane Chazelas: > 2021-04-30 16:13:34 -0700, Bart Schaefer: > [...] > > I went back and looked at the patch again. > > Thanks. Here's a third version with further improvements > addressing some of the comments here. [...] That v3 patch had (at least) a couple of bugs: - in ERE mode, replacement was not inserted properly when pattern matched an empty string not at the start of the subject (like in regexp-replace var '\>' new) - it would run in an infinite loop when there's no match in ERE mode. I see Bart ended up committing the v2 version of my patch (from 48747) a few months later in: commit bb61da36aaeeaa70413cdf5bc66d7a71194f93e5 Author: Stephane Chazelas AuthorDate: Mon Sep 6 14:43:01 2021 -0700 Commit: Bart Schaefer CommitDate: Mon Sep 6 14:43:01 2021 -0700 45180: clarify doc for POSIX EREs, fix an issue with PCRE when the replacement was empty or generated more than one element That one didn't have the second problem but had the first and also failed to add the replacement in: regexp-replace var '^' replacement for instance when $var is initially empty. So here's a v4 that should address that, some of the objections to v2 and uses namerefs to replace that illegible usage of positional parameters for local variables (that's diff against current HEAD, not pre-v2). I went for the: typeset -g -- "$1" typeset -nu -- var=$1 suggested by Bart to avoid possible clashes with local variable names. That might have side effects if called as regexp-replace 'a[2]' re? diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace index d4408f0f7..0e3deed4f 100644 --- a/Functions/Misc/regexp-replace +++ b/Functions/Misc/regexp-replace @@ -1,91 +1,95 @@ -# Replace all occurrences of a regular expression in a variable. The -# variable is modified directly. Respects the setting of the -# option RE_MATCH_PCRE. +# Replace all occurrences of a regular expression in a scalar variable. +# The variable is modified directly. Respects the setting of the option +# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode. # -# First argument: *name* (not contents) of variable. -# Second argument: regular expression -# Third argument: replacement string. This can contain all forms of -# $ and backtick substitutions; in particular, $MATCH will be replaced -# by the portion of the string matched by the regular expression. - -# we use positional parameters instead of variables to avoid -# clashing with the user's variable. Make sure we start with 3 and only -# 3 elements: -argv=("$1" "$2" "$3") - -# $4 records whether pcre is enabled as that information would otherwise -# be lost after emulate -L zsh -4=0 -[[ -o re_match_pcre ]] && 4=1 +# Arguments: +# +# 1. *name* (not contents) of variable or more generally any lvalue; +# expected to be scalar. +# +# 2. regular expression +# +# 3. replacement string. This can contain all forms of +# $ and backtick substitutions; in particular, $MATCH will be +# replaced by the portion of the string matched by the regular +# expression. Parsing errors are fatal to the shell process. + +if (( $# < 2 || $# > 3 )); then + setopt localoptions functionargzero + print -ru2 "Usage: $0 []" + return 2 +fi -emulate -L zsh +# ensure variable exists in the caller's scope before referencing it +# to make sure we don't end up referencing one of our own. +typeset -g -- "$1" || return 2 +typeset -nu -- var=$1 || return 2 +local -i use_pcre=0 +[[ -o re_match_pcre ]] && use_pcre=1 -local MATCH MBEGIN MEND +emulate -L zsh + +local regexp=$2 replacement=$3 result MATCH MBEGIN MEND local -a match mbegin mend -if (( $4 )); then +if (( use_pcre )); then # if using pcre, we're using pcre_match and a running offset # That's needed for ^, \A, \b, and look-behind operators to work # properly. zmodload zsh/pcre || return 2 - pcre_compile -- "$2" && pcre_study || return 2 + pcre_compile -- "$regexp" && pcre_study || return 2 + + local -i offset=0 start stop + local new ZPCRE_OP + local -a finds - # $4 is the current *byte* offset, $5, $6 reserved for later use - 4=0 6= + while pcre_match -b -n $offset -- "$var"; do + # we need to perform the evaluation in a scalar assignment so that + # if it generates an array, the elements are converted to string (by + # joining with the first chararacter of $IFS as usual) + new=${(Xe)replacement} - local ZPCRE_OP - while pcre_match -b -n $4 -- "${(P)1}"; do - # append offsets and computed replacement to the array - # we need to perform the evaluation in a scalar assignment so that if - # it generates an array, the elements are converted to string (by - # joining with the first character of $IFS as usual) - 5=${(e)3} - argv+=(${(s: :)ZPCRE_OP} "$5") + finds+=( ${(s[ ])ZPCRE_OP} "$new" ) # for 0-width matches, increase offset by 1 to avoid # infinite loop - 4=$((argv[-2] + (argv[-3] == argv[-2]))) + (( offset = finds[-2] + (finds[-3] == finds[-2]) )) done - (($# > 6)) || return # no match + (( $#finds )) || return # no match - set +o multibyte + unsetopt multibyte - # $5 contains the result, $6 the current offset - 5= 6=1 - for 2 3 4 in "$@[7,-1]"; do - 5+=${(P)1[$6,$2]}$4 - 6=$(($3 + 1)) + offset=1 + for start stop new in "$finds[@]"; do + result+=${var[offset,start]}$new + (( offset = stop + 1 )) done - 5+=${(P)1[$6,-1]} -else + result+=${var[offset,-1]} + +else # no PCRE + # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where # available) won't work properly. - - # $4 is the string to be matched - 4=${(P)1} - - while [[ -n $4 ]]; do - if [[ $4 =~ $2 ]]; then - # append initial part and substituted match - 5+=${4[1,MBEGIN-1]}${(e)3} - # truncate remaining string - if ((MEND < MBEGIN)); then - # zero-width match, skip one character for the next match - ((MEND++)) - 5+=${4[1]} - fi - 4=${4[MEND+1,-1]} - # indicate we did something - 6=1 - else - break + local subject=$var + local -i ok + while [[ $subject =~ $regexp ]]; do + # append initial part and substituted match + result+=$subject[1,MBEGIN-1]${(Xe)replacement} + # truncate remaining string + if (( MEND < MBEGIN )); then + # zero-width match, skip one character for the next match + (( MEND++ )) + result+=$subject[MBEGIN] fi + subject=$subject[MEND+1,-1] + ok=1 + [[ -n $subject ]] && break done - [[ -n $6 ]] || return # no match - 5+=$4 + (( ok )) || return + result+=$subject fi -eval $1=\$5 +var=$result