From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 12996 invoked from network); 30 Apr 2021 06:11:32 -0000 Received: from zero.zsh.org (2a02:898:31:0:48:4558:7a:7368) by inbox.vuxu.org with ESMTPUTF8; 30 Apr 2021 06:11:32 -0000 ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20200801; t=1619763092; b=MeZEWfLjWVISuR9DHr5Ys5BKzWLindtOSapAW1lVcVCMUaVUDFd0Y9xJwZlprAsQBpDqt1UnRG d44ZYhPnj6Cfs3kixkPK8Nj8CtpCOvb7e9cZGf1dKY0KrbyfRUQPfB1/2onB42cn6ThUfkRMBC ZqaKgBzL1N+9pgP3zHNEFtX4PIIsAYZmlgGaCSCncNxBTW/TGGaaxPaWy5c1yTuiLZaAoXhyjC PY2TJY7almQq0LGhAimk9vbuuu7zzl+EvHxB3qNl2oRlTvQ+NclzBQHUNY8Ft6qkdMKwbgK2Ur ncjsQ8VVv0G2J889qV73boaiANPSv7UMoGdeFbqVMFS5MQ==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (mail-wr1-f42.google.com) smtp.remote-ip=209.85.221.42; dkim=pass header.d=gmail.com header.s=20161025 header.a=rsa-sha256; dmarc=pass header.from=gmail.com; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20200801; t=1619763092; bh=PQVuLsTbps6p7DBu+2LooFDuCYa1+p3OKVqFmzmscLo=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:In-Reply-To:Content-Type:MIME-Version:References:Message-ID: Subject:To:From:Date:DKIM-Signature:DKIM-Signature; b=rM7d6cAQvZ9hHUzEjdWnRPt4JqhH2Ft61mqSxP1sAZwdgvkk9W77/Scb/oMI5Jlp0X6f31wTbW NfjIIrE2neRgC2l8h3FALzMNskyegyBFhywUokTC5/Mguj1satAu85tajZNrV3mtpp1gW4sCoZ uxKR2NOE8z3wBXXOzaQA6LvdRNUq8l8pi4w/906ciqpGA7a+Y3BPFCn5WhJ+PtsaUEkjsnymot rDfM62L31QSYHl41ROSAp9NQhpq0LfuYjCNlmT6zU3y4CgWXmkkb39NeDAiK8/KAfOF9aV9/Ql nd1v/vEiXIGvZbcPnAG8DBIp4GPnSbBz8Wx5CZJhNCiorQ==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20200801; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:To:From:Date:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=2Ouv/UeKmUIrC6oNsHFtgmIdIkqAnq+tVSni/NTn/Kc=; b=MdXDQA1lwO0DH/wjqEU+tX1aDp msGShj3RrAuABevFcpBDCvULYalgvTYqro3UrdWBoj/yamDstYxX+P/4yCiHPK/NGplDWBHKFgBSv QIbYizVfzpw453dSB0eHb/K8/N1f1+iYrXA8TiTD3ewi30nAOvl+wxSQ7tFBzFGaaSTKI9GOZg0v0 HOeXwt1lwNBh6l/mFjKrIAFFUCAg6h9Vl4kSLOvM32Ev/2OA85JbcmRJUwYra6EsBwVmODU+WRCo3 ZnjoTqYBEwIYkAHZ3FBtR4B5HfWcYtnG7sq3+4EInZgvKud4jIUq+BRQeVAM+6CJ5mRnGxBX7lj8B nqZQYgJw==; Received: from authenticated user by zero.zsh.org with local id 1lcMN9-0008w7-G9; Fri, 30 Apr 2021 06:11:31 +0000 Authentication-Results: zsh.org; iprev=pass (mail-wr1-f42.google.com) smtp.remote-ip=209.85.221.42; dkim=pass header.d=gmail.com header.s=20161025 header.a=rsa-sha256; dmarc=pass header.from=gmail.com; arc=none Received: from mail-wr1-f42.google.com ([209.85.221.42]:42664) by zero.zsh.org with esmtps (TLS1.3:TLS_AES_128_GCM_SHA256:128) id 1lcMMy-0008iK-NL; Fri, 30 Apr 2021 06:11:21 +0000 Received: by mail-wr1-f42.google.com with SMTP id l2so16874399wrm.9 for ; Thu, 29 Apr 2021 23:11:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:subject:message-id:mail-followup-to:references :mime-version:content-disposition:in-reply-to; bh=2Ouv/UeKmUIrC6oNsHFtgmIdIkqAnq+tVSni/NTn/Kc=; b=I5nYEpYvsX/2xYrrmqNXW9ISX1av0uS+b3NmFhz48UuN3AeIeaWMrtwmKhG8agHkfw eiBh7IS9reVAj/YgYez/kCxMISCUGqkAzK4gaSed6IUKxE+gvXNssInO5yWe5NxIvvXo Thm6SAuJKsAkBJW6uaBnMIxNeimhYV9LC80sSZIxemEo9ha3/xxXJi1EmsXFT7kQWGnF gNMHke8Pjb8+pmnQWcMuyYoKNExjRZ69pE5RSAhKAsRJ8N+7til/YzB0IS7cES3gM4GN uzwzZsvk4OP4ti/L35j7MEKBFEq2AVHxZLXmvLimKi+1oJVPOqr/HF6NRa4Hvt5JBagl Kxaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:mail-followup-to :references:mime-version:content-disposition:in-reply-to; bh=2Ouv/UeKmUIrC6oNsHFtgmIdIkqAnq+tVSni/NTn/Kc=; b=i04UPvOgfUM3Q6XR8WhpD3nMy2Iy4oKKXR7y9xf1CO3nexi+yxq7Tn2TZfVE0gZKTG 5Aj1kfaB4tK1ThtD8tAU72krDwoqU8EfSSmffcWNGbm6h08iJHwOgqE728ncLa4pcK/z FlABU07T6V6sIGGUnHKGpAi3OajQE4RwX3qgEQxZcqna94hlWmVFZAHPk5m1bvrwTKYC 1gY8WBFL1ewp2bnfPTwN9iqbUcvZhz4DP0xtTSRkR0rYr+h6oL9TdYwsxabDjAqWGBkO 3w00mihpc1qHil7rNOFj3SERsyJxv2ttxWUhaM8atihZLI8UT5yRxoHMzxOmaTaPGeDM JSIQ== X-Gm-Message-State: AOAM533KHm994XT9Dr0vKshnqKBqBtEAZcca2igy60LkAiQC55oU5Pxv ist4a+qQQPbovyEuxivYOYTr3Mzp944= X-Google-Smtp-Source: ABdhPJyiTco1etkaH7U1Rptd31BxpJcDokcMvnviwW763rIYY4MU68g9Z7WeRZFwWBDamb63YjHl/w== X-Received: by 2002:adf:f908:: with SMTP id b8mr4345999wrr.184.1619763080296; Thu, 29 Apr 2021 23:11:20 -0700 (PDT) Received: from chazelas.org ([90.215.204.106]) by smtp.gmail.com with ESMTPSA id q12sm991985wrx.17.2021.04.29.23.11.18 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Apr 2021 23:11:19 -0700 (PDT) Date: Fri, 30 Apr 2021 07:11:17 +0100 From: Stephane Chazelas To: Zsh hackers list Subject: Re: [PATCH v2] regexp-replace and ^, word boundary or look-behind operators Message-ID: <20210430061117.buyhdhky5crqjrf2@chazelas.org> Mail-Followup-To: Zsh hackers list References: <20191216211013.6opkv5sy4wvp3yn2@chaz.gmail.com> <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> X-Seq: 48747 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: List-Subscribe: List-Unsubscribe: List-Post: List-Owner: List-Archive: ping. 2020-01-01 14:03:43 +0000, Stephane Chazelas: 2019-12-18 00:22:53 +0000, Daniel Shahaf: [...] > > + > > +Note that if not using PCRE, using the tt(^) or word boundary operators > > +(where available) may not work properly. > > Suggest to avoid the double negative: > > 1. s/not using PCRE/using POSIX ERE's/ > > 2. Add "(ERE's)" after "POSIX extended regular expressions" in the first paragraph > > I'll push a minor change to that first paragraph in a moment. [...] Thanks, I've incorporated that suggesting and fixed an issue with PCRE when the replacement was empty or generated more than one element. diff --git a/Doc/Zsh/contrib.yo b/Doc/Zsh/contrib.yo index 558342711..9a804fc11 100644 --- a/Doc/Zsh/contrib.yo +++ b/Doc/Zsh/contrib.yo @@ -4284,7 +4284,7 @@ See also the tt(pager), tt(prompt) and tt(rprompt) styles below. findex(regexp-replace) item(tt(regexp-replace) var(var) var(regexp) var(replace))( Use regular expressions to perform a global search and replace operation -on a variable. POSIX extended regular expressions are used, +on a variable. POSIX extended regular expressions (ERE) are used, unless the option tt(RE_MATCH_PCRE) has been set, in which case Perl-compatible regular expressions are used (this requires the shell to be linked against the tt(pcre) @@ -4302,6 +4302,9 @@ and arithmetic expressions which will be replaced: in particular, a reference to tt($MATCH) will be replaced by the text matched by the pattern. The return status is 0 if at least one match was performed, else 1. + +Note that if using POSIX EREs, the tt(^) or word boundary operators +(where available) may not work properly. ) findex(run-help) item(tt(run-help) var(cmd))( diff --git a/Functions/Example/zpgrep b/Functions/Example/zpgrep index 8b1edaa1c..556e58cd6 100644 --- a/Functions/Example/zpgrep +++ b/Functions/Example/zpgrep @@ -2,24 +2,31 @@ # zpgrep() { -local file pattern +local file pattern ret pattern=$1 shift +ret=1 if ((! ARGC)) then set -- - fi -pcre_compile $pattern +zmodload zsh/pcre || return +pcre_compile -- "$pattern" pcre_study for file do if [[ "$file" == - ]] then - while read -u0 buf; do pcre_match $buf && print $buf; done + while IFS= read -ru0 buf; do + pcre_match -- "$buf" && ret=0 && print -r -- "$buf" + done else - while read -u0 buf; do pcre_match $buf && print $buf; done < "$file" + while IFS= read -ru0 buf; do + pcre_match -- "$buf" && ret=0 && print -r -- "$buf" + done < "$file" fi done +return "$ret" } diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace index dec105524..0d5948075 100644 --- a/Functions/Misc/regexp-replace +++ b/Functions/Misc/regexp-replace @@ -8,36 +8,84 @@ # $ and backtick substitutions; in particular, $MATCH will be replaced # by the portion of the string matched by the regular expression. -integer pcre +# we use positional parameters instead of variables to avoid +# clashing with the user's variable. Make sure we start with 3 and only +# 3 elements: +argv=("$1" "$2" "$3") -[[ -o re_match_pcre ]] && pcre=1 +# $4 records whether pcre is enabled as that information would otherwise +# be lost after emulate -L zsh +4=0 +[[ -o re_match_pcre ]] && 4=1 emulate -L zsh -(( pcre )) && setopt re_match_pcre - -# $4 is the string to be matched -4=${(P)1} -# $5 is the final string -5= -# 6 indicates if we made a change -6= + + local MATCH MBEGIN MEND local -a match mbegin mend -while [[ -n $4 ]]; do - if [[ $4 =~ $2 ]]; then - # append initial part and subsituted match - 5+=${4[1,MBEGIN-1]}${(e)3} - # truncate remaining string - 4=${4[MEND+1,-1]} - # indicate we did something - 6=1 - else - break - fi -done -5+=$4 - -eval ${1}=${(q)5} -# status 0 if we did something, else 1. -[[ -n $6 ]] +if (( $4 )); then + # if using pcre, we're using pcre_match and a running offset + # That's needed for ^, \A, \b, and look-behind operators to work + # properly. + + zmodload zsh/pcre || return 2 + pcre_compile -- "$2" && pcre_study || return 2 + + # $4 is the current *byte* offset, $5, $6 reserved for later use + 4=0 6= + + local ZPCRE_OP + while pcre_match -b -n $4 -- "${(P)1}"; do + # append offsets and computed replacement to the array + # we need to perform the evaluation in a scalar assignment so that if + # it generates an array, the elements are converted to string (by + # joining with the first chararacter of $IFS as usual) + 5=${(e)3} + argv+=(${(s: :)ZPCRE_OP} "$5") + + # for 0-width matches, increase offset by 1 to avoid + # infinite loop + 4=$((argv[-2] + (argv[-3] == argv[-2]))) + done + + (($# > 6)) || return # no match + + set +o multibyte + + # $5 contains the result, $6 the current offset + 5= 6=1 + for 2 3 4 in "$@[7,-1]"; do + 5+=${(P)1[$6,$2]}$4 + 6=$(($3 + 1)) + done + 5+=${(P)1[$6,-1]} +else + # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where + # available) won't work properly. + + # $4 is the string to be matched + 4=${(P)1} + + while [[ -n $4 ]]; do + if [[ $4 =~ $2 ]]; then + # append initial part and substituted match + 5+=${4[1,MBEGIN-1]}${(e)3} + # truncate remaining string + if ((MEND < MBEGIN)); then + # zero-width match, skip one character for the next match + ((MEND++)) + 5+=${4[1]} + fi + 4=${4[MEND+1,-1]} + # indicate we did something + 6=1 + else + break + fi + done + [[ -n $6 ]] || return # no match + 5+=$4 +fi + +eval $1=\$5