From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: from zero.zsh.org (zero.zsh.org [IPv6:2a02:898:31:0:48:4558:7a:7368]) by inbox.vuxu.org (Postfix) with ESMTP id 598CF297D6 for ; Sun, 10 Mar 2024 20:52:21 +0100 (CET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1710100341; b=j44JfskDDSNKB3GbJiJWzWTvN/qKc6lcqM5eq1Rxn5ke3Ad3IfzStbwq5h1Is2gnW/4tnlDTE9 DQc1/DBSP5D+PYm8M47tI1OilQnyqcggXS6m9g2TBDrS9sWWtNXlD9hQyaGAfo2n0jWvIZWrZq RRstKh1+AkTaIqyN1xIxMGg7s/DHP6wB/pYzh+/Ui0c5KuYZ9CAickJbqSlXEH9x5+ffztJwqA dSyrp2FlvT3bBkTVny2lFNujiod5YW/OenYhtEik9XaZHI+2flhb4ldlpDmY+7EK3U6gb6KHZ2 vOlsyw0WRaJk4kUDzrx7iePOfsqJOxW01zuYaSQSigVN4g==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (relay8-d.mail.gandi.net) smtp.remote-ip=217.70.183.201; dmarc=none header.from=chazelas.org; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1710100341; bh=lkMRUPgBZycw0GWSp1dpOTuxngKbzKoRPY2p+Wpn8Xc=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:In-Reply-To:Content-Type:MIME-Version:References:Message-ID: Subject:To:From:Date:DKIM-Signature; b=AyzKKRtnhGsrOI+M5HfIwbBbF93/2tQT0MI/ZXez2LSn0zhs4hnsf87XY0nrSSBj3O9lnr5IW2 NjGlbhO5cmFznZrcK7dB7qxSuwa5Qi7tvvSpbNc7te+Q3ZK7x0HuJ+1hNS6VvGiTpZeDTuWc8Z eQi/vD5uoYm5dhV+ZgUEqBbdvCArqgw4F+sLf/d/GLxrDfVr2CrzIMOjOWK5yQ4m1Z3vh3EXlq 3miJJjS5ckFZQPAb0eFTfX3J4gDjRS0zjLJO5nF/p4k104cB0QDqBklGbukklvtH/fKLAVWFLI QllGpySmkKSa6ayE2CKB7rnXIGDT6ozEoDHs8x7Z1/i78A==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:To:From:Date:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=drqw29KdgZLP7GfOlyZ9OCaoaVQ1fuEgDig75Y9CAlM=; b=fI1ggIDqttME20kiMlJGIwR0Ob fIE33gwN12oJLn1VWfsnh63rJRzaTlIHpY/pt/xKTpLGfB87xV2sc5cnY4VBkclttavZmrbaBJY0l aYe/vD9GOEcuG/qRBUhuZhBn2X8UxA7ZHh1LMn/h/WdE8qr7AXjnaewaZp/q4QMvX8CQygiFT4ICw vGCt4+a4y1Z/saj8aC0jRzQdY4+2XRZaA1OZ4n1RRoFiVO6f5ChKP8ocoB9wTCcnGnqy7uDK63a29 3/4hlHVjPjz5PAWixV+aZYLZ8yeiQVtscslb6hurIk5bOMb1g19dKkelaQryJIj2TO18DHx4bWFFU f5MYgoEA==; Received: by zero.zsh.org with local id 1rjPDX-000Frb-6q; Sun, 10 Mar 2024 19:52:19 +0000 Authentication-Results: zsh.org; iprev=pass (relay8-d.mail.gandi.net) smtp.remote-ip=217.70.183.201; dmarc=none header.from=chazelas.org; arc=none Received: from relay8-d.mail.gandi.net ([217.70.183.201]:45539) by zero.zsh.org with esmtps (TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256) id 1rjPDG-000FZZ-HV; Sun, 10 Mar 2024 19:52:04 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id E11C61BF204 for ; Sun, 10 Mar 2024 19:52:01 +0000 (UTC) Date: Sun, 10 Mar 2024 19:52:01 +0000 From: Stephane Chazelas To: Zsh hackers list Subject: [PATCH v6] regexp-replace and ^, word boundary or look-behind operators (and more). Message-ID: <20240310195201.c53tzhgcyk5qgi27@chazelas.org> Mail-Followup-To: Zsh hackers list References: <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> <20210430061117.buyhdhky5crqjrf2@chazelas.org> <20210505114521.bemoiekpophssbug@chazelas.org> <20240308153050.u63fqtcjyr2yewye@chazelas.org> <20240309130310.5zovit5jk6l4rnak@chazelas.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240309130310.5zovit5jk6l4rnak@chazelas.org> X-GND-Sasl: stephane@chazelas.org X-Seq: 52727 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: 2024-03-09 13:03:10 +0000, Stephane Chazelas: [...] > I'll send a v6 likely using namespaced variables rather than > going back to using positional parameters, once I understand the > point of using .regexp_replace.myvar over _regexp_replace_myvar [...] So here it is. I ended up using none of the new features (nameref and namespace) as they were not overly useful in this instance and that means the code can be used as-is in older versions. I'm using $_regexp_replace_localvarname for namespacing. I compared performance with ${.regexp_replace.localvarname} and they were similar (the latter about 2-3% slower in my limited tests). diff --git a/Functions/Misc/regexp-replace b/Functions/Misc/regexp-replace index d4408f0f7..630a5ceab 100644 --- a/Functions/Misc/regexp-replace +++ b/Functions/Misc/regexp-replace @@ -1,91 +1,99 @@ -# Replace all occurrences of a regular expression in a variable. The -# variable is modified directly. Respects the setting of the -# option RE_MATCH_PCRE. +# Replace all occurrences of a regular expression in a scalar variable. +# The variable is modified directly. Respects the setting of the option +# RE_MATCH_PCRE, but otherwise sets the zsh emulation mode. # -# First argument: *name* (not contents) of variable. -# Second argument: regular expression -# Third argument: replacement string. This can contain all forms of -# $ and backtick substitutions; in particular, $MATCH will be replaced -# by the portion of the string matched by the regular expression. - -# we use positional parameters instead of variables to avoid -# clashing with the user's variable. Make sure we start with 3 and only -# 3 elements: -argv=("$1" "$2" "$3") - -# $4 records whether pcre is enabled as that information would otherwise -# be lost after emulate -L zsh -4=0 -[[ -o re_match_pcre ]] && 4=1 +# Arguments: +# +# 1. *name* (not contents) of variable or more generally any lvalue; +# expected to be scalar. +# +# 2. regular expression +# +# 3. replacement string. This can contain all forms of +# $ and backtick substitutions; in particular, $MATCH will be +# replaced by the portion of the string matched by the regular +# expression. Parsing errors are fatal to the shell process. + +if (( $# < 2 || $# > 3 )); then + setopt localoptions functionargzero + print -ru2 "Usage: $0 []" + return 2 +fi + +local _regexp_replace_use_pcre=0 +[[ -o re_match_pcre ]] && _regexp_replace_use_pcre=1 emulate -L zsh +local _regexp_replace_subject=${(P)1} \ + _regexp_replace_regexp=$2 \ + _regexp_replace_replacement=$3 \ + _regexp_replace_result \ + MATCH MBEGIN MEND -local MATCH MBEGIN MEND local -a match mbegin mend -if (( $4 )); then +if (( _regexp_replace_use_pcre )); then # if using pcre, we're using pcre_match and a running offset # That's needed for ^, \A, \b, and look-behind operators to work # properly. zmodload zsh/pcre || return 2 - pcre_compile -- "$2" && pcre_study || return 2 + pcre_compile -- "$_regexp_replace_regexp" && pcre_study || return 2 + + local _regexp_replace_offset=0 _regexp_replace_start _regexp_replace_stop _regexp_replace_new ZPCRE_OP + local -a _regexp_replace_finds - # $4 is the current *byte* offset, $5, $6 reserved for later use - 4=0 6= + while pcre_match -b -n $_regexp_replace_offset -- "$_regexp_replace_subject"; do + # we need to perform the evaluation in a scalar assignment so that + # if it generates an array, the elements are converted to string (by + # joining with the first chararacter of $IFS as usual) + _regexp_replace_new=${(Xe)_regexp_replace_replacement} - local ZPCRE_OP - while pcre_match -b -n $4 -- "${(P)1}"; do - # append offsets and computed replacement to the array - # we need to perform the evaluation in a scalar assignment so that if - # it generates an array, the elements are converted to string (by - # joining with the first character of $IFS as usual) - 5=${(e)3} - argv+=(${(s: :)ZPCRE_OP} "$5") + _regexp_replace_finds+=( ${(s[ ])ZPCRE_OP} "$_regexp_replace_new" ) # for 0-width matches, increase offset by 1 to avoid # infinite loop - 4=$((argv[-2] + (argv[-3] == argv[-2]))) + (( _regexp_replace_offset = _regexp_replace_finds[-2] + (_regexp_replace_finds[-3] == _regexp_replace_finds[-2]) )) done - (($# > 6)) || return # no match + (( $#_regexp_replace_finds )) || return # no match - set +o multibyte + unsetopt multibyte - # $5 contains the result, $6 the current offset - 5= 6=1 - for 2 3 4 in "$@[7,-1]"; do - 5+=${(P)1[$6,$2]}$4 - 6=$(($3 + 1)) + _regexp_replace_offset=1 + for _regexp_replace_start _regexp_replace_stop _regexp_replace_new in "$_regexp_replace_finds[@]"; do + _regexp_replace_result+=${_regexp_replace_subject[_regexp_replace_offset,_regexp_replace_start]}$_regexp_replace_new + (( _regexp_replace_offset = _regexp_replace_stop + 1 )) done - 5+=${(P)1[$6,-1]} -else + _regexp_replace_result+=${_regexp_replace_subject[_regexp_replace_offset,-1]} + +else # no PCRE # in ERE, we can't use an offset so ^, (and \<, \b, \B, [[:<:]] where # available) won't work properly. - # $4 is the string to be matched - 4=${(P)1} - - while [[ -n $4 ]]; do - if [[ $4 =~ $2 ]]; then - # append initial part and substituted match - 5+=${4[1,MBEGIN-1]}${(e)3} - # truncate remaining string - if ((MEND < MBEGIN)); then - # zero-width match, skip one character for the next match - ((MEND++)) - 5+=${4[1]} - fi - 4=${4[MEND+1,-1]} - # indicate we did something - 6=1 - else - break + local _regexp_replace_ok=0 + while [[ $_regexp_replace_subject =~ $_regexp_replace_regexp ]]; do + # append initial part and substituted match + _regexp_replace_result+=$_regexp_replace_subject[1,MBEGIN-1]${(Xe)_regexp_replace_replacement} + # truncate remaining string + if (( MEND < MBEGIN )); then + # zero-width match, skip one character for the next match + (( MEND++ )) + _regexp_replace_result+=$_regexp_replace_subject[MBEGIN] fi + _regexp_replace_subject=$_regexp_replace_subject[MEND+1,-1] + _regexp_replace_ok=1 + [[ -z $_regexp_replace_subject ]] && break done - [[ -n $6 ]] || return # no match - 5+=$4 + (( _regexp_replace_ok )) || return + _regexp_replace_result+=$_regexp_replace_subject fi -eval $1=\$5 +# assign result to target variable if at least one substitution was +# made. At this point, if the variable was originally array or assoc, it +# is converted to scalar. If $1 doesn't contain a valid lvalue +# specification, an exception is raised (exits the shell process if +# non-interactive). +: ${(P)1::="$_regexp_replace_result"} +