From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: from zero.zsh.org (zero.zsh.org [IPv6:2a02:898:31:0:48:4558:7a:7368]) by inbox.vuxu.org (Postfix) with ESMTP id 2009B2AE37 for ; Sat, 9 Mar 2024 10:21:32 +0100 (CET) ARC-Seal: i=1; cv=none; a=rsa-sha256; d=zsh.org; s=rsa-20210803; t=1709976092; b=B9yX8cxiaX2WKKN4Uf887WHfl0FKEkhmlSTxlofiWWgUKy1SIHEcl2TN+JjbJMzTiHZJKVWSD/ IbLvMbcjQD6zpTn5NL8MhBVV2jS990QyyfhdGXXRYWSaASt7Oe71oMRJKtbWAoWKrvrVq/hax2 c58Qtk8+baTtEbJsW73O11H8wad1QiR7uoE9nEKcjm4hjaaw6XpKN6LLkVZPQu15DyiN2HMzND HvbUa+g1dhnoU98BrNmdoqVYwJaJWW5wj0lERuvKJ/p2dfPupPLI8+P3MtVeNSK6ERkPgGkDRW eho/HN0MaH2jZVQoXLmRRvOXwlQ11+3oyvlBr4T630ICSQ==; ARC-Authentication-Results: i=1; zsh.org; iprev=pass (relay9-d.mail.gandi.net) smtp.remote-ip=217.70.183.199; dmarc=none header.from=chazelas.org; arc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed; d=zsh.org; s=rsa-20210803; t=1709976092; bh=zlRg/jCq2UzXb9LvEqgwx18rbVkjR5VYWCjpWkuWuGA=; h=List-Archive:List-Owner:List-Post:List-Unsubscribe:List-Subscribe:List-Help: List-Id:Sender:In-Reply-To:Content-Type:MIME-Version:References:Message-ID: Subject:To:From:Date:DKIM-Signature; b=kKyNo2zd6lL/xRXn59rQ+p3be+6IUk2q6k0LYaaNyvf7Kfz6eV1PrdSwEVrhLQEXKqVRjRdbyA Nj4TB6fLN/Mp7jQiY0g/9zSNqjGnIyXFo6d9MrGvLUjVL8v0q3ft4VTKutMRGNv4DDu0wuNm14 EnR0FWpsgBruI5+U/4zOU107UQ/S6/TNxSml9wH1SPVqpnlL96SN9PznfDU1gCaOymN0phVAti 5zH5qJsxfZDlLoaTTw72PeOBA5Yh6UKiJhdb6KavnT3Bi4CjEQ23fBHn2MKvrWvkRL0W2bGH7M qoOJG0YIirkimJiLFEF1bDY/K129GmZ4KyRbXMHVfLC0xw==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=zsh.org; s=rsa-20210803; h=List-Archive:List-Owner:List-Post:List-Unsubscribe: List-Subscribe:List-Help:List-Id:Sender:In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:To:From:Date:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID; bh=fpq7pAjhftiIDDqe342GwSLglModfU+m2niOoBulDiE=; b=br/xqrCEgtt7MZGP+3hX0E5djU XDUM9TMxiWPycTp0jZpDGDhjF9yAWMgfsqF6t35xeujR2/rnrMJoXrUjwJ1vuvExEyUZARQDLn8Oi kqlFhzG++Wu6ZWrv3j0/1WuZIhmO68ml8HkOaz5bu4OMmq9M9j3rF/E61c2xPPp6iIfZNHD+bDBf/ 2ew3y6nTh4xh0pBv/AEglFHG1B6LJjV2b67mmXyXLdaSkYzWJs3CQaL5m1Rib5IAmGeuNJVo6Jkpg MGis2ESSuidUQBRTBSEPzpZmVQd3QFQNMg9EUdE10iNpaeCT5N/aOLaaVIzv/G+emrCk3ZQD6mBl1 ZqXlhrNA==; Received: by zero.zsh.org with local id 1ristY-000Ftl-LR; Sat, 09 Mar 2024 09:21:32 +0000 Authentication-Results: zsh.org; iprev=pass (relay9-d.mail.gandi.net) smtp.remote-ip=217.70.183.199; dmarc=none header.from=chazelas.org; arc=none Received: from relay9-d.mail.gandi.net ([217.70.183.199]:32923) by zero.zsh.org with esmtps (TLS1.2:ECDHE-RSA-AES256-GCM-SHA384:256) id 1ristE-000FZr-Jr; Sat, 09 Mar 2024 09:21:13 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 0FD03FF803; Sat, 9 Mar 2024 09:21:11 +0000 (UTC) Date: Sat, 9 Mar 2024 09:21:11 +0000 From: Stephane Chazelas To: Bart Schaefer , Zsh hackers list Subject: MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).) Message-ID: <20240309092111.4izlumpqejgqhyti@chazelas.org> Mail-Followup-To: Bart Schaefer , Zsh hackers list References: <20191216212706.i3xvf6hn5h3jwkjh@chaz.gmail.com> <20191217073846.4usg2hnsk66bhqvl@chaz.gmail.com> <20191217111113.z242f4g6sx7xdwru@chaz.gmail.com> <2ea6feb3-a686-4d83-ab27-6a582424487c@www.fastmail.com> <20200101140343.qwfx2xaojumuds3d@chaz.gmail.com> <20210430061117.buyhdhky5crqjrf2@chazelas.org> <20210505114521.bemoiekpophssbug@chazelas.org> <20240308153050.u63fqtcjyr2yewye@chazelas.org> <20240309084158.jiyx2is3tbrwyzia@chazelas.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240309084158.jiyx2is3tbrwyzia@chazelas.org> X-GND-Sasl: stephane@chazelas.org X-Seq: 52719 Archived-At: X-Loop: zsh-workers@zsh.org Errors-To: zsh-workers-owner@zsh.org Precedence: list Precedence: bulk Sender: zsh-workers-request@zsh.org X-no-archive: yes List-Id: List-Help: , List-Subscribe: , List-Unsubscribe: , List-Post: List-Owner: List-Archive: 2024-03-09 08:41:58 +0000, Stephane Chazelas: [...] > + while [[ $subject =~ $regexp ]]; do > + # append initial part and substituted match > + result+=$subject[1,MBEGIN-1]${(Xe)replacement} [...] BTW, likely not zsh's fault but here on Ubuntu 22.04 With: $ a=$'ABC/\U0010fffe/DEF' $ print -r - ${(q)a} ABC/$'\364\217\277\276'/DEF So with a string containing a 4-byte multibyte character. $ regexp-replace a $'\276' $'\277' $ print -r - ${(q)a} ABC/$'\364\217\277\276'/D$'\277'F See $'\277' not replacing $'\276' but E instead. It's my bad as a user to be doing that with multibyte enabled in a locale with a multibyte charset. $ a=$'ABC/\U0010fffe/DEF' $ set +o multibyte $ regexp-replace a $'\276' $'\277' $ print -r - ${(q+)a} $'ABC/\U0010ffff/DEF' $ set -o multibyte $ print -r - ${(q)a} ABC/$'\364\217\277\277'/DEF Is OK The problem here is: $ [[ $a =~ $'\276' ]] $ echo $MBEGIN $MEND 8 8 $ [[ $a =~ D ]] $ echo $MBEGIN $MEND 7 7 And could very well be caused by a bug in my regex library, maybe a variation of https://sourceware.org/bugzilla/show_bug.cgi?id=31075 for regex. If the problem is in the system's regexps, I can't think of anything zsh could do about it except maybe checking that subject and regexp decode as text properly, and error out if not like it does in pcre mode. zsh pattern matching seems to be handling it better. $ [[ $a = (#b)*($'\276')* ]] && echo match; typeset mbegin mend mbegin=( -1 -1 ) mend=( -1 -1 ) $ [[ $a = (#b)*(D)* ]] && echo match; typeset mbegin mend match mbegin=( 7 ) mend=( 7 ) I wonder if PCRE2_MATCH_INVALID_UTF/PCRE2_NO_UTF_CHECK could be used to improve matching with invalid UTF-8 for the pcre mode, at least for the pcre builtins where offsets are byte-wide rather than character-wise. -- Stephane