zsh-workers
 help / color / mirror / code / Atom feed
From: Stephane Chazelas <stephane@chazelas.org>
To: Bart Schaefer <schaefer@brasslantern.com>,
	Zsh hackers list <zsh-workers@zsh.org>
Subject: MBEGIN when =~ finds bytes inside characters (Was: [PATCH v5] regexp-replace and ^, word boundary or look-behind operators (and more).)
Date: Sat, 9 Mar 2024 09:21:11 +0000	[thread overview]
Message-ID: <20240309092111.4izlumpqejgqhyti@chazelas.org> (raw)
In-Reply-To: <20240309084158.jiyx2is3tbrwyzia@chazelas.org>

2024-03-09 08:41:58 +0000, Stephane Chazelas:
[...]
> +  while [[ $subject =~ $regexp ]]; do
> +    # append initial part and substituted match
> +    result+=$subject[1,MBEGIN-1]${(Xe)replacement}
[...]

BTW, likely not zsh's fault but here on Ubuntu 22.04

With:

$ a=$'ABC/\U0010fffe/DEF'
$ print -r - ${(q)a}
ABC/$'\364\217\277\276'/DEF

So with a string containing a 4-byte multibyte character.

$ regexp-replace  a $'\276' $'\277'
$ print -r - ${(q)a}
ABC/$'\364\217\277\276'/D$'\277'F

See $'\277' not replacing $'\276' but E instead.

It's my bad as a user to be doing that with multibyte enabled in
a locale with a multibyte charset.

$ a=$'ABC/\U0010fffe/DEF'
$ set +o multibyte
$ regexp-replace  a $'\276' $'\277'
$ print -r - ${(q+)a}
$'ABC/\U0010ffff/DEF'
$ set -o multibyte
$ print -r - ${(q)a}
ABC/$'\364\217\277\277'/DEF

Is OK

The problem here is:

$ [[ $a =~ $'\276' ]]
$ echo $MBEGIN $MEND
8 8
$ [[ $a =~ D ]]
$ echo $MBEGIN $MEND
7 7

And could very well be caused by a bug in my regex library,
maybe a variation of
https://sourceware.org/bugzilla/show_bug.cgi?id=31075 for regex.

If the problem is in the system's regexps, I can't think of
anything zsh could do about it except maybe checking that
subject and regexp decode as text properly, and error out if not
like it does in pcre mode.

zsh pattern matching seems to be handling it better.

$ [[ $a = (#b)*($'\276')* ]] && echo match; typeset mbegin mend
mbegin=( -1 -1 )
mend=( -1 -1 )
$ [[ $a = (#b)*(D)* ]] && echo match; typeset mbegin mend
match
mbegin=( 7 )
mend=( 7 )

I wonder if PCRE2_MATCH_INVALID_UTF/PCRE2_NO_UTF_CHECK could be
used to improve matching with invalid UTF-8 for the pcre mode,
at least for the pcre builtins where offsets are byte-wide
rather than character-wise.

-- 
Stephane


  reply	other threads:[~2024-03-09  9:21 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-16 21:10 regexp-replace and ^, word boundary or look-behind operators Stephane Chazelas
2019-12-16 21:27 ` Stephane Chazelas
2019-12-17  7:38   ` Stephane Chazelas
2019-12-17 11:11     ` [PATCH] " Stephane Chazelas
2019-12-18  0:22       ` Daniel Shahaf
2019-12-18  8:31         ` Stephane Chazelas
2020-01-01 14:03         ` [PATCH v2] " Stephane Chazelas
2021-04-30  6:11           ` Stephane Chazelas
2021-04-30 23:13             ` Bart Schaefer
2021-05-05 11:45               ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
2021-05-31  0:58                 ` Lawrence Velázquez
2021-05-31 18:18                 ` Bart Schaefer
2021-05-31 21:37                   ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
2021-06-01  5:32                     ` Stephane Chazelas
2021-06-01 16:05                       ` Bart Schaefer
2021-06-02  2:51                         ` [PATCH] (take two?) typeset array[position=index]=value / unset hash[$stuff] Bart Schaefer
2021-06-02 10:06                           ` Stephane Chazelas
2021-06-02 14:52                             ` Bart Schaefer
2021-06-02 16:02                               ` Stephane Chazelas
2021-06-02  9:11                         ` [PATCH] (?) typeset array[position=index]=value Stephane Chazelas
2021-06-02 13:34                           ` Daniel Shahaf
2021-06-02 14:20                             ` Stephane Chazelas
2021-06-02 15:59                               ` Bart Schaefer
2021-06-03  2:04                                 ` [PATCH (not final)] (take three?) unset "array[$anything]" Bart Schaefer
2021-06-03  2:42                                   ` Bart Schaefer
2021-06-03  6:12                                     ` Bart Schaefer
2021-06-03  8:54                                       ` Peter Stephenson
2021-06-03 13:13                                         ` Stephane Chazelas
2021-06-03 14:41                                           ` Peter Stephenson
2021-06-04 19:25                                             ` Bart Schaefer
2021-06-05 18:18                                               ` Peter Stephenson
2021-06-09 23:31                                                 ` Bart Schaefer
2021-06-13 16:51                                                   ` Peter Stephenson
2021-06-13 18:04                                                     ` Bart Schaefer
2021-06-13 19:48                                                       ` Peter Stephenson
2021-06-13 21:44                                                         ` Bart Schaefer
2021-06-14  7:19                                                           ` Stephane Chazelas
2021-06-03 18:12                                           ` Bart Schaefer
2021-06-04  8:02                                             ` Stephane Chazelas
2021-06-04 18:36                                               ` Bart Schaefer
2021-06-04 20:21                                                 ` Stephane Chazelas
2021-06-05  0:20                                                   ` Bart Schaefer
2021-06-05 17:05                                                     ` Stephane Chazelas
2021-06-10  0:14                                                       ` Square brackets in command position Bart Schaefer
2021-06-03  6:05                                   ` [PATCH (not final)] (take three?) unset "array[$anything]" Stephane Chazelas
2021-06-03  6:43                                     ` Bart Schaefer
2021-06-03  7:31                                       ` Stephane Chazelas
2021-06-10  0:21                         ` [PATCH] (?) typeset array[position=index]=value Bart Schaefer
2021-06-05  4:29                     ` Mikael Magnusson
2021-06-05  5:49                       ` Bart Schaefer
2021-06-05 11:06                         ` Mikael Magnusson
2021-06-05 16:22                           ` Bart Schaefer
2021-06-18 10:53                         ` Mikael Magnusson
2024-03-08 15:30                 ` [PATCH v3] regexp-replace and ^, word boundary or look-behind operators (and more) Stephane Chazelas
2024-03-09  8:41                   ` [PATCH v5] " Stephane Chazelas
2024-03-09  9:21                     ` Stephane Chazelas [this message]
2024-03-09 13:03                   ` [PATCH v3] " Stephane Chazelas
2024-03-10 19:52                     ` [PATCH v6] " Stephane Chazelas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240309092111.4izlumpqejgqhyti@chazelas.org \
    --to=stephane@chazelas.org \
    --cc=schaefer@brasslantern.com \
    --cc=zsh-workers@zsh.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).