* clarification on (#U) in pattern matching.
@ 2022-02-06 8:42 Stephane Chazelas
[not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
0 siblings, 1 reply; 4+ messages in thread
From: Stephane Chazelas @ 2022-02-06 8:42 UTC (permalink / raw)
To: Zsh hackers list
$ set -o extendedglob
$ a='Stéphane€'
$ print -rn -- ${a//(#U)?} | hd
00000000 a9 82 ac |...|
00000003
It seems that with (#U) (and here in a locale using UTF-8 as
charmap), ? with (#U) matches only on the first byte of
multibyte characters. Is that how it's meant to be?
$ print -r -- ${a//(#m)?/[$MATCH]}
[S][t][é][p][h][a][n][e][€]
$ print -r -- ${a//(#Um)?/[$MATCH]}
[S][t][�]�[p][h][a][n][e][�]��
Also
[[ $'\ue9' = (#U)*$'\xa9'* ]] returns true (and doesn't without
(#U), but:
print -r -- ${a//(#U)$'\xa9'}
fails to remove it. But:
$ echo ${a//(#U)?$'\xa9'}
Stphane€
With set +o multibyte:
$ set +o multibyte
$ print -r -- ${a//(#m)?/[$MATCH]}
[S][t][�][�][p][h][a][n][e][�][�][�]
Where ? matches on each byte of those multi-byte characters.
The doc has:
> U
> All characters are considered to be a single byte long. The
> opposite of u. This overrides the MULTIBYTE option.
which is a bit ambiguous and may be interpreted as justifying
the current behaviour.
But I suspect that's because when ${var//pattern/replace}
resumes searching for the next pattern after the first one, it
starts at the next character instead of next byte, and the (#U)
applies to pattern matching but not to the
${var//pattern/replace} operator itself.
Thanks
Stephane
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clarification on (#U) in pattern matching.
[not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
@ 2022-02-07 12:15 ` Peter Stephenson
2022-02-07 12:24 ` Stephane Chazelas
2022-02-07 12:59 ` Peter Stephenson
0 siblings, 2 replies; 4+ messages in thread
From: Peter Stephenson @ 2022-02-07 12:15 UTC (permalink / raw)
To: zsh workers
Sorry, this just went to Stephane.
pws
> On 07 February 2022 at 11:30 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
>
>
> > On 06 February 2022 at 08:42 Stephane Chazelas <stephane@chazelas.org> wrote:
> > $ set -o extendedglob
> > $ a='Stéphane€'
> > $ print -rn -- ${a//(#U)?} | hd
> > 00000000 a9 82 ac |...|
> > 00000003
> >
> > It seems that with (#U) (and here in a locale using UTF-8 as
> > charmap), ? with (#U) matches only on the first byte of
> > multibyte characters. Is that how it's meant to be?
>
> I think what you're hitting is probably, as you suspected, a
> difference between the pattern matching code and the substitution
> code. The underlying pattern matching really is byte by byte,
> but this doesn't force any substitution such as // to behave
> in the same way. As far as I know, the MULTIBYTE option is
> the only higher level consistency measure we have.
>
> I think there might be a parameter matching flag that you can
> also set that would help. I'd have to look in more detail.
>
> pws
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clarification on (#U) in pattern matching.
2022-02-07 12:15 ` Peter Stephenson
@ 2022-02-07 12:24 ` Stephane Chazelas
2022-02-07 12:59 ` Peter Stephenson
1 sibling, 0 replies; 4+ messages in thread
From: Stephane Chazelas @ 2022-02-07 12:24 UTC (permalink / raw)
To: Peter Stephenson; +Cc: zsh workers
See also:
$ a=été
$ echo ${a%(#U)?}
été
$ echo ${a%%(#U)?}
été
$ echo ${a%(#U)$'\xa9'}
été
$ echo ${a%(#U)?*}
ét
Consistent with the other behaviours but maybe even more confusing.
--
Stephane
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clarification on (#U) in pattern matching.
2022-02-07 12:15 ` Peter Stephenson
2022-02-07 12:24 ` Stephane Chazelas
@ 2022-02-07 12:59 ` Peter Stephenson
1 sibling, 0 replies; 4+ messages in thread
From: Peter Stephenson @ 2022-02-07 12:59 UTC (permalink / raw)
To: zsh workers
On 07 February 2022 at 12:15 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> I think there might be a parameter matching flag that you can
> also set that would help. I'd have to look in more detail.
On examination, no, I don't think there is an option at the parameter level,
as opposed to the pattern match level --- and I appreciate that this
distinction, while clear in the code, isn't likely to be so obvious to a
user staring at shell code, even if parameter substitution and pattern
matching are documented separately. The doc for parameters simplify refers
to the MULTIBYTE option.
There's a limited feature to turn off multibyte counting when calculating
widths for use in padding etc. I suppose in principle that could be extended
for pattern matching, but in practice (at least without a major rewrite) all
it would do is turn off the multibyte option locally --- probably carrying over
into pattern matching, so you'd at least get consistency between the two that
way round.
There isn't a sane way to propagate the (U) flag out of the pattern code
back up into the parameter level.
pws
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-02-07 12:59 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-06 8:42 clarification on (#U) in pattern matching Stephane Chazelas
[not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
2022-02-07 12:15 ` Peter Stephenson
2022-02-07 12:24 ` Stephane Chazelas
2022-02-07 12:59 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/zsh/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).