* clarification on (#U) in pattern matching.
@ 2022-02-06 8:42 Stephane Chazelas
[not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
0 siblings, 1 reply; 4+ messages in thread
From: Stephane Chazelas @ 2022-02-06 8:42 UTC (permalink / raw)
To: Zsh hackers list
$ set -o extendedglob
$ a='Stéphane€'
$ print -rn -- ${a//(#U)?} | hd
00000000 a9 82 ac |...|
00000003
It seems that with (#U) (and here in a locale using UTF-8 as
charmap), ? with (#U) matches only on the first byte of
multibyte characters. Is that how it's meant to be?
$ print -r -- ${a//(#m)?/[$MATCH]}
[S][t][é][p][h][a][n][e][€]
$ print -r -- ${a//(#Um)?/[$MATCH]}
[S][t][�]�[p][h][a][n][e][�]��
Also
[[ $'\ue9' = (#U)*$'\xa9'* ]] returns true (and doesn't without
(#U), but:
print -r -- ${a//(#U)$'\xa9'}
fails to remove it. But:
$ echo ${a//(#U)?$'\xa9'}
Stphane€
With set +o multibyte:
$ set +o multibyte
$ print -r -- ${a//(#m)?/[$MATCH]}
[S][t][�][�][p][h][a][n][e][�][�][�]
Where ? matches on each byte of those multi-byte characters.
The doc has:
> U
> All characters are considered to be a single byte long. The
> opposite of u. This overrides the MULTIBYTE option.
which is a bit ambiguous and may be interpreted as justifying
the current behaviour.
But I suspect that's because when ${var//pattern/replace}
resumes searching for the next pattern after the first one, it
starts at the next character instead of next byte, and the (#U)
applies to pattern matching but not to the
${var//pattern/replace} operator itself.
Thanks
Stephane
^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <1071890479.577225.1644233454174@mail2.virginmedia.com>]
* Re: clarification on (#U) in pattern matching. [not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com> @ 2022-02-07 12:15 ` Peter Stephenson 2022-02-07 12:24 ` Stephane Chazelas 2022-02-07 12:59 ` Peter Stephenson 0 siblings, 2 replies; 4+ messages in thread From: Peter Stephenson @ 2022-02-07 12:15 UTC (permalink / raw) To: zsh workers Sorry, this just went to Stephane. pws > On 07 February 2022 at 11:30 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote: > > > > On 06 February 2022 at 08:42 Stephane Chazelas <stephane@chazelas.org> wrote: > > $ set -o extendedglob > > $ a='Stéphane€' > > $ print -rn -- ${a//(#U)?} | hd > > 00000000 a9 82 ac |...| > > 00000003 > > > > It seems that with (#U) (and here in a locale using UTF-8 as > > charmap), ? with (#U) matches only on the first byte of > > multibyte characters. Is that how it's meant to be? > > I think what you're hitting is probably, as you suspected, a > difference between the pattern matching code and the substitution > code. The underlying pattern matching really is byte by byte, > but this doesn't force any substitution such as // to behave > in the same way. As far as I know, the MULTIBYTE option is > the only higher level consistency measure we have. > > I think there might be a parameter matching flag that you can > also set that would help. I'd have to look in more detail. > > pws ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clarification on (#U) in pattern matching. 2022-02-07 12:15 ` Peter Stephenson @ 2022-02-07 12:24 ` Stephane Chazelas 2022-02-07 12:59 ` Peter Stephenson 1 sibling, 0 replies; 4+ messages in thread From: Stephane Chazelas @ 2022-02-07 12:24 UTC (permalink / raw) To: Peter Stephenson; +Cc: zsh workers See also: $ a=été $ echo ${a%(#U)?} été $ echo ${a%%(#U)?} été $ echo ${a%(#U)$'\xa9'} été $ echo ${a%(#U)?*} ét Consistent with the other behaviours but maybe even more confusing. -- Stephane ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: clarification on (#U) in pattern matching. 2022-02-07 12:15 ` Peter Stephenson 2022-02-07 12:24 ` Stephane Chazelas @ 2022-02-07 12:59 ` Peter Stephenson 1 sibling, 0 replies; 4+ messages in thread From: Peter Stephenson @ 2022-02-07 12:59 UTC (permalink / raw) To: zsh workers On 07 February 2022 at 12:15 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote: > I think there might be a parameter matching flag that you can > also set that would help. I'd have to look in more detail. On examination, no, I don't think there is an option at the parameter level, as opposed to the pattern match level --- and I appreciate that this distinction, while clear in the code, isn't likely to be so obvious to a user staring at shell code, even if parameter substitution and pattern matching are documented separately. The doc for parameters simplify refers to the MULTIBYTE option. There's a limited feature to turn off multibyte counting when calculating widths for use in padding etc. I suppose in principle that could be extended for pattern matching, but in practice (at least without a major rewrite) all it would do is turn off the multibyte option locally --- probably carrying over into pattern matching, so you'd at least get consistency between the two that way round. There isn't a sane way to propagate the (U) flag out of the pattern code back up into the parameter level. pws ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-02-07 12:59 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-02-06 8:42 clarification on (#U) in pattern matching Stephane Chazelas [not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com> 2022-02-07 12:15 ` Peter Stephenson 2022-02-07 12:24 ` Stephane Chazelas 2022-02-07 12:59 ` Peter Stephenson
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).