clarification on (#U) in pattern matching.

zsh-workers
 help / color / mirror / code / Atom feed

* clarification on (#U) in pattern matching.
@ 2022-02-06  8:42 Stephane Chazelas
       [not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Stephane Chazelas @ 2022-02-06  8:42 UTC (permalink / raw)
  To: Zsh hackers list

$ set -o extendedglob
$ a='Stéphane€'
$ print -rn -- ${a//(#U)?} | hd
00000000  a9 82 ac                                          |...|
00000003

It seems that with (#U) (and here in a locale using UTF-8 as
charmap), ? with (#U) matches only on the first byte of
multibyte characters. Is that how it's meant to be?

$ print -r -- ${a//(#m)?/[$MATCH]}
[S][t][é][p][h][a][n][e][€]
$ print -r -- ${a//(#Um)?/[$MATCH]}
[S][t][�]�[p][h][a][n][e][�]��

Also

[[ $'\ue9' = (#U)*$'\xa9'* ]] returns true (and doesn't without
(#U), but:

print -r -- ${a//(#U)$'\xa9'}

fails to remove it. But:

$ echo ${a//(#U)?$'\xa9'}
Stphane€

With set +o multibyte:

$ set +o multibyte
$ print -r -- ${a//(#m)?/[$MATCH]}
[S][t][�][�][p][h][a][n][e][�][�][�]

Where ? matches on each byte of those multi-byte characters.

The doc has:

> U
>     All characters are considered to be a single byte long.  The
>     opposite of u.  This overrides the MULTIBYTE option.

which is a bit ambiguous and may be interpreted as justifying
the current behaviour.

But I suspect that's because when ${var//pattern/replace}
resumes searching for the next pattern after the first one, it
starts at the next character instead of next byte, and the (#U)
applies to pattern matching but not to the
${var//pattern/replace} operator itself.

Thanks
Stephane

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: clarification on (#U) in pattern matching.
       [not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
@ 2022-02-07 12:15   ` Peter Stephenson
  2022-02-07 12:24     ` Stephane Chazelas
  2022-02-07 12:59     ` Peter Stephenson
  0 siblings, 2 replies; 4+ messages in thread
From: Peter Stephenson @ 2022-02-07 12:15 UTC (permalink / raw)
  To: zsh workers

Sorry, this just went to Stephane.

pws

> On 07 February 2022 at 11:30 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> 
> 
> > On 06 February 2022 at 08:42 Stephane Chazelas <stephane@chazelas.org> wrote:
> > $ set -o extendedglob
> > $ a='Stéphane€'
> > $ print -rn -- ${a//(#U)?} | hd
> > 00000000  a9 82 ac                                          |...|
> > 00000003
> > 
> > It seems that with (#U) (and here in a locale using UTF-8 as
> > charmap), ? with (#U) matches only on the first byte of
> > multibyte characters. Is that how it's meant to be?
> 
> I think what you're hitting is probably, as you suspected, a
> difference between the pattern matching code and the substitution
> code.  The underlying pattern matching really is byte by byte,
> but this doesn't force any substitution such as // to behave
> in the same way.  As far as I know, the MULTIBYTE option is
> the only higher level consistency measure we have.
> 
> I think there might be a parameter matching flag that you can
> also set that would help.  I'd have to look in more detail.
> 
> pws


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: clarification on (#U) in pattern matching.
  2022-02-07 12:15   ` Peter Stephenson
@ 2022-02-07 12:24     ` Stephane Chazelas
  2022-02-07 12:59     ` Peter Stephenson
  1 sibling, 0 replies; 4+ messages in thread
From: Stephane Chazelas @ 2022-02-07 12:24 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: zsh workers

See also:

$ a=été
$ echo ${a%(#U)?}
été
$ echo ${a%%(#U)?}
été
$ echo ${a%(#U)$'\xa9'}
été
$ echo ${a%(#U)?*}
ét

Consistent with the other behaviours but maybe even more confusing.

-- 
Stephane


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: clarification on (#U) in pattern matching.
  2022-02-07 12:15   ` Peter Stephenson
  2022-02-07 12:24     ` Stephane Chazelas
@ 2022-02-07 12:59     ` Peter Stephenson
  1 sibling, 0 replies; 4+ messages in thread
From: Peter Stephenson @ 2022-02-07 12:59 UTC (permalink / raw)
  To: zsh workers

On 07 February 2022 at 12:15 Peter Stephenson <p.w.stephenson@ntlworld.com> wrote:
> I think there might be a parameter matching flag that you can
> also set that would help.  I'd have to look in more detail.

On examination, no, I don't think there is an option at the parameter level,
as opposed to the pattern match level --- and I appreciate that this
distinction, while clear in the code, isn't likely to be so obvious to a
user staring at shell code, even if parameter substitution and pattern
matching are documented separately. The doc for parameters simplify refers
to the MULTIBYTE option.

There's a limited feature to turn off multibyte counting when calculating
widths for use in padding etc. I suppose in principle that could be extended
for pattern matching, but in practice (at least without a major rewrite) all
it would do is turn off the multibyte option locally --- probably carrying over
into pattern matching, so you'd at least get consistency between the two that
way round.

There isn't a sane way to propagate the (U) flag out of the pattern code
back up into the parameter level.

pws

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-02-07 12:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-06  8:42 clarification on (#U) in pattern matching Stephane Chazelas
     [not found] ` <1071890479.577225.1644233454174@mail2.virginmedia.com>
2022-02-07 12:15   ` Peter Stephenson
2022-02-07 12:24     ` Stephane Chazelas
2022-02-07 12:59     ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).