help / color / mirror / code / Atom feed
From: Stephane Chazelas <stephane@chazelas.org>
To: zsh-workers@zsh.org
Subject: Re: Test ./E03posix.ztst was expected to fail, but passed.
Date: Wed, 23 Mar 2022 10:38:35 +0000	[thread overview]
Message-ID: <20220323103835.hpoprdgt45iyqqgt@chazelas.org> (raw)
In-Reply-To: <20220323022644.GA349036@zira.vinc17.org>

2022-03-23 03:26:44 +0100, Vincent Lefevre:
> On 2022-03-22 14:04:30 -0700, Bart Schaefer wrote:
> > Specifically in this instance, we consider it a POSIX bug that '%s'
> > always counts byte positions and that zsh has fixed this when it
> > counts character positions.
> But, AFAIK, on the POSIX side, it has never been regarded as a bug
> (I haven't seen any bug report).

It's been raised several times on the POSIX mailing list, and
my understanding the opengroup doesn't consider it as a bug, and
they have made it clear that they would not address it. They may
consider specifying ksh93's %Ls (which pads based on display
width, not byte nor character count) if enough implementations
start to support it.

That's why I didn't bother raising it as a bug personally, but
to me, that position (where printf(1) is meant to be an
interface to printf(3) without decoding those bytes into
characters) does not make sense. printf is to print formatted
text, not doing padding of binary strings. printf(3) was
extended with wprintf(3) to handle wide characters, printf(1)
should have been enhanced to switch to that or equivalent just
like every other text utility is now specified to be able to
cope with wide characters.

printf(1) should need to decode arguments into text if only
because in the format or %b arguments, the "\" character (also
"%" in the format) is being interpreted specially. zsh doesn't
btw (which may be considered a bug, but then again those
non-UTF8 multibyte charsets are poorly supported throughout,
and to me it doesn't seem worth the effort given that hardly
anybody uses multibyte charsets other than UTF-8 these days):

$ LC_ALL=zh_HK SHELL=/bin/zsh luit
zsh$ locale charmap
zsh$ printf 'αb' | hd
00000000  a3 08                                             |..|

(as α is encoded as 0xa3 0x5c in BIG5-HKSCS as used in that
locale, 0x5c being also \)

Yash is probably the only shell that does implement the POSIX
spec as POSIXly likely intends it to be:

~$ LC_ALL=zh_HK SHELL=yash luit
yash$ printf 'αb' | hd
00000000  a3 5c 62                                          |.\b|
yash$ printf %5s 'αb' | hd
00000000  20 20 a3 5c 62                                    |  .\b|
yash$ printf %5b 'αb' | hd
00000000  20 20 a3 5c 62                                    |  .\b|

That is bytes are decoded into characters for those backslashes
to be interpreted "correctly" (yash does decode everything, it's
not specific to printf¹), and then encoded back to behave as if
being passed to printf(3) as POSIX requires.

I've not verified it, but I've read somewhere the C standard was
considering enhancing printf("%.3s") so it doesn't break
characters in the middle (or maybe it's already the case?).
So printf '%.3s\n' Stéphane, where é is UTF-8 encoded in a
locale using UTF-8 would output "St" instead of "St<0xc3>".

My opinion would be:

- not change how %5s works in zsh. To me, zsh made an effort to
  fix that, I can't expect anyone relying on the POSIX
  behaviour which to me is a bug. One can always do

    printf() {
      set -o localoptions +o multibyte; builtin printf "$@"

  if they want the POSIX behaviour.

- no need to fix the problems with backslashes in those
  messed-up multibyte encodings as I'd expect they're being
  phased out.

- maybe implement ksh93's %Ls (zsh does have a ${(ml[5])param}
  alternative though it does both padding and truncation).

¹ That approach is not tenable IMO as that means yash can't cope
with arbitrary file paths, arguments, or environment variables


  reply	other threads:[~2022-03-23 10:38 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-15 16:33 Vincent Lefevre
2022-03-15 16:53 ` Mikael Magnusson
2022-03-16 15:30   ` Jun. T
2022-03-22  3:32     ` Jun T
2022-03-22 13:03       ` Vincent Lefevre
2022-03-22 21:04       ` Bart Schaefer
2022-03-23  2:26         ` Vincent Lefevre
2022-03-23 10:38           ` Stephane Chazelas [this message]
2022-03-23 16:17             ` Vincent Lefevre
2022-03-23  7:14         ` Jun T
2022-03-29  9:10           ` Jun T
2022-03-29  9:00     ` Jun T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220323103835.hpoprdgt45iyqqgt@chazelas.org \
    --to=stephane@chazelas.org \
    --cc=zsh-workers@zsh.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).