zsh-workers
 help / color / mirror / code / Atom feed
* bug report : printf %.1s outputting more than 1 character
       [not found] <1621619253.265114.1678847919086.ref@mail.yahoo.com>
@ 2023-03-15  2:38 ` Jason C. Kwan
  2023-03-15  3:46   ` Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: Jason C. Kwan @ 2023-03-15  2:38 UTC (permalink / raw)
  To: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 3346 bytes --]

I'm using the macOS 13.2.1 OS-provided zsh, version 5.8.1, which I understand isn't the latest and greatest of 5.9, so perhaps this bug has already been addressed.
In the 4-byte sequence as seen below ( defined via explicit octal codes ), under no Unicode scenario should 4 bytes be printed out via a command of printf %.1s, by design. 
 - The first byte of \377 \xFF is explicitly invalid under UTF-8 (even allowing up to 7-byte in the oldest of definitions).  - The 4-byte value is too large to constitute a single character under either endian of UTF-32.  - It's also not a pair of beyond-BMP UTF-16 surrogates either, regardless of endian
At best, if treated as UTF-16, of either endian, this 4-byte sequence represents 2 code points, in which case, only 2 bytes should be printed not 4.

My high-level understanding of printf %.1s is that it should output the first locale-valid character of the input string, and in its absence, output the first byte instead, if any, so setting LC_ALL=C or POSIX would defeat the purpose of this bug report.
The reproducible sample shell command below includes what the output from zsh built-in printf looks like, what the macOS built-in printf looks like, and what the gnu printf looks like, all else being equal. The testing shell was invoked via
invoked via
    zsh --restricted --no-rcs --nologin --verbose -xtrace -f -c
In all 3 test scenarios, LC_ALL is explicitly cleared, while LANG is explicitly set to a widely used one. 
The od used is the macOS one, not the gnu one.
To my best knowledge, the other printfs have produced the correct output.
Thanks for your time.
====================================================================
echo; echo "$ZSH_VERSION"; echo; uname -a; echo; LC_ALL= LANG="en_US.UTF-8" builtin printf '\n\n\t[%.1s]\n\n' $'\377\210\234\256' | od -bacx ;  echo; LC_ALL= LANG="en_US.UTF-8" command printf '\n\n\t[%.1s]\n\n' $'\377\210\234\256' | od -bacx ;  echo; LC_ALL= LANG="en_US.UTF-8" gprintf '\n\n\t[%.1s]\n\n' $'\377\210\234\256' | od -bacx ;  echo;+zsh:1> echo
+zsh:1> echo 5.8.15.8.1+zsh:1> echo
+zsh:1> uname -aDarwin m1mx4CT 22.3.0 Darwin Kernel Version 22.3.0: Mon Jan 30 20:38:37 PST 2023; root:xnu-8792.81.3~2/RELEASE_ARM64_T6000 arm64+zsh:1> echo
+zsh:1> LC_ALL='' LANG=en_US.UTF-8 +zsh:1> printf '\n\n\t[%.1s]\n\n' $'\M-\C-?\M-\C-H\M-\C-\\M-.'+zsh:1> od -bacx0000000   012 012 011 133 377 210 234 256 135 012 012          nl  nl  ht   [   ?  88  9c   ?   ]  nl  nl          \n  \n  \t   [ 377 210 234 256   ]  \n  \n             0a0a    5b09    88ff    ae9c    0a5d    000a0000013+zsh:1> echo
+zsh:1> LC_ALL='' LANG=en_US.UTF-8 printf '\n\n\t[%.1s]\n\n' $'\M-\C-?\M-\C-H\M-\C-\\M-.'+zsh:1> od -bacx0000000   012 012 011 133 377 135 012 012          nl  nl  ht   [   ?   ]  nl  nl          \n  \n  \t   [ 377   ]  \n  \n             0a0a    5b09    5dff    0a0a0000010+zsh:1> echo
+zsh:1> LC_ALL='' LANG=en_US.UTF-8 gprintf '\n\n\t[%.1s]\n\n' $'\M-\C-?\M-\C-H\M-\C-\\M-.'+zsh:1> od -bacx0000000   012 012 011 133 377 135 012 012          nl  nl  ht   [   ?   ]  nl  nl          \n  \n  \t   [ 377   ]  \n  \n             0a0a    5b09    5dff    0a0a0000010+zsh:1> echo
zsh 5.8.1 (x86_64-apple-darwin22.0)

[-- Attachment #2: Type: text/html, Size: 10769 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug report : printf %.1s outputting more than 1 character
  2023-03-15  2:38 ` bug report : printf %.1s outputting more than 1 character Jason C. Kwan
@ 2023-03-15  3:46   ` Bart Schaefer
  2023-03-15  4:56     ` Jason C. Kwan
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Schaefer @ 2023-03-15  3:46 UTC (permalink / raw)
  To: Jason C. Kwan; +Cc: zsh-workers

On Tue, Mar 14, 2023 at 7:40 PM Jason C. Kwan <jasonckwan@yahoo.com> wrote:
>
> I'm using the macOS 13.2.1 OS-provided zsh, version 5.8.1, which I understand isn't the latest and greatest of 5.9, so perhaps this bug has already been addressed.

A related case been addressed by declaring it an intentional
divergence from POSIX, see
https://www.zsh.org/mla/workers/2022/msg00240.html

However ...

> In the 4-byte sequence as seen below ( defined via explicit octal codes ), under no Unicode scenario should 4 bytes be printed out via a command of printf %.1s, by design.
>
>  - The first byte of \377 \xFF is explicitly invalid under UTF-8 (even allowing up to 7-byte in the oldest of definitions).

This triggers a branch of the printf code introduced by this comment:
    /*
     * Invalid/incomplete character at this
     * point.  Assume all the rest are a
     * single byte.  That's about the best we
     * can do.
     */

Thus, you've deliberately invoked a case where zsh's response to
invalid input is to punt.  This dates back to the original
implementation in workers/23098,
https://www.zsh.org/mla/workers/2007/msg00019.html, January 2007.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug report : printf %.1s outputting more than 1 character
  2023-03-15  3:46   ` Bart Schaefer
@ 2023-03-15  4:56     ` Jason C. Kwan
  2023-03-15 15:31       ` Bart Schaefer
  0 siblings, 1 reply; 6+ messages in thread
From: Jason C. Kwan @ 2023-03-15  4:56 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers

[-- Attachment #1: Type: text/plain, Size: 7381 bytes --]

 quote :
============This triggers a branch of the printf code introduced by this comment:
    /*
    * Invalid/incomplete character at this
    * point.  Assume all the rest are a
    * single byte.  That's about the best we
    * can do.
    */============

does the following ( below the "====" line ) behavior look even reasonable at all, regardless of your spec ? Because what the spec ends up doing is treating the rest of the input string as 1 byte and printing everything out, even though there are valid code points further down the input string. 
The behavior is correct when LC_ALL=C is set, meaning zsh already has the codes needed to generate the correct output. My point was that instead of treating the rest of the input string, regardless of size, as 1 byte/character, why not have it behave "as if" LC_ALL=C is in effect whenever it enters this branch :
if (chars < 0) {/* * Invalid/incomplete character at this * point.  Assume all the rest are a * single byte.  That's about the best we * can do. */lchars += lleft;lbytes = (ptr - b) + lleft;break;
and continue in this mode until a locale-valid character is found, then revert back to multi-byte behavior ? wouldn't that be a more logical behavior ?
If that's too complex to implement, then perhaps treat rest of input string as a collection of individual bytes instead of just 1 byte ?
I just find printf '%.3s' outputting a 179 KB string rather odd.
=========================
 zsh --restricted --no-rcs --nologin --verbose -xtrace -f -c '___=$'\''=\343\276\255#\377\210\234\256A\301B\354\210\264_'\''; command printf "%s" "$___" | gwc -lcm; for __ in {1..16}; do builtin printf "%.${__}s" "$___" | gwc -lcm; done '___=$'=\343\276\255#\377\210\234\256A\301B\354\210\264_'; command printf "%s" "$___" | gwc -lcm; for __ in {1..16}; do builtin printf "%.${__}s" "$___" | gwc -lcm; done+zsh:1> ___=$'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> printf %s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=1+zsh:1> printf %.1s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       1       1+zsh:1> __=2+zsh:1> printf %.2s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       2       4+zsh:1> __=3+zsh:1> printf %.3s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       3       5+zsh:1> __=4+zsh:1> printf %.4s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=5+zsh:1> printf %.5s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=6+zsh:1> printf %.6s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=7+zsh:1> printf %.7s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=8+zsh:1> printf %.8s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=9+zsh:1> printf %.9s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=10+zsh:1> printf %.10s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=11+zsh:1> printf %.11s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=12+zsh:1> printf %.12s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=13+zsh:1> printf %.13s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=14+zsh:1> printf %.14s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=15+zsh:1> printf %.15s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=16+zsh:1> printf %.16s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16
+zsh:1> ___=$'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> LC_ALL=C printf %s $'=㾭#\M-\C-?\M-\C-H\M-\C-\\M-.A\M-AB숴_'+zsh:1> gwc -lcm      0       7      16+zsh:1> __=1+zsh:1> LC_ALL=C +zsh:1> printf %.1s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       1       1+zsh:1> __=2+zsh:1> LC_ALL=C +zsh:1> printf %.2s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       1       2+zsh:1> __=3+zsh:1> LC_ALL=C +zsh:1> printf %.3s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       1       3+zsh:1> __=4+zsh:1> LC_ALL=C +zsh:1> printf %.4s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       2       4+zsh:1> __=5+zsh:1> LC_ALL=C +zsh:1> printf %.5s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       3       5+zsh:1> __=6+zsh:1> LC_ALL=C +zsh:1> printf %.6s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       3       6+zsh:1> __=7+zsh:1> LC_ALL=C +zsh:1> printf %.7s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       3       7+zsh:1> __=8+zsh:1> LC_ALL=C +zsh:1> printf %.8s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       3       8+zsh:1> __=9+zsh:1> LC_ALL=C +zsh:1> printf %.9s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       3       9+zsh:1> __=10+zsh:1> LC_ALL=C +zsh:1> printf %.10s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       4      10+zsh:1> __=11+zsh:1> LC_ALL=C +zsh:1> printf %.11s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       4      11+zsh:1> __=12+zsh:1> LC_ALL=C +zsh:1> printf %.12s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       5      12+zsh:1> __=13+zsh:1> LC_ALL=C +zsh:1> printf %.13s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       5      13+zsh:1> __=14+zsh:1> LC_ALL=C +zsh:1> printf %.14s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       5      14+zsh:1> __=15+zsh:1> LC_ALL=C +zsh:1> printf %.15s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       6      15+zsh:1> __=16+zsh:1> LC_ALL=C +zsh:1> printf %.16s '=㾭#????A?B숴_'+zsh:1> gwc -lcm      0       7      16



    On Tuesday, March 14, 2023 at 11:46:14 PM EDT, Bart Schaefer <schaefer@brasslantern.com> wrote:  
 
 On Tue, Mar 14, 2023 at 7:40 PM Jason C. Kwan <jasonckwan@yahoo.com> wrote:
>
> I'm using the macOS 13.2.1 OS-provided zsh, version 5.8.1, which I understand isn't the latest and greatest of 5.9, so perhaps this bug has already been addressed.

A related case been addressed by declaring it an intentional
divergence from POSIX, see
https://www.zsh.org/mla/workers/2022/msg00240.html

However ...

> In the 4-byte sequence as seen below ( defined via explicit octal codes ), under no Unicode scenario should 4 bytes be printed out via a command of printf %.1s, by design.
>
>  - The first byte of \377 \xFF is explicitly invalid under UTF-8 (even allowing up to 7-byte in the oldest of definitions).

This triggers a branch of the printf code introduced by this comment:
    /*
    * Invalid/incomplete character at this
    * point.  Assume all the rest are a
    * single byte.  That's about the best we
    * can do.
    */

Thus, you've deliberately invoked a case where zsh's response to
invalid input is to punt.  This dates back to the original
implementation in workers/23098,
https://www.zsh.org/mla/workers/2007/msg00019.html, January 2007.
  

[-- Attachment #2: Type: text/html, Size: 26365 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug report : printf %.1s outputting more than 1 character
  2023-03-15  4:56     ` Jason C. Kwan
@ 2023-03-15 15:31       ` Bart Schaefer
  2023-03-15 15:50         ` Roman Perepelitsa
  2023-03-18 16:56         ` Peter Stephenson
  0 siblings, 2 replies; 6+ messages in thread
From: Bart Schaefer @ 2023-03-15 15:31 UTC (permalink / raw)
  To: Jason C. Kwan; +Cc: zsh-workers

On Tue, Mar 14, 2023 at 9:56 PM Jason C. Kwan <jasonckwan@yahoo.com> wrote:
>
> does the following ( below the "====" line ) behavior look even reasonable at all, regardless of your spec ? Because what the spec ends up doing is treating the rest of the input string as 1 byte and printing everything out, even though there are valid code points further down the input string.

I'm not the resident expert on multibyte character sets, so I'm just
reporting the situation and waiting for e.g. PWS to respond.  However,
as far as my understanding of the multibyte library goes, once you've
"desynchronized" the input by encountering an invalid byte, you're not
guaranteed that anything further that you see can be correctly
interpreted as a code point.  I agree that it's not ideal to just dump
everything else "raw".


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug report : printf %.1s outputting more than 1 character
  2023-03-15 15:31       ` Bart Schaefer
@ 2023-03-15 15:50         ` Roman Perepelitsa
  2023-03-18 16:56         ` Peter Stephenson
  1 sibling, 0 replies; 6+ messages in thread
From: Roman Perepelitsa @ 2023-03-15 15:50 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: Jason C. Kwan, zsh-workers

On Wed, Mar 15, 2023 at 4:32 PM Bart Schaefer <schaefer@brasslantern.com> wrote:
>
> On Tue, Mar 14, 2023 at 9:56 PM Jason C. Kwan <jasonckwan@yahoo.com> wrote:
> >
> > does the following ( below the "====" line ) behavior look even reasonable at all, regardless of your spec ? Because what the spec ends up doing is treating the rest of the input string as 1 byte and printing everything out, even though there are valid code points further down the input string.
>
> I'm not the resident expert on multibyte character sets, so I'm just
> reporting the situation and waiting for e.g. PWS to respond.  However,
> as far as my understanding of the multibyte library goes, once you've
> "desynchronized" the input by encountering an invalid byte, you're not
> guaranteed that anything further that you see can be correctly
> interpreted as a code point.  I agree that it's not ideal to just dump
> everything else "raw".

UTF-8 has a nice property that you can jump to an arbitrary byte
position in the stream and quickly find the start of the next
character. A byte is the start of a character if it has the most
significant bit equal to 0 or two most significant bits equal to 1.
This can also be used to recover after an invalid character.

Roman.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: bug report : printf %.1s outputting more than 1 character
  2023-03-15 15:31       ` Bart Schaefer
  2023-03-15 15:50         ` Roman Perepelitsa
@ 2023-03-18 16:56         ` Peter Stephenson
  1 sibling, 0 replies; 6+ messages in thread
From: Peter Stephenson @ 2023-03-18 16:56 UTC (permalink / raw)
  To: zsh-workers

On Wed, 2023-03-15 at 08:31 -0700, Bart Schaefer wrote:
> On Tue, Mar 14, 2023 at 9:56 PM Jason C. Kwan <jasonckwan@yahoo.com> wrote:
> > 
>> does the following ( below the "====" line ) behavior look even
>> reasonable at all, regardless of your spec ? Because what the spec ends
>> up doing is treating the rest of the input string as 1 byte and printing
>> everything out, even though there are valid code points further down the
>> input string.
> 
> I'm not the resident expert on multibyte character sets, so I'm just
> reporting the situation and waiting for e.g. PWS to respond.  However,
> as far as my understanding of the multibyte library goes, once you've
> "desynchronized" the input by encountering an invalid byte, you're not
> guaranteed that anything further that you see can be correctly
> interpreted as a code point.  I agree that it's not ideal to just dump
> everything else "raw".

Elsewhere, we mostly treat invalid codes as if they're single octets, so
this is a bit inconsistent.  I think it's really just to try to avoid
overcomplicating %s output.  However, it would probably be more
consistent just to treat everything that doesn't make sense as single
bytes until we get back on track.  There doesn't seem any point about
doing anything different with incomplete characters here, either ---
we've already got all the characters we're going to get.  Something like
this, but feel free to tweak further --- I don't have any motivation to
do so myself.

This is probably good enough for the obvious simple case of "just
output the next thing you see whatever the heck it looks like".

pws

diff --git a/Src/builtin.c b/Src/builtin.c
index 70a950666..9719d26d1 100644
--- a/Src/builtin.c
+++ b/Src/builtin.c
@@ -5222,20 +5222,21 @@ bin_print(char *name, char **args, Options ops, int func)
 #ifdef MULTIBYTE_SUPPORT
 			if (isset(MULTIBYTE)) {
 			    chars = mbrlen(ptr, lleft, &mbs);
-			    if (chars < 0) {
-				/*
-				 * Invalid/incomplete character at this
-				 * point.  Assume all the rest are a
-				 * single byte.  That's about the best we
-				 * can do.
-				 */
-				lchars += lleft;
-				lbytes = (ptr - b) + lleft;
-				break;
-			    } else if (chars == 0) {
-				/* NUL, handle as real character */
+			    /*
+			     * chars <= 0 means one of
+			     *
+			     * 0: NUL, handle as real character
+			     *
+			     * -1: MB_INVALID: Assume this is
+			     *     a single character as we do
+			     *     elsewhere in the code.
+			     *
+			     * -2: MB_INCOMPLETE: We're not waiting
+			     *     for input on this occasion, so
+			     *     just treat this as invalid.
+			     */
+			    if (chars <= 0)
 				chars = 1;
-			    }
 			}
 			else	/* use the non-multibyte code below */
 #endif



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-03-18 16:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1621619253.265114.1678847919086.ref@mail.yahoo.com>
2023-03-15  2:38 ` bug report : printf %.1s outputting more than 1 character Jason C. Kwan
2023-03-15  3:46   ` Bart Schaefer
2023-03-15  4:56     ` Jason C. Kwan
2023-03-15 15:31       ` Bart Schaefer
2023-03-15 15:50         ` Roman Perepelitsa
2023-03-18 16:56         ` Peter Stephenson

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).