From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 4326 invoked from network); 25 May 2021 00:04:30 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 25 May 2021 00:04:30 -0000
Received: (qmail 15602 invoked by uid 550); 25 May 2021 00:04:28 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 15583 invoked from network); 25 May 2021 00:04:27 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=2JUusHAR95H6CM06poJhfRK6vuOIp7whEJOgjETU8vg=;
        b=DSnW4SoyhjNHZeR8yCocL08TQeD6zyyumjsixiFNmUaX5qVijIENIDVKGm6oyJI4zY
         mvqjzB92lz21pugUNfAYN4ZWNUvmDDWpFhJXBG9GiJDX/ZNL9gclXN+jJ+Myar1d5P0Z
         imgkMJLb/07yZDU+le+8fuw/mzWBfoGdVFahQgJ/OWpROugqR/r8pdKSMtpTm/gy6tTZ
         8n7fFEmUpwo44QF6NSqg8UScguwGAbRf0SEch3F4pP6EAfydR4BIRVMnXKhyygb0mDNn
         gy7739eiGCQztE+r584aKmIlIWWblyMVF96DOPCkWyZle37DhOawC1LObZ+KHM310ngX
         KT7w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=2JUusHAR95H6CM06poJhfRK6vuOIp7whEJOgjETU8vg=;
        b=Ncl9+vdS31g9RBFqxmWklIJg14AOYx4EMI35x/uI/30bRv7cSuU0AaJVnppZPAWI+A
         CPF7479pfyyplq0h8nvzCjHhyH7L27JJeFSGly2YE8rjUpMrEUfE0eX3OA15DAKtOnvt
         xiaIhSO4UY8MHCx6P83q0x5IBGd3a0a2Ub7lCsDieN9sCUdOF12yKX2Rp2I2Cr70dUj6
         YwMM3GznRq/1Jvtco8W0h2X85MbkgUNk4QRINkSQ1eN6xpHf3x9+E48mNMuknzcKHoKd
         BLq1DL6i7AG4ADpOUWVs+MmHb5o+U5PGNwEjcFIDdeV4VpLxsWMMYOGJHWEqGyTo3ong
         ooGg==
X-Gm-Message-State: AOAM533m1oGu7IJVRlA2iALvHGRCVb+IRheWSIGO+j8v6HOM78CfFKaX
	z0BCTN6NtZ6QL8fqSI9isbYI28NLs1l+2xZCcfU=
X-Google-Smtp-Source: ABdhPJxKa4S6tg3zZGoacZNMqVWo+rngdqJI8X6WvWVsUeBXesxYDlNDqJUWWy8LpUGtuutoI1WDg0snVOUgZAvLjgw=
X-Received: by 2002:a17:902:728c:b029:f6:6aff:4d66 with SMTP id
 d12-20020a170902728cb02900f66aff4d66mr23072470pll.20.1621901055278; Mon, 24
 May 2021 17:04:15 -0700 (PDT)
MIME-Version: 1.0
References: <CAMOBWkOsdWXqva-5LVYo76jcBEr8TkE6V-O69XRw0ruhn5Fd_g@mail.gmail.com>
 <20210524215021.GC2546@brightrain.aerifal.cx>
In-Reply-To: <20210524215021.GC2546@brightrain.aerifal.cx>
From: Konstantin Isakov <dragonroot@gmail.com>
Date: Mon, 24 May 2021 20:04:04 -0400
Message-ID: <CAMOBWkPjcqSnPHCMdFdfyyXD-ovKXg0k6014mENcgeHJ+O-akg@mail.gmail.com>
To: Rich Felker <dalias@libc.org>
Cc: musl@lists.openwall.com
Content-Type: multipart/alternative; boundary="00000000000042c40005c31c4420"
Subject: Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly

--00000000000042c40005c31c4420
Content-Type: text/plain; charset="UTF-8"

Thanks for replying!

That fixed it.

I'm surprised, however, that this is required given that in this case
swprintf() operates on wchars exclusively -- taking wchar arguments and
producing wchar output. I'd expect that in the worst case scenario it would
have to convert from single chars to wide chars, but never the other way
around, so the representation requirement seems strange. That setlocale()
step also doesn't seem to be needed with glibc.

On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:

> On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > Hi,
> >
> > The following program:
> >
> > ===================================
> > #include <stdio.h>
> > #include <wchar.h>
> >
> > int main()
> > {
> >   wchar_t buf[ 32 ];
> >
> >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> >
> >   for ( wchar_t * p = buf; *p; ++p )
> >     printf( "%u\n", ( unsigned ) *p );
> >
> >   return 0;
> > }
> > ===================================
> >
> > With musl 1.2.2 produces the following output:
> > 97
> > 98
> >
> > The expected output is:
> > 97
> > 98
> > 225
> > 99
> >
> > With musl, only the first two characters ('a' and 'b') are processed, and
> > the string ends on a Unicode character (U+00E1, which is an 'a' with
> acute
> > accent), instead of outputting it and the last character, 'c'.
> >
> > Please CC me when replying. Thanks!
>
> You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> \u00e1 is unrepresentable, because POSIX requires the C locale be
> single-byte and you're in the C locale until you call setlocale, and
> thus produces an encoding error (EILSEQ).
>
> Rich
>

--00000000000042c40005c31c4420
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Thanks for replying!</div><div><br></div><div>That fi=
xed it.</div><div><br></div><div>I&#39;m surprised, however, that this is r=
equired given that in this case swprintf() operates on wchars exclusively -=
- taking wchar arguments and producing wchar output. I&#39;d expect that in=
 the worst case scenario it would have to convert from single chars to wide=
 chars, but never the other way around, so the representation requirement s=
eems strange. That setlocale() step also doesn&#39;t seem to be needed with=
 glibc.</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D=
"gmail_attr">On Mon, May 24, 2021 at 5:50 PM Rich Felker &lt;<a href=3D"mai=
lto:dalias@libc.org">dalias@libc.org</a>&gt; wrote:<br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">On Mon, May 24, 2021 at 12:39:35AM -040=
0, Konstantin Isakov wrote:<br>
&gt; Hi,<br>
&gt; <br>
&gt; The following program:<br>
&gt; <br>
&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt; #include &lt;stdio.h&gt;<br>
&gt; #include &lt;wchar.h&gt;<br>
&gt; <br>
&gt; int main()<br>
&gt; {<br>
&gt;=C2=A0 =C2=A0wchar_t buf[ 32 ];<br>
&gt; <br>
&gt;=C2=A0 =C2=A0swprintf( buf, sizeof( buf ) / sizeof( *buf ), L&quot;ab\u=
00E1c&quot; );<br>
&gt; <br>
&gt;=C2=A0 =C2=A0for ( wchar_t * p =3D buf; *p; ++p )<br>
&gt;=C2=A0 =C2=A0 =C2=A0printf( &quot;%u\n&quot;, ( unsigned ) *p );<br>
&gt; <br>
&gt;=C2=A0 =C2=A0return 0;<br>
&gt; }<br>
&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt; <br>
&gt; With musl 1.2.2 produces the following output:<br>
&gt; 97<br>
&gt; 98<br>
&gt; <br>
&gt; The expected output is:<br>
&gt; 97<br>
&gt; 98<br>
&gt; 225<br>
&gt; 99<br>
&gt; <br>
&gt; With musl, only the first two characters (&#39;a&#39; and &#39;b&#39;)=
 are processed, and<br>
&gt; the string ends on a Unicode character (U+00E1, which is an &#39;a&#39=
; with acute<br>
&gt; accent), instead of outputting it and the last character, &#39;c&#39;.=
<br>
&gt; <br>
&gt; Please CC me when replying. Thanks!<br>
<br>
You need to call setlocale(LC_CTYPE, &quot;&quot;). Otherwise the character=
<br>
\u00e1 is unrepresentable, because POSIX requires the C locale be<br>
single-byte and you&#39;re in the C locale until you call setlocale, and<br=
>
thus produces an encoding error (EILSEQ).<br>
<br>
Rich<br>
</blockquote></div>

--00000000000042c40005c31c4420--