From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 4326 invoked from network); 25 May 2021 00:04:30 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 25 May 2021 00:04:30 -0000 Received: (qmail 15602 invoked by uid 550); 25 May 2021 00:04:28 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 15583 invoked from network); 25 May 2021 00:04:27 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=2JUusHAR95H6CM06poJhfRK6vuOIp7whEJOgjETU8vg=; b=DSnW4SoyhjNHZeR8yCocL08TQeD6zyyumjsixiFNmUaX5qVijIENIDVKGm6oyJI4zY mvqjzB92lz21pugUNfAYN4ZWNUvmDDWpFhJXBG9GiJDX/ZNL9gclXN+jJ+Myar1d5P0Z imgkMJLb/07yZDU+le+8fuw/mzWBfoGdVFahQgJ/OWpROugqR/r8pdKSMtpTm/gy6tTZ 8n7fFEmUpwo44QF6NSqg8UScguwGAbRf0SEch3F4pP6EAfydR4BIRVMnXKhyygb0mDNn gy7739eiGCQztE+r584aKmIlIWWblyMVF96DOPCkWyZle37DhOawC1LObZ+KHM310ngX KT7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=2JUusHAR95H6CM06poJhfRK6vuOIp7whEJOgjETU8vg=; b=Ncl9+vdS31g9RBFqxmWklIJg14AOYx4EMI35x/uI/30bRv7cSuU0AaJVnppZPAWI+A CPF7479pfyyplq0h8nvzCjHhyH7L27JJeFSGly2YE8rjUpMrEUfE0eX3OA15DAKtOnvt xiaIhSO4UY8MHCx6P83q0x5IBGd3a0a2Ub7lCsDieN9sCUdOF12yKX2Rp2I2Cr70dUj6 YwMM3GznRq/1Jvtco8W0h2X85MbkgUNk4QRINkSQ1eN6xpHf3x9+E48mNMuknzcKHoKd BLq1DL6i7AG4ADpOUWVs+MmHb5o+U5PGNwEjcFIDdeV4VpLxsWMMYOGJHWEqGyTo3ong ooGg== X-Gm-Message-State: AOAM533m1oGu7IJVRlA2iALvHGRCVb+IRheWSIGO+j8v6HOM78CfFKaX z0BCTN6NtZ6QL8fqSI9isbYI28NLs1l+2xZCcfU= X-Google-Smtp-Source: ABdhPJxKa4S6tg3zZGoacZNMqVWo+rngdqJI8X6WvWVsUeBXesxYDlNDqJUWWy8LpUGtuutoI1WDg0snVOUgZAvLjgw= X-Received: by 2002:a17:902:728c:b029:f6:6aff:4d66 with SMTP id d12-20020a170902728cb02900f66aff4d66mr23072470pll.20.1621901055278; Mon, 24 May 2021 17:04:15 -0700 (PDT) MIME-Version: 1.0 References: <20210524215021.GC2546@brightrain.aerifal.cx> In-Reply-To: <20210524215021.GC2546@brightrain.aerifal.cx> From: Konstantin Isakov Date: Mon, 24 May 2021 20:04:04 -0400 Message-ID: To: Rich Felker Cc: musl@lists.openwall.com Content-Type: multipart/alternative; boundary="00000000000042c40005c31c4420" Subject: Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly --00000000000042c40005c31c4420 Content-Type: text/plain; charset="UTF-8" Thanks for replying! That fixed it. I'm surprised, however, that this is required given that in this case swprintf() operates on wchars exclusively -- taking wchar arguments and producing wchar output. I'd expect that in the worst case scenario it would have to convert from single chars to wide chars, but never the other way around, so the representation requirement seems strange. That setlocale() step also doesn't seem to be needed with glibc. On Mon, May 24, 2021 at 5:50 PM Rich Felker wrote: > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote: > > Hi, > > > > The following program: > > > > =================================== > > #include > > #include > > > > int main() > > { > > wchar_t buf[ 32 ]; > > > > swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" ); > > > > for ( wchar_t * p = buf; *p; ++p ) > > printf( "%u\n", ( unsigned ) *p ); > > > > return 0; > > } > > =================================== > > > > With musl 1.2.2 produces the following output: > > 97 > > 98 > > > > The expected output is: > > 97 > > 98 > > 225 > > 99 > > > > With musl, only the first two characters ('a' and 'b') are processed, and > > the string ends on a Unicode character (U+00E1, which is an 'a' with > acute > > accent), instead of outputting it and the last character, 'c'. > > > > Please CC me when replying. Thanks! > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character > \u00e1 is unrepresentable, because POSIX requires the C locale be > single-byte and you're in the C locale until you call setlocale, and > thus produces an encoding error (EILSEQ). > > Rich > --00000000000042c40005c31c4420 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks for replying!

That fi= xed it.

I'm surprised, however, that this is r= equired given that in this case swprintf() operates on wchars exclusively -= - taking wchar arguments and producing wchar output. I'd expect that in= the worst case scenario it would have to convert from single chars to wide= chars, but never the other way around, so the representation requirement s= eems strange. That setlocale() step also doesn't seem to be needed with= glibc.

On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
On Mon, May 24, 2021 at 12:39:35AM -040= 0, Konstantin Isakov wrote:
> Hi,
>
> The following program:
>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> #include <stdio.h>
> #include <wchar.h>
>
> int main()
> {
>=C2=A0 =C2=A0wchar_t buf[ 32 ];
>
>=C2=A0 =C2=A0swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u= 00E1c" );
>
>=C2=A0 =C2=A0for ( wchar_t * p =3D buf; *p; ++p )
>=C2=A0 =C2=A0 =C2=A0printf( "%u\n", ( unsigned ) *p );
>
>=C2=A0 =C2=A0return 0;
> }
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>
> With musl 1.2.2 produces the following output:
> 97
> 98
>
> The expected output is:
> 97
> 98
> 225
> 99
>
> With musl, only the first two characters ('a' and 'b')= are processed, and
> the string ends on a Unicode character (U+00E1, which is an 'a'= ; with acute
> accent), instead of outputting it and the last character, 'c'.=
>
> Please CC me when replying. Thanks!

You need to call setlocale(LC_CTYPE, ""). Otherwise the character=
\u00e1 is unrepresentable, because POSIX requires the C locale be
single-byte and you're in the C locale until you call setlocale, and thus produces an encoding error (EILSEQ).

Rich
--00000000000042c40005c31c4420--