From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,FREEMAIL_FROM,HTML_MESSAGE,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 8969 invoked from network); 25 May 2021 00:46:27 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 25 May 2021 00:46:27 -0000
Received: (qmail 32008 invoked by uid 550); 25 May 2021 00:46:25 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 31989 invoked from network); 25 May 2021 00:46:24 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=JdT5sTkJByDMXaXis2mt0Gqyem881v1OHlcrGLLC9v8=;
        b=TsywZA2hyHlDW9mAE8mx+EUwnXvTAmJuKG8KZHHlavqkIUIoGJpveqwmnvhrWiPl7f
         jDdLXX4CAXdD6gjukoH7P0Yzpfw5bVoGPRisGm9oenvLhG4YwaKZPc0iLpjQNMP+ogXC
         Qnp/EibpX28VMWxwJw3j9DIBvJGaNkDlgOxgzPcQMXhLX1ZVlFcjhVKDYDs4GApm/uHK
         26rTudBMw/d2/+fdE+3DwZh/4ABaS3X4HQA7nAdWS9BuOUxQjLzbNtGh0PvMMMneWfQ7
         GaC8XK77sXhr9aoVpTsetPlg5U0kWvsQ4BattCQQuw23ybVKK1737iycIjJ+uAOMd02j
         nGPw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=JdT5sTkJByDMXaXis2mt0Gqyem881v1OHlcrGLLC9v8=;
        b=eSTJtugBj9oLGgjbM3HGyYqenp72YXMFj78nF1Aro+rz+w6HTlpVtQ0WLluDRKxhJq
         o8ovu1BE25YcDyoOu//xVZTxHS7ht7VN8HjIVgfZFenoNsQS525yQTW3Byfgwba0BceH
         am4NaBbiIR1YW9mf4W0V0SDdpMlNlKmpqdszugrTTsf3BhUNw4kSZeNctLDYnS5OFpE0
         ql2WDl+rBqp9NdH3jF4XyxgOWYDjO5TvCyhNKzZopMjjFJL4ONM6NnAcT/Qb31hQtRnQ
         ct1HEAiGjoZDAbf8baAVa5LQDrFfnJtyANu5psUz/U/GTrCmJR0uLQWoYNVV/vtIHC9m
         I+yg==
X-Gm-Message-State: AOAM532hrXDFpKEfOSupEJSiYb0+Sey+QW2eiKt1tIqs+ZcwOL83ObNt
	7lDnVE2cT9n40NooMotYcdavGiy4oMyK00nU5aBTw+J8S4jpmg==
X-Google-Smtp-Source: ABdhPJwOWnyzWFA5L0D+oCzGpR5j9JVvheSj24xyvOhIH2WQfIoNUTzgdS4A1xtu/7L7P359nOlDQk+zJBqBQrM0F5M=
X-Received: by 2002:a05:6a00:813:b029:27f:fb6a:24b5 with SMTP id
 m19-20020a056a000813b029027ffb6a24b5mr23469422pfk.18.1621903572501; Mon, 24
 May 2021 17:46:12 -0700 (PDT)
MIME-Version: 1.0
References: <CAMOBWkOsdWXqva-5LVYo76jcBEr8TkE6V-O69XRw0ruhn5Fd_g@mail.gmail.com>
 <20210524215021.GC2546@brightrain.aerifal.cx> <CAMOBWkPjcqSnPHCMdFdfyyXD-ovKXg0k6014mENcgeHJ+O-akg@mail.gmail.com>
 <20210525003040.GE2546@brightrain.aerifal.cx>
In-Reply-To: <20210525003040.GE2546@brightrain.aerifal.cx>
From: Konstantin Isakov <dragonroot@gmail.com>
Date: Mon, 24 May 2021 20:46:01 -0400
Message-ID: <CAMOBWkPtuijGJK7nrajh4F0kzR4bO6dURn_ZQ_QGBUnxL04j7A@mail.gmail.com>
To: Rich Felker <dalias@libc.org>
Cc: musl@lists.openwall.com
Content-Type: multipart/alternative; boundary="0000000000004c88ce05c31cdab2"
Subject: Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly

--0000000000004c88ce05c31cdab2
Content-Type: text/plain; charset="UTF-8"

Is swprintf() a form of fwprintf() though? fwprintf() and wprintf() output
to single-byte streams, so the conversion is necessary there, while
swprintf() outputs to a wide buffer. Performing double conversion (to
single chars and back) seems like unnecessary work in that case (though, of
course, it's less work to implement swprintf() like that).

On Mon, May 24, 2021 at 8:30 PM Rich Felker <dalias@libc.org> wrote:

> On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> > Thanks for replying!
> >
> > That fixed it.
> >
> > I'm surprised, however, that this is required given that in this case
> > swprintf() operates on wchars exclusively -- taking wchar arguments and
> > producing wchar output. I'd expect that in the worst case scenario it
> would
> > have to convert from single chars to wide chars, but never the other way
> > around, so the representation requirement seems strange. That setlocale()
> > step also doesn't seem to be needed with glibc.
>
> Yes, it's not clear to me whether the glibc behavior is conforming or
> not. As specified,
>
>   In addition, all forms of fwprintf() shall fail if:
>
>   [EILSEQ]
>     A wide-character code that does not correspond
>     to a valid character has been detected.
>
>   ...
>
> The "has been detected" wording may allow for the possibility of
> ignoring the error, as glibc does, if the function is implemented such
> that no conversion takes place (or, for fwprintf, such that conversion
> is deferred until flush time) and thus no "detection" takes place. But
> it's wrong to assume the operation will succeed.
>
> In musl, there is no separate wide stdio buffering mode; conversion to
> a multibyte sequence happens at (logical) fputwc time, and in the case
> of swprintf, conversion (in this case, conversion back) to a wchar_t[]
> string occurs at flush time.
>
> Rich
>
>
>
>
> > On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> >
> > > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > > Hi,
> > > >
> > > > The following program:
> > > >
> > > > ===================================
> > > > #include <stdio.h>
> > > > #include <wchar.h>
> > > >
> > > > int main()
> > > > {
> > > >   wchar_t buf[ 32 ];
> > > >
> > > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > > >
> > > >   for ( wchar_t * p = buf; *p; ++p )
> > > >     printf( "%u\n", ( unsigned ) *p );
> > > >
> > > >   return 0;
> > > > }
> > > > ===================================
> > > >
> > > > With musl 1.2.2 produces the following output:
> > > > 97
> > > > 98
> > > >
> > > > The expected output is:
> > > > 97
> > > > 98
> > > > 225
> > > > 99
> > > >
> > > > With musl, only the first two characters ('a' and 'b') are
> processed, and
> > > > the string ends on a Unicode character (U+00E1, which is an 'a' with
> > > acute
> > > > accent), instead of outputting it and the last character, 'c'.
> > > >
> > > > Please CC me when replying. Thanks!
> > >
> > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > > single-byte and you're in the C locale until you call setlocale, and
> > > thus produces an encoding error (EILSEQ).
> > >
> > > Rich
> > >
>

--0000000000004c88ce05c31cdab2
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Is swprintf() a form of fwprintf() though? fwprintf() and =
wprintf() output to single-byte streams, so the conversion is necessary the=
re, while swprintf() outputs to a wide buffer. Performing double conversion=
 (to single chars and back) seems like unnecessary work in that case (thoug=
h, of course, it&#39;s less work to implement swprintf() like that).</div><=
br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon,=
 May 24, 2021 at 8:30 PM Rich Felker &lt;<a href=3D"mailto:dalias@libc.org"=
>dalias@libc.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pa=
dding-left:1ex">On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov=
 wrote:<br>
&gt; Thanks for replying!<br>
&gt; <br>
&gt; That fixed it.<br>
&gt; <br>
&gt; I&#39;m surprised, however, that this is required given that in this c=
ase<br>
&gt; swprintf() operates on wchars exclusively -- taking wchar arguments an=
d<br>
&gt; producing wchar output. I&#39;d expect that in the worst case scenario=
 it would<br>
&gt; have to convert from single chars to wide chars, but never the other w=
ay<br>
&gt; around, so the representation requirement seems strange. That setlocal=
e()<br>
&gt; step also doesn&#39;t seem to be needed with glibc.<br>
<br>
Yes, it&#39;s not clear to me whether the glibc behavior is conforming or<b=
r>
not. As specified,<br>
<br>
=C2=A0 In addition, all forms of fwprintf() shall fail if:<br>
<br>
=C2=A0 [EILSEQ]<br>
=C2=A0 =C2=A0 A wide-character code that does not correspond<br>
=C2=A0 =C2=A0 to a valid character has been detected.<br>
<br>
=C2=A0 ...<br>
<br>
The &quot;has been detected&quot; wording may allow for the possibility of<=
br>
ignoring the error, as glibc does, if the function is implemented such<br>
that no conversion takes place (or, for fwprintf, such that conversion<br>
is deferred until flush time) and thus no &quot;detection&quot; takes place=
. But<br>
it&#39;s wrong to assume the operation will succeed.<br>
<br>
In musl, there is no separate wide stdio buffering mode; conversion to<br>
a multibyte sequence happens at (logical) fputwc time, and in the case<br>
of swprintf, conversion (in this case, conversion back) to a wchar_t[]<br>
string occurs at flush time.<br>
<br>
Rich<br>
<br>
<br>
<br>
<br>
&gt; On Mon, May 24, 2021 at 5:50 PM Rich Felker &lt;<a href=3D"mailto:dali=
as@libc.org" target=3D"_blank">dalias@libc.org</a>&gt; wrote:<br>
&gt; <br>
&gt; &gt; On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote=
:<br>
&gt; &gt; &gt; Hi,<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; The following program:<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt; &gt; &gt; #include &lt;stdio.h&gt;<br>
&gt; &gt; &gt; #include &lt;wchar.h&gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; int main()<br>
&gt; &gt; &gt; {<br>
&gt; &gt; &gt;=C2=A0 =C2=A0wchar_t buf[ 32 ];<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;=C2=A0 =C2=A0swprintf( buf, sizeof( buf ) / sizeof( *buf ), L=
&quot;ab\u00E1c&quot; );<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;=C2=A0 =C2=A0for ( wchar_t * p =3D buf; *p; ++p )<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0printf( &quot;%u\n&quot;, ( unsigned ) *p=
 );<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;=C2=A0 =C2=A0return 0;<br>
&gt; &gt; &gt; }<br>
&gt; &gt; &gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; With musl 1.2.2 produces the following output:<br>
&gt; &gt; &gt; 97<br>
&gt; &gt; &gt; 98<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; The expected output is:<br>
&gt; &gt; &gt; 97<br>
&gt; &gt; &gt; 98<br>
&gt; &gt; &gt; 225<br>
&gt; &gt; &gt; 99<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; With musl, only the first two characters (&#39;a&#39; and &#=
39;b&#39;) are processed, and<br>
&gt; &gt; &gt; the string ends on a Unicode character (U+00E1, which is an =
&#39;a&#39; with<br>
&gt; &gt; acute<br>
&gt; &gt; &gt; accent), instead of outputting it and the last character, &#=
39;c&#39;.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Please CC me when replying. Thanks!<br>
&gt; &gt;<br>
&gt; &gt; You need to call setlocale(LC_CTYPE, &quot;&quot;). Otherwise the=
 character<br>
&gt; &gt; \u00e1 is unrepresentable, because POSIX requires the C locale be=
<br>
&gt; &gt; single-byte and you&#39;re in the C locale until you call setloca=
le, and<br>
&gt; &gt; thus produces an encoding error (EILSEQ).<br>
&gt; &gt;<br>
&gt; &gt; Rich<br>
&gt; &gt;<br>
</blockquote></div>

--0000000000004c88ce05c31cdab2--