From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 7234 invoked from network); 25 May 2021 00:30:57 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 25 May 2021 00:30:57 -0000
Received: (qmail 24551 invoked by uid 550); 25 May 2021 00:30:55 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 24533 invoked from network); 25 May 2021 00:30:54 -0000
Date: Mon, 24 May 2021 20:30:41 -0400
From: Rich Felker <dalias@libc.org>
To: Konstantin Isakov <dragonroot@gmail.com>
Cc: musl@lists.openwall.com
Message-ID: <20210525003040.GE2546@brightrain.aerifal.cx>
References: <CAMOBWkOsdWXqva-5LVYo76jcBEr8TkE6V-O69XRw0ruhn5Fd_g@mail.gmail.com>
 <20210524215021.GC2546@brightrain.aerifal.cx>
 <CAMOBWkPjcqSnPHCMdFdfyyXD-ovKXg0k6014mENcgeHJ+O-akg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAMOBWkPjcqSnPHCMdFdfyyXD-ovKXg0k6014mENcgeHJ+O-akg@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] [BUG] swprintf() doesn't handle Unicode characters
 correctly

On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> Thanks for replying!
> 
> That fixed it.
> 
> I'm surprised, however, that this is required given that in this case
> swprintf() operates on wchars exclusively -- taking wchar arguments and
> producing wchar output. I'd expect that in the worst case scenario it would
> have to convert from single chars to wide chars, but never the other way
> around, so the representation requirement seems strange. That setlocale()
> step also doesn't seem to be needed with glibc.

Yes, it's not clear to me whether the glibc behavior is conforming or
not. As specified,

  In addition, all forms of fwprintf() shall fail if:

  [EILSEQ]
    A wide-character code that does not correspond
    to a valid character has been detected.

  ...

The "has been detected" wording may allow for the possibility of
ignoring the error, as glibc does, if the function is implemented such
that no conversion takes place (or, for fwprintf, such that conversion
is deferred until flush time) and thus no "detection" takes place. But
it's wrong to assume the operation will succeed.

In musl, there is no separate wide stdio buffering mode; conversion to
a multibyte sequence happens at (logical) fputwc time, and in the case
of swprintf, conversion (in this case, conversion back) to a wchar_t[]
string occurs at flush time.

Rich


> On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> 
> > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > Hi,
> > >
> > > The following program:
> > >
> > > ===================================
> > > #include <stdio.h>
> > > #include <wchar.h>
> > >
> > > int main()
> > > {
> > >   wchar_t buf[ 32 ];
> > >
> > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > >
> > >   for ( wchar_t * p = buf; *p; ++p )
> > >     printf( "%u\n", ( unsigned ) *p );
> > >
> > >   return 0;
> > > }
> > > ===================================
> > >
> > > With musl 1.2.2 produces the following output:
> > > 97
> > > 98
> > >
> > > The expected output is:
> > > 97
> > > 98
> > > 225
> > > 99
> > >
> > > With musl, only the first two characters ('a' and 'b') are processed, and
> > > the string ends on a Unicode character (U+00E1, which is an 'a' with
> > acute
> > > accent), instead of outputting it and the last character, 'c'.
> > >
> > > Please CC me when replying. Thanks!
> >
> > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > single-byte and you're in the C locale until you call setlocale, and
> > thus produces an encoding error (EILSEQ).
> >
> > Rich
> >