mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Konstantin Isakov <dragonroot@gmail.com>
To: Rich Felker <dalias@libc.org>
Cc: musl@lists.openwall.com
Subject: Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
Date: Mon, 24 May 2021 20:46:01 -0400	[thread overview]
Message-ID: <CAMOBWkPtuijGJK7nrajh4F0kzR4bO6dURn_ZQ_QGBUnxL04j7A@mail.gmail.com> (raw)
In-Reply-To: <20210525003040.GE2546@brightrain.aerifal.cx>

[-- Attachment #1: Type: text/plain, Size: 3272 bytes --]

Is swprintf() a form of fwprintf() though? fwprintf() and wprintf() output
to single-byte streams, so the conversion is necessary there, while
swprintf() outputs to a wide buffer. Performing double conversion (to
single chars and back) seems like unnecessary work in that case (though, of
course, it's less work to implement swprintf() like that).

On Mon, May 24, 2021 at 8:30 PM Rich Felker <dalias@libc.org> wrote:

> On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> > Thanks for replying!
> >
> > That fixed it.
> >
> > I'm surprised, however, that this is required given that in this case
> > swprintf() operates on wchars exclusively -- taking wchar arguments and
> > producing wchar output. I'd expect that in the worst case scenario it
> would
> > have to convert from single chars to wide chars, but never the other way
> > around, so the representation requirement seems strange. That setlocale()
> > step also doesn't seem to be needed with glibc.
>
> Yes, it's not clear to me whether the glibc behavior is conforming or
> not. As specified,
>
>   In addition, all forms of fwprintf() shall fail if:
>
>   [EILSEQ]
>     A wide-character code that does not correspond
>     to a valid character has been detected.
>
>   ...
>
> The "has been detected" wording may allow for the possibility of
> ignoring the error, as glibc does, if the function is implemented such
> that no conversion takes place (or, for fwprintf, such that conversion
> is deferred until flush time) and thus no "detection" takes place. But
> it's wrong to assume the operation will succeed.
>
> In musl, there is no separate wide stdio buffering mode; conversion to
> a multibyte sequence happens at (logical) fputwc time, and in the case
> of swprintf, conversion (in this case, conversion back) to a wchar_t[]
> string occurs at flush time.
>
> Rich
>
>
>
>
> > On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> >
> > > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > > Hi,
> > > >
> > > > The following program:
> > > >
> > > > ===================================
> > > > #include <stdio.h>
> > > > #include <wchar.h>
> > > >
> > > > int main()
> > > > {
> > > >   wchar_t buf[ 32 ];
> > > >
> > > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > > >
> > > >   for ( wchar_t * p = buf; *p; ++p )
> > > >     printf( "%u\n", ( unsigned ) *p );
> > > >
> > > >   return 0;
> > > > }
> > > > ===================================
> > > >
> > > > With musl 1.2.2 produces the following output:
> > > > 97
> > > > 98
> > > >
> > > > The expected output is:
> > > > 97
> > > > 98
> > > > 225
> > > > 99
> > > >
> > > > With musl, only the first two characters ('a' and 'b') are
> processed, and
> > > > the string ends on a Unicode character (U+00E1, which is an 'a' with
> > > acute
> > > > accent), instead of outputting it and the last character, 'c'.
> > > >
> > > > Please CC me when replying. Thanks!
> > >
> > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > > single-byte and you're in the C locale until you call setlocale, and
> > > thus produces an encoding error (EILSEQ).
> > >
> > > Rich
> > >
>

[-- Attachment #2: Type: text/html, Size: 4435 bytes --]

  reply	other threads:[~2021-05-25  0:46 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-24  4:39 Konstantin Isakov
2021-05-24 21:50 ` Rich Felker
2021-05-25  0:04   ` Konstantin Isakov
2021-05-25  0:30     ` Rich Felker
2021-05-25  0:46       ` Konstantin Isakov [this message]
2021-05-25  1:09         ` Rich Felker
2021-05-25  1:58           ` Konstantin Isakov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMOBWkPtuijGJK7nrajh4F0kzR4bO6dURn_ZQ_QGBUnxL04j7A@mail.gmail.com \
    --to=dragonroot@gmail.com \
    --cc=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).