mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
@ 2021-05-24  4:39 Konstantin Isakov
  2021-05-24 21:50 ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Isakov @ 2021-05-24  4:39 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 674 bytes --]

Hi,

The following program:

===================================
#include <stdio.h>
#include <wchar.h>

int main()
{
  wchar_t buf[ 32 ];

  swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );

  for ( wchar_t * p = buf; *p; ++p )
    printf( "%u\n", ( unsigned ) *p );

  return 0;
}
===================================

With musl 1.2.2 produces the following output:
97
98

The expected output is:
97
98
225
99

With musl, only the first two characters ('a' and 'b') are processed, and
the string ends on a Unicode character (U+00E1, which is an 'a' with acute
accent), instead of outputting it and the last character, 'c'.

Please CC me when replying. Thanks!

[-- Attachment #2: Type: text/html, Size: 1035 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
  2021-05-24  4:39 [musl] [BUG] swprintf() doesn't handle Unicode characters correctly Konstantin Isakov
@ 2021-05-24 21:50 ` Rich Felker
  2021-05-25  0:04   ` Konstantin Isakov
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2021-05-24 21:50 UTC (permalink / raw)
  To: Konstantin Isakov; +Cc: musl

On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> Hi,
> 
> The following program:
> 
> ===================================
> #include <stdio.h>
> #include <wchar.h>
> 
> int main()
> {
>   wchar_t buf[ 32 ];
> 
>   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> 
>   for ( wchar_t * p = buf; *p; ++p )
>     printf( "%u\n", ( unsigned ) *p );
> 
>   return 0;
> }
> ===================================
> 
> With musl 1.2.2 produces the following output:
> 97
> 98
> 
> The expected output is:
> 97
> 98
> 225
> 99
> 
> With musl, only the first two characters ('a' and 'b') are processed, and
> the string ends on a Unicode character (U+00E1, which is an 'a' with acute
> accent), instead of outputting it and the last character, 'c'.
> 
> Please CC me when replying. Thanks!

You need to call setlocale(LC_CTYPE, ""). Otherwise the character
\u00e1 is unrepresentable, because POSIX requires the C locale be
single-byte and you're in the C locale until you call setlocale, and
thus produces an encoding error (EILSEQ).

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
  2021-05-24 21:50 ` Rich Felker
@ 2021-05-25  0:04   ` Konstantin Isakov
  2021-05-25  0:30     ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Isakov @ 2021-05-25  0:04 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1665 bytes --]

Thanks for replying!

That fixed it.

I'm surprised, however, that this is required given that in this case
swprintf() operates on wchars exclusively -- taking wchar arguments and
producing wchar output. I'd expect that in the worst case scenario it would
have to convert from single chars to wide chars, but never the other way
around, so the representation requirement seems strange. That setlocale()
step also doesn't seem to be needed with glibc.

On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:

> On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > Hi,
> >
> > The following program:
> >
> > ===================================
> > #include <stdio.h>
> > #include <wchar.h>
> >
> > int main()
> > {
> >   wchar_t buf[ 32 ];
> >
> >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> >
> >   for ( wchar_t * p = buf; *p; ++p )
> >     printf( "%u\n", ( unsigned ) *p );
> >
> >   return 0;
> > }
> > ===================================
> >
> > With musl 1.2.2 produces the following output:
> > 97
> > 98
> >
> > The expected output is:
> > 97
> > 98
> > 225
> > 99
> >
> > With musl, only the first two characters ('a' and 'b') are processed, and
> > the string ends on a Unicode character (U+00E1, which is an 'a' with
> acute
> > accent), instead of outputting it and the last character, 'c'.
> >
> > Please CC me when replying. Thanks!
>
> You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> \u00e1 is unrepresentable, because POSIX requires the C locale be
> single-byte and you're in the C locale until you call setlocale, and
> thus produces an encoding error (EILSEQ).
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 2348 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
  2021-05-25  0:04   ` Konstantin Isakov
@ 2021-05-25  0:30     ` Rich Felker
  2021-05-25  0:46       ` Konstantin Isakov
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2021-05-25  0:30 UTC (permalink / raw)
  To: Konstantin Isakov; +Cc: musl

On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> Thanks for replying!
> 
> That fixed it.
> 
> I'm surprised, however, that this is required given that in this case
> swprintf() operates on wchars exclusively -- taking wchar arguments and
> producing wchar output. I'd expect that in the worst case scenario it would
> have to convert from single chars to wide chars, but never the other way
> around, so the representation requirement seems strange. That setlocale()
> step also doesn't seem to be needed with glibc.

Yes, it's not clear to me whether the glibc behavior is conforming or
not. As specified,

  In addition, all forms of fwprintf() shall fail if:

  [EILSEQ]
    A wide-character code that does not correspond
    to a valid character has been detected.

  ...

The "has been detected" wording may allow for the possibility of
ignoring the error, as glibc does, if the function is implemented such
that no conversion takes place (or, for fwprintf, such that conversion
is deferred until flush time) and thus no "detection" takes place. But
it's wrong to assume the operation will succeed.

In musl, there is no separate wide stdio buffering mode; conversion to
a multibyte sequence happens at (logical) fputwc time, and in the case
of swprintf, conversion (in this case, conversion back) to a wchar_t[]
string occurs at flush time.

Rich




> On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> 
> > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > Hi,
> > >
> > > The following program:
> > >
> > > ===================================
> > > #include <stdio.h>
> > > #include <wchar.h>
> > >
> > > int main()
> > > {
> > >   wchar_t buf[ 32 ];
> > >
> > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > >
> > >   for ( wchar_t * p = buf; *p; ++p )
> > >     printf( "%u\n", ( unsigned ) *p );
> > >
> > >   return 0;
> > > }
> > > ===================================
> > >
> > > With musl 1.2.2 produces the following output:
> > > 97
> > > 98
> > >
> > > The expected output is:
> > > 97
> > > 98
> > > 225
> > > 99
> > >
> > > With musl, only the first two characters ('a' and 'b') are processed, and
> > > the string ends on a Unicode character (U+00E1, which is an 'a' with
> > acute
> > > accent), instead of outputting it and the last character, 'c'.
> > >
> > > Please CC me when replying. Thanks!
> >
> > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > single-byte and you're in the C locale until you call setlocale, and
> > thus produces an encoding error (EILSEQ).
> >
> > Rich
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
  2021-05-25  0:30     ` Rich Felker
@ 2021-05-25  0:46       ` Konstantin Isakov
  2021-05-25  1:09         ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: Konstantin Isakov @ 2021-05-25  0:46 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 3272 bytes --]

Is swprintf() a form of fwprintf() though? fwprintf() and wprintf() output
to single-byte streams, so the conversion is necessary there, while
swprintf() outputs to a wide buffer. Performing double conversion (to
single chars and back) seems like unnecessary work in that case (though, of
course, it's less work to implement swprintf() like that).

On Mon, May 24, 2021 at 8:30 PM Rich Felker <dalias@libc.org> wrote:

> On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> > Thanks for replying!
> >
> > That fixed it.
> >
> > I'm surprised, however, that this is required given that in this case
> > swprintf() operates on wchars exclusively -- taking wchar arguments and
> > producing wchar output. I'd expect that in the worst case scenario it
> would
> > have to convert from single chars to wide chars, but never the other way
> > around, so the representation requirement seems strange. That setlocale()
> > step also doesn't seem to be needed with glibc.
>
> Yes, it's not clear to me whether the glibc behavior is conforming or
> not. As specified,
>
>   In addition, all forms of fwprintf() shall fail if:
>
>   [EILSEQ]
>     A wide-character code that does not correspond
>     to a valid character has been detected.
>
>   ...
>
> The "has been detected" wording may allow for the possibility of
> ignoring the error, as glibc does, if the function is implemented such
> that no conversion takes place (or, for fwprintf, such that conversion
> is deferred until flush time) and thus no "detection" takes place. But
> it's wrong to assume the operation will succeed.
>
> In musl, there is no separate wide stdio buffering mode; conversion to
> a multibyte sequence happens at (logical) fputwc time, and in the case
> of swprintf, conversion (in this case, conversion back) to a wchar_t[]
> string occurs at flush time.
>
> Rich
>
>
>
>
> > On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> >
> > > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > > Hi,
> > > >
> > > > The following program:
> > > >
> > > > ===================================
> > > > #include <stdio.h>
> > > > #include <wchar.h>
> > > >
> > > > int main()
> > > > {
> > > >   wchar_t buf[ 32 ];
> > > >
> > > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > > >
> > > >   for ( wchar_t * p = buf; *p; ++p )
> > > >     printf( "%u\n", ( unsigned ) *p );
> > > >
> > > >   return 0;
> > > > }
> > > > ===================================
> > > >
> > > > With musl 1.2.2 produces the following output:
> > > > 97
> > > > 98
> > > >
> > > > The expected output is:
> > > > 97
> > > > 98
> > > > 225
> > > > 99
> > > >
> > > > With musl, only the first two characters ('a' and 'b') are
> processed, and
> > > > the string ends on a Unicode character (U+00E1, which is an 'a' with
> > > acute
> > > > accent), instead of outputting it and the last character, 'c'.
> > > >
> > > > Please CC me when replying. Thanks!
> > >
> > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > > single-byte and you're in the C locale until you call setlocale, and
> > > thus produces an encoding error (EILSEQ).
> > >
> > > Rich
> > >
>

[-- Attachment #2: Type: text/html, Size: 4435 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
  2021-05-25  0:46       ` Konstantin Isakov
@ 2021-05-25  1:09         ` Rich Felker
  2021-05-25  1:58           ` Konstantin Isakov
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2021-05-25  1:09 UTC (permalink / raw)
  To: Konstantin Isakov; +Cc: musl

On Mon, May 24, 2021 at 08:46:01PM -0400, Konstantin Isakov wrote:
> Is swprintf() a form of fwprintf() though?

As specified, it is. They're all covered together under
https://pubs.opengroup.org/onlinepubs/9699919799/functions/swprintf.html

and "all forms" is in contrast to just "fwprintf() and wprintf()" (the
other 2/3) mentioned above which can fail for any of the fputwc
reasons (which would already cover EILSEQ anyway).

> fwprintf() and wprintf() output
> to single-byte streams, so the conversion is necessary there, while
> swprintf() outputs to a wide buffer. Performing double conversion (to
> single chars and back) seems like unnecessary work in that case (though, of
> course, it's less work to implement swprintf() like that).

It's what gives consistent behavior, and it's what you get
automatically if you don't want either a completely independent
implementation of swprintf (that behaves surprisingly unlike fwprintf)
or the wide-mode buffering glibc does.

(Note: the original reason they did separate wide-mode buffering was
that gconv is very slow for individual character conversions and was
designed only for bulk conversion calls, which would happen at flush
time. Making individual conversions fast was one of the original
design goals of musl before there even was a whole libc around it.)

Rich


> On Mon, May 24, 2021 at 8:30 PM Rich Felker <dalias@libc.org> wrote:
> 
> > On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> > > Thanks for replying!
> > >
> > > That fixed it.
> > >
> > > I'm surprised, however, that this is required given that in this case
> > > swprintf() operates on wchars exclusively -- taking wchar arguments and
> > > producing wchar output. I'd expect that in the worst case scenario it
> > would
> > > have to convert from single chars to wide chars, but never the other way
> > > around, so the representation requirement seems strange. That setlocale()
> > > step also doesn't seem to be needed with glibc.
> >
> > Yes, it's not clear to me whether the glibc behavior is conforming or
> > not. As specified,
> >
> >   In addition, all forms of fwprintf() shall fail if:
> >
> >   [EILSEQ]
> >     A wide-character code that does not correspond
> >     to a valid character has been detected.
> >
> >   ...
> >
> > The "has been detected" wording may allow for the possibility of
> > ignoring the error, as glibc does, if the function is implemented such
> > that no conversion takes place (or, for fwprintf, such that conversion
> > is deferred until flush time) and thus no "detection" takes place. But
> > it's wrong to assume the operation will succeed.
> >
> > In musl, there is no separate wide stdio buffering mode; conversion to
> > a multibyte sequence happens at (logical) fputwc time, and in the case
> > of swprintf, conversion (in this case, conversion back) to a wchar_t[]
> > string occurs at flush time.
> >
> > Rich
> >
> >
> >
> >
> > > On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> > >
> > > > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > > > Hi,
> > > > >
> > > > > The following program:
> > > > >
> > > > > ===================================
> > > > > #include <stdio.h>
> > > > > #include <wchar.h>
> > > > >
> > > > > int main()
> > > > > {
> > > > >   wchar_t buf[ 32 ];
> > > > >
> > > > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > > > >
> > > > >   for ( wchar_t * p = buf; *p; ++p )
> > > > >     printf( "%u\n", ( unsigned ) *p );
> > > > >
> > > > >   return 0;
> > > > > }
> > > > > ===================================
> > > > >
> > > > > With musl 1.2.2 produces the following output:
> > > > > 97
> > > > > 98
> > > > >
> > > > > The expected output is:
> > > > > 97
> > > > > 98
> > > > > 225
> > > > > 99
> > > > >
> > > > > With musl, only the first two characters ('a' and 'b') are
> > processed, and
> > > > > the string ends on a Unicode character (U+00E1, which is an 'a' with
> > > > acute
> > > > > accent), instead of outputting it and the last character, 'c'.
> > > > >
> > > > > Please CC me when replying. Thanks!
> > > >
> > > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > > > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > > > single-byte and you're in the C locale until you call setlocale, and
> > > > thus produces an encoding error (EILSEQ).
> > > >
> > > > Rich
> > > >
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [musl] [BUG] swprintf() doesn't handle Unicode characters correctly
  2021-05-25  1:09         ` Rich Felker
@ 2021-05-25  1:58           ` Konstantin Isakov
  0 siblings, 0 replies; 7+ messages in thread
From: Konstantin Isakov @ 2021-05-25  1:58 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4796 bytes --]

Thanks, Rich, that was very informative!

On Mon, May 24, 2021 at 9:09 PM Rich Felker <dalias@libc.org> wrote:

> On Mon, May 24, 2021 at 08:46:01PM -0400, Konstantin Isakov wrote:
> > Is swprintf() a form of fwprintf() though?
>
> As specified, it is. They're all covered together under
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/swprintf.html
>
> and "all forms" is in contrast to just "fwprintf() and wprintf()" (the
> other 2/3) mentioned above which can fail for any of the fputwc
> reasons (which would already cover EILSEQ anyway).
>
> > fwprintf() and wprintf() output
> > to single-byte streams, so the conversion is necessary there, while
> > swprintf() outputs to a wide buffer. Performing double conversion (to
> > single chars and back) seems like unnecessary work in that case (though,
> of
> > course, it's less work to implement swprintf() like that).
>
> It's what gives consistent behavior, and it's what you get
> automatically if you don't want either a completely independent
> implementation of swprintf (that behaves surprisingly unlike fwprintf)
> or the wide-mode buffering glibc does.
>
> (Note: the original reason they did separate wide-mode buffering was
> that gconv is very slow for individual character conversions and was
> designed only for bulk conversion calls, which would happen at flush
> time. Making individual conversions fast was one of the original
> design goals of musl before there even was a whole libc around it.)
>
> Rich
>
>
> > On Mon, May 24, 2021 at 8:30 PM Rich Felker <dalias@libc.org> wrote:
> >
> > > On Mon, May 24, 2021 at 08:04:04PM -0400, Konstantin Isakov wrote:
> > > > Thanks for replying!
> > > >
> > > > That fixed it.
> > > >
> > > > I'm surprised, however, that this is required given that in this case
> > > > swprintf() operates on wchars exclusively -- taking wchar arguments
> and
> > > > producing wchar output. I'd expect that in the worst case scenario it
> > > would
> > > > have to convert from single chars to wide chars, but never the other
> way
> > > > around, so the representation requirement seems strange. That
> setlocale()
> > > > step also doesn't seem to be needed with glibc.
> > >
> > > Yes, it's not clear to me whether the glibc behavior is conforming or
> > > not. As specified,
> > >
> > >   In addition, all forms of fwprintf() shall fail if:
> > >
> > >   [EILSEQ]
> > >     A wide-character code that does not correspond
> > >     to a valid character has been detected.
> > >
> > >   ...
> > >
> > > The "has been detected" wording may allow for the possibility of
> > > ignoring the error, as glibc does, if the function is implemented such
> > > that no conversion takes place (or, for fwprintf, such that conversion
> > > is deferred until flush time) and thus no "detection" takes place. But
> > > it's wrong to assume the operation will succeed.
> > >
> > > In musl, there is no separate wide stdio buffering mode; conversion to
> > > a multibyte sequence happens at (logical) fputwc time, and in the case
> > > of swprintf, conversion (in this case, conversion back) to a wchar_t[]
> > > string occurs at flush time.
> > >
> > > Rich
> > >
> > >
> > >
> > >
> > > > On Mon, May 24, 2021 at 5:50 PM Rich Felker <dalias@libc.org> wrote:
> > > >
> > > > > On Mon, May 24, 2021 at 12:39:35AM -0400, Konstantin Isakov wrote:
> > > > > > Hi,
> > > > > >
> > > > > > The following program:
> > > > > >
> > > > > > ===================================
> > > > > > #include <stdio.h>
> > > > > > #include <wchar.h>
> > > > > >
> > > > > > int main()
> > > > > > {
> > > > > >   wchar_t buf[ 32 ];
> > > > > >
> > > > > >   swprintf( buf, sizeof( buf ) / sizeof( *buf ), L"ab\u00E1c" );
> > > > > >
> > > > > >   for ( wchar_t * p = buf; *p; ++p )
> > > > > >     printf( "%u\n", ( unsigned ) *p );
> > > > > >
> > > > > >   return 0;
> > > > > > }
> > > > > > ===================================
> > > > > >
> > > > > > With musl 1.2.2 produces the following output:
> > > > > > 97
> > > > > > 98
> > > > > >
> > > > > > The expected output is:
> > > > > > 97
> > > > > > 98
> > > > > > 225
> > > > > > 99
> > > > > >
> > > > > > With musl, only the first two characters ('a' and 'b') are
> > > processed, and
> > > > > > the string ends on a Unicode character (U+00E1, which is an 'a'
> with
> > > > > acute
> > > > > > accent), instead of outputting it and the last character, 'c'.
> > > > > >
> > > > > > Please CC me when replying. Thanks!
> > > > >
> > > > > You need to call setlocale(LC_CTYPE, ""). Otherwise the character
> > > > > \u00e1 is unrepresentable, because POSIX requires the C locale be
> > > > > single-byte and you're in the C locale until you call setlocale,
> and
> > > > > thus produces an encoding error (EILSEQ).
> > > > >
> > > > > Rich
> > > > >
> > >
>

[-- Attachment #2: Type: text/html, Size: 6839 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-05-25  1:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-24  4:39 [musl] [BUG] swprintf() doesn't handle Unicode characters correctly Konstantin Isakov
2021-05-24 21:50 ` Rich Felker
2021-05-25  0:04   ` Konstantin Isakov
2021-05-25  0:30     ` Rich Felker
2021-05-25  0:46       ` Konstantin Isakov
2021-05-25  1:09         ` Rich Felker
2021-05-25  1:58           ` Konstantin Isakov

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).