* [musl] wcrtomb in UTF-8 locale should check the multibyte state
@ 2025-04-04 21:12 Kang-Che Sung
2025-04-04 21:39 ` Thorsten Glaser
0 siblings, 1 reply; 4+ messages in thread
From: Kang-Che Sung @ 2025-04-04 21:12 UTC (permalink / raw)
To: musl
Hello.
I'm reporting an issue that I think should be a bug.
Even though UTF-8 is not an encoding that uses shift characters, the
wide character encoding functions like wcrtomb should check the
mbstate_t object anyway if supplied by the caller.
I'm presenting a use case for why this matters:
(1) I call mbrtowc (or sometimes mbsnrtowcs) to scan a few bytes from
a multibyte string
(2) Then I call wcrtomb with the appropriate offset to the multibyte
string buffer, to "overwrite" or "append to" the end of the buffer.
In this case, the mbstate_t object after the mbrtowc call might
contain an incomplete UTF-8 sequence. When writing a new character
through wcrtomb, I should be able to tell if appending a character
would cause the string to become ill-formed. If the mbstate_t object
is ignored when calling wcrtomb, then it can mask out an error. (An
incomplete sequence during mbrtowc does not set errno=EILSEQ, but
writing a new sequence right after the incomplete one should make that
incomplete sequence _invalid_, and thus it should have an EILSEQ
error.)
This is an example code:
```c
#include <limits.h>
#include <locale.h>
#include <stdio.h>
#include <string.h>
#include <wchar.h>
int main(void) {
// Preloaded character: U+306B Hiragana Letter Ni
char buf[256] = "\xE3\x81\xAB";
setlocale(LC_ALL, "en_US.UTF-8");
mbstate_t state;
memset(&state, 0, sizeof(state));
wchar_t wc = 0;
size_t len = mbrtowc(&wc, buf, 2, &state);
printf("%zu %04x\n", len, (unsigned int)wc);
wc = (wchar_t)0;
len = wcrtomb(&buf[2], wc, &state);
printf("%zu\n", len);
}
```
Actual result (with musl libc 1.2.5 on Arch Linux x86-64):
```text
18446744073709551614 0000
1
```
Expected result:
```text
18446744073709551614 0000
18446744073709551615
```
Note: It is _allowed_ in the C standard to reuse an mbstate_t object
across different multibyte conversion functions. It is _not an
undefined behavior_ when the mbstate_t object is used for the _same
string_ in the _same locale_, and thus the example code above should
be a valid use.
When I tested the code in macOS 15.4, it demonstrated the expected behavior.
But both glibc and musl libc seem to have the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] wcrtomb in UTF-8 locale should check the multibyte state
2025-04-04 21:12 [musl] wcrtomb in UTF-8 locale should check the multibyte state Kang-Che Sung
@ 2025-04-04 21:39 ` Thorsten Glaser
2025-04-04 23:32 ` Kang-Che Sung
0 siblings, 1 reply; 4+ messages in thread
From: Thorsten Glaser @ 2025-04-04 21:39 UTC (permalink / raw)
To: musl
On Sat, 5 Apr 2025, Kang-Che Sung wrote:
>Note: It is _allowed_ in the C standard to reuse an mbstate_t object
>across different multibyte conversion functions. It is _not an
7.31.6 begs to differ:
| If an mbstate_t object has been altered by any of the functions
| described in this subclause, and is then used with a different
| multibyte character sequence, or in the other conversion direction, or
| with a different LC_CTYPE category setting than on earlier function
| calls, the behavior is undefined.414)
bye,
//mirabilos
--
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] wcrtomb in UTF-8 locale should check the multibyte state
2025-04-04 21:39 ` Thorsten Glaser
@ 2025-04-04 23:32 ` Kang-Che Sung
2025-04-05 2:50 ` Rich Felker
0 siblings, 1 reply; 4+ messages in thread
From: Kang-Che Sung @ 2025-04-04 23:32 UTC (permalink / raw)
To: musl
Hi.
On Sat, Apr 5, 2025 at 5:39 AM Thorsten Glaser <tg@evolvis.org> wrote:
>
> On Sat, 5 Apr 2025, Kang-Che Sung wrote:
>
> >Note: It is _allowed_ in the C standard to reuse an mbstate_t object
> >across different multibyte conversion functions. It is _not an
>
> 7.31.6 begs to differ:
>
> | If an mbstate_t object has been altered by any of the functions
> | described in this subclause, and is then used with a different
> | multibyte character sequence, or in the other conversion direction, or
> | with a different LC_CTYPE category setting than on earlier function
> | calls, the behavior is undefined.414)
>
I'm aware of that part of the standard paragraph.
I may have read it wrongly regarding the meaning of the "conversion
direction", but I still believe that ignoring the mbstate_t object is
a bad idea.
I need to make a correction on one thing though:
In macOS, the wcrtomb call in the example code in my last email
actually sets errno=EINVAL, not EILSEQ.
I guess some BSD implementations also follow this (I'm not sure).
POSIX says "EINVAL: ps points to an object that contains an invalid
conversion state."
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] wcrtomb in UTF-8 locale should check the multibyte state
2025-04-04 23:32 ` Kang-Che Sung
@ 2025-04-05 2:50 ` Rich Felker
0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2025-04-05 2:50 UTC (permalink / raw)
To: Kang-Che Sung; +Cc: musl
On Sat, Apr 05, 2025 at 07:32:37AM +0800, Kang-Che Sung wrote:
> Hi.
>
> On Sat, Apr 5, 2025 at 5:39 AM Thorsten Glaser <tg@evolvis.org> wrote:
> >
> > On Sat, 5 Apr 2025, Kang-Che Sung wrote:
> >
> > >Note: It is _allowed_ in the C standard to reuse an mbstate_t object
> > >across different multibyte conversion functions. It is _not an
> >
> > 7.31.6 begs to differ:
> >
> > | If an mbstate_t object has been altered by any of the functions
> > | described in this subclause, and is then used with a different
> > | multibyte character sequence, or in the other conversion direction, or
> > | with a different LC_CTYPE category setting than on earlier function
> > | calls, the behavior is undefined.414)
> >
>
> I'm aware of that part of the standard paragraph.
> I may have read it wrongly regarding the meaning of the "conversion
> direction", but I still believe that ignoring the mbstate_t object is
> a bad idea.
>
> I need to make a correction on one thing though:
> In macOS, the wcrtomb call in the example code in my last email
> actually sets errno=EINVAL, not EILSEQ.
> I guess some BSD implementations also follow this (I'm not sure).
> POSIX says "EINVAL: ps points to an object that contains an invalid
> conversion state."
"...the behavior is undefined" means (among other things) there is no
obligation to follow any particular error convention.
Rich
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-04-05 2:50 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-04 21:12 [musl] wcrtomb in UTF-8 locale should check the multibyte state Kang-Che Sung
2025-04-04 21:39 ` Thorsten Glaser
2025-04-04 23:32 ` Kang-Che Sung
2025-04-05 2:50 ` Rich Felker
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).