mailing list of musl libc
 help / color / mirror / code / Atom feed
* [PATCH] implement a private state for the uchar.h functions
@ 2014-11-09 10:18 Jens Gustedt
  2014-11-11  3:21 ` Rich Felker
  2014-11-15 17:29 ` Rich Felker
  0 siblings, 2 replies; 8+ messages in thread
From: Jens Gustedt @ 2014-11-09 10:18 UTC (permalink / raw)
  To: musl

The C standard is imperative on that:

  7.28.1 ... If ps is a null pointer, each function uses its own internal
  mbstate_t object instead, which is initialized at program startup to
  the initial conversion state;

and these functions are also not supposed to implicitly use the state of
the wchar.h functions:

  7.29.6.3 ... The implementation behaves as if no library function calls
  these functions with a null pointer for ps.

Previously this resulted in two bugs.

 - The functions c16rtomb and mbrtoc16 would crash when called with ps
   set to null.

 - The functions c32rtomb and mbrtoc32 used the private states of wcrtomb
   and mbrtowc, respectively, which they are not allowed to do.
---
 src/multibyte/c16rtomb.c | 2 ++
 src/multibyte/c32rtomb.c | 2 ++
 src/multibyte/mbrtoc16.c | 2 ++
 src/multibyte/mbrtoc32.c | 2 ++
 4 files changed, 8 insertions(+)

diff --git a/src/multibyte/c16rtomb.c b/src/multibyte/c16rtomb.c
index 2e8ec97..39ca375 100644
--- a/src/multibyte/c16rtomb.c
+++ b/src/multibyte/c16rtomb.c
@@ -4,6 +4,8 @@
 
 size_t c16rtomb(char *restrict s, char16_t c16, mbstate_t *restrict ps)
 {
+	static unsigned internal_state;
+	if (!ps) ps = (void *)&internal_state;
 	unsigned *x = (unsigned *)ps;
 	wchar_t wc;
 
diff --git a/src/multibyte/c32rtomb.c b/src/multibyte/c32rtomb.c
index 6785132..a5d49ff 100644
--- a/src/multibyte/c32rtomb.c
+++ b/src/multibyte/c32rtomb.c
@@ -3,5 +3,7 @@
 
 size_t c32rtomb(char *restrict s, char32_t c32, mbstate_t *restrict ps)
 {
+	static unsigned internal_state;
+	if (!ps) ps = (void *)&internal_state;
 	return wcrtomb(s, c32, ps);
 }
diff --git a/src/multibyte/mbrtoc16.c b/src/multibyte/mbrtoc16.c
index 74b7d77..765ff90 100644
--- a/src/multibyte/mbrtoc16.c
+++ b/src/multibyte/mbrtoc16.c
@@ -3,6 +3,8 @@
 
 size_t mbrtoc16(char16_t *restrict pc16, const char *restrict s, size_t n, mbstate_t *restrict ps)
 {
+	static unsigned internal_state;
+	if (!ps) ps = (void *)&internal_state;
 	unsigned *pending = (unsigned *)ps;
 
 	if (!s) return mbrtoc16(0, "", 1, ps);
diff --git a/src/multibyte/mbrtoc32.c b/src/multibyte/mbrtoc32.c
index c6d2082..9b6b236 100644
--- a/src/multibyte/mbrtoc32.c
+++ b/src/multibyte/mbrtoc32.c
@@ -3,6 +3,8 @@
 
 size_t mbrtoc32(char32_t *restrict pc32, const char *restrict s, size_t n, mbstate_t *restrict ps)
 {
+	static unsigned internal_state;
+	if (!ps) ps = (void *)&internal_state;
 	if (!s) return mbrtoc32(0, "", 1, ps);
 	wchar_t wc;
 	size_t ret = mbrtowc(&wc, s, n, ps);
-- 
1.9.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-09 10:18 [PATCH] implement a private state for the uchar.h functions Jens Gustedt
@ 2014-11-11  3:21 ` Rich Felker
  2014-11-11 13:53   ` Jens Gustedt
  2014-11-15 17:29 ` Rich Felker
  1 sibling, 1 reply; 8+ messages in thread
From: Rich Felker @ 2014-11-11  3:21 UTC (permalink / raw)
  To: musl

On Sun, Nov 09, 2014 at 11:18:08AM +0100, Jens Gustedt wrote:
> The C standard is imperative on that:
> 
>   7.28.1 ... If ps is a null pointer, each function uses its own internal
>   mbstate_t object instead, which is initialized at program startup to
>   the initial conversion state;

Thanks. Actually I originally had this functionality and removed it
because it seemed to be unnecessary, due to the requirement being
buried in that introductory text rather than the descriptions of the
individual functions. I figured the committee had just intentionally
decided not to copy this backwards functionality from the old
multibyte functions into the new uchar ones, but sadly that's not the
case...

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-11  3:21 ` Rich Felker
@ 2014-11-11 13:53   ` Jens Gustedt
  2014-11-11 14:39     ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Gustedt @ 2014-11-11 13:53 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2182 bytes --]

Am Montag, den 10.11.2014, 22:21 -0500 schrieb Rich Felker:
> On Sun, Nov 09, 2014 at 11:18:08AM +0100, Jens Gustedt wrote:
> > The C standard is imperative on that:
> > 
> >   7.28.1 ... If ps is a null pointer, each function uses its own internal
> >   mbstate_t object instead, which is initialized at program startup to
> >   the initial conversion state;
> 
> Thanks. Actually I originally had this functionality and removed it
> because it seemed to be unnecessary, due to the requirement being
> buried in that introductory text rather than the descriptions of the
> individual functions. I figured the committee had just intentionally
> decided not to copy this backwards functionality from the old
> multibyte functions into the new uchar ones, but sadly that's not the
> case...

Yes these are bizarre additions. That has almost a dozen different
static states for all of the different restartable functions.

Perhaps I misunderstood something, but isn't it that in direction mbs
-> charXX_t these functions allow to handle surrogates, but the other
way around is not possible?

From that new unicode support in C11 I get some of the ideas, but some
things remain quite misterious

 - having a standard way to specify unicode characters inside a string
   of any kind through \u and \U is really a great achievement

 - introducing types charXX_t and constants literals with u and U is
   already less clear. The only thing that can be done with them is
   conversion, there are no auxiliary functions. In particular the
   character counting and classification problems for surrogates is
   still not solved.

 - introducing a u8 prefix for strings that guarantees utf8 encoding
   for mbs sounds nice. But then there is nothing that relates these
   to "normal" string literals. What are we supposed to do with these?

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-11 13:53   ` Jens Gustedt
@ 2014-11-11 14:39     ` Rich Felker
  2014-11-11 16:03       ` Jens Gustedt
  0 siblings, 1 reply; 8+ messages in thread
From: Rich Felker @ 2014-11-11 14:39 UTC (permalink / raw)
  To: musl

On Tue, Nov 11, 2014 at 02:53:02PM +0100, Jens Gustedt wrote:
> Am Montag, den 10.11.2014, 22:21 -0500 schrieb Rich Felker:
> > On Sun, Nov 09, 2014 at 11:18:08AM +0100, Jens Gustedt wrote:
> > > The C standard is imperative on that:
> > > 
> > >   7.28.1 ... If ps is a null pointer, each function uses its own internal
> > >   mbstate_t object instead, which is initialized at program startup to
> > >   the initial conversion state;
> > 
> > Thanks. Actually I originally had this functionality and removed it
> > because it seemed to be unnecessary, due to the requirement being
> > buried in that introductory text rather than the descriptions of the
> > individual functions. I figured the committee had just intentionally
> > decided not to copy this backwards functionality from the old
> > multibyte functions into the new uchar ones, but sadly that's not the
> > case...
> 
> Yes these are bizarre additions. That has almost a dozen different
> static states for all of the different restartable functions.
> 
> Perhaps I misunderstood something, but isn't it that in direction mbs
> -> charXX_t these functions allow to handle surrogates, but the other
> way around is not possible?

Both directions are possible. c16rtomb returns 0 and saves the first
surrogate as state for the next call. mbrtoc16 writes out the first
surrogate, saves the second in the state, and returns 4 on the first
call, then returns (size_t)-3 and writes out the second surrogate on
the next call. Yes it's hideously ugly but it way trivial to
implement.

> From that new unicode support in C11 I get some of the ideas, but some
> things remain quite misterious
> 
>  - having a standard way to specify unicode characters inside a string
>    of any kind through \u and \U is really a great achievement

Yes and no. I don't think anyone really wants to use these. They're
unreadable except when used extremely sparingly, and embedding natural
language text in source is widely frowned upon anyway which limits the
usefulness. But it is nice to at least have a way if/when you need it.

>  - introducing types charXX_t and constants literals with u and U is
>    already less clear. The only thing that can be done with them is
>    conversion, there are no auxiliary functions. In particular the
>    character counting and classification problems for surrogates is
>    still not solved.

The provided conversions to/from multibyte are useless because the
current multibyte character set cannot necessarily even represent
them. Initially I thought they should have provided conversions
to/from wchar_t, but that would also be useless since wchar_t is only
officially meaningful for characters in the current (multibyte)
character set. The only conversions that would actually be useful are
between UTF-8, UTF-16, and UTF-32, but those are all well-defined in
an implementation-independent manner and thus something you can
provide yourself (even though at least 70% if people doing so do it
wrong...) which I can only assume is the reason the language standard
doesn't provide them.

>  - introducing a u8 prefix for strings that guarantees utf8 encoding
>    for mbs sounds nice. But then there is nothing that relates these
>    to "normal" string literals. What are we supposed to do with these?

Process them with your own code, or just pass them to external
interfaces that expect UTF-8 (e.g. filesystem structures, network
protocols, etc.).

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-11 14:39     ` Rich Felker
@ 2014-11-11 16:03       ` Jens Gustedt
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Gustedt @ 2014-11-11 16:03 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

Am Dienstag, den 11.11.2014, 09:39 -0500 schrieb Rich Felker:
> On Tue, Nov 11, 2014 at 02:53:02PM +0100, Jens Gustedt wrote:
> >  - having a standard way to specify unicode characters inside a string
> >    of any kind through \u and \U is really a great achievement
> 
> Yes and no. I don't think anyone really wants to use these. They're
> unreadable except when used extremely sparingly, and embedding natural
> language text in source is widely frowned upon anyway which limits the
> usefulness. But it is nice to at least have a way if/when you need it.

yes

Unicode is not only for natural language scripts, but also graphics,
mathematics, IPA, other technical stuff and smilies ☺

So for these usages this might come handy.


> >  - introducing a u8 prefix for strings that guarantees utf8 encoding
> >    for mbs sounds nice. But then there is nothing that relates these
> >    to "normal" string literals. What are we supposed to do with these?
> 
> Process them with your own code,

at some point I thought the C library was there to provide the basics
for interacting with the environment, probably much too naive :)

> or just pass them to external
> interfaces that expect UTF-8 (e.g. filesystem structures, network
> protocols, etc.).

right, good point

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-09 10:18 [PATCH] implement a private state for the uchar.h functions Jens Gustedt
  2014-11-11  3:21 ` Rich Felker
@ 2014-11-15 17:29 ` Rich Felker
  2014-11-15 17:57   ` Jens Gustedt
  1 sibling, 1 reply; 8+ messages in thread
From: Rich Felker @ 2014-11-15 17:29 UTC (permalink / raw)
  To: musl

On Sun, Nov 09, 2014 at 11:18:08AM +0100, Jens Gustedt wrote:
> The C standard is imperative on that:
> 
>   7.28.1 ... If ps is a null pointer, each function uses its own internal
>   mbstate_t object instead, which is initialized at program startup to
>   the initial conversion state;
> 
> and these functions are also not supposed to implicitly use the state of
> the wchar.h functions:
> 
>   7.29.6.3 ... The implementation behaves as if no library function calls
>   these functions with a null pointer for ps.
> 
> Previously this resulted in two bugs.
> 
>  - The functions c16rtomb and mbrtoc16 would crash when called with ps
>    set to null.
> 
>  - The functions c32rtomb and mbrtoc32 used the private states of wcrtomb
>    and mbrtowc, respectively, which they are not allowed to do.

One small correction: wcrtomb has no state, and c32rtomb does not need
any either.

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-15 17:29 ` Rich Felker
@ 2014-11-15 17:57   ` Jens Gustedt
  2014-11-15 20:10     ` Rich Felker
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Gustedt @ 2014-11-15 17:57 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 770 bytes --]

Am Samstag, den 15.11.2014, 12:29 -0500 schrieb Rich Felker:
> >  - The functions c32rtomb and mbrtoc32 used the private states of wcrtomb
> >    and mbrtowc, respectively, which they are not allowed to do.
> 
> One small correction: wcrtomb has no state, and c32rtomb does not need
> any either.

right, so the corrected phrase would be

  - the function mbrtoc32 uses the private state of mbrtowc, which it
    is not allowed to do.

thanks

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] implement a private state for the uchar.h functions
  2014-11-15 17:57   ` Jens Gustedt
@ 2014-11-15 20:10     ` Rich Felker
  0 siblings, 0 replies; 8+ messages in thread
From: Rich Felker @ 2014-11-15 20:10 UTC (permalink / raw)
  To: musl

On Sat, Nov 15, 2014 at 06:57:21PM +0100, Jens Gustedt wrote:
> Am Samstag, den 15.11.2014, 12:29 -0500 schrieb Rich Felker:
> > >  - The functions c32rtomb and mbrtoc32 used the private states of wcrtomb
> > >    and mbrtowc, respectively, which they are not allowed to do.
> > 
> > One small correction: wcrtomb has no state, and c32rtomb does not need
> > any either.
> 
> right, so the corrected phrase would be
> 
>   - the function mbrtoc32 uses the private state of mbrtowc, which it
>     is not allowed to do.
> 
> thanks

Committing with this change. Thanks!

Rich


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-11-15 20:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-09 10:18 [PATCH] implement a private state for the uchar.h functions Jens Gustedt
2014-11-11  3:21 ` Rich Felker
2014-11-11 13:53   ` Jens Gustedt
2014-11-11 14:39     ` Rich Felker
2014-11-11 16:03       ` Jens Gustedt
2014-11-15 17:29 ` Rich Felker
2014-11-15 17:57   ` Jens Gustedt
2014-11-15 20:10     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).