From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_LOW,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 26131 invoked from network); 25 Jun 2023 17:26:35 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 25 Jun 2023 17:26:35 -0000 Received: (qmail 26104 invoked by uid 550); 25 Jun 2023 17:26:31 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 26066 invoked from network); 25 Jun 2023 17:26:31 -0000 Date: Sun, 25 Jun 2023 13:26:20 -0400 From: Rich Felker To: Jens Gustedt Cc: musl@lists.openwall.com Message-ID: <20230625172619.GR4163@brightrain.aerifal.cx> References: <976b592b08c82bd6b1f8834904071db6602f3885.1685534402.git.Jens.Gustedt@inria.fr> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <976b592b08c82bd6b1f8834904071db6602f3885.1685534402.git.Jens.Gustedt@inria.fr> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] [C23 string conversion 1/2] C23: implement the c8rtomb and mbrtoc8 functions Following up on some specifics of code I hadn't reviewed in enough detail yet... On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote: > The implementations each separate two cases: a fast path that > corresponds to calls with state variables that are in the initial > state and a leading input character that is ASCII; the other cases are > handled by a function for the slow path. These functions are > implemented such that the fast versions should not need stack > variables of their own, and thus can tail call into the slow path with > a jmp instruction if necessary. The fast versions could typically be > also used as shallow inline wrapper, if we like to go that way. > > Only the implementation of mbrtoc8 is a bit tricky. This is because of > the weird conventions for dripping the output bytes one call after the > other. This is handled by using the state variable as a stack for the > up to three characters that are still to be sent to the output. > --- > include/uchar.h | 6 +++++ > src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++ > src/multibyte/mbrtoc8.c | 51 +++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 88 insertions(+) > create mode 100644 src/multibyte/c8rtomb.c > create mode 100644 src/multibyte/mbrtoc8.c > > diff --git a/include/uchar.h b/include/uchar.h > index 7e5c4d40..9f6a3706 100644 > --- a/include/uchar.h > +++ b/include/uchar.h > @@ -9,6 +9,9 @@ extern "C" { > typedef unsigned short char16_t; > typedef unsigned char32_t; > #endif > +#if __cplusplus < 201811L > +typedef unsigned char char8_t; > +#endif > > #define __NEED_mbstate_t > #define __NEED_size_t > @@ -16,6 +19,9 @@ typedef unsigned char32_t; > #include > #include > > +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict); > +size_t mbrtoc8(char8_t *__restrict, const char *__restrict, size_t, mbstate_t *__restrict); > + > size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict); > size_t mbrtoc16(char16_t *__restrict, const char *__restrict, size_t, mbstate_t *__restrict); > > diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c > new file mode 100644 > index 00000000..8645e112 > --- /dev/null > +++ b/src/multibyte/c8rtomb.c > @@ -0,0 +1,31 @@ > +#include > +#include > + > +__attribute__((__noinline__)) Where we use this so far is conditioned on #ifdef __GNUC__. I'm not sure if it's actually helping, but I think it'd make sense to be consistent. > +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8, mbstate_t *__restrict st) > +{ > + // We need an internal state different from mbrtowc. > + static mbstate_t internal_state; > + if (!st) st = &internal_state; > + wchar_t wc; > + > + // mbrtowc has return values -2, -1, 0, 1. > + size_t res = mbrtowc(&wc, (char const*)&c8, 1, st); > + switch (res) { > + // our implementation of wcrtomb ignores the state > + case 1: res = wcrtomb(s, wc, 0); break; > + case 0: res = 1; if (s) *s = 0; break; > + } > + return res; > +} At first I was confused by the conversion back and forth, but indeed it absolutely makes sense to handle the weird buffering this interface imposes a requirement for. However, it seems to be malfunctioning in the C locale, where arbitrary bytes of c8 input would be accepted (despite not being valid UTF-8, as the interface requires). Logically, the right thing to do is to set the calling thread's locale to __c_dot_utf8_locale for the duration of the mbrtowc only. However, I think there may be simpler solutions depending on what you interpret the interface requirement as being. If c8rtomb is required to accept valid UTF-8 up to the penultimate byte then error only on storing it due to the locale not being able to represent it, I don't see a good solution except what I just said. But if it's permitted to preemptively say "sorry, nope, not gonna be able to represent that" as soon as it sees the first byte, then when MB_CUR_MAX==1, having c8rtomb_slow fail unconditionally with EILSEQ seems like the best and easiest approach. > +__attribute__((__noinline__)) > +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const char *__restrict src, size_t n, mbstate_t *__restrict st) > +{ > + static unsigned internal_state; > + wchar_t wc; > + unsigned* pending = st ? (void*)st : &internal_state; > + > + // The high bit is set if there is still missing input. If it > + // is not set and the value is not zero, a previous call has > + // stacked the missing output bytes withing the state. > + if ((-*pending) > INT_MIN) { > + if (pc8) *pc8 = *pending; > + (*pending) >>= 8; > + return -3; > + } > + > + // mbrtowc has return values -2, -1, 0, 1, ..., 4. > + size_t res = mbrtowc(&wc, src, n, (void*)pending); > + if (res <= 4) { > + _Alignas(unsigned) unsigned char s[8] = { 0 }; > + // Write the result bytes to s over the word boundary > + // as we need it. Our wcrtomb implementation ignores > + // the state variable, there will be no errors and the > + // return value can be ignored. > + wcrtomb((void*)(s+3), wc, 0); > + // Depending on the endianess this maybe a single load > + // instruction. We want to be sure that the bytes are > + // in this order, and that the high-order byte is 0. > + *pending = (0u|s[4])<<000 | (0u|s[5])<<010 | (0u|s[6])<<020 | (0u|s[7])<<030; > + if (pc8) *pc8 = s[3]; > + } > + return res; > +} Does _Alignas work the way you want it on an array like this? I was under the impression you would need a structure to do that in a way that actually meshes with the standard semantics for _Alignas and not the GCC attribute's special behaviors. A cleaner approach might be a union where the presence of the unsigned union member yields the alignment. This would also be good since the code is written in C99 which doesn't have _Alignas. Overall it seems like a lot of complexity effort to set things up for an efficient store, but if anyone's going to use this interface it's probably worthwhile. > +weak_alias(__mbrtoc8, mbrtoc8); I think we already discussed this in another place, but in case I'm misremembering, these aliases aren't needed unless you need a namespace-safe way to call the function internally. Rich