Rich, on Sun, 25 Jun 2023 13:26:20 -0400 you (Rich Felker ) wrote: > Following up on some specifics of code I hadn't reviewed in enough > detail yet... > > On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote: > > The implementations each separate two cases: a fast path that > > corresponds to calls with state variables that are in the initial > > state and a leading input character that is ASCII; the other cases > > are handled by a function for the slow path. These functions are > > implemented such that the fast versions should not need stack > > variables of their own, and thus can tail call into the slow path > > with a jmp instruction if necessary. The fast versions could > > typically be also used as shallow inline wrapper, if we like to go > > that way. > > > > Only the implementation of mbrtoc8 is a bit tricky. This is because > > of the weird conventions for dripping the output bytes one call > > after the other. This is handled by using the state variable as a > > stack for the up to three characters that are still to be sent to > > the output. --- > > include/uchar.h | 6 +++++ > > src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++ > > src/multibyte/mbrtoc8.c | 51 > > +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 88 > > insertions(+) create mode 100644 src/multibyte/c8rtomb.c > > create mode 100644 src/multibyte/mbrtoc8.c > > > > diff --git a/include/uchar.h b/include/uchar.h > > index 7e5c4d40..9f6a3706 100644 > > --- a/include/uchar.h > > +++ b/include/uchar.h > > @@ -9,6 +9,9 @@ extern "C" { > > typedef unsigned short char16_t; > > typedef unsigned char32_t; > > #endif > > +#if __cplusplus < 201811L > > +typedef unsigned char char8_t; > > +#endif > > > > #define __NEED_mbstate_t > > #define __NEED_size_t > > @@ -16,6 +19,9 @@ typedef unsigned char32_t; > > #include > > #include > > > > +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict); > > +size_t mbrtoc8(char8_t *__restrict, const char *__restrict, > > size_t, mbstate_t *__restrict); + > > size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict); > > size_t mbrtoc16(char16_t *__restrict, const char *__restrict, > > size_t, mbstate_t *__restrict); > > diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c > > new file mode 100644 > > index 00000000..8645e112 > > --- /dev/null > > +++ b/src/multibyte/c8rtomb.c > > @@ -0,0 +1,31 @@ > > +#include > > +#include > > + > > +__attribute__((__noinline__)) > > Where we use this so far is conditioned on #ifdef __GNUC__. I'm not > sure if it's actually helping, but I think it'd make sense to be > consistent. ok, no pb > > +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8, > > mbstate_t *__restrict st) +{ > > + // We need an internal state different from mbrtowc. > > + static mbstate_t internal_state; > > + if (!st) st = &internal_state; > > + wchar_t wc; > > + > > + // mbrtowc has return values -2, -1, 0, 1. > > + size_t res = mbrtowc(&wc, (char const*)&c8, 1, st); > > + switch (res) { > > + // our implementation of wcrtomb ignores the state > > + case 1: res = wcrtomb(s, wc, 0); break; > > + case 0: res = 1; if (s) *s = 0; break; > > + } > > + return res; > > +} > > At first I was confused by the conversion back and forth, but indeed > it absolutely makes sense to handle the weird buffering this interface > imposes a requirement for. it is weird, indeed > However, it seems to be malfunctioning in the C locale, where > arbitrary bytes of c8 input would be accepted (despite not being valid > UTF-8, as the interface requires). > > Logically, the right thing to do is to set the calling thread's locale > to __c_dot_utf8_locale for the duration of the mbrtowc only. However, > I think there may be simpler solutions depending on what you interpret > the interface requirement as being. > > If c8rtomb is required to accept valid UTF-8 up to the penultimate > byte then error only on storing it due to the locale not being able to > represent it, I don't see a good solution except what I just said. > > But if it's permitted to preemptively say "sorry, nope, not gonna be > able to represent that" as soon as it sees the first byte, then when > MB_CUR_MAX==1, having c8rtomb_slow fail unconditionally with EILSEQ > seems like the best and easiest approach. Good question. We should just be consistent and do the same whatever we do when we pass a surrogate into `c16rtomb`. If I read that correctly this does fail with `EILSEQ`. The intent (IIRC what I did some weeks ago ;-) of this > > + size_t res = mbrtowc(&wc, (char const*)&c8, 1, st); was to fail the same if `c8` is not start or continue of a valid mb-character. > > +__attribute__((__noinline__)) > > +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const > > char *__restrict src, size_t n, mbstate_t *__restrict st) +{ > > + static unsigned internal_state; > > + wchar_t wc; > > + unsigned* pending = st ? (void*)st : &internal_state; > > + > > + // The high bit is set if there is still missing input. If > > it > > + // is not set and the value is not zero, a previous call > > has > > + // stacked the missing output bytes withing the state. > > + if ((-*pending) > INT_MIN) { > > + if (pc8) *pc8 = *pending; > > + (*pending) >>= 8; > > + return -3; > > + } > > + > > + // mbrtowc has return values -2, -1, 0, 1, ..., 4. > > + size_t res = mbrtowc(&wc, src, n, (void*)pending); > > + if (res <= 4) { > > + _Alignas(unsigned) unsigned char s[8] = { 0 }; > > + // Write the result bytes to s over the word > > boundary > > + // as we need it. Our wcrtomb implementation > > ignores > > + // the state variable, there will be no errors and > > the > > + // return value can be ignored. > > + wcrtomb((void*)(s+3), wc, 0); > > + // Depending on the endianess this maybe a single > > load > > + // instruction. We want to be sure that the bytes > > are > > + // in this order, and that the high-order byte is > > 0. > > + *pending = (0u|s[4])<<000 | (0u|s[5])<<010 | > > (0u|s[6])<<020 | (0u|s[7])<<030; > > + if (pc8) *pc8 = s[3]; > > + } > > + return res; > > +} > > Does _Alignas work the way you want it on an array like this? It is supposed to, yes. > I was under the impression you would need a structure to do that in > a way that actually meshes with the standard semantics for _Alignas > and not the GCC attribute's special behaviors. In the contrary, I think the support for this feature in `struct` had first been forgotten in C11 (was it) and then added in C17. > A cleaner approach might be a union where the presence of the unsigned > union member yields the alignment. This would also be good since the > code is written in C99 which doesn't have _Alignas. This may indeed be better readable and portable to oldish C. I'll look into this. > Overall it seems like a lot of complexity effort to set things up for > an efficient store, but if anyone's going to use this interface it's > probably worthwhile. Great. Yes, this is supposed to be used to process streams of input. I observed quite nice optimization on my platform with that trick. > > +weak_alias(__mbrtoc8, mbrtoc8); > > I think we already discussed this in another place, but in case I'm > misremembering, these aliases aren't needed unless you need a > namespace-safe way to call the function internally. Yes, this is probably a left-over, sorry. Thanks Jₑₙₛ -- :: ICube :::::::::::::::::::::::::::::: deputy director :: :: Université de Strasbourg :::::::::::::::::::::: ICPS :: :: INRIA Nancy Grand Est :::::::::::::::::::::::: Camus :: :: :::::::::::::::::::::::::::::::::::: ☎ +33 368854536 :: :: https://icube-icps.unistra.fr/index.php/Jens_Gustedt ::