From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 29777 invoked from network); 26 Jun 2023 10:28:52 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 26 Jun 2023 10:28:52 -0000 Received: (qmail 32089 invoked by uid 550); 26 Jun 2023 10:28:47 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 32054 invoked from network); 26 Jun 2023 10:28:46 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=inria.fr; s=dc; h=date:from:to:subject:message-id:in-reply-to:references: mime-version; bh=WjTzfQ/SswfA5AWuBbH/ljRsfX9uz/k2ka/skHeC2UE=; b=n66N78AoT5/DY5zenIqhKKS/AjBF4Y50vrgP8j6PEvDqELI6sHWiSGfL MpAanIH9qabUA4/WfH9Be1+37kB3/6nOhutbS/aXaSD5M4PLbvB9AgZ+1 D2YZwYaXQNf/n0UZr2zLO1u5Ow8Ijuz8NnS9G9cae3GKU4bTS89qKW8Hc k=; Authentication-Results: mail3-relais-sop.national.inria.fr; dkim=none (message not signed) header.i=none; spf=SoftFail smtp.mailfrom=jens.gustedt@inria.fr; dmarc=fail (p=none dis=none) d=inria.fr X-IronPort-AV: E=Sophos;i="6.01,159,1684792800"; d="scan'208";a="59815891" Date: Mon, 26 Jun 2023 12:28:33 +0200 From: =?UTF-8?B?SuKCkeKCmeKCmw==?= Gustedt To: musl@lists.openwall.com Message-ID: <20230626122833.7f0e73a8@inria.fr> In-Reply-To: <20230625172619.GR4163@brightrain.aerifal.cx> References: <976b592b08c82bd6b1f8834904071db6602f3885.1685534402.git.Jens.Gustedt@inria.fr> <20230625172619.GR4163@brightrain.aerifal.cx> Organization: inria.fr X-Mailer: Claws Mail 4.0.0 (GTK+ 3.24.33; x86_64-pc-linux-gnu) X-Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAAXNSR0IArs4c6QAAACRQTFRFERslNjAsLTE9Ok9wUk9TaUs8iWhSrYZkj42Rz6aD3sGZ MIME-Version: 1.0 Content-Type: multipart/signed; boundary="Sig_/_1CrXysnZopIat+DOL26NkK"; protocol="application/pgp-signature"; micalg=pgp-sha1 Subject: Re: [musl] [C23 string conversion 1/2] C23: implement the c8rtomb and mbrtoc8 functions --Sig_/_1CrXysnZopIat+DOL26NkK Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Rich, on Sun, 25 Jun 2023 13:26:20 -0400 you (Rich Felker ) wrote: > Following up on some specifics of code I hadn't reviewed in enough > detail yet... >=20 > On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote: > > The implementations each separate two cases: a fast path that > > corresponds to calls with state variables that are in the initial > > state and a leading input character that is ASCII; the other cases > > are handled by a function for the slow path. These functions are > > implemented such that the fast versions should not need stack > > variables of their own, and thus can tail call into the slow path > > with a jmp instruction if necessary. The fast versions could > > typically be also used as shallow inline wrapper, if we like to go > > that way. > >=20 > > Only the implementation of mbrtoc8 is a bit tricky. This is because > > of the weird conventions for dripping the output bytes one call > > after the other. This is handled by using the state variable as a > > stack for the up to three characters that are still to be sent to > > the output. --- > > include/uchar.h | 6 +++++ > > src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++ > > src/multibyte/mbrtoc8.c | 51 > > +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 88 > > insertions(+) create mode 100644 src/multibyte/c8rtomb.c > > create mode 100644 src/multibyte/mbrtoc8.c > >=20 > > diff --git a/include/uchar.h b/include/uchar.h > > index 7e5c4d40..9f6a3706 100644 > > --- a/include/uchar.h > > +++ b/include/uchar.h > > @@ -9,6 +9,9 @@ extern "C" { > > typedef unsigned short char16_t; > > typedef unsigned char32_t; > > #endif > > +#if __cplusplus < 201811L > > +typedef unsigned char char8_t; > > +#endif > > =20 > > #define __NEED_mbstate_t > > #define __NEED_size_t > > @@ -16,6 +19,9 @@ typedef unsigned char32_t; > > #include > > #include > > =20 > > +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict); > > +size_t mbrtoc8(char8_t *__restrict, const char *__restrict, > > size_t, mbstate_t *__restrict); + > > size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict); > > size_t mbrtoc16(char16_t *__restrict, const char *__restrict, > > size_t, mbstate_t *__restrict);=20 > > diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c > > new file mode 100644 > > index 00000000..8645e112 > > --- /dev/null > > +++ b/src/multibyte/c8rtomb.c > > @@ -0,0 +1,31 @@ > > +#include > > +#include > > + > > +__attribute__((__noinline__)) =20 >=20 > Where we use this so far is conditioned on #ifdef __GNUC__. I'm not > sure if it's actually helping, but I think it'd make sense to be > consistent. ok, no pb > > +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8, > > mbstate_t *__restrict st) +{ > > + // We need an internal state different from mbrtowc. > > + static mbstate_t internal_state; > > + if (!st) st =3D &internal_state; > > + wchar_t wc; > > + > > + // mbrtowc has return values -2, -1, 0, 1. > > + size_t res =3D mbrtowc(&wc, (char const*)&c8, 1, st); > > + switch (res) { > > + // our implementation of wcrtomb ignores the state > > + case 1: res =3D wcrtomb(s, wc, 0); break; > > + case 0: res =3D 1; if (s) *s =3D 0; break; > > + } > > + return res; > > +} =20 >=20 > At first I was confused by the conversion back and forth, but indeed > it absolutely makes sense to handle the weird buffering this interface > imposes a requirement for. it is weird, indeed > However, it seems to be malfunctioning in the C locale, where > arbitrary bytes of c8 input would be accepted (despite not being valid > UTF-8, as the interface requires). >=20 > Logically, the right thing to do is to set the calling thread's locale > to __c_dot_utf8_locale for the duration of the mbrtowc only. However, > I think there may be simpler solutions depending on what you interpret > the interface requirement as being. >=20 > If c8rtomb is required to accept valid UTF-8 up to the penultimate > byte then error only on storing it due to the locale not being able to > represent it, I don't see a good solution except what I just said. >=20 > But if it's permitted to preemptively say "sorry, nope, not gonna be > able to represent that" as soon as it sees the first byte, then when > MB_CUR_MAX=3D=3D1, having c8rtomb_slow fail unconditionally with EILSEQ > seems like the best and easiest approach. Good question. We should just be consistent and do the same whatever we do when we pass a surrogate into `c16rtomb`. If I read that correctly this does fail with `EILSEQ`. The intent (IIRC what I did some weeks ago ;-) of this > > + size_t res =3D mbrtowc(&wc, (char const*)&c8, 1, st); was to fail the same if `c8` is not start or continue of a valid mb-character. > > +__attribute__((__noinline__)) > > +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const > > char *__restrict src, size_t n, mbstate_t *__restrict st) +{ > > + static unsigned internal_state; > > + wchar_t wc; > > + unsigned* pending =3D st ? (void*)st : &internal_state; > > + > > + // The high bit is set if there is still missing input. If > > it > > + // is not set and the value is not zero, a previous call > > has > > + // stacked the missing output bytes withing the state. > > + if ((-*pending) > INT_MIN) { > > + if (pc8) *pc8 =3D *pending; > > + (*pending) >>=3D 8; > > + return -3; > > + } > > + > > + // mbrtowc has return values -2, -1, 0, 1, ..., 4. > > + size_t res =3D mbrtowc(&wc, src, n, (void*)pending); > > + if (res <=3D 4) { > > + _Alignas(unsigned) unsigned char s[8] =3D { 0 }; > > + // Write the result bytes to s over the word > > boundary > > + // as we need it. Our wcrtomb implementation > > ignores > > + // the state variable, there will be no errors and > > the > > + // return value can be ignored. > > + wcrtomb((void*)(s+3), wc, 0); > > + // Depending on the endianess this maybe a single > > load > > + // instruction. We want to be sure that the bytes > > are > > + // in this order, and that the high-order byte is > > 0. > > + *pending =3D (0u|s[4])<<000 | (0u|s[5])<<010 | > > (0u|s[6])<<020 | (0u|s[7])<<030; > > + if (pc8) *pc8 =3D s[3]; > > + } > > + return res; > > +} =20 >=20 > Does _Alignas work the way you want it on an array like this? It is supposed to, yes. > I was under the impression you would need a structure to do that in > a way that actually meshes with the standard semantics for _Alignas > and not the GCC attribute's special behaviors. In the contrary, I think the support for this feature in `struct` had first been forgotten in C11 (was it) and then added in C17. > A cleaner approach might be a union where the presence of the unsigned > union member yields the alignment. This would also be good since the > code is written in C99 which doesn't have _Alignas. This may indeed be better readable and portable to oldish C. I'll look into this. > Overall it seems like a lot of complexity effort to set things up for > an efficient store, but if anyone's going to use this interface it's > probably worthwhile. Great. Yes, this is supposed to be used to process streams of input. I observed quite nice optimization on my platform with that trick. > > +weak_alias(__mbrtoc8, mbrtoc8); =20 >=20 > I think we already discussed this in another place, but in case I'm > misremembering, these aliases aren't needed unless you need a > namespace-safe way to call the function internally. Yes, this is probably a left-over, sorry. Thanks J=E2=82=91=E2=82=99=E2=82=9B --=20 :: ICube :::::::::::::::::::::::::::::: deputy director :: :: Universit=C3=A9 de Strasbourg :::::::::::::::::::::: ICPS :: :: INRIA Nancy Grand Est :::::::::::::::::::::::: Camus :: :: :::::::::::::::::::::::::::::::::::: =E2=98=8E +33 368854536 :: :: https://icube-icps.unistra.fr/index.php/Jens_Gustedt :: --Sig_/_1CrXysnZopIat+DOL26NkK Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iF0EARECAB0WIQSN9stI2OFN1pLljN0P0+hp2tU34gUCZJloUgAKCRAP0+hp2tU3 4tnYAKCMjcv+3zsA3GVaFSwNu+30/X7uAgCfTyuianLrdWxCL8yrPPlqqdH+M3M= =7k9k -----END PGP SIGNATURE----- --Sig_/_1CrXysnZopIat+DOL26NkK--