From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.7 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_LOW,T_SCC_BODY_TEXT_LINE autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 26131 invoked from network); 25 Jun 2023 17:26:35 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 25 Jun 2023 17:26:35 -0000
Received: (qmail 26104 invoked by uid 550); 25 Jun 2023 17:26:31 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 26066 invoked from network); 25 Jun 2023 17:26:31 -0000
Date: Sun, 25 Jun 2023 13:26:20 -0400
From: Rich Felker <dalias@libc.org>
To: Jens Gustedt <Jens.Gustedt@inria.fr>
Cc: musl@lists.openwall.com
Message-ID: <20230625172619.GR4163@brightrain.aerifal.cx>
References: <cover.1685534402.git.Jens.Gustedt@inria.fr>
 <976b592b08c82bd6b1f8834904071db6602f3885.1685534402.git.Jens.Gustedt@inria.fr>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <976b592b08c82bd6b1f8834904071db6602f3885.1685534402.git.Jens.Gustedt@inria.fr>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] [C23 string conversion 1/2] C23: implement the c8rtomb
 and mbrtoc8 functions

Following up on some specifics of code I hadn't reviewed in enough
detail yet...

On Wed, May 31, 2023 at 04:01:43PM +0200, Jens Gustedt wrote:
> The implementations each separate two cases: a fast path that
> corresponds to calls with state variables that are in the initial
> state and a leading input character that is ASCII; the other cases are
> handled by a function for the slow path. These functions are
> implemented such that the fast versions should not need stack
> variables of their own, and thus can tail call into the slow path with
> a jmp instruction if necessary. The fast versions could typically be
> also used as shallow inline wrapper, if we like to go that way.
> 
> Only the implementation of mbrtoc8 is a bit tricky. This is because of
> the weird conventions for dripping the output bytes one call after the
> other. This is handled by using the state variable as a stack for the
> up to three characters that are still to be sent to the output.
> ---
>  include/uchar.h         |  6 +++++
>  src/multibyte/c8rtomb.c | 31 +++++++++++++++++++++++++
>  src/multibyte/mbrtoc8.c | 51 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+)
>  create mode 100644 src/multibyte/c8rtomb.c
>  create mode 100644 src/multibyte/mbrtoc8.c
> 
> diff --git a/include/uchar.h b/include/uchar.h
> index 7e5c4d40..9f6a3706 100644
> --- a/include/uchar.h
> +++ b/include/uchar.h
> @@ -9,6 +9,9 @@ extern "C" {
>  typedef unsigned short char16_t;
>  typedef unsigned char32_t;
>  #endif
> +#if  __cplusplus < 201811L
> +typedef unsigned char char8_t;
> +#endif
>  
>  #define __NEED_mbstate_t
>  #define __NEED_size_t
> @@ -16,6 +19,9 @@ typedef unsigned char32_t;
>  #include <features.h>
>  #include <bits/alltypes.h>
>  
> +size_t c8rtomb(char *__restrict, char8_t, mbstate_t *__restrict);
> +size_t mbrtoc8(char8_t *__restrict, const char *__restrict, size_t, mbstate_t *__restrict);
> +
>  size_t c16rtomb(char *__restrict, char16_t, mbstate_t *__restrict);
>  size_t mbrtoc16(char16_t *__restrict, const char *__restrict, size_t, mbstate_t *__restrict);
>  
> diff --git a/src/multibyte/c8rtomb.c b/src/multibyte/c8rtomb.c
> new file mode 100644
> index 00000000..8645e112
> --- /dev/null
> +++ b/src/multibyte/c8rtomb.c
> @@ -0,0 +1,31 @@
> +#include <uchar.h>
> +#include <wchar.h>
> +
> +__attribute__((__noinline__))

Where we use this so far is conditioned on #ifdef __GNUC__. I'm not
sure if it's actually helping, but I think it'd make sense to be
consistent.

> +static size_t c8rtomb_slow(char *__restrict s, unsigned char c8, mbstate_t *__restrict st)
> +{
> +	// We need an internal state different from mbrtowc.
> +	static mbstate_t internal_state;
> +	if (!st) st = &internal_state;
> +	wchar_t wc;
> +
> +	// mbrtowc has return values -2, -1, 0, 1.
> +	size_t res = mbrtowc(&wc, (char const*)&c8, 1, st);
> +	switch (res) {
> +	// our implementation of wcrtomb ignores the state
> +	case  1: res = wcrtomb(s, wc, 0); break;
> +	case  0: res = 1; if (s) *s = 0;  break;
> +	}
> +	return res;
> +}

At first I was confused by the conversion back and forth, but indeed
it absolutely makes sense to handle the weird buffering this interface
imposes a requirement for.

However, it seems to be malfunctioning in the C locale, where
arbitrary bytes of c8 input would be accepted (despite not being valid
UTF-8, as the interface requires).

Logically, the right thing to do is to set the calling thread's locale
to __c_dot_utf8_locale for the duration of the mbrtowc only. However,
I think there may be simpler solutions depending on what you interpret
the interface requirement as being.

If c8rtomb is required to accept valid UTF-8 up to the penultimate
byte then error only on storing it due to the locale not being able to
represent it, I don't see a good solution except what I just said.

But if it's permitted to preemptively say "sorry, nope, not gonna be
able to represent that" as soon as it sees the first byte, then when
MB_CUR_MAX==1, having c8rtomb_slow fail unconditionally with EILSEQ
seems like the best and easiest approach.

> +__attribute__((__noinline__))
> +static size_t mbrtoc8_slow(unsigned char *__restrict pc8, const char *__restrict src, size_t n, mbstate_t *__restrict st)
> +{
> +	static unsigned internal_state;
> +	wchar_t wc;
> +	unsigned* pending = st ? (void*)st : &internal_state;
> +
> +	// The high bit is set if there is still missing input. If it
> +	// is not set and the value is not zero, a previous call has
> +	// stacked the missing output bytes withing the state.
> +	if ((-*pending) > INT_MIN) {
> +		if (pc8) *pc8 = *pending;
> +		(*pending) >>= 8;
> +		return -3;
> +	}
> +
> +	// mbrtowc has return values -2, -1, 0, 1, ..., 4.
> +	size_t res = mbrtowc(&wc, src, n, (void*)pending);
> +	if (res <= 4) {
> +		_Alignas(unsigned) unsigned char s[8] = { 0 };
> +		// Write the result bytes to s over the word boundary
> +		// as we need it. Our wcrtomb implementation ignores
> +		// the state variable, there will be no errors and the
> +		// return value can be ignored.
> +		wcrtomb((void*)(s+3), wc, 0);
> +		// Depending on the endianess this maybe a single load
> +		// instruction. We want to be sure that the bytes are
> +		// in this order, and that the high-order byte is 0.
> +		*pending = (0u|s[4])<<000 | (0u|s[5])<<010 | (0u|s[6])<<020 | (0u|s[7])<<030;
> +		if (pc8) *pc8 = s[3];
> +	}
> +	return res;
> +}

Does _Alignas work the way you want it on an array like this? I was
under the impression you would need a structure to do that in a way
that actually meshes with the standard semantics for _Alignas and not
the GCC attribute's special behaviors.

A cleaner approach might be a union where the presence of the unsigned
union member yields the alignment. This would also be good since the
code is written in C99 which doesn't have _Alignas.

Overall it seems like a lot of complexity effort to set things up for
an efficient store, but if anyone's going to use this interface it's
probably worthwhile.

> +weak_alias(__mbrtoc8, mbrtoc8);

I think we already discussed this in another place, but in case I'm
misremembering, these aliases aren't needed unless you need a
namespace-safe way to call the function internally.

Rich