From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED,
	MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 32197 invoked from network); 11 Nov 2022 15:02:45 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 11 Nov 2022 15:02:45 -0000
Received: (qmail 11857 invoked by uid 550); 11 Nov 2022 15:02:40 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 11816 invoked from network); 11 Nov 2022 15:02:39 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1668178947;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  references:references;
	bh=ED7Rem5ApJh0aJrwrBM/ZbKCZOmyv9yZ88hQFEvAjK8=;
	b=Rx80ip8EO9eThTodNzUw/NTtpiNyPnIOutMn1tBp0tRM64hnUfAwo4cB4tnZo0Pir1GbYS
	423JbnVW77HasVbmcLczpetfGvPCghGnSugtE4hygdYeOcdx9mxkLgJj9vwWJziZ1XcKmg
	5CwoViagezKOkzBoKxytSURIfmvlDbQ=
X-MC-Unique: BNA_qNJaN3yMHPSXCkMm8Q-1
From: Florian Weimer <fweimer@redhat.com>
To: Rich Felker <dalias@libc.org>
Cc: musl@lists.openwall.com
References: <875yfn5j1i.fsf@oldenburg.str.redhat.com>
	<20221110144447.GI29905@brightrain.aerifal.cx>
Date: Fri, 11 Nov 2022 16:02:23 +0100
Message-ID: <87v8nlo7pc.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the
 POSIX locale

* Rich Felker:

> On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
>> It has come to my attention that musl uses the range 0xDF80=E2=80=A60xDF=
FF to
>> cover the entire byte range:
>>=20
>> /* Arbitrary encoding for representing code units instead of characters.=
 */
>> #define CODEUNIT(c) (0xdfff & (signed char)(c))
>> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
>>=20
>> There is a very similar surrogate character mapping for undecodable
>> UTF-8 bytes, suggested here:
>>=20
>>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/l=
inux-utf8/2000-07/msg00040.html>
>>=20
>> It uses 0xDC80=E2=80=A60xDCFF.  This has been picked up by various
>> implementations, including Python.
>>=20
>> Is there a reason why musl picked a different surrogate mapping here?
>> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
>> the same range?
>
> I'll have to look back through archives to see what the motivations
> for the particular range were -- I seem to recall there being some.
> But I think the more important thing here is the *lack* of any
> motivation to align with anything else. The values here are explicitly
> *not* intended for use in any sort of information interchange. They're
> invalid codes that are not Unicode scalar values, and the only reason
> they exist at all is to make application-internal (or even
> implementation-internal, in the case of regex/glob/etc.)
> round-tripping work in the byte-based C locale while avoiding
> assigning character properties to the bytes or inadvertently handling
> them in a way that might facilitate pretending they're just latin1.

For glibc, we are doing this because POSIX requires this for the C
(POSIX) locale.  It's now required to use a single-byte character set
with wchar_t mappings for all bytes.  Previously, I had hoped to
transition to UTF-8 by default (possibly with a surrogate-escape
encoding like Python's).

I guess as an alternative, we could just use the Latin-1 mapping.  Why
hasn't musl done this?  Because it would promote the idea that the world
is Latin-1?

> Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
> that appeared in a stream expected to be UTF-8" and "bytes of what's
> expected to be valid UTF-8 being treated bytewise for processing by
> user request" are related.

I think those two are fairly similar?  But =E2=80=9Cfake single-byte charac=
ter
set due to POSIX mandate=E2=80=9D is different?

Thanks,
Florian