From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED,
	MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 21334 invoked from network); 10 Nov 2022 08:08:14 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 10 Nov 2022 08:08:14 -0000
Received: (qmail 29993 invoked by uid 550); 10 Nov 2022 08:08:10 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 29970 invoked from network); 10 Nov 2022 08:08:09 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1668067678;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=isQNNBhTcA3D4DlAA7Ev304w78686BZ/Dam/olyjdW0=;
	b=Ts18836iBkO7n+OdCUYwwkcPsT95p2vcRmkud9guw4fzHH4CkGyWjy0OTqaMnfEi1lP1nB
	flwjy5BwX0nEjtvLBMtNLMo0ZfBHzo43qh0i0cFZzz4/S9AhEh47hXWSwsaV816n88j193
	HFhqvVtTehhczr/cSMBweRzk2kTLSPY=
X-MC-Unique: 75Wc7jgxNyybDtuxUb9T_w-1
From: Florian Weimer <fweimer@redhat.com>
To: musl@lists.openwall.com
Date: Thu, 10 Nov 2022 09:07:53 +0100
Message-ID: <875yfn5j1i.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

It has come to my attention that musl uses the range 0xDF80=E2=80=A60xDFFF =
to
cover the entire byte range:

/* Arbitrary encoding for representing code units instead of characters. */
#define CODEUNIT(c) (0xdfff & (signed char)(c))
#define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)

There is a very similar surrogate character mapping for undecodable
UTF-8 bytes, suggested here:

  <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linu=
x-utf8/2000-07/msg00040.html>

It uses 0xDC80=E2=80=A60xDCFF.  This has been picked up by various
implementations, including Python.

Is there a reason why musl picked a different surrogate mapping here?
Isn't it similar enough to the UTF-8 hack that it makes sense to pick
the same range?

Thanks,
Florian