From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 21334 invoked from network); 10 Nov 2022 08:08:14 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 10 Nov 2022 08:08:14 -0000 Received: (qmail 29993 invoked by uid 550); 10 Nov 2022 08:08:10 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 29970 invoked from network); 10 Nov 2022 08:08:09 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668067678; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=isQNNBhTcA3D4DlAA7Ev304w78686BZ/Dam/olyjdW0=; b=Ts18836iBkO7n+OdCUYwwkcPsT95p2vcRmkud9guw4fzHH4CkGyWjy0OTqaMnfEi1lP1nB flwjy5BwX0nEjtvLBMtNLMo0ZfBHzo43qh0i0cFZzz4/S9AhEh47hXWSwsaV816n88j193 HFhqvVtTehhczr/cSMBweRzk2kTLSPY= X-MC-Unique: 75Wc7jgxNyybDtuxUb9T_w-1 From: Florian Weimer To: musl@lists.openwall.com Date: Thu, 10 Nov 2022 09:07:53 +0100 Message-ID: <875yfn5j1i.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale It has come to my attention that musl uses the range 0xDF80=E2=80=A60xDFFF = to cover the entire byte range: /* Arbitrary encoding for representing code units instead of characters. */ #define CODEUNIT(c) (0xdfff & (signed char)(c)) #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) There is a very similar surrogate character mapping for undecodable UTF-8 bytes, suggested here: It uses 0xDC80=E2=80=A60xDCFF. This has been picked up by various implementations, including Python. Is there a reason why musl picked a different surrogate mapping here? Isn't it similar enough to the UTF-8 hack that it makes sense to pick the same range? Thanks, Florian