From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 15611 invoked from network); 10 Nov 2022 14:45:09 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 10 Nov 2022 14:45:09 -0000
Received: (qmail 11409 invoked by uid 550); 10 Nov 2022 14:45:04 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 11386 invoked from network); 10 Nov 2022 14:45:03 -0000
Date: Thu, 10 Nov 2022 09:44:48 -0500
From: Rich Felker <dalias@libc.org>
To: Florian Weimer <fweimer@redhat.com>
Cc: musl@lists.openwall.com
Message-ID: <20221110144447.GI29905@brightrain.aerifal.cx>
References: <875yfn5j1i.fsf@oldenburg.str.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <875yfn5j1i.fsf@oldenburg.str.redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the
 POSIX locale

On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
> It has come to my attention that musl uses the range 0xDF80…0xDFFF to
> cover the entire byte range:
> 
> /* Arbitrary encoding for representing code units instead of characters. */
> #define CODEUNIT(c) (0xdfff & (signed char)(c))
> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
> 
> There is a very similar surrogate character mapping for undecodable
> UTF-8 bytes, suggested here:
> 
>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
> 
> It uses 0xDC80…0xDCFF.  This has been picked up by various
> implementations, including Python.
> 
> Is there a reason why musl picked a different surrogate mapping here?
> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
> the same range?

I'll have to look back through archives to see what the motivations
for the particular range were -- I seem to recall there being some.
But I think the more important thing here is the *lack* of any
motivation to align with anything else. The values here are explicitly
*not* intended for use in any sort of information interchange. They're
invalid codes that are not Unicode scalar values, and the only reason
they exist at all is to make application-internal (or even
implementation-internal, in the case of regex/glob/etc.)
round-tripping work in the byte-based C locale while avoiding
assigning character properties to the bytes or inadvertently handling
them in a way that might facilitate pretending they're just latin1.

Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
that appeared in a stream expected to be UTF-8" and "bytes of what's
expected to be valid UTF-8 being treated bytewise for processing by
user request" are related.

The proposal you linked is a decent implementation-internal choice for
handling data in a binary-clean manner where that's needed (e.g. a
text editor operating on files containing a mix of text and binary
data or a mix of text encodings), but I think (or at least hope?) that
in the years since it was written, there's come to be a consensus that
it is *not* a good idea to do this as a "decoding" operation (where
the data is saved out as invalid UTF-16 or -32 and used in
interchange, as opposed to just internally) because it breaks lots of
the good properties of UTF-8.

Rich