From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 15611 invoked from network); 10 Nov 2022 14:45:09 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 10 Nov 2022 14:45:09 -0000 Received: (qmail 11409 invoked by uid 550); 10 Nov 2022 14:45:04 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 11386 invoked from network); 10 Nov 2022 14:45:03 -0000 Date: Thu, 10 Nov 2022 09:44:48 -0500 From: Rich Felker To: Florian Weimer Cc: musl@lists.openwall.com Message-ID: <20221110144447.GI29905@brightrain.aerifal.cx> References: <875yfn5j1i.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <875yfn5j1i.fsf@oldenburg.str.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: > It has come to my attention that musl uses the range 0xDF80…0xDFFF to > cover the entire byte range: > > /* Arbitrary encoding for representing code units instead of characters. */ > #define CODEUNIT(c) (0xdfff & (signed char)(c)) > #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) > > There is a very similar surrogate character mapping for undecodable > UTF-8 bytes, suggested here: > > > > It uses 0xDC80…0xDCFF. This has been picked up by various > implementations, including Python. > > Is there a reason why musl picked a different surrogate mapping here? > Isn't it similar enough to the UTF-8 hack that it makes sense to pick > the same range? I'll have to look back through archives to see what the motivations for the particular range were -- I seem to recall there being some. But I think the more important thing here is the *lack* of any motivation to align with anything else. The values here are explicitly *not* intended for use in any sort of information interchange. They're invalid codes that are not Unicode scalar values, and the only reason they exist at all is to make application-internal (or even implementation-internal, in the case of regex/glob/etc.) round-tripping work in the byte-based C locale while avoiding assigning character properties to the bytes or inadvertently handling them in a way that might facilitate pretending they're just latin1. Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes that appeared in a stream expected to be UTF-8" and "bytes of what's expected to be valid UTF-8 being treated bytewise for processing by user request" are related. The proposal you linked is a decent implementation-internal choice for handling data in a binary-clean manner where that's needed (e.g. a text editor operating on files containing a mix of text and binary data or a mix of text encodings), but I think (or at least hope?) that in the years since it was written, there's come to be a consensus that it is *not* a good idea to do this as a "decoding" operation (where the data is saved out as invalid UTF-16 or -32 and used in interchange, as opposed to just internally) because it breaks lots of the good properties of UTF-8. Rich