From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 4318 invoked from network); 11 Nov 2022 15:38:31 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 11 Nov 2022 15:38:31 -0000 Received: (qmail 27956 invoked by uid 550); 11 Nov 2022 15:38:28 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 27933 invoked from network); 11 Nov 2022 15:38:28 -0000 Date: Fri, 11 Nov 2022 10:38:12 -0500 From: Rich Felker To: Florian Weimer Cc: musl@lists.openwall.com Message-ID: <20221111153812.GK29905@brightrain.aerifal.cx> References: <875yfn5j1i.fsf@oldenburg.str.redhat.com> <20221110144447.GI29905@brightrain.aerifal.cx> <87v8nlo7pc.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <87v8nlo7pc.fsf@oldenburg.str.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale On Fri, Nov 11, 2022 at 04:02:23PM +0100, Florian Weimer wrote: > * Rich Felker: > > > On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: > >> It has come to my attention that musl uses the range 0xDF80…0xDFFF to > >> cover the entire byte range: > >> > >> /* Arbitrary encoding for representing code units instead of characters. */ > >> #define CODEUNIT(c) (0xdfff & (signed char)(c)) > >> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) > >> > >> There is a very similar surrogate character mapping for undecodable > >> UTF-8 bytes, suggested here: > >> > >> > >> > >> It uses 0xDC80…0xDCFF. This has been picked up by various > >> implementations, including Python. > >> > >> Is there a reason why musl picked a different surrogate mapping here? > >> Isn't it similar enough to the UTF-8 hack that it makes sense to pick > >> the same range? > > > > I'll have to look back through archives to see what the motivations > > for the particular range were -- I seem to recall there being some. > > But I think the more important thing here is the *lack* of any > > motivation to align with anything else. The values here are explicitly > > *not* intended for use in any sort of information interchange. They're > > invalid codes that are not Unicode scalar values, and the only reason > > they exist at all is to make application-internal (or even > > implementation-internal, in the case of regex/glob/etc.) > > round-tripping work in the byte-based C locale while avoiding > > assigning character properties to the bytes or inadvertently handling > > them in a way that might facilitate pretending they're just latin1. > > For glibc, we are doing this because POSIX requires this for the C > (POSIX) locale. It's now required to use a single-byte character set > with wchar_t mappings for all bytes. Previously, I had hoped to > transition to UTF-8 by default (possibly with a surrogate-escape > encoding like Python's). Yes, that's entirely my fault and I'm so sorry. I reported a bug where an interface's spec was ambiguous because they hadn't considered the possibility that the C locale might be multibyte, and rather than fix it, all the old-timers freaked out something they were taking for granted (that the C locale would be byte-based) wasn't actually specified. > I guess as an alternative, we could just use the Latin-1 mapping. Why > hasn't musl done this? Because it would promote the idea that the world > is Latin-1? Exactly. musl has always been very intentional about not supporting legacy m17n-incompatible encodings and that character identity under musl is not locale-specific. So, when we got stuck having to do a byte-based C locale because of the above unfortunate outcome, what we strived for was a way to express "these are code units of UTF-8 being processed as individual bytes for a workflow where the user wants to operate on bytes". > > Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes > > that appeared in a stream expected to be UTF-8" and "bytes of what's > > expected to be valid UTF-8 being treated bytewise for processing by > > user request" are related. > > I think those two are fairly similar? But “fake single-byte character > set due to POSIX mandate” is different? They admit the same mechanism and yes they at least have "similarities", but the problems themselves are somewhat different, I think. And the former has lots of weird likely unwanted behaviors, like decode(concat(a,b)) != concat(decode(a),decode(b)) that arise from the mapping only being taken in the 'error path' rather than applied to all data uniformly. Regardless of whether there's a technical reason DF80... is better than DC80..., I think I'd generally be disinclined to change anything now. Not because I want to preserve an existing mapping that nothing should be relying on, but because the only practical motivation for a change would be to align the mapping for interchange purposes -- which means, even if we say "this is explicitly not for interchange purposes", to anyone reading the change it clearly is for interchange purposes because that's the only effect, and thereby, we might as well be saying "go ahead and use this for interchange purposes!" Rich