* [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale @ 2022-11-10 8:07 Florian Weimer 2022-11-10 14:44 ` Rich Felker 0 siblings, 1 reply; 4+ messages in thread From: Florian Weimer @ 2022-11-10 8:07 UTC (permalink / raw) To: musl It has come to my attention that musl uses the range 0xDF80…0xDFFF to cover the entire byte range: /* Arbitrary encoding for representing code units instead of characters. */ #define CODEUNIT(c) (0xdfff & (signed char)(c)) #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) There is a very similar surrogate character mapping for undecodable UTF-8 bytes, suggested here: <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> It uses 0xDC80…0xDCFF. This has been picked up by various implementations, including Python. Is there a reason why musl picked a different surrogate mapping here? Isn't it similar enough to the UTF-8 hack that it makes sense to pick the same range? Thanks, Florian ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale 2022-11-10 8:07 [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale Florian Weimer @ 2022-11-10 14:44 ` Rich Felker 2022-11-11 15:02 ` Florian Weimer 0 siblings, 1 reply; 4+ messages in thread From: Rich Felker @ 2022-11-10 14:44 UTC (permalink / raw) To: Florian Weimer; +Cc: musl On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: > It has come to my attention that musl uses the range 0xDF80…0xDFFF to > cover the entire byte range: > > /* Arbitrary encoding for representing code units instead of characters. */ > #define CODEUNIT(c) (0xdfff & (signed char)(c)) > #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) > > There is a very similar surrogate character mapping for undecodable > UTF-8 bytes, suggested here: > > <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> > > It uses 0xDC80…0xDCFF. This has been picked up by various > implementations, including Python. > > Is there a reason why musl picked a different surrogate mapping here? > Isn't it similar enough to the UTF-8 hack that it makes sense to pick > the same range? I'll have to look back through archives to see what the motivations for the particular range were -- I seem to recall there being some. But I think the more important thing here is the *lack* of any motivation to align with anything else. The values here are explicitly *not* intended for use in any sort of information interchange. They're invalid codes that are not Unicode scalar values, and the only reason they exist at all is to make application-internal (or even implementation-internal, in the case of regex/glob/etc.) round-tripping work in the byte-based C locale while avoiding assigning character properties to the bytes or inadvertently handling them in a way that might facilitate pretending they're just latin1. Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes that appeared in a stream expected to be UTF-8" and "bytes of what's expected to be valid UTF-8 being treated bytewise for processing by user request" are related. The proposal you linked is a decent implementation-internal choice for handling data in a binary-clean manner where that's needed (e.g. a text editor operating on files containing a mix of text and binary data or a mix of text encodings), but I think (or at least hope?) that in the years since it was written, there's come to be a consensus that it is *not* a good idea to do this as a "decoding" operation (where the data is saved out as invalid UTF-16 or -32 and used in interchange, as opposed to just internally) because it breaks lots of the good properties of UTF-8. Rich ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale 2022-11-10 14:44 ` Rich Felker @ 2022-11-11 15:02 ` Florian Weimer 2022-11-11 15:38 ` Rich Felker 0 siblings, 1 reply; 4+ messages in thread From: Florian Weimer @ 2022-11-11 15:02 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker: > On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: >> It has come to my attention that musl uses the range 0xDF80…0xDFFF to >> cover the entire byte range: >> >> /* Arbitrary encoding for representing code units instead of characters. */ >> #define CODEUNIT(c) (0xdfff & (signed char)(c)) >> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) >> >> There is a very similar surrogate character mapping for undecodable >> UTF-8 bytes, suggested here: >> >> <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> >> >> It uses 0xDC80…0xDCFF. This has been picked up by various >> implementations, including Python. >> >> Is there a reason why musl picked a different surrogate mapping here? >> Isn't it similar enough to the UTF-8 hack that it makes sense to pick >> the same range? > > I'll have to look back through archives to see what the motivations > for the particular range were -- I seem to recall there being some. > But I think the more important thing here is the *lack* of any > motivation to align with anything else. The values here are explicitly > *not* intended for use in any sort of information interchange. They're > invalid codes that are not Unicode scalar values, and the only reason > they exist at all is to make application-internal (or even > implementation-internal, in the case of regex/glob/etc.) > round-tripping work in the byte-based C locale while avoiding > assigning character properties to the bytes or inadvertently handling > them in a way that might facilitate pretending they're just latin1. For glibc, we are doing this because POSIX requires this for the C (POSIX) locale. It's now required to use a single-byte character set with wchar_t mappings for all bytes. Previously, I had hoped to transition to UTF-8 by default (possibly with a surrogate-escape encoding like Python's). I guess as an alternative, we could just use the Latin-1 mapping. Why hasn't musl done this? Because it would promote the idea that the world is Latin-1? > Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes > that appeared in a stream expected to be UTF-8" and "bytes of what's > expected to be valid UTF-8 being treated bytewise for processing by > user request" are related. I think those two are fairly similar? But “fake single-byte character set due to POSIX mandate” is different? Thanks, Florian ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale 2022-11-11 15:02 ` Florian Weimer @ 2022-11-11 15:38 ` Rich Felker 0 siblings, 0 replies; 4+ messages in thread From: Rich Felker @ 2022-11-11 15:38 UTC (permalink / raw) To: Florian Weimer; +Cc: musl On Fri, Nov 11, 2022 at 04:02:23PM +0100, Florian Weimer wrote: > * Rich Felker: > > > On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: > >> It has come to my attention that musl uses the range 0xDF80…0xDFFF to > >> cover the entire byte range: > >> > >> /* Arbitrary encoding for representing code units instead of characters. */ > >> #define CODEUNIT(c) (0xdfff & (signed char)(c)) > >> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) > >> > >> There is a very similar surrogate character mapping for undecodable > >> UTF-8 bytes, suggested here: > >> > >> <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html> > >> > >> It uses 0xDC80…0xDCFF. This has been picked up by various > >> implementations, including Python. > >> > >> Is there a reason why musl picked a different surrogate mapping here? > >> Isn't it similar enough to the UTF-8 hack that it makes sense to pick > >> the same range? > > > > I'll have to look back through archives to see what the motivations > > for the particular range were -- I seem to recall there being some. > > But I think the more important thing here is the *lack* of any > > motivation to align with anything else. The values here are explicitly > > *not* intended for use in any sort of information interchange. They're > > invalid codes that are not Unicode scalar values, and the only reason > > they exist at all is to make application-internal (or even > > implementation-internal, in the case of regex/glob/etc.) > > round-tripping work in the byte-based C locale while avoiding > > assigning character properties to the bytes or inadvertently handling > > them in a way that might facilitate pretending they're just latin1. > > For glibc, we are doing this because POSIX requires this for the C > (POSIX) locale. It's now required to use a single-byte character set > with wchar_t mappings for all bytes. Previously, I had hoped to > transition to UTF-8 by default (possibly with a surrogate-escape > encoding like Python's). Yes, that's entirely my fault and I'm so sorry. I reported a bug where an interface's spec was ambiguous because they hadn't considered the possibility that the C locale might be multibyte, and rather than fix it, all the old-timers freaked out something they were taking for granted (that the C locale would be byte-based) wasn't actually specified. > I guess as an alternative, we could just use the Latin-1 mapping. Why > hasn't musl done this? Because it would promote the idea that the world > is Latin-1? Exactly. musl has always been very intentional about not supporting legacy m17n-incompatible encodings and that character identity under musl is not locale-specific. So, when we got stuck having to do a byte-based C locale because of the above unfortunate outcome, what we strived for was a way to express "these are code units of UTF-8 being processed as individual bytes for a workflow where the user wants to operate on bytes". > > Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes > > that appeared in a stream expected to be UTF-8" and "bytes of what's > > expected to be valid UTF-8 being treated bytewise for processing by > > user request" are related. > > I think those two are fairly similar? But “fake single-byte character > set due to POSIX mandate” is different? They admit the same mechanism and yes they at least have "similarities", but the problems themselves are somewhat different, I think. And the former has lots of weird likely unwanted behaviors, like decode(concat(a,b)) != concat(decode(a),decode(b)) that arise from the mapping only being taken in the 'error path' rather than applied to all data uniformly. Regardless of whether there's a technical reason DF80... is better than DC80..., I think I'd generally be disinclined to change anything now. Not because I want to preserve an existing mapping that nothing should be relying on, but because the only practical motivation for a change would be to align the mapping for interchange purposes -- which means, even if we say "this is explicitly not for interchange purposes", to anyone reading the change it clearly is for interchange purposes because that's the only effect, and thereby, we might as well be saying "go ahead and use this for interchange purposes!" Rich ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-11-11 15:38 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-11-10 8:07 [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale Florian Weimer 2022-11-10 14:44 ` Rich Felker 2022-11-11 15:02 ` Florian Weimer 2022-11-11 15:38 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).