[musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
@ 2022-11-10  8:07 Florian Weimer
  2022-11-10 14:44 ` Rich Felker
  0 siblings, 1 reply; 4+ messages in thread
From: Florian Weimer @ 2022-11-10  8:07 UTC (permalink / raw)
  To: musl

It has come to my attention that musl uses the range 0xDF80…0xDFFF to
cover the entire byte range:

/* Arbitrary encoding for representing code units instead of characters. */
#define CODEUNIT(c) (0xdfff & (signed char)(c))
#define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)

There is a very similar surrogate character mapping for undecodable
UTF-8 bytes, suggested here:

  <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>

It uses 0xDC80…0xDCFF.  This has been picked up by various
implementations, including Python.

Is there a reason why musl picked a different surrogate mapping here?
Isn't it similar enough to the UTF-8 hack that it makes sense to pick
the same range?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
  2022-11-10  8:07 [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale Florian Weimer
@ 2022-11-10 14:44 ` Rich Felker
  2022-11-11 15:02   ` Florian Weimer
  0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2022-11-10 14:44 UTC (permalink / raw)
  To: Florian Weimer; +Cc: musl

On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
> It has come to my attention that musl uses the range 0xDF80…0xDFFF to
> cover the entire byte range:
> 
> /* Arbitrary encoding for representing code units instead of characters. */
> #define CODEUNIT(c) (0xdfff & (signed char)(c))
> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
> 
> There is a very similar surrogate character mapping for undecodable
> UTF-8 bytes, suggested here:
> 
>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
> 
> It uses 0xDC80…0xDCFF.  This has been picked up by various
> implementations, including Python.
> 
> Is there a reason why musl picked a different surrogate mapping here?
> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
> the same range?

I'll have to look back through archives to see what the motivations
for the particular range were -- I seem to recall there being some.
But I think the more important thing here is the *lack* of any
motivation to align with anything else. The values here are explicitly
*not* intended for use in any sort of information interchange. They're
invalid codes that are not Unicode scalar values, and the only reason
they exist at all is to make application-internal (or even
implementation-internal, in the case of regex/glob/etc.)
round-tripping work in the byte-based C locale while avoiding
assigning character properties to the bytes or inadvertently handling
them in a way that might facilitate pretending they're just latin1.

Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
that appeared in a stream expected to be UTF-8" and "bytes of what's
expected to be valid UTF-8 being treated bytewise for processing by
user request" are related.

The proposal you linked is a decent implementation-internal choice for
handling data in a binary-clean manner where that's needed (e.g. a
text editor operating on files containing a mix of text and binary
data or a mix of text encodings), but I think (or at least hope?) that
in the years since it was written, there's come to be a consensus that
it is *not* a good idea to do this as a "decoding" operation (where
the data is saved out as invalid UTF-16 or -32 and used in
interchange, as opposed to just internally) because it breaks lots of
the good properties of UTF-8.

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
  2022-11-10 14:44 ` Rich Felker
@ 2022-11-11 15:02   ` Florian Weimer
  2022-11-11 15:38     ` Rich Felker
  0 siblings, 1 reply; 4+ messages in thread
From: Florian Weimer @ 2022-11-11 15:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker:

> On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
>> It has come to my attention that musl uses the range 0xDF80…0xDFFF to
>> cover the entire byte range:
>> 
>> /* Arbitrary encoding for representing code units instead of characters. */
>> #define CODEUNIT(c) (0xdfff & (signed char)(c))
>> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
>> 
>> There is a very similar surrogate character mapping for undecodable
>> UTF-8 bytes, suggested here:
>> 
>>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
>> 
>> It uses 0xDC80…0xDCFF.  This has been picked up by various
>> implementations, including Python.
>> 
>> Is there a reason why musl picked a different surrogate mapping here?
>> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
>> the same range?
>
> I'll have to look back through archives to see what the motivations
> for the particular range were -- I seem to recall there being some.
> But I think the more important thing here is the *lack* of any
> motivation to align with anything else. The values here are explicitly
> *not* intended for use in any sort of information interchange. They're
> invalid codes that are not Unicode scalar values, and the only reason
> they exist at all is to make application-internal (or even
> implementation-internal, in the case of regex/glob/etc.)
> round-tripping work in the byte-based C locale while avoiding
> assigning character properties to the bytes or inadvertently handling
> them in a way that might facilitate pretending they're just latin1.

For glibc, we are doing this because POSIX requires this for the C
(POSIX) locale.  It's now required to use a single-byte character set
with wchar_t mappings for all bytes.  Previously, I had hoped to
transition to UTF-8 by default (possibly with a surrogate-escape
encoding like Python's).

I guess as an alternative, we could just use the Latin-1 mapping.  Why
hasn't musl done this?  Because it would promote the idea that the world
is Latin-1?

> Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
> that appeared in a stream expected to be UTF-8" and "bytes of what's
> expected to be valid UTF-8 being treated bytewise for processing by
> user request" are related.

I think those two are fairly similar?  But “fake single-byte character
set due to POSIX mandate” is different?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale
  2022-11-11 15:02   ` Florian Weimer
@ 2022-11-11 15:38     ` Rich Felker
  0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2022-11-11 15:38 UTC (permalink / raw)
  To: Florian Weimer; +Cc: musl

On Fri, Nov 11, 2022 at 04:02:23PM +0100, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote:
> >> It has come to my attention that musl uses the range 0xDF80…0xDFFF to
> >> cover the entire byte range:
> >> 
> >> /* Arbitrary encoding for representing code units instead of characters. */
> >> #define CODEUNIT(c) (0xdfff & (signed char)(c))
> >> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80)
> >> 
> >> There is a very similar surrogate character mapping for undecodable
> >> UTF-8 bytes, suggested here:
> >> 
> >>   <https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
> >> 
> >> It uses 0xDC80…0xDCFF.  This has been picked up by various
> >> implementations, including Python.
> >> 
> >> Is there a reason why musl picked a different surrogate mapping here?
> >> Isn't it similar enough to the UTF-8 hack that it makes sense to pick
> >> the same range?
> >
> > I'll have to look back through archives to see what the motivations
> > for the particular range were -- I seem to recall there being some.
> > But I think the more important thing here is the *lack* of any
> > motivation to align with anything else. The values here are explicitly
> > *not* intended for use in any sort of information interchange. They're
> > invalid codes that are not Unicode scalar values, and the only reason
> > they exist at all is to make application-internal (or even
> > implementation-internal, in the case of regex/glob/etc.)
> > round-tripping work in the byte-based C locale while avoiding
> > assigning character properties to the bytes or inadvertently handling
> > them in a way that might facilitate pretending they're just latin1.
> 
> For glibc, we are doing this because POSIX requires this for the C
> (POSIX) locale.  It's now required to use a single-byte character set
> with wchar_t mappings for all bytes.  Previously, I had hoped to
> transition to UTF-8 by default (possibly with a surrogate-escape
> encoding like Python's).

Yes, that's entirely my fault and I'm so sorry. I reported a bug where
an interface's spec was ambiguous because they hadn't considered the
possibility that the C locale might be multibyte, and rather than fix
it, all the old-timers freaked out something they were taking for
granted (that the C locale would be byte-based) wasn't actually
specified.

> I guess as an alternative, we could just use the Latin-1 mapping.  Why
> hasn't musl done this?  Because it would promote the idea that the world
> is Latin-1?

Exactly. musl has always been very intentional about not supporting
legacy m17n-incompatible encodings and that character identity under
musl is not locale-specific. So, when we got stuck having to do a
byte-based C locale because of the above unfortunate outcome, what we
strived for was a way to express "these are code units of UTF-8 being
processed as individual bytes for a workflow where the user wants to
operate on bytes".

> > Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes
> > that appeared in a stream expected to be UTF-8" and "bytes of what's
> > expected to be valid UTF-8 being treated bytewise for processing by
> > user request" are related.
> 
> I think those two are fairly similar?  But “fake single-byte character
> set due to POSIX mandate” is different?

They admit the same mechanism and yes they at least have
"similarities", but the problems themselves are somewhat different, I
think. And the former has lots of weird likely unwanted behaviors,
like decode(concat(a,b)) != concat(decode(a),decode(b)) that arise
from the mapping only being taken in the 'error path' rather than
applied to all data uniformly.

Regardless of whether there's a technical reason DF80... is better
than DC80..., I think I'd generally be disinclined to change anything
now. Not because I want to preserve an existing mapping that nothing
should be relying on, but because the only practical motivation for a
change would be to align the mapping for interchange purposes -- which
means, even if we say "this is explicitly not for interchange
purposes", to anyone reading the change it clearly is for interchange
purposes because that's the only effect, and thereby, we might as well
be saying "go ahead and use this for interchange purposes!"

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-11 15:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-10  8:07 [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale Florian Weimer
2022-11-10 14:44 ` Rich Felker
2022-11-11 15:02   ` Florian Weimer
2022-11-11 15:38     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).