From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 32197 invoked from network); 11 Nov 2022 15:02:45 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 11 Nov 2022 15:02:45 -0000 Received: (qmail 11857 invoked by uid 550); 11 Nov 2022 15:02:40 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 11816 invoked from network); 11 Nov 2022 15:02:39 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668178947; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: references:references; bh=ED7Rem5ApJh0aJrwrBM/ZbKCZOmyv9yZ88hQFEvAjK8=; b=Rx80ip8EO9eThTodNzUw/NTtpiNyPnIOutMn1tBp0tRM64hnUfAwo4cB4tnZo0Pir1GbYS 423JbnVW77HasVbmcLczpetfGvPCghGnSugtE4hygdYeOcdx9mxkLgJj9vwWJziZ1XcKmg 5CwoViagezKOkzBoKxytSURIfmvlDbQ= X-MC-Unique: BNA_qNJaN3yMHPSXCkMm8Q-1 From: Florian Weimer To: Rich Felker Cc: musl@lists.openwall.com References: <875yfn5j1i.fsf@oldenburg.str.redhat.com> <20221110144447.GI29905@brightrain.aerifal.cx> Date: Fri, 11 Nov 2022 16:02:23 +0100 Message-ID: <87v8nlo7pc.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [musl] Choice of wchar_t mapping for non-ASCII bytes in the POSIX locale * Rich Felker: > On Thu, Nov 10, 2022 at 09:07:53AM +0100, Florian Weimer wrote: >> It has come to my attention that musl uses the range 0xDF80=E2=80=A60xDF= FF to >> cover the entire byte range: >>=20 >> /* Arbitrary encoding for representing code units instead of characters.= */ >> #define CODEUNIT(c) (0xdfff & (signed char)(c)) >> #define IS_CODEUNIT(c) ((unsigned)(c)-0xdf80 < 0x80) >>=20 >> There is a very similar surrogate character mapping for undecodable >> UTF-8 bytes, suggested here: >>=20 >> >>=20 >> It uses 0xDC80=E2=80=A60xDCFF. This has been picked up by various >> implementations, including Python. >>=20 >> Is there a reason why musl picked a different surrogate mapping here? >> Isn't it similar enough to the UTF-8 hack that it makes sense to pick >> the same range? > > I'll have to look back through archives to see what the motivations > for the particular range were -- I seem to recall there being some. > But I think the more important thing here is the *lack* of any > motivation to align with anything else. The values here are explicitly > *not* intended for use in any sort of information interchange. They're > invalid codes that are not Unicode scalar values, and the only reason > they exist at all is to make application-internal (or even > implementation-internal, in the case of regex/glob/etc.) > round-tripping work in the byte-based C locale while avoiding > assigning character properties to the bytes or inadvertently handling > them in a way that might facilitate pretending they're just latin1. For glibc, we are doing this because POSIX requires this for the C (POSIX) locale. It's now required to use a single-byte character set with wchar_t mappings for all bytes. Previously, I had hoped to transition to UTF-8 by default (possibly with a surrogate-escape encoding like Python's). I guess as an alternative, we could just use the Latin-1 mapping. Why hasn't musl done this? Because it would promote the idea that the world is Latin-1? > Aside from that, I'm not sure how closely "invalid non-UTF-8 bytes > that appeared in a stream expected to be UTF-8" and "bytes of what's > expected to be valid UTF-8 being treated bytewise for processing by > user request" are related. I think those two are fairly similar? But =E2=80=9Cfake single-byte charac= ter set due to POSIX mandate=E2=80=9D is different? Thanks, Florian