From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7741
Path: news.gmane.org!not-for-mail
From: Josiah Worcester <josiahw@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Revisiting byte-based C locale
Date: Thu, 21 May 2015 23:04:47 -0500
Message-ID: <CAMAJcuCK=nh9=Kvsh5h5Et3tw5XDQv5XGy4OozyZOLyft9LgVA@mail.gmail.com>
References: <20150522022203.GA26651@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Trace: ger.gmane.org 1432267504 3428 80.91.229.3 (22 May 2015 04:05:04 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 22 May 2015 04:05:04 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-7753-gllmg-musl=m.gmane.org@lists.openwall.com Fri May 22 06:05:03 2015
Return-path: <musl-return-7753-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-7753-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1YveCj-0000zh-F6
	for gllmg-musl@m.gmane.org; Fri, 22 May 2015 06:05:01 +0200
Original-Received: (qmail 16244 invoked by uid 550); 22 May 2015 04:04:58 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 16221 invoked from network); 22 May 2015 04:04:58 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        bh=k5c6XTzmqiUy1F7YI0VjEswkWN29zdgsZJcO5Qyg2go=;
        b=pnjZ2C851BpQKX3ExxZ/NSC9KaNaj/JyUa7SJ8EEa9ntLmh/Dl4v1PvNDt8FYhTYL5
         t/zIrimsXyqqQgx3mhdpvdcMulxQNzZmDHzLspHVWzAa9oqpYL53GueSLETQjiKvmuex
         98UN0O7k+/EUHzv99IEJg6PocgXBrK194EC4TZa0Dx6Fyct08jBA8FQtHFieL30NuOrR
         aHyUje5MGXRV8iGfaVwmuTRA+pMdVvig12EBgwOBPPdJ7vIMAaDLjJ/sYNKSO/p/yZPZ
         b6dj7xeM4kocePKvF5qW9z3vFYsUXsNGWOROwSId1nbjXwhNR8yZg4QjRWjUyTEVl5U8
         6rPQ==
X-Received: by 10.112.160.73 with SMTP id xi9mr4722943lbb.92.1432267487068;
 Thu, 21 May 2015 21:04:47 -0700 (PDT)
In-Reply-To: <20150522022203.GA26651@brightrain.aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:7741
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/7741>

On Thu, May 21, 2015 at 9:22 PM, Rich Felker <dalias@libc.org> wrote:
> The last time the the byte-based C locale topic was visited ("Possible
> bytelocale patch", http://www.openwall.com/lists/musl/2014/07/03/2),
> it was a rather ugly patch introducing lots of code duplication. Now,
> I believe the callers of multibyte/wide char functions which need to
> always work in UTF-8 mode (iconv) or need to match a previously-saved
> mode (stdio wide functions, which save the encoding in the FILE when
> it becomes wide-oriented) can simply swap __pthread_self()->locale
> back and forth. There is no longer a possibility that the thread
> pointer may be uninitialized, nor a heavy synchronization cost of
> switching thread-local locales from the atomics in uselocale -- commit
> 68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that.
>
> Thus, I think we're at a point where we can evaluate the choice to
> support or not to support a byte-based C locale on the basis of things
> like standards conformance and impact on users and on software
> compatibility without having to weigh implementation costs (which
> would have contributed to "impact on users").
>
> Since last year, the issue of byte-based C locale has come up a few
> more times as a stumbling point for users on the IRC channel and/or
> mailing list (I forget which and haven't gone back to look it up yet).
> In particular, broken configure tests passing binary data to grep
> failed, and I believe one or more language interpreters loading source
> files in the C locale errored out due to a Latin-1 encoded "=C2=A9"
> character in source comments. Personally I'm in favor of getting the
> broken stuff fixed, but I can see both sides.
>
> There are also minor conformance reasons to consider the byte-based C
> locale even without accepting the resolution to Austin Group issue 663
> (which is supposedly imposing the requirement, someday). In
> particular, the C standard seems to allow the current behavior of
> musl, where the C locale has extra characters for which isw*() return
> true, as long as the non-wide is*() functions don't have such extra
> characters. C doesn't even define abstract character classes that
> these functions report, just loose requirements on their behavior. But
> POSIX specifies LC_CTYPE in terms of character classes which have
> members, and does not leave room for extra characters in the C locale
> as far as I can tell. This could affect real-world usage cases where
> an application intentionally running in the C locale expects the
> regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII
> letters. As mentioned several times in the past, this non-conformance
> could be addressed by changes in the isw*() functions (making them
> locale-aware) rather than by adding the byte-based C locale, but if
> there are other motivations to support the byte-based C locale, it
> may make sense to solve both issues with one change.
>
> Any new opinions on the topic? Or interest in re-emphasizing a
> previously stated opinion? :)
>
> Rich

Given the POSIX rules on LC_CTYPE character classes effecting
[[:alpha:]], it seems to me now that the clear intent (if not
statement) is in fact for a byte-based C locale. Though maybe
unfortunate, it does seem like as though that is in fact the most
conformant way of doing it, and conforming looks to have little cost
now.