From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7740
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Revisiting byte-based C locale
Date: Thu, 21 May 2015 22:22:03 -0400
Message-ID: <20150522022203.GA26651@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: ger.gmane.org 1432261353 17521 80.91.229.3 (22 May 2015 02:22:33 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 22 May 2015 02:22:33 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-7752-gllmg-musl=m.gmane.org@lists.openwall.com Fri May 22 04:22:28 2015
Return-path: <musl-return-7752-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-7752-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1YvcbR-0008Ce-Qn
	for gllmg-musl@m.gmane.org; Fri, 22 May 2015 04:22:25 +0200
Original-Received: (qmail 5742 invoked by uid 550); 22 May 2015 02:22:23 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 5668 invoked from network); 22 May 2015 02:22:17 -0000
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:7740
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/7740>

The last time the the byte-based C locale topic was visited ("Possible
bytelocale patch", http://www.openwall.com/lists/musl/2014/07/03/2),
it was a rather ugly patch introducing lots of code duplication. Now,
I believe the callers of multibyte/wide char functions which need to
always work in UTF-8 mode (iconv) or need to match a previously-saved
mode (stdio wide functions, which save the encoding in the FILE when
it becomes wide-oriented) can simply swap __pthread_self()->locale
back and forth. There is no longer a possibility that the thread
pointer may be uninitialized, nor a heavy synchronization cost of
switching thread-local locales from the atomics in uselocale -- commit
68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that.

Thus, I think we're at a point where we can evaluate the choice to
support or not to support a byte-based C locale on the basis of things
like standards conformance and impact on users and on software
compatibility without having to weigh implementation costs (which
would have contributed to "impact on users").

Since last year, the issue of byte-based C locale has come up a few
more times as a stumbling point for users on the IRC channel and/or
mailing list (I forget which and haven't gone back to look it up yet).
In particular, broken configure tests passing binary data to grep
failed, and I believe one or more language interpreters loading source
files in the C locale errored out due to a Latin-1 encoded "©"
character in source comments. Personally I'm in favor of getting the
broken stuff fixed, but I can see both sides.

There are also minor conformance reasons to consider the byte-based C
locale even without accepting the resolution to Austin Group issue 663
(which is supposedly imposing the requirement, someday). In
particular, the C standard seems to allow the current behavior of
musl, where the C locale has extra characters for which isw*() return
true, as long as the non-wide is*() functions don't have such extra
characters. C doesn't even define abstract character classes that
these functions report, just loose requirements on their behavior. But
POSIX specifies LC_CTYPE in terms of character classes which have
members, and does not leave room for extra characters in the C locale
as far as I can tell. This could affect real-world usage cases where
an application intentionally running in the C locale expects the
regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII
letters. As mentioned several times in the past, this non-conformance
could be addressed by changes in the isw*() functions (making them
locale-aware) rather than by adding the byte-based C locale, but if
there are other motivations to support the byte-based C locale, it
may make sense to solve both issues with one change.

Any new opinions on the topic? Or interest in re-emphasizing a
previously stated opinion? :)

Rich