From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7741 Path: news.gmane.org!not-for-mail From: Josiah Worcester Newsgroups: gmane.linux.lib.musl.general Subject: Re: Revisiting byte-based C locale Date: Thu, 21 May 2015 23:04:47 -0500 Message-ID: References: <20150522022203.GA26651@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Trace: ger.gmane.org 1432267504 3428 80.91.229.3 (22 May 2015 04:05:04 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 22 May 2015 04:05:04 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-7753-gllmg-musl=m.gmane.org@lists.openwall.com Fri May 22 06:05:03 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YveCj-0000zh-F6 for gllmg-musl@m.gmane.org; Fri, 22 May 2015 06:05:01 +0200 Original-Received: (qmail 16244 invoked by uid 550); 22 May 2015 04:04:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 16221 invoked from network); 22 May 2015 04:04:58 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=k5c6XTzmqiUy1F7YI0VjEswkWN29zdgsZJcO5Qyg2go=; b=pnjZ2C851BpQKX3ExxZ/NSC9KaNaj/JyUa7SJ8EEa9ntLmh/Dl4v1PvNDt8FYhTYL5 t/zIrimsXyqqQgx3mhdpvdcMulxQNzZmDHzLspHVWzAa9oqpYL53GueSLETQjiKvmuex 98UN0O7k+/EUHzv99IEJg6PocgXBrK194EC4TZa0Dx6Fyct08jBA8FQtHFieL30NuOrR aHyUje5MGXRV8iGfaVwmuTRA+pMdVvig12EBgwOBPPdJ7vIMAaDLjJ/sYNKSO/p/yZPZ b6dj7xeM4kocePKvF5qW9z3vFYsUXsNGWOROwSId1nbjXwhNR8yZg4QjRWjUyTEVl5U8 6rPQ== X-Received: by 10.112.160.73 with SMTP id xi9mr4722943lbb.92.1432267487068; Thu, 21 May 2015 21:04:47 -0700 (PDT) In-Reply-To: <20150522022203.GA26651@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:7741 Archived-At: On Thu, May 21, 2015 at 9:22 PM, Rich Felker wrote: > The last time the the byte-based C locale topic was visited ("Possible > bytelocale patch", http://www.openwall.com/lists/musl/2014/07/03/2), > it was a rather ugly patch introducing lots of code duplication. Now, > I believe the callers of multibyte/wide char functions which need to > always work in UTF-8 mode (iconv) or need to match a previously-saved > mode (stdio wide functions, which save the encoding in the FILE when > it becomes wide-oriented) can simply swap __pthread_self()->locale > back and forth. There is no longer a possibility that the thread > pointer may be uninitialized, nor a heavy synchronization cost of > switching thread-local locales from the atomics in uselocale -- commit > 68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that. > > Thus, I think we're at a point where we can evaluate the choice to > support or not to support a byte-based C locale on the basis of things > like standards conformance and impact on users and on software > compatibility without having to weigh implementation costs (which > would have contributed to "impact on users"). > > Since last year, the issue of byte-based C locale has come up a few > more times as a stumbling point for users on the IRC channel and/or > mailing list (I forget which and haven't gone back to look it up yet). > In particular, broken configure tests passing binary data to grep > failed, and I believe one or more language interpreters loading source > files in the C locale errored out due to a Latin-1 encoded "=C2=A9" > character in source comments. Personally I'm in favor of getting the > broken stuff fixed, but I can see both sides. > > There are also minor conformance reasons to consider the byte-based C > locale even without accepting the resolution to Austin Group issue 663 > (which is supposedly imposing the requirement, someday). In > particular, the C standard seems to allow the current behavior of > musl, where the C locale has extra characters for which isw*() return > true, as long as the non-wide is*() functions don't have such extra > characters. C doesn't even define abstract character classes that > these functions report, just loose requirements on their behavior. But > POSIX specifies LC_CTYPE in terms of character classes which have > members, and does not leave room for extra characters in the C locale > as far as I can tell. This could affect real-world usage cases where > an application intentionally running in the C locale expects the > regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII > letters. As mentioned several times in the past, this non-conformance > could be addressed by changes in the isw*() functions (making them > locale-aware) rather than by adding the byte-based C locale, but if > there are other motivations to support the byte-based C locale, it > may make sense to solve both issues with one change. > > Any new opinions on the topic? Or interest in re-emphasizing a > previously stated opinion? :) > > Rich Given the POSIX rules on LC_CTYPE character classes effecting [[:alpha:]], it seems to me now that the clear intent (if not statement) is in fact for a byte-based C locale. Though maybe unfortunate, it does seem like as though that is in fact the most conformant way of doing it, and conforming looks to have little cost now.