From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/12065 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: setlocale behavior with 'missing' locales Date: Wed, 8 Nov 2017 00:27:15 -0500 Message-ID: <20171108052715.GM1627@brightrain.aerifal.cx> References: <20171108050338.GL1627@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: blaine.gmane.org 1510118848 30081 195.159.176.226 (8 Nov 2017 05:27:28 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Wed, 8 Nov 2017 05:27:28 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-12081-gllmg-musl=m.gmane.org@lists.openwall.com Wed Nov 08 06:27:23 2017 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1eCItX-0007gC-Cw for gllmg-musl@m.gmane.org; Wed, 08 Nov 2017 06:27:23 +0100 Original-Received: (qmail 9673 invoked by uid 550); 8 Nov 2017 05:27:28 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 9649 invoked from network); 8 Nov 2017 05:27:27 -0000 Content-Disposition: inline In-Reply-To: <20171108050338.GL1627@brightrain.aerifal.cx> Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:12065 Archived-At: On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote: > Unfortunately this turns out to have been something of a tradeoff, > since there's no way for applications (and, as it turns out, > especially tests/test suites) to query whether a particular locale is > "really" available. I've been asked to change the behavior to fail on > unknown locale names, but of course that's not a working option in > light of the above. > > I think there may be a solution that makes everyone happy, but I'm not > sure yet. I'm going to follow up with a description and analysis of > whether it's valid/conforming. So here's the possible solution. ISO C leaves the default locale when setlocale(cat,"") is called implementation-defined. POSIX however defines it in terms of the LANG and LC_* environment variables. See the CX text in: http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html "Setting all of the categories of the global locale is similar to successively setting each individual category of the global locale, except that all error checking is done before any actions are performed. To set all the categories of the global locale, setlocale() can be invoked as: setlocale(LC_ALL, ""); In this case, setlocale() shall first verify that the values of all the environment variables it needs according to the precedence rules (described in XBD Environment Variables) indicate supported locales. If the value of any of these environment variable searches yields a locale that is not supported (and non-null), setlocale() shall return a null pointer and the global locale shall not be changed. If all environment variables name supported locales, setlocale() shall proceed as if it had been called for each category, using the appropriate value from the associated environment variable or from the implementation-defined default if there is no such value." and the Environment Variables text in XBD 8.2: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02 The former seems to tie our hands: unless the locales determined by the environment variables all exist, setlocale is required to fail and leave us in the (unacceptable) "C" locale where UTF-8 doesn't work. However the latter seems to offer us a way out. After describing how the precedence of the variables work, how locale pathnames work if localedef is supported (musl doesn't support it), and how implementation-provided/defined locale names work, it specifies: "If the locale value is not recognized by the implementation, the behavior is unspecified." My optimistic reading of this is that, in the event the locale name provided does not correspond to something we recognize, we're free to define how it's interpreted, and always interpret it as C.UTF-8. What this would achieve is the following: 1. setlocale(cat, explicit_locale_name) - succeeds if the locale actually has a definition file, fails and returns a null pointer otherwise. 2. setlocale(cat, "") - always succeeds, honoring the environment variable for the category if a locale definition file by that name exists, but otherwise (the unspecified behavior) treating it as if it were C.UTF-8. This way, applications that probe for specific locale names can do so and determine if they exist, but applications that just want to use the default locale the user configured will still avoid catastrophic breakage (failure to support UTF-8) even if they encounter "bad" LC_* variables. Does this approach sound acceptable? I'm fairly content with interpreting it as conforming to the standard; I'm mainly concerned about whether there might be unforseen breakage. One notable issue is that, right now, we rely on being able to set LC_MESSAGES to an arbitrary name even if there's no libc locale definition for it; this is because gettext() relies on the name of the current LC_MESSAGES locale to find (application-specific) translation files that might exist even without a libc translation. I'm not sure how we would best keep this working under changes similar to the above. Rich