From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/9824 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Possible bug in setlocale upon invalid LC_ALL value Date: Sat, 2 Apr 2016 00:09:14 -0400 Message-ID: <20160402040914.GD21636@brightrain.aerifal.cx> References: <4C4AEBC7-4344-4867-B8F6-F1A691F123E0@gmail.com> <20160402005858.GA21636@brightrain.aerifal.cx> <9292C698-FABF-4721-AFC6-221ABAAD14F5@gmail.com> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: ger.gmane.org 1459570173 13656 80.91.229.3 (2 Apr 2016 04:09:33 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 2 Apr 2016 04:09:33 +0000 (UTC) Cc: musl@lists.openwall.com To: Assaf Gordon Original-X-From: musl-return-9837-gllmg-musl=m.gmane.org@lists.openwall.com Sat Apr 02 06:09:33 2016 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1amCsN-0003Yj-Dw for gllmg-musl@m.gmane.org; Sat, 02 Apr 2016 06:09:31 +0200 Original-Received: (qmail 31762 invoked by uid 550); 2 Apr 2016 04:09:28 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 30719 invoked from network); 2 Apr 2016 04:09:27 -0000 Content-Disposition: inline In-Reply-To: <9292C698-FABF-4721-AFC6-221ABAAD14F5@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:9824 Archived-At: On Fri, Apr 01, 2016 at 10:46:25PM -0400, Assaf Gordon wrote: > Hello Rich, > > thank you for the prompt and detailed response. > > > On Apr 1, 2016, at 20:58, Rich Felker wrote: > > > > On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote: > >> I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale. > > > > This is intentional. All locale names are valid under musl, and those > > which don't have any particular definition are just aliases for > > C.UTF-8. > > I will suggest a minor fix to GNU coreutils to accommodate for this > current implementation. I think any 'fix' would be inconsistent with both the specified behavior and the intended behavior. See below: > > The alternative would be that UTF-8 support breaks whenever > > LC_* vars are set but locales are not installed/configured, which > > would pretty much _always_ be the case when running a static-linked > > standalone binary on a non-musl-based system (where LC_* are probably > > set to something the main host libc recognizes). > > > > One possibility if this behavior is problematic would be to only > > consider names without their own definitions as aliases for C.UTF-8 > > when MUSL_LOCPATH is not set. However I think we'd need to see a > > strong motivation for doing that, since it seems like it would be > > worse behavior in some ways, especially when using LC_MESSAGES set to > > a language for which you don't have a locale installed. > > I'm not an expert about locales to argue one way or the other. > > Naively, I would think that this is somewhat problematic, because a > best-behaving program (one that checks set locale's return code for > errors) has no way to warn the user that he/she used an invalid > locale. Well the intent is that it _is_ valid. > Perhaps a work-around would be to handle it this way: > if an invalid (non-existing) locale is given in LC_* env vars, > setlocale(LC_ALL,"") should return NULL (indicating an error), then > all other invocations of setlocale(LC_*,NULL) would return the > "C.UTF-8" indicator. This would allow detecting the error, but not > affect further processing (if invalid locales are already an alias > to C.UTF-8). This seems to match other OSes/libcs which return fixed > "C" in such cases. This is non-conforming. If setlocale returns NULL it is required not to have modified the locale. This, combined with the fact that prior to calling setlocale successfully, the locale is in an unusable (single-byte, non-UTF-8-handling state), is the whole motivation for musl's treatment of locale names that don't have definitions. > The reason for such check is that it is common user mistake to > specify non-existing locales, then be confused by the seemingly > incorrect results. Allowing a program to detect incorrect locales is > a good mitigation. > > I'll side-step the non-UTF-8 locales (which would be a problem in > the current musl auto-aliasing to UTF-8), and show one possible case > where silent aliasing leads to incorrect results. musl does not support non-UTF-8 encodings at all, so that's not a very interesting case anyway. > consider the following UTF-8 string: > M N Ñ O P Y Z Æ Ø Å > (which includes Spanish eñe and the last three letters in the Swedish alphabet). > When sorting with locale-aware programs, different locales should > give different collation orders (e.g. es_ES.UTF-8 vs sv_FI.UTF-8). > > To reproduce: > A='\116\n\303\221\n\117\n\120\n\131\n\132\n\303\205\n\303\204\n\303\226\n' > printf "$A" | LC_ALL=sv_FI.UTF-8 sort > printf "$A" | LC_ALL=es_ES.UTF-8 sort > > If a user has a typo in the locale name (e.g. sv_SV.UTF-8), there's > no way for a program to detect it, and he will get unexpected > ordered results. But how is this any different from having a typo that results in another defined locale being selected? > GNU coreutils' 'sort' program added a --debug option to help user diagnose such issues. > On Linux with glibc, this will be the output: > > $ printf "$A" | LC_ALL=es_ES.UTF-8 sort --debug > /dev/null > sort: using ‘es_ES.UTF-8’ sorting rules > > $ printf "$A" | LC_ALL=sv_FI.UTF-8 sort --debug > /dev/null > sort: using ‘sv_FI.UTF-8’ sorting rules > > $ printf "$A" | LC_ALL=sv_SV.UTF-8 sort --debug > /dev/null > sort: using simple byte comparison > > $ printf "$A" | LC_ALL=foobar sort --debug > /dev/null > sort: using simple byte comparison > > The last two messages ("simple byte") is the hint that the locale is > invalid, and sort will does not use it. > > On Alpine (linux + musl), there's no way to detect such case: > > $ printf "$A" | LC_ALL=sv_FI.UTF-8 gsort --debug > /dev/null > gsort: using ‘sv_FI.UTF-8’ sorting rules > > $ printf "$A" | LC_ALL=sv_SV.UTF-8 gsort --debug > /dev/null > gsort: using ‘sv_SV.UTF-8’ sorting rules > > $ printf "$A" | LC_ALL=foobar gsort --debug > /dev/null > gsort: using ‘foobar’ sorting rules It might help if this resulted in: gsort: using ‘C.UTF-8’ sorting rules This is what used to happen ("hard resolving" the alias to a different name, rather than "soft resolving" it), but now we save the actual requsted name so that it can be used for loading messages if dcgettext is used with a category other than LC_MESSAGES. This is actually a very rarely used feature, which could probably be sacrificed for categories other than LC_MESSAGES if there's a strong benefit to doing so. Note that musl does not have any collation support at all right now, nor any official locale files. That gives us some flexibility to change things without impacting users, but the changes still can't impact standards conformance/API contracts. I do hope to add collation in the near future, as part of the goals for "1.2". Rich