mailing list of musl libc
 help / color / mirror / code / Atom feed
* Possible bug in setlocale upon invalid LC_ALL value
@ 2016-04-02  0:47 Assaf Gordon
  2016-04-02  0:58 ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Assaf Gordon @ 2016-04-02  0:47 UTC (permalink / raw)
  To: musl

Hello musl developers,

I'm testing compilation of GNU coreutils on Alpine Linux 3.3.3 (linux kernel 4.1.20, musl-1.1.12-r3).

I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale.
example:

   $ LC_ALL=missing ./myprogram

If myprogram calls setlocale(LC_ALL,""),
then musl sets the internal locale despite being invalid value.
later, checking the locale for a specific category (e.g. LC_COLLATE) will return 'missing' instead of 'C' .


The relevant POSIX clause is this:
 http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
 "[...] If the value of any of these environment variable searches yields a locale that
  is not supported (and non-null), setlocale() shall return a null pointer and the global
  locale shall not be changed."

Below is a short C program demonstrating the issue, with example output from various OSes.

comments welcomed,
 - assaf


/*
Test 'setlocale()' behaviour.

compile:
   cc -o print-locale print-locale.c
test:
   ./print-locale
   LC_ALL=C ./print-locale
   LC_ALL=missing ./print-locale
*/
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>

int main(void)
{
 char* p = getenv("LC_ALL");
 printf("LC_ALL env var = '%s'\n", p?p:"(NULL)");

 p = setlocale(LC_ALL,"");
 printf("setlocale(LC_ALL,\"\") = '%s'\n", p?p:"(NULL)");

 p = setlocale(LC_ALL,NULL);
 printf("LC_ALL from setlocale = '%s'\n", p?p:"(NULL)");

 p = setlocale(LC_COLLATE,NULL);
 printf("LC_COLLATE from setlocale = '%s'\n", p?p:"(NULL)");

 return 0;
}

==== musl libc =======

$ ./print-locale
LC_ALL env var = '(NULL)'
setlocale(LC_ALL,"") = 'C.UTF-8;C;C;C;C;C'
LC_ALL from setlocale = 'C.UTF-8;C;C;C;C;C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=c ./print-locale
LC_ALL env var = 'C'
setlocale(LC_ALL,"") = 'C;C;C;C;C;C'
LC_ALL from setlocale = 'C;C;C;C;C;C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=missing ./print-locale
LC_ALL env var = 'missing'
setlocale(LC_ALL,"") = 'missing;missing;missing;missing;missing;missing'
LC_ALL from setlocale = 'missing;missing;missing;missing;missing;missing'
LC_COLLATE from setlocale = 'missing'


==== glibc (Ubuntu) ====

$ ./print-locale
LC_ALL env var = '(NULL)'
setlocale(LC_ALL,"") = 'en_US.UTF-8'
LC_ALL from setlocale = 'en_US.UTF-8'
LC_COLLATE from setlocale = 'en_US.UTF-8'

$ LC_ALL=C ./print-locale
LC_ALL env var = 'C'
setlocale(LC_ALL,"") = 'C'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=missing ./print-locale
LC_ALL env var = 'missing'
setlocale(LC_ALL,"") = '(NULL)'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'

==== FreeBSD 10.1 ====

$ ./print-locale
LC_ALL env var = '(NULL)'
setlocale(LC_ALL,"") = 'C'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=C ./print-locale
LC_ALL env var = 'C'
setlocale(LC_ALL,"") = 'C'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=missing ./print-locale
LC_ALL env var = 'missing'
setlocale(LC_ALL,"") = '(NULL)'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'


==== OpenBSD 5.8 ====

$ ./print-locale
LC_ALL env var = '(NULL)'
setlocale(LC_ALL,"") = 'C'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=C ./print-locale
LC_ALL env var = 'C'
setlocale(LC_ALL,"") = 'C'
LC_ALL from setlocale = 'C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=missing ./print-locale
LC_ALL env var = 'missing'
setlocale(LC_ALL,"") = 'C/missing/C/C/C/C'
LC_ALL from setlocale = 'C/missing/C/C/C/C'
LC_COLLATE from setlocale = 'C'

==== AIX 7 ===

$ ./print-locale
LC_ALL env var = '(NULL)'
setlocale(LC_ALL,"") = 'en_US en_US en_US en_US en_US en_US'
LC_ALL from setlocale = 'en_US en_US en_US en_US en_US en_US'
LC_COLLATE from setlocale = 'en_US'

$ LC_ALL=C ./print-locale
LC_ALL env var = 'C'
setlocale(LC_ALL,"") = 'C C C C C C'
LC_ALL from setlocale = 'C C C C C C'
LC_COLLATE from setlocale = 'C'

$ LC_ALL=missing ./print-locale
LC_ALL env var = 'missing'
setlocale(LC_ALL,"") = '(NULL)'
LC_ALL from setlocale = 'C C C C C C'
LC_COLLATE from setlocale = 'C'


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible bug in setlocale upon invalid LC_ALL value
  2016-04-02  0:47 Possible bug in setlocale upon invalid LC_ALL value Assaf Gordon
@ 2016-04-02  0:58 ` Rich Felker
  2016-04-02  2:46   ` Assaf Gordon
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2016-04-02  0:58 UTC (permalink / raw)
  To: Assaf Gordon; +Cc: musl

On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote:
> Hello musl developers,
> 
> I'm testing compilation of GNU coreutils on Alpine Linux 3.3.3 (linux kernel 4.1.20, musl-1.1.12-r3).
> 
> I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale.
> example:
> 
>    $ LC_ALL=missing ./myprogram
> 
> If myprogram calls setlocale(LC_ALL,""),
> then musl sets the internal locale despite being invalid value.
> later, checking the locale for a specific category (e.g. LC_COLLATE) will return 'missing' instead of 'C' .
> 
> 
> The relevant POSIX clause is this:
>  http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
>  "[...] If the value of any of these environment variable searches yields a locale that
>   is not supported (and non-null), setlocale() shall return a null pointer and the global
>   locale shall not be changed."
> 
> Below is a short C program demonstrating the issue, with example output from various OSes.

This is intentional. All locale names are valid under musl, and those
which don't have any particular definition are just aliases for
C.UTF-8. The alternative would be that UTF-8 support breaks whenever
LC_* vars are set but locales are not installed/configured, which
would pretty much _always_ be the case when running a static-linked
standalone binary on a non-musl-based system (where LC_* are probably
set to something the main host libc recognizes).

One possibility if this behavior is problematic would be to only
consider names without their own definitions as aliases for C.UTF-8
when MUSL_LOCPATH is not set. However I think we'd need to see a
strong motivation for doing that, since it seems like it would be
worse behavior in some ways, especially when using LC_MESSAGES set to
a language for which you don't have a locale installed.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible bug in setlocale upon invalid LC_ALL value
  2016-04-02  0:58 ` Rich Felker
@ 2016-04-02  2:46   ` Assaf Gordon
  2016-04-02  4:09     ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Assaf Gordon @ 2016-04-02  2:46 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Hello Rich,

thank you for the prompt and detailed response.

> On Apr 1, 2016, at 20:58, Rich Felker <dalias@libc.org> wrote:
> 
> On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote:
>> I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale.
> 
> This is intentional. All locale names are valid under musl, and those
> which don't have any particular definition are just aliases for
> C.UTF-8.

I will suggest a minor fix to GNU coreutils to accommodate for this current implementation.

> The alternative would be that UTF-8 support breaks whenever
> LC_* vars are set but locales are not installed/configured, which
> would pretty much _always_ be the case when running a static-linked
> standalone binary on a non-musl-based system (where LC_* are probably
> set to something the main host libc recognizes).
> 
> One possibility if this behavior is problematic would be to only
> consider names without their own definitions as aliases for C.UTF-8
> when MUSL_LOCPATH is not set. However I think we'd need to see a
> strong motivation for doing that, since it seems like it would be
> worse behavior in some ways, especially when using LC_MESSAGES set to
> a language for which you don't have a locale installed.

I'm not an expert about locales to argue one way or the other.

Naively, I would think that this is somewhat problematic, because a best-behaving program (one that checks set locale's return code for errors) has no way to warn the user that he/she used an invalid locale.

Perhaps a work-around would be to handle it this way:
if an invalid (non-existing) locale is given in LC_* env vars, setlocale(LC_ALL,"") should return NULL (indicating an error), then all other invocations of setlocale(LC_*,NULL) would return the "C.UTF-8" indicator. This would allow detecting the error, but not affect further processing (if invalid locales are already an alias to C.UTF-8). This seems to match other OSes/libcs which return fixed "C" in such cases.

The reason for such check is that it is common user mistake to specify non-existing locales, then be confused by the seemingly incorrect results. Allowing a program to detect incorrect locales is a good mitigation.

I'll side-step the non-UTF-8 locales (which would be a problem in the current musl auto-aliasing to UTF-8), and show one possible case where silent aliasing leads to incorrect results.

consider the following UTF-8 string:
   M N Ñ O P Y Z Æ Ø Å
(which includes Spanish eñe and the last three letters in the Swedish alphabet).
When sorting with locale-aware programs, different locales should give different collation orders (e.g. es_ES.UTF-8 vs sv_FI.UTF-8).

To reproduce:
  A='\116\n\303\221\n\117\n\120\n\131\n\132\n\303\205\n\303\204\n\303\226\n'
  printf "$A" | LC_ALL=sv_FI.UTF-8 sort
  printf "$A" | LC_ALL=es_ES.UTF-8 sort

If a user has a typo in the locale name (e.g. sv_SV.UTF-8), there's no way for a program to detect it, and he will get unexpected ordered results.

GNU coreutils' 'sort' program added a --debug option to help user diagnose such issues.
On Linux with glibc, this will be the output:

  $ printf "$A" | LC_ALL=es_ES.UTF-8 sort --debug > /dev/null
  sort: using ‘es_ES.UTF-8’ sorting rules

  $ printf "$A" | LC_ALL=sv_FI.UTF-8 sort --debug > /dev/null                             
  sort: using ‘sv_FI.UTF-8’ sorting rules

  $ printf "$A" | LC_ALL=sv_SV.UTF-8 sort --debug > /dev/null                             
  sort: using simple byte comparison 

  $ printf "$A" | LC_ALL=foobar sort --debug > /dev/null                                   
  sort: using simple byte comparison

The last two messages ("simple byte") is the hint that the locale is invalid, and sort will does not use it.

On Alpine (linux + musl), there's no way to detect such case:

  $ printf "$A" | LC_ALL=sv_FI.UTF-8 gsort --debug > /dev/null
  gsort: using ‘sv_FI.UTF-8’ sorting rules

  $ printf "$A" | LC_ALL=sv_SV.UTF-8 gsort --debug > /dev/null
  gsort: using ‘sv_SV.UTF-8’ sorting rules

  $ printf "$A" | LC_ALL=foobar gsort --debug > /dev/null
  gsort: using ‘foobar’ sorting rules


regards,
 - assaf



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible bug in setlocale upon invalid LC_ALL value
  2016-04-02  2:46   ` Assaf Gordon
@ 2016-04-02  4:09     ` Rich Felker
  2016-04-02  4:18       ` Assaf Gordon
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2016-04-02  4:09 UTC (permalink / raw)
  To: Assaf Gordon; +Cc: musl

On Fri, Apr 01, 2016 at 10:46:25PM -0400, Assaf Gordon wrote:
> Hello Rich,
> 
> thank you for the prompt and detailed response.
> 
> > On Apr 1, 2016, at 20:58, Rich Felker <dalias@libc.org> wrote:
> > 
> > On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote:
> >> I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale.
> > 
> > This is intentional. All locale names are valid under musl, and those
> > which don't have any particular definition are just aliases for
> > C.UTF-8.
> 
> I will suggest a minor fix to GNU coreutils to accommodate for this
> current implementation.

I think any 'fix' would be inconsistent with both the specified
behavior and the intended behavior. See below:

> > The alternative would be that UTF-8 support breaks whenever
> > LC_* vars are set but locales are not installed/configured, which
> > would pretty much _always_ be the case when running a static-linked
> > standalone binary on a non-musl-based system (where LC_* are probably
> > set to something the main host libc recognizes).
> > 
> > One possibility if this behavior is problematic would be to only
> > consider names without their own definitions as aliases for C.UTF-8
> > when MUSL_LOCPATH is not set. However I think we'd need to see a
> > strong motivation for doing that, since it seems like it would be
> > worse behavior in some ways, especially when using LC_MESSAGES set to
> > a language for which you don't have a locale installed.
> 
> I'm not an expert about locales to argue one way or the other.
> 
> Naively, I would think that this is somewhat problematic, because a
> best-behaving program (one that checks set locale's return code for
> errors) has no way to warn the user that he/she used an invalid
> locale.

Well the intent is that it _is_ valid.

> Perhaps a work-around would be to handle it this way:
> if an invalid (non-existing) locale is given in LC_* env vars,
> setlocale(LC_ALL,"") should return NULL (indicating an error), then
> all other invocations of setlocale(LC_*,NULL) would return the
> "C.UTF-8" indicator. This would allow detecting the error, but not
> affect further processing (if invalid locales are already an alias
> to C.UTF-8). This seems to match other OSes/libcs which return fixed
> "C" in such cases.

This is non-conforming. If setlocale returns NULL it is required not
to have modified the locale. This, combined with the fact that prior
to calling setlocale successfully, the locale is in an unusable
(single-byte, non-UTF-8-handling state), is the whole motivation for
musl's treatment of locale names that don't have definitions.

> The reason for such check is that it is common user mistake to
> specify non-existing locales, then be confused by the seemingly
> incorrect results. Allowing a program to detect incorrect locales is
> a good mitigation.
> 
> I'll side-step the non-UTF-8 locales (which would be a problem in
> the current musl auto-aliasing to UTF-8), and show one possible case
> where silent aliasing leads to incorrect results.

musl does not support non-UTF-8 encodings at all, so that's not a very
interesting case anyway.

> consider the following UTF-8 string:
>    M N Ñ O P Y Z Æ Ø Å
> (which includes Spanish eñe and the last three letters in the Swedish alphabet).
> When sorting with locale-aware programs, different locales should
> give different collation orders (e.g. es_ES.UTF-8 vs sv_FI.UTF-8).
> 
> To reproduce:
>   A='\116\n\303\221\n\117\n\120\n\131\n\132\n\303\205\n\303\204\n\303\226\n'
>   printf "$A" | LC_ALL=sv_FI.UTF-8 sort
>   printf "$A" | LC_ALL=es_ES.UTF-8 sort
> 
> If a user has a typo in the locale name (e.g. sv_SV.UTF-8), there's
> no way for a program to detect it, and he will get unexpected
> ordered results.

But how is this any different from having a typo that results in
another defined locale being selected?

> GNU coreutils' 'sort' program added a --debug option to help user diagnose such issues.
> On Linux with glibc, this will be the output:
> 
>   $ printf "$A" | LC_ALL=es_ES.UTF-8 sort --debug > /dev/null
>   sort: using ‘es_ES.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_FI.UTF-8 sort --debug > /dev/null                             
>   sort: using ‘sv_FI.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_SV.UTF-8 sort --debug > /dev/null                             
>   sort: using simple byte comparison 
> 
>   $ printf "$A" | LC_ALL=foobar sort --debug > /dev/null                                   
>   sort: using simple byte comparison
> 
> The last two messages ("simple byte") is the hint that the locale is
> invalid, and sort will does not use it.
> 
> On Alpine (linux + musl), there's no way to detect such case:
> 
>   $ printf "$A" | LC_ALL=sv_FI.UTF-8 gsort --debug > /dev/null
>   gsort: using ‘sv_FI.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_SV.UTF-8 gsort --debug > /dev/null
>   gsort: using ‘sv_SV.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=foobar gsort --debug > /dev/null
>   gsort: using ‘foobar’ sorting rules

It might help if this resulted in:

	gsort: using ‘C.UTF-8’ sorting rules

This is what used to happen ("hard resolving" the alias to a different
name, rather than "soft resolving" it), but now we save the actual
requsted name so that it can be used for loading messages if dcgettext
is used with a category other than LC_MESSAGES. This is actually a
very rarely used feature, which could probably be sacrificed for
categories other than LC_MESSAGES if there's a strong benefit to doing
so.

Note that musl does not have any collation support at all right now,
nor any official locale files. That gives us some flexibility to
change things without impacting users, but the changes still can't
impact standards conformance/API contracts. I do hope to add collation
in the near future, as part of the goals for "1.2".

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Possible bug in setlocale upon invalid LC_ALL value
  2016-04-02  4:09     ` Rich Felker
@ 2016-04-02  4:18       ` Assaf Gordon
  0 siblings, 0 replies; 5+ messages in thread
From: Assaf Gordon @ 2016-04-02  4:18 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

Hello,

Thanks again for the detailed response.

> On Apr 2, 2016, at 00:09, Rich Felker <dalias@libc.org> wrote:
> On Fri, Apr 01, 2016 at 10:46:25PM -0400, Assaf Gordon wrote:
>>> On Apr 1, 2016, at 20:58, Rich Felker <dalias@libc.org> wrote:
>>> This is intentional. All locale names are valid under musl, and those
>>> which don't have any particular definition are just aliases for
>>> C.UTF-8.
>> 
>> I will suggest a minor fix to GNU coreutils to accommodate for this
>> current implementation.
> 
> I think any 'fix' would be inconsistent with both the specified
> behavior and the intended behavior. See below:

I should've worded it better: the 'fix' will simply detect the situation and skip the test (or give a warning) instead of reporting a test failure.

(cf. http://lists.gnu.org/archive/html/coreutils/2016-04/msg00002.html )

regards,
 - assaf





^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-04-02  4:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-02  0:47 Possible bug in setlocale upon invalid LC_ALL value Assaf Gordon
2016-04-02  0:58 ` Rich Felker
2016-04-02  2:46   ` Assaf Gordon
2016-04-02  4:09     ` Rich Felker
2016-04-02  4:18       ` Assaf Gordon

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).