mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: Assaf Gordon <assafgordon@gmail.com>
Cc: musl@lists.openwall.com
Subject: Re: Possible bug in setlocale upon invalid LC_ALL value
Date: Sat, 2 Apr 2016 00:09:14 -0400	[thread overview]
Message-ID: <20160402040914.GD21636@brightrain.aerifal.cx> (raw)
In-Reply-To: <9292C698-FABF-4721-AFC6-221ABAAD14F5@gmail.com>

On Fri, Apr 01, 2016 at 10:46:25PM -0400, Assaf Gordon wrote:
> Hello Rich,
> 
> thank you for the prompt and detailed response.
> 
> > On Apr 1, 2016, at 20:58, Rich Felker <dalias@libc.org> wrote:
> > 
> > On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote:
> >> I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale.
> > 
> > This is intentional. All locale names are valid under musl, and those
> > which don't have any particular definition are just aliases for
> > C.UTF-8.
> 
> I will suggest a minor fix to GNU coreutils to accommodate for this
> current implementation.

I think any 'fix' would be inconsistent with both the specified
behavior and the intended behavior. See below:

> > The alternative would be that UTF-8 support breaks whenever
> > LC_* vars are set but locales are not installed/configured, which
> > would pretty much _always_ be the case when running a static-linked
> > standalone binary on a non-musl-based system (where LC_* are probably
> > set to something the main host libc recognizes).
> > 
> > One possibility if this behavior is problematic would be to only
> > consider names without their own definitions as aliases for C.UTF-8
> > when MUSL_LOCPATH is not set. However I think we'd need to see a
> > strong motivation for doing that, since it seems like it would be
> > worse behavior in some ways, especially when using LC_MESSAGES set to
> > a language for which you don't have a locale installed.
> 
> I'm not an expert about locales to argue one way or the other.
> 
> Naively, I would think that this is somewhat problematic, because a
> best-behaving program (one that checks set locale's return code for
> errors) has no way to warn the user that he/she used an invalid
> locale.

Well the intent is that it _is_ valid.

> Perhaps a work-around would be to handle it this way:
> if an invalid (non-existing) locale is given in LC_* env vars,
> setlocale(LC_ALL,"") should return NULL (indicating an error), then
> all other invocations of setlocale(LC_*,NULL) would return the
> "C.UTF-8" indicator. This would allow detecting the error, but not
> affect further processing (if invalid locales are already an alias
> to C.UTF-8). This seems to match other OSes/libcs which return fixed
> "C" in such cases.

This is non-conforming. If setlocale returns NULL it is required not
to have modified the locale. This, combined with the fact that prior
to calling setlocale successfully, the locale is in an unusable
(single-byte, non-UTF-8-handling state), is the whole motivation for
musl's treatment of locale names that don't have definitions.

> The reason for such check is that it is common user mistake to
> specify non-existing locales, then be confused by the seemingly
> incorrect results. Allowing a program to detect incorrect locales is
> a good mitigation.
> 
> I'll side-step the non-UTF-8 locales (which would be a problem in
> the current musl auto-aliasing to UTF-8), and show one possible case
> where silent aliasing leads to incorrect results.

musl does not support non-UTF-8 encodings at all, so that's not a very
interesting case anyway.

> consider the following UTF-8 string:
>    M N Ñ O P Y Z Æ Ø Å
> (which includes Spanish eñe and the last three letters in the Swedish alphabet).
> When sorting with locale-aware programs, different locales should
> give different collation orders (e.g. es_ES.UTF-8 vs sv_FI.UTF-8).
> 
> To reproduce:
>   A='\116\n\303\221\n\117\n\120\n\131\n\132\n\303\205\n\303\204\n\303\226\n'
>   printf "$A" | LC_ALL=sv_FI.UTF-8 sort
>   printf "$A" | LC_ALL=es_ES.UTF-8 sort
> 
> If a user has a typo in the locale name (e.g. sv_SV.UTF-8), there's
> no way for a program to detect it, and he will get unexpected
> ordered results.

But how is this any different from having a typo that results in
another defined locale being selected?

> GNU coreutils' 'sort' program added a --debug option to help user diagnose such issues.
> On Linux with glibc, this will be the output:
> 
>   $ printf "$A" | LC_ALL=es_ES.UTF-8 sort --debug > /dev/null
>   sort: using ‘es_ES.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_FI.UTF-8 sort --debug > /dev/null                             
>   sort: using ‘sv_FI.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_SV.UTF-8 sort --debug > /dev/null                             
>   sort: using simple byte comparison 
> 
>   $ printf "$A" | LC_ALL=foobar sort --debug > /dev/null                                   
>   sort: using simple byte comparison
> 
> The last two messages ("simple byte") is the hint that the locale is
> invalid, and sort will does not use it.
> 
> On Alpine (linux + musl), there's no way to detect such case:
> 
>   $ printf "$A" | LC_ALL=sv_FI.UTF-8 gsort --debug > /dev/null
>   gsort: using ‘sv_FI.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_SV.UTF-8 gsort --debug > /dev/null
>   gsort: using ‘sv_SV.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=foobar gsort --debug > /dev/null
>   gsort: using ‘foobar’ sorting rules

It might help if this resulted in:

	gsort: using ‘C.UTF-8’ sorting rules

This is what used to happen ("hard resolving" the alias to a different
name, rather than "soft resolving" it), but now we save the actual
requsted name so that it can be used for loading messages if dcgettext
is used with a category other than LC_MESSAGES. This is actually a
very rarely used feature, which could probably be sacrificed for
categories other than LC_MESSAGES if there's a strong benefit to doing
so.

Note that musl does not have any collation support at all right now,
nor any official locale files. That gives us some flexibility to
change things without impacting users, but the changes still can't
impact standards conformance/API contracts. I do hope to add collation
in the near future, as part of the goals for "1.2".

Rich


  reply	other threads:[~2016-04-02  4:09 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-02  0:47 Assaf Gordon
2016-04-02  0:58 ` Rich Felker
2016-04-02  2:46   ` Assaf Gordon
2016-04-02  4:09     ` Rich Felker [this message]
2016-04-02  4:18       ` Assaf Gordon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160402040914.GD21636@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=assafgordon@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).