mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: setlocale behavior with 'missing' locales
Date: Wed, 8 Nov 2017 00:03:38 -0500	[thread overview]
Message-ID: <20171108050338.GL1627@brightrain.aerifal.cx> (raw)

One of the primary concerns when the byte-based C locale was added(*)
was not to introduce regressions in the property that musl is "always
UTF-8" except when the user or application has explicitly requested a
byte-based ("C"/"POSIX") locale.

First, some background: In order for the standard libc interfaces to
honor character encoding, a portable program has always needed to call
setlocale(LC_CTYPE, "") or setlocale(LC_ALL, ""). Addition of the
byte-based C locale "disabled UTF-8" in any application which wasn't
calling setlocale, but that was deemed acceptable since such
applications were not portable and would not work on other systems
anyway.

The other important cases to consider were failure of setlocale. Prior
to the addition of the byte-based C locale, setlocale was essentially
a no-op, and from a practical standpoint it didn't matter if it
succeeded or failed because the preexisting "C" locale at program
entry already provided UTF-8. But afterwards, if setlocale failed for
some reason, applications that were trying to do the right thing would
suffer regression.

We ruled out spurious failure for resource exhaustion reasons by
making a statically allocated C.UTF-8 locale object. But the other
possible source of failure would have been having LC_* variables in
the environment (perhaps as a result of ssh'ing from another system or
running a musl-linked binary on a glibc-based system) with no
corresponding locale files for musl. If we treated that as an error,
UTF-8 would have suddenly broken in all sorts of real-world
situtations, and one of the core original design goals/values of musl
would have been broken.

The choice I made at the time to avoid this was to declare that all
locale names are valid locales, and if there's no actual file defining
the locale, it's simply a clone of C.UTF-8. So for example if you run
with LC_ALL=fr_FR but no fr_FR translation file, you get a locale
named fr_FR (that's what setlocale reports as the active locale) but
with no translated messages/dates/etc., just UTF-8 character encoding
(so you're still able to access all characters properly and use
localized or multilingual data).

Unfortunately this turns out to have been something of a tradeoff,
since there's no way for applications (and, as it turns out,
especially tests/test suites) to query whether a particular locale is
"really" available. I've been asked to change the behavior to fail on
unknown locale names, but of course that's not a working option in
light of the above.

I think there may be a solution that makes everyone happy, but I'm not
sure yet. I'm going to follow up with a description and analysis of
whether it's valid/conforming.

Rich






(*) References on byte-based C locale:

Subject: [musl] Possible bytelocale patch
Message-ID: <20140703071318.GA10117@brightrain.aerifal.cx>

Subject: [musl] Revisiting byte-based C locale
Message-ID: <20150522022203.GA26651@brightrain.aerifal.cx>

Subject: [musl] [PATCH] Byte-based C locale, draft 1
Message-ID: <20150606214007.GA17398@brightrain.aerifal.cx>

commit 1507ebf837334e9e07cfab1ca1c2e88449069a80
byte-based C locale, phase 1: multibyte character handling functions

commit 16f18d036d9a7bf590ee6eb86785c0a9658220b6
byte-based C locale, phase 2: stdio and iconv (multibyte callers)



             reply	other threads:[~2017-11-08  5:03 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-08  5:03 Rich Felker [this message]
2017-11-08  5:27 ` Rich Felker
2017-11-12 22:19   ` A. Wilcox
2017-11-13  0:15     ` Rich Felker
2018-02-12  6:02       ` A. Wilcox
2018-02-12 20:04         ` Rich Felker
2018-03-01  1:13   ` Rich Felker
2018-03-01 19:10     ` William Pitcock
2018-03-01 19:25       ` Rich Felker
2018-03-01 20:45         ` Rich Felker
2018-03-02  1:43         ` Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171108050338.GL1627@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).