mailing list of musl libc
 help / color / mirror / code / Atom feed
* setlocale behavior with 'missing' locales
@ 2017-11-08  5:03 Rich Felker
  2017-11-08  5:27 ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2017-11-08  5:03 UTC (permalink / raw)
  To: musl

One of the primary concerns when the byte-based C locale was added(*)
was not to introduce regressions in the property that musl is "always
UTF-8" except when the user or application has explicitly requested a
byte-based ("C"/"POSIX") locale.

First, some background: In order for the standard libc interfaces to
honor character encoding, a portable program has always needed to call
setlocale(LC_CTYPE, "") or setlocale(LC_ALL, ""). Addition of the
byte-based C locale "disabled UTF-8" in any application which wasn't
calling setlocale, but that was deemed acceptable since such
applications were not portable and would not work on other systems
anyway.

The other important cases to consider were failure of setlocale. Prior
to the addition of the byte-based C locale, setlocale was essentially
a no-op, and from a practical standpoint it didn't matter if it
succeeded or failed because the preexisting "C" locale at program
entry already provided UTF-8. But afterwards, if setlocale failed for
some reason, applications that were trying to do the right thing would
suffer regression.

We ruled out spurious failure for resource exhaustion reasons by
making a statically allocated C.UTF-8 locale object. But the other
possible source of failure would have been having LC_* variables in
the environment (perhaps as a result of ssh'ing from another system or
running a musl-linked binary on a glibc-based system) with no
corresponding locale files for musl. If we treated that as an error,
UTF-8 would have suddenly broken in all sorts of real-world
situtations, and one of the core original design goals/values of musl
would have been broken.

The choice I made at the time to avoid this was to declare that all
locale names are valid locales, and if there's no actual file defining
the locale, it's simply a clone of C.UTF-8. So for example if you run
with LC_ALL=fr_FR but no fr_FR translation file, you get a locale
named fr_FR (that's what setlocale reports as the active locale) but
with no translated messages/dates/etc., just UTF-8 character encoding
(so you're still able to access all characters properly and use
localized or multilingual data).

Unfortunately this turns out to have been something of a tradeoff,
since there's no way for applications (and, as it turns out,
especially tests/test suites) to query whether a particular locale is
"really" available. I've been asked to change the behavior to fail on
unknown locale names, but of course that's not a working option in
light of the above.

I think there may be a solution that makes everyone happy, but I'm not
sure yet. I'm going to follow up with a description and analysis of
whether it's valid/conforming.

Rich






(*) References on byte-based C locale:

Subject: [musl] Possible bytelocale patch
Message-ID: <20140703071318.GA10117@brightrain.aerifal.cx>

Subject: [musl] Revisiting byte-based C locale
Message-ID: <20150522022203.GA26651@brightrain.aerifal.cx>

Subject: [musl] [PATCH] Byte-based C locale, draft 1
Message-ID: <20150606214007.GA17398@brightrain.aerifal.cx>

commit 1507ebf837334e9e07cfab1ca1c2e88449069a80
byte-based C locale, phase 1: multibyte character handling functions

commit 16f18d036d9a7bf590ee6eb86785c0a9658220b6
byte-based C locale, phase 2: stdio and iconv (multibyte callers)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2017-11-08  5:03 setlocale behavior with 'missing' locales Rich Felker
@ 2017-11-08  5:27 ` Rich Felker
  2017-11-12 22:19   ` A. Wilcox
  2018-03-01  1:13   ` Rich Felker
  0 siblings, 2 replies; 11+ messages in thread
From: Rich Felker @ 2017-11-08  5:27 UTC (permalink / raw)
  To: musl

On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote:
> Unfortunately this turns out to have been something of a tradeoff,
> since there's no way for applications (and, as it turns out,
> especially tests/test suites) to query whether a particular locale is
> "really" available. I've been asked to change the behavior to fail on
> unknown locale names, but of course that's not a working option in
> light of the above.
> 
> I think there may be a solution that makes everyone happy, but I'm not
> sure yet. I'm going to follow up with a description and analysis of
> whether it's valid/conforming.

So here's the possible solution. ISO C leaves the default locale when
setlocale(cat,"") is called implementation-defined. POSIX however
defines it in terms of the LANG and LC_* environment variables. See
the CX text in:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

  "Setting all of the categories of the global locale is similar to
  successively setting each individual category of the global locale,
  except that all error checking is done before any actions are
  performed. To set all the categories of the global locale,
  setlocale() can be invoked as:

  setlocale(LC_ALL, "");

  In this case, setlocale() shall first verify that the values of all
  the environment variables it needs according to the precedence rules
  (described in XBD Environment Variables) indicate supported locales.
  If the value of any of these environment variable searches yields a
  locale that is not supported (and non-null), setlocale() shall
  return a null pointer and the global locale shall not be changed. If
  all environment variables name supported locales, setlocale() shall
  proceed as if it had been called for each category, using the
  appropriate value from the associated environment variable or from
  the implementation-defined default if there is no such value."

and the Environment Variables text in XBD 8.2:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02

The former seems to tie our hands: unless the locales determined by
the environment variables all exist, setlocale is required to fail and
leave us in the (unacceptable) "C" locale where UTF-8 doesn't work.
However the latter seems to offer us a way out. After describing how
the precedence of the variables work, how locale pathnames work if
localedef is supported (musl doesn't support it), and how
implementation-provided/defined locale names work, it specifies:

  "If the locale value is not recognized by the implementation, the
  behavior is unspecified."

My optimistic reading of this is that, in the event the locale name
provided does not correspond to something we recognize, we're free to
define how it's interpreted, and always interpret it as C.UTF-8.

What this would achieve is the following:

1. setlocale(cat, explicit_locale_name) - succeeds if the locale
   actually has a definition file, fails and returns a null pointer
   otherwise.

2. setlocale(cat, "") - always succeeds, honoring the environment
   variable for the category if a locale definition file by that name
   exists, but otherwise (the unspecified behavior) treating it as if
   it were C.UTF-8.

This way, applications that probe for specific locale names can do so
and determine if they exist, but applications that just want to use
the default locale the user configured will still avoid catastrophic
breakage (failure to support UTF-8) even if they encounter "bad" LC_*
variables.

Does this approach sound acceptable? I'm fairly content with
interpreting it as conforming to the standard; I'm mainly concerned
about whether there might be unforseen breakage.

One notable issue is that, right now, we rely on being able to set
LC_MESSAGES to an arbitrary name even if there's no libc locale
definition for it; this is because gettext() relies on the name of the
current LC_MESSAGES locale to find (application-specific) translation
files that might exist even without a libc translation. I'm not sure
how we would best keep this working under changes similar to the
above.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2017-11-08  5:27 ` Rich Felker
@ 2017-11-12 22:19   ` A. Wilcox
  2017-11-13  0:15     ` Rich Felker
  2018-03-01  1:13   ` Rich Felker
  1 sibling, 1 reply; 11+ messages in thread
From: A. Wilcox @ 2017-11-12 22:19 UTC (permalink / raw)
  To: musl

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 07/11/17 23:27, Rich Felker wrote:
> One notable issue is that, right now, we rely on being able to set 
> LC_MESSAGES to an arbitrary name even if there's no libc locale 
> definition for it; this is because gettext() relies on the name of
> the current LC_MESSAGES locale to find (application-specific)
> translation files that might exist even without a libc translation.
> I'm not sure how we would best keep this working under changes
> similar to the above.

I can think of two ways to handle it, but neither of them are all that
great:

1) Provide simple translations for the most common 90% of languages.
Some people are getting unfairly screwed here, but they are probably
getting screwed by every other app/library as well.  It should be very
simple to find a list of month names and abbreviations online for
pretty much any language (even using Wikipedia's month article
translations or Microsoft's Open Software Translations project).

2) Use an access(3) call for /usr/share/locale/$LC_MESSAGES.  This
means there is virtually no work for musl beyond adding the call, and
it will only succeed if the locale is available (which is exactly what
the standard demands).  The two problems I see with this is:

  1) if /usr/share is NFS shared this could lock.
     But it would do so anyway if setlocale(3) succeeded.

  2) it requires use of stdio which most people on this list hate.


Best,
- --arw

- -- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJaCMkFAAoJEMspy1GSK50UrVQP/2gPhSi1zUcLOBspP4Xs0P73
VEB1QvcRc1ZLmUH2DmOCYdPQzJkuBVGDWfy0ycofLvCl3rlIgAc/XSXwmNXQk2iI
h7MXfekyIvT61klLLAVW5HILUHH0pbBmrR8u1HapHln2Dhb5JjdL5Nk9bAQg9XWU
e7OCNI5zZ72yvo6cYHoLjDny2/yqMWc6lFgrieFnExmO2rkb1u5JYGd1S4aKNO1Y
ewzlBNG6quJ3p/zisJSPo8Goy2ybrOXvkiiXl+MRWfTR7lqHoJEYDs5a8TpcYXg6
Z5x1dqJoCE6yzM7pzureZATp6fpu7yzrDBckGNJbiAh0S2cGamZLGNBi4gZtq8fQ
0/SHjZ7wCECA9UtkRDdAgSjw8o5veGLXXsIS3Dpizpa4rNaWc7VAbIXPNxnxlJh7
XSOylciBw6/wx/9TgxTVYGMFJ6xMmmXNXlaNPaqYo6trjN/aXD7IOuU/z/hFbTFP
t1wW6rsNFwdbHpNRzZKg//toSi7pU44sXfKDMRQlbGwCXBAQigARZR5Fu6SfrCdm
N3VMUlOSXLATvv4oCF76eV/9jS1iPZOe9Pdpsf0gOXtB2fgOxw7tTz2t/dpF9a/G
XGdSBsjblu1S7YUBw7zli/OoxHXEdcbYFMPIFrXKY7+EcwpPtYEl/yddxjIforKn
xJY+4sEnd926h5zAImEg
=0DW3
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2017-11-12 22:19   ` A. Wilcox
@ 2017-11-13  0:15     ` Rich Felker
  2018-02-12  6:02       ` A. Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2017-11-13  0:15 UTC (permalink / raw)
  To: musl

On Sun, Nov 12, 2017 at 04:19:51PM -0600, A. Wilcox wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> On 07/11/17 23:27, Rich Felker wrote:
> > One notable issue is that, right now, we rely on being able to set 
> > LC_MESSAGES to an arbitrary name even if there's no libc locale 
> > definition for it; this is because gettext() relies on the name of
> > the current LC_MESSAGES locale to find (application-specific)
> > translation files that might exist even without a libc translation.
> > I'm not sure how we would best keep this working under changes
> > similar to the above.
> 
> I can think of two ways to handle it, but neither of them are all that
> great:
> 
> 1) Provide simple translations for the most common 90% of languages.
> Some people are getting unfairly screwed here, but they are probably
> getting screwed by every other app/library as well.  It should be very
> simple to find a list of month names and abbreviations online for
> pretty much any language (even using Wikipedia's month article
> translations or Microsoft's Open Software Translations project).

This is an interesting idea I once considered, but it has too many
problems. It's a large volume of data that would be duplicated in
every static-linked application, and at least some of it would become
outdated or would be subject to user disagreement about what it should
be -- things like time formatting, radix separator (if we make it
variable at all), etc. Also the category in question is LC_MESSAGES,
which has nothing to do with dates but rather strerror messages and
such. Having default translations for these for all languages linked
into libc really does not make sense.

> 2) Use an access(3) call for /usr/share/locale/$LC_MESSAGES.  This
> means there is virtually no work for musl beyond adding the call, and
> it will only succeed if the locale is available (which is exactly what
> the standard demands).  The two problems I see with this is:

access() is generally always wrong (uses wrong permissions when real &
effective ids differ), but that's a minor detail. However I don't see
how this tells you anything useful. All it tells you is whether at
least some application installed in the given prefix (here /usr) has
translation files for the particular locale name. It doesn't tell you
whether the running application does (false positives), nor does it
tell you whether applications with different prefixes might (false
negatives).

>   1) if /usr/share is NFS shared this could lock.
>      But it would do so anyway if setlocale(3) succeeded.
> 
>   2) it requires use of stdio which most people on this list hate.

I don't see what relation it has to stdio.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2017-11-13  0:15     ` Rich Felker
@ 2018-02-12  6:02       ` A. Wilcox
  2018-02-12 20:04         ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: A. Wilcox @ 2018-02-12  6:02 UTC (permalink / raw)
  To: musl, adelie-devel


[-- Attachment #1.1: Type: text/plain, Size: 538 bytes --]

I'd just like to further note that as of 2.55, the GLib test suite now
fails for the same reason as Perl's (and libunistring, and coreutils,
and libidn, and git, ...): it tries to set LC_COLLATE to en_US, it
"succeeds", then it tries to collate and fails to get the expected result.

I'm not *quite* to the point where I am going to just LD_PRELOAD a stub
that makes setlocale always fail when running `abuild check`, but I'm
very close.

--arw

-- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 866 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2018-02-12  6:02       ` A. Wilcox
@ 2018-02-12 20:04         ` Rich Felker
  0 siblings, 0 replies; 11+ messages in thread
From: Rich Felker @ 2018-02-12 20:04 UTC (permalink / raw)
  To: musl

On Mon, Feb 12, 2018 at 12:02:48AM -0600, A. Wilcox wrote:
> I'd just like to further note that as of 2.55, the GLib test suite now
> fails for the same reason as Perl's (and libunistring, and coreutils,
> and libidn, and git, ...): it tries to set LC_COLLATE to en_US, it
> "succeeds", then it tries to collate and fails to get the expected result.
> 
> I'm not *quite* to the point where I am going to just LD_PRELOAD a stub
> that makes setlocale always fail when running `abuild check`, but I'm
> very close.

I'm still interested in following up with the idea for solving this,
but was hoping for more input on whether it's appropriate. However
actually implementing collation table support is a separate issue from
fixing the "spurious success" of setlocale, and this breakage would
still happen if you have a "real" en_US locale installed. So figuring
out how to store collation tables in the locale data is also important
here.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2017-11-08  5:27 ` Rich Felker
  2017-11-12 22:19   ` A. Wilcox
@ 2018-03-01  1:13   ` Rich Felker
  2018-03-01 19:10     ` William Pitcock
  1 sibling, 1 reply; 11+ messages in thread
From: Rich Felker @ 2018-03-01  1:13 UTC (permalink / raw)
  To: musl

On Wed, Nov 08, 2017 at 12:27:15AM -0500, Rich Felker wrote:
> On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote:
> > Unfortunately this turns out to have been something of a tradeoff,
> > since there's no way for applications (and, as it turns out,
> > especially tests/test suites) to query whether a particular locale is
> > "really" available. I've been asked to change the behavior to fail on
> > unknown locale names, but of course that's not a working option in
> > light of the above.
> > 
> > I think there may be a solution that makes everyone happy, but I'm not
> > sure yet. I'm going to follow up with a description and analysis of
> > whether it's valid/conforming.
> 
> So here's the possible solution. ISO C leaves the default locale when
> setlocale(cat,"") is called implementation-defined. POSIX however
> defines it in terms of the LANG and LC_* environment variables. See
> the CX text in:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
> 
>   "Setting all of the categories of the global locale is similar to
>   successively setting each individual category of the global locale,
>   except that all error checking is done before any actions are
>   performed. To set all the categories of the global locale,
>   setlocale() can be invoked as:
> 
>   setlocale(LC_ALL, "");
> 
>   In this case, setlocale() shall first verify that the values of all
>   the environment variables it needs according to the precedence rules
>   (described in XBD Environment Variables) indicate supported locales.
>   If the value of any of these environment variable searches yields a
>   locale that is not supported (and non-null), setlocale() shall
>   return a null pointer and the global locale shall not be changed. If
>   all environment variables name supported locales, setlocale() shall
>   proceed as if it had been called for each category, using the
>   appropriate value from the associated environment variable or from
>   the implementation-defined default if there is no such value."
> 
> and the Environment Variables text in XBD 8.2:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02
> 
> The former seems to tie our hands: unless the locales determined by
> the environment variables all exist, setlocale is required to fail and
> leave us in the (unacceptable) "C" locale where UTF-8 doesn't work.
> However the latter seems to offer us a way out. After describing how
> the precedence of the variables work, how locale pathnames work if
> localedef is supported (musl doesn't support it), and how
> implementation-provided/defined locale names work, it specifies:
> 
>   "If the locale value is not recognized by the implementation, the
>   behavior is unspecified."
> 
> My optimistic reading of this is that, in the event the locale name
> provided does not correspond to something we recognize, we're free to
> define how it's interpreted, and always interpret it as C.UTF-8.
> 
> What this would achieve is the following:
> 
> 1. setlocale(cat, explicit_locale_name) - succeeds if the locale
>    actually has a definition file, fails and returns a null pointer
>    otherwise.
> 
> 2. setlocale(cat, "") - always succeeds, honoring the environment
>    variable for the category if a locale definition file by that name
>    exists, but otherwise (the unspecified behavior) treating it as if
>    it were C.UTF-8.
> 
> This way, applications that probe for specific locale names can do so
> and determine if they exist, but applications that just want to use
> the default locale the user configured will still avoid catastrophic
> breakage (failure to support UTF-8) even if they encounter "bad" LC_*
> variables.
> 
> Does this approach sound acceptable? I'm fairly content with
> interpreting it as conforming to the standard; I'm mainly concerned
> about whether there might be unforseen breakage.
> 
> One notable issue is that, right now, we rely on being able to set
> LC_MESSAGES to an arbitrary name even if there's no libc locale
> definition for it; this is because gettext() relies on the name of the
> current LC_MESSAGES locale to find (application-specific) translation
> files that might exist even without a libc translation. I'm not sure
> how we would best keep this working under changes similar to the
> above.

Any further thoughts on this? I'd like to begin addressing these
issues in this release cycle.

I think the above plan works (is conforming, doesn't break things)
except for the LC_MESSAGES issue mentioned at the end. I don't have
any good ideas still for dealing with that. Really since gettext can
be used with any category, not just LC_MESSAGES (although LC_MESSAGES
is the normal choice), it applies to all categories. Maybe we could
still use the ("nonexistant") requested locale name in this case, or
some derivative of it that clarifies that it's synthesized...?

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2018-03-01  1:13   ` Rich Felker
@ 2018-03-01 19:10     ` William Pitcock
  2018-03-01 19:25       ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: William Pitcock @ 2018-03-01 19:10 UTC (permalink / raw)
  To: musl

Hi,

On Wed, Feb 28, 2018 at 7:13 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, Nov 08, 2017 at 12:27:15AM -0500, Rich Felker wrote:
>> On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote:
>> > Unfortunately this turns out to have been something of a tradeoff,
>> > since there's no way for applications (and, as it turns out,
>> > especially tests/test suites) to query whether a particular locale is
>> > "really" available. I've been asked to change the behavior to fail on
>> > unknown locale names, but of course that's not a working option in
>> > light of the above.
>> >
>> > I think there may be a solution that makes everyone happy, but I'm not
>> > sure yet. I'm going to follow up with a description and analysis of
>> > whether it's valid/conforming.
>>
>> So here's the possible solution. ISO C leaves the default locale when
>> setlocale(cat,"") is called implementation-defined. POSIX however
>> defines it in terms of the LANG and LC_* environment variables. See
>> the CX text in:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html
>>
>>   "Setting all of the categories of the global locale is similar to
>>   successively setting each individual category of the global locale,
>>   except that all error checking is done before any actions are
>>   performed. To set all the categories of the global locale,
>>   setlocale() can be invoked as:
>>
>>   setlocale(LC_ALL, "");
>>
>>   In this case, setlocale() shall first verify that the values of all
>>   the environment variables it needs according to the precedence rules
>>   (described in XBD Environment Variables) indicate supported locales.
>>   If the value of any of these environment variable searches yields a
>>   locale that is not supported (and non-null), setlocale() shall
>>   return a null pointer and the global locale shall not be changed. If
>>   all environment variables name supported locales, setlocale() shall
>>   proceed as if it had been called for each category, using the
>>   appropriate value from the associated environment variable or from
>>   the implementation-defined default if there is no such value."
>>
>> and the Environment Variables text in XBD 8.2:
>>
>> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02
>>
>> The former seems to tie our hands: unless the locales determined by
>> the environment variables all exist, setlocale is required to fail and
>> leave us in the (unacceptable) "C" locale where UTF-8 doesn't work.
>> However the latter seems to offer us a way out. After describing how
>> the precedence of the variables work, how locale pathnames work if
>> localedef is supported (musl doesn't support it), and how
>> implementation-provided/defined locale names work, it specifies:
>>
>>   "If the locale value is not recognized by the implementation, the
>>   behavior is unspecified."
>>
>> My optimistic reading of this is that, in the event the locale name
>> provided does not correspond to something we recognize, we're free to
>> define how it's interpreted, and always interpret it as C.UTF-8.
>>
>> What this would achieve is the following:
>>
>> 1. setlocale(cat, explicit_locale_name) - succeeds if the locale
>>    actually has a definition file, fails and returns a null pointer
>>    otherwise.
>>
>> 2. setlocale(cat, "") - always succeeds, honoring the environment
>>    variable for the category if a locale definition file by that name
>>    exists, but otherwise (the unspecified behavior) treating it as if
>>    it were C.UTF-8.
>>
>> This way, applications that probe for specific locale names can do so
>> and determine if they exist, but applications that just want to use
>> the default locale the user configured will still avoid catastrophic
>> breakage (failure to support UTF-8) even if they encounter "bad" LC_*
>> variables.
>>
>> Does this approach sound acceptable? I'm fairly content with
>> interpreting it as conforming to the standard; I'm mainly concerned
>> about whether there might be unforseen breakage.
>>
>> One notable issue is that, right now, we rely on being able to set
>> LC_MESSAGES to an arbitrary name even if there's no libc locale
>> definition for it; this is because gettext() relies on the name of the
>> current LC_MESSAGES locale to find (application-specific) translation
>> files that might exist even without a libc translation. I'm not sure
>> how we would best keep this working under changes similar to the
>> above.
>
> Any further thoughts on this? I'd like to begin addressing these
> issues in this release cycle.
>
> I think the above plan works (is conforming, doesn't break things)
> except for the LC_MESSAGES issue mentioned at the end. I don't have
> any good ideas still for dealing with that. Really since gettext can
> be used with any category, not just LC_MESSAGES (although LC_MESSAGES
> is the normal choice), it applies to all categories. Maybe we could
> still use the ("nonexistant") requested locale name in this case, or
> some derivative of it that clarifies that it's synthesized...?

+1 to using this approach.

We could use a locale name such as "en_US@virtual.UTF-8".

glibc uses this style of locale name for locales such as UK english
with eurozone LC_CURRENCY: en_UK@euro.UTF-8.

William


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2018-03-01 19:10     ` William Pitcock
@ 2018-03-01 19:25       ` Rich Felker
  2018-03-01 20:45         ` Rich Felker
  2018-03-02  1:43         ` Rich Felker
  0 siblings, 2 replies; 11+ messages in thread
From: Rich Felker @ 2018-03-01 19:25 UTC (permalink / raw)
  To: musl

On Thu, Mar 01, 2018 at 01:10:47PM -0600, William Pitcock wrote:
> >> One notable issue is that, right now, we rely on being able to set
> >> LC_MESSAGES to an arbitrary name even if there's no libc locale
> >> definition for it; this is because gettext() relies on the name of the
> >> current LC_MESSAGES locale to find (application-specific) translation
> >> files that might exist even without a libc translation. I'm not sure
> >> how we would best keep this working under changes similar to the
> >> above.
> >
> > Any further thoughts on this? I'd like to begin addressing these
> > issues in this release cycle.
> >
> > I think the above plan works (is conforming, doesn't break things)
> > except for the LC_MESSAGES issue mentioned at the end. I don't have
> > any good ideas still for dealing with that. Really since gettext can
> > be used with any category, not just LC_MESSAGES (although LC_MESSAGES
> > is the normal choice), it applies to all categories. Maybe we could
> > still use the ("nonexistant") requested locale name in this case, or
> > some derivative of it that clarifies that it's synthesized...?
> 
> +1 to using this approach.
> 
> We could use a locale name such as "en_US@virtual.UTF-8".
> 
> glibc uses this style of locale name for locales such as UK english
> with eurozone LC_CURRENCY: en_UK@euro.UTF-8.

I was actually just in the process of trying to work out something
very similar. Here's how I think it might work:

setlocale(cat, "") -- always succeeds, produces ll_TT@virtual (or
ll_TT@missing was my idea) if a locale file by the matching name is
not found.

setlocale(cat, "ll_TT@virtual") (or whatever name) - always succeeds.

setlocale(cat, "ll_TT[@other]") - succeeds only if a file matching the
name is found.

One thing I don't entirely like is repurposing the @ modifier for
this; it conflicts with (and perhaps fails to preserve) an existing
modifier if there is one, and affects how search for gettext
translation files would happen (searching extra @virtual paths).
Perhaps we should instead make it a separate component delimited in
some other way so it can always be dropped by gettext.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2018-03-01 19:25       ` Rich Felker
@ 2018-03-01 20:45         ` Rich Felker
  2018-03-02  1:43         ` Rich Felker
  1 sibling, 0 replies; 11+ messages in thread
From: Rich Felker @ 2018-03-01 20:45 UTC (permalink / raw)
  To: musl

On Thu, Mar 01, 2018 at 02:25:45PM -0500, Rich Felker wrote:
> On Thu, Mar 01, 2018 at 01:10:47PM -0600, William Pitcock wrote:
> > >> One notable issue is that, right now, we rely on being able to set
> > >> LC_MESSAGES to an arbitrary name even if there's no libc locale
> > >> definition for it; this is because gettext() relies on the name of the
> > >> current LC_MESSAGES locale to find (application-specific) translation
> > >> files that might exist even without a libc translation. I'm not sure
> > >> how we would best keep this working under changes similar to the
> > >> above.
> > >
> > > Any further thoughts on this? I'd like to begin addressing these
> > > issues in this release cycle.
> > >
> > > I think the above plan works (is conforming, doesn't break things)
> > > except for the LC_MESSAGES issue mentioned at the end. I don't have
> > > any good ideas still for dealing with that. Really since gettext can
> > > be used with any category, not just LC_MESSAGES (although LC_MESSAGES
> > > is the normal choice), it applies to all categories. Maybe we could
> > > still use the ("nonexistant") requested locale name in this case, or
> > > some derivative of it that clarifies that it's synthesized...?
> > 
> > +1 to using this approach.
> > 
> > We could use a locale name such as "en_US@virtual.UTF-8".
> > 
> > glibc uses this style of locale name for locales such as UK english
> > with eurozone LC_CURRENCY: en_UK@euro.UTF-8.
> 
> I was actually just in the process of trying to work out something
> very similar. Here's how I think it might work:
> 
> setlocale(cat, "") -- always succeeds, produces ll_TT@virtual (or
> ll_TT@missing was my idea) if a locale file by the matching name is
> not found.
> 
> setlocale(cat, "ll_TT@virtual") (or whatever name) - always succeeds.
> 
> setlocale(cat, "ll_TT[@other]") - succeeds only if a file matching the
> name is found.
> 
> One thing I don't entirely like is repurposing the @ modifier for
> this; it conflicts with (and perhaps fails to preserve) an existing
> modifier if there is one, and affects how search for gettext
> translation files would happen (searching extra @virtual paths).
> Perhaps we should instead make it a separate component delimited in
> some other way so it can always be dropped by gettext.

Implementation notes if we do this:

__get_locale is the internal backend that loads locale maps, and looks
like the point at which this all should be implemented.

Presently __get_locale has no means to return an error; a null return
value indicates the C locale, which is represented everywhere by the
lack of any locale map.

It seems __get_locale has all the information it needs to decide how
to act:

- If the argument is "", missing/virtual locale synthesis should
  happen. If allocation failures etc. prevent synthesis, it should
  behave as if the argument had been "C.UTF-8".

- If the argument is one of the builtin locales (C/C.UTF-8/POSIX) it
  can return one of the builtin maps. Right now it oddly replaces
  "C.UTF-8" with just plain "C" (null return value) in all categories
  except LC_CTYPE. This behavior might should be revisited but
  newlocale.c and perhaps other places encode assumptions that it's
  done this way.

- If the argument is another name that can't be found, an error should
  be returned to the caller somehow. We could perhaps use MAP_FAILED.
  The alternative seems to be reworking the contract so that null
  doesn't mean C and either using a real locale_map object for the C
  locale or translating to null in the caller, but these choices seem
  to impose worse costs/effects elsewhere.

None of the above covers anything about _how_ the synthesis of names
for missing locales should happen, just where/when it should happen.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: setlocale behavior with 'missing' locales
  2018-03-01 19:25       ` Rich Felker
  2018-03-01 20:45         ` Rich Felker
@ 2018-03-02  1:43         ` Rich Felker
  1 sibling, 0 replies; 11+ messages in thread
From: Rich Felker @ 2018-03-02  1:43 UTC (permalink / raw)
  To: musl

On Thu, Mar 01, 2018 at 02:25:45PM -0500, Rich Felker wrote:
> On Thu, Mar 01, 2018 at 01:10:47PM -0600, William Pitcock wrote:
> > >> One notable issue is that, right now, we rely on being able to set
> > >> LC_MESSAGES to an arbitrary name even if there's no libc locale
> > >> definition for it; this is because gettext() relies on the name of the
> > >> current LC_MESSAGES locale to find (application-specific) translation
> > >> files that might exist even without a libc translation. I'm not sure
> > >> how we would best keep this working under changes similar to the
> > >> above.
> > >
> > > Any further thoughts on this? I'd like to begin addressing these
> > > issues in this release cycle.
> > >
> > > I think the above plan works (is conforming, doesn't break things)
> > > except for the LC_MESSAGES issue mentioned at the end. I don't have
> > > any good ideas still for dealing with that. Really since gettext can
> > > be used with any category, not just LC_MESSAGES (although LC_MESSAGES
> > > is the normal choice), it applies to all categories. Maybe we could
> > > still use the ("nonexistant") requested locale name in this case, or
> > > some derivative of it that clarifies that it's synthesized...?
> > 
> > +1 to using this approach.
> > 
> > We could use a locale name such as "en_US@virtual.UTF-8".
> > 
> > glibc uses this style of locale name for locales such as UK english
> > with eurozone LC_CURRENCY: en_UK@euro.UTF-8.
> 
> I was actually just in the process of trying to work out something
> very similar. Here's how I think it might work:
> 
> setlocale(cat, "") -- always succeeds, produces ll_TT@virtual (or
> ll_TT@missing was my idea) if a locale file by the matching name is
> not found.
> 
> setlocale(cat, "ll_TT@virtual") (or whatever name) - always succeeds.
> 
> setlocale(cat, "ll_TT[@other]") - succeeds only if a file matching the
> name is found.
> 
> One thing I don't entirely like is repurposing the @ modifier for
> this; it conflicts with (and perhaps fails to preserve) an existing
> modifier if there is one, and affects how search for gettext
> translation files would happen (searching extra @virtual paths).
> Perhaps we should instead make it a separate component delimited in
> some other way so it can always be dropped by gettext.

On this topic, I did some research on GNU gettext, and just like
musl's it ignores the codeset part of the locale name
ll[_TT][.codeset][@modifier] while trying combinations of including or
omitting _TT and @modifier. So it looks like the only way to make a
synthesized locale name that can match all the same translation files
as the original name, under either musl or GNU gettext, is by
misappropriating the codeset field as the indicator that it's a
synthesized locale. That doesn't sound particularly good.

If we're only concerned about musl gettext and not GNU gettext or
other third-party software trying to parse the resulting synthesized
locale names, we can simply adopt any notation we like and have musl's
gettext ignore it.

Also in the case where the original requested locale had no @modifier
component, adding a special @synth/@missing/whatever modifier would
not disturb search for translations with either musl or GNU gettext.
At worst GNU gettext would search a few extra nonexistant pathnames.

One other thing to note is that synthesizing locales without adjusting
the name to indicate that they're synthesized does not seem consistent
if setlocale is going to reject unknown explicit names. The name that
the program reads back from setlocale(cat,0) or NL_LOCALE_NAME would
then fail to be valid for subsequent use as an explicit name.

One possible alternative to synthesizing names would be just reading
back the name of the locale that was actually set ("C.UTF-8" or some
fallback like "en" when "en_US" was requested but only "en" was
available). In this case GNU gettext or any third-party code would be
unable to honor the requested locale. musl's internal gettext could,
but I'm not sure this kind of hidden state would be desirable or
consistent, so I'd be a bit hesitant to do it. An alternative would be
just giving up on the ability to get message translations in a
language for which you don't have a locale installed. This would sound
a lot more acceptable if we actually had locale definition files, I
think....

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-02  1:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-08  5:03 setlocale behavior with 'missing' locales Rich Felker
2017-11-08  5:27 ` Rich Felker
2017-11-12 22:19   ` A. Wilcox
2017-11-13  0:15     ` Rich Felker
2018-02-12  6:02       ` A. Wilcox
2018-02-12 20:04         ` Rich Felker
2018-03-01  1:13   ` Rich Felker
2018-03-01 19:10     ` William Pitcock
2018-03-01 19:25       ` Rich Felker
2018-03-01 20:45         ` Rich Felker
2018-03-02  1:43         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).