* setlocale behavior with 'missing' locales @ 2017-11-08 5:03 Rich Felker 2017-11-08 5:27 ` Rich Felker 0 siblings, 1 reply; 11+ messages in thread From: Rich Felker @ 2017-11-08 5:03 UTC (permalink / raw) To: musl One of the primary concerns when the byte-based C locale was added(*) was not to introduce regressions in the property that musl is "always UTF-8" except when the user or application has explicitly requested a byte-based ("C"/"POSIX") locale. First, some background: In order for the standard libc interfaces to honor character encoding, a portable program has always needed to call setlocale(LC_CTYPE, "") or setlocale(LC_ALL, ""). Addition of the byte-based C locale "disabled UTF-8" in any application which wasn't calling setlocale, but that was deemed acceptable since such applications were not portable and would not work on other systems anyway. The other important cases to consider were failure of setlocale. Prior to the addition of the byte-based C locale, setlocale was essentially a no-op, and from a practical standpoint it didn't matter if it succeeded or failed because the preexisting "C" locale at program entry already provided UTF-8. But afterwards, if setlocale failed for some reason, applications that were trying to do the right thing would suffer regression. We ruled out spurious failure for resource exhaustion reasons by making a statically allocated C.UTF-8 locale object. But the other possible source of failure would have been having LC_* variables in the environment (perhaps as a result of ssh'ing from another system or running a musl-linked binary on a glibc-based system) with no corresponding locale files for musl. If we treated that as an error, UTF-8 would have suddenly broken in all sorts of real-world situtations, and one of the core original design goals/values of musl would have been broken. The choice I made at the time to avoid this was to declare that all locale names are valid locales, and if there's no actual file defining the locale, it's simply a clone of C.UTF-8. So for example if you run with LC_ALL=fr_FR but no fr_FR translation file, you get a locale named fr_FR (that's what setlocale reports as the active locale) but with no translated messages/dates/etc., just UTF-8 character encoding (so you're still able to access all characters properly and use localized or multilingual data). Unfortunately this turns out to have been something of a tradeoff, since there's no way for applications (and, as it turns out, especially tests/test suites) to query whether a particular locale is "really" available. I've been asked to change the behavior to fail on unknown locale names, but of course that's not a working option in light of the above. I think there may be a solution that makes everyone happy, but I'm not sure yet. I'm going to follow up with a description and analysis of whether it's valid/conforming. Rich (*) References on byte-based C locale: Subject: [musl] Possible bytelocale patch Message-ID: <20140703071318.GA10117@brightrain.aerifal.cx> Subject: [musl] Revisiting byte-based C locale Message-ID: <20150522022203.GA26651@brightrain.aerifal.cx> Subject: [musl] [PATCH] Byte-based C locale, draft 1 Message-ID: <20150606214007.GA17398@brightrain.aerifal.cx> commit 1507ebf837334e9e07cfab1ca1c2e88449069a80 byte-based C locale, phase 1: multibyte character handling functions commit 16f18d036d9a7bf590ee6eb86785c0a9658220b6 byte-based C locale, phase 2: stdio and iconv (multibyte callers) ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2017-11-08 5:03 setlocale behavior with 'missing' locales Rich Felker @ 2017-11-08 5:27 ` Rich Felker 2017-11-12 22:19 ` A. Wilcox 2018-03-01 1:13 ` Rich Felker 0 siblings, 2 replies; 11+ messages in thread From: Rich Felker @ 2017-11-08 5:27 UTC (permalink / raw) To: musl On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote: > Unfortunately this turns out to have been something of a tradeoff, > since there's no way for applications (and, as it turns out, > especially tests/test suites) to query whether a particular locale is > "really" available. I've been asked to change the behavior to fail on > unknown locale names, but of course that's not a working option in > light of the above. > > I think there may be a solution that makes everyone happy, but I'm not > sure yet. I'm going to follow up with a description and analysis of > whether it's valid/conforming. So here's the possible solution. ISO C leaves the default locale when setlocale(cat,"") is called implementation-defined. POSIX however defines it in terms of the LANG and LC_* environment variables. See the CX text in: http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html "Setting all of the categories of the global locale is similar to successively setting each individual category of the global locale, except that all error checking is done before any actions are performed. To set all the categories of the global locale, setlocale() can be invoked as: setlocale(LC_ALL, ""); In this case, setlocale() shall first verify that the values of all the environment variables it needs according to the precedence rules (described in XBD Environment Variables) indicate supported locales. If the value of any of these environment variable searches yields a locale that is not supported (and non-null), setlocale() shall return a null pointer and the global locale shall not be changed. If all environment variables name supported locales, setlocale() shall proceed as if it had been called for each category, using the appropriate value from the associated environment variable or from the implementation-defined default if there is no such value." and the Environment Variables text in XBD 8.2: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02 The former seems to tie our hands: unless the locales determined by the environment variables all exist, setlocale is required to fail and leave us in the (unacceptable) "C" locale where UTF-8 doesn't work. However the latter seems to offer us a way out. After describing how the precedence of the variables work, how locale pathnames work if localedef is supported (musl doesn't support it), and how implementation-provided/defined locale names work, it specifies: "If the locale value is not recognized by the implementation, the behavior is unspecified." My optimistic reading of this is that, in the event the locale name provided does not correspond to something we recognize, we're free to define how it's interpreted, and always interpret it as C.UTF-8. What this would achieve is the following: 1. setlocale(cat, explicit_locale_name) - succeeds if the locale actually has a definition file, fails and returns a null pointer otherwise. 2. setlocale(cat, "") - always succeeds, honoring the environment variable for the category if a locale definition file by that name exists, but otherwise (the unspecified behavior) treating it as if it were C.UTF-8. This way, applications that probe for specific locale names can do so and determine if they exist, but applications that just want to use the default locale the user configured will still avoid catastrophic breakage (failure to support UTF-8) even if they encounter "bad" LC_* variables. Does this approach sound acceptable? I'm fairly content with interpreting it as conforming to the standard; I'm mainly concerned about whether there might be unforseen breakage. One notable issue is that, right now, we rely on being able to set LC_MESSAGES to an arbitrary name even if there's no libc locale definition for it; this is because gettext() relies on the name of the current LC_MESSAGES locale to find (application-specific) translation files that might exist even without a libc translation. I'm not sure how we would best keep this working under changes similar to the above. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2017-11-08 5:27 ` Rich Felker @ 2017-11-12 22:19 ` A. Wilcox 2017-11-13 0:15 ` Rich Felker 2018-03-01 1:13 ` Rich Felker 1 sibling, 1 reply; 11+ messages in thread From: A. Wilcox @ 2017-11-12 22:19 UTC (permalink / raw) To: musl -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 07/11/17 23:27, Rich Felker wrote: > One notable issue is that, right now, we rely on being able to set > LC_MESSAGES to an arbitrary name even if there's no libc locale > definition for it; this is because gettext() relies on the name of > the current LC_MESSAGES locale to find (application-specific) > translation files that might exist even without a libc translation. > I'm not sure how we would best keep this working under changes > similar to the above. I can think of two ways to handle it, but neither of them are all that great: 1) Provide simple translations for the most common 90% of languages. Some people are getting unfairly screwed here, but they are probably getting screwed by every other app/library as well. It should be very simple to find a list of month names and abbreviations online for pretty much any language (even using Wikipedia's month article translations or Microsoft's Open Software Translations project). 2) Use an access(3) call for /usr/share/locale/$LC_MESSAGES. This means there is virtually no work for musl beyond adding the call, and it will only succeed if the locale is available (which is exactly what the standard demands). The two problems I see with this is: 1) if /usr/share is NFS shared this could lock. But it would do so anyway if setlocale(3) succeeded. 2) it requires use of stdio which most people on this list hate. Best, - --arw - -- A. Wilcox (awilfox) Project Lead, Adélie Linux http://adelielinux.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJaCMkFAAoJEMspy1GSK50UrVQP/2gPhSi1zUcLOBspP4Xs0P73 VEB1QvcRc1ZLmUH2DmOCYdPQzJkuBVGDWfy0ycofLvCl3rlIgAc/XSXwmNXQk2iI h7MXfekyIvT61klLLAVW5HILUHH0pbBmrR8u1HapHln2Dhb5JjdL5Nk9bAQg9XWU e7OCNI5zZ72yvo6cYHoLjDny2/yqMWc6lFgrieFnExmO2rkb1u5JYGd1S4aKNO1Y ewzlBNG6quJ3p/zisJSPo8Goy2ybrOXvkiiXl+MRWfTR7lqHoJEYDs5a8TpcYXg6 Z5x1dqJoCE6yzM7pzureZATp6fpu7yzrDBckGNJbiAh0S2cGamZLGNBi4gZtq8fQ 0/SHjZ7wCECA9UtkRDdAgSjw8o5veGLXXsIS3Dpizpa4rNaWc7VAbIXPNxnxlJh7 XSOylciBw6/wx/9TgxTVYGMFJ6xMmmXNXlaNPaqYo6trjN/aXD7IOuU/z/hFbTFP t1wW6rsNFwdbHpNRzZKg//toSi7pU44sXfKDMRQlbGwCXBAQigARZR5Fu6SfrCdm N3VMUlOSXLATvv4oCF76eV/9jS1iPZOe9Pdpsf0gOXtB2fgOxw7tTz2t/dpF9a/G XGdSBsjblu1S7YUBw7zli/OoxHXEdcbYFMPIFrXKY7+EcwpPtYEl/yddxjIforKn xJY+4sEnd926h5zAImEg =0DW3 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2017-11-12 22:19 ` A. Wilcox @ 2017-11-13 0:15 ` Rich Felker 2018-02-12 6:02 ` A. Wilcox 0 siblings, 1 reply; 11+ messages in thread From: Rich Felker @ 2017-11-13 0:15 UTC (permalink / raw) To: musl On Sun, Nov 12, 2017 at 04:19:51PM -0600, A. Wilcox wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On 07/11/17 23:27, Rich Felker wrote: > > One notable issue is that, right now, we rely on being able to set > > LC_MESSAGES to an arbitrary name even if there's no libc locale > > definition for it; this is because gettext() relies on the name of > > the current LC_MESSAGES locale to find (application-specific) > > translation files that might exist even without a libc translation. > > I'm not sure how we would best keep this working under changes > > similar to the above. > > I can think of two ways to handle it, but neither of them are all that > great: > > 1) Provide simple translations for the most common 90% of languages. > Some people are getting unfairly screwed here, but they are probably > getting screwed by every other app/library as well. It should be very > simple to find a list of month names and abbreviations online for > pretty much any language (even using Wikipedia's month article > translations or Microsoft's Open Software Translations project). This is an interesting idea I once considered, but it has too many problems. It's a large volume of data that would be duplicated in every static-linked application, and at least some of it would become outdated or would be subject to user disagreement about what it should be -- things like time formatting, radix separator (if we make it variable at all), etc. Also the category in question is LC_MESSAGES, which has nothing to do with dates but rather strerror messages and such. Having default translations for these for all languages linked into libc really does not make sense. > 2) Use an access(3) call for /usr/share/locale/$LC_MESSAGES. This > means there is virtually no work for musl beyond adding the call, and > it will only succeed if the locale is available (which is exactly what > the standard demands). The two problems I see with this is: access() is generally always wrong (uses wrong permissions when real & effective ids differ), but that's a minor detail. However I don't see how this tells you anything useful. All it tells you is whether at least some application installed in the given prefix (here /usr) has translation files for the particular locale name. It doesn't tell you whether the running application does (false positives), nor does it tell you whether applications with different prefixes might (false negatives). > 1) if /usr/share is NFS shared this could lock. > But it would do so anyway if setlocale(3) succeeded. > > 2) it requires use of stdio which most people on this list hate. I don't see what relation it has to stdio. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2017-11-13 0:15 ` Rich Felker @ 2018-02-12 6:02 ` A. Wilcox 2018-02-12 20:04 ` Rich Felker 0 siblings, 1 reply; 11+ messages in thread From: A. Wilcox @ 2018-02-12 6:02 UTC (permalink / raw) To: musl, adelie-devel [-- Attachment #1.1: Type: text/plain, Size: 538 bytes --] I'd just like to further note that as of 2.55, the GLib test suite now fails for the same reason as Perl's (and libunistring, and coreutils, and libidn, and git, ...): it tries to set LC_COLLATE to en_US, it "succeeds", then it tries to collate and fails to get the expected result. I'm not *quite* to the point where I am going to just LD_PRELOAD a stub that makes setlocale always fail when running `abuild check`, but I'm very close. --arw -- A. Wilcox (awilfox) Project Lead, Adélie Linux http://adelielinux.org [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 866 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2018-02-12 6:02 ` A. Wilcox @ 2018-02-12 20:04 ` Rich Felker 0 siblings, 0 replies; 11+ messages in thread From: Rich Felker @ 2018-02-12 20:04 UTC (permalink / raw) To: musl On Mon, Feb 12, 2018 at 12:02:48AM -0600, A. Wilcox wrote: > I'd just like to further note that as of 2.55, the GLib test suite now > fails for the same reason as Perl's (and libunistring, and coreutils, > and libidn, and git, ...): it tries to set LC_COLLATE to en_US, it > "succeeds", then it tries to collate and fails to get the expected result. > > I'm not *quite* to the point where I am going to just LD_PRELOAD a stub > that makes setlocale always fail when running `abuild check`, but I'm > very close. I'm still interested in following up with the idea for solving this, but was hoping for more input on whether it's appropriate. However actually implementing collation table support is a separate issue from fixing the "spurious success" of setlocale, and this breakage would still happen if you have a "real" en_US locale installed. So figuring out how to store collation tables in the locale data is also important here. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2017-11-08 5:27 ` Rich Felker 2017-11-12 22:19 ` A. Wilcox @ 2018-03-01 1:13 ` Rich Felker 2018-03-01 19:10 ` William Pitcock 1 sibling, 1 reply; 11+ messages in thread From: Rich Felker @ 2018-03-01 1:13 UTC (permalink / raw) To: musl On Wed, Nov 08, 2017 at 12:27:15AM -0500, Rich Felker wrote: > On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote: > > Unfortunately this turns out to have been something of a tradeoff, > > since there's no way for applications (and, as it turns out, > > especially tests/test suites) to query whether a particular locale is > > "really" available. I've been asked to change the behavior to fail on > > unknown locale names, but of course that's not a working option in > > light of the above. > > > > I think there may be a solution that makes everyone happy, but I'm not > > sure yet. I'm going to follow up with a description and analysis of > > whether it's valid/conforming. > > So here's the possible solution. ISO C leaves the default locale when > setlocale(cat,"") is called implementation-defined. POSIX however > defines it in terms of the LANG and LC_* environment variables. See > the CX text in: > > http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html > > "Setting all of the categories of the global locale is similar to > successively setting each individual category of the global locale, > except that all error checking is done before any actions are > performed. To set all the categories of the global locale, > setlocale() can be invoked as: > > setlocale(LC_ALL, ""); > > In this case, setlocale() shall first verify that the values of all > the environment variables it needs according to the precedence rules > (described in XBD Environment Variables) indicate supported locales. > If the value of any of these environment variable searches yields a > locale that is not supported (and non-null), setlocale() shall > return a null pointer and the global locale shall not be changed. If > all environment variables name supported locales, setlocale() shall > proceed as if it had been called for each category, using the > appropriate value from the associated environment variable or from > the implementation-defined default if there is no such value." > > and the Environment Variables text in XBD 8.2: > > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02 > > The former seems to tie our hands: unless the locales determined by > the environment variables all exist, setlocale is required to fail and > leave us in the (unacceptable) "C" locale where UTF-8 doesn't work. > However the latter seems to offer us a way out. After describing how > the precedence of the variables work, how locale pathnames work if > localedef is supported (musl doesn't support it), and how > implementation-provided/defined locale names work, it specifies: > > "If the locale value is not recognized by the implementation, the > behavior is unspecified." > > My optimistic reading of this is that, in the event the locale name > provided does not correspond to something we recognize, we're free to > define how it's interpreted, and always interpret it as C.UTF-8. > > What this would achieve is the following: > > 1. setlocale(cat, explicit_locale_name) - succeeds if the locale > actually has a definition file, fails and returns a null pointer > otherwise. > > 2. setlocale(cat, "") - always succeeds, honoring the environment > variable for the category if a locale definition file by that name > exists, but otherwise (the unspecified behavior) treating it as if > it were C.UTF-8. > > This way, applications that probe for specific locale names can do so > and determine if they exist, but applications that just want to use > the default locale the user configured will still avoid catastrophic > breakage (failure to support UTF-8) even if they encounter "bad" LC_* > variables. > > Does this approach sound acceptable? I'm fairly content with > interpreting it as conforming to the standard; I'm mainly concerned > about whether there might be unforseen breakage. > > One notable issue is that, right now, we rely on being able to set > LC_MESSAGES to an arbitrary name even if there's no libc locale > definition for it; this is because gettext() relies on the name of the > current LC_MESSAGES locale to find (application-specific) translation > files that might exist even without a libc translation. I'm not sure > how we would best keep this working under changes similar to the > above. Any further thoughts on this? I'd like to begin addressing these issues in this release cycle. I think the above plan works (is conforming, doesn't break things) except for the LC_MESSAGES issue mentioned at the end. I don't have any good ideas still for dealing with that. Really since gettext can be used with any category, not just LC_MESSAGES (although LC_MESSAGES is the normal choice), it applies to all categories. Maybe we could still use the ("nonexistant") requested locale name in this case, or some derivative of it that clarifies that it's synthesized...? Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2018-03-01 1:13 ` Rich Felker @ 2018-03-01 19:10 ` William Pitcock 2018-03-01 19:25 ` Rich Felker 0 siblings, 1 reply; 11+ messages in thread From: William Pitcock @ 2018-03-01 19:10 UTC (permalink / raw) To: musl Hi, On Wed, Feb 28, 2018 at 7:13 PM, Rich Felker <dalias@libc.org> wrote: > On Wed, Nov 08, 2017 at 12:27:15AM -0500, Rich Felker wrote: >> On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote: >> > Unfortunately this turns out to have been something of a tradeoff, >> > since there's no way for applications (and, as it turns out, >> > especially tests/test suites) to query whether a particular locale is >> > "really" available. I've been asked to change the behavior to fail on >> > unknown locale names, but of course that's not a working option in >> > light of the above. >> > >> > I think there may be a solution that makes everyone happy, but I'm not >> > sure yet. I'm going to follow up with a description and analysis of >> > whether it's valid/conforming. >> >> So here's the possible solution. ISO C leaves the default locale when >> setlocale(cat,"") is called implementation-defined. POSIX however >> defines it in terms of the LANG and LC_* environment variables. See >> the CX text in: >> >> http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html >> >> "Setting all of the categories of the global locale is similar to >> successively setting each individual category of the global locale, >> except that all error checking is done before any actions are >> performed. To set all the categories of the global locale, >> setlocale() can be invoked as: >> >> setlocale(LC_ALL, ""); >> >> In this case, setlocale() shall first verify that the values of all >> the environment variables it needs according to the precedence rules >> (described in XBD Environment Variables) indicate supported locales. >> If the value of any of these environment variable searches yields a >> locale that is not supported (and non-null), setlocale() shall >> return a null pointer and the global locale shall not be changed. If >> all environment variables name supported locales, setlocale() shall >> proceed as if it had been called for each category, using the >> appropriate value from the associated environment variable or from >> the implementation-defined default if there is no such value." >> >> and the Environment Variables text in XBD 8.2: >> >> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02 >> >> The former seems to tie our hands: unless the locales determined by >> the environment variables all exist, setlocale is required to fail and >> leave us in the (unacceptable) "C" locale where UTF-8 doesn't work. >> However the latter seems to offer us a way out. After describing how >> the precedence of the variables work, how locale pathnames work if >> localedef is supported (musl doesn't support it), and how >> implementation-provided/defined locale names work, it specifies: >> >> "If the locale value is not recognized by the implementation, the >> behavior is unspecified." >> >> My optimistic reading of this is that, in the event the locale name >> provided does not correspond to something we recognize, we're free to >> define how it's interpreted, and always interpret it as C.UTF-8. >> >> What this would achieve is the following: >> >> 1. setlocale(cat, explicit_locale_name) - succeeds if the locale >> actually has a definition file, fails and returns a null pointer >> otherwise. >> >> 2. setlocale(cat, "") - always succeeds, honoring the environment >> variable for the category if a locale definition file by that name >> exists, but otherwise (the unspecified behavior) treating it as if >> it were C.UTF-8. >> >> This way, applications that probe for specific locale names can do so >> and determine if they exist, but applications that just want to use >> the default locale the user configured will still avoid catastrophic >> breakage (failure to support UTF-8) even if they encounter "bad" LC_* >> variables. >> >> Does this approach sound acceptable? I'm fairly content with >> interpreting it as conforming to the standard; I'm mainly concerned >> about whether there might be unforseen breakage. >> >> One notable issue is that, right now, we rely on being able to set >> LC_MESSAGES to an arbitrary name even if there's no libc locale >> definition for it; this is because gettext() relies on the name of the >> current LC_MESSAGES locale to find (application-specific) translation >> files that might exist even without a libc translation. I'm not sure >> how we would best keep this working under changes similar to the >> above. > > Any further thoughts on this? I'd like to begin addressing these > issues in this release cycle. > > I think the above plan works (is conforming, doesn't break things) > except for the LC_MESSAGES issue mentioned at the end. I don't have > any good ideas still for dealing with that. Really since gettext can > be used with any category, not just LC_MESSAGES (although LC_MESSAGES > is the normal choice), it applies to all categories. Maybe we could > still use the ("nonexistant") requested locale name in this case, or > some derivative of it that clarifies that it's synthesized...? +1 to using this approach. We could use a locale name such as "en_US@virtual.UTF-8". glibc uses this style of locale name for locales such as UK english with eurozone LC_CURRENCY: en_UK@euro.UTF-8. William ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2018-03-01 19:10 ` William Pitcock @ 2018-03-01 19:25 ` Rich Felker 2018-03-01 20:45 ` Rich Felker 2018-03-02 1:43 ` Rich Felker 0 siblings, 2 replies; 11+ messages in thread From: Rich Felker @ 2018-03-01 19:25 UTC (permalink / raw) To: musl On Thu, Mar 01, 2018 at 01:10:47PM -0600, William Pitcock wrote: > >> One notable issue is that, right now, we rely on being able to set > >> LC_MESSAGES to an arbitrary name even if there's no libc locale > >> definition for it; this is because gettext() relies on the name of the > >> current LC_MESSAGES locale to find (application-specific) translation > >> files that might exist even without a libc translation. I'm not sure > >> how we would best keep this working under changes similar to the > >> above. > > > > Any further thoughts on this? I'd like to begin addressing these > > issues in this release cycle. > > > > I think the above plan works (is conforming, doesn't break things) > > except for the LC_MESSAGES issue mentioned at the end. I don't have > > any good ideas still for dealing with that. Really since gettext can > > be used with any category, not just LC_MESSAGES (although LC_MESSAGES > > is the normal choice), it applies to all categories. Maybe we could > > still use the ("nonexistant") requested locale name in this case, or > > some derivative of it that clarifies that it's synthesized...? > > +1 to using this approach. > > We could use a locale name such as "en_US@virtual.UTF-8". > > glibc uses this style of locale name for locales such as UK english > with eurozone LC_CURRENCY: en_UK@euro.UTF-8. I was actually just in the process of trying to work out something very similar. Here's how I think it might work: setlocale(cat, "") -- always succeeds, produces ll_TT@virtual (or ll_TT@missing was my idea) if a locale file by the matching name is not found. setlocale(cat, "ll_TT@virtual") (or whatever name) - always succeeds. setlocale(cat, "ll_TT[@other]") - succeeds only if a file matching the name is found. One thing I don't entirely like is repurposing the @ modifier for this; it conflicts with (and perhaps fails to preserve) an existing modifier if there is one, and affects how search for gettext translation files would happen (searching extra @virtual paths). Perhaps we should instead make it a separate component delimited in some other way so it can always be dropped by gettext. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2018-03-01 19:25 ` Rich Felker @ 2018-03-01 20:45 ` Rich Felker 2018-03-02 1:43 ` Rich Felker 1 sibling, 0 replies; 11+ messages in thread From: Rich Felker @ 2018-03-01 20:45 UTC (permalink / raw) To: musl On Thu, Mar 01, 2018 at 02:25:45PM -0500, Rich Felker wrote: > On Thu, Mar 01, 2018 at 01:10:47PM -0600, William Pitcock wrote: > > >> One notable issue is that, right now, we rely on being able to set > > >> LC_MESSAGES to an arbitrary name even if there's no libc locale > > >> definition for it; this is because gettext() relies on the name of the > > >> current LC_MESSAGES locale to find (application-specific) translation > > >> files that might exist even without a libc translation. I'm not sure > > >> how we would best keep this working under changes similar to the > > >> above. > > > > > > Any further thoughts on this? I'd like to begin addressing these > > > issues in this release cycle. > > > > > > I think the above plan works (is conforming, doesn't break things) > > > except for the LC_MESSAGES issue mentioned at the end. I don't have > > > any good ideas still for dealing with that. Really since gettext can > > > be used with any category, not just LC_MESSAGES (although LC_MESSAGES > > > is the normal choice), it applies to all categories. Maybe we could > > > still use the ("nonexistant") requested locale name in this case, or > > > some derivative of it that clarifies that it's synthesized...? > > > > +1 to using this approach. > > > > We could use a locale name such as "en_US@virtual.UTF-8". > > > > glibc uses this style of locale name for locales such as UK english > > with eurozone LC_CURRENCY: en_UK@euro.UTF-8. > > I was actually just in the process of trying to work out something > very similar. Here's how I think it might work: > > setlocale(cat, "") -- always succeeds, produces ll_TT@virtual (or > ll_TT@missing was my idea) if a locale file by the matching name is > not found. > > setlocale(cat, "ll_TT@virtual") (or whatever name) - always succeeds. > > setlocale(cat, "ll_TT[@other]") - succeeds only if a file matching the > name is found. > > One thing I don't entirely like is repurposing the @ modifier for > this; it conflicts with (and perhaps fails to preserve) an existing > modifier if there is one, and affects how search for gettext > translation files would happen (searching extra @virtual paths). > Perhaps we should instead make it a separate component delimited in > some other way so it can always be dropped by gettext. Implementation notes if we do this: __get_locale is the internal backend that loads locale maps, and looks like the point at which this all should be implemented. Presently __get_locale has no means to return an error; a null return value indicates the C locale, which is represented everywhere by the lack of any locale map. It seems __get_locale has all the information it needs to decide how to act: - If the argument is "", missing/virtual locale synthesis should happen. If allocation failures etc. prevent synthesis, it should behave as if the argument had been "C.UTF-8". - If the argument is one of the builtin locales (C/C.UTF-8/POSIX) it can return one of the builtin maps. Right now it oddly replaces "C.UTF-8" with just plain "C" (null return value) in all categories except LC_CTYPE. This behavior might should be revisited but newlocale.c and perhaps other places encode assumptions that it's done this way. - If the argument is another name that can't be found, an error should be returned to the caller somehow. We could perhaps use MAP_FAILED. The alternative seems to be reworking the contract so that null doesn't mean C and either using a real locale_map object for the C locale or translating to null in the caller, but these choices seem to impose worse costs/effects elsewhere. None of the above covers anything about _how_ the synthesis of names for missing locales should happen, just where/when it should happen. Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: setlocale behavior with 'missing' locales 2018-03-01 19:25 ` Rich Felker 2018-03-01 20:45 ` Rich Felker @ 2018-03-02 1:43 ` Rich Felker 1 sibling, 0 replies; 11+ messages in thread From: Rich Felker @ 2018-03-02 1:43 UTC (permalink / raw) To: musl On Thu, Mar 01, 2018 at 02:25:45PM -0500, Rich Felker wrote: > On Thu, Mar 01, 2018 at 01:10:47PM -0600, William Pitcock wrote: > > >> One notable issue is that, right now, we rely on being able to set > > >> LC_MESSAGES to an arbitrary name even if there's no libc locale > > >> definition for it; this is because gettext() relies on the name of the > > >> current LC_MESSAGES locale to find (application-specific) translation > > >> files that might exist even without a libc translation. I'm not sure > > >> how we would best keep this working under changes similar to the > > >> above. > > > > > > Any further thoughts on this? I'd like to begin addressing these > > > issues in this release cycle. > > > > > > I think the above plan works (is conforming, doesn't break things) > > > except for the LC_MESSAGES issue mentioned at the end. I don't have > > > any good ideas still for dealing with that. Really since gettext can > > > be used with any category, not just LC_MESSAGES (although LC_MESSAGES > > > is the normal choice), it applies to all categories. Maybe we could > > > still use the ("nonexistant") requested locale name in this case, or > > > some derivative of it that clarifies that it's synthesized...? > > > > +1 to using this approach. > > > > We could use a locale name such as "en_US@virtual.UTF-8". > > > > glibc uses this style of locale name for locales such as UK english > > with eurozone LC_CURRENCY: en_UK@euro.UTF-8. > > I was actually just in the process of trying to work out something > very similar. Here's how I think it might work: > > setlocale(cat, "") -- always succeeds, produces ll_TT@virtual (or > ll_TT@missing was my idea) if a locale file by the matching name is > not found. > > setlocale(cat, "ll_TT@virtual") (or whatever name) - always succeeds. > > setlocale(cat, "ll_TT[@other]") - succeeds only if a file matching the > name is found. > > One thing I don't entirely like is repurposing the @ modifier for > this; it conflicts with (and perhaps fails to preserve) an existing > modifier if there is one, and affects how search for gettext > translation files would happen (searching extra @virtual paths). > Perhaps we should instead make it a separate component delimited in > some other way so it can always be dropped by gettext. On this topic, I did some research on GNU gettext, and just like musl's it ignores the codeset part of the locale name ll[_TT][.codeset][@modifier] while trying combinations of including or omitting _TT and @modifier. So it looks like the only way to make a synthesized locale name that can match all the same translation files as the original name, under either musl or GNU gettext, is by misappropriating the codeset field as the indicator that it's a synthesized locale. That doesn't sound particularly good. If we're only concerned about musl gettext and not GNU gettext or other third-party software trying to parse the resulting synthesized locale names, we can simply adopt any notation we like and have musl's gettext ignore it. Also in the case where the original requested locale had no @modifier component, adding a special @synth/@missing/whatever modifier would not disturb search for translations with either musl or GNU gettext. At worst GNU gettext would search a few extra nonexistant pathnames. One other thing to note is that synthesizing locales without adjusting the name to indicate that they're synthesized does not seem consistent if setlocale is going to reject unknown explicit names. The name that the program reads back from setlocale(cat,0) or NL_LOCALE_NAME would then fail to be valid for subsequent use as an explicit name. One possible alternative to synthesizing names would be just reading back the name of the locale that was actually set ("C.UTF-8" or some fallback like "en" when "en_US" was requested but only "en" was available). In this case GNU gettext or any third-party code would be unable to honor the requested locale. musl's internal gettext could, but I'm not sure this kind of hidden state would be desirable or consistent, so I'd be a bit hesitant to do it. An alternative would be just giving up on the ability to get message translations in a language for which you don't have a locale installed. This would sound a lot more acceptable if we actually had locale definition files, I think.... Rich ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-03-02 1:43 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-08 5:03 setlocale behavior with 'missing' locales Rich Felker 2017-11-08 5:27 ` Rich Felker 2017-11-12 22:19 ` A. Wilcox 2017-11-13 0:15 ` Rich Felker 2018-02-12 6:02 ` A. Wilcox 2018-02-12 20:04 ` Rich Felker 2018-03-01 1:13 ` Rich Felker 2018-03-01 19:10 ` William Pitcock 2018-03-01 19:25 ` Rich Felker 2018-03-01 20:45 ` Rich Felker 2018-03-02 1:43 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).