From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/10025 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: gettext and locale names Date: Wed, 11 May 2016 19:26:14 -0400 Message-ID: <20160511232614.GJ21636@brightrain.aerifal.cx> References: <20160504213938.GV21636@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1463009191 23795 80.91.229.3 (11 May 2016 23:26:31 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 11 May 2016 23:26:31 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-10038-gllmg-musl=m.gmane.org@lists.openwall.com Thu May 12 01:26:31 2016 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1b0dWR-00085W-Bl for gllmg-musl@m.gmane.org; Thu, 12 May 2016 01:26:31 +0200 Original-Received: (qmail 3278 invoked by uid 550); 11 May 2016 23:26:28 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 3248 invoked from network); 11 May 2016 23:26:27 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:10025 Archived-At: On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote: > 2016-05-05 6:39 GMT+09:00 Rich Felker : > > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote: > >> Hello, > >> > >> When I played with gettext API, I found that musl searches .mo files > >> with a directory named as current *full* locale names, e.g. > >> en_US.UTF-8. However, we often use shortened names too. Here is a list > >> of those names from those of my machine in /usr/share/locale: de, > >> en_GB, ru_UA.koi8u, sr@latin, etc. > >> > >> Due to this mismatch, we can't get translations with musl's gettext > >> API for applications in wild. Thus, I'm considering to implement > >> locale searching with shortening. Does it make sense? > > > > Yes, I think this makes sense. Before spending time on the code though > > it makes sense to discuss the proposed logic here. What level would > > the search/shortening happen at? __get_locale in locale_map.c? In > > dcngettext.c? > > Sure. I doubt that shortening in __get_locale might be insufficient > since some code may want the full locale name even if there is no > locale data for it. I will dig into the code. > > Another problem is the preference of shortened locales. Obviously, the > full locale itself has the highest priority and language-only locales > (e.g. en, de, etc.) do the lowest one. However, which is the preferred > locale, en_GB@euro or en_GB.UTF-8, when the code receives > en_GB.UTF-8@euro? > > I am unsure whether someone actually uses such locale, but I think it > is necessary to discuss such corner cases. Conceptually there are two sets of names the locale names need to lead us to: libc locales in MUSL_LOCPATH, and gettext translation files in directories provided to bindtextdomain. If/when we add non-stub catgets support, the locale name is also relevant to NLSPATH processing where %L expands to the whole locale name, and %l, %t, and %c expand to the language, territory, and codeset parts of it, respectively. >From musl's standpoint all locales are UTF-8-encoded, so the codeset portion of the locale name is at best redundant. The official musl locale files, once we have such a thing, should not have ".UTF-8" in their names, but a spurious ".UTF-8" component in the locale name string should be accepted (and ignored) for compatibility with glibc-based systems where the specifier may be necessary for glibc-linked programs to distinguish from legacy versions of the locales. In principle we could implement this by stripping the ".UTF-8" at setlocale time (in __get_locale from locale_map.c) but I don't see a major advantage in doing that versus keeping the full string and just stripping it when constructing filenames to try opening. On the other hand there are advantages to keeping it: some users/distros may want to put a spurious ".UTF-8" in the locale name to trick broken programs that use strstr on the locale name, rather than nl_langinfo(CODESET), to determine that they're in a UTF-8 environment. For gettext translations, I haven't seen ".UTF-8" used either. My $prefix/share/locale directories have under them directories only of the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern gettext-based programs always store UTF-8 in their message catalogs and legacy locales are expected to convert the contents when loading. Based on all this, the search order we perform should probably be something like this: First, take the input locale name and strip any codeset identifier. Then, iterate over 4 steps: 1. Try full name. 2. If a modifier (@mod) is present, try with modifier removed. 3. If a territory (_TT) is present, try with territory removed. 4. If both modifer and territory are present, try with both removed. At worst this yields 4 file-open attempts, and only in the case where a user has requested a ll_TT@mod type locale but either the @mod or _TT does not exist. For ll_TT type locales, it yields at most 2 attempts. For ll type locales, or locale names that don't fit the standard pattern, there should be at most one attempt. >From an implementation side, note that, presently, dcngettext uses the full pathname of the message catalog file as the key to look up the memory-mapped image. This lookup needs to happen without touching the filesystem, and the only reason a pathname is used is that it encompasses all the necessary key components (bound directory, locale name, category name, domain name). So if we go with my above proposal, the "pathname" used as a key should still contain the full locale name, not the particular fallback it resolved to, and thus might not actually be a valid pathname anymore. Because of this it's plausible that the same catalog file could end up getting mapped more than once (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major cost and I don't think it's worth trying to detect and avoid. Does this all make sense? Does it sound reasonable? Rich