mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Rich Felker <dalias@libc.org>
To: musl@lists.openwall.com
Subject: Re: gettext and locale names
Date: Wed, 11 May 2016 19:26:14 -0400	[thread overview]
Message-ID: <20160511232614.GJ21636@brightrain.aerifal.cx> (raw)
In-Reply-To: <CAA-4+jc-q4Hsxvf9GbrgPobU0Q0gWWGiCM2aXMwDDPUji8=rJw@mail.gmail.com>

On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@libc.org>:
> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
> >> Hello,
> >>
> >> When I played with gettext API, I found that musl searches .mo files
> >> with a directory named as current *full* locale names, e.g.
> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
> >> of those names from those of my machine in /usr/share/locale: de,
> >> en_GB, ru_UA.koi8u, sr@latin, etc.
> >>
> >> Due to this mismatch, we can't get translations with musl's gettext
> >> API for applications in wild. Thus, I'm considering to implement
> >> locale searching with shortening. Does it make sense?
> >
> > Yes, I think this makes sense. Before spending time on the code though
> > it makes sense to discuss the proposed logic here. What level would
> > the search/shortening happen at? __get_locale in locale_map.c? In
> > dcngettext.c?
> 
> Sure. I doubt that shortening in __get_locale might be insufficient
> since some code may want the full locale name even if there is no
> locale data for it. I will dig into the code.
> 
> Another problem is the preference of shortened locales. Obviously, the
> full locale itself has the highest priority and language-only locales
> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
> locale, en_GB@euro or en_GB.UTF-8, when the code receives
> en_GB.UTF-8@euro?
> 
> I am unsure whether someone actually uses such locale, but I think it
> is necessary to discuss such corner cases.

Conceptually there are two sets of names the locale names need to lead
us to: libc locales in MUSL_LOCPATH, and gettext translation files in
directories provided to bindtextdomain. If/when we add non-stub
catgets support, the locale name is also relevant to NLSPATH
processing where %L expands to the whole locale name, and %l, %t, and
%c expand to the language, territory, and codeset parts of it,
respectively.

From musl's standpoint all locales are UTF-8-encoded, so the codeset
portion of the locale name is at best redundant. The official musl
locale files, once we have such a thing, should not have ".UTF-8" in
their names, but a spurious ".UTF-8" component in the locale name
string should be accepted (and ignored) for compatibility with
glibc-based systems where the specifier may be necessary for
glibc-linked programs to distinguish from legacy versions of the
locales.

In principle we could implement this by stripping the ".UTF-8" at
setlocale time (in __get_locale from locale_map.c) but I don't see a
major advantage in doing that versus keeping the full string and just
stripping it when constructing filenames to try opening. On the other
hand there are advantages to keeping it: some users/distros may want
to put a spurious ".UTF-8" in the locale name to trick broken programs
that use strstr on the locale name, rather than nl_langinfo(CODESET),
to determine that they're in a UTF-8 environment.

For gettext translations, I haven't seen ".UTF-8" used either. My
$prefix/share/locale directories have under them directories only of
the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
gettext-based programs always store UTF-8 in their message catalogs
and legacy locales are expected to convert the contents when loading.

Based on all this, the search order we perform should probably be
something like this: First, take the input locale name and strip any
codeset identifier. Then, iterate over 4 steps:

1. Try full name.
2. If a modifier (@mod) is present, try with modifier removed.
3. If a territory (_TT) is present, try with territory removed.
4. If both modifer and territory are present, try with both removed.

At worst this yields 4 file-open attempts, and only in the case where
a user has requested a ll_TT@mod type locale but either the @mod or
_TT does not exist. For ll_TT type locales, it yields at most 2
attempts. For ll type locales, or locale names that don't fit the
standard pattern, there should be at most one attempt.

From an implementation side, note that, presently, dcngettext uses the
full pathname of the message catalog file as the key to look up the
memory-mapped image. This lookup needs to happen without touching the
filesystem, and the only reason a pathname is used is that it
encompasses all the necessary key components (bound directory, locale
name, category name, domain name). So if we go with my above proposal,
the "pathname" used as a key should still contain the full locale
name, not the particular fallback it resolved to, and thus might not
actually be a valid pathname anymore. Because of this it's plausible
that the same catalog file could end up getting mapped more than once
(e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
/usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
cost and I don't think it's worth trying to detect and avoid.

Does this all make sense? Does it sound reasonable?

Rich


  reply	other threads:[~2016-05-11 23:26 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-04 13:05 Masanori Ogino
2016-05-04 21:39 ` Rich Felker
2016-05-09 12:46   ` Masanori Ogino
2016-05-11 23:26     ` Rich Felker [this message]
2016-05-20 14:55       ` Masanori Ogino

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160511232614.GJ21636@brightrain.aerifal.cx \
    --to=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).