mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Masanori Ogino <masanori.ogino@gmail.com>
To: musl@lists.openwall.com
Subject: Re: gettext and locale names
Date: Fri, 20 May 2016 23:55:45 +0900	[thread overview]
Message-ID: <CAA-4+je+sRPjxYqpVQM1SLeSDjoRAeAyFirjZKQ16VpXDGQ-kA@mail.gmail.com> (raw)
In-Reply-To: <20160511232614.GJ21636@brightrain.aerifal.cx>

Hello. I'm sorry for the delay.

2016-05-12 8:26 GMT+09:00 Rich Felker <dalias@libc.org>:
> On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
>> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@libc.org>:
>> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
>> >> Hello,
>> >>
>> >> When I played with gettext API, I found that musl searches .mo files
>> >> with a directory named as current *full* locale names, e.g.
>> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
>> >> of those names from those of my machine in /usr/share/locale: de,
>> >> en_GB, ru_UA.koi8u, sr@latin, etc.
>> >>
>> >> Due to this mismatch, we can't get translations with musl's gettext
>> >> API for applications in wild. Thus, I'm considering to implement
>> >> locale searching with shortening. Does it make sense?
>> >
>> > Yes, I think this makes sense. Before spending time on the code though
>> > it makes sense to discuss the proposed logic here. What level would
>> > the search/shortening happen at? __get_locale in locale_map.c? In
>> > dcngettext.c?
>>
>> Sure. I doubt that shortening in __get_locale might be insufficient
>> since some code may want the full locale name even if there is no
>> locale data for it. I will dig into the code.
>>
>> Another problem is the preference of shortened locales. Obviously, the
>> full locale itself has the highest priority and language-only locales
>> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
>> locale, en_GB@euro or en_GB.UTF-8, when the code receives
>> en_GB.UTF-8@euro?
>>
>> I am unsure whether someone actually uses such locale, but I think it
>> is necessary to discuss such corner cases.
>
> Conceptually there are two sets of names the locale names need to lead
> us to: libc locales in MUSL_LOCPATH, and gettext translation files in
> directories provided to bindtextdomain. If/when we add non-stub
> catgets support, the locale name is also relevant to NLSPATH
> processing where %L expands to the whole locale name, and %l, %t, and
> %c expand to the language, territory, and codeset parts of it,
> respectively.

I didn't aware of this. Thank you.

> From musl's standpoint all locales are UTF-8-encoded, so the codeset
> portion of the locale name is at best redundant. The official musl
> locale files, once we have such a thing, should not have ".UTF-8" in
> their names, but a spurious ".UTF-8" component in the locale name
> string should be accepted (and ignored) for compatibility with
> glibc-based systems where the specifier may be necessary for
> glibc-linked programs to distinguish from legacy versions of the
> locales.
>
> In principle we could implement this by stripping the ".UTF-8" at
> setlocale time (in __get_locale from locale_map.c) but I don't see a
> major advantage in doing that versus keeping the full string and just
> stripping it when constructing filenames to try opening. On the other
> hand there are advantages to keeping it: some users/distros may want
> to put a spurious ".UTF-8" in the locale name to trick broken programs
> that use strstr on the locale name, rather than nl_langinfo(CODESET),
> to determine that they're in a UTF-8 environment.

I agree with you.

> For gettext translations, I haven't seen ".UTF-8" used either. My
> $prefix/share/locale directories have under them directories only of
> the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
> gettext-based programs always store UTF-8 in their message catalogs
> and legacy locales are expected to convert the contents when loading.

I also have never seen *.UTF-8 locale directories.

> Based on all this, the search order we perform should probably be
> something like this: First, take the input locale name and strip any
> codeset identifier. Then, iterate over 4 steps:
>
> 1. Try full name.
> 2. If a modifier (@mod) is present, try with modifier removed.
> 3. If a territory (_TT) is present, try with territory removed.
> 4. If both modifer and territory are present, try with both removed.
>
> At worst this yields 4 file-open attempts, and only in the case where
> a user has requested a ll_TT@mod type locale but either the @mod or
> _TT does not exist. For ll_TT type locales, it yields at most 2
> attempts. For ll type locales, or locale names that don't fit the
> standard pattern, there should be at most one attempt.

This algorithm looks good to me.

> From an implementation side, note that, presently, dcngettext uses the
> full pathname of the message catalog file as the key to look up the
> memory-mapped image. This lookup needs to happen without touching the
> filesystem, and the only reason a pathname is used is that it
> encompasses all the necessary key components (bound directory, locale
> name, category name, domain name). So if we go with my above proposal,
> the "pathname" used as a key should still contain the full locale
> name, not the particular fallback it resolved to, and thus might not
> actually be a valid pathname anymore. Because of this it's plausible
> that the same catalog file could end up getting mapped more than once
> (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
> /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
> cost and I don't think it's worth trying to detect and avoid.
>
> Does this all make sense? Does it sound reasonable?

Yes, I think it makes sense. Thanks for the suggestion.

After checking gettext-tools to confirm that .mo files are encoded to
UTF-8, I will try to implement the algorithm.

-- 
Masanori Ogino


      reply	other threads:[~2016-05-20 14:55 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-04 13:05 Masanori Ogino
2016-05-04 21:39 ` Rich Felker
2016-05-09 12:46   ` Masanori Ogino
2016-05-11 23:26     ` Rich Felker
2016-05-20 14:55       ` Masanori Ogino [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA-4+je+sRPjxYqpVQM1SLeSDjoRAeAyFirjZKQ16VpXDGQ-kA@mail.gmail.com \
    --to=masanori.ogino@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).