gettext and locale names

mailing list of musl libc
 help / color / mirror / code / Atom feed

* gettext and locale names
@ 2016-05-04 13:05 Masanori Ogino
  2016-05-04 21:39 ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Masanori Ogino @ 2016-05-04 13:05 UTC (permalink / raw)
  To: musl

Hello,

When I played with gettext API, I found that musl searches .mo files
with a directory named as current *full* locale names, e.g.
en_US.UTF-8. However, we often use shortened names too. Here is a list
of those names from those of my machine in /usr/share/locale: de,
en_GB, ru_UA.koi8u, sr@latin, etc.

Due to this mismatch, we can't get translations with musl's gettext
API for applications in wild. Thus, I'm considering to implement
locale searching with shortening. Does it make sense?

-- 
Masanori Ogino

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gettext and locale names
  2016-05-04 13:05 gettext and locale names Masanori Ogino
@ 2016-05-04 21:39 ` Rich Felker
  2016-05-09 12:46   ` Masanori Ogino
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2016-05-04 21:39 UTC (permalink / raw)
  To: musl

On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
> Hello,
> 
> When I played with gettext API, I found that musl searches .mo files
> with a directory named as current *full* locale names, e.g.
> en_US.UTF-8. However, we often use shortened names too. Here is a list
> of those names from those of my machine in /usr/share/locale: de,
> en_GB, ru_UA.koi8u, sr@latin, etc.
> 
> Due to this mismatch, we can't get translations with musl's gettext
> API for applications in wild. Thus, I'm considering to implement
> locale searching with shortening. Does it make sense?

Yes, I think this makes sense. Before spending time on the code though
it makes sense to discuss the proposed logic here. What level would
the search/shortening happen at? __get_locale in locale_map.c? In
dcngettext.c?

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gettext and locale names
  2016-05-04 21:39 ` Rich Felker
@ 2016-05-09 12:46   ` Masanori Ogino
  2016-05-11 23:26     ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Masanori Ogino @ 2016-05-09 12:46 UTC (permalink / raw)
  To: musl

2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@libc.org>:
> On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
>> Hello,
>>
>> When I played with gettext API, I found that musl searches .mo files
>> with a directory named as current *full* locale names, e.g.
>> en_US.UTF-8. However, we often use shortened names too. Here is a list
>> of those names from those of my machine in /usr/share/locale: de,
>> en_GB, ru_UA.koi8u, sr@latin, etc.
>>
>> Due to this mismatch, we can't get translations with musl's gettext
>> API for applications in wild. Thus, I'm considering to implement
>> locale searching with shortening. Does it make sense?
>
> Yes, I think this makes sense. Before spending time on the code though
> it makes sense to discuss the proposed logic here. What level would
> the search/shortening happen at? __get_locale in locale_map.c? In
> dcngettext.c?
>
> Rich

Sure. I doubt that shortening in __get_locale might be insufficient
since some code may want the full locale name even if there is no
locale data for it. I will dig into the code.

Another problem is the preference of shortened locales. Obviously, the
full locale itself has the highest priority and language-only locales
(e.g. en, de, etc.) do the lowest one. However, which is the preferred
locale, en_GB@euro or en_GB.UTF-8, when the code receives
en_GB.UTF-8@euro?

I am unsure whether someone actually uses such locale, but I think it
is necessary to discuss such corner cases.

-- 
Masanori Ogino

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gettext and locale names
  2016-05-09 12:46   ` Masanori Ogino
@ 2016-05-11 23:26     ` Rich Felker
  2016-05-20 14:55       ` Masanori Ogino
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2016-05-11 23:26 UTC (permalink / raw)
  To: musl

On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@libc.org>:
> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
> >> Hello,
> >>
> >> When I played with gettext API, I found that musl searches .mo files
> >> with a directory named as current *full* locale names, e.g.
> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
> >> of those names from those of my machine in /usr/share/locale: de,
> >> en_GB, ru_UA.koi8u, sr@latin, etc.
> >>
> >> Due to this mismatch, we can't get translations with musl's gettext
> >> API for applications in wild. Thus, I'm considering to implement
> >> locale searching with shortening. Does it make sense?
> >
> > Yes, I think this makes sense. Before spending time on the code though
> > it makes sense to discuss the proposed logic here. What level would
> > the search/shortening happen at? __get_locale in locale_map.c? In
> > dcngettext.c?
> 
> Sure. I doubt that shortening in __get_locale might be insufficient
> since some code may want the full locale name even if there is no
> locale data for it. I will dig into the code.
> 
> Another problem is the preference of shortened locales. Obviously, the
> full locale itself has the highest priority and language-only locales
> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
> locale, en_GB@euro or en_GB.UTF-8, when the code receives
> en_GB.UTF-8@euro?
> 
> I am unsure whether someone actually uses such locale, but I think it
> is necessary to discuss such corner cases.

Conceptually there are two sets of names the locale names need to lead
us to: libc locales in MUSL_LOCPATH, and gettext translation files in
directories provided to bindtextdomain. If/when we add non-stub
catgets support, the locale name is also relevant to NLSPATH
processing where %L expands to the whole locale name, and %l, %t, and
%c expand to the language, territory, and codeset parts of it,
respectively.

From musl's standpoint all locales are UTF-8-encoded, so the codeset
portion of the locale name is at best redundant. The official musl
locale files, once we have such a thing, should not have ".UTF-8" in
their names, but a spurious ".UTF-8" component in the locale name
string should be accepted (and ignored) for compatibility with
glibc-based systems where the specifier may be necessary for
glibc-linked programs to distinguish from legacy versions of the
locales.

In principle we could implement this by stripping the ".UTF-8" at
setlocale time (in __get_locale from locale_map.c) but I don't see a
major advantage in doing that versus keeping the full string and just
stripping it when constructing filenames to try opening. On the other
hand there are advantages to keeping it: some users/distros may want
to put a spurious ".UTF-8" in the locale name to trick broken programs
that use strstr on the locale name, rather than nl_langinfo(CODESET),
to determine that they're in a UTF-8 environment.

For gettext translations, I haven't seen ".UTF-8" used either. My
$prefix/share/locale directories have under them directories only of
the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
gettext-based programs always store UTF-8 in their message catalogs
and legacy locales are expected to convert the contents when loading.

Based on all this, the search order we perform should probably be
something like this: First, take the input locale name and strip any
codeset identifier. Then, iterate over 4 steps:

1. Try full name.
2. If a modifier (@mod) is present, try with modifier removed.
3. If a territory (_TT) is present, try with territory removed.
4. If both modifer and territory are present, try with both removed.

At worst this yields 4 file-open attempts, and only in the case where
a user has requested a ll_TT@mod type locale but either the @mod or
_TT does not exist. For ll_TT type locales, it yields at most 2
attempts. For ll type locales, or locale names that don't fit the
standard pattern, there should be at most one attempt.

From an implementation side, note that, presently, dcngettext uses the
full pathname of the message catalog file as the key to look up the
memory-mapped image. This lookup needs to happen without touching the
filesystem, and the only reason a pathname is used is that it
encompasses all the necessary key components (bound directory, locale
name, category name, domain name). So if we go with my above proposal,
the "pathname" used as a key should still contain the full locale
name, not the particular fallback it resolved to, and thus might not
actually be a valid pathname anymore. Because of this it's plausible
that the same catalog file could end up getting mapped more than once
(e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
/usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
cost and I don't think it's worth trying to detect and avoid.

Does this all make sense? Does it sound reasonable?

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: gettext and locale names
  2016-05-11 23:26     ` Rich Felker
@ 2016-05-20 14:55       ` Masanori Ogino
  0 siblings, 0 replies; 5+ messages in thread
From: Masanori Ogino @ 2016-05-20 14:55 UTC (permalink / raw)
  To: musl

Hello. I'm sorry for the delay.

2016-05-12 8:26 GMT+09:00 Rich Felker <dalias@libc.org>:
> On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
>> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@libc.org>:
>> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
>> >> Hello,
>> >>
>> >> When I played with gettext API, I found that musl searches .mo files
>> >> with a directory named as current *full* locale names, e.g.
>> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
>> >> of those names from those of my machine in /usr/share/locale: de,
>> >> en_GB, ru_UA.koi8u, sr@latin, etc.
>> >>
>> >> Due to this mismatch, we can't get translations with musl's gettext
>> >> API for applications in wild. Thus, I'm considering to implement
>> >> locale searching with shortening. Does it make sense?
>> >
>> > Yes, I think this makes sense. Before spending time on the code though
>> > it makes sense to discuss the proposed logic here. What level would
>> > the search/shortening happen at? __get_locale in locale_map.c? In
>> > dcngettext.c?
>>
>> Sure. I doubt that shortening in __get_locale might be insufficient
>> since some code may want the full locale name even if there is no
>> locale data for it. I will dig into the code.
>>
>> Another problem is the preference of shortened locales. Obviously, the
>> full locale itself has the highest priority and language-only locales
>> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
>> locale, en_GB@euro or en_GB.UTF-8, when the code receives
>> en_GB.UTF-8@euro?
>>
>> I am unsure whether someone actually uses such locale, but I think it
>> is necessary to discuss such corner cases.
>
> Conceptually there are two sets of names the locale names need to lead
> us to: libc locales in MUSL_LOCPATH, and gettext translation files in
> directories provided to bindtextdomain. If/when we add non-stub
> catgets support, the locale name is also relevant to NLSPATH
> processing where %L expands to the whole locale name, and %l, %t, and
> %c expand to the language, territory, and codeset parts of it,
> respectively.

I didn't aware of this. Thank you.

> From musl's standpoint all locales are UTF-8-encoded, so the codeset
> portion of the locale name is at best redundant. The official musl
> locale files, once we have such a thing, should not have ".UTF-8" in
> their names, but a spurious ".UTF-8" component in the locale name
> string should be accepted (and ignored) for compatibility with
> glibc-based systems where the specifier may be necessary for
> glibc-linked programs to distinguish from legacy versions of the
> locales.
>
> In principle we could implement this by stripping the ".UTF-8" at
> setlocale time (in __get_locale from locale_map.c) but I don't see a
> major advantage in doing that versus keeping the full string and just
> stripping it when constructing filenames to try opening. On the other
> hand there are advantages to keeping it: some users/distros may want
> to put a spurious ".UTF-8" in the locale name to trick broken programs
> that use strstr on the locale name, rather than nl_langinfo(CODESET),
> to determine that they're in a UTF-8 environment.

I agree with you.

> For gettext translations, I haven't seen ".UTF-8" used either. My
> $prefix/share/locale directories have under them directories only of
> the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
> gettext-based programs always store UTF-8 in their message catalogs
> and legacy locales are expected to convert the contents when loading.

I also have never seen *.UTF-8 locale directories.

> Based on all this, the search order we perform should probably be
> something like this: First, take the input locale name and strip any
> codeset identifier. Then, iterate over 4 steps:
>
> 1. Try full name.
> 2. If a modifier (@mod) is present, try with modifier removed.
> 3. If a territory (_TT) is present, try with territory removed.
> 4. If both modifer and territory are present, try with both removed.
>
> At worst this yields 4 file-open attempts, and only in the case where
> a user has requested a ll_TT@mod type locale but either the @mod or
> _TT does not exist. For ll_TT type locales, it yields at most 2
> attempts. For ll type locales, or locale names that don't fit the
> standard pattern, there should be at most one attempt.

This algorithm looks good to me.

> From an implementation side, note that, presently, dcngettext uses the
> full pathname of the message catalog file as the key to look up the
> memory-mapped image. This lookup needs to happen without touching the
> filesystem, and the only reason a pathname is used is that it
> encompasses all the necessary key components (bound directory, locale
> name, category name, domain name). So if we go with my above proposal,
> the "pathname" used as a key should still contain the full locale
> name, not the particular fallback it resolved to, and thus might not
> actually be a valid pathname anymore. Because of this it's plausible
> that the same catalog file could end up getting mapped more than once
> (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
> /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
> cost and I don't think it's worth trying to detect and avoid.
>
> Does this all make sense? Does it sound reasonable?

Yes, I think it makes sense. Thanks for the suggestion.

After checking gettext-tools to confirm that .mo files are encoded to
UTF-8, I will try to implement the algorithm.

-- 
Masanori Ogino


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-20 14:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-04 13:05 gettext and locale names Masanori Ogino
2016-05-04 21:39 ` Rich Felker
2016-05-09 12:46   ` Masanori Ogino
2016-05-11 23:26     ` Rich Felker
2016-05-20 14:55       ` Masanori Ogino

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).