From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/10038
Path: news.gmane.org!not-for-mail
From: Masanori Ogino <masanori.ogino@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: gettext and locale names
Date: Fri, 20 May 2016 23:55:45 +0900
Message-ID: <CAA-4+je+sRPjxYqpVQM1SLeSDjoRAeAyFirjZKQ16VpXDGQ-kA@mail.gmail.com>
References: <CAA-4+jcC7aM8XWo6gRVgfqD-WFC3ODvQYSMLdcJWK9wu9sgNbQ@mail.gmail.com>
	<20160504213938.GV21636@brightrain.aerifal.cx>
	<CAA-4+jc-q4Hsxvf9GbrgPobU0Q0gWWGiCM2aXMwDDPUji8=rJw@mail.gmail.com>
	<20160511232614.GJ21636@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
X-Trace: ger.gmane.org 1463756162 6124 80.91.229.3 (20 May 2016 14:56:02 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Fri, 20 May 2016 14:56:02 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-10051-gllmg-musl=m.gmane.org@lists.openwall.com Fri May 20 16:56:01 2016
Return-path: <musl-return-10051-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-10051-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1b3lqL-0004pl-Ek
	for gllmg-musl@m.gmane.org; Fri, 20 May 2016 16:56:01 +0200
Original-Received: (qmail 15728 invoked by uid 550); 20 May 2016 14:55:58 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 15707 invoked from network); 20 May 2016 14:55:57 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:sender:in-reply-to:references:date:message-id:subject
         :from:to;
        bh=i8dzzXic8OLShlLHqqQVTDVceqmKdoYSI9rmc/SNebI=;
        b=NSMfNod5Gb6luE4+Hu3er84jyUJ8OPapwJ3CLSzQGE/woe0GALELq2zqJwuxicm2w0
         fmyGy0cQPzhvsIhOHohOYO3e2U5HsNaHPdrC3DiIax6Rl0A3O8qOfUhV0YWHCF3adII6
         TNBV5M7rb9CSESN2YgJJMdQb9g+VXMXk7mbymIZSimb1SRp0Jl1JfCY45PqzdizvJ3wA
         HULhE/3ORUATh8GFklX9rFZ3Ru7QoFPLtqlHqVZ1jHdUA+fEQ7JHShqnRYG+DeWfzxhJ
         EqBi4pgEkqbdUn2dvVBK3YWcpIKfUCYSu6ToOGrom7BfIrlhLWRRXEa0/uVNOCAUN59s
         ZwVQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:mime-version:sender:in-reply-to:references:date
         :message-id:subject:from:to;
        bh=i8dzzXic8OLShlLHqqQVTDVceqmKdoYSI9rmc/SNebI=;
        b=jVaavN9AI6vlfew9JcISxZTock3ojpmUoVYRZvQhUP2edYJOw9mLuO19qMxxEKrpGG
         JDQduEJuMHEured3E5+EOOLqKnZthNq1dtfRUj9YCykuQQdtAz25jFKxrrtfDZOfigvz
         ngq1RLN5vpgpliifDEVnNtfF2/WctTZRBu5zdv108zVnh9l+zO/XDVRbd1RiZ26Z+dCA
         8+zc5v0VePG6ku2p2iTcFeyaXs1+/gHcXe8krxcXQAjv1rasEJpV6D2LPXQx8joCM8KU
         rdRHBrNi8u1qv720q/8H0dBQmDP0i6ThPgcXiVYNdUFbcxAv/eRsnciWxUsa66jq/5FL
         uQ7w==
X-Gm-Message-State: AOPr4FU6mUqBwi2eEZgK5OanrOIYUR1YRrHmu49hAvmbv7C2qtVAxjvUBe7E2Tot9ZCuV7Tb/XufuBf/tNA8rQ==
X-Received: by 10.157.15.226 with SMTP id m31mr2357252otd.46.1463756145375;
 Fri, 20 May 2016 07:55:45 -0700 (PDT)
Original-Sender: masanoriogino@gmail.com
In-Reply-To: <20160511232614.GJ21636@brightrain.aerifal.cx>
X-Google-Sender-Auth: WCGlbyvxrU1gbENAjm4DzwX1zcQ
Xref: news.gmane.org gmane.linux.lib.musl.general:10038
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/10038>

Hello. I'm sorry for the delay.

2016-05-12 8:26 GMT+09:00 Rich Felker <dalias@libc.org>:
> On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
>> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@libc.org>:
>> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
>> >> Hello,
>> >>
>> >> When I played with gettext API, I found that musl searches .mo files
>> >> with a directory named as current *full* locale names, e.g.
>> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
>> >> of those names from those of my machine in /usr/share/locale: de,
>> >> en_GB, ru_UA.koi8u, sr@latin, etc.
>> >>
>> >> Due to this mismatch, we can't get translations with musl's gettext
>> >> API for applications in wild. Thus, I'm considering to implement
>> >> locale searching with shortening. Does it make sense?
>> >
>> > Yes, I think this makes sense. Before spending time on the code though
>> > it makes sense to discuss the proposed logic here. What level would
>> > the search/shortening happen at? __get_locale in locale_map.c? In
>> > dcngettext.c?
>>
>> Sure. I doubt that shortening in __get_locale might be insufficient
>> since some code may want the full locale name even if there is no
>> locale data for it. I will dig into the code.
>>
>> Another problem is the preference of shortened locales. Obviously, the
>> full locale itself has the highest priority and language-only locales
>> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
>> locale, en_GB@euro or en_GB.UTF-8, when the code receives
>> en_GB.UTF-8@euro?
>>
>> I am unsure whether someone actually uses such locale, but I think it
>> is necessary to discuss such corner cases.
>
> Conceptually there are two sets of names the locale names need to lead
> us to: libc locales in MUSL_LOCPATH, and gettext translation files in
> directories provided to bindtextdomain. If/when we add non-stub
> catgets support, the locale name is also relevant to NLSPATH
> processing where %L expands to the whole locale name, and %l, %t, and
> %c expand to the language, territory, and codeset parts of it,
> respectively.

I didn't aware of this. Thank you.

> From musl's standpoint all locales are UTF-8-encoded, so the codeset
> portion of the locale name is at best redundant. The official musl
> locale files, once we have such a thing, should not have ".UTF-8" in
> their names, but a spurious ".UTF-8" component in the locale name
> string should be accepted (and ignored) for compatibility with
> glibc-based systems where the specifier may be necessary for
> glibc-linked programs to distinguish from legacy versions of the
> locales.
>
> In principle we could implement this by stripping the ".UTF-8" at
> setlocale time (in __get_locale from locale_map.c) but I don't see a
> major advantage in doing that versus keeping the full string and just
> stripping it when constructing filenames to try opening. On the other
> hand there are advantages to keeping it: some users/distros may want
> to put a spurious ".UTF-8" in the locale name to trick broken programs
> that use strstr on the locale name, rather than nl_langinfo(CODESET),
> to determine that they're in a UTF-8 environment.

I agree with you.

> For gettext translations, I haven't seen ".UTF-8" used either. My
> $prefix/share/locale directories have under them directories only of
> the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
> gettext-based programs always store UTF-8 in their message catalogs
> and legacy locales are expected to convert the contents when loading.

I also have never seen *.UTF-8 locale directories.

> Based on all this, the search order we perform should probably be
> something like this: First, take the input locale name and strip any
> codeset identifier. Then, iterate over 4 steps:
>
> 1. Try full name.
> 2. If a modifier (@mod) is present, try with modifier removed.
> 3. If a territory (_TT) is present, try with territory removed.
> 4. If both modifer and territory are present, try with both removed.
>
> At worst this yields 4 file-open attempts, and only in the case where
> a user has requested a ll_TT@mod type locale but either the @mod or
> _TT does not exist. For ll_TT type locales, it yields at most 2
> attempts. For ll type locales, or locale names that don't fit the
> standard pattern, there should be at most one attempt.

This algorithm looks good to me.

> From an implementation side, note that, presently, dcngettext uses the
> full pathname of the message catalog file as the key to look up the
> memory-mapped image. This lookup needs to happen without touching the
> filesystem, and the only reason a pathname is used is that it
> encompasses all the necessary key components (bound directory, locale
> name, category name, domain name). So if we go with my above proposal,
> the "pathname" used as a key should still contain the full locale
> name, not the particular fallback it resolved to, and thus might not
> actually be a valid pathname anymore. Because of this it's plausible
> that the same catalog file could end up getting mapped more than once
> (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
> /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
> cost and I don't think it's worth trying to detect and avoid.
>
> Does this all make sense? Does it sound reasonable?

Yes, I think it makes sense. Thanks for the suggestion.

After checking gettext-tools to confirm that .mo files are encoded to
UTF-8, I will try to implement the algorithm.

-- 
Masanori Ogino