From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/10038 Path: news.gmane.org!not-for-mail From: Masanori Ogino Newsgroups: gmane.linux.lib.musl.general Subject: Re: gettext and locale names Date: Fri, 20 May 2016 23:55:45 +0900 Message-ID: References: <20160504213938.GV21636@brightrain.aerifal.cx> <20160511232614.GJ21636@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: ger.gmane.org 1463756162 6124 80.91.229.3 (20 May 2016 14:56:02 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 20 May 2016 14:56:02 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-10051-gllmg-musl=m.gmane.org@lists.openwall.com Fri May 20 16:56:01 2016 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1b3lqL-0004pl-Ek for gllmg-musl@m.gmane.org; Fri, 20 May 2016 16:56:01 +0200 Original-Received: (qmail 15728 invoked by uid 550); 20 May 2016 14:55:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 15707 invoked from network); 20 May 2016 14:55:57 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to; bh=i8dzzXic8OLShlLHqqQVTDVceqmKdoYSI9rmc/SNebI=; b=NSMfNod5Gb6luE4+Hu3er84jyUJ8OPapwJ3CLSzQGE/woe0GALELq2zqJwuxicm2w0 fmyGy0cQPzhvsIhOHohOYO3e2U5HsNaHPdrC3DiIax6Rl0A3O8qOfUhV0YWHCF3adII6 TNBV5M7rb9CSESN2YgJJMdQb9g+VXMXk7mbymIZSimb1SRp0Jl1JfCY45PqzdizvJ3wA HULhE/3ORUATh8GFklX9rFZ3Ru7QoFPLtqlHqVZ1jHdUA+fEQ7JHShqnRYG+DeWfzxhJ EqBi4pgEkqbdUn2dvVBK3YWcpIKfUCYSu6ToOGrom7BfIrlhLWRRXEa0/uVNOCAUN59s ZwVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:date :message-id:subject:from:to; bh=i8dzzXic8OLShlLHqqQVTDVceqmKdoYSI9rmc/SNebI=; b=jVaavN9AI6vlfew9JcISxZTock3ojpmUoVYRZvQhUP2edYJOw9mLuO19qMxxEKrpGG JDQduEJuMHEured3E5+EOOLqKnZthNq1dtfRUj9YCykuQQdtAz25jFKxrrtfDZOfigvz ngq1RLN5vpgpliifDEVnNtfF2/WctTZRBu5zdv108zVnh9l+zO/XDVRbd1RiZ26Z+dCA 8+zc5v0VePG6ku2p2iTcFeyaXs1+/gHcXe8krxcXQAjv1rasEJpV6D2LPXQx8joCM8KU rdRHBrNi8u1qv720q/8H0dBQmDP0i6ThPgcXiVYNdUFbcxAv/eRsnciWxUsa66jq/5FL uQ7w== X-Gm-Message-State: AOPr4FU6mUqBwi2eEZgK5OanrOIYUR1YRrHmu49hAvmbv7C2qtVAxjvUBe7E2Tot9ZCuV7Tb/XufuBf/tNA8rQ== X-Received: by 10.157.15.226 with SMTP id m31mr2357252otd.46.1463756145375; Fri, 20 May 2016 07:55:45 -0700 (PDT) Original-Sender: masanoriogino@gmail.com In-Reply-To: <20160511232614.GJ21636@brightrain.aerifal.cx> X-Google-Sender-Auth: WCGlbyvxrU1gbENAjm4DzwX1zcQ Xref: news.gmane.org gmane.linux.lib.musl.general:10038 Archived-At: Hello. I'm sorry for the delay. 2016-05-12 8:26 GMT+09:00 Rich Felker : > On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote: >> 2016-05-05 6:39 GMT+09:00 Rich Felker : >> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote: >> >> Hello, >> >> >> >> When I played with gettext API, I found that musl searches .mo files >> >> with a directory named as current *full* locale names, e.g. >> >> en_US.UTF-8. However, we often use shortened names too. Here is a list >> >> of those names from those of my machine in /usr/share/locale: de, >> >> en_GB, ru_UA.koi8u, sr@latin, etc. >> >> >> >> Due to this mismatch, we can't get translations with musl's gettext >> >> API for applications in wild. Thus, I'm considering to implement >> >> locale searching with shortening. Does it make sense? >> > >> > Yes, I think this makes sense. Before spending time on the code though >> > it makes sense to discuss the proposed logic here. What level would >> > the search/shortening happen at? __get_locale in locale_map.c? In >> > dcngettext.c? >> >> Sure. I doubt that shortening in __get_locale might be insufficient >> since some code may want the full locale name even if there is no >> locale data for it. I will dig into the code. >> >> Another problem is the preference of shortened locales. Obviously, the >> full locale itself has the highest priority and language-only locales >> (e.g. en, de, etc.) do the lowest one. However, which is the preferred >> locale, en_GB@euro or en_GB.UTF-8, when the code receives >> en_GB.UTF-8@euro? >> >> I am unsure whether someone actually uses such locale, but I think it >> is necessary to discuss such corner cases. > > Conceptually there are two sets of names the locale names need to lead > us to: libc locales in MUSL_LOCPATH, and gettext translation files in > directories provided to bindtextdomain. If/when we add non-stub > catgets support, the locale name is also relevant to NLSPATH > processing where %L expands to the whole locale name, and %l, %t, and > %c expand to the language, territory, and codeset parts of it, > respectively. I didn't aware of this. Thank you. > From musl's standpoint all locales are UTF-8-encoded, so the codeset > portion of the locale name is at best redundant. The official musl > locale files, once we have such a thing, should not have ".UTF-8" in > their names, but a spurious ".UTF-8" component in the locale name > string should be accepted (and ignored) for compatibility with > glibc-based systems where the specifier may be necessary for > glibc-linked programs to distinguish from legacy versions of the > locales. > > In principle we could implement this by stripping the ".UTF-8" at > setlocale time (in __get_locale from locale_map.c) but I don't see a > major advantage in doing that versus keeping the full string and just > stripping it when constructing filenames to try opening. On the other > hand there are advantages to keeping it: some users/distros may want > to put a spurious ".UTF-8" in the locale name to trick broken programs > that use strstr on the locale name, rather than nl_langinfo(CODESET), > to determine that they're in a UTF-8 environment. I agree with you. > For gettext translations, I haven't seen ".UTF-8" used either. My > $prefix/share/locale directories have under them directories only of > the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern > gettext-based programs always store UTF-8 in their message catalogs > and legacy locales are expected to convert the contents when loading. I also have never seen *.UTF-8 locale directories. > Based on all this, the search order we perform should probably be > something like this: First, take the input locale name and strip any > codeset identifier. Then, iterate over 4 steps: > > 1. Try full name. > 2. If a modifier (@mod) is present, try with modifier removed. > 3. If a territory (_TT) is present, try with territory removed. > 4. If both modifer and territory are present, try with both removed. > > At worst this yields 4 file-open attempts, and only in the case where > a user has requested a ll_TT@mod type locale but either the @mod or > _TT does not exist. For ll_TT type locales, it yields at most 2 > attempts. For ll type locales, or locale names that don't fit the > standard pattern, there should be at most one attempt. This algorithm looks good to me. > From an implementation side, note that, presently, dcngettext uses the > full pathname of the message catalog file as the key to look up the > memory-mapped image. This lookup needs to happen without touching the > filesystem, and the only reason a pathname is used is that it > encompasses all the necessary key components (bound directory, locale > name, category name, domain name). So if we go with my above proposal, > the "pathname" used as a key should still contain the full locale > name, not the particular fallback it resolved to, and thus might not > actually be a valid pathname anymore. Because of this it's plausible > that the same catalog file could end up getting mapped more than once > (e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and > /usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major > cost and I don't think it's worth trying to detect and avoid. > > Does this all make sense? Does it sound reasonable? Yes, I think it makes sense. Thanks for the suggestion. After checking gettext-tools to confirm that .mo files are encoded to UTF-8, I will try to implement the algorithm. -- Masanori Ogino