mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] gettext LC_MESSAGES differences from other libc
@ 2025-01-11 18:13 Gavin Smith
  2025-01-12  2:08 ` Thorsten Glaser
  2025-01-12  4:51 ` Rich Felker
  0 siblings, 2 replies; 5+ messages in thread
From: Gavin Smith @ 2025-01-11 18:13 UTC (permalink / raw)
  To: musl

(Please CC me in any replies as I am not subscribed to the list.)

As you know, the gettext function in musl does not behave exactly like
the function in glibc and some other libc implementations.  Specifically,
it does not obey the LANGUAGE variable which can be used to specify that
translated strings should be in a certain language.

In 2014, you discussed the rationale for not supporting LANGUAGE.  There
were issues with threads and caching:

Rich Felker, Thu, 31 Jul 2014, "How should $LANGUAGE work in our gettext?"
https://www.openwall.com/lists/musl/2014/07/31/2

Recently in the Texinfo project, we found this incompatibility with musl
for translations of strings to be placed in output files.  The gettext
API (neither musl or glibc/other) is not a perfect match for Texinfo
needs as much assumes that the target language is that of the user, of
the person sitting in front of the computer, whereas the appropriate
translation language is that of the input document.  For example, somebody
could be generating documentation in Italian to be posted to a website,
while they don't speak Italian themselves and do not have an Italian
locale installed.

The only way we can support this with glibc is to set LC_MESSAGES and/or
LC_ALL to a locale that is not "C" or "POSIX", and then to set the LANGUAGE
variable for the actual target language.  This is a nuisance, as sometimes
it is a struggle to actually find such a locale.  The assumption when this
API was designed was that a user with only a "C" locale does not need
translations, but this is false when they are generating them for somebody
else.  libc appears to offer no way just to open an arbitrary .mo file (the
file with the translated strings in it) to get the translations, forcing
you to go through the locale system.

musl supports setting LC_MESSAGES to an arbitrary value that is not
a locale, so can access arbitrary translation files in a different way.
However, we didn't think it was worth having a special case in the code
just for musl:
https://lists.gnu.org/archive/html/bug-texinfo/2024-12/msg00035.html

You also discussed this changing how LC_MESSAGES worked in a post in
2017, but as far as I am aware nothing came of it:

Rich Felker, Wed, 8 Nov 2017, "Re: setlocale behavior with 'missing' locales"

  One notable issue is that, right now, we rely on being able to set
  LC_MESSAGES to an arbitrary name even if there's no libc locale
  definition for it; this is because gettext() relies on the name of the
  current LC_MESSAGES locale to find (application-specific) translation
  files that might exist even without a libc translation. I'm not sure
  how we would best keep this working under changes similar to the
  above.
https://www.openwall.com/lists/musl/2017/11/08/2

Could there be a possiblity of a new extension to the getttext API that
works with musl, glibc and other libc implementations, that could be used
for arbitrary languages, not just those with installed locales?

I mention the possibility, as I found an old proposal (from 2016) to
add to the glibc API for translation languages that could be of interest:

Bruno Haible, 2016-05-10
"Re: [bug-gettext] RFC: move LANGUAGE check out of gettext()"
https://lists.gnu.org/archive/html/bug-gettext/2016-05/msg00009.html

> Why is this being reported for the LANGUAGE environment variable but not
> for the LANG and LC_ALL environment variables? Because for LANG and LC_*
> we have an architecture composed of three functionalities:
> 
>   (A) environment variables: getenv(), setenv()
> 
>   (B) locales: setlocale(), newlocale(), uselocale().
> 
>   (C) gettext() and friends.
> 
> (A) is the bottom-most layer. But it has the limitation that multi-threaded
> programs must not call setenv().
> 
> (B) is a layer that fetches the initial values from (A), and that allows
> mutators (setlocale(), uselocale()) in multi-threaded programs.
> So that multi-threaded applications can modify the program's locale after
> startup, there is the setlocale() function.
> So that multi-threaded programs can have a locale per thread, there is a
> uselocale() function.
> 
> (C) is an application layer that happens to be in Glibc for convenience
> reasons. It is based on the layer (B).
> 
> 
> Back to the LANGUAGE environment variable. The problem is that here we
> have the layers (A) and (C), but (B) is missing. The solution ought to
> be to introduce a layer (B) for LANGUAGE. LANGUAGE is not specified by
> POSIX and does not perfectly fit into the locale system, therefore I
> believe it is best treated separately.

This was also raised in the glibc bugtracker system:

Daiki Ueno, 2016-05-31
"API for language priority list"
https://sourceware.org/bugzilla/show_bug.cgi?id=20184

It was proposed that a language preference list could be set on a thread
specific basis, that would not involve setting environment variables.
This accords with point 2 in Rich Felker's 2014 commentary:

  2. The $LANGUAGE variable conflicts with uselocale and thread-local
     locales. For instance if the caller has called uselocale to request
     language Y despite the process-wide locale being language X, where
     language X is based on the user's preferences in the environment
     and language Y is based on data, it's wrong to present messages
     based on the environment ($LANGUAGE) rather than the requested
     language Y.


I hope this possibility is interesting to you although I don't fully
understand all the issues involved.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] gettext LC_MESSAGES differences from other libc
  2025-01-11 18:13 [musl] gettext LC_MESSAGES differences from other libc Gavin Smith
@ 2025-01-12  2:08 ` Thorsten Glaser
  2025-01-12  4:51 ` Rich Felker
  1 sibling, 0 replies; 5+ messages in thread
From: Thorsten Glaser @ 2025-01-12  2:08 UTC (permalink / raw)
  To: musl; +Cc: Gavin Smith

(you’ll most likely not get my eMail since you use Googlemail,
which actively works against community development)

Gavin Smith dixit:

>> Back to the LANGUAGE environment variable. The problem is that here we
>> have the layers (A) and (C), but (B) is missing. The solution ought to
>> be to introduce a layer (B) for LANGUAGE.

But isn’t $LANGUAGE a mostly annoying (I know I have to unset it
in my scripts to make them work on GNU systems all the time) GNU
extension?

>musl supports setting LC_MESSAGES to an arbitrary value that is not
>a locale, so can access arbitrary translation files in a different way.
>However, we didn't think it was worth having a special case in the code
>just for musl:

I’d reconsider that decision instead.

(Not a musl developer, just lurking.)

bye,
//mirabilos
-- 
Yay for having to rewrite other people's Bash scripts because bash
suddenly stopped supporting the bash extensions they make use of
	-- Tonnerre Lombard in #nosec

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] gettext LC_MESSAGES differences from other libc
  2025-01-11 18:13 [musl] gettext LC_MESSAGES differences from other libc Gavin Smith
  2025-01-12  2:08 ` Thorsten Glaser
@ 2025-01-12  4:51 ` Rich Felker
  2025-01-21 20:43   ` Gavin Smith
  1 sibling, 1 reply; 5+ messages in thread
From: Rich Felker @ 2025-01-12  4:51 UTC (permalink / raw)
  To: Gavin Smith; +Cc: musl

On Sat, Jan 11, 2025 at 06:13:25PM +0000, Gavin Smith wrote:
> (Please CC me in any replies as I am not subscribed to the list.)
> 
> As you know, the gettext function in musl does not behave exactly like
> the function in glibc and some other libc implementations.  Specifically,
> it does not obey the LANGUAGE variable which can be used to specify that
> translated strings should be in a certain language.
> 
> In 2014, you discussed the rationale for not supporting LANGUAGE.  There
> were issues with threads and caching:
> 
> Rich Felker, Thu, 31 Jul 2014, "How should $LANGUAGE work in our gettext?"
> https://www.openwall.com/lists/musl/2014/07/31/2
> 
> Recently in the Texinfo project, we found this incompatibility with musl
> for translations of strings to be placed in output files.  The gettext
> API (neither musl or glibc/other) is not a perfect match for Texinfo
> needs as much assumes that the target language is that of the user, of
> the person sitting in front of the computer, whereas the appropriate
> translation language is that of the input document.  For example, somebody
> could be generating documentation in Italian to be posted to a website,
> while they don't speak Italian themselves and do not have an Italian
> locale installed.

This sounds like locale is not the right tool for processing it.

> The only way we can support this with glibc is to set LC_MESSAGES and/or
> LC_ALL to a locale that is not "C" or "POSIX", and then to set the LANGUAGE
> variable for the actual target language.  This is a nuisance, as sometimes
> it is a struggle to actually find such a locale.  The assumption when this
> API was designed was that a user with only a "C" locale does not need
> translations, but this is false when they are generating them for somebody
> else.  libc appears to offer no way just to open an arbitrary .mo file (the
> file with the translated strings in it) to get the translations, forcing
> you to go through the locale system.

If you just want to process .mo files without going thru the locale
system, the necessary code is about 42 source lines/329 machine code
bytes that's MIT-licensed in musl that you're free to copy. This
probably makes the most sense.

> musl supports setting LC_MESSAGES to an arbitrary value that is not
> a locale, so can access arbitrary translation files in a different way.
> However, we didn't think it was worth having a special case in the code
> just for musl:
> https://lists.gnu.org/archive/html/bug-texinfo/2024-12/msg00035.html
> 
> You also discussed this changing how LC_MESSAGES worked in a post in
> 2017, but as far as I am aware nothing came of it:
> 
> Rich Felker, Wed, 8 Nov 2017, "Re: setlocale behavior with 'missing' locales"
> 
>   One notable issue is that, right now, we rely on being able to set
>   LC_MESSAGES to an arbitrary name even if there's no libc locale
>   definition for it; this is because gettext() relies on the name of the
>   current LC_MESSAGES locale to find (application-specific) translation
>   files that might exist even without a libc translation. I'm not sure
>   how we would best keep this working under changes similar to the
>   above.
> https://www.openwall.com/lists/musl/2017/11/08/2

There's currently a proposal to partly remove this behaior, because it
prevents applictions from being able to detect if there's actually a
meaningful locale installed for a specific locale name. The specifics
have not been worked out, and this is an area I'd really like input
from affected parties on.

The hard constraint from my perspective is that setlocale("",x) can't
be allowed to fail (user stuck with no Unicode because of unsupported
locale name in environment), but both the current behavior of making a
virtual locale by the requested name, and replacing the name by
C.UTF-8 in this case, are options. It's plausible that only
LC_MESSAGES could keep the current behavior if this turns out to be
the most helpful.

Depending on how LC_MESSAGES is to be handled, it's plausible that we
could integrate support for LANGUAGE at the same time, maybe having
the synthesized locale for "" also storing/encoding the value of
LANGUAGE, or some other mechanism to achieve the same thing. But I'm
not sure it's a good idea. There are many reasons already discussed
why the LANGUAGE model is broken, and I'm not sure we can fix it in a
way that's consistent with user expectations.

I'll probably open a new thread on this specific topic soon.

But I suspect your problem is best solved by not using locale for
non-user-language data processing.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] gettext LC_MESSAGES differences from other libc
  2025-01-12  4:51 ` Rich Felker
@ 2025-01-21 20:43   ` Gavin Smith
  2025-01-28 11:26     ` Patrice Dumas
  0 siblings, 1 reply; 5+ messages in thread
From: Gavin Smith @ 2025-01-21 20:43 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Patrice Dumas

On Sat, Jan 11, 2025 at 11:51:05PM -0500, Rich Felker wrote:
> > Recently in the Texinfo project, we found this incompatibility with musl
> > for translations of strings to be placed in output files.  The gettext
> > API (neither musl or glibc/other) is not a perfect match for Texinfo
> > needs as much assumes that the target language is that of the user, of
> > the person sitting in front of the computer, whereas the appropriate
> > translation language is that of the input document.  For example, somebody
> > could be generating documentation in Italian to be posted to a website,
> > while they don't speak Italian themselves and do not have an Italian
> > locale installed.
> 
> This sounds like locale is not the right tool for processing it.
> 
> > The only way we can support this with glibc is to set LC_MESSAGES and/or
> > LC_ALL to a locale that is not "C" or "POSIX", and then to set the LANGUAGE
> > variable for the actual target language.  This is a nuisance, as sometimes
> > it is a struggle to actually find such a locale.  The assumption when this
> > API was designed was that a user with only a "C" locale does not need
> > translations, but this is false when they are generating them for somebody
> > else.  libc appears to offer no way just to open an arbitrary .mo file (the
> > file with the translated strings in it) to get the translations, forcing
> > you to go through the locale system.
> 
> If you just want to process .mo files without going thru the locale
> system, the necessary code is about 42 source lines/329 machine code
> bytes that's MIT-licensed in musl that you're free to copy. This
> probably makes the most sense.

Thanks for the suggestion.  It is possible that we will end up doing
this, if the current approach has more problems.

I noticed that your implementation at:

https://git.musl-libc.org/cgit/musl/tree/src/locale/__mo_lookup.c

does not refer to a hashing table section of the .mo file.  This
could make it slower.

I'm not sure if there is a relevant standard for the format for .mo
files.

At https://pubs.opengroup.org/onlinepubs/9799919799/utilities/msgfmt.html, it
says:

"The format of the created messages object files is unspecified."

The GNU gettext manual gives some documentation on the file format,
but does not document the format of the hashing table:

"The precise hashing algorithm used is fairly dependent on GNU gettext
code, and is not documented here."
https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html

Apart from the hashing table issue, using libc gettext handles some
other things that we would have to recreate ourselves, on top of the .mo
file format.  Ones I can think of are translation contexts, character
encodings, and regional language variants (e.g. pt and pt_BR).
Another issue could be plural variants of translations.  We could probably
reimplement all of this without a huge amount of difficulty, if we
really wanted to, although as the code is in the libc already it would
seem simpler if we could access it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] gettext LC_MESSAGES differences from other libc
  2025-01-21 20:43   ` Gavin Smith
@ 2025-01-28 11:26     ` Patrice Dumas
  0 siblings, 0 replies; 5+ messages in thread
From: Patrice Dumas @ 2025-01-28 11:26 UTC (permalink / raw)
  To: Gavin Smith; +Cc: Rich Felker, musl

On Tue, Jan 21, 2025 at 08:43:16PM +0000, Gavin Smith wrote:
> > If you just want to process .mo files without going thru the locale
> > system, the necessary code is about 42 source lines/329 machine code
> > bytes that's MIT-licensed in musl that you're free to copy. This
> > probably makes the most sense.
> 
> Thanks for the suggestion.  It is possible that we will end up doing
> this, if the current approach has more problems.

As a side note, I discussed with Guido Flohr, the maintainer of
libintl-perl in 2011 about this specific issue, as there is the same
issue from Perl when trying to use a language independent from the
locale.  At that time, he said that he had something in progress for a
more low-level and direct access to .mo files, but so far it did not
materialize.

-- 
Pat

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-01-28 15:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-11 18:13 [musl] gettext LC_MESSAGES differences from other libc Gavin Smith
2025-01-12  2:08 ` Thorsten Glaser
2025-01-12  4:51 ` Rich Felker
2025-01-21 20:43   ` Gavin Smith
2025-01-28 11:26     ` Patrice Dumas

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).