mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] setlocale() again
@ 2023-08-10 15:41 Alastair Houghton
  2023-08-10 15:51 ` Rich Felker
  0 siblings, 1 reply; 4+ messages in thread
From: Alastair Houghton @ 2023-08-10 15:41 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker

Hi again,

I spent some time today looking at the setlocale() problem and thought I’d put some notes down in an email.

1. Musl wishes to support UTF-8 “out of the box”.

2. At the same time, it needs to be 8-bit-safe, so the default locale, C, is NOT UTF-8.

3. POSIX, and the C standard, specify that setlocale() should fail if the locale name isn’t a valid locale, but don’t really say what that means precisely.  A program that wants UTF-8 support and that does `setlocale(LC_ALL, “”)` can therefore find itself in the C locale if the one specified in the environment happens to be invalid.

4. This seemed undesirable, so setlocale() presently accepts any locale name as valid; if it doesn’t have a definition file for a locale, it will copy the C.UTF-8 locale, giving it the name passed in and return that.  This avoids the problem in (3), and also means that gettext() will work for any language without installing locale data for Musl.  Unfortunately it also means that there is no way for a program (notably a test suite) to determine the presence of data for a locale, because setlocale() will always succeed, even if we don’t have the data.

5. Back in 2017 (https://www.openwall.com/lists/musl/2017/11/08/2) Rich was proposing to change things so that `setlocale(cat, “”)` always succeeds, but if the environment specifies an unknown locale, treats it as C.UTF-8, while `setlocale(cat, explicit_name)` will fail unless a valid definition file is installed for that locale name.  This would also avoid the problem in (3), although it will mean that gettext() will not work unless a valid locale definition is installed for the C library (BTW, this is exactly the situation Glibc is in here; if Glibc doesn’t have locale data, it will fail setlocale() and then gettext() will find itself in the C locale).  On the other hand, it does mean that programs can detect whether or not a given locale is present.

Why do I care?  Because I’m trying to make libc++ work with Musl and right now it has failing tests because it expects (not entirely unreasonably) that if e.g. `setlocale(LC_ALL, “fr_FR”)` succeeds, then the C library will localise things into French.  While I can test for the unusual behaviour of Musl detailed in (4), the libc++ maintainer understandably doesn’t like it and we would both far rather Musl were fixed to behave similarly to other implementations.

It seems to me that Rich’s proposal (5) was sensible.  Programs that use gettext(), and users relying on it for localization, must already cope with the fact that the C library must have locale data for their chosen locale in order for gettext() to work; that is how things work on Glibc.  It so happened that (4) meant that such programs would work with partial localization on Musl without there being any locale data installed for Musl, but that isn’t really right (e.g. you might get a mix of localized strings from gettext() but with numeric formatting that didn’t match - for French, for instance, numbers would have “.”s instead of “,”s as a decimal separator).

Looking at the 2017 thread, it appears it didn’t go anywhere for whatever reason, so I’d like to understand the status of the proposed change.  Was it nixed for some reason?  Is it likely to happen in the future?  If it’s a matter of resource, if I were to raise a patch for it, would it be accepted, in principle?

Kind regards,

Alastair.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] setlocale() again
  2023-08-10 15:41 [musl] setlocale() again Alastair Houghton
@ 2023-08-10 15:51 ` Rich Felker
  2023-09-05 12:57   ` Alastair Houghton
  0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2023-08-10 15:51 UTC (permalink / raw)
  To: Alastair Houghton; +Cc: musl

On Thu, Aug 10, 2023 at 04:41:38PM +0100, Alastair Houghton wrote:
> Hi again,
> 
> I spent some time today looking at the setlocale() problem and
> thought I’d put some notes down in an email.
> 
> 1. Musl wishes to support UTF-8 “out of the box”.
> 
> 2. At the same time, it needs to be 8-bit-safe, so the default
> locale, C, is NOT UTF-8.
> 
> 3. POSIX, and the C standard, specify that setlocale() should fail
> if the locale name isn’t a valid locale, but don’t really say what
> that means precisely. A program that wants UTF-8 support and that
> does `setlocale(LC_ALL, “”)` can therefore find itself in the C
> locale if the one specified in the environment happens to be
> invalid.
> 
> 4. This seemed undesirable, so setlocale() presently accepts any
> locale name as valid; if it doesn’t have a definition file for a
> locale, it will copy the C.UTF-8 locale, giving it the name passed
> in and return that. This avoids the problem in (3), and also means
> that gettext() will work for any language without installing locale
> data for Musl. Unfortunately it also means that there is no way for
> a program (notably a test suite) to determine the presence of data
> for a locale, because setlocale() will always succeed, even if we
> don’t have the data.
> 
> 5. Back in 2017 (https://www.openwall.com/lists/musl/2017/11/08/2)
> Rich was proposing to change things so that `setlocale(cat, “”)`
> always succeeds, but if the environment specifies an unknown locale,
> treats it as C.UTF-8, while `setlocale(cat, explicit_name)` will
> fail unless a valid definition file is installed for that locale
> name. This would also avoid the problem in (3), although it will
> mean that gettext() will not work unless a valid locale definition
> is installed for the C library (BTW, this is exactly the situation
> Glibc is in here; if Glibc doesn’t have locale data, it will fail
> setlocale() and then gettext() will find itself in the C locale). On
> the other hand, it does mean that programs can detect whether or not
> a given locale is present.
> 
> Why do I care? Because I’m trying to make libc++ work with Musl and
> right now it has failing tests because it expects (not entirely
> unreasonably) that if e.g. `setlocale(LC_ALL, “fr_FR”)` succeeds,
> then the C library will localise things into French. While I can
> test for the unusual behaviour of Musl detailed in (4), the libc++
> maintainer understandably doesn’t like it and we would both far
> rather Musl were fixed to behave similarly to other implementations.
> 
> It seems to me that Rich’s proposal (5) was sensible. Programs that
> use gettext(), and users relying on it for localization, must
> already cope with the fact that the C library must have locale data
> for their chosen locale in order for gettext() to work; that is how
> things work on Glibc. It so happened that (4) meant that such
> programs would work with partial localization on Musl without there
> being any locale data installed for Musl, but that isn’t really
> right (e.g. you might get a mix of localized strings from gettext()
> but with numeric formatting that didn’t match - for French, for
> instance, numbers would have “.”s instead of “,”s as a decimal
> separator).
> 
> Looking at the 2017 thread, it appears it didn’t go anywhere for
> whatever reason, so I’d like to understand the status of the
> proposed change. Was it nixed for some reason? Is it likely to
> happen in the future? If it’s a matter of resource, if I were to
> raise a patch for it, would it be accepted, in principle?

Thank you for following up on this! The main reason it didn't go
anywhere was lack of feedback/engagement from anyone who cares about
locale behavior. I want whatever steps we take to be informed by what
folks actually need, not just my guesses at that. So in that sense,
your bumping of the issue is helpful in itself!

At this point, it's been quite a while since I looked at the
mechanisms. If you'd like to help move this forward, rather than
starting with a patch, writing a high-level natural language
description of how you'd make the changes (in terms of musl's current
internal representation for locale state) would be the most helpful.
If I'm forgetting and there's already such a good description, just
digging it up and citing it might be fine.

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] setlocale() again
  2023-08-10 15:51 ` Rich Felker
@ 2023-09-05 12:57   ` Alastair Houghton
  2023-09-18 14:18     ` Alastair Houghton
  0 siblings, 1 reply; 4+ messages in thread
From: Alastair Houghton @ 2023-09-05 12:57 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

On 10 Aug 2023, at 16:51, Rich Felker <dalias@libc.org> wrote:
> 
> On Thu, Aug 10, 2023 at 04:41:38PM +0100, Alastair Houghton wrote:
>> I spent some time today looking at the setlocale() problem and
>> thought I’d put some notes down in an email.

[snip]

> At this point, it's been quite a while since I looked at the
> mechanisms. If you'd like to help move this forward, rather than
> starting with a patch, writing a high-level natural language
> description of how you'd make the changes (in terms of musl's current
> internal representation for locale state) would be the most helpful.
> If I'm forgetting and there's already such a good description, just
> digging it up and citing it might be fine.

I wrote something up here:

https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a

Let me know what you think; I can update the document there until we’re happy that we’ve got the right solution, then I should be able to create a patch and get the relevant permission from my employer to submit it.

Kind regards,

Alastair.


[-- Attachment #2: Type: text/html, Size: 7636 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [musl] setlocale() again
  2023-09-05 12:57   ` Alastair Houghton
@ 2023-09-18 14:18     ` Alastair Houghton
  0 siblings, 0 replies; 4+ messages in thread
From: Alastair Houghton @ 2023-09-18 14:18 UTC (permalink / raw)
  To: musl, Rich Felker

[-- Attachment #1: Type: text/plain, Size: 1378 bytes --]

Hi all (Rich especially though :-))

Has anyone had time to take a look at this? I’d like to make some progress on this front if possible.

Kind regards,

Alastair.

> On 5 Sep 2023, at 13:57, Alastair Houghton <ahoughton@apple.com> wrote:
> 
> On 10 Aug 2023, at 16:51, Rich Felker <dalias@libc.org> wrote:
>> 
>> On Thu, Aug 10, 2023 at 04:41:38PM +0100, Alastair Houghton wrote:
>>> I spent some time today looking at the setlocale() problem and
>>> thought I’d put some notes down in an email.
> 
> [snip]
> 
>> At this point, it's been quite a while since I looked at the
>> mechanisms. If you'd like to help move this forward, rather than
>> starting with a patch, writing a high-level natural language
>> description of how you'd make the changes (in terms of musl's current
>> internal representation for locale state) would be the most helpful.
>> If I'm forgetting and there's already such a good description, just
>> digging it up and citing it might be fine.
> 
> I wrote something up here:
> 
> https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a
> 
> Let me know what you think; I can update the document there until we’re happy that we’ve got the right solution, then I should be able to create a patch and get the relevant permission from my employer to submit it.
> 
> Kind regards,
> 
> Alastair.
> 


[-- Attachment #2: Type: text/html, Size: 8276 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-09-18 14:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-10 15:41 [musl] setlocale() again Alastair Houghton
2023-08-10 15:51 ` Rich Felker
2023-09-05 12:57   ` Alastair Houghton
2023-09-18 14:18     ` Alastair Houghton

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).