mailing list of musl libc
 help / color / mirror / code / Atom feed
* Planned locale work and community thoughts
       [not found] <ecdf29e7b16f50fbb734870f5c34abaa61d946cd.camel@postmarketos.org>
@ 2025-06-02 17:37 ` Pablo Correa Gomez
  2025-06-18 19:28   ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Pablo Correa Gomez @ 2025-06-02 17:37 UTC (permalink / raw)
  To: musl

Hi everybody,

I am Pablo Correa Gomez, a member of postmarketOS Core Contributors,
working on the collation and locale overhaul project
(https://www.openwall.com/lists/musl/2025/05/05/5)together with Rich.

We have now more details on the planned locale work that was earlier
announced. The current musl locales experience is sub-par compared to
other platforms, and we plan to use this project to fix that. 

The main and biggest issue that we aim to solve is the representation
format of the locale strings. The initial implementation used English
strings as keys to lookup for translations. This had a major issue
where May would represent both the abbreviated and non-abbreviated
forms of the month, making it untranslatable in languages where May has
more than 3 letters. However, there are other different issues that are
also aiming to solve in this project:

* Implement RADIXCHAR so that "." is not the only possible separator.
THOUSEP will in principle not be implemented due to it breaking quite
some assumptions, and it being less critical for users.
* Implement LC_MONETARY so that we can get properly localized currency
representation.
* Make sure that every function that accepts a locale actually uses it
for the translation.

To be able to prepare for the technical work, there are some things for
which we would like community input:

1. We need to figure out an alternative representation for the
translatable strings derived from[1] to avoid the "May issue". A simple
solution would be to use those constants (or an abbreviation of them)
as keys for the lookup. Hopefully that would be both unambiguous and
self-explanatory and as a bonus, it's already documented. Does somebody
have other/better ideas?

2. Regardless of the representation we choose, we need to decide on a
workflow for translators. Currently, people can just copy the .pot 
file[2] with a hard-coded representation that might include other
things to translate. That seems good enough if we chose the
representation directly from [1], but might not be possible if we
decide on something different.

3. Right now, other translatable strings coming from different sources
(another email with a detailed analysis will follow up) are also part
of the musl locales project. Those are also just translated directly as
strings. However, some also appear in different contexts. Like "out of
memory" on regex, on getting network address info. Should these be
split, receive a different representation, and thus provide additional
context information to translators? I personally believe that most
high-level applications should hide these messages coming directly from
libc, and thus they should only be rarely available to users, like in
CLI applications, where users are generally expected to have a basic
knowledge of English. I would be fine with leaving these strings
represented just by their own English string names, even if that means
a bit of context is lost in some languages.

4. Chose a default locale placement, so that we can get translations
without needing to parse an envvar in [3]. In Alpine/pmOS the location
is currently in /usr/share/i18n/locales/musl/ I do not think that's a
great place, but the FHS does not seem to provide an obvious place for
it to live, since AFAIU locales for the libc should not be mixed with
LC_MESSAGES from other applications. Are there other suggestions?

5. So far, although the musl-locales project exist, it has been kept
apart from the main musl project, and not really sanctioned as
"official". It would be great, if we could have discussions related to
musl-locales project directly in this mailing list. And if there could
be a synchronized copy of it in https://git.musl-libc.org/cgit next to
the musl repository. Is there somebody against this?

6. Given that at postmarketOS good localization is something critical,
I would be very happy if we could fork the current project, host it  in
our gitlab, and use it as the place to synchronize with
https://git.musl-libc.org/cgit. If somebody would have other ideas, or
moving it is considered disruptive, then it would be great if somebody
from our team could also get access, so we can increase the maintenance
it has seen lately.

7. If a locale is missing in musl, setlocale currently "fakes" that
support exist by copying the C data to the said locale. This has the
benefit that apps which are translated in a locale missing in musl
still show up as translated for the application-related messages. The
problem with this is that the UX is then inconsistent, since users get
things mixed and matched in different languages. This is also generally
a step against musl philosophy of being strinctly correct. A previous
discussion[4] had a pretty good proposal[5] that I fully support. As I
said in the thread, as long as we have some time to adapt, the behavior
change should be acceptable.

8. Finally, if you want to be involved in testing in a language for
which we don't yet have a volunteer signed-in in[6], feel free to
report yourself, we might have some small funding available, for which
please send me a private email with the details specified in there.

We hope that at the end of this work, we have a setup for musl locales
that is able to fit the needs of most users. If you believe there is
something missing, please let us know.

This work is possible thanks to a grant from NLnet and the NGI Zero
Core Fund. Thank you for supporting us!

[1]
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.h.html
[2]
https://git.adelielinux.org/adelie/musl-locales/-/blob/main/musl.pot?ref_type=heads
[3]
https://git.musl-libc.org/cgit/musl/tree/src/locale/locale_map.c#n66
[4] https://www.openwall.com/lists/musl/2023/08/10/3
[5]
https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a#proposed-behaviour
[6]
https://gitlab.postmarketos.org/postmarketOS/postmarketos/-/issues/65


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Planned locale work and community thoughts
  2025-06-02 17:37 ` Planned locale work and community thoughts Pablo Correa Gomez
@ 2025-06-18 19:28   ` Rich Felker
  2025-06-18 21:23     ` [musl] " Rich Felker
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Rich Felker @ 2025-06-18 19:28 UTC (permalink / raw)
  To: Pablo Correa Gomez; +Cc: musl

On Mon, Jun 02, 2025 at 07:37:51PM +0200, Pablo Correa Gomez wrote:
> Hi everybody,
> 
> I am Pablo Correa Gomez, a member of postmarketOS Core Contributors,
> working on the collation and locale overhaul project
> (https://www.openwall.com/lists/musl/2025/05/05/5)together with Rich.
> 
> We have now more details on the planned locale work that was earlier
> announced. The current musl locales experience is sub-par compared to
> other platforms, and we plan to use this project to fix that. 
> 
> The main and biggest issue that we aim to solve is the representation
> format of the locale strings. The initial implementation used English
> strings as keys to lookup for translations. This had a major issue
> where May would represent both the abbreviated and non-abbreviated
> forms of the month, making it untranslatable in languages where May has
> more than 3 letters. However, there are other different issues that are
> also aiming to solve in this project:

Main decision to be made here is how we key items that need
localization, whether by fixing the string-based keying (e.g. using
the macro names like "ABMON5" as the keys) with the gettext-type
lookup we have now, or switching to assigned integer indices as the
keying for a more catgets-like system (likely using the values from
the macros in langinfo.h as the indices), or something else.

> * Implement RADIXCHAR so that "." is not the only possible separator.
> THOUSEP will in principle not be implemented due to it breaking quite
> some assumptions, and it being less critical for users.

To give some background on this: from the start I was largely opposed
to having the radix char be localizable at all, as this has been a
source of perpetual problems for parsing and generating text-based
data formats intended for interchange, and I didn't really think there
was any modern demand for it.

However, in past discussions of the topic, it's come up that some
people do want it, and I don't want us to be the bad guys who are
being stubborn dismissing someone else's cultural expectations, so the
tentative plan has been to offer this with 1-bit degree of freedom
between '.' and ',' as the only choices.

I've been made aware that, at least historically prior to use in
computer systems, there have been other notations for radix point, but
it's not clear if there's any modern expectation to be able to do
that. What I think would be a useful next step is to grep the Unicode
CLDR for whether there are non-'.' non-',' radix chars in any locale
definitions. If there are none, I think that already settles it. If
there are any, we should attempt to figure out whether there are
real-world systems that support them and precedent for users to expect
they work.

Note that supporting basically anything plausble other than '.' and
',' as radix characters has major technical issues that may introduce
vulns into programs not expecting it, so in the absence of both strong
evidence of necessity and research into what would break and whether
unsafe breakage is unlikely, I want to just say no to this.

It may however make sense for the on-disk data format to allow for the
possibility, and for musl to just treat anything but "," as if it were
"." for the forseeable future.

> * Implement LC_MONETARY so that we can get properly localized currency
> representation.

This is fairly straightforward, but does need a reasonable data format
that translates well into "struct localeconv" form. The localeconv
fields that are strings need to be directly usable from the
memory-mapped locale file, so that we don't need to allocate
variable-sized storage for them, and one complication of this is that
"grouping" and "mon_grouping" are *arch-specific* because the encoding
uses CHAR_MAX with a special meaning, and CHAR_MAX could be 127 or
255. This means both versions of the string (one for singed-plain-char
archs and one for unsigned-plain-char archs) need to be stored in the
on-disk format.

> * Make sure that every function that accepts a locale actually uses it
> for the translation.
> 
> To be able to prepare for the technical work, there are some things for
> which we would like community input:
> 
> 1. We need to figure out an alternative representation for the
> translatable strings derived from[1] to avoid the "May issue". A simple
> solution would be to use those constants (or an abbreviation of them)
> as keys for the lookup. Hopefully that would be both unambiguous and
> self-explanatory and as a bonus, it's already documented. Does somebody
> have other/better ideas?

See above for the options I'm aware of.

> 2. Regardless of the representation we choose, we need to decide on a
> workflow for translators. Currently, people can just copy the .pot 
> file[2] with a hard-coded representation that might include other
> things to translate. That seems good enough if we chose the
> representation directly from [1], but might not be possible if we
> decide on something different.

Long-term, the workflow should probably be deriving the data from the
Unicode CLDR with possibility for overrides, with tooling to do that.
I'm not sure if we want to prepare such tooling now.

At least for collation, I think we need some level of tooling now in
order to be able to test/evaluate it. I'm presently trying to find the
relevant tooling other systems use. ICU has something that converts
the base weights table to a possibly-reasonable binary form but I
haven't located the tooling to apply locale-specific modifications
from CLDR data to the table.

> 3. Right now, other translatable strings coming from different sources
> (another email with a detailed analysis will follow up) are also part
> of the musl locales project. Those are also just translated directly as
> strings. However, some also appear in different contexts. Like "out of
> memory" on regex, on getting network address info. Should these be
> split, receive a different representation, and thus provide additional
> context information to translators? I personally believe that most
> high-level applications should hide these messages coming directly from
> libc, and thus they should only be rarely available to users, like in
> CLI applications, where users are generally expected to have a basic
> knowledge of English. I would be fine with leaving these strings
> represented just by their own English string names, even if that means
> a bit of context is lost in some languages.

I think it's expected that they're translated (don't set LC_MESSAGES
if you don't want that) but again the mechanism is open to change
while we're making a major overhaul here.

Do we want the messages keyed by the English strings, as now, or do we
want them keyed by identity of the error, whether that's the names of
the errno E*/REG_*/EAI_*/etc. macros or some assigned integer codes as
in the option for LC_TIME stuff above.

> 4. Chose a default locale placement, so that we can get translations
> without needing to parse an envvar in [3]. In Alpine/pmOS the location
> is currently in /usr/share/i18n/locales/musl/ I do not think that's a
> great place, but the FHS does not seem to provide an obvious place for
> it to live, since AFAIU locales for the libc should not be mixed with
> LC_MESSAGES from other applications. Are there other suggestions?

My main concern, especially if we want them to be usable by suid
binaries, is that they should be in a location we can rely on to
belong to root. While I don't think they should be *stored* in /etc,
reaching them via a path component (intended to be a symlink) in /etc
is probably the best way to both ensure that and allow the actual
files to be placed wherever distro policy wants them to be placed.

> 5. So far, although the musl-locales project exist, it has been kept
> apart from the main musl project, and not really sanctioned as
> "official". It would be great, if we could have discussions related to
> musl-locales project directly in this mailing list. And if there could
> be a synchronized copy of it in https://git.musl-libc.org/cgit next to
> the musl repository. Is there somebody against this?

I think having the discussion on the main mailing list should be fine.

> 6. Given that at postmarketOS good localization is something critical,
> I would be very happy if we could fork the current project, host it  in
> our gitlab, and use it as the place to synchronize with
> https://git.musl-libc.org/cgit. If somebody would have other ideas, or
> moving it is considered disruptive, then it would be great if somebody
> from our team could also get access, so we can increase the maintenance
> it has seen lately.

I don't have a strong opinion on this yet, but I do agree that we
should have it sync'd to the main musl git server, regardless of where
actual devel takes place, so that it presents as "official".

> 7. If a locale is missing in musl, setlocale currently "fakes" that
> support exist by copying the C data to the said locale. This has the
> benefit that apps which are translated in a locale missing in musl
> still show up as translated for the application-related messages. The
> problem with this is that the UX is then inconsistent, since users get
> things mixed and matched in different languages. This is also generally
> a step against musl philosophy of being strinctly correct. A previous
> discussion[4] had a pretty good proposal[5] that I fully support. As I
> said in the thread, as long as we have some time to adapt, the behavior
> change should be acceptable.

I need to review this but I recall the proposal being acceptable.

> 8. Finally, if you want to be involved in testing in a language for
> which we don't yet have a volunteer signed-in in[6], feel free to
> report yourself, we might have some small funding available, for which
> please send me a private email with the details specified in there.
> 
> We hope that at the end of this work, we have a setup for musl locales
> that is able to fit the needs of most users. If you believe there is
> something missing, please let us know.
> 
> This work is possible thanks to a grant from NLnet and the NGI Zero
> Core Fund. Thank you for supporting us!
> 
> [1]
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.h.html
> [2]
> https://git.adelielinux.org/adelie/musl-locales/-/blob/main/musl.pot?ref_type=heads
> [3]
> https://git.musl-libc.org/cgit/musl/tree/src/locale/locale_map.c#n66
> [4] https://www.openwall.com/lists/musl/2023/08/10/3
> [5]
> https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a#proposed-behaviour
> [6]
> https://gitlab.postmarketos.org/postmarketOS/postmarketos/-/issues/65


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] Planned locale work and community thoughts
  2025-06-18 19:28   ` Rich Felker
@ 2025-06-18 21:23     ` Rich Felker
  2025-06-18 22:42       ` Thorsten Glaser
  2025-07-30 15:53     ` Pablo Correa Gomez
  2025-08-07 19:25     ` [musl] " Rich Felker
  2 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2025-06-18 21:23 UTC (permalink / raw)
  To: Pablo Correa Gomez; +Cc: musl

On Wed, Jun 18, 2025 at 03:28:47PM -0400, Rich Felker wrote:
> > * Implement RADIXCHAR so that "." is not the only possible separator.
> > THOUSEP will in principle not be implemented due to it breaking quite
> > some assumptions, and it being less critical for users.
> 
> To give some background on this: from the start I was largely opposed
> to having the radix char be localizable at all, as this has been a
> source of perpetual problems for parsing and generating text-based
> data formats intended for interchange, and I didn't really think there
> was any modern demand for it.
> 
> However, in past discussions of the topic, it's come up that some
> people do want it, and I don't want us to be the bad guys who are
> being stubborn dismissing someone else's cultural expectations, so the
> tentative plan has been to offer this with 1-bit degree of freedom
> between '.' and ',' as the only choices.
> 
> I've been made aware that, at least historically prior to use in
> computer systems, there have been other notations for radix point, but
> it's not clear if there's any modern expectation to be able to do
> that. What I think would be a useful next step is to grep the Unicode
> CLDR for whether there are non-'.' non-',' radix chars in any locale
> definitions. If there are none, I think that already settles it. If
> there are any, we should attempt to figure out whether there are
> real-world systems that support them and precedent for users to expect
> they work.
> 
> Note that supporting basically anything plausble other than '.' and
> ',' as radix characters has major technical issues that may introduce
> vulns into programs not expecting it, so in the absence of both strong
> evidence of necessity and research into what would break and whether
> unsafe breakage is unlikely, I want to just say no to this.
> 
> It may however make sense for the on-disk data format to allow for the
> possibility, and for musl to just treat anything but "," as if it were

I've run a textual grep on the data from cldr-47.0.0-json-full.zip:

    grep '"decimal": *"[^,.]"' cldr-numbers-full/main/*/numbers.json

and the only results seem to be for alternative-numerals Arabic
profiles under "symbols-numberSystem-arabext", which is not
used/usable in the C/POSIX locale system.

(There is an alternate symbol, but it's only used with alternate
numeral characters, and C/POSIX can't use alternate numeral characters
in their locale model.)

Theoretically it's possible the textual grep missed things if there is
inconsistent json formatting anywhere, so if anyone familiar with jq
wants to conduct a search using it instead to confirm, go ahead. I
think we're good though.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] Planned locale work and community thoughts
  2025-06-18 21:23     ` [musl] " Rich Felker
@ 2025-06-18 22:42       ` Thorsten Glaser
  2025-06-18 23:14         ` Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Thorsten Glaser @ 2025-06-18 22:42 UTC (permalink / raw)
  To: musl; +Cc: Pablo Correa Gomez

On Wed, 18 Jun 2025, Rich Felker wrote:

>Theoretically it's possible the textual grep missed things if there is
>inconsistent json formatting anywhere, so if anyone familiar with jq
>wants to conduct a search using it instead to confirm, go ahead. I

My jq-foo is not very good, but I managed this:

tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^  "[^.,]"' -e '^  ".[^"]' | uniq
  "٫"

So yes, U+066B is the only other one, and no multi-char ones.

tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"' -e '^  ".[^"]'

… shows all the occurrences, but a quick filter shows that we have
both symbols-numberSystem-arabext and symbols-numberSystem-arab but
assuming both are out of scope…

tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"' -e '^  ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-numberSystem-arabext"' -e '.symbols-numberSystem-arab"'
  >>main.bgn-AE.numbers.symbols-numberSystem-latn",
  >>main.bgn-AF.numbers.symbols-numberSystem-latn",
  >>main.bgn-IR.numbers.symbols-numberSystem-latn",
  >>main.bgn-OM.numbers.symbols-numberSystem-latn",
  >>main.bgn.numbers.symbols-numberSystem-latn",

… leaves us with this; bgn/numbers.json examplary:

{
  "main": {
    "bgn": {
      "numbers": {
        "symbols-numberSystem-arabext": {
          "decimal": "٫",
          "group": "٬",
          "list": "؛",
…
        },
        "symbols-numberSystem-latn": {
          "decimal": "٫",
          "group": "،",
          "list": ";",
…

So, if the bgn locales are ever going to be relevant…
unsure what that exactly is, but my acronyms database says…
	[ISO 639-3] Western Balochi (cf. bal)
… which seems to fit.

bye,
//mirabilos
-- 
<ch> you introduced a merge commit        │<mika> % g rebase -i HEAD^^
<mika> sorry, no idea and rebasing just fscked │<mika> Segmentation
<ch> should have cloned into a clean repo      │  fault (core dumped)
<ch> if I rebase that now, it's really ugh     │<mika:#grml> wuahhhhhh

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] Planned locale work and community thoughts
  2025-06-18 22:42       ` Thorsten Glaser
@ 2025-06-18 23:14         ` Rich Felker
  2025-08-01  9:58           ` Pablo Correa Gomez
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2025-06-18 23:14 UTC (permalink / raw)
  To: Thorsten Glaser; +Cc: musl, Pablo Correa Gomez

On Thu, Jun 19, 2025 at 12:42:50AM +0200, Thorsten Glaser wrote:
> On Wed, 18 Jun 2025, Rich Felker wrote:
> 
> >Theoretically it's possible the textual grep missed things if there is
> >inconsistent json formatting anywhere, so if anyone familiar with jq
> >wants to conduct a search using it instead to confirm, go ahead. I
> 
> My jq-foo is not very good, but I managed this:
> 
> tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^  "[^.,]"' -e '^  ".[^"]' | uniq
>   "٫"
> 
> So yes, U+066B is the only other one, and no multi-char ones.
> 
> tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"' -e '^  ".[^"]'
> 
> … shows all the occurrences, but a quick filter shows that we have
> both symbols-numberSystem-arabext and symbols-numberSystem-arab but
> assuming both are out of scope…
> 
> tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")), getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"' -e '^  ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-numberSystem-arabext"' -e '.symbols-numberSystem-arab"'
>   >>main.bgn-AE.numbers.symbols-numberSystem-latn",
>   >>main.bgn-AF.numbers.symbols-numberSystem-latn",
>   >>main.bgn-IR.numbers.symbols-numberSystem-latn",
>   >>main.bgn-OM.numbers.symbols-numberSystem-latn",
>   >>main.bgn.numbers.symbols-numberSystem-latn",
> 
> … leaves us with this; bgn/numbers.json examplary:
> 
> {
>   "main": {
>     "bgn": {
>       "numbers": {
>         "symbols-numberSystem-arabext": {
>           "decimal": "٫",
>           "group": "٬",
>           "list": "؛",
> …
>         },
>         "symbols-numberSystem-latn": {
>           "decimal": "٫",
>           "group": "،",
>           "list": ";",
> …
> 
> So, if the bgn locales are ever going to be relevant…
> unsure what that exactly is, but my acronyms database says…
> 	[ISO 639-3] Western Balochi (cf. bal)
> … which seems to fit.

Thanks. My grapping seems to have overlooked that just because it was
the same character that would normally only be used in an alt-digits
context. I wonder if the above is intentional or a mistake and if any
systems are actually doing that.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Planned locale work and community thoughts
  2025-06-18 19:28   ` Rich Felker
  2025-06-18 21:23     ` [musl] " Rich Felker
@ 2025-07-30 15:53     ` Pablo Correa Gomez
  2025-08-07 19:25     ` [musl] " Rich Felker
  2 siblings, 0 replies; 10+ messages in thread
From: Pablo Correa Gomez @ 2025-07-30 15:53 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

El mie, 18-06-2025 a las 15:28 -0400, Rich Felker escribió:
> On Mon, Jun 02, 2025 at 07:37:51PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> > 
> > I am Pablo Correa Gomez, a member of postmarketOS Core
> > Contributors,
> > working on the collation and locale overhaul project
> > (https://www.openwall.com/lists/musl/2025/05/05/5)together with
> > Rich.
> > 
> > We have now more details on the planned locale work that was
> > earlier
> > announced. The current musl locales experience is sub-par compared
> > to
> > other platforms, and we plan to use this project to fix that. 
> > 
> > The main and biggest issue that we aim to solve is the
> > representation
> > format of the locale strings. The initial implementation used
> > English
> > strings as keys to lookup for translations. This had a major issue
> > where May would represent both the abbreviated and non-abbreviated
> > forms of the month, making it untranslatable in languages where May
> > has
> > more than 3 letters. However, there are other different issues that
> > are
> > also aiming to solve in this project:
> 
> Main decision to be made here is how we key items that need
> localization, whether by fixing the string-based keying (e.g. using
> the macro names like "ABMON5" as the keys) with the gettext-type
> lookup we have now, or switching to assigned integer indices as the
> keying for a more catgets-like system (likely using the values from
> the macros in langinfo.h as the indices), or something else.

Given not many people vouched-in, Rich and I have agreed that he'll
come with a proposal some people can comment on something more
specific. In addition to passing it through the community, we will also
ask translators that have volunteered for this project through
postmarketOS to vouch in.

> 
> > * Implement RADIXCHAR so that "." is not the only possible
> > separator.
> > THOUSEP will in principle not be implemented due to it breaking
> > quite
> > some assumptions, and it being less critical for users.
> 
> To give some background on this: from the start I was largely opposed
> to having the radix char be localizable at all, as this has been a
> source of perpetual problems for parsing and generating text-based
> data formats intended for interchange, and I didn't really think
> there
> was any modern demand for it.
> 
> However, in past discussions of the topic, it's come up that some
> people do want it, and I don't want us to be the bad guys who are
> being stubborn dismissing someone else's cultural expectations, so
> the
> tentative plan has been to offer this with 1-bit degree of freedom
> between '.' and ',' as the only choices.
> 
> I've been made aware that, at least historically prior to use in
> computer systems, there have been other notations for radix point,
> but
> it's not clear if there's any modern expectation to be able to do
> that. What I think would be a useful next step is to grep the Unicode
> CLDR for whether there are non-'.' non-',' radix chars in any locale
> definitions. If there are none, I think that already settles it. If
> there are any, we should attempt to figure out whether there are
> real-world systems that support them and precedent for users to
> expect
> they work.
> 
> Note that supporting basically anything plausble other than '.' and
> ',' as radix characters has major technical issues that may introduce
> vulns into programs not expecting it, so in the absence of both
> strong
> evidence of necessity and research into what would break and whether
> unsafe breakage is unlikely, I want to just say no to this.
> 
> It may however make sense for the on-disk data format to allow for
> the
> possibility, and for musl to just treat anything but "," as if it
> were
> "." for the forseeable future.
> 
> > * Implement LC_MONETARY so that we can get properly localized
> > currency
> > representation.
> 
> This is fairly straightforward, but does need a reasonable data
> format
> that translates well into "struct localeconv" form. The localeconv
> fields that are strings need to be directly usable from the
> memory-mapped locale file, so that we don't need to allocate
> variable-sized storage for them, and one complication of this is that
> "grouping" and "mon_grouping" are *arch-specific* because the
> encoding
> uses CHAR_MAX with a special meaning, and CHAR_MAX could be 127 or
> 255. This means both versions of the string (one for singed-plain-
> char
> archs and one for unsigned-plain-char archs) need to be stored in the
> on-disk format.
> 
> > * Make sure that every function that accepts a locale actually uses
> > it
> > for the translation.
> > 
> > To be able to prepare for the technical work, there are some things
> > for
> > which we would like community input:
> > 
> > 1. We need to figure out an alternative representation for the
> > translatable strings derived from[1] to avoid the "May issue". A
> > simple
> > solution would be to use those constants (or an abbreviation of
> > them)
> > as keys for the lookup. Hopefully that would be both unambiguous
> > and
> > self-explanatory and as a bonus, it's already documented. Does
> > somebody
> > have other/better ideas?
> 
> See above for the options I'm aware of.
> 
> > 2. Regardless of the representation we choose, we need to decide on
> > a
> > workflow for translators. Currently, people can just copy the .pot 
> > file[2] with a hard-coded representation that might include other
> > things to translate. That seems good enough if we chose the
> > representation directly from [1], but might not be possible if we
> > decide on something different.
> 
> Long-term, the workflow should probably be deriving the data from the
> Unicode CLDR with possibility for overrides, with tooling to do that.
> I'm not sure if we want to prepare such tooling now.
> 
> At least for collation, I think we need some level of tooling now in
> order to be able to test/evaluate it. I'm presently trying to find
> the
> relevant tooling other systems use. ICU has something that converts
> the base weights table to a possibly-reasonable binary form but I
> haven't located the tooling to apply locale-specific modifications
> from CLDR data to the table.
> 
> > 3. Right now, other translatable strings coming from different
> > sources
> > (another email with a detailed analysis will follow up) are also
> > part
> > of the musl locales project. Those are also just translated
> > directly as
> > strings. However, some also appear in different contexts. Like "out
> > of
> > memory" on regex, on getting network address info. Should these be
> > split, receive a different representation, and thus provide
> > additional
> > context information to translators? I personally believe that most
> > high-level applications should hide these messages coming directly
> > from
> > libc, and thus they should only be rarely available to users, like
> > in
> > CLI applications, where users are generally expected to have a
> > basic
> > knowledge of English. I would be fine with leaving these strings
> > represented just by their own English string names, even if that
> > means
> > a bit of context is lost in some languages.
> 
> I think it's expected that they're translated (don't set LC_MESSAGES
> if you don't want that) but again the mechanism is open to change
> while we're making a major overhaul here.
> 
> Do we want the messages keyed by the English strings, as now, or do
> we
> want them keyed by identity of the error, whether that's the names of
> the errno E*/REG_*/EAI_*/etc. macros or some assigned integer codes
> as
> in the option for LC_TIME stuff above.

In in the end we're going to translate those (or lay the ground-work
for it), from my personal translation experience, these are probably
best being context-dependent. Different languages have different ways
of expressing context that might not always be represented by English.
So the E* proposal sounds to me a lot more sane than English-based
strings. On top of that, is there a chance of English strings ever
changing? In the musl locales project there are currently some from
musl 1.1.14, that do no longer exist in master, so I'm assuming the
answer is "unlikely, but yes". I'm pretty sure that if we take
translations seriously from now on, we don't want to neither have to
block musl release on translation upgrades, neither break translations
just due to some typo, change, or clarification in the English wording.

> > 4. Chose a default locale placement, so that we can get
> > translations
> > without needing to parse an envvar in [3]. In Alpine/pmOS the
> > location
> > is currently in /usr/share/i18n/locales/musl/ I do not think that's
> > a
> > great place, but the FHS does not seem to provide an obvious place
> > for
> > it to live, since AFAIU locales for the libc should not be mixed
> > with
> > LC_MESSAGES from other applications. Are there other suggestions?
> 
> My main concern, especially if we want them to be usable by suid
> binaries, is that they should be in a location we can rely on to
> belong to root. While I don't think they should be *stored* in /etc,
> reaching them via a path component (intended to be a symlink) in /etc
> is probably the best way to both ensure that and allow the actual
> files to be placed wherever distro policy wants them to be placed.

I think in general the musl ecosystem cares quite a bit about
standards. And whereas the FHS does not specify a perfect place for
this, expecting a symlink in /etc certainly not spec-compliant, since
that's for "Host-specific configuration". It would seem reasonable to
me that in the case of a minimal recovery system where /usr does not
exist, translations are not available. If users have specific
requirements related to recovery, mounting /usr, and translations, then
they can place the locales under /etc and use MUSL_LOCPATH. Forcing
that on everybody seems like would for the majority to take action
(either symlink /etc or have an envvariable with less security) for
some niche use-case, and in the meanwhile going against a respected
standard. Is there something I'm missing?

> > 5. So far, although the musl-locales project exist, it has been
> > kept
> > apart from the main musl project, and not really sanctioned as
> > "official". It would be great, if we could have discussions related
> > to
> > musl-locales project directly in this mailing list. And if there
> > could
> > be a synchronized copy of it in https://git.musl-libc.org/cgit next
> > to
> > the musl repository. Is there somebody against this?
> 
> I think having the discussion on the main mailing list should be
> fine.

Great to know for this all the other ones below.

Best,
Pablo

> 
> > 6. Given that at postmarketOS good localization is something
> > critical,
> > I would be very happy if we could fork the current project, host
> > it  in
> > our gitlab, and use it as the place to synchronize with
> > https://git.musl-libc.org/cgit. If somebody would have other ideas,
> > or
> > moving it is considered disruptive, then it would be great if
> > somebody
> > from our team could also get access, so we can increase the
> > maintenance
> > it has seen lately.
> 
> I don't have a strong opinion on this yet, but I do agree that we
> should have it sync'd to the main musl git server, regardless of
> where
> actual devel takes place, so that it presents as "official".
> 
> > 7. If a locale is missing in musl, setlocale currently "fakes" that
> > support exist by copying the C data to the said locale. This has
> > the
> > benefit that apps which are translated in a locale missing in musl
> > still show up as translated for the application-related messages.
> > The
> > problem with this is that the UX is then inconsistent, since users
> > get
> > things mixed and matched in different languages. This is also
> > generally
> > a step against musl philosophy of being strinctly correct. A
> > previous
> > discussion[4] had a pretty good proposal[5] that I fully support.
> > As I
> > said in the thread, as long as we have some time to adapt, the
> > behavior
> > change should be acceptable.
> 
> I need to review this but I recall the proposal being acceptable.
> 
> > 8. Finally, if you want to be involved in testing in a language for
> > which we don't yet have a volunteer signed-in in[6], feel free to
> > report yourself, we might have some small funding available, for
> > which
> > please send me a private email with the details specified in there.
> > 
> > We hope that at the end of this work, we have a setup for musl
> > locales
> > that is able to fit the needs of most users. If you believe there
> > is
> > something missing, please let us know.
> > 
> > This work is possible thanks to a grant from NLnet and the NGI Zero
> > Core Fund. Thank you for supporting us!
> > 
> > [1]
> > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.
> > h.html
> > [2]
> > https://git.adelielinux.org/adelie/musl-locales/-
> > /blob/main/musl.pot?ref_type=heads
> > [3]
> > https://git.musl-
> > libc.org/cgit/musl/tree/src/locale/locale_map.c#n66
> > [4] https://www.openwall.com/lists/musl/2023/08/10/3
> > [5]
> > https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a#p
> > roposed-behaviour
> > [6]
> > https://gitlab.postmarketos.org/postmarketOS/postmarketos/-
> > /issues/65


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Planned locale work and community thoughts
  2025-06-18 23:14         ` Rich Felker
@ 2025-08-01  9:58           ` Pablo Correa Gomez
  2025-08-01 13:58             ` [musl] " Rich Felker
  0 siblings, 1 reply; 10+ messages in thread
From: Pablo Correa Gomez @ 2025-08-01  9:58 UTC (permalink / raw)
  To: Rich Felker, Thorsten Glaser; +Cc: musl

El mie, 18-06-2025 a las 19:14 -0400, Rich Felker escribió:
> On Thu, Jun 19, 2025 at 12:42:50AM +0200, Thorsten Glaser wrote:
> > On Wed, 18 Jun 2025, Rich Felker wrote:
> > 
> > > Theoretically it's possible the textual grep missed things if
> > > there is
> > > inconsistent json formatting anywhere, so if anyone familiar with
> > > jq
> > > wants to conduct a search using it instead to confirm, go ahead.
> > > I
> > 
> > My jq-foo is not very good, but I managed this:
> > 
> > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^  "[^.,]"' -e
> > '^  ".[^"]' | uniq
> >   "٫"
> > 
> > So yes, U+066B is the only other one, and no multi-char ones.
> > 
> > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"'
> > -e '^  ".[^"]'
> > 
> > … shows all the occurrences, but a quick filter shows that we have
> > both symbols-numberSystem-arabext and symbols-numberSystem-arab but
> > assuming both are out of scope…
> > 
> > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"'
> > -e '^  ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-numberSystem-
> > arabext"' -e '.symbols-numberSystem-arab"'
> >   >>main.bgn-AE.numbers.symbols-numberSystem-latn",
> >   >>main.bgn-AF.numbers.symbols-numberSystem-latn",
> >   >>main.bgn-IR.numbers.symbols-numberSystem-latn",
> >   >>main.bgn-OM.numbers.symbols-numberSystem-latn",
> >   >>main.bgn.numbers.symbols-numberSystem-latn",
> > 
> > … leaves us with this; bgn/numbers.json examplary:
> > 
> > {
> >   "main": {
> >     "bgn": {
> >       "numbers": {
> >         "symbols-numberSystem-arabext": {
> >           "decimal": "٫",
> >           "group": "٬",
> >           "list": "؛",
> > …
> >         },
> >         "symbols-numberSystem-latn": {
> >           "decimal": "٫",
> >           "group": "،",
> >           "list": ";",
> > …
> > 
> > So, if the bgn locales are ever going to be relevant…
> > unsure what that exactly is, but my acronyms database says…
> >  [ISO 639-3] Western Balochi (cf. bal)
> > … which seems to fit.
> 
> Thanks. My grapping seems to have overlooked that just because it was
> the same character that would normally only be used in an alt-digits
> context. I wonder if the above is intentional or a mistake and if any
> systems are actually doing that.

I've done some research on this topic to see if we could figure out a
bit more information. Unfortunately, online resources related to
Western Balochi are incredibly sparse:

* glibc: no support at all
https://github.com/bminor/glibc/tree/master/localedata/locales
* Windows: no support at all
https://support.microsoft.com/en-us/windows/language-packs-for-windows-a5094319-a92d-18de-5b53-1cfc697cfca8
* Android: no support at all
https://android.googlesource.com/platform/frameworks/base/+/android-16.0.0_r1/core/res/res/values/locale_config.xml
* Weblate: 3 projects seems to have translations, but on 0%
translation: https://hosted.weblate.org/languages/bgn/
* iOS: no support in system languages
https://www.apple.com/ios/feature-availability/#system-language-system-language
or keyboard support
https://www.apple.com/ios/feature-availability/#quicktype-keyboard-language-support

In addition, it seems like that data in the CLDR was introduced 10
years ago in
https://github.com/unicode-org/cldr/commit/a4fe61ea1c1a01e3dfe2545d013ca3289640c81f
and never changed since. I've also tried to do some research on whether
the data in the CLDR could be an error. The survey for Western Balochi
unfortunately shows no votes:
https://st.unicode.org/cldr-apps/v#/bgn/Symbols/a1ef41eaeb6982d
compared to something like Spanish that has votes from Apple,
Microsoft, and Google:
https://st.unicode.org/cldr-apps/v#/es/Symbols/4ec3d1b99830ad07

I wonder if it's worth it to bring this to the attention of the unicode
consortium to get some clarity on it, or if we just consider this a bug
from a language with very  limited digitalization and move on with the
assumption of just "." and ",".

Best,
Pablo Correa Gomez


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] Planned locale work and community thoughts
  2025-08-01  9:58           ` Pablo Correa Gomez
@ 2025-08-01 13:58             ` Rich Felker
  2025-08-01 14:24               ` Pablo Correa Gomez
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2025-08-01 13:58 UTC (permalink / raw)
  To: Pablo Correa Gomez; +Cc: Thorsten Glaser, musl

On Fri, Aug 01, 2025 at 11:58:30AM +0200, Pablo Correa Gomez wrote:
> El mie, 18-06-2025 a las 19:14 -0400, Rich Felker escribió:
> > On Thu, Jun 19, 2025 at 12:42:50AM +0200, Thorsten Glaser wrote:
> > > On Wed, 18 Jun 2025, Rich Felker wrote:
> > > 
> > > > Theoretically it's possible the textual grep missed things if
> > > > there is
> > > > inconsistent json formatting anywhere, so if anyone familiar with
> > > > jq
> > > > wants to conduct a search using it instead to confirm, go ahead.
> > > > I
> > > 
> > > My jq-foo is not very good, but I managed this:
> > > 
> > > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^  "[^.,]"' -e
> > > '^  ".[^"]' | uniq
> > >   "٫"
> > > 
> > > So yes, U+066B is the only other one, and no multi-char ones.
> > > 
> > > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"'
> > > -e '^  ".[^"]'
> > > 
> > > … shows all the occurrences, but a quick filter shows that we have
> > > both symbols-numberSystem-arabext and symbols-numberSystem-arab but
> > > assuming both are out of scope…
> > > 
> > > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^  "[^.,]"'
> > > -e '^  ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-numberSystem-
> > > arabext"' -e '.symbols-numberSystem-arab"'
> > >   >>main.bgn-AE.numbers.symbols-numberSystem-latn",
> > >   >>main.bgn-AF.numbers.symbols-numberSystem-latn",
> > >   >>main.bgn-IR.numbers.symbols-numberSystem-latn",
> > >   >>main.bgn-OM.numbers.symbols-numberSystem-latn",
> > >   >>main.bgn.numbers.symbols-numberSystem-latn",
> > > 
> > > … leaves us with this; bgn/numbers.json examplary:
> > > 
> > > {
> > >   "main": {
> > >     "bgn": {
> > >       "numbers": {
> > >         "symbols-numberSystem-arabext": {
> > >           "decimal": "٫",
> > >           "group": "٬",
> > >           "list": "؛",
> > > …
> > >         },
> > >         "symbols-numberSystem-latn": {
> > >           "decimal": "٫",
> > >           "group": "،",
> > >           "list": ";",
> > > …
> > > 
> > > So, if the bgn locales are ever going to be relevant…
> > > unsure what that exactly is, but my acronyms database says…
> > >  [ISO 639-3] Western Balochi (cf. bal)
> > > … which seems to fit.
> > 
> > Thanks. My grapping seems to have overlooked that just because it was
> > the same character that would normally only be used in an alt-digits
> > context. I wonder if the above is intentional or a mistake and if any
> > systems are actually doing that.
> 
> I've done some research on this topic to see if we could figure out a
> bit more information. Unfortunately, online resources related to
> Western Balochi are incredibly sparse:
> 
> * glibc: no support at all
> https://github.com/bminor/glibc/tree/master/localedata/locales
> * Windows: no support at all
> https://support.microsoft.com/en-us/windows/language-packs-for-windows-a5094319-a92d-18de-5b53-1cfc697cfca8
> * Android: no support at all
> https://android.googlesource.com/platform/frameworks/base/+/android-16.0.0_r1/core/res/res/values/locale_config.xml
> * Weblate: 3 projects seems to have translations, but on 0%
> translation: https://hosted.weblate.org/languages/bgn/
> * iOS: no support in system languages
> https://www.apple.com/ios/feature-availability/#system-language-system-language
> or keyboard support
> https://www.apple.com/ios/feature-availability/#quicktype-keyboard-language-support
> 
> In addition, it seems like that data in the CLDR was introduced 10
> years ago in
> https://github.com/unicode-org/cldr/commit/a4fe61ea1c1a01e3dfe2545d013ca3289640c81f
> and never changed since. I've also tried to do some research on whether
> the data in the CLDR could be an error. The survey for Western Balochi
> unfortunately shows no votes:
> https://st.unicode.org/cldr-apps/v#/bgn/Symbols/a1ef41eaeb6982d
> compared to something like Spanish that has votes from Apple,
> Microsoft, and Google:
> https://st.unicode.org/cldr-apps/v#/es/Symbols/4ec3d1b99830ad07
> 
> I wonder if it's worth it to bring this to the attention of the unicode
> consortium to get some clarity on it, or if we just consider this a bug
> from a language with very  limited digitalization and move on with the
> assumption of just "." and ",".

Thanks for the quick research!

My view is that unless there's an existing strong precedent for this
convention in digital interfaces, which you seem to have established
that there's not, we should not pursue supporting it.

I'm fine with leaving open the possibility in the data format (i.e.
not just encoding the value in the locale file as 1 bit) so that the
possibility isn't locked out, but I'm pretty strongly on the side of
either mapping anything but ',' to '.', or refusing to load locale
files where the field is neither '.' nor ',' as unsupported/malformed.

I just don't see any way to rationalize doing something that likely
has unforseen security consequences for the sake of a generality that
no existing users expect (because there's no software that has set
that expectation).

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Planned locale work and community thoughts
  2025-08-01 13:58             ` [musl] " Rich Felker
@ 2025-08-01 14:24               ` Pablo Correa Gomez
  0 siblings, 0 replies; 10+ messages in thread
From: Pablo Correa Gomez @ 2025-08-01 14:24 UTC (permalink / raw)
  To: Rich Felker; +Cc: Thorsten Glaser, musl

El vie, 01-08-2025 a las 09:58 -0400, Rich Felker escribió:
> On Fri, Aug 01, 2025 at 11:58:30AM +0200, Pablo Correa Gomez wrote:
> > El mie, 18-06-2025 a las 19:14 -0400, Rich Felker escribió:
> > > On Thu, Jun 19, 2025 at 12:42:50AM +0200, Thorsten Glaser wrote:
> > > > On Wed, 18 Jun 2025, Rich Felker wrote:
> > > > 
> > > > > Theoretically it's possible the textual grep missed things if
> > > > > there is
> > > > > inconsistent json formatting anywhere, so if anyone familiar
> > > > > with
> > > > > jq
> > > > > wants to conduct a search using it instead to confirm, go
> > > > > ahead.
> > > > > I
> > > > 
> > > > My jq-foo is not very good, but I managed this:
> > > > 
> > > > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -e '^  "[^.,]"'
> > > > -e
> > > > '^  ".[^"]' | uniq
> > > >   "٫"
> > > > 
> > > > So yes, U+066B is the only other one, and no multi-char ones.
> > > > 
> > > > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^ 
> > > > "[^.,]"'
> > > > -e '^  ".[^"]'
> > > > 
> > > > … shows all the occurrences, but a quick filter shows that we
> > > > have
> > > > both symbols-numberSystem-arabext and symbols-numberSystem-arab
> > > > but
> > > > assuming both are out of scope…
> > > > 
> > > > tg@x61p:/tmp/u/cldr-numbers-full/main $ cat */numbers.json | jq
> > > > 'paths(.decimal?|scalars) as $p | [">>" + ($p | join(".")),
> > > > getpath($p).decimal]' | sed 's/">>/>>/' | grep -B 1 -e '^ 
> > > > "[^.,]"'
> > > > -e '^  ".[^"]' | fgrep '>>' | fgrep -v -e '.symbols-
> > > > numberSystem-
> > > > arabext"' -e '.symbols-numberSystem-arab"'
> > > >   >>main.bgn-AE.numbers.symbols-numberSystem-latn",
> > > >   >>main.bgn-AF.numbers.symbols-numberSystem-latn",
> > > >   >>main.bgn-IR.numbers.symbols-numberSystem-latn",
> > > >   >>main.bgn-OM.numbers.symbols-numberSystem-latn",
> > > >   >>main.bgn.numbers.symbols-numberSystem-latn",
> > > > 
> > > > … leaves us with this; bgn/numbers.json examplary:
> > > > 
> > > > {
> > > >   "main": {
> > > >     "bgn": {
> > > >       "numbers": {
> > > >         "symbols-numberSystem-arabext": {
> > > >           "decimal": "٫",
> > > >           "group": "٬",
> > > >           "list": "؛",
> > > > …
> > > >         },
> > > >         "symbols-numberSystem-latn": {
> > > >           "decimal": "٫",
> > > >           "group": "،",
> > > >           "list": ";",
> > > > …
> > > > 
> > > > So, if the bgn locales are ever going to be relevant…
> > > > unsure what that exactly is, but my acronyms database says…
> > > >  [ISO 639-3] Western Balochi (cf. bal)
> > > > … which seems to fit.
> > > 
> > > Thanks. My grapping seems to have overlooked that just because it
> > > was
> > > the same character that would normally only be used in an alt-
> > > digits
> > > context. I wonder if the above is intentional or a mistake and if
> > > any
> > > systems are actually doing that.
> > 
> > I've done some research on this topic to see if we could figure out
> > a
> > bit more information. Unfortunately, online resources related to
> > Western Balochi are incredibly sparse:
> > 
> > * glibc: no support at all
> > https://github.com/bminor/glibc/tree/master/localedata/locales
> > * Windows: no support at all
> > https://support.microsoft.com/en-us/windows/language-packs-for-
> > windows-a5094319-a92d-18de-5b53-1cfc697cfca8
> > * Android: no support at all
> > https://android.googlesource.com/platform/frameworks/base/+/android
> > -16.0.0_r1/core/res/res/values/locale_config.xml
> > * Weblate: 3 projects seems to have translations, but on 0%
> > translation: https://hosted.weblate.org/languages/bgn/
> > * iOS: no support in system languages
> > https://www.apple.com/ios/feature-availability/#system-language-
> > system-language
> > or keyboard support
> > https://www.apple.com/ios/feature-availability/#quicktype-keyboard-
> > language-support
> > 
> > In addition, it seems like that data in the CLDR was introduced 10
> > years ago in
> > https://github.com/unicode-
> > org/cldr/commit/a4fe61ea1c1a01e3dfe2545d013ca3289640c81f
> > and never changed since. I've also tried to do some research on
> > whether
> > the data in the CLDR could be an error. The survey for Western
> > Balochi
> > unfortunately shows no votes:
> > https://st.unicode.org/cldr-apps/v#/bgn/Symbols/a1ef41eaeb6982d
> > compared to something like Spanish that has votes from Apple,
> > Microsoft, and Google:
> > https://st.unicode.org/cldr-apps/v#/es/Symbols/4ec3d1b99830ad07
> > 
> > I wonder if it's worth it to bring this to the attention of the
> > unicode
> > consortium to get some clarity on it, or if we just consider this a
> > bug
> > from a language with very  limited digitalization and move on with
> > the
> > assumption of just "." and ",".
> 
> Thanks for the quick research!
> 
> My view is that unless there's an existing strong precedent for this
> convention in digital interfaces, which you seem to have established
> that there's not, we should not pursue supporting it.
> 
> I'm fine with leaving open the possibility in the data format (i.e.
> not just encoding the value in the locale file as 1 bit) so that the
> possibility isn't locked out, but I'm pretty strongly on the side of
> either mapping anything but ',' to '.', or refusing to load locale
> files where the field is neither '.' nor ',' as
> unsupported/malformed.

Seems like the best of both worlds :)

> 
> I just don't see any way to rationalize doing something that likely
> has unforseen security consequences for the sake of a generality that
> no existing users expect (because there's no software that has set
> that expectation).
> 
> Rich


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] Planned locale work and community thoughts
  2025-06-18 19:28   ` Rich Felker
  2025-06-18 21:23     ` [musl] " Rich Felker
  2025-07-30 15:53     ` Pablo Correa Gomez
@ 2025-08-07 19:25     ` Rich Felker
  2 siblings, 0 replies; 10+ messages in thread
From: Rich Felker @ 2025-08-07 19:25 UTC (permalink / raw)
  To: Pablo Correa Gomez; +Cc: musl

On Wed, Jun 18, 2025 at 03:28:47PM -0400, Rich Felker wrote:
> On Mon, Jun 02, 2025 at 07:37:51PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> > 
> > I am Pablo Correa Gomez, a member of postmarketOS Core Contributors,
> > working on the collation and locale overhaul project
> > (https://www.openwall.com/lists/musl/2025/05/05/5)together with Rich.
> > 
> > We have now more details on the planned locale work that was earlier
> > announced. The current musl locales experience is sub-par compared to
> > other platforms, and we plan to use this project to fix that. 
> > 
> > The main and biggest issue that we aim to solve is the representation
> > format of the locale strings. The initial implementation used English
> > strings as keys to lookup for translations. This had a major issue
> > where May would represent both the abbreviated and non-abbreviated
> > forms of the month, making it untranslatable in languages where May has
> > more than 3 letters. However, there are other different issues that are
> > also aiming to solve in this project:
> 
> Main decision to be made here is how we key items that need
> localization, whether by fixing the string-based keying (e.g. using
> the macro names like "ABMON5" as the keys) with the gettext-type
> lookup we have now, or switching to assigned integer indices as the
> keying for a more catgets-like system (likely using the values from
> the macros in langinfo.h as the indices), or something else.

I've been reviewing the options here with the intent of making a
proposal, and my thinking so far is that neither of the above (catgets
approach or existing gettext approach) is very good.

While the integer keys approach of catgets solves the problem of
English-string being a really poor key for looking up the localized
value, the gratuitous vastness of the 32x32 bit keyspace necessitates
binary search which is undesirably costly and pretty much entirely
defeats any runtime-efficiency argument for integer keying over
gettext-style string keying.

What I'm leaning towards proposing is a direct integer-indexed
multi-level table, analogous in form to the tables collation weight
lookup will use.

For reference, the currently needed lookups are:

1. nl_langinfo keys (this also covers all date/time functionality)
2. strerror (errno.h codes)
3. gai_strerror (netdb.h EAI_* codes)
4. hstrerror (legacy getXbyY resolver API error codes)
5. regerror (regex.h REG_* error codes)

And the added ones we will have are:

6. struct localeconv contents (for LC_NUMERIC/LC_MONETARY)
7. collation weight table roots (for LC_COLLATE)

Of these, items 1 and 3-5 already have arch-independent sequential
keys that are public constants (item 1 has them as 2-level, which is
fine). Item 2 (strerror) does require index remapping on archs with
their own numbering. Item 6 (localeconv) does not have lookups
addressible by an application, just stuffing data into a struct at
locale load-time, so there are no real constraints to set here. I can
just propose a simple format for the data to be loaded from. And item
7 is its own thing already covered by a multi-level table.

All of the indexing for 1-5 (for 2, via the arch/generic version of
errno.h) comes from constants that are already public ABI and thus
stable.

Anything in 6-7 is up to us to define in a way that's stable and
future-proof/extensible.

I'll follow up with more of this fleshed out.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-07 19:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <ecdf29e7b16f50fbb734870f5c34abaa61d946cd.camel@postmarketos.org>
2025-06-02 17:37 ` Planned locale work and community thoughts Pablo Correa Gomez
2025-06-18 19:28   ` Rich Felker
2025-06-18 21:23     ` [musl] " Rich Felker
2025-06-18 22:42       ` Thorsten Glaser
2025-06-18 23:14         ` Rich Felker
2025-08-01  9:58           ` Pablo Correa Gomez
2025-08-01 13:58             ` [musl] " Rich Felker
2025-08-01 14:24               ` Pablo Correa Gomez
2025-07-30 15:53     ` Pablo Correa Gomez
2025-08-07 19:25     ` [musl] " Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).