Locale bikeshed time

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Locale bikeshed time
@ 2014-07-22 18:49 Rich Felker
  2014-07-22 20:10 ` u-igbb
  2014-07-22 20:17 ` Laurent Bercot
  0 siblings, 2 replies; 43+ messages in thread
From: Rich Felker @ 2014-07-22 18:49 UTC (permalink / raw)
  To: musl

I've got the next phase of the locale work pretty much ready to
commit, but since it needs some policy for how to load locales, I want
to continue the discussion first rather than having commits that
change the behavior back and forth as we discuss this.

Overall, my plan at this point is to disallow any absolute/relative
pathnames in the LC_* vars and restrict them purely to locale names,
and have the path in a separate variable outside the scope of the
standard. This is basically how glibc does it, and the idea is that
you can allow locale names from an untrusted source (e.g. for suid,
for remote apps acting on behalf of a user such as web apps or
gitolite, or for apps that process mixed-locale data with uselocale
and have locale names in their data) as long as the locale path does
not contain malicious locales.

So, the first bikeshed decision to be made is what environment
variable to use for the locale path, and what fallback should be if
it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to
use the same var (since apps are already aware of the need to treat it
specially), but on the other it's undesirable to have them tied
together (e.g. if you're using musl as a non-root installation and
can't write to /usr/lib) and to avoid clashing with glibc's files we
would need to choose a subdirectory under $LOCPATH rather than using
it directly. All of these aspects make it a lot less attractive.

The second issue is how locale categories are split up. Glibc has each
category in a separate file, except for the "locale-archive" file
which stores everything in one file for easy mapping. My leaning so
far is to put the whole locale -- time format and translations,
message translations, ... in a single file. This avoids the need for
multiple mappings (and syscall overhead, and vma overhead, ...) if
you're using the same value for all categories. But on the other hand,
if you wanted to have lots of subtle variants of a locale, you might
end up with largely-duplicate files on disk. Fortunately I think
they'll all be very small anyway so this may not matter.

Of course making this work is contingent on finding a good way to
encode LC_MONETARY and LC_COLLATE data in a .mo file, since if the
whole locale is unified into one file, it would be a .mo file. My
leaning is to simply use "int_cur_symbol", etc. as gettext keys for
the string fields of LC_MONETARY and then put all the numeric fields
of lconv into a single string that could be parsed with scanf or a
tiny integer parser in localeconv() on the first usage. While not the
most efficient, it avoids needing nasty special tools to generate
locale files; a po-to-mo converter is all you need. For LC_COLLATE,
obviously one solution would be to have keys for each collation
element and use gettext to convert collation elements to the symbols
strxfrm is supposed to output. I'm not sure if the efficiency of this
method is tolerable however. We could go with it for now and later add
something more advanced if needed (e.g. mapping to a DFA represented
as a byte arrary that does the conversions).

I probably have some more issues to discuss with this too but I'll
just go ahead and send now to get discussion started, and hopefully
get back to adding some more code first.

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-22 18:49 Locale bikeshed time Rich Felker
@ 2014-07-22 20:10 ` u-igbb
  2014-07-22 20:35   ` Rich Felker
  2014-07-22 20:17 ` Laurent Bercot
  1 sibling, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-22 20:10 UTC (permalink / raw)
  To: musl

On Tue, Jul 22, 2014 at 02:49:32PM -0400, Rich Felker wrote:
> Overall, my plan at this point is to disallow any absolute/relative
> pathnames in the LC_* vars and restrict them purely to locale names,
> and have the path in a separate variable outside the scope of the
> standard.

+1

> So, the first bikeshed decision to be made is what environment
> variable to use for the locale path, and what fallback should be if
> it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to
> use the same var (since apps are already aware of the need to treat it
> specially), but on the other it's undesirable to have them tied
> together (e.g. if you're using musl as a non-root installation and
> can't write to /usr/lib) and to avoid clashing with glibc's files we

This issue is not crucial for my usage pattern, here it is easy to assign
values of this kind per binary, not per process tree (in contrast to
the locale names which I want to be settable by the user and inheritable
regardless of which library can happen to interpret them).

Speaking more generally, using the same variable as glibc would introduce
a substantial risk of confusion, making the semantics of the variable
context-dependent (i.e. depending on which library a certain binary is
linked to).

This confusion is kind of hidden in monolithic distros where all binaries
are expected to have been built by tightly cooperating parties using the
same libraries - but the general case includes using binaries built
with different premises.

A musl-specific variable name would be a better/cleaner choice.

> would need to choose a subdirectory under $LOCPATH rather than using
> it directly. All of these aspects make it a lot less attractive.

+1

> The second issue is how locale categories are split up. Glibc has each
> category in a separate file, except for the "locale-archive" file
> which stores everything in one file for easy mapping. My leaning so

By the way, please do not follow the way of a single big file.
For systems which rely on file boundaries to reflect data clustering
(i.e. which data is most probable to be used together) it is very useful
to let the files correspond to the data structure. Otherwise some cheap
and efficient distributed data access optimizations become impossible.

Coda file system uses a file as a transmission and caching unit - which is
quite efficient because a file very often corresponds to an "information
unit" which is needed as a whole. Glibc's locale archive enforces a big
wasteful transfer and a large cache footprint for very little actual use.

> far is to put the whole locale -- time format and translations,
> message translations, ... in a single file. This avoids the need for
> multiple mappings (and syscall overhead, and vma overhead, ...) if
> you're using the same value for all categories. But on the other hand,
> if you wanted to have lots of subtle variants of a locale, you might
> end up with largely-duplicate files on disk. Fortunately I think
> they'll all be very small anyway so this may not matter.

I actually do mix categories from different locales.
No problem as long as the files are small.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-22 18:49 Locale bikeshed time Rich Felker
  2014-07-22 20:10 ` u-igbb
@ 2014-07-22 20:17 ` Laurent Bercot
  2014-07-22 20:36   ` Rich Felker
  1 sibling, 1 reply; 43+ messages in thread
From: Laurent Bercot @ 2014-07-22 20:17 UTC (permalink / raw)
  To: musl


> So, the first bikeshed decision to be made is what environment
> variable to use for the locale path, and what fallback should be if
> it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to
> use the same var (since apps are already aware of the need to treat it
> specially), but on the other it's undesirable to have them tied
> together (e.g. if you're using musl as a non-root installation and
> can't write to /usr/lib) and to avoid clashing with glibc's files we
> would need to choose a subdirectory under $LOCPATH rather than using
> it directly. All of these aspects make it a lot less attractive.

  Well, my suggestion is going to be ugly, but it's the most flexible
one: have two variables. LOCPATH for compatibility, and something
like MUSL_LOCPATH for admins who fear conflicts. Try $MUSL_LOCPATH
first, then fallback to $LOCPATH if MUSL_LOCPATH is not set, then
fallback to a hard-coded directory.

  I'm not experienced enough with locales (nor have enough interest in
them) to talk about the other points.

-- 
  Laurent


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-22 20:10 ` u-igbb
@ 2014-07-22 20:35   ` Rich Felker
  2014-07-23  9:50     ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-22 20:35 UTC (permalink / raw)
  To: musl

On Tue, Jul 22, 2014 at 10:10:08PM +0200, u-igbb@aetey.se wrote:
> A musl-specific variable name would be a better/cleaner choice.

One question is whether this is really musl-specific or specific to a
locale scheme that could be used outside of musl too. However, either
way it's probably appropriate for the variable to be musl-specific.
Having one variable configure multiple things is usually error-prone
and inflexible.

> > The second issue is how locale categories are split up. Glibc has each
> > category in a separate file, except for the "locale-archive" file
> > which stores everything in one file for easy mapping. My leaning so
> 
> By the way, please do not follow the way of a single big file.
> For systems which rely on file boundaries to reflect data clustering
> (i.e. which data is most probable to be used together) it is very useful
> to let the files correspond to the data structure. Otherwise some cheap
> and efficient distributed data access optimizations become impossible.

I hadn't even considered this aspect, but I think the whole concept of
a single big file is undesirable with data that's naturally subject to
change over time, and where the data comes from multiple sources. So I
wasn't really considering that option anyway.

> > far is to put the whole locale -- time format and translations,
> > message translations, ... in a single file. This avoids the need for
> > multiple mappings (and syscall overhead, and vma overhead, ...) if
> > you're using the same value for all categories. But on the other hand,
> > if you wanted to have lots of subtle variants of a locale, you might
> > end up with largely-duplicate files on disk. Fortunately I think
> > they'll all be very small anyway so this may not matter.
> 
> I actually do mix categories from different locales.
> No problem as long as the files are small.

Note that if you're just mixing "ll_TT" and "C", there wouldn't be any
cost anyway since the C locale (and its aliases) are builtin and never
loaded from a file. Where I was thinking you might see duplication is
for things like: LC_ALL=ll_TT@modifier where modifier is really just
an alternate for one category (e.g. ISO date format for time, alt
collation order, etc.), but the file ends up storing duplicates of all
the data from other categories. However, I think the alternate
preferred usage here would be to provide a file for just the category
being overridden that does not contain the base data and require users
to set the individual categories, like what you're doing, e.g.

LANG=ll_TT LC_TIME=ll_TT@isodate

rather than:

LC_ALL=ll_TT@isodate

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-22 20:17 ` Laurent Bercot
@ 2014-07-22 20:36   ` Rich Felker
  2014-07-23 22:03     ` Laurent Bercot
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-22 20:36 UTC (permalink / raw)
  To: musl

On Tue, Jul 22, 2014 at 09:17:50PM +0100, Laurent Bercot wrote:
> 
> >So, the first bikeshed decision to be made is what environment
> >variable to use for the locale path, and what fallback should be if
> >it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to
> >use the same var (since apps are already aware of the need to treat it
> >specially), but on the other it's undesirable to have them tied
> >together (e.g. if you're using musl as a non-root installation and
> >can't write to /usr/lib) and to avoid clashing with glibc's files we
> >would need to choose a subdirectory under $LOCPATH rather than using
> >it directly. All of these aspects make it a lot less attractive.
> 
>  Well, my suggestion is going to be ugly, but it's the most flexible
> one: have two variables. LOCPATH for compatibility, and something
> like MUSL_LOCPATH for admins who fear conflicts. Try $MUSL_LOCPATH
> first, then fallback to $LOCPATH if MUSL_LOCPATH is not set, then
> fallback to a hard-coded directory.

Do you have in mind a usage case this would be beneficial for?

Rich


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-22 20:35   ` Rich Felker
@ 2014-07-23  9:50     ` u-igbb
  2014-07-23 16:39       ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-23  9:50 UTC (permalink / raw)
  To: musl

On Tue, Jul 22, 2014 at 04:35:40PM -0400, Rich Felker wrote:
> Having one variable configure multiple things is usually error-prone
> and inflexible.

+1

> > By the way, please do not follow the way of a single big file.
> > For systems which rely on file boundaries to reflect data clustering

> I hadn't even considered this aspect, but I think the whole concept of
> a single big file is undesirable with data that's naturally subject to
> change over time, and where the data comes from multiple sources. So I
> wasn't really considering that option anyway.

Nice.

> > I actually do mix categories from different locales.
> > No problem as long as the files are small.
> 
> Note that if you're just mixing "ll_TT" and "C", there wouldn't be any
> cost anyway since the C locale (and its aliases) are builtin and never
> loaded from a file. Where I was thinking you might see duplication is

Sure. This covers certainly most of my preferences but I thought of
LANG=l1_T1 and LC_SOMETHING=l2_T2 [and LC_SOMETHINGELSE=l3_T3].
This would result in pulling in two or three locale data files but the
overhead is presumably negligible.

> for things like: LC_ALL=ll_TT@modifier where modifier is really just
> an alternate for one category (e.g. ISO date format for time, alt
> collation order, etc.), but the file ends up storing duplicates of all
> the data from other categories. However, I think the alternate
> preferred usage here would be to provide a file for just the category
> being overridden that does not contain the base data and require users
> to set the individual categories, like what you're doing, e.g.

> LANG=ll_TT LC_TIME=ll_TT@isodate

This means that most of the time there will be a single locale file to be
opened, sometimes more, in extreme cases up to the number of categories,
the files also being of different "completeness". This would certainly
contribute to confusion for both the administrators and the users.

For the sake of uniformity I would possibly prefer to see only the
"thinner" files defining exactly one category, instead of different
files having different numbers of included categories.

But most of all I'd support your approach of including all information in
each file. This is "least confusing" and quite efficient. The overhead
is mostly static storage (not noticeable in our setup and probably not
much anyway :) and the run time overhead affects just the minority of
users who mix locales/categories. (Oh btw as a nice bonus this makes
the file boundaries correspond to the data usage patterns).

To summarize my view,

- a file per locale, with all categories included  best
- a file per category                              acceptable
- files with differing data subsets                please don't

> rather than:
> 
> LC_ALL=ll_TT@isodate

In a real scenario it would be probably
 LANG=ll_TT@isodate
and this feels OK.

> Rich

Regards,
Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23  9:50     ` u-igbb
@ 2014-07-23 16:39       ` Rich Felker
  2014-07-23 19:25         ` u-igbb
  2014-07-23 23:22         ` writeonce
  0 siblings, 2 replies; 43+ messages in thread
From: Rich Felker @ 2014-07-23 16:39 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 11:50:31AM +0200, u-igbb@aetey.se wrote:
> > > I actually do mix categories from different locales.
> > > No problem as long as the files are small.
> > 
> > Note that if you're just mixing "ll_TT" and "C", there wouldn't be any
> > cost anyway since the C locale (and its aliases) are builtin and never
> > loaded from a file. Where I was thinking you might see duplication is
> 
> Sure. This covers certainly most of my preferences but I thought of
> LANG=l1_T1 and LC_SOMETHING=l2_T2 [and LC_SOMETHINGELSE=l3_T3].
> This would result in pulling in two or three locale data files but the
> overhead is presumably negligible.

It's two or three sets of syscalls -- open (one per path component
tried until it succeeds), fstat, mmap, close -- rather than one set.
And an extra vma (resulting from the mmap) for each used. But the
choice isn't whether to have this overhead or not, unless you want to
consider the glibc locale-archive ugliness. The choice is just whether
to optimize the case where the categories are all the same (only
having one set of syscalls in that case) or mostly the same, or not to
optimize it and always have multiple sets of syscalls. I believe the
latter is strictly worse.

> > for things like: LC_ALL=ll_TT@modifier where modifier is really just
> > an alternate for one category (e.g. ISO date format for time, alt
> > collation order, etc.), but the file ends up storing duplicates of all
> > the data from other categories. However, I think the alternate
> > preferred usage here would be to provide a file for just the category
> > being overridden that does not contain the base data and require users
> > to set the individual categories, like what you're doing, e.g.
> 
> > LANG=ll_TT LC_TIME=ll_TT@isodate
> 
> This means that most of the time there will be a single locale file to be
> opened, sometimes more, in extreme cases up to the number of categories,
> the files also being of different "completeness". This would certainly
> contribute to confusion for both the administrators and the users.

Hmm. I see how it would be confusing and maybe it's best to discourage
this use (incomplete .mo files). But it's purely a useage issue,
outside of musl'c control, unless we wanted to impose a check that any
locale file have data for all the categories (and I think such a rule
would be bad since it precludes having locale files that are unrelated
to languages, e.g. a generic "UCA" collation locale with the default
UCA data).

> For the sake of uniformity I would possibly prefer to see only the
> "thinner" files defining exactly one category, instead of different
> files having different numbers of included categories.

Yes that sounds like a good policy. Really, policy matters like this
(i.e. ones that don't affect libc implementation) should be worked out
when it comes time to actually make some locales and find a maintainer
for a musl-locale repo/package.

On that topic, while this is a matter outside my control for
individual users, my preference would be that the official musl-locale
data attempt to avoid multiple variants/modifiers and legacy options
if possible. For example I would like to see the numeric date format
be ISO format in all locales, with traditional formats only where the
natural-language string representations for months/days are included
(and I say this as someone coming from one of the locales, i.e. US,
where the traditional numeric date format is non-ISO). In keeping with
the principle that musl is "modern" I'd like to prefer modern cultural
conventions to historical ones.

> But most of all I'd support your approach of including all information in
> each file. This is "least confusing" and quite efficient. The overhead
> is mostly static storage (not noticeable in our setup and probably not
> much anyway :) and the run time overhead affects just the minority of
> users who mix locales/categories. (Oh btw as a nice bonus this makes
> the file boundaries correspond to the data usage patterns).
> 
> To summarize my view,
> 
> - a file per locale, with all categories included  best
> - a file per category                              acceptable
> - files with differing data subsets                please don't

Yes I think this makes sense. My leaning would be to use complete
files for language-based locales, and file-per-category for individual
category locales that are not associated with any particular language
(and where, thereby, there's no assumption that they should provide
any behavior to other categories).

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 16:39       ` Rich Felker
@ 2014-07-23 19:25         ` u-igbb
  2014-07-23 21:01           ` Rich Felker
  2014-07-23 23:22         ` writeonce
  1 sibling, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-23 19:25 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 12:39:07PM -0400, Rich Felker wrote:
> My leaning would be to use complete
> files for language-based locales, and file-per-category for individual
> category locales that are not associated with any particular language
> (and where, thereby, there's no assumption that they should provide
> any behavior to other categories).

This feels appropriate - if the definitions indeed fall into distinctive
classes like "full" / "single-category" and also if the naming reflects
the distinction (keeping objects with different properties in the same
name space is otherwise harmful, among others harmful for ease of
understanding by the prospective users and administrators).

Thanks.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 19:25         ` u-igbb
@ 2014-07-23 21:01           ` Rich Felker
  2014-07-24 15:35             ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-23 21:01 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 09:25:03PM +0200, u-igbb@aetey.se wrote:
> On Wed, Jul 23, 2014 at 12:39:07PM -0400, Rich Felker wrote:
> > My leaning would be to use complete
> > files for language-based locales, and file-per-category for individual
> > category locales that are not associated with any particular language
> > (and where, thereby, there's no assumption that they should provide
> > any behavior to other categories).
> 
> This feels appropriate - if the definitions indeed fall into distinctive
> classes like "full" / "single-category" and also if the naming reflects
> the distinction (keeping objects with different properties in the same
> name space is otherwise harmful, among others harmful for ease of
> understanding by the prospective users and administrators).

IMO language-based locales should be ll, lll, ll_TT, or lll_TT form
where ll or lll is lowercase ISO language code and TT is uppercase
territory code. Non-language-based locale files should avoid these
patterns.

Rich


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-22 20:36   ` Rich Felker
@ 2014-07-23 22:03     ` Laurent Bercot
  2014-07-23 22:12       ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: Laurent Bercot @ 2014-07-23 22:03 UTC (permalink / raw)
  To: musl

On 22/07/2014 21:36, Rich Felker wrote:
> Do you have in mind a usage case this would be beneficial for?

  Not particularly - again, I'm far from a locale expert. But you
seem reluctant to reuse LOCPATH for mixed musl+glibc installations,
and also reluctant to break an established convention, so having
both would cater to all use cases - LOCPATH for the principle of
least surprise and general use, and MUSL_LOCPATH for people who
know what they are doing and need a different variable.

-- 
  Laurent

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 22:03     ` Laurent Bercot
@ 2014-07-23 22:12       ` Rich Felker
  2014-07-24 15:38         ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-23 22:12 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 11:03:00PM +0100, Laurent Bercot wrote:
> On 22/07/2014 21:36, Rich Felker wrote:
> >Do you have in mind a usage case this would be beneficial for?
> 
>  Not particularly - again, I'm far from a locale expert. But you
> seem reluctant to reuse LOCPATH for mixed musl+glibc installations,
> and also reluctant to break an established convention, so having

I was not reluctant to break established convention here, but
reluctant to add new variables systems integrators and administrators
need to be aware of (for the sake of preserving them or filtering
them). But I think the alternative is worse.

> both would cater to all use cases - LOCPATH for the principle of
> least surprise and general use, and MUSL_LOCPATH for people who
> know what they are doing and need a different variable.

I'm not seeing a way that setting LOCPATH without being aware that the
app you're trying to affect is using musl could be helpful. The
locales you're trying to make visible to the app need to be in the
format used by musl, not the glibc format, so you have to already be
aware of this. Maybe there's something I'm not seeing -- this is why I
asked -- but if there's no reason for it, I think searching both "just
because" is bad.

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 16:39       ` Rich Felker
  2014-07-23 19:25         ` u-igbb
@ 2014-07-23 23:22         ` writeonce
  2014-07-23 23:38           ` Rich Felker
  1 sibling, 1 reply; 43+ messages in thread
From: writeonce @ 2014-07-23 23:22 UTC (permalink / raw)
  To: musl

On 07/23/2014 12:39 PM, Rich Felker wrote:
> On that topic, while this is a matter outside my control for 
> individual users, my preference would be that the official musl-locale 
> data attempt to avoid multiple variants/modifiers and legacy options 
> if possible. For example I would like to see the numeric date format 
> be ISO format in all locales, with traditional formats only where the 
> natural-language string representations for months/days are included 
> (and I say this as someone coming from one of the locales, i.e. US, 
> where the traditional numeric date format is non-ISO). In keeping with 
> the principle that musl is "modern" I'd like to prefer modern cultural 
> conventions to historical ones.
For what it's worth, I wanted to point out that the ISO C explicitly 
pertains to the Gregorian calendar only, albeit in parenthesis (N1570, 
7.27.1).  For users of [listed in alphabetical order:] Arabic, Chinese, 
Hebrew, Japanese, Persian, and Tibetan, for instance, there are two 
different issues at stake: the first is the representation of the date 
according to the Gregorian calendar in one's own language, which could 
(~easily~) be made "modern" (ISO compliant), whereas the second is the 
representation of the date according to the culture's native calendar in 
the language matching the current locale.

While I'm not necessary suggesting that musl (or any other libc, for 
that matter) should implement the conversion functions from the 
Gregorian calendar to other calendars and vice versa, it would be nice 
if at least the prototypes of the conversion functions were somehow 
standardized, and also if the locale files likewise accounted for the 
above issues (e.g. in the form of placeholders).

PS. speaking of historical vs. modern and LC_MONETARY, we should 
probably bear in mind the many locale variants that are based on 
currency only, as for example in the case of EU member countries before 
and after the Euro.

zg



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 23:22         ` writeonce
@ 2014-07-23 23:38           ` Rich Felker
  2014-07-24  1:07             ` writeonce
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-23 23:38 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 07:22:49PM -0400, writeonce@midipix.org wrote:
> On 07/23/2014 12:39 PM, Rich Felker wrote:
> >On that topic, while this is a matter outside my control for
> >individual users, my preference would be that the official
> >musl-locale data attempt to avoid multiple variants/modifiers and
> >legacy options if possible. For example I would like to see the
> >numeric date format be ISO format in all locales, with traditional
> >formats only where the natural-language string representations for
> >months/days are included (and I say this as someone coming from
> >one of the locales, i.e. US, where the traditional numeric date
> >format is non-ISO). In keeping with the principle that musl is
> >"modern" I'd like to prefer modern cultural conventions to
> >historical ones.
> For what it's worth, I wanted to point out that the ISO C explicitly
> pertains to the Gregorian calendar only, albeit in parenthesis
> (N1570, 7.27.1).  For users of [listed in alphabetical order:]
> Arabic, Chinese, Hebrew, Japanese, Persian, and Tibetan, for
> instance, there are two different issues at stake: the first is the
> representation of the date according to the Gregorian calendar in
> one's own language, which could (~easily~) be made "modern" (ISO
> compliant), whereas the second is the representation of the date
> according to the culture's native calendar in the language matching
> the current locale.
> 
> While I'm not necessary suggesting that musl (or any other libc, for
> that matter) should implement the conversion functions from the
> Gregorian calendar to other calendars and vice versa, it would be
> nice if at least the prototypes of the conversion functions were
> somehow standardized, and also if the locale files likewise
> accounted for the above issues (e.g. in the form of placeholders).

The strftime function has the %E? conversion specifiers which, if
supported, could provide this kind of functionality, I think. I have
no idea how to represent the rules for the conversions, though, or
whether doing so would be practical.

> PS. speaking of historical vs. modern and LC_MONETARY, we should
> probably bear in mind the many locale variants that are based on
> currency only, as for example in the case of EU member countries
> before and after the Euro.

Interesting point I hadn't really thought of.

Rich


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 23:38           ` Rich Felker
@ 2014-07-24  1:07             ` writeonce
  2014-07-24  1:57               ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: writeonce @ 2014-07-24  1:07 UTC (permalink / raw)
  To: musl

On 07/23/2014 07:38 PM, Rich Felker wrote:
> On Wed, Jul 23, 2014 at 07:22:49PM -0400, writeonce@midipix.org wrote:
>> On 07/23/2014 12:39 PM, Rich Felker wrote:
>>> On that topic, while this is a matter outside my control for
>>> individual users, my preference would be that the official
>>> musl-locale data attempt to avoid multiple variants/modifiers and
>>> legacy options if possible. For example I would like to see the
>>> numeric date format be ISO format in all locales, with traditional
>>> formats only where the natural-language string representations for
>>> months/days are included (and I say this as someone coming from
>>> one of the locales, i.e. US, where the traditional numeric date
>>> format is non-ISO). In keeping with the principle that musl is
>>> "modern" I'd like to prefer modern cultural conventions to
>>> historical ones.
>> For what it's worth, I wanted to point out that the ISO C explicitly
>> pertains to the Gregorian calendar only, albeit in parenthesis
>> (N1570, 7.27.1).  For users of [listed in alphabetical order:]
>> Arabic, Chinese, Hebrew, Japanese, Persian, and Tibetan, for
>> instance, there are two different issues at stake: the first is the
>> representation of the date according to the Gregorian calendar in
>> one's own language, which could (~easily~) be made "modern" (ISO
>> compliant), whereas the second is the representation of the date
>> according to the culture's native calendar in the language matching
>> the current locale.
>>
>> While I'm not necessary suggesting that musl (or any other libc, for
>> that matter) should implement the conversion functions from the
>> Gregorian calendar to other calendars and vice versa, it would be
>> nice if at least the prototypes of the conversion functions were
>> somehow standardized, and also if the locale files likewise
>> accounted for the above issues (e.g. in the form of placeholders).
> The strftime function has the %E? conversion specifiers which, if
> supported, could provide this kind of functionality, I think. I have
> no idea how to represent the rules for the conversions, though, or
> whether doing so would be practical.
The conversion function I'm missing is one that would produce a native 
struct tm from either a "general" (Gregorian), or a UTC struct tm.  
Consider for instance the case of lunisolar calendars with an occasional 
thirteenth month (i.e. Hebrew, Tibetan): given a Gregorian date, it is 
often useful to store the corresponding native date in a struct tm, 
manipulate it in some way (for instance finding the number of days left 
until the end of the native month), and then print the result.  There 
are already a lot of applications and websites that perform such 
calculations, yet no standard interface for the conversion functions (at 
least not as far as I can tell).

In terms of practicality: for many of the calendars I mentioned, 
conversion involves not only a formula, but also date- and year-based 
considerations.  A correct implementation of all the %E? specifiers is 
accordingly going to include many bytes of code that probably shouldn't 
be pulled in whenever a "random" static application uses strftime.  That 
being said, if musl's implementation of %E? could use weak aliases and 
standardized hooks, then applications or calendar-specific libraries 
could provide these hooks and still use the libc strftime, rather than a 
complex system of wrappers and conditionals.

* Not be extra difficult or anything, but even the simple example in 
N1570 ("what day of the week is July 4, 2001") contains a cultural bias, 
namely the assumption that days begin and end at midnight;-)

zg

>
>> PS. speaking of historical vs. modern and LC_MONETARY, we should
>> probably bear in mind the many locale variants that are based on
>> currency only, as for example in the case of EU member countries
>> before and after the Euro.
> Interesting point I hadn't really thought of.
>
> Rich
>
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24  1:07             ` writeonce
@ 2014-07-24  1:57               ` Rich Felker
  2014-07-24  2:16                 ` writeonce
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-24  1:57 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 09:07:22PM -0400, writeonce@midipix.org wrote:
> In terms of practicality: for many of the calendars I mentioned,
> conversion involves not only a formula, but also date- and
> year-based considerations.  A correct implementation of all the %E?
> specifiers is accordingly going to include many bytes of code that
> probably shouldn't be pulled in whenever a "random" static
> application uses strftime.  That being said, if musl's

Obviously unless the set of such rules is fixed and free of the need
for external updates, it would need to be represented as *data* that's
loaded as part of the locale file and not as code. Or as a query to a
(local) service.

> implementation of %E? could use weak aliases and standardized hooks,
> then applications or calendar-specific libraries could provide these
> hooks and still use the libc strftime, rather than a complex system
> of wrappers and conditionals.

That's now how weak symbols work -- they're not a way to add plug-in
code.

Rich


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24  1:57               ` Rich Felker
@ 2014-07-24  2:16                 ` writeonce
  2014-07-24  2:24                   ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: writeonce @ 2014-07-24  2:16 UTC (permalink / raw)
  To: musl

On 07/23/2014 09:57 PM, Rich Felker wrote:
> On Wed, Jul 23, 2014 at 09:07:22PM -0400, writeonce@midipix.org wrote:
>> In terms of practicality: for many of the calendars I mentioned,
>> conversion involves not only a formula, but also date- and
>> year-based considerations.  A correct implementation of all the %E?
>> specifiers is accordingly going to include many bytes of code that
>> probably shouldn't be pulled in whenever a "random" static
>> application uses strftime.  That being said, if musl's
> Obviously unless the set of such rules is fixed and free of the need
> for external updates, it would need to be represented as *data* that's
> loaded as part of the locale file and not as code. Or as a query to a
> (local) service.
>
>> implementation of %E? could use weak aliases and standardized hooks,
>> then applications or calendar-specific libraries could provide these
>> hooks and still use the libc strftime, rather than a complex system
>> of wrappers and conditionals.
> That's now how weak symbols work -- they're not a way to add plug-in
> code.
>
> Rich
>
>
Thanks for the clarification.  This leaves a query to a local service 
the more likely solution since for at least the Hijri (Muslim) and 
Hebrew (Jewish) calendars, accurate conversion cannot be based on data 
or tables alone.

zg


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24  2:16                 ` writeonce
@ 2014-07-24  2:24                   ` Rich Felker
  2014-07-24  2:59                     ` writeonce
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-24  2:24 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 10:16:40PM -0400, writeonce@midipix.org wrote:
> >>implementation of %E? could use weak aliases and standardized hooks,
> >>then applications or calendar-specific libraries could provide these
> >>hooks and still use the libc strftime, rather than a complex system
> >>of wrappers and conditionals.
> >That's now how weak symbols work -- they're not a way to add plug-in
> >code.
> >
> Thanks for the clarification.  This leaves a query to a local
> service the more likely solution since for at least the Hijri
> (Muslim) and Hebrew (Jewish) calendars, accurate conversion cannot
> be based on data or tables alone.

How so? All code is fundamentally data/tables. Is your point that the
current data is not sufficient to compute all future times (in which
case updated data would be needed but sufficient) or just that the
algorithms are moderately complex (in which case a the data would have
to represent a nontrivial computational language).

In any case this is probably quite low priority. I'm not aware of any
other libc supporting these features or any significant demand for
them. So it's a neat topic to discuss, but getting pretty far off
topic from the topic at hand (locale support in 1.1.4).

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24  2:24                   ` Rich Felker
@ 2014-07-24  2:59                     ` writeonce
  0 siblings, 0 replies; 43+ messages in thread
From: writeonce @ 2014-07-24  2:59 UTC (permalink / raw)
  To: musl

On 07/23/2014 10:24 PM, Rich Felker wrote:
> On Wed, Jul 23, 2014 at 10:16:40PM -0400, writeonce@midipix.org wrote:
>>>> implementation of %E? could use weak aliases and standardized hooks,
>>>> then applications or calendar-specific libraries could provide these
>>>> hooks and still use the libc strftime, rather than a complex system
>>>> of wrappers and conditionals.
>>> That's now how weak symbols work -- they're not a way to add plug-in
>>> code.
>>>
>> Thanks for the clarification.  This leaves a query to a local
>> service the more likely solution since for at least the Hijri
>> (Muslim) and Hebrew (Jewish) calendars, accurate conversion cannot
>> be based on data or tables alone.
> How so? All code is fundamentally data/tables. Is your point that the
> current data is not sufficient to compute all future times (in which
> case updated data would be needed but sufficient) or just that the
> algorithms are moderately complex (in which case a the data would have
> to represent a nontrivial computational language).
For one thing, the algorithms for past dates are moderately (or somehow 
more than moderately) complex.  This pertains to both the Hijri and 
Hebrew calendars.  With respect to future dates, in the most traditional 
Muslim countries the first day of the month is predicted via astronomic, 
yet confirmed via traditional means (observation of the new moon).  In 
those countries you might therefore find out "in real time" whether 
tonight is the 1st of the next month, or the 30th of the current one.  
For date conversion to be accurate, then, you would have to get your 
data updated on a monthly basis, which at least for me makes a local 
service the better solution.  Obviously, things would have been much 
easier if the wise and elderly had designed their calendars with an ISO 
libc in mind (a bug in the prophet utility?), but I personally wouldn't 
dare trying to change them;-)
>   
>
> In any case this is probably quite low priority. I'm not aware of any
> other libc supporting these features or any significant demand for
> them. So it's a neat topic to discuss, but getting pretty far off
> topic from the topic at hand (locale support in 1.1.4).
>
> Rich
>
>
Of course.  That's why I suggested to define only the interfaces (or 
hooks, which now I understand are not an option), and not worry about 
the implementation.

zg



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 21:01           ` Rich Felker
@ 2014-07-24 15:35             ` u-igbb
  2014-07-24 16:01               ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-24 15:35 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 05:01:20PM -0400, Rich Felker wrote:
> > This feels appropriate - if the definitions indeed fall into distinctive
> > classes like "full" / "single-category" and also if the naming reflects
> > the distinction
> 
> IMO language-based locales should be ll, lll, ll_TT, or lll_TT form
> where ll or lll is lowercase ISO language code and TT is uppercase
> territory code. Non-language-based locale files should avoid these
> patterns.

Just for certainty:

I assume you mean "l" above being lower case and non-language-based
definitions to begin/consist of uppercase letters? Totally avoiding two-
and three-letter combinations would be hardly followed by less scrupulous
parties :) but you certainly did not mean this.

Btw do we have to also use lll (the three-letter codes) or would be
the two-letter ones sufficient?

I understand that this is not an implementation question but rather a
discipline/policy one but in the long run it helps enormously to have
a clean deployment idea from the beginning.

An example of a spectacular failure to do so were the xkb keyboard maps.
[
  Two incompatible representations were in use, for many years (!) One was
  reasonable, structured by country i.e. reflecting different countries'
  actual standards. The other one was broken by design, using "language"
  as the main key without any actual definition of its semantics. This
  led to many of the available definitions being a hardly useful hacks
  (and of course to a lot of confusion for everyone as this thing was
  impossible to document). Remarkably even the maintainers of the maps
  at x.org/freedesktop.org at the time did not realize the origin of the
  problem. I happen to have been involved into clarifying the issue,
  now the structure of xkb/symbols is reasonable.
]
This happens when one does not clearly document the target deployment
model which the implementation exist for, iow is meant to implement.

Other/unexpected ways to use a tool can be good too (or sometimes even
better) but most of the deployers lack the time and knowledge for the
analysis which the implementors by their role are to do - the analysis
which you among other things are doing by the discussions here.

The lack of the understanding easily leads to bad practices being
perpetuated (like the mess of the Kerberos keytab traditions).

I am afraid that not stating a clean usage model may harm musl deployments
too (say by mixing two- and three-letter locale codes so that one can not
sanely know which kind to use).

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-23 22:12       ` Rich Felker
@ 2014-07-24 15:38         ` u-igbb
  0 siblings, 0 replies; 43+ messages in thread
From: u-igbb @ 2014-07-24 15:38 UTC (permalink / raw)
  To: musl

On Wed, Jul 23, 2014 at 06:12:12PM -0400, Rich Felker wrote:
> I'm not seeing a way that setting LOCPATH without being aware that the
> app you're trying to affect is using musl could be helpful. The
> locales you're trying to make visible to the app need to be in the
> format used by musl, not the glibc format, so you have to already be
> aware of this. Maybe there's something I'm not seeing -- this is why I
> asked -- but if there's no reason for it, I think searching both "just
> because" is bad.

+1 (in my eyes it would be plainly harmful for a number of reasons)

Rune



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24 15:35             ` u-igbb
@ 2014-07-24 16:01               ` Rich Felker
  2014-07-24 19:24                 ` u-igbb
  2014-07-24 20:15                 ` u-igbb
  0 siblings, 2 replies; 43+ messages in thread
From: Rich Felker @ 2014-07-24 16:01 UTC (permalink / raw)
  To: musl

On Thu, Jul 24, 2014 at 05:35:26PM +0200, u-igbb@aetey.se wrote:
> On Wed, Jul 23, 2014 at 05:01:20PM -0400, Rich Felker wrote:
> > > This feels appropriate - if the definitions indeed fall into distinctive
> > > classes like "full" / "single-category" and also if the naming reflects
> > > the distinction
> > 
> > IMO language-based locales should be ll, lll, ll_TT, or lll_TT form
> > where ll or lll is lowercase ISO language code and TT is uppercase
> > territory code. Non-language-based locale files should avoid these
> > patterns.
> 
> Just for certainty:
> 
> I assume you mean "l" above being lower case and non-language-based
> definitions to begin/consist of uppercase letters? Totally avoiding two-
> and three-letter combinations would be hardly followed by less scrupulous
> parties :) but you certainly did not mean this.

I just meant that language-based locales should match the pattern:

^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$

assuming I didn't make any stupid mistakes in writing that regex. And
non-language-based locales should not match this pattern.

BTW POSIX actually describes this pattern (or similar) for locale
names under the XSI option.

> Btw do we have to also use lll (the three-letter codes) or would be
> the two-letter ones sufficient?

I believe there are some languages for which there is no two-letter
code. (Note that even the whole 26x26 space is probably insufficient
to represent all of the world's languages, and for practical purposes,
the letters should have some correspondence with the name of the
language.)

> I understand that this is not an implementation question but rather a
> discipline/policy one but in the long run it helps enormously to have
> a clean deployment idea from the beginning.

Agreed.

> An example of a spectacular failure to do so were the xkb keyboard maps.
> [
>   Two incompatible representations were in use, for many years (!) One was
>   reasonable, structured by country i.e. reflecting different countries'
>   actual standards. The other one was broken by design, using "language"
>   as the main key without any actual definition of its semantics. This
>   led to many of the available definitions being a hardly useful hacks
>   (and of course to a lot of confusion for everyone as this thing was
>   impossible to document). Remarkably even the maintainers of the maps
>   at x.org/freedesktop.org at the time did not realize the origin of the
>   problem. I happen to have been involved into clarifying the issue,
>   now the structure of xkb/symbols is reasonable.
> ]

This text is utterly backwards, and I've complained about the policy
before, but gotten nowhere with it. Yes many languages have keyboard
variants connected to a particular geographic territory (this is
mainly true for European languages, not so much for the rest of the
world), but it does not make keyboard layout a property of country.
You also have:

- Users who speak and use languages that have no relation to the
  country where they're living.

- Languages which have no territory.

- Languages used in territories where the country it belongs to is
  disputed.

- Etc.

All of these issues make country-based keyboard selection at best
inconvenient, and at worst culturally and politically offensive, to
users. And offending users is utterly bad policy.

The same issue exists in glibc -- for a long time, their policy
mandated that all locales have a territory associated with them, and
this (along with other stupid policy) was preventing the addition of
the Esperanto locale. See:

https://sourceware.org/bugzilla/show_bug.cgi?id=16190

I believe the policy has been fixed now, but the discussion happened
on a different bug tracker issue and/or mailing list thread, and I
don't have the link.

> I am afraid that not stating a clean usage model may harm musl deployments
> too (say by mixing two- and three-letter locale codes so that one can not
> sanely know which kind to use).

The reasonable approach to this is probably just using the
three-letter codes for languages that do not have a two-letter code.
In practice I haven't seen such translations/locales on other systems,
but we certainly don't want to preclude them.

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24 16:01               ` Rich Felker
@ 2014-07-24 19:24                 ` u-igbb
  2014-07-24 20:15                 ` u-igbb
  1 sibling, 0 replies; 43+ messages in thread
From: u-igbb @ 2014-07-24 19:24 UTC (permalink / raw)
  To: musl

On Thu, Jul 24, 2014 at 12:01:50PM -0400, Rich Felker wrote:
> > Btw do we have to also use lll (the three-letter codes) or would be
> > the two-letter ones sufficient?
> 
> I believe there are some languages for which there is no two-letter
> code. (Note that even the whole 26x26 space is probably insufficient
> to represent all of the world's languages, and for practical purposes,
> the letters should have some correspondence with the name of the
> language.)

Ok, then it's a pity that we can not "postulate" three-letter ones :)
(as we want to embrace the values people already have in their LANG...)

> > An example of a spectacular failure to do so were the xkb keyboard maps.
> > [
> >   Two incompatible representations were in use, for many years (!) One was
> >   reasonable, structured by country i.e. reflecting different countries'
> >   actual standards. The other one was broken by design, using "language"
> >   as the main key without any actual definition of its semantics. This
> >   led to many of the available definitions being a hardly useful hacks
> >   (and of course to a lot of confusion for everyone as this thing was
> >   impossible to document). Remarkably even the maintainers of the maps
> >   at x.org/freedesktop.org at the time did not realize the origin of the
> >   problem. I happen to have been involved into clarifying the issue,
> >   now the structure of xkb/symbols is reasonable.
> > ]
> 
> This text is utterly backwards, and I've complained about the policy
> before, but gotten nowhere with it. Yes many languages have keyboard

Oh.

> variants connected to a particular geographic territory (this is
> mainly true for European languages, not so much for the rest of the
> world), but it does not make keyboard layout a property of country.

A keyboard layout is nothing else than "a cultural or personal
preference". In the overwhelming majority of usage cases it is
reflected/strongly-suggested by the manufacturers of the keyboards by
carving some glyphs on the keys. The manufacturers tend to follow the
different countries' most formalized cultural references, expressed
by the locally defined standards (among others, the country-specific
standards for keyboard layouts). This also creates a strong bios towards
a certain one of these layouts for the corresponding users, of course.

I do not say that "country" is a perfect reference. It is nevertheless
much more reasonable/usable than "language" which has nothing to do with
layouts (_alphabets_ might have a say in corner cases, but not languages).

> You also have:
> 
> - Users who speak and use languages that have no relation to the
>   country where they're living.
> 
> - Languages which have no territory.
> 
> - Languages used in territories where the country it belongs to is
>   disputed.
> 
> - Etc.

Sure. But this is not what a keyboard layout reflects.
You mention a complexity which does not belong to the task.

> All of these issues make country-based keyboard selection at best
> inconvenient, and at worst culturally and politically offensive, to
> users. And offending users is utterly bad policy.

Believe it or not, I saw this argument from the freedesktop people
during the corresponding discussion (about 15+ years ago iirc).
Nevertheless reflecting what is carved on the keyboard which has been
bought or otherwise chosen _by_the_user_ is hardly an insult.

If there is no national standard for the "user's home culture"
layout, then there is none. Not our fault and a purely political
issue (technically you always can find a place for an extra layout
definition).

If a user chooses a layout defined by a standard of a "country of BAD" -
that was the user choice, not ours. Somebody can possibly feel that
a BAD country "stole" "their" layout but we are not in a position to
judge there.

The matter of fact is that the keyboards being manufactured are governed
in the _very_ first hand by national standards, which among others define
both the physical placement of the keys and the placement of the labels
on the keys.

Blind typers may forget about the second fact but this does not make
the fact irrelevant.

> The same issue exists in glibc -- for a long time, their policy
> mandated that all locales have a territory associated with them, and
> this (along with other stupid policy) was preventing the addition of
> the Esperanto locale. See:

_This_ was indeed stupid, I am aware of this issue too as I needed
the locale.

> The reasonable approach to this is probably just using the
> three-letter codes for languages that do not have a two-letter code.
> In practice I haven't seen such translations/locales on other systems,
> but we certainly don't want to preclude them.

Fair enough!

Thanks Rich.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24 16:01               ` Rich Felker
  2014-07-24 19:24                 ` u-igbb
@ 2014-07-24 20:15                 ` u-igbb
  2014-07-24 22:02                   ` Rich Felker
  1 sibling, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-24 20:15 UTC (permalink / raw)
  To: musl

On Thu, Jul 24, 2014 at 12:01:50PM -0400, Rich Felker wrote:
> I just meant that language-based locales should match the pattern:
> 
> ^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$
> 
> assuming I didn't make any stupid mistakes in writing that regex. And
> non-language-based locales should not match this pattern.

I feel it would be somewhat more robust if we'd have a positive
definition for "the second class" of locale data, just in case we one
day discover that we want to differently handle, say, three classes (?)

A negative defintition gives also very little guidance for the actual
naming and in the worst case may lead to misunderstanding when multiple
parties are involved.

Why not make such a worst case less probable by a somewhat more strict
naming rule?
Possibly also defining "non-language-based" in a positive way?

This is just a thought. I have no actual proposal as I do not have a
good mental picture of which kinds of "non-language-based" definitions
exist or should exist and how they are being used or might/should be used.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24 20:15                 ` u-igbb
@ 2014-07-24 22:02                   ` Rich Felker
  2014-07-25  9:06                     ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-24 22:02 UTC (permalink / raw)
  To: musl

On Thu, Jul 24, 2014 at 10:15:48PM +0200, u-igbb@aetey.se wrote:
> On Thu, Jul 24, 2014 at 12:01:50PM -0400, Rich Felker wrote:
> > I just meant that language-based locales should match the pattern:
> > 
> > ^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$
> > 
> > assuming I didn't make any stupid mistakes in writing that regex. And
> > non-language-based locales should not match this pattern.
> 
> I feel it would be somewhat more robust if we'd have a positive
> definition for "the second class" of locale data, just in case we one
> day discover that we want to differently handle, say, three classes (?)
> 
> A negative defintition gives also very little guidance for the actual
> naming and in the worst case may lead to misunderstanding when multiple
> parties are involved.
> 
> Why not make such a worst case less probable by a somewhat more strict
> naming rule?
> Possibly also defining "non-language-based" in a positive way?
> 
> This is just a thought. I have no actual proposal as I do not have a
> good mental picture of which kinds of "non-language-based" definitions
> exist or should exist and how they are being used or might/should be used.

This is a reasonable sentiment, but do you have a proposal? I think
first you would need an idea of what some "non-language" category
values might be. I can think of some for LC_COLLATE, though I'm not
sure how valuable many of them are:

- UCA default tables
- UTF-16 code unit order
- Case-insensitive Unicode codepoint order

For the other categories, examples seem much harder to find.
LC_MESSAGES is inherently a language-based category, but perhaps you
could have a locale that eliminates verbose natural-language messages
and replaces them with C/POSIX identifiers (e.g. printing ENOENT
instead of "No such file or directory") conveying the meaning. (Or we
could be somewhat radical and replace all the internal strerror
messages like this and require LC_MESSAGES=en to get them back.) I'm
not sure if there would be interesting LC_TIME locales not associated
with a language (since LC_TIME has to offer day/month names). And for
LC_MONETARY, most if not all of the data really corresponds to a
political unit context, not a language, so in principle it might make
sense to have locales just for LC_MONETARY that aren't associated with
a language, but I can't see that being a convenient or reasonable
design in practice...

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-24 22:02                   ` Rich Felker
@ 2014-07-25  9:06                     ` u-igbb
  2014-07-25 20:15                       ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-25  9:06 UTC (permalink / raw)
  To: musl

On Thu, Jul 24, 2014 at 06:02:28PM -0400, Rich Felker wrote:
> first you would need an idea of what some "non-language" category
> values might be. I can think of some for LC_COLLATE, though I'm not
> sure how valuable many of them are:
> 
> - UCA default tables
> - UTF-16 code unit order
> - Case-insensitive Unicode codepoint order

I can hardly give any opinion on their importance.

> For the other categories, examples seem much harder to find.
> LC_MESSAGES is inherently a language-based category, but perhaps you
> could have a locale that eliminates verbose natural-language messages
> and replaces them with C/POSIX identifiers (e.g. printing ENOENT
> instead of "No such file or directory") conveying the meaning. (Or we
> could be somewhat radical and replace all the internal strerror
> messages like this and require LC_MESSAGES=en to get them back.) I'm

I like this - for clarity, conciseness and for making it as neutral
as possible (ENOENT stems of course from English but no worse than
the keywords of C itself).

> LC_MONETARY, most if not all of the data really corresponds to a
> political unit context, not a language, so in principle it might make
> sense to have locales just for LC_MONETARY that aren't associated with
> a language, but I can't see that being a convenient or reasonable
> design in practice...

Indeed, LC_MONETARY has basically nothing to do with language.

If I might choose I would not let LANG imply LC_MONETARY
(iow would skip LC_MONETARY in language-based locale definitions).

Returning to the naming. As language-based locales are named
after languages, it would be nice to name other kinds of locale
data after their "natural association" too. Then politically-bound
data could be put into the corresponding "territorial" family:

 language                ll[l][_TT]
 territory               TT[_ll[l]]

And if we find something that does not feel reasonable to connect
to either a language or a territory, we can do

 special cases           @<specialcase>

[or                       ZZ@<specialcase>     ("no territory")
 or                       zxx@<specialcase>    ("no language")
 but the shorter and simpler is to prefer]

The expected mode of usage would be like

LANG=de LC_MONETARY=EU
 or
LANG=sv LC_MONETARY=SE
 or
LANG=eo@iso8601 LC_MONETARY=US@iso4217

which would in every case access two locale data files of different
classes, clearly visible in the naming.

Iso date format actually would be a good candidate for a standalone
"@iso8601", but it can as well live inside the C locale.
Then the last example above might look like

LANG=eo LC_TIME=@iso8601 LC_MONETARY=US@iso4217
 at the expense of a third file to be accessed
 or rather
LANG=eo LC_TIME=C LC_MONETARY=US@iso4217

What do you think about such a naming convention and usage mode?

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-25  9:06                     ` u-igbb
@ 2014-07-25 20:15                       ` u-igbb
  2014-07-25 22:32                         ` Rich Felker
  2014-07-26 20:43                         ` Rich Felker
  0 siblings, 2 replies; 43+ messages in thread
From: u-igbb @ 2014-07-25 20:15 UTC (permalink / raw)
  To: musl

Replying to myself.

On Fri, Jul 25, 2014 at 11:06:49AM +0200, u-igbb@aetey.se wrote:
> Returning to the naming. As language-based locales are named
> after languages, it would be nice to name other kinds of locale
> data after their "natural association" too. Then politically-bound
> data could be put into the corresponding "territorial" family:
> 
>  language                ll[l][_TT]
>  territory               TT[_ll[l]]

A bad idea, forget it. This would be open to misinterpretation
(which key is "more fundamental" for a certain kind of data,
shall it go to ll_TT or TT_ll ?)

Somewhat cleaner might be:   ("zxx" and "ZZ" below are literals)

   no localization                         C
   language[+territory]                    ll[l][_TT]
   purely territorial                      zxx_TT   ("no language" code)

and possibly
   no territory-specific stuff included    ll[l]_ZZ ("no territory" code)

The last item would e.g. allow treating ll[l] alone as "including
the most frequently used territorial features for this language"
(like "sv" == "sv_SE"),
but I think this approach would be bad and confusing - such a definition
is not certain nor stable.

I think that a language code alone should mean "no territory-specific
stuff included" and nothing else.

Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have
to exist at all.

Then the usage would be like

LANG=de_DE                        (... "€")

LANG=sv_SE                        (decimal comma, "kr")
LANG=sv     LC_MONETARY=zxx_SE    (decimal point from "C", iso4217 "SEK")

LANG=sv_FI  LC_MONETARY=sv_SE     (... "kr")

LANG=eo     LC_MONETARY=zxx_EU    (... iso4217 "EUR")

Assuming that any categories not explicitly defined
in the corresponding files are to be taken from "C".

Hope this makes some sense in your eyes.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-25 20:15                       ` u-igbb
@ 2014-07-25 22:32                         ` Rich Felker
  2014-07-26  7:25                           ` u-igbb
  2014-07-26 20:43                         ` Rich Felker
  1 sibling, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-25 22:32 UTC (permalink / raw)
  To: musl

On Fri, Jul 25, 2014 at 10:15:51PM +0200, u-igbb@aetey.se wrote:
> Replying to myself.
> 
> On Fri, Jul 25, 2014 at 11:06:49AM +0200, u-igbb@aetey.se wrote:
> > Returning to the naming. As language-based locales are named
> > after languages, it would be nice to name other kinds of locale
> > data after their "natural association" too. Then politically-bound
> > data could be put into the corresponding "territorial" family:
> > 
> >  language                ll[l][_TT]
> >  territory               TT[_ll[l]]
> 
> A bad idea, forget it. This would be open to misinterpretation
> (which key is "more fundamental" for a certain kind of data,
> shall it go to ll_TT or TT_ll ?)

Yes, I agree that's a bad idea.

> Somewhat cleaner might be:   ("zxx" and "ZZ" below are literals)
> 
>    no localization                         C
>    language[+territory]                    ll[l][_TT]
>    purely territorial                      zxx_TT   ("no language" code)

While clean and well-defined, I wonder whether zxx_TT is
counter-intuitive to most users...

> and possibly
>    no territory-specific stuff included    ll[l]_ZZ ("no territory" code)
> 
> The last item would e.g. allow treating ll[l] alone as "including
> the most frequently used territorial features for this language"
> (like "sv" == "sv_SE"),
> but I think this approach would be bad and confusing - such a definition
> is not certain nor stable.
> 
> I think that a language code alone should mean "no territory-specific
> stuff included" and nothing else.

I think that's reasonable.

> Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have
> to exist at all.

That's definitely nice.

> Then the usage would be like
> 
> LANG=de_DE                        (... "€")
> 
> LANG=sv_SE                        (decimal comma, "kr")
> LANG=sv     LC_MONETARY=zxx_SE    (decimal point from "C", iso4217 "SEK")

Changing the numeric radix point is explicitly not supported. :)
LC_NUMERIC is just always C because, well, numbers are numbers, not
something to vary by culture, and changing the radix point just breaks
parsing and storing data for interchange. LC_MONETARY on the other
hand could in principle provide a different monetary radix point, but
it's not terribly useful until we get a full-featured strfmon anyway.

Rich


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-25 22:32                         ` Rich Felker
@ 2014-07-26  7:25                           ` u-igbb
  2014-07-26  8:03                             ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-26  7:25 UTC (permalink / raw)
  To: musl

On Fri, Jul 25, 2014 at 06:32:39PM -0400, Rich Felker wrote:
> > Somewhat cleaner might be:   ("zxx" and "ZZ" below are literals)
> > 
> >    no localization                         C
> >    language[+territory]                    ll[l][_TT]
> >    purely territorial                      zxx_TT   ("no language" code)
> 
> While clean and well-defined, I wonder whether zxx_TT is
> counter-intuitive to most users...

Sure this contradicts the all that convenient inclination to use short
names when possible. Nevertheless I would even argue against myself
(again :) and say that we'd better disallow short variants altogether
(no TT, nor ll).

> > I think that a language code alone should mean "no territory-specific
> > stuff included" and nothing else.
> 
> I think that's reasonable.

Givet that we'd need both this extra rule and a hope that the future
user/maintainer keeps it in mind too

> > Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have
> > to exist at all.

it would be in fact more robust to to the contrary simply always
assume the full ll[l]_TT syntax, with zxx and ZZ being already defined
by the corresponding standards to denote the needed special cases.

Then this would be fully standard-compliant and consistent.

I understand this may feel a bit strange and "too long" even though
the extra characters are hardly a burden in practice.

Let me compare this to the dns search domains - short names seem
convenient but they are not reliable nor do scale. Short locale names
as as well prone to be misunderstood and there will be contributions
with different semantics and long bikeshed discussions on different
forums about which one is right :)

In other words, I feel that it is more clear to _not_ include
Sweden-specific bits into "sv_ZZ" (which indicates "not _any_ country"
and hence "not Sweden") than into "sv".

> > LANG=sv_SE                        (decimal comma, "kr")
> > LANG=sv     LC_MONETARY=zxx_SE    (decimal point from "C", iso4217 "SEK")
> 
> Changing the numeric radix point is explicitly not supported. :)
> LC_NUMERIC is just always C because, well, numbers are numbers, not
> something to vary by culture, and changing the radix point just breaks
> parsing and storing data for interchange. LC_MONETARY on the other

I am fully with you on the point of formatting numerical data for
intechange. The purpose of locale is though the exact _opposite_, to
represent data in a format especially chosen for the specific occasion
and a specific user, _differently_ from what would be suitable for the
rest of the world. Isn't it?

So I would say it is indeed stupid to localize data meant for
interchange. Nevertheless it may still be meaningful to format numbers
for the user's taste when the data presentation is only meant for some
kind of a "local" context.

Related to the decimal point issue:

I think we (or at least myself) would need a clarification about
the role of "C" locale. It is to mean "no localization" which does
not say that it is expected to provide representation usable globally
(I think it is on the contrary by its origin heavily English/US biased).

I assume that you are aiming to reduce this bias as much as possible so
that "C" could be neutral and suitable for as many users/uses as possible.
Unfortunately this raises more questions, like the following:

According to https://en.wikipedia.org/wiki/Decimal_mark

"
Countries where a dot "." is used to mark the radix point comprise
roughly 60% of the world's population.[citation needed]
"

which indicates that this information is unreliable.

Notably, according to the same article (and verifiably :) the living
auxiliary languages meant for international communication all made a
different choice (apparently for reasons based on some research):

"
The three most spoken international auxiliary languages, Ido, Esperanto,
and Interlingua all use the comma as the official radix point
"

Is there anything that postulates C locale to use "." as the radix point?
Is there any evidence that "." is more widely used than "," ?

Do not misunderstand my questions as a cultural bias. I am _much_
more used to the decimal dot than comma, because of the involvement
with programming languages using ".". Nevertheless locale is not about
representing data for computers, but for humans - and I would love to
have a best possible internationally useful locale as the default.

Otherwise let us say that "C" locale is for interacting with programs,
not with humans, period (those wishing a human-friendly internationally
sound environment are to use e.g. LANG=eo_ZZ).
This is possibly the only reliable/efficient/robust approach?

Yet it would be a pity to not have a common representation for both
humans and computers, without a cultural bias.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26  7:25                           ` u-igbb
@ 2014-07-26  8:03                             ` Rich Felker
  2014-07-26  9:06                               ` Jens Gustedt
  2014-07-26  9:38                               ` u-igbb
  0 siblings, 2 replies; 43+ messages in thread
From: Rich Felker @ 2014-07-26  8:03 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 09:25:03AM +0200, u-igbb@aetey.se wrote:
> > Changing the numeric radix point is explicitly not supported. :)
> > LC_NUMERIC is just always C because, well, numbers are numbers, not
> > something to vary by culture, and changing the radix point just breaks
> > parsing and storing data for interchange. LC_MONETARY on the other
> 
> I am fully with you on the point of formatting numerical data for
> intechange. The purpose of locale is though the exact _opposite_, to
> represent data in a format especially chosen for the specific occasion
> and a specific user, _differently_ from what would be suitable for the
> rest of the world. Isn't it?
> 
> So I would say it is indeed stupid to localize data meant for
> interchange. Nevertheless it may still be meaningful to format numbers
> for the user's taste when the data presentation is only meant for some
> kind of a "local" context.

The problem is that the vast majority of actual printing and parsing
of floating point numbers is for interchange purposes, not mere visual
pretty-printing, and the existence of alternate radix characters
introduces subtle bugs into programs that are not tested in such
locales. Very few programs or libraries I've seen go to the trouble to
obtain a usable LC_NUMERIC locale in a portable, thread-safe, and
library-safe way before calling snprintf or strtod. And lots of broken
gui libraries set LC_NUMERIC behind the application's back even if the
application only wanted to set other categories.

> Is there anything that postulates C locale to use "." as the radix point?

Yes, it's required by ISO C and POSIX. The C locale is defined by its
ability to be used for translating C programs. In C programs, the
radix point is ".".

> Is there any evidence that "." is more widely used than "," ?

Well, 2/3 of the world's population is in India and China and they all
use ".", so I think that pretty much covers the question of which is
"more widely used".

> Do not misunderstand my questions as a cultural bias. I am _much_
> more used to the decimal dot than comma, because of the involvement
> with programming languages using ".". Nevertheless locale is not about
> representing data for computers, but for humans - and I would love to
> have a best possible internationally useful locale as the default.

This goes back to the question about modern versus old tradition.
Alternate radix points are a cultural convention that's (seemingly,
hopefully) on the way out due to computers and information
interchange. Maybe in some sense this is cultural imperialism (or just
globalization or whatnot) but it's certainly a lot less negative than
the "everyone should use English" attitude. Nobody's saying "don't use
your language", just "don't gratuitously break things for a one-pixel
difference". :-)

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26  8:03                             ` Rich Felker
@ 2014-07-26  9:06                               ` Jens Gustedt
  2014-07-26  9:25                                 ` Rich Felker
  2014-07-26  9:38                               ` u-igbb
  1 sibling, 1 reply; 43+ messages in thread
From: Jens Gustedt @ 2014-07-26  9:06 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

Am Samstag, den 26.07.2014, 04:03 -0400 schrieb Rich Felker:
> The problem is that the vast majority of actual printing and parsing
> of floating point numbers is for interchange purposes, not mere visual
> pretty-printing,

do you have statistics that support that claim?

printing that is really concerned in interchange, should just use the
%a formats. All other formats are intended for human readability.

> This goes back to the question about modern versus old tradition.
> Alternate radix points are a cultural convention that's (seemingly,
> hopefully) on the way out due to computers and information
> interchange. Maybe in some sense this is cultural imperialism (or just
> globalization or whatnot)

+1 for imperialism

Jens

-- 
:: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS :::
:: ::::::::::::::: office Strasbourg : +33 368854536   ::
:: :::::::::::::::::::::: gsm France : +33 651400183   ::
:: ::::::::::::::: gsm international : +49 15737185122 ::
:: http://icube-icps.unistra.fr/index.php/Jens_Gustedt ::



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26  9:06                               ` Jens Gustedt
@ 2014-07-26  9:25                                 ` Rich Felker
  0 siblings, 0 replies; 43+ messages in thread
From: Rich Felker @ 2014-07-26  9:25 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 11:06:56AM +0200, Jens Gustedt wrote:
> Am Samstag, den 26.07.2014, 04:03 -0400 schrieb Rich Felker:
> > The problem is that the vast majority of actual printing and parsing
> > of floating point numbers is for interchange purposes, not mere visual
> > pretty-printing,
> 
> do you have statistics that support that claim?

Anecdotes, yes; statistics, no. Some examples that come to mind
immediately:

- Anything JSON
- Text-based 3D model files
- Subtitle file timings
- Video framerates, aspect ratios, etc.
- Input files for scientific and mathematical computing.
...

> printing that is really concerned in interchange, should just use the
> %a formats. All other formats are intended for human readability.

In an ideal world, yes, people would use %a. In practice I don't think
I've ever seen it used. :( And the radix point affects %a anyway,
which is rather nonsensical, since there's definitely no cultural
convention for commas to be used as radix points in hex floats.

> > This goes back to the question about modern versus old tradition.
> > Alternate radix points are a cultural convention that's (seemingly,
> > hopefully) on the way out due to computers and information
> > interchange. Maybe in some sense this is cultural imperialism (or just
> > globalization or whatnot)
> 
> +1 for imperialism

Call it what you like, but lack of a variable LC_NUMERIC has been part
of the proposed locale design since the beginning. This isn't
something new I'm springing now. The radix point in LC_NUMERIC is also
probably the single most-hated part of locale by members of the
community who objected to musl having any sort of locale support at
all.

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26  8:03                             ` Rich Felker
  2014-07-26  9:06                               ` Jens Gustedt
@ 2014-07-26  9:38                               ` u-igbb
  2014-07-26 17:47                                 ` Szabolcs Nagy
  1 sibling, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-26  9:38 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote:
> > So I would say it is indeed stupid to localize data meant for
> > interchange. Nevertheless it may still be meaningful to format numbers
> > for the user's taste when the data presentation is only meant for some
> > kind of a "local" context.
> 
> The problem is that the vast majority of actual printing and parsing
> of floating point numbers is for interchange purposes, not mere visual
> pretty-printing, and the existence of alternate radix characters
> introduces subtle bugs into programs that are not tested in such
> locales. Very few programs or libraries I've seen go to the trouble to
> obtain a usable LC_NUMERIC locale in a portable, thread-safe, and
> library-safe way before calling snprintf or strtod. And lots of broken
> gui libraries set LC_NUMERIC behind the application's back even if the
> application only wanted to set other categories.

Ok, the reality is that locale is not being used in a reasonable way so
we do not have to bother implementing it for proper use.
Instead we are obliged to try to reduce the harm by being non-conforming
in a partially compensating fashion. Sigh.

Well, locale is a mess by design...

> > Is there any evidence that "." is more widely used than "," ?
> 
> Well, 2/3 of the world's population is in India and China and they all
> use ".", so I think that pretty much covers the question of which is
> "more widely used".

Ah indeed. That's a sufficient evidence.

> >  locale is not about
> > representing data for computers, but for humans - and I would love to
> > have a best possible internationally useful locale as the default.
> 
> This goes back to the question about modern versus old tradition.
> Alternate radix points are a cultural convention that's (seemingly,
> hopefully) on the way out due to computers and information
> interchange. Maybe in some sense this is cultural imperialism (or just
> globalization or whatnot) but it's certainly a lot less negative than
> the "everyone should use English" attitude. Nobody's saying "don't use
> your language", just "don't gratuitously break things for a one-pixel
> difference". :-)

:-D

In practice this calls for "eo_ZZ@decimal_dot" - which actually would
make sense.

This reminds me that we have an unset issue of naming the variants. Wonder
which schemes happen to exist, to be standardized (?), to be in use?

Gnu gettext manual states
"
The ‘@variant’ can denote any kind of characteristics that is not
already implied by the language ll and the country CC. It can denote a
particular monetary unit. For example, on glibc systems, ‘de_DE@euro’
denotes the locale that uses the Euro currency, in contrast to the
older locale ‘de_DE’ which implies the use of the currency before
2002. It can also denote a dialect of the language, or the script used
to write text (for example, ‘sr_RS@latin’ uses the Latin script,
whereas ‘sr_RS’ uses the Cyrillic script to write Serbian), or the
orthography rules, or similar.
"

I read this as "there is no structure on variant naming and all kinds
of variations share the same name space". Then it is the hopefully
present comment in the locale definition file which apparently has to
be consulted to know what a certain variant is about.

Fine with me but I would like to see this stated somewhere (instead
of my _guess_ after reading the above documentation - it does _not_
say a word about how one can learn the actual semantics of the variant
aka the intention of the locale submitter).

A straightforward try to learn what a certain installed locale is about,
on a Debian Linux system:

 $ locale -a | grep en
 en_US.utf8
 $ apropos en_US
 en_US: nothing appropriate.
 $

On a RedHat Linux system with "@Everything":

 $ locale -a | grep en
  ... lots of en_SOMETHING including en_US ...
 $ apropos en_US
 strlen_user          (9)  - Get the size of a string in user space
 strnlen_user         (9)  - Get the size of a string in user space
 $

Iow one has nice prerequisites for keeping the messy thing in a messy
state :)

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26  9:38                               ` u-igbb
@ 2014-07-26 17:47                                 ` Szabolcs Nagy
  2014-07-26 18:23                                   ` Rich Felker
  2014-07-26 18:56                                   ` u-igbb
  0 siblings, 2 replies; 43+ messages in thread
From: Szabolcs Nagy @ 2014-07-26 17:47 UTC (permalink / raw)
  To: musl

* u-igbb@aetey.se <u-igbb@aetey.se> [2014-07-26 11:38:05 +0200]:
> On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote:
> > Well, 2/3 of the world's population is in India and China and they all
> > use ".", so I think that pretty much covers the question of which is
> > "more widely used".
> 
> Ah indeed. That's a sufficient evidence.
> 

world is about 7G
india+china is about 2.5G
that looks closer to 1/3 than to 2/3

but using anything other than '.' as the decimal point is broken


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 17:47                                 ` Szabolcs Nagy
@ 2014-07-26 18:23                                   ` Rich Felker
  2014-07-26 18:59                                     ` u-igbb
  2014-07-26 18:56                                   ` u-igbb
  1 sibling, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-26 18:23 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 07:47:06PM +0200, Szabolcs Nagy wrote:
> * u-igbb@aetey.se <u-igbb@aetey.se> [2014-07-26 11:38:05 +0200]:
> > On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote:
> > > Well, 2/3 of the world's population is in India and China and they all
> > > use ".", so I think that pretty much covers the question of which is
> > > "more widely used".
> > 
> > Ah indeed. That's a sufficient evidence.
> > 
> 
> world is about 7G
> india+china is about 2.5G
> that looks closer to 1/3 than to 2/3

Yes, sorry; I was thinking both were already closer to 2G and was
using an old idea (closer to 6G) for world population. So some more
work would be needed to get a good estimate.

> but using anything other than '.' as the decimal point is broken

Agreed. BTW if you support arbitrary radix characters, you should not
restrict it to ASCII; this then means the length in bytes of floating
point fields varies by locale (currently the only printf specifier
where either the contents OR length vary by locale is the nonstandard
one, %m) which affects asprintf (right now it's "broken" if it races
with setlocale and the format includes %m; I don't know if we care)
and the implementation of a lot of other stuff (like wprintf).

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 17:47                                 ` Szabolcs Nagy
  2014-07-26 18:23                                   ` Rich Felker
@ 2014-07-26 18:56                                   ` u-igbb
  2014-07-26 19:30                                     ` Rich Felker
  1 sibling, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-26 18:56 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 07:47:06PM +0200, Szabolcs Nagy wrote:
> > On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote:
> > > Well, 2/3 of the world's population is in India and China and they all
> > > use ".", so I think that pretty much covers the question of which is
> > > "more widely used".
> > 
> > Ah indeed. That's a sufficient evidence.

> world is about 7G
> india+china is about 2.5G
> that looks closer to 1/3 than to 2/3

Oops. I should have checked the numbers :)
Thanks for the notice.

> but using anything other than '.' as the decimal point is broken

I am unsure what/which context you mean (say application development,
user settings for the screen, user settings for editing documents to share,
or all scenarios at once?)

Or do you mean to second Rich's opinion that the applicatons' use of
locale facilities is so inconsistent that there is no use of relying on
the lc_numeric?

(I do not have any bias towards any representation of the radix point but
I am happy to learn objective/verifiable arguments for one or another.)

This is hardly important though if (as Rich wrote) "C"/"POSIX" locale is
specified to use "." and given that musl already has a policy of disallowing
other characters.

I apparently missed the discussions which led to this policy,
sorry if I am beating a dead horse:

To prevent users from shooting themselves in the foot looks
considerate. Personally I would nevertheless prefer to have a possibility
to influence the radix dot character, at least as a locale variant
and take the consequences.

If anybody else feels for using "," than so be it in the corresponding
locale, as the default or as a variant - how much would it cost?

The generally "most international" locale I am aware of (eo_ZZ) happens
to need ",". If your population numbers are correct, this seems to be
a proper choice too - modulo broken locale use/deployment.

Actually I very much dislike software which expects that it knows my
situation better than myself and prevents me from doing what I need.

Regards,
Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 18:23                                   ` Rich Felker
@ 2014-07-26 18:59                                     ` u-igbb
  2014-07-26 19:14                                       ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-26 18:59 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 02:23:48PM -0400, Rich Felker wrote:
> > but using anything other than '.' as the decimal point is broken
> 
> Agreed. BTW if you support arbitrary radix characters, you should not
> restrict it to ASCII; this then means the length in bytes of floating

I think there is some international recommendations which actually
say "either '.' or ',' but not anything else".

Rune



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 18:59                                     ` u-igbb
@ 2014-07-26 19:14                                       ` Rich Felker
  0 siblings, 0 replies; 43+ messages in thread
From: Rich Felker @ 2014-07-26 19:14 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 08:59:04PM +0200, u-igbb@aetey.se wrote:
> On Sat, Jul 26, 2014 at 02:23:48PM -0400, Rich Felker wrote:
> > > but using anything other than '.' as the decimal point is broken
> > 
> > Agreed. BTW if you support arbitrary radix characters, you should not
> > restrict it to ASCII; this then means the length in bytes of floating
> 
> I think there is some international recommendations which actually
> say "either '.' or ',' but not anything else".

According to Stephane Chazelas on oss-security
(http://www.openwall.com/lists/oss-security/2014/07/21/12), glibc
allows '2' as the radix point...

This is yet another reason for musl's locale system having an "only
allow variations that are necessary" approach: it limits the impact of
malicious locale files if a user can somehow trick a privileged
process into using one.

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 18:56                                   ` u-igbb
@ 2014-07-26 19:30                                     ` Rich Felker
  2014-07-27  7:28                                       ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-26 19:30 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 08:56:13PM +0200, u-igbb@aetey.se wrote:
> I apparently missed the discussions which led to this policy,
> sorry if I am beating a dead horse:

The original locale plan was based on the principle that the whole
locale system in C/POSIX is poorly designed, but that it's still the
basis for supporting native-language interfaces in software, and thus
that musl should eventually have the minimal locale support necessary
for enabling such usage. Using the locale system as a means for
arbitrary non-essential customization, support for legacy character
encodings, etc. is outside the scope of that.

> To prevent users from shooting themselves in the foot looks
> considerate. Personally I would nevertheless prefer to have a possibility
> to influence the radix dot character, at least as a locale variant
> and take the consequences.
> 
> If anybody else feels for using "," than so be it in the corresponding
> locale, as the default or as a variant - how much would it cost?
> 
> The generally "most international" locale I am aware of (eo_ZZ) happens
> to need ",". If your population numbers are correct, this seems to be
> a proper choice too - modulo broken locale use/deployment.
> 
> Actually I very much dislike software which expects that it knows my
> situation better than myself and prevents me from doing what I need.

And what about character encoding? Should we also support Latin-1? Or
ISO-2022? One thing I don't like about how the whole locale discussion
has gone is that, once one one thing is added, demands for more and
more things that have much higher implementation and maintenance
costs, good technical reasons not to provide, and less and less
practical benefit, keep popping up. Radix points is definitely such an
item where the cost (in terms of bug/security risks, code size, making
state-free code stateful, maintenance, and just plain being ugly, ...)
is relatively high and the benefits are near-zero.

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-25 20:15                       ` u-igbb
  2014-07-25 22:32                         ` Rich Felker
@ 2014-07-26 20:43                         ` Rich Felker
  2014-07-27  7:51                           ` u-igbb
  1 sibling, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-26 20:43 UTC (permalink / raw)
  To: musl

On Fri, Jul 25, 2014 at 10:15:51PM +0200, u-igbb@aetey.se wrote:
> Replying to myself.
> 
> On Fri, Jul 25, 2014 at 11:06:49AM +0200, u-igbb@aetey.se wrote:
> > Returning to the naming. As language-based locales are named
> > after languages, it would be nice to name other kinds of locale
> > data after their "natural association" too. Then politically-bound
> > data could be put into the corresponding "territorial" family:
> > 
> >  language                ll[l][_TT]
> >  territory               TT[_ll[l]]
> 
> A bad idea, forget it. This would be open to misinterpretation
> (which key is "more fundamental" for a certain kind of data,
> shall it go to ll_TT or TT_ll ?)

I wasn't quite sure where to inject this reply into the thread, but
one thing I just remembered is that glibc (and the XSI option for
POSIX) has [.charset] as part of the standard form for locale names,
and all of glibc's usable locales end in ".UTF-8". So a user on a
mixed system is likely to have their locale vars set to include
".UTF-8 "at the end, and therefore wouldn't get any localization when
running musl-linked programs with the locale names we've proposed.

The way I see it, we could either have the locale package provide
symlinks to all of the locales with ".UTF-8" on the end, or musl
itself could ignore anything starting with the first '.' in a locale
name. One downside of symlinks is that a locale could uselessly get
mapped twice if somebody happens to reference it by both names in
their locale vars. It also puts more of a configuration/complexity
burden on the installation. But it does keep policy out of libc and
saves a few bytes of code in libc.

Any opinions on the matter?

Rich

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 19:30                                     ` Rich Felker
@ 2014-07-27  7:28                                       ` u-igbb
  0 siblings, 0 replies; 43+ messages in thread
From: u-igbb @ 2014-07-27  7:28 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 03:30:15PM -0400, Rich Felker wrote:
> The original locale plan was based on the principle that the whole
> locale system in C/POSIX is poorly designed, but that it's still the

Sure.

> basis for supporting native-language interfaces in software, and thus
> that musl should eventually have the minimal locale support necessary
> for enabling such usage. Using the locale system as a means for
> arbitrary non-essential customization, support for legacy character
> encodings, etc. is outside the scope of that.

Sounds pretty reasonable.

> > If anybody else feels for using "," than so be it in the corresponding
> > locale, as the default or as a variant - how much would it cost?

This (the cost) is as I see it the main consideration.

> > Actually I very much dislike software which expects that it knows my
> > situation better than myself and prevents me from doing what I need.
> 
> And what about character encoding? Should we also support Latin-1? Or

This is a different matter, not exactly about cost against perceived
benefit.

You _do_ offer a means to represent the user's data by unicode/utf-8,
so the desired function (e.g. to represent åäöü) is available.

It is different with the radix point as there is hardly another means for
helping the user to handle her grandmother's favourite radix character.

There is a difference between computing-related preferences like
"I like Latin-1 and refuse to recode my data to utf-8" and the
real life ones like "I grew up in Germany and despite 20 years in USA
still confuse 24 and 42 in English".

What you call "a single pixel" makes for some people the difference
between apparent legibility and gibberish. Note, one e.g. can be very
much used to interpreting '.' as a thousand delimiter and will be utterly
confused by 3.142 .

> ISO-2022? One thing I don't like about how the whole locale discussion
> has gone is that, once one one thing is added, demands for more and
> more things that have much higher implementation and maintenance

I do not demand for more things, I just want to go after cost vs benefit.
If the cost is high we may have to drop even very useful features.

Without analysing the code I can not say how dangerous / bothersome /
complex it would be to support one alternative character as radix dot and
weight this against making about half the human population 1% happier. :)

(A third character for radix dot would certainly affect much less
than half the population).

It is you who have a say about the balance, I just call for objective
reasoning. Being generally upset about "feature proliferation" does
not help, features is what any software is about. Do not doubt, in my
eyes compactness and simplicity => robustness and safety are extremely
important features, as they are for you.

> practical benefit, keep popping up. Radix points is definitely such an
> item where the cost (in terms of bug/security risks, code size, making
> state-free code stateful, maintenance, and just plain being ugly, ...)
> is relatively high

It is your competence area.

> and the benefits are near-zero.

But this one is possibly to some degree outside of your personal
experience and you may _possibly_ have a biased impression.
That's why I dare to spend the precious time of yours and of mine
for this discussion. Thanks for listening and of course for musl as it
is - clean and useful.

Yours,
Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-26 20:43                         ` Rich Felker
@ 2014-07-27  7:51                           ` u-igbb
  2014-07-27  8:00                             ` Rich Felker
  0 siblings, 1 reply; 43+ messages in thread
From: u-igbb @ 2014-07-27  7:51 UTC (permalink / raw)
  To: musl

On Sat, Jul 26, 2014 at 04:43:29PM -0400, Rich Felker wrote:
> I wasn't quite sure where to inject this reply into the thread, but
> one thing I just remembered is that glibc (and the XSI option for
> POSIX) has [.charset] as part of the standard form for locale names,
> and all of glibc's usable locales end in ".UTF-8". So a user on a
> mixed system is likely to have their locale vars set to include
> ".UTF-8 "at the end, and therefore wouldn't get any localization when
> running musl-linked programs with the locale names we've proposed.

Ah yes this is regrettable. The transition from legacy charsets/encodings
has already happened and even with glibc .UTF-8 is a de-facto default,
thus "shouldn't" have to be indicated.

> The way I see it, we could either have the locale package provide
> symlinks to all of the locales with ".UTF-8" on the end, or musl
> itself could ignore anything starting with the first '.' in a locale
> name. One downside of symlinks is that a locale could uselessly get
> mapped twice if somebody happens to reference it by both names in
> their locale vars. It also puts more of a configuration/complexity
> burden on the installation. But it does keep policy out of libc and
> saves a few bytes of code in libc.

As an integrator I certainly appreciate if I can skip
making zillions of legacy links.
There is also a matter of spelling utf-8 Utf-8 UTF-8 utf8 UTF8 Utf8 utf_8
(did I forget some? :) which different distros/users may choose differently.

Debian Linux:

$ locale -a
C
C.UTF-8         <=====
en_US.utf8      <=====
POSIX
$

Given that the library implies utf-8, please ignore .anything
explicitly - this part of the name is meaningless for musl by design.

A packager can not fully imitate such behaviour even with a lot of links.

The rare cases when the user really means a different charset
but gets utf-8 are better handled by the user if/when encountered.

Rune

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-27  7:51                           ` u-igbb
@ 2014-07-27  8:00                             ` Rich Felker
  2014-07-27  8:24                               ` u-igbb
  0 siblings, 1 reply; 43+ messages in thread
From: Rich Felker @ 2014-07-27  8:00 UTC (permalink / raw)
  To: musl

On Sun, Jul 27, 2014 at 09:51:20AM +0200, u-igbb@aetey.se wrote:
> As an integrator I certainly appreciate if I can skip
> making zillions of legacy links.
> There is also a matter of spelling utf-8 Utf-8 UTF-8 utf8 UTF8 Utf8 utf_8
> (did I forget some? :) which different distros/users may choose differently.
> 
> Debian Linux:
> 
> $ locale -a
> C
> C.UTF-8         <=====
> en_US.utf8      <=====
> POSIX
> $
> 
> Given that the library implies utf-8, please ignore .anything
> explicitly - this part of the name is meaningless for musl by design.
> 
> A packager can not fully imitate such behaviour even with a lot of links.
> 
> The rare cases when the user really means a different charset
> but gets utf-8 are better handled by the user if/when encountered.

OK. I actually expected you to prefer symlinks, but if you prefer musl
automatically ignoring everything after the dot, I'm quite happy with
that approach and I'll probably go with it. Thanks for the feedback!

Rich


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Locale bikeshed time
  2014-07-27  8:00                             ` Rich Felker
@ 2014-07-27  8:24                               ` u-igbb
  0 siblings, 0 replies; 43+ messages in thread
From: u-igbb @ 2014-07-27  8:24 UTC (permalink / raw)
  To: musl

On Sun, Jul 27, 2014 at 04:00:31AM -0400, Rich Felker wrote:
> OK. I actually expected you to prefer symlinks, but if you prefer musl
> automatically ignoring everything after the dot, I'm quite happy with
> that approach and I'll probably go with it. Thanks for the feedback!

This shows how important it is to talk to each other :)
Luckily you do that, thanks.

Rune



^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2014-07-27  8:24 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-22 18:49 Locale bikeshed time Rich Felker
2014-07-22 20:10 ` u-igbb
2014-07-22 20:35   ` Rich Felker
2014-07-23  9:50     ` u-igbb
2014-07-23 16:39       ` Rich Felker
2014-07-23 19:25         ` u-igbb
2014-07-23 21:01           ` Rich Felker
2014-07-24 15:35             ` u-igbb
2014-07-24 16:01               ` Rich Felker
2014-07-24 19:24                 ` u-igbb
2014-07-24 20:15                 ` u-igbb
2014-07-24 22:02                   ` Rich Felker
2014-07-25  9:06                     ` u-igbb
2014-07-25 20:15                       ` u-igbb
2014-07-25 22:32                         ` Rich Felker
2014-07-26  7:25                           ` u-igbb
2014-07-26  8:03                             ` Rich Felker
2014-07-26  9:06                               ` Jens Gustedt
2014-07-26  9:25                                 ` Rich Felker
2014-07-26  9:38                               ` u-igbb
2014-07-26 17:47                                 ` Szabolcs Nagy
2014-07-26 18:23                                   ` Rich Felker
2014-07-26 18:59                                     ` u-igbb
2014-07-26 19:14                                       ` Rich Felker
2014-07-26 18:56                                   ` u-igbb
2014-07-26 19:30                                     ` Rich Felker
2014-07-27  7:28                                       ` u-igbb
2014-07-26 20:43                         ` Rich Felker
2014-07-27  7:51                           ` u-igbb
2014-07-27  8:00                             ` Rich Felker
2014-07-27  8:24                               ` u-igbb
2014-07-23 23:22         ` writeonce
2014-07-23 23:38           ` Rich Felker
2014-07-24  1:07             ` writeonce
2014-07-24  1:57               ` Rich Felker
2014-07-24  2:16                 ` writeonce
2014-07-24  2:24                   ` Rich Felker
2014-07-24  2:59                     ` writeonce
2014-07-22 20:17 ` Laurent Bercot
2014-07-22 20:36   ` Rich Felker
2014-07-23 22:03     ` Laurent Bercot
2014-07-23 22:12       ` Rich Felker
2014-07-24 15:38         ` u-igbb

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).