mailing list of musl libc
 help / color / mirror / code / Atom feed
* Draft proposed locale changes
@ 2018-03-05 18:39 Rich Felker
  2018-03-05 18:42 ` Konstantin P.
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2018-03-05 18:39 UTC (permalink / raw)
  To: musl


localeconv/LC_NUMERIC/LC_MONETARY

Each loaded locale needs an immutable lconv structure to represent
this data. It needs to be allocated with the locale (at locale loading
time) since localeconv() has no provision for failure, but we can wait
to populate it lazily, and we can put the code to populate it in
localeconv.c so that static-linked programs that don't use this
rarely-used interface don't have to pay for it. We could also omit
even allocating it (56/96 bytes) if localeconv.o is not linked, but
it's probably not worth the special-casing code to do that.

The localeconv structure should be part of struct __locale_map, not
struct __locale_struct, since it's a pure function of the data in the
memory-mapped locale file and not a function of how that data is
linked to a specific locale category. Putting it in __locale_struct
would just complicate setlocale and newlocale.

The obvious (but not terribly efficient) form for the data in the
locale file is to have each lconv field as a mo-level key, as in:

	msgid "int_frac_digits"
	msgstr "2"

A more compact form could pack them all into one, but then the order
becomes a hidden locale-file interface boundary/ABI.

For the string fields it's necessary that they each be in-place
strings in the mo file. grouping and mon_grouping also have the
special constraint that they need to vary by whether the arch uses
signed or unsigned plain-char (since CHAR_MAX has special meaning) so
the mo file needs to store both versions. That's ugly but I don't see
any good way around it. We can probably punt on this for now just by
not supporting grouping (i.e. only supporting locale definitions that
don't do grouping), since it's not implemented anyway.

If we support decimal_point, it should not go through the localeconv
mechanism since it would always be needed by printf and strtod.
Instead __get_locale should probe it right away and set a 1-bit flag
in the __locale_map structure for these functions to consume (1-bit
based on previous research that [.,] are the only values).



nl_langinfo/LC_TIME/etc.

Eliminate the currently-present wrong values for ERA* and related
LC_TIME stuff; that gets rid of all ambiguous translation keys except
"May". Bikeshed up some alternate key for May.



strerror/LC_MESSAGES

Not sure yet. One radical idea I kinda like is removing all the
English-phrase messages from libc core and just having strerror
produce strings like "ENOENT", "EPERM", etc. in the C locale. This
seems to be the only option that wouldn't either moderately increase
libc size or require translation files to match the exact current text
in the builtin English libc messages. Users who want the current
messages would then need an "en" locale with contents like:

	msgid "ENOENT"
	msgstr "No such file or directory"

If we don't want this, the possible solutions look like one of:

1. Prepending the error code and a null byte (e.g. "ENOENT\0") to all
the existing error strings, then skipping past it if the translation
was not found.

2. Putting a second version of strerror in locale_map.c with the E*
names in it, so it's only linked if you use locale. I strongly dislike
this approach because it greatly increases the marginal size cost of
doing the right thing (calling setlocale) and imposes the cost even if
you don't use strerror at all (only setlocale).

3. Accepting that translations need to match (and perpetually be
updated to match) error strings in musl __strerror.h. I don't like
this much either.

So I think it should be between options 1 and "zero" above. Option
zero decreases the size of libc by nearly 1k (removing messages) but
changes the behavior. Option 1 increases the size of libc by about 1k.



LC_COLLATE

No specific proposal yet. We need a data structure to map characters
and sequences of characters to collating elements. Obviously the mo
file's lookups could be used directly (O(log n), improved avg case if
we ever add hash table support) but they might be heavier than we
want. The alternative would be having a gigantic string in the mo file
that's just "compiled" collation table data, but unless it's
well-designed that seems like an undesirable permanent interface
boundary.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Draft proposed locale changes
  2018-03-05 18:39 Draft proposed locale changes Rich Felker
@ 2018-03-05 18:42 ` Konstantin P.
  2018-03-05 18:54   ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin P. @ 2018-03-05 18:42 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4531 bytes --]

Can you publish official po file for musl after proposed changes?

On Mon, Mar 5, 2018 at 9:39 PM, Rich Felker <dalias@libc.org> wrote:

>
> localeconv/LC_NUMERIC/LC_MONETARY
>
> Each loaded locale needs an immutable lconv structure to represent
> this data. It needs to be allocated with the locale (at locale loading
> time) since localeconv() has no provision for failure, but we can wait
> to populate it lazily, and we can put the code to populate it in
> localeconv.c so that static-linked programs that don't use this
> rarely-used interface don't have to pay for it. We could also omit
> even allocating it (56/96 bytes) if localeconv.o is not linked, but
> it's probably not worth the special-casing code to do that.
>
> The localeconv structure should be part of struct __locale_map, not
> struct __locale_struct, since it's a pure function of the data in the
> memory-mapped locale file and not a function of how that data is
> linked to a specific locale category. Putting it in __locale_struct
> would just complicate setlocale and newlocale.
>
> The obvious (but not terribly efficient) form for the data in the
> locale file is to have each lconv field as a mo-level key, as in:
>
>         msgid "int_frac_digits"
>         msgstr "2"
>
> A more compact form could pack them all into one, but then the order
> becomes a hidden locale-file interface boundary/ABI.
>
> For the string fields it's necessary that they each be in-place
> strings in the mo file. grouping and mon_grouping also have the
> special constraint that they need to vary by whether the arch uses
> signed or unsigned plain-char (since CHAR_MAX has special meaning) so
> the mo file needs to store both versions. That's ugly but I don't see
> any good way around it. We can probably punt on this for now just by
> not supporting grouping (i.e. only supporting locale definitions that
> don't do grouping), since it's not implemented anyway.
>
> If we support decimal_point, it should not go through the localeconv
> mechanism since it would always be needed by printf and strtod.
> Instead __get_locale should probe it right away and set a 1-bit flag
> in the __locale_map structure for these functions to consume (1-bit
> based on previous research that [.,] are the only values).
>
>
>
> nl_langinfo/LC_TIME/etc.
>
> Eliminate the currently-present wrong values for ERA* and related
> LC_TIME stuff; that gets rid of all ambiguous translation keys except
> "May". Bikeshed up some alternate key for May.
>
>
>
> strerror/LC_MESSAGES
>
> Not sure yet. One radical idea I kinda like is removing all the
> English-phrase messages from libc core and just having strerror
> produce strings like "ENOENT", "EPERM", etc. in the C locale. This
> seems to be the only option that wouldn't either moderately increase
> libc size or require translation files to match the exact current text
> in the builtin English libc messages. Users who want the current
> messages would then need an "en" locale with contents like:
>
>         msgid "ENOENT"
>         msgstr "No such file or directory"
>
> If we don't want this, the possible solutions look like one of:
>
> 1. Prepending the error code and a null byte (e.g. "ENOENT\0") to all
> the existing error strings, then skipping past it if the translation
> was not found.
>
> 2. Putting a second version of strerror in locale_map.c with the E*
> names in it, so it's only linked if you use locale. I strongly dislike
> this approach because it greatly increases the marginal size cost of
> doing the right thing (calling setlocale) and imposes the cost even if
> you don't use strerror at all (only setlocale).
>
> 3. Accepting that translations need to match (and perpetually be
> updated to match) error strings in musl __strerror.h. I don't like
> this much either.
>
> So I think it should be between options 1 and "zero" above. Option
> zero decreases the size of libc by nearly 1k (removing messages) but
> changes the behavior. Option 1 increases the size of libc by about 1k.
>
>
>
> LC_COLLATE
>
> No specific proposal yet. We need a data structure to map characters
> and sequences of characters to collating elements. Obviously the mo
> file's lookups could be used directly (O(log n), improved avg case if
> we ever add hash table support) but they might be heavier than we
> want. The alternative would be having a gigantic string in the mo file
> that's just "compiled" collation table data, but unless it's
> well-designed that seems like an undesirable permanent interface
> boundary.
>
>

[-- Attachment #2: Type: text/html, Size: 5343 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Draft proposed locale changes
  2018-03-05 18:42 ` Konstantin P.
@ 2018-03-05 18:54   ` Rich Felker
  2018-03-05 20:00     ` Konstantin P.
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2018-03-05 18:54 UTC (permalink / raw)
  To: musl

On Mon, Mar 05, 2018 at 09:42:49PM +0300, Konstantin P. wrote:
> Can you publish official po file for musl after proposed changes?

Do you mean a po template (pot) file showing what needs to be
translated? Or an example po file for a particular locale (maybe just
English preserving the error message text, if we go with "option zero"
and strip the English error text out of libc)?

Rich


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Draft proposed locale changes
  2018-03-05 18:54   ` Rich Felker
@ 2018-03-05 20:00     ` Konstantin P.
  2018-03-05 21:25       ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin P. @ 2018-03-05 20:00 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 585 bytes --]

Raw pot file. I want to include it to my musl-locales project:
https://github.com/rilian-la-te/musl-locales

On Mon, Mar 5, 2018 at 9:54 PM, Rich Felker <dalias@libc.org> wrote:

> On Mon, Mar 05, 2018 at 09:42:49PM +0300, Konstantin P. wrote:
> > Can you publish official po file for musl after proposed changes?
>
> Do you mean a po template (pot) file showing what needs to be
> translated? Or an example po file for a particular locale (maybe just
> English preserving the error message text, if we go with "option zero"
> and strip the English error text out of libc)?
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 1038 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Draft proposed locale changes
  2018-03-05 20:00     ` Konstantin P.
@ 2018-03-05 21:25       ` Rich Felker
  2018-03-06  1:54         ` Konstantin P.
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2018-03-05 21:25 UTC (permalink / raw)
  To: musl

On Mon, Mar 05, 2018 at 11:00:41PM +0300, Konstantin P. wrote:
> Raw pot file. I want to include it to my musl-locales project:
> https://github.com/rilian-la-te/musl-locales

OK. FWIW I'm working on some script infrastructure to extract the
non-LC_MESSAGES locale data from Unicode CLDR, so with that available,
the only real translation work will be error messages, I think. I'm
not sure whether you'll want the other stuff (LC_TIME, etc.) in the
pot file or not since the definitions can be provided automatically
from other sources rather than handed off to translators.

Rich


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Draft proposed locale changes
  2018-03-05 21:25       ` Rich Felker
@ 2018-03-06  1:54         ` Konstantin P.
  0 siblings, 0 replies; 6+ messages in thread
From: Konstantin P. @ 2018-03-06  1:54 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 799 bytes --]

If you have an ability to provide it without translators aid - it is good.
But I want a full pot, if you will provide it)

On Tue, Mar 6, 2018 at 12:25 AM, Rich Felker <dalias@libc.org> wrote:

> On Mon, Mar 05, 2018 at 11:00:41PM +0300, Konstantin P. wrote:
> > Raw pot file. I want to include it to my musl-locales project:
> > https://github.com/rilian-la-te/musl-locales
>
> OK. FWIW I'm working on some script infrastructure to extract the
> non-LC_MESSAGES locale data from Unicode CLDR, so with that available,
> the only real translation work will be error messages, I think. I'm
> not sure whether you'll want the other stuff (LC_TIME, etc.) in the
> pot file or not since the definitions can be provided automatically
> from other sources rather than handed off to translators.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 1301 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-03-06  1:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-05 18:39 Draft proposed locale changes Rich Felker
2018-03-05 18:42 ` Konstantin P.
2018-03-05 18:54   ` Rich Felker
2018-03-05 20:00     ` Konstantin P.
2018-03-05 21:25       ` Rich Felker
2018-03-06  1:54         ` Konstantin P.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).