mailing list of musl libc
 help / color / mirror / code / Atom feed
* Bikeshed invitation for nl_langinfo ambiguities
@ 2017-11-11  2:06 Rich Felker
  2017-11-26 23:19 ` A. Wilcox
  0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2017-11-11  2:06 UTC (permalink / raw)
  To: musl

I've found 2 ambiguous-string-to-translate bugs in musl's locale
support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and T_FMT
and ERA_T_FMT ("%H:%M:%S"), have the same values in the C locale, and
thus can't be translated to distinct values like they need to be in
other locales.

Any opinions on the cleanest way to handle this? There are various
hacks I could do at the implementation level, like adding a prefix
character to one or the other then applying +1 to the output string,
But whatever solution we choose becomes a public interface for
translators, so it should be something that's not horribly ugly.

So.... it's bikeshed time.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bikeshed invitation for nl_langinfo ambiguities
  2017-11-11  2:06 Bikeshed invitation for nl_langinfo ambiguities Rich Felker
@ 2017-11-26 23:19 ` A. Wilcox
  2017-11-27  1:07   ` Rich Felker
  2018-03-03  5:08   ` Rich Felker
  0 siblings, 2 replies; 7+ messages in thread
From: A. Wilcox @ 2017-11-26 23:19 UTC (permalink / raw)
  To: musl

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 10/11/17 20:06, Rich Felker wrote:
> I've found 2 ambiguous-string-to-translate bugs in musl's locale 
> support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and
> T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C
> locale, and thus can't be translated to distinct values like they
> need to be in other locales.
> 
> Any opinions on the cleanest way to handle this? There are various 
> hacks I could do at the implementation level, like adding a prefix 
> character to one or the other then applying +1 to the output
> string, But whatever solution we choose becomes a public interface
> for translators, so it should be something that's not horribly
> ugly.

I would personally recommend actually using the enum values as the
strings to translate.  _("MON_5"), _("ABMON_5"), etc; this is
non-ambiguous, easily understandable and describable for translators,
and does not require weird hacks at the implementation or ABI level.

Of course, then a "C" / "POSIX" strings file must be present.  But
this is, in my opinion, a very small sacrifice to ensure full purity
and ease of translation.

A simple " " with a note it is intentional /could/ work, but then
every locale file has to have an extra " " for those two values.  This
would additionally affect any additional duplicate strings that are
found when musl is translated to other languages.  If there's just ten
of these, and musl is ported to just 100 languages (glibc has over
200), that's already 10 kB wasted on a silly hack.  It is also more
brittle.

> So.... it's bikeshed time.
> 
> Rich
> 

Yellow, definitely pastel yellow.  It's a nice muted colour that is
warm and inviting without being too striking or in-your-face.  The
perfect colour for any bikeshed!

Best,
- --arw

- -- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJaG0vnAAoJEMspy1GSK50UjGsQAKiuZi+AehXhCgZpzM/ZxKP9
UB4UtvnG9u6EyEbEI+2lpUcftoP+gDLMYyiObjIPH8o5/8v0jlQqFzWgd7E9Mdoa
fCgicD8iozr5rDPdF8aEpDOlks97leGErXTjVdozH4PRgHdU9XranzCEKD0rpAiI
BI0Ti2CaKOADRoqb5+ZCsL+3giljD2I1PTEkahSD4NeOd28NKAAIY1PtvqkOg5JX
XR4CyGA/XRUB/bbyGcfD/ASMpUltw1Jc57xXryvfeo5SHkmJ7/e/KCZmIrZZO0sO
lmsEU3OqbrE8/1PlDLMRLn9ty/DH241FWxDEsktTjLb09GgNjlv7W3Um2IdXpM/E
EkjZiRuudW0Wr6rQamaOpgJTmpDZSd0MrlNDnib8lFNrP6I7AnIserDSeRtUAVGG
pqRWtL2QxEnmZVRH24L71Z7g6BNOFmwIBKtQrYzvn4oUwijUnP23ZYJ0l837F6rR
VyhgklTReGjknvxDk2lcXvAnyjMRVFGMDFBOmMeVv1StcN0fIiro4CQh3Si8MFqS
nn2u0qBiziLx956MpjYJ4WezzesPJYsTBW77nb1YssPm+sP65aZ9hgaJh56jeESC
ruPi7wwMoqN9hZldzCWEMao7zOxpq/IX40T2YJtwBXfFdpOU99OS3jr09vQI3xR6
I6U+YiwwcCMFxU4IkjD4
=DwvU
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bikeshed invitation for nl_langinfo ambiguities
  2017-11-26 23:19 ` A. Wilcox
@ 2017-11-27  1:07   ` Rich Felker
  2017-11-27  2:57     ` A. Wilcox
  2018-03-03  5:08   ` Rich Felker
  1 sibling, 1 reply; 7+ messages in thread
From: Rich Felker @ 2017-11-27  1:07 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> On 10/11/17 20:06, Rich Felker wrote:
> > I've found 2 ambiguous-string-to-translate bugs in musl's locale 
> > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and
> > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C
> > locale, and thus can't be translated to distinct values like they
> > need to be in other locales.
> > 
> > Any opinions on the cleanest way to handle this? There are various 
> > hacks I could do at the implementation level, like adding a prefix 
> > character to one or the other then applying +1 to the output
> > string, But whatever solution we choose becomes a public interface
> > for translators, so it should be something that's not horribly
> > ugly.
> 
> I would personally recommend actually using the enum values as the
> strings to translate.  _("MON_5"), _("ABMON_5"), etc; this is
> non-ambiguous, easily understandable and describable for translators,
> and does not require weird hacks at the implementation or ABI level.

This is certainly one possibility, but it does result in embedding a
number of "useless" strings that are never used themselves, only as
translation keys, in the binary. One nice property of it (especially
if we did the same for strerror keys) is that it eliminates the need
for translation files to care about changes in the text in musl.

> Of course, then a "C" / "POSIX" strings file must be present.  But
> this is, in my opinion, a very small sacrifice to ensure full purity
> and ease of translation.

This is of course not acceptable. It's solvable just by having the
"translations" embedded (in mo format or otherwise) in the source, but
then it's impossible for __lctrans to be a dummy identity-map in
programs that don't link setlocale/newlocale. I have in mind a way we
could potentially avoid this: passing keys like "ABMON_5" to
__lctrans, and if it returns back the key (which is what happens with
the stub implementation or with no translation present), use the
builtin C locale strings instead.

> A simple " " with a note it is intentional /could/ work, but then
> every locale file has to have an extra " " for those two values.  This
> would additionally affect any additional duplicate strings that are
> found when musl is translated to other languages.  If there's just ten
> of these, and musl is ported to just 100 languages (glibc has over
> 200), that's already 10 kB wasted on a silly hack.  It is also more
> brittle.

I don't follow; there are only two duplicate strings and they are
"May" and "%H:%M:%S". The number does not grow with the number of
translations because it's a property of the untranslated strings not
the translated ones.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bikeshed invitation for nl_langinfo ambiguities
  2017-11-27  1:07   ` Rich Felker
@ 2017-11-27  2:57     ` A. Wilcox
  2017-11-27  5:09       ` Rich Felker
  0 siblings, 1 reply; 7+ messages in thread
From: A. Wilcox @ 2017-11-27  2:57 UTC (permalink / raw)
  To: musl

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 26/11/17 19:07, Rich Felker wrote:
>> I would personally recommend actually using the enum values as
>> the strings to translate.  _("MON_5"), _("ABMON_5"), etc; this
>> is non-ambiguous, easily understandable and describable for
>> translators, and does not require weird hacks at the
>> implementation or ABI level.
> 
> This is certainly one possibility, but it does result in embedding
> a number of "useless" strings that are never used themselves, only
> as translation keys, in the binary. One nice property of it
> (especially if we did the same for strerror keys) is that it
> eliminates the need for translation files to care about changes in
> the text in musl.


That was the idea, yes.  This would work for *all* translatable
strings, not just the nl_langinfo ones.


>> Of course, then a "C" / "POSIX" strings file must be present.
>> But this is, in my opinion, a very small sacrifice to ensure full
>> purity and ease of translation.
> 
> This is of course not acceptable.


"Of course"?  Why not?  The reason this wouldn't be acceptable is not
obvious to me.


> I have in mind a way we could potentially avoid this: passing keys
> like "ABMON_5" to __lctrans, and if it returns back the key (which
> is what happens with the stub implementation or with no translation
> present), use the builtin C locale strings instead.


That would work.


> I don't follow; there are only two duplicate strings and they are 
> "May" and "%H:%M:%S". The number does not grow with the number of 
> translations because it's a property of the untranslated strings
> not the translated ones.



If you are returning (string+1), then all strings need to have the " "
at the beginning, or else you are going to return "ai" instead of
"mai" for French and so on.


Best,
- --arw
- -- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJaG38SAAoJEMspy1GSK50UblkQAMqHOuqFTrgbV6k5+6scBc2y
a6IcldAupt6swk1b/+Bzro7n3DVcVhoBOV97ofyPSlwT1OOzFCoc6ljeqR1+TUgZ
ZaFyOPICpS/1xnBqiipiQgQjn0a9/ZH7fDeb54aP3F0Jp18hrjwpvFduWcNItx/D
AonDoUGecrahwwBYUYjk7GJ6Hp2hqeVGPLTpNWKwDkXTvN1CQ9HTCTRsBdaBkuFh
CevzYbt4m+BQblcr4IdCHlKr5Yenb8s5aHCvV8ag/PhC1uFD6/MFu6kgKQbyqjfr
ji9mD4xoX/l0rrUXqPGpdLRs5OwCpda42lGxUu9gGIUDrMt3v3YAPXTbsePTa6H6
+9cPNd562OqTqoCDiG6V7pWSapyDK+eR7swwnqNDe1Z8AyHZN3/2tJ3Gr/jH98Ko
Vvjyv4sybiAgqdiu5Q/v1kKX0hVafOz7KNRKQuFt9kiUG8bj6S714CYg3WCFW7uk
30KdhI02Iy888y/nZIn6ApzmCsmx42ZZeG16CqgsbizXuyZK1kndXSaPuE7TG7Ot
0vxiz5B1Zv0ftqgEMaFeG1yqNpCPINhyYRjWMMohfrm9hi7gXQEtYWb4o+ibmb52
I/iCYdvRCcfPll48p3MNIc3w8zUDVs2vF86WQhX0nrWNnwI6urTmOdnC3L5fM4zz
CzCoHkaHJIj2HEPE0V2o
=MtiD
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bikeshed invitation for nl_langinfo ambiguities
  2017-11-27  2:57     ` A. Wilcox
@ 2017-11-27  5:09       ` Rich Felker
  0 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2017-11-27  5:09 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 08:57:25PM -0600, A. Wilcox wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> On 26/11/17 19:07, Rich Felker wrote:
> >> I would personally recommend actually using the enum values as
> >> the strings to translate.  _("MON_5"), _("ABMON_5"), etc; this
> >> is non-ambiguous, easily understandable and describable for
> >> translators, and does not require weird hacks at the
> >> implementation or ABI level.
> > 
> > This is certainly one possibility, but it does result in embedding
> > a number of "useless" strings that are never used themselves, only
> > as translation keys, in the binary. One nice property of it
> > (especially if we did the same for strerror keys) is that it
> > eliminates the need for translation files to care about changes in
> > the text in musl.
> 
> That was the idea, yes.  This would work for *all* translatable
> strings, not just the nl_langinfo ones.
> 
> >> Of course, then a "C" / "POSIX" strings file must be present.
> >> But this is, in my opinion, a very small sacrifice to ensure full
> >> purity and ease of translation.
> > 
> > This is of course not acceptable.
> 
> "Of course"?  Why not?  The reason this wouldn't be acceptable is not
> obvious to me.

It basically means there's no such thing as "truely static linked"
binaries anymore, and all programs need to search for and open the "C
locale translation file" at program startup time, resulting a number
of syscalls. None of this would be remotely acceptable to a large
portion of musl's userbase, including myself.

If it's not clear why, just observe that functions like nl_langinfo,
strftime, etc. have no way to fail for arbitrary reasons when the
input is valid. The only way to ensure that they can't fail is to have
the data they need already loaded at the time the locale is set, which
for the default C locale is program start time. Normally/currently
this is achieved just by having the strings in the program text
segment.

> > I have in mind a way we could potentially avoid this: passing keys
> > like "ABMON_5" to __lctrans, and if it returns back the key (which
> > is what happens with the stub implementation or with no translation
> > present), use the builtin C locale strings instead.
> 
> That would work.
> 
> > I don't follow; there are only two duplicate strings and they are 
> > "May" and "%H:%M:%S". The number does not grow with the number of 
> > translations because it's a property of the untranslated strings
> > not the translated ones.
> 
> If you are returning (string+1), then all strings need to have the " "
> at the beginning, or else you are going to return "ai" instead of
> "mai" for French and so on.

I see. I was not intending to include the prefix in the translated
strings, just in the key.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bikeshed invitation for nl_langinfo ambiguities
  2017-11-26 23:19 ` A. Wilcox
  2017-11-27  1:07   ` Rich Felker
@ 2018-03-03  5:08   ` Rich Felker
  2018-03-05 17:10     ` Rich Felker
  1 sibling, 1 reply; 7+ messages in thread
From: Rich Felker @ 2018-03-03  5:08 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> On 10/11/17 20:06, Rich Felker wrote:
> > I've found 2 ambiguous-string-to-translate bugs in musl's locale 
> > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and
> > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C
> > locale, and thus can't be translated to distinct values like they
> > need to be in other locales.
> > 
> > Any opinions on the cleanest way to handle this? There are various 
> > hacks I could do at the implementation level, like adding a prefix 
> > character to one or the other then applying +1 to the output
> > string, But whatever solution we choose becomes a public interface
> > for translators, so it should be something that's not horribly
> > ugly.
> 
> I would personally recommend actually using the enum values as the
> strings to translate.  _("MON_5"), _("ABMON_5"), etc; this is
> non-ambiguous, easily understandable and describable for translators,
> and does not require weird hacks at the implementation or ABI level.

I think this may be the nicest approach, despite being an incompatible
change from the existing system, which apparently doesn't matter and
isn't being used or people would have noticed that "May" can't be
translated right.

> Of course, then a "C" / "POSIX" strings file must be present.  But
> this is, in my opinion, a very small sacrifice to ensure full purity
> and ease of translation.

As noted before, obviously this isn't acceptable. We could drop a .mo
file blob in the musl langinfo.c, but I think it might make more sense
to just use different code paths for translated vs nontranslated case.
Then we could just synthesize the keys (ABMON_*, MON_*, ABDAY_*,
DAY_*) to pass into LCTRANS() rather than having a table of them all
expanded out. I might change my mind when actually working out how the
code would look, though.

An alternative I thought about would be just having translations for
"ABMON", "MON", etc. that produce nul-delimited multistrings, but then
the data file encodes an assumption that musl will do the O(n) (albeit
small n) multistring scan for the requested item, and I don't think
it's nice encoding assumptions that limit efficiency or force a
particular size/efficiency tradeoff.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bikeshed invitation for nl_langinfo ambiguities
  2018-03-03  5:08   ` Rich Felker
@ 2018-03-05 17:10     ` Rich Felker
  0 siblings, 0 replies; 7+ messages in thread
From: Rich Felker @ 2018-03-05 17:10 UTC (permalink / raw)
  To: musl

On Sat, Mar 03, 2018 at 12:08:54AM -0500, Rich Felker wrote:
> On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> > 
> > On 10/11/17 20:06, Rich Felker wrote:
> > > I've found 2 ambiguous-string-to-translate bugs in musl's locale 
> > > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and
> > > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C
> > > locale, and thus can't be translated to distinct values like they
> > > need to be in other locales.
> > > 
> > > Any opinions on the cleanest way to handle this? There are various 
> > > hacks I could do at the implementation level, like adding a prefix 
> > > character to one or the other then applying +1 to the output
> > > string, But whatever solution we choose becomes a public interface
> > > for translators, so it should be something that's not horribly
> > > ugly.
> > 
> > I would personally recommend actually using the enum values as the
> > strings to translate.  _("MON_5"), _("ABMON_5"), etc; this is
> > non-ambiguous, easily understandable and describable for translators,
> > and does not require weird hacks at the implementation or ABI level.
> 
> I think this may be the nicest approach, despite being an incompatible
> change from the existing system, which apparently doesn't matter and
> isn't being used or people would have noticed that "May" can't be
> translated right.

One really ugly thing here is that the POSIX key for weekdays is
"highly unconventional" - ABDAY_1/DAY_1 is Sunday and ABDAY_7/DAY_7 is
Saturday. Even the Unicode CLDR noticed this nonsense and used
"sun"..."sat" as the keys rather than using numbers so as to be
unambiguous.

> > Of course, then a "C" / "POSIX" strings file must be present.  But
> > this is, in my opinion, a very small sacrifice to ensure full purity
> > and ease of translation.
> 
> As noted before, obviously this isn't acceptable. We could drop a .mo
> file blob in the musl langinfo.c, but I think it might make more sense
> to just use different code paths for translated vs nontranslated case.

I did some simple estimates with a toy .po/.mo file, and it looks like
either of those approaches is going to more-than-double the size of
langinfo.o, and make it a lot more complex. Given that "Sun".."Sat"
are nicer keys for days anyway, I'm leaning back towards sticking with
what we have and just adding a special case for "May". The other
ambiguity is one of the ERA_* formats, which we're not even doing
right now anyway; they're "not available in the POSIX locale"
according to XBD 7.3.5 LC_TIME, so as I read it they should return ""
(not the correspondign non-era string) in the C/POSIX locale, and only
return something else if they're defined for the locale. Eventually,
we should probably look them up with mo keys like "era_d_fmt", etc.
but unless/until we properly support them, the lookups for them should
just be removed.

> Then we could just synthesize the keys (ABMON_*, MON_*, ABDAY_*,
> DAY_*) to pass into LCTRANS() rather than having a table of them all
> expanded out. I might change my mind when actually working out how the
> code would look, though.

I started working on a nice means of doing this synthesis - having a
table like the existing c_time etc. but contents like:

	"ABDAY_1\0\0\0\0\0\0\0"
	"DAY_1\0\0\0\0\0\0\0"
	"ABMON_1\0\0\0\0\0\0\0\0\0\0\0\0"
	"MON_1\0\0\0\0\0\0\0\0\0\0\0\0"
	...

where, when a zero-length entry is hit, the last non-zero-length one
seen gets used as a basis for synthesis. But it still didn't seem
possible to avoid significant increase in code size and complexity.

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-03-05 17:10 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-11  2:06 Bikeshed invitation for nl_langinfo ambiguities Rich Felker
2017-11-26 23:19 ` A. Wilcox
2017-11-27  1:07   ` Rich Felker
2017-11-27  2:57     ` A. Wilcox
2017-11-27  5:09       ` Rich Felker
2018-03-03  5:08   ` Rich Felker
2018-03-05 17:10     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).