* Bikeshed invitation for nl_langinfo ambiguities @ 2017-11-11 2:06 Rich Felker 2017-11-26 23:19 ` A. Wilcox 0 siblings, 1 reply; 7+ messages in thread From: Rich Felker @ 2017-11-11 2:06 UTC (permalink / raw) To: musl I've found 2 ambiguous-string-to-translate bugs in musl's locale support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C locale, and thus can't be translated to distinct values like they need to be in other locales. Any opinions on the cleanest way to handle this? There are various hacks I could do at the implementation level, like adding a prefix character to one or the other then applying +1 to the output string, But whatever solution we choose becomes a public interface for translators, so it should be something that's not horribly ugly. So.... it's bikeshed time. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bikeshed invitation for nl_langinfo ambiguities 2017-11-11 2:06 Bikeshed invitation for nl_langinfo ambiguities Rich Felker @ 2017-11-26 23:19 ` A. Wilcox 2017-11-27 1:07 ` Rich Felker 2018-03-03 5:08 ` Rich Felker 0 siblings, 2 replies; 7+ messages in thread From: A. Wilcox @ 2017-11-26 23:19 UTC (permalink / raw) To: musl -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 10/11/17 20:06, Rich Felker wrote: > I've found 2 ambiguous-string-to-translate bugs in musl's locale > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C > locale, and thus can't be translated to distinct values like they > need to be in other locales. > > Any opinions on the cleanest way to handle this? There are various > hacks I could do at the implementation level, like adding a prefix > character to one or the other then applying +1 to the output > string, But whatever solution we choose becomes a public interface > for translators, so it should be something that's not horribly > ugly. I would personally recommend actually using the enum values as the strings to translate. _("MON_5"), _("ABMON_5"), etc; this is non-ambiguous, easily understandable and describable for translators, and does not require weird hacks at the implementation or ABI level. Of course, then a "C" / "POSIX" strings file must be present. But this is, in my opinion, a very small sacrifice to ensure full purity and ease of translation. A simple " " with a note it is intentional /could/ work, but then every locale file has to have an extra " " for those two values. This would additionally affect any additional duplicate strings that are found when musl is translated to other languages. If there's just ten of these, and musl is ported to just 100 languages (glibc has over 200), that's already 10 kB wasted on a silly hack. It is also more brittle. > So.... it's bikeshed time. > > Rich > Yellow, definitely pastel yellow. It's a nice muted colour that is warm and inviting without being too striking or in-your-face. The perfect colour for any bikeshed! Best, - --arw - -- A. Wilcox (awilfox) Project Lead, Adélie Linux http://adelielinux.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJaG0vnAAoJEMspy1GSK50UjGsQAKiuZi+AehXhCgZpzM/ZxKP9 UB4UtvnG9u6EyEbEI+2lpUcftoP+gDLMYyiObjIPH8o5/8v0jlQqFzWgd7E9Mdoa fCgicD8iozr5rDPdF8aEpDOlks97leGErXTjVdozH4PRgHdU9XranzCEKD0rpAiI BI0Ti2CaKOADRoqb5+ZCsL+3giljD2I1PTEkahSD4NeOd28NKAAIY1PtvqkOg5JX XR4CyGA/XRUB/bbyGcfD/ASMpUltw1Jc57xXryvfeo5SHkmJ7/e/KCZmIrZZO0sO lmsEU3OqbrE8/1PlDLMRLn9ty/DH241FWxDEsktTjLb09GgNjlv7W3Um2IdXpM/E EkjZiRuudW0Wr6rQamaOpgJTmpDZSd0MrlNDnib8lFNrP6I7AnIserDSeRtUAVGG pqRWtL2QxEnmZVRH24L71Z7g6BNOFmwIBKtQrYzvn4oUwijUnP23ZYJ0l837F6rR VyhgklTReGjknvxDk2lcXvAnyjMRVFGMDFBOmMeVv1StcN0fIiro4CQh3Si8MFqS nn2u0qBiziLx956MpjYJ4WezzesPJYsTBW77nb1YssPm+sP65aZ9hgaJh56jeESC ruPi7wwMoqN9hZldzCWEMao7zOxpq/IX40T2YJtwBXfFdpOU99OS3jr09vQI3xR6 I6U+YiwwcCMFxU4IkjD4 =DwvU -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bikeshed invitation for nl_langinfo ambiguities 2017-11-26 23:19 ` A. Wilcox @ 2017-11-27 1:07 ` Rich Felker 2017-11-27 2:57 ` A. Wilcox 2018-03-03 5:08 ` Rich Felker 1 sibling, 1 reply; 7+ messages in thread From: Rich Felker @ 2017-11-27 1:07 UTC (permalink / raw) To: musl On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On 10/11/17 20:06, Rich Felker wrote: > > I've found 2 ambiguous-string-to-translate bugs in musl's locale > > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and > > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C > > locale, and thus can't be translated to distinct values like they > > need to be in other locales. > > > > Any opinions on the cleanest way to handle this? There are various > > hacks I could do at the implementation level, like adding a prefix > > character to one or the other then applying +1 to the output > > string, But whatever solution we choose becomes a public interface > > for translators, so it should be something that's not horribly > > ugly. > > I would personally recommend actually using the enum values as the > strings to translate. _("MON_5"), _("ABMON_5"), etc; this is > non-ambiguous, easily understandable and describable for translators, > and does not require weird hacks at the implementation or ABI level. This is certainly one possibility, but it does result in embedding a number of "useless" strings that are never used themselves, only as translation keys, in the binary. One nice property of it (especially if we did the same for strerror keys) is that it eliminates the need for translation files to care about changes in the text in musl. > Of course, then a "C" / "POSIX" strings file must be present. But > this is, in my opinion, a very small sacrifice to ensure full purity > and ease of translation. This is of course not acceptable. It's solvable just by having the "translations" embedded (in mo format or otherwise) in the source, but then it's impossible for __lctrans to be a dummy identity-map in programs that don't link setlocale/newlocale. I have in mind a way we could potentially avoid this: passing keys like "ABMON_5" to __lctrans, and if it returns back the key (which is what happens with the stub implementation or with no translation present), use the builtin C locale strings instead. > A simple " " with a note it is intentional /could/ work, but then > every locale file has to have an extra " " for those two values. This > would additionally affect any additional duplicate strings that are > found when musl is translated to other languages. If there's just ten > of these, and musl is ported to just 100 languages (glibc has over > 200), that's already 10 kB wasted on a silly hack. It is also more > brittle. I don't follow; there are only two duplicate strings and they are "May" and "%H:%M:%S". The number does not grow with the number of translations because it's a property of the untranslated strings not the translated ones. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bikeshed invitation for nl_langinfo ambiguities 2017-11-27 1:07 ` Rich Felker @ 2017-11-27 2:57 ` A. Wilcox 2017-11-27 5:09 ` Rich Felker 0 siblings, 1 reply; 7+ messages in thread From: A. Wilcox @ 2017-11-27 2:57 UTC (permalink / raw) To: musl -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 26/11/17 19:07, Rich Felker wrote: >> I would personally recommend actually using the enum values as >> the strings to translate. _("MON_5"), _("ABMON_5"), etc; this >> is non-ambiguous, easily understandable and describable for >> translators, and does not require weird hacks at the >> implementation or ABI level. > > This is certainly one possibility, but it does result in embedding > a number of "useless" strings that are never used themselves, only > as translation keys, in the binary. One nice property of it > (especially if we did the same for strerror keys) is that it > eliminates the need for translation files to care about changes in > the text in musl. That was the idea, yes. This would work for *all* translatable strings, not just the nl_langinfo ones. >> Of course, then a "C" / "POSIX" strings file must be present. >> But this is, in my opinion, a very small sacrifice to ensure full >> purity and ease of translation. > > This is of course not acceptable. "Of course"? Why not? The reason this wouldn't be acceptable is not obvious to me. > I have in mind a way we could potentially avoid this: passing keys > like "ABMON_5" to __lctrans, and if it returns back the key (which > is what happens with the stub implementation or with no translation > present), use the builtin C locale strings instead. That would work. > I don't follow; there are only two duplicate strings and they are > "May" and "%H:%M:%S". The number does not grow with the number of > translations because it's a property of the untranslated strings > not the translated ones. If you are returning (string+1), then all strings need to have the " " at the beginning, or else you are going to return "ai" instead of "mai" for French and so on. Best, - --arw - -- A. Wilcox (awilfox) Project Lead, Adélie Linux http://adelielinux.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJaG38SAAoJEMspy1GSK50UblkQAMqHOuqFTrgbV6k5+6scBc2y a6IcldAupt6swk1b/+Bzro7n3DVcVhoBOV97ofyPSlwT1OOzFCoc6ljeqR1+TUgZ ZaFyOPICpS/1xnBqiipiQgQjn0a9/ZH7fDeb54aP3F0Jp18hrjwpvFduWcNItx/D AonDoUGecrahwwBYUYjk7GJ6Hp2hqeVGPLTpNWKwDkXTvN1CQ9HTCTRsBdaBkuFh CevzYbt4m+BQblcr4IdCHlKr5Yenb8s5aHCvV8ag/PhC1uFD6/MFu6kgKQbyqjfr ji9mD4xoX/l0rrUXqPGpdLRs5OwCpda42lGxUu9gGIUDrMt3v3YAPXTbsePTa6H6 +9cPNd562OqTqoCDiG6V7pWSapyDK+eR7swwnqNDe1Z8AyHZN3/2tJ3Gr/jH98Ko Vvjyv4sybiAgqdiu5Q/v1kKX0hVafOz7KNRKQuFt9kiUG8bj6S714CYg3WCFW7uk 30KdhI02Iy888y/nZIn6ApzmCsmx42ZZeG16CqgsbizXuyZK1kndXSaPuE7TG7Ot 0vxiz5B1Zv0ftqgEMaFeG1yqNpCPINhyYRjWMMohfrm9hi7gXQEtYWb4o+ibmb52 I/iCYdvRCcfPll48p3MNIc3w8zUDVs2vF86WQhX0nrWNnwI6urTmOdnC3L5fM4zz CzCoHkaHJIj2HEPE0V2o =MtiD -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bikeshed invitation for nl_langinfo ambiguities 2017-11-27 2:57 ` A. Wilcox @ 2017-11-27 5:09 ` Rich Felker 0 siblings, 0 replies; 7+ messages in thread From: Rich Felker @ 2017-11-27 5:09 UTC (permalink / raw) To: musl On Sun, Nov 26, 2017 at 08:57:25PM -0600, A. Wilcox wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On 26/11/17 19:07, Rich Felker wrote: > >> I would personally recommend actually using the enum values as > >> the strings to translate. _("MON_5"), _("ABMON_5"), etc; this > >> is non-ambiguous, easily understandable and describable for > >> translators, and does not require weird hacks at the > >> implementation or ABI level. > > > > This is certainly one possibility, but it does result in embedding > > a number of "useless" strings that are never used themselves, only > > as translation keys, in the binary. One nice property of it > > (especially if we did the same for strerror keys) is that it > > eliminates the need for translation files to care about changes in > > the text in musl. > > That was the idea, yes. This would work for *all* translatable > strings, not just the nl_langinfo ones. > > >> Of course, then a "C" / "POSIX" strings file must be present. > >> But this is, in my opinion, a very small sacrifice to ensure full > >> purity and ease of translation. > > > > This is of course not acceptable. > > "Of course"? Why not? The reason this wouldn't be acceptable is not > obvious to me. It basically means there's no such thing as "truely static linked" binaries anymore, and all programs need to search for and open the "C locale translation file" at program startup time, resulting a number of syscalls. None of this would be remotely acceptable to a large portion of musl's userbase, including myself. If it's not clear why, just observe that functions like nl_langinfo, strftime, etc. have no way to fail for arbitrary reasons when the input is valid. The only way to ensure that they can't fail is to have the data they need already loaded at the time the locale is set, which for the default C locale is program start time. Normally/currently this is achieved just by having the strings in the program text segment. > > I have in mind a way we could potentially avoid this: passing keys > > like "ABMON_5" to __lctrans, and if it returns back the key (which > > is what happens with the stub implementation or with no translation > > present), use the builtin C locale strings instead. > > That would work. > > > I don't follow; there are only two duplicate strings and they are > > "May" and "%H:%M:%S". The number does not grow with the number of > > translations because it's a property of the untranslated strings > > not the translated ones. > > If you are returning (string+1), then all strings need to have the " " > at the beginning, or else you are going to return "ai" instead of > "mai" for French and so on. I see. I was not intending to include the prefix in the translated strings, just in the key. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bikeshed invitation for nl_langinfo ambiguities 2017-11-26 23:19 ` A. Wilcox 2017-11-27 1:07 ` Rich Felker @ 2018-03-03 5:08 ` Rich Felker 2018-03-05 17:10 ` Rich Felker 1 sibling, 1 reply; 7+ messages in thread From: Rich Felker @ 2018-03-03 5:08 UTC (permalink / raw) To: musl On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On 10/11/17 20:06, Rich Felker wrote: > > I've found 2 ambiguous-string-to-translate bugs in musl's locale > > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and > > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C > > locale, and thus can't be translated to distinct values like they > > need to be in other locales. > > > > Any opinions on the cleanest way to handle this? There are various > > hacks I could do at the implementation level, like adding a prefix > > character to one or the other then applying +1 to the output > > string, But whatever solution we choose becomes a public interface > > for translators, so it should be something that's not horribly > > ugly. > > I would personally recommend actually using the enum values as the > strings to translate. _("MON_5"), _("ABMON_5"), etc; this is > non-ambiguous, easily understandable and describable for translators, > and does not require weird hacks at the implementation or ABI level. I think this may be the nicest approach, despite being an incompatible change from the existing system, which apparently doesn't matter and isn't being used or people would have noticed that "May" can't be translated right. > Of course, then a "C" / "POSIX" strings file must be present. But > this is, in my opinion, a very small sacrifice to ensure full purity > and ease of translation. As noted before, obviously this isn't acceptable. We could drop a .mo file blob in the musl langinfo.c, but I think it might make more sense to just use different code paths for translated vs nontranslated case. Then we could just synthesize the keys (ABMON_*, MON_*, ABDAY_*, DAY_*) to pass into LCTRANS() rather than having a table of them all expanded out. I might change my mind when actually working out how the code would look, though. An alternative I thought about would be just having translations for "ABMON", "MON", etc. that produce nul-delimited multistrings, but then the data file encodes an assumption that musl will do the O(n) (albeit small n) multistring scan for the requested item, and I don't think it's nice encoding assumptions that limit efficiency or force a particular size/efficiency tradeoff. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Bikeshed invitation for nl_langinfo ambiguities 2018-03-03 5:08 ` Rich Felker @ 2018-03-05 17:10 ` Rich Felker 0 siblings, 0 replies; 7+ messages in thread From: Rich Felker @ 2018-03-05 17:10 UTC (permalink / raw) To: musl On Sat, Mar 03, 2018 at 12:08:54AM -0500, Rich Felker wrote: > On Sun, Nov 26, 2017 at 05:19:07PM -0600, A. Wilcox wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA256 > > > > On 10/11/17 20:06, Rich Felker wrote: > > > I've found 2 ambiguous-string-to-translate bugs in musl's locale > > > support in nl_langinfo: The pairs ABMON_5 and MON_5 ("May"), and > > > T_FMT and ERA_T_FMT ("%H:%M:%S"), have the same values in the C > > > locale, and thus can't be translated to distinct values like they > > > need to be in other locales. > > > > > > Any opinions on the cleanest way to handle this? There are various > > > hacks I could do at the implementation level, like adding a prefix > > > character to one or the other then applying +1 to the output > > > string, But whatever solution we choose becomes a public interface > > > for translators, so it should be something that's not horribly > > > ugly. > > > > I would personally recommend actually using the enum values as the > > strings to translate. _("MON_5"), _("ABMON_5"), etc; this is > > non-ambiguous, easily understandable and describable for translators, > > and does not require weird hacks at the implementation or ABI level. > > I think this may be the nicest approach, despite being an incompatible > change from the existing system, which apparently doesn't matter and > isn't being used or people would have noticed that "May" can't be > translated right. One really ugly thing here is that the POSIX key for weekdays is "highly unconventional" - ABDAY_1/DAY_1 is Sunday and ABDAY_7/DAY_7 is Saturday. Even the Unicode CLDR noticed this nonsense and used "sun"..."sat" as the keys rather than using numbers so as to be unambiguous. > > Of course, then a "C" / "POSIX" strings file must be present. But > > this is, in my opinion, a very small sacrifice to ensure full purity > > and ease of translation. > > As noted before, obviously this isn't acceptable. We could drop a .mo > file blob in the musl langinfo.c, but I think it might make more sense > to just use different code paths for translated vs nontranslated case. I did some simple estimates with a toy .po/.mo file, and it looks like either of those approaches is going to more-than-double the size of langinfo.o, and make it a lot more complex. Given that "Sun".."Sat" are nicer keys for days anyway, I'm leaning back towards sticking with what we have and just adding a special case for "May". The other ambiguity is one of the ERA_* formats, which we're not even doing right now anyway; they're "not available in the POSIX locale" according to XBD 7.3.5 LC_TIME, so as I read it they should return "" (not the correspondign non-era string) in the C/POSIX locale, and only return something else if they're defined for the locale. Eventually, we should probably look them up with mo keys like "era_d_fmt", etc. but unless/until we properly support them, the lookups for them should just be removed. > Then we could just synthesize the keys (ABMON_*, MON_*, ABDAY_*, > DAY_*) to pass into LCTRANS() rather than having a table of them all > expanded out. I might change my mind when actually working out how the > code would look, though. I started working on a nice means of doing this synthesis - having a table like the existing c_time etc. but contents like: "ABDAY_1\0\0\0\0\0\0\0" "DAY_1\0\0\0\0\0\0\0" "ABMON_1\0\0\0\0\0\0\0\0\0\0\0\0" "MON_1\0\0\0\0\0\0\0\0\0\0\0\0" ... where, when a zero-length entry is hit, the last non-zero-length one seen gets used as a basis for synthesis. But it still didn't seem possible to avoid significant increase in code size and complexity. Rich ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-03-05 17:10 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-11 2:06 Bikeshed invitation for nl_langinfo ambiguities Rich Felker 2017-11-26 23:19 ` A. Wilcox 2017-11-27 1:07 ` Rich Felker 2017-11-27 2:57 ` A. Wilcox 2017-11-27 5:09 ` Rich Felker 2018-03-03 5:08 ` Rich Felker 2018-03-05 17:10 ` Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).