* Locale bikeshed time @ 2014-07-22 18:49 Rich Felker 2014-07-22 20:10 ` u-igbb 2014-07-22 20:17 ` Laurent Bercot 0 siblings, 2 replies; 43+ messages in thread From: Rich Felker @ 2014-07-22 18:49 UTC (permalink / raw) To: musl I've got the next phase of the locale work pretty much ready to commit, but since it needs some policy for how to load locales, I want to continue the discussion first rather than having commits that change the behavior back and forth as we discuss this. Overall, my plan at this point is to disallow any absolute/relative pathnames in the LC_* vars and restrict them purely to locale names, and have the path in a separate variable outside the scope of the standard. This is basically how glibc does it, and the idea is that you can allow locale names from an untrusted source (e.g. for suid, for remote apps acting on behalf of a user such as web apps or gitolite, or for apps that process mixed-locale data with uselocale and have locale names in their data) as long as the locale path does not contain malicious locales. So, the first bikeshed decision to be made is what environment variable to use for the locale path, and what fallback should be if it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to use the same var (since apps are already aware of the need to treat it specially), but on the other it's undesirable to have them tied together (e.g. if you're using musl as a non-root installation and can't write to /usr/lib) and to avoid clashing with glibc's files we would need to choose a subdirectory under $LOCPATH rather than using it directly. All of these aspects make it a lot less attractive. The second issue is how locale categories are split up. Glibc has each category in a separate file, except for the "locale-archive" file which stores everything in one file for easy mapping. My leaning so far is to put the whole locale -- time format and translations, message translations, ... in a single file. This avoids the need for multiple mappings (and syscall overhead, and vma overhead, ...) if you're using the same value for all categories. But on the other hand, if you wanted to have lots of subtle variants of a locale, you might end up with largely-duplicate files on disk. Fortunately I think they'll all be very small anyway so this may not matter. Of course making this work is contingent on finding a good way to encode LC_MONETARY and LC_COLLATE data in a .mo file, since if the whole locale is unified into one file, it would be a .mo file. My leaning is to simply use "int_cur_symbol", etc. as gettext keys for the string fields of LC_MONETARY and then put all the numeric fields of lconv into a single string that could be parsed with scanf or a tiny integer parser in localeconv() on the first usage. While not the most efficient, it avoids needing nasty special tools to generate locale files; a po-to-mo converter is all you need. For LC_COLLATE, obviously one solution would be to have keys for each collation element and use gettext to convert collation elements to the symbols strxfrm is supposed to output. I'm not sure if the efficiency of this method is tolerable however. We could go with it for now and later add something more advanced if needed (e.g. mapping to a DFA represented as a byte arrary that does the conversions). I probably have some more issues to discuss with this too but I'll just go ahead and send now to get discussion started, and hopefully get back to adding some more code first. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-22 18:49 Locale bikeshed time Rich Felker @ 2014-07-22 20:10 ` u-igbb 2014-07-22 20:35 ` Rich Felker 2014-07-22 20:17 ` Laurent Bercot 1 sibling, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-22 20:10 UTC (permalink / raw) To: musl On Tue, Jul 22, 2014 at 02:49:32PM -0400, Rich Felker wrote: > Overall, my plan at this point is to disallow any absolute/relative > pathnames in the LC_* vars and restrict them purely to locale names, > and have the path in a separate variable outside the scope of the > standard. +1 > So, the first bikeshed decision to be made is what environment > variable to use for the locale path, and what fallback should be if > it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to > use the same var (since apps are already aware of the need to treat it > specially), but on the other it's undesirable to have them tied > together (e.g. if you're using musl as a non-root installation and > can't write to /usr/lib) and to avoid clashing with glibc's files we This issue is not crucial for my usage pattern, here it is easy to assign values of this kind per binary, not per process tree (in contrast to the locale names which I want to be settable by the user and inheritable regardless of which library can happen to interpret them). Speaking more generally, using the same variable as glibc would introduce a substantial risk of confusion, making the semantics of the variable context-dependent (i.e. depending on which library a certain binary is linked to). This confusion is kind of hidden in monolithic distros where all binaries are expected to have been built by tightly cooperating parties using the same libraries - but the general case includes using binaries built with different premises. A musl-specific variable name would be a better/cleaner choice. > would need to choose a subdirectory under $LOCPATH rather than using > it directly. All of these aspects make it a lot less attractive. +1 > The second issue is how locale categories are split up. Glibc has each > category in a separate file, except for the "locale-archive" file > which stores everything in one file for easy mapping. My leaning so By the way, please do not follow the way of a single big file. For systems which rely on file boundaries to reflect data clustering (i.e. which data is most probable to be used together) it is very useful to let the files correspond to the data structure. Otherwise some cheap and efficient distributed data access optimizations become impossible. Coda file system uses a file as a transmission and caching unit - which is quite efficient because a file very often corresponds to an "information unit" which is needed as a whole. Glibc's locale archive enforces a big wasteful transfer and a large cache footprint for very little actual use. > far is to put the whole locale -- time format and translations, > message translations, ... in a single file. This avoids the need for > multiple mappings (and syscall overhead, and vma overhead, ...) if > you're using the same value for all categories. But on the other hand, > if you wanted to have lots of subtle variants of a locale, you might > end up with largely-duplicate files on disk. Fortunately I think > they'll all be very small anyway so this may not matter. I actually do mix categories from different locales. No problem as long as the files are small. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-22 20:10 ` u-igbb @ 2014-07-22 20:35 ` Rich Felker 2014-07-23 9:50 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-22 20:35 UTC (permalink / raw) To: musl On Tue, Jul 22, 2014 at 10:10:08PM +0200, u-igbb@aetey.se wrote: > A musl-specific variable name would be a better/cleaner choice. One question is whether this is really musl-specific or specific to a locale scheme that could be used outside of musl too. However, either way it's probably appropriate for the variable to be musl-specific. Having one variable configure multiple things is usually error-prone and inflexible. > > The second issue is how locale categories are split up. Glibc has each > > category in a separate file, except for the "locale-archive" file > > which stores everything in one file for easy mapping. My leaning so > > By the way, please do not follow the way of a single big file. > For systems which rely on file boundaries to reflect data clustering > (i.e. which data is most probable to be used together) it is very useful > to let the files correspond to the data structure. Otherwise some cheap > and efficient distributed data access optimizations become impossible. I hadn't even considered this aspect, but I think the whole concept of a single big file is undesirable with data that's naturally subject to change over time, and where the data comes from multiple sources. So I wasn't really considering that option anyway. > > far is to put the whole locale -- time format and translations, > > message translations, ... in a single file. This avoids the need for > > multiple mappings (and syscall overhead, and vma overhead, ...) if > > you're using the same value for all categories. But on the other hand, > > if you wanted to have lots of subtle variants of a locale, you might > > end up with largely-duplicate files on disk. Fortunately I think > > they'll all be very small anyway so this may not matter. > > I actually do mix categories from different locales. > No problem as long as the files are small. Note that if you're just mixing "ll_TT" and "C", there wouldn't be any cost anyway since the C locale (and its aliases) are builtin and never loaded from a file. Where I was thinking you might see duplication is for things like: LC_ALL=ll_TT@modifier where modifier is really just an alternate for one category (e.g. ISO date format for time, alt collation order, etc.), but the file ends up storing duplicates of all the data from other categories. However, I think the alternate preferred usage here would be to provide a file for just the category being overridden that does not contain the base data and require users to set the individual categories, like what you're doing, e.g. LANG=ll_TT LC_TIME=ll_TT@isodate rather than: LC_ALL=ll_TT@isodate Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-22 20:35 ` Rich Felker @ 2014-07-23 9:50 ` u-igbb 2014-07-23 16:39 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-23 9:50 UTC (permalink / raw) To: musl On Tue, Jul 22, 2014 at 04:35:40PM -0400, Rich Felker wrote: > Having one variable configure multiple things is usually error-prone > and inflexible. +1 > > By the way, please do not follow the way of a single big file. > > For systems which rely on file boundaries to reflect data clustering > I hadn't even considered this aspect, but I think the whole concept of > a single big file is undesirable with data that's naturally subject to > change over time, and where the data comes from multiple sources. So I > wasn't really considering that option anyway. Nice. > > I actually do mix categories from different locales. > > No problem as long as the files are small. > > Note that if you're just mixing "ll_TT" and "C", there wouldn't be any > cost anyway since the C locale (and its aliases) are builtin and never > loaded from a file. Where I was thinking you might see duplication is Sure. This covers certainly most of my preferences but I thought of LANG=l1_T1 and LC_SOMETHING=l2_T2 [and LC_SOMETHINGELSE=l3_T3]. This would result in pulling in two or three locale data files but the overhead is presumably negligible. > for things like: LC_ALL=ll_TT@modifier where modifier is really just > an alternate for one category (e.g. ISO date format for time, alt > collation order, etc.), but the file ends up storing duplicates of all > the data from other categories. However, I think the alternate > preferred usage here would be to provide a file for just the category > being overridden that does not contain the base data and require users > to set the individual categories, like what you're doing, e.g. > LANG=ll_TT LC_TIME=ll_TT@isodate This means that most of the time there will be a single locale file to be opened, sometimes more, in extreme cases up to the number of categories, the files also being of different "completeness". This would certainly contribute to confusion for both the administrators and the users. For the sake of uniformity I would possibly prefer to see only the "thinner" files defining exactly one category, instead of different files having different numbers of included categories. But most of all I'd support your approach of including all information in each file. This is "least confusing" and quite efficient. The overhead is mostly static storage (not noticeable in our setup and probably not much anyway :) and the run time overhead affects just the minority of users who mix locales/categories. (Oh btw as a nice bonus this makes the file boundaries correspond to the data usage patterns). To summarize my view, - a file per locale, with all categories included best - a file per category acceptable - files with differing data subsets please don't > rather than: > > LC_ALL=ll_TT@isodate In a real scenario it would be probably LANG=ll_TT@isodate and this feels OK. > Rich Regards, Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 9:50 ` u-igbb @ 2014-07-23 16:39 ` Rich Felker 2014-07-23 19:25 ` u-igbb 2014-07-23 23:22 ` writeonce 0 siblings, 2 replies; 43+ messages in thread From: Rich Felker @ 2014-07-23 16:39 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 11:50:31AM +0200, u-igbb@aetey.se wrote: > > > I actually do mix categories from different locales. > > > No problem as long as the files are small. > > > > Note that if you're just mixing "ll_TT" and "C", there wouldn't be any > > cost anyway since the C locale (and its aliases) are builtin and never > > loaded from a file. Where I was thinking you might see duplication is > > Sure. This covers certainly most of my preferences but I thought of > LANG=l1_T1 and LC_SOMETHING=l2_T2 [and LC_SOMETHINGELSE=l3_T3]. > This would result in pulling in two or three locale data files but the > overhead is presumably negligible. It's two or three sets of syscalls -- open (one per path component tried until it succeeds), fstat, mmap, close -- rather than one set. And an extra vma (resulting from the mmap) for each used. But the choice isn't whether to have this overhead or not, unless you want to consider the glibc locale-archive ugliness. The choice is just whether to optimize the case where the categories are all the same (only having one set of syscalls in that case) or mostly the same, or not to optimize it and always have multiple sets of syscalls. I believe the latter is strictly worse. > > for things like: LC_ALL=ll_TT@modifier where modifier is really just > > an alternate for one category (e.g. ISO date format for time, alt > > collation order, etc.), but the file ends up storing duplicates of all > > the data from other categories. However, I think the alternate > > preferred usage here would be to provide a file for just the category > > being overridden that does not contain the base data and require users > > to set the individual categories, like what you're doing, e.g. > > > LANG=ll_TT LC_TIME=ll_TT@isodate > > This means that most of the time there will be a single locale file to be > opened, sometimes more, in extreme cases up to the number of categories, > the files also being of different "completeness". This would certainly > contribute to confusion for both the administrators and the users. Hmm. I see how it would be confusing and maybe it's best to discourage this use (incomplete .mo files). But it's purely a useage issue, outside of musl'c control, unless we wanted to impose a check that any locale file have data for all the categories (and I think such a rule would be bad since it precludes having locale files that are unrelated to languages, e.g. a generic "UCA" collation locale with the default UCA data). > For the sake of uniformity I would possibly prefer to see only the > "thinner" files defining exactly one category, instead of different > files having different numbers of included categories. Yes that sounds like a good policy. Really, policy matters like this (i.e. ones that don't affect libc implementation) should be worked out when it comes time to actually make some locales and find a maintainer for a musl-locale repo/package. On that topic, while this is a matter outside my control for individual users, my preference would be that the official musl-locale data attempt to avoid multiple variants/modifiers and legacy options if possible. For example I would like to see the numeric date format be ISO format in all locales, with traditional formats only where the natural-language string representations for months/days are included (and I say this as someone coming from one of the locales, i.e. US, where the traditional numeric date format is non-ISO). In keeping with the principle that musl is "modern" I'd like to prefer modern cultural conventions to historical ones. > But most of all I'd support your approach of including all information in > each file. This is "least confusing" and quite efficient. The overhead > is mostly static storage (not noticeable in our setup and probably not > much anyway :) and the run time overhead affects just the minority of > users who mix locales/categories. (Oh btw as a nice bonus this makes > the file boundaries correspond to the data usage patterns). > > To summarize my view, > > - a file per locale, with all categories included best > - a file per category acceptable > - files with differing data subsets please don't Yes I think this makes sense. My leaning would be to use complete files for language-based locales, and file-per-category for individual category locales that are not associated with any particular language (and where, thereby, there's no assumption that they should provide any behavior to other categories). Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 16:39 ` Rich Felker @ 2014-07-23 19:25 ` u-igbb 2014-07-23 21:01 ` Rich Felker 2014-07-23 23:22 ` writeonce 1 sibling, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-23 19:25 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 12:39:07PM -0400, Rich Felker wrote: > My leaning would be to use complete > files for language-based locales, and file-per-category for individual > category locales that are not associated with any particular language > (and where, thereby, there's no assumption that they should provide > any behavior to other categories). This feels appropriate - if the definitions indeed fall into distinctive classes like "full" / "single-category" and also if the naming reflects the distinction (keeping objects with different properties in the same name space is otherwise harmful, among others harmful for ease of understanding by the prospective users and administrators). Thanks. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 19:25 ` u-igbb @ 2014-07-23 21:01 ` Rich Felker 2014-07-24 15:35 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-23 21:01 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 09:25:03PM +0200, u-igbb@aetey.se wrote: > On Wed, Jul 23, 2014 at 12:39:07PM -0400, Rich Felker wrote: > > My leaning would be to use complete > > files for language-based locales, and file-per-category for individual > > category locales that are not associated with any particular language > > (and where, thereby, there's no assumption that they should provide > > any behavior to other categories). > > This feels appropriate - if the definitions indeed fall into distinctive > classes like "full" / "single-category" and also if the naming reflects > the distinction (keeping objects with different properties in the same > name space is otherwise harmful, among others harmful for ease of > understanding by the prospective users and administrators). IMO language-based locales should be ll, lll, ll_TT, or lll_TT form where ll or lll is lowercase ISO language code and TT is uppercase territory code. Non-language-based locale files should avoid these patterns. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 21:01 ` Rich Felker @ 2014-07-24 15:35 ` u-igbb 2014-07-24 16:01 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-24 15:35 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 05:01:20PM -0400, Rich Felker wrote: > > This feels appropriate - if the definitions indeed fall into distinctive > > classes like "full" / "single-category" and also if the naming reflects > > the distinction > > IMO language-based locales should be ll, lll, ll_TT, or lll_TT form > where ll or lll is lowercase ISO language code and TT is uppercase > territory code. Non-language-based locale files should avoid these > patterns. Just for certainty: I assume you mean "l" above being lower case and non-language-based definitions to begin/consist of uppercase letters? Totally avoiding two- and three-letter combinations would be hardly followed by less scrupulous parties :) but you certainly did not mean this. Btw do we have to also use lll (the three-letter codes) or would be the two-letter ones sufficient? I understand that this is not an implementation question but rather a discipline/policy one but in the long run it helps enormously to have a clean deployment idea from the beginning. An example of a spectacular failure to do so were the xkb keyboard maps. [ Two incompatible representations were in use, for many years (!) One was reasonable, structured by country i.e. reflecting different countries' actual standards. The other one was broken by design, using "language" as the main key without any actual definition of its semantics. This led to many of the available definitions being a hardly useful hacks (and of course to a lot of confusion for everyone as this thing was impossible to document). Remarkably even the maintainers of the maps at x.org/freedesktop.org at the time did not realize the origin of the problem. I happen to have been involved into clarifying the issue, now the structure of xkb/symbols is reasonable. ] This happens when one does not clearly document the target deployment model which the implementation exist for, iow is meant to implement. Other/unexpected ways to use a tool can be good too (or sometimes even better) but most of the deployers lack the time and knowledge for the analysis which the implementors by their role are to do - the analysis which you among other things are doing by the discussions here. The lack of the understanding easily leads to bad practices being perpetuated (like the mess of the Kerberos keytab traditions). I am afraid that not stating a clean usage model may harm musl deployments too (say by mixing two- and three-letter locale codes so that one can not sanely know which kind to use). Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 15:35 ` u-igbb @ 2014-07-24 16:01 ` Rich Felker 2014-07-24 19:24 ` u-igbb 2014-07-24 20:15 ` u-igbb 0 siblings, 2 replies; 43+ messages in thread From: Rich Felker @ 2014-07-24 16:01 UTC (permalink / raw) To: musl On Thu, Jul 24, 2014 at 05:35:26PM +0200, u-igbb@aetey.se wrote: > On Wed, Jul 23, 2014 at 05:01:20PM -0400, Rich Felker wrote: > > > This feels appropriate - if the definitions indeed fall into distinctive > > > classes like "full" / "single-category" and also if the naming reflects > > > the distinction > > > > IMO language-based locales should be ll, lll, ll_TT, or lll_TT form > > where ll or lll is lowercase ISO language code and TT is uppercase > > territory code. Non-language-based locale files should avoid these > > patterns. > > Just for certainty: > > I assume you mean "l" above being lower case and non-language-based > definitions to begin/consist of uppercase letters? Totally avoiding two- > and three-letter combinations would be hardly followed by less scrupulous > parties :) but you certainly did not mean this. I just meant that language-based locales should match the pattern: ^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$ assuming I didn't make any stupid mistakes in writing that regex. And non-language-based locales should not match this pattern. BTW POSIX actually describes this pattern (or similar) for locale names under the XSI option. > Btw do we have to also use lll (the three-letter codes) or would be > the two-letter ones sufficient? I believe there are some languages for which there is no two-letter code. (Note that even the whole 26x26 space is probably insufficient to represent all of the world's languages, and for practical purposes, the letters should have some correspondence with the name of the language.) > I understand that this is not an implementation question but rather a > discipline/policy one but in the long run it helps enormously to have > a clean deployment idea from the beginning. Agreed. > An example of a spectacular failure to do so were the xkb keyboard maps. > [ > Two incompatible representations were in use, for many years (!) One was > reasonable, structured by country i.e. reflecting different countries' > actual standards. The other one was broken by design, using "language" > as the main key without any actual definition of its semantics. This > led to many of the available definitions being a hardly useful hacks > (and of course to a lot of confusion for everyone as this thing was > impossible to document). Remarkably even the maintainers of the maps > at x.org/freedesktop.org at the time did not realize the origin of the > problem. I happen to have been involved into clarifying the issue, > now the structure of xkb/symbols is reasonable. > ] This text is utterly backwards, and I've complained about the policy before, but gotten nowhere with it. Yes many languages have keyboard variants connected to a particular geographic territory (this is mainly true for European languages, not so much for the rest of the world), but it does not make keyboard layout a property of country. You also have: - Users who speak and use languages that have no relation to the country where they're living. - Languages which have no territory. - Languages used in territories where the country it belongs to is disputed. - Etc. All of these issues make country-based keyboard selection at best inconvenient, and at worst culturally and politically offensive, to users. And offending users is utterly bad policy. The same issue exists in glibc -- for a long time, their policy mandated that all locales have a territory associated with them, and this (along with other stupid policy) was preventing the addition of the Esperanto locale. See: https://sourceware.org/bugzilla/show_bug.cgi?id=16190 I believe the policy has been fixed now, but the discussion happened on a different bug tracker issue and/or mailing list thread, and I don't have the link. > I am afraid that not stating a clean usage model may harm musl deployments > too (say by mixing two- and three-letter locale codes so that one can not > sanely know which kind to use). The reasonable approach to this is probably just using the three-letter codes for languages that do not have a two-letter code. In practice I haven't seen such translations/locales on other systems, but we certainly don't want to preclude them. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 16:01 ` Rich Felker @ 2014-07-24 19:24 ` u-igbb 2014-07-24 20:15 ` u-igbb 1 sibling, 0 replies; 43+ messages in thread From: u-igbb @ 2014-07-24 19:24 UTC (permalink / raw) To: musl On Thu, Jul 24, 2014 at 12:01:50PM -0400, Rich Felker wrote: > > Btw do we have to also use lll (the three-letter codes) or would be > > the two-letter ones sufficient? > > I believe there are some languages for which there is no two-letter > code. (Note that even the whole 26x26 space is probably insufficient > to represent all of the world's languages, and for practical purposes, > the letters should have some correspondence with the name of the > language.) Ok, then it's a pity that we can not "postulate" three-letter ones :) (as we want to embrace the values people already have in their LANG...) > > An example of a spectacular failure to do so were the xkb keyboard maps. > > [ > > Two incompatible representations were in use, for many years (!) One was > > reasonable, structured by country i.e. reflecting different countries' > > actual standards. The other one was broken by design, using "language" > > as the main key without any actual definition of its semantics. This > > led to many of the available definitions being a hardly useful hacks > > (and of course to a lot of confusion for everyone as this thing was > > impossible to document). Remarkably even the maintainers of the maps > > at x.org/freedesktop.org at the time did not realize the origin of the > > problem. I happen to have been involved into clarifying the issue, > > now the structure of xkb/symbols is reasonable. > > ] > > This text is utterly backwards, and I've complained about the policy > before, but gotten nowhere with it. Yes many languages have keyboard Oh. > variants connected to a particular geographic territory (this is > mainly true for European languages, not so much for the rest of the > world), but it does not make keyboard layout a property of country. A keyboard layout is nothing else than "a cultural or personal preference". In the overwhelming majority of usage cases it is reflected/strongly-suggested by the manufacturers of the keyboards by carving some glyphs on the keys. The manufacturers tend to follow the different countries' most formalized cultural references, expressed by the locally defined standards (among others, the country-specific standards for keyboard layouts). This also creates a strong bios towards a certain one of these layouts for the corresponding users, of course. I do not say that "country" is a perfect reference. It is nevertheless much more reasonable/usable than "language" which has nothing to do with layouts (_alphabets_ might have a say in corner cases, but not languages). > You also have: > > - Users who speak and use languages that have no relation to the > country where they're living. > > - Languages which have no territory. > > - Languages used in territories where the country it belongs to is > disputed. > > - Etc. Sure. But this is not what a keyboard layout reflects. You mention a complexity which does not belong to the task. > All of these issues make country-based keyboard selection at best > inconvenient, and at worst culturally and politically offensive, to > users. And offending users is utterly bad policy. Believe it or not, I saw this argument from the freedesktop people during the corresponding discussion (about 15+ years ago iirc). Nevertheless reflecting what is carved on the keyboard which has been bought or otherwise chosen _by_the_user_ is hardly an insult. If there is no national standard for the "user's home culture" layout, then there is none. Not our fault and a purely political issue (technically you always can find a place for an extra layout definition). If a user chooses a layout defined by a standard of a "country of BAD" - that was the user choice, not ours. Somebody can possibly feel that a BAD country "stole" "their" layout but we are not in a position to judge there. The matter of fact is that the keyboards being manufactured are governed in the _very_ first hand by national standards, which among others define both the physical placement of the keys and the placement of the labels on the keys. Blind typers may forget about the second fact but this does not make the fact irrelevant. > The same issue exists in glibc -- for a long time, their policy > mandated that all locales have a territory associated with them, and > this (along with other stupid policy) was preventing the addition of > the Esperanto locale. See: _This_ was indeed stupid, I am aware of this issue too as I needed the locale. > The reasonable approach to this is probably just using the > three-letter codes for languages that do not have a two-letter code. > In practice I haven't seen such translations/locales on other systems, > but we certainly don't want to preclude them. Fair enough! Thanks Rich. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 16:01 ` Rich Felker 2014-07-24 19:24 ` u-igbb @ 2014-07-24 20:15 ` u-igbb 2014-07-24 22:02 ` Rich Felker 1 sibling, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-24 20:15 UTC (permalink / raw) To: musl On Thu, Jul 24, 2014 at 12:01:50PM -0400, Rich Felker wrote: > I just meant that language-based locales should match the pattern: > > ^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$ > > assuming I didn't make any stupid mistakes in writing that regex. And > non-language-based locales should not match this pattern. I feel it would be somewhat more robust if we'd have a positive definition for "the second class" of locale data, just in case we one day discover that we want to differently handle, say, three classes (?) A negative defintition gives also very little guidance for the actual naming and in the worst case may lead to misunderstanding when multiple parties are involved. Why not make such a worst case less probable by a somewhat more strict naming rule? Possibly also defining "non-language-based" in a positive way? This is just a thought. I have no actual proposal as I do not have a good mental picture of which kinds of "non-language-based" definitions exist or should exist and how they are being used or might/should be used. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 20:15 ` u-igbb @ 2014-07-24 22:02 ` Rich Felker 2014-07-25 9:06 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-24 22:02 UTC (permalink / raw) To: musl On Thu, Jul 24, 2014 at 10:15:48PM +0200, u-igbb@aetey.se wrote: > On Thu, Jul 24, 2014 at 12:01:50PM -0400, Rich Felker wrote: > > I just meant that language-based locales should match the pattern: > > > > ^[[:lower:]]{2,3}(_[[:upper:]]{2})?([[:punct:]].*)?$ > > > > assuming I didn't make any stupid mistakes in writing that regex. And > > non-language-based locales should not match this pattern. > > I feel it would be somewhat more robust if we'd have a positive > definition for "the second class" of locale data, just in case we one > day discover that we want to differently handle, say, three classes (?) > > A negative defintition gives also very little guidance for the actual > naming and in the worst case may lead to misunderstanding when multiple > parties are involved. > > Why not make such a worst case less probable by a somewhat more strict > naming rule? > Possibly also defining "non-language-based" in a positive way? > > This is just a thought. I have no actual proposal as I do not have a > good mental picture of which kinds of "non-language-based" definitions > exist or should exist and how they are being used or might/should be used. This is a reasonable sentiment, but do you have a proposal? I think first you would need an idea of what some "non-language" category values might be. I can think of some for LC_COLLATE, though I'm not sure how valuable many of them are: - UCA default tables - UTF-16 code unit order - Case-insensitive Unicode codepoint order For the other categories, examples seem much harder to find. LC_MESSAGES is inherently a language-based category, but perhaps you could have a locale that eliminates verbose natural-language messages and replaces them with C/POSIX identifiers (e.g. printing ENOENT instead of "No such file or directory") conveying the meaning. (Or we could be somewhat radical and replace all the internal strerror messages like this and require LC_MESSAGES=en to get them back.) I'm not sure if there would be interesting LC_TIME locales not associated with a language (since LC_TIME has to offer day/month names). And for LC_MONETARY, most if not all of the data really corresponds to a political unit context, not a language, so in principle it might make sense to have locales just for LC_MONETARY that aren't associated with a language, but I can't see that being a convenient or reasonable design in practice... Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 22:02 ` Rich Felker @ 2014-07-25 9:06 ` u-igbb 2014-07-25 20:15 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-25 9:06 UTC (permalink / raw) To: musl On Thu, Jul 24, 2014 at 06:02:28PM -0400, Rich Felker wrote: > first you would need an idea of what some "non-language" category > values might be. I can think of some for LC_COLLATE, though I'm not > sure how valuable many of them are: > > - UCA default tables > - UTF-16 code unit order > - Case-insensitive Unicode codepoint order I can hardly give any opinion on their importance. > For the other categories, examples seem much harder to find. > LC_MESSAGES is inherently a language-based category, but perhaps you > could have a locale that eliminates verbose natural-language messages > and replaces them with C/POSIX identifiers (e.g. printing ENOENT > instead of "No such file or directory") conveying the meaning. (Or we > could be somewhat radical and replace all the internal strerror > messages like this and require LC_MESSAGES=en to get them back.) I'm I like this - for clarity, conciseness and for making it as neutral as possible (ENOENT stems of course from English but no worse than the keywords of C itself). > LC_MONETARY, most if not all of the data really corresponds to a > political unit context, not a language, so in principle it might make > sense to have locales just for LC_MONETARY that aren't associated with > a language, but I can't see that being a convenient or reasonable > design in practice... Indeed, LC_MONETARY has basically nothing to do with language. If I might choose I would not let LANG imply LC_MONETARY (iow would skip LC_MONETARY in language-based locale definitions). Returning to the naming. As language-based locales are named after languages, it would be nice to name other kinds of locale data after their "natural association" too. Then politically-bound data could be put into the corresponding "territorial" family: language ll[l][_TT] territory TT[_ll[l]] And if we find something that does not feel reasonable to connect to either a language or a territory, we can do special cases @<specialcase> [or ZZ@<specialcase> ("no territory") or zxx@<specialcase> ("no language") but the shorter and simpler is to prefer] The expected mode of usage would be like LANG=de LC_MONETARY=EU or LANG=sv LC_MONETARY=SE or LANG=eo@iso8601 LC_MONETARY=US@iso4217 which would in every case access two locale data files of different classes, clearly visible in the naming. Iso date format actually would be a good candidate for a standalone "@iso8601", but it can as well live inside the C locale. Then the last example above might look like LANG=eo LC_TIME=@iso8601 LC_MONETARY=US@iso4217 at the expense of a third file to be accessed or rather LANG=eo LC_TIME=C LC_MONETARY=US@iso4217 What do you think about such a naming convention and usage mode? Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-25 9:06 ` u-igbb @ 2014-07-25 20:15 ` u-igbb 2014-07-25 22:32 ` Rich Felker 2014-07-26 20:43 ` Rich Felker 0 siblings, 2 replies; 43+ messages in thread From: u-igbb @ 2014-07-25 20:15 UTC (permalink / raw) To: musl Replying to myself. On Fri, Jul 25, 2014 at 11:06:49AM +0200, u-igbb@aetey.se wrote: > Returning to the naming. As language-based locales are named > after languages, it would be nice to name other kinds of locale > data after their "natural association" too. Then politically-bound > data could be put into the corresponding "territorial" family: > > language ll[l][_TT] > territory TT[_ll[l]] A bad idea, forget it. This would be open to misinterpretation (which key is "more fundamental" for a certain kind of data, shall it go to ll_TT or TT_ll ?) Somewhat cleaner might be: ("zxx" and "ZZ" below are literals) no localization C language[+territory] ll[l][_TT] purely territorial zxx_TT ("no language" code) and possibly no territory-specific stuff included ll[l]_ZZ ("no territory" code) The last item would e.g. allow treating ll[l] alone as "including the most frequently used territorial features for this language" (like "sv" == "sv_SE"), but I think this approach would be bad and confusing - such a definition is not certain nor stable. I think that a language code alone should mean "no territory-specific stuff included" and nothing else. Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have to exist at all. Then the usage would be like LANG=de_DE (... "€") LANG=sv_SE (decimal comma, "kr") LANG=sv LC_MONETARY=zxx_SE (decimal point from "C", iso4217 "SEK") LANG=sv_FI LC_MONETARY=sv_SE (... "kr") LANG=eo LC_MONETARY=zxx_EU (... iso4217 "EUR") Assuming that any categories not explicitly defined in the corresponding files are to be taken from "C". Hope this makes some sense in your eyes. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-25 20:15 ` u-igbb @ 2014-07-25 22:32 ` Rich Felker 2014-07-26 7:25 ` u-igbb 2014-07-26 20:43 ` Rich Felker 1 sibling, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-25 22:32 UTC (permalink / raw) To: musl On Fri, Jul 25, 2014 at 10:15:51PM +0200, u-igbb@aetey.se wrote: > Replying to myself. > > On Fri, Jul 25, 2014 at 11:06:49AM +0200, u-igbb@aetey.se wrote: > > Returning to the naming. As language-based locales are named > > after languages, it would be nice to name other kinds of locale > > data after their "natural association" too. Then politically-bound > > data could be put into the corresponding "territorial" family: > > > > language ll[l][_TT] > > territory TT[_ll[l]] > > A bad idea, forget it. This would be open to misinterpretation > (which key is "more fundamental" for a certain kind of data, > shall it go to ll_TT or TT_ll ?) Yes, I agree that's a bad idea. > Somewhat cleaner might be: ("zxx" and "ZZ" below are literals) > > no localization C > language[+territory] ll[l][_TT] > purely territorial zxx_TT ("no language" code) While clean and well-defined, I wonder whether zxx_TT is counter-intuitive to most users... > and possibly > no territory-specific stuff included ll[l]_ZZ ("no territory" code) > > The last item would e.g. allow treating ll[l] alone as "including > the most frequently used territorial features for this language" > (like "sv" == "sv_SE"), > but I think this approach would be bad and confusing - such a definition > is not certain nor stable. > > I think that a language code alone should mean "no territory-specific > stuff included" and nothing else. I think that's reasonable. > Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have > to exist at all. That's definitely nice. > Then the usage would be like > > LANG=de_DE (... "€") > > LANG=sv_SE (decimal comma, "kr") > LANG=sv LC_MONETARY=zxx_SE (decimal point from "C", iso4217 "SEK") Changing the numeric radix point is explicitly not supported. :) LC_NUMERIC is just always C because, well, numbers are numbers, not something to vary by culture, and changing the radix point just breaks parsing and storing data for interchange. LC_MONETARY on the other hand could in principle provide a different monetary radix point, but it's not terribly useful until we get a full-featured strfmon anyway. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-25 22:32 ` Rich Felker @ 2014-07-26 7:25 ` u-igbb 2014-07-26 8:03 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-26 7:25 UTC (permalink / raw) To: musl On Fri, Jul 25, 2014 at 06:32:39PM -0400, Rich Felker wrote: > > Somewhat cleaner might be: ("zxx" and "ZZ" below are literals) > > > > no localization C > > language[+territory] ll[l][_TT] > > purely territorial zxx_TT ("no language" code) > > While clean and well-defined, I wonder whether zxx_TT is > counter-intuitive to most users... Sure this contradicts the all that convenient inclination to use short names when possible. Nevertheless I would even argue against myself (again :) and say that we'd better disallow short variants altogether (no TT, nor ll). > > I think that a language code alone should mean "no territory-specific > > stuff included" and nothing else. > > I think that's reasonable. Givet that we'd need both this extra rule and a hope that the future user/maintainer keeps it in mind too > > Then "ll" would be a synonym for "ll_ZZ" and hence "ll_ZZ" will not have > > to exist at all. it would be in fact more robust to to the contrary simply always assume the full ll[l]_TT syntax, with zxx and ZZ being already defined by the corresponding standards to denote the needed special cases. Then this would be fully standard-compliant and consistent. I understand this may feel a bit strange and "too long" even though the extra characters are hardly a burden in practice. Let me compare this to the dns search domains - short names seem convenient but they are not reliable nor do scale. Short locale names as as well prone to be misunderstood and there will be contributions with different semantics and long bikeshed discussions on different forums about which one is right :) In other words, I feel that it is more clear to _not_ include Sweden-specific bits into "sv_ZZ" (which indicates "not _any_ country" and hence "not Sweden") than into "sv". > > LANG=sv_SE (decimal comma, "kr") > > LANG=sv LC_MONETARY=zxx_SE (decimal point from "C", iso4217 "SEK") > > Changing the numeric radix point is explicitly not supported. :) > LC_NUMERIC is just always C because, well, numbers are numbers, not > something to vary by culture, and changing the radix point just breaks > parsing and storing data for interchange. LC_MONETARY on the other I am fully with you on the point of formatting numerical data for intechange. The purpose of locale is though the exact _opposite_, to represent data in a format especially chosen for the specific occasion and a specific user, _differently_ from what would be suitable for the rest of the world. Isn't it? So I would say it is indeed stupid to localize data meant for interchange. Nevertheless it may still be meaningful to format numbers for the user's taste when the data presentation is only meant for some kind of a "local" context. Related to the decimal point issue: I think we (or at least myself) would need a clarification about the role of "C" locale. It is to mean "no localization" which does not say that it is expected to provide representation usable globally (I think it is on the contrary by its origin heavily English/US biased). I assume that you are aiming to reduce this bias as much as possible so that "C" could be neutral and suitable for as many users/uses as possible. Unfortunately this raises more questions, like the following: According to https://en.wikipedia.org/wiki/Decimal_mark " Countries where a dot "." is used to mark the radix point comprise roughly 60% of the world's population.[citation needed] " which indicates that this information is unreliable. Notably, according to the same article (and verifiably :) the living auxiliary languages meant for international communication all made a different choice (apparently for reasons based on some research): " The three most spoken international auxiliary languages, Ido, Esperanto, and Interlingua all use the comma as the official radix point " Is there anything that postulates C locale to use "." as the radix point? Is there any evidence that "." is more widely used than "," ? Do not misunderstand my questions as a cultural bias. I am _much_ more used to the decimal dot than comma, because of the involvement with programming languages using ".". Nevertheless locale is not about representing data for computers, but for humans - and I would love to have a best possible internationally useful locale as the default. Otherwise let us say that "C" locale is for interacting with programs, not with humans, period (those wishing a human-friendly internationally sound environment are to use e.g. LANG=eo_ZZ). This is possibly the only reliable/efficient/robust approach? Yet it would be a pity to not have a common representation for both humans and computers, without a cultural bias. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 7:25 ` u-igbb @ 2014-07-26 8:03 ` Rich Felker 2014-07-26 9:06 ` Jens Gustedt 2014-07-26 9:38 ` u-igbb 0 siblings, 2 replies; 43+ messages in thread From: Rich Felker @ 2014-07-26 8:03 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 09:25:03AM +0200, u-igbb@aetey.se wrote: > > Changing the numeric radix point is explicitly not supported. :) > > LC_NUMERIC is just always C because, well, numbers are numbers, not > > something to vary by culture, and changing the radix point just breaks > > parsing and storing data for interchange. LC_MONETARY on the other > > I am fully with you on the point of formatting numerical data for > intechange. The purpose of locale is though the exact _opposite_, to > represent data in a format especially chosen for the specific occasion > and a specific user, _differently_ from what would be suitable for the > rest of the world. Isn't it? > > So I would say it is indeed stupid to localize data meant for > interchange. Nevertheless it may still be meaningful to format numbers > for the user's taste when the data presentation is only meant for some > kind of a "local" context. The problem is that the vast majority of actual printing and parsing of floating point numbers is for interchange purposes, not mere visual pretty-printing, and the existence of alternate radix characters introduces subtle bugs into programs that are not tested in such locales. Very few programs or libraries I've seen go to the trouble to obtain a usable LC_NUMERIC locale in a portable, thread-safe, and library-safe way before calling snprintf or strtod. And lots of broken gui libraries set LC_NUMERIC behind the application's back even if the application only wanted to set other categories. > Is there anything that postulates C locale to use "." as the radix point? Yes, it's required by ISO C and POSIX. The C locale is defined by its ability to be used for translating C programs. In C programs, the radix point is ".". > Is there any evidence that "." is more widely used than "," ? Well, 2/3 of the world's population is in India and China and they all use ".", so I think that pretty much covers the question of which is "more widely used". > Do not misunderstand my questions as a cultural bias. I am _much_ > more used to the decimal dot than comma, because of the involvement > with programming languages using ".". Nevertheless locale is not about > representing data for computers, but for humans - and I would love to > have a best possible internationally useful locale as the default. This goes back to the question about modern versus old tradition. Alternate radix points are a cultural convention that's (seemingly, hopefully) on the way out due to computers and information interchange. Maybe in some sense this is cultural imperialism (or just globalization or whatnot) but it's certainly a lot less negative than the "everyone should use English" attitude. Nobody's saying "don't use your language", just "don't gratuitously break things for a one-pixel difference". :-) Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 8:03 ` Rich Felker @ 2014-07-26 9:06 ` Jens Gustedt 2014-07-26 9:25 ` Rich Felker 2014-07-26 9:38 ` u-igbb 1 sibling, 1 reply; 43+ messages in thread From: Jens Gustedt @ 2014-07-26 9:06 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1062 bytes --] Am Samstag, den 26.07.2014, 04:03 -0400 schrieb Rich Felker: > The problem is that the vast majority of actual printing and parsing > of floating point numbers is for interchange purposes, not mere visual > pretty-printing, do you have statistics that support that claim? printing that is really concerned in interchange, should just use the %a formats. All other formats are intended for human readability. > This goes back to the question about modern versus old tradition. > Alternate radix points are a cultural convention that's (seemingly, > hopefully) on the way out due to computers and information > interchange. Maybe in some sense this is cultural imperialism (or just > globalization or whatnot) +1 for imperialism Jens -- :: INRIA Nancy Grand Est ::: AlGorille ::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 9:06 ` Jens Gustedt @ 2014-07-26 9:25 ` Rich Felker 0 siblings, 0 replies; 43+ messages in thread From: Rich Felker @ 2014-07-26 9:25 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 11:06:56AM +0200, Jens Gustedt wrote: > Am Samstag, den 26.07.2014, 04:03 -0400 schrieb Rich Felker: > > The problem is that the vast majority of actual printing and parsing > > of floating point numbers is for interchange purposes, not mere visual > > pretty-printing, > > do you have statistics that support that claim? Anecdotes, yes; statistics, no. Some examples that come to mind immediately: - Anything JSON - Text-based 3D model files - Subtitle file timings - Video framerates, aspect ratios, etc. - Input files for scientific and mathematical computing. ... > printing that is really concerned in interchange, should just use the > %a formats. All other formats are intended for human readability. In an ideal world, yes, people would use %a. In practice I don't think I've ever seen it used. :( And the radix point affects %a anyway, which is rather nonsensical, since there's definitely no cultural convention for commas to be used as radix points in hex floats. > > This goes back to the question about modern versus old tradition. > > Alternate radix points are a cultural convention that's (seemingly, > > hopefully) on the way out due to computers and information > > interchange. Maybe in some sense this is cultural imperialism (or just > > globalization or whatnot) > > +1 for imperialism Call it what you like, but lack of a variable LC_NUMERIC has been part of the proposed locale design since the beginning. This isn't something new I'm springing now. The radix point in LC_NUMERIC is also probably the single most-hated part of locale by members of the community who objected to musl having any sort of locale support at all. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 8:03 ` Rich Felker 2014-07-26 9:06 ` Jens Gustedt @ 2014-07-26 9:38 ` u-igbb 2014-07-26 17:47 ` Szabolcs Nagy 1 sibling, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-26 9:38 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote: > > So I would say it is indeed stupid to localize data meant for > > interchange. Nevertheless it may still be meaningful to format numbers > > for the user's taste when the data presentation is only meant for some > > kind of a "local" context. > > The problem is that the vast majority of actual printing and parsing > of floating point numbers is for interchange purposes, not mere visual > pretty-printing, and the existence of alternate radix characters > introduces subtle bugs into programs that are not tested in such > locales. Very few programs or libraries I've seen go to the trouble to > obtain a usable LC_NUMERIC locale in a portable, thread-safe, and > library-safe way before calling snprintf or strtod. And lots of broken > gui libraries set LC_NUMERIC behind the application's back even if the > application only wanted to set other categories. Ok, the reality is that locale is not being used in a reasonable way so we do not have to bother implementing it for proper use. Instead we are obliged to try to reduce the harm by being non-conforming in a partially compensating fashion. Sigh. Well, locale is a mess by design... > > Is there any evidence that "." is more widely used than "," ? > > Well, 2/3 of the world's population is in India and China and they all > use ".", so I think that pretty much covers the question of which is > "more widely used". Ah indeed. That's a sufficient evidence. > > locale is not about > > representing data for computers, but for humans - and I would love to > > have a best possible internationally useful locale as the default. > > This goes back to the question about modern versus old tradition. > Alternate radix points are a cultural convention that's (seemingly, > hopefully) on the way out due to computers and information > interchange. Maybe in some sense this is cultural imperialism (or just > globalization or whatnot) but it's certainly a lot less negative than > the "everyone should use English" attitude. Nobody's saying "don't use > your language", just "don't gratuitously break things for a one-pixel > difference". :-) :-D In practice this calls for "eo_ZZ@decimal_dot" - which actually would make sense. This reminds me that we have an unset issue of naming the variants. Wonder which schemes happen to exist, to be standardized (?), to be in use? Gnu gettext manual states " The ‘@variant’ can denote any kind of characteristics that is not already implied by the language ll and the country CC. It can denote a particular monetary unit. For example, on glibc systems, ‘de_DE@euro’ denotes the locale that uses the Euro currency, in contrast to the older locale ‘de_DE’ which implies the use of the currency before 2002. It can also denote a dialect of the language, or the script used to write text (for example, ‘sr_RS@latin’ uses the Latin script, whereas ‘sr_RS’ uses the Cyrillic script to write Serbian), or the orthography rules, or similar. " I read this as "there is no structure on variant naming and all kinds of variations share the same name space". Then it is the hopefully present comment in the locale definition file which apparently has to be consulted to know what a certain variant is about. Fine with me but I would like to see this stated somewhere (instead of my _guess_ after reading the above documentation - it does _not_ say a word about how one can learn the actual semantics of the variant aka the intention of the locale submitter). A straightforward try to learn what a certain installed locale is about, on a Debian Linux system: $ locale -a | grep en en_US.utf8 $ apropos en_US en_US: nothing appropriate. $ On a RedHat Linux system with "@Everything": $ locale -a | grep en ... lots of en_SOMETHING including en_US ... $ apropos en_US strlen_user (9) - Get the size of a string in user space strnlen_user (9) - Get the size of a string in user space $ Iow one has nice prerequisites for keeping the messy thing in a messy state :) Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 9:38 ` u-igbb @ 2014-07-26 17:47 ` Szabolcs Nagy 2014-07-26 18:23 ` Rich Felker 2014-07-26 18:56 ` u-igbb 0 siblings, 2 replies; 43+ messages in thread From: Szabolcs Nagy @ 2014-07-26 17:47 UTC (permalink / raw) To: musl * u-igbb@aetey.se <u-igbb@aetey.se> [2014-07-26 11:38:05 +0200]: > On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote: > > Well, 2/3 of the world's population is in India and China and they all > > use ".", so I think that pretty much covers the question of which is > > "more widely used". > > Ah indeed. That's a sufficient evidence. > world is about 7G india+china is about 2.5G that looks closer to 1/3 than to 2/3 but using anything other than '.' as the decimal point is broken ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 17:47 ` Szabolcs Nagy @ 2014-07-26 18:23 ` Rich Felker 2014-07-26 18:59 ` u-igbb 2014-07-26 18:56 ` u-igbb 1 sibling, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-26 18:23 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 07:47:06PM +0200, Szabolcs Nagy wrote: > * u-igbb@aetey.se <u-igbb@aetey.se> [2014-07-26 11:38:05 +0200]: > > On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote: > > > Well, 2/3 of the world's population is in India and China and they all > > > use ".", so I think that pretty much covers the question of which is > > > "more widely used". > > > > Ah indeed. That's a sufficient evidence. > > > > world is about 7G > india+china is about 2.5G > that looks closer to 1/3 than to 2/3 Yes, sorry; I was thinking both were already closer to 2G and was using an old idea (closer to 6G) for world population. So some more work would be needed to get a good estimate. > but using anything other than '.' as the decimal point is broken Agreed. BTW if you support arbitrary radix characters, you should not restrict it to ASCII; this then means the length in bytes of floating point fields varies by locale (currently the only printf specifier where either the contents OR length vary by locale is the nonstandard one, %m) which affects asprintf (right now it's "broken" if it races with setlocale and the format includes %m; I don't know if we care) and the implementation of a lot of other stuff (like wprintf). Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 18:23 ` Rich Felker @ 2014-07-26 18:59 ` u-igbb 2014-07-26 19:14 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-26 18:59 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 02:23:48PM -0400, Rich Felker wrote: > > but using anything other than '.' as the decimal point is broken > > Agreed. BTW if you support arbitrary radix characters, you should not > restrict it to ASCII; this then means the length in bytes of floating I think there is some international recommendations which actually say "either '.' or ',' but not anything else". Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 18:59 ` u-igbb @ 2014-07-26 19:14 ` Rich Felker 0 siblings, 0 replies; 43+ messages in thread From: Rich Felker @ 2014-07-26 19:14 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 08:59:04PM +0200, u-igbb@aetey.se wrote: > On Sat, Jul 26, 2014 at 02:23:48PM -0400, Rich Felker wrote: > > > but using anything other than '.' as the decimal point is broken > > > > Agreed. BTW if you support arbitrary radix characters, you should not > > restrict it to ASCII; this then means the length in bytes of floating > > I think there is some international recommendations which actually > say "either '.' or ',' but not anything else". According to Stephane Chazelas on oss-security (http://www.openwall.com/lists/oss-security/2014/07/21/12), glibc allows '2' as the radix point... This is yet another reason for musl's locale system having an "only allow variations that are necessary" approach: it limits the impact of malicious locale files if a user can somehow trick a privileged process into using one. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 17:47 ` Szabolcs Nagy 2014-07-26 18:23 ` Rich Felker @ 2014-07-26 18:56 ` u-igbb 2014-07-26 19:30 ` Rich Felker 1 sibling, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-26 18:56 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 07:47:06PM +0200, Szabolcs Nagy wrote: > > On Sat, Jul 26, 2014 at 04:03:27AM -0400, Rich Felker wrote: > > > Well, 2/3 of the world's population is in India and China and they all > > > use ".", so I think that pretty much covers the question of which is > > > "more widely used". > > > > Ah indeed. That's a sufficient evidence. > world is about 7G > india+china is about 2.5G > that looks closer to 1/3 than to 2/3 Oops. I should have checked the numbers :) Thanks for the notice. > but using anything other than '.' as the decimal point is broken I am unsure what/which context you mean (say application development, user settings for the screen, user settings for editing documents to share, or all scenarios at once?) Or do you mean to second Rich's opinion that the applicatons' use of locale facilities is so inconsistent that there is no use of relying on the lc_numeric? (I do not have any bias towards any representation of the radix point but I am happy to learn objective/verifiable arguments for one or another.) This is hardly important though if (as Rich wrote) "C"/"POSIX" locale is specified to use "." and given that musl already has a policy of disallowing other characters. I apparently missed the discussions which led to this policy, sorry if I am beating a dead horse: To prevent users from shooting themselves in the foot looks considerate. Personally I would nevertheless prefer to have a possibility to influence the radix dot character, at least as a locale variant and take the consequences. If anybody else feels for using "," than so be it in the corresponding locale, as the default or as a variant - how much would it cost? The generally "most international" locale I am aware of (eo_ZZ) happens to need ",". If your population numbers are correct, this seems to be a proper choice too - modulo broken locale use/deployment. Actually I very much dislike software which expects that it knows my situation better than myself and prevents me from doing what I need. Regards, Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 18:56 ` u-igbb @ 2014-07-26 19:30 ` Rich Felker 2014-07-27 7:28 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-26 19:30 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 08:56:13PM +0200, u-igbb@aetey.se wrote: > I apparently missed the discussions which led to this policy, > sorry if I am beating a dead horse: The original locale plan was based on the principle that the whole locale system in C/POSIX is poorly designed, but that it's still the basis for supporting native-language interfaces in software, and thus that musl should eventually have the minimal locale support necessary for enabling such usage. Using the locale system as a means for arbitrary non-essential customization, support for legacy character encodings, etc. is outside the scope of that. > To prevent users from shooting themselves in the foot looks > considerate. Personally I would nevertheless prefer to have a possibility > to influence the radix dot character, at least as a locale variant > and take the consequences. > > If anybody else feels for using "," than so be it in the corresponding > locale, as the default or as a variant - how much would it cost? > > The generally "most international" locale I am aware of (eo_ZZ) happens > to need ",". If your population numbers are correct, this seems to be > a proper choice too - modulo broken locale use/deployment. > > Actually I very much dislike software which expects that it knows my > situation better than myself and prevents me from doing what I need. And what about character encoding? Should we also support Latin-1? Or ISO-2022? One thing I don't like about how the whole locale discussion has gone is that, once one one thing is added, demands for more and more things that have much higher implementation and maintenance costs, good technical reasons not to provide, and less and less practical benefit, keep popping up. Radix points is definitely such an item where the cost (in terms of bug/security risks, code size, making state-free code stateful, maintenance, and just plain being ugly, ...) is relatively high and the benefits are near-zero. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 19:30 ` Rich Felker @ 2014-07-27 7:28 ` u-igbb 0 siblings, 0 replies; 43+ messages in thread From: u-igbb @ 2014-07-27 7:28 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 03:30:15PM -0400, Rich Felker wrote: > The original locale plan was based on the principle that the whole > locale system in C/POSIX is poorly designed, but that it's still the Sure. > basis for supporting native-language interfaces in software, and thus > that musl should eventually have the minimal locale support necessary > for enabling such usage. Using the locale system as a means for > arbitrary non-essential customization, support for legacy character > encodings, etc. is outside the scope of that. Sounds pretty reasonable. > > If anybody else feels for using "," than so be it in the corresponding > > locale, as the default or as a variant - how much would it cost? This (the cost) is as I see it the main consideration. > > Actually I very much dislike software which expects that it knows my > > situation better than myself and prevents me from doing what I need. > > And what about character encoding? Should we also support Latin-1? Or This is a different matter, not exactly about cost against perceived benefit. You _do_ offer a means to represent the user's data by unicode/utf-8, so the desired function (e.g. to represent åäöü) is available. It is different with the radix point as there is hardly another means for helping the user to handle her grandmother's favourite radix character. There is a difference between computing-related preferences like "I like Latin-1 and refuse to recode my data to utf-8" and the real life ones like "I grew up in Germany and despite 20 years in USA still confuse 24 and 42 in English". What you call "a single pixel" makes for some people the difference between apparent legibility and gibberish. Note, one e.g. can be very much used to interpreting '.' as a thousand delimiter and will be utterly confused by 3.142 . > ISO-2022? One thing I don't like about how the whole locale discussion > has gone is that, once one one thing is added, demands for more and > more things that have much higher implementation and maintenance I do not demand for more things, I just want to go after cost vs benefit. If the cost is high we may have to drop even very useful features. Without analysing the code I can not say how dangerous / bothersome / complex it would be to support one alternative character as radix dot and weight this against making about half the human population 1% happier. :) (A third character for radix dot would certainly affect much less than half the population). It is you who have a say about the balance, I just call for objective reasoning. Being generally upset about "feature proliferation" does not help, features is what any software is about. Do not doubt, in my eyes compactness and simplicity => robustness and safety are extremely important features, as they are for you. > practical benefit, keep popping up. Radix points is definitely such an > item where the cost (in terms of bug/security risks, code size, making > state-free code stateful, maintenance, and just plain being ugly, ...) > is relatively high It is your competence area. > and the benefits are near-zero. But this one is possibly to some degree outside of your personal experience and you may _possibly_ have a biased impression. That's why I dare to spend the precious time of yours and of mine for this discussion. Thanks for listening and of course for musl as it is - clean and useful. Yours, Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-25 20:15 ` u-igbb 2014-07-25 22:32 ` Rich Felker @ 2014-07-26 20:43 ` Rich Felker 2014-07-27 7:51 ` u-igbb 1 sibling, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-26 20:43 UTC (permalink / raw) To: musl On Fri, Jul 25, 2014 at 10:15:51PM +0200, u-igbb@aetey.se wrote: > Replying to myself. > > On Fri, Jul 25, 2014 at 11:06:49AM +0200, u-igbb@aetey.se wrote: > > Returning to the naming. As language-based locales are named > > after languages, it would be nice to name other kinds of locale > > data after their "natural association" too. Then politically-bound > > data could be put into the corresponding "territorial" family: > > > > language ll[l][_TT] > > territory TT[_ll[l]] > > A bad idea, forget it. This would be open to misinterpretation > (which key is "more fundamental" for a certain kind of data, > shall it go to ll_TT or TT_ll ?) I wasn't quite sure where to inject this reply into the thread, but one thing I just remembered is that glibc (and the XSI option for POSIX) has [.charset] as part of the standard form for locale names, and all of glibc's usable locales end in ".UTF-8". So a user on a mixed system is likely to have their locale vars set to include ".UTF-8 "at the end, and therefore wouldn't get any localization when running musl-linked programs with the locale names we've proposed. The way I see it, we could either have the locale package provide symlinks to all of the locales with ".UTF-8" on the end, or musl itself could ignore anything starting with the first '.' in a locale name. One downside of symlinks is that a locale could uselessly get mapped twice if somebody happens to reference it by both names in their locale vars. It also puts more of a configuration/complexity burden on the installation. But it does keep policy out of libc and saves a few bytes of code in libc. Any opinions on the matter? Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-26 20:43 ` Rich Felker @ 2014-07-27 7:51 ` u-igbb 2014-07-27 8:00 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: u-igbb @ 2014-07-27 7:51 UTC (permalink / raw) To: musl On Sat, Jul 26, 2014 at 04:43:29PM -0400, Rich Felker wrote: > I wasn't quite sure where to inject this reply into the thread, but > one thing I just remembered is that glibc (and the XSI option for > POSIX) has [.charset] as part of the standard form for locale names, > and all of glibc's usable locales end in ".UTF-8". So a user on a > mixed system is likely to have their locale vars set to include > ".UTF-8 "at the end, and therefore wouldn't get any localization when > running musl-linked programs with the locale names we've proposed. Ah yes this is regrettable. The transition from legacy charsets/encodings has already happened and even with glibc .UTF-8 is a de-facto default, thus "shouldn't" have to be indicated. > The way I see it, we could either have the locale package provide > symlinks to all of the locales with ".UTF-8" on the end, or musl > itself could ignore anything starting with the first '.' in a locale > name. One downside of symlinks is that a locale could uselessly get > mapped twice if somebody happens to reference it by both names in > their locale vars. It also puts more of a configuration/complexity > burden on the installation. But it does keep policy out of libc and > saves a few bytes of code in libc. As an integrator I certainly appreciate if I can skip making zillions of legacy links. There is also a matter of spelling utf-8 Utf-8 UTF-8 utf8 UTF8 Utf8 utf_8 (did I forget some? :) which different distros/users may choose differently. Debian Linux: $ locale -a C C.UTF-8 <===== en_US.utf8 <===== POSIX $ Given that the library implies utf-8, please ignore .anything explicitly - this part of the name is meaningless for musl by design. A packager can not fully imitate such behaviour even with a lot of links. The rare cases when the user really means a different charset but gets utf-8 are better handled by the user if/when encountered. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-27 7:51 ` u-igbb @ 2014-07-27 8:00 ` Rich Felker 2014-07-27 8:24 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-27 8:00 UTC (permalink / raw) To: musl On Sun, Jul 27, 2014 at 09:51:20AM +0200, u-igbb@aetey.se wrote: > As an integrator I certainly appreciate if I can skip > making zillions of legacy links. > There is also a matter of spelling utf-8 Utf-8 UTF-8 utf8 UTF8 Utf8 utf_8 > (did I forget some? :) which different distros/users may choose differently. > > Debian Linux: > > $ locale -a > C > C.UTF-8 <===== > en_US.utf8 <===== > POSIX > $ > > Given that the library implies utf-8, please ignore .anything > explicitly - this part of the name is meaningless for musl by design. > > A packager can not fully imitate such behaviour even with a lot of links. > > The rare cases when the user really means a different charset > but gets utf-8 are better handled by the user if/when encountered. OK. I actually expected you to prefer symlinks, but if you prefer musl automatically ignoring everything after the dot, I'm quite happy with that approach and I'll probably go with it. Thanks for the feedback! Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-27 8:00 ` Rich Felker @ 2014-07-27 8:24 ` u-igbb 0 siblings, 0 replies; 43+ messages in thread From: u-igbb @ 2014-07-27 8:24 UTC (permalink / raw) To: musl On Sun, Jul 27, 2014 at 04:00:31AM -0400, Rich Felker wrote: > OK. I actually expected you to prefer symlinks, but if you prefer musl > automatically ignoring everything after the dot, I'm quite happy with > that approach and I'll probably go with it. Thanks for the feedback! This shows how important it is to talk to each other :) Luckily you do that, thanks. Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 16:39 ` Rich Felker 2014-07-23 19:25 ` u-igbb @ 2014-07-23 23:22 ` writeonce 2014-07-23 23:38 ` Rich Felker 1 sibling, 1 reply; 43+ messages in thread From: writeonce @ 2014-07-23 23:22 UTC (permalink / raw) To: musl On 07/23/2014 12:39 PM, Rich Felker wrote: > On that topic, while this is a matter outside my control for > individual users, my preference would be that the official musl-locale > data attempt to avoid multiple variants/modifiers and legacy options > if possible. For example I would like to see the numeric date format > be ISO format in all locales, with traditional formats only where the > natural-language string representations for months/days are included > (and I say this as someone coming from one of the locales, i.e. US, > where the traditional numeric date format is non-ISO). In keeping with > the principle that musl is "modern" I'd like to prefer modern cultural > conventions to historical ones. For what it's worth, I wanted to point out that the ISO C explicitly pertains to the Gregorian calendar only, albeit in parenthesis (N1570, 7.27.1). For users of [listed in alphabetical order:] Arabic, Chinese, Hebrew, Japanese, Persian, and Tibetan, for instance, there are two different issues at stake: the first is the representation of the date according to the Gregorian calendar in one's own language, which could (~easily~) be made "modern" (ISO compliant), whereas the second is the representation of the date according to the culture's native calendar in the language matching the current locale. While I'm not necessary suggesting that musl (or any other libc, for that matter) should implement the conversion functions from the Gregorian calendar to other calendars and vice versa, it would be nice if at least the prototypes of the conversion functions were somehow standardized, and also if the locale files likewise accounted for the above issues (e.g. in the form of placeholders). PS. speaking of historical vs. modern and LC_MONETARY, we should probably bear in mind the many locale variants that are based on currency only, as for example in the case of EU member countries before and after the Euro. zg ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 23:22 ` writeonce @ 2014-07-23 23:38 ` Rich Felker 2014-07-24 1:07 ` writeonce 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-23 23:38 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 07:22:49PM -0400, writeonce@midipix.org wrote: > On 07/23/2014 12:39 PM, Rich Felker wrote: > >On that topic, while this is a matter outside my control for > >individual users, my preference would be that the official > >musl-locale data attempt to avoid multiple variants/modifiers and > >legacy options if possible. For example I would like to see the > >numeric date format be ISO format in all locales, with traditional > >formats only where the natural-language string representations for > >months/days are included (and I say this as someone coming from > >one of the locales, i.e. US, where the traditional numeric date > >format is non-ISO). In keeping with the principle that musl is > >"modern" I'd like to prefer modern cultural conventions to > >historical ones. > For what it's worth, I wanted to point out that the ISO C explicitly > pertains to the Gregorian calendar only, albeit in parenthesis > (N1570, 7.27.1). For users of [listed in alphabetical order:] > Arabic, Chinese, Hebrew, Japanese, Persian, and Tibetan, for > instance, there are two different issues at stake: the first is the > representation of the date according to the Gregorian calendar in > one's own language, which could (~easily~) be made "modern" (ISO > compliant), whereas the second is the representation of the date > according to the culture's native calendar in the language matching > the current locale. > > While I'm not necessary suggesting that musl (or any other libc, for > that matter) should implement the conversion functions from the > Gregorian calendar to other calendars and vice versa, it would be > nice if at least the prototypes of the conversion functions were > somehow standardized, and also if the locale files likewise > accounted for the above issues (e.g. in the form of placeholders). The strftime function has the %E? conversion specifiers which, if supported, could provide this kind of functionality, I think. I have no idea how to represent the rules for the conversions, though, or whether doing so would be practical. > PS. speaking of historical vs. modern and LC_MONETARY, we should > probably bear in mind the many locale variants that are based on > currency only, as for example in the case of EU member countries > before and after the Euro. Interesting point I hadn't really thought of. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 23:38 ` Rich Felker @ 2014-07-24 1:07 ` writeonce 2014-07-24 1:57 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: writeonce @ 2014-07-24 1:07 UTC (permalink / raw) To: musl On 07/23/2014 07:38 PM, Rich Felker wrote: > On Wed, Jul 23, 2014 at 07:22:49PM -0400, writeonce@midipix.org wrote: >> On 07/23/2014 12:39 PM, Rich Felker wrote: >>> On that topic, while this is a matter outside my control for >>> individual users, my preference would be that the official >>> musl-locale data attempt to avoid multiple variants/modifiers and >>> legacy options if possible. For example I would like to see the >>> numeric date format be ISO format in all locales, with traditional >>> formats only where the natural-language string representations for >>> months/days are included (and I say this as someone coming from >>> one of the locales, i.e. US, where the traditional numeric date >>> format is non-ISO). In keeping with the principle that musl is >>> "modern" I'd like to prefer modern cultural conventions to >>> historical ones. >> For what it's worth, I wanted to point out that the ISO C explicitly >> pertains to the Gregorian calendar only, albeit in parenthesis >> (N1570, 7.27.1). For users of [listed in alphabetical order:] >> Arabic, Chinese, Hebrew, Japanese, Persian, and Tibetan, for >> instance, there are two different issues at stake: the first is the >> representation of the date according to the Gregorian calendar in >> one's own language, which could (~easily~) be made "modern" (ISO >> compliant), whereas the second is the representation of the date >> according to the culture's native calendar in the language matching >> the current locale. >> >> While I'm not necessary suggesting that musl (or any other libc, for >> that matter) should implement the conversion functions from the >> Gregorian calendar to other calendars and vice versa, it would be >> nice if at least the prototypes of the conversion functions were >> somehow standardized, and also if the locale files likewise >> accounted for the above issues (e.g. in the form of placeholders). > The strftime function has the %E? conversion specifiers which, if > supported, could provide this kind of functionality, I think. I have > no idea how to represent the rules for the conversions, though, or > whether doing so would be practical. The conversion function I'm missing is one that would produce a native struct tm from either a "general" (Gregorian), or a UTC struct tm. Consider for instance the case of lunisolar calendars with an occasional thirteenth month (i.e. Hebrew, Tibetan): given a Gregorian date, it is often useful to store the corresponding native date in a struct tm, manipulate it in some way (for instance finding the number of days left until the end of the native month), and then print the result. There are already a lot of applications and websites that perform such calculations, yet no standard interface for the conversion functions (at least not as far as I can tell). In terms of practicality: for many of the calendars I mentioned, conversion involves not only a formula, but also date- and year-based considerations. A correct implementation of all the %E? specifiers is accordingly going to include many bytes of code that probably shouldn't be pulled in whenever a "random" static application uses strftime. That being said, if musl's implementation of %E? could use weak aliases and standardized hooks, then applications or calendar-specific libraries could provide these hooks and still use the libc strftime, rather than a complex system of wrappers and conditionals. * Not be extra difficult or anything, but even the simple example in N1570 ("what day of the week is July 4, 2001") contains a cultural bias, namely the assumption that days begin and end at midnight;-) zg > >> PS. speaking of historical vs. modern and LC_MONETARY, we should >> probably bear in mind the many locale variants that are based on >> currency only, as for example in the case of EU member countries >> before and after the Euro. > Interesting point I hadn't really thought of. > > Rich > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 1:07 ` writeonce @ 2014-07-24 1:57 ` Rich Felker 2014-07-24 2:16 ` writeonce 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-24 1:57 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 09:07:22PM -0400, writeonce@midipix.org wrote: > In terms of practicality: for many of the calendars I mentioned, > conversion involves not only a formula, but also date- and > year-based considerations. A correct implementation of all the %E? > specifiers is accordingly going to include many bytes of code that > probably shouldn't be pulled in whenever a "random" static > application uses strftime. That being said, if musl's Obviously unless the set of such rules is fixed and free of the need for external updates, it would need to be represented as *data* that's loaded as part of the locale file and not as code. Or as a query to a (local) service. > implementation of %E? could use weak aliases and standardized hooks, > then applications or calendar-specific libraries could provide these > hooks and still use the libc strftime, rather than a complex system > of wrappers and conditionals. That's now how weak symbols work -- they're not a way to add plug-in code. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 1:57 ` Rich Felker @ 2014-07-24 2:16 ` writeonce 2014-07-24 2:24 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: writeonce @ 2014-07-24 2:16 UTC (permalink / raw) To: musl On 07/23/2014 09:57 PM, Rich Felker wrote: > On Wed, Jul 23, 2014 at 09:07:22PM -0400, writeonce@midipix.org wrote: >> In terms of practicality: for many of the calendars I mentioned, >> conversion involves not only a formula, but also date- and >> year-based considerations. A correct implementation of all the %E? >> specifiers is accordingly going to include many bytes of code that >> probably shouldn't be pulled in whenever a "random" static >> application uses strftime. That being said, if musl's > Obviously unless the set of such rules is fixed and free of the need > for external updates, it would need to be represented as *data* that's > loaded as part of the locale file and not as code. Or as a query to a > (local) service. > >> implementation of %E? could use weak aliases and standardized hooks, >> then applications or calendar-specific libraries could provide these >> hooks and still use the libc strftime, rather than a complex system >> of wrappers and conditionals. > That's now how weak symbols work -- they're not a way to add plug-in > code. > > Rich > > Thanks for the clarification. This leaves a query to a local service the more likely solution since for at least the Hijri (Muslim) and Hebrew (Jewish) calendars, accurate conversion cannot be based on data or tables alone. zg ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 2:16 ` writeonce @ 2014-07-24 2:24 ` Rich Felker 2014-07-24 2:59 ` writeonce 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-24 2:24 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 10:16:40PM -0400, writeonce@midipix.org wrote: > >>implementation of %E? could use weak aliases and standardized hooks, > >>then applications or calendar-specific libraries could provide these > >>hooks and still use the libc strftime, rather than a complex system > >>of wrappers and conditionals. > >That's now how weak symbols work -- they're not a way to add plug-in > >code. > > > Thanks for the clarification. This leaves a query to a local > service the more likely solution since for at least the Hijri > (Muslim) and Hebrew (Jewish) calendars, accurate conversion cannot > be based on data or tables alone. How so? All code is fundamentally data/tables. Is your point that the current data is not sufficient to compute all future times (in which case updated data would be needed but sufficient) or just that the algorithms are moderately complex (in which case a the data would have to represent a nontrivial computational language). In any case this is probably quite low priority. I'm not aware of any other libc supporting these features or any significant demand for them. So it's a neat topic to discuss, but getting pretty far off topic from the topic at hand (locale support in 1.1.4). Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-24 2:24 ` Rich Felker @ 2014-07-24 2:59 ` writeonce 0 siblings, 0 replies; 43+ messages in thread From: writeonce @ 2014-07-24 2:59 UTC (permalink / raw) To: musl On 07/23/2014 10:24 PM, Rich Felker wrote: > On Wed, Jul 23, 2014 at 10:16:40PM -0400, writeonce@midipix.org wrote: >>>> implementation of %E? could use weak aliases and standardized hooks, >>>> then applications or calendar-specific libraries could provide these >>>> hooks and still use the libc strftime, rather than a complex system >>>> of wrappers and conditionals. >>> That's now how weak symbols work -- they're not a way to add plug-in >>> code. >>> >> Thanks for the clarification. This leaves a query to a local >> service the more likely solution since for at least the Hijri >> (Muslim) and Hebrew (Jewish) calendars, accurate conversion cannot >> be based on data or tables alone. > How so? All code is fundamentally data/tables. Is your point that the > current data is not sufficient to compute all future times (in which > case updated data would be needed but sufficient) or just that the > algorithms are moderately complex (in which case a the data would have > to represent a nontrivial computational language). For one thing, the algorithms for past dates are moderately (or somehow more than moderately) complex. This pertains to both the Hijri and Hebrew calendars. With respect to future dates, in the most traditional Muslim countries the first day of the month is predicted via astronomic, yet confirmed via traditional means (observation of the new moon). In those countries you might therefore find out "in real time" whether tonight is the 1st of the next month, or the 30th of the current one. For date conversion to be accurate, then, you would have to get your data updated on a monthly basis, which at least for me makes a local service the better solution. Obviously, things would have been much easier if the wise and elderly had designed their calendars with an ISO libc in mind (a bug in the prophet utility?), but I personally wouldn't dare trying to change them;-) > > > In any case this is probably quite low priority. I'm not aware of any > other libc supporting these features or any significant demand for > them. So it's a neat topic to discuss, but getting pretty far off > topic from the topic at hand (locale support in 1.1.4). > > Rich > > Of course. That's why I suggested to define only the interfaces (or hooks, which now I understand are not an option), and not worry about the implementation. zg ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-22 18:49 Locale bikeshed time Rich Felker 2014-07-22 20:10 ` u-igbb @ 2014-07-22 20:17 ` Laurent Bercot 2014-07-22 20:36 ` Rich Felker 1 sibling, 1 reply; 43+ messages in thread From: Laurent Bercot @ 2014-07-22 20:17 UTC (permalink / raw) To: musl > So, the first bikeshed decision to be made is what environment > variable to use for the locale path, and what fallback should be if > it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to > use the same var (since apps are already aware of the need to treat it > specially), but on the other it's undesirable to have them tied > together (e.g. if you're using musl as a non-root installation and > can't write to /usr/lib) and to avoid clashing with glibc's files we > would need to choose a subdirectory under $LOCPATH rather than using > it directly. All of these aspects make it a lot less attractive. Well, my suggestion is going to be ugly, but it's the most flexible one: have two variables. LOCPATH for compatibility, and something like MUSL_LOCPATH for admins who fear conflicts. Try $MUSL_LOCPATH first, then fallback to $LOCPATH if MUSL_LOCPATH is not set, then fallback to a hard-coded directory. I'm not experienced enough with locales (nor have enough interest in them) to talk about the other points. -- Laurent ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-22 20:17 ` Laurent Bercot @ 2014-07-22 20:36 ` Rich Felker 2014-07-23 22:03 ` Laurent Bercot 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-22 20:36 UTC (permalink / raw) To: musl On Tue, Jul 22, 2014 at 09:17:50PM +0100, Laurent Bercot wrote: > > >So, the first bikeshed decision to be made is what environment > >variable to use for the locale path, and what fallback should be if > >it's not set. Glibc uses $LOCPATH. On the one hand it would be nice to > >use the same var (since apps are already aware of the need to treat it > >specially), but on the other it's undesirable to have them tied > >together (e.g. if you're using musl as a non-root installation and > >can't write to /usr/lib) and to avoid clashing with glibc's files we > >would need to choose a subdirectory under $LOCPATH rather than using > >it directly. All of these aspects make it a lot less attractive. > > Well, my suggestion is going to be ugly, but it's the most flexible > one: have two variables. LOCPATH for compatibility, and something > like MUSL_LOCPATH for admins who fear conflicts. Try $MUSL_LOCPATH > first, then fallback to $LOCPATH if MUSL_LOCPATH is not set, then > fallback to a hard-coded directory. Do you have in mind a usage case this would be beneficial for? Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-22 20:36 ` Rich Felker @ 2014-07-23 22:03 ` Laurent Bercot 2014-07-23 22:12 ` Rich Felker 0 siblings, 1 reply; 43+ messages in thread From: Laurent Bercot @ 2014-07-23 22:03 UTC (permalink / raw) To: musl On 22/07/2014 21:36, Rich Felker wrote: > Do you have in mind a usage case this would be beneficial for? Not particularly - again, I'm far from a locale expert. But you seem reluctant to reuse LOCPATH for mixed musl+glibc installations, and also reluctant to break an established convention, so having both would cater to all use cases - LOCPATH for the principle of least surprise and general use, and MUSL_LOCPATH for people who know what they are doing and need a different variable. -- Laurent ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 22:03 ` Laurent Bercot @ 2014-07-23 22:12 ` Rich Felker 2014-07-24 15:38 ` u-igbb 0 siblings, 1 reply; 43+ messages in thread From: Rich Felker @ 2014-07-23 22:12 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 11:03:00PM +0100, Laurent Bercot wrote: > On 22/07/2014 21:36, Rich Felker wrote: > >Do you have in mind a usage case this would be beneficial for? > > Not particularly - again, I'm far from a locale expert. But you > seem reluctant to reuse LOCPATH for mixed musl+glibc installations, > and also reluctant to break an established convention, so having I was not reluctant to break established convention here, but reluctant to add new variables systems integrators and administrators need to be aware of (for the sake of preserving them or filtering them). But I think the alternative is worse. > both would cater to all use cases - LOCPATH for the principle of > least surprise and general use, and MUSL_LOCPATH for people who > know what they are doing and need a different variable. I'm not seeing a way that setting LOCPATH without being aware that the app you're trying to affect is using musl could be helpful. The locales you're trying to make visible to the app need to be in the format used by musl, not the glibc format, so you have to already be aware of this. Maybe there's something I'm not seeing -- this is why I asked -- but if there's no reason for it, I think searching both "just because" is bad. Rich ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Locale bikeshed time 2014-07-23 22:12 ` Rich Felker @ 2014-07-24 15:38 ` u-igbb 0 siblings, 0 replies; 43+ messages in thread From: u-igbb @ 2014-07-24 15:38 UTC (permalink / raw) To: musl On Wed, Jul 23, 2014 at 06:12:12PM -0400, Rich Felker wrote: > I'm not seeing a way that setting LOCPATH without being aware that the > app you're trying to affect is using musl could be helpful. The > locales you're trying to make visible to the app need to be in the > format used by musl, not the glibc format, so you have to already be > aware of this. Maybe there's something I'm not seeing -- this is why I > asked -- but if there's no reason for it, I think searching both "just > because" is bad. +1 (in my eyes it would be plainly harmful for a number of reasons) Rune ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2014-07-27 8:24 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-07-22 18:49 Locale bikeshed time Rich Felker 2014-07-22 20:10 ` u-igbb 2014-07-22 20:35 ` Rich Felker 2014-07-23 9:50 ` u-igbb 2014-07-23 16:39 ` Rich Felker 2014-07-23 19:25 ` u-igbb 2014-07-23 21:01 ` Rich Felker 2014-07-24 15:35 ` u-igbb 2014-07-24 16:01 ` Rich Felker 2014-07-24 19:24 ` u-igbb 2014-07-24 20:15 ` u-igbb 2014-07-24 22:02 ` Rich Felker 2014-07-25 9:06 ` u-igbb 2014-07-25 20:15 ` u-igbb 2014-07-25 22:32 ` Rich Felker 2014-07-26 7:25 ` u-igbb 2014-07-26 8:03 ` Rich Felker 2014-07-26 9:06 ` Jens Gustedt 2014-07-26 9:25 ` Rich Felker 2014-07-26 9:38 ` u-igbb 2014-07-26 17:47 ` Szabolcs Nagy 2014-07-26 18:23 ` Rich Felker 2014-07-26 18:59 ` u-igbb 2014-07-26 19:14 ` Rich Felker 2014-07-26 18:56 ` u-igbb 2014-07-26 19:30 ` Rich Felker 2014-07-27 7:28 ` u-igbb 2014-07-26 20:43 ` Rich Felker 2014-07-27 7:51 ` u-igbb 2014-07-27 8:00 ` Rich Felker 2014-07-27 8:24 ` u-igbb 2014-07-23 23:22 ` writeonce 2014-07-23 23:38 ` Rich Felker 2014-07-24 1:07 ` writeonce 2014-07-24 1:57 ` Rich Felker 2014-07-24 2:16 ` writeonce 2014-07-24 2:24 ` Rich Felker 2014-07-24 2:59 ` writeonce 2014-07-22 20:17 ` Laurent Bercot 2014-07-22 20:36 ` Rich Felker 2014-07-23 22:03 ` Laurent Bercot 2014-07-23 22:12 ` Rich Felker 2014-07-24 15:38 ` u-igbb
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).