* a bug in bindtextdomain() and strip '.UTF-8' @ 2017-01-20 11:25 He X 2017-01-29 4:52 ` He X 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-01-20 11:25 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 425 bytes --] sorry for my poor english :D first, it's not p->domainname but q->domainname in dcngettext:77. second, any news about http://www.openwall.com/lists/musl/2016/05/11/81? strip '.UTF-8' is important, i think. irc log: 18:58 < xhe> @dalias: i must found a bug in bindtextdomain(), and also the improvement about stripping '.UTF-8' should be merged(my code sucks, that's for my tests), detailed: http://pastebin.com/3C2APqMH [-- Attachment #2: Type: text/html, Size: 1173 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-20 11:25 a bug in bindtextdomain() and strip '.UTF-8' He X @ 2017-01-29 4:52 ` He X 2017-01-29 13:39 ` Szabolcs Nagy 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-01-29 4:52 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1155 bytes --] found two more bugs related to intl: 1. no memset after malloc, caused chromium crash: http://www.openwall.com/lists/musl/2017/01/28/1 --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000 +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000 @@ -180,6 +180,7 @@ __munmap((void *)map, map_size); goto notrans; } + memset(p, 0, sizeof *p + namelen + 1); p->map = map; p->map_size = map_size; memcpy(p->name, name, namelen+1); 2. musl uses generic config of libstdc++, which blocked the support of locale, patch is there: https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150 2017-01-20 19:25 GMT+08:00 He X <xw897002528@gmail.com>: > sorry for my poor english :D > > first, it's not p->domainname but q->domainname in dcngettext:77. > > second, any news about http://www.openwall.com/lists/musl/2016/05/11/81? > strip '.UTF-8' is important, i think. > > irc log: > 18:58 < xhe> @dalias: i must found a bug in bindtextdomain(), and also the > improvement about stripping '.UTF-8' should be merged(my code sucks, > that's for my tests), detailed: http://pastebin.com/3C2APqMH > [-- Attachment #2: Type: text/html, Size: 3170 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 4:52 ` He X @ 2017-01-29 13:39 ` Szabolcs Nagy 2017-01-29 14:07 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: Szabolcs Nagy @ 2017-01-29 13:39 UTC (permalink / raw) To: musl * He X <xw897002528@gmail.com> [2017-01-29 12:52:56 +0800]: > 1. no memset after malloc, caused chromium crash: > http://www.openwall.com/lists/musl/2017/01/28/1 > --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000 > +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000 > @@ -180,6 +180,7 @@ > __munmap((void *)map, map_size); > goto notrans; > } > + memset(p, 0, sizeof *p + namelen + 1); > p->map = map; > p->map_size = map_size; > memcpy(p->name, name, namelen+1); if you want to zero the entire allocation, then use calloc. but i think initializing plural_rule is enough. > 2. musl uses generic config of libstdc++, which blocked the support of > locale, patch is there: > https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150 this breaks the abi of libstdc++ because the definition of a type in the public api is changed. so existing c++ binaries break if the toolchain is patched. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 13:39 ` Szabolcs Nagy @ 2017-01-29 14:07 ` Rich Felker 2017-01-29 14:48 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-01-29 14:07 UTC (permalink / raw) To: musl On Sun, Jan 29, 2017 at 02:39:47PM +0100, Szabolcs Nagy wrote: > * He X <xw897002528@gmail.com> [2017-01-29 12:52:56 +0800]: > > 1. no memset after malloc, caused chromium crash: > > http://www.openwall.com/lists/musl/2017/01/28/1 > > --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000 > > +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000 > > @@ -180,6 +180,7 @@ > > __munmap((void *)map, map_size); > > goto notrans; > > } > > + memset(p, 0, sizeof *p + namelen + 1); > > p->map = map; > > p->map_size = map_size; > > memcpy(p->name, name, namelen+1); > > if you want to zero the entire allocation, then use calloc. > but i think initializing plural_rule is enough. Conceptually it seems nice to avoid filling name[] twice, but since namelen is bounded in size (note: we should be using strnlen elsewhere where it first gets introduced, but strlen is already checked) it's not such a practical issue. I would be ok with just zeroing plural_rule and nplurals (the latter doesn't seem necessary but leaving it uninitialized until later seems to be a poor choice w.r.t. future-proofing the code) but just calling calloc for these allocations is probably the cleanest fix. > > 2. musl uses generic config of libstdc++, which blocked the support of > > locale, patch is there: > > https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150 > > this breaks the abi of libstdc++ because the definition of > a type in the public api is changed. > > so existing c++ binaries break if the toolchain is patched. I'm not sufficiently familiar with this code to understand why right away. Do you see an easy fix to avoid ABI breakage while fixing the bug? Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 14:07 ` Rich Felker @ 2017-01-29 14:48 ` He X 2017-01-29 15:55 ` Rich Felker ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: He X @ 2017-01-29 14:48 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2580 bytes --] 1. agreed with rich, nplurals is important too; compiling the kernel, cannot update the patch 2. no other ways, musl will use generic config 100%, and then the exception, the run time error is hardcoded there; but i doubt if this really breaks binaries, the function is only called by libstdc++ itself. you cant only update the config, but does not update libstdc++. libstdc++ exported the same abi for common binaries, wont break most dynamic-loaded binary in my view. btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread), and these two patches, fcitx, chromium are working well. but there're some names like 'de_DE@euro', 'zh_CN.GBK', these should be stripped, either, any good ideas? 2017-01-29 22:07 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sun, Jan 29, 2017 at 02:39:47PM +0100, Szabolcs Nagy wrote: > > * He X <xw897002528@gmail.com> [2017-01-29 12:52:56 +0800]: > > > 1. no memset after malloc, caused chromium crash: > > > http://www.openwall.com/lists/musl/2017/01/28/1 > > > --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 > +0000 > > > +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 > +0000 > > > @@ -180,6 +180,7 @@ > > > __munmap((void *)map, map_size); > > > goto notrans; > > > } > > > + memset(p, 0, sizeof *p + namelen + 1); > > > p->map = map; > > > p->map_size = map_size; > > > memcpy(p->name, name, namelen+1); > > > > if you want to zero the entire allocation, then use calloc. > > but i think initializing plural_rule is enough. > > Conceptually it seems nice to avoid filling name[] twice, but since > namelen is bounded in size (note: we should be using strnlen elsewhere > where it first gets introduced, but strlen is already checked) it's > not such a practical issue. I would be ok with just zeroing > plural_rule and nplurals (the latter doesn't seem necessary but > leaving it uninitialized until later seems to be a poor choice w.r.t. > future-proofing the code) but just calling calloc for these > allocations is probably the cleanest fix. > > > > 2. musl uses generic config of libstdc++, which blocked the support of > > > locale, patch is there: > > > https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150 > > > > this breaks the abi of libstdc++ because the definition of > > a type in the public api is changed. > > > > so existing c++ binaries break if the toolchain is patched. > > I'm not sufficiently familiar with this code to understand why right > away. Do you see an easy fix to avoid ABI breakage while fixing the > bug? > > Rich > [-- Attachment #2: Type: text/html, Size: 3631 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 14:48 ` He X @ 2017-01-29 15:55 ` Rich Felker 2017-01-29 16:14 ` He X 2017-01-29 16:37 ` Rich Felker 2017-01-29 16:40 ` Szabolcs Nagy 2 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-01-29 15:55 UTC (permalink / raw) To: musl On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote: > 1. agreed with rich, nplurals is important too; compiling the kernel, > cannot update the patch > 2. no other ways, musl will use generic config 100%, and then the > exception, the run time error is hardcoded there; but i doubt if this > really breaks binaries, the function is only called by libstdc++ itself. > you cant only update the config, but does not update libstdc++. libstdc++ > exported the same abi for common binaries, wont break most dynamic-loaded > binary in my view. > > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread), and > these two patches, fcitx, chromium are working well. > > but there're some names like 'de_DE@euro', 'zh_CN.GBK', these should be > stripped, either, any good ideas? This has all been discussed before; see this email and others in the thread: http://www.openwall.com/lists/musl/2016/05/11/8 Masanori Ogino was going to work on some follow-up research, testing, and/or implementation but didn't get around to it. I'm not aware of any newer findings that contradict the direction suggested in that thread. For your specific examples, de_DE@euro would be searched in de_DE, de@euro, and finally de; zh_CN.GBK would be invalid (non-UTF-8 encodings not permitted) but it's not clear to me how it should be handled (rejection or rewriting at setlocale time, stripping .GBK at translation load time, or leaving .GBK there and letting translation fail). Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 15:55 ` Rich Felker @ 2017-01-29 16:14 ` He X 2017-01-29 16:33 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-01-29 16:14 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1923 bytes --] I can't wait, can i work on it and make a patch for these issues if Masanori Ogino is busy now? I'd like to see that these issues could be solved in official musl repo as soon as possible. And maybe rejection for NON-UTF-8, since 'LANG=zh_CN.GBK ./a.out( setlocale(LC_*, "") )' showed me a segfault with glibc. Wang He 2017-01-29 23:55 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote: > > 1. agreed with rich, nplurals is important too; compiling the kernel, > > cannot update the patch > > 2. no other ways, musl will use generic config 100%, and then the > > exception, the run time error is hardcoded there; but i doubt if this > > really breaks binaries, the function is only called by libstdc++ itself. > > you cant only update the config, but does not update libstdc++. libstdc++ > > exported the same abi for common binaries, wont break most dynamic-loaded > > binary in my view. > > > > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread), > and > > these two patches, fcitx, chromium are working well. > > > > but there're some names like 'de_DE@euro', 'zh_CN.GBK', these should be > > stripped, either, any good ideas? > > This has all been discussed before; see this email and others in the > thread: > > http://www.openwall.com/lists/musl/2016/05/11/8 > > Masanori Ogino was going to work on some follow-up research, testing, > and/or implementation but didn't get around to it. I'm not aware of > any newer findings that contradict the direction suggested in that > thread. > > For your specific examples, de_DE@euro would be searched in de_DE, > de@euro, and finally de; zh_CN.GBK would be invalid (non-UTF-8 > encodings not permitted) but it's not clear to me how it should be > handled (rejection or rewriting at setlocale time, stripping .GBK at > translation load time, or leaving .GBK there and letting translation > fail). > > Rich > [-- Attachment #2: Type: text/html, Size: 2690 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:14 ` He X @ 2017-01-29 16:33 ` Rich Felker 2017-02-08 10:13 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-01-29 16:33 UTC (permalink / raw) To: musl On Mon, Jan 30, 2017 at 12:14:49AM +0800, He X wrote: > I can't wait, can i work on it and make a patch for these issues if Masanori > Ogino is busy now? I'd like to see that these issues could be solved in > official musl repo as soon as possible. I'm not saying you need to wait, just that you should be aware of past discussion of the topic, and if you want to propose patches they should either follow the behavior outlined before or come with discussion of why you think a different behavior is more appropriate. > And maybe rejection for NON-UTF-8, since 'LANG=zh_CN.GBK ./a.out( > setlocale(LC_*, "") )' showed me a segfault with glibc. I don't think "it crashes on glibc" is a good justification for anything. Rather there should probably be UX discussions of what different choices mean for different poor-configuration situations that are likely to arise in the wild (from things like LC_* getting copied over ssh). Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:33 ` Rich Felker @ 2017-02-08 10:13 ` He X 2017-02-08 14:31 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-02-08 10:13 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2283 bytes --] here the patch is: http://paste.ubuntu.com/23953329/ The code tested, but maybe it sucks. 1. striping @xx, _TT: when mapping with full name failed, we check if there's a '@' in locname. if so, go back to the part of copying catname, override and skip '@xx'. Then we check if there's a '_', and if both '@' and '_TT' is there, point locname to '@xx', set a correct loclen, go back to the part of writing locname to replace '_TT' with '@xx'. If not both, skip and simply override '_TT'. Because there's also '_' in 'LC_xx', we may get into a dead loop of stripping '_TT'. So locname is checked, it's set to NULL if we used strchr to skip once. Same reason, we may get into a dead loop of overriding '_TT'. The first position of '/' should be front of the '_' if we replaced it once, the name will like: 'zh@t/LC_xx'. zh_CN@t (stripped by the first part)-> zh_CN (overrided by the second part)-> zh@t (stripped by the first part again)-> zh 2. about rewriting of '.GBK': I agreeded with keeping the original value of user, and stripping it in gettext() before. But i thought that someone may validate if libc set the correct charset by setlocale(). So we should rewrite .XX to .UTF-8 in setlocale(), we cant return a wrong value in principle. 2017-01-30 0:33 GMT+08:00 Rich Felker <dalias@libc.org>: > On Mon, Jan 30, 2017 at 12:14:49AM +0800, He X wrote: > > I can't wait, can i work on it and make a patch for these issues if > Masanori > > Ogino is busy now? I'd like to see that these issues could be solved in > > official musl repo as soon as possible. > > I'm not saying you need to wait, just that you should be aware of past > discussion of the topic, and if you want to propose patches they > should either follow the behavior outlined before or come with > discussion of why you think a different behavior is more appropriate. > > > And maybe rejection for NON-UTF-8, since 'LANG=zh_CN.GBK ./a.out( > > setlocale(LC_*, "") )' showed me a segfault with glibc. > > I don't think "it crashes on glibc" is a good justification for > anything. Rather there should probably be UX discussions of what > different choices mean for different poor-configuration situations > that are likely to arise in the wild (from things like LC_* getting > copied over ssh). > > Rich > [-- Attachment #2: Type: text/html, Size: 3184 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-08 10:13 ` He X @ 2017-02-08 14:31 ` Rich Felker 2017-02-09 9:49 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-02-08 14:31 UTC (permalink / raw) To: musl On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote: > here the patch is: http://paste.ubuntu.com/23953329/ > The code tested, but maybe it sucks. Patches need to be attached and sent to the list, not pastebins that might disappear. The latter don't work for discussing and preserving discussion of the patch. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-08 14:31 ` Rich Felker @ 2017-02-09 9:49 ` He X 2017-02-11 2:36 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-02-09 9:49 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 401 bytes --] sry! 2017-02-08 22:31 GMT+08:00 Rich Felker <dalias@libc.org>: > On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote: > > here the patch is: http://paste.ubuntu.com/23953329/ > > The code tested, but maybe it sucks. > > Patches need to be attached and sent to the list, not pastebins that > might disappear. The latter don't work for discussing and preserving > discussion of the patch. > > Rich > [-- Attachment #1.2: Type: text/html, Size: 893 bytes --] [-- Attachment #2: locale.diff --] [-- Type: text/plain, Size: 2015 bytes --] --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 @@ -19,6 +19,7 @@ }; static void *volatile bindings; +char *__strchrnul(const char *, int); static char *gettextdir(const char *domainname, size_t *dirlen) { @@ -143,7 +143,7 @@ catname = catnames[category]; catlen = catlens[category]; - loclen = strlen(locname); + loclen = __strchrnul(locname, '.') - locname; size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; char name[namelen+1], *s = name; @@ -157,6 +157,8 @@ +rewrite_loc: memcpy(s, locname, loclen); s[loclen] = '/'; s += loclen + 1; +skip_loc: memcpy(s, catname, catlen); s[catlen] = '/'; s += catlen + 1; @@ -174,7 +175,22 @@ void *old_cats; size_t map_size; const void *map = __map_file(name, &map_size); - if (!map) goto notrans; + if (!map) { + if (s = strchr(name + dirlen + 1, '@')) { + *s++ = '/'; + goto skip_loc; + } + if (locname && (s = strchr(name + dirlen + 1, '_')) && (strchr(name + dirlen +1, '/') > s) ) { + if (locname = strchr(locname, '@')) { + loclen = __strchrnul(lm->name, '.') - locname; + goto rewrite_loc; + } else { + *s++ = '/'; + goto skip_loc; + } + } + goto notrans; + } p = calloc(sizeof *p + namelen + 1, 1); if (!p) { __munmap((void *)map, map_size); --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 @@ -32,6 +32,7 @@ struct __locale_map *new = 0; const char *path = 0, *z; char buf[256]; + char *dotp; size_t l, n; if (!*val) { @@ -40,6 +41,12 @@ (val = getenv("LANG")) && *val || (val = "C.UTF-8"); } + if (dotp = strchr(val, '.')) { + char part[256]; + memcpy(part, val, dotp - val); + memcpy(&part[dotp - val], ".UTF-8\0", 7); + val = part; + } /* Limit name length and forbid leading dot or any slashes. */ for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-09 9:49 ` He X @ 2017-02-11 2:36 ` Rich Felker 2017-02-11 6:00 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-02-11 2:36 UTC (permalink / raw) To: musl On Thu, Feb 09, 2017 at 05:49:13PM +0800, He X wrote: > sry! > > 2017-02-08 22:31 GMT+08:00 Rich Felker <dalias@libc.org>: > > > On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote: > > > here the patch is: http://paste.ubuntu.com/23953329/ > > > The code tested, but maybe it sucks. > > > > Patches need to be attached and sent to the list, not pastebins that > > might disappear. The latter don't work for discussing and preserving > > discussion of the patch. > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > @@ -19,6 +19,7 @@ > }; > > static void *volatile bindings; > +char *__strchrnul(const char *, int); > > static char *gettextdir(const char *domainname, size_t *dirlen) > { > @@ -143,7 +143,7 @@ > > catname = catnames[category]; > catlen = catlens[category]; > - loclen = strlen(locname); > + loclen = __strchrnul(locname, '.') - locname; > > size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > char name[namelen+1], *s = name; > @@ -157,6 +157,8 @@ > +rewrite_loc: > memcpy(s, locname, loclen); > s[loclen] = '/'; > s += loclen + 1; > +skip_loc: > memcpy(s, catname, catlen); > s[catlen] = '/'; > s += catlen + 1; > @@ -174,7 +175,22 @@ > void *old_cats; > size_t map_size; > const void *map = __map_file(name, &map_size); > - if (!map) goto notrans; > + if (!map) { > + if (s = strchr(name + dirlen + 1, '@')) { > + *s++ = '/'; > + goto skip_loc; > + } > + if (locname && (s = strchr(name + dirlen + 1, '_')) && (strchr(name + dirlen +1, '/') > s) ) { > + if (locname = strchr(locname, '@')) { > + loclen = __strchrnul(lm->name, '.') - locname; > + goto rewrite_loc; > + } else { > + *s++ = '/'; > + goto skip_loc; > + } > + } > + goto notrans; > + } This doesn't work because it changes both the key used for the lookup and the filename mapped. If you try this code with a translation that requires a fallback, and run it under strace, you'll see that _every_ call to gettext will try again to find the nonexistent files. It could be fixed, but I think the code should be refactored so that, rather than the msgcat list being indexed by pathname strings, it's indexed by tuples of: ( struct __locale_map *, struct binding *, category ) These are all integers/pointers and thus compare very fast versus the current strcmp operation, and it's very quick to look them up. Then we only have to construct the pathname string when a new file needs to be loaded, not on every call, and you're free to clobber the pathname string while doing fallbacks. > p = calloc(sizeof *p + namelen + 1, 1); > if (!p) { > __munmap((void *)map, map_size); > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > @@ -32,6 +32,7 @@ > struct __locale_map *new = 0; > const char *path = 0, *z; > char buf[256]; > + char *dotp; > size_t l, n; > > if (!*val) { > @@ -40,6 +41,12 @@ > (val = getenv("LANG")) && *val || > (val = "C.UTF-8"); > } > + if (dotp = strchr(val, '.')) { > + char part[256]; > + memcpy(part, val, dotp - val); > + memcpy(&part[dotp - val], ".UTF-8\0", 7); > + val = part; > + } > > /* Limit name length and forbid leading dot or any slashes. */ > for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); I don't think this part is desirable, but if it were, it would need to be done differently. As-is, it has serious UB, use of part[] after the end of its lifetime. It also seems to have no check to see that dotp-val is less than 256-7 or even that it's bounded, whereas the code that immediately follows checks the length of the string pointed to by val. I think what it should be doing is the opposite, stopping when hitting a dot in the name and only using the part up to the dot, except in the one special case "C.UTF-8". The subsequent path search for the locale file should probably then be repeated with combinations of dropping @mod and _CC suffixes, but this dropping should _not_ affect the name that's saved and reported back. (That is, if LC_TIME=fr_CA but only a "fr" locale file exists, the "fr" file should get mapped but the name returned by setlocale, and saved for use by gettext, should still be the full "fr_CA" in case applications have "fr_CA" translations.) Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-11 2:36 ` Rich Felker @ 2017-02-11 6:00 ` He X 2017-02-11 23:59 ` Rich Felker 2017-02-12 2:34 ` Rich Felker 0 siblings, 2 replies; 39+ messages in thread From: He X @ 2017-02-11 6:00 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 6301 bytes --] fresh patch :) 1. It's easier that just stopping at dot, and i think this should be commented in the wiki or somewhere. 2. I read your first part of reply for 20mins, but im not sure; If i understand right, you mean, let the __locale_map* and strcut binding* be the id-card for msgcat list instead of the long name string, not only faster, but also more easy to construct pathname string. But there's some questions: + I removed name from msgcat, i can't find its use there, is it safe? + gettextdir() is replaced by a new loop, since i need the pointer of struct binding not only the dirname, but then, gettextdir() is only called by bindtextdomain(), is there a need to keep it? Or we have a better way to get the pointer of struct binding? + you said msgcat's indexed by ( struct __locale_map *, struct binding *, category ), but i found lm(locale_map) is located by category, so if category is different, then we can't get the same lm, so we can just compare lm, right? 2017-02-11 10:36 GMT+08:00 Rich Felker <dalias@libc.org>: > On Thu, Feb 09, 2017 at 05:49:13PM +0800, He X wrote: > > sry! > > > > 2017-02-08 22:31 GMT+08:00 Rich Felker <dalias@libc.org>: > > > > > On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote: > > > > here the patch is: http://paste.ubuntu.com/23953329/ > > > > The code tested, but maybe it sucks. > > > > > > Patches need to be attached and sent to the list, not pastebins that > > > might disappear. The latter don't work for discussing and preserving > > > discussion of the patch. > > > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > > @@ -19,6 +19,7 @@ > > }; > > > > static void *volatile bindings; > > +char *__strchrnul(const char *, int); > > > > static char *gettextdir(const char *domainname, size_t *dirlen) > > { > > @@ -143,7 +143,7 @@ > > > > catname = catnames[category]; > > catlen = catlens[category]; > > - loclen = strlen(locname); > > + loclen = __strchrnul(locname, '.') - locname; > > > > size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > > char name[namelen+1], *s = name; > > @@ -157,6 +157,8 @@ > > +rewrite_loc: > > memcpy(s, locname, loclen); > > s[loclen] = '/'; > > s += loclen + 1; > > +skip_loc: > > memcpy(s, catname, catlen); > > s[catlen] = '/'; > > s += catlen + 1; > > @@ -174,7 +175,22 @@ > > void *old_cats; > > size_t map_size; > > const void *map = __map_file(name, &map_size); > > - if (!map) goto notrans; > > + if (!map) { > > + if (s = strchr(name + dirlen + 1, '@')) { > > + *s++ = '/'; > > + goto skip_loc; > > + } > > + if (locname && (s = strchr(name + dirlen + 1, > '_')) && (strchr(name + dirlen +1, '/') > s) ) { > > + if (locname = strchr(locname, '@')) { > > + loclen = __strchrnul(lm->name, > '.') - locname; > > + goto rewrite_loc; > > + } else { > > + *s++ = '/'; > > + goto skip_loc; > > + } > > + } > > + goto notrans; > > + } > > This doesn't work because it changes both the key used for the lookup > and the filename mapped. If you try this code with a translation that > requires a fallback, and run it under strace, you'll see that _every_ > call to gettext will try again to find the nonexistent files. > > It could be fixed, but I think the code should be refactored so that, > rather than the msgcat list being indexed by pathname strings, it's > indexed by tuples of: > > ( struct __locale_map *, struct binding *, category ) > > These are all integers/pointers and thus compare very fast versus the > current strcmp operation, and it's very quick to look them up. Then we > only have to construct the pathname string when a new file needs to be > loaded, not on every call, and you're free to clobber the pathname > string while doing fallbacks. > > > p = calloc(sizeof *p + namelen + 1, 1); > > if (!p) { > > __munmap((void *)map, map_size); > > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > > @@ -32,6 +32,7 @@ > > struct __locale_map *new = 0; > > const char *path = 0, *z; > > char buf[256]; > > + char *dotp; > > size_t l, n; > > > > if (!*val) { > > @@ -40,6 +41,12 @@ > > (val = getenv("LANG")) && *val || > > (val = "C.UTF-8"); > > } > > + if (dotp = strchr(val, '.')) { > > + char part[256]; > > + memcpy(part, val, dotp - val); > > + memcpy(&part[dotp - val], ".UTF-8\0", 7); > > + val = part; > > + } > > > > /* Limit name length and forbid leading dot or any slashes. */ > > for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); > > I don't think this part is desirable, but if it were, it would need to > be done differently. As-is, it has serious UB, use of part[] after the > end of its lifetime. It also seems to have no check to see that > dotp-val is less than 256-7 or even that it's bounded, whereas the > code that immediately follows checks the length of the string pointed > to by val. > > I think what it should be doing is the opposite, stopping when hitting > a dot in the name and only using the part up to the dot, except in the > one special case "C.UTF-8". The subsequent path search for the locale > file should probably then be repeated with combinations of dropping > @mod and _CC suffixes, but this dropping should _not_ affect the name > that's saved and reported back. (That is, if LC_TIME=fr_CA but only a > "fr" locale file exists, the "fr" file should get mapped but the name > returned by setlocale, and saved for use by gettext, should still be > the full "fr_CA" in case applications have "fr_CA" translations.) > > Rich > [-- Attachment #1.2: Type: text/html, Size: 8221 bytes --] [-- Attachment #2: locale.diff --] [-- Type: text/plain, Size: 3878 bytes --] --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 @@ -100,7 +100,8 @@ size_t map_size; void *volatile plural_rule; volatile int nplurals; - char name[]; + struct binding *binding; + struct __locale_map *lm; }; static char *dummy_gettextdomain() @@ -120,58 +122,87 @@ struct msgcat *p; struct __locale_struct *loc = CURRENT_LOCALE; const struct __locale_map *lm; - const char *dirname, *locname, *catname; - size_t dirlen, loclen, catlen, domlen; + size_t domlen; + struct binding *q; if ((unsigned)category >= LC_ALL) goto notrans; if (!domainname) domainname = __gettextdomain(); domlen = strnlen(domainname, NAME_MAX+1); if (domlen > NAME_MAX) goto notrans; - dirname = gettextdir(domainname, &dirlen); - if (!dirname) goto notrans; + for (q=bindings; q; q=q->next) + if (!strcmp(q->domainname, domainname) && q->active) + break; + if (!q) goto notrans; lm = loc->cat[category]; if (!lm) { notrans: return (char *) ((n == 1) ? msgid1 : msgid2); } - locname = lm->name; - - catname = catnames[category]; - catlen = catlens[category]; - loclen = strlen(locname); - - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; - char name[namelen+1], *s = name; - - memcpy(s, dirname, dirlen); - s[dirlen] = '/'; - s += dirlen + 1; - memcpy(s, locname, loclen); - s[loclen] = '/'; - s += loclen + 1; - memcpy(s, catname, catlen); - s[catlen] = '/'; - s += catlen + 1; - memcpy(s, domainname, domlen); - s[domlen] = '.'; - s[domlen+1] = 'm'; - s[domlen+2] = 'o'; - s[domlen+3] = 0; for (p=cats; p; p=p->next) - if (!strcmp(p->name, name)) + if (p->binding == q && p->lm == lm) break; if (!p) { + const char *dirname, *locname, *catname; + size_t dirlen, loclen, catlen; void *old_cats; size_t map_size; + + dirname = q->dirname; + locname = lm->name; + catname = catnames[category]; + + dirlen = q->dirlen; + loclen = strlen(locname); + catlen = catlens[category]; + + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; + char name[namelen+1], *s = name; + char *str = name; + + memcpy(s, dirname, dirlen); + s[dirlen] = '/'; + s += dirlen + 1; + memcpy(s, locname, loclen); + s[loclen] = '/'; + s += loclen + 1; +skip_loc: + memcpy(s, catname, catlen); + s[catlen] = '/'; + s += catlen + 1; + memcpy(s, domainname, domlen); + s[domlen] = '.'; + s[domlen+1] = 'm'; + s[domlen+2] = 'o'; + s[domlen+3] = 0; + const void *map = __map_file(name, &map_size); - if (!map) goto notrans; + if (!map) { + if (s = strchr(name+dirlen+1, '@')) { + *s++ = '/'; + goto skip_loc;; + } + if ( str && (s = strchr(name+dirlen+1, '_')) && (s < strchr(name+dirlen+1, '/')) ) { + if (str = strchr(locname, '@')) { + loclen += locname - str; + memcpy(s, str, loclen); + s[loclen] = '/'; + s += loclen + 1; + str = 0; + goto skip_loc; + } else { + *s++ = '/'; + goto skip_loc; + } + } + goto notrans; + } p = calloc(sizeof *p + namelen + 1, 1); if (!p) { __munmap((void *)map, map_size); goto notrans; @@ -209,7 +209,6 @@ } p->map = map; p->map_size = map_size; - memcpy(p->name, name, namelen+1); do { old_cats = cats; p->next = old_cats; --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 @@ -49,8 +49,8 @@ } /* Limit name length and forbid leading dot or any slashes. */ - for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); - if (val[0]=='.' || val[n]) val = "C.UTF-8"; + for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && val[n]!='.'; n++); + if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8"; int builtin = (val[0]=='C' && !val[1]) || !strcmp(val, "C.UTF-8") || !strcmp(val, "POSIX"); ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-11 6:00 ` He X @ 2017-02-11 23:59 ` Rich Felker 2017-02-12 2:34 ` Rich Felker 1 sibling, 0 replies; 39+ messages in thread From: Rich Felker @ 2017-02-11 23:59 UTC (permalink / raw) To: musl On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote: > fresh patch :) > 1. It's easier that just stopping at dot, and i think this should be > commented in the wiki or somewhere. > 2. I read your first part of reply for 20mins, but im not sure; If i > understand right, you mean, let the __locale_map* and strcut binding* be > the id-card for msgcat list instead of the long name string, not only > faster, but also more easy to construct pathname string. Yes. The values needed for the "id-card" (key) for the lookup are: 1. loc->cat[category] 2. category 3. The struct binding * active for domainname; gettextdir should be replaced with a function to lookup the binding rather than just returning the dir name. These three pointer/integer values uniquely determine the pathname(s) to try, and thus the mapped translation file to use. > But there's some > questions: > + I removed name from msgcat, i can't find its use there, is it safe? I think that's fine. > + gettextdir() is replaced by a new loop, since i need the pointer of > struct binding not only the dirname, but then, gettextdir() is only called > by bindtextdomain(), is there a need to keep it? Or we have a better way to > get the pointer of struct binding? It could be replaced with a function calle getdomainbinding that returns the struct binding * for the domain argument. Then the caller can use the returned pointer to lookup an existing mapped msgcat, or read out the ->dirname member if it needs to construct a path to map a new one. > + you said msgcat's indexed by ( struct __locale_map *, struct binding *, > category ), but i found lm(locale_map) is located by category, so if > category is different, then we can't get the same lm, so we can just > compare lm, right? If LC_TIME and LC_MESSAGES are both the same locale, then loc->cat[LC_TIME] and loc->cat[LC_MESSAGES] will both be the same pointer. Thus category also needs to be kept for the lookup (and path expansion). Does this help? I'll review the patch code separately. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-11 6:00 ` He X 2017-02-11 23:59 ` Rich Felker @ 2017-02-12 2:34 ` Rich Felker 2017-02-12 6:56 ` He X 2017-02-13 8:01 ` He X 1 sibling, 2 replies; 39+ messages in thread From: Rich Felker @ 2017-02-12 2:34 UTC (permalink / raw) To: musl On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote: > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > @@ -100,7 +100,8 @@ > size_t map_size; > void *volatile plural_rule; > volatile int nplurals; > - char name[]; > + struct binding *binding; > + struct __locale_map *lm; > }; As stated in the reply to message body, I think you need the category in the keying too, because there can be different .mo files loaded depending on which category was requested. > static char *dummy_gettextdomain() > @@ -120,58 +122,87 @@ > struct msgcat *p; > struct __locale_struct *loc = CURRENT_LOCALE; > const struct __locale_map *lm; > - const char *dirname, *locname, *catname; > - size_t dirlen, loclen, catlen, domlen; > + size_t domlen; > + struct binding *q; > > if ((unsigned)category >= LC_ALL) goto notrans; > > if (!domainname) domainname = __gettextdomain(); > > domlen = strnlen(domainname, NAME_MAX+1); > if (domlen > NAME_MAX) goto notrans; > > - dirname = gettextdir(domainname, &dirlen); > - if (!dirname) goto notrans; > + for (q=bindings; q; q=q->next) > + if (!strcmp(q->domainname, domainname) && q->active) > + break; > + if (!q) goto notrans; Looks ok. I had said this should be a function but it really doesn't need to be; it's plenty simple inline. > lm = loc->cat[category]; > if (!lm) { > notrans: > return (char *) ((n == 1) ? msgid1 : msgid2); > } > - locname = lm->name; > - > - catname = catnames[category]; > - catlen = catlens[category]; > - loclen = strlen(locname); > - > - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > - char name[namelen+1], *s = name; > - > - memcpy(s, dirname, dirlen); > - s[dirlen] = '/'; > - s += dirlen + 1; > - memcpy(s, locname, loclen); > - s[loclen] = '/'; > - s += loclen + 1; > - memcpy(s, catname, catlen); > - s[catlen] = '/'; > - s += catlen + 1; > - memcpy(s, domainname, domlen); > - s[domlen] = '.'; > - s[domlen+1] = 'm'; > - s[domlen+2] = 'o'; > - s[domlen+3] = 0; > > for (p=cats; p; p=p->next) > - if (!strcmp(p->name, name)) > + if (p->binding == q && p->lm == lm) > break; && p->cat == category > if (!p) { > + const char *dirname, *locname, *catname; > + size_t dirlen, loclen, catlen; > void *old_cats; > size_t map_size; > + > + dirname = q->dirname; > + locname = lm->name; > + catname = catnames[category]; > + > + dirlen = q->dirlen; > + loclen = strlen(locname); > + catlen = catlens[category]; Now that these are only computed once rather than per-call, optimizing out strlen is probably not worthwhile anymore, but it doesn't really hurt either. Not something you need to change, just a comment. > + > + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > + char name[namelen+1], *s = name; > + char *str = name; > + > + memcpy(s, dirname, dirlen); > + s[dirlen] = '/'; > + s += dirlen + 1; > + memcpy(s, locname, loclen); > + s[loclen] = '/'; > + s += loclen + 1; > +skip_loc: > + memcpy(s, catname, catlen); > + s[catlen] = '/'; > + s += catlen + 1; > + memcpy(s, domainname, domlen); > + s[domlen] = '.'; > + s[domlen+1] = 'm'; > + s[domlen+2] = 'o'; > + s[domlen+3] = 0; Actually, now that this code is not a hot path, it should just be using snprintf to construct the pathname, I think. It would be a lot simpler and easier to ensure correctness. > + > const void *map = __map_file(name, &map_size); > - if (!map) goto notrans; > + if (!map) { > + if (s = strchr(name+dirlen+1, '@')) { > + *s++ = '/'; > + goto skip_loc;; > + } > + if ( str && (s = strchr(name+dirlen+1, '_')) && (s < strchr(name+dirlen+1, '/')) ) { > + if (str = strchr(locname, '@')) { > + loclen += locname - str; > + memcpy(s, str, loclen); > + s[loclen] = '/'; > + s += loclen + 1; > + str = 0; > + goto skip_loc; > + } else { > + *s++ = '/'; > + goto skip_loc; > + } > + } > + goto notrans; > + } Using snprintf should also make it easy to get rid of the goto/retry logic here, perhaps even with a 4-iteration loop and array of which format modifications happen on each iteration. > p = calloc(sizeof *p + namelen + 1, 1); > if (!p) { > __munmap((void *)map, map_size); > goto notrans; > @@ -209,7 +209,6 @@ > } > p->map = map; > p->map_size = map_size; > - memcpy(p->name, name, namelen+1); > do { > old_cats = cats; > p->next = old_cats; > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > @@ -49,8 +49,8 @@ > } > > /* Limit name length and forbid leading dot or any slashes. */ > - for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); > - if (val[0]=='.' || val[n]) val = "C.UTF-8"; > + for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && val[n]!='.'; n++); > + if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8"; > int builtin = (val[0]=='C' && !val[1]) > || !strcmp(val, "C.UTF-8") > || !strcmp(val, "POSIX"); This looks ok but might still need some tweaks. Should an input like "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might appear as junk on the user's terminal) or as "C" (no localization) with only ASCII characters (safe for the user's terminal), or even cause setlocale to fail and return an error so that the application can decide what to do? These are not technical comments on your patch but policy matters the community should weigh in on. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-12 2:34 ` Rich Felker @ 2017-02-12 6:56 ` He X 2017-02-12 7:11 ` He X 2017-02-13 17:08 ` Rich Felker 2017-02-13 8:01 ` He X 1 sibling, 2 replies; 39+ messages in thread From: He X @ 2017-02-12 6:56 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 9048 bytes --] 1. cat is added to the keys, also do a validate 2. so we what do we deal with the gettextdir() exactly? inline it or construct a gettextpointer()? 3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is replaced by snprintf, compiled, and working well with fcitx 4. i just found that i forgot to store the keys to new buffer, it's ok to just use normal expression? or we need atomic operations? ``` + p->cat = category; + p->binding = q; + p->lm = lm; ``` 5. I do want to rewrite all to .UTF8, but it's a bit annoying as your words, then i changed the code to simply strip. > (safe for the user's terminal) LANG is set by users who are using musl and it's modified to zh_CN at setlocale(), app will use UTF8 directly, there's no such situation where charset will cause troubles to users' terminal, except apps which get the LANG manually by getenv(). I have not seen such strange applications so far, and most apps only have the UTF8 translation files. For moving from glibc to musl, i think doing this way is good for now, we could delete it later, or just keep it forever. And most people won't use non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8, because no translation files for GBK. So, it's not so dagerous, i think :). And for developers, they should not use setlocale to detect the charset, this is wrong, nl_langinfo is the correct way. If they use, stripping will let their app know something went wrong. Strip .GBK or .UTF-8, so users would be happy that their old settings are working, developers will notice their mistakes that using setlocale() to validate charset is wrong. We get a lot more than failing the setlocale() and return C, the only bad thing is we need to care about a almost impossible event: an app directly getenv(). 2017-02-12 10:34 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote: > > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > > @@ -100,7 +100,8 @@ > > size_t map_size; > > void *volatile plural_rule; > > volatile int nplurals; > > - char name[]; > > + struct binding *binding; > > + struct __locale_map *lm; > > }; > > As stated in the reply to message body, I think you need the category > in the keying too, because there can be different .mo files loaded > depending on which category was requested. > > > static char *dummy_gettextdomain() > > @@ -120,58 +122,87 @@ > > struct msgcat *p; > > struct __locale_struct *loc = CURRENT_LOCALE; > > const struct __locale_map *lm; > > - const char *dirname, *locname, *catname; > > - size_t dirlen, loclen, catlen, domlen; > > + size_t domlen; > > + struct binding *q; > > > > if ((unsigned)category >= LC_ALL) goto notrans; > > > > if (!domainname) domainname = __gettextdomain(); > > > > domlen = strnlen(domainname, NAME_MAX+1); > > if (domlen > NAME_MAX) goto notrans; > > > > - dirname = gettextdir(domainname, &dirlen); > > - if (!dirname) goto notrans; > > + for (q=bindings; q; q=q->next) > > + if (!strcmp(q->domainname, domainname) && q->active) > > + break; > > + if (!q) goto notrans; > > Looks ok. I had said this should be a function but it really doesn't > need to be; it's plenty simple inline. > > > lm = loc->cat[category]; > > if (!lm) { > > notrans: > > return (char *) ((n == 1) ? msgid1 : msgid2); > > } > > - locname = lm->name; > > - > > - catname = catnames[category]; > > - catlen = catlens[category]; > > - loclen = strlen(locname); > > - > > - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > > - char name[namelen+1], *s = name; > > - > > - memcpy(s, dirname, dirlen); > > - s[dirlen] = '/'; > > - s += dirlen + 1; > > - memcpy(s, locname, loclen); > > - s[loclen] = '/'; > > - s += loclen + 1; > > - memcpy(s, catname, catlen); > > - s[catlen] = '/'; > > - s += catlen + 1; > > - memcpy(s, domainname, domlen); > > - s[domlen] = '.'; > > - s[domlen+1] = 'm'; > > - s[domlen+2] = 'o'; > > - s[domlen+3] = 0; > > > > for (p=cats; p; p=p->next) > > - if (!strcmp(p->name, name)) > > + if (p->binding == q && p->lm == lm) > > break; > > && p->cat == category > > > if (!p) { > > + const char *dirname, *locname, *catname; > > + size_t dirlen, loclen, catlen; > > void *old_cats; > > size_t map_size; > > + > > + dirname = q->dirname; > > + locname = lm->name; > > + catname = catnames[category]; > > + > > + dirlen = q->dirlen; > > + loclen = strlen(locname); > > + catlen = catlens[category]; > > Now that these are only computed once rather than per-call, optimizing > out strlen is probably not worthwhile anymore, but it doesn't really > hurt either. Not something you need to change, just a comment. > > > + > > + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > > + char name[namelen+1], *s = name; > > + char *str = name; > > + > > + memcpy(s, dirname, dirlen); > > + s[dirlen] = '/'; > > + s += dirlen + 1; > > + memcpy(s, locname, loclen); > > + s[loclen] = '/'; > > + s += loclen + 1; > > +skip_loc: > > + memcpy(s, catname, catlen); > > + s[catlen] = '/'; > > + s += catlen + 1; > > + memcpy(s, domainname, domlen); > > + s[domlen] = '.'; > > + s[domlen+1] = 'm'; > > + s[domlen+2] = 'o'; > > + s[domlen+3] = 0; > > Actually, now that this code is not a hot path, it should just be > using snprintf to construct the pathname, I think. It would be a lot > simpler and easier to ensure correctness. > > > + > > const void *map = __map_file(name, &map_size); > > - if (!map) goto notrans; > > + if (!map) { > > + if (s = strchr(name+dirlen+1, '@')) { > > + *s++ = '/'; > > + goto skip_loc;; > > + } > > + if ( str && (s = strchr(name+dirlen+1, '_')) && (s > < strchr(name+dirlen+1, '/')) ) { > > + if (str = strchr(locname, '@')) { > > + loclen += locname - str; > > + memcpy(s, str, loclen); > > + s[loclen] = '/'; > > + s += loclen + 1; > > + str = 0; > > + goto skip_loc; > > + } else { > > + *s++ = '/'; > > + goto skip_loc; > > + } > > + } > > + goto notrans; > > + } > > Using snprintf should also make it easy to get rid of the goto/retry > logic here, perhaps even with a 4-iteration loop and array of which > format modifications happen on each iteration. > > > p = calloc(sizeof *p + namelen + 1, 1); > > if (!p) { > > __munmap((void *)map, map_size); > > goto notrans; > > @@ -209,7 +209,6 @@ > > } > > p->map = map; > > p->map_size = map_size; > > - memcpy(p->name, name, namelen+1); > > do { > > old_cats = cats; > > p->next = old_cats; > > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > > @@ -49,8 +49,8 @@ > > } > > > > /* Limit name length and forbid leading dot or any slashes. */ > > - for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); > > - if (val[0]=='.' || val[n]) val = "C.UTF-8"; > > + for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && > val[n]!='.'; n++); > > + if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8"; > > int builtin = (val[0]=='C' && !val[1]) > > || !strcmp(val, "C.UTF-8") > > || !strcmp(val, "POSIX"); > > This looks ok but might still need some tweaks. Should an input like > "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might > appear as junk on the user's terminal) or as "C" (no localization) > with only ASCII characters (safe for the user's terminal), or even > cause setlocale to fail and return an error so that the application > can decide what to do? These are not technical comments on your patch > but policy matters the community should weigh in on. > > Rich > [-- Attachment #1.2: Type: text/html, Size: 12745 bytes --] [-- Attachment #2: locale.diff --] [-- Type: text/plain, Size: 3236 bytes --] --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 @@ -100,7 +100,9 @@ size_t map_size; void *volatile plural_rule; volatile int nplurals; - char name[]; + struct binding *binding; + struct __locale_map *lm; + struct msgcat cat; }; static char *dummy_gettextdomain() @@ -120,8 +122,8 @@ struct msgcat *p; struct __locale_struct *loc = CURRENT_LOCALE; const struct __locale_map *lm; - const char *dirname, *locname, *catname; - size_t dirlen, loclen, catlen, domlen; + size_t domlen; + struct binding *q; if ((unsigned)category >= LC_ALL) goto notrans; @@ -130,47 +132,62 @@ domlen = strnlen(domainname, NAME_MAX+1); if (domlen > NAME_MAX) goto notrans; - dirname = gettextdir(domainname, &dirlen); - if (!dirname) goto notrans; + for (q=bindings; q; q=q->next) + if (!strcmp(q->domainname, domainname) && q->active) + break; + if (!q) goto notrans; lm = loc->cat[category]; if (!lm) { notrans: return (char *) ((n == 1) ? msgid1 : msgid2); } - locname = lm->name; - - catname = catnames[category]; - catlen = catlens[category]; - loclen = strlen(locname); - - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; - char name[namelen+1], *s = name; - - memcpy(s, dirname, dirlen); - s[dirlen] = '/'; - s += dirlen + 1; - memcpy(s, locname, loclen); - s[loclen] = '/'; - s += loclen + 1; - memcpy(s, catname, catlen); - s[catlen] = '/'; - s += catlen + 1; - memcpy(s, domainname, domlen); - s[domlen] = '.'; - s[domlen+1] = 'm'; - s[domlen+2] = 'o'; - s[domlen+3] = 0; for (p=cats; p; p=p->next) - if (!strcmp(p->name, name)) + if (p->binding == q && p->lm == lm && p->cat == category) break; if (!p) { + const char *dirname, *locname, *catname; + size_t dirlen, loclen, catlen; void *old_cats; size_t map_size; - const void *map = __map_file(name, &map_size); + + dirname = q->dirname; + locname = lm->name; + catname = catnames[category]; + + dirlen = q->dirlen; + loclen = strlen(locname); + catlen = catlens[category]; + + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; + char name[namelen+1]; + char locbuf[loclen+1], *locp = locbuf; + const void *map; + + memcpy(locbuf, locname, loclen); + locbuf[loclen] = 0; + + for (;;) { + snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname); + if (map = __map_file(name, &map_size)) break; + + if (locp = strchr(locbuf, '@')) { + *locp = 0; + locbuf[loclen] = '@'; + } else if (locp = strchr(locbuf, '_')) { + if (locbuf[loclen] == '@') { + locbuf[loclen] = 0; + *locp = '@'; + strcat(locp+1, locbuf + strlen(locbuf) + 1); + } else *locp = 0; + } else { + break; + } + } if (!map) goto notrans; + p = calloc(sizeof *p + namelen + 1, 1); if (!p) { __munmap((void *)map, map_size); @@ -178,7 +195,6 @@ } p->map = map; p->map_size = map_size; - memcpy(p->name, name, namelen+1); do { old_cats = cats; p->next = old_cats; @@ -193,6 +193,9 @@ __munmap((void *)map, map_size); goto notrans; } + p->cat = category; + p->binding = q; + p->lm = lm; p->map = map; p->map_size = map_size; do { ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-12 6:56 ` He X @ 2017-02-12 7:11 ` He X 2017-02-13 17:08 ` Rich Felker 1 sibling, 0 replies; 39+ messages in thread From: He X @ 2017-02-12 7:11 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 9460 bytes --] sry, i uploaded the wront patch, it should be int cat, rather than struct msgcat :/ 2017-02-12 14:56 GMT+08:00 He X <xw897002528@gmail.com>: > 1. cat is added to the keys, also do a validate > 2. so we what do we deal with the gettextdir() exactly? inline it or > construct a gettextpointer()? > 3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is > replaced by snprintf, compiled, and working well with fcitx > 4. i just found that i forgot to store the keys to new buffer, it's ok to > just use normal expression? or we need atomic operations? > ``` > + p->cat = category; > + p->binding = q; > + p->lm = lm; > ``` > 5. I do want to rewrite all to .UTF8, but it's a bit annoying as your > words, then i changed the code to simply strip. > > > (safe for the user's terminal) > LANG is set by users who are using musl and it's modified to zh_CN at > setlocale(), app will use UTF8 directly, there's no such situation where > charset will cause troubles to users' terminal, except apps which get the > LANG manually by getenv(). I have not seen such strange applications so > far, and most apps only have the UTF8 translation files. > > For moving from glibc to musl, i think doing this way is good for now, we > could delete it later, or just keep it forever. And most people won't use > non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8, > because no translation files for GBK. So, it's not so dagerous, i think > :). > > And for developers, they should not use setlocale to detect the charset, > this is wrong, nl_langinfo is the correct way. If they use, stripping will > let their app know something went wrong. > > Strip .GBK or .UTF-8, so users would be happy that their old settings are > working, developers will notice their mistakes that using setlocale() to > validate charset is wrong. We get a lot more than failing the setlocale() > and return C, the only bad thing is we need to care about a almost > impossible event: an app directly getenv(). > > 2017-02-12 10:34 GMT+08:00 Rich Felker <dalias@libc.org>: > >> On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote: >> > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 >> > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 >> > @@ -100,7 +100,8 @@ >> > size_t map_size; >> > void *volatile plural_rule; >> > volatile int nplurals; >> > - char name[]; >> > + struct binding *binding; >> > + struct __locale_map *lm; >> > }; >> >> As stated in the reply to message body, I think you need the category >> in the keying too, because there can be different .mo files loaded >> depending on which category was requested. >> >> > static char *dummy_gettextdomain() >> > @@ -120,58 +122,87 @@ >> > struct msgcat *p; >> > struct __locale_struct *loc = CURRENT_LOCALE; >> > const struct __locale_map *lm; >> > - const char *dirname, *locname, *catname; >> > - size_t dirlen, loclen, catlen, domlen; >> > + size_t domlen; >> > + struct binding *q; >> > >> > if ((unsigned)category >= LC_ALL) goto notrans; >> > >> > if (!domainname) domainname = __gettextdomain(); >> > >> > domlen = strnlen(domainname, NAME_MAX+1); >> > if (domlen > NAME_MAX) goto notrans; >> > >> > - dirname = gettextdir(domainname, &dirlen); >> > - if (!dirname) goto notrans; >> > + for (q=bindings; q; q=q->next) >> > + if (!strcmp(q->domainname, domainname) && q->active) >> > + break; >> > + if (!q) goto notrans; >> >> Looks ok. I had said this should be a function but it really doesn't >> need to be; it's plenty simple inline. >> >> > lm = loc->cat[category]; >> > if (!lm) { >> > notrans: >> > return (char *) ((n == 1) ? msgid1 : msgid2); >> > } >> > - locname = lm->name; >> > - >> > - catname = catnames[category]; >> > - catlen = catlens[category]; >> > - loclen = strlen(locname); >> > - >> > - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; >> > - char name[namelen+1], *s = name; >> > - >> > - memcpy(s, dirname, dirlen); >> > - s[dirlen] = '/'; >> > - s += dirlen + 1; >> > - memcpy(s, locname, loclen); >> > - s[loclen] = '/'; >> > - s += loclen + 1; >> > - memcpy(s, catname, catlen); >> > - s[catlen] = '/'; >> > - s += catlen + 1; >> > - memcpy(s, domainname, domlen); >> > - s[domlen] = '.'; >> > - s[domlen+1] = 'm'; >> > - s[domlen+2] = 'o'; >> > - s[domlen+3] = 0; >> > >> > for (p=cats; p; p=p->next) >> > - if (!strcmp(p->name, name)) >> > + if (p->binding == q && p->lm == lm) >> > break; >> >> && p->cat == category >> >> > if (!p) { >> > + const char *dirname, *locname, *catname; >> > + size_t dirlen, loclen, catlen; >> > void *old_cats; >> > size_t map_size; >> > + >> > + dirname = q->dirname; >> > + locname = lm->name; >> > + catname = catnames[category]; >> > + >> > + dirlen = q->dirlen; >> > + loclen = strlen(locname); >> > + catlen = catlens[category]; >> >> Now that these are only computed once rather than per-call, optimizing >> out strlen is probably not worthwhile anymore, but it doesn't really >> hurt either. Not something you need to change, just a comment. >> >> > + >> > + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + >> domlen+3; >> > + char name[namelen+1], *s = name; >> > + char *str = name; >> > + >> > + memcpy(s, dirname, dirlen); >> > + s[dirlen] = '/'; >> > + s += dirlen + 1; >> > + memcpy(s, locname, loclen); >> > + s[loclen] = '/'; >> > + s += loclen + 1; >> > +skip_loc: >> > + memcpy(s, catname, catlen); >> > + s[catlen] = '/'; >> > + s += catlen + 1; >> > + memcpy(s, domainname, domlen); >> > + s[domlen] = '.'; >> > + s[domlen+1] = 'm'; >> > + s[domlen+2] = 'o'; >> > + s[domlen+3] = 0; >> >> Actually, now that this code is not a hot path, it should just be >> using snprintf to construct the pathname, I think. It would be a lot >> simpler and easier to ensure correctness. >> >> > + >> > const void *map = __map_file(name, &map_size); >> > - if (!map) goto notrans; >> > + if (!map) { >> > + if (s = strchr(name+dirlen+1, '@')) { >> > + *s++ = '/'; >> > + goto skip_loc;; >> > + } >> > + if ( str && (s = strchr(name+dirlen+1, '_')) && >> (s < strchr(name+dirlen+1, '/')) ) { >> > + if (str = strchr(locname, '@')) { >> > + loclen += locname - str; >> > + memcpy(s, str, loclen); >> > + s[loclen] = '/'; >> > + s += loclen + 1; >> > + str = 0; >> > + goto skip_loc; >> > + } else { >> > + *s++ = '/'; >> > + goto skip_loc; >> > + } >> > + } >> > + goto notrans; >> > + } >> >> Using snprintf should also make it easy to get rid of the goto/retry >> logic here, perhaps even with a 4-iteration loop and array of which >> format modifications happen on each iteration. >> >> > p = calloc(sizeof *p + namelen + 1, 1); >> > if (!p) { >> > __munmap((void *)map, map_size); >> > goto notrans; >> > @@ -209,7 +209,6 @@ >> > } >> > p->map = map; >> > p->map_size = map_size; >> > - memcpy(p->name, name, namelen+1); >> > do { >> > old_cats = cats; >> > p->next = old_cats; >> > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 >> > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 >> > @@ -49,8 +49,8 @@ >> > } >> > >> > /* Limit name length and forbid leading dot or any slashes. */ >> > - for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); >> > - if (val[0]=='.' || val[n]) val = "C.UTF-8"; >> > + for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && >> val[n]!='.'; n++); >> > + if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8"; >> > int builtin = (val[0]=='C' && !val[1]) >> > || !strcmp(val, "C.UTF-8") >> > || !strcmp(val, "POSIX"); >> >> This looks ok but might still need some tweaks. Should an input like >> "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might >> appear as junk on the user's terminal) or as "C" (no localization) >> with only ASCII characters (safe for the user's terminal), or even >> cause setlocale to fail and return an error so that the application >> can decide what to do? These are not technical comments on your patch >> but policy matters the community should weigh in on. >> >> Rich >> > > [-- Attachment #2: Type: text/html, Size: 13352 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-12 6:56 ` He X 2017-02-12 7:11 ` He X @ 2017-02-13 17:08 ` Rich Felker 1 sibling, 0 replies; 39+ messages in thread From: Rich Felker @ 2017-02-13 17:08 UTC (permalink / raw) To: musl On Sun, Feb 12, 2017 at 02:56:53PM +0800, He X wrote: > 1. cat is added to the keys, also do a validate > 2. so we what do we deal with the gettextdir() exactly? inline it or > construct a gettextpointer()? > 3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is > replaced by snprintf, compiled, and working well with fcitx I haven't verified the loop logic yet but on a high level it looks correct. > 4. i just found that i forgot to store the keys to new buffer, it's ok to > just use normal expression? or we need atomic operations? > ``` > + p->cat = category; > + p->binding = q; > + p->lm = lm; > ``` This is fine since the new msgcat is not visible to other threads until it's installed with an atomic, which makes all previous writes visible. I do want to rework this all with a lock structure rather than atomics but that's a separate project. > 5. I do want to rewrite all to .UTF8, but it's a bit annoying as your > words, then i changed the code to simply strip. Since this part is separate and there seems to be disagreement about what it should do, let's separate it from the issue at hand; it's really a separate change from making gettext do proper fallbacks anyway. > > (safe for the user's terminal) > LANG is set by users who are using musl and it's modified to zh_CN at > setlocale(), app will use UTF8 directly, there's no such situation where > charset will cause troubles to users' terminal, except apps which get the > LANG manually by getenv(). I have not seen such strange applications so > far, and most apps only have the UTF8 translation files. > > For moving from glibc to musl, i think doing this way is good for now, we > could delete it later, or just keep it forever. And most people won't use > non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8, > because no translation files for GBK. So, it's not so dagerous, i think :). The main considerations are: 1. what happens when a glibc user ssh's into a musl-based system 2. what happens when a musl user ssh's into a glibc-based system 3. what happens when running musl binaries on a glibc-based system For #1 and #3, it's desirable for musl to accept ".UTF-8" in the locale name, and for #2, users may desire to have ".UTF-8" in their LC_* env vars so that remote glibc programs behave correctly. For #1 and #3, if a glibc uses is using a legacy non-UTF-8 locale and runs a musl program, they're either going to get messed-up output or ASCII-only, depending on decisions we make and/or what their locale value is. These are not really important since legacy encodings are not supported, but it might be nice to make least-bad. If the user has a locale name like "fr_FR" or "zh_CN" that, that's going to be interpreted differently by musl vs glibc; that was already decided a long time ago in the interest of designing around the future rather than broken legacy stuff. But if the locale name is explicitly non-UTF-8 like "zh_CN.GBK", we could opt to reject it without breaking anything, and this may give users better feedback about what's going wrong if they have such settings when ssh'ing into a musl-based system. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-12 2:34 ` Rich Felker 2017-02-12 6:56 ` He X @ 2017-02-13 8:01 ` He X 2017-02-13 13:28 ` Rich Felker 1 sibling, 1 reply; 39+ messages in thread From: He X @ 2017-02-13 8:01 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 7765 bytes --] New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK codeset, we can't strip .UTF-8 easily, or we will get a lot of junk: ``` [xhe@xhe-PC ~]$ ls /share/vim/lang/zh_ zh_CN/ zh_CN.UTF-8/ zh_CN.cp936/ zh_TW/ zh_TW.UTF-8/ ``` I add this to the loop, and delete strip in setlocale: + if (locp = strchr(locbuf, '.')) { + *locp = 0; + } else if (locp = strchr(locbuf, '@')) { Now i think we should just fail non-UTF8 codeset in setlocale for safe, and check the codeset before map that file. 2017-02-12 10:34 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote: > > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > > @@ -100,7 +100,8 @@ > > size_t map_size; > > void *volatile plural_rule; > > volatile int nplurals; > > - char name[]; > > + struct binding *binding; > > + struct __locale_map *lm; > > }; > > As stated in the reply to message body, I think you need the category > in the keying too, because there can be different .mo files loaded > depending on which category was requested. > > > static char *dummy_gettextdomain() > > @@ -120,58 +122,87 @@ > > struct msgcat *p; > > struct __locale_struct *loc = CURRENT_LOCALE; > > const struct __locale_map *lm; > > - const char *dirname, *locname, *catname; > > - size_t dirlen, loclen, catlen, domlen; > > + size_t domlen; > > + struct binding *q; > > > > if ((unsigned)category >= LC_ALL) goto notrans; > > > > if (!domainname) domainname = __gettextdomain(); > > > > domlen = strnlen(domainname, NAME_MAX+1); > > if (domlen > NAME_MAX) goto notrans; > > > > - dirname = gettextdir(domainname, &dirlen); > > - if (!dirname) goto notrans; > > + for (q=bindings; q; q=q->next) > > + if (!strcmp(q->domainname, domainname) && q->active) > > + break; > > + if (!q) goto notrans; > > Looks ok. I had said this should be a function but it really doesn't > need to be; it's plenty simple inline. > > > lm = loc->cat[category]; > > if (!lm) { > > notrans: > > return (char *) ((n == 1) ? msgid1 : msgid2); > > } > > - locname = lm->name; > > - > > - catname = catnames[category]; > > - catlen = catlens[category]; > > - loclen = strlen(locname); > > - > > - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > > - char name[namelen+1], *s = name; > > - > > - memcpy(s, dirname, dirlen); > > - s[dirlen] = '/'; > > - s += dirlen + 1; > > - memcpy(s, locname, loclen); > > - s[loclen] = '/'; > > - s += loclen + 1; > > - memcpy(s, catname, catlen); > > - s[catlen] = '/'; > > - s += catlen + 1; > > - memcpy(s, domainname, domlen); > > - s[domlen] = '.'; > > - s[domlen+1] = 'm'; > > - s[domlen+2] = 'o'; > > - s[domlen+3] = 0; > > > > for (p=cats; p; p=p->next) > > - if (!strcmp(p->name, name)) > > + if (p->binding == q && p->lm == lm) > > break; > > && p->cat == category > > > if (!p) { > > + const char *dirname, *locname, *catname; > > + size_t dirlen, loclen, catlen; > > void *old_cats; > > size_t map_size; > > + > > + dirname = q->dirname; > > + locname = lm->name; > > + catname = catnames[category]; > > + > > + dirlen = q->dirlen; > > + loclen = strlen(locname); > > + catlen = catlens[category]; > > Now that these are only computed once rather than per-call, optimizing > out strlen is probably not worthwhile anymore, but it doesn't really > hurt either. Not something you need to change, just a comment. > > > + > > + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > > + char name[namelen+1], *s = name; > > + char *str = name; > > + > > + memcpy(s, dirname, dirlen); > > + s[dirlen] = '/'; > > + s += dirlen + 1; > > + memcpy(s, locname, loclen); > > + s[loclen] = '/'; > > + s += loclen + 1; > > +skip_loc: > > + memcpy(s, catname, catlen); > > + s[catlen] = '/'; > > + s += catlen + 1; > > + memcpy(s, domainname, domlen); > > + s[domlen] = '.'; > > + s[domlen+1] = 'm'; > > + s[domlen+2] = 'o'; > > + s[domlen+3] = 0; > > Actually, now that this code is not a hot path, it should just be > using snprintf to construct the pathname, I think. It would be a lot > simpler and easier to ensure correctness. > > > + > > const void *map = __map_file(name, &map_size); > > - if (!map) goto notrans; > > + if (!map) { > > + if (s = strchr(name+dirlen+1, '@')) { > > + *s++ = '/'; > > + goto skip_loc;; > > + } > > + if ( str && (s = strchr(name+dirlen+1, '_')) && (s > < strchr(name+dirlen+1, '/')) ) { > > + if (str = strchr(locname, '@')) { > > + loclen += locname - str; > > + memcpy(s, str, loclen); > > + s[loclen] = '/'; > > + s += loclen + 1; > > + str = 0; > > + goto skip_loc; > > + } else { > > + *s++ = '/'; > > + goto skip_loc; > > + } > > + } > > + goto notrans; > > + } > > Using snprintf should also make it easy to get rid of the goto/retry > logic here, perhaps even with a 4-iteration loop and array of which > format modifications happen on each iteration. > > > p = calloc(sizeof *p + namelen + 1, 1); > > if (!p) { > > __munmap((void *)map, map_size); > > goto notrans; > > @@ -209,7 +209,6 @@ > > } > > p->map = map; > > p->map_size = map_size; > > - memcpy(p->name, name, namelen+1); > > do { > > old_cats = cats; > > p->next = old_cats; > > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000 > > @@ -49,8 +49,8 @@ > > } > > > > /* Limit name length and forbid leading dot or any slashes. */ > > - for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++); > > - if (val[0]=='.' || val[n]) val = "C.UTF-8"; > > + for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && > val[n]!='.'; n++); > > + if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8"; > > int builtin = (val[0]=='C' && !val[1]) > > || !strcmp(val, "C.UTF-8") > > || !strcmp(val, "POSIX"); > > This looks ok but might still need some tweaks. Should an input like > "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might > appear as junk on the user's terminal) or as "C" (no localization) > with only ASCII characters (safe for the user's terminal), or even > cause setlocale to fail and return an error so that the application > can decide what to do? These are not technical comments on your patch > but policy matters the community should weigh in on. > > Rich > [-- Attachment #2: Type: text/html, Size: 11088 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-13 8:01 ` He X @ 2017-02-13 13:28 ` Rich Felker 2017-02-13 14:06 ` He X 2017-02-13 14:12 ` He X 0 siblings, 2 replies; 39+ messages in thread From: Rich Felker @ 2017-02-13 13:28 UTC (permalink / raw) To: musl On Mon, Feb 13, 2017 at 04:01:31PM +0800, He X wrote: > New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK > codeset, we can't strip .UTF-8 easily, or we will get a lot of junk: That's on glibc; your "finding" is irrelevant to musl, where the encoding for all locales is UTF-8. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-13 13:28 ` Rich Felker @ 2017-02-13 14:06 ` He X 2017-02-13 17:12 ` Rich Felker 2017-02-13 14:12 ` He X 1 sibling, 1 reply; 39+ messages in thread From: He X @ 2017-02-13 14:06 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1185 bytes --] no, it's on musl, i just tested it with my patches, with vim, stripping will lead to unknown characters. I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other apps is UTF-8 set, that meas there may be other apps like vim, we should be more cautious, add a check before map the .mo files, and fail non-UTF8 set in setlocale. Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling with the native intl of musl, and after i added a dump for these two symbols, gnu tar showed me segfaults, because he passed a zero msgid1 causing __mo_lookup segfault, we should add a check in dcngettext to avoid it(if (!msgid1) goto notrans;): #2 0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar", msgid1=0x0, msgid2=0x0, n=1, category=5) at src/locale/dcngettext.c:211 2017-02-13 21:28 GMT+08:00 Rich Felker <dalias@libc.org>: > On Mon, Feb 13, 2017 at 04:01:31PM +0800, He X wrote: > > New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK > > codeset, we can't strip .UTF-8 easily, or we will get a lot of junk: > > That's on glibc; your "finding" is irrelevant to musl, where the > encoding for all locales is UTF-8. > > Rich > [-- Attachment #2: Type: text/html, Size: 3048 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-13 14:06 ` He X @ 2017-02-13 17:12 ` Rich Felker 2017-03-04 8:02 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-02-13 17:12 UTC (permalink / raw) To: musl On Mon, Feb 13, 2017 at 10:06:49PM +0800, He X wrote: > no, it's on musl, i just tested it with my patches, with vim, stripping > will lead to unknown characters. That's not a matter of the locale being non-UTF-8 (it's UTF-8) but of the application doing something broken. The locale is UTF-8 because nl_langinfo(CODESET) says it is and because mb/wc conversion functions process UTF-8. That's what it means for the locale to be UTF-8. > I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other > apps is UTF-8 set, that meas there may be other apps like vim, we should be > more cautious, add a check before map the .mo files, and fail non-UTF8 set > in setlocale. All musl locale files are required to be UTF-8. If an application has translation files that are not UTF-8, they're not usable. This could be fixed in the application or by using a fixed version of msgfmt that converts to UTF-8 before producing the .mo file. > Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling with > the native intl of musl, and after i added a dump for these two symbols, The autoconf text for gettext is supposed to be getting fixed not to do that anymore, but I'm not sure what the progress on upstreaming it is. > gnu tar showed me segfaults, because he passed a zero msgid1 causing > __mo_lookup segfault, we should add a check in dcngettext to avoid it(if > (!msgid1) goto notrans;): > > #2 0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar", > msgid1=0x0, msgid2=0x0, n=1, > category=5) at src/locale/dcngettext.c:211 Is it expecting gettext to return a null pointer in this case, or to return something else (like the "header", i.e. the translation of "")? I think it's acceptable to change this behavior as long as we do it right, but it should be a separate patch since it's an independent change. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-13 17:12 ` Rich Felker @ 2017-03-04 8:02 ` He X 2017-03-17 19:27 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-03-04 8:02 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 4411 bytes --] OK, i am busy on school these days. I read the mailing lists again, and i clean up. These are all remaining issues we need to solve since previous discussion: 1. about zero msgid1, i can prove that glibc will fallback to no translations. It's equal to printf(""), so this should be ok: @@ -120,8 +122,9 @@ + if (!msgid1) goto notrans; > but it should be a separate patch since it's an independent change. (added in the head of dcngettext(), ill send a new standalone mail for this, but it's also included in this patch, be careful) 2. >But if the locale name is explicitly non-UTF-8 like "zh_CN.GBK", we could opt to reject it without breaking anything, and this may give users better feedback about what's going wrong if they have such settings when ssh'ing into a musl-based system. About the .GBK(and any other non-UTF8 charsets), i ignore them by treating them as C.UTF-8, do we need to be more strict? --- musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000 +++ musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000 @@ -46,7 +46,8 @@ if (val[0]=='.' || val[n]) val = "C.UTF-8"; int builtin = (val[0]=='C' && !val[1]) || !strcmp(val, "C.UTF-8") - || !strcmp(val, "POSIX"); + || !strcmp(val, "POSIX") + || strcmp(__strchrnul(val, '.'), ".UTF-8"); if (builtin) { if (cat == LC_CTYPE && val[1]=='.') 3. >The autoconf text for gettext is supposed to be getting fixed not to do that anymore, but I'm not sure what the progress on upstreaming it is. It's just a workaround before they handle it, and i am not going to change anything in musl, just a description. I only patched myself. 4. > Support for non-UTF-8 .mo files won't be added. > msgfmt just needs to be fixed not to produce non-UTF-8 output. I agreed with you, so then i hope '.UTF-8' could be kept. Rather than stripping it as i thought before, '.UTF-8' should be kept until the code went into dcngettext(). What if the .mo files are downloaded from www? Or what if it's pre-generated in the releases of programs? (i guess that's why vim gave me GBK set, it must be pre-generated) And even with msgfmt generating UTF-8 outputs, what if programs still name the dir as zh_CN.UTF-8 instead of simply zh_CN? You can't say it's wrong, right? It's their preference how to name it. It's necessary for those who have a full name like zh_CN.UTF-8 instead of zh_CN. This's what i am trying to express now. 2017-02-14 1:12 GMT+08:00 Rich Felker <dalias@libc.org>: > On Mon, Feb 13, 2017 at 10:06:49PM +0800, He X wrote: > > no, it's on musl, i just tested it with my patches, with vim, stripping > > will lead to unknown characters. > > That's not a matter of the locale being non-UTF-8 (it's UTF-8) but of > the application doing something broken. The locale is UTF-8 because > nl_langinfo(CODESET) says it is and because mb/wc conversion functions > process UTF-8. That's what it means for the locale to be UTF-8. > > > I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other > > apps is UTF-8 set, that meas there may be other apps like vim, we should > be > > more cautious, add a check before map the .mo files, and fail non-UTF8 > set > > in setlocale. > > All musl locale files are required to be UTF-8. If an application has > translation files that are not UTF-8, they're not usable. This could > be fixed in the application or by using a fixed version of msgfmt that > converts to UTF-8 before producing the .mo file. > > > Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling > with > > the native intl of musl, and after i added a dump for these two symbols, > > The autoconf text for gettext is supposed to be getting fixed not to > do that anymore, but I'm not sure what the progress on upstreaming it > is. > > > gnu tar showed me segfaults, because he passed a zero msgid1 causing > > __mo_lookup segfault, we should add a check in dcngettext to avoid it(if > > (!msgid1) goto notrans;): > > > > #2 0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar", > > msgid1=0x0, msgid2=0x0, n=1, > > category=5) at src/locale/dcngettext.c:211 > > Is it expecting gettext to return a null pointer in this case, or to > return something else (like the "header", i.e. the translation of "")? > I think it's acceptable to change this behavior as long as we do it > right, but it should be a separate patch since it's an independent > change. > > Rich > [-- Attachment #1.2: Type: text/html, Size: 12499 bytes --] [-- Attachment #2: locale.diff --] [-- Type: text/plain, Size: 3636 bytes --] --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 @@ -100,7 +100,9 @@ size_t map_size; void *volatile plural_rule; volatile int nplurals; - char name[]; + struct binding *binding; + struct __locale_map *lm; + int cat; }; static char *dummy_gettextdomain() @@ -120,8 +122,9 @@ + if (!msgid1) goto notrans; struct msgcat *p; struct __locale_struct *loc = CURRENT_LOCALE; const struct __locale_map *lm; - const char *dirname, *locname, *catname; - size_t dirlen, loclen, catlen, domlen; + size_t domlen; + struct binding *q; if ((unsigned)category >= LC_ALL) goto notrans; @@ -130,47 +132,64 @@ domlen = strnlen(domainname, NAME_MAX+1); if (domlen > NAME_MAX) goto notrans; - dirname = gettextdir(domainname, &dirlen); - if (!dirname) goto notrans; + for (q=bindings; q; q=q->next) + if (!strcmp(q->domainname, domainname) && q->active) + break; + if (!q) goto notrans; lm = loc->cat[category]; if (!lm) { notrans: return (char *) ((n == 1) ? msgid1 : msgid2); } - locname = lm->name; - - catname = catnames[category]; - catlen = catlens[category]; - loclen = strlen(locname); - - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; - char name[namelen+1], *s = name; - - memcpy(s, dirname, dirlen); - s[dirlen] = '/'; - s += dirlen + 1; - memcpy(s, locname, loclen); - s[loclen] = '/'; - s += loclen + 1; - memcpy(s, catname, catlen); - s[catlen] = '/'; - s += catlen + 1; - memcpy(s, domainname, domlen); - s[domlen] = '.'; - s[domlen+1] = 'm'; - s[domlen+2] = 'o'; - s[domlen+3] = 0; for (p=cats; p; p=p->next) - if (!strcmp(p->name, name)) + if (p->binding == q && p->lm == lm && p->cat == category) break; if (!p) { + const char *dirname, *locname, *catname; + size_t dirlen, loclen, catlen; void *old_cats; size_t map_size; - const void *map = __map_file(name, &map_size); + + dirname = q->dirname; + locname = lm->name; + catname = catnames[category]; + + dirlen = q->dirlen; + loclen = strlen(locname); + catlen = catlens[category]; + + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; + char name[namelen+1]; + char locbuf[loclen+1], *locp = locbuf; + const void *map; + + memcpy(locbuf, locname, loclen); + locbuf[loclen] = 0; + + for (;;) { + snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname); + if (map = __map_file(name, &map_size)) break; + + if (locp = strchr(locbuf, '.')) { + *locp = 0; + } else if (locp = strchr(locbuf, '@')) { + *locp = 0; + locbuf[loclen] = '@'; + } else if (locp = strchr(locbuf, '_')) { + if (locbuf[loclen] == '@') { + locbuf[loclen] = 0; + *locp = '@'; + strcat(locp+1, locbuf + strlen(locbuf) + 1); + } else *locp = 0; + } else { + break; + } + } if (!map) goto notrans; + p = calloc(sizeof *p + namelen + 1, 1); if (!p) { __munmap((void *)map, map_size); @@ -178,7 +195,9 @@ } + p->cat = category; + p->binding = q; + p->lm = lm; p->map = map; p->map_size = map_size; - memcpy(p->name, name, namelen+1); do { old_cats = cats; p->next = old_cats; --- musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000 +++ musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000 @@ -46,7 +46,8 @@ if (val[0]=='.' || val[n]) val = "C.UTF-8"; int builtin = (val[0]=='C' && !val[1]) || !strcmp(val, "C.UTF-8") - || !strcmp(val, "POSIX"); + || !strcmp(val, "POSIX") + || strcmp(__strchrnul(val, '.'), ".UTF-8"); if (builtin) { if (cat == LC_CTYPE && val[1]=='.') ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-03-04 8:02 ` He X @ 2017-03-17 19:27 ` Rich Felker 2017-03-17 19:37 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-03-17 19:27 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 5428 bytes --] As discussed on IRC, I'm preparing a version that fixes some issues. I also found a few more problems after you logged off which I'll describe inline: On Sat, Mar 04, 2017 at 04:02:58PM +0800, He X wrote: > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000 > @@ -100,7 +100,9 @@ > size_t map_size; > void *volatile plural_rule; > volatile int nplurals; > - char name[]; > + struct binding *binding; > + struct __locale_map *lm; This needs to be const struct __locale_map *lm because the value assigned to it is pointer-to-const. > + int cat; > }; > > static char *dummy_gettextdomain() > @@ -120,8 +122,9 @@ > + if (!msgid1) goto notrans; This overlaps with the other patch so I'll factor it out. > struct msgcat *p; > struct __locale_struct *loc = CURRENT_LOCALE; > const struct __locale_map *lm; > - const char *dirname, *locname, *catname; > - size_t dirlen, loclen, catlen, domlen; > + size_t domlen; > + struct binding *q; > > if ((unsigned)category >= LC_ALL) goto notrans; > > @@ -130,47 +132,64 @@ > domlen = strnlen(domainname, NAME_MAX+1); > if (domlen > NAME_MAX) goto notrans; > > - dirname = gettextdir(domainname, &dirlen); > - if (!dirname) goto notrans; > + for (q=bindings; q; q=q->next) > + if (!strcmp(q->domainname, domainname) && q->active) > + break; > + if (!q) goto notrans; > > lm = loc->cat[category]; > if (!lm) { > notrans: > return (char *) ((n == 1) ? msgid1 : msgid2); > } > - locname = lm->name; > - > - catname = catnames[category]; > - catlen = catlens[category]; > - loclen = strlen(locname); > - > - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > - char name[namelen+1], *s = name; > - > - memcpy(s, dirname, dirlen); > - s[dirlen] = '/'; > - s += dirlen + 1; > - memcpy(s, locname, loclen); > - s[loclen] = '/'; > - s += loclen + 1; > - memcpy(s, catname, catlen); > - s[catlen] = '/'; > - s += catlen + 1; > - memcpy(s, domainname, domlen); > - s[domlen] = '.'; > - s[domlen+1] = 'm'; > - s[domlen+2] = 'o'; > - s[domlen+3] = 0; > > for (p=cats; p; p=p->next) > - if (!strcmp(p->name, name)) > + if (p->binding == q && p->lm == lm && p->cat == category) > break; > > if (!p) { > + const char *dirname, *locname, *catname; > + size_t dirlen, loclen, catlen; > void *old_cats; > size_t map_size; > - const void *map = __map_file(name, &map_size); > + > + dirname = q->dirname; > + locname = lm->name; > + catname = catnames[category]; > + > + dirlen = q->dirlen; > + loclen = strlen(locname); > + catlen = catlens[category]; > + > + size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; > + char name[namelen+1]; > + char locbuf[loclen+1], *locp = locbuf; > + const void *map; > + > + memcpy(locbuf, locname, loclen); > + locbuf[loclen] = 0; > + > + for (;;) { > + snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname); > + if (map = __map_file(name, &map_size)) break; > + > + if (locp = strchr(locbuf, '.')) { > + *locp = 0; As discussed on irc, .charset suffixes should be dropped before the loop even begins (never used in pathnames), and they occur before the @mod, not after it, so the logic for dropping them is different. > + } else if (locp = strchr(locbuf, '@')) { > + *locp = 0; > + locbuf[loclen] = '@'; > + } else if (locp = strchr(locbuf, '_')) { > + if (locbuf[loclen] == '@') { > + locbuf[loclen] = 0; > + *locp = '@'; > + strcat(locp+1, locbuf + strlen(locbuf) + 1); This invocation of strcat is UB because src and dest overlap; you need memmove. > + } else *locp = 0; > + } else { > + break; > + } > + } Aside from the above, I'm concerned what happens when there are an unexpected mix of _'s and @'s in places you don't expect them to be. I've worked out an (untested) alternative approach which I think is cleaner, where the locale name is split up into components before the loop (in-place, with no locbuf) and the loop controls which parts get used. This way the logic for how the parts are broken up is static and not dependent on what happened in previous loop iterations. > if (!map) goto notrans; > + > p = calloc(sizeof *p + namelen + 1, 1); namelen space is not needed here anymore since name[] was removed from the struct. > if (!p) { > __munmap((void *)map, map_size); > @@ -178,7 +195,9 @@ > } > + p->cat = category; > + p->binding = q; > + p->lm = lm; > p->map = map; > p->map_size = map_size; > - memcpy(p->name, name, namelen+1); > do { > old_cats = cats; > p->next = old_cats; > --- musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000 > +++ musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000 > @@ -46,7 +46,8 @@ > if (val[0]=='.' || val[n]) val = "C.UTF-8"; > int builtin = (val[0]=='C' && !val[1]) > || !strcmp(val, "C.UTF-8") > - || !strcmp(val, "POSIX"); > + || !strcmp(val, "POSIX") > + || strcmp(__strchrnul(val, '.'), ".UTF-8"); > > if (builtin) { > if (cat == LC_CTYPE && val[1]=='.') As discussed this is a separate change, and I don't think it's right as-is. The right change requires adding an error path to the function (and its callers), which I'll look into next. Attached is my draft of modifications to your patch. I haven't tested it yet but it does compile ok. Rich [-- Attachment #2: dcngettext.diff --] [-- Type: text/plain, Size: 3574 bytes --] diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c index b68e24b..abaa414 100644 --- a/src/locale/dcngettext.c +++ b/src/locale/dcngettext.c @@ -100,7 +100,9 @@ struct msgcat { size_t map_size; void *volatile plural_rule; volatile int nplurals; - char name[]; + struct binding *binding; + const struct __locale_map *lm; + int cat; }; static char *dummy_gettextdomain() @@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2, struct msgcat *p; struct __locale_struct *loc = CURRENT_LOCALE; const struct __locale_map *lm; - const char *dirname, *locname, *catname; - size_t dirlen, loclen, catlen, domlen; + size_t domlen; + struct binding *q; if ((unsigned)category >= LC_ALL) goto notrans; @@ -130,55 +132,76 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2, domlen = strnlen(domainname, NAME_MAX+1); if (domlen > NAME_MAX) goto notrans; - dirname = gettextdir(domainname, &dirlen); - if (!dirname) goto notrans; + for (q=bindings; q; q=q->next) + if (!strcmp(q->domainname, domainname) && q->active) + break; + if (!q) goto notrans; lm = loc->cat[category]; if (!lm) { notrans: return (char *) ((n == 1) ? msgid1 : msgid2); } - locname = lm->name; - - catname = catnames[category]; - catlen = catlens[category]; - loclen = strlen(locname); - - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; - char name[namelen+1], *s = name; - - memcpy(s, dirname, dirlen); - s[dirlen] = '/'; - s += dirlen + 1; - memcpy(s, locname, loclen); - s[loclen] = '/'; - s += loclen + 1; - memcpy(s, catname, catlen); - s[catlen] = '/'; - s += catlen + 1; - memcpy(s, domainname, domlen); - s[domlen] = '.'; - s[domlen+1] = 'm'; - s[domlen+2] = 'o'; - s[domlen+3] = 0; for (p=cats; p; p=p->next) - if (!strcmp(p->name, name)) + if (p->binding == q && p->lm == lm && p->cat == category) break; if (!p) { + const char *dirname, *locname, *catname, *modname, *locp; + size_t dirlen, loclen, catlen, modlen, alt_modlen; void *old_cats; size_t map_size; - const void *map = __map_file(name, &map_size); + + dirname = q->dirname; + locname = lm->name; + catname = catnames[category]; + + dirlen = q->dirlen; + loclen = strlen(locname); + catlen = catlens[category]; + + /* Logically split @mod suffix from locale name. */ + modname = memchr(locname, '@', loclen); + if (!modname) modname = locname + loclen; + alt_modlen = modlen = loclen - (modname-locname); + loclen = modname-locname; + + /* Drop .charset identifier; it is not used. */ + const char *csp = memchr(locname, '.', loclen); + if (csp) loclen = csp-locname; + + char name[dirlen+1 + loclen+1 + catlen+1 + domlen+3 + 1]; + const void *map; + + for (;;) { + snprintf(name, sizeof name, "%s/%.*s%.*s/%s/%s.mo\0", + dirname, (int)loclen, locname, + (int)alt_modlen, modname, catname, domainname); + if (map = __map_file(name, &map_size)) break; + + /* Try dropping @mod, _YY, then both. */ + if (alt_modlen) { + alt_modlen = 0; + } else if ((locp = memchr(locname, '_', loclen))) { + loclen = locp-locname; + alt_modlen = modlen; + } else { + break; + } + } if (!map) goto notrans; - p = calloc(sizeof *p + namelen + 1, 1); + + p = calloc(sizeof *p, 1); if (!p) { __munmap((void *)map, map_size); goto notrans; } + p->cat = category; + p->binding = q; + p->lm = lm; p->map = map; p->map_size = map_size; - memcpy(p->name, name, namelen+1); do { old_cats = cats; p->next = old_cats; ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-03-17 19:27 ` Rich Felker @ 2017-03-17 19:37 ` Rich Felker 2017-03-18 7:34 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-03-17 19:37 UTC (permalink / raw) To: musl On Fri, Mar 17, 2017 at 03:27:49PM -0400, Rich Felker wrote: > + /* Logically split @mod suffix from locale name. */ > + modname = memchr(locname, '@', loclen); > + if (!modname) modname = locname + loclen; > + alt_modlen = modlen = loclen - (modname-locname); > + loclen = modname-locname; > + > + /* Drop .charset identifier; it is not used. */ > + const char *csp = memchr(locname, '.', loclen); > + if (csp) loclen = csp-locname; > + > + char name[dirlen+1 + loclen+1 + catlen+1 + domlen+3 + 1]; > + const void *map; One bug in my version: loclen+1 should be loclen+modlen+1. Without changing that, the pathname gets truncated if @mod is used. Otherwise it seems to work. One separate but related problem I noticed while testing is that musl's locale name length limit of 16 is insufficient for something like "en_GB.UTF-8@euro". If we want to allow users to have an explicit ".UTF-8" in the locale name (which, as we discussed a while back, users may want to do if they plan to ssh into glibc-based systems) then we should probably increase this limit a bit, perhaps to 24 or so. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-03-17 19:37 ` Rich Felker @ 2017-03-18 7:34 ` He X 2017-03-18 12:28 ` Rich Felker 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-03-18 7:34 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 4037 bytes --] > As discussed on irc, .charset suffixes should be dropped before the loop even begins (never used in pathnames), and they occur before the @mod, not after it, so the logic for dropping them is different. 1. drop .charset: Sorry for proposing it again, i forget this case after around three weeks, as i said before, vim will generate three different .mo files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po. In that case, dropping is to generate a lots of junk. I now found it's not a bug of msgfmt. That is charset is converted by: iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname is decided by softwares, msgfmt does not do charset converting at all, just a format-translator. (btw, iconv.c is from alpine) How do we deal with the case? Patch the special software(since these special cases are rare, maybe other editors like emacs have similar actions, general softwares only prepare UTF-8 files)? Or put the drop back in the loop? I prefer the latter, it seems more elegant and correct than ugly patches, my idea is: ``` tt_TT.XXX@mod tt_TT.XXX tt.XXX@mod tt.XXX tt_TT@mod tt_TT tt@mod tt ``` I have made it in a similar way as @mod in the attached. Here's the output from strace: ``` 15:14:32.235463 open("./es_MX.utf8@euro/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235495 open("./es_MX.utf8/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235560 open("./es.utf8@euro/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235585 open("./es.utf8/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235621 open("./es_MX@euro/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235656 open("./es_MX/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235683 open("./es@euro/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory) 15:14:32.235706 open("./es/LC_MESSAGES/hellogt.mo", O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 3 ``` Other possible LANG is tested, too. > One bug in my version: loclen+1 should be loclen+modlen+1. Without changing that, the pathname gets truncated if @mod is used. Otherwise it seems to work. 2. Yes, it is, made it in the attached. > then we should probably increase this limit a bit, perhaps to 24 or so. 3. the longest line is `tt_TT.iso885915@euro`, 21. 24 seems good for the current standard. Made it in the attached. 2017-03-17 19:37 GMT+00:00 Rich Felker <dalias@libc.org>: > On Fri, Mar 17, 2017 at 03:27:49PM -0400, Rich Felker wrote: > > + /* Logically split @mod suffix from locale name. */ > > + modname = memchr(locname, '@', loclen); > > + if (!modname) modname = locname + loclen; > > + alt_modlen = modlen = loclen - (modname-locname); > > + loclen = modname-locname; > > + > > + /* Drop .charset identifier; it is not used. */ > > + const char *csp = memchr(locname, '.', loclen); > > + if (csp) loclen = csp-locname; > > + > > + char name[dirlen+1 + loclen+1 + catlen+1 + domlen+3 + 1]; > > + const void *map; > > One bug in my version: loclen+1 should be loclen+modlen+1. Without > changing that, the pathname gets truncated if @mod is used. Otherwise > it seems to work. > > One separate but related problem I noticed while testing is that > musl's locale name length limit of 16 is insufficient for something > like "en_GB.UTF-8@euro". If we want to allow users to have an explicit > ".UTF-8" in the locale name (which, as we discussed a while back, > users may want to do if they plan to ssh into glibc-based systems) > then we should probably increase this limit a bit, perhaps to 24 or > so. > > Rich > [-- Attachment #1.2: Type: text/html, Size: 6082 bytes --] [-- Attachment #2: locale.diff --] [-- Type: text/plain, Size: 4100 bytes --] diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c index b68e24b..abaa414 100644 --- a/src/locale/dcngettext.c +++ b/src/locale/dcngettext.c @@ -100,7 +100,9 @@ struct msgcat { size_t map_size; void *volatile plural_rule; volatile int nplurals; - char name[]; + struct binding *binding; + const struct __locale_map *lm; + int cat; }; static char *dummy_gettextdomain() @@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2, struct msgcat *p; struct __locale_struct *loc = CURRENT_LOCALE; const struct __locale_map *lm; - const char *dirname, *locname, *catname; - size_t dirlen, loclen, catlen, domlen; + size_t domlen; + struct binding *q; if ((unsigned)category >= LC_ALL) goto notrans; @@ -130,55 +132,83 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2, domlen = strnlen(domainname, NAME_MAX+1); if (domlen > NAME_MAX) goto notrans; - dirname = gettextdir(domainname, &dirlen); - if (!dirname) goto notrans; + for (q=bindings; q; q=q->next) + if (!strcmp(q->domainname, domainname) && q->active) + break; + if (!q) goto notrans; lm = loc->cat[category]; if (!lm) { notrans: return (char *) ((n == 1) ? msgid1 : msgid2); } - locname = lm->name; - - catname = catnames[category]; - catlen = catlens[category]; - loclen = strlen(locname); - - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; - char name[namelen+1], *s = name; - - memcpy(s, dirname, dirlen); - s[dirlen] = '/'; - s += dirlen + 1; - memcpy(s, locname, loclen); - s[loclen] = '/'; - s += loclen + 1; - memcpy(s, catname, catlen); - s[catlen] = '/'; - s += catlen + 1; - memcpy(s, domainname, domlen); - s[domlen] = '.'; - s[domlen+1] = 'm'; - s[domlen+2] = 'o'; - s[domlen+3] = 0; for (p=cats; p; p=p->next) - if (!strcmp(p->name, name)) + if (p->binding == q && p->lm == lm && p->cat == category) break; if (!p) { + const char *dirname, *locname, *catname, *modname, *csname, *locp; + size_t dirlen, loclen, catlen, modlen, alt_modlen, cslen, alt_charset; void *old_cats; size_t map_size; - const void *map = __map_file(name, &map_size); + + dirname = q->dirname; + locname = lm->name; + catname = catnames[category]; + + dirlen = q->dirlen; + loclen = strlen(locname); + catlen = catlens[category]; + + /* Logically split @mod suffix from locale name. */ + modname = memchr(locname, '@', loclen); + if (!modname) modname = locname + loclen; + alt_modlen = modlen = loclen - (modname-locname); + loclen -= modlen; + + /* Split .charset suffix as @mod. */ + csname = memchr(locname, '.', loclen); + if (!csname) csname = locname + loclen; + alt_charset = cslen = loclen - (csname-locname); + loclen -= cslen; + + char name[dirlen+1 + loclen+modlen+cslen+1 + catlen+1 + domlen+3 + 1]; + const void *map; + + for (;;) { + snprintf(name, sizeof name, "%s/%.*s%.*s%.*s/%s/%s.mo\0", + dirname, (int)loclen, locname, + (int)alt_charset, csname, + (int)alt_modlen, modname, catname, domainname); + if (map = __map_file(name, &map_size)) break; + + /* Try dropping @mod, _YY, then both. */ + if (alt_modlen) { + alt_modlen = 0; + } else if ((locp = memchr(locname, '_', loclen))) { + loclen = locp-locname; + alt_modlen = modlen; + } else if (alt_charset) { + alt_charset = 0; + alt_modlen = modlen; + loclen = strlen(locname) - modlen - cslen; + } else { + break; + } + } if (!map) goto notrans; - p = calloc(sizeof *p + namelen + 1, 1); + + p = calloc(sizeof *p, 1); if (!p) { __munmap((void *)map, map_size); goto notrans; } + p->cat = category; + p->binding = q; + p->lm = lm; p->map = map; p->map_size = map_size; - memcpy(p->name, name, namelen+1); do { old_cats = cats; p->next = old_cats; --- musl-1.1.16/src/internal/locale_impl.h +++ musl-1.1.16/src/internal/locale_impl.h @@ -6,7 +6,7 @@ #include "libc.h" #include "pthread_impl.h" -#define LOCALE_NAME_MAX 15 +#define LOCALE_NAME_MAX 23 struct __locale_map { const void *map; ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-03-18 7:34 ` He X @ 2017-03-18 12:28 ` Rich Felker 2017-03-18 13:50 ` He X 0 siblings, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-03-18 12:28 UTC (permalink / raw) To: musl On Sat, Mar 18, 2017 at 07:34:58AM +0000, He X wrote: > > As discussed on irc, .charset suffixes should be dropped before the > loop even begins (never used in pathnames), and they occur before the > @mod, not after it, so the logic for dropping them is different. > > 1. drop .charset: Sorry for proposing it again, i forget this case after > around three weeks, as i said before, vim will generate three different .mo > files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po. > In that case, dropping is to generate a lots of junk. > > I now found it's not a bug of msgfmt. That is charset is converted by: > iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e > 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname is > decided by softwares, msgfmt does not do charset converting at all, just a > format-translator. (btw, iconv.c is from alpine) There are two things you seem to be missing: 1. musl does not, and won't, support non-UTF-8 locales, so there is no point in trying to load translations for them. Moreover, with the proposed changes to setlocale/locale_map.c, it will never be possible for the locale name to contain a . with anything other than UTF-8 (or, for compatibility, some variant like utf8) after it. So I don't see how there's any point in iterating and trying with/without .charset when the only possibilities are that .charset is blank, .UTF-8, or some misspelling of .UTF-8. In the latter case, we'd even have to do remapping of the misspellings to avoid having to have multiple dirs/symlinks. 2. From my perspective, msgfmt's production of non-UTF-8 .mo files is a bug. Yes the .po file can be something else, but msgfmt should be transcoding it at 'compile' time. There's at least one other change msgfmt needs for all features to work with musl's gettext -- expansion of SYSDEP strings to all their possible format patterns -- so I don't think it's a significant additional burden to ensure that the msgfmt used on musl-based systems outputs UTF-8. Of course software trying to do multiple encodings like you described will still install duplicate files unless patched, but any of them should work as long as msgfmt recoded them. In the mean time, distros can just patch the build process for software that's still installing non-UTF-8 locale files. AFAIK doing that is not a recommended practice even by the GNU gettext project, so the patches might even make it upstream. One thing we could do for robustness is check the .mo header at load time and, if it has a charset= specification with something other than UTF-8, reject it. I mainly suggest this in case the program is running on a non-musl system where a glibc-built version of the same program (e.g. vi) with non-UTF-8 .mo files is present and they're using the same textdomain dir (actually unlikely since prefix should be different). But if we do this it should be a separate patch because it's a separate functional change. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-03-18 12:28 ` Rich Felker @ 2017-03-18 13:50 ` He X 0 siblings, 0 replies; 39+ messages in thread From: He X @ 2017-03-18 13:50 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 3467 bytes --] OK, i think there's no further needs of discussion. I got your idea, if this is what musl want to be. I will try to make patches to vim later! But for the checking of `charset=`, i can't help, i did not understand what's up in __mo_lookup(). Hope you can make the patch. The attached has deleted all things related to drop .charset. 2017-03-18 20:28 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sat, Mar 18, 2017 at 07:34:58AM +0000, He X wrote: > > > As discussed on irc, .charset suffixes should be dropped before the > > loop even begins (never used in pathnames), and they occur before the > > @mod, not after it, so the logic for dropping them is different. > > > > 1. drop .charset: Sorry for proposing it again, i forget this case after > > around three weeks, as i said before, vim will generate three different > .mo > > files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po. > > In that case, dropping is to generate a lots of junk. > > > > I now found it's not a bug of msgfmt. That is charset is converted by: > > iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e > > 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname > is > > decided by softwares, msgfmt does not do charset converting at all, just > a > > format-translator. (btw, iconv.c is from alpine) > > There are two things you seem to be missing: > > 1. musl does not, and won't, support non-UTF-8 locales, so there is no > point in trying to load translations for them. Moreover, with the > proposed changes to setlocale/locale_map.c, it will never be possible > for the locale name to contain a . with anything other than UTF-8 (or, > for compatibility, some variant like utf8) after it. So I don't see > how there's any point in iterating and trying with/without .charset > when the only possibilities are that .charset is blank, .UTF-8, or > some misspelling of .UTF-8. In the latter case, we'd even have to do > remapping of the misspellings to avoid having to have multiple > dirs/symlinks. > > 2. From my perspective, msgfmt's production of non-UTF-8 .mo files is > a bug. Yes the .po file can be something else, but msgfmt should be > transcoding it at 'compile' time. There's at least one other change > msgfmt needs for all features to work with musl's gettext -- expansion > of SYSDEP strings to all their possible format patterns -- so I don't > think it's a significant additional burden to ensure that the msgfmt > used on musl-based systems outputs UTF-8. > > Of course software trying to do multiple encodings like you described > will still install duplicate files unless patched, but any of them > should work as long as msgfmt recoded them. In the mean time, distros > can just patch the build process for software that's still installing > non-UTF-8 locale files. AFAIK doing that is not a recommended practice > even by the GNU gettext project, so the patches might even make it > upstream. > > One thing we could do for robustness is check the .mo header at load > time and, if it has a charset= specification with something other than > UTF-8, reject it. I mainly suggest this in case the program is running > on a non-musl system where a glibc-built version of the same program > (e.g. vi) with non-UTF-8 .mo files is present and they're using the > same textdomain dir (actually unlikely since prefix should be > different). But if we do this it should be a separate patch because > it's a separate functional change. > > Rich > [-- Attachment #1.2: Type: text/html, Size: 4138 bytes --] [-- Attachment #2: locale.diff --] [-- Type: text/plain, Size: 3824 bytes --] diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c index b68e24b..abaa414 100644 --- a/src/locale/dcngettext.c +++ b/src/locale/dcngettext.c @@ -100,7 +100,9 @@ struct msgcat { size_t map_size; void *volatile plural_rule; volatile int nplurals; - char name[]; + struct binding *binding; + const struct __locale_map *lm; + int cat; }; static char *dummy_gettextdomain() @@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2, struct msgcat *p; struct __locale_struct *loc = CURRENT_LOCALE; const struct __locale_map *lm; - const char *dirname, *locname, *catname; - size_t dirlen, loclen, catlen, domlen; + size_t domlen; + struct binding *q; if ((unsigned)category >= LC_ALL) goto notrans; @@ -130,55 +132,76 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2, domlen = strnlen(domainname, NAME_MAX+1); if (domlen > NAME_MAX) goto notrans; - dirname = gettextdir(domainname, &dirlen); - if (!dirname) goto notrans; + for (q=bindings; q; q=q->next) + if (!strcmp(q->domainname, domainname) && q->active) + break; + if (!q) goto notrans; lm = loc->cat[category]; if (!lm) { notrans: return (char *) ((n == 1) ? msgid1 : msgid2); } - locname = lm->name; - - catname = catnames[category]; - catlen = catlens[category]; - loclen = strlen(locname); - - size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3; - char name[namelen+1], *s = name; - - memcpy(s, dirname, dirlen); - s[dirlen] = '/'; - s += dirlen + 1; - memcpy(s, locname, loclen); - s[loclen] = '/'; - s += loclen + 1; - memcpy(s, catname, catlen); - s[catlen] = '/'; - s += catlen + 1; - memcpy(s, domainname, domlen); - s[domlen] = '.'; - s[domlen+1] = 'm'; - s[domlen+2] = 'o'; - s[domlen+3] = 0; for (p=cats; p; p=p->next) - if (!strcmp(p->name, name)) + if (p->binding == q && p->lm == lm && p->cat == category) break; if (!p) { + const char *dirname, *locname, *catname, *modname, *locp; + size_t dirlen, loclen, catlen, modlen, alt_modlen; void *old_cats; size_t map_size; - const void *map = __map_file(name, &map_size); + + dirname = q->dirname; + locname = lm->name; + catname = catnames[category]; + + dirlen = q->dirlen; + loclen = strlen(locname); + catlen = catlens[category]; + + /* Logically split @mod suffix from locale name. */ + modname = memchr(locname, '@', loclen); + if (!modname) modname = locname + loclen; + alt_modlen = modlen = loclen - (modname-locname); + loclen = modname-locname; + + /* Drop .charset identifier; it is not used. */ + const char *csp = memchr(locname, '.', loclen); + if (csp) loclen = csp-locname; + + char name[dirlen+1 + loclen+modlen+1 + catlen+1 + domlen+3 + 1]; + const void *map; + + for (;;) { + snprintf(name, sizeof name, "%s/%.*s%.*s/%s/%s.mo\0", + dirname, (int)loclen, locname, + (int)alt_modlen, modname, catname, domainname); + if (map = __map_file(name, &map_size)) break; + + /* Try dropping @mod, _YY, then both. */ + if (alt_modlen) { + alt_modlen = 0; + } else if ((locp = memchr(locname, '_', loclen))) { + loclen = locp-locname; + alt_modlen = modlen; + } else { + break; + } + } if (!map) goto notrans; - p = calloc(sizeof *p + namelen + 1, 1); + + p = calloc(sizeof *p, 1); if (!p) { __munmap((void *)map, map_size); goto notrans; } + p->cat = category; + p->binding = q; + p->lm = lm; p->map = map; p->map_size = map_size; - memcpy(p->name, name, namelen+1); do { old_cats = cats; p->next = old_cats; --- musl-1.1.16/src/internal/locale_impl.h +++ musl-1.1.16/src/internal/locale_impl.h @@ -6,7 +6,7 @@ #include "libc.h" #include "pthread_impl.h" -#define LOCALE_NAME_MAX 15 +#define LOCALE_NAME_MAX 23 struct __locale_map { const void *map; ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-13 13:28 ` Rich Felker 2017-02-13 14:06 ` He X @ 2017-02-13 14:12 ` He X 2017-02-13 17:13 ` Rich Felker 1 sibling, 1 reply; 39+ messages in thread From: He X @ 2017-02-13 14:12 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 654 bytes --] I dont know how, but it's indeed GBK, even with musl, vim indeed generated GBK set files, maybe it's because im using gnu gettext(without-included-gettext). I think we should avoid this issue depending on a check of libc, rather than assuming all .mo files are UTF-8 set. 2017-02-13 21:28 GMT+08:00 Rich Felker <dalias@libc.org>: > On Mon, Feb 13, 2017 at 04:01:31PM +0800, He X wrote: > > New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK > > codeset, we can't strip .UTF-8 easily, or we will get a lot of junk: > > That's on glibc; your "finding" is irrelevant to musl, where the > encoding for all locales is UTF-8. > > Rich > [-- Attachment #2: Type: text/html, Size: 1084 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-02-13 14:12 ` He X @ 2017-02-13 17:13 ` Rich Felker 0 siblings, 0 replies; 39+ messages in thread From: Rich Felker @ 2017-02-13 17:13 UTC (permalink / raw) To: musl On Mon, Feb 13, 2017 at 10:12:13PM +0800, He X wrote: > I dont know how, but it's indeed GBK, even with musl, vim indeed generated > GBK set files, maybe it's because im using gnu > gettext(without-included-gettext). I think we should avoid this issue > depending on a check of libc, rather than assuming all .mo files are UTF-8 > set. Support for non-UTF-8 .mo files won't be added. It means wasting lots of memory per-process to keep converted copies in-core rather than mapping from disk. msgfmt just needs to be fixed not to produce non-UTF-8 output. (It also needs to be fixed to handle SYSDEP strings correctly.) Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 14:48 ` He X 2017-01-29 15:55 ` Rich Felker @ 2017-01-29 16:37 ` Rich Felker 2017-01-30 0:37 ` He X 2017-01-30 14:17 ` He X 2017-01-29 16:40 ` Szabolcs Nagy 2 siblings, 2 replies; 39+ messages in thread From: Rich Felker @ 2017-01-29 16:37 UTC (permalink / raw) To: musl On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote: > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread), and > these two patches, fcitx, chromium are working well. Can I ask how .UTF-8 got in the locale name to begin with? Did you put it there, or was it copied from another non-glibc system you logged in from, or did chromium itself add it? Re: the original patch, it should probably (depending on what we want to do with other invalid encodings) either use strchr to find the first '.' and strip everything after it, or something like: if (loclen > 6 && !strcmp(locname+loclen-6, ".UTF-8")) There's no reason to pull strstr in here. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:37 ` Rich Felker @ 2017-01-30 0:37 ` He X 2017-01-30 14:17 ` He X 1 sibling, 0 replies; 39+ messages in thread From: He X @ 2017-01-30 0:37 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2573 bytes --] > I'm not saying you need to wait.... 1. its hard to read that thread for me, i just glanced once, thx for you advice, ill be more cautious next time! ;p > Can I ask how .UTF-8 got in the locale name.... 2. And '.UTF-8' is copied from glibc's locale-table, i put it there, it's set by normal user. As i looked in to musl's source, i found it's totally useless for musl to set such a suffix, suffixes are meaningless. But we should still do a compatibility with glibc in my view, suffixes seems already unofficial but standard way to ask libc to provide a proper charset. > I don't think "it crashes on glibc"... 3. Really sorry, forgot to locale-gen before test, that's why segfault, seems glibc only stripped '.GBK' at translation load time, showed me '»ỰѡÏî:'. In another word, it was using real GBK set! Though I agree with rejection: because musl is utf8, but this '.GBK' asked for using 'GBK' rather than utf8, conceptually we should just reject it. But stand on the side of normal users, rewriting is nice to avoid failing. And for developers using musl, they should know there's no 'non-utf8' sets in musl rather than depending on libc, so i would like the idea of rewriting. Or we could put the responsibility of setting right LC_* to users? Not so friendly... Because users may want to validate the strings returned by setlocale()... So the best rewriting time, i think, is at the translation time. > Re: the original patch, it should probably... 4. makes sense, i'm not a pro coder, i havnt think about using strchr or strcmp! :) And with the idea above, i suggest better using strchr to strip all things after '.'. that is good, and we dont need focus at what is placed after '.', since whatever he asked, musl is using utf8. 2017-01-30 0:37 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote: > > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread), > and > > these two patches, fcitx, chromium are working well. > > Can I ask how .UTF-8 got in the locale name to begin with? Did you put > it there, or was it copied from another non-glibc system you logged in > from, or did chromium itself add it? > > Re: the original patch, it should probably (depending on what we want > to do with other invalid encodings) either use strchr to find the > first '.' and strip everything after it, or something like: > > if (loclen > 6 && !strcmp(locname+loclen-6, ".UTF-8")) > > There's no reason to pull strstr in here. > > Rich > [-- Attachment #2: Type: text/html, Size: 4042 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:37 ` Rich Felker 2017-01-30 0:37 ` He X @ 2017-01-30 14:17 ` He X 1 sibling, 0 replies; 39+ messages in thread From: He X @ 2017-01-30 14:17 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 952 bytes --] 1. how do we validate if a name is correct? i mean how do we attempt with zh, zh_CN, zh_CN@xx? using open() to check? what's the fastest and correctest way? 2. Thx for your help, nsz! 2017-01-30 0:37 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote: > > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread), > and > > these two patches, fcitx, chromium are working well. > > Can I ask how .UTF-8 got in the locale name to begin with? Did you put > it there, or was it copied from another non-glibc system you logged in > from, or did chromium itself add it? > > Re: the original patch, it should probably (depending on what we want > to do with other invalid encodings) either use strchr to find the > first '.' and strip everything after it, or something like: > > if (loclen > 6 && !strcmp(locname+loclen-6, ".UTF-8")) > > There's no reason to pull strstr in here. > > Rich > [-- Attachment #2: Type: text/html, Size: 1478 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 14:48 ` He X 2017-01-29 15:55 ` Rich Felker 2017-01-29 16:37 ` Rich Felker @ 2017-01-29 16:40 ` Szabolcs Nagy 2017-01-29 16:49 ` Rich Felker 2017-01-30 1:32 ` He X 2 siblings, 2 replies; 39+ messages in thread From: Szabolcs Nagy @ 2017-01-29 16:40 UTC (permalink / raw) To: musl * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]: > 2. no other ways, musl will use generic config 100%, and then the > exception, the run time error is hardcoded there; but i doubt if this > really breaks binaries, the function is only called by libstdc++ itself. > you cant only update the config, but does not update libstdc++. libstdc++ > exported the same abi for common binaries, wont break most dynamic-loaded > binary in my view. that's not how abi works.. you change the __c_locale typedef of the generic config from int* to locale_t, such change breaks abi and cannot be upstreamed to gcc. (it's unfortunate that the c++ locale handling default for non-gnu systems does not use posix locales correctly, but breaking abi for all non-gnu systems is not an acceptable fix.) with your change the abi of libstdc++ would look like the current gnu abi: $ grep __locale_struct libstdc++-v3/config/abi/post/x86_64-linux-gnu/baseline_symbols.txt |wc -l 72 the abi on a musl based system is like $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l 0 so with your patch if a compiled binary refers to any one of those 72 symbols then the dynamic linker will fail to load it on a musl based system. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:40 ` Szabolcs Nagy @ 2017-01-29 16:49 ` Rich Felker 2017-01-30 12:36 ` He X 2017-01-30 1:32 ` He X 1 sibling, 1 reply; 39+ messages in thread From: Rich Felker @ 2017-01-29 16:49 UTC (permalink / raw) To: musl On Sun, Jan 29, 2017 at 05:40:08PM +0100, Szabolcs Nagy wrote: > * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]: > > 2. no other ways, musl will use generic config 100%, and then the > > exception, the run time error is hardcoded there; but i doubt if this > > really breaks binaries, the function is only called by libstdc++ itself. > > you cant only update the config, but does not update libstdc++. libstdc++ > > exported the same abi for common binaries, wont break most dynamic-loaded > > binary in my view. > > that's not how abi works.. you change the __c_locale typedef > of the generic config from int* to locale_t, such change > breaks abi and cannot be upstreamed to gcc. (it's unfortunate > that the c++ locale handling default for non-gnu systems > does not use posix locales correctly, but breaking abi > for all non-gnu systems is not an acceptable fix.) > > with your change the abi of libstdc++ would look like > the current gnu abi: > > $ grep __locale_struct libstdc++-v3/config/abi/post/x86_64-linux-gnu/baseline_symbols.txt |wc -l > 72 > > the abi on a musl based system is like > > $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l > 0 > > so with your patch if a compiled binary refers to any one > of those 72 symbols then the dynamic linker will fail to > load it on a musl based system. So would this work? struct __generic_locale { int dummy; localt_t locale; }; Then allocate these objects with new. That way int* could point to the object (by pointing to its first member) and the locale_t could be obtained. Wasteful and ugly, but valid. Any other fix seems to assume int* can represent locale_t; I don't think that's valid for a "generic" implementation. Rich ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:49 ` Rich Felker @ 2017-01-30 12:36 ` He X 2017-01-30 13:05 ` Szabolcs Nagy 0 siblings, 1 reply; 39+ messages in thread From: He X @ 2017-01-30 12:36 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2086 bytes --] worked! http://paste.ubuntu.com/23893379/ -su-4.4# nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l 0 anything i missed? it it acceptable ? and how could i post this to the upstream? 2017-01-30 0:49 GMT+08:00 Rich Felker <dalias@libc.org>: > On Sun, Jan 29, 2017 at 05:40:08PM +0100, Szabolcs Nagy wrote: > > * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]: > > > 2. no other ways, musl will use generic config 100%, and then the > > > exception, the run time error is hardcoded there; but i doubt if this > > > really breaks binaries, the function is only called by libstdc++ > itself. > > > you cant only update the config, but does not update libstdc++. > libstdc++ > > > exported the same abi for common binaries, wont break most > dynamic-loaded > > > binary in my view. > > > > that's not how abi works.. you change the __c_locale typedef > > of the generic config from int* to locale_t, such change > > breaks abi and cannot be upstreamed to gcc. (it's unfortunate > > that the c++ locale handling default for non-gnu systems > > does not use posix locales correctly, but breaking abi > > for all non-gnu systems is not an acceptable fix.) > > > > with your change the abi of libstdc++ would look like > > the current gnu abi: > > > > $ grep __locale_struct libstdc++-v3/config/abi/post/ > x86_64-linux-gnu/baseline_symbols.txt |wc -l > > 72 > > > > the abi on a musl based system is like > > > > $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l > > 0 > > > > so with your patch if a compiled binary refers to any one > > of those 72 symbols then the dynamic linker will fail to > > load it on a musl based system. > > So would this work? > > struct __generic_locale { > int dummy; > localt_t locale; > }; > > Then allocate these objects with new. That way int* could point to the > object (by pointing to its first member) and the locale_t could be > obtained. Wasteful and ugly, but valid. > > Any other fix seems to assume int* can represent locale_t; I don't > think that's valid for a "generic" implementation. > > Rich > [-- Attachment #2: Type: text/html, Size: 2894 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-30 12:36 ` He X @ 2017-01-30 13:05 ` Szabolcs Nagy 0 siblings, 0 replies; 39+ messages in thread From: Szabolcs Nagy @ 2017-01-30 13:05 UTC (permalink / raw) To: musl * He X <xw897002528@gmail.com> [2017-01-30 20:36:41 +0800]: > worked! http://paste.ubuntu.com/23893379/ > -su-4.4# nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l > 0 > anything i missed? it it acceptable ? and how could i post this to the > upstream? on a second look you might be right that conforming c++ code is probably not able to reference those symbols at all. in that case i'm not sure why they are public symbols, and changing the abi might be preferable to the workaround hack. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: Re: a bug in bindtextdomain() and strip '.UTF-8' 2017-01-29 16:40 ` Szabolcs Nagy 2017-01-29 16:49 ` Rich Felker @ 2017-01-30 1:32 ` He X 1 sibling, 0 replies; 39+ messages in thread From: He X @ 2017-01-30 1:32 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 2088 bytes --] > To Szabolcs Nagy: that's not how abi works... 1. I know, what i meant, is that: these symbols seems only called by stdc++ themselves, i havnt seen any program invoke these symbols directly, they use std::locale() or something else. At least in my system, schroot does not need to recompile, neither chromium, they worked well just after i patched the toolchain. It's worth to checking if these symbols is used by other binaries out of stdc++, if just 1 or 2, or even no, it's not a big deal to apply this patch. But now i change my mind, as you said, this not a perfect solution, indeed there's a risk of breaking binaries. If it is on the way of upstreaming the patch, i wont do like this. > To Rich: So would this work?... 2. let me try it. :) 2017-01-30 0:40 GMT+08:00 Szabolcs Nagy <nsz@port70.net>: > * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]: > > 2. no other ways, musl will use generic config 100%, and then the > > exception, the run time error is hardcoded there; but i doubt if this > > really breaks binaries, the function is only called by libstdc++ itself. > > you cant only update the config, but does not update libstdc++. libstdc++ > > exported the same abi for common binaries, wont break most dynamic-loaded > > binary in my view. > > that's not how abi works.. you change the __c_locale typedef > of the generic config from int* to locale_t, such change > breaks abi and cannot be upstreamed to gcc. (it's unfortunate > that the c++ locale handling default for non-gnu systems > does not use posix locales correctly, but breaking abi > for all non-gnu systems is not an acceptable fix.) > > with your change the abi of libstdc++ would look like > the current gnu abi: > > $ grep __locale_struct libstdc++-v3/config/abi/post/ > x86_64-linux-gnu/baseline_symbols.txt |wc -l > 72 > > the abi on a musl based system is like > > $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l > 0 > > so with your patch if a compiled binary refers to any one > of those 72 symbols then the dynamic linker will fail to > load it on a musl based system. > [-- Attachment #2: Type: text/html, Size: 2874 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* a bug in bindtextdomain() and strip '.UTF-8' @ 2017-01-20 11:11 He X 0 siblings, 0 replies; 39+ messages in thread From: He X @ 2017-01-20 11:11 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 425 bytes --] sorry for my poor english :D first, it's not p->domainname but q->domainname in dcngettext:77. second, any news about http://www.openwall.com/lists/musl/2016/05/11/81? strip '.UTF-8' is important, i think. irc log: 18:58 < xhe> @dalias: i must found a bug in bindtextdomain(), and also the improvement about stripping '.UTF-8' should be merged(my code sucks, that's for my tests), detailed: http://pastebin.com/3C2APqMH [-- Attachment #2: Type: text/html, Size: 962 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2017-03-18 13:50 UTC | newest] Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-20 11:25 a bug in bindtextdomain() and strip '.UTF-8' He X 2017-01-29 4:52 ` He X 2017-01-29 13:39 ` Szabolcs Nagy 2017-01-29 14:07 ` Rich Felker 2017-01-29 14:48 ` He X 2017-01-29 15:55 ` Rich Felker 2017-01-29 16:14 ` He X 2017-01-29 16:33 ` Rich Felker 2017-02-08 10:13 ` He X 2017-02-08 14:31 ` Rich Felker 2017-02-09 9:49 ` He X 2017-02-11 2:36 ` Rich Felker 2017-02-11 6:00 ` He X 2017-02-11 23:59 ` Rich Felker 2017-02-12 2:34 ` Rich Felker 2017-02-12 6:56 ` He X 2017-02-12 7:11 ` He X 2017-02-13 17:08 ` Rich Felker 2017-02-13 8:01 ` He X 2017-02-13 13:28 ` Rich Felker 2017-02-13 14:06 ` He X 2017-02-13 17:12 ` Rich Felker 2017-03-04 8:02 ` He X 2017-03-17 19:27 ` Rich Felker 2017-03-17 19:37 ` Rich Felker 2017-03-18 7:34 ` He X 2017-03-18 12:28 ` Rich Felker 2017-03-18 13:50 ` He X 2017-02-13 14:12 ` He X 2017-02-13 17:13 ` Rich Felker 2017-01-29 16:37 ` Rich Felker 2017-01-30 0:37 ` He X 2017-01-30 14:17 ` He X 2017-01-29 16:40 ` Szabolcs Nagy 2017-01-29 16:49 ` Rich Felker 2017-01-30 12:36 ` He X 2017-01-30 13:05 ` Szabolcs Nagy 2017-01-30 1:32 ` He X -- strict thread matches above, loose matches on Subject: below -- 2017-01-20 11:11 He X
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).