a bug in bindtextdomain() and strip '.UTF-8'

mailing list of musl libc
 help / color / mirror / code / Atom feed

* a bug in bindtextdomain() and strip '.UTF-8'
@ 2017-01-20 11:25 He X
  2017-01-29  4:52 ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-01-20 11:25 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

sorry for my poor english :D

first, it's not p->domainname but q->domainname in dcngettext:77.

second, any news about http://www.openwall.com/lists/musl/2016/05/11/81?
strip  '.UTF-8' is important, i think.

irc log:
18:58 < xhe> @dalias: i must found a bug in bindtextdomain(), and also the
improvement about stripping  '.UTF-8' should be merged(my code sucks,
that's for my tests), detailed: http://pastebin.com/3C2APqMH

[-- Attachment #2: Type: text/html, Size: 1173 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-20 11:25 a bug in bindtextdomain() and strip '.UTF-8' He X
@ 2017-01-29  4:52 ` He X
  2017-01-29 13:39   ` Szabolcs Nagy
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-01-29  4:52 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1155 bytes --]

found two more bugs related to intl:

1. no memset after malloc, caused chromium crash:
http://www.openwall.com/lists/musl/2017/01/28/1
--- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000
+++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000
@@ -180,6 +180,7 @@
  __munmap((void *)map, map_size);
  goto notrans;
  }
+ memset(p, 0, sizeof *p + namelen + 1);
  p->map = map;
  p->map_size = map_size;
  memcpy(p->name, name, namelen+1);

2. musl uses generic config of libstdc++, which blocked the support of
locale, patch is there:
https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150


2017-01-20 19:25 GMT+08:00 He X <xw897002528@gmail.com>:

> sorry for my poor english :D
>
> first, it's not p->domainname but q->domainname in dcngettext:77.
>
> second, any news about http://www.openwall.com/lists/musl/2016/05/11/81?
> strip  '.UTF-8' is important, i think.
>
> irc log:
> 18:58 < xhe> @dalias: i must found a bug in bindtextdomain(), and also the
> improvement about stripping  '.UTF-8' should be merged(my code sucks,
> that's for my tests), detailed: http://pastebin.com/3C2APqMH
>

[-- Attachment #2: Type: text/html, Size: 3170 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29  4:52 ` He X
@ 2017-01-29 13:39   ` Szabolcs Nagy
  2017-01-29 14:07     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Szabolcs Nagy @ 2017-01-29 13:39 UTC (permalink / raw)
  To: musl

* He X <xw897002528@gmail.com> [2017-01-29 12:52:56 +0800]:
> 1. no memset after malloc, caused chromium crash:
> http://www.openwall.com/lists/musl/2017/01/28/1
> --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000
> +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000
> @@ -180,6 +180,7 @@
>   __munmap((void *)map, map_size);
>   goto notrans;
>   }
> + memset(p, 0, sizeof *p + namelen + 1);
>   p->map = map;
>   p->map_size = map_size;
>   memcpy(p->name, name, namelen+1);

if you want to zero the entire allocation, then use calloc.
but i think initializing plural_rule is enough.

> 2. musl uses generic config of libstdc++, which blocked the support of
> locale, patch is there:
> https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150

this breaks the abi of libstdc++ because the definition of
a type in the public api is changed.

so existing c++ binaries break if the toolchain is patched.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 13:39   ` Szabolcs Nagy
@ 2017-01-29 14:07     ` Rich Felker
  2017-01-29 14:48       ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-01-29 14:07 UTC (permalink / raw)
  To: musl

On Sun, Jan 29, 2017 at 02:39:47PM +0100, Szabolcs Nagy wrote:
> * He X <xw897002528@gmail.com> [2017-01-29 12:52:56 +0800]:
> > 1. no memset after malloc, caused chromium crash:
> > http://www.openwall.com/lists/musl/2017/01/28/1
> > --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000
> > +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317 +0000
> > @@ -180,6 +180,7 @@
> >   __munmap((void *)map, map_size);
> >   goto notrans;
> >   }
> > + memset(p, 0, sizeof *p + namelen + 1);
> >   p->map = map;
> >   p->map_size = map_size;
> >   memcpy(p->name, name, namelen+1);
> 
> if you want to zero the entire allocation, then use calloc.
> but i think initializing plural_rule is enough.

Conceptually it seems nice to avoid filling name[] twice, but since
namelen is bounded in size (note: we should be using strnlen elsewhere
where it first gets introduced, but strlen is already checked) it's
not such a practical issue. I would be ok with just zeroing
plural_rule and nplurals (the latter doesn't seem necessary but
leaving it uninitialized until later seems to be a poor choice w.r.t.
future-proofing the code) but just calling calloc for these
allocations is probably the cleanest fix.

> > 2. musl uses generic config of libstdc++, which blocked the support of
> > locale, patch is there:
> > https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150
> 
> this breaks the abi of libstdc++ because the definition of
> a type in the public api is changed.
> 
> so existing c++ binaries break if the toolchain is patched.

I'm not sufficiently familiar with this code to understand why right
away. Do you see an easy fix to avoid ABI breakage while fixing the
bug?

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 14:07     ` Rich Felker
@ 2017-01-29 14:48       ` He X
  2017-01-29 15:55         ` Rich Felker
                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: He X @ 2017-01-29 14:48 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2580 bytes --]

1. agreed with rich, nplurals is important too; compiling the kernel,
cannot update the patch
2. no other ways, musl will use generic config 100%, and then the
exception, the run time error is hardcoded there; but i doubt if this
really breaks binaries, the function is only called by libstdc++ itself.
you cant only update the config, but does not update libstdc++. libstdc++
exported the same abi for common binaries, wont break most dynamic-loaded
binary in my view.

btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread),  and
these two patches, fcitx, chromium are working well.

but there're some names like 'de_DE@euro', 'zh_CN.GBK', these should be
stripped, either, any good ideas?

2017-01-29 22:07 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sun, Jan 29, 2017 at 02:39:47PM +0100, Szabolcs Nagy wrote:
> > * He X <xw897002528@gmail.com> [2017-01-29 12:52:56 +0800]:
> > > 1. no memset after malloc, caused chromium crash:
> > > http://www.openwall.com/lists/musl/2017/01/28/1
> > > --- musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317
> +0000
> > > +++ musl-1.1.16/src/locale/dcngettext.c 2017-01-29 04:42:49.002221317
> +0000
> > > @@ -180,6 +180,7 @@
> > >   __munmap((void *)map, map_size);
> > >   goto notrans;
> > >   }
> > > + memset(p, 0, sizeof *p + namelen + 1);
> > >   p->map = map;
> > >   p->map_size = map_size;
> > >   memcpy(p->name, name, namelen+1);
> >
> > if you want to zero the entire allocation, then use calloc.
> > but i think initializing plural_rule is enough.
>
> Conceptually it seems nice to avoid filling name[] twice, but since
> namelen is bounded in size (note: we should be using strnlen elsewhere
> where it first gets introduced, but strlen is already checked) it's
> not such a practical issue. I would be ok with just zeroing
> plural_rule and nplurals (the latter doesn't seem necessary but
> leaving it uninitialized until later seems to be a poor choice w.r.t.
> future-proofing the code) but just calling calloc for these
> allocations is probably the cleanest fix.
>
> > > 2. musl uses generic config of libstdc++, which blocked the support of
> > > locale, patch is there:
> > > https://github.com/xhebox/noname-linux/issues/2#issuecomment-275704150
> >
> > this breaks the abi of libstdc++ because the definition of
> > a type in the public api is changed.
> >
> > so existing c++ binaries break if the toolchain is patched.
>
> I'm not sufficiently familiar with this code to understand why right
> away. Do you see an easy fix to avoid ABI breakage while fixing the
> bug?
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 3631 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 14:48       ` He X
@ 2017-01-29 15:55         ` Rich Felker
  2017-01-29 16:14           ` He X
  2017-01-29 16:37         ` Rich Felker
  2017-01-29 16:40         ` Szabolcs Nagy
  2 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-01-29 15:55 UTC (permalink / raw)
  To: musl

On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote:
> 1. agreed with rich, nplurals is important too; compiling the kernel,
> cannot update the patch
> 2. no other ways, musl will use generic config 100%, and then the
> exception, the run time error is hardcoded there; but i doubt if this
> really breaks binaries, the function is only called by libstdc++ itself.
> you cant only update the config, but does not update libstdc++. libstdc++
> exported the same abi for common binaries, wont break most dynamic-loaded
> binary in my view.
> 
> btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread),  and
> these two patches, fcitx, chromium are working well.
> 
> but there're some names like 'de_DE@euro', 'zh_CN.GBK', these should be
> stripped, either, any good ideas?

This has all been discussed before; see this email and others in the
thread:

http://www.openwall.com/lists/musl/2016/05/11/8

Masanori Ogino was going to work on some follow-up research, testing,
and/or implementation but didn't get around to it. I'm not aware of
any newer findings that contradict the direction suggested in that
thread.

For your specific examples, de_DE@euro would be searched in de_DE,
de@euro, and finally de; zh_CN.GBK would be invalid (non-UTF-8
encodings not permitted) but it's not clear to me how it should be
handled (rejection or rewriting at setlocale time, stripping .GBK at
translation load time, or leaving .GBK there and letting translation
fail).

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 15:55         ` Rich Felker
@ 2017-01-29 16:14           ` He X
  2017-01-29 16:33             ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-01-29 16:14 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]

I can't wait, can i work on it and make a patch for these issues if Masanori
Ogino is busy now? I'd like to see that these issues could be solved in
official musl repo as soon as possible.
And maybe rejection for NON-UTF-8, since 'LANG=zh_CN.GBK ./a.out(
setlocale(LC_*, "") )' showed me a segfault with glibc.

Wang He

2017-01-29 23:55 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote:
> > 1. agreed with rich, nplurals is important too; compiling the kernel,
> > cannot update the patch
> > 2. no other ways, musl will use generic config 100%, and then the
> > exception, the run time error is hardcoded there; but i doubt if this
> > really breaks binaries, the function is only called by libstdc++ itself.
> > you cant only update the config, but does not update libstdc++. libstdc++
> > exported the same abi for common binaries, wont break most dynamic-loaded
> > binary in my view.
> >
> > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread),
> and
> > these two patches, fcitx, chromium are working well.
> >
> > but there're some names like 'de_DE@euro', 'zh_CN.GBK', these should be
> > stripped, either, any good ideas?
>
> This has all been discussed before; see this email and others in the
> thread:
>
> http://www.openwall.com/lists/musl/2016/05/11/8
>
> Masanori Ogino was going to work on some follow-up research, testing,
> and/or implementation but didn't get around to it. I'm not aware of
> any newer findings that contradict the direction suggested in that
> thread.
>
> For your specific examples, de_DE@euro would be searched in de_DE,
> de@euro, and finally de; zh_CN.GBK would be invalid (non-UTF-8
> encodings not permitted) but it's not clear to me how it should be
> handled (rejection or rewriting at setlocale time, stripping .GBK at
> translation load time, or leaving .GBK there and letting translation
> fail).
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 2690 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:14           ` He X
@ 2017-01-29 16:33             ` Rich Felker
  2017-02-08 10:13               ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-01-29 16:33 UTC (permalink / raw)
  To: musl

On Mon, Jan 30, 2017 at 12:14:49AM +0800, He X wrote:
> I can't wait, can i work on it and make a patch for these issues if Masanori
> Ogino is busy now? I'd like to see that these issues could be solved in
> official musl repo as soon as possible.

I'm not saying you need to wait, just that you should be aware of past
discussion of the topic, and if you want to propose patches they
should either follow the behavior outlined before or come with
discussion of why you think a different behavior is more appropriate.

> And maybe rejection for NON-UTF-8, since 'LANG=zh_CN.GBK ./a.out(
> setlocale(LC_*, "") )' showed me a segfault with glibc.

I don't think "it crashes on glibc" is a good justification for
anything. Rather there should probably be UX discussions of what
different choices mean for different poor-configuration situations
that are likely to arise in the wild (from things like LC_* getting
copied over ssh).

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 14:48       ` He X
  2017-01-29 15:55         ` Rich Felker
@ 2017-01-29 16:37         ` Rich Felker
  2017-01-30  0:37           ` He X
  2017-01-30 14:17           ` He X
  2017-01-29 16:40         ` Szabolcs Nagy
  2 siblings, 2 replies; 38+ messages in thread
From: Rich Felker @ 2017-01-29 16:37 UTC (permalink / raw)
  To: musl

On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote:
> btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread),  and
> these two patches, fcitx, chromium are working well.

Can I ask how .UTF-8 got in the locale name to begin with? Did you put
it there, or was it copied from another non-glibc system you logged in
from, or did chromium itself add it?

Re: the original patch, it should probably (depending on what we want
to do with other invalid encodings) either use strchr to find the
first '.' and strip everything after it, or something like:

	if (loclen > 6 && !strcmp(locname+loclen-6, ".UTF-8"))

There's no reason to pull strstr in here.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 14:48       ` He X
  2017-01-29 15:55         ` Rich Felker
  2017-01-29 16:37         ` Rich Felker
@ 2017-01-29 16:40         ` Szabolcs Nagy
  2017-01-29 16:49           ` Rich Felker
  2017-01-30  1:32           ` He X
  2 siblings, 2 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2017-01-29 16:40 UTC (permalink / raw)
  To: musl

* He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]:
> 2. no other ways, musl will use generic config 100%, and then the
> exception, the run time error is hardcoded there; but i doubt if this
> really breaks binaries, the function is only called by libstdc++ itself.
> you cant only update the config, but does not update libstdc++. libstdc++
> exported the same abi for common binaries, wont break most dynamic-loaded
> binary in my view.

that's not how abi works.. you change the __c_locale typedef
of the generic config from int* to locale_t, such change
breaks abi and cannot be upstreamed to gcc. (it's unfortunate
that the c++ locale handling default for non-gnu systems
does not use posix locales correctly, but breaking abi
for all non-gnu systems is not an acceptable fix.)

with your change the abi of libstdc++ would look like
the current gnu abi:

$ grep __locale_struct libstdc++-v3/config/abi/post/x86_64-linux-gnu/baseline_symbols.txt |wc -l
72

the abi on a musl based system is like

$ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l
0

so with your patch if a compiled binary refers to any one
of those 72 symbols then the dynamic linker will fail to
load it on a musl based system.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:40         ` Szabolcs Nagy
@ 2017-01-29 16:49           ` Rich Felker
  2017-01-30 12:36             ` He X
  2017-01-30  1:32           ` He X
  1 sibling, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-01-29 16:49 UTC (permalink / raw)
  To: musl

On Sun, Jan 29, 2017 at 05:40:08PM +0100, Szabolcs Nagy wrote:
> * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]:
> > 2. no other ways, musl will use generic config 100%, and then the
> > exception, the run time error is hardcoded there; but i doubt if this
> > really breaks binaries, the function is only called by libstdc++ itself.
> > you cant only update the config, but does not update libstdc++. libstdc++
> > exported the same abi for common binaries, wont break most dynamic-loaded
> > binary in my view.
> 
> that's not how abi works.. you change the __c_locale typedef
> of the generic config from int* to locale_t, such change
> breaks abi and cannot be upstreamed to gcc. (it's unfortunate
> that the c++ locale handling default for non-gnu systems
> does not use posix locales correctly, but breaking abi
> for all non-gnu systems is not an acceptable fix.)
> 
> with your change the abi of libstdc++ would look like
> the current gnu abi:
> 
> $ grep __locale_struct libstdc++-v3/config/abi/post/x86_64-linux-gnu/baseline_symbols.txt |wc -l
> 72
> 
> the abi on a musl based system is like
> 
> $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l
> 0
> 
> so with your patch if a compiled binary refers to any one
> of those 72 symbols then the dynamic linker will fail to
> load it on a musl based system.

So would this work?

struct __generic_locale {
	int dummy;
	localt_t locale;
};

Then allocate these objects with new. That way int* could point to the
object (by pointing to its first member) and the locale_t could be
obtained. Wasteful and ugly, but valid.

Any other fix seems to assume int* can represent locale_t; I don't
think that's valid for a "generic" implementation.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:37         ` Rich Felker
@ 2017-01-30  0:37           ` He X
  2017-01-30 14:17           ` He X
  1 sibling, 0 replies; 38+ messages in thread
From: He X @ 2017-01-30  0:37 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2573 bytes --]

> I'm not saying you need to wait....
1. its hard to read that thread for me, i just glanced once, thx for you
advice, ill be more cautious next time! ;p

> Can I ask how .UTF-8 got in the locale name....
2. And '.UTF-8' is copied from glibc's locale-table, i put it there, it's
set by normal user. As i looked in to musl's source, i found it's totally
useless for musl to set such a suffix, suffixes are meaningless. But we
should still do a compatibility with glibc in my view, suffixes seems
already unofficial but standard way to ask libc to provide a proper charset.

> I don't think "it crashes on glibc"...
3. Really sorry, forgot to locale-gen before test, that's why segfault,
seems glibc only stripped '.GBK' at translation load time, showed me
'»ỰѡÏî:'. In another word, it was using real GBK set!

Though I agree with rejection: because musl is utf8, but this '.GBK' asked
for using 'GBK' rather than utf8, conceptually we should just reject it.
But stand on the side of normal users, rewriting is nice to avoid failing.
And for developers using musl, they should know there's no 'non-utf8' sets
in musl rather than depending on libc, so i would like the idea of
rewriting. Or we could put the responsibility of setting right LC_* to
users? Not so friendly...

Because users may want to validate the strings returned by setlocale()...
So the best rewriting time, i think, is at the translation time.

> Re: the original patch, it should probably...
4. makes sense, i'm not a pro coder, i havnt think about using strchr or
strcmp! :)

And with the idea above, i suggest better using strchr to strip all things
after '.'. that is good, and we dont need focus at what is placed after
'.', since whatever he asked, musl is using utf8.

2017-01-30 0:37 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote:
> > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread),
> and
> > these two patches, fcitx, chromium are working well.
>
> Can I ask how .UTF-8 got in the locale name to begin with? Did you put
> it there, or was it copied from another non-glibc system you logged in
> from, or did chromium itself add it?
>
> Re: the original patch, it should probably (depending on what we want
> to do with other invalid encodings) either use strchr to find the
> first '.' and strip everything after it, or something like:
>
>         if (loclen > 6 && !strcmp(locname+loclen-6, ".UTF-8"))
>
> There's no reason to pull strstr in here.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 4042 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:40         ` Szabolcs Nagy
  2017-01-29 16:49           ` Rich Felker
@ 2017-01-30  1:32           ` He X
  1 sibling, 0 replies; 38+ messages in thread
From: He X @ 2017-01-30  1:32 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

> To Szabolcs Nagy: that's not how abi works...
1. I know, what i meant, is that: these symbols seems only called by stdc++
themselves, i havnt seen any program invoke these symbols directly, they
use std::locale() or something else. At least in my system, schroot does
not need to recompile, neither chromium, they worked well just after i
patched the toolchain. It's worth to checking if these symbols is used by
other binaries out of stdc++, if just 1 or 2, or even no, it's not a big
deal to apply this patch.

But now i change my mind, as you said, this not a perfect solution, indeed
there's a risk of breaking binaries. If it is on the way of upstreaming the
patch, i wont do like this.

> To Rich: So would this work?...
2. let me try it. :)

2017-01-30 0:40 GMT+08:00 Szabolcs Nagy <nsz@port70.net>:

> * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]:
> > 2. no other ways, musl will use generic config 100%, and then the
> > exception, the run time error is hardcoded there; but i doubt if this
> > really breaks binaries, the function is only called by libstdc++ itself.
> > you cant only update the config, but does not update libstdc++. libstdc++
> > exported the same abi for common binaries, wont break most dynamic-loaded
> > binary in my view.
>
> that's not how abi works.. you change the __c_locale typedef
> of the generic config from int* to locale_t, such change
> breaks abi and cannot be upstreamed to gcc. (it's unfortunate
> that the c++ locale handling default for non-gnu systems
> does not use posix locales correctly, but breaking abi
> for all non-gnu systems is not an acceptable fix.)
>
> with your change the abi of libstdc++ would look like
> the current gnu abi:
>
> $ grep __locale_struct libstdc++-v3/config/abi/post/
> x86_64-linux-gnu/baseline_symbols.txt |wc -l
> 72
>
> the abi on a musl based system is like
>
> $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l
> 0
>
> so with your patch if a compiled binary refers to any one
> of those 72 symbols then the dynamic linker will fail to
> load it on a musl based system.
>

[-- Attachment #2: Type: text/html, Size: 2874 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:49           ` Rich Felker
@ 2017-01-30 12:36             ` He X
  2017-01-30 13:05               ` Szabolcs Nagy
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-01-30 12:36 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2086 bytes --]

worked!  http://paste.ubuntu.com/23893379/
-su-4.4# nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l
0
anything i missed? it it acceptable ? and how could i post this to the
upstream?

2017-01-30 0:49 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sun, Jan 29, 2017 at 05:40:08PM +0100, Szabolcs Nagy wrote:
> > * He X <xw897002528@gmail.com> [2017-01-29 22:48:34 +0800]:
> > > 2. no other ways, musl will use generic config 100%, and then the
> > > exception, the run time error is hardcoded there; but i doubt if this
> > > really breaks binaries, the function is only called by libstdc++
> itself.
> > > you cant only update the config, but does not update libstdc++.
> libstdc++
> > > exported the same abi for common binaries, wont break most
> dynamic-loaded
> > > binary in my view.
> >
> > that's not how abi works.. you change the __c_locale typedef
> > of the generic config from int* to locale_t, such change
> > breaks abi and cannot be upstreamed to gcc. (it's unfortunate
> > that the c++ locale handling default for non-gnu systems
> > does not use posix locales correctly, but breaking abi
> > for all non-gnu systems is not an acceptable fix.)
> >
> > with your change the abi of libstdc++ would look like
> > the current gnu abi:
> >
> > $ grep __locale_struct libstdc++-v3/config/abi/post/
> x86_64-linux-gnu/baseline_symbols.txt |wc -l
> > 72
> >
> > the abi on a musl based system is like
> >
> > $ nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l
> > 0
> >
> > so with your patch if a compiled binary refers to any one
> > of those 72 symbols then the dynamic linker will fail to
> > load it on a musl based system.
>
> So would this work?
>
> struct __generic_locale {
>         int dummy;
>         localt_t locale;
> };
>
> Then allocate these objects with new. That way int* could point to the
> object (by pointing to its first member) and the locale_t could be
> obtained. Wasteful and ugly, but valid.
>
> Any other fix seems to assume int* can represent locale_t; I don't
> think that's valid for a "generic" implementation.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 2894 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-30 12:36             ` He X
@ 2017-01-30 13:05               ` Szabolcs Nagy
  0 siblings, 0 replies; 38+ messages in thread
From: Szabolcs Nagy @ 2017-01-30 13:05 UTC (permalink / raw)
  To: musl

* He X <xw897002528@gmail.com> [2017-01-30 20:36:41 +0800]:
> worked!  http://paste.ubuntu.com/23893379/
> -su-4.4# nm -D /usr/lib/libstdc++.so.6 |grep __locale_struct |wc -l
> 0
> anything i missed? it it acceptable ? and how could i post this to the
> upstream?

on a second look you might be right that
conforming c++ code is probably not able
to reference those symbols at all.

in that case i'm not sure why they are
public symbols, and changing the abi
might be preferable to the workaround hack.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:37         ` Rich Felker
  2017-01-30  0:37           ` He X
@ 2017-01-30 14:17           ` He X
  1 sibling, 0 replies; 38+ messages in thread
From: He X @ 2017-01-30 14:17 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

1. how do we validate if a name is correct? i mean how do we attempt with
zh, zh_CN, zh_CN@xx? using open() to check? what's the fastest and
correctest way?
2. Thx for your help, nsz!

2017-01-30 0:37 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sun, Jan 29, 2017 at 10:48:34PM +0800, He X wrote:
> > btw, with 'p-> to q->', 'strip .UTF-8'(these two in the first thread),
> and
> > these two patches, fcitx, chromium are working well.
>
> Can I ask how .UTF-8 got in the locale name to begin with? Did you put
> it there, or was it copied from another non-glibc system you logged in
> from, or did chromium itself add it?
>
> Re: the original patch, it should probably (depending on what we want
> to do with other invalid encodings) either use strchr to find the
> first '.' and strip everything after it, or something like:
>
>         if (loclen > 6 && !strcmp(locname+loclen-6, ".UTF-8"))
>
> There's no reason to pull strstr in here.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 1478 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-01-29 16:33             ` Rich Felker
@ 2017-02-08 10:13               ` He X
  2017-02-08 14:31                 ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-02-08 10:13 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2283 bytes --]

here the patch is: http://paste.ubuntu.com/23953329/
The code tested, but maybe it sucks.

1. striping @xx, _TT: when mapping with full name failed,  we check if
there's a '@' in locname. if so, go back to the part of copying catname,
override and skip '@xx'.

Then we check if there's a '_', and if both '@' and '_TT' is there, point
locname to '@xx', set a correct loclen, go back to the part of writing
locname to replace '_TT' with '@xx'. If not both, skip and simply override
'_TT'.

Because there's also '_' in 'LC_xx', we may get into a dead loop of
stripping '_TT'. So locname is checked, it's set to NULL if we used strchr
to skip once.

Same reason, we may get into a dead loop of overriding '_TT'. The first
position of '/' should be front of the '_' if we replaced it once, the name
will like: 'zh@t/LC_xx'.

zh_CN@t (stripped by the first part)-> zh_CN (overrided by the second
part)-> zh@t  (stripped by the first part again)-> zh

2. about rewriting of '.GBK': I agreeded with keeping the original value
 of user, and stripping it in gettext() before. But i thought that someone
may validate if libc set the correct charset by setlocale(). So we should
rewrite .XX to .UTF-8 in setlocale(), we cant return a wrong value in
principle.

2017-01-30 0:33 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Mon, Jan 30, 2017 at 12:14:49AM +0800, He X wrote:
> > I can't wait, can i work on it and make a patch for these issues if
> Masanori
> > Ogino is busy now? I'd like to see that these issues could be solved in
> > official musl repo as soon as possible.
>
> I'm not saying you need to wait, just that you should be aware of past
> discussion of the topic, and if you want to propose patches they
> should either follow the behavior outlined before or come with
> discussion of why you think a different behavior is more appropriate.
>
> > And maybe rejection for NON-UTF-8, since 'LANG=zh_CN.GBK ./a.out(
> > setlocale(LC_*, "") )' showed me a segfault with glibc.
>
> I don't think "it crashes on glibc" is a good justification for
> anything. Rather there should probably be UX discussions of what
> different choices mean for different poor-configuration situations
> that are likely to arise in the wild (from things like LC_* getting
> copied over ssh).
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 3184 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-08 10:13               ` He X
@ 2017-02-08 14:31                 ` Rich Felker
  2017-02-09  9:49                   ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-02-08 14:31 UTC (permalink / raw)
  To: musl

On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote:
> here the patch is: http://paste.ubuntu.com/23953329/
> The code tested, but maybe it sucks.

Patches need to be attached and sent to the list, not pastebins that
might disappear. The latter don't work for discussing and preserving
discussion of the patch.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-08 14:31                 ` Rich Felker
@ 2017-02-09  9:49                   ` He X
  2017-02-11  2:36                     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-02-09  9:49 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 401 bytes --]

sry!

2017-02-08 22:31 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote:
> > here the patch is: http://paste.ubuntu.com/23953329/
> > The code tested, but maybe it sucks.
>
> Patches need to be attached and sent to the list, not pastebins that
> might disappear. The latter don't work for discussing and preserving
> discussion of the patch.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 893 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 2015 bytes --]

--- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
+++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
@@ -19,6 +19,7 @@
 };
 
 static void *volatile bindings;
+char *__strchrnul(const char *, int);
 
 static char *gettextdir(const char *domainname, size_t *dirlen)
 {
@@ -143,7 +143,7 @@
 
 	catname = catnames[category];
 	catlen = catlens[category];
-	loclen = strlen(locname);
+	loclen = __strchrnul(locname, '.') - locname;
 
 	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
 	char name[namelen+1], *s = name;
@@ -157,6 +157,8 @@
+rewrite_loc:
 	memcpy(s, locname, loclen);
 	s[loclen] = '/';
 	s += loclen + 1;
+skip_loc:
 	memcpy(s, catname, catlen);
 	s[catlen] = '/';
 	s += catlen + 1;
@@ -174,7 +175,22 @@
 		void *old_cats;
 		size_t map_size;
 		const void *map = __map_file(name, &map_size);
-		if (!map) goto notrans;
+		if (!map) {
+			if (s = strchr(name + dirlen + 1, '@')) {
+				*s++ = '/';
+				goto skip_loc;
+			}
+			if (locname && (s = strchr(name + dirlen + 1, '_')) && (strchr(name + dirlen +1, '/') > s) ) {
+				if (locname = strchr(locname, '@')) {
+					loclen = __strchrnul(lm->name, '.') - locname;
+					goto rewrite_loc;
+				} else {
+					*s++ = '/';
+					goto skip_loc;
+				}
+			}
+			goto notrans;
+		}
 		p = calloc(sizeof *p + namelen + 1, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
--- a/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
+++ b/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
@@ -32,6 +32,7 @@
 	struct __locale_map *new = 0;
 	const char *path = 0, *z;
 	char buf[256];
+	char *dotp;
 	size_t l, n;
 
 	if (!*val) {
@@ -40,6 +41,12 @@
 		(val = getenv("LANG")) && *val ||
 		(val = "C.UTF-8");
 	}
+	if (dotp = strchr(val, '.')) {
+		char part[256];
+		memcpy(part, val, dotp - val);
+		memcpy(&part[dotp - val], ".UTF-8\0", 7);
+		val = part;
+	}
 
 	/* Limit name length and forbid leading dot or any slashes. */
 	for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-09  9:49                   ` He X
@ 2017-02-11  2:36                     ` Rich Felker
  2017-02-11  6:00                       ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-02-11  2:36 UTC (permalink / raw)
  To: musl

On Thu, Feb 09, 2017 at 05:49:13PM +0800, He X wrote:
> sry!
> 
> 2017-02-08 22:31 GMT+08:00 Rich Felker <dalias@libc.org>:
> 
> > On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote:
> > > here the patch is: http://paste.ubuntu.com/23953329/
> > > The code tested, but maybe it sucks.
> >
> > Patches need to be attached and sent to the list, not pastebins that
> > might disappear. The latter don't work for discussing and preserving
> > discussion of the patch.

> --- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
> +++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
> @@ -19,6 +19,7 @@
>  };
>  
>  static void *volatile bindings;
> +char *__strchrnul(const char *, int);
>  
>  static char *gettextdir(const char *domainname, size_t *dirlen)
>  {
> @@ -143,7 +143,7 @@
>  
>  	catname = catnames[category];
>  	catlen = catlens[category];
> -	loclen = strlen(locname);
> +	loclen = __strchrnul(locname, '.') - locname;
>  
>  	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
>  	char name[namelen+1], *s = name;
> @@ -157,6 +157,8 @@
> +rewrite_loc:
>  	memcpy(s, locname, loclen);
>  	s[loclen] = '/';
>  	s += loclen + 1;
> +skip_loc:
>  	memcpy(s, catname, catlen);
>  	s[catlen] = '/';
>  	s += catlen + 1;
> @@ -174,7 +175,22 @@
>  		void *old_cats;
>  		size_t map_size;
>  		const void *map = __map_file(name, &map_size);
> -		if (!map) goto notrans;
> +		if (!map) {
> +			if (s = strchr(name + dirlen + 1, '@')) {
> +				*s++ = '/';
> +				goto skip_loc;
> +			}
> +			if (locname && (s = strchr(name + dirlen + 1, '_')) && (strchr(name + dirlen +1, '/') > s) ) {
> +				if (locname = strchr(locname, '@')) {
> +					loclen = __strchrnul(lm->name, '.') - locname;
> +					goto rewrite_loc;
> +				} else {
> +					*s++ = '/';
> +					goto skip_loc;
> +				}
> +			}
> +			goto notrans;
> +		}

This doesn't work because it changes both the key used for the lookup
and the filename mapped. If you try this code with a translation that
requires a fallback, and run it under strace, you'll see that _every_
call to gettext will try again to find the nonexistent files.

It could be fixed, but I think the code should be refactored so that,
rather than the msgcat list being indexed by pathname strings, it's
indexed by tuples of:

	( struct __locale_map *, struct binding *, category )

These are all integers/pointers and thus compare very fast versus the
current strcmp operation, and it's very quick to look them up. Then we
only have to construct the pathname string when a new file needs to be
loaded, not on every call, and you're free to clobber the pathname
string while doing fallbacks.

>  		p = calloc(sizeof *p + namelen + 1, 1);
>  		if (!p) {
>  			__munmap((void *)map, map_size);
> --- a/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
> +++ b/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
> @@ -32,6 +32,7 @@
>  	struct __locale_map *new = 0;
>  	const char *path = 0, *z;
>  	char buf[256];
> +	char *dotp;
>  	size_t l, n;
>  
>  	if (!*val) {
> @@ -40,6 +41,12 @@
>  		(val = getenv("LANG")) && *val ||
>  		(val = "C.UTF-8");
>  	}
> +	if (dotp = strchr(val, '.')) {
> +		char part[256];
> +		memcpy(part, val, dotp - val);
> +		memcpy(&part[dotp - val], ".UTF-8\0", 7);
> +		val = part;
> +	}
>  
>  	/* Limit name length and forbid leading dot or any slashes. */
>  	for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);

I don't think this part is desirable, but if it were, it would need to
be done differently. As-is, it has serious UB, use of part[] after the
end of its lifetime. It also seems to have no check to see that
dotp-val is less than 256-7 or even that it's bounded, whereas the
code that immediately follows checks the length of the string pointed
to by val.

I think what it should be doing is the opposite, stopping when hitting
a dot in the name and only using the part up to the dot, except in the
one special case "C.UTF-8". The subsequent path search for the locale
file should probably then be repeated with combinations of dropping
@mod and _CC suffixes, but this dropping should _not_ affect the name
that's saved and reported back. (That is, if LC_TIME=fr_CA but only a
"fr" locale file exists, the "fr" file should get mapped but the name
returned by setlocale, and saved for use by gettext, should still be
the full "fr_CA" in case applications have "fr_CA" translations.)

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-11  2:36                     ` Rich Felker
@ 2017-02-11  6:00                       ` He X
  2017-02-11 23:59                         ` Rich Felker
  2017-02-12  2:34                         ` Rich Felker
  0 siblings, 2 replies; 38+ messages in thread
From: He X @ 2017-02-11  6:00 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 6301 bytes --]

fresh patch :)
1. It's easier that just stopping at dot, and i think this should be
commented in the wiki or somewhere.
2. I read your first part of reply for 20mins, but im not sure; If i
understand right, you mean, let the __locale_map* and strcut binding* be
the id-card for msgcat list instead of the long name string, not only
faster, but also more easy to construct pathname string. But there's some
questions:
+ I removed name from msgcat, i can't find its use there, is it safe?
+ gettextdir() is replaced by a new loop, since i need the pointer of
struct binding not only the dirname, but then, gettextdir() is only called
by bindtextdomain(), is there a need to keep it? Or we have a better way to
get the pointer of struct binding?
+ you said msgcat's indexed by  ( struct __locale_map *, struct binding *,
category ), but i found lm(locale_map) is located by category, so if
category is different, then we can't get the same lm, so we can just
compare lm, right?

2017-02-11 10:36 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Thu, Feb 09, 2017 at 05:49:13PM +0800, He X wrote:
> > sry!
> >
> > 2017-02-08 22:31 GMT+08:00 Rich Felker <dalias@libc.org>:
> >
> > > On Wed, Feb 08, 2017 at 06:13:30PM +0800, He X wrote:
> > > > here the patch is: http://paste.ubuntu.com/23953329/
> > > > The code tested, but maybe it sucks.
> > >
> > > Patches need to be attached and sent to the list, not pastebins that
> > > might disappear. The latter don't work for discussing and preserving
> > > discussion of the patch.
>
> > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
> > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
> > @@ -19,6 +19,7 @@
> >  };
> >
> >  static void *volatile bindings;
> > +char *__strchrnul(const char *, int);
> >
> >  static char *gettextdir(const char *domainname, size_t *dirlen)
> >  {
> > @@ -143,7 +143,7 @@
> >
> >       catname = catnames[category];
> >       catlen = catlens[category];
> > -     loclen = strlen(locname);
> > +     loclen = __strchrnul(locname, '.') - locname;
> >
> >       size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> >       char name[namelen+1], *s = name;
> > @@ -157,6 +157,8 @@
> > +rewrite_loc:
> >       memcpy(s, locname, loclen);
> >       s[loclen] = '/';
> >       s += loclen + 1;
> > +skip_loc:
> >       memcpy(s, catname, catlen);
> >       s[catlen] = '/';
> >       s += catlen + 1;
> > @@ -174,7 +175,22 @@
> >               void *old_cats;
> >               size_t map_size;
> >               const void *map = __map_file(name, &map_size);
> > -             if (!map) goto notrans;
> > +             if (!map) {
> > +                     if (s = strchr(name + dirlen + 1, '@')) {
> > +                             *s++ = '/';
> > +                             goto skip_loc;
> > +                     }
> > +                     if (locname && (s = strchr(name + dirlen + 1,
> '_')) && (strchr(name + dirlen +1, '/') > s) ) {
> > +                             if (locname = strchr(locname, '@')) {
> > +                                     loclen = __strchrnul(lm->name,
> '.') - locname;
> > +                                     goto rewrite_loc;
> > +                             } else {
> > +                                     *s++ = '/';
> > +                                     goto skip_loc;
> > +                             }
> > +                     }
> > +                     goto notrans;
> > +             }
>
> This doesn't work because it changes both the key used for the lookup
> and the filename mapped. If you try this code with a translation that
> requires a fallback, and run it under strace, you'll see that _every_
> call to gettext will try again to find the nonexistent files.
>
> It could be fixed, but I think the code should be refactored so that,
> rather than the msgcat list being indexed by pathname strings, it's
> indexed by tuples of:
>
>         ( struct __locale_map *, struct binding *, category )
>
> These are all integers/pointers and thus compare very fast versus the
> current strcmp operation, and it's very quick to look them up. Then we
> only have to construct the pathname string when a new file needs to be
> loaded, not on every call, and you're free to clobber the pathname
> string while doing fallbacks.
>
> >               p = calloc(sizeof *p + namelen + 1, 1);
> >               if (!p) {
> >                       __munmap((void *)map, map_size);
> > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
> > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
> > @@ -32,6 +32,7 @@
> >       struct __locale_map *new = 0;
> >       const char *path = 0, *z;
> >       char buf[256];
> > +     char *dotp;
> >       size_t l, n;
> >
> >       if (!*val) {
> > @@ -40,6 +41,12 @@
> >               (val = getenv("LANG")) && *val ||
> >               (val = "C.UTF-8");
> >       }
> > +     if (dotp = strchr(val, '.')) {
> > +             char part[256];
> > +             memcpy(part, val, dotp - val);
> > +             memcpy(&part[dotp - val], ".UTF-8\0", 7);
> > +             val = part;
> > +     }
> >
> >       /* Limit name length and forbid leading dot or any slashes. */
> >       for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);
>
> I don't think this part is desirable, but if it were, it would need to
> be done differently. As-is, it has serious UB, use of part[] after the
> end of its lifetime. It also seems to have no check to see that
> dotp-val is less than 256-7 or even that it's bounded, whereas the
> code that immediately follows checks the length of the string pointed
> to by val.
>
> I think what it should be doing is the opposite, stopping when hitting
> a dot in the name and only using the part up to the dot, except in the
> one special case "C.UTF-8". The subsequent path search for the locale
> file should probably then be repeated with combinations of dropping
> @mod and _CC suffixes, but this dropping should _not_ affect the name
> that's saved and reported back. (That is, if LC_TIME=fr_CA but only a
> "fr" locale file exists, the "fr" file should get mapped but the name
> returned by setlocale, and saved for use by gettext, should still be
> the full "fr_CA" in case applications have "fr_CA" translations.)
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 8221 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 3878 bytes --]

--- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
+++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
@@ -100,7 +100,8 @@
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	struct __locale_map *lm;
 };
 
 static char *dummy_gettextdomain()
@@ -120,58 +122,87 @@
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;
 
 	if ((unsigned)category >= LC_ALL) goto notrans;
 
 	if (!domainname) domainname = __gettextdomain();
 
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;
 
-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;
 
 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;
 
 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm)
 			break;
 
 	if (!p) {
+		const char *dirname, *locname, *catname;
+		size_t dirlen, loclen, catlen;
 		void *old_cats;
 		size_t map_size;
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
+		char name[namelen+1], *s = name;
+		char *str = name;
+
+		memcpy(s, dirname, dirlen);
+		s[dirlen] = '/';
+		s += dirlen + 1;
+		memcpy(s, locname, loclen);
+		s[loclen] = '/';
+		s += loclen + 1;
+skip_loc:
+		memcpy(s, catname, catlen);
+		s[catlen] = '/';
+		s += catlen + 1;
+		memcpy(s, domainname, domlen);
+		s[domlen] = '.';
+		s[domlen+1] = 'm';
+		s[domlen+2] = 'o';
+		s[domlen+3] = 0;
+
 		const void *map = __map_file(name, &map_size);
-		if (!map) goto notrans;
+		if (!map) {
+			if (s = strchr(name+dirlen+1, '@')) {
+ 				*s++ = '/';
+ 				goto skip_loc;;
+ 			}
+ 			if ( str && (s = strchr(name+dirlen+1, '_')) && (s < strchr(name+dirlen+1, '/')) ) {
+ 				if (str = strchr(locname, '@')) {
+ 					loclen += locname - str;
+					memcpy(s, str, loclen);
+					s[loclen] = '/';
+					s += loclen + 1;
+					str = 0;
+ 					goto skip_loc;
+ 				} else {
+					*s++ = '/';
+ 					goto skip_loc;
+ 				}
+ 			}
+			goto notrans;
+		}
 		p = calloc(sizeof *p + namelen + 1, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
 			goto notrans;
@@ -209,7 +209,6 @@
 		}
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
--- a/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
+++ b/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
@@ -49,8 +49,8 @@
 	}
 
 	/* Limit name length and forbid leading dot or any slashes. */
-	for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);
-	if (val[0]=='.' || val[n]) val = "C.UTF-8";
+	for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && val[n]!='.'; n++);
+	if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8";
 	int builtin = (val[0]=='C' && !val[1])
 		|| !strcmp(val, "C.UTF-8")
 		|| !strcmp(val, "POSIX");

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-11  6:00                       ` He X
@ 2017-02-11 23:59                         ` Rich Felker
  2017-02-12  2:34                         ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: Rich Felker @ 2017-02-11 23:59 UTC (permalink / raw)
  To: musl

On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote:
> fresh patch :)
> 1. It's easier that just stopping at dot, and i think this should be
> commented in the wiki or somewhere.
> 2. I read your first part of reply for 20mins, but im not sure; If i
> understand right, you mean, let the __locale_map* and strcut binding* be
> the id-card for msgcat list instead of the long name string, not only
> faster, but also more easy to construct pathname string.

Yes. The values needed for the "id-card" (key) for the lookup are:

1. loc->cat[category]
2. category
3. The struct binding * active for domainname; gettextdir should be
   replaced with a function to lookup the binding rather than just
   returning the dir name.

These three pointer/integer values uniquely determine the pathname(s)
to try, and thus the mapped translation file to use.

> But there's some
> questions:
> + I removed name from msgcat, i can't find its use there, is it safe?

I think that's fine.

> + gettextdir() is replaced by a new loop, since i need the pointer of
> struct binding not only the dirname, but then, gettextdir() is only called
> by bindtextdomain(), is there a need to keep it? Or we have a better way to
> get the pointer of struct binding?

It could be replaced with a function calle getdomainbinding that
returns the struct binding * for the domain argument. Then the caller
can use the returned pointer to lookup an existing mapped msgcat, or
read out the ->dirname member if it needs to construct a path to map a
new one.

> + you said msgcat's indexed by  ( struct __locale_map *, struct binding *,
> category ), but i found lm(locale_map) is located by category, so if
> category is different, then we can't get the same lm, so we can just
> compare lm, right?

If LC_TIME and LC_MESSAGES are both the same locale, then
loc->cat[LC_TIME] and loc->cat[LC_MESSAGES] will both be the same
pointer. Thus category also needs to be kept for the lookup (and path
expansion).

Does this help?

I'll review the patch code separately.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-11  6:00                       ` He X
  2017-02-11 23:59                         ` Rich Felker
@ 2017-02-12  2:34                         ` Rich Felker
  2017-02-12  6:56                           ` He X
  2017-02-13  8:01                           ` He X
  1 sibling, 2 replies; 38+ messages in thread
From: Rich Felker @ 2017-02-12  2:34 UTC (permalink / raw)
  To: musl

On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote:
> --- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
> +++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
> @@ -100,7 +100,8 @@
>  	size_t map_size;
>  	void *volatile plural_rule;
>  	volatile int nplurals;
> -	char name[];
> +	struct binding *binding;
> +	struct __locale_map *lm;
>  };

As stated in the reply to message body, I think you need the category
in the keying too, because there can be different .mo files loaded
depending on which category was requested.

>  static char *dummy_gettextdomain()
> @@ -120,58 +122,87 @@
>  	struct msgcat *p;
>  	struct __locale_struct *loc = CURRENT_LOCALE;
>  	const struct __locale_map *lm;
> -	const char *dirname, *locname, *catname;
> -	size_t dirlen, loclen, catlen, domlen;
> +	size_t domlen;
> +	struct binding *q;
>  
>  	if ((unsigned)category >= LC_ALL) goto notrans;
>  
>  	if (!domainname) domainname = __gettextdomain();
>  
>  	domlen = strnlen(domainname, NAME_MAX+1);
>  	if (domlen > NAME_MAX) goto notrans;
>  
> -	dirname = gettextdir(domainname, &dirlen);
> -	if (!dirname) goto notrans;
> +	for (q=bindings; q; q=q->next)
> +		if (!strcmp(q->domainname, domainname) && q->active)
> +			break;
> +	if (!q) goto notrans;

Looks ok. I had said this should be a function but it really doesn't
need to be; it's plenty simple inline.

>  	lm = loc->cat[category];
>  	if (!lm) {
>  notrans:
>  		return (char *) ((n == 1) ? msgid1 : msgid2);
>  	}
> -	locname = lm->name;
> -
> -	catname = catnames[category];
> -	catlen = catlens[category];
> -	loclen = strlen(locname);
> -
> -	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> -	char name[namelen+1], *s = name;
> -
> -	memcpy(s, dirname, dirlen);
> -	s[dirlen] = '/';
> -	s += dirlen + 1;
> -	memcpy(s, locname, loclen);
> -	s[loclen] = '/';
> -	s += loclen + 1;
> -	memcpy(s, catname, catlen);
> -	s[catlen] = '/';
> -	s += catlen + 1;
> -	memcpy(s, domainname, domlen);
> -	s[domlen] = '.';
> -	s[domlen+1] = 'm';
> -	s[domlen+2] = 'o';
> -	s[domlen+3] = 0;
>  
>  	for (p=cats; p; p=p->next)
> -		if (!strcmp(p->name, name))
> +		if (p->binding == q && p->lm == lm)
>  			break;

&& p->cat == category

>  	if (!p) {
> +		const char *dirname, *locname, *catname;
> +		size_t dirlen, loclen, catlen;
>  		void *old_cats;
>  		size_t map_size;
> +
> +		dirname = q->dirname;
> +		locname = lm->name;
> +		catname = catnames[category];
> +
> +		dirlen = q->dirlen;
> +		loclen = strlen(locname);
> +		catlen = catlens[category];

Now that these are only computed once rather than per-call, optimizing
out strlen is probably not worthwhile anymore, but it doesn't really
hurt either. Not something you need to change, just a comment.

> +
> +		size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> +		char name[namelen+1], *s = name;
> +		char *str = name;
> +
> +		memcpy(s, dirname, dirlen);
> +		s[dirlen] = '/';
> +		s += dirlen + 1;
> +		memcpy(s, locname, loclen);
> +		s[loclen] = '/';
> +		s += loclen + 1;
> +skip_loc:
> +		memcpy(s, catname, catlen);
> +		s[catlen] = '/';
> +		s += catlen + 1;
> +		memcpy(s, domainname, domlen);
> +		s[domlen] = '.';
> +		s[domlen+1] = 'm';
> +		s[domlen+2] = 'o';
> +		s[domlen+3] = 0;

Actually, now that this code is not a hot path, it should just be
using snprintf to construct the pathname, I think. It would be a lot
simpler and easier to ensure correctness.

> +
>  		const void *map = __map_file(name, &map_size);
> -		if (!map) goto notrans;
> +		if (!map) {
> +			if (s = strchr(name+dirlen+1, '@')) {
> + 				*s++ = '/';
> + 				goto skip_loc;;
> + 			}
> + 			if ( str && (s = strchr(name+dirlen+1, '_')) && (s < strchr(name+dirlen+1, '/')) ) {
> + 				if (str = strchr(locname, '@')) {
> + 					loclen += locname - str;
> +					memcpy(s, str, loclen);
> +					s[loclen] = '/';
> +					s += loclen + 1;
> +					str = 0;
> + 					goto skip_loc;
> + 				} else {
> +					*s++ = '/';
> + 					goto skip_loc;
> + 				}
> + 			}
> +			goto notrans;
> +		}

Using snprintf should also make it easy to get rid of the goto/retry
logic here, perhaps even with a 4-iteration loop and array of which
format modifications happen on each iteration.

>  		p = calloc(sizeof *p + namelen + 1, 1);
>  		if (!p) {
>  			__munmap((void *)map, map_size);
>  			goto notrans;
> @@ -209,7 +209,6 @@
>  		}
>  		p->map = map;
>  		p->map_size = map_size;
> -		memcpy(p->name, name, namelen+1);
>  		do {
>  			old_cats = cats;
>  			p->next = old_cats;
> --- a/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
> +++ b/src/locale/locale_map.c	2017-02-06 14:39:17.797148750 +0000
> @@ -49,8 +49,8 @@
>  	}
>  
>  	/* Limit name length and forbid leading dot or any slashes. */
> -	for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);
> -	if (val[0]=='.' || val[n]) val = "C.UTF-8";
> +	for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' && val[n]!='.'; n++);
> +	if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8";
>  	int builtin = (val[0]=='C' && !val[1])
>  		|| !strcmp(val, "C.UTF-8")
>  		|| !strcmp(val, "POSIX");

This looks ok but might still need some tweaks. Should an input like
"zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might
appear as junk on the user's terminal) or as "C" (no localization)
with only ASCII characters (safe for the user's terminal), or even
cause setlocale to fail and return an error so that the application
can decide what to do? These are not technical comments on your patch
but policy matters the community should weigh in on.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-12  2:34                         ` Rich Felker
@ 2017-02-12  6:56                           ` He X
  2017-02-12  7:11                             ` He X
  2017-02-13 17:08                             ` Rich Felker
  2017-02-13  8:01                           ` He X
  1 sibling, 2 replies; 38+ messages in thread
From: He X @ 2017-02-12  6:56 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 9048 bytes --]

1. cat is added to the keys, also do a validate
2. so we what do we deal with the gettextdir() exactly? inline it or
construct a gettextpointer()?
3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is
replaced by snprintf, compiled, and working well with fcitx
4. i just found that i forgot to store the keys to new buffer, it's ok to
just use normal expression? or we need atomic operations?
```
+ p->cat = category;
+ p->binding = q;
+ p->lm = lm;
```
5.  I do want to rewrite all to .UTF8, but it's a bit annoying as your
words, then i changed the code to simply strip.

>  (safe for the user's terminal)
LANG is set by users who are using musl and it's modified to zh_CN at
setlocale(), app will use UTF8 directly, there's no such situation where
charset will cause troubles to users' terminal, except apps which get the
LANG manually by getenv(). I have not seen such strange applications so
far, and most apps only have the UTF8 translation files.

For moving from glibc to musl, i think doing this way is good for now, we
could delete it later, or just keep it forever. And most people won't use
non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8,
because no translation files for GBK. So, it's not so dagerous, i think :).

And for developers,  they should not use setlocale to detect the charset,
this is wrong, nl_langinfo is the correct way. If they use, stripping will
let their app know something went wrong.

Strip .GBK or .UTF-8, so users would be happy that their old settings are
working, developers will notice their mistakes that using setlocale() to
validate charset is wrong. We get a lot more than failing the setlocale()
and return C, the only bad thing is we need to care about a almost
impossible event: an app directly getenv().

2017-02-12 10:34 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote:
> > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
> > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
> > @@ -100,7 +100,8 @@
> >       size_t map_size;
> >       void *volatile plural_rule;
> >       volatile int nplurals;
> > -     char name[];
> > +     struct binding *binding;
> > +     struct __locale_map *lm;
> >  };
>
> As stated in the reply to message body, I think you need the category
> in the keying too, because there can be different .mo files loaded
> depending on which category was requested.
>
> >  static char *dummy_gettextdomain()
> > @@ -120,58 +122,87 @@
> >       struct msgcat *p;
> >       struct __locale_struct *loc = CURRENT_LOCALE;
> >       const struct __locale_map *lm;
> > -     const char *dirname, *locname, *catname;
> > -     size_t dirlen, loclen, catlen, domlen;
> > +     size_t domlen;
> > +     struct binding *q;
> >
> >       if ((unsigned)category >= LC_ALL) goto notrans;
> >
> >       if (!domainname) domainname = __gettextdomain();
> >
> >       domlen = strnlen(domainname, NAME_MAX+1);
> >       if (domlen > NAME_MAX) goto notrans;
> >
> > -     dirname = gettextdir(domainname, &dirlen);
> > -     if (!dirname) goto notrans;
> > +     for (q=bindings; q; q=q->next)
> > +             if (!strcmp(q->domainname, domainname) && q->active)
> > +                     break;
> > +     if (!q) goto notrans;
>
> Looks ok. I had said this should be a function but it really doesn't
> need to be; it's plenty simple inline.
>
> >       lm = loc->cat[category];
> >       if (!lm) {
> >  notrans:
> >               return (char *) ((n == 1) ? msgid1 : msgid2);
> >       }
> > -     locname = lm->name;
> > -
> > -     catname = catnames[category];
> > -     catlen = catlens[category];
> > -     loclen = strlen(locname);
> > -
> > -     size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> > -     char name[namelen+1], *s = name;
> > -
> > -     memcpy(s, dirname, dirlen);
> > -     s[dirlen] = '/';
> > -     s += dirlen + 1;
> > -     memcpy(s, locname, loclen);
> > -     s[loclen] = '/';
> > -     s += loclen + 1;
> > -     memcpy(s, catname, catlen);
> > -     s[catlen] = '/';
> > -     s += catlen + 1;
> > -     memcpy(s, domainname, domlen);
> > -     s[domlen] = '.';
> > -     s[domlen+1] = 'm';
> > -     s[domlen+2] = 'o';
> > -     s[domlen+3] = 0;
> >
> >       for (p=cats; p; p=p->next)
> > -             if (!strcmp(p->name, name))
> > +             if (p->binding == q && p->lm == lm)
> >                       break;
>
> && p->cat == category
>
> >       if (!p) {
> > +             const char *dirname, *locname, *catname;
> > +             size_t dirlen, loclen, catlen;
> >               void *old_cats;
> >               size_t map_size;
> > +
> > +             dirname = q->dirname;
> > +             locname = lm->name;
> > +             catname = catnames[category];
> > +
> > +             dirlen = q->dirlen;
> > +             loclen = strlen(locname);
> > +             catlen = catlens[category];
>
> Now that these are only computed once rather than per-call, optimizing
> out strlen is probably not worthwhile anymore, but it doesn't really
> hurt either. Not something you need to change, just a comment.
>
> > +
> > +             size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> > +             char name[namelen+1], *s = name;
> > +             char *str = name;
> > +
> > +             memcpy(s, dirname, dirlen);
> > +             s[dirlen] = '/';
> > +             s += dirlen + 1;
> > +             memcpy(s, locname, loclen);
> > +             s[loclen] = '/';
> > +             s += loclen + 1;
> > +skip_loc:
> > +             memcpy(s, catname, catlen);
> > +             s[catlen] = '/';
> > +             s += catlen + 1;
> > +             memcpy(s, domainname, domlen);
> > +             s[domlen] = '.';
> > +             s[domlen+1] = 'm';
> > +             s[domlen+2] = 'o';
> > +             s[domlen+3] = 0;
>
> Actually, now that this code is not a hot path, it should just be
> using snprintf to construct the pathname, I think. It would be a lot
> simpler and easier to ensure correctness.
>
> > +
> >               const void *map = __map_file(name, &map_size);
> > -             if (!map) goto notrans;
> > +             if (!map) {
> > +                     if (s = strchr(name+dirlen+1, '@')) {
> > +                             *s++ = '/';
> > +                             goto skip_loc;;
> > +                     }
> > +                     if ( str && (s = strchr(name+dirlen+1, '_')) && (s
> < strchr(name+dirlen+1, '/')) ) {
> > +                             if (str = strchr(locname, '@')) {
> > +                                     loclen += locname - str;
> > +                                     memcpy(s, str, loclen);
> > +                                     s[loclen] = '/';
> > +                                     s += loclen + 1;
> > +                                     str = 0;
> > +                                     goto skip_loc;
> > +                             } else {
> > +                                     *s++ = '/';
> > +                                     goto skip_loc;
> > +                             }
> > +                     }
> > +                     goto notrans;
> > +             }
>
> Using snprintf should also make it easy to get rid of the goto/retry
> logic here, perhaps even with a 4-iteration loop and array of which
> format modifications happen on each iteration.
>
> >               p = calloc(sizeof *p + namelen + 1, 1);
> >               if (!p) {
> >                       __munmap((void *)map, map_size);
> >                       goto notrans;
> > @@ -209,7 +209,6 @@
> >               }
> >               p->map = map;
> >               p->map_size = map_size;
> > -             memcpy(p->name, name, namelen+1);
> >               do {
> >                       old_cats = cats;
> >                       p->next = old_cats;
> > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
> > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
> > @@ -49,8 +49,8 @@
> >       }
> >
> >       /* Limit name length and forbid leading dot or any slashes. */
> > -     for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);
> > -     if (val[0]=='.' || val[n]) val = "C.UTF-8";
> > +     for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' &&
> val[n]!='.'; n++);
> > +     if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8";
> >       int builtin = (val[0]=='C' && !val[1])
> >               || !strcmp(val, "C.UTF-8")
> >               || !strcmp(val, "POSIX");
>
> This looks ok but might still need some tweaks. Should an input like
> "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might
> appear as junk on the user's terminal) or as "C" (no localization)
> with only ASCII characters (safe for the user's terminal), or even
> cause setlocale to fail and return an error so that the application
> can decide what to do? These are not technical comments on your patch
> but policy matters the community should weigh in on.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 12745 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 3236 bytes --]

--- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
+++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
@@ -100,7 +100,9 @@
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	struct __locale_map *lm;
+	struct msgcat cat;
 };
 
 static char *dummy_gettextdomain()
@@ -120,8 +122,8 @@
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;
 
 	if ((unsigned)category >= LC_ALL) goto notrans;
 
@@ -130,47 +132,62 @@
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;
 
-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;
 
 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;
 
 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;
 
 	if (!p) {
+		const char *dirname, *locname, *catname;
+		size_t dirlen, loclen, catlen;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
+		char name[namelen+1];
+		char locbuf[loclen+1], *locp = locbuf;
+		const void *map;
+
+		memcpy(locbuf, locname, loclen);
+		locbuf[loclen] = 0;
+
+		for (;;) {
+			snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			if (locp = strchr(locbuf, '@')) {
+				*locp = 0;
+				locbuf[loclen] = '@';
+			} else if (locp = strchr(locbuf, '_')) {
+				if (locbuf[loclen] == '@') {
+					locbuf[loclen] = 0;
+					*locp = '@';
+					strcat(locp+1, locbuf + strlen(locbuf) + 1);
+				} else *locp = 0;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
+
 		p = calloc(sizeof *p + namelen + 1, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
@@ -178,7 +195,6 @@
 		}
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
@@ -193,6 +193,9 @@
 			__munmap((void *)map, map_size);
 			goto notrans;
 		}
+		p->cat = category;
+		p->binding = q;
+		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
 		do {

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-12  6:56                           ` He X
@ 2017-02-12  7:11                             ` He X
  2017-02-13 17:08                             ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: He X @ 2017-02-12  7:11 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 9460 bytes --]

sry, i uploaded the wront patch, it should be int cat, rather than struct
msgcat :/

2017-02-12 14:56 GMT+08:00 He X <xw897002528@gmail.com>:

> 1. cat is added to the keys, also do a validate
> 2. so we what do we deal with the gettextdir() exactly? inline it or
> construct a gettextpointer()?
> 3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is
> replaced by snprintf, compiled, and working well with fcitx
> 4. i just found that i forgot to store the keys to new buffer, it's ok to
> just use normal expression? or we need atomic operations?
> ```
> + p->cat = category;
> + p->binding = q;
> + p->lm = lm;
> ```
> 5.  I do want to rewrite all to .UTF8, but it's a bit annoying as your
> words, then i changed the code to simply strip.
>
> >  (safe for the user's terminal)
> LANG is set by users who are using musl and it's modified to zh_CN at
> setlocale(), app will use UTF8 directly, there's no such situation where
> charset will cause troubles to users' terminal, except apps which get the
> LANG manually by getenv(). I have not seen such strange applications so
> far, and most apps only have the UTF8 translation files.
>
> For moving from glibc to musl, i think doing this way is good for now, we
> could delete it later, or just keep it forever. And most people won't use
> non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8,
> because no translation files for GBK. So, it's not so dagerous, i think
> :).
>
> And for developers,  they should not use setlocale to detect the charset,
> this is wrong, nl_langinfo is the correct way. If they use, stripping will
> let their app know something went wrong.
>
> Strip .GBK or .UTF-8, so users would be happy that their old settings are
> working, developers will notice their mistakes that using setlocale() to
> validate charset is wrong. We get a lot more than failing the setlocale()
> and return C, the only bad thing is we need to care about a almost
> impossible event: an app directly getenv().
>
> 2017-02-12 10:34 GMT+08:00 Rich Felker <dalias@libc.org>:
>
>> On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote:
>> > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
>> > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
>> > @@ -100,7 +100,8 @@
>> >       size_t map_size;
>> >       void *volatile plural_rule;
>> >       volatile int nplurals;
>> > -     char name[];
>> > +     struct binding *binding;
>> > +     struct __locale_map *lm;
>> >  };
>>
>> As stated in the reply to message body, I think you need the category
>> in the keying too, because there can be different .mo files loaded
>> depending on which category was requested.
>>
>> >  static char *dummy_gettextdomain()
>> > @@ -120,58 +122,87 @@
>> >       struct msgcat *p;
>> >       struct __locale_struct *loc = CURRENT_LOCALE;
>> >       const struct __locale_map *lm;
>> > -     const char *dirname, *locname, *catname;
>> > -     size_t dirlen, loclen, catlen, domlen;
>> > +     size_t domlen;
>> > +     struct binding *q;
>> >
>> >       if ((unsigned)category >= LC_ALL) goto notrans;
>> >
>> >       if (!domainname) domainname = __gettextdomain();
>> >
>> >       domlen = strnlen(domainname, NAME_MAX+1);
>> >       if (domlen > NAME_MAX) goto notrans;
>> >
>> > -     dirname = gettextdir(domainname, &dirlen);
>> > -     if (!dirname) goto notrans;
>> > +     for (q=bindings; q; q=q->next)
>> > +             if (!strcmp(q->domainname, domainname) && q->active)
>> > +                     break;
>> > +     if (!q) goto notrans;
>>
>> Looks ok. I had said this should be a function but it really doesn't
>> need to be; it's plenty simple inline.
>>
>> >       lm = loc->cat[category];
>> >       if (!lm) {
>> >  notrans:
>> >               return (char *) ((n == 1) ? msgid1 : msgid2);
>> >       }
>> > -     locname = lm->name;
>> > -
>> > -     catname = catnames[category];
>> > -     catlen = catlens[category];
>> > -     loclen = strlen(locname);
>> > -
>> > -     size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
>> > -     char name[namelen+1], *s = name;
>> > -
>> > -     memcpy(s, dirname, dirlen);
>> > -     s[dirlen] = '/';
>> > -     s += dirlen + 1;
>> > -     memcpy(s, locname, loclen);
>> > -     s[loclen] = '/';
>> > -     s += loclen + 1;
>> > -     memcpy(s, catname, catlen);
>> > -     s[catlen] = '/';
>> > -     s += catlen + 1;
>> > -     memcpy(s, domainname, domlen);
>> > -     s[domlen] = '.';
>> > -     s[domlen+1] = 'm';
>> > -     s[domlen+2] = 'o';
>> > -     s[domlen+3] = 0;
>> >
>> >       for (p=cats; p; p=p->next)
>> > -             if (!strcmp(p->name, name))
>> > +             if (p->binding == q && p->lm == lm)
>> >                       break;
>>
>> && p->cat == category
>>
>> >       if (!p) {
>> > +             const char *dirname, *locname, *catname;
>> > +             size_t dirlen, loclen, catlen;
>> >               void *old_cats;
>> >               size_t map_size;
>> > +
>> > +             dirname = q->dirname;
>> > +             locname = lm->name;
>> > +             catname = catnames[category];
>> > +
>> > +             dirlen = q->dirlen;
>> > +             loclen = strlen(locname);
>> > +             catlen = catlens[category];
>>
>> Now that these are only computed once rather than per-call, optimizing
>> out strlen is probably not worthwhile anymore, but it doesn't really
>> hurt either. Not something you need to change, just a comment.
>>
>> > +
>> > +             size_t namelen = dirlen+1 + loclen+1 + catlen+1 +
>> domlen+3;
>> > +             char name[namelen+1], *s = name;
>> > +             char *str = name;
>> > +
>> > +             memcpy(s, dirname, dirlen);
>> > +             s[dirlen] = '/';
>> > +             s += dirlen + 1;
>> > +             memcpy(s, locname, loclen);
>> > +             s[loclen] = '/';
>> > +             s += loclen + 1;
>> > +skip_loc:
>> > +             memcpy(s, catname, catlen);
>> > +             s[catlen] = '/';
>> > +             s += catlen + 1;
>> > +             memcpy(s, domainname, domlen);
>> > +             s[domlen] = '.';
>> > +             s[domlen+1] = 'm';
>> > +             s[domlen+2] = 'o';
>> > +             s[domlen+3] = 0;
>>
>> Actually, now that this code is not a hot path, it should just be
>> using snprintf to construct the pathname, I think. It would be a lot
>> simpler and easier to ensure correctness.
>>
>> > +
>> >               const void *map = __map_file(name, &map_size);
>> > -             if (!map) goto notrans;
>> > +             if (!map) {
>> > +                     if (s = strchr(name+dirlen+1, '@')) {
>> > +                             *s++ = '/';
>> > +                             goto skip_loc;;
>> > +                     }
>> > +                     if ( str && (s = strchr(name+dirlen+1, '_')) &&
>> (s < strchr(name+dirlen+1, '/')) ) {
>> > +                             if (str = strchr(locname, '@')) {
>> > +                                     loclen += locname - str;
>> > +                                     memcpy(s, str, loclen);
>> > +                                     s[loclen] = '/';
>> > +                                     s += loclen + 1;
>> > +                                     str = 0;
>> > +                                     goto skip_loc;
>> > +                             } else {
>> > +                                     *s++ = '/';
>> > +                                     goto skip_loc;
>> > +                             }
>> > +                     }
>> > +                     goto notrans;
>> > +             }
>>
>> Using snprintf should also make it easy to get rid of the goto/retry
>> logic here, perhaps even with a 4-iteration loop and array of which
>> format modifications happen on each iteration.
>>
>> >               p = calloc(sizeof *p + namelen + 1, 1);
>> >               if (!p) {
>> >                       __munmap((void *)map, map_size);
>> >                       goto notrans;
>> > @@ -209,7 +209,6 @@
>> >               }
>> >               p->map = map;
>> >               p->map_size = map_size;
>> > -             memcpy(p->name, name, namelen+1);
>> >               do {
>> >                       old_cats = cats;
>> >                       p->next = old_cats;
>> > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
>> > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
>> > @@ -49,8 +49,8 @@
>> >       }
>> >
>> >       /* Limit name length and forbid leading dot or any slashes. */
>> > -     for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);
>> > -     if (val[0]=='.' || val[n]) val = "C.UTF-8";
>> > +     for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' &&
>> val[n]!='.'; n++);
>> > +     if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8";
>> >       int builtin = (val[0]=='C' && !val[1])
>> >               || !strcmp(val, "C.UTF-8")
>> >               || !strcmp(val, "POSIX");
>>
>> This looks ok but might still need some tweaks. Should an input like
>> "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might
>> appear as junk on the user's terminal) or as "C" (no localization)
>> with only ASCII characters (safe for the user's terminal), or even
>> cause setlocale to fail and return an error so that the application
>> can decide what to do? These are not technical comments on your patch
>> but policy matters the community should weigh in on.
>>
>> Rich
>>
>
>

[-- Attachment #2: Type: text/html, Size: 13352 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-12  2:34                         ` Rich Felker
  2017-02-12  6:56                           ` He X
@ 2017-02-13  8:01                           ` He X
  2017-02-13 13:28                             ` Rich Felker
  1 sibling, 1 reply; 38+ messages in thread
From: He X @ 2017-02-13  8:01 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 7765 bytes --]

New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK
codeset, we can't strip .UTF-8 easily, or we will get a lot of junk:
```
[xhe@xhe-PC ~]$ ls /share/vim/lang/zh_
zh_CN/       zh_CN.UTF-8/ zh_CN.cp936/ zh_TW/       zh_TW.UTF-8/
```
I add this to the loop, and delete strip in setlocale:
+ if (locp = strchr(locbuf, '.')) {
+ *locp = 0;
+ } else if (locp = strchr(locbuf, '@')) {
Now i think we should just fail non-UTF8 codeset in setlocale for safe, and
check the codeset before map that file.

2017-02-12 10:34 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sat, Feb 11, 2017 at 02:00:56PM +0800, He X wrote:
> > --- a/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
> > +++ b/src/locale/dcngettext.c 2017-02-06 14:39:17.860482624 +0000
> > @@ -100,7 +100,8 @@
> >       size_t map_size;
> >       void *volatile plural_rule;
> >       volatile int nplurals;
> > -     char name[];
> > +     struct binding *binding;
> > +     struct __locale_map *lm;
> >  };
>
> As stated in the reply to message body, I think you need the category
> in the keying too, because there can be different .mo files loaded
> depending on which category was requested.
>
> >  static char *dummy_gettextdomain()
> > @@ -120,58 +122,87 @@
> >       struct msgcat *p;
> >       struct __locale_struct *loc = CURRENT_LOCALE;
> >       const struct __locale_map *lm;
> > -     const char *dirname, *locname, *catname;
> > -     size_t dirlen, loclen, catlen, domlen;
> > +     size_t domlen;
> > +     struct binding *q;
> >
> >       if ((unsigned)category >= LC_ALL) goto notrans;
> >
> >       if (!domainname) domainname = __gettextdomain();
> >
> >       domlen = strnlen(domainname, NAME_MAX+1);
> >       if (domlen > NAME_MAX) goto notrans;
> >
> > -     dirname = gettextdir(domainname, &dirlen);
> > -     if (!dirname) goto notrans;
> > +     for (q=bindings; q; q=q->next)
> > +             if (!strcmp(q->domainname, domainname) && q->active)
> > +                     break;
> > +     if (!q) goto notrans;
>
> Looks ok. I had said this should be a function but it really doesn't
> need to be; it's plenty simple inline.
>
> >       lm = loc->cat[category];
> >       if (!lm) {
> >  notrans:
> >               return (char *) ((n == 1) ? msgid1 : msgid2);
> >       }
> > -     locname = lm->name;
> > -
> > -     catname = catnames[category];
> > -     catlen = catlens[category];
> > -     loclen = strlen(locname);
> > -
> > -     size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> > -     char name[namelen+1], *s = name;
> > -
> > -     memcpy(s, dirname, dirlen);
> > -     s[dirlen] = '/';
> > -     s += dirlen + 1;
> > -     memcpy(s, locname, loclen);
> > -     s[loclen] = '/';
> > -     s += loclen + 1;
> > -     memcpy(s, catname, catlen);
> > -     s[catlen] = '/';
> > -     s += catlen + 1;
> > -     memcpy(s, domainname, domlen);
> > -     s[domlen] = '.';
> > -     s[domlen+1] = 'm';
> > -     s[domlen+2] = 'o';
> > -     s[domlen+3] = 0;
> >
> >       for (p=cats; p; p=p->next)
> > -             if (!strcmp(p->name, name))
> > +             if (p->binding == q && p->lm == lm)
> >                       break;
>
> && p->cat == category
>
> >       if (!p) {
> > +             const char *dirname, *locname, *catname;
> > +             size_t dirlen, loclen, catlen;
> >               void *old_cats;
> >               size_t map_size;
> > +
> > +             dirname = q->dirname;
> > +             locname = lm->name;
> > +             catname = catnames[category];
> > +
> > +             dirlen = q->dirlen;
> > +             loclen = strlen(locname);
> > +             catlen = catlens[category];
>
> Now that these are only computed once rather than per-call, optimizing
> out strlen is probably not worthwhile anymore, but it doesn't really
> hurt either. Not something you need to change, just a comment.
>
> > +
> > +             size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> > +             char name[namelen+1], *s = name;
> > +             char *str = name;
> > +
> > +             memcpy(s, dirname, dirlen);
> > +             s[dirlen] = '/';
> > +             s += dirlen + 1;
> > +             memcpy(s, locname, loclen);
> > +             s[loclen] = '/';
> > +             s += loclen + 1;
> > +skip_loc:
> > +             memcpy(s, catname, catlen);
> > +             s[catlen] = '/';
> > +             s += catlen + 1;
> > +             memcpy(s, domainname, domlen);
> > +             s[domlen] = '.';
> > +             s[domlen+1] = 'm';
> > +             s[domlen+2] = 'o';
> > +             s[domlen+3] = 0;
>
> Actually, now that this code is not a hot path, it should just be
> using snprintf to construct the pathname, I think. It would be a lot
> simpler and easier to ensure correctness.
>
> > +
> >               const void *map = __map_file(name, &map_size);
> > -             if (!map) goto notrans;
> > +             if (!map) {
> > +                     if (s = strchr(name+dirlen+1, '@')) {
> > +                             *s++ = '/';
> > +                             goto skip_loc;;
> > +                     }
> > +                     if ( str && (s = strchr(name+dirlen+1, '_')) && (s
> < strchr(name+dirlen+1, '/')) ) {
> > +                             if (str = strchr(locname, '@')) {
> > +                                     loclen += locname - str;
> > +                                     memcpy(s, str, loclen);
> > +                                     s[loclen] = '/';
> > +                                     s += loclen + 1;
> > +                                     str = 0;
> > +                                     goto skip_loc;
> > +                             } else {
> > +                                     *s++ = '/';
> > +                                     goto skip_loc;
> > +                             }
> > +                     }
> > +                     goto notrans;
> > +             }
>
> Using snprintf should also make it easy to get rid of the goto/retry
> logic here, perhaps even with a 4-iteration loop and array of which
> format modifications happen on each iteration.
>
> >               p = calloc(sizeof *p + namelen + 1, 1);
> >               if (!p) {
> >                       __munmap((void *)map, map_size);
> >                       goto notrans;
> > @@ -209,7 +209,6 @@
> >               }
> >               p->map = map;
> >               p->map_size = map_size;
> > -             memcpy(p->name, name, namelen+1);
> >               do {
> >                       old_cats = cats;
> >                       p->next = old_cats;
> > --- a/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
> > +++ b/src/locale/locale_map.c 2017-02-06 14:39:17.797148750 +0000
> > @@ -49,8 +49,8 @@
> >       }
> >
> >       /* Limit name length and forbid leading dot or any slashes. */
> > -     for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/'; n++);
> > -     if (val[0]=='.' || val[n]) val = "C.UTF-8";
> > +     for (n=0; n<LOCALE_NAME_MAX && val[n] && val[n]!='/' &&
> val[n]!='.'; n++);
> > +     if (val[0]=='.' || (val[n] && val[n]!='.')) val = "C.UTF-8";
> >       int builtin = (val[0]=='C' && !val[1])
> >               || !strcmp(val, "C.UTF-8")
> >               || !strcmp(val, "POSIX");
>
> This looks ok but might still need some tweaks. Should an input like
> "zh_CN.GBK" get treated as "zh_CN" (thus outputting UTF-8 that might
> appear as junk on the user's terminal) or as "C" (no localization)
> with only ASCII characters (safe for the user's terminal), or even
> cause setlocale to fail and return an error so that the application
> can decide what to do? These are not technical comments on your patch
> but policy matters the community should weigh in on.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 11088 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-13  8:01                           ` He X
@ 2017-02-13 13:28                             ` Rich Felker
  2017-02-13 14:06                               ` He X
  2017-02-13 14:12                               ` He X
  0 siblings, 2 replies; 38+ messages in thread
From: Rich Felker @ 2017-02-13 13:28 UTC (permalink / raw)
  To: musl

On Mon, Feb 13, 2017 at 04:01:31PM +0800, He X wrote:
> New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK
> codeset, we can't strip .UTF-8 easily, or we will get a lot of junk:

That's on glibc; your "finding" is irrelevant to musl, where the
encoding for all locales is UTF-8.

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-13 13:28                             ` Rich Felker
@ 2017-02-13 14:06                               ` He X
  2017-02-13 17:12                                 ` Rich Felker
  2017-02-13 14:12                               ` He X
  1 sibling, 1 reply; 38+ messages in thread
From: He X @ 2017-02-13 14:06 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

no, it's on musl, i just tested it with my patches, with vim, stripping
will lead to unknown characters.
I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other
apps is UTF-8 set, that meas there may be other apps like vim, we should be
more cautious, add a check before map the .mo files, and fail non-UTF8 set
in setlocale.

Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling with
the native intl of musl, and after i added a dump for these two symbols,
gnu tar showed me segfaults, because he passed a zero msgid1 causing
__mo_lookup segfault, we should add a check in dcngettext to avoid it(if
(!msgid1) goto notrans;):

 #2  0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar",
msgid1=0x0, msgid2=0x0, n=1,
    category=5) at src/locale/dcngettext.c:211


2017-02-13 21:28 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Mon, Feb 13, 2017 at 04:01:31PM +0800, He X wrote:
> > New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK
> > codeset, we can't strip .UTF-8 easily, or we will get a lot of junk:
>
> That's on glibc; your "finding" is irrelevant to musl, where the
> encoding for all locales is UTF-8.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 3048 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-13 13:28                             ` Rich Felker
  2017-02-13 14:06                               ` He X
@ 2017-02-13 14:12                               ` He X
  2017-02-13 17:13                                 ` Rich Felker
  1 sibling, 1 reply; 38+ messages in thread
From: He X @ 2017-02-13 14:12 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 654 bytes --]

I dont know how, but it's indeed GBK, even with musl, vim indeed generated
GBK set files, maybe it's because im using gnu
gettext(without-included-gettext). I think we should avoid this issue
depending on a check of libc, rather than assuming all .mo files are UTF-8
set.

2017-02-13 21:28 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Mon, Feb 13, 2017 at 04:01:31PM +0800, He X wrote:
> > New find, as you can see, zh_CN is different from zh_CN.UTF-8, it's GBK
> > codeset, we can't strip .UTF-8 easily, or we will get a lot of junk:
>
> That's on glibc; your "finding" is irrelevant to musl, where the
> encoding for all locales is UTF-8.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 1084 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-12  6:56                           ` He X
  2017-02-12  7:11                             ` He X
@ 2017-02-13 17:08                             ` Rich Felker
  1 sibling, 0 replies; 38+ messages in thread
From: Rich Felker @ 2017-02-13 17:08 UTC (permalink / raw)
  To: musl

On Sun, Feb 12, 2017 at 02:56:53PM +0800, He X wrote:
> 1. cat is added to the keys, also do a validate
> 2. so we what do we deal with the gettextdir() exactly? inline it or
> construct a gettextpointer()?
> 3. i added a extra locbuf array, and goto is replaced by a loop, memcpy is
> replaced by snprintf, compiled, and working well with fcitx

I haven't verified the loop logic yet but on a high level it looks
correct.

> 4. i just found that i forgot to store the keys to new buffer, it's ok to
> just use normal expression? or we need atomic operations?
> ```
> + p->cat = category;
> + p->binding = q;
> + p->lm = lm;
> ```

This is fine since the new msgcat is not visible to other threads
until it's installed with an atomic, which makes all previous writes
visible. I do want to rework this all with a lock structure rather
than atomics but that's a separate project.

> 5.  I do want to rewrite all to .UTF8, but it's a bit annoying as your
> words, then i changed the code to simply strip.

Since this part is separate and there seems to be disagreement about
what it should do, let's separate it from the issue at hand; it's
really a separate change from making gettext do proper fallbacks
anyway.

> >  (safe for the user's terminal)
> LANG is set by users who are using musl and it's modified to zh_CN at
> setlocale(), app will use UTF8 directly, there's no such situation where
> charset will cause troubles to users' terminal, except apps which get the
> LANG manually by getenv(). I have not seen such strange applications so
> far, and most apps only have the UTF8 translation files.
> 
> For moving from glibc to musl, i think doing this way is good for now, we
> could delete it later, or just keep it forever. And most people won't use
> non-UTF8 at all, if they do use GBK, their app will even fallback to UTF8,
> because no translation files for GBK. So, it's not so dagerous, i think :).

The main considerations are:

1. what happens when a glibc user ssh's into a musl-based system
2. what happens when a musl user ssh's into a glibc-based system
3. what happens when running musl binaries on a glibc-based system

For #1 and #3, it's desirable for musl to accept ".UTF-8" in the
locale name, and for #2, users may desire to have ".UTF-8" in their
LC_* env vars so that remote glibc programs behave correctly.

For #1 and #3, if a glibc uses is using a legacy non-UTF-8 locale and
runs a musl program, they're either going to get messed-up output or
ASCII-only, depending on decisions we make and/or what their locale
value is. These are not really important since legacy encodings are
not supported, but it might be nice to make least-bad.

If the user has a locale name like "fr_FR" or "zh_CN" that, that's
going to be interpreted differently by musl vs glibc; that was already
decided a long time ago in the interest of designing around the future
rather than broken legacy stuff. But if the locale name is explicitly
non-UTF-8 like "zh_CN.GBK", we could opt to reject it without breaking
anything, and this may give users better feedback about what's going
wrong if they have such settings when ssh'ing into a musl-based
system.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-13 14:06                               ` He X
@ 2017-02-13 17:12                                 ` Rich Felker
  2017-03-04  8:02                                   ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-02-13 17:12 UTC (permalink / raw)
  To: musl

On Mon, Feb 13, 2017 at 10:06:49PM +0800, He X wrote:
> no, it's on musl, i just tested it with my patches, with vim, stripping
> will lead to unknown characters.

That's not a matter of the locale being non-UTF-8 (it's UTF-8) but of
the application doing something broken. The locale is UTF-8 because
nl_langinfo(CODESET) says it is and because mb/wc conversion functions
process UTF-8. That's what it means for the locale to be UTF-8.

> I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other
> apps is UTF-8 set, that meas there may be other apps like vim, we should be
> more cautious, add a check before map the .mo files, and fail non-UTF8 set
> in setlocale.

All musl locale files are required to be UTF-8. If an application has
translation files that are not UTF-8, they're not usable. This could
be fixed in the application or by using a fixed version of msgfmt that
converts to UTF-8 before producing the .mo file.

> Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling with
> the native intl of musl, and after i added a dump for these two symbols,

The autoconf text for gettext is supposed to be getting fixed not to
do that anymore, but I'm not sure what the progress on upstreaming it
is.

> gnu tar showed me segfaults, because he passed a zero msgid1 causing
> __mo_lookup segfault, we should add a check in dcngettext to avoid it(if
> (!msgid1) goto notrans;):
> 
>  #2  0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar",
> msgid1=0x0, msgid2=0x0, n=1,
>     category=5) at src/locale/dcngettext.c:211

Is it expecting gettext to return a null pointer in this case, or to
return something else (like the "header", i.e. the translation of "")?
I think it's acceptable to change this behavior as long as we do it
right, but it should be a separate patch since it's an independent
change.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-13 14:12                               ` He X
@ 2017-02-13 17:13                                 ` Rich Felker
  0 siblings, 0 replies; 38+ messages in thread
From: Rich Felker @ 2017-02-13 17:13 UTC (permalink / raw)
  To: musl

On Mon, Feb 13, 2017 at 10:12:13PM +0800, He X wrote:
> I dont know how, but it's indeed GBK, even with musl, vim indeed generated
> GBK set files, maybe it's because im using gnu
> gettext(without-included-gettext). I think we should avoid this issue
> depending on a check of libc, rather than assuming all .mo files are UTF-8
> set.

Support for non-UTF-8 .mo files won't be added. It means wasting lots
of memory per-process to keep converted copies in-core rather than
mapping from disk. msgfmt just needs to be fixed not to produce
non-UTF-8 output. (It also needs to be fixed to handle SYSDEP strings
correctly.)

Rich


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-02-13 17:12                                 ` Rich Felker
@ 2017-03-04  8:02                                   ` He X
  2017-03-17 19:27                                     ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-03-04  8:02 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 4411 bytes --]

OK, i am busy on school these days. I read the mailing lists again, and i
clean up. These are all remaining issues we need to solve since previous
discussion:

1. about zero msgid1, i can prove that glibc will fallback to no
translations. It's equal to printf(""), so this should be ok:
@@ -120,8 +122,9 @@
+ if (!msgid1) goto notrans;

> but it should be a separate patch since it's an independent
change.
(added in the head of dcngettext(), ill send a new standalone mail for
this, but it's also included in this patch, be careful)

2.
>But if the locale name is explicitly
non-UTF-8 like "zh_CN.GBK", we could opt to reject it without breaking
anything, and this may give users better feedback about what's going
wrong if they have such settings when ssh'ing into a musl-based
system.

About the .GBK(and any other non-UTF8 charsets), i ignore them by treating
them as C.UTF-8, do we need to be more strict?
--- musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000
+++ musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000
@@ -46,7 +46,8 @@
  if (val[0]=='.' || val[n]) val = "C.UTF-8";
  int builtin = (val[0]=='C' && !val[1])
  || !strcmp(val, "C.UTF-8")
- || !strcmp(val, "POSIX");
+ || !strcmp(val, "POSIX")
+ || strcmp(__strchrnul(val, '.'), ".UTF-8");

  if (builtin) {
  if (cat == LC_CTYPE && val[1]=='.')
3.
>The autoconf text for gettext is supposed to be getting fixed not to
do that anymore, but I'm not sure what the progress on upstreaming it
is.
It's just a workaround before they handle it, and i am not going to change
anything in musl, just a description. I only patched myself.

4.
 > Support for non-UTF-8 .mo files won't be added.
> msgfmt just needs to be fixed not to produce
non-UTF-8 output.

I agreed with you, so then i hope '.UTF-8' could be kept. Rather than
stripping it as i thought before, '.UTF-8' should be kept until the code
went into dcngettext(). What if the .mo files are downloaded from www? Or
what if it's pre-generated in the releases of programs? (i guess that's why
vim gave me GBK set, it must be pre-generated)

And even with msgfmt generating UTF-8 outputs, what if programs still name
the dir as zh_CN.UTF-8 instead of simply zh_CN? You can't say it's wrong,
right? It's their preference how to name it.

It's necessary for those who have a full name like zh_CN.UTF-8 instead of
zh_CN. This's what i am trying to express now.

2017-02-14 1:12 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Mon, Feb 13, 2017 at 10:06:49PM +0800, He X wrote:
> > no, it's on musl, i just tested it with my patches, with vim, stripping
> > will lead to unknown characters.
>
> That's not a matter of the locale being non-UTF-8 (it's UTF-8) but of
> the application doing something broken. The locale is UTF-8 because
> nl_langinfo(CODESET) says it is and because mb/wc conversion functions
> process UTF-8. That's what it means for the locale to be UTF-8.
>
> > I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other
> > apps is UTF-8 set, that meas there may be other apps like vim, we should
> be
> > more cautious, add a check before map the .mo files, and fail non-UTF8
> set
> > in setlocale.
>
> All musl locale files are required to be UTF-8. If an application has
> translation files that are not UTF-8, they're not usable. This could
> be fixed in the application or by using a fixed version of msgfmt that
> converts to UTF-8 before producing the .mo file.
>
> > Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling
> with
> > the native intl of musl, and after i added a dump for these two symbols,
>
> The autoconf text for gettext is supposed to be getting fixed not to
> do that anymore, but I'm not sure what the progress on upstreaming it
> is.
>
> > gnu tar showed me segfaults, because he passed a zero msgid1 causing
> > __mo_lookup segfault, we should add a check in dcngettext to avoid it(if
> > (!msgid1) goto notrans;):
> >
> >  #2  0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar",
> > msgid1=0x0, msgid2=0x0, n=1,
> >     category=5) at src/locale/dcngettext.c:211
>
> Is it expecting gettext to return a null pointer in this case, or to
> return something else (like the "header", i.e. the translation of "")?
> I think it's acceptable to change this behavior as long as we do it
> right, but it should be a separate patch since it's an independent
> change.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 12499 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 3636 bytes --]

--- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
+++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
@@ -100,7 +100,9 @@
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	struct __locale_map *lm;
+	int cat;
 };
 
 static char *dummy_gettextdomain()
@@ -120,8 +122,9 @@
+	if (!msgid1) goto notrans;
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;
 
 	if ((unsigned)category >= LC_ALL) goto notrans;
 
@@ -130,47 +132,64 @@
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;
 
-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;
 
 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;
 
 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;
 
 	if (!p) {
+		const char *dirname, *locname, *catname;
+		size_t dirlen, loclen, catlen;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
+		char name[namelen+1];
+		char locbuf[loclen+1], *locp = locbuf;
+		const void *map;
+
+		memcpy(locbuf, locname, loclen);
+		locbuf[loclen] = 0;
+
+		for (;;) {
+			snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			if (locp = strchr(locbuf, '.')) {
+				*locp = 0;
+			} else if (locp = strchr(locbuf, '@')) {
+				*locp = 0;
+				locbuf[loclen] = '@';
+			} else if (locp = strchr(locbuf, '_')) {
+				if (locbuf[loclen] == '@') {
+					locbuf[loclen] = 0;
+					*locp = '@';
+					strcat(locp+1, locbuf + strlen(locbuf) + 1);
+				} else *locp = 0;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
+
 		p = calloc(sizeof *p + namelen + 1, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
@@ -178,7 +195,9 @@
 		}
+ 		p->cat = category;
+ 		p->binding = q;
+ 		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
--- musl-1.1.16/src/locale/locale_map.c	2017-01-01 03:27:17.000000000 +0000
+++ musl-1.1.16/src/locale/locale_map.c	2017-01-01 03:27:17.000000000 +0000
@@ -46,7 +46,8 @@
 	if (val[0]=='.' || val[n]) val = "C.UTF-8";
 	int builtin = (val[0]=='C' && !val[1])
 		|| !strcmp(val, "C.UTF-8")
-		|| !strcmp(val, "POSIX");
+		|| !strcmp(val, "POSIX")
+		|| strcmp(__strchrnul(val, '.'), ".UTF-8");
 
 	if (builtin) {
 		if (cat == LC_CTYPE && val[1]=='.')

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-03-04  8:02                                   ` He X
@ 2017-03-17 19:27                                     ` Rich Felker
  2017-03-17 19:37                                       ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-03-17 19:27 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 5428 bytes --]

As discussed on IRC, I'm preparing a version that fixes some issues. I
also found a few more problems after you logged off which I'll
describe inline:

On Sat, Mar 04, 2017 at 04:02:58PM +0800, He X wrote:
> --- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
> +++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
> @@ -100,7 +100,9 @@
>  	size_t map_size;
>  	void *volatile plural_rule;
>  	volatile int nplurals;
> -	char name[];
> +	struct binding *binding;
> +	struct __locale_map *lm;

This needs to be const struct __locale_map *lm because the value
assigned to it is pointer-to-const.

> +	int cat;
>  };
>  
>  static char *dummy_gettextdomain()
> @@ -120,8 +122,9 @@
> +	if (!msgid1) goto notrans;

This overlaps with the other patch so I'll factor it out.

>  	struct msgcat *p;
>  	struct __locale_struct *loc = CURRENT_LOCALE;
>  	const struct __locale_map *lm;
> -	const char *dirname, *locname, *catname;
> -	size_t dirlen, loclen, catlen, domlen;
> +	size_t domlen;
> +	struct binding *q;
>  
>  	if ((unsigned)category >= LC_ALL) goto notrans;
>  
> @@ -130,47 +132,64 @@
>  	domlen = strnlen(domainname, NAME_MAX+1);
>  	if (domlen > NAME_MAX) goto notrans;
>  
> -	dirname = gettextdir(domainname, &dirlen);
> -	if (!dirname) goto notrans;
> +	for (q=bindings; q; q=q->next)
> +		if (!strcmp(q->domainname, domainname) && q->active)
> +			break;
> +	if (!q) goto notrans;
>  
>  	lm = loc->cat[category];
>  	if (!lm) {
>  notrans:
>  		return (char *) ((n == 1) ? msgid1 : msgid2);
>  	}
> -	locname = lm->name;
> -
> -	catname = catnames[category];
> -	catlen = catlens[category];
> -	loclen = strlen(locname);
> -
> -	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> -	char name[namelen+1], *s = name;
> -
> -	memcpy(s, dirname, dirlen);
> -	s[dirlen] = '/';
> -	s += dirlen + 1;
> -	memcpy(s, locname, loclen);
> -	s[loclen] = '/';
> -	s += loclen + 1;
> -	memcpy(s, catname, catlen);
> -	s[catlen] = '/';
> -	s += catlen + 1;
> -	memcpy(s, domainname, domlen);
> -	s[domlen] = '.';
> -	s[domlen+1] = 'm';
> -	s[domlen+2] = 'o';
> -	s[domlen+3] = 0;
>  
>  	for (p=cats; p; p=p->next)
> -		if (!strcmp(p->name, name))
> +		if (p->binding == q && p->lm == lm && p->cat == category)
>  			break;
>  
>  	if (!p) {
> +		const char *dirname, *locname, *catname;
> +		size_t dirlen, loclen, catlen;
>  		void *old_cats;
>  		size_t map_size;
> -		const void *map = __map_file(name, &map_size);
> +
> +		dirname = q->dirname;
> +		locname = lm->name;
> +		catname = catnames[category];
> +
> +		dirlen = q->dirlen;
> +		loclen = strlen(locname);
> +		catlen = catlens[category];
> +
> +		size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
> +		char name[namelen+1];
> +		char locbuf[loclen+1], *locp = locbuf;
> +		const void *map;
> +
> +		memcpy(locbuf, locname, loclen);
> +		locbuf[loclen] = 0;
> +
> +		for (;;) {
> +			snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname);
> +			if (map = __map_file(name, &map_size)) break;
> +
> +			if (locp = strchr(locbuf, '.')) {
> +				*locp = 0;

As discussed on irc, .charset suffixes should be dropped before the
loop even begins (never used in pathnames), and they occur before the
@mod, not after it, so the logic for dropping them is different.

> +			} else if (locp = strchr(locbuf, '@')) {
> +				*locp = 0;
> +				locbuf[loclen] = '@';
> +			} else if (locp = strchr(locbuf, '_')) {
> +				if (locbuf[loclen] == '@') {
> +					locbuf[loclen] = 0;
> +					*locp = '@';
> +					strcat(locp+1, locbuf + strlen(locbuf) + 1);

This invocation of strcat is UB because src and dest overlap; you need
memmove.

> +				} else *locp = 0;
> +			} else {
> +				break;
> +			}
> +		}

Aside from the above, I'm concerned what happens when there are an
unexpected mix of _'s and @'s in places you don't expect them to be.
I've worked out an (untested) alternative approach which I think is
cleaner, where the locale name is split up into components before the
loop (in-place, with no locbuf) and the loop controls which parts get
used. This way the logic for how the parts are broken up is static and
not dependent on what happened in previous loop iterations.

>  		if (!map) goto notrans;
> +
>  		p = calloc(sizeof *p + namelen + 1, 1);

namelen space is not needed here anymore since name[] was removed from
the struct.

>  		if (!p) {
>  			__munmap((void *)map, map_size);
> @@ -178,7 +195,9 @@
>  		}
> + 		p->cat = category;
> + 		p->binding = q;
> + 		p->lm = lm;
>  		p->map = map;
>  		p->map_size = map_size;
> -		memcpy(p->name, name, namelen+1);
>  		do {
>  			old_cats = cats;
>  			p->next = old_cats;
> --- musl-1.1.16/src/locale/locale_map.c	2017-01-01 03:27:17.000000000 +0000
> +++ musl-1.1.16/src/locale/locale_map.c	2017-01-01 03:27:17.000000000 +0000
> @@ -46,7 +46,8 @@
>  	if (val[0]=='.' || val[n]) val = "C.UTF-8";
>  	int builtin = (val[0]=='C' && !val[1])
>  		|| !strcmp(val, "C.UTF-8")
> -		|| !strcmp(val, "POSIX");
> +		|| !strcmp(val, "POSIX")
> +		|| strcmp(__strchrnul(val, '.'), ".UTF-8");
>  
>  	if (builtin) {
>  		if (cat == LC_CTYPE && val[1]=='.')

As discussed this is a separate change, and I don't think it's right
as-is. The right change requires adding an error path to the function
(and its callers), which I'll look into next.

Attached is my draft of modifications to your patch. I haven't tested
it yet but it does compile ok.

Rich

[-- Attachment #2: dcngettext.diff --]
[-- Type: text/plain, Size: 3574 bytes --]

diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c
index b68e24b..abaa414 100644
--- a/src/locale/dcngettext.c
+++ b/src/locale/dcngettext.c
@@ -100,7 +100,9 @@ struct msgcat {
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	const struct __locale_map *lm;
+	int cat;
 };
 
 static char *dummy_gettextdomain()
@@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;
 
 	if ((unsigned)category >= LC_ALL) goto notrans;
 
@@ -130,55 +132,76 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;
 
-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;
 
 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;
 
 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;
 
 	if (!p) {
+		const char *dirname, *locname, *catname, *modname, *locp;
+		size_t dirlen, loclen, catlen, modlen, alt_modlen;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		/* Logically split @mod suffix from locale name. */
+		modname = memchr(locname, '@', loclen);
+		if (!modname) modname = locname + loclen;
+		alt_modlen = modlen = loclen - (modname-locname);
+		loclen = modname-locname;
+
+		/* Drop .charset identifier; it is not used. */
+		const char *csp = memchr(locname, '.', loclen);
+		if (csp) loclen = csp-locname;
+
+		char name[dirlen+1 + loclen+1 + catlen+1 + domlen+3 + 1];
+		const void *map;
+
+		for (;;) {
+			snprintf(name, sizeof name, "%s/%.*s%.*s/%s/%s.mo\0",
+				dirname, (int)loclen, locname,
+				(int)alt_modlen, modname, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			/* Try dropping @mod, _YY, then both. */
+			if (alt_modlen) {
+				alt_modlen = 0;
+			} else if ((locp = memchr(locname, '_', loclen))) {
+				loclen = locp-locname;
+				alt_modlen = modlen;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
-		p = calloc(sizeof *p + namelen + 1, 1);
+
+		p = calloc(sizeof *p, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
 			goto notrans;
 		}
+		p->cat = category;
+		p->binding = q;
+		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-03-17 19:27                                     ` Rich Felker
@ 2017-03-17 19:37                                       ` Rich Felker
  2017-03-18  7:34                                         ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-03-17 19:37 UTC (permalink / raw)
  To: musl

On Fri, Mar 17, 2017 at 03:27:49PM -0400, Rich Felker wrote:
> +		/* Logically split @mod suffix from locale name. */
> +		modname = memchr(locname, '@', loclen);
> +		if (!modname) modname = locname + loclen;
> +		alt_modlen = modlen = loclen - (modname-locname);
> +		loclen = modname-locname;
> +
> +		/* Drop .charset identifier; it is not used. */
> +		const char *csp = memchr(locname, '.', loclen);
> +		if (csp) loclen = csp-locname;
> +
> +		char name[dirlen+1 + loclen+1 + catlen+1 + domlen+3 + 1];
> +		const void *map;

One bug in my version: loclen+1 should be loclen+modlen+1. Without
changing that, the pathname gets truncated if @mod is used. Otherwise
it seems to work.

One separate but related problem I noticed while testing is that
musl's locale name length limit of 16 is insufficient for something
like "en_GB.UTF-8@euro". If we want to allow users to have an explicit
".UTF-8" in the locale name (which, as we discussed a while back,
users may want to do if they plan to ssh into glibc-based systems)
then we should probably increase this limit a bit, perhaps to 24 or
so.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-03-17 19:37                                       ` Rich Felker
@ 2017-03-18  7:34                                         ` He X
  2017-03-18 12:28                                           ` Rich Felker
  0 siblings, 1 reply; 38+ messages in thread
From: He X @ 2017-03-18  7:34 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 4037 bytes --]

> As discussed on irc, .charset suffixes should be dropped before the
loop even begins (never used in pathnames), and they occur before the
@mod, not after it, so the logic for dropping them is different.

1. drop .charset: Sorry for proposing it again, i forget this case after
around three weeks, as i said before, vim will generate three different .mo
files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po.
In that case, dropping is to generate a lots of junk.

I now found it's not a bug of msgfmt. That is charset is converted by:
iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e
's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname is
decided by softwares, msgfmt does not do charset converting at all, just a
format-translator. (btw, iconv.c is from alpine)

How do we deal with the case? Patch the special software(since these
special cases are rare, maybe other editors like emacs have similar
actions, general softwares only prepare UTF-8 files)?
Or put the drop back in the loop? I prefer the latter, it seems more
elegant and correct than ugly patches, my idea is: ```
tt_TT.XXX@mod
tt_TT.XXX
tt.XXX@mod
tt.XXX
tt_TT@mod
tt_TT
tt@mod
tt
```
I have made it in a similar way as @mod in the attached. Here's the output
from strace:
```
15:14:32.235463 open("./es_MX.utf8@euro/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235495 open("./es_MX.utf8/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235560 open("./es.utf8@euro/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235585 open("./es.utf8/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235621 open("./es_MX@euro/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235656 open("./es_MX/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235683 open("./es@euro/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = -1 ENOENT (No such file or directory)
15:14:32.235706 open("./es/LC_MESSAGES/hellogt.mo",
O_RDONLY|O_NONBLOCK|O_CLOEXEC) = 3
```
Other possible LANG is tested, too.

> One bug in my version: loclen+1 should be loclen+modlen+1. Without
changing that, the pathname gets truncated if @mod is used. Otherwise
it seems to work.
2. Yes, it is, made it in the attached.

> then we should probably increase this limit a bit, perhaps to 24 or
so.
3. the longest line is `tt_TT.iso885915@euro`, 21. 24 seems good for the
current standard. Made it in the attached.

2017-03-17 19:37 GMT+00:00 Rich Felker <dalias@libc.org>:

> On Fri, Mar 17, 2017 at 03:27:49PM -0400, Rich Felker wrote:
> > +             /* Logically split @mod suffix from locale name. */
> > +             modname = memchr(locname, '@', loclen);
> > +             if (!modname) modname = locname + loclen;
> > +             alt_modlen = modlen = loclen - (modname-locname);
> > +             loclen = modname-locname;
> > +
> > +             /* Drop .charset identifier; it is not used. */
> > +             const char *csp = memchr(locname, '.', loclen);
> > +             if (csp) loclen = csp-locname;
> > +
> > +             char name[dirlen+1 + loclen+1 + catlen+1 + domlen+3 + 1];
> > +             const void *map;
>
> One bug in my version: loclen+1 should be loclen+modlen+1. Without
> changing that, the pathname gets truncated if @mod is used. Otherwise
> it seems to work.
>
> One separate but related problem I noticed while testing is that
> musl's locale name length limit of 16 is insufficient for something
> like "en_GB.UTF-8@euro". If we want to allow users to have an explicit
> ".UTF-8" in the locale name (which, as we discussed a while back,
> users may want to do if they plan to ssh into glibc-based systems)
> then we should probably increase this limit a bit, perhaps to 24 or
> so.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 6082 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 4100 bytes --]

diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c
index b68e24b..abaa414 100644
--- a/src/locale/dcngettext.c
+++ b/src/locale/dcngettext.c
@@ -100,7 +100,9 @@ struct msgcat {
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	const struct __locale_map *lm;
+	int cat;
 };

 static char *dummy_gettextdomain()
@@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;

 	if ((unsigned)category >= LC_ALL) goto notrans;

@@ -130,55 +132,83 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;

-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;

 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;

 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;

 	if (!p) {
+		const char *dirname, *locname, *catname, *modname, *csname, *locp;
+		size_t dirlen, loclen, catlen, modlen, alt_modlen, cslen, alt_charset;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		/* Logically split @mod suffix from locale name. */
+		modname = memchr(locname, '@', loclen);
+		if (!modname) modname = locname + loclen;
+		alt_modlen = modlen = loclen - (modname-locname);
+		loclen -= modlen;
+
+		/* Split .charset suffix as @mod. */
+		csname = memchr(locname, '.', loclen);
+		if (!csname) csname = locname + loclen;
+		alt_charset = cslen = loclen - (csname-locname);
+		loclen -= cslen;
+
+		char name[dirlen+1 + loclen+modlen+cslen+1 + catlen+1 + domlen+3 + 1];
+		const void *map;
+
+		for (;;) {
+			snprintf(name, sizeof name, "%s/%.*s%.*s%.*s/%s/%s.mo\0",
+				dirname, (int)loclen, locname,
+                               (int)alt_charset, csname,
+				(int)alt_modlen, modname, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			/* Try dropping @mod, _YY, then both. */
+			if (alt_modlen) {
+				alt_modlen = 0;
+			} else if ((locp = memchr(locname, '_', loclen))) {
+				loclen = locp-locname;
+				alt_modlen = modlen;
+			} else if (alt_charset) {
+				alt_charset = 0;
+				alt_modlen = modlen;
+				loclen = strlen(locname) - modlen - cslen;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
-		p = calloc(sizeof *p + namelen + 1, 1);
+
+		p = calloc(sizeof *p, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
 			goto notrans;
 		}
+		p->cat = category;
+		p->binding = q;
+		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
--- musl-1.1.16/src/internal/locale_impl.h
+++ musl-1.1.16/src/internal/locale_impl.h
@@ -6,7 +6,7 @@
 #include "libc.h"
 #include "pthread_impl.h"
 
-#define LOCALE_NAME_MAX 15
+#define LOCALE_NAME_MAX 23
 
 struct __locale_map {
 	const void *map;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-03-18  7:34                                         ` He X
@ 2017-03-18 12:28                                           ` Rich Felker
  2017-03-18 13:50                                             ` He X
  0 siblings, 1 reply; 38+ messages in thread
From: Rich Felker @ 2017-03-18 12:28 UTC (permalink / raw)
  To: musl

On Sat, Mar 18, 2017 at 07:34:58AM +0000, He X wrote:
> > As discussed on irc, .charset suffixes should be dropped before the
> loop even begins (never used in pathnames), and they occur before the
> @mod, not after it, so the logic for dropping them is different.
> 
> 1. drop .charset: Sorry for proposing it again, i forget this case after
> around three weeks, as i said before, vim will generate three different .mo
> files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po.
> In that case, dropping is to generate a lots of junk.
> 
> I now found it's not a bug of msgfmt. That is charset is converted by:
> iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e
> 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname is
> decided by softwares, msgfmt does not do charset converting at all, just a
> format-translator. (btw, iconv.c is from alpine)

There are two things you seem to be missing:

1. musl does not, and won't, support non-UTF-8 locales, so there is no
point in trying to load translations for them. Moreover, with the
proposed changes to setlocale/locale_map.c, it will never be possible
for the locale name to contain a . with anything other than UTF-8 (or,
for compatibility, some variant like utf8) after it. So I don't see
how there's any point in iterating and trying with/without .charset
when the only possibilities are that .charset is blank, .UTF-8, or
some misspelling of .UTF-8. In the latter case, we'd even have to do
remapping of the misspellings to avoid having to have multiple
dirs/symlinks.

2. From my perspective, msgfmt's production of non-UTF-8 .mo files is
a bug. Yes the .po file can be something else, but msgfmt should be
transcoding it at 'compile' time. There's at least one other change
msgfmt needs for all features to work with musl's gettext -- expansion
of SYSDEP strings to all their possible format patterns -- so I don't
think it's a significant additional burden to ensure that the msgfmt
used on musl-based systems outputs UTF-8.

Of course software trying to do multiple encodings like you described
will still install duplicate files unless patched, but any of them
should work as long as msgfmt recoded them. In the mean time, distros
can just patch the build process for software that's still installing
non-UTF-8 locale files. AFAIK doing that is not a recommended practice
even by the GNU gettext project, so the patches might even make it
upstream.

One thing we could do for robustness is check the .mo header at load
time and, if it has a charset= specification with something other than
UTF-8, reject it. I mainly suggest this in case the program is running
on a non-musl system where a glibc-built version of the same program
(e.g. vi) with non-UTF-8 .mo files is present and they're using the
same textdomain dir (actually unlikely since prefix should be
different). But if we do this it should be a separate patch because
it's a separate functional change.

Rich

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
  2017-03-18 12:28                                           ` Rich Felker
@ 2017-03-18 13:50                                             ` He X
  0 siblings, 0 replies; 38+ messages in thread
From: He X @ 2017-03-18 13:50 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 3467 bytes --]

OK, i think there's no further needs of discussion. I got your idea, if
this is what musl want to be. I will try to make patches to vim later!
But for the checking of `charset=`, i can't help, i did not understand
what's up in __mo_lookup(). Hope you can make the patch. The attached has
deleted all things related to drop .charset.

2017-03-18 20:28 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sat, Mar 18, 2017 at 07:34:58AM +0000, He X wrote:
> > > As discussed on irc, .charset suffixes should be dropped before the
> > loop even begins (never used in pathnames), and they occur before the
> > @mod, not after it, so the logic for dropping them is different.
> >
> > 1. drop .charset: Sorry for proposing it again, i forget this case after
> > around three weeks, as i said before, vim will generate three different
> .mo
> > files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po.
> > In that case, dropping is to generate a lots of junk.
> >
> > I now found it's not a bug of msgfmt. That is charset is converted by:
> > iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e
> > 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname
> is
> > decided by softwares, msgfmt does not do charset converting at all, just
> a
> > format-translator. (btw, iconv.c is from alpine)
>
> There are two things you seem to be missing:
>
> 1. musl does not, and won't, support non-UTF-8 locales, so there is no
> point in trying to load translations for them. Moreover, with the
> proposed changes to setlocale/locale_map.c, it will never be possible
> for the locale name to contain a . with anything other than UTF-8 (or,
> for compatibility, some variant like utf8) after it. So I don't see
> how there's any point in iterating and trying with/without .charset
> when the only possibilities are that .charset is blank, .UTF-8, or
> some misspelling of .UTF-8. In the latter case, we'd even have to do
> remapping of the misspellings to avoid having to have multiple
> dirs/symlinks.
>
> 2. From my perspective, msgfmt's production of non-UTF-8 .mo files is
> a bug. Yes the .po file can be something else, but msgfmt should be
> transcoding it at 'compile' time. There's at least one other change
> msgfmt needs for all features to work with musl's gettext -- expansion
> of SYSDEP strings to all their possible format patterns -- so I don't
> think it's a significant additional burden to ensure that the msgfmt
> used on musl-based systems outputs UTF-8.
>
> Of course software trying to do multiple encodings like you described
> will still install duplicate files unless patched, but any of them
> should work as long as msgfmt recoded them. In the mean time, distros
> can just patch the build process for software that's still installing
> non-UTF-8 locale files. AFAIK doing that is not a recommended practice
> even by the GNU gettext project, so the patches might even make it
> upstream.
>
> One thing we could do for robustness is check the .mo header at load
> time and, if it has a charset= specification with something other than
> UTF-8, reject it. I mainly suggest this in case the program is running
> on a non-musl system where a glibc-built version of the same program
> (e.g. vi) with non-UTF-8 .mo files is present and they're using the
> same textdomain dir (actually unlikely since prefix should be
> different). But if we do this it should be a separate patch because
> it's a separate functional change.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 4138 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 3824 bytes --]

diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c
index b68e24b..abaa414 100644
--- a/src/locale/dcngettext.c
+++ b/src/locale/dcngettext.c
@@ -100,7 +100,9 @@ struct msgcat {
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	const struct __locale_map *lm;
+	int cat;
 };

 static char *dummy_gettextdomain()
@@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;

 	if ((unsigned)category >= LC_ALL) goto notrans;

@@ -130,55 +132,76 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;

-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;

 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;

 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;

 	if (!p) {
+		const char *dirname, *locname, *catname, *modname, *locp;
+		size_t dirlen, loclen, catlen, modlen, alt_modlen;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		/* Logically split @mod suffix from locale name. */
+		modname = memchr(locname, '@', loclen);
+		if (!modname) modname = locname + loclen;
+		alt_modlen = modlen = loclen - (modname-locname);
+		loclen = modname-locname;
+
+		/* Drop .charset identifier; it is not used. */
+		const char *csp = memchr(locname, '.', loclen);
+		if (csp) loclen = csp-locname;
+
+		char name[dirlen+1 + loclen+modlen+1 + catlen+1 + domlen+3 + 1];
+		const void *map;
+
+		for (;;) {
+			snprintf(name, sizeof name, "%s/%.*s%.*s/%s/%s.mo\0",
+				dirname, (int)loclen, locname,
+				(int)alt_modlen, modname, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			/* Try dropping @mod, _YY, then both. */
+			if (alt_modlen) {
+				alt_modlen = 0;
+			} else if ((locp = memchr(locname, '_', loclen))) {
+				loclen = locp-locname;
+				alt_modlen = modlen;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
-		p = calloc(sizeof *p + namelen + 1, 1);
+
+		p = calloc(sizeof *p, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
 			goto notrans;
 		}
+		p->cat = category;
+		p->binding = q;
+		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
--- musl-1.1.16/src/internal/locale_impl.h
+++ musl-1.1.16/src/internal/locale_impl.h
@@ -6,7 +6,7 @@
 #include "libc.h"
 #include "pthread_impl.h"
 
-#define LOCALE_NAME_MAX 15
+#define LOCALE_NAME_MAX 23
 
 struct __locale_map {
 	const void *map;

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-03-18 13:50 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-20 11:25 a bug in bindtextdomain() and strip '.UTF-8' He X
2017-01-29  4:52 ` He X
2017-01-29 13:39   ` Szabolcs Nagy
2017-01-29 14:07     ` Rich Felker
2017-01-29 14:48       ` He X
2017-01-29 15:55         ` Rich Felker
2017-01-29 16:14           ` He X
2017-01-29 16:33             ` Rich Felker
2017-02-08 10:13               ` He X
2017-02-08 14:31                 ` Rich Felker
2017-02-09  9:49                   ` He X
2017-02-11  2:36                     ` Rich Felker
2017-02-11  6:00                       ` He X
2017-02-11 23:59                         ` Rich Felker
2017-02-12  2:34                         ` Rich Felker
2017-02-12  6:56                           ` He X
2017-02-12  7:11                             ` He X
2017-02-13 17:08                             ` Rich Felker
2017-02-13  8:01                           ` He X
2017-02-13 13:28                             ` Rich Felker
2017-02-13 14:06                               ` He X
2017-02-13 17:12                                 ` Rich Felker
2017-03-04  8:02                                   ` He X
2017-03-17 19:27                                     ` Rich Felker
2017-03-17 19:37                                       ` Rich Felker
2017-03-18  7:34                                         ` He X
2017-03-18 12:28                                           ` Rich Felker
2017-03-18 13:50                                             ` He X
2017-02-13 14:12                               ` He X
2017-02-13 17:13                                 ` Rich Felker
2017-01-29 16:37         ` Rich Felker
2017-01-30  0:37           ` He X
2017-01-30 14:17           ` He X
2017-01-29 16:40         ` Szabolcs Nagy
2017-01-29 16:49           ` Rich Felker
2017-01-30 12:36             ` He X
2017-01-30 13:05               ` Szabolcs Nagy
2017-01-30  1:32           ` He X

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).