mailing list of musl libc
 help / color / mirror / code / Atom feed
From: He X <xw897002528@gmail.com>
To: musl@lists.openwall.com
Subject: Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
Date: Sat, 4 Mar 2017 16:02:58 +0800	[thread overview]
Message-ID: <CAPG2z0-7JV8safi5rrYr0zeu+dKqZLGAikV549jrumy+=NLJxQ@mail.gmail.com> (raw)
In-Reply-To: <20170213171236.GI1520@brightrain.aerifal.cx>


[-- Attachment #1.1: Type: text/plain, Size: 4411 bytes --]

OK, i am busy on school these days. I read the mailing lists again, and i
clean up. These are all remaining issues we need to solve since previous
discussion:

1. about zero msgid1, i can prove that glibc will fallback to no
translations. It's equal to printf(""), so this should be ok:
@@ -120,8 +122,9 @@
+ if (!msgid1) goto notrans;

> but it should be a separate patch since it's an independent
change.
(added in the head of dcngettext(), ill send a new standalone mail for
this, but it's also included in this patch, be careful)

2.
>But if the locale name is explicitly
non-UTF-8 like "zh_CN.GBK", we could opt to reject it without breaking
anything, and this may give users better feedback about what's going
wrong if they have such settings when ssh'ing into a musl-based
system.

About the .GBK(and any other non-UTF8 charsets), i ignore them by treating
them as C.UTF-8, do we need to be more strict?
--- musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000
+++ musl-1.1.16/src/locale/locale_map.c 2017-01-01 03:27:17.000000000 +0000
@@ -46,7 +46,8 @@
  if (val[0]=='.' || val[n]) val = "C.UTF-8";
  int builtin = (val[0]=='C' && !val[1])
  || !strcmp(val, "C.UTF-8")
- || !strcmp(val, "POSIX");
+ || !strcmp(val, "POSIX")
+ || strcmp(__strchrnul(val, '.'), ".UTF-8");

  if (builtin) {
  if (cat == LC_CTYPE && val[1]=='.')
3.
>The autoconf text for gettext is supposed to be getting fixed not to
do that anymore, but I'm not sure what the progress on upstreaming it
is.
It's just a workaround before they handle it, and i am not going to change
anything in musl, just a description. I only patched myself.

4.
 > Support for non-UTF-8 .mo files won't be added.
> msgfmt just needs to be fixed not to produce
non-UTF-8 output.

I agreed with you, so then i hope '.UTF-8' could be kept. Rather than
stripping it as i thought before, '.UTF-8' should be kept until the code
went into dcngettext(). What if the .mo files are downloaded from www? Or
what if it's pre-generated in the releases of programs? (i guess that's why
vim gave me GBK set, it must be pre-generated)

And even with msgfmt generating UTF-8 outputs, what if programs still name
the dir as zh_CN.UTF-8 instead of simply zh_CN? You can't say it's wrong,
right? It's their preference how to name it.

It's necessary for those who have a full name like zh_CN.UTF-8 instead of
zh_CN. This's what i am trying to express now.

2017-02-14 1:12 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Mon, Feb 13, 2017 at 10:06:49PM +0800, He X wrote:
> > no, it's on musl, i just tested it with my patches, with vim, stripping
> > will lead to unknown characters.
>
> That's not a matter of the locale being non-UTF-8 (it's UTF-8) but of
> the application doing something broken. The locale is UTF-8 because
> nl_langinfo(CODESET) says it is and because mb/wc conversion functions
> process UTF-8. That's what it means for the locale to be UTF-8.
>
> > I mean, .mo files under zh_CN/ of vim is GBK set, while zh_CN/ of other
> > apps is UTF-8 set, that meas there may be other apps like vim, we should
> be
> > more cautious, add a check before map the .mo files, and fail non-UTF8
> set
> > in setlocale.
>
> All musl locale files are required to be UTF-8. If an application has
> translation files that are not UTF-8, they're not usable. This could
> be fixed in the application or by using a fixed version of msgfmt that
> converts to UTF-8 before producing the .mo file.
>
> > Btw, _nl_msg_cat_cntr & _nl_domain_bindings will block apps compiling
> with
> > the native intl of musl, and after i added a dump for these two symbols,
>
> The autoconf text for gettext is supposed to be getting fixed not to
> do that anymore, but I'm not sure what the progress on upstreaming it
> is.
>
> > gnu tar showed me segfaults, because he passed a zero msgid1 causing
> > __mo_lookup segfault, we should add a check in dcngettext to avoid it(if
> > (!msgid1) goto notrans;):
> >
> >  #2  0x00007ffff7d82a6f in dcngettext (domainname=0x6737a0 "tar",
> > msgid1=0x0, msgid2=0x0, n=1,
> >     category=5) at src/locale/dcngettext.c:211
>
> Is it expecting gettext to return a null pointer in this case, or to
> return something else (like the "header", i.e. the translation of "")?
> I think it's acceptable to change this behavior as long as we do it
> right, but it should be a separate patch since it's an independent
> change.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 12499 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 3636 bytes --]

--- a/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000 
+++ b/src/locale/dcngettext.c	2017-02-06 14:39:17.860482624 +0000
@@ -100,7 +100,9 @@
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	struct __locale_map *lm;
+	int cat;
 };
 
 static char *dummy_gettextdomain()
@@ -120,8 +122,9 @@
+	if (!msgid1) goto notrans;
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;
 
 	if ((unsigned)category >= LC_ALL) goto notrans;
 
@@ -130,47 +132,64 @@
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;
 
-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;
 
 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;
 
 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;
 
 	if (!p) {
+		const char *dirname, *locname, *catname;
+		size_t dirlen, loclen, catlen;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
+		char name[namelen+1];
+		char locbuf[loclen+1], *locp = locbuf;
+		const void *map;
+
+		memcpy(locbuf, locname, loclen);
+		locbuf[loclen] = 0;
+
+		for (;;) {
+			snprintf(name, namelen+1, "%s/%s/%s/%s.mo\0", dirname, locbuf, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			if (locp = strchr(locbuf, '.')) {
+				*locp = 0;
+			} else if (locp = strchr(locbuf, '@')) {
+				*locp = 0;
+				locbuf[loclen] = '@';
+			} else if (locp = strchr(locbuf, '_')) {
+				if (locbuf[loclen] == '@') {
+					locbuf[loclen] = 0;
+					*locp = '@';
+					strcat(locp+1, locbuf + strlen(locbuf) + 1);
+				} else *locp = 0;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
+
 		p = calloc(sizeof *p + namelen + 1, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
@@ -178,7 +195,9 @@
 		}
+ 		p->cat = category;
+ 		p->binding = q;
+ 		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
--- musl-1.1.16/src/locale/locale_map.c	2017-01-01 03:27:17.000000000 +0000
+++ musl-1.1.16/src/locale/locale_map.c	2017-01-01 03:27:17.000000000 +0000
@@ -46,7 +46,8 @@
 	if (val[0]=='.' || val[n]) val = "C.UTF-8";
 	int builtin = (val[0]=='C' && !val[1])
 		|| !strcmp(val, "C.UTF-8")
-		|| !strcmp(val, "POSIX");
+		|| !strcmp(val, "POSIX")
+		|| strcmp(__strchrnul(val, '.'), ".UTF-8");
 
 	if (builtin) {
 		if (cat == LC_CTYPE && val[1]=='.')

  reply	other threads:[~2017-03-04  8:02 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-20 11:25 He X
2017-01-29  4:52 ` He X
2017-01-29 13:39   ` Szabolcs Nagy
2017-01-29 14:07     ` Rich Felker
2017-01-29 14:48       ` He X
2017-01-29 15:55         ` Rich Felker
2017-01-29 16:14           ` He X
2017-01-29 16:33             ` Rich Felker
2017-02-08 10:13               ` He X
2017-02-08 14:31                 ` Rich Felker
2017-02-09  9:49                   ` He X
2017-02-11  2:36                     ` Rich Felker
2017-02-11  6:00                       ` He X
2017-02-11 23:59                         ` Rich Felker
2017-02-12  2:34                         ` Rich Felker
2017-02-12  6:56                           ` He X
2017-02-12  7:11                             ` He X
2017-02-13 17:08                             ` Rich Felker
2017-02-13  8:01                           ` He X
2017-02-13 13:28                             ` Rich Felker
2017-02-13 14:06                               ` He X
2017-02-13 17:12                                 ` Rich Felker
2017-03-04  8:02                                   ` He X [this message]
2017-03-17 19:27                                     ` Rich Felker
2017-03-17 19:37                                       ` Rich Felker
2017-03-18  7:34                                         ` He X
2017-03-18 12:28                                           ` Rich Felker
2017-03-18 13:50                                             ` He X
2017-02-13 14:12                               ` He X
2017-02-13 17:13                                 ` Rich Felker
2017-01-29 16:37         ` Rich Felker
2017-01-30  0:37           ` He X
2017-01-30 14:17           ` He X
2017-01-29 16:40         ` Szabolcs Nagy
2017-01-29 16:49           ` Rich Felker
2017-01-30 12:36             ` He X
2017-01-30 13:05               ` Szabolcs Nagy
2017-01-30  1:32           ` He X

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPG2z0-7JV8safi5rrYr0zeu+dKqZLGAikV549jrumy+=NLJxQ@mail.gmail.com' \
    --to=xw897002528@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).