Re: Re: a bug in bindtextdomain() and strip '.UTF-8'

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: He X <xw897002528@gmail.com>
To: musl@lists.openwall.com
Subject: Re: Re: a bug in bindtextdomain() and strip '.UTF-8'
Date: Sat, 18 Mar 2017 21:50:28 +0800	[thread overview]
Message-ID: <CAPG2z09edp7keF3ZfHxKR8_2LPu=nSj4aL-XE8VXtiL_+3LuHA@mail.gmail.com> (raw)
In-Reply-To: <20170318122833.GN1693@brightrain.aerifal.cx>


[-- Attachment #1.1: Type: text/plain, Size: 3467 bytes --]

OK, i think there's no further needs of discussion. I got your idea, if
this is what musl want to be. I will try to make patches to vim later!
But for the checking of `charset=`, i can't help, i did not understand
what's up in __mo_lookup(). Hope you can make the patch. The attached has
deleted all things related to drop .charset.

2017-03-18 20:28 GMT+08:00 Rich Felker <dalias@libc.org>:

> On Sat, Mar 18, 2017 at 07:34:58AM +0000, He X wrote:
> > > As discussed on irc, .charset suffixes should be dropped before the
> > loop even begins (never used in pathnames), and they occur before the
> > @mod, not after it, so the logic for dropping them is different.
> >
> > 1. drop .charset: Sorry for proposing it again, i forget this case after
> > around three weeks, as i said before, vim will generate three different
> .mo
> > files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po.
> > In that case, dropping is to generate a lots of junk.
> >
> > I now found it's not a bug of msgfmt. That is charset is converted by:
> > iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e
> > 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname
> is
> > decided by softwares, msgfmt does not do charset converting at all, just
> a
> > format-translator. (btw, iconv.c is from alpine)
>
> There are two things you seem to be missing:
>
> 1. musl does not, and won't, support non-UTF-8 locales, so there is no
> point in trying to load translations for them. Moreover, with the
> proposed changes to setlocale/locale_map.c, it will never be possible
> for the locale name to contain a . with anything other than UTF-8 (or,
> for compatibility, some variant like utf8) after it. So I don't see
> how there's any point in iterating and trying with/without .charset
> when the only possibilities are that .charset is blank, .UTF-8, or
> some misspelling of .UTF-8. In the latter case, we'd even have to do
> remapping of the misspellings to avoid having to have multiple
> dirs/symlinks.
>
> 2. From my perspective, msgfmt's production of non-UTF-8 .mo files is
> a bug. Yes the .po file can be something else, but msgfmt should be
> transcoding it at 'compile' time. There's at least one other change
> msgfmt needs for all features to work with musl's gettext -- expansion
> of SYSDEP strings to all their possible format patterns -- so I don't
> think it's a significant additional burden to ensure that the msgfmt
> used on musl-based systems outputs UTF-8.
>
> Of course software trying to do multiple encodings like you described
> will still install duplicate files unless patched, but any of them
> should work as long as msgfmt recoded them. In the mean time, distros
> can just patch the build process for software that's still installing
> non-UTF-8 locale files. AFAIK doing that is not a recommended practice
> even by the GNU gettext project, so the patches might even make it
> upstream.
>
> One thing we could do for robustness is check the .mo header at load
> time and, if it has a charset= specification with something other than
> UTF-8, reject it. I mainly suggest this in case the program is running
> on a non-musl system where a glibc-built version of the same program
> (e.g. vi) with non-UTF-8 .mo files is present and they're using the
> same textdomain dir (actually unlikely since prefix should be
> different). But if we do this it should be a separate patch because
> it's a separate functional change.
>
> Rich
>

[-- Attachment #1.2: Type: text/html, Size: 4138 bytes --]

[-- Attachment #2: locale.diff --]
[-- Type: text/plain, Size: 3824 bytes --]

diff --git a/src/locale/dcngettext.c b/src/locale/dcngettext.c
index b68e24b..abaa414 100644
--- a/src/locale/dcngettext.c
+++ b/src/locale/dcngettext.c
@@ -100,7 +100,9 @@ struct msgcat {
 	size_t map_size;
 	void *volatile plural_rule;
 	volatile int nplurals;
-	char name[];
+	struct binding *binding;
+	const struct __locale_map *lm;
+	int cat;
 };

 static char *dummy_gettextdomain()
@@ -120,8 +122,8 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	struct msgcat *p;
 	struct __locale_struct *loc = CURRENT_LOCALE;
 	const struct __locale_map *lm;
-	const char *dirname, *locname, *catname;
-	size_t dirlen, loclen, catlen, domlen;
+	size_t domlen;
+	struct binding *q;

 	if ((unsigned)category >= LC_ALL) goto notrans;

@@ -130,55 +132,76 @@ char *dcngettext(const char *domainname, const char *msgid1, const char *msgid2,
 	domlen = strnlen(domainname, NAME_MAX+1);
 	if (domlen > NAME_MAX) goto notrans;

-	dirname = gettextdir(domainname, &dirlen);
-	if (!dirname) goto notrans;
+	for (q=bindings; q; q=q->next)
+		if (!strcmp(q->domainname, domainname) && q->active)
+			break;
+	if (!q) goto notrans;

 	lm = loc->cat[category];
 	if (!lm) {
 notrans:
 		return (char *) ((n == 1) ? msgid1 : msgid2);
 	}
-	locname = lm->name;
-
-	catname = catnames[category];
-	catlen = catlens[category];
-	loclen = strlen(locname);
-
-	size_t namelen = dirlen+1 + loclen+1 + catlen+1 + domlen+3;
-	char name[namelen+1], *s = name;
-
-	memcpy(s, dirname, dirlen);
-	s[dirlen] = '/';
-	s += dirlen + 1;
-	memcpy(s, locname, loclen);
-	s[loclen] = '/';
-	s += loclen + 1;
-	memcpy(s, catname, catlen);
-	s[catlen] = '/';
-	s += catlen + 1;
-	memcpy(s, domainname, domlen);
-	s[domlen] = '.';
-	s[domlen+1] = 'm';
-	s[domlen+2] = 'o';
-	s[domlen+3] = 0;

 	for (p=cats; p; p=p->next)
-		if (!strcmp(p->name, name))
+		if (p->binding == q && p->lm == lm && p->cat == category)
 			break;

 	if (!p) {
+		const char *dirname, *locname, *catname, *modname, *locp;
+		size_t dirlen, loclen, catlen, modlen, alt_modlen;
 		void *old_cats;
 		size_t map_size;
-		const void *map = __map_file(name, &map_size);
+
+		dirname = q->dirname;
+		locname = lm->name;
+		catname = catnames[category];
+
+		dirlen = q->dirlen;
+		loclen = strlen(locname);
+		catlen = catlens[category];
+
+		/* Logically split @mod suffix from locale name. */
+		modname = memchr(locname, '@', loclen);
+		if (!modname) modname = locname + loclen;
+		alt_modlen = modlen = loclen - (modname-locname);
+		loclen = modname-locname;
+
+		/* Drop .charset identifier; it is not used. */
+		const char *csp = memchr(locname, '.', loclen);
+		if (csp) loclen = csp-locname;
+
+		char name[dirlen+1 + loclen+modlen+1 + catlen+1 + domlen+3 + 1];
+		const void *map;
+
+		for (;;) {
+			snprintf(name, sizeof name, "%s/%.*s%.*s/%s/%s.mo\0",
+				dirname, (int)loclen, locname,
+				(int)alt_modlen, modname, catname, domainname);
+			if (map = __map_file(name, &map_size)) break;
+
+			/* Try dropping @mod, _YY, then both. */
+			if (alt_modlen) {
+				alt_modlen = 0;
+			} else if ((locp = memchr(locname, '_', loclen))) {
+				loclen = locp-locname;
+				alt_modlen = modlen;
+			} else {
+				break;
+			}
+		}
 		if (!map) goto notrans;
-		p = calloc(sizeof *p + namelen + 1, 1);
+
+		p = calloc(sizeof *p, 1);
 		if (!p) {
 			__munmap((void *)map, map_size);
 			goto notrans;
 		}
+		p->cat = category;
+		p->binding = q;
+		p->lm = lm;
 		p->map = map;
 		p->map_size = map_size;
-		memcpy(p->name, name, namelen+1);
 		do {
 			old_cats = cats;
 			p->next = old_cats;
--- musl-1.1.16/src/internal/locale_impl.h
+++ musl-1.1.16/src/internal/locale_impl.h
@@ -6,7 +6,7 @@
 #include "libc.h"
 #include "pthread_impl.h"
 
-#define LOCALE_NAME_MAX 15
+#define LOCALE_NAME_MAX 23
 
 struct __locale_map {
 	const void *map;

next prev parent reply	other threads:[~2017-03-18 13:50 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-20 11:25 He X
2017-01-29  4:52 ` He X
2017-01-29 13:39   ` Szabolcs Nagy
2017-01-29 14:07     ` Rich Felker
2017-01-29 14:48       ` He X
2017-01-29 15:55         ` Rich Felker
2017-01-29 16:14           ` He X
2017-01-29 16:33             ` Rich Felker
2017-02-08 10:13               ` He X
2017-02-08 14:31                 ` Rich Felker
2017-02-09  9:49                   ` He X
2017-02-11  2:36                     ` Rich Felker
2017-02-11  6:00                       ` He X
2017-02-11 23:59                         ` Rich Felker
2017-02-12  2:34                         ` Rich Felker
2017-02-12  6:56                           ` He X
2017-02-12  7:11                             ` He X
2017-02-13 17:08                             ` Rich Felker
2017-02-13  8:01                           ` He X
2017-02-13 13:28                             ` Rich Felker
2017-02-13 14:06                               ` He X
2017-02-13 17:12                                 ` Rich Felker
2017-03-04  8:02                                   ` He X
2017-03-17 19:27                                     ` Rich Felker
2017-03-17 19:37                                       ` Rich Felker
2017-03-18  7:34                                         ` He X
2017-03-18 12:28                                           ` Rich Felker
2017-03-18 13:50                                             ` He X [this message]
2017-02-13 14:12                               ` He X
2017-02-13 17:13                                 ` Rich Felker
2017-01-29 16:37         ` Rich Felker
2017-01-30  0:37           ` He X
2017-01-30 14:17           ` He X
2017-01-29 16:40         ` Szabolcs Nagy
2017-01-29 16:49           ` Rich Felker
2017-01-30 12:36             ` He X
2017-01-30 13:05               ` Szabolcs Nagy
2017-01-30  1:32           ` He X

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPG2z09edp7keF3ZfHxKR8_2LPu=nSj4aL-XE8VXtiL_+3LuHA@mail.gmail.com' \
    --to=xw897002528@gmail.com \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).