From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/11157 Path: news.gmane.org!.POSTED!not-for-mail From: He X Newsgroups: gmane.linux.lib.musl.general Subject: Re: Re: a bug in bindtextdomain() and strip '.UTF-8' Date: Sat, 18 Mar 2017 21:50:28 +0800 Message-ID: References: <20170212023422.GE1520@brightrain.aerifal.cx> <20170213132816.GG1520@brightrain.aerifal.cx> <20170213171236.GI1520@brightrain.aerifal.cx> <20170317192749.GL1693@brightrain.aerifal.cx> <20170317193740.GM1693@brightrain.aerifal.cx> <20170318122833.GN1693@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary=94eb2c04515cf0db9d054b01942f X-Trace: blaine.gmane.org 1489845064 6639 195.159.176.226 (18 Mar 2017 13:51:04 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sat, 18 Mar 2017 13:51:04 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-11172-gllmg-musl=m.gmane.org@lists.openwall.com Sat Mar 18 14:51:00 2017 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1cpEkz-00014I-1i for gllmg-musl@m.gmane.org; Sat, 18 Mar 2017 14:50:57 +0100 Original-Received: (qmail 23862 invoked by uid 550); 18 Mar 2017 13:51:01 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 23844 invoked from network); 18 Mar 2017 13:51:01 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=QpWla1zDcncDTim5kN18zUEBOWEGabjCJWOscuPCwPw=; b=PG+S+FuAi3jBmiSnZwLFClykRwRYLXWXYiaoqj+2unpPb+6UbNovjNHuDhwKYGQMxw OV7i/7EPmsWmVCRNG8xwAC+fYjPrMwLpA6ksFdxS3cbtbSW8IWG9dGI78bztGjSWexpv ogqPxlyVHJ9c2wZGkB0twhT4i0upiRZT3D9Azq7qDgXKufcFt6N7c0Y8Hxs3o2qU9uau ekVPl8/XzAtOYiU48rl6Vwoa7MuXXBmRJCzVFBOHWibX/wqhntoQptAoWudDypF0g3x+ YE1JmIgou7bnN3RJlZkqZdvP5KL86SUqIUn3VsieTiZ5q3LuV51Ikn9sPtryaRmYwVsG sjog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=QpWla1zDcncDTim5kN18zUEBOWEGabjCJWOscuPCwPw=; b=PeAWou4pKQQju+c4QNCPgH0RqtbROGjNQVkDidQn30EA9dcxWTbcCRWp6ruysfW+tn m5aEyUEipCT0tVIT2WzYrUbMnFcFnw10LOGJX4ThU9FLOon0VLRKylFHwYNEgN5UV7YU Fw0XG9uXC7+YG56wU1PduERubrEypT14pKMwJrYqm1D45tx70hzBpa/wfWVTyE3wi7sM a8bWYWyAi4ysWBFtXQU0tjTeCUKr98I99oxqnQyf1kdqYEQi1n52Yko3QBz5EFzQBiX6 yzHMH61wRvYtSTDLbIgyRiJOyTskgW7CFM/sK+RhtMMiD/k6aioTIz3yYEOlb+R60ugN iYTw== X-Gm-Message-State: AFeK/H2BkGxa7PMbxtllMrvxb6MEJZPyq58h2N4uZgt5jfi6YMcD2KctFaULHIsJLDzClKn7bS98ftnzr1Iq8w== X-Received: by 10.176.6.233 with SMTP id g96mr8727150uag.68.1489845049395; Sat, 18 Mar 2017 06:50:49 -0700 (PDT) In-Reply-To: <20170318122833.GN1693@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:11157 Archived-At: --94eb2c04515cf0db9d054b01942f Content-Type: multipart/alternative; boundary=94eb2c04515cf0db96054b01942d --94eb2c04515cf0db96054b01942d Content-Type: text/plain; charset=UTF-8 OK, i think there's no further needs of discussion. I got your idea, if this is what musl want to be. I will try to make patches to vim later! But for the checking of `charset=`, i can't help, i did not understand what's up in __mo_lookup(). Hope you can make the patch. The attached has deleted all things related to drop .charset. 2017-03-18 20:28 GMT+08:00 Rich Felker : > On Sat, Mar 18, 2017 at 07:34:58AM +0000, He X wrote: > > > As discussed on irc, .charset suffixes should be dropped before the > > loop even begins (never used in pathnames), and they occur before the > > @mod, not after it, so the logic for dropping them is different. > > > > 1. drop .charset: Sorry for proposing it again, i forget this case after > > around three weeks, as i said before, vim will generate three different > .mo > > files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_CN.po. > > In that case, dropping is to generate a lots of junk. > > > > I now found it's not a bug of msgfmt. That is charset is converted by: > > iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e > > 's/charset=utf-8/charset=gbk/ > ... So that means, charset and pathname > is > > decided by softwares, msgfmt does not do charset converting at all, just > a > > format-translator. (btw, iconv.c is from alpine) > > There are two things you seem to be missing: > > 1. musl does not, and won't, support non-UTF-8 locales, so there is no > point in trying to load translations for them. Moreover, with the > proposed changes to setlocale/locale_map.c, it will never be possible > for the locale name to contain a . with anything other than UTF-8 (or, > for compatibility, some variant like utf8) after it. So I don't see > how there's any point in iterating and trying with/without .charset > when the only possibilities are that .charset is blank, .UTF-8, or > some misspelling of .UTF-8. In the latter case, we'd even have to do > remapping of the misspellings to avoid having to have multiple > dirs/symlinks. > > 2. From my perspective, msgfmt's production of non-UTF-8 .mo files is > a bug. Yes the .po file can be something else, but msgfmt should be > transcoding it at 'compile' time. There's at least one other change > msgfmt needs for all features to work with musl's gettext -- expansion > of SYSDEP strings to all their possible format patterns -- so I don't > think it's a significant additional burden to ensure that the msgfmt > used on musl-based systems outputs UTF-8. > > Of course software trying to do multiple encodings like you described > will still install duplicate files unless patched, but any of them > should work as long as msgfmt recoded them. In the mean time, distros > can just patch the build process for software that's still installing > non-UTF-8 locale files. AFAIK doing that is not a recommended practice > even by the GNU gettext project, so the patches might even make it > upstream. > > One thing we could do for robustness is check the .mo header at load > time and, if it has a charset= specification with something other than > UTF-8, reject it. I mainly suggest this in case the program is running > on a non-musl system where a glibc-built version of the same program > (e.g. vi) with non-UTF-8 .mo files is present and they're using the > same textdomain dir (actually unlikely since prefix should be > different). But if we do this it should be a separate patch because > it's a separate functional change. > > Rich > --94eb2c04515cf0db96054b01942d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
OK, i think there's no further needs of discussion. I = got your idea, if this is what musl want to be. I will try to make patches = to vim later!=C2=A0
But for the checking of `charset=3D`, i can't h= elp, i did not understand what's up in __mo_lookup(). Hope you can make= the patch. The attached has deleted all things related to drop .charset.

2017-03= -18 20:28 GMT+08:00 Rich Felker <dalias@libc.org>:
On Sat, Mar 18, 2017 at 07:34:58AM = +0000, He X wrote:
> > As discussed on irc, .charset suffixes should be dropped before t= he
> loop even begins (never used in pathnames), and they occur before the<= br> > @mod, not after it, so the logic for dropping them is different.
>
> 1. drop .charset: Sorry for proposing it again, i forget this case aft= er
> around three weeks, as i said before, vim will generate three differen= t .mo
> files with different charset -> zh_CN.UTF-8.po, zh_CN.cp936.po, zh_= CN.po.
> In that case, dropping is to generate a lots of junk.
>
> I now found it's not a bug of msgfmt. That is charset is converted= by:
> iconv -f UTF-8 -t cp936 zh_CN.UTF-8.po | sed -e
> 's/charset=3Dutf-8/charset=3Dgbk/ > ... So that means, charset = and pathname is
> decided by softwares, msgfmt does not do charset converting at all, ju= st a
> format-translator. (btw, iconv.c is from alpine)

There are two things you seem to be missing:

1. musl does not, and won't, support non-UTF-8 locales, so there is no<= br> point in trying to load translations for them. Moreover, with the
proposed changes to setlocale/locale_map.c, it will never be possible
for the locale name to contain a . with anything other than UTF-8 (or,
for compatibility, some variant like utf8) after it. So I don't see
how there's any point in iterating and trying with/without .charset
when the only possibilities are that .charset is blank, .UTF-8, or
some misspelling of .UTF-8. In the latter case, we'd even have to do remapping of the misspellings to avoid having to have multiple
dirs/symlinks.

2. From my perspective, msgfmt's production of non-UTF-8 .mo files is a bug. Yes the .po file can be something else, but msgfmt should be
transcoding it at 'compile' time. There's at least one other ch= ange
msgfmt needs for all features to work with musl's gettext -- expansion<= br> of SYSDEP strings to all their possible format patterns -- so I don't think it's a significant additional burden to ensure that the msgfmt used on musl-based systems outputs UTF-8.

Of course software trying to do multiple encodings like you described
will still install duplicate files unless patched, but any of them
should work as long as msgfmt recoded them. In the mean time, distros
can just patch the build process for software that's still installing non-UTF-8 locale files. AFAIK doing that is not a recommended practice
even by the GNU gettext project, so the patches might even make it
upstream.

One thing we could do for robustness is check the .mo header at load
time and, if it has a charset=3D specification with something other than UTF-8, reject it. I mainly suggest this in case the program is running
on a non-musl system where a glibc-built version of the same program
(e.g. vi) with non-UTF-8 .mo files is present and they're using the
same textdomain dir (actually unlikely since prefix should be
different). But if we do this it should be a separate patch because
it's a separate functional change.

Rich

--94eb2c04515cf0db96054b01942d-- --94eb2c04515cf0db9d054b01942f Content-Type: text/plain; charset=US-ASCII; name="locale.diff" Content-Disposition: attachment; filename="locale.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_j0fb9bku0 ZGlmZiAtLWdpdCBhL3NyYy9sb2NhbGUvZGNuZ2V0dGV4dC5jIGIvc3JjL2xvY2FsZS9kY25nZXR0 ZXh0LmMKaW5kZXggYjY4ZTI0Yi4uYWJhYTQxNCAxMDA2NDQKLS0tIGEvc3JjL2xvY2FsZS9kY25n ZXR0ZXh0LmMKKysrIGIvc3JjL2xvY2FsZS9kY25nZXR0ZXh0LmMKQEAgLTEwMCw3ICsxMDAsOSBA QCBzdHJ1Y3QgbXNnY2F0IHsKIAlzaXplX3QgbWFwX3NpemU7CiAJdm9pZCAqdm9sYXRpbGUgcGx1 cmFsX3J1bGU7CiAJdm9sYXRpbGUgaW50IG5wbHVyYWxzOwotCWNoYXIgbmFtZVtdOworCXN0cnVj dCBiaW5kaW5nICpiaW5kaW5nOworCWNvbnN0IHN0cnVjdCBfX2xvY2FsZV9tYXAgKmxtOworCWlu dCBjYXQ7CiB9OwoKIHN0YXRpYyBjaGFyICpkdW1teV9nZXR0ZXh0ZG9tYWluKCkKQEAgLTEyMCw4 ICsxMjIsOCBAQCBjaGFyICpkY25nZXR0ZXh0KGNvbnN0IGNoYXIgKmRvbWFpbm5hbWUsIGNvbnN0 IGNoYXIgKm1zZ2lkMSwgY29uc3QgY2hhciAqbXNnaWQyLAogCXN0cnVjdCBtc2djYXQgKnA7CiAJ c3RydWN0IF9fbG9jYWxlX3N0cnVjdCAqbG9jID0gQ1VSUkVOVF9MT0NBTEU7CiAJY29uc3Qgc3Ry dWN0IF9fbG9jYWxlX21hcCAqbG07Ci0JY29uc3QgY2hhciAqZGlybmFtZSwgKmxvY25hbWUsICpj YXRuYW1lOwotCXNpemVfdCBkaXJsZW4sIGxvY2xlbiwgY2F0bGVuLCBkb21sZW47CisJc2l6ZV90 IGRvbWxlbjsKKwlzdHJ1Y3QgYmluZGluZyAqcTsKCiAJaWYgKCh1bnNpZ25lZCljYXRlZ29yeSA+ PSBMQ19BTEwpIGdvdG8gbm90cmFuczsKCkBAIC0xMzAsNTUgKzEzMiw3NiBAQCBjaGFyICpkY25n ZXR0ZXh0KGNvbnN0IGNoYXIgKmRvbWFpbm5hbWUsIGNvbnN0IGNoYXIgKm1zZ2lkMSwgY29uc3Qg Y2hhciAqbXNnaWQyLAogCWRvbWxlbiA9IHN0cm5sZW4oZG9tYWlubmFtZSwgTkFNRV9NQVgrMSk7 CiAJaWYgKGRvbWxlbiA+IE5BTUVfTUFYKSBnb3RvIG5vdHJhbnM7CgotCWRpcm5hbWUgPSBnZXR0 ZXh0ZGlyKGRvbWFpbm5hbWUsICZkaXJsZW4pOwotCWlmICghZGlybmFtZSkgZ290byBub3RyYW5z OworCWZvciAocT1iaW5kaW5nczsgcTsgcT1xLT5uZXh0KQorCQlpZiAoIXN0cmNtcChxLT5kb21h aW5uYW1lLCBkb21haW5uYW1lKSAmJiBxLT5hY3RpdmUpCisJCQlicmVhazsKKwlpZiAoIXEpIGdv dG8gbm90cmFuczsKCiAJbG0gPSBsb2MtPmNhdFtjYXRlZ29yeV07CiAJaWYgKCFsbSkgewogbm90 cmFuczoKIAkJcmV0dXJuIChjaGFyICopICgobiA9PSAxKSA/IG1zZ2lkMSA6IG1zZ2lkMik7CiAJ fQotCWxvY25hbWUgPSBsbS0+bmFtZTsKLQotCWNhdG5hbWUgPSBjYXRuYW1lc1tjYXRlZ29yeV07 Ci0JY2F0bGVuID0gY2F0bGVuc1tjYXRlZ29yeV07Ci0JbG9jbGVuID0gc3RybGVuKGxvY25hbWUp OwotCi0Jc2l6ZV90IG5hbWVsZW4gPSBkaXJsZW4rMSArIGxvY2xlbisxICsgY2F0bGVuKzEgKyBk b21sZW4rMzsKLQljaGFyIG5hbWVbbmFtZWxlbisxXSwgKnMgPSBuYW1lOwotCi0JbWVtY3B5KHMs IGRpcm5hbWUsIGRpcmxlbik7Ci0Jc1tkaXJsZW5dID0gJy8nOwotCXMgKz0gZGlybGVuICsgMTsK LQltZW1jcHkocywgbG9jbmFtZSwgbG9jbGVuKTsKLQlzW2xvY2xlbl0gPSAnLyc7Ci0JcyArPSBs b2NsZW4gKyAxOwotCW1lbWNweShzLCBjYXRuYW1lLCBjYXRsZW4pOwotCXNbY2F0bGVuXSA9ICcv JzsKLQlzICs9IGNhdGxlbiArIDE7Ci0JbWVtY3B5KHMsIGRvbWFpbm5hbWUsIGRvbWxlbik7Ci0J c1tkb21sZW5dID0gJy4nOwotCXNbZG9tbGVuKzFdID0gJ20nOwotCXNbZG9tbGVuKzJdID0gJ28n OwotCXNbZG9tbGVuKzNdID0gMDsKCiAJZm9yIChwPWNhdHM7IHA7IHA9cC0+bmV4dCkKLQkJaWYg KCFzdHJjbXAocC0+bmFtZSwgbmFtZSkpCisJCWlmIChwLT5iaW5kaW5nID09IHEgJiYgcC0+bG0g PT0gbG0gJiYgcC0+Y2F0ID09IGNhdGVnb3J5KQogCQkJYnJlYWs7CgogCWlmICghcCkgeworCQlj b25zdCBjaGFyICpkaXJuYW1lLCAqbG9jbmFtZSwgKmNhdG5hbWUsICptb2RuYW1lLCAqbG9jcDsK KwkJc2l6ZV90IGRpcmxlbiwgbG9jbGVuLCBjYXRsZW4sIG1vZGxlbiwgYWx0X21vZGxlbjsKIAkJ dm9pZCAqb2xkX2NhdHM7CiAJCXNpemVfdCBtYXBfc2l6ZTsKLQkJY29uc3Qgdm9pZCAqbWFwID0g X19tYXBfZmlsZShuYW1lLCAmbWFwX3NpemUpOworCisJCWRpcm5hbWUgPSBxLT5kaXJuYW1lOwor CQlsb2NuYW1lID0gbG0tPm5hbWU7CisJCWNhdG5hbWUgPSBjYXRuYW1lc1tjYXRlZ29yeV07CisK KwkJZGlybGVuID0gcS0+ZGlybGVuOworCQlsb2NsZW4gPSBzdHJsZW4obG9jbmFtZSk7CisJCWNh dGxlbiA9IGNhdGxlbnNbY2F0ZWdvcnldOworCisJCS8qIExvZ2ljYWxseSBzcGxpdCBAbW9kIHN1 ZmZpeCBmcm9tIGxvY2FsZSBuYW1lLiAqLworCQltb2RuYW1lID0gbWVtY2hyKGxvY25hbWUsICdA JywgbG9jbGVuKTsKKwkJaWYgKCFtb2RuYW1lKSBtb2RuYW1lID0gbG9jbmFtZSArIGxvY2xlbjsK KwkJYWx0X21vZGxlbiA9IG1vZGxlbiA9IGxvY2xlbiAtIChtb2RuYW1lLWxvY25hbWUpOworCQls b2NsZW4gPSBtb2RuYW1lLWxvY25hbWU7CisKKwkJLyogRHJvcCAuY2hhcnNldCBpZGVudGlmaWVy OyBpdCBpcyBub3QgdXNlZC4gKi8KKwkJY29uc3QgY2hhciAqY3NwID0gbWVtY2hyKGxvY25hbWUs ICcuJywgbG9jbGVuKTsKKwkJaWYgKGNzcCkgbG9jbGVuID0gY3NwLWxvY25hbWU7CisKKwkJY2hh ciBuYW1lW2RpcmxlbisxICsgbG9jbGVuK21vZGxlbisxICsgY2F0bGVuKzEgKyBkb21sZW4rMyAr IDFdOworCQljb25zdCB2b2lkICptYXA7CisKKwkJZm9yICg7OykgeworCQkJc25wcmludGYobmFt ZSwgc2l6ZW9mIG5hbWUsICIlcy8lLipzJS4qcy8lcy8lcy5tb1wwIiwKKwkJCQlkaXJuYW1lLCAo aW50KWxvY2xlbiwgbG9jbmFtZSwKKwkJCQkoaW50KWFsdF9tb2RsZW4sIG1vZG5hbWUsIGNhdG5h bWUsIGRvbWFpbm5hbWUpOworCQkJaWYgKG1hcCA9IF9fbWFwX2ZpbGUobmFtZSwgJm1hcF9zaXpl KSkgYnJlYWs7CisKKwkJCS8qIFRyeSBkcm9wcGluZyBAbW9kLCBfWVksIHRoZW4gYm90aC4gKi8K KwkJCWlmIChhbHRfbW9kbGVuKSB7CisJCQkJYWx0X21vZGxlbiA9IDA7CisJCQl9IGVsc2UgaWYg KChsb2NwID0gbWVtY2hyKGxvY25hbWUsICdfJywgbG9jbGVuKSkpIHsKKwkJCQlsb2NsZW4gPSBs b2NwLWxvY25hbWU7CisJCQkJYWx0X21vZGxlbiA9IG1vZGxlbjsKKwkJCX0gZWxzZSB7CisJCQkJ YnJlYWs7CisJCQl9CisJCX0KIAkJaWYgKCFtYXApIGdvdG8gbm90cmFuczsKLQkJcCA9IGNhbGxv YyhzaXplb2YgKnAgKyBuYW1lbGVuICsgMSwgMSk7CisKKwkJcCA9IGNhbGxvYyhzaXplb2YgKnAs IDEpOwogCQlpZiAoIXApIHsKIAkJCV9fbXVubWFwKCh2b2lkICopbWFwLCBtYXBfc2l6ZSk7CiAJ CQlnb3RvIG5vdHJhbnM7CiAJCX0KKwkJcC0+Y2F0ID0gY2F0ZWdvcnk7CisJCXAtPmJpbmRpbmcg PSBxOworCQlwLT5sbSA9IGxtOwogCQlwLT5tYXAgPSBtYXA7CiAJCXAtPm1hcF9zaXplID0gbWFw X3NpemU7Ci0JCW1lbWNweShwLT5uYW1lLCBuYW1lLCBuYW1lbGVuKzEpOwogCQlkbyB7CiAJCQlv bGRfY2F0cyA9IGNhdHM7CiAJCQlwLT5uZXh0ID0gb2xkX2NhdHM7Ci0tLSBtdXNsLTEuMS4xNi9z cmMvaW50ZXJuYWwvbG9jYWxlX2ltcGwuaAorKysgbXVzbC0xLjEuMTYvc3JjL2ludGVybmFsL2xv Y2FsZV9pbXBsLmgKQEAgLTYsNyArNiw3IEBACiAjaW5jbHVkZSAibGliYy5oIgogI2luY2x1ZGUg InB0aHJlYWRfaW1wbC5oIgogCi0jZGVmaW5lIExPQ0FMRV9OQU1FX01BWCAxNQorI2RlZmluZSBM T0NBTEVfTkFNRV9NQVggMjMKIAogc3RydWN0IF9fbG9jYWxlX21hcCB7CiAJY29uc3Qgdm9pZCAq bWFwOwo= --94eb2c04515cf0db9d054b01942f--