From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/11274 Path: news.gmane.org!.POSTED!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: [PATCH v2] IDNA support in name lookups Date: Sun, 23 Apr 2017 12:56:14 -0400 Message-ID: <20170423165614.GP17319@brightrain.aerifal.cx> References: <20170329112629.GA3506324@wirbelwind> <20170402073026.GA4177284@wirbelwind> <20170423010100.GM17319@brightrain.aerifal.cx> <20170423081424.GA15554@wirbelwind> <20170423150747.GN17319@brightrain.aerifal.cx> <20170423163824.GB15554@wirbelwind> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1492966587 6896 195.159.176.226 (23 Apr 2017 16:56:27 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 23 Apr 2017 16:56:27 +0000 (UTC) User-Agent: Mutt/1.5.21 (2010-09-15) To: musl@lists.openwall.com Original-X-From: musl-return-11289-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 23 18:56:22 2017 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1d2KoA-0001iH-GB for gllmg-musl@m.gmane.org; Sun, 23 Apr 2017 18:56:22 +0200 Original-Received: (qmail 22263 invoked by uid 550); 23 Apr 2017 16:56:26 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 22243 invoked from network); 23 Apr 2017 16:56:26 -0000 Content-Disposition: inline In-Reply-To: <20170423163824.GB15554@wirbelwind> Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:11274 Archived-At: On Sun, Apr 23, 2017 at 06:38:24PM +0200, Joakim Sindholt wrote: > > If I'm not mistaken, your patch as-is actually _breaks_ support for > > literal utf-8 hostnames in the hosts file. > > No, it supports all combinations. It encodes every hostname found to the > punycode equivalent and compares only punycode. The encoding is done on > a per-label basis so ønskeskyen.dk is "xn--nskeskyen-k8a" and "dk" treated > entirely separately. > Suppose they had a subdomain, ønsker.ønskeskyen.dk, then this approach > would work even if someone put "ønsker.xn--nskeskyen-k8a.dk" in their > hosts file, even if you looked it up as xn--nsker-uua.ønskeskyen.dk. OK, that makes more sense and it's non-harmful. > There are a couple of reasons that I chose to do it this way, with the > primary being that it meant I only had to include the encoding function > and the secondary being that the encoded name will always fit in 254 > characters, whereas the decoded equivalent might be larger (see below). > > I don't have any ideological opposition to doing the comparison entirely > in unicode though. This was just the more practical solution for the > time being. It's probably okay the way you're doing it, then. I failed to understand that it accepts either. Main other consideration might be whether it's noticably slow on large hosts files, but I don't think the existing code was terribly fast anyway. > > > > > int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags) > > > > > { > > > > > + char _name[256]; > > > > > int cnt = 0, i, j; > > > > > > > > > > *canon = 0; > > > > > if (name) { > > > > > - /* reject empty name and check len so it fits into temp bufs */ > > > > > - size_t l = strnlen(name, 255); > > > > > - if (l-1 >= 254) > > > > > + /* convert unicode name to RFC3492 punycode */ > > > > > + ssize_t l; > > > > > + if ((l = idnaenc(_name, name)) <= 0) > > > > > return EAI_NONAME; > > > > > - memcpy(canon, name, l+1); > > > > > + memcpy(canon, _name, l+1); > > > > > + name = _name; > > > > > } > > > > > > > > If it's not needed for hosts backend, this code probably belongs > > > > localized to the dns lookup, rather than at the top of __lookup_name. > > > > > > > > BTW there's perhaps also a need for the opposite-direction > > > > translation, both for ai_canonname (when a CNAME points to IDN) and > > > > for getnameinfo reverse lookups. But that can be added as a second > > > > patch I think. > > > > > > I have already written the code for decoding as well if need be :) > > > > Yay! > > > > > The only problem as I see it is that a unicode name can be a hair under > > > 4 times larger (in bytes) than the punycode equivalent. Select any 4 > > > byte UTF-8 character and make labels exclusively containing that. All > > > subsequent characters to the first will be encoded as an 'a'. > > > > > > This, by the way, also means that we should probably mess with the > > > buffering when reading the hosts file. > > > > You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this > > is ABI, I don't think we can return names longer than 255 bytes. Such > > names probably just need to be left as punycode when coming from dns, > > and not supported in hosts files. > > I'm not sure that's the correct interpretation. For example when doing > HTTP requests the browser will fill in the HTTP Host header with the > punycode name. I think the intent is to do everything with the punycoded > name and ditch utf8 permanently :( That's ugly and backwards but fits with the header encoding being ASCII. However it's not universal. The most important other place domain names are used, in certificates, they're proper unicode; punycode is not used there. > What I meant here though was when reading from /etc/hosts the line > buffer used is 512 bytes which isn't necessarily enough for one full > domain in utf8. But it is enough assuming we honor HOST_NAME_MAX. > Take any cuneiform character of your choice, repeat it > 56 times and you get a label of 63 chars that looks something like > xn--4c4daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa > but has a utf8 size of 224 bytes. You can have almost 4 of those in an > otherwise completely valid FQDN. The last one would have to lose 2 'a's > (8 bytes) making the total 63+1+63+1+63+1+61+1=254, decoding to a total > of 224+1+224+1+224+1+216+1=892 bytes. > > Now I will freely admit that I don't know for sure if this is legal but > I believe that it is, and that the maximum length of 254 applies solely > to the encoded name. > I can at least say that dig will happily accept that massive domain > name and complain on a domain with one more cuneiform in the last label. > It will also complain if any one label exceeds 63 bytes post-encoding. > > For the record, glibc has #define NI_MAXHOST 1025, if it matters at all. Yeah, so far I've considered that a mistake. For non-DNS names it's rather up to the implementation what they support, but I don't think there's significant merit in supporting gratuitously/maliciously long hostnames. Rich