From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/11273 Path: news.gmane.org!.POSTED!not-for-mail From: Joakim Sindholt Newsgroups: gmane.linux.lib.musl.general Subject: Re: [PATCH v2] IDNA support in name lookups Date: Sun, 23 Apr 2017 18:38:24 +0200 Message-ID: <20170423163824.GB15554@wirbelwind> References: <20170329112629.GA3506324@wirbelwind> <20170402073026.GA4177284@wirbelwind> <20170423010100.GM17319@brightrain.aerifal.cx> <20170423081424.GA15554@wirbelwind> <20170423150747.GN17319@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Trace: blaine.gmane.org 1492965520 18939 195.159.176.226 (23 Apr 2017 16:38:40 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 23 Apr 2017 16:38:40 +0000 (UTC) User-Agent: Mutt/1.5.24 (2015-08-30) To: musl@lists.openwall.com Original-X-From: musl-return-11288-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 23 18:38:35 2017 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.84_2) (envelope-from ) id 1d2KWv-0004mZ-Gq for gllmg-musl@m.gmane.org; Sun, 23 Apr 2017 18:38:33 +0200 Original-Received: (qmail 11422 invoked by uid 550); 23 Apr 2017 16:38:37 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 11398 invoked from network); 23 Apr 2017 16:38:37 -0000 Content-Disposition: inline In-Reply-To: <20170423150747.GN17319@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:11273 Archived-At: On Sun, Apr 23, 2017 at 11:07:47AM -0400, Rich Felker wrote: > > > > @@ -61,12 +230,25 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati > > > > return EAI_SYSTEM; > > > > } > > > > while (fgets(line, sizeof line, f) && cnt < MAXADDRS) { > > > > - char *p, *z; > > > > + char idna[256]; > > > > + ssize_t r; > > > > + char *p, *z, c; > > > > > > > > if ((p=strchr(line, '#'))) *p++='\n', *p=0; > > > > - for(p=line+1; (p=strstr(p, name)) && > > > > - (!isspace(p[-1]) || !isspace(p[l])); p++); > > > > - if (!p) continue; > > > > + /* skip ip address and canonicalize names */ > > > > + for (p=line; *p && !isspace(*p); p++); > > > > + while (*p) { > > > > + for (; *p && isspace(*p); p++); > > > > + for (z=p; *z && !isspace(*z); z++); > > > > + c = *z; > > > > + *z = 0; > > > > + r = idnaenc(idna, p); > > > > + *z = c; > > > > + if (r == l && memcmp(idna, name, l) == 0) > > > > + break; > > > > + p = z; > > > > + } > > > > + if (!*p) continue; > > > > > > > > /* Isolate IP address to parse */ > > > > for (p=line; *p && !isspace(*p); p++); > > > > @@ -86,7 +268,7 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati > > > > for (; *p && isspace(*p); p++); > > > > for (z=p; *z && !isspace(*z); z++); > > > > *z = 0; > > > > - if (is_valid_hostname(p)) memcpy(canon, p, z-p+1); > > > > + if ((r = idnaenc(idna, p)) > 0) memcpy(canon, idna, r); > > > > } > > > > __fclose_ca(f); > > > > return cnt ? cnt : badfam; > > > > @@ -285,15 +467,17 @@ static int addrcmp(const void *_a, const void *_b) > > > > > > Is there any reason this needs to be done, or should be done, for > > > lookups from the hosts file? IDN/punycode is a hack for transporting > > > unicode names on top of DNS protocol. For hosts file you can just put > > > the proper unicode strings directly in the file. > > > > My logic was that some people might have/want to have punycode in their > > hosts file, and some might even (accidentally or otherwise) have mixed > > punycode-unicode names written down. In any case I wanted it to Just > > Work™ so decoding the host from punycode before comparing seemed to be > > the easiest way to ensure it catches everything. > > This was prompted by a paper wedding invitaion I received where the > > couple had listed their gift registry in punycode form. This says to me > > that people just dont know or care about this, and since the hosts file > > is used extensively by non-developers as well I would personally prefer > > if this worked regardless of what deranged things people might put in > > there. > > In fact I could imagine a lot of people shoving the punycode form in > > there under the assumption that that would work better. > > Also, suppose you have callers from all over the system dialing out to a > > server but some of them call xn--foo-bar.com and others dial the unicode > > version. Is it really reasonable that you should need to list this > > domain twice in the hosts file for it to work? > > If you want the punycode to resolve to the unicode name in hosts file, > I would do the opposite: decode it to proper utf-8 and match that. > Nobody should have to deal with wacky punycode encodings to make > non-latin hostnames work. They should just work transparently and > punycode should be an implementation detail inside layers that > users/programmers don't see. > > If I'm not mistaken, your patch as-is actually _breaks_ support for > literal utf-8 hostnames in the hosts file. No, it supports all combinations. It encodes every hostname found to the punycode equivalent and compares only punycode. The encoding is done on a per-label basis so ønskeskyen.dk is "xn--nskeskyen-k8a" and "dk" treated entirely separately. Suppose they had a subdomain, ønsker.ønskeskyen.dk, then this approach would work even if someone put "ønsker.xn--nskeskyen-k8a.dk" in their hosts file, even if you looked it up as xn--nsker-uua.ønskeskyen.dk. There are a couple of reasons that I chose to do it this way, with the primary being that it meant I only had to include the encoding function and the secondary being that the encoded name will always fit in 254 characters, whereas the decoded equivalent might be larger (see below). I don't have any ideological opposition to doing the comparison entirely in unicode though. This was just the more practical solution for the time being. > > > > int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags) > > > > { > > > > + char _name[256]; > > > > int cnt = 0, i, j; > > > > > > > > *canon = 0; > > > > if (name) { > > > > - /* reject empty name and check len so it fits into temp bufs */ > > > > - size_t l = strnlen(name, 255); > > > > - if (l-1 >= 254) > > > > + /* convert unicode name to RFC3492 punycode */ > > > > + ssize_t l; > > > > + if ((l = idnaenc(_name, name)) <= 0) > > > > return EAI_NONAME; > > > > - memcpy(canon, name, l+1); > > > > + memcpy(canon, _name, l+1); > > > > + name = _name; > > > > } > > > > > > If it's not needed for hosts backend, this code probably belongs > > > localized to the dns lookup, rather than at the top of __lookup_name. > > > > > > BTW there's perhaps also a need for the opposite-direction > > > translation, both for ai_canonname (when a CNAME points to IDN) and > > > for getnameinfo reverse lookups. But that can be added as a second > > > patch I think. > > > > I have already written the code for decoding as well if need be :) > > Yay! > > > The only problem as I see it is that a unicode name can be a hair under > > 4 times larger (in bytes) than the punycode equivalent. Select any 4 > > byte UTF-8 character and make labels exclusively containing that. All > > subsequent characters to the first will be encoded as an 'a'. > > > > This, by the way, also means that we should probably mess with the > > buffering when reading the hosts file. > > You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this > is ABI, I don't think we can return names longer than 255 bytes. Such > names probably just need to be left as punycode when coming from dns, > and not supported in hosts files. I'm not sure that's the correct interpretation. For example when doing HTTP requests the browser will fill in the HTTP Host header with the punycode name. I think the intent is to do everything with the punycoded name and ditch utf8 permanently :( What I meant here though was when reading from /etc/hosts the line buffer used is 512 bytes which isn't necessarily enough for one full domain in utf8. Take any cuneiform character of your choice, repeat it 56 times and you get a label of 63 chars that looks something like xn--4c4daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa but has a utf8 size of 224 bytes. You can have almost 4 of those in an otherwise completely valid FQDN. The last one would have to lose 2 'a's (8 bytes) making the total 63+1+63+1+63+1+61+1=254, decoding to a total of 224+1+224+1+224+1+216+1=892 bytes. Now I will freely admit that I don't know for sure if this is legal but I believe that it is, and that the maximum length of 254 applies solely to the encoded name. I can at least say that dig will happily accept that massive domain name and complain on a domain with one more cuneiform in the last label. It will also complain if any one label exceeds 63 bytes post-encoding. For the record, glibc has #define NI_MAXHOST 1025, if it matters at all.