From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/11273
Path: news.gmane.org!.POSTED!not-for-mail
From: Joakim Sindholt <opensource@zhasha.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: [PATCH v2] IDNA support in name lookups
Date: Sun, 23 Apr 2017 18:38:24 +0200
Message-ID: <20170423163824.GB15554@wirbelwind>
References: <20170329112629.GA3506324@wirbelwind>
 <20170402073026.GA4177284@wirbelwind>
 <20170423010100.GM17319@brightrain.aerifal.cx>
 <20170423081424.GA15554@wirbelwind>
 <20170423150747.GN17319@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: blaine.gmane.org 1492965520 18939 195.159.176.226 (23 Apr 2017 16:38:40 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 23 Apr 2017 16:38:40 +0000 (UTC)
User-Agent: Mutt/1.5.24 (2015-08-30)
To: musl@lists.openwall.com
Original-X-From: musl-return-11288-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 23 18:38:35 2017
Return-path: <musl-return-11288-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-11288-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1d2KWv-0004mZ-Gq
	for gllmg-musl@m.gmane.org; Sun, 23 Apr 2017 18:38:33 +0200
Original-Received: (qmail 11422 invoked by uid 550); 23 Apr 2017 16:38:37 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 11398 invoked from network); 23 Apr 2017 16:38:37 -0000
Content-Disposition: inline
In-Reply-To: <20170423150747.GN17319@brightrain.aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:11273
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/11273>

On Sun, Apr 23, 2017 at 11:07:47AM -0400, Rich Felker wrote:
> > > > @@ -61,12 +230,25 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati
> > > >  		return EAI_SYSTEM;
> > > >  	}
> > > >  	while (fgets(line, sizeof line, f) && cnt < MAXADDRS) {
> > > > -		char *p, *z;
> > > > +		char idna[256];
> > > > +		ssize_t r;
> > > > +		char *p, *z, c;
> > > >  
> > > >  		if ((p=strchr(line, '#'))) *p++='\n', *p=0;
> > > > -		for(p=line+1; (p=strstr(p, name)) &&
> > > > -			(!isspace(p[-1]) || !isspace(p[l])); p++);
> > > > -		if (!p) continue;
> > > > +		/* skip ip address and canonicalize names */
> > > > +		for (p=line; *p && !isspace(*p); p++);
> > > > +		while (*p) {
> > > > +			for (; *p && isspace(*p); p++);
> > > > +			for (z=p; *z && !isspace(*z); z++);
> > > > +			c = *z;
> > > > +			*z = 0;
> > > > +			r = idnaenc(idna, p);
> > > > +			*z = c;
> > > > +			if (r == l && memcmp(idna, name, l) == 0)
> > > > +				break;
> > > > +			p = z;
> > > > +		}
> > > > +		if (!*p) continue;
> > > >  
> > > >  		/* Isolate IP address to parse */
> > > >  		for (p=line; *p && !isspace(*p); p++);
> > > > @@ -86,7 +268,7 @@ static int name_from_hosts(struct address buf[static MAXADDRS], char canon[stati
> > > >  		for (; *p && isspace(*p); p++);
> > > >  		for (z=p; *z && !isspace(*z); z++);
> > > >  		*z = 0;
> > > > -		if (is_valid_hostname(p)) memcpy(canon, p, z-p+1);
> > > > +		if ((r = idnaenc(idna, p)) > 0) memcpy(canon, idna, r);
> > > >  	}
> > > >  	__fclose_ca(f);
> > > >  	return cnt ? cnt : badfam;
> > > > @@ -285,15 +467,17 @@ static int addrcmp(const void *_a, const void *_b)
> > > 
> > > Is there any reason this needs to be done, or should be done, for
> > > lookups from the hosts file? IDN/punycode is a hack for transporting
> > > unicode names on top of DNS protocol. For hosts file you can just put
> > > the proper unicode strings directly in the file.
> > 
> > My logic was that some people might have/want to have punycode in their
> > hosts file, and some might even (accidentally or otherwise) have mixed
> > punycode-unicode names written down. In any case I wanted it to Just
> > Work™ so decoding the host from punycode before comparing seemed to be
> > the easiest way to ensure it catches everything.
> > This was prompted by a paper wedding invitaion I received where the
> > couple had listed their gift registry in punycode form. This says to me
> > that people just dont know or care about this, and since the hosts file
> > is used extensively by non-developers as well I would personally prefer
> > if this worked regardless of what deranged things people might put in
> > there.
> > In fact I could imagine a lot of people shoving the punycode form in
> > there under the assumption that that would work better.
> > Also, suppose you have callers from all over the system dialing out to a
> > server but some of them call xn--foo-bar.com and others dial the unicode
> > version. Is it really reasonable that you should need to list this
> > domain twice in the hosts file for it to work?
> 
> If you want the punycode to resolve to the unicode name in hosts file,
> I would do the opposite: decode it to proper utf-8 and match that.
> Nobody should have to deal with wacky punycode encodings to make
> non-latin hostnames work. They should just work transparently and
> punycode should be an implementation detail inside layers that
> users/programmers don't see.
> 
> If I'm not mistaken, your patch as-is actually _breaks_ support for
> literal utf-8 hostnames in the hosts file.

No, it supports all combinations. It encodes every hostname found to the
punycode equivalent and compares only punycode. The encoding is done on
a per-label basis so ønskeskyen.dk is "xn--nskeskyen-k8a" and "dk" treated
entirely separately.
Suppose they had a subdomain, ønsker.ønskeskyen.dk, then this approach
would work even if someone put "ønsker.xn--nskeskyen-k8a.dk" in their
hosts file, even if you looked it up as xn--nsker-uua.ønskeskyen.dk.

There are a couple of reasons that I chose to do it this way, with the
primary being that it meant I only had to include the encoding function
and the secondary being that the encoded name will always fit in 254
characters, whereas the decoded equivalent might be larger (see below).

I don't have any ideological opposition to doing the comparison entirely
in unicode though. This was just the more practical solution for the
time being.
 
> > > >  int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags)
> > > >  {
> > > > +	char _name[256];
> > > >  	int cnt = 0, i, j;
> > > >  
> > > >  	*canon = 0;
> > > >  	if (name) {
> > > > -		/* reject empty name and check len so it fits into temp bufs */
> > > > -		size_t l = strnlen(name, 255);
> > > > -		if (l-1 >= 254)
> > > > +		/* convert unicode name to RFC3492 punycode */
> > > > +		ssize_t l;
> > > > +		if ((l = idnaenc(_name, name)) <= 0)
> > > >  			return EAI_NONAME;
> > > > -		memcpy(canon, name, l+1);
> > > > +		memcpy(canon, _name, l+1);
> > > > +		name = _name;
> > > >  	}
> > > 
> > > If it's not needed for hosts backend, this code probably belongs
> > > localized to the dns lookup, rather than at the top of __lookup_name.
> > > 
> > > BTW there's perhaps also a need for the opposite-direction
> > > translation, both for ai_canonname (when a CNAME points to IDN) and
> > > for getnameinfo reverse lookups. But that can be added as a second
> > > patch I think.
> > 
> > I have already written the code for decoding as well if need be :)
> 
> Yay!
> 
> > The only problem as I see it is that a unicode name can be a hair under
> > 4 times larger (in bytes) than the punycode equivalent. Select any 4
> > byte UTF-8 character and make labels exclusively containing that. All
> > subsequent characters to the first will be encoded as an 'a'.
> > 
> > This, by the way, also means that we should probably mess with the
> > buffering when reading the hosts file.
> 
> You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this
> is ABI, I don't think we can return names longer than 255 bytes. Such
> names probably just need to be left as punycode when coming from dns,
> and not supported in hosts files.

I'm not sure that's the correct interpretation. For example when doing
HTTP requests the browser will fill in the HTTP Host header with the
punycode name. I think the intent is to do everything with the punycoded
name and ditch utf8 permanently :(

What I meant here though was when reading from /etc/hosts the line
buffer used is 512 bytes which isn't necessarily enough for one full
domain in utf8. Take any cuneiform character of your choice, repeat it
56 times and you get a label of 63 chars that looks something like
xn--4c4daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
but has a utf8 size of 224 bytes. You can have almost 4 of those in an
otherwise completely valid FQDN. The last one would have to lose 2 'a's
(8 bytes) making the total 63+1+63+1+63+1+61+1=254, decoding to a total
of 224+1+224+1+224+1+216+1=892 bytes.

Now I will freely admit that I don't know for sure if this is legal but
I believe that it is, and that the maximum length of 254 applies solely
to the encoded name.
I can at least say that dig will happily accept that massive domain
name and complain on a domain with one more cuneiform in the last label.
It will also complain if any one label exceeds 63 bytes post-encoding.

For the record, glibc has #define NI_MAXHOST 1025, if it matters at all.