From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/11274
Path: news.gmane.org!.POSTED!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: [PATCH v2] IDNA support in name lookups
Date: Sun, 23 Apr 2017 12:56:14 -0400
Message-ID: <20170423165614.GP17319@brightrain.aerifal.cx>
References: <20170329112629.GA3506324@wirbelwind>
 <20170402073026.GA4177284@wirbelwind>
 <20170423010100.GM17319@brightrain.aerifal.cx>
 <20170423081424.GA15554@wirbelwind>
 <20170423150747.GN17319@brightrain.aerifal.cx>
 <20170423163824.GB15554@wirbelwind>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: blaine.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: blaine.gmane.org 1492966587 6896 195.159.176.226 (23 Apr 2017 16:56:27 GMT)
X-Complaints-To: usenet@blaine.gmane.org
NNTP-Posting-Date: Sun, 23 Apr 2017 16:56:27 +0000 (UTC)
User-Agent: Mutt/1.5.21 (2010-09-15)
To: musl@lists.openwall.com
Original-X-From: musl-return-11289-gllmg-musl=m.gmane.org@lists.openwall.com Sun Apr 23 18:56:22 2017
Return-path: <musl-return-11289-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.84_2)
	(envelope-from <musl-return-11289-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1d2KoA-0001iH-GB
	for gllmg-musl@m.gmane.org; Sun, 23 Apr 2017 18:56:22 +0200
Original-Received: (qmail 22263 invoked by uid 550); 23 Apr 2017 16:56:26 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 22243 invoked from network); 23 Apr 2017 16:56:26 -0000
Content-Disposition: inline
In-Reply-To: <20170423163824.GB15554@wirbelwind>
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:11274
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/11274>

On Sun, Apr 23, 2017 at 06:38:24PM +0200, Joakim Sindholt wrote:
> > If I'm not mistaken, your patch as-is actually _breaks_ support for
> > literal utf-8 hostnames in the hosts file.
> 
> No, it supports all combinations. It encodes every hostname found to the
> punycode equivalent and compares only punycode. The encoding is done on
> a per-label basis so ønskeskyen.dk is "xn--nskeskyen-k8a" and "dk" treated
> entirely separately.
> Suppose they had a subdomain, ønsker.ønskeskyen.dk, then this approach
> would work even if someone put "ønsker.xn--nskeskyen-k8a.dk" in their
> hosts file, even if you looked it up as xn--nsker-uua.ønskeskyen.dk.

OK, that makes more sense and it's non-harmful.

> There are a couple of reasons that I chose to do it this way, with the
> primary being that it meant I only had to include the encoding function
> and the secondary being that the encoded name will always fit in 254
> characters, whereas the decoded equivalent might be larger (see below).
> 
> I don't have any ideological opposition to doing the comparison entirely
> in unicode though. This was just the more practical solution for the
> time being.

It's probably okay the way you're doing it, then. I failed to
understand that it accepts either. Main other consideration might be
whether it's noticably slow on large hosts files, but I don't think
the existing code was terribly fast anyway.

> > > > >  int __lookup_name(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, int flags)
> > > > >  {
> > > > > +	char _name[256];
> > > > >  	int cnt = 0, i, j;
> > > > >  
> > > > >  	*canon = 0;
> > > > >  	if (name) {
> > > > > -		/* reject empty name and check len so it fits into temp bufs */
> > > > > -		size_t l = strnlen(name, 255);
> > > > > -		if (l-1 >= 254)
> > > > > +		/* convert unicode name to RFC3492 punycode */
> > > > > +		ssize_t l;
> > > > > +		if ((l = idnaenc(_name, name)) <= 0)
> > > > >  			return EAI_NONAME;
> > > > > -		memcpy(canon, name, l+1);
> > > > > +		memcpy(canon, _name, l+1);
> > > > > +		name = _name;
> > > > >  	}
> > > > 
> > > > If it's not needed for hosts backend, this code probably belongs
> > > > localized to the dns lookup, rather than at the top of __lookup_name.
> > > > 
> > > > BTW there's perhaps also a need for the opposite-direction
> > > > translation, both for ai_canonname (when a CNAME points to IDN) and
> > > > for getnameinfo reverse lookups. But that can be added as a second
> > > > patch I think.
> > > 
> > > I have already written the code for decoding as well if need be :)
> > 
> > Yay!
> > 
> > > The only problem as I see it is that a unicode name can be a hair under
> > > 4 times larger (in bytes) than the punycode equivalent. Select any 4
> > > byte UTF-8 character and make labels exclusively containing that. All
> > > subsequent characters to the first will be encoded as an 'a'.
> > > 
> > > This, by the way, also means that we should probably mess with the
> > > buffering when reading the hosts file.
> > 
> > You mean increase it? Since HOST_NAME_MAX/NI_MAXHOST is 255 and this
> > is ABI, I don't think we can return names longer than 255 bytes. Such
> > names probably just need to be left as punycode when coming from dns,
> > and not supported in hosts files.
> 
> I'm not sure that's the correct interpretation. For example when doing
> HTTP requests the browser will fill in the HTTP Host header with the
> punycode name. I think the intent is to do everything with the punycoded
> name and ditch utf8 permanently :(

That's ugly and backwards but fits with the header encoding being
ASCII. However it's not universal. The most important other place
domain names are used, in certificates, they're proper unicode;
punycode is not used there.

> What I meant here though was when reading from /etc/hosts the line
> buffer used is 512 bytes which isn't necessarily enough for one full
> domain in utf8.

But it is enough assuming we honor HOST_NAME_MAX.

> Take any cuneiform character of your choice, repeat it
> 56 times and you get a label of 63 chars that looks something like
> xn--4c4daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
> but has a utf8 size of 224 bytes. You can have almost 4 of those in an
> otherwise completely valid FQDN. The last one would have to lose 2 'a's
> (8 bytes) making the total 63+1+63+1+63+1+61+1=254, decoding to a total
> of 224+1+224+1+224+1+216+1=892 bytes.
> 
> Now I will freely admit that I don't know for sure if this is legal but
> I believe that it is, and that the maximum length of 254 applies solely
> to the encoded name.
> I can at least say that dig will happily accept that massive domain
> name and complain on a domain with one more cuneiform in the last label.
> It will also complain if any one label exceeds 63 bytes post-encoding.
> 
> For the record, glibc has #define NI_MAXHOST 1025, if it matters at all.

Yeah, so far I've considered that a mistake. For non-DNS names it's
rather up to the implementation what they support, but I don't think
there's significant merit in supporting gratuitously/maliciously long
hostnames.

Rich