From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED,
	MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 9619 invoked from network); 20 Sep 2022 09:42:27 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 20 Sep 2022 09:42:27 -0000
Received: (qmail 28559 invoked by uid 550); 20 Sep 2022 09:42:23 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 28536 invoked from network); 20 Sep 2022 09:42:22 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1663666931;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 references:references; bh=0jN0gYUL9+mJIxBZZTnTXL+b78yP3gtVNSEfsMPXSbk=;
	b=T1Za7H69C0IDgJ/KLm9NRCIvS+dc/e0Ugc8E7PPdfaDWLaukiGVUiuzsYbJLEcQ75sENxx
	VfAwwbD6E3k8jtzogOQFkyKYQnFcbhTVjp6gk/2/V3KJ6Bl9WUmPxa9b47XvXy9v98OEcD
	NxGQPbm9yb05B7+iZGTrdxQE1xaRAi4=
X-MC-Unique: yCdSJ2kAPky9q6u-_U5Yxw-1
From: Florian Weimer <fweimer@redhat.com>
To: Rich Felker <dalias@libc.org>
Cc: musl@lists.openwall.com
References: <20220916041435.GL9709@brightrain.aerifal.cx>
Date: Tue, 20 Sep 2022 11:42:04 +0200
Message-ID: <87wn9yo00j.fsf@oldenburg.str.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain
Subject: Re: [musl] TCP fallback open questions

* Rich Felker:

> In principle we could end up using N*M TCP sockets for an exhaustive
> parallel query. N and M are small enough that this isn't huge, but
> it's also not really nice. Given that the switch to TCP was triggered
> by a truncated UDP response, we already know that the responding
> server *knows the answer* and just can't send it within the size
> limits. So a reasonable course of action is just to open a TCP
> connection to the nameserver that issued the truncated response. This
> is not necessarily going to be optimal -- it's possible that another
> nameserver has gotten a response in the mean time and that the
> round-trip for TCP handshake and payload would be much lower to that
> other server. But I'm doubtful that consuming extra kernel resources
> and producing extra network load to optimize the latency here is a
> reasonable tradeoff.

The large centralized load balancers typically do not share caches
between their UDP and TCP endpoints, at least not immediately, and
neither between different UDP frontend servers behind the loadbalancer.
So the assumption that the cache is hot at the time of the TCP query is
probably not true in that case.  But it probably does not matter.

> I'm assuming so far that each question at least would have its own TCP
> connection (if truncated as UDP). Using multiple nameservers in
> parallel with TCP would maybe be an option if we were doing multiple
> queries on the same connection, but I'm not aware of whether TCP DNS
> has any sort of "pipelining" that would make this perform reasonably.
> Maybe if "priming" the question via UDP it doesn't matter though and
> we could expect the queries to be processed immediately with cached
> results? I don't think I like this but I'm just raising it for
> completeness.

TCP DNS has pipelining.  The glibc stub resolver exercises that, sending
the A and AAAA queries back-to-back over TCP, probably in the same
segment.  I think it should even deal with reordered responses (so no
head-of-line blocking), but I'm not sure if recursive resolver code
actually exercises it by reorder replies.

Historically, it's unfriendly to keep TCP connections to recursive
resolvers open for extended periods of time.  Furthermore, it
complicates the retry logic in the client because once you keep
connections open, RST in response to a send does not indicate an error
(the server may just have dropped the question), so you need to retry in
that case with a fresh connection.

> TL;DR summary: my leaning is to do one TCP connection per question
> that needs fallback, to the nameserver that issued the truncated
> response for the question. Does this seem reasonable? Am I overlooking
> anything important?

It's certainly the most conservative approach.

> 3. Timeouts:
>
> UDP being datagram based, there is no condition where we have to worry
> about blocking and getting stuck in the middle of a partial read.
> Timeout occurs just at the loop level. 
>
> Are there any special considerations for timeout here using TCP? My
> leaning is no, since we'll still be in a poll loop regime, and
> regardless of blocking state on the socket, recv should do partial
> reads in the absence of MSG_WAITALL.

Ideally, you'd also use a non-blocking connect with a shorter timeout
than the system default (which can be quite long).

> 4. Logic for when fallback is needed:
>
> As noted in the thread "res_query/res_send contract findings",
> fallback is always needed by these functions when they get a response
> with the TC bit set because of the contract to return the size needed
> for the complete answer. But for high level (getaddrinfo, etc.)
> lookups, it's desirable to use truncated answers when we can. What
> should the condition for "when we can" be? My first leaning was that
> "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the
> truncated response contains only the CNAME RR, not any records from
> the A or AAAA RRset.

Historically, TC=1 responses could have responses truncated in the
middle of the record set, or even the record.  Some middleboxes probably
do this still.  You can still detect this after a failing packet parse
if it's actual truncation, but there could be other data there as well.
I expect most implementations to just discard TC=1 responses.

There's at least one implementation out there that tries an UDP EDNS0
query with a larger buffer space first when it encounters a TC=1
response, rather than going to TCP directly.  But that probably needs
EDNS0-specific failure detection, so not ideal either.

Some virtualization software also violates the UDP 512-byte contract, so
you need to be prepared to receive larger responses over UDP as well.
(I think this particular case is particularly bad because TCP service is
broken as well.)

> Some possible conditions that could be used:
>
> - At least one RR of the type in the question. This seems to be the
>   choice to make maximal use of truncated responses, but could give
>   significantly fewer addresses than one might like if the nameserver
>   is badly behaved or if there's a very large CNAME consuming most of
>   the packet.
>
> - No CNAME and packet size is at least 512 minus the size of one RR.
>   This goes maximally in the other direction, never using results that
>   might be limited by the presence of a CNAME, and ensuring we always
>   have the number of answers we'd keep from a TCP response.

You really only should process the answer section if its record count
indicates that it's complete (compared to the header).  More complex
heuristics probably go wrong with some slightly broken DNS servers.

Thanks,
Florian