From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 32131 invoked from network); 20 Sep 2022 12:53:33 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 20 Sep 2022 12:53:33 -0000
Received: (qmail 3176 invoked by uid 550); 20 Sep 2022 12:53:30 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 3150 invoked from network); 20 Sep 2022 12:53:30 -0000
Date: Tue, 20 Sep 2022 08:53:18 -0400
From: Rich Felker <dalias@libc.org>
To: Florian Weimer <fweimer@redhat.com>
Cc: musl@lists.openwall.com
Message-ID: <20220920125317.GN9709@brightrain.aerifal.cx>
References: <20220916041435.GL9709@brightrain.aerifal.cx>
 <87wn9yo00j.fsf@oldenburg.str.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87wn9yo00j.fsf@oldenburg.str.redhat.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] TCP fallback open questions

On Tue, Sep 20, 2022 at 11:42:04AM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > In principle we could end up using N*M TCP sockets for an exhaustive
> > parallel query. N and M are small enough that this isn't huge, but
> > it's also not really nice. Given that the switch to TCP was triggered
> > by a truncated UDP response, we already know that the responding
> > server *knows the answer* and just can't send it within the size
> > limits. So a reasonable course of action is just to open a TCP
> > connection to the nameserver that issued the truncated response. This
> > is not necessarily going to be optimal -- it's possible that another
> > nameserver has gotten a response in the mean time and that the
> > round-trip for TCP handshake and payload would be much lower to that
> > other server. But I'm doubtful that consuming extra kernel resources
> > and producing extra network load to optimize the latency here is a
> > reasonable tradeoff.
> 
> The large centralized load balancers typically do not share caches
> between their UDP and TCP endpoints, at least not immediately, and
> neither between different UDP frontend servers behind the loadbalancer.
> So the assumption that the cache is hot at the time of the TCP query is
> probably not true in that case.  But it probably does not matter.

Thanks, this is good to know. For the most important case, a local
trusted validating nameserver on localhost, it should mean the result
is immediately available. For others, at least it's hopefully
indicative that the result is likely-obtainable. I would really hope
that the big servers like Google and CF don't go back to querying the
authoritative server a second time upon fallback to TCP, but use their
own common upstread cache or something, if for no other reason than
not putting unnecessary load on the rest of the internet. But maybe
this is naive...

> > I'm assuming so far that each question at least would have its own TCP
> > connection (if truncated as UDP). Using multiple nameservers in
> > parallel with TCP would maybe be an option if we were doing multiple
> > queries on the same connection, but I'm not aware of whether TCP DNS
> > has any sort of "pipelining" that would make this perform reasonably.
> > Maybe if "priming" the question via UDP it doesn't matter though and
> > we could expect the queries to be processed immediately with cached
> > results? I don't think I like this but I'm just raising it for
> > completeness.
> 
> TCP DNS has pipelining.

Has that always been a thing? If so, it might advise a different way
to do this.

> The glibc stub resolver exercises that, sending
> the A and AAAA queries back-to-back over TCP, probably in the same
> segment.  I think it should even deal with reordered responses (so no
> head-of-line blocking), but I'm not sure if recursive resolver code
> actually exercises it by reorder replies.

Do real-world servers reliably do out-of-order responding, starting
multiple queries received over the same connection in parallel and
responding to them in the order answers become available? If so, this
is a potentially appealing approach. It does require either reading
the 2-byte length as a discrete read/recv first or else performing
complex buffer shuffling (since without knowing length, answers to 2
different queries might come in the same read) but the syscall cost is
really inconsequential compared to the network latency costs anyway,
so it's probably best not to be concerned with optimizing number of
syscalls.

> Historically, it's unfriendly to keep TCP connections to recursive
> resolvers open for extended periods of time.  Furthermore, it
> complicates the retry logic in the client because once you keep
> connections open, RST in response to a send does not indicate an error
> (the server may just have dropped the question), so you need to retry in
> that case with a fresh connection.

We definitely wouldn't keep them open for an extended period since the
expectation is that they won't normally be used, and since tying up
fds is bad. But if there's nasty corner case handling for when the
server decides it doesn't want to answer more questions on your
existing socket (possibly even within a single run) the prospect of
using a single connection for multiple queries becomes less appealing.

> > TL;DR summary: my leaning is to do one TCP connection per question
> > that needs fallback, to the nameserver that issued the truncated
> > response for the question. Does this seem reasonable? Am I overlooking
> > anything important?
> 
> It's certainly the most conservative approach.
> 
> > 3. Timeouts:
> >
> > UDP being datagram based, there is no condition where we have to worry
> > about blocking and getting stuck in the middle of a partial read.
> > Timeout occurs just at the loop level. 
> >
> > Are there any special considerations for timeout here using TCP? My
> > leaning is no, since we'll still be in a poll loop regime, and
> > regardless of blocking state on the socket, recv should do partial
> > reads in the absence of MSG_WAITALL.
> 
> Ideally, you'd also use a non-blocking connect with a shorter timeout
> than the system default (which can be quite long).

Yes, definitely nonblocking connect, or rather nonblocking sendmsg
with MSG_FASTOPEN as long as the kernel supports it. (If the server
supports it, this saves us a lot of latency, and even if not, the
kernel avoids waking us just to perform the send after the connect
completes.) I think keeping the logic for this clean and simple is
another motivation for using one socket per query.

> > 4. Logic for when fallback is needed:
> >
> > As noted in the thread "res_query/res_send contract findings",
> > fallback is always needed by these functions when they get a response
> > with the TC bit set because of the contract to return the size needed
> > for the complete answer. But for high level (getaddrinfo, etc.)
> > lookups, it's desirable to use truncated answers when we can. What
> > should the condition for "when we can" be? My first leaning was that
> > "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the
> > truncated response contains only the CNAME RR, not any records from
> > the A or AAAA RRset.
> 
> Historically, TC=1 responses could have responses truncated in the
> middle of the record set, or even the record.  Some middleboxes probably
> do this still.  You can still detect this after a failing packet parse
> if it's actual truncation, but there could be other data there as well.
> I expect most implementations to just discard TC=1 responses.

By the time we parse the packet (except looking at RCODE) the query
machine has been discarded. I think TCP support means we want to do at
least some rudimentary parsing inside the machine (or as a callback
from it) as part of the predicate to decide whether to accept the
truncated response.

> There's at least one implementation out there that tries an UDP EDNS0
> query with a larger buffer space first when it encounters a TC=1
> response, rather than going to TCP directly.  But that probably needs
> EDNS0-specific failure detection, so not ideal either.

Yes, I basically ruled out doing anything with EDNS0 because it
requires layering violations, or rewriting the query packet to EDNS0,
then rewriting the answer back, so that it's in the form the caller
expects. This could be avoided internally but with res_* API it
becomes externally visible. On top of that, EDNS0 parsing has been a
source of vulns in various software in the past, which makes me want
to stay away from it. It also doesn't even provide a complete solution
to the problem, since for very large answers you'll still need a
second fallback to TCP.

> Some virtualization software also violates the UDP 512-byte contract, so
> you need to be prepared to receive larger responses over UDP as well.
> (I think this particular case is particularly bad because TCP service is
> broken as well.)

We generally don't support violation of DNS contract. In this case it
doesn't really matter though; the large packets will just be silently
truncated by reading only 512 bytes. We could in theory request a
single extra byte via an iovec to detect this condition and
artifically add the TC bit if it's missing, but unless there's a
strong motivation to do this, the right action is probably to ignore
it along with the plethora of other utterly broken things wacky
nameservers can do.

> > Some possible conditions that could be used:
> >
> > - At least one RR of the type in the question. This seems to be the
> >   choice to make maximal use of truncated responses, but could give
> >   significantly fewer addresses than one might like if the nameserver
> >   is badly behaved or if there's a very large CNAME consuming most of
> >   the packet.
> >
> > - No CNAME and packet size is at least 512 minus the size of one RR.
> >   This goes maximally in the other direction, never using results that
> >   might be limited by the presence of a CNAME, and ensuring we always
> >   have the number of answers we'd keep from a TCP response.
> 
> You really only should process the answer section if its record count
> indicates that it's complete (compared to the header).  More complex
> heuristics probably go wrong with some slightly broken DNS servers.

Yes, I was assuming it is complete with respect to the header and that
the nameserver is following the spec and truncating on RR granularity,
with the ANCOUNT reflecting the number of RRs actually present. But
this can be checked as another means to trigger TCP fallback, and
doing so is probably a good idea (and won't affect anyone with
non-broken nameservers).

So the question here is basically just about how to decide whether to
fallback when the packet is well-formed but lacks one or more RR from
what the complete answer would have.

Rich