mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] TCP fallback open questions
@ 2022-09-16  4:14 Rich Felker
  2022-09-20  9:42 ` Florian Weimer
  0 siblings, 1 reply; 3+ messages in thread
From: Rich Felker @ 2022-09-16  4:14 UTC (permalink / raw)
  To: musl

I'm beginning to work on the long-awaited TCP fallback path in musl's
stub resolver, and have here a list of things I've found so far that
need some decisions about specifics of the behavior. For the most part
these "questions" already have a seemingly solid choice I'm leaning
towards, but since this is a topic that's complex and that's had lots
of dissatisfaction over it in the past, I want to put them out to the
community for feedback as soon as possible so that any major concerns
can be considered.



1. How to switch over in the middle of a (multi-)query:

In general, the DNS query core is performing M (1 or 2) concurrent
queries against N configured nameservers, all in parallel (N*M
in-flight operations). Any one of those might get back a truncated
response requiring fallback to TCP. We want to do this fallback in a
way that minimizes additional latency, resource usage, and logic
complexity, which may be conflicting goals.

As policy, we already assume the configured nameservers are redundant,
providing non-conflicting views of the same DNS namespace. So once we
see a single reply for a given question that requires TCP fallback,
it's reasonable to conclude that any other reply to that question
(from the other nameservers) would also require fallback, and to stop
further retries of thaat question over UDP and ignore further answers
to that question over UDP. The other question(s, but really only at
most one, the opposite A/AAAA) however may still have satisfactory UDP
answers, so ideally we want to keep listening for those, and keep
retrying them.

In principle we could end up using N*M TCP sockets for an exhaustive
parallel query. N and M are small enough that this isn't huge, but
it's also not really nice. Given that the switch to TCP was triggered
by a truncated UDP response, we already know that the responding
server *knows the answer* and just can't send it within the size
limits. So a reasonable course of action is just to open a TCP
connection to the nameserver that issued the truncated response. This
is not necessarily going to be optimal -- it's possible that another
nameserver has gotten a response in the mean time and that the
round-trip for TCP handshake and payload would be much lower to that
other server. But I'm doubtful that consuming extra kernel resources
and producing extra network load to optimize the latency here is a
reasonable tradeoff.

I'm assuming so far that each question at least would have its own TCP
connection (if truncated as UDP). Using multiple nameservers in
parallel with TCP would maybe be an option if we were doing multiple
queries on the same connection, but I'm not aware of whether TCP DNS
has any sort of "pipelining" that would make this perform reasonably.
Maybe if "priming" the question via UDP it doesn't matter though and
we could expect the queries to be processed immediately with cached
results? I don't think I like this but I'm just raising it for
completeness.

TL;DR summary: my leaning is to do one TCP connection per question
that needs fallback, to the nameserver that issued the truncated
response for the question. Does this seem reasonable? Am I overlooking
anything important?





2. Buffer shuffling:

Presently, UDP reads take place into the first unanswered buffer slot
and then get moved if it wasn't the right place. This does not seem
like it will work well when there are potentially partial TCP reads
taking place into one or more slots. I think the simplest solution is
just to use an additional fixed-size 512-byte local buffer in
__res_msend_rc for receiving UDP and always move it into the right
slot afterwards. The increased stack usage is not wonderful, but
rather small relative to the whole calling call stack, and probably
worth it to avoid code complexity. It also gives us a place to perform
throwaway TCP reads into, when reading responses longer than the
destination buffer just to report the needed length to the caller.




3. Timeouts:

UDP being datagram based, there is no condition where we have to worry
about blocking and getting stuck in the middle of a partial read.
Timeout occurs just at the loop level. 

Are there any special considerations for timeout here using TCP? My
leaning is no, since we'll still be in a poll loop regime, and
regardless of blocking state on the socket, recv should do partial
reads in the absence of MSG_WAITALL.




4. Logic for when fallback is needed:

As noted in the thread "res_query/res_send contract findings",
fallback is always needed by these functions when they get a response
with the TC bit set because of the contract to return the size needed
for the complete answer. But for high level (getaddrinfo, etc.)
lookups, it's desirable to use truncated answers when we can. What
should the condition for "when we can" be? My first leaning was that
"nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the
truncated response contains only the CNAME RR, not any records from
the A or AAAA RRset.

Some possible conditions that could be used:

- At least one RR of the type in the question. This seems to be the
  choice to make maximal use of truncated responses, but could give
  significantly fewer addresses than one might like if the nameserver
  is badly behaved or if there's a very large CNAME consuming most of
  the packet.

- No CNAME and packet size is at least 512 minus the size of one RR.
  This goes maximally in the other direction, never using results that
  might be limited by the presence of a CNAME, and ensuring we always
  have the number of answers we'd keep from a TCP response.

There are probably several other reasonable options on a spectrum
between these too.

Unless name_from_dns (lookup_name.c) is changed to use longer response
buffers, the only case in which switching to TCP will give us a better
answer is when the nameserver is being petulent in its truncation. But
it probably should be changed, since the case where the entire packet
is consumed by a CNAME can be hit. To avoid that, the buffer needs to
be at least just under 600 bytes.




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] TCP fallback open questions
  2022-09-16  4:14 [musl] TCP fallback open questions Rich Felker
@ 2022-09-20  9:42 ` Florian Weimer
  2022-09-20 12:53   ` Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Florian Weimer @ 2022-09-20  9:42 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

* Rich Felker:

> In principle we could end up using N*M TCP sockets for an exhaustive
> parallel query. N and M are small enough that this isn't huge, but
> it's also not really nice. Given that the switch to TCP was triggered
> by a truncated UDP response, we already know that the responding
> server *knows the answer* and just can't send it within the size
> limits. So a reasonable course of action is just to open a TCP
> connection to the nameserver that issued the truncated response. This
> is not necessarily going to be optimal -- it's possible that another
> nameserver has gotten a response in the mean time and that the
> round-trip for TCP handshake and payload would be much lower to that
> other server. But I'm doubtful that consuming extra kernel resources
> and producing extra network load to optimize the latency here is a
> reasonable tradeoff.

The large centralized load balancers typically do not share caches
between their UDP and TCP endpoints, at least not immediately, and
neither between different UDP frontend servers behind the loadbalancer.
So the assumption that the cache is hot at the time of the TCP query is
probably not true in that case.  But it probably does not matter.

> I'm assuming so far that each question at least would have its own TCP
> connection (if truncated as UDP). Using multiple nameservers in
> parallel with TCP would maybe be an option if we were doing multiple
> queries on the same connection, but I'm not aware of whether TCP DNS
> has any sort of "pipelining" that would make this perform reasonably.
> Maybe if "priming" the question via UDP it doesn't matter though and
> we could expect the queries to be processed immediately with cached
> results? I don't think I like this but I'm just raising it for
> completeness.

TCP DNS has pipelining.  The glibc stub resolver exercises that, sending
the A and AAAA queries back-to-back over TCP, probably in the same
segment.  I think it should even deal with reordered responses (so no
head-of-line blocking), but I'm not sure if recursive resolver code
actually exercises it by reorder replies.

Historically, it's unfriendly to keep TCP connections to recursive
resolvers open for extended periods of time.  Furthermore, it
complicates the retry logic in the client because once you keep
connections open, RST in response to a send does not indicate an error
(the server may just have dropped the question), so you need to retry in
that case with a fresh connection.

> TL;DR summary: my leaning is to do one TCP connection per question
> that needs fallback, to the nameserver that issued the truncated
> response for the question. Does this seem reasonable? Am I overlooking
> anything important?

It's certainly the most conservative approach.

> 3. Timeouts:
>
> UDP being datagram based, there is no condition where we have to worry
> about blocking and getting stuck in the middle of a partial read.
> Timeout occurs just at the loop level. 
>
> Are there any special considerations for timeout here using TCP? My
> leaning is no, since we'll still be in a poll loop regime, and
> regardless of blocking state on the socket, recv should do partial
> reads in the absence of MSG_WAITALL.

Ideally, you'd also use a non-blocking connect with a shorter timeout
than the system default (which can be quite long).

> 4. Logic for when fallback is needed:
>
> As noted in the thread "res_query/res_send contract findings",
> fallback is always needed by these functions when they get a response
> with the TC bit set because of the contract to return the size needed
> for the complete answer. But for high level (getaddrinfo, etc.)
> lookups, it's desirable to use truncated answers when we can. What
> should the condition for "when we can" be? My first leaning was that
> "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the
> truncated response contains only the CNAME RR, not any records from
> the A or AAAA RRset.

Historically, TC=1 responses could have responses truncated in the
middle of the record set, or even the record.  Some middleboxes probably
do this still.  You can still detect this after a failing packet parse
if it's actual truncation, but there could be other data there as well.
I expect most implementations to just discard TC=1 responses.

There's at least one implementation out there that tries an UDP EDNS0
query with a larger buffer space first when it encounters a TC=1
response, rather than going to TCP directly.  But that probably needs
EDNS0-specific failure detection, so not ideal either.

Some virtualization software also violates the UDP 512-byte contract, so
you need to be prepared to receive larger responses over UDP as well.
(I think this particular case is particularly bad because TCP service is
broken as well.)

> Some possible conditions that could be used:
>
> - At least one RR of the type in the question. This seems to be the
>   choice to make maximal use of truncated responses, but could give
>   significantly fewer addresses than one might like if the nameserver
>   is badly behaved or if there's a very large CNAME consuming most of
>   the packet.
>
> - No CNAME and packet size is at least 512 minus the size of one RR.
>   This goes maximally in the other direction, never using results that
>   might be limited by the presence of a CNAME, and ensuring we always
>   have the number of answers we'd keep from a TCP response.

You really only should process the answer section if its record count
indicates that it's complete (compared to the header).  More complex
heuristics probably go wrong with some slightly broken DNS servers.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] TCP fallback open questions
  2022-09-20  9:42 ` Florian Weimer
@ 2022-09-20 12:53   ` Rich Felker
  0 siblings, 0 replies; 3+ messages in thread
From: Rich Felker @ 2022-09-20 12:53 UTC (permalink / raw)
  To: Florian Weimer; +Cc: musl

On Tue, Sep 20, 2022 at 11:42:04AM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > In principle we could end up using N*M TCP sockets for an exhaustive
> > parallel query. N and M are small enough that this isn't huge, but
> > it's also not really nice. Given that the switch to TCP was triggered
> > by a truncated UDP response, we already know that the responding
> > server *knows the answer* and just can't send it within the size
> > limits. So a reasonable course of action is just to open a TCP
> > connection to the nameserver that issued the truncated response. This
> > is not necessarily going to be optimal -- it's possible that another
> > nameserver has gotten a response in the mean time and that the
> > round-trip for TCP handshake and payload would be much lower to that
> > other server. But I'm doubtful that consuming extra kernel resources
> > and producing extra network load to optimize the latency here is a
> > reasonable tradeoff.
> 
> The large centralized load balancers typically do not share caches
> between their UDP and TCP endpoints, at least not immediately, and
> neither between different UDP frontend servers behind the loadbalancer.
> So the assumption that the cache is hot at the time of the TCP query is
> probably not true in that case.  But it probably does not matter.

Thanks, this is good to know. For the most important case, a local
trusted validating nameserver on localhost, it should mean the result
is immediately available. For others, at least it's hopefully
indicative that the result is likely-obtainable. I would really hope
that the big servers like Google and CF don't go back to querying the
authoritative server a second time upon fallback to TCP, but use their
own common upstread cache or something, if for no other reason than
not putting unnecessary load on the rest of the internet. But maybe
this is naive...

> > I'm assuming so far that each question at least would have its own TCP
> > connection (if truncated as UDP). Using multiple nameservers in
> > parallel with TCP would maybe be an option if we were doing multiple
> > queries on the same connection, but I'm not aware of whether TCP DNS
> > has any sort of "pipelining" that would make this perform reasonably.
> > Maybe if "priming" the question via UDP it doesn't matter though and
> > we could expect the queries to be processed immediately with cached
> > results? I don't think I like this but I'm just raising it for
> > completeness.
> 
> TCP DNS has pipelining.

Has that always been a thing? If so, it might advise a different way
to do this.

> The glibc stub resolver exercises that, sending
> the A and AAAA queries back-to-back over TCP, probably in the same
> segment.  I think it should even deal with reordered responses (so no
> head-of-line blocking), but I'm not sure if recursive resolver code
> actually exercises it by reorder replies.

Do real-world servers reliably do out-of-order responding, starting
multiple queries received over the same connection in parallel and
responding to them in the order answers become available? If so, this
is a potentially appealing approach. It does require either reading
the 2-byte length as a discrete read/recv first or else performing
complex buffer shuffling (since without knowing length, answers to 2
different queries might come in the same read) but the syscall cost is
really inconsequential compared to the network latency costs anyway,
so it's probably best not to be concerned with optimizing number of
syscalls.

> Historically, it's unfriendly to keep TCP connections to recursive
> resolvers open for extended periods of time.  Furthermore, it
> complicates the retry logic in the client because once you keep
> connections open, RST in response to a send does not indicate an error
> (the server may just have dropped the question), so you need to retry in
> that case with a fresh connection.

We definitely wouldn't keep them open for an extended period since the
expectation is that they won't normally be used, and since tying up
fds is bad. But if there's nasty corner case handling for when the
server decides it doesn't want to answer more questions on your
existing socket (possibly even within a single run) the prospect of
using a single connection for multiple queries becomes less appealing.

> > TL;DR summary: my leaning is to do one TCP connection per question
> > that needs fallback, to the nameserver that issued the truncated
> > response for the question. Does this seem reasonable? Am I overlooking
> > anything important?
> 
> It's certainly the most conservative approach.
> 
> > 3. Timeouts:
> >
> > UDP being datagram based, there is no condition where we have to worry
> > about blocking and getting stuck in the middle of a partial read.
> > Timeout occurs just at the loop level. 
> >
> > Are there any special considerations for timeout here using TCP? My
> > leaning is no, since we'll still be in a poll loop regime, and
> > regardless of blocking state on the socket, recv should do partial
> > reads in the absence of MSG_WAITALL.
> 
> Ideally, you'd also use a non-blocking connect with a shorter timeout
> than the system default (which can be quite long).

Yes, definitely nonblocking connect, or rather nonblocking sendmsg
with MSG_FASTOPEN as long as the kernel supports it. (If the server
supports it, this saves us a lot of latency, and even if not, the
kernel avoids waking us just to perform the send after the connect
completes.) I think keeping the logic for this clean and simple is
another motivation for using one socket per query.

> > 4. Logic for when fallback is needed:
> >
> > As noted in the thread "res_query/res_send contract findings",
> > fallback is always needed by these functions when they get a response
> > with the TC bit set because of the contract to return the size needed
> > for the complete answer. But for high level (getaddrinfo, etc.)
> > lookups, it's desirable to use truncated answers when we can. What
> > should the condition for "when we can" be? My first leaning was that
> > "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the
> > truncated response contains only the CNAME RR, not any records from
> > the A or AAAA RRset.
> 
> Historically, TC=1 responses could have responses truncated in the
> middle of the record set, or even the record.  Some middleboxes probably
> do this still.  You can still detect this after a failing packet parse
> if it's actual truncation, but there could be other data there as well.
> I expect most implementations to just discard TC=1 responses.

By the time we parse the packet (except looking at RCODE) the query
machine has been discarded. I think TCP support means we want to do at
least some rudimentary parsing inside the machine (or as a callback
from it) as part of the predicate to decide whether to accept the
truncated response.

> There's at least one implementation out there that tries an UDP EDNS0
> query with a larger buffer space first when it encounters a TC=1
> response, rather than going to TCP directly.  But that probably needs
> EDNS0-specific failure detection, so not ideal either.

Yes, I basically ruled out doing anything with EDNS0 because it
requires layering violations, or rewriting the query packet to EDNS0,
then rewriting the answer back, so that it's in the form the caller
expects. This could be avoided internally but with res_* API it
becomes externally visible. On top of that, EDNS0 parsing has been a
source of vulns in various software in the past, which makes me want
to stay away from it. It also doesn't even provide a complete solution
to the problem, since for very large answers you'll still need a
second fallback to TCP.

> Some virtualization software also violates the UDP 512-byte contract, so
> you need to be prepared to receive larger responses over UDP as well.
> (I think this particular case is particularly bad because TCP service is
> broken as well.)

We generally don't support violation of DNS contract. In this case it
doesn't really matter though; the large packets will just be silently
truncated by reading only 512 bytes. We could in theory request a
single extra byte via an iovec to detect this condition and
artifically add the TC bit if it's missing, but unless there's a
strong motivation to do this, the right action is probably to ignore
it along with the plethora of other utterly broken things wacky
nameservers can do.

> > Some possible conditions that could be used:
> >
> > - At least one RR of the type in the question. This seems to be the
> >   choice to make maximal use of truncated responses, but could give
> >   significantly fewer addresses than one might like if the nameserver
> >   is badly behaved or if there's a very large CNAME consuming most of
> >   the packet.
> >
> > - No CNAME and packet size is at least 512 minus the size of one RR.
> >   This goes maximally in the other direction, never using results that
> >   might be limited by the presence of a CNAME, and ensuring we always
> >   have the number of answers we'd keep from a TCP response.
> 
> You really only should process the answer section if its record count
> indicates that it's complete (compared to the header).  More complex
> heuristics probably go wrong with some slightly broken DNS servers.

Yes, I was assuming it is complete with respect to the header and that
the nameserver is following the spec and truncating on RR granularity,
with the ANCOUNT reflecting the number of RRs actually present. But
this can be checked as another means to trigger TCP fallback, and
doing so is probably a good idea (and won't affect anyone with
non-broken nameservers).

So the question here is basically just about how to decide whether to
fallback when the packet is well-formed but lacks one or more RR from
what the complete answer would have.

Rich

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-09-20 12:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16  4:14 [musl] TCP fallback open questions Rich Felker
2022-09-20  9:42 ` Florian Weimer
2022-09-20 12:53   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).