* [musl] TCP fallback open questions @ 2022-09-16 4:14 Rich Felker 2022-09-20 9:42 ` Florian Weimer 0 siblings, 1 reply; 4+ messages in thread From: Rich Felker @ 2022-09-16 4:14 UTC (permalink / raw) To: musl I'm beginning to work on the long-awaited TCP fallback path in musl's stub resolver, and have here a list of things I've found so far that need some decisions about specifics of the behavior. For the most part these "questions" already have a seemingly solid choice I'm leaning towards, but since this is a topic that's complex and that's had lots of dissatisfaction over it in the past, I want to put them out to the community for feedback as soon as possible so that any major concerns can be considered. 1. How to switch over in the middle of a (multi-)query: In general, the DNS query core is performing M (1 or 2) concurrent queries against N configured nameservers, all in parallel (N*M in-flight operations). Any one of those might get back a truncated response requiring fallback to TCP. We want to do this fallback in a way that minimizes additional latency, resource usage, and logic complexity, which may be conflicting goals. As policy, we already assume the configured nameservers are redundant, providing non-conflicting views of the same DNS namespace. So once we see a single reply for a given question that requires TCP fallback, it's reasonable to conclude that any other reply to that question (from the other nameservers) would also require fallback, and to stop further retries of thaat question over UDP and ignore further answers to that question over UDP. The other question(s, but really only at most one, the opposite A/AAAA) however may still have satisfactory UDP answers, so ideally we want to keep listening for those, and keep retrying them. In principle we could end up using N*M TCP sockets for an exhaustive parallel query. N and M are small enough that this isn't huge, but it's also not really nice. Given that the switch to TCP was triggered by a truncated UDP response, we already know that the responding server *knows the answer* and just can't send it within the size limits. So a reasonable course of action is just to open a TCP connection to the nameserver that issued the truncated response. This is not necessarily going to be optimal -- it's possible that another nameserver has gotten a response in the mean time and that the round-trip for TCP handshake and payload would be much lower to that other server. But I'm doubtful that consuming extra kernel resources and producing extra network load to optimize the latency here is a reasonable tradeoff. I'm assuming so far that each question at least would have its own TCP connection (if truncated as UDP). Using multiple nameservers in parallel with TCP would maybe be an option if we were doing multiple queries on the same connection, but I'm not aware of whether TCP DNS has any sort of "pipelining" that would make this perform reasonably. Maybe if "priming" the question via UDP it doesn't matter though and we could expect the queries to be processed immediately with cached results? I don't think I like this but I'm just raising it for completeness. TL;DR summary: my leaning is to do one TCP connection per question that needs fallback, to the nameserver that issued the truncated response for the question. Does this seem reasonable? Am I overlooking anything important? 2. Buffer shuffling: Presently, UDP reads take place into the first unanswered buffer slot and then get moved if it wasn't the right place. This does not seem like it will work well when there are potentially partial TCP reads taking place into one or more slots. I think the simplest solution is just to use an additional fixed-size 512-byte local buffer in __res_msend_rc for receiving UDP and always move it into the right slot afterwards. The increased stack usage is not wonderful, but rather small relative to the whole calling call stack, and probably worth it to avoid code complexity. It also gives us a place to perform throwaway TCP reads into, when reading responses longer than the destination buffer just to report the needed length to the caller. 3. Timeouts: UDP being datagram based, there is no condition where we have to worry about blocking and getting stuck in the middle of a partial read. Timeout occurs just at the loop level. Are there any special considerations for timeout here using TCP? My leaning is no, since we'll still be in a poll loop regime, and regardless of blocking state on the socket, recv should do partial reads in the absence of MSG_WAITALL. 4. Logic for when fallback is needed: As noted in the thread "res_query/res_send contract findings", fallback is always needed by these functions when they get a response with the TC bit set because of the contract to return the size needed for the complete answer. But for high level (getaddrinfo, etc.) lookups, it's desirable to use truncated answers when we can. What should the condition for "when we can" be? My first leaning was that "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the truncated response contains only the CNAME RR, not any records from the A or AAAA RRset. Some possible conditions that could be used: - At least one RR of the type in the question. This seems to be the choice to make maximal use of truncated responses, but could give significantly fewer addresses than one might like if the nameserver is badly behaved or if there's a very large CNAME consuming most of the packet. - No CNAME and packet size is at least 512 minus the size of one RR. This goes maximally in the other direction, never using results that might be limited by the presence of a CNAME, and ensuring we always have the number of answers we'd keep from a TCP response. There are probably several other reasonable options on a spectrum between these too. Unless name_from_dns (lookup_name.c) is changed to use longer response buffers, the only case in which switching to TCP will give us a better answer is when the nameserver is being petulent in its truncation. But it probably should be changed, since the case where the entire packet is consumed by a CNAME can be hit. To avoid that, the buffer needs to be at least just under 600 bytes. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] TCP fallback open questions 2022-09-16 4:14 [musl] TCP fallback open questions Rich Felker @ 2022-09-20 9:42 ` Florian Weimer 2022-09-20 12:53 ` Rich Felker 0 siblings, 1 reply; 4+ messages in thread From: Florian Weimer @ 2022-09-20 9:42 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker: > In principle we could end up using N*M TCP sockets for an exhaustive > parallel query. N and M are small enough that this isn't huge, but > it's also not really nice. Given that the switch to TCP was triggered > by a truncated UDP response, we already know that the responding > server *knows the answer* and just can't send it within the size > limits. So a reasonable course of action is just to open a TCP > connection to the nameserver that issued the truncated response. This > is not necessarily going to be optimal -- it's possible that another > nameserver has gotten a response in the mean time and that the > round-trip for TCP handshake and payload would be much lower to that > other server. But I'm doubtful that consuming extra kernel resources > and producing extra network load to optimize the latency here is a > reasonable tradeoff. The large centralized load balancers typically do not share caches between their UDP and TCP endpoints, at least not immediately, and neither between different UDP frontend servers behind the loadbalancer. So the assumption that the cache is hot at the time of the TCP query is probably not true in that case. But it probably does not matter. > I'm assuming so far that each question at least would have its own TCP > connection (if truncated as UDP). Using multiple nameservers in > parallel with TCP would maybe be an option if we were doing multiple > queries on the same connection, but I'm not aware of whether TCP DNS > has any sort of "pipelining" that would make this perform reasonably. > Maybe if "priming" the question via UDP it doesn't matter though and > we could expect the queries to be processed immediately with cached > results? I don't think I like this but I'm just raising it for > completeness. TCP DNS has pipelining. The glibc stub resolver exercises that, sending the A and AAAA queries back-to-back over TCP, probably in the same segment. I think it should even deal with reordered responses (so no head-of-line blocking), but I'm not sure if recursive resolver code actually exercises it by reorder replies. Historically, it's unfriendly to keep TCP connections to recursive resolvers open for extended periods of time. Furthermore, it complicates the retry logic in the client because once you keep connections open, RST in response to a send does not indicate an error (the server may just have dropped the question), so you need to retry in that case with a fresh connection. > TL;DR summary: my leaning is to do one TCP connection per question > that needs fallback, to the nameserver that issued the truncated > response for the question. Does this seem reasonable? Am I overlooking > anything important? It's certainly the most conservative approach. > 3. Timeouts: > > UDP being datagram based, there is no condition where we have to worry > about blocking and getting stuck in the middle of a partial read. > Timeout occurs just at the loop level. > > Are there any special considerations for timeout here using TCP? My > leaning is no, since we'll still be in a poll loop regime, and > regardless of blocking state on the socket, recv should do partial > reads in the absence of MSG_WAITALL. Ideally, you'd also use a non-blocking connect with a shorter timeout than the system default (which can be quite long). > 4. Logic for when fallback is needed: > > As noted in the thread "res_query/res_send contract findings", > fallback is always needed by these functions when they get a response > with the TC bit set because of the contract to return the size needed > for the complete answer. But for high level (getaddrinfo, etc.) > lookups, it's desirable to use truncated answers when we can. What > should the condition for "when we can" be? My first leaning was that > "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the > truncated response contains only the CNAME RR, not any records from > the A or AAAA RRset. Historically, TC=1 responses could have responses truncated in the middle of the record set, or even the record. Some middleboxes probably do this still. You can still detect this after a failing packet parse if it's actual truncation, but there could be other data there as well. I expect most implementations to just discard TC=1 responses. There's at least one implementation out there that tries an UDP EDNS0 query with a larger buffer space first when it encounters a TC=1 response, rather than going to TCP directly. But that probably needs EDNS0-specific failure detection, so not ideal either. Some virtualization software also violates the UDP 512-byte contract, so you need to be prepared to receive larger responses over UDP as well. (I think this particular case is particularly bad because TCP service is broken as well.) > Some possible conditions that could be used: > > - At least one RR of the type in the question. This seems to be the > choice to make maximal use of truncated responses, but could give > significantly fewer addresses than one might like if the nameserver > is badly behaved or if there's a very large CNAME consuming most of > the packet. > > - No CNAME and packet size is at least 512 minus the size of one RR. > This goes maximally in the other direction, never using results that > might be limited by the presence of a CNAME, and ensuring we always > have the number of answers we'd keep from a TCP response. You really only should process the answer section if its record count indicates that it's complete (compared to the header). More complex heuristics probably go wrong with some slightly broken DNS servers. Thanks, Florian ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] TCP fallback open questions 2022-09-20 9:42 ` Florian Weimer @ 2022-09-20 12:53 ` Rich Felker 2022-10-10 15:07 ` Florian Weimer 0 siblings, 1 reply; 4+ messages in thread From: Rich Felker @ 2022-09-20 12:53 UTC (permalink / raw) To: Florian Weimer; +Cc: musl On Tue, Sep 20, 2022 at 11:42:04AM +0200, Florian Weimer wrote: > * Rich Felker: > > > In principle we could end up using N*M TCP sockets for an exhaustive > > parallel query. N and M are small enough that this isn't huge, but > > it's also not really nice. Given that the switch to TCP was triggered > > by a truncated UDP response, we already know that the responding > > server *knows the answer* and just can't send it within the size > > limits. So a reasonable course of action is just to open a TCP > > connection to the nameserver that issued the truncated response. This > > is not necessarily going to be optimal -- it's possible that another > > nameserver has gotten a response in the mean time and that the > > round-trip for TCP handshake and payload would be much lower to that > > other server. But I'm doubtful that consuming extra kernel resources > > and producing extra network load to optimize the latency here is a > > reasonable tradeoff. > > The large centralized load balancers typically do not share caches > between their UDP and TCP endpoints, at least not immediately, and > neither between different UDP frontend servers behind the loadbalancer. > So the assumption that the cache is hot at the time of the TCP query is > probably not true in that case. But it probably does not matter. Thanks, this is good to know. For the most important case, a local trusted validating nameserver on localhost, it should mean the result is immediately available. For others, at least it's hopefully indicative that the result is likely-obtainable. I would really hope that the big servers like Google and CF don't go back to querying the authoritative server a second time upon fallback to TCP, but use their own common upstread cache or something, if for no other reason than not putting unnecessary load on the rest of the internet. But maybe this is naive... > > I'm assuming so far that each question at least would have its own TCP > > connection (if truncated as UDP). Using multiple nameservers in > > parallel with TCP would maybe be an option if we were doing multiple > > queries on the same connection, but I'm not aware of whether TCP DNS > > has any sort of "pipelining" that would make this perform reasonably. > > Maybe if "priming" the question via UDP it doesn't matter though and > > we could expect the queries to be processed immediately with cached > > results? I don't think I like this but I'm just raising it for > > completeness. > > TCP DNS has pipelining. Has that always been a thing? If so, it might advise a different way to do this. > The glibc stub resolver exercises that, sending > the A and AAAA queries back-to-back over TCP, probably in the same > segment. I think it should even deal with reordered responses (so no > head-of-line blocking), but I'm not sure if recursive resolver code > actually exercises it by reorder replies. Do real-world servers reliably do out-of-order responding, starting multiple queries received over the same connection in parallel and responding to them in the order answers become available? If so, this is a potentially appealing approach. It does require either reading the 2-byte length as a discrete read/recv first or else performing complex buffer shuffling (since without knowing length, answers to 2 different queries might come in the same read) but the syscall cost is really inconsequential compared to the network latency costs anyway, so it's probably best not to be concerned with optimizing number of syscalls. > Historically, it's unfriendly to keep TCP connections to recursive > resolvers open for extended periods of time. Furthermore, it > complicates the retry logic in the client because once you keep > connections open, RST in response to a send does not indicate an error > (the server may just have dropped the question), so you need to retry in > that case with a fresh connection. We definitely wouldn't keep them open for an extended period since the expectation is that they won't normally be used, and since tying up fds is bad. But if there's nasty corner case handling for when the server decides it doesn't want to answer more questions on your existing socket (possibly even within a single run) the prospect of using a single connection for multiple queries becomes less appealing. > > TL;DR summary: my leaning is to do one TCP connection per question > > that needs fallback, to the nameserver that issued the truncated > > response for the question. Does this seem reasonable? Am I overlooking > > anything important? > > It's certainly the most conservative approach. > > > 3. Timeouts: > > > > UDP being datagram based, there is no condition where we have to worry > > about blocking and getting stuck in the middle of a partial read. > > Timeout occurs just at the loop level. > > > > Are there any special considerations for timeout here using TCP? My > > leaning is no, since we'll still be in a poll loop regime, and > > regardless of blocking state on the socket, recv should do partial > > reads in the absence of MSG_WAITALL. > > Ideally, you'd also use a non-blocking connect with a shorter timeout > than the system default (which can be quite long). Yes, definitely nonblocking connect, or rather nonblocking sendmsg with MSG_FASTOPEN as long as the kernel supports it. (If the server supports it, this saves us a lot of latency, and even if not, the kernel avoids waking us just to perform the send after the connect completes.) I think keeping the logic for this clean and simple is another motivation for using one socket per query. > > 4. Logic for when fallback is needed: > > > > As noted in the thread "res_query/res_send contract findings", > > fallback is always needed by these functions when they get a response > > with the TC bit set because of the contract to return the size needed > > for the complete answer. But for high level (getaddrinfo, etc.) > > lookups, it's desirable to use truncated answers when we can. What > > should the condition for "when we can" be? My first leaning was that > > "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the > > truncated response contains only the CNAME RR, not any records from > > the A or AAAA RRset. > > Historically, TC=1 responses could have responses truncated in the > middle of the record set, or even the record. Some middleboxes probably > do this still. You can still detect this after a failing packet parse > if it's actual truncation, but there could be other data there as well. > I expect most implementations to just discard TC=1 responses. By the time we parse the packet (except looking at RCODE) the query machine has been discarded. I think TCP support means we want to do at least some rudimentary parsing inside the machine (or as a callback from it) as part of the predicate to decide whether to accept the truncated response. > There's at least one implementation out there that tries an UDP EDNS0 > query with a larger buffer space first when it encounters a TC=1 > response, rather than going to TCP directly. But that probably needs > EDNS0-specific failure detection, so not ideal either. Yes, I basically ruled out doing anything with EDNS0 because it requires layering violations, or rewriting the query packet to EDNS0, then rewriting the answer back, so that it's in the form the caller expects. This could be avoided internally but with res_* API it becomes externally visible. On top of that, EDNS0 parsing has been a source of vulns in various software in the past, which makes me want to stay away from it. It also doesn't even provide a complete solution to the problem, since for very large answers you'll still need a second fallback to TCP. > Some virtualization software also violates the UDP 512-byte contract, so > you need to be prepared to receive larger responses over UDP as well. > (I think this particular case is particularly bad because TCP service is > broken as well.) We generally don't support violation of DNS contract. In this case it doesn't really matter though; the large packets will just be silently truncated by reading only 512 bytes. We could in theory request a single extra byte via an iovec to detect this condition and artifically add the TC bit if it's missing, but unless there's a strong motivation to do this, the right action is probably to ignore it along with the plethora of other utterly broken things wacky nameservers can do. > > Some possible conditions that could be used: > > > > - At least one RR of the type in the question. This seems to be the > > choice to make maximal use of truncated responses, but could give > > significantly fewer addresses than one might like if the nameserver > > is badly behaved or if there's a very large CNAME consuming most of > > the packet. > > > > - No CNAME and packet size is at least 512 minus the size of one RR. > > This goes maximally in the other direction, never using results that > > might be limited by the presence of a CNAME, and ensuring we always > > have the number of answers we'd keep from a TCP response. > > You really only should process the answer section if its record count > indicates that it's complete (compared to the header). More complex > heuristics probably go wrong with some slightly broken DNS servers. Yes, I was assuming it is complete with respect to the header and that the nameserver is following the spec and truncating on RR granularity, with the ANCOUNT reflecting the number of RRs actually present. But this can be checked as another means to trigger TCP fallback, and doing so is probably a good idea (and won't affect anyone with non-broken nameservers). So the question here is basically just about how to decide whether to fallback when the packet is well-formed but lacks one or more RR from what the complete answer would have. Rich ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [musl] TCP fallback open questions 2022-09-20 12:53 ` Rich Felker @ 2022-10-10 15:07 ` Florian Weimer 0 siblings, 0 replies; 4+ messages in thread From: Florian Weimer @ 2022-10-10 15:07 UTC (permalink / raw) To: Rich Felker; +Cc: musl * Rich Felker: >> TCP DNS has pipelining. > > Has that always been a thing? If so, it might advise a different way > to do this. Like most things DNS, it's controversial. There's certainly the glibc precendent. I found this: | 6.2.1.1. Query Pipelining | | Due to the historical use of TCP primarily for zone transfer and | truncated responses, no existing RFC discusses the idea of pipelining | DNS queries over a TCP connection. | | In order to achieve performance on par with UDP, DNS clients SHOULD | pipeline their queries. When a DNS client sends multiple queries to | a server, it SHOULD NOT wait for an outstanding reply before sending | the next query. Clients SHOULD treat TCP and UDP equivalently when | considering the time at which to send a particular query. | | It is likely that DNS servers need to process pipelined queries | concurrently and also send out-of-order responses over TCP in order | to provide the level of performance possible with UDP transport. If | TCP performance is of importance, clients might find it useful to use | server processing times as input to server and transport selection | algorithms. | | DNS servers (especially recursive) MUST expect to receive pipelined | queries. The server SHOULD process TCP queries concurrently, just as | it would for UDP. The server SHOULD answer all pipelined queries, | even if they are received in quick succession. The handling of | responses to pipelined queries is covered in Section 7. <https://www.rfc-editor.org/rfc/rfc7766#section-6.2.1.1> But I'm too out-of-touch to tell whether this is good advice (for client behavior). >> The glibc stub resolver exercises that, sending >> the A and AAAA queries back-to-back over TCP, probably in the same >> segment. I think it should even deal with reordered responses (so no >> head-of-line blocking), but I'm not sure if recursive resolver code >> actually exercises it by reorder replies. > > Do real-world servers reliably do out-of-order responding, starting > multiple queries received over the same connection in parallel and > responding to them in the order answers become available? There's a security requirement to collapse multiple outgoing queries for the same QNAME/QCLASS/QTYPE, so strict sequential processing without updating global state isn't really possible. It's a small step from sending out the responses in parallel. Section 7 in RFC 7766 recommends this as well. >> Historically, it's unfriendly to keep TCP connections to recursive >> resolvers open for extended periods of time. Furthermore, it >> complicates the retry logic in the client because once you keep >> connections open, RST in response to a send does not indicate an error >> (the server may just have dropped the question), so you need to retry in >> that case with a fresh connection. > > We definitely wouldn't keep them open for an extended period since the > expectation is that they won't normally be used, and since tying up > fds is bad. But if there's nasty corner case handling for when the > server decides it doesn't want to answer more questions on your > existing socket (possibly even within a single run) the prospect of > using a single connection for multiple queries becomes less appealing. It's about reviving a connection that's been dormant for a while, not the case of back-to-back or parallel queries for A/AAAA. TCP treats uncoordinated connection close like a network error, and DNS doesn't have a coordination protocol. The DNS resolver eventually needs to shut down connections from clients that are inactive, which requires an uncoordinated close. On the client side, the BSD sockets API isn't really good at telling this from a bona-fide networking problem even if it has additional data. Thanks, Florian ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-10-10 15:07 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-16 4:14 [musl] TCP fallback open questions Rich Felker 2022-09-20 9:42 ` Florian Weimer 2022-09-20 12:53 ` Rich Felker 2022-10-10 15:07 ` Florian Weimer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).