From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 32131 invoked from network); 20 Sep 2022 12:53:33 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 20 Sep 2022 12:53:33 -0000 Received: (qmail 3176 invoked by uid 550); 20 Sep 2022 12:53:30 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 3150 invoked from network); 20 Sep 2022 12:53:30 -0000 Date: Tue, 20 Sep 2022 08:53:18 -0400 From: Rich Felker To: Florian Weimer Cc: musl@lists.openwall.com Message-ID: <20220920125317.GN9709@brightrain.aerifal.cx> References: <20220916041435.GL9709@brightrain.aerifal.cx> <87wn9yo00j.fsf@oldenburg.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87wn9yo00j.fsf@oldenburg.str.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] TCP fallback open questions On Tue, Sep 20, 2022 at 11:42:04AM +0200, Florian Weimer wrote: > * Rich Felker: > > > In principle we could end up using N*M TCP sockets for an exhaustive > > parallel query. N and M are small enough that this isn't huge, but > > it's also not really nice. Given that the switch to TCP was triggered > > by a truncated UDP response, we already know that the responding > > server *knows the answer* and just can't send it within the size > > limits. So a reasonable course of action is just to open a TCP > > connection to the nameserver that issued the truncated response. This > > is not necessarily going to be optimal -- it's possible that another > > nameserver has gotten a response in the mean time and that the > > round-trip for TCP handshake and payload would be much lower to that > > other server. But I'm doubtful that consuming extra kernel resources > > and producing extra network load to optimize the latency here is a > > reasonable tradeoff. > > The large centralized load balancers typically do not share caches > between their UDP and TCP endpoints, at least not immediately, and > neither between different UDP frontend servers behind the loadbalancer. > So the assumption that the cache is hot at the time of the TCP query is > probably not true in that case. But it probably does not matter. Thanks, this is good to know. For the most important case, a local trusted validating nameserver on localhost, it should mean the result is immediately available. For others, at least it's hopefully indicative that the result is likely-obtainable. I would really hope that the big servers like Google and CF don't go back to querying the authoritative server a second time upon fallback to TCP, but use their own common upstread cache or something, if for no other reason than not putting unnecessary load on the rest of the internet. But maybe this is naive... > > I'm assuming so far that each question at least would have its own TCP > > connection (if truncated as UDP). Using multiple nameservers in > > parallel with TCP would maybe be an option if we were doing multiple > > queries on the same connection, but I'm not aware of whether TCP DNS > > has any sort of "pipelining" that would make this perform reasonably. > > Maybe if "priming" the question via UDP it doesn't matter though and > > we could expect the queries to be processed immediately with cached > > results? I don't think I like this but I'm just raising it for > > completeness. > > TCP DNS has pipelining. Has that always been a thing? If so, it might advise a different way to do this. > The glibc stub resolver exercises that, sending > the A and AAAA queries back-to-back over TCP, probably in the same > segment. I think it should even deal with reordered responses (so no > head-of-line blocking), but I'm not sure if recursive resolver code > actually exercises it by reorder replies. Do real-world servers reliably do out-of-order responding, starting multiple queries received over the same connection in parallel and responding to them in the order answers become available? If so, this is a potentially appealing approach. It does require either reading the 2-byte length as a discrete read/recv first or else performing complex buffer shuffling (since without knowing length, answers to 2 different queries might come in the same read) but the syscall cost is really inconsequential compared to the network latency costs anyway, so it's probably best not to be concerned with optimizing number of syscalls. > Historically, it's unfriendly to keep TCP connections to recursive > resolvers open for extended periods of time. Furthermore, it > complicates the retry logic in the client because once you keep > connections open, RST in response to a send does not indicate an error > (the server may just have dropped the question), so you need to retry in > that case with a fresh connection. We definitely wouldn't keep them open for an extended period since the expectation is that they won't normally be used, and since tying up fds is bad. But if there's nasty corner case handling for when the server decides it doesn't want to answer more questions on your existing socket (possibly even within a single run) the prospect of using a single connection for multiple queries becomes less appealing. > > TL;DR summary: my leaning is to do one TCP connection per question > > that needs fallback, to the nameserver that issued the truncated > > response for the question. Does this seem reasonable? Am I overlooking > > anything important? > > It's certainly the most conservative approach. > > > 3. Timeouts: > > > > UDP being datagram based, there is no condition where we have to worry > > about blocking and getting stuck in the middle of a partial read. > > Timeout occurs just at the loop level. > > > > Are there any special considerations for timeout here using TCP? My > > leaning is no, since we'll still be in a poll loop regime, and > > regardless of blocking state on the socket, recv should do partial > > reads in the absence of MSG_WAITALL. > > Ideally, you'd also use a non-blocking connect with a shorter timeout > than the system default (which can be quite long). Yes, definitely nonblocking connect, or rather nonblocking sendmsg with MSG_FASTOPEN as long as the kernel supports it. (If the server supports it, this saves us a lot of latency, and even if not, the kernel avoids waking us just to perform the send after the connect completes.) I think keeping the logic for this clean and simple is another motivation for using one socket per query. > > 4. Logic for when fallback is needed: > > > > As noted in the thread "res_query/res_send contract findings", > > fallback is always needed by these functions when they get a response > > with the TC bit set because of the contract to return the size needed > > for the complete answer. But for high level (getaddrinfo, etc.) > > lookups, it's desirable to use truncated answers when we can. What > > should the condition for "when we can" be? My first leaning was that > > "nonzero ANCOUNT" suffices, but for CNAMEs, it's possible that the > > truncated response contains only the CNAME RR, not any records from > > the A or AAAA RRset. > > Historically, TC=1 responses could have responses truncated in the > middle of the record set, or even the record. Some middleboxes probably > do this still. You can still detect this after a failing packet parse > if it's actual truncation, but there could be other data there as well. > I expect most implementations to just discard TC=1 responses. By the time we parse the packet (except looking at RCODE) the query machine has been discarded. I think TCP support means we want to do at least some rudimentary parsing inside the machine (or as a callback from it) as part of the predicate to decide whether to accept the truncated response. > There's at least one implementation out there that tries an UDP EDNS0 > query with a larger buffer space first when it encounters a TC=1 > response, rather than going to TCP directly. But that probably needs > EDNS0-specific failure detection, so not ideal either. Yes, I basically ruled out doing anything with EDNS0 because it requires layering violations, or rewriting the query packet to EDNS0, then rewriting the answer back, so that it's in the form the caller expects. This could be avoided internally but with res_* API it becomes externally visible. On top of that, EDNS0 parsing has been a source of vulns in various software in the past, which makes me want to stay away from it. It also doesn't even provide a complete solution to the problem, since for very large answers you'll still need a second fallback to TCP. > Some virtualization software also violates the UDP 512-byte contract, so > you need to be prepared to receive larger responses over UDP as well. > (I think this particular case is particularly bad because TCP service is > broken as well.) We generally don't support violation of DNS contract. In this case it doesn't really matter though; the large packets will just be silently truncated by reading only 512 bytes. We could in theory request a single extra byte via an iovec to detect this condition and artifically add the TC bit if it's missing, but unless there's a strong motivation to do this, the right action is probably to ignore it along with the plethora of other utterly broken things wacky nameservers can do. > > Some possible conditions that could be used: > > > > - At least one RR of the type in the question. This seems to be the > > choice to make maximal use of truncated responses, but could give > > significantly fewer addresses than one might like if the nameserver > > is badly behaved or if there's a very large CNAME consuming most of > > the packet. > > > > - No CNAME and packet size is at least 512 minus the size of one RR. > > This goes maximally in the other direction, never using results that > > might be limited by the presence of a CNAME, and ensuring we always > > have the number of answers we'd keep from a TCP response. > > You really only should process the answer section if its record count > indicates that it's complete (compared to the header). More complex > heuristics probably go wrong with some slightly broken DNS servers. Yes, I was assuming it is complete with respect to the header and that the nameserver is following the spec and truncating on RR granularity, with the ANCOUNT reflecting the number of RRs actually present. But this can be checked as another means to trigger TCP fallback, and doing so is probably a good idea (and won't affect anyone with non-broken nameservers). So the question here is basically just about how to decide whether to fallback when the packet is well-formed but lacks one or more RR from what the complete answer would have. Rich