mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH] [draft] dns tcp fallback
@ 2022-09-22 20:28 Rich Felker
  2022-09-26  3:12 ` Rich Felker
  0 siblings, 1 reply; 2+ messages in thread
From: Rich Felker @ 2022-09-22 20:28 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 970 bytes --]

Here is my first draft of the DNS TCP fallback support that seems to
be basically working. There may of course be bugs, and there are cases
where I probably don't want it to fall back like it's doing, but it's
at a stage where review and [smoke-]testing would be helpful.

Some known issues:

- getaddrinfo should use larger buffers (probably 640-768 bytes) or
  the "huge CNAME" problem iss not solved.

- If res_send is used to send EDNS0 queries crafted by the caller and
  the caller provides a buffer smaller than what the OPT header
  advertises, the answer will be silently truncated to the caller's
  buffer size.

I've tested querying TXT for locations.publicdns.goog (>7k RRset) and
basic non-fallback usage. Further large-answer query test cases would
be appreciated, especially names where one or both of the A or AAAA
RRset are large so that fallback in parallel lookup can be tested, and
later so that getaddrinfo buffer size increase can be tested.

Rich

[-- Attachment #2: 0001-adapt-res_msend-DNS-query-core-for-working-with-mult.patch --]
[-- Type: text/plain, Size: 2279 bytes --]

From c87d75f2aa6fde49a99cf3287e535a14f354f781 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Wed, 21 Sep 2022 18:49:53 -0400
Subject: [PATCH 1/3] adapt res_msend DNS query core for working with multiple
 sockets

this is groundwork for TCP fallback support, but does not itself
change behavior in any way.
---
 src/network/res_msend.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/network/res_msend.c b/src/network/res_msend.c
index 9adaea13..98e6d8a3 100644
--- a/src/network/res_msend.c
+++ b/src/network/res_msend.c
@@ -16,7 +16,9 @@
 
 static void cleanup(void *p)
 {
-	__syscall(SYS_close, (intptr_t)p);
+	struct pollfd *pfd = p;
+	for (int i=0; pfd[i].fd >= -1; i++)
+		if (pfd[i].fd >= 0) __syscall(SYS_close, pfd[i].fd);
 }
 
 static unsigned long mtime()
@@ -44,7 +46,7 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 	int next;
 	int i, j;
 	int cs;
-	struct pollfd pfd;
+	struct pollfd pfd[nqueries+2];
 	unsigned long t0, t1, t2;
 
 	pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);
@@ -92,7 +94,12 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 	 * yield either no reply (indicated by zero length) or an answer
 	 * packet which is up to the caller to interpret. */
 
-	pthread_cleanup_push(cleanup, (void *)(intptr_t)fd);
+	for (i=0; i<nqueries; i++) pfd[i].fd = -1;
+	pfd[nqueries].fd = fd;
+	pfd[nqueries].events = POLLIN;
+	pfd[nqueries+1].fd = -2;
+
+	pthread_cleanup_push(cleanup, pfd);
 	pthread_setcancelstate(cs, 0);
 
 	/* Convert any IPv4 addresses in a mixed environment to v4-mapped */
@@ -112,8 +119,6 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 
 	memset(alens, 0, sizeof *alens * nqueries);
 
-	pfd.fd = fd;
-	pfd.events = POLLIN;
 	retry_interval = timeout / attempts;
 	next = 0;
 	t0 = t2 = mtime();
@@ -133,7 +138,7 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 		}
 
 		/* Wait for a response, or until time to retry */
-		if (poll(&pfd, 1, t1+retry_interval-t2) <= 0) continue;
+		if (poll(pfd, nqueries+1, t1+retry_interval-t2) <= 0) continue;
 
 		while ((rlen = recvfrom(fd, answers[next], asize, 0,
 		  (void *)&sa, (socklen_t[1]){sl})) >= 0) {
-- 
2.21.0


[-- Attachment #3: 0002-res_send-use-a-temp-buffer-if-caller-s-buffer-is-und.patch --]
[-- Type: text/plain, Size: 1704 bytes --]

From e2e9517607f67c1e23c059769b19bf4270d22123 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Thu, 22 Sep 2022 12:41:23 -0400
Subject: [PATCH 2/3] res_send: use a temp buffer if caller's buffer is under
 512 bytes

for extremely small buffer sizes, the DNS query core in __res_msend
may malfunction completely, being unable to get even the headers to
determine the response code. but there is also a problem for
reasonable sizes under 512 bytes: __res_msend is unable to determine
if the udp answer was truncated at the recv layer, in which case it
may be incomplete, and res_send is then unable to honor its contract
to return the length of the full, non-truncated answer.

at present, res_send does not honor that contract anyway when the full
answer would exceed 512 bytes, since there is no tcp fallback, but
this change at least makes it consistent in a context where this is
the only "full answer" to be had.
---
 src/network/res_send.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/network/res_send.c b/src/network/res_send.c
index ee4abf1f..9593164d 100644
--- a/src/network/res_send.c
+++ b/src/network/res_send.c
@@ -1,8 +1,16 @@
 #include <resolv.h>
+#include <string.h>
 
 int __res_send(const unsigned char *msg, int msglen, unsigned char *answer, int anslen)
 {
-	int r = __res_msend(1, &msg, &msglen, &answer, &anslen, anslen);
+	int r;
+	if (anslen < 512) {
+		unsigned char buf[512];
+		r = __res_send(msg, msglen, buf, sizeof buf);
+		if (r >= 0) memcpy(answer, buf, r < anslen ? r : anslen);
+		return r;
+	}
+	r = __res_msend(1, &msg, &msglen, &answer, &anslen, anslen);
 	return r<0 || !anslen ? -1 : anslen;
 }
 
-- 
2.21.0


[-- Attachment #4: 0003-dns-implement-tcp-fallback-in-__res_msend-query-core.patch --]
[-- Type: text/plain, Size: 7399 bytes --]

From 51d4669fb97782f6a66606da852b5afd49a08001 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Thu, 22 Sep 2022 14:17:05 -0400
Subject: [PATCH 3/3] dns: implement tcp fallback in __res_msend query core

tcp fallback was originally deemed unwanted and unnecessary, since we
aim to return a bounded-size result from getaddrinfo anyway and
normally plenty of address records fit in the 512-byte udp dns limit.
however, this turned out to have several problems:

- some recursive nameservers truncate by omitting all the answers,
  rather than sending as many as can fit.

- a pathological worst-case CNAME for a worst-case name can fill the
  entire 512-byte space with just the two names, leaving no room for
  any addresses.

- the res_* family of interfaces allow querying of non-address records
  such as TLSA (DANE), TXT, etc. which can be very large. for many of
  these, it's critical that the caller see the whole RRset. also,
  res_send/res_query are specified to return the complete, untruncated
  length so that the caller can retry with an appropriately-sized
  buffer. determining this is not possible without tcp.

so, it's time to add tcp fallback.

the fallback strategy implemented here uses one tcp socket per
question (1 or 2 questions), initiated via tcp fastopen when possible.
the connection is made to the nameserver that issued the truncated
answer. right now, fallback happens unconditionally when truncation is
seen. this can, and may later be, relaxed for queries made by the
getaddrinfo system, since it will only use a bounded number of results
anyway.

retry is not attempted again after failure over tcp. the logic could
easily be adapted to do that, but it's of questionable value, since
the tcp stack automatically handles retransmission and the successs
answer with TC=1 over udp strongly suggests that the nameserver has
the full answer ready to give. further retry is likely just "take
longer to fail".
---
 src/network/res_msend.c | 119 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 117 insertions(+), 2 deletions(-)

diff --git a/src/network/res_msend.c b/src/network/res_msend.c
index 98e6d8a3..1e76886a 100644
--- a/src/network/res_msend.c
+++ b/src/network/res_msend.c
@@ -1,5 +1,6 @@
 #include <sys/socket.h>
 #include <netinet/in.h>
+#include <netinet/tcp.h>
 #include <netdb.h>
 #include <arpa/inet.h>
 #include <stdint.h>
@@ -29,6 +30,51 @@ static unsigned long mtime()
 		+ ts.tv_nsec / 1000000;
 }
 
+static int start_tcp(struct pollfd *pfd, int family, const void *sa, socklen_t sl, const unsigned char *q, int ql)
+{
+	struct msghdr mh = {
+		.msg_name = (void *)sa,
+		.msg_namelen = sl,
+		.msg_iovlen = 2,
+		.msg_iov = (struct iovec [2]){
+			{ .iov_base = (uint8_t[]){ ql>>8, ql }, .iov_len = 2 },
+			{ .iov_base = (void *)q, .iov_len = ql } }
+	};
+	int r;
+	int fd = socket(family, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0);
+	pfd->fd = fd;
+	pfd->events = POLLOUT;
+	if (!setsockopt(fd, IPPROTO_TCP, TCP_FASTOPEN_CONNECT,
+	    &(int){1}, sizeof(int))) {
+		r = sendmsg(fd, &mh, MSG_FASTOPEN|MSG_NOSIGNAL);
+		if (r == ql+2) pfd->events = POLLIN;
+		if (r >= 0) return r;
+		if (errno == EINPROGRESS) return 0;
+	}
+	r = connect(fd, sa, sl);
+	if (!r || errno == EINPROGRESS) return 0;
+	close(fd);
+	pfd->fd = -1;
+	return -1;
+}
+
+static void step_mh(struct msghdr *mh, size_t n)
+{
+	/* Adjust iovec in msghdr to skip first n bytes. */
+	while (mh->msg_iovlen && n >= mh->msg_iov->iov_len) {
+		n -= mh->msg_iov->iov_len;
+		mh->msg_iov++;
+		mh->msg_iovlen--;
+	}
+	if (!mh->msg_iovlen) return;
+	mh->msg_iov->iov_base = (char *)mh->msg_iov->iov_base + n;
+	mh->msg_iov->iov_len -= n;
+}
+
+/* Internal contract for __res_msend[_rc]: asize must be >=512, nqueries
+ * must be sufficiently small to be safe as VLA size. In practice it's
+ * either 1 or 2, anyway. */
+
 int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 	const int *qlens, unsigned char *const *answers, int *alens, int asize,
 	const struct resolvconf *conf)
@@ -47,6 +93,9 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 	int i, j;
 	int cs;
 	struct pollfd pfd[nqueries+2];
+	int qpos[nqueries], apos[nqueries];
+	unsigned char alen_buf[nqueries][2];
+	int r;
 	unsigned long t0, t1, t2;
 
 	pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);
@@ -125,6 +174,11 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 	t1 = t2 - retry_interval;
 
 	for (; t2-t0 < timeout; t2=mtime()) {
+		/* This is the loop exit condition: that all queries
+		 * have an accepted answer. */
+		for (i=0; i<nqueries && alens[i]>0; i++);
+		if (i==nqueries) break;
+
 		if (t2-t1 >= retry_interval) {
 			/* Query all configured namservers in parallel */
 			for (i=0; i<nqueries; i++)
@@ -140,7 +194,8 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 		/* Wait for a response, or until time to retry */
 		if (poll(pfd, nqueries+1, t1+retry_interval-t2) <= 0) continue;
 
-		while ((rlen = recvfrom(fd, answers[next], asize, 0,
+		while (next < nqueries &&
+		  (rlen = recvfrom(fd, answers[next], asize, 0,
 		  (void *)&sa, (socklen_t[1]){sl})) >= 0) {
 
 			/* Ignore non-identifiable packets */
@@ -181,12 +236,72 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 			else
 				memcpy(answers[i], answers[next], rlen);
 
-			if (next == nqueries) goto out;
+			/* Ignore further UDP if all slots full or TCP-mode */
+			if (next == nqueries) pfd[nqueries].events = 0;
+
+			/* If answer is truncated (TC bit), fallback to TCP */
+			if (answers[i][2] & 2) {
+				alens[i] = -1;
+				pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0);
+				r = start_tcp(pfd+i, family, ns+j, sl, queries[i], qlens[i]);
+				pthread_setcancelstate(cs, 0);
+				if (r >= 0) {
+					qpos[i] = r;
+					apos[i] = 0;
+				}
+				continue;
+			}
+		}
+
+		for (i=0; i<nqueries; i++) if (pfd[i].revents & POLLOUT) {
+			struct msghdr mh = {
+				.msg_iovlen = 2,
+				.msg_iov = (struct iovec [2]){
+					{ .iov_base = (uint8_t[]){ qlens[i]>>8, qlens[i] }, .iov_len = 2 },
+					{ .iov_base = (void *)queries[i], .iov_len = qlens[i] } }
+			};
+			step_mh(&mh, qpos[i]);
+			r = sendmsg(pfd[i].fd, &mh, MSG_NOSIGNAL);
+			if (r < 0) goto out;
+			qpos[i] += r;
+			if (qpos[i] == qlens[i]+2)
+				pfd[i].events = POLLIN;
+		}
+
+		for (i=0; i<nqueries; i++) if (pfd[i].revents & POLLIN) {
+			struct msghdr mh = {
+				.msg_iovlen = 2,
+				.msg_iov = (struct iovec [2]){
+					{ .iov_base = alen_buf[i], .iov_len = 2 },
+					{ .iov_base = answers[i], .iov_len = asize } }
+			};
+			step_mh(&mh, apos[i]);
+			r = recvmsg(pfd[i].fd, &mh, 0);
+			if (r < 0) goto out;
+			apos[i] += r;
+			if (apos[i] < 2) continue;
+			int alen = alen_buf[i][0]*256 + alen_buf[i][1];
+			if (alen < 13) goto out;
+			if (apos[i] < alen+2 && apos[i] < asize+2)
+				continue;
+			int rcode = answers[i][3] & 15;
+			if (rcode != 0 && rcode != 3)
+				goto out;
+
+			/* Storing the length here commits the accepted answer.
+			 * Immediately close TCP socket so as not to consume
+			 * resources we no longer need. */
+			alens[i] = alen;
+			__syscall(SYS_close, pfd[i].fd);
+			pfd[i].fd = -1;
 		}
 	}
 out:
 	pthread_cleanup_pop(1);
 
+	/* Disregard any incomplete TCP results */
+	for (i=0; i<nqueries; i++) if (alens[i]<0) alens[i] = 0;
+
 	return 0;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [musl] [PATCH] [draft] dns tcp fallback
  2022-09-22 20:28 [musl] [PATCH] [draft] dns tcp fallback Rich Felker
@ 2022-09-26  3:12 ` Rich Felker
  0 siblings, 0 replies; 2+ messages in thread
From: Rich Felker @ 2022-09-26  3:12 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

On Thu, Sep 22, 2022 at 04:28:34PM -0400, Rich Felker wrote:
> Here is my first draft of the DNS TCP fallback support that seems to
> be basically working. There may of course be bugs, and there are cases
> where I probably don't want it to fall back like it's doing, but it's
> at a stage where review and [smoke-]testing would be helpful.
> 
> Some known issues:
> 
> - getaddrinfo should use larger buffers (probably 640-768 bytes) or
>   the "huge CNAME" problem iss not solved.
> 
> - If res_send is used to send EDNS0 queries crafted by the caller and
>   the caller provides a buffer smaller than what the OPT header
>   advertises, the answer will be silently truncated to the caller's
>   buffer size.
> 
> I've tested querying TXT for locations.publicdns.goog (>7k RRset) and
> basic non-fallback usage. Further large-answer query test cases would
> be appreciated, especially names where one or both of the A or AAAA
> RRset are large so that fallback in parallel lookup can be tested, and
> later so that getaddrinfo buffer size increase can be tested.

OK, I have patches for both of the above now.


[-- Attachment #2: 0004-getaddrinfo-dns-lookup-use-larger-answer-buffer-to-h.patch --]
[-- Type: text/plain, Size: 2282 bytes --]

From 4b5570f57e0784611225968d47cfbe7babe02da4 Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Thu, 22 Sep 2022 19:11:48 -0400
Subject: [PATCH 4/5] getaddrinfo dns lookup: use larger answer buffer to
 handle long CNAMEs

the size of 512 is not sufficient to get at least one address in the
worst case where the name is at or near max length and resolves to a
CNAME at or near max length. prior to tcp fallback, there was nothing
we could do about this case anyway, but now it's fixable.

the new limit 768 is chosen so as to admit roughly the number of
addresses with a worst-case CNAME as could fit for a worst-case name
that's not a CNAME in the old 512-byte limit. outside of this
worst-case, the number of addresses that might be obtained is
increased.

MAXADDRS (48) was originally chosen as an upper bound on the combined
number of A and AAAA records that could fit in 512-byte packets (31
and 17, respectively). it is not increased at this time.

so as to prevent a situation where the A records consume almost all of
these slots (at 768 bytes, a "best-case" name can fit almost 47 A
records), the order of parsing is swapped to process AAAA first. this
ensures roughly half of the slots are available to each address
family.
---
 src/network/lookup_name.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index 37d481f9..3eda65a7 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -137,7 +137,7 @@ static int dns_parse_callback(void *c, int rr, const void *data, int len, const
 
 static int name_from_dns(struct address buf[static MAXADDRS], char canon[static 256], const char *name, int family, const struct resolvconf *conf)
 {
-	unsigned char qbuf[2][280], abuf[2][512];
+	unsigned char qbuf[2][280], abuf[2][768];
 	const unsigned char *qp[2] = { qbuf[0], qbuf[1] };
 	unsigned char *ap[2] = { abuf[0], abuf[1] };
 	int qlens[2], alens[2];
@@ -171,7 +171,7 @@ static int name_from_dns(struct address buf[static MAXADDRS], char canon[static
 		if ((abuf[i][3] & 15) != 0) return EAI_FAIL;
 	}
 
-	for (i=0; i<nq; i++)
+	for (i=nq-1; i>=0; i--)
 		__dns_parse(abuf[i], alens[i], dns_parse_callback, &ctx);
 
 	if (ctx.cnt) return ctx.cnt;
-- 
2.21.0


[-- Attachment #3: 0005-dns-query-core-detect-udp-truncation-at-recv-time.patch --]
[-- Type: text/plain, Size: 2414 bytes --]

From 28960a5de11ecb39ce2d50dd1e3e97d55139052e Mon Sep 17 00:00:00 2001
From: Rich Felker <dalias@aerifal.cx>
Date: Sun, 25 Sep 2022 22:48:12 -0400
Subject: [PATCH 5/5] dns query core: detect udp truncation at recv time

we already attempt to preclude this case by having res_send use a
sufficiently large temporary buffer even if the caller did not provide
one as large as or larger than the udp dns max of 512 bytes. however,
it's possible that the caller passed a custom-crafted query packet
using EDNS0, e.g. to get detailed DNSSEC results, with a larger udp
size allowance.

I have also seen claims that there are some broken nameservers in the
wild that do not honor the dns udp limit of 512 and send large answers
without the TC bit set, when the query was not using EDNS.

we generally don't aim to support broken nameservers, but in this case
both problems, if the latter is even real, have a common solution:
using recvmsg instead of recvfrom so we can examine the MSG_TRUNC
flag.
---
 src/network/res_msend.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/src/network/res_msend.c b/src/network/res_msend.c
index 1e76886a..11c6aa0e 100644
--- a/src/network/res_msend.c
+++ b/src/network/res_msend.c
@@ -194,9 +194,18 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 		/* Wait for a response, or until time to retry */
 		if (poll(pfd, nqueries+1, t1+retry_interval-t2) <= 0) continue;
 
-		while (next < nqueries &&
-		  (rlen = recvfrom(fd, answers[next], asize, 0,
-		  (void *)&sa, (socklen_t[1]){sl})) >= 0) {
+		while (next < nqueries) {
+			struct msghdr mh = {
+				.msg_name = (void *)&sa,
+				.msg_namelen = sl,
+				.msg_iovlen = 1,
+				.msg_iov = (struct iovec []){
+					{ .iov_base = (void *)answers[next],
+					  .iov_len = asize }
+				}
+			};
+			rlen = recvmsg(fd, &mh, 0);
+			if (rlen < 0) break;
 
 			/* Ignore non-identifiable packets */
 			if (rlen < 4) continue;
@@ -240,7 +249,7 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries,
 			if (next == nqueries) pfd[nqueries].events = 0;
 
 			/* If answer is truncated (TC bit), fallback to TCP */
-			if (answers[i][2] & 2) {
+			if ((answers[i][2] & 2) || (mh.msg_flags & MSG_TRUNC)) {
 				alens[i] = -1;
 				pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0);
 				r = start_tcp(pfd+i, family, ns+j, sl, queries[i], qlens[i]);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-09-26  3:12 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 20:28 [musl] [PATCH] [draft] dns tcp fallback Rich Felker
2022-09-26  3:12 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).