From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 3239 invoked from network); 22 Sep 2022 20:28:54 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 22 Sep 2022 20:28:54 -0000 Received: (qmail 23628 invoked by uid 550); 22 Sep 2022 20:28:50 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 23593 invoked from network); 22 Sep 2022 20:28:49 -0000 Date: Thu, 22 Sep 2022 16:28:34 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20220922202833.GY9709@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="VMt1DrMGOVs3KQwf" Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Subject: [musl] [PATCH] [draft] dns tcp fallback --VMt1DrMGOVs3KQwf Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Here is my first draft of the DNS TCP fallback support that seems to be basically working. There may of course be bugs, and there are cases where I probably don't want it to fall back like it's doing, but it's at a stage where review and [smoke-]testing would be helpful. Some known issues: - getaddrinfo should use larger buffers (probably 640-768 bytes) or the "huge CNAME" problem iss not solved. - If res_send is used to send EDNS0 queries crafted by the caller and the caller provides a buffer smaller than what the OPT header advertises, the answer will be silently truncated to the caller's buffer size. I've tested querying TXT for locations.publicdns.goog (>7k RRset) and basic non-fallback usage. Further large-answer query test cases would be appreciated, especially names where one or both of the A or AAAA RRset are large so that fallback in parallel lookup can be tested, and later so that getaddrinfo buffer size increase can be tested. Rich --VMt1DrMGOVs3KQwf Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-adapt-res_msend-DNS-query-core-for-working-with-mult.patch" >From c87d75f2aa6fde49a99cf3287e535a14f354f781 Mon Sep 17 00:00:00 2001 From: Rich Felker Date: Wed, 21 Sep 2022 18:49:53 -0400 Subject: [PATCH 1/3] adapt res_msend DNS query core for working with multiple sockets this is groundwork for TCP fallback support, but does not itself change behavior in any way. --- src/network/res_msend.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/src/network/res_msend.c b/src/network/res_msend.c index 9adaea13..98e6d8a3 100644 --- a/src/network/res_msend.c +++ b/src/network/res_msend.c @@ -16,7 +16,9 @@ static void cleanup(void *p) { - __syscall(SYS_close, (intptr_t)p); + struct pollfd *pfd = p; + for (int i=0; pfd[i].fd >= -1; i++) + if (pfd[i].fd >= 0) __syscall(SYS_close, pfd[i].fd); } static unsigned long mtime() @@ -44,7 +46,7 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries, int next; int i, j; int cs; - struct pollfd pfd; + struct pollfd pfd[nqueries+2]; unsigned long t0, t1, t2; pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs); @@ -92,7 +94,12 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries, * yield either no reply (indicated by zero length) or an answer * packet which is up to the caller to interpret. */ - pthread_cleanup_push(cleanup, (void *)(intptr_t)fd); + for (i=0; i= 0) { -- 2.21.0 --VMt1DrMGOVs3KQwf Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0002-res_send-use-a-temp-buffer-if-caller-s-buffer-is-und.patch" >From e2e9517607f67c1e23c059769b19bf4270d22123 Mon Sep 17 00:00:00 2001 From: Rich Felker Date: Thu, 22 Sep 2022 12:41:23 -0400 Subject: [PATCH 2/3] res_send: use a temp buffer if caller's buffer is under 512 bytes for extremely small buffer sizes, the DNS query core in __res_msend may malfunction completely, being unable to get even the headers to determine the response code. but there is also a problem for reasonable sizes under 512 bytes: __res_msend is unable to determine if the udp answer was truncated at the recv layer, in which case it may be incomplete, and res_send is then unable to honor its contract to return the length of the full, non-truncated answer. at present, res_send does not honor that contract anyway when the full answer would exceed 512 bytes, since there is no tcp fallback, but this change at least makes it consistent in a context where this is the only "full answer" to be had. --- src/network/res_send.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/src/network/res_send.c b/src/network/res_send.c index ee4abf1f..9593164d 100644 --- a/src/network/res_send.c +++ b/src/network/res_send.c @@ -1,8 +1,16 @@ #include +#include int __res_send(const unsigned char *msg, int msglen, unsigned char *answer, int anslen) { - int r = __res_msend(1, &msg, &msglen, &answer, &anslen, anslen); + int r; + if (anslen < 512) { + unsigned char buf[512]; + r = __res_send(msg, msglen, buf, sizeof buf); + if (r >= 0) memcpy(answer, buf, r < anslen ? r : anslen); + return r; + } + r = __res_msend(1, &msg, &msglen, &answer, &anslen, anslen); return r<0 || !anslen ? -1 : anslen; } -- 2.21.0 --VMt1DrMGOVs3KQwf Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0003-dns-implement-tcp-fallback-in-__res_msend-query-core.patch" >From 51d4669fb97782f6a66606da852b5afd49a08001 Mon Sep 17 00:00:00 2001 From: Rich Felker Date: Thu, 22 Sep 2022 14:17:05 -0400 Subject: [PATCH 3/3] dns: implement tcp fallback in __res_msend query core tcp fallback was originally deemed unwanted and unnecessary, since we aim to return a bounded-size result from getaddrinfo anyway and normally plenty of address records fit in the 512-byte udp dns limit. however, this turned out to have several problems: - some recursive nameservers truncate by omitting all the answers, rather than sending as many as can fit. - a pathological worst-case CNAME for a worst-case name can fill the entire 512-byte space with just the two names, leaving no room for any addresses. - the res_* family of interfaces allow querying of non-address records such as TLSA (DANE), TXT, etc. which can be very large. for many of these, it's critical that the caller see the whole RRset. also, res_send/res_query are specified to return the complete, untruncated length so that the caller can retry with an appropriately-sized buffer. determining this is not possible without tcp. so, it's time to add tcp fallback. the fallback strategy implemented here uses one tcp socket per question (1 or 2 questions), initiated via tcp fastopen when possible. the connection is made to the nameserver that issued the truncated answer. right now, fallback happens unconditionally when truncation is seen. this can, and may later be, relaxed for queries made by the getaddrinfo system, since it will only use a bounded number of results anyway. retry is not attempted again after failure over tcp. the logic could easily be adapted to do that, but it's of questionable value, since the tcp stack automatically handles retransmission and the successs answer with TC=1 over udp strongly suggests that the nameserver has the full answer ready to give. further retry is likely just "take longer to fail". --- src/network/res_msend.c | 119 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 117 insertions(+), 2 deletions(-) diff --git a/src/network/res_msend.c b/src/network/res_msend.c index 98e6d8a3..1e76886a 100644 --- a/src/network/res_msend.c +++ b/src/network/res_msend.c @@ -1,5 +1,6 @@ #include #include +#include #include #include #include @@ -29,6 +30,51 @@ static unsigned long mtime() + ts.tv_nsec / 1000000; } +static int start_tcp(struct pollfd *pfd, int family, const void *sa, socklen_t sl, const unsigned char *q, int ql) +{ + struct msghdr mh = { + .msg_name = (void *)sa, + .msg_namelen = sl, + .msg_iovlen = 2, + .msg_iov = (struct iovec [2]){ + { .iov_base = (uint8_t[]){ ql>>8, ql }, .iov_len = 2 }, + { .iov_base = (void *)q, .iov_len = ql } } + }; + int r; + int fd = socket(family, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0); + pfd->fd = fd; + pfd->events = POLLOUT; + if (!setsockopt(fd, IPPROTO_TCP, TCP_FASTOPEN_CONNECT, + &(int){1}, sizeof(int))) { + r = sendmsg(fd, &mh, MSG_FASTOPEN|MSG_NOSIGNAL); + if (r == ql+2) pfd->events = POLLIN; + if (r >= 0) return r; + if (errno == EINPROGRESS) return 0; + } + r = connect(fd, sa, sl); + if (!r || errno == EINPROGRESS) return 0; + close(fd); + pfd->fd = -1; + return -1; +} + +static void step_mh(struct msghdr *mh, size_t n) +{ + /* Adjust iovec in msghdr to skip first n bytes. */ + while (mh->msg_iovlen && n >= mh->msg_iov->iov_len) { + n -= mh->msg_iov->iov_len; + mh->msg_iov++; + mh->msg_iovlen--; + } + if (!mh->msg_iovlen) return; + mh->msg_iov->iov_base = (char *)mh->msg_iov->iov_base + n; + mh->msg_iov->iov_len -= n; +} + +/* Internal contract for __res_msend[_rc]: asize must be >=512, nqueries + * must be sufficiently small to be safe as VLA size. In practice it's + * either 1 or 2, anyway. */ + int __res_msend_rc(int nqueries, const unsigned char *const *queries, const int *qlens, unsigned char *const *answers, int *alens, int asize, const struct resolvconf *conf) @@ -47,6 +93,9 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries, int i, j; int cs; struct pollfd pfd[nqueries+2]; + int qpos[nqueries], apos[nqueries]; + unsigned char alen_buf[nqueries][2]; + int r; unsigned long t0, t1, t2; pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs); @@ -125,6 +174,11 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries, t1 = t2 - retry_interval; for (; t2-t0 < timeout; t2=mtime()) { + /* This is the loop exit condition: that all queries + * have an accepted answer. */ + for (i=0; i0; i++); + if (i==nqueries) break; + if (t2-t1 >= retry_interval) { /* Query all configured namservers in parallel */ for (i=0; i= 0) { /* Ignore non-identifiable packets */ @@ -181,12 +236,72 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries, else memcpy(answers[i], answers[next], rlen); - if (next == nqueries) goto out; + /* Ignore further UDP if all slots full or TCP-mode */ + if (next == nqueries) pfd[nqueries].events = 0; + + /* If answer is truncated (TC bit), fallback to TCP */ + if (answers[i][2] & 2) { + alens[i] = -1; + pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, 0); + r = start_tcp(pfd+i, family, ns+j, sl, queries[i], qlens[i]); + pthread_setcancelstate(cs, 0); + if (r >= 0) { + qpos[i] = r; + apos[i] = 0; + } + continue; + } + } + + for (i=0; i>8, qlens[i] }, .iov_len = 2 }, + { .iov_base = (void *)queries[i], .iov_len = qlens[i] } } + }; + step_mh(&mh, qpos[i]); + r = sendmsg(pfd[i].fd, &mh, MSG_NOSIGNAL); + if (r < 0) goto out; + qpos[i] += r; + if (qpos[i] == qlens[i]+2) + pfd[i].events = POLLIN; + } + + for (i=0; i