From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 7232 invoked from network); 20 Sep 2022 04:38:02 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 20 Sep 2022 04:38:02 -0000 Received: (qmail 21672 invoked by uid 550); 20 Sep 2022 04:37:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 21638 invoked from network); 20 Sep 2022 04:37:58 -0000 Date: Tue, 20 Sep 2022 00:37:45 -0400 From: Rich Felker To: musl@lists.openwall.com Message-ID: <20220920043744.GJ9709@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Subject: [musl] TCP DNS outline What follow are some notes I've put together in preparation for TCP fallback addition to DNS query core. Open to comments/concerns. Current DNS query core state machine outline: Single socket fd for sending/receiving UDP messages. One or more answer buffers: answers[nqueries] Lengths of accepted answers, 0 if none: alens[nqueries] On each loop iteration, queries that don't yet have answers get [re]sent if retry interval is exceeded. UDP socket fd is poll()ed with timeout, then messages are recvfrom'd and processed until no more are available. For each: - It is read into the lowest-indexed answer buffer that's not yet filled. - Query id is used to determine which question it goes with. If none or if that question already has an answer, it's discarded and processing continues to next packet. - RCODE is checked; if non-conclusive the packet is ignored and processing continues to next packet (possibly with early retry, but this is not core to the logic) - If answer it accepted, it's moved to the right answer slot (possibly the one it's already in) and the size is recorded, marking the slot as having an accepted answer. This process continues until all questions have answers or timeout. Changes for TCP: Additional state: For polling multiple socket fds: struct pollfd [nqueries+1]. This requires nqueries be well-bounded, but it's always 1 or 2 anyway. The +1 is for the UDP socket. The pollfd array itself can store socket fds so we don't need separate storage for them. Buffer positions for handling partial read: pos[nqueries] Logic changes: An answer slot that's being queried by TCP is always considered in-use for the purpose of receiving UDP. If all (both) queries have switched to TCP, no UDP will be accepted. UDP received for a slot that's already in "TCP mode" will be dropped just like if the slot already have a UDP answe. When checking the validity (see above: query id, RCODE) of a UDP answer, if the TC bit is set, a predicate will be evaluated to decide whether to accept the truncated answer. If so, there is no change. If not, a new TCP socket will be opened to the nameserver address that issued the TC'd answer, and added to the pollfd set. (This should probably use TCP fastopen API if the kernel supports it, and only fall back if not. I think this means that, if the MSG_FASTOPEN sendmsg succeeds, we can go straight to polling for read, and only need to connect and poll for write first if it doesn't.) Buffer position is initialized to -2 (2 bytes prior to payload start) for the prepended BE16 length field that's not part of the answer and that will be read into a separate location via iovec. When poll reports a TCP socket writable, send the corresponding query payload (looping to send the whole thing; assume it won't block after partial send of <280 bytes). Then switch the pollfd events to only poll for read. When poll reports a TCP socket ready for read, use buffer position and readv or recvmsg to iovec-read into answer size and answer payload, advancing buffer position accordingly. If we reach the answer length, or hit the full answer buffer size without reaching it, close the connection and deem the answer completed. If we get EOF or socket error or see an inconclusive rcode, close the connection and deem the query failed (alens[i]=0).