[-- Attachment #1: Type: text/plain, Size: 15149 bytes --] Hi everyone, I'm implementing a socket protocol with similar works to Emscripten - ok: Here's what I noticed - resolv.conf using multiple DNS servers options timeout:22 attempts:5 nameserver 8.8.4.4 nameserver 208.67.222.222 nameserver 9.9.9.9 nameserver 1.1.1.1 - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved void socketstest1_client_dnsonly(void) { struct addrinfo hints, * res; memset(&hints, 0, sizeof(hints)); printf("begin lookup1...\n"); if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) { perror("getaddrinfo1"); } printf("begin lookup2...\n"); if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) { perror("getaddrinfo2"); } printf("begin lookup3...\n"); if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) { perror("getaddrinfo3"); } } When repeating socketstest1_client_dnsonly often, it appears that IPV6 answers for IPV4 requests to another server [or vice-versa] responses can get mixed up and getaddrinfo reports sometimes only an error where none is. Below is a console dump of my test where you see the critical situation (end) 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto), resulting in error getaddrinfo1: Address in use This happens in about 1/100 tests. When you specify IPV4 or IPV6 in hints, this issue does NOT show up. Many Thanks Sascha Console Data Dump: __syscall_recv begin EP[8.8.4.4:53] __syscall_recv'd_internal [020000350808040400000000000000008.8.4.4:53] f79e818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000004800102a0014504001081200000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 connect:0200ffff8efaba240000000000000000 connect:0a00ffff000000002a00145040010812000000000000200400000000 begin lookup1... __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 08740100000100000000000003777777037765620264650000010001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 08740100000100000000000003777777037765620264650000010001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 08740100000100000000000003777777037765620264650000010001 __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 087401000001000000000000037777770377656202646500001c0001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 087401000001000000000000037777770377656202646500001c0001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 087401000001000000000000037777770377656202646500001c0001 __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] 08748180000100020000000003777777037765620264650000010001c00c000500010000011a000f0377777708672d68612d776562c014c028000100010000000c000452a5e58a000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 recv: no data __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] 087481800001000100010000037777770377656202646500001c0001c00c00050001000000fb000f0377777708672d68612d776562c014c02c00060001000000210031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 begin lookup2... __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 4920010000010000000000000377777706676f6f676c650264650000010001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 4920010000010000000000000377777706676f6f676c650264650000010001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 4920010000010000000000000377777706676f6f676c650264650000010001 __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 4920010000010000000000000377777706676f6f676c6502646500001c0001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 4920010000010000000000000377777706676f6f676c6502646500001c0001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 4920010000010000000000000377777706676f6f676c6502646500001c0001 __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000008.8.4.4:53] 4920818000010001000000000377777706676f6f676c650264650000010001c00c00010001000000b200048efabaa3000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 __syscall_recv begin EP[8.8.4.4:53] __syscall_recv'd_internal [020000350808040400000000000000009.9.9.9:53] 4920818000010001000000000377777706676f6f676c650264650000010001c00c00010001000000ad00048efab9e3000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 begin lookup3... __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 9308010000010000000000000377777706676f6f676c6503636f6d0000010001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 9308010000010000000000000377777706676f6f676c6503636f6d0000010001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 9308010000010000000000000377777706676f6f676c6503636f6d0000010001 __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 9308010000010000000000000377777706676f6f676c6503636f6d00001c0001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 9308010000010000000000000377777706676f6f676c6503636f6d00001c0001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 9308010000010000000000000377777706676f6f676c6503636f6d00001c0001 __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] 9308818000010001000000000377777706676f6f676c6503636f6d0000010001c00c000100010000009f00048efab9440000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 recv: no data __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] 9308818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000002400102a0014504001082900000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 connect:0200ffff8efab9440000000000000000 connect:0a00ffff000000002a00145040010829000000000000200400000000 begin lookup1... __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] dcf00100000100000000000003777777037765620264650000010001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] dcf00100000100000000000003777777037765620264650000010001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] dcf00100000100000000000003777777037765620264650000010001 __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] dcf001000001000000000000037777770377656202646500001c0001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] dcf001000001000000000000037777770377656202646500001c0001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] dcf001000001000000000000037777770377656202646500001c0001 __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000008.8.4.4:53] dcf081800001000100010000037777770377656202646500001c0001c00c00050001000000fa000f0377777708672d68612d776562c014c02c000600010000000f0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 __syscall_recv begin EP[8.8.4.4:53] __syscall_recv'd_internal [020000350808040400000000000000009.9.9.9:53] dcf081800001000100010000037777770377656202646500001c0001c00c0005000100000129000f0377777708672d68612d776562c014c02c000600010000000e0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 getaddrinfo1: Address in use [-- Attachment #2: Type: text/html, Size: 18545 bytes --]
On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> Hi everyone,
>
> I'm implementing a socket protocol with similar works to Emscripten - ok:
>
> Here's what I noticed
>
> - resolv.conf using multiple DNS servers
> options timeout:22 attempts:5
> nameserver 8.8.4.4
> nameserver 208.67.222.222
> nameserver 9.9.9.9
> nameserver 1.1.1.1
>
> - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> void socketstest1_client_dnsonly(void) {
>
> struct addrinfo hints, * res;
> memset(&hints, 0, sizeof(hints));
> printf("begin lookup1...\n");
> if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> perror("getaddrinfo1");
> }
> printf("begin lookup2...\n");
> if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> perror("getaddrinfo2");
> }
> printf("begin lookup3...\n");
> if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> perror("getaddrinfo3");
> }
>
> }
>
> When repeating socketstest1_client_dnsonly often, it appears that IPV6
> answers for IPV4 requests to another server [or vice-versa] responses can
> get mixed up and getaddrinfo reports sometimes only an error where none is.
>
> Below is a console dump of my test where you see the critical situation
> (end)
> 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto),
> resulting in error getaddrinfo1: Address in use
>
> This happens in about 1/100 tests. When you specify IPV4 or IPV6 in hints,
> this issue does NOT show up.
Can you provide more information on how to interpret the console dump,
and in particular, which of your syscalls is returning the EADDRINUSE
error? I'm pretty sure this is just a bug in your socket
stack/emulation. It looks like it's coming from recvfrom, and
EADDRINUSE is not a valid error for recvfrom to produce.
Rich
[-- Attachment #1: Type: text/plain, Size: 2556 bytes --] Hi Rich, Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a glitch in the Emscripten/WASI modification for musl (there are different values for errnos per implementation): __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN. My question currently breaks down to: Is it common for this implementation of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if answers from DNS servers came in? If not, I need to investigate much further. I really could not find issues with my sockets implementation, although your thought is of course very reasonable. Regards S. Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>: > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote: > > Hi everyone, > > > > I'm implementing a socket protocol with similar works to Emscripten - ok: > > > > Here's what I noticed > > > > - resolv.conf using multiple DNS servers > > options timeout:22 attempts:5 > > nameserver 8.8.4.4 > > nameserver 208.67.222.222 > > nameserver 9.9.9.9 > > nameserver 1.1.1.1 > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved > > void socketstest1_client_dnsonly(void) { > > > > struct addrinfo hints, * res; > > memset(&hints, 0, sizeof(hints)); > > printf("begin lookup1...\n"); > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) { > > perror("getaddrinfo1"); > > } > > printf("begin lookup2...\n"); > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) { > > perror("getaddrinfo2"); > > } > > printf("begin lookup3...\n"); > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) { > > perror("getaddrinfo3"); > > } > > > > } > > > > When repeating socketstest1_client_dnsonly often, it appears that IPV6 > > answers for IPV4 requests to another server [or vice-versa] responses can > > get mixed up and getaddrinfo reports sometimes only an error where none > is. > > > > Below is a console dump of my test where you see the critical situation > > (end) > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto), > > resulting in error getaddrinfo1: Address in use > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in > hints, > > this issue does NOT show up. > > Can you provide more information on how to interpret the console dump, > and in particular, which of your syscalls is returning the EADDRINUSE > error? I'm pretty sure this is just a bug in your socket > stack/emulation. It looks like it's coming from recvfrom, and > EADDRINUSE is not a valid error for recvfrom to produce. > > Rich > [-- Attachment #2: Type: text/html, Size: 4704 bytes --]
On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote: > Hi Rich, > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a > glitch in the Emscripten/WASI modification for musl (there are different > values for errnos per implementation): > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN. > > My question currently breaks down to: Is it common for this implementation > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if > answers from DNS servers came in? > If not, I need to investigate much further. I really could not find issues > with my sockets implementation, although your thought is of course very > reasonable. OK, looking at your code again, part of your problem is that you're misusing perror. You need to save the return value of getaddrinfo. Unless it's EAI_SYSTEM, errno is meaningless and you need to use gai_strerror to get a string for the error code (the value that was returned) rather than inspecting errno. If getaddrinfo actually returned EAI_AGAIN, this probably means you got an inconclusive result from one of the nameservers, probably ServFail. However if your debug output is packet dumps, I'm not seeing a ServFail there, and I'm not convinced it actually returned EAI_AGAIN since you're not saving the value to check it. The value that happens to be in errno is NOT the error getaddrinfo returned. Can you check what it's actually returning? > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>: > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote: > > > Hi everyone, > > > > > > I'm implementing a socket protocol with similar works to Emscripten - ok: > > > > > > Here's what I noticed > > > > > > - resolv.conf using multiple DNS servers > > > options timeout:22 attempts:5 > > > nameserver 8.8.4.4 > > > nameserver 208.67.222.222 > > > nameserver 9.9.9.9 > > > nameserver 1.1.1.1 > > > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved > > > void socketstest1_client_dnsonly(void) { > > > > > > struct addrinfo hints, * res; > > > memset(&hints, 0, sizeof(hints)); > > > printf("begin lookup1...\n"); > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) { > > > perror("getaddrinfo1"); > > > } > > > printf("begin lookup2...\n"); > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) { > > > perror("getaddrinfo2"); > > > } > > > printf("begin lookup3...\n"); > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) { > > > perror("getaddrinfo3"); > > > } > > > > > > } > > > > > > When repeating socketstest1_client_dnsonly often, it appears that IPV6 > > > answers for IPV4 requests to another server [or vice-versa] responses can > > > get mixed up and getaddrinfo reports sometimes only an error where none > > is. > > > > > > Below is a console dump of my test where you see the critical situation > > > (end) > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto), > > > resulting in error getaddrinfo1: Address in use > > > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in > > hints, > > > this issue does NOT show up. > > > > Can you provide more information on how to interpret the console dump, > > and in particular, which of your syscalls is returning the EADDRINUSE > > error? I'm pretty sure this is just a bug in your socket > > stack/emulation. It looks like it's coming from recvfrom, and > > EADDRINUSE is not a valid error for recvfrom to produce. > > > > Rich > >
[-- Attachment #1: Type: text/plain, Size: 4121 bytes --] I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare errors. The errno value (EAGAIN) came from my syscall, recvfrom, which does not have data in these cases (all packets indicated by 'poll' were read previously by recvfrom calls). (Btw., I did try to return 0 for these cases in my recvfrom implementation, did not help either.) Thanks Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>: > On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote: > > Hi Rich, > > > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a > > glitch in the Emscripten/WASI modification for musl (there are different > > values for errnos per implementation): > > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN. > > > > My question currently breaks down to: Is it common for this > implementation > > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if > > answers from DNS servers came in? > > If not, I need to investigate much further. I really could not find > issues > > with my sockets implementation, although your thought is of course very > > reasonable. > > OK, looking at your code again, part of your problem is that you're > misusing perror. You need to save the return value of getaddrinfo. > Unless it's EAI_SYSTEM, errno is meaningless and you need to use > gai_strerror to get a string for the error code (the value that was > returned) rather than inspecting errno. > > If getaddrinfo actually returned EAI_AGAIN, this probably means you > got an inconclusive result from one of the nameservers, probably > ServFail. However if your debug output is packet dumps, I'm not seeing > a ServFail there, and I'm not convinced it actually returned EAI_AGAIN > since you're not saving the value to check it. The value that happens > to be in errno is NOT the error getaddrinfo returned. Can you check > what it's actually returning? > > > > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>: > > > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote: > > > > Hi everyone, > > > > > > > > I'm implementing a socket protocol with similar works to Emscripten > - ok: > > > > > > > > Here's what I noticed > > > > > > > > - resolv.conf using multiple DNS servers > > > > options timeout:22 attempts:5 > > > > nameserver 8.8.4.4 > > > > nameserver 208.67.222.222 > > > > nameserver 9.9.9.9 > > > > nameserver 1.1.1.1 > > > > > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved > > > > void socketstest1_client_dnsonly(void) { > > > > > > > > struct addrinfo hints, * res; > > > > memset(&hints, 0, sizeof(hints)); > > > > printf("begin lookup1...\n"); > > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) { > > > > perror("getaddrinfo1"); > > > > } > > > > printf("begin lookup2...\n"); > > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) { > > > > perror("getaddrinfo2"); > > > > } > > > > printf("begin lookup3...\n"); > > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) { > > > > perror("getaddrinfo3"); > > > > } > > > > > > > > } > > > > > > > > When repeating socketstest1_client_dnsonly often, it appears that > IPV6 > > > > answers for IPV4 requests to another server [or vice-versa] > responses can > > > > get mixed up and getaddrinfo reports sometimes only an error where > none > > > is. > > > > > > > > Below is a console dump of my test where you see the critical > situation > > > > (end) > > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP > Proto), > > > > resulting in error getaddrinfo1: Address in use > > > > > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in > > > hints, > > > > this issue does NOT show up. > > > > > > Can you provide more information on how to interpret the console dump, > > > and in particular, which of your syscalls is returning the EADDRINUSE > > > error? I'm pretty sure this is just a bug in your socket > > > stack/emulation. It looks like it's coming from recvfrom, and > > > EADDRINUSE is not a valid error for recvfrom to produce. > > > > > > Rich > > > > [-- Attachment #2: Type: text/html, Size: 6203 bytes --]
On Wed, Jun 01, 2022 at 11:52:46PM +0200, Sascha Braun wrote: > I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare > errors. > > The errno value (EAGAIN) came from my syscall, recvfrom, which does not > have data in these cases (all packets indicated by 'poll' were read > previously by recvfrom calls). (Btw., I did try to return 0 for these cases > in my recvfrom implementation, did not help either.) It's normal to see these EAGAINs. They are not errors. Returning 0 would be wrong; a zero-length read means EOF not EAGAIN. I don't see an immediate cause visible for why you're getting EAI_NONAME. Note that your debug output does not seem to show the *length* recvfrom returned, so is it possible you're just showing old data that was already in the buffer, when recvfrom really malfunctioned and returned a short read? At the res_msend loop layer musl would possibly accept this (like you seem to be seeing) but then find no results later. Rich > > Thanks > > Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>: > > > On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote: > > > Hi Rich, > > > > > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a > > > glitch in the Emscripten/WASI modification for musl (there are different > > > values for errnos per implementation): > > > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN. > > > > > > My question currently breaks down to: Is it common for this > > implementation > > > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if > > > answers from DNS servers came in? > > > If not, I need to investigate much further. I really could not find > > issues > > > with my sockets implementation, although your thought is of course very > > > reasonable. > > > > OK, looking at your code again, part of your problem is that you're > > misusing perror. You need to save the return value of getaddrinfo. > > Unless it's EAI_SYSTEM, errno is meaningless and you need to use > > gai_strerror to get a string for the error code (the value that was > > returned) rather than inspecting errno. > > > > If getaddrinfo actually returned EAI_AGAIN, this probably means you > > got an inconclusive result from one of the nameservers, probably > > ServFail. However if your debug output is packet dumps, I'm not seeing > > a ServFail there, and I'm not convinced it actually returned EAI_AGAIN > > since you're not saving the value to check it. The value that happens > > to be in errno is NOT the error getaddrinfo returned. Can you check > > what it's actually returning? > > > > > > > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>: > > > > > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote: > > > > > Hi everyone, > > > > > > > > > > I'm implementing a socket protocol with similar works to Emscripten > > - ok: > > > > > > > > > > Here's what I noticed > > > > > > > > > > - resolv.conf using multiple DNS servers > > > > > options timeout:22 attempts:5 > > > > > nameserver 8.8.4.4 > > > > > nameserver 208.67.222.222 > > > > > nameserver 9.9.9.9 > > > > > nameserver 1.1.1.1 > > > > > > > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved > > > > > void socketstest1_client_dnsonly(void) { > > > > > > > > > > struct addrinfo hints, * res; > > > > > memset(&hints, 0, sizeof(hints)); > > > > > printf("begin lookup1...\n"); > > > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) { > > > > > perror("getaddrinfo1"); > > > > > } > > > > > printf("begin lookup2...\n"); > > > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) { > > > > > perror("getaddrinfo2"); > > > > > } > > > > > printf("begin lookup3...\n"); > > > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) { > > > > > perror("getaddrinfo3"); > > > > > } > > > > > > > > > > } > > > > > > > > > > When repeating socketstest1_client_dnsonly often, it appears that > > IPV6 > > > > > answers for IPV4 requests to another server [or vice-versa] > > responses can > > > > > get mixed up and getaddrinfo reports sometimes only an error where > > none > > > > is. > > > > > > > > > > Below is a console dump of my test where you see the critical > > situation > > > > > (end) > > > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP > > Proto), > > > > > resulting in error getaddrinfo1: Address in use > > > > > > > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in > > > > hints, > > > > > this issue does NOT show up. > > > > > > > > Can you provide more information on how to interpret the console dump, > > > > and in particular, which of your syscalls is returning the EADDRINUSE > > > > error? I'm pretty sure this is just a bug in your socket > > > > stack/emulation. It looks like it's coming from recvfrom, and > > > > EADDRINUSE is not a valid error for recvfrom to produce. > > > > > > > > Rich > > > > > >
[-- Attachment #1: Type: text/plain, Size: 12998 bytes --] Hi Rich, I think I narrowed the thing down. Below is a dump of what happens in a 'normal' situation and what happened when the sporadic issue appeared. Normal: begin lookup3... __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 __syscall_recv begin EP[9.9.9.9:53] __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53] ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 connect:0200ffffd83ad4840000000000000000 connect:0a00ffff000000002a00145040010808000000000000200400000000 So we did sent the bytes ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the other sequence (hex encoded, 2 characters per byte) to each of the DNS and we received back one 'short' and one 'long' reply from 9.9.9.9. I guess the short one is IPv4, long one IPv6(?). That's the case with all successful lookups, i.e. the 99% ok ones - (at least) one short - (at least) one long. Now the problematic one: begin lookup1... __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 94d90100000100000000000003777777037765620264650000010001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 94d90100000100000000000003777777037765620264650000010001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 94d90100000100000000000003777777037765620264650000010001 __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 94d901000001000000000000037777770377656202646500001c0001 __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 94d901000001000000000000037777770377656202646500001c0001 __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 94d901000001000000000000037777770377656202646500001c0001 __syscall_recv begin EP[0.0.0.0:0] __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 __syscall_recv begin EP[9.9.9.9:53] __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53] 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 getaddrinfo1: Address in use We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4 All occurrences of the problem show this constellation - two 'long' responses received. As a note, of course my implementation of recv returns the correct number of bytes received. The zeros you see are only from the dump function, it's dumping the 512 byte buffer. I hope this is helpful in some manner. I came across this, I seems glibc had a similar issue (I did not look in-depth, just want to share the link) https://bugzilla.redhat.com/show_bug.cgi?id=1044628 https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html Thanks Sascha Am Do., 2. Juni 2022 um 16:25 Uhr schrieb Rich Felker <dalias@libc.org>: > On Wed, Jun 01, 2022 at 11:52:46PM +0200, Sascha Braun wrote: > > I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare > > errors. > > > > The errno value (EAGAIN) came from my syscall, recvfrom, which does not > > have data in these cases (all packets indicated by 'poll' were read > > previously by recvfrom calls). (Btw., I did try to return 0 for these > cases > > in my recvfrom implementation, did not help either.) > > It's normal to see these EAGAINs. They are not errors. Returning 0 > would be wrong; a zero-length read means EOF not EAGAIN. > > I don't see an immediate cause visible for why you're getting > EAI_NONAME. Note that your debug output does not seem to show the > *length* recvfrom returned, so is it possible you're just showing old > data that was already in the buffer, when recvfrom really > malfunctioned and returned a short read? At the res_msend loop layer > musl would possibly accept this (like you seem to be seeing) but then > find no results later. > > Rich > > > > > > > Thanks > > > > Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>: > > > > > On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote: > > > > Hi Rich, > > > > > > > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this > is a > > > > glitch in the Emscripten/WASI modification for musl (there are > different > > > > values for errnos per implementation): > > > > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN. > > > > > > > > My question currently breaks down to: Is it common for this > > > implementation > > > > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even > if > > > > answers from DNS servers came in? > > > > If not, I need to investigate much further. I really could not find > > > issues > > > > with my sockets implementation, although your thought is of course > very > > > > reasonable. > > > > > > OK, looking at your code again, part of your problem is that you're > > > misusing perror. You need to save the return value of getaddrinfo. > > > Unless it's EAI_SYSTEM, errno is meaningless and you need to use > > > gai_strerror to get a string for the error code (the value that was > > > returned) rather than inspecting errno. > > > > > > If getaddrinfo actually returned EAI_AGAIN, this probably means you > > > got an inconclusive result from one of the nameservers, probably > > > ServFail. However if your debug output is packet dumps, I'm not seeing > > > a ServFail there, and I'm not convinced it actually returned EAI_AGAIN > > > since you're not saving the value to check it. The value that happens > > > to be in errno is NOT the error getaddrinfo returned. Can you check > > > what it's actually returning? > > > > > > > > > > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker < > dalias@libc.org>: > > > > > > > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote: > > > > > > Hi everyone, > > > > > > > > > > > > I'm implementing a socket protocol with similar works to > Emscripten > > > - ok: > > > > > > > > > > > > Here's what I noticed > > > > > > > > > > > > - resolv.conf using multiple DNS servers > > > > > > options timeout:22 attempts:5 > > > > > > nameserver 8.8.4.4 > > > > > > nameserver 208.67.222.222 > > > > > > nameserver 9.9.9.9 > > > > > > nameserver 1.1.1.1 > > > > > > > > > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved > > > > > > void socketstest1_client_dnsonly(void) { > > > > > > > > > > > > struct addrinfo hints, * res; > > > > > > memset(&hints, 0, sizeof(hints)); > > > > > > printf("begin lookup1...\n"); > > > > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) { > > > > > > perror("getaddrinfo1"); > > > > > > } > > > > > > printf("begin lookup2...\n"); > > > > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) { > > > > > > perror("getaddrinfo2"); > > > > > > } > > > > > > printf("begin lookup3...\n"); > > > > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) { > > > > > > perror("getaddrinfo3"); > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > When repeating socketstest1_client_dnsonly often, it appears that > > > IPV6 > > > > > > answers for IPV4 requests to another server [or vice-versa] > > > responses can > > > > > > get mixed up and getaddrinfo reports sometimes only an error > where > > > none > > > > > is. > > > > > > > > > > > > Below is a console dump of my test where you see the critical > > > situation > > > > > > (end) > > > > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP > > > Proto), > > > > > > resulting in error getaddrinfo1: Address in use > > > > > > > > > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 > in > > > > > hints, > > > > > > this issue does NOT show up. > > > > > > > > > > Can you provide more information on how to interpret the console > dump, > > > > > and in particular, which of your syscalls is returning the > EADDRINUSE > > > > > error? I'm pretty sure this is just a bug in your socket > > > > > stack/emulation. It looks like it's coming from recvfrom, and > > > > > EADDRINUSE is not a valid error for recvfrom to produce. > > > > > > > > > > Rich > > > > > > > > > [-- Attachment #2: Type: text/html, Size: 18556 bytes --]
On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote:
> Hi Rich,
>
> I think I narrowed the thing down. Below is a dump of what happens in a
> 'normal' situation and what happened when the sporadic issue appeared.
>
> Normal:
> begin lookup3...
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> __syscall_recv begin EP[0.0.0.0:0]
> __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> __syscall_recv begin EP[9.9.9.9:53]
> __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
> ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> connect:0200ffffd83ad4840000000000000000
> connect:0a00ffff000000002a00145040010808000000000000200400000000
>
> So we did sent the bytes
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the
> other sequence (hex encoded, 2 characters per byte) to each of the DNS and
> we received back one 'short' and one 'long' reply from 9.9.9.9.
> I guess the short one is IPv4, long one IPv6(?). That's the case with all
> successful lookups, i.e. the 99% ok ones - (at least) one short - (at
> least) one long.
>
> Now the problematic one:
>
> begin lookup1...
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> 94d90100000100000000000003777777037765620264650000010001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> 94d90100000100000000000003777777037765620264650000010001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> 94d90100000100000000000003777777037765620264650000010001
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> 94d901000001000000000000037777770377656202646500001c0001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> 94d901000001000000000000037777770377656202646500001c0001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> 94d901000001000000000000037777770377656202646500001c0001
>
> __syscall_recv begin EP[0.0.0.0:0]
> __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> __syscall_recv begin EP[9.9.9.9:53]
> __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
> 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> getaddrinfo1: Address in use
>
> We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4
>
> All occurrences of the problem show this constellation - two 'long'
> responses received.
> As a note, of course my implementation of recv returns the correct number
> of bytes received. The zeros you see are only from the dump function, it's
> dumping the 512 byte buffer.
>
> I hope this is helpful in some manner.
>
> I came across this, I seems glibc had a similar issue (I did not look
> in-depth, just want to share the link)
> https://bugzilla.redhat.com/show_bug.cgi?id=1044628
> https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html
OK, I found your problem. It's that the query ids for both the A and
AAAA are the same, probably because you have a low-resolution or
non-working clock_gettime. If the host environment does not provide a
way to get a high resolution clock, I think you should still apply a
monotonic increasing increment of the nanoseconds on each call where
the host environment's time did not increase so that the clock is
strictly monotonic. However musl's resolver should also deal with this
case since it's always possible to get identical query ids (with low
probability). We should just check if they're equal, and if so,
increment the second one. I'll write a patch to do this.
Rich
[-- Attachment #1: Type: text/plain, Size: 8706 bytes --] On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote: > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote: > > Hi Rich, > > > > I think I narrowed the thing down. Below is a dump of what happens in a > > 'normal' situation and what happened when the sporadic issue appeared. > > > > Normal: > > begin lookup3... > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > __syscall_recv begin EP[0.0.0.0:0] > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > __syscall_recv begin EP[9.9.9.9:53] > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53] > > ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > connect:0200ffffd83ad4840000000000000000 > > connect:0a00ffff000000002a00145040010808000000000000200400000000 > > > > So we did sent the bytes > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the > > other sequence (hex encoded, 2 characters per byte) to each of the DNS and > > we received back one 'short' and one 'long' reply from 9.9.9.9. > > I guess the short one is IPv4, long one IPv6(?). That's the case with all > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at > > least) one long. > > > > Now the problematic one: > > > > begin lookup1... > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > 94d90100000100000000000003777777037765620264650000010001 > > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] > > 94d90100000100000000000003777777037765620264650000010001 > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > 94d90100000100000000000003777777037765620264650000010001 > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > 94d901000001000000000000037777770377656202646500001c0001 > > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] > > 94d901000001000000000000037777770377656202646500001c0001 > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > 94d901000001000000000000037777770377656202646500001c0001 > > > > __syscall_recv begin EP[0.0.0.0:0] > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > __syscall_recv begin EP[9.9.9.9:53] > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53] > > 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > getaddrinfo1: Address in use > > > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4 > > > > All occurrences of the problem show this constellation - two 'long' > > responses received. > > As a note, of course my implementation of recv returns the correct number > > of bytes received. The zeros you see are only from the dump function, it's > > dumping the 512 byte buffer. > > > > I hope this is helpful in some manner. > > > > I came across this, I seems glibc had a similar issue (I did not look > > in-depth, just want to share the link) > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628 > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html > > OK, I found your problem. It's that the query ids for both the A and > AAAA are the same, probably because you have a low-resolution or > non-working clock_gettime. If the host environment does not provide a > way to get a high resolution clock, I think you should still apply a > monotonic increasing increment of the nanoseconds on each call where > the host environment's time did not increase so that the clock is > strictly monotonic. However musl's resolver should also deal with this > case since it's always possible to get identical query ids (with low > probability). We should just check if they're equal, and if so, > increment the second one. I'll write a patch to do this. See if the attached patch fixes it. [-- Attachment #2: queryid.diff --] [-- Type: text/plain, Size: 438 bytes --] diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c index aa558c19..e20ad6db 100644 --- a/src/network/lookup_name.c +++ b/src/network/lookup_name.c @@ -155,6 +155,8 @@ static int name_from_dns(struct address buf[static MAXADDRS], char canon[static if (qlens[nq] == -1) return EAI_NONAME; qbuf[nq][3] = 0; /* don't need AD flag */ + if (nq && qbuf[nq][0] == qbuf[0][0]) + qbuf[nq][0]++; nq++; } }
[-- Attachment #1: Type: text/plain, Size: 9347 bytes --] Hi Rich, Great, Thank You! This explains everything, because - although I do have a high resolution timer - the time that arrived through CLOCK_REALTIME was milliseconds. I did improve the resolution down to 1/1000 of msec and things are working fine now. I will also test your patch this weekend and report back. Many Thanks Sascha Am Fr., 3. Juni 2022 um 14:34 Uhr schrieb Rich Felker <dalias@libc.org>: > On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote: > > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote: > > > Hi Rich, > > > > > > I think I narrowed the thing down. Below is a dump of what happens in a > > > 'normal' situation and what happened when the sporadic issue appeared. > > > > > > Normal: > > > begin lookup3... > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > __syscall_recv begin EP[0.0.0.0:0] > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > > > ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > __syscall_recv begin EP[9.9.9.9:53] > > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53] > > > > ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > connect:0200ffffd83ad4840000000000000000 > > > connect:0a00ffff000000002a00145040010808000000000000200400000000 > > > > > > So we did sent the bytes > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and > the > > > other sequence (hex encoded, 2 characters per byte) to each of the DNS > and > > > we received back one 'short' and one 'long' reply from 9.9.9.9. > > > I guess the short one is IPv4, long one IPv6(?). That's the case with > all > > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at > > > least) one long. > > > > > > Now the problematic one: > > > > > > begin lookup1... > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > 94d90100000100000000000003777777037765620264650000010001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > 94d90100000100000000000003777777037765620264650000010001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > 94d90100000100000000000003777777037765620264650000010001 > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > 94d901000001000000000000037777770377656202646500001c0001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > 94d901000001000000000000037777770377656202646500001c0001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > 94d901000001000000000000037777770377656202646500001c0001 > > > > > > __syscall_recv begin EP[0.0.0.0:0] > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > > > 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > __syscall_recv begin EP[9.9.9.9:53] > > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53] > > > > 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > getaddrinfo1: Address in use > > > > > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4 > > > > > > All occurrences of the problem show this constellation - two 'long' > > > responses received. > > > As a note, of course my implementation of recv returns the correct > number > > > of bytes received. The zeros you see are only from the dump function, > it's > > > dumping the 512 byte buffer. > > > > > > I hope this is helpful in some manner. > > > > > > I came across this, I seems glibc had a similar issue (I did not look > > > in-depth, just want to share the link) > > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628 > > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html > > > > OK, I found your problem. It's that the query ids for both the A and > > AAAA are the same, probably because you have a low-resolution or > > non-working clock_gettime. If the host environment does not provide a > > way to get a high resolution clock, I think you should still apply a > > monotonic increasing increment of the nanoseconds on each call where > > the host environment's time did not increase so that the clock is > > strictly monotonic. However musl's resolver should also deal with this > > case since it's always possible to get identical query ids (with low > > probability). We should just check if they're equal, and if so, > > increment the second one. I'll write a patch to do this. > > See if the attached patch fixes it. > > [-- Attachment #2: Type: text/html, Size: 12546 bytes --]
[-- Attachment #1: Type: text/plain, Size: 9185 bytes --] Hi Rich, I tested your patch against my new (high res clock_realtime 100nsec) and my old (low res 1msec) code and in both cases it's working, no more problems. Many Thanks Sascha Am Fr., 3. Juni 2022 um 14:34 Uhr schrieb Rich Felker <dalias@libc.org>: > On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote: > > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote: > > > Hi Rich, > > > > > > I think I narrowed the thing down. Below is a dump of what happens in a > > > 'normal' situation and what happened when the sporadic issue appeared. > > > > > > Normal: > > > begin lookup3... > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > __syscall_recv begin EP[0.0.0.0:0] > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > > > ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > __syscall_recv begin EP[9.9.9.9:53] > > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53] > > > > ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > connect:0200ffffd83ad4840000000000000000 > > > connect:0a00ffff000000002a00145040010808000000000000200400000000 > > > > > > So we did sent the bytes > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and > the > > > other sequence (hex encoded, 2 characters per byte) to each of the DNS > and > > > we received back one 'short' and one 'long' reply from 9.9.9.9. > > > I guess the short one is IPv4, long one IPv6(?). That's the case with > all > > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at > > > least) one long. > > > > > > Now the problematic one: > > > > > > begin lookup1... > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > 94d90100000100000000000003777777037765620264650000010001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > 94d90100000100000000000003777777037765620264650000010001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > 94d90100000100000000000003777777037765620264650000010001 > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > 94d901000001000000000000037777770377656202646500001c0001 > > > __syscall_send_internal 02000035d043dede00000000000000 > 00208.67.222.222:53] > > > 94d901000001000000000000037777770377656202646500001c0001 > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > 94d901000001000000000000037777770377656202646500001c0001 > > > > > > __syscall_recv begin EP[0.0.0.0:0] > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > > > 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > __syscall_recv begin EP[9.9.9.9:53] > > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53] > > > > 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > getaddrinfo1: Address in use > > > > > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4 > > > > > > All occurrences of the problem show this constellation - two 'long' > > > responses received. > > > As a note, of course my implementation of recv returns the correct > number > > > of bytes received. The zeros you see are only from the dump function, > it's > > > dumping the 512 byte buffer. > > > > > > I hope this is helpful in some manner. > > > > > > I came across this, I seems glibc had a similar issue (I did not look > > > in-depth, just want to share the link) > > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628 > > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html > > > > OK, I found your problem. It's that the query ids for both the A and > > AAAA are the same, probably because you have a low-resolution or > > non-working clock_gettime. If the host environment does not provide a > > way to get a high resolution clock, I think you should still apply a > > monotonic increasing increment of the nanoseconds on each call where > > the host environment's time did not increase so that the clock is > > strictly monotonic. However musl's resolver should also deal with this > > case since it's always possible to get identical query ids (with low > > probability). We should just check if they're equal, and if so, > > increment the second one. I'll write a patch to do this. > > See if the attached patch fixes it. > > [-- Attachment #2: Type: text/html, Size: 12009 bytes --]
On Sun, Jun 05, 2022 at 08:14:49PM +0200, Sascha Braun wrote: > Hi Rich, > > I tested your patch against my new (high res clock_realtime 100nsec) and my > old (low res 1msec) code and in both cases it's working, no more problems. Great! Thanks for following up. I'll include the patch upstream. Rich > Am Fr., 3. Juni 2022 um 14:34 Uhr schrieb Rich Felker <dalias@libc.org>: > > > On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote: > > > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote: > > > > Hi Rich, > > > > > > > > I think I narrowed the thing down. Below is a dump of what happens in a > > > > 'normal' situation and what happened when the sporadic issue appeared.. > > > > > > > > Normal: > > > > begin lookup3... > > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > > __syscall_send_internal 02000035d043dede00000000000000 > > 00208.67.222.222:53] > > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 > > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > > __syscall_send_internal 02000035d043dede00000000000000 > > 00208.67.222.222:53] > > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001 > > > > __syscall_recv begin EP[0.0.0.0:0] > > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > > > > > ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > > __syscall_recv begin EP[9.9.9.9:53] > > > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53] > > > > > > ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > > connect:0200ffffd83ad4840000000000000000 > > > > connect:0a00ffff000000002a00145040010808000000000000200400000000 > > > > > > > > So we did sent the bytes > > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and > > the > > > > other sequence (hex encoded, 2 characters per byte) to each of the DNS > > and > > > > we received back one 'short' and one 'long' reply from 9.9.9.9. > > > > I guess the short one is IPv4, long one IPv6(?). That's the case with > > all > > > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at > > > > least) one long. > > > > > > > > Now the problematic one: > > > > > > > > begin lookup1... > > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > > 94d90100000100000000000003777777037765620264650000010001 > > > > __syscall_send_internal 02000035d043dede00000000000000 > > 00208.67.222.222:53] > > > > 94d90100000100000000000003777777037765620264650000010001 > > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > > 94d90100000100000000000003777777037765620264650000010001 > > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53] > > > > 94d901000001000000000000037777770377656202646500001c0001 > > > > __syscall_send_internal 02000035d043dede00000000000000 > > 00208.67.222.222:53] > > > > 94d901000001000000000000037777770377656202646500001c0001 > > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53] > > > > 94d901000001000000000000037777770377656202646500001c0001 > > > > > > > > __syscall_recv begin EP[0.0.0.0:0] > > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] > > > > > > 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > > __syscall_recv begin EP[9.9.9.9:53] > > > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53] > > > > > > 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 > > > > getaddrinfo1: Address in use > > > > > > > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4 > > > > > > > > All occurrences of the problem show this constellation - two 'long' > > > > responses received. > > > > As a note, of course my implementation of recv returns the correct > > number > > > > of bytes received. The zeros you see are only from the dump function, > > it's > > > > dumping the 512 byte buffer. > > > > > > > > I hope this is helpful in some manner. > > > > > > > > I came across this, I seems glibc had a similar issue (I did not look > > > > in-depth, just want to share the link) > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628 > > > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html > > > > > > OK, I found your problem. It's that the query ids for both the A and > > > AAAA are the same, probably because you have a low-resolution or > > > non-working clock_gettime. If the host environment does not provide a > > > way to get a high resolution clock, I think you should still apply a > > > monotonic increasing increment of the nanoseconds on each call where > > > the host environment's time did not increase so that the clock is > > > strictly monotonic. However musl's resolver should also deal with this > > > case since it's always possible to get identical query ids (with low > > > probability). We should just check if they're equal, and if so, > > > increment the second one. I'll write a patch to do this. > > > > See if the attached patch fixes it. > > > >