mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
@ 2022-05-31 10:24 Sascha Braun
  2022-06-01 14:14 ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Sascha Braun @ 2022-05-31 10:24 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 15149 bytes --]

Hi everyone,

I'm implementing a socket protocol with similar works to Emscripten - ok:

Here's what I noticed

- resolv.conf using multiple DNS servers
options timeout:22 attempts:5
nameserver 8.8.4.4
nameserver 208.67.222.222
nameserver 9.9.9.9
nameserver 1.1.1.1

- getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
void socketstest1_client_dnsonly(void) {

struct addrinfo hints, * res;
memset(&hints, 0, sizeof(hints));
printf("begin lookup1...\n");
if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
perror("getaddrinfo1");
}
printf("begin lookup2...\n");
if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
perror("getaddrinfo2");
}
printf("begin lookup3...\n");
if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
perror("getaddrinfo3");
}

}

When repeating socketstest1_client_dnsonly often, it appears that IPV6
answers for IPV4 requests to another server [or vice-versa] responses can
get mixed up and getaddrinfo reports sometimes only an error where none is.

Below is a console dump of my test where you see the critical situation
(end)
8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto),
resulting in error getaddrinfo1: Address in use

This happens in about 1/100 tests. When you specify IPV4 or IPV6 in hints,
this issue does NOT show up.
Many Thanks
Sascha

Console Data Dump:

__syscall_recv begin EP[8.8.4.4:53]
__syscall_recv'd_internal [020000350808040400000000000000008.8.4.4:53]
f79e818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000004800102a0014504001081200000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
connect:0200ffff8efaba240000000000000000
connect:0a00ffff000000002a00145040010812000000000000200400000000
begin lookup1...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
08740100000100000000000003777777037765620264650000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
08740100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
08740100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
087401000001000000000000037777770377656202646500001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
087401000001000000000000037777770377656202646500001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
087401000001000000000000037777770377656202646500001c0001
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
08748180000100020000000003777777037765620264650000010001c00c000500010000011a000f0377777708672d68612d776562c014c028000100010000000c000452a5e58a000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
recv: no data
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
087481800001000100010000037777770377656202646500001c0001c00c00050001000000fb000f0377777708672d68612d776562c014c02c00060001000000210031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
begin lookup2...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
4920010000010000000000000377777706676f6f676c650264650000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
4920010000010000000000000377777706676f6f676c650264650000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
4920010000010000000000000377777706676f6f676c650264650000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
4920010000010000000000000377777706676f6f676c6502646500001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
4920010000010000000000000377777706676f6f676c6502646500001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
4920010000010000000000000377777706676f6f676c6502646500001c0001
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000008.8.4.4:53]
4920818000010001000000000377777706676f6f676c650264650000010001c00c00010001000000b200048efabaa3000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
__syscall_recv begin EP[8.8.4.4:53]
__syscall_recv'd_internal [020000350808040400000000000000009.9.9.9:53]
4920818000010001000000000377777706676f6f676c650264650000010001c00c00010001000000ad00048efab9e3000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
begin lookup3...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
9308010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
9308010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
9308010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
9308010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
9308010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
9308010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
9308818000010001000000000377777706676f6f676c6503636f6d0000010001c00c000100010000009f00048efab9440000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
recv: no data
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
9308818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000002400102a0014504001082900000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
connect:0200ffff8efab9440000000000000000
connect:0a00ffff000000002a00145040010829000000000000200400000000
begin lookup1...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
dcf00100000100000000000003777777037765620264650000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
dcf00100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
dcf00100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
dcf001000001000000000000037777770377656202646500001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
dcf001000001000000000000037777770377656202646500001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
dcf001000001000000000000037777770377656202646500001c0001
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000008.8.4.4:53]
dcf081800001000100010000037777770377656202646500001c0001c00c00050001000000fa000f0377777708672d68612d776562c014c02c000600010000000f0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
__syscall_recv begin EP[8.8.4.4:53]
__syscall_recv'd_internal [020000350808040400000000000000009.9.9.9:53]
dcf081800001000100010000037777770377656202646500001c0001c00c0005000100000129000f0377777708672d68612d776562c014c02c000600010000000e0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
getaddrinfo1: Address in use

[-- Attachment #2: Type: text/html, Size: 18545 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-05-31 10:24 [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c) Sascha Braun
@ 2022-06-01 14:14 ` Rich Felker
  2022-06-01 20:35   ` Sascha Braun
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2022-06-01 14:14 UTC (permalink / raw)
  To: Sascha Braun; +Cc: musl

On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> Hi everyone,
> 
> I'm implementing a socket protocol with similar works to Emscripten - ok:
> 
> Here's what I noticed
> 
> - resolv.conf using multiple DNS servers
> options timeout:22 attempts:5
> nameserver 8.8.4.4
> nameserver 208.67.222.222
> nameserver 9.9.9.9
> nameserver 1.1.1.1
> 
> - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> void socketstest1_client_dnsonly(void) {
> 
> struct addrinfo hints, * res;
> memset(&hints, 0, sizeof(hints));
> printf("begin lookup1...\n");
> if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> perror("getaddrinfo1");
> }
> printf("begin lookup2...\n");
> if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> perror("getaddrinfo2");
> }
> printf("begin lookup3...\n");
> if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> perror("getaddrinfo3");
> }
> 
> }
> 
> When repeating socketstest1_client_dnsonly often, it appears that IPV6
> answers for IPV4 requests to another server [or vice-versa] responses can
> get mixed up and getaddrinfo reports sometimes only an error where none is.
> 
> Below is a console dump of my test where you see the critical situation
> (end)
> 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto),
> resulting in error getaddrinfo1: Address in use
> 
> This happens in about 1/100 tests. When you specify IPV4 or IPV6 in hints,
> this issue does NOT show up.

Can you provide more information on how to interpret the console dump,
and in particular, which of your syscalls is returning the EADDRINUSE
error? I'm pretty sure this is just a bug in your socket
stack/emulation. It looks like it's coming from recvfrom, and
EADDRINUSE is not a valid error for recvfrom to produce.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-01 14:14 ` Rich Felker
@ 2022-06-01 20:35   ` Sascha Braun
  2022-06-01 21:30     ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Sascha Braun @ 2022-06-01 20:35 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 2556 bytes --]

Hi Rich,

Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a
glitch in the Emscripten/WASI modification for musl (there are different
values for errnos per implementation):
__WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN.

My question currently breaks down to: Is it common for this implementation
of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if
answers from DNS servers came in?
If not, I need to investigate much further. I really could not find issues
with my sockets implementation, although your thought is of course very
reasonable.

Regards
S.


Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>:

> On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> > Hi everyone,
> >
> > I'm implementing a socket protocol with similar works to Emscripten - ok:
> >
> > Here's what I noticed
> >
> > - resolv.conf using multiple DNS servers
> > options timeout:22 attempts:5
> > nameserver 8.8.4.4
> > nameserver 208.67.222.222
> > nameserver 9.9.9.9
> > nameserver 1.1.1.1
> >
> > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> > void socketstest1_client_dnsonly(void) {
> >
> > struct addrinfo hints, * res;
> > memset(&hints, 0, sizeof(hints));
> > printf("begin lookup1...\n");
> > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> > perror("getaddrinfo1");
> > }
> > printf("begin lookup2...\n");
> > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> > perror("getaddrinfo2");
> > }
> > printf("begin lookup3...\n");
> > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> > perror("getaddrinfo3");
> > }
> >
> > }
> >
> > When repeating socketstest1_client_dnsonly often, it appears that IPV6
> > answers for IPV4 requests to another server [or vice-versa] responses can
> > get mixed up and getaddrinfo reports sometimes only an error where none
> is.
> >
> > Below is a console dump of my test where you see the critical situation
> > (end)
> > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto),
> > resulting in error getaddrinfo1: Address in use
> >
> > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in
> hints,
> > this issue does NOT show up.
>
> Can you provide more information on how to interpret the console dump,
> and in particular, which of your syscalls is returning the EADDRINUSE
> error? I'm pretty sure this is just a bug in your socket
> stack/emulation. It looks like it's coming from recvfrom, and
> EADDRINUSE is not a valid error for recvfrom to produce.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 4704 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-01 20:35   ` Sascha Braun
@ 2022-06-01 21:30     ` Rich Felker
  2022-06-01 21:52       ` Sascha Braun
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2022-06-01 21:30 UTC (permalink / raw)
  To: Sascha Braun; +Cc: musl

On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote:
> Hi Rich,
> 
> Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a
> glitch in the Emscripten/WASI modification for musl (there are different
> values for errnos per implementation):
> __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN.
> 
> My question currently breaks down to: Is it common for this implementation
> of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if
> answers from DNS servers came in?
> If not, I need to investigate much further. I really could not find issues
> with my sockets implementation, although your thought is of course very
> reasonable.

OK, looking at your code again, part of your problem is that you're
misusing perror. You need to save the return value of getaddrinfo.
Unless it's EAI_SYSTEM, errno is meaningless and you need to use
gai_strerror to get a string for the error code (the value that was
returned) rather than inspecting errno.

If getaddrinfo actually returned EAI_AGAIN, this probably means you
got an inconclusive result from one of the nameservers, probably
ServFail. However if your debug output is packet dumps, I'm not seeing
a ServFail there, and I'm not convinced it actually returned EAI_AGAIN
since you're not saving the value to check it. The value that happens
to be in errno is NOT the error getaddrinfo returned. Can you check
what it's actually returning?


> Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>:
> 
> > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> > > Hi everyone,
> > >
> > > I'm implementing a socket protocol with similar works to Emscripten - ok:
> > >
> > > Here's what I noticed
> > >
> > > - resolv.conf using multiple DNS servers
> > > options timeout:22 attempts:5
> > > nameserver 8.8.4.4
> > > nameserver 208.67.222.222
> > > nameserver 9.9.9.9
> > > nameserver 1.1.1.1
> > >
> > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> > > void socketstest1_client_dnsonly(void) {
> > >
> > > struct addrinfo hints, * res;
> > > memset(&hints, 0, sizeof(hints));
> > > printf("begin lookup1...\n");
> > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> > > perror("getaddrinfo1");
> > > }
> > > printf("begin lookup2...\n");
> > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> > > perror("getaddrinfo2");
> > > }
> > > printf("begin lookup3...\n");
> > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> > > perror("getaddrinfo3");
> > > }
> > >
> > > }
> > >
> > > When repeating socketstest1_client_dnsonly often, it appears that IPV6
> > > answers for IPV4 requests to another server [or vice-versa] responses can
> > > get mixed up and getaddrinfo reports sometimes only an error where none
> > is.
> > >
> > > Below is a console dump of my test where you see the critical situation
> > > (end)
> > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP Proto),
> > > resulting in error getaddrinfo1: Address in use
> > >
> > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in
> > hints,
> > > this issue does NOT show up.
> >
> > Can you provide more information on how to interpret the console dump,
> > and in particular, which of your syscalls is returning the EADDRINUSE
> > error? I'm pretty sure this is just a bug in your socket
> > stack/emulation. It looks like it's coming from recvfrom, and
> > EADDRINUSE is not a valid error for recvfrom to produce.
> >
> > Rich
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-01 21:30     ` Rich Felker
@ 2022-06-01 21:52       ` Sascha Braun
  2022-06-02 14:25         ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Sascha Braun @ 2022-06-01 21:52 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4121 bytes --]

I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare
errors.

The errno value (EAGAIN) came from my syscall, recvfrom, which does not
have data in these cases (all packets indicated by 'poll' were read
previously by recvfrom calls). (Btw., I did try to return 0 for these cases
in my recvfrom implementation, did not help either.)

Thanks

Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>:

> On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote:
> > Hi Rich,
> >
> > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a
> > glitch in the Emscripten/WASI modification for musl (there are different
> > values for errnos per implementation):
> > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN.
> >
> > My question currently breaks down to: Is it common for this
> implementation
> > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if
> > answers from DNS servers came in?
> > If not, I need to investigate much further. I really could not find
> issues
> > with my sockets implementation, although your thought is of course very
> > reasonable.
>
> OK, looking at your code again, part of your problem is that you're
> misusing perror. You need to save the return value of getaddrinfo.
> Unless it's EAI_SYSTEM, errno is meaningless and you need to use
> gai_strerror to get a string for the error code (the value that was
> returned) rather than inspecting errno.
>
> If getaddrinfo actually returned EAI_AGAIN, this probably means you
> got an inconclusive result from one of the nameservers, probably
> ServFail. However if your debug output is packet dumps, I'm not seeing
> a ServFail there, and I'm not convinced it actually returned EAI_AGAIN
> since you're not saving the value to check it. The value that happens
> to be in errno is NOT the error getaddrinfo returned. Can you check
> what it's actually returning?
>
>
> > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>:
> >
> > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> > > > Hi everyone,
> > > >
> > > > I'm implementing a socket protocol with similar works to Emscripten
> - ok:
> > > >
> > > > Here's what I noticed
> > > >
> > > > - resolv.conf using multiple DNS servers
> > > > options timeout:22 attempts:5
> > > > nameserver 8.8.4.4
> > > > nameserver 208.67.222.222
> > > > nameserver 9.9.9.9
> > > > nameserver 1.1.1.1
> > > >
> > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> > > > void socketstest1_client_dnsonly(void) {
> > > >
> > > > struct addrinfo hints, * res;
> > > > memset(&hints, 0, sizeof(hints));
> > > > printf("begin lookup1...\n");
> > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> > > > perror("getaddrinfo1");
> > > > }
> > > > printf("begin lookup2...\n");
> > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> > > > perror("getaddrinfo2");
> > > > }
> > > > printf("begin lookup3...\n");
> > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> > > > perror("getaddrinfo3");
> > > > }
> > > >
> > > > }
> > > >
> > > > When repeating socketstest1_client_dnsonly often, it appears that
> IPV6
> > > > answers for IPV4 requests to another server [or vice-versa]
> responses can
> > > > get mixed up and getaddrinfo reports sometimes only an error where
> none
> > > is.
> > > >
> > > > Below is a console dump of my test where you see the critical
> situation
> > > > (end)
> > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP
> Proto),
> > > > resulting in error getaddrinfo1: Address in use
> > > >
> > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in
> > > hints,
> > > > this issue does NOT show up.
> > >
> > > Can you provide more information on how to interpret the console dump,
> > > and in particular, which of your syscalls is returning the EADDRINUSE
> > > error? I'm pretty sure this is just a bug in your socket
> > > stack/emulation. It looks like it's coming from recvfrom, and
> > > EADDRINUSE is not a valid error for recvfrom to produce.
> > >
> > > Rich
> > >
>

[-- Attachment #2: Type: text/html, Size: 6203 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-01 21:52       ` Sascha Braun
@ 2022-06-02 14:25         ` Rich Felker
  2022-06-02 22:28           ` Sascha Braun
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2022-06-02 14:25 UTC (permalink / raw)
  To: Sascha Braun; +Cc: musl

On Wed, Jun 01, 2022 at 11:52:46PM +0200, Sascha Braun wrote:
> I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare
> errors.
> 
> The errno value (EAGAIN) came from my syscall, recvfrom, which does not
> have data in these cases (all packets indicated by 'poll' were read
> previously by recvfrom calls). (Btw., I did try to return 0 for these cases
> in my recvfrom implementation, did not help either.)

It's normal to see these EAGAINs. They are not errors. Returning 0
would be wrong; a zero-length read means EOF not EAGAIN.

I don't see an immediate cause visible for why you're getting
EAI_NONAME. Note that your debug output does not seem to show the
*length* recvfrom returned, so is it possible you're just showing old
data that was already in the buffer, when recvfrom really
malfunctioned and returned a short read? At the res_msend loop layer
musl would possibly accept this (like you seem to be seeing) but then
find no results later.

Rich



> 
> Thanks
> 
> Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>:
> 
> > On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote:
> > > Hi Rich,
> > >
> > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a
> > > glitch in the Emscripten/WASI modification for musl (there are different
> > > values for errnos per implementation):
> > > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN.
> > >
> > > My question currently breaks down to: Is it common for this
> > implementation
> > > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if
> > > answers from DNS servers came in?
> > > If not, I need to investigate much further. I really could not find
> > issues
> > > with my sockets implementation, although your thought is of course very
> > > reasonable.
> >
> > OK, looking at your code again, part of your problem is that you're
> > misusing perror. You need to save the return value of getaddrinfo.
> > Unless it's EAI_SYSTEM, errno is meaningless and you need to use
> > gai_strerror to get a string for the error code (the value that was
> > returned) rather than inspecting errno.
> >
> > If getaddrinfo actually returned EAI_AGAIN, this probably means you
> > got an inconclusive result from one of the nameservers, probably
> > ServFail. However if your debug output is packet dumps, I'm not seeing
> > a ServFail there, and I'm not convinced it actually returned EAI_AGAIN
> > since you're not saving the value to check it. The value that happens
> > to be in errno is NOT the error getaddrinfo returned. Can you check
> > what it's actually returning?
> >
> >
> > > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>:
> > >
> > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> > > > > Hi everyone,
> > > > >
> > > > > I'm implementing a socket protocol with similar works to Emscripten
> > - ok:
> > > > >
> > > > > Here's what I noticed
> > > > >
> > > > > - resolv.conf using multiple DNS servers
> > > > > options timeout:22 attempts:5
> > > > > nameserver 8.8.4.4
> > > > > nameserver 208.67.222.222
> > > > > nameserver 9.9.9.9
> > > > > nameserver 1.1.1.1
> > > > >
> > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> > > > > void socketstest1_client_dnsonly(void) {
> > > > >
> > > > > struct addrinfo hints, * res;
> > > > > memset(&hints, 0, sizeof(hints));
> > > > > printf("begin lookup1...\n");
> > > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> > > > > perror("getaddrinfo1");
> > > > > }
> > > > > printf("begin lookup2...\n");
> > > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> > > > > perror("getaddrinfo2");
> > > > > }
> > > > > printf("begin lookup3...\n");
> > > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> > > > > perror("getaddrinfo3");
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > > When repeating socketstest1_client_dnsonly often, it appears that
> > IPV6
> > > > > answers for IPV4 requests to another server [or vice-versa]
> > responses can
> > > > > get mixed up and getaddrinfo reports sometimes only an error where
> > none
> > > > is.
> > > > >
> > > > > Below is a console dump of my test where you see the critical
> > situation
> > > > > (end)
> > > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP
> > Proto),
> > > > > resulting in error getaddrinfo1: Address in use
> > > > >
> > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in
> > > > hints,
> > > > > this issue does NOT show up.
> > > >
> > > > Can you provide more information on how to interpret the console dump,
> > > > and in particular, which of your syscalls is returning the EADDRINUSE
> > > > error? I'm pretty sure this is just a bug in your socket
> > > > stack/emulation. It looks like it's coming from recvfrom, and
> > > > EADDRINUSE is not a valid error for recvfrom to produce.
> > > >
> > > > Rich
> > > >
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-02 14:25         ` Rich Felker
@ 2022-06-02 22:28           ` Sascha Braun
  2022-06-03 12:27             ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Sascha Braun @ 2022-06-02 22:28 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 12998 bytes --]

Hi Rich,

I think I narrowed the thing down. Below is a dump of what happens in a
'normal' situation and what happened when the sporadic issue appeared.

Normal:
begin lookup3...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
__syscall_recv begin EP[9.9.9.9:53]
__syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
connect:0200ffffd83ad4840000000000000000
connect:0a00ffff000000002a00145040010808000000000000200400000000

So we did sent the bytes
ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the
other sequence (hex encoded, 2 characters per byte) to each of the DNS and
we received back one 'short' and one 'long' reply from 9.9.9.9.
I guess the short one is IPv4, long one IPv6(?). That's the case with all
successful lookups, i.e. the 99% ok ones - (at least) one short - (at
least) one long.

Now the problematic one:

begin lookup1...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
94d90100000100000000000003777777037765620264650000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
94d90100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
94d90100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
94d901000001000000000000037777770377656202646500001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
94d901000001000000000000037777770377656202646500001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
94d901000001000000000000037777770377656202646500001c0001

__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
__syscall_recv begin EP[9.9.9.9:53]
__syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
getaddrinfo1: Address in use

We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4

All occurrences of the problem show this constellation - two 'long'
responses received.
As a note, of course my implementation of recv returns the correct number
of bytes received. The zeros you see are only from the dump function, it's
dumping the 512 byte buffer.

I hope this is helpful in some manner.

I came across this, I seems glibc had a similar issue (I did not look
in-depth, just want to share the link)
https://bugzilla.redhat.com/show_bug.cgi?id=1044628
https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html

Thanks
Sascha



Am Do., 2. Juni 2022 um 16:25 Uhr schrieb Rich Felker <dalias@libc.org>:

> On Wed, Jun 01, 2022 at 11:52:46PM +0200, Sascha Braun wrote:
> > I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare
> > errors.
> >
> > The errno value (EAGAIN) came from my syscall, recvfrom, which does not
> > have data in these cases (all packets indicated by 'poll' were read
> > previously by recvfrom calls). (Btw., I did try to return 0 for these
> cases
> > in my recvfrom implementation, did not help either.)
>
> It's normal to see these EAGAINs. They are not errors. Returning 0
> would be wrong; a zero-length read means EOF not EAGAIN.
>
> I don't see an immediate cause visible for why you're getting
> EAI_NONAME. Note that your debug output does not seem to show the
> *length* recvfrom returned, so is it possible you're just showing old
> data that was already in the buffer, when recvfrom really
> malfunctioned and returned a short read? At the res_msend loop layer
> musl would possibly accept this (like you seem to be seeing) but then
> find no results later.
>
> Rich
>
>
>
> >
> > Thanks
> >
> > Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>:
> >
> > > On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote:
> > > > Hi Rich,
> > > >
> > > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this
> is a
> > > > glitch in the Emscripten/WASI modification for musl (there are
> different
> > > > values for errnos per implementation):
> > > > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN.
> > > >
> > > > My question currently breaks down to: Is it common for this
> > > implementation
> > > > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even
> if
> > > > answers from DNS servers came in?
> > > > If not, I need to investigate much further. I really could not find
> > > issues
> > > > with my sockets implementation, although your thought is of course
> very
> > > > reasonable.
> > >
> > > OK, looking at your code again, part of your problem is that you're
> > > misusing perror. You need to save the return value of getaddrinfo.
> > > Unless it's EAI_SYSTEM, errno is meaningless and you need to use
> > > gai_strerror to get a string for the error code (the value that was
> > > returned) rather than inspecting errno.
> > >
> > > If getaddrinfo actually returned EAI_AGAIN, this probably means you
> > > got an inconclusive result from one of the nameservers, probably
> > > ServFail. However if your debug output is packet dumps, I'm not seeing
> > > a ServFail there, and I'm not convinced it actually returned EAI_AGAIN
> > > since you're not saving the value to check it. The value that happens
> > > to be in errno is NOT the error getaddrinfo returned. Can you check
> > > what it's actually returning?
> > >
> > >
> > > > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <
> dalias@libc.org>:
> > > >
> > > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'm implementing a socket protocol with similar works to
> Emscripten
> > > - ok:
> > > > > >
> > > > > > Here's what I noticed
> > > > > >
> > > > > > - resolv.conf using multiple DNS servers
> > > > > > options timeout:22 attempts:5
> > > > > > nameserver 8.8.4.4
> > > > > > nameserver 208.67.222.222
> > > > > > nameserver 9.9.9.9
> > > > > > nameserver 1.1.1.1
> > > > > >
> > > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> > > > > > void socketstest1_client_dnsonly(void) {
> > > > > >
> > > > > > struct addrinfo hints, * res;
> > > > > > memset(&hints, 0, sizeof(hints));
> > > > > > printf("begin lookup1...\n");
> > > > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> > > > > > perror("getaddrinfo1");
> > > > > > }
> > > > > > printf("begin lookup2...\n");
> > > > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> > > > > > perror("getaddrinfo2");
> > > > > > }
> > > > > > printf("begin lookup3...\n");
> > > > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> > > > > > perror("getaddrinfo3");
> > > > > > }
> > > > > >
> > > > > > }
> > > > > >
> > > > > > When repeating socketstest1_client_dnsonly often, it appears that
> > > IPV6
> > > > > > answers for IPV4 requests to another server [or vice-versa]
> > > responses can
> > > > > > get mixed up and getaddrinfo reports sometimes only an error
> where
> > > none
> > > > > is.
> > > > > >
> > > > > > Below is a console dump of my test where you see the critical
> > > situation
> > > > > > (end)
> > > > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP
> > > Proto),
> > > > > > resulting in error getaddrinfo1: Address in use
> > > > > >
> > > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6
> in
> > > > > hints,
> > > > > > this issue does NOT show up.
> > > > >
> > > > > Can you provide more information on how to interpret the console
> dump,
> > > > > and in particular, which of your syscalls is returning the
> EADDRINUSE
> > > > > error? I'm pretty sure this is just a bug in your socket
> > > > > stack/emulation. It looks like it's coming from recvfrom, and
> > > > > EADDRINUSE is not a valid error for recvfrom to produce.
> > > > >
> > > > > Rich
> > > > >
> > >
>

[-- Attachment #2: Type: text/html, Size: 18556 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-02 22:28           ` Sascha Braun
@ 2022-06-03 12:27             ` Rich Felker
  2022-06-03 12:34               ` Rich Felker
  0 siblings, 1 reply; 12+ messages in thread
From: Rich Felker @ 2022-06-03 12:27 UTC (permalink / raw)
  To: Sascha Braun; +Cc: musl

On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote:
> Hi Rich,
> 
> I think I narrowed the thing down. Below is a dump of what happens in a
> 'normal' situation and what happened when the sporadic issue appeared.
> 
> Normal:
> begin lookup3...
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> __syscall_recv begin EP[0.0.0.0:0]
> __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> __syscall_recv begin EP[9.9.9.9:53]
> __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
> ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> connect:0200ffffd83ad4840000000000000000
> connect:0a00ffff000000002a00145040010808000000000000200400000000
> 
> So we did sent the bytes
> ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the
> other sequence (hex encoded, 2 characters per byte) to each of the DNS and
> we received back one 'short' and one 'long' reply from 9.9.9.9.
> I guess the short one is IPv4, long one IPv6(?). That's the case with all
> successful lookups, i.e. the 99% ok ones - (at least) one short - (at
> least) one long.
> 
> Now the problematic one:
> 
> begin lookup1...
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> 94d90100000100000000000003777777037765620264650000010001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> 94d90100000100000000000003777777037765620264650000010001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> 94d90100000100000000000003777777037765620264650000010001
> __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> 94d901000001000000000000037777770377656202646500001c0001
> __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> 94d901000001000000000000037777770377656202646500001c0001
> __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> 94d901000001000000000000037777770377656202646500001c0001
> 
> __syscall_recv begin EP[0.0.0.0:0]
> __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> __syscall_recv begin EP[9.9.9.9:53]
> __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
> 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> getaddrinfo1: Address in use
> 
> We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4
> 
> All occurrences of the problem show this constellation - two 'long'
> responses received.
> As a note, of course my implementation of recv returns the correct number
> of bytes received. The zeros you see are only from the dump function, it's
> dumping the 512 byte buffer.
> 
> I hope this is helpful in some manner.
> 
> I came across this, I seems glibc had a similar issue (I did not look
> in-depth, just want to share the link)
> https://bugzilla.redhat.com/show_bug.cgi?id=1044628
> https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html

OK, I found your problem. It's that the query ids for both the A and
AAAA are the same, probably because you have a low-resolution or
non-working clock_gettime. If the host environment does not provide a
way to get a high resolution clock, I think you should still apply a
monotonic increasing increment of the nanoseconds on each call where
the host environment's time did not increase so that the clock is
strictly monotonic. However musl's resolver should also deal with this
case since it's always possible to get identical query ids (with low
probability). We should just check if they're equal, and if so,
increment the second one. I'll write a patch to do this.

Rich

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-03 12:27             ` Rich Felker
@ 2022-06-03 12:34               ` Rich Felker
  2022-06-03 13:44                 ` Sascha Braun
  2022-06-05 18:14                 ` Sascha Braun
  0 siblings, 2 replies; 12+ messages in thread
From: Rich Felker @ 2022-06-03 12:34 UTC (permalink / raw)
  To: Sascha Braun; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 8706 bytes --]

On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote:
> On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote:
> > Hi Rich,
> > 
> > I think I narrowed the thing down. Below is a dump of what happens in a
> > 'normal' situation and what happened when the sporadic issue appeared.
> > 
> > Normal:
> > begin lookup3...
> > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > __syscall_recv begin EP[0.0.0.0:0]
> > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > __syscall_recv begin EP[9.9.9.9:53]
> > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
> > ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > connect:0200ffffd83ad4840000000000000000
> > connect:0a00ffff000000002a00145040010808000000000000200400000000
> > 
> > So we did sent the bytes
> > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the
> > other sequence (hex encoded, 2 characters per byte) to each of the DNS and
> > we received back one 'short' and one 'long' reply from 9.9.9.9.
> > I guess the short one is IPv4, long one IPv6(?). That's the case with all
> > successful lookups, i.e. the 99% ok ones - (at least) one short - (at
> > least) one long.
> > 
> > Now the problematic one:
> > 
> > begin lookup1...
> > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > 94d90100000100000000000003777777037765620264650000010001
> > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> > 94d90100000100000000000003777777037765620264650000010001
> > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > 94d90100000100000000000003777777037765620264650000010001
> > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > 94d901000001000000000000037777770377656202646500001c0001
> > __syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53]
> > 94d901000001000000000000037777770377656202646500001c0001
> > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > 94d901000001000000000000037777770377656202646500001c0001
> > 
> > __syscall_recv begin EP[0.0.0.0:0]
> > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > __syscall_recv begin EP[9.9.9.9:53]
> > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
> > 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > getaddrinfo1: Address in use
> > 
> > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4
> > 
> > All occurrences of the problem show this constellation - two 'long'
> > responses received.
> > As a note, of course my implementation of recv returns the correct number
> > of bytes received. The zeros you see are only from the dump function, it's
> > dumping the 512 byte buffer.
> > 
> > I hope this is helpful in some manner.
> > 
> > I came across this, I seems glibc had a similar issue (I did not look
> > in-depth, just want to share the link)
> > https://bugzilla.redhat.com/show_bug.cgi?id=1044628
> > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html
> 
> OK, I found your problem. It's that the query ids for both the A and
> AAAA are the same, probably because you have a low-resolution or
> non-working clock_gettime. If the host environment does not provide a
> way to get a high resolution clock, I think you should still apply a
> monotonic increasing increment of the nanoseconds on each call where
> the host environment's time did not increase so that the clock is
> strictly monotonic. However musl's resolver should also deal with this
> case since it's always possible to get identical query ids (with low
> probability). We should just check if they're equal, and if so,
> increment the second one. I'll write a patch to do this.

See if the attached patch fixes it.


[-- Attachment #2: queryid.diff --]
[-- Type: text/plain, Size: 438 bytes --]

diff --git a/src/network/lookup_name.c b/src/network/lookup_name.c
index aa558c19..e20ad6db 100644
--- a/src/network/lookup_name.c
+++ b/src/network/lookup_name.c
@@ -155,6 +155,8 @@ static int name_from_dns(struct address buf[static MAXADDRS], char canon[static
 			if (qlens[nq] == -1)
 				return EAI_NONAME;
 			qbuf[nq][3] = 0; /* don't need AD flag */
+			if (nq && qbuf[nq][0] == qbuf[0][0])
+				qbuf[nq][0]++;
 			nq++;
 		}
 	}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-03 12:34               ` Rich Felker
@ 2022-06-03 13:44                 ` Sascha Braun
  2022-06-05 18:14                 ` Sascha Braun
  1 sibling, 0 replies; 12+ messages in thread
From: Sascha Braun @ 2022-06-03 13:44 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 9347 bytes --]

Hi Rich,

Great, Thank You! This explains everything, because - although I do have a
high resolution timer - the time that arrived through CLOCK_REALTIME was
milliseconds.
I did improve the resolution down to 1/1000 of msec and things are working
fine now.

I will also test your patch this weekend and report back.

Many Thanks
Sascha


Am Fr., 3. Juni 2022 um 14:34 Uhr schrieb Rich Felker <dalias@libc.org>:

> On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote:
> > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote:
> > > Hi Rich,
> > >
> > > I think I narrowed the thing down. Below is a dump of what happens in a
> > > 'normal' situation and what happened when the sporadic issue appeared.
> > >
> > > Normal:
> > > begin lookup3...
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > __syscall_recv begin EP[0.0.0.0:0]
> > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > >
> ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > __syscall_recv begin EP[9.9.9.9:53]
> > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
> > >
> ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > connect:0200ffffd83ad4840000000000000000
> > > connect:0a00ffff000000002a00145040010808000000000000200400000000
> > >
> > > So we did sent the bytes
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and
> the
> > > other sequence (hex encoded, 2 characters per byte) to each of the DNS
> and
> > > we received back one 'short' and one 'long' reply from 9.9.9.9.
> > > I guess the short one is IPv4, long one IPv6(?). That's the case with
> all
> > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at
> > > least) one long.
> > >
> > > Now the problematic one:
> > >
> > > begin lookup1...
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > 94d90100000100000000000003777777037765620264650000010001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > 94d90100000100000000000003777777037765620264650000010001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > 94d90100000100000000000003777777037765620264650000010001
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > 94d901000001000000000000037777770377656202646500001c0001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > 94d901000001000000000000037777770377656202646500001c0001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > 94d901000001000000000000037777770377656202646500001c0001
> > >
> > > __syscall_recv begin EP[0.0.0.0:0]
> > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > >
> 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > __syscall_recv begin EP[9.9.9.9:53]
> > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
> > >
> 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > getaddrinfo1: Address in use
> > >
> > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4
> > >
> > > All occurrences of the problem show this constellation - two 'long'
> > > responses received.
> > > As a note, of course my implementation of recv returns the correct
> number
> > > of bytes received. The zeros you see are only from the dump function,
> it's
> > > dumping the 512 byte buffer.
> > >
> > > I hope this is helpful in some manner.
> > >
> > > I came across this, I seems glibc had a similar issue (I did not look
> > > in-depth, just want to share the link)
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628
> > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html
> >
> > OK, I found your problem. It's that the query ids for both the A and
> > AAAA are the same, probably because you have a low-resolution or
> > non-working clock_gettime. If the host environment does not provide a
> > way to get a high resolution clock, I think you should still apply a
> > monotonic increasing increment of the nanoseconds on each call where
> > the host environment's time did not increase so that the clock is
> > strictly monotonic. However musl's resolver should also deal with this
> > case since it's always possible to get identical query ids (with low
> > probability). We should just check if they're equal, and if so,
> > increment the second one. I'll write a patch to do this.
>
> See if the attached patch fixes it.
>
>

[-- Attachment #2: Type: text/html, Size: 12546 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-03 12:34               ` Rich Felker
  2022-06-03 13:44                 ` Sascha Braun
@ 2022-06-05 18:14                 ` Sascha Braun
  2022-06-05 18:47                   ` Rich Felker
  1 sibling, 1 reply; 12+ messages in thread
From: Sascha Braun @ 2022-06-05 18:14 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 9185 bytes --]

Hi Rich,

I tested your patch against my new (high res clock_realtime 100nsec) and my
old (low res 1msec) code and in both cases it's working, no more problems.

Many Thanks
Sascha

Am Fr., 3. Juni 2022 um 14:34 Uhr schrieb Rich Felker <dalias@libc.org>:

> On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote:
> > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote:
> > > Hi Rich,
> > >
> > > I think I narrowed the thing down. Below is a dump of what happens in a
> > > 'normal' situation and what happened when the sporadic issue appeared.
> > >
> > > Normal:
> > > begin lookup3...
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > __syscall_recv begin EP[0.0.0.0:0]
> > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > >
> ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > __syscall_recv begin EP[9.9.9.9:53]
> > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
> > >
> ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > connect:0200ffffd83ad4840000000000000000
> > > connect:0a00ffff000000002a00145040010808000000000000200400000000
> > >
> > > So we did sent the bytes
> > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and
> the
> > > other sequence (hex encoded, 2 characters per byte) to each of the DNS
> and
> > > we received back one 'short' and one 'long' reply from 9.9.9.9.
> > > I guess the short one is IPv4, long one IPv6(?). That's the case with
> all
> > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at
> > > least) one long.
> > >
> > > Now the problematic one:
> > >
> > > begin lookup1...
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > 94d90100000100000000000003777777037765620264650000010001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > 94d90100000100000000000003777777037765620264650000010001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > 94d90100000100000000000003777777037765620264650000010001
> > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > 94d901000001000000000000037777770377656202646500001c0001
> > > __syscall_send_internal 02000035d043dede00000000000000
> 00208.67.222.222:53]
> > > 94d901000001000000000000037777770377656202646500001c0001
> > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > 94d901000001000000000000037777770377656202646500001c0001
> > >
> > > __syscall_recv begin EP[0.0.0.0:0]
> > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > >
> 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > __syscall_recv begin EP[9.9.9.9:53]
> > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
> > >
> 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > getaddrinfo1: Address in use
> > >
> > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4
> > >
> > > All occurrences of the problem show this constellation - two 'long'
> > > responses received.
> > > As a note, of course my implementation of recv returns the correct
> number
> > > of bytes received. The zeros you see are only from the dump function,
> it's
> > > dumping the 512 byte buffer.
> > >
> > > I hope this is helpful in some manner.
> > >
> > > I came across this, I seems glibc had a similar issue (I did not look
> > > in-depth, just want to share the link)
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628
> > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html
> >
> > OK, I found your problem. It's that the query ids for both the A and
> > AAAA are the same, probably because you have a low-resolution or
> > non-working clock_gettime. If the host environment does not provide a
> > way to get a high resolution clock, I think you should still apply a
> > monotonic increasing increment of the nanoseconds on each call where
> > the host environment's time did not increase so that the clock is
> > strictly monotonic. However musl's resolver should also deal with this
> > case since it's always possible to get identical query ids (with low
> > probability). We should just check if they're equal, and if so,
> > increment the second one. I'll write a patch to do this.
>
> See if the attached patch fixes it.
>
>

[-- Attachment #2: Type: text/html, Size: 12009 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c)
  2022-06-05 18:14                 ` Sascha Braun
@ 2022-06-05 18:47                   ` Rich Felker
  0 siblings, 0 replies; 12+ messages in thread
From: Rich Felker @ 2022-06-05 18:47 UTC (permalink / raw)
  To: Sascha Braun; +Cc: musl

On Sun, Jun 05, 2022 at 08:14:49PM +0200, Sascha Braun wrote:
> Hi Rich,
> 
> I tested your patch against my new (high res clock_realtime 100nsec) and my
> old (low res 1msec) code and in both cases it's working, no more problems.

Great! Thanks for following up. I'll include the patch upstream.

Rich


> Am Fr., 3. Juni 2022 um 14:34 Uhr schrieb Rich Felker <dalias@libc.org>:
> 
> > On Fri, Jun 03, 2022 at 08:27:04AM -0400, Rich Felker wrote:
> > > On Fri, Jun 03, 2022 at 12:28:40AM +0200, Sascha Braun wrote:
> > > > Hi Rich,
> > > >
> > > > I think I narrowed the thing down. Below is a dump of what happens in a
> > > > 'normal' situation and what happened when the sporadic issue appeared..
> > > >
> > > > Normal:
> > > > begin lookup3...
> > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > > __syscall_send_internal 02000035d043dede00000000000000
> > 00208.67.222.222:53]
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
> > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > > __syscall_send_internal 02000035d043dede00000000000000
> > 00208.67.222.222:53]
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
> > > > __syscall_recv begin EP[0.0.0.0:0]
> > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > > >
> > ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > > __syscall_recv begin EP[9.9.9.9:53]
> > > > __syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53]
> > > >
> > ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > > connect:0200ffffd83ad4840000000000000000
> > > > connect:0a00ffff000000002a00145040010808000000000000200400000000
> > > >
> > > > So we did sent the bytes
> > > > ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and
> > the
> > > > other sequence (hex encoded, 2 characters per byte) to each of the DNS
> > and
> > > > we received back one 'short' and one 'long' reply from 9.9.9.9.
> > > > I guess the short one is IPv4, long one IPv6(?). That's the case with
> > all
> > > > successful lookups, i.e. the 99% ok ones - (at least) one short - (at
> > > > least) one long.
> > > >
> > > > Now the problematic one:
> > > >
> > > > begin lookup1...
> > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > > 94d90100000100000000000003777777037765620264650000010001
> > > > __syscall_send_internal 02000035d043dede00000000000000
> > 00208.67.222.222:53]
> > > > 94d90100000100000000000003777777037765620264650000010001
> > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > > 94d90100000100000000000003777777037765620264650000010001
> > > > __syscall_send_internal 020000350808040400000000000000008.8.4.4:53]
> > > > 94d901000001000000000000037777770377656202646500001c0001
> > > > __syscall_send_internal 02000035d043dede00000000000000
> > 00208.67.222.222:53]
> > > > 94d901000001000000000000037777770377656202646500001c0001
> > > > __syscall_send_internal 020000350909090900000000000000009.9.9.9:53]
> > > > 94d901000001000000000000037777770377656202646500001c0001
> > > >
> > > > __syscall_recv begin EP[0.0.0.0:0]
> > > > __syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53]
> > > >
> > 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > > __syscall_recv begin EP[9.9.9.9:53]
> > > > __syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53]
> > > >
> > 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
> > > > getaddrinfo1: Address in use
> > > >
> > > > We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4
> > > >
> > > > All occurrences of the problem show this constellation - two 'long'
> > > > responses received.
> > > > As a note, of course my implementation of recv returns the correct
> > number
> > > > of bytes received. The zeros you see are only from the dump function,
> > it's
> > > > dumping the 512 byte buffer.
> > > >
> > > > I hope this is helpful in some manner.
> > > >
> > > > I came across this, I seems glibc had a similar issue (I did not look
> > > > in-depth, just want to share the link)
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1044628
> > > > https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html
> > >
> > > OK, I found your problem. It's that the query ids for both the A and
> > > AAAA are the same, probably because you have a low-resolution or
> > > non-working clock_gettime. If the host environment does not provide a
> > > way to get a high resolution clock, I think you should still apply a
> > > monotonic increasing increment of the nanoseconds on each call where
> > > the host environment's time did not increase so that the clock is
> > > strictly monotonic. However musl's resolver should also deal with this
> > > case since it's always possible to get identical query ids (with low
> > > probability). We should just check if they're equal, and if so,
> > > increment the second one. I'll write a patch to do this.
> >
> > See if the attached patch fixes it.
> >
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-06-05 18:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-31 10:24 [musl] Problably Issue in name_from_dns // __res_msend_rc (lookup_name.c) Sascha Braun
2022-06-01 14:14 ` Rich Felker
2022-06-01 20:35   ` Sascha Braun
2022-06-01 21:30     ` Rich Felker
2022-06-01 21:52       ` Sascha Braun
2022-06-02 14:25         ` Rich Felker
2022-06-02 22:28           ` Sascha Braun
2022-06-03 12:27             ` Rich Felker
2022-06-03 12:34               ` Rich Felker
2022-06-03 13:44                 ` Sascha Braun
2022-06-05 18:14                 ` Sascha Braun
2022-06-05 18:47                   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).