Hi Rich,

I think I narrowed the thing down. Below is a dump of what happens in a 'normal' situation and what happened when the sporadic issue appeared.

Normal:
begin lookup3...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]        ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]        ecea010000010000000000000377777706676f6f676c6503636f6d0000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53]        ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53]        ecea010000010000000000000377777706676f6f676c6503636f6d00001c0001
__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] ecea818000010001000000000377777706676f6f676c6503636f6d0000010001c00c00010001000000390004d83ad4840000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
__syscall_recv begin EP[9.9.9.9:53]
__syscall_recv'd_internal [020000350909090900000000000000009.9.9.9:53] ecea818000010001000000000377777706676f6f676c6503636f6d00001c0001c00c001c00010000003400102a0014504001080800000000000020040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
connect:0200ffffd83ad4840000000000000000
connect:0a00ffff000000002a00145040010808000000000000200400000000

So we did sent the bytes ecea010000010000000000000377777706676f6f676c6503636f6d0000010001 and the other sequence (hex encoded, 2 characters per byte) to each of the DNS and we received back one 'short' and one 'long' reply from 9.9.9.9.
I guess the short one is IPv4, long one IPv6(?). That's the case with all successful lookups, i.e. the 99% ok ones - (at least) one short - (at least) one long.

Now the problematic one:

begin lookup1...
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 94d90100000100000000000003777777037765620264650000010001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 94d90100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 94d90100000100000000000003777777037765620264650000010001
__syscall_send_internal 020000350808040400000000000000008.8.4.4:53] 94d901000001000000000000037777770377656202646500001c0001
__syscall_send_internal 02000035d043dede0000000000000000208.67.222.222:53] 94d901000001000000000000037777770377656202646500001c0001
__syscall_send_internal 020000350909090900000000000000009.9.9.9:53] 94d901000001000000000000037777770377656202646500001c0001

__syscall_recv begin EP[0.0.0.0:0]
__syscall_recv'd_internal [020000000000000000000000000000009.9.9.9:53] 94d981800001000100010000037777770377656202646500001c0001c00c000500010000004e000f0377777708672d68612d776562c014c02c00060001000000340031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
__syscall_recv begin EP[9.9.9.9:53]
__syscall_recv'd_internal [020000350909090900000000000000008.8.4.4:53] 94d981800001000100010000037777770377656202646500001c0001c00c0005000100000114000f0377777708672d68612d776562c014c02c000600010000002c0031036e733102706f0675692d646e73c0140a686f73746d6173746572c04378860f1200002a3000000e1000093a800000003c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
getaddrinfo1: Address in use

We received two 'long' responses, one from 9.9.9.9; one from 8.8.4.4

All occurrences of the problem show this constellation - two 'long' responses received.
As a note, of course my implementation of recv returns the correct number of bytes received. The zeros you see are only from the dump function, it's dumping the 512 byte buffer.

I hope this is helpful in some manner.

I came across this, I seems glibc had a similar issue (I did not look in-depth, just want to share the link)
https://bugzilla.redhat.com/show_bug.cgi?id=1044628
https://sourceware.org/legacy-ml/libc-alpha/2014-04/msg00321.html

Thanks
Sascha



Am Do., 2. Juni 2022 um 16:25 Uhr schrieb Rich Felker <dalias@libc.org>:
On Wed, Jun 01, 2022 at 11:52:46PM +0200, Sascha Braun wrote:
> I see - getaddrinfo always returns -2, EAI_NONAME in case of these rare
> errors.
>
> The errno value (EAGAIN) came from my syscall, recvfrom, which does not
> have data in these cases (all packets indicated by 'poll' were read
> previously by recvfrom calls). (Btw., I did try to return 0 for these cases
> in my recvfrom implementation, did not help either.)

It's normal to see these EAGAINs. They are not errors. Returning 0
would be wrong; a zero-length read means EOF not EAGAIN.

I don't see an immediate cause visible for why you're getting
EAI_NONAME. Note that your debug output does not seem to show the
*length* recvfrom returned, so is it possible you're just showing old
data that was already in the buffer, when recvfrom really
malfunctioned and returned a short read? At the res_msend loop layer
musl would possibly accept this (like you seem to be seeing) but then
find no results later.

Rich



>
> Thanks
>
> Am Mi., 1. Juni 2022 um 23:30 Uhr schrieb Rich Felker <dalias@libc.org>:
>
> > On Wed, Jun 01, 2022 at 10:35:52PM +0200, Sascha Braun wrote:
> > > Hi Rich,
> > >
> > > Thanks for your time. The EADDRINUSE is in reality an EAGAIN, this is a
> > > glitch in the Emscripten/WASI modification for musl (there are different
> > > values for errnos per implementation):
> > > __WASI_ERRNO_ADDRINUSE == -3 == EAI_AGAIN.
> > >
> > > My question currently breaks down to: Is it common for this
> > implementation
> > > of getaddrinfo to return EAGAIN sometimes (I experience 1:100), even if
> > > answers from DNS servers came in?
> > > If not, I need to investigate much further. I really could not find
> > issues
> > > with my sockets implementation, although your thought is of course very
> > > reasonable.
> >
> > OK, looking at your code again, part of your problem is that you're
> > misusing perror. You need to save the return value of getaddrinfo.
> > Unless it's EAI_SYSTEM, errno is meaningless and you need to use
> > gai_strerror to get a string for the error code (the value that was
> > returned) rather than inspecting errno.
> >
> > If getaddrinfo actually returned EAI_AGAIN, this probably means you
> > got an inconclusive result from one of the nameservers, probably
> > ServFail. However if your debug output is packet dumps, I'm not seeing
> > a ServFail there, and I'm not convinced it actually returned EAI_AGAIN
> > since you're not saving the value to check it. The value that happens
> > to be in errno is NOT the error getaddrinfo returned. Can you check
> > what it's actually returning?
> >
> >
> > > Am Mi., 1. Juni 2022 um 16:14 Uhr schrieb Rich Felker <dalias@libc.org>:
> > >
> > > > On Tue, May 31, 2022 at 12:24:19PM +0200, Sascha Braun wrote:
> > > > > Hi everyone,
> > > > >
> > > > > I'm implementing a socket protocol with similar works to Emscripten
> > - ok:
> > > > >
> > > > > Here's what I noticed
> > > > >
> > > > > - resolv.conf using multiple DNS servers
> > > > > options timeout:22 attempts:5
> > > > > nameserver 8.8.4.4
> > > > > nameserver 208.67.222.222
> > > > > nameserver 9.9.9.9
> > > > > nameserver 1.1.1.1
> > > > >
> > > > > - getaddrinfo with no hints, so that IPV6 and IPV4 is resolved
> > > > > void socketstest1_client_dnsonly(void) {
> > > > >
> > > > > struct addrinfo hints, * res;
> > > > > memset(&hints, 0, sizeof(hints));
> > > > > printf("begin lookup1...\n");
> > > > > if (getaddrinfo("www.web.de", "80", &hints, &res) != 0) {
> > > > > perror("getaddrinfo1");
> > > > > }
> > > > > printf("begin lookup2...\n");
> > > > > if (getaddrinfo("www.google.de", "80", &hints, &res) != 0) {
> > > > > perror("getaddrinfo2");
> > > > > }
> > > > > printf("begin lookup3...\n");
> > > > > if (getaddrinfo("www.google.com", "80", &hints, &res) != 0) {
> > > > > perror("getaddrinfo3");
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > > When repeating socketstest1_client_dnsonly often, it appears that
> > IPV6
> > > > > answers for IPV4 requests to another server [or vice-versa]
> > responses can
> > > > > get mixed up and getaddrinfo reports sometimes only an error where
> > none
> > > > is.
> > > > >
> > > > > Below is a console dump of my test where you see the critical
> > situation
> > > > > (end)
> > > > > 8.8.4.4:53 was queried, 9.9.9.9:53 responded for a (different IP
> > Proto),
> > > > > resulting in error getaddrinfo1: Address in use
> > > > >
> > > > > This happens in about 1/100 tests. When you specify IPV4 or IPV6 in
> > > > hints,
> > > > > this issue does NOT show up.
> > > >
> > > > Can you provide more information on how to interpret the console dump,
> > > > and in particular, which of your syscalls is returning the EADDRINUSE
> > > > error? I'm pretty sure this is just a bug in your socket
> > > > stack/emulation. It looks like it's coming from recvfrom, and
> > > > EADDRINUSE is not a valid error for recvfrom to produce.
> > > >
> > > > Rich
> > > >
> >