Looks like reverting that commit works, but interestingly, only stochastically. chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com getaddrinfo: Try again chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com AF_INET6: 2607:f8b0:4006:81c::200e AF_INET: 142.250.80.46 chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com AF_INET6: 2607:f8b0:4006:81c::200e AF_INET: 142.250.80.46 chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com getaddrinfo: Try again chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com AF_INET6: 2607:f8b0:4006:81c::200e AF_INET: 142.250.80.46 chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com AF_INET6: 2607:f8b0:4006:81c::200e AF_INET: 142.250.80.46 chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com getaddrinfo: Try again chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com AF_INET6: 2607:f8b0:4006:81c::200e AF_INET: 142.250.80.46 chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com getaddrinfo: Try again chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com getaddrinfo: Try again chronos@localhost /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $ ./musl_getaddrinfo_test google.com AF_INET6: 2607:f8b0:4006:81c::200e AF_INET: 142.250.80.46 On Wed, Feb 16, 2022 at 11:14 PM Satadru Pramanik wrote: > Oops. Looks like I reversed the patch wrong. Rebuilding again... > > On Wed, Feb 16, 2022 at 8:44 PM Satadru Pramanik > wrote: > >> Looks like I need to do more than just reverse that commit to get this to >> build... >> >> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro >> ‘__socketcall_cp’ 47 | #define socketcall_cp(nm,a,b,c,d,e,f) >> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f)) >> | >> ^~~~~~~~~~~~~~~ >> ../src_musl/src/network/accept.c:6:16: note: in expansion of macro >> ‘socketcall_cp’ >> 6 | return socketcall_cp(accept, fd, addr, len, 0, 0, 0); >> | ^~~~~~~~~~~~~ >> ../src_musl/src/internal/syscall.h:62:54: note: each undeclared >> identifier is reported only once for each function it appears in >> 62 | #define >> __socketcall_cp(nm,a,b,c,d,e,f) __syscall_cp(SYS_##nm, a, b, c, d, e, f) >> | ^~~~ >> ../src_musl/src/internal/syscall.h:55:53: note: in definition of >> macro ‘__syscall_cp6’ 55 | #define __syscall_cp6(n,a,b,c,d,e,f) >> (__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f)) >> | >> ^ >> ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro >> ‘__SYSCALL_DISP’ 57 | #define __syscall_cp(...) >> __SYSCALL_DISP(__syscall_cp,__VA_ARGS__) | >> ^~~~~~~~~~~~~~ >> ../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro >> ‘__syscall_cp’ 62 | #define __socketcall_cp(nm,a,b,c,d,e,f) >> __syscall_cp(SYS_##nm, a, b, c, d, e, f) | >> ^~~~~~~~~~~~ >> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro >> ‘__socketcall_cp’ >> 47 | #define socketcall_cp(nm,a,b,c,d,e,f) >> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f)) >> | >> ^~~~~~~~~~~~~~~ >> ../src_musl/src/network/accept.c:6:16: note: in expansion of macro >> ‘socketcall_cp’ 6 | return socketcall_cp(accept, fd, addr, >> len, 0, 0, 0); | ^~~~~~~~~~~~~ >> ../src_musl/src/network/accept.c:7:1: warning: control reaches end of >> non-void function [-Wreturn-type] >> 7 | } >> | ^ >> make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error >> 1 >> make[2]: Leaving directory >> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl' >> make[1]: *** [Makefile:249: >> obj_musl/.lc_built] Error 2 make[1]: >> Leaving directory >> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl' >> make: *** >> [Makefile:194: all] Error 2 >> >> On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik wrote: >> >>> I was looking at that commit too. I've started a build with that >>> reverted and should be able to check back on that tomorrow. >>> >>> On Wed, Feb 16, 2022 at 4:33 PM Rich Felker wrote: >>> >>>> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote: >>>> > The only change to socket.c I'm seeing is use __socketcall to simplify >>>> > socket() >>>> > < >>>> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001 >>>> >, >>>> > so maybe it would make sense for me to try building with that >>>> reversed? >>>> >>>> That should not be a functional change, but you may be overlooking >>>> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the >>>> new (added in 4.3) individual socket syscalls instead of the legacy >>>> multiplexed SYS_socketcall. It's supposed to fall back to using the >>>> old ones, but perhaps something goes wrong on your kernel that's >>>> preventing it. I'm not sure what the mechanism by which it works when >>>> straced/single-stepped could be, though, but if it's a weird kernel >>>> bug anything is possible. >>>> >>>> Reverting that commit should be entirely safe, if it turns out to be >>>> what's triggering your problem, but I'd like to get to the root cause >>>> and see if there's anything we can do to ensure this doesn't come up >>>> again. >>>> >>>> >>>> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik >>>> wrote: >>>> > >>>> > > >>>> > >> >>>> > >> - Whether any network traffic occurs when it fails (in the real >>>> > >> environment not a replicated one elsewhere). >>>> > >> >>>> > >> >>>> > > There is no network traffic in the real environment. >>>> > > >>>> > > >>>> > >> - Whether it fails or succeeds under strace (in the real >>>> > >> environment not a replicated one elsewhere). >>>> > >> >>>> > >> It succeeds in strace (in the real environment) >>>> > > >>>> > > >>>> > > >>>> > >> - Whether the real environment involves Docker or not. >>>> > >> >>>> > >> The real environment does not involve docker. >>>> > > >>>> > > >>>> > > >>>> > >> - What's in resolv.conf (in the real environment not a replicated >>>> one >>>> > >> elsewhere) and what nameserver software (if known) is running on >>>> the >>>> > >> nameserver(s) listed in there. >>>> > >> >>>> > >> The nameserver is picked up from dhcp. The contents of the file >>>> are as >>>> > > follows: >>>> > > nameserver 192.168.0.1 >>>> > > search lan. >>>> > > options single-request timeout:1 attempts:5 >>>> > > >>>> > > >>>> > >> - Anything else that might be relevant. >>>> > >> >>>> > >> DNS server is dnsmasq running on a current OpenWRT device. >>>> > > >>>> > > >>>> > >> It's really hard to offer any productive advice when the problem is >>>> > >> unclear. >>>> > >> >>>> > >> Apologies for the confusion. >>>> > > I'm really just trying to debug this getaddrinfo breakage on this >>>> older >>>> > > hardware. The docker containers setup is something we use to build >>>> packages >>>> > > for this hardware, and our frustration is that the software works >>>> perfectly >>>> > > fine in the docker containers, but not on the hardware. >>>> > > >>>> > > > Any other suggestions on how to track down this issue? >>>> > >> >>>> > >> Rather than stepping through, I would put a single breakpoint at a >>>> > >> place you want to see whether execution reaches before running the >>>> > >> test program, then start it and see if the breakpoint fires or not. >>>> > >> Then remove the breakpoint, add a different one, and repeat. For >>>> > >> example, see if __res_msend is ever called, and if so, whether >>>> > >> particular lines of it are reached (or just put breakpoints on >>>> some of >>>> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to >>>> see >>>> > >> if they're called). >>>> > >> >>>> > >> It might also be useful to put a breakpoint on clock_gettime and >>>> then >>>> > >> 'finish' to see what it returns (in case the problem is something >>>> > >> time64-related). >>>> > >> >>>> > >> >>>> > > The only breakpoint which fixed the execution was for line 20 (which >>>> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and >>>> then >>>> > > continuing is the only way it does not result in failure. >>>> > > >>>> > > Any later breakpoints fail. >>>> > > >>>> > > I went though the other breakpoints as requested. >>>> > > clock_gettime did not fire. >>>> > > >>>> > > Breakpoint 1 at 0x5c2f7: file >>>> ../src_musl/compat/time32/clock_gettime32.c, >>>> > > line 9. >>>> > > __res_msend, setsockopt also did not fire. >>>> > > The ones that did fire were: socket, bind, recvfrom, poll, >>>> __res_msend_rc, >>>> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate, >>>> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy >>>> > > >>>> > > When breaking on socket, stepping through the __kernel_vsyscall >>>> call after >>>> > > socket and then continuing succeeds. >>>> > > >>>> > > Is it possible that the socket is not waiting long enough for a >>>> response >>>> > > from __kernel_vsyscall? Has that changed? >>>> > > Breaking, stepping, and continuing on every other function above >>>> fails. >>>> > > >>>> > > The gdb log is attached. >>>> > > >>>> > > Regards, >>>> > > >>>> > > Satadru >>>> > > >>>> > > >>>> >>>