Looks like I need to do more than just reverse that commit to get this to build... ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro ‘__socketcall_cp’ 47 | #define socketcall_cp(nm,a,b,c,d,e,f) __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f)) | ^~~~~~~~~~~~~~~ ../src_musl/src/network/accept.c:6:16: note: in expansion of macro ‘socketcall_cp’ 6 | return socketcall_cp(accept, fd, addr, len, 0, 0, 0); | ^~~~~~~~~~~~~ ../src_musl/src/internal/syscall.h:62:54: note: each undeclared identifier is reported only once for each function it appears in 62 | #define __socketcall_cp(nm,a,b,c,d,e,f) __syscall_cp(SYS_##nm, a, b, c, d, e, f) | ^~~~ ../src_musl/src/internal/syscall.h:55:53: note: in definition of macro ‘__syscall_cp6’ 55 | #define __syscall_cp6(n,a,b,c,d,e,f) (__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f)) | ^ ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro ‘__SYSCALL_DISP’ 57 | #define __syscall_cp(...) __SYSCALL_DISP(__syscall_cp,__VA_ARGS__) | ^~~~~~~~~~~~~~ ../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro ‘__syscall_cp’ 62 | #define __socketcall_cp(nm,a,b,c,d,e,f) __syscall_cp(SYS_##nm, a, b, c, d, e, f) | ^~~~~~~~~~~~ ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro ‘__socketcall_cp’ 47 | #define socketcall_cp(nm,a,b,c,d,e,f) __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f)) | ^~~~~~~~~~~~~~~ ../src_musl/src/network/accept.c:6:16: note: in expansion of macro ‘socketcall_cp’ 6 | return socketcall_cp(accept, fd, addr, len, 0, 0, 0); | ^~~~~~~~~~~~~ ../src_musl/src/network/accept.c:7:1: warning: control reaches end of non-void function [-Wreturn-type] 7 | } | ^ make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error 1 make[2]: Leaving directory '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl' make[1]: *** [Makefile:249: obj_musl/.lc_built] Error 2 make[1]: Leaving directory '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl' make: *** [Makefile:194: all] Error 2 On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik wrote: > I was looking at that commit too. I've started a build with that reverted > and should be able to check back on that tomorrow. > > On Wed, Feb 16, 2022 at 4:33 PM Rich Felker wrote: > >> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote: >> > The only change to socket.c I'm seeing is use __socketcall to simplify >> > socket() >> > < >> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001 >> >, >> > so maybe it would make sense for me to try building with that reversed? >> >> That should not be a functional change, but you may be overlooking >> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the >> new (added in 4.3) individual socket syscalls instead of the legacy >> multiplexed SYS_socketcall. It's supposed to fall back to using the >> old ones, but perhaps something goes wrong on your kernel that's >> preventing it. I'm not sure what the mechanism by which it works when >> straced/single-stepped could be, though, but if it's a weird kernel >> bug anything is possible. >> >> Reverting that commit should be entirely safe, if it turns out to be >> what's triggering your problem, but I'd like to get to the root cause >> and see if there's anything we can do to ensure this doesn't come up >> again. >> >> >> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik >> wrote: >> > >> > > >> > >> >> > >> - Whether any network traffic occurs when it fails (in the real >> > >> environment not a replicated one elsewhere). >> > >> >> > >> >> > > There is no network traffic in the real environment. >> > > >> > > >> > >> - Whether it fails or succeeds under strace (in the real >> > >> environment not a replicated one elsewhere). >> > >> >> > >> It succeeds in strace (in the real environment) >> > > >> > > >> > > >> > >> - Whether the real environment involves Docker or not. >> > >> >> > >> The real environment does not involve docker. >> > > >> > > >> > > >> > >> - What's in resolv.conf (in the real environment not a replicated one >> > >> elsewhere) and what nameserver software (if known) is running on >> the >> > >> nameserver(s) listed in there. >> > >> >> > >> The nameserver is picked up from dhcp. The contents of the file are >> as >> > > follows: >> > > nameserver 192.168.0.1 >> > > search lan. >> > > options single-request timeout:1 attempts:5 >> > > >> > > >> > >> - Anything else that might be relevant. >> > >> >> > >> DNS server is dnsmasq running on a current OpenWRT device. >> > > >> > > >> > >> It's really hard to offer any productive advice when the problem is >> > >> unclear. >> > >> >> > >> Apologies for the confusion. >> > > I'm really just trying to debug this getaddrinfo breakage on this >> older >> > > hardware. The docker containers setup is something we use to build >> packages >> > > for this hardware, and our frustration is that the software works >> perfectly >> > > fine in the docker containers, but not on the hardware. >> > > >> > > > Any other suggestions on how to track down this issue? >> > >> >> > >> Rather than stepping through, I would put a single breakpoint at a >> > >> place you want to see whether execution reaches before running the >> > >> test program, then start it and see if the breakpoint fires or not. >> > >> Then remove the breakpoint, add a different one, and repeat. For >> > >> example, see if __res_msend is ever called, and if so, whether >> > >> particular lines of it are reached (or just put breakpoints on some >> of >> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to >> see >> > >> if they're called). >> > >> >> > >> It might also be useful to put a breakpoint on clock_gettime and then >> > >> 'finish' to see what it returns (in case the problem is something >> > >> time64-related). >> > >> >> > >> >> > > The only breakpoint which fixed the execution was for line 20 (which >> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then >> > > continuing is the only way it does not result in failure. >> > > >> > > Any later breakpoints fail. >> > > >> > > I went though the other breakpoints as requested. >> > > clock_gettime did not fire. >> > > >> > > Breakpoint 1 at 0x5c2f7: file >> ../src_musl/compat/time32/clock_gettime32.c, >> > > line 9. >> > > __res_msend, setsockopt also did not fire. >> > > The ones that did fire were: socket, bind, recvfrom, poll, >> __res_msend_rc, >> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate, >> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy >> > > >> > > When breaking on socket, stepping through the __kernel_vsyscall call >> after >> > > socket and then continuing succeeds. >> > > >> > > Is it possible that the socket is not waiting long enough for a >> response >> > > from __kernel_vsyscall? Has that changed? >> > > Breaking, stepping, and continuing on every other function above >> fails. >> > > >> > > The gdb log is attached. >> > > >> > > Regards, >> > > >> > > Satadru >> > > >> > > >> >