Oops. Looks like I reversed the patch wrong. Rebuilding again... On Wed, Feb 16, 2022 at 8:44 PM Satadru Pramanik wrote: > Looks like I need to do more than just reverse that commit to get this to > build... > > ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro > ‘__socketcall_cp’ 47 | #define socketcall_cp(nm,a,b,c,d,e,f) > __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f)) > | > ^~~~~~~~~~~~~~~ > ../src_musl/src/network/accept.c:6:16: note: in expansion of macro > ‘socketcall_cp’ > 6 | return socketcall_cp(accept, fd, addr, len, 0, 0, 0); > | ^~~~~~~~~~~~~ > ../src_musl/src/internal/syscall.h:62:54: note: each undeclared identifier > is reported only once for each function it appears in > 62 | #define __socketcall_cp(nm,a,b,c,d,e,f) > __syscall_cp(SYS_##nm, a, b, c, d, e, f) | > ^~~~ > ../src_musl/src/internal/syscall.h:55:53: note: in definition of macro > ‘__syscall_cp6’ 55 | #define __syscall_cp6(n,a,b,c,d,e,f) > (__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f)) > | > ^ > ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro > ‘__SYSCALL_DISP’ 57 | #define __syscall_cp(...) > __SYSCALL_DISP(__syscall_cp,__VA_ARGS__) | > ^~~~~~~~~~~~~~ > ../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro > ‘__syscall_cp’ 62 | #define __socketcall_cp(nm,a,b,c,d,e,f) > __syscall_cp(SYS_##nm, a, b, c, d, e, f) | > ^~~~~~~~~~~~ > ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro > ‘__socketcall_cp’ > 47 | #define socketcall_cp(nm,a,b,c,d,e,f) > __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f)) > | > ^~~~~~~~~~~~~~~ > ../src_musl/src/network/accept.c:6:16: note: in expansion of macro > ‘socketcall_cp’ 6 | return socketcall_cp(accept, fd, addr, > len, 0, 0, 0); | ^~~~~~~~~~~~~ > ../src_musl/src/network/accept.c:7:1: warning: control reaches end of > non-void function [-Wreturn-type] > 7 | } > | ^ > make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error 1 > make[2]: Leaving directory > '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl' > make[1]: *** [Makefile:249: > obj_musl/.lc_built] Error 2 make[1]: > Leaving directory > '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl' > make: *** > [Makefile:194: all] Error 2 > > On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik wrote: > >> I was looking at that commit too. I've started a build with that reverted >> and should be able to check back on that tomorrow. >> >> On Wed, Feb 16, 2022 at 4:33 PM Rich Felker wrote: >> >>> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote: >>> > The only change to socket.c I'm seeing is use __socketcall to simplify >>> > socket() >>> > < >>> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001 >>> >, >>> > so maybe it would make sense for me to try building with that reversed? >>> >>> That should not be a functional change, but you may be overlooking >>> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the >>> new (added in 4.3) individual socket syscalls instead of the legacy >>> multiplexed SYS_socketcall. It's supposed to fall back to using the >>> old ones, but perhaps something goes wrong on your kernel that's >>> preventing it. I'm not sure what the mechanism by which it works when >>> straced/single-stepped could be, though, but if it's a weird kernel >>> bug anything is possible. >>> >>> Reverting that commit should be entirely safe, if it turns out to be >>> what's triggering your problem, but I'd like to get to the root cause >>> and see if there's anything we can do to ensure this doesn't come up >>> again. >>> >>> >>> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik >>> wrote: >>> > >>> > > >>> > >> >>> > >> - Whether any network traffic occurs when it fails (in the real >>> > >> environment not a replicated one elsewhere). >>> > >> >>> > >> >>> > > There is no network traffic in the real environment. >>> > > >>> > > >>> > >> - Whether it fails or succeeds under strace (in the real >>> > >> environment not a replicated one elsewhere). >>> > >> >>> > >> It succeeds in strace (in the real environment) >>> > > >>> > > >>> > > >>> > >> - Whether the real environment involves Docker or not. >>> > >> >>> > >> The real environment does not involve docker. >>> > > >>> > > >>> > > >>> > >> - What's in resolv.conf (in the real environment not a replicated >>> one >>> > >> elsewhere) and what nameserver software (if known) is running on >>> the >>> > >> nameserver(s) listed in there. >>> > >> >>> > >> The nameserver is picked up from dhcp. The contents of the file are >>> as >>> > > follows: >>> > > nameserver 192.168.0.1 >>> > > search lan. >>> > > options single-request timeout:1 attempts:5 >>> > > >>> > > >>> > >> - Anything else that might be relevant. >>> > >> >>> > >> DNS server is dnsmasq running on a current OpenWRT device. >>> > > >>> > > >>> > >> It's really hard to offer any productive advice when the problem is >>> > >> unclear. >>> > >> >>> > >> Apologies for the confusion. >>> > > I'm really just trying to debug this getaddrinfo breakage on this >>> older >>> > > hardware. The docker containers setup is something we use to build >>> packages >>> > > for this hardware, and our frustration is that the software works >>> perfectly >>> > > fine in the docker containers, but not on the hardware. >>> > > >>> > > > Any other suggestions on how to track down this issue? >>> > >> >>> > >> Rather than stepping through, I would put a single breakpoint at a >>> > >> place you want to see whether execution reaches before running the >>> > >> test program, then start it and see if the breakpoint fires or not. >>> > >> Then remove the breakpoint, add a different one, and repeat. For >>> > >> example, see if __res_msend is ever called, and if so, whether >>> > >> particular lines of it are reached (or just put breakpoints on some >>> of >>> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to >>> see >>> > >> if they're called). >>> > >> >>> > >> It might also be useful to put a breakpoint on clock_gettime and >>> then >>> > >> 'finish' to see what it returns (in case the problem is something >>> > >> time64-related). >>> > >> >>> > >> >>> > > The only breakpoint which fixed the execution was for line 20 (which >>> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then >>> > > continuing is the only way it does not result in failure. >>> > > >>> > > Any later breakpoints fail. >>> > > >>> > > I went though the other breakpoints as requested. >>> > > clock_gettime did not fire. >>> > > >>> > > Breakpoint 1 at 0x5c2f7: file >>> ../src_musl/compat/time32/clock_gettime32.c, >>> > > line 9. >>> > > __res_msend, setsockopt also did not fire. >>> > > The ones that did fire were: socket, bind, recvfrom, poll, >>> __res_msend_rc, >>> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate, >>> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy >>> > > >>> > > When breaking on socket, stepping through the __kernel_vsyscall call >>> after >>> > > socket and then continuing succeeds. >>> > > >>> > > Is it possible that the socket is not waiting long enough for a >>> response >>> > > from __kernel_vsyscall? Has that changed? >>> > > Breaking, stepping, and continuing on every other function above >>> fails. >>> > > >>> > > The gdb log is attached. >>> > > >>> > > Regards, >>> > > >>> > > Satadru >>> > > >>> > > >>> >>