On Sun, Apr 10, 2016 at 10:35:22PM -0400, Rich Felker wrote: > On Mon, Apr 11, 2016 at 12:33:07AM +0200, Sebastian Gottschall wrote: > > Am 11.04.2016 um 00:29 schrieb Rich Felker: > > >On Mon, Apr 11, 2016 at 12:24:49AM +0200, Sebastian Gottschall wrote: > > >>>I think what nsz was asking for, and what I'd like to see, is a way to > > >>>reproduce the bug. I'm going to try building iproute2 for mips64 and > > >>>running it on a prebuilt kernel from Aboriginal Linux under > > >>>qemu-system-mips64, but I don't know what specific commands are needed > > >>>to hit the affected code path. > > >>any command since all is netlink based > > >>ip add add 192.168.1.1/24 dev eth0 > > >> > > >>yo will see that nothing will happen. ip will just return a error > > >>message (i wrote this message already in the first entry on this > > >>mailinglist) > > >>"EOF on netlink" is the error which is shown > > >OK, I'll try this. > > > > > >>>>its all resulting in the same failing recvmsg / sendmsg call.. so > > >>>>yes libnetlink.c does not work with musl on mips64 (it does work on > > >>>>x64 and everything else, just not mips64) unless the hack i offered > > >>>>was applied which again fixed all. > > >>>>before you ask again for a problem description, just read again. it > > >>>>wont change the description if you ask again and just makes people > > >>>>tired on this list. > > >>>Both versions of the struct (musl's and your modified one that matches > > >>>the kernel) have the exact same layout, but due to having a member > > >>>with 64-bit type, yours has 8-byte alignment and musl's only has > > >>>4-byte alignment. This means, at least: > > >>> > > >>>1. When musl's sendmsg.c makes its copy to zero out the padding, the > > >>> copy may not be correctly aligned for 64-bit writes, and the kernel > > >>> faults or manually produces an error for this case, causing the > > >>> whole operation to fail. However, I don't see where iproute2 is > > >>> actually passing control messages to sendmsg, so while this is a > > >>> problem, I don't think it's the cause. Maybe I'm missing the > > >>> affected call point; this is why I'd like steps to reproduce the > > >>> issue so I can see it. > > >>> > > >>>2. iproute2's libnetlink.c's rtnl_listen function does not properly > > >>> declare its cmsgbuf with the alignment of cmsghdr; it has type > > >>> char[] so the compiler is free not to align it at all. This is > > >>> presumably a bug in iproute2, but I can't find any good > > >>> documentation (in the standards or Linux-specific) for how you're > > >>> supposed to allocate this space, so maybe the kernel is able to > > >>> handle aligning the buffer itself. I don't see any way the > > >>> alignment of musl's cmsghdr type affects recvmsg though. > > >>> > > >>>Maybe there are other effects I'm missing? I'll follow up again once I > > >>>get a test build/run of iproute2 and let you know whether I can see > > >>>the problem. > > >>okay. if you need a remote access to a octeon system using musl (my > > >>fixed variant), just tell me. > > >That would be really helpful. Something's wrong with the userspace for > > >the Aboriginal mips64 binaries (SIGBUS in init) and debugging that > > >would be a big distraction. > > > > > >BTW do you have gdb and strace available? > > not on the system itself. i'm not sure if strace works on mips64. > > never tried it. > > but you're free to copy any binary to the /tmp dir. it has 2 gb ram. > > so enough space for static binaries if you want to play with. > > i will send you the ssh data in a private email > > I haven't been able to reproduce the error on your system. I've tried > building my own static-linked version of the "ip" utility with a > mips64-linux-musl softfloat compiler, and uploading my libc.so and > using it to run both your version of ip and a dynamic-linked one I > just built. They all work fine for adding/removing a 127.0.0.2 address > to the "lo" interface. > > Next I'm going to try to get a minimal testcase that tries to > intentionally misalign the control message buffers. I suspect I'm just > "getting lucky" and my buffer happens to be aligned the way the kernel > wants by chance. I've managed to track down the cause of the breakage. Somehow your iproute2 has been miscompiled. What I did was add debug logic to libc.so to print the contents of the msghdr struct passed in before fixups, after fixups, and after the syscall. The output I got was: msghdr: 0xffffd58e08 12 0xffffd58df8 1 0 0 0 0 0 msghdr: 0xffffd58e08 12 0xffffd58df8 0 0 0 0 0 0 msghdr: 0xffffd58e08 12 0xffffd58df8 0 0 0 0 0 32 The fields (including __pad1 and __pad2) are printed in order. So as you can see, ip passed in a structure with a 1 in __pad1 and a 0 in msg_iovlen. The source (libnetlink.c) stores 1 to msg_iovlen, so my guess is that somehow it ended up getting the wrong-endian version of the structure definition. You could confirm this by adding #error to the little-endian case in arch/mips64/bits/socket.h and recompiling. I suspect it's going to take some additional work to track down the cause, which is likely specific to something in your toolchain (it didn't happen for me when I built my own iproute2). In case you or anyone else would like to use the struct dumping in testing, or just understand precisely what it's printing, I'm attaching the patch I used. Rich