[musl] Re: musl getaddr info breakage on older kernels

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] Re: musl getaddr info breakage on older kernels
       [not found] <CAFrh3J9CK67S=Jn531zTK04QeR8yiZwro_ZThQD=3aUx5XghCA@mail.gmail.com>
@ 2022-02-06 21:30 ` Rich Felker
  2022-02-06 23:25   ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-06 21:30 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Sun, Feb 06, 2022 at 03:32:40PM -0500, Satadru Pramanik wrote:
> Hello,
> 
> I'm a dev for the Chromebrew linux distribution, and we've been trying to
> support older EOL i686 chromebooks stuck on older kernels.
> 
> We have noticed that running newer versions of musl on such machines breaks
> getaddrinfo in musl. This is a problem as we are trying to build
> musl-static builds of curl and git which will work on these older machines.
> (This is really irritating since these binaries work fine in i686
> containers on newer machines running newer kernels, but then fail when run
> on the hardware which has the older kernels.)

The problem is almost surely not the kernel or musl but either buggy
nameservers or buggy Docker. Can you post the strace log of the
failing test case? That would quickly determine which it is.

> I have been attempting to bisect where this happens, and have
> determined that the first commit which breaks DNS resolution on musl for
> these older machines is
> https://git.musl-libc.org/cgit/musl/commit/?id=2f2348c9588d61680123bbe438db38acf5dfea4c

This commit does not change anything in musl, which does not use those
macros at all internally, just the public headers. So the bisect
result must be wrong.

> (I emailed you since you are the person who made that commit.)

Please mail the list. I've CC'd it in my reply here.

> Just applying a reverse of that commit allows musl networking to work
> properly up to the (2020-03-14) 2b2c8aafce9d80f9d58652643538f4d58e82b856
> commit. I'm trying newer commits with a
> 2f2348c9588d61680123bbe438db38acf5dfea4c reversal patch to see if where
> networking breaks again, as later commits such as the (2020-05-18)
> fd7ec068efd590c0393a612599a4fab9bb0a8633 commit have broken ipv6 DNS
> resolving.
> 
> I'd like to be able to get this issue fixed, as we would like to use a
> newer version of musl not susceptible to CVE-2020-28928.
> 
> The test case I'm using is the code at
> https://www.openwall.com/lists/musl/2021/07/19/1
> 
> Our implementation is here:
> https://github.com/skycocker/chromebrew/blob/master/packages/musl_getaddrinfo_test.rb
> 
> Once I find other commits which are breaking this, I'd like to be able to
> get a patch integrated into musl so that we don't have to maintain a
> patchset for older machines.
> 
> Would that be feasible? And my apologies if this should have gone to one of
> the mailing lists first. I figured I should ask you since you might have
> the most insight into this breaking installs.
> 
> Here is the current reversal patch we are testing with:
> 
> --- b/include/arpa/inet.h
> +++ a/include/arpa/inet.h
> @@ -24,6 +24,11 @@
>  in_addr_t inet_lnaof(struct in_addr);
>  in_addr_t inet_netof(struct in_addr);
> 
> +#undef INET_ADDRSTRLEN
> +#undef INET6_ADDRSTRLEN
> +#define INET_ADDRSTRLEN  16
> +#define INET6_ADDRSTRLEN 46
> +
>  #ifdef __cplusplus
>  }
>  #endif
> --- b/include/netinet/in.h
> +++ a/include/netinet/in.h
> @@ -60,6 +60,8 @@
> 
>  extern const struct in6_addr in6addr_any, in6addr_loopback;
> 
> +#undef INET_ADDRSTRLEN
> +#undef INET6_ADDRSTRLEN
>  #define INET_ADDRSTRLEN  16
>  #define INET6_ADDRSTRLEN 46
> 
> Thanks,
> 
> Satadru Pramanik
> Dev Team
> Chromebrew

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [musl] Re: musl getaddr info breakage on older kernels
  2022-02-06 21:30 ` [musl] Re: musl getaddr info breakage on older kernels Rich Felker
@ 2022-02-06 23:25   ` Satadru Pramanik
  2022-02-06 23:44     ` Rich Felker
  2022-02-07  5:35     ` Markus Wichmann
  0 siblings, 2 replies; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-06 23:25 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4309 bytes --]

So here's the funny thing. These test binaries always work when run through
strace, sometimes even working once the binary been run after it has been
run with strace, but tends to fail when musl is installed, and then run for
the first time after being built against musl. (It isn't just these test
case binaries which have this issue. I've seen the same issue with the test
binaries from c_ares.)

Does that ring a bell at all? Does it make sense for this to run
correctly when run with strace (built against glibc) but have problems when
run w/o strace?

Happy to provide binaries as a well as instructions for a container setup
to build musl like I'm doing, which can generate such binaries.

Of course this could be a buggy nameserver issue. My test setup uses an
OpenWRT dnsmasq name server forwarding to 9.9.9.9, 8.8.8.8, and 1.1.1.1

Satadru

On Sun, Feb 6, 2022 at 4:30 PM Rich Felker <dalias@aerifal.cx> wrote:

> On Sun, Feb 06, 2022 at 03:32:40PM -0500, Satadru Pramanik wrote:
> > Hello,
> >
> > I'm a dev for the Chromebrew linux distribution, and we've been trying to
> > support older EOL i686 chromebooks stuck on older kernels.
> >
> > We have noticed that running newer versions of musl on such machines
> breaks
> > getaddrinfo in musl. This is a problem as we are trying to build
> > musl-static builds of curl and git which will work on these older
> machines.
> > (This is really irritating since these binaries work fine in i686
> > containers on newer machines running newer kernels, but then fail when
> run
> > on the hardware which has the older kernels.)
>
> The problem is almost surely not the kernel or musl but either buggy
> nameservers or buggy Docker. Can you post the strace log of the
> failing test case? That would quickly determine which it is.
>
> > I have been attempting to bisect where this happens, and have
> > determined that the first commit which breaks DNS resolution on musl for
> > these older machines is
> >
> https://git.musl-libc.org/cgit/musl/commit/?id=2f2348c9588d61680123bbe438db38acf5dfea4c
>
> This commit does not change anything in musl, which does not use those
> macros at all internally, just the public headers. So the bisect
> result must be wrong.
>
> > (I emailed you since you are the person who made that commit.)
>
> Please mail the list. I've CC'd it in my reply here.
>
> > Just applying a reverse of that commit allows musl networking to work
> > properly up to the (2020-03-14) 2b2c8aafce9d80f9d58652643538f4d58e82b856
> > commit. I'm trying newer commits with a
> > 2f2348c9588d61680123bbe438db38acf5dfea4c reversal patch to see if where
> > networking breaks again, as later commits such as the (2020-05-18)
> > fd7ec068efd590c0393a612599a4fab9bb0a8633 commit have broken ipv6 DNS
> > resolving.
> >
> > I'd like to be able to get this issue fixed, as we would like to use a
> > newer version of musl not susceptible to CVE-2020-28928.
> >
> > The test case I'm using is the code at
> > https://www.openwall.com/lists/musl/2021/07/19/1
> >
> > Our implementation is here:
> >
> https://github.com/skycocker/chromebrew/blob/master/packages/musl_getaddrinfo_test.rb
> >
> > Once I find other commits which are breaking this, I'd like to be able to
> > get a patch integrated into musl so that we don't have to maintain a
> > patchset for older machines.
> >
> > Would that be feasible? And my apologies if this should have gone to one
> of
> > the mailing lists first. I figured I should ask you since you might have
> > the most insight into this breaking installs.
> >
> > Here is the current reversal patch we are testing with:
> >
> > --- b/include/arpa/inet.h
> > +++ a/include/arpa/inet.h
> > @@ -24,6 +24,11 @@
> >  in_addr_t inet_lnaof(struct in_addr);
> >  in_addr_t inet_netof(struct in_addr);
> >
> > +#undef INET_ADDRSTRLEN
> > +#undef INET6_ADDRSTRLEN
> > +#define INET_ADDRSTRLEN  16
> > +#define INET6_ADDRSTRLEN 46
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > --- b/include/netinet/in.h
> > +++ a/include/netinet/in.h
> > @@ -60,6 +60,8 @@
> >
> >  extern const struct in6_addr in6addr_any, in6addr_loopback;
> >
> > +#undef INET_ADDRSTRLEN
> > +#undef INET6_ADDRSTRLEN
> >  #define INET_ADDRSTRLEN  16
> >  #define INET6_ADDRSTRLEN 46
> >
> > Thanks,
> >
> > Satadru Pramanik
> > Dev Team
> > Chromebrew
>

[-- Attachment #2: Type: text/html, Size: 5613 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-06 23:25   ` Satadru Pramanik
@ 2022-02-06 23:44     ` Rich Felker
  2022-02-07  1:29       ` Satadru Pramanik
  2022-02-07  5:35     ` Markus Wichmann
  1 sibling, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-06 23:44 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Sun, Feb 06, 2022 at 06:25:34PM -0500, Satadru Pramanik wrote:
> So here's the funny thing. These test binaries always work when run through
> strace, sometimes even working once the binary been run after it has been
> run with strace, but tends to fail when musl is installed, and then run for
> the first time after being built against musl. (It isn't just these test
> case binaries which have this issue. I've seen the same issue with the test
> binaries from c_ares.)
> 
> Does that ring a bell at all? Does it make sense for this to run
> correctly when run with strace (built against glibc) but have problems when
> run w/o strace?

Can you include that strace, even if it's for a run that worked? It
might include evidence of the cause. My best guess based on this
symptom is a buggy Docker version blocking time64 syscalls with EPERM.

> Happy to provide binaries as a well as instructions for a container setup
> to build musl like I'm doing, which can generate such binaries.
> 
> Of course this could be a buggy nameserver issue. My test setup uses an
> OpenWRT dnsmasq name server forwarding to 9.9.9.9, 8.8.8.8, and 1.1.1.1
> 
> Satadru
> 
> On Sun, Feb 6, 2022 at 4:30 PM Rich Felker <dalias@aerifal.cx> wrote:
> 
> > On Sun, Feb 06, 2022 at 03:32:40PM -0500, Satadru Pramanik wrote:
> > > Hello,
> > >
> > > I'm a dev for the Chromebrew linux distribution, and we've been trying to
> > > support older EOL i686 chromebooks stuck on older kernels.
> > >
> > > We have noticed that running newer versions of musl on such machines
> > breaks
> > > getaddrinfo in musl. This is a problem as we are trying to build
> > > musl-static builds of curl and git which will work on these older
> > machines.
> > > (This is really irritating since these binaries work fine in i686
> > > containers on newer machines running newer kernels, but then fail when
> > run
> > > on the hardware which has the older kernels.)
> >
> > The problem is almost surely not the kernel or musl but either buggy
> > nameservers or buggy Docker. Can you post the strace log of the
> > failing test case? That would quickly determine which it is.
> >
> > > I have been attempting to bisect where this happens, and have
> > > determined that the first commit which breaks DNS resolution on musl for
> > > these older machines is
> > >
> > https://git.musl-libc.org/cgit/musl/commit/?id=2f2348c9588d61680123bbe438db38acf5dfea4c
> >
> > This commit does not change anything in musl, which does not use those
> > macros at all internally, just the public headers. So the bisect
> > result must be wrong.
> >
> > > (I emailed you since you are the person who made that commit.)
> >
> > Please mail the list. I've CC'd it in my reply here.
> >
> > > Just applying a reverse of that commit allows musl networking to work
> > > properly up to the (2020-03-14) 2b2c8aafce9d80f9d58652643538f4d58e82b856
> > > commit. I'm trying newer commits with a
> > > 2f2348c9588d61680123bbe438db38acf5dfea4c reversal patch to see if where
> > > networking breaks again, as later commits such as the (2020-05-18)
> > > fd7ec068efd590c0393a612599a4fab9bb0a8633 commit have broken ipv6 DNS
> > > resolving.
> > >
> > > I'd like to be able to get this issue fixed, as we would like to use a
> > > newer version of musl not susceptible to CVE-2020-28928.
> > >
> > > The test case I'm using is the code at
> > > https://www.openwall.com/lists/musl/2021/07/19/1
> > >
> > > Our implementation is here:
> > >
> > https://github.com/skycocker/chromebrew/blob/master/packages/musl_getaddrinfo_test.rb
> > >
> > > Once I find other commits which are breaking this, I'd like to be able to
> > > get a patch integrated into musl so that we don't have to maintain a
> > > patchset for older machines.
> > >
> > > Would that be feasible? And my apologies if this should have gone to one
> > of
> > > the mailing lists first. I figured I should ask you since you might have
> > > the most insight into this breaking installs.
> > >
> > > Here is the current reversal patch we are testing with:
> > >
> > > --- b/include/arpa/inet.h
> > > +++ a/include/arpa/inet.h
> > > @@ -24,6 +24,11 @@
> > >  in_addr_t inet_lnaof(struct in_addr);
> > >  in_addr_t inet_netof(struct in_addr);
> > >
> > > +#undef INET_ADDRSTRLEN
> > > +#undef INET6_ADDRSTRLEN
> > > +#define INET_ADDRSTRLEN  16
> > > +#define INET6_ADDRSTRLEN 46
> > > +
> > >  #ifdef __cplusplus
> > >  }
> > >  #endif
> > > --- b/include/netinet/in.h
> > > +++ a/include/netinet/in.h
> > > @@ -60,6 +60,8 @@
> > >
> > >  extern const struct in6_addr in6addr_any, in6addr_loopback;
> > >
> > > +#undef INET_ADDRSTRLEN
> > > +#undef INET6_ADDRSTRLEN
> > >  #define INET_ADDRSTRLEN  16
> > >  #define INET6_ADDRSTRLEN 46
> > >
> > > Thanks,
> > >
> > > Satadru Pramanik
> > > Dev Team
> > > Chromebrew
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-06 23:44     ` Rich Felker
@ 2022-02-07  1:29       ` Satadru Pramanik
  2022-02-07  2:40         ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-07  1:29 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 7694 bytes --]

Here are illustrative logs of output and strace logs.

Note that while the musl toolchain is built in a container on a much more
powerful machine, this "musl_getaddrinfo_test" app is built locally on the
machine with the 3.8 kernel.

I ran the following to get the output on the smaller i686 machine
immediately after the app is built.
Apologies for the ruby code wrapping the shell commands.

    @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null | head -2
| tail -1 | awk '{print $2}'`.chomp
    puts 'Testing the musl resolver to see if it can resolve google.com:
'.lightblue
    system "./musl_getaddrinfo_test google.com set_ai_family 2>&1 |tee -a
/tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt "
    system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
/tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
    system "strace -o
/tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
./musl_getaddrinfo_test google.com set_ai_family"
    system "strace -o
/tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
./musl_getaddrinfo_test google.com"

And here is the output for each run before running again via strace. Note
how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all shows
up, but everything works fine according to the strace logs. (Strace is
built against glibc 2.23.)

==>
musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
<==
AF_INET: 142.251.40.110

==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <==
AF_INET: 142.251.40.110

==>
musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
<==
AF_INET: 142.251.40.142

==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <==
getaddrinfo: Try again

==>
musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
<==
AF_INET: 142.251.40.206

==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <==
AF_INET6: 2607:f8b0:4006:81f::200e
AF_INET: 142.251.40.206

==>
musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt <==
AF_INET: 142.250.65.206

==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <==
AF_INET: 142.250.65.206

==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
AF_INET: 142.251.40.110

==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
getaddrinfo: Try again

==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
getaddrinfo: Try again

==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
getaddrinfo: Try again

Regards,

Satadru


On Sun, Feb 6, 2022 at 6:44 PM Rich Felker <dalias@aerifal.cx> wrote:

> On Sun, Feb 06, 2022 at 06:25:34PM -0500, Satadru Pramanik wrote:
> > So here's the funny thing. These test binaries always work when run
> through
> > strace, sometimes even working once the binary been run after it has been
> > run with strace, but tends to fail when musl is installed, and then run
> for
> > the first time after being built against musl. (It isn't just these test
> > case binaries which have this issue. I've seen the same issue with the
> test
> > binaries from c_ares.)
> >
> > Does that ring a bell at all? Does it make sense for this to run
> > correctly when run with strace (built against glibc) but have problems
> when
> > run w/o strace?
>
> Can you include that strace, even if it's for a run that worked? It
> might include evidence of the cause. My best guess based on this
> symptom is a buggy Docker version blocking time64 syscalls with EPERM.
>
> > Happy to provide binaries as a well as instructions for a container setup
> > to build musl like I'm doing, which can generate such binaries.
> >
> > Of course this could be a buggy nameserver issue. My test setup uses an
> > OpenWRT dnsmasq name server forwarding to 9.9.9.9, 8.8.8.8, and 1.1.1.1
> >
> > Satadru
> >
> > On Sun, Feb 6, 2022 at 4:30 PM Rich Felker <dalias@aerifal.cx> wrote:
> >
> > > On Sun, Feb 06, 2022 at 03:32:40PM -0500, Satadru Pramanik wrote:
> > > > Hello,
> > > >
> > > > I'm a dev for the Chromebrew linux distribution, and we've been
> trying to
> > > > support older EOL i686 chromebooks stuck on older kernels.
> > > >
> > > > We have noticed that running newer versions of musl on such machines
> > > breaks
> > > > getaddrinfo in musl. This is a problem as we are trying to build
> > > > musl-static builds of curl and git which will work on these older
> > > machines.
> > > > (This is really irritating since these binaries work fine in i686
> > > > containers on newer machines running newer kernels, but then fail
> when
> > > run
> > > > on the hardware which has the older kernels.)
> > >
> > > The problem is almost surely not the kernel or musl but either buggy
> > > nameservers or buggy Docker. Can you post the strace log of the
> > > failing test case? That would quickly determine which it is.
> > >
> > > > I have been attempting to bisect where this happens, and have
> > > > determined that the first commit which breaks DNS resolution on musl
> for
> > > > these older machines is
> > > >
> > >
> https://git.musl-libc.org/cgit/musl/commit/?id=2f2348c9588d61680123bbe438db38acf5dfea4c
> > >
> > > This commit does not change anything in musl, which does not use those
> > > macros at all internally, just the public headers. So the bisect
> > > result must be wrong.
> > >
> > > > (I emailed you since you are the person who made that commit.)
> > >
> > > Please mail the list. I've CC'd it in my reply here.
> > >
> > > > Just applying a reverse of that commit allows musl networking to work
> > > > properly up to the (2020-03-14)
> 2b2c8aafce9d80f9d58652643538f4d58e82b856
> > > > commit. I'm trying newer commits with a
> > > > 2f2348c9588d61680123bbe438db38acf5dfea4c reversal patch to see if
> where
> > > > networking breaks again, as later commits such as the (2020-05-18)
> > > > fd7ec068efd590c0393a612599a4fab9bb0a8633 commit have broken ipv6 DNS
> > > > resolving.
> > > >
> > > > I'd like to be able to get this issue fixed, as we would like to use
> a
> > > > newer version of musl not susceptible to CVE-2020-28928.
> > > >
> > > > The test case I'm using is the code at
> > > > https://www.openwall.com/lists/musl/2021/07/19/1
> > > >
> > > > Our implementation is here:
> > > >
> > >
> https://github.com/skycocker/chromebrew/blob/master/packages/musl_getaddrinfo_test.rb
> > > >
> > > > Once I find other commits which are breaking this, I'd like to be
> able to
> > > > get a patch integrated into musl so that we don't have to maintain a
> > > > patchset for older machines.
> > > >
> > > > Would that be feasible? And my apologies if this should have gone to
> one
> > > of
> > > > the mailing lists first. I figured I should ask you since you might
> have
> > > > the most insight into this breaking installs.
> > > >
> > > > Here is the current reversal patch we are testing with:
> > > >
> > > > --- b/include/arpa/inet.h
> > > > +++ a/include/arpa/inet.h
> > > > @@ -24,6 +24,11 @@
> > > >  in_addr_t inet_lnaof(struct in_addr);
> > > >  in_addr_t inet_netof(struct in_addr);
> > > >
> > > > +#undef INET_ADDRSTRLEN
> > > > +#undef INET6_ADDRSTRLEN
> > > > +#define INET_ADDRSTRLEN  16
> > > > +#define INET6_ADDRSTRLEN 46
> > > > +
> > > >  #ifdef __cplusplus
> > > >  }
> > > >  #endif
> > > > --- b/include/netinet/in.h
> > > > +++ a/include/netinet/in.h
> > > > @@ -60,6 +60,8 @@
> > > >
> > > >  extern const struct in6_addr in6addr_any, in6addr_loopback;
> > > >
> > > > +#undef INET_ADDRSTRLEN
> > > > +#undef INET6_ADDRSTRLEN
> > > >  #define INET_ADDRSTRLEN  16
> > > >  #define INET6_ADDRSTRLEN 46
> > > >
> > > > Thanks,
> > > >
> > > > Satadru Pramanik
> > > > Dev Team
> > > > Chromebrew
> > >
>

[-- Attachment #1.2: Type: text/html, Size: 10278 bytes --]

[-- Attachment #2: musl_getaddrinfo_logs_i686.tgz --]
[-- Type: application/gzip, Size: 3783 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-07  1:29       ` Satadru Pramanik
@ 2022-02-07  2:40         ` Rich Felker
  2022-02-07 19:19           ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-07  2:40 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote:
> Here are illustrative logs of output and strace logs.
> 
> Note that while the musl toolchain is built in a container on a much more
> powerful machine, this "musl_getaddrinfo_test" app is built locally on the
> machine with the 3.8 kernel.
> 
> I ran the following to get the output on the smaller i686 machine
> immediately after the app is built.
> Apologies for the ruby code wrapping the shell commands.
> 
>     @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null | head -2
> | tail -1 | awk '{print $2}'`.chomp
>     puts 'Testing the musl resolver to see if it can resolve google.com:
> '.lightblue
>     system "./musl_getaddrinfo_test google.com set_ai_family 2>&1 |tee -a
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt "
>     system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
>     system "strace -o
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
> ../musl_getaddrinfo_test google.com set_ai_family"
>     system "strace -o
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
> ../musl_getaddrinfo_test google.com"
> 
> And here is the output for each run before running again via strace. Note
> how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all shows
> up, but everything works fine according to the strace logs. (Strace is
> built against glibc 2.23.)
> 
> ==>
> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
> <==
> AF_INET: 142.251.40.110
> 
> ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <==
> AF_INET: 142.251.40.110
> 
> ==>
> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
> <==
> AF_INET: 142.251.40.142
> 
> ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <==
> getaddrinfo: Try again
> 
> ==>
> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
> <==
> AF_INET: 142.251.40.206
> 
> ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <==
> AF_INET6: 2607:f8b0:4006:81f::200e
> AF_INET: 142.251.40.206
> 
> ==>
> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt <==
> AF_INET: 142.250.65.206
> 
> ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <==
> AF_INET: 142.250.65.206
> 
> ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
> AF_INET: 142.251.40.110
> 
> ==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
> getaddrinfo: Try again
> 
> ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
> getaddrinfo: Try again
> 
> ==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
> getaddrinfo: Try again
> 
> Regards,

OK, I don't see anything in the strace suggesting a cause. The kernel
version (or whether a container was used) present on the system where
you built musl or the test programs should make no difference
whatsoever; musl has no build dependencies on the host kernel or
kernel headers or anything like that (and doesn't even need to be
built on a Linux host).

A couple questions:

Are the test programs on the i686 machine running under Docker or any
other container environment?

Can you tcpdump the traffic between the test program and the dnsmasq
during a failing run, with verbose display of the packet contents
(-vvv or something like that)?

I don't see any plausible explanation for the result varying between
runs and with timing like this unless dnsmasq is doing something
odd/wrong. I thought it might be related to something blocking time64
syscalls but that doesn't seem to be the case -- according to the
strace logs they're getting ENOSYS as expected with fallback to the
legacy 32-bit clock_gettime etc. which is fine.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-06 23:25   ` Satadru Pramanik
  2022-02-06 23:44     ` Rich Felker
@ 2022-02-07  5:35     ` Markus Wichmann
  1 sibling, 0 replies; 37+ messages in thread
From: Markus Wichmann @ 2022-02-07  5:35 UTC (permalink / raw)
  To: musl

On Sun, Feb 06, 2022 at 06:25:34PM -0500, Satadru Pramanik wrote:
> Does that ring a bell at all? Does it make sense for this to run
> correctly when run with strace (built against glibc) but have problems when
> run w/o strace?

The libc strace is built against should not matter. strace uses a simple
debugger interface to halt the target process at system call boundaries,
and the system call interface does not depend on libc. The only
difference it does make is to the timing, but on a level where it really
should not matter for DNS.

That is, unless docker were to do something weird to the time inside of
the container that somehow is undone by strace running. But I don't know
of any kernel interface that would do that and still leave strace to
function.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-07  2:40         ` Rich Felker
@ 2022-02-07 19:19           ` Satadru Pramanik
  2022-02-07 21:02             ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-07 19:19 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 5647 bytes --]

The test programs are being run from...
glibc 2.23 -> bash (crosh shell)
crosh shell -> invokes ruby -> invokes bash to run the test programs.

tcpdump on the router shows no network activity at all when running
the test program with tcpdump -i any -vvv host (IP address)
When I run the test pogram with strace though I see this:
14:06:24.617860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.121.46846 > office.lan.53: [udp sum ok] 16051+ A? google.com.
(28)
14:06:24.622352 IP (tos 0x0, ttl 64, id 15884, offset 0, flags [DF], proto
UDP (17), length 72)
    office.lan.53 > 192.168.0.121.46846: [bad udp cksum 0x8210 -> 0x7bc1!]
16051 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
14:06:24.688610 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.121.42267 > office.lan.53: [udp sum ok] 35406+ A? google.com.
(28)
14:06:24.688931 IP (tos 0x0, ttl 64, id 15887, offset 0, flags [DF], proto
UDP (17), length 72)
    office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x8210 -> 0x4209!]
35406 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
14:06:24.689018 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.121.42267 > office.lan.53: [udp sum ok] 13657+ AAAA?
google.com. (28)
14:06:24.689186 IP (tos 0x0, ttl 64, id 15888, offset 0, flags [DF], proto
UDP (17), length 84)
    office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x821c -> 0xc77e!]
13657 q: AAAA? google.com. 1/0/0 google.com. [20s] AAAA
2607:f8b0:4006:80b::200e (56)

On Sun, Feb 6, 2022 at 9:40 PM Rich Felker <dalias@aerifal.cx> wrote:

> On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote:
> > Here are illustrative logs of output and strace logs.
> >
> > Note that while the musl toolchain is built in a container on a much more
> > powerful machine, this "musl_getaddrinfo_test" app is built locally on
> the
> > machine with the 3.8 kernel.
> >
> > I ran the following to get the output on the smaller i686 machine
> > immediately after the app is built.
> > Apologies for the ruby code wrapping the shell commands.
> >
> >     @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null | head
> -2
> > | tail -1 | awk '{print $2}'`.chomp
> >     puts 'Testing the musl resolver to see if it can resolve google.com:
> > '.lightblue
> >     system "./musl_getaddrinfo_test google.com set_ai_family 2>&1 |tee
> -a
> > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt "
> >     system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
> > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
> >     system "strace -o
> >
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
> > ../musl_getaddrinfo_test google.com set_ai_family"
> >     system "strace -o
> > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
> > ../musl_getaddrinfo_test google.com"
> >
> > And here is the output for each run before running again via strace. Note
> > how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all
> shows
> > up, but everything works fine according to the strace logs. (Strace is
> > built against glibc 2.23.)
> >
> > ==>
> > musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
> > <==
> > AF_INET: 142.251.40.110
> >
> > ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <==
> > AF_INET: 142.251.40.110
> >
> > ==>
> > musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
> > <==
> > AF_INET: 142.251.40.142
> >
> > ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <==
> > getaddrinfo: Try again
> >
> > ==>
> > musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
> > <==
> > AF_INET: 142.251.40.206
> >
> > ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <==
> > AF_INET6: 2607:f8b0:4006:81f::200e
> > AF_INET: 142.251.40.206
> >
> > ==>
> > musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt
> <==
> > AF_INET: 142.250.65.206
> >
> > ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <==
> > AF_INET: 142.250.65.206
> >
> > ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
> > AF_INET: 142.251.40.110
> >
> > ==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
> > getaddrinfo: Try again
> >
> > ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
> > getaddrinfo: Try again
> >
> > ==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
> > getaddrinfo: Try again
> >
> > Regards,
>
> OK, I don't see anything in the strace suggesting a cause. The kernel
> version (or whether a container was used) present on the system where
> you built musl or the test programs should make no difference
> whatsoever; musl has no build dependencies on the host kernel or
> kernel headers or anything like that (and doesn't even need to be
> built on a Linux host).
>
> A couple questions:
>
> Are the test programs on the i686 machine running under Docker or any
> other container environment?
>
> Can you tcpdump the traffic between the test program and the dnsmasq
> during a failing run, with verbose display of the packet contents
> (-vvv or something like that)?
>
> I don't see any plausible explanation for the result varying between
> runs and with timing like this unless dnsmasq is doing something
> odd/wrong. I thought it might be related to something blocking time64
> syscalls but that doesn't seem to be the case -- according to the
> strace logs they're getting ENOSYS as expected with fallback to the
> legacy 32-bit clock_gettime etc. which is fine.
>

[-- Attachment #2: Type: text/html, Size: 7742 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-07 19:19           ` Satadru Pramanik
@ 2022-02-07 21:02             ` Rich Felker
  2022-02-14 17:24               ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-07 21:02 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Mon, Feb 07, 2022 at 02:19:05PM -0500, Satadru Pramanik wrote:
> The test programs are being run from...
> glibc 2.23 -> bash (crosh shell)
> crosh shell -> invokes ruby -> invokes bash to run the test programs.
> 
> tcpdump on the router shows no network activity at all when running
> the test program with tcpdump -i any -vvv host (IP address)

There's reliably no network traffic when you run the test program not
under strace? Is there any difference in how you're invoking it other
than strace not being there? I'm running out of possible explanations
unless there's some hidden details we don't know about.

> When I run the test pogram with strace though I see this:
> 14:06:24.617860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
> (17), length 56)
>     192.168.0.121.46846 > office.lan.53: [udp sum ok] 16051+ A? google.com.
> (28)
> 14:06:24.622352 IP (tos 0x0, ttl 64, id 15884, offset 0, flags [DF], proto
> UDP (17), length 72)
>     office.lan.53 > 192.168.0.121.46846: [bad udp cksum 0x8210 -> 0x7bc1!]
> 16051 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
> 14:06:24.688610 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
> (17), length 56)
>     192.168.0.121.42267 > office.lan.53: [udp sum ok] 35406+ A? google.com.
> (28)
> 14:06:24.688931 IP (tos 0x0, ttl 64, id 15887, offset 0, flags [DF], proto
> UDP (17), length 72)
>     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x8210 -> 0x4209!]
> 35406 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
> 14:06:24.689018 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
> (17), length 56)
>     192.168.0.121.42267 > office.lan.53: [udp sum ok] 13657+ AAAA?
> google.com. (28)
> 14:06:24.689186 IP (tos 0x0, ttl 64, id 15888, offset 0, flags [DF], proto
> UDP (17), length 84)
>     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x821c -> 0xc77e!]
> 13657 q: AAAA? google.com. 1/0/0 google.com. [20s] AAAA
> 2607:f8b0:4006:80b::200e (56)
> 
> On Sun, Feb 6, 2022 at 9:40 PM Rich Felker <dalias@aerifal.cx> wrote:
> 
> > On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote:
> > > Here are illustrative logs of output and strace logs.
> > >
> > > Note that while the musl toolchain is built in a container on a much more
> > > powerful machine, this "musl_getaddrinfo_test" app is built locally on
> > the
> > > machine with the 3.8 kernel.
> > >
> > > I ran the following to get the output on the smaller i686 machine
> > > immediately after the app is built.
> > > Apologies for the ruby code wrapping the shell commands.
> > >
> > >     @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null | head
> > -2
> > > | tail -1 | awk '{print $2}'`.chomp
> > >     puts 'Testing the musl resolver to see if it can resolve google.com:
> > > '.lightblue
> > >     system "./musl_getaddrinfo_test google.com set_ai_family 2>&1 |tee
> > -a
> > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt "
> > >     system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
> > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
> > >     system "strace -o
> > >
> > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
> > > ../musl_getaddrinfo_test google.com set_ai_family"
> > >     system "strace -o
> > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
> > > ../musl_getaddrinfo_test google.com"
> > >
> > > And here is the output for each run before running again via strace. Note
> > > how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all
> > shows
> > > up, but everything works fine according to the strace logs. (Strace is
> > > built against glibc 2.23.)
> > >
> > > ==>
> > > musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
> > > <==
> > > AF_INET: 142.251.40.110
> > >
> > > ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <==
> > > AF_INET: 142.251.40.110
> > >
> > > ==>
> > > musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
> > > <==
> > > AF_INET: 142.251.40.142
> > >
> > > ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <==
> > > getaddrinfo: Try again
> > >
> > > ==>
> > > musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
> > > <==
> > > AF_INET: 142.251.40.206
> > >
> > > ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <==
> > > AF_INET6: 2607:f8b0:4006:81f::200e
> > > AF_INET: 142.251.40.206
> > >
> > > ==>
> > > musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt
> > <==
> > > AF_INET: 142.250.65.206
> > >
> > > ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <==
> > > AF_INET: 142.250.65.206
> > >
> > > ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > AF_INET: 142.251.40.110
> > >
> > > ==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
> > > getaddrinfo: Try again
> > >
> > > ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > getaddrinfo: Try again
> > >
> > > ==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
> > > getaddrinfo: Try again
> > >
> > > Regards,
> >
> > OK, I don't see anything in the strace suggesting a cause. The kernel
> > version (or whether a container was used) present on the system where
> > you built musl or the test programs should make no difference
> > whatsoever; musl has no build dependencies on the host kernel or
> > kernel headers or anything like that (and doesn't even need to be
> > built on a Linux host).
> >
> > A couple questions:
> >
> > Are the test programs on the i686 machine running under Docker or any
> > other container environment?
> >
> > Can you tcpdump the traffic between the test program and the dnsmasq
> > during a failing run, with verbose display of the packet contents
> > (-vvv or something like that)?
> >
> > I don't see any plausible explanation for the result varying between
> > runs and with timing like this unless dnsmasq is doing something
> > odd/wrong. I thought it might be related to something blocking time64
> > syscalls but that doesn't seem to be the case -- according to the
> > strace logs they're getting ENOSYS as expected with fallback to the
> > legacy 32-bit clock_gettime etc. which is fine.
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-07 21:02             ` Rich Felker
@ 2022-02-14 17:24               ` Satadru Pramanik
  2022-02-14 18:29                 ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-14 17:24 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 7479 bytes --]

After multiple rebuilds of various versions, I'm hitting a wall. There is
no traffic when built against newer versions of musl, except when run under
strace. Strace (or ltrace) fixes the problem every single time. The program
also doesn't create any network traffic when run in gdb either.

(Apparently, I'm not the first to discover the "medicinal effects" of
strace though, as per this abomination:
https://github.com/strace/strace/issues/14 )

I've even tried building and running this in a docker container inside a
VirtualBox VM running CentOS 7 so as to get the 3.10 kernel involved, and
that works too!

[image: Screenshot 2022-02-14 at 12.22.33.png]
The screenshot is from a run on the actual hardware, in a crosh window. It
is not in a VM.

Is there anything else I should try?

Best,

Satadru

On Mon, Feb 7, 2022 at 4:02 PM Rich Felker <dalias@aerifal.cx> wrote:

> On Mon, Feb 07, 2022 at 02:19:05PM -0500, Satadru Pramanik wrote:
> > The test programs are being run from...
> > glibc 2.23 -> bash (crosh shell)
> > crosh shell -> invokes ruby -> invokes bash to run the test programs.
> >
> > tcpdump on the router shows no network activity at all when running
> > the test program with tcpdump -i any -vvv host (IP address)
>
> There's reliably no network traffic when you run the test program not
> under strace? Is there any difference in how you're invoking it other
> than strace not being there? I'm running out of possible explanations
> unless there's some hidden details we don't know about.
>
> > When I run the test pogram with strace though I see this:
> > 14:06:24.617860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> UDP
> > (17), length 56)
> >     192.168.0.121.46846 > office.lan.53: [udp sum ok] 16051+ A?
> google.com.
> > (28)
> > 14:06:24.622352 IP (tos 0x0, ttl 64, id 15884, offset 0, flags [DF],
> proto
> > UDP (17), length 72)
> >     office.lan.53 > 192.168.0.121.46846: [bad udp cksum 0x8210 ->
> 0x7bc1!]
> > 16051 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
> > 14:06:24.688610 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> UDP
> > (17), length 56)
> >     192.168.0.121.42267 > office.lan.53: [udp sum ok] 35406+ A?
> google.com.
> > (28)
> > 14:06:24.688931 IP (tos 0x0, ttl 64, id 15887, offset 0, flags [DF],
> proto
> > UDP (17), length 72)
> >     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x8210 ->
> 0x4209!]
> > 35406 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
> > 14:06:24.689018 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> UDP
> > (17), length 56)
> >     192.168.0.121.42267 > office.lan.53: [udp sum ok] 13657+ AAAA?
> > google.com. (28)
> > 14:06:24.689186 IP (tos 0x0, ttl 64, id 15888, offset 0, flags [DF],
> proto
> > UDP (17), length 84)
> >     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x821c ->
> 0xc77e!]
> > 13657 q: AAAA? google.com. 1/0/0 google.com. [20s] AAAA
> > 2607:f8b0:4006:80b::200e (56)
> >
> > On Sun, Feb 6, 2022 at 9:40 PM Rich Felker <dalias@aerifal.cx> wrote:
> >
> > > On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote:
> > > > Here are illustrative logs of output and strace logs.
> > > >
> > > > Note that while the musl toolchain is built in a container on a much
> more
> > > > powerful machine, this "musl_getaddrinfo_test" app is built locally
> on
> > > the
> > > > machine with the 3.8 kernel.
> > > >
> > > > I ran the following to get the output on the smaller i686 machine
> > > > immediately after the app is built.
> > > > Apologies for the ruby code wrapping the shell commands.
> > > >
> > > >     @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null |
> head
> > > -2
> > > > | tail -1 | awk '{print $2}'`.chomp
> > > >     puts 'Testing the musl resolver to see if it can resolve
> google.com:
> > > > '.lightblue
> > > >     system "./musl_getaddrinfo_test google.com set_ai_family 2>&1
> |tee
> > > -a
> > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt
> "
> > > >     system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
> > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
> > > >     system "strace -o
> > > >
> > >
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
> > > > ../musl_getaddrinfo_test google.com set_ai_family"
> > > >     system "strace -o
> > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
> > > > ../musl_getaddrinfo_test google.com"
> > > >
> > > > And here is the output for each run before running again via strace.
> Note
> > > > how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all
> > > shows
> > > > up, but everything works fine according to the strace logs. (Strace
> is
> > > > built against glibc 2.23.)
> > > >
> > > > ==>
> > > >
> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
> > > > <==
> > > > AF_INET: 142.251.40.110
> > > >
> > > > ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <==
> > > > AF_INET: 142.251.40.110
> > > >
> > > > ==>
> > > >
> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
> > > > <==
> > > > AF_INET: 142.251.40.142
> > > >
> > > > ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <==
> > > > getaddrinfo: Try again
> > > >
> > > > ==>
> > > >
> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
> > > > <==
> > > > AF_INET: 142.251.40.206
> > > >
> > > > ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <==
> > > > AF_INET6: 2607:f8b0:4006:81f::200e
> > > > AF_INET: 142.251.40.206
> > > >
> > > > ==>
> > > >
> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt
> > > <==
> > > > AF_INET: 142.250.65.206
> > > >
> > > > ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <==
> > > > AF_INET: 142.250.65.206
> > > >
> > > > ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > > AF_INET: 142.251.40.110
> > > >
> > > > ==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
> > > > getaddrinfo: Try again
> > > >
> > > > ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > > getaddrinfo: Try again
> > > >
> > > > ==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
> > > > getaddrinfo: Try again
> > > >
> > > > Regards,
> > >
> > > OK, I don't see anything in the strace suggesting a cause. The kernel
> > > version (or whether a container was used) present on the system where
> > > you built musl or the test programs should make no difference
> > > whatsoever; musl has no build dependencies on the host kernel or
> > > kernel headers or anything like that (and doesn't even need to be
> > > built on a Linux host).
> > >
> > > A couple questions:
> > >
> > > Are the test programs on the i686 machine running under Docker or any
> > > other container environment?
> > >
> > > Can you tcpdump the traffic between the test program and the dnsmasq
> > > during a failing run, with verbose display of the packet contents
> > > (-vvv or something like that)?
> > >
> > > I don't see any plausible explanation for the result varying between
> > > runs and with timing like this unless dnsmasq is doing something
> > > odd/wrong. I thought it might be related to something blocking time64
> > > syscalls but that doesn't seem to be the case -- according to the
> > > strace logs they're getting ENOSYS as expected with fallback to the
> > > legacy 32-bit clock_gettime etc. which is fine.
> > >
>

[-- Attachment #1.2: Type: text/html, Size: 11016 bytes --]

[-- Attachment #2: Screenshot 2022-02-14 at 12.22.33.png --]
[-- Type: image/png, Size: 175485 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-14 17:24               ` Satadru Pramanik
@ 2022-02-14 18:29                 ` Rich Felker
  2022-02-14 19:00                   ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-14 18:29 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Mon, Feb 14, 2022 at 12:24:30PM -0500, Satadru Pramanik wrote:
> After multiple rebuilds of various versions, I'm hitting a wall. There is
> no traffic when built against newer versions of musl, except when run under
> strace. Strace (or ltrace) fixes the problem every single time. The program
> also doesn't create any network traffic when run in gdb either.
> 
> (Apparently, I'm not the first to discover the "medicinal effects" of
> strace though, as per this abomination:
> https://github.com/strace/strace/issues/14 )
> 
> I've even tried building and running this in a docker container inside a
> VirtualBox VM running CentOS 7 so as to get the 3.10 kernel involved, and
> that works too!
> 
> [image: Screenshot 2022-02-14 at 12.22.33.png]
> The screenshot is from a run on the actual hardware, in a crosh window. It
> is not in a VM.
> 
> Is there anything else I should try?

Are you sure the "running under strace" has no differences in how the
test program is invoked aside from just using strace? Rather than
running it via the ruby machinery, can you just test under plain
manual execution from the same shell instance for both?

Can you report what Docker version you're using, and try executing
with Docker's seccomp sandboxing disabled? This shouldn't happen, but
it's plausible that your old kernel has bugs where seccomp filtering
gets bypassed when the process is running under strace, thereby
working around a buggy seccomp filter in Docker.

If you know how to use gdb, you could also try setting some
breakpoints to see what code is or isn't reached.


> On Mon, Feb 7, 2022 at 4:02 PM Rich Felker <dalias@aerifal.cx> wrote:
> 
> > On Mon, Feb 07, 2022 at 02:19:05PM -0500, Satadru Pramanik wrote:
> > > The test programs are being run from...
> > > glibc 2.23 -> bash (crosh shell)
> > > crosh shell -> invokes ruby -> invokes bash to run the test programs.
> > >
> > > tcpdump on the router shows no network activity at all when running
> > > the test program with tcpdump -i any -vvv host (IP address)
> >
> > There's reliably no network traffic when you run the test program not
> > under strace? Is there any difference in how you're invoking it other
> > than strace not being there? I'm running out of possible explanations
> > unless there's some hidden details we don't know about.
> >
> > > When I run the test pogram with strace though I see this:
> > > 14:06:24.617860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > UDP
> > > (17), length 56)
> > >     192.168.0.121.46846 > office.lan.53: [udp sum ok] 16051+ A?
> > google.com.
> > > (28)
> > > 14:06:24.622352 IP (tos 0x0, ttl 64, id 15884, offset 0, flags [DF],
> > proto
> > > UDP (17), length 72)
> > >     office.lan.53 > 192.168.0.121.46846: [bad udp cksum 0x8210 ->
> > 0x7bc1!]
> > > 16051 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
> > > 14:06:24.688610 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > UDP
> > > (17), length 56)
> > >     192.168.0.121.42267 > office.lan.53: [udp sum ok] 35406+ A?
> > google.com.
> > > (28)
> > > 14:06:24.688931 IP (tos 0x0, ttl 64, id 15887, offset 0, flags [DF],
> > proto
> > > UDP (17), length 72)
> > >     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x8210 ->
> > 0x4209!]
> > > 35406 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110 (44)
> > > 14:06:24.689018 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > UDP
> > > (17), length 56)
> > >     192.168.0.121.42267 > office.lan.53: [udp sum ok] 13657+ AAAA?
> > > google.com. (28)
> > > 14:06:24.689186 IP (tos 0x0, ttl 64, id 15888, offset 0, flags [DF],
> > proto
> > > UDP (17), length 84)
> > >     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x821c ->
> > 0xc77e!]
> > > 13657 q: AAAA? google.com. 1/0/0 google.com. [20s] AAAA
> > > 2607:f8b0:4006:80b::200e (56)
> > >
> > > On Sun, Feb 6, 2022 at 9:40 PM Rich Felker <dalias@aerifal.cx> wrote:
> > >
> > > > On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote:
> > > > > Here are illustrative logs of output and strace logs.
> > > > >
> > > > > Note that while the musl toolchain is built in a container on a much
> > more
> > > > > powerful machine, this "musl_getaddrinfo_test" app is built locally
> > on
> > > > the
> > > > > machine with the 3.8 kernel.
> > > > >
> > > > > I ran the following to get the output on the smaller i686 machine
> > > > > immediately after the app is built.
> > > > > Apologies for the ruby code wrapping the shell commands.
> > > > >
> > > > >     @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null |
> > head
> > > > -2
> > > > > | tail -1 | awk '{print $2}'`.chomp
> > > > >     puts 'Testing the musl resolver to see if it can resolve
> > google.com:
> > > > > '.lightblue
> > > > >     system "./musl_getaddrinfo_test google.com set_ai_family 2>&1
> > |tee
> > > > -a
> > > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt
> > "
> > > > >     system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
> > > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
> > > > >     system "strace -o
> > > > >
> > > >
> > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
> > > > > ../musl_getaddrinfo_test google.com set_ai_family"
> > > > >     system "strace -o
> > > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
> > > > > ../musl_getaddrinfo_test google.com"
> > > > >
> > > > > And here is the output for each run before running again via strace.
> > Note
> > > > > how IPv6 addresses show up sporadically, and for 1.2.2 nothing at all
> > > > shows
> > > > > up, but everything works fine according to the strace logs. (Strace
> > is
> > > > > built against glibc 2.23.)
> > > > >
> > > > > ==>
> > > > >
> > musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > <==
> > > > > AF_INET: 142.251.40.110
> > > > >
> > > > > ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt <==
> > > > > AF_INET: 142.251.40.110
> > > > >
> > > > > ==>
> > > > >
> > musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > <==
> > > > > AF_INET: 142.251.40.142
> > > > >
> > > > > ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt <==
> > > > > getaddrinfo: Try again
> > > > >
> > > > > ==>
> > > > >
> > musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > <==
> > > > > AF_INET: 142.251.40.206
> > > > >
> > > > > ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt <==
> > > > > AF_INET6: 2607:f8b0:4006:81f::200e
> > > > > AF_INET: 142.251.40.206
> > > > >
> > > > > ==>
> > > > >
> > musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt
> > > > <==
> > > > > AF_INET: 142.250.65.206
> > > > >
> > > > > ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt <==
> > > > > AF_INET: 142.250.65.206
> > > > >
> > > > > ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > > > AF_INET: 142.251.40.110
> > > > >
> > > > > ==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
> > > > > getaddrinfo: Try again
> > > > >
> > > > > ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > > > getaddrinfo: Try again
> > > > >
> > > > > ==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
> > > > > getaddrinfo: Try again
> > > > >
> > > > > Regards,
> > > >
> > > > OK, I don't see anything in the strace suggesting a cause. The kernel
> > > > version (or whether a container was used) present on the system where
> > > > you built musl or the test programs should make no difference
> > > > whatsoever; musl has no build dependencies on the host kernel or
> > > > kernel headers or anything like that (and doesn't even need to be
> > > > built on a Linux host).
> > > >
> > > > A couple questions:
> > > >
> > > > Are the test programs on the i686 machine running under Docker or any
> > > > other container environment?
> > > >
> > > > Can you tcpdump the traffic between the test program and the dnsmasq
> > > > during a failing run, with verbose display of the packet contents
> > > > (-vvv or something like that)?
> > > >
> > > > I don't see any plausible explanation for the result varying between
> > > > runs and with timing like this unless dnsmasq is doing something
> > > > odd/wrong. I thought it might be related to something blocking time64
> > > > syscalls but that doesn't seem to be the case -- according to the
> > > > strace logs they're getting ENOSYS as expected with fallback to the
> > > > legacy 32-bit clock_gettime etc. which is fine.
> > > >
> >



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-14 18:29                 ` Rich Felker
@ 2022-02-14 19:00                   ` Satadru Pramanik
  2022-02-14 22:00                     ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-14 19:00 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 10861 bytes --]

>
>
> Are you sure the "running under strace" has no differences in how the
> test program is invoked aside from just using strace?


I'm running strace & ltrace from the command prompt. For what it is worth,
the process tree is a little weird on ChromeOS:
[image: Screenshot 2022-02-14 at 13.58.03.png]


> Rather than
> running it via the ruby machinery, can you just test under plain
> manual execution from the same shell instance for both?
>
> The previous email showed it running via manual execution from the same
shell instance, invoked without and with ltrace.
i.e. I have the same issue.


> Can you report what Docker version you're using, and try executing
> with Docker's seccomp sandboxing disabled?

docker version
Client: Docker Engine - Community
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.12
 Git commit:        e91ed57
 Built:             Mon Dec 13 11:45:41 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.12
  Git commit:       459d0df
  Built:            Mon Dec 13 11:44:05 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

This is the docker command I used, which may disable the seccomp sandboxing?

#!/bin/bash
docker pull satmandu/crewbuild:alex-i686.m58
docker pull tonistiigi/binfmt
docker run --privileged --rm tonistiigi/binfmt --install all
docker run --security-opt seccomp=unconfined  --platform linux/386
--cap-add SYS_PTRACE --rm -v $(pwd)/pkg_cache:/usr/local/tmp/packages -v
$(pwd):/output -h $(hostname)-i686 -it satmandu/crewbuild:alex-i686.m58
/usr/local/bin/setarch i686 sudo -i -u chronos /usr/local/bin/bash -i

../musl_getaddrinfo_test  google.com
AF_INET: 142.250.80.46
AF_INET6: 2607:f8b0:4006:80b::200e

Commands inside that docker invocation to test this:
crew upgrade ; yes | crew install ltrace
CREW_TESTING_REPO=https://github.com/satmandu/chromebrew.git
CREW_TESTING_BRANCH=musl_testing CREW_TESTING=1 crew update
crew upgrade
cd /usr/local/tmp
yes | crew build -k musl_getaddrinfo_test
cd crew/musl_getaddrinfo_test.*
../musl_getaddrinfo_test  google.com

When I ran that, I got this:
../musl_getaddrinfo_test  google.com
AF_INET: 142.250.80.46
AF_INET6: 2607:f8b0:4006:80b::200e


> This shouldn't happen, but
> it's plausible that your old kernel has bugs where seccomp filtering
> gets bypassed when the process is running under strace, thereby
> working around a buggy seccomp filter in Docker.
>
> Is it possible there are seccomp issues with the old kernel that just
weren't triggered by an older version of musl?


> If you know how to use gdb, you could also try setting some
> breakpoints to see what code is or isn't reached.
>
> This would be ideal. I'm still trying to get gdb built with musl on this
older setup. The gdb I have built against glibc doesn't seem to be
particularly helpful when I'm running it with this program built against
the musl libc.

Is there a static build of i686 gdb built against musl someone has
available which might be helpful here?

Satadru


> > On Mon, Feb 7, 2022 at 4:02 PM Rich Felker <dalias@aerifal.cx> wrote:
> >
> > > On Mon, Feb 07, 2022 at 02:19:05PM -0500, Satadru Pramanik wrote:
> > > > The test programs are being run from...
> > > > glibc 2.23 -> bash (crosh shell)
> > > > crosh shell -> invokes ruby -> invokes bash to run the test programs.
> > > >
> > > > tcpdump on the router shows no network activity at all when running
> > > > the test program with tcpdump -i any -vvv host (IP address)
> > >
> > > There's reliably no network traffic when you run the test program not
> > > under strace? Is there any difference in how you're invoking it other
> > > than strace not being there? I'm running out of possible explanations
> > > unless there's some hidden details we don't know about.
> > >
> > > > When I run the test pogram with strace though I see this:
> > > > 14:06:24.617860 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.121.46846 > office.lan.53: [udp sum ok] 16051+ A?
> > > google.com.
> > > > (28)
> > > > 14:06:24.622352 IP (tos 0x0, ttl 64, id 15884, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 72)
> > > >     office.lan.53 > 192.168.0.121.46846: [bad udp cksum 0x8210 ->
> > > 0x7bc1!]
> > > > 16051 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110
> (44)
> > > > 14:06:24.688610 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.121.42267 > office.lan.53: [udp sum ok] 35406+ A?
> > > google.com.
> > > > (28)
> > > > 14:06:24.688931 IP (tos 0x0, ttl 64, id 15887, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 72)
> > > >     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x8210 ->
> > > 0x4209!]
> > > > 35406 q: A? google.com. 1/0/0 google.com. [1m32s] A 142.251.40.110
> (44)
> > > > 14:06:24.689018 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.121.42267 > office.lan.53: [udp sum ok] 13657+ AAAA?
> > > > google.com. (28)
> > > > 14:06:24.689186 IP (tos 0x0, ttl 64, id 15888, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 84)
> > > >     office.lan.53 > 192.168.0.121.42267: [bad udp cksum 0x821c ->
> > > 0xc77e!]
> > > > 13657 q: AAAA? google.com. 1/0/0 google.com. [20s] AAAA
> > > > 2607:f8b0:4006:80b::200e (56)
> > > >
> > > > On Sun, Feb 6, 2022 at 9:40 PM Rich Felker <dalias@aerifal.cx>
> wrote:
> > > >
> > > > > On Sun, Feb 06, 2022 at 08:29:16PM -0500, Satadru Pramanik wrote:
> > > > > > Here are illustrative logs of output and strace logs.
> > > > > >
> > > > > > Note that while the musl toolchain is built in a container on a
> much
> > > more
> > > > > > powerful machine, this "musl_getaddrinfo_test" app is built
> locally
> > > on
> > > > > the
> > > > > > machine with the 3.8 kernel.
> > > > > >
> > > > > > I ran the following to get the output on the smaller i686 machine
> > > > > > immediately after the app is built.
> > > > > > Apologies for the ruby code wrapping the shell commands.
> > > > > >
> > > > > >     @musl_ver = `#{CREW_MUSL_PREFIX}/lib/libc.so 2>&1 >/dev/null
> |
> > > head
> > > > > -2
> > > > > > | tail -1 | awk '{print $2}'`.chomp
> > > > > >     puts 'Testing the musl resolver to see if it can resolve
> > > google.com:
> > > > > > '.lightblue
> > > > > >     system "./musl_getaddrinfo_test google.com set_ai_family
> 2>&1
> > > |tee
> > > > > -a
> > > > > >
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family.txt
> > > "
> > > > > >     system "./musl_getaddrinfo_test google.com 2>&1 |tee -a
> > > > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com.txt"
> > > > > >     system "strace -o
> > > > > >
> > > > >
> > >
> /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_set_ai_family_STRACE.txt
> > > > > > ../musl_getaddrinfo_test google.com set_ai_family"
> > > > > >     system "strace -o
> > > > > > /tmp/musl_#{@musl_ver}_getaddrinfo_test_google.com_STRACE.txt
> > > > > > ../musl_getaddrinfo_test google.com"
> > > > > >
> > > > > > And here is the output for each run before running again via
> strace.
> > > Note
> > > > > > how IPv6 addresses show up sporadically, and for 1.2.2 nothing
> at all
> > > > > shows
> > > > > > up, but everything works fine according to the strace logs.
> (Strace
> > > is
> > > > > > built against glibc 2.23.)
> > > > > >
> > > > > > ==>
> > > > > >
> > >
> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > > <==
> > > > > > AF_INET: 142.251.40.110
> > > > > >
> > > > > > ==> musl_1.2.0-git-17-g33338ebc_getaddrinfo_test_google.com.txt
> <==
> > > > > > AF_INET: 142.251.40.110
> > > > > >
> > > > > > ==>
> > > > > >
> > >
> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > > <==
> > > > > > AF_INET: 142.251.40.142
> > > > > >
> > > > > > ==> musl_1.2.0-git-39-g5cf1ac24_getaddrinfo_test_google.com.txt
> <==
> > > > > > getaddrinfo: Try again
> > > > > >
> > > > > > ==>
> > > > > >
> > >
> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > > <==
> > > > > > AF_INET: 142.251.40.206
> > > > > >
> > > > > > ==> musl_1.2.0-git-40-g1b4e84c5_getaddrinfo_test_google.com.txt
> <==
> > > > > > AF_INET6: 2607:f8b0:4006:81f::200e
> > > > > > AF_INET: 142.251.40.206
> > > > > >
> > > > > > ==>
> > > > > >
> > >
> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com_set_ai_family.txt
> > > > > <==
> > > > > > AF_INET: 142.250.65.206
> > > > > >
> > > > > > ==> musl_1.2.0-git-6-g2f2348c9_getaddrinfo_test_google.com.txt
> <==
> > > > > > AF_INET: 142.250.65.206
> > > > > >
> > > > > > ==> musl_1.2.1_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > > > > AF_INET: 142.251.40.110
> > > > > >
> > > > > > ==> musl_1.2.1_getaddrinfo_test_google.com.txt <==
> > > > > > getaddrinfo: Try again
> > > > > >
> > > > > > ==> musl_1.2.2_getaddrinfo_test_google.com_set_ai_family.txt <==
> > > > > > getaddrinfo: Try again
> > > > > >
> > > > > > ==> musl_1.2.2_getaddrinfo_test_google.com.txt <==
> > > > > > getaddrinfo: Try again
> > > > > >
> > > > > > Regards,
> > > > >
> > > > > OK, I don't see anything in the strace suggesting a cause. The
> kernel
> > > > > version (or whether a container was used) present on the system
> where
> > > > > you built musl or the test programs should make no difference
> > > > > whatsoever; musl has no build dependencies on the host kernel or
> > > > > kernel headers or anything like that (and doesn't even need to be
> > > > > built on a Linux host).
> > > > >
> > > > > A couple questions:
> > > > >
> > > > > Are the test programs on the i686 machine running under Docker or
> any
> > > > > other container environment?
> > > > >
> > > > > Can you tcpdump the traffic between the test program and the
> dnsmasq
> > > > > during a failing run, with verbose display of the packet contents
> > > > > (-vvv or something like that)?
> > > > >
> > > > > I don't see any plausible explanation for the result varying
> between
> > > > > runs and with timing like this unless dnsmasq is doing something
> > > > > odd/wrong. I thought it might be related to something blocking
> time64
> > > > > syscalls but that doesn't seem to be the case -- according to the
> > > > > strace logs they're getting ENOSYS as expected with fallback to the
> > > > > legacy 32-bit clock_gettime etc. which is fine.
> > > > >
> > >
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 16813 bytes --]

[-- Attachment #2: Screenshot 2022-02-14 at 13.58.03.png --]
[-- Type: image/png, Size: 94083 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-14 19:00                   ` Satadru Pramanik
@ 2022-02-14 22:00                     ` Rich Felker
  2022-02-15 16:59                       ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-14 22:00 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

Note: you dropped the list from CC in your last reply; I've re-added
it and bounced your message to the list.

On Mon, Feb 14, 2022 at 02:00:24PM -0500, Satadru Pramanik wrote:
> >
> >
> > Are you sure the "running under strace" has no differences in how the
> > test program is invoked aside from just using strace?
> 
> 
> I'm running strace & ltrace from the command prompt. For what it is worth,
> the process tree is a little weird on ChromeOS:
> [image: Screenshot 2022-02-14 at 13.58.03.png]
> 
> 
> > Rather than
> > running it via the ruby machinery, can you just test under plain
> > manual execution from the same shell instance for both?
> >
> The previous email showed it running via manual execution from the same
> shell instance, invoked without and with ltrace.
> i.e. I have the same issue.
> 
> 
> > Can you report what Docker version you're using, and try executing
> > with Docker's seccomp sandboxing disabled?
> 
> docker version
> Client: Docker Engine - Community
>  Version:           20.10.12
>  API version:       1.41
>  Go version:        go1.16.12
>  Git commit:        e91ed57
>  Built:             Mon Dec 13 11:45:41 2021
>  OS/Arch:           linux/amd64
>  Context:           default
>  Experimental:      true
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:          20.10.12
>   API version:      1.41 (minimum version 1.12)
>   Go version:       go1.16.12
>   Git commit:       459d0df
>   Built:            Mon Dec 13 11:44:05 2021
>   OS/Arch:          linux/amd64
>   Experimental:     false
>  containerd:
>   Version:          1.4.12
>   GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
>  runc:
>   Version:          1.0.2
>   GitCommit:        v1.0.2-0-g52b36a2
>  docker-init:
>   Version:          0.19.0
>   GitCommit:        de40ad0
> 
> This is the docker command I used, which may disable the seccomp sandboxing?
> 
> #!/bin/bash
> docker pull satmandu/crewbuild:alex-i686.m58
> docker pull tonistiigi/binfmt
> docker run --privileged --rm tonistiigi/binfmt --install all
> docker run --security-opt seccomp=unconfined  --platform linux/386
> --cap-add SYS_PTRACE --rm -v $(pwd)/pkg_cache:/usr/local/tmp/packages -v
> $(pwd):/output -h $(hostname)-i686 -it satmandu/crewbuild:alex-i686.m58
> /usr/local/bin/setarch i686 sudo -i -u chronos /usr/local/bin/bash -i
> 
> .../musl_getaddrinfo_test  google.com
> AF_INET: 142.250.80.46
> AF_INET6: 2607:f8b0:4006:80b::200e
> 
> Commands inside that docker invocation to test this:
> crew upgrade ; yes | crew install ltrace
> CREW_TESTING_REPO=https://github.com/satmandu/chromebrew.git
> CREW_TESTING_BRANCH=musl_testing CREW_TESTING=1 crew update
> crew upgrade
> cd /usr/local/tmp
> yes | crew build -k musl_getaddrinfo_test
> cd crew/musl_getaddrinfo_test.*
> .../musl_getaddrinfo_test  google.com
> 
> When I ran that, I got this:
> .../musl_getaddrinfo_test  google.com
> AF_INET: 142.250.80.46
> AF_INET6: 2607:f8b0:4006:80b::200e

Are you saying it works (resolves correctly) with the above command
line? If so that's indicative of buggy Docker seccomp.

> > This shouldn't happen, but
> > it's plausible that your old kernel has bugs where seccomp filtering
> > gets bypassed when the process is running under strace, thereby
> > working around a buggy seccomp filter in Docker.
> >
> Is it possible there are seccomp issues with the old kernel that just
> weren't triggered by an older version of musl?

I'm pretty sure it's not a kernel bug causing your problem. It's
possibly a Docker bug that should make it fail both with and without
strace, but that strange kernel behavior (differences when tracing vs
not tracing) made the problem go away as soon as you used strace.

> > If you know how to use gdb, you could also try setting some
> > breakpoints to see what code is or isn't reached.
> >
> This would be ideal. I'm still trying to get gdb built with musl on this
> older setup. The gdb I have built against glibc doesn't seem to be
> particularly helpful when I'm running it with this program built against
> the musl libc.
> 
> Is there a static build of i686 gdb built against musl someone has
> available which might be helpful here?

gdb should not need to be built against musl to debug programs built
against musl. Maybe you're just missing debug info?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-14 22:00                     ` Rich Felker
@ 2022-02-15 16:59                       ` Satadru Pramanik
  2022-02-15 17:44                         ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-15 16:59 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

On Mon, Feb 14, 2022 at 5:00 PM Rich Felker <dalias@aerifal.cx> wrote:

> Note: you dropped the list from CC in your last reply; I've re-added
> it and bounced your message to the list.
>
> Mea culpa!


>
> > When I ran that, I got this:
> > .../musl_getaddrinfo_test  google.com
> > AF_INET: 142.250.80.46
> > AF_INET6: 2607:f8b0:4006:80b::200e
>
> Are you saying it works (resolves correctly) with the above command
> line? If so that's indicative of buggy Docker seccomp.
>
> I am saying it resolves correctly with the above command line.

Suggestions on what command line I should be using?


>
> > Is there a static build of i686 gdb built against musl someone has
> > available which might be helpful here?
>
> gdb should not need to be built against musl to debug programs built
> against musl. Maybe you're just missing debug info?
>
> This is entirely possible.
I'm building a musl toolchain with this:
https://git.zv.io/toolchains/musl-cross-make.git (which is itself a variant
of your https://github.com/richfelker/musl-cross-make )
My config.mak is generated inside here:
https://github.com/satmandu/chromebrew/blob/musl_testing/packages/musl_native_toolchain.rb

Is there a way to get the debug info for gdb saved into the musl libc, or
should I open an issue just for that on
https://github.com/richfelker/musl-cross-make ?

When I package the musl toolchain, I don't see a debug or .debug folder
being kept in the $(OUTPUT) folder.

Thanks,

Satadru

[-- Attachment #2: Type: text/html, Size: 2693 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-15 16:59                       ` Satadru Pramanik
@ 2022-02-15 17:44                         ` Rich Felker
  2022-02-15 22:56                           ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-15 17:44 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Tue, Feb 15, 2022 at 11:59:52AM -0500, Satadru Pramanik wrote:
> On Mon, Feb 14, 2022 at 5:00 PM Rich Felker <dalias@aerifal.cx> wrote:
> 
> > Note: you dropped the list from CC in your last reply; I've re-added
> > it and bounced your message to the list.
> >
> > Mea culpa!
> 
> 
> >
> > > When I ran that, I got this:
> > > .../musl_getaddrinfo_test  google.com
> > > AF_INET: 142.250.80.46
> > > AF_INET6: 2607:f8b0:4006:80b::200e
> >
> > Are you saying it works (resolves correctly) with the above command
> > line? If so that's indicative of buggy Docker seccomp.
> >
> I am saying it resolves correctly with the above command line.
> 
> Suggestions on what command line I should be using?

OK, then in that case it's surely Docker's seccomp filters that are
the problem. I think --security-opt seccomp=unconfined is the part you
need to work around it.

Alternatively, I'm pretty sure this is fixed in latest Docker (note:
this includes latest for its runtime components like runc, libseccomp,
etc.) so upgrading Docker might be an option too that would let you
keep the seccomp protections in place.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-15 17:44                         ` Rich Felker
@ 2022-02-15 22:56                           ` Satadru Pramanik
  2022-02-16  1:41                             ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-15 22:56 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1.1: Type: text/plain, Size: 1465 bytes --]

>
>
>
> OK, then in that case it's surely Docker's seccomp filters that are
> the problem. I think --security-opt seccomp=unconfined is the part you
> need to work around it.

 That's the command line I was using, which leads to the application NOT
breaking, and thus doesn't allow me to replicate the problem:
 docker run --security-opt seccomp=unconfined  --platform linux/386
--cap-add SYS_PTRACE --rm -v $(pwd)/pkg_cache:/usr/local/tmp/packages -v
$(pwd):/output -h $(hostname)-i686 -it satmandu/crewbuild:alex-i686.m58
/usr/local/bin/setarch i686 sudo -i -u chronos /usr/local/bin/bash -i

The goal with docker was to try to replicate the breakage on the actual
hardware, which is the place we are having this problem.

I ran the process through gdb on the hardware, and stepped through it with
the timeit function from here: https://stackoverflow.com/a/48412363

Of note perhaps is the very long time it takes for some of these calls to
return in gdb? (The program does run in gdb when stepping through the
function, but not when run without the break point)
my commands were in essence the following in gdb:
add symbol table from file "/usr/local/share/musl/lib/libc.so"
break main
run google.com 2>>gdb.out.txt
ti (repeated until the program exited)
(I ran this twice, and both runs succeed with long delays)
Then I ran (this, which fails):
clear main
run google.com 2>>gdb.out.txt

Any other suggestions on how to track down this issue?

Regards,

Satadru

[-- Attachment #1.2: Type: text/html, Size: 2096 bytes --]

[-- Attachment #2: gdb.out.txt --]
[-- Type: text/plain, Size: 5489 bytes --]

Reading symbols from ./musl_getaddrinfo_test...
add symbol table from file "/usr/local/share/musl/lib/libc.so"
Reading symbols from /usr/local/share/musl/lib/libc.so...
Breakpoint 1 at 0x7be: file musl_getaddrinfo_test.c, line 15.
Starting program: /usr/local/tmp/crew/musl_getaddrinfo_test.20220215221133.dir/musl_getaddrinfo_test google.com 2>> gdb.out
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

Breakpoint 1, main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:15
warning: Source file is more recent than executable.
15	        memset(&hints, 0, sizeof(hints));
16	        if (argc > 2)
Call took 0.0342 ms
18	        hints.ai_socktype = SOCK_DGRAM;
Call took 0.0074 ms
20	        r = getaddrinfo(argv[1], NULL, &hints, &result);
Call took 0.0050 ms
0x77f66414 in __kernel_vsyscall ()
Call took 0.4002 ms
Single stepping until exit from function __kernel_vsyscall,
which has no line number information.
main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:21
21	        if (r != 0) {
Call took 11.8751 ms
26	        for (rp = result; rp != NULL; rp = rp->ai_next) {
Call took 0.0070 ms
27	            switch (rp->ai_family) {
Call took 0.0073 ms
35	*)rp->ai_addr)->sin6_addr, s, INET6_ADDRSTRLEN);
Call took 0.0085 ms
34	                inet_ntop(AF_INET6, &((struct sockaddr_in6
Call took 0.0057 ms
36	                printf("AF_INET6: %s\n", s);
Call took 3.1300 ms
0x77f66414 in __kernel_vsyscall ()
Call took 0.6778 ms
Single stepping until exit from function __kernel_vsyscall,
which has no line number information.
main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:37
37	                break;
Call took 0.1246 ms
26	        for (rp = result; rp != NULL; rp = rp->ai_next) {
Call took 0.0046 ms
27	            switch (rp->ai_family) {
Call took 0.0076 ms
30	*)rp->ai_addr)->sin_addr, s, INET6_ADDRSTRLEN);
Call took 0.0049 ms
29	                inet_ntop(AF_INET, &((struct sockaddr_in
Call took 0.0054 ms
31	                printf("AF_INET: %s\n", s);
Call took 1.3367 ms
0x77f66414 in __kernel_vsyscall ()
Call took 0.6217 ms
Single stepping until exit from function __kernel_vsyscall,
which has no line number information.
main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:32
32	                break;
Call took 0.1150 ms
26	        for (rp = result; rp != NULL; rp = rp->ai_next) {
Call took 0.0031 ms
41	        freeaddrinfo(result);
Call took 0.0067 ms
42	    }
Call took 0.2755 ms
0x555556e0 in __do_global_dtors_aux ()
Call took 0.3166 ms
Single stepping until exit from function __do_global_dtors_aux,
which has no line number information.
[Inferior 1 (process 23325) exited normally]
Call took 0.0374 ms
Starting program: /usr/local/tmp/crew/musl_getaddrinfo_test.20220215221133.dir/musl_getaddrinfo_test google.com 2>> gdb.out
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

Breakpoint 1, main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:15
15	        memset(&hints, 0, sizeof(hints));
16	        if (argc > 2)
Call took 0.0387 ms
18	        hints.ai_socktype = SOCK_DGRAM;
Call took 0.0063 ms
20	        r = getaddrinfo(argv[1], NULL, &hints, &result);
Call took 0.0051 ms
0x77f66414 in __kernel_vsyscall ()
Call took 0.4215 ms
Single stepping until exit from function __kernel_vsyscall,
which has no line number information.
main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:21
21	        if (r != 0) {
Call took 12.0090 ms
26	        for (rp = result; rp != NULL; rp = rp->ai_next) {
Call took 0.0070 ms
27	            switch (rp->ai_family) {
Call took 0.0112 ms
35	*)rp->ai_addr)->sin6_addr, s, INET6_ADDRSTRLEN);
Call took 0.0071 ms
34	                inet_ntop(AF_INET6, &((struct sockaddr_in6
Call took 0.0075 ms
36	                printf("AF_INET6: %s\n", s);
Call took 3.2155 ms
0x77f66414 in __kernel_vsyscall ()
Call took 0.6847 ms
Single stepping until exit from function __kernel_vsyscall,
which has no line number information.
main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:37
37	                break;
Call took 0.1380 ms
26	        for (rp = result; rp != NULL; rp = rp->ai_next) {
Call took 0.0055 ms
27	            switch (rp->ai_family) {
Call took 0.0109 ms
30	*)rp->ai_addr)->sin_addr, s, INET6_ADDRSTRLEN);
Call took 0.0115 ms
29	                inet_ntop(AF_INET, &((struct sockaddr_in
Call took 0.0078 ms
31	                printf("AF_INET: %s\n", s);
Call took 1.3403 ms
0x77f66414 in __kernel_vsyscall ()
Call took 0.6378 ms
Single stepping until exit from function __kernel_vsyscall,
which has no line number information.
main (argc=2, argv=0x7fffecb4) at musl_getaddrinfo_test.c:32
32	                break;
Call took 0.0897 ms
26	        for (rp = result; rp != NULL; rp = rp->ai_next) {
Call took 0.0047 ms
41	        freeaddrinfo(result);
Call took 0.0085 ms
42	    }
Call took 0.2994 ms
0x555556e0 in __do_global_dtors_aux ()
Call took 0.3179 ms
Single stepping until exit from function __do_global_dtors_aux,
which has no line number information.
[Inferior 1 (process 23422) exited normally]
Call took 0.0414 ms
Deleted breakpoint 1 
Starting program: /usr/local/tmp/crew/musl_getaddrinfo_test.20220215221133.dir/musl_getaddrinfo_test google.com 2>> gdb.out
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
getaddrinfo: Try again
[Inferior 1 (process 23511) exited with code 01]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-15 22:56                           ` Satadru Pramanik
@ 2022-02-16  1:41                             ` Rich Felker
  2022-02-16 18:37                               ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-16  1:41 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Tue, Feb 15, 2022 at 05:56:53PM -0500, Satadru Pramanik wrote:
> >
> >
> >
> > OK, then in that case it's surely Docker's seccomp filters that are
> > the problem. I think --security-opt seccomp=unconfined is the part you
> > need to work around it.
> 
>  That's the command line I was using, which leads to the application NOT
> breaking, and thus doesn't allow me to replicate the problem:
>  docker run --security-opt seccomp=unconfined  --platform linux/386
> --cap-add SYS_PTRACE --rm -v $(pwd)/pkg_cache:/usr/local/tmp/packages -v
> $(pwd):/output -h $(hostname)-i686 -it satmandu/crewbuild:alex-i686.m58
> /usr/local/bin/setarch i686 sudo -i -u chronos /usr/local/bin/bash -i
> 
> The goal with docker was to try to replicate the breakage on the actual
> hardware, which is the place we are having this problem.

OK, you haven't been clear about where the problem actually happens
from the beginning. I was under the impression all along that the
problem happened only in a Docker environment. Before we continue, can
you please clarify the exact environment the problem happens in
including:

- Whether any network traffic occurs when it fails (in the real
  environment not a replicated one elsewhere).

- Whether it fails or succeeds under strace (in the real
  environment not a replicated one elsewhere).

- Whether the real environment involves Docker or not.

- What's in resolv.conf (in the real environment not a replicated one
  elsewhere) and what nameserver software (if known) is running on the
  nameserver(s) listed in there.

- Anything else that might be relevant.

It's really hard to offer any productive advice when the problem is
unclear.

> I ran the process through gdb on the hardware, and stepped through it with
> the timeit function from here: https://stackoverflow.com/a/48412363
> 
> Of note perhaps is the very long time it takes for some of these calls to
> return in gdb? (The program does run in gdb when stepping through the
> function, but not when run without the break point)
> my commands were in essence the following in gdb:
> add symbol table from file "/usr/local/share/musl/lib/libc.so"
> break main
> run google.com 2>>gdb.out.txt
> ti (repeated until the program exited)
> (I ran this twice, and both runs succeed with long delays)
> Then I ran (this, which fails):
> clear main
> run google.com 2>>gdb.out.txt
> 
> Any other suggestions on how to track down this issue?

Rather than stepping through, I would put a single breakpoint at a
place you want to see whether execution reaches before running the
test program, then start it and see if the breakpoint fires or not.
Then remove the breakpoint, add a different one, and repeat. For
example, see if __res_msend is ever called, and if so, whether
particular lines of it are reached (or just put breakpoints on some of
the functions it calls, like socket, bind, recvfrom, poll, etc. to see
if they're called).

It might also be useful to put a breakpoint on clock_gettime and then
'finish' to see what it returns (in case the problem is something
time64-related).


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-16  1:41                             ` Rich Felker
@ 2022-02-16 18:37                               ` Satadru Pramanik
  2022-02-16 18:44                                 ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-16 18:37 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 3007 bytes --]

>
>
>
> - Whether any network traffic occurs when it fails (in the real
>   environment not a replicated one elsewhere).
>
>
There is no network traffic in the real environment.


> - Whether it fails or succeeds under strace (in the real
>   environment not a replicated one elsewhere).
>
> It succeeds in strace (in the real environment)



> - Whether the real environment involves Docker or not.
>
> The real environment does not involve docker.



> - What's in resolv.conf (in the real environment not a replicated one
>   elsewhere) and what nameserver software (if known) is running on the
>   nameserver(s) listed in there.
>
> The nameserver is picked up from dhcp. The contents of the file are as
follows:
nameserver 192.168.0.1
search lan.
options single-request timeout:1 attempts:5


> - Anything else that might be relevant.
>
> DNS server is dnsmasq running on a current OpenWRT device.


> It's really hard to offer any productive advice when the problem is
> unclear.
>
> Apologies for the confusion.
I'm really just trying to debug this getaddrinfo breakage on this older
hardware. The docker containers setup is something we use to build packages
for this hardware, and our frustration is that the software works perfectly
fine in the docker containers, but not on the hardware.

> Any other suggestions on how to track down this issue?
>
> Rather than stepping through, I would put a single breakpoint at a
> place you want to see whether execution reaches before running the
> test program, then start it and see if the breakpoint fires or not.
> Then remove the breakpoint, add a different one, and repeat. For
> example, see if __res_msend is ever called, and if so, whether
> particular lines of it are reached (or just put breakpoints on some of
> the functions it calls, like socket, bind, recvfrom, poll, etc. to see
> if they're called).
>
> It might also be useful to put a breakpoint on clock_gettime and then
> 'finish' to see what it returns (in case the problem is something
> time64-related).
>
>
The only breakpoint which fixed the execution was for line 20 (which
invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
continuing is the only way it does not result in failure.

Any later breakpoints fail.

I went though the other breakpoints as requested.
clock_gettime did not fire.

Breakpoint 1 at 0x5c2f7: file ../src_musl/compat/time32/clock_gettime32.c,
line 9.
__res_msend, setsockopt also did not fire.
The ones that did fire were: socket, bind, recvfrom, poll, __res_msend_rc,
memset, sendto, __get_resolv_conf, pthread_setcancelstate,
__pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy

When breaking on socket, stepping through the __kernel_vsyscall call after
socket and then continuing succeeds.

Is it possible that the socket is not waiting long enough for a response
from __kernel_vsyscall? Has that changed?
Breaking, stepping, and continuing on every other function above fails.

The gdb log is attached.

Regards,

Satadru

[-- Attachment #1.2: Type: text/html, Size: 4651 bytes --]

[-- Attachment #2: gdb.out.txt.gz --]
[-- Type: application/x-gzip, Size: 11679 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-16 18:37                               ` Satadru Pramanik
@ 2022-02-16 18:44                                 ` Satadru Pramanik
  2022-02-16 21:33                                   ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-16 18:44 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 3459 bytes --]

The only change to socket.c I'm seeing is use __socketcall to simplify
socket()
<https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001>,
so maybe it would make sense for me to try building with that reversed?

satadru

On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@gmail.com> wrote:

>
>>
>> - Whether any network traffic occurs when it fails (in the real
>>   environment not a replicated one elsewhere).
>>
>>
> There is no network traffic in the real environment.
>
>
>> - Whether it fails or succeeds under strace (in the real
>>   environment not a replicated one elsewhere).
>>
>> It succeeds in strace (in the real environment)
>
>
>
>> - Whether the real environment involves Docker or not.
>>
>> The real environment does not involve docker.
>
>
>
>> - What's in resolv.conf (in the real environment not a replicated one
>>   elsewhere) and what nameserver software (if known) is running on the
>>   nameserver(s) listed in there.
>>
>> The nameserver is picked up from dhcp. The contents of the file are as
> follows:
> nameserver 192.168.0.1
> search lan.
> options single-request timeout:1 attempts:5
>
>
>> - Anything else that might be relevant.
>>
>> DNS server is dnsmasq running on a current OpenWRT device.
>
>
>> It's really hard to offer any productive advice when the problem is
>> unclear.
>>
>> Apologies for the confusion.
> I'm really just trying to debug this getaddrinfo breakage on this older
> hardware. The docker containers setup is something we use to build packages
> for this hardware, and our frustration is that the software works perfectly
> fine in the docker containers, but not on the hardware.
>
> > Any other suggestions on how to track down this issue?
>>
>> Rather than stepping through, I would put a single breakpoint at a
>> place you want to see whether execution reaches before running the
>> test program, then start it and see if the breakpoint fires or not.
>> Then remove the breakpoint, add a different one, and repeat. For
>> example, see if __res_msend is ever called, and if so, whether
>> particular lines of it are reached (or just put breakpoints on some of
>> the functions it calls, like socket, bind, recvfrom, poll, etc. to see
>> if they're called).
>>
>> It might also be useful to put a breakpoint on clock_gettime and then
>> 'finish' to see what it returns (in case the problem is something
>> time64-related).
>>
>>
> The only breakpoint which fixed the execution was for line 20 (which
> invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
> continuing is the only way it does not result in failure.
>
> Any later breakpoints fail.
>
> I went though the other breakpoints as requested.
> clock_gettime did not fire.
>
> Breakpoint 1 at 0x5c2f7: file ../src_musl/compat/time32/clock_gettime32.c,
> line 9.
> __res_msend, setsockopt also did not fire.
> The ones that did fire were: socket, bind, recvfrom, poll, __res_msend_rc,
> memset, sendto, __get_resolv_conf, pthread_setcancelstate,
> __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
>
> When breaking on socket, stepping through the __kernel_vsyscall call after
> socket and then continuing succeeds.
>
> Is it possible that the socket is not waiting long enough for a response
> from __kernel_vsyscall? Has that changed?
> Breaking, stepping, and continuing on every other function above fails.
>
> The gdb log is attached.
>
> Regards,
>
> Satadru
>
>

[-- Attachment #2: Type: text/html, Size: 5299 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-16 18:44                                 ` Satadru Pramanik
@ 2022-02-16 21:33                                   ` Rich Felker
  2022-02-16 21:53                                     ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-16 21:33 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
> The only change to socket.c I'm seeing is use __socketcall to simplify
> socket()
> <https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001>,
> so maybe it would make sense for me to try building with that reversed?

That should not be a functional change, but you may be overlooking
commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
new (added in 4.3) individual socket syscalls instead of the legacy
multiplexed SYS_socketcall. It's supposed to fall back to using the
old ones, but perhaps something goes wrong on your kernel that's
preventing it. I'm not sure what the mechanism by which it works when
straced/single-stepped could be, though, but if it's a weird kernel
bug anything is possible.

Reverting that commit should be entirely safe, if it turns out to be
what's triggering your problem, but I'd like to get to the root cause
and see if there's anything we can do to ensure this doesn't come up
again.


> On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@gmail.com> wrote:
> 
> >
> >>
> >> - Whether any network traffic occurs when it fails (in the real
> >>   environment not a replicated one elsewhere).
> >>
> >>
> > There is no network traffic in the real environment.
> >
> >
> >> - Whether it fails or succeeds under strace (in the real
> >>   environment not a replicated one elsewhere).
> >>
> >> It succeeds in strace (in the real environment)
> >
> >
> >
> >> - Whether the real environment involves Docker or not.
> >>
> >> The real environment does not involve docker.
> >
> >
> >
> >> - What's in resolv.conf (in the real environment not a replicated one
> >>   elsewhere) and what nameserver software (if known) is running on the
> >>   nameserver(s) listed in there.
> >>
> >> The nameserver is picked up from dhcp. The contents of the file are as
> > follows:
> > nameserver 192.168.0.1
> > search lan.
> > options single-request timeout:1 attempts:5
> >
> >
> >> - Anything else that might be relevant.
> >>
> >> DNS server is dnsmasq running on a current OpenWRT device.
> >
> >
> >> It's really hard to offer any productive advice when the problem is
> >> unclear.
> >>
> >> Apologies for the confusion.
> > I'm really just trying to debug this getaddrinfo breakage on this older
> > hardware. The docker containers setup is something we use to build packages
> > for this hardware, and our frustration is that the software works perfectly
> > fine in the docker containers, but not on the hardware.
> >
> > > Any other suggestions on how to track down this issue?
> >>
> >> Rather than stepping through, I would put a single breakpoint at a
> >> place you want to see whether execution reaches before running the
> >> test program, then start it and see if the breakpoint fires or not.
> >> Then remove the breakpoint, add a different one, and repeat. For
> >> example, see if __res_msend is ever called, and if so, whether
> >> particular lines of it are reached (or just put breakpoints on some of
> >> the functions it calls, like socket, bind, recvfrom, poll, etc. to see
> >> if they're called).
> >>
> >> It might also be useful to put a breakpoint on clock_gettime and then
> >> 'finish' to see what it returns (in case the problem is something
> >> time64-related).
> >>
> >>
> > The only breakpoint which fixed the execution was for line 20 (which
> > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
> > continuing is the only way it does not result in failure.
> >
> > Any later breakpoints fail.
> >
> > I went though the other breakpoints as requested.
> > clock_gettime did not fire.
> >
> > Breakpoint 1 at 0x5c2f7: file ../src_musl/compat/time32/clock_gettime32.c,
> > line 9.
> > __res_msend, setsockopt also did not fire.
> > The ones that did fire were: socket, bind, recvfrom, poll, __res_msend_rc,
> > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
> > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
> >
> > When breaking on socket, stepping through the __kernel_vsyscall call after
> > socket and then continuing succeeds.
> >
> > Is it possible that the socket is not waiting long enough for a response
> > from __kernel_vsyscall? Has that changed?
> > Breaking, stepping, and continuing on every other function above fails.
> >
> > The gdb log is attached.
> >
> > Regards,
> >
> > Satadru
> >
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-16 21:33                                   ` Rich Felker
@ 2022-02-16 21:53                                     ` Satadru Pramanik
  2022-02-17  1:44                                       ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-16 21:53 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4881 bytes --]

I was looking at that commit too. I've started a build with that reverted
and should be able to check back on that tomorrow.

On Wed, Feb 16, 2022 at 4:33 PM Rich Felker <dalias@aerifal.cx> wrote:

> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
> > The only change to socket.c I'm seeing is use __socketcall to simplify
> > socket()
> > <
> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001
> >,
> > so maybe it would make sense for me to try building with that reversed?
>
> That should not be a functional change, but you may be overlooking
> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
> new (added in 4.3) individual socket syscalls instead of the legacy
> multiplexed SYS_socketcall. It's supposed to fall back to using the
> old ones, but perhaps something goes wrong on your kernel that's
> preventing it. I'm not sure what the mechanism by which it works when
> straced/single-stepped could be, though, but if it's a weird kernel
> bug anything is possible.
>
> Reverting that commit should be entirely safe, if it turns out to be
> what's triggering your problem, but I'd like to get to the root cause
> and see if there's anything we can do to ensure this doesn't come up
> again.
>
>
> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@gmail.com>
> wrote:
> >
> > >
> > >>
> > >> - Whether any network traffic occurs when it fails (in the real
> > >>   environment not a replicated one elsewhere).
> > >>
> > >>
> > > There is no network traffic in the real environment.
> > >
> > >
> > >> - Whether it fails or succeeds under strace (in the real
> > >>   environment not a replicated one elsewhere).
> > >>
> > >> It succeeds in strace (in the real environment)
> > >
> > >
> > >
> > >> - Whether the real environment involves Docker or not.
> > >>
> > >> The real environment does not involve docker.
> > >
> > >
> > >
> > >> - What's in resolv.conf (in the real environment not a replicated one
> > >>   elsewhere) and what nameserver software (if known) is running on the
> > >>   nameserver(s) listed in there.
> > >>
> > >> The nameserver is picked up from dhcp. The contents of the file are as
> > > follows:
> > > nameserver 192.168.0.1
> > > search lan.
> > > options single-request timeout:1 attempts:5
> > >
> > >
> > >> - Anything else that might be relevant.
> > >>
> > >> DNS server is dnsmasq running on a current OpenWRT device.
> > >
> > >
> > >> It's really hard to offer any productive advice when the problem is
> > >> unclear.
> > >>
> > >> Apologies for the confusion.
> > > I'm really just trying to debug this getaddrinfo breakage on this older
> > > hardware. The docker containers setup is something we use to build
> packages
> > > for this hardware, and our frustration is that the software works
> perfectly
> > > fine in the docker containers, but not on the hardware.
> > >
> > > > Any other suggestions on how to track down this issue?
> > >>
> > >> Rather than stepping through, I would put a single breakpoint at a
> > >> place you want to see whether execution reaches before running the
> > >> test program, then start it and see if the breakpoint fires or not.
> > >> Then remove the breakpoint, add a different one, and repeat. For
> > >> example, see if __res_msend is ever called, and if so, whether
> > >> particular lines of it are reached (or just put breakpoints on some of
> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to see
> > >> if they're called).
> > >>
> > >> It might also be useful to put a breakpoint on clock_gettime and then
> > >> 'finish' to see what it returns (in case the problem is something
> > >> time64-related).
> > >>
> > >>
> > > The only breakpoint which fixed the execution was for line 20 (which
> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
> > > continuing is the only way it does not result in failure.
> > >
> > > Any later breakpoints fail.
> > >
> > > I went though the other breakpoints as requested.
> > > clock_gettime did not fire.
> > >
> > > Breakpoint 1 at 0x5c2f7: file
> ../src_musl/compat/time32/clock_gettime32.c,
> > > line 9.
> > > __res_msend, setsockopt also did not fire.
> > > The ones that did fire were: socket, bind, recvfrom, poll,
> __res_msend_rc,
> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
> > >
> > > When breaking on socket, stepping through the __kernel_vsyscall call
> after
> > > socket and then continuing succeeds.
> > >
> > > Is it possible that the socket is not waiting long enough for a
> response
> > > from __kernel_vsyscall? Has that changed?
> > > Breaking, stepping, and continuing on every other function above fails.
> > >
> > > The gdb log is attached.
> > >
> > > Regards,
> > >
> > > Satadru
> > >
> > >
>

[-- Attachment #2: Type: text/html, Size: 6459 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-16 21:53                                     ` Satadru Pramanik
@ 2022-02-17  1:44                                       ` Satadru Pramanik
  2022-02-17  4:14                                         ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17  1:44 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 8142 bytes --]

Looks like I need to do more than just reverse that commit to get this to
build...

../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
‘__socketcall_cp’    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
__syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
                                                            |
                                       ^~~~~~~~~~~~~~~
 ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
‘socketcall_cp’
    6 |         return socketcall_cp(accept, fd, addr, len, 0, 0, 0);
      |                ^~~~~~~~~~~~~
../src_musl/src/internal/syscall.h:62:54: note: each undeclared identifier
is reported only once for each function it appears in
                             62 | #define __socketcall_cp(nm,a,b,c,d,e,f)
__syscall_cp(SYS_##nm, a, b, c, d, e, f)      |
                          ^~~~
 ../src_musl/src/internal/syscall.h:55:53: note: in definition of macro
‘__syscall_cp6’     55 | #define __syscall_cp6(n,a,b,c,d,e,f)
(__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f))
                                                              |
                                         ^
 ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro
‘__SYSCALL_DISP’     57 | #define __syscall_cp(...)
__SYSCALL_DISP(__syscall_cp,__VA_ARGS__)                    |
             ^~~~~~~~~~~~~~
../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro
‘__syscall_cp’       62 | #define __socketcall_cp(nm,a,b,c,d,e,f)
__syscall_cp(SYS_##nm, a, b, c, d, e, f)      |
             ^~~~~~~~~~~~
../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
‘__socketcall_cp’
   47 | #define socketcall_cp(nm,a,b,c,d,e,f)
__syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
                                                            |
                                       ^~~~~~~~~~~~~~~
 ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
‘socketcall_cp’          6 |         return socketcall_cp(accept, fd, addr,
len, 0, 0, 0);                         |                ^~~~~~~~~~~~~
../src_musl/src/network/accept.c:7:1: warning: control reaches end of
non-void function [-Wreturn-type]
                                   7 | }
      | ^
             make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error 1
make[2]: Leaving directory
'/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl'
                                                make[1]: *** [Makefile:249:
obj_musl/.lc_built] Error 2                                 make[1]:
Leaving directory
'/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl'
                                                         make: ***
[Makefile:194: all] Error 2

On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik <satadru@gmail.com> wrote:

> I was looking at that commit too. I've started a build with that reverted
> and should be able to check back on that tomorrow.
>
> On Wed, Feb 16, 2022 at 4:33 PM Rich Felker <dalias@aerifal.cx> wrote:
>
>> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
>> > The only change to socket.c I'm seeing is use __socketcall to simplify
>> > socket()
>> > <
>> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001
>> >,
>> > so maybe it would make sense for me to try building with that reversed?
>>
>> That should not be a functional change, but you may be overlooking
>> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
>> new (added in 4.3) individual socket syscalls instead of the legacy
>> multiplexed SYS_socketcall. It's supposed to fall back to using the
>> old ones, but perhaps something goes wrong on your kernel that's
>> preventing it. I'm not sure what the mechanism by which it works when
>> straced/single-stepped could be, though, but if it's a weird kernel
>> bug anything is possible.
>>
>> Reverting that commit should be entirely safe, if it turns out to be
>> what's triggering your problem, but I'd like to get to the root cause
>> and see if there's anything we can do to ensure this doesn't come up
>> again.
>>
>>
>> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@gmail.com>
>> wrote:
>> >
>> > >
>> > >>
>> > >> - Whether any network traffic occurs when it fails (in the real
>> > >>   environment not a replicated one elsewhere).
>> > >>
>> > >>
>> > > There is no network traffic in the real environment.
>> > >
>> > >
>> > >> - Whether it fails or succeeds under strace (in the real
>> > >>   environment not a replicated one elsewhere).
>> > >>
>> > >> It succeeds in strace (in the real environment)
>> > >
>> > >
>> > >
>> > >> - Whether the real environment involves Docker or not.
>> > >>
>> > >> The real environment does not involve docker.
>> > >
>> > >
>> > >
>> > >> - What's in resolv.conf (in the real environment not a replicated one
>> > >>   elsewhere) and what nameserver software (if known) is running on
>> the
>> > >>   nameserver(s) listed in there.
>> > >>
>> > >> The nameserver is picked up from dhcp. The contents of the file are
>> as
>> > > follows:
>> > > nameserver 192.168.0.1
>> > > search lan.
>> > > options single-request timeout:1 attempts:5
>> > >
>> > >
>> > >> - Anything else that might be relevant.
>> > >>
>> > >> DNS server is dnsmasq running on a current OpenWRT device.
>> > >
>> > >
>> > >> It's really hard to offer any productive advice when the problem is
>> > >> unclear.
>> > >>
>> > >> Apologies for the confusion.
>> > > I'm really just trying to debug this getaddrinfo breakage on this
>> older
>> > > hardware. The docker containers setup is something we use to build
>> packages
>> > > for this hardware, and our frustration is that the software works
>> perfectly
>> > > fine in the docker containers, but not on the hardware.
>> > >
>> > > > Any other suggestions on how to track down this issue?
>> > >>
>> > >> Rather than stepping through, I would put a single breakpoint at a
>> > >> place you want to see whether execution reaches before running the
>> > >> test program, then start it and see if the breakpoint fires or not.
>> > >> Then remove the breakpoint, add a different one, and repeat. For
>> > >> example, see if __res_msend is ever called, and if so, whether
>> > >> particular lines of it are reached (or just put breakpoints on some
>> of
>> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to
>> see
>> > >> if they're called).
>> > >>
>> > >> It might also be useful to put a breakpoint on clock_gettime and then
>> > >> 'finish' to see what it returns (in case the problem is something
>> > >> time64-related).
>> > >>
>> > >>
>> > > The only breakpoint which fixed the execution was for line 20 (which
>> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
>> > > continuing is the only way it does not result in failure.
>> > >
>> > > Any later breakpoints fail.
>> > >
>> > > I went though the other breakpoints as requested.
>> > > clock_gettime did not fire.
>> > >
>> > > Breakpoint 1 at 0x5c2f7: file
>> ../src_musl/compat/time32/clock_gettime32.c,
>> > > line 9.
>> > > __res_msend, setsockopt also did not fire.
>> > > The ones that did fire were: socket, bind, recvfrom, poll,
>> __res_msend_rc,
>> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
>> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
>> > >
>> > > When breaking on socket, stepping through the __kernel_vsyscall call
>> after
>> > > socket and then continuing succeeds.
>> > >
>> > > Is it possible that the socket is not waiting long enough for a
>> response
>> > > from __kernel_vsyscall? Has that changed?
>> > > Breaking, stepping, and continuing on every other function above
>> fails.
>> > >
>> > > The gdb log is attached.
>> > >
>> > > Regards,
>> > >
>> > > Satadru
>> > >
>> > >
>>
>

[-- Attachment #2: Type: text/html, Size: 10951 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17  1:44                                       ` Satadru Pramanik
@ 2022-02-17  4:14                                         ` Satadru Pramanik
  2022-02-17 13:17                                           ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17  4:14 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 8544 bytes --]

Oops. Looks like I reversed the patch wrong. Rebuilding again...

On Wed, Feb 16, 2022 at 8:44 PM Satadru Pramanik <satadru@gmail.com> wrote:

> Looks like I need to do more than just reverse that commit to get this to
> build...
>
> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
> ‘__socketcall_cp’    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
>                                                             |
>                                        ^~~~~~~~~~~~~~~
>  ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
> ‘socketcall_cp’
>     6 |         return socketcall_cp(accept, fd, addr, len, 0, 0, 0);
>       |                ^~~~~~~~~~~~~
> ../src_musl/src/internal/syscall.h:62:54: note: each undeclared identifier
> is reported only once for each function it appears in
>                              62 | #define __socketcall_cp(nm,a,b,c,d,e,f)
> __syscall_cp(SYS_##nm, a, b, c, d, e, f)      |
>                           ^~~~
>  ../src_musl/src/internal/syscall.h:55:53: note: in definition of macro
> ‘__syscall_cp6’     55 | #define __syscall_cp6(n,a,b,c,d,e,f)
> (__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f))
>                                                               |
>                                          ^
>  ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro
> ‘__SYSCALL_DISP’     57 | #define __syscall_cp(...)
> __SYSCALL_DISP(__syscall_cp,__VA_ARGS__)                    |
>              ^~~~~~~~~~~~~~
> ../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro
> ‘__syscall_cp’       62 | #define __socketcall_cp(nm,a,b,c,d,e,f)
> __syscall_cp(SYS_##nm, a, b, c, d, e, f)      |
>              ^~~~~~~~~~~~
> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
> ‘__socketcall_cp’
>    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
>                                                             |
>                                        ^~~~~~~~~~~~~~~
>  ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
> ‘socketcall_cp’          6 |         return socketcall_cp(accept, fd, addr,
> len, 0, 0, 0);                         |                ^~~~~~~~~~~~~
> ../src_musl/src/network/accept.c:7:1: warning: control reaches end of
> non-void function [-Wreturn-type]
>                                    7 | }
>       | ^
>              make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error 1
> make[2]: Leaving directory
> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl'
>                                                 make[1]: *** [Makefile:249:
> obj_musl/.lc_built] Error 2                                 make[1]:
> Leaving directory
> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl'
>                                                          make: ***
> [Makefile:194: all] Error 2
>
> On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik <satadru@gmail.com> wrote:
>
>> I was looking at that commit too. I've started a build with that reverted
>> and should be able to check back on that tomorrow.
>>
>> On Wed, Feb 16, 2022 at 4:33 PM Rich Felker <dalias@aerifal.cx> wrote:
>>
>>> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
>>> > The only change to socket.c I'm seeing is use __socketcall to simplify
>>> > socket()
>>> > <
>>> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001
>>> >,
>>> > so maybe it would make sense for me to try building with that reversed?
>>>
>>> That should not be a functional change, but you may be overlooking
>>> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
>>> new (added in 4.3) individual socket syscalls instead of the legacy
>>> multiplexed SYS_socketcall. It's supposed to fall back to using the
>>> old ones, but perhaps something goes wrong on your kernel that's
>>> preventing it. I'm not sure what the mechanism by which it works when
>>> straced/single-stepped could be, though, but if it's a weird kernel
>>> bug anything is possible.
>>>
>>> Reverting that commit should be entirely safe, if it turns out to be
>>> what's triggering your problem, but I'd like to get to the root cause
>>> and see if there's anything we can do to ensure this doesn't come up
>>> again.
>>>
>>>
>>> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@gmail.com>
>>> wrote:
>>> >
>>> > >
>>> > >>
>>> > >> - Whether any network traffic occurs when it fails (in the real
>>> > >>   environment not a replicated one elsewhere).
>>> > >>
>>> > >>
>>> > > There is no network traffic in the real environment.
>>> > >
>>> > >
>>> > >> - Whether it fails or succeeds under strace (in the real
>>> > >>   environment not a replicated one elsewhere).
>>> > >>
>>> > >> It succeeds in strace (in the real environment)
>>> > >
>>> > >
>>> > >
>>> > >> - Whether the real environment involves Docker or not.
>>> > >>
>>> > >> The real environment does not involve docker.
>>> > >
>>> > >
>>> > >
>>> > >> - What's in resolv.conf (in the real environment not a replicated
>>> one
>>> > >>   elsewhere) and what nameserver software (if known) is running on
>>> the
>>> > >>   nameserver(s) listed in there.
>>> > >>
>>> > >> The nameserver is picked up from dhcp. The contents of the file are
>>> as
>>> > > follows:
>>> > > nameserver 192.168.0.1
>>> > > search lan.
>>> > > options single-request timeout:1 attempts:5
>>> > >
>>> > >
>>> > >> - Anything else that might be relevant.
>>> > >>
>>> > >> DNS server is dnsmasq running on a current OpenWRT device.
>>> > >
>>> > >
>>> > >> It's really hard to offer any productive advice when the problem is
>>> > >> unclear.
>>> > >>
>>> > >> Apologies for the confusion.
>>> > > I'm really just trying to debug this getaddrinfo breakage on this
>>> older
>>> > > hardware. The docker containers setup is something we use to build
>>> packages
>>> > > for this hardware, and our frustration is that the software works
>>> perfectly
>>> > > fine in the docker containers, but not on the hardware.
>>> > >
>>> > > > Any other suggestions on how to track down this issue?
>>> > >>
>>> > >> Rather than stepping through, I would put a single breakpoint at a
>>> > >> place you want to see whether execution reaches before running the
>>> > >> test program, then start it and see if the breakpoint fires or not.
>>> > >> Then remove the breakpoint, add a different one, and repeat. For
>>> > >> example, see if __res_msend is ever called, and if so, whether
>>> > >> particular lines of it are reached (or just put breakpoints on some
>>> of
>>> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to
>>> see
>>> > >> if they're called).
>>> > >>
>>> > >> It might also be useful to put a breakpoint on clock_gettime and
>>> then
>>> > >> 'finish' to see what it returns (in case the problem is something
>>> > >> time64-related).
>>> > >>
>>> > >>
>>> > > The only breakpoint which fixed the execution was for line 20 (which
>>> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and then
>>> > > continuing is the only way it does not result in failure.
>>> > >
>>> > > Any later breakpoints fail.
>>> > >
>>> > > I went though the other breakpoints as requested.
>>> > > clock_gettime did not fire.
>>> > >
>>> > > Breakpoint 1 at 0x5c2f7: file
>>> ../src_musl/compat/time32/clock_gettime32.c,
>>> > > line 9.
>>> > > __res_msend, setsockopt also did not fire.
>>> > > The ones that did fire were: socket, bind, recvfrom, poll,
>>> __res_msend_rc,
>>> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
>>> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
>>> > >
>>> > > When breaking on socket, stepping through the __kernel_vsyscall call
>>> after
>>> > > socket and then continuing succeeds.
>>> > >
>>> > > Is it possible that the socket is not waiting long enough for a
>>> response
>>> > > from __kernel_vsyscall? Has that changed?
>>> > > Breaking, stepping, and continuing on every other function above
>>> fails.
>>> > >
>>> > > The gdb log is attached.
>>> > >
>>> > > Regards,
>>> > >
>>> > > Satadru
>>> > >
>>> > >
>>>
>>

[-- Attachment #2: Type: text/html, Size: 11406 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17  4:14                                         ` Satadru Pramanik
@ 2022-02-17 13:17                                           ` Satadru Pramanik
  2022-02-17 13:24                                             ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 13:17 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 10763 bytes --]

Looks like reverting that commit works, but interestingly, only
stochastically.

chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
getaddrinfo: Try again
chronos@localhost
/usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
./musl_getaddrinfo_test google.com
AF_INET6: 2607:f8b0:4006:81c::200e
AF_INET: 142.250.80.46

On Wed, Feb 16, 2022 at 11:14 PM Satadru Pramanik <satadru@gmail.com> wrote:

> Oops. Looks like I reversed the patch wrong. Rebuilding again...
>
> On Wed, Feb 16, 2022 at 8:44 PM Satadru Pramanik <satadru@gmail.com>
> wrote:
>
>> Looks like I need to do more than just reverse that commit to get this to
>> build...
>>
>> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
>> ‘__socketcall_cp’    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
>> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
>>                                                             |
>>                                        ^~~~~~~~~~~~~~~
>>  ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
>> ‘socketcall_cp’
>>     6 |         return socketcall_cp(accept, fd, addr, len, 0, 0, 0);
>>       |                ^~~~~~~~~~~~~
>> ../src_musl/src/internal/syscall.h:62:54: note: each undeclared
>> identifier is reported only once for each function it appears in
>>                                        62 | #define
>> __socketcall_cp(nm,a,b,c,d,e,f) __syscall_cp(SYS_##nm, a, b, c, d, e, f)
>>   |                                                      ^~~~
>>          ../src_musl/src/internal/syscall.h:55:53: note: in definition of
>> macro ‘__syscall_cp6’     55 | #define __syscall_cp6(n,a,b,c,d,e,f)
>> (__syscall_cp)(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f))
>>                                                               |
>>                                          ^
>>  ../src_musl/src/internal/syscall.h:57:27: note: in expansion of macro
>> ‘__SYSCALL_DISP’     57 | #define __syscall_cp(...)
>> __SYSCALL_DISP(__syscall_cp,__VA_ARGS__)                    |
>>              ^~~~~~~~~~~~~~
>> ../src_musl/src/internal/syscall.h:62:41: note: in expansion of macro
>> ‘__syscall_cp’       62 | #define __socketcall_cp(nm,a,b,c,d,e,f)
>> __syscall_cp(SYS_##nm, a, b, c, d, e, f)      |
>>              ^~~~~~~~~~~~
>> ../src_musl/src/internal/syscall.h:47:53: note: in expansion of macro
>> ‘__socketcall_cp’
>>    47 | #define socketcall_cp(nm,a,b,c,d,e,f)
>> __syscall_ret(__socketcall_cp(nm,a,b,c,d,e,f))
>>                                                             |
>>                                        ^~~~~~~~~~~~~~~
>>  ../src_musl/src/network/accept.c:6:16: note: in expansion of macro
>> ‘socketcall_cp’          6 |         return socketcall_cp(accept, fd, addr,
>> len, 0, 0, 0);                         |                ^~~~~~~~~~~~~
>> ../src_musl/src/network/accept.c:7:1: warning: control reaches end of
>> non-void function [-Wreturn-type]
>>                                    7 | }
>>       | ^
>>                make[2]: *** [Makefile:159: obj/src/network/accept.lo] Error
>> 1
>> make[2]: Leaving directory
>> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl/obj_musl'
>>                                                 make[1]: *** [Makefile:249:
>> obj_musl/.lc_built] Error 2                                 make[1]:
>> Leaving directory
>> '/usr/local/tmp/crew/musl_native_toolchain.20220216215322.dir/build/local/i686-linux-musl'
>>                                                          make: ***
>> [Makefile:194: all] Error 2
>>
>> On Wed, Feb 16, 2022, 4:53 PM Satadru Pramanik <satadru@gmail.com> wrote:
>>
>>> I was looking at that commit too. I've started a build with that
>>> reverted and should be able to check back on that tomorrow.
>>>
>>> On Wed, Feb 16, 2022 at 4:33 PM Rich Felker <dalias@aerifal.cx> wrote:
>>>
>>>> On Wed, Feb 16, 2022 at 01:44:35PM -0500, Satadru Pramanik wrote:
>>>> > The only change to socket.c I'm seeing is use __socketcall to simplify
>>>> > socket()
>>>> > <
>>>> https://git.musl-libc.org/cgit/musl/commit/?id=7063c459e7dbd63c2c94e04413743abab5272001
>>>> >,
>>>> > so maybe it would make sense for me to try building with that
>>>> reversed?
>>>>
>>>> That should not be a functional change, but you may be overlooking
>>>> commit c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91, which was: using the
>>>> new (added in 4.3) individual socket syscalls instead of the legacy
>>>> multiplexed SYS_socketcall. It's supposed to fall back to using the
>>>> old ones, but perhaps something goes wrong on your kernel that's
>>>> preventing it. I'm not sure what the mechanism by which it works when
>>>> straced/single-stepped could be, though, but if it's a weird kernel
>>>> bug anything is possible.
>>>>
>>>> Reverting that commit should be entirely safe, if it turns out to be
>>>> what's triggering your problem, but I'd like to get to the root cause
>>>> and see if there's anything we can do to ensure this doesn't come up
>>>> again.
>>>>
>>>>
>>>> > On Wed, Feb 16, 2022 at 1:37 PM Satadru Pramanik <satadru@gmail.com>
>>>> wrote:
>>>> >
>>>> > >
>>>> > >>
>>>> > >> - Whether any network traffic occurs when it fails (in the real
>>>> > >>   environment not a replicated one elsewhere).
>>>> > >>
>>>> > >>
>>>> > > There is no network traffic in the real environment.
>>>> > >
>>>> > >
>>>> > >> - Whether it fails or succeeds under strace (in the real
>>>> > >>   environment not a replicated one elsewhere).
>>>> > >>
>>>> > >> It succeeds in strace (in the real environment)
>>>> > >
>>>> > >
>>>> > >
>>>> > >> - Whether the real environment involves Docker or not.
>>>> > >>
>>>> > >> The real environment does not involve docker.
>>>> > >
>>>> > >
>>>> > >
>>>> > >> - What's in resolv.conf (in the real environment not a replicated
>>>> one
>>>> > >>   elsewhere) and what nameserver software (if known) is running on
>>>> the
>>>> > >>   nameserver(s) listed in there.
>>>> > >>
>>>> > >> The nameserver is picked up from dhcp. The contents of the file
>>>> are as
>>>> > > follows:
>>>> > > nameserver 192.168.0.1
>>>> > > search lan.
>>>> > > options single-request timeout:1 attempts:5
>>>> > >
>>>> > >
>>>> > >> - Anything else that might be relevant.
>>>> > >>
>>>> > >> DNS server is dnsmasq running on a current OpenWRT device.
>>>> > >
>>>> > >
>>>> > >> It's really hard to offer any productive advice when the problem is
>>>> > >> unclear.
>>>> > >>
>>>> > >> Apologies for the confusion.
>>>> > > I'm really just trying to debug this getaddrinfo breakage on this
>>>> older
>>>> > > hardware. The docker containers setup is something we use to build
>>>> packages
>>>> > > for this hardware, and our frustration is that the software works
>>>> perfectly
>>>> > > fine in the docker containers, but not on the hardware.
>>>> > >
>>>> > > > Any other suggestions on how to track down this issue?
>>>> > >>
>>>> > >> Rather than stepping through, I would put a single breakpoint at a
>>>> > >> place you want to see whether execution reaches before running the
>>>> > >> test program, then start it and see if the breakpoint fires or not.
>>>> > >> Then remove the breakpoint, add a different one, and repeat. For
>>>> > >> example, see if __res_msend is ever called, and if so, whether
>>>> > >> particular lines of it are reached (or just put breakpoints on
>>>> some of
>>>> > >> the functions it calls, like socket, bind, recvfrom, poll, etc. to
>>>> see
>>>> > >> if they're called).
>>>> > >>
>>>> > >> It might also be useful to put a breakpoint on clock_gettime and
>>>> then
>>>> > >> 'finish' to see what it returns (in case the problem is something
>>>> > >> time64-related).
>>>> > >>
>>>> > >>
>>>> > > The only breakpoint which fixed the execution was for line 20 (which
>>>> > > invokes getaddrinfo). Stepping through the __kernel_vsyscall and
>>>> then
>>>> > > continuing is the only way it does not result in failure.
>>>> > >
>>>> > > Any later breakpoints fail.
>>>> > >
>>>> > > I went though the other breakpoints as requested.
>>>> > > clock_gettime did not fire.
>>>> > >
>>>> > > Breakpoint 1 at 0x5c2f7: file
>>>> ../src_musl/compat/time32/clock_gettime32.c,
>>>> > > line 9.
>>>> > > __res_msend, setsockopt also did not fire.
>>>> > > The ones that did fire were: socket, bind, recvfrom, poll,
>>>> __res_msend_rc,
>>>> > > memset, sendto, __get_resolv_conf, pthread_setcancelstate,
>>>> > > __pthread_setcancelstate, __lookup_serv, __lookup_name, memcpy
>>>> > >
>>>> > > When breaking on socket, stepping through the __kernel_vsyscall
>>>> call after
>>>> > > socket and then continuing succeeds.
>>>> > >
>>>> > > Is it possible that the socket is not waiting long enough for a
>>>> response
>>>> > > from __kernel_vsyscall? Has that changed?
>>>> > > Breaking, stepping, and continuing on every other function above
>>>> fails.
>>>> > >
>>>> > > The gdb log is attached.
>>>> > >
>>>> > > Regards,
>>>> > >
>>>> > > Satadru
>>>> > >
>>>> > >
>>>>
>>>

[-- Attachment #2: Type: text/html, Size: 14059 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 13:17                                           ` Satadru Pramanik
@ 2022-02-17 13:24                                             ` Rich Felker
  2022-02-17 13:30                                               ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-17 13:24 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Thu, Feb 17, 2022 at 08:17:36AM -0500, Satadru Pramanik wrote:
> Looks like reverting that commit works, but interestingly, only
> stochastically.
> 
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> getaddrinfo: Try again
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> AF_INET6: 2607:f8b0:4006:81c::200e
> AF_INET: 142.250.80.46
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> AF_INET6: 2607:f8b0:4006:81c::200e
> AF_INET: 142.250.80.46
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> getaddrinfo: Try again
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> AF_INET6: 2607:f8b0:4006:81c::200e
> AF_INET: 142.250.80.46
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> AF_INET6: 2607:f8b0:4006:81c::200e
> AF_INET: 142.250.80.46
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> getaddrinfo: Try again
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> AF_INET6: 2607:f8b0:4006:81c::200e
> AF_INET: 142.250.80.46
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> getaddrinfo: Try again
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> getaddrinfo: Try again
> chronos@localhost
> /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> ../musl_getaddrinfo_test google.com
> AF_INET6: 2607:f8b0:4006:81c::200e
> AF_INET: 142.250.80.46

Can you tcpdump these again and see if there's network traffic when it
fails? This sounds more like the nameserver being unreliable.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 13:24                                             ` Rich Felker
@ 2022-02-17 13:30                                               ` Satadru Pramanik
  2022-02-17 13:46                                                 ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 13:30 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 8849 bytes --]

*This is a failure:*
tcpdump -i any -vvv host 192.168.0.115
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture
size 262144 bytes
08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A? google.com. (28)
08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags [DF], proto
UDP (17), length 72)
    office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a -> 0x5c7d!]
0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110 (44)
08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA? google.com.
(28)
08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto
UDP (17), length 84)
    office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 -> 0xb42f!]
0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
2607:f8b0:4006:80d::200e (56)
08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags [none],
proto ICMP (1), length 112)
    192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
unreachable, length 92
        IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto UDP
(17), length 84)
    office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA? google.com.
1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags [DF], proto
TCP (6), length 52)
    192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.], cksum
0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
[nop,nop,TS val 198422160 ecr 2351261566], length 0
08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags [none],
proto TCP (6), length 52)
    lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.], cksum
0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val 2351306585
ecr 198377148], length 0
^C
7 packets captured
7 packets received by filter
0 packets dropped by kernel

*Two successes and one failure:*

# tcpdump -i any -vvv host 192.168.0.115
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture
size 262144 bytes
08:28:16.906114 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.34033 > office.lan.53: [udp sum ok] 0+ A? google.com. (28)
08:28:16.906243 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.34033 > office.lan.53: [udp sum ok] 0+ AAAA? google.com.
(28)
08:28:16.906512 IP (tos 0x0, ttl 64, id 8977, offset 0, flags [DF], proto
UDP (17), length 72)
    office.lan.53 > 192.168.0.115.34033: [bad udp cksum 0x820a -> 0xc40b!]
0 q: A? google.com. 1/0/0 google.com. [3m37s] A 142.250.80.110 (44)
08:28:16.906696 IP (tos 0x0, ttl 64, id 8978, offset 0, flags [DF], proto
UDP (17), length 84)
    office.lan.53 > 192.168.0.115.34033: [bad udp cksum 0x8216 -> 0x1d02!]
0 q: AAAA? google.com. 1/0/0 google.com. [5s] AAAA 2607:f8b0:4006:820::200e
(56)
08:28:19.257368 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.53048 > office.lan.53: [udp sum ok] 0+ A? google.com. (28)
08:28:19.257466 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.53048 > office.lan.53: [udp sum ok] 0+ AAAA? google.com.
(28)
08:28:19.257680 IP (tos 0x0, ttl 64, id 9078, offset 0, flags [DF], proto
UDP (17), length 72)
    office.lan.53 > 192.168.0.115.53048: [bad udp cksum 0x820a -> 0x79c7!]
0 q: A? google.com. 1/0/0 google.com. [3m34s] A 142.250.80.110 (44)
08:28:19.257849 IP (tos 0x0, ttl 64, id 9079, offset 0, flags [DF], proto
UDP (17), length 84)
    office.lan.53 > 192.168.0.115.53048: [bad udp cksum 0x8216 -> 0xd2bd!]
0 q: AAAA? google.com. 1/0/0 google.com. [2s] AAAA 2607:f8b0:4006:820::200e
(56)
08:28:20.354244 IP (tos 0x0, ttl 64, id 52155, offset 0, flags [DF], proto
TCP (6), length 52)
    192.168.0.115.53400 > lga34s35-in-f4.1e100.net.443: Flags [.], cksum
0xf854 (correct), seq 2594363947, ack 50065344, win 135, options
[nop,nop,TS val 198343040 ecr 3566225862], length 0
08:28:20.358922 IP (tos 0x0, ttl 122, id 3787, offset 0, flags [none],
proto TCP (6), length 52)
    lga34s35-in-f4.1e100.net.443 > 192.168.0.115.53400: Flags [.], cksum
0xa7b0 (correct), seq 1, ack 1, win 276, options [nop,nop,TS val 3566270916
ecr 198252953], length 0
08:28:20.482219 IP (tos 0x0, ttl 64, id 27145, offset 0, flags [DF], proto
TCP (6), length 52)
    192.168.0.115.54184 > bc-in-f188.1e100.net.5228: Flags [.], cksum
0x4e35 (correct), seq 2783933939, ack 881888313, win 249, options
[nop,nop,TS val 198343168 ecr 855506216], length 0
08:28:20.492954 IP (tos 0x0, ttl 59, id 49690, offset 0, flags [none],
proto TCP (6), length 52)
    bc-in-f188.1e100.net.5228 > 192.168.0.115.54184: Flags [.], cksum
0xfdcb (correct), seq 1, ack 1, win 274, options [nop,nop,TS val 855551371
ecr 198253037], length 0
08:28:20.860748 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.49427 > office.lan.53: [udp sum ok] 0+ A? google.com. (28)
08:28:20.860857 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
(17), length 56)
    192.168.0.115.49427 > office.lan.53: [udp sum ok] 0+ AAAA? google.com.
(28)
08:28:20.861401 IP (tos 0x0, ttl 64, id 9095, offset 0, flags [DF], proto
UDP (17), length 72)
    office.lan.53 > 192.168.0.115.49427: [bad udp cksum 0x820a -> 0x87ed!]
0 q: A? google.com. 1/0/0 google.com. [3m33s] A 142.250.80.110 (44)
08:28:20.861738 IP (tos 0x0, ttl 64, id 9096, offset 0, flags [DF], proto
UDP (17), length 84)
    office.lan.53 > 192.168.0.115.49427: [bad udp cksum 0x8216 -> 0xe0e3!]
0 q: AAAA? google.com. 1/0/0 google.com. [1s] AAAA 2607:f8b0:4006:820::200e
(56)
08:28:20.862682 IP (tos 0xc0, ttl 64, id 59726, offset 0, flags [none],
proto ICMP (1), length 112)
    192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 49427
unreachable, length 92
        IP (tos 0x0, ttl 64, id 9096, offset 0, flags [DF], proto UDP (17),
length 84)
    office.lan.53 > 192.168.0.115.49427: [udp sum ok] 0 q: AAAA? google.com.
1/0/0 google.com. [1s] AAAA 2607:f8b0:4006:820::200e (56)
08:28:21.985759 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has
192.168.0.115 tell office.lan, length 28
08:28:22.133507 ARP, Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.115
is-at 20:68:9d:0f:33:44 (oui Unknown), length 46
^C
19 packets captured
19 packets received by filter
0 packets dropped by kernel

On Thu, Feb 17, 2022 at 8:24 AM Rich Felker <dalias@aerifal.cx> wrote:

> On Thu, Feb 17, 2022 at 08:17:36AM -0500, Satadru Pramanik wrote:
> > Looks like reverting that commit works, but interestingly, only
> > stochastically.
> >
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > getaddrinfo: Try again
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > AF_INET6: 2607:f8b0:4006:81c::200e
> > AF_INET: 142.250.80.46
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > AF_INET6: 2607:f8b0:4006:81c::200e
> > AF_INET: 142.250.80.46
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > getaddrinfo: Try again
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > AF_INET6: 2607:f8b0:4006:81c::200e
> > AF_INET: 142.250.80.46
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > AF_INET6: 2607:f8b0:4006:81c::200e
> > AF_INET: 142.250.80.46
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > getaddrinfo: Try again
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > AF_INET6: 2607:f8b0:4006:81c::200e
> > AF_INET: 142.250.80.46
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > getaddrinfo: Try again
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > getaddrinfo: Try again
> > chronos@localhost
> > /usr/local/tmp/crew/musl_getaddrinfo_test.20220217125953.dir $
> > ../musl_getaddrinfo_test google.com
> > AF_INET6: 2607:f8b0:4006:81c::200e
> > AF_INET: 142.250.80.46
>
> Can you tcpdump these again and see if there's network traffic when it
> fails? This sounds more like the nameserver being unreliable.
>

[-- Attachment #2: Type: text/html, Size: 11501 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 13:30                                               ` Satadru Pramanik
@ 2022-02-17 13:46                                                 ` Rich Felker
  2022-02-17 14:49                                                   ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-17 13:46 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> *This is a failure:*
> tcpdump -i any -vvv host 192.168.0.115
> tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture
> size 262144 bytes
> 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
> (17), length 56)
>     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A? google.com. (28)
> 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags [DF], proto
> UDP (17), length 72)
>     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a -> 0x5c7d!]
> 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110 (44)
> 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP
> (17), length 56)
>     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA? google.com.
> (28)
> 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto
> UDP (17), length 84)
>     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 -> 0xb42f!]
> 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> 2607:f8b0:4006:80d::200e (56)
> 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags [none],
> proto ICMP (1), length 112)
>     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> unreachable, length 92
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

OK, this shows that the client has requested both answers and the
nameserver replied almost immediately (about 0.5ms later), but when
the second reply arrives (to the AAAA), the client has already closed
the listening port, despite only a few ms having passed. The only way
I see this could happen is by "timing out". This suggests that
something is wrong with telling time.

Can you either put a breakpoint in __clock_gettime64 (this is the name
you have to use for a breakpoint -- sorry I messed it up last time)
and then see what it returns when you "finish" it and what's in the
timespec struct after that? Or just write a test program to call
clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
use the time64 symbol name here) and print the results (return value
and contents of the timespec struct).



>         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto UDP
> (17), length 84)
>     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA? google.com.
> 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags [DF], proto
> TCP (6), length 52)
>     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.], cksum
> 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> [nop,nop,TS val 198422160 ecr 2351261566], length 0
> 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags [none],
> proto TCP (6), length 52)
>     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.], cksum
> 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val 2351306585
> ecr 198377148], length 0
> ^C
> 7 packets captured
> 7 packets received by filter
> 0 packets dropped by kernel

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 13:46                                                 ` Rich Felker
@ 2022-02-17 14:49                                                   ` Satadru Pramanik
  2022-02-17 15:53                                                     ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 14:49 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 3696 bytes --]

Apologies for not being as familiar with gdb as I ought to be.
I used the __clock_gettime64 breakpoint and did a backtrace and finish
repeatedly.
I couldn't figure out how to best get the timespec struct info.

Alternately if you want to throw out a sample test program for me to build
and run, and what gdb commands to run to get the right info, happy to do
that too.

gdb output is attached.



On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx> wrote:

> On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > *This is a failure:*
> > tcpdump -i any -vvv host 192.168.0.115
> > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture
> > size 262144 bytes
> > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> UDP
> > (17), length 56)
> >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A? google.com.
> (28)
> > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags [DF],
> proto
> > UDP (17), length 72)
> >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> 0x5c7d!]
> > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110 (44)
> > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> UDP
> > (17), length 56)
> >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> google.com.
> > (28)
> > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF],
> proto
> > UDP (17), length 84)
> >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> 0xb42f!]
> > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > 2607:f8b0:4006:80d::200e (56)
> > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags [none],
> > proto ICMP (1), length 112)
> >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > unreachable, length 92
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> OK, this shows that the client has requested both answers and the
> nameserver replied almost immediately (about 0.5ms later), but when
> the second reply arrives (to the AAAA), the client has already closed
> the listening port, despite only a few ms having passed. The only way
> I see this could happen is by "timing out". This suggests that
> something is wrong with telling time.
>
> Can you either put a breakpoint in __clock_gettime64 (this is the name
> you have to use for a breakpoint -- sorry I messed it up last time)
> and then see what it returns when you "finish" it and what's in the
> timespec struct after that? Or just write a test program to call
> clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> use the time64 symbol name here) and print the results (return value
> and contents of the timespec struct).
>
>
>
> >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto UDP
> > (17), length 84)
> >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> google.com.
> > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags [DF],
> proto
> > TCP (6), length 52)
> >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.], cksum
> > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags [none],
> > proto TCP (6), length 52)
> >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.], cksum
> > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> 2351306585
> > ecr 198377148], length 0
> > ^C
> > 7 packets captured
> > 7 packets received by filter
> > 0 packets dropped by kernel
>

[-- Attachment #1.2: Type: text/html, Size: 4975 bytes --]

[-- Attachment #2: gdb.out.txt.gz --]
[-- Type: application/gzip, Size: 1569 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 14:49                                                   ` Satadru Pramanik
@ 2022-02-17 15:53                                                     ` Rich Felker
  2022-02-17 16:05                                                       ` Rich Felker
  2022-02-17 16:17                                                       ` Satadru Pramanik
  0 siblings, 2 replies; 37+ messages in thread
From: Rich Felker @ 2022-02-17 15:53 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4532 bytes --]

On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> Apologies for not being as familiar with gdb as I ought to be.
> I used the __clock_gettime64 breakpoint and did a backtrace and finish
> repeatedly.
> I couldn't figure out how to best get the timespec struct info.
> 
> Alternately if you want to throw out a sample test program for me to build
> and run, and what gdb commands to run to get the right info, happy to do
> that too.
> 
> gdb output is attached.

If gdb reported it correctly, clock_gettime returned 403, which should
be impossible. It can only return 0 or -1. Incidentally, 403 is the
syscall number for SYS_clock_gettime64, which suggests your kernel is
simply *returning the syscall number* instead of -ENOSYS for syscalls
that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
have any sort of weird vendor patching? Any LSMs loaded?

If you'd like to run a test just to make sure we're accurately seeing
what's happening, the attached should work. It should print 0 followed
by the current time in seconds and nanoseconds.



> On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx> wrote:
> 
> > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > *This is a failure:*
> > > tcpdump -i any -vvv host 192.168.0.115
> > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture
> > > size 262144 bytes
> > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > UDP
> > > (17), length 56)
> > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A? google.com.
> > (28)
> > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags [DF],
> > proto
> > > UDP (17), length 72)
> > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> > 0x5c7d!]
> > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110 (44)
> > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > UDP
> > > (17), length 56)
> > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > google.com.
> > > (28)
> > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF],
> > proto
> > > UDP (17), length 84)
> > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> > 0xb42f!]
> > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > 2607:f8b0:4006:80d::200e (56)
> > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags [none],
> > > proto ICMP (1), length 112)
> > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > > unreachable, length 92
> >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > OK, this shows that the client has requested both answers and the
> > nameserver replied almost immediately (about 0.5ms later), but when
> > the second reply arrives (to the AAAA), the client has already closed
> > the listening port, despite only a few ms having passed. The only way
> > I see this could happen is by "timing out". This suggests that
> > something is wrong with telling time.
> >
> > Can you either put a breakpoint in __clock_gettime64 (this is the name
> > you have to use for a breakpoint -- sorry I messed it up last time)
> > and then see what it returns when you "finish" it and what's in the
> > timespec struct after that? Or just write a test program to call
> > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> > use the time64 symbol name here) and print the results (return value
> > and contents of the timespec struct).
> >
> >
> >
> > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto UDP
> > > (17), length 84)
> > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> > google.com.
> > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags [DF],
> > proto
> > > TCP (6), length 52)
> > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.], cksum
> > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags [none],
> > > proto TCP (6), length 52)
> > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.], cksum
> > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> > 2351306585
> > > ecr 198377148], length 0
> > > ^C
> > > 7 packets captured
> > > 7 packets received by filter
> > > 0 packets dropped by kernel
> >



[-- Attachment #2: cgt.c --]
[-- Type: text/plain, Size: 184 bytes --]

#include <time.h>
#include <stdio.h>
int main()
{
	struct timespec ts;
	printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
	printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 15:53                                                     ` Rich Felker
@ 2022-02-17 16:05                                                       ` Rich Felker
  2022-02-17 16:36                                                         ` Satadru Pramanik
  2022-02-17 16:17                                                       ` Satadru Pramanik
  1 sibling, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-17 16:05 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Thu, Feb 17, 2022 at 10:53:52AM -0500, Rich Felker wrote:
> On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> > Apologies for not being as familiar with gdb as I ought to be.
> > I used the __clock_gettime64 breakpoint and did a backtrace and finish
> > repeatedly.
> > I couldn't figure out how to best get the timespec struct info.
> > 
> > Alternately if you want to throw out a sample test program for me to build
> > and run, and what gdb commands to run to get the right info, happy to do
> > that too.
> > 
> > gdb output is attached.
> 
> If gdb reported it correctly, clock_gettime returned 403, which should
> be impossible. It can only return 0 or -1. Incidentally, 403 is the
> syscall number for SYS_clock_gettime64, which suggests your kernel is
> simply *returning the syscall number* instead of -ENOSYS for syscalls
> that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
> have any sort of weird vendor patching? Any LSMs loaded?
> 
> If you'd like to run a test just to make sure we're accurately seeing
> what's happening, the attached should work. It should print 0 followed
> by the current time in seconds and nanoseconds.

It looks like you hit the bug introduced in commit
554086d85e71f30abe46fc014fea31929a7c6a8a and fixed in commit
8142b215501f8b291a108a202b3a053a265b03dd. It looks like, since the
former was a CVE fix, somebody backported it to the kernel you're
using, but they failed to backport the fix-for-the-fix, so you have a
kernel that operates dangerously incorrectly for syscall numbers it's
unaware of.

This really needs to be fixed in the kernel if you can. On our side
(musl) we probably need to find out if such kernels are actually out
in the wild, and if so, whether there's any reasonable way to detect
the false success and treat it as failure.

> > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx> wrote:
> > 
> > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > > *This is a failure:*
> > > > tcpdump -i any -vvv host 192.168.0.115
> > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture
> > > > size 262144 bytes
> > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A? google.com.
> > > (28)
> > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 72)
> > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> > > 0x5c7d!]
> > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110 (44)
> > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > > google.com.
> > > > (28)
> > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 84)
> > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> > > 0xb42f!]
> > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > > 2607:f8b0:4006:80d::200e (56)
> > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags [none],
> > > > proto ICMP (1), length 112)
> > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > > > unreachable, length 92
> > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > OK, this shows that the client has requested both answers and the
> > > nameserver replied almost immediately (about 0.5ms later), but when
> > > the second reply arrives (to the AAAA), the client has already closed
> > > the listening port, despite only a few ms having passed. The only way
> > > I see this could happen is by "timing out". This suggests that
> > > something is wrong with telling time.
> > >
> > > Can you either put a breakpoint in __clock_gettime64 (this is the name
> > > you have to use for a breakpoint -- sorry I messed it up last time)
> > > and then see what it returns when you "finish" it and what's in the
> > > timespec struct after that? Or just write a test program to call
> > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> > > use the time64 symbol name here) and print the results (return value
> > > and contents of the timespec struct).
> > >
> > >
> > >
> > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto UDP
> > > > (17), length 84)
> > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> > > google.com.
> > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags [DF],
> > > proto
> > > > TCP (6), length 52)
> > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.], cksum
> > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags [none],
> > > > proto TCP (6), length 52)
> > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.], cksum
> > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> > > 2351306585
> > > > ecr 198377148], length 0
> > > > ^C
> > > > 7 packets captured
> > > > 7 packets received by filter
> > > > 0 packets dropped by kernel
> > >
> 
> 

> #include <time.h>
> #include <stdio.h>
> int main()
> {
> 	struct timespec ts;
> 	printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
> 	printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
> }


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 15:53                                                     ` Rich Felker
  2022-02-17 16:05                                                       ` Rich Felker
@ 2022-02-17 16:17                                                       ` Satadru Pramanik
  1 sibling, 0 replies; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 16:17 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 4942 bytes --]

I get an output of "403 4 000000000" when built against musl.
When built against glibc I get "0 1645114577 573811224"

On Thu, Feb 17, 2022 at 10:53 AM Rich Felker <dalias@aerifal.cx> wrote:

> On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> > Apologies for not being as familiar with gdb as I ought to be.
> > I used the __clock_gettime64 breakpoint and did a backtrace and finish
> > repeatedly.
> > I couldn't figure out how to best get the timespec struct info.
> >
> > Alternately if you want to throw out a sample test program for me to
> build
> > and run, and what gdb commands to run to get the right info, happy to do
> > that too.
> >
> > gdb output is attached.
>
> If gdb reported it correctly, clock_gettime returned 403, which should
> be impossible. It can only return 0 or -1. Incidentally, 403 is the
> syscall number for SYS_clock_gettime64, which suggests your kernel is
> simply *returning the syscall number* instead of -ENOSYS for syscalls
> that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
> have any sort of weird vendor patching? Any LSMs loaded?
>
> If you'd like to run a test just to make sure we're accurately seeing
> what's happening, the attached should work. It should print 0 followed
> by the current time in seconds and nanoseconds.
>
>
>
> > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx> wrote:
> >
> > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > > *This is a failure:*
> > > > tcpdump -i any -vvv host 192.168.0.115
> > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1),
> capture
> > > > size 262144 bytes
> > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A?
> google.com.
> > > (28)
> > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 72)
> > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> > > 0x5c7d!]
> > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110 (44)
> > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > UDP
> > > > (17), length 56)
> > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > > google.com.
> > > > (28)
> > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF],
> > > proto
> > > > UDP (17), length 84)
> > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> > > 0xb42f!]
> > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > > 2607:f8b0:4006:80d::200e (56)
> > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags
> [none],
> > > > proto ICMP (1), length 112)
> > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > > > unreachable, length 92
> > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > OK, this shows that the client has requested both answers and the
> > > nameserver replied almost immediately (about 0.5ms later), but when
> > > the second reply arrives (to the AAAA), the client has already closed
> > > the listening port, despite only a few ms having passed. The only way
> > > I see this could happen is by "timing out". This suggests that
> > > something is wrong with telling time.
> > >
> > > Can you either put a breakpoint in __clock_gettime64 (this is the name
> > > you have to use for a breakpoint -- sorry I messed it up last time)
> > > and then see what it returns when you "finish" it and what's in the
> > > timespec struct after that? Or just write a test program to call
> > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> > > use the time64 symbol name here) and print the results (return value
> > > and contents of the timespec struct).
> > >
> > >
> > >
> > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto
> UDP
> > > > (17), length 84)
> > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> > > google.com.
> > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags [DF],
> > > proto
> > > > TCP (6), length 52)
> > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.],
> cksum
> > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags
> [none],
> > > > proto TCP (6), length 52)
> > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.],
> cksum
> > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> > > 2351306585
> > > > ecr 198377148], length 0
> > > > ^C
> > > > 7 packets captured
> > > > 7 packets received by filter
> > > > 0 packets dropped by kernel
> > >
>
>
>

[-- Attachment #2: Type: text/html, Size: 6842 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 16:05                                                       ` Rich Felker
@ 2022-02-17 16:36                                                         ` Satadru Pramanik
  2022-02-17 18:17                                                           ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 16:36 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 6973 bytes --]

 This machine is a EOL Samsung Series 5 Chromebook
<https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/samsung-series-5-chromebook/>
code
named Alex
<https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/#:~:text=Series%205%20Chromebook-,Alex,-x86%2Dalex%20%26%20x86>
.
It is the target device for our i686 builds for Chromebrew.

It is running a 3.8.11 kernel, and I believe the kernel source for that is
here:
https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/chromeos-3.8

Getting a signed kernel update for an EOL kernel for an EOL machine is
close to impossible from Google, so we're just trying to work around these
issues in userspace to maintain some functionality for any users who may
still be using the device.

The simplest workaround possible would be ideal. It is interesting though
that the sample program works fine when built against near-stock glibc
2.23, no?

Satadru

On Thu, Feb 17, 2022 at 11:05 AM Rich Felker <dalias@aerifal.cx> wrote:

> On Thu, Feb 17, 2022 at 10:53:52AM -0500, Rich Felker wrote:
> > On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> > > Apologies for not being as familiar with gdb as I ought to be.
> > > I used the __clock_gettime64 breakpoint and did a backtrace and finish
> > > repeatedly.
> > > I couldn't figure out how to best get the timespec struct info.
> > >
> > > Alternately if you want to throw out a sample test program for me to
> build
> > > and run, and what gdb commands to run to get the right info, happy to
> do
> > > that too.
> > >
> > > gdb output is attached.
> >
> > If gdb reported it correctly, clock_gettime returned 403, which should
> > be impossible. It can only return 0 or -1. Incidentally, 403 is the
> > syscall number for SYS_clock_gettime64, which suggests your kernel is
> > simply *returning the syscall number* instead of -ENOSYS for syscalls
> > that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
> > have any sort of weird vendor patching? Any LSMs loaded?
> >
> > If you'd like to run a test just to make sure we're accurately seeing
> > what's happening, the attached should work. It should print 0 followed
> > by the current time in seconds and nanoseconds.
>
> It looks like you hit the bug introduced in commit
> 554086d85e71f30abe46fc014fea31929a7c6a8a and fixed in commit
> 8142b215501f8b291a108a202b3a053a265b03dd. It looks like, since the
> former was a CVE fix, somebody backported it to the kernel you're
> using, but they failed to backport the fix-for-the-fix, so you have a
> kernel that operates dangerously incorrectly for syscall numbers it's
> unaware of.
>
> This really needs to be fixed in the kernel if you can. On our side
> (musl) we probably need to find out if such kernels are actually out
> in the wild, and if so, whether there's any reasonable way to detect
> the false success and treat it as failure.
>
> > > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx> wrote:
> > >
> > > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > > > *This is a failure:*
> > > > > tcpdump -i any -vvv host 192.168.0.115
> > > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1),
> capture
> > > > > size 262144 bytes
> > > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > > UDP
> > > > > (17), length 56)
> > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A?
> google.com.
> > > > (28)
> > > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags
> [DF],
> > > > proto
> > > > > UDP (17), length 72)
> > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> > > > 0x5c7d!]
> > > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110
> (44)
> > > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> proto
> > > > UDP
> > > > > (17), length 56)
> > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > > > google.com.
> > > > > (28)
> > > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags
> [DF],
> > > > proto
> > > > > UDP (17), length 84)
> > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> > > > 0xb42f!]
> > > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > > > 2607:f8b0:4006:80d::200e (56)
> > > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags
> [none],
> > > > > proto ICMP (1), length 112)
> > > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > > > > unreachable, length 92
> > > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > >
> > > > OK, this shows that the client has requested both answers and the
> > > > nameserver replied almost immediately (about 0.5ms later), but when
> > > > the second reply arrives (to the AAAA), the client has already closed
> > > > the listening port, despite only a few ms having passed. The only way
> > > > I see this could happen is by "timing out". This suggests that
> > > > something is wrong with telling time.
> > > >
> > > > Can you either put a breakpoint in __clock_gettime64 (this is the
> name
> > > > you have to use for a breakpoint -- sorry I messed it up last time)
> > > > and then see what it returns when you "finish" it and what's in the
> > > > timespec struct after that? Or just write a test program to call
> > > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> > > > use the time64 symbol name here) and print the results (return value
> > > > and contents of the timespec struct).
> > > >
> > > >
> > > >
> > > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto
> UDP
> > > > > (17), length 84)
> > > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> > > > google.com.
> > > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags
> [DF],
> > > > proto
> > > > > TCP (6), length 52)
> > > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.],
> cksum
> > > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags
> [none],
> > > > > proto TCP (6), length 52)
> > > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.],
> cksum
> > > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> > > > 2351306585
> > > > > ecr 198377148], length 0
> > > > > ^C
> > > > > 7 packets captured
> > > > > 7 packets received by filter
> > > > > 0 packets dropped by kernel
> > > >
> >
> >
>
> > #include <time.h>
> > #include <stdio.h>
> > int main()
> > {
> >       struct timespec ts;
> >       printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
> >       printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
> > }
>
>

[-- Attachment #2: Type: text/html, Size: 9545 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 16:36                                                         ` Satadru Pramanik
@ 2022-02-17 18:17                                                           ` Rich Felker
  2022-02-17 21:39                                                             ` Satadru Pramanik
       [not found]                                                             ` <CAFrh3J9CBGagzmnn9E7W0H15iiOzZEtg4YKfJ1qRwCtrjnw6tA@mail.gmail.com>
  0 siblings, 2 replies; 37+ messages in thread
From: Rich Felker @ 2022-02-17 18:17 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 8601 bytes --]

On Thu, Feb 17, 2022 at 11:36:31AM -0500, Satadru Pramanik wrote:
>  This machine is a EOL Samsung Series 5 Chromebook
> <https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/samsung-series-5-chromebook/>
> code
> named Alex
> <https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/#:~:text=Series%205%20Chromebook-,Alex,-x86%2Dalex%20%26%20x86>
> ..
> It is the target device for our i686 builds for Chromebrew.
> 
> It is running a 3.8.11 kernel, and I believe the kernel source for that is
> here:
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/chromeos-3.8
> 
> Getting a signed kernel update for an EOL kernel for an EOL machine is
> close to impossible from Google, so we're just trying to work around these

If these are machines you're in control of, you may be able to load a
module to patch it. If this is something you're deploying to users
stuck on that kernel who don't want to fix their systems, then of
course that's not a practical option.

> issues in userspace to maintain some functionality for any users who may
> still be using the device.
> 
> The simplest workaround possible would be ideal.

If you're shipping binaries specifically for these devices, the
simplest fix is just to emulate the failure that should happen in the
kernel in userspace, using the attached patch. DO NOT deploy this
patch in binaries meant to be used on modern systems, since they will
break when Y2038 rolls around. (Your old Chromebooks will break then
too.)

> It is interesting though
> that the sample program works fine when built against near-stock glibc
> 2.23, no?

No. If your kernel has a bug that makes something behave wildly wrong,
whether you do or don't see that as visible breakage with a particular
piece of software is not particularly interesting.

I'm pretty sure, however, that you just haven't tested enough to see
any failures. glibc 2.23 is from 2016, so any functionality in it
using syscalls added after 2011 (3.8 kernel) is going to blow up
badly, thinking the syscall succeeded and returned some positive value
when actually the kernel lacks it.

In the particular case of clock_gettime, it's just that your glibc
2.23 has a hard Y2038 EOL and does not use/support the missing time64
syscalls.


> On Thu, Feb 17, 2022 at 11:05 AM Rich Felker <dalias@aerifal.cx> wrote:
> 
> > On Thu, Feb 17, 2022 at 10:53:52AM -0500, Rich Felker wrote:
> > > On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> > > > Apologies for not being as familiar with gdb as I ought to be.
> > > > I used the __clock_gettime64 breakpoint and did a backtrace and finish
> > > > repeatedly.
> > > > I couldn't figure out how to best get the timespec struct info.
> > > >
> > > > Alternately if you want to throw out a sample test program for me to
> > build
> > > > and run, and what gdb commands to run to get the right info, happy to
> > do
> > > > that too.
> > > >
> > > > gdb output is attached.
> > >
> > > If gdb reported it correctly, clock_gettime returned 403, which should
> > > be impossible. It can only return 0 or -1. Incidentally, 403 is the
> > > syscall number for SYS_clock_gettime64, which suggests your kernel is
> > > simply *returning the syscall number* instead of -ENOSYS for syscalls
> > > that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
> > > have any sort of weird vendor patching? Any LSMs loaded?
> > >
> > > If you'd like to run a test just to make sure we're accurately seeing
> > > what's happening, the attached should work. It should print 0 followed
> > > by the current time in seconds and nanoseconds.
> >
> > It looks like you hit the bug introduced in commit
> > 554086d85e71f30abe46fc014fea31929a7c6a8a and fixed in commit
> > 8142b215501f8b291a108a202b3a053a265b03dd. It looks like, since the
> > former was a CVE fix, somebody backported it to the kernel you're
> > using, but they failed to backport the fix-for-the-fix, so you have a
> > kernel that operates dangerously incorrectly for syscall numbers it's
> > unaware of.
> >
> > This really needs to be fixed in the kernel if you can. On our side
> > (musl) we probably need to find out if such kernels are actually out
> > in the wild, and if so, whether there's any reasonable way to detect
> > the false success and treat it as failure.
> >
> > > > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx> wrote:
> > > >
> > > > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > > > > *This is a failure:*
> > > > > > tcpdump -i any -vvv host 192.168.0.115
> > > > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1),
> > capture
> > > > > > size 262144 bytes
> > > > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > proto
> > > > > UDP
> > > > > > (17), length 56)
> > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A?
> > google.com.
> > > > > (28)
> > > > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags
> > [DF],
> > > > > proto
> > > > > > UDP (17), length 72)
> > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a ->
> > > > > 0x5c7d!]
> > > > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110
> > (44)
> > > > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF],
> > proto
> > > > > UDP
> > > > > > (17), length 56)
> > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > > > > google.com.
> > > > > > (28)
> > > > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags
> > [DF],
> > > > > proto
> > > > > > UDP (17), length 84)
> > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216 ->
> > > > > 0xb42f!]
> > > > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > > > > 2607:f8b0:4006:80d::200e (56)
> > > > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags
> > [none],
> > > > > > proto ICMP (1), length 112)
> > > > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port 60625
> > > > > > unreachable, length 92
> > > > >   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > >
> > > > > OK, this shows that the client has requested both answers and the
> > > > > nameserver replied almost immediately (about 0.5ms later), but when
> > > > > the second reply arrives (to the AAAA), the client has already closed
> > > > > the listening port, despite only a few ms having passed. The only way
> > > > > I see this could happen is by "timing out". This suggests that
> > > > > something is wrong with telling time.
> > > > >
> > > > > Can you either put a breakpoint in __clock_gettime64 (this is the
> > name
> > > > > you have to use for a breakpoint -- sorry I messed it up last time)
> > > > > and then see what it returns when you "finish" it and what's in the
> > > > > timespec struct after that? Or just write a test program to call
> > > > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or want to
> > > > > use the time64 symbol name here) and print the results (return value
> > > > > and contents of the timespec struct).
> > > > >
> > > > >
> > > > >
> > > > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF], proto
> > UDP
> > > > > > (17), length 84)
> > > > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q: AAAA?
> > > > > google.com.
> > > > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags
> > [DF],
> > > > > proto
> > > > > > TCP (6), length 52)
> > > > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags [.],
> > cksum
> > > > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115, options
> > > > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags
> > [none],
> > > > > > proto TCP (6), length 52)
> > > > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags [.],
> > cksum
> > > > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS val
> > > > > 2351306585
> > > > > > ecr 198377148], length 0
> > > > > > ^C
> > > > > > 7 packets captured
> > > > > > 7 packets received by filter
> > > > > > 0 packets dropped by kernel
> > > > >
> > >
> > >
> >
> > > #include <time.h>
> > > #include <stdio.h>
> > > int main()
> > > {
> > >       struct timespec ts;
> > >       printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
> > >       printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
> > > }
> >
> >

[-- Attachment #2: broken_chromeos_kernel_hack.diff --]
[-- Type: text/plain, Size: 2379 bytes --]

diff --git a/arch/i386/syscall_arch.h b/arch/i386/syscall_arch.h
index f92b7aa9..b7c77863 100644
--- a/arch/i386/syscall_arch.h
+++ b/arch/i386/syscall_arch.h
@@ -14,6 +14,7 @@
 
 static inline long __syscall0(long n)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 	__asm__ __volatile__ (SYSCALL_INSNS : "=a"(__ret) : "a"(n) : "memory");
 	return __ret;
@@ -21,6 +22,7 @@ static inline long __syscall0(long n)
 
 static inline long __syscall1(long n, long a1)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 	__asm__ __volatile__ (SYSCALL_INSNS_12 : "=a"(__ret) : "a"(n), "d"(a1) : "memory");
 	return __ret;
@@ -28,6 +30,7 @@ static inline long __syscall1(long n, long a1)
 
 static inline long __syscall2(long n, long a1, long a2)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 	__asm__ __volatile__ (SYSCALL_INSNS_12 : "=a"(__ret) : "a"(n), "d"(a1), "c"(a2) : "memory");
 	return __ret;
@@ -35,6 +38,7 @@ static inline long __syscall2(long n, long a1, long a2)
 
 static inline long __syscall3(long n, long a1, long a2, long a3)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ (SYSCALL_INSNS : "=a"(__ret) : "a"(n), "b"(a1), "c"(a2), "d"(a3) : "memory");
@@ -46,6 +50,7 @@ static inline long __syscall3(long n, long a1, long a2, long a3)
 
 static inline long __syscall4(long n, long a1, long a2, long a3, long a4)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ (SYSCALL_INSNS : "=a"(__ret) : "a"(n), "b"(a1), "c"(a2), "d"(a3), "S"(a4) : "memory");
@@ -57,6 +62,7 @@ static inline long __syscall4(long n, long a1, long a2, long a3, long a4)
 
 static inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a5)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ (SYSCALL_INSNS
@@ -70,6 +76,7 @@ static inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a
 
 static inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ ("pushl %7 ; push %%ebp ; mov 4(%%esp),%%ebp ; " SYSCALL_INSNS " ; pop %%ebp ; add $4,%%esp"

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 18:17                                                           ` Rich Felker
@ 2022-02-17 21:39                                                             ` Satadru Pramanik
  2022-02-17 21:42                                                               ` Satadru Pramanik
       [not found]                                                             ` <CAFrh3J9CBGagzmnn9E7W0H15iiOzZEtg4YKfJ1qRwCtrjnw6tA@mail.gmail.com>
  1 sibling, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 21:39 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 10488 bytes --]

I'm noticing one small issue with this suggested patch:

In file included from ../src_musl/src/internal/syscall.h:6,
                 from ../src_musl/src/dirent/opendir.c:6:
../src_musl/arch/i386/syscall_arch.h: In function ‘__syscall0’:
../src_musl/arch/i386/syscall_arch.h:17:28: error: ‘ENOSYS’ undeclared
(first use in this function)
   17 |         if (n>350) return -ENOSYS;
      |                            ^~~~~~
../src_musl/arch/i386/syscall_arch.h:17:28: note: each undeclared
identifier is reported only once for each function it appears in
../src_musl/arch/i386/syscall_arch.h: In function ‘__syscall1’:
../src_musl/arch/i386/syscall_arch.h:25:28: error: ‘ENOSYS’ undeclared
(first use in this function)
   25 |         if (n>350) return -ENOSYS;
      |                            ^~~~~~
../src_musl/arch/i386/syscall_arch.h: In function ‘__syscall2’:
../src_musl/arch/i386/syscall_arch.h:33:28: error: ‘ENOSYS’ undeclared
(first use in this function)
   33 |         if (n>350) return -ENOSYS;

Should we be adding an include or just defining this locally?

Satadru

On Thu, Feb 17, 2022 at 1:17 PM Rich Felker <dalias@aerifal.cx> wrote:

> On Thu, Feb 17, 2022 at 11:36:31AM -0500, Satadru Pramanik wrote:
> >  This machine is a EOL Samsung Series 5 Chromebook
> > <
> https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/samsung-series-5-chromebook/
> >
> > code
> > named Alex
> > <
> https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/#:~:text=Series%205%20Chromebook-,Alex,-x86%2Dalex%20%26%20x86
> >
> > ..
> > It is the target device for our i686 builds for Chromebrew.
> >
> > It is running a 3.8.11 kernel, and I believe the kernel source for that
> is
> > here:
> >
> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/chromeos-3.8
> >
> > Getting a signed kernel update for an EOL kernel for an EOL machine is
> > close to impossible from Google, so we're just trying to work around
> these
>
> If these are machines you're in control of, you may be able to load a
> module to patch it. If this is something you're deploying to users
> stuck on that kernel who don't want to fix their systems, then of
> course that's not a practical option.
>
> > issues in userspace to maintain some functionality for any users who may
> > still be using the device.
> >
> > The simplest workaround possible would be ideal.
>
> If you're shipping binaries specifically for these devices, the
> simplest fix is just to emulate the failure that should happen in the
> kernel in userspace, using the attached patch. DO NOT deploy this
> patch in binaries meant to be used on modern systems, since they will
> break when Y2038 rolls around. (Your old Chromebooks will break then
> too.)
>
> > It is interesting though
> > that the sample program works fine when built against near-stock glibc
> > 2.23, no?
>
> No. If your kernel has a bug that makes something behave wildly wrong,
> whether you do or don't see that as visible breakage with a particular
> piece of software is not particularly interesting.
>
> I'm pretty sure, however, that you just haven't tested enough to see
> any failures. glibc 2.23 is from 2016, so any functionality in it
> using syscalls added after 2011 (3.8 kernel) is going to blow up
> badly, thinking the syscall succeeded and returned some positive value
> when actually the kernel lacks it.
>
> In the particular case of clock_gettime, it's just that your glibc
> 2.23 has a hard Y2038 EOL and does not use/support the missing time64
> syscalls.
>
>
> > On Thu, Feb 17, 2022 at 11:05 AM Rich Felker <dalias@aerifal.cx> wrote:
> >
> > > On Thu, Feb 17, 2022 at 10:53:52AM -0500, Rich Felker wrote:
> > > > On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
> > > > > Apologies for not being as familiar with gdb as I ought to be.
> > > > > I used the __clock_gettime64 breakpoint and did a backtrace and
> finish
> > > > > repeatedly.
> > > > > I couldn't figure out how to best get the timespec struct info.
> > > > >
> > > > > Alternately if you want to throw out a sample test program for me
> to
> > > build
> > > > > and run, and what gdb commands to run to get the right info, happy
> to
> > > do
> > > > > that too.
> > > > >
> > > > > gdb output is attached.
> > > >
> > > > If gdb reported it correctly, clock_gettime returned 403, which
> should
> > > > be impossible. It can only return 0 or -1. Incidentally, 403 is the
> > > > syscall number for SYS_clock_gettime64, which suggests your kernel is
> > > > simply *returning the syscall number* instead of -ENOSYS for syscalls
> > > > that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
> > > > have any sort of weird vendor patching? Any LSMs loaded?
> > > >
> > > > If you'd like to run a test just to make sure we're accurately seeing
> > > > what's happening, the attached should work. It should print 0
> followed
> > > > by the current time in seconds and nanoseconds.
> > >
> > > It looks like you hit the bug introduced in commit
> > > 554086d85e71f30abe46fc014fea31929a7c6a8a and fixed in commit
> > > 8142b215501f8b291a108a202b3a053a265b03dd. It looks like, since the
> > > former was a CVE fix, somebody backported it to the kernel you're
> > > using, but they failed to backport the fix-for-the-fix, so you have a
> > > kernel that operates dangerously incorrectly for syscall numbers it's
> > > unaware of.
> > >
> > > This really needs to be fixed in the kernel if you can. On our side
> > > (musl) we probably need to find out if such kernels are actually out
> > > in the wild, and if so, whether there's any reasonable way to detect
> > > the false success and treat it as failure.
> > >
> > > > > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx>
> wrote:
> > > > >
> > > > > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik wrote:
> > > > > > > *This is a failure:*
> > > > > > > tcpdump -i any -vvv host 192.168.0.115
> > > > > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked
> v1),
> > > capture
> > > > > > > size 262144 bytes
> > > > > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags
> [DF],
> > > proto
> > > > > > UDP
> > > > > > > (17), length 56)
> > > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A?
> > > google.com.
> > > > > > (28)
> > > > > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags
> > > [DF],
> > > > > > proto
> > > > > > > UDP (17), length 72)
> > > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x820a
> ->
> > > > > > 0x5c7d!]
> > > > > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A 142.250.80.110
> > > (44)
> > > > > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags
> [DF],
> > > proto
> > > > > > UDP
> > > > > > > (17), length 56)
> > > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
> > > > > > google.com.
> > > > > > > (28)
> > > > > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags
> > > [DF],
> > > > > > proto
> > > > > > > UDP (17), length 84)
> > > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum 0x8216
> ->
> > > > > > 0xb42f!]
> > > > > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
> > > > > > > 2607:f8b0:4006:80d::200e (56)
> > > > > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0, flags
> > > [none],
> > > > > > > proto ICMP (1), length 112)
> > > > > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port
> 60625
> > > > > > > unreachable, length 92
> > > > > >
>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > > >
> > > > > > OK, this shows that the client has requested both answers and the
> > > > > > nameserver replied almost immediately (about 0.5ms later), but
> when
> > > > > > the second reply arrives (to the AAAA), the client has already
> closed
> > > > > > the listening port, despite only a few ms having passed. The
> only way
> > > > > > I see this could happen is by "timing out". This suggests that
> > > > > > something is wrong with telling time.
> > > > > >
> > > > > > Can you either put a breakpoint in __clock_gettime64 (this is the
> > > name
> > > > > > you have to use for a breakpoint -- sorry I messed it up last
> time)
> > > > > > and then see what it returns when you "finish" it and what's in
> the
> > > > > > timespec struct after that? Or just write a test program to call
> > > > > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or
> want to
> > > > > > use the time64 symbol name here) and print the results (return
> value
> > > > > > and contents of the timespec struct).
> > > > > >
> > > > > >
> > > > > >
> > > > > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF],
> proto
> > > UDP
> > > > > > > (17), length 84)
> > > > > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q:
> AAAA?
> > > > > > google.com.
> > > > > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
> > > > > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags
> > > [DF],
> > > > > > proto
> > > > > > > TCP (6), length 52)
> > > > > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags
> [.],
> > > cksum
> > > > > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115,
> options
> > > > > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
> > > > > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0, flags
> > > [none],
> > > > > > > proto TCP (6), length 52)
> > > > > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags
> [.],
> > > cksum
> > > > > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS
> val
> > > > > > 2351306585
> > > > > > > ecr 198377148], length 0
> > > > > > > ^C
> > > > > > > 7 packets captured
> > > > > > > 7 packets received by filter
> > > > > > > 0 packets dropped by kernel
> > > > > >
> > > >
> > > >
> > >
> > > > #include <time.h>
> > > > #include <stdio.h>
> > > > int main()
> > > > {
> > > >       struct timespec ts;
> > > >       printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
> > > >       printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
> > > > }
> > >
> > >
>

[-- Attachment #2: Type: text/html, Size: 14427 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 21:39                                                             ` Satadru Pramanik
@ 2022-02-17 21:42                                                               ` Satadru Pramanik
  0 siblings, 0 replies; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 21:42 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 11019 bytes --]

Just adding this to arch/i386/syscall_arch.h solves the build...
#define ENOSYS          38      /* Invalid system call number */

On Thu, Feb 17, 2022 at 4:39 PM Satadru Pramanik <satadru@gmail.com> wrote:

> I'm noticing one small issue with this suggested patch:
>
> In file included from ../src_musl/src/internal/syscall.h:6,
>                  from ../src_musl/src/dirent/opendir.c:6:
> ../src_musl/arch/i386/syscall_arch.h: In function ‘__syscall0’:
> ../src_musl/arch/i386/syscall_arch.h:17:28: error: ‘ENOSYS’ undeclared
> (first use in this function)
>    17 |         if (n>350) return -ENOSYS;
>       |                            ^~~~~~
> ../src_musl/arch/i386/syscall_arch.h:17:28: note: each undeclared
> identifier is reported only once for each function it appears in
> ../src_musl/arch/i386/syscall_arch.h: In function ‘__syscall1’:
> ../src_musl/arch/i386/syscall_arch.h:25:28: error: ‘ENOSYS’ undeclared
> (first use in this function)
>    25 |         if (n>350) return -ENOSYS;
>       |                            ^~~~~~
> ../src_musl/arch/i386/syscall_arch.h: In function ‘__syscall2’:
> ../src_musl/arch/i386/syscall_arch.h:33:28: error: ‘ENOSYS’ undeclared
> (first use in this function)
>    33 |         if (n>350) return -ENOSYS;
>
> Should we be adding an include or just defining this locally?
>
> Satadru
>
> On Thu, Feb 17, 2022 at 1:17 PM Rich Felker <dalias@aerifal.cx> wrote:
>
>> On Thu, Feb 17, 2022 at 11:36:31AM -0500, Satadru Pramanik wrote:
>> >  This machine is a EOL Samsung Series 5 Chromebook
>> > <
>> https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/samsung-series-5-chromebook/
>> >
>> > code
>> > named Alex
>> > <
>> https://www.chromium.org/chromium-os/developer-information-for-chrome-os-devices/#:~:text=Series%205%20Chromebook-,Alex,-x86%2Dalex%20%26%20x86
>> >
>> > ..
>> > It is the target device for our i686 builds for Chromebrew.
>> >
>> > It is running a 3.8.11 kernel, and I believe the kernel source for that
>> is
>> > here:
>> >
>> https://chromium.googlesource.com/chromiumos/third_party/kernel/+/refs/heads/chromeos-3.8
>> >
>> > Getting a signed kernel update for an EOL kernel for an EOL machine is
>> > close to impossible from Google, so we're just trying to work around
>> these
>>
>> If these are machines you're in control of, you may be able to load a
>> module to patch it. If this is something you're deploying to users
>> stuck on that kernel who don't want to fix their systems, then of
>> course that's not a practical option.
>>
>> > issues in userspace to maintain some functionality for any users who may
>> > still be using the device.
>> >
>> > The simplest workaround possible would be ideal.
>>
>> If you're shipping binaries specifically for these devices, the
>> simplest fix is just to emulate the failure that should happen in the
>> kernel in userspace, using the attached patch. DO NOT deploy this
>> patch in binaries meant to be used on modern systems, since they will
>> break when Y2038 rolls around. (Your old Chromebooks will break then
>> too.)
>>
>> > It is interesting though
>> > that the sample program works fine when built against near-stock glibc
>> > 2.23, no?
>>
>> No. If your kernel has a bug that makes something behave wildly wrong,
>> whether you do or don't see that as visible breakage with a particular
>> piece of software is not particularly interesting.
>>
>> I'm pretty sure, however, that you just haven't tested enough to see
>> any failures. glibc 2.23 is from 2016, so any functionality in it
>> using syscalls added after 2011 (3.8 kernel) is going to blow up
>> badly, thinking the syscall succeeded and returned some positive value
>> when actually the kernel lacks it.
>>
>> In the particular case of clock_gettime, it's just that your glibc
>> 2.23 has a hard Y2038 EOL and does not use/support the missing time64
>> syscalls.
>>
>>
>> > On Thu, Feb 17, 2022 at 11:05 AM Rich Felker <dalias@aerifal.cx> wrote:
>> >
>> > > On Thu, Feb 17, 2022 at 10:53:52AM -0500, Rich Felker wrote:
>> > > > On Thu, Feb 17, 2022 at 09:49:45AM -0500, Satadru Pramanik wrote:
>> > > > > Apologies for not being as familiar with gdb as I ought to be.
>> > > > > I used the __clock_gettime64 breakpoint and did a backtrace and
>> finish
>> > > > > repeatedly.
>> > > > > I couldn't figure out how to best get the timespec struct info.
>> > > > >
>> > > > > Alternately if you want to throw out a sample test program for me
>> to
>> > > build
>> > > > > and run, and what gdb commands to run to get the right info,
>> happy to
>> > > do
>> > > > > that too.
>> > > > >
>> > > > > gdb output is attached.
>> > > >
>> > > > If gdb reported it correctly, clock_gettime returned 403, which
>> should
>> > > > be impossible. It can only return 0 or -1. Incidentally, 403 is the
>> > > > syscall number for SYS_clock_gettime64, which suggests your kernel
>> is
>> > > > simply *returning the syscall number* instead of -ENOSYS for
>> syscalls
>> > > > that don't exist on it. Is this a stock kernel (3.8 IIRC) or does it
>> > > > have any sort of weird vendor patching? Any LSMs loaded?
>> > > >
>> > > > If you'd like to run a test just to make sure we're accurately
>> seeing
>> > > > what's happening, the attached should work. It should print 0
>> followed
>> > > > by the current time in seconds and nanoseconds.
>> > >
>> > > It looks like you hit the bug introduced in commit
>> > > 554086d85e71f30abe46fc014fea31929a7c6a8a and fixed in commit
>> > > 8142b215501f8b291a108a202b3a053a265b03dd. It looks like, since the
>> > > former was a CVE fix, somebody backported it to the kernel you're
>> > > using, but they failed to backport the fix-for-the-fix, so you have a
>> > > kernel that operates dangerously incorrectly for syscall numbers it's
>> > > unaware of.
>> > >
>> > > This really needs to be fixed in the kernel if you can. On our side
>> > > (musl) we probably need to find out if such kernels are actually out
>> > > in the wild, and if so, whether there's any reasonable way to detect
>> > > the false success and treat it as failure.
>> > >
>> > > > > On Thu, Feb 17, 2022 at 8:46 AM Rich Felker <dalias@aerifal.cx>
>> wrote:
>> > > > >
>> > > > > > On Thu, Feb 17, 2022 at 08:30:47AM -0500, Satadru Pramanik
>> wrote:
>> > > > > > > *This is a failure:*
>> > > > > > > tcpdump -i any -vvv host 192.168.0.115
>> > > > > > > tcpdump: listening on any, link-type LINUX_SLL (Linux cooked
>> v1),
>> > > capture
>> > > > > > > size 262144 bytes
>> > > > > > > 08:29:38.043849 IP (tos 0x0, ttl 64, id 0, offset 0, flags
>> [DF],
>> > > proto
>> > > > > > UDP
>> > > > > > > (17), length 56)
>> > > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ A?
>> > > google.com.
>> > > > > > (28)
>> > > > > > > 08:29:38.044237 IP (tos 0x0, ttl 64, id 11463, offset 0, flags
>> > > [DF],
>> > > > > > proto
>> > > > > > > UDP (17), length 72)
>> > > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum
>> 0x820a ->
>> > > > > > 0x5c7d!]
>> > > > > > > 0 q: A? google.com. 1/0/0 google.com. [2m15s] A
>> 142.250.80.110
>> > > (44)
>> > > > > > > 08:29:38.047754 IP (tos 0x0, ttl 64, id 0, offset 0, flags
>> [DF],
>> > > proto
>> > > > > > UDP
>> > > > > > > (17), length 56)
>> > > > > > >     192.168.0.115.60625 > office.lan.53: [udp sum ok] 0+ AAAA?
>> > > > > > google.com.
>> > > > > > > (28)
>> > > > > > > 08:29:38.048078 IP (tos 0x0, ttl 64, id 11464, offset 0, flags
>> > > [DF],
>> > > > > > proto
>> > > > > > > UDP (17), length 84)
>> > > > > > >     office.lan.53 > 192.168.0.115.60625: [bad udp cksum
>> 0x8216 ->
>> > > > > > 0xb42f!]
>> > > > > > > 0 q: AAAA? google.com. 1/0/0 google.com. [4m26s] AAAA
>> > > > > > > 2607:f8b0:4006:80d::200e (56)
>> > > > > > > 08:29:38.048955 IP (tos 0xc0, ttl 64, id 59728, offset 0,
>> flags
>> > > [none],
>> > > > > > > proto ICMP (1), length 112)
>> > > > > > >     192.168.0.115 > office.lan: ICMP 192.168.0.115 udp port
>> 60625
>> > > > > > > unreachable, length 92
>> > > > > >
>>  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> > > > > >
>> > > > > > OK, this shows that the client has requested both answers and
>> the
>> > > > > > nameserver replied almost immediately (about 0.5ms later), but
>> when
>> > > > > > the second reply arrives (to the AAAA), the client has already
>> closed
>> > > > > > the listening port, despite only a few ms having passed. The
>> only way
>> > > > > > I see this could happen is by "timing out". This suggests that
>> > > > > > something is wrong with telling time.
>> > > > > >
>> > > > > > Can you either put a breakpoint in __clock_gettime64 (this is
>> the
>> > > name
>> > > > > > you have to use for a breakpoint -- sorry I messed it up last
>> time)
>> > > > > > and then see what it returns when you "finish" it and what's in
>> the
>> > > > > > timespec struct after that? Or just write a test program to call
>> > > > > > clock_gettime(CLOCK_REALTIME, &ts) (note: you do NOT need or
>> want to
>> > > > > > use the time64 symbol name here) and print the results (return
>> value
>> > > > > > and contents of the timespec struct).
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > >         IP (tos 0x0, ttl 64, id 11464, offset 0, flags [DF],
>> proto
>> > > UDP
>> > > > > > > (17), length 84)
>> > > > > > >     office.lan.53 > 192.168.0.115.60625: [udp sum ok] 0 q:
>> AAAA?
>> > > > > > google.com.
>> > > > > > > 1/0/0 google.com. [4m26s] AAAA 2607:f8b0:4006:80d::200e (56)
>> > > > > > > 08:29:39.476101 IP (tos 0x0, ttl 64, id 12690, offset 0, flags
>> > > [DF],
>> > > > > > proto
>> > > > > > > TCP (6), length 52)
>> > > > > > >     192.168.0.115.51204 > lga34s35-in-f3.1e100.net.80: Flags
>> [.],
>> > > cksum
>> > > > > > > 0xa666 (correct), seq 1466707759, ack 3358943837, win 115,
>> options
>> > > > > > > [nop,nop,TS val 198422160 ecr 2351261566], length 0
>> > > > > > > 08:29:39.478914 IP (tos 0x80, ttl 122, id 6227, offset 0,
>> flags
>> > > [none],
>> > > > > > > proto TCP (6), length 52)
>> > > > > > >     lga34s35-in-f3.1e100.net.80 > 192.168.0.115.51204: Flags
>> [.],
>> > > cksum
>> > > > > > > 0xa5b7 (correct), seq 1, ack 1, win 282, options [nop,nop,TS
>> val
>> > > > > > 2351306585
>> > > > > > > ecr 198377148], length 0
>> > > > > > > ^C
>> > > > > > > 7 packets captured
>> > > > > > > 7 packets received by filter
>> > > > > > > 0 packets dropped by kernel
>> > > > > >
>> > > >
>> > > >
>> > >
>> > > > #include <time.h>
>> > > > #include <stdio.h>
>> > > > int main()
>> > > > {
>> > > >       struct timespec ts;
>> > > >       printf("%d", clock_gettime(CLOCK_REALTIME, &ts));
>> > > >       printf(" %lld %.9ld\n", (long long)ts.tv_sec, ts.tv_nsec);
>> > > > }
>> > >
>> > >
>>
>

[-- Attachment #2: Type: text/html, Size: 14947 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Fwd: [musl] Re: musl getaddr info breakage on older kernels
       [not found]                                                             ` <CAFrh3J9CBGagzmnn9E7W0H15iiOzZEtg4YKfJ1qRwCtrjnw6tA@mail.gmail.com>
@ 2022-02-17 21:43                                                               ` Satadru Pramanik
  2022-02-17 22:48                                                                 ` Rich Felker
  0 siblings, 1 reply; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-17 21:43 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2563 bytes --]

>
>
> > Getting a signed kernel update for an EOL kernel for an EOL machine is
> > close to impossible from Google, so we're just trying to work around
> these
>
> If these are machines you're in control of, you may be able to load a
> module to patch it. If this is something you're deploying to users
> stuck on that kernel who don't want to fix their systems, then of
> course that's not a practical option.
>
> Deploying kernel modules to EOL machines is probably not worth doing at
this point. One might argue this is just a pointless exercise in preserving
legacy technology!


> issues in userspace to maintain some functionality for any users who may
> > still be using the device.
> >
> > The simplest workaround possible would be ideal.
>
> If you're shipping binaries specifically for these devices, the
> simplest fix is just to emulate the failure that should happen in the
> kernel in userspace, using the attached patch. DO NOT deploy this
> patch in binaries meant to be used on modern systems, since they will
> break when Y2038 rolls around. (Your old Chromebooks will break then
> too.)
>
> I'll let the next generation of preservationists worry about the Y2038
issues. Thanks for the patch. I'll build that now with the
https://git.musl-libc.org/cgit/musl/patch/?id=c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91
reversal patch for the i686 machines, and see if that fixes the issue. If
so, we can consider the matter closed, unless you want to suggest a
modification of the syscall patch to handle older working kernels which
might avoid the need for the patch when used with older systems.


> It is interesting though
> > that the sample program works fine when built against near-stock glibc
> > 2.23, no?
>
> No. If your kernel has a bug that makes something behave wildly wrong,
> whether you do or don't see that as visible breakage with a particular
> piece of software is not particularly interesting.
>
> I'm pretty sure, however, that you just haven't tested enough to see
> any failures. glibc 2.23 is from 2016, so any functionality in it
> using syscalls added after 2011 (3.8 kernel) is going to blow up
> badly, thinking the syscall succeeded and returned some positive value
> when actually the kernel lacks it.
>
> In the particular case of clock_gettime, it's just that your glibc
> 2.23 has a hard Y2038 EOL and does not use/support the missing time64
> syscalls.
>
>
Duly noted.

Thanks so much for being patient while I got you enough information to fix
this!

On behalf of the entire Chromebrew community, thank you!

Satadru

[-- Attachment #2: Type: text/html, Size: 3575 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Fwd: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 21:43                                                               ` Fwd: " Satadru Pramanik
@ 2022-02-17 22:48                                                                 ` Rich Felker
  2022-02-18  1:25                                                                   ` Satadru Pramanik
  0 siblings, 1 reply; 37+ messages in thread
From: Rich Felker @ 2022-02-17 22:48 UTC (permalink / raw)
  To: Satadru Pramanik; +Cc: musl

On Thu, Feb 17, 2022 at 04:43:51PM -0500, Satadru Pramanik wrote:
> > issues in userspace to maintain some functionality for any users who may
> > > still be using the device.
> > >
> > > The simplest workaround possible would be ideal.
> >
> > If you're shipping binaries specifically for these devices, the
> > simplest fix is just to emulate the failure that should happen in the
> > kernel in userspace, using the attached patch. DO NOT deploy this
> > patch in binaries meant to be used on modern systems, since they will
> > break when Y2038 rolls around. (Your old Chromebooks will break then
> > too.)
> >
> I'll let the next generation of preservationists worry about the Y2038
> issues. Thanks for the patch. I'll build that now with the
> https://git.musl-libc.org/cgit/musl/patch/?id=c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91
> reversal patch for the i686 machines, and see if that fixes the issue. If

That reversal should no longer be needed with the other patch, which
is a much bigger hammer. Reversing that just got rid of using the new
socket-related syscalls; the new hack patch gets rid of using all new
syscalls.

> so, we can consider the matter closed, unless you want to suggest a
> modification of the syscall patch to handle older working kernels which
> might avoid the need for the patch when used with older systems.

The patch is fine for older broken or older working systems. It just
suppresses the ability to use any syscall added after Linux 3.8, so
it's a bad idea to use on *newer* systems.

> > It is interesting though
> > > that the sample program works fine when built against near-stock glibc
> > > 2.23, no?
> >
> > No. If your kernel has a bug that makes something behave wildly wrong,
> > whether you do or don't see that as visible breakage with a particular
> > piece of software is not particularly interesting.
> >
> > I'm pretty sure, however, that you just haven't tested enough to see
> > any failures. glibc 2.23 is from 2016, so any functionality in it
> > using syscalls added after 2011 (3.8 kernel) is going to blow up
> > badly, thinking the syscall succeeded and returned some positive value
> > when actually the kernel lacks it.
> >
> > In the particular case of clock_gettime, it's just that your glibc
> > 2.23 has a hard Y2038 EOL and does not use/support the missing time64
> > syscalls.
> >
> >
> Duly noted.
> 
> Thanks so much for being patient while I got you enough information to fix
> this!
> 
> On behalf of the entire Chromebrew community, thank you!

And thanks for reporting, running tests, and following through on
this. It will be useful to know this is an issue others might hit, and
to be able to check other old systems that might have backported the
patch with the bug.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Fwd: [musl] Re: musl getaddr info breakage on older kernels
  2022-02-17 22:48                                                                 ` Rich Felker
@ 2022-02-18  1:25                                                                   ` Satadru Pramanik
  0 siblings, 0 replies; 37+ messages in thread
From: Satadru Pramanik @ 2022-02-18  1:25 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl


[-- Attachment #1.1: Type: text/plain, Size: 3157 bytes --]

I have attached the working patch I have ended up using, which has added
only a definition for ENOSYS.

Thanks again. It's nice to be offering software for these machines with
working, current libc. :)

Satadru

On Thu, Feb 17, 2022 at 5:48 PM Rich Felker <dalias@libc.org> wrote:

> On Thu, Feb 17, 2022 at 04:43:51PM -0500, Satadru Pramanik wrote:
> > > issues in userspace to maintain some functionality for any users who
> may
> > > > still be using the device.
> > > >
> > > > The simplest workaround possible would be ideal.
> > >
> > > If you're shipping binaries specifically for these devices, the
> > > simplest fix is just to emulate the failure that should happen in the
> > > kernel in userspace, using the attached patch. DO NOT deploy this
> > > patch in binaries meant to be used on modern systems, since they will
> > > break when Y2038 rolls around. (Your old Chromebooks will break then
> > > too.)
> > >
> > I'll let the next generation of preservationists worry about the Y2038
> > issues. Thanks for the patch. I'll build that now with the
> >
> https://git.musl-libc.org/cgit/musl/patch/?id=c2feda4e2ea61f4da73f2f38b2be5e327a7d1a91
> > reversal patch for the i686 machines, and see if that fixes the issue. If
>
> That reversal should no longer be needed with the other patch, which
> is a much bigger hammer. Reversing that just got rid of using the new
> socket-related syscalls; the new hack patch gets rid of using all new
> syscalls.
>
> > so, we can consider the matter closed, unless you want to suggest a
> > modification of the syscall patch to handle older working kernels which
> > might avoid the need for the patch when used with older systems.
>
> The patch is fine for older broken or older working systems. It just
> suppresses the ability to use any syscall added after Linux 3.8, so
> it's a bad idea to use on *newer* systems.
>
> > > It is interesting though
> > > > that the sample program works fine when built against near-stock
> glibc
> > > > 2.23, no?
> > >
> > > No. If your kernel has a bug that makes something behave wildly wrong,
> > > whether you do or don't see that as visible breakage with a particular
> > > piece of software is not particularly interesting.
> > >
> > > I'm pretty sure, however, that you just haven't tested enough to see
> > > any failures. glibc 2.23 is from 2016, so any functionality in it
> > > using syscalls added after 2011 (3.8 kernel) is going to blow up
> > > badly, thinking the syscall succeeded and returned some positive value
> > > when actually the kernel lacks it.
> > >
> > > In the particular case of clock_gettime, it's just that your glibc
> > > 2.23 has a hard Y2038 EOL and does not use/support the missing time64
> > > syscalls.
> > >
> > >
> > Duly noted.
> >
> > Thanks so much for being patient while I got you enough information to
> fix
> > this!
> >
> > On behalf of the entire Chromebrew community, thank you!
>
> And thanks for reporting, running tests, and following through on
> this. It will be useful to know this is an issue others might hit, and
> to be able to check other old systems that might have backported the
> patch with the bug.
>

[-- Attachment #1.2: Type: text/html, Size: 4070 bytes --]

[-- Attachment #2: broken_chromeos_kernel_hack_2.diff --]
[-- Type: application/octet-stream, Size: 2653 bytes --]

--- a/arch/i386/syscall_arch.h	2022-02-17 16:45:37.398583011 -0500
+++ b/arch/i386/syscall_arch.h	2022-02-17 16:50:02.311265598 -0500
@@ -9,11 +9,21 @@
 #define SYSCALL_INSNS "call *%%gs:16"
 #endif
 
+/*
+ *  * This error code is special: arch syscall entry code will return
+ *   * -ENOSYS if users try to call a syscall that doesn't exist.  To keep
+ *    * failures of syscalls that really do exist distinguishable from
+ *     * failures due to attempts to use a nonexistent syscall, syscall
+ *      * implementations should refrain from returning -ENOSYS.
+ *       */
+#define ENOSYS          38      /* Invalid system call number */
+
 #define SYSCALL_INSNS_12 "xchg %%ebx,%%edx ; " SYSCALL_INSNS " ; xchg %%ebx,%%edx"
 #define SYSCALL_INSNS_34 "xchg %%ebx,%%edi ; " SYSCALL_INSNS " ; xchg %%ebx,%%edi"
 
 static inline long __syscall0(long n)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 	__asm__ __volatile__ (SYSCALL_INSNS : "=a"(__ret) : "a"(n) : "memory");
 	return __ret;
@@ -21,6 +31,7 @@
 
 static inline long __syscall1(long n, long a1)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 	__asm__ __volatile__ (SYSCALL_INSNS_12 : "=a"(__ret) : "a"(n), "d"(a1) : "memory");
 	return __ret;
@@ -28,6 +39,7 @@
 
 static inline long __syscall2(long n, long a1, long a2)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 	__asm__ __volatile__ (SYSCALL_INSNS_12 : "=a"(__ret) : "a"(n), "d"(a1), "c"(a2) : "memory");
 	return __ret;
@@ -35,6 +47,7 @@
 
 static inline long __syscall3(long n, long a1, long a2, long a3)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ (SYSCALL_INSNS : "=a"(__ret) : "a"(n), "b"(a1), "c"(a2), "d"(a3) : "memory");
@@ -46,6 +59,7 @@
 
 static inline long __syscall4(long n, long a1, long a2, long a3, long a4)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ (SYSCALL_INSNS : "=a"(__ret) : "a"(n), "b"(a1), "c"(a2), "d"(a3), "S"(a4) : "memory");
@@ -57,6 +71,7 @@
 
 static inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a5)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ (SYSCALL_INSNS
@@ -70,6 +85,7 @@
 
 static inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
 {
+	if (n>350) return -ENOSYS;
 	unsigned long __ret;
 #if !defined(__PIC__) || !defined(BROKEN_EBX_ASM)
 	__asm__ __volatile__ ("pushl %7 ; push %%ebp ; mov 4(%%esp),%%ebp ; " SYSCALL_INSNS " ; pop %%ebp ; add $4,%%esp"

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-02-18  1:25 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAFrh3J9CK67S=Jn531zTK04QeR8yiZwro_ZThQD=3aUx5XghCA@mail.gmail.com>
2022-02-06 21:30 ` [musl] Re: musl getaddr info breakage on older kernels Rich Felker
2022-02-06 23:25   ` Satadru Pramanik
2022-02-06 23:44     ` Rich Felker
2022-02-07  1:29       ` Satadru Pramanik
2022-02-07  2:40         ` Rich Felker
2022-02-07 19:19           ` Satadru Pramanik
2022-02-07 21:02             ` Rich Felker
2022-02-14 17:24               ` Satadru Pramanik
2022-02-14 18:29                 ` Rich Felker
2022-02-14 19:00                   ` Satadru Pramanik
2022-02-14 22:00                     ` Rich Felker
2022-02-15 16:59                       ` Satadru Pramanik
2022-02-15 17:44                         ` Rich Felker
2022-02-15 22:56                           ` Satadru Pramanik
2022-02-16  1:41                             ` Rich Felker
2022-02-16 18:37                               ` Satadru Pramanik
2022-02-16 18:44                                 ` Satadru Pramanik
2022-02-16 21:33                                   ` Rich Felker
2022-02-16 21:53                                     ` Satadru Pramanik
2022-02-17  1:44                                       ` Satadru Pramanik
2022-02-17  4:14                                         ` Satadru Pramanik
2022-02-17 13:17                                           ` Satadru Pramanik
2022-02-17 13:24                                             ` Rich Felker
2022-02-17 13:30                                               ` Satadru Pramanik
2022-02-17 13:46                                                 ` Rich Felker
2022-02-17 14:49                                                   ` Satadru Pramanik
2022-02-17 15:53                                                     ` Rich Felker
2022-02-17 16:05                                                       ` Rich Felker
2022-02-17 16:36                                                         ` Satadru Pramanik
2022-02-17 18:17                                                           ` Rich Felker
2022-02-17 21:39                                                             ` Satadru Pramanik
2022-02-17 21:42                                                               ` Satadru Pramanik
     [not found]                                                             ` <CAFrh3J9CBGagzmnn9E7W0H15iiOzZEtg4YKfJ1qRwCtrjnw6tA@mail.gmail.com>
2022-02-17 21:43                                                               ` Fwd: " Satadru Pramanik
2022-02-17 22:48                                                                 ` Rich Felker
2022-02-18  1:25                                                                   ` Satadru Pramanik
2022-02-17 16:17                                                       ` Satadru Pramanik
2022-02-07  5:35     ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).