mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] DNS resolver fails prematurely when server reports failure?
@ 2021-12-01 12:49 Mark Hills
  2021-12-01 15:23 ` Rich Felker
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Hills @ 2021-12-01 12:49 UTC (permalink / raw)
  To: musl

With multiple DNS servers in /etc/resolv.conf, the docs [1] are clear:

  "musl's resolver queries them all in parallel and accepts whichever 
   response arrives first."

So dual configuration is expected to give greater resiliancy:

  nameserver 213.186.33.99  # OVH
  nameserver 1.1.1.1        # Cloudflare

However, 1.1.1.1 appears quite prone to some kind of internal SERVFAIL 
(may be internal load shedding; though we are not making excessive DNS 
queries)

With glibc's cascading behaviour (or perhaps another OS) this may be dealt 
with by the client.

But if the wiki is read literally, the first response received is "this 
server has failed" then a good response from another server is ignored?

And indeed this seems to be the behaviour we experience, as removing 
1.1.1.1 restored reliability.

I tried to confirm this in the source [2] but found I'd need more time to 
understand this code.

Also, diagnosis was made more difficult by a colleage diligently following 
the resolv.conf(5) man page on the host (installed via man-pages on Alpine 
Linux) but this documents glibc. Perhaps musl could/should provide its 
own, but I expect there is a policy for this and similar issues.

Thanks

[1] https://wiki.musl-libc.org/functional-differences-from-glibc.html
[2] https://git.musl-libc.org/cgit/musl/tree/src/network/lookup_name.c#n296

-- 
Mark

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] DNS resolver fails prematurely when server reports failure?
  2021-12-01 12:49 [musl] DNS resolver fails prematurely when server reports failure? Mark Hills
@ 2021-12-01 15:23 ` Rich Felker
  2021-12-01 15:57   ` Mark Hills
  0 siblings, 1 reply; 3+ messages in thread
From: Rich Felker @ 2021-12-01 15:23 UTC (permalink / raw)
  To: Mark Hills; +Cc: musl

On Wed, Dec 01, 2021 at 12:49:07PM +0000, Mark Hills wrote:
> With multiple DNS servers in /etc/resolv.conf, the docs [1] are clear:
> 
>   "musl's resolver queries them all in parallel and accepts whichever 
>    response arrives first."
> 
> So dual configuration is expected to give greater resiliancy:
> 
>   nameserver 213.186.33.99  # OVH
>   nameserver 1.1.1.1        # Cloudflare
> 
> However, 1.1.1.1 appears quite prone to some kind of internal SERVFAIL 
> (may be internal load shedding; though we are not making excessive DNS 
> queries)
> 
> With glibc's cascading behaviour (or perhaps another OS) this may be dealt 
> with by the client.
> 
> But if the wiki is read literally, the first response received is "this 
> server has failed" then a good response from another server is ignored?

No. ServFail is an inconclusive response, treated basically the same
as if no packet had arrived at all. (Slight difference: it triggers
immediate retry up to a limited number of times.)

> And indeed this seems to be the behaviour we experience, as removing 
> 1.1.1.1 restored reliability.

Have you looked at a packet capture of what's happening? Likely
1.1.1.1 was returning a false conclusive result (NxDomain or NODATA)
rather than ServFail.

Rich

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [musl] DNS resolver fails prematurely when server reports failure?
  2021-12-01 15:23 ` Rich Felker
@ 2021-12-01 15:57   ` Mark Hills
  0 siblings, 0 replies; 3+ messages in thread
From: Mark Hills @ 2021-12-01 15:57 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Wed, 1 Dec 2021, Rich Felker wrote:

> On Wed, Dec 01, 2021 at 12:49:07PM +0000, Mark Hills wrote:
> > With multiple DNS servers in /etc/resolv.conf, the docs [1] are clear:
> > 
> >   "musl's resolver queries them all in parallel and accepts whichever 
> >    response arrives first."
> > 
> > So dual configuration is expected to give greater resiliancy:
> > 
> >   nameserver 213.186.33.99  # OVH
> >   nameserver 1.1.1.1        # Cloudflare
> > 
> > However, 1.1.1.1 appears quite prone to some kind of internal SERVFAIL 
> > (may be internal load shedding; though we are not making excessive DNS 
> > queries)
> > 
> > With glibc's cascading behaviour (or perhaps another OS) this may be dealt 
> > with by the client.
> > 
> > But if the wiki is read literally, the first response received is "this 
> > server has failed" then a good response from another server is ignored?
> 
> No. ServFail is an inconclusive response, treated basically the same
> as if no packet had arrived at all. (Slight difference: it triggers
> immediate retry up to a limited number of times.)

Ok, thanks. That sounds correct, and I realise now that the real process 
of the query is in this source file [1] which is why the code looked so 
opaque.

Could it be better to make a small change to the wiki text? Perhaps 
"conclusive answer" instead of "response":

  accepts whichever conclusive answer arrives first

> > And indeed this seems to be the behaviour we experience, as removing 
> > 1.1.1.1 restored reliability.
> 
> Have you looked at a packet capture of what's happening? Likely
> 1.1.1.1 was returning a false conclusive result (NxDomain or NODATA)
> rather than ServFail.

We caught the problem with a tcpdump (which is first how we realised the 
differing behaviour between the man page and musl), and reproduced it with 
"dig", however it doesn't seem to be reproducable now. My recollection is 
that was an instant response and where I first encountered "ServFail" but 
I'll see if we have logged the actual run. I'm _fairly_ sure I'd have 
noticed a false but conclusive response.

I'm re-adding the "backup" DNS on a test system to see if we can get back 
to reproducing the problem.

[1] https://git.musl-libc.org/cgit/musl/tree/src/network/res_msend.c#n30

Thanks

-- 
Mark

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-12-01 15:58 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-01 12:49 [musl] DNS resolver fails prematurely when server reports failure? Mark Hills
2021-12-01 15:23 ` Rich Felker
2021-12-01 15:57   ` Mark Hills

Code repositories for project(s) associated with this inbox:

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).