mailing list of musl libc
 help / color / mirror / code / Atom feed
* DNS FQDN query issue in musl starting 1.1.13
@ 2019-09-13  7:43 Andrey Arapov
  2019-09-13 12:15 ` Rich Felker
  2019-09-13 14:19 ` Andrey Arapov
  0 siblings, 2 replies; 4+ messages in thread
From: Andrey Arapov @ 2019-09-13  7:43 UTC (permalink / raw)
  To: musl

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hello,

I've noticed that musl C lib starting 1.1.13 isn't trying to resolve the FQDN in the first place,
it rather tries <FQDN>.<search_domain_found_in_/etc/resolv.conf_file> first which is different to how GNU
C library is working.

Also, since musl C library is "never falling back to search, which glibc would do" according to
https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name-Resolver/DNS

this poses an issue when DNS server is misconfigured.

For example, when DNS server is returning SERVFAIL (no SOA), the musl C is simply stopping from
attempting the FQDN.

So having a wrong record in the /etc/resolv.conf will cause musl C resolver to break way too fast.

I was wondering whether this is an expected behavior or not? And can this be changed in a way so musl C lib is trying the FQDN first?

This behavior is making some people resort to using short hostnames instead of FQDNs, such as
ad-hoc patching ucp-metrics (Alpine based container) --
https://forums.docker.com/t/ucp-dashboard-shows-no-data/72337/4


To expand the issue with the ucp-metrics:

So when resolv.conf is set to the following configuration:
nameserver 10.96.0.10
search kube-system.svc.cluster.local svc.cluster.local cluster.local some.brokendnsserver.com
options ndots:5

An attempt to resolve the ucp-controller.kube-system.svc.cluster.local will be rendered into attempt to resolve the ucp-controller.kube-system.svc.cluster.local.some.brokendnsserver.com in the first place.

Workaround people use in the wild is: ucp-controller.kube-system.svc.cluster.local => ucp-controller

I've already informed the Docker Support about this issue, they are working on the knowledge base article regarding this issue, so people are aware of this and can decide to rather fix their domain search server (should they have an access/rights to) or resolv.conf record.


I think that this should be fixed since even having the good domain search server is making the system prone to an error should the domain search server fail (or partially fail, returning SERVFAIL/[no SOA]) at any point of time.


Please kindly Cc me on replies.


Kind Regards,
Andrey Arapov

-----BEGIN PGP SIGNATURE-----

iI8EARYIADcWIQRDMZ/b1AtG/U4LjuKQdtXmsxrpnAUCXXtIjhkcYW5kcmV5LmFy
YXBvdkBuaXhhaWQuY29tAAoJEJB21eazGumcHjMBAP5Y6ZsoVCVd2VHN+Vf09+B7
SQRPtH3O5++Vu5R5vTDdAQCABCZVzZcl3R+KuZqHX0suZY6ddfYwTjZHtzensR7u
CQ==
=CCbe
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: DNS FQDN query issue in musl starting 1.1.13
  2019-09-13  7:43 DNS FQDN query issue in musl starting 1.1.13 Andrey Arapov
@ 2019-09-13 12:15 ` Rich Felker
  2019-09-13 14:19 ` Andrey Arapov
  1 sibling, 0 replies; 4+ messages in thread
From: Rich Felker @ 2019-09-13 12:15 UTC (permalink / raw)
  To: musl; +Cc: Andrey Arapov

On Fri, Sep 13, 2019 at 07:43:28AM +0000, Andrey Arapov wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Hello,
> 
> I've noticed that musl C lib starting 1.1.13 isn't trying to resolve the FQDN in the first place,
> it rather tries <FQDN>.<search_domain_found_in_/etc/resolv.conf_file> first which is different to how GNU
> C library is working.

This is only the case if fqdn contains fewer than ndots dots and does
not end in a dot. This behavior should match all other resolvers.

> Also, since musl C library is "never falling back to search, which glibc would do" according to
> https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name-Resolver/DNS
> 
> this poses an issue when DNS server is misconfigured.
> 
> For example, when DNS server is returning SERVFAIL (no SOA), the musl C is simply stopping from
> attempting the FQDN.

If one lookup ends in ServFail, it's indeterminate and must be
reported as an error to the caller. Otherwise the successful result of
a lookup yields different values depending on transient failure of a
nameserver. This is dangerously wrong regardless of whether other
implementations do it.

This was all discussed (with people involved in the Docker-related
projects using these kind of search tricks, as I remember it) at the
time search was added. Addition of search was explicitly conditional
on *not* reproducing buggy/dangerous behavior other implementations
have.

> So having a wrong record in the /etc/resolv.conf will cause musl C
> resolver to break way too fast.
> 
> I was wondering whether this is an expected behavior or not? And can
> this be changed in a way so musl C lib is trying the FQDN first?

Don't set ndots>1. ndots>1 has all sorts of problems.

> This behavior is making some people resort to using short hostnames instead of FQDNs, such as
> ad-hoc patching ucp-metrics (Alpine based container) --
> https://forums.docker.com/t/ucp-dashboard-shows-no-data/72337/4
> 
> 
> To expand the issue with the ucp-metrics:
> 
> So when resolv.conf is set to the following configuration:
> nameserver 10.96.0.10
> search kube-system.svc.cluster.local svc.cluster.local cluster.local some.brokendnsserver.com
> options ndots:5
> 
> An attempt to resolve the
> ucp-controller.kube-system.svc.cluster.local will be rendered into
> attempt to resolve the
> ucp-controller.kube-system.svc.cluster.local.some.brokendnsserver.com
> in the first place.
> 
> Workaround people use in the wild is: ucp-controller.kube-system.svc.cluster.local => ucp-controller
> 
> I've already informed the Docker Support about this issue, they are
> working on the knowledge base article regarding this issue, so
> people are aware of this and can decide to rather fix their domain
> search server (should they have an access/rights to) or resolv.conf
> record.

Ideally they just would not use ndots>1, since it necessarily yields
this and lots of other problems (like extra round trips and timeout
delay for each lookup, even if the lookup works). I'm not sure what to
recommend as an alternative since I don't entirely understand the
usage constraints here, but I know these issues were all known on the
Docker and Kubernetes side back when search was first implemented in
musl, and that folks understood that these uses of search domains with
multiple components were a problem and planned to phase them out. I
don't know what happened with that.

> I think that this should be fixed since even having the good domain
> search server is making the system prone to an error should the
> domain search server fail (or partially fail, returning SERVFAIL/[no
> SOA]) at any point of time.

This is entirely intentional. If one of the servers fails, the
application needs to know that it can't get a meaningful result for
its query. Not silently get the wrong result.

Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: DNS FQDN query issue in musl starting 1.1.13
  2019-09-13  7:43 DNS FQDN query issue in musl starting 1.1.13 Andrey Arapov
  2019-09-13 12:15 ` Rich Felker
@ 2019-09-13 14:19 ` Andrey Arapov
  2019-09-13 15:11   ` Rich Felker
  1 sibling, 1 reply; 4+ messages in thread
From: Andrey Arapov @ 2019-09-13 14:19 UTC (permalink / raw)
  To: Rich Felker, musl

Hello Rich,

thank you for your prompt reply.

I agree that SERVFAIL must be reported as an error to the caller and have just realized
that "ucp-controller.kube-system.svc.cluster.local" has only 4 ndots, hence it isn't
tried unless a trailing (5th) dot was specified,
e.g. "ucp-controller.kube-system.svc.cluster.local.".

Probably one of the differences is that, I presume, glibc treats a domain name terminated
by a length byte of zero (RFC1035 3.1 Name space definitions), hence resolving the FQDN
with only 4 dots whilst 5 is set.
Please correct me if I am wrong.


Regarding usage constraints, it looks like that the whole point having the ndots > 1 is
basically to make the internal cluster lookups faster (the more dots the faster) while
cache the external DNS lookups so they are slow the first time but fast subsequently.

But having ndots = 1 to workaround the musl's unexpected behavior (when ndots > 1) is
making all intra-cluster lookups slower, whilst upstream FQDN faster.


After reading through the discussions it turns out that in the beginning people resorted
to using the Kubernetes's dnsConfig for setting the ndots to 1 (default) as a workaround
but later then they did not need that anymore as Kubernetes/CoreDNS dropped Alpine
(not sure for what reasons though).

I guess the whole point is that the projects using musl C library (>=1.1.13)
should clearly make people aware of that difference in hostname lookups
which cause unexpected behavior compared to the glibc.


Below is some brief story-line I gathered which might be handy to anyone reading this.

### September-December 2016/2017

> - "We're smart enough to solve this for everyone" is not realistic. (c) BrianGallew

The rationale for having ndots=5 is explained at length at #33554
- https://github.com/kubernetes/kubernetes/issues/33554#issuecomment-266251056

dnsConfig:
- https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config
- https://github.com/kubernetes/website/pull/5978

### June-August 2018 ... April 2019

People are struggling with this issue as upstream/downstream projects
updated (or switched to) their Alpine base distro to 3.4 (or higher with musl >= 1.1.13).

ndots breaks DNS resolving
- https://github.com/kubernetes/kubernetes/issues/64924

Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances
- https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html

Docker: drop alpine
- https://github.com/coredns/coredns/pull/1843

Rebase container images from alpine to debian-base.
- https://github.com/kubernetes/dns/pull/294


Kind Regards,
Andrey Arapov

September 13, 2019 2:15 PM, "Rich Felker" <dalias@libc.org> wrote:

> On Fri, Sep 13, 2019 at 07:43:28AM +0000, Andrey Arapov wrote:
> 
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>> 
>> Hello,
>> 
>> I've noticed that musl C lib starting 1.1.13 isn't trying to resolve the FQDN in the first place,
>> it rather tries <FQDN>.<search_domain_found_in_/etc/resolv.conf_file> first which is different to
>> how GNU
>> C library is working.
> 
> This is only the case if fqdn contains fewer than ndots dots and does
> not end in a dot. This behavior should match all other resolvers.
> 
>> Also, since musl C library is "never falling back to search, which glibc would do" according to
>> https://wiki.musl-libc.org/functional-differences-from-glibc.html#Name-Resolver/DNS
>> 
>> this poses an issue when DNS server is misconfigured.
>> 
>> For example, when DNS server is returning SERVFAIL (no SOA), the musl C is simply stopping from
>> attempting the FQDN.
> 
> If one lookup ends in ServFail, it's indeterminate and must be
> reported as an error to the caller. Otherwise the successful result of
> a lookup yields different values depending on transient failure of a
> nameserver. This is dangerously wrong regardless of whether other
> implementations do it.
> 
> This was all discussed (with people involved in the Docker-related
> projects using these kind of search tricks, as I remember it) at the
> time search was added. Addition of search was explicitly conditional
> on *not* reproducing buggy/dangerous behavior other implementations
> have.
> 
>> So having a wrong record in the /etc/resolv.conf will cause musl C
>> resolver to break way too fast.
>> 
>> I was wondering whether this is an expected behavior or not? And can
>> this be changed in a way so musl C lib is trying the FQDN first?
> 
> Don't set ndots>1. ndots>1 has all sorts of problems.
> 
>> This behavior is making some people resort to using short hostnames instead of FQDNs, such as
>> ad-hoc patching ucp-metrics (Alpine based container) --
>> https://forums.docker.com/t/ucp-dashboard-shows-no-data/72337/4
>> 
>> To expand the issue with the ucp-metrics:
>> 
>> So when resolv.conf is set to the following configuration:
>> nameserver 10.96.0.10
>> search kube-system.svc.cluster.local svc.cluster.local cluster.local some.brokendnsserver.com
>> options ndots:5
>> 
>> An attempt to resolve the
>> ucp-controller.kube-system.svc.cluster.local will be rendered into
>> attempt to resolve the
>> ucp-controller.kube-system.svc.cluster.local.some.brokendnsserver.com
>> in the first place.
>> 
>> Workaround people use in the wild is: ucp-controller.kube-system.svc.cluster.local =>
>> ucp-controller
>> 
>> I've already informed the Docker Support about this issue, they are
>> working on the knowledge base article regarding this issue, so
>> people are aware of this and can decide to rather fix their domain
>> search server (should they have an access/rights to) or resolv.conf
>> record.
> 
> Ideally they just would not use ndots>1, since it necessarily yields
> this and lots of other problems (like extra round trips and timeout
> delay for each lookup, even if the lookup works). I'm not sure what to
> recommend as an alternative since I don't entirely understand the
> usage constraints here, but I know these issues were all known on the
> Docker and Kubernetes side back when search was first implemented in
> musl, and that folks understood that these uses of search domains with
> multiple components were a problem and planned to phase them out. I
> don't know what happened with that.
> 
>> I think that this should be fixed since even having the good domain
>> search server is making the system prone to an error should the
>> domain search server fail (or partially fail, returning SERVFAIL/[no
>> SOA]) at any point of time.
> 
> This is entirely intentional. If one of the servers fails, the
> application needs to know that it can't get a meaningful result for
> its query. Not silently get the wrong result.
> 
> Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: DNS FQDN query issue in musl starting 1.1.13
  2019-09-13 14:19 ` Andrey Arapov
@ 2019-09-13 15:11   ` Rich Felker
  0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2019-09-13 15:11 UTC (permalink / raw)
  To: Andrey Arapov; +Cc: musl

On Fri, Sep 13, 2019 at 02:19:48PM +0000, Andrey Arapov wrote:
> Hello Rich,
> 
> thank you for your prompt reply.
> 
> I agree that SERVFAIL must be reported as an error to the caller and have just realized
> that "ucp-controller.kube-system.svc.cluster.local" has only 4 ndots, hence it isn't
> tried unless a trailing (5th) dot was specified,
> e.g. "ucp-controller.kube-system.svc.cluster.local.".
> 
> Probably one of the differences is that, I presume, glibc treats a domain name terminated
> by a length byte of zero (RFC1035 3.1 Name space definitions), hence resolving the FQDN
> with only 4 dots whilst 5 is set.
> Please correct me if I am wrong.

You're mistaken here. The input to the resolver interfaces is a plain
string; there is no "length byte". Length byte is part of the DNS
protocol on the wire and is only involved at a lower layer past the
search processing, which is just transformations on strings.

On both glibc and musl, the search domains are tried initially if the
query string does not end in a dot and does not contain at least ndots
dots. So in your above example with four dots and ndots==5, both will
do the search. Adding the final dot suppresses the search regardless
of whether it takes the dots count to 5 or more.

The only difference between glibc and musl here is that, as I
understand it at least, glibc will continue searching after hitting an
error like ServFail, thus producing results that depend on transient
failures. musl stops and reports the error.

In the opposite case, with something like a.b.c.d.example.com, where
there are 5 dots, the behaviors differ in the way you're wondering
about, but I don't think it's relevant to your usage/problem. glibc
will first try to resolve it as a FQDN, and if that fails, it will run
through the whole search. musl does not perform search at all for
queries with at least ndots dots in them. The motivation here is the
same: if you depend on that behavior, your application/configuration
is subject to breakage by third parties registering new things in the
global DNS namespace. This is actually what happened with all the new
TLDs -- lots of networks using those names as fake local TLDs via
search with ndots==1 broke when they appeared as real TLDs.

> Regarding usage constraints, it looks like that the whole point having the ndots > 1 is
> basically to make the internal cluster lookups faster (the more dots the faster) while
> cache the external DNS lookups so they are slow the first time but fast subsequently.

If all your internal cluster lookups use the .local fake TLD
explicitly, e.g. *.svc.cluster.local, etc., then search domain is not
needed whatsoever and just making things slow and fragile. ndots==1
would avoid it getting used, though.

If your internal cluster lookups are looking up names like "foo.bar"
and expecting it to resolve to foo.bar.kube-system.svc.cluster.local,
foo.bar.svc.cluster.local, or foo.bar.cluster.local depending on which
first defines it, then your setup actually does depend on search and
having ndots be greater than the number of dots in the longest
"foo.bar" part you use. The number of dots in the search part
("foo.bar.svc.cluster.local") is not relevant.

One proposal I recall hearing from Kubernetes folks way back was to
use names like "foo-bar" instead of "foo.bar" in the above, so that
ndots==1 suffices and the search behavior does not clash with lookups
of real domains.

> But having ndots = 1 to workaround the musl's unexpected behavior (when ndots > 1) is
> making all intra-cluster lookups slower, whilst upstream FQDN faster.

I don't understand why that would be. They should either fail entirely
or be just as fast. Decreasing ndots should not be able to slow down
any lookup that works both before and after the decreasing.

> After reading through the discussions it turns out that in the beginning people resorted
> to using the Kubernetes's dnsConfig for setting the ndots to 1 (default) as a workaround
> but later then they did not need that anymore as Kubernetes/CoreDNS dropped Alpine
> (not sure for what reasons though).
> 
> I guess the whole point is that the projects using musl C library (>=1.1.13)
> should clearly make people aware of that difference in hostname lookups
> which cause unexpected behavior compared to the glibc.

We've tried to do this on the wiki about functional differences. Open
to improvements.

> Below is some brief story-line I gathered which might be handy to anyone reading this.
> 
> ### September-December 2016/2017
> 
> > - "We're smart enough to solve this for everyone" is not realistic. (c) BrianGallew
> 
> The rationale for having ndots=5 is explained at length at #33554
> - https://github.com/kubernetes/kubernetes/issues/33554#issuecomment-266251056
> 
> dnsConfig:
> - https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-config
> - https://github.com/kubernetes/website/pull/5978
> 
> ### June-August 2018 ... April 2019
> 
> People are struggling with this issue as upstream/downstream projects
> updated (or switched to) their Alpine base distro to 3.4 (or higher with musl >= 1.1.13).
> 
> ndots breaks DNS resolving
> - https://github.com/kubernetes/kubernetes/issues/64924
> 
> Kubernetes pods /etc/resolv.conf ndots:5 option and why it may negatively affect your application performances
> - https://pracucci.com/kubernetes-dns-resolution-ndots-options-and-why-it-may-affect-application-performances.html
> 
> Docker: drop alpine
> - https://github.com/coredns/coredns/pull/1843
> 
> Rebase container images from alpine to debian-base.
> - https://github.com/kubernetes/dns/pull/294

I'm not up for reading through all that right now, but thanks for
collecting the relevant information. It's frustrating that, despite
having known way back that what they were doing was wrong, they seem
to have decided to "drop support for Alpine/musl" rather than "fix the
stuff that's obviously wrong and hurting performance even on glibc
based dists"...

Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-09-13 15:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-13  7:43 DNS FQDN query issue in musl starting 1.1.13 Andrey Arapov
2019-09-13 12:15 ` Rich Felker
2019-09-13 14:19 ` Andrey Arapov
2019-09-13 15:11   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).