Hi, I'm a developer from the performance monitoring company Dynatrace, and I've been recently investigating curious problems at our customers' environments where a call to musl's getaddrinfo appears to spuriously return ENOENT when called from a node.js application that is being monitored with the Dynatrace agent. I managed to pinpoint the problem to the code that performs the AI_ADDRCONFIG check. If an address family that is not enabled on the host is specified, a call to "connect" in that code fails, the socket fd is closed, and the value of "errno" is then evaluated. The problem is that the call to "close" can change the value of errno, which will break the switch-case that follows it. Especially if aio is used (which is the case when the Dynatrace agent is included in the application), the call to close will end up setting errno to ENOENT by default (even without a failure) within the "aio_cancel" function if an aio operation is active. In such a case getaddrinfo will then incorrectly return EAI_SYSTEM with errno set to ENOENT. (After some error code translations within libuv, node.js will then print an error message claiming that getaddrinfo failed with ENOENT which is rather confusing.) Even if aio is not used, the code might fail whenever "close" gets interrupted and returns with errno set to EINTR. As the return value of close is not checked, the errno might thus "silently" change before getting evaluated with the assumption that it still contains the value set when "connect" failed. Below is a simple patch that should take care of this problem. Let me know if I can provide any more information or if there is anything else I can help with. Thanks, Jussi ------------------------------------------------------------------------------- diff --git a/src/network/getaddrinfo.c b/src/network/getaddrinfo.c index efaab306..71809856 100644 --- a/src/network/getaddrinfo.c +++ b/src/network/getaddrinfo.c @@ -16,6 +16,7 @@ int getaddrinfo(const char *restrict host, const char *restrict serv, const stru char canon[256], *outcanon; int nservs, naddrs, nais, canon_len, i, j, k; int family = AF_UNSPEC, flags = 0, proto = 0, socktype = 0; + int saved_errno = 0; struct aibuf *out; if (!host && !serv) return EAI_NONAME; @@ -66,11 +67,14 @@ int getaddrinfo(const char *restrict host, const char *restrict serv, const stru pthread_setcancelstate( PTHREAD_CANCEL_DISABLE, &cs); int r = connect(s, ta[i], tl[i]); + /* The call to "close" might change errno, especially if aio is in use; + * save the value set by "connect" for the later comparison. */ + if (r < 0) saved_errno = errno; pthread_setcancelstate(cs, 0); close(s); if (!r) continue; } - switch (errno) { + switch (saved_errno) { case EADDRNOTAVAIL: case EAFNOSUPPORT: case EHOSTUNREACH: ------------------------------------------------------------------------------- This email may contain confidential information. If it appears this message was sent to you by mistake, please let us know of the error. In this case, we also ask that you do not further forward the content and delete it. Thank you for your cooperation and understanding. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4020 Linz, Austria, Am Fünfundzwanziger Turm 20.
On Tue, Nov 23, 2021 at 02:47:49PM +0000, Nieminen, Jussi wrote: > Hi, > > I'm a developer from the performance monitoring company Dynatrace, and I've been > recently investigating curious problems at our customers' environments where a > call to musl's getaddrinfo appears to spuriously return ENOENT when called from > a node.js application that is being monitored with the Dynatrace agent. > > I managed to pinpoint the problem to the code that performs the AI_ADDRCONFIG > check. If an address family that is not enabled on the host is specified, a call > to "connect" in that code fails, the socket fd is closed, and the value of > "errno" is then evaluated. > > The problem is that the call to "close" can change the value of errno, which > will break the switch-case that follows it. Especially if aio is used (which is > the case when the Dynatrace agent is included in the application), the call to > close will end up setting errno to ENOENT by default (even without a failure) > within the "aio_cancel" function if an aio operation is active. In such a case > getaddrinfo will then incorrectly return EAI_SYSTEM with errno set to ENOENT. > > (After some error code translations within libuv, node.js will then print an > error message claiming that getaddrinfo failed with ENOENT which is rather > confusing.) Indeed, this all makes sense. > Even if aio is not used, the code might fail whenever "close" gets interrupted > and returns with errno set to EINTR. As the return value of close is not > checked, the errno might thus "silently" change before getting evaluated with > the assumption that it still contains the value set when "connect" failed. close can't return with EINTR but can return with EINPROGRESS which would give the same result here, I think. > Below is a simple patch that should take care of this problem. Let me know if I > can provide any more information or if there is anything else I can help with. I think this patch is probably okay. I wondered if close was in the set of functions POSIX-future intends to require not to touch errno on success, but it doesn't seem to be, and moreover the EINPROGRESS semantics would undermine that anyway. So saving errno before calling close is probably the right thing to do here. > Thanks, > Jussi Thanks for the clear analysis and patch! > ------------------------------------------------------------------------------- > diff --git a/src/network/getaddrinfo.c b/src/network/getaddrinfo.c > index efaab306..71809856 100644 > --- a/src/network/getaddrinfo.c > +++ b/src/network/getaddrinfo.c > @@ -16,6 +16,7 @@ int getaddrinfo(const char *restrict host, const char *restrict serv, const stru > char canon[256], *outcanon; > int nservs, naddrs, nais, canon_len, i, j, k; > int family = AF_UNSPEC, flags = 0, proto = 0, socktype = 0; > + int saved_errno = 0; > struct aibuf *out; > > if (!host && !serv) return EAI_NONAME; > @@ -66,11 +67,14 @@ int getaddrinfo(const char *restrict host, const char *restrict serv, const stru > pthread_setcancelstate( > PTHREAD_CANCEL_DISABLE, &cs); > int r = connect(s, ta[i], tl[i]); > + /* The call to "close" might change errno, especially if aio is in use; > + * save the value set by "connect" for the later comparison. */ > + if (r < 0) saved_errno = errno; > pthread_setcancelstate(cs, 0); > close(s); > if (!r) continue; > } > - switch (errno) { > + switch (saved_errno) { > case EADDRNOTAVAIL: > case EAFNOSUPPORT: > case EHOSTUNREACH: > -------------------------------------------------------------------------------
On Tue, Nov 23, 2021 at 10:05:37AM -0500, Rich Felker wrote: > On Tue, Nov 23, 2021 at 02:47:49PM +0000, Nieminen, Jussi wrote: > > Hi, > > > > I'm a developer from the performance monitoring company Dynatrace, and I've been > > recently investigating curious problems at our customers' environments where a > > call to musl's getaddrinfo appears to spuriously return ENOENT when called from > > a node.js application that is being monitored with the Dynatrace agent. > > > > I managed to pinpoint the problem to the code that performs the AI_ADDRCONFIG > > check. If an address family that is not enabled on the host is specified, a call > > to "connect" in that code fails, the socket fd is closed, and the value of > > "errno" is then evaluated. > > > > The problem is that the call to "close" can change the value of errno, which > > will break the switch-case that follows it. Especially if aio is used (which is > > the case when the Dynatrace agent is included in the application), the call to > > close will end up setting errno to ENOENT by default (even without a failure) > > within the "aio_cancel" function if an aio operation is active. In such a case > > getaddrinfo will then incorrectly return EAI_SYSTEM with errno set to ENOENT. > > > > (After some error code translations within libuv, node.js will then print an > > error message claiming that getaddrinfo failed with ENOENT which is rather > > confusing.) > > Indeed, this all makes sense. > > > Even if aio is not used, the code might fail whenever "close" gets interrupted > > and returns with errno set to EINTR. As the return value of close is not > > checked, the errno might thus "silently" change before getting evaluated with > > the assumption that it still contains the value set when "connect" failed. > > close can't return with EINTR but can return with EINPROGRESS which > would give the same result here, I think. I was mistaken about this. EINPROGRESS was removed back in commit 6bea5dc69892cd9ff0c222474e7dd468c29dfa75 (2015) and is no longer possible. However, as the original report stated, ENOENT comes from __aio_close where errno is clobbered even on success. On looking at this again, I'm thinking we might should rework the stuff causing that to happen, but the patch submitted still looks reasonable since close does not have any contract not to clobber errno. > > Below is a simple patch that should take care of this problem. Let me know if I > > can provide any more information or if there is anything else I can help with. > > I think this patch is probably okay. I wondered if close was in the > set of functions POSIX-future intends to require not to touch errno on > success, but it doesn't seem to be, and moreover the EINPROGRESS > semantics would undermine that anyway. So saving errno before calling > close is probably the right thing to do here. > > > Thanks, > > Jussi > > Thanks for the clear analysis and patch! If nobody objects/sees anything wrong, I'll go ahead and take this now. Sorry for the long delay! Rich
On Tue, Nov 23, 2021 at 02:47:49PM +0000, Nieminen, Jussi wrote:
> Hi,
>
> I'm a developer from the performance monitoring company Dynatrace, and I've been
> recently investigating curious problems at our customers' environments where a
> call to musl's getaddrinfo appears to spuriously return ENOENT when called from
> a node.js application that is being monitored with the Dynatrace agent.
>
> I managed to pinpoint the problem to the code that performs the AI_ADDRCONFIG
> check. If an address family that is not enabled on the host is specified, a call
> to "connect" in that code fails, the socket fd is closed, and the value of
> "errno" is then evaluated.
>
> The problem is that the call to "close" can change the value of errno, which
> will break the switch-case that follows it. Especially if aio is used (which is
> the case when the Dynatrace agent is included in the application), the call to
> close will end up setting errno to ENOENT by default (even without a failure)
> within the "aio_cancel" function if an aio operation is active. In such a case
> getaddrinfo will then incorrectly return EAI_SYSTEM with errno set to ENOENT.
>
> (After some error code translations within libuv, node.js will then print an
> error message claiming that getaddrinfo failed with ENOENT which is rather
> confusing.)
>
> Even if aio is not used, the code might fail whenever "close" gets interrupted
> and returns with errno set to EINTR. As the return value of close is not
> checked, the errno might thus "silently" change before getting evaluated with
> the assumption that it still contains the value set when "connect" failed.
>
> Below is a simple patch that should take care of this problem. Let me know if I
> can provide any more information or if there is anything else I can help with.
>
> Thanks,
> Jussi
>
>
> -------------------------------------------------------------------------------
> diff --git a/src/network/getaddrinfo.c b/src/network/getaddrinfo.c
> index efaab306..71809856 100644
> --- a/src/network/getaddrinfo.c
> +++ b/src/network/getaddrinfo.c
> @@ -16,6 +16,7 @@ int getaddrinfo(const char *restrict host, const char *restrict serv, const stru
> char canon[256], *outcanon;
> int nservs, naddrs, nais, canon_len, i, j, k;
> int family = AF_UNSPEC, flags = 0, proto = 0, socktype = 0;
> + int saved_errno = 0;
> struct aibuf *out;
>
> if (!host && !serv) return EAI_NONAME;
> @@ -66,11 +67,14 @@ int getaddrinfo(const char *restrict host, const char *restrict serv, const stru
> pthread_setcancelstate(
> PTHREAD_CANCEL_DISABLE, &cs);
> int r = connect(s, ta[i], tl[i]);
> + /* The call to "close" might change errno, especially if aio is in use;
> + * save the value set by "connect" for the later comparison. */
> + if (r < 0) saved_errno = errno;
> pthread_setcancelstate(cs, 0);
> close(s);
> if (!r) continue;
> }
> - switch (errno) {
> + switch (saved_errno) {
> case EADDRNOTAVAIL:
> case EAFNOSUPPORT:
> case EHOSTUNREACH:
> -------------------------------------------------------------------------------
A couple minor problems with the patch:
- The errno from socket() is not used if the failure was from
socket(). I'm not sure yet if that matters but I think it may if
IPv6 was disabled in a way that makes socket() fail.
- In the case where EAI_SYSTEM is returned, the error was not restored
back into errno, so the caller cannot get the cause of error if it
was clobbered by close.
I'll work on a fixed version. I think the right thing to do is just
save/restore errno itself rather than switching on saved_errno.
Rich