From: Charlie Brady <charlieb-supervision@budge.apana.org.au>
To: supervision@list.skarnet.org
Subject: Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?
Date: Tue, 17 Aug 2010 16:51:43 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.64.1008171529440.4362@e-smith.charlieb.ott.istop.com> (raw)
In-Reply-To: <20100817075720.GA9754@skarnet.org>
On Tue, 17 Aug 2010, Laurent Bercot wrote:
> Before accusing the Linux kernel, let's check the runsv code and see
> whether there's a possible execution path that leads to the situation
> you're describing...
>
> The pipe creation part looks correct.
> The part where the error occurs looks correct.
> Okay, so is there a place where the pipe might be closed? Sure enough,
> there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
> svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
> both get closed. And this is the only place where it can happen.
>
> My bet is that at some point, your runsv ran through that code, but
> somehow managed to live and the services didn't die, i.e. another control
> message was sent and processed before the exit condition was reached, and
> runsv is still trying to supervise things - but runs into trouble with the
> closed logpipe. I have no time to investigate further right now, but earlier
> in your strace, you should see stuff such as the control messages arriving,
> the logpipe getting closed, etc.
> If my bet is correct, then the bug is that there's a case where runsv can
> close the logpipe and still keep going, whereas it should exit as soon as
> the logger dies no matter what (or just exit on the spot and let the logger
> die on its own).
Some additional information. There have also been some runsv fatal startup
errors, due to failure to flock supervise/lock. The most recent of those
post-dates the most recent ./log/run failure.
[root@mbgvm1 ~]# tr \\0 \\n < /proc/1913/cmdline
runsvdir
-P
/service
log: : unable to lock supervise/lock: temporary failure
runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run:
file descriptor not open
runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure
runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run:
file descriptor not open
runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure
[root@mbgvm1 ~]#
Although we don't have timestamps on those log messages, I'm prepared to
bet the were generated close to Aug 10 15:21:54 - that's when new versions
of /service/tug* symlinks were installed via rpm:
[root@mbgvm1 ~]# ps -eo pid,lstart,cmd | grep runsv
1913 Mon Aug 9 14:02:24 2010 runsvdir -P /service log: : unable to lock
supervise/lock: temporary failure?runsv tug-metrics: fatal: unable to
setup filedescriptor for ./log/run: file descriptor not open?runsv
tug-metrics: fatal: unable to lock supervise/lock: temporary failure?runsv
tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file
descriptor not open?runsv tug-metrics: fatal: unable to lock
supervise/lock: temporary failure?
1940 Mon Aug 9 14:02:24 2010 runsv oidentd
1942 Mon Aug 9 14:02:24 2010 runsv radiusd
1947 Mon Aug 9 14:02:24 2010 runsv eventq
1948 Mon Aug 9 14:02:24 2010 runsv haldaemon
1949 Mon Aug 9 14:02:24 2010 runsv dhcpd
1950 Mon Aug 9 14:02:24 2010 runsv qdiskd
1951 Mon Aug 9 14:02:24 2010 runsv replicate-master
1958 Mon Aug 9 14:02:24 2010 runsv qmail
1962 Mon Aug 9 14:02:24 2010 runsv squid
1963 Mon Aug 9 14:02:24 2010 runsv soap-wrapper
1964 Mon Aug 9 14:02:24 2010 runsv smtp-auth-proxy
1973 Mon Aug 9 14:02:24 2010 runsv httpd-admin
1984 Mon Aug 9 14:02:24 2010 runsv cvm-unix-local
1985 Mon Aug 9 14:02:24 2010 runsv dnscache.forwarder
1986 Mon Aug 9 14:02:24 2010 runsv ulogd
1987 Mon Aug 9 14:02:24 2010 runsv qpsmtpd
1992 Mon Aug 9 14:02:24 2010 runsv wan
2000 Mon Aug 9 14:02:24 2010 runsv sshd
2005 Mon Aug 9 14:02:24 2010 runsv tug-metrics
2006 Mon Aug 9 14:02:24 2010 runsv ntpd
2007 Mon Aug 9 14:02:24 2010 runsv tomcat
2010 Mon Aug 9 14:02:24 2010 runsv heartbeat
2011 Mon Aug 9 14:02:24 2010 runsv certmgmt
2012 Mon Aug 9 14:02:24 2010 runsv dnscache
2015 Mon Aug 9 14:02:24 2010 runsv snmpd
2016 Mon Aug 9 14:02:24 2010 runsv alarmd
2017 Mon Aug 9 14:02:24 2010 runsv vmware-guestd
2018 Mon Aug 9 14:02:24 2010 runsv mntftpserver
2019 Mon Aug 9 14:02:24 2010 runsv httpd-e-smith
2020 Mon Aug 9 14:02:24 2010 runsv ldap
2021 Mon Aug 9 14:02:24 2010 runsv clusterip
2022 Mon Aug 9 14:02:24 2010 runsv pptpd
2023 Mon Aug 9 14:02:24 2010 runsv tinydns
2024 Mon Aug 9 14:02:24 2010 runsv postgresql
2025 Mon Aug 9 14:02:24 2010 runsv raidmonitor
4506 Tue Aug 17 16:13:25 2010 grep runsv
14262 Tue Aug 10 15:21:54 2010 runsv tug
14263 Tue Aug 10 15:21:54 2010 runsv tug-eventd
[root@mbgvm1 ~]#
I suspect there is a race condition in runsvdir when a /service/xxx
symlink is replaced with another. If I read the code correctly, runsvdir
will detect that the inode number of the symlink is different, send
sigterm to the runsv process, and then forget about it. On the next scan,
it will start a new runsv process - which might fail if the old one still
persists, and has a lock on the supervise/lock file.
The runsv process, on receipt of the sigterm, will send itself the 'x'
control command. As Laurent points out, this will cause runsv to close the
logging pipe.
Larent also says:
whereas it should exit as soon as the logger dies no matter what (or just
exit on the spot and let the logger die on its own).
however I don't see that in the code. runsv will exit if it reaches this
code and has recorded the logger as having exited, via waitpid()
processing:
...
if ((svd[0].want == W_EXIT) && (svd[0].state == S_DOWN)) {
if (svd[1].pid == 0) _exit(0);
if (svd[1].want != W_EXIT) {
svd[1].want =W_EXIT;
/* stopservice(&svd[1]); */
update_status(&svd[1]);
if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]");
if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]");
}
}
...
I think it is depending on closing the pipe to terminate the logger, and
is depending on waitpid() processing to terminate the runsv process. Am I
reading this correctly?
prev parent reply other threads:[~2010-08-17 20:51 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-08-16 20:32 Charlie Brady
2010-08-16 23:30 ` Charlie Brady
2010-08-17 7:57 ` Laurent Bercot
2010-08-17 11:56 ` Charlie Brady
2010-08-17 20:51 ` Charlie Brady [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.1008171529440.4362@e-smith.charlieb.ott.istop.com \
--to=charlieb-supervision@budge.apana.org.au \
--cc=supervision@list.skarnet.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).