supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
From: Charlie Brady <charlieb-supervision@budge.apana.org.au>
To: supervision@list.skarnet.org
Subject: Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?
Date: Tue, 17 Aug 2010 16:51:43 -0400 (EDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.1008171529440.4362@e-smith.charlieb.ott.istop.com> (raw)
In-Reply-To: <20100817075720.GA9754@skarnet.org>


On Tue, 17 Aug 2010, Laurent Bercot wrote:

>  Before accusing the Linux kernel, let's check the runsv code and see
> whether there's a possible execution path that leads to the situation
> you're describing...
> 
>  The pipe creation part looks correct.
>  The part where the error occurs looks correct.
>  Okay, so is there a place where the pipe might be closed? Sure enough,
> there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
> svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
> both get closed. And this is the only place where it can happen.
> 
>  My bet is that at some point, your runsv ran through that code, but
> somehow managed to live and the services didn't die, i.e. another control
> message was sent and processed before the exit condition was reached, and
> runsv is still trying to supervise things - but runs into trouble with the
> closed logpipe. I have no time to investigate further right now, but earlier
> in your strace, you should see stuff such as the control messages arriving,
> the logpipe getting closed, etc.
>  If my bet is correct, then the bug is that there's a case where runsv can
> close the logpipe and still keep going, whereas it should exit as soon as
> the logger dies no matter what (or just exit on the spot and let the logger
> die on its own).

Some additional information. There have also been some runsv fatal startup 
errors, due to failure to flock supervise/lock. The most recent of those 
post-dates the most recent ./log/run failure.

[root@mbgvm1 ~]# tr \\0 \\n < /proc/1913/cmdline 
runsvdir
-P
/service
log: : unable to lock supervise/lock: temporary failure
runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: 
file descriptor not open
runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure
runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: 
file descriptor not open
runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure

[root@mbgvm1 ~]# 

Although we don't have timestamps on those log messages, I'm prepared to 
bet the were generated close to Aug 10 15:21:54 - that's when new versions 
of /service/tug* symlinks were installed via rpm:

[root@mbgvm1 ~]# ps -eo pid,lstart,cmd | grep runsv
 1913 Mon Aug  9 14:02:24 2010 runsvdir -P /service log: : unable to lock 
supervise/lock: temporary failure?runsv tug-metrics: fatal: unable to 
setup filedescriptor for ./log/run: file descriptor not open?runsv 
tug-metrics: fatal: unable to lock supervise/lock: temporary failure?runsv 
tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file 
descriptor not open?runsv tug-metrics: fatal: unable to lock 
supervise/lock: temporary failure?
 1940 Mon Aug  9 14:02:24 2010 runsv oidentd
 1942 Mon Aug  9 14:02:24 2010 runsv radiusd
 1947 Mon Aug  9 14:02:24 2010 runsv eventq
 1948 Mon Aug  9 14:02:24 2010 runsv haldaemon
 1949 Mon Aug  9 14:02:24 2010 runsv dhcpd
 1950 Mon Aug  9 14:02:24 2010 runsv qdiskd
 1951 Mon Aug  9 14:02:24 2010 runsv replicate-master
 1958 Mon Aug  9 14:02:24 2010 runsv qmail
 1962 Mon Aug  9 14:02:24 2010 runsv squid
 1963 Mon Aug  9 14:02:24 2010 runsv soap-wrapper
 1964 Mon Aug  9 14:02:24 2010 runsv smtp-auth-proxy
 1973 Mon Aug  9 14:02:24 2010 runsv httpd-admin
 1984 Mon Aug  9 14:02:24 2010 runsv cvm-unix-local
 1985 Mon Aug  9 14:02:24 2010 runsv dnscache.forwarder
 1986 Mon Aug  9 14:02:24 2010 runsv ulogd
 1987 Mon Aug  9 14:02:24 2010 runsv qpsmtpd
 1992 Mon Aug  9 14:02:24 2010 runsv wan
 2000 Mon Aug  9 14:02:24 2010 runsv sshd
 2005 Mon Aug  9 14:02:24 2010 runsv tug-metrics
 2006 Mon Aug  9 14:02:24 2010 runsv ntpd
 2007 Mon Aug  9 14:02:24 2010 runsv tomcat
 2010 Mon Aug  9 14:02:24 2010 runsv heartbeat
 2011 Mon Aug  9 14:02:24 2010 runsv certmgmt
 2012 Mon Aug  9 14:02:24 2010 runsv dnscache
 2015 Mon Aug  9 14:02:24 2010 runsv snmpd
 2016 Mon Aug  9 14:02:24 2010 runsv alarmd
 2017 Mon Aug  9 14:02:24 2010 runsv vmware-guestd
 2018 Mon Aug  9 14:02:24 2010 runsv mntftpserver
 2019 Mon Aug  9 14:02:24 2010 runsv httpd-e-smith
 2020 Mon Aug  9 14:02:24 2010 runsv ldap
 2021 Mon Aug  9 14:02:24 2010 runsv clusterip
 2022 Mon Aug  9 14:02:24 2010 runsv pptpd
 2023 Mon Aug  9 14:02:24 2010 runsv tinydns
 2024 Mon Aug  9 14:02:24 2010 runsv postgresql
 2025 Mon Aug  9 14:02:24 2010 runsv raidmonitor
 4506 Tue Aug 17 16:13:25 2010 grep runsv
14262 Tue Aug 10 15:21:54 2010 runsv tug
14263 Tue Aug 10 15:21:54 2010 runsv tug-eventd
[root@mbgvm1 ~]# 

I suspect there is a race condition in runsvdir when a /service/xxx 
symlink is replaced with another. If I read the code correctly, runsvdir 
will detect that the inode number of the symlink is different, send 
sigterm to the runsv process, and then forget about it. On the next scan, 
it will start a new runsv process - which might fail if the old one still 
persists, and has a lock on the supervise/lock file.

The runsv process, on receipt of the sigterm, will send itself the 'x' 
control command. As Laurent points out, this will cause runsv to close the 
logging pipe. 

Larent also says:

 whereas it should exit as soon as the logger dies no matter what (or just
 exit on the spot and let the logger die on its own).

however I don't see that in the code. runsv will exit if it reaches this 
code and has recorded the logger as having exited, via waitpid() 
processing:

...
    if ((svd[0].want == W_EXIT) && (svd[0].state == S_DOWN)) {
      if (svd[1].pid == 0) _exit(0);
      if (svd[1].want != W_EXIT) {
        svd[1].want =W_EXIT;
        /* stopservice(&svd[1]); */
        update_status(&svd[1]);
        if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]");
        if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]");
      }
    } 
...

I think it is depending on closing the pipe to terminate the logger, and 
is depending on waitpid() processing to terminate the runsv process. Am I 
reading this correctly?


      parent reply	other threads:[~2010-08-17 20:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-16 20:32 Charlie Brady
2010-08-16 23:30 ` Charlie Brady
2010-08-17  7:57   ` Laurent Bercot
2010-08-17 11:56     ` Charlie Brady
2010-08-17 20:51     ` Charlie Brady [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.1008171529440.4362@e-smith.charlieb.ott.istop.com \
    --to=charlieb-supervision@budge.apana.org.au \
    --cc=supervision@list.skarnet.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).