From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2031 Path: news.gmane.org!not-for-mail From: Charlie Brady Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runsv failing when starting up logger - missing pipe - failure of logpipe init? Date: Tue, 17 Aug 2010 16:51:43 -0400 (EDT) Message-ID: References: <20100817075720.GA9754@skarnet.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Trace: dough.gmane.org 1282078313 16386 80.91.229.12 (17 Aug 2010 20:51:53 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 17 Aug 2010 20:51:53 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2266-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Aug 17 22:51:52 2010 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1OlT8K-0001eQ-AO for gcsg-supervision@lo.gmane.org; Tue, 17 Aug 2010 22:51:44 +0200 Original-Received: (qmail 3447 invoked by uid 76); 17 Aug 2010 20:53:47 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 3437 invoked from network); 17 Aug 2010 20:53:47 -0000 X-X-Sender: charlieb@e-smith.charlieb.ott.istop.com In-Reply-To: <20100817075720.GA9754@skarnet.org> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2031 Archived-At: On Tue, 17 Aug 2010, Laurent Bercot wrote: > Before accusing the Linux kernel, let's check the runsv code and see > whether there's a possible execution path that leads to the situation > you're describing... > > The pipe creation part looks correct. > The part where the error occurs looks correct. > Okay, so is there a place where the pipe might be closed? Sure enough, > there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN, > svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0] > both get closed. And this is the only place where it can happen. > > My bet is that at some point, your runsv ran through that code, but > somehow managed to live and the services didn't die, i.e. another control > message was sent and processed before the exit condition was reached, and > runsv is still trying to supervise things - but runs into trouble with the > closed logpipe. I have no time to investigate further right now, but earlier > in your strace, you should see stuff such as the control messages arriving, > the logpipe getting closed, etc. > If my bet is correct, then the bug is that there's a case where runsv can > close the logpipe and still keep going, whereas it should exit as soon as > the logger dies no matter what (or just exit on the spot and let the logger > die on its own). Some additional information. There have also been some runsv fatal startup errors, due to failure to flock supervise/lock. The most recent of those post-dates the most recent ./log/run failure. [root@mbgvm1 ~]# tr \\0 \\n < /proc/1913/cmdline runsvdir -P /service log: : unable to lock supervise/lock: temporary failure runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file descriptor not open runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file descriptor not open runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure [root@mbgvm1 ~]# Although we don't have timestamps on those log messages, I'm prepared to bet the were generated close to Aug 10 15:21:54 - that's when new versions of /service/tug* symlinks were installed via rpm: [root@mbgvm1 ~]# ps -eo pid,lstart,cmd | grep runsv 1913 Mon Aug 9 14:02:24 2010 runsvdir -P /service log: : unable to lock supervise/lock: temporary failure?runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file descriptor not open?runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure?runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file descriptor not open?runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure? 1940 Mon Aug 9 14:02:24 2010 runsv oidentd 1942 Mon Aug 9 14:02:24 2010 runsv radiusd 1947 Mon Aug 9 14:02:24 2010 runsv eventq 1948 Mon Aug 9 14:02:24 2010 runsv haldaemon 1949 Mon Aug 9 14:02:24 2010 runsv dhcpd 1950 Mon Aug 9 14:02:24 2010 runsv qdiskd 1951 Mon Aug 9 14:02:24 2010 runsv replicate-master 1958 Mon Aug 9 14:02:24 2010 runsv qmail 1962 Mon Aug 9 14:02:24 2010 runsv squid 1963 Mon Aug 9 14:02:24 2010 runsv soap-wrapper 1964 Mon Aug 9 14:02:24 2010 runsv smtp-auth-proxy 1973 Mon Aug 9 14:02:24 2010 runsv httpd-admin 1984 Mon Aug 9 14:02:24 2010 runsv cvm-unix-local 1985 Mon Aug 9 14:02:24 2010 runsv dnscache.forwarder 1986 Mon Aug 9 14:02:24 2010 runsv ulogd 1987 Mon Aug 9 14:02:24 2010 runsv qpsmtpd 1992 Mon Aug 9 14:02:24 2010 runsv wan 2000 Mon Aug 9 14:02:24 2010 runsv sshd 2005 Mon Aug 9 14:02:24 2010 runsv tug-metrics 2006 Mon Aug 9 14:02:24 2010 runsv ntpd 2007 Mon Aug 9 14:02:24 2010 runsv tomcat 2010 Mon Aug 9 14:02:24 2010 runsv heartbeat 2011 Mon Aug 9 14:02:24 2010 runsv certmgmt 2012 Mon Aug 9 14:02:24 2010 runsv dnscache 2015 Mon Aug 9 14:02:24 2010 runsv snmpd 2016 Mon Aug 9 14:02:24 2010 runsv alarmd 2017 Mon Aug 9 14:02:24 2010 runsv vmware-guestd 2018 Mon Aug 9 14:02:24 2010 runsv mntftpserver 2019 Mon Aug 9 14:02:24 2010 runsv httpd-e-smith 2020 Mon Aug 9 14:02:24 2010 runsv ldap 2021 Mon Aug 9 14:02:24 2010 runsv clusterip 2022 Mon Aug 9 14:02:24 2010 runsv pptpd 2023 Mon Aug 9 14:02:24 2010 runsv tinydns 2024 Mon Aug 9 14:02:24 2010 runsv postgresql 2025 Mon Aug 9 14:02:24 2010 runsv raidmonitor 4506 Tue Aug 17 16:13:25 2010 grep runsv 14262 Tue Aug 10 15:21:54 2010 runsv tug 14263 Tue Aug 10 15:21:54 2010 runsv tug-eventd [root@mbgvm1 ~]# I suspect there is a race condition in runsvdir when a /service/xxx symlink is replaced with another. If I read the code correctly, runsvdir will detect that the inode number of the symlink is different, send sigterm to the runsv process, and then forget about it. On the next scan, it will start a new runsv process - which might fail if the old one still persists, and has a lock on the supervise/lock file. The runsv process, on receipt of the sigterm, will send itself the 'x' control command. As Laurent points out, this will cause runsv to close the logging pipe. Larent also says: whereas it should exit as soon as the logger dies no matter what (or just exit on the spot and let the logger die on its own). however I don't see that in the code. runsv will exit if it reaches this code and has recorded the logger as having exited, via waitpid() processing: ... if ((svd[0].want == W_EXIT) && (svd[0].state == S_DOWN)) { if (svd[1].pid == 0) _exit(0); if (svd[1].want != W_EXIT) { svd[1].want =W_EXIT; /* stopservice(&svd[1]); */ update_status(&svd[1]); if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]"); if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]"); } } ... I think it is depending on closing the pipe to terminate the logger, and is depending on waitpid() processing to terminate the runsv process. Am I reading this correctly?