supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
* runsv failing when starting up logger - missing pipe - failure of logpipe init?
@ 2010-08-16 20:32 Charlie Brady
  2010-08-16 23:30 ` Charlie Brady
  0 siblings, 1 reply; 5+ messages in thread
From: Charlie Brady @ 2010-08-16 20:32 UTC (permalink / raw)
  To: supervision


I'm running runit 1.7.2, and have a sytem showing this anomolous behaviour 
(one service is failing to start).

Here is the bad service:

[root@mbgvm1 ~]# sv status /service/tug-metrics/
down: /service/tug-metrics/: 1316s, normally up; down: log: 0s, normally 
up, want up
[root@mbgvm1 ~]#

strace of the runsv process shows that the child is dying immediately 
because fd 5 is not valid:

2005  fork()                            = 8121
2005  gettimeofday({1281988874, 32865}, NULL) = 0
2005  open("supervise/pid.new", O_WRONLY|O_CREAT|O_TRUNC|O_NONBLOCK, 0644) 
= 5
2005  write(5, "8121\n", 5)             = 5
2005  close(5)                          = 0
2005  rename("supervise/pid.new", "log/supervise/pid") = 0
2005  open("supervise/stat.new", O_WRONLY|O_CREAT|O_TRUNC|O_NONBLOCK, 
0644) = 5
2005  write(5, "run\n", 4)              = 4
2005  close(5)                          = 0
2005  rename("supervise/stat.new", "log/supervise/stat") = 0
2005  open("supervise/status.new", O_WRONLY|O_CREAT|O_TRUNC|O_NONBLOCK, 
0644) = 5
2005  write(5, "@\0\0\0Li\231\24\1\365|\334\271\37\0\0\0u\0\1", 20) = 20
2005  close(5)                          = 0
2005  rename("supervise/status.new", "log/supervise/status") = 0
2005  gettimeofday({1281988874, 33467}, NULL) = 0
2005  rt_sigprocmask(SIG_UNBLOCK, [TERM], NULL, 8) = 0
2005  rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
2005  poll([{fd=3, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}], 3, 1000020 <unfinished ...>
8121  fcntl(5, F_GETFL)                 = -1 EBADF (Bad file descriptor)
8121  write(2, "runsv tug-metrics: fatal: unable to setup filedescriptor 
for ./log/run: file descriptor not open\n", 97) = 97
8121  _exit(111)                        = ?
2005  <... poll resumed> )              = -1 EINTR (Interrupted system 
call)


We can see there is no fd 5:

[root@mbgvm1 ~]# ls -l /proc/2005/fd/*
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/0 -> /dev/null
lrwx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/1 -> /dev/console
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/10 -> 
/var/service/tug-metrics/supervise/control
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/11 -> 
/var/service/tug-metrics/log/supervise/control
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/12 -> 
/var/service/tug-metrics/log/supervise/control
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/13 -> 
/var/service/tug-metrics/supervise/ok
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/14 -> 
/var/service/tug-metrics/log/supervise/ok
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/2 -> pipe:[6121]
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/3 -> pipe:[6363]
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/4 -> pipe:[6363]
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/7 -> 
/var/service/tug-metrics/supervise/lock
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/8 -> 
/var/service/tug-metrics/log/supervise/lock
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2005/fd/9 -> 
/var/service/tug-metrics/supervise/control
[root@mbgvm1 ~]# 

If we look at a well behaved runsv, we see that fd 5 and fd 6 is a second 
pipe:

[root@mbgvm1 ~]# ls -l /proc/2023/fd/*
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/0 -> /dev/null
lrwx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/1 -> /dev/console
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/10 -> 
/var/service/tinydns/supervise/control
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/11 -> 
/var/service/tinydns/log/supervise/control
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/12 -> 
/var/service/tinydns/log/supervise/control
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/13 -> 
/var/service/tinydns/supervise/ok
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/14 -> 
/var/service/tinydns/log/supervise/ok
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/2 -> pipe:[6121]
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/3 -> pipe:[6421]
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/4 -> pipe:[6421]
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/5 -> pipe:[6425]
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/6 -> pipe:[6425]
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/7 -> 
/var/service/tinydns/supervise/lock
l-wx------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/8 -> 
/var/service/tinydns/log/supervise/lock
lr-x------ 1 root root 64 Aug 16 11:37 /proc/2023/fd/9 -> 
/var/service/tinydns/supervise/control
[root@mbgvm1 ~]#

Here's the code in runsv where the child is dying:

...
  if (p == 0) {
    /* child */
    if (haslog) {
      if (s->islog) {
        if (fd_copy(0, logpipe[0]) == -1)
          fatal("unable to setup filedescriptor for ./log/run");
...

and here is where logpipe[0] should have been initialised:

...
      if (pipe(logpipe) == -1)
        fatal("unable to create log pipe");
...

Has anyone else seen this error condition or can posit a situation where 
it might be seen?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?
  2010-08-16 20:32 runsv failing when starting up logger - missing pipe - failure of logpipe init? Charlie Brady
@ 2010-08-16 23:30 ` Charlie Brady
  2010-08-17  7:57   ` Laurent Bercot
  0 siblings, 1 reply; 5+ messages in thread
From: Charlie Brady @ 2010-08-16 23:30 UTC (permalink / raw)
  To: supervision


On Mon, 16 Aug 2010, Charlie Brady wrote:

> Here's the code in runsv where the child is dying:
> 
> ...
>   if (p == 0) {
>     /* child */
>     if (haslog) {
>       if (s->islog) {
>         if (fd_copy(0, logpipe[0]) == -1)
>           fatal("unable to setup filedescriptor for ./log/run");
> ...
> 
> and here is where logpipe[0] should have been initialised:
> 
> ...
>       if (pipe(logpipe) == -1)
>         fatal("unable to create log pipe");
> ...
> 
> Has anyone else seen this error condition or can posit a situation where 
> it might be seen?

The next question to ponder is where the bug lies. The runsv process here 
has no fd 5 and fd 6 - IOW, logpipe[0] is 5, but isn't a valid fd. Are 
there circumstances where a pipe can just cease to be? Should runsv have 
detected this issue (where pipe() did not return -1, but the fds returned 
were not valid)?

Is this a linux kernel bug?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?
  2010-08-16 23:30 ` Charlie Brady
@ 2010-08-17  7:57   ` Laurent Bercot
  2010-08-17 11:56     ` Charlie Brady
  2010-08-17 20:51     ` Charlie Brady
  0 siblings, 2 replies; 5+ messages in thread
From: Laurent Bercot @ 2010-08-17  7:57 UTC (permalink / raw)
  To: supervision

>> Has anyone else seen this error condition or can posit a situation where 
>> it might be seen?
> 
> The next question to ponder is where the bug lies. The runsv process here 
> has no fd 5 and fd 6 - IOW, logpipe[0] is 5, but isn't a valid fd. Are 
> there circumstances where a pipe can just cease to be? Should runsv have 
> detected this issue (where pipe() did not return -1, but the fds returned 
> were not valid)?
> 
> Is this a linux kernel bug?

 Before accusing the Linux kernel, let's check the runsv code and see
whether there's a possible execution path that leads to the situation
you're describing...

 The pipe creation part looks correct.
 The part where the error occurs looks correct.
 Okay, so is there a place where the pipe might be closed? Sure enough,
there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
both get closed. And this is the only place where it can happen.

 My bet is that at some point, your runsv ran through that code, but
somehow managed to live and the services didn't die, i.e. another control
message was sent and processed before the exit condition was reached, and
runsv is still trying to supervise things - but runs into trouble with the
closed logpipe. I have no time to investigate further right now, but earlier
in your strace, you should see stuff such as the control messages arriving,
the logpipe getting closed, etc.
 If my bet is correct, then the bug is that there's a case where runsv can
close the logpipe and still keep going, whereas it should exit as soon as
the logger dies no matter what (or just exit on the spot and let the logger
die on its own).

-- 
 Laurent


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?
  2010-08-17  7:57   ` Laurent Bercot
@ 2010-08-17 11:56     ` Charlie Brady
  2010-08-17 20:51     ` Charlie Brady
  1 sibling, 0 replies; 5+ messages in thread
From: Charlie Brady @ 2010-08-17 11:56 UTC (permalink / raw)
  To: Laurent Bercot; +Cc: supervision


On Tue, 17 Aug 2010, Laurent Bercot wrote:

>  The pipe creation part looks correct.
>  The part where the error occurs looks correct.
>  Okay, so is there a place where the pipe might be closed? Sure enough,
> there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
> svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
> both get closed. And this is the only place where it can happen.
> 
>  My bet is that at some point, your runsv ran through that code, but
> somehow managed to live and the services didn't die, i.e. another control
> message was sent and processed before the exit condition was reached, and
> runsv is still trying to supervise things - but runs into trouble with the
> closed logpipe. I have no time to investigate further right now, but earlier
> in your strace, you should see stuff such as the control messages arriving,
> the logpipe getting closed, etc.
>  If my bet is correct, then the bug is that there's a case where runsv can
> close the logpipe and still keep going, whereas it should exit as soon as
> the logger dies no matter what (or just exit on the spot and let the logger
> die on its own).

Thanks Laurent, for your pointer. Unfortunately the strace won't help, 
since it wasn't started until long after the runsv process was already 
malfunctioning.

Presumably Gerrit will have a good think about possible execution paths. 
I'll look further too.

---
Charlie


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: runsv failing when starting up logger - missing pipe - failure of logpipe init?
  2010-08-17  7:57   ` Laurent Bercot
  2010-08-17 11:56     ` Charlie Brady
@ 2010-08-17 20:51     ` Charlie Brady
  1 sibling, 0 replies; 5+ messages in thread
From: Charlie Brady @ 2010-08-17 20:51 UTC (permalink / raw)
  To: supervision


On Tue, 17 Aug 2010, Laurent Bercot wrote:

>  Before accusing the Linux kernel, let's check the runsv code and see
> whether there's a possible execution path that leads to the situation
> you're describing...
> 
>  The pipe creation part looks correct.
>  The part where the error occurs looks correct.
>  Okay, so is there a place where the pipe might be closed? Sure enough,
> there is: right at the end, if svd[0].want == W_EXIT, svd[0].state == DOWN,
> svd[1].pid != 0 and svd[1].want != W_EXIT, then logpipe[1] and logpipe[0]
> both get closed. And this is the only place where it can happen.
> 
>  My bet is that at some point, your runsv ran through that code, but
> somehow managed to live and the services didn't die, i.e. another control
> message was sent and processed before the exit condition was reached, and
> runsv is still trying to supervise things - but runs into trouble with the
> closed logpipe. I have no time to investigate further right now, but earlier
> in your strace, you should see stuff such as the control messages arriving,
> the logpipe getting closed, etc.
>  If my bet is correct, then the bug is that there's a case where runsv can
> close the logpipe and still keep going, whereas it should exit as soon as
> the logger dies no matter what (or just exit on the spot and let the logger
> die on its own).

Some additional information. There have also been some runsv fatal startup 
errors, due to failure to flock supervise/lock. The most recent of those 
post-dates the most recent ./log/run failure.

[root@mbgvm1 ~]# tr \\0 \\n < /proc/1913/cmdline 
runsvdir
-P
/service
log: : unable to lock supervise/lock: temporary failure
runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: 
file descriptor not open
runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure
runsv tug-metrics: fatal: unable to setup filedescriptor for ./log/run: 
file descriptor not open
runsv tug-metrics: fatal: unable to lock supervise/lock: temporary failure

[root@mbgvm1 ~]# 

Although we don't have timestamps on those log messages, I'm prepared to 
bet the were generated close to Aug 10 15:21:54 - that's when new versions 
of /service/tug* symlinks were installed via rpm:

[root@mbgvm1 ~]# ps -eo pid,lstart,cmd | grep runsv
 1913 Mon Aug  9 14:02:24 2010 runsvdir -P /service log: : unable to lock 
supervise/lock: temporary failure?runsv tug-metrics: fatal: unable to 
setup filedescriptor for ./log/run: file descriptor not open?runsv 
tug-metrics: fatal: unable to lock supervise/lock: temporary failure?runsv 
tug-metrics: fatal: unable to setup filedescriptor for ./log/run: file 
descriptor not open?runsv tug-metrics: fatal: unable to lock 
supervise/lock: temporary failure?
 1940 Mon Aug  9 14:02:24 2010 runsv oidentd
 1942 Mon Aug  9 14:02:24 2010 runsv radiusd
 1947 Mon Aug  9 14:02:24 2010 runsv eventq
 1948 Mon Aug  9 14:02:24 2010 runsv haldaemon
 1949 Mon Aug  9 14:02:24 2010 runsv dhcpd
 1950 Mon Aug  9 14:02:24 2010 runsv qdiskd
 1951 Mon Aug  9 14:02:24 2010 runsv replicate-master
 1958 Mon Aug  9 14:02:24 2010 runsv qmail
 1962 Mon Aug  9 14:02:24 2010 runsv squid
 1963 Mon Aug  9 14:02:24 2010 runsv soap-wrapper
 1964 Mon Aug  9 14:02:24 2010 runsv smtp-auth-proxy
 1973 Mon Aug  9 14:02:24 2010 runsv httpd-admin
 1984 Mon Aug  9 14:02:24 2010 runsv cvm-unix-local
 1985 Mon Aug  9 14:02:24 2010 runsv dnscache.forwarder
 1986 Mon Aug  9 14:02:24 2010 runsv ulogd
 1987 Mon Aug  9 14:02:24 2010 runsv qpsmtpd
 1992 Mon Aug  9 14:02:24 2010 runsv wan
 2000 Mon Aug  9 14:02:24 2010 runsv sshd
 2005 Mon Aug  9 14:02:24 2010 runsv tug-metrics
 2006 Mon Aug  9 14:02:24 2010 runsv ntpd
 2007 Mon Aug  9 14:02:24 2010 runsv tomcat
 2010 Mon Aug  9 14:02:24 2010 runsv heartbeat
 2011 Mon Aug  9 14:02:24 2010 runsv certmgmt
 2012 Mon Aug  9 14:02:24 2010 runsv dnscache
 2015 Mon Aug  9 14:02:24 2010 runsv snmpd
 2016 Mon Aug  9 14:02:24 2010 runsv alarmd
 2017 Mon Aug  9 14:02:24 2010 runsv vmware-guestd
 2018 Mon Aug  9 14:02:24 2010 runsv mntftpserver
 2019 Mon Aug  9 14:02:24 2010 runsv httpd-e-smith
 2020 Mon Aug  9 14:02:24 2010 runsv ldap
 2021 Mon Aug  9 14:02:24 2010 runsv clusterip
 2022 Mon Aug  9 14:02:24 2010 runsv pptpd
 2023 Mon Aug  9 14:02:24 2010 runsv tinydns
 2024 Mon Aug  9 14:02:24 2010 runsv postgresql
 2025 Mon Aug  9 14:02:24 2010 runsv raidmonitor
 4506 Tue Aug 17 16:13:25 2010 grep runsv
14262 Tue Aug 10 15:21:54 2010 runsv tug
14263 Tue Aug 10 15:21:54 2010 runsv tug-eventd
[root@mbgvm1 ~]# 

I suspect there is a race condition in runsvdir when a /service/xxx 
symlink is replaced with another. If I read the code correctly, runsvdir 
will detect that the inode number of the symlink is different, send 
sigterm to the runsv process, and then forget about it. On the next scan, 
it will start a new runsv process - which might fail if the old one still 
persists, and has a lock on the supervise/lock file.

The runsv process, on receipt of the sigterm, will send itself the 'x' 
control command. As Laurent points out, this will cause runsv to close the 
logging pipe. 

Larent also says:

 whereas it should exit as soon as the logger dies no matter what (or just
 exit on the spot and let the logger die on its own).

however I don't see that in the code. runsv will exit if it reaches this 
code and has recorded the logger as having exited, via waitpid() 
processing:

...
    if ((svd[0].want == W_EXIT) && (svd[0].state == S_DOWN)) {
      if (svd[1].pid == 0) _exit(0);
      if (svd[1].want != W_EXIT) {
        svd[1].want =W_EXIT;
        /* stopservice(&svd[1]); */
        update_status(&svd[1]);
        if (close(logpipe[1]) == -1) warn("unable to close logpipe[1]");
        if (close(logpipe[0]) == -1) warn("unable to close logpipe[0]");
      }
    } 
...

I think it is depending on closing the pipe to terminate the logger, and 
is depending on waitpid() processing to terminate the runsv process. Am I 
reading this correctly?


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-08-17 20:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-16 20:32 runsv failing when starting up logger - missing pipe - failure of logpipe init? Charlie Brady
2010-08-16 23:30 ` Charlie Brady
2010-08-17  7:57   ` Laurent Bercot
2010-08-17 11:56     ` Charlie Brady
2010-08-17 20:51     ` Charlie Brady

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).