From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2070
Path: news.gmane.org!not-for-mail
From: Laurent Bercot <ska-supervision@skarnet.org>
Newsgroups: gmane.comp.sysutils.supervision.general
Subject: [LONG] Re: runit not collecting zombies
Date: Tue, 15 Feb 2011 14:12:18 +0100
Message-ID: <20110215131218.GA18284@skarnet.org>
References: <20070912172245.GF12043@home.power> <Pine.LNX.4.64.0709121329560.24457@e-smith.charlieb.ott.istop.com> <20070912181836.GG12043@home.power> <Pine.LNX.4.64.0709121451100.24457@e-smith.charlieb.ott.istop.com> <20070912191346.GH12043@home.power> <Pine.LNX.4.64.0709121514570.24457@e-smith.charlieb.ott.istop.com> <20070915133641.GA30650@home.power> <20070917075651.8280.qmail@f6989948e15a99.315fe32.mid.smarden.org> <20070917115924.GB1531@home.power> <20070918081441.20488.qmail@1a6f0ddc0befcc.315fe32.mid.smarden.org>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: dough.gmane.org 1297775414 27251 80.91.229.12 (15 Feb 2011 13:10:14 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Tue, 15 Feb 2011 13:10:14 +0000 (UTC)
To: supervision@list.skarnet.org
Original-X-From: supervision-return-2304-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Feb 15 14:10:04 2011
Return-path: <supervision-return-2304-gcsg-supervision=m.gmane.org@list.skarnet.org>
Envelope-to: gcsg-supervision@lo.gmane.org
Original-Received: from antah.skarnet.org ([212.85.147.14])
	by lo.gmane.org with smtp (Exim 4.69)
	(envelope-from <supervision-return-2304-gcsg-supervision=m.gmane.org@list.skarnet.org>)
	id 1PpKfK-00028x-Qr
	for gcsg-supervision@lo.gmane.org; Tue, 15 Feb 2011 14:10:02 +0100
Original-Received: (qmail 31602 invoked by uid 76); 15 Feb 2011 13:12:19 -0000
Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm
List-Post: <mailto:supervision@list.skarnet.org>
List-Help: <mailto:supervision-help@list.skarnet.org>
List-Unsubscribe: <mailto:supervision-unsubscribe@list.skarnet.org>
List-Subscribe: <mailto:supervision-subscribe@list.skarnet.org>
List-Archive: <http://www.skarnet.org/lists/>
Original-Received: (qmail 31591 invoked by uid 1000); 15 Feb 2011 13:12:18 -0000
Mail-Followup-To: supervision@list.skarnet.org
Content-Disposition: inline
In-Reply-To: <20070918081441.20488.qmail@1a6f0ddc0befcc.315fe32.mid.smarden.org>
User-Agent: Mutt/1.4i
Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2070
Archived-At: <http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/2070>


 Four years later, I'm coming back to this thread, because something is
still bothering me.

 Quick summary: Radek Podgorny and Alex Efros both had an issue where
zombie processes would accumulate and *not* be reaped by runit as they
should have been. A long discussion ensued; it appeared that the problem
was caused by the following situation:

 * a process A forks a child, B ;
 * B dies, and a SIGCHLD is sent to A ;
 * A does not wait() for B and dies ;
 * so zombie B is reparented to 1, but no SIGCHLD is sent to 1 ;
 * zombie B remains there until runit's reaper is triggered, which can
be much, much later.

 Gerrit Pape concluded:

> runit tries to over-optimise, and only wakes up to reap zombies if it
> knows there are some, at least one.  Due to the fact that the mother
> process, which re-parented itself to pid 1, on the one hand receives a
> SIGCHLD, but on the other hand doesn't care about that, exits and leaves
> the dead child alone, the child gets re-parented to runit, but without
> any notification.
> 
> The situation would have been cleaned up on your systems once any child
> process gets re-parented to process 1 before it terminates, and then
> exits, causing runit to get a SIGCHLD; which apparently didn't happen.
> It's what the kill -CONT 1 I suggested fakes.  That seems to explain why
> this problem didn't show up for years.
> 
> I prepare a new version of runit that looks for and reaps zombies not
> only if it knows that there are some, but also after a 14 seconds
> timeout, there seems to be no way around that.


 And that is what bothers me. Something is not right.
 Unix should be able to function without polling at all.
 I'm building Linux environments for embedded platforms, on which
energy consumption is an important thing. If such a basic thing as
process 1 has to do polling, I'm forfeiting my job right now.

 runit ran perfectly without polling for lots of people except Radek and
Alex. Until Gerrit had to add a polling mechanism just for them. What do
other init systems do ?


 I straced sysvinit:
Process 1 attached - interrupt to quit
select(11, [10], NULL, NULL, {2, 902034}) = 0 (Timeout)
time(NULL)                              = 1297769034
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
select(11, [10], NULL, NULL, {5, 0})    = 0 (Timeout)
time(NULL)                              = 1297769039
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
select(11, [10], NULL, NULL, {5, 0})    = 0 (Timeout)
time(NULL)                              = 1297769044
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
select(11, [10], NULL, NULL, {5, 0}^C <unfinished ...>
Process 1 detached

 No luck here. sysvinit wakes up every 5 seconds. Don't ask me why:
it does not even reap children when it wakes up. Its only goal seems
to be to make sure that /dev/initctl is still there by stat()ing
it three times. Lol. sysvinit sucks - nothing new here.


 I straced Upstart:
Process 1 attached - interrupt to quit
select(11, [3 5 6 7 9 10], [], [7 9 10], NULL^C <unfinished ...>
Process 1 detached
 
 Aha. Upstart waits on notifications forever. It does not poll at all.
No, I'm definitely not going to install Upstart on embedded systems :)
but it's a good indication that it is possible to only reap children
when being triggered; it is *not necessary*, at least on Linux, to
have a timed reaping loop.

 So, where does the problem come from ?
 Do reparented zombies *really* cause no trigger ?

 I ran the following command while stracing my own process 1 (s4-svscan,
which does not poll) on a Linux 2.6.36.1 kernel:
$ execlineb -c "background { sleep 1 } s4-sleep 2"

 This little execline script will fork; the child will exec "sleep 1",
which will exit after 1 second. The parent will exec "s4-sleep 2", which will
sleep 2 seconds *without being interrupted by signals* and then exit
*without waiting for its dead child*. (I used my own version of "sleep"
just to make sure it slept for the full duration and did not wait().)

 So, when the child dies, a SIGCHLD will be sent to the parent, which
is totally oblivious to it. One second later, the parent will die, and
its zombie child will then be inherited by process 1. What happens then ?

Process 1 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 1
gettimeofday({1297770104, 310225}, NULL) = 0
read(5, "\21\0\0\0\0\0\0\0\1\0\0\0:1\0\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
read(5, 0xbfa51d0c, 128)                = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 12602
wait4(-1, 0xbfa51e10, WNOHANG, NULL)    = 0
poll([{fd=5, events=POLLIN|POLLHUP}, {fd=4, events=POLLIN|POLLHUP}], 2, -1^C <unfinished ...>
Process 1 detached

 fd 5 is actually obtained via signalfd() when available, and listens to
signals such as SIGCHLD. (When signalfd() is not available, a selfpipe is
used instead.)
 The trace is crystal clear: when the parent dies and the zombie child
is reparented to 1, *process 1 does get notified with a SIGCHLD* even if
the former parent has already been notified before (and has done nothing).
Here, the signal is seen as the signalfd being available, but it's still
a signal.

 This is a very normal, expectable, sane behaviour that Linux 2.6.36.1
exhibits; and it confirms my expectation that process 1 SHOULD NOT have
a timed reaping loop.

 Upstart does the right thing (as far as waiting for notifications is
concerned, I mean). runit did the right thing before the change.

 The problem Radek and Alex had was most likely caused by a kernel bug:
in some cases, when a zombie is reparented to process 1, process 1 does
not get notified with a SIGCHLD, as it should be.

 I don't have the time or resources to explore this further; but the
modus operandi is simple.

 - Make sure you can strace your process 1. If you cannot, patch it
so it writes something (to the system log or its own stderr which should
point to the console) everytime it receives a SIGCHLD. Upstart and
sysvinit are spaghetti monsters, but runit is trivial to patch.

 - Run the following script: sh -c "sleep 1 & ; exec sleep 2"
provided your sleep binary does not do anything fancy with signals.
Or replace the "sleep 2" with something that you know does not
catch signals and lasts more than one second.

 - Check what process 1 says after 2 seconds. If it received a SIGCHLD,
your kernel works. If it did not, you have found a kernel bug.

 runit's polling mechanism is a workaround to this bug, not the
solution to some Unix problem. Gerrit, please make it optional, so
functional systems can disable polling entirely.

-- 
 Laurent