From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2070 Path: news.gmane.org!not-for-mail From: Laurent Bercot Newsgroups: gmane.comp.sysutils.supervision.general Subject: [LONG] Re: runit not collecting zombies Date: Tue, 15 Feb 2011 14:12:18 +0100 Message-ID: <20110215131218.GA18284@skarnet.org> References: <20070912172245.GF12043@home.power> <20070912181836.GG12043@home.power> <20070912191346.GH12043@home.power> <20070915133641.GA30650@home.power> <20070917075651.8280.qmail@f6989948e15a99.315fe32.mid.smarden.org> <20070917115924.GB1531@home.power> <20070918081441.20488.qmail@1a6f0ddc0befcc.315fe32.mid.smarden.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1297775414 27251 80.91.229.12 (15 Feb 2011 13:10:14 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 15 Feb 2011 13:10:14 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2304-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Feb 15 14:10:04 2011 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1PpKfK-00028x-Qr for gcsg-supervision@lo.gmane.org; Tue, 15 Feb 2011 14:10:02 +0100 Original-Received: (qmail 31602 invoked by uid 76); 15 Feb 2011 13:12:19 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 31591 invoked by uid 1000); 15 Feb 2011 13:12:18 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20070918081441.20488.qmail@1a6f0ddc0befcc.315fe32.mid.smarden.org> User-Agent: Mutt/1.4i Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2070 Archived-At: Four years later, I'm coming back to this thread, because something is still bothering me. Quick summary: Radek Podgorny and Alex Efros both had an issue where zombie processes would accumulate and *not* be reaped by runit as they should have been. A long discussion ensued; it appeared that the problem was caused by the following situation: * a process A forks a child, B ; * B dies, and a SIGCHLD is sent to A ; * A does not wait() for B and dies ; * so zombie B is reparented to 1, but no SIGCHLD is sent to 1 ; * zombie B remains there until runit's reaper is triggered, which can be much, much later. Gerrit Pape concluded: > runit tries to over-optimise, and only wakes up to reap zombies if it > knows there are some, at least one. Due to the fact that the mother > process, which re-parented itself to pid 1, on the one hand receives a > SIGCHLD, but on the other hand doesn't care about that, exits and leaves > the dead child alone, the child gets re-parented to runit, but without > any notification. > > The situation would have been cleaned up on your systems once any child > process gets re-parented to process 1 before it terminates, and then > exits, causing runit to get a SIGCHLD; which apparently didn't happen. > It's what the kill -CONT 1 I suggested fakes. That seems to explain why > this problem didn't show up for years. > > I prepare a new version of runit that looks for and reaps zombies not > only if it knows that there are some, but also after a 14 seconds > timeout, there seems to be no way around that. And that is what bothers me. Something is not right. Unix should be able to function without polling at all. I'm building Linux environments for embedded platforms, on which energy consumption is an important thing. If such a basic thing as process 1 has to do polling, I'm forfeiting my job right now. runit ran perfectly without polling for lots of people except Radek and Alex. Until Gerrit had to add a polling mechanism just for them. What do other init systems do ? I straced sysvinit: Process 1 attached - interrupt to quit select(11, [10], NULL, NULL, {2, 902034}) = 0 (Timeout) time(NULL) = 1297769034 stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 select(11, [10], NULL, NULL, {5, 0}) = 0 (Timeout) time(NULL) = 1297769039 stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 select(11, [10], NULL, NULL, {5, 0}) = 0 (Timeout) time(NULL) = 1297769044 stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 select(11, [10], NULL, NULL, {5, 0}^C Process 1 detached No luck here. sysvinit wakes up every 5 seconds. Don't ask me why: it does not even reap children when it wakes up. Its only goal seems to be to make sure that /dev/initctl is still there by stat()ing it three times. Lol. sysvinit sucks - nothing new here. I straced Upstart: Process 1 attached - interrupt to quit select(11, [3 5 6 7 9 10], [], [7 9 10], NULL^C Process 1 detached Aha. Upstart waits on notifications forever. It does not poll at all. No, I'm definitely not going to install Upstart on embedded systems :) but it's a good indication that it is possible to only reap children when being triggered; it is *not necessary*, at least on Linux, to have a timed reaping loop. So, where does the problem come from ? Do reparented zombies *really* cause no trigger ? I ran the following command while stracing my own process 1 (s4-svscan, which does not poll) on a Linux 2.6.36.1 kernel: $ execlineb -c "background { sleep 1 } s4-sleep 2" This little execline script will fork; the child will exec "sleep 1", which will exit after 1 second. The parent will exec "s4-sleep 2", which will sleep 2 seconds *without being interrupted by signals* and then exit *without waiting for its dead child*. (I used my own version of "sleep" just to make sure it slept for the full duration and did not wait().) So, when the child dies, a SIGCHLD will be sent to the parent, which is totally oblivious to it. One second later, the parent will die, and its zombie child will then be inherited by process 1. What happens then ? Process 1 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...>) = 1 gettimeofday({1297770104, 310225}, NULL) = 0 read(5, "\21\0\0\0\0\0\0\0\1\0\0\0:1\0\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 read(5, 0xbfa51d0c, 128) = -1 EAGAIN (Resource temporarily unavailable) wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 12602 wait4(-1, 0xbfa51e10, WNOHANG, NULL) = 0 poll([{fd=5, events=POLLIN|POLLHUP}, {fd=4, events=POLLIN|POLLHUP}], 2, -1^C Process 1 detached fd 5 is actually obtained via signalfd() when available, and listens to signals such as SIGCHLD. (When signalfd() is not available, a selfpipe is used instead.) The trace is crystal clear: when the parent dies and the zombie child is reparented to 1, *process 1 does get notified with a SIGCHLD* even if the former parent has already been notified before (and has done nothing). Here, the signal is seen as the signalfd being available, but it's still a signal. This is a very normal, expectable, sane behaviour that Linux 2.6.36.1 exhibits; and it confirms my expectation that process 1 SHOULD NOT have a timed reaping loop. Upstart does the right thing (as far as waiting for notifications is concerned, I mean). runit did the right thing before the change. The problem Radek and Alex had was most likely caused by a kernel bug: in some cases, when a zombie is reparented to process 1, process 1 does not get notified with a SIGCHLD, as it should be. I don't have the time or resources to explore this further; but the modus operandi is simple. - Make sure you can strace your process 1. If you cannot, patch it so it writes something (to the system log or its own stderr which should point to the console) everytime it receives a SIGCHLD. Upstart and sysvinit are spaghetti monsters, but runit is trivial to patch. - Run the following script: sh -c "sleep 1 & ; exec sleep 2" provided your sleep binary does not do anything fancy with signals. Or replace the "sleep 2" with something that you know does not catch signals and lasts more than one second. - Check what process 1 says after 2 seconds. If it received a SIGCHLD, your kernel works. If it did not, you have found a kernel bug. runit's polling mechanism is a workaround to this bug, not the solution to some Unix problem. Gerrit, please make it optional, so functional systems can disable polling entirely. -- Laurent