From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1453 Path: news.gmane.org!not-for-mail From: Gerrit Pape Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit not collecting zombies Date: Tue, 26 Jun 2007 09:59:20 +0000 Message-ID: <20070626095920.6195.qmail@3e147d410b1c2c.315fe32.mid.smarden.org> References: <20070526103517.GD24895@home.power> <20070603111056.15978.qmail@3deb4a0e5d8414.315fe32.mid.smarden.org> <20070611131112.GA1576@home.power> <20070618134516.GA1560@home.power> <20070619181325.23252.qmail@a92f927aabd53f.315fe32.mid.smarden.org> <20070619190751.GC27090@home.power> <20070620162325.26345.qmail@7d91355cde742c.315fe32.mid.smarden.org> <20070620165736.GC12963@home.power> <20070620183532.4571.qmail@9f638fd8b69905.315fe32.mid.smarden.org> <20070623044205.GA1594@home.power> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="pWyiEgJYm5f9v55/" X-Trace: sea.gmane.org 1182851952 26449 80.91.229.12 (26 Jun 2007 09:59:12 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 26 Jun 2007 09:59:12 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1690-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Jun 26 11:59:09 2007 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1I37p9-0001z5-7y for gcsg-supervision@gmane.org; Tue, 26 Jun 2007 11:59:03 +0200 Original-Received: (qmail 9970 invoked by uid 76); 26 Jun 2007 09:59:23 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 9965 invoked from network); 26 Jun 2007 09:59:23 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20070623044205.GA1594@home.power> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1453 Archived-At: --pWyiEgJYm5f9v55/ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Sat, Jun 23, 2007 at 07:42:05AM +0300, Alex Efros wrote: > On Wed, Jun 20, 2007 at 06:35:32PM +0000, Gerrit Pape wrote: > > Thanks for helping to try to track this down. You can use it on top of > > the first, the first one should generally speed up reaping zombies. > > One of my servers now has ~700 zombie processes. Looks like these two > patches don't fix this issue. :( >From reading the code, I can't see why the runit program shouldn't collect these zombies on your system. But I may be blind, let's see whether reaping zombies at least every 5 seconds helps. Can you please apply the attached patch (it supersedes the previous patches), install the resulting runit program into /sbin/, reboot the machine, make sure that the new runit program is running as pid 1, and see whether zombies are left over? Is anything printed to the console when the zombie problem arises? To be sure that runit is the problem, could you boot one of your systems into sysvinit to see if it has the same problem? Thanks, Gerrit. --pWyiEgJYm5f9v55/ Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=diff diff --git a/src/runit.c b/src/runit.c index f7d6522..6f0793d 100644 --- a/src/runit.c +++ b/src/runit.c @@ -143,22 +143,28 @@ int main (int argc, const char * const *argv, char * const *envp) { FD_SET(x.fd, &rfds); #endif for (;;) { - int child; + int r, child; sig_unblock(sig_child); sig_unblock(sig_cont); sig_unblock(sig_int); #ifdef IOPAUSE_POLL - poll(&x, 1, -1); + r =poll(&x, 1, 5000); #else - select(x.fd +1, &rfds, (fd_set*)0, (fd_set*)0, (struct timeval*)0); + r =select(x.fd +1, &rfds, (fd_set*)0, (fd_set*)0, (struct timeval*)0); #endif sig_block(sig_cont); sig_block(sig_child); sig_block(sig_int); - read(selfpipe[0], &ch, 1); - child =wait_nohang(&wstat); + while (read(selfpipe[0], &ch, 1) == 1) {} + while ((child =wait_nohang(&wstat)) > 0) + if (child == pid) break; + if (child == -1) { + strerr_warn2(WARNING, "wait_nohang, pausing: ", &strerr_sys); + sleep(5); + } + if ((r == 0) && (child != pid)) continue; /* reget stderr */ if ((ttyfd =open_write("/dev/console")) != -1) { @@ -194,7 +200,7 @@ int main (int argc, const char * const *argv, char * const *envp) { strerr_warn3(INFO, "leave stage: ", stage[st], 0); break; } - if (child > 0) { + if (child != 0) { /* collect terminated children */ write(selfpipe[1], "", 1); continue; --pWyiEgJYm5f9v55/--