From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1445 Path: news.gmane.org!not-for-mail From: Gerrit Pape Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit not collecting zombies Date: Wed, 20 Jun 2007 16:23:25 +0000 Message-ID: <20070620162325.26345.qmail@7d91355cde742c.315fe32.mid.smarden.org> References: <46561ABE.7030008@podgorny.cz> <20070526103517.GD24895@home.power> <20070603111056.15978.qmail@3deb4a0e5d8414.315fe32.mid.smarden.org> <20070611131112.GA1576@home.power> <20070618134516.GA1560@home.power> <20070619181325.23252.qmail@a92f927aabd53f.315fe32.mid.smarden.org> <20070619190751.GC27090@home.power> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="FL5UXtIhxfXey3p5" X-Trace: sea.gmane.org 1182356588 7002 80.91.229.12 (20 Jun 2007 16:23:08 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 20 Jun 2007 16:23:08 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1682-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Jun 20 18:23:07 2007 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1I12xX-0006q4-7A for gcsg-supervision@gmane.org; Wed, 20 Jun 2007 18:23:07 +0200 Original-Received: (qmail 29679 invoked by uid 76); 20 Jun 2007 16:23:27 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 29674 invoked from network); 20 Jun 2007 16:23:27 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20070619190751.GC27090@home.power> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1445 Archived-At: --FL5UXtIhxfXey3p5 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue, Jun 19, 2007 at 10:07:51PM +0300, Alex Efros wrote: > On Tue, Jun 19, 2007 at 06:13:25PM +0000, Gerrit Pape wrote: > > Hi Alex, after checking the code, I currently cannot say that or how > > runit could fail reaping zombies that detached and re-parented to pid 1. > > On Linux running strace on pid 1 isn't supported AFAIK. To be sure that > > runit is at fault, can you please check the kernel versions on your two > > machines, can it be that they have changed at the time the problem > > popped up? Does upgrading the Linux kernel to a more recent version > > change anything? > > Yeah, I also think this may be kernel<->runit bug. But this bug happens > previously on 2.6.16, and now I've it on 2.6.20 (upgraded few days ago). > > Moreover, looks like runit not just 'stop reaping zombies' at some point. > Looks like it continue reaping them, but not all of them. Look: [...] > This server generate a lot of short-living processes every minute, i.e. it > generate new processes much faster than new non-reaped zombies arise. > > I may be wrong, but I've a feeling on 2.6.16 new non-reaped zombies arise > much faster - in few hours I got 8192 use processes and was forced to reboot. > > Maybe kernel just don't send SIGCHLD in some situation? Maybe some race > condition in runit or kernel? Maybe if runit will try to do waitpid() > every few seconds this will solve issue? That could solve the issue, yes, but runit should know when to do waitpid(), and I would like to find out why that goes wrong. I tried to reproduce the problem on a Debian/unstable ppc with Linux 2.6.20.7, but failed. Does this cuase a problem on your system?: # cat >test.c < int main () { int pid, i; for (i =0; i < 8193; ++i) { pid =fork(); if (pid == -1) { write(1, "f\n", 2); } if (!pid) { daemon(0, 0); // sleep(1); _exit(0); } } sleep(14); _exit(0); } EOT # gcc test.c # ./a.out If not, can you provide this service daemon that produced these amount of detached short-living processes? And I have another patch to try attached. Thanks, Gerrit. --FL5UXtIhxfXey3p5 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=diff Index: src/runit.c =================================================================== RCS file: /cvs/runit/src/runit.c,v retrieving revision 1.14 diff -u -r1.14 runit.c --- src/runit.c 21 Nov 2006 15:09:18 -0000 1.14 +++ src/runit.c 20 Jun 2007 16:21:46 -0000 @@ -194,7 +194,7 @@ strerr_warn3(INFO, "leave stage: ", stage[st], 0); break; } - if (child > 0) { + if (child != 0) { /* collect terminated children */ write(selfpipe[1], "", 1); continue; --FL5UXtIhxfXey3p5--