From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2072 Path: news.gmane.org!not-for-mail From: Laurent Bercot Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: [LONG] Re: runit not collecting zombies Date: Tue, 15 Feb 2011 16:22:00 +0100 Message-ID: <20110215152200.GA13420@skarnet.org> References: <20070912181836.GG12043@home.power> <20070912191346.GH12043@home.power> <20070915133641.GA30650@home.power> <20070917075651.8280.qmail@f6989948e15a99.315fe32.mid.smarden.org> <20070917115924.GB1531@home.power> <20070918081441.20488.qmail@1a6f0ddc0befcc.315fe32.mid.smarden.org> <20110215131218.GA18284@skarnet.org> <20110215150025.GB3430@home.power> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1297783188 8426 80.91.229.12 (15 Feb 2011 15:19:48 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Tue, 15 Feb 2011 15:19:48 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2306-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Feb 15 16:19:39 2011 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1PpMgl-00053b-Dv for gcsg-supervision@lo.gmane.org; Tue, 15 Feb 2011 16:19:39 +0100 Original-Received: (qmail 15615 invoked by uid 76); 15 Feb 2011 15:22:01 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 15604 invoked by uid 1000); 15 Feb 2011 15:22:00 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20110215150025.GB3430@home.power> User-Agent: Mutt/1.4i Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2072 Archived-At: > AFAIR this polling mechanism doesn't solved issue for me, but I may be > wrong because that was long time ago. Anyway, I'm still using `kill -CONT 1` > hack in /etc/cron.hourly/ to work around this issue on all my systems. As far as I can tell, the polling mechanism should have solved the issue with a recent runit (one that reaps *all* its zombies every time it's triggered, not just one), because it does the exact same thing as your SIGCONT crontab entry: manually trigger the reaper every amount of time. Your crontab entry triggers the reaper every hour. The integrated polling mecanism triggers it every 14 seconds. That should have been working. ^^ > Again, this was long time ago and I may be wrong, but AFAIR this simple > trick with two processes wasn't correct example to reproduce this issue. It was what I gathered when reading the thread again. The cause of your zombie attack was parents not reaping their dead children and then dying, giving their zombies to process 1 *without* triggering process 1's reaper. My little script is a minimal example of this. > Ok, there no harm to trying. I repeated your test with strace - on my > current system runit got SIGCHLD. I'm using kernel 2.6.36-hardened-r9 and > runit 2.0.0. I've just switched off hourly `kill -CONT 1` workaround, so > we'll see is everything fine in a couple of days. Looks like 2.6.36-hardened-r9 is exempt from the bug. If runit got SIGCHLD, its reaper mechanism was triggered and you should be okay. -- Laurent