From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1536 Path: news.gmane.org!not-for-mail From: Gerrit Pape Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit not collecting zombies Date: Tue, 18 Sep 2007 08:14:41 +0000 Message-ID: <20070918081441.20488.qmail@1a6f0ddc0befcc.315fe32.mid.smarden.org> References: <20070912172245.GF12043@home.power> <20070912181836.GG12043@home.power> <20070912191346.GH12043@home.power> <20070915133641.GA30650@home.power> <20070917075651.8280.qmail@f6989948e15a99.315fe32.mid.smarden.org> <20070917115924.GB1531@home.power> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1190103267 31320 80.91.229.12 (18 Sep 2007 08:14:27 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 18 Sep 2007 08:14:27 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1771-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Sep 18 10:14:26 2007 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1IXYDu-0005rE-4j for gcsg-supervision@gmane.org; Tue, 18 Sep 2007 10:14:22 +0200 Original-Received: (qmail 20774 invoked by uid 76); 18 Sep 2007 08:14:41 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 20769 invoked from network); 18 Sep 2007 08:14:41 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20070917115924.GB1531@home.power> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1536 Archived-At: On Mon, Sep 17, 2007 at 11:07:30AM +0200, Radek Podgorny wrote: > Wow! This helped on one of my machines (havent tried others)! On Mon, Sep 17, 2007 at 02:59:24PM +0300, Alex Efros wrote: > Yeah! It works!!! Nice. Actually the check for zombies every five seconds patch should have worked then already. After all it's a programming error in runit. Charlie is completely right with On Sat, Sep 15, 2007 at 11:47:02AM -0400, Charlie Brady wrote: > You won't see zombies if process 14925 reads exit status of process > 14926 before it exits. > > Yes, runit should reap that status, but that doesn't change the fact > that ssh is wrong. Note also that SIGCHLD is delivered to sshd > process, not to runit, because 14926 terminates before 14925. runit tries to over-optimise, and only wakes up to reap zombies if it knows there are some, at least one. Due to the fact that the mother process, which re-parented itself to pid 1, on the one hand receives a SIGCHLD, but on the other hand doesn't care about that, exits and leaves the dead child alone, the child gets re-parented to runit, but without any notification. The situation would have been cleaned up on your systems once any child process gets re-parented to process 1 before it terminates, and then exits, causing runit to get a SIGCHLD; which apparently didn't happen. It's what the kill -CONT 1 I suggested fakes. That seems to explain why this problem didn't show up for years. I prepare a new version of runit that looks for and reaps zombies not only if it knows that there are some, but also after a 14 seconds timeout, there seems to be no way around that. Thanks, Gerrit. --And now fix those bad mother processes that see their children fade away, but don't care about that; that's not good behavior of a Mom ,-).