From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1507 Path: news.gmane.org!not-for-mail From: Charlie Brady Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit not collecting zombies Date: Wed, 12 Sep 2007 15:07:59 -0400 (EDT) Message-ID: References: <20070716000927.GY23517@home.power> <47939.::ffff:77.75.72.5.1189601606.squirrel@mail.podgorny.cz> <20070912143557.GC12043@home.power> <20070912150047.GD12043@home.power> <20070912172245.GF12043@home.power> <20070912181836.GG12043@home.power> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Trace: sea.gmane.org 1189624087 24050 80.91.229.12 (12 Sep 2007 19:08:07 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 12 Sep 2007 19:08:07 +0000 (UTC) Cc: supervision@list.skarnet.org To: Alex Efros Original-X-From: supervision-return-1742-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Sep 12 21:08:05 2007 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1IVXZA-0006XD-Hs for gcsg-supervision@gmane.org; Wed, 12 Sep 2007 21:08:00 +0200 Original-Received: (qmail 20178 invoked by uid 76); 12 Sep 2007 19:08:22 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 20168 invoked from network); 12 Sep 2007 19:08:22 -0000 X-X-Sender: charlieb@e-smith.charlieb.ott.istop.com In-Reply-To: <20070912181836.GG12043@home.power> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1507 Archived-At: On Wed, 12 Sep 2007, Alex Efros wrote: > On Wed, Sep 12, 2007 at 01:40:07PM -0400, Charlie Brady wrote: >> Indeed. Please remember that we haven't seen your ps output. > > Oh, really? How about this one: > http://article.gmane.org/gmane.comp.sysutils.supervision.general/1422 > and this: > http://article.gmane.org/gmane.comp.sysutils.supervision.general/1482 Yep, you've got me there. >> If the parent sshd continues to run, then it can fork lots of children, all >> or many of which exit very quickly, and there will still be no zombies >> reparented to init. There's something more going on here. You would be well >> advised to report the problem to whoever maintains the ssh which you use >> and/or the ssh maintainers. > > Hmm. This sounds reasonable enough, I haven't think about this. > Actually, parent ssh never exit - I never see /var/service/ssh/run restarted! OK, so that means that any zombie process must be at least a child of a child. If we look at this: ... sshd 2421 1 0 Jul14 ? Z 0:00 [sshd] sshd 2423 1 0 Jul14 ? Z 0:00 [sshd] sshd 2425 1 0 Jul14 ? Z 0:00 [sshd] sshd 2427 1 0 Jul14 ? Z 0:00 [sshd] sshd 2429 1 0 Jul14 ? Z 0:00 [sshd] sshd 2431 1 0 Jul14 ? Z 0:00 [sshd] ... you'll see every second pid is a zombie. This could occur if the ancestor sshd forks, then the child forks again, and the parent of the grandchild exits without waiting for its child. Once the child exits, it will be a zombie until process 1 reaps its status. In the example shown, let's make the ancestor sshd process 100. Then it forks and produces process 2420. 2420 forks to produce 2421, then exits. 100 reaps the exit status of 2420, so 2420 disappears from the process table. Then 2421 exits, and appears as a zombie until its status is reaped by proc 1. 100 forks again and produces process 4222. 4222 forks to produce 2423, then exits. 100 reaps the exit status of 2422, so 2422 disappears from the process table. Then 2423 exits, and appears as a zombie until its status is reaped by proc 1. etc. The ssh maintainers should be interested in your process table. If you mention what version you are running, someone might be interested to go looking through the ssh code to find out how the scenario you show could have occurred. An strace which captures the fork/fork/exit sequence as it happens would be very useful. --- Charlie