From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1497 Path: news.gmane.org!not-for-mail From: Alex Efros Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit not collecting zombies Date: Wed, 12 Sep 2007 17:35:57 +0300 Organization: asdfGroup Inc., http://powerman.asdfGroup.com/ Message-ID: <20070912143557.GC12043@home.power> References: <20070715190757.GW23517@home.power> <20070715201846.GT3925@run.galis.org> <20070715223553.GU3925@run.galis.org> <20070716000927.GY23517@home.power> <47939.::ffff:77.75.72.5.1189601606.squirrel@mail.podgorny.cz> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1189607763 26557 80.91.229.12 (12 Sep 2007 14:36:03 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 12 Sep 2007 14:36:03 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1732-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Sep 12 16:36:01 2007 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1IVTJv-00006i-87 for gcsg-supervision@gmane.org; Wed, 12 Sep 2007 16:35:59 +0200 Original-Received: (qmail 32761 invoked by uid 76); 12 Sep 2007 14:36:21 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 32753 invoked from network); 12 Sep 2007 14:36:20 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.16 (2007-06-09) Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1497 Archived-At: Hi! On Wed, Sep 12, 2007 at 09:55:16AM -0400, Charlie Brady wrote: >> Hi! Any progress on this? Alex, have you found at least a workaround? This >> is getting really annoying as I have to reboot my servers manually ... Nope. Chances are I'll write a script to check amount of zombies every 10 minutes and reboot if there >100 zombies. :~( I'm tired of manual server monitoring and reboot every 2-7 days. > You can make the problem (whatever it is) a non-issue for you, as it is for > nearly everyone else, if you can fix whichever run script is generating > zombies. It's possible, believe me. > > [I've still seen no evidence that openssh generates zombies.] I'm so happy about you see no evidence, but, bad for me, I see these evidence in my `ps` output every ~week. Please stop repeating yourself. We all already know what you think about this issue. There IS a bug somewhere (runit/kernel/somewhere else) and you don't help us to fix it. The idea is: no matter what user are doing, there shouldn't be increasing number of unreaped zombies in the system. If this isn't work - then it is a bug, and it should be fixed. Asking user not to do something (don't run chpst -L from cron) which just increase _probability_ to hit that bug isn't a solution at all, because there different software which also produce unreaped zombies (like ssh). This isn't a solution because chpst doesn't do anything wrong - just like ssh and other software. Your recommendation sounds like 'start less short-living processes', which is idiocy! Server should work, and if it work is to run a lot of short-living processes - then it should do this in reliable manner without requiring reboot every several days. Sorry for my emotions - now I've a lot of Linux servers which work just like Windows - from reboot to reboot - and that makes me a little angry... >>>> So. If this is a race condition bug in linux kernel 2.6.20, how to debug it? >>> Have a look at SystemTap. Sadly, but I've a lot of work last months, so I haven't tried to debug kernel myself. (I've tried to ask gentoo kernel devs to research this issue, but looks like they don't believe this is problem in glibc/kernel, and point me back to runit.) -- WBR, Alex.