From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1443 Path: news.gmane.org!not-for-mail From: Alex Efros Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit not collecting zombies Date: Tue, 19 Jun 2007 22:07:51 +0300 Organization: asdfGroup Inc., http://powerman.asdfGroup.com/ Message-ID: <20070619190751.GC27090@home.power> References: <46561ABE.7030008@podgorny.cz> <20070526103517.GD24895@home.power> <20070603111056.15978.qmail@3deb4a0e5d8414.315fe32.mid.smarden.org> <20070611131112.GA1576@home.power> <20070618134516.GA1560@home.power> <20070619181325.23252.qmail@a92f927aabd53f.315fe32.mid.smarden.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1182280079 26360 80.91.229.12 (19 Jun 2007 19:07:59 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Tue, 19 Jun 2007 19:07:59 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1680-gcsg-supervision=m.gmane.org@list.skarnet.org Tue Jun 19 21:07:56 2007 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1I0j3T-0002B0-BG for gcsg-supervision@gmane.org; Tue, 19 Jun 2007 21:07:55 +0200 Original-Received: (qmail 29170 invoked by uid 76); 19 Jun 2007 19:08:16 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 29164 invoked from network); 19 Jun 2007 19:08:16 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20070619181325.23252.qmail@a92f927aabd53f.315fe32.mid.smarden.org> User-Agent: Mutt/1.5.13 (2006-08-11) Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1443 Archived-At: Hi! On Tue, Jun 19, 2007 at 06:13:25PM +0000, Gerrit Pape wrote: > Hi Alex, after checking the code, I currently cannot say that or how > runit could fail reaping zombies that detached and re-parented to pid 1. > On Linux running strace on pid 1 isn't supported AFAIK. To be sure that > runit is at fault, can you please check the kernel versions on your two > machines, can it be that they have changed at the time the problem > popped up? Does upgrading the Linux kernel to a more recent version > change anything? Yeah, I also think this may be kernel<->runit bug. But this bug happens previously on 2.6.16, and now I've it on 2.6.20 (upgraded few days ago). Moreover, looks like runit not just 'stop reaping zombies' at some point. Looks like it continue reaping them, but not all of them. Look: # date; ps ax | grep Z | wc Mon Jun 18 13:46:02 GMT 2007 162 973 7155 # date; ps ax | grep Z | wc Mon Jun 18 22:41:58 GMT 2007 406 2437 17894 # date; ps ax | grep Z | wc Tue Jun 19 18:59:46 GMT 2007 770 4621 33939 This server generate a lot of short-living processes every minute, i.e. it generate new processes much faster than new non-reaped zombies arise. I may be wrong, but I've a feeling on 2.6.16 new non-reaped zombies arise much faster - in few hours I got 8192 use processes and was forced to reboot. Maybe kernel just don't send SIGCHLD in some situation? Maybe some race condition in runit or kernel? Maybe if runit will try to do waitpid() every few seconds this will solve issue? -- WBR, Alex.