From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11969 invoked from network); 25 May 2008 22:27:14 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on f.primenet.com.au X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.2.4 Received: from news.dotsrc.org (HELO a.mx.sunsite.dk) (130.225.247.88) by ns1.primenet.com.au with SMTP; 25 May 2008 22:27:14 -0000 Received-SPF: none (ns1.primenet.com.au: domain at sunsite.dk does not designate permitted sender hosts) Received: (qmail 84432 invoked from network); 25 May 2008 22:27:08 -0000 Received: from sunsite.dk (130.225.247.90) by a.mx.sunsite.dk with SMTP; 25 May 2008 22:27:08 -0000 Received: (qmail 13076 invoked by alias); 25 May 2008 22:27:05 -0000 Mailing-List: contact zsh-workers-help@sunsite.dk; run by ezmlm Precedence: bulk X-No-Archive: yes X-Seq: 25107 Received: (qmail 13061 invoked from network); 25 May 2008 22:27:05 -0000 Received: from bifrost.dotsrc.org (130.225.254.106) by sunsite.dk with SMTP; 25 May 2008 22:27:05 -0000 Received: from mx.spodhuis.org (redoubt.spodhuis.org [193.202.115.177]) by bifrost.dotsrc.org (Postfix) with ESMTP id C810C80589A4 for ; Mon, 26 May 2008 00:27:00 +0200 (CEST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=d200803; d=spodhuis.org; h=Received:Date:From:To:Subject:Message-ID:Mail-Followup-To:References:MIME-Version:Content-Type:Content-Disposition:In-Reply-To; b=vIx0nAWQYAojOVYqzejLsz0z8ZB0hws1FAV4KPTwnc0ZBMQvXl/zXldZvcPwN1/ighgOTFbr+6uL7nyKkDoF87iDOBBmQYh4aE9yaJ84YvbxA+Q3Ajn7QRzJFbktdNeTRDdUMDqckB/8nN6tJ3JaBdM0J2snm6nh99vwWHdtAEs=; Received: by smtp.spodhuis.org with local id 1K0Og6-000Ky2-3t; Sun, 25 May 2008 22:26:58 +0000 Date: Sun, 25 May 2008 15:26:58 -0700 From: Phil Pennock To: zsh-workers@sunsite.dk, 482346@bugs.debian.org Subject: Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) Message-ID: <20080525222657.GA80449@redoubt.spodhuis.org> Mail-Followup-To: zsh-workers@sunsite.dk, 482346@bugs.debian.org References: <20080522233327.GA24953@scru.org> <080523073940.ZM13804@torch.brasslantern.com> <20080523145722.GA12096@scru.org> <20080523224305.GN7056@prunille.vinc17.org> <20080524025556.GA30511@scru.org> <20080524124445.GQ7056@prunille.vinc17.org> <20080524234002.GA35143@redoubt.spodhuis.org> <20080525004101.GT7056@prunille.vinc17.org> <20080525012321.GA7438@redoubt.spodhuis.org> <20080525213719.GV7056@prunille.vinc17.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080525213719.GV7056@prunille.vinc17.org> X-Virus-Scanned: ClamAV 0.91.2/7238/Sun May 25 22:37:51 2008 on bifrost X-Virus-Status: Clean On 2008-05-25 at 23:37 +0200, Vincent Lefevre wrote: > On 2008-05-24 18:23:21 -0700, Phil Pennock wrote: > > Then I'd be inclined to start looking into hardware issues, since > > _something's_ probably getting stuck in disk IO; I'll suspect that > > before kernel bugs, but it might also be worth seeing if there are other > > problems with threaded programs on powerpc, if init really can't reap > > something that has already become a zombie. > > I've looked at /var/log/kern.log and there's something each time > I interrupted vlc, e.g. > > May 24 14:33:36 ay kernel: Unable to handle kernel paging request for data at address 0x481e7000 > May 24 14:33:36 ay kernel: Faulting instruction address: 0xc00131e8 > May 24 14:33:36 ay kernel: Oops: Kernel access of bad area, sig: 11 [#1] That's a segfault; the kernel's then oopsing whilst trying to page in memory to write the coredump; looks like a problem in the MMU logic for the powerpc. So, the problems are: * vlc is segfaulting when it receives SIGINT; * the powerpc Linux kernel has a bug whereby it's ending up not letting the parent wait on it (from what I understand of the details so far) in some cases, so it looks like the process isn't actually ending and transitioning to zombie status; it might be worth talking to the architecture maintainers for your distribution, to see about known issues; note that even init is unable to reclaim these processes; have you tried sending a SIGKILL to force-exit the vlc, to see if either zsh or init can reap the process then? * zsh is somehow tickling the kernel bug and it might be worth having configure logic to deal with this, even after the problem is fixed, once we know what it is that's tickling this. > May 24 14:33:36 ay kernel: note: vlc[21850] exited with preempt_count 1 My nasty suspicious mind thinks that special kernel logic for handling a weird exit condition, and logging it, is less tested code that's already doing something different to the default, so this is likely close to the root cause; no powerpc available for me to test, though. It seems unlikely that there'd be enough bugs to also have a zombie contributing to load average, so I suspect that the process has not in fact exited yet, it's still running, that's where the load comes from. Does ps(1) actually show the 'Z' for zombie? -Phil