zsh-workers
 help / color / mirror / code / Atom feed
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
       [not found] ` <20080521235930.GW7056@prunille.vinc17.org>
@ 2008-05-22 23:33   ` Clint Adams
  2008-05-23 14:39     ` Bart Schaefer
  0 siblings, 1 reply; 24+ messages in thread
From: Clint Adams @ 2008-05-22 23:33 UTC (permalink / raw)
  To: zsh-workers; +Cc: Vincent Lefevre, 482346

Any ideas what might be going on? Race condition?

On Thu, May 22, 2008 at 01:50:08AM +0200, Vincent Lefevre wrote:
> I started vlc from the zsh command line. Some time later, I decided
> to kill vlc with Ctrl-C. As I didn't get the prompt, I tried Ctrl-C
> a few more times, with no change. I can see that vlc is now a zombie:
> 
> ay:~> ps -ft pts/5
> UID        PID  PPID  C STIME TTY          TIME CMD
> lefevre   4277 20147  1 01:16 pts/5    00:00:26 [vlc] <defunct>
> lefevre  20147 20126  0 May06 pts/5    00:00:00 zsh

On Thu, May 22, 2008 at 01:59:30AM +0200, Vincent Lefevre wrote:
> Additional information that may be useful:
> 
> ay:~> ps -lt pts/5
> F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
> 0 Z  1000  4277 20147  1  80   0 -     0 exit   pts/5    00:00:26 vlc <defunct>
> 0 S  1000 20147 20126  0  80   0 -  2269 rt_sig pts/5    00:00:00 zsh
> 
> I also ran gdb on the zsh running process and got:
> 
> 0x0fd312d4 in sigsuspend () from /lib/libc.so.6
> (gdb) bt
> #0  0x0fd312d4 in sigsuspend () from /lib/libc.so.6
> #1  0x10071ae4 in signal_suspend ()
> #2  0x10042268 in ?? ()
> #3  0x100423d4 in waitjobs ()
> #4  0x100274e0 in ?? ()
> #5  0x10027d20 in execlist ()
> #6  0x100283fc in execode ()
> #7  0x1003bcec in loop ()
> #8  0x1003cc30 in zsh_main ()
> #9  0x1000dc70 in main ()


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-22 23:33   ` Bug#482346: zsh doesn't always wait for its children (-> zombie) Clint Adams
@ 2008-05-23 14:39     ` Bart Schaefer
  2008-05-23 14:57       ` Clint Adams
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Schaefer @ 2008-05-23 14:39 UTC (permalink / raw)
  To: zsh-workers; +Cc: 482346

On May 22, 11:33pm, Clint Adams wrote:
}
} Any ideas what might be going on? Race condition?

What version of the shell are we dealing with here?  Does it have PWS's
patch from users/12815?  I've been a little worried that that is too
simple.  On the other hand, if it does not have users/12815, it's
possible that zsh is actually waiting for some other job that doesn't
exist any more, and interrupting vlc is a red herring.

I note from the ps output that PIDs have rolled over since the shell
was started, which is exactly the case where that patch would come
into play.

} On Thu, May 22, 2008 at 01:50:08AM +0200, Vincent Lefevre wrote:
} > I started vlc from the zsh command line. Some time later, I decided
} > to kill vlc with Ctrl-C. As I didn't get the prompt, I tried Ctrl-C
} > a few more times, with no change. I can see that vlc is now a zombie:
} > 
} > ay:~> ps -ft pts/5
} > UID        PID  PPID  C STIME TTY          TIME CMD
} > lefevre   4277 20147  1 01:16 pts/5    00:00:26 [vlc] <defunct>
} > lefevre  20147 20126  0 May06 pts/5    00:00:00 zsh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-23 14:39     ` Bart Schaefer
@ 2008-05-23 14:57       ` Clint Adams
  2008-05-23 22:43         ` Vincent Lefevre
  0 siblings, 1 reply; 24+ messages in thread
From: Clint Adams @ 2008-05-23 14:57 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers, 482346, Vincent Lefevre

On Fri, May 23, 2008 at 07:39:40AM -0700, Bart Schaefer wrote:
> What version of the shell are we dealing with here?  Does it have PWS's
> patch from users/12815?  I've been a little worried that that is too
> simple.  On the other hand, if it does not have users/12815, it's
> possible that zsh is actually waiting for some other job that doesn't
> exist any more, and interrupting vlc is a red herring.

> I note from the ps output that PIDs have rolled over since the shell
> was started, which is exactly the case where that patch would come
> into play.

He's using what is just about stock 4.3.6.

Vincent, could you try reproducing this with the zsh-beta package?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-23 14:57       ` Clint Adams
@ 2008-05-23 22:43         ` Vincent Lefevre
  2008-05-23 22:45           ` Vincent Lefevre
  2008-05-24  2:55           ` Clint Adams
  0 siblings, 2 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-23 22:43 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers, 482346

On 2008-05-23 14:57:22 +0000, Clint Adams wrote:
> Vincent, could you try reproducing this with the zsh-beta package?

Same problem with zsh-beta 4.3.6-dev-0+20080506-1.

zshenv...
zshrc...
Shell level: 4
The tty is frozen
ay:~> exec zsh-beta                                                    <0:36:50
zshenv...
zshrc...
Shell level: 4
The tty is frozen
echo $ZSH_V     %                                                               ay:~> echo $ZSH_VERSION                                                <0:37:05
4.3.6-dev-0+0506
ay:~> vlc                                                              <0:37:06
VLC media player 0.8.6e Janus
signal 2 received, terminating vlc - do it again in case it gets stuck
user insisted too much, dying badly

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-23 22:43         ` Vincent Lefevre
@ 2008-05-23 22:45           ` Vincent Lefevre
  2008-05-23 23:04             ` Vincent Lefevre
  2008-05-24  2:55           ` Clint Adams
  1 sibling, 1 reply; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-23 22:45 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers, 482346

On 2008-05-24 00:43:05 +0200, Vincent Lefevre wrote:
> On 2008-05-23 14:57:22 +0000, Clint Adams wrote:
> > Vincent, could you try reproducing this with the zsh-beta package?
> 
> Same problem with zsh-beta 4.3.6-dev-0+20080506-1.

Ditto with zsh-beta -f:

zshenv...
zshrc...
Shell level: 4
The tty is frozen
ay:~> exec zsh-beta -f                                                 <0:43:21
ay% vlc
VLC media player 0.8.6e Janus
signal 2 received, terminating vlc - do it again in case it gets stuck

(.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed

(.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed

(.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed

(.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed

(.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed

(.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed
user insisted too much, dying badly

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-23 22:45           ` Vincent Lefevre
@ 2008-05-23 23:04             ` Vincent Lefevre
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-23 23:04 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers, 482346

I've done a "strace -f". The last system call from zsh is:

21344 rt_sigsuspend([] <unfinished ...>

which occurred at about the same time vlc was started. Then, beginning
at the first SIGINT:

21345 --- SIGINT (Interrupt) @ 0 (0) ---
21345 time(NULL)                        = 1211583212
21345 sigreturn()                       = ? (mask now [ILL ABRT BUS FPE USR1 USR2 ALRM TERM CHLD CONT STOP TSTP TTIN URG VTALRM PROF WINCH])
21345 ioctl(7, FIONREAD, [0])           = 0
21345 poll( <unfinished ...>
21348 <... nanosleep resumed> NULL)     = 0
21347 <... nanosleep resumed> NULL)     = 0
21348 nanosleep({0, 50000000},  <unfinished ...>
21347 nanosleep({0, 50000000},  <unfinished ...>
21346 <... nanosleep resumed> NULL)     = 0
[...]
21345 ioctl(7, FIONREAD, [0])           = 0
21345 poll( <unfinished ...>
21347 <... nanosleep resumed> NULL)     = 0
21347 nanosleep({0, 50000000},  <unfinished ...>
21346 <... nanosleep resumed> NULL)     = 0
21346 nanosleep({0, 25000000},  <unfinished ...>
21349 <... nanosleep resumed> NULL)     = 0
21349 nanosleep({0, 100000000},  <unfinished ...>
21348 <... nanosleep resumed> NULL)     = 0
21348 nanosleep({0, 50000000},  <unfinished ...>
21346 <... nanosleep resumed> NULL)     = 0
21346 nanosleep({0, 25000000},  <unfinished ...>
21347 <... nanosleep resumed> NULL)     = 0
21347 nanosleep({0, 50000000},  <unfinished ...>
21346 <... nanosleep resumed> NULL)     = 0
21346 nanosleep({0, 25000000},  <unfinished ...>
21348 <... nanosleep resumed> NULL)     = 0
21348 nanosleep({0, 50000000},  <unfinished ...>
21346 <... nanosleep resumed> NULL)     = 0
21346 nanosleep({0, 25000000},  <unfinished ...>
21345 <... poll resumed> [{fd=5, events=POLLIN}, {fd=7, events=POLLIN}], 2, 99) = ? ERESTART_RESTARTBLOCK (To be restarted)
21345 --- SIGINT (Interrupt) @ 0 (0) ---
21345 time(NULL)                        = 1211583216
21345 rt_sigaction(SIGINT, {SIG_DFL}, {0x10001e9c, [INT], SA_RESTART}, 8) = 0
21345 rt_sigaction(SIGHUP, {SIG_DFL}, {0x10001e9c, [HUP], SA_RESTART}, 8) = 0
21345 rt_sigaction(SIGQUIT, {SIG_DFL}, {0x10001e9c, [QUIT], SA_RESTART}, 8) = 0
21345 rt_sigaction(SIGALRM, {SIG_DFL}, {SIG_IGN}, 8) = 0
21345 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0
21345 write(2, "user insisted too much, dying ba"..., 36) = 36
21345 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
21345 tgkill(21345, 21345, SIGABRT)     = 0
21345 --- SIGABRT (Aborted) @ 0 (0) ---
21349 <... nanosleep resumed> 0)        = ? ERESTART_RESTARTBLOCK (To be restarted)
21347 <... nanosleep resumed> 0)        = ? ERESTART_RESTARTBLOCK (To be restarted)
21348 <... nanosleep resumed> 0)        = ? ERESTART_RESTARTBLOCK (To be restarted)
21346 <... nanosleep resumed> 0)        = ? ERESTART_RESTARTBLOCK (To be restarted)

The strace output ends here.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-23 22:43         ` Vincent Lefevre
  2008-05-23 22:45           ` Vincent Lefevre
@ 2008-05-24  2:55           ` Clint Adams
  2008-05-24 12:44             ` Vincent Lefevre
  1 sibling, 1 reply; 24+ messages in thread
From: Clint Adams @ 2008-05-24  2:55 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers, 482346

On Sat, May 24, 2008 at 12:43:05AM +0200, Vincent Lefevre wrote:
> Same problem with zsh-beta 4.3.6-dev-0+20080506-1.

Just to keep things clear, that has users/12815 applied, and the
original report did not.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24  2:55           ` Clint Adams
@ 2008-05-24 12:44             ` Vincent Lefevre
  2008-05-24 14:25               ` Peter Stephenson
  2008-05-24 23:40               ` Phil Pennock
  0 siblings, 2 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-24 12:44 UTC (permalink / raw)
  To: Bart Schaefer, zsh-workers, 482346

severity 482346 important
thanks

On 2008-05-24 02:55:56 +0000, Clint Adams wrote:
> On Sat, May 24, 2008 at 12:43:05AM +0200, Vincent Lefevre wrote:
> > Same problem with zsh-beta 4.3.6-dev-0+20080506-1.
> 
> Just to keep things clear, that has users/12815 applied, and the
> original report did not.

A bit more information:

This is 100% reproducible with both zsh and zsh-beta. This makes
the load average go very high (e.g. up to 26) and ends up by DoS
(the news server complains about the load average and shuts the
connection down). The only solution seems to reboot the machine.
(That's why I've increased the bug severity to important.)

Note: when I kill zsh, the zombie remains there and gets attached
to init. The load average remains very high.

I do not have such a problem when I start vlc from bash and quit
it in the same way.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 12:44             ` Vincent Lefevre
@ 2008-05-24 14:25               ` Peter Stephenson
  2008-05-24 15:27                 ` Stephane Chazelas
  2008-05-24 23:40               ` Phil Pennock
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Stephenson @ 2008-05-24 14:25 UTC (permalink / raw)
  To: 482346, zsh-workers

On Sat, 24 May 2008 14:44:45 +0200
Vincent Lefevre <vincent@vinc17.org> wrote:
> This is 100% reproducible with both zsh and zsh-beta.

If it's just a matter of starting vlc and trying to kill it for you,
then there's something more to track down since this doesn't happen for
me (Fedora 9).

There must be more to it than simply zsh not waiting for children:  that
has no effect on whether vlc is executing code.  The propagation of the
signal to the programme seems to me a likely suspect.  Of course these
could be related to the same fundamental problem.  vlc does appear to do
some fairly odd things with signals which may be interacting in
unexpected ways with the shell.

It would be useful to know whether the vlc job is listed in the shell's
job table at this point, and to see what else is there.  The array
jobtab contains the jobs, which contain linked lists of process
structures.  One of the jobs (it should be the one indexed
by curjob) should include the vlc process.  The previous bug
happened because some other job (marked as defunct) had
a process with the same number.

pws


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 14:25               ` Peter Stephenson
@ 2008-05-24 15:27                 ` Stephane Chazelas
  2008-05-24 15:58                   ` Stephane Chazelas
  2008-05-24 17:41                   ` Bart Schaefer
  0 siblings, 2 replies; 24+ messages in thread
From: Stephane Chazelas @ 2008-05-24 15:27 UTC (permalink / raw)
  To: Peter Stephenson; +Cc: 482346, zsh-workers

On Sat, May 24, 2008 at 03:25:04PM +0100, Peter Stephenson wrote:
> On Sat, 24 May 2008 14:44:45 +0200
> Vincent Lefevre <vincent@vinc17.org> wrote:
> > This is 100% reproducible with both zsh and zsh-beta.
> 
> If it's just a matter of starting vlc and trying to kill it for you,
> then there's something more to track down since this doesn't happen for
> me (Fedora 9).
[...]

For information, I cannot reproduce it here on a debian with the
same versions of the packages as Vincent's but on x86 (Vincent
is on powerpc according to bugs.debian.org/482346

>From the straces, we see that zsh is not receiving any SIGCHLD.

It may be a good idea to check if the same can be observed with
other shells that use sigsuspend (pdksh, mksh) instead of
wait4/waitpid (bash, ksh93, posh, ash).

Note that both mksh and posh are meant to derive from pdksh. It
would be interesting to know why posh switched from sigsuspend
to wait4.

-- 
Stéphane


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 15:27                 ` Stephane Chazelas
@ 2008-05-24 15:58                   ` Stephane Chazelas
  2008-05-24 17:53                     ` Clint Adams
  2008-05-24 17:41                   ` Bart Schaefer
  1 sibling, 1 reply; 24+ messages in thread
From: Stephane Chazelas @ 2008-05-24 15:58 UTC (permalink / raw)
  To: Peter Stephenson, 482346, zsh-workers

On Sat, May 24, 2008 at 04:27:04PM +0100, Stephane Chazelas wrote:
[...]
> Note that both mksh and posh are meant to derive from pdksh. It
> would be interesting to know why posh switched from sigsuspend
> to wait4.
[...]

I'm under the impression that it is by accident/mistake, the
conf-end.h that would enable the sigsuspend is there but not
included in posh, and there's no mention of the change in the
changelog. Clint may confirm/infirm

-- 
Stéphane


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 15:27                 ` Stephane Chazelas
  2008-05-24 15:58                   ` Stephane Chazelas
@ 2008-05-24 17:41                   ` Bart Schaefer
  2008-05-24 18:25                     ` Stephane Chazelas
  1 sibling, 1 reply; 24+ messages in thread
From: Bart Schaefer @ 2008-05-24 17:41 UTC (permalink / raw)
  To: zsh-workers; +Cc: 482346

On May 24,  4:27pm, Stephane Chazelas wrote:
}
} From the straces, we see that zsh is not receiving any SIGCHLD.

If that were the only problem, then opening another shell window and
performing "kill -CHLD ..." on the original shell should clear it all
up.  But as PWS pointed out, the spiking load indicates that either
vlc or zsh is actively doing something, which may have to do with the
way zsh progates signals, or how it sets the signal masks before
forking vlc in the first place.

If it's zsh that's looping, then Vincent's stack trace indicates it
must be here in zwaitjob:

	while (!errflag && jn->stat &&
	       !(jn->stat & STAT_DONE) &&
	       !(interact && (jn->stat & STAT_STOPPED))) {
	    signal_suspend(SIGCHLD);
	    /* job handling stuff elided */
	    child_block();
	}

But if *that* were a tight loop, it would mean that signal_suspend()
isn't working.  It'd be nice to know what process or processes send
the load so high; 100% CPU usage is one thing, but 26+ processes in
runnable state sounds like another thing entirely.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 15:58                   ` Stephane Chazelas
@ 2008-05-24 17:53                     ` Clint Adams
  0 siblings, 0 replies; 24+ messages in thread
From: Clint Adams @ 2008-05-24 17:53 UTC (permalink / raw)
  To: 482346, zsh-workers

On Sat, May 24, 2008 at 04:58:20PM +0100, Stephane Chazelas wrote:
> I'm under the impression that it is by accident/mistake, the
> conf-end.h that would enable the sigsuspend is there but not
> included in posh, and there's no mention of the change in the
> changelog. Clint may confirm/infirm

Yes, that was unintentional, but it needs to be redone anyway.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 17:41                   ` Bart Schaefer
@ 2008-05-24 18:25                     ` Stephane Chazelas
  0 siblings, 0 replies; 24+ messages in thread
From: Stephane Chazelas @ 2008-05-24 18:25 UTC (permalink / raw)
  To: Bart Schaefer; +Cc: zsh-workers, 482346

On Sat, May 24, 2008 at 10:41:07AM -0700, Bart Schaefer wrote:
[...]
> But if *that* were a tight loop, it would mean that signal_suspend()
> isn't working.  It'd be nice to know what process or processes send
> the load so high; 100% CPU usage is one thing, but 26+ processes in
> runnable state sounds like another thing entirely.

>From the strace and from gdb, it would seem that zsh's
sigsuspend doesn't return, so zsh can't be in any kind of loop
unless I missed something.

Might be some multithreading issue. top -H might help.

-- 
Stéphane


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 12:44             ` Vincent Lefevre
  2008-05-24 14:25               ` Peter Stephenson
@ 2008-05-24 23:40               ` Phil Pennock
  2008-05-25  0:41                 ` Vincent Lefevre
  1 sibling, 1 reply; 24+ messages in thread
From: Phil Pennock @ 2008-05-24 23:40 UTC (permalink / raw)
  To: Vincent Lefevre; +Cc: zsh-workers, 482346

On 2008-05-24 at 14:44 +0200, Vincent Lefevre wrote:
> Note: when I kill zsh, the zombie remains there and gets attached
> to init. The load average remains very high.

If the zombie is reparented to init but still stays a zombie, then
there's something worse wrong with your system.  If init can't reap its
children then it's understandable that zsh might have troubles too.

Since you're on a rarer architecture that doesn't see so much Linux
kernel debugging, I'd be inclined to look at what has changed in the
kernel's architecture-specific signal handling code.  (But see below).

Further, it's strange that zombies are contributing to load average; if
zsh is gone (killed off and no longer even possibly stuck in a tight
loop)  and there's the zombie and init left, then there shouldn't be
anything contributing to load avg.

If you use tools such as top(1), what processes are they attributing the
load to?  Is the high load average confirmed by vmstat reports of idle
CPU, or is the load avg really out of sync with CPU reality?  Linux is
unusual in counting processes blocked on storage IO towards the load
average, so if the problem is something like a flaky disk underneath the
root filesystem, that might be complicating your problem.

-Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-24 23:40               ` Phil Pennock
@ 2008-05-25  0:41                 ` Vincent Lefevre
  2008-05-25  1:23                   ` Phil Pennock
  2008-05-25 10:08                   ` Stephane Chazelas
  0 siblings, 2 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-25  0:41 UTC (permalink / raw)
  To: zsh-workers; +Cc: 482346

On 2008-05-24 16:40:02 -0700, Phil Pennock wrote:
> Since you're on a rarer architecture that doesn't see so much Linux
> kernel debugging, I'd be inclined to look at what has changed in the
> kernel's architecture-specific signal handling code.  (But see below).

I don't know if the bug is new. I've never killed vlc in such a way
in the past. The current kernel is linux-image-2.6.24-1-powerpc
2.6.24-7.

> Further, it's strange that zombies are contributing to load average; if
> zsh is gone (killed off and no longer even possibly stuck in a tight
> loop)  and there's the zombie and init left, then there shouldn't be
> anything contributing to load avg.

Yes, that's strange.

> If you use tools such as top(1), what processes are they attributing the
> load to?

None, and the CPU is idle:

top - 02:31:55 up 22:19,  6 users,  load average: 4.09, 4.24, 4.25
Tasks: 117 total,   1 running, 115 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.7%us,  0.7%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.3%hi,  0.0%si,  0.0%st
Mem:    255432k total,   233456k used,    21976k free,     3004k buffers
Swap:   524280k total,    22820k used,   501460k free,   105548k cached

(The zombie is vlc.)

>  Is the high load average confirmed by vmstat reports of idle
> CPU, or is the load avg really out of sync with CPU reality?

Here's vmstat output:

ay:~> vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0  22820  26456   5520  99688    0    0    15    35   30  196  8  1 88  3

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25  0:41                 ` Vincent Lefevre
@ 2008-05-25  1:23                   ` Phil Pennock
  2008-05-25 21:37                     ` Vincent Lefevre
  2008-05-25 10:08                   ` Stephane Chazelas
  1 sibling, 1 reply; 24+ messages in thread
From: Phil Pennock @ 2008-05-25  1:23 UTC (permalink / raw)
  To: zsh-workers, 482346

On 2008-05-25 at 02:41 +0200, Vincent Lefevre wrote:
> None, and the CPU is idle:

Then I'd be inclined to start looking into hardware issues, since
_something's_ probably getting stuck in disk IO; I'll suspect that
before kernel bugs, but it might also be worth seeing if there are other
problems with threaded programs on powerpc, if init really can't reap
something that has already become a zombie.

> Here's vmstat output:

First line of vmstat is average since system boot, you need to do
something like "vmstat 1", ignore the first line, and look at what's
happening at the current time.

-Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25  0:41                 ` Vincent Lefevre
  2008-05-25  1:23                   ` Phil Pennock
@ 2008-05-25 10:08                   ` Stephane Chazelas
  2008-05-25 21:54                     ` Vincent Lefevre
  1 sibling, 1 reply; 24+ messages in thread
From: Stephane Chazelas @ 2008-05-25 10:08 UTC (permalink / raw)
  To: zsh-workers, 482346

On Sun, May 25, 2008 at 02:41:01AM +0200, Vincent Lefevre wrote:
> On 2008-05-24 16:40:02 -0700, Phil Pennock wrote:
> > Since you're on a rarer architecture that doesn't see so much Linux
> > kernel debugging, I'd be inclined to look at what has changed in the
> > kernel's architecture-specific signal handling code.  (But see below).
> 
> I don't know if the bug is new. I've never killed vlc in such a way
> in the past. The current kernel is linux-image-2.6.24-1-powerpc
> 2.6.24-7.
> 
> > Further, it's strange that zombies are contributing to load average; if
> > zsh is gone (killed off and no longer even possibly stuck in a tight
> > loop)  and there's the zombie and init left, then there shouldn't be
> > anything contributing to load avg.
> 
> Yes, that's strange.

Now, what happens if the main thread in a multithread
process dies and other threads remain?

Anyway one process can at most contribute 1 point to the load
average, so there must be more, if they don't show in top, that
could be either that they are created and destroyed too fast or
that they are threads.

zombies are not processes so cannot contribute to load average.

> 
> > If you use tools such as top(1), what processes are they attributing the
> > load to?
> 
> None, and the CPU is idle:
> 
> top - 02:31:55 up 22:19,  6 users,  load average: 4.09, 4.24, 4.25
> Tasks: 117 total,   1 running, 115 sleeping,   0 stopped,   1 zombie

Could you try top -H (to see threads) and the same thing with
pdksh?

I'd bet for a kernel and/or pthread and/or vlc issue.

-- 
Stéphane


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25  1:23                   ` Phil Pennock
@ 2008-05-25 21:37                     ` Vincent Lefevre
  2008-05-25 22:26                       ` Phil Pennock
  0 siblings, 1 reply; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-25 21:37 UTC (permalink / raw)
  To: zsh-workers, 482346

On 2008-05-24 18:23:21 -0700, Phil Pennock wrote:
> Then I'd be inclined to start looking into hardware issues, since
> _something's_ probably getting stuck in disk IO; I'll suspect that
> before kernel bugs, but it might also be worth seeing if there are other
> problems with threaded programs on powerpc, if init really can't reap
> something that has already become a zombie.

I've looked at /var/log/kern.log and there's something each time
I interrupted vlc, e.g.

May 24 14:33:36 ay kernel: Unable to handle kernel paging request for data at address 0x481e7000
May 24 14:33:36 ay kernel: Faulting instruction address: 0xc00131e8
May 24 14:33:36 ay kernel: Oops: Kernel access of bad area, sig: 11 [#1]
May 24 14:33:36 ay kernel: PowerMac
May 24 14:33:36 ay kernel: Modules linked in: snd_powermac snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd snd_page_alloc soundcore xt_multiport iptable_filter ip_tables x_tables ipv6 ide_cd cdrom sungem sungem_phy firewire_ohci firewire_core crc_itu_t yenta_socket rsrc_nonstatic pcmcia_core uninorth_agp agpgart sd_mod scsi_mod dm_snapshot dm_mirror dm_mod ext3 jbd mbcache ide_disk evdev i2c_powermac windfarm_core
May 24 14:33:36 ay kernel: NIP: c00131e8 LR: c0017780 CTR: 00000080
May 24 14:33:36 ay kernel: REGS: c2769b60 TRAP: 0300   Not tainted  (2.6.24-1-powerpc)
May 24 14:33:36 ay kernel: MSR: 00009032 <EE,ME,IR,DR>  CR: 24004422  XER: 00000000
May 24 14:33:36 ay kernel: DAR: 481e7000, DSISR: 40000000
May 24 14:33:36 ay kernel: TASK = c2d7ca80[21850] 'vlc' THREAD: c2768000
May 24 14:33:36 ay kernel: GPR00: c3356060 c2769c10 c2d7ca80 481e7000 00000080 0c57a181 481e7000 00000000 
May 24 14:33:36 ay kernel: GPR08: 0c57a181 c3356060 00000000 c03fe000 44004422 1001a728 bfffffff cf3d2140 
May 24 14:33:36 ay kernel: GPR16: 0000000d c3356060 00000030 00000000 c2769ccc c272e480 c3356060 00000001 
May 24 14:33:36 ay kernel: GPR24: 00000000 481e7000 0000079c c245e860 c245e860 0c57a181 481e7000 c0588f40 
May 24 14:33:36 ay kernel: NIP [c00131e8] __flush_dcache_icache+0x14/0x40
May 24 14:33:36 ay kernel: LR [c0017780] update_mmu_cache+0x84/0x108
May 24 14:33:36 ay kernel: Call Trace:
May 24 14:33:36 ay kernel: [c2769c10] [481e7000] 0x481e7000 (unreliable)
May 24 14:33:36 ay kernel: [c2769c30] [c0085888] handle_mm_fault+0xc70/0xd70
May 24 14:33:36 ay kernel: [c2769c70] [c0085d48] get_user_pages+0x3c0/0x4d0
May 24 14:33:36 ay kernel: [c2769cc0] [c00d008c] elf_core_dump+0xa28/0xce4
May 24 14:33:36 ay kernel: [c2769d60] [c009fe28] do_coredump+0x664/0x6cc
May 24 14:33:36 ay kernel: [c2769e50] [c003df64] get_signal_to_deliver+0x390/0x3c0
May 24 14:33:36 ay kernel: [c2769e80] [c00096ac] do_signal+0x50/0x268
May 24 14:33:36 ay kernel: [c2769f40] [c0013ef4] do_user_signal+0x7c/0xcc
May 24 14:33:36 ay kernel: --- Exception: c00 at 0xfbcce5c
May 24 14:33:36 ay kernel:     LR = 0xfbceaf4
May 24 14:33:36 ay kernel: Instruction dump:
May 24 14:33:36 ay kernel: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000 
May 24 14:33:36 ay kernel: 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 7c0004ac 
May 24 14:33:36 ay kernel: ---[ end trace 6343c960c4d55920 ]---
May 24 14:33:36 ay kernel: note: vlc[21850] exited with preempt_count 1

> > Here's vmstat output:
> 
> First line of vmstat is average since system boot, you need to do
> something like "vmstat 1", ignore the first line, and look at what's
> happening at the current time.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0 120044  17616  54288 124632    4    4    33    44   32  261  8  1 87  4
 0  0 120044  17616  54288 124632    0    0     0     0   24  299  0  0 100  0
 0  0 120044  17616  54288 124632    0    0     0     0   25  303  0  0 100  0
 0  0 120044  17616  54288 124632    0    0     0     0   24  298  0  0 100  0
 0  0 120044  17616  54288 124632    0    0     0     0   24  290  0  0 100  0
 0  0 120044  17376  54288 124632    0    0     0     0   23  300 73  7 20  0
 0  0 120044  17436  54296 124632    0    0     0    32   27  297  0  0 99  1
 1  0 120044  17376  54296 124604    0    0     0     0   24  298 11  0 89  0
 0  0 120044  17616  54296 124632    0    0     0     0   23  299 51  2 47  0
 0  0 120044  17616  54296 124632    0    0     0     0   24  292  0  0 100  0
 0  0 120044  17556  54300 124632    0    0     0   340   65  365  0  0 80 20

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25 10:08                   ` Stephane Chazelas
@ 2008-05-25 21:54                     ` Vincent Lefevre
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-25 21:54 UTC (permalink / raw)
  To: zsh-workers, 482346

On 2008-05-25 11:08:44 +0100, Stephane Chazelas wrote:
> zombies are not processes so cannot contribute to load average.

Except in case of kernel bugs, I suppose.

> Could you try top -H (to see threads)

I don't see anything special.

> and the same thing with pdksh?

Just like with bash, no problem when I start vlc from pdksh and
interrupt it in the same way. No kernel oops in /var/log/kern.log.

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25 21:37                     ` Vincent Lefevre
@ 2008-05-25 22:26                       ` Phil Pennock
  2008-05-25 23:34                         ` Vincent Lefevre
  0 siblings, 1 reply; 24+ messages in thread
From: Phil Pennock @ 2008-05-25 22:26 UTC (permalink / raw)
  To: zsh-workers, 482346

On 2008-05-25 at 23:37 +0200, Vincent Lefevre wrote:
> On 2008-05-24 18:23:21 -0700, Phil Pennock wrote:
> > Then I'd be inclined to start looking into hardware issues, since
> > _something's_ probably getting stuck in disk IO; I'll suspect that
> > before kernel bugs, but it might also be worth seeing if there are other
> > problems with threaded programs on powerpc, if init really can't reap
> > something that has already become a zombie.
> 
> I've looked at /var/log/kern.log and there's something each time
> I interrupted vlc, e.g.
> 
> May 24 14:33:36 ay kernel: Unable to handle kernel paging request for data at address 0x481e7000
> May 24 14:33:36 ay kernel: Faulting instruction address: 0xc00131e8
> May 24 14:33:36 ay kernel: Oops: Kernel access of bad area, sig: 11 [#1]

That's a segfault; the kernel's then oopsing whilst trying to page in
memory to write the coredump; looks like a problem in the MMU logic for
the powerpc.

So, the problems are:

 * vlc is segfaulting when it receives SIGINT;

 * the powerpc Linux kernel has a bug whereby it's ending up not letting
   the parent wait on it (from what I understand of the details so far)
   in some cases, so it looks like the process isn't actually ending and
   transitioning to zombie status; it might be worth talking to the
   architecture maintainers for your distribution, to see about known
   issues; note that even init is unable to reclaim these processes;
   have you tried sending a SIGKILL to force-exit the vlc, to see if
   either zsh or init can reap the process then?

 * zsh is somehow tickling the kernel bug and it might be worth having
   configure logic to deal with this, even after the problem is fixed,
   once we know what it is that's tickling this.

> May 24 14:33:36 ay kernel: note: vlc[21850] exited with preempt_count 1

My nasty suspicious mind thinks that special kernel logic for handling a
weird exit condition, and logging it, is less tested code that's already
doing something different to the default, so this is likely close to the
root cause; no powerpc available for me to test, though.

It seems unlikely that there'd be enough bugs to also have a zombie
contributing to load average, so I suspect that the process has not in
fact exited yet, it's still running, that's where the load comes from.
Does ps(1) actually show the 'Z' for zombie?

-Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25 22:26                       ` Phil Pennock
@ 2008-05-25 23:34                         ` Vincent Lefevre
  2008-05-25 23:43                           ` Vincent Lefevre
  2008-05-26  0:31                           ` Bart Schaefer
  0 siblings, 2 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-25 23:34 UTC (permalink / raw)
  To: zsh-workers, 482346

On 2008-05-25 15:26:58 -0700, Phil Pennock wrote:
>    have you tried sending a SIGKILL to force-exit the vlc, to see if
>    either zsh or init can reap the process then?

This has no effect.

> Does ps(1) actually show the 'Z' for zombie?

Yes (BTW, this was in my second message).

Now, I could reproduce the problem with both bash and pdksh.
I don't know why I couldn't reproduce it the first times while it is
100% reproducible under zsh (perhaps timings are a bit different).

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25 23:34                         ` Vincent Lefevre
@ 2008-05-25 23:43                           ` Vincent Lefevre
  2008-05-26  0:31                           ` Bart Schaefer
  1 sibling, 0 replies; 24+ messages in thread
From: Vincent Lefevre @ 2008-05-25 23:43 UTC (permalink / raw)
  To: zsh-workers, 482346

reassign 482346 linux-image-2.6.24-1-powerpc
retitle 482346 kernel 2.6.24-1-powerpc oops after vlc segfaults
thanks

because it became clear that zsh has nothing to do with this bug.

I forgot to say that the screen brightness is set to 100% (the machine
is an early PowerBook G4) when the kernel oops occurs.

I suppose that next time zsh-workers could be dropped from To|Cc.
Those interested in this bug could subscribe to bug 482346. See:

  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=482346

-- 
Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/>
100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/>
Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie)
  2008-05-25 23:34                         ` Vincent Lefevre
  2008-05-25 23:43                           ` Vincent Lefevre
@ 2008-05-26  0:31                           ` Bart Schaefer
  1 sibling, 0 replies; 24+ messages in thread
From: Bart Schaefer @ 2008-05-26  0:31 UTC (permalink / raw)
  To: 482346, zsh-workers

On May 26,  1:34am, Vincent Lefevre wrote:
} 
} Now, I could reproduce the problem with both bash and pdksh.
} I don't know why I couldn't reproduce it the first times while it is
} 100% reproducible under zsh (perhaps timings are a bit different).

It's possible that zsh is passing a larger *envp or some similar
difference, that makes it more likely the bug is tickled.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-05-26  0:32 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20080521235008.GA5600@ay.vinc17.org>
     [not found] ` <20080521235930.GW7056@prunille.vinc17.org>
2008-05-22 23:33   ` Bug#482346: zsh doesn't always wait for its children (-> zombie) Clint Adams
2008-05-23 14:39     ` Bart Schaefer
2008-05-23 14:57       ` Clint Adams
2008-05-23 22:43         ` Vincent Lefevre
2008-05-23 22:45           ` Vincent Lefevre
2008-05-23 23:04             ` Vincent Lefevre
2008-05-24  2:55           ` Clint Adams
2008-05-24 12:44             ` Vincent Lefevre
2008-05-24 14:25               ` Peter Stephenson
2008-05-24 15:27                 ` Stephane Chazelas
2008-05-24 15:58                   ` Stephane Chazelas
2008-05-24 17:53                     ` Clint Adams
2008-05-24 17:41                   ` Bart Schaefer
2008-05-24 18:25                     ` Stephane Chazelas
2008-05-24 23:40               ` Phil Pennock
2008-05-25  0:41                 ` Vincent Lefevre
2008-05-25  1:23                   ` Phil Pennock
2008-05-25 21:37                     ` Vincent Lefevre
2008-05-25 22:26                       ` Phil Pennock
2008-05-25 23:34                         ` Vincent Lefevre
2008-05-25 23:43                           ` Vincent Lefevre
2008-05-26  0:31                           ` Bart Schaefer
2008-05-25 10:08                   ` Stephane Chazelas
2008-05-25 21:54                     ` Vincent Lefevre

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/zsh/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).