* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) [not found] ` <20080521235930.GW7056@prunille.vinc17.org> @ 2008-05-22 23:33 ` Clint Adams 2008-05-23 14:39 ` Bart Schaefer 0 siblings, 1 reply; 24+ messages in thread From: Clint Adams @ 2008-05-22 23:33 UTC (permalink / raw) To: zsh-workers; +Cc: Vincent Lefevre, 482346 Any ideas what might be going on? Race condition? On Thu, May 22, 2008 at 01:50:08AM +0200, Vincent Lefevre wrote: > I started vlc from the zsh command line. Some time later, I decided > to kill vlc with Ctrl-C. As I didn't get the prompt, I tried Ctrl-C > a few more times, with no change. I can see that vlc is now a zombie: > > ay:~> ps -ft pts/5 > UID PID PPID C STIME TTY TIME CMD > lefevre 4277 20147 1 01:16 pts/5 00:00:26 [vlc] <defunct> > lefevre 20147 20126 0 May06 pts/5 00:00:00 zsh On Thu, May 22, 2008 at 01:59:30AM +0200, Vincent Lefevre wrote: > Additional information that may be useful: > > ay:~> ps -lt pts/5 > F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > 0 Z 1000 4277 20147 1 80 0 - 0 exit pts/5 00:00:26 vlc <defunct> > 0 S 1000 20147 20126 0 80 0 - 2269 rt_sig pts/5 00:00:00 zsh > > I also ran gdb on the zsh running process and got: > > 0x0fd312d4 in sigsuspend () from /lib/libc.so.6 > (gdb) bt > #0 0x0fd312d4 in sigsuspend () from /lib/libc.so.6 > #1 0x10071ae4 in signal_suspend () > #2 0x10042268 in ?? () > #3 0x100423d4 in waitjobs () > #4 0x100274e0 in ?? () > #5 0x10027d20 in execlist () > #6 0x100283fc in execode () > #7 0x1003bcec in loop () > #8 0x1003cc30 in zsh_main () > #9 0x1000dc70 in main () ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-22 23:33 ` Bug#482346: zsh doesn't always wait for its children (-> zombie) Clint Adams @ 2008-05-23 14:39 ` Bart Schaefer 2008-05-23 14:57 ` Clint Adams 0 siblings, 1 reply; 24+ messages in thread From: Bart Schaefer @ 2008-05-23 14:39 UTC (permalink / raw) To: zsh-workers; +Cc: 482346 On May 22, 11:33pm, Clint Adams wrote: } } Any ideas what might be going on? Race condition? What version of the shell are we dealing with here? Does it have PWS's patch from users/12815? I've been a little worried that that is too simple. On the other hand, if it does not have users/12815, it's possible that zsh is actually waiting for some other job that doesn't exist any more, and interrupting vlc is a red herring. I note from the ps output that PIDs have rolled over since the shell was started, which is exactly the case where that patch would come into play. } On Thu, May 22, 2008 at 01:50:08AM +0200, Vincent Lefevre wrote: } > I started vlc from the zsh command line. Some time later, I decided } > to kill vlc with Ctrl-C. As I didn't get the prompt, I tried Ctrl-C } > a few more times, with no change. I can see that vlc is now a zombie: } > } > ay:~> ps -ft pts/5 } > UID PID PPID C STIME TTY TIME CMD } > lefevre 4277 20147 1 01:16 pts/5 00:00:26 [vlc] <defunct> } > lefevre 20147 20126 0 May06 pts/5 00:00:00 zsh ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-23 14:39 ` Bart Schaefer @ 2008-05-23 14:57 ` Clint Adams 2008-05-23 22:43 ` Vincent Lefevre 0 siblings, 1 reply; 24+ messages in thread From: Clint Adams @ 2008-05-23 14:57 UTC (permalink / raw) To: Bart Schaefer; +Cc: zsh-workers, 482346, Vincent Lefevre On Fri, May 23, 2008 at 07:39:40AM -0700, Bart Schaefer wrote: > What version of the shell are we dealing with here? Does it have PWS's > patch from users/12815? I've been a little worried that that is too > simple. On the other hand, if it does not have users/12815, it's > possible that zsh is actually waiting for some other job that doesn't > exist any more, and interrupting vlc is a red herring. > I note from the ps output that PIDs have rolled over since the shell > was started, which is exactly the case where that patch would come > into play. He's using what is just about stock 4.3.6. Vincent, could you try reproducing this with the zsh-beta package? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-23 14:57 ` Clint Adams @ 2008-05-23 22:43 ` Vincent Lefevre 2008-05-23 22:45 ` Vincent Lefevre 2008-05-24 2:55 ` Clint Adams 0 siblings, 2 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-23 22:43 UTC (permalink / raw) To: Bart Schaefer, zsh-workers, 482346 On 2008-05-23 14:57:22 +0000, Clint Adams wrote: > Vincent, could you try reproducing this with the zsh-beta package? Same problem with zsh-beta 4.3.6-dev-0+20080506-1. zshenv... zshrc... Shell level: 4 The tty is frozen ay:~> exec zsh-beta <0:36:50 zshenv... zshrc... Shell level: 4 The tty is frozen echo $ZSH_V % ay:~> echo $ZSH_VERSION <0:37:05 4.3.6-dev-0+0506 ay:~> vlc <0:37:06 VLC media player 0.8.6e Janus signal 2 received, terminating vlc - do it again in case it gets stuck user insisted too much, dying badly -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-23 22:43 ` Vincent Lefevre @ 2008-05-23 22:45 ` Vincent Lefevre 2008-05-23 23:04 ` Vincent Lefevre 2008-05-24 2:55 ` Clint Adams 1 sibling, 1 reply; 24+ messages in thread From: Vincent Lefevre @ 2008-05-23 22:45 UTC (permalink / raw) To: Bart Schaefer, zsh-workers, 482346 On 2008-05-24 00:43:05 +0200, Vincent Lefevre wrote: > On 2008-05-23 14:57:22 +0000, Clint Adams wrote: > > Vincent, could you try reproducing this with the zsh-beta package? > > Same problem with zsh-beta 4.3.6-dev-0+20080506-1. Ditto with zsh-beta -f: zshenv... zshrc... Shell level: 4 The tty is frozen ay:~> exec zsh-beta -f <0:43:21 ay% vlc VLC media player 0.8.6e Janus signal 2 received, terminating vlc - do it again in case it gets stuck (.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed (.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed (.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed (.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed (.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed (.:21103): Gtk-CRITICAL **: gtk_container_remove: assertion `GTK_IS_TOOLBAR (container) || widget->parent == GTK_WIDGET (container)' failed user insisted too much, dying badly -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-23 22:45 ` Vincent Lefevre @ 2008-05-23 23:04 ` Vincent Lefevre 0 siblings, 0 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-23 23:04 UTC (permalink / raw) To: Bart Schaefer, zsh-workers, 482346 I've done a "strace -f". The last system call from zsh is: 21344 rt_sigsuspend([] <unfinished ...> which occurred at about the same time vlc was started. Then, beginning at the first SIGINT: 21345 --- SIGINT (Interrupt) @ 0 (0) --- 21345 time(NULL) = 1211583212 21345 sigreturn() = ? (mask now [ILL ABRT BUS FPE USR1 USR2 ALRM TERM CHLD CONT STOP TSTP TTIN URG VTALRM PROF WINCH]) 21345 ioctl(7, FIONREAD, [0]) = 0 21345 poll( <unfinished ...> 21348 <... nanosleep resumed> NULL) = 0 21347 <... nanosleep resumed> NULL) = 0 21348 nanosleep({0, 50000000}, <unfinished ...> 21347 nanosleep({0, 50000000}, <unfinished ...> 21346 <... nanosleep resumed> NULL) = 0 [...] 21345 ioctl(7, FIONREAD, [0]) = 0 21345 poll( <unfinished ...> 21347 <... nanosleep resumed> NULL) = 0 21347 nanosleep({0, 50000000}, <unfinished ...> 21346 <... nanosleep resumed> NULL) = 0 21346 nanosleep({0, 25000000}, <unfinished ...> 21349 <... nanosleep resumed> NULL) = 0 21349 nanosleep({0, 100000000}, <unfinished ...> 21348 <... nanosleep resumed> NULL) = 0 21348 nanosleep({0, 50000000}, <unfinished ...> 21346 <... nanosleep resumed> NULL) = 0 21346 nanosleep({0, 25000000}, <unfinished ...> 21347 <... nanosleep resumed> NULL) = 0 21347 nanosleep({0, 50000000}, <unfinished ...> 21346 <... nanosleep resumed> NULL) = 0 21346 nanosleep({0, 25000000}, <unfinished ...> 21348 <... nanosleep resumed> NULL) = 0 21348 nanosleep({0, 50000000}, <unfinished ...> 21346 <... nanosleep resumed> NULL) = 0 21346 nanosleep({0, 25000000}, <unfinished ...> 21345 <... poll resumed> [{fd=5, events=POLLIN}, {fd=7, events=POLLIN}], 2, 99) = ? ERESTART_RESTARTBLOCK (To be restarted) 21345 --- SIGINT (Interrupt) @ 0 (0) --- 21345 time(NULL) = 1211583216 21345 rt_sigaction(SIGINT, {SIG_DFL}, {0x10001e9c, [INT], SA_RESTART}, 8) = 0 21345 rt_sigaction(SIGHUP, {SIG_DFL}, {0x10001e9c, [HUP], SA_RESTART}, 8) = 0 21345 rt_sigaction(SIGQUIT, {SIG_DFL}, {0x10001e9c, [QUIT], SA_RESTART}, 8) = 0 21345 rt_sigaction(SIGALRM, {SIG_DFL}, {SIG_IGN}, 8) = 0 21345 rt_sigaction(SIGPIPE, {SIG_DFL}, {SIG_IGN}, 8) = 0 21345 write(2, "user insisted too much, dying ba"..., 36) = 36 21345 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 21345 tgkill(21345, 21345, SIGABRT) = 0 21345 --- SIGABRT (Aborted) @ 0 (0) --- 21349 <... nanosleep resumed> 0) = ? ERESTART_RESTARTBLOCK (To be restarted) 21347 <... nanosleep resumed> 0) = ? ERESTART_RESTARTBLOCK (To be restarted) 21348 <... nanosleep resumed> 0) = ? ERESTART_RESTARTBLOCK (To be restarted) 21346 <... nanosleep resumed> 0) = ? ERESTART_RESTARTBLOCK (To be restarted) The strace output ends here. -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-23 22:43 ` Vincent Lefevre 2008-05-23 22:45 ` Vincent Lefevre @ 2008-05-24 2:55 ` Clint Adams 2008-05-24 12:44 ` Vincent Lefevre 1 sibling, 1 reply; 24+ messages in thread From: Clint Adams @ 2008-05-24 2:55 UTC (permalink / raw) To: Bart Schaefer, zsh-workers, 482346 On Sat, May 24, 2008 at 12:43:05AM +0200, Vincent Lefevre wrote: > Same problem with zsh-beta 4.3.6-dev-0+20080506-1. Just to keep things clear, that has users/12815 applied, and the original report did not. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 2:55 ` Clint Adams @ 2008-05-24 12:44 ` Vincent Lefevre 2008-05-24 14:25 ` Peter Stephenson 2008-05-24 23:40 ` Phil Pennock 0 siblings, 2 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-24 12:44 UTC (permalink / raw) To: Bart Schaefer, zsh-workers, 482346 severity 482346 important thanks On 2008-05-24 02:55:56 +0000, Clint Adams wrote: > On Sat, May 24, 2008 at 12:43:05AM +0200, Vincent Lefevre wrote: > > Same problem with zsh-beta 4.3.6-dev-0+20080506-1. > > Just to keep things clear, that has users/12815 applied, and the > original report did not. A bit more information: This is 100% reproducible with both zsh and zsh-beta. This makes the load average go very high (e.g. up to 26) and ends up by DoS (the news server complains about the load average and shuts the connection down). The only solution seems to reboot the machine. (That's why I've increased the bug severity to important.) Note: when I kill zsh, the zombie remains there and gets attached to init. The load average remains very high. I do not have such a problem when I start vlc from bash and quit it in the same way. -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 12:44 ` Vincent Lefevre @ 2008-05-24 14:25 ` Peter Stephenson 2008-05-24 15:27 ` Stephane Chazelas 2008-05-24 23:40 ` Phil Pennock 1 sibling, 1 reply; 24+ messages in thread From: Peter Stephenson @ 2008-05-24 14:25 UTC (permalink / raw) To: 482346, zsh-workers On Sat, 24 May 2008 14:44:45 +0200 Vincent Lefevre <vincent@vinc17.org> wrote: > This is 100% reproducible with both zsh and zsh-beta. If it's just a matter of starting vlc and trying to kill it for you, then there's something more to track down since this doesn't happen for me (Fedora 9). There must be more to it than simply zsh not waiting for children: that has no effect on whether vlc is executing code. The propagation of the signal to the programme seems to me a likely suspect. Of course these could be related to the same fundamental problem. vlc does appear to do some fairly odd things with signals which may be interacting in unexpected ways with the shell. It would be useful to know whether the vlc job is listed in the shell's job table at this point, and to see what else is there. The array jobtab contains the jobs, which contain linked lists of process structures. One of the jobs (it should be the one indexed by curjob) should include the vlc process. The previous bug happened because some other job (marked as defunct) had a process with the same number. pws ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 14:25 ` Peter Stephenson @ 2008-05-24 15:27 ` Stephane Chazelas 2008-05-24 15:58 ` Stephane Chazelas 2008-05-24 17:41 ` Bart Schaefer 0 siblings, 2 replies; 24+ messages in thread From: Stephane Chazelas @ 2008-05-24 15:27 UTC (permalink / raw) To: Peter Stephenson; +Cc: 482346, zsh-workers On Sat, May 24, 2008 at 03:25:04PM +0100, Peter Stephenson wrote: > On Sat, 24 May 2008 14:44:45 +0200 > Vincent Lefevre <vincent@vinc17.org> wrote: > > This is 100% reproducible with both zsh and zsh-beta. > > If it's just a matter of starting vlc and trying to kill it for you, > then there's something more to track down since this doesn't happen for > me (Fedora 9). [...] For information, I cannot reproduce it here on a debian with the same versions of the packages as Vincent's but on x86 (Vincent is on powerpc according to bugs.debian.org/482346 >From the straces, we see that zsh is not receiving any SIGCHLD. It may be a good idea to check if the same can be observed with other shells that use sigsuspend (pdksh, mksh) instead of wait4/waitpid (bash, ksh93, posh, ash). Note that both mksh and posh are meant to derive from pdksh. It would be interesting to know why posh switched from sigsuspend to wait4. -- Stéphane ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 15:27 ` Stephane Chazelas @ 2008-05-24 15:58 ` Stephane Chazelas 2008-05-24 17:53 ` Clint Adams 2008-05-24 17:41 ` Bart Schaefer 1 sibling, 1 reply; 24+ messages in thread From: Stephane Chazelas @ 2008-05-24 15:58 UTC (permalink / raw) To: Peter Stephenson, 482346, zsh-workers On Sat, May 24, 2008 at 04:27:04PM +0100, Stephane Chazelas wrote: [...] > Note that both mksh and posh are meant to derive from pdksh. It > would be interesting to know why posh switched from sigsuspend > to wait4. [...] I'm under the impression that it is by accident/mistake, the conf-end.h that would enable the sigsuspend is there but not included in posh, and there's no mention of the change in the changelog. Clint may confirm/infirm -- Stéphane ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 15:58 ` Stephane Chazelas @ 2008-05-24 17:53 ` Clint Adams 0 siblings, 0 replies; 24+ messages in thread From: Clint Adams @ 2008-05-24 17:53 UTC (permalink / raw) To: 482346, zsh-workers On Sat, May 24, 2008 at 04:58:20PM +0100, Stephane Chazelas wrote: > I'm under the impression that it is by accident/mistake, the > conf-end.h that would enable the sigsuspend is there but not > included in posh, and there's no mention of the change in the > changelog. Clint may confirm/infirm Yes, that was unintentional, but it needs to be redone anyway. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 15:27 ` Stephane Chazelas 2008-05-24 15:58 ` Stephane Chazelas @ 2008-05-24 17:41 ` Bart Schaefer 2008-05-24 18:25 ` Stephane Chazelas 1 sibling, 1 reply; 24+ messages in thread From: Bart Schaefer @ 2008-05-24 17:41 UTC (permalink / raw) To: zsh-workers; +Cc: 482346 On May 24, 4:27pm, Stephane Chazelas wrote: } } From the straces, we see that zsh is not receiving any SIGCHLD. If that were the only problem, then opening another shell window and performing "kill -CHLD ..." on the original shell should clear it all up. But as PWS pointed out, the spiking load indicates that either vlc or zsh is actively doing something, which may have to do with the way zsh progates signals, or how it sets the signal masks before forking vlc in the first place. If it's zsh that's looping, then Vincent's stack trace indicates it must be here in zwaitjob: while (!errflag && jn->stat && !(jn->stat & STAT_DONE) && !(interact && (jn->stat & STAT_STOPPED))) { signal_suspend(SIGCHLD); /* job handling stuff elided */ child_block(); } But if *that* were a tight loop, it would mean that signal_suspend() isn't working. It'd be nice to know what process or processes send the load so high; 100% CPU usage is one thing, but 26+ processes in runnable state sounds like another thing entirely. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 17:41 ` Bart Schaefer @ 2008-05-24 18:25 ` Stephane Chazelas 0 siblings, 0 replies; 24+ messages in thread From: Stephane Chazelas @ 2008-05-24 18:25 UTC (permalink / raw) To: Bart Schaefer; +Cc: zsh-workers, 482346 On Sat, May 24, 2008 at 10:41:07AM -0700, Bart Schaefer wrote: [...] > But if *that* were a tight loop, it would mean that signal_suspend() > isn't working. It'd be nice to know what process or processes send > the load so high; 100% CPU usage is one thing, but 26+ processes in > runnable state sounds like another thing entirely. >From the strace and from gdb, it would seem that zsh's sigsuspend doesn't return, so zsh can't be in any kind of loop unless I missed something. Might be some multithreading issue. top -H might help. -- Stéphane ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 12:44 ` Vincent Lefevre 2008-05-24 14:25 ` Peter Stephenson @ 2008-05-24 23:40 ` Phil Pennock 2008-05-25 0:41 ` Vincent Lefevre 1 sibling, 1 reply; 24+ messages in thread From: Phil Pennock @ 2008-05-24 23:40 UTC (permalink / raw) To: Vincent Lefevre; +Cc: zsh-workers, 482346 On 2008-05-24 at 14:44 +0200, Vincent Lefevre wrote: > Note: when I kill zsh, the zombie remains there and gets attached > to init. The load average remains very high. If the zombie is reparented to init but still stays a zombie, then there's something worse wrong with your system. If init can't reap its children then it's understandable that zsh might have troubles too. Since you're on a rarer architecture that doesn't see so much Linux kernel debugging, I'd be inclined to look at what has changed in the kernel's architecture-specific signal handling code. (But see below). Further, it's strange that zombies are contributing to load average; if zsh is gone (killed off and no longer even possibly stuck in a tight loop) and there's the zombie and init left, then there shouldn't be anything contributing to load avg. If you use tools such as top(1), what processes are they attributing the load to? Is the high load average confirmed by vmstat reports of idle CPU, or is the load avg really out of sync with CPU reality? Linux is unusual in counting processes blocked on storage IO towards the load average, so if the problem is something like a flaky disk underneath the root filesystem, that might be complicating your problem. -Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-24 23:40 ` Phil Pennock @ 2008-05-25 0:41 ` Vincent Lefevre 2008-05-25 1:23 ` Phil Pennock 2008-05-25 10:08 ` Stephane Chazelas 0 siblings, 2 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-25 0:41 UTC (permalink / raw) To: zsh-workers; +Cc: 482346 On 2008-05-24 16:40:02 -0700, Phil Pennock wrote: > Since you're on a rarer architecture that doesn't see so much Linux > kernel debugging, I'd be inclined to look at what has changed in the > kernel's architecture-specific signal handling code. (But see below). I don't know if the bug is new. I've never killed vlc in such a way in the past. The current kernel is linux-image-2.6.24-1-powerpc 2.6.24-7. > Further, it's strange that zombies are contributing to load average; if > zsh is gone (killed off and no longer even possibly stuck in a tight > loop) and there's the zombie and init left, then there shouldn't be > anything contributing to load avg. Yes, that's strange. > If you use tools such as top(1), what processes are they attributing the > load to? None, and the CPU is idle: top - 02:31:55 up 22:19, 6 users, load average: 4.09, 4.24, 4.25 Tasks: 117 total, 1 running, 115 sleeping, 0 stopped, 1 zombie Cpu(s): 0.7%us, 0.7%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.3%hi, 0.0%si, 0.0%st Mem: 255432k total, 233456k used, 21976k free, 3004k buffers Swap: 524280k total, 22820k used, 501460k free, 105548k cached (The zombie is vlc.) > Is the high load average confirmed by vmstat reports of idle > CPU, or is the load avg really out of sync with CPU reality? Here's vmstat output: ay:~> vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 22820 26456 5520 99688 0 0 15 35 30 196 8 1 88 3 -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 0:41 ` Vincent Lefevre @ 2008-05-25 1:23 ` Phil Pennock 2008-05-25 21:37 ` Vincent Lefevre 2008-05-25 10:08 ` Stephane Chazelas 1 sibling, 1 reply; 24+ messages in thread From: Phil Pennock @ 2008-05-25 1:23 UTC (permalink / raw) To: zsh-workers, 482346 On 2008-05-25 at 02:41 +0200, Vincent Lefevre wrote: > None, and the CPU is idle: Then I'd be inclined to start looking into hardware issues, since _something's_ probably getting stuck in disk IO; I'll suspect that before kernel bugs, but it might also be worth seeing if there are other problems with threaded programs on powerpc, if init really can't reap something that has already become a zombie. > Here's vmstat output: First line of vmstat is average since system boot, you need to do something like "vmstat 1", ignore the first line, and look at what's happening at the current time. -Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 1:23 ` Phil Pennock @ 2008-05-25 21:37 ` Vincent Lefevre 2008-05-25 22:26 ` Phil Pennock 0 siblings, 1 reply; 24+ messages in thread From: Vincent Lefevre @ 2008-05-25 21:37 UTC (permalink / raw) To: zsh-workers, 482346 On 2008-05-24 18:23:21 -0700, Phil Pennock wrote: > Then I'd be inclined to start looking into hardware issues, since > _something's_ probably getting stuck in disk IO; I'll suspect that > before kernel bugs, but it might also be worth seeing if there are other > problems with threaded programs on powerpc, if init really can't reap > something that has already become a zombie. I've looked at /var/log/kern.log and there's something each time I interrupted vlc, e.g. May 24 14:33:36 ay kernel: Unable to handle kernel paging request for data at address 0x481e7000 May 24 14:33:36 ay kernel: Faulting instruction address: 0xc00131e8 May 24 14:33:36 ay kernel: Oops: Kernel access of bad area, sig: 11 [#1] May 24 14:33:36 ay kernel: PowerMac May 24 14:33:36 ay kernel: Modules linked in: snd_powermac snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd snd_page_alloc soundcore xt_multiport iptable_filter ip_tables x_tables ipv6 ide_cd cdrom sungem sungem_phy firewire_ohci firewire_core crc_itu_t yenta_socket rsrc_nonstatic pcmcia_core uninorth_agp agpgart sd_mod scsi_mod dm_snapshot dm_mirror dm_mod ext3 jbd mbcache ide_disk evdev i2c_powermac windfarm_core May 24 14:33:36 ay kernel: NIP: c00131e8 LR: c0017780 CTR: 00000080 May 24 14:33:36 ay kernel: REGS: c2769b60 TRAP: 0300 Not tainted (2.6.24-1-powerpc) May 24 14:33:36 ay kernel: MSR: 00009032 <EE,ME,IR,DR> CR: 24004422 XER: 00000000 May 24 14:33:36 ay kernel: DAR: 481e7000, DSISR: 40000000 May 24 14:33:36 ay kernel: TASK = c2d7ca80[21850] 'vlc' THREAD: c2768000 May 24 14:33:36 ay kernel: GPR00: c3356060 c2769c10 c2d7ca80 481e7000 00000080 0c57a181 481e7000 00000000 May 24 14:33:36 ay kernel: GPR08: 0c57a181 c3356060 00000000 c03fe000 44004422 1001a728 bfffffff cf3d2140 May 24 14:33:36 ay kernel: GPR16: 0000000d c3356060 00000030 00000000 c2769ccc c272e480 c3356060 00000001 May 24 14:33:36 ay kernel: GPR24: 00000000 481e7000 0000079c c245e860 c245e860 0c57a181 481e7000 c0588f40 May 24 14:33:36 ay kernel: NIP [c00131e8] __flush_dcache_icache+0x14/0x40 May 24 14:33:36 ay kernel: LR [c0017780] update_mmu_cache+0x84/0x108 May 24 14:33:36 ay kernel: Call Trace: May 24 14:33:36 ay kernel: [c2769c10] [481e7000] 0x481e7000 (unreliable) May 24 14:33:36 ay kernel: [c2769c30] [c0085888] handle_mm_fault+0xc70/0xd70 May 24 14:33:36 ay kernel: [c2769c70] [c0085d48] get_user_pages+0x3c0/0x4d0 May 24 14:33:36 ay kernel: [c2769cc0] [c00d008c] elf_core_dump+0xa28/0xce4 May 24 14:33:36 ay kernel: [c2769d60] [c009fe28] do_coredump+0x664/0x6cc May 24 14:33:36 ay kernel: [c2769e50] [c003df64] get_signal_to_deliver+0x390/0x3c0 May 24 14:33:36 ay kernel: [c2769e80] [c00096ac] do_signal+0x50/0x268 May 24 14:33:36 ay kernel: [c2769f40] [c0013ef4] do_user_signal+0x7c/0xcc May 24 14:33:36 ay kernel: --- Exception: c00 at 0xfbcce5c May 24 14:33:36 ay kernel: LR = 0xfbceaf4 May 24 14:33:36 ay kernel: Instruction dump: May 24 14:33:36 ay kernel: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 60000000 May 24 14:33:36 ay kernel: 54630026 38800080 7c8903a6 7c661b78 <7c00186c> 38630020 4200fff8 7c0004ac May 24 14:33:36 ay kernel: ---[ end trace 6343c960c4d55920 ]--- May 24 14:33:36 ay kernel: note: vlc[21850] exited with preempt_count 1 > > Here's vmstat output: > > First line of vmstat is average since system boot, you need to do > something like "vmstat 1", ignore the first line, and look at what's > happening at the current time. procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 0 120044 17616 54288 124632 4 4 33 44 32 261 8 1 87 4 0 0 120044 17616 54288 124632 0 0 0 0 24 299 0 0 100 0 0 0 120044 17616 54288 124632 0 0 0 0 25 303 0 0 100 0 0 0 120044 17616 54288 124632 0 0 0 0 24 298 0 0 100 0 0 0 120044 17616 54288 124632 0 0 0 0 24 290 0 0 100 0 0 0 120044 17376 54288 124632 0 0 0 0 23 300 73 7 20 0 0 0 120044 17436 54296 124632 0 0 0 32 27 297 0 0 99 1 1 0 120044 17376 54296 124604 0 0 0 0 24 298 11 0 89 0 0 0 120044 17616 54296 124632 0 0 0 0 23 299 51 2 47 0 0 0 120044 17616 54296 124632 0 0 0 0 24 292 0 0 100 0 0 0 120044 17556 54300 124632 0 0 0 340 65 365 0 0 80 20 -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 21:37 ` Vincent Lefevre @ 2008-05-25 22:26 ` Phil Pennock 2008-05-25 23:34 ` Vincent Lefevre 0 siblings, 1 reply; 24+ messages in thread From: Phil Pennock @ 2008-05-25 22:26 UTC (permalink / raw) To: zsh-workers, 482346 On 2008-05-25 at 23:37 +0200, Vincent Lefevre wrote: > On 2008-05-24 18:23:21 -0700, Phil Pennock wrote: > > Then I'd be inclined to start looking into hardware issues, since > > _something's_ probably getting stuck in disk IO; I'll suspect that > > before kernel bugs, but it might also be worth seeing if there are other > > problems with threaded programs on powerpc, if init really can't reap > > something that has already become a zombie. > > I've looked at /var/log/kern.log and there's something each time > I interrupted vlc, e.g. > > May 24 14:33:36 ay kernel: Unable to handle kernel paging request for data at address 0x481e7000 > May 24 14:33:36 ay kernel: Faulting instruction address: 0xc00131e8 > May 24 14:33:36 ay kernel: Oops: Kernel access of bad area, sig: 11 [#1] That's a segfault; the kernel's then oopsing whilst trying to page in memory to write the coredump; looks like a problem in the MMU logic for the powerpc. So, the problems are: * vlc is segfaulting when it receives SIGINT; * the powerpc Linux kernel has a bug whereby it's ending up not letting the parent wait on it (from what I understand of the details so far) in some cases, so it looks like the process isn't actually ending and transitioning to zombie status; it might be worth talking to the architecture maintainers for your distribution, to see about known issues; note that even init is unable to reclaim these processes; have you tried sending a SIGKILL to force-exit the vlc, to see if either zsh or init can reap the process then? * zsh is somehow tickling the kernel bug and it might be worth having configure logic to deal with this, even after the problem is fixed, once we know what it is that's tickling this. > May 24 14:33:36 ay kernel: note: vlc[21850] exited with preempt_count 1 My nasty suspicious mind thinks that special kernel logic for handling a weird exit condition, and logging it, is less tested code that's already doing something different to the default, so this is likely close to the root cause; no powerpc available for me to test, though. It seems unlikely that there'd be enough bugs to also have a zombie contributing to load average, so I suspect that the process has not in fact exited yet, it's still running, that's where the load comes from. Does ps(1) actually show the 'Z' for zombie? -Phil ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 22:26 ` Phil Pennock @ 2008-05-25 23:34 ` Vincent Lefevre 2008-05-25 23:43 ` Vincent Lefevre 2008-05-26 0:31 ` Bart Schaefer 0 siblings, 2 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-25 23:34 UTC (permalink / raw) To: zsh-workers, 482346 On 2008-05-25 15:26:58 -0700, Phil Pennock wrote: > have you tried sending a SIGKILL to force-exit the vlc, to see if > either zsh or init can reap the process then? This has no effect. > Does ps(1) actually show the 'Z' for zombie? Yes (BTW, this was in my second message). Now, I could reproduce the problem with both bash and pdksh. I don't know why I couldn't reproduce it the first times while it is 100% reproducible under zsh (perhaps timings are a bit different). -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 23:34 ` Vincent Lefevre @ 2008-05-25 23:43 ` Vincent Lefevre 2008-05-26 0:31 ` Bart Schaefer 1 sibling, 0 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-25 23:43 UTC (permalink / raw) To: zsh-workers, 482346 reassign 482346 linux-image-2.6.24-1-powerpc retitle 482346 kernel 2.6.24-1-powerpc oops after vlc segfaults thanks because it became clear that zsh has nothing to do with this bug. I forgot to say that the screen brightness is set to 100% (the machine is an early PowerBook G4) when the kernel oops occurs. I suppose that next time zsh-workers could be dropped from To|Cc. Those interested in this bug could subscribe to bug 482346. See: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=482346 -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 23:34 ` Vincent Lefevre 2008-05-25 23:43 ` Vincent Lefevre @ 2008-05-26 0:31 ` Bart Schaefer 1 sibling, 0 replies; 24+ messages in thread From: Bart Schaefer @ 2008-05-26 0:31 UTC (permalink / raw) To: 482346, zsh-workers On May 26, 1:34am, Vincent Lefevre wrote: } } Now, I could reproduce the problem with both bash and pdksh. } I don't know why I couldn't reproduce it the first times while it is } 100% reproducible under zsh (perhaps timings are a bit different). It's possible that zsh is passing a larger *envp or some similar difference, that makes it more likely the bug is tickled. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 0:41 ` Vincent Lefevre 2008-05-25 1:23 ` Phil Pennock @ 2008-05-25 10:08 ` Stephane Chazelas 2008-05-25 21:54 ` Vincent Lefevre 1 sibling, 1 reply; 24+ messages in thread From: Stephane Chazelas @ 2008-05-25 10:08 UTC (permalink / raw) To: zsh-workers, 482346 On Sun, May 25, 2008 at 02:41:01AM +0200, Vincent Lefevre wrote: > On 2008-05-24 16:40:02 -0700, Phil Pennock wrote: > > Since you're on a rarer architecture that doesn't see so much Linux > > kernel debugging, I'd be inclined to look at what has changed in the > > kernel's architecture-specific signal handling code. (But see below). > > I don't know if the bug is new. I've never killed vlc in such a way > in the past. The current kernel is linux-image-2.6.24-1-powerpc > 2.6.24-7. > > > Further, it's strange that zombies are contributing to load average; if > > zsh is gone (killed off and no longer even possibly stuck in a tight > > loop) and there's the zombie and init left, then there shouldn't be > > anything contributing to load avg. > > Yes, that's strange. Now, what happens if the main thread in a multithread process dies and other threads remain? Anyway one process can at most contribute 1 point to the load average, so there must be more, if they don't show in top, that could be either that they are created and destroyed too fast or that they are threads. zombies are not processes so cannot contribute to load average. > > > If you use tools such as top(1), what processes are they attributing the > > load to? > > None, and the CPU is idle: > > top - 02:31:55 up 22:19, 6 users, load average: 4.09, 4.24, 4.25 > Tasks: 117 total, 1 running, 115 sleeping, 0 stopped, 1 zombie Could you try top -H (to see threads) and the same thing with pdksh? I'd bet for a kernel and/or pthread and/or vlc issue. -- Stéphane ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Bug#482346: zsh doesn't always wait for its children (-> zombie) 2008-05-25 10:08 ` Stephane Chazelas @ 2008-05-25 21:54 ` Vincent Lefevre 0 siblings, 0 replies; 24+ messages in thread From: Vincent Lefevre @ 2008-05-25 21:54 UTC (permalink / raw) To: zsh-workers, 482346 On 2008-05-25 11:08:44 +0100, Stephane Chazelas wrote: > zombies are not processes so cannot contribute to load average. Except in case of kernel bugs, I suppose. > Could you try top -H (to see threads) I don't see anything special. > and the same thing with pdksh? Just like with bash, no problem when I start vlc from pdksh and interrupt it in the same way. No kernel oops in /var/log/kern.log. -- Vincent Lefèvre <vincent@vinc17.org> - Web: <http://www.vinc17.org/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.org/blog/> Work: CR INRIA - computer arithmetic / Arenaire project (LIP, ENS-Lyon) ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2008-05-26 0:32 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20080521235008.GA5600@ay.vinc17.org> [not found] ` <20080521235930.GW7056@prunille.vinc17.org> 2008-05-22 23:33 ` Bug#482346: zsh doesn't always wait for its children (-> zombie) Clint Adams 2008-05-23 14:39 ` Bart Schaefer 2008-05-23 14:57 ` Clint Adams 2008-05-23 22:43 ` Vincent Lefevre 2008-05-23 22:45 ` Vincent Lefevre 2008-05-23 23:04 ` Vincent Lefevre 2008-05-24 2:55 ` Clint Adams 2008-05-24 12:44 ` Vincent Lefevre 2008-05-24 14:25 ` Peter Stephenson 2008-05-24 15:27 ` Stephane Chazelas 2008-05-24 15:58 ` Stephane Chazelas 2008-05-24 17:53 ` Clint Adams 2008-05-24 17:41 ` Bart Schaefer 2008-05-24 18:25 ` Stephane Chazelas 2008-05-24 23:40 ` Phil Pennock 2008-05-25 0:41 ` Vincent Lefevre 2008-05-25 1:23 ` Phil Pennock 2008-05-25 21:37 ` Vincent Lefevre 2008-05-25 22:26 ` Phil Pennock 2008-05-25 23:34 ` Vincent Lefevre 2008-05-25 23:43 ` Vincent Lefevre 2008-05-26 0:31 ` Bart Schaefer 2008-05-25 10:08 ` Stephane Chazelas 2008-05-25 21:54 ` Vincent Lefevre
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/zsh/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).