On Tue, Jun 19, 2007 at 10:07:51PM +0300, Alex Efros wrote: > On Tue, Jun 19, 2007 at 06:13:25PM +0000, Gerrit Pape wrote: > > Hi Alex, after checking the code, I currently cannot say that or how > > runit could fail reaping zombies that detached and re-parented to pid 1. > > On Linux running strace on pid 1 isn't supported AFAIK. To be sure that > > runit is at fault, can you please check the kernel versions on your two > > machines, can it be that they have changed at the time the problem > > popped up? Does upgrading the Linux kernel to a more recent version > > change anything? > > Yeah, I also think this may be kernel<->runit bug. But this bug happens > previously on 2.6.16, and now I've it on 2.6.20 (upgraded few days ago). > > Moreover, looks like runit not just 'stop reaping zombies' at some point. > Looks like it continue reaping them, but not all of them. Look: [...] > This server generate a lot of short-living processes every minute, i.e. it > generate new processes much faster than new non-reaped zombies arise. > > I may be wrong, but I've a feeling on 2.6.16 new non-reaped zombies arise > much faster - in few hours I got 8192 use processes and was forced to reboot. > > Maybe kernel just don't send SIGCHLD in some situation? Maybe some race > condition in runit or kernel? Maybe if runit will try to do waitpid() > every few seconds this will solve issue? That could solve the issue, yes, but runit should know when to do waitpid(), and I would like to find out why that goes wrong. I tried to reproduce the problem on a Debian/unstable ppc with Linux 2.6.20.7, but failed. Does this cuase a problem on your system?: # cat >test.c < int main () { int pid, i; for (i =0; i < 8193; ++i) { pid =fork(); if (pid == -1) { write(1, "f\n", 2); } if (!pid) { daemon(0, 0); // sleep(1); _exit(0); } } sleep(14); _exit(0); } EOT # gcc test.c # ./a.out If not, can you provide this service daemon that produced these amount of detached short-living processes? And I have another patch to try attached. Thanks, Gerrit.