supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
* runit not collecting zombies
@ 2007-05-24 23:07 Radek Podgorny
  2007-05-26 10:35 ` Alex Efros
                   ` (3 more replies)
  0 siblings, 4 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-05-24 23:07 UTC (permalink / raw)
  To: supervision

Hi,

I'm experiancing a weird behaviour where runit (1.5.0) does not collect 
zombies. AFAIK the init (pid=1) process should take care of these and 
make them quit "correctly". I know that when zombies show up, the 
initiating application is to blame but in the real world, I need a 
stable server where I don't run out of PIDs after just few hours :-(.

Someone else was having the same problem with lighttpd. See the text and 
patch at http://trac.lighttpd.net/trac/ticket/978 to get more info.

Sincerely
Radek Podgorny


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-24 23:07 runit not collecting zombies Radek Podgorny
@ 2007-05-26 10:35 ` Alex Efros
  2007-05-26 10:45   ` Alex Efros
                     ` (3 more replies)
  2008-02-25  7:25 ` Alex Efros
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 113+ messages in thread
From: Alex Efros @ 2007-05-26 10:35 UTC (permalink / raw)
  To: supervision

Hi!

On Fri, May 25, 2007 at 01:07:42AM +0200, Radek Podgorny wrote:
> I'm experiancing a weird behaviour where runit (1.5.0) does not collect 
> zombies. AFAIK the init (pid=1) process should take care of these and 
> make them quit "correctly". I know that when zombies show up, the 
> initiating application is to blame but in the real world, I need a 
> stable server where I don't run out of PIDs after just few hours :-(.

I've just got same issue. This is first time I see runit does not collect
zombies. Server uptime is 28 days, runit 1.5.0 (upgraded from 1.4.1 at
21 Apr). Server generate huge amount of short-living processes, so maybe
some integer overflow in runit or something similar result in this issue.

Right now I've 8114 zombies and user which run all these scripts now
unable to start new processes:
    bash: fork: Resource temporarily unavailable

Looks like I've to reboot to fix this issue for some time. :-(

I've no idea how to provide more information, strace won't working:
    # strace -p 1
    attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
So only I can provide is partial `ps axf` output:


  PID TTY      STAT   TIME COMMAND
    1 ?        T      2:59 runit
    2 ?        SN     0:00 [ksoftirqd/0]
    3 ?        S<     0:00 [events/0]
    4 ?        S<     0:00 [khelper]
    5 ?        S<     0:00 [kthread]
    7 ?        S<     0:34  \_ [kblockd/0]
   61 ?        S<     0:00  \_ [aio/0]
  132 ?        S<     0:00  \_ [kseriod]
  161 ?        S<     0:00  \_ [kpsmoused]
  174 ?        S<     0:01  \_ [reiserfs/0]
16498 ?        S      0:08  \_ [pdflush]
26144 ?        S      0:03  \_ [pdflush]
   60 ?        S      1:53 [kswapd0]
 1091 ?        Ss     0:00 /bin/sh /etc/runit/2
24888 ?        S      0:00  \_ runsvdir /var/service log: ......................
...
17454 ?        ZN     0:00 [SpiderAuto] <defunct>
 5187 ?        ZN     0:00 [SpiderAuto] <defunct>
22364 ?        ZN     0:00 [SpiderAuto] <defunct>
18907 ?        ZN     0:00 [SpiderAuto] <defunct>
...

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-26 10:35 ` Alex Efros
@ 2007-05-26 10:45   ` Alex Efros
  2007-05-26 12:55   ` Charlie Brady
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-05-26 10:45 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, May 26, 2007 at 01:35:17PM +0300, Alex Efros wrote:
> some integer overflow in runit or something similar result in this issue.

After reboot looks like runit collect zombies again:

$ ps axf | grep Z
13913 ?        Z      0:00 [sh] <defunct>
 3227 ?        Z      0:00 [sh] <defunct>
10764 ?        Z      0:00 [sh] <defunct>
11112 pts/4    S+     0:00      \_ grep --colour=auto Z
$ ps axf | grep Z
 2201 pts/4    S+     0:00      \_ grep --colour=auto Z

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-26 10:35 ` Alex Efros
  2007-05-26 10:45   ` Alex Efros
@ 2007-05-26 12:55   ` Charlie Brady
  2007-05-26 13:03     ` Alex Efros
  2007-05-26 17:01   ` Paul Jarc
  2007-06-03 11:10   ` Gerrit Pape
  3 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-05-26 12:55 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 26 May 2007, Alex Efros wrote:

> I've no idea how to provide more information, ...

Post the run script of the service which is creating zombies.

> 24888 ?        S      0:00  \_ runsvdir /var/service log: ......................
> ...
> 17454 ?        ZN     0:00 [SpiderAuto] <defunct>
> 5187 ?        ZN     0:00 [SpiderAuto] <defunct>
> 22364 ?        ZN     0:00 [SpiderAuto] <defunct>
> 18907 ?        ZN     0:00 [SpiderAuto] <defunct>
> ...

What does "sv status SpiderAuto" (or whatever your service name which 
starts SpiderAuto is) say?

Do you have a reason for believing that it is runit which is creating and 
not reaping zombies rather than a specific service daemon (e.g. 
SpiderAuto)?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-26 12:55   ` Charlie Brady
@ 2007-05-26 13:03     ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-05-26 13:03 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, May 26, 2007 at 08:55:33AM -0400, Charlie Brady wrote:
> Post the run script of the service which is creating zombies.
 
SpiderAuto isn't a service. It's just a perl script which is executed
by cron.

> Do you have a reason for believing that it is runit which is creating and 
> not reaping zombies rather than a specific service daemon (e.g. 
> SpiderAuto)?

SpiderAuto will generate zombies by it architecture - it work this way:
1) main script start from cron every 1 minute
2) main script analyze current situation - find hang worker processes and
   kill them, find current tasks in queue and start several worker
   processes to do these tasks
3) main script exit (and so leave several child/worker processes without
   parent)
Worker processes may work several minutes.

I don't thing this architecture is good, but it works for more than 4
years without troubles, and I never seen zombie processes because runit
collect them.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-26 10:35 ` Alex Efros
  2007-05-26 10:45   ` Alex Efros
  2007-05-26 12:55   ` Charlie Brady
@ 2007-05-26 17:01   ` Paul Jarc
  2007-06-02 14:55     ` Alex Efros
  2007-06-03 11:10   ` Gerrit Pape
  3 siblings, 1 reply; 113+ messages in thread
From: Paul Jarc @ 2007-05-26 17:01 UTC (permalink / raw)
  To: supervision

Alex Efros <powerman@powerman.asdfGroup.com> wrote:
> So only I can provide is partial `ps axf` output:

Check "ps -ef" to verify that these zombies have 1 as their PPID.


paul


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-26 17:01   ` Paul Jarc
@ 2007-06-02 14:55     ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-06-02 14:55 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, May 26, 2007 at 01:01:14PM -0400, Paul Jarc wrote:
> Alex Efros <powerman@powerman.asdfGroup.com> wrote:
> > So only I can provide is partial `ps axf` output:
> 
> Check "ps -ef" to verify that these zombies have 1 as their PPID.

It just happens again!! :-(

And yeah, all these processes has PPID=1:

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 May26 ?        00:00:40 runit
...
bets     16454     1  0 04:14 ?        00:00:00 [SpiderAuto] <defunct>
bets     22692     1  0 04:15 ?        00:00:00 [SpiderAuto] <defunct>
bets      2027     1  0 04:15 ?        00:00:00 [SpiderAuto] <defunct>
bets     17471     1  0 04:15 ?        00:00:00 [SpiderAuto] <defunct>
...
bets     21649     1  0 09:25 ?        00:00:01 [SpiderAuto] <defunct>
bets     22188     1  0 09:25 ?        00:00:00 [SpiderAuto] <defunct>
ebook     4304     1  0 09:46 ?        00:00:00 [chpst] <defunct>
ebook    30650     1  0 09:51 ?        00:00:00 [chpst] <defunct>
...
ebook     5492     1  0 12:46 ?        00:00:00 [chpst] <defunct>
sshd     20961     1  0 13:01 ?        00:00:00 [sshd] <defunct>
sshd     20915     1  0 13:01 ?        00:00:00 [sshd] <defunct>
sshd      4653     1  0 13:01 ?        00:00:00 [sshd] <defunct>
...
sshd     18475     1  0 13:05 ?        00:00:00 [sshd] <defunct>
sshd     18954     1  0 13:05 ?        00:00:00 [sshd] <defunct>
ebook    13994     1  0 13:11 ?        00:00:00 [chpst] <defunct>
ebook    27178     1  0 13:31 ?        00:00:00 [chpst] <defunct>
...

I've no idea what to do... reboot server every week isn't good idea...
rollback to runit-1.4.1 - I'm not sure it will work well with linux kernel
2.6.20 (I remember there was some discussion about issues with runit and
newer linux kernel which was fixed in 1.5.0 if I remember correctly).

Maybe somebody can provide me with instructions how to debug this issue
next time it happens (strace can't attach to runit)?


P.S. Not sure is it important, but currently used kernel is 2.6.16, and I
wanna upgrade to 2.6.20 soon.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-26 10:35 ` Alex Efros
                     ` (2 preceding siblings ...)
  2007-05-26 17:01   ` Paul Jarc
@ 2007-06-03 11:10   ` Gerrit Pape
  2007-06-03 14:33     ` Alex Efros
  2007-06-11 13:11     ` Alex Efros
  3 siblings, 2 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-06-03 11:10 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Sat, May 26, 2007 at 01:35:17PM +0300, Alex Efros wrote:
> On Fri, May 25, 2007 at 01:07:42AM +0200, Radek Podgorny wrote:
> > I'm experiancing a weird behaviour where runit (1.5.0) does not collect 
> > zombies. AFAIK the init (pid=1) process should take care of these and 
> > make them quit "correctly". I know that when zombies show up, the 
> > initiating application is to blame but in the real world, I need a 
> > stable server where I don't run out of PIDs after just few hours :-(.
> 
> I've just got same issue. This is first time I see runit does not collect
> zombies. Server uptime is 28 days, runit 1.5.0 (upgraded from 1.4.1 at
> 21 Apr). Server generate huge amount of short-living processes, so maybe
> some integer overflow in runit or something similar result in this issue.
> 
> Right now I've 8114 zombies and user which run all these scripts now
> unable to start new processes:
>     bash: fork: Resource temporarily unavailable
> 
> Looks like I've to reboot to fix this issue for some time. :-(

Hi Alex, the runit program didn't change from 1.4.1 to 1.5.0 or 1.5.1,
so downgrading should not help.  I'm not yet completely sure what the
actual problem is, but have an idea.  Can you please test the patch
below?:

 cd /package/admin/runit
 patch -p0 </the/attached/diff
 package/install
 ln /sbin/runit /sbin/runit.old
 install -m0755 command/runit /sbin/runit
 reboot

Thanks, Gerrit.

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 628 bytes --]

Index: src/runit.c
===================================================================
RCS file: /cvs/runit/src/runit.c,v
retrieving revision 1.14
diff -u -r1.14 runit.c
--- src/runit.c	21 Nov 2006 15:09:18 -0000	1.14
+++ src/runit.c	3 Jun 2007 10:59:19 -0000
@@ -157,8 +157,9 @@
       sig_block(sig_child);
       sig_block(sig_int);
       
-      read(selfpipe[0], &ch, 1);
-      child =wait_nohang(&wstat);
+      while (read(selfpipe[0], &ch, 1) == 1) {}
+      while ((child =wait_nohang(&wstat)) > 0)
+        if (child == pid) break;
 
       /* reget stderr */
       if ((ttyfd =open_write("/dev/console")) != -1) {

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-03 11:10   ` Gerrit Pape
@ 2007-06-03 14:33     ` Alex Efros
  2007-06-03 16:31       ` Gerrit Pape
  2007-06-11 13:11     ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-06-03 14:33 UTC (permalink / raw)
  To: supervision

Hi!

On Sun, Jun 03, 2007 at 11:10:56AM +0000, Gerrit Pape wrote:
> Hi Alex, the runit program didn't change from 1.4.1 to 1.5.0 or 1.5.1,
> so downgrading should not help.  I'm not yet completely sure what the
> actual problem is, but have an idea.  Can you please test the patch
> below?:

Maybe it's better to upgrade to runit-1.7.2? Or that latest version also
need to be tested with that patch?

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-03 14:33     ` Alex Efros
@ 2007-06-03 16:31       ` Gerrit Pape
  0 siblings, 0 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-06-03 16:31 UTC (permalink / raw)
  To: supervision

On Sun, Jun 03, 2007 at 05:33:11PM +0300, Alex Efros wrote:
> On Sun, Jun 03, 2007 at 11:10:56AM +0000, Gerrit Pape wrote:
> > Hi Alex, the runit program didn't change from 1.4.1 to 1.5.0 or 1.5.1,
> > so downgrading should not help.  I'm not yet completely sure what the
> > actual problem is, but have an idea.  Can you please test the patch
> > below?:
> 
> Maybe it's better to upgrade to runit-1.7.2? Or that latest version also
> need to be tested with that patch?

I don't think there are relevant changes to src/runit.c in version 1.7.2
either.  So, yes, the patch should be tested to see if it fixes your
problem.

Regards, Gerrit.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-03 11:10   ` Gerrit Pape
  2007-06-03 14:33     ` Alex Efros
@ 2007-06-11 13:11     ` Alex Efros
  2007-06-18 13:45       ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-06-11 13:11 UTC (permalink / raw)
  To: supervision

Hi!

On Sun, Jun 03, 2007 at 11:10:56AM +0000, Gerrit Pape wrote:
> Can you please test the patch below?:

I've just installed runit-1.7.2 with that patch on my servers. I think if
after a week or two there will be no zombies, then it working. In last
week I didn't install it because I gather some statistics: looks like my
servers start producing uncollected zombies after ~3 days of work.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-11 13:11     ` Alex Efros
@ 2007-06-18 13:45       ` Alex Efros
  2007-06-19 18:13         ` Gerrit Pape
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-06-18 13:45 UTC (permalink / raw)
  To: supervision

Hi!

On Mon, Jun 11, 2007 at 04:11:12PM +0300, Alex Efros wrote:
> > Can you please test the patch below?:
> 
> I've just installed runit-1.7.2 with that patch on my servers. I think if
> after a week or two there will be no zombies, then it working. In last
> week I didn't install it because I gather some statistics: looks like my
> servers start producing uncollected zombies after ~3 days of work.

No, patch don't fixed this bug. :(

I've just rebooted my home workstation because of this issue (~6 days
uptime), and my servers (~2.5 days uptime) already started generating
uncollected zombies (640 zombies on one server, 160 on another), so I
expect I should reboot them in about 10-12 hours.

Rebooting servers every 2-3 days in unacceptable! I need instructions
how to help you debug and fix this issue.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-18 13:45       ` Alex Efros
@ 2007-06-19 18:13         ` Gerrit Pape
  2007-06-19 19:07           ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Gerrit Pape @ 2007-06-19 18:13 UTC (permalink / raw)
  To: supervision

On Mon, Jun 18, 2007 at 04:45:16PM +0300, Alex Efros wrote:
> On Mon, Jun 11, 2007 at 04:11:12PM +0300, Alex Efros wrote:
> > > Can you please test the patch below?:
> > 
> > I've just installed runit-1.7.2 with that patch on my servers. I think if
> > after a week or two there will be no zombies, then it working. In last
> > week I didn't install it because I gather some statistics: looks like my
> > servers start producing uncollected zombies after ~3 days of work.
> 
> No, patch don't fixed this bug. :(
> 
> I've just rebooted my home workstation because of this issue (~6 days
> uptime), and my servers (~2.5 days uptime) already started generating
> uncollected zombies (640 zombies on one server, 160 on another), so I
> expect I should reboot them in about 10-12 hours.
> 
> Rebooting servers every 2-3 days in unacceptable! I need instructions
> how to help you debug and fix this issue.

Hi Alex, after checking the code, I currently cannot say that or how
runit could fail reaping zombies that detached and re-parented to pid 1.
On Linux running strace on pid 1 isn't supported AFAIK.  To be sure that
runit is at fault, can you please check the kernel versions on your two
machines, can it be that they have changed at the time the problem
popped up?  Does upgrading the Linux kernel to a more recent version
change anything?

Thanks, Gerrit.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-19 18:13         ` Gerrit Pape
@ 2007-06-19 19:07           ` Alex Efros
  2007-06-20 16:23             ` Gerrit Pape
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-06-19 19:07 UTC (permalink / raw)
  To: supervision

Hi!

On Tue, Jun 19, 2007 at 06:13:25PM +0000, Gerrit Pape wrote:
> Hi Alex, after checking the code, I currently cannot say that or how
> runit could fail reaping zombies that detached and re-parented to pid 1.
> On Linux running strace on pid 1 isn't supported AFAIK.  To be sure that
> runit is at fault, can you please check the kernel versions on your two
> machines, can it be that they have changed at the time the problem
> popped up?  Does upgrading the Linux kernel to a more recent version
> change anything?

Yeah, I also think this may be kernel<->runit bug. But this bug happens
previously on 2.6.16, and now I've it on 2.6.20 (upgraded few days ago).

Moreover, looks like runit not just 'stop reaping zombies' at some point.
Looks like it continue reaping them, but not all of them. Look:

# date; ps ax | grep Z | wc
Mon Jun 18 13:46:02 GMT 2007
    162     973    7155
# date; ps ax | grep Z | wc
Mon Jun 18 22:41:58 GMT 2007
    406    2437   17894
# date; ps ax | grep Z | wc
Tue Jun 19 18:59:46 GMT 2007
    770    4621   33939

This server generate a lot of short-living processes every minute, i.e. it
generate new processes much faster than new non-reaped zombies arise.

I may be wrong, but I've a feeling on 2.6.16 new non-reaped zombies arise
much faster - in few hours I got 8192 use processes and was forced to reboot.

Maybe kernel just don't send SIGCHLD in some situation? Maybe some race
condition in runit or kernel? Maybe if runit will try to do waitpid()
every few seconds this will solve issue?

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-19 19:07           ` Alex Efros
@ 2007-06-20 16:23             ` Gerrit Pape
  2007-06-20 16:57               ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Gerrit Pape @ 2007-06-20 16:23 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

On Tue, Jun 19, 2007 at 10:07:51PM +0300, Alex Efros wrote:
> On Tue, Jun 19, 2007 at 06:13:25PM +0000, Gerrit Pape wrote:
> > Hi Alex, after checking the code, I currently cannot say that or how
> > runit could fail reaping zombies that detached and re-parented to pid 1.
> > On Linux running strace on pid 1 isn't supported AFAIK.  To be sure that
> > runit is at fault, can you please check the kernel versions on your two
> > machines, can it be that they have changed at the time the problem
> > popped up?  Does upgrading the Linux kernel to a more recent version
> > change anything?
> 
> Yeah, I also think this may be kernel<->runit bug. But this bug happens
> previously on 2.6.16, and now I've it on 2.6.20 (upgraded few days ago).
> 
> Moreover, looks like runit not just 'stop reaping zombies' at some point.
> Looks like it continue reaping them, but not all of them. Look:
[...]
> This server generate a lot of short-living processes every minute, i.e. it
> generate new processes much faster than new non-reaped zombies arise.
> 
> I may be wrong, but I've a feeling on 2.6.16 new non-reaped zombies arise
> much faster - in few hours I got 8192 use processes and was forced to reboot.
> 
> Maybe kernel just don't send SIGCHLD in some situation? Maybe some race
> condition in runit or kernel? Maybe if runit will try to do waitpid()
> every few seconds this will solve issue?

That could solve the issue, yes, but runit should know when to do
waitpid(), and I would like to find out why that goes wrong. I tried to
reproduce the problem on a Debian/unstable ppc with Linux 2.6.20.7, but
failed.  Does this cuase a problem on your system?:

# cat >test.c <<EOT
#include <unistd.h>

int main () {
  int pid, i;
  for (i =0; i < 8193; ++i) {
    pid =fork();
    if (pid == -1) { write(1, "f\n", 2); }
    if (!pid) {
      daemon(0, 0);
      // sleep(1);
      _exit(0);
    }
  }
  sleep(14);
  _exit(0);
}
EOT
# gcc test.c
# ./a.out

If not, can you provide this service daemon that produced these amount
of detached short-living processes?

And I have another patch to try attached.

Thanks, Gerrit.

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 513 bytes --]

Index: src/runit.c
===================================================================
RCS file: /cvs/runit/src/runit.c,v
retrieving revision 1.14
diff -u -r1.14 runit.c
--- src/runit.c	21 Nov 2006 15:09:18 -0000	1.14
+++ src/runit.c	20 Jun 2007 16:21:46 -0000
@@ -194,7 +194,7 @@
         strerr_warn3(INFO, "leave stage: ", stage[st], 0);
         break;
       }
-      if (child > 0) {
+      if (child != 0) {
         /* collect terminated children */
         write(selfpipe[1], "", 1);
         continue;

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-20 16:23             ` Gerrit Pape
@ 2007-06-20 16:57               ` Alex Efros
  2007-06-20 18:35                 ` Gerrit Pape
  2007-06-20 19:57                 ` Charlie Brady
  0 siblings, 2 replies; 113+ messages in thread
From: Alex Efros @ 2007-06-20 16:57 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Jun 20, 2007 at 04:23:25PM +0000, Gerrit Pape wrote:
> # gcc test.c
> # ./a.out
 
This test exiting without leaving zombies and don't output anything on my
home workstation (if you remember, I had to reboot workstation because of
same issue few days ago). But for now this issue don't happens on
workstation (yet, I think - uptime is just 2 days and it doesn't generate
new processes as often as servers).

Then I've executed this test on server, which already has this issue, but
it don't have up to 8192 zombies for single user account and so I don't
rebooted it yet. Before running test server has:

    # date; ps ax | grep Z | wc
    Wed Jun 20 16:42:18 GMT 2007
       1259    7555   55496

test has printed several 'f', here is full output:

    $ ./a.out
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    f
    $ 

and now there a lot of zombies:

    # date; ps ax | grep Z | wc
    Wed Jun 20 16:42:39 GMT 2007
      17586  105517  790218

Several minutes later situation doesn't changed:

    # date; ps ax | grep Z | wc
    Wed Jun 20 16:49:04 GMT 2007
      17587  105523  790263

> If not, can you provide this service daemon that produced these amount
> of detached short-living processes?

On my home workstation most of zombie processes was 'chpst' executed by
dcron every 1 minute using lines like this one:

    */1  * * * *    ( cd /var/www/soft.p/html && exec chpst -L .lib/var/.lock.service runsvdir .lib/service/ &>/dev/null ) &

(I use runsvdir to run services in my web projects, and only way to
guarantee these services will be started after reboot is cron
configuration like this one - I don't like to use root access to start
services for web projects.)

Also I see a lot of zombie 'sshd' on my servers. So, I don't think this
issue is in my perl scripts or other applications - it's somewhere in
runit and/or kernel.

> And I have another patch to try attached.

Thanks, I'll try it. If I understand correctly, I should try this patch
instead of previous, not together with previous..?

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-20 16:57               ` Alex Efros
@ 2007-06-20 18:35                 ` Gerrit Pape
  2007-06-23  4:42                   ` Alex Efros
  2007-07-01  8:43                   ` Radek Podgorny
  2007-06-20 19:57                 ` Charlie Brady
  1 sibling, 2 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-06-20 18:35 UTC (permalink / raw)
  To: supervision

On Wed, Jun 20, 2007 at 07:57:36PM +0300, Alex Efros wrote:
> Thanks, I'll try it. If I understand correctly, I should try this patch
> instead of previous, not together with previous..?

Thanks for helping to try to track this down.  You can use it on top of
the first, the first one should generally speed up reaping zombies.

Regards, Gerrit.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-20 16:57               ` Alex Efros
  2007-06-20 18:35                 ` Gerrit Pape
@ 2007-06-20 19:57                 ` Charlie Brady
  1 sibling, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2007-06-20 19:57 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 20 Jun 2007, Alex Efros wrote:

> Also I see a lot of zombie 'sshd' on my servers. So, I don't think this
> issue is in my perl scripts or other applications - it's somewhere in
> runit and/or kernel.

I don't think that's been established. The fact that the zombies aren't 
reaped seems to be an issue with runit and/or kernel. The fact that you 
have such applications re-parented to process 1 when/if they should be 
supervised is likely to be an issue in your perl scripts or other 
applications.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-20 18:35                 ` Gerrit Pape
@ 2007-06-23  4:42                   ` Alex Efros
  2007-06-26  9:59                     ` Gerrit Pape
  2007-07-01  8:43                   ` Radek Podgorny
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-06-23  4:42 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Jun 20, 2007 at 06:35:32PM +0000, Gerrit Pape wrote:
> Thanks for helping to try to track this down.  You can use it on top of
> the first, the first one should generally speed up reaping zombies.

One of my servers now has ~700 zombie processes. Looks like these two
patches don't fix this issue. :(


# date; ps ax | grep Z | wc
Sat Jun 23 04:39:32 GMT 2007
    711    4267   35332
# date; ps ax | grep Z | wc
Sat Jun 23 04:40:27 GMT 2007
    743    4459   36927
# date; ps ax | grep Z | wc
Sat Jun 23 04:41:15 GMT 2007
    755    4531   37522

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-23  4:42                   ` Alex Efros
@ 2007-06-26  9:59                     ` Gerrit Pape
  2007-07-07  7:16                       ` Alex Efros
  2007-07-15 14:47                       ` Alex Efros
  0 siblings, 2 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-06-26  9:59 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 1027 bytes --]

On Sat, Jun 23, 2007 at 07:42:05AM +0300, Alex Efros wrote:
> On Wed, Jun 20, 2007 at 06:35:32PM +0000, Gerrit Pape wrote:
> > Thanks for helping to try to track this down.  You can use it on top of
> > the first, the first one should generally speed up reaping zombies.
> 
> One of my servers now has ~700 zombie processes. Looks like these two
> patches don't fix this issue. :(

From reading the code, I can't see why the runit program shouldn't
collect these zombies on your system.  But I may be blind, let's see
whether reaping zombies at least every 5 seconds helps.  Can you please
apply the attached patch (it supersedes the previous patches), install
the resulting runit program into /sbin/, reboot the machine, make sure
that the new runit program is running as pid 1, and see whether zombies
are left over?  Is anything printed to the console when the zombie
problem arises?

To be sure that runit is the problem, could you boot one of your systems
into sysvinit to see if it has the same problem?

Thanks, Gerrit.

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1488 bytes --]

diff --git a/src/runit.c b/src/runit.c
index f7d6522..6f0793d 100644
--- a/src/runit.c
+++ b/src/runit.c
@@ -143,22 +143,28 @@ int main (int argc, const char * const *argv, char * const *envp) {
     FD_SET(x.fd, &rfds);
 #endif
     for (;;) {
-      int child;
+      int r, child;
 
       sig_unblock(sig_child);
       sig_unblock(sig_cont);
       sig_unblock(sig_int);
 #ifdef IOPAUSE_POLL
-      poll(&x, 1, -1);
+      r =poll(&x, 1, 5000);
 #else
-      select(x.fd +1, &rfds, (fd_set*)0, (fd_set*)0, (struct timeval*)0);
+      r =select(x.fd +1, &rfds, (fd_set*)0, (fd_set*)0, (struct timeval*)0);
 #endif
       sig_block(sig_cont);
       sig_block(sig_child);
       sig_block(sig_int);
       
-      read(selfpipe[0], &ch, 1);
-      child =wait_nohang(&wstat);
+      while (read(selfpipe[0], &ch, 1) == 1) {}
+      while ((child =wait_nohang(&wstat)) > 0)
+        if (child == pid) break;
+      if (child == -1) {
+        strerr_warn2(WARNING, "wait_nohang, pausing: ", &strerr_sys);
+        sleep(5);
+      }
+      if ((r == 0) && (child != pid)) continue;
 
       /* reget stderr */
       if ((ttyfd =open_write("/dev/console")) != -1) {
@@ -194,7 +200,7 @@ int main (int argc, const char * const *argv, char * const *envp) {
         strerr_warn3(INFO, "leave stage: ", stage[st], 0);
         break;
       }
-      if (child > 0) {
+      if (child != 0) {
         /* collect terminated children */
         write(selfpipe[1], "", 1);
         continue;

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-20 18:35                 ` Gerrit Pape
  2007-06-23  4:42                   ` Alex Efros
@ 2007-07-01  8:43                   ` Radek Podgorny
  2007-07-02  8:28                     ` Gerrit Pape
  1 sibling, 1 reply; 113+ messages in thread
From: Radek Podgorny @ 2007-07-01  8:43 UTC (permalink / raw)
  To: supervision

Hi! What is the status of this?  Did the "reap zombies every 5secs"
help? One of my servers just passed away again and I really need this
issue to be fixed... :-(

Radek Podgorny



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-01  8:43                   ` Radek Podgorny
@ 2007-07-02  8:28                     ` Gerrit Pape
  2007-07-02 11:23                       ` Radek Podgorny
  2007-07-07  4:54                       ` Alex Efros
  0 siblings, 2 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-07-02  8:28 UTC (permalink / raw)
  To: supervision

On Sun, Jul 01, 2007 at 10:43:19AM +0200, Radek Podgorny wrote:
> Hi! What is the status of this?  Did the "reap zombies every 5secs"
> help? One of my servers just passed away again and I really need this
> issue to be fixed... :-(

Not sure, no response from Alex yet.  I didn't knew that there's someone
else having this problem.  What triggers the problem on your system?,
also a huge amount of short running processes that detached to have
parent pid 1?  Did you check ppid of the zombies?  Can you give some
information on how it can be reproduced?

Thanks, Gerrit.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-02  8:28                     ` Gerrit Pape
@ 2007-07-02 11:23                       ` Radek Podgorny
  2007-07-02 12:14                         ` Gerrit Pape
  2007-07-07  4:54                       ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Radek Podgorny @ 2007-07-02 11:23 UTC (permalink / raw)
  To: supervision

Well, actually I'm the original poster. Unfortunately I can't do any
tests since the affected are all in production and on my testing setups,
the problem does not occur. :-(

The number of processes doesn't have to be "huge" and they don't need to
be "short lived" either (AFAIK). The parent pid of the zombies is 1.

As I said before, I can't do any thorough testing so I can't give you
much feedback about the patches. I can only gather "passive" info
(versions, ...).

All my systems are Gentoo. Some of them are amd64, some x86. The problem
appears on both architectures. My "unstable" laptop is amd64 and does
not suffer from the problem. So it may seem to be problem of different
versions of packages (glibc, whatever...). Unfortunately, some of the
"stable" system do not have the problem. :-( So the only difference may
be the kernel which I can check if you want... ...or something
completely different. :-(

Sincerely
Radek Podgorny


Gerrit Pape wrote:
> On Sun, Jul 01, 2007 at 10:43:19AM +0200, Radek Podgorny wrote:
>> Hi! What is the status of this?  Did the "reap zombies every 5secs"
>> help? One of my servers just passed away again and I really need this
>> issue to be fixed... :-(
> 
> Not sure, no response from Alex yet.  I didn't knew that there's someone
> else having this problem.  What triggers the problem on your system?,
> also a huge amount of short running processes that detached to have
> parent pid 1?  Did you check ppid of the zombies?  Can you give some
> information on how it can be reproduced?
> 
> Thanks, Gerrit.
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-02 11:23                       ` Radek Podgorny
@ 2007-07-02 12:14                         ` Gerrit Pape
  2007-07-02 12:42                           ` Radek Podgorny
  0 siblings, 1 reply; 113+ messages in thread
From: Gerrit Pape @ 2007-07-02 12:14 UTC (permalink / raw)
  To: supervision

On Mon, Jul 02, 2007 at 01:23:23PM +0200, Radek Podgorny wrote:
> Well, actually I'm the original poster. Unfortunately I can't do any

Ups, sorry.

> The number of processes doesn't have to be "huge" and they don't need to
> be "short lived" either (AFAIK). The parent pid of the zombies is 1.

When reading your initial post and http://trac.lighttpd.net/trac/ticket/978
I concluded that this is not a runit problem, otherwise the patch posted
to the link above should not work.  As I see it, lighttpd version 1.4.15
has a problem when run with the -D switch.  It doesn't wait() for its
children that are spawned on startup.  If run without -D, it detaches
afterwards through a double fork which makes the zombies go, but it
doesn't with -D:

 # lighttpd -D -f /etc/lighttpd/lighttpd.conf &
 [1] 10362
 # ps --ppid 10362
   PID TTY          TIME CMD
 10363 pts/1    00:00:00 create-mime.ass <defunct>
 10364 pts/1    00:00:00 include-conf-en <defunct>
 # 

Please check again the parent pid of the zombies through 'ps -ef' and/or
/proc/<pid>/status.

Thanks, Gerrit.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-02 12:14                         ` Gerrit Pape
@ 2007-07-02 12:42                           ` Radek Podgorny
  0 siblings, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-07-02 12:42 UTC (permalink / raw)
  To: supervision

Yeah, the PPID of the zombies is 1 for sure.

Actually, I'm not experiencing it with lighttpd (I've just found a
similar problem on the web with better explanation in hope it helps).

There are basically two types of zombies on my system. Lots of sshd
zombies (I don't know where they come from, maybe automated login
attempts...) and lots of arp zombies. ARP does not for at all AFAIK.
It's there because I have a python script which executes arp and run
that script from cron.

Radek P.


Gerrit Pape wrote:
> On Mon, Jul 02, 2007 at 01:23:23PM +0200, Radek Podgorny wrote:
>> Well, actually I'm the original poster. Unfortunately I can't do any
> 
> Ups, sorry.
> 
>> The number of processes doesn't have to be "huge" and they don't need to
>> be "short lived" either (AFAIK). The parent pid of the zombies is 1.
> 
> When reading your initial post and http://trac.lighttpd.net/trac/ticket/978
> I concluded that this is not a runit problem, otherwise the patch posted
> to the link above should not work.  As I see it, lighttpd version 1.4.15
> has a problem when run with the -D switch.  It doesn't wait() for its
> children that are spawned on startup.  If run without -D, it detaches
> afterwards through a double fork which makes the zombies go, but it
> doesn't with -D:
> 
>  # lighttpd -D -f /etc/lighttpd/lighttpd.conf &
>  [1] 10362
>  # ps --ppid 10362
>    PID TTY          TIME CMD
>  10363 pts/1    00:00:00 create-mime.ass <defunct>
>  10364 pts/1    00:00:00 include-conf-en <defunct>
>  # 
> 
> Please check again the parent pid of the zombies through 'ps -ef' and/or
> /proc/<pid>/status.
> 
> Thanks, Gerrit.
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-02  8:28                     ` Gerrit Pape
  2007-07-02 11:23                       ` Radek Podgorny
@ 2007-07-07  4:54                       ` Alex Efros
  1 sibling, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-07-07  4:54 UTC (permalink / raw)
  To: supervision

Hi!

On Mon, Jul 02, 2007 at 08:28:01AM +0000, Gerrit Pape wrote:
> Not sure, no response from Alex yet.

I was away on vacation. I've just come home, so I'll probably try this
patch today or tomorrow, and we'll know results in about a week.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-26  9:59                     ` Gerrit Pape
@ 2007-07-07  7:16                       ` Alex Efros
  2007-07-07 18:13                         ` Charlie Brady
  2007-07-15 14:47                       ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-07  7:16 UTC (permalink / raw)
  To: supervision

Hi!

On Tue, Jun 26, 2007 at 09:59:20AM +0000, Gerrit Pape wrote:
> But I may be blind, let's see whether reaping zombies at least every 5
> seconds helps.

One difference I already noticed. Every 1 minute cron run several `chpst -L`
like this one:

*/1  * * * *    ( cd /var/www/soft.p/html && exec chpst -L .lib/var/.lock.service runsvdir .lib/service/ &>/dev/null ) &

and with this patch I notice zombies produced by this command collected
with ~5-seconds delay:

home ~ # date; ps ax | grep Z
Sat Jul  7 10:06:00 EEST 2007
home ~ # date; ps ax | grep Z
Sat Jul  7 10:06:01 EEST 2007
 2544 ?        Z      0:00 [sh] <defunct>
 2545 ?        Z      0:00 [sh] <defunct>
 2548 ?        Z      0:00 [sh] <defunct>
 2550 ?        Z      0:00 [sh] <defunct>
 2552 ?        Z      0:00 [sh] <defunct>
 2553 ?        Z      0:00 [sh] <defunct>
 2556 ?        Z      0:00 [sh] <defunct>
home ~ # date; ps ax | grep Z
Sat Jul  7 10:06:06 EEST 2007
 2544 ?        Z      0:00 [sh] <defunct>
 2545 ?        Z      0:00 [sh] <defunct>
 2548 ?        Z      0:00 [sh] <defunct>
 2550 ?        Z      0:00 [sh] <defunct>
 2552 ?        Z      0:00 [sh] <defunct>
 2553 ?        Z      0:00 [sh] <defunct>
 2556 ?        Z      0:00 [sh] <defunct>
home ~ # date; ps ax | grep Z
Sat Jul  7 10:06:08 EEST 2007

This situation repeated every 1 minute, that's why I think it's related to
crontab line shown above.

I notice this right after reboot, so this can't be related to 'non-reaping
zombies issue' discussed in this thread. Looks like it's just behaviour of
this patch.

> Is anything printed to the console when the zombie problem arises?

This problem usually arises on remote servers, so I can't check console...
but I'll look at kernel log.

> To be sure that runit is the problem, could you boot one of your systems
> into sysvinit to see if it has the same problem?

No, sorry. I've no configured /etc/inittab and /etc/{init.d,conf.d}/ on my
servers, only /etc/runit/{1,2,3}.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-07  7:16                       ` Alex Efros
@ 2007-07-07 18:13                         ` Charlie Brady
  2007-07-07 19:12                           ` Alex Efros
  2007-07-12 14:42                           ` Charlie Brady
  0 siblings, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2007-07-07 18:13 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 7 Jul 2007, Alex Efros wrote:

> On Tue, Jun 26, 2007 at 09:59:20AM +0000, Gerrit Pape wrote:
>> But I may be blind, let's see whether reaping zombies at least every 5
>> seconds helps.
>
> One difference I already noticed. Every 1 minute cron run several `chpst -L`
> like this one:

Why? That looks like a very strange thing to do.

> */1  * * * *    ( cd /var/www/soft.p/html && exec chpst -L .lib/var/.lock.service runsvdir .lib/service/ &>/dev/null ) &

So every minute cron will run a shell script, and then wait for it to 
finish. Each shell script forks a subshell in the background and then 
exits, so cron no longer waits. The subshell is reparented to process 1. 
When it exits, it will become a zombie until process 1 reaps its status.

Why are you running the subshell? Why do you background it? Why are you 
throwing away any error output from chpst/runsvdir?

What happens if your cron line is:

*/1  * * * *   chpst -L /var/www/soft.p/html/.lib/var/.lock.service runsvdir /var/www/soft.p/html/.lib/var/.lib/service/

?

Why are you starting a new runsvdir every minute?

What are you actually trying to achieve? - perhaps someone can suggest a 
less "unusual" design.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-07 18:13                         ` Charlie Brady
@ 2007-07-07 19:12                           ` Alex Efros
  2007-07-12 14:21                             ` Charlie Brady
  2007-07-12 14:42                           ` Charlie Brady
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-07 19:12 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, Jul 07, 2007 at 02:13:48PM -0400, Charlie Brady wrote:
> What happens if your cron line is:
> 
> */1  * * * *   chpst -L /var/www/soft.p/html/.lib/var/.lock.service 
> runsvdir /var/www/soft.p/html/.lib/var/service/
> 
> ?
 
According to crontab(1):

    No attempt will be made to issue commands lost due to a reboot, and
    commands are not reissued if the previously issued command is still
    running.

it may be enough to do even this:

*/1  * * * *	runsvdir /var/www/soft.p/html/.lib/var/service/

But. I don't know how reliable this feature in my cron daemon (I'm using
`dcron') and how it realized. And I know `chpst -L` is reliable and I know
how it works. So I'd prefer to use known reliable solution.
(Moreover, cron daemon can be any, not only dcron - I'd like to use same
crontab configuration line which will work with any cron daemon.)

Anyway, my code can be 'overkill' or ugly, but it working correctly and
it does nothing really 'wrong'. Yeah, it produce zombie, but that zombie
must be reaped by process N1.

> Why are you starting a new runsvdir every minute?
 
Because I need to restart user-controlled services under runsvdir after
reboot without any special configuration done by 'root'.

> What are you actually trying to achieve? - perhaps someone can suggest a 
> less "unusual" design.

Okay. I've a several independent web projects. These projects are complex
enough, and contain not only CGI scripts and database, but also cron
scripts, command line utilities, and network daemons. All these
applications should not require 'root' to setup and should be 100%
manageable by user. 

To have single point for managing log files for all these scripts I run
separate `socklog unix /custom/path` for each of these web projects, and
all scripts (CGI, cron, network daemons, etc.) in each project use that
'/custom/path' (different for each project, of course) UNIX socket for logging.

So, I need at least single 'service' for each web project: socklog.
Some projects also have other custom services like network daemons.
All these services should be running as reliable as usual system services
in /var/service/* and should be restarted after system reboot.

I solved this by using runsvdir to run services and using user's crontab
to start (and restart after system reboot or killing runsvdir) runsvdir.

To avoid multiple execution of runsvdir for same directory with services
I use `chpst -L`.

That's all.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-07 19:12                           ` Alex Efros
@ 2007-07-12 14:21                             ` Charlie Brady
  2007-07-12 14:41                               ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-07-12 14:21 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 7 Jul 2007, Alex Efros wrote:

> it may be enough to do even this:
>
> */1  * * * *	runsvdir /var/www/soft.p/html/.lib/var/service/
>
> But. I don't know how reliable this feature in my cron daemon (I'm using
> `dcron') and how it realized.

So why are you using cron? Why don't you have a "soft.p.runsvdir" service, 
with a run script which does:

#! /bin/sh
exec runsvdir /var/www/soft.p/html/.lib/var/service/

?

> So I'd prefer to use known reliable solution.

Indeed.

> (Moreover, cron daemon can be any, not only dcron - I'd like to use same
> crontab configuration line which will work with any cron daemon.)

>> Why are you starting a new runsvdir every minute?
>
> Because I need to restart user-controlled services under runsvdir after
> reboot without any special configuration done by 'root'.

That only says that you need to have runsvdir running, not that you need 
to restart it every minues.

> All these services should be running as reliable as usual system services
> in /var/service/* and should be restarted after system reboot.

runit can do that (without cron).

> To avoid multiple execution of runsvdir for same directory with services
> I use `chpst -L`.

runsv guarantees singleton processes.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 14:21                             ` Charlie Brady
@ 2007-07-12 14:41                               ` Alex Efros
  2007-07-12 14:45                                 ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-12 14:41 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Jul 12, 2007 at 10:21:55AM -0400, Charlie Brady wrote:
> So why are you using cron? Why don't you have a "soft.p.runsvdir" service, 
> with a run script which does:

Because this should be configured as root. Actually, I wish to be able to
run my web projects on average good web hosting. Here 'average good' mean
hosting must provide things like ssh and gcc. Having these, I'll be able
to compile runit in my home directory and use it as part of my project.
But there 0.0001% chance web hosting admins will know about runit and
even less chance they will use it... and even less chance they agree to
add runsvdir for my project to their system services. So, only way to
(re)start my own services automatically after reboot is cron.

> runsv guarantees singleton processes.

runsv - yes, but is runsvdir guarantees this too?

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-07 18:13                         ` Charlie Brady
  2007-07-07 19:12                           ` Alex Efros
@ 2007-07-12 14:42                           ` Charlie Brady
  2007-07-12 14:43                             ` Charlie Brady
  2007-07-12 14:49                             ` Alex Efros
  1 sibling, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2007-07-12 14:42 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 7 Jul 2007, Charlie Brady wrote:

>>  */1  * * * *    ( cd /var/www/soft.p/html && exec chpst -L
>>  .lib/var/.lock.service runsvdir .lib/service/ &>/dev/null ) &
>
> So every minute cron will run a shell script, and then wait for it to finish. 
> Each shell script forks a subshell in the background and then exits, so cron 
> no longer waits. The subshell is reparented to process 1. When it exits, it 
> will become a zombie until process 1 reaps its status.

Note a process forks you do not know whether parent or child runs first. 
With the above job, you are creating a situation where both parent and 
child will exit very quickly - the parent (shell) by reading EOF and then 
exiting, the child (subshell) by execing chpst which exits when it fails 
to get the lock.

A common method of avoiding zombie processes is for a SIGCHILD handler in 
the parent to reap the status. I wonder whether there is possibility for 
SIGCHILD to be queued to the wrong process (due to a race during 
reparenting). Does runit as process 1 depend on SIGCHILD to reap zombies?

Alex, are you running an SMP system (which would allow parent and child 
to both be scheduled simulateously)?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 14:42                           ` Charlie Brady
@ 2007-07-12 14:43                             ` Charlie Brady
  2007-07-12 14:49                             ` Alex Efros
  1 sibling, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2007-07-12 14:43 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Thu, 12 Jul 2007, Charlie Brady wrote:

> Note a process forks you do not know whether parent or child runs first. With

s/Note a/Note when a/


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 14:41                               ` Alex Efros
@ 2007-07-12 14:45                                 ` Charlie Brady
  2007-07-12 14:57                                   ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-07-12 14:45 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Thu, 12 Jul 2007, Alex Efros wrote:

> On Thu, Jul 12, 2007 at 10:21:55AM -0400, Charlie Brady wrote:
>> So why are you using cron? Why don't you have a "soft.p.runsvdir" service,
>> with a run script which does:
>
> Because this should be configured as root. Actually, I wish to be able to
> run my web projects on average good web hosting.

I thought that we were discussing a system where runit is process 1, not 
some "average good web hosting" system.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 14:42                           ` Charlie Brady
  2007-07-12 14:43                             ` Charlie Brady
@ 2007-07-12 14:49                             ` Alex Efros
  2007-07-12 15:11                               ` Charlie Brady
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-12 14:49 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Jul 12, 2007 at 10:42:18AM -0400, Charlie Brady wrote:
> A common method of avoiding zombie processes is for a SIGCHILD handler in 
> the parent to reap the status. I wonder whether there is possibility for 
> SIGCHILD to be queued to the wrong process (due to a race during 
> reparenting). Does runit as process 1 depend on SIGCHILD to reap zombies?

Yeah, this is possible. But, anyway, it's a race in kernel or runit,
which must be fixed there.

With last patch runit try reaping every 5 seconds, instead of depending on
SIGCHLD. For now (uptime 5 days) there no zombies on my servers. This is
ugly workaround, of course.

> Alex, are you running an SMP system (which would allow parent and child 
> to both be scheduled simulateously)?

My home workstation is SMP (Core2Duo), while servers are not SMP
(nowadays it's usual to have workstation much more powerful than servers :)).
This issue arise both on workstation and servers, but much more often on
servers (because they work much more intensively and generate much more
processes).

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 14:45                                 ` Charlie Brady
@ 2007-07-12 14:57                                   ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-07-12 14:57 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Jul 12, 2007 at 10:45:30AM -0400, Charlie Brady wrote:
> I thought that we were discussing a system where runit is process 1, not 
> some "average good web hosting" system.

I discussing my 'solution' how to introduce runit's reliability on
"average good web hosting" system. :-P

Actually, thinking about "average good web hosting" may be mistake on my
side. Last years I'm working on complex tasks, which require dedicated
servers or clusters of dedicated servers. I install on these servers
my custom Hardened Gentoo with runit anyway, and I'm root on these servers
anyway. So, from this view using cron isn't really required anymore.

But this "cron+runsvdir" solution was born many years ago when some of my
projects was installed on usual web hostings, and I continue using this
solution because I don't see the reason why I should lose potential
ability to run my projects on these hostings. Running, say, 5 'chpst -L'
every 1 minute doesn't impact server's performance in any way. Generating
zombies also shouldn't be a problem. So...

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 14:49                             ` Alex Efros
@ 2007-07-12 15:11                               ` Charlie Brady
  2007-07-12 15:15                                 ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-07-12 15:11 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Thu, 12 Jul 2007, Alex Efros wrote:

> On Thu, Jul 12, 2007 at 10:42:18AM -0400, Charlie Brady wrote:
>> A common method of avoiding zombie processes is for a SIGCHILD handler in
>> the parent to reap the status. I wonder whether there is possibility for
>> SIGCHILD to be queued to the wrong process (due to a race during
>> reparenting). Does runit as process 1 depend on SIGCHILD to reap zombies?
>
> Yeah, this is possible. But, anyway, it's a race in kernel or runit,
> which must be fixed there.
>
> With last patch runit try reaping every 5 seconds, instead of depending on
> SIGCHLD. For now (uptime 5 days) there no zombies on my servers. This is
> ugly workaround, of course.

If you have fixed your cron job as I suggested, you will no longer have 
the reaping race, so you should never see any such zombies.




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 15:11                               ` Charlie Brady
@ 2007-07-12 15:15                                 ` Alex Efros
  2007-07-12 15:40                                   ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-12 15:15 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Jul 12, 2007 at 11:11:56AM -0400, Charlie Brady wrote:
> If you have fixed your cron job as I suggested, you will no longer have 
> the reaping race, so you should never see any such zombies.

Nope. As reported by me and other people in this thread, such zombies
produced also by sshd and other software.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-12 15:15                                 ` Alex Efros
@ 2007-07-12 15:40                                   ` Charlie Brady
  0 siblings, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2007-07-12 15:40 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Thu, 12 Jul 2007, Alex Efros wrote:

> On Thu, Jul 12, 2007 at 11:11:56AM -0400, Charlie Brady wrote:
>> If you have fixed your cron job as I suggested, you will no longer have
>> the reaping race, so you should never see any such zombies.
>
> Nope. As reported by me and other people in this thread, such zombies
> produced also by sshd and other software.

I'm not suggesting that it's impossible to create zombies reparented to 
process 1 or that runit doesn't have a problem in dealing with such 
zombies. I was merely asserting that if you fix your cron job, it will no 
longer create zombies reparented to process 1.

Your cron job, as posted, is a good test case for generating such zombies, 
because of the pointless (AFAICT) fork/ignore child/exit shell script.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-06-26  9:59                     ` Gerrit Pape
  2007-07-07  7:16                       ` Alex Efros
@ 2007-07-15 14:47                       ` Alex Efros
  2007-07-15 19:07                         ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-15 14:47 UTC (permalink / raw)
  To: supervision

Hi!

On Tue, Jun 26, 2007 at 09:59:20AM +0000, Gerrit Pape wrote:
> From reading the code, I can't see why the runit program shouldn't
> collect these zombies on your system.  But I may be blind, let's see
> whether reaping zombies at least every 5 seconds helps.  Can you please
> apply the attached patch (it supersedes the previous patches), install
> the resulting runit program into /sbin/, reboot the machine, make sure
> that the new runit program is running as pid 1, and see whether zombies
> are left over?  Is anything printed to the console when the zombie
> problem arises?

This patch don't fixed the issue. But looks like it result in more time
needed to get this issue: about 8 days instead of 2-3 days. Right now I've
~150 zombies on one server, ~300 zombies on another, and third server was
already rebooted because there was more than 8000 zombies and fork() stop
working.

Most of zombies are [sshd] and [chpst] processes, and there few others.

There no unusual messages in kernel log or in syslog.


# uptime
 14:44:36 up 8 days,  7:19,  1 user,  load average: 0.90, 0.50, 0.36

# date; ps ax | grep Z | wc
Sun Jul 15 14:31:15 GMT 2007
    323    1939   14389

# date; ps ax | grep Z | wc
Sun Jul 15 14:41:36 GMT 2007
    323    1939   14389

# ps -ef ax | grep Z
feed       623     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed       866     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1019     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1078     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1147     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1320     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1370     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1460     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1619     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      1798     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
sshd      1878     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1880     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1882     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1884     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1886     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1888     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1890     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1892     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1894     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1896     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1898     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1900     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1902     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1904     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1906     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1908     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1910     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1912     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1914     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      1916     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
feed      1930     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2002     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2092     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2093     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2145     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2181     1  0 13:10 ?        Z      0:00 [chpst] <defunct>
sshd      2235     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2237     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2239     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2241     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2243     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2245     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2247     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2249     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2251     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2257     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2259     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2261     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2263     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2265     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2267     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2269     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2271     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2273     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2275     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2279     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2281     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2283     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2285     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2287     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2293     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2295     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2297     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2299     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2301     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2303     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2305     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2307     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2309     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2311     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2313     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2315     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2317     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2319     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2321     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2323     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
feed      2329     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
sshd      2344     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2354     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2356     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2358     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2360     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2362     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2364     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2366     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2369     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2373     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2375     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2377     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2379     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2381     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2387     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2389     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2391     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2393     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2395     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2397     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2399     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2401     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2403     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2405     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2407     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2409     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2411     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2413     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2415     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2421     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2423     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2425     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2427     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2429     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2431     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
feed      2454     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2802     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      2900     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3109     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3243     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3336     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3380     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
qmailq    3604     1  0 Jul14 ?        Z      0:00 [qmail-queue] <defunct>
feed      3626     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3670     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3685     1  0 13:15 ?        Z      0:00 [chpst] <defunct>
feed      3804     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3897     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3921     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      3942     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      4233     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      4367     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      4461     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      4793     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      4954     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5046     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5180     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5377     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5467     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5512     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5607     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5657     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      5955     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6089     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6225     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6357     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6587     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6637     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6775     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6795     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6898     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      6982     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7084     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7415     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7507     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7551     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7712     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7795     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7840     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      7872     1  0 13:30 ?        Z      0:00 [chpst] <defunct>
feed      7974     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      8066     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      8110     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      8397     1  0 00:15 ?        Z      0:00 [chpst] <defunct>
feed      8487     1  0 00:25 ?        Z      0:00 [chpst] <defunct>
feed      8532     1  0 00:30 ?        Z      0:00 [chpst] <defunct>
feed      8669     1  0 00:45 ?        Z      0:00 [chpst] <defunct>
feed      8967     1  0 01:15 ?        Z      0:00 [chpst] <defunct>
feed      9102     1  0 01:30 ?        Z      0:00 [chpst] <defunct>
feed      9195     1  0 01:40 ?        Z      0:00 [chpst] <defunct>
feed      9240     1  0 01:45 ?        Z      0:00 [chpst] <defunct>
feed      9539     1  0 02:15 ?        Z      0:00 [chpst] <defunct>
feed      9629     1  0 02:25 ?        Z      0:00 [chpst] <defunct>
feed      9674     1  0 02:30 ?        Z      0:00 [chpst] <defunct>
feed      9677     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed      9811     1  0 02:45 ?        Z      0:00 [chpst] <defunct>
feed     10639     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     11088     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     11463     1  0 13:45 ?        Z      0:00 [chpst] <defunct>
feed     13320     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     14072     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     14082     1  0 13:55 ?        Z      0:00 [chpst] <defunct>
feed     15707     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     18128     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     18735     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     18787     1  0 14:15 ?        Z      0:00 [chpst] <defunct>
feed     20377     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     21157     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     22537     1  0 14:30 ?        Z      0:00 [chpst] <defunct>
feed     23654     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     24968     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
root     25773  1230  0 14:42 pts/1    R+     0:00 grep --colour=auto Z
feed     25849     1  0 03:10 ?        Z      0:00 [chpst] <defunct>
feed     25893     1  0 03:15 ?        Z      0:00 [chpst] <defunct>
feed     26028     1  0 03:30 ?        Z      0:00 [chpst] <defunct>
feed     26120     1  0 03:40 ?        Z      0:00 [chpst] <defunct>
feed     26165     1  0 03:45 ?        Z      0:00 [chpst] <defunct>
feed     26255     1  0 03:55 ?        Z      0:00 [chpst] <defunct>
feed     26472     1  0 04:15 ?        Z      0:00 [chpst] <defunct>
feed     26562     1  0 04:25 ?        Z      0:00 [chpst] <defunct>
feed     26755     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     26763     1  0 04:45 ?        Z      0:00 [chpst] <defunct>
feed     26853     1  0 04:55 ?        Z      0:00 [chpst] <defunct>
feed     27006     1  0 05:10 ?        Z      0:00 [chpst] <defunct>
feed     27050     1  0 05:15 ?        Z      0:00 [chpst] <defunct>
feed     27140     1  0 05:25 ?        Z      0:00 [chpst] <defunct>
feed     27185     1  0 05:30 ?        Z      0:00 [chpst] <defunct>
feed     27277     1  0 05:40 ?        Z      0:00 [chpst] <defunct>
feed     27336     1  0 05:45 ?        Z      0:00 [chpst] <defunct>
feed     27426     1  0 05:55 ?        Z      0:00 [chpst] <defunct>
feed     27516     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     27580     1  0 06:10 ?        Z      0:00 [chpst] <defunct>
feed     27759     1  0 06:30 ?        Z      0:00 [chpst] <defunct>
feed     27902     1  0 06:45 ?        Z      0:00 [chpst] <defunct>
feed     27992     1  0 06:55 ?        Z      0:00 [chpst] <defunct>
feed     28170     1  0 07:10 ?        Z      0:00 [chpst] <defunct>
feed     28214     1  0 07:15 ?        Z      0:00 [chpst] <defunct>
feed     28304     1  0 07:25 ?        Z      0:00 [chpst] <defunct>
feed     28348     1  0 07:30 ?        Z      0:00 [chpst] <defunct>
feed     28441     1  0 07:40 ?        Z      0:00 [chpst] <defunct>
feed     28789     1  0 08:15 ?        Z      0:00 [chpst] <defunct>
feed     28938     1  0 08:30 ?        Z      0:00 [chpst] <defunct>
feed     29045     1  0 08:40 ?        Z      0:00 [chpst] <defunct>
feed     29089     1  0 08:45 ?        Z      0:00 [chpst] <defunct>
feed     29179     1  0 08:55 ?        Z      0:00 [chpst] <defunct>
feed     29378     1  0 09:15 ?        Z      0:00 [chpst] <defunct>
feed     29468     1  0 09:25 ?        Z      0:00 [chpst] <defunct>
feed     29490     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     29513     1  0 09:30 ?        Z      0:00 [chpst] <defunct>
feed     29619     1  0 09:40 ?        Z      0:00 [chpst] <defunct>
feed     29675     1  0 09:45 ?        Z      0:00 [chpst] <defunct>
feed     29765     1  0 09:55 ?        Z      0:00 [chpst] <defunct>
feed     29961     1  0 10:15 ?        Z      0:00 [chpst] <defunct>
sshd     30031     1  0 10:20 ?        Z      0:00 [sshd] <defunct>
feed     30053     1  0 10:25 ?        Z      0:00 [chpst] <defunct>
feed     30097     1  0 10:30 ?        Z      0:00 [chpst] <defunct>
feed     30204     1  0 10:40 ?        Z      0:00 [chpst] <defunct>
feed     30267     1  0 10:45 ?        Z      0:00 [chpst] <defunct>
feed     30364     1  0 10:55 ?        Z      0:00 [chpst] <defunct>
feed     30516     1  0 11:10 ?        Z      0:00 [chpst] <defunct>
feed     30560     1  0 11:15 ?        Z      0:00 [chpst] <defunct>
feed     30695     1  0 11:30 ?        Z      0:00 [chpst] <defunct>
feed     30787     1  0 11:40 ?        Z      0:00 [chpst] <defunct>
feed     30831     1  0 11:45 ?        Z      0:00 [chpst] <defunct>
feed     30921     1  0 11:55 ?        Z      0:00 [chpst] <defunct>
feed     31119     1  0 12:15 ?        Z      0:00 [chpst] <defunct>
feed     31252     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     31255     1  0 12:30 ?        Z      0:00 [chpst] <defunct>
feed     31348     1  0 12:40 ?        Z      0:00 [chpst] <defunct>
feed     31392     1  0 12:45 ?        Z      0:00 [chpst] <defunct>
feed     31539     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     31584     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     31676     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     31720     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     31810     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
sshd     31851     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31853     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31855     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31857     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31859     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31861     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31889     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31891     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31893     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31895     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31897     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31899     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31909     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31912     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31919     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31927     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31929     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31939     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31941     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31943     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31945     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31947     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31949     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31951     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31953     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31955     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31957     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31959     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31961     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31963     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31965     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31967     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31969     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31971     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31973     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31975     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31977     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31979     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31981     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31983     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31985     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31987     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31989     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31991     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31993     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31995     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31997     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     31999     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32001     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32003     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32005     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32007     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32009     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32011     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32013     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32015     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32017     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32019     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32021     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32023     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32025     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32027     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32029     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32031     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32033     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd     32035     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
feed     32142     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     32262     1  9 Jul13 ?        Z    306:17 [ParserEngine] <defunct>
feed     32280     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     32374     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>
feed     32509     1  0 Jul14 ?        Z      0:00 [chpst] <defunct>

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 14:47                       ` Alex Efros
@ 2007-07-15 19:07                         ` Alex Efros
  2007-07-15 20:18                           ` George Georgalis
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-15 19:07 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

Hi!

# date; ps ax | grep Z | wc
Sun Jul 15 19:00:29 GMT 2007
    371    2227   16523

# ps -ef ax | grep perl | grep Z
root      9072     1  0 18:58 pts/1    Z      0:00 [perl] <defunct>
root      9094     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9183     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9192     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9261     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9267     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9273     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>

# perl -e 'fork && exit; sleep 1; print "$$ done\n"'
# 9392 done

# date; ps ax | grep Z | wc
Sun Jul 15 19:01:12 GMT 2007
    372    2233   16567

# ps -ef ax | grep perl | grep Z
root      9072     1  0 18:58 pts/1    Z      0:00 [perl] <defunct>
root      9094     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9183     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9192     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9261     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9267     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9273     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
root      9392     1  0 19:01 pts/1    Z      0:00 [perl] <defunct>

Can anybody help me debug this issue?
I've attached tar file with contents of /proc/9392/, maybe this helps.

-- 
			WBR, Alex.

[-- Attachment #2: 9392.tar --]
[-- Type: application/x-tar, Size: 20480 bytes --]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 19:07                         ` Alex Efros
@ 2007-07-15 20:18                           ` George Georgalis
  2007-07-15 20:31                             ` Paul Jarc
  0 siblings, 1 reply; 113+ messages in thread
From: George Georgalis @ 2007-07-15 20:18 UTC (permalink / raw)
  To: supervision

On Sun, Jul 15, 2007 at 10:07:57PM +0300, Alex Efros wrote:
>Hi!
>
># date; ps ax | grep Z | wc
>Sun Jul 15 19:00:29 GMT 2007
>    371    2227   16523
>
># ps -ef ax | grep perl | grep Z
>root      9072     1  0 18:58 pts/1    Z      0:00 [perl] <defunct>
>root      9094     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9183     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9192     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9261     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9267     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9273     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>
># perl -e 'fork && exit; sleep 1; print "$$ done\n"'
># 9392 done
>
># date; ps ax | grep Z | wc
>Sun Jul 15 19:01:12 GMT 2007
>    372    2233   16567
>
># ps -ef ax | grep perl | grep Z
>root      9072     1  0 18:58 pts/1    Z      0:00 [perl] <defunct>
>root      9094     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9183     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9192     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9261     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9267     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9273     1  0 18:59 pts/1    Z      0:00 [perl] <defunct>
>root      9392     1  0 19:01 pts/1    Z      0:00 [perl] <defunct>
>
>Can anybody help me debug this issue?
>I've attached tar file with contents of /proc/9392/, maybe this helps.

try using lsof to determine what file descriptors are open and
focus on attaching them somewhere ie /dev/null when you fork the
process. Your defunct perl process are probably waiting for EOF
from the fork. maybe you could close stdout/stderr of the fork?

// George


-- 
George Georgalis, information systems scientist <IXOYE><


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 20:18                           ` George Georgalis
@ 2007-07-15 20:31                             ` Paul Jarc
  2007-07-15 22:35                               ` George Georgalis
  0 siblings, 1 reply; 113+ messages in thread
From: Paul Jarc @ 2007-07-15 20:31 UTC (permalink / raw)
  To: supervision

"George Georgalis" <george@galis.org> wrote:
> Your defunct perl process are probably waiting for EOF from the
> fork. maybe you could close stdout/stderr of the fork?

No, a defunct process has already exited.  The only reason it still
shows up in the process list is that its parent (in this case, process
1) hasn't wait()ed for it yet.  I don't think anything about the
defunct process, at least not its file descriptors, can influence
whether the parent waits for it.


paul


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 20:31                             ` Paul Jarc
@ 2007-07-15 22:35                               ` George Georgalis
  2007-07-15 23:06                                 ` Paul Jarc
  0 siblings, 1 reply; 113+ messages in thread
From: George Georgalis @ 2007-07-15 22:35 UTC (permalink / raw)
  To: supervision

On Sun, Jul 15, 2007 at 04:31:55PM -0400, Paul Jarc wrote:
>"George Georgalis" <george@galis.org> wrote:
>> Your defunct perl process are probably waiting for EOF from the
>> fork. maybe you could close stdout/stderr of the fork?
>
>No, a defunct process has already exited.  The only reason it still
>shows up in the process list is that its parent (in this case, process
>1) hasn't wait()ed for it yet.  I don't think anything about the
>defunct process, at least not its file descriptors, can influence
>whether the parent waits for it.


that's elucidating.

but in practice isn't the best way to deal with defunct entries by
attaching fd to a file or socket then exec the child (which may
fork) so the parent no longer has a fd open to the child?

// George


-- 
George Georgalis, information systems scientist <IXOYE><


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 22:35                               ` George Georgalis
@ 2007-07-15 23:06                                 ` Paul Jarc
  2007-07-15 23:23                                   ` Charlie Brady
  2007-07-16  2:24                                   ` George Georgalis
  0 siblings, 2 replies; 113+ messages in thread
From: Paul Jarc @ 2007-07-15 23:06 UTC (permalink / raw)
  To: supervision

"George Georgalis" <george@galis.org> wrote:
> but in practice isn't the best way to deal with defunct entries by
> attaching fd to a file or socket then exec the child (which may
> fork) so the parent no longer has a fd open to the child?

It sounds like you're thinking of something like fghack, but that's a
solution to a different problem: a service that forks itself into the
background, which makes it difficult to supervise.  The problem here
is different: some processes outlive their parents, so they are
adopted by process 1, but process 1 is not wait()ing for them.


paul


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 23:06                                 ` Paul Jarc
@ 2007-07-15 23:23                                   ` Charlie Brady
  2007-07-16  0:09                                     ` Alex Efros
  2007-07-16  2:24                                   ` George Georgalis
  1 sibling, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-07-15 23:23 UTC (permalink / raw)
  To: supervision


On Sun, 15 Jul 2007, Paul Jarc wrote:

> "George Georgalis" <george@galis.org> wrote:
>> but in practice isn't the best way to deal with defunct entries by
>> attaching fd to a file or socket then exec the child (which may
>> fork) so the parent no longer has a fd open to the child?
>
> It sounds like you're thinking of something like fghack, but that's a
> solution to a different problem: a service that forks itself into the
> background, which makes it difficult to supervise.  The problem here
> is different: some processes outlive their parents, so they are
> adopted by process 1, but process 1 is not wait()ing for them.

So there are two problems there - the processes which are outliving their 
parents, and runit as process 1. Most people here seem to be ignoring the 
first problem, and instead are just looking for a magic fix by someone 
solving problem 2.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 23:23                                   ` Charlie Brady
@ 2007-07-16  0:09                                     ` Alex Efros
  2007-07-16  2:11                                       ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-07-16  0:09 UTC (permalink / raw)
  To: supervision

Hi!

On Sun, Jul 15, 2007 at 07:23:13PM -0400, Charlie Brady wrote:
> So there are two problems there - the processes which are outliving their 
> parents, and runit as process 1. Most people here seem to be ignoring the 
> first problem, and instead are just looking for a magic fix by someone 
> solving problem 2.

Ohh. Okay, okay, I think we all agree with you about 'generating zombies'
is a Bad Thing (tm). But real world is slightly different from ideal world.
In real world we've a 'zombie processes', which are part of *NIX
architecture, and which can't be solved by just stopping generating
zombies - because there a lot of existing applications (like OpenSSH)
which already generate zombies, and because there exists some cases when
zombies may and will be generated anyway.

In this situation, the Right Thing is solve this issue between runit and
linux kernel.

So. If this is a race condition bug in linux kernel 2.6.20, how to debug it?
Maybe some sort of patch, which will add some debug printf()'s into both 
kernel AND runit? Maybe this bug not in kernel, but in glibc's wrapper for
wait() or something else? I'm not a C programmer, so it's hard enough for
me to debug this myself. :(

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-16  0:09                                     ` Alex Efros
@ 2007-07-16  2:11                                       ` Charlie Brady
  2007-09-12 12:53                                         ` Radek Podgorny
       [not found]                                         ` <47939.::ffff:77.75.72.5.1189601606.squirrel@mail.podgorny.cz>
  0 siblings, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2007-07-16  2:11 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision



On Mon, 16 Jul 2007, Alex Efros wrote:

> On Sun, Jul 15, 2007 at 07:23:13PM -0400, Charlie Brady wrote:
>> So there are two problems there - the processes which are outliving their
>> parents, and runit as process 1. Most people here seem to be ignoring the
>> first problem, and instead are just looking for a magic fix by someone
>> solving problem 2.
>
> Ohh. Okay, okay, I think we all agree with you about 'generating zombies'
> is a Bad Thing (tm). But real world is slightly different from ideal world.
> In real world we've a 'zombie processes', which are part of *NIX
> architecture, and which can't be solved by just stopping generating
> zombies - because there a lot of existing applications (like OpenSSH)
> which already generate zombies, and because there exists some cases when
> zombies may and will be generated anyway.

Sure they will. But in every case except a daemon which was given a term 
signal I'd say it is a bug.

I've seen no evidence that openssh generates zombies.

> In this situation, the Right Thing is solve this issue between runit and
> linux kernel.

That's one of the right thing, yes.

> So. If this is a race condition bug in linux kernel 2.6.20, how to debug it?

Have a look at SystemTap.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-15 23:06                                 ` Paul Jarc
  2007-07-15 23:23                                   ` Charlie Brady
@ 2007-07-16  2:24                                   ` George Georgalis
  1 sibling, 0 replies; 113+ messages in thread
From: George Georgalis @ 2007-07-16  2:24 UTC (permalink / raw)
  To: supervision

On Sun, Jul 15, 2007 at 07:06:18PM -0400, Paul Jarc wrote:
>"George Georgalis" <george@galis.org> wrote:
>> but in practice isn't the best way to deal with defunct entries by
>> attaching fd to a file or socket then exec the child (which may
>> fork) so the parent no longer has a fd open to the child?
>
>It sounds like you're thinking of something like fghack, but that's a
>solution to a different problem: a service that forks itself into the
>background, which makes it difficult to supervise.  The problem here
>is different: some processes outlive their parents, so they are
>adopted by process 1, but process 1 is not wait()ing for them.

I've never tried fghack, nor had problems with
defunct supervised processes (nor pid 1 adoptions).

what I have seen is emacs sessions run in screen
(for weeks at a time) which invoke an interactive
R process which fork non interactive R pid that do
heavy lifting and eventually stdout to the R process
that forked them. *sigh*

that approach to invocation creates a defunct R
process for every heavy lifting R process that
completes. on the other hand the interactive R is
able to (invoke within emacs tools/env) aggregate
output and do post processing when the child
completes with stdout.

I've never heard of adoption when child outlives
parent.

// George


-- 
George Georgalis, information systems scientist <IXOYE><


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-07-16  2:11                                       ` Charlie Brady
@ 2007-09-12 12:53                                         ` Radek Podgorny
       [not found]                                         ` <47939.::ffff:77.75.72.5.1189601606.squirrel@mail.podgorny.cz>
  1 sibling, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-09-12 12:53 UTC (permalink / raw)
  To: Alex Efros, supervision

Hi! Any progress on this? Alex, have you found at least a workaround? This
is getting really annoying as I have to reboot my servers manually (ssh
can't fork for remote login)... :-(

Radek P.


>
>
> On Mon, 16 Jul 2007, Alex Efros wrote:
>
>> On Sun, Jul 15, 2007 at 07:23:13PM -0400, Charlie Brady wrote:
>>> So there are two problems there - the processes which are outliving
>>> their
>>> parents, and runit as process 1. Most people here seem to be ignoring
>>> the
>>> first problem, and instead are just looking for a magic fix by someone
>>> solving problem 2.
>>
>> Ohh. Okay, okay, I think we all agree with you about 'generating
>> zombies'
>> is a Bad Thing (tm). But real world is slightly different from ideal
>> world.
>> In real world we've a 'zombie processes', which are part of *NIX
>> architecture, and which can't be solved by just stopping generating
>> zombies - because there a lot of existing applications (like OpenSSH)
>> which already generate zombies, and because there exists some cases when
>> zombies may and will be generated anyway.
>
> Sure they will. But in every case except a daemon which was given a term
> signal I'd say it is a bug.
>
> I've seen no evidence that openssh generates zombies.
>
>> In this situation, the Right Thing is solve this issue between runit and
>> linux kernel.
>
> That's one of the right thing, yes.
>
>> So. If this is a race condition bug in linux kernel 2.6.20, how to debug
>> it?
>
> Have a look at SystemTap.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]                                         ` <47939.::ffff:77.75.72.5.1189601606.squirrel@mail.podgorny.cz>
@ 2007-09-12 13:55                                           ` Charlie Brady
  2007-09-12 14:35                                             ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 13:55 UTC (permalink / raw)
  To: Radek Podgorny; +Cc: Alex Efros, supervision


On Wed, 12 Sep 2007, Radek Podgorny wrote:

> Hi! Any progress on this? Alex, have you found at least a workaround? This
> is getting really annoying as I have to reboot my servers manually ...

You can make the problem (whatever it is) a non-issue for you, as it is 
for nearly everyone else, if you can fix whichever run script is 
generating zombies. It's possible, believe me.

[I've still seen no evidence that openssh generates zombies.]

> (ssh
> can't fork for remote login)... :-(
>
> Radek P.
>
>
>>
>>
>> On Mon, 16 Jul 2007, Alex Efros wrote:
>>
>>> On Sun, Jul 15, 2007 at 07:23:13PM -0400, Charlie Brady wrote:
>>>> So there are two problems there - the processes which are outliving
>>>> their
>>>> parents, and runit as process 1. Most people here seem to be ignoring
>>>> the
>>>> first problem, and instead are just looking for a magic fix by someone
>>>> solving problem 2.
>>>
>>> Ohh. Okay, okay, I think we all agree with you about 'generating
>>> zombies'
>>> is a Bad Thing (tm). But real world is slightly different from ideal
>>> world.
>>> In real world we've a 'zombie processes', which are part of *NIX
>>> architecture, and which can't be solved by just stopping generating
>>> zombies - because there a lot of existing applications (like OpenSSH)
>>> which already generate zombies, and because there exists some cases when
>>> zombies may and will be generated anyway.
>>
>> Sure they will. But in every case except a daemon which was given a term
>> signal I'd say it is a bug.
>>
>> I've seen no evidence that openssh generates zombies.
>>
>>> In this situation, the Right Thing is solve this issue between runit and
>>> linux kernel.
>>
>> That's one of the right thing, yes.
>>
>>> So. If this is a race condition bug in linux kernel 2.6.20, how to debug
>>> it?
>>
>> Have a look at SystemTap.
>>
>>
>
>
>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 13:55                                           ` Charlie Brady
@ 2007-09-12 14:35                                             ` Alex Efros
  2007-09-12 14:55                                               ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-12 14:35 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 09:55:16AM -0400, Charlie Brady wrote:
>> Hi! Any progress on this? Alex, have you found at least a workaround? This
>> is getting really annoying as I have to reboot my servers manually ...

Nope. Chances are I'll write a script to check amount of zombies every 10
minutes and reboot if there >100 zombies. :~( I'm tired of manual server
monitoring and reboot every 2-7 days.

> You can make the problem (whatever it is) a non-issue for you, as it is for 
> nearly everyone else, if you can fix whichever run script is generating 
> zombies. It's possible, believe me.
>
> [I've still seen no evidence that openssh generates zombies.]

I'm so happy about you see no evidence, but, bad for me, I see these
evidence in my `ps` output every ~week. Please stop repeating yourself.
We all already know what you think about this issue. There IS a bug
somewhere (runit/kernel/somewhere else) and you don't help us to fix it.
The idea is: no matter what user are doing, there shouldn't be increasing
number of unreaped zombies in the system. If this isn't work - then it is
a bug, and it should be fixed. Asking user not to do something (don't run
chpst -L from cron) which just increase _probability_ to hit that bug
isn't a solution at all, because there different software which also
produce unreaped zombies (like ssh). This isn't a solution because chpst
doesn't do anything wrong - just like ssh and other software.
Your recommendation sounds like 'start less short-living processes', which
is idiocy! Server should work, and if it work is to run a lot of
short-living processes - then it should do this in reliable manner without
requiring reboot every several days. Sorry for my emotions - now I've a
lot of Linux servers which work just like Windows - from reboot to reboot -
and that makes me a little angry...

>>>> So. If this is a race condition bug in linux kernel 2.6.20, how to debug it?
>>> Have a look at SystemTap.

Sadly, but I've a lot of work last months, so I haven't tried to debug
kernel myself. (I've tried to ask gentoo kernel devs to research this
issue, but looks like they don't believe this is problem in glibc/kernel,
and point me back to runit.)

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 14:35                                             ` Alex Efros
@ 2007-09-12 14:55                                               ` Charlie Brady
  2007-09-12 15:00                                                 ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 14:55 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 12 Sep 2007, Alex Efros wrote:

> We all already know what you think about this issue. There IS a bug
> somewhere (runit/kernel/somewhere else) and you don't help us to fix it.
> The idea is: no matter what user are doing, there shouldn't be increasing
> number of unreaped zombies in the system.

Sure, but until more detail is known about that exact circumstances where 
such unreaped zombies appear there's little chance that anyone can fix the 
bug.

> ... because there different software which also
> produce unreaped zombies (like ssh).

You keep saying that, but I continue to doubt it. If you can document that 
that occurs, I'm sure that the ssh maintainers will want to fix the bug.

> Your recommendation sounds like 'start less short-living processes', which
> is idiocy!

No, that's not my recommendation. My recommendation is that you do not 
deploy software which creates zombies.

> Server should work, and if it work is to run a lot of
> short-living processes - then it should do this in reliable manner without
> requiring reboot every several days.

Agreed. Short-living processes are fine, and if their parent process reaps 
their status, they won't become zombies.

> Sorry for my emotions - now I've a
> lot of Linux servers which work just like Windows - from reboot to reboot -
> and that makes me a little angry...

My advice is that you don't get angry, but you fix the problem. Please go 
back to the discussion of your cron script on June 12. I still can't see 
any reason why you are using cron. Just run runsvdir as a supervised 
process. Your process tree will be something like:

runit
\_ runsvdir -P /service log: ..................
   \_ runsv soft.p
     \_ runsvdir -P /var/www/soft.p/html/.lib/service
       \_ runsv soft.p.service1
         \_ service1
       \_ runsv soft.p.service2
         \_ service2
       \_ runsv soft.p.service3
         \_ service3
   \_ runsv soft.q
     \_ runsvdir -P /var/www/soft.q/html/.lib/service
   \_ runsv dnscache
...

Trying to start a new unsupervised runsvdir in 
/var/www/soft.p/html/.lib/service via cron (as you were doing) is just 
asking for troube - as well as doing lots of unnecessary work.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 14:55                                               ` Charlie Brady
@ 2007-09-12 15:00                                                 ` Alex Efros
  2007-09-12 16:02                                                   ` Charlie Brady
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-12 15:00 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 10:55:18AM -0400, Charlie Brady wrote:
>> ... because there different software which also
>> produce unreaped zombies (like ssh).
> You keep saying that, but I continue to doubt it. If you can document that 
> that occurs, I'm sure that the ssh maintainers will want to fix the bug.

Are you listen to me? My solution with cron is just INCREASE PROBABILITY,
nothing more. What about Radek Podgorny - I think he doesn't use cron to
start runsvdir, and he has issue with ssh..?

And what does mean 'I continue to doubt' - you think we're lying to you?!

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 15:00                                                 ` Alex Efros
@ 2007-09-12 16:02                                                   ` Charlie Brady
  2007-09-12 16:10                                                     ` Radek Podgorny
  2007-09-12 17:22                                                     ` Alex Efros
  2007-09-12 16:04                                                   ` Radek Podgorny
       [not found]                                                   ` <35517.::ffff:77.75.72.5.1189613042.squirrel@mail.podgorny.cz>
  2 siblings, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 16:02 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 12 Sep 2007, Alex Efros wrote:

> On Wed, Sep 12, 2007 at 10:55:18AM -0400, Charlie Brady wrote:
>>> ... because there different software which also
>>> produce unreaped zombies (like ssh).
>> You keep saying that, but I continue to doubt it. If you can document that
>> that occurs, I'm sure that the ssh maintainers will want to fix the bug.
>
> Are you listen to me? My solution with cron is just INCREASE PROBABILITY,
> nothing more.

Sure. And if you decrease the probability to zero, you don't have a 
problem any more.

> What about Radek Podgorny - I think he doesn't use cron to
> start runsvdir, and he has issue with ssh..?

I don't know the details of his problem.

> And what does mean 'I continue to doubt' - you think we're lying to you?!

No, I just haven't seen any evidence. I suspect you are misinterpreting 
the misbehaviour of some program started from ssh, and attributing that 
program's failures to ssh. ssh is always used to start other programs, and 
other programs can always generate zombies. There's nothing ssh can do to 
prevent a child program of it from creating zombies. If ssh is at fault, 
details would be useful, because then someone can find the fault in ssh 
and fix it. Until someone provides evidence that ssh is creating zombies, 
then it's quite reasonable for me to assume that it isn't doing so.

I remain convinced that your problem can be fixed by using runsvdir and 
runsv as they are designed to be used. We can advise you how to do that. 
But if you'd prefer to do strange things with cron, and continue to have 
problems, and point fingers at runit/glibc/kernel, then you have a free 
choice to do that.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 15:00                                                 ` Alex Efros
  2007-09-12 16:02                                                   ` Charlie Brady
@ 2007-09-12 16:04                                                   ` Radek Podgorny
       [not found]                                                   ` <35517.::ffff:77.75.72.5.1189613042.squirrel@mail.podgorny.cz>
  2 siblings, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-09-12 16:04 UTC (permalink / raw)
  To: supervision

No offense but I think cooling down a little bit would bring us closer to
the solution (that applies to both Alex and Charlie ;-) )...

Charlie, is there any specific info that would help you? I'm not that
skilled to debug this myself so all I offer is being someone's (your)
hands and eyes...

Alex, did I get it right you use gentoo? On what architecture? Stable or
unstable? I use gentoo on all my machines (stable/unstable mix, x86/amd64
mix, different kernels, ...) and some machines are OK, others are not.
Maybe this is gentoo specific somehow (exotic USE for glibc, wrong
gcc?...). I'll get the versions from my machines and post it here, could
you please do the same? Let's find what's common...

Radek P.


> Hi!
>
> On Wed, Sep 12, 2007 at 10:55:18AM -0400, Charlie Brady wrote:
>>> ... because there different software which also
>>> produce unreaped zombies (like ssh).
>> You keep saying that, but I continue to doubt it. If you can document
>> that
>> that occurs, I'm sure that the ssh maintainers will want to fix the bug.
>
> Are you listen to me? My solution with cron is just INCREASE PROBABILITY,
> nothing more. What about Radek Podgorny - I think he doesn't use cron to
> start runsvdir, and he has issue with ssh..?
>
> And what does mean 'I continue to doubt' - you think we're lying to you?!
>
> --
> 			WBR, Alex.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 16:02                                                   ` Charlie Brady
@ 2007-09-12 16:10                                                     ` Radek Podgorny
  2007-09-12 17:22                                                     ` Alex Efros
  1 sibling, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-09-12 16:10 UTC (permalink / raw)
  To: supervision

>
> On Wed, 12 Sep 2007, Alex Efros wrote:
>
>> On Wed, Sep 12, 2007 at 10:55:18AM -0400, Charlie Brady wrote:
>>>> ... because there different software which also
>>>> produce unreaped zombies (like ssh).
>>> You keep saying that, but I continue to doubt it. If you can document
>>> that
>>> that occurs, I'm sure that the ssh maintainers will want to fix the
>>> bug.
>>
>> Are you listen to me? My solution with cron is just INCREASE
>> PROBABILITY,
>> nothing more.
>
> Sure. And if you decrease the probability to zero, you don't have a
> problem any more.

The problem is you can't push it zero. :-( Imagine the system as a car,
zombies being accidents and runit (or init in general - reaping zombies)
being the seatbelts and airbag. You can be the best driver in the world
but still, would you buy a car without seatbelts and airbags? ;-)
Accidents shouldn't happen (we have rules, right?) but actually, they
do... :-(

>
>> What about Radek Podgorny - I think he doesn't use cron to
>> start runsvdir, and he has issue with ssh..?
>
> I don't know the details of his problem.
>
>> And what does mean 'I continue to doubt' - you think we're lying to
>> you?!
>
> No, I just haven't seen any evidence. I suspect you are misinterpreting
> the misbehaviour of some program started from ssh, and attributing that
> program's failures to ssh. ssh is always used to start other programs, and
> other programs can always generate zombies. There's nothing ssh can do to
> prevent a child program of it from creating zombies. If ssh is at fault,
> details would be useful, because then someone can find the fault in ssh
> and fix it. Until someone provides evidence that ssh is creating zombies,
> then it's quite reasonable for me to assume that it isn't doing so.
>
> I remain convinced that your problem can be fixed by using runsvdir and
> runsv as they are designed to be used. We can advise you how to do that.
> But if you'd prefer to do strange things with cron, and continue to have
> problems, and point fingers at runit/glibc/kernel, then you have a free
> choice to do that.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]                                                   ` <35517.::ffff:77.75.72.5.1189613042.squirrel@mail.podgorny.cz>
@ 2007-09-12 17:04                                                     ` Alex Efros
  2007-09-12 19:38                                                       ` Mike Buland
                                                                         ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-12 17:04 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 06:04:02PM +0200, Radek Podgorny wrote:
> Alex, did I get it right you use gentoo? On what architecture? Stable or

Stable x86 (except few ~x86 packages like runit and svlogd), all 32bit.

I use Hardened Gentoo, and one of ideas is it's GrSecurity/PaX patches
introduce that bug - this may explain why a lot of vanilla kernel users
don't see this bug.
Another idea - some of other gentoo-specific kernel patches.
To test this I should stop using GrSecurity/PaX on production servers for
a weeks, and I dislike this idea.

> unstable? I use gentoo on all my machines (stable/unstable mix, x86/amd64
> mix, different kernels, ...) and some machines are OK, others are not.

Yeah, I've one server which don't have this issue. His admin made a
mistake many months ago - he installed too new gcc (which isn't support
hardened patches yet - SSP and PIE), and afraid to disgrade it on
production server. He wait until hardened patches will be released for
that gcc version to come back to hardened land. This is only noticeable
difference between our servers.

> Maybe this is gentoo specific somehow (exotic USE for glibc, wrong
> gcc?...). I'll get the versions from my machines and post it here, could
> you please do the same? Let's find what's common...

My servers and workstation use (unique lines) (all of them have this issue):
  2.6.20-hardened-r6 SMP i686 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel
  2.6.20-hardened-r6     i686 Intel(R) Pentium(R) 4 CPU 2.80GHz GenuineIntel
  2.6.20-hardened-r6     i686 Intel(R) Pentium(R) 4 CPU 3.00GHz GenuineIntel
  2.6.20-hardened-r6     i686 AMD Athlon(tm) 64 Processor 3500+ AuthenticAMD
Server without zombie issue use:
  2.6.20-hardened-r6     i686 Intel(R) Celeron(R) CPU 2.00GHz GenuineIntel
Kernel configuration is 100% equal on server without zombies and my P4 servers.

All servers use:
  sys-libs/glibc-2.5-r4
  sys-devel/binutils-2.17

My servers use:
  sys-devel/gcc-3.4.6-r2  (with SSP and PIE)
Server without zombie issue use:
  sys-devel/gcc-4.1.1-r3

I've tried runit from 1.5.0 to 1.7.2 with patches from this maillist on my
servers. Server without this issue work on runit 1.5.0.

USE-flags on all servers are same:
  sys-kernel/hardened-sources-2.6.20-r6
    USE="-build -symlink"
  sys-libs/glibc-2.5-r4
    USE="hardened nls nptl nptlonly -build -debug -glibc-compat20 -glibc-omitfp -multilib -profile (-selinux)"
  sys-devel/binutils-2.17
    USE="nls -multislot -multitarget -test -vanilla"
  sys-devel/gcc-3.4.6-r2
    USE="hardened nls (-altivec) -bootstrap -boundschecking -build -d -doc -fortran -gcj -gtk -ip28 -ip32r10k -multilib -multislot (-n32) (-n64) -nocxx -nopie -nossp -objc -test -vanilla"
  sys-process/runit-1.7.2
    USE="-static"

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 16:02                                                   ` Charlie Brady
  2007-09-12 16:10                                                     ` Radek Podgorny
@ 2007-09-12 17:22                                                     ` Alex Efros
  2007-09-12 17:40                                                       ` Charlie Brady
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-12 17:22 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 12:02:48PM -0400, Charlie Brady wrote:
> No, I just haven't seen any evidence. I suspect you are misinterpreting the 
> misbehaviour of some program started from ssh, and attributing that 
> program's failures to ssh. ssh is always used to start other programs, and 
> other programs can always generate zombies. There's nothing ssh can do to 

I don't think it's 'other programs'. If this issue happens with
'other programs', then I'll probably see 'other programs' names in `ps`
output, while I have seen '[sshd]'. I think this is the reason for ssh zombies:

(14)   auth.err:  Sep  5 09:02:00 sshd[3133]: error: channel 0: chan_read_failed for istate 3
(29)   auth.info: Sep  5 18:13:37 sshd[1022]: Did not receive identification string from 85.17.106.138
(3789) auth.info: Sep  6 13:27:18 sshd[5016]: Invalid user apple from 81.228.45.11
(102)  auth.info: Sep  6 13:27:52 sshd[5144]: User mysql not allowed because account is locked
(576)  auth.info: Sep 11 16:24:04 sshd[1210]: Address 66.236.207.196 maps to intra-works.com, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
(1)    auth.info: Sep 11 16:39:13 sshd[1323]: User ldap not allowed because shell /dev/null is not executable

The number in a parens is amount of lines in my log similar to shown above.

This is usual enough nowadays. SSH worms trying to hack our systems. 
My sshd has password authentication disabled, so I'm not worry much about
these worms... but looks like they force sshd to fork and exit very
quickly because of failed auth, and so sshd start producing unreaped
zombies at some point.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 17:22                                                     ` Alex Efros
@ 2007-09-12 17:40                                                       ` Charlie Brady
  2007-09-12 18:18                                                         ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 17:40 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 12 Sep 2007, Alex Efros wrote:

> On Wed, Sep 12, 2007 at 12:02:48PM -0400, Charlie Brady wrote:
>> No, I just haven't seen any evidence. I suspect you are misinterpreting the
>> misbehaviour of some program started from ssh, and attributing that
>> program's failures to ssh. ssh is always used to start other programs, and
>> other programs can always generate zombies. There's nothing ssh can do to
>
> I don't think it's 'other programs'. If this issue happens with
> 'other programs', then I'll probably see 'other programs' names in `ps`
> output, while I have seen '[sshd]'.

Indeed. Please remember that we haven't seen your ps output.

> I think this is the reason for ssh zombies:
>
> (14)   auth.err:  Sep  5 09:02:00 sshd[3133]: error: channel 0: chan_read_failed for istate 3
> (29)   auth.info: Sep  5 18:13:37 sshd[1022]: Did not receive identification string from 85.17.106.138
> (3789) auth.info: Sep  6 13:27:18 sshd[5016]: Invalid user apple from 81.228.45.11
> (102)  auth.info: Sep  6 13:27:52 sshd[5144]: User mysql not allowed because account is locked
> (576)  auth.info: Sep 11 16:24:04 sshd[1210]: Address 66.236.207.196 maps to intra-works.com, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
> (1)    auth.info: Sep 11 16:39:13 sshd[1323]: User ldap not allowed because shell /dev/null is not executable
>
> The number in a parens is amount of lines in my log similar to shown above.

Well they're not reason for ssh zombies. They're just sshd log messages, 
which won't cause zombies. Zombies are caused by programming errors.

> This is usual enough nowadays. SSH worms trying to hack our systems.

Yep, everyone sees them. Not everyone sees sshd zombies.

> My sshd has password authentication disabled, so I'm not worry much about
> these worms... but looks like they force sshd to fork and exit very
> quickly because of failed auth, and so sshd start producing unreaped
> zombies at some point.

If the parent sshd continues to run, then it can fork lots of children, 
all or many of which exit very quickly, and there will still be no zombies 
reparented to init. There's something more going on here. You would be 
well advised to report the problem to whoever maintains the ssh which you 
use and/or the ssh maintainers.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 17:40                                                       ` Charlie Brady
@ 2007-09-12 18:18                                                         ` Alex Efros
  2007-09-12 19:07                                                           ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-12 18:18 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 01:40:07PM -0400, Charlie Brady wrote:
> Indeed. Please remember that we haven't seen your ps output.

Oh, really? How about this one:
http://article.gmane.org/gmane.comp.sysutils.supervision.general/1422
and this:
http://article.gmane.org/gmane.comp.sysutils.supervision.general/1482

> If the parent sshd continues to run, then it can fork lots of children, all 
> or many of which exit very quickly, and there will still be no zombies 
> reparented to init. There's something more going on here. You would be well 
> advised to report the problem to whoever maintains the ssh which you use 
> and/or the ssh maintainers.

Hmm. This sounds reasonable enough, I haven't think about this.
Actually, parent ssh never exit - I never see /var/service/ssh/run restarted!
But in my `ps` output PPID for zombie [sshd] is 1...
Maybe ssh doing setsid() to detach from parent sshd? But usually this
isn't the case:

  869 ?        Ss     0:00  \_ runsv ssh
23079 ?        S      0:00  |   \_ /usr/sbin/sshd -D
26899 ?        Ss     0:00  |       \_ sshd: root@pts/0    
26901 pts/0    Ss+    0:00  |           \_ -bash
26907 pts/0    S+     0:00  |               \_ screen -dR

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 18:18                                                         ` Alex Efros
@ 2007-09-12 19:07                                                           ` Charlie Brady
  2007-09-12 19:13                                                             ` Alex Efros
  0 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 19:07 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 12 Sep 2007, Alex Efros wrote:

> On Wed, Sep 12, 2007 at 01:40:07PM -0400, Charlie Brady wrote:
>> Indeed. Please remember that we haven't seen your ps output.
>
> Oh, really? How about this one:
> http://article.gmane.org/gmane.comp.sysutils.supervision.general/1422
> and this:
> http://article.gmane.org/gmane.comp.sysutils.supervision.general/1482

Yep, you've got me there.

>> If the parent sshd continues to run, then it can fork lots of children, all
>> or many of which exit very quickly, and there will still be no zombies
>> reparented to init. There's something more going on here. You would be well
>> advised to report the problem to whoever maintains the ssh which you use
>> and/or the ssh maintainers.
>
> Hmm. This sounds reasonable enough, I haven't think about this.
> Actually, parent ssh never exit - I never see /var/service/ssh/run restarted!

OK, so that means that any zombie process must be at least a child of a 
child.

If we look at this:

...
sshd      2421     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2423     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2425     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2427     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2429     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
sshd      2431     1  0 Jul14 ?        Z      0:00 [sshd] <defunct>
...

you'll see every second pid is a zombie. This could occur if the ancestor 
sshd forks, then the child forks again, and the parent of the grandchild 
exits without waiting for its child. Once the child exits, it will be a 
zombie until process 1 reaps its status.

In the example shown, let's make the ancestor sshd process 100. Then it 
forks and produces process 2420. 2420 forks to produce 2421, then exits. 
100 reaps the exit status of 2420, so 2420 disappears from the process 
table. Then 2421 exits, and appears as a zombie until its status is reaped 
by proc 1.

100 forks again and produces process 4222. 4222 forks to produce 2423, 
then exits. 100 reaps the exit status of 2422, so 2422 disappears from the 
process table. Then 2423 exits, and appears as a zombie until its status 
is reaped by proc 1.

etc.

The ssh maintainers should be interested in your process table.

If you mention what version you are running, someone might be interested 
to go looking through the ssh code to find out how the scenario you show 
could have occurred. An strace which captures the fork/fork/exit sequence 
as it happens would be very useful.

---
Charlie


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 19:07                                                           ` Charlie Brady
@ 2007-09-12 19:13                                                             ` Alex Efros
  2007-09-12 19:18                                                               ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-12 19:13 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 03:07:59PM -0400, Charlie Brady wrote:
> If you mention what version you are running, someone might be interested to 
> go looking through the ssh code to find out how the scenario you show could 
> have occurred. An strace which captures the fork/fork/exit sequence as it 
> happens would be very useful.

This is Gentoo... it's a moving target. :) OpenSSH was upgraded several
times on my servers in last months. In Jun it was 4.5_p1, then in Aug
4.6_p1, and now 3 days ago it was upgraded to 4.7_p1.

I don't see how fixing ssh will solve my issue with servers, but I'll try
to gather more information about ssh next time this issue happens on my
servers.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 19:13                                                             ` Alex Efros
@ 2007-09-12 19:18                                                               ` Charlie Brady
  2007-09-12 19:30                                                                 ` Alex Efros
  2007-09-15 13:36                                                                 ` Alex Efros
  0 siblings, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 19:18 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 12 Sep 2007, Alex Efros wrote:

> I don't see how fixing ssh will solve my issue with servers, but I'll try
> to gather more information about ssh next time this issue happens on my
> servers.

It won't, but if you can fix it it will reduce the severity of your 
problem with runit process 1. If you fix your runsvdir related cron job 
problem (which leaves all the chpst zombies), then that will further 
reduce the severity of your problem.

Have you considered not running runit as process 1, until someone finds 
and fixes the zombie reaping problem?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 19:18                                                               ` Charlie Brady
@ 2007-09-12 19:30                                                                 ` Alex Efros
  2007-09-12 19:37                                                                   ` Charlie Brady
  2007-09-15 13:36                                                                 ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-12 19:30 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 03:18:02PM -0400, Charlie Brady wrote:
> Have you considered not running runit as process 1, until someone finds and 
> fixes the zombie reaping problem?

I don't think it solve this issue because I don't believe this is runit bug.
After trying last patch for 1.7.2 (which try to do waitpid() every 5
seconds, no matter is SIGCHLD was received or not), I 99.9% sure this bug
is outside runit - and switching to sysvinit won't solve it.

But I should try to do this - just to be able to return to gentoo devs and
say: I've same bug with sysvinit - NOW you'll try to fix it, or not?
The problem with this is what I use custom /etc/runit/{1,2,3} files, and
don't use /etc/init.d/* scripts at all. So, switching to sysvinit will not
be ease.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 19:30                                                                 ` Alex Efros
@ 2007-09-12 19:37                                                                   ` Charlie Brady
  0 siblings, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2007-09-12 19:37 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Wed, 12 Sep 2007, Alex Efros wrote:

> On Wed, Sep 12, 2007 at 03:18:02PM -0400, Charlie Brady wrote:
>> Have you considered not running runit as process 1, until someone finds and
>> fixes the zombie reaping problem?
>
> I don't think it solve this issue because I don't believe this is runit bug.

I haven't seen evidence that it's not a runit bug. If you have the same 
problem with another init, then you do have some evidence.

> But I should try to do this - just to be able to return to gentoo devs and
> say: I've same bug with sysvinit - NOW you'll try to fix it, or not?
> The problem with this is what I use custom /etc/runit/{1,2,3} files, and
> don't use /etc/init.d/* scripts at all. So, switching to sysvinit will not
> be ease.

Running your /etc/runit/{1,2} scripts in the correct order shouldn't be 
difficult with /etc/inittab. You won't use /etc/init.d/* at all unless you 
reference them from /etc/inittab.

/etc/runit/3 isn't relevant for the problem you are seeing.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 17:04                                                     ` Alex Efros
@ 2007-09-12 19:38                                                       ` Mike Buland
  2007-09-12 20:28                                                         ` Alex Efros
  2007-09-13  8:58                                                       ` Radek Podgorny
       [not found]                                                       ` <50411.::ffff:77.75.72.5.1189673890.squirrel@mail.podgorny.cz>
  2 siblings, 1 reply; 113+ messages in thread
From: Mike Buland @ 2007-09-12 19:38 UTC (permalink / raw)
  To: supervision

Hello,

On Wednesday 12 September 2007 11:04:50 am Alex Efros wrote:
> Yeah, I've one server which don't have this issue. His admin made a
> mistake many months ago - he installed too new gcc (which isn't support
> hardened patches yet - SSP and PIE), and afraid to disgrade it on
> production server. He wait until hardened patches will be released for
> that gcc version to come back to hardened land. This is only noticeable
> difference between our servers.

I'm just curious, but doesn't it sound like this is the first place to look 
for the trouble?  Unfortunately, as you point out, there are two differences 
between the two systems, the one that works isn't using two of the hardened 
patches, and is using a newer gcc.  Have you reported these facts to the 
maintainers of the hardened patches (I'm sure they know they don't work with 
gcc 4.1.1, but not-reaping zombies is an issue).  Also, anyone who could hope 
to fix this in runit should have these patches applied and be using an older 
gcc.

Obviously these patches don't completely ruin the kernel/libc's ability to 
reap zombies, or this would have been found before now, but it does seem to 
affect it.  I think debugging efforts should probably be focused on these 
modifications to the system, and not general runit (I've never seen this 
problem on any of my machines).

I'd be happy to build out a gentoo system and hack around with all this...in 
october : ).  Before then...I can only offer observation.

Good luck.

P.S.  Doing some quick scans through the patches for references to 
wait-related changes could be a good, first clue...maybe?  It could be where 
I'd start, that or gdb :)


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 19:38                                                       ` Mike Buland
@ 2007-09-12 20:28                                                         ` Alex Efros
  2007-09-12 20:38                                                           ` Alex Efros
  2007-09-13  1:05                                                           ` Mike Buland
  0 siblings, 2 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-12 20:28 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 01:38:54PM -0600, Mike Buland wrote:
> I'm just curious, but doesn't it sound like this is the first place to look 
> for the trouble?  Unfortunately, as you point out, there are two differences 
> between the two systems, the one that works isn't using two of the hardened 
> patches, and is using a newer gcc.  Have you reported these facts to the 

Yep. I dislike both idea to use non-hardened gcc on production servers and
even more dislike idea to upgrade to gcc-4.1.1 without ability to safely
disgrade after testing. Remember, this issue happens every ~week, so I
should wait at least 3 weeks because saying 'huh, changing gcc solved issue'.

I've tried to analyze this from other side. As I noted here:
    http://bugs.gentoo.org/show_bug.cgi?id=190261#c1
this issue happens at some time, and then repeated every 2-10 days.
So, looks like something was changed on all my servers 2-10 days BEFORE
this issue happens for the first time. Only changed thing was usual
upgrade for some Gentoo packages. And I know when this issue happens for
me first time: 2007-05-26. And I've logs for all package upgrades and server
reboots for that period:

Fri Apr 21 19:18:39 2006 >>> sys-process/runit-1.5.0
...
Kernel 2.6.16-hardened-r11 was used from Sep 10 12:46:47 GMT 2006
...
Sun Sep 10 17:42:41 2006 >>> sys-devel/gcc-3.4.6-r1
...
Mon Dec 18 02:25:38 2006 >>> sys-libs/glibc-2.3.6-r5
...
reboot (2.6.16-hardened-r11) at Sat Dec 23 23:58:49 GMT 2006
...
Mon Jan  1 21:35:05 2007 >>> sys-devel/gcc-3.4.6-r2
...
Sat Mar 31 01:45:24 2007 >>> sys-devel/gcc-3.4.6-r2
...
Sun Apr  1 13:37:43 2007 >>> dev-lang/perl-5.8.8-r2
Sun Apr  1 13:41:18 2007 >>> dev-lang/perl-5.8.8-r2
Sun Apr  1 13:41:49 2007 >>> dev-perl/Net-Daemon-0.39
Sun Apr  1 13:41:54 2007 >>> dev-perl/PlRPC-0.2018
Sun Apr  1 13:42:09 2007 >>> dev-perl/DBI-1.53
Sun Apr  1 13:42:26 2007 >>> dev-perl/DBD-mysql-3.0008
Sun Apr  1 17:59:45 2007 >>> app-misc/mime-types-7
Sun Apr  1 18:00:57 2007 >>> sys-apps/man-1.6e-r1
Sun Apr  1 18:07:55 2007 >>> sys-libs/db-4.3.29-r2
Sun Apr  1 18:08:07 2007 >>> app-portage/gentoolkit-0.2.3-r1
Sun Apr  8 18:12:28 2007 >>> sys-libs/ncurses-5.6
Sun Apr  8 18:13:15 2007 >>> sys-apps/file-4.20-r1
Wed Apr 11 03:08:33 2007 >>> sys-apps/man-pages-2.44
reboot (2.6.16-hardened-r11) at Fri Apr 27 21:55:13 GMT 2007
Sun May  6 19:05:48 2007 >>> sys-apps/debianutils-2.17.5
Sun May  6 19:08:07 2007 >>> dev-libs/apr-0.9.12
Sun May  6 19:11:34 2007 >>> dev-util/pkgconfig-0.21-r1
Sun May  6 19:11:54 2007 >>> sys-libs/timezone-data-2007d
Sun May  6 19:12:48 2007 >>> dev-lang/spidermonkey-1.5-r2
Sun May  6 19:13:17 2007 >>> sys-devel/patch-2.5.9-r1
Sun May  6 19:13:24 2007 >>> sys-apps/hdparm-6.9-r1
Sun May  6 19:14:28 2007 >>> net-misc/rsync-2.6.9-r2
Sun May  6 19:15:34 2007 >>> dev-libs/pth-2.0.6
Sun May  6 19:15:37 2007 >>> sys-devel/binutils-config-1.9-r4
Sun May  6 19:19:57 2007 >>> app-shells/bash-3.2_p15-r1
Sun May  6 19:20:31 2007 >>> dev-util/dialog-1.1.20070227
Sun May  6 19:20:59 2007 >>> sys-apps/man-1.6e-r3
Sun May  6 19:22:14 2007 >>> media-libs/libpng-1.2.16
Sun May  6 19:23:48 2007 >>> media-libs/freetype-2.1.10-r3
Sun May  6 19:23:58 2007 >>> app-misc/ca-certificates-20070303-r1
Sun May  6 19:26:04 2007 >>> sys-libs/readline-5.2_p2
Sun May  6 19:27:49 2007 >>> dev-libs/libgpg-error-1.5
Sun May  6 19:28:43 2007 >>> sys-devel/m4-1.4.9
Sun May  6 19:30:40 2007 >>> sys-fs/e2fsprogs-1.39-r2
Sun May  6 19:31:19 2007 >>> app-editors/nano-2.0.4
Sun May  6 19:32:19 2007 >>> net-mail/fetchmail-6.3.8
Sun May  6 19:32:55 2007 >>> sys-devel/flex-2.5.33-r2
Sun May  6 19:33:25 2007 >>> sys-apps/baselayout-1.12.9-r2
Sun May  6 19:36:02 2007 >>> sys-apps/util-linux-2.12r-r6
Sun May  6 19:37:14 2007 >>> app-editors/vim-core-7.0.235
Sun May  6 19:38:27 2007 >>> dev-libs/libksba-1.0.0
Sun May  6 19:40:22 2007 >>> dev-libs/libxslt-1.1.20
Sun May  6 19:41:04 2007 >>> sys-apps/module-init-tools-3.2.2-r3
Sun May  6 19:47:27 2007 >>> app-editors/vim-7.0.235
Sun May  6 19:53:14 2007 >>> sys-kernel/hardened-sources-2.6.20-r2
Sun May  6 19:56:14 2007 >>> net-misc/curl-7.15.1-r1
Sun May  6 20:36:14 2007 >>> dev-db/mysql-5.0.38
Sun May  6 20:53:01 2007 >>> media-gfx/imagemagick-6.3.3
Sun May  6 20:53:19 2007 >>> sys-devel/gcc-config-1.3.16
Sun May  6 21:11:35 2007 >>> sys-libs/libstdc++-v3-3.3.6
Wed May  9 15:53:39 2007 >>> media-libs/freetype-2.3.3
Wed May  9 15:59:54 2007 >>> dev-lang/python-2.4.4
Wed May  9 16:04:33 2007 >>> mail-mta/netqmail-1.05-r8
Wed May 23 13:51:19 2007 >>> sys-apps/portage-2.1.2.7
Wed May 23 13:51:45 2007 >>> sys-libs/timezone-data-2007e
Wed May 23 13:51:55 2007 >>> app-forensics/chkrootkit-0.47
Wed May 23 13:52:28 2007 >>> sys-libs/zlib-1.2.3-r1
Wed May 23 13:53:50 2007 >>> media-libs/libpng-1.2.18
Wed May 23 13:56:07 2007 >>> media-libs/freetype-2.3.4-r2
Wed May 23 14:44:45 2007 >>> dev-db/mysql-5.0.40
Wed May 23 14:50:22 2007 >>> dev-lang/python-2.4.4-r4
Wed May 23 14:50:24 2007 >>> app-admin/python-updater-0.2
Wed May 23 14:51:36 2007 >>> sys-apps/util-linux-2.12r-r7
Wed May 23 14:52:04 2007 >>> sys-apps/gradm-2.1.10.200702231759
Wed May 23 14:59:26 2007 >>> app-crypt/gnupg-1.4.7-r1
reboot (2.6.16-hardened-r11) at Sat May 26 10:37:41 GMT 2007
reboot (2.6.16-hardened-r11) at Sat Jun  2 14:54:20 GMT 2007
Sun Jun  3 13:04:10 2007 >>> app-portage/eix-0.9.1
reboot (2.6.16-hardened-r11) at Sat Jun  9 14:56:38 GMT 2007
Mon Jun 11 13:00:46 2007 >>> sys-process/runit-1.7.2
reboot (2.6.16-hardened-r11) at Mon Jun 11 13:06:48 GMT 2007
reboot (2.6.16-hardened-r11) at Sat Jun 16 03:59:03 GMT 2007
reboot (2.6.20-hardened-r2)  at Sat Jun 16 04:26:30 GMT 2007
...
Thu Jun 14 23:12:38 2007 >>> sys-libs/glibc-2.5-r3

Neither runit, nor gcc, glibc or kernel was upgraded in May (kernel
sources was unpacked in /usr/src at May 6, but it was compiled and boot
only Jun 16).

Any ideas how upgrading THESE packages may affect zombie reaping in
already running runit? :-/

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 20:28                                                         ` Alex Efros
@ 2007-09-12 20:38                                                           ` Alex Efros
  2007-09-13  1:05                                                           ` Mike Buland
  1 sibling, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-12 20:38 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 11:28:42PM +0300, Alex Efros wrote:
> Any ideas how upgrading THESE packages may affect zombie reaping in
> already running runit? :-/

BTW, Radek Podgorny also use Gentoo, and he got same issue with zombies
one day before me (2007-05-25)! Looks like he upgrade Gentoo slightly more
often than I. :)

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 20:28                                                         ` Alex Efros
  2007-09-12 20:38                                                           ` Alex Efros
@ 2007-09-13  1:05                                                           ` Mike Buland
  1 sibling, 0 replies; 113+ messages in thread
From: Mike Buland @ 2007-09-13  1:05 UTC (permalink / raw)
  To: supervision

Quite honestly, I don't know which ones would or wouldn't.  It doesn't seem 
likely, but it's the only thing that you've mentioned yet that seems like it 
could be contributing.  If I were going to try to find this issue, I would 
start with an older set of packages, something from hardened gentoo from back 
when you say it worked, from April or so (I guess), and see if I can 
reproduce this problem.  Upgrade packages and see what affects runit.

Yes, they're all live servers and you can't stand to have them down or 
non-hardened, and that's fine.  Pull an old desktop and use it, just for some 
testing, you shouldn't need to wait too long if you can get anything to 
produce zombies, and there you go.  If you can get closer to pinpointing what 
on your list may have an effect, then it will be easier to solve.

Good Luck,
--Mike

On Wednesday 12 September 2007 02:28:42 pm Alex Efros wrote:
> Hi!
>
> On Wed, Sep 12, 2007 at 01:38:54PM -0600, Mike Buland wrote:
> > I'm just curious, but doesn't it sound like this is the first place to
> > look for the trouble?  Unfortunately, as you point out, there are two
> > differences between the two systems, the one that works isn't using two
> > of the hardened patches, and is using a newer gcc.  Have you reported
> > these facts to the
>
> Yep. I dislike both idea to use non-hardened gcc on production servers and
> even more dislike idea to upgrade to gcc-4.1.1 without ability to safely
> disgrade after testing. Remember, this issue happens every ~week, so I
> should wait at least 3 weeks because saying 'huh, changing gcc solved
> issue'.
>
> I've tried to analyze this from other side. As I noted here:
>     http://bugs.gentoo.org/show_bug.cgi?id=190261#c1
> this issue happens at some time, and then repeated every 2-10 days.
> So, looks like something was changed on all my servers 2-10 days BEFORE
> this issue happens for the first time. Only changed thing was usual
> upgrade for some Gentoo packages. And I know when this issue happens for
> me first time: 2007-05-26. And I've logs for all package upgrades and
> server reboots for that period:
>
> Fri Apr 21 19:18:39 2006 >>> sys-process/runit-1.5.0
> ...
> Kernel 2.6.16-hardened-r11 was used from Sep 10 12:46:47 GMT 2006
> ...
> Sun Sep 10 17:42:41 2006 >>> sys-devel/gcc-3.4.6-r1
> ...
> Mon Dec 18 02:25:38 2006 >>> sys-libs/glibc-2.3.6-r5
> ...
> reboot (2.6.16-hardened-r11) at Sat Dec 23 23:58:49 GMT 2006
> ...
> Mon Jan  1 21:35:05 2007 >>> sys-devel/gcc-3.4.6-r2
> ...
> Sat Mar 31 01:45:24 2007 >>> sys-devel/gcc-3.4.6-r2
> ...
> Sun Apr  1 13:37:43 2007 >>> dev-lang/perl-5.8.8-r2
> Sun Apr  1 13:41:18 2007 >>> dev-lang/perl-5.8.8-r2
> Sun Apr  1 13:41:49 2007 >>> dev-perl/Net-Daemon-0.39
> Sun Apr  1 13:41:54 2007 >>> dev-perl/PlRPC-0.2018
> Sun Apr  1 13:42:09 2007 >>> dev-perl/DBI-1.53
> Sun Apr  1 13:42:26 2007 >>> dev-perl/DBD-mysql-3.0008
> Sun Apr  1 17:59:45 2007 >>> app-misc/mime-types-7
> Sun Apr  1 18:00:57 2007 >>> sys-apps/man-1.6e-r1
> Sun Apr  1 18:07:55 2007 >>> sys-libs/db-4.3.29-r2
> Sun Apr  1 18:08:07 2007 >>> app-portage/gentoolkit-0.2.3-r1
> Sun Apr  8 18:12:28 2007 >>> sys-libs/ncurses-5.6
> Sun Apr  8 18:13:15 2007 >>> sys-apps/file-4.20-r1
> Wed Apr 11 03:08:33 2007 >>> sys-apps/man-pages-2.44
> reboot (2.6.16-hardened-r11) at Fri Apr 27 21:55:13 GMT 2007
> Sun May  6 19:05:48 2007 >>> sys-apps/debianutils-2.17.5
> Sun May  6 19:08:07 2007 >>> dev-libs/apr-0.9.12
> Sun May  6 19:11:34 2007 >>> dev-util/pkgconfig-0.21-r1
> Sun May  6 19:11:54 2007 >>> sys-libs/timezone-data-2007d
> Sun May  6 19:12:48 2007 >>> dev-lang/spidermonkey-1.5-r2
> Sun May  6 19:13:17 2007 >>> sys-devel/patch-2.5.9-r1
> Sun May  6 19:13:24 2007 >>> sys-apps/hdparm-6.9-r1
> Sun May  6 19:14:28 2007 >>> net-misc/rsync-2.6.9-r2
> Sun May  6 19:15:34 2007 >>> dev-libs/pth-2.0.6
> Sun May  6 19:15:37 2007 >>> sys-devel/binutils-config-1.9-r4
> Sun May  6 19:19:57 2007 >>> app-shells/bash-3.2_p15-r1
> Sun May  6 19:20:31 2007 >>> dev-util/dialog-1.1.20070227
> Sun May  6 19:20:59 2007 >>> sys-apps/man-1.6e-r3
> Sun May  6 19:22:14 2007 >>> media-libs/libpng-1.2.16
> Sun May  6 19:23:48 2007 >>> media-libs/freetype-2.1.10-r3
> Sun May  6 19:23:58 2007 >>> app-misc/ca-certificates-20070303-r1
> Sun May  6 19:26:04 2007 >>> sys-libs/readline-5.2_p2
> Sun May  6 19:27:49 2007 >>> dev-libs/libgpg-error-1.5
> Sun May  6 19:28:43 2007 >>> sys-devel/m4-1.4.9
> Sun May  6 19:30:40 2007 >>> sys-fs/e2fsprogs-1.39-r2
> Sun May  6 19:31:19 2007 >>> app-editors/nano-2.0.4
> Sun May  6 19:32:19 2007 >>> net-mail/fetchmail-6.3.8
> Sun May  6 19:32:55 2007 >>> sys-devel/flex-2.5.33-r2
> Sun May  6 19:33:25 2007 >>> sys-apps/baselayout-1.12.9-r2
> Sun May  6 19:36:02 2007 >>> sys-apps/util-linux-2.12r-r6
> Sun May  6 19:37:14 2007 >>> app-editors/vim-core-7.0.235
> Sun May  6 19:38:27 2007 >>> dev-libs/libksba-1.0.0
> Sun May  6 19:40:22 2007 >>> dev-libs/libxslt-1.1.20
> Sun May  6 19:41:04 2007 >>> sys-apps/module-init-tools-3.2.2-r3
> Sun May  6 19:47:27 2007 >>> app-editors/vim-7.0.235
> Sun May  6 19:53:14 2007 >>> sys-kernel/hardened-sources-2.6.20-r2
> Sun May  6 19:56:14 2007 >>> net-misc/curl-7.15.1-r1
> Sun May  6 20:36:14 2007 >>> dev-db/mysql-5.0.38
> Sun May  6 20:53:01 2007 >>> media-gfx/imagemagick-6.3.3
> Sun May  6 20:53:19 2007 >>> sys-devel/gcc-config-1.3.16
> Sun May  6 21:11:35 2007 >>> sys-libs/libstdc++-v3-3.3.6
> Wed May  9 15:53:39 2007 >>> media-libs/freetype-2.3.3
> Wed May  9 15:59:54 2007 >>> dev-lang/python-2.4.4
> Wed May  9 16:04:33 2007 >>> mail-mta/netqmail-1.05-r8
> Wed May 23 13:51:19 2007 >>> sys-apps/portage-2.1.2.7
> Wed May 23 13:51:45 2007 >>> sys-libs/timezone-data-2007e
> Wed May 23 13:51:55 2007 >>> app-forensics/chkrootkit-0.47
> Wed May 23 13:52:28 2007 >>> sys-libs/zlib-1.2.3-r1
> Wed May 23 13:53:50 2007 >>> media-libs/libpng-1.2.18
> Wed May 23 13:56:07 2007 >>> media-libs/freetype-2.3.4-r2
> Wed May 23 14:44:45 2007 >>> dev-db/mysql-5.0.40
> Wed May 23 14:50:22 2007 >>> dev-lang/python-2.4.4-r4
> Wed May 23 14:50:24 2007 >>> app-admin/python-updater-0.2
> Wed May 23 14:51:36 2007 >>> sys-apps/util-linux-2.12r-r7
> Wed May 23 14:52:04 2007 >>> sys-apps/gradm-2.1.10.200702231759
> Wed May 23 14:59:26 2007 >>> app-crypt/gnupg-1.4.7-r1
> reboot (2.6.16-hardened-r11) at Sat May 26 10:37:41 GMT 2007
> reboot (2.6.16-hardened-r11) at Sat Jun  2 14:54:20 GMT 2007
> Sun Jun  3 13:04:10 2007 >>> app-portage/eix-0.9.1
> reboot (2.6.16-hardened-r11) at Sat Jun  9 14:56:38 GMT 2007
> Mon Jun 11 13:00:46 2007 >>> sys-process/runit-1.7.2
> reboot (2.6.16-hardened-r11) at Mon Jun 11 13:06:48 GMT 2007
> reboot (2.6.16-hardened-r11) at Sat Jun 16 03:59:03 GMT 2007
> reboot (2.6.20-hardened-r2)  at Sat Jun 16 04:26:30 GMT 2007
> ...
> Thu Jun 14 23:12:38 2007 >>> sys-libs/glibc-2.5-r3
>
> Neither runit, nor gcc, glibc or kernel was upgraded in May (kernel
> sources was unpacked in /usr/src at May 6, but it was compiled and boot
> only Jun 16).
>
> Any ideas how upgrading THESE packages may affect zombie reaping in
> already running runit? :-/



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 17:04                                                     ` Alex Efros
  2007-09-12 19:38                                                       ` Mike Buland
@ 2007-09-13  8:58                                                       ` Radek Podgorny
       [not found]                                                       ` <50411.::ffff:77.75.72.5.1189673890.squirrel@mail.podgorny.cz>
  2 siblings, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-09-13  8:58 UTC (permalink / raw)
  To: supervision

So, my systems are listed at http://podgorny.cz/moin/RunitBug (I will
polish the list soon).

The interesting things:

* I don't use hardened at all so I think this is not to be blamed.
* It is not a 2.6.20 kernel bug as I experience this on 2.6.19.1 and even
something like 2.6.18 (can't look right now).
* All my kernels are vanilla.
* The only two systems that run fine may be fucked up, too. One of them is
a laptop so the uptime may be too short to notice. The other one has
uptime of something like 160 days so it may screw up on next reboot.

What about CFLAGS? I'll get them from my machines and post them. I suspect
they are mostly -O3 and arch set to specific processor (not generic i686
or so)...

Maybe it's some kind of time overflow bug, can you find when did you start
experiencing the trouble?

Radek P.


> Hi!
>
> On Wed, Sep 12, 2007 at 06:04:02PM +0200, Radek Podgorny wrote:
>> Alex, did I get it right you use gentoo? On what architecture? Stable or
>
> Stable x86 (except few ~x86 packages like runit and svlogd), all 32bit.
>
> I use Hardened Gentoo, and one of ideas is it's GrSecurity/PaX patches
> introduce that bug - this may explain why a lot of vanilla kernel users
> don't see this bug.
> Another idea - some of other gentoo-specific kernel patches.
> To test this I should stop using GrSecurity/PaX on production servers for
> a weeks, and I dislike this idea.
>
>> unstable? I use gentoo on all my machines (stable/unstable mix,
>> x86/amd64
>> mix, different kernels, ...) and some machines are OK, others are not.
>
> Yeah, I've one server which don't have this issue. His admin made a
> mistake many months ago - he installed too new gcc (which isn't support
> hardened patches yet - SSP and PIE), and afraid to disgrade it on
> production server. He wait until hardened patches will be released for
> that gcc version to come back to hardened land. This is only noticeable
> difference between our servers.
>
>> Maybe this is gentoo specific somehow (exotic USE for glibc, wrong
>> gcc?...). I'll get the versions from my machines and post it here, could
>> you please do the same? Let's find what's common...
>
> My servers and workstation use (unique lines) (all of them have this
> issue):
>   2.6.20-hardened-r6 SMP i686 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
> GenuineIntel
>   2.6.20-hardened-r6     i686 Intel(R) Pentium(R) 4 CPU 2.80GHz
> GenuineIntel
>   2.6.20-hardened-r6     i686 Intel(R) Pentium(R) 4 CPU 3.00GHz
> GenuineIntel
>   2.6.20-hardened-r6     i686 AMD Athlon(tm) 64 Processor 3500+
> AuthenticAMD
> Server without zombie issue use:
>   2.6.20-hardened-r6     i686 Intel(R) Celeron(R) CPU 2.00GHz GenuineIntel
> Kernel configuration is 100% equal on server without zombies and my P4
> servers.
>
> All servers use:
>   sys-libs/glibc-2.5-r4
>   sys-devel/binutils-2.17
>
> My servers use:
>   sys-devel/gcc-3.4.6-r2  (with SSP and PIE)
> Server without zombie issue use:
>   sys-devel/gcc-4.1.1-r3
>
> I've tried runit from 1.5.0 to 1.7.2 with patches from this maillist on my
> servers. Server without this issue work on runit 1.5.0.
>
> USE-flags on all servers are same:
>   sys-kernel/hardened-sources-2.6.20-r6
>     USE="-build -symlink"
>   sys-libs/glibc-2.5-r4
>     USE="hardened nls nptl nptlonly -build -debug -glibc-compat20
> -glibc-omitfp -multilib -profile (-selinux)"
>   sys-devel/binutils-2.17
>     USE="nls -multislot -multitarget -test -vanilla"
>   sys-devel/gcc-3.4.6-r2
>     USE="hardened nls (-altivec) -bootstrap -boundschecking -build -d -doc
> -fortran -gcj -gtk -ip28 -ip32r10k -multilib -multislot (-n32) (-n64)
> -nocxx -nopie -nossp -objc -test -vanilla"
>   sys-process/runit-1.7.2
>     USE="-static"
>
> --
> 			WBR, Alex.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]                                                       ` <50411.::ffff:77.75.72.5.1189673890.squirrel@mail.podgorny.cz>
@ 2007-09-13 10:57                                                         ` Alex Efros
  2007-09-13 12:06                                                           ` Alex Efros
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-13 10:57 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Sep 13, 2007 at 10:58:10AM +0200, Radek Podgorny wrote:
> * I don't use hardened at all so I think this is not to be blamed.
> * It is not a 2.6.20 kernel bug as I experience this on 2.6.19.1 and even
> something like 2.6.18 (can't look right now).

I got this issue first time on 2.6.16 (hardened packages released with
some delay).

> * All my kernels are vanilla.
> * The only two systems that run fine may be fucked up, too. One of them is
> a laptop so the uptime may be too short to notice. The other one has
> uptime of something like 160 days so it may screw up on next reboot.

Hmm. Can you reboot that system with 160 days update, just to be sure it
is have sense this compare it to other systems?

> What about CFLAGS? I'll get them from my machines and post them. I suspect
> they are mostly -O3 and arch set to specific processor (not generic i686
> or so)...

Hardened insists on very stable CFLAGS.
My servers/workstation:
    CFLAGS="-march=pentium-m -O2 -pipe"
    CFLAGS="-O2 -march=pentium4 -pipe"
    CFLAGS="-march=k8 -O2 -pipe -fforce-addr"
Server without zombie issue:
    CFLAGS="-march=pentium4 -O2 -pipe -ftracer -fprefetch-loop-arrays"

> Maybe it's some kind of time overflow bug, can you find when did you start
> experiencing the trouble?

Time overflow? In 2-10 days? Don't think so.

Important question: do you recompile overall system after upgrading any of
toolchain packages (linux-headers/glibc/gcc/binutils)? I see you're using
newer gcc (4.x.x), but do you recompiled everything with it?

P.S. I think it's time to try sysvinit for me. 

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-13 10:57                                                         ` Alex Efros
@ 2007-09-13 12:06                                                           ` Alex Efros
  2007-09-13 14:31                                                           ` Radek Podgorny
       [not found]                                                           ` <51910.::ffff:77.75.72.5.1189693860.squirrel@mail.podgorny.cz>
  2 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-13 12:06 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Sep 13, 2007 at 01:57:31PM +0300, Alex Efros wrote:
> P.S. I think it's time to try sysvinit for me. 

Ok, I boot 3 different servers using sysvinit-2.86 with this /etc/inittab:
---cut---
id:3:initdefault:
rc::bootwait:/etc/runit/1
l0:0:wait:/bin/sh -c '/etc/runit/3; exec /sbin/halt'
l3:3:once:/etc/runit/2
l6:6:wait:/bin/sh -c '/etc/runit/3; exec /sbin/reboot'
ca:12345:ctrlaltdel:/sbin/shutdown -r now
---cut---
and leave other servers under runit-init.

So, in ~3 weeks we'll know, is it solved zombie issue.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-13 10:57                                                         ` Alex Efros
  2007-09-13 12:06                                                           ` Alex Efros
@ 2007-09-13 14:31                                                           ` Radek Podgorny
       [not found]                                                           ` <51910.::ffff:77.75.72.5.1189693860.squirrel@mail.podgorny.cz>
  2 siblings, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-09-13 14:31 UTC (permalink / raw)
  To: supervision

> Hi!
>
> On Thu, Sep 13, 2007 at 10:58:10AM +0200, Radek Podgorny wrote:
>> * I don't use hardened at all so I think this is not to be blamed.
>> * It is not a 2.6.20 kernel bug as I experience this on 2.6.19.1 and
>> even
>> something like 2.6.18 (can't look right now).
>
> I got this issue first time on 2.6.16 (hardened packages released with
> some delay).

So, it's 2.6.16.19.

>
>> * All my kernels are vanilla.
>> * The only two systems that run fine may be fucked up, too. One of them
>> is
>> a laptop so the uptime may be too short to notice. The other one has
>> uptime of something like 160 days so it may screw up on next reboot.
>
> Hmm. Can you reboot that system with 160 days update, just to be sure it
> is have sense this compare it to other systems?

I must admit I'm affraid to reboot the system. At least I have one running
machine... :-(

>> What about CFLAGS? I'll get them from my machines and post them. I
>> suspect
>> they are mostly -O3 and arch set to specific processor (not generic i686
>> or so)...
>
> Hardened insists on very stable CFLAGS.
> My servers/workstation:
>     CFLAGS="-march=pentium-m -O2 -pipe"
>     CFLAGS="-O2 -march=pentium4 -pipe"
>     CFLAGS="-march=k8 -O2 -pipe -fforce-addr"
> Server without zombie issue:
>     CFLAGS="-march=pentium4 -O2 -pipe -ftracer -fprefetch-loop-arrays"

Hmmm, so this looks like CFLAGS are out, too...

>> Maybe it's some kind of time overflow bug, can you find when did you
>> start
>> experiencing the trouble?
>
> Time overflow? In 2-10 days? Don't think so.

I meant a "global" time overflow (not relative to bootup). Something like
unix timestamp 32b overflow kind of thing...

> Important question: do you recompile overall system after upgrading any of
> toolchain packages (linux-headers/glibc/gcc/binutils)? I see you're using
> newer gcc (4.x.x), but do you recompiled everything with it?

No I don't. I just upgrade the compiler and as the packages are to be
upgraded, the new compiler is used. So the system is compiled with
different versions. Do you recompile the entire system?

> P.S. I think it's time to try sysvinit for me.

Looking forward to see the results...

>
> --
> 			WBR, Alex.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]                                                           ` <51910.::ffff:77.75.72.5.1189693860.squirrel@mail.podgorny.cz>
@ 2007-09-13 14:51                                                             ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-13 14:51 UTC (permalink / raw)
  To: supervision

Hi!

On Thu, Sep 13, 2007 at 04:31:00PM +0200, Radek Podgorny wrote:
> > Important question: do you recompile overall system after upgrading any of
> > toolchain packages (linux-headers/glibc/gcc/binutils)? I see you're using
> > newer gcc (4.x.x), but do you recompiled everything with it?
> 
> No I don't. I just upgrade the compiler and as the packages are to be
> upgraded, the new compiler is used. So the system is compiled with
> different versions. Do you recompile the entire system?

Of course. Every time one of toolchain packages upgrades I'll do this:

# To be able to safely use `emerge -k` we should clean directory with
# current binary packages (e.g. move it to /tmp/portage-packages):
pkgdir=$(portageq pkgdir)
mv $pkgdir /tmp/portage-packages
install -d -o portage -g portage $pkgdir
# First compilation of toolchain:
emerge linux-headers glibc binutils gcc-config gcc
    # if newer gcc was installed in separate SLOT, then choose it now:
    gcc-config NAME_OF_NUMBER_OF_NEW_GCC    # look `gcc-config -l`
    source /etc/profile
emerge -1 libtool
# Second compilation of toolchain with creation of binary packages:
emerge -b glibc binutils gcc portage
# Rebuild 'system', toolchain will be quickly unpacked from binary packages:
emerge -bke system
# Rebuild 'world', all 'system' packages will be unpacked from binary packages:
emerge -ke world

This take some time (about 5-6 hours on my current servers), but ensure
all changes in toolchain will affect overall system.

P.S. It may seen like overkill, but it's only known 100% __safe__ way to
upgrade toolchain.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-12 19:18                                                               ` Charlie Brady
  2007-09-12 19:30                                                                 ` Alex Efros
@ 2007-09-15 13:36                                                                 ` Alex Efros
  2007-09-15 13:57                                                                   ` Alex Efros
                                                                                     ` (2 more replies)
  1 sibling, 3 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-15 13:36 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Sep 12, 2007 at 03:18:02PM -0400, Charlie Brady wrote:
>> I don't see how fixing ssh will solve my issue with servers, but I'll try
>> to gather more information about ssh next time this issue happens on my
>> servers.
> It won't, but if you can fix it it will reduce the severity of your problem 
> with runit process 1. If you fix your runsvdir related cron job problem 
> (which leaves all the chpst zombies), then that will further reduce the 
> severity of your problem.

Ok, and here is a first results. I've two unused dedicated servers - we
buy them, I've installed Gentoo, and they wait until I'll install our
projects there. I've installed sysvinit on one of these servers, and
reboot BOTH servers, so they've same uptime, they're on same hosting, they
are 100% equal expect different IP/MAC and sysvinit/runit.

Now, server with runit has 350 ssh zombies (it has only ssh zombies
because I've not installed our project with cron/chpst, etc.). Server with
sysvinit has no zombies yet.

Full `ps -ef axf` output here: http://powerman.asdfgroup.com/tmp/ps.txt

I've started `strace -f -ff -p PID` (you can see it in `ps` output), but
for now there no connections to ssh, so it's output is empty yet.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 13:36                                                                 ` Alex Efros
@ 2007-09-15 13:57                                                                   ` Alex Efros
  2007-09-15 15:20                                                                     ` Charlie Brady
  2007-09-15 14:03                                                                   ` Alex Efros
  2007-09-17  7:56                                                                   ` Gerrit Pape
  2 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-15 13:57 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, Sep 15, 2007 at 04:36:42PM +0300, Alex Efros wrote:
> Full `ps -ef axf` output here: http://powerman.asdfgroup.com/tmp/ps.txt

Here is full syslog for Sep 15: http://powerman.asdfgroup.com/tmp/syslog.txt

According to `ps` output, all zombies was created between 04:19-04:21 and
few at 07:51. There no records in kernel log for that period, and in
syslog all records for Sep 15 is ssh-related.

Looks like last zombie and last record in the log (user mysql) was created
by my test attempt to connect. In this case we've a chance to get clean
strace output for such simple connect attempt which create unreaped zombie!
I'll restart strace and try to connect as mysql user again now ...

GOT IT!!!

# date ; ps -ef axf | tail -n 1
Sat Sep 15 13:51:38 GMT 2007
sshd     14804     1  0 13:50 ?        Z      0:00 [sshd] <defunct>
# date ; ps -ef axf | tail -n 1
Sat Sep 15 13:51:53 GMT 2007
sshd     14804     1  0 13:50 ?        Z      0:00 [sshd] <defunct>

# tail -n 1 /var/log/syslog/all/current 
auth.info: Sep 15 13:50:43 sshd[14803]: User mysql not allowed because account is locked

Strace output with all details about PIDs 939 (ssh server), 14803 and 14804 
(unreaped zombie) is here: http://powerman.asdfgroup.com/tmp/ssh_strace.txt

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 13:36                                                                 ` Alex Efros
  2007-09-15 13:57                                                                   ` Alex Efros
@ 2007-09-15 14:03                                                                   ` Alex Efros
  2007-09-17  7:56                                                                   ` Gerrit Pape
  2 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-15 14:03 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, Sep 15, 2007 at 04:36:42PM +0300, Alex Efros wrote:
> Now, server with runit has 350 ssh zombies (it has only ssh zombies
> because I've not installed our project with cron/chpst, etc.). Server with
> sysvinit has no zombies yet.

Just checked two other servers. These two servers under high load (loadavg
between 0.3 and 1.5), they has same project installed (cluster), so their
load are nearly equal. Again, one with runit and other with sysvinit.
Server with runit now has 69 zombies (63 sshd and 6 chpst), with with
sysvinit has no zombies yet.

uptime for all 4 (!) servers is 2 days and 2 hours.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 13:57                                                                   ` Alex Efros
@ 2007-09-15 15:20                                                                     ` Charlie Brady
  2007-09-15 15:28                                                                       ` Alex Efros
  2007-09-15 15:36                                                                       ` Alex Efros
  0 siblings, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2007-09-15 15:20 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 15 Sep 2007, Alex Efros wrote:

> Strace output with all details about PIDs 939 (ssh server), 14803 and 14804
> (unreaped zombie) is here: http://powerman.asdfgroup.com/tmp/ssh_strace.txt

Here's (at least part of) your problem:

...
[pid 14804] socket(PF_FILE, SOCK_DGRAM, 0) = 6
[pid 14804] fcntl64(6, F_SETFD, FD_CLOEXEC) = 0
[pid 14804] connect(6, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
[pid 14804] close(6)                    = 0
[pid 14804] exit_group(255)             = ?
Process 14804 detached
[pid 14803] <... read resumed> 0x5f9b54fc, 4) = ? ERESTARTSYS (To be 
restarted)
[pid 14803] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 14803] read(6, "", 4)              = 0
[pid 14803] exit_group(255)             = ?
Process 14803 detached
...

You are running sshd with privilege separation. Process 14804 is running 
chrooted into /var/empty. It's trying to syslog to /dev/log in the chroot, 
and failing, then exiting. Its parent exits without doing waitpid (when 
it gets a 0 byte read from the pipe to the child. Tell syslog to listen on 
/var/empty/dev/log and you'll learn more.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:20                                                                     ` Charlie Brady
@ 2007-09-15 15:28                                                                       ` Alex Efros
  2007-09-15 15:47                                                                         ` Charlie Brady
  2007-09-15 15:49                                                                         ` Charlie Brady
  2007-09-15 15:36                                                                       ` Alex Efros
  1 sibling, 2 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-15 15:28 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, Sep 15, 2007 at 11:20:57AM -0400, Charlie Brady wrote:
> You are running sshd with privilege separation. Process 14804 is running 
> chrooted into /var/empty. It's trying to syslog to /dev/log in the chroot, 
> and failing, then exiting. Its parent exits without doing waitpid (when it 
> gets a 0 byte read from the pipe to the child. Tell syslog to listen on 
> /var/empty/dev/log and you'll learn more.

I think this is normal ssh behaviour and doesn't related to zombie issue.
But:

# mkdir /var/empty/dev
# mount -o bind /dev/ /var/empty/dev/
# strace -f -ff -p 939 &>/tmp/ssh_strace3 &
# ssh mysql@my.host

# tail /var/log/syslog/all/current
auth.info: Sep 15 15:23:15 sshd[14925]: User mysql not allowed because account is locked
auth.info: Sep 15 15:23:15 sshd[14926]: input_userauth_request: invalid user mysql
auth.info: Sep 15 15:23:15 sshd[14926]: Connection closed by 85.90.198.1

# tail /tmp/ssh_strace3
[pid 14926] connect(6, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
[pid 14926] send(6, "<38>Sep 15 15:23:15 sshd[14926]:"..., 65, MSG_NOSIGNAL) = 65
[pid 14926] close(6)                    = 0
[pid 14926] exit_group(255)             = ?
Process 14926 detached
[pid 14925] <... read resumed> 0x5baa81ac, 4) = ? ERESTARTSYS (To be restarted)
[pid 14925] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 14925] read(6, "", 4)              = 0
[pid 14925] exit_group(255)             = ?
Process 14925 detached
<... select resumed> )                  = 1 (in [5])
--- SIGCHLD (Child exited) @ 0 (0) ---
waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 255}], WNOHANG) = 14925
waitpid(-1, 0x5809d3ac, WNOHANG)        = 0
rt_sigaction(SIGCHLD, NULL, {0x17c5a4f0, [], 0}, 8) = 0
sigreturn()                             = ? (mask now [])
close(5)                                = 0
select(6, [3], NULL, NULL, NULL <unfinished ...>
Process 939 detached

So, only difference is successful output into log.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:20                                                                     ` Charlie Brady
  2007-09-15 15:28                                                                       ` Alex Efros
@ 2007-09-15 15:36                                                                       ` Alex Efros
  2007-09-15 15:58                                                                         ` Charlie Brady
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-15 15:36 UTC (permalink / raw)
  To: supervision

Hi!

And, Charlie, I wish to repeat this again: forget about ssh. This issue
isn't related to ssh itself. Look:

# ps -ef axf | tail -n 2
sshd     14804     1  0 13:50 ?        Z      0:00 [sshd] <defunct>
sshd     14926     1  0 15:23 ?        Z      0:00 [sshd] <defunct>

# perl -e 'fork || sleep 1; print "pid $$ exit\n"'
pid 14953 exit
# pid 14954 exit

# ps -ef axf | tail -n 3
sshd     14804     1  0 13:50 ?        Z      0:00 [sshd] <defunct>
sshd     14926     1  0 15:23 ?        Z      0:00 [sshd] <defunct>
root     14954     1  0 15:31 pts/1    Z      0:00 [perl] <defunct>

Starting from some point (usually after 2-7 days uptime), process N1 stop
reaping zombies. Any zombies. After that point. That's all. Nothing about
ssh in this equation.

Looks like that point happens on some Gentoo installations after 25 May,
and don't happens on many other systems over the world.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:28                                                                       ` Alex Efros
@ 2007-09-15 15:47                                                                         ` Charlie Brady
  2007-09-15 16:02                                                                           ` Alex Efros
  2007-09-15 15:49                                                                         ` Charlie Brady
  1 sibling, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-09-15 15:47 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 15 Sep 2007, Alex Efros wrote:

> I think this is normal ssh behaviour and doesn't related to zombie issue.

No. It might be usual ssh behaviour, but it's wrong behaviour and is 
related to the zombie issue. Please re-read my earlier posts to the 
thread.

> # tail /tmp/ssh_strace3
> [pid 14926] connect(6, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
> [pid 14926] send(6, "<38>Sep 15 15:23:15 sshd[14926]:"..., 65, MSG_NOSIGNAL) = 65
> [pid 14926] close(6)                    = 0
> [pid 14926] exit_group(255)             = ?
> Process 14926 detached
> [pid 14925] <... read resumed> 0x5baa81ac, 4) = ? ERESTARTSYS (To be restarted)
> [pid 14925] --- SIGCHLD (Child exited) @ 0 (0) ---
> [pid 14925] read(6, "", 4)              = 0
> [pid 14925] exit_group(255)             = ?
> Process 14925 detached

You won't see zombies if process 14925 reads exit status of process 14926 
before it exits.

Yes, runit should reap that status, but that doesn't change the fact that 
ssh is wrong. Note also that SIGCHLD is delivered to sshd process, not to 
runit, because 14926 terminates before 14925.

IMO this is a bug in the privilege separation code in openssh.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:28                                                                       ` Alex Efros
  2007-09-15 15:47                                                                         ` Charlie Brady
@ 2007-09-15 15:49                                                                         ` Charlie Brady
  2007-09-15 15:55                                                                           ` Alex Efros
  1 sibling, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2007-09-15 15:49 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 15 Sep 2007, Alex Efros wrote:

> But:
>
> # mkdir /var/empty/dev
> # mount -o bind /dev/ /var/empty/dev/

BTW, you don't want to do that. You are exposing all device nodes inside 
/var/empty. You on;y want the syslog socket, and syslogd will create that 
if you tell it to.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:49                                                                         ` Charlie Brady
@ 2007-09-15 15:55                                                                           ` Alex Efros
  2007-09-15 16:02                                                                             ` Charlie Brady
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-15 15:55 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, Sep 15, 2007 at 11:49:53AM -0400, Charlie Brady wrote:
>> # mkdir /var/empty/dev
>> # mount -o bind /dev/ /var/empty/dev/
>
> BTW, you don't want to do that. You are exposing all device nodes inside 
> /var/empty. You on;y want the syslog socket, and syslogd will create that 
> if you tell it to.

Yep. I know. I've unmounted it after experiment. I don't wish to provide
/var/empty/dev/log for ssh - it's ssh responsibility to have access to
/dev/log if it need log. For example, ssh can open /dev/log before fork
and provide that fd for chroot'ed child.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:36                                                                       ` Alex Efros
@ 2007-09-15 15:58                                                                         ` Charlie Brady
  0 siblings, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2007-09-15 15:58 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 15 Sep 2007, Alex Efros wrote:

> And, Charlie, I wish to repeat this again: forget about ssh. This issue
> isn't related to ssh itself.

Obviously I need to repeat myself as well. You have two problems. Various 
programs, including sshd and your silly cron script, are generating 
zombies, and also runit as proc 1 is not reaping those zombies. Since 
nobody is solving the runit problem (which could lie in the kernel), you 
can minimise your inconvenience by reducing the incidence of zombies 
reparented to proc 1.

Note that some zombies may exist temporarily which are not and will not be 
reparented to proc 1. Those zombies are processes which have exited whose 
parents are still running and have not yet reaped the child's status.

> # ps -ef axf | tail -n 3
> sshd     14804     1  0 13:50 ?        Z      0:00 [sshd] <defunct>
> sshd     14926     1  0 15:23 ?        Z      0:00 [sshd] <defunct>
> root     14954     1  0 15:31 pts/1    Z      0:00 [perl] <defunct>
>
> Starting from some point (usually after 2-7 days uptime), process N1 stop
> reaping zombies. Any zombies. After that point. That's all. Nothing about
> ssh in this equation.

I assume you mean process 1 when you say process N1.

I'm not denying that runit as proc 1 seems to have a problem on your 
system. But since your problem is accumulated zombies, if you stop 
generating them, that problem becomes unimportant. If you reduce the 
generation of zombies, the runit problem becomes less important (or at 
least less urgent).

Have I now made myself clear?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:47                                                                         ` Charlie Brady
@ 2007-09-15 16:02                                                                           ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-15 16:02 UTC (permalink / raw)
  To: supervision

Hi!

On Sat, Sep 15, 2007 at 11:47:02AM -0400, Charlie Brady wrote:
> Yes, runit should reap that status, but that doesn't change the fact that 
> ssh is wrong. Note also that SIGCHLD is delivered to sshd process, not to 
> runit, because 14926 terminates before 14925.
>
> IMO this is a bug in the privilege separation code in openssh.

Yep. But this isn't ssh maillist, and I'm not worry much about this ssh bug.
I'm worry about zombies. Using this bug as my chance to fix all software over
the world which may generate zombies is cool idea, but I've no time for this.

Right now, I take another server, also fresh Gentoo installation, without
any load, which has 2 day 23 hours uptime and NO zombies. And run this:

    perl -e '$i=1000; $i-- || exit while fork(); sleep 1'

And I got 1000 unreaped zombies.

I've reboot this server. And run this again, again and again:

    perl -e '$i=1000; $i-- || exit while fork(); sleep 1'
    perl -e '$i=1000; $i-- || exit while fork(); sleep 1'
    for i in $(seq 1 100); do perl -e '$i=1000; $i-- || exit while fork(); sleep 1'; done
    for i in $(seq 1 100); do perl -e '$i=1000; $i-- || exit while fork(); sleep 1'; sleep 1; done

No effect. All zombies was reaped by runit. Probably I should try to run
this every hour... :( maybe this experiment give us additional information.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 15:55                                                                           ` Alex Efros
@ 2007-09-15 16:02                                                                             ` Charlie Brady
  0 siblings, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2007-09-15 16:02 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Sat, 15 Sep 2007, Alex Efros wrote:

> On Sat, Sep 15, 2007 at 11:49:53AM -0400, Charlie Brady wrote:
>>> # mkdir /var/empty/dev
>>> # mount -o bind /dev/ /var/empty/dev/
>>
>> BTW, you don't want to do that. You are exposing all device nodes inside
>> /var/empty. You on;y want the syslog socket, and syslogd will create that
>> if you tell it to.
>
> Yep. I know. I've unmounted it after experiment. I don't wish to provide
> /var/empty/dev/log for ssh - it's ssh responsibility to have access to
> /dev/log if it need log. For example, ssh can open /dev/log before fork
> and provide that fd for chroot'ed child.

No, that doesn't work. syslogd might be restarted while sshd continues to 
run.

Anyway, your problem if you want to throw away log messages ...


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-15 13:36                                                                 ` Alex Efros
  2007-09-15 13:57                                                                   ` Alex Efros
  2007-09-15 14:03                                                                   ` Alex Efros
@ 2007-09-17  7:56                                                                   ` Gerrit Pape
  2007-09-17  9:07                                                                     ` Radek Podgorny
  2007-09-17 11:59                                                                     ` Alex Efros
  2 siblings, 2 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-09-17  7:56 UTC (permalink / raw)
  To: supervision

On Sat, Sep 15, 2007 at 04:36:42PM +0300, Alex Efros wrote:
> Now, server with runit has 350 ssh zombies (it has only ssh zombies
> because I've not installed our project with cron/chpst, etc.). Server with
> sysvinit has no zombies yet.

Hi Alex, thanks a lot for your ongoing tests.

On the server that has runit as pid 1, and those zombies currently in
the process table, can you please do the following as root?:

Count the currently unreaped zombies
 # chmod 0 /etc/runit/stopit
 # kill -CONT 1
Count the unreaped zombies again

Does the number change?

Thanks, Gerrit.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-17  7:56                                                                   ` Gerrit Pape
@ 2007-09-17  9:07                                                                     ` Radek Podgorny
  2007-09-17 11:59                                                                     ` Alex Efros
  1 sibling, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2007-09-17  9:07 UTC (permalink / raw)
  To: supervision

Wow! This helped on one of my machines (havent tried others)!

Radek Podgorny


> On Sat, Sep 15, 2007 at 04:36:42PM +0300, Alex Efros wrote:
>> Now, server with runit has 350 ssh zombies (it has only ssh zombies
>> because I've not installed our project with cron/chpst, etc.). Server
>> with
>> sysvinit has no zombies yet.
>
> Hi Alex, thanks a lot for your ongoing tests.
>
> On the server that has runit as pid 1, and those zombies currently in
> the process table, can you please do the following as root?:
>
> Count the currently unreaped zombies
>  # chmod 0 /etc/runit/stopit
>  # kill -CONT 1
> Count the unreaped zombies again
>
> Does the number change?
>
> Thanks, Gerrit.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-17  7:56                                                                   ` Gerrit Pape
  2007-09-17  9:07                                                                     ` Radek Podgorny
@ 2007-09-17 11:59                                                                     ` Alex Efros
  2007-09-18  8:14                                                                       ` Gerrit Pape
  1 sibling, 1 reply; 113+ messages in thread
From: Alex Efros @ 2007-09-17 11:59 UTC (permalink / raw)
  To: supervision

Hi!

On Mon, Sep 17, 2007 at 07:56:51AM +0000, Gerrit Pape wrote:
> Count the currently unreaped zombies
>  # chmod 0 /etc/runit/stopit
>  # kill -CONT 1
> Count the unreaped zombies again
> 
> Does the number change?

Yeah! It works!!!

# date; uptime; ps ax | grep Z | wc
Mon Sep 17 11:57:24 GMT 2007
 11:57:24 up 3 days, 23:57,  1 user,  load average: 0.00, 0.00, 0.00
   1897   11383   83474
# chmod 0 /etc/runit/stopit
# kill -CONT 1
# date; uptime; ps ax | grep Z | wc
Mon Sep 17 11:57:56 GMT 2007
 11:57:56 up 3 days, 23:58,  1 user,  load average: 0.00, 0.00, 0.00
      1       7      48

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-17 11:59                                                                     ` Alex Efros
@ 2007-09-18  8:14                                                                       ` Gerrit Pape
  2007-09-18 11:33                                                                         ` Alex Efros
                                                                                           ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: Gerrit Pape @ 2007-09-18  8:14 UTC (permalink / raw)
  To: supervision

On Mon, Sep 17, 2007 at 11:07:30AM +0200, Radek Podgorny wrote:
> Wow! This helped on one of my machines (havent tried others)!
On Mon, Sep 17, 2007 at 02:59:24PM +0300, Alex Efros wrote:
> Yeah! It works!!!

Nice.  Actually the check for zombies every five seconds patch should
have worked then already.

After all it's a programming error in runit.  Charlie is completely
right with

On Sat, Sep 15, 2007 at 11:47:02AM -0400, Charlie Brady wrote:
> You won't see zombies if process 14925 reads exit status of process
> 14926 before it exits.
>
> Yes, runit should reap that status, but that doesn't change the fact
> that ssh is wrong. Note also that SIGCHLD is delivered to sshd
> process, not to runit, because 14926 terminates before 14925.

runit tries to over-optimise, and only wakes up to reap zombies if it
knows there are some, at least one.  Due to the fact that the mother
process, which re-parented itself to pid 1, on the one hand receives a
SIGCHLD, but on the other hand doesn't care about that, exits and leaves
the dead child alone, the child gets re-parented to runit, but without
any notification.

The situation would have been cleaned up on your systems once any child
process gets re-parented to process 1 before it terminates, and then
exits, causing runit to get a SIGCHLD; which apparently didn't happen.
It's what the kill -CONT 1 I suggested fakes.  That seems to explain why
this problem didn't show up for years.

I prepare a new version of runit that looks for and reaps zombies not
only if it knows that there are some, but also after a 14 seconds
timeout, there seems to be no way around that.

Thanks, Gerrit.

--And now fix those bad mother processes that see their children fade
away, but don't care about that; that's not good behavior of a Mom ,-).


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-18  8:14                                                                       ` Gerrit Pape
@ 2007-09-18 11:33                                                                         ` Alex Efros
  2007-09-18 11:45                                                                         ` Laurent Bercot
  2011-02-15 13:12                                                                         ` [LONG] " Laurent Bercot
  2 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2007-09-18 11:33 UTC (permalink / raw)
  To: supervision

Hi!

On Tue, Sep 18, 2007 at 08:14:41AM +0000, Gerrit Pape wrote:
> The situation would have been cleaned up on your systems once any child
> process gets re-parented to process 1 before it terminates, and then
> exits, causing runit to get a SIGCHLD; which apparently didn't happen.

The interesting question is: why it didn't happen? Or - why it stop
happening after 25 May 2007 on Gentoo systems.

For example, in this case parent process exits before child, so it doesn't
have a chance to intercept SIGCHLD, and SIGCHLD must be delivered to runit:

# ps ax | grep Z | wc
    280    1681   12360

# perl -e 'fork || sleep 1; print "pid $$ exit\n"'
pid 4977 exit
# pid 4979 exit

# sleep 15; ps ax | grep 4979         
  4979 pts/1    Z      0:00 [perl] <defunct>

# ps ax | grep Z | wc
    283    1699   12488

> I prepare a new version of runit that looks for and reaps zombies not
> only if it knows that there are some, but also after a 14 seconds
> timeout, there seems to be no way around that.

Maybe it has sense to check how sysvinit handle zombies?

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-09-18  8:14                                                                       ` Gerrit Pape
  2007-09-18 11:33                                                                         ` Alex Efros
@ 2007-09-18 11:45                                                                         ` Laurent Bercot
  2011-02-15 13:12                                                                         ` [LONG] " Laurent Bercot
  2 siblings, 0 replies; 113+ messages in thread
From: Laurent Bercot @ 2007-09-18 11:45 UTC (permalink / raw)
  To: supervision

> I prepare a new version of runit that looks for and reaps zombies not
> only if it knows that there are some, but also after a 14 seconds
> timeout, there seems to be no way around that.

 I'm curious. How do you guys come up with your constants ?
 DJB's svscan checks for new services (and reaps zombies) every 5 seconds.
 You're thinking about a check every 14 seconds.
 Is 14 the result of some clever calculation, or is it just something like
a "reasonable guesstimate" ?

 What I'd really like to see is an option to configure the timeout.
 For instance, runit -t 10 would reap zombies every 10 seconds.
 The default could, of course, be whatever reasonable guesstimate, or
clever result, you want.

 Congratulations to you all for analyzing and finally solving the
problem, anyway. And my apologies again for the 'hardware interrupt'
that occurred during the process.

-- 
 Laurent


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-24 23:07 runit not collecting zombies Radek Podgorny
  2007-05-26 10:35 ` Alex Efros
@ 2008-02-25  7:25 ` Alex Efros
  2008-02-25 14:57   ` Charlie Brady
                     ` (3 more replies)
       [not found] ` <F694D808C0BB4890A12C565F68B9A691@home.internal>
  2008-10-21 21:46 ` Alex Efros
  3 siblings, 4 replies; 113+ messages in thread
From: Alex Efros @ 2008-02-25  7:25 UTC (permalink / raw)
  To: supervision

Hi!

I've just got new information. One of our servers doesn't have that zombie
issue. But it wasn't updated for about 6 months because it previous admin
retired. Now I begin updating it. At first I updated simple tools and
libraries, which shouldn't broke anything, and now I'm working on updating
more critical things like toolchain, LAMP and few system tools like PAM
and udev. Current uptime is 128 days. And know what? I have just noticed a
lot of zombies there! Usual trick 'chmod 0 /etc/runit/stopit; kill -CONT 1'
did the work, but... 

    I have neither updated runit (or other critical packages like
    toolchain) nor reboot system - this issue with zombies has arisen
    after update of some simple utils and libraries!!!!!
    WTF??? How this can be possible?!

If runit doesn't reload dynamic libraries on the fly or execute
external utils (which I believe is not the case), then how update of
some libraries and tools may result in runit failure to gather zombies?

Only possible "explanation" which come in my mind is some CPU/RAM usage
pattern which is happens while compiling all these packages somehow affect
runit-init. And this isn't just "high load", because this server most of
time under high enough load, in all ways (CPU+RAM+HDD).

Below is list of updated packages, I can provide also versions "from" is
was updated, if somebody will need it.

Sat Feb 23 17:01:46 2008 >>> sys-apps/sandbox-1.2.18.1-r2
Sat Feb 23 17:02:21 2008 >>> sys-apps/portage-2.1.4.4
Sat Feb 23 21:48:03 2008 >>> app-misc/colordiff-1.0.6-r1
Sun Feb 24 02:17:13 2008 >>> sys-libs/ncurses-5.6-r2
Sun Feb 24 02:17:57 2008 >>> app-arch/bzip2-1.0.4-r1
Sun Feb 24 02:18:16 2008 >>> sys-devel/gnuconfig-20070724
Sun Feb 24 02:21:45 2008 >>> media-libs/freetype-2.3.5-r2
Sun Feb 24 02:24:49 2008 >>> dev-util/pkgconfig-0.22
Sun Feb 24 02:26:01 2008 >>> sys-libs/cracklib-2.8.12
Sun Feb 24 02:26:33 2008 >>> app-misc/pax-utils-0.1.16
Sun Feb 24 02:26:59 2008 >>> sys-devel/gcc-config-1.4.0-r4
Sun Feb 24 02:27:49 2008 >>> sys-libs/timezone-data-2007k
Sun Feb 24 02:29:33 2008 >>> media-libs/t1lib-5.0.2-r1
Sun Feb 24 02:32:08 2008 >>> dev-libs/libmcrypt-2.5.8
Sun Feb 24 02:35:28 2008 >>> dev-db/sqlite-3.5.3
Sun Feb 24 02:36:52 2008 >>> sys-fs/sysfsutils-2.1.0
Sun Feb 24 02:37:13 2008 >>> net-analyzer/netselect-0.3-r2
Sun Feb 24 02:37:30 2008 >>> sys-process/cronbase-0.3.2-r1
Sun Feb 24 02:40:02 2008 >>> dev-lang/spidermonkey-1.7.0
Sun Feb 24 02:42:02 2008 >>> app-arch/cpio-2.9-r1
Sun Feb 24 02:42:43 2008 >>> sys-process/acct-6.3.5-r2
Sun Feb 24 02:43:16 2008 >>> app-portage/portage-utils-0.1.29
Sun Feb 24 02:56:51 2008 >>> app-arch/p7zip-4.57
Sun Feb 24 02:58:10 2008 >>> sys-fs/reiserfsprogs-3.6.19-r2
Sun Feb 24 02:58:53 2008 >>> net-proxy/3proxy-0.5.3j
Sun Feb 24 02:59:19 2008 >>> net-analyzer/traceroute-2.0.9-r1
Sun Feb 24 03:01:08 2008 >>> sys-kernel/linux-headers-2.6.23-r3
Sun Feb 24 03:04:43 2008 >>> app-shells/bash-3.2_p17-r1
Sun Feb 24 03:07:02 2008 >>> media-libs/libpng-1.2.24
Sun Feb 24 03:09:36 2008 >>> dev-libs/libpcre-7.6-r1
Sun Feb 24 03:13:22 2008 >>> media-libs/tiff-3.8.2-r3
Sun Feb 24 03:14:18 2008 >>> sys-apps/less-418
Sun Feb 24 03:24:46 2008 >>> app-portage/eix-0.10.2
Sun Feb 24 03:26:09 2008 >>> sys-apps/pciutils-2.2.9
Sun Feb 24 03:27:51 2008 >>> net-misc/netkit-telnetd-0.17-r8
Sun Feb 24 03:34:33 2008 >>> sys-libs/readline-5.2_p12-r1
Sun Feb 24 03:36:50 2008 >>> sys-apps/sysvinit-2.86-r10
Sun Feb 24 03:38:53 2008 >>> sys-apps/ed-0.8
Sun Feb 24 03:41:09 2008 >>> net-misc/iputils-20070202
Sun Feb 24 04:06:57 2008 >>> dev-libs/libxml2-2.6.30-r1
Sun Feb 24 04:53:51 2008 >>> sys-devel/gettext-0.17
Sun Feb 24 04:55:04 2008 >>> sys-apps/sed-4.1.5
Sun Feb 24 04:56:47 2008 >>> sys-devel/m4-1.4.10
Sun Feb 24 04:57:34 2008 >>> sys-apps/man-1.6f
Sun Feb 24 04:58:44 2008 >>> sys-devel/flex-2.5.33-r3
Sun Feb 24 04:59:48 2008 >>> dev-libs/libgpg-error-1.6
Sun Feb 24 05:02:31 2008 >>> sys-apps/findutils-4.3.11
Sun Feb 24 05:04:23 2008 >>> sys-apps/gawk-3.1.5-r5
Sun Feb 24 05:05:38 2008 >>> app-editors/nano-2.0.7
Sun Feb 24 05:06:49 2008 >>> dev-util/dialog-1.1.20071028
Sun Feb 24 05:07:51 2008 >>> sys-apps/kbd-1.13-r1
Sun Feb 24 05:10:45 2008 >>> app-arch/tar-1.19-r1
Sun Feb 24 05:11:23 2008 >>> sys-apps/ucspi-tcp-0.88-r16
Sun Feb 24 05:12:49 2008 >>> sys-apps/man-pages-2.76
Sun Feb 24 05:13:23 2008 >>> sys-devel/bc-1.06-r6
Sun Feb 24 05:13:34 2008 >>> virtual/editor-0
Sun Feb 24 05:26:40 2008 >>> sys-devel/binutils-2.18-r1
Sun Feb 24 05:28:49 2008 >>> sys-libs/com_err-1.40.4
Sun Feb 24 05:41:56 2008 >>> sys-libs/db-4.5.20_p2
Sun Feb 24 05:43:16 2008 >>> sys-libs/ss-1.40.4
Sun Feb 24 05:43:38 2008 >>> sys-apps/paxctl-0.5
Sun Feb 24 05:46:54 2008 >>> sys-fs/e2fsprogs-1.40.4
Sun Feb 24 05:49:59 2008 >>> sys-apps/util-linux-2.13-r2
Sun Feb 24 05:54:34 2008 >>> sys-apps/parted-1.8.8
Sun Feb 24 05:56:03 2008 >>> net-mail/fetchmail-6.3.8-r1
Sun Feb 24 06:01:18 2008 >>> net-misc/ntp-4.2.4_p4
Sun Feb 24 06:15:26 2008 >>> dev-lang/perl-5.8.8-r4
Sun Feb 24 06:16:04 2008 >>> sys-devel/autoconf-2.61-r1
Sun Feb 24 06:16:20 2008 >>> app-admin/perl-cleaner-1.05
Sun Feb 24 06:18:05 2008 >>> dev-libs/libtasn1-1.2
Sun Feb 24 06:18:25 2008 >>> sys-apps/help2man-1.36.4
Sun Feb 24 06:20:09 2008 >>> net-misc/rsync-2.6.9-r5
Sun Feb 24 06:22:37 2008 >>> sys-devel/libtool-1.5.26
Sun Feb 24 07:56:47 2008 >>> sys-apps/busybox-1.8.2
Sun Feb 24 07:57:13 2008 >>> sys-apps/slocate-3.1-r1
Sun Feb 24 07:58:20 2008 >>> net-libs/libpcap-0.9.8
Sun Feb 24 08:02:15 2008 >>> dev-libs/libgcrypt-1.4.0-r1
Sun Feb 24 08:03:07 2008 >>> dev-perl/DBI-1.601
Sun Feb 24 08:03:47 2008 >>> x11-misc/read-edid-1.4.1-r1
Sun Feb 24 08:05:52 2008 >>> net-dns/libidn-1.0-r1
Sun Feb 24 08:07:13 2008 >>> sys-process/psmisc-22.6
Sun Feb 24 08:28:13 2008 >>> net-fs/samba-3.0.28
Sun Feb 24 08:30:00 2008 >>> dev-util/strace-4.5.16
Sun Feb 24 08:35:23 2008 >>> net-analyzer/rrdtool-1.2.23-r1
Sun Feb 24 08:38:41 2008 >>> dev-libs/libxslt-1.1.22
Sun Feb 24 08:40:32 2008 >>> app-crypt/opencdk-0.6.6
Sun Feb 24 08:41:20 2008 >>> net-analyzer/lft-3.0
Sun Feb 24 08:45:39 2008 >>> net-analyzer/nmap-4.53
Sun Feb 24 08:46:01 2008 >>> net-misc/whois-4.7.24
Sun Feb 24 08:47:51 2008 >>> net-analyzer/tcpdump-3.9.8
Sun Feb 24 08:57:19 2008 >>> net-libs/gnutls-2.0.4
Sun Feb 24 09:01:41 2008 >>> net-misc/curl-7.17.1
Sun Feb 24 09:08:55 2008 >>> dev-lang/python-2.4.4-r6
Sun Feb 24 09:10:28 2008 >>> sys-apps/file-4.23
Sun Feb 24 09:10:44 2008 >>> app-admin/eselect-vi-1.1.5
Sun Feb 24 09:10:55 2008 >>> app-admin/eselect-ctags-1.3
Sun Feb 24 09:11:55 2008 >>> dev-util/ctags-5.7
Sun Feb 24 09:12:25 2008 >>> sys-apps/debianutils-2.28.2
Sun Feb 24 09:12:45 2008 >>> app-portage/gentoolkit-0.2.3-r1
Sun Feb 24 09:13:18 2008 >>> sys-apps/baselayout-1.12.11.1
Sun Feb 24 09:14:03 2008 >>> sys-apps/module-init-tools-3.4
Sun Feb 24 09:30:04 2008 >>> sys-kernel/hardened-sources-2.6.23-r7
Sun Feb 24 09:31:40 2008 >>> net-dialup/ppp-2.4.4-r14
Sun Feb 24 09:33:01 2008 >>> sys-apps/lm_sensors-2.10.4
Sun Feb 24 09:34:21 2008 >>> net-firewall/iptables-1.3.8-r3
Sun Feb 24 09:36:33 2008 >>> app-editors/vim-core-7.1.123
Sun Feb 24 09:41:04 2008 >>> app-editors/vim-7.1.123
Sun Feb 24 10:08:14 2008 >>> dev-lang/php-5.2.4_pre200708051230-r2
Sun Feb 24 10:09:42 2008 >>> dev-php/PEAR-PEAR-1.6.2-r1
Sun Feb 24 10:09:53 2008 >>> app-admin/eselect-fontconfig-1.0
Sun Feb 24 10:13:11 2008 >>> media-libs/fontconfig-2.5.0-r1
Sun Feb 24 10:13:49 2008 >>> media-fonts/corefonts-1-r4
Sun Feb 24 10:35:59 2008 >>> media-gfx/imagemagick-6.3.7.9
Sun Feb 24 11:02:04 2008 >>> dev-libs/glib-2.14.6
Sun Feb 24 11:02:30 2008 >>> app-arch/rar-3.7.1

These packages was installed, not updated:
Sun Feb 24 18:10:56 2008 >>> app-misc/mc-mp-4.1.40_pre9-r1
Sun Feb 24 23:47:46 2008 >>> app-forensics/chkrootkit-0.47
Sun Feb 24 23:48:45 2008 >>> sys-process/lsof-4.78-r1
Sun Feb 24 23:49:00 2008 >>> app-forensics/rkhunter-1.2.9

P.S. Gerrit: runit is really cool, but this bug (unfixed for about a year!)
drives me crazy... :(

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-25  7:25 ` Alex Efros
@ 2008-02-25 14:57   ` Charlie Brady
  2008-02-25 15:23     ` Radek Podgorny
       [not found]     ` <59012.::ffff:77.75.72.226.1203952988.squirrel@mail.podgorny.cz>
  2008-02-25 15:27   ` Radek Podgorny
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 113+ messages in thread
From: Charlie Brady @ 2008-02-25 14:57 UTC (permalink / raw)
  To: Alex Efros; +Cc: supervision


On Mon, 25 Feb 2008, Alex Efros wrote:

> Only possible "explanation" which come in my mind is some CPU/RAM usage
> pattern which is happens while compiling all these packages somehow affect
> runit-init.

... or the kernel.

> P.S. Gerrit: runit is really cool, but this bug (unfixed for about a year!)
> drives me crazy... :(

If it's a kernel bug, then Gerrit won't be able to fix it.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-25 14:57   ` Charlie Brady
@ 2008-02-25 15:23     ` Radek Podgorny
       [not found]     ` <59012.::ffff:77.75.72.226.1203952988.squirrel@mail.podgorny.cz>
  1 sibling, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2008-02-25 15:23 UTC (permalink / raw)
  To: Charlie Brady; +Cc: Alex Efros, supervision

>
> On Mon, 25 Feb 2008, Alex Efros wrote:
>
>> Only possible "explanation" which come in my mind is some CPU/RAM usage
>> pattern which is happens while compiling all these packages somehow
>> affect
>> runit-init.
>
> ... or the kernel.

How can it be the kernel when the system was not rebooted?

>
>> P.S. Gerrit: runit is really cool, but this bug (unfixed for about a
>> year!)
>> drives me crazy... :(
>
> If it's a kernel bug, then Gerrit won't be able to fix it.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]     ` <59012.::ffff:77.75.72.226.1203952988.squirrel@mail.podgorny.cz>
@ 2008-02-25 15:26       ` George Georgalis
  2008-02-25 15:32       ` Charlie Brady
  2008-02-25 17:20       ` Mike Buland
  2 siblings, 0 replies; 113+ messages in thread
From: George Georgalis @ 2008-02-25 15:26 UTC (permalink / raw)
  To: supervision

On Mon, Feb 25, 2008 at 04:23:08PM +0100, Radek Podgorny wrote:
>>
>> On Mon, 25 Feb 2008, Alex Efros wrote:
>>
>>> Only possible "explanation" which come in my mind is some CPU/RAM usage
>>> pattern which is happens while compiling all these packages somehow
>>> affect
>>> runit-init.
>>
>> ... or the kernel.
>
>How can it be the kernel when the system was not rebooted?

I think the idea is some new package is wedging the
kernel into a state with the issue as a consequence.

// George


-- 
George Georgalis, information system scientist <IXOYE><


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-25  7:25 ` Alex Efros
  2008-02-25 14:57   ` Charlie Brady
@ 2008-02-25 15:27   ` Radek Podgorny
       [not found]   ` <34616.::ffff:77.75.72.226.1203953244.squirrel@mail.podgorny.cz>
  2008-02-27  8:19   ` Bernhard Graf
  3 siblings, 0 replies; 113+ messages in thread
From: Radek Podgorny @ 2008-02-25 15:27 UTC (permalink / raw)
  To: supervision

Hi! I'm interested in the original versions. Could you, please, post them?

Thanks
Radek Podgorny


> Hi!
>
> I've just got new information. One of our servers doesn't have that zombie
> issue. But it wasn't updated for about 6 months because it previous admin
> retired. Now I begin updating it. At first I updated simple tools and
> libraries, which shouldn't broke anything, and now I'm working on updating
> more critical things like toolchain, LAMP and few system tools like PAM
> and udev. Current uptime is 128 days. And know what? I have just noticed a
> lot of zombies there! Usual trick 'chmod 0 /etc/runit/stopit; kill -CONT
> 1'
> did the work, but...
>
>     I have neither updated runit (or other critical packages like
>     toolchain) nor reboot system - this issue with zombies has arisen
>     after update of some simple utils and libraries!!!!!
>     WTF??? How this can be possible?!
>
> If runit doesn't reload dynamic libraries on the fly or execute
> external utils (which I believe is not the case), then how update of
> some libraries and tools may result in runit failure to gather zombies?
>
> Only possible "explanation" which come in my mind is some CPU/RAM usage
> pattern which is happens while compiling all these packages somehow affect
> runit-init. And this isn't just "high load", because this server most of
> time under high enough load, in all ways (CPU+RAM+HDD).
>
> Below is list of updated packages, I can provide also versions "from" is
> was updated, if somebody will need it.
>
> Sat Feb 23 17:01:46 2008 >>> sys-apps/sandbox-1.2.18.1-r2
> Sat Feb 23 17:02:21 2008 >>> sys-apps/portage-2.1.4.4
> Sat Feb 23 21:48:03 2008 >>> app-misc/colordiff-1.0.6-r1
> Sun Feb 24 02:17:13 2008 >>> sys-libs/ncurses-5.6-r2
> Sun Feb 24 02:17:57 2008 >>> app-arch/bzip2-1.0.4-r1
> Sun Feb 24 02:18:16 2008 >>> sys-devel/gnuconfig-20070724
> Sun Feb 24 02:21:45 2008 >>> media-libs/freetype-2.3.5-r2
> Sun Feb 24 02:24:49 2008 >>> dev-util/pkgconfig-0.22
> Sun Feb 24 02:26:01 2008 >>> sys-libs/cracklib-2.8.12
> Sun Feb 24 02:26:33 2008 >>> app-misc/pax-utils-0.1.16
> Sun Feb 24 02:26:59 2008 >>> sys-devel/gcc-config-1.4.0-r4
> Sun Feb 24 02:27:49 2008 >>> sys-libs/timezone-data-2007k
> Sun Feb 24 02:29:33 2008 >>> media-libs/t1lib-5.0.2-r1
> Sun Feb 24 02:32:08 2008 >>> dev-libs/libmcrypt-2.5.8
> Sun Feb 24 02:35:28 2008 >>> dev-db/sqlite-3.5.3
> Sun Feb 24 02:36:52 2008 >>> sys-fs/sysfsutils-2.1.0
> Sun Feb 24 02:37:13 2008 >>> net-analyzer/netselect-0.3-r2
> Sun Feb 24 02:37:30 2008 >>> sys-process/cronbase-0.3.2-r1
> Sun Feb 24 02:40:02 2008 >>> dev-lang/spidermonkey-1.7.0
> Sun Feb 24 02:42:02 2008 >>> app-arch/cpio-2.9-r1
> Sun Feb 24 02:42:43 2008 >>> sys-process/acct-6.3.5-r2
> Sun Feb 24 02:43:16 2008 >>> app-portage/portage-utils-0.1.29
> Sun Feb 24 02:56:51 2008 >>> app-arch/p7zip-4.57
> Sun Feb 24 02:58:10 2008 >>> sys-fs/reiserfsprogs-3.6.19-r2
> Sun Feb 24 02:58:53 2008 >>> net-proxy/3proxy-0.5.3j
> Sun Feb 24 02:59:19 2008 >>> net-analyzer/traceroute-2.0.9-r1
> Sun Feb 24 03:01:08 2008 >>> sys-kernel/linux-headers-2.6.23-r3
> Sun Feb 24 03:04:43 2008 >>> app-shells/bash-3.2_p17-r1
> Sun Feb 24 03:07:02 2008 >>> media-libs/libpng-1.2.24
> Sun Feb 24 03:09:36 2008 >>> dev-libs/libpcre-7.6-r1
> Sun Feb 24 03:13:22 2008 >>> media-libs/tiff-3.8.2-r3
> Sun Feb 24 03:14:18 2008 >>> sys-apps/less-418
> Sun Feb 24 03:24:46 2008 >>> app-portage/eix-0.10.2
> Sun Feb 24 03:26:09 2008 >>> sys-apps/pciutils-2.2.9
> Sun Feb 24 03:27:51 2008 >>> net-misc/netkit-telnetd-0.17-r8
> Sun Feb 24 03:34:33 2008 >>> sys-libs/readline-5.2_p12-r1
> Sun Feb 24 03:36:50 2008 >>> sys-apps/sysvinit-2.86-r10
> Sun Feb 24 03:38:53 2008 >>> sys-apps/ed-0.8
> Sun Feb 24 03:41:09 2008 >>> net-misc/iputils-20070202
> Sun Feb 24 04:06:57 2008 >>> dev-libs/libxml2-2.6.30-r1
> Sun Feb 24 04:53:51 2008 >>> sys-devel/gettext-0.17
> Sun Feb 24 04:55:04 2008 >>> sys-apps/sed-4.1.5
> Sun Feb 24 04:56:47 2008 >>> sys-devel/m4-1.4.10
> Sun Feb 24 04:57:34 2008 >>> sys-apps/man-1.6f
> Sun Feb 24 04:58:44 2008 >>> sys-devel/flex-2.5.33-r3
> Sun Feb 24 04:59:48 2008 >>> dev-libs/libgpg-error-1.6
> Sun Feb 24 05:02:31 2008 >>> sys-apps/findutils-4.3.11
> Sun Feb 24 05:04:23 2008 >>> sys-apps/gawk-3.1.5-r5
> Sun Feb 24 05:05:38 2008 >>> app-editors/nano-2.0.7
> Sun Feb 24 05:06:49 2008 >>> dev-util/dialog-1.1.20071028
> Sun Feb 24 05:07:51 2008 >>> sys-apps/kbd-1.13-r1
> Sun Feb 24 05:10:45 2008 >>> app-arch/tar-1.19-r1
> Sun Feb 24 05:11:23 2008 >>> sys-apps/ucspi-tcp-0.88-r16
> Sun Feb 24 05:12:49 2008 >>> sys-apps/man-pages-2.76
> Sun Feb 24 05:13:23 2008 >>> sys-devel/bc-1.06-r6
> Sun Feb 24 05:13:34 2008 >>> virtual/editor-0
> Sun Feb 24 05:26:40 2008 >>> sys-devel/binutils-2.18-r1
> Sun Feb 24 05:28:49 2008 >>> sys-libs/com_err-1.40.4
> Sun Feb 24 05:41:56 2008 >>> sys-libs/db-4.5.20_p2
> Sun Feb 24 05:43:16 2008 >>> sys-libs/ss-1.40.4
> Sun Feb 24 05:43:38 2008 >>> sys-apps/paxctl-0.5
> Sun Feb 24 05:46:54 2008 >>> sys-fs/e2fsprogs-1.40.4
> Sun Feb 24 05:49:59 2008 >>> sys-apps/util-linux-2.13-r2
> Sun Feb 24 05:54:34 2008 >>> sys-apps/parted-1.8.8
> Sun Feb 24 05:56:03 2008 >>> net-mail/fetchmail-6.3.8-r1
> Sun Feb 24 06:01:18 2008 >>> net-misc/ntp-4.2.4_p4
> Sun Feb 24 06:15:26 2008 >>> dev-lang/perl-5.8.8-r4
> Sun Feb 24 06:16:04 2008 >>> sys-devel/autoconf-2.61-r1
> Sun Feb 24 06:16:20 2008 >>> app-admin/perl-cleaner-1.05
> Sun Feb 24 06:18:05 2008 >>> dev-libs/libtasn1-1.2
> Sun Feb 24 06:18:25 2008 >>> sys-apps/help2man-1.36.4
> Sun Feb 24 06:20:09 2008 >>> net-misc/rsync-2.6.9-r5
> Sun Feb 24 06:22:37 2008 >>> sys-devel/libtool-1.5.26
> Sun Feb 24 07:56:47 2008 >>> sys-apps/busybox-1.8.2
> Sun Feb 24 07:57:13 2008 >>> sys-apps/slocate-3.1-r1
> Sun Feb 24 07:58:20 2008 >>> net-libs/libpcap-0.9.8
> Sun Feb 24 08:02:15 2008 >>> dev-libs/libgcrypt-1.4.0-r1
> Sun Feb 24 08:03:07 2008 >>> dev-perl/DBI-1.601
> Sun Feb 24 08:03:47 2008 >>> x11-misc/read-edid-1.4.1-r1
> Sun Feb 24 08:05:52 2008 >>> net-dns/libidn-1.0-r1
> Sun Feb 24 08:07:13 2008 >>> sys-process/psmisc-22.6
> Sun Feb 24 08:28:13 2008 >>> net-fs/samba-3.0.28
> Sun Feb 24 08:30:00 2008 >>> dev-util/strace-4.5.16
> Sun Feb 24 08:35:23 2008 >>> net-analyzer/rrdtool-1.2.23-r1
> Sun Feb 24 08:38:41 2008 >>> dev-libs/libxslt-1.1.22
> Sun Feb 24 08:40:32 2008 >>> app-crypt/opencdk-0.6.6
> Sun Feb 24 08:41:20 2008 >>> net-analyzer/lft-3.0
> Sun Feb 24 08:45:39 2008 >>> net-analyzer/nmap-4.53
> Sun Feb 24 08:46:01 2008 >>> net-misc/whois-4.7.24
> Sun Feb 24 08:47:51 2008 >>> net-analyzer/tcpdump-3.9.8
> Sun Feb 24 08:57:19 2008 >>> net-libs/gnutls-2.0.4
> Sun Feb 24 09:01:41 2008 >>> net-misc/curl-7.17.1
> Sun Feb 24 09:08:55 2008 >>> dev-lang/python-2.4.4-r6
> Sun Feb 24 09:10:28 2008 >>> sys-apps/file-4.23
> Sun Feb 24 09:10:44 2008 >>> app-admin/eselect-vi-1.1.5
> Sun Feb 24 09:10:55 2008 >>> app-admin/eselect-ctags-1.3
> Sun Feb 24 09:11:55 2008 >>> dev-util/ctags-5.7
> Sun Feb 24 09:12:25 2008 >>> sys-apps/debianutils-2.28.2
> Sun Feb 24 09:12:45 2008 >>> app-portage/gentoolkit-0.2.3-r1
> Sun Feb 24 09:13:18 2008 >>> sys-apps/baselayout-1.12.11.1
> Sun Feb 24 09:14:03 2008 >>> sys-apps/module-init-tools-3.4
> Sun Feb 24 09:30:04 2008 >>> sys-kernel/hardened-sources-2.6.23-r7
> Sun Feb 24 09:31:40 2008 >>> net-dialup/ppp-2.4.4-r14
> Sun Feb 24 09:33:01 2008 >>> sys-apps/lm_sensors-2.10.4
> Sun Feb 24 09:34:21 2008 >>> net-firewall/iptables-1.3.8-r3
> Sun Feb 24 09:36:33 2008 >>> app-editors/vim-core-7.1.123
> Sun Feb 24 09:41:04 2008 >>> app-editors/vim-7.1.123
> Sun Feb 24 10:08:14 2008 >>> dev-lang/php-5.2.4_pre200708051230-r2
> Sun Feb 24 10:09:42 2008 >>> dev-php/PEAR-PEAR-1.6.2-r1
> Sun Feb 24 10:09:53 2008 >>> app-admin/eselect-fontconfig-1.0
> Sun Feb 24 10:13:11 2008 >>> media-libs/fontconfig-2.5.0-r1
> Sun Feb 24 10:13:49 2008 >>> media-fonts/corefonts-1-r4
> Sun Feb 24 10:35:59 2008 >>> media-gfx/imagemagick-6.3.7.9
> Sun Feb 24 11:02:04 2008 >>> dev-libs/glib-2.14.6
> Sun Feb 24 11:02:30 2008 >>> app-arch/rar-3.7.1
>
> These packages was installed, not updated:
> Sun Feb 24 18:10:56 2008 >>> app-misc/mc-mp-4.1.40_pre9-r1
> Sun Feb 24 23:47:46 2008 >>> app-forensics/chkrootkit-0.47
> Sun Feb 24 23:48:45 2008 >>> sys-process/lsof-4.78-r1
> Sun Feb 24 23:49:00 2008 >>> app-forensics/rkhunter-1.2.9
>
> P.S. Gerrit: runit is really cool, but this bug (unfixed for about a
> year!)
> drives me crazy... :(
>
> --
> 			WBR, Alex.
>
>




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]     ` <59012.::ffff:77.75.72.226.1203952988.squirrel@mail.podgorny.cz>
  2008-02-25 15:26       ` George Georgalis
@ 2008-02-25 15:32       ` Charlie Brady
  2008-02-25 16:17         ` Alex Efros
  2008-02-25 17:20       ` Mike Buland
  2 siblings, 1 reply; 113+ messages in thread
From: Charlie Brady @ 2008-02-25 15:32 UTC (permalink / raw)
  To: Radek Podgorny; +Cc: Alex Efros, supervision


On Mon, 25 Feb 2008, Radek Podgorny wrote:

>>
>> On Mon, 25 Feb 2008, Alex Efros wrote:
>>
>>> Only possible "explanation" which come in my mind is some CPU/RAM 
>>> usage pattern which is happens while compiling all these packages 
>>> somehow affect runit-init.
>>
>> ... or the kernel.
>
> How can it be the kernel when the system was not rebooted?

It could be a kernel bug only triggered by memory pressure ...


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]   ` <34616.::ffff:77.75.72.226.1203953244.squirrel@mail.podgorny.cz>
@ 2008-02-25 16:15     ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2008-02-25 16:15 UTC (permalink / raw)
  To: supervision

Hi!

On Mon, Feb 25, 2008 at 04:27:24PM +0100, Radek Podgorny wrote:
> Hi! I'm interested in the original versions. Could you, please, post them?

app-admin/eselect-vi                    1.1.4 -> 1.1.5
app-admin/perl-cleaner                 1.04.3 -> 1.05
app-arch/bzip2                          1.0.4 -> 1.0.4-r1
app-arch/cpio                             2.9 -> 2.9-r1
app-arch/p7zip                           4.48 -> 4.57
app-arch/rar                            3.7.0 -> 3.7.1
app-arch/tar                          1.18-r1 -> 1.19-r1
app-crypt/opencdk                       0.5.7 -> 0.6.6
app-editors/nano                        2.0.6 -> 2.0.7
app-editors/vim                       7.1.042 -> 7.1.123
app-editors/vim-core                  7.1.042 -> 7.1.123
app-misc/mc                          4.6.1-r3 -> 4.6.1-r4
app-misc/pax-utils                     0.1.15 -> 0.1.16
app-portage/eix                        0.9.10 -> 0.10.2
app-portage/portage-utils              0.1.28 -> 0.1.29
app-shells/bash                       3.2_p17 -> 3.2_p17-r1
dev-db/sqlite                          3.3.17 -> 3.5.3
dev-lang/perl                        5.8.8-r2 -> 5.8.8-r4
dev-lang/python                      2.3.5-r3 -> 2.4.4-r6
dev-lang/spidermonkey                  1.5-r2 -> 1.7.0
dev-libs/glib                         2.12.12 -> 2.14.6
dev-libs/libgcrypt                   1.2.2-r1 -> 1.4.0-r1
dev-libs/libgpg-error                     1.5 -> 1.6
dev-libs/libmcrypt                      2.5.7 -> 2.5.8
dev-libs/libpcre                          6.6 -> 7.6-r1
dev-libs/libtasn1                       0.3.5 -> 1.2
dev-libs/libxml2                       2.6.28 -> 2.6.30-r1
dev-libs/libxslt                       1.1.20 -> 1.1.22
dev-perl/DBI                             1.58 -> 1.601
dev-php/PEAR-PEAR                      1.4.11 -> 1.6.2-r1
dev-util/ctags                       5.5.4-r2 -> 5.7
dev-util/dialog                  1.1.20070704 -> 1.1.20071028
dev-util/pkgconfig                    0.21-r1 -> 0.22
dev-util/strace                        4.5.15 -> 4.5.16
media-fonts/corefonts                    1-r2 -> 1-r4
media-gfx/imagemagick                6.3.4-r1 -> 6.3.7.9
media-libs/fontconfig                   2.4.2 -> 2.5.0-r1
media-libs/freetype                  2.3.4-r2 -> 2.3.5-r2
media-libs/libpng                      1.2.18 -> 1.2.24
media-libs/t1lib                        5.0.2 -> 5.0.2-r1
media-libs/tiff                      3.8.2-r2 -> 3.8.2-r3
net-analyzer/lft                         2.31 -> 3.0
net-analyzer/netselect                 0.3-r1 -> 0.3-r2
net-analyzer/nmap                        4.20 -> 4.53
net-analyzer/rrdtool                1.2.15-r3 -> 1.2.23-r1
net-analyzer/tcpdump                 3.9.6-r1 -> 3.9.8
net-analyzer/traceroute            1.4_p12-r5 -> 2.0.9-r1
net-dialup/ppp                       2.4.4-r9 -> 2.4.4-r14
net-dns/libidn                       0.6.9-r1 -> 1.0-r1
net-firewall/iptables                1.3.5-r4 -> 1.3.8-r3
net-fs/samba                        3.0.24-r3 -> 3.0.28
net-libs/gnutls                      1.4.4-r1 -> 2.0.4
net-libs/libpcap                        0.9.6 -> 0.9.8
net-mail/fetchmail                      6.3.8 -> 6.3.8-r1
net-misc/curl                          7.16.4 -> 7.17.1
net-misc/iputils                     20060512 -> 20070202
net-misc/netkit-telnetd               0.17-r6 -> 0.17-r8
net-misc/ntp                         4.2.4_p0 -> 4.2.4_p4
net-misc/rsync                       2.6.9-r2 -> 2.6.9-r5
net-misc/whois                         4.7.21 -> 4.7.24
net-proxy/3proxy                       0.5.3h -> 0.5.3j
sys-apps/baselayout                 1.12.9-r2 -> 1.12.11.1
sys-apps/busybox                        1.6.1 -> 1.8.2
sys-apps/debianutils                   2.22.1 -> 2.28.2
sys-apps/ed                               0.5 -> 0.8
sys-apps/file                         4.21-r1 -> 4.23
sys-apps/findutils                      4.3.8 -> 4.3.11
sys-apps/gawk                        3.1.5-r3 -> 3.1.5-r5
sys-apps/kbd                          1.12-r8 -> 1.13-r1
sys-apps/less                             406 -> 418
sys-apps/lm_sensors                    2.10.1 -> 2.10.4
sys-apps/man                          1.6e-r3 -> 1.6f
sys-apps/man-pages                       2.63 -> 2.76
sys-apps/module-init-tools           3.2.2-r3 -> 3.4
sys-apps/parted                      1.7.1-r1 -> 1.8.8
sys-apps/paxctl                           0.4 -> 0.5
sys-apps/pciutils                    2.2.4-r3 -> 2.2.9
sys-apps/portage                     2.1.2.11 -> 2.1.4.4
sys-apps/sandbox                       1.2.17 -> 1.2.18.1-r2
sys-apps/slocate                       2.7-r8 -> 3.1-r1
sys-apps/sysvinit                     2.86-r8 -> 2.86-r10
sys-apps/ucspi-tcp                   0.88-r15 -> 0.88-r16
sys-apps/util-linux                  2.12r-r7 -> 2.13-r2
sys-devel/autoconf                       2.13 -> 2.61-r1
sys-devel/binutils                       2.17 -> 2.18-r1
sys-devel/flex                      2.5.33-r2 -> 2.5.33-r3
sys-devel/gcc-config                   1.3.16 -> 1.4.0-r4
sys-devel/gettext                   0.16.1-r1 -> 0.17
sys-devel/gnuconfig                  20070118 -> 20070724
sys-devel/libtool                     1.5.23b -> 1.5.26
sys-devel/m4                         1.4.9-r1 -> 1.4.10
sys-fs/e2fsprogs                      1.39-r2 -> 1.40.4
sys-fs/reiserfsprogs                3.6.19-r1 -> 3.6.19-r2
sys-fs/sysfsutils                    1.3.0-r1 -> 2.1.0
sys-kernel/hardened-sources         2.6.20-r2 -> 2.6.23-r7
sys-kernel/linux-headers            2.6.17-r2 -> 2.6.23-r3
sys-libs/com_err                         1.39 -> 1.40.4
sys-libs/cracklib                    2.8.9-r1 -> 2.8.12
sys-libs/db                         3.2.9-r11 -> 4.5.20_p2
sys-libs/ncurses                       5.6-r1 -> 5.6-r2
sys-libs/readline                      5.2_p4 -> 5.2_p12-r1
sys-libs/ss                              1.39 -> 1.40.4
sys-libs/timezone-data                  2007f -> 2007k
sys-process/acct                     6.3.5-r1 -> 6.3.5-r2
sys-process/cronbase                    0.3.2 -> 0.3.2-r1
sys-process/psmisc                    22.5-r1 -> 22.6

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-25 15:32       ` Charlie Brady
@ 2008-02-25 16:17         ` Alex Efros
  0 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2008-02-25 16:17 UTC (permalink / raw)
  To: supervision

Hi!

On Mon, Feb 25, 2008 at 10:32:45AM -0500, Charlie Brady wrote:
>>> ... or the kernel.
>> How can it be the kernel when the system was not rebooted?
> It could be a kernel bug only triggered by memory pressure ...

If this is a kernel bug, then why this command fixed for for about a week?
    chmod 0 /etc/runit/stopit; kill -CONT 1

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: runit not collecting zombies
       [not found] ` <F694D808C0BB4890A12C565F68B9A691@home.internal>
@ 2008-02-25 16:24   ` rehan khan
  2008-02-25 16:27     ` Charlie Brady
                       ` (2 more replies)
  0 siblings, 3 replies; 113+ messages in thread
From: rehan khan @ 2008-02-25 16:24 UTC (permalink / raw)
  To: Alex Efros, supervision

Actually a number of the updated packages *could* be the cause of this
issue. The package that I would look at closest is the bash upgrade as
runit scripts depend on the built-in exec command properly replacing the
shell with the same process id. Next see if the util-linux package
upgrade caused the problem. Last and most unlikely is the sysVinit
upgrade. I would guess downgrading one package at a time and checking
for zombies is the simplest methodology. I don't think the kernel is
implicated if you haven't restarted the machine. Again if I was to point
a finger it would be vaguely in the direction of the bash upgrade. As
bashtastic as it is it is not un-heard of for bash to have some quirks
version to version.

R

-----Original Message-----
From: Alex Efros [mailto:powerman@powerman.name] 
Sent: 25 February 2008 07:31
To: supervision@list.skarnet.org
Subject: Re: runit not collecting zombies

Hi!

I've just got new information. One of our servers doesn't have that
zombie
issue. But it wasn't updated for about 6 months because it previous
admin
retired. Now I begin updating it. At first I updated simple tools and
libraries, which shouldn't broke anything, and now I'm working on
updating
more critical things like toolchain, LAMP and few system tools like PAM
and udev. Current uptime is 128 days. And know what? I have just noticed
a
lot of zombies there! Usual trick 'chmod 0 /etc/runit/stopit; kill -CONT
1'
did the work, but... 

    I have neither updated runit (or other critical packages like
    toolchain) nor reboot system - this issue with zombies has arisen
    after update of some simple utils and libraries!!!!!
    WTF??? How this can be possible?!

If runit doesn't reload dynamic libraries on the fly or execute
external utils (which I believe is not the case), then how update of
some libraries and tools may result in runit failure to gather zombies?

Only possible "explanation" which come in my mind is some CPU/RAM usage
pattern which is happens while compiling all these packages somehow
affect
runit-init. And this isn't just "high load", because this server most of
time under high enough load, in all ways (CPU+RAM+HDD).

Below is list of updated packages, I can provide also versions "from" is
was updated, if somebody will need it.

Sat Feb 23 17:01:46 2008 >>> sys-apps/sandbox-1.2.18.1-r2
Sat Feb 23 17:02:21 2008 >>> sys-apps/portage-2.1.4.4
Sat Feb 23 21:48:03 2008 >>> app-misc/colordiff-1.0.6-r1
Sun Feb 24 02:17:13 2008 >>> sys-libs/ncurses-5.6-r2
Sun Feb 24 02:17:57 2008 >>> app-arch/bzip2-1.0.4-r1
Sun Feb 24 02:18:16 2008 >>> sys-devel/gnuconfig-20070724
Sun Feb 24 02:21:45 2008 >>> media-libs/freetype-2.3.5-r2
Sun Feb 24 02:24:49 2008 >>> dev-util/pkgconfig-0.22
Sun Feb 24 02:26:01 2008 >>> sys-libs/cracklib-2.8.12
Sun Feb 24 02:26:33 2008 >>> app-misc/pax-utils-0.1.16
Sun Feb 24 02:26:59 2008 >>> sys-devel/gcc-config-1.4.0-r4
Sun Feb 24 02:27:49 2008 >>> sys-libs/timezone-data-2007k
Sun Feb 24 02:29:33 2008 >>> media-libs/t1lib-5.0.2-r1
Sun Feb 24 02:32:08 2008 >>> dev-libs/libmcrypt-2.5.8
Sun Feb 24 02:35:28 2008 >>> dev-db/sqlite-3.5.3
Sun Feb 24 02:36:52 2008 >>> sys-fs/sysfsutils-2.1.0
Sun Feb 24 02:37:13 2008 >>> net-analyzer/netselect-0.3-r2
Sun Feb 24 02:37:30 2008 >>> sys-process/cronbase-0.3.2-r1
Sun Feb 24 02:40:02 2008 >>> dev-lang/spidermonkey-1.7.0
Sun Feb 24 02:42:02 2008 >>> app-arch/cpio-2.9-r1
Sun Feb 24 02:42:43 2008 >>> sys-process/acct-6.3.5-r2
Sun Feb 24 02:43:16 2008 >>> app-portage/portage-utils-0.1.29
Sun Feb 24 02:56:51 2008 >>> app-arch/p7zip-4.57
Sun Feb 24 02:58:10 2008 >>> sys-fs/reiserfsprogs-3.6.19-r2
Sun Feb 24 02:58:53 2008 >>> net-proxy/3proxy-0.5.3j
Sun Feb 24 02:59:19 2008 >>> net-analyzer/traceroute-2.0.9-r1
Sun Feb 24 03:01:08 2008 >>> sys-kernel/linux-headers-2.6.23-r3
Sun Feb 24 03:04:43 2008 >>> app-shells/bash-3.2_p17-r1
Sun Feb 24 03:07:02 2008 >>> media-libs/libpng-1.2.24
Sun Feb 24 03:09:36 2008 >>> dev-libs/libpcre-7.6-r1
Sun Feb 24 03:13:22 2008 >>> media-libs/tiff-3.8.2-r3
Sun Feb 24 03:14:18 2008 >>> sys-apps/less-418
Sun Feb 24 03:24:46 2008 >>> app-portage/eix-0.10.2
Sun Feb 24 03:26:09 2008 >>> sys-apps/pciutils-2.2.9
Sun Feb 24 03:27:51 2008 >>> net-misc/netkit-telnetd-0.17-r8
Sun Feb 24 03:34:33 2008 >>> sys-libs/readline-5.2_p12-r1
Sun Feb 24 03:36:50 2008 >>> sys-apps/sysvinit-2.86-r10
Sun Feb 24 03:38:53 2008 >>> sys-apps/ed-0.8
Sun Feb 24 03:41:09 2008 >>> net-misc/iputils-20070202
Sun Feb 24 04:06:57 2008 >>> dev-libs/libxml2-2.6.30-r1
Sun Feb 24 04:53:51 2008 >>> sys-devel/gettext-0.17
Sun Feb 24 04:55:04 2008 >>> sys-apps/sed-4.1.5
Sun Feb 24 04:56:47 2008 >>> sys-devel/m4-1.4.10
Sun Feb 24 04:57:34 2008 >>> sys-apps/man-1.6f
Sun Feb 24 04:58:44 2008 >>> sys-devel/flex-2.5.33-r3
Sun Feb 24 04:59:48 2008 >>> dev-libs/libgpg-error-1.6
Sun Feb 24 05:02:31 2008 >>> sys-apps/findutils-4.3.11
Sun Feb 24 05:04:23 2008 >>> sys-apps/gawk-3.1.5-r5
Sun Feb 24 05:05:38 2008 >>> app-editors/nano-2.0.7
Sun Feb 24 05:06:49 2008 >>> dev-util/dialog-1.1.20071028
Sun Feb 24 05:07:51 2008 >>> sys-apps/kbd-1.13-r1
Sun Feb 24 05:10:45 2008 >>> app-arch/tar-1.19-r1
Sun Feb 24 05:11:23 2008 >>> sys-apps/ucspi-tcp-0.88-r16
Sun Feb 24 05:12:49 2008 >>> sys-apps/man-pages-2.76
Sun Feb 24 05:13:23 2008 >>> sys-devel/bc-1.06-r6
Sun Feb 24 05:13:34 2008 >>> virtual/editor-0
Sun Feb 24 05:26:40 2008 >>> sys-devel/binutils-2.18-r1
Sun Feb 24 05:28:49 2008 >>> sys-libs/com_err-1.40.4
Sun Feb 24 05:41:56 2008 >>> sys-libs/db-4.5.20_p2
Sun Feb 24 05:43:16 2008 >>> sys-libs/ss-1.40.4
Sun Feb 24 05:43:38 2008 >>> sys-apps/paxctl-0.5
Sun Feb 24 05:46:54 2008 >>> sys-fs/e2fsprogs-1.40.4
Sun Feb 24 05:49:59 2008 >>> sys-apps/util-linux-2.13-r2
Sun Feb 24 05:54:34 2008 >>> sys-apps/parted-1.8.8
Sun Feb 24 05:56:03 2008 >>> net-mail/fetchmail-6.3.8-r1
Sun Feb 24 06:01:18 2008 >>> net-misc/ntp-4.2.4_p4
Sun Feb 24 06:15:26 2008 >>> dev-lang/perl-5.8.8-r4
Sun Feb 24 06:16:04 2008 >>> sys-devel/autoconf-2.61-r1
Sun Feb 24 06:16:20 2008 >>> app-admin/perl-cleaner-1.05
Sun Feb 24 06:18:05 2008 >>> dev-libs/libtasn1-1.2
Sun Feb 24 06:18:25 2008 >>> sys-apps/help2man-1.36.4
Sun Feb 24 06:20:09 2008 >>> net-misc/rsync-2.6.9-r5
Sun Feb 24 06:22:37 2008 >>> sys-devel/libtool-1.5.26
Sun Feb 24 07:56:47 2008 >>> sys-apps/busybox-1.8.2
Sun Feb 24 07:57:13 2008 >>> sys-apps/slocate-3.1-r1
Sun Feb 24 07:58:20 2008 >>> net-libs/libpcap-0.9.8
Sun Feb 24 08:02:15 2008 >>> dev-libs/libgcrypt-1.4.0-r1
Sun Feb 24 08:03:07 2008 >>> dev-perl/DBI-1.601
Sun Feb 24 08:03:47 2008 >>> x11-misc/read-edid-1.4.1-r1
Sun Feb 24 08:05:52 2008 >>> net-dns/libidn-1.0-r1
Sun Feb 24 08:07:13 2008 >>> sys-process/psmisc-22.6
Sun Feb 24 08:28:13 2008 >>> net-fs/samba-3.0.28
Sun Feb 24 08:30:00 2008 >>> dev-util/strace-4.5.16
Sun Feb 24 08:35:23 2008 >>> net-analyzer/rrdtool-1.2.23-r1
Sun Feb 24 08:38:41 2008 >>> dev-libs/libxslt-1.1.22
Sun Feb 24 08:40:32 2008 >>> app-crypt/opencdk-0.6.6
Sun Feb 24 08:41:20 2008 >>> net-analyzer/lft-3.0
Sun Feb 24 08:45:39 2008 >>> net-analyzer/nmap-4.53
Sun Feb 24 08:46:01 2008 >>> net-misc/whois-4.7.24
Sun Feb 24 08:47:51 2008 >>> net-analyzer/tcpdump-3.9.8
Sun Feb 24 08:57:19 2008 >>> net-libs/gnutls-2.0.4
Sun Feb 24 09:01:41 2008 >>> net-misc/curl-7.17.1
Sun Feb 24 09:08:55 2008 >>> dev-lang/python-2.4.4-r6
Sun Feb 24 09:10:28 2008 >>> sys-apps/file-4.23
Sun Feb 24 09:10:44 2008 >>> app-admin/eselect-vi-1.1.5
Sun Feb 24 09:10:55 2008 >>> app-admin/eselect-ctags-1.3
Sun Feb 24 09:11:55 2008 >>> dev-util/ctags-5.7
Sun Feb 24 09:12:25 2008 >>> sys-apps/debianutils-2.28.2
Sun Feb 24 09:12:45 2008 >>> app-portage/gentoolkit-0.2.3-r1
Sun Feb 24 09:13:18 2008 >>> sys-apps/baselayout-1.12.11.1
Sun Feb 24 09:14:03 2008 >>> sys-apps/module-init-tools-3.4
Sun Feb 24 09:30:04 2008 >>> sys-kernel/hardened-sources-2.6.23-r7
Sun Feb 24 09:31:40 2008 >>> net-dialup/ppp-2.4.4-r14
Sun Feb 24 09:33:01 2008 >>> sys-apps/lm_sensors-2.10.4
Sun Feb 24 09:34:21 2008 >>> net-firewall/iptables-1.3.8-r3
Sun Feb 24 09:36:33 2008 >>> app-editors/vim-core-7.1.123
Sun Feb 24 09:41:04 2008 >>> app-editors/vim-7.1.123
Sun Feb 24 10:08:14 2008 >>> dev-lang/php-5.2.4_pre200708051230-r2
Sun Feb 24 10:09:42 2008 >>> dev-php/PEAR-PEAR-1.6.2-r1
Sun Feb 24 10:09:53 2008 >>> app-admin/eselect-fontconfig-1.0
Sun Feb 24 10:13:11 2008 >>> media-libs/fontconfig-2.5.0-r1
Sun Feb 24 10:13:49 2008 >>> media-fonts/corefonts-1-r4
Sun Feb 24 10:35:59 2008 >>> media-gfx/imagemagick-6.3.7.9
Sun Feb 24 11:02:04 2008 >>> dev-libs/glib-2.14.6
Sun Feb 24 11:02:30 2008 >>> app-arch/rar-3.7.1

These packages was installed, not updated:
Sun Feb 24 18:10:56 2008 >>> app-misc/mc-mp-4.1.40_pre9-r1
Sun Feb 24 23:47:46 2008 >>> app-forensics/chkrootkit-0.47
Sun Feb 24 23:48:45 2008 >>> sys-process/lsof-4.78-r1
Sun Feb 24 23:49:00 2008 >>> app-forensics/rkhunter-1.2.9

P.S. Gerrit: runit is really cool, but this bug (unfixed for about a
year!)
drives me crazy... :(

-- 
			WBR, Alex.




^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: runit not collecting zombies
  2008-02-25 16:24   ` rehan khan
@ 2008-02-25 16:27     ` Charlie Brady
       [not found]     ` <54B6D6D6D32D4DB685F8CA9A836076D7@home.internal>
  2008-02-25 19:13     ` Charlie Brady
  2 siblings, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2008-02-25 16:27 UTC (permalink / raw)
  To: rehan khan; +Cc: Alex Efros, supervision


On Mon, 25 Feb 2008, rehan khan wrote:

> Actually a number of the updated packages *could* be the cause of this
> issue. The package that I would look at closest is the bash upgrade as
> runit scripts depend on the built-in exec command properly replacing the
> shell with the same process id.

That's a function of the exec* system calls, and doesn't need anything 
special in bash.

> I don't think the kernel is implicated if you haven't restarted the 
> machine.

I think it would be imprudent to make that assumption.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: runit not collecting zombies
       [not found]     ` <54B6D6D6D32D4DB685F8CA9A836076D7@home.internal>
@ 2008-02-25 17:11       ` rehan khan
  0 siblings, 0 replies; 113+ messages in thread
From: rehan khan @ 2008-02-25 17:11 UTC (permalink / raw)
  To: Charlie Brady; +Cc: Alex Efros, supervision


>> Actually a number of the updated packages *could* be the cause of
this
>> issue. The package that I would look at closest is the bash upgrade
as
>> runit scripts depend on the built-in exec command properly replacing
the
>> shell with the same process id.

>That's a function of the exec* system calls, and doesn't need anything 
>special in bash.

Nevertheless, going with the available information I personally would be
suspicious of the bash upgrade until proven otherwise.

>> I don't think the kernel is implicated if you haven't restarted the 
>> machine.

>I think it would be imprudent to make that assumption.

Nevertheless, going with the available information I personally would be
suspicious of the bash upgrade until proven otherwise :)

You're not going to beat me up over this as well are you Charlie? :P



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
       [not found]     ` <59012.::ffff:77.75.72.226.1203952988.squirrel@mail.podgorny.cz>
  2008-02-25 15:26       ` George Georgalis
  2008-02-25 15:32       ` Charlie Brady
@ 2008-02-25 17:20       ` Mike Buland
  2 siblings, 0 replies; 113+ messages in thread
From: Mike Buland @ 2008-02-25 17:20 UTC (permalink / raw)
  To: supervision

On Monday 25 February 2008 08:23:08 am Radek Podgorny wrote:
> > On Mon, 25 Feb 2008, Alex Efros wrote:
> >> Only possible "explanation" which come in my mind is some CPU/RAM usage
> >> pattern which is happens while compiling all these packages somehow
> >> affect
> >> runit-init.
> >
> > ... or the kernel.
>
> How can it be the kernel when the system was not rebooted?

I doubt this is the case, but the kernel has supported "kexec" for a while, 
it's bizzare, but it allows a kernel to be restarted or replaced while 
running.  I seriously doubt that this was used, but I think it's an 
interesting feature.  Think of it as a fun-fact :)

>
> >> P.S. Gerrit: runit is really cool, but this bug (unfixed for about a
> >> year!)
> >> drives me crazy... :(
> >
> > If it's a kernel bug, then Gerrit won't be able to fix it.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: runit not collecting zombies
  2008-02-25 16:24   ` rehan khan
  2008-02-25 16:27     ` Charlie Brady
       [not found]     ` <54B6D6D6D32D4DB685F8CA9A836076D7@home.internal>
@ 2008-02-25 19:13     ` Charlie Brady
  2 siblings, 0 replies; 113+ messages in thread
From: Charlie Brady @ 2008-02-25 19:13 UTC (permalink / raw)
  To: rehan khan; +Cc: Alex Efros, supervision


On Mon, 25 Feb 2008, rehan khan wrote:

> I don't think the kernel is implicated if you haven't restarted the 
> machine.

Well *I* don't think that runit-init is implicated, since it hasn't been 
restarted.

So where does that leave us?

My understanding is that there are only two items of software involved in 
the reaping of zombies - the kernel, and process 1 (i.e. runit-init).

Given their relative complexities, I'd say that a bug in the kernel is 
more likely than a bug in runit-init. Gerrit has looked for a bug in 
runit-init and not found one. Other people run the same runit-init as 
Alex, but a different kernel, and don't see the problem. That also makes 
me suspect a kernel problem.

If you were to supply me with a malicious bash, how would it be able to 
create zombies which my runit-init did not reap? IOW, I don't see how bash 
could be implicated.




^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-25  7:25 ` Alex Efros
                     ` (2 preceding siblings ...)
       [not found]   ` <34616.::ffff:77.75.72.226.1203953244.squirrel@mail.podgorny.cz>
@ 2008-02-27  8:19   ` Bernhard Graf
  2008-02-27  8:36     ` Alex Efros
  3 siblings, 1 reply; 113+ messages in thread
From: Bernhard Graf @ 2008-02-27  8:19 UTC (permalink / raw)
  To: supervision

On Montag 25 Februar 2008, Alex Efros wrote:

>     I have neither updated runit (or other critical packages like
>     toolchain) nor reboot system - this issue with zombies has arisen
>     after update of some simple utils and libraries!!!!!
>     WTF??? How this can be possible?!

Just taking a stab in the dark:

Are you using any process memory limits for the supervised processes 
(chpst, softlimit, ulimit)? Have you tried increasing them 
(significantly)?
-- 
Bernhard Graf


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-27  8:19   ` Bernhard Graf
@ 2008-02-27  8:36     ` Alex Efros
  2008-02-27  8:58       ` Bernhard Graf
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2008-02-27  8:36 UTC (permalink / raw)
  To: supervision

Hi!

On Wed, Feb 27, 2008 at 09:19:27AM +0100, Bernhard Graf wrote:
> >     I have neither updated runit (or other critical packages like
> >     toolchain) nor reboot system - this issue with zombies has arisen
> >     after update of some simple utils and libraries!!!!!
> >     WTF??? How this can be possible?!
> 
> Just taking a stab in the dark:
> 
> Are you using any process memory limits for the supervised processes 
> (chpst, softlimit, ulimit)? Have you tried increasing them 
> (significantly)?

Hmm. Of course in some places I use it (but not everywhere, just because I
doesn't configured this yet). But chpst&ssh processes which generate most
of zombies doesn't use these limits, so this isn't the case.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2008-02-27  8:36     ` Alex Efros
@ 2008-02-27  8:58       ` Bernhard Graf
  0 siblings, 0 replies; 113+ messages in thread
From: Bernhard Graf @ 2008-02-27  8:58 UTC (permalink / raw)
  To: supervision

Alex Efros wrote:

> > Are you using any process memory limits for the supervised
> > processes (chpst, softlimit, ulimit)? Have you tried increasing
> > them (significantly)?
>
> Hmm. Of course in some places I use it (but not everywhere, just
> because I doesn't configured this yet). But chpst&ssh processes which
> generate most of zombies doesn't use these limits, so this isn't the
> case.

So do you use ssh with chpst? But not with any memory limits?
You are saying it is mainly ssh. Are only certain supervised processes 
affected or does it happen to all?

Why not posting your ssh run script here? Maybe someone has an idea...
-- 
Bernhard Graf


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: runit not collecting zombies
  2007-05-24 23:07 runit not collecting zombies Radek Podgorny
                   ` (2 preceding siblings ...)
       [not found] ` <F694D808C0BB4890A12C565F68B9A691@home.internal>
@ 2008-10-21 21:46 ` Alex Efros
  3 siblings, 0 replies; 113+ messages in thread
From: Alex Efros @ 2008-10-21 21:46 UTC (permalink / raw)
  To: supervision

Hi!

BTW, subj issue still exists. Just a short reminder.

AFAIK Gerrit was unable to reproduce this issue, and I was unable to debug
and fix it myself. So now I use this small nasty thing:

    # cat /etc/cron.hourly/runit-zombie-fix.sh 
    #!/bin/bash
    chmod 0 /etc/runit/stopit
    kill -CONT 1

Few minutes ago I got an idea: I've VMware image with Gentoo configured in
same way as on all my servers, so it probably suffer from same issue.
I can test this (have it running for 7-10 days), and if this is the case,
I can send this VMware image to Gerrit. But before spending time in this I
would like to hear from Gerrit: is this acceptable for you and do you
willing to debug this issue in this way when you'll have spare time?

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [LONG] Re: runit not collecting zombies
  2007-09-18  8:14                                                                       ` Gerrit Pape
  2007-09-18 11:33                                                                         ` Alex Efros
  2007-09-18 11:45                                                                         ` Laurent Bercot
@ 2011-02-15 13:12                                                                         ` Laurent Bercot
  2011-02-15 15:00                                                                           ` Alex Efros
  2 siblings, 1 reply; 113+ messages in thread
From: Laurent Bercot @ 2011-02-15 13:12 UTC (permalink / raw)
  To: supervision


 Four years later, I'm coming back to this thread, because something is
still bothering me.

 Quick summary: Radek Podgorny and Alex Efros both had an issue where
zombie processes would accumulate and *not* be reaped by runit as they
should have been. A long discussion ensued; it appeared that the problem
was caused by the following situation:

 * a process A forks a child, B ;
 * B dies, and a SIGCHLD is sent to A ;
 * A does not wait() for B and dies ;
 * so zombie B is reparented to 1, but no SIGCHLD is sent to 1 ;
 * zombie B remains there until runit's reaper is triggered, which can
be much, much later.

 Gerrit Pape concluded:

> runit tries to over-optimise, and only wakes up to reap zombies if it
> knows there are some, at least one.  Due to the fact that the mother
> process, which re-parented itself to pid 1, on the one hand receives a
> SIGCHLD, but on the other hand doesn't care about that, exits and leaves
> the dead child alone, the child gets re-parented to runit, but without
> any notification.
> 
> The situation would have been cleaned up on your systems once any child
> process gets re-parented to process 1 before it terminates, and then
> exits, causing runit to get a SIGCHLD; which apparently didn't happen.
> It's what the kill -CONT 1 I suggested fakes.  That seems to explain why
> this problem didn't show up for years.
> 
> I prepare a new version of runit that looks for and reaps zombies not
> only if it knows that there are some, but also after a 14 seconds
> timeout, there seems to be no way around that.


 And that is what bothers me. Something is not right.
 Unix should be able to function without polling at all.
 I'm building Linux environments for embedded platforms, on which
energy consumption is an important thing. If such a basic thing as
process 1 has to do polling, I'm forfeiting my job right now.

 runit ran perfectly without polling for lots of people except Radek and
Alex. Until Gerrit had to add a polling mechanism just for them. What do
other init systems do ?


 I straced sysvinit:
Process 1 attached - interrupt to quit
select(11, [10], NULL, NULL, {2, 902034}) = 0 (Timeout)
time(NULL)                              = 1297769034
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
select(11, [10], NULL, NULL, {5, 0})    = 0 (Timeout)
time(NULL)                              = 1297769039
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
select(11, [10], NULL, NULL, {5, 0})    = 0 (Timeout)
time(NULL)                              = 1297769044
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
fstat64(10, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat64("/dev/initctl", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
select(11, [10], NULL, NULL, {5, 0}^C <unfinished ...>
Process 1 detached

 No luck here. sysvinit wakes up every 5 seconds. Don't ask me why:
it does not even reap children when it wakes up. Its only goal seems
to be to make sure that /dev/initctl is still there by stat()ing
it three times. Lol. sysvinit sucks - nothing new here.


 I straced Upstart:
Process 1 attached - interrupt to quit
select(11, [3 5 6 7 9 10], [], [7 9 10], NULL^C <unfinished ...>
Process 1 detached
 
 Aha. Upstart waits on notifications forever. It does not poll at all.
No, I'm definitely not going to install Upstart on embedded systems :)
but it's a good indication that it is possible to only reap children
when being triggered; it is *not necessary*, at least on Linux, to
have a timed reaping loop.

 So, where does the problem come from ?
 Do reparented zombies *really* cause no trigger ?

 I ran the following command while stracing my own process 1 (s4-svscan,
which does not poll) on a Linux 2.6.36.1 kernel:
$ execlineb -c "background { sleep 1 } s4-sleep 2"

 This little execline script will fork; the child will exec "sleep 1",
which will exit after 1 second. The parent will exec "s4-sleep 2", which will
sleep 2 seconds *without being interrupted by signals* and then exit
*without waiting for its dead child*. (I used my own version of "sleep"
just to make sure it slept for the full duration and did not wait().)

 So, when the child dies, a SIGCHLD will be sent to the parent, which
is totally oblivious to it. One second later, the parent will die, and
its zombie child will then be inherited by process 1. What happens then ?

Process 1 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 1
gettimeofday({1297770104, 310225}, NULL) = 0
read(5, "\21\0\0\0\0\0\0\0\1\0\0\0:1\0\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
read(5, 0xbfa51d0c, 128)                = -1 EAGAIN (Resource temporarily unavailable)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 12602
wait4(-1, 0xbfa51e10, WNOHANG, NULL)    = 0
poll([{fd=5, events=POLLIN|POLLHUP}, {fd=4, events=POLLIN|POLLHUP}], 2, -1^C <unfinished ...>
Process 1 detached

 fd 5 is actually obtained via signalfd() when available, and listens to
signals such as SIGCHLD. (When signalfd() is not available, a selfpipe is
used instead.)
 The trace is crystal clear: when the parent dies and the zombie child
is reparented to 1, *process 1 does get notified with a SIGCHLD* even if
the former parent has already been notified before (and has done nothing).
Here, the signal is seen as the signalfd being available, but it's still
a signal.

 This is a very normal, expectable, sane behaviour that Linux 2.6.36.1
exhibits; and it confirms my expectation that process 1 SHOULD NOT have
a timed reaping loop.

 Upstart does the right thing (as far as waiting for notifications is
concerned, I mean). runit did the right thing before the change.

 The problem Radek and Alex had was most likely caused by a kernel bug:
in some cases, when a zombie is reparented to process 1, process 1 does
not get notified with a SIGCHLD, as it should be.

 I don't have the time or resources to explore this further; but the
modus operandi is simple.

 - Make sure you can strace your process 1. If you cannot, patch it
so it writes something (to the system log or its own stderr which should
point to the console) everytime it receives a SIGCHLD. Upstart and
sysvinit are spaghetti monsters, but runit is trivial to patch.

 - Run the following script: sh -c "sleep 1 & ; exec sleep 2"
provided your sleep binary does not do anything fancy with signals.
Or replace the "sleep 2" with something that you know does not
catch signals and lasts more than one second.

 - Check what process 1 says after 2 seconds. If it received a SIGCHLD,
your kernel works. If it did not, you have found a kernel bug.

 runit's polling mechanism is a workaround to this bug, not the
solution to some Unix problem. Gerrit, please make it optional, so
functional systems can disable polling entirely.

-- 
 Laurent


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [LONG] Re: runit not collecting zombies
  2011-02-15 13:12                                                                         ` [LONG] " Laurent Bercot
@ 2011-02-15 15:00                                                                           ` Alex Efros
  2011-02-15 15:22                                                                             ` Laurent Bercot
  0 siblings, 1 reply; 113+ messages in thread
From: Alex Efros @ 2011-02-15 15:00 UTC (permalink / raw)
  To: supervision

Hi!

On Tue, Feb 15, 2011 at 02:12:18PM +0100, Laurent Bercot wrote:
>  runit ran perfectly without polling for lots of people except Radek and
> Alex. Until Gerrit had to add a polling mechanism just for them.

AFAIR this polling mechanism doesn't solved issue for me, but I may be
wrong because that was long time ago. Anyway, I'm still using `kill -CONT 1`
hack in /etc/cron.hourly/ to work around this issue on all my systems.

>  I ran the following command while stracing my own process 1 (s4-svscan,
> which does not poll) on a Linux 2.6.36.1 kernel:

Again, this was long time ago and I may be wrong, but AFAIR this simple
trick with two processes wasn't correct example to reproduce this issue.

>  The problem Radek and Alex had was most likely caused by a kernel bug:
> in some cases, when a zombie is reparented to process 1, process 1 does
> not get notified with a SIGCHLD, as it should be.

Ok, there no harm to trying. I repeated your test with strace - on my
current system runit got SIGCHLD. I'm using kernel 2.6.36-hardened-r9 and
runit 2.0.0. I've just switched off hourly `kill -CONT 1` workaround, so
we'll see is everything fine in a couple of days.

If there will be no growing army of zombies on my system after that, I'll
be glad to test modified runit version without polling if someone send me
the patch.

-- 
			WBR, Alex.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [LONG] Re: runit not collecting zombies
  2011-02-15 15:00                                                                           ` Alex Efros
@ 2011-02-15 15:22                                                                             ` Laurent Bercot
  0 siblings, 0 replies; 113+ messages in thread
From: Laurent Bercot @ 2011-02-15 15:22 UTC (permalink / raw)
  To: supervision

> AFAIR this polling mechanism doesn't solved issue for me, but I may be
> wrong because that was long time ago. Anyway, I'm still using `kill -CONT 1`
> hack in /etc/cron.hourly/ to work around this issue on all my systems.

 As far as I can tell, the polling mechanism should have solved the
issue with a recent runit (one that reaps *all* its zombies every time
it's triggered, not just one), because it does the exact same thing as
your SIGCONT crontab entry: manually trigger the reaper every amount of
time.
 Your crontab entry triggers the reaper every hour. The integrated polling
mecanism triggers it every 14 seconds. That should have been working. ^^


> Again, this was long time ago and I may be wrong, but AFAIR this simple
> trick with two processes wasn't correct example to reproduce this issue.

 It was what I gathered when reading the thread again. The cause of your
zombie attack was parents not reaping their dead children and then dying,
giving their zombies to process 1 *without* triggering process 1's reaper.
My little script is a minimal example of this.


> Ok, there no harm to trying. I repeated your test with strace - on my
> current system runit got SIGCHLD. I'm using kernel 2.6.36-hardened-r9 and
> runit 2.0.0. I've just switched off hourly `kill -CONT 1` workaround, so
> we'll see is everything fine in a couple of days.

 Looks like 2.6.36-hardened-r9 is exempt from the bug. If runit got SIGCHLD,
its reaper mechanism was triggered and you should be okay.

-- 
 Laurent


^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2011-02-15 15:22 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-05-24 23:07 runit not collecting zombies Radek Podgorny
2007-05-26 10:35 ` Alex Efros
2007-05-26 10:45   ` Alex Efros
2007-05-26 12:55   ` Charlie Brady
2007-05-26 13:03     ` Alex Efros
2007-05-26 17:01   ` Paul Jarc
2007-06-02 14:55     ` Alex Efros
2007-06-03 11:10   ` Gerrit Pape
2007-06-03 14:33     ` Alex Efros
2007-06-03 16:31       ` Gerrit Pape
2007-06-11 13:11     ` Alex Efros
2007-06-18 13:45       ` Alex Efros
2007-06-19 18:13         ` Gerrit Pape
2007-06-19 19:07           ` Alex Efros
2007-06-20 16:23             ` Gerrit Pape
2007-06-20 16:57               ` Alex Efros
2007-06-20 18:35                 ` Gerrit Pape
2007-06-23  4:42                   ` Alex Efros
2007-06-26  9:59                     ` Gerrit Pape
2007-07-07  7:16                       ` Alex Efros
2007-07-07 18:13                         ` Charlie Brady
2007-07-07 19:12                           ` Alex Efros
2007-07-12 14:21                             ` Charlie Brady
2007-07-12 14:41                               ` Alex Efros
2007-07-12 14:45                                 ` Charlie Brady
2007-07-12 14:57                                   ` Alex Efros
2007-07-12 14:42                           ` Charlie Brady
2007-07-12 14:43                             ` Charlie Brady
2007-07-12 14:49                             ` Alex Efros
2007-07-12 15:11                               ` Charlie Brady
2007-07-12 15:15                                 ` Alex Efros
2007-07-12 15:40                                   ` Charlie Brady
2007-07-15 14:47                       ` Alex Efros
2007-07-15 19:07                         ` Alex Efros
2007-07-15 20:18                           ` George Georgalis
2007-07-15 20:31                             ` Paul Jarc
2007-07-15 22:35                               ` George Georgalis
2007-07-15 23:06                                 ` Paul Jarc
2007-07-15 23:23                                   ` Charlie Brady
2007-07-16  0:09                                     ` Alex Efros
2007-07-16  2:11                                       ` Charlie Brady
2007-09-12 12:53                                         ` Radek Podgorny
     [not found]                                         ` <47939.::ffff:77.75.72.5.1189601606.squirrel@mail.podgorny.cz>
2007-09-12 13:55                                           ` Charlie Brady
2007-09-12 14:35                                             ` Alex Efros
2007-09-12 14:55                                               ` Charlie Brady
2007-09-12 15:00                                                 ` Alex Efros
2007-09-12 16:02                                                   ` Charlie Brady
2007-09-12 16:10                                                     ` Radek Podgorny
2007-09-12 17:22                                                     ` Alex Efros
2007-09-12 17:40                                                       ` Charlie Brady
2007-09-12 18:18                                                         ` Alex Efros
2007-09-12 19:07                                                           ` Charlie Brady
2007-09-12 19:13                                                             ` Alex Efros
2007-09-12 19:18                                                               ` Charlie Brady
2007-09-12 19:30                                                                 ` Alex Efros
2007-09-12 19:37                                                                   ` Charlie Brady
2007-09-15 13:36                                                                 ` Alex Efros
2007-09-15 13:57                                                                   ` Alex Efros
2007-09-15 15:20                                                                     ` Charlie Brady
2007-09-15 15:28                                                                       ` Alex Efros
2007-09-15 15:47                                                                         ` Charlie Brady
2007-09-15 16:02                                                                           ` Alex Efros
2007-09-15 15:49                                                                         ` Charlie Brady
2007-09-15 15:55                                                                           ` Alex Efros
2007-09-15 16:02                                                                             ` Charlie Brady
2007-09-15 15:36                                                                       ` Alex Efros
2007-09-15 15:58                                                                         ` Charlie Brady
2007-09-15 14:03                                                                   ` Alex Efros
2007-09-17  7:56                                                                   ` Gerrit Pape
2007-09-17  9:07                                                                     ` Radek Podgorny
2007-09-17 11:59                                                                     ` Alex Efros
2007-09-18  8:14                                                                       ` Gerrit Pape
2007-09-18 11:33                                                                         ` Alex Efros
2007-09-18 11:45                                                                         ` Laurent Bercot
2011-02-15 13:12                                                                         ` [LONG] " Laurent Bercot
2011-02-15 15:00                                                                           ` Alex Efros
2011-02-15 15:22                                                                             ` Laurent Bercot
2007-09-12 16:04                                                   ` Radek Podgorny
     [not found]                                                   ` <35517.::ffff:77.75.72.5.1189613042.squirrel@mail.podgorny.cz>
2007-09-12 17:04                                                     ` Alex Efros
2007-09-12 19:38                                                       ` Mike Buland
2007-09-12 20:28                                                         ` Alex Efros
2007-09-12 20:38                                                           ` Alex Efros
2007-09-13  1:05                                                           ` Mike Buland
2007-09-13  8:58                                                       ` Radek Podgorny
     [not found]                                                       ` <50411.::ffff:77.75.72.5.1189673890.squirrel@mail.podgorny.cz>
2007-09-13 10:57                                                         ` Alex Efros
2007-09-13 12:06                                                           ` Alex Efros
2007-09-13 14:31                                                           ` Radek Podgorny
     [not found]                                                           ` <51910.::ffff:77.75.72.5.1189693860.squirrel@mail.podgorny.cz>
2007-09-13 14:51                                                             ` Alex Efros
2007-07-16  2:24                                   ` George Georgalis
2007-07-01  8:43                   ` Radek Podgorny
2007-07-02  8:28                     ` Gerrit Pape
2007-07-02 11:23                       ` Radek Podgorny
2007-07-02 12:14                         ` Gerrit Pape
2007-07-02 12:42                           ` Radek Podgorny
2007-07-07  4:54                       ` Alex Efros
2007-06-20 19:57                 ` Charlie Brady
2008-02-25  7:25 ` Alex Efros
2008-02-25 14:57   ` Charlie Brady
2008-02-25 15:23     ` Radek Podgorny
     [not found]     ` <59012.::ffff:77.75.72.226.1203952988.squirrel@mail.podgorny.cz>
2008-02-25 15:26       ` George Georgalis
2008-02-25 15:32       ` Charlie Brady
2008-02-25 16:17         ` Alex Efros
2008-02-25 17:20       ` Mike Buland
2008-02-25 15:27   ` Radek Podgorny
     [not found]   ` <34616.::ffff:77.75.72.226.1203953244.squirrel@mail.podgorny.cz>
2008-02-25 16:15     ` Alex Efros
2008-02-27  8:19   ` Bernhard Graf
2008-02-27  8:36     ` Alex Efros
2008-02-27  8:58       ` Bernhard Graf
     [not found] ` <F694D808C0BB4890A12C565F68B9A691@home.internal>
2008-02-25 16:24   ` rehan khan
2008-02-25 16:27     ` Charlie Brady
     [not found]     ` <54B6D6D6D32D4DB685F8CA9A836076D7@home.internal>
2008-02-25 17:11       ` rehan khan
2008-02-25 19:13     ` Charlie Brady
2008-10-21 21:46 ` Alex Efros

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).