supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
* sv term handling with a slow child
@ 2008-01-16 22:41 Ryan Woodrum
  2008-01-16 23:04 ` Mike Buland
  0 siblings, 1 reply; 6+ messages in thread
From: Ryan Woodrum @ 2008-01-16 22:41 UTC (permalink / raw)
  To: supervision

Hello!

I believe I have found a possible bug/oddity in the behavior of sv
using runsv.  I happened upon this particular scenario in a test
environment, but was actually able to repro it in my production
environment as well as in a primitive case.  The issue involves slow
children or children whose TERM handler isn't registered soon enough.

Here's the setup:
I create a simplistic base service configuration under which I will
run a ruby application.  The ruby app looks like so:
slow_signal.rb
---
sleep(10)

puts "registering term handler..."
trap("TERM") do
  puts "got term"
  exit
end

while(true) do
  puts "looping and sleeping..."
  sleep 2
end
---

I run this under my run svdir with:
#!/bin/sh
exec 2>&1
exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb


The premise of the primitive ruby application is to emulate a slow-ish
loading base of code that has a term handler registered early in the
life of the process.

If I invoke:
/etc/init.d/slow_signal start

followed within the 10 second sleep period by:
/etc/init.d/slow_signal stop

(/etc/init.d/slow_signal is a symlink to /usr/bin/sv)

The process does not handle the signal but its state is set to 'd';
down.  In subsequent calls to control() within sv.c, it will no longer
write to the pipe because it thinks there is no need.  With no further
writes to the pipe, another TERM will never get sent and so the
process cannot be shut down via sv/runsv, at least not with TERM.

It took me awhile to learn how everything was work and to track down
just where this check was happening.  The source I worked against was
the source available via the debian package v1.8.0 (`apt-get source runit`
under debian sid).  (I looked for a repo but did not find a public
one.)

Two solutions I can think of are not to set svstatus[17] unless you're
sure the process actually went down, but this is more complicated 
(perhaps more correct?) than a second solution.  Inside of control() in
sv.c, a modification to always send a TERM can be made like so:
-----
247c247,248
<   if (svstatus[17] == *a) return(0);
---
>   /* Write a TERM to the pipe even if we already have.  Slow TERM
>   handler perhaps?  What about other cases?*/
>   if (svstatus[17] == *a && *a != 'd') return(0);
-----

In this case, we simply decide that, if we want to issue a TERM via sv
stop, down etc., we will go ahead and write again to the pipe.  Even
if we think we don't need to.  This way, we're not stuck in "want down,
got TERM."

So with an answer in hand... is this behavior by design?  It seems
that a particularly slow child shouldn't immunize itself from a TERM
because of a slow load time or late signal handler registration.

Thoughts appreciated!  Thanks!

-ryan woodrum


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sv term handling with a slow child
  2008-01-16 22:41 sv term handling with a slow child Ryan Woodrum
@ 2008-01-16 23:04 ` Mike Buland
  2008-01-17  0:25   ` Ryan Woodrum
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Mike Buland @ 2008-01-16 23:04 UTC (permalink / raw)
  To: supervision

Hi

I went ahead and ran a few tests, including your ruby script.  I can't 
apparently repreduce the behaviour you describe.

On linux (and POSIX systems) there is a default signal handler for many of the 
signals.  The terminate signal normally ends the process.  At least in my 
tests the ruby program is indeed terminated, the process ends, and the status 
in runit is set to 'd' or down.  It is set to down, but the program is gone.

When I wrote my own test in C:
----
#include <stdlib.h>

int main()
{
	sleep( 50000 );
}
----

to test the behaviour of TERM everything works as expected.  No term signal 
handler is registered, sending the program a term on the command line 
(kill -15 $pid) terminates the program.  Then I tried ignoring term:

----
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>

int main()
{
	signal( 15, SIG_IGN );
	sleep( 50000 );
}
----

And the program kept running.  Testing both of these programs with runit gave 
the expected results.  The program using the default signal handler exited as 
soon as runit sent it term, and the status of the service was set 
accordingly, for the second program term was ignored and runit went 
into "want down, got TERM" state.

On your system, are you 100% sure that the ruby test program you're using 
isn't just exiting appropriately?  I can't find anything that mimics the 
described bahaviour.  I.E.  runit is behaving the way you describe, but the 
process does end.

--Mike

On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote:
> Hello!
>
> I believe I have found a possible bug/oddity in the behavior of sv
> using runsv.  I happened upon this particular scenario in a test
> environment, but was actually able to repro it in my production
> environment as well as in a primitive case.  The issue involves slow
> children or children whose TERM handler isn't registered soon enough.
>
> Here's the setup:
> I create a simplistic base service configuration under which I will
> run a ruby application.  The ruby app looks like so:
> slow_signal.rb
> ---
> sleep(10)
>
> puts "registering term handler..."
> trap("TERM") do
>   puts "got term"
>   exit
> end
>
> while(true) do
>   puts "looping and sleeping..."
>   sleep 2
> end
> ---
>
> I run this under my run svdir with:
> #!/bin/sh
> exec 2>&1
> exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
>
>
> The premise of the primitive ruby application is to emulate a slow-ish
> loading base of code that has a term handler registered early in the
> life of the process.
>
> If I invoke:
> /etc/init.d/slow_signal start
>
> followed within the 10 second sleep period by:
> /etc/init.d/slow_signal stop
>
> (/etc/init.d/slow_signal is a symlink to /usr/bin/sv)
>
> The process does not handle the signal but its state is set to 'd';
> down.  In subsequent calls to control() within sv.c, it will no longer
> write to the pipe because it thinks there is no need.  With no further
> writes to the pipe, another TERM will never get sent and so the
> process cannot be shut down via sv/runsv, at least not with TERM.
>
> It took me awhile to learn how everything was work and to track down
> just where this check was happening.  The source I worked against was
> the source available via the debian package v1.8.0 (`apt-get source runit`
> under debian sid).  (I looked for a repo but did not find a public
> one.)
>
> Two solutions I can think of are not to set svstatus[17] unless you're
> sure the process actually went down, but this is more complicated
> (perhaps more correct?) than a second solution.  Inside of control() in
> sv.c, a modification to always send a TERM can be made like so:
> -----
> 247c247,248
> <   if (svstatus[17] == *a) return(0);
> ---
>
> >   /* Write a TERM to the pipe even if we already have.  Slow TERM
> >   handler perhaps?  What about other cases?*/
> >   if (svstatus[17] == *a && *a != 'd') return(0);
>
> -----
>
> In this case, we simply decide that, if we want to issue a TERM via sv
> stop, down etc., we will go ahead and write again to the pipe.  Even
> if we think we don't need to.  This way, we're not stuck in "want down,
> got TERM."
>
> So with an answer in hand... is this behavior by design?  It seems
> that a particularly slow child shouldn't immunize itself from a TERM
> because of a slow load time or late signal handler registration.
>
> Thoughts appreciated!  Thanks!
>
> -ryan woodrum



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sv term handling with a slow child
  2008-01-16 23:04 ` Mike Buland
@ 2008-01-17  0:25   ` Ryan Woodrum
  2008-01-17  0:35   ` Ryan Woodrum
  2008-01-17  8:25   ` Ryan Woodrum
  2 siblings, 0 replies; 6+ messages in thread
From: Ryan Woodrum @ 2008-01-17  0:25 UTC (permalink / raw)
  To: supervision

Hi, Mike,

Thanks much for the prompt reply.

I am 100% certain the process is not exiting.  Interestingly enough,
as you described, if I create an equivalent C program, the problem
does not repro.
-----
#include <stdio.h>
#include <signal.h>
#include <stdlib.h>

void handler(int signal) {
   printf("Entered signal handler...\n");
   exit(signal);
}

int main() {
   printf("Doing initial sleep for 10...\n");
   sleep(10);

   signal(SIGTERM, handler);

   while(1) {
      printf("Looping and sleeping...\n");
      sleep(2);
   }
   return 0;
}
-----

But here is some output with the ruby script again.

Well behaved scenario with a long sleep before the first TERM:
-----
ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal start \
> && ps ax | grep slow \
> && sleep 12 \
> && /etc/init.d/slow_signal stop
ok: run: slow_signal: (pid 30008) 0s
28866 ?        Ss     0:00 runsv slow_signal
30008 ?        S      0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
30010 ttyp0    S+     0:00 grep slow
ok: down: slow_signal: 0s, normally up



And now the quick start/stop scenario with extras:
-----
ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal start \
> && ps ax | grep slow \
> && sleep 4 \
> && /etc/init.d/slow_signal stop
ok: run: slow_signal: (pid 30549) 0s
30229 ?        Ss     0:00 runsv slow_signal
30549 ?        S      0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
30551 ttyp0    S+     0:00 grep slow
timeout: run: slow_signal: (pid 30549) 12s, want down, got TERM
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp# ps ax | grep slow
30229 ?        Ss     0:00 runsv slow_signal
30549 ?        S      0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
30557 ttyp0    S+     0:00 grep slow
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal stop
timeout: run: slow_signal: (pid 30549) 27s, want down, got TERM
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp# ps ax | grep slow
30229 ?        Ss     0:00 runsv slow_signal
30549 ?        S      0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
30560 ttyp0    S+     0:00 grep slow
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp# kill -TERM 30549
ops1test:/home/rwoodrum/tmp#
ops1test:/home/rwoodrum/tmp# ps ax | grep slow
30229 ?        Ss     0:00 runsv slow_signal
30566 ttyp0    S+     0:00 grep slow
ops1test:/home/rwoodrum/tmp#


So... maybe this is a ruby thing?  What I'm certain of is if in the
second, misbehaved scenario I strace the process, I never see another
TERM getting delivered to slow_signal.rb.  If I send it a term via
kill, I see the term and it handles it correctly.

Something still isn't adding up.

-ryan woodrum


On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote:
> Hi
>
> I went ahead and ran a few tests, including your ruby script.  I can't
> apparently repreduce the behaviour you describe.
>
> On linux (and POSIX systems) there is a default signal handler for many of
> the signals.  The terminate signal normally ends the process.  At least in
> my tests the ruby program is indeed terminated, the process ends, and the
> status in runit is set to 'd' or down.  It is set to down, but the program
> is gone.
>
> When I wrote my own test in C:
> ----
> #include <stdlib.h>
>
> int main()
> {
> 	sleep( 50000 );
> }
> ----
>
> to test the behaviour of TERM everything works as expected.  No term signal
> handler is registered, sending the program a term on the command line
> (kill -15 $pid) terminates the program.  Then I tried ignoring term:
>
> ----
> #include <stdio.h>
> #include <stdlib.h>
> #include <signal.h>
>
> int main()
> {
> 	signal( 15, SIG_IGN );
> 	sleep( 50000 );
> }
> ----
>
> And the program kept running.  Testing both of these programs with runit
> gave the expected results.  The program using the default signal handler
> exited as soon as runit sent it term, and the status of the service was set
> accordingly, for the second program term was ignored and runit went
> into "want down, got TERM" state.
>
> On your system, are you 100% sure that the ruby test program you're using
> isn't just exiting appropriately?  I can't find anything that mimics the
> described bahaviour.  I.E.  runit is behaving the way you describe, but the
> process does end.
>
> --Mike
>
> On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote:
> > Hello!
> >
> > I believe I have found a possible bug/oddity in the behavior of sv
> > using runsv.  I happened upon this particular scenario in a test
> > environment, but was actually able to repro it in my production
> > environment as well as in a primitive case.  The issue involves slow
> > children or children whose TERM handler isn't registered soon enough.
> >
> > Here's the setup:
> > I create a simplistic base service configuration under which I will
> > run a ruby application.  The ruby app looks like so:
> > slow_signal.rb
> > ---
> > sleep(10)
> >
> > puts "registering term handler..."
> > trap("TERM") do
> >   puts "got term"
> >   exit
> > end
> >
> > while(true) do
> >   puts "looping and sleeping..."
> >   sleep 2
> > end
> > ---
> >
> > I run this under my run svdir with:
> > #!/bin/sh
> > exec 2>&1
> > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
> >
> >
> > The premise of the primitive ruby application is to emulate a slow-ish
> > loading base of code that has a term handler registered early in the
> > life of the process.
> >
> > If I invoke:
> > /etc/init.d/slow_signal start
> >
> > followed within the 10 second sleep period by:
> > /etc/init.d/slow_signal stop
> >
> > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv)
> >
> > The process does not handle the signal but its state is set to 'd';
> > down.  In subsequent calls to control() within sv.c, it will no longer
> > write to the pipe because it thinks there is no need.  With no further
> > writes to the pipe, another TERM will never get sent and so the
> > process cannot be shut down via sv/runsv, at least not with TERM.
> >
> > It took me awhile to learn how everything was work and to track down
> > just where this check was happening.  The source I worked against was
> > the source available via the debian package v1.8.0 (`apt-get source
> > runit` under debian sid).  (I looked for a repo but did not find a public
> > one.)
> >
> > Two solutions I can think of are not to set svstatus[17] unless you're
> > sure the process actually went down, but this is more complicated
> > (perhaps more correct?) than a second solution.  Inside of control() in
> > sv.c, a modification to always send a TERM can be made like so:
> > -----
> > 247c247,248
> > <   if (svstatus[17] == *a) return(0);
> > ---
> >
> > >   /* Write a TERM to the pipe even if we already have.  Slow TERM
> > >   handler perhaps?  What about other cases?*/
> > >   if (svstatus[17] == *a && *a != 'd') return(0);
> >
> > -----
> >
> > In this case, we simply decide that, if we want to issue a TERM via sv
> > stop, down etc., we will go ahead and write again to the pipe.  Even
> > if we think we don't need to.  This way, we're not stuck in "want down,
> > got TERM."
> >
> > So with an answer in hand... is this behavior by design?  It seems
> > that a particularly slow child shouldn't immunize itself from a TERM
> > because of a slow load time or late signal handler registration.
> >
> > Thoughts appreciated!  Thanks!
> >
> > -ryan woodrum


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sv term handling with a slow child
  2008-01-16 23:04 ` Mike Buland
  2008-01-17  0:25   ` Ryan Woodrum
@ 2008-01-17  0:35   ` Ryan Woodrum
  2008-01-17  8:25   ` Ryan Woodrum
  2 siblings, 0 replies; 6+ messages in thread
From: Ryan Woodrum @ 2008-01-17  0:35 UTC (permalink / raw)
  To: supervision

I should add for clarity to my first, well behaved example showing that the 
process does indeed exit:

ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal start \
> && ps ax | grep slow \
> && sleep 12 \
> && /etc/init.d/slow_signal stop \
> && ps ax | grep slow
ok: run: slow_signal: (pid 31434) 58s
30229 ?        Ss     0:00 runsv slow_signal
31434 ?        S      0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
31446 ttyp0    S+     0:00 grep slow
ok: down: slow_signal: 0s, normally up
30229 ?        Ss     0:00 runsv slow_signal
31456 ttyp0    S+     0:00 grep slow


On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote:
> Hi
>
> I went ahead and ran a few tests, including your ruby script.  I can't
> apparently repreduce the behaviour you describe.
>
> On linux (and POSIX systems) there is a default signal handler for many of
> the signals.  The terminate signal normally ends the process.  At least in
> my tests the ruby program is indeed terminated, the process ends, and the
> status in runit is set to 'd' or down.  It is set to down, but the program
> is gone.
>
> When I wrote my own test in C:
> ----
> #include <stdlib.h>
>
> int main()
> {
> 	sleep( 50000 );
> }
> ----
>
> to test the behaviour of TERM everything works as expected.  No term signal
> handler is registered, sending the program a term on the command line
> (kill -15 $pid) terminates the program.  Then I tried ignoring term:
>
> ----
> #include <stdio.h>
> #include <stdlib.h>
> #include <signal.h>
>
> int main()
> {
> 	signal( 15, SIG_IGN );
> 	sleep( 50000 );
> }
> ----
>
> And the program kept running.  Testing both of these programs with runit
> gave the expected results.  The program using the default signal handler
> exited as soon as runit sent it term, and the status of the service was set
> accordingly, for the second program term was ignored and runit went
> into "want down, got TERM" state.
>
> On your system, are you 100% sure that the ruby test program you're using
> isn't just exiting appropriately?  I can't find anything that mimics the
> described bahaviour.  I.E.  runit is behaving the way you describe, but the
> process does end.
>
> --Mike
>
> On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote:
> > Hello!
> >
> > I believe I have found a possible bug/oddity in the behavior of sv
> > using runsv.  I happened upon this particular scenario in a test
> > environment, but was actually able to repro it in my production
> > environment as well as in a primitive case.  The issue involves slow
> > children or children whose TERM handler isn't registered soon enough.
> >
> > Here's the setup:
> > I create a simplistic base service configuration under which I will
> > run a ruby application.  The ruby app looks like so:
> > slow_signal.rb
> > ---
> > sleep(10)
> >
> > puts "registering term handler..."
> > trap("TERM") do
> >   puts "got term"
> >   exit
> > end
> >
> > while(true) do
> >   puts "looping and sleeping..."
> >   sleep 2
> > end
> > ---
> >
> > I run this under my run svdir with:
> > #!/bin/sh
> > exec 2>&1
> > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
> >
> >
> > The premise of the primitive ruby application is to emulate a slow-ish
> > loading base of code that has a term handler registered early in the
> > life of the process.
> >
> > If I invoke:
> > /etc/init.d/slow_signal start
> >
> > followed within the 10 second sleep period by:
> > /etc/init.d/slow_signal stop
> >
> > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv)
> >
> > The process does not handle the signal but its state is set to 'd';
> > down.  In subsequent calls to control() within sv.c, it will no longer
> > write to the pipe because it thinks there is no need.  With no further
> > writes to the pipe, another TERM will never get sent and so the
> > process cannot be shut down via sv/runsv, at least not with TERM.
> >
> > It took me awhile to learn how everything was work and to track down
> > just where this check was happening.  The source I worked against was
> > the source available via the debian package v1.8.0 (`apt-get source
> > runit` under debian sid).  (I looked for a repo but did not find a public
> > one.)
> >
> > Two solutions I can think of are not to set svstatus[17] unless you're
> > sure the process actually went down, but this is more complicated
> > (perhaps more correct?) than a second solution.  Inside of control() in
> > sv.c, a modification to always send a TERM can be made like so:
> > -----
> > 247c247,248
> > <   if (svstatus[17] == *a) return(0);
> > ---
> >
> > >   /* Write a TERM to the pipe even if we already have.  Slow TERM
> > >   handler perhaps?  What about other cases?*/
> > >   if (svstatus[17] == *a && *a != 'd') return(0);
> >
> > -----
> >
> > In this case, we simply decide that, if we want to issue a TERM via sv
> > stop, down etc., we will go ahead and write again to the pipe.  Even
> > if we think we don't need to.  This way, we're not stuck in "want down,
> > got TERM."
> >
> > So with an answer in hand... is this behavior by design?  It seems
> > that a particularly slow child shouldn't immunize itself from a TERM
> > because of a slow load time or late signal handler registration.
> >
> > Thoughts appreciated!  Thanks!
> >
> > -ryan woodrum


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sv term handling with a slow child
  2008-01-16 23:04 ` Mike Buland
  2008-01-17  0:25   ` Ryan Woodrum
  2008-01-17  0:35   ` Ryan Woodrum
@ 2008-01-17  8:25   ` Ryan Woodrum
  2008-01-17 19:16     ` Mike Buland
  2 siblings, 1 reply; 6+ messages in thread
From: Ryan Woodrum @ 2008-01-17  8:25 UTC (permalink / raw)
  To: supervision

Not to get into the habit of replying to myself, but....

I found the problem with this and it actually appears to have nothing to do 
with runit. (Sorry, all!)   While I was on the way home from work turning 
this over in my head, it occurred to me to indeed test the default handler by 
attempting to send a sig_term to a ruby script that was simply executing a 
sleep.

I tried this on my home box running ruby v1.8.6 and it terminated as expected.  
I tried it in the environment where I was experiencing the problems (and 
where it is running ruby v1.8.5) and the process did not terminate.  In 
retrospect I don't know why I didn't test this most basic of cases.

So the answer is that it is, in fact, a bug in ruby.  Or was, rather.  See 
this thread where Matsumoto chimed in in a seemingly related situation:
http://www.ruby-forum.com/topic/85485

An strace of a more recent version of ruby shows the term coming in and then 
being handled by default:
-----
)    = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
sigprocmask(SIG_SETMASK, [], NULL)      = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0
kill(9481, SIGTERM)                     = 0
--- SIGTERM (Terminated) @ 0 (0) ---
+++ killed by SIGTERM +++


The older version doesn't seem to be doing this.  Is the call to sigreturn 
indicative of the default handler not doing anything...?
-----
)    = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
sigreturn()                             = ? (mask now [])
select(0, NULL, NULL, NULL, {13, 616000}


) = 0 (Timeout)
time(NULL)                              = 1200558046
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f890d0, [], 0}, 8) = 0
exit_group(0)                           = ?


I don't believe I understand it 100% yet, but regardless, it is not a runit 
problem.

-ryan woodrum


On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote:
> Hi
>
> I went ahead and ran a few tests, including your ruby script.  I can't
> apparently repreduce the behaviour you describe.
>
> On linux (and POSIX systems) there is a default signal handler for many of
> the signals.  The terminate signal normally ends the process.  At least in
> my tests the ruby program is indeed terminated, the process ends, and the
> status in runit is set to 'd' or down.  It is set to down, but the program
> is gone.
>
> When I wrote my own test in C:
> ----
> #include <stdlib.h>
>
> int main()
> {
> 	sleep( 50000 );
> }
> ----
>
> to test the behaviour of TERM everything works as expected.  No term signal
> handler is registered, sending the program a term on the command line
> (kill -15 $pid) terminates the program.  Then I tried ignoring term:
>
> ----
> #include <stdio.h>
> #include <stdlib.h>
> #include <signal.h>
>
> int main()
> {
> 	signal( 15, SIG_IGN );
> 	sleep( 50000 );
> }
> ----
>
> And the program kept running.  Testing both of these programs with runit
> gave the expected results.  The program using the default signal handler
> exited as soon as runit sent it term, and the status of the service was set
> accordingly, for the second program term was ignored and runit went
> into "want down, got TERM" state.
>
> On your system, are you 100% sure that the ruby test program you're using
> isn't just exiting appropriately?  I can't find anything that mimics the
> described bahaviour.  I.E.  runit is behaving the way you describe, but the
> process does end.
>
> --Mike
>
> On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote:
> > Hello!
> >
> > I believe I have found a possible bug/oddity in the behavior of sv
> > using runsv.  I happened upon this particular scenario in a test
> > environment, but was actually able to repro it in my production
> > environment as well as in a primitive case.  The issue involves slow
> > children or children whose TERM handler isn't registered soon enough.
> >
> > Here's the setup:
> > I create a simplistic base service configuration under which I will
> > run a ruby application.  The ruby app looks like so:
> > slow_signal.rb
> > ---
> > sleep(10)
> >
> > puts "registering term handler..."
> > trap("TERM") do
> >   puts "got term"
> >   exit
> > end
> >
> > while(true) do
> >   puts "looping and sleeping..."
> >   sleep 2
> > end
> > ---
> >
> > I run this under my run svdir with:
> > #!/bin/sh
> > exec 2>&1
> > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
> >
> >
> > The premise of the primitive ruby application is to emulate a slow-ish
> > loading base of code that has a term handler registered early in the
> > life of the process.
> >
> > If I invoke:
> > /etc/init.d/slow_signal start
> >
> > followed within the 10 second sleep period by:
> > /etc/init.d/slow_signal stop
> >
> > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv)
> >
> > The process does not handle the signal but its state is set to 'd';
> > down.  In subsequent calls to control() within sv.c, it will no longer
> > write to the pipe because it thinks there is no need.  With no further
> > writes to the pipe, another TERM will never get sent and so the
> > process cannot be shut down via sv/runsv, at least not with TERM.
> >
> > It took me awhile to learn how everything was work and to track down
> > just where this check was happening.  The source I worked against was
> > the source available via the debian package v1.8.0 (`apt-get source
> > runit` under debian sid).  (I looked for a repo but did not find a public
> > one.)
> >
> > Two solutions I can think of are not to set svstatus[17] unless you're
> > sure the process actually went down, but this is more complicated
> > (perhaps more correct?) than a second solution.  Inside of control() in
> > sv.c, a modification to always send a TERM can be made like so:
> > -----
> > 247c247,248
> > <   if (svstatus[17] == *a) return(0);
> > ---
> >
> > >   /* Write a TERM to the pipe even if we already have.  Slow TERM
> > >   handler perhaps?  What about other cases?*/
> > >   if (svstatus[17] == *a && *a != 'd') return(0);
> >
> > -----
> >
> > In this case, we simply decide that, if we want to issue a TERM via sv
> > stop, down etc., we will go ahead and write again to the pipe.  Even
> > if we think we don't need to.  This way, we're not stuck in "want down,
> > got TERM."
> >
> > So with an answer in hand... is this behavior by design?  It seems
> > that a particularly slow child shouldn't immunize itself from a TERM
> > because of a slow load time or late signal handler registration.
> >
> > Thoughts appreciated!  Thanks!
> >
> > -ryan woodrum


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sv term handling with a slow child
  2008-01-17  8:25   ` Ryan Woodrum
@ 2008-01-17 19:16     ` Mike Buland
  0 siblings, 0 replies; 6+ messages in thread
From: Mike Buland @ 2008-01-17 19:16 UTC (permalink / raw)
  To: supervision


Whew, thanks for responding, I was curious about the outcome myself.  That was 
the next thing that I was going to suggest (ruby version comparison), and you 
beat me to it.  Well done finding that bug :)

--Mike

On Thursday 17 January 2008 01:25:27 am Ryan Woodrum wrote:
> Not to get into the habit of replying to myself, but....
>
> I found the problem with this and it actually appears to have nothing to do
> with runit. (Sorry, all!)   While I was on the way home from work turning
> this over in my head, it occurred to me to indeed test the default handler
> by attempting to send a sig_term to a ruby script that was simply executing
> a sleep.
>
> I tried this on my home box running ruby v1.8.6 and it terminated as
> expected. I tried it in the environment where I was experiencing the
> problems (and where it is running ruby v1.8.5) and the process did not
> terminate.  In retrospect I don't know why I didn't test this most basic of
> cases.
>
> So the answer is that it is, in fact, a bug in ruby.  Or was, rather.  See
> this thread where Matsumoto chimed in in a seemingly related situation:
> http://www.ruby-forum.com/topic/85485
>
> An strace of a more recent version of ruby shows the term coming in and
> then being handled by default:
> -----
> )    = ? ERESTARTNOHAND (To be restarted)
> --- SIGTERM (Terminated) @ 0 (0) ---
> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> sigprocmask(SIG_SETMASK, [], NULL)      = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0
> rt_sigaction(SIGTERM, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0
> kill(9481, SIGTERM)                     = 0
> --- SIGTERM (Terminated) @ 0 (0) ---
> +++ killed by SIGTERM +++
>
>
> The older version doesn't seem to be doing this.  Is the call to sigreturn
> indicative of the default handler not doing anything...?
> -----
> )    = ? ERESTARTNOHAND (To be restarted)
> --- SIGTERM (Terminated) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])
> select(0, NULL, NULL, NULL, {13, 616000}
>
>
> ) = 0 (Timeout)
> time(NULL)                              = 1200558046
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f890d0, [], 0}, 8) = 0
> exit_group(0)                           = ?
>
>
> I don't believe I understand it 100% yet, but regardless, it is not a runit
> problem.
>
> -ryan woodrum
>
> On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote:
> > Hi
> >
> > I went ahead and ran a few tests, including your ruby script.  I can't
> > apparently repreduce the behaviour you describe.
> >
> > On linux (and POSIX systems) there is a default signal handler for many
> > of the signals.  The terminate signal normally ends the process.  At
> > least in my tests the ruby program is indeed terminated, the process
> > ends, and the status in runit is set to 'd' or down.  It is set to down,
> > but the program is gone.
> >
> > When I wrote my own test in C:
> > ----
> > #include <stdlib.h>
> >
> > int main()
> > {
> > 	sleep( 50000 );
> > }
> > ----
> >
> > to test the behaviour of TERM everything works as expected.  No term
> > signal handler is registered, sending the program a term on the command
> > line (kill -15 $pid) terminates the program.  Then I tried ignoring term:
> >
> > ----
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <signal.h>
> >
> > int main()
> > {
> > 	signal( 15, SIG_IGN );
> > 	sleep( 50000 );
> > }
> > ----
> >
> > And the program kept running.  Testing both of these programs with runit
> > gave the expected results.  The program using the default signal handler
> > exited as soon as runit sent it term, and the status of the service was
> > set accordingly, for the second program term was ignored and runit went
> > into "want down, got TERM" state.
> >
> > On your system, are you 100% sure that the ruby test program you're using
> > isn't just exiting appropriately?  I can't find anything that mimics the
> > described bahaviour.  I.E.  runit is behaving the way you describe, but
> > the process does end.
> >
> > --Mike
> >
> > On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote:
> > > Hello!
> > >
> > > I believe I have found a possible bug/oddity in the behavior of sv
> > > using runsv.  I happened upon this particular scenario in a test
> > > environment, but was actually able to repro it in my production
> > > environment as well as in a primitive case.  The issue involves slow
> > > children or children whose TERM handler isn't registered soon enough.
> > >
> > > Here's the setup:
> > > I create a simplistic base service configuration under which I will
> > > run a ruby application.  The ruby app looks like so:
> > > slow_signal.rb
> > > ---
> > > sleep(10)
> > >
> > > puts "registering term handler..."
> > > trap("TERM") do
> > >   puts "got term"
> > >   exit
> > > end
> > >
> > > while(true) do
> > >   puts "looping and sleeping..."
> > >   sleep 2
> > > end
> > > ---
> > >
> > > I run this under my run svdir with:
> > > #!/bin/sh
> > > exec 2>&1
> > > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb
> > >
> > >
> > > The premise of the primitive ruby application is to emulate a slow-ish
> > > loading base of code that has a term handler registered early in the
> > > life of the process.
> > >
> > > If I invoke:
> > > /etc/init.d/slow_signal start
> > >
> > > followed within the 10 second sleep period by:
> > > /etc/init.d/slow_signal stop
> > >
> > > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv)
> > >
> > > The process does not handle the signal but its state is set to 'd';
> > > down.  In subsequent calls to control() within sv.c, it will no longer
> > > write to the pipe because it thinks there is no need.  With no further
> > > writes to the pipe, another TERM will never get sent and so the
> > > process cannot be shut down via sv/runsv, at least not with TERM.
> > >
> > > It took me awhile to learn how everything was work and to track down
> > > just where this check was happening.  The source I worked against was
> > > the source available via the debian package v1.8.0 (`apt-get source
> > > runit` under debian sid).  (I looked for a repo but did not find a
> > > public one.)
> > >
> > > Two solutions I can think of are not to set svstatus[17] unless you're
> > > sure the process actually went down, but this is more complicated
> > > (perhaps more correct?) than a second solution.  Inside of control() in
> > > sv.c, a modification to always send a TERM can be made like so:
> > > -----
> > > 247c247,248
> > > <   if (svstatus[17] == *a) return(0);
> > > ---
> > >
> > > >   /* Write a TERM to the pipe even if we already have.  Slow TERM
> > > >   handler perhaps?  What about other cases?*/
> > > >   if (svstatus[17] == *a && *a != 'd') return(0);
> > >
> > > -----
> > >
> > > In this case, we simply decide that, if we want to issue a TERM via sv
> > > stop, down etc., we will go ahead and write again to the pipe.  Even
> > > if we think we don't need to.  This way, we're not stuck in "want down,
> > > got TERM."
> > >
> > > So with an answer in hand... is this behavior by design?  It seems
> > > that a particularly slow child shouldn't immunize itself from a TERM
> > > because of a slow load time or late signal handler registration.
> > >
> > > Thoughts appreciated!  Thanks!
> > >
> > > -ryan woodrum



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-01-17 19:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-16 22:41 sv term handling with a slow child Ryan Woodrum
2008-01-16 23:04 ` Mike Buland
2008-01-17  0:25   ` Ryan Woodrum
2008-01-17  0:35   ` Ryan Woodrum
2008-01-17  8:25   ` Ryan Woodrum
2008-01-17 19:16     ` Mike Buland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).