From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1635 Path: news.gmane.org!not-for-mail From: Ryan Woodrum Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: sv term handling with a slow child Date: Wed, 16 Jan 2008 16:25:29 -0800 Organization: Avvo, Inc. Message-ID: <200801161625.29625.rwoodrum@avvo.com> References: <200801161441.29193.rwoodrum@avvo.com> <200801161604.45554.mike@geekgene.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1200529551 10769 80.91.229.12 (17 Jan 2008 00:25:51 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 17 Jan 2008 00:25:51 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1870-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jan 17 01:26:09 2008 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1JFIa6-0002pO-EQ for gcsg-supervision@gmane.org; Thu, 17 Jan 2008 01:26:06 +0100 Original-Received: (qmail 1611 invoked by uid 76); 17 Jan 2008 00:25:48 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 1602 invoked from network); 17 Jan 2008 00:25:47 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on dns1.hcs.net X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.4 tests=none autolearn=failed version=3.2.1 User-Agent: KMail/1.9.7 In-Reply-To: <200801161604.45554.mike@geekgene.com> Content-Disposition: inline Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1635 Archived-At: Hi, Mike, Thanks much for the prompt reply. I am 100% certain the process is not exiting. Interestingly enough, as you described, if I create an equivalent C program, the problem does not repro. ----- #include #include #include void handler(int signal) { printf("Entered signal handler...\n"); exit(signal); } int main() { printf("Doing initial sleep for 10...\n"); sleep(10); signal(SIGTERM, handler); while(1) { printf("Looping and sleeping...\n"); sleep(2); } return 0; } ----- But here is some output with the ruby script again. Well behaved scenario with a long sleep before the first TERM: ----- ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal start \ > && ps ax | grep slow \ > && sleep 12 \ > && /etc/init.d/slow_signal stop ok: run: slow_signal: (pid 30008) 0s 28866 ? Ss 0:00 runsv slow_signal 30008 ? S 0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb 30010 ttyp0 S+ 0:00 grep slow ok: down: slow_signal: 0s, normally up And now the quick start/stop scenario with extras: ----- ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal start \ > && ps ax | grep slow \ > && sleep 4 \ > && /etc/init.d/slow_signal stop ok: run: slow_signal: (pid 30549) 0s 30229 ? Ss 0:00 runsv slow_signal 30549 ? S 0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb 30551 ttyp0 S+ 0:00 grep slow timeout: run: slow_signal: (pid 30549) 12s, want down, got TERM ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ps ax | grep slow 30229 ? Ss 0:00 runsv slow_signal 30549 ? S 0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb 30557 ttyp0 S+ 0:00 grep slow ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal stop timeout: run: slow_signal: (pid 30549) 27s, want down, got TERM ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ps ax | grep slow 30229 ? Ss 0:00 runsv slow_signal 30549 ? S 0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb 30560 ttyp0 S+ 0:00 grep slow ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# kill -TERM 30549 ops1test:/home/rwoodrum/tmp# ops1test:/home/rwoodrum/tmp# ps ax | grep slow 30229 ? Ss 0:00 runsv slow_signal 30566 ttyp0 S+ 0:00 grep slow ops1test:/home/rwoodrum/tmp# So... maybe this is a ruby thing? What I'm certain of is if in the second, misbehaved scenario I strace the process, I never see another TERM getting delivered to slow_signal.rb. If I send it a term via kill, I see the term and it handles it correctly. Something still isn't adding up. -ryan woodrum On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote: > Hi > > I went ahead and ran a few tests, including your ruby script. I can't > apparently repreduce the behaviour you describe. > > On linux (and POSIX systems) there is a default signal handler for many of > the signals. The terminate signal normally ends the process. At least in > my tests the ruby program is indeed terminated, the process ends, and the > status in runit is set to 'd' or down. It is set to down, but the program > is gone. > > When I wrote my own test in C: > ---- > #include > > int main() > { > sleep( 50000 ); > } > ---- > > to test the behaviour of TERM everything works as expected. No term signal > handler is registered, sending the program a term on the command line > (kill -15 $pid) terminates the program. Then I tried ignoring term: > > ---- > #include > #include > #include > > int main() > { > signal( 15, SIG_IGN ); > sleep( 50000 ); > } > ---- > > And the program kept running. Testing both of these programs with runit > gave the expected results. The program using the default signal handler > exited as soon as runit sent it term, and the status of the service was set > accordingly, for the second program term was ignored and runit went > into "want down, got TERM" state. > > On your system, are you 100% sure that the ruby test program you're using > isn't just exiting appropriately? I can't find anything that mimics the > described bahaviour. I.E. runit is behaving the way you describe, but the > process does end. > > --Mike > > On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote: > > Hello! > > > > I believe I have found a possible bug/oddity in the behavior of sv > > using runsv. I happened upon this particular scenario in a test > > environment, but was actually able to repro it in my production > > environment as well as in a primitive case. The issue involves slow > > children or children whose TERM handler isn't registered soon enough. > > > > Here's the setup: > > I create a simplistic base service configuration under which I will > > run a ruby application. The ruby app looks like so: > > slow_signal.rb > > --- > > sleep(10) > > > > puts "registering term handler..." > > trap("TERM") do > > puts "got term" > > exit > > end > > > > while(true) do > > puts "looping and sleeping..." > > sleep 2 > > end > > --- > > > > I run this under my run svdir with: > > #!/bin/sh > > exec 2>&1 > > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb > > > > > > The premise of the primitive ruby application is to emulate a slow-ish > > loading base of code that has a term handler registered early in the > > life of the process. > > > > If I invoke: > > /etc/init.d/slow_signal start > > > > followed within the 10 second sleep period by: > > /etc/init.d/slow_signal stop > > > > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv) > > > > The process does not handle the signal but its state is set to 'd'; > > down. In subsequent calls to control() within sv.c, it will no longer > > write to the pipe because it thinks there is no need. With no further > > writes to the pipe, another TERM will never get sent and so the > > process cannot be shut down via sv/runsv, at least not with TERM. > > > > It took me awhile to learn how everything was work and to track down > > just where this check was happening. The source I worked against was > > the source available via the debian package v1.8.0 (`apt-get source > > runit` under debian sid). (I looked for a repo but did not find a public > > one.) > > > > Two solutions I can think of are not to set svstatus[17] unless you're > > sure the process actually went down, but this is more complicated > > (perhaps more correct?) than a second solution. Inside of control() in > > sv.c, a modification to always send a TERM can be made like so: > > ----- > > 247c247,248 > > < if (svstatus[17] == *a) return(0); > > --- > > > > > /* Write a TERM to the pipe even if we already have. Slow TERM > > > handler perhaps? What about other cases?*/ > > > if (svstatus[17] == *a && *a != 'd') return(0); > > > > ----- > > > > In this case, we simply decide that, if we want to issue a TERM via sv > > stop, down etc., we will go ahead and write again to the pipe. Even > > if we think we don't need to. This way, we're not stuck in "want down, > > got TERM." > > > > So with an answer in hand... is this behavior by design? It seems > > that a particularly slow child shouldn't immunize itself from a TERM > > because of a slow load time or late signal handler registration. > > > > Thoughts appreciated! Thanks! > > > > -ryan woodrum