From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1638 Path: news.gmane.org!not-for-mail From: Mike Buland Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: sv term handling with a slow child Date: Thu, 17 Jan 2008 12:16:25 -0700 Organization: Geek Gene Message-ID: <200801171216.25539.mike@geekgene.com> References: <200801161441.29193.rwoodrum@avvo.com> <200801161604.45554.mike@geekgene.com> <200801170025.27102.rwoodrum@avvo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1200597415 30021 80.91.229.12 (17 Jan 2008 19:16:55 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 17 Jan 2008 19:16:55 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1873-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jan 17 20:17:11 2008 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1JFaEh-00006y-6a for gcsg-supervision@gmane.org; Thu, 17 Jan 2008 20:17:11 +0100 Original-Received: (qmail 14918 invoked by uid 76); 17 Jan 2008 19:16:52 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 14910 invoked from network); 17 Jan 2008 19:16:51 -0000 User-Agent: KMail/1.9.6 In-Reply-To: <200801170025.27102.rwoodrum@avvo.com> Content-Disposition: inline Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1638 Archived-At: Whew, thanks for responding, I was curious about the outcome myself. That was the next thing that I was going to suggest (ruby version comparison), and you beat me to it. Well done finding that bug :) --Mike On Thursday 17 January 2008 01:25:27 am Ryan Woodrum wrote: > Not to get into the habit of replying to myself, but.... > > I found the problem with this and it actually appears to have nothing to do > with runit. (Sorry, all!) While I was on the way home from work turning > this over in my head, it occurred to me to indeed test the default handler > by attempting to send a sig_term to a ruby script that was simply executing > a sleep. > > I tried this on my home box running ruby v1.8.6 and it terminated as > expected. I tried it in the environment where I was experiencing the > problems (and where it is running ruby v1.8.5) and the process did not > terminate. In retrospect I don't know why I didn't test this most basic of > cases. > > So the answer is that it is, in fact, a bug in ruby. Or was, rather. See > this thread where Matsumoto chimed in in a seemingly related situation: > http://www.ruby-forum.com/topic/85485 > > An strace of a more recent version of ruby shows the term coming in and > then being handled by default: > ----- > ) = ? ERESTARTNOHAND (To be restarted) > --- SIGTERM (Terminated) @ 0 (0) --- > rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > sigprocmask(SIG_SETMASK, [], NULL) = 0 > sigprocmask(SIG_BLOCK, NULL, []) = 0 > sigprocmask(SIG_BLOCK, NULL, []) = 0 > rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0 > rt_sigaction(SIGTERM, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0 > kill(9481, SIGTERM) = 0 > --- SIGTERM (Terminated) @ 0 (0) --- > +++ killed by SIGTERM +++ > > > The older version doesn't seem to be doing this. Is the call to sigreturn > indicative of the default handler not doing anything...? > ----- > ) = ? ERESTARTNOHAND (To be restarted) > --- SIGTERM (Terminated) @ 0 (0) --- > sigreturn() = ? (mask now []) > select(0, NULL, NULL, NULL, {13, 616000} > > > ) = 0 (Timeout) > time(NULL) = 1200558046 > sigprocmask(SIG_BLOCK, NULL, []) = 0 > sigprocmask(SIG_BLOCK, NULL, []) = 0 > rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f890d0, [], 0}, 8) = 0 > exit_group(0) = ? > > > I don't believe I understand it 100% yet, but regardless, it is not a runit > problem. > > -ryan woodrum > > On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote: > > Hi > > > > I went ahead and ran a few tests, including your ruby script. I can't > > apparently repreduce the behaviour you describe. > > > > On linux (and POSIX systems) there is a default signal handler for many > > of the signals. The terminate signal normally ends the process. At > > least in my tests the ruby program is indeed terminated, the process > > ends, and the status in runit is set to 'd' or down. It is set to down, > > but the program is gone. > > > > When I wrote my own test in C: > > ---- > > #include > > > > int main() > > { > > sleep( 50000 ); > > } > > ---- > > > > to test the behaviour of TERM everything works as expected. No term > > signal handler is registered, sending the program a term on the command > > line (kill -15 $pid) terminates the program. Then I tried ignoring term: > > > > ---- > > #include > > #include > > #include > > > > int main() > > { > > signal( 15, SIG_IGN ); > > sleep( 50000 ); > > } > > ---- > > > > And the program kept running. Testing both of these programs with runit > > gave the expected results. The program using the default signal handler > > exited as soon as runit sent it term, and the status of the service was > > set accordingly, for the second program term was ignored and runit went > > into "want down, got TERM" state. > > > > On your system, are you 100% sure that the ruby test program you're using > > isn't just exiting appropriately? I can't find anything that mimics the > > described bahaviour. I.E. runit is behaving the way you describe, but > > the process does end. > > > > --Mike > > > > On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote: > > > Hello! > > > > > > I believe I have found a possible bug/oddity in the behavior of sv > > > using runsv. I happened upon this particular scenario in a test > > > environment, but was actually able to repro it in my production > > > environment as well as in a primitive case. The issue involves slow > > > children or children whose TERM handler isn't registered soon enough. > > > > > > Here's the setup: > > > I create a simplistic base service configuration under which I will > > > run a ruby application. The ruby app looks like so: > > > slow_signal.rb > > > --- > > > sleep(10) > > > > > > puts "registering term handler..." > > > trap("TERM") do > > > puts "got term" > > > exit > > > end > > > > > > while(true) do > > > puts "looping and sleeping..." > > > sleep 2 > > > end > > > --- > > > > > > I run this under my run svdir with: > > > #!/bin/sh > > > exec 2>&1 > > > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb > > > > > > > > > The premise of the primitive ruby application is to emulate a slow-ish > > > loading base of code that has a term handler registered early in the > > > life of the process. > > > > > > If I invoke: > > > /etc/init.d/slow_signal start > > > > > > followed within the 10 second sleep period by: > > > /etc/init.d/slow_signal stop > > > > > > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv) > > > > > > The process does not handle the signal but its state is set to 'd'; > > > down. In subsequent calls to control() within sv.c, it will no longer > > > write to the pipe because it thinks there is no need. With no further > > > writes to the pipe, another TERM will never get sent and so the > > > process cannot be shut down via sv/runsv, at least not with TERM. > > > > > > It took me awhile to learn how everything was work and to track down > > > just where this check was happening. The source I worked against was > > > the source available via the debian package v1.8.0 (`apt-get source > > > runit` under debian sid). (I looked for a repo but did not find a > > > public one.) > > > > > > Two solutions I can think of are not to set svstatus[17] unless you're > > > sure the process actually went down, but this is more complicated > > > (perhaps more correct?) than a second solution. Inside of control() in > > > sv.c, a modification to always send a TERM can be made like so: > > > ----- > > > 247c247,248 > > > < if (svstatus[17] == *a) return(0); > > > --- > > > > > > > /* Write a TERM to the pipe even if we already have. Slow TERM > > > > handler perhaps? What about other cases?*/ > > > > if (svstatus[17] == *a && *a != 'd') return(0); > > > > > > ----- > > > > > > In this case, we simply decide that, if we want to issue a TERM via sv > > > stop, down etc., we will go ahead and write again to the pipe. Even > > > if we think we don't need to. This way, we're not stuck in "want down, > > > got TERM." > > > > > > So with an answer in hand... is this behavior by design? It seems > > > that a particularly slow child shouldn't immunize itself from a TERM > > > because of a slow load time or late signal handler registration. > > > > > > Thoughts appreciated! Thanks! > > > > > > -ryan woodrum