From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1637 Path: news.gmane.org!not-for-mail From: Ryan Woodrum Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: sv term handling with a slow child Date: Thu, 17 Jan 2008 00:25:27 -0800 Organization: Avvo, Inc. Message-ID: <200801170025.27102.rwoodrum@avvo.com> References: <200801161441.29193.rwoodrum@avvo.com> <200801161604.45554.mike@geekgene.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1200558341 13881 80.91.229.12 (17 Jan 2008 08:25:41 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 17 Jan 2008 08:25:41 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1872-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jan 17 09:25:59 2008 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1JFQ4V-0003Vz-BJ for gcsg-supervision@gmane.org; Thu, 17 Jan 2008 09:25:59 +0100 Original-Received: (qmail 23662 invoked by uid 76); 17 Jan 2008 08:25:40 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 23656 invoked from network); 17 Jan 2008 08:25:39 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on dns1.hcs.net X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.4 tests=none autolearn=failed version=3.2.1 User-Agent: KMail/1.9.7 In-Reply-To: <200801161604.45554.mike@geekgene.com> Content-Disposition: inline Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1637 Archived-At: Not to get into the habit of replying to myself, but.... I found the problem with this and it actually appears to have nothing to do with runit. (Sorry, all!) While I was on the way home from work turning this over in my head, it occurred to me to indeed test the default handler by attempting to send a sig_term to a ruby script that was simply executing a sleep. I tried this on my home box running ruby v1.8.6 and it terminated as expected. I tried it in the environment where I was experiencing the problems (and where it is running ruby v1.8.5) and the process did not terminate. In retrospect I don't know why I didn't test this most basic of cases. So the answer is that it is, in fact, a bug in ruby. Or was, rather. See this thread where Matsumoto chimed in in a seemingly related situation: http://www.ruby-forum.com/topic/85485 An strace of a more recent version of ruby shows the term coming in and then being handled by default: ----- ) = ? ERESTARTNOHAND (To be restarted) --- SIGTERM (Terminated) @ 0 (0) --- rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 sigprocmask(SIG_SETMASK, [], NULL) = 0 sigprocmask(SIG_BLOCK, NULL, []) = 0 sigprocmask(SIG_BLOCK, NULL, []) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0 rt_sigaction(SIGTERM, {SIG_DFL}, {0xb7f56960, [], 0}, 8) = 0 kill(9481, SIGTERM) = 0 --- SIGTERM (Terminated) @ 0 (0) --- +++ killed by SIGTERM +++ The older version doesn't seem to be doing this. Is the call to sigreturn indicative of the default handler not doing anything...? ----- ) = ? ERESTARTNOHAND (To be restarted) --- SIGTERM (Terminated) @ 0 (0) --- sigreturn() = ? (mask now []) select(0, NULL, NULL, NULL, {13, 616000} ) = 0 (Timeout) time(NULL) = 1200558046 sigprocmask(SIG_BLOCK, NULL, []) = 0 sigprocmask(SIG_BLOCK, NULL, []) = 0 rt_sigaction(SIGINT, {SIG_DFL}, {0xb7f890d0, [], 0}, 8) = 0 exit_group(0) = ? I don't believe I understand it 100% yet, but regardless, it is not a runit problem. -ryan woodrum On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote: > Hi > > I went ahead and ran a few tests, including your ruby script. I can't > apparently repreduce the behaviour you describe. > > On linux (and POSIX systems) there is a default signal handler for many of > the signals. The terminate signal normally ends the process. At least in > my tests the ruby program is indeed terminated, the process ends, and the > status in runit is set to 'd' or down. It is set to down, but the program > is gone. > > When I wrote my own test in C: > ---- > #include > > int main() > { > sleep( 50000 ); > } > ---- > > to test the behaviour of TERM everything works as expected. No term signal > handler is registered, sending the program a term on the command line > (kill -15 $pid) terminates the program. Then I tried ignoring term: > > ---- > #include > #include > #include > > int main() > { > signal( 15, SIG_IGN ); > sleep( 50000 ); > } > ---- > > And the program kept running. Testing both of these programs with runit > gave the expected results. The program using the default signal handler > exited as soon as runit sent it term, and the status of the service was set > accordingly, for the second program term was ignored and runit went > into "want down, got TERM" state. > > On your system, are you 100% sure that the ruby test program you're using > isn't just exiting appropriately? I can't find anything that mimics the > described bahaviour. I.E. runit is behaving the way you describe, but the > process does end. > > --Mike > > On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote: > > Hello! > > > > I believe I have found a possible bug/oddity in the behavior of sv > > using runsv. I happened upon this particular scenario in a test > > environment, but was actually able to repro it in my production > > environment as well as in a primitive case. The issue involves slow > > children or children whose TERM handler isn't registered soon enough. > > > > Here's the setup: > > I create a simplistic base service configuration under which I will > > run a ruby application. The ruby app looks like so: > > slow_signal.rb > > --- > > sleep(10) > > > > puts "registering term handler..." > > trap("TERM") do > > puts "got term" > > exit > > end > > > > while(true) do > > puts "looping and sleeping..." > > sleep 2 > > end > > --- > > > > I run this under my run svdir with: > > #!/bin/sh > > exec 2>&1 > > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb > > > > > > The premise of the primitive ruby application is to emulate a slow-ish > > loading base of code that has a term handler registered early in the > > life of the process. > > > > If I invoke: > > /etc/init.d/slow_signal start > > > > followed within the 10 second sleep period by: > > /etc/init.d/slow_signal stop > > > > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv) > > > > The process does not handle the signal but its state is set to 'd'; > > down. In subsequent calls to control() within sv.c, it will no longer > > write to the pipe because it thinks there is no need. With no further > > writes to the pipe, another TERM will never get sent and so the > > process cannot be shut down via sv/runsv, at least not with TERM. > > > > It took me awhile to learn how everything was work and to track down > > just where this check was happening. The source I worked against was > > the source available via the debian package v1.8.0 (`apt-get source > > runit` under debian sid). (I looked for a repo but did not find a public > > one.) > > > > Two solutions I can think of are not to set svstatus[17] unless you're > > sure the process actually went down, but this is more complicated > > (perhaps more correct?) than a second solution. Inside of control() in > > sv.c, a modification to always send a TERM can be made like so: > > ----- > > 247c247,248 > > < if (svstatus[17] == *a) return(0); > > --- > > > > > /* Write a TERM to the pipe even if we already have. Slow TERM > > > handler perhaps? What about other cases?*/ > > > if (svstatus[17] == *a && *a != 'd') return(0); > > > > ----- > > > > In this case, we simply decide that, if we want to issue a TERM via sv > > stop, down etc., we will go ahead and write again to the pipe. Even > > if we think we don't need to. This way, we're not stuck in "want down, > > got TERM." > > > > So with an answer in hand... is this behavior by design? It seems > > that a particularly slow child shouldn't immunize itself from a TERM > > because of a slow load time or late signal handler registration. > > > > Thoughts appreciated! Thanks! > > > > -ryan woodrum