From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1634 Path: news.gmane.org!not-for-mail From: Mike Buland Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: sv term handling with a slow child Date: Wed, 16 Jan 2008 16:04:45 -0700 Organization: Geek Gene Message-ID: <200801161604.45554.mike@geekgene.com> References: <200801161441.29193.rwoodrum@avvo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1200524760 29051 80.91.229.12 (16 Jan 2008 23:06:00 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 16 Jan 2008 23:06:00 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1869-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jan 17 00:06:18 2008 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1JFHK6-0008Nk-QZ for gcsg-supervision@gmane.org; Thu, 17 Jan 2008 00:05:48 +0100 Original-Received: (qmail 24950 invoked by uid 76); 16 Jan 2008 23:05:12 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 24944 invoked from network); 16 Jan 2008 23:05:12 -0000 User-Agent: KMail/1.9.6 In-Reply-To: <200801161441.29193.rwoodrum@avvo.com> Content-Disposition: inline Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1634 Archived-At: Hi I went ahead and ran a few tests, including your ruby script. I can't apparently repreduce the behaviour you describe. On linux (and POSIX systems) there is a default signal handler for many of the signals. The terminate signal normally ends the process. At least in my tests the ruby program is indeed terminated, the process ends, and the status in runit is set to 'd' or down. It is set to down, but the program is gone. When I wrote my own test in C: ---- #include int main() { sleep( 50000 ); } ---- to test the behaviour of TERM everything works as expected. No term signal handler is registered, sending the program a term on the command line (kill -15 $pid) terminates the program. Then I tried ignoring term: ---- #include #include #include int main() { signal( 15, SIG_IGN ); sleep( 50000 ); } ---- And the program kept running. Testing both of these programs with runit gave the expected results. The program using the default signal handler exited as soon as runit sent it term, and the status of the service was set accordingly, for the second program term was ignored and runit went into "want down, got TERM" state. On your system, are you 100% sure that the ruby test program you're using isn't just exiting appropriately? I can't find anything that mimics the described bahaviour. I.E. runit is behaving the way you describe, but the process does end. --Mike On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote: > Hello! > > I believe I have found a possible bug/oddity in the behavior of sv > using runsv. I happened upon this particular scenario in a test > environment, but was actually able to repro it in my production > environment as well as in a primitive case. The issue involves slow > children or children whose TERM handler isn't registered soon enough. > > Here's the setup: > I create a simplistic base service configuration under which I will > run a ruby application. The ruby app looks like so: > slow_signal.rb > --- > sleep(10) > > puts "registering term handler..." > trap("TERM") do > puts "got term" > exit > end > > while(true) do > puts "looping and sleeping..." > sleep 2 > end > --- > > I run this under my run svdir with: > #!/bin/sh > exec 2>&1 > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb > > > The premise of the primitive ruby application is to emulate a slow-ish > loading base of code that has a term handler registered early in the > life of the process. > > If I invoke: > /etc/init.d/slow_signal start > > followed within the 10 second sleep period by: > /etc/init.d/slow_signal stop > > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv) > > The process does not handle the signal but its state is set to 'd'; > down. In subsequent calls to control() within sv.c, it will no longer > write to the pipe because it thinks there is no need. With no further > writes to the pipe, another TERM will never get sent and so the > process cannot be shut down via sv/runsv, at least not with TERM. > > It took me awhile to learn how everything was work and to track down > just where this check was happening. The source I worked against was > the source available via the debian package v1.8.0 (`apt-get source runit` > under debian sid). (I looked for a repo but did not find a public > one.) > > Two solutions I can think of are not to set svstatus[17] unless you're > sure the process actually went down, but this is more complicated > (perhaps more correct?) than a second solution. Inside of control() in > sv.c, a modification to always send a TERM can be made like so: > ----- > 247c247,248 > < if (svstatus[17] == *a) return(0); > --- > > > /* Write a TERM to the pipe even if we already have. Slow TERM > > handler perhaps? What about other cases?*/ > > if (svstatus[17] == *a && *a != 'd') return(0); > > ----- > > In this case, we simply decide that, if we want to issue a TERM via sv > stop, down etc., we will go ahead and write again to the pipe. Even > if we think we don't need to. This way, we're not stuck in "want down, > got TERM." > > So with an answer in hand... is this behavior by design? It seems > that a particularly slow child shouldn't immunize itself from a TERM > because of a slow load time or late signal handler registration. > > Thoughts appreciated! Thanks! > > -ryan woodrum