From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1636 Path: news.gmane.org!not-for-mail From: Ryan Woodrum Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: sv term handling with a slow child Date: Wed, 16 Jan 2008 16:35:40 -0800 Organization: Avvo, Inc. Message-ID: <200801161635.40876.rwoodrum@avvo.com> References: <200801161441.29193.rwoodrum@avvo.com> <200801161604.45554.mike@geekgene.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1200530155 12412 80.91.229.12 (17 Jan 2008 00:35:55 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 17 Jan 2008 00:35:55 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1871-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jan 17 01:36:13 2008 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1JFIjs-000613-HU for gcsg-supervision@gmane.org; Thu, 17 Jan 2008 01:36:12 +0100 Original-Received: (qmail 2753 invoked by uid 76); 17 Jan 2008 00:35:54 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 2742 invoked from network); 17 Jan 2008 00:35:54 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on dns1.hcs.net X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.4 tests=none autolearn=failed version=3.2.1 User-Agent: KMail/1.9.7 In-Reply-To: <200801161604.45554.mike@geekgene.com> Content-Disposition: inline Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1636 Archived-At: I should add for clarity to my first, well behaved example showing that the process does indeed exit: ops1test:/home/rwoodrum/tmp# /etc/init.d/slow_signal start \ > && ps ax | grep slow \ > && sleep 12 \ > && /etc/init.d/slow_signal stop \ > && ps ax | grep slow ok: run: slow_signal: (pid 31434) 58s 30229 ? Ss 0:00 runsv slow_signal 31434 ? S 0:00 /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb 31446 ttyp0 S+ 0:00 grep slow ok: down: slow_signal: 0s, normally up 30229 ? Ss 0:00 runsv slow_signal 31456 ttyp0 S+ 0:00 grep slow On Wednesday 16 January 2008 03:04:45 pm Mike Buland wrote: > Hi > > I went ahead and ran a few tests, including your ruby script. I can't > apparently repreduce the behaviour you describe. > > On linux (and POSIX systems) there is a default signal handler for many of > the signals. The terminate signal normally ends the process. At least in > my tests the ruby program is indeed terminated, the process ends, and the > status in runit is set to 'd' or down. It is set to down, but the program > is gone. > > When I wrote my own test in C: > ---- > #include > > int main() > { > sleep( 50000 ); > } > ---- > > to test the behaviour of TERM everything works as expected. No term signal > handler is registered, sending the program a term on the command line > (kill -15 $pid) terminates the program. Then I tried ignoring term: > > ---- > #include > #include > #include > > int main() > { > signal( 15, SIG_IGN ); > sleep( 50000 ); > } > ---- > > And the program kept running. Testing both of these programs with runit > gave the expected results. The program using the default signal handler > exited as soon as runit sent it term, and the status of the service was set > accordingly, for the second program term was ignored and runit went > into "want down, got TERM" state. > > On your system, are you 100% sure that the ruby test program you're using > isn't just exiting appropriately? I can't find anything that mimics the > described bahaviour. I.E. runit is behaving the way you describe, but the > process does end. > > --Mike > > On Wednesday 16 January 2008 03:41:29 pm Ryan Woodrum wrote: > > Hello! > > > > I believe I have found a possible bug/oddity in the behavior of sv > > using runsv. I happened upon this particular scenario in a test > > environment, but was actually able to repro it in my production > > environment as well as in a primitive case. The issue involves slow > > children or children whose TERM handler isn't registered soon enough. > > > > Here's the setup: > > I create a simplistic base service configuration under which I will > > run a ruby application. The ruby app looks like so: > > slow_signal.rb > > --- > > sleep(10) > > > > puts "registering term handler..." > > trap("TERM") do > > puts "got term" > > exit > > end > > > > while(true) do > > puts "looping and sleeping..." > > sleep 2 > > end > > --- > > > > I run this under my run svdir with: > > #!/bin/sh > > exec 2>&1 > > exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb > > > > > > The premise of the primitive ruby application is to emulate a slow-ish > > loading base of code that has a term handler registered early in the > > life of the process. > > > > If I invoke: > > /etc/init.d/slow_signal start > > > > followed within the 10 second sleep period by: > > /etc/init.d/slow_signal stop > > > > (/etc/init.d/slow_signal is a symlink to /usr/bin/sv) > > > > The process does not handle the signal but its state is set to 'd'; > > down. In subsequent calls to control() within sv.c, it will no longer > > write to the pipe because it thinks there is no need. With no further > > writes to the pipe, another TERM will never get sent and so the > > process cannot be shut down via sv/runsv, at least not with TERM. > > > > It took me awhile to learn how everything was work and to track down > > just where this check was happening. The source I worked against was > > the source available via the debian package v1.8.0 (`apt-get source > > runit` under debian sid). (I looked for a repo but did not find a public > > one.) > > > > Two solutions I can think of are not to set svstatus[17] unless you're > > sure the process actually went down, but this is more complicated > > (perhaps more correct?) than a second solution. Inside of control() in > > sv.c, a modification to always send a TERM can be made like so: > > ----- > > 247c247,248 > > < if (svstatus[17] == *a) return(0); > > --- > > > > > /* Write a TERM to the pipe even if we already have. Slow TERM > > > handler perhaps? What about other cases?*/ > > > if (svstatus[17] == *a && *a != 'd') return(0); > > > > ----- > > > > In this case, we simply decide that, if we want to issue a TERM via sv > > stop, down etc., we will go ahead and write again to the pipe. Even > > if we think we don't need to. This way, we're not stuck in "want down, > > got TERM." > > > > So with an answer in hand... is this behavior by design? It seems > > that a particularly slow child shouldn't immunize itself from a TERM > > because of a slow load time or late signal handler registration. > > > > Thoughts appreciated! Thanks! > > > > -ryan woodrum