From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/1633 Path: news.gmane.org!not-for-mail From: Ryan Woodrum Newsgroups: gmane.comp.sysutils.supervision.general Subject: sv term handling with a slow child Date: Wed, 16 Jan 2008 14:41:29 -0800 Organization: Avvo, Inc. Message-ID: <200801161441.29193.rwoodrum@avvo.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1200523337 24222 80.91.229.12 (16 Jan 2008 22:42:17 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 16 Jan 2008 22:42:17 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-1868-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Jan 16 23:42:35 2008 Return-path: Envelope-to: gcsg-supervision@gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.50) id 1JFGxv-00085Z-JJ for gcsg-supervision@gmane.org; Wed, 16 Jan 2008 23:42:35 +0100 Original-Received: (qmail 21784 invoked by uid 76); 16 Jan 2008 22:42:16 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 21774 invoked from network); 16 Jan 2008 22:42:15 -0000 X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on dns1.hcs.net X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.4 tests=none autolearn=failed version=3.2.1 User-Agent: KMail/1.9.7 Content-Disposition: inline Xref: news.gmane.org gmane.comp.sysutils.supervision.general:1633 Archived-At: Hello! I believe I have found a possible bug/oddity in the behavior of sv using runsv. I happened upon this particular scenario in a test environment, but was actually able to repro it in my production environment as well as in a primitive case. The issue involves slow children or children whose TERM handler isn't registered soon enough. Here's the setup: I create a simplistic base service configuration under which I will run a ruby application. The ruby app looks like so: slow_signal.rb --- sleep(10) puts "registering term handler..." trap("TERM") do puts "got term" exit end while(true) do puts "looping and sleeping..." sleep 2 end --- I run this under my run svdir with: #!/bin/sh exec 2>&1 exec /usr/bin/ruby /home/rwoodrum/tmp/slow_signal.rb The premise of the primitive ruby application is to emulate a slow-ish loading base of code that has a term handler registered early in the life of the process. If I invoke: /etc/init.d/slow_signal start followed within the 10 second sleep period by: /etc/init.d/slow_signal stop (/etc/init.d/slow_signal is a symlink to /usr/bin/sv) The process does not handle the signal but its state is set to 'd'; down. In subsequent calls to control() within sv.c, it will no longer write to the pipe because it thinks there is no need. With no further writes to the pipe, another TERM will never get sent and so the process cannot be shut down via sv/runsv, at least not with TERM. It took me awhile to learn how everything was work and to track down just where this check was happening. The source I worked against was the source available via the debian package v1.8.0 (`apt-get source runit` under debian sid). (I looked for a repo but did not find a public one.) Two solutions I can think of are not to set svstatus[17] unless you're sure the process actually went down, but this is more complicated (perhaps more correct?) than a second solution. Inside of control() in sv.c, a modification to always send a TERM can be made like so: ----- 247c247,248 < if (svstatus[17] == *a) return(0); --- > /* Write a TERM to the pipe even if we already have. Slow TERM > handler perhaps? What about other cases?*/ > if (svstatus[17] == *a && *a != 'd') return(0); ----- In this case, we simply decide that, if we want to issue a TERM via sv stop, down etc., we will go ahead and write again to the pipe. Even if we think we don't need to. This way, we're not stuck in "want down, got TERM." So with an answer in hand... is this behavior by design? It seems that a particularly slow child shouldn't immunize itself from a TERM because of a slow load time or late signal handler registration. Thoughts appreciated! Thanks! -ryan woodrum