From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/903
Path: news.gmane.org!not-for-mail
From: Charlie Brady <charlieb-supervision@budge.apana.org.au>
Newsgroups: gmane.comp.sysutils.supervision.general
Subject: Re: runit under sysvinit - coping when runsvdir dies
Date: Tue, 15 Nov 2005 22:50:39 -0500 (EST)
Message-ID: <Pine.LNX.4.61.0511152202500.21197@e-smith.charlieb.ott.istop.com>
References: <dlajh8$eqj$1@sea.gmane.org>
NNTP-Posting-Host: main.gmane.org
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Trace: sea.gmane.org 1132113103 27118 80.91.229.2 (16 Nov 2005 03:51:43 GMT)
X-Complaints-To: usenet@sea.gmane.org
NNTP-Posting-Date: Wed, 16 Nov 2005 03:51:43 +0000 (UTC)
Cc: supervision@list.skarnet.org
Original-X-From: supervision-return-1139-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Nov 16 04:51:39 2005
Return-path: <supervision-return-1139-gcsg-supervision=m.gmane.org@list.skarnet.org>
Original-Received: from antah.skarnet.org ([212.85.147.14])
	by ciao.gmane.org with smtp (Exim 4.43)
	id 1EcEJq-0001Rp-Jz
	for gcsg-supervision@gmane.org; Wed, 16 Nov 2005 04:50:47 +0100
Original-Received: (qmail 26590 invoked by uid 76); 16 Nov 2005 03:51:04 -0000
Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm
List-Post: <mailto:supervision@list.skarnet.org>
List-Help: <mailto:supervision-help@list.skarnet.org>
List-Unsubscribe: <mailto:supervision-unsubscribe@list.skarnet.org>
List-Subscribe: <mailto:supervision-subscribe@list.skarnet.org>
List-Archive: <http://www.skarnet.org/lists/>
Original-Received: (qmail 26584 invoked from network); 16 Nov 2005 03:51:04 -0000
X-X-Sender: charlieb@e-smith.charlieb.ott.istop.com
Original-To: Charles Duffy <cduffy@spamcop.net>
In-Reply-To: <dlajh8$eqj$1@sea.gmane.org>
Xref: news.gmane.org gmane.comp.sysutils.supervision.general:903
Archived-At: <http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/903>


On Mon, 14 Nov 2005, Charles Duffy wrote:

> I recently had a situation on a fielded server where runsvdir (from runit 
> 1.2.3) apparently died; in any event, all the runsv processes which would 
> typically be directly under runsvdir were instead inherited by sysvinit and 
> running there. However, since runsvdir was respawned by sysvinit, a great 
> deal of CPU time was being spent continuously trying to start new runsv 
> instances under the fresh runsvdir -- attempts which failed because there 
> were still runsv instances alive and holding open the relevant locks. I 
> killed the old runsv instances with "runsvctrl e", and fresh children of the 
> new runsvdir took their place -- but there are still some questions raised:
>
> - How could this have happened? The system's message log doesn't show the OOM 
> killer taking down runsvdir or any segfault on the part of the same.

If any of the run scripts (or programs exec'd by the run scripts) did not 
create a new process group, but sent a kill or term signal to its own 
process group (pppd used to do this in the past, and maybe still does), 
then I think it could have brought down the whole pack of cards. In order 
to prevent this, I think that each runsv should start a new process group 
for the run script to execute in.

Note that for any service with a 'down' file, runsv crashing could mean 
that the service changes from an up state to a down state - which is 
almost certainly not want you want to happen (you're running runit to 
prevent such unrequested transitions).