From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/903 Path: news.gmane.org!not-for-mail From: Charlie Brady Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit under sysvinit - coping when runsvdir dies Date: Tue, 15 Nov 2005 22:50:39 -0500 (EST) Message-ID: References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Trace: sea.gmane.org 1132113103 27118 80.91.229.2 (16 Nov 2005 03:51:43 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Wed, 16 Nov 2005 03:51:43 +0000 (UTC) Cc: supervision@list.skarnet.org Original-X-From: supervision-return-1139-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Nov 16 04:51:39 2005 Return-path: Original-Received: from antah.skarnet.org ([212.85.147.14]) by ciao.gmane.org with smtp (Exim 4.43) id 1EcEJq-0001Rp-Jz for gcsg-supervision@gmane.org; Wed, 16 Nov 2005 04:50:47 +0100 Original-Received: (qmail 26590 invoked by uid 76); 16 Nov 2005 03:51:04 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 26584 invoked from network); 16 Nov 2005 03:51:04 -0000 X-X-Sender: charlieb@e-smith.charlieb.ott.istop.com Original-To: Charles Duffy In-Reply-To: Xref: news.gmane.org gmane.comp.sysutils.supervision.general:903 Archived-At: On Mon, 14 Nov 2005, Charles Duffy wrote: > I recently had a situation on a fielded server where runsvdir (from runit > 1.2.3) apparently died; in any event, all the runsv processes which would > typically be directly under runsvdir were instead inherited by sysvinit and > running there. However, since runsvdir was respawned by sysvinit, a great > deal of CPU time was being spent continuously trying to start new runsv > instances under the fresh runsvdir -- attempts which failed because there > were still runsv instances alive and holding open the relevant locks. I > killed the old runsv instances with "runsvctrl e", and fresh children of the > new runsvdir took their place -- but there are still some questions raised: > > - How could this have happened? The system's message log doesn't show the OOM > killer taking down runsvdir or any segfault on the part of the same. If any of the run scripts (or programs exec'd by the run scripts) did not create a new process group, but sent a kill or term signal to its own process group (pppd used to do this in the past, and maybe still does), then I think it could have brought down the whole pack of cards. In order to prevent this, I think that each runsv should start a new process group for the run script to execute in. Note that for any service with a 'down' file, runsv crashing could mean that the service changes from an up state to a down state - which is almost certainly not want you want to happen (you're running runit to prevent such unrequested transitions).