From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/908 Path: news.gmane.org!not-for-mail From: Gerrit Pape Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: runit under sysvinit - coping when runsvdir dies Date: Thu, 17 Nov 2005 09:08:51 +0000 Message-ID: <20051117090319.29817.qmail@f69151e2a0a700.315fe32.mid.smarden.org> References: NNTP-Posting-Host: main.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: sea.gmane.org 1132218228 19424 80.91.229.2 (17 Nov 2005 09:03:48 GMT) X-Complaints-To: usenet@sea.gmane.org NNTP-Posting-Date: Thu, 17 Nov 2005 09:03:48 +0000 (UTC) Original-X-From: supervision-return-1144-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Nov 17 10:03:46 2005 Return-path: Original-Received: from antah.skarnet.org ([212.85.147.14]) by ciao.gmane.org with smtp (Exim 4.43) id 1EcffW-0003gp-Fk for gcsg-supervision@gmane.org; Thu, 17 Nov 2005 10:02:58 +0100 Original-Received: (qmail 7971 invoked by uid 76); 17 Nov 2005 09:03:20 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 7966 invoked from network); 17 Nov 2005 09:03:20 -0000 Original-To: supervision@list.skarnet.org Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: Xref: news.gmane.org gmane.comp.sysutils.supervision.general:908 Archived-At: On Mon, Nov 14, 2005 at 12:03:20PM -0600, Charles Duffy wrote: > I recently had a situation on a fielded server where runsvdir (from > runit 1.2.3) apparently died; in any event, all the runsv processes > which would typically be directly under runsvdir were instead inherited > by sysvinit and running there. However, since runsvdir was respawned by > sysvinit, a great deal of CPU time was being spent continuously trying > to start new runsv instances under the fresh runsvdir -- attempts which > failed because there were still runsv instances alive and holding open > the relevant locks. I killed the old runsv instances with "runsvctrl e", > and fresh children of the new runsvdir took their place -- but there are > still some questions raised: > > - How could this have happened? The system's message log doesn't show > the OOM killer taking down runsvdir or any segfault on the part of the same. Charlie answered that. > - How could such situations be more gracefully handled in the future? > Having the customer call and complain because their server was unusably > slow was a less-than-ideal way to find out about this issue. If runsvdir receives the HUP signal, it sends a term signal to all runsv processes it manages. On the TERM signal, it simply exits, leaving the runsv processes alone. Maybe I should switch that, so that TERM signals 'by mistake' are handled better. It then would re-init the complete system, stopping all services plus supervisors on TERM, and starting them up again after being re-spawned through inittab. I'm not yet sure about side-effects of this change though. Regards, Gerrit.