From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2034 Path: news.gmane.org!not-for-mail From: Charlie Brady Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: hello - hanging services Date: Wed, 18 Aug 2010 11:23:41 -0400 (EDT) Message-ID: References: <20100817190803.41e8257f.jean.bruenn@ip-minds.de> <20100817192422.a157e85f.jean.bruenn@ip-minds.de> <20100818105735.GA13364@skarnet.org> <20100818170635.a5a24d3f.jean.bruenn@ip-minds.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Trace: dough.gmane.org 1282145024 32546 80.91.229.12 (18 Aug 2010 15:23:44 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 18 Aug 2010 15:23:44 +0000 (UTC) Cc: supervision@list.skarnet.org To: Jean-Michel Bruenn Original-X-From: supervision-return-2269-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Aug 18 17:23:43 2010 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1OlkUP-0004wZ-Fz for gcsg-supervision@lo.gmane.org; Wed, 18 Aug 2010 17:23:41 +0200 Original-Received: (qmail 14590 invoked by uid 76); 18 Aug 2010 15:25:46 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 14578 invoked from network); 18 Aug 2010 15:25:45 -0000 X-X-Sender: charlieb@e-smith.charlieb.ott.istop.com In-Reply-To: <20100818170635.a5a24d3f.jean.bruenn@ip-minds.de> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2034 Archived-At: On Wed, 18 Aug 2010, Jean-Michel Bruenn wrote: > In fact i was thinking about something more simple, i guess you guys > know nagios? similar to nagios - Just run a command, check for output > or timeout, for example for apache, you write a script called > "hangcheck" which gets run all X seconds by runit. This script > contains something like: > > #!/bin/sh > service=apache2 > exec curl 127.0.0.1 || sv restart $service > > e.g: > wdp@localhost:~$ curl 127.0.0.1 || echo "didnt work" > curl: (7) couldn't connect to host > didnt work > > So the idea is, you can define with which command to test a service > (you don't need to use this at all) and runit is just periodically > running the hangcheck script - the hangcheck script itself is just > running a command, and deciding be exit-code whether to do something or > not (so this can be used to mail someone about a hanging/not responding > service, or to restart this service. > > So there's no need for any special scripting or any special algorithm. > And if i'm right there's not much work to be done in runit - just: if > [ -f hangcheck ]; then ./hangcheck; fi (of course with the periodic > timer set, let's say 10 seconds? 1 minute?) There are many complications that you probably haven't thought of. What would runit do if the hangcheck script itself hangs? How might you compromise the existing functionality of runit by adding this feature? How many race conditions do you introduce? IMO you don't need to have this functionality in runit, which is already doing it's specified task well. You want something else to act as a service watchdog. You use another tool to do that - for instance, nagios.