From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2033 Path: news.gmane.org!not-for-mail From: Jean-Michel Bruenn Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: hello - hanging services Date: Wed, 18 Aug 2010 17:06:35 +0200 Organization: IP Minds Message-ID: <20100818170635.a5a24d3f.jean.bruenn@ip-minds.de> References: <20100817190803.41e8257f.jean.bruenn@ip-minds.de> <20100817192422.a157e85f.jean.bruenn@ip-minds.de> <20100818105735.GA13364@skarnet.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1282144003 27688 80.91.229.12 (18 Aug 2010 15:06:43 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 18 Aug 2010 15:06:43 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2268-gcsg-supervision=m.gmane.org@list.skarnet.org Wed Aug 18 17:06:42 2010 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1OlkDv-0002yd-5c for gcsg-supervision@lo.gmane.org; Wed, 18 Aug 2010 17:06:39 +0200 Original-Received: (qmail 11494 invoked by uid 76); 18 Aug 2010 15:08:42 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 11483 invoked from network); 18 Aug 2010 15:08:42 -0000 In-Reply-To: <20100818105735.GA13364@skarnet.org> X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; i686-pc-linux-gnu) Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2033 Archived-At: > >> > >> Difficult to implement? > > > > Yes. > > More precisely, it's not so much "difficult to implement" (I've done > it for a paying customer's project) as "impossible to do without specific > support in the service you're trying to manage". > In other words, what Jean-Michel wants is a software watchdog; it can > be done, but it's pretty intrusive. It requires having a library, a daemon, > and making library calls in the managed process' source, sending messages > to the daemon by doing so. The daemon is configured with a certain policy > that decides "the service is running fine" or "the service has hung" > depending on the frequency of the messages it receives. > > It's doable, and a watchdog library/daemon may even have its place in > a supervision suite (I'll think about it), but it certainly has nothing > to do with purely external process management tools such as runsvdir/runsv > or svscan/supervise. It's a whole piece of software on its own. > > I'm certain that a lot of open source software watchdogs already exist > out there. I'm also certain that none of them is as lightweight and easy > to use as I'd like, but that's another story. In fact i was thinking about something more simple, i guess you guys know nagios? similar to nagios - Just run a command, check for output or timeout, for example for apache, you write a script called "hangcheck" which gets run all X seconds by runit. This script contains something like: #!/bin/sh service=apache2 exec curl 127.0.0.1 || sv restart $service e.g: wdp@localhost:~$ curl 127.0.0.1 || echo "didnt work" curl: (7) couldn't connect to host didnt work So the idea is, you can define with which command to test a service (you don't need to use this at all) and runit is just periodically running the hangcheck script - the hangcheck script itself is just running a command, and deciding be exit-code whether to do something or not (so this can be used to mail someone about a hanging/not responding service, or to restart this service. So there's no need for any special scripting or any special algorithm. And if i'm right there's not much work to be done in runit - just: if [ -f hangcheck ]; then ./hangcheck; fi (of course with the periodic timer set, let's say 10 seconds? 1 minute?) Cheers