From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2036
Path: news.gmane.org!not-for-mail
From: Laurent Bercot <ska-supervision@skarnet.org>
Newsgroups: gmane.comp.sysutils.supervision.general
Subject: Re: hello - hanging services
Date: Thu, 19 Aug 2010 07:46:35 +0200
Message-ID: <20100819054635.GA14146@skarnet.org>
References: <20100817190803.41e8257f.jean.bruenn@ip-minds.de> <Pine.LNX.4.64.1008171311210.4362@e-smith.charlieb.ott.istop.com> <20100817192422.a157e85f.jean.bruenn@ip-minds.de> <Pine.LNX.4.64.1008171335070.4362@e-smith.charlieb.ott.istop.com> <20100818105735.GA13364@skarnet.org> <20100818170635.a5a24d3f.jean.bruenn@ip-minds.de> <Pine.LNX.4.64.1008181118260.18955@e-smith.charlieb.ott.istop.com> <20100818180205.5be254c7.jean.bruenn@ip-minds.de>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: dough.gmane.org 1282196676 31035 80.91.229.12 (19 Aug 2010 05:44:36 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Thu, 19 Aug 2010 05:44:36 +0000 (UTC)
To: supervision@list.skarnet.org
Original-X-From: supervision-return-2271-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Aug 19 07:44:35 2010
Return-path: <supervision-return-2271-gcsg-supervision=m.gmane.org@list.skarnet.org>
Envelope-to: gcsg-supervision@lo.gmane.org
Original-Received: from antah.skarnet.org ([212.85.147.14])
	by lo.gmane.org with smtp (Exim 4.69)
	(envelope-from <supervision-return-2271-gcsg-supervision=m.gmane.org@list.skarnet.org>)
	id 1OlxvX-0008E9-8U
	for gcsg-supervision@lo.gmane.org; Thu, 19 Aug 2010 07:44:35 +0200
Original-Received: (qmail 21855 invoked by uid 76); 19 Aug 2010 05:46:35 -0000
Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm
List-Post: <mailto:supervision@list.skarnet.org>
List-Help: <mailto:supervision-help@list.skarnet.org>
List-Unsubscribe: <mailto:supervision-unsubscribe@list.skarnet.org>
List-Subscribe: <mailto:supervision-subscribe@list.skarnet.org>
List-Archive: <http://www.skarnet.org/lists/>
Original-Received: (qmail 21847 invoked by uid 1000); 19 Aug 2010 05:46:35 -0000
Mail-Followup-To: supervision@list.skarnet.org
Content-Disposition: inline
In-Reply-To: <20100818180205.5be254c7.jean.bruenn@ip-minds.de>
User-Agent: Mutt/1.4i
Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2036
Archived-At: <http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/2036>

> I understand your thoughts about this, and yes i have thought about
> this, too. But let's make it clear: This can happen with runit as it
> is now, also: a weird written run-script or a broken log-script might
> compromise the existing functionality of runit (if it doesnt, adding a
> new variant like a hangcheck-script wouldn't do so neither). I mean:
> what happens currently if one of the services which you're trying to
> start hangs? I havent tried yet, so i guess only the service which
> you're trying to start would be compromised - not whole runit. And this
> wouldn't be the case with my suggestion neither.

 The problem is that your suggestion affects the reliability of the
service you want to check.
 If ./run hangs, well, the service hangs. ./run IS the service: of
course you need to write the script properly if you want the service
to function properly. There's nothing we can do about that. Same for
the logger, which is also a service (albeit a special one).
 If ./hangcheck hangs, then what should be the default policy? To be
congruent with a watchdog's purpose, you should restart the service.
But then, you might have a buggy ./hangcheck script and a perfectly
functional service, and restarting it for no good reason is a decrease
in service availability and reliability (and a waste of resources).
By adding ./hangcheck support, you are adding a dependency, and making
the service architecture more fragile. That's what Charlie meant
(I think).

 Small is beautiful for a reason: small has less hidden costs.
Everytime you want to add a feature, look for the hidden costs.
Sometimes the feature is worth paying them. Most of the time it's not.


> Probably you're right, though i don't exactly understand your
> argumentation because: Runit is starting crashed processes (this
> shouldn't be the job of an Init-System - the job of an init-system is
> starting processes, not making sure that they're up and running -
> thats the job of a software-watchdog).

 Please read the list archives; this has been discussed at length.
What it comes down to is the duties of process 1, and process 1 *has*
to restart processes (at least one), in order to keep the system in
a usable state no matter what happens, no matter what dies. A supervision
architecture such as runit is a natural consequence of properly
implementing process 1's duties.

 You are mixing two different notions of 'up and running'.
 What runit does (and what any init system *should* do) is make sure
that the *process* corresponding to a given service has been properly
forked and exec'ed. As long as the process is there, runit is happy.
It's an external process management tool.
 What a software watchdog does is make sure that said process actually
does what is expected of it, as opposed to i.e. hang or busyloop. This
is more complex, because it requires knowledge of what the service is
supposed to do, and it generally can't be done without access to the
service's source code.
 

> BUT: runit is doing exactly
> this. Runit is taking care that your service is up and running, by
> restarting it if its crashing - By argueing that "checking whether a
> service is responding (and thus working) is not the job of runit", you
> might also argue that "restarting a crashed job is not runit's job".

 No. runit's job is process management. runit is there to ensure that
the process tree you want is always there; and that includes restarting
crashed processes if needed. But runit's job is not making sure that
every process in the process tree is doing exactly what's it's supposed
to do. Again, there's a difference between "process A is up", which runit
can and should control, and "process A is behaving as expected", which
can only be controlled by some A-specific watchdog.

 And since this is Unix, two different things should be handled by two
different tools.

-- 
 Laurent