From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2036 Path: news.gmane.org!not-for-mail From: Laurent Bercot Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: hello - hanging services Date: Thu, 19 Aug 2010 07:46:35 +0200 Message-ID: <20100819054635.GA14146@skarnet.org> References: <20100817190803.41e8257f.jean.bruenn@ip-minds.de> <20100817192422.a157e85f.jean.bruenn@ip-minds.de> <20100818105735.GA13364@skarnet.org> <20100818170635.a5a24d3f.jean.bruenn@ip-minds.de> <20100818180205.5be254c7.jean.bruenn@ip-minds.de> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1282196676 31035 80.91.229.12 (19 Aug 2010 05:44:36 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 19 Aug 2010 05:44:36 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2271-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Aug 19 07:44:35 2010 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1OlxvX-0008E9-8U for gcsg-supervision@lo.gmane.org; Thu, 19 Aug 2010 07:44:35 +0200 Original-Received: (qmail 21855 invoked by uid 76); 19 Aug 2010 05:46:35 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 21847 invoked by uid 1000); 19 Aug 2010 05:46:35 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20100818180205.5be254c7.jean.bruenn@ip-minds.de> User-Agent: Mutt/1.4i Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2036 Archived-At: > I understand your thoughts about this, and yes i have thought about > this, too. But let's make it clear: This can happen with runit as it > is now, also: a weird written run-script or a broken log-script might > compromise the existing functionality of runit (if it doesnt, adding a > new variant like a hangcheck-script wouldn't do so neither). I mean: > what happens currently if one of the services which you're trying to > start hangs? I havent tried yet, so i guess only the service which > you're trying to start would be compromised - not whole runit. And this > wouldn't be the case with my suggestion neither. The problem is that your suggestion affects the reliability of the service you want to check. If ./run hangs, well, the service hangs. ./run IS the service: of course you need to write the script properly if you want the service to function properly. There's nothing we can do about that. Same for the logger, which is also a service (albeit a special one). If ./hangcheck hangs, then what should be the default policy? To be congruent with a watchdog's purpose, you should restart the service. But then, you might have a buggy ./hangcheck script and a perfectly functional service, and restarting it for no good reason is a decrease in service availability and reliability (and a waste of resources). By adding ./hangcheck support, you are adding a dependency, and making the service architecture more fragile. That's what Charlie meant (I think). Small is beautiful for a reason: small has less hidden costs. Everytime you want to add a feature, look for the hidden costs. Sometimes the feature is worth paying them. Most of the time it's not. > Probably you're right, though i don't exactly understand your > argumentation because: Runit is starting crashed processes (this > shouldn't be the job of an Init-System - the job of an init-system is > starting processes, not making sure that they're up and running - > thats the job of a software-watchdog). Please read the list archives; this has been discussed at length. What it comes down to is the duties of process 1, and process 1 *has* to restart processes (at least one), in order to keep the system in a usable state no matter what happens, no matter what dies. A supervision architecture such as runit is a natural consequence of properly implementing process 1's duties. You are mixing two different notions of 'up and running'. What runit does (and what any init system *should* do) is make sure that the *process* corresponding to a given service has been properly forked and exec'ed. As long as the process is there, runit is happy. It's an external process management tool. What a software watchdog does is make sure that said process actually does what is expected of it, as opposed to i.e. hang or busyloop. This is more complex, because it requires knowledge of what the service is supposed to do, and it generally can't be done without access to the service's source code. > BUT: runit is doing exactly > this. Runit is taking care that your service is up and running, by > restarting it if its crashing - By argueing that "checking whether a > service is responding (and thus working) is not the job of runit", you > might also argue that "restarting a crashed job is not runit's job". No. runit's job is process management. runit is there to ensure that the process tree you want is always there; and that includes restarting crashed processes if needed. But runit's job is not making sure that every process in the process tree is doing exactly what's it's supposed to do. Again, there's a difference between "process A is up", which runit can and should control, and "process A is behaving as expected", which can only be controlled by some A-specific watchdog. And since this is Unix, two different things should be handled by two different tools. -- Laurent