From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2037 Path: news.gmane.org!not-for-mail From: =?ISO-8859-1?Q?Nicol=E1s_de_la_Torre?= Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: hello - hanging services Date: Fri, 20 Aug 2010 09:24:01 -0300 Message-ID: References: <20100817190803.41e8257f.jean.bruenn@ip-minds.de> <20100817192422.a157e85f.jean.bruenn@ip-minds.de> <20100818105735.GA13364@skarnet.org> <20100818170635.a5a24d3f.jean.bruenn@ip-minds.de> <20100818180205.5be254c7.jean.bruenn@ip-minds.de> <20100819054635.GA14146@skarnet.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1282307046 3891 80.91.229.12 (20 Aug 2010 12:24:06 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 20 Aug 2010 12:24:06 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2272-gcsg-supervision=m.gmane.org@list.skarnet.org Fri Aug 20 14:24:05 2010 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1OmQdh-0000V7-0A for gcsg-supervision@lo.gmane.org; Fri, 20 Aug 2010 14:24:05 +0200 Original-Received: (qmail 31555 invoked by uid 76); 20 Aug 2010 12:26:07 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 31547 invoked from network); 20 Aug 2010 12:26:07 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=XyPPg3aGffiOAI5mr06PAkE2s81HyFymmiyoap9tDmw=; b=oS1pXKXTGVKlIO0MlCC/DLG0rwondXC7FK1OAjQz4vU90B/ira/mjIzPlE6ybbalsj B6lKV94OXSljCAWfKdhhkzrABYupovgUyUMzlJ5OwzrBaker3zln14AhCC8BXRu275b2 Y4EJ6ynLMGrPPt3uWjWDMy+y6z4tjFzk6heUA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=SFCDaEn4WWSG2Lzj3iX0xw8fECz53hEwzmK2oBAa/yJSaYc3ppyl09L8GH5257NKo7 7X7hz2JWhWGYQtIO1UKLoQCHXYM7opSJ7PaVFEscN7oLK5QuRYAnrC5kUpjqgWfYezs9 RT6jek+LICEAeNLxmeNdAGUs4EeaxZhDOhco4= In-Reply-To: <20100819054635.GA14146@skarnet.org> Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2037 Archived-At: I must be on this list by mistake, please unsubscribe. 2010/8/19 Laurent Bercot > > > I understand your thoughts about this, and yes i have thought about > > this, too. But let's make it clear: This can happen with runit as it > > is now, also: a weird written run-script or a broken log-script might > > compromise the existing functionality of runit (if it doesnt, adding a > > new variant like a hangcheck-script wouldn't do so neither). I mean: > > what happens currently if one of the services which you're trying to > > start hangs? I havent tried yet, so i guess only the service which > > you're trying to start would be compromised - not whole runit. And this > > wouldn't be the case with my suggestion neither. > > =A0The problem is that your suggestion affects the reliability of the > service you want to check. > =A0If ./run hangs, well, the service hangs. ./run IS the service: of > course you need to write the script properly if you want the service > to function properly. There's nothing we can do about that. Same for > the logger, which is also a service (albeit a special one). > =A0If ./hangcheck hangs, then what should be the default policy? To be > congruent with a watchdog's purpose, you should restart the service. > But then, you might have a buggy ./hangcheck script and a perfectly > functional service, and restarting it for no good reason is a decrease > in service availability and reliability (and a waste of resources). > By adding ./hangcheck support, you are adding a dependency, and making > the service architecture more fragile. That's what Charlie meant > (I think). > > =A0Small is beautiful for a reason: small has less hidden costs. > Everytime you want to add a feature, look for the hidden costs. > Sometimes the feature is worth paying them. Most of the time it's not. > > > > Probably you're right, though i don't exactly understand your > > argumentation because: Runit is starting crashed processes (this > > shouldn't be the job of an Init-System - the job of an init-system is > > starting processes, not making sure that they're up and running - > > thats the job of a software-watchdog). > > =A0Please read the list archives; this has been discussed at length. > What it comes down to is the duties of process 1, and process 1 *has* > to restart processes (at least one), in order to keep the system in > a usable state no matter what happens, no matter what dies. A supervision > architecture such as runit is a natural consequence of properly > implementing process 1's duties. > > =A0You are mixing two different notions of 'up and running'. > =A0What runit does (and what any init system *should* do) is make sure > that the *process* corresponding to a given service has been properly > forked and exec'ed. As long as the process is there, runit is happy. > It's an external process management tool. > =A0What a software watchdog does is make sure that said process actually > does what is expected of it, as opposed to i.e. hang or busyloop. This > is more complex, because it requires knowledge of what the service is > supposed to do, and it generally can't be done without access to the > service's source code. > > > > BUT: runit is doing exactly > > this. Runit is taking care that your service is up and running, by > > restarting it if its crashing - By argueing that "checking whether a > > service is responding (and thus working) is not the job of runit", you > > might also argue that "restarting a crashed job is not runit's job". > > =A0No. runit's job is process management. runit is there to ensure that > the process tree you want is always there; and that includes restarting > crashed processes if needed. But runit's job is not making sure that > every process in the process tree is doing exactly what's it's supposed > to do. Again, there's a difference between "process A is up", which runit > can and should control, and "process A is behaving as expected", which > can only be controlled by some A-specific watchdog. > > =A0And since this is Unix, two different things should be handled by two > different tools. > > -- > =A0Laurent