From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2083 Path: news.gmane.org!not-for-mail From: Laurent Bercot Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: [announce] perp-2.03: persistent process supervision Date: Mon, 14 Mar 2011 17:47:41 +0100 Message-ID: <20110314164741.GA7248@skarnet.org> References: <20110314113933.3544df05@b0llix.net> <20110314131706.GA17316@skarnet.org> <20110314150225.7cf61c3c@b0llix.net> <4D7E24DA.2030404@robinbowes.com> <20110314153425.34ed16dc@b0llix.net> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1300121126 14621 80.91.229.12 (14 Mar 2011 16:45:26 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 14 Mar 2011 16:45:26 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2317-gcsg-supervision=m.gmane.org@list.skarnet.org Mon Mar 14 17:45:18 2011 Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1PzAtQ-0001jA-CA for gcsg-supervision@lo.gmane.org; Mon, 14 Mar 2011 17:45:16 +0100 Original-Received: (qmail 10233 invoked by uid 76); 14 Mar 2011 16:47:42 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 10223 invoked by uid 1000); 14 Mar 2011 16:47:41 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: <20110314153425.34ed16dc@b0llix.net> User-Agent: Mutt/1.4i Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2083 Archived-At: > First, perpd(8) will not die (TM). Of course it will not - not in normal circumstances. Neither will svscan, or runsvdir, or s6-svscan. I trust your programming ability in that matter as much as mine - this is not a concern at all. The concern is that you don't always have the say. There's this playful thing called the Linux OOM killer. I hear the heuristics have been fixed in recent kernel releases, but for a long time, the OOM killer had the amusing habit of shooting processes at random, and very much failing to locate the process that is actually responsible for the memory outage. There are still a whole lot of broken OOM killers out there. Of course, this is not a normal condition, and under careful administration it never happens. But the point is, when you are designing a supervision tool, you should assume that you can get a random SIGKILL (Headshot. Do not pass Go. Do not call your cleanup routines.) at any time. Because if a supervision tool can't recover from an OOM event and keep vital services running until the sysadmin finishes his coffee and can manually repair things, then what is it good for ? That is why I asked my question. In other supervision schemes, tasks are de-centralized, so if one process randomly dies, it generally does not have much impact on the rest of the system. (If runsvdir dies, it's annoying, but things keep working until the admin can come clean things up.) perpd, however, looks like a neural hub, centralizing a lot of info into its memory. IOW, a SPOF, and you can be sure that the next broken system tool will love to play Doom with it. Is your supervision chain SIGKILL-resistant ? -- Laurent