From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2015 Path: news.gmane.org!not-for-mail From: Laurent Bercot Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: pidsig 0.11 - a fghack like de-daemonisation tool Date: Thu, 3 Jun 2010 21:25:30 +0200 Message-ID: <20100603192530.GA19916@skarnet.org> References: <20100602184653.GA20534@skarnet.org> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1275593010 22837 80.91.229.12 (3 Jun 2010 19:23:30 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 3 Jun 2010 19:23:30 +0000 (UTC) To: supervision@list.skarnet.org Original-X-From: supervision-return-2250-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jun 03 21:23:29 2010 connect(): No such file or directory Return-path: Envelope-to: gcsg-supervision@lo.gmane.org Original-Received: from antah.skarnet.org ([212.85.147.14]) by lo.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1OKG0m-0003xP-KQ for gcsg-supervision@lo.gmane.org; Thu, 03 Jun 2010 21:23:28 +0200 Original-Received: (qmail 4264 invoked by uid 76); 3 Jun 2010 19:25:31 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Archive: Original-Received: (qmail 4248 invoked by uid 1000); 3 Jun 2010 19:25:30 -0000 Mail-Followup-To: supervision@list.skarnet.org Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4i Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2015 Archived-At: > These kinds of problems are not that theoretical - just recently I saw > svscan/svscanboot crashing on a >1y uptime box, taking many of the > processes with it, including most of the supervise infrastructure, > very likely not due to any fault in them - could be oom gone wild, > cosmic rays hitting svscan memory, whatever). That's a typical case of "weak" supervision, as opposed to a "strong" supervision chain. "Strong" supervision makes sure that all the infrastructure is connected to init. * svscan achieves strong supervision *if* svscanboot is flagged as "respawn" in /etc/inittab on System V-style inits, in /etc/event.d/ with Upstart, or in /etc/gettys on BSD. It does *not* achieve it if svscanboot is started via some rc.local script (as the stock daemontools instructions tell you to do, shame on DJB! :)) * perp is in the same boat, depending on how you start perpboot. * Paul Jarc has instructions on how to directly run svscan as process 1. * runit achieves strong supervision if you're using runit-init. * s4, my own supervision suite (to be released next summer) can be run from a respawner, but was also designed so s4-svscan can run as process 1. Strong supervision makes sure that your supervisor process tree is *always* alive and complete, unless process 1 itself crashes, in which case you're doomed to reboot anyway. > The host stayed up and admin access was possible (thanks mainly to > sshd being a long running daemon), but the state the processes was > left in was nothing to be glad about. This will, unfortunately, be the case even with a strong supervision chain : if a branch of the supervision tree is broken, the whole subtree is reconstructed from the breaking point, but the old subtree might still be alive and locking resources, preventing the new subtree from being fully functional (and filling your logs with warning messages). An intervention from the administrator is always necessary. The advantage is, all the admin has to do is kill the old subtree (including services) and everything will be working perfectly again. I'm not sure it's possible to design a supervision suite that addresses this problem cleanly without endangering service reliability. > Another question would be if there are more ways to reliably connect > to any given process detecting it being gone - but all the current > daemons that I run can be handled now :) Unfortunately, no; not without support from the process you want to monitor. There are only two ways of being notified of a process' death: - getting a SIGCHLD if you're the process' parent. That's what a supervisor uses (supervise, runsv, perpetrate, s4-supervise all work on this model). - getting an EOF on a pipe or socket you're listening to, when the monitored process is the only writer on the other side. That's what fghack uses (and pidsig too, I presume). The EOF method does not require the monitorer to be the monitoree's parent, so it's more flexible; but it does require the monitoree to not go out of its way to arbitrarily close fds. If some daemon forks itself and resists fghack and pidsig, you're out of luck: it definitely won't be supervised. If you have the process' pid, you can poll the process table, but polling - as opposed to notification - is evil. Also, a fundamental problem is that pids do NOT uniquely identify a process (which is the main flaw with .pid files), unless you're blessed with an OS where pids are 64 bits long. If you're willing to be non-portable, Linux should allow you to set an inotify fd listening to /proc/$pid's disappearance (if the /proc filesystem supports inotify, but there's no reason why it shouldn't). BSD might have a similar mechanism with kevent/kqueue. And all of this is definitely too much work for a lousy daemon. -- Laurent