From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2015
Path: news.gmane.org!not-for-mail
From: Laurent Bercot <ska-supervision@skarnet.org>
Newsgroups: gmane.comp.sysutils.supervision.general
Subject: Re: pidsig 0.11 - a fghack like de-daemonisation tool
Date: Thu, 3 Jun 2010 21:25:30 +0200
Message-ID: <20100603192530.GA19916@skarnet.org>
References: <AANLkTimDguA-oWt2I-FXzy9beqSPNCLdAajQ2duHcnBA@mail.gmail.com> <20100602184653.GA20534@skarnet.org> <AANLkTikD0KKb7Pqcinv_wpyBGEGy-nK97TgRhIR6y4H5@mail.gmail.com>
NNTP-Posting-Host: lo.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: dough.gmane.org 1275593010 22837 80.91.229.12 (3 Jun 2010 19:23:30 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Thu, 3 Jun 2010 19:23:30 +0000 (UTC)
To: supervision@list.skarnet.org
Original-X-From: supervision-return-2250-gcsg-supervision=m.gmane.org@list.skarnet.org Thu Jun 03 21:23:29 2010
connect(): No such file or directory
Return-path: <supervision-return-2250-gcsg-supervision=m.gmane.org@list.skarnet.org>
Envelope-to: gcsg-supervision@lo.gmane.org
Original-Received: from antah.skarnet.org ([212.85.147.14])
	by lo.gmane.org with smtp (Exim 4.69)
	(envelope-from <supervision-return-2250-gcsg-supervision=m.gmane.org@list.skarnet.org>)
	id 1OKG0m-0003xP-KQ
	for gcsg-supervision@lo.gmane.org; Thu, 03 Jun 2010 21:23:28 +0200
Original-Received: (qmail 4264 invoked by uid 76); 3 Jun 2010 19:25:31 -0000
Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm
List-Post: <mailto:supervision@list.skarnet.org>
List-Help: <mailto:supervision-help@list.skarnet.org>
List-Unsubscribe: <mailto:supervision-unsubscribe@list.skarnet.org>
List-Subscribe: <mailto:supervision-subscribe@list.skarnet.org>
List-Archive: <http://www.skarnet.org/lists/>
Original-Received: (qmail 4248 invoked by uid 1000); 3 Jun 2010 19:25:30 -0000
Mail-Followup-To: supervision@list.skarnet.org
Content-Disposition: inline
In-Reply-To: <AANLkTikD0KKb7Pqcinv_wpyBGEGy-nK97TgRhIR6y4H5@mail.gmail.com>
User-Agent: Mutt/1.4i
Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2015
Archived-At: <http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/2015>

> These kinds of problems are not that theoretical - just recently I saw
> svscan/svscanboot crashing on a >1y uptime box, taking many of the
> processes with it, including most of the supervise infrastructure,
> very likely not due to any fault in them - could be oom gone wild,
> cosmic rays hitting svscan memory, whatever).

 That's a typical case of "weak" supervision, as opposed to a "strong"
supervision chain. "Strong" supervision makes sure that all the
infrastructure is connected to init.

 * svscan achieves strong supervision *if* svscanboot is flagged as
"respawn" in /etc/inittab on System V-style inits, in /etc/event.d/
with Upstart, or in /etc/gettys on BSD. It does *not* achieve it if
svscanboot is started via some rc.local script (as the stock daemontools
instructions tell you to do, shame on DJB! :))
 * perp is in the same boat, depending on how you start perpboot.
 * Paul Jarc has instructions on how to directly run svscan as
process 1.
 * runit achieves strong supervision if you're using runit-init.
 * s4, my own supervision suite (to be released next summer) can be
run from a respawner, but was also designed so s4-svscan can run as
process 1.

 Strong supervision makes sure that your supervisor process tree is
*always* alive and complete, unless process 1 itself crashes, in which
case you're doomed to reboot anyway.


> The host stayed up and admin access was possible (thanks mainly to
> sshd being a long running daemon), but the state the processes was
> left in was nothing to be glad about.

 This will, unfortunately, be the case even with a strong supervision
chain : if a branch of the supervision tree is broken, the whole
subtree is reconstructed from the breaking point, but the old subtree
might still be alive and locking resources, preventing the new
subtree from being fully functional (and filling your logs with warning
messages). An intervention from the administrator is always necessary.
The advantage is, all the admin has to do is kill the old subtree
(including services) and everything will be working perfectly again.
 I'm not sure it's possible to design a supervision suite that addresses
this problem cleanly without endangering service reliability.


> Another question would be if there are more ways to reliably connect
> to any given process detecting it being gone - but all the current
> daemons that I run can be handled now :)

 Unfortunately, no; not without support from the process you want to
monitor. There are only two ways of being notified of a process'
death:
 - getting a SIGCHLD if you're the process' parent. That's what a
supervisor uses (supervise, runsv, perpetrate, s4-supervise all work
on this model).
 - getting an EOF on a pipe or socket you're listening to, when the
monitored process is the only writer on the other side. That's what
fghack uses (and pidsig too, I presume).

 The EOF method does not require the monitorer to be the monitoree's
parent, so it's more flexible; but it does require the monitoree to
not go out of its way to arbitrarily close fds. If some daemon forks
itself and resists fghack and pidsig, you're out of luck: it definitely
won't be supervised.

 If you have the process' pid, you can poll the process table, but
polling - as opposed to notification - is evil. Also, a fundamental
problem is that pids do NOT uniquely identify a process (which is the
main flaw with .pid files), unless you're blessed with an OS where
pids are 64 bits long.
 If you're willing to be non-portable, Linux should allow you to set
an inotify fd listening to /proc/$pid's disappearance (if the /proc
filesystem supports inotify, but there's no reason why it shouldn't).
BSD might have a similar mechanism with kevent/kqueue. And all of this
is definitely too much work for a lousy daemon.

-- 
 Laurent