From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2565 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Jeff Newsgroups: gmane.comp.sysutils.supervision.general Subject: supervising the supervisor Date: Fri, 03 May 2019 06:13:00 +0200 Message-ID: <6084291556856780@sas2-a1efad875d04.qloud-c.yandex.net> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="213573"; mail-complaints-to="usenet@blaine.gmane.org" To: supervision Original-X-From: supervision-return-2155-gcsg-supervision=m.gmane.org@list.skarnet.org Fri May 03 06:13:07 2019 Return-path: Envelope-to: gcsg-supervision@m.gmane.org Original-Received: from alyss.skarnet.org ([95.142.172.232]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1hMPZK-000tO1-GJ for gcsg-supervision@m.gmane.org; Fri, 03 May 2019 06:13:06 +0200 Original-Received: (qmail 6542 invoked by uid 89); 3 May 2019 04:13:30 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm Original-Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Original-Received: (qmail 6535 invoked from network); 3 May 2019 04:13:29 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail; t=1556856780; bh=nnm1wpIMLhic37Pka9UEcvDxN2IfCNwJder1X5y+gJk=; h=Message-Id:Date:Subject:To:From; b=PNF6ITZ//W0mTgFbQii/nM63J9KzlDKiFqnggr1Ql8XkpupUJDsXHHTMOVRjVby89 2n44reYJZVF0F5day8Nin4j3hCVYrdNYGoxJ/RwgLiIVU/wnJLhHD5nXiMItl06mtK e9q7GQUbomhg+UK7P+R8WfGwiTEaYvHDIxK2rnbI= Authentication-Results: mxback19g.mail.yandex.net; dkim=pass header.i=@yandex.com X-Mailer: Yamail [ http://yandex.ru ] 5.0 Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2565 Archived-At: from the last replies we have the following possibilities regarding process #1's supervision capabilities: - no supervision/respawning (maybe also not handling system shutdown at all, too): simplifies the process #1 implementation (especially in the latter case), supervision can be delegated to a subprocess which also simplifies that supervisor's implenation since there is no need for it to handle process #1 specific duties (given there are more than just reaping zombies as the default subreaper and successfully starting at least one necessary child process (to which the remaining duties are delegated)). disadvantage: "incorrect" behaviour when all other processes die, leads to a bricked system, deep shit ahead. - respawning (at least one) given services/daemons, possibly even with log output redirection to logger processes (s6-svscan et al) - a compromise between the above 2 solutions: process #1 supervises (i. e. respawns, possibly only under certain conditions) at most 2 subprocesses (a real "supervisor") and maybe redirects its output via pipe(2) to a separate supervised dedicated logger subprocess. in that case those child processes should only be respawned under certain conditions (respawn throttling maybe, i. e. stop respawning if one of the 2 repeatedly fails in a certain amount of time). if those conditions are not met it should start a single user rescue shell (possible via sulogin) and/or reboot. only in case the logger child process repeatedly fails: do not redirect the supervisor's output, use our own (possibly opened by the kernel) output fds for the supervisor child process (probably the console device) instead of the pipe fds. it could also be a good idea to close all of process #1 stdio fds and only open the console device for output when the need arises. this has the advantage that we do not have this device open all the time (in case /dev needs to get re/unmounted). again (as we are at it ;-): in the last case: when said "supervisor" is s6-svscan (or perpd for that) it would be helpful for the process #1 implementor (me) if it could manage its own output logger via a command line option (akin to dt encore's "svscan") since it saves him from opening the pipe, comparing terminated child PIDs with an additional (the logger's) PID, and managing additional emergency situations caused by the logger's failure himself (especially since s6-svscan does a lot of additional stuff like catching signals and running the corresponding scripts anyway). :PP