From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2602 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Jeff Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: interesting claims Date: Thu, 16 May 2019 19:10:50 +0200 Message-ID: <1190281558026650@sas2-c434f6e124b6.qloud-c.yandex.net> References: <11997211556565598@myt6-27270b78ac4f.qloud-c.yandex.net> <20190501033355.6e41e707@mydesk.domain.cxm> <20190515132206.03f9736e@mydesk.domain.cxm> <20190516012214.15ffcf2e@dickeberta> <20190515210717.27b002ba@mydesk.domain.cxm> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="186151"; mail-complaints-to="usenet@blaine.gmane.org" To: supervision@list.skarnet.org Original-X-From: supervision-return-2192-gcsg-supervision=m.gmane.org@list.skarnet.org Thu May 16 19:10:55 2019 Return-path: Envelope-to: gcsg-supervision@m.gmane.org Original-Received: from alyss.skarnet.org ([95.142.172.232]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1hRJuA-000mCm-U5 for gcsg-supervision@m.gmane.org; Thu, 16 May 2019 19:10:55 +0200 Original-Received: (qmail 5225 invoked by uid 89); 16 May 2019 17:11:20 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm Original-Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Original-Received: (qmail 5217 invoked from network); 16 May 2019 17:11:19 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.com; s=mail; t=1558026650; bh=zPyWRAeF1JtKKXn9qSX5TFSLvL4aukMvDcIXazU5HH0=; h=Message-Id:Subject:In-Reply-To:Date:References:To:From; b=BLuJT2EDcX08pFEx+rR+M6o2I+gFKNtW2fOWBiBgXbVrxQrIsxif5Yh0gNkrfrj9A eduCBTnzDRa2xcQvuvWbA9seVzv1BwxtN1TB4TMoh+3Opfc3EfCRc8GKeLdXNPdrtY uajRsUpJZ40eL3fejERVAuIjMeCmJGhBGW/8Xt8o= Authentication-Results: mxback5j.mail.yandex.net; dkim=pass header.i=@yandex.com In-Reply-To: X-Mailer: Yamail [ http://yandex.ru ] 5.0 Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2602 Archived-At: 16.05.2019, 10:31, "Laurent Bercot" : >> The Question: As a newbie outsider I wonder, after following the >> discussion of supervision and tasks on stages (1,2,3), that there is a >> restrictive linear progression that prevents reversal. In terms of pid1 >> that I may not totally understand, is there a way that an admin can >> reduce the system back to pid1 and restart processes instead of taking >> the system down and restarting? If a glitch is found, usually it is >> corrected and we find it simple to just do a reboot. What if you can >> fix the problem and do it on the fly. The question would be why (or why >> not), and I am not sure I can answer it, but if you theoretically can do >> so, then can you also kill pid2 while pid10 is still running. With my >> limited vision I see stages as one-way check valves in a series of fluid >> linear flow. take a look at (the now defunct) depinit: http://sf.net/p/depinit/ http://depinit.sf.net/ it is said to provide very extended rollback of dependencies (so extended gettys will not work with it according to the docs). > Stage 1 isn't reversible; once it's done, you never touch it again, > you don't need to "reverse" it. It would be akin to also unloading > the kernel from memory before shutting down - it's just not necessary. indeed. and when something fails in that first stage a super-user rescue shell should be started to fix it instead of any services that depend on it. (stupid example: sethostname failed for some reason, spawn a rescue shell for the admin to do something about it ;-). in such cases it has to be considered whether this failure important enough to justify interuption of the boot phase. if not: start as much other services as possible, output/log an error message, keep calm, and carry on, things can be handled when a getty is up. > stage 4 i would prefer to call it "stage 3b" since stage 4 would be start after stage3a + b, i. e. process #1 execs into another executable, maybe required in connection with initramfs, anopa provides such a stage 4 execline script. > - If you want to kill every process but pid 1 and have the system > reconstruct itself from there, then yes, it is possible, and that is > the whole point of having a supervision tree rooted in pid 1. When > you kill every process, the supervision tree respawns, so you always > have a certain set of services running, and the system can always > recover from whatever you throw at it. Try it: grab a machine with > a supervision tree and a root shell, run "kill -9 -1", see what happens. i wonder what happens if process #1 reacts to, say SIGTERM, by starting the shutdown phase and doing reboot afterwards. what if process #1 is signaled "accidently" by kill -TERM 1 (as we saw in preceding posts -1 will not reach it). nothing is restarted and the system goes down instead since it is assumed that the signal was not sent "accidently". in the case of a process #1 not supervising anything, supervisor runs with 1 < PID when killing everything "accidently" (via kill ( -1, SIGKILL ) for example), system is bricked, reset button has to be used: only a privileged process can reach everything with PID > 1 that way. there seems to be something wrong that should be fixed ASAP. in the case of process #1 respawning the supervisor: it restarts everything, maybe the "accident" happens again, and so on ... could lead to the system being caught in such an "endless" loop. maybe this can also only get fixed by powering down ... non supervising process #1: same, but worse: reset button has to be used, state is lost, fs are not unmounted cleanly and what not. but in the situation of a supervising process #1 it can also be possible to be prevented from entering the shutdown phase cleanly.