From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2496 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: "Laurent Bercot" Newsgroups: gmane.comp.sysutils.supervision.general Subject: Re: s6 bites noob Date: Sun, 03 Feb 2019 10:19:26 +0000 Message-ID: References: Reply-To: "Laurent Bercot" Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset=utf-8 Content-Transfer-Encoding: quoted-printable Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="207599"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: eM_Client/7.2.33939.0 To: "supervision@list.skarnet.org" Original-X-From: supervision-return-2086-gcsg-supervision=m.gmane.org@list.skarnet.org Sun Feb 03 11:19:28 2019 Return-path: Envelope-to: gcsg-supervision@m.gmane.org Original-Received: from alyss.skarnet.org ([95.142.172.232]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1gqEs3-000rvC-Vc for gcsg-supervision@m.gmane.org; Sun, 03 Feb 2019 11:19:28 +0100 Original-Received: (qmail 17968 invoked by uid 89); 3 Feb 2019 10:19:54 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm Original-Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 17961 invoked from network); 3 Feb 2019 10:19:54 -0000 In-Reply-To: X-VR-SPAMSTATE: OK X-VR-SPAMSCORE: 0 X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtledrkedvgdduiecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfpfgfogfftkfevteeunffgpdfqfgfvnecuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkfgjfhhrfgggtgfgsehtqhertddtreejnecuhfhrohhmpedfnfgruhhrvghnthcuuegvrhgtohhtfdcuoehskhgrqdhsuhhpvghrvhhishhiohhnsehskhgrrhhnvghtrdhorhhgqeenucffohhmrghinheprggsohhrthhsrdhfohhopdihphdrthhonecurfgrrhgrmhepmhhouggvpehsmhhtphhouhhtnecuvehluhhsthgvrhfuihiivgeptd Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2496 Archived-At: >s6-supervise aborts on startup if foo/supervise/control is already open, b= ut perpetually retries if foo/run doesn't exist. Both of those problems ind= icate the user is doing something wrong. Wouldn't it make more sense for bo= th problems to result in the same behavior (either retry or abort, preferab= ly the latter)? foo/supervise/control being already open indicates there's already a s6-supervise process monitoring foo - in which case spawning another one makes no sense, so s6-supervise aborts. foo/run not existing is a temporary error condition that can happen at any time, not only at the start of s6-supervise. This is a very different case: the supervisor is already running and the user is relying on its monitoring foo. At that point, the supervisor really should not die, unless explicitly asked to; and "nonexistent foo/run" is perfectly recoverable, you just have to warn the user and try again later. It's simply the difference between a fatal error and a recoverable error. In most simple programs, all errors can be treated as fatal: if you're not in the nominal case, just abort and let the user deal with it. But in a supervisor, the difference is important, because surviving all kinds of trouble is precisely what a supervisor is there for. >https://cr.yp.to/daemontools/supervise.html indicates the original verison = of supervise aborts in both cases. That's what it suggests, but it is unclear ("may exit"). I have forgotten what daemontools' supervise does when foo/run doesn't exist, but I don't think it dies. I think it loops, just as s6-supervise does. You should test it. > I also don't understand the reason for svscan and supervise being differ= ent. Supervise's job is to watch one daemon. Svscan's job is to watch a col= lection of supervise procs. Why not omit supervise, and have svscan directl= y watch the daemons? Surely this is a common question. You said it yourself: supervise's job is to watch one daemon, and svscan's job is to watch a collection of supervise processes. That is not the same job at all. And if it's not the same job, a Unix guideline says they should be different programs: one function =3D one tool. With experience, I've found this guideline to be 100% justified, and extremely useful. Look at s6-svscan's and s6-supervise's source code. You will find they share very few library functions - there's basically no code duplication, no functionality duplication, between them. Supervising several daemons from one unique process is obviously possible. That's for instance what perpd, sysvinit and systemd do. But if you look at perpd's source code (which is functionally and stylistically the closest to svscan+supervise) you'll see that it's almost as long as the source code of s6-svscan plus s6-supervise combined, while not being a perfectly nonblocking state machine as s6-supervise is. Combining functionality into a single process adds complexity. Putting separate functionality in separate processes reduces complexity, because it takes advantage of the natural boundaries provided by the OS. It allows you to do just as much with much less code. >I understand svscan must be as simple as possible, for reliability, becaus= e it must not die. But I don't see how combining it with supervise would re= ally make it more complex. It already has supervise's functionality built i= n (watch a target proc, and restart it when it dies). No, the functionality isn't the same at all, and "restart a process when it dies" is an excessively simplified view of what s6-supervise does. If that was all there is to it, a "while true ; do ./run ; done" shell script would do the job; but if you've had to deal with that approach once in a production environment, you intimately and painfully know how terrible it is. s6-svscan knows how s6-supervise behaves, and can trust it and rely on an interface between the two programs since they're part of the same package. Spawning and watching a s6-supervise process is easy, as easy as calling a function; s6-svscan's complexity comes from the fact that it needs to manage a *collection* of s6-supervise processes. (Actually, the brunt of its complexity comes from supporting pipes between a service and a logger, but that's beside the point.) On the other hand, s6-supervise does not know how ./run behaves, can make no assumption about it, cannot trust it, must babysit it no matter how bad it gets, and must remain stable no matter how much shit it throws at you. This is a totally different job - and a much harder job than watching a thousand of nice, friendly s6-supervise programs. Part of the proof is that s6-supervise's source code is bigger than s6-svscan's. By all means, if you want a single supervisor for all your services, try perp. It may suit you. But I don't think having less processes in your "ps" output is a worthwhile goal: it's purely cosmetic, and you have to balance that against the real benefits that separating processes provides. -- Laurent