From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.io/gmane.comp.sysutils.supervision.general/2496
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: "Laurent Bercot" <ska-supervision@skarnet.org>
Newsgroups: gmane.comp.sysutils.supervision.general
Subject: Re: s6 bites noob
Date: Sun, 03 Feb 2019 10:19:26 +0000
Message-ID: <emd0ae6a6a-ae00-47e8-8e20-253c9734342a@elzian>
References: <AQz0GiuJWvL9Jh5xW6l1akGGS63ktlq7lidemXydLTY@local>
 <em4e294a5f-49fc-4789-b813-54b60b6acd90@elzian>
 <bqbqlmzSNJzsTvSRV5wSp6oRvkh4CATkPg6ZWSuBcAL@local>
Reply-To: "Laurent Bercot" <ska-supervision@skarnet.org>
Mime-Version: 1.0
Content-Type: text/plain; format=flowed; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="207599"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: eM_Client/7.2.33939.0
To: "supervision@list.skarnet.org" <supervision@list.skarnet.org>
Original-X-From: supervision-return-2086-gcsg-supervision=m.gmane.org@list.skarnet.org Sun Feb 03 11:19:28 2019
Return-path: <supervision-return-2086-gcsg-supervision=m.gmane.org@list.skarnet.org>
Envelope-to: gcsg-supervision@m.gmane.org
Original-Received: from alyss.skarnet.org ([95.142.172.232])
	by blaine.gmane.org with smtp (Exim 4.89)
	(envelope-from <supervision-return-2086-gcsg-supervision=m.gmane.org@list.skarnet.org>)
	id 1gqEs3-000rvC-Vc
	for gcsg-supervision@m.gmane.org; Sun, 03 Feb 2019 11:19:28 +0100
Original-Received: (qmail 17968 invoked by uid 89); 3 Feb 2019 10:19:54 -0000
Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm
Original-Sender: <supervision@list.skarnet.org>
Precedence: bulk
List-Post: <mailto:supervision@list.skarnet.org>
List-Help: <mailto:supervision-help@list.skarnet.org>
List-Unsubscribe: <mailto:supervision-unsubscribe@list.skarnet.org>
List-Subscribe: <mailto:supervision-subscribe@list.skarnet.org>
Original-Received: (qmail 17961 invoked from network); 3 Feb 2019 10:19:54 -0000
In-Reply-To: <bqbqlmzSNJzsTvSRV5wSp6oRvkh4CATkPg6ZWSuBcAL@local>
X-VR-SPAMSTATE: OK
X-VR-SPAMSCORE: 0
X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedtledrkedvgdduiecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfpfgfogfftkfevteeunffgpdfqfgfvnecuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkfgjfhhrfgggtgfgsehtqhertddtreejnecuhfhrohhmpedfnfgruhhrvghnthcuuegvrhgtohhtfdcuoehskhgrqdhsuhhpvghrvhhishhiohhnsehskhgrrhhnvghtrdhorhhgqeenucffohhmrghinheprggsohhrthhsrdhfohhopdihphdrthhonecurfgrrhgrmhepmhhouggvpehsmhhtphhouhhtnecuvehluhhsthgvrhfuihiivgeptd
Xref: news.gmane.org gmane.comp.sysutils.supervision.general:2496
Archived-At: <http://permalink.gmane.org/gmane.comp.sysutils.supervision.general/2496>

>s6-supervise aborts on startup if foo/supervise/control is already open, b=
ut perpetually retries if foo/run doesn't exist. Both of those problems ind=
icate the user is doing something wrong. Wouldn't it make more sense for bo=
th problems to result in the same behavior (either retry or abort, preferab=
ly the latter)?

foo/supervise/control being already open indicates there's already a
s6-supervise process monitoring foo - in which case spawning another
one makes no sense, so s6-supervise aborts.

foo/run not existing is a temporary error condition that can happen
at any time, not only at the start of s6-supervise. This is a very
different case: the supervisor is already running and the user is
relying on its monitoring foo. At that point, the supervisor really
should not die, unless explicitly asked to; and "nonexistent foo/run"
is perfectly recoverable, you just have to warn the user and try
again later.

It's simply the difference between a fatal error and a recoverable
error. In most simple programs, all errors can be treated as fatal:
if you're not in the nominal case, just abort and let the user deal
with it. But in a supervisor, the difference is important, because
surviving all kinds of trouble is precisely what a supervisor is
there for.


>https://cr.yp.to/daemontools/supervise.html indicates the original verison =
of supervise aborts in both cases.

That's what it suggests, but it is unclear ("may exit"). I have
forgotten what daemontools' supervise does when foo/run doesn't
exist, but I don't think it dies. I think it loops, just as
s6-supervise does. You should test it.


>  I also don't understand the reason for svscan and supervise being differ=
ent. Supervise's job is to watch one daemon. Svscan's job is to watch a col=
lection of supervise procs. Why not omit supervise, and have svscan directl=
y watch the daemons? Surely this is a common question.

You said it yourself: supervise's job is to watch one daemon, and
svscan's job is to watch a collection of supervise processes. That is
not the same job at all. And if it's not the same job, a Unix guideline
says they should be different programs: one function =3D one tool. With
experience, I've found this guideline to be 100% justified, and
extremely useful.
Look at s6-svscan's and s6-supervise's source code. You will find
they share very few library functions - there's basically no code
duplication, no functionality duplication, between them.

Supervising several daemons from one unique process is obviously
possible. That's for instance what perpd, sysvinit and systemd do.
But if you look at perpd's source code (which is functionally and
stylistically the closest to svscan+supervise) you'll see that
it's almost as long as the source code of s6-svscan plus s6-supervise
combined, while not being a perfectly nonblocking state machine as
s6-supervise is.

Combining functionality into a single process adds complexity.
Putting separate functionality in separate processes reduces
complexity, because it takes advantage of the natural boundaries
provided by the OS. It allows you to do just as much with much less
code.


>I understand svscan must be as simple as possible, for reliability, becaus=
e it must not die. But I don't see how combining it with supervise would re=
ally make it more complex. It already has supervise's functionality built i=
n (watch a target proc, and restart it when it dies).

No, the functionality isn't the same at all, and "restart a process
when it dies" is an excessively simplified view of what s6-supervise
does. If that was all there is to it, a "while true ; do ./run ; done"
shell script would do the job; but if you've had to deal with that
approach once in a production environment, you intimately and
painfully know how terrible it is.

s6-svscan knows how s6-supervise behaves, and can trust it and rely
on an interface between the two programs since they're part of the
same package. Spawning and watching a s6-supervise process is easy,
as easy as calling a function; s6-svscan's complexity comes from the
fact that it needs to manage a *collection* of s6-supervise
processes. (Actually, the brunt of its complexity comes from supporting
pipes between a service and a logger, but that's beside the point.)

On the other hand, s6-supervise does not know how ./run behaves, can
make no assumption about it, cannot trust it, must babysit it no matter
how bad it gets, and must remain stable no matter how much shit it
throws at you. This is a totally different job - and a much harder job
than watching a thousand of nice, friendly s6-supervise programs.
Part of the proof is that s6-supervise's source code is bigger than
s6-svscan's.

By all means, if you want a single supervisor for all your services,
try perp. It may suit you. But I don't think having less processes
in your "ps" output is a worthwhile goal: it's purely cosmetic, and
you have to balance that against the real benefits that separating
processes provides.

--
Laurent