Process supervision is something I'm very opinionated about. In a number of high availability production environments, its a necessary evil.

However, it should *never* be an out of the box default for any network-exposed service, Service failures should be extraordinary events, and we should strive to keep treating them as such, so that we continue to pursue stability. Restarting a service automatically doesn't improve stability of that software, it works around an instability rather than addressing the root cause - it's a band-aid over a festering wound.

The failure of a service is analogous in my eyes to the tripping of a circuit breaker - it happened for a reason, and that underlying reason is probably serious. Circuit breakers in houses generally don't reset themselves, and either should network-facing services.

The biggest concern in any service failure is that a failure was caused by an exploit attempt - attacks which exploit bad memory-management tend to crash whatever they are exploiting, even on a failed attempt. In an environment where such an event has been reduced to routine, and automatic restarts are the norm, that attacker gets as many attempts as they need, reducing one of the first signs of an intrusion to barely a blip on the radar if the systems are even being monitored at all.


The second reason is that it will reduce the number of high-quality bug reports developers receive - if failure is part of the routine, it tends not to get investigate very thoroughly, if at all.

A third reason is convention and expectation. We've lived without process supervision in the *nix world for almost 4 decades now, those decades of experienced admins generally expect to be able to kill off a process and have it stay down.

Please consider these factors in any implementation of process supervision - while it's certainly it's a needed improvement for many organizations,, it's not something that should just be on by default.





On Wed, May 4, 2016 at 12:35 PM Steve Litt <slitt@troubleshooters.com> wrote:
On Tue, 3 May 2016 22:41:48 -1000
Joel Roth <joelz@pobox.com> wrote:

> We're not the first people to think about supporting
> alternative init systems. There are collections of the
> init scripts already available.
>
> https://bitbucket.org/avery_payne/supervision-scripts
> https://github.com/tokiclover/supervision

Those can serve as references and starting points, but I don't think
either one is complete, and in Avery's case, that can mean you don't
know whether a given daemon's run script and environment was complete
or not. In tokiclover's case, that github page implies that the only run
scripts he had were for the gettys, and that's pretty straightforward
(and well known) anyway.

As I remember, before he had to put it aside for awhile, Avery was
working on new ways of testing whether needed daemons (like the
network) were really functional. That would have been huge.

Another source of daemon startup scripts his here:

https://universe2.us/collector/epoch.conf

SteveT

Steve Litt
April 2016 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21
_______________________________________________
Dng mailing list
Dng@lists.dyne.org
https://mailinglists.dyne.org/cgi-bin/mailman/listinfo/dng