supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
* Respawn limit for runsv?
@ 2005-02-11 12:56 Lars Kellogg-Stedman
  2005-02-12  0:14 ` Charlie Brady
  0 siblings, 1 reply; 12+ messages in thread
From: Lars Kellogg-Stedman @ 2005-02-11 12:56 UTC (permalink / raw)


Howdy,

Has anyone investigated adding some sort of respawn limiter to runit?  I 
figure that if a program is crashing immediately every time it runs, it 
probably doesn't need to be restarted -- which just floods the logs with 
error message, perhaps overwriting critical log data about what 
precipitated the problem in the first place.

I'm thinking of some sort of setting that will down a service if it dies 
more than m times in n seconds.  This would be controlled via a file in 
the service directory...maybe we can introduce a 'config' file, similar 
to what svlogd uses, to allow per-service configuration?

Alternatively, this could perhaps be farmed out to a wrapper program, so 
that a run script might look like:

  exec respawn_wrapper ...

Any thoughts?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-11 12:56 Respawn limit for runsv? Lars Kellogg-Stedman
@ 2005-02-12  0:14 ` Charlie Brady
  2005-02-12  1:14   ` Lars Kellogg-Stedman
  0 siblings, 1 reply; 12+ messages in thread
From: Charlie Brady @ 2005-02-12  0:14 UTC (permalink / raw)
  Cc: supervision


On Fri, 11 Feb 2005, Lars Kellogg-Stedman wrote:

> Has anyone investigated adding some sort of respawn limiter to runit?  I 
> figure that if a program is crashing immediately every time it runs, it 
> probably doesn't need to be restarted -- which just floods the logs with 
> error message, perhaps overwriting critical log data about what 
> precipitated the problem in the first place.

The problem is obviously ongoing, and will be able to be diagnosed without 
earlier logs. Certainly the direct cause will be diagnosable. If there's 
an indirect cause, it won't necessarily be logged in this process's log 
files.

Do you have any concrete example that illustrates the "problem". I 
wouldn't be happy to see additional complexity without strong 
justification.

---
Charlie



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12  0:14 ` Charlie Brady
@ 2005-02-12  1:14   ` Lars Kellogg-Stedman
  2005-02-12  3:13     ` Charlie Brady
  2005-02-13  3:59     ` Lars Kellogg-Stedman
  0 siblings, 2 replies; 12+ messages in thread
From: Lars Kellogg-Stedman @ 2005-02-12  1:14 UTC (permalink / raw)


> Do you have any concrete example that illustrates the "problem".

Thanks, but I'm looking at providing a solution to a general class of 
problems.  My question is simply whether or not anyone has investigated 
adding this sort of feature to runsv before -- if not, I'll probably 
spend some of my time trying to add this behavior, since respawning 
software is a reasonably common problem (for example, this is why the 
SysV Init package under Linux will stop respawning a process for a few 
minutes in this situation).

-- Lars



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12  1:14   ` Lars Kellogg-Stedman
@ 2005-02-12  3:13     ` Charlie Brady
  2005-02-12  5:21       ` Charles Duffy
  2005-02-13  3:59     ` Lars Kellogg-Stedman
  1 sibling, 1 reply; 12+ messages in thread
From: Charlie Brady @ 2005-02-12  3:13 UTC (permalink / raw)
  Cc: supervision


On Fri, 11 Feb 2005, Lars Kellogg-Stedman wrote:

> > Do you have any concrete example that illustrates the "problem".
> 
> Thanks, but I'm looking at providing a solution to a general class of 
> problems.

There is no general class of problem unless you can provide a single 
instance.

>  My question is simply whether or not anyone has investigated 
> adding this sort of feature to runsv before -- if not, I'll probably 

Why would they, if there isn't a real problem?

> spend some of my time trying to add this behavior, since respawning 
> software is a reasonably common problem (for example, this is why the 
> SysV Init package under Linux will stop respawning a process for a few 
> minutes in this situation).

One of the reasons people chose to use daemontools (and subsequently 
runit) was to get away from such ad-hoc behaviour ...

You say that respawning software is a reasonably common problem. Whenever 
I've seen it, it was a configuration problem. Or rather, there was a 
configuration problem, and the respawning was a symptom. I didn't see the 
respawning as a problem. The automated respawning is a feature.

You asked for thoughts. I've given them.

---
Charlie



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12  3:13     ` Charlie Brady
@ 2005-02-12  5:21       ` Charles Duffy
  2005-02-12 13:04         ` Lars Kellogg-Stedman
  2005-02-13 13:48         ` Gerrit Pape
  0 siblings, 2 replies; 12+ messages in thread
From: Charles Duffy @ 2005-02-12  5:21 UTC (permalink / raw)


On Fri, 11 Feb 2005 22:13:43 -0500, Charlie Brady wrote:

> There is no general class of problem unless you can provide a single
> instance.

Let me jump in here.

I have a web server. It's running a big, complex servlet container (which
provides only one of the many services available) that eats a bunch of CPU
time and writes a bunch of logs trying to start up.

Should the servlet container be misconfigured in such a way as to die
immediately after startup (and some such misconfigurations could occur
based on issues on other systems ie. the DNS server), the automated
respawning would bring everything else on the web server to its knees.

> You say that respawning software is a reasonably common problem. Whenever
> I've seen it, it was a configuration problem. Or rather, there was a
> configuration problem, and the respawning was a symptom. I didn't see the
> respawning as a problem. The automated respawning is a feature.

Certainly, when it occurs, it's a symptom of a larger problem. That's not
to say that it wouldn't be useful to have more configurable respawning --
ranging from the simple "no more than N times in M minutes" to backoff
algorithms akin to those used for TCP.

Putting these more complex algorithms into runit, I agree, isn't
necessarily appropriate -- but that's why he mentioned having a second
process responsible for implementing them.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12  5:21       ` Charles Duffy
@ 2005-02-12 13:04         ` Lars Kellogg-Stedman
  2005-02-13 18:42           ` Charlie Brady
  2005-02-13 13:48         ` Gerrit Pape
  1 sibling, 1 reply; 12+ messages in thread
From: Lars Kellogg-Stedman @ 2005-02-12 13:04 UTC (permalink / raw)


> > There is no general class of problem unless you can provide a single
> > instance.

Seriously, I would like to live in your perfect world, but in the past 
couple of decades I've seen a variety of situations that would have been 
easier to deal with (or that, in fact, *were* easier to deal with) 
because of a limiter.  Let's make up some examples:

- A piece of hardware on which a program depends goes bad, causing the 
program to exit immediately upon startup.  Suppose that this program has 
a high initial startup cost -- so not only is it respawning pointlessly, 
but it's driving up system load.

- The disk fills up, causing your X startup to fail.  But because of the 
continuous respawning, you can't log in on the console!

- A program bug causes a crash and concomitant database corruption.  
Subsequent startups fail immediately, but since you're logging through 
svlogd, the original crash messages disappears into the ether because 
the roughly 3600 respawns/hour have pushed it out of the logs.

- Or in any of the above scenarios, maybe you're *not* logging through 
svlogd, and the error messages fill up a partition and bring the system 
to a screeching halt, or at least give it a noticeable limp.

Sure, yes, the root problem here is not unlimited respawning, but this 
behavior exacerbates the problem.  Diagnosis and resource consumption 
are both aided by some sort of limit.

Exponential back-off would probably be just fine, and represents a 
fairly common solution to this class of problem.  SysV init simply 
pauses for a few minutes if the respawn rate exceeds a certain 
threshold, and then tries again.

Either behavior would be helpful.  If you've never encountered a 
situation in which this would be useful, then by all means, don't 
partake of whatever, if anything, I ultimately manage to produce.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12  1:14   ` Lars Kellogg-Stedman
  2005-02-12  3:13     ` Charlie Brady
@ 2005-02-13  3:59     ` Lars Kellogg-Stedman
  1 sibling, 0 replies; 12+ messages in thread
From: Lars Kellogg-Stedman @ 2005-02-13  3:59 UTC (permalink / raw)


My first stab at a respawn limiter is available here:

  http://www.oddbit.com/software/respawn/respawn.tar.gz

This is a very simple implementation of a respawn limiter as a wrapper
program; you run it like this:

  respawn -- /my/program

If /my/program starts more than m times in n seconds (default 10 times in
120 seconds), respawn will sleep for x seconds (defaults 300).  Respawn
keeps a table of start times in a file called 'spawntab' in its current
working directory.

It ain't pretty, but it appears to work.

Let me know what you think.

-- Lars

-- 
Lars Kellogg-Stedman <lars@oddbit.com>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12  5:21       ` Charles Duffy
  2005-02-12 13:04         ` Lars Kellogg-Stedman
@ 2005-02-13 13:48         ` Gerrit Pape
  2005-02-13 18:19           ` Lars Kellogg-Stedman
  1 sibling, 1 reply; 12+ messages in thread
From: Gerrit Pape @ 2005-02-13 13:48 UTC (permalink / raw)


On Fri, Feb 11, 2005 at 11:21:20PM -0600, Charles Duffy wrote:
> On Fri, 11 Feb 2005 22:13:43 -0500, Charlie Brady wrote:
> > You say that respawning software is a reasonably common problem. Whenever
> > I've seen it, it was a configuration problem. Or rather, there was a
> > configuration problem, and the respawning was a symptom. I didn't see the
> > respawning as a problem. The automated respawning is a feature.
> 
> Certainly, when it occurs, it's a symptom of a larger problem. That's not
> to say that it wouldn't be useful to have more configurable respawning --
> ranging from the simple "no more than N times in M minutes" to backoff
> algorithms akin to those used for TCP.
> 
> Putting these more complex algorithms into runit, I agree, isn't
> necessarily appropriate -- but that's why he mentioned having a second
> process responsible for implementing them.

I also think a separate program is appropriate for such special
services.  Another solution to the problem with log messages getting
out-rotated is to have the startup and failing messages of the service
written to a different log directory maintained by the same svlogd
service.

Regards, Gerrit.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-13 13:48         ` Gerrit Pape
@ 2005-02-13 18:19           ` Lars Kellogg-Stedman
  2005-02-14 16:23             ` Clemens Fischer
  0 siblings, 1 reply; 12+ messages in thread
From: Lars Kellogg-Stedman @ 2005-02-13 18:19 UTC (permalink / raw)


> I also think a separate program is appropriate for such special
> services.

I'm not sure this should really be thought of as a special service; since
the supervisor is responsible for restarting the target program, this would
also seem to be the logical place to handle respawn limits.

Having said that, an external implementation does work, although the code
is a little more complicated since it becomes necessary to maintain some
sort of external state.  If the limiter were implemented in runsv, for
example, it could all be managed internally without the necessity of
reading and writing an external file.

But with a working external solution, I'm probably too lazy to pursue
anything else.

-- Lars

-- 
Lars Kellogg-Stedman <lars@oddbit.com>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-12 13:04         ` Lars Kellogg-Stedman
@ 2005-02-13 18:42           ` Charlie Brady
  2005-02-13 21:21             ` Thomas Schwinge
  0 siblings, 1 reply; 12+ messages in thread
From: Charlie Brady @ 2005-02-13 18:42 UTC (permalink / raw)
  Cc: supervision


On Sat, 12 Feb 2005, Lars Kellogg-Stedman wrote:

> > > There is no general class of problem unless you can provide a single
> > > instance.
> 
> Seriously, I would like to live in your perfect world, but in the past 
> couple of decades I've seen a variety of situations that would have been 
> easier to deal with (or that, in fact, *were* easier to deal with) 
> because of a limiter. 

I only asked for one :-)

> Let's make up some examples:

I didn't want made up examples :-)

> - A piece of hardware on which a program depends goes bad, causing the 
> program to exit immediately upon startup.  Suppose that this program has 
> a high initial startup cost -- so not only is it respawning pointlessly, 
> but it's driving up system load.
> 
> - The disk fills up, causing your X startup to fail.  But because of the 
> continuous respawning, you can't log in on the console!

OK, I'm convinced.

> Exponential back-off would probably be just fine, and represents a 
> fairly common solution to this class of problem.  SysV init simply 
> pauses for a few minutes if the respawn rate exceeds a certain 
> threshold, and then tries again.
> 
> Either behavior would be helpful.

But perhaps both are unnecessarily complicated. Would it not be sufficient 
for runsv to have a configurable "dead time" after starting the supervised 
process before it was again prepared to respawn the child? We want it to 
start a new child quickly, but we don't want it to do so often. So how 
about "within one second", but "at most every ten seconds".

It would also be useful to have a mechanism to distinguish between a
process dying in reponse to a request from runsv, and a program dying
unexpectedly. Perhaps have "finish" and "unexpected_finish" scripts. I'd
certainly like to have a mechanism to run a finish script if a service is
taken down, but not if it just died unexpecedly. The "unexpected_finish"
script could introduce the programmed delay you want, notify the admin,
preserve any essential logs, etc.

---
Charlie



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-13 18:42           ` Charlie Brady
@ 2005-02-13 21:21             ` Thomas Schwinge
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Schwinge @ 2005-02-13 21:21 UTC (permalink / raw)
  Cc: Lars Kellogg-Stedman, supervision

On Sun, Feb 13, 2005 at 01:42:01PM -0500, Charlie Brady wrote:
> On Sat, 12 Feb 2005, Lars Kellogg-Stedman wrote:
> > - The disk fills up, causing your X startup to fail.  But because of the 
> > continuous respawning, you can't log in on the console!

This has happened to me, too.  :-(
(And of course no telnetd, sshd or similar was running...)


> It would also be useful to have a mechanism to distinguish between a
> process dying in reponse to a request from runsv, and a program dying
> unexpectedly. Perhaps have "finish" and "unexpected_finish" scripts. I'd
> certainly like to have a mechanism to run a finish script if a service is
> taken down, but not if it just died unexpecedly. The "unexpected_finish"
> script could introduce the programmed delay you want, notify the admin,
> preserve any essential logs, etc.

I'd suggest to only have 'finish', but have it passed a) the information
how the process (was) terminated (i.e. either manually using svc,
runsvctrl, writing to 'SERVICE/supervise/control' or because of the
process's termination) and b) the process's exit-code.  This information
could be passed either using positional parapeters or via environment
variables.


Regards,
 Thomas


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Respawn limit for runsv?
  2005-02-13 18:19           ` Lars Kellogg-Stedman
@ 2005-02-14 16:23             ` Clemens Fischer
  0 siblings, 0 replies; 12+ messages in thread
From: Clemens Fischer @ 2005-02-14 16:23 UTC (permalink / raw)


* Lars Kellogg-Stedman:

> Having said that, an external implementation does work, although the code is
> a little more complicated since it becomes necessary to maintain some sort
> of external state.

this is the same problem as making a connection limiter for tcpserver/tvpsvd.
this can become important for services under DOS attacks.  i have such a
beast, written in guile-scheme (for experimenting), and the database part
works, but not logging by writing to stderr.  the correct place to implement a
simple algorithm (don't let some IP connect more then N times in T seconds)
would be the servers themselves.

> But with a working external solution, I'm probably too lazy to pursue
> anything else.

you could still make a patch and send it to the list.  me, i don't dare to
until i thought out all the issues with a simple but propably big and ever
changing hash table on file.

  clemens


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-02-14 16:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-02-11 12:56 Respawn limit for runsv? Lars Kellogg-Stedman
2005-02-12  0:14 ` Charlie Brady
2005-02-12  1:14   ` Lars Kellogg-Stedman
2005-02-12  3:13     ` Charlie Brady
2005-02-12  5:21       ` Charles Duffy
2005-02-12 13:04         ` Lars Kellogg-Stedman
2005-02-13 18:42           ` Charlie Brady
2005-02-13 21:21             ` Thomas Schwinge
2005-02-13 13:48         ` Gerrit Pape
2005-02-13 18:19           ` Lars Kellogg-Stedman
2005-02-14 16:23             ` Clemens Fischer
2005-02-13  3:59     ` Lars Kellogg-Stedman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).