Re: "Back off" setting for crashing services with s6+openrc?

supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed

From: "Laurent Bercot" <ska-supervision@skarnet.org>
To: "supervision@list.skarnet.org" <supervision@list.skarnet.org>
Subject: Re: "Back off" setting for crashing services with s6+openrc?
Date: Fri, 30 Sep 2022 13:21:06 +0000	[thread overview]
Message-ID: <emb087a60d-4590-4260-bfba-cf19c1c83ee5@5e65708a.com> (raw)
In-Reply-To: <20220930113440.72c09a4f@flunder.oschad.de>

  I feel like this whole thread comes from mismatched expectations of
how s6 should behave.

  s6 always waits for one second before two successive starts of a
service. This ensures it never hogs the CPU by spamming a crashing
service. (With an asterisk, see below.)

  It does not wait for one second if the service has been started for
more than one second. The idea is, if the service has been running for
a while, and dies, you want it up _immediately_, and it's okay because
it was running for a while, and either was somewhat idle (in which case
you're not hurting for resources) or was hot (in which case you don't
want a 1-second delay).

  The point of s6 is to maximize service uptime; this is why it does
not have a configurable backoff mechanism. When you want a service up,
you want it *up*, and that's the #1 priority. A service that keeps
crashing is an abnormal condition and is supposed to be handled by the
admin quickly enough - ideally, before getting to production.

  If the CPU is still being hogged by s6 despite the 1-second delay
and it's not that the service is running hot, then it means it's
crashing while still initializing, and the initialization takes more
than one second while using a lot of resources. In other words, you
got yourself a pretty heavy service that is crashing while starting.

  That should definitely be caught before it goes in production. But
if it's not possible, the least ugly workaround is indeed to sleep
in the finish script, and increasing timeout-finish if needed.
(The "run_service || sleep 10" approach leaves a shell between
s6-supervise and run_service, so it's not good.)
./finish is generally supposed to be very short-lived, because the
"finishing" state is generally confusing to an observer, but in this
case it does not matter: it's an abnormal situation anyway.

  There is, however, one improvement I think I can safely make.
  Currently, the 1-second delay is computed from when the service 
*starts*:
if it has been running for more than one second, and crashes, it 
restarts
immediately, even if it has only been busy initializing, which causes
the resource hog OP is experiencing.
  I could change it to being computed from when the service is *ready*:
if the service dies before being ready, s6-supervise *always* waits for
1 second before restarting. The delay is only skipped if the service has
been *ready* for 1 second or more, which means it really serving and
either idle (i.e. chill resource-wise) or hot (i.e. you don't want to
delay it).
  Does that sound like a valid improvement to you folks?

  Note that I don't think making the restart delay configurable is a good
trade-off. It adds complexity, size and failure cases to the 
s6-supervise
code, it adds another file to a service directory for users to remember,
it adds another avenue for configuration mistakes causing downtime, all
that to save resources for a pathological case. The difference between
0 second and 1 second of free CPU is significant; longer delays have
diminishing returns.

--
  Laurent

next prev parent reply	other threads:[~2022-09-30 13:21 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-22 19:45 Tor Rune Skoglund
2022-09-22 20:21 ` John W Higgins
2022-09-23  7:29   ` Tor Rune Skoglund
2022-09-23 16:44   ` Alyssa Ross
2022-09-24 12:33   ` Oliver Schad
2022-09-26 23:00   ` Colin Booth
2022-09-30  9:34     ` Oliver Schad
2022-09-30 13:21       ` Laurent Bercot [this message]
2022-10-04 14:16         ` Tor Rune Skoglund
2022-10-10 16:20           ` Laurent Bercot
2022-10-11  0:38         ` Dewayne

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=emb087a60d-4590-4260-bfba-cf19c1c83ee5@5e65708a.com \
    --to=ska-supervision@skarnet.org \
    --cc=supervision@list.skarnet.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).