s6-rc: timeout questions

supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed

* s6-rc: timeout questions
@ 2020-11-17 15:53 Xavier Stonestreet
  2020-11-17 21:53 ` Laurent Bercot
  0 siblings, 1 reply; 5+ messages in thread
From: Xavier Stonestreet @ 2020-11-17 15:53 UTC (permalink / raw)
  To: supervision

Hello,

- Am I correct in thinking that if a service has properly configured
timeout-kill and timeout-finish, timeout-down becomes unnecessary and
even undesirable as it can leave services in an undefined state limbo?
I know the documentation pretty much says so, but I'm still a bit
confused by the existence of timeout-down to begin with, if it's
redundant and unhelpful.
- Can you confirm that timeout-up and timeout-down are also used with
oneshots? They are defined in the s6-rc-compile documentation, but the
s6-rc documentation doesn't specifically mention them for oneshots
state transitions.

Thanks,
X.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: s6-rc: timeout questions
  2020-11-17 15:53 s6-rc: timeout questions Xavier Stonestreet
@ 2020-11-17 21:53 ` Laurent Bercot
  2020-11-18 17:49   ` Xavier Stonestreet
  0 siblings, 1 reply; 5+ messages in thread
From: Laurent Bercot @ 2020-11-17 21:53 UTC (permalink / raw)
  To: Xavier Stonestreet, supervision

>- Am I correct in thinking that if a service has properly configured
>timeout-kill and timeout-finish, timeout-down becomes unnecessary and
>even undesirable as it can leave services in an undefined state limbo?
>I know the documentation pretty much says so, but I'm still a bit
>confused by the existence of timeout-down to begin with, if it's
>redundant and unhelpful.

  timeout-kill and timeout-finish are a s6 thing: if present, they're
just copied as is to the service directory that will be managed by
the s6 supervision tree.
  timeout-up and timeout-down are specific to s6-rc: they will be
embedded into the compile database. They do not interact with s6 at
all, they're just a rule for the s6-rc state machine: if the
service does not report being up (resp. down) by the timeout, then
s6-rc marks the transition as failed and stops looking at what
happens with the service.

  For longruns, yes, timeout-kill ensures that the service will
eventually be brought down no matter what. But there are cases where
you *do not want* to kill -9 a daemon (and need a timeout-kill of 0).
timeout-down is useful here, even if it's a pretty niche case.

  And then, of course, the point is that it's needed for oneshots,
which do not have the s6 mechanisms.

>- Can you confirm that timeout-up and timeout-down are also used with
>oneshots? They are defined in the s6-rc-compile documentation, but the
>s6-rc documentation doesn't specifically mention them for oneshots
>state transitions.

  Yes, I confirm that they're also (and primarily) used with oneshots.
They're defined in the "atomic services" section, which comprises
longruns *and* oneshots.

--
  Laurent

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: s6-rc: timeout questions
  2020-11-17 21:53 ` Laurent Bercot
@ 2020-11-18 17:49   ` Xavier Stonestreet
  2020-11-18 19:06     ` Laurent Bercot
  0 siblings, 1 reply; 5+ messages in thread
From: Xavier Stonestreet @ 2020-11-18 17:49 UTC (permalink / raw)
  To: supervision

Thank you for your clarifications.

>   And then, of course, the point is that it's needed for oneshots,
> which do not have the s6 mechanisms.

Right. Understood.

> >- Can you confirm that timeout-up and timeout-down are also used with
> >oneshots? They are defined in the s6-rc-compile documentation, but the
> >s6-rc documentation doesn't specifically mention them for oneshots
> >state transitions.
>
>   Yes, I confirm that they're also (and primarily) used with oneshots.
> They're defined in the "atomic services" section, which comprises
> longruns *and* oneshots.

Could you elaborate a little more about the state transition failures
of oneshots caused by timeouts?

Let's say for example the oneshot's up script times out, so the
transition fails. From s6-rc's point of view the oneshot is still
down. What actually happens to the process running the up script? Is
it left running in the background? If yes, is it correct to assume
that since s6-rc considers it down, another invocation of the s6-rc -u
change command on the same oneshot will spawn another instance of the
up script? If not, is it killed, and how?

I stumbled upon some unexpected behavior while doing some tests
yesterday. I'd like to know if this is by design or unintended.
Consider this scenario:

2 longruns s1 and s2
s2 depends on s1
s2 takes less than 1 sec to start and become ready on its own
s2 has a timeout-up of 5 secs
s1 has a run script containing an artificial delay of 6 secs

Test 1:
s1 is up and ready
s2 is down
s6-rc -u change s2

Success. OK.

Test 2:
s1 is down
s2 is down
s6-rc -u change s2
s6-rc: fatal: timed out
s6-svlisten1: fatal: timed out

Timeout failure. Unexpected. I thought timeout-up and timeout-down
applied to each atomic service individually, not to the entire
dependency chain to bring it up or down.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: s6-rc: timeout questions
  2020-11-18 17:49   ` Xavier Stonestreet
@ 2020-11-18 19:06     ` Laurent Bercot
  2020-11-20 10:30       ` Xavier Stonestreet
  0 siblings, 1 reply; 5+ messages in thread
From: Laurent Bercot @ 2020-11-18 19:06 UTC (permalink / raw)
  To: Xavier Stonestreet, supervision

>Could you elaborate a little more about the state transition failures
>of oneshots caused by timeouts?
>
>Let's say for example the oneshot's up script times out, so the
>transition fails. From s6-rc's point of view the oneshot is still
>down. What actually happens to the process running the up script? Is
>it left running in the background? If yes, is it correct to assume
>that since s6-rc considers it down, another invocation of the s6-rc -u
>change command on the same oneshot will spawn another instance of the
>up script? If not, is it killed, and how?

  It is correct to assume that another instance will be spawned, yes.
It was a difficult decision to make, and I'm still not sure it is the
right one. There are advantages and drawbacks to both approaches, but
at the end of the day it all comes down to: what set of actions will
leave the system in the *least* unknown state?

  s6-rc's design assumes that timeouts, if they exist, are properly
calibrated; if a service times out, then it's not that the timeout is
too short, it's that something is really going wrong. So it considers
the transition failed. Now what should it do about the existing
process? kill it or not?
  If the process is allowed to live on, it may succeed, in which case
s6-rc's vision of the service will be wrong, but 1. it doesn't matter
because services should always be written as idempotent, and 2. it means
that the timeout was badly calibrated in the first place. Or it may
fail and s6-rc's vision will be correct.
  If the process is killed, chances are that it will add to the problem
instead of solving it. For instance, if the process is hanging in D
state, killing it won't do anything except make the system more 
unstable.
If the process is doing some complex operation and not properly
sequencing its operations, sending it a signal may trigger a bug. etc.
  In the end I weighed that sending a signal would potentially cause more
harm than good, but I don't think using the opposite approach would be
wrong either.

>Test 2:
>s1 is down
>s2 is down
>s6-rc -u change s2
>s6-rc: fatal: timed out
>s6-svlisten1: fatal: timed out
>
>Timeout failure. Unexpected. I thought timeout-up and timeout-down
>applied to each atomic service individually, not to the entire
>dependency chain to bring it up or down.

  Yes, it should be behaving as you say, and I suspect you have
uncovered a bug - not in the timeout management for a dependency chain,
but in the management of s6-rc's *global* timeout, which is the one
that is triggering here. I suspect I'm taking incorrect shortcuts wrt
timeout management, and will take a look. Thanks!

--
  Laurent

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: s6-rc: timeout questions
  2020-11-18 19:06     ` Laurent Bercot
@ 2020-11-20 10:30       ` Xavier Stonestreet
  0 siblings, 0 replies; 5+ messages in thread
From: Xavier Stonestreet @ 2020-11-20 10:30 UTC (permalink / raw)
  To: supervision

Thanks for taking the time to respond and for all your detailed
answers, much appreciated.

On Wed, Nov 18, 2020 at 8:06 PM Laurent Bercot
<ska-supervision@skarnet.org> wrote:
[...]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-11-20 10:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-17 15:53 s6-rc: timeout questions Xavier Stonestreet
2020-11-17 21:53 ` Laurent Bercot
2020-11-18 17:49   ` Xavier Stonestreet
2020-11-18 19:06     ` Laurent Bercot
2020-11-20 10:30       ` Xavier Stonestreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).