supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
* "Back off" setting for crashing services with s6+openrc?
@ 2022-09-22 19:45 Tor Rune Skoglund
  2022-09-22 20:21 ` John W Higgins
  0 siblings, 1 reply; 11+ messages in thread
From: Tor Rune Skoglund @ 2022-09-22 19:45 UTC (permalink / raw)
  To: supervision

Hello List,

I am new to the list, so sorry if this question has been raised before. 
However, I've made an honest attempt to figure this out before posting 
here now.

I am using s6 with openrc on Gentoo and are using a couple of custom 
made services which are fully stable under some conditions; i.e. they 
simple crash and stop continuesly sometimes, which causes s6 to 
instantly restart them --- with a high cpu-usage 
crash-restart-crash-restart... loop as result.

Of course, the correct thing would be to fix the crashing service, but 
it is not a possible option right now.

As a generic question, is there any setting with this s6+openrc config 
that would make s6 "back off" a configurable number of seconds before 
doing the restart?

Any help appreciated. Thanks.

BR, Tor Rune Skoglund, trs@fourc.eu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-22 19:45 "Back off" setting for crashing services with s6+openrc? Tor Rune Skoglund
@ 2022-09-22 20:21 ` John W Higgins
  2022-09-23  7:29   ` Tor Rune Skoglund
                     ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: John W Higgins @ 2022-09-22 20:21 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 513 bytes --]

Good Day,

On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote:
...

> As a generic question, is there any setting with this s6+openrc config
> that would make s6 "back off" a configurable number of seconds before
> doing the restart?
>
>
Does something as simple as changing your run script to be something like

run_my_crashing_app || sleep 10

Work? The run script will sit there for 10 seconds if your app fails. Not
built in - but should accomplish the task pretty easily.

John W Higgins

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-22 20:21 ` John W Higgins
@ 2022-09-23  7:29   ` Tor Rune Skoglund
  2022-09-23 16:44   ` Alyssa Ross
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Tor Rune Skoglund @ 2022-09-23  7:29 UTC (permalink / raw)
  To: supervision

Hello,


Den 22.09.2022 22:21, skrev John W Higgins:
> On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote:
>> As a generic question, is there any setting with this s6+openrc config
>> that would make s6 "back off" a configurable number of seconds before
>> doing the restart?
> Does something as simple as changing your run script to be something like
>
> run_my_crashing_app || sleep 10
>
> Work? The run script will sit there for 10 seconds if your app fails. Not
> built in - but should accomplish the task pretty easily.

Thanks, yes, that would work.

However, I was just hoping this could be set on a more global s6 level 
so that it would catch any misbehaving service in general.

- Tor Rune Skoglund


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-22 20:21 ` John W Higgins
  2022-09-23  7:29   ` Tor Rune Skoglund
@ 2022-09-23 16:44   ` Alyssa Ross
  2022-09-24 12:33   ` Oliver Schad
  2022-09-26 23:00   ` Colin Booth
  3 siblings, 0 replies; 11+ messages in thread
From: Alyssa Ross @ 2022-09-23 16:44 UTC (permalink / raw)
  To: John W Higgins; +Cc: Tor Rune Skoglund, supervision

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

John W Higgins <wishdev@gmail.com> writes:

> Good Day,
>
> On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote:
> ...
>
>> As a generic question, is there any setting with this s6+openrc config
>> that would make s6 "back off" a configurable number of seconds before
>> doing the restart?
>>
>>
> Does something as simple as changing your run script to be something like
>
> run_my_crashing_app || sleep 10
>
> Work? The run script will sit there for 10 seconds if your app fails. Not
> built in - but should accomplish the task pretty easily.

I think there's a correctness problem with this approach.  An
s6-supervise process supervises its direct child, which is supposed to
be the daemon under supervision.  But in this case it would be the
shell, so s6 doesn't know the daemon's PID, and so can't signal it.  If
s6 tries to stop the service, it will signal the shell, which won't
necessarily do the right thing and stop the application.  An s6 service
needs to ensure that the service is stopped if s6-supervise's direct
child dies.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-22 20:21 ` John W Higgins
  2022-09-23  7:29   ` Tor Rune Skoglund
  2022-09-23 16:44   ` Alyssa Ross
@ 2022-09-24 12:33   ` Oliver Schad
  2022-09-26 23:00   ` Colin Booth
  3 siblings, 0 replies; 11+ messages in thread
From: Oliver Schad @ 2022-09-24 12:33 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

On Thu, 22 Sep 2022 13:21:46 -0700
John W Higgins <wishdev@gmail.com> wrote:

> Good Day,
> 
> On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu>
> wrote: ...
> 
> > As a generic question, is there any setting with this s6+openrc
> > config that would make s6 "back off" a configurable number of
> > seconds before doing the restart?
> >
> >  
> Does something as simple as changing your run script to be something
> like
> 
> run_my_crashing_app || sleep 10
> 
> Work? The run script will sit there for 10 seconds if your app fails.
> Not built in - but should accomplish the task pretty easily.

You could propably outsource the backoff mechanism, which can handle
some statistics.

Something like that:
https://pastebin.com/aH3EDGLG

You would use it in your run script as:

exec with_backoff my_daemon

Best Regards
Oli

-- 
Automatic-Server AG •••••
Oliver Schad
Geschäftsführer
Hardstr. 46
9434 Au | Schweiz

www.automatic-server.com | oliver.schad@automatic-server.com
Tel: +41 71 511 31 11 | Mobile: +41 76 330 03 47

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-22 20:21 ` John W Higgins
                     ` (2 preceding siblings ...)
  2022-09-24 12:33   ` Oliver Schad
@ 2022-09-26 23:00   ` Colin Booth
  2022-09-30  9:34     ` Oliver Schad
  3 siblings, 1 reply; 11+ messages in thread
From: Colin Booth @ 2022-09-26 23:00 UTC (permalink / raw)
  To: supervision

On Thu, Sep 22, 2022 at 01:21:46PM -0700, John W Higgins wrote:
> Good Day,
> 
> On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote:
> ...
> 
> > As a generic question, is there any setting with this s6+openrc config
> > that would make s6 "back off" a configurable number of seconds before
> > doing the restart?
> >
> >
> Does something as simple as changing your run script to be something like
> 
> run_my_crashing_app || sleep 10
> 
> Work? The run script will sit there for 10 seconds if your app fails. Not
> built in - but should accomplish the task pretty easily.
> 
> John W Higgins
Put the backoff in the finish script. The API is to run finish with the
exit code and signal (if appropriate) as $1 and $2 respectively. With
that information you can have finish make decisions about if it should
delay restarting, set the service down in the face of a permanent error,
or so on. Note that by default finish has a five second deadline so if
you want to delay a restart for ten seconds you'll need to increase that
deadline with a timeout-finish file.

-- 
Colin Booth

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-26 23:00   ` Colin Booth
@ 2022-09-30  9:34     ` Oliver Schad
  2022-09-30 13:21       ` Laurent Bercot
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Schad @ 2022-09-30  9:34 UTC (permalink / raw)
  Cc: supervision

[-- Attachment #1: Type: text/plain, Size: 2426 bytes --]

On Mon, 26 Sep 2022 23:00:48 +0000
Colin Booth <colin@heliocat.net> wrote:

> On Thu, Sep 22, 2022 at 01:21:46PM -0700, John W Higgins wrote:
> > run_my_crashing_app || sleep 10
> > 
> > Work? The run script will sit there for 10 seconds if your app
> > fails. Not built in - but should accomplish the task pretty easily.
> > 
> > John W Higgins  
> Put the backoff in the finish script. The API is to run finish with
> the exit code and signal (if appropriate) as $1 and $2 respectively.
> With that information you can have finish make decisions about if it
> should delay restarting, set the service down in the face of a
> permanent error, or so on. Note that by default finish has a five
> second deadline so if you want to delay a restart for ten seconds
> you'll need to increase that deadline with a timeout-finish file.
 
Sounds in theory good, but in practice I would split the whole thing in
2 parts:

- define start timings in "finish"
- do start timings in "run"

Why? I expect stopping more or less immediatly and the timeout of 5
seconds of "finish" meets my expectations.

Of course it's not the right semantic of starting, that delaying is
part of start. However, with the current features of s6 it's more
logical to extend the starting phase, because it sleeps some time.

But I'm still not happy with that.

Uhm, don't know, if it would work: could we model that delay as a
service dependency? So if the dependency is up, the service itself can
be started.

Inside of the finish script of the real service, you would write the
stats, which controls the behaviour of the delay service.

So we had a "delay service", and a real service and you wouldn't had to
model some unvisible magic inside of run scripts. I.e. if you would
call the service "delay-myapp" you would exactly see, why your
service doesn't start, now. I was thinking about that issue, if you
aren't familiar with a setup, what would you expect as an administrator?

I wouldn't expect delay magic in run or finish scripts. I mean, imagine
you have to debug why a service doesn't start. Reading the run or
finish scripts for that is not that fancy.

Best Regards
Oli


-- 
Automatic-Server AG •••••
Oliver Schad
Geschäftsführer
Hardstr. 46
9434 Au | Schweiz

www.automatic-server.com | oliver.schad@automatic-server.com
Tel: +41 71 511 31 11 | Mobile: +41 76 330 03 47

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-30  9:34     ` Oliver Schad
@ 2022-09-30 13:21       ` Laurent Bercot
  2022-10-04 14:16         ` Tor Rune Skoglund
  2022-10-11  0:38         ` Dewayne
  0 siblings, 2 replies; 11+ messages in thread
From: Laurent Bercot @ 2022-09-30 13:21 UTC (permalink / raw)
  To: supervision


  I feel like this whole thread comes from mismatched expectations of
how s6 should behave.

  s6 always waits for one second before two successive starts of a
service. This ensures it never hogs the CPU by spamming a crashing
service. (With an asterisk, see below.)

  It does not wait for one second if the service has been started for
more than one second. The idea is, if the service has been running for
a while, and dies, you want it up _immediately_, and it's okay because
it was running for a while, and either was somewhat idle (in which case
you're not hurting for resources) or was hot (in which case you don't
want a 1-second delay).

  The point of s6 is to maximize service uptime; this is why it does
not have a configurable backoff mechanism. When you want a service up,
you want it *up*, and that's the #1 priority. A service that keeps
crashing is an abnormal condition and is supposed to be handled by the
admin quickly enough - ideally, before getting to production.

  If the CPU is still being hogged by s6 despite the 1-second delay
and it's not that the service is running hot, then it means it's
crashing while still initializing, and the initialization takes more
than one second while using a lot of resources. In other words, you
got yourself a pretty heavy service that is crashing while starting.

  That should definitely be caught before it goes in production. But
if it's not possible, the least ugly workaround is indeed to sleep
in the finish script, and increasing timeout-finish if needed.
(The "run_service || sleep 10" approach leaves a shell between
s6-supervise and run_service, so it's not good.)
./finish is generally supposed to be very short-lived, because the
"finishing" state is generally confusing to an observer, but in this
case it does not matter: it's an abnormal situation anyway.

  There is, however, one improvement I think I can safely make.
  Currently, the 1-second delay is computed from when the service 
*starts*:
if it has been running for more than one second, and crashes, it 
restarts
immediately, even if it has only been busy initializing, which causes
the resource hog OP is experiencing.
  I could change it to being computed from when the service is *ready*:
if the service dies before being ready, s6-supervise *always* waits for
1 second before restarting. The delay is only skipped if the service has
been *ready* for 1 second or more, which means it really serving and
either idle (i.e. chill resource-wise) or hot (i.e. you don't want to
delay it).
  Does that sound like a valid improvement to you folks?

  Note that I don't think making the restart delay configurable is a good
trade-off. It adds complexity, size and failure cases to the 
s6-supervise
code, it adds another file to a service directory for users to remember,
it adds another avenue for configuration mistakes causing downtime, all
that to save resources for a pathological case. The difference between
0 second and 1 second of free CPU is significant; longer delays have
diminishing returns.

--
  Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-30 13:21       ` Laurent Bercot
@ 2022-10-04 14:16         ` Tor Rune Skoglund
  2022-10-10 16:20           ` Laurent Bercot
  2022-10-11  0:38         ` Dewayne
  1 sibling, 1 reply; 11+ messages in thread
From: Tor Rune Skoglund @ 2022-10-04 14:16 UTC (permalink / raw)
  To: Laurent Bercot, supervision

Den 30.09.2022 15:21, skrev Laurent Bercot:
>  There is, however, one improvement I think I can safely make.
>  Currently, the 1-second delay is computed from when the service 
> *starts*:
> if it has been running for more than one second, and crashes, it restarts
> immediately, even if it has only been busy initializing, which causes
> the resource hog OP is experiencing.
>  I could change it to being computed from when the service is *ready*:
> if the service dies before being ready, s6-supervise *always* waits for
> 1 second before restarting. The delay is only skipped if the service has
> been *ready* for 1 second or more, which means it really serving and
> either idle (i.e. chill resource-wise) or hot (i.e. you don't want to
> delay it).
>  Does that sound like a valid improvement to you folks?

To me this seems like a relevant improvement that would catch a 
problematic edge case issue.

- Tor Rune

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-10-04 14:16         ` Tor Rune Skoglund
@ 2022-10-10 16:20           ` Laurent Bercot
  0 siblings, 0 replies; 11+ messages in thread
From: Laurent Bercot @ 2022-10-10 16:20 UTC (permalink / raw)
  To: supervision


>To me this seems like a relevant improvement that would catch a problematic edge case issue.

  I pushed such a change to the s6 git. A new numbered release should
be cut soon-ish.

--
  Laurent


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: "Back off" setting for crashing services with s6+openrc?
  2022-09-30 13:21       ` Laurent Bercot
  2022-10-04 14:16         ` Tor Rune Skoglund
@ 2022-10-11  0:38         ` Dewayne
  1 sibling, 0 replies; 11+ messages in thread
From: Dewayne @ 2022-10-11  0:38 UTC (permalink / raw)
  To: supervision

On 30/09/2022 11:21 pm, Laurent Bercot wrote:
>  A service that keeps crashing is an abnormal condition

Yes - sometimes that's what villains like to do :)

> Note that I don't think making the restart delay configurable is a good
> trade-off. It adds complexity, size and failure cases to the s6-supervise
> code, it adds another file to a service directory for users to remember,
> it adds another avenue for configuration mistakes causing downtime, all
> that to save resources for a pathological case. The difference between
> 0 second and 1 second of free CPU is significant; longer delays have
> diminishing returns.

I'm not sure that its entirely pathological, as I also use 'finish' with 
a 'sleep' and timeout-finish in an effort to reduce SROP issues.  Its 
also fairly common for us to have a loadavg of 4x ncores.

In the general case, yes if you want a process running then it should be 
up asap, and for that I'm very appreciative.  :)

Ref:
For ROP https://en.wikipedia.org/wiki/Return-oriented_programming
SROP https://www.cs.vu.nl/~herbertb/papers/srop_sp14.pdf

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-11  0:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-22 19:45 "Back off" setting for crashing services with s6+openrc? Tor Rune Skoglund
2022-09-22 20:21 ` John W Higgins
2022-09-23  7:29   ` Tor Rune Skoglund
2022-09-23 16:44   ` Alyssa Ross
2022-09-24 12:33   ` Oliver Schad
2022-09-26 23:00   ` Colin Booth
2022-09-30  9:34     ` Oliver Schad
2022-09-30 13:21       ` Laurent Bercot
2022-10-04 14:16         ` Tor Rune Skoglund
2022-10-10 16:20           ` Laurent Bercot
2022-10-11  0:38         ` Dewayne

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).