* "Back off" setting for crashing services with s6+openrc? @ 2022-09-22 19:45 Tor Rune Skoglund 2022-09-22 20:21 ` John W Higgins 0 siblings, 1 reply; 11+ messages in thread From: Tor Rune Skoglund @ 2022-09-22 19:45 UTC (permalink / raw) To: supervision Hello List, I am new to the list, so sorry if this question has been raised before. However, I've made an honest attempt to figure this out before posting here now. I am using s6 with openrc on Gentoo and are using a couple of custom made services which are fully stable under some conditions; i.e. they simple crash and stop continuesly sometimes, which causes s6 to instantly restart them --- with a high cpu-usage crash-restart-crash-restart... loop as result. Of course, the correct thing would be to fix the crashing service, but it is not a possible option right now. As a generic question, is there any setting with this s6+openrc config that would make s6 "back off" a configurable number of seconds before doing the restart? Any help appreciated. Thanks. BR, Tor Rune Skoglund, trs@fourc.eu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-22 19:45 "Back off" setting for crashing services with s6+openrc? Tor Rune Skoglund @ 2022-09-22 20:21 ` John W Higgins 2022-09-23 7:29 ` Tor Rune Skoglund ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: John W Higgins @ 2022-09-22 20:21 UTC (permalink / raw) To: supervision [-- Attachment #1: Type: text/plain, Size: 513 bytes --] Good Day, On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote: ... > As a generic question, is there any setting with this s6+openrc config > that would make s6 "back off" a configurable number of seconds before > doing the restart? > > Does something as simple as changing your run script to be something like run_my_crashing_app || sleep 10 Work? The run script will sit there for 10 seconds if your app fails. Not built in - but should accomplish the task pretty easily. John W Higgins ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-22 20:21 ` John W Higgins @ 2022-09-23 7:29 ` Tor Rune Skoglund 2022-09-23 16:44 ` Alyssa Ross ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Tor Rune Skoglund @ 2022-09-23 7:29 UTC (permalink / raw) To: supervision Hello, Den 22.09.2022 22:21, skrev John W Higgins: > On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote: >> As a generic question, is there any setting with this s6+openrc config >> that would make s6 "back off" a configurable number of seconds before >> doing the restart? > Does something as simple as changing your run script to be something like > > run_my_crashing_app || sleep 10 > > Work? The run script will sit there for 10 seconds if your app fails. Not > built in - but should accomplish the task pretty easily. Thanks, yes, that would work. However, I was just hoping this could be set on a more global s6 level so that it would catch any misbehaving service in general. - Tor Rune Skoglund ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-22 20:21 ` John W Higgins 2022-09-23 7:29 ` Tor Rune Skoglund @ 2022-09-23 16:44 ` Alyssa Ross 2022-09-24 12:33 ` Oliver Schad 2022-09-26 23:00 ` Colin Booth 3 siblings, 0 replies; 11+ messages in thread From: Alyssa Ross @ 2022-09-23 16:44 UTC (permalink / raw) To: John W Higgins; +Cc: Tor Rune Skoglund, supervision [-- Attachment #1: Type: text/plain, Size: 1058 bytes --] John W Higgins <wishdev@gmail.com> writes: > Good Day, > > On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote: > ... > >> As a generic question, is there any setting with this s6+openrc config >> that would make s6 "back off" a configurable number of seconds before >> doing the restart? >> >> > Does something as simple as changing your run script to be something like > > run_my_crashing_app || sleep 10 > > Work? The run script will sit there for 10 seconds if your app fails. Not > built in - but should accomplish the task pretty easily. I think there's a correctness problem with this approach. An s6-supervise process supervises its direct child, which is supposed to be the daemon under supervision. But in this case it would be the shell, so s6 doesn't know the daemon's PID, and so can't signal it. If s6 tries to stop the service, it will signal the shell, which won't necessarily do the right thing and stop the application. An s6 service needs to ensure that the service is stopped if s6-supervise's direct child dies. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-22 20:21 ` John W Higgins 2022-09-23 7:29 ` Tor Rune Skoglund 2022-09-23 16:44 ` Alyssa Ross @ 2022-09-24 12:33 ` Oliver Schad 2022-09-26 23:00 ` Colin Booth 3 siblings, 0 replies; 11+ messages in thread From: Oliver Schad @ 2022-09-24 12:33 UTC (permalink / raw) To: supervision [-- Attachment #1: Type: text/plain, Size: 1094 bytes --] On Thu, 22 Sep 2022 13:21:46 -0700 John W Higgins <wishdev@gmail.com> wrote: > Good Day, > > On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> > wrote: ... > > > As a generic question, is there any setting with this s6+openrc > > config that would make s6 "back off" a configurable number of > > seconds before doing the restart? > > > > > Does something as simple as changing your run script to be something > like > > run_my_crashing_app || sleep 10 > > Work? The run script will sit there for 10 seconds if your app fails. > Not built in - but should accomplish the task pretty easily. You could propably outsource the backoff mechanism, which can handle some statistics. Something like that: https://pastebin.com/aH3EDGLG You would use it in your run script as: exec with_backoff my_daemon Best Regards Oli -- Automatic-Server AG ••••• Oliver Schad Geschäftsführer Hardstr. 46 9434 Au | Schweiz www.automatic-server.com | oliver.schad@automatic-server.com Tel: +41 71 511 31 11 | Mobile: +41 76 330 03 47 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-22 20:21 ` John W Higgins ` (2 preceding siblings ...) 2022-09-24 12:33 ` Oliver Schad @ 2022-09-26 23:00 ` Colin Booth 2022-09-30 9:34 ` Oliver Schad 3 siblings, 1 reply; 11+ messages in thread From: Colin Booth @ 2022-09-26 23:00 UTC (permalink / raw) To: supervision On Thu, Sep 22, 2022 at 01:21:46PM -0700, John W Higgins wrote: > Good Day, > > On Thu, Sep 22, 2022 at 1:13 PM Tor Rune Skoglund <trs@fourc.eu> wrote: > ... > > > As a generic question, is there any setting with this s6+openrc config > > that would make s6 "back off" a configurable number of seconds before > > doing the restart? > > > > > Does something as simple as changing your run script to be something like > > run_my_crashing_app || sleep 10 > > Work? The run script will sit there for 10 seconds if your app fails. Not > built in - but should accomplish the task pretty easily. > > John W Higgins Put the backoff in the finish script. The API is to run finish with the exit code and signal (if appropriate) as $1 and $2 respectively. With that information you can have finish make decisions about if it should delay restarting, set the service down in the face of a permanent error, or so on. Note that by default finish has a five second deadline so if you want to delay a restart for ten seconds you'll need to increase that deadline with a timeout-finish file. -- Colin Booth ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-26 23:00 ` Colin Booth @ 2022-09-30 9:34 ` Oliver Schad 2022-09-30 13:21 ` Laurent Bercot 0 siblings, 1 reply; 11+ messages in thread From: Oliver Schad @ 2022-09-30 9:34 UTC (permalink / raw) Cc: supervision [-- Attachment #1: Type: text/plain, Size: 2426 bytes --] On Mon, 26 Sep 2022 23:00:48 +0000 Colin Booth <colin@heliocat.net> wrote: > On Thu, Sep 22, 2022 at 01:21:46PM -0700, John W Higgins wrote: > > run_my_crashing_app || sleep 10 > > > > Work? The run script will sit there for 10 seconds if your app > > fails. Not built in - but should accomplish the task pretty easily. > > > > John W Higgins > Put the backoff in the finish script. The API is to run finish with > the exit code and signal (if appropriate) as $1 and $2 respectively. > With that information you can have finish make decisions about if it > should delay restarting, set the service down in the face of a > permanent error, or so on. Note that by default finish has a five > second deadline so if you want to delay a restart for ten seconds > you'll need to increase that deadline with a timeout-finish file. Sounds in theory good, but in practice I would split the whole thing in 2 parts: - define start timings in "finish" - do start timings in "run" Why? I expect stopping more or less immediatly and the timeout of 5 seconds of "finish" meets my expectations. Of course it's not the right semantic of starting, that delaying is part of start. However, with the current features of s6 it's more logical to extend the starting phase, because it sleeps some time. But I'm still not happy with that. Uhm, don't know, if it would work: could we model that delay as a service dependency? So if the dependency is up, the service itself can be started. Inside of the finish script of the real service, you would write the stats, which controls the behaviour of the delay service. So we had a "delay service", and a real service and you wouldn't had to model some unvisible magic inside of run scripts. I.e. if you would call the service "delay-myapp" you would exactly see, why your service doesn't start, now. I was thinking about that issue, if you aren't familiar with a setup, what would you expect as an administrator? I wouldn't expect delay magic in run or finish scripts. I mean, imagine you have to debug why a service doesn't start. Reading the run or finish scripts for that is not that fancy. Best Regards Oli -- Automatic-Server AG ••••• Oliver Schad Geschäftsführer Hardstr. 46 9434 Au | Schweiz www.automatic-server.com | oliver.schad@automatic-server.com Tel: +41 71 511 31 11 | Mobile: +41 76 330 03 47 [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-30 9:34 ` Oliver Schad @ 2022-09-30 13:21 ` Laurent Bercot 2022-10-04 14:16 ` Tor Rune Skoglund 2022-10-11 0:38 ` Dewayne 0 siblings, 2 replies; 11+ messages in thread From: Laurent Bercot @ 2022-09-30 13:21 UTC (permalink / raw) To: supervision I feel like this whole thread comes from mismatched expectations of how s6 should behave. s6 always waits for one second before two successive starts of a service. This ensures it never hogs the CPU by spamming a crashing service. (With an asterisk, see below.) It does not wait for one second if the service has been started for more than one second. The idea is, if the service has been running for a while, and dies, you want it up _immediately_, and it's okay because it was running for a while, and either was somewhat idle (in which case you're not hurting for resources) or was hot (in which case you don't want a 1-second delay). The point of s6 is to maximize service uptime; this is why it does not have a configurable backoff mechanism. When you want a service up, you want it *up*, and that's the #1 priority. A service that keeps crashing is an abnormal condition and is supposed to be handled by the admin quickly enough - ideally, before getting to production. If the CPU is still being hogged by s6 despite the 1-second delay and it's not that the service is running hot, then it means it's crashing while still initializing, and the initialization takes more than one second while using a lot of resources. In other words, you got yourself a pretty heavy service that is crashing while starting. That should definitely be caught before it goes in production. But if it's not possible, the least ugly workaround is indeed to sleep in the finish script, and increasing timeout-finish if needed. (The "run_service || sleep 10" approach leaves a shell between s6-supervise and run_service, so it's not good.) ./finish is generally supposed to be very short-lived, because the "finishing" state is generally confusing to an observer, but in this case it does not matter: it's an abnormal situation anyway. There is, however, one improvement I think I can safely make. Currently, the 1-second delay is computed from when the service *starts*: if it has been running for more than one second, and crashes, it restarts immediately, even if it has only been busy initializing, which causes the resource hog OP is experiencing. I could change it to being computed from when the service is *ready*: if the service dies before being ready, s6-supervise *always* waits for 1 second before restarting. The delay is only skipped if the service has been *ready* for 1 second or more, which means it really serving and either idle (i.e. chill resource-wise) or hot (i.e. you don't want to delay it). Does that sound like a valid improvement to you folks? Note that I don't think making the restart delay configurable is a good trade-off. It adds complexity, size and failure cases to the s6-supervise code, it adds another file to a service directory for users to remember, it adds another avenue for configuration mistakes causing downtime, all that to save resources for a pathological case. The difference between 0 second and 1 second of free CPU is significant; longer delays have diminishing returns. -- Laurent ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-30 13:21 ` Laurent Bercot @ 2022-10-04 14:16 ` Tor Rune Skoglund 2022-10-10 16:20 ` Laurent Bercot 2022-10-11 0:38 ` Dewayne 1 sibling, 1 reply; 11+ messages in thread From: Tor Rune Skoglund @ 2022-10-04 14:16 UTC (permalink / raw) To: Laurent Bercot, supervision Den 30.09.2022 15:21, skrev Laurent Bercot: > There is, however, one improvement I think I can safely make. > Currently, the 1-second delay is computed from when the service > *starts*: > if it has been running for more than one second, and crashes, it restarts > immediately, even if it has only been busy initializing, which causes > the resource hog OP is experiencing. > I could change it to being computed from when the service is *ready*: > if the service dies before being ready, s6-supervise *always* waits for > 1 second before restarting. The delay is only skipped if the service has > been *ready* for 1 second or more, which means it really serving and > either idle (i.e. chill resource-wise) or hot (i.e. you don't want to > delay it). > Does that sound like a valid improvement to you folks? To me this seems like a relevant improvement that would catch a problematic edge case issue. - Tor Rune ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-10-04 14:16 ` Tor Rune Skoglund @ 2022-10-10 16:20 ` Laurent Bercot 0 siblings, 0 replies; 11+ messages in thread From: Laurent Bercot @ 2022-10-10 16:20 UTC (permalink / raw) To: supervision >To me this seems like a relevant improvement that would catch a problematic edge case issue. I pushed such a change to the s6 git. A new numbered release should be cut soon-ish. -- Laurent ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: "Back off" setting for crashing services with s6+openrc? 2022-09-30 13:21 ` Laurent Bercot 2022-10-04 14:16 ` Tor Rune Skoglund @ 2022-10-11 0:38 ` Dewayne 1 sibling, 0 replies; 11+ messages in thread From: Dewayne @ 2022-10-11 0:38 UTC (permalink / raw) To: supervision On 30/09/2022 11:21 pm, Laurent Bercot wrote: > A service that keeps crashing is an abnormal condition Yes - sometimes that's what villains like to do :) > Note that I don't think making the restart delay configurable is a good > trade-off. It adds complexity, size and failure cases to the s6-supervise > code, it adds another file to a service directory for users to remember, > it adds another avenue for configuration mistakes causing downtime, all > that to save resources for a pathological case. The difference between > 0 second and 1 second of free CPU is significant; longer delays have > diminishing returns. I'm not sure that its entirely pathological, as I also use 'finish' with a 'sleep' and timeout-finish in an effort to reduce SROP issues. Its also fairly common for us to have a loadavg of 4x ncores. In the general case, yes if you want a process running then it should be up asap, and for that I'm very appreciative. :) Ref: For ROP https://en.wikipedia.org/wiki/Return-oriented_programming SROP https://www.cs.vu.nl/~herbertb/papers/srop_sp14.pdf ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-10-11 0:41 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-09-22 19:45 "Back off" setting for crashing services with s6+openrc? Tor Rune Skoglund 2022-09-22 20:21 ` John W Higgins 2022-09-23 7:29 ` Tor Rune Skoglund 2022-09-23 16:44 ` Alyssa Ross 2022-09-24 12:33 ` Oliver Schad 2022-09-26 23:00 ` Colin Booth 2022-09-30 9:34 ` Oliver Schad 2022-09-30 13:21 ` Laurent Bercot 2022-10-04 14:16 ` Tor Rune Skoglund 2022-10-10 16:20 ` Laurent Bercot 2022-10-11 0:38 ` Dewayne
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).