From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 20607 invoked from network); 30 Sep 2022 13:21:09 -0000 Received: from alyss.skarnet.org (95.142.172.232) by inbox.vuxu.org with ESMTPUTF8; 30 Sep 2022 13:21:09 -0000 Received: (qmail 25890 invoked by uid 89); 30 Sep 2022 13:21:34 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Received: (qmail 25883 invoked from network); 30 Sep 2022 13:21:34 -0000 From: "Laurent Bercot" To: "supervision@list.skarnet.org" Subject: Re: "Back off" setting for crashing services with s6+openrc? Date: Fri, 30 Sep 2022 13:21:06 +0000 Message-Id: In-Reply-To: <20220930113440.72c09a4f@flunder.oschad.de> References: <76856b27-1653-4e3c-28a5-737b63dea1f0@fourc.eu> <20220930113440.72c09a4f@flunder.oschad.de> Reply-To: "Laurent Bercot" User-Agent: eM_Client/9.1.2109.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable I feel like this whole thread comes from mismatched expectations of how s6 should behave. s6 always waits for one second before two successive starts of a service. This ensures it never hogs the CPU by spamming a crashing service. (With an asterisk, see below.) It does not wait for one second if the service has been started for more than one second. The idea is, if the service has been running for a while, and dies, you want it up _immediately_, and it's okay because it was running for a while, and either was somewhat idle (in which case you're not hurting for resources) or was hot (in which case you don't want a 1-second delay). The point of s6 is to maximize service uptime; this is why it does not have a configurable backoff mechanism. When you want a service up, you want it *up*, and that's the #1 priority. A service that keeps crashing is an abnormal condition and is supposed to be handled by the admin quickly enough - ideally, before getting to production. If the CPU is still being hogged by s6 despite the 1-second delay and it's not that the service is running hot, then it means it's crashing while still initializing, and the initialization takes more than one second while using a lot of resources. In other words, you got yourself a pretty heavy service that is crashing while starting. That should definitely be caught before it goes in production. But if it's not possible, the least ugly workaround is indeed to sleep in the finish script, and increasing timeout-finish if needed. (The "run_service || sleep 10" approach leaves a shell between s6-supervise and run_service, so it's not good.) ./finish is generally supposed to be very short-lived, because the "finishing" state is generally confusing to an observer, but in this case it does not matter: it's an abnormal situation anyway. There is, however, one improvement I think I can safely make. Currently, the 1-second delay is computed from when the service=20 *starts*: if it has been running for more than one second, and crashes, it=20 restarts immediately, even if it has only been busy initializing, which causes the resource hog OP is experiencing. I could change it to being computed from when the service is *ready*: if the service dies before being ready, s6-supervise *always* waits for 1 second before restarting. The delay is only skipped if the service has been *ready* for 1 second or more, which means it really serving and either idle (i.e. chill resource-wise) or hot (i.e. you don't want to delay it). Does that sound like a valid improvement to you folks? Note that I don't think making the restart delay configurable is a good trade-off. It adds complexity, size and failure cases to the=20 s6-supervise code, it adds another file to a service directory for users to remember, it adds another avenue for configuration mistakes causing downtime, all that to save resources for a pathological case. The difference between 0 second and 1 second of free CPU is significant; longer delays have diminishing returns. -- Laurent