From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: from alyss.skarnet.org (alyss.skarnet.org [95.142.172.232]) by inbox.vuxu.org (Postfix) with SMTP id 3D69C28E5D for ; Sun, 26 May 2024 21:53:41 +0200 (CEST) Received: (qmail 48996 invoked by uid 89); 26 May 2024 19:54:06 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Received: (qmail 48989 invoked from network); 26 May 2024 19:54:06 -0000 From: "Laurent Bercot" To: supervision Subject: Re: s6 daemon restart feature enhancement suggestion Date: Sun, 26 May 2024 19:53:39 +0000 Message-Id: In-Reply-To: References: <1716661664.18967.ezmlm@list.skarnet.org> Reply-To: "Laurent Bercot" User-Agent: eM_Client/9.2.2258.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable >Let me say that a $daemon i.e. wpa_supplicant or iwd providing=20 >$service=3DWiFi{wpa} has been pulled into the s6-rc compiled db and=20 >started in the supervision tree. >But the system doesn't have the hardware to support that, or some=20 >important resource is unavailable. So, here's my question: if the system doesn't have the hardware to support that, why is the daemon in the database in the first place? s6-rc, in its current incarnation, is very static when it comes to its service database; this is by design. The point is that when you have a compiled service database, you know what's in there, you know what it does, and you know what services will be running when you boot your system. Adding dynamism goes against that design. I understand the value of flexibility (this is why most distributions won't use s6-rc as is: they need more flexibility in their service manager) but there's a trade-off with reliability, and s6-rc weighs heavily on the reliability side. If you are building a distribution aimed at supporting several kinds of hardware, I suggest adding flexibility at the *source database* level, and building the compiled database at system configuration time (or, in extreme cases, at boot time, though I do not recommend that if you can avoid it, since you lose the static bootability guarantee). If your machine can't run wpa_supplicant, then the service manager should not attempt to run wpa_supplicant in the first place, so the wpa_supplicant service should not appear in the top bundle. Lacking resources is a different issue: it's a temporary error, and it makes sense for the service to fail (and be restarted) if it cannot reserve the resources it needs. If you want to report permanent failure, and stop trying to bring the service up, after a certain amount of time, you can write a timeout-up file, or have a finish script exit 125, see below. >A mechanism should be prepared, to let $daemon inform it's instance of=20 >s6-supervise that it can't run, or can't provide $service / it's=20 >services. If you have the information before the machine boots, you should use the information to prune your service database, and compile a database that you know will work with your system. If you don't have the information before the machine boots, then a service failing to start is a normal temporary failure, and s6 will attempt to restart the service until it reports permanent failure. You have several ways of marking a service as permanently failed: - (only with s6-rc) you can have a timeout-up file, see https://skarnet.org/software/s6-rc/s6-rc-compile.html and look for "timeout-up" - (generic s6) you can have a finish script that uses data that has been collected by s6-supervise to determine whether a permanent failure should be reported or not. A finish script can report permanent failure by exiting 125. For instance, using s6-permafailon, see https://skarnet.org/software/s6/s6-permafailon.html , allows you to tell s6 that if the service exits nonzero too many times in a given number of seconds, then it's hopeless. Does this help? -- Laurent