From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 23455 invoked from network); 18 Nov 2020 19:06:12 -0000 Received: from alyss.skarnet.org (95.142.172.232) by inbox.vuxu.org with ESMTPUTF8; 18 Nov 2020 19:06:12 -0000 Received: (qmail 32048 invoked by uid 89); 18 Nov 2020 19:06:36 -0000 Mailing-List: contact supervision-help@list.skarnet.org; run by ezmlm Sender: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Received: (qmail 32041 invoked from network); 18 Nov 2020 19:06:36 -0000 From: "Laurent Bercot" To: "Xavier Stonestreet" , supervision@list.skarnet.org Subject: Re: s6-rc: timeout questions Date: Wed, 18 Nov 2020 19:06:09 +0000 Message-Id: In-Reply-To: References: Reply-To: "Laurent Bercot" User-Agent: eM_Client/8.0.3385.0 Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-VR-SPAMSTATE: OK X-VR-SPAMSCORE: -100 X-VR-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedujedrudefhedguddulecutefuodetggdotffvucfrrhhofhhilhgvmecupfgfoffgtffkveetuefngfdpqfgfvfenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhephffvufffkfgjfhhrfgggtgfgsehtqhertddtreejnecuhfhrohhmpedfnfgruhhrvghnthcuuegvrhgtohhtfdcuoehskhgrqdhsuhhpvghrvhhishhiohhnsehskhgrrhhnvghtrdhorhhgqeenucggtffrrghtthgvrhhnpedvgfevffeuleegvdektdffteegvdeiieefkeetgfeuheffheelheejhfevueeijeenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhhouggvpehsmhhtphhouhht >Could you elaborate a little more about the state transition failures >of oneshots caused by timeouts? > >Let's say for example the oneshot's up script times out, so the >transition fails. From s6-rc's point of view the oneshot is still >down. What actually happens to the process running the up script? Is >it left running in the background? If yes, is it correct to assume >that since s6-rc considers it down, another invocation of the s6-rc -u >change command on the same oneshot will spawn another instance of the >up script? If not, is it killed, and how? It is correct to assume that another instance will be spawned, yes. It was a difficult decision to make, and I'm still not sure it is the right one. There are advantages and drawbacks to both approaches, but at the end of the day it all comes down to: what set of actions will leave the system in the *least* unknown state? s6-rc's design assumes that timeouts, if they exist, are properly calibrated; if a service times out, then it's not that the timeout is too short, it's that something is really going wrong. So it considers the transition failed. Now what should it do about the existing process? kill it or not? If the process is allowed to live on, it may succeed, in which case s6-rc's vision of the service will be wrong, but 1. it doesn't matter because services should always be written as idempotent, and 2. it means that the timeout was badly calibrated in the first place. Or it may fail and s6-rc's vision will be correct. If the process is killed, chances are that it will add to the problem instead of solving it. For instance, if the process is hanging in D state, killing it won't do anything except make the system more=20 unstable. If the process is doing some complex operation and not properly sequencing its operations, sending it a signal may trigger a bug. etc. In the end I weighed that sending a signal would potentially cause more harm than good, but I don't think using the opposite approach would be wrong either. >Test 2: >s1 is down >s2 is down >s6-rc -u change s2 >s6-rc: fatal: timed out >s6-svlisten1: fatal: timed out > >Timeout failure. Unexpected. I thought timeout-up and timeout-down >applied to each atomic service individually, not to the entire >dependency chain to bring it up or down. Yes, it should be behaving as you say, and I suspect you have uncovered a bug - not in the timeout management for a dependency chain, but in the management of s6-rc's *global* timeout, which is the one that is triggering here. I suspect I'm taking incorrect shortcuts wrt timeout management, and will take a look. Thanks! -- Laurent