On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman wrote: > > On Fri, 26 Jul 2024, Peter Tribble wrote: > > > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman > wrote: > > > > > Please can you review the following change? > > > > > > 15665 svc:/network/loopback exits successfully even if it fails > > > https://www.illumos.org/issues/15665 > > > https://code.illumos.org/c/illumos-gate/+/3610 > > > > > > > When this first came up I expressed my belief that making this change is > > the wrong > > thing to do, and I'll express it again. > > Apologies Peter. I had recalled that your objection to the original change > was mostly around the addition of the extra dependency to the service, > which > I've removed in this new patch set (that is > https://www.illumos.org/issues/15664 which remains open). > > > If this service fails, I think the best thing to do is drive on so that > the > > system can come up as far as possible to maximise the chance that the > system > > comes up far enough for an administrator to be able to get in and fix > it. Not > > putting the service into maintenance is a feature, not a bug. > > The impetus for this change is that over the past couple of years we've had > a number of occasions where we've had to debug networking problems that > have had their root in the fact that the loopback interfaces were not > created > for one reason or another. It happened again yesterday in a non-global > zone. In > all of these, it would have been really useful and expedited diagnosis if > the > service had gone into maintenance. I understand the perspective of > allowing the > system to come up as far as possible - to the point of remote access even > - but > it still seems wrong for a service to report success where it has not > actually > achieved its goal. Is there some middle ground here. > > > I think generally it would be wrong for a single voice to veto any > change, > > which means I would generally be uncomfortable sticking a -1 on it, but > if > > this does get into the gate it will be reverted in Tribblix. > > Understood. This definitely warrants further discussion. > As I mentioned in my other reply, it seems that what we're after is some way to mark a service as having generated an error without bringing the system down by going into maintenance. Some sort of degraded mode. We have a couple of SMF exit codes that look interesting - SMF_EXIT_MON_DEGRADE and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's even an issue in this area - https://www.illumos.org/issues/7711 (which refers back to 8891 which is another case of something dropping into maintenance breaking the entire system). Interestingly, looking at the ssh method script for S11 https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132 you see the following: # Put the service into degraded mode in case some of previous # configuration tasks failed. # We do not let the service enter maintenance mode, since # we want to keep the system as much operating as feasible. # if [ $ret1 -ne 0 ]; then smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \ "Failed to generate missing host keys." fi So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for? -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/