On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote:

>
> On Fri, 26 Jul 2024, Peter Tribble wrote:
>
> > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net>
> wrote:
> >
> > > Please can you review the following change?
> > >
> > >     15665 svc:/network/loopback exits successfully even if it fails
> > >     https://www.illumos.org/issues/15665
> > >     https://code.illumos.org/c/illumos-gate/+/3610
> > >
> >
> > When this first came up I expressed my belief that making this change is
> > the wrong
> > thing to do, and I'll express it again.
>
> Apologies Peter. I had recalled that your objection to the original change
> was mostly around the addition of the extra dependency to the service,
> which
> I've removed in this new patch set (that is
> https://www.illumos.org/issues/15664 which remains open).
>
> > If this service fails, I think the best thing to do is drive on so that
> the
> > system can come up as far as possible to maximise the chance that the
> system
> > comes up far enough for an administrator to be able to get in and fix
> it. Not
> > putting the service into maintenance is a feature, not a bug.
>
> The impetus for this change is that over the past couple of years we've had
> a number of occasions where we've had to debug networking problems that
> have had their root in the fact that the loopback interfaces were not
> created
> for one reason or another. It happened again yesterday in a non-global
> zone. In
> all of these, it would have been really useful and expedited diagnosis if
> the
> service had gone into maintenance. I understand the perspective of
> allowing the
> system to come up as far as possible - to the point of remote access even
> - but
> it still seems wrong for a service to report success where it has not
> actually
> achieved its goal. Is there some middle ground here.
>
> > I think generally it would be wrong for a single voice to veto any
> change,
> > which means I would generally be uncomfortable sticking a -1 on it, but
> if
> > this does get into the gate it will be reverted in Tribblix.
>
> Understood. This definitely warrants further discussion.
>

As I mentioned in my other reply, it seems that what we're after is some
way to mark
a service as having generated an error without bringing the system down by
going
into maintenance. Some sort of degraded mode.

We have a couple of SMF exit codes that look interesting -
SMF_EXIT_MON_DEGRADE
and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's
even an issue in this area - https://www.illumos.org/issues/7711 (which
refers back to 8891
which is another case of something dropping into maintenance breaking the
entire system).

Interestingly, looking at the ssh method script for S11
https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132
you see the following:

# Put the service into degraded mode in case some of previous
# configuration tasks failed.
# We do not let the service enter maintenance mode, since
# we want to keep the system as much operating as feasible.
#
if [ $ret1 -ne 0 ]; then
smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \
   "Failed to generate missing host keys."
fi

So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for?

-- 
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/