On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote:

On Fri, 26 Jul 2024, Peter Tribble wrote:

> On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net> wrote:
>
> > Please can you review the following change?
> >
> > 15665 svc:/network/loopback exits successfully even if it fails
> > https://www.illumos.org/issues/15665
> > https://code.illumos.org/c/illumos-gate/+/3610
> >
>
> When this first came up I expressed my belief that making this change is
> the wrong
> thing to do, and I'll express it again.

Apologies Peter. I had recalled that your objection to the original change
was mostly around the addition of the extra dependency to the service, which
I've removed in this new patch set (that is
https://www.illumos.org/issues/15664 which remains open).

> If this service fails, I think the best thing to do is drive on so that the
> system can come up as far as possible to maximise the chance that the system
> comes up far enough for an administrator to be able to get in and fix it. Not
> putting the service into maintenance is a feature, not a bug.

The impetus for this change is that over the past couple of years we've had
a number of occasions where we've had to debug networking problems that
have had their root in the fact that the loopback interfaces were not created
for one reason or another. It happened again yesterday in a non-global zone. In
all of these, it would have been really useful and expedited diagnosis if the
service had gone into maintenance. I understand the perspective of allowing the
system to come up as far as possible - to the point of remote access even - but
it still seems wrong for a service to report success where it has not actually
achieved its goal. Is there some middle ground here.

> I think generally it would be wrong for a single voice to veto any change,
> which means I would generally be uncomfortable sticking a -1 on it, but if
> this does get into the gate it will be reverted in Tribblix.

Understood. This definitely warrants further discussion.

As I mentioned in my other reply, it seems that what we're after is some way to mark

a service as having generated an error without bringing the system down by going

into maintenance. Some sort of degraded mode.

We have a couple of SMF exit codes that look interesting - SMF_EXIT_MON_DEGRADE
and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's

even an issue in this area - https://www.illumos.org/issues/7711 (which refers back to 8891

which is another case of something dropping into maintenance breaking the entire system).

Interestingly, looking at the ssh method script for S11
https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132

you see the following:

# Put the service into degraded mode in case some of previous
# configuration tasks failed.
# We do not let the service enter maintenance mode, since
# we want to keep the system as much operating as feasible.
#
if [ $ret1 -ne 0 ]; then
smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \
"Failed to generate missing host keys."
fi

So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for?

-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/