* Review - 15665 svc:/network/loopback exits successfully even if it fails
@ 2024-07-26 8:20 Andy Fiddaman
2024-07-26 12:44 ` [developer] " Peter Tribble
0 siblings, 1 reply; 13+ messages in thread
From: Andy Fiddaman @ 2024-07-26 8:20 UTC (permalink / raw)
To: developer
Please can you review the following change?
15665 svc:/network/loopback exits successfully even if it fails
https://www.illumos.org/issues/15665
https://code.illumos.org/c/illumos-gate/+/3610
Thanks,
Andy
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 8:20 Review - 15665 svc:/network/loopback exits successfully even if it fails Andy Fiddaman
@ 2024-07-26 12:44 ` Peter Tribble
2024-07-26 13:41 ` Toomas Soome
2024-07-26 13:50 ` Andy Fiddaman
0 siblings, 2 replies; 13+ messages in thread
From: Peter Tribble @ 2024-07-26 12:44 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]
On Fri, Jul 26, 2024 at 9:21 AM Andy Fiddaman <illumos@fiddaman.net> wrote:
> Please can you review the following change?
>
> 15665 svc:/network/loopback exits successfully even if it fails
> https://www.illumos.org/issues/15665
> https://code.illumos.org/c/illumos-gate/+/3610
>
When this first came up I expressed my belief that making this change is
the wrong
thing to do, and I'll express it again.
If this service fails, I think the best thing to do is drive on so that the
system can come
up as far as possible to maximise the chance that the system comes up far
enough for
an administrator to be able to get in and fix it. Not putting the service
into maintenance
is a feature, not a bug.
(If it fails, then there's something deeper wrong with the system, and
those fundamental
causes should be dealt with.)
I think generally it would be wrong for a single voice to veto any change,
which means I
would generally be uncomfortable sticking a -1 on it, but if this does get
into the gate
it will be reverted in Tribblix.
Regards,
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
[-- Attachment #2: Type: text/html, Size: 2168 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 12:44 ` [developer] " Peter Tribble
@ 2024-07-26 13:41 ` Toomas Soome
2024-07-26 14:05 ` Peter Tribble
2024-07-26 13:50 ` Andy Fiddaman
1 sibling, 1 reply; 13+ messages in thread
From: Toomas Soome @ 2024-07-26 13:41 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]
> On 26. Jul 2024, at 15:44, Peter Tribble <peter.tribble@gmail.com> wrote:
>
> On Fri, Jul 26, 2024 at 9:21 AM Andy Fiddaman <illumos@fiddaman.net <mailto:illumos@fiddaman.net>> wrote:
>> Please can you review the following change?
>>
>> 15665 svc:/network/loopback exits successfully even if it fails
>> https://www.illumos.org/issues/15665
>> https://code.illumos.org/c/illumos-gate/+/3610
>
> When this first came up I expressed my belief that making this change is the wrong
> thing to do, and I'll express it again.
>
> If this service fails, I think the best thing to do is drive on so that the system can come
> up as far as possible to maximise the chance that the system comes up far enough for
> an administrator to be able to get in and fix it. Not putting the service into maintenance
> is a feature, not a bug.
>
> (If it fails, then there's something deeper wrong with the system, and those fundamental
> causes should be dealt with.)
>
> I think generally it would be wrong for a single voice to veto any change, which means I
> would generally be uncomfortable sticking a -1 on it, but if this does get into the gate
> it will be reverted in Tribblix.
>
hm, ok, you mean that service startup error in case of network/loopback will block too many other services? but then again, if there is an error, those depending services are also not functional, aren't they? Otherwise those depending services should not depend on network/loopback ;)
rgds,
toomas
[-- Attachment #2: Type: text/html, Size: 2492 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 12:44 ` [developer] " Peter Tribble
2024-07-26 13:41 ` Toomas Soome
@ 2024-07-26 13:50 ` Andy Fiddaman
2024-07-26 15:14 ` Peter Tribble
1 sibling, 1 reply; 13+ messages in thread
From: Andy Fiddaman @ 2024-07-26 13:50 UTC (permalink / raw)
To: illumos-developer
On Fri, 26 Jul 2024, Peter Tribble wrote:
> On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net> wrote:
>
> > Please can you review the following change?
> >
> > 15665 svc:/network/loopback exits successfully even if it fails
> > https://www.illumos.org/issues/15665
> > https://code.illumos.org/c/illumos-gate/+/3610
> >
>
> When this first came up I expressed my belief that making this change is
> the wrong
> thing to do, and I'll express it again.
Apologies Peter. I had recalled that your objection to the original change
was mostly around the addition of the extra dependency to the service, which
I've removed in this new patch set (that is
https://www.illumos.org/issues/15664 which remains open).
> If this service fails, I think the best thing to do is drive on so that the
> system can come up as far as possible to maximise the chance that the system
> comes up far enough for an administrator to be able to get in and fix it. Not
> putting the service into maintenance is a feature, not a bug.
The impetus for this change is that over the past couple of years we've had
a number of occasions where we've had to debug networking problems that
have had their root in the fact that the loopback interfaces were not created
for one reason or another. It happened again yesterday in a non-global zone. In
all of these, it would have been really useful and expedited diagnosis if the
service had gone into maintenance. I understand the perspective of allowing the
system to come up as far as possible - to the point of remote access even - but
it still seems wrong for a service to report success where it has not actually
achieved its goal. Is there some middle ground here.
> I think generally it would be wrong for a single voice to veto any change,
> which means I would generally be uncomfortable sticking a -1 on it, but if
> this does get into the gate it will be reverted in Tribblix.
Understood. This definitely warrants further discussion.
Andy
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 13:41 ` Toomas Soome
@ 2024-07-26 14:05 ` Peter Tribble
0 siblings, 0 replies; 13+ messages in thread
From: Peter Tribble @ 2024-07-26 14:05 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/plain, Size: 2799 bytes --]
On Fri, Jul 26, 2024 at 2:41 PM Toomas Soome via illumos-developer <
developer@lists.illumos.org> wrote:
>
>
> On 26. Jul 2024, at 15:44, Peter Tribble <peter.tribble@gmail.com> wrote:
>
> On Fri, Jul 26, 2024 at 9:21 AM Andy Fiddaman <illumos@fiddaman.net>
> wrote:
>
>> Please can you review the following change?
>>
>> 15665 svc:/network/loopback exits successfully even if it fails
>> https://www.illumos.org/issues/15665
>> https://code.illumos.org/c/illumos-gate/+/3610
>>
>
> When this first came up I expressed my belief that making this change is
> the wrong
> thing to do, and I'll express it again.
>
> If this service fails, I think the best thing to do is drive on so that
> the system can come
> up as far as possible to maximise the chance that the system comes up far
> enough for
> an administrator to be able to get in and fix it. Not putting the service
> into maintenance
> is a feature, not a bug.
>
> (If it fails, then there's something deeper wrong with the system, and
> those fundamental
> causes should be dealt with.)
>
> I think generally it would be wrong for a single voice to veto any change,
> which means I
> would generally be uncomfortable sticking a -1 on it, but if this does get
> into the gate
> it will be reverted in Tribblix.
>
>
> hm, ok, you mean that service startup error in case of network/loopback
> will block too many other services? but then again, if there is an error,
> those depending services are also not functional, aren't they? Otherwise
> those depending services should not depend on network/loopback ;)
>
Well yes, the whole thing gets completely blocked. If it's a cloud or
remote server,
it won't boot, and if you don't have console access (which isn't universal)
then the
system is a total loss.
One of the issues is that SMF doesn't have a very rich vocabulary for
service states
or dependencies - it's very much black and white. And this is a case where
you would
normally want loopback to be up (I think many users and applications make
unstated
assumptions here, although actually quite a lot is completely unaffected if
the loopback
isn't up), and you definitely want it up *before* the other applications -
and generally for
loopback it's the ordering that's important. So when everything is working
the dependency
is correct.
When something goes wrong, though, in many cases what you want is to carry
on, so
that as much as possible works and you get a chance to fix it. So what I
see missing
here is some way of exposing that the service has an error, without bluntly
dropping it into
maintenance and killing all the dependent services with it.
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
[-- Attachment #2: Type: text/html, Size: 4301 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 13:50 ` Andy Fiddaman
@ 2024-07-26 15:14 ` Peter Tribble
2024-07-26 15:55 ` Jorge Schrauwen
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Peter Tribble @ 2024-07-26 15:14 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/plain, Size: 3489 bytes --]
On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote:
>
> On Fri, 26 Jul 2024, Peter Tribble wrote:
>
> > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net>
> wrote:
> >
> > > Please can you review the following change?
> > >
> > > 15665 svc:/network/loopback exits successfully even if it fails
> > > https://www.illumos.org/issues/15665
> > > https://code.illumos.org/c/illumos-gate/+/3610
> > >
> >
> > When this first came up I expressed my belief that making this change is
> > the wrong
> > thing to do, and I'll express it again.
>
> Apologies Peter. I had recalled that your objection to the original change
> was mostly around the addition of the extra dependency to the service,
> which
> I've removed in this new patch set (that is
> https://www.illumos.org/issues/15664 which remains open).
>
> > If this service fails, I think the best thing to do is drive on so that
> the
> > system can come up as far as possible to maximise the chance that the
> system
> > comes up far enough for an administrator to be able to get in and fix
> it. Not
> > putting the service into maintenance is a feature, not a bug.
>
> The impetus for this change is that over the past couple of years we've had
> a number of occasions where we've had to debug networking problems that
> have had their root in the fact that the loopback interfaces were not
> created
> for one reason or another. It happened again yesterday in a non-global
> zone. In
> all of these, it would have been really useful and expedited diagnosis if
> the
> service had gone into maintenance. I understand the perspective of
> allowing the
> system to come up as far as possible - to the point of remote access even
> - but
> it still seems wrong for a service to report success where it has not
> actually
> achieved its goal. Is there some middle ground here.
>
> > I think generally it would be wrong for a single voice to veto any
> change,
> > which means I would generally be uncomfortable sticking a -1 on it, but
> if
> > this does get into the gate it will be reverted in Tribblix.
>
> Understood. This definitely warrants further discussion.
>
As I mentioned in my other reply, it seems that what we're after is some
way to mark
a service as having generated an error without bringing the system down by
going
into maintenance. Some sort of degraded mode.
We have a couple of SMF exit codes that look interesting -
SMF_EXIT_MON_DEGRADE
and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's
even an issue in this area - https://www.illumos.org/issues/7711 (which
refers back to 8891
which is another case of something dropping into maintenance breaking the
entire system).
Interestingly, looking at the ssh method script for S11
https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132
you see the following:
# Put the service into degraded mode in case some of previous
# configuration tasks failed.
# We do not let the service enter maintenance mode, since
# we want to keep the system as much operating as feasible.
#
if [ $ret1 -ne 0 ]; then
smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \
"Failed to generate missing host keys."
fi
So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for?
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
[-- Attachment #2: Type: text/html, Size: 4836 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 15:14 ` Peter Tribble
@ 2024-07-26 15:55 ` Jorge Schrauwen
2024-07-30 22:46 ` Gordon Ross
2024-07-26 18:08 ` Alan Coopersmith
2024-08-07 20:04 ` Andy Fiddaman
2 siblings, 1 reply; 13+ messages in thread
From: Jorge Schrauwen @ 2024-07-26 15:55 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/html, Size: 6264 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 15:14 ` Peter Tribble
2024-07-26 15:55 ` Jorge Schrauwen
@ 2024-07-26 18:08 ` Alan Coopersmith
2024-08-07 20:04 ` Andy Fiddaman
2 siblings, 0 replies; 13+ messages in thread
From: Alan Coopersmith @ 2024-07-26 18:08 UTC (permalink / raw)
To: developer
On 7/26/24 08:14, Peter Tribble wrote:
> As I mentioned in my other reply, it seems that what we're after is some way to mark
> a service as having generated an error without bringing the system down by going
> into maintenance. Some sort of degraded mode.
>
> So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for?
SMF_EXIT_DEGRADED was added in Solaris 11.4, but the degraded mode existed long
before that - see for instance smf_degrade_instance() on
https://illumos.org/man/3SCF/smf_degrade_instance and the states section of
https://illumos.org/man/7/smf .
The addition of SMF_EXIT_DEGRADED allows start, stop, and refresh methods to
put the service into degraded mode, since otherwise they'd have to return an
exit code that put the service into one of the other modes (online, offline,
maintenance), and then have something else wait for the method to finish and
then call smf_degrade_instance(), which could lead to race conditions or
incorrect actions being taken during the brief moment the service is in the
wrong state before going to degraded mode.
-alan-
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 15:55 ` Jorge Schrauwen
@ 2024-07-30 22:46 ` Gordon Ross
2024-07-31 9:44 ` Peter Tribble
0 siblings, 1 reply; 13+ messages in thread
From: Gordon Ross @ 2024-07-30 22:46 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/plain, Size: 4634 bytes --]
Optional dependency does that in SMF, right?
On Tue, Jul 30, 2024 at 12:56 PM Jorge Schrauwen via illumos-developer <
developer@lists.illumos.org> wrote:
> This last reply from Peter made me think of the difference between
> requires vs after in systemd speak.
>
> Although that is probably a lot of work as one would need those feature
> and somehow fix all manifests that express a dependancy on loopback.
>
> Admittedly I sometimes miss a more soft dependancy in smf in general.
>
> ~ sjorge
>
> On 26 Jul 2024, at 17:16, Peter Tribble <peter.tribble@gmail.com> wrote:
>
>
>
>
>
> On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote:
>
>>
>> On Fri, 26 Jul 2024, Peter Tribble wrote:
>>
>> > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net>
>> wrote:
>> >
>> > > Please can you review the following change?
>> > >
>> > > 15665 svc:/network/loopback exits successfully even if it fails
>> > > https://www.illumos.org/issues/15665
>> > > https://code.illumos.org/c/illumos-gate/+/3610
>> > >
>> >
>> > When this first came up I expressed my belief that making this change is
>> > the wrong
>> > thing to do, and I'll express it again.
>>
>> Apologies Peter. I had recalled that your objection to the original change
>> was mostly around the addition of the extra dependency to the service,
>> which
>> I've removed in this new patch set (that is
>> https://www.illumos.org/issues/15664 which remains open).
>>
>> > If this service fails, I think the best thing to do is drive on so that
>> the
>> > system can come up as far as possible to maximise the chance that the
>> system
>> > comes up far enough for an administrator to be able to get in and fix
>> it. Not
>> > putting the service into maintenance is a feature, not a bug.
>>
>> The impetus for this change is that over the past couple of years we've
>> had
>> a number of occasions where we've had to debug networking problems that
>> have had their root in the fact that the loopback interfaces were not
>> created
>> for one reason or another. It happened again yesterday in a non-global
>> zone. In
>> all of these, it would have been really useful and expedited diagnosis if
>> the
>> service had gone into maintenance. I understand the perspective of
>> allowing the
>> system to come up as far as possible - to the point of remote access even
>> - but
>> it still seems wrong for a service to report success where it has not
>> actually
>> achieved its goal. Is there some middle ground here.
>>
>> > I think generally it would be wrong for a single voice to veto any
>> change,
>> > which means I would generally be uncomfortable sticking a -1 on it, but
>> if
>> > this does get into the gate it will be reverted in Tribblix.
>>
>> Understood. This definitely warrants further discussion.
>>
>
> As I mentioned in my other reply, it seems that what we're after is some
> way to mark
> a service as having generated an error without bringing the system down by
> going
> into maintenance. Some sort of degraded mode.
>
> We have a couple of SMF exit codes that look interesting -
> SMF_EXIT_MON_DEGRADE
> and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's
> even an issue in this area - https://www.illumos.org/issues/7711 (which
> refers back to 8891
> which is another case of something dropping into maintenance breaking the
> entire system).
>
> Interestingly, looking at the ssh method script for S11
>
> https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132
> you see the following:
>
> # Put the service into degraded mode in case some of previous
> # configuration tasks failed.
> # We do not let the service enter maintenance mode, since
> # we want to keep the system as much operating as feasible.
> #
> if [ $ret1 -ne 0 ]; then
> smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \
> "Failed to generate missing host keys."
> fi
>
> So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for?
>
> --
> -Peter Tribble
> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
>
> *illumos <https://illumos.topicbox.com/latest>* / illumos-developer / see
> discussions <https://illumos.topicbox.com/groups/developer> + participants
> <https://illumos.topicbox.com/groups/developer/members> + delivery options
> <https://illumos.topicbox.com/groups/developer/subscription> Permalink
> <https://illumos.topicbox.com/groups/developer/Tb6183512dad6d1f9-M134224d6134770305d607b6e>
>
[-- Attachment #2: Type: text/html, Size: 6918 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-30 22:46 ` Gordon Ross
@ 2024-07-31 9:44 ` Peter Tribble
2024-08-01 8:48 ` Joshua M. Clulow
0 siblings, 1 reply; 13+ messages in thread
From: Peter Tribble @ 2024-07-31 9:44 UTC (permalink / raw)
To: illumos-developer
[-- Attachment #1: Type: text/plain, Size: 5218 bytes --]
On Tue, Jul 30, 2024 at 11:46 PM Gordon Ross <gordon.w.ross@gmail.com>
wrote:
> Optional dependency does that in SMF, right?
>
Well no, that's a rather different case. That is the "I don't care if it's
enabled or not,
but if it is I'll have a hard dependency on it".
What we're after here is the "I do care that it's enabled, and must run
after it, but I'm
prepared to live with errors".
> On Tue, Jul 30, 2024 at 12:56 PM Jorge Schrauwen via illumos-developer <
> developer@lists.illumos.org> wrote:
>
>> This last reply from Peter made me think of the difference between
>> requires vs after in systemd speak.
>>
>> Although that is probably a lot of work as one would need those feature
>> and somehow fix all manifests that express a dependancy on loopback.
>>
>> Admittedly I sometimes miss a more soft dependancy in smf in general.
>>
>> ~ sjorge
>>
>> On 26 Jul 2024, at 17:16, Peter Tribble <peter.tribble@gmail.com> wrote:
>>
>>
>>
>>
>>
>> On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote:
>>
>>>
>>> On Fri, 26 Jul 2024, Peter Tribble wrote:
>>>
>>> > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net>
>>> wrote:
>>> >
>>> > > Please can you review the following change?
>>> > >
>>> > > 15665 svc:/network/loopback exits successfully even if it fails
>>> > > https://www.illumos.org/issues/15665
>>> > > https://code.illumos.org/c/illumos-gate/+/3610
>>> > >
>>> >
>>> > When this first came up I expressed my belief that making this change
>>> is
>>> > the wrong
>>> > thing to do, and I'll express it again.
>>>
>>> Apologies Peter. I had recalled that your objection to the original
>>> change
>>> was mostly around the addition of the extra dependency to the service,
>>> which
>>> I've removed in this new patch set (that is
>>> https://www.illumos.org/issues/15664 which remains open).
>>>
>>> > If this service fails, I think the best thing to do is drive on so
>>> that the
>>> > system can come up as far as possible to maximise the chance that the
>>> system
>>> > comes up far enough for an administrator to be able to get in and fix
>>> it. Not
>>> > putting the service into maintenance is a feature, not a bug.
>>>
>>> The impetus for this change is that over the past couple of years we've
>>> had
>>> a number of occasions where we've had to debug networking problems that
>>> have had their root in the fact that the loopback interfaces were not
>>> created
>>> for one reason or another. It happened again yesterday in a non-global
>>> zone. In
>>> all of these, it would have been really useful and expedited diagnosis
>>> if the
>>> service had gone into maintenance. I understand the perspective of
>>> allowing the
>>> system to come up as far as possible - to the point of remote access
>>> even - but
>>> it still seems wrong for a service to report success where it has not
>>> actually
>>> achieved its goal. Is there some middle ground here.
>>>
>>> > I think generally it would be wrong for a single voice to veto any
>>> change,
>>> > which means I would generally be uncomfortable sticking a -1 on it,
>>> but if
>>> > this does get into the gate it will be reverted in Tribblix.
>>>
>>> Understood. This definitely warrants further discussion.
>>>
>>
>> As I mentioned in my other reply, it seems that what we're after is some
>> way to mark
>> a service as having generated an error without bringing the system down
>> by going
>> into maintenance. Some sort of degraded mode.
>>
>> We have a couple of SMF exit codes that look interesting -
>> SMF_EXIT_MON_DEGRADE
>> and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented.
>> There's
>> even an issue in this area - https://www.illumos.org/issues/7711 (which
>> refers back to 8891
>> which is another case of something dropping into maintenance breaking the
>> entire system).
>>
>> Interestingly, looking at the ssh method script for S11
>>
>> https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132
>> you see the following:
>>
>> # Put the service into degraded mode in case some of previous
>> # configuration tasks failed.
>> # We do not let the service enter maintenance mode, since
>> # we want to keep the system as much operating as feasible.
>> #
>> if [ $ret1 -ne 0 ]; then
>> smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \
>> "Failed to generate missing host keys."
>> fi
>>
>> So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for?
>>
>> --
>> -Peter Tribble
>> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
>>
>> *illumos <https://illumos.topicbox.com/latest>* / illumos-developer /
> see discussions <https://illumos.topicbox.com/groups/developer> +
> participants <https://illumos.topicbox.com/groups/developer/members> +
> delivery options
> <https://illumos.topicbox.com/groups/developer/subscription> Permalink
> <https://illumos.topicbox.com/groups/developer/Tb6183512dad6d1f9-Me937f6d14ff9b2b0d0229d8b>
>
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
[-- Attachment #2: Type: text/html, Size: 8158 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-31 9:44 ` Peter Tribble
@ 2024-08-01 8:48 ` Joshua M. Clulow
0 siblings, 0 replies; 13+ messages in thread
From: Joshua M. Clulow @ 2024-08-01 8:48 UTC (permalink / raw)
To: illumos-developer
On Wed, 31 Jul 2024 at 02:45, Peter Tribble <peter.tribble@gmail.com> wrote:
> On Tue, Jul 30, 2024 at 11:46 PM Gordon Ross <gordon.w.ross@gmail.com> wrote:
>> Optional dependency does that in SMF, right?
> Well no, that's a rather different case. That is the "I don't care if it's enabled or not,
> but if it is I'll have a hard dependency on it".
That's not what "optional_all" does. As per smf(7):
optional_all
Satisfied if the cited services are running (online or
degraded) or do not run without administrative action
(disabled, maintenance, not present, or offline
waiting for dependencies which do not start without
administrative action).
Note, most critically, if a service is in the maintenance state, it
will not run without administrative action. This allows a dependency
with the "optional_all" group to proceed with starting up.
> What we're after here is the "I do care that it's enabled, and must run after it, but I'm
> prepared to live with errors".
That's what "optional_all" does. It's effectively just for startup
sequencing; i.e., "as long as you believe you're going to start the
service, I'll wait, but if that changes just start me straight away".
I think the important thing to recognise is that dependencies aren't
considering what's marked enabled or disabled, they're considering the
_state_ of services and potentially the _next state_ for services in
transition (e.g., being started or stopped). It's a subtle but
important distinction because there _is_ a state called "disabled" as
well, but that's separate from whether the service is marked as being
enabled or disabled by the administrator. After you mark a service as
disabled, it might not hit the "disabled" state for quite some time,
depending on what state it was in to begin with.
The keen reader will note that "not present" is also not a state, but
it's treated like one in the mental model: it's the state implicitly
assumed for services we don't appear to have on the system. Obviously
services that do not exist will not be able to run without some
administrative action either.
I agree that the degraded state is interesting in the limit, but I do
not believe it is necessary to implement more of it merely to get the
behaviour you're asking for here.
Cheers.
--
Joshua M. Clulow
http://blog.sysmgr.org
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-07-26 15:14 ` Peter Tribble
2024-07-26 15:55 ` Jorge Schrauwen
2024-07-26 18:08 ` Alan Coopersmith
@ 2024-08-07 20:04 ` Andy Fiddaman
2024-09-09 17:36 ` Andy Fiddaman
2 siblings, 1 reply; 13+ messages in thread
From: Andy Fiddaman @ 2024-08-07 20:04 UTC (permalink / raw)
To: illumos-developer
On Fri, 26 Jul 2024, Peter Tribble wrote:
> As I mentioned in my other reply, it seems that what we're after is some
> way to mark
> a service as having generated an error without bringing the system down by
> going
> into maintenance. Some sort of degraded mode.
>
> We have a couple of SMF exit codes that look interesting -
> SMF_EXIT_MON_DEGRADE
> and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's
> even an issue in this area - https://www.illumos.org/issues/7711 (which
> refers back to 8891
> which is another case of something dropping into maintenance breaking the
> entire system).
I have put up https://code.illumos.org/c/illumos-gate/+/3621 which is
resurrecting work by Andy Stormont to add the missing pieces to support
degraded services, and then re-worked
https://code.illumos.org/c/illumos-gate/+/3610/
on top of that.
This does seem like the best solution - the service does what it can and
doesn't block others from coming up, yet it shows up as degraded in
`svcs -x` and co.
My testing's going well so I'll get that sent out for review in the next
couple of days.
Thanks for all the discussion and feedback on this change!
Andy
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails
2024-08-07 20:04 ` Andy Fiddaman
@ 2024-09-09 17:36 ` Andy Fiddaman
0 siblings, 0 replies; 13+ messages in thread
From: Andy Fiddaman @ 2024-09-09 17:36 UTC (permalink / raw)
To: illumos-developer
On Wed, 7 Aug 2024, Andy Fiddaman wrote:
>
> On Fri, 26 Jul 2024, Peter Tribble wrote:
>
> > As I mentioned in my other reply, it seems that what we're after is some
> > way to mark
> > a service as having generated an error without bringing the system down by
> > going
> > into maintenance. Some sort of degraded mode.
> >
> > We have a couple of SMF exit codes that look interesting -
> > SMF_EXIT_MON_DEGRADE
> > and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's
> > even an issue in this area - https://www.illumos.org/issues/7711 (which
> > refers back to 8891
> > which is another case of something dropping into maintenance breaking the
> > entire system).
>
> I have put up https://code.illumos.org/c/illumos-gate/+/3621 which is
> resurrecting work by Andy Stormont to add the missing pieces to support
> degraded services, and then re-worked
> https://code.illumos.org/c/illumos-gate/+/3610/
> on top of that.
>
> This does seem like the best solution - the service does what it can and
> doesn't block others from coming up, yet it shows up as degraded in
> `svcs -x` and co.
>
> My testing's going well so I'll get that sent out for review in the next
> couple of days.
>
> Thanks for all the discussion and feedback on this change!
Now that https://www.illumos.org/issues/7711 has integrated, I am planning
to send the latest version of this for RTI:
15665 svc:/network/loopback exits successfully even if it fails
https://www.illumos.org/issues/15665
https://code.illumos.org/c/illumos-gate/+/3610
This now uses the SMF degraded state rather than maintenance as the original
change did. I think this should address the concerns that were raised and,
again, thanks for the discussions around this.
Andy
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-09-09 17:36 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-26 8:20 Review - 15665 svc:/network/loopback exits successfully even if it fails Andy Fiddaman
2024-07-26 12:44 ` [developer] " Peter Tribble
2024-07-26 13:41 ` Toomas Soome
2024-07-26 14:05 ` Peter Tribble
2024-07-26 13:50 ` Andy Fiddaman
2024-07-26 15:14 ` Peter Tribble
2024-07-26 15:55 ` Jorge Schrauwen
2024-07-30 22:46 ` Gordon Ross
2024-07-31 9:44 ` Peter Tribble
2024-08-01 8:48 ` Joshua M. Clulow
2024-07-26 18:08 ` Alan Coopersmith
2024-08-07 20:04 ` Andy Fiddaman
2024-09-09 17:36 ` Andy Fiddaman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).