* Review - 15665 svc:/network/loopback exits successfully even if it fails @ 2024-07-26 8:20 Andy Fiddaman 2024-07-26 12:44 ` [developer] " Peter Tribble 0 siblings, 1 reply; 13+ messages in thread From: Andy Fiddaman @ 2024-07-26 8:20 UTC (permalink / raw) To: developer Please can you review the following change? 15665 svc:/network/loopback exits successfully even if it fails https://www.illumos.org/issues/15665 https://code.illumos.org/c/illumos-gate/+/3610 Thanks, Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 8:20 Review - 15665 svc:/network/loopback exits successfully even if it fails Andy Fiddaman @ 2024-07-26 12:44 ` Peter Tribble 2024-07-26 13:41 ` Toomas Soome 2024-07-26 13:50 ` Andy Fiddaman 0 siblings, 2 replies; 13+ messages in thread From: Peter Tribble @ 2024-07-26 12:44 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/plain, Size: 1174 bytes --] On Fri, Jul 26, 2024 at 9:21 AM Andy Fiddaman <illumos@fiddaman.net> wrote: > Please can you review the following change? > > 15665 svc:/network/loopback exits successfully even if it fails > https://www.illumos.org/issues/15665 > https://code.illumos.org/c/illumos-gate/+/3610 > When this first came up I expressed my belief that making this change is the wrong thing to do, and I'll express it again. If this service fails, I think the best thing to do is drive on so that the system can come up as far as possible to maximise the chance that the system comes up far enough for an administrator to be able to get in and fix it. Not putting the service into maintenance is a feature, not a bug. (If it fails, then there's something deeper wrong with the system, and those fundamental causes should be dealt with.) I think generally it would be wrong for a single voice to veto any change, which means I would generally be uncomfortable sticking a -1 on it, but if this does get into the gate it will be reverted in Tribblix. Regards, -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ [-- Attachment #2: Type: text/html, Size: 2168 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 12:44 ` [developer] " Peter Tribble @ 2024-07-26 13:41 ` Toomas Soome 2024-07-26 14:05 ` Peter Tribble 2024-07-26 13:50 ` Andy Fiddaman 1 sibling, 1 reply; 13+ messages in thread From: Toomas Soome @ 2024-07-26 13:41 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/plain, Size: 1527 bytes --] > On 26. Jul 2024, at 15:44, Peter Tribble <peter.tribble@gmail.com> wrote: > > On Fri, Jul 26, 2024 at 9:21 AM Andy Fiddaman <illumos@fiddaman.net <mailto:illumos@fiddaman.net>> wrote: >> Please can you review the following change? >> >> 15665 svc:/network/loopback exits successfully even if it fails >> https://www.illumos.org/issues/15665 >> https://code.illumos.org/c/illumos-gate/+/3610 > > When this first came up I expressed my belief that making this change is the wrong > thing to do, and I'll express it again. > > If this service fails, I think the best thing to do is drive on so that the system can come > up as far as possible to maximise the chance that the system comes up far enough for > an administrator to be able to get in and fix it. Not putting the service into maintenance > is a feature, not a bug. > > (If it fails, then there's something deeper wrong with the system, and those fundamental > causes should be dealt with.) > > I think generally it would be wrong for a single voice to veto any change, which means I > would generally be uncomfortable sticking a -1 on it, but if this does get into the gate > it will be reverted in Tribblix. > hm, ok, you mean that service startup error in case of network/loopback will block too many other services? but then again, if there is an error, those depending services are also not functional, aren't they? Otherwise those depending services should not depend on network/loopback ;) rgds, toomas [-- Attachment #2: Type: text/html, Size: 2492 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 13:41 ` Toomas Soome @ 2024-07-26 14:05 ` Peter Tribble 0 siblings, 0 replies; 13+ messages in thread From: Peter Tribble @ 2024-07-26 14:05 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/plain, Size: 2799 bytes --] On Fri, Jul 26, 2024 at 2:41 PM Toomas Soome via illumos-developer < developer@lists.illumos.org> wrote: > > > On 26. Jul 2024, at 15:44, Peter Tribble <peter.tribble@gmail.com> wrote: > > On Fri, Jul 26, 2024 at 9:21 AM Andy Fiddaman <illumos@fiddaman.net> > wrote: > >> Please can you review the following change? >> >> 15665 svc:/network/loopback exits successfully even if it fails >> https://www.illumos.org/issues/15665 >> https://code.illumos.org/c/illumos-gate/+/3610 >> > > When this first came up I expressed my belief that making this change is > the wrong > thing to do, and I'll express it again. > > If this service fails, I think the best thing to do is drive on so that > the system can come > up as far as possible to maximise the chance that the system comes up far > enough for > an administrator to be able to get in and fix it. Not putting the service > into maintenance > is a feature, not a bug. > > (If it fails, then there's something deeper wrong with the system, and > those fundamental > causes should be dealt with.) > > I think generally it would be wrong for a single voice to veto any change, > which means I > would generally be uncomfortable sticking a -1 on it, but if this does get > into the gate > it will be reverted in Tribblix. > > > hm, ok, you mean that service startup error in case of network/loopback > will block too many other services? but then again, if there is an error, > those depending services are also not functional, aren't they? Otherwise > those depending services should not depend on network/loopback ;) > Well yes, the whole thing gets completely blocked. If it's a cloud or remote server, it won't boot, and if you don't have console access (which isn't universal) then the system is a total loss. One of the issues is that SMF doesn't have a very rich vocabulary for service states or dependencies - it's very much black and white. And this is a case where you would normally want loopback to be up (I think many users and applications make unstated assumptions here, although actually quite a lot is completely unaffected if the loopback isn't up), and you definitely want it up *before* the other applications - and generally for loopback it's the ordering that's important. So when everything is working the dependency is correct. When something goes wrong, though, in many cases what you want is to carry on, so that as much as possible works and you get a chance to fix it. So what I see missing here is some way of exposing that the service has an error, without bluntly dropping it into maintenance and killing all the dependent services with it. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ [-- Attachment #2: Type: text/html, Size: 4301 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 12:44 ` [developer] " Peter Tribble 2024-07-26 13:41 ` Toomas Soome @ 2024-07-26 13:50 ` Andy Fiddaman 2024-07-26 15:14 ` Peter Tribble 1 sibling, 1 reply; 13+ messages in thread From: Andy Fiddaman @ 2024-07-26 13:50 UTC (permalink / raw) To: illumos-developer On Fri, 26 Jul 2024, Peter Tribble wrote: > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net> wrote: > > > Please can you review the following change? > > > > 15665 svc:/network/loopback exits successfully even if it fails > > https://www.illumos.org/issues/15665 > > https://code.illumos.org/c/illumos-gate/+/3610 > > > > When this first came up I expressed my belief that making this change is > the wrong > thing to do, and I'll express it again. Apologies Peter. I had recalled that your objection to the original change was mostly around the addition of the extra dependency to the service, which I've removed in this new patch set (that is https://www.illumos.org/issues/15664 which remains open). > If this service fails, I think the best thing to do is drive on so that the > system can come up as far as possible to maximise the chance that the system > comes up far enough for an administrator to be able to get in and fix it. Not > putting the service into maintenance is a feature, not a bug. The impetus for this change is that over the past couple of years we've had a number of occasions where we've had to debug networking problems that have had their root in the fact that the loopback interfaces were not created for one reason or another. It happened again yesterday in a non-global zone. In all of these, it would have been really useful and expedited diagnosis if the service had gone into maintenance. I understand the perspective of allowing the system to come up as far as possible - to the point of remote access even - but it still seems wrong for a service to report success where it has not actually achieved its goal. Is there some middle ground here. > I think generally it would be wrong for a single voice to veto any change, > which means I would generally be uncomfortable sticking a -1 on it, but if > this does get into the gate it will be reverted in Tribblix. Understood. This definitely warrants further discussion. Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 13:50 ` Andy Fiddaman @ 2024-07-26 15:14 ` Peter Tribble 2024-07-26 15:55 ` Jorge Schrauwen ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Peter Tribble @ 2024-07-26 15:14 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/plain, Size: 3489 bytes --] On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote: > > On Fri, 26 Jul 2024, Peter Tribble wrote: > > > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net> > wrote: > > > > > Please can you review the following change? > > > > > > 15665 svc:/network/loopback exits successfully even if it fails > > > https://www.illumos.org/issues/15665 > > > https://code.illumos.org/c/illumos-gate/+/3610 > > > > > > > When this first came up I expressed my belief that making this change is > > the wrong > > thing to do, and I'll express it again. > > Apologies Peter. I had recalled that your objection to the original change > was mostly around the addition of the extra dependency to the service, > which > I've removed in this new patch set (that is > https://www.illumos.org/issues/15664 which remains open). > > > If this service fails, I think the best thing to do is drive on so that > the > > system can come up as far as possible to maximise the chance that the > system > > comes up far enough for an administrator to be able to get in and fix > it. Not > > putting the service into maintenance is a feature, not a bug. > > The impetus for this change is that over the past couple of years we've had > a number of occasions where we've had to debug networking problems that > have had their root in the fact that the loopback interfaces were not > created > for one reason or another. It happened again yesterday in a non-global > zone. In > all of these, it would have been really useful and expedited diagnosis if > the > service had gone into maintenance. I understand the perspective of > allowing the > system to come up as far as possible - to the point of remote access even > - but > it still seems wrong for a service to report success where it has not > actually > achieved its goal. Is there some middle ground here. > > > I think generally it would be wrong for a single voice to veto any > change, > > which means I would generally be uncomfortable sticking a -1 on it, but > if > > this does get into the gate it will be reverted in Tribblix. > > Understood. This definitely warrants further discussion. > As I mentioned in my other reply, it seems that what we're after is some way to mark a service as having generated an error without bringing the system down by going into maintenance. Some sort of degraded mode. We have a couple of SMF exit codes that look interesting - SMF_EXIT_MON_DEGRADE and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's even an issue in this area - https://www.illumos.org/issues/7711 (which refers back to 8891 which is another case of something dropping into maintenance breaking the entire system). Interestingly, looking at the ssh method script for S11 https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132 you see the following: # Put the service into degraded mode in case some of previous # configuration tasks failed. # We do not let the service enter maintenance mode, since # we want to keep the system as much operating as feasible. # if [ $ret1 -ne 0 ]; then smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \ "Failed to generate missing host keys." fi So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for? -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ [-- Attachment #2: Type: text/html, Size: 4836 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 15:14 ` Peter Tribble @ 2024-07-26 15:55 ` Jorge Schrauwen 2024-07-30 22:46 ` Gordon Ross 2024-07-26 18:08 ` Alan Coopersmith 2024-08-07 20:04 ` Andy Fiddaman 2 siblings, 1 reply; 13+ messages in thread From: Jorge Schrauwen @ 2024-07-26 15:55 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/html, Size: 6264 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 15:55 ` Jorge Schrauwen @ 2024-07-30 22:46 ` Gordon Ross 2024-07-31 9:44 ` Peter Tribble 0 siblings, 1 reply; 13+ messages in thread From: Gordon Ross @ 2024-07-30 22:46 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/plain, Size: 4634 bytes --] Optional dependency does that in SMF, right? On Tue, Jul 30, 2024 at 12:56 PM Jorge Schrauwen via illumos-developer < developer@lists.illumos.org> wrote: > This last reply from Peter made me think of the difference between > requires vs after in systemd speak. > > Although that is probably a lot of work as one would need those feature > and somehow fix all manifests that express a dependancy on loopback. > > Admittedly I sometimes miss a more soft dependancy in smf in general. > > ~ sjorge > > On 26 Jul 2024, at 17:16, Peter Tribble <peter.tribble@gmail.com> wrote: > > > > > > On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote: > >> >> On Fri, 26 Jul 2024, Peter Tribble wrote: >> >> > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net> >> wrote: >> > >> > > Please can you review the following change? >> > > >> > > 15665 svc:/network/loopback exits successfully even if it fails >> > > https://www.illumos.org/issues/15665 >> > > https://code.illumos.org/c/illumos-gate/+/3610 >> > > >> > >> > When this first came up I expressed my belief that making this change is >> > the wrong >> > thing to do, and I'll express it again. >> >> Apologies Peter. I had recalled that your objection to the original change >> was mostly around the addition of the extra dependency to the service, >> which >> I've removed in this new patch set (that is >> https://www.illumos.org/issues/15664 which remains open). >> >> > If this service fails, I think the best thing to do is drive on so that >> the >> > system can come up as far as possible to maximise the chance that the >> system >> > comes up far enough for an administrator to be able to get in and fix >> it. Not >> > putting the service into maintenance is a feature, not a bug. >> >> The impetus for this change is that over the past couple of years we've >> had >> a number of occasions where we've had to debug networking problems that >> have had their root in the fact that the loopback interfaces were not >> created >> for one reason or another. It happened again yesterday in a non-global >> zone. In >> all of these, it would have been really useful and expedited diagnosis if >> the >> service had gone into maintenance. I understand the perspective of >> allowing the >> system to come up as far as possible - to the point of remote access even >> - but >> it still seems wrong for a service to report success where it has not >> actually >> achieved its goal. Is there some middle ground here. >> >> > I think generally it would be wrong for a single voice to veto any >> change, >> > which means I would generally be uncomfortable sticking a -1 on it, but >> if >> > this does get into the gate it will be reverted in Tribblix. >> >> Understood. This definitely warrants further discussion. >> > > As I mentioned in my other reply, it seems that what we're after is some > way to mark > a service as having generated an error without bringing the system down by > going > into maintenance. Some sort of degraded mode. > > We have a couple of SMF exit codes that look interesting - > SMF_EXIT_MON_DEGRADE > and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's > even an issue in this area - https://www.illumos.org/issues/7711 (which > refers back to 8891 > which is another case of something dropping into maintenance breaking the > entire system). > > Interestingly, looking at the ssh method script for S11 > > https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132 > you see the following: > > # Put the service into degraded mode in case some of previous > # configuration tasks failed. > # We do not let the service enter maintenance mode, since > # we want to keep the system as much operating as feasible. > # > if [ $ret1 -ne 0 ]; then > smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \ > "Failed to generate missing host keys." > fi > > So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for? > > -- > -Peter Tribble > http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ > > *illumos <https://illumos.topicbox.com/latest>* / illumos-developer / see > discussions <https://illumos.topicbox.com/groups/developer> + participants > <https://illumos.topicbox.com/groups/developer/members> + delivery options > <https://illumos.topicbox.com/groups/developer/subscription> Permalink > <https://illumos.topicbox.com/groups/developer/Tb6183512dad6d1f9-M134224d6134770305d607b6e> > [-- Attachment #2: Type: text/html, Size: 6918 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-30 22:46 ` Gordon Ross @ 2024-07-31 9:44 ` Peter Tribble 2024-08-01 8:48 ` Joshua M. Clulow 0 siblings, 1 reply; 13+ messages in thread From: Peter Tribble @ 2024-07-31 9:44 UTC (permalink / raw) To: illumos-developer [-- Attachment #1: Type: text/plain, Size: 5218 bytes --] On Tue, Jul 30, 2024 at 11:46 PM Gordon Ross <gordon.w.ross@gmail.com> wrote: > Optional dependency does that in SMF, right? > Well no, that's a rather different case. That is the "I don't care if it's enabled or not, but if it is I'll have a hard dependency on it". What we're after here is the "I do care that it's enabled, and must run after it, but I'm prepared to live with errors". > On Tue, Jul 30, 2024 at 12:56 PM Jorge Schrauwen via illumos-developer < > developer@lists.illumos.org> wrote: > >> This last reply from Peter made me think of the difference between >> requires vs after in systemd speak. >> >> Although that is probably a lot of work as one would need those feature >> and somehow fix all manifests that express a dependancy on loopback. >> >> Admittedly I sometimes miss a more soft dependancy in smf in general. >> >> ~ sjorge >> >> On 26 Jul 2024, at 17:16, Peter Tribble <peter.tribble@gmail.com> wrote: >> >> >> >> >> >> On Fri, Jul 26, 2024 at 2:50 PM Andy Fiddaman <andy@omnios.org> wrote: >> >>> >>> On Fri, 26 Jul 2024, Peter Tribble wrote: >>> >>> > On Fri, Jul 26, 2024 at 9:21?AM Andy Fiddaman <illumos@fiddaman.net> >>> wrote: >>> > >>> > > Please can you review the following change? >>> > > >>> > > 15665 svc:/network/loopback exits successfully even if it fails >>> > > https://www.illumos.org/issues/15665 >>> > > https://code.illumos.org/c/illumos-gate/+/3610 >>> > > >>> > >>> > When this first came up I expressed my belief that making this change >>> is >>> > the wrong >>> > thing to do, and I'll express it again. >>> >>> Apologies Peter. I had recalled that your objection to the original >>> change >>> was mostly around the addition of the extra dependency to the service, >>> which >>> I've removed in this new patch set (that is >>> https://www.illumos.org/issues/15664 which remains open). >>> >>> > If this service fails, I think the best thing to do is drive on so >>> that the >>> > system can come up as far as possible to maximise the chance that the >>> system >>> > comes up far enough for an administrator to be able to get in and fix >>> it. Not >>> > putting the service into maintenance is a feature, not a bug. >>> >>> The impetus for this change is that over the past couple of years we've >>> had >>> a number of occasions where we've had to debug networking problems that >>> have had their root in the fact that the loopback interfaces were not >>> created >>> for one reason or another. It happened again yesterday in a non-global >>> zone. In >>> all of these, it would have been really useful and expedited diagnosis >>> if the >>> service had gone into maintenance. I understand the perspective of >>> allowing the >>> system to come up as far as possible - to the point of remote access >>> even - but >>> it still seems wrong for a service to report success where it has not >>> actually >>> achieved its goal. Is there some middle ground here. >>> >>> > I think generally it would be wrong for a single voice to veto any >>> change, >>> > which means I would generally be uncomfortable sticking a -1 on it, >>> but if >>> > this does get into the gate it will be reverted in Tribblix. >>> >>> Understood. This definitely warrants further discussion. >>> >> >> As I mentioned in my other reply, it seems that what we're after is some >> way to mark >> a service as having generated an error without bringing the system down >> by going >> into maintenance. Some sort of degraded mode. >> >> We have a couple of SMF exit codes that look interesting - >> SMF_EXIT_MON_DEGRADE >> and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. >> There's >> even an issue in this area - https://www.illumos.org/issues/7711 (which >> refers back to 8891 >> which is another case of something dropping into maintenance breaking the >> entire system). >> >> Interestingly, looking at the ssh method script for S11 >> >> https://github.com/oracle/solaris-userland/blob/master/components/openssh/sources/sshd.sh#L132 >> you see the following: >> >> # Put the service into degraded mode in case some of previous >> # configuration tasks failed. >> # We do not let the service enter maintenance mode, since >> # we want to keep the system as much operating as feasible. >> # >> if [ $ret1 -ne 0 ]; then >> smf_method_exit $SMF_EXIT_DEGRADED "hostkey_configuration" \ >> "Failed to generate missing host keys." >> fi >> >> So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for? >> >> -- >> -Peter Tribble >> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ >> >> *illumos <https://illumos.topicbox.com/latest>* / illumos-developer / > see discussions <https://illumos.topicbox.com/groups/developer> + > participants <https://illumos.topicbox.com/groups/developer/members> + > delivery options > <https://illumos.topicbox.com/groups/developer/subscription> Permalink > <https://illumos.topicbox.com/groups/developer/Tb6183512dad6d1f9-Me937f6d14ff9b2b0d0229d8b> > -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ [-- Attachment #2: Type: text/html, Size: 8158 bytes --] ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-31 9:44 ` Peter Tribble @ 2024-08-01 8:48 ` Joshua M. Clulow 0 siblings, 0 replies; 13+ messages in thread From: Joshua M. Clulow @ 2024-08-01 8:48 UTC (permalink / raw) To: illumos-developer On Wed, 31 Jul 2024 at 02:45, Peter Tribble <peter.tribble@gmail.com> wrote: > On Tue, Jul 30, 2024 at 11:46 PM Gordon Ross <gordon.w.ross@gmail.com> wrote: >> Optional dependency does that in SMF, right? > Well no, that's a rather different case. That is the "I don't care if it's enabled or not, > but if it is I'll have a hard dependency on it". That's not what "optional_all" does. As per smf(7): optional_all Satisfied if the cited services are running (online or degraded) or do not run without administrative action (disabled, maintenance, not present, or offline waiting for dependencies which do not start without administrative action). Note, most critically, if a service is in the maintenance state, it will not run without administrative action. This allows a dependency with the "optional_all" group to proceed with starting up. > What we're after here is the "I do care that it's enabled, and must run after it, but I'm > prepared to live with errors". That's what "optional_all" does. It's effectively just for startup sequencing; i.e., "as long as you believe you're going to start the service, I'll wait, but if that changes just start me straight away". I think the important thing to recognise is that dependencies aren't considering what's marked enabled or disabled, they're considering the _state_ of services and potentially the _next state_ for services in transition (e.g., being started or stopped). It's a subtle but important distinction because there _is_ a state called "disabled" as well, but that's separate from whether the service is marked as being enabled or disabled by the administrator. After you mark a service as disabled, it might not hit the "disabled" state for quite some time, depending on what state it was in to begin with. The keen reader will note that "not present" is also not a state, but it's treated like one in the mental model: it's the state implicitly assumed for services we don't appear to have on the system. Obviously services that do not exist will not be able to run without some administrative action either. I agree that the degraded state is interesting in the limit, but I do not believe it is necessary to implement more of it merely to get the behaviour you're asking for here. Cheers. -- Joshua M. Clulow http://blog.sysmgr.org ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 15:14 ` Peter Tribble 2024-07-26 15:55 ` Jorge Schrauwen @ 2024-07-26 18:08 ` Alan Coopersmith 2024-08-07 20:04 ` Andy Fiddaman 2 siblings, 0 replies; 13+ messages in thread From: Alan Coopersmith @ 2024-07-26 18:08 UTC (permalink / raw) To: developer On 7/26/24 08:14, Peter Tribble wrote: > As I mentioned in my other reply, it seems that what we're after is some way to mark > a service as having generated an error without bringing the system down by going > into maintenance. Some sort of degraded mode. > > So the equivalent of SMF_EXIT_DEGRADED might be what we're looking for? SMF_EXIT_DEGRADED was added in Solaris 11.4, but the degraded mode existed long before that - see for instance smf_degrade_instance() on https://illumos.org/man/3SCF/smf_degrade_instance and the states section of https://illumos.org/man/7/smf . The addition of SMF_EXIT_DEGRADED allows start, stop, and refresh methods to put the service into degraded mode, since otherwise they'd have to return an exit code that put the service into one of the other modes (online, offline, maintenance), and then have something else wait for the method to finish and then call smf_degrade_instance(), which could lead to race conditions or incorrect actions being taken during the brief moment the service is in the wrong state before going to degraded mode. -alan- ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-07-26 15:14 ` Peter Tribble 2024-07-26 15:55 ` Jorge Schrauwen 2024-07-26 18:08 ` Alan Coopersmith @ 2024-08-07 20:04 ` Andy Fiddaman 2024-09-09 17:36 ` Andy Fiddaman 2 siblings, 1 reply; 13+ messages in thread From: Andy Fiddaman @ 2024-08-07 20:04 UTC (permalink / raw) To: illumos-developer On Fri, 26 Jul 2024, Peter Tribble wrote: > As I mentioned in my other reply, it seems that what we're after is some > way to mark > a service as having generated an error without bringing the system down by > going > into maintenance. Some sort of degraded mode. > > We have a couple of SMF exit codes that look interesting - > SMF_EXIT_MON_DEGRADE > and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's > even an issue in this area - https://www.illumos.org/issues/7711 (which > refers back to 8891 > which is another case of something dropping into maintenance breaking the > entire system). I have put up https://code.illumos.org/c/illumos-gate/+/3621 which is resurrecting work by Andy Stormont to add the missing pieces to support degraded services, and then re-worked https://code.illumos.org/c/illumos-gate/+/3610/ on top of that. This does seem like the best solution - the service does what it can and doesn't block others from coming up, yet it shows up as degraded in `svcs -x` and co. My testing's going well so I'll get that sent out for review in the next couple of days. Thanks for all the discussion and feedback on this change! Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [developer] Review - 15665 svc:/network/loopback exits successfully even if it fails 2024-08-07 20:04 ` Andy Fiddaman @ 2024-09-09 17:36 ` Andy Fiddaman 0 siblings, 0 replies; 13+ messages in thread From: Andy Fiddaman @ 2024-09-09 17:36 UTC (permalink / raw) To: illumos-developer On Wed, 7 Aug 2024, Andy Fiddaman wrote: > > On Fri, 26 Jul 2024, Peter Tribble wrote: > > > As I mentioned in my other reply, it seems that what we're after is some > > way to mark > > a service as having generated an error without bringing the system down by > > going > > into maintenance. Some sort of degraded mode. > > > > We have a couple of SMF exit codes that look interesting - > > SMF_EXIT_MON_DEGRADE > > and SMF_EXIT_MON_OFFLINE, but I'm sure they were never implemented. There's > > even an issue in this area - https://www.illumos.org/issues/7711 (which > > refers back to 8891 > > which is another case of something dropping into maintenance breaking the > > entire system). > > I have put up https://code.illumos.org/c/illumos-gate/+/3621 which is > resurrecting work by Andy Stormont to add the missing pieces to support > degraded services, and then re-worked > https://code.illumos.org/c/illumos-gate/+/3610/ > on top of that. > > This does seem like the best solution - the service does what it can and > doesn't block others from coming up, yet it shows up as degraded in > `svcs -x` and co. > > My testing's going well so I'll get that sent out for review in the next > couple of days. > > Thanks for all the discussion and feedback on this change! Now that https://www.illumos.org/issues/7711 has integrated, I am planning to send the latest version of this for RTI: 15665 svc:/network/loopback exits successfully even if it fails https://www.illumos.org/issues/15665 https://code.illumos.org/c/illumos-gate/+/3610 This now uses the SMF degraded state rather than maintenance as the original change did. I think this should address the concerns that were raised and, again, thanks for the discussions around this. Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-09-09 17:36 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-07-26 8:20 Review - 15665 svc:/network/loopback exits successfully even if it fails Andy Fiddaman 2024-07-26 12:44 ` [developer] " Peter Tribble 2024-07-26 13:41 ` Toomas Soome 2024-07-26 14:05 ` Peter Tribble 2024-07-26 13:50 ` Andy Fiddaman 2024-07-26 15:14 ` Peter Tribble 2024-07-26 15:55 ` Jorge Schrauwen 2024-07-30 22:46 ` Gordon Ross 2024-07-31 9:44 ` Peter Tribble 2024-08-01 8:48 ` Joshua M. Clulow 2024-07-26 18:08 ` Alan Coopersmith 2024-08-07 20:04 ` Andy Fiddaman 2024-09-09 17:36 ` Andy Fiddaman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).