Service watchdog

supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed

* Service watchdog
@ 2021-10-19  7:20 Petr Malat
  2021-10-19  7:24 ` Ellenor Bjornsdottir
  0 siblings, 1 reply; 6+ messages in thread
From: Petr Malat @ 2021-10-19  7:20 UTC (permalink / raw)
  To: supervision

Hi,
I'm using the busybox implementation of runit to manage services and I
miss some kind of a watchdog in runsv. I though about extending
supervise/control pipe by a status command which would allow to publish
a status, for example 's Running'. Runsv would then append a monotonic
timestamp when it was received and the passed string to its argv[0]
making it visible in the process listing. This could be used by "check"
to check the service is up and also by watchdog to see it made some
progress since the last run.
Any opinions on that?
BR,
  Petr

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Service watchdog
  2021-10-19  7:20 Service watchdog Petr Malat
@ 2021-10-19  7:24 ` Ellenor Bjornsdottir
  2021-10-19  7:41   ` Petr Malat
  0 siblings, 1 reply; 6+ messages in thread
From: Ellenor Bjornsdottir @ 2021-10-19  7:24 UTC (permalink / raw)
  To: supervision

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

Is this some genre of continuous readiness notification, or so?

On 19 October 2021 07:20:41 UTC, Petr Malat <oss@malat.biz> wrote:
>Hi,
>I'm using the busybox implementation of runit to manage services and I
>miss some kind of a watchdog in runsv. I though about extending
>supervise/control pipe by a status command which would allow to publish
>a status, for example 's Running'. Runsv would then append a monotonic
>timestamp when it was received and the passed string to its argv[0]
>making it visible in the process listing. This could be used by "check"
>to check the service is up and also by watchdog to see it made some
>progress since the last run.
>Any opinions on that?
>BR,
>  Petr

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Service watchdog
  2021-10-19  7:24 ` Ellenor Bjornsdottir
@ 2021-10-19  7:41   ` Petr Malat
  2021-10-19  9:47     ` Laurent Bercot
  2021-10-19 17:46     ` Steve Litt
  0 siblings, 2 replies; 6+ messages in thread
From: Petr Malat @ 2021-10-19  7:41 UTC (permalink / raw)
  To: supervision

Yes, in my usecase this would be used at the place where sd_notify()
is used if the service runs under systemd. Then periodically executed
watchdog could check the service makes progress and react if it
doesn't.

The question is how to implement the watchdog then - it could be either
a global service or another executable in service directory, which
would be started periodically by runsv.

On Tue, Oct 19, 2021 at 07:24:38AM +0000, Ellenor Bjornsdottir wrote:
> Is this some genre of continuous readiness notification, or so?
> 
> On 19 October 2021 07:20:41 UTC, Petr Malat <oss@malat.biz> wrote:
> >Hi,
> >I'm using the busybox implementation of runit to manage services and I
> >miss some kind of a watchdog in runsv. I though about extending
> >supervise/control pipe by a status command which would allow to publish
> >a status, for example 's Running'. Runsv would then append a monotonic
> >timestamp when it was received and the passed string to its argv[0]
> >making it visible in the process listing. This could be used by "check"
> >to check the service is up and also by watchdog to see it made some
> >progress since the last run.
> >Any opinions on that?
> >BR,
> >  Petr
> 
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Service watchdog
  2021-10-19  7:41   ` Petr Malat
@ 2021-10-19  9:47     ` Laurent Bercot
  2021-10-21  9:20       ` Petr Malat
  2021-10-19 17:46     ` Steve Litt
  1 sibling, 1 reply; 6+ messages in thread
From: Laurent Bercot @ 2021-10-19  9:47 UTC (permalink / raw)
  To: supervision

>Yes, in my usecase this would be used at the place where sd_notify()
>is used if the service runs under systemd. Then periodically executed
>watchdog could check the service makes progress and react if it
>doesn't.
>
>The question is how to implement the watchdog then - it could be either
>a global service or another executable in service directory, which
>would be started periodically by runsv.

  If a single notification step is enough for you, i.e. the service
goes from a "preparing" state to a "ready" state and remains ready
until the process dies, then what you want is implemented in the s6
process supervisor: https://skarnet.org/software/s6/notifywhenup.html

  Then you can synchronously wait for service readiness
(s6-svwait $service) or, if you have a watchdog service, periodically
poll for readiness (s6-svstat -r $service).

  But that's only valid if your service can only change states once
(from "not ready" to "ready"). If you need anything more complex, s6
won't support it intrinsically.

  The reason why there isn't more advanced support for this in any
supervision suite (save systemd but even there it's pretty minimal)
is that service states other than "not ready yet" and "ready" are
very much service-dependent and it's impossible for a generic process
supervisor to support enough states for every possible existing service.
Daemons that need complex states usually come with their own
monitoring software that handles their specific states, with integrated
health checks etc.

  So my advice would be:
  - if what you need is just readiness notification, switch to s6.
It's very similar to runit and I think you'll find it has other
benefits as well. The drawback, obviously, is that it's not in busybox
and the required effort to switch may not be worth it.
  - if you need anything more complex, you can stick to runit, but you
will kinda need to write your own monitor for your daemon, because
that's what everyone does.

  Depending on the details of the monitoring you need, the monitoring
software can be implemented as another service (e.g. to receive
heartbeats from your daemon), or as a polling client (e.g. to do
periodic health checks). Both approaches are valid.

  Don't hack on runit, especially the control pipe thing. It will not
end well.
  (runit's control pipe feature is super dangerous, because it allows a
service to hijack the control flow of its supervisor, which endangers
the supervisor's safety. That's why s6 does not implement it; it
provides similar - albeit slightly less powerful - control features
via ways that never give the service any power over the supervisor.)

--
  Laurent

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Service watchdog
  2021-10-19  7:41   ` Petr Malat
  2021-10-19  9:47     ` Laurent Bercot
@ 2021-10-19 17:46     ` Steve Litt
  1 sibling, 0 replies; 6+ messages in thread
From: Steve Litt @ 2021-10-19 17:46 UTC (permalink / raw)
  To: supervision

Petr Malat said on Tue, 19 Oct 2021 09:41:19 +0200

>Yes, in my usecase this would be used at the place where sd_notify()
>is used if the service runs under systemd. Then periodically executed
>watchdog could check the service makes progress and react if it
>doesn't.
>
>The question is how to implement the watchdog then - it could be either
>a global service or another executable in service directory, which
>would be started periodically by runsv.

LOL, I'll tell you how I did it on my reminder system, and you can
decide whether or not to do it my way...

I have a reminder system written by me in Perl early this century, when
I still used Perl. It runs 5 times a day via cron, popping a window up
on the screen telling me of my appointments. Some consider it
intrusive, I like it that way (which is why I wrote it that way).

After a few years of using my reminder system, it became apparent that
sometimes it was failing silently, and I wouldn't notice the
absence of popup windows, causing me to miss appointments and the like.

So I wrote another program (by this time I'd switched to Python), run
as a runit service:

========================================
#!/bin/sh
cd /d/at/python/reminder_check
exec chpst -u slitt:slitt /d/at/python/reminder_check/reminder_check.py
========================================

The main routine of the Python program follows:

========================================
while True:
    if tooOld(LOGFILE, TOO_OLD_HOURS):
        alarm_all()
    time.sleep(SLEEP_SECONDS)
========================================

So every SLEEP_SECONDS seconds, it checks logfile LOGFILE, which is
written by the reminder program itself, to see if it's more than
TOO_OLD_HOURS old, and if it does, it throws up a big old green and
purple window proclaiming the alarm system is broken.

In my case, SLEEP_SECONDS is 3600. Yeah, it's polling instead of
interrupt driven, but I make no apology for polling once an hour.
Matter of fact, I'd make no apologies for 10 second polling, given that
if everything's OK all it's going to do is check a file date.

It seems to me the key question is how quickly do you need to be
informed of the failure of the watched daemon. If being informed a
minute later is OK, I'd say my method is fine. If being informed a
second later is OK, I'd rewrite the time check in C and then if it
flunks, system() the "on error" program. If you need subsecond warning,
my method is probably not what you want.

By the way, when I test for a daemon functioning, I typically don't use
svstatus or that other program that just returns a 1 or 0, because I
don't care if the program is running: I want to know that it's
*functioning*, so I test the functionality of the running program. So
for the network, I'd do a quick 1 iteration ping, for PostGreSQL I
might do a simple select statement, etc.

Best of luck.

SteveT

Steve Litt 
Spring 2021 featured book: Troubleshooting Techniques of the Successful
Technologist http://www.troubleshooters.com/techniques

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Service watchdog
  2021-10-19  9:47     ` Laurent Bercot
@ 2021-10-21  9:20       ` Petr Malat
  0 siblings, 0 replies; 6+ messages in thread
From: Petr Malat @ 2021-10-21  9:20 UTC (permalink / raw)
  To: supervision

Hi!

> > Yes, in my usecase this would be used at the place where sd_notify()
> > is used if the service runs under systemd. Then periodically executed
> > watchdog could check the service makes progress and react if it
> > doesn't.
> 
>  If a single notification step is enough for you, i.e. the service
> goes from a "preparing" state to a "ready" state and remains ready
> until the process dies, then what you want is implemented in the s6
> process supervisor: https://skarnet.org/software/s6/notifywhenup.html
> 
>  Then you can synchronously wait for service readiness
> (s6-svwait $service) or, if you have a watchdog service, periodically
> poll for readiness (s6-svstat -r $service).
> 
>  But that's only valid if your service can only change states once
> (from "not ready" to "ready"). If you need anything more complex, s6
> won't support it intrinsically.
No, I need to monitor the service is alive - my watchdog script would
test if the age of the status message is older than a defined threshold
in which case it would kill the service (and the rest would be handled
in finish script).

>  The reason why there isn't more advanced support for this in any
> supervision suite (save systemd but even there it's pretty minimal)
> is that service states other than "not ready yet" and "ready" are
> very much service-dependent and it's impossible for a generic process
> supervisor to support enough states for every possible existing service.
> Daemons that need complex states usually come with their own
> monitoring software that handles their specific states, with integrated
> health checks etc.
> 
>  So my advice would be:
>  - if what you need is just readiness notification, switch to s6.
> It's very similar to runit and I think you'll find it has other
> benefits as well. The drawback, obviously, is that it's not in busybox
> and the required effort to switch may not be worth it.
>  - if you need anything more complex, you can stick to runit, but you
> will kinda need to write your own monitor for your daemon, because
> that's what everyone does.
> 
>  Depending on the details of the monitoring you need, the monitoring
> software can be implemented as another service (e.g. to receive
> heartbeats from your daemon), or as a polling client (e.g. to do
> periodic health checks). Both approaches are valid.
That's what I thought of as well, but having this completely out of the
runsv can lead to a possible race window when the watchdog can kill a
service, which has restarted itself. This could be avoided if the check
would be serialized with other steps (run/finish execution) within
runsv. So far the futile restart of the service doesn't seem to cause
problems to me, so I'm not much bothered with it.

>  Don't hack on runit, especially the control pipe thing. It will not
> end well.
>  (runit's control pipe feature is super dangerous, because it allows a
> service to hijack the control flow of its supervisor, which endangers
> the supervisor's safety. That's why s6 does not implement it; it
> provides similar - albeit slightly less powerful - control features
> via ways that never give the service any power over the supervisor.)
The main reason I wanted to use the service pipe for it was a possibility
to see the service status in the process tree, which would be a nice
benefit.

BR,
  Petr

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-10-21  9:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-19  7:20 Service watchdog Petr Malat
2021-10-19  7:24 ` Ellenor Bjornsdottir
2021-10-19  7:41   ` Petr Malat
2021-10-19  9:47     ` Laurent Bercot
2021-10-21  9:20       ` Petr Malat
2021-10-19 17:46     ` Steve Litt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).