supervision - discussion about system services, daemon supervision, init, runlevel management, and tools such as s6 and runit
 help / color / mirror / Atom feed
* pidsig 0.11 - a fghack like de-daemonisation tool
@ 2010-06-02  6:08 Janos Farkas
  2010-06-02 18:46 ` Laurent Bercot
  0 siblings, 1 reply; 10+ messages in thread
From: Janos Farkas @ 2010-06-02  6:08 UTC (permalink / raw)
  To: supervision

Hi,

I've been using a tool to replace fghack in Bernstein chains.  It
overcomes at least one limitation of fghack, namely that fghack
doesn't have a way to pass on signals to daemons that it started.

pidsig works very similarly to fghack, will create (at least) one pipe
in the newly started daemon, but in addition to that, it will also
keep track of the pid for the child it started.  With the recorded
pid, pidsig is able to pass on (some) signals that it receives itself.

In absence (or in addition to) the child pid, it can send signals to
processes that have recorded pid files.

Although it's still best to modify daemons to not actually put
themselves to background, there are occasionally exceptions that are
difficult to handle otherwise.

One such example is nginx - which has an own "worker" scheme, and
supports online code replacement, thus, occasionally can have several
main threads, keeping several pid files for them when this is in
progress.

pidsig will be able to work with two kinds of daemons:
- those that don't go out their way to close all open fd's (just like fghack)
- those that don't go out their way to disconnect parent pids by
forking too many times

Furthermore:
- it can chroot (although if still running as root, it may not be much
more "secure")
- it can run as a specific user (but then it may be limited in what
processes it can kill)
- can read several(!) pid files to pass on signals to - to signify
quit, configuration reload, etc.

I've been using it successfully to manage a few nginx configurations
with daemontools, it also works well with the atd from the Linux at
package.

Sample usage for nginx:

pidsig -d/var/run -pnginx.pid.oldbin -pnginx.pid /usr/sbin/nginx

There are some obvious, and some quirky features that I have plans to
implement - it all depends on the interest.  Please drop me a note if
any of the above sounds interesting and/or check out its project page
at github:

  http://github.com/chexum/pidsig

Janos


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-02  6:08 pidsig 0.11 - a fghack like de-daemonisation tool Janos Farkas
@ 2010-06-02 18:46 ` Laurent Bercot
  2010-06-03 16:53   ` Janos Farkas
  0 siblings, 1 reply; 10+ messages in thread
From: Laurent Bercot @ 2010-06-02 18:46 UTC (permalink / raw)
  To: supervision

> pidsig works very similarly to fghack, will create (at least) one pipe
> in the newly started daemon, but in addition to that, it will also
> keep track of the pid for the child it started.  With the recorded
> pid, pidsig is able to pass on (some) signals that it receives itself.

 Nice !


> There are some obvious, and some quirky features that I have plans to
> implement - it all depends on the interest.

 I would advise to implement just the features you need. Such a tool,
like fghack, is very handy to have when there's no other choice; however,
we don't want people to rely on them, to use them as excuses for bad
coding practices. And if you make pidsig too powerful, you *know* it's
going to happen.

 Working in professional environments has made me very disillusioned
about the way people design software (when they design it) and understand
Unix (when they understand it). Having workarounds is good, but if you
make it too easy for people to misbehave, *they will*.

 So... scratch your itch, but don't go out of your way to accommodate
all kinds of ugly practices. :)

-- 
 Laurent


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-02 18:46 ` Laurent Bercot
@ 2010-06-03 16:53   ` Janos Farkas
  2010-06-03 19:25     ` Laurent Bercot
  0 siblings, 1 reply; 10+ messages in thread
From: Janos Farkas @ 2010-06-03 16:53 UTC (permalink / raw)
  To: supervision

On Wed, Jun 2, 2010 at 19:46, Laurent Bercot
<ska-supervision@skarnet.org> wrote:
> [kind words]

Thanks!

>  I would advise to implement just the features you need. Such a tool,
> like fghack, is very handy to have when there's no other choice; however,
> we don't want people to rely on them, to use them as excuses for bad
> coding practices. And if you make pidsig too powerful, you *know* it's
> going to happen.

In principle, I agree - it does what I originally wanted, I have some
corner cases in mind that can be difficult to tackle (what if pidsig
itself crashes, but the daemon below it doesn't - shall it try to do
something with the pid file?).

These kinds of problems are not that theoretical - just recently I saw
svscan/svscanboot crashing on a >1y uptime box, taking many of the
processes with it, including most of the supervise infrastructure,
very likely not due to any fault in them - could be oom gone wild,
cosmic rays hitting svscan memory, whatever).

The host stayed up and admin access was possible (thanks mainly to
sshd being a long running daemon), but the state the processes was
left in was nothing to be glad about.

Another question would be if there are more ways to reliably connect
to any given process detecting it being gone - but all the current
daemons that I run can be handled now :)

>  Working in professional environments has made me very disillusioned
> about the way people design software (when they design it) and understand
> Unix (when they understand it). Having workarounds is good, but if you
> make it too easy for people to misbehave, *they will*.
>
>  So... scratch your itch, but don't go out of your way to accommodate
> all kinds of ugly practices. :)

I hear you, points taken :)  Please return to your regularly scheduled
supervision :)

Janos


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-03 16:53   ` Janos Farkas
@ 2010-06-03 19:25     ` Laurent Bercot
  2010-06-04 16:26       ` Wayne Marshall
  0 siblings, 1 reply; 10+ messages in thread
From: Laurent Bercot @ 2010-06-03 19:25 UTC (permalink / raw)
  To: supervision

> These kinds of problems are not that theoretical - just recently I saw
> svscan/svscanboot crashing on a >1y uptime box, taking many of the
> processes with it, including most of the supervise infrastructure,
> very likely not due to any fault in them - could be oom gone wild,
> cosmic rays hitting svscan memory, whatever).

 That's a typical case of "weak" supervision, as opposed to a "strong"
supervision chain. "Strong" supervision makes sure that all the
infrastructure is connected to init.

 * svscan achieves strong supervision *if* svscanboot is flagged as
"respawn" in /etc/inittab on System V-style inits, in /etc/event.d/
with Upstart, or in /etc/gettys on BSD. It does *not* achieve it if
svscanboot is started via some rc.local script (as the stock daemontools
instructions tell you to do, shame on DJB! :))
 * perp is in the same boat, depending on how you start perpboot.
 * Paul Jarc has instructions on how to directly run svscan as
process 1.
 * runit achieves strong supervision if you're using runit-init.
 * s4, my own supervision suite (to be released next summer) can be
run from a respawner, but was also designed so s4-svscan can run as
process 1.

 Strong supervision makes sure that your supervisor process tree is
*always* alive and complete, unless process 1 itself crashes, in which
case you're doomed to reboot anyway.


> The host stayed up and admin access was possible (thanks mainly to
> sshd being a long running daemon), but the state the processes was
> left in was nothing to be glad about.

 This will, unfortunately, be the case even with a strong supervision
chain : if a branch of the supervision tree is broken, the whole
subtree is reconstructed from the breaking point, but the old subtree
might still be alive and locking resources, preventing the new
subtree from being fully functional (and filling your logs with warning
messages). An intervention from the administrator is always necessary.
The advantage is, all the admin has to do is kill the old subtree
(including services) and everything will be working perfectly again.
 I'm not sure it's possible to design a supervision suite that addresses
this problem cleanly without endangering service reliability.


> Another question would be if there are more ways to reliably connect
> to any given process detecting it being gone - but all the current
> daemons that I run can be handled now :)

 Unfortunately, no; not without support from the process you want to
monitor. There are only two ways of being notified of a process'
death:
 - getting a SIGCHLD if you're the process' parent. That's what a
supervisor uses (supervise, runsv, perpetrate, s4-supervise all work
on this model).
 - getting an EOF on a pipe or socket you're listening to, when the
monitored process is the only writer on the other side. That's what
fghack uses (and pidsig too, I presume).

 The EOF method does not require the monitorer to be the monitoree's
parent, so it's more flexible; but it does require the monitoree to
not go out of its way to arbitrarily close fds. If some daemon forks
itself and resists fghack and pidsig, you're out of luck: it definitely
won't be supervised.

 If you have the process' pid, you can poll the process table, but
polling - as opposed to notification - is evil. Also, a fundamental
problem is that pids do NOT uniquely identify a process (which is the
main flaw with .pid files), unless you're blessed with an OS where
pids are 64 bits long.
 If you're willing to be non-portable, Linux should allow you to set
an inotify fd listening to /proc/$pid's disappearance (if the /proc
filesystem supports inotify, but there's no reason why it shouldn't).
BSD might have a similar mechanism with kevent/kqueue. And all of this
is definitely too much work for a lousy daemon.

-- 
 Laurent


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-03 19:25     ` Laurent Bercot
@ 2010-06-04 16:26       ` Wayne Marshall
  2010-06-04 16:54         ` Charlie Brady
  0 siblings, 1 reply; 10+ messages in thread
From: Wayne Marshall @ 2010-06-04 16:26 UTC (permalink / raw)
  To: Laurent Bercot, supervision

On Thu, 3 Jun 2010 21:25:30 +0200
Laurent Bercot <ska-supervision@skarnet.org> wrote:

> > These kinds of problems are not that theoretical - just
> > recently I saw svscan/svscanboot crashing on a >1y uptime
> > box, taking many of the processes with it, including most of
> > the supervise infrastructure, very likely not due to any
> > fault in them - could be oom gone wild, cosmic rays hitting
> > svscan memory, whatever).
> 
>  That's a typical case of "weak" supervision, as opposed to a
> "strong" supervision chain. "Strong" supervision makes sure
> that all the infrastructure is connected to init.
> 
>  * svscan achieves strong supervision *if* svscanboot is
> flagged as "respawn" in /etc/inittab on System V-style inits,
> in /etc/event.d/ with Upstart, or in /etc/gettys on BSD. It
> does *not* achieve it if svscanboot is started via some
> rc.local script (as the stock daemontools instructions tell
> you to do, shame on DJB! :))
>  * perp is in the same boat, depending on how you start
> perpboot.
> ...
> Strong supervision makes sure that your supervisor process
> tree is *always* alive and complete, unless process 1 itself
> crashes, in which case you're doomed to reboot anyway.
>

FWIW, the perp-setup(8)/perpboot(8) utilities do indeed enable
such "strong supervision" in the default configurations on both
BSD and Linux systems.  Let me know if any question.

> > Another question would be if there are more ways to reliably
> > connect to any given process detecting it being gone - but
> > all the current daemons that I run can be handled now :)
> 
>  Unfortunately, no; not without support from the process you
> want to monitor. There are only two ways of being notified of
> a process' death:
>  - getting a SIGCHLD if you're the process' parent. That's
> what a supervisor uses (supervise, runsv, perpetrate,
> s4-supervise all work on this model).
>  - getting an EOF on a pipe or socket you're listening to,
> when the monitored process is the only writer on the other
> side. That's what fghack uses (and pidsig too, I presume).
>

Also FWIW, the minit/ninit suites offer a "pidfilehack"
utility that enables the supervisor to watch for SIGCHLD from
non-progeny processes.  It is clever and effective, but only
works as intended if running minit/ninit as process 1.  (The
trick is based on the fact that process 1 inherits processes
without parents.)

Cheers,

Wayne


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-04 16:26       ` Wayne Marshall
@ 2010-06-04 16:54         ` Charlie Brady
  2010-06-04 17:17           ` Wayne Marshall
  2010-06-04 18:43           ` Laurent Bercot
  0 siblings, 2 replies; 10+ messages in thread
From: Charlie Brady @ 2010-06-04 16:54 UTC (permalink / raw)
  To: Wayne Marshall; +Cc: supervision


On Fri, 4 Jun 2010, Wayne Marshall wrote:

> On Thu, 3 Jun 2010 21:25:30 +0200
> Laurent Bercot <ska-supervision@skarnet.org> wrote:
> 
> > > These kinds of problems are not that theoretical - just
> > > recently I saw svscan/svscanboot crashing on a >1y uptime
> > > box, taking many of the processes with it, including most of
> > > the supervise infrastructure, very likely not due to any
> > > fault in them - could be oom gone wild, cosmic rays hitting
> > > svscan memory, whatever).
> > 
> >  That's a typical case of "weak" supervision, as opposed to a
> > "strong" supervision chain. "Strong" supervision makes sure
> > that all the infrastructure is connected to init.
> > 
> >  * svscan achieves strong supervision *if* svscanboot is
> > flagged as "respawn" in /etc/inittab on System V-style inits,
> > in /etc/event.d/ with Upstart, or in /etc/gettys on BSD. It
> > does *not* achieve it if svscanboot is started via some
> > rc.local script (as the stock daemontools instructions tell
> > you to do, shame on DJB! :))
> >  * perp is in the same boat, depending on how you start
> > perpboot.
> > ...
> > Strong supervision makes sure that your supervisor process
> > tree is *always* alive and complete, unless process 1 itself
> > crashes, in which case you're doomed to reboot anyway.

There is a weakness in this "strong supervision" model. Any service with a 
'down' file will not be restarted if its supervise/runsv or 
svscan/runsvdir is replaced.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-04 16:54         ` Charlie Brady
@ 2010-06-04 17:17           ` Wayne Marshall
  2010-06-04 17:21             ` Charlie Brady
  2010-06-04 18:43           ` Laurent Bercot
  1 sibling, 1 reply; 10+ messages in thread
From: Wayne Marshall @ 2010-06-04 17:17 UTC (permalink / raw)
  To: Charlie Brady; +Cc: supervision

On Fri, 4 Jun 2010 12:54:46 -0400 (EDT)
Charlie Brady <charlieb-supervision@budge.apana.org.au> wrote:

> > On Thu, 3 Jun 2010 21:25:30 +0200
> > Laurent Bercot <ska-supervision@skarnet.org> wrote:
> > 
> > > Strong supervision makes sure that your supervisor process
> > > tree is *always* alive and complete, unless process 1
> > > itself crashes, in which case you're doomed to reboot
> > > anyway.
> 
> There is a weakness in this "strong supervision" model. Any
> service with a 'down' file will not be restarted if its
> supervise/runsv or svscan/runsvdir is replaced.
> 

Why do you describe this as a "weakness"?  The down flagfile is
consulted only on startup of the supervisor.  If the
administrator has configured the service to be down on startup,
presumably she wants it to be down on startup.

Cheers,

Wayne


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-04 17:17           ` Wayne Marshall
@ 2010-06-04 17:21             ` Charlie Brady
  2010-06-04 20:00               ` Wayne Marshall
  0 siblings, 1 reply; 10+ messages in thread
From: Charlie Brady @ 2010-06-04 17:21 UTC (permalink / raw)
  To: Wayne Marshall; +Cc: supervision


On Fri, 4 Jun 2010, Wayne Marshall wrote:

> On Fri, 4 Jun 2010 12:54:46 -0400 (EDT)
> Charlie Brady <charlieb-supervision@budge.apana.org.au> wrote:
> 
> > > On Thu, 3 Jun 2010 21:25:30 +0200
> > > Laurent Bercot <ska-supervision@skarnet.org> wrote:
> > > 
> > > > Strong supervision makes sure that your supervisor process
> > > > tree is *always* alive and complete, unless process 1
> > > > itself crashes, in which case you're doomed to reboot
> > > > anyway.
> > 
> > There is a weakness in this "strong supervision" model. Any
> > service with a 'down' file will not be restarted if its
> > supervise/runsv or svscan/runsvdir is replaced.
> 
> Why do you describe this as a "weakness"?  The down flagfile is
> consulted only on startup of the supervisor.  If the
> administrator has configured the service to be down on startup,
> presumably she wants it to be down on startup.

I thought that we were discussing here the situation where the supervisor 
dies and is automatically restarted. That is not the 'on startup' where 
the adminstrator intends the service to be down. "on startup" is long 
gone, and the adminstrator has started the service, and wants it to 
continue running. The automated restart of the supervisor shouldn't change 
that running state.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-04 16:54         ` Charlie Brady
  2010-06-04 17:17           ` Wayne Marshall
@ 2010-06-04 18:43           ` Laurent Bercot
  1 sibling, 0 replies; 10+ messages in thread
From: Laurent Bercot @ 2010-06-04 18:43 UTC (permalink / raw)
  To: supervision

> There is a weakness in this "strong supervision" model. Any service with a 
> 'down' file will not be restarted if its supervise/runsv or 
> svscan/runsvdir is replaced.

 If a branch of the supervision tree dies, the old subtree, including
leaves (i.e. services) is still alive. Manual admin intervention is
necessary to kill it off and recreate a new subtree, connected to init.
If there are any services with down files, but that need to be alive,
the admin can take care of them at that time.
 Now, if a service has a down file, and its supervisor dies, *and then*
the service dies too, then the service won't be restarted indeed; but
we're talking about a double failure, which should be uncommon.

 Nevertheless, down files are a decrease in reliability. They're practical
for test and manual intervention purposes, but I've never met a real-life
case where they are necessary. It's always possible to boot the machine
with a nearly-empty svscan directory and populate it during the later
initialization phases.

-- 
 Laurent


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: pidsig 0.11 - a fghack like de-daemonisation tool
  2010-06-04 17:21             ` Charlie Brady
@ 2010-06-04 20:00               ` Wayne Marshall
  0 siblings, 0 replies; 10+ messages in thread
From: Wayne Marshall @ 2010-06-04 20:00 UTC (permalink / raw)
  To: supervision

On Fri, 4 Jun 2010 13:21:18 -0400 (EDT)
Charlie Brady <charlieb-supervision@budge.apana.org.au> wrote:

> 
> On Fri, 4 Jun 2010, Wayne Marshall wrote:
> 
> > On Fri, 4 Jun 2010 12:54:46 -0400 (EDT)
> > Charlie Brady <charlieb-supervision@budge.apana.org.au>
> > wrote:
> > 
> > > > On Thu, 3 Jun 2010 21:25:30 +0200
> > > > Laurent Bercot <ska-supervision@skarnet.org> wrote:
> > > > 
> > > > > Strong supervision makes sure that your supervisor
> > > > > process tree is *always* alive and complete, unless
> > > > > process 1 itself crashes, in which case you're doomed
> > > > > to reboot anyway.
> > > 
> > > There is a weakness in this "strong supervision" model. Any
> > > service with a 'down' file will not be restarted if its
> > > supervise/runsv or svscan/runsvdir is replaced.
> > 
> > Why do you describe this as a "weakness"?  The down flagfile
> > is consulted only on startup of the supervisor.  If the
> > administrator has configured the service to be down on
> > startup, presumably she wants it to be down on startup.
> 
> I thought that we were discussing here the situation where the
> supervisor dies and is automatically restarted. That is not
> the 'on startup' where the adminstrator intends the service to
> be down. "on startup" is long gone, and the adminstrator has
> started the service, and wants it to continue running. The
> automated restart of the supervisor shouldn't change that
> running state.
> 

Well, the down flagfile is relevant on startup of the
supervisor.  Starting/restarting the supervisor may occur at
system boot, and/or any number of times thereafter.  Supervisors
do not normally need to know or care between system boot and
"thereafter".

If the administrator needs to differentiate between system boot
and "thereafter", she will have probably need to effect that
differentiation through her system boot/shutdown scripts.  But
that is her problem, rather than a weakness in the supervisory
model.

Cheers,

Wayne



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-06-04 20:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-02  6:08 pidsig 0.11 - a fghack like de-daemonisation tool Janos Farkas
2010-06-02 18:46 ` Laurent Bercot
2010-06-03 16:53   ` Janos Farkas
2010-06-03 19:25     ` Laurent Bercot
2010-06-04 16:26       ` Wayne Marshall
2010-06-04 16:54         ` Charlie Brady
2010-06-04 17:17           ` Wayne Marshall
2010-06-04 17:21             ` Charlie Brady
2010-06-04 20:00               ` Wayne Marshall
2010-06-04 18:43           ` Laurent Bercot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).