development list for leahutils (lr, xe, snooze, ...)
 help / color / mirror / Atom feed
From: Leah Neukirchen <>
Subject: outage post mortem
Date: Thu, 14 Feb 2019 22:54:07 +0100	[thread overview]
Message-ID: <> (raw)

At 00:20 CET tonight (2019-02-14), a virtual machine
host that runs, among others, and, the aarch64 builder for Void Linux, went down and
only came up around 11 CET again.

This was entirely my fault, but how it happened is interesting:

I was informed a user-space port forwarding was not working.  It was
realized using socat, supervised by runit (the init system of Void

	socat TCP4-LISTEN:3722,fork,su=nobody

However, starting this showed the address was already in use:

	2019/02/14 00:20:44 socat[5049] E bind(5, {AF=2}, 16): Address already in use

My assumption was there was a runaway instance of socat running (for
unknown reasons), and I decided to kill all socat instances.  My usual
tool of choice would have been `killall socat`, but as there were other
socat instances running on the machine, I only wanted to kill the port
3722 ones.

A quick test with `pgrep` showed a plausible list of PIDs, so I ran

	kill $(pgrep -f socat.*3722)

which seemed to work fine at first.

Several seconds later I was greeted with this message:

	Connection to closed by remote host.
	Connection to closed.

And the box didn't ping anymore...

As experienced SSH user, this indicated that the host shut down in
some controlled way, else I would have gotten a `broken pipe` message.

But how could the `pkill` shut down the machine?  I could not come up
with any plausible theory, but it was already late and there was alcohol
involved as well.  So I decided to leave it at rest until the morning.

In the morning, the box still was not up and there was no evidence of
a network issue or anything.  I decided to enter the Hetzner Control
Panel and trigger an "automated reset".  Nothing changed, the box
still didn't ping.  I tried to activate the "vKVM rescue system", to
no avail.

At this point I actually assumed some hardware issue, and I called for
a "manual reset", which means someone has to get up, walk to the
machine, restart it, and watch a bit whether it seems to boot

Of course, the true reason was much simpler: the box was powered off.

Unfortuately, nothing about the Hetzner Control Panel shows you this
simple fact, so I guess I'm not the only one to send poor support
folks to go boot other people's machines.

The box booted fine and all services were restored within minutes.

The remaining question is how it's possible that the command shut down
the machine, and it's easy to answer too:
`runsvdir`, the main runit process that controls "stage 2", i.e.
while the system is up, displays error messages of all direct
child processes in it's `argv[0]`, so you can check for unlogged
messages with `ps`:

	runsvdir -P /run/runit/runsvdir/current log: ....logs here....

Unfortunately, in above sitation this resulted in both "socat" and
"3722" to appear in the error messages, and thus the process title,
which made `pkill -f` match it and, as commanded, kill `runsvdir`,
which results in exiting stage 2 and runit performing an orderly
shutdown of the system.  Duh.

Lessons learned:
- The first intuition is often right, even if it's not plausible at first.
- Don't use `pkill -f` as root, at least not without careful checking
  and regexp anchoring.
- If a box doesn't react to reset requests, try sending wake-on-lan to
  turn it on.
- runit should reboot by default, not shutdown!

Leah Neukirchen  <>

             reply	other threads:[~2019-02-14 21:54 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-14 21:54 Leah Neukirchen [this message]
2019-02-14 22:07 ` hiro
2019-02-16  2:15 ` Diego Augusto Molina
2019-02-16 21:07   ` Leah Neukirchen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).