Void Linux discussion
 help / color / mirror / Atom feed
* epona.vuxu.org outage post mortem
@ 2019-02-14 21:54 Leah Neukirchen
  2019-02-14 22:07 ` hiro
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Leah Neukirchen @ 2019-02-14 21:54 UTC (permalink / raw)
  To: leahutils; +Cc: voidlinux


At 00:20 CET tonight (2019-02-14) epona.vuxu.org, a virtual machine
host that runs, among others, git.vuxu.org/inbox.vuxu.org and
hestia.vuxu.org, the aarch64 builder for Void Linux, went down and
only came up around 11 CET again.

This was entirely my fault, but how it happened is interesting:

I was informed a user-space port forwarding was not working.  It was
realized using socat, supervised by runit (the init system of Void
Linux):

	socat TCP4-LISTEN:3722,fork,su=nobody TCP6:hestia.vuxu.org:22

However, starting this showed the address was already in use:

	2019/02/14 00:20:44 socat[5049] E bind(5, {AF=2 0.0.0.0:3722}, 16): Address already in use

My assumption was there was a runaway instance of socat running (for
unknown reasons), and I decided to kill all socat instances.  My usual
tool of choice would have been `killall socat`, but as there were other
socat instances running on the machine, I only wanted to kill the port
3722 ones.

A quick test with `pgrep` showed a plausible list of PIDs, so I ran

	kill $(pgrep -f socat.*3722)

which seemed to work fine at first.

Several seconds later I was greeted with this message:

	Connection to epona.vuxu.org closed by remote host.
	Connection to epona.vuxu.org closed.

And the box didn't ping anymore...

As experienced SSH user, this indicated that the host shut down in
some controlled way, else I would have gotten a `broken pipe` message.

But how could the `pkill` shut down the machine?  I could not come up
with any plausible theory, but it was already late and there was alcohol
involved as well.  So I decided to leave it at rest until the morning.

In the morning, the box still was not up and there was no evidence of
a network issue or anything.  I decided to enter the Hetzner Control
Panel and trigger an "automated reset".  Nothing changed, the box
still didn't ping.  I tried to activate the "vKVM rescue system", to
no avail.

At this point I actually assumed some hardware issue, and I called for
a "manual reset", which means someone has to get up, walk to the
machine, restart it, and watch a bit whether it seems to boot
properly.

Of course, the true reason was much simpler: the box was powered off.

Unfortuately, nothing about the Hetzner Control Panel shows you this
simple fact, so I guess I'm not the only one to send poor support
folks to go boot other people's machines.

The box booted fine and all services were restored within minutes.

The remaining question is how it's possible that the command shut down
the machine, and it's easy to answer too:
`runsvdir`, the main runit process that controls "stage 2", i.e.
while the system is up, displays error messages of all direct
child processes in it's `argv[0]`, so you can check for unlogged
messages with `ps`:

	runsvdir -P /run/runit/runsvdir/current log: ....logs here....

Unfortunately, in above sitation this resulted in both "socat" and
"3722" to appear in the error messages, and thus the process title,
which made `pkill -f` match it and, as commanded, kill `runsvdir`,
which results in exiting stage 2 and runit performing an orderly
shutdown of the system.  Duh.

Lessons learned:
- The first intuition is often right, even if it's not plausible at first.
- Don't use `pkill -f` as root, at least not without careful checking
  and regexp anchoring.
- If a box doesn't react to reset requests, try sending wake-on-lan to
  turn it on.
- runit should reboot by default, not shutdown!

-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/87mumyvxxc.fsf%40vuxu.org.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: epona.vuxu.org outage post mortem
  2019-02-14 21:54 epona.vuxu.org outage post mortem Leah Neukirchen
@ 2019-02-14 22:07 ` hiro
  2019-02-15  8:11 ` Quentin Rameau
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: hiro @ 2019-02-14 22:07 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: leahutils, voidlinux

next time consider to begin with grepping the netstat output ;)

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/CAFSF3XMRt2cOMRqSwAe7A5mcxsPBG7dL2Q%2Bgosbh905mHE7MSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: epona.vuxu.org outage post mortem
  2019-02-14 21:54 epona.vuxu.org outage post mortem Leah Neukirchen
  2019-02-14 22:07 ` hiro
@ 2019-02-15  8:11 ` Quentin Rameau
  2019-02-15  8:59   ` Leah Neukirchen
  2019-02-16  2:15 ` Diego Augusto Molina
  2019-02-23  1:14 ` Chris Brannon
  3 siblings, 1 reply; 7+ messages in thread
From: Quentin Rameau @ 2019-02-15  8:11 UTC (permalink / raw)
  To: voidlinux

Hi Leah,

> My assumption was there was a runaway instance of socat running (for
> unknown reasons), and I decided to kill all socat instances.  My usual
> tool of choice would have been `killall socat`, but as there were other
> socat instances running on the machine, I only wanted to kill the port
> 3722 ones.
>
> Lessons learned:
> - The first intuition is often right, even if it's not plausible at first.
> - Don't use `pkill -f` as root, at least not without careful checking
>   and regexp anchoring.
> - If a box doesn't react to reset requests, try sending wake-on-lan to
>   turn it on.
> - runit should reboot by default, not shutdown!

Or track the origin of the issue (runaway socat instance), did you find
anything?

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/20190215091115.1444c0c7%40fifth.space.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: epona.vuxu.org outage post mortem
  2019-02-15  8:11 ` Quentin Rameau
@ 2019-02-15  8:59   ` Leah Neukirchen
  0 siblings, 0 replies; 7+ messages in thread
From: Leah Neukirchen @ 2019-02-15  8:59 UTC (permalink / raw)
  To: voidlinux

Quentin Rameau <quinq@fifth.space> writes:

> Hi Leah,
>
>> My assumption was there was a runaway instance of socat running (for
>> unknown reasons), and I decided to kill all socat instances.  My usual
>> tool of choice would have been `killall socat`, but as there were other
>> socat instances running on the machine, I only wanted to kill the port
>> 3722 ones.
>>
>> Lessons learned:
>> - The first intuition is often right, even if it's not plausible at first.
>> - Don't use `pkill -f` as root, at least not without careful checking
>>   and regexp anchoring.
>> - If a box doesn't react to reset requests, try sending wake-on-lan to
>>   turn it on.
>> - runit should reboot by default, not shutdown!
>
> Or track the origin of the issue (runaway socat instance), did you find
> anything?

My current theory is that it should use reuseaddr to restart
immediately if needed.

-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/87ef89who7.fsf%40vuxu.org.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: epona.vuxu.org outage post mortem
  2019-02-14 21:54 epona.vuxu.org outage post mortem Leah Neukirchen
  2019-02-14 22:07 ` hiro
  2019-02-15  8:11 ` Quentin Rameau
@ 2019-02-16  2:15 ` Diego Augusto Molina
  2019-02-16 21:07   ` Leah Neukirchen
  2019-02-23  1:14 ` Chris Brannon
  3 siblings, 1 reply; 7+ messages in thread
From: Diego Augusto Molina @ 2019-02-16  2:15 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: leahutils, voidlinux

On 2/14/19, Leah Neukirchen <leah@vuxu.org> wrote:
>
> At 00:20 CET tonight (2019-02-14) epona.vuxu.org, a virtual machine
> host that runs, among others, git.vuxu.org/inbox.vuxu.org and
> hestia.vuxu.org, the aarch64 builder for Void Linux, went down and
> only came up around 11 CET again.
>
> This was entirely my fault, but how it happened is interesting:
>
> I was informed a user-space port forwarding was not working.  It was
> realized using socat, supervised by runit (the init system of Void
> Linux):
>
> 	socat TCP4-LISTEN:3722,fork,su=nobody TCP6:hestia.vuxu.org:22
>
> However, starting this showed the address was already in use:
>
> 	2019/02/14 00:20:44 socat[5049] E bind(5, {AF=2 0.0.0.0:3722}, 16): Address
> already in use
>
> My assumption was there was a runaway instance of socat running (for
> unknown reasons), and I decided to kill all socat instances.  My usual
> tool of choice would have been `killall socat`, but as there were other
> socat instances running on the machine, I only wanted to kill the port
> 3722 ones.
>
> A quick test with `pgrep` showed a plausible list of PIDs, so I ran
>
> 	kill $(pgrep -f socat.*3722)
>
> which seemed to work fine at first.
>
> Several seconds later I was greeted with this message:
>
> 	Connection to epona.vuxu.org closed by remote host.
> 	Connection to epona.vuxu.org closed.
>
> And the box didn't ping anymore...
>
> As experienced SSH user, this indicated that the host shut down in
> some controlled way, else I would have gotten a `broken pipe` message.
>
> But how could the `pkill` shut down the machine?  I could not come up
> with any plausible theory, but it was already late and there was alcohol
> involved as well.  So I decided to leave it at rest until the morning.


"Oh, no! How could someone possibly drink and sysadmin!"
Said no other sysadmin ever.

>
> In the morning, the box still was not up and there was no evidence of
> a network issue or anything.  I decided to enter the Hetzner Control
> Panel and trigger an "automated reset".  Nothing changed, the box
> still didn't ping.  I tried to activate the "vKVM rescue system", to
> no avail.
>
> At this point I actually assumed some hardware issue, and I called for
> a "manual reset", which means someone has to get up, walk to the
> machine, restart it, and watch a bit whether it seems to boot
> properly.
>
> Of course, the true reason was much simpler: the box was powered off.
>
> Unfortuately, nothing about the Hetzner Control Panel shows you this
> simple fact, so I guess I'm not the only one to send poor support
> folks to go boot other people's machines.
>
> The box booted fine and all services were restored within minutes.
>
> The remaining question is how it's possible that the command shut down
> the machine, and it's easy to answer too:
> `runsvdir`, the main runit process that controls "stage 2", i.e.
> while the system is up, displays error messages of all direct
> child processes in it's `argv[0]`, so you can check for unlogged
> messages with `ps`:
>
> 	runsvdir -P /run/runit/runsvdir/current log: ....logs here....
>
> Unfortunately, in above sitation this resulted in both "socat" and
> "3722" to appear in the error messages, and thus the process title,
> which made `pkill -f` match it and, as commanded, kill `runsvdir`,
> which results in exiting stage 2 and runit performing an orderly
> shutdown of the system.  Duh.
>
> Lessons learned:
> - The first intuition is often right, even if it's not plausible at first.
> - Don't use `pkill -f` as root, at least not without careful checking
>   and regexp anchoring.
> - If a box doesn't react to reset requests, try sending wake-on-lan to
>   turn it on.
> - runit should reboot by default, not shutdown!
>
> --
> Leah Neukirchen  <leah@vuxu.org>  http://leah.zone
>

Here's my suggestion:

# ss -nlpt | grep 3722

That should include your offending instance of socat listening on TCP
3722, stating the PID that has the resource (a.k.a., the socat process
that opened the port). Killing that PID blindly might not always do
the trick (e.g. "while true; do socat ...; sleep 1; done") so you may
want to kill parents/children too. With that PID in mind use "ps faux"
to navigate through the process tree. My way is:

# ps faux | grep -vF \[ | less -SRI

The grep is to remove kernel processes which drown the output.

Bye.

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/CAGOxLdFfv%3DEH61i2wnkk9%3DXRHUr9WYG0auUb9yP-yPK8bkGuJA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: epona.vuxu.org outage post mortem
  2019-02-16  2:15 ` Diego Augusto Molina
@ 2019-02-16 21:07   ` Leah Neukirchen
  0 siblings, 0 replies; 7+ messages in thread
From: Leah Neukirchen @ 2019-02-16 21:07 UTC (permalink / raw)
  To: Diego Augusto Molina; +Cc: leahutils, voidlinux

Diego Augusto Molina <diegoaugustomolina@gmail.com> writes:

> Here's my suggestion:
>
> # ss -nlpt | grep 3722
>
> That should include your offending instance of socat listening on TCP
> 3722, stating the PID that has the resource (a.k.a., the socat process
> that opened the port). Killing that PID blindly might not always do
> the trick (e.g. "while true; do socat ...; sleep 1; done") so you may
> want to kill parents/children too. With that PID in mind use "ps faux"
> to navigate through the process tree. My way is:

Yes, this didn't help here as the socket was in TIME-WAIT.

ss -napt however works.

-- 
Leah Neukirchen  <leah@vuxu.org>  http://leah.zone

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/871s47wigo.fsf%40vuxu.org.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: epona.vuxu.org outage post mortem
  2019-02-14 21:54 epona.vuxu.org outage post mortem Leah Neukirchen
                   ` (2 preceding siblings ...)
  2019-02-16  2:15 ` Diego Augusto Molina
@ 2019-02-23  1:14 ` Chris Brannon
  3 siblings, 0 replies; 7+ messages in thread
From: Chris Brannon @ 2019-02-23  1:14 UTC (permalink / raw)
  To: voidlinux

Leah Neukirchen writes:
> 
> At 00:20 CET tonight (2019-02-14) epona.vuxu.org, a virtual machine
> host that runs, among others, git.vuxu.org/inbox.vuxu.org and
> hestia.vuxu.org, the aarch64 builder for Void Linux, went down and
> only came up around 11 CET again.
>
> This was entirely my fault, but how it happened is interesting:

This is an excellent writeup.  I found it fascinating and informative.
Thank you for sharing.

-- Chris

-- 
You received this message because you are subscribed to the Google Groups "voidlinux" group.
To unsubscribe from this group and stop receiving emails from it, send an email to voidlinux+unsubscribe@googlegroups.com.
To post to this group, send email to voidlinux@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/voidlinux/20190223011441.818347B03B%40hurricane.the-brannons.com.
For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-02-23  1:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-14 21:54 epona.vuxu.org outage post mortem Leah Neukirchen
2019-02-14 22:07 ` hiro
2019-02-15  8:11 ` Quentin Rameau
2019-02-15  8:59   ` Leah Neukirchen
2019-02-16  2:15 ` Diego Augusto Molina
2019-02-16 21:07   ` Leah Neukirchen
2019-02-23  1:14 ` Chris Brannon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).