Development discussion of WireGuard
 help / color / mirror / Atom feed
* Re: Wireguard connection lost between peers
@ 2021-05-12  5:19 Raoul Bhatia
  2021-05-30 13:20 ` Raoul Bhatia
  0 siblings, 1 reply; 4+ messages in thread
From: Raoul Bhatia @ 2021-05-12  5:19 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: wireguard

[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]

Hi Jason

Apologies for taking some time to get back to you.
We tried to verify a few things and to see if we spot anything unusual,
and waited for a few mor instances to happen to get sufficient right data.

> That's surprising behavior. Thanks for debugging it. Can you see if
> you can reproduce with dynamic logging enabled? That'll give some
> useful information in dmesg:
>
>            # modprobe wireguard && echo module wireguard +p >
> /sys/kernel/debug/dynamic_debug/control

I did enable the debug control and also set
  sysctl -w net.core.message_cost=0
and have extracted a sample of the issue.
Please find it here https://nem3d.net/wireguard_20210512a.txt

From my observation, it is always the following symptoms:
1. Everything is WORKING:
LXC container d1-h sends handshake initiation.
Host wg0 receives, re-creates keypair, answers
d1-h receives, re-creates keypair, sends keepalive
wg0 receives keepalive
etc.


2. Somewhen it BREAKS
d1-h stopps hearing back after 15 seconds.
Initialization loop like mentioned above
d1-h stopps hearing back after 15 seconds.
etc.

As mentioned, the resolution is to dump the config, 
remove the peer, and syncconf to restore.
This time,  I used "nsenter -n" to apply this procedure to the
unprivileged container interface d1-h.

Lastly, we also saw similar behavior even between 2 physical hosts.
I will try to gather similar debug information.

Please let me know if further information is needed to
better understand the problem.

Raoul


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6069 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Wireguard connection lost between peers
  2021-05-12  5:19 Wireguard connection lost between peers Raoul Bhatia
@ 2021-05-30 13:20 ` Raoul Bhatia
  0 siblings, 0 replies; 4+ messages in thread
From: Raoul Bhatia @ 2021-05-30 13:20 UTC (permalink / raw)
  To: Jason A. Donenfeld, wireguard

[-- Attachment #1: Type: text/plain, Size: 4119 bytes --]

(Appologies for top posting; my current email client is not mailing list friendly)

Hi Jason and community,

We are continuing to observe the mentioned behavior
and it seems that the error we are seeing is mostly triggered
during high(er) load situations i.e. 500Mbit-2000Mbit and as we move
around 200GB+ chunks of data via TCP over WireGuard.
Until now, we managed to keep things more stable but
then again faced an outage.



Observations:
A. Connectivity is more likely to break between the same components, i.e.
I saw the same container fail 1-2x per day within a week of monitoring [1];
I see the issue more often between the _same_ physical hosts of the cluster
(however, these are the more traffic heavy ones)



B. I've also seen an interesting behavior where I had to
reset a connection 2x between 2 physical servers
within a short amount of time and first from node2 and then from node1.

Please find the related config snippets and 
journalctl -k messages (filtered for the peers) here:
* https://nem3d.net/node1.txt
* https://nem3d.net/node2.txt

Approx timeline of events:
1. 05/26 One container failed on node2 and was recovered from the host [1]
2. 05/26 Almost exactly 1,5h later, the connection between node2 and node1
                broke with "stopped hearing back after 15 seconds"
3. 05/27 06.02h on node2 I manually used peer remove node1/syncconf
                to re-establish the connection
4. 05/27 06.25h on node1 logged "stopped hearing back after 15 seconds"
5. 05/27 06.31h on node1 I used peer remove node2/syncconf
                to re-establish the connection

Afterwards, things worked smooth again.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment:
Debian 9 with kernel from package and wireguard from buster-backports

$ uname -r -v -m -o
4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

$ modinfo wireguard
filename:       /lib/modules/4.9.0-11-amd64/updates/dkms/wireguard.ko
intree:         Y
alias:          net-pf-16-proto-16-family-wireguard
alias:          rtnl-link-wireguard
version:        1.0.20210124
author:         Jason A. Donenfeld <Jason@zx2c4.com>
description:    WireGuard secure network tunnel
license:        GPL v2
srcversion:     507BE23A7368F016AEBAF94
depends:        udp_tunnel,ip6_udp_tunnel
retpoline:      Y
vermagic:       4.9.0-11-amd64 SMP mod_unload modversions

Thanks,
Raoul

[1] I created a small cronjob that checks host/container connectivty via icmp/ping
and initiates the wg dump/peer remove/syncconf recovery procedure in case of an error.

On 12.05.21, 07:19, "Raoul Bhatia" <raoul.bhatia@radarcs.com> wrote:
> [...]
>> That's surprising behavior. Thanks for debugging it. Can you see if
>> you can reproduce with dynamic logging enabled? That'll give some
>> useful information in dmesg:
>>
>>            # modprobe wireguard && echo module wireguard +p >
>> /sys/kernel/debug/dynamic_debug/control
> 
> I did enable the debug control and also set
>   sysctl -w net.core.message_cost=0
> and have extracted a sample of the issue.
> Please find it here https://nem3d.net/wireguard_20210512a.txt
> 
> From my observation, it is always the following symptoms:
> 1. Everything is WORKING:
> LXC container d1-h sends handshake initiation.
> Host wg0 receives, re-creates keypair, answers
> d1-h receives, re-creates keypair, sends keepalive
> wg0 receives keepalive
> etc.
> 
> 
> 2. Somewhen it BREAKS
> d1-h stopps hearing back after 15 seconds.
> Initialization loop like mentioned above
> d1-h stopps hearing back after 15 seconds.
> etc.
> 
> As mentioned, the resolution is to dump the config, 
> remove the peer, and syncconf to restore.
> This time,  I used "nsenter -n" to apply this procedure to the
> unprivileged container interface d1-h.
> 
> Lastly, we also saw similar behavior even between 2 physical hosts.
> I will try to gather similar debug information.
> 
> Please let me know if further information is needed to
> better understand the problem.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6069 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Wireguard connection lost between peers
  2021-04-29 10:30 Raoul Bhatia
@ 2021-04-30 13:42 ` Jason A. Donenfeld
  0 siblings, 0 replies; 4+ messages in thread
From: Jason A. Donenfeld @ 2021-04-30 13:42 UTC (permalink / raw)
  To: raoul.bhatia; +Cc: wireguard, Velimir Iveljic

Hi Raoul,

That's surprising behavior. Thanks for debugging it. Can you see if
you can reproduce with dynamic logging enabled? That'll give some
useful information in dmesg:

           # modprobe wireguard && echo module wireguard +p >
/sys/kernel/debug/dynamic_debug/control

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Wireguard connection lost between peers
@ 2021-04-29 10:30 Raoul Bhatia
  2021-04-30 13:42 ` Jason A. Donenfeld
  0 siblings, 1 reply; 4+ messages in thread
From: Raoul Bhatia @ 2021-04-29 10:30 UTC (permalink / raw)
  To: wireguard; +Cc: Velimir Iveljic

[-- Attachment #1: Type: text/plain, Size: 3105 bytes --]

Dear List,

We are experiencing unusual issue where WireGuard connectivity between peers suddenly stops working.
The connection itself seem to be up, but the peers cannot communicate to each other (more details below).
Any insight would be greatly appreciated.

Software versions:
- Debian Stretch 4.9.0-11-amd64 (4.9.189-3+deb9u2)
- LXC version: 2.0.7
- Wireguard: 1.0.20210124 (from buster-backports)
 
Environment:
A Debian host serves as LXC hypervisor for unpriviledged containers.
WireGuard is used as a network layer for the containers, which means on the host we create physical WG interface for each container.
Inside the containers, we run a distributed cluster which is spanning multiple containers on multiple physical servers interconnected via 10G links (i.e. 6 physical servers w/ 8-10 containers each).
So the network load can get comparatively high, on average 500Mbit/s with peaks of ~3Gbit/s.
 
Problem description:
The outlined setup works fine in most cases.
Occasionally, however, one container completely loses connectivity, and is not reachable _even_ from the underlying host.
We cannot distinguish what is the trigger for this to happen, but we observed it happening when the network traffic is high.
NOTE: We also had this similar (same?) issue between two physical hosts.

So far we identified two ways to restore the service:
1. Restart wg-quick@wg0 service on the host, which is _not_ sustainable because this resets the connectivity of all containers, impacting the cluster.
2. Dump the WG conf, manually remove the unreachable peer public key from the interface, and then re-sync the dumped conf.
--- SNIP ---
$ wg-quick strip wg0 > wg0_peers
$ wg show wg0 dump
$ wg set wg0 peer $PEER_PUB_KEY remove
$ wg syncconf wg0 wg0_peers
--- SNIP ---
 
Additional notes:
1. We didn't manage to reproduce the issue until now in a test environment.
2. We cannot easily upgrade the versions that we run in production.
3. We suspected that time settings on the host could be the issue, so we made sure timesyncd is configured properly.  We observe this issue less frequently, but it is not fully gone. 
4. The dynamic kernel log doesn't provide much information other than send/receive handshake request, keepalive packets and re-creating keypairs.
 
WireGuard module version
---
filename: /lib/modules/4.9.0-11-amd64/updates/dkms/wireguard.ko
intree: Y
alias: net-pf-16-proto-16-family-wireguard
alias: rtnl-link-wireguard
version: 1.0.20210124
author: Jason A. Donenfeld <Jason@zx2c4.com>
description: WireGuard secure network tunnel
license: GPL v2
srcversion: 507BE23A7368F016AEBAF94
depends: udp_tunnel,ip6_udp_tunnel
retpoline: Y
vermagic: 4.9.0-11-amd64 SMP mod_unload modversions
---

** Issue was also observed on (now decomissioned) host Debian 4.9.0-14-amd64 (4.9.246-2) with Wireguard module (1.0.20210124)

Any insight would be appreciated.  Happy to share more debug information as requested / offline.

Thanks,
Raoul
PS. Please reply to me / reply all, as I am currently not subscribed to the mailing list.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6069 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-30 15:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12  5:19 Wireguard connection lost between peers Raoul Bhatia
2021-05-30 13:20 ` Raoul Bhatia
  -- strict thread matches above, loose matches on Subject: below --
2021-04-29 10:30 Raoul Bhatia
2021-04-30 13:42 ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).