(Appologies for top posting; my current email client is not mailing list friendly) Hi Jason and community, We are continuing to observe the mentioned behavior and it seems that the error we are seeing is mostly triggered during high(er) load situations i.e. 500Mbit-2000Mbit and as we move around 200GB+ chunks of data via TCP over WireGuard. Until now, we managed to keep things more stable but then again faced an outage. Observations: A. Connectivity is more likely to break between the same components, i.e. I saw the same container fail 1-2x per day within a week of monitoring [1]; I see the issue more often between the _same_ physical hosts of the cluster (however, these are the more traffic heavy ones) B. I've also seen an interesting behavior where I had to reset a connection 2x between 2 physical servers within a short amount of time and first from node2 and then from node1. Please find the related config snippets and journalctl -k messages (filtered for the peers) here: * https://nem3d.net/node1.txt * https://nem3d.net/node2.txt Approx timeline of events: 1. 05/26 One container failed on node2 and was recovered from the host [1] 2. 05/26 Almost exactly 1,5h later, the connection between node2 and node1 broke with "stopped hearing back after 15 seconds" 3. 05/27 06.02h on node2 I manually used peer remove node1/syncconf to re-establish the connection 4. 05/27 06.25h on node1 logged "stopped hearing back after 15 seconds" 5. 05/27 06.31h on node1 I used peer remove node2/syncconf to re-establish the connection Afterwards, things worked smooth again. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Environment: Debian 9 with kernel from package and wireguard from buster-backports $ uname -r -v -m -o 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux $ modinfo wireguard filename: /lib/modules/4.9.0-11-amd64/updates/dkms/wireguard.ko intree: Y alias: net-pf-16-proto-16-family-wireguard alias: rtnl-link-wireguard version: 1.0.20210124 author: Jason A. Donenfeld description: WireGuard secure network tunnel license: GPL v2 srcversion: 507BE23A7368F016AEBAF94 depends: udp_tunnel,ip6_udp_tunnel retpoline: Y vermagic: 4.9.0-11-amd64 SMP mod_unload modversions Thanks, Raoul [1] I created a small cronjob that checks host/container connectivty via icmp/ping and initiates the wg dump/peer remove/syncconf recovery procedure in case of an error. On 12.05.21, 07:19, "Raoul Bhatia" wrote: > [...] >> That's surprising behavior. Thanks for debugging it. Can you see if >> you can reproduce with dynamic logging enabled? That'll give some >> useful information in dmesg: >> >> # modprobe wireguard && echo module wireguard +p > >> /sys/kernel/debug/dynamic_debug/control > > I did enable the debug control and also set > sysctl -w net.core.message_cost=0 > and have extracted a sample of the issue. > Please find it here https://nem3d.net/wireguard_20210512a.txt > > From my observation, it is always the following symptoms: > 1. Everything is WORKING: > LXC container d1-h sends handshake initiation. > Host wg0 receives, re-creates keypair, answers > d1-h receives, re-creates keypair, sends keepalive > wg0 receives keepalive > etc. > > > 2. Somewhen it BREAKS > d1-h stopps hearing back after 15 seconds. > Initialization loop like mentioned above > d1-h stopps hearing back after 15 seconds. > etc. > > As mentioned, the resolution is to dump the config, > remove the peer, and syncconf to restore. > This time, I used "nsenter -n" to apply this procedure to the > unprivileged container interface d1-h. > > Lastly, we also saw similar behavior even between 2 physical hosts. > I will try to gather similar debug information. > > Please let me know if further information is needed to > better understand the problem.