(Appologies for top posting; my current email client is not mailing list friendly)

Hi Jason and community,

We are continuing to observe the mentioned behavior
and it seems that the error we are seeing is mostly triggered
during high(er) load situations i.e. 500Mbit-2000Mbit and as we move
around 200GB+ chunks of data via TCP over WireGuard.
Until now, we managed to keep things more stable but
then again faced an outage.



Observations:
A. Connectivity is more likely to break between the same components, i.e.
I saw the same container fail 1-2x per day within a week of monitoring [1];
I see the issue more often between the _same_ physical hosts of the cluster
(however, these are the more traffic heavy ones)



B. I've also seen an interesting behavior where I had to
reset a connection 2x between 2 physical servers
within a short amount of time and first from node2 and then from node1.

Please find the related config snippets and 
journalctl -k messages (filtered for the peers) here:
* https://nem3d.net/node1.txt
* https://nem3d.net/node2.txt

Approx timeline of events:
1. 05/26 One container failed on node2 and was recovered from the host [1]
2. 05/26 Almost exactly 1,5h later, the connection between node2 and node1
                broke with "stopped hearing back after 15 seconds"
3. 05/27 06.02h on node2 I manually used peer remove node1/syncconf
                to re-establish the connection
4. 05/27 06.25h on node1 logged "stopped hearing back after 15 seconds"
5. 05/27 06.31h on node1 I used peer remove node2/syncconf
                to re-establish the connection

Afterwards, things worked smooth again.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment:
Debian 9 with kernel from package and wireguard from buster-backports

$ uname -r -v -m -o
4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

$ modinfo wireguard
filename:       /lib/modules/4.9.0-11-amd64/updates/dkms/wireguard.ko
intree:         Y
alias:          net-pf-16-proto-16-family-wireguard
alias:          rtnl-link-wireguard
version:        1.0.20210124
author:         Jason A. Donenfeld <Jason@zx2c4.com>
description:    WireGuard secure network tunnel
license:        GPL v2
srcversion:     507BE23A7368F016AEBAF94
depends:        udp_tunnel,ip6_udp_tunnel
retpoline:      Y
vermagic:       4.9.0-11-amd64 SMP mod_unload modversions

Thanks,
Raoul

[1] I created a small cronjob that checks host/container connectivty via icmp/ping
and initiates the wg dump/peer remove/syncconf recovery procedure in case of an error.

On 12.05.21, 07:19, "Raoul Bhatia" <raoul.bhatia@radarcs.com> wrote:
> [...]
>> That's surprising behavior. Thanks for debugging it. Can you see if
>> you can reproduce with dynamic logging enabled? That'll give some
>> useful information in dmesg:
>>
>>            # modprobe wireguard && echo module wireguard +p >
>> /sys/kernel/debug/dynamic_debug/control
> 
> I did enable the debug control and also set
>   sysctl -w net.core.message_cost=0
> and have extracted a sample of the issue.
> Please find it here https://nem3d.net/wireguard_20210512a.txt
> 
> From my observation, it is always the following symptoms:
> 1. Everything is WORKING:
> LXC container d1-h sends handshake initiation.
> Host wg0 receives, re-creates keypair, answers
> d1-h receives, re-creates keypair, sends keepalive
> wg0 receives keepalive
> etc.
> 
> 
> 2. Somewhen it BREAKS
> d1-h stopps hearing back after 15 seconds.
> Initialization loop like mentioned above
> d1-h stopps hearing back after 15 seconds.
> etc.
> 
> As mentioned, the resolution is to dump the config, 
> remove the peer, and syncconf to restore.
> This time,  I used "nsenter -n" to apply this procedure to the
> unprivileged container interface d1-h.
> 
> Lastly, we also saw similar behavior even between 2 physical hosts.
> I will try to gather similar debug information.
> 
> Please let me know if further information is needed to
> better understand the problem.