Development discussion of WireGuard
 help / color / mirror / Atom feed
From: Raoul Bhatia <raoul.bhatia@radarcs.com>
To: "Jason A. Donenfeld" <Jason@zx2c4.com>,
	"wireguard@lists.zx2c4.com" <wireguard@lists.zx2c4.com>
Subject: Re: Wireguard connection lost between peers
Date: Sun, 30 May 2021 13:20:21 +0000	[thread overview]
Message-ID: <31839237-F1FE-4760-82B1-7C01E93753E1@radarcs.com> (raw)
In-Reply-To: <1CCB5519-611B-4377-97A9-0F9179E3C6F2@radarcs.com>

[-- Attachment #1: Type: text/plain, Size: 4119 bytes --]

(Appologies for top posting; my current email client is not mailing list friendly)

Hi Jason and community,

We are continuing to observe the mentioned behavior
and it seems that the error we are seeing is mostly triggered
during high(er) load situations i.e. 500Mbit-2000Mbit and as we move
around 200GB+ chunks of data via TCP over WireGuard.
Until now, we managed to keep things more stable but
then again faced an outage.



Observations:
A. Connectivity is more likely to break between the same components, i.e.
I saw the same container fail 1-2x per day within a week of monitoring [1];
I see the issue more often between the _same_ physical hosts of the cluster
(however, these are the more traffic heavy ones)



B. I've also seen an interesting behavior where I had to
reset a connection 2x between 2 physical servers
within a short amount of time and first from node2 and then from node1.

Please find the related config snippets and 
journalctl -k messages (filtered for the peers) here:
* https://nem3d.net/node1.txt
* https://nem3d.net/node2.txt

Approx timeline of events:
1. 05/26 One container failed on node2 and was recovered from the host [1]
2. 05/26 Almost exactly 1,5h later, the connection between node2 and node1
                broke with "stopped hearing back after 15 seconds"
3. 05/27 06.02h on node2 I manually used peer remove node1/syncconf
                to re-establish the connection
4. 05/27 06.25h on node1 logged "stopped hearing back after 15 seconds"
5. 05/27 06.31h on node1 I used peer remove node2/syncconf
                to re-establish the connection

Afterwards, things worked smooth again.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Environment:
Debian 9 with kernel from package and wireguard from buster-backports

$ uname -r -v -m -o
4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

$ modinfo wireguard
filename:       /lib/modules/4.9.0-11-amd64/updates/dkms/wireguard.ko
intree:         Y
alias:          net-pf-16-proto-16-family-wireguard
alias:          rtnl-link-wireguard
version:        1.0.20210124
author:         Jason A. Donenfeld <Jason@zx2c4.com>
description:    WireGuard secure network tunnel
license:        GPL v2
srcversion:     507BE23A7368F016AEBAF94
depends:        udp_tunnel,ip6_udp_tunnel
retpoline:      Y
vermagic:       4.9.0-11-amd64 SMP mod_unload modversions

Thanks,
Raoul

[1] I created a small cronjob that checks host/container connectivty via icmp/ping
and initiates the wg dump/peer remove/syncconf recovery procedure in case of an error.

On 12.05.21, 07:19, "Raoul Bhatia" <raoul.bhatia@radarcs.com> wrote:
> [...]
>> That's surprising behavior. Thanks for debugging it. Can you see if
>> you can reproduce with dynamic logging enabled? That'll give some
>> useful information in dmesg:
>>
>>            # modprobe wireguard && echo module wireguard +p >
>> /sys/kernel/debug/dynamic_debug/control
> 
> I did enable the debug control and also set
>   sysctl -w net.core.message_cost=0
> and have extracted a sample of the issue.
> Please find it here https://nem3d.net/wireguard_20210512a.txt
> 
> From my observation, it is always the following symptoms:
> 1. Everything is WORKING:
> LXC container d1-h sends handshake initiation.
> Host wg0 receives, re-creates keypair, answers
> d1-h receives, re-creates keypair, sends keepalive
> wg0 receives keepalive
> etc.
> 
> 
> 2. Somewhen it BREAKS
> d1-h stopps hearing back after 15 seconds.
> Initialization loop like mentioned above
> d1-h stopps hearing back after 15 seconds.
> etc.
> 
> As mentioned, the resolution is to dump the config, 
> remove the peer, and syncconf to restore.
> This time,  I used "nsenter -n" to apply this procedure to the
> unprivileged container interface d1-h.
> 
> Lastly, we also saw similar behavior even between 2 physical hosts.
> I will try to gather similar debug information.
> 
> Please let me know if further information is needed to
> better understand the problem.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 6069 bytes --]

  reply	other threads:[~2021-05-30 15:28 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12  5:19 Raoul Bhatia
2021-05-30 13:20 ` Raoul Bhatia [this message]
  -- strict thread matches above, loose matches on Subject: below --
2021-04-29 10:30 Raoul Bhatia
2021-04-30 13:42 ` Jason A. Donenfeld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31839237-F1FE-4760-82B1-7C01E93753E1@radarcs.com \
    --to=raoul.bhatia@radarcs.com \
    --cc=Jason@zx2c4.com \
    --cc=wireguard@lists.zx2c4.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).