From: Ignat Korchagin <ignat@cloudflare.com>
To: Jason@zx2c4.com, "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
wireguard@lists.zx2c4.com, netdev <netdev@vger.kernel.org>,
linux-kernel <linux-kernel@vger.kernel.org>,
jiri@resnulli.us,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Lorenzo Bianconi <lorenzo@kernel.org>
Cc: kernel-team <kernel-team@cloudflare.com>
Subject: Re: wireguard/napi stuck in napi_disable
Date: Mon, 23 Sep 2024 22:33:15 +0100 [thread overview]
Message-ID: <CALrw=nEoT2emQ0OAYCjM1d_6Xe_kNLSZ6dhjb5FxrLFYh4kozA@mail.gmail.com> (raw)
In-Reply-To: <CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com>
On Mon, Sep 23, 2024 at 7:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> Hello,
>
> We run calico on our Kubernetes cluster, which uses Wireguard to
> encrypt in-cluster traffic [1]. Recently we tried to improve the
> throughput of the cluster and eliminate some packet drops we’re seeing
> by switching on threaded NAPI [2] on these managed Wireguard
> interfaces. However, our Kubernetes hosts started to lock up once in a
> while.
>
> Analyzing one stuck host with drgn we were able to confirm that the
> code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
> be cleared for the Wireguard peer napi instance, but that never
> happens for some reason. For context the full state of the stuck napi
> instance is 0b100110111. What makes things worse - this happens when
> calico removes a Wireguard peer, which happens while holding the
> global rtnl_mutex, so all the other tasks requiring that mutex get
> stuck as well.
>
> Full stacktrace of the “looping” task:
>
> #0 context_switch (linux/kernel/sched/core.c:5380:2)
> #1 __schedule (linux/kernel/sched/core.c:6698:8)
> #2 schedule (linux/kernel/sched/core.c:6772:3)
> #3 schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
> #4 usleep_range_state (linux/kernel/time/timer.c:2363:8)
> #5 usleep_range (linux/include/linux/delay.h:68:2)
> #6 napi_disable (linux/net/core/dev.c:6477:4)
> #7 peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
> #8 set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
> #9 wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
> #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
> #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
> #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
> #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
> #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
> #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
> #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
> #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
> #18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
> #19 __sock_sendmsg (linux/net/socket.c:745:16)
> #20 ____sys_sendmsg (linux/net/socket.c:2560:8)
> #21 ___sys_sendmsg (linux/net/socket.c:2614:8)
> #22 __sys_sendmsg (linux/net/socket.c:2643:8)
> #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
> #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
> #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)
>
> We have also noticed that a similar issue is observed, when we switch
> Wireguard threaded NAPI back to off: removing a Wireguard peer task
> may still spend a considerable amount of time in the above loop (and
> hold rtnl_mutex), however the host eventually recovers from this
> state.
>
> So the questions are:
> 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
> threaded NAPI case in Wireguard?
> 2. Is it generally a good idea for Wireguard to loop for an
> indeterminate amount of time, while holding the rtnl_mutex? Or can it
> be refactored?
I've been also trying to reproduce this issue with a script [1]. While
I could not reproduce the complete lockup I've been able to confirm
that peer_remove_after_dead() may take multiple seconds to execute -
all while holding the rtnl_mutex. Below is bcc-tools funclatency
output from a freshly compiled mainline (6.11):
# /usr/share/bcc/tools/funclatency peer_remove_after_dead
Tracing 1 functions for "peer_remove_after_dead"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 0 | |
16384 -> 32767 : 0 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 0 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 68 |** |
524288 -> 1048575 : 658 |********************|
1048576 -> 2097151 : 267 |******** |
2097152 -> 4194303 : 68 |** |
4194304 -> 8388607 : 124 |*** |
8388608 -> 16777215 : 182 |***** |
16777216 -> 33554431 : 72 |** |
33554432 -> 67108863 : 34 |* |
67108864 -> 134217727 : 22 | |
134217728 -> 268435455 : 11 | |
268435456 -> 536870911 : 2 | |
536870912 -> 1073741823 : 2 | |
1073741824 -> 2147483647 : 1 | |
2147483648 -> 4294967295 : 0 | |
4294967296 -> 8589934591 : 1 | |
avg = 14251705 nsecs, total: 21548578415 nsecs, count: 1512
Detaching...
So we have cases where it takes 2 or even 8 seconds to remove a single
peer, which is definitely not great considering we're holding a global
lock.
> We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
> downgrade the kernel a couple of patch revisions, but it did not help
> and our logs indicate that at least the non-threaded prolonged holding
> of the rtnl_mutex is happening for a while now.
>
> [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
> [2]: https://docs.kernel.org/networking/napi.html#threaded
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476
Ignat
[1]: https://gist.githubusercontent.com/ignatk/4505d96e02815de3aa5649c4aa7c3fca/raw/177e4eab9f491024db6488cd0ea1cbba2d5579b4/wg.sh
next prev parent reply other threads:[~2024-11-18 13:24 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-23 18:23 Ignat Korchagin
2024-09-23 18:46 ` Eric Dumazet
2024-09-23 21:33 ` Ignat Korchagin [this message]
2024-09-25 15:06 ` Daniel Dao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CALrw=nEoT2emQ0OAYCjM1d_6Xe_kNLSZ6dhjb5FxrLFYh4kozA@mail.gmail.com' \
--to=ignat@cloudflare.com \
--cc=Jason@zx2c4.com \
--cc=bigeasy@linutronix.de \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jiri@resnulli.us \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lorenzo@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=wireguard@lists.zx2c4.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).