Development discussion of WireGuard
 help / color / mirror / Atom feed
From: Ignat Korchagin <ignat@cloudflare.com>
To: Jason@zx2c4.com, "David S. Miller" <davem@davemloft.net>,
	 Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	 wireguard@lists.zx2c4.com, netdev <netdev@vger.kernel.org>,
	 linux-kernel <linux-kernel@vger.kernel.org>,
	jiri@resnulli.us,
	 Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
	Lorenzo Bianconi <lorenzo@kernel.org>
Cc: kernel-team <kernel-team@cloudflare.com>
Subject: wireguard/napi stuck in napi_disable
Date: Mon, 23 Sep 2024 19:23:14 +0100	[thread overview]
Message-ID: <CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com> (raw)

Hello,

We run calico on our Kubernetes cluster, which uses Wireguard to
encrypt in-cluster traffic [1]. Recently we tried to improve the
throughput of the cluster and eliminate some packet drops we’re seeing
by switching on threaded NAPI [2] on these managed Wireguard
interfaces. However, our Kubernetes hosts started to lock up once in a
while.

Analyzing one stuck host with drgn we were able to confirm that the
code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
be cleared for the Wireguard peer napi instance, but that never
happens for some reason. For context the full state of the stuck napi
instance is 0b100110111. What makes things worse - this happens when
calico removes a Wireguard peer, which happens while holding the
global rtnl_mutex, so all the other tasks requiring that mutex get
stuck as well.

Full stacktrace of the “looping” task:

#0  context_switch (linux/kernel/sched/core.c:5380:2)
#1  __schedule (linux/kernel/sched/core.c:6698:8)
#2  schedule (linux/kernel/sched/core.c:6772:3)
#3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
#4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
#5  usleep_range (linux/include/linux/delay.h:68:2)
#6  napi_disable (linux/net/core/dev.c:6477:4)
#7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
#8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
#9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
#10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
#11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
#12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
#13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
#14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
#15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
#16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
#17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
#18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
#19 __sock_sendmsg (linux/net/socket.c:745:16)
#20 ____sys_sendmsg (linux/net/socket.c:2560:8)
#21 ___sys_sendmsg (linux/net/socket.c:2614:8)
#22 __sys_sendmsg (linux/net/socket.c:2643:8)
#23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
#24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
#25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)

We have also noticed that a similar issue is observed, when we switch
Wireguard threaded NAPI back to off: removing a Wireguard peer task
may still spend a considerable amount of time in the above loop (and
hold rtnl_mutex), however the host eventually recovers from this
state.

So the questions are:
1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
threaded NAPI case in Wireguard?
2. Is it generally a good idea for Wireguard to loop for an
indeterminate amount of time, while holding the rtnl_mutex? Or can it
be refactored?

We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
downgrade the kernel a couple of patch revisions, but it did not help
and our logs indicate that at least the non-threaded prolonged holding
of the rtnl_mutex is happening for a while now.

[1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
[2]: https://docs.kernel.org/networking/napi.html#threaded
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476

             reply	other threads:[~2024-11-18 13:24 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-23 18:23 Ignat Korchagin [this message]
2024-09-23 18:46 ` Eric Dumazet
2024-09-23 21:33 ` Ignat Korchagin
2024-09-25 15:06   ` Daniel Dao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com' \
    --to=ignat@cloudflare.com \
    --cc=Jason@zx2c4.com \
    --cc=bigeasy@linutronix.de \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jiri@resnulli.us \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lorenzo@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=wireguard@lists.zx2c4.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).