wireguard/napi stuck in napi

Development discussion of WireGuard
 help / color / mirror / Atom feed

* wireguard/napi stuck in napi_disable
@ 2024-09-23 18:23 Ignat Korchagin
  2024-09-23 18:46 ` Eric Dumazet
  2024-09-23 21:33 ` Ignat Korchagin
  0 siblings, 2 replies; 4+ messages in thread
From: Ignat Korchagin @ 2024-09-23 18:23 UTC (permalink / raw)
  To: Jason, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, wireguard, netdev, linux-kernel, jiri,
	Sebastian Andrzej Siewior, Lorenzo Bianconi
  Cc: kernel-team

Hello,

We run calico on our Kubernetes cluster, which uses Wireguard to
encrypt in-cluster traffic [1]. Recently we tried to improve the
throughput of the cluster and eliminate some packet drops we’re seeing
by switching on threaded NAPI [2] on these managed Wireguard
interfaces. However, our Kubernetes hosts started to lock up once in a
while.

Analyzing one stuck host with drgn we were able to confirm that the
code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
be cleared for the Wireguard peer napi instance, but that never
happens for some reason. For context the full state of the stuck napi
instance is 0b100110111. What makes things worse - this happens when
calico removes a Wireguard peer, which happens while holding the
global rtnl_mutex, so all the other tasks requiring that mutex get
stuck as well.

Full stacktrace of the “looping” task:

#0  context_switch (linux/kernel/sched/core.c:5380:2)
#1  __schedule (linux/kernel/sched/core.c:6698:8)
#2  schedule (linux/kernel/sched/core.c:6772:3)
#3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
#4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
#5  usleep_range (linux/include/linux/delay.h:68:2)
#6  napi_disable (linux/net/core/dev.c:6477:4)
#7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
#8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
#9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
#10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
#11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
#12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
#13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
#14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
#15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
#16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
#17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
#18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
#19 __sock_sendmsg (linux/net/socket.c:745:16)
#20 ____sys_sendmsg (linux/net/socket.c:2560:8)
#21 ___sys_sendmsg (linux/net/socket.c:2614:8)
#22 __sys_sendmsg (linux/net/socket.c:2643:8)
#23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
#24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
#25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)

We have also noticed that a similar issue is observed, when we switch
Wireguard threaded NAPI back to off: removing a Wireguard peer task
may still spend a considerable amount of time in the above loop (and
hold rtnl_mutex), however the host eventually recovers from this
state.

So the questions are:
1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
threaded NAPI case in Wireguard?
2. Is it generally a good idea for Wireguard to loop for an
indeterminate amount of time, while holding the rtnl_mutex? Or can it
be refactored?

We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
downgrade the kernel a couple of patch revisions, but it did not help
and our logs indicate that at least the non-threaded prolonged holding
of the rtnl_mutex is happening for a while now.

[1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
[2]: https://docs.kernel.org/networking/napi.html#threaded
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: wireguard/napi stuck in napi_disable
  2024-09-23 18:23 wireguard/napi stuck in napi_disable Ignat Korchagin
@ 2024-09-23 18:46 ` Eric Dumazet
  2024-09-23 21:33 ` Ignat Korchagin
  1 sibling, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2024-09-23 18:46 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Jason, David S. Miller, Jakub Kicinski, Paolo Abeni, wireguard,
	netdev, linux-kernel, jiri, Sebastian Andrzej Siewior,
	Lorenzo Bianconi, kernel-team

On Mon, Sep 23, 2024 at 8:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> Hello,
>
> We run calico on our Kubernetes cluster, which uses Wireguard to
> encrypt in-cluster traffic [1]. Recently we tried to improve the
> throughput of the cluster and eliminate some packet drops we’re seeing
> by switching on threaded NAPI [2] on these managed Wireguard
> interfaces. However, our Kubernetes hosts started to lock up once in a
> while.
>
> Analyzing one stuck host with drgn we were able to confirm that the
> code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
> be cleared for the Wireguard peer napi instance, but that never
> happens for some reason. For context the full state of the stuck napi
> instance is 0b100110111. What makes things worse - this happens when
> calico removes a Wireguard peer, which happens while holding the
> global rtnl_mutex, so all the other tasks requiring that mutex get
> stuck as well.
>
> Full stacktrace of the “looping” task:
>
> #0  context_switch (linux/kernel/sched/core.c:5380:2)
> #1  __schedule (linux/kernel/sched/core.c:6698:8)
> #2  schedule (linux/kernel/sched/core.c:6772:3)
> #3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
> #4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
> #5  usleep_range (linux/include/linux/delay.h:68:2)
> #6  napi_disable (linux/net/core/dev.c:6477:4)
> #7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
> #8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
> #9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
> #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
> #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
> #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
> #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
> #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
> #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
> #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
> #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
> #18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
> #19 __sock_sendmsg (linux/net/socket.c:745:16)
> #20 ____sys_sendmsg (linux/net/socket.c:2560:8)
> #21 ___sys_sendmsg (linux/net/socket.c:2614:8)
> #22 __sys_sendmsg (linux/net/socket.c:2643:8)
> #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
> #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
> #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)
>
> We have also noticed that a similar issue is observed, when we switch
> Wireguard threaded NAPI back to off: removing a Wireguard peer task
> may still spend a considerable amount of time in the above loop (and
> hold rtnl_mutex), however the host eventually recovers from this
> state.
>
> So the questions are:
> 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
> threaded NAPI case in Wireguard?
> 2. Is it generally a good idea for Wireguard to loop for an
> indeterminate amount of time, while holding the rtnl_mutex? Or can it
> be refactored?
>
> We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
> downgrade the kernel a couple of patch revisions, but it did not help
> and our logs indicate that at least the non-threaded prolonged holding
> of the rtnl_mutex is happening for a while now.
>
> [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
> [2]: https://docs.kernel.org/networking/napi.html#threaded
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476

Somehow wireguard continuously feeds packets without checking it
should not (IFF_UP or some other bit)

napi_schedule() detects NAPIF_STATE_DISABLE, and
napi_disable_pending() is also used
from __napi_poll() to avoid adding back the napi if the whole budget
was consumed.

Not sure, more debugging might be needed.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: wireguard/napi stuck in napi_disable
  2024-09-23 18:23 wireguard/napi stuck in napi_disable Ignat Korchagin
  2024-09-23 18:46 ` Eric Dumazet
@ 2024-09-23 21:33 ` Ignat Korchagin
  2024-09-25 15:06   ` Daniel Dao
  1 sibling, 1 reply; 4+ messages in thread
From: Ignat Korchagin @ 2024-09-23 21:33 UTC (permalink / raw)
  To: Jason, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, wireguard, netdev, linux-kernel, jiri,
	Sebastian Andrzej Siewior, Lorenzo Bianconi
  Cc: kernel-team

On Mon, Sep 23, 2024 at 7:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> Hello,
>
> We run calico on our Kubernetes cluster, which uses Wireguard to
> encrypt in-cluster traffic [1]. Recently we tried to improve the
> throughput of the cluster and eliminate some packet drops we’re seeing
> by switching on threaded NAPI [2] on these managed Wireguard
> interfaces. However, our Kubernetes hosts started to lock up once in a
> while.
>
> Analyzing one stuck host with drgn we were able to confirm that the
> code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
> be cleared for the Wireguard peer napi instance, but that never
> happens for some reason. For context the full state of the stuck napi
> instance is 0b100110111. What makes things worse - this happens when
> calico removes a Wireguard peer, which happens while holding the
> global rtnl_mutex, so all the other tasks requiring that mutex get
> stuck as well.
>
> Full stacktrace of the “looping” task:
>
> #0  context_switch (linux/kernel/sched/core.c:5380:2)
> #1  __schedule (linux/kernel/sched/core.c:6698:8)
> #2  schedule (linux/kernel/sched/core.c:6772:3)
> #3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
> #4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
> #5  usleep_range (linux/include/linux/delay.h:68:2)
> #6  napi_disable (linux/net/core/dev.c:6477:4)
> #7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
> #8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
> #9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
> #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
> #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
> #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
> #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
> #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
> #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
> #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
> #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
> #18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
> #19 __sock_sendmsg (linux/net/socket.c:745:16)
> #20 ____sys_sendmsg (linux/net/socket.c:2560:8)
> #21 ___sys_sendmsg (linux/net/socket.c:2614:8)
> #22 __sys_sendmsg (linux/net/socket.c:2643:8)
> #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
> #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
> #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)
>
> We have also noticed that a similar issue is observed, when we switch
> Wireguard threaded NAPI back to off: removing a Wireguard peer task
> may still spend a considerable amount of time in the above loop (and
> hold rtnl_mutex), however the host eventually recovers from this
> state.
>
> So the questions are:
> 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
> threaded NAPI case in Wireguard?
> 2. Is it generally a good idea for Wireguard to loop for an
> indeterminate amount of time, while holding the rtnl_mutex? Or can it
> be refactored?

I've been also trying to reproduce this issue with a script [1]. While
I could not reproduce the complete lockup I've been able to confirm
that peer_remove_after_dead() may take multiple seconds to execute -
all while holding the rtnl_mutex. Below is bcc-tools funclatency
output from a freshly compiled mainline (6.11):

# /usr/share/bcc/tools/funclatency peer_remove_after_dead
Tracing 1 functions for "peer_remove_after_dead"... Hit Ctrl-C to end.
^C
               nsecs                         : count     distribution
                   0 -> 1                    : 0        |                    |
                   2 -> 3                    : 0        |                    |
                   4 -> 7                    : 0        |                    |
                   8 -> 15                   : 0        |                    |
                  16 -> 31                   : 0        |                    |
                  32 -> 63                   : 0        |                    |
                  64 -> 127                  : 0        |                    |
                 128 -> 255                  : 0        |                    |
                 256 -> 511                  : 0        |                    |
                 512 -> 1023                 : 0        |                    |
                1024 -> 2047                 : 0        |                    |
                2048 -> 4095                 : 0        |                    |
                4096 -> 8191                 : 0        |                    |
                8192 -> 16383                : 0        |                    |
               16384 -> 32767                : 0        |                    |
               32768 -> 65535                : 0        |                    |
               65536 -> 131071               : 0        |                    |
              131072 -> 262143               : 0        |                    |
              262144 -> 524287               : 68       |**                  |
              524288 -> 1048575              : 658      |********************|
             1048576 -> 2097151              : 267      |********            |
             2097152 -> 4194303              : 68       |**                  |
             4194304 -> 8388607              : 124      |***                 |
             8388608 -> 16777215             : 182      |*****               |
            16777216 -> 33554431             : 72       |**                  |
            33554432 -> 67108863             : 34       |*                   |
            67108864 -> 134217727            : 22       |                    |
           134217728 -> 268435455            : 11       |                    |
           268435456 -> 536870911            : 2        |                    |
           536870912 -> 1073741823           : 2        |                    |
          1073741824 -> 2147483647           : 1        |                    |
          2147483648 -> 4294967295           : 0        |                    |
          4294967296 -> 8589934591           : 1        |                    |

avg = 14251705 nsecs, total: 21548578415 nsecs, count: 1512

Detaching...

So we have cases where it takes 2 or even 8 seconds to remove a single
peer, which is definitely not great considering we're holding a global
lock.

> We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
> downgrade the kernel a couple of patch revisions, but it did not help
> and our logs indicate that at least the non-threaded prolonged holding
> of the rtnl_mutex is happening for a while now.
>
> [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
> [2]: https://docs.kernel.org/networking/napi.html#threaded
> [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476

Ignat

[1]: https://gist.githubusercontent.com/ignatk/4505d96e02815de3aa5649c4aa7c3fca/raw/177e4eab9f491024db6488cd0ea1cbba2d5579b4/wg.sh

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: wireguard/napi stuck in napi_disable
  2024-09-23 21:33 ` Ignat Korchagin
@ 2024-09-25 15:06   ` Daniel Dao
  0 siblings, 0 replies; 4+ messages in thread
From: Daniel Dao @ 2024-09-25 15:06 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Jason, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, wireguard, netdev, linux-kernel, jiri,
	Sebastian Andrzej Siewior, Lorenzo Bianconi, kernel-team,
	Tejun Heo

On Mon, Sep 23, 2024 at 10:33 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> On Mon, Sep 23, 2024 at 7:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote:
> >
> > Hello,
> >
> > We run calico on our Kubernetes cluster, which uses Wireguard to
> > encrypt in-cluster traffic [1]. Recently we tried to improve the
> > throughput of the cluster and eliminate some packet drops we’re seeing
> > by switching on threaded NAPI [2] on these managed Wireguard
> > interfaces. However, our Kubernetes hosts started to lock up once in a
> > while.
> >
> > Analyzing one stuck host with drgn we were able to confirm that the
> > code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
> > be cleared for the Wireguard peer napi instance, but that never
> > happens for some reason. For context the full state of the stuck napi
> > instance is 0b100110111. What makes things worse - this happens when
> > calico removes a Wireguard peer, which happens while holding the
> > global rtnl_mutex, so all the other tasks requiring that mutex get
> > stuck as well.
> >
> > Full stacktrace of the “looping” task:
> >
> > #0  context_switch (linux/kernel/sched/core.c:5380:2)
> > #1  __schedule (linux/kernel/sched/core.c:6698:8)
> > #2  schedule (linux/kernel/sched/core.c:6772:3)
> > #3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
> > #4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
> > #5  usleep_range (linux/include/linux/delay.h:68:2)
> > #6  napi_disable (linux/net/core/dev.c:6477:4)
> > #7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
> > #8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
> > #9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
> > #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
> > #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
> > #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
> > #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
> > #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
> > #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
> > #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
> > #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
> > #18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
> > #19 __sock_sendmsg (linux/net/socket.c:745:16)
> > #20 ____sys_sendmsg (linux/net/socket.c:2560:8)
> > #21 ___sys_sendmsg (linux/net/socket.c:2614:8)
> > #22 __sys_sendmsg (linux/net/socket.c:2643:8)
> > #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
> > #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
> > #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)
> >

Looking at this further to understand why the napi state would be 0b100110111
which decoded to: SCHED | MISSED | DISABLE | LISTED | NO_BUSY_POLL | THREADED,
I think this is a problem with MISSED wakeup in napi threaded mode.

napi_complete_done calls __napi_schedule when state has `NAPIF_STATE_MISSED`,
but in 6.6, it does not set NAPI_STATE_SCHED_THREADED when the napi thread is
running, and since the thread is running, wakeup does not do anything. Therefore
we missed the chance to do another poll, and if we race with napi_disable then
the state will be SCHED | MISSED | DISABLE, and we stuck.

However it looks like the following commit resolves that situation for us

commit 56364c910691f6d10ba88c964c9041b9ab777bd6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Mar 25 08:40:28 2024 +0100

    net: Remove conditional threaded-NAPI wakeup based on task state.

As long as we always set SCHED_THREADED, the next loop in napi thread will do
the right thing and clear SCHED, letting napi_disable complete.

We are testing 6.6 with this patch, and haven't seen any similar lockups so far.

> > We have also noticed that a similar issue is observed, when we switch
> > Wireguard threaded NAPI back to off: removing a Wireguard peer task
> > may still spend a considerable amount of time in the above loop (and
> > hold rtnl_mutex), however the host eventually recovers from this
> > state.

However, these lockups become much more prominent now. Here is the
stack trace of
peer_remove_after_dead

#0  context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5380:2)
#1  __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6699:8)
#2  schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6773:3)
#3  schedule_timeout (/cfsetup_build/build/linux/kernel/time/timer.c:2143:3)
#4  do_wait_for_common
(/cfsetup_build/build/linux/kernel/sched/completion.c:95:14)
#5  __wait_for_common
(/cfsetup_build/build/linux/kernel/sched/completion.c:116:12)
#6  wait_for_common (/cfsetup_build/build/linux/kernel/sched/completion.c:127:9)
#7  wait_for_completion
(/cfsetup_build/build/linux/kernel/sched/completion.c:148:2)
#8  __flush_workqueue (/cfsetup_build/build/linux/kernel/workqueue.c:3196:2)
#9  peer_remove_after_dead
(/cfsetup_build/build/linux/drivers/net/wireguard/peer.c:116:2)
#10 set_peer (/cfsetup_build/build/linux/drivers/net/wireguard/netlink.c:425:3)
#11 wg_set_device
(/cfsetup_build/build/linux/drivers/net/wireguard/netlink.c:592:10)
#12 genl_family_rcv_msg_doit
(/cfsetup_build/build/linux/net/netlink/genetlink.c:971:8)
#13 genl_family_rcv_msg
(/cfsetup_build/build/linux/net/netlink/genetlink.c:1051:10)
#14 genl_rcv_msg (/cfsetup_build/build/linux/net/netlink/genetlink.c:1066:8)
#15 netlink_rcv_skb (/cfsetup_build/build/linux/net/netlink/af_netlink.c:2544:9)
#16 genl_rcv (/cfsetup_build/build/linux/net/netlink/genetlink.c:1075:2)
#17 netlink_unicast_kernel
(/cfsetup_build/build/linux/net/netlink/af_netlink.c:1342:3)
#18 netlink_unicast
(/cfsetup_build/build/linux/net/netlink/af_netlink.c:1368:10)
#19 netlink_sendmsg (/cfsetup_build/build/linux/net/netlink/af_netlink.c:1910:8)
#20 sock_sendmsg_nosec (/cfsetup_build/build/linux/net/socket.c:730:12)
#21 __sock_sendmsg (/cfsetup_build/build/linux/net/socket.c:745:16)
#22 ____sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2590:8)
#23 ___sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2644:8)
#24 __sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2673:8)
#25 do_syscall_x64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:51:14)
#26 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:81:7)
#27 entry_SYSCALL_64+0x9c/0x184
(/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:121)
#28 0x41262e

drgn shows that we are waiting to for completion of work for
wg_packet_tx_worker,
which is destined for a completely different peer than the peer we
want to remove.

*(struct worker *)0xffff888107f640c0 = {
        .entry = (struct list_head){
                .next = (struct list_head *)0x0,
                .prev = (struct list_head *)0xffff8897e0cb1f50,
        },
        .hentry = (struct hlist_node){
                .next = (struct hlist_node *)0x0,
                .pprev = (struct hlist_node **)0xffff8897e0cb1f50,
        },
        .current_work = (struct work_struct *)0xffff8881a32638d0,
        .current_func = (work_func_t)wg_packet_tx_worker+0x0 =
0xffffffffc0f6ca40,
        .current_pwq = (struct pool_workqueue *)0xffff88812bca6400,
        .current_at = (u64)3491257913,
        .current_color = (unsigned int)4,
        .sleeping = (int)0,
        .last_func = (work_func_t)wg_packet_tx_worker+0x0 = 0xffffffffc0f6ca40,
        .scheduled = (struct list_head){
                .next = (struct list_head *)0xffff8881a32638d8,
                .prev = (struct list_head *)0xffff8881a32638d8,
        },
        .task = (struct task_struct *)0xffff888472a08000,
        .pool = (struct worker_pool *)0xffff8897e0cb1cc0,
        .node = (struct list_head){
                .next = (struct list_head *)0xffff888107f64360,
                .prev = (struct list_head *)0xffff888107f645a0,
        },
        .last_active = (unsigned long)4409382646,
        .flags = (unsigned int)64,
        .id = (int)1,
        .desc = (char [32])"wg-crypt-wireguard.cali",
        .rescue_wq = (struct workqueue_struct *)0x0,
}

This can take a very long time especially if the peers produce/receive
as fast as it
can, as in our test setup. We setup some metrics with bpftrace to measure the
duration of peer_remove_after_dead and wg_packet_tx_worker and got the following
measurements after a long wait time.

@duration_ms[peer_remove_after_dead]:
[512K, 1M)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@duration_ms[wg_packet_tx_worker]:
[0]               744612 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   20 |                                                    |
[2, 4)                 2 |                                                    |
[4, 8)                 2 |                                                    |
[16, 32)               2 |                                                    |
[256K, 512K)           1 |                                                    |
[2M, 4M)               1 |                                                    |

We can see that peer_remove_after_dead took between 512s to 1000s to complete,
while wg_packet_tx_worker can take up to [2000s, 4000s) to complete, which is an
awfully long time.

Daniel

> >
> > So the questions are:
> > 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
> > threaded NAPI case in Wireguard?
> > 2. Is it generally a good idea for Wireguard to loop for an
> > indeterminate amount of time, while holding the rtnl_mutex? Or can it
> > be refactored?
>
> I've been also trying to reproduce this issue with a script [1]. While
> I could not reproduce the complete lockup I've been able to confirm
> that peer_remove_after_dead() may take multiple seconds to execute -
> all while holding the rtnl_mutex. Below is bcc-tools funclatency
> output from a freshly compiled mainline (6.11):
>
> # /usr/share/bcc/tools/funclatency peer_remove_after_dead
> Tracing 1 functions for "peer_remove_after_dead"... Hit Ctrl-C to end.
> ^C
>                nsecs                         : count     distribution
>                    0 -> 1                    : 0        |                    |
>                    2 -> 3                    : 0        |                    |
>                    4 -> 7                    : 0        |                    |
>                    8 -> 15                   : 0        |                    |
>                   16 -> 31                   : 0        |                    |
>                   32 -> 63                   : 0        |                    |
>                   64 -> 127                  : 0        |                    |
>                  128 -> 255                  : 0        |                    |
>                  256 -> 511                  : 0        |                    |
>                  512 -> 1023                 : 0        |                    |
>                 1024 -> 2047                 : 0        |                    |
>                 2048 -> 4095                 : 0        |                    |
>                 4096 -> 8191                 : 0        |                    |
>                 8192 -> 16383                : 0        |                    |
>                16384 -> 32767                : 0        |                    |
>                32768 -> 65535                : 0        |                    |
>                65536 -> 131071               : 0        |                    |
>               131072 -> 262143               : 0        |                    |
>               262144 -> 524287               : 68       |**                  |
>               524288 -> 1048575              : 658      |********************|
>              1048576 -> 2097151              : 267      |********            |
>              2097152 -> 4194303              : 68       |**                  |
>              4194304 -> 8388607              : 124      |***                 |
>              8388608 -> 16777215             : 182      |*****               |
>             16777216 -> 33554431             : 72       |**                  |
>             33554432 -> 67108863             : 34       |*                   |
>             67108864 -> 134217727            : 22       |                    |
>            134217728 -> 268435455            : 11       |                    |
>            268435456 -> 536870911            : 2        |                    |
>            536870912 -> 1073741823           : 2        |                    |
>           1073741824 -> 2147483647           : 1        |                    |
>           2147483648 -> 4294967295           : 0        |                    |
>           4294967296 -> 8589934591           : 1        |                    |
>
> avg = 14251705 nsecs, total: 21548578415 nsecs, count: 1512
>
> Detaching...
>
> So we have cases where it takes 2 or even 8 seconds to remove a single
> peer, which is definitely not great considering we're holding a global
> lock.
>
> > We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
> > downgrade the kernel a couple of patch revisions, but it did not help
> > and our logs indicate that at least the non-threaded prolonged holding
> > of the rtnl_mutex is happening for a while now.
> >
> > [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
> > [2]: https://docs.kernel.org/networking/napi.html#threaded
> > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476
>
> Ignat
>
> [1]: https://gist.githubusercontent.com/ignatk/4505d96e02815de3aa5649c4aa7c3fca/raw/177e4eab9f491024db6488cd0ea1cbba2d5579b4/wg.sh

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-11-18 13:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-23 18:23 wireguard/napi stuck in napi_disable Ignat Korchagin
2024-09-23 18:46 ` Eric Dumazet
2024-09-23 21:33 ` Ignat Korchagin
2024-09-25 15:06   ` Daniel Dao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).