* wireguard/napi stuck in napi_disable @ 2024-09-23 18:23 Ignat Korchagin 2024-09-23 18:46 ` Eric Dumazet 2024-09-23 21:33 ` Ignat Korchagin 0 siblings, 2 replies; 4+ messages in thread From: Ignat Korchagin @ 2024-09-23 18:23 UTC (permalink / raw) To: Jason, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, wireguard, netdev, linux-kernel, jiri, Sebastian Andrzej Siewior, Lorenzo Bianconi Cc: kernel-team Hello, We run calico on our Kubernetes cluster, which uses Wireguard to encrypt in-cluster traffic [1]. Recently we tried to improve the throughput of the cluster and eliminate some packet drops we’re seeing by switching on threaded NAPI [2] on these managed Wireguard interfaces. However, our Kubernetes hosts started to lock up once in a while. Analyzing one stuck host with drgn we were able to confirm that the code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to be cleared for the Wireguard peer napi instance, but that never happens for some reason. For context the full state of the stuck napi instance is 0b100110111. What makes things worse - this happens when calico removes a Wireguard peer, which happens while holding the global rtnl_mutex, so all the other tasks requiring that mutex get stuck as well. Full stacktrace of the “looping” task: #0 context_switch (linux/kernel/sched/core.c:5380:2) #1 __schedule (linux/kernel/sched/core.c:6698:8) #2 schedule (linux/kernel/sched/core.c:6772:3) #3 schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3) #4 usleep_range_state (linux/kernel/time/timer.c:2363:8) #5 usleep_range (linux/include/linux/delay.h:68:2) #6 napi_disable (linux/net/core/dev.c:6477:4) #7 peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2) #8 set_peer (linux/drivers/net/wireguard/netlink.c:425:3) #9 wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10) #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8) #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10) #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8) #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9) #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2) #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3) #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10) #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8) #18 sock_sendmsg_nosec (linux/net/socket.c:730:12) #19 __sock_sendmsg (linux/net/socket.c:745:16) #20 ____sys_sendmsg (linux/net/socket.c:2560:8) #21 ___sys_sendmsg (linux/net/socket.c:2614:8) #22 __sys_sendmsg (linux/net/socket.c:2643:8) #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14) #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7) #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121) We have also noticed that a similar issue is observed, when we switch Wireguard threaded NAPI back to off: removing a Wireguard peer task may still spend a considerable amount of time in the above loop (and hold rtnl_mutex), however the host eventually recovers from this state. So the questions are: 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the threaded NAPI case in Wireguard? 2. Is it generally a good idea for Wireguard to loop for an indeterminate amount of time, while holding the rtnl_mutex? Or can it be refactored? We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to downgrade the kernel a couple of patch revisions, but it did not help and our logs indicate that at least the non-threaded prolonged holding of the rtnl_mutex is happening for a while now. [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic [2]: https://docs.kernel.org/networking/napi.html#threaded [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: wireguard/napi stuck in napi_disable 2024-09-23 18:23 wireguard/napi stuck in napi_disable Ignat Korchagin @ 2024-09-23 18:46 ` Eric Dumazet 2024-09-23 21:33 ` Ignat Korchagin 1 sibling, 0 replies; 4+ messages in thread From: Eric Dumazet @ 2024-09-23 18:46 UTC (permalink / raw) To: Ignat Korchagin Cc: Jason, David S. Miller, Jakub Kicinski, Paolo Abeni, wireguard, netdev, linux-kernel, jiri, Sebastian Andrzej Siewior, Lorenzo Bianconi, kernel-team On Mon, Sep 23, 2024 at 8:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote: > > Hello, > > We run calico on our Kubernetes cluster, which uses Wireguard to > encrypt in-cluster traffic [1]. Recently we tried to improve the > throughput of the cluster and eliminate some packet drops we’re seeing > by switching on threaded NAPI [2] on these managed Wireguard > interfaces. However, our Kubernetes hosts started to lock up once in a > while. > > Analyzing one stuck host with drgn we were able to confirm that the > code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to > be cleared for the Wireguard peer napi instance, but that never > happens for some reason. For context the full state of the stuck napi > instance is 0b100110111. What makes things worse - this happens when > calico removes a Wireguard peer, which happens while holding the > global rtnl_mutex, so all the other tasks requiring that mutex get > stuck as well. > > Full stacktrace of the “looping” task: > > #0 context_switch (linux/kernel/sched/core.c:5380:2) > #1 __schedule (linux/kernel/sched/core.c:6698:8) > #2 schedule (linux/kernel/sched/core.c:6772:3) > #3 schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3) > #4 usleep_range_state (linux/kernel/time/timer.c:2363:8) > #5 usleep_range (linux/include/linux/delay.h:68:2) > #6 napi_disable (linux/net/core/dev.c:6477:4) > #7 peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2) > #8 set_peer (linux/drivers/net/wireguard/netlink.c:425:3) > #9 wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10) > #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8) > #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10) > #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8) > #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9) > #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2) > #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3) > #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10) > #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8) > #18 sock_sendmsg_nosec (linux/net/socket.c:730:12) > #19 __sock_sendmsg (linux/net/socket.c:745:16) > #20 ____sys_sendmsg (linux/net/socket.c:2560:8) > #21 ___sys_sendmsg (linux/net/socket.c:2614:8) > #22 __sys_sendmsg (linux/net/socket.c:2643:8) > #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14) > #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7) > #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121) > > We have also noticed that a similar issue is observed, when we switch > Wireguard threaded NAPI back to off: removing a Wireguard peer task > may still spend a considerable amount of time in the above loop (and > hold rtnl_mutex), however the host eventually recovers from this > state. > > So the questions are: > 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the > threaded NAPI case in Wireguard? > 2. Is it generally a good idea for Wireguard to loop for an > indeterminate amount of time, while holding the rtnl_mutex? Or can it > be refactored? > > We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to > downgrade the kernel a couple of patch revisions, but it did not help > and our logs indicate that at least the non-threaded prolonged holding > of the rtnl_mutex is happening for a while now. > > [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic > [2]: https://docs.kernel.org/networking/napi.html#threaded > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476 Somehow wireguard continuously feeds packets without checking it should not (IFF_UP or some other bit) napi_schedule() detects NAPIF_STATE_DISABLE, and napi_disable_pending() is also used from __napi_poll() to avoid adding back the napi if the whole budget was consumed. Not sure, more debugging might be needed. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: wireguard/napi stuck in napi_disable 2024-09-23 18:23 wireguard/napi stuck in napi_disable Ignat Korchagin 2024-09-23 18:46 ` Eric Dumazet @ 2024-09-23 21:33 ` Ignat Korchagin 2024-09-25 15:06 ` Daniel Dao 1 sibling, 1 reply; 4+ messages in thread From: Ignat Korchagin @ 2024-09-23 21:33 UTC (permalink / raw) To: Jason, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, wireguard, netdev, linux-kernel, jiri, Sebastian Andrzej Siewior, Lorenzo Bianconi Cc: kernel-team On Mon, Sep 23, 2024 at 7:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote: > > Hello, > > We run calico on our Kubernetes cluster, which uses Wireguard to > encrypt in-cluster traffic [1]. Recently we tried to improve the > throughput of the cluster and eliminate some packet drops we’re seeing > by switching on threaded NAPI [2] on these managed Wireguard > interfaces. However, our Kubernetes hosts started to lock up once in a > while. > > Analyzing one stuck host with drgn we were able to confirm that the > code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to > be cleared for the Wireguard peer napi instance, but that never > happens for some reason. For context the full state of the stuck napi > instance is 0b100110111. What makes things worse - this happens when > calico removes a Wireguard peer, which happens while holding the > global rtnl_mutex, so all the other tasks requiring that mutex get > stuck as well. > > Full stacktrace of the “looping” task: > > #0 context_switch (linux/kernel/sched/core.c:5380:2) > #1 __schedule (linux/kernel/sched/core.c:6698:8) > #2 schedule (linux/kernel/sched/core.c:6772:3) > #3 schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3) > #4 usleep_range_state (linux/kernel/time/timer.c:2363:8) > #5 usleep_range (linux/include/linux/delay.h:68:2) > #6 napi_disable (linux/net/core/dev.c:6477:4) > #7 peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2) > #8 set_peer (linux/drivers/net/wireguard/netlink.c:425:3) > #9 wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10) > #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8) > #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10) > #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8) > #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9) > #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2) > #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3) > #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10) > #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8) > #18 sock_sendmsg_nosec (linux/net/socket.c:730:12) > #19 __sock_sendmsg (linux/net/socket.c:745:16) > #20 ____sys_sendmsg (linux/net/socket.c:2560:8) > #21 ___sys_sendmsg (linux/net/socket.c:2614:8) > #22 __sys_sendmsg (linux/net/socket.c:2643:8) > #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14) > #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7) > #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121) > > We have also noticed that a similar issue is observed, when we switch > Wireguard threaded NAPI back to off: removing a Wireguard peer task > may still spend a considerable amount of time in the above loop (and > hold rtnl_mutex), however the host eventually recovers from this > state. > > So the questions are: > 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the > threaded NAPI case in Wireguard? > 2. Is it generally a good idea for Wireguard to loop for an > indeterminate amount of time, while holding the rtnl_mutex? Or can it > be refactored? I've been also trying to reproduce this issue with a script [1]. While I could not reproduce the complete lockup I've been able to confirm that peer_remove_after_dead() may take multiple seconds to execute - all while holding the rtnl_mutex. Below is bcc-tools funclatency output from a freshly compiled mainline (6.11): # /usr/share/bcc/tools/funclatency peer_remove_after_dead Tracing 1 functions for "peer_remove_after_dead"... Hit Ctrl-C to end. ^C nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 0 | | 1024 -> 2047 : 0 | | 2048 -> 4095 : 0 | | 4096 -> 8191 : 0 | | 8192 -> 16383 : 0 | | 16384 -> 32767 : 0 | | 32768 -> 65535 : 0 | | 65536 -> 131071 : 0 | | 131072 -> 262143 : 0 | | 262144 -> 524287 : 68 |** | 524288 -> 1048575 : 658 |********************| 1048576 -> 2097151 : 267 |******** | 2097152 -> 4194303 : 68 |** | 4194304 -> 8388607 : 124 |*** | 8388608 -> 16777215 : 182 |***** | 16777216 -> 33554431 : 72 |** | 33554432 -> 67108863 : 34 |* | 67108864 -> 134217727 : 22 | | 134217728 -> 268435455 : 11 | | 268435456 -> 536870911 : 2 | | 536870912 -> 1073741823 : 2 | | 1073741824 -> 2147483647 : 1 | | 2147483648 -> 4294967295 : 0 | | 4294967296 -> 8589934591 : 1 | | avg = 14251705 nsecs, total: 21548578415 nsecs, count: 1512 Detaching... So we have cases where it takes 2 or even 8 seconds to remove a single peer, which is definitely not great considering we're holding a global lock. > We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to > downgrade the kernel a couple of patch revisions, but it did not help > and our logs indicate that at least the non-threaded prolonged holding > of the rtnl_mutex is happening for a while now. > > [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic > [2]: https://docs.kernel.org/networking/napi.html#threaded > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476 Ignat [1]: https://gist.githubusercontent.com/ignatk/4505d96e02815de3aa5649c4aa7c3fca/raw/177e4eab9f491024db6488cd0ea1cbba2d5579b4/wg.sh ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: wireguard/napi stuck in napi_disable 2024-09-23 21:33 ` Ignat Korchagin @ 2024-09-25 15:06 ` Daniel Dao 0 siblings, 0 replies; 4+ messages in thread From: Daniel Dao @ 2024-09-25 15:06 UTC (permalink / raw) To: Ignat Korchagin Cc: Jason, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, wireguard, netdev, linux-kernel, jiri, Sebastian Andrzej Siewior, Lorenzo Bianconi, kernel-team, Tejun Heo On Mon, Sep 23, 2024 at 10:33 PM Ignat Korchagin <ignat@cloudflare.com> wrote: > > On Mon, Sep 23, 2024 at 7:23 PM Ignat Korchagin <ignat@cloudflare.com> wrote: > > > > Hello, > > > > We run calico on our Kubernetes cluster, which uses Wireguard to > > encrypt in-cluster traffic [1]. Recently we tried to improve the > > throughput of the cluster and eliminate some packet drops we’re seeing > > by switching on threaded NAPI [2] on these managed Wireguard > > interfaces. However, our Kubernetes hosts started to lock up once in a > > while. > > > > Analyzing one stuck host with drgn we were able to confirm that the > > code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to > > be cleared for the Wireguard peer napi instance, but that never > > happens for some reason. For context the full state of the stuck napi > > instance is 0b100110111. What makes things worse - this happens when > > calico removes a Wireguard peer, which happens while holding the > > global rtnl_mutex, so all the other tasks requiring that mutex get > > stuck as well. > > > > Full stacktrace of the “looping” task: > > > > #0 context_switch (linux/kernel/sched/core.c:5380:2) > > #1 __schedule (linux/kernel/sched/core.c:6698:8) > > #2 schedule (linux/kernel/sched/core.c:6772:3) > > #3 schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3) > > #4 usleep_range_state (linux/kernel/time/timer.c:2363:8) > > #5 usleep_range (linux/include/linux/delay.h:68:2) > > #6 napi_disable (linux/net/core/dev.c:6477:4) > > #7 peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2) > > #8 set_peer (linux/drivers/net/wireguard/netlink.c:425:3) > > #9 wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10) > > #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8) > > #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10) > > #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8) > > #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9) > > #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2) > > #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3) > > #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10) > > #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8) > > #18 sock_sendmsg_nosec (linux/net/socket.c:730:12) > > #19 __sock_sendmsg (linux/net/socket.c:745:16) > > #20 ____sys_sendmsg (linux/net/socket.c:2560:8) > > #21 ___sys_sendmsg (linux/net/socket.c:2614:8) > > #22 __sys_sendmsg (linux/net/socket.c:2643:8) > > #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14) > > #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7) > > #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121) > > Looking at this further to understand why the napi state would be 0b100110111 which decoded to: SCHED | MISSED | DISABLE | LISTED | NO_BUSY_POLL | THREADED, I think this is a problem with MISSED wakeup in napi threaded mode. napi_complete_done calls __napi_schedule when state has `NAPIF_STATE_MISSED`, but in 6.6, it does not set NAPI_STATE_SCHED_THREADED when the napi thread is running, and since the thread is running, wakeup does not do anything. Therefore we missed the chance to do another poll, and if we race with napi_disable then the state will be SCHED | MISSED | DISABLE, and we stuck. However it looks like the following commit resolves that situation for us commit 56364c910691f6d10ba88c964c9041b9ab777bd6 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Mon Mar 25 08:40:28 2024 +0100 net: Remove conditional threaded-NAPI wakeup based on task state. As long as we always set SCHED_THREADED, the next loop in napi thread will do the right thing and clear SCHED, letting napi_disable complete. We are testing 6.6 with this patch, and haven't seen any similar lockups so far. > > We have also noticed that a similar issue is observed, when we switch > > Wireguard threaded NAPI back to off: removing a Wireguard peer task > > may still spend a considerable amount of time in the above loop (and > > hold rtnl_mutex), however the host eventually recovers from this > > state. However, these lockups become much more prominent now. Here is the stack trace of peer_remove_after_dead #0 context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5380:2) #1 __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6699:8) #2 schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6773:3) #3 schedule_timeout (/cfsetup_build/build/linux/kernel/time/timer.c:2143:3) #4 do_wait_for_common (/cfsetup_build/build/linux/kernel/sched/completion.c:95:14) #5 __wait_for_common (/cfsetup_build/build/linux/kernel/sched/completion.c:116:12) #6 wait_for_common (/cfsetup_build/build/linux/kernel/sched/completion.c:127:9) #7 wait_for_completion (/cfsetup_build/build/linux/kernel/sched/completion.c:148:2) #8 __flush_workqueue (/cfsetup_build/build/linux/kernel/workqueue.c:3196:2) #9 peer_remove_after_dead (/cfsetup_build/build/linux/drivers/net/wireguard/peer.c:116:2) #10 set_peer (/cfsetup_build/build/linux/drivers/net/wireguard/netlink.c:425:3) #11 wg_set_device (/cfsetup_build/build/linux/drivers/net/wireguard/netlink.c:592:10) #12 genl_family_rcv_msg_doit (/cfsetup_build/build/linux/net/netlink/genetlink.c:971:8) #13 genl_family_rcv_msg (/cfsetup_build/build/linux/net/netlink/genetlink.c:1051:10) #14 genl_rcv_msg (/cfsetup_build/build/linux/net/netlink/genetlink.c:1066:8) #15 netlink_rcv_skb (/cfsetup_build/build/linux/net/netlink/af_netlink.c:2544:9) #16 genl_rcv (/cfsetup_build/build/linux/net/netlink/genetlink.c:1075:2) #17 netlink_unicast_kernel (/cfsetup_build/build/linux/net/netlink/af_netlink.c:1342:3) #18 netlink_unicast (/cfsetup_build/build/linux/net/netlink/af_netlink.c:1368:10) #19 netlink_sendmsg (/cfsetup_build/build/linux/net/netlink/af_netlink.c:1910:8) #20 sock_sendmsg_nosec (/cfsetup_build/build/linux/net/socket.c:730:12) #21 __sock_sendmsg (/cfsetup_build/build/linux/net/socket.c:745:16) #22 ____sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2590:8) #23 ___sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2644:8) #24 __sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2673:8) #25 do_syscall_x64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:51:14) #26 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:81:7) #27 entry_SYSCALL_64+0x9c/0x184 (/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:121) #28 0x41262e drgn shows that we are waiting to for completion of work for wg_packet_tx_worker, which is destined for a completely different peer than the peer we want to remove. *(struct worker *)0xffff888107f640c0 = { .entry = (struct list_head){ .next = (struct list_head *)0x0, .prev = (struct list_head *)0xffff8897e0cb1f50, }, .hentry = (struct hlist_node){ .next = (struct hlist_node *)0x0, .pprev = (struct hlist_node **)0xffff8897e0cb1f50, }, .current_work = (struct work_struct *)0xffff8881a32638d0, .current_func = (work_func_t)wg_packet_tx_worker+0x0 = 0xffffffffc0f6ca40, .current_pwq = (struct pool_workqueue *)0xffff88812bca6400, .current_at = (u64)3491257913, .current_color = (unsigned int)4, .sleeping = (int)0, .last_func = (work_func_t)wg_packet_tx_worker+0x0 = 0xffffffffc0f6ca40, .scheduled = (struct list_head){ .next = (struct list_head *)0xffff8881a32638d8, .prev = (struct list_head *)0xffff8881a32638d8, }, .task = (struct task_struct *)0xffff888472a08000, .pool = (struct worker_pool *)0xffff8897e0cb1cc0, .node = (struct list_head){ .next = (struct list_head *)0xffff888107f64360, .prev = (struct list_head *)0xffff888107f645a0, }, .last_active = (unsigned long)4409382646, .flags = (unsigned int)64, .id = (int)1, .desc = (char [32])"wg-crypt-wireguard.cali", .rescue_wq = (struct workqueue_struct *)0x0, } This can take a very long time especially if the peers produce/receive as fast as it can, as in our test setup. We setup some metrics with bpftrace to measure the duration of peer_remove_after_dead and wg_packet_tx_worker and got the following measurements after a long wait time. @duration_ms[peer_remove_after_dead]: [512K, 1M) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @duration_ms[wg_packet_tx_worker]: [0] 744612 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 20 | | [2, 4) 2 | | [4, 8) 2 | | [16, 32) 2 | | [256K, 512K) 1 | | [2M, 4M) 1 | | We can see that peer_remove_after_dead took between 512s to 1000s to complete, while wg_packet_tx_worker can take up to [2000s, 4000s) to complete, which is an awfully long time. Daniel > > > > So the questions are: > > 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the > > threaded NAPI case in Wireguard? > > 2. Is it generally a good idea for Wireguard to loop for an > > indeterminate amount of time, while holding the rtnl_mutex? Or can it > > be refactored? > > I've been also trying to reproduce this issue with a script [1]. While > I could not reproduce the complete lockup I've been able to confirm > that peer_remove_after_dead() may take multiple seconds to execute - > all while holding the rtnl_mutex. Below is bcc-tools funclatency > output from a freshly compiled mainline (6.11): > > # /usr/share/bcc/tools/funclatency peer_remove_after_dead > Tracing 1 functions for "peer_remove_after_dead"... Hit Ctrl-C to end. > ^C > nsecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 0 | | > 4 -> 7 : 0 | | > 8 -> 15 : 0 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 0 | | > 256 -> 511 : 0 | | > 512 -> 1023 : 0 | | > 1024 -> 2047 : 0 | | > 2048 -> 4095 : 0 | | > 4096 -> 8191 : 0 | | > 8192 -> 16383 : 0 | | > 16384 -> 32767 : 0 | | > 32768 -> 65535 : 0 | | > 65536 -> 131071 : 0 | | > 131072 -> 262143 : 0 | | > 262144 -> 524287 : 68 |** | > 524288 -> 1048575 : 658 |********************| > 1048576 -> 2097151 : 267 |******** | > 2097152 -> 4194303 : 68 |** | > 4194304 -> 8388607 : 124 |*** | > 8388608 -> 16777215 : 182 |***** | > 16777216 -> 33554431 : 72 |** | > 33554432 -> 67108863 : 34 |* | > 67108864 -> 134217727 : 22 | | > 134217728 -> 268435455 : 11 | | > 268435456 -> 536870911 : 2 | | > 536870912 -> 1073741823 : 2 | | > 1073741824 -> 2147483647 : 1 | | > 2147483648 -> 4294967295 : 0 | | > 4294967296 -> 8589934591 : 1 | | > > avg = 14251705 nsecs, total: 21548578415 nsecs, count: 1512 > > Detaching... > > So we have cases where it takes 2 or even 8 seconds to remove a single > peer, which is definitely not great considering we're holding a global > lock. > > > We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to > > downgrade the kernel a couple of patch revisions, but it did not help > > and our logs indicate that at least the non-threaded prolonged holding > > of the rtnl_mutex is happening for a while now. > > > > [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic > > [2]: https://docs.kernel.org/networking/napi.html#threaded > > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476 > > Ignat > > [1]: https://gist.githubusercontent.com/ignatk/4505d96e02815de3aa5649c4aa7c3fca/raw/177e4eab9f491024db6488cd0ea1cbba2d5579b4/wg.sh ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-11-18 13:47 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-09-23 18:23 wireguard/napi stuck in napi_disable Ignat Korchagin 2024-09-23 18:46 ` Eric Dumazet 2024-09-23 21:33 ` Ignat Korchagin 2024-09-25 15:06 ` Daniel Dao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).