* [PATCH] Enabling Threaded NAPI by Default
@ 2025-05-27 9:08 Mirco Barone
2025-05-27 12:17 ` Jason A. Donenfeld
0 siblings, 1 reply; 6+ messages in thread
From: Mirco Barone @ 2025-05-27 9:08 UTC (permalink / raw)
To: wireguard
Hi everyone,
While testing WireGuard with a large number of tunnels, we expected throughput to scale linearly with the number of active tunnels. Instead, we observed very poor performance due to a bottleneck caused by multiple NAPI functions stacking on the same CPU core, preventing the system from scaling effectively.
More details are provided in this paper on page 3:
https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf
Since each peer has its own NAPI struct, the problem can potentially occur when many peers are created on the same machine. The simple solution we found is to enable threaded NAPI, which improves
considerably the throughput in our testing conditions while, at the same time, showing no drawbacks in case of traditional deployment scenarios (i.e., single tunnel). Hence, we feel we could slightly modify the code and move to threaded NAPI as the new default.
Any comment?
The option to revert to NAPI handled by a softirq is still preserved, by simply changing the `/sys/class/net/<iface>/threaded` flag.
-----------------------------------------------------------------------
CHANGES
-----------------------------------------------------------------------
drivers/net/wireguard/device.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
index 45e9b908dbfb..bb77f54d7526 100644
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -363,6 +363,7 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
ret = wg_ratelimiter_init();
if (ret < 0)
goto err_free_handshake_queue;
+ dev_set_threaded(dev,true);
ret = register_netdevice(dev);
if (ret < 0)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Enabling Threaded NAPI by Default
2025-05-27 9:08 [PATCH] Enabling Threaded NAPI by Default Mirco Barone
@ 2025-05-27 12:17 ` Jason A. Donenfeld
2025-05-28 17:26 ` R: " Mirco Barone
0 siblings, 1 reply; 6+ messages in thread
From: Jason A. Donenfeld @ 2025-05-27 12:17 UTC (permalink / raw)
To: Mirco Barone; +Cc: wireguard
Hi,
Indeed I'm interested in this, but I need this as a proper git
formatted patch with a commit message that has real information:
- What kind of speedups and under which circumstances?
- Are there any known performance regressions? Small packets? Bursty traffic?
- Why is this not enabled by default everywhere and what makes
WireGuard special?
And so forth. All of this should be in a normally written git message,
with the patch sent using `git send-email`.
Thanks,
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* R: [PATCH] Enabling Threaded NAPI by Default
2025-05-27 12:17 ` Jason A. Donenfeld
@ 2025-05-28 17:26 ` Mirco Barone
2025-05-28 18:10 ` Jason A. Donenfeld
2025-05-29 10:22 ` Mirco Barone
0 siblings, 2 replies; 6+ messages in thread
From: Mirco Barone @ 2025-05-28 17:26 UTC (permalink / raw)
To: Jason A. Donenfeld; +Cc: wireguard
This patch enables threaded NAPI by default for WireGuard devices in
response to low performance behavior that we observed when multiple
tunnels (and thus multiple wg devices) are deployed on a single host.
This affects any kind of multi-tunnel deployment, regardless of whether
the tunnels share the same endpoints or not (i.e., a VPN concentrator
type of gateway would also be affected).
The problem is caused by the fact that, in case of a traffic surge that
involves multiple tunnels at the same time, the polling of the NAPI
instance of all these wg devices tends to converge onto the same core,
causing underutilization of the CPU and bottlenecking performance.
This happens because NAPI polling is hosted by default in softirq
context, but the WireGuard driver only raises this softirq after the rx
peer queue has been drained, which doesn't happen during high traffic.
In this case, the softirq already active on a core is reused instead of
raising a new one.
As a result, once two or more tunnel softirqs have been scheduled on
the same core, they remain pinned there until the surge ends.
In our experiments, this almost always leads to all tunnel NAPIs being
handled on a single core shortly after a surge begins, limiting
scalability to less than 3× the performance of a single tunnel, despite
plenty of unused CPU cores being available.
The proposed mitigation is to enable threaded NAPI for all WireGuard
devices. This moves the NAPI polling context to a dedicated per-device
kernel thread, allowing the scheduler to balance the load across all
available cores.
On our 32-core gateways, enabling threaded NAPI yields a ~4× performance
improvement with 16 tunnels, increasing throughput from ~13 Gbps to
~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck)
jumps from 20% to 100%.
We have found no performance regressions in any scenario we tested.
Single-tunnel throughput remains unchanged.
More details are available in our Netdev paper:
https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf
---
drivers/net/wireguard/device.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
index 45e9b908dbfb..bb77f54d7526 100644
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -363,6 +363,7 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
ret = wg_ratelimiter_init();
if (ret < 0)
goto err_free_handshake_queue;
+ dev_set_threaded(dev,true);
ret = register_netdevice(dev);
if (ret < 0)
--
2.34.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: R: [PATCH] Enabling Threaded NAPI by Default
2025-05-28 17:26 ` R: " Mirco Barone
@ 2025-05-28 18:10 ` Jason A. Donenfeld
2025-05-29 10:22 ` Mirco Barone
1 sibling, 0 replies; 6+ messages in thread
From: Jason A. Donenfeld @ 2025-05-28 18:10 UTC (permalink / raw)
To: Mirco Barone; +Cc: wireguard
On Wed, May 28, 2025 at 05:26:34PM +0000, Mirco Barone wrote:
> This happens because NAPI polling is hosted by default in softirq
> context, but the WireGuard driver only raises this softirq after the rx
> peer queue has been drained, which doesn't happen during high traffic.
> In this case, the softirq already active on a core is reused instead of
> raising a new one.
>
> As a result, once two or more tunnel softirqs have been scheduled on
> the same core, they remain pinned there until the surge ends.
>
> In our experiments, this almost always leads to all tunnel NAPIs being
> handled on a single core shortly after a surge begins, limiting
> scalability to less than 3× the performance of a single tunnel, despite
> plenty of unused CPU cores being available.
So *that's* what's been going on! Holy Moses, nice discovery.
> On our 32-core gateways, enabling threaded NAPI yields a ~4× performance
> improvement with 16 tunnels, increasing throughput from ~13 Gbps to
> ~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck)
> jumps from 20% to 100%.
Shut up and take my money! Patch applied.
> ---
> drivers/net/wireguard/device.c | 1 +
> 1 file changed, 1 insertion(+)
Actually, no, wait, sorry, this needs your Signed-off-by line, per the
kernel contribution guidelines, for me to be able to push it. Can you
just reply to your initial patch email, quote all of the text, and
append the string:
Signed-off-by: Mirco Barone <mirco.barone@polito.it>
And then I'll push this up.
Sorry for the hassle; kernel development has its particularities.
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Enabling Threaded NAPI by Default
2025-05-28 17:26 ` R: " Mirco Barone
2025-05-28 18:10 ` Jason A. Donenfeld
@ 2025-05-29 10:22 ` Mirco Barone
1 sibling, 0 replies; 6+ messages in thread
From: Mirco Barone @ 2025-05-29 10:22 UTC (permalink / raw)
To: Jason A. Donenfeld; +Cc: wireguard
On 5/28/2025 7:26 PM, Mirco Barone wrote:
> This patch enables threaded NAPI by default for WireGuard devices in
> response to low performance behavior that we observed when multiple
> tunnels (and thus multiple wg devices) are deployed on a single host.
> This affects any kind of multi-tunnel deployment, regardless of whether
> the tunnels share the same endpoints or not (i.e., a VPN concentrator
> type of gateway would also be affected).
>
> The problem is caused by the fact that, in case of a traffic surge that
> involves multiple tunnels at the same time, the polling of the NAPI
> instance of all these wg devices tends to converge onto the same core,
> causing underutilization of the CPU and bottlenecking performance.
>
> This happens because NAPI polling is hosted by default in softirq
> context, but the WireGuard driver only raises this softirq after the rx
> peer queue has been drained, which doesn't happen during high traffic.
> In this case, the softirq already active on a core is reused instead of
> raising a new one.
>
> As a result, once two or more tunnel softirqs have been scheduled on
> the same core, they remain pinned there until the surge ends.
>
> In our experiments, this almost always leads to all tunnel NAPIs being
> handled on a single core shortly after a surge begins, limiting
> scalability to less than 3× the performance of a single tunnel, despite
> plenty of unused CPU cores being available.
>
> The proposed mitigation is to enable threaded NAPI for all WireGuard
> devices. This moves the NAPI polling context to a dedicated per-device
> kernel thread, allowing the scheduler to balance the load across all
> available cores.
>
> On our 32-core gateways, enabling threaded NAPI yields a ~4× performance
> improvement with 16 tunnels, increasing throughput from ~13 Gbps to
> ~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck)
> jumps from 20% to 100%.
>
> We have found no performance regressions in any scenario we tested.
> Single-tunnel throughput remains unchanged.
>
> More details are available in our Netdev paper:
> https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf
> ---
> drivers/net/wireguard/device.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
> index 45e9b908dbfb..bb77f54d7526 100644
> --- a/drivers/net/wireguard/device.c
> +++ b/drivers/net/wireguard/device.c
> @@ -363,6 +363,7 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
> ret = wg_ratelimiter_init();
> if (ret < 0)
> goto err_free_handshake_queue;
> + dev_set_threaded(dev,true);
>
> ret = register_netdevice(dev);
> if (ret < 0)
Signed-off-by: Mirco Barone <mirco.barone@polito.it>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH] Enabling Threaded NAPI by Default
@ 2024-09-10 6:33 Mirco Barone
0 siblings, 0 replies; 6+ messages in thread
From: Mirco Barone @ 2024-09-10 6:33 UTC (permalink / raw)
To: wireguard
Hi everyone,
While testing Wireguard with a large number of tunnels, we noticed a
bottleneck caused by the superimposition of multiple NAPI functions on
the same CPU core, hence preventing the system to scale effectively.
More details are described in this paper on page 3:
https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf
Since each peer has its own NAPI struct, the problem can potentially
occur when many peers are created on the same machine. The simple
solution we found is to enable threaded NAPI, which improves
considerably the throughput in our testing conditions while, at the
same time, showing no drawbacks in case of traditional deployment
scenarios (i.e., single tunnel). Hence, we feel we could slightly
modify the code and move to threaded NAPI as the new default.
Any comment?
The option to revert to NAPI handled by a softirq is still preserved,
by simply changing the `/sys/class/net/<iface>/threaded` flag.
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
old mode 100644
new mode 100755
index 3feb36ee5bfb..60554b7c405a
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -363,6 +363,8 @@ static int wg_newlink(struct net *src_net, struct
net_device *dev,
ret = wg_ratelimiter_init();
if (ret < 0)
goto err_free_handshake_queue;
+
+ dev_set_threaded(dev,true);
ret = register_netdevice(dev);
if (ret < 0)
Kind regards
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Privo di virus.www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-05-29 10:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-27 9:08 [PATCH] Enabling Threaded NAPI by Default Mirco Barone
2025-05-27 12:17 ` Jason A. Donenfeld
2025-05-28 17:26 ` R: " Mirco Barone
2025-05-28 18:10 ` Jason A. Donenfeld
2025-05-29 10:22 ` Mirco Barone
-- strict thread matches above, loose matches on Subject: below --
2024-09-10 6:33 Mirco Barone
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).