Development discussion of WireGuard
 help / color / mirror / Atom feed
* [PATCH] WireGuard: restrict packet handling to non-isolated CPUs.
@ 2022-04-05 21:21 Charles-Francois Natali
  2022-04-22  0:02 ` Jason A. Donenfeld
  0 siblings, 1 reply; 5+ messages in thread
From: Charles-Francois Natali @ 2022-04-05 21:21 UTC (permalink / raw)
  To: wireguard; +Cc: Jason, Charles-Francois Natali

WireGuard currently uses round-robin to dispatch the handling of
packets, handling them on all online CPUs, including isolated ones
(isolcpus).

This is unfortunate because it causes significant latency on isolated
CPUs - see e.g. below over 240 usec:

kworker/47:1-2373323 [047] 243644.756405: funcgraph_entry: |
process_one_work() {
    kworker/47:1-2373323 [047] 243644.756406: funcgraph_entry: |
wg_packet_decrypt_worker() {
[...]
    kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: 0.591 us | }
    kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: ! 242.655 us | }

Instead, restrict to non-isolated CPUs.

Example:

~# cat /sys/devices/system/cpu/isolated
3
~# /usr/share/doc/wireguard-tools/examples/ncat-client-server/client.sh
~# ping 192.168.4.1

Before - corresponding workqueues are executed on all CPUs:

~# trace-cmd record -p function -l wg_packet_decrypt_worker -- sleep 10
  plugin 'function'
CPU0 data recorded at offset=0x7d6000
    4096 bytes in size
CPU1 data recorded at offset=0x7d7000
    4096 bytes in size
CPU2 data recorded at offset=0x7d8000
    4096 bytes in size
CPU3 data recorded at offset=0x7d9000
    4096 bytes in size
~# trace-cmd report
cpus=4
     kworker/3:1-52    [003]    49.784353: function:
wg_packet_decrypt_worker
     kworker/0:1-17    [000]    50.782879: function:
wg_packet_decrypt_worker
     kworker/1:3-162   [001]    51.783044: function:
wg_packet_decrypt_worker
     kworker/2:1-56    [002]    52.782159: function:
wg_packet_decrypt_worker
     kworker/3:1-52    [003]    53.780919: function:
wg_packet_decrypt_worker
     kworker/0:0-6     [000]    54.781755: function:
wg_packet_decrypt_worker
     kworker/1:3-162   [001]    55.781273: function:
wg_packet_decrypt_worker
     kworker/2:1-56    [002]    56.781946: function:
wg_packet_decrypt_worker
     kworker/3:1-52    [003]    57.781010: function:
wg_packet_decrypt_worker
     kworker/0:0-6     [000]    58.782097: function:
wg_packet_decrypt_worker
~#

After - isolated CPU 3 is excluded:

~# trace-cmd record -p function -l wg_packet_decrypt_worker -- sleep 10
  plugin 'function'
CPU0 data recorded at offset=0x7d7000
    4096 bytes in size
CPU1 data recorded at offset=0x7d8000
    4096 bytes in size
CPU2 data recorded at offset=0x7d9000
    4096 bytes in size
CPU3 data recorded at offset=0x7da000
    0 bytes in size
~# trace-cmd report
CPU 3 is empty
cpus=4
     kworker/1:2-66    [001]   291.800063: function:
wg_packet_decrypt_worker
     kworker/2:2-143   [002]   292.800266: function:
wg_packet_decrypt_worker
     kworker/0:2-145   [000]   293.801778: function:
wg_packet_decrypt_worker
     kworker/1:4-261   [001]   294.803411: function:
wg_packet_decrypt_worker
     kworker/2:2-143   [002]   295.804068: function:
wg_packet_decrypt_worker
     kworker/0:2-145   [000]   296.806057: function:
wg_packet_decrypt_worker
     kworker/1:2-66    [001]   297.810686: function:
wg_packet_decrypt_worker
     kworker/2:2-143   [002]   298.811602: function:
wg_packet_decrypt_worker
     kworker/0:2-145   [000]   299.812790: function:
wg_packet_decrypt_worker
     kworker/1:4-261   [001]   300.813076: function:
wg_packet_decrypt_worker
~#

Signed-off-by: Charles-Francois Natali <cf.natali@gmail.com>
---
 drivers/net/wireguard/queueing.h | 59 +++++++++++++++++++++++++-------
 drivers/net/wireguard/receive.c  |  2 +-
 2 files changed, 48 insertions(+), 13 deletions(-)

diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index 583adb37e..106a2686c 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -11,6 +11,7 @@
 #include <linux/skbuff.h>
 #include <linux/ip.h>
 #include <linux/ipv6.h>
+#include <linux/sched/isolation.h>
 #include <net/ip_tunnels.h>
 
 struct wg_device;
@@ -102,16 +103,50 @@ static inline void wg_reset_packet(struct sk_buff *skb, bool encapsulating)
 	skb_reset_inner_headers(skb);
 }
 
-static inline int wg_cpumask_choose_online(int *stored_cpu, unsigned int id)
+/* We only want to dispatch work to housekeeping CPUs, ignoring isolated ones.
+ */
+static inline const struct cpumask *wg_cpumask_housekeeping(void)
+{
+	return housekeeping_cpumask(HK_FLAG_DOMAIN);
+}
+
+static inline int wg_cpumask_test_cpu(int cpu)
+{
+	return cpumask_test_cpu(cpu, cpu_online_mask) &&
+		cpumask_test_cpu(cpu, wg_cpumask_housekeeping());
+}
+
+static inline unsigned int wg_cpumask_first(void)
+{
+	return cpumask_first_and(cpu_online_mask, wg_cpumask_housekeeping());
+}
+
+static inline unsigned int wg_cpumask_next(int n)
+{
+	return cpumask_next_and(n, cpu_online_mask, wg_cpumask_housekeeping());
+}
+
+static inline unsigned int wg_cpumask_weight(void)
+{
+	int cpu;
+	int weight = 0;
+
+	for_each_cpu_and(cpu, cpu_online_mask, wg_cpumask_housekeeping()) {
+		++weight;
+	}
+
+	return weight;
+}
+
+static inline int wg_cpumask_choose_eligible(int *stored_cpu, unsigned int id)
 {
 	unsigned int cpu = *stored_cpu, cpu_index, i;
 
-	if (unlikely(cpu == nr_cpumask_bits ||
-		     !cpumask_test_cpu(cpu, cpu_online_mask))) {
-		cpu_index = id % cpumask_weight(cpu_online_mask);
-		cpu = cpumask_first(cpu_online_mask);
+	if (unlikely(cpu == nr_cpumask_bits || !wg_cpumask_test_cpu(cpu))) {
+		cpu_index = id % wg_cpumask_weight();
+		cpu = wg_cpumask_first();
 		for (i = 0; i < cpu_index; ++i)
-			cpu = cpumask_next(cpu, cpu_online_mask);
+			cpu = wg_cpumask_next(cpu);
 		*stored_cpu = cpu;
 	}
 	return cpu;
@@ -124,13 +159,13 @@ static inline int wg_cpumask_choose_online(int *stored_cpu, unsigned int id)
  * a bit slower, and it doesn't seem like this potential race actually
  * introduces any performance loss, so we live with it.
  */
-static inline int wg_cpumask_next_online(int *next)
+static inline int wg_cpumask_next_eligible(int *next)
 {
 	int cpu = *next;
 
-	while (unlikely(!cpumask_test_cpu(cpu, cpu_online_mask)))
-		cpu = cpumask_next(cpu, cpu_online_mask) % nr_cpumask_bits;
-	*next = cpumask_next(cpu, cpu_online_mask) % nr_cpumask_bits;
+	while (unlikely(!wg_cpumask_test_cpu(cpu)))
+		cpu = wg_cpumask_next(cpu) % nr_cpumask_bits;
+	*next = wg_cpumask_next(cpu) % nr_cpumask_bits;
 	return cpu;
 }
 
@@ -173,7 +208,7 @@ static inline int wg_queue_enqueue_per_device_and_peer(
 	/* Then we queue it up in the device queue, which consumes the
 	 * packet as soon as it can.
 	 */
-	cpu = wg_cpumask_next_online(next_cpu);
+	cpu = wg_cpumask_next_eligible(next_cpu);
 	if (unlikely(ptr_ring_produce_bh(&device_queue->ring, skb)))
 		return -EPIPE;
 	queue_work_on(cpu, wq, &per_cpu_ptr(device_queue->worker, cpu)->work);
@@ -188,7 +223,7 @@ static inline void wg_queue_enqueue_per_peer_tx(struct sk_buff *skb, enum packet
 	struct wg_peer *peer = wg_peer_get(PACKET_PEER(skb));
 
 	atomic_set_release(&PACKET_CB(skb)->state, state);
-	queue_work_on(wg_cpumask_choose_online(&peer->serial_work_cpu, peer->internal_id),
+	queue_work_on(wg_cpumask_choose_eligible(&peer->serial_work_cpu, peer->internal_id),
 		      peer->device->packet_crypt_wq, &peer->transmit_packet_work);
 	wg_peer_put(peer);
 }
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 7b8df406c..2d5d903d0 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -572,7 +572,7 @@ void wg_packet_receive(struct wg_device *wg, struct sk_buff *skb)
 			goto err;
 		}
 		atomic_inc(&wg->handshake_queue_len);
-		cpu = wg_cpumask_next_online(&wg->handshake_queue.last_cpu);
+		cpu = wg_cpumask_next_eligible(&wg->handshake_queue.last_cpu);
 		/* Queues up a call to packet_process_queued_handshake_packets(skb): */
 		queue_work_on(cpu, wg->handshake_receive_wq,
 			      &per_cpu_ptr(wg->handshake_queue.worker, cpu)->work);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] WireGuard: restrict packet handling to non-isolated CPUs.
  2022-04-05 21:21 [PATCH] WireGuard: restrict packet handling to non-isolated CPUs Charles-Francois Natali
@ 2022-04-22  0:02 ` Jason A. Donenfeld
  2022-04-22  0:40   ` Stephen Hemminger
  2022-04-22 22:23   ` Charles-François Natali
  0 siblings, 2 replies; 5+ messages in thread
From: Jason A. Donenfeld @ 2022-04-22  0:02 UTC (permalink / raw)
  To: Charles-Francois Natali
  Cc: wireguard, netdev, linux-crypto, Daniel Jordan, Steffen Klassert

netdev@ - Original thread is at
https://lore.kernel.org/wireguard/20220405212129.2270-1-cf.natali@gmail.com/

Hi Charles-François,

On Tue, Apr 05, 2022 at 10:21:29PM +0100, Charles-Francois Natali wrote:
> WireGuard currently uses round-robin to dispatch the handling of
> packets, handling them on all online CPUs, including isolated ones
> (isolcpus).
> 
> This is unfortunate because it causes significant latency on isolated
> CPUs - see e.g. below over 240 usec:
> 
> kworker/47:1-2373323 [047] 243644.756405: funcgraph_entry: |
> process_one_work() { kworker/47:1-2373323 [047] 243644.756406:
> funcgraph_entry: | wg_packet_decrypt_worker() { [...]
> kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: 0.591 us | }
> kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: ! 242.655 us
> | }
> 
> Instead, restrict to non-isolated CPUs.

Huh, interesting... I haven't seen this feature before. What's the
intended use case? To never run _anything_ on those cores except
processes you choose? To run some things but not intensive things? Is it
sort of a RT-lite?

I took a look in padata/pcrypt and it doesn't look like they're
examining the housekeeping mask at all. Grepping for
housekeeping_cpumask doesn't appear to show many results in things like
workqueues, but rather in core scheduling stuff. So I'm not quite sure
what to make of this patch.

I suspect the thing to do might be to patch both wireguard and padata,
and send a patch series to me, the padata people, and
netdev@vger.kernel.org, and we can all hash this out together.

Regarding your patch, is there a way to make that a bit more succinct,
without introducing all of those helper functions? It seems awfully
verbose for something that seems like a matter of replacing the online
mask with the housekeeping mask.

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] WireGuard: restrict packet handling to non-isolated CPUs.
  2022-04-22  0:02 ` Jason A. Donenfeld
@ 2022-04-22  0:40   ` Stephen Hemminger
  2022-04-22 22:23   ` Charles-François Natali
  1 sibling, 0 replies; 5+ messages in thread
From: Stephen Hemminger @ 2022-04-22  0:40 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Charles-Francois Natali, wireguard, netdev, linux-crypto,
	Daniel Jordan, Steffen Klassert

On Fri, 22 Apr 2022 02:02:21 +0200
"Jason A. Donenfeld" <Jason@zx2c4.com> wrote:

> netdev@ - Original thread is at
> https://lore.kernel.org/wireguard/20220405212129.2270-1-cf.natali@gmail.com/
> 
> Hi Charles-François,
> 
> On Tue, Apr 05, 2022 at 10:21:29PM +0100, Charles-Francois Natali wrote:
> > WireGuard currently uses round-robin to dispatch the handling of
> > packets, handling them on all online CPUs, including isolated ones
> > (isolcpus).
> > 
> > This is unfortunate because it causes significant latency on isolated
> > CPUs - see e.g. below over 240 usec:
> > 
> > kworker/47:1-2373323 [047] 243644.756405: funcgraph_entry: |
> > process_one_work() { kworker/47:1-2373323 [047] 243644.756406:
> > funcgraph_entry: | wg_packet_decrypt_worker() { [...]
> > kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: 0.591 us | }
> > kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: ! 242.655 us
> > | }
> > 
> > Instead, restrict to non-isolated CPUs.  
> 
> Huh, interesting... I haven't seen this feature before. What's the
> intended use case? To never run _anything_ on those cores except
> processes you choose? To run some things but not intensive things? Is it
> sort of a RT-lite?
> 
> I took a look in padata/pcrypt and it doesn't look like they're
> examining the housekeeping mask at all. Grepping for
> housekeeping_cpumask doesn't appear to show many results in things like
> workqueues, but rather in core scheduling stuff. So I'm not quite sure
> what to make of this patch.
> 
> I suspect the thing to do might be to patch both wireguard and padata,
> and send a patch series to me, the padata people, and
> netdev@vger.kernel.org, and we can all hash this out together.
> 
> Regarding your patch, is there a way to make that a bit more succinct,
> without introducing all of those helper functions? It seems awfully
> verbose for something that seems like a matter of replacing the online
> mask with the housekeeping mask.
> 
> Jason

Applications like DPDK that do polling often use isolcpus or cgroups
to keep unwanted rabble off of their cpus.  Having wireguard use those
cpus seems bad.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] WireGuard: restrict packet handling to non-isolated CPUs.
  2022-04-22  0:02 ` Jason A. Donenfeld
  2022-04-22  0:40   ` Stephen Hemminger
@ 2022-04-22 22:23   ` Charles-François Natali
  2022-04-23  1:08     ` Jason A. Donenfeld
  1 sibling, 1 reply; 5+ messages in thread
From: Charles-François Natali @ 2022-04-22 22:23 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: wireguard, netdev, linux-crypto, Daniel Jordan, Steffen Klassert

Hi,

On Fri, 22 Apr 2022 at 01:02, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> netdev@ - Original thread is at
> https://lore.kernel.org/wireguard/20220405212129.2270-1-cf.natali@gmail.com/
>
> Hi Charles-François,
>
> On Tue, Apr 05, 2022 at 10:21:29PM +0100, Charles-Francois Natali wrote:
> > WireGuard currently uses round-robin to dispatch the handling of
> > packets, handling them on all online CPUs, including isolated ones
> > (isolcpus).
> >
> > This is unfortunate because it causes significant latency on isolated
> > CPUs - see e.g. below over 240 usec:
> >
> > kworker/47:1-2373323 [047] 243644.756405: funcgraph_entry: |
> > process_one_work() { kworker/47:1-2373323 [047] 243644.756406:
> > funcgraph_entry: | wg_packet_decrypt_worker() { [...]
> > kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: 0.591 us | }
> > kworker/47:1-2373323 [047] 243644.756647: funcgraph_exit: ! 242.655 us
> > | }
> >
> > Instead, restrict to non-isolated CPUs.
>
> Huh, interesting... I haven't seen this feature before. What's the
> intended use case? To never run _anything_ on those cores except
> processes you choose? To run some things but not intensive things? Is it
> sort of a RT-lite?

Yes, the idea is to not run anything on those cores: no user tasks, no unbound
workqueues, etc.
Typically one would also set IRQ affinity etc to avoid those cores, to avoid
(soft)IRQS which cause significant latency as well.

This series by Frederic Weisbecker is a good introduction:
https://www.suse.com/c/cpu-isolation-introduction-part-1/

The idea is to achieve low latency and jitter.
With a reasonably tuned kernel one can reach around 10usec latency - however
whenever we start using wireguard, we can see the bound workqueues used for
round-robin dispatch cause up to 1ms stalls, which is just not
acceptable for us.
Currently our only option is to either patch the wireguard code, or
stop using it,
which would be a shame :).

> I took a look in padata/pcrypt and it doesn't look like they're
> examining the housekeeping mask at all. Grepping for
> housekeeping_cpumask doesn't appear to show many results in things like
> workqueues, but rather in core scheduling stuff. So I'm not quite sure
> what to make of this patch.

Thanks, I didn't know about padata, but after skimming through the code it does
seem that it would suffer from the same issue.

> I suspect the thing to do might be to patch both wireguard and padata,
> and send a patch series to me, the padata people, and
> netdev@vger.kernel.org, and we can all hash this out together.

Sure, I'll try to have a look at the padata code and write something up.

> Regarding your patch, is there a way to make that a bit more succinct,
> without introducing all of those helper functions? It seems awfully
> verbose for something that seems like a matter of replacing the online
> mask with the housekeeping mask.

Indeed, I wasn't really happy about that.
The reason I've written those helper functions is that the housekeeping mask
includes possible CPUs (cpu_possible_mask), so unfortunately it's not just a
matter of e.g. replacing cpu_online_mask with
housekeeping_cpumask(HK_FLAG_DOMAIN), we have to perform an AND
whenever we compute the weight, find the next CPU in the mask etc.

And I'd rather have the operations and mask in a single location instead of
scattered throughout the code, to make it easier to understand and maintain.

Happy to change to something more inline though, or open to suggestions.

Cheers,

Charles


>
> Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] WireGuard: restrict packet handling to non-isolated CPUs.
  2022-04-22 22:23   ` Charles-François Natali
@ 2022-04-23  1:08     ` Jason A. Donenfeld
  0 siblings, 0 replies; 5+ messages in thread
From: Jason A. Donenfeld @ 2022-04-23  1:08 UTC (permalink / raw)
  To: Charles-François Natali
  Cc: wireguard, netdev, linux-crypto, Daniel Jordan, Steffen Klassert

Hi Charles,

On Fri, Apr 22, 2022 at 11:23:01PM +0100, Charles-François Natali wrote:
> > Regarding your patch, is there a way to make that a bit more succinct,
> > without introducing all of those helper functions? It seems awfully
> > verbose for something that seems like a matter of replacing the online
> > mask with the housekeeping mask.
> 
> Indeed, I wasn't really happy about that.
> The reason I've written those helper functions is that the housekeeping mask
> includes possible CPUs (cpu_possible_mask), so unfortunately it's not just a
> matter of e.g. replacing cpu_online_mask with
> housekeeping_cpumask(HK_FLAG_DOMAIN), we have to perform an AND
> whenever we compute the weight, find the next CPU in the mask etc.
> 
> And I'd rather have the operations and mask in a single location instead of
> scattered throughout the code, to make it easier to understand and maintain.
> 
> Happy to change to something more inline though, or open to suggestions.

Probably more inlined, yea. A simpler version of your patch would
probably be something like this, right?

diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index 583adb37ee1e..b3117cdd647d 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -112,6 +112,8 @@ static inline int wg_cpumask_choose_online(int *stored_cpu, unsigned int id)
 		cpu = cpumask_first(cpu_online_mask);
 		for (i = 0; i < cpu_index; ++i)
 			cpu = cpumask_next(cpu, cpu_online_mask);
+		while (!housekeeping_test_cpu(cpu, HK_???))
+			cpu = cpumask_next(cpu, cpu_online_mask);
 		*stored_cpu = cpu;
 	}
 	return cpu;
@@ -128,7 +130,7 @@ static inline int wg_cpumask_next_online(int *next)
 {
 	int cpu = *next;

-	while (unlikely(!cpumask_test_cpu(cpu, cpu_online_mask)))
+	while (unlikely(!cpumask_test_cpu(cpu, cpu_online_mask) && !housekeeping_test_cpu(cpu, HK_???)))
 		cpu = cpumask_next(cpu, cpu_online_mask) % nr_cpumask_bits;
 	*next = cpumask_next(cpu, cpu_online_mask) % nr_cpumask_bits;
 	return cpu;

However, from looking at kernel/sched/isolation.c a bit, I noticed that
indeed you're right that most of these functions (save one) are based on
cpu_possible_mask rather than cpu_online_mask. This is frustrating
because the code makes smart use of static branches to remain quick, but
ANDing housekeeping_cpumask() with cpu_online_mask would, in the fast
path, wind up ANDing cpu_online_mask with cpu_possible_mask, which is
silly and pointless. That makes me suspect that maybe the best approach
would be adding a relevant helper to kernel/sched/isolation.c, so that
the helper can then do the `if (static_branch_unlikely(&housekeeping_overridden))`
stuff internally.

Or maybe you'll do some measurements and decide that just [ab]using
housekeeping_test_cpu() like above is actually optimal? Not really sure
myself.

Anyway, I'll keep an eye out for your joint wireguard/padata series. Be
sure to CC the people who wrote the isolation & housekeeping code, as
they likely have opinions about this stuff (and certainly know more than
me about it).

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-04-23  1:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-05 21:21 [PATCH] WireGuard: restrict packet handling to non-isolated CPUs Charles-Francois Natali
2022-04-22  0:02 ` Jason A. Donenfeld
2022-04-22  0:40   ` Stephen Hemminger
2022-04-22 22:23   ` Charles-François Natali
2022-04-23  1:08     ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).