[PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers

Development discussion of WireGuard
 help / color / mirror / Atom feed

* [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
@ 2021-02-08 13:38 Jason A. Donenfeld
  2021-02-09  8:24 ` Dmitry Vyukov
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2021-02-08 13:38 UTC (permalink / raw)
  To: wireguard; +Cc: Jason A. Donenfeld, Dmitry Vyukov

Having two ring buffers per-peer means that every peer results in two
massive ring allocations. On an 8-core x86_64 machine, this commit
reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
is an 90% reduction. Ninety percent! With some single-machine
deployments approaching 400,000 peers, we're talking about a reduction
from 7 gigs of memory down to 700 megs of memory.

In order to get rid of these per-peer allocations, this commit switches
to using a list-based queueing approach. Currently GSO fragments are
chained together using the skb->next pointer, so we form the per-peer
queue around the unused skb->prev pointer, which makes sense because the
links are pointing backwards. Multiple cores can write into the queue at
any given time, because its writes occur in the start_xmit path or in
the udp_recv path. But reads happen in a single workqueue item per-peer,
amounting to a multi-producer, single-consumer paradigm.

The MPSC queue is implemented locklessly and never blocks. However, it
is not linearizable (though it is serializable), with a very tight and
unlikely race on writes, which, when hit (about 0.15% of the time on a
fully loaded 16-core x86_64 system), causes the queue reader to
terminate early. However, because every packet sent queues up the same
workqueue item after it is fully added, the queue resumes again, and
stopping early isn't actually a problem, since at that point the packet
wouldn't have yet been added to the encryption queue. These properties
allow us to avoid disabling interrupts or spinning.

Performance-wise, ordinarily list-based queues aren't preferable to
ringbuffers, because of cache misses when following pointers around.
However, we *already* have to follow the adjacent pointers when working
through fragments, so there shouldn't actually be any change there. A
potential downside is that dequeueing is a bit more complicated, but the
ptr_ring structure used prior had a spinlock when dequeueing, so all and
all the difference appears to be a wash.

Actually, from profiling, the biggest performance hit, by far, of this
commit winds up being atomic_add_unless(count, 1, max) and atomic_
dec(count), which account for the majority of CPU time, according to
perf. In that sense, the previous ring buffer was superior in that it
could check if it was full by head==tail, which the list-based approach
cannot do.

Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
Hoping to get some feedback here from people running massive deployments
and running into ram issues, as well as Dmitry on the queueing semantics
(the mpsc queue is his design), before I send this to Dave for merging.
These changes are quite invasive, so I don't want to get anything wrong.

 drivers/net/wireguard/device.c   | 12 ++---
 drivers/net/wireguard/device.h   | 15 +++---
 drivers/net/wireguard/peer.c     | 29 ++++-------
 drivers/net/wireguard/peer.h     |  4 +-
 drivers/net/wireguard/queueing.c | 82 +++++++++++++++++++++++++-------
 drivers/net/wireguard/queueing.h | 45 +++++++++++++-----
 drivers/net/wireguard/receive.c  | 16 +++----
 drivers/net/wireguard/send.c     | 31 +++++-------
 8 files changed, 141 insertions(+), 93 deletions(-)

diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
index cd51a2afa28e..d744199823b3 100644
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -234,8 +234,8 @@ static void wg_destruct(struct net_device *dev)
 	destroy_workqueue(wg->handshake_receive_wq);
 	destroy_workqueue(wg->handshake_send_wq);
 	destroy_workqueue(wg->packet_crypt_wq);
-	wg_packet_queue_free(&wg->decrypt_queue, true);
-	wg_packet_queue_free(&wg->encrypt_queue, true);
+	wg_packet_queue_free(&wg->decrypt_queue);
+	wg_packet_queue_free(&wg->encrypt_queue);
 	rcu_barrier(); /* Wait for all the peers to be actually freed. */
 	wg_ratelimiter_uninit();
 	memzero_explicit(&wg->static_identity, sizeof(wg->static_identity));
@@ -337,12 +337,12 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
 		goto err_destroy_handshake_send;
 
 	ret = wg_packet_queue_init(&wg->encrypt_queue, wg_packet_encrypt_worker,
-				   true, MAX_QUEUED_PACKETS);
+				   MAX_QUEUED_PACKETS);
 	if (ret < 0)
 		goto err_destroy_packet_crypt;
 
 	ret = wg_packet_queue_init(&wg->decrypt_queue, wg_packet_decrypt_worker,
-				   true, MAX_QUEUED_PACKETS);
+				   MAX_QUEUED_PACKETS);
 	if (ret < 0)
 		goto err_free_encrypt_queue;
 
@@ -367,9 +367,9 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
 err_uninit_ratelimiter:
 	wg_ratelimiter_uninit();
 err_free_decrypt_queue:
-	wg_packet_queue_free(&wg->decrypt_queue, true);
+	wg_packet_queue_free(&wg->decrypt_queue);
 err_free_encrypt_queue:
-	wg_packet_queue_free(&wg->encrypt_queue, true);
+	wg_packet_queue_free(&wg->encrypt_queue);
 err_destroy_packet_crypt:
 	destroy_workqueue(wg->packet_crypt_wq);
 err_destroy_handshake_send:
diff --git a/drivers/net/wireguard/device.h b/drivers/net/wireguard/device.h
index 4d0144e16947..cb919f2ad1f8 100644
--- a/drivers/net/wireguard/device.h
+++ b/drivers/net/wireguard/device.h
@@ -27,13 +27,14 @@ struct multicore_worker {
 
 struct crypt_queue {
 	struct ptr_ring ring;
-	union {
-		struct {
-			struct multicore_worker __percpu *worker;
-			int last_cpu;
-		};
-		struct work_struct work;
-	};
+	struct multicore_worker __percpu *worker;
+	int last_cpu;
+};
+
+struct prev_queue {
+	struct sk_buff *head, *tail, *peeked;
+	struct { struct sk_buff *next, *prev; } empty;
+	atomic_t count;
 };
 
 struct wg_device {
diff --git a/drivers/net/wireguard/peer.c b/drivers/net/wireguard/peer.c
index b3b6370e6b95..1969fc22d47e 100644
--- a/drivers/net/wireguard/peer.c
+++ b/drivers/net/wireguard/peer.c
@@ -32,27 +32,22 @@ struct wg_peer *wg_peer_create(struct wg_device *wg,
 	peer = kzalloc(sizeof(*peer), GFP_KERNEL);
 	if (unlikely(!peer))
 		return ERR_PTR(ret);
-	peer->device = wg;
+	if (dst_cache_init(&peer->endpoint_cache, GFP_KERNEL))
+		goto err;
 
+	peer->device = wg;
 	wg_noise_handshake_init(&peer->handshake, &wg->static_identity,
 				public_key, preshared_key, peer);
-	if (dst_cache_init(&peer->endpoint_cache, GFP_KERNEL))
-		goto err_1;
-	if (wg_packet_queue_init(&peer->tx_queue, wg_packet_tx_worker, false,
-				 MAX_QUEUED_PACKETS))
-		goto err_2;
-	if (wg_packet_queue_init(&peer->rx_queue, NULL, false,
-				 MAX_QUEUED_PACKETS))
-		goto err_3;
-
 	peer->internal_id = atomic64_inc_return(&peer_counter);
 	peer->serial_work_cpu = nr_cpumask_bits;
 	wg_cookie_init(&peer->latest_cookie);
 	wg_timers_init(peer);
 	wg_cookie_checker_precompute_peer_keys(peer);
 	spin_lock_init(&peer->keypairs.keypair_update_lock);
-	INIT_WORK(&peer->transmit_handshake_work,
-		  wg_packet_handshake_send_worker);
+	INIT_WORK(&peer->transmit_handshake_work, wg_packet_handshake_send_worker);
+	INIT_WORK(&peer->transmit_packet_work, wg_packet_tx_worker);
+	wg_prev_queue_init(&peer->tx_queue);
+	wg_prev_queue_init(&peer->rx_queue);
 	rwlock_init(&peer->endpoint_lock);
 	kref_init(&peer->refcount);
 	skb_queue_head_init(&peer->staged_packet_queue);
@@ -68,11 +63,7 @@ struct wg_peer *wg_peer_create(struct wg_device *wg,
 	pr_debug("%s: Peer %llu created\n", wg->dev->name, peer->internal_id);
 	return peer;
 
-err_3:
-	wg_packet_queue_free(&peer->tx_queue, false);
-err_2:
-	dst_cache_destroy(&peer->endpoint_cache);
-err_1:
+err:
 	kfree(peer);
 	return ERR_PTR(ret);
 }
@@ -197,8 +188,8 @@ static void rcu_release(struct rcu_head *rcu)
 	struct wg_peer *peer = container_of(rcu, struct wg_peer, rcu);
 
 	dst_cache_destroy(&peer->endpoint_cache);
-	wg_packet_queue_free(&peer->rx_queue, false);
-	wg_packet_queue_free(&peer->tx_queue, false);
+	WARN_ON(wg_prev_queue_dequeue(&peer->tx_queue) || peer->tx_queue.peeked);
+	WARN_ON(wg_prev_queue_dequeue(&peer->rx_queue) || peer->rx_queue.peeked);
 
 	/* The final zeroing takes care of clearing any remaining handshake key
 	 * material and other potentially sensitive information.
diff --git a/drivers/net/wireguard/peer.h b/drivers/net/wireguard/peer.h
index aaff8de6e34b..8d53b687a1d1 100644
--- a/drivers/net/wireguard/peer.h
+++ b/drivers/net/wireguard/peer.h
@@ -36,7 +36,7 @@ struct endpoint {
 
 struct wg_peer {
 	struct wg_device *device;
-	struct crypt_queue tx_queue, rx_queue;
+	struct prev_queue tx_queue, rx_queue;
 	struct sk_buff_head staged_packet_queue;
 	int serial_work_cpu;
 	bool is_dead;
@@ -46,7 +46,7 @@ struct wg_peer {
 	rwlock_t endpoint_lock;
 	struct noise_handshake handshake;
 	atomic64_t last_sent_handshake;
-	struct work_struct transmit_handshake_work, clear_peer_work;
+	struct work_struct transmit_handshake_work, clear_peer_work, transmit_packet_work;
 	struct cookie latest_cookie;
 	struct hlist_node pubkey_hash;
 	u64 rx_bytes, tx_bytes;
diff --git a/drivers/net/wireguard/queueing.c b/drivers/net/wireguard/queueing.c
index 71b8e80b58e1..a72380ce97dd 100644
--- a/drivers/net/wireguard/queueing.c
+++ b/drivers/net/wireguard/queueing.c
@@ -9,8 +9,7 @@ struct multicore_worker __percpu *
 wg_packet_percpu_multicore_worker_alloc(work_func_t function, void *ptr)
 {
 	int cpu;
-	struct multicore_worker __percpu *worker =
-		alloc_percpu(struct multicore_worker);
+	struct multicore_worker __percpu *worker = alloc_percpu(struct multicore_worker);
 
 	if (!worker)
 		return NULL;
@@ -23,7 +22,7 @@ wg_packet_percpu_multicore_worker_alloc(work_func_t function, void *ptr)
 }
 
 int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function,
-			 bool multicore, unsigned int len)
+			 unsigned int len)
 {
 	int ret;
 
@@ -31,25 +30,74 @@ int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function,
 	ret = ptr_ring_init(&queue->ring, len, GFP_KERNEL);
 	if (ret)
 		return ret;
-	if (function) {
-		if (multicore) {
-			queue->worker = wg_packet_percpu_multicore_worker_alloc(
-				function, queue);
-			if (!queue->worker) {
-				ptr_ring_cleanup(&queue->ring, NULL);
-				return -ENOMEM;
-			}
-		} else {
-			INIT_WORK(&queue->work, function);
-		}
+	queue->worker = wg_packet_percpu_multicore_worker_alloc(function, queue);
+	if (!queue->worker) {
+		ptr_ring_cleanup(&queue->ring, NULL);
+		return -ENOMEM;
 	}
 	return 0;
 }
 
-void wg_packet_queue_free(struct crypt_queue *queue, bool multicore)
+void wg_packet_queue_free(struct crypt_queue *queue)
 {
-	if (multicore)
-		free_percpu(queue->worker);
+	free_percpu(queue->worker);
 	WARN_ON(!__ptr_ring_empty(&queue->ring));
 	ptr_ring_cleanup(&queue->ring, NULL);
 }
+
+#define NEXT(skb) ((skb)->prev)
+#define STUB(queue) ((struct sk_buff *)&queue->empty)
+
+void wg_prev_queue_init(struct prev_queue *queue)
+{
+	NEXT(STUB(queue)) = NULL;
+	queue->head = queue->tail = STUB(queue);
+	queue->peeked = NULL;
+	atomic_set(&queue->count, 0);
+}
+
+static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
+{
+	WRITE_ONCE(NEXT(skb), NULL);
+	smp_wmb();
+	WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);
+}
+
+bool wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
+{
+	if (!atomic_add_unless(&queue->count, 1, MAX_QUEUED_PACKETS))
+		return false;
+	__wg_prev_queue_enqueue(queue, skb);
+	return true;
+}
+
+struct sk_buff *wg_prev_queue_dequeue(struct prev_queue *queue)
+{
+	struct sk_buff *tail = queue->tail, *next = smp_load_acquire(&NEXT(tail));
+
+	if (tail == STUB(queue)) {
+		if (!next)
+			return NULL;
+		queue->tail = next;
+		tail = next;
+		next = smp_load_acquire(&NEXT(next));
+	}
+	if (next) {
+		queue->tail = next;
+		atomic_dec(&queue->count);
+		return tail;
+	}
+	if (tail != READ_ONCE(queue->head))
+		return NULL;
+	__wg_prev_queue_enqueue(queue, STUB(queue));
+	next = smp_load_acquire(&NEXT(tail));
+	if (next) {
+		queue->tail = next;
+		atomic_dec(&queue->count);
+		return tail;
+	}
+	return NULL;
+}
+
+#undef NEXT
+#undef STUB
diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index dfb674e03076..4ef2944a68bc 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -17,12 +17,13 @@ struct wg_device;
 struct wg_peer;
 struct multicore_worker;
 struct crypt_queue;
+struct prev_queue;
 struct sk_buff;
 
 /* queueing.c APIs: */
 int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function,
-			 bool multicore, unsigned int len);
-void wg_packet_queue_free(struct crypt_queue *queue, bool multicore);
+			 unsigned int len);
+void wg_packet_queue_free(struct crypt_queue *queue);
 struct multicore_worker __percpu *
 wg_packet_percpu_multicore_worker_alloc(work_func_t function, void *ptr);
 
@@ -135,8 +136,31 @@ static inline int wg_cpumask_next_online(int *next)
 	return cpu;
 }
 
+void wg_prev_queue_init(struct prev_queue *queue);
+
+/* Multi producer */
+bool wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb);
+
+/* Single consumer */
+struct sk_buff *wg_prev_queue_dequeue(struct prev_queue *queue);
+
+/* Single consumer */
+static inline struct sk_buff *wg_prev_queue_peek(struct prev_queue *queue)
+{
+	if (queue->peeked)
+		return queue->peeked;
+	queue->peeked = wg_prev_queue_dequeue(queue);
+	return queue->peeked;
+}
+
+/* Single consumer */
+static inline void wg_prev_queue_drop_peeked(struct prev_queue *queue)
+{
+	queue->peeked = NULL;
+}
+
 static inline int wg_queue_enqueue_per_device_and_peer(
-	struct crypt_queue *device_queue, struct crypt_queue *peer_queue,
+	struct crypt_queue *device_queue, struct prev_queue *peer_queue,
 	struct sk_buff *skb, struct workqueue_struct *wq, int *next_cpu)
 {
 	int cpu;
@@ -145,8 +169,9 @@ static inline int wg_queue_enqueue_per_device_and_peer(
 	/* We first queue this up for the peer ingestion, but the consumer
 	 * will wait for the state to change to CRYPTED or DEAD before.
 	 */
-	if (unlikely(ptr_ring_produce_bh(&peer_queue->ring, skb)))
+	if (unlikely(!wg_prev_queue_enqueue(peer_queue, skb)))
 		return -ENOSPC;
+
 	/* Then we queue it up in the device queue, which consumes the
 	 * packet as soon as it can.
 	 */
@@ -157,9 +182,7 @@ static inline int wg_queue_enqueue_per_device_and_peer(
 	return 0;
 }
 
-static inline void wg_queue_enqueue_per_peer(struct crypt_queue *queue,
-					     struct sk_buff *skb,
-					     enum packet_state state)
+static inline void wg_queue_enqueue_per_peer_tx(struct sk_buff *skb, enum packet_state state)
 {
 	/* We take a reference, because as soon as we call atomic_set, the
 	 * peer can be freed from below us.
@@ -167,14 +190,12 @@ static inline void wg_queue_enqueue_per_peer(struct crypt_queue *queue,
 	struct wg_peer *peer = wg_peer_get(PACKET_PEER(skb));
 
 	atomic_set_release(&PACKET_CB(skb)->state, state);
-	queue_work_on(wg_cpumask_choose_online(&peer->serial_work_cpu,
-					       peer->internal_id),
-		      peer->device->packet_crypt_wq, &queue->work);
+	queue_work_on(wg_cpumask_choose_online(&peer->serial_work_cpu, peer->internal_id),
+		      peer->device->packet_crypt_wq, &peer->transmit_packet_work);
 	wg_peer_put(peer);
 }
 
-static inline void wg_queue_enqueue_per_peer_napi(struct sk_buff *skb,
-						  enum packet_state state)
+static inline void wg_queue_enqueue_per_peer_rx(struct sk_buff *skb, enum packet_state state)
 {
 	/* We take a reference, because as soon as we call atomic_set, the
 	 * peer can be freed from below us.
diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
index 2c9551ea6dc7..7dc84bcca261 100644
--- a/drivers/net/wireguard/receive.c
+++ b/drivers/net/wireguard/receive.c
@@ -444,7 +444,6 @@ static void wg_packet_consume_data_done(struct wg_peer *peer,
 int wg_packet_rx_poll(struct napi_struct *napi, int budget)
 {
 	struct wg_peer *peer = container_of(napi, struct wg_peer, napi);
-	struct crypt_queue *queue = &peer->rx_queue;
 	struct noise_keypair *keypair;
 	struct endpoint endpoint;
 	enum packet_state state;
@@ -455,11 +454,10 @@ int wg_packet_rx_poll(struct napi_struct *napi, int budget)
 	if (unlikely(budget <= 0))
 		return 0;
 
-	while ((skb = __ptr_ring_peek(&queue->ring)) != NULL &&
+	while ((skb = wg_prev_queue_peek(&peer->rx_queue)) != NULL &&
 	       (state = atomic_read_acquire(&PACKET_CB(skb)->state)) !=
 		       PACKET_STATE_UNCRYPTED) {
-		__ptr_ring_discard_one(&queue->ring);
-		peer = PACKET_PEER(skb);
+		wg_prev_queue_drop_peeked(&peer->rx_queue);
 		keypair = PACKET_CB(skb)->keypair;
 		free = true;
 
@@ -508,7 +506,7 @@ void wg_packet_decrypt_worker(struct work_struct *work)
 		enum packet_state state =
 			likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
 				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
-		wg_queue_enqueue_per_peer_napi(skb, state);
+		wg_queue_enqueue_per_peer_rx(skb, state);
 		if (need_resched())
 			cond_resched();
 	}
@@ -531,12 +529,10 @@ static void wg_packet_consume_data(struct wg_device *wg, struct sk_buff *skb)
 	if (unlikely(READ_ONCE(peer->is_dead)))
 		goto err;
 
-	ret = wg_queue_enqueue_per_device_and_peer(&wg->decrypt_queue,
-						   &peer->rx_queue, skb,
-						   wg->packet_crypt_wq,
-						   &wg->decrypt_queue.last_cpu);
+	ret = wg_queue_enqueue_per_device_and_peer(&wg->decrypt_queue, &peer->rx_queue, skb,
+						   wg->packet_crypt_wq, &wg->decrypt_queue.last_cpu);
 	if (unlikely(ret == -EPIPE))
-		wg_queue_enqueue_per_peer_napi(skb, PACKET_STATE_DEAD);
+		wg_queue_enqueue_per_peer_rx(skb, PACKET_STATE_DEAD);
 	if (likely(!ret || ret == -EPIPE)) {
 		rcu_read_unlock_bh();
 		return;
diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
index f74b9341ab0f..5368f7c35b4b 100644
--- a/drivers/net/wireguard/send.c
+++ b/drivers/net/wireguard/send.c
@@ -239,8 +239,7 @@ void wg_packet_send_keepalive(struct wg_peer *peer)
 	wg_packet_send_staged_packets(peer);
 }
 
-static void wg_packet_create_data_done(struct sk_buff *first,
-				       struct wg_peer *peer)
+static void wg_packet_create_data_done(struct wg_peer *peer, struct sk_buff *first)
 {
 	struct sk_buff *skb, *next;
 	bool is_keepalive, data_sent = false;
@@ -262,22 +261,19 @@ static void wg_packet_create_data_done(struct sk_buff *first,
 
 void wg_packet_tx_worker(struct work_struct *work)
 {
-	struct crypt_queue *queue = container_of(work, struct crypt_queue,
-						 work);
+	struct wg_peer *peer = container_of(work, struct wg_peer, transmit_packet_work);
 	struct noise_keypair *keypair;
 	enum packet_state state;
 	struct sk_buff *first;
-	struct wg_peer *peer;
 
-	while ((first = __ptr_ring_peek(&queue->ring)) != NULL &&
+	while ((first = wg_prev_queue_peek(&peer->tx_queue)) != NULL &&
 	       (state = atomic_read_acquire(&PACKET_CB(first)->state)) !=
 		       PACKET_STATE_UNCRYPTED) {
-		__ptr_ring_discard_one(&queue->ring);
-		peer = PACKET_PEER(first);
+		wg_prev_queue_drop_peeked(&peer->tx_queue);
 		keypair = PACKET_CB(first)->keypair;
 
 		if (likely(state == PACKET_STATE_CRYPTED))
-			wg_packet_create_data_done(first, peer);
+			wg_packet_create_data_done(peer, first);
 		else
 			kfree_skb_list(first);
 
@@ -306,16 +302,14 @@ void wg_packet_encrypt_worker(struct work_struct *work)
 				break;
 			}
 		}
-		wg_queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first,
-					  state);
+		wg_queue_enqueue_per_peer_tx(first, state);
 		if (need_resched())
 			cond_resched();
 	}
 }
 
-static void wg_packet_create_data(struct sk_buff *first)
+static void wg_packet_create_data(struct wg_peer *peer, struct sk_buff *first)
 {
-	struct wg_peer *peer = PACKET_PEER(first);
 	struct wg_device *wg = peer->device;
 	int ret = -EINVAL;
 
@@ -323,13 +317,10 @@ static void wg_packet_create_data(struct sk_buff *first)
 	if (unlikely(READ_ONCE(peer->is_dead)))
 		goto err;
 
-	ret = wg_queue_enqueue_per_device_and_peer(&wg->encrypt_queue,
-						   &peer->tx_queue, first,
-						   wg->packet_crypt_wq,
-						   &wg->encrypt_queue.last_cpu);
+	ret = wg_queue_enqueue_per_device_and_peer(&wg->encrypt_queue, &peer->tx_queue, first,
+						   wg->packet_crypt_wq, &wg->encrypt_queue.last_cpu);
 	if (unlikely(ret == -EPIPE))
-		wg_queue_enqueue_per_peer(&peer->tx_queue, first,
-					  PACKET_STATE_DEAD);
+		wg_queue_enqueue_per_peer_tx(first, PACKET_STATE_DEAD);
 err:
 	rcu_read_unlock_bh();
 	if (likely(!ret || ret == -EPIPE))
@@ -393,7 +384,7 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
 	packets.prev->next = NULL;
 	wg_peer_get(keypair->entry.peer);
 	PACKET_CB(packets.next)->keypair = keypair;
-	wg_packet_create_data(packets.next);
+	wg_packet_create_data(peer, packets.next);
 	return;
 
 out_invalid:
-- 
2.30.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-08 13:38 [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers Jason A. Donenfeld
@ 2021-02-09  8:24 ` Dmitry Vyukov
  2021-02-09 15:44   ` Jason A. Donenfeld
  2021-02-17 18:36 ` Toke Høiland-Jørgensen
  2021-02-18 13:49 ` Björn Töpel
  2 siblings, 1 reply; 12+ messages in thread
From: Dmitry Vyukov @ 2021-02-09  8:24 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

On Mon, Feb 8, 2021 at 2:38 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Having two ring buffers per-peer means that every peer results in two
> massive ring allocations. On an 8-core x86_64 machine, this commit
> reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
> is an 90% reduction. Ninety percent! With some single-machine
> deployments approaching 400,000 peers, we're talking about a reduction
> from 7 gigs of memory down to 700 megs of memory.
>
> In order to get rid of these per-peer allocations, this commit switches
> to using a list-based queueing approach. Currently GSO fragments are
> chained together using the skb->next pointer, so we form the per-peer
> queue around the unused skb->prev pointer, which makes sense because the
> links are pointing backwards. Multiple cores can write into the queue at
> any given time, because its writes occur in the start_xmit path or in
> the udp_recv path. But reads happen in a single workqueue item per-peer,
> amounting to a multi-producer, single-consumer paradigm.
>
> The MPSC queue is implemented locklessly and never blocks. However, it
> is not linearizable (though it is serializable), with a very tight and
> unlikely race on writes, which, when hit (about 0.15% of the time on a
> fully loaded 16-core x86_64 system), causes the queue reader to
> terminate early. However, because every packet sent queues up the same
> workqueue item after it is fully added, the queue resumes again, and
> stopping early isn't actually a problem, since at that point the packet
> wouldn't have yet been added to the encryption queue. These properties
> allow us to avoid disabling interrupts or spinning.

Hi Jason,

Exciting! I reviewed only the queue code itself.

Strictly saying, 0.15% is for delaying the newly added item only. This
is not a problem, we can just consider that push has not finished yet
in this case. You can get this with any queue. It's just that consumer
has peeked on producer that it started enqueue but has not finished
yet. In a mutex-protected queue consumers just don't have the
opportunity to peek, they just block until enqueue has completed.
The problem is only when a partially queued item blocks subsequent
completely queued items. That should be some small fraction of 0.15%.


> Performance-wise, ordinarily list-based queues aren't preferable to
> ringbuffers, because of cache misses when following pointers around.
> However, we *already* have to follow the adjacent pointers when working
> through fragments, so there shouldn't actually be any change there. A
> potential downside is that dequeueing is a bit more complicated, but the
> ptr_ring structure used prior had a spinlock when dequeueing, so all and
> all the difference appears to be a wash.
>
> Actually, from profiling, the biggest performance hit, by far, of this
> commit winds up being atomic_add_unless(count, 1, max) and atomic_
> dec(count), which account for the majority of CPU time, according to
> perf. In that sense, the previous ring buffer was superior in that it
> could check if it was full by head==tail, which the list-based approach
> cannot do.

We could try to cheat a bit here.
We could split the counter into:

atomic_t enqueued;
unsigned dequeued;

then, consumer will do just dequeued++.
Producers can do (depending on how precise you want them to be):

if ((int)(atomic_read(&enqueued) - dequeued) >= MAX)
    return false;
atomic_add(&enqueued, 1);

or, for more precise counting we could do a CAS loop on enqueued.
Since any modifications to dequeued can only lead to reduction of
size, we don't need to double check it before CAS, thus the CAS loop
should provide a precise upper bound on size.
Or, we could check, opportunistically increment, and then decrement if
overflow, but that looks the least favorable option.


> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

The queue logic looks correct to me.
I did not spot any significant algorithmic differences with my algorithm:
https://groups.google.com/g/lock-free/c/Vd9xuHrLggE/m/B9-URa3B37MJ

Reviewed-by: Dmitry Vyukov <dvyukov@google.com>

> ---
> Hoping to get some feedback here from people running massive deployments
> and running into ram issues, as well as Dmitry on the queueing semantics
> (the mpsc queue is his design), before I send this to Dave for merging.
> These changes are quite invasive, so I don't want to get anything wrong.



> +struct prev_queue {
> +       struct sk_buff *head, *tail, *peeked;
> +       struct { struct sk_buff *next, *prev; } empty;
> +       atomic_t count;
>  };


This would benefit from a comment explaining that empty needs to match
sk_buff up to prev (and a corresponding build bug that offset of prev
match in empty and sk_buff), and why we use prev instead of next (I
don't know).


> +#define NEXT(skb) ((skb)->prev)
> +#define STUB(queue) ((struct sk_buff *)&queue->empty)
> +
> +void wg_prev_queue_init(struct prev_queue *queue)
> +{
> +       NEXT(STUB(queue)) = NULL;
> +       queue->head = queue->tail = STUB(queue);
> +       queue->peeked = NULL;
> +       atomic_set(&queue->count, 0);
> +}
> +
> +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> +{
> +       WRITE_ONCE(NEXT(skb), NULL);
> +       smp_wmb();
> +       WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);
> +}
> +
> +bool wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> +{
> +       if (!atomic_add_unless(&queue->count, 1, MAX_QUEUED_PACKETS))
> +               return false;
> +       __wg_prev_queue_enqueue(queue, skb);
> +       return true;
> +}
> +
> +struct sk_buff *wg_prev_queue_dequeue(struct prev_queue *queue)
> +{
> +       struct sk_buff *tail = queue->tail, *next = smp_load_acquire(&NEXT(tail));
> +
> +       if (tail == STUB(queue)) {
> +               if (!next)
> +                       return NULL;
> +               queue->tail = next;
> +               tail = next;
> +               next = smp_load_acquire(&NEXT(next));
> +       }
> +       if (next) {
> +               queue->tail = next;
> +               atomic_dec(&queue->count);
> +               return tail;
> +       }
> +       if (tail != READ_ONCE(queue->head))
> +               return NULL;
> +       __wg_prev_queue_enqueue(queue, STUB(queue));
> +       next = smp_load_acquire(&NEXT(tail));
> +       if (next) {
> +               queue->tail = next;
> +               atomic_dec(&queue->count);
> +               return tail;
> +       }
> +       return NULL;
> +}
> +
> +#undef NEXT
> +#undef STUB


> +void wg_prev_queue_init(struct prev_queue *queue);
> +
> +/* Multi producer */
> +bool wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb);
> +
> +/* Single consumer */
> +struct sk_buff *wg_prev_queue_dequeue(struct prev_queue *queue);
> +
> +/* Single consumer */
> +static inline struct sk_buff *wg_prev_queue_peek(struct prev_queue *queue)
> +{
> +       if (queue->peeked)
> +               return queue->peeked;
> +       queue->peeked = wg_prev_queue_dequeue(queue);
> +       return queue->peeked;
> +}
> +
> +/* Single consumer */
> +static inline void wg_prev_queue_drop_peeked(struct prev_queue *queue)
> +{
> +       queue->peeked = NULL;
> +}


> @@ -197,8 +188,8 @@ static void rcu_release(struct rcu_head *rcu)
>         struct wg_peer *peer = container_of(rcu, struct wg_peer, rcu);
>
>         dst_cache_destroy(&peer->endpoint_cache);
> -       wg_packet_queue_free(&peer->rx_queue, false);
> -       wg_packet_queue_free(&peer->tx_queue, false);
> +       WARN_ON(wg_prev_queue_dequeue(&peer->tx_queue) || peer->tx_queue.peeked);
> +       WARN_ON(wg_prev_queue_dequeue(&peer->rx_queue) || peer->rx_queue.peeked);

This could use just wg_prev_queue_peek.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-09  8:24 ` Dmitry Vyukov
@ 2021-02-09 15:44   ` Jason A. Donenfeld
  2021-02-09 16:20     ` Dmitry Vyukov
  0 siblings, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2021-02-09 15:44 UTC (permalink / raw)
  To: Dmitry Vyukov; +Cc: WireGuard mailing list

Hi Dmitry,

Thanks for the review.

On Tue, Feb 9, 2021 at 9:24 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> Strictly saying, 0.15% is for delaying the newly added item only. This
> is not a problem, we can just consider that push has not finished yet
> in this case. You can get this with any queue. It's just that consumer
> has peeked on producer that it started enqueue but has not finished
> yet. In a mutex-protected queue consumers just don't have the
> opportunity to peek, they just block until enqueue has completed.
> The problem is only when a partially queued item blocks subsequent
> completely queued items. That should be some small fraction of 0.15%.

Ah right. I'll make that clear in the commit message.

> We could try to cheat a bit here.
> We could split the counter into:
>
> atomic_t enqueued;
> unsigned dequeued;
>
> then, consumer will do just dequeued++.
> Producers can do (depending on how precise you want them to be):
>
> if ((int)(atomic_read(&enqueued) - dequeued) >= MAX)
>     return false;
> atomic_add(&enqueued, 1);
>
> or, for more precise counting we could do a CAS loop on enqueued.

I guess the CAS case would look like `if
(!atomic_add_unless(&enqueued, 1, MAX + dequeued))` or similar, though
>= might be safer than ==, so writing out the loop manually wouldn't
be a bad idea.

But... I would probably need smp_load/smp_store helpers around
dequeued, right? Unless we argue some degree of courseness doesn't
matter.

> Or, we could check, opportunistically increment, and then decrement if
> overflow, but that looks the least favorable option.

I had originally done something like that, but I didn't like the idea
of it being able to grow beyond the limit by the number of CPU cores.

The other option, of course, is to just do nothing, and keep the
atomic as-is. There's already ~high overhead from kref_get, so I could
always revisit this after I move from kref.h over to
percpu-refcount.h.

>
> > +struct prev_queue {
> > +       struct sk_buff *head, *tail, *peeked;
> > +       struct { struct sk_buff *next, *prev; } empty;
> > +       atomic_t count;
> >  };
>
>
> This would benefit from a comment explaining that empty needs to match
> sk_buff up to prev (and a corresponding build bug that offset of prev
> match in empty and sk_buff), and why we use prev instead of next (I
> don't know).

That's a good idea. Will do.


> > @@ -197,8 +188,8 @@ static void rcu_release(struct rcu_head *rcu)
> >         struct wg_peer *peer = container_of(rcu, struct wg_peer, rcu);
> >
> >         dst_cache_destroy(&peer->endpoint_cache);
> > -       wg_packet_queue_free(&peer->rx_queue, false);
> > -       wg_packet_queue_free(&peer->tx_queue, false);
> > +       WARN_ON(wg_prev_queue_dequeue(&peer->tx_queue) || peer->tx_queue.peeked);
> > +       WARN_ON(wg_prev_queue_dequeue(&peer->rx_queue) || peer->rx_queue.peeked);
>
> This could use just wg_prev_queue_peek.

Nice catch, thanks.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-09 15:44   ` Jason A. Donenfeld
@ 2021-02-09 16:20     ` Dmitry Vyukov
  0 siblings, 0 replies; 12+ messages in thread
From: Dmitry Vyukov @ 2021-02-09 16:20 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

On Tue, Feb 9, 2021 at 4:44 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Dmitry,
>
> Thanks for the review.
>
> On Tue, Feb 9, 2021 at 9:24 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > Strictly saying, 0.15% is for delaying the newly added item only. This
> > is not a problem, we can just consider that push has not finished yet
> > in this case. You can get this with any queue. It's just that consumer
> > has peeked on producer that it started enqueue but has not finished
> > yet. In a mutex-protected queue consumers just don't have the
> > opportunity to peek, they just block until enqueue has completed.
> > The problem is only when a partially queued item blocks subsequent
> > completely queued items. That should be some small fraction of 0.15%.
>
> Ah right. I'll make that clear in the commit message.
>
> > We could try to cheat a bit here.
> > We could split the counter into:
> >
> > atomic_t enqueued;
> > unsigned dequeued;
> >
> > then, consumer will do just dequeued++.
> > Producers can do (depending on how precise you want them to be):
> >
> > if ((int)(atomic_read(&enqueued) - dequeued) >= MAX)
> >     return false;
> > atomic_add(&enqueued, 1);
> >
> > or, for more precise counting we could do a CAS loop on enqueued.
>
> I guess the CAS case would look like `if
> (!atomic_add_unless(&enqueued, 1, MAX + dequeued))` or similar, though
> >= might be safer than ==, so writing out the loop manually wouldn't
> be a bad idea.

What I had in mind is:

int e = READ_ONCE(q->enqueued);
for (;;) {
  int d = READ_ONCE(q->dequeued);
  if (e - d >= MAX)
    return false;
  int x = CAS(&q->enqueued, e, e+1);
  if (x == e)
    break;
  e = x;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-08 13:38 [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers Jason A. Donenfeld
  2021-02-09  8:24 ` Dmitry Vyukov
@ 2021-02-17 18:36 ` Toke Høiland-Jørgensen
  2021-02-17 22:28   ` Jason A. Donenfeld
  2021-02-18 13:49 ` Björn Töpel
  2 siblings, 1 reply; 12+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-02-17 18:36 UTC (permalink / raw)
  To: Jason A. Donenfeld, wireguard; +Cc: Jason A. Donenfeld, Dmitry Vyukov

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Having two ring buffers per-peer means that every peer results in two
> massive ring allocations. On an 8-core x86_64 machine, this commit
> reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
> is an 90% reduction. Ninety percent! With some single-machine
> deployments approaching 400,000 peers, we're talking about a reduction
> from 7 gigs of memory down to 700 megs of memory.
>
> In order to get rid of these per-peer allocations, this commit switches
> to using a list-based queueing approach. Currently GSO fragments are
> chained together using the skb->next pointer, so we form the per-peer
> queue around the unused skb->prev pointer, which makes sense because the
> links are pointing backwards.

"which makes sense because the links are pointing backwards" - huh?

> Multiple cores can write into the queue at any given time, because its
> writes occur in the start_xmit path or in the udp_recv path. But reads
> happen in a single workqueue item per-peer, amounting to a
> multi-producer, single-consumer paradigm.
>
> The MPSC queue is implemented locklessly and never blocks. However, it
> is not linearizable (though it is serializable), with a very tight and
> unlikely race on writes, which, when hit (about 0.15% of the time on a
> fully loaded 16-core x86_64 system), causes the queue reader to
> terminate early. However, because every packet sent queues up the same
> workqueue item after it is fully added, the queue resumes again, and
> stopping early isn't actually a problem, since at that point the packet
> wouldn't have yet been added to the encryption queue. These properties
> allow us to avoid disabling interrupts or spinning.

Wow, so this was a fascinating rabbit hole into the concurrent algorithm
realm, thanks to Dmitry's link to his original posting of the algorithm.
Maybe referencing the origin of the algorithm would be nice for context
and posterity (as well as commenting it so the original properties are
not lost if the source should disappear)?

> Performance-wise, ordinarily list-based queues aren't preferable to
> ringbuffers, because of cache misses when following pointers around.
> However, we *already* have to follow the adjacent pointers when working
> through fragments, so there shouldn't actually be any change there. A
> potential downside is that dequeueing is a bit more complicated, but the
> ptr_ring structure used prior had a spinlock when dequeueing, so all and
> all the difference appears to be a wash.
>
> Actually, from profiling, the biggest performance hit, by far, of this
> commit winds up being atomic_add_unless(count, 1, max) and atomic_
> dec(count), which account for the majority of CPU time, according to
> perf. In that sense, the previous ring buffer was superior in that it
> could check if it was full by head==tail, which the list-based approach
> cannot do.

Are these performance measurements are based on micro-benchmarks of the
queueing structure, or overall wireguard performance? Do you see any
measurable difference in the overall performance (i.e., throughput
drop)? And what about relative to using one of the existing skb queueing
primitives in the kernel? Including some actual numbers would be nice to
justify adding yet-another skb queueing scheme to the kernel :)

I say this also because the actual queueing of the packets has never
really shown up on any performance radar in the qdisc and mac80211
layers, which both use traditional spinlock-protected queueing
structures. Now Wireguard does have a somewhat unusual structure with
the MPSC pattern, so it may of course be different here. But quantifying
that would be good; also for figuring out if this algorithm might be
useful in other areas as well (and don't get me wrong, I'm fascinated by
it!).

> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Hoping to get some feedback here from people running massive deployments
> and running into ram issues, as well as Dmitry on the queueing semantics
> (the mpsc queue is his design), before I send this to Dave for merging.
> These changes are quite invasive, so I don't want to get anything wrong.
>
>  drivers/net/wireguard/device.c   | 12 ++---
>  drivers/net/wireguard/device.h   | 15 +++---
>  drivers/net/wireguard/peer.c     | 29 ++++-------
>  drivers/net/wireguard/peer.h     |  4 +-
>  drivers/net/wireguard/queueing.c | 82 +++++++++++++++++++++++++-------
>  drivers/net/wireguard/queueing.h | 45 +++++++++++++-----
>  drivers/net/wireguard/receive.c  | 16 +++----
>  drivers/net/wireguard/send.c     | 31 +++++-------
>  8 files changed, 141 insertions(+), 93 deletions(-)
>
> diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
> index cd51a2afa28e..d744199823b3 100644
> --- a/drivers/net/wireguard/device.c
> +++ b/drivers/net/wireguard/device.c
> @@ -234,8 +234,8 @@ static void wg_destruct(struct net_device *dev)
>  	destroy_workqueue(wg->handshake_receive_wq);
>  	destroy_workqueue(wg->handshake_send_wq);
>  	destroy_workqueue(wg->packet_crypt_wq);
> -	wg_packet_queue_free(&wg->decrypt_queue, true);
> -	wg_packet_queue_free(&wg->encrypt_queue, true);
> +	wg_packet_queue_free(&wg->decrypt_queue);
> +	wg_packet_queue_free(&wg->encrypt_queue);
>  	rcu_barrier(); /* Wait for all the peers to be actually freed. */
>  	wg_ratelimiter_uninit();
>  	memzero_explicit(&wg->static_identity, sizeof(wg->static_identity));
> @@ -337,12 +337,12 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
>  		goto err_destroy_handshake_send;
>  
>  	ret = wg_packet_queue_init(&wg->encrypt_queue, wg_packet_encrypt_worker,
> -				   true, MAX_QUEUED_PACKETS);
> +				   MAX_QUEUED_PACKETS);
>  	if (ret < 0)
>  		goto err_destroy_packet_crypt;
>  
>  	ret = wg_packet_queue_init(&wg->decrypt_queue, wg_packet_decrypt_worker,
> -				   true, MAX_QUEUED_PACKETS);
> +				   MAX_QUEUED_PACKETS);
>  	if (ret < 0)
>  		goto err_free_encrypt_queue;
>  
> @@ -367,9 +367,9 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
>  err_uninit_ratelimiter:
>  	wg_ratelimiter_uninit();
>  err_free_decrypt_queue:
> -	wg_packet_queue_free(&wg->decrypt_queue, true);
> +	wg_packet_queue_free(&wg->decrypt_queue);
>  err_free_encrypt_queue:
> -	wg_packet_queue_free(&wg->encrypt_queue, true);
> +	wg_packet_queue_free(&wg->encrypt_queue);
>  err_destroy_packet_crypt:
>  	destroy_workqueue(wg->packet_crypt_wq);
>  err_destroy_handshake_send:
> diff --git a/drivers/net/wireguard/device.h b/drivers/net/wireguard/device.h
> index 4d0144e16947..cb919f2ad1f8 100644
> --- a/drivers/net/wireguard/device.h
> +++ b/drivers/net/wireguard/device.h
> @@ -27,13 +27,14 @@ struct multicore_worker {
>  
>  struct crypt_queue {
>  	struct ptr_ring ring;
> -	union {
> -		struct {
> -			struct multicore_worker __percpu *worker;
> -			int last_cpu;
> -		};
> -		struct work_struct work;
> -	};
> +	struct multicore_worker __percpu *worker;
> +	int last_cpu;
> +};
> +
> +struct prev_queue {
> +	struct sk_buff *head, *tail, *peeked;
> +	struct { struct sk_buff *next, *prev; } empty;
> +	atomic_t count;
>  };
>  
>  struct wg_device {
> diff --git a/drivers/net/wireguard/peer.c b/drivers/net/wireguard/peer.c
> index b3b6370e6b95..1969fc22d47e 100644
> --- a/drivers/net/wireguard/peer.c
> +++ b/drivers/net/wireguard/peer.c
> @@ -32,27 +32,22 @@ struct wg_peer *wg_peer_create(struct wg_device *wg,
>  	peer = kzalloc(sizeof(*peer), GFP_KERNEL);
>  	if (unlikely(!peer))
>  		return ERR_PTR(ret);
> -	peer->device = wg;
> +	if (dst_cache_init(&peer->endpoint_cache, GFP_KERNEL))
> +		goto err;
>  
> +	peer->device = wg;
>  	wg_noise_handshake_init(&peer->handshake, &wg->static_identity,
>  				public_key, preshared_key, peer);
> -	if (dst_cache_init(&peer->endpoint_cache, GFP_KERNEL))
> -		goto err_1;
> -	if (wg_packet_queue_init(&peer->tx_queue, wg_packet_tx_worker, false,
> -				 MAX_QUEUED_PACKETS))
> -		goto err_2;
> -	if (wg_packet_queue_init(&peer->rx_queue, NULL, false,
> -				 MAX_QUEUED_PACKETS))
> -		goto err_3;
> -
>  	peer->internal_id = atomic64_inc_return(&peer_counter);
>  	peer->serial_work_cpu = nr_cpumask_bits;
>  	wg_cookie_init(&peer->latest_cookie);
>  	wg_timers_init(peer);
>  	wg_cookie_checker_precompute_peer_keys(peer);
>  	spin_lock_init(&peer->keypairs.keypair_update_lock);
> -	INIT_WORK(&peer->transmit_handshake_work,
> -		  wg_packet_handshake_send_worker);
> +	INIT_WORK(&peer->transmit_handshake_work, wg_packet_handshake_send_worker);
> +	INIT_WORK(&peer->transmit_packet_work, wg_packet_tx_worker);

It's not quite clear to me why changing the queue primitives requires
adding another work queue?

> +	wg_prev_queue_init(&peer->tx_queue);
> +	wg_prev_queue_init(&peer->rx_queue);
>  	rwlock_init(&peer->endpoint_lock);
>  	kref_init(&peer->refcount);
>  	skb_queue_head_init(&peer->staged_packet_queue);
> @@ -68,11 +63,7 @@ struct wg_peer *wg_peer_create(struct wg_device *wg,
>  	pr_debug("%s: Peer %llu created\n", wg->dev->name, peer->internal_id);
>  	return peer;
>  
> -err_3:
> -	wg_packet_queue_free(&peer->tx_queue, false);
> -err_2:
> -	dst_cache_destroy(&peer->endpoint_cache);
> -err_1:
> +err:
>  	kfree(peer);
>  	return ERR_PTR(ret);
>  }
> @@ -197,8 +188,8 @@ static void rcu_release(struct rcu_head *rcu)
>  	struct wg_peer *peer = container_of(rcu, struct wg_peer, rcu);
>  
>  	dst_cache_destroy(&peer->endpoint_cache);
> -	wg_packet_queue_free(&peer->rx_queue, false);
> -	wg_packet_queue_free(&peer->tx_queue, false);
> +	WARN_ON(wg_prev_queue_dequeue(&peer->tx_queue) || peer->tx_queue.peeked);
> +	WARN_ON(wg_prev_queue_dequeue(&peer->rx_queue) || peer->rx_queue.peeked);
>  
>  	/* The final zeroing takes care of clearing any remaining handshake key
>  	 * material and other potentially sensitive information.
> diff --git a/drivers/net/wireguard/peer.h b/drivers/net/wireguard/peer.h
> index aaff8de6e34b..8d53b687a1d1 100644
> --- a/drivers/net/wireguard/peer.h
> +++ b/drivers/net/wireguard/peer.h
> @@ -36,7 +36,7 @@ struct endpoint {
>  
>  struct wg_peer {
>  	struct wg_device *device;
> -	struct crypt_queue tx_queue, rx_queue;
> +	struct prev_queue tx_queue, rx_queue;
>  	struct sk_buff_head staged_packet_queue;
>  	int serial_work_cpu;
>  	bool is_dead;
> @@ -46,7 +46,7 @@ struct wg_peer {
>  	rwlock_t endpoint_lock;
>  	struct noise_handshake handshake;
>  	atomic64_t last_sent_handshake;
> -	struct work_struct transmit_handshake_work, clear_peer_work;
> +	struct work_struct transmit_handshake_work, clear_peer_work, transmit_packet_work;
>  	struct cookie latest_cookie;
>  	struct hlist_node pubkey_hash;
>  	u64 rx_bytes, tx_bytes;
> diff --git a/drivers/net/wireguard/queueing.c b/drivers/net/wireguard/queueing.c
> index 71b8e80b58e1..a72380ce97dd 100644
> --- a/drivers/net/wireguard/queueing.c
> +++ b/drivers/net/wireguard/queueing.c
> @@ -9,8 +9,7 @@ struct multicore_worker __percpu *
>  wg_packet_percpu_multicore_worker_alloc(work_func_t function, void *ptr)
>  {
>  	int cpu;
> -	struct multicore_worker __percpu *worker =
> -		alloc_percpu(struct multicore_worker);
> +	struct multicore_worker __percpu *worker = alloc_percpu(struct multicore_worker);
>  
>  	if (!worker)
>  		return NULL;
> @@ -23,7 +22,7 @@ wg_packet_percpu_multicore_worker_alloc(work_func_t function, void *ptr)
>  }
>  
>  int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function,
> -			 bool multicore, unsigned int len)
> +			 unsigned int len)
>  {
>  	int ret;
>  
> @@ -31,25 +30,74 @@ int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function,
>  	ret = ptr_ring_init(&queue->ring, len, GFP_KERNEL);
>  	if (ret)
>  		return ret;
> -	if (function) {
> -		if (multicore) {
> -			queue->worker = wg_packet_percpu_multicore_worker_alloc(
> -				function, queue);
> -			if (!queue->worker) {
> -				ptr_ring_cleanup(&queue->ring, NULL);
> -				return -ENOMEM;
> -			}
> -		} else {
> -			INIT_WORK(&queue->work, function);
> -		}
> +	queue->worker = wg_packet_percpu_multicore_worker_alloc(function, queue);
> +	if (!queue->worker) {
> +		ptr_ring_cleanup(&queue->ring, NULL);
> +		return -ENOMEM;
>  	}
>  	return 0;
>  }
>  
> -void wg_packet_queue_free(struct crypt_queue *queue, bool multicore)
> +void wg_packet_queue_free(struct crypt_queue *queue)
>  {
> -	if (multicore)
> -		free_percpu(queue->worker);
> +	free_percpu(queue->worker);
>  	WARN_ON(!__ptr_ring_empty(&queue->ring));
>  	ptr_ring_cleanup(&queue->ring, NULL);
>  }
> +

It would be nice to add a comment block here explaining the algorithm,
with a link to the original implementation and the same reasoning as you
have in the commit message. And some of the explanation from the
original thread would be nice. But if you do copy that, please for the
love of $DEITY, expand the acronyms - it took me half an hour of
extremely frustrating Googling to figure out what PDR means! :D

(I finally found out that it means "Partial copy-on-write Deferred
Reclamation" in another of Dmitry's replies here:
https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Non-blocking-data-structures-vs-garbage-collection/td-p/847215 )

> +#define NEXT(skb) ((skb)->prev)

In particular, please explain this oxymoronic define :)

> +#define STUB(queue) ((struct sk_buff *)&queue->empty)
> +
> +void wg_prev_queue_init(struct prev_queue *queue)
> +{
> +	NEXT(STUB(queue)) = NULL;
> +	queue->head = queue->tail = STUB(queue);
> +	queue->peeked = NULL;
> +	atomic_set(&queue->count, 0);
> +}
> +
> +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> +{
> +	WRITE_ONCE(NEXT(skb), NULL);
> +	smp_wmb();
> +	WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);

While this is nice and compact it's also really hard to read. It's also
hiding the "race condition" between the xchg() and setting the next ptr.
So why not split it between two lines and make the race explicit with a
comment?

> +}
> +
> +bool wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> +{
> +	if (!atomic_add_unless(&queue->count, 1, MAX_QUEUED_PACKETS))
> +		return false;
> +	__wg_prev_queue_enqueue(queue, skb);
> +	return true;
> +}
> +
> +struct sk_buff *wg_prev_queue_dequeue(struct prev_queue *queue)
> +{
> +	struct sk_buff *tail = queue->tail, *next = smp_load_acquire(&NEXT(tail));
> +
> +	if (tail == STUB(queue)) {
> +		if (!next)
> +			return NULL;
> +		queue->tail = next;
> +		tail = next;
> +		next = smp_load_acquire(&NEXT(next));
> +	}
> +	if (next) {
> +		queue->tail = next;
> +		atomic_dec(&queue->count);
> +		return tail;
> +	}
> +	if (tail != READ_ONCE(queue->head))
> +		return NULL;
> +	__wg_prev_queue_enqueue(queue, STUB(queue));
> +	next = smp_load_acquire(&NEXT(tail));
> +	if (next) {
> +		queue->tail = next;
> +		atomic_dec(&queue->count);
> +		return tail;
> +	}
> +	return NULL;
> +}

I don't see anywhere that you're clearing the next pointer (or prev, as
it were). Which means you'll likely end up passing packets up or down
the stack with that pointer still set, right? See this commit for a
previous instance where something like this has lead to issues:

22f6bbb7bcfc ("net: use skb_list_del_init() to remove from RX sublists")

> +#undef NEXT
> +#undef STUB
> diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
> index dfb674e03076..4ef2944a68bc 100644
> --- a/drivers/net/wireguard/queueing.h
> +++ b/drivers/net/wireguard/queueing.h
> @@ -17,12 +17,13 @@ struct wg_device;
>  struct wg_peer;
>  struct multicore_worker;
>  struct crypt_queue;
> +struct prev_queue;
>  struct sk_buff;
>  
>  /* queueing.c APIs: */
>  int wg_packet_queue_init(struct crypt_queue *queue, work_func_t function,
> -			 bool multicore, unsigned int len);
> -void wg_packet_queue_free(struct crypt_queue *queue, bool multicore);
> +			 unsigned int len);
> +void wg_packet_queue_free(struct crypt_queue *queue);
>  struct multicore_worker __percpu *
>  wg_packet_percpu_multicore_worker_alloc(work_func_t function, void *ptr);
>  
> @@ -135,8 +136,31 @@ static inline int wg_cpumask_next_online(int *next)
>  	return cpu;
>  }
>  
> +void wg_prev_queue_init(struct prev_queue *queue);
> +
> +/* Multi producer */
> +bool wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb);
> +
> +/* Single consumer */
> +struct sk_buff *wg_prev_queue_dequeue(struct prev_queue *queue);
> +
> +/* Single consumer */
> +static inline struct sk_buff *wg_prev_queue_peek(struct prev_queue *queue)
> +{
> +	if (queue->peeked)
> +		return queue->peeked;
> +	queue->peeked = wg_prev_queue_dequeue(queue);
> +	return queue->peeked;
> +}
> +
> +/* Single consumer */
> +static inline void wg_prev_queue_drop_peeked(struct prev_queue *queue)
> +{
> +	queue->peeked = NULL;
> +}
> +
>  static inline int wg_queue_enqueue_per_device_and_peer(
> -	struct crypt_queue *device_queue, struct crypt_queue *peer_queue,
> +	struct crypt_queue *device_queue, struct prev_queue *peer_queue,
>  	struct sk_buff *skb, struct workqueue_struct *wq, int *next_cpu)
>  {
>  	int cpu;
> @@ -145,8 +169,9 @@ static inline int wg_queue_enqueue_per_device_and_peer(
>  	/* We first queue this up for the peer ingestion, but the consumer
>  	 * will wait for the state to change to CRYPTED or DEAD before.
>  	 */
> -	if (unlikely(ptr_ring_produce_bh(&peer_queue->ring, skb)))
> +	if (unlikely(!wg_prev_queue_enqueue(peer_queue, skb)))
>  		return -ENOSPC;
> +
>  	/* Then we queue it up in the device queue, which consumes the
>  	 * packet as soon as it can.
>  	 */
> @@ -157,9 +182,7 @@ static inline int wg_queue_enqueue_per_device_and_peer(
>  	return 0;
>  }
>  
> -static inline void wg_queue_enqueue_per_peer(struct crypt_queue *queue,
> -					     struct sk_buff *skb,
> -					     enum packet_state state)
> +static inline void wg_queue_enqueue_per_peer_tx(struct sk_buff *skb, enum packet_state state)
>  {
>  	/* We take a reference, because as soon as we call atomic_set, the
>  	 * peer can be freed from below us.
> @@ -167,14 +190,12 @@ static inline void wg_queue_enqueue_per_peer(struct crypt_queue *queue,
>  	struct wg_peer *peer = wg_peer_get(PACKET_PEER(skb));
>  
>  	atomic_set_release(&PACKET_CB(skb)->state, state);
> -	queue_work_on(wg_cpumask_choose_online(&peer->serial_work_cpu,
> -					       peer->internal_id),
> -		      peer->device->packet_crypt_wq, &queue->work);
> +	queue_work_on(wg_cpumask_choose_online(&peer->serial_work_cpu, peer->internal_id),
> +		      peer->device->packet_crypt_wq, &peer->transmit_packet_work);
>  	wg_peer_put(peer);
>  }
>  
> -static inline void wg_queue_enqueue_per_peer_napi(struct sk_buff *skb,
> -						  enum packet_state state)
> +static inline void wg_queue_enqueue_per_peer_rx(struct sk_buff *skb, enum packet_state state)
>  {
>  	/* We take a reference, because as soon as we call atomic_set, the
>  	 * peer can be freed from below us.
> diff --git a/drivers/net/wireguard/receive.c b/drivers/net/wireguard/receive.c
> index 2c9551ea6dc7..7dc84bcca261 100644
> --- a/drivers/net/wireguard/receive.c
> +++ b/drivers/net/wireguard/receive.c
> @@ -444,7 +444,6 @@ static void wg_packet_consume_data_done(struct wg_peer *peer,
>  int wg_packet_rx_poll(struct napi_struct *napi, int budget)
>  {
>  	struct wg_peer *peer = container_of(napi, struct wg_peer, napi);
> -	struct crypt_queue *queue = &peer->rx_queue;
>  	struct noise_keypair *keypair;
>  	struct endpoint endpoint;
>  	enum packet_state state;
> @@ -455,11 +454,10 @@ int wg_packet_rx_poll(struct napi_struct *napi, int budget)
>  	if (unlikely(budget <= 0))
>  		return 0;
>  
> -	while ((skb = __ptr_ring_peek(&queue->ring)) != NULL &&
> +	while ((skb = wg_prev_queue_peek(&peer->rx_queue)) != NULL &&
>  	       (state = atomic_read_acquire(&PACKET_CB(skb)->state)) !=
>  		       PACKET_STATE_UNCRYPTED) {
> -		__ptr_ring_discard_one(&queue->ring);
> -		peer = PACKET_PEER(skb);
> +		wg_prev_queue_drop_peeked(&peer->rx_queue);
>  		keypair = PACKET_CB(skb)->keypair;
>  		free = true;
>  
> @@ -508,7 +506,7 @@ void wg_packet_decrypt_worker(struct work_struct *work)
>  		enum packet_state state =
>  			likely(decrypt_packet(skb, PACKET_CB(skb)->keypair)) ?
>  				PACKET_STATE_CRYPTED : PACKET_STATE_DEAD;
> -		wg_queue_enqueue_per_peer_napi(skb, state);
> +		wg_queue_enqueue_per_peer_rx(skb, state);
>  		if (need_resched())
>  			cond_resched();
>  	}
> @@ -531,12 +529,10 @@ static void wg_packet_consume_data(struct wg_device *wg, struct sk_buff *skb)
>  	if (unlikely(READ_ONCE(peer->is_dead)))
>  		goto err;
>  
> -	ret = wg_queue_enqueue_per_device_and_peer(&wg->decrypt_queue,
> -						   &peer->rx_queue, skb,
> -						   wg->packet_crypt_wq,
> -						   &wg->decrypt_queue.last_cpu);
> +	ret = wg_queue_enqueue_per_device_and_peer(&wg->decrypt_queue, &peer->rx_queue, skb,
> +						   wg->packet_crypt_wq, &wg->decrypt_queue.last_cpu);
>  	if (unlikely(ret == -EPIPE))
> -		wg_queue_enqueue_per_peer_napi(skb, PACKET_STATE_DEAD);
> +		wg_queue_enqueue_per_peer_rx(skb, PACKET_STATE_DEAD);
>  	if (likely(!ret || ret == -EPIPE)) {
>  		rcu_read_unlock_bh();
>  		return;
> diff --git a/drivers/net/wireguard/send.c b/drivers/net/wireguard/send.c
> index f74b9341ab0f..5368f7c35b4b 100644
> --- a/drivers/net/wireguard/send.c
> +++ b/drivers/net/wireguard/send.c
> @@ -239,8 +239,7 @@ void wg_packet_send_keepalive(struct wg_peer *peer)
>  	wg_packet_send_staged_packets(peer);
>  }
>  
> -static void wg_packet_create_data_done(struct sk_buff *first,
> -				       struct wg_peer *peer)
> +static void wg_packet_create_data_done(struct wg_peer *peer, struct sk_buff *first)
>  {
>  	struct sk_buff *skb, *next;
>  	bool is_keepalive, data_sent = false;
> @@ -262,22 +261,19 @@ static void wg_packet_create_data_done(struct sk_buff *first,
>  
>  void wg_packet_tx_worker(struct work_struct *work)
>  {
> -	struct crypt_queue *queue = container_of(work, struct crypt_queue,
> -						 work);
> +	struct wg_peer *peer = container_of(work, struct wg_peer, transmit_packet_work);
>  	struct noise_keypair *keypair;
>  	enum packet_state state;
>  	struct sk_buff *first;
> -	struct wg_peer *peer;
>  
> -	while ((first = __ptr_ring_peek(&queue->ring)) != NULL &&
> +	while ((first = wg_prev_queue_peek(&peer->tx_queue)) != NULL &&
>  	       (state = atomic_read_acquire(&PACKET_CB(first)->state)) !=
>  		       PACKET_STATE_UNCRYPTED) {
> -		__ptr_ring_discard_one(&queue->ring);
> -		peer = PACKET_PEER(first);
> +		wg_prev_queue_drop_peeked(&peer->tx_queue);
>  		keypair = PACKET_CB(first)->keypair;
>  
>  		if (likely(state == PACKET_STATE_CRYPTED))
> -			wg_packet_create_data_done(first, peer);
> +			wg_packet_create_data_done(peer, first);
>  		else
>  			kfree_skb_list(first);
>  
> @@ -306,16 +302,14 @@ void wg_packet_encrypt_worker(struct work_struct *work)
>  				break;
>  			}
>  		}
> -		wg_queue_enqueue_per_peer(&PACKET_PEER(first)->tx_queue, first,
> -					  state);
> +		wg_queue_enqueue_per_peer_tx(first, state);
>  		if (need_resched())
>  			cond_resched();
>  	}
>  }
>  
> -static void wg_packet_create_data(struct sk_buff *first)
> +static void wg_packet_create_data(struct wg_peer *peer, struct sk_buff *first)
>  {
> -	struct wg_peer *peer = PACKET_PEER(first);
>  	struct wg_device *wg = peer->device;
>  	int ret = -EINVAL;
>  
> @@ -323,13 +317,10 @@ static void wg_packet_create_data(struct sk_buff *first)
>  	if (unlikely(READ_ONCE(peer->is_dead)))
>  		goto err;
>  
> -	ret = wg_queue_enqueue_per_device_and_peer(&wg->encrypt_queue,
> -						   &peer->tx_queue, first,
> -						   wg->packet_crypt_wq,
> -						   &wg->encrypt_queue.last_cpu);
> +	ret = wg_queue_enqueue_per_device_and_peer(&wg->encrypt_queue, &peer->tx_queue, first,
> +						   wg->packet_crypt_wq, &wg->encrypt_queue.last_cpu);
>  	if (unlikely(ret == -EPIPE))
> -		wg_queue_enqueue_per_peer(&peer->tx_queue, first,
> -					  PACKET_STATE_DEAD);
> +		wg_queue_enqueue_per_peer_tx(first, PACKET_STATE_DEAD);
>  err:
>  	rcu_read_unlock_bh();
>  	if (likely(!ret || ret == -EPIPE))
> @@ -393,7 +384,7 @@ void wg_packet_send_staged_packets(struct wg_peer *peer)
>  	packets.prev->next = NULL;
>  	wg_peer_get(keypair->entry.peer);
>  	PACKET_CB(packets.next)->keypair = keypair;
> -	wg_packet_create_data(packets.next);
> +	wg_packet_create_data(peer, packets.next);
>  	return;
>  
>  out_invalid:
> -- 
> 2.30.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-17 18:36 ` Toke Høiland-Jørgensen
@ 2021-02-17 22:28   ` Jason A. Donenfeld
  2021-02-17 23:41     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2021-02-17 22:28 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: WireGuard mailing list, Dmitry Vyukov

On Wed, Feb 17, 2021 at 7:36 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
> Are these performance measurements are based on micro-benchmarks of the
> queueing structure, or overall wireguard performance? Do you see any
> measurable difference in the overall performance (i.e., throughput
> drop)?

These are from counting cycles per instruction using perf and seeing
which instructions are hotspots that take a greater or smaller
percentage of the overall time.

> And what about relative to using one of the existing skb queueing
> primitives in the kernel? Including some actual numbers would be nice to
> justify adding yet-another skb queueing scheme to the kernel :)

If you're referring to skb_queue_* and friends, those very much will
not work in any way, shape, or form here. Aside from the fact that the
MPSC nature of it is problematic for performance, those functions use
a doubly linked list. In wireguard's case, there is only one pointer
available (skb->prev), as skb->next is used to create the singly
linked skb_list (see skb_list_walk_safe) of gso frags. And in fact, by
having these two pointers next to each other for the separate lists,
it doesn't need to pull in another cache line. This isn't "yet-another
queueing scheme" in the kernel. This is just a singly linked list
queue.

> I say this also because the actual queueing of the packets has never
> really shown up on any performance radar in the qdisc and mac80211
> layers, which both use traditional spinlock-protected queueing
> structures.

Those are single threaded and the locks aren't really contended much.

> that would be good; also for figuring out if this algorithm might be
> useful in other areas as well (and don't get me wrong, I'm fascinated by
> it!).

If I find the motivation -- and if the mailing list conversations
don't become overly miserable -- I might try to fashion the queueing
mechanism into a general header-only data structure in include/linux/.
But that'd take a bit of work to see if there are actually places
where it matters and where it's useful. WireGuard can get away with it
because of its workqueue design, but other things probably aren't as
lucky like that. So I'm on the fence about generality.

> > -     if (wg_packet_queue_init(&peer->tx_queue, wg_packet_tx_worker, false,
> > -                              MAX_QUEUED_PACKETS))
> > -             goto err_2;
> > +     INIT_WORK(&peer->transmit_packet_work, wg_packet_tx_worker);
>
> It's not quite clear to me why changing the queue primitives requires
> adding another work queue?

It doesn't require a new workqueue. It's just that a workqueue was
init'd earlier in the call to "wg_packet_queue_init", which allocated
a ring buffer at the same time. We're not going through that
infrastructure anymore, but I still want the workqueue it used, so I
init it there instead. I truncated the diff in my quoted reply -- take
a look at that quote above and you'll see more clearly what I mean.

> > +#define NEXT(skb) ((skb)->prev)
>
> In particular, please explain this oxymoronic define :)

I can write more about that, sure. But it's what I wrote earlier in
this email -- the next pointer is taken; the prev one is free. So,
this uses the prev one.

> While this is nice and compact it's also really hard to read.

Actually I've already reworked that a bit in master to get the memory
barrier better.

> I don't see anywhere that you're clearing the next pointer (or prev, as
> it were). Which means you'll likely end up passing packets up or down
> the stack with that pointer still set, right? See this commit for a
> previous instance where something like this has lead to issues:
>
> 22f6bbb7bcfc ("net: use skb_list_del_init() to remove from RX sublists")

The prev pointer is never used for anything or initialized to NULL
anywhere. skb_mark_not_on_list concerns skb->next.

Thanks for the review.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-17 22:28   ` Jason A. Donenfeld
@ 2021-02-17 23:41     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 12+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-02-17 23:41 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list, Dmitry Vyukov

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> On Wed, Feb 17, 2021 at 7:36 PM Toke Høiland-Jørgensen <toke@toke.dk> wrote:
>> Are these performance measurements are based on micro-benchmarks of the
>> queueing structure, or overall wireguard performance? Do you see any
>> measurable difference in the overall performance (i.e., throughput
>> drop)?
>
> These are from counting cycles per instruction using perf and seeing
> which instructions are hotspots that take a greater or smaller
> percentage of the overall time.

Right. Would still love to see some actual numbers if you have them.
I.e., what kind of overhead is the queueing operations compared to the
rest of the wg data path, and how much of that is the hotspot
operations? Even better if you have a comparison with a spinlock
version, but I do realise that may be asking too much :)

>> And what about relative to using one of the existing skb queueing
>> primitives in the kernel? Including some actual numbers would be nice to
>> justify adding yet-another skb queueing scheme to the kernel :)
>
> If you're referring to skb_queue_* and friends, those very much will
> not work in any way, shape, or form here. Aside from the fact that the
> MPSC nature of it is problematic for performance, those functions use
> a doubly linked list. In wireguard's case, there is only one pointer
> available (skb->prev), as skb->next is used to create the singly
> linked skb_list (see skb_list_walk_safe) of gso frags. And in fact, by
> having these two pointers next to each other for the separate lists,
> it doesn't need to pull in another cache line. This isn't "yet-another
> queueing scheme" in the kernel. This is just a singly linked list
> queue.

Having this clearly articulated in the commit message would be good, and
could prevent others from pushing back against what really does appear
at first glance to be "yet-another queueing scheme"...

I.e., in the version you posted you go "the ring buffer is too much
memory, so here's a new linked-list queueing algoritm", skipping the
"and this is why we can't use any of the existing ones" in-between.

>> I say this also because the actual queueing of the packets has never
>> really shown up on any performance radar in the qdisc and mac80211
>> layers, which both use traditional spinlock-protected queueing
>> structures.
>
> Those are single threaded and the locks aren't really contended much.
>
>> that would be good; also for figuring out if this algorithm might be
>> useful in other areas as well (and don't get me wrong, I'm fascinated by
>> it!).
>
> If I find the motivation -- and if the mailing list conversations
> don't become overly miserable -- I might try to fashion the queueing
> mechanism into a general header-only data structure in include/linux/.
> But that'd take a bit of work to see if there are actually places
> where it matters and where it's useful. WireGuard can get away with it
> because of its workqueue design, but other things probably aren't as
> lucky like that. So I'm on the fence about generality.

Yeah, I can't think of any off the top of my head either. But I'll
definitely keep this in mind if I do run into any. If there's no obvious
contenders, IMO it would be fine to just keep it internal to wg until
such a use case shows up, and then generalise it at that time. Although
that does give it less visibility for other users, it also saves you
some potentially-redundant work :)

>> > -     if (wg_packet_queue_init(&peer->tx_queue, wg_packet_tx_worker, false,
>> > -                              MAX_QUEUED_PACKETS))
>> > -             goto err_2;
>> > +     INIT_WORK(&peer->transmit_packet_work, wg_packet_tx_worker);
>>
>> It's not quite clear to me why changing the queue primitives requires
>> adding another work queue?
>
> It doesn't require a new workqueue. It's just that a workqueue was
> init'd earlier in the call to "wg_packet_queue_init", which allocated
> a ring buffer at the same time. We're not going through that
> infrastructure anymore, but I still want the workqueue it used, so I
> init it there instead. I truncated the diff in my quoted reply -- take
> a look at that quote above and you'll see more clearly what I mean.

Ah, right, it's moving things from wg_packet_queue_init() - missed that.
Thanks!

>> > +#define NEXT(skb) ((skb)->prev)
>>
>> In particular, please explain this oxymoronic define :)
>
> I can write more about that, sure. But it's what I wrote earlier in
> this email -- the next pointer is taken; the prev one is free. So,
> this uses the prev one.

Yeah, I just meant to duplicate the explanation and references in
comments as well as the commit message, to save the people looking at
the code in the future some head scratching, and to make the origins
of the algorithm clear (credit where credit is due and all that).

>> While this is nice and compact it's also really hard to read.
>
> Actually I've already reworked that a bit in master to get the memory
> barrier better.

That version still hides the possible race inside a nested macro
expansion, though. Not doing your readers any favours.

>> I don't see anywhere that you're clearing the next pointer (or prev, as
>> it were). Which means you'll likely end up passing packets up or down
>> the stack with that pointer still set, right? See this commit for a
>> previous instance where something like this has lead to issues:
>>
>> 22f6bbb7bcfc ("net: use skb_list_del_init() to remove from RX sublists")
>
> The prev pointer is never used for anything or initialized to NULL
> anywhere. skb_mark_not_on_list concerns skb->next.

I was more concerned with stepping on the 'struct list_head' that shares
the space with the next and prev pointers, actually. But if you audited
that there are no other users of the pointer space at all, great! Please
do note this somewhere, though.

> Thanks for the review.

You're welcome - feel free to add my:

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>

-Toke

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-08 13:38 [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers Jason A. Donenfeld
  2021-02-09  8:24 ` Dmitry Vyukov
  2021-02-17 18:36 ` Toke Høiland-Jørgensen
@ 2021-02-18 13:49 ` Björn Töpel
  2021-02-18 13:53   ` Jason A. Donenfeld
  2 siblings, 1 reply; 12+ messages in thread
From: Björn Töpel @ 2021-02-18 13:49 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: wireguard, Dmitry Vyukov

On Mon, 8 Feb 2021 at 14:47, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Having two ring buffers per-peer means that every peer results in two
> massive ring allocations. On an 8-core x86_64 machine, this commit
> reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
> is an 90% reduction. Ninety percent! With some single-machine
> deployments approaching 400,000 peers, we're talking about a reduction
> from 7 gigs of memory down to 700 megs of memory.
>
> In order to get rid of these per-peer allocations, this commit switches
> to using a list-based queueing approach. Currently GSO fragments are
> chained together using the skb->next pointer, so we form the per-peer
> queue around the unused skb->prev pointer, which makes sense because the
> links are pointing backwards. Multiple cores can write into the queue at
> any given time, because its writes occur in the start_xmit path or in
> the udp_recv path. But reads happen in a single workqueue item per-peer,
> amounting to a multi-producer, single-consumer paradigm.
>
> The MPSC queue is implemented locklessly and never blocks. However, it
> is not linearizable (though it is serializable), with a very tight and
> unlikely race on writes, which, when hit (about 0.15% of the time on a
> fully loaded 16-core x86_64 system), causes the queue reader to
> terminate early. However, because every packet sent queues up the same
> workqueue item after it is fully added, the queue resumes again, and
> stopping early isn't actually a problem, since at that point the packet
> wouldn't have yet been added to the encryption queue. These properties
> allow us to avoid disabling interrupts or spinning.
>
> Performance-wise, ordinarily list-based queues aren't preferable to
> ringbuffers, because of cache misses when following pointers around.
> However, we *already* have to follow the adjacent pointers when working
> through fragments, so there shouldn't actually be any change there. A
> potential downside is that dequeueing is a bit more complicated, but the
> ptr_ring structure used prior had a spinlock when dequeueing, so all and
> all the difference appears to be a wash.
>
> Actually, from profiling, the biggest performance hit, by far, of this
> commit winds up being atomic_add_unless(count, 1, max) and atomic_
> dec(count), which account for the majority of CPU time, according to
> perf. In that sense, the previous ring buffer was superior in that it
> could check if it was full by head==tail, which the list-based approach
> cannot do.
>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> Hoping to get some feedback here from people running massive deployments
> and running into ram issues, as well as Dmitry on the queueing semantics
> (the mpsc queue is his design), before I send this to Dave for merging.
> These changes are quite invasive, so I don't want to get anything wrong.
>

[...]

> diff --git a/drivers/net/wireguard/queueing.c b/drivers/net/wireguard/queueing.c
> index 71b8e80b58e1..a72380ce97dd 100644
> --- a/drivers/net/wireguard/queueing.c
> +++ b/drivers/net/wireguard/queueing.c

[...]

> +
> +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> +{
> +       WRITE_ONCE(NEXT(skb), NULL);
> +       smp_wmb();
> +       WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);
> +}
> +

I'll chime in with Toke; This MPSC and Dmitry's links really took me
to the "verify with pen/paper"-level! Thanks!

I'd replace the smp_wmb()/_relaxed above with a xchg_release(), which
might perform better on some platforms. Also, it'll be a nicer pair
with the ldacq below. :-P In general, it would be nice with some
wording how the fences pair. It would help the readers (like me!) a
lot.


Cheers,
Björn

[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-18 13:49 ` Björn Töpel
@ 2021-02-18 13:53   ` Jason A. Donenfeld
  2021-02-18 14:04     ` Björn Töpel
  0 siblings, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2021-02-18 13:53 UTC (permalink / raw)
  To: Björn Töpel; +Cc: WireGuard mailing list, Dmitry Vyukov

Hey Bjorn,

On Thu, Feb 18, 2021 at 2:50 PM Björn Töpel <bjorn@kernel.org> wrote:
> > +
> > +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> > +{
> > +       WRITE_ONCE(NEXT(skb), NULL);
> > +       smp_wmb();
> > +       WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);
> > +}
> > +
>
> I'll chime in with Toke; This MPSC and Dmitry's links really took me
> to the "verify with pen/paper"-level! Thanks!
>
> I'd replace the smp_wmb()/_relaxed above with a xchg_release(), which
> might perform better on some platforms. Also, it'll be a nicer pair
> with the ldacq below. :-P In general, it would be nice with some
> wording how the fences pair. It would help the readers (like me!) a
> lot.

Exactly. This is what's been in my dev tree for the last week or so:

+static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct
sk_buff *skb)
+{
+       WRITE_ONCE(NEXT(skb), NULL);
+       WRITE_ONCE(NEXT(xchg_release(&queue->head, skb)), skb);
+}

Look good?

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-18 13:53   ` Jason A. Donenfeld
@ 2021-02-18 14:04     ` Björn Töpel
  2021-02-18 14:15       ` Jason A. Donenfeld
  0 siblings, 1 reply; 12+ messages in thread
From: Björn Töpel @ 2021-02-18 14:04 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list, Dmitry Vyukov

On Thu, 18 Feb 2021 at 14:53, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hey Bjorn,
>
> On Thu, Feb 18, 2021 at 2:50 PM Björn Töpel <bjorn@kernel.org> wrote:
> > > +
> > > +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> > > +{
> > > +       WRITE_ONCE(NEXT(skb), NULL);
> > > +       smp_wmb();
> > > +       WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);
> > > +}
> > > +
> >
> > I'll chime in with Toke; This MPSC and Dmitry's links really took me
> > to the "verify with pen/paper"-level! Thanks!
> >
> > I'd replace the smp_wmb()/_relaxed above with a xchg_release(), which
> > might perform better on some platforms. Also, it'll be a nicer pair
> > with the ldacq below. :-P In general, it would be nice with some
> > wording how the fences pair. It would help the readers (like me!) a
> > lot.
>
> Exactly. This is what's been in my dev tree for the last week or so:
>

Ah, nice!

> +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct
> sk_buff *skb)
> +{
> +       WRITE_ONCE(NEXT(skb), NULL);
> +       WRITE_ONCE(NEXT(xchg_release(&queue->head, skb)), skb);
> +}
>
> Look good?
>

Yes, exactly like that!


Cheers,
Björn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-18 14:04     ` Björn Töpel
@ 2021-02-18 14:15       ` Jason A. Donenfeld
  2021-02-18 15:12         ` Björn Töpel
  0 siblings, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2021-02-18 14:15 UTC (permalink / raw)
  To: Björn Töpel; +Cc: WireGuard mailing list, Dmitry Vyukov

On Thu, Feb 18, 2021 at 3:04 PM Björn Töpel <bjorn@kernel.org> wrote:
>
> On Thu, 18 Feb 2021 at 14:53, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > Hey Bjorn,
> >
> > On Thu, Feb 18, 2021 at 2:50 PM Björn Töpel <bjorn@kernel.org> wrote:
> > > > +
> > > > +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct sk_buff *skb)
> > > > +{
> > > > +       WRITE_ONCE(NEXT(skb), NULL);
> > > > +       smp_wmb();
> > > > +       WRITE_ONCE(NEXT(xchg_relaxed(&queue->head, skb)), skb);
> > > > +}
> > > > +
> > >
> > > I'll chime in with Toke; This MPSC and Dmitry's links really took me
> > > to the "verify with pen/paper"-level! Thanks!
> > >
> > > I'd replace the smp_wmb()/_relaxed above with a xchg_release(), which
> > > might perform better on some platforms. Also, it'll be a nicer pair
> > > with the ldacq below. :-P In general, it would be nice with some
> > > wording how the fences pair. It would help the readers (like me!) a
> > > lot.
> >
> > Exactly. This is what's been in my dev tree for the last week or so:
> >
>
> Ah, nice!
>
> > +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct
> > sk_buff *skb)
> > +{
> > +       WRITE_ONCE(NEXT(skb), NULL);
> > +       WRITE_ONCE(NEXT(xchg_release(&queue->head, skb)), skb);
> > +}
> >
> > Look good?
> >
>
> Yes, exactly like that!

The downside is that on armv7, this becomes a dmb(ish) instead of a
dmb(ishst). But I was unable to measure any actual difference anyway,
and the atomic bounded increment is already more expensive, so I think
it's okay.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers
  2021-02-18 14:15       ` Jason A. Donenfeld
@ 2021-02-18 15:12         ` Björn Töpel
  0 siblings, 0 replies; 12+ messages in thread
From: Björn Töpel @ 2021-02-18 15:12 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list, Dmitry Vyukov

On Thu, 18 Feb 2021 at 15:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

[...]

> >
> > > +static void __wg_prev_queue_enqueue(struct prev_queue *queue, struct
> > > sk_buff *skb)
> > > +{
> > > +       WRITE_ONCE(NEXT(skb), NULL);
> > > +       WRITE_ONCE(NEXT(xchg_release(&queue->head, skb)), skb);
> > > +}
> > >
> > > Look good?
> > >
> >
> > Yes, exactly like that!
>
> The downside is that on armv7, this becomes a dmb(ish) instead of a
> dmb(ishst). But I was unable to measure any actual difference anyway,
> and the atomic bounded increment is already more expensive, so I think
> it's okay.
>

Who cares about armv7!? The world is moving to Armv8/LSE, where we'll
end up with one fine "swpl" in this case, w/o any explicit (well...)
fence. ;-P

On a more serious note, it does make sense to base the decision on
benchmarks. OTOH I'd guess that the systems that mostly benefit from
this memory saving patch are x86_64, where the
smp_wmb()/xchg_relaxed() and xchg_release() are identical.


Björn

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-02-18 15:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-08 13:38 [PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers Jason A. Donenfeld
2021-02-09  8:24 ` Dmitry Vyukov
2021-02-09 15:44   ` Jason A. Donenfeld
2021-02-09 16:20     ` Dmitry Vyukov
2021-02-17 18:36 ` Toke Høiland-Jørgensen
2021-02-17 22:28   ` Jason A. Donenfeld
2021-02-17 23:41     ` Toke Høiland-Jørgensen
2021-02-18 13:49 ` Björn Töpel
2021-02-18 13:53   ` Jason A. Donenfeld
2021-02-18 14:04     ` Björn Töpel
2021-02-18 14:15       ` Jason A. Donenfeld
2021-02-18 15:12         ` Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).