Development discussion of WireGuard
 help / color / mirror / Atom feed
* [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
@ 2016-09-30 18:41 Jason A. Donenfeld
  2016-09-30 19:18 ` Dave Taht
  0 siblings, 1 reply; 4+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 18:41 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

Hey Dave,

I've been comparing graphs and bandwidth and so forth with flent's
rrul and iperf3, trying to figure out what's going on. Here's my
present understanding of the queuing buffering issues. I sort of
suspect these are issues that might not translate entirely well to the
work you've been doing, but maybe I'm wrong. Here goes...

1. For each peer, there is a separate queue, called peer_queue. Each
peer corresponds to a specific UDP endpoint, which means that a peer
is a "flow".
2. When certain crypto handshake requirements haven't yet been met,
packets pile up in peer_queue. Then when a handshake completes, all
the packets that piled up are released. Because handshakes might take
a while, peer_queue is quite big -- 1024 packets (dropping the oldest
packets when full). In this context, that's not huge buffer bloat, but
rather that's just a queue of packets for while the setup operation is
occurring.
3. WireGuard is a net_device interface, which means it transmits
packets from userspace in softirq. It's advertised as accepting GSO
"super packets", so sometimes it is asked to transmit a packet that is
65k in length. When this happens, it splits those packets up into
MTU-sized packets, puts them in the queue, and then processes the
entire queue at once, immediately after.

If that were the totality of things, I believe it would work quite
well. If the description stopped there, it means packets would be
encrypted and sent immediately in the softirq device transmit handler,
just like how the mac80211 stack does things. The above existence of
peer_queue wouldn't equate to any form of buffer bloat or latency
issues, because it would just act as a simple data structure for
immediately transmitting packets. Similarly, when receiving a packet
from the UDP socket, we _could_ simply just decrypt in softirq, again
like mac80211, as the packet comes in. This makes all the expensive
crypto operations blocking to the initiator of the operation -- the
userspace application calling send() or the udp socket receiving an
encrypted packet. All is well.

However, things get complicated and ugly when we add multi-core
encryption and decryption. We add on to the above as follows:

4. The kernel has a library called padata (kernel/padata.c). You
submit asynchronous jobs, which are then sent off to various CPUs in
parallel, and then you're notified when the jobs are done, with the
nice feature that you get these notifications in the same order that
you submitted the jobs, so that packets don't get reordered. padata
has a hard coded maximum of in-progress operations of 1000. We can
artificially make this lower, if we want (currently we don't), but we
can't make it higher.
5. We continue from the above described peer_queue, only this time
instead of encrypting immediately in softirq, we simply send all of
peer_queue off to padata. Since the actual work happens
asynchronously, we return immediately, not spending cycles in softirq.
When that batch of encryption jobs completes, we transmit the
resultant encrypted packets. When we send those jobs off, it's
possible padata already has 1000 operations in progress, in which case
we get "-EBUSY", and can take one of two options: (a) put that packet
back at the top of peer_queue, return from sending, and try again to
send all of peer_queue the next time the user submits a packet, or (b)
discard that packet, and keep trying to queue up the ones after.
Currently we go with behavior (a).
6. Likewise, when receiving an encrypted packet from a UDP socket, we
decrypt it asynchronously using padata. If there are already 1000
operations in place, we drop the packet.

If I change the length of peer_queue from 1024 to something small like
16, it makes some effect when combined with choice (a) as opposed to
choice (b), but I think this nob isn't so important, and I can leave
it at 1024. However, if I change the length of padata's maximum from
1000 to something small like 16, I immediately get much lower latency.
However, bandwidth suffers greatly, no matter choice (a) or choice
(b). Padata's maximum seems to be the relevant nob. But I'm not sure
the best way to tune it, nor am I sure the best way to interact with
everything else here.

I'm open to all suggestions, as at the moment I'm a bit in the dark on
how to proceed. Simply saying "just throw fq_codel at it!" or "change
your buffer lengths!" doesn't really help me much, as I believe the
design is a bit more nuanced.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread
* [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
@ 2016-09-30 23:21 Jason A. Donenfeld
  0 siblings, 0 replies; 4+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 23:21 UTC (permalink / raw)
  To: cake, make-wifi-fast, WireGuard mailing list, Dave Taht

Hi all,

On Fri, Sep 30, 2016 at 9:18 PM, Dave Taht <dave.taht@gmail.com> wrote:
> All: I've always dreamed of a vpn that could fq and - when it was
> bottlenecking on cpu - throw away packets intelligently. Wireguard,
> which is what jason & co are working on, is a really simple, elegant
> set of newer vpn ideas that currently has a queuing model designed to
> optimize for multi-cpu encryption, and not, so much, for managing
> worst case network behaviors, or fairness, or working on lower end
> hardware.

Would love any feedback and support for working on the queuing model
with WireGuard. I hear the bufferbloat folks are geniuses at that...

> Do do a git clone of the code, and take a look... somewhere on the
> wireguard list, or privately, jason'd pointed me at the relevant bits
> of the queuing model.

It was this post:
https://lists.zx2c4.com/pipermail/wireguard/2016-August/000378.html
Start reading from "There are a couple reasons" and finish at
"chunking them somehow." The rest can be disregarded.

Hope to hear from y'all soon!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-09-30 23:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30 18:41 [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues Jason A. Donenfeld
2016-09-30 19:18 ` Dave Taht
2016-09-30 20:12   ` Jason A. Donenfeld
2016-09-30 23:21 Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).