[WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues

Development discussion of WireGuard
 help / color / mirror / Atom feed

* [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
@ 2016-09-30 18:41 Jason A. Donenfeld
  2016-09-30 19:18 ` Dave Taht
  0 siblings, 1 reply; 4+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 18:41 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

Hey Dave,

I've been comparing graphs and bandwidth and so forth with flent's
rrul and iperf3, trying to figure out what's going on. Here's my
present understanding of the queuing buffering issues. I sort of
suspect these are issues that might not translate entirely well to the
work you've been doing, but maybe I'm wrong. Here goes...

1. For each peer, there is a separate queue, called peer_queue. Each
peer corresponds to a specific UDP endpoint, which means that a peer
is a "flow".
2. When certain crypto handshake requirements haven't yet been met,
packets pile up in peer_queue. Then when a handshake completes, all
the packets that piled up are released. Because handshakes might take
a while, peer_queue is quite big -- 1024 packets (dropping the oldest
packets when full). In this context, that's not huge buffer bloat, but
rather that's just a queue of packets for while the setup operation is
occurring.
3. WireGuard is a net_device interface, which means it transmits
packets from userspace in softirq. It's advertised as accepting GSO
"super packets", so sometimes it is asked to transmit a packet that is
65k in length. When this happens, it splits those packets up into
MTU-sized packets, puts them in the queue, and then processes the
entire queue at once, immediately after.

If that were the totality of things, I believe it would work quite
well. If the description stopped there, it means packets would be
encrypted and sent immediately in the softirq device transmit handler,
just like how the mac80211 stack does things. The above existence of
peer_queue wouldn't equate to any form of buffer bloat or latency
issues, because it would just act as a simple data structure for
immediately transmitting packets. Similarly, when receiving a packet
from the UDP socket, we _could_ simply just decrypt in softirq, again
like mac80211, as the packet comes in. This makes all the expensive
crypto operations blocking to the initiator of the operation -- the
userspace application calling send() or the udp socket receiving an
encrypted packet. All is well.

However, things get complicated and ugly when we add multi-core
encryption and decryption. We add on to the above as follows:

4. The kernel has a library called padata (kernel/padata.c). You
submit asynchronous jobs, which are then sent off to various CPUs in
parallel, and then you're notified when the jobs are done, with the
nice feature that you get these notifications in the same order that
you submitted the jobs, so that packets don't get reordered. padata
has a hard coded maximum of in-progress operations of 1000. We can
artificially make this lower, if we want (currently we don't), but we
can't make it higher.
5. We continue from the above described peer_queue, only this time
instead of encrypting immediately in softirq, we simply send all of
peer_queue off to padata. Since the actual work happens
asynchronously, we return immediately, not spending cycles in softirq.
When that batch of encryption jobs completes, we transmit the
resultant encrypted packets. When we send those jobs off, it's
possible padata already has 1000 operations in progress, in which case
we get "-EBUSY", and can take one of two options: (a) put that packet
back at the top of peer_queue, return from sending, and try again to
send all of peer_queue the next time the user submits a packet, or (b)
discard that packet, and keep trying to queue up the ones after.
Currently we go with behavior (a).
6. Likewise, when receiving an encrypted packet from a UDP socket, we
decrypt it asynchronously using padata. If there are already 1000
operations in place, we drop the packet.

If I change the length of peer_queue from 1024 to something small like
16, it makes some effect when combined with choice (a) as opposed to
choice (b), but I think this nob isn't so important, and I can leave
it at 1024. However, if I change the length of padata's maximum from
1000 to something small like 16, I immediately get much lower latency.
However, bandwidth suffers greatly, no matter choice (a) or choice
(b). Padata's maximum seems to be the relevant nob. But I'm not sure
the best way to tune it, nor am I sure the best way to interact with
everything else here.

I'm open to all suggestions, as at the moment I'm a bit in the dark on
how to proceed. Simply saying "just throw fq_codel at it!" or "change
your buffer lengths!" doesn't really help me much, as I believe the
design is a bit more nuanced.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-09-30 18:41 [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues Jason A. Donenfeld
@ 2016-09-30 19:18 ` Dave Taht
  2016-09-30 20:12   ` Jason A. Donenfeld
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Taht @ 2016-09-30 19:18 UTC (permalink / raw)
  To: Jason A. Donenfeld, cake, make-wifi-fast; +Cc: WireGuard mailing list

Dear Jason:

Let me cross post, with a little background, for those not paying
attention on the other lists.

All: I've always dreamed of a vpn that could fq and - when it was
bottlenecking on cpu - throw away packets intelligently. Wireguard,
which is what jason & co are working on, is a really simple, elegant
set of newer vpn ideas that currently has a queuing model designed to
optimize for multi-cpu encryption, and not, so much, for managing
worst case network behaviors, or fairness, or working on lower end
hardware. There's a lede port for it that
topped out at (I think) about 16Mbits on weak hardware.

http://wireguard.io/ is really easy to compile and setup. I wrote a
bit about it in my blog as well (
http://blog.cerowrt.org/post/wireguard/ ) - and the fact that I spent
any time on it at all is symptomatic of my overall ADHD (and at the
time I was about to add a few more servers to the flent network and
didn't want to use tinc anymore).

But - As it turns out, the structure/basic concepts in the mac80211
implementation - the retry queue, the global fq_codel queue with per
station hash collision detection - seemed to match much of wireguard's
internal model, and I'd tweaked jason's interest.

Do do a git clone of the code, and take a look... somewhere on the
wireguard list, or privately, jason'd pointed me at the relevant bits
of the queuing model.

On Fri, Sep 30, 2016 at 11:41 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrot=
e:
> Hey Dave,
>
> I've been comparing graphs and bandwidth and so forth with flent's
> rrul and iperf3, trying to figure out what's going on.

A quick note on iperf3 - please see
http://burntchrome.blogspot.com/2016/09/iperf3-and-microbursts.html

There's a lesson here in this, and in pacing in general, sending a
giant burst out of
your retry queue, after you finish negotiating the link, is a bad
idea, and some sort of pacing mechanism might help. And rather than
pre-commenting here, I'll just include your last mail to these new
lists:

> Here's my
> present understanding of the queuing buffering issues. I sort of
> suspect these are issues that might not translate entirely well to the
> work you've been doing, but maybe I'm wrong. Here goes...
>
> 1. For each peer, there is a separate queue, called peer_queue. Each
> peer corresponds to a specific UDP endpoint, which means that a peer
> is a "flow".
> 2. When certain crypto handshake requirements haven't yet been met,
> packets pile up in peer_queue. Then when a handshake completes, all
> the packets that piled up are released. Because handshakes might take
> a while, peer_queue is quite big -- 1024 packets (dropping the oldest
> packets when full). In this context, that's not huge buffer bloat, but
> rather that's just a queue of packets for while the setup operation is
> occurring.
> 3. WireGuard is a net_device interface, which means it transmits
> packets from userspace in softirq. It's advertised as accepting GSO
> "super packets", so sometimes it is asked to transmit a packet that is
> 65k in length. When this happens, it splits those packets up into
> MTU-sized packets, puts them in the queue, and then processes the
> entire queue at once, immediately after.
>
> If that were the totality of things, I believe it would work quite
> well. If the description stopped there, it means packets would be
> encrypted and sent immediately in the softirq device transmit handler,
> just like how the mac80211 stack does things. The above existence of
> peer_queue wouldn't equate to any form of buffer bloat or latency
> issues, because it would just act as a simple data structure for
> immediately transmitting packets. Similarly, when receiving a packet
> from the UDP socket, we _could_ simply just decrypt in softirq, again
> like mac80211, as the packet comes in. This makes all the expensive
> crypto operations blocking to the initiator of the operation -- the
> userspace application calling send() or the udp socket receiving an
> encrypted packet. All is well.
>
> However, things get complicated and ugly when we add multi-core
> encryption and decryption. We add on to the above as follows:
>
> 4. The kernel has a library called padata (kernel/padata.c). You
> submit asynchronous jobs, which are then sent off to various CPUs in
> parallel, and then you're notified when the jobs are done, with the
> nice feature that you get these notifications in the same order that
> you submitted the jobs, so that packets don't get reordered. padata
> has a hard coded maximum of in-progress operations of 1000. We can
> artificially make this lower, if we want (currently we don't), but we
> can't make it higher.
> 5. We continue from the above described peer_queue, only this time
> instead of encrypting immediately in softirq, we simply send all of
> peer_queue off to padata. Since the actual work happens
> asynchronously, we return immediately, not spending cycles in softirq.
> When that batch of encryption jobs completes, we transmit the
> resultant encrypted packets. When we send those jobs off, it's
> possible padata already has 1000 operations in progress, in which case
> we get "-EBUSY", and can take one of two options: (a) put that packet
> back at the top of peer_queue, return from sending, and try again to
> send all of peer_queue the next time the user submits a packet, or (b)
> discard that packet, and keep trying to queue up the ones after.
> Currently we go with behavior (a).
> 6. Likewise, when receiving an encrypted packet from a UDP socket, we
> decrypt it asynchronously using padata. If there are already 1000
> operations in place, we drop the packet.
>
> If I change the length of peer_queue from 1024 to something small like
> 16, it makes some effect when combined with choice (a) as opposed to
> choice (b), but I think this nob isn't so important, and I can leave
> it at 1024. However, if I change the length of padata's maximum from
> 1000 to something small like 16, I immediately get much lower latency.
> However, bandwidth suffers greatly, no matter choice (a) or choice
> (b). Padata's maximum seems to be the relevant nob. But I'm not sure
> the best way to tune it, nor am I sure the best way to interact with
> everything else here.
>
> I'm open to all suggestions, as at the moment I'm a bit in the dark on
> how to proceed. Simply saying "just throw fq_codel at it!" or "change
> your buffer lengths!" doesn't really help me much, as I believe the
> design is a bit more nuanced.
>
> Thanks,
> Jason



--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
  2016-09-30 19:18 ` Dave Taht
@ 2016-09-30 20:12   ` Jason A. Donenfeld
  0 siblings, 0 replies; 4+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 20:12 UTC (permalink / raw)
  To: Dave Taht; +Cc: cake, make-wifi-fast, WireGuard mailing list

Hi all,

On Fri, Sep 30, 2016 at 9:18 PM, Dave Taht <dave.taht@gmail.com> wrote:
> All: I've always dreamed of a vpn that could fq and - when it was
> bottlenecking on cpu - throw away packets intelligently. Wireguard,
> which is what jason & co are working on, is a really simple, elegant
> set of newer vpn ideas that currently has a queuing model designed to
> optimize for multi-cpu encryption, and not, so much, for managing
> worst case network behaviors, or fairness, or working on lower end
> hardware.

Would love any feedback and support for working on the queuing model
with WireGuard. I hear the bufferbloat folks are geniuses at that...

> Do do a git clone of the code, and take a look... somewhere on the
> wireguard list, or privately, jason'd pointed me at the relevant bits
> of the queuing model.

It was this post:
https://lists.zx2c4.com/pipermail/wireguard/2016-August/000378.html
Start reading from "There are a couple reasons" and finish at
"chunking them somehow." The rest can be disregarded.

Hope to hear from y'all soon!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues
@ 2016-09-30 23:21 Jason A. Donenfeld
  0 siblings, 0 replies; 4+ messages in thread
From: Jason A. Donenfeld @ 2016-09-30 23:21 UTC (permalink / raw)
  To: cake, make-wifi-fast, WireGuard mailing list, Dave Taht

Hi all,

On Fri, Sep 30, 2016 at 9:18 PM, Dave Taht <dave.taht@gmail.com> wrote:
> All: I've always dreamed of a vpn that could fq and - when it was
> bottlenecking on cpu - throw away packets intelligently. Wireguard,
> which is what jason & co are working on, is a really simple, elegant
> set of newer vpn ideas that currently has a queuing model designed to
> optimize for multi-cpu encryption, and not, so much, for managing
> worst case network behaviors, or fairness, or working on lower end
> hardware.

Would love any feedback and support for working on the queuing model
with WireGuard. I hear the bufferbloat folks are geniuses at that...

> Do do a git clone of the code, and take a look... somewhere on the
> wireguard list, or privately, jason'd pointed me at the relevant bits
> of the queuing model.

It was this post:
https://lists.zx2c4.com/pipermail/wireguard/2016-August/000378.html
Start reading from "There are a couple reasons" and finish at
"chunking them somehow." The rest can be disregarded.

Hope to hear from y'all soon!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-09-30 23:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30 18:41 [WireGuard] WireGuard Queuing, Bufferbloat, Performance, Latency, and related issues Jason A. Donenfeld
2016-09-30 19:18 ` Dave Taht
2016-09-30 20:12   ` Jason A. Donenfeld
2016-09-30 23:21 Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).