Development discussion of WireGuard
 help / color / mirror / Atom feed
* [WireGuard] Major Queueing Algorithm Simplification
@ 2016-11-04 13:24 Jason A. Donenfeld
  2016-11-04 14:45 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 3+ messages in thread
From: Jason A. Donenfeld @ 2016-11-04 13:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: WireGuard mailing list

Hey,

This might be of interest...

Before, every time I got a GSO superpacket from the kernel, I'd split
it into little packets, and then queue each little packet as a
different parallel job.

Now, every time I get a GSO super packet from the kernel, I split it
into little packets, and queue up that whole bundle of packets as a
single parallel job. This means that each GSO superpacket expansion
gets processed on a single CPU. This greatly simplifies the algorithm,
and delivers mega impressive performance throughput gains.

In practice, what this means is that if you call send(tcp_socket_fd,
buffer, biglength), then each 65k contiguous chunk of buffer will be
encrypted on the same CPU. Before, each 1.5k contiguous chunk would be
encrypted on the same CPU.

I had thought about doing this a long time ago, but didn't, due to
reasons that are now fuzzy to me. I believe it had something to do
with latency. But at the moment, I think this solution will actually
reduce latency on systems with lots of cores, since it means those
cores don't all have to be synchronized before a bundle can be sent
out. I haven't measured this yet, and I welcome any such tests. The
magic commit for this is [1], if you'd like to compare before and
after.

Are there any obvious objections I've overlooked with this simplified approach?

Thanks,
Jason

[1] https://git.zx2c4.com/WireGuard/commit/?id=7901251422e55bcd55ab04afb7fb390983593e39

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [WireGuard] Major Queueing Algorithm Simplification
  2016-11-04 13:24 [WireGuard] Major Queueing Algorithm Simplification Jason A. Donenfeld
@ 2016-11-04 14:45 ` Toke Høiland-Jørgensen
  2016-11-04 16:36   ` Jason A. Donenfeld
  0 siblings, 1 reply; 3+ messages in thread
From: Toke Høiland-Jørgensen @ 2016-11-04 14:45 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

"Jason A. Donenfeld" <Jason@zx2c4.com> writes:

> Hey,
>
> This might be of interest...
>
> Before, every time I got a GSO superpacket from the kernel, I'd split
> it into little packets, and then queue each little packet as a
> different parallel job.
>
> Now, every time I get a GSO super packet from the kernel, I split it
> into little packets, and queue up that whole bundle of packets as a
> single parallel job. This means that each GSO superpacket expansion
> gets processed on a single CPU. This greatly simplifies the algorithm,
> and delivers mega impressive performance throughput gains.
>
> In practice, what this means is that if you call send(tcp_socket_fd,
> buffer, biglength), then each 65k contiguous chunk of buffer will be
> encrypted on the same CPU. Before, each 1.5k contiguous chunk would be
> encrypted on the same CPU.
>
> I had thought about doing this a long time ago, but didn't, due to
> reasons that are now fuzzy to me. I believe it had something to do
> with latency. But at the moment, I think this solution will actually
> reduce latency on systems with lots of cores, since it means those
> cores don't all have to be synchronized before a bundle can be sent
> out. I haven't measured this yet, and I welcome any such tests. The
> magic commit for this is [1], if you'd like to compare before and
> after.
>
> Are there any obvious objections I've overlooked with this simplified
> approach?

My guess would be that it would worsen latency. You now basically have
head of line blocking where an entire superpacket needs to be processed
before another flow gets to transmit one packet.

I guess this also means that the total amount of data that is currently
being processed increases? I.e., before you would have (max number of
jobs * 1.5K) bytes queued up for encryption at once, where now you will
have (max number of jobs * 65K) bytes? That can be a substantive amount
of latency.

But really, instead of guessing why not measure?

Simply run a `flent tcp_4up <hostname>` through the tunnel (requires a
netperf server instance running on <hostname>) and look at the latency
graph. The TCP flows will start five seconds after the ping flow; this
shouldn't cause the ping RTT to rise by more than max ~10ms. And of
course, trying this on a machine that does *not* have a gazillion
megafast cores as well is important :)

There's an Ubuntu PPA to get Flent, or you can just `pip install flent`.
See https://flent.org/intro.html#installing-flent

-Toke

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [WireGuard] Major Queueing Algorithm Simplification
  2016-11-04 14:45 ` Toke Høiland-Jørgensen
@ 2016-11-04 16:36   ` Jason A. Donenfeld
  0 siblings, 0 replies; 3+ messages in thread
From: Jason A. Donenfeld @ 2016-11-04 16:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: WireGuard mailing list

On Fri, Nov 4, 2016 at 3:45 PM, Toke H=C3=B8iland-J=C3=B8rgensen <toke@toke=
.dk> wrote:
> But really, instead of guessing why not measure?
>
> Simply run a `flent tcp_4up <hostname>` through the tunnel (requires a
> netperf server instance running on <hostname>) and look at the latency
> graph. The TCP flows will start five seconds after the ping flow; this
> shouldn't cause the ping RTT to rise by more than max ~10ms. And of
> course, trying this on a machine that does *not* have a gazillion
> megafast cores as well is important :)
>
> There's an Ubuntu PPA to get Flent, or you can just `pip install flent`.
> See https://flent.org/intro.html#installing-flent

I maintain the Gentoo package for Flent, actually. I'm doing some
benchmarks now...

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-11-04 16:34 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-04 13:24 [WireGuard] Major Queueing Algorithm Simplification Jason A. Donenfeld
2016-11-04 14:45 ` Toke Høiland-Jørgensen
2016-11-04 16:36   ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).