Development discussion of WireGuard
 help / color / mirror / Atom feed
* [WireGuard] fq, ecn, etc with wireguard
@ 2016-08-27 21:03 Dave Taht
  2016-08-27 21:33 ` jens
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dave Taht @ 2016-08-27 21:03 UTC (permalink / raw)
  To: wireguard

I have been running a set of tinc based vpns for a long time now, and
based on the complexity of the codebase, and some general flakyness
and slowness, I am considering fiddling with wireguard for a
replacement of it. The review of it over on
https://plus.google.com/+gregkroahhartman/posts/NoGTVYbBtiP?hl=3Den was
pretty inspiring.

My principal work is on queueing algorithms (like fq_codel, and cake),
and what I'm working on now is primarily adding these algos to wifi,
but I do need a working vpn, and have longed to improve latency and
loss recovery on vpns for quite some time now.

A) does wireguard handle ecn encapsulation/decapsulation?

https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-07

Doing ecn "right" through vpn with a bottleneck router with a fq_codel
enabled qdisc allows for zero induced packet loss and good congestion
control.

B) I see that "noqueue" is the default qdisc for wireguard. What is
the maximum outstanding queue depth held internally? How is it
configured? I imagine it is a strict fifo queue, and that wireguard
bottlenecks on the crypto step and drops on reads... eventually.
Managing the queue length looks to be helpful especially in the
openwrt/lede case.

(we have managed to successfully apply something fq_codel-like within
the mac80211 layer, see various blog entries of mine and the ongoing
work on the make-wifi-fast mailing list)

So managing the inbound queue for wireguard well, to hold induced
latencies down to bare minimums when going from 1Gbit to XMbit, and
it's bottlenecked on wireguard, rather than an external router, is on
my mind. Got a pretty nice hammer in the fq_codel code, not sure why
you have noqueue as the default.

C) One flaw of fq_codel , is that multiplexing multiple outbound flows
over a single connection endpoint degrades that aggregate flow to
codel's behavior, and the vpn "flow" competes evenly with all other
flows. A classic pure aqm solution would be more fair to vpn
encapsulated flows than fq_codel is.

An answer to that would be to expose "fq" properties to the underlying
vpn protocol. For example, being able to specify an endpoint
identifier of 2001:db8:1234::1/118:udp_port would allow for a one to
one mapping for external fq_codel queues to internal vpn queues, and
thus vpn traffic would compete equally with non-vpn traffic at the
router. While this does expose more per flow information, the
corresponding decrease for e2e latency under load, especially for
"sparse" flows, like voip and dns, strikes me as a potential major win
(and one way to use up a bunch of ipv6 addresses in a good cause).
Doing that "right" however probably involves negotiating perfect
forward secrecy for a ton of mostly idle channels (with a separate
seqno base for each), (but I could live with merely having a /123 on
the task)

C1) (does the current codebase work with ipv6?)

D) my end goal would be to somehow replicate the meshy characteristics
of tinc, and choosing good paths through multiple potential
connections, leveraging source specific routing and another layer 3
routing protocol like babel, but I do grok that doing that right would
take a ton more work...

Anyway, I'll go off and read some more docs and code to see if I can
answer a few of these questions myself. I am impressed by what little
I understand so far.

--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-27 21:03 [WireGuard] fq, ecn, etc with wireguard Dave Taht
@ 2016-08-27 21:33 ` jens
  2016-08-27 22:03 ` Dave Taht
  2016-08-29 17:16 ` Jason A. Donenfeld
  2 siblings, 0 replies; 12+ messages in thread
From: jens @ 2016-08-27 21:33 UTC (permalink / raw)
  To: wireguard

[-- Attachment #1: Type: text/plain, Size: 3813 bytes --]

we have done some tests with wireguard and on top l2tp_v3 and batman-adv
this way batman adv handles the routing.

on monday we may change from our tinc tunnel to this setup.
in test scenarios we got stable 5-600 Mbit on a 1 Gbit cable. Everthing
seem prety stable.


On 27.08.2016 23:03, Dave Taht wrote:
> I have been running a set of tinc based vpns for a long time now, and
> based on the complexity of the codebase, and some general flakyness
> and slowness, I am considering fiddling with wireguard for a
> replacement of it. The review of it over on
> https://plus.google.com/+gregkroahhartman/posts/NoGTVYbBtiP?hl=en was
> pretty inspiring.
>
> My principal work is on queueing algorithms (like fq_codel, and cake),
> and what I'm working on now is primarily adding these algos to wifi,
> but I do need a working vpn, and have longed to improve latency and
> loss recovery on vpns for quite some time now.
>
> A) does wireguard handle ecn encapsulation/decapsulation?
>
> https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-07
>
> Doing ecn "right" through vpn with a bottleneck router with a fq_codel
> enabled qdisc allows for zero induced packet loss and good congestion
> control.
>
> B) I see that "noqueue" is the default qdisc for wireguard. What is
> the maximum outstanding queue depth held internally? How is it
> configured? I imagine it is a strict fifo queue, and that wireguard
> bottlenecks on the crypto step and drops on reads... eventually.
> Managing the queue length looks to be helpful especially in the
> openwrt/lede case.
>
> (we have managed to successfully apply something fq_codel-like within
> the mac80211 layer, see various blog entries of mine and the ongoing
> work on the make-wifi-fast mailing list)
>
> So managing the inbound queue for wireguard well, to hold induced
> latencies down to bare minimums when going from 1Gbit to XMbit, and
> it's bottlenecked on wireguard, rather than an external router, is on
> my mind. Got a pretty nice hammer in the fq_codel code, not sure why
> you have noqueue as the default.
>
> C) One flaw of fq_codel , is that multiplexing multiple outbound flows
> over a single connection endpoint degrades that aggregate flow to
> codel's behavior, and the vpn "flow" competes evenly with all other
> flows. A classic pure aqm solution would be more fair to vpn
> encapsulated flows than fq_codel is.
>
> An answer to that would be to expose "fq" properties to the underlying
> vpn protocol. For example, being able to specify an endpoint
> identifier of 2001:db8:1234::1/118:udp_port would allow for a one to
> one mapping for external fq_codel queues to internal vpn queues, and
> thus vpn traffic would compete equally with non-vpn traffic at the
> router. While this does expose more per flow information, the
> corresponding decrease for e2e latency under load, especially for
> "sparse" flows, like voip and dns, strikes me as a potential major win
> (and one way to use up a bunch of ipv6 addresses in a good cause).
> Doing that "right" however probably involves negotiating perfect
> forward secrecy for a ton of mostly idle channels (with a separate
> seqno base for each), (but I could live with merely having a /123 on
> the task)
>
> C1) (does the current codebase work with ipv6?)
>
> D) my end goal would be to somehow replicate the meshy characteristics
> of tinc, and choosing good paths through multiple potential
> connections, leveraging source specific routing and another layer 3
> routing protocol like babel, but I do grok that doing that right would
> take a ton more work...
>
> Anyway, I'll go off and read some more docs and code to see if I can
> answer a few of these questions myself. I am impressed by what little
> I understand so far.
>

-- 
make the world nicer, please use PGP encryption


[-- Attachment #2: Type: text/html, Size: 4407 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-27 21:03 [WireGuard] fq, ecn, etc with wireguard Dave Taht
  2016-08-27 21:33 ` jens
@ 2016-08-27 22:03 ` Dave Taht
  2016-08-29 17:16 ` Jason A. Donenfeld
  2 siblings, 0 replies; 12+ messages in thread
From: Dave Taht @ 2016-08-27 22:03 UTC (permalink / raw)
  To: wireguard

I cited the wrong ecn draft:

https://tools.ietf.org/id/draft-ietf-tsvwg-ecn-tunnel-10.txt

Despite the complexity of the draft, basically copying the inner ecn
bits to the outer header on encaps, and or-ing ecn bits to inner
header (except when 00s) on decaps, seems to be the "right thing",
nowadays.


On Sat, Aug 27, 2016 at 2:03 PM, Dave Taht <dave.taht@gmail.com> wrote:
> I have been running a set of tinc based vpns for a long time now, and
> based on the complexity of the codebase, and some general flakyness
> and slowness, I am considering fiddling with wireguard for a
> replacement of it. The review of it over on
> https://plus.google.com/+gregkroahhartman/posts/NoGTVYbBtiP?hl=3Den was
> pretty inspiring.
>
> My principal work is on queueing algorithms (like fq_codel, and cake),
> and what I'm working on now is primarily adding these algos to wifi,
> but I do need a working vpn, and have longed to improve latency and
> loss recovery on vpns for quite some time now.
>
> A) does wireguard handle ecn encapsulation/decapsulation?
>
> https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-07
>
> Doing ecn "right" through vpn with a bottleneck router with a fq_codel
> enabled qdisc allows for zero induced packet loss and good congestion
> control.
>
> B) I see that "noqueue" is the default qdisc for wireguard. What is
> the maximum outstanding queue depth held internally? How is it
> configured? I imagine it is a strict fifo queue, and that wireguard
> bottlenecks on the crypto step and drops on reads... eventually.
> Managing the queue length looks to be helpful especially in the
> openwrt/lede case.
>
> (we have managed to successfully apply something fq_codel-like within
> the mac80211 layer, see various blog entries of mine and the ongoing
> work on the make-wifi-fast mailing list)
>
> So managing the inbound queue for wireguard well, to hold induced
> latencies down to bare minimums when going from 1Gbit to XMbit, and
> it's bottlenecked on wireguard, rather than an external router, is on
> my mind. Got a pretty nice hammer in the fq_codel code, not sure why
> you have noqueue as the default.
>
> C) One flaw of fq_codel , is that multiplexing multiple outbound flows
> over a single connection endpoint degrades that aggregate flow to
> codel's behavior, and the vpn "flow" competes evenly with all other
> flows. A classic pure aqm solution would be more fair to vpn
> encapsulated flows than fq_codel is.
>
> An answer to that would be to expose "fq" properties to the underlying
> vpn protocol. For example, being able to specify an endpoint
> identifier of 2001:db8:1234::1/118:udp_port would allow for a one to
> one mapping for external fq_codel queues to internal vpn queues, and
> thus vpn traffic would compete equally with non-vpn traffic at the
> router. While this does expose more per flow information, the
> corresponding decrease for e2e latency under load, especially for
> "sparse" flows, like voip and dns, strikes me as a potential major win
> (and one way to use up a bunch of ipv6 addresses in a good cause).
> Doing that "right" however probably involves negotiating perfect
> forward secrecy for a ton of mostly idle channels (with a separate
> seqno base for each), (but I could live with merely having a /123 on
> the task)
>
> C1) (does the current codebase work with ipv6?)
>
> D) my end goal would be to somehow replicate the meshy characteristics
> of tinc, and choosing good paths through multiple potential
> connections, leveraging source specific routing and another layer 3
> routing protocol like babel, but I do grok that doing that right would
> take a ton more work...
>
> Anyway, I'll go off and read some more docs and code to see if I can
> answer a few of these questions myself. I am impressed by what little
> I understand so far.
>
> --
> Dave T=C3=A4ht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org



--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-27 21:03 [WireGuard] fq, ecn, etc with wireguard Dave Taht
  2016-08-27 21:33 ` jens
  2016-08-27 22:03 ` Dave Taht
@ 2016-08-29 17:16 ` Jason A. Donenfeld
  2016-08-29 19:23   ` Jason A. Donenfeld
  2016-08-30  0:24   ` Dave Taht
  2 siblings, 2 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-08-29 17:16 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

Hey Dave,

You're exactly the sort of person I've been hoping would appear during the
last several months. Indeed there's a lot of interesting queueing things
happening with WireGuard. I'll detail them inline below.

> I have been running a set of tinc based vpns for a long time now, and
> based on the complexity of the codebase, and some general flakyness
> and slowness, I am considering fiddling with wireguard for a
> replacement of it. The review of it over on
> https://plus.google.com/+gregkroahhartman/posts/NoGTVYbBtiP?hl=en was
> pretty inspiring.

Indeed this seems to be a very common use case of WireGuard -- replacing
complex userspace things with something fast and simple. You've come to the
right place. :-P

> My principal work is on queueing algorithms (like fq_codel, and cake),
> and what I'm working on now is primarily adding these algos to wifi,
> but I do need a working vpn, and have longed to improve latency and
> loss recovery on vpns for quite some time now.

Great.

>
> A) does wireguard handle ecn encapsulation/decapsulation?
>
> https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-07
>
> Doing ecn "right" through vpn with a bottleneck router with a fq_codel
> enabled qdisc allows for zero induced packet loss and good congestion
> control.

At the moment I don't do anything special with DSCP or ECN. I set a high
priority DSCP for the handshake messages, but for the actual transport
packets, I leave it at zero:

https://git.zx2c4.com/WireGuard/tree/src/send.c#n137

This has been a TODO item for quite some time; it's on wireguard.io/roadmap
too. The reason I've left it at zero, thus far, is that I didn't want to
infoleak anything about the underlying data. Is there a case to be made,
however, that ECN doesn't leak data like DSCP does, and so I'd be okay just
copying those top bits? I'll read the IETF draft you sent and see if I can
come up with something. It does have an important utility; you're right.

> B) I see that "noqueue" is the default qdisc for wireguard. What is
> the maximum outstanding queue depth held internally? How is it
> configured? I imagine it is a strict fifo queue, and that wireguard
> bottlenecks on the crypto step and drops on reads... eventually.
> Managing the queue length looks to be helpful especially in the
> openwrt/lede case.
>
> (we have managed to successfully apply something fq_codel-like within
> the mac80211 layer, see various blog entries of mine and the ongoing
> work on the make-wifi-fast mailing list)
>
> So managing the inbound queue for wireguard well, to hold induced
> latencies down to bare minimums when going from 1Gbit to XMbit, and
> it's bottlenecked on wireguard, rather than an external router, is on
> my mind. Got a pretty nice hammer in the fq_codel code, not sure why
> you have noqueue as the default.

There are a couple reasons. Originally I used multiqueue and had a separate
subqueue for each peer. I then abused starting and stopping these subqueues as
the various peers negotiated handshakes. This worked, but it was quite
limiting for a number of reasons, leading me to ultimately switch to noqueue.

Using noqueue gives me a couple benefits. First, packet transmission calls my
xmit function directly, which means I can trivially check for routing loops
using dev_recursion_level(). Second, it allows me to return things like
`-ENOKEY` from the xmit function, which gets directly passed up to userspace,
giving more interesting error messages than ICMP handles (though I also
support ICMP). But the main reason is because it fits the current queuing
design of WireGuard. I'll explain:

A WireGuard device has multiple peers. Either there's an active session for a
peer, in which case the packet can be encrypted and sent, or there isn't, in
which case it's queued up until a session is established. If a peer doesn't
have a session, after queuing up that packet, the session handshake occurs,
and immediately following, the queue is released and the packet is sent. This
has the effect of making WireGuard appear "stateless" to userspace. The
administrator set up all the peer details, and then typed `ping peer`, and
then it just worked. Where did the connection happen? That's what happens
behind scenes in WireGuard. So each peer has its own queue. I limit each queue
to 1024 packets, somewhat arbitrarily. As the queue exceeds 1024, the oldest
packets are dropped first.

There's another hitch: Linux supports "super packets" for GSO. Essentially,
the kernel will hand off a massive TCP packet -- 65k -- to the device driver,
if requested, expecting the device driver to then segment this into MTU-sized
bites. This was designed for hardware that has built-in TCP segmentation and
such. I found it was very performant to do the same with WireGuard. The reason
is that everytime a final encrypted packet is transmitted, it has to traverse
the big complicated Linux networking stack. In order to reduce cache misses, I
prefer to transmit a bunch of packets at once. Please read this LKML post
where I detail this a bit more (Ctrl+F for "It makes use of a few tricks"),
and then return to this email:

http://lkml.iu.edu/hypermail/linux/kernel/1606.3/02833.html

The next thing is that I support parallel encryption, which means encrypting
these bundles of packets is asynchronous.

All these requirements would lead you to think that this is all super
complicated and horrible, but I actually managed to put this together in a
decently simple way. There's the queuing algorithm all together:
https://git.zx2c4.com/WireGuard/tree/src/device.c#n101

1. user sends a packet. xmit() in device.c is called.
2. look up to which peer we're sending this packet.
3. if we have >1024 packets in that peer's queue, remove the oldest ones.
4. segment the super packet into normal MTU-sized packets, and queue those
   up. note that this may allow the queue to temporarily exceed 1024 packets,
   which is fine.
5. try to encrypt&send the entire queue.

There's what step 5 looks like, found in packet_send_queue() in send.c:
https://git.zx2c4.com/WireGuard/tree/src/send.c#n159

1. immediately empty the entire queue, putting it into a local temp queue.
2. if the queue is empty, return. if the queue only has one packet that's
   less than or equal to 256 bytes, don't parallelize it.
3. for each packet in the queue, send it off to the asynchronous encryption
   a. if that returns '-ENOKEY', it means we don't have a valid session, so
      we should initiate one, and then do (b) too.
   b. if that returns '-ENOKEY' or '-EBUSY' (workqueue is at kernel limit),
      we put that packet and all the ones after it from the local queue back
      into the peer's queue.
   c. if we fail for any other reason, we drop that packet, and then keep
      processing the rest of the queue.
4. we tell userspace "ok! sent!"
5. when the packets that were successfully submitted finish encrypting
   (asynchronously), we transmit the encrypted packets in a tight loop
   to reduce cache misses in the networking stack.

That's it! It's pretty basic. I do wonder if this has some problems, and if
you have some suggestions on how to improve it, or what to replace it with.
I'm open to all suggestions here.

One thing, for example, that I haven't yet worked out is better scheduling for
submitting packets to different threads for encryption. Right now I just evenly
distribute them, one by one, and then wait until they're finished. Clearly
better performance could be achieved by chunking them somehow.

> C) One flaw of fq_codel , is that multiplexing multiple outbound flows
> over a single connection endpoint degrades that aggregate flow to
> codel's behavior, and the vpn "flow" competes evenly with all other
> flows. A classic pure aqm solution would be more fair to vpn
> encapsulated flows than fq_codel is.
>
> An answer to that would be to expose "fq" properties to the underlying
> vpn protocol. For example, being able to specify an endpoint
> identifier of 2001:db8:1234::1/118:udp_port would allow for a one to
> one mapping for external fq_codel queues to internal vpn queues, and
> thus vpn traffic would compete equally with non-vpn traffic at the
> router. While this does expose more per flow information, the
> corresponding decrease for e2e latency under load, especially for
> "sparse" flows, like voip and dns, strikes me as a potential major win
> (and one way to use up a bunch of ipv6 addresses in a good cause).
> Doing that "right" however probably involves negotiating perfect
> forward secrecy for a ton of mostly idle channels (with a separate
> seqno base for each), (but I could live with merely having a /123 on
> the task)

Do you mean to suggest that there be a separate wireguard session for each
4-tuple?

> C1) (does the current codebase work with ipv6?)

Yes, very well, out of the box, from day 1. You can do v6-in-v6, v4-in-v4,
v4-in-v6, and v6-in-v4.

> D) my end goal would be to somehow replicate the meshy characteristics
> of tinc, and choosing good paths through multiple potential
> connections, leveraging source specific routing and another layer 3
> routing protocol like babel, but I do grok that doing that right would
> take a ton more work...

That'd be great. I'm trying to find a chance to sit down with the fella behind
Babel one of these days. I'd love to get these working well together.

> Anyway, I'll go off and read some more docs and code to see if I can
> answer a few of these questions myself. I am impressed by what little
> I understand so far.

Great! Let me know what you find. Feel free to find me in IRC (zx2c4 in
#wireguard on freenode) if you'd like to chat about this all in realtime.

Regards,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 17:16 ` Jason A. Donenfeld
@ 2016-08-29 19:23   ` Jason A. Donenfeld
  2016-08-29 19:50     ` Dave Taht
  2016-08-29 20:15     ` Dave Taht
  2016-08-30  0:24   ` Dave Taht
  1 sibling, 2 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-08-29 19:23 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

Hi again,

So I implemented a first stab of this, which I intend to refine with
your feedback:

    https://git.zx2c4.com/WireGuard/commit/?id=a2dfc902e942cce8d5da4a42d6aa384413e7fc81


On the way out, the ECN is set to:

outgoing_skb->tos = encap_ecn(0, inner_skb->tos);

where encap_ecn is defined as:

u8 encap_ecn(u8 outer, u8 inner)
{
        outer &= ~INET_ECN_MASK;
        outer |= !INET_ECN_is_ce(inner) ? (inner & INET_ECN_MASK) :
                                          INET_ECN_ECT_0;
        return outer;
}

Since outer goes in as 0, this function can be reduced to simply:

outgoing_skb->tos = !INET_ECN_is_ce(inner_skb->tos) ? (inner_skb->tos
& INET_ECN_MASK) : INET_ECN_ECT_0;

QUESTION A: is 0 a good value to use here as outer? Or, in fact,
should I use the tos value that comes from the routing table for the
outer route?


On the way in, the ECN is set to:

if (INET_ECN_is_ce(outer_skb->tos))
        IP_ECN_set_ce(inner_skb->tos)

I do NOT compute the following:

        if (INET_ECN_is_not_ect(inner)) {
                switch (outer & INET_ECN_MASK) {
                case INET_ECN_NOT_ECT:
                        return EVERYTHING_IS_OKAY;
                case INET_ECN_ECT_0:
                case INET_ECN_ECT_1:
                        return BROKEN_SO_LOG_PACKET;
                case INET_ECN_CE:
                        return BROKEN_SO_DROP_PACKET;
                }
        }

QUESTION B: is it okay that I do not compute the above checks? Or is
this potentially very problematic?


I await your answer on questions A and B.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 19:23   ` Jason A. Donenfeld
@ 2016-08-29 19:50     ` Dave Taht
  2016-08-29 20:15     ` Dave Taht
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Taht @ 2016-08-29 19:50 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

Nice to see you so quickly being productive. I am still constructing a
reply to your previous message.

Rather than try to expand your macros, my mental model on encode is

if(inner_dscp & 3) outer_dscp = (outer_dscp & 3) | (inner_dscp & 3);

decode is different. A bad actor could, for example, flip the outer
ecn bits from ect(1) to ect(0) (which have different meanings in the
l4s effort in the ietf),  or set the outer to CE (one evil ISP did
this until the worldwide test by apple last year for ecn capability
got them to fix it), when the inner is not ECN capable at all.

if(itos = inner_dscp & 3) if (otos = outer_dscp & 3)  if(otos == 3)
itos = itos | 3;

I see you are using the cb to temporarily store these bits. If we end
up sneaking fq_codel into there, you'll also need space for a
timestamp and another field.....

I didn't even know ip_tunnel_get_dsfield() even existed!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 19:23   ` Jason A. Donenfeld
  2016-08-29 19:50     ` Dave Taht
@ 2016-08-29 20:15     ` Dave Taht
  2016-08-29 21:00       ` Jason A. Donenfeld
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Taht @ 2016-08-29 20:15 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

To try and answer your actual questions...

On Mon, Aug 29, 2016 at 12:23 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrot=
e:
> Hi again,
>
> So I implemented a first stab of this, which I intend to refine with
> your feedback:
>
>     https://git.zx2c4.com/WireGuard/commit/?id=3Da2dfc902e942cce8d5da4a42=
d6aa384413e7fc81
>
>
> On the way out, the ECN is set to:
>
> outgoing_skb->tos =3D encap_ecn(0, inner_skb->tos);
>
> where encap_ecn is defined as:
>
> u8 encap_ecn(u8 outer, u8 inner)
> {
>         outer &=3D ~INET_ECN_MASK;
>         outer |=3D !INET_ECN_is_ce(inner) ? (inner & INET_ECN_MASK) :
>                                           INET_ECN_ECT_0;
>         return outer;
> }
>
> Since outer goes in as 0, this function can be reduced to simply:
>
> outgoing_skb->tos =3D !INET_ECN_is_ce(inner_skb->tos) ? (inner_skb->tos
> & INET_ECN_MASK) : INET_ECN_ECT_0;
>
> QUESTION A: is 0 a good value to use here as outer? Or, in fact,
> should I use the tos value that comes from the routing table for the
> outer route?

The outer routing table is read for where stuff comes in in the first
place from the packet to make the routing decision.

As in general dscp values are not preserved end to end and can cause
re-ordering when they are, it's best to use your own dscp value
consistently for the outer header and not vary it within the vpn flow
based on the inner header.

There is a keyword in the ip command (inherit) that can be applied to
switch on or off these behaviors.

Short answer is - stick with 0.

>
> On the way in, the ECN is set to:
>
> if (INET_ECN_is_ce(outer_skb->tos))
>         IP_ECN_set_ce(inner_skb->tos)

This is not correct. (I think my definition of in and out are different)

if (INET_ECN_is_ce(outer_skb->tos) && inner_skb->tos & 3 !=3D 0) //
sorry don't have the macro in my head
           IP_ECN_set_ce(inner_skb->tos)



>
> I do NOT compute the following:
>
>         if (INET_ECN_is_not_ect(inner)) {
>                 switch (outer & INET_ECN_MASK) {
>                 case INET_ECN_NOT_ECT:
>                         return EVERYTHING_IS_OKAY;
>                 case INET_ECN_ECT_0:
>                 case INET_ECN_ECT_1:
>                         return BROKEN_SO_LOG_PACKET;
>                 case INET_ECN_CE:
>                         return BROKEN_SO_DROP_PACKET;
>                 }
>         }
>
> QUESTION B: is it okay that I do not compute the above checks? Or is
> this potentially very problematic?
>
>
> I await your answer on questions A and B.
>
> Thanks,
> Jason



--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 20:15     ` Dave Taht
@ 2016-08-29 21:00       ` Jason A. Donenfeld
  2016-08-29 21:11         ` Dave Taht
  2016-08-29 23:24         ` Dave Taht
  0 siblings, 2 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-08-29 21:00 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

> Nice to see you so quickly being productive. I am still constructing a
> reply to your previous message.

Awaiting it's arrival :)

> In re-reading over your message, I think not dropping the packet when
> there is an outer CE marking and no ecn enabling in in the inner
> packet is probably the right thing, by Postel's law, if not, by the
> RFCs.

Vxlan, geneveve, ipip, and sit all log & drop for the last condition.
Xfrm (IPsec) does not.

The RFCs seem to indicate that it should be dropped though. Check out
the function here, used by vxlan, geneveve, ipip, and sit:

(A) https://git.zx2c4.com/linux/tree/include/net/inet_ecn.h#n166

IPsec uses this much shorter function here, on which I've modeled mine:

(B) https://git.zx2c4.com/linux/tree/net/ipv4/xfrm4_mode_tunnel.c#n18

> Are you in a position to test this? (pie and fq_codel both support
> ecn. My go-to script for testing stuff like this locally are the
> sqm-scripts, or cake, and enabling ecn in /etc/sysctl.conf
>
> https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/
>
> tc qdisc add dev eth0 root cake bandwidth 10mbit # or ratelimit with
> the sqm-scripts and fq_codel or pie with ecn enabled
>
> and enabling ecn in /etc/sysctl.conf
>
> sysctl -w net.ipv4.tcp_ecn=1
>
> And aggh, there's another part of your message I missed, and I haven't
> answered the first one yet.

Cool. I didn't even have the qdisc functions compiled into my kernel! But
anyway I went ahead and compiled your module and modified iproute2, and then
modified src/tests/netns.sh as follows:

    diff --git a/src/tests/netns.sh b/src/tests/netns.sh
    index 1c638d4..294dea6 100755
    --- a/src/tests/netns.sh
    +++ b/src/tests/netns.sh
    @@ -58,6 +58,11 @@ ip netns del $netns2 2>/dev/null || true
     pp ip netns add $netns0
     pp ip netns add $netns1
     pp ip netns add $netns2
    +n0 sysctl -w net.ipv4.tcp_ecn=1
    +n1 sysctl -w net.ipv4.tcp_ecn=1
    +n2 sysctl -w net.ipv4.tcp_ecn=1
    +n0 /home/zx2c4/iproute2-cake/tc/tc qdisc add dev lo root cake
bandwidth 10mbit
    +
     ip0 link set up dev lo

     ip0 link add dev wg0 type wireguard

After that, I ran it and then looked at tcpdump at the lo device that connects
the two namespaces (see netns.sh for an explanation of how this works). I saw
lots of things like this:

22:41:40.386446 IP (tos 0x0, ttl 64, id 56003, offset 0, flags [none],
proto UDP (17), length 121)
    127.0.0.1.2 > 127.0.0.1.1: UDP, length 93
22:41:40.386552 IP (tos 0x2,ECT(0), ttl 64, id 56005, offset 0, flags
[none], proto UDP (17), length 1480)
    127.0.0.1.1 > 127.0.0.1.2: UDP, length 1452
22:41:40.387776 IP (tos 0x2,ECT(0), ttl 64, id 56006, offset 0, flags
[none], proto UDP (17), length 1480)
    127.0.0.1.1 > 127.0.0.1.2: UDP, length 1452

These are the outer encrypted UDP packets. I assume that the decrypted data
inside is an ACK followed by two data packets. ECT is marked for the data
packets, then.

Does this mean it works? How precisely do I test correct behavior?

> Short answer is - stick with 0.

Okay. In that case, when outgoing, the ECN calculation will always be:

outgoing_skb->tos = !INET_ECN_is_ce(inner_skb->tos) ? (inner_skb->tos
& INET_ECN_MASK) : INET_ECN_ECT_0;

Can you verify this is correct?

> This is not correct. (I think my definition of in and out are different)
>
> if (INET_ECN_is_ce(outer_skb->tos) && inner_skb->tos & 3 != 0) //
> sorry don't have the macro in my head

See (A) and (B) above. They seem to do what I'm doing.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 21:00       ` Jason A. Donenfeld
@ 2016-08-29 21:11         ` Dave Taht
  2016-08-29 23:24         ` Dave Taht
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Taht @ 2016-08-29 21:11 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

well, you should see ect(3) if you pound the network interface. Things
like tcp small queues get in the way so you won't see it with a simple
single flow test against cake/codel/etc.

something like 4 netperfs will do it.

Since you are so fast at getting code running, I think you'll like
flent as a test tool. It's pretty easy to install:

sudo apt-get install python-matplotlib python-qt4 netperf fping # some
distros don't include netperf tho

# you only need netperf's netserver running on the clients

git clone https://github.com/tohojo/flent.git

cd flent; sudo make install
netserver -N # wherever you have targets

It has a zillion tests, lets you plot the results, over time, etc,
etc. The rrul test, in particular, would be a good stress test of your
code. You can see usage of flent (formerly known as netperf-wrapper)
all over the web now....

example test script (with a title of what I'm testing now)

#!/bin/sh

for i in 1 2 4 8 12 16 24;
do
flent -t "unencrypted-ht40-$i-flows-osx-ether" -H 192.168.1.201 -l 30
--test-parameter=upload_streams=$i tcp_nup
flent -t "unencrypted-ht40-$i-flows-osx-ether" -H 192.168.1.201 -l 30
--test-parameter=download_streams=$i tcp_ndown
done

flent -t "unencrypted-ht40-$i-flows-osx-ether" -H 192.168.1.201 -l 30 rrul_be
flent -t "unencrypted-ht40-$i-flows-osx-ether" -H 192.168.1.201 -l 30 rrul

flent-gui *.gz

# Anyway, I'll join you in irc to look over what you doing.....

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 21:00       ` Jason A. Donenfeld
  2016-08-29 21:11         ` Dave Taht
@ 2016-08-29 23:24         ` Dave Taht
  2016-08-29 23:57           ` Jason A. Donenfeld
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Taht @ 2016-08-29 23:24 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

On Mon, Aug 29, 2016 at 2:00 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote=
:
>> Nice to see you so quickly being productive. I am still constructing a
>> reply to your previous message.
>
> Awaiting it's arrival :)
>
>> In re-reading over your message, I think not dropping the packet when
>> there is an outer CE marking and no ecn enabling in in the inner
>> packet is probably the right thing, by Postel's law, if not, by the
>> RFCs.
>
> Vxlan, geneveve, ipip, and sit all log & drop for the last condition.
> Xfrm (IPsec) does not.
>
> The RFCs seem to indicate that it should be dropped though. Check out
> the function here, used by vxlan, geneveve, ipip, and sit:
>
> (A) https://git.zx2c4.com/linux/tree/include/net/inet_ecn.h#n166
>
> IPsec uses this much shorter function here, on which I've modeled mine:
>
> (B) https://git.zx2c4.com/linux/tree/net/ipv4/xfrm4_mode_tunnel.c#n18

Then by all means follow the existing latest code.

Postel is long dead, and the internet is a far more hostile place.

>> Are you in a position to test this? (pie and fq_codel both support
>> ecn. My go-to script for testing stuff like this locally are the
>> sqm-scripts, or cake, and enabling ecn in /etc/sysctl.conf
>>
>> https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/
>>
>> tc qdisc add dev eth0 root cake bandwidth 10mbit # or ratelimit with
>> the sqm-scripts and fq_codel or pie with ecn enabled
>>
>> and enabling ecn in /etc/sysctl.conf
>>
>> sysctl -w net.ipv4.tcp_ecn=3D1
>>
>> And aggh, there's another part of your message I missed, and I haven't
>> answered the first one yet.
>
> Cool. I didn't even have the qdisc functions compiled into my kernel! But
> anyway I went ahead and compiled your module and modified iproute2, and t=
hen
> modified src/tests/netns.sh as follows:
>
>     diff --git a/src/tests/netns.sh b/src/tests/netns.sh
>     index 1c638d4..294dea6 100755
>     --- a/src/tests/netns.sh
>     +++ b/src/tests/netns.sh
>     @@ -58,6 +58,11 @@ ip netns del $netns2 2>/dev/null || true
>      pp ip netns add $netns0
>      pp ip netns add $netns1
>      pp ip netns add $netns2
>     +n0 sysctl -w net.ipv4.tcp_ecn=3D1
>     +n1 sysctl -w net.ipv4.tcp_ecn=3D1
>     +n2 sysctl -w net.ipv4.tcp_ecn=3D1
>     +n0 /home/zx2c4/iproute2-cake/tc/tc qdisc add dev lo root cake
> bandwidth 10mbit
>     +
>      ip0 link set up dev lo
>
>      ip0 link add dev wg0 type wireguard

so did cake manage to successfully ratelimit the output to 10Mbit's in
this configuration?

cake's stats also show marks and drops. (tc -s qdisc show dev lo)

> After that, I ran it and then looked at tcpdump at the lo device that con=
nects
> the two namespaces (see netns.sh for an explanation of how this works). I=
 saw
> lots of things like this:
>
> 22:41:40.386446 IP (tos 0x0, ttl 64, id 56003, offset 0, flags [none],
> proto UDP (17), length 121)
>     127.0.0.1.2 > 127.0.0.1.1: UDP, length 93

acks do not get ECN marks.

(I note that mosh is also ecn enabled, last I looked)

> 22:41:40.386552 IP (tos 0x2,ECT(0), ttl 64, id 56005, offset 0, flags
> [none], proto UDP (17), length 1480)
>     127.0.0.1.1 > 127.0.0.1.2: UDP, length 1452
> 22:41:40.387776 IP (tos 0x2,ECT(0), ttl 64, id 56006, offset 0, flags
> [none], proto UDP (17), length 1480)

This shows the marking making it to the outer header. But you should
see 0x3 whenever the qdisc engages it.

>     127.0.0.1.1 > 127.0.0.1.2: UDP, length 1452
>
> These are the outer encrypted UDP packets. I assume that the decrypted da=
ta
> inside is an ACK followed by two data packets. ECT is marked for the data
> packets, then.
>
> Does this mean it works? How precisely do I test correct behavior?

I am a big fan of flent to generate tests with, and tcptrace -G on the
capture of the decrypted interface and looking at the output via
xplot.org (not xplot). Looking at the resulting capture on one end,
you will see the CE going out, on the other you will see just the CWR
and little dots showing the acks acknowledging the CE has been heard
and acted upon.

example of the latter:

http://www.taht.net/~d/typical_ecn_response.png
http://www.taht.net/~d/typica_ecn_response_closeupt.png

Don't have a pic of the former handy. A *really good intro* to
tcptrace an xplot are in apple's
presentation on ecn here, starting 16 (or was it 24?) minutes in:

https://developer.apple.com/videos/play/wwdc2015/719/

(without a mac, you can just download the video)

>> Short answer is - stick with 0.
>
> Okay. In that case, when outgoing, the ECN calculation will always be:
>
> outgoing_skb->tos =3D !INET_ECN_is_ce(inner_skb->tos) ? (inner_skb->tos
> & INET_ECN_MASK) : INET_ECN_ECT_0;
>
> Can you verify this is correct?
>
>> This is not correct. (I think my definition of in and out are different)
>>
>> if (INET_ECN_is_ce(outer_skb->tos) && inner_skb->tos & 3 !=3D 0) //
>> sorry don't have the macro in my head
>
> See (A) and (B) above. They seem to do what I'm doing.

I will. I got very busy today getting a "final" version of the
fq_codel code for ath9k wifi tested. It's lovely. :) If you are into
openwrt, we've got builds available at:

https://lists.bufferbloat.net/pipermail/make-wifi-fast/2016-August/000940.h=
tml

and patches due out tomorrow. It would be great fun to also start
fiddling with wireguard with this new stuff. I think the right thing -
now that you've found it - is to ape what the newer protocols do...
(and someone shoul fix linux ipsec)

--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 23:24         ` Dave Taht
@ 2016-08-29 23:57           ` Jason A. Donenfeld
  0 siblings, 0 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-08-29 23:57 UTC (permalink / raw)
  To: Dave Taht; +Cc: WireGuard mailing list

> well, you should see ect(3) if you pound the network interface. Things
> like tcp small queues get in the way so you won't see it with a simple
> single flow test against cake/codel/etc.
>
> something like 4 netperfs will do it.

It works!

01:40:57.962131 IP (tos 0x3,CE, ttl 64, id 51647, offset 0, flags
[none], proto UDP (17), length 1480)

I made this happen by giving `-P 50 -N -w 500M ` to iperf3.

> sudo apt-get install python-matplotlib python-qt4 netperf fping # some
> git clone https://github.com/tohojo/flent.git
> cd flent; sudo make install
> flent-gui *.gz
> I am a big fan of flent to generate tests with, and tcptrace -G on the
> capture of the decrypted interface and looking at the output via
> xplot.org (not xplot). Looking at the resulting capture on one end,
> you will see the CE going out, on the other you will see just the CWR
> and little dots showing the acks acknowledging the CE has been heard
> and acted upon.


Excellent, thanks for the suggestions on testing tools. I've been using iperf3
nearly exclusively, and indeed it seems like netperf and others are
substantially more powerful. I've also got a directory of horrible .c programs
generating packets for stress testing printing out text for me to pop into
Mathematica... Clearly my homebrewed rig isn't going to be useful for much
longer. I'll look into getting these setup.

> # Anyway, I'll join you in irc to look over what you doing.....

Sorry -- when you messaged me in there, I was just getting off a big 14 hour
international flight. I've got a long layover now, and probably I'll be in IRC
again if you want to pop in there. Though, I'm quite exhausted and might
collapse on some airport benches...

> Then by all means follow the existing latest code.

Well that was my question. Should I follow (A) or (B)? IPsec does (B);
everything else does (A). Does IPsec have a security reason for doing (B)? Is
there some denial of service attack that can be mounted, or some information
disclosure, or some sort of oracle attack? Or is (A) pretty much clearly
better and more robust under abuse, and IPsec is just behind the times?
Probably I shoul djust read the ECN RFCs to actually understand what all this
is doing and make up my mind.

> Postel is long dead, and the internet is a far more hostile place.
> so did cake manage to successfully ratelimit the output to 10Mbit's in
> this configuration?

It did. With I rate limited the actual link to 10mbps, and iperf3 got around
9mbps over TCP, which seems about right considering TCP and considering the
packet encapsulation overhead.

>
> cake's stats also show marks and drops. (tc -s qdisc show dev lo)

Seems to work:

    qdisc cake 8008: root refcnt 2 bandwidth 10Mbit diffserv4 flows
rtt 100.0ms raw
     Sent 82329355 bytes 95473 pkt (dropped 5295, overlimits 195966 requeues 0)
     backlog 137124b 150p requeues 0
    memory used: 502566b of 500000b
    capacity estimate: 10Mbit
                 Tin 0       Tin 1       Tin 2       Tin 3
      thresh      10Mbit    9375Kbit    7500Kbit    2500Kbit
      target      78.7ms      83.9ms     104.9ms     314.6ms
    interval     173.7ms     178.9ms     209.8ms     629.3ms
    Pk-delay         0us      93.9ms         0us         0us
    Av-delay         0us      91.2ms         0us         0us
    Sp-delay         0us      83.8ms         0us         0us
      pkts             0      100916           2           0
      bytes            0    84254676         318           0
    way-inds           0           0           0           0
    way-miss           0           1           1           0
    way-cols           0           0           0           0
      drops            0        5295           0           0
      marks            0        4660           0           0
    Sp-flows           0           0           0           0
    Bk-flows           0           1           0           0
    last-len       65536           0           0           0
    max-len            0        1494         187           0

> acks do not get ECN marks.

I'll need to think carefully about the infoleak aspects of allowing ECN. It
seems nearly as bad as the DSCP infoleak. Is this information that's okay to
be sent in the clear? I'm not quite sure yet.

> (I note that mosh is also ecn enabled, last I looked)

Cool. I recall when they added the DSCP priority, but I don't recall ECN. I'll
try that out. I'm a big Mosh user -- and the roaming feature of WireGuard was
inspired by it.

> I will. I got very busy today getting a "final" version of the
> fq_codel code for ath9k wifi tested. It's lovely. :) If you are into
> openwrt, we've got builds available at:
>
> https://lists.bufferbloat.net/pipermail/make-wifi-fast/2016-August/000940.html
>
> and patches due out tomorrow. It would be great fun to also start
> fiddling with wireguard with this new stuff. I think the right thing -
> now that you've found it - is to ape what the newer protocols do...
> (and someone shoul fix linux ipsec)

Excellent. I've just arrived in the mountains for a week and a half with my
family, so I have limited access to hardware, but I am very interested in
OpenWRT and WireGuard performance. Baptiste, a OpenWRT developer, hangs out on
this mailing list and got a WireGuard package in LEDE which is quite cool.
That's excellent about a faster ath9k driver.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] fq, ecn, etc with wireguard
  2016-08-29 17:16 ` Jason A. Donenfeld
  2016-08-29 19:23   ` Jason A. Donenfeld
@ 2016-08-30  0:24   ` Dave Taht
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Taht @ 2016-08-30  0:24 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

:whew:

On Mon, Aug 29, 2016 at 10:16 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrot=
e:
> Hey Dave,
>
> You're exactly the sort of person I've been hoping would appear during th=
e
> last several months.

The bufferbloat project has had a lot of people randomly show up at
the party to make a contribution, getting a little PR in the right
places always helps. Glad to have shown up,
am sorry to be so scattered today and not reviewing detailed code.

>> A) does wireguard handle ecn encapsulation/decapsulation?
>>
>> https://tools.ietf.org/html/draft-ietf-tsvwg-ecn-encap-guidelines-07
>>
>> Doing ecn "right" through vpn with a bottleneck router with a fq_codel
>> enabled qdisc allows for zero induced packet loss and good congestion
>> control.
>
> At the moment I don't do anything special with DSCP or ECN. I set a high
> priority DSCP for the handshake messages, but for the actual transport
> packets, I leave it at zero:
>
> https://git.zx2c4.com/WireGuard/tree/src/send.c#n137
>
> This has been a TODO item for quite some time; it's on wireguard.io/roadm=
ap
> too. The reason I've left it at zero, thus far, is that I didn't want to
> infoleak anything about the underlying data. Is there a case to be made,
> however, that ECN doesn't leak data like DSCP does, and so I'd be okay ju=
st
> copying those top bits? I'll read the IETF draft you sent and see if I ca=
n
> come up with something. It does have an important utility; you're right.

The ietf consensus was that a 2 bit covert channel wasn't useful and
that being able to expose congestion control information was ok.

>> B) I see that "noqueue" is the default qdisc for wireguard. What is
>> the maximum outstanding queue depth held internally? How is it
>> configured? I imagine it is a strict fifo queue, and that wireguard
>> bottlenecks on the crypto step and drops on reads... eventually.
>> Managing the queue length looks to be helpful especially in the
>> openwrt/lede case.
>>
>> (we have managed to successfully apply something fq_codel-like within
>> the mac80211 layer, see various blog entries of mine and the ongoing
>> work on the make-wifi-fast mailing list)
>>
>> So managing the inbound queue for wireguard well, to hold induced
>> latencies down to bare minimums when going from 1Gbit to XMbit, and
>> it's bottlenecked on wireguard, rather than an external router, is on
>> my mind. Got a pretty nice hammer in the fq_codel code, not sure why
>> you have noqueue as the default.
>
> There are a couple reasons. Originally I used multiqueue and had a separa=
te
> subqueue for each peer. I then abused starting and stopping these subqueu=
es as
> the various peers negotiated handshakes. This worked, but it was quite
> limiting for a number of reasons, leading me to ultimately switch to noqu=
eue.
>
> Using noqueue gives me a couple benefits. First, packet transmission call=
s my
> xmit function directly, which means I can trivially check for routing loo=
ps
> using dev_recursion_level(). Second, it allows me to return things like
> `-ENOKEY` from the xmit function, which gets directly passed up to usersp=
ace,
> giving more interesting error messages than ICMP handles (though I also
> support ICMP). But the main reason is because it fits the current queuing
> design of WireGuard. I'll explain:
>
> A WireGuard device has multiple peers. Either there's an active session f=
or a
> peer, in which case the packet can be encrypted and sent, or there isn't,=
 in
> which case it's queued up until a session is established. If a peer doesn=
't
> have a session, after queuing up that packet, the session handshake occur=
s,
> and immediately following, the queue is released and the packet is sent. =
This
> has the effect of making WireGuard appear "stateless" to userspace. The
> administrator set up all the peer details, and then typed `ping peer`, an=
d
> then it just worked. Where did the connection happen? That's what happens
> behind scenes in WireGuard. So each peer has its own queue. I limit each =
queue
> to 1024 packets, somewhat arbitrarily. As the queue exceeds 1024, the old=
est
> packets are dropped first.

OK, well, 1024 packets is quite a lot. Assuming TSO is in use, running
at  10Mbit for the sake of example, that's a worst case latency of ~85
*seconds* at that speed, and 98,304,000 bytes of buffering.

Even 1024 packets is a lot at a gbit, when TSO/GRO are in use, 850ms.
Devices that use soft GRO can also accumulate up to 64K packets,
although the spec is 24k, several devices violate it.

Thankfully TSO and GRO are not always invoked and that our use of fq,
tends to start reducing the maximally sized burst at the endpoints to
reasonable values - like 2 superpackets.

But, even if you only have 1024 normal sized packets, that's a worst
case delay of 1.5 seconds...

Now, there is no "right number" for buffering, but various rules of
thumb. What we like about the new AQM designs (codel and pie) is that
they try to establish a minimal "standing queue", measured in time
(which is a proxy for bytes), not packets - 5ms in the case of codel,
16 in the case of pie.

and they do it dynamically, based on induced latency. A typical figure
for codel's standing queue at 10 mbit is *2* full size packets,
moderated a bit by whatever BQL sets, which is 2-3k bytes.

There's several great papers/presentations on codel in acm queue, and
I'm pretty fond of Van's, my and stephen hemmingers talks on the
subject, linked to off of here:

https://www.bufferbloat.net/projects/cerowrt/wiki/Bloat-videos/

Anyway, having a shared queue for  all peers would more more sensible,
and limiting it by bytes
rather than packets (as cake does), helpful. Trying to come up with a
better initial estimate for how big the queue should be based on what
outgoing interfaces are available (e.g. is a gigE interface available?
10GigE), and then be moderated by the aqm.

>
> There's another hitch: Linux supports "super packets" for GSO. Essentiall=
y,
> the kernel will hand off a massive TCP packet -- 65k -- to the device dri=
ver,
> if requested, expecting the device driver to then segment this into MTU-s=
ized
> bites. This was designed for hardware that has built-in TCP segmentation =
and
> such. I found it was very performant to do the same with WireGuard. The r=
eason
> is that everytime a final encrypted packet is transmitted, it has to trav=
erse
> the big complicated Linux networking stack. In order to reduce cache miss=
es, I
> prefer to transmit a bunch of packets at once.

Well, we also break up superpackets in cake, but we do it with the
express intent of allowing other flows through. Staying with my 10Mbit
example, a single 64k superpacket would 54ms to transmit, which blows
the jitter budget on a competing voip call.

I'm painfully aware that this costs cpu, but having shorter queues in
the first place helps, and we have experimented with breaking up
superpackets less based on the workload and bandwidth, in cake, but
haven't settled on a scheme to do so.

> Please read this LKML post
> where I detail this a bit more (Ctrl+F for "It makes use of a few tricks"=
),
> and then return to this email:
>
> http://lkml.iu.edu/hypermail/linux/kernel/1606.3/02833.html
>
> The next thing is that I support parallel encryption, which means encrypt=
ing
> these bundles of packets is asynchronous.

All packets in the broken up superpacket are handed to be encrypted in
parallel? cool.

Can I encourage you to try the rrul test and think about encrypting
different flows in parallel also? :)

real network traffic, particularly over a network to network oriented
vpn is *never* a single bulk flow.

> All these requirements would lead you to think that this is all super
> complicated and horrible, but I actually managed to put this together in =
a
> decently simple way. There's the queuing algorithm all together:
> https://git.zx2c4.com/WireGuard/tree/src/device.c#n101
>
> 1. user sends a packet. xmit() in device.c is called.
> 2. look up to which peer we're sending this packet.
> 3. if we have >1024 packets in that peer's queue, remove the oldest ones.

More than 200 is really a crazy number for a fixed length fifo at 1gbit or =
less.

> 4. segment the super packet into normal MTU-sized packets, and queue thos=
e
>    up. note that this may allow the queue to temporarily exceed 1024 pack=
ets,
>    which is fine.
> 5. try to encrypt&send the entire queue.
>
> There's what step 5 looks like, found in packet_send_queue() in send.c:
> https://git.zx2c4.com/WireGuard/tree/src/send.c#n159
>
> 1. immediately empty the entire queue, putting it into a local temp queue=
.
> 2. if the queue is empty, return. if the queue only has one packet that's
>    less than or equal to 256 bytes, don't parallelize it.
> 3. for each packet in the queue, send it off to the asynchronous encrypti=
on
>    a. if that returns '-ENOKEY', it means we don't have a valid session, =
so
>       we should initiate one, and then do (b) too.
>    b. if that returns '-ENOKEY' or '-EBUSY' (workqueue is at kernel limit=
),
>       we put that packet and all the ones after it from the local queue b=
ack
>       into the peer's queue.
>    c. if we fail for any other reason, we drop that packet, and then keep
>       processing the rest of the queue.
> 4. we tell userspace "ok! sent!"
> 5. when the packets that were successfully submitted finish encrypting
>    (asynchronously), we transmit the encrypted packets in a tight loop
>    to reduce cache misses in the networking stack.
>
> That's it! It's pretty basic. I do wonder if this has some problems, and =
if
> you have some suggestions on how to improve it, or what to replace it wit=
h.
> I'm open to all suggestions here.

Well the idea of fq_codel is to break things up into 1024 different
flows. The code base is now generalized so that it can be used by the
fq_codel qdisc and the new stuff for mac80211.

But! The concept of those flows is still serialized in the end in this
codebase, you need to keep pulling stuff out of it until you are
done... using merely the idea of fq_codel and explicitly parallizing
enqueuing would let you defer nexthop lookup and handle multiple flows
in parallel on multiple cpus.

> One thing, for example, that I haven't yet worked out is better schedulin=
g for
> submitting packets to different threads for encryption. Right now I just =
evenly
> distribute them, one by one, and then wait until they're finished. Clearl=
y
> better performance could be achieved by chunking them somehow.

Better crypto performance, not network performance. :) The war between
bulking up stuff to save cpu and breaking things back down again into
packets so packet theory actually works, is ongoing.

>> C) One flaw of fq_codel , is that multiplexing multiple outbound flows
>> over a single connection endpoint degrades that aggregate flow to
>> codel's behavior, and the vpn "flow" competes evenly with all other
>> flows. A classic pure aqm solution would be more fair to vpn
>> encapsulated flows than fq_codel is.
>>
>> An answer to that would be to expose "fq" properties to the underlying
>> vpn protocol. For example, being able to specify an endpoint
>> identifier of 2001:db8:1234::1/118:udp_port would allow for a one to
>> one mapping for external fq_codel queues to internal vpn queues, and
>> thus vpn traffic would compete equally with non-vpn traffic at the
>> router. While this does expose more per flow information, the
>> corresponding decrease for e2e latency under load, especially for
>> "sparse" flows, like voip and dns, strikes me as a potential major win
>> (and one way to use up a bunch of ipv6 addresses in a good cause).
>> Doing that "right" however probably involves negotiating perfect
>> forward secrecy for a ton of mostly idle channels (with a separate
>> seqno base for each), (but I could live with merely having a /123 on
>> the task)
>
> Do you mean to suggest that there be a separate wireguard session for eac=
h
> 4-tuple?

Sorta. Instead, you can share a IV seqno among these these queues so
long as your replay protection buffer is big enough relative to the
buffering and RTT, no need to negotiate a separate connection for
each. Then you are semi-serializing the seqno access/increment, but
that's not a big issue.

There are issues with hole punching on this, regardless, and I wasn't
suggesting even trying for ipv4! But we have a deployment window for
ipv6 where we could have fun using up tons of addresses for a noble
purpose (0 latency for sparse flows!), and routing a set of 1024
addresses into a vpn's endpoint design is possible with your
architecture. Linode gives me a 4096 to play with - comcast, a /60 or
/56....

Have you seen the mosh-multipath paper?, which sort of ties into your
design as well, except that as you are creating a routable device,
makes listening on a ton of ip addresses a snap....

https://arxiv.org/pdf/1502.02402.pdf

>> C1) (does the current codebase work with ipv6?)
>
> Yes, very well, out of the box, from day 1. You can do v6-in-v6, v4-in-v4=
,
> v4-in-v6, and v6-in-v4.

I tried to get it going yesterday, ipv6 to ipv6, but failed with 4 tx
errors on one side and 3 on the other, reported by ifconfig, no error
messages. I'll try harder once I come down from fixing up the fq_codel
wifi code....

>> D) my end goal would be to somehow replicate the meshy characteristics
>> of tinc, and choosing good paths through multiple potential
>> connections, leveraging source specific routing and another layer 3
>> routing protocol like babel, but I do grok that doing that right would
>> take a ton more work...
>
> That'd be great. I'm trying to find a chance to sit down with the fella b=
ehind
> Babel one of these days. I'd love to get these working well together.

Juliusz hangs out on #babel on freenode, paris time.

batman-adv is also a good choice, and bmx7 has some nice
characteristics. I'm mostly familiar with babel - can you carry ipv6
link layer multicast? (if not, we've been nagging julius to add a
unicast only mode)

One common use case for babel IS to manage a set of gre tunnels, for
which wireguard could be a drop in replacement for. You set up 30
tunnels to everywhere that can all route to everywhere, and let babel
figure out the right one. It should be reasonably robust in the face
of nuclear holocost, a zombie invasion, or the upcoming US election.

https://tools.ietf.org/html/draft-jonglez-babel-rtt-extension-01

BTW:

To what extent would source specific routing help solve your oif issue?

https://arxiv.org/pdf/1403.0445.pdf

we use that extensively to do things that we used to do with policy
routing, and it's a snap to use... and nearly all device's we've
played with are built with IPv6 subtrees.

but it's an ipv6 only feature, getting that into ipv4 would be nice.

>
>> Anyway, I'll go off and read some more docs and code to see if I can
>> answer a few of these questions myself. I am impressed by what little
>> I understand so far.
>
> Great! Let me know what you find. Feel free to find me in IRC (zx2c4 in
> #wireguard on freenode) if you'd like to chat about this all in realtime.
>
> Regards,
> Jason



--=20
Dave T=C3=A4ht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-08-30  0:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-27 21:03 [WireGuard] fq, ecn, etc with wireguard Dave Taht
2016-08-27 21:33 ` jens
2016-08-27 22:03 ` Dave Taht
2016-08-29 17:16 ` Jason A. Donenfeld
2016-08-29 19:23   ` Jason A. Donenfeld
2016-08-29 19:50     ` Dave Taht
2016-08-29 20:15     ` Dave Taht
2016-08-29 21:00       ` Jason A. Donenfeld
2016-08-29 21:11         ` Dave Taht
2016-08-29 23:24         ` Dave Taht
2016-08-29 23:57           ` Jason A. Donenfeld
2016-08-30  0:24   ` Dave Taht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).