On Wed, 2021-04-07 at 16:15 -0700, Daniel Lenski wrote:
> On Wed, Apr 7, 2021 at 4:49 AM David Woodhouse <dwmw2@infradead.org> wrote:
> > If WintunSendPacket took an additional 'offset' argument to disregard a
> > certain number of bytes at the beginning of the buffer, that would
> > probably suffice. Or is it possible to simply add to the pointer
> > returned by WintunAllocateSendPacket()?
> 
> To expand on this possibility a little bit, I had proposed kludging a
> “shift” into the allocation for the outgoing packet:
> 
>     /* Always use this instead for an outgoing packet, instead of
>      * malloc(sizeof(struct pkt) + payload_len */
>     BYTE *tun_pkt = WintunAllocateSendPacket(
>         vpninfo->wintun_session,
>         paylod_len                 /* packet payload size */
>         + sizeof(struct pkt)     /* OpenConnect's internal packet header size */
>     );
> 
>     /* Then after we build and populate the outgoing packet, just tell
>      * Wintun to send from an offset that's NOT at the beginning of the
>      * buffer it allocated for us. No memcpy! */
>     WintunSendPacket(vpninfo->wintun_session, tun_pkt + sizeof(struct pkt));
> 
> The concern here is that Wintun may not have been written with this
> possibility in mind, and might not always like sending from an address
> other than the exact start of a buffer it's allocated for us.

If we have to, we could probably ditch a lot of what's in the DLL's
api/session.c and manage the ring for ourselves; the fundamental
limitation right now is how the kernel driver itself handles the shared
ring.

It uses a TUN_PACKET header which is basically just a 'ULONG Size'
followed immediately by the data. To allow the application to process
packets out-of-order, the *userspace* part sets the top bit of ->Size
to manage ownership between the DLL and the application, but the kernel
side doesn't see that because it's removed by the time the DLL bumps
the ring tail to allow the kernel to see the packet.

For OpenConnect I think it's reasonable to declare that we don't care
about optimising the TCP fallback data path; it's the UDP data path
that we really want to be zero-copy, decrypting directly into the tun
ring and encrypting directly from it.

Let's take a look at the protocols we need to support...
 
 ========== 
 Cisco DTLS
 ==========

Cisco DTLS has only a single byte of header, for the packet type.

So for VPN→tun traffic we could actually just receive the packet (with
gnutls_record_recv() or ssl_read()) directly into the TUN ring, with
the first byte *overwriting* the MSB of the Size field of the
TUN_HEADER.

If it's a packet type of anything other than AC_PKT_DATA we'd then
handle it immediately (as we do now) and end up re-using the same
location in the ring next time. If it's AC_PKT_DATA we just reinstate
the Size field of the header and bump the ring Tail to let the kernel
see it. If it wasn't for the fact that we need to allocate the full
possible MTU and then reduce the actual packet size once we see it,
this would even be compatible with the existing ring management in
wintun.dll. It's certainly compatible with the kernel driver.

For tun→VPN traffic we can do something similar, abusing that top byte
of the Size field. And this *would* be compatible with the existing
DLL, since we know that byte is going to be zero (which is AC_PKT_DATA)
anyway. (Here's where it would be nice to have gnutls_record_sendv() or
SSL_writev() methods which do scatter/gather, but we don't).

The bare minimum we need from Wintun here is a function that would
resize the last allocated send packet. As long as nothing has been
allocated *after* it, we could reduce its size and adjust
Session->Receive.Tail accordingly.

 ===
 ESP
 ===

For ESP we do our own encrypt/decrypt operations directly and don't
need a header. But we do actually need a *tail*, since the packet is
padded to the crypto block size and has a 'next header' field in the
final encrypted byte, e.g:

   DD DD DD DD DD DD DD DD  DD DD DD DD DD DD DD DD
   DD DD DD 01 02 03 04 05  06 07 08 09 0a 0b 0b 29
            <........... padding ..........>  ↑  ↑
                                         padlen  next_hdr
  
So, much as with Cisco DTLS for VPN→tun traffic we'd want to allocate
*more* space in the Wintun ring than we actually need, then shrink the
Size field later to match precisely the size of the packet. Which again
is compatible with the kernel driver but not with the ring management
currently in api/session.c.

For tun→VPN traffic it's a little more complex as we ideally want the
kernel driver to leave space *after* the packet, for us to append the
tail bytes without scribbling over the subsequent packet in the ring.

But it isn't the end of the world; we can encrypt *most* of the packet
directly from the tun ring, and we only need to copy *less* than a
single crypto block (those first three DD DD DD bytes in the final line
of the above example) into a buffer elsewhere to handle the padding
etc.


 =============
 PPP over DTLS
 =============

We just added support for the PPP-based protocols (Fortinet, F5) and
I'm not sure we even know what the DTLS-based version looks like on the
wire, do we? If the header is 4 bytes or fewer, the same nasty trick
works that I suggest for Cisco DTLS above. And a PPP header even with
accomp and pfcomp *would* fit in 4 bytes. For the TCP transports we
have an additional framing but I'm hoping those aren't there in DTLS?

If we do need a header larger than 4 bytes, then we are forced to do
things properly by adding support in the kernel driver instead of just
abusing the existing header while we know the kernel isn't looking at
it.



So, what do we want, and what's the bare minimum we actually *need*
from Wintun to be able to avoid those memcpys? 

The bare minimum is either exposing enough of the TUN_SESSION to let us
manage the rings for ourselves, or a function which can resize the
*last* allocated packet from the Tx ring before we call
WintunSendPacket() on it. That's purely userspace in wintun.dll.

The next request would be to expand the TUN_HEADER to include head/tail
space, and a parameter in the TUN_REGISTER_RINGS structure which
configures the amount of head/tail space to leave between received
packets. That's a change in the kernel API and is more complex to
manage, and as noted we *could* live without it for now although it's
kind of ugly, still involves *some* copying at the tail of outbound ESP
packets, and depends on those PPP headers not exceeding the 4 bytes
that are currently available for us to abuse :)