On Wed, 2021-04-07 at 16:15 -0700, Daniel Lenski wrote: > On Wed, Apr 7, 2021 at 4:49 AM David Woodhouse wrote: > > If WintunSendPacket took an additional 'offset' argument to disregard a > > certain number of bytes at the beginning of the buffer, that would > > probably suffice. Or is it possible to simply add to the pointer > > returned by WintunAllocateSendPacket()? > > To expand on this possibility a little bit, I had proposed kludging a > “shift” into the allocation for the outgoing packet: > > /* Always use this instead for an outgoing packet, instead of > * malloc(sizeof(struct pkt) + payload_len */ > BYTE *tun_pkt = WintunAllocateSendPacket( > vpninfo->wintun_session, > paylod_len /* packet payload size */ > + sizeof(struct pkt) /* OpenConnect's internal packet header size */ > ); > > /* Then after we build and populate the outgoing packet, just tell > * Wintun to send from an offset that's NOT at the beginning of the > * buffer it allocated for us. No memcpy! */ > WintunSendPacket(vpninfo->wintun_session, tun_pkt + sizeof(struct pkt)); > > The concern here is that Wintun may not have been written with this > possibility in mind, and might not always like sending from an address > other than the exact start of a buffer it's allocated for us. If we have to, we could probably ditch a lot of what's in the DLL's api/session.c and manage the ring for ourselves; the fundamental limitation right now is how the kernel driver itself handles the shared ring. It uses a TUN_PACKET header which is basically just a 'ULONG Size' followed immediately by the data. To allow the application to process packets out-of-order, the *userspace* part sets the top bit of ->Size to manage ownership between the DLL and the application, but the kernel side doesn't see that because it's removed by the time the DLL bumps the ring tail to allow the kernel to see the packet. For OpenConnect I think it's reasonable to declare that we don't care about optimising the TCP fallback data path; it's the UDP data path that we really want to be zero-copy, decrypting directly into the tun ring and encrypting directly from it. Let's take a look at the protocols we need to support... ========== Cisco DTLS ========== Cisco DTLS has only a single byte of header, for the packet type. So for VPN→tun traffic we could actually just receive the packet (with gnutls_record_recv() or ssl_read()) directly into the TUN ring, with the first byte *overwriting* the MSB of the Size field of the TUN_HEADER. If it's a packet type of anything other than AC_PKT_DATA we'd then handle it immediately (as we do now) and end up re-using the same location in the ring next time. If it's AC_PKT_DATA we just reinstate the Size field of the header and bump the ring Tail to let the kernel see it. If it wasn't for the fact that we need to allocate the full possible MTU and then reduce the actual packet size once we see it, this would even be compatible with the existing ring management in wintun.dll. It's certainly compatible with the kernel driver. For tun→VPN traffic we can do something similar, abusing that top byte of the Size field. And this *would* be compatible with the existing DLL, since we know that byte is going to be zero (which is AC_PKT_DATA) anyway. (Here's where it would be nice to have gnutls_record_sendv() or SSL_writev() methods which do scatter/gather, but we don't). The bare minimum we need from Wintun here is a function that would resize the last allocated send packet. As long as nothing has been allocated *after* it, we could reduce its size and adjust Session->Receive.Tail accordingly. === ESP === For ESP we do our own encrypt/decrypt operations directly and don't need a header. But we do actually need a *tail*, since the packet is padded to the crypto block size and has a 'next header' field in the final encrypted byte, e.g: DD DD DD DD DD DD DD DD DD DD DD DD DD DD DD DD DD DD DD 01 02 03 04 05 06 07 08 09 0a 0b 0b 29 <........... padding ..........> ↑ ↑ padlen next_hdr So, much as with Cisco DTLS for VPN→tun traffic we'd want to allocate *more* space in the Wintun ring than we actually need, then shrink the Size field later to match precisely the size of the packet. Which again is compatible with the kernel driver but not with the ring management currently in api/session.c. For tun→VPN traffic it's a little more complex as we ideally want the kernel driver to leave space *after* the packet, for us to append the tail bytes without scribbling over the subsequent packet in the ring. But it isn't the end of the world; we can encrypt *most* of the packet directly from the tun ring, and we only need to copy *less* than a single crypto block (those first three DD DD DD bytes in the final line of the above example) into a buffer elsewhere to handle the padding etc. ============= PPP over DTLS ============= We just added support for the PPP-based protocols (Fortinet, F5) and I'm not sure we even know what the DTLS-based version looks like on the wire, do we? If the header is 4 bytes or fewer, the same nasty trick works that I suggest for Cisco DTLS above. And a PPP header even with accomp and pfcomp *would* fit in 4 bytes. For the TCP transports we have an additional framing but I'm hoping those aren't there in DTLS? If we do need a header larger than 4 bytes, then we are forced to do things properly by adding support in the kernel driver instead of just abusing the existing header while we know the kernel isn't looking at it. So, what do we want, and what's the bare minimum we actually *need* from Wintun to be able to avoid those memcpys? The bare minimum is either exposing enough of the TUN_SESSION to let us manage the rings for ourselves, or a function which can resize the *last* allocated packet from the Tx ring before we call WintunSendPacket() on it. That's purely userspace in wintun.dll. The next request would be to expand the TUN_HEADER to include head/tail space, and a parameter in the TUN_REGISTER_RINGS structure which configures the amount of head/tail space to leave between received packets. That's a change in the kernel API and is more complex to manage, and as noted we *could* live without it for now although it's kind of ugly, still involves *some* copying at the tail of outbound ESP packets, and depends on those PPP headers not exceeding the 4 bytes that are currently available for us to abuse :)