* [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD) @ 2023-10-29 19:22 Thomas Brierley 2023-11-20 1:17 ` Daniel Gröber 0 siblings, 1 reply; 4+ messages in thread From: Thomas Brierley @ 2023-10-29 19:22 UTC (permalink / raw) To: wireguard; +Cc: Thomas Brierley Currently MTU calculation fails to successfully utilise the kernel's built-in path MTU discovery mechanism (PMTUD). Fixing this required a re-write of the set_mtu_up() function, which also addresses two related MTU issues as a side effect... 1. Trigger PMTUD Before Query Currently the endpoint path MTU acquired from `ip route get` will almost definitely be empty, because this only queries the routing cache. To trigger PMTUD on the endpoint and fill this cache, it is necessary to send an ICMP with the DF bit set. We now perform a ping beforehand with a total packet size equal to the interface MTU, larger will not trigger PMTUD, and smaller can miss a bottleneck. To calculate the ping payload, the device MTU and IP header size must be determined first. 2. Consider IPv6/4 Header Size Currently an 80 byte header size is assumed i.e. IPv6=40 + WireGuard=40. However this is not optimal in the case of IPv4. Since determining the IP header size is required for PMTUD anyway, this is now optimised as a side effect of endpoint MTU calculation. 3. Use Smallest Endpoint MTU Currently in the case of multiple endpoints the largest endpoint path MTU is used. However WireGuard will dynamically switch between endpoints when e.g. one fails, so the smallest MTU is now used to ensure all endpoints will function correctly. Signed-off-by: Thomas Brierley <tomxor@gmail.com> --- src/wg-quick/linux.bash | 41 ++++++++++++++++++++++++++--------------- 1 file changed, 26 insertions(+), 15 deletions(-) diff --git a/src/wg-quick/linux.bash b/src/wg-quick/linux.bash index 4193ce5..5aba2cb 100755 --- a/src/wg-quick/linux.bash +++ b/src/wg-quick/linux.bash @@ -123,22 +123,33 @@ add_addr() { } set_mtu_up() { - local mtu=0 endpoint output - if [[ -n $MTU ]]; then - cmd ip link set mtu "$MTU" up dev "$INTERFACE" - return - fi - while read -r _ endpoint; do - [[ $endpoint =~ ^\[?([a-z0-9:.]+)\]?:[0-9]+$ ]] || continue - output="$(ip route get "${BASH_REMATCH[1]}" || true)" - [[ ( $output =~ mtu\ ([0-9]+) || ( $output =~ dev\ ([^ ]+) && $(ip link show dev "${BASH_REMATCH[1]}") =~ mtu\ ([0-9]+) ) ) && ${BASH_REMATCH[1]} -gt $mtu ]] && mtu="${BASH_REMATCH[1]}" - done < <(wg show "$INTERFACE" endpoints) - if [[ $mtu -eq 0 ]]; then - read -r output < <(ip route show default || true) || true - [[ ( $output =~ mtu\ ([0-9]+) || ( $output =~ dev\ ([^ ]+) && $(ip link show dev "${BASH_REMATCH[1]}") =~ mtu\ ([0-9]+) ) ) && ${BASH_REMATCH[1]} -gt $mtu ]] && mtu="${BASH_REMATCH[1]}" + local dev devmtu end endmtu iph=40 wgh=40 mtu + # Device MTU + if [[ -n $(ip route show default) ]]; then + [[ $(ip route show default) =~ dev\ ([^ ]+) ]] + dev=${BASH_REMATCH[1]} + [[ $(ip addr show $dev scope global) =~ inet6 ]] && + iph=40 || iph=20 + if [[ $(ip link show dev $dev) =~ mtu\ ([0-9]+) ]]; then + devmtu=${BASH_REMATCH[1]} + [[ $(( devmtu - iph - wgh )) -gt $mtu ]] && + mtu=$(( devmtu - iph - wgh )) + fi + # Endpoint MTU + while read -r _ end; do + [[ $end =~ ^\[?([a-f0-9:.]+)\]?:[0-9]+$ ]] + end=${BASH_REMATCH[1]} + [[ $end =~ [:] ]] && + iph=40 || iph=20 + ping -w 1 -M do -s $(( devmtu - iph - 8 )) $end &> /dev/null || true + if [[ $(ip route get $end) =~ mtu\ ([0-9]+) ]]; then + endmtu=${BASH_REMATCH[1]} + [[ $(( endmtu - iph - wgh )) -lt $mtu ]] && + mtu=$(( endmtu - iph - wgh )) + fi + done < <(wg show "$INTERFACE" endpoints) fi - [[ $mtu -gt 0 ]] || mtu=1500 - cmd ip link set mtu $(( mtu - 80 )) up dev "$INTERFACE" + cmd ip link set mtu ${MTU:-${mtu:-1420}} up dev "$INTERFACE" } resolvconf_iface_prefix() { -- 2.30.2 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD) 2023-10-29 19:22 [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD) Thomas Brierley @ 2023-11-20 1:17 ` Daniel Gröber 2023-11-23 3:33 ` Thomas Brierley 0 siblings, 1 reply; 4+ messages in thread From: Daniel Gröber @ 2023-11-20 1:17 UTC (permalink / raw) To: Thomas Brierley; +Cc: wireguard Hi Thomas, On Sun, Oct 29, 2023 at 07:22:10PM +0000, Thomas Brierley wrote: > Currently MTU calculation fails to successfully utilise the kernel's > built-in path MTU discovery mechanism (PMTUD). Fixing this required a > re-write of the set_mtu_up() function, which also addresses two related > MTU issues as a side effect... > > 1. Trigger PMTUD Before Query > > Currently the endpoint path MTU acquired from `ip route get` will almost > definitely be empty, This is not entirely true, routes can specify the `mtu` route attribute explicitly and this will show up here. Something like: $ ip route add 2001:db8::1 dev eth0 via fe80::1 mtu 9000 $ ip route get 2001:db8::1 ~ 2001:db8::1 from :: via fe80::1 dev eth0 src 2001:db8:2 metric 1 mtu 9000 pref medium So this is useful even when not taking PMTU into account. The only concerning problem I see here is that in the case where ip-route-get returns a cached PMTU so the MTU selection isn't fully deterministic. > because this only queries the routing cache. To > trigger PMTUD on the endpoint and fill this cache, it is necessary to > send an ICMP with the DF bit set. I don't think this is useful. Path MTU may change, doing this only once when the interface comes up just makes wg-quick less predictable IMO. > We now perform a ping beforehand with a total packet size equal to the > interface MTU, larger will not trigger PMTUD, and smaller can miss a > bottleneck. To calculate the ping payload, the device MTU and IP header > size must be determined first. > > 2. Consider IPv6/4 Header Size > > Currently an 80 byte header size is assumed i.e. IPv6=40 + WireGuard=40. > However this is not optimal in the case of IPv4. Since determining the > IP header size is required for PMTUD anyway, this is now optimised as a > side effect of endpoint MTU calculation. This is not a good idea. Consider what happens when a peer roams from an IPv4 to a IPv6 endpoint address. It's better to be conservative and assume IPv6 sized overhead, besides IPv4 is legacy anyway ;) > 3. Use Smallest Endpoint MTU > > Currently in the case of multiple endpoints the largest endpoint path > MTU is used. However WireGuard will dynamically switch between endpoints > when e.g. one fails, so the smallest MTU is now used to ensure all > endpoints will function correctly. "function correctly". Do note that wireguard lets it's UDP packets be fragmented. So connectivty will still work even when the wg device MTU doesn't match the (current) PMTU. The only downsides to this mismatch being performance: - additional header overhead for fragments, - less than half max packets-per-second performance and - additional lateny for tunnel packets hit by IPv6 PMTU discovery I was surprised to learn that this would happen periodically, every time the PMTU cache expires. Seems inherent in the IPv6 design as there's no way (AFAICT) for the kernel to validate the PMTU before the cache expires (like is done for NDP for example). > Signed-off-by: Thomas Brierley <tomxor@gmail.com> > --- > src/wg-quick/linux.bash | 41 ++++++++++++++++++++++++++--------------- > 1 file changed, 26 insertions(+), 15 deletions(-) > > diff --git a/src/wg-quick/linux.bash b/src/wg-quick/linux.bash > index 4193ce5..5aba2cb 100755 > --- a/src/wg-quick/linux.bash > +++ b/src/wg-quick/linux.bash > @@ -123,22 +123,33 @@ add_addr() { > } > > set_mtu_up() { > - local mtu=0 endpoint output > - if [[ -n $MTU ]]; then > - cmd ip link set mtu "$MTU" up dev "$INTERFACE" > - return > - fi > - while read -r _ endpoint; do > - [[ $endpoint =~ ^\[?([a-z0-9:.]+)\]?:[0-9]+$ ]] || continue > - output="$(ip route get "${BASH_REMATCH[1]}" || true)" > - [[ ( $output =~ mtu\ ([0-9]+) || ( $output =~ dev\ ([^ ]+) && $(ip link show dev "${BASH_REMATCH[1]}") =~ mtu\ ([0-9]+) ) ) && ${BASH_REMATCH[1]} -gt $mtu ]] && mtu="${BASH_REMATCH[1]}" > - done < <(wg show "$INTERFACE" endpoints) > - if [[ $mtu -eq 0 ]]; then > - read -r output < <(ip route show default || true) || true > - [[ ( $output =~ mtu\ ([0-9]+) || ( $output =~ dev\ ([^ ]+) && $(ip link show dev "${BASH_REMATCH[1]}") =~ mtu\ ([0-9]+) ) ) && ${BASH_REMATCH[1]} -gt $mtu ]] && mtu="${BASH_REMATCH[1]}" > + local dev devmtu end endmtu iph=40 wgh=40 mtu > + # Device MTU > + if [[ -n $(ip route show default) ]]; then > + [[ $(ip route show default) =~ dev\ ([^ ]+) ]] > + dev=${BASH_REMATCH[1]} > + [[ $(ip addr show $dev scope global) =~ inet6 ]] && > + iph=40 || iph=20 > + if [[ $(ip link show dev $dev) =~ mtu\ ([0-9]+) ]]; then > + devmtu=${BASH_REMATCH[1]} > + [[ $(( devmtu - iph - wgh )) -gt $mtu ]] && > + mtu=$(( devmtu - iph - wgh )) > + fi > + # Endpoint MTU > + while read -r _ end; do > + [[ $end =~ ^\[?([a-f0-9:.]+)\]?:[0-9]+$ ]] > + end=${BASH_REMATCH[1]} > + [[ $end =~ [:] ]] && > + iph=40 || iph=20 > + ping -w 1 -M do -s $(( devmtu - iph - 8 )) $end &> /dev/null || true > + if [[ $(ip route get $end) =~ mtu\ ([0-9]+) ]]; then > + endmtu=${BASH_REMATCH[1]} > + [[ $(( endmtu - iph - wgh )) -lt $mtu ]] && > + mtu=$(( endmtu - iph - wgh )) > + fi > + done < <(wg show "$INTERFACE" endpoints) > fi > - [[ $mtu -gt 0 ]] || mtu=1500 > - cmd ip link set mtu $(( mtu - 80 )) up dev "$INTERFACE" > + cmd ip link set mtu ${MTU:-${mtu:-1420}} up dev "$INTERFACE" > } > > resolvconf_iface_prefix() { > -- > 2.30.2 > --Daniel ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD) 2023-11-20 1:17 ` Daniel Gröber @ 2023-11-23 3:33 ` Thomas Brierley 2023-11-27 14:19 ` Wg fragment packet blackholing issue (Was: [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD)) Daniel Gröber 0 siblings, 1 reply; 4+ messages in thread From: Thomas Brierley @ 2023-11-23 3:33 UTC (permalink / raw) To: Daniel Gröber; +Cc: wireguard Hi Daniel Thanks for having a look at this. On Mon, 20 Nov 2023 at 01:17, Daniel Gröber <dxld@darkboxed.org> wrote: > > because this only queries the routing cache. To > > trigger PMTUD on the endpoint and fill this cache, it is necessary to > > send an ICMP with the DF bit set. > > I don't think this is useful. Path MTU may change, doing this only once > when the interface comes up just makes wg-quick less predictable IMO. Yes, I understand PMTU may change, usually when changing internet connection. There is also the issue of bringing up an interface without a connection, such as when using the wg-quick startup service. Accommodating dynamic PMTU is probably out of scope of the wg-quick script, but is something I would like to look into separately. I still think it would be beneficial to set the MTU optimally if only upon bringing an interface up, because PMTU is usually stable for a particular gateway and having this built in makes it far easier for users to automatically obtain the appropriate MTU. I think it also more accurately reflects the man page which suggests automatic discovery. > > 2. Consider IPv6/4 Header Size > > > > Currently an 80 byte header size is assumed i.e. IPv6=40 + WireGuard=40. > > However this is not optimal in the case of IPv4. Since determining the > > IP header size is required for PMTUD anyway, this is now optimised as a > > side effect of endpoint MTU calculation. > > This is not a good idea. Consider what happens when a peer roams from an > IPv4 to a IPv6 endpoint address. It's better to be conservative and assume > IPv6 sized overhead, besides IPv4 is legacy anyway ;) MTU calculation is performed independently for each endpoint, with separate header size calculation accommodating both IPv4 and IPv6 addresses along side each other. The smallest MTU of all endpoints is used, so switching from an IPv4 to an IPv6 endpoint should not result in an MTU which is too large due to IP header size differences. In my case the current behaviour is not conservative enough, but due to absence of PMTUD rather than assumed IP header sizes. > > 3. Use Smallest Endpoint MTU > > > > Currently in the case of multiple endpoints the largest endpoint path > > MTU is used. However WireGuard will dynamically switch between endpoints > > when e.g. one fails, so the smallest MTU is now used to ensure all > > endpoints will function correctly. > > "function correctly". Do note that WireGuard lets it's UDP packets be > fragmented. So connectivity will still work even when the wg device MTU > doesn't match the (current) PMTU. The only downsides to this mismatch being > performance: > > - additional header overhead for fragments, > - less than half max packets-per-second performance and > - additional lateny for tunnel packets hit by IPv6 PMTU discovery > > I was surprised to learn that this would happen periodically, every time > the PMTU cache expires. Seems inherent in the IPv6 design as there's no > way (AFAICT) for the kernel to validate the PMTU before the cache > expires (like is done for NDP for example). So, the reason I ended up tinkering with WireGuard MTU is due to real world reliability issues. Although the risk in setting it optimally based on PMTU remains unclear to me, marginal performance gains are not what brought me here. Networking is not my area of expertise, so the best I can do is lay out my experience and see if you think it adds any weight in favour of this change in behaviour, because I haven't done a full root cause analysis: I found that browsing the web over WireGuard with an MTU set larger than the PMTU resulted in randomly stalled HTTP requests. This is noticeable even with a single stalled HTTP request due to the HTTP 1.1 head of line blocking issue. I tested this manually with individual HTTP requests with a large enough payload, verifying that it only occurs over WireGuard connections. With naked HTTP/TCP the network seems happy, I assume it is fragmenting packets; but over WireGuard, somehow, some packets just seem to get dropped. Maybe UDP is getting treated differently, or maybe what's actually happening is the network is blackholing in both cases but PMTUD is figuring this out in the case of TCP (RFC 2923), and maybe that stops working when encapsulated in UDP?... But this is pure speculation, I'm out of my depth here, and haven't dug any deeper. This behaviour is probably network operator dependent, or specific to LTE networks, which I use for permanent internet access, and which commonly use a lower than average MTU. For example my current ISP uses 1380, and the current wg-quick behaviour is to set the MTU to the default route interface MTU less 80 bytes (1420 for regular interfaces), which results in the above behaviour. I've used all four of the major mobile network operators in my country and experienced this on two of them (separate physical networks, not virtual operators). The other two used an MTU of 1500 anyway. Just to prove I'm not entirely on my own, this issue also appears to be known to WireGuard VPN providers, .e.g from Mullvad's FAQ: > The default MTU (maximum transmission unit) for WireGuard in the Mullvad > app is 1380. You can set it to 1280 if the WireGuard connection stops > working. This may be necessary in some mobile networks. I suppose it could be argued this is not a WireGuard concern, mobile networks are behaving weirdly. Also IME it's not entirely unreliable above the optimal MTU, it's just *less* reliable. I had not anticipated such a patch would have any down sides, I saw this as a general deficiency - Although I appreciate, as you pointed out, it is not a 100% complete solution. I'm interested more in what your concerns are and what you think of the above, but will move along if you still think it's not suitable. Cheers Tom ^ permalink raw reply [flat|nested] 4+ messages in thread
* Wg fragment packet blackholing issue (Was: [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD)) 2023-11-23 3:33 ` Thomas Brierley @ 2023-11-27 14:19 ` Daniel Gröber 0 siblings, 0 replies; 4+ messages in thread From: Daniel Gröber @ 2023-11-27 14:19 UTC (permalink / raw) To: Thomas Brierley; +Cc: wireguard Hi Tom, On Thu, Nov 23, 2023 at 03:33:39AM +0000, Thomas Brierley wrote: > On Mon, 20 Nov 2023 at 01:17, Daniel Gröber <dxld@darkboxed.org> wrote: > > > > because this only queries the routing cache. To > > > trigger PMTUD on the endpoint and fill this cache, it is necessary to > > > send an ICMP with the DF bit set. > > > > I don't think this is useful. Path MTU may change, doing this only once > > when the interface comes up just makes wg-quick less predictable IMO. > > Yes, I understand PMTU may change, usually when changing internet > connection. Just to make sure you understand the distinction, PMTU may change at any time. It's a property of the *path* IP packets take to get to their destination, specifically the minimimum *link* MTU involved in forwarding your packet on the routers along the path. > > > 2. Consider IPv6/4 Header Size > > > > > > Currently an 80 byte header size is assumed i.e. IPv6=40 + WireGuard=40. > > > However this is not optimal in the case of IPv4. Since determining the > > > IP header size is required for PMTUD anyway, this is now optimised as a > > > side effect of endpoint MTU calculation. > > > > This is not a good idea. Consider what happens when a peer roams from an > > IPv4 to a IPv6 endpoint address. It's better to be conservative and assume > > IPv6 sized overhead, besides IPv4 is legacy anyway ;) > > MTU calculation is performed independently for each endpoint I think you misunderstand, this is not about different peers having different endpoint IPs, it's about one peer (say a laptop) moving from an IPv4 network to an IPv6 one. In this case doing the PMTU "optimization" on the v4 one would cause fragmentation on the v6 network. > > "function correctly". Do note that WireGuard lets it's UDP packets be > > fragmented. So connectivity will still work even when the wg device MTU > > doesn't match the (current) PMTU. The only downsides to this mismatch being > > performance: > > > > - additional header overhead for fragments, > > - less than half max packets-per-second performance and > > - additional lateny for tunnel packets hit by IPv6 PMTU discovery > > > > I was surprised to learn that this would happen periodically, every time > > the PMTU cache expires. Seems inherent in the IPv6 design as there's no > > way (AFAICT) for the kernel to validate the PMTU before the cache > > expires (like is done for NDP for example). > > So, the reason I ended up tinkering with WireGuard MTU is due to real > world reliability issues. Although the risk in setting it optimally > based on PMTU remains unclear to me, marginal performance gains are not > what brought me here. Networking is not my area of expertise, so the > best I can do is lay out my experience and see if you think it adds any > weight in favour of this change in behaviour, because I haven't done a > full root cause analysis: > > I found that browsing the web over WireGuard with an MTU set larger than > the PMTU resulted in randomly stalled HTTP requests. This is noticeable > even with a single stalled HTTP request due to the HTTP 1.1 head of line > blocking issue. I tested this manually with individual HTTP requests > with a large enough payload, verifying that it only occurs over > WireGuard connections. Hanging TCP/HTTP connections is a typical symptom of broken IP fragmentation, there's a number of possible causes. There's a browser based test for the common issues here (thanks majek): http://icmpcheck.popcount.org/ (IPv4) http://icmpcheck6.popcount.org/ (IPv6) Try it (without any active VPNs or tunnels obv.) to see if your internet provider's network is fundamentally broken. Note: The IPv6 test can give wrong results for networks deploying DNS64+NAT64. To make sure this doesn't affect you you should ensure ipv4only.arpa fails to resolve. Expected: $ getent ahostsv6 ipv4only.arpa; echo rv=$? rv=2 If broken fragmentation is the root problem you can work around this in a number of ways. The easiest is probably by implementing TCP MSS clamping on your gateway or host(s). > With naked HTTP/TCP the network seems happy, I assume it is fragmenting > packets; but over WireGuard, somehow, some packets just seem to get > dropped. Maybe UDP is getting treated differently That's entirely possible. UDP is (unfortunately) commonly abused for DDoS and so network operators may filter it in desperation and pray no customer notices. So it's up to you to notice, complain loudly and switch providers if all else fails :) > or maybe what's actually happening is the network is blackholing in both > cases but PMTUD is figuring this out in the case of TCP (RFC 2923), and > maybe that stops working when encapsulated in UDP?... PMTUD is independent of the L4 protocol so it normally works either way, but that always assumes your network operatior is doing their job correctly. It's possible to filter (eg.) only ICMP-PTB errors involving UDP packets. While this sounds pointless to me it could be some attempt at UDP fragment attack DDoS mitigation. Speaking of RFC 2923, another possible workaround is enabling RFC 4821 (Packetization Layer PMTU) behaviour on your hosts, but MSS clamping has the advantage of letting you control this on a gateway i.e. without touching (all) your hosts. > This behaviour is probably network operator dependent, or specific to > LTE networks, which I use for permanent internet access, and which > commonly use a lower than average MTU. For example my current ISP uses > 1380, and the current wg-quick behaviour is to set the MTU to the > default route interface MTU less 80 bytes (1420 for regular interfaces), > which results in the above behaviour. > > I've used all four of the major mobile network operators in my country > and experienced this on two of them (separate physical networks, not > virtual operators). The other two used an MTU of 1500 anyway. > > Just to prove I'm not entirely on my own, this issue also appears to be > known to WireGuard VPN providers, .e.g from Mullvad's FAQ: I have no doubt the problem you're facing is real, depending on the results from the test above you may just be complaining to the wrong party ;) Happy to help you figure out how to work around any ISP bullshit once we figure out what it is that's happening. --Daniel ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-11-27 14:19 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-10-29 19:22 [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD) Thomas Brierley 2023-11-20 1:17 ` Daniel Gröber 2023-11-23 3:33 ` Thomas Brierley 2023-11-27 14:19 ` Wg fragment packet blackholing issue (Was: [PATCH] wg-quick: linux: fix MTU calculation (use PMTUD)) Daniel Gröber
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).