Development discussion of WireGuard
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Introduce per-peer MTU setting
@ 2021-12-28 23:45 leon
  2021-12-28 23:45 ` [RFC PATCH 1/4] netdevice: add ndo_lookup_mtu for dynamically determining MTU leon
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: leon @ 2021-12-28 23:45 UTC (permalink / raw)
  To: wireguard; +Cc: Leon Schuermann

From: Leon Schuermann <leon@is.currently.online>

This patch series is an attempt to integrate a per-peer MTU setting
into WireGuard. With matching changes to the wireguard-tools,
individual MTU values can be set and retrieved for each registered
peer.

While Linux supports setting an MTU metric for specific FIB route
entries [which I've only found out after implementing this :)], and
thus allows to lower the MTU for individual peers, this appears to
disable regular path MTU discovery (PMTUD) entirely on the
route. While regular PMTUD does not work over the tunnel link, it
should still be usable on the rest of the route.

Furthermore, with the goal of eventually introducing an in-band
per-peer PMTUD mechanism, keeping an internal per-peer MTU value does
not require modifying the FIB and thus potentially interfere with
userspace.

In an effort to solve this issues, this patch series introduces a
rather generic framework for implementing these kinds of dynamic MTU
policies. By providing a hook in the netdevice implementation to
decide on the applicable MTU, very flexible designs can be built. I
suppose that these changes are rather controversial, or at least
require some more discussion. I'm sending this patchset to the
WireGuard development list in hopes to get some initial feedback on
the idea and implementation and would like to eventually submit the
non-WireGuard changes directly to the netdev ML.

The patches are currently based on v5.10, as that happens to be
what I was developing on. I'll gladly rebase to the latest
revision / wireguard-devel if requested.

Thanks!

Leon

Leon Schuermann (4):
  netdevice: add ndo_lookup_mtu for dynamically determining MTU
  net/ipv4: respect MTU determined by `ndo_lookup_mtu`
  net/ipv6: respect MTU determined by `ndo_lookup_mtu`
  net/wireguard: add per-peer MTU setting

 drivers/net/wireguard/allowedips.c |  2 +-
 drivers/net/wireguard/allowedips.h |  2 +-
 drivers/net/wireguard/device.c     | 20 ++++++++++++++++--
 drivers/net/wireguard/netlink.c    |  8 +++++++
 drivers/net/wireguard/peer.c       |  1 +
 drivers/net/wireguard/peer.h       |  1 +
 drivers/net/wireguard/queueing.h   |  2 +-
 include/linux/netdevice.h          | 12 +++++++++++
 include/net/ip.h                   | 34 ++++++++++++++++++++++--------
 include/net/ip6_route.h            | 14 ++++++++++--
 include/uapi/linux/wireguard.h     |  5 +++++
 net/ipv4/ip_forward.c              |  2 +-
 net/netfilter/nf_flow_table_core.c |  2 +-
 13 files changed, 87 insertions(+), 18 deletions(-)


base-commit: 2c85ebc57b3e1817b6ce1a6b703928e113a90442
-- 
2.33.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/4] netdevice: add ndo_lookup_mtu for dynamically determining MTU
  2021-12-28 23:45 [RFC PATCH 0/4] Introduce per-peer MTU setting leon
@ 2021-12-28 23:45 ` leon
  2021-12-28 23:45 ` [RFC PATCH 2/4] net/ipv4: respect MTU determined by `ndo_lookup_mtu` leon
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: leon @ 2021-12-28 23:45 UTC (permalink / raw)
  To: wireguard; +Cc: Leon Schuermann

From: Leon Schuermann <leon@is.currently.online>

Add an optional function `ndo_lookup_mtu` to the `struct
net_device_ops`. This function can be used to allow other parts of the
network stack to let the destination netdevice determine the allowed
packet MTU. This is done on a per-packet basis, providing the `struct
sk_buff` holding the packet contents.

The information obtained through this method may be cached by other
parts of the network stack, such as for instance the path MTU
discovery (PMTUD) mechanism. It is not guaranteed that this function
will be called for every packet, not even that is called on a single
packet of a given flow. When this function is not implemented or when
it returns -ENODATA no statement about the permitted MTU is made and
the networking stack will resort to the device MTU values. These
properties make this mechanism capable of providing a "suggestion" for
a packet's MTU, deviating from the default device MTU.

The device is allowed to announce MTU values lower or higher than the
minimum and maximum device MTU respectively. Whether such MTU values
will be respected is up to the implementation.

Still, even with this being a non-mandatory to implement or respect
mechanism, it has some interesting consequences. Being able to inspect
the entire packet buffer, the destination netdevice implementation can
control MTUs on a flow granularity. For instance, it could be used to
allow two devices on a shared Ethernet segment to communicate with
each other using a large (> 1500 byte) MTU, while using a lower MTU
for other devices.

The immediate motivation for these changes provide another example of
this mechanism being useful: when using WireGuard, peers can reside
behind paths of varying MTU restrictions. PMTUD does not work across
these tunnel links however, as WireGuard cannot accept unauthenticated
ICMP responses. Thus it will continue to send too large packets over
lower-MTU links. With this mechanism WireGuard can, on a per-peer
granularity, reduce the MTU, without limiting the overall device
MTU. Furthermore, it can employ in-band PMTUD mechanisms to resolve
these values automatically. While an MTU metric can be set for
specific FIB routes and thus lower the MTU for individual peers, as a
consequence this completely disables PMTUD on the entire route. While
regular PMTUD does not work over the tunnel link, it should still be
usable on the rest of the route. Furthermore, when employing an
in-band per-peer PMTUD mechanism, modifying the FIB to store the
detected MTU is inelegant at best.

Signed-off-by: Leon Schuermann <leon@is.currently.online>
---
 include/linux/netdevice.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7c3da0e1ea9d..d9d59b756f57 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1279,6 +1279,16 @@ struct netdev_net_notifier {
  * struct net_device *(*ndo_get_peer_dev)(struct net_device *dev);
  *	If a device is paired with a peer device, return the peer instance.
  *	The caller must be under RCU read context.
+ * int (*ndo_lookup_mtu)(const struct sk_buff *skb,
+ *			 const struct net_device *dev);
+ *	For devices supporting dynamic lookup of the MTU for individual
+ *	skb packets, this function returns the MTU for the passed skb.
+ *	A return value of -ENODATA must be treated as if the device does
+ *	not support this feature. It is not guaranteed that this function will
+ *	be called for every packet presented to the ndo_start_xmit function.
+ *	A device must always accept packets of the announced min/max device MTU.
+ *	This function may be used to potentially allow MTU sizes lower/higher
+ *	than the min/max device MTU respectively.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1487,6 +1497,8 @@ struct net_device_ops {
 	int			(*ndo_tunnel_ctl)(struct net_device *dev,
 						  struct ip_tunnel_parm *p, int cmd);
 	struct net_device *	(*ndo_get_peer_dev)(struct net_device *dev);
+	int			(*ndo_lookup_mtu)(const struct sk_buff *skb,
+						  const struct net_device *dev);
 };
 
 /**
-- 
2.33.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/4] net/ipv4: respect MTU determined by `ndo_lookup_mtu`
  2021-12-28 23:45 [RFC PATCH 0/4] Introduce per-peer MTU setting leon
  2021-12-28 23:45 ` [RFC PATCH 1/4] netdevice: add ndo_lookup_mtu for dynamically determining MTU leon
@ 2021-12-28 23:45 ` leon
  2021-12-28 23:45 ` [RFC PATCH 3/4] net/ipv6: " leon
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: leon @ 2021-12-28 23:45 UTC (permalink / raw)
  To: wireguard; +Cc: Leon Schuermann

From: Leon Schuermann <leon@is.currently.online>

This integrates the newly introduced dynamic MTU lookup mechanism with
the IPv4 stack. It will attempt to query the destination netdevice for
the individual packet MTU and, if not found or the mechanism is not
implemented, fall back to the device MTU.

`ndo_lookup_mtu` will not be queried and respected for every
packet. For instance, flow offloading with netfilter will only take
the device MTU into account.

Signed-off-by: Leon Schuermann <leon@is.currently.online>
---
 include/net/ip.h                   | 34 ++++++++++++++++++++++--------
 net/ipv4/ip_forward.c              |  2 +-
 net/netfilter/nf_flow_table_core.c |  2 +-
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 2d6b985d11cc..5232d0c07dea 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -433,34 +433,50 @@ static inline bool ip_sk_ignore_df(const struct sock *sk)
 }
 
 static inline unsigned int ip_dst_mtu_maybe_forward(const struct dst_entry *dst,
-						    bool forwarding)
+						    bool forwarding,
+						    const struct sk_buff *skb)
 {
-	struct net *net = dev_net(dst->dev);
+	int err;
 	unsigned int mtu;
+	struct net_device *dev = dst->dev;
 
-	if (net->ipv4.sysctl_ip_fwd_use_pmtu ||
-	    ip_mtu_locked(dst) ||
-	    !forwarding)
-		return dst_mtu(dst);
+	err = -ENODATA;
+	if (skb && dev->netdev_ops->ndo_lookup_mtu)
+		err = dev->netdev_ops->ndo_lookup_mtu(skb, dev);
+	mtu = (err >= 0) ? err : READ_ONCE(dst->dev->mtu);
+
+	if (dev_net(dev)->ipv4.sysctl_ip_fwd_use_pmtu ||
+		ip_mtu_locked(dst) ||
+		!forwarding)
+		return min(mtu, dst_mtu(dst));
 
 	/* 'forwarding = true' case should always honour route mtu */
 	mtu = dst_metric_raw(dst, RTAX_MTU);
 	if (mtu)
 		return mtu;
 
-	return min(READ_ONCE(dst->dev->mtu), IP_MAX_MTU);
+	return min(mtu, IP_MAX_MTU);
 }
 
 static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
 					  const struct sk_buff *skb)
 {
+	int err;
+	unsigned int mtu;
+	struct net_device *dev = skb_dst(skb)->dev;
+
 	if (!sk || !sk_fullsock(sk) || ip_sk_use_pmtu(sk)) {
 		bool forwarding = IPCB(skb)->flags & IPSKB_FORWARDED;
 
-		return ip_dst_mtu_maybe_forward(skb_dst(skb), forwarding);
+		return ip_dst_mtu_maybe_forward(skb_dst(skb), forwarding, skb);
 	}
 
-	return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
+	err = -ENODATA;
+	if (dev->netdev_ops->ndo_lookup_mtu)
+		err = dev->netdev_ops->ndo_lookup_mtu(skb, dev);
+	mtu = (err >= 0) ? err : READ_ONCE(dev->mtu);
+
+	return min(mtu, IP_MAX_MTU);
 }
 
 struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx,
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 00ec819f949b..7a7ec3643c37 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -127,7 +127,7 @@ int ip_forward(struct sk_buff *skb)
 		goto sr_failed;
 
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
-	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
+	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true, skb);
 	if (ip_exceeds_mtu(skb, mtu)) {
 		IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c
index 513f78db3cb2..95bd7a87066a 100644
--- a/net/netfilter/nf_flow_table_core.c
+++ b/net/netfilter/nf_flow_table_core.c
@@ -87,7 +87,7 @@ static int flow_offload_fill_route(struct flow_offload *flow,
 
 	switch (flow_tuple->l3proto) {
 	case NFPROTO_IPV4:
-		flow_tuple->mtu = ip_dst_mtu_maybe_forward(dst, true);
+		flow_tuple->mtu = ip_dst_mtu_maybe_forward(dst, true, NULL);
 		break;
 	case NFPROTO_IPV6:
 		flow_tuple->mtu = ip6_dst_mtu_forward(dst);
-- 
2.33.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/4] net/ipv6: respect MTU determined by `ndo_lookup_mtu`
  2021-12-28 23:45 [RFC PATCH 0/4] Introduce per-peer MTU setting leon
  2021-12-28 23:45 ` [RFC PATCH 1/4] netdevice: add ndo_lookup_mtu for dynamically determining MTU leon
  2021-12-28 23:45 ` [RFC PATCH 2/4] net/ipv4: respect MTU determined by `ndo_lookup_mtu` leon
@ 2021-12-28 23:45 ` leon
  2021-12-28 23:45 ` [RFC PATCH 4/4] net/wireguard: add per-peer MTU setting leon
  2022-01-04 21:34 ` [RFC PATCH 0/4] Introduce " Toke Høiland-Jørgensen
  4 siblings, 0 replies; 7+ messages in thread
From: leon @ 2021-12-28 23:45 UTC (permalink / raw)
  To: wireguard; +Cc: Leon Schuermann

From: Leon Schuermann <leon@is.currently.online>

This integrates the newly introduced dynamic MTU lookup mechanism with
the IPv6 stack. It will attempt to query the destination netdevice for
the individual packet MTU and, if not found or the mechanism is not
implemented, fall back to the device MTU.

`ndo_lookup_mtu` will not be queried and respected for every
packet. For instance, flow offloading with netfilter will only take
the device MTU into account.

Signed-off-by: Leon Schuermann <leon@is.currently.online>
---
 include/net/ip6_route.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 2a5277758379..97e304291d1f 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -264,11 +264,21 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 
 static inline int ip6_skb_dst_mtu(struct sk_buff *skb)
 {
+	int mtu;
+	int err;
+	struct net_device *dev = skb_dst(skb)->dev;
 	struct ipv6_pinfo *np = skb->sk && !dev_recursion_level() ?
 				inet6_sk(skb->sk) : NULL;
 
-	return (np && np->pmtudisc >= IPV6_PMTUDISC_PROBE) ?
-	       skb_dst(skb)->dev->mtu : dst_mtu(skb_dst(skb));
+	err = -ENODATA;
+	if (dev->netdev_ops->ndo_lookup_mtu)
+		err = dev->netdev_ops->ndo_lookup_mtu(skb, dev);
+	mtu = (err >= 0) ? err : READ_ONCE(dev->mtu);
+
+	if (np && np->pmtudisc < IPV6_PMTUDISC_PROBE)
+		mtu = min_t(int, mtu, dst_mtu(skb_dst(skb)));
+
+	return mtu;
 }
 
 static inline bool ip6_sk_accept_pmtu(const struct sock *sk)
-- 
2.33.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/4] net/wireguard: add per-peer MTU setting
  2021-12-28 23:45 [RFC PATCH 0/4] Introduce per-peer MTU setting leon
                   ` (2 preceding siblings ...)
  2021-12-28 23:45 ` [RFC PATCH 3/4] net/ipv6: " leon
@ 2021-12-28 23:45 ` leon
  2022-01-04 21:34 ` [RFC PATCH 0/4] Introduce " Toke Høiland-Jørgensen
  4 siblings, 0 replies; 7+ messages in thread
From: leon @ 2021-12-28 23:45 UTC (permalink / raw)
  To: wireguard; +Cc: Leon Schuermann

From: Leon Schuermann <leon@is.currently.online>

WireGuard is, as of now, unable to detect an insufficient link MTU and
adjust the MTU of transmitted frames accordingly, as ICMP messages
which would indicate this are unauthenticated. This poses a problem in
an environment where multiple peers connect to a shared endpoint,
through various links with different MTU constraints. A workaround is
to reduce the shared endpoint's MTU to the largest MTU supported by
every peer link. Naturally, this adds additional overhead for all
peers with more relaxed MTU constraints.

Thus this introduces a per-peer MTU setting into WireGuard. This value
can be set statically through the userspace API. It is announced to
the rest of the network stack through the `ndo_lookup_mtu` module,
which is (partially) respected by both IPv4 and IPv6.

While Linux supports setting an MTU metric for specific FIB route
entries and thus allows to lower the MTU for individual peers, this
causes regular path MTU discovery (PMTUD) to be completely disabled on
the entire route. While regular PMTUD does not work over the tunnel
link, it should still be usable on the rest of the route. Furthermore,
keeping an internal per-peer MTU value paves the way for integrating
an in-band PMTUD mechanism, as it does not require modifying the FIB
to reflect new discovered MTU values.

Signed-off-by: Leon Schuermann <leon@is.currently.online>
---
 drivers/net/wireguard/allowedips.c |  2 +-
 drivers/net/wireguard/allowedips.h |  2 +-
 drivers/net/wireguard/device.c     | 20 ++++++++++++++++++--
 drivers/net/wireguard/netlink.c    |  8 ++++++++
 drivers/net/wireguard/peer.c       |  1 +
 drivers/net/wireguard/peer.h       |  1 +
 drivers/net/wireguard/queueing.h   |  2 +-
 include/uapi/linux/wireguard.h     |  5 +++++
 8 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c
index 3725e9cd85f4..8870a96b223b 100644
--- a/drivers/net/wireguard/allowedips.c
+++ b/drivers/net/wireguard/allowedips.c
@@ -354,7 +354,7 @@ int wg_allowedips_read_node(struct allowedips_node *node, u8 ip[16], u8 *cidr)
 
 /* Returns a strong reference to a peer */
 struct wg_peer *wg_allowedips_lookup_dst(struct allowedips *table,
-					 struct sk_buff *skb)
+					 const struct sk_buff *skb)
 {
 	if (skb->protocol == htons(ETH_P_IP))
 		return lookup(table->root4, 32, &ip_hdr(skb)->daddr);
diff --git a/drivers/net/wireguard/allowedips.h b/drivers/net/wireguard/allowedips.h
index e5c83cafcef4..366f42c04e25 100644
--- a/drivers/net/wireguard/allowedips.h
+++ b/drivers/net/wireguard/allowedips.h
@@ -48,7 +48,7 @@ int wg_allowedips_read_node(struct allowedips_node *node, u8 ip[16], u8 *cidr);
 
 /* These return a strong reference to a peer: */
 struct wg_peer *wg_allowedips_lookup_dst(struct allowedips *table,
-					 struct sk_buff *skb);
+					 const struct sk_buff *skb);
 struct wg_peer *wg_allowedips_lookup_src(struct allowedips *table,
 					 struct sk_buff *skb);
 
diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
index c9f65e96ccb0..52460a04cffb 100644
--- a/drivers/net/wireguard/device.c
+++ b/drivers/net/wireguard/device.c
@@ -211,11 +211,27 @@ static netdev_tx_t wg_xmit(struct sk_buff *skb, struct net_device *dev)
 	return ret;
 }
 
+static int wg_lookup_mtu(const struct sk_buff *skb, const struct net_device *dev)
+{
+	struct wg_device *wg = netdev_priv(dev);
+	struct wg_peer *peer;
+
+	if (unlikely(!wg_check_packet_protocol(skb)))
+		return -ENODATA;
+
+	peer = wg_allowedips_lookup_dst(&wg->peer_allowedips, skb);
+	if (unlikely(!peer))
+		return -ENODATA;
+
+	return (peer->peer_mtu != 0) ? peer->peer_mtu : -ENODATA;
+}
+
 static const struct net_device_ops netdev_ops = {
 	.ndo_open		= wg_open,
 	.ndo_stop		= wg_stop,
-	.ndo_start_xmit		= wg_xmit,
-	.ndo_get_stats64	= ip_tunnel_get_stats64
+	.ndo_start_xmit	= wg_xmit,
+	.ndo_get_stats64	= ip_tunnel_get_stats64,
+	.ndo_lookup_mtu	= wg_lookup_mtu,
 };
 
 static void wg_destruct(struct net_device *dev)
diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c
index d0f3b6d7f408..ad636f893027 100644
--- a/drivers/net/wireguard/netlink.c
+++ b/drivers/net/wireguard/netlink.c
@@ -40,6 +40,7 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = {
 	[WGPEER_A_RX_BYTES]				= { .type = NLA_U64 },
 	[WGPEER_A_TX_BYTES]				= { .type = NLA_U64 },
 	[WGPEER_A_ALLOWEDIPS]				= { .type = NLA_NESTED },
+	[WGPEER_A_PEER_MTU]				= { .type = NLA_U32 },
 	[WGPEER_A_PROTOCOL_VERSION]			= { .type = NLA_U32 }
 };
 
@@ -142,6 +143,7 @@ get_peer(struct wg_peer *peer, struct sk_buff *skb, struct dump_ctx *ctx)
 				      WGPEER_A_UNSPEC) ||
 		    nla_put_u64_64bit(skb, WGPEER_A_RX_BYTES, peer->rx_bytes,
 				      WGPEER_A_UNSPEC) ||
+		    nla_put_u32(skb, WGPEER_A_PEER_MTU, peer->peer_mtu) ||
 		    nla_put_u32(skb, WGPEER_A_PROTOCOL_VERSION, 1))
 			goto err;
 
@@ -480,6 +482,12 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs)
 			wg_packet_send_keepalive(peer);
 	}
 
+	if (attrs[WGPEER_A_PEER_MTU]) {
+		const u32 peer_mtu = nla_get_u32(attrs[WGPEER_A_PEER_MTU]);
+
+		peer->peer_mtu = peer_mtu;
+	}
+
 	if (netif_running(wg->dev))
 		wg_packet_send_staged_packets(peer);
 
diff --git a/drivers/net/wireguard/peer.c b/drivers/net/wireguard/peer.c
index b3b6370e6b95..c21704176216 100644
--- a/drivers/net/wireguard/peer.c
+++ b/drivers/net/wireguard/peer.c
@@ -46,6 +46,7 @@ struct wg_peer *wg_peer_create(struct wg_device *wg,
 		goto err_3;
 
 	peer->internal_id = atomic64_inc_return(&peer_counter);
+	peer->peer_mtu = 0;
 	peer->serial_work_cpu = nr_cpumask_bits;
 	wg_cookie_init(&peer->latest_cookie);
 	wg_timers_init(peer);
diff --git a/drivers/net/wireguard/peer.h b/drivers/net/wireguard/peer.h
index 23af40922997..2ba51f29e3da 100644
--- a/drivers/net/wireguard/peer.h
+++ b/drivers/net/wireguard/peer.h
@@ -64,6 +64,7 @@ struct wg_peer {
 	u64 internal_id;
 	struct napi_struct napi;
 	bool is_dead;
+	u32 peer_mtu;
 };
 
 struct wg_peer *wg_peer_create(struct wg_device *wg,
diff --git a/drivers/net/wireguard/queueing.h b/drivers/net/wireguard/queueing.h
index dfb674e03076..f7fc7a32abf6 100644
--- a/drivers/net/wireguard/queueing.h
+++ b/drivers/net/wireguard/queueing.h
@@ -66,7 +66,7 @@ struct packet_cb {
 #define PACKET_CB(skb) ((struct packet_cb *)((skb)->cb))
 #define PACKET_PEER(skb) (PACKET_CB(skb)->keypair->entry.peer)
 
-static inline bool wg_check_packet_protocol(struct sk_buff *skb)
+static inline bool wg_check_packet_protocol(const struct sk_buff *skb)
 {
 	__be16 real_protocol = ip_tunnel_parse_protocol(skb);
 	return real_protocol && skb->protocol == real_protocol;
diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h
index ae88be14c947..b86ab0320fa5 100644
--- a/include/uapi/linux/wireguard.h
+++ b/include/uapi/linux/wireguard.h
@@ -49,6 +49,7 @@
  *                    ...
  *                ...
  *            WGPEER_A_PROTOCOL_VERSION: NLA_U32
+ *            WGPEER_A_PEER_MTU: NLA_U32
  *        0: NLA_NESTED
  *            ...
  *        ...
@@ -111,6 +112,9 @@
  *                                       most recent protocol will be used when
  *                                       this is unset. Otherwise, must be set
  *                                       to 1.
+ *            WGPEER_A_PEER_MTU: per-peer TX MTU (IP header + payload), or 0 to
+ *                                       use the interface MTU. Applies only if
+ *                                       lower than the interface MTU.
  *        0: NLA_NESTED
  *            ...
  *        ...
@@ -180,6 +184,7 @@ enum wgpeer_attribute {
 	WGPEER_A_TX_BYTES,
 	WGPEER_A_ALLOWEDIPS,
 	WGPEER_A_PROTOCOL_VERSION,
+	WGPEER_A_PEER_MTU,
 	__WGPEER_A_LAST
 };
 #define WGPEER_A_MAX (__WGPEER_A_LAST - 1)
-- 
2.33.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/4] Introduce per-peer MTU setting
  2021-12-28 23:45 [RFC PATCH 0/4] Introduce per-peer MTU setting leon
                   ` (3 preceding siblings ...)
  2021-12-28 23:45 ` [RFC PATCH 4/4] net/wireguard: add per-peer MTU setting leon
@ 2022-01-04 21:34 ` Toke Høiland-Jørgensen
  2022-01-07 22:13   ` Leon Schuermann
  4 siblings, 1 reply; 7+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-01-04 21:34 UTC (permalink / raw)
  To: leon, wireguard; +Cc: Leon Schuermann

leon@is.currently.online writes:

> From: Leon Schuermann <leon@is.currently.online>
>
> This patch series is an attempt to integrate a per-peer MTU setting
> into WireGuard. With matching changes to the wireguard-tools,
> individual MTU values can be set and retrieved for each registered
> peer.
>
> While Linux supports setting an MTU metric for specific FIB route
> entries [which I've only found out after implementing this :)], and
> thus allows to lower the MTU for individual peers, this appears to
> disable regular path MTU discovery (PMTUD) entirely on the
> route. While regular PMTUD does not work over the tunnel link, it
> should still be usable on the rest of the route.

I'm not sure I understand the use case? Either PMTUD works through the
tunnel and you can just let that do its job, or it doesn't and you have
to do out-of-band discovery anyway in which case you can just use the
FIB route MTU? Or what do you mean by "usable on the rest of the route"?

> Furthermore, with the goal of eventually introducing an in-band
> per-peer PMTUD mechanism, keeping an internal per-peer MTU value does
> not require modifying the FIB and thus potentially interfere with
> userspace.

What "in-band per-peer PMTUD mechanism"? And why does it need this?

-Toke

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/4] Introduce per-peer MTU setting
  2022-01-04 21:34 ` [RFC PATCH 0/4] Introduce " Toke Høiland-Jørgensen
@ 2022-01-07 22:13   ` Leon Schuermann
  0 siblings, 0 replies; 7+ messages in thread
From: Leon Schuermann @ 2022-01-07 22:13 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen; +Cc: wireguard


Toke Høiland-Jørgensen <toke@toke.dk> writes:
> I'm not sure I understand the use case? Either PMTUD works through the
> tunnel and you can just let that do its job, or it doesn't and you
> have to do out-of-band discovery anyway in which case you can just use
> the FIB route MTU?

For traffic _through_ the WireGuard tunnel, that is correct. As
WireGuard in general does not do any funny business with the traffic it
forwards, path MTU discovery through the tunnel works just fine. I'll
call that end-to-end PMTUD. If this does not work for any reason, one
has to fall back onto specifying the MTU in FIB, or some other
mechanism.

I am however concerned about the link(s) _underneath_ the WireGuard
tunnel (where the encrypted + authenticated packets are forwarded), so
the endpoint-to-endpoint link. Regular path MTU discovery does not work
here. As far as I understand, the reasoning behind this is that even if
the WireGuard endpoint does receive ICMP Fragmentation Needed / Packet
Too Big messages from a host on the path the tunnel traverses, these
messages are not and cannot be authenticated. This means that this
information cannot be forwarded to the sender of the original packet,
outside of the tunnel.

This is a real-word issue I am experiencing in WireGuard setups. For
instance, I administer the WireGuard instance of a small student ISP.
Clients connect from a variety of networks to this endpoint, such as DSL
links (PPPoE) which commonly have 1492 bytes MTU, or connections using
Dual-Stack Lite, having 1460 bytes MTU due to the encapsulation
overhead. Essentially no residential providers fragment packets, and
some do not even send ICMP responses. Sometimes people use a tunnel
inside another tunnel, further decreasing MTU.

While reducing the server and client MTUs to the maximum MTU supported
by all supported link types technically works, it increases IP, tunnel
and transport header overhead. It is thus desirable to be able to
specify an individual MTU per WireGuard peer, to use the available MTU
on the respective routes. This is also on the WireGuard project's todo
[1] and has been discussed before [2].

> what do you mean by "usable on the rest of the route"?

Actually, I think I might be wrong here. Initial tests have suggested me
that if the route MTU is specified in the FIB, Linux would not take any
ICMP Fragmentation Needed / Packet Too Big responses into account. I've
tested this again, and it seems to indeed perform proper path MTU
discovery even if the route MTU is specified. This is important as a
route to the destination host might first go through a WireGuard tunnel
to a peer, and then forwarded over paths which might have an even lower
MTU.

Thus the FIB entry MTU is a viable solution for setting individual
peer's route limits, but it might be rather inelegant to modify the
route's MTU values in the FIB from within kernel space, which might be
needed for an in-band PMTUD mechanism.

>> Furthermore, with the goal of eventually introducing an in-band
>> per-peer PMTUD mechanism, keeping an internal per-peer MTU value does
>> not require modifying the FIB and thus potentially interfere with
>> userspace.
>
> What "in-band per-peer PMTUD mechanism"? And why does it need this?

As outlined above, WireGuard cannot utilize the regular ICMP-based PMTUD
mechanism over the endpoint-to-endpoint path. It is however not great to
default to a low MTU to accomodate for low-MTU links on this path, and
very inconvenient to manually adjust the tunnel MTUs.

A solution to this issue could be a PMTUD mechanism through the tunnel
link itself. It would circumvent the security considerations with
ICMP-based PMTUD by relying exclusively on an encrypted + authenticated
message exchange. For instance, a naive approach could be to send ICMP
echo messages with increasing/decreasing payload size to the peer and
discover the usable tunnel MTU based on the (lost) responses. While this
can be implemented outside of the WireGuard kernel module, it makes
certain assumptions about the tunnel and endpoint configuration, such as
the endpoints having an IP assigned, this IP being in the AllowedIPs
(not a given), responding to ICMP echo packets, etc. If such a mechanism
were to be (optionally) integrated into WireGuard directly, it could
have the potential to reduce these kinds of headaches significantly.

#+BEGIN_EXAMPLE
Here is an illustration of these issues using a hacky Mininet test
setup[3], which has the following topology (all traffic from h5 being
routed over the tunnel between h1 and h4), with fragmentation disabled:

      /--- wireguard ---\
     /                   \
    /  eth    eth    eth  \
    h1 <-> h2 <-> h3 <-> h4 <-> h5

The route from h1 to h4 has an MTU of 1500 bytes:

    mininet> h1 ping -c1 -Mdo -s1472 h4
    1480 bytes from 10.0.2.2: icmp_seq=1 ttl=62 time=0.508 ms

The route from h1 to h5 (through the WireGuard tunnel, via h4) has an
MTU of 1420 bytes:

    mininet> h1 ping -c1 -Mdo -s1392 h5
    1400 bytes from 192.168.1.2: icmp_seq=1 ttl=63 time=7.44 ms

When decreasing the MTU of the h2 to h3 link, we can observe that PMTUD
works on the route of h1 to h4:

    mininet> h2 ip link set h2-eth1 mtu 1492
    mininet> h3 ip link set h3-eth0 mtu 1492
    mininet> h1 ping -c1 -Mdo -s1472 h4
    From 10.0.0.2 icmp_seq=1 Frag needed and DF set (mtu = 1492)

However, when trying to ping h5 from h1 through the WireGuard tunnel,
the packet is silently dropped:

    mininet> h1 ping -c1 -Mdo -s1392 -W1 h5
    PING 192.168.1.2 (192.168.1.2) 1392(1420) bytes of data.

    --- 192.168.1.2 ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

We can change the appropriate FIB entry of the route _through_ the
tunnel to make Linux aware of the lower MTU:

    mininet> h1 ip route change 192.168.1.0/24 dev wg0 mtu 1412
    mininet> h1 ping -c1 -Mdo -s1392 -W1 h5
    ping: local error: message too long, mtu=1412
    mininet> h1 ping -c1 -Mdo -s1384 -W1 h5
    1392 bytes from 192.168.1.2: icmp_seq=1 ttl=63 time=10.8 ms

When lowering the MTU of the h4 to h5 link even further (not part of the
endpoint-to-endpoint link, but the route), PMTUD does work, which is
good:

    mininet> h4 ip link set h4-eth1 mtu 1400
    mininet> h5 ip link set h5-eth0 mtu 1400
    mininet> h1 ping -c1 -Mdo -s1384 -W1 h5
    PING 192.168.1.2 (192.168.1.2) 1384(1412) bytes of data.
    From 192.168.0.2 icmp_seq=1 Frag needed and DF set (mtu = 1400)
#+END_EXAMPLE

Let me know if that made things any clearer. :)

- Leon

[1]: https://www.wireguard.com/todo/#per-peer-pmtu
[2]: https://lists.zx2c4.com/pipermail/wireguard/2018-April/002651.html
[3]: https://gist.github.com/lschuermann/7e5de6e00358d1312c86e2144d7352b4

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-07 22:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28 23:45 [RFC PATCH 0/4] Introduce per-peer MTU setting leon
2021-12-28 23:45 ` [RFC PATCH 1/4] netdevice: add ndo_lookup_mtu for dynamically determining MTU leon
2021-12-28 23:45 ` [RFC PATCH 2/4] net/ipv4: respect MTU determined by `ndo_lookup_mtu` leon
2021-12-28 23:45 ` [RFC PATCH 3/4] net/ipv6: " leon
2021-12-28 23:45 ` [RFC PATCH 4/4] net/wireguard: add per-peer MTU setting leon
2022-01-04 21:34 ` [RFC PATCH 0/4] Introduce " Toke Høiland-Jørgensen
2022-01-07 22:13   ` Leon Schuermann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).