* [WireGuard] [PATCH] Add support for platforms which has no efficient unaligned memory access @ 2016-09-10 12:50 René van Dorst 2016-09-10 12:57 ` René van Dorst 2016-09-11 12:06 ` [WireGuard] [PATCHv2] " René van Dorst 0 siblings, 2 replies; 6+ messages in thread From: René van Dorst @ 2016-09-10 12:50 UTC (permalink / raw) To: wireguard Here by my patch to support platforms which has no efficient unaligned memory access. Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, MIPS32r2@400Mhz. Benchmarks before: root@lede:~# iperf3 -c 10.0.0.1 -i 10 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 sender [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec receiver root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%) [ 4] Sent 3982 datagrams Benchmarks with aligned memory fetching: root@lede:~# iperf3 -c 10.0.0.1 -i 10 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 sender [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec receiver iperf Done. root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%) [ 4] Sent 7207 datagrams Greats, René van Dorst. ^ permalink raw reply [flat|nested] 6+ messages in thread
* [WireGuard] [PATCH] Add support for platforms which has no efficient unaligned memory access 2016-09-10 12:50 [WireGuard] [PATCH] Add support for platforms which has no efficient unaligned memory access René van Dorst @ 2016-09-10 12:57 ` René van Dorst 2016-09-11 12:06 ` [WireGuard] [PATCHv2] " René van Dorst 1 sibling, 0 replies; 6+ messages in thread From: René van Dorst @ 2016-09-10 12:57 UTC (permalink / raw) To: wireguard From 35cef72b38756e111a6bf7a04641bd0f8d5eef61 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ren=C3=A9=20van=20Dorst?= <opensource@vdorst.com> Date: Sat, 10 Sep 2016 10:58:58 +0200 Subject: [PATCH] Add support for platforms which has no efficient unaligned memory access Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, MIPS32r2@400Mhz. Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS at compile time. Test on TP-Link WR1043ND, MIPS32r2@400Mhz. Setup: https://lists.zx2c4.com/pipermail/wireguard/2016-August/000331.html Benchmarks before: root@lede:~# iperf3 -c 10.0.0.1 -i 10 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 sender [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec receiver root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%) [ 4] Sent 3982 datagrams Benchmarks with aligned memory fetching: root@lede:~# iperf3 -c 10.0.0.1 -i 10 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 sender [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec receiver iperf Done. root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%) [ 4] Sent 7207 datagrams --- src/crypto/chacha20poly1305.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) mode change 100644 => 100755 src/crypto/chacha20poly1305.c diff --git a/src/crypto/chacha20poly1305.c b/src/crypto/chacha20poly1305.c old mode 100644 new mode 100755 index 5190894..f4ef356 --- a/src/crypto/chacha20poly1305.c +++ b/src/crypto/chacha20poly1305.c @@ -248,13 +248,29 @@ struct poly1305_ctx { static void poly1305_init(struct poly1305_ctx *ctx, const u8 key[static POLY1305_KEY_SIZE]) { +#ifndef HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 t0, t1, t2, t3; +#endif + memset(ctx, 0, sizeof(struct poly1305_ctx)); /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ +#ifdef HAVE_EFFICIENT_UNALIGNED_ACCESS ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03; ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; +#else + t0 = le32_to_cpuvp(key + 0); + t1 = le32_to_cpuvp(key + 4); + t2 = le32_to_cpuvp(key + 8); + t3 = le32_to_cpuvp(key +12); + ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6; + ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12; + ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18; + ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8; + ctx->r[4] = t3 & 0x00fffff; +#endif ctx->s[0] = le32_to_cpuvp(key + 16); ctx->s[1] = le32_to_cpuvp(key + 20); ctx->s[2] = le32_to_cpuvp(key + 24); @@ -267,6 +283,9 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 * u32 s1, s2, s3, s4; u32 h0, h1, h2, h3, h4; u64 d0, d1, d2, d3, d4; +#ifndef HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 t0, t1, t2, t3; +#endif r0 = ctx->r[0]; r1 = ctx->r[1]; @@ -287,11 +306,23 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 * while (likely(srclen >= POLY1305_BLOCK_SIZE)) { /* h += m[i] */ +#ifdef HAVE_EFFICIENT_UNALIGNED_ACCESS h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; +#else + t0 = le32_to_cpuvp(src + 0); + t1 = le32_to_cpuvp(src + 4); + t2 = le32_to_cpuvp(src + 8); + t3 = le32_to_cpuvp(src + 12); + h0 += t0 & 0x3ffffff; + h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff; + h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff; + h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff; + h4 += (t3 >> 8) | hibit; +#endif /* h *= r */ d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + mlt(h3, s2) + mlt(h4, s1); -- 2.5.5 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [WireGuard] [PATCHv2] Add support for platforms which has no efficient unaligned memory access 2016-09-10 12:50 [WireGuard] [PATCH] Add support for platforms which has no efficient unaligned memory access René van Dorst 2016-09-10 12:57 ` René van Dorst @ 2016-09-11 12:06 ` René van Dorst 2016-09-20 19:58 ` Jason A. Donenfeld 1 sibling, 1 reply; 6+ messages in thread From: René van Dorst @ 2016-09-11 12:06 UTC (permalink / raw) To: wireguard Typo HAVE_EFFICIENT_UNALIGNED_ACCESS --> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. From 13fae657624aac6b9c1f411aa6472a91aae7fcc3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ren=C3=A9=20van=20Dorst?= <opensource@vdorst.com> Date: Sat, 10 Sep 2016 10:58:58 +0200 Subject: [PATCH] Add support for platforms which has no efficient unaligned memory access Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, MIPS32r2@400Mhz. Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS at compile time. Test on TP-Link WR1043ND, MIPS32r2@400Mhz. Setup: https://lists.zx2c4.com/pipermail/wireguard/2016-August/000331.html Benchmarks before: root@lede:~# iperf3 -c 10.0.0.1 -i 10 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 sender [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec receiver root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%) [ 4] Sent 3982 datagrams Benchmarks with aligned memory fetching: root@lede:~# iperf3 -c 10.0.0.1 -i 10 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 sender [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec receiver iperf Done. root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%) [ 4] Sent 7207 datagrams --- src/crypto/chacha20poly1305.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/src/crypto/chacha20poly1305.c b/src/crypto/chacha20poly1305.c index 5190894..294cbf6 100644 --- a/src/crypto/chacha20poly1305.c +++ b/src/crypto/chacha20poly1305.c @@ -248,13 +248,29 @@ struct poly1305_ctx { static void poly1305_init(struct poly1305_ctx *ctx, const u8 key[static POLY1305_KEY_SIZE]) { +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 t0, t1, t2, t3; +#endif + memset(ctx, 0, sizeof(struct poly1305_ctx)); /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03; ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; +#else + t0 = le32_to_cpuvp(key + 0); + t1 = le32_to_cpuvp(key + 4); + t2 = le32_to_cpuvp(key + 8); + t3 = le32_to_cpuvp(key +12); + ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6; + ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12; + ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18; + ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8; + ctx->r[4] = t3 & 0x00fffff; +#endif ctx->s[0] = le32_to_cpuvp(key + 16); ctx->s[1] = le32_to_cpuvp(key + 20); ctx->s[2] = le32_to_cpuvp(key + 24); @@ -267,6 +283,9 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 * u32 s1, s2, s3, s4; u32 h0, h1, h2, h3, h4; u64 d0, d1, d2, d3, d4; +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 t0, t1, t2, t3; +#endif r0 = ctx->r[0]; r1 = ctx->r[1]; @@ -287,11 +306,23 @@ static unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 * while (likely(srclen >= POLY1305_BLOCK_SIZE)) { /* h += m[i] */ +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; +#else + t0 = le32_to_cpuvp(src + 0); + t1 = le32_to_cpuvp(src + 4); + t2 = le32_to_cpuvp(src + 8); + t3 = le32_to_cpuvp(src + 12); + h0 += t0 & 0x3ffffff; + h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff; + h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff; + h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff; + h4 += (t3 >> 8) | hibit; +#endif /* h *= r */ d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + mlt(h3, s2) + mlt(h4, s1); -- 2.5.5 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [WireGuard] [PATCHv2] Add support for platforms which has no efficient unaligned memory access 2016-09-11 12:06 ` [WireGuard] [PATCHv2] " René van Dorst @ 2016-09-20 19:58 ` Jason A. Donenfeld 2016-09-20 20:36 ` Jason A. Donenfeld 2016-09-21 6:45 ` René van Dorst 0 siblings, 2 replies; 6+ messages in thread From: Jason A. Donenfeld @ 2016-09-20 19:58 UTC (permalink / raw) To: René van Dorst; +Cc: WireGuard mailing list [-- Attachment #1: Type: text/plain, Size: 6612 bytes --] Hey René, This is an excellent find. Thanks. Pretty significant speed improvements. I wonder where else this is happening too. Have you tested this on both endians? The main thing I'm wondering here is why exactly the compiler can't generate more efficient code itself. I'll review this and merge soon if it looks good. Regards, Jason On Sun, Sep 11, 2016 at 2:06 PM, René van Dorst <opensource@vdorst.com> wrote: > Typo HAVE_EFFICIENT_UNALIGNED_ACCESS --> CONFIG_HAVE_EFFICIENT_UNALIGNE > D_ACCESS. > > From 13fae657624aac6b9c1f411aa6472a91aae7fcc3 Mon Sep 17 00:00:00 2001 > From: =?UTF-8?q?Ren=C3=A9=20van=20Dorst?= <opensource@vdorst.com> > Date: Sat, 10 Sep 2016 10:58:58 +0200 > Subject: [PATCH] Add support for platforms which has no efficient unaligned > memory access > > Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, > MIPS32r2@400Mhz. > > Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS at compile time. > > Test on TP-Link WR1043ND, MIPS32r2@400Mhz. > Setup: https://lists.zx2c4.com/pipermail/wireguard/2016-August/000331.html > > Benchmarks before: > > root@lede:~# iperf3 -c 10.0.0.1 -i 10 > [ ID] Interval Transfer Bandwidth Retr Cwnd > [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 > sender > [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec > receiver > > root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G > [ ID] Interval Transfer Bandwidth Total Datagrams > [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982 > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth Jitter Lost/Total > Datagrams > [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%) > [ 4] Sent 3982 datagrams > > Benchmarks with aligned memory fetching: > > root@lede:~# iperf3 -c 10.0.0.1 -i 10 > [ ID] Interval Transfer Bandwidth Retr Cwnd > [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 > sender > [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec > receiver > > iperf Done. > root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G > [ ID] Interval Transfer Bandwidth Total Datagrams > [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207 > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth Jitter Lost/Total > Datagrams > [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%) > [ 4] Sent 7207 datagrams > --- > src/crypto/chacha20poly1305.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/src/crypto/chacha20poly1305.c b/src/crypto/chacha20poly1305.c > index 5190894..294cbf6 100644 > --- a/src/crypto/chacha20poly1305.c > +++ b/src/crypto/chacha20poly1305.c > @@ -248,13 +248,29 @@ struct poly1305_ctx { > > static void poly1305_init(struct poly1305_ctx *ctx, const u8 key[static > POLY1305_KEY_SIZE]) > { > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + u32 t0, t1, t2, t3; > +#endif > + > memset(ctx, 0, sizeof(struct poly1305_ctx)); > /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; > ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03; > ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; > ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; > ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; > +#else > + t0 = le32_to_cpuvp(key + 0); > + t1 = le32_to_cpuvp(key + 4); > + t2 = le32_to_cpuvp(key + 8); > + t3 = le32_to_cpuvp(key +12); > + ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6; > + ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12; > + ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18; > + ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8; > + ctx->r[4] = t3 & 0x00fffff; > +#endif > ctx->s[0] = le32_to_cpuvp(key + 16); > ctx->s[1] = le32_to_cpuvp(key + 20); > ctx->s[2] = le32_to_cpuvp(key + 24); > @@ -267,6 +283,9 @@ static unsigned int poly1305_generic_blocks(struct > poly1305_ctx *ctx, const u8 * > u32 s1, s2, s3, s4; > u32 h0, h1, h2, h3, h4; > u64 d0, d1, d2, d3, d4; > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > + u32 t0, t1, t2, t3; > +#endif > > r0 = ctx->r[0]; > r1 = ctx->r[1]; > @@ -287,11 +306,23 @@ static unsigned int poly1305_generic_blocks(struct > poly1305_ctx *ctx, const u8 * > > while (likely(srclen >= POLY1305_BLOCK_SIZE)) { > /* h += m[i] */ > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; > h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; > h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; > h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; > h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; > +#else > + t0 = le32_to_cpuvp(src + 0); > + t1 = le32_to_cpuvp(src + 4); > + t2 = le32_to_cpuvp(src + 8); > + t3 = le32_to_cpuvp(src + 12); > + h0 += t0 & 0x3ffffff; > + h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff; > + h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff; > + h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff; > + h4 += (t3 >> 8) | hibit; > +#endif > > /* h *= r */ > d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + mlt(h3, s2) > + mlt(h4, s1); > -- > 2.5.5 > > > _______________________________________________ > WireGuard mailing list > WireGuard@lists.zx2c4.com > http://lists.zx2c4.com/mailman/listinfo/wireguard > -- Jason A. Donenfeld Deep Space Explorer fr: +33 6 51 90 82 66 us: +1 513 476 1200 www.jasondonenfeld.com www.zx2c4.com zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc [-- Attachment #2: Type: text/html, Size: 9106 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [WireGuard] [PATCHv2] Add support for platforms which has no efficient unaligned memory access 2016-09-20 19:58 ` Jason A. Donenfeld @ 2016-09-20 20:36 ` Jason A. Donenfeld 2016-09-21 6:45 ` René van Dorst 1 sibling, 0 replies; 6+ messages in thread From: Jason A. Donenfeld @ 2016-09-20 20:36 UTC (permalink / raw) To: René van Dorst; +Cc: WireGuard mailing list [-- Attachment #1: Type: text/plain, Size: 7385 bytes --] Hey again, That commit is tentatively living here while I examine it: https://git.zx2c4.com/WireGuard/commit/?id=7a6abc928ea082d34d703d4097bcc06f6a2117e0 By the way, what you sent didn't actually apply, so I had to retype it. Next time, please use git-send-email(1). Jason On Tue, Sep 20, 2016 at 9:58 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Hey René, > > This is an excellent find. Thanks. Pretty significant speed improvements. > I wonder where else this is happening too. > > Have you tested this on both endians? > > The main thing I'm wondering here is why exactly the compiler can't > generate more efficient code itself. > > I'll review this and merge soon if it looks good. > > Regards, > Jason > > On Sun, Sep 11, 2016 at 2:06 PM, René van Dorst <opensource@vdorst.com> > wrote: > >> Typo HAVE_EFFICIENT_UNALIGNED_ACCESS --> CONFIG_HAVE_EFFICIENT_UNALIGNE >> D_ACCESS. >> >> From 13fae657624aac6b9c1f411aa6472a91aae7fcc3 Mon Sep 17 00:00:00 2001 >> From: =?UTF-8?q?Ren=C3=A9=20van=20Dorst?= <opensource@vdorst.com> >> Date: Sat, 10 Sep 2016 10:58:58 +0200 >> Subject: [PATCH] Add support for platforms which has no efficient >> unaligned >> memory access >> >> Without it, it caused 55.2% slowdown in throughput at TP-Link WR1043ND, >> MIPS32r2@400Mhz. >> >> Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS at compile time. >> >> Test on TP-Link WR1043ND, MIPS32r2@400Mhz. >> Setup: https://lists.zx2c4.com/pipermail/wireguard/2016-August/0003 >> 31.html >> >> Benchmarks before: >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 >> [ ID] Interval Transfer Bandwidth Retr Cwnd >> [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Retr >> [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 >> sender >> [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec >> receiver >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G >> [ ID] Interval Transfer Bandwidth Total Datagrams >> [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982 >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Jitter >> Lost/Total Datagrams >> [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 >> (0%) >> [ 4] Sent 3982 datagrams >> >> Benchmarks with aligned memory fetching: >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 >> [ ID] Interval Transfer Bandwidth Retr Cwnd >> [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Retr >> [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 >> sender >> [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec >> receiver >> >> iperf Done. >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G >> [ ID] Interval Transfer Bandwidth Total Datagrams >> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207 >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Jitter >> Lost/Total Datagrams >> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 >> (0%) >> [ 4] Sent 7207 datagrams >> --- >> src/crypto/chacha20poly1305.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/src/crypto/chacha20poly1305.c b/src/crypto/chacha20poly1305. >> c >> index 5190894..294cbf6 100644 >> --- a/src/crypto/chacha20poly1305.c >> +++ b/src/crypto/chacha20poly1305.c >> @@ -248,13 +248,29 @@ struct poly1305_ctx { >> >> static void poly1305_init(struct poly1305_ctx *ctx, const u8 key[static >> POLY1305_KEY_SIZE]) >> { >> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> + u32 t0, t1, t2, t3; >> +#endif >> + >> memset(ctx, 0, sizeof(struct poly1305_ctx)); >> /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ >> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; >> ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03; >> ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; >> ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; >> ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; >> +#else >> + t0 = le32_to_cpuvp(key + 0); >> + t1 = le32_to_cpuvp(key + 4); >> + t2 = le32_to_cpuvp(key + 8); >> + t3 = le32_to_cpuvp(key +12); >> + ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6; >> + ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12; >> + ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18; >> + ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8; >> + ctx->r[4] = t3 & 0x00fffff; >> +#endif >> ctx->s[0] = le32_to_cpuvp(key + 16); >> ctx->s[1] = le32_to_cpuvp(key + 20); >> ctx->s[2] = le32_to_cpuvp(key + 24); >> @@ -267,6 +283,9 @@ static unsigned int poly1305_generic_blocks(struct >> poly1305_ctx *ctx, const u8 * >> u32 s1, s2, s3, s4; >> u32 h0, h1, h2, h3, h4; >> u64 d0, d1, d2, d3, d4; >> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> + u32 t0, t1, t2, t3; >> +#endif >> >> r0 = ctx->r[0]; >> r1 = ctx->r[1]; >> @@ -287,11 +306,23 @@ static unsigned int poly1305_generic_blocks(struct >> poly1305_ctx *ctx, const u8 * >> >> while (likely(srclen >= POLY1305_BLOCK_SIZE)) { >> /* h += m[i] */ >> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; >> h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; >> h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; >> h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; >> h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; >> +#else >> + t0 = le32_to_cpuvp(src + 0); >> + t1 = le32_to_cpuvp(src + 4); >> + t2 = le32_to_cpuvp(src + 8); >> + t3 = le32_to_cpuvp(src + 12); >> + h0 += t0 & 0x3ffffff; >> + h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff; >> + h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff; >> + h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff; >> + h4 += (t3 >> 8) | hibit; >> +#endif >> >> /* h *= r */ >> d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + mlt(h3, >> s2) + mlt(h4, s1); >> -- >> 2.5.5 >> >> >> _______________________________________________ >> WireGuard mailing list >> WireGuard@lists.zx2c4.com >> http://lists.zx2c4.com/mailman/listinfo/wireguard >> > > > > -- > Jason A. Donenfeld > Deep Space Explorer > fr: +33 6 51 90 82 66 > us: +1 513 476 1200 > www.jasondonenfeld.com > www.zx2c4.com > zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc > -- Jason A. Donenfeld Deep Space Explorer fr: +33 6 51 90 82 66 us: +1 513 476 1200 www.jasondonenfeld.com www.zx2c4.com zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc [-- Attachment #2: Type: text/html, Size: 10563 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [WireGuard] [PATCHv2] Add support for platforms which has no efficient unaligned memory access 2016-09-20 19:58 ` Jason A. Donenfeld 2016-09-20 20:36 ` Jason A. Donenfeld @ 2016-09-21 6:45 ` René van Dorst 1 sibling, 0 replies; 6+ messages in thread From: René van Dorst @ 2016-09-21 6:45 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: WireGuard mailing list [-- Attachment #1: Plaintext Message --] [-- Type: text/plain, Size: 7764 bytes --] Hi Jason, I searched a bit if I could find it in other places but I could not find it. > Have you tested this on both endians? No, my hardware only supports big endian. I am not experienced enough to run it in Qemu. I see it is already applied. Great! Greats, René van Dorst. Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>: > Hey René, > This is an excellent find. Thanks. Pretty significant speed > improvements. I wonder where else this is happening too. > > Have you tested this on both endians? > > The main thing I'm wondering here is why exactly the compiler > can't generate more efficient code itself. > > I'll review this and merge soon if it looks good. > > Regards, > Jason > > > On Sun, Sep 11, 2016 at 2:06 PM, René van Dorst > <opensource@vdorst.com> wrote: > >> Typo HAVE_EFFICIENT_UNALIGNED_ACCESS --> >> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. >> >> From 13fae657624aac6b9c1f411aa6472a91aae7fcc3 Mon Sep 17 00:00:00 2001 >> From: =?UTF-8?q?Ren=C3=A9=20van=20Dorst?= <opensource@vdorst.com> >> Date: Sat, 10 Sep 2016 10:58:58 +0200 >> Subject: [PATCH] Add support for platforms which has no efficient unaligned >> memory access >> >> Without it, it caused 55.2% slowdown in throughput at TP-Link >> WR1043ND, MIPS32r2@400Mhz. >> >> Simply check for CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS at compile time. >> >> Test on TP-Link WR1043ND, MIPS32r2@400Mhz. >> Setup: https://lists.zx2c4.com/pipermail/wireguard/2016-August/000331.html >> >> Benchmarks before: >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 >> [ ID] Interval Transfer Bandwidth Retr Cwnd >> [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 202 KBytes >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Retr >> [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec 0 sender >> [ 4] 0.00-10.13 sec 28.8 MBytes 23.8 Mbits/sec >> receiver >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G >> [ ID] Interval Transfer Bandwidth Total Datagrams >> [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 3982 >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Jitter >> Lost/Total Datagrams >> [ 4] 0.00-10.00 sec 31.1 MBytes 26.1 Mbits/sec 0.049 ms 0/3982 (0%) >> [ 4] Sent 3982 datagrams >> >> Benchmarks with aligned memory fetching: >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 >> [ ID] Interval Transfer Bandwidth Retr Cwnd >> [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 145 KBytes >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Retr >> [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec 0 sender >> [ 4] 0.00-10.22 sec 52.5 MBytes 43.1 Mbits/sec >> receiver >> >> iperf Done. >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G >> [ ID] Interval Transfer Bandwidth Total Datagrams >> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7207 >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Jitter >> Lost/Total Datagrams >> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.041 ms 0/7207 (0%) >> [ 4] Sent 7207 datagrams >> >> --- >> src/crypto/chacha20poly1305.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/src/crypto/chacha20poly1305.c b/src/crypto/chacha20poly1305.c >> index 5190894..294cbf6 100644 >> --- a/src/crypto/chacha20poly1305.c >> +++ b/src/crypto/chacha20poly1305.c >> @@ -248,13 +248,29 @@ struct poly1305_ctx { >> >> static void poly1305_init(struct poly1305_ctx *ctx, const u8 >> key[static POLY1305_KEY_SIZE]) >> { >> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> + u32 t0, t1, t2, t3; >> +#endif >> + >> memset(ctx, 0, sizeof(struct poly1305_ctx)); >> /* r &= 0xffffffc0ffffffc0ffffffc0fffffff */ >> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ffffff; >> ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x3ffff03; >> ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; >> ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; >> ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00fffff; >> +#else >> + t0 = le32_to_cpuvp(key + 0); >> + t1 = le32_to_cpuvp(key + 4); >> + t2 = le32_to_cpuvp(key + 8); >> + t3 = le32_to_cpuvp(key +12); >> + ctx->r[0] = t0 & 0x3ffffff; t0 >>= 26; t0 |= t1 << 6; >> + ctx->r[1] = t0 & 0x3ffff03; t1 >>= 20; t1 |= t2 << 12; >> + ctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18; >> + ctx->r[3] = t2 & 0x3f03fff; t3 >>= 8; >> + ctx->r[4] = t3 & 0x00fffff; >> +#endif >> ctx->s[0] = le32_to_cpuvp(key + 16); >> ctx->s[1] = le32_to_cpuvp(key + 20); >> ctx->s[2] = le32_to_cpuvp(key + 24); >> @@ -267,6 +283,9 @@ static unsigned int >> poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 * >> u32 s1, s2, s3, s4; >> u32 h0, h1, h2, h3, h4; >> u64 d0, d1, d2, d3, d4; >> +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> + u32 t0, t1, t2, t3; >> +#endif >> >> r0 = ctx->r[0]; >> r1 = ctx->r[1]; >> @@ -287,11 +306,23 @@ static unsigned int >> poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 * >> >> while (likely(srclen >= POLY1305_BLOCK_SIZE)) { >> /* h += m[i] */ >> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; >> h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; >> h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; >> h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; >> h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; >> +#else >> + t0 = le32_to_cpuvp(src + 0); >> + t1 = le32_to_cpuvp(src + 4); >> + t2 = le32_to_cpuvp(src + 8); >> + t3 = le32_to_cpuvp(src + 12); >> + h0 += t0 & 0x3ffffff; >> + h1 += sr((((u64)t1 << 32) | t0), 26) & 0x3ffffff; >> + h2 += sr((((u64)t2 << 32) | t1), 20) & 0x3ffffff; >> + h3 += sr((((u64)t3 << 32) | t2), 14) & 0x3ffffff; >> + h4 += (t3 >> 8) | hibit; >> +#endif >> >> /* h *= r */ >> d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + >> mlt(h3, s2) + mlt(h4, s1); >> -- >> 2.5.5 >> >> _______________________________________________ >> WireGuard mailing list >> WireGuard@lists.zx2c4.com >> http://lists.zx2c4.com/mailman/listinfo/wireguard > > > -- > Jason A. Donenfeld > Deep Space Explorer > fr: +33 6 51 90 82 66 > us: +1 513 476 1200 > www.jasondonenfeld.com[1] > www.zx2c4.com[2] > zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc[3] Links: ------ [1] http://www.jasondonenfeld.com [2] http://www.zx2c4.com [3] http://zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc [-- Attachment #2: HTML Message --] [-- Type: text/html, Size: 11797 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-09-21 6:36 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-10 12:50 [WireGuard] [PATCH] Add support for platforms which has no efficient unaligned memory access René van Dorst 2016-09-10 12:57 ` René van Dorst 2016-09-11 12:06 ` [WireGuard] [PATCHv2] " René van Dorst 2016-09-20 19:58 ` Jason A. Donenfeld 2016-09-20 20:36 ` Jason A. Donenfeld 2016-09-21 6:45 ` René van Dorst
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).