* [WireGuard] News about MIPS and ARM optimized code? @ 2016-08-08 13:23 René van Dorst 2016-08-08 14:29 ` Jason A. Donenfeld 0 siblings, 1 reply; 12+ messages in thread From: René van Dorst @ 2016-08-08 13:23 UTC (permalink / raw) To: wireguard News about MIPS and ARM optimized code? Greats, René van Dorst. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-08-08 13:23 [WireGuard] News about MIPS and ARM optimized code? René van Dorst @ 2016-08-08 14:29 ` Jason A. Donenfeld 2016-09-08 11:57 ` René van Dorst 0 siblings, 1 reply; 12+ messages in thread From: Jason A. Donenfeld @ 2016-08-08 14:29 UTC (permalink / raw) To: René van Dorst; +Cc: WireGuard mailing list Would you like to write it? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-08-08 14:29 ` Jason A. Donenfeld @ 2016-09-08 11:57 ` René van Dorst 2016-09-09 13:46 ` René van Dorst 0 siblings, 1 reply; 12+ messages in thread From: René van Dorst @ 2016-09-08 11:57 UTC (permalink / raw) Cc: WireGuard mailing list I did try to write some MIPS32r2 code. I wrote the chacha20_keysetup, chacha20_generic_block and poly1305_generic_blocks in assembly. Tried to load all needed variables in the registers. Which should reduce the memory overhead. But it is very difficult for me to do code profiling and/or isolate the code and make some benchmark programs like supercop. So testing was simple. Crosscompile the code. Copy and load the module on the target. Run setup script and iperf. #ifdef CONFIG_CPU_MIPS32_R2 asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8 key[static 32], const u8 nonce[static 8]); asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx); asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 *src, unsigned int srclen, u32 hibit); #endif But the speed is equal or less on my TP WR1043ND device which is a MIPS32r2 24kc big endian. So GCC does a good job. Also 24kc has no special CoProcessors or FPU. Most improvement what I had it to change the buildroot default optimization -Os to -O2. This gives around 1-3% speed improvement. ideas: - remove the little endian parts on the MIPS. Offcourse do it also on the other side. On this device I can't switch endian. But I did not see any improvements. Need 2 instruction for swapping 32bit register. After a quick calculation it could save around 0.4% which is ~0.1MBit/s on this device. Greats, René van Dorst. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-08 11:57 ` René van Dorst @ 2016-09-09 13:46 ` René van Dorst 2016-09-09 13:52 ` Baptiste Jonglez 0 siblings, 1 reply; 12+ messages in thread From: René van Dorst @ 2016-09-09 13:46 UTC (permalink / raw) To: wireguard Duo the misaligned data fetching function like poly1305 causes regression on the mips. h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; Had 26MBit now +42. root@lede:~# iperf3 -c 10.0.0.1 -i 10 Connecting to host 10.0.0.1, port 5201 [ 4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 171 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 sender [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec receiver iperf Done. root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10 Connecting to host 10.0.0.1, port 5201 [ 4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201 [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7209 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.034 ms 0/7209 (0%) [ 4] Sent 7209 datagrams iperf Done. root@lede:~# Work is not done yet but a good start. Greats, René van Dorst. Quoting René van Dorst <opensource@vdorst.com>: > I did try to write some MIPS32r2 code. > I wrote the chacha20_keysetup, chacha20_generic_block and > poly1305_generic_blocks in assembly. > Tried to load all needed variables in the registers. Which should > reduce the memory overhead. > But it is very difficult for me to do code profiling and/or isolate > the code and make some benchmark programs like supercop. > So testing was simple. Crosscompile the code. Copy and load the > module on the target. Run setup script and iperf. > > #ifdef CONFIG_CPU_MIPS32_R2 > asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8 > key[static 32], const u8 nonce[static 8]); > asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx); > asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx > *ctx, const u8 *src, unsigned int srclen, u32 hibit); > #endif > > But the speed is equal or less on my TP WR1043ND device which is a > MIPS32r2 24kc big endian. > So GCC does a good job. Also 24kc has no special CoProcessors or FPU. > > Most improvement what I had it to change the buildroot default > optimization -Os to -O2. > This gives around 1-3% speed improvement. > > ideas: > - remove the little endian parts on the MIPS. > Offcourse do it also on the other side. > On this device I can't switch endian. > But I did not see any improvements. Need 2 instruction for > swapping 32bit register. > After a quick calculation it could save around 0.4% which is > ~0.1MBit/s on this device. > > Greats, > > René van Dorst. > > _______________________________________________ > WireGuard mailing list > WireGuard@lists.zx2c4.com > http://lists.zx2c4.com/mailman/listinfo/wireguard ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-09 13:46 ` René van Dorst @ 2016-09-09 13:52 ` Baptiste Jonglez 2016-09-09 15:22 ` René van Dorst 2016-09-14 8:10 ` jens 0 siblings, 2 replies; 12+ messages in thread From: Baptiste Jonglez @ 2016-09-09 13:52 UTC (permalink / raw) To: René van Dorst; +Cc: wireguard [-- Attachment #1: Type: text/plain, Size: 3959 bytes --] Nice work! I had tried to write chacha20_generic_block in MIPS assembly, but I got confused with endianness issues and the code didn't work in the end. Is your code available somewhere? I'd be happy to test on a variety of MIPS routers. On Fri, Sep 09, 2016 at 01:46:11PM +0000, René van Dorst wrote: > Duo the misaligned data fetching function like poly1305 causes regression on > the mips. > > h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; > h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; > h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; > h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; > h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; > > > Had 26MBit now +42. > > root@lede:~# iperf3 -c 10.0.0.1 -i 10 > Connecting to host 10.0.0.1, port 5201 > [ 4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201 > [ ID] Interval Transfer Bandwidth Retr Cwnd > [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 171 KBytes > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 sender > [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec receiver > > iperf Done. > root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10 > Connecting to host 10.0.0.1, port 5201 > [ 4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201 > [ ID] Interval Transfer Bandwidth Total Datagrams > [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7209 > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bandwidth Jitter Lost/Total > Datagrams > [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.034 ms 0/7209 (0%) > [ 4] Sent 7209 datagrams > > iperf Done. > root@lede:~# > > > Work is not done yet but a good start. > > Greats, > > René van Dorst. > > Quoting René van Dorst <opensource@vdorst.com>: > > >I did try to write some MIPS32r2 code. > >I wrote the chacha20_keysetup, chacha20_generic_block and > >poly1305_generic_blocks in assembly. > >Tried to load all needed variables in the registers. Which should reduce > >the memory overhead. > >But it is very difficult for me to do code profiling and/or isolate the > >code and make some benchmark programs like supercop. > >So testing was simple. Crosscompile the code. Copy and load the module on > >the target. Run setup script and iperf. > > > >#ifdef CONFIG_CPU_MIPS32_R2 > >asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8 > >key[static 32], const u8 nonce[static 8]); > >asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx); > >asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, > >const u8 *src, unsigned int srclen, u32 hibit); > >#endif > > > >But the speed is equal or less on my TP WR1043ND device which is a > >MIPS32r2 24kc big endian. > >So GCC does a good job. Also 24kc has no special CoProcessors or FPU. > > > >Most improvement what I had it to change the buildroot default > >optimization -Os to -O2. > >This gives around 1-3% speed improvement. > > > >ideas: > >- remove the little endian parts on the MIPS. > > Offcourse do it also on the other side. > > On this device I can't switch endian. > > But I did not see any improvements. Need 2 instruction for swapping > >32bit register. > > After a quick calculation it could save around 0.4% which is ~0.1MBit/s > >on this device. > > > >Greats, > > > >René van Dorst. > > > >_______________________________________________ > >WireGuard mailing list > >WireGuard@lists.zx2c4.com > >http://lists.zx2c4.com/mailman/listinfo/wireguard > > > > _______________________________________________ > WireGuard mailing list > WireGuard@lists.zx2c4.com > http://lists.zx2c4.com/mailman/listinfo/wireguard [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-09 13:52 ` Baptiste Jonglez @ 2016-09-09 15:22 ` René van Dorst 2016-09-09 19:49 ` René van Dorst 2016-09-14 8:10 ` jens 1 sibling, 1 reply; 12+ messages in thread From: René van Dorst @ 2016-09-09 15:22 UTC (permalink / raw) To: Baptiste Jonglez; +Cc: wireguard Not yet. But it think more platforms suffer of this misaligned memory fetching. So if someone fix this also in the C code that it will boost the performance without the assembly version. Greats, René Quoting Baptiste Jonglez <baptiste@bitsofnetworks.org>: > Nice work! I had tried to write chacha20_generic_block in MIPS assembly, > but I got confused with endianness issues and the code didn't work in the > end. > > Is your code available somewhere? I'd be happy to test on a variety of > MIPS routers. > > On Fri, Sep 09, 2016 at 01:46:11PM +0000, René van Dorst wrote: >> Duo the misaligned data fetching function like poly1305 causes regression on >> the mips. >> >> h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ffffff; >> h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ffffff; >> h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ffffff; >> h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ffffff; >> h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; >> >> >> Had 26MBit now +42. >> >> root@lede:~# iperf3 -c 10.0.0.1 -i 10 >> Connecting to host 10.0.0.1, port 5201 >> [ 4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201 >> [ ID] Interval Transfer Bandwidth Retr Cwnd >> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 171 KBytes >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Retr >> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec 0 sender >> [ 4] 0.00-10.08 sec 51.2 MBytes 42.7 Mbits/sec >> receiver >> >> iperf Done. >> root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10 >> Connecting to host 10.0.0.1, port 5201 >> [ 4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201 >> [ ID] Interval Transfer Bandwidth Total Datagrams >> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 7209 >> - - - - - - - - - - - - - - - - - - - - - - - - - >> [ ID] Interval Transfer Bandwidth Jitter Lost/Total >> Datagrams >> [ 4] 0.00-10.00 sec 56.3 MBytes 47.2 Mbits/sec 0.034 ms 0/7209 (0%) >> [ 4] Sent 7209 datagrams >> >> iperf Done. >> root@lede:~# >> >> >> Work is not done yet but a good start. >> >> Greats, >> >> René van Dorst. >> >> Quoting René van Dorst <opensource@vdorst.com>: >> >> >I did try to write some MIPS32r2 code. >> >I wrote the chacha20_keysetup, chacha20_generic_block and >> >poly1305_generic_blocks in assembly. >> >Tried to load all needed variables in the registers. Which should reduce >> >the memory overhead. >> >But it is very difficult for me to do code profiling and/or isolate the >> >code and make some benchmark programs like supercop. >> >So testing was simple. Crosscompile the code. Copy and load the module on >> >the target. Run setup script and iperf. >> > >> >#ifdef CONFIG_CPU_MIPS32_R2 >> >asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8 >> >key[static 32], const u8 nonce[static 8]); >> >asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx); >> >asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, >> >const u8 *src, unsigned int srclen, u32 hibit); >> >#endif >> > >> >But the speed is equal or less on my TP WR1043ND device which is a >> >MIPS32r2 24kc big endian. >> >So GCC does a good job. Also 24kc has no special CoProcessors or FPU. >> > >> >Most improvement what I had it to change the buildroot default >> >optimization -Os to -O2. >> >This gives around 1-3% speed improvement. >> > >> >ideas: >> >- remove the little endian parts on the MIPS. >> > Offcourse do it also on the other side. >> > On this device I can't switch endian. >> > But I did not see any improvements. Need 2 instruction for swapping >> >32bit register. >> > After a quick calculation it could save around 0.4% which is ~0.1MBit/s >> >on this device. >> > >> >Greats, >> > >> >René van Dorst. >> > >> >_______________________________________________ >> >WireGuard mailing list >> >WireGuard@lists.zx2c4.com >> >http://lists.zx2c4.com/mailman/listinfo/wireguard >> >> >> >> _______________________________________________ >> WireGuard mailing list >> WireGuard@lists.zx2c4.com >> http://lists.zx2c4.com/mailman/listinfo/wireguard ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-09 15:22 ` René van Dorst @ 2016-09-09 19:49 ` René van Dorst 2016-09-14 7:16 ` René van Dorst 0 siblings, 1 reply; 12+ messages in thread From: René van Dorst @ 2016-09-09 19:49 UTC (permalink / raw) To: wireguard Here is my last source code https://github.com/vDorst/wireguard/tree/mips32r2 Including the long history of try and fail ;). But also good ideas like try to optimize the code for better data dependency. Which makes the code less readable but more efficient. This is the assembly part https://github.com/vDorst/wireguard/blob/mips32r2/src/crypto/chacha20-mips32r2.S Created functions: * asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8 key[static 32], const u8 nonce[static 8]); * asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx); * asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx, const u8 *src, unsigned int srclen, u32 hibit); poly1305_generic_blocks is fixed in the last commit. Code is written for MIPS32r2 Big endian. Code has some define for __ORDER_BIG_ENDIAN__ which enable the endian swap for that data but is not tested for Litte endian. Todo: * Change the C code to see how fast that works and set benchmark baseline. * Look if I can optimize assembler version even more. Greats, René van Dorst. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-09 19:49 ` René van Dorst @ 2016-09-14 7:16 ` René van Dorst 2016-09-20 20:39 ` Jason A. Donenfeld 2016-09-27 1:48 ` Jason A. Donenfeld 0 siblings, 2 replies; 12+ messages in thread From: René van Dorst @ 2016-09-14 7:16 UTC (permalink / raw) To: wireguard An update of my current findings. Most improvements I have seen at the moment is writing and optimize poly1305_generic_blocks function. This gives a improvement of more than 1%. I also noticed that the ping time does not change. Improvement at the moment is around UDP: ~1.47% TCP: ~1.68% on large transfers like iperf. Wireguard mix of Asm and C variant: https://github.com/vDorst/wireguard/commit/6f9187c325ee883b1f2b9f9da3deb0a61655b504 root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G -t 60 Connecting to host 10.0.0.1, port 5201 [ 4] local 10.0.0.2 port 47996 connected to 10.0.0.1 port 5201 [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 57.5 MBytes 48.2 Mbits/sec 7354 [ 4] 10.00-20.00 sec 57.4 MBytes 48.2 Mbits/sec 7350 [ 4] 20.00-30.00 sec 57.4 MBytes 48.2 Mbits/sec 7353 [ 4] 30.00-40.00 sec 57.5 MBytes 48.2 Mbits/sec 7356 [ 4] 40.00-50.00 sec 57.5 MBytes 48.2 Mbits/sec 7357 [ 4] 50.00-60.00 sec 57.5 MBytes 48.2 Mbits/sec 7358 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datag rams [ 4] 0.00-60.00 sec 345 MBytes 48.2 Mbits/sec 0.037 ms 0/44128 (0%) [ 4] Sent 44128 datagrams root@lede:~# iperf3 -c 10.0.0.1 -i 10-b 1G -t 60 Connecting to host 10.0.0.1, port 5201 [ 4] local 10.0.0.2 port 37950 connected to 10.0.0.1 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.14 sec 52.5 MBytes 43.4 Mbits/sec 0 147 KBytes [ 4] 10.14-20.02 sec 51.2 MBytes 43.5 Mbits/sec 0 147 KBytes [ 4] 20.02-30.14 sec 52.5 MBytes 43.5 Mbits/sec 0 147 KBytes [ 4] 30.14-40.01 sec 51.2 MBytes 43.5 Mbits/sec 0 147 KBytes [ 4] 40.01-50.16 sec 52.5 MBytes 43.4 Mbits/sec 0 220 KBytes [ 4] 50.16-60.01 sec 42.5 MBytes 36.2 Mbits/sec 0 220 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.01 sec 302 MBytes 42.3 Mbits/sec 0 sender [ 4] 0.00-60.01 sec 302 MBytes 42.3 Mbits/sec receiver Wireguard C variant: https://github.com/vDorst/wireguard/commit/13fae657624aac6b9c1f411aa6472a91aae7fcc3 root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G -t 60 Connecting to host 10.0.0.1, port 5201 [ 4] local 10.0.0.2 port 40439 connected to 10.0.0.1 port 5201 [ ID] Interval Transfer Bandwidth Total Datagrams [ 4] 0.00-10.00 sec 56.6 MBytes 47.5 Mbits/sec 7246 [ 4] 10.00-20.00 sec 56.6 MBytes 47.5 Mbits/sec 7243 [ 4] 20.00-30.00 sec 56.6 MBytes 47.5 Mbits/sec 7244 [ 4] 30.00-40.00 sec 56.6 MBytes 47.5 Mbits/sec 7245 [ 4] 40.00-50.00 sec 56.6 MBytes 47.5 Mbits/sec 7245 [ 4] 50.00-60.00 sec 56.6 MBytes 47.5 Mbits/sec 7247 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 4] 0.00-60.00 sec 340 MBytes 47.5 Mbits/sec 0.039 ms 0/43470 (0%) [ 4] Sent 43470 datagrams root@lede:~# iperf3 -c 10.0.0.1 -i 10 -b 1G -t 60 Connecting to host 10.0.0.1, port 5201 [ 4] local 10.0.0.2 port 37956 connected to 10.0.0.1 port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-10.02 sec 49.6 MBytes 41.5 Mbits/sec 0 137 KBytes [ 4] 10.02-20.00 sec 49.6 MBytes 41.7 Mbits/sec 0 209 KBytes [ 4] 20.00-30.02 sec 49.6 MBytes 41.6 Mbits/sec 0 209 KBytes [ 4] 30.02-40.01 sec 49.2 MBytes 41.3 Mbits/sec 0 209 KBytes [ 4] 40.01-50.02 sec 49.6 MBytes 41.6 Mbits/sec 0 209 KBytes [ 4] 50.02-60.02 sec 49.6 MBytes 41.6 Mbits/sec 0 209 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-60.02 sec 297 MBytes 41.6 Mbits/sec 0 sender [ 4] 0.00-60.02 sec 297 MBytes 41.6 Mbits/sec receiver Greats, René van Dorst. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-14 7:16 ` René van Dorst @ 2016-09-20 20:39 ` Jason A. Donenfeld 2016-09-22 18:27 ` René van Dorst 2016-09-27 1:48 ` Jason A. Donenfeld 1 sibling, 1 reply; 12+ messages in thread From: Jason A. Donenfeld @ 2016-09-20 20:39 UTC (permalink / raw) To: René van Dorst; +Cc: WireGuard mailing list Hey Ren=C3=A9, That's excellent. Thanks for writing that. I'll review this implementation. Is your speed up compared to your unaligned optimization from the other patch? Or is that against vanilla? With only a 1% increase, I'm first interested to see where precisely that improvement is coming from, and if we could squeeze that out of gcc instead, so that they're producing more or less the same code. Regards, Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-20 20:39 ` Jason A. Donenfeld @ 2016-09-22 18:27 ` René van Dorst 0 siblings, 0 replies; 12+ messages in thread From: René van Dorst @ 2016-09-22 18:27 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: WireGuard mailing list Hi Jason, I am using the LEDE-projects default kernel. My comparison is only between the patched C version with the aligned memory reads and my assembly version module. I think it is too complex for GCC to optimize, so it flows the code by the letter. This results in a lot of data hazards. By doing by hand you can prevent many data hazards. The trick is try to do 2 things by weaving the code together. Which results in less maintainable code. Greats, René van Dorst. Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>: > Hey René, > > That's excellent. Thanks for writing that. I'll review this implementation. > > Is your speed up compared to your unaligned optimization from the > other patch? Or is that against vanilla? > > With only a 1% increase, I'm first interested to see where precisely > that improvement is coming from, and if we could squeeze that out of > gcc instead, so that they're producing more or less the same code. > > Regards, > Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-14 7:16 ` René van Dorst 2016-09-20 20:39 ` Jason A. Donenfeld @ 2016-09-27 1:48 ` Jason A. Donenfeld 1 sibling, 0 replies; 12+ messages in thread From: Jason A. Donenfeld @ 2016-09-27 1:48 UTC (permalink / raw) To: René van Dorst; +Cc: WireGuard mailing list Hey Ren=C3=A9, I've begun trying to integrate your excellent work into WireGuard in the branch rvh/mips: https://git.zx2c4.com/WireGuard/commit/?h=3Drvd/mips It seems like there's still a bit of cleaning up and polishing to do, but it's headed in a great direction. There's a lot of weird formatting and general inconstancy to clean up. I'll do a review of the crypto as we get rolling here. To make things easier, I gave you commit access to the rvh/mips branch in the repo. Feel free to do with this what you like, and when we're ready, I'll merge it to master. $ git clone ssh://git@git.zx2c4.com/WireGuard $ cd WireGuard $ git checkout -b rvh/mips origin/rvh/mips $ edit code... $ git commit... $ git push That general flow should work for you, using your Github SSH key. Let me know if there are any issues, and feel free to poke me on irc (zx2c4 on freenode -- #wireguard). Talk soon, Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [WireGuard] News about MIPS and ARM optimized code? 2016-09-09 13:52 ` Baptiste Jonglez 2016-09-09 15:22 ` René van Dorst @ 2016-09-14 8:10 ` jens 1 sibling, 0 replies; 12+ messages in thread From: jens @ 2016-09-14 8:10 UTC (permalink / raw) To: wireguard On 09.09.2016 15:52, Baptiste Jonglez wrote: > Nice work! I had tried to write chacha20_generic_block in MIPS assembl= y, > but I got confused with endianness issues and the code didn't work in t= he > end. > > Is your code available somewhere? I'd be happy to test on a variety of= > MIPS routers. i build some lede with Rene v Dorst patch - but have no time to actually test it, if someone has ... here a the links for 841-v11 we want to test specificly and here is the link for more devices (only build in patched) patch openfreiburg.de/freifunk/firmware/lede/chacha20poly1305.c_patch1 841 stuff openfreiburg.de/freifunk/firmware/lede/ more lede buildstuff also there (other images and packages) jens ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-09-27 1:38 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-08-08 13:23 [WireGuard] News about MIPS and ARM optimized code? René van Dorst 2016-08-08 14:29 ` Jason A. Donenfeld 2016-09-08 11:57 ` René van Dorst 2016-09-09 13:46 ` René van Dorst 2016-09-09 13:52 ` Baptiste Jonglez 2016-09-09 15:22 ` René van Dorst 2016-09-09 19:49 ` René van Dorst 2016-09-14 7:16 ` René van Dorst 2016-09-20 20:39 ` Jason A. Donenfeld 2016-09-22 18:27 ` René van Dorst 2016-09-27 1:48 ` Jason A. Donenfeld 2016-09-14 8:10 ` jens
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).