[WireGuard] News about MIPS and ARM optimized code?

Development discussion of WireGuard
 help / color / mirror / Atom feed

* [WireGuard] News about MIPS and ARM optimized code?
@ 2016-08-08 13:23 René van Dorst
  2016-08-08 14:29 ` Jason A. Donenfeld
  0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-08-08 13:23 UTC (permalink / raw)
  To: wireguard


News about MIPS and ARM optimized code?

Greats,

René van Dorst.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-08-08 13:23 [WireGuard] News about MIPS and ARM optimized code? René van Dorst
@ 2016-08-08 14:29 ` Jason A. Donenfeld
  2016-09-08 11:57   ` René van Dorst
  0 siblings, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-08-08 14:29 UTC (permalink / raw)
  To: René van Dorst; +Cc: WireGuard mailing list

Would you like to write it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-08-08 14:29 ` Jason A. Donenfeld
@ 2016-09-08 11:57   ` René van Dorst
  2016-09-09 13:46     ` René van Dorst
  0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-08 11:57 UTC (permalink / raw)
  Cc: WireGuard mailing list

I did try to write some MIPS32r2 code.
I wrote the chacha20_keysetup, chacha20_generic_block and  
poly1305_generic_blocks in assembly.
Tried to load all needed variables in the registers. Which should  
reduce the memory overhead.
But it is very difficult for me to do code profiling and/or isolate  
the code and make some benchmark programs like supercop.
So testing was simple. Crosscompile the code. Copy and load the module  
on the target. Run setup script and iperf.

#ifdef CONFIG_CPU_MIPS32_R2
asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8  
key[static 32], const u8 nonce[static 8]);
asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx  
*ctx, const u8 *src, unsigned int srclen, u32 hibit);
#endif

But the speed is equal or less on my TP WR1043ND device which is a  
MIPS32r2 24kc big endian.
So GCC does a good job. Also 24kc has no special CoProcessors or FPU.

Most improvement what I had it to change the buildroot default  
optimization -Os to -O2.
This gives around 1-3% speed improvement.

ideas:
- remove the little endian parts on the MIPS.
   Offcourse do it also on the other side.
   On this device I can't switch endian.
   But I did not see any improvements. Need 2 instruction for swapping  
32bit register.
   After a quick calculation it could save around 0.4% which is  
~0.1MBit/s on this device.

Greats,

René van Dorst.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-08 11:57   ` René van Dorst
@ 2016-09-09 13:46     ` René van Dorst
  2016-09-09 13:52       ` Baptiste Jonglez
  0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-09 13:46 UTC (permalink / raw)
  To: wireguard

Duo the misaligned data fetching function like poly1305 causes  
regression on the mips.

	h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;


Had 26MBit now +42.

root@lede:~# iperf3 -c 10.0.0.1 -i 10
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0    171 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0             sender
[  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec                  receiver

iperf Done.
root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  7209
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter     
Lost/Total Datagrams
[  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  0.034 ms  0/7209 (0%)
[  4] Sent 7209 datagrams

iperf Done.
root@lede:~#


Work is not done yet but a good start.

Greats,

René van Dorst.

Quoting René van Dorst <opensource@vdorst.com>:

> I did try to write some MIPS32r2 code.
> I wrote the chacha20_keysetup, chacha20_generic_block and  
> poly1305_generic_blocks in assembly.
> Tried to load all needed variables in the registers. Which should  
> reduce the memory overhead.
> But it is very difficult for me to do code profiling and/or isolate  
> the code and make some benchmark programs like supercop.
> So testing was simple. Crosscompile the code. Copy and load the  
> module on the target. Run setup script and iperf.
>
> #ifdef CONFIG_CPU_MIPS32_R2
> asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8  
> key[static 32], const u8 nonce[static 8]);
> asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
> asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx  
> *ctx, const u8 *src, unsigned int srclen, u32 hibit);
> #endif
>
> But the speed is equal or less on my TP WR1043ND device which is a  
> MIPS32r2 24kc big endian.
> So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
>
> Most improvement what I had it to change the buildroot default  
> optimization -Os to -O2.
> This gives around 1-3% speed improvement.
>
> ideas:
> - remove the little endian parts on the MIPS.
>   Offcourse do it also on the other side.
>   On this device I can't switch endian.
>   But I did not see any improvements. Need 2 instruction for  
> swapping 32bit register.
>   After a quick calculation it could save around 0.4% which is  
> ~0.1MBit/s on this device.
>
> Greats,
>
> René van Dorst.
>
> _______________________________________________
> WireGuard mailing list
> WireGuard@lists.zx2c4.com
> http://lists.zx2c4.com/mailman/listinfo/wireguard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-09 13:46     ` René van Dorst
@ 2016-09-09 13:52       ` Baptiste Jonglez
  2016-09-09 15:22         ` René van Dorst
  2016-09-14  8:10         ` jens
  0 siblings, 2 replies; 12+ messages in thread
From: Baptiste Jonglez @ 2016-09-09 13:52 UTC (permalink / raw)
  To: René van Dorst; +Cc: wireguard

[-- Attachment #1: Type: text/plain, Size: 3959 bytes --]

Nice work!  I had tried to write chacha20_generic_block in MIPS assembly,
but I got confused with endianness issues and the code didn't work in the
end.

Is your code available somewhere?  I'd be happy to test on a variety of
MIPS routers.

On Fri, Sep 09, 2016 at 01:46:11PM +0000, René van Dorst wrote:
> Duo the misaligned data fetching function like poly1305 causes regression on
> the mips.
> 
> 	h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
> 		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
> 		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
> 		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
> 		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
> 
> 
> Had 26MBit now +42.
> 
> root@lede:~# iperf3 -c 10.0.0.1 -i 10
> Connecting to host 10.0.0.1, port 5201
> [  4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0    171 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0             sender
> [  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec                  receiver
> 
> iperf Done.
> root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
> Connecting to host 10.0.0.1, port 5201
> [  4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  7209
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total
> Datagrams
> [  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  0.034 ms  0/7209 (0%)
> [  4] Sent 7209 datagrams
> 
> iperf Done.
> root@lede:~#
> 
> 
> Work is not done yet but a good start.
> 
> Greats,
> 
> René van Dorst.
> 
> Quoting René van Dorst <opensource@vdorst.com>:
> 
> >I did try to write some MIPS32r2 code.
> >I wrote the chacha20_keysetup, chacha20_generic_block and
> >poly1305_generic_blocks in assembly.
> >Tried to load all needed variables in the registers. Which should reduce
> >the memory overhead.
> >But it is very difficult for me to do code profiling and/or isolate the
> >code and make some benchmark programs like supercop.
> >So testing was simple. Crosscompile the code. Copy and load the module on
> >the target. Run setup script and iperf.
> >
> >#ifdef CONFIG_CPU_MIPS32_R2
> >asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
> >key[static 32], const u8 nonce[static 8]);
> >asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
> >asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx,
> >const u8 *src, unsigned int srclen, u32 hibit);
> >#endif
> >
> >But the speed is equal or less on my TP WR1043ND device which is a
> >MIPS32r2 24kc big endian.
> >So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
> >
> >Most improvement what I had it to change the buildroot default
> >optimization -Os to -O2.
> >This gives around 1-3% speed improvement.
> >
> >ideas:
> >- remove the little endian parts on the MIPS.
> >  Offcourse do it also on the other side.
> >  On this device I can't switch endian.
> >  But I did not see any improvements. Need 2 instruction for swapping
> >32bit register.
> >  After a quick calculation it could save around 0.4% which is ~0.1MBit/s
> >on this device.
> >
> >Greats,
> >
> >René van Dorst.
> >
> >_______________________________________________
> >WireGuard mailing list
> >WireGuard@lists.zx2c4.com
> >http://lists.zx2c4.com/mailman/listinfo/wireguard
> 
> 
> 
> _______________________________________________
> WireGuard mailing list
> WireGuard@lists.zx2c4.com
> http://lists.zx2c4.com/mailman/listinfo/wireguard

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-09 13:52       ` Baptiste Jonglez
@ 2016-09-09 15:22         ` René van Dorst
  2016-09-09 19:49           ` René van Dorst
  2016-09-14  8:10         ` jens
  1 sibling, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-09 15:22 UTC (permalink / raw)
  To: Baptiste Jonglez; +Cc: wireguard

Not yet.

But it think more platforms suffer of this misaligned memory fetching.

So if someone fix this also in the C code that it will boost the  
performance without the assembly version.

Greats,

René

Quoting Baptiste Jonglez <baptiste@bitsofnetworks.org>:

> Nice work!  I had tried to write chacha20_generic_block in MIPS assembly,
> but I got confused with endianness issues and the code didn't work in the
> end.
>
> Is your code available somewhere?  I'd be happy to test on a variety of
> MIPS routers.
>
> On Fri, Sep 09, 2016 at 01:46:11PM +0000, René van Dorst wrote:
>> Duo the misaligned data fetching function like poly1305 causes regression on
>> the mips.
>>
>> 	h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ffffff;
>> 		h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ffffff;
>> 		h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ffffff;
>> 		h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ffffff;
>> 		h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
>>
>>
>> Had 26MBit now +42.
>>
>> root@lede:~# iperf3 -c 10.0.0.1 -i 10
>> Connecting to host 10.0.0.1, port 5201
>> [  4] local 10.0.0.2 port 36216 connected to 10.0.0.1 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
>> [  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0    171 KBytes
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Retr
>> [  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec    0             sender
>> [  4]   0.00-10.08  sec  51.2 MBytes  42.7 Mbits/sec                 
>>   receiver
>>
>> iperf Done.
>> root@lede:~# iperf3 -c 10.0.0.1 -u -b 1G -i 10
>> Connecting to host 10.0.0.1, port 5201
>> [  4] local 10.0.0.2 port 60714 connected to 10.0.0.1 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  7209
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total
>> Datagrams
>> [  4]   0.00-10.00  sec  56.3 MBytes  47.2 Mbits/sec  0.034 ms  0/7209 (0%)
>> [  4] Sent 7209 datagrams
>>
>> iperf Done.
>> root@lede:~#
>>
>>
>> Work is not done yet but a good start.
>>
>> Greats,
>>
>> René van Dorst.
>>
>> Quoting René van Dorst <opensource@vdorst.com>:
>>
>> >I did try to write some MIPS32r2 code.
>> >I wrote the chacha20_keysetup, chacha20_generic_block and
>> >poly1305_generic_blocks in assembly.
>> >Tried to load all needed variables in the registers. Which should reduce
>> >the memory overhead.
>> >But it is very difficult for me to do code profiling and/or isolate the
>> >code and make some benchmark programs like supercop.
>> >So testing was simple. Crosscompile the code. Copy and load the module on
>> >the target. Run setup script and iperf.
>> >
>> >#ifdef CONFIG_CPU_MIPS32_R2
>> >asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8
>> >key[static 32], const u8 nonce[static 8]);
>> >asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
>> >asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx *ctx,
>> >const u8 *src, unsigned int srclen, u32 hibit);
>> >#endif
>> >
>> >But the speed is equal or less on my TP WR1043ND device which is a
>> >MIPS32r2 24kc big endian.
>> >So GCC does a good job. Also 24kc has no special CoProcessors or FPU.
>> >
>> >Most improvement what I had it to change the buildroot default
>> >optimization -Os to -O2.
>> >This gives around 1-3% speed improvement.
>> >
>> >ideas:
>> >- remove the little endian parts on the MIPS.
>> >  Offcourse do it also on the other side.
>> >  On this device I can't switch endian.
>> >  But I did not see any improvements. Need 2 instruction for swapping
>> >32bit register.
>> >  After a quick calculation it could save around 0.4% which is ~0.1MBit/s
>> >on this device.
>> >
>> >Greats,
>> >
>> >René van Dorst.
>> >
>> >_______________________________________________
>> >WireGuard mailing list
>> >WireGuard@lists.zx2c4.com
>> >http://lists.zx2c4.com/mailman/listinfo/wireguard
>>
>>
>>
>> _______________________________________________
>> WireGuard mailing list
>> WireGuard@lists.zx2c4.com
>> http://lists.zx2c4.com/mailman/listinfo/wireguard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-09 15:22         ` René van Dorst
@ 2016-09-09 19:49           ` René van Dorst
  2016-09-14  7:16             ` René van Dorst
  0 siblings, 1 reply; 12+ messages in thread
From: René van Dorst @ 2016-09-09 19:49 UTC (permalink / raw)
  To: wireguard

Here is my last source code https://github.com/vDorst/wireguard/tree/mips32r2
Including the long history of try and fail ;).
But also good ideas like try to optimize the code for better data dependency.
Which makes the code less readable but more efficient.

This is the assembly part  
https://github.com/vDorst/wireguard/blob/mips32r2/src/crypto/chacha20-mips32r2.S

Created functions:
* asmlinkage void chacha20_keysetup(struct chacha20_ctx *ctx, const u8  
key[static 32], const u8 nonce[static 8]);
* asmlinkage void chacha20_generic_block(struct chacha20_ctx *ctx);
* asmlinkage unsigned int poly1305_generic_blocks(struct poly1305_ctx  
*ctx, const u8 *src, unsigned int srclen, u32 hibit);

poly1305_generic_blocks is fixed in the last commit.

Code is written for MIPS32r2 Big endian.
Code has some define for __ORDER_BIG_ENDIAN__ which enable the endian  
swap for that data but is not tested for Litte endian.

Todo:
* Change the C code to see how fast that works and set benchmark baseline.
* Look if I can optimize assembler version even more.

Greats,

René van Dorst.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-09 19:49           ` René van Dorst
@ 2016-09-14  7:16             ` René van Dorst
  2016-09-20 20:39               ` Jason A. Donenfeld
  2016-09-27  1:48               ` Jason A. Donenfeld
  0 siblings, 2 replies; 12+ messages in thread
From: René van Dorst @ 2016-09-14  7:16 UTC (permalink / raw)
  To: wireguard

An update of my current findings.

Most improvements I have seen at the moment is writing and optimize  
poly1305_generic_blocks function.
This gives a improvement of more than 1%.
I also noticed that the ping time does not change.

Improvement at the moment is around UDP: ~1.47% TCP: ~1.68% on large  
transfers like iperf.

Wireguard mix of Asm and C variant:  
https://github.com/vDorst/wireguard/commit/6f9187c325ee883b1f2b9f9da3deb0a61655b504

root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 47996 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-10.00  sec  57.5 MBytes  48.2 Mbits/sec  7354
[  4]  10.00-20.00  sec  57.4 MBytes  48.2 Mbits/sec  7350
[  4]  20.00-30.00  sec  57.4 MBytes  48.2 Mbits/sec  7353
[  4]  30.00-40.00  sec  57.5 MBytes  48.2 Mbits/sec  7356
[  4]  40.00-50.00  sec  57.5 MBytes  48.2 Mbits/sec  7357
[  4]  50.00-60.00  sec  57.5 MBytes  48.2 Mbits/sec  7358
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter     
Lost/Total Datag                                        rams
[  4]   0.00-60.00  sec   345 MBytes  48.2 Mbits/sec  0.037 ms  0/44128 (0%)
[  4] Sent 44128 datagrams

root@lede:~# iperf3 -c 10.0.0.1 -i 10-b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 37950 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.14  sec  52.5 MBytes  43.4 Mbits/sec    0    147 KBytes
[  4]  10.14-20.02  sec  51.2 MBytes  43.5 Mbits/sec    0    147 KBytes
[  4]  20.02-30.14  sec  52.5 MBytes  43.5 Mbits/sec    0    147 KBytes
[  4]  30.14-40.01  sec  51.2 MBytes  43.5 Mbits/sec    0    147 KBytes
[  4]  40.01-50.16  sec  52.5 MBytes  43.4 Mbits/sec    0    220 KBytes
[  4]  50.16-60.01  sec  42.5 MBytes  36.2 Mbits/sec    0    220 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.01  sec   302 MBytes  42.3 Mbits/sec    0             sender
[  4]   0.00-60.01  sec   302 MBytes  42.3 Mbits/sec                  receiver


Wireguard C variant:  
https://github.com/vDorst/wireguard/commit/13fae657624aac6b9c1f411aa6472a91aae7fcc3

root@lede:~# iperf3 -c 10.0.0.1 -i 10 -u -b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 40439 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Total Datagrams
[  4]   0.00-10.00  sec  56.6 MBytes  47.5 Mbits/sec  7246
[  4]  10.00-20.00  sec  56.6 MBytes  47.5 Mbits/sec  7243
[  4]  20.00-30.00  sec  56.6 MBytes  47.5 Mbits/sec  7244
[  4]  30.00-40.00  sec  56.6 MBytes  47.5 Mbits/sec  7245
[  4]  40.00-50.00  sec  56.6 MBytes  47.5 Mbits/sec  7245
[  4]  50.00-60.00  sec  56.6 MBytes  47.5 Mbits/sec  7247
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Jitter     
Lost/Total Datagrams
[  4]   0.00-60.00  sec   340 MBytes  47.5 Mbits/sec  0.039 ms  0/43470 (0%)
[  4] Sent 43470 datagrams

root@lede:~# iperf3 -c 10.0.0.1 -i 10 -b 1G -t 60
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 37956 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-10.02  sec  49.6 MBytes  41.5 Mbits/sec    0    137 KBytes
[  4]  10.02-20.00  sec  49.6 MBytes  41.7 Mbits/sec    0    209 KBytes
[  4]  20.00-30.02  sec  49.6 MBytes  41.6 Mbits/sec    0    209 KBytes
[  4]  30.02-40.01  sec  49.2 MBytes  41.3 Mbits/sec    0    209 KBytes
[  4]  40.01-50.02  sec  49.6 MBytes  41.6 Mbits/sec    0    209 KBytes
[  4]  50.02-60.02  sec  49.6 MBytes  41.6 Mbits/sec    0    209 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-60.02  sec   297 MBytes  41.6 Mbits/sec    0             sender
[  4]   0.00-60.02  sec   297 MBytes  41.6 Mbits/sec                  receiver


Greats,

René van Dorst.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-14  7:16             ` René van Dorst
@ 2016-09-20 20:39               ` Jason A. Donenfeld
  2016-09-22 18:27                 ` René van Dorst
  2016-09-27  1:48               ` Jason A. Donenfeld
  1 sibling, 1 reply; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-09-20 20:39 UTC (permalink / raw)
  To: René van Dorst; +Cc: WireGuard mailing list

Hey Ren=C3=A9,

That's excellent. Thanks for writing that. I'll review this implementation.

Is your speed up compared to your unaligned optimization from the
other patch? Or is that against vanilla?

With only a 1% increase, I'm first interested to see where precisely
that improvement is coming from, and if we could squeeze that out of
gcc instead, so that they're producing more or less the same code.

Regards,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-20 20:39               ` Jason A. Donenfeld
@ 2016-09-22 18:27                 ` René van Dorst
  0 siblings, 0 replies; 12+ messages in thread
From: René van Dorst @ 2016-09-22 18:27 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: WireGuard mailing list

Hi Jason,

I am using the LEDE-projects default kernel.
My comparison is only between the patched C version with the aligned  
memory reads and my assembly version module.

I think it is too complex for GCC to optimize, so it flows the code by  
the letter.
This results in a lot of data hazards.

By doing by hand you can prevent many data hazards.
The trick is try to do 2 things by weaving the code together.
Which results in less maintainable code.

Greats,

René van Dorst.

Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>:

> Hey René,
>
> That's excellent. Thanks for writing that. I'll review this implementation.
>
> Is your speed up compared to your unaligned optimization from the
> other patch? Or is that against vanilla?
>
> With only a 1% increase, I'm first interested to see where precisely
> that improvement is coming from, and if we could squeeze that out of
> gcc instead, so that they're producing more or less the same code.
>
> Regards,
> Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-14  7:16             ` René van Dorst
  2016-09-20 20:39               ` Jason A. Donenfeld
@ 2016-09-27  1:48               ` Jason A. Donenfeld
  1 sibling, 0 replies; 12+ messages in thread
From: Jason A. Donenfeld @ 2016-09-27  1:48 UTC (permalink / raw)
  To: René van Dorst; +Cc: WireGuard mailing list

Hey Ren=C3=A9,

I've begun trying to integrate your excellent work into WireGuard in
the branch rvh/mips:
https://git.zx2c4.com/WireGuard/commit/?h=3Drvd/mips

It seems like there's still a bit of cleaning up and polishing to do,
but it's headed in a great direction. There's a lot of weird
formatting and general inconstancy to clean up. I'll do a review of
the crypto as we get rolling here.

To make things easier, I gave you commit access to the rvh/mips branch
in the repo. Feel free to do with this what you like, and when we're
ready, I'll merge it to master.

$ git clone ssh://git@git.zx2c4.com/WireGuard
$ cd WireGuard
$ git checkout -b rvh/mips origin/rvh/mips
$ edit code...
$ git commit...
$ git push

That general flow should work for you, using your Github SSH key. Let
me know if there are any issues, and feel free to poke me on irc
(zx2c4 on freenode -- #wireguard).

Talk soon,
Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [WireGuard] News about MIPS and ARM optimized code?
  2016-09-09 13:52       ` Baptiste Jonglez
  2016-09-09 15:22         ` René van Dorst
@ 2016-09-14  8:10         ` jens
  1 sibling, 0 replies; 12+ messages in thread
From: jens @ 2016-09-14  8:10 UTC (permalink / raw)
  To: wireguard

On 09.09.2016 15:52, Baptiste Jonglez wrote:
> Nice work!  I had tried to write chacha20_generic_block in MIPS assembl=
y,
> but I got confused with endianness issues and the code didn't work in t=
he
> end.
>
> Is your code available somewhere?  I'd be happy to test on a variety of=

> MIPS routers.

i build some lede with Rene v Dorst patch - but have no time to actually
test it, if someone has ...
here a the links for 841-v11 we want to test specificly
and here is the link for more devices (only build in patched)

patch openfreiburg.de/freifunk/firmware/lede/chacha20poly1305.c_patch1
841 stuff openfreiburg.de/freifunk/firmware/lede/
more lede buildstuff also there (other images and packages)

jens

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-09-27  1:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-08 13:23 [WireGuard] News about MIPS and ARM optimized code? René van Dorst
2016-08-08 14:29 ` Jason A. Donenfeld
2016-09-08 11:57   ` René van Dorst
2016-09-09 13:46     ` René van Dorst
2016-09-09 13:52       ` Baptiste Jonglez
2016-09-09 15:22         ` René van Dorst
2016-09-09 19:49           ` René van Dorst
2016-09-14  7:16             ` René van Dorst
2016-09-20 20:39               ` Jason A. Donenfeld
2016-09-22 18:27                 ` René van Dorst
2016-09-27  1:48               ` Jason A. Donenfeld
2016-09-14  8:10         ` jens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).