Mini PCIE HW accelerator for ChaCha20

Development discussion of WireGuard
 help / color / mirror / Atom feed

* Mini PCIE HW accelerator for ChaCha20
@ 2024-06-12 14:11 Germano Massullo
  2024-06-16 13:47 ` Max Schulze
  0 siblings, 1 reply; 11+ messages in thread
From: Germano Massullo @ 2024-06-12 14:11 UTC (permalink / raw)
  To: WireGuard mailing list

Hello, I would like to ask if you are aware of any mini PCI express
card that provides hardware acceleration for ChaCha20 algorithm. I
would need it to improve Turris Omnia Wireguard throughput
Cheers!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-12 14:11 Mini PCIE HW accelerator for ChaCha20 Germano Massullo
@ 2024-06-16 13:47 ` Max Schulze
  2024-06-16 14:59   ` Germano Massullo
  0 siblings, 1 reply; 11+ messages in thread
From: Max Schulze @ 2024-06-16 13:47 UTC (permalink / raw)
  To: Germano Massullo, WireGuard mailing list

Hi,

On 12.06.24 16:11, Germano Massullo wrote:
> Hello, I would like to ask if you are aware of any mini PCI express
> card that provides hardware acceleration for ChaCha20 algorithm. I
> would need it to improve Turris Omnia Wireguard throughput


why do you think this is the bottleneck and at what speed are you hitting a limit?

Curious, as I always found wg performance to be excellent, even on ARM.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-16 13:47 ` Max Schulze
@ 2024-06-16 14:59   ` Germano Massullo
  2024-06-16 19:00     ` Max Schulze
  0 siblings, 1 reply; 11+ messages in thread
From: Germano Massullo @ 2024-06-16 14:59 UTC (permalink / raw)
  To: Max Schulze, WireGuard mailing list

Il 16/06/24 15:47, Max Schulze ha scritto:
> why do you think this is the bottleneck and at what speed are you hitting a limit?

I get ~550 Mbit/s throughput in LAN between a Ryzen 5 3600 and the 
Turris Omnia which CPU goes to 100% load during iperf3 test

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-16 14:59   ` Germano Massullo
@ 2024-06-16 19:00     ` Max Schulze
  2024-06-17  9:21       ` Germano Massullo
  0 siblings, 1 reply; 11+ messages in thread
From: Max Schulze @ 2024-06-16 19:00 UTC (permalink / raw)
  To: Germano Massullo, WireGuard mailing list



On 16.06.24 16:59, Germano Massullo wrote:
> Il 16/06/24 15:47, Max Schulze ha scritto:
>> why do you think this is the bottleneck and at what speed are you hitting a limit?
>
> I get ~550 Mbit/s throughput in LAN between a Ryzen 5 3600 and the Turris Omnia which CPU goes to 100% load during iperf3 test


Ok then I think you really max out the cpu. I have not heard of any acceleration card. Overall I think it's not too bad.


Some notes:

Per [1], my stock Raspberry Pi 4 B (BCM2711, quad-core), has roughly 1.5x cpu-power than the dual-core Marvel Armada 385.

Are you running iperf3 with --bidir?

I get:
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID][Role] Interval           Transfer     Bitrate         Retr
> [  5][TX-C]   0.00-120.00 sec  10.4 GBytes   744 Mbits/sec  1839             sender
> [  5][TX-C]   0.00-120.00 sec  10.4 GBytes   744 Mbits/sec                  receiver
> [  7][RX-C]   0.00-120.00 sec  9.08 GBytes   650 Mbits/sec  151             sender
> [  7][RX-C]   0.00-120.00 sec  9.08 GBytes   650 Mbits/sec                  receiver

Keep in mind that iperf3 itself uses some cpu.

You could test serving a static file and transfer via http.

( ex: dd if=/dev/urandom of=/dev/shm/test.rand bs=1M count=300
and serve with [2], and download with "wget -O /dev/null [...]" )


I get 848 Mbit/s when downloading to the pi and 728 Mbit/s when downloading from it (everything via wireguard).






[1] https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md
[2] https://github.com/svenstaro/miniserve

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-16 19:00     ` Max Schulze
@ 2024-06-17  9:21       ` Germano Massullo
  2024-06-17  9:45         ` Antonio Quartulli
  0 siblings, 1 reply; 11+ messages in thread
From: Germano Massullo @ 2024-06-17  9:21 UTC (permalink / raw)
  To: Max Schulze, WireGuard mailing list

Il 16/06/24 21:00, Max Schulze ha scritto:
> Ok then I think you really max out the cpu. I have not heard of any acceleration card. Overall I think it's not too bad.

The problem is that is far under my internet connection capabilities (1 
Gbit/s upload)

> Are you running iperf3 with --bidir?

Such flag halves the throughput, I am getting ~280 Mbit/s compared to 
the previous value I got by using
iperf3 -c 10.0.50.1 -P 4 -Z bbr
(using the Ryzen as client)
> Keep in mind that iperf3 itself uses some cpu.
> You could test serving a static file and transfer via http.

The iperf3 CPU usage is not so high, it wouldn't change much to use the 
http transfer

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-17  9:21       ` Germano Massullo
@ 2024-06-17  9:45         ` Antonio Quartulli
  2024-06-17 11:08           ` Germano Massullo
  0 siblings, 1 reply; 11+ messages in thread
From: Antonio Quartulli @ 2024-06-17  9:45 UTC (permalink / raw)
  To: Germano Massullo, Max Schulze, WireGuard mailing list

On 17/06/2024 11:21, Germano Massullo wrote:
> Il 16/06/24 21:00, Max Schulze ha scritto:
>> Ok then I think you really max out the cpu. I have not heard of any 
>> acceleration card. Overall I think it's not too bad.
> 
> The problem is that is far under my internet connection capabilities (1 
> Gbit/s upload)
> 
>> Are you running iperf3 with --bidir?
> 
> Such flag halves the throughput, I am getting ~280 Mbit/s compared to 
> the previous value I got by using
> iperf3 -c 10.0.50.1 -P 4 -Z bbr
> (using the Ryzen as client)
>> Keep in mind that iperf3 itself uses some cpu.
>> You could test serving a static file and transfer via http.
> 
> The iperf3 CPU usage is not so high, it wouldn't change much to use the 
> http transfer


Have you tried running the test between a client, behind the omnia 
turris and the wg server?
I am asking because such embedded devices are not necessarily fast in 
generating the traffic that iperf requires, therefore using a different 
client may give you a better estimate.

Regards,

-- 
Antonio Quartulli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-17  9:45         ` Antonio Quartulli
@ 2024-06-17 11:08           ` Germano Massullo
  2024-06-17 11:42             ` Antonio Quartulli
  0 siblings, 1 reply; 11+ messages in thread
From: Germano Massullo @ 2024-06-17 11:08 UTC (permalink / raw)
  To: Antonio Quartulli, WireGuard mailing list

Il 17/06/24 11:45, Antonio Quartulli ha scritto:
> Have you tried running the test between a client, behind the omnia 
> turris and the wg server?

Do you mean a configuration where the Turris Omnia is not acting as 
Wireguard peer/gateway? I could do it but I prefer not

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-17 11:08           ` Germano Massullo
@ 2024-06-17 11:42             ` Antonio Quartulli
  2024-06-17 12:32               ` Germano Massullo
  0 siblings, 1 reply; 11+ messages in thread
From: Antonio Quartulli @ 2024-06-17 11:42 UTC (permalink / raw)
  To: Germano Massullo, WireGuard mailing list

Hi,

On 17/06/2024 13:08, Germano Massullo wrote:
> Il 17/06/24 11:45, Antonio Quartulli ha scritto:
>> Have you tried running the test between a client, behind the omnia 
>> turris and the wg server?
> 
> Do you mean a configuration where the Turris Omnia is not acting as 
> Wireguard peer/gateway? I could do it but I prefer not

No no. Sorry I might have used the wrong words.

Basically you should keep the wg setup as it is, but instead of running 
the iperf client on the turris, you run it on another host that uses the 
turris as gateway (as if the turris was the gateway of a LAN).

This way the tunnel is still established between the turris and the 
server (which is what you want to test), but you move the traffic 
generation to a different host (which is most likely what you will have 
in a real scenario).

I hope I clarified your doubt.

Regards,

-- 
Antonio Quartulli

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-17 11:42             ` Antonio Quartulli
@ 2024-06-17 12:32               ` Germano Massullo
  2024-06-17 12:41                 ` Roman Mamedov
  0 siblings, 1 reply; 11+ messages in thread
From: Germano Massullo @ 2024-06-17 12:32 UTC (permalink / raw)
  To: Antonio Quartulli, WireGuard mailing list

Il 17/06/24 13:42, Antonio Quartulli ha scritto:
> Hi,
>
> On 17/06/2024 13:08, Germano Massullo wrote:
>> Il 17/06/24 11:45, Antonio Quartulli ha scritto:
>>> Have you tried running the test between a client, behind the omnia 
>>> turris and the wg server?
>>
>> Do you mean a configuration where the Turris Omnia is not acting as 
>> Wireguard peer/gateway? I could do it but I prefer not
>
> No no. Sorry I might have used the wrong words.
>
> Basically you should keep the wg setup as it is, but instead of 
> running the iperf client on the turris, you run it on another host 
> that uses the turris as gateway (as if the turris was the gateway of a 
> LAN).
>
> This way the tunnel is still established between the turris and the 
> server (which is what you want to test), but you move the traffic 
> generation to a different host (which is most likely what you will 
> have in a real scenario).
>
> I hope I clarified your doubt.
>
> Regards,
>
>
Got it. That configuration will not improve the throughput cause the 
reason why I started this benchmark is finding out the bottleneck in my 
configuration, which is very similar to the one you described

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-17 12:32               ` Germano Massullo
@ 2024-06-17 12:41                 ` Roman Mamedov
  2024-06-17 14:31                   ` Germano Massullo
  0 siblings, 1 reply; 11+ messages in thread
From: Roman Mamedov @ 2024-06-17 12:41 UTC (permalink / raw)
  To: Germano Massullo; +Cc: Antonio Quartulli, WireGuard mailing list

On Mon, 17 Jun 2024 14:32:19 +0200
Germano Massullo <germano.massullo@gmail.com> wrote:

> Got it. That configuration will not improve the throughput cause the 
> reason why I started this benchmark is finding out the bottleneck in my 
> configuration, which is very similar to the one you described

Point is that iperf itself is using a huge amount of CPU. You can run your
test and launch "top" in another SSH window. In my experience for slow CPUs
during such tests the CPU use may be like 60% iperf.

If your typical scenario is router just forwarding packets between networks
and into WG tunnel, and not providing any network services itself (such as
Samba), testing with iperf launched on the router will not be representative
of real-world usage bottleneck or lack thereof.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Mini PCIE HW accelerator for ChaCha20
  2024-06-17 12:41                 ` Roman Mamedov
@ 2024-06-17 14:31                   ` Germano Massullo
  0 siblings, 0 replies; 11+ messages in thread
From: Germano Massullo @ 2024-06-17 14:31 UTC (permalink / raw)
  To: WireGuard mailing list, Roman Mamedov; +Cc: Antonio Quartulli

After having checked that iperf3 was indeed consuming a lot of a CPU 
core on the Turris Omnia, I modified the Wireguard topology in order to 
have the router to just be the Wireguard gateway among two LAN computers 
( [A] <--wireguard--> [C] <--wireguard--> [B] ), and I have run the 
iperf3 among such computers
iperf3 -c x.x.x.x -P 4 -Z bbr
and the throughput was ~320 Mbit/s. Considering that the router had to 
handle two Wireguard tunnels, one could guess (without any claim of 
accuracy due lack of more accurate tests), that the maximum Wireguard 
throughput that such router can handle is ~2x 320 Mbit/s = ~640 Mbit/s

[A]: Ryzen 5 3600 - kernel 5.14.0-427.18.1.el9_4.x86_64
[B]: Ryzen 7 PRO 6850U -  kernel 6.8.11-300.fc40.x86_64
[C]: Turris Omnia - TurrisOS 7.0.0, kernel 5.15.148

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-06-17 14:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-12 14:11 Mini PCIE HW accelerator for ChaCha20 Germano Massullo
2024-06-16 13:47 ` Max Schulze
2024-06-16 14:59   ` Germano Massullo
2024-06-16 19:00     ` Max Schulze
2024-06-17  9:21       ` Germano Massullo
2024-06-17  9:45         ` Antonio Quartulli
2024-06-17 11:08           ` Germano Massullo
2024-06-17 11:42             ` Antonio Quartulli
2024-06-17 12:32               ` Germano Massullo
2024-06-17 12:41                 ` Roman Mamedov
2024-06-17 14:31                   ` Germano Massullo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).