* ARM multitheaded?
@ 2017-11-21 9:25 René van Dorst
2017-11-21 9:40 ` René van Dorst
0 siblings, 1 reply; 4+ messages in thread
From: René van Dorst @ 2017-11-21 9:25 UTC (permalink / raw)
To: WireGuard list
Hi Jason,
iperf3 -c 10.0.0.1 -t 10 -Z -i 40 -P 3
1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
87.7%] Tasks: 29, 9 thr, 83 kthr; 6 running
2 [||||||||||||||||||||
28.5%] Load average: 0.86 0.64 0.87
3 [|||||||||||||||||||
27.3%] Uptime: 4 days, 14:22:07
4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
Mem[||||||||||||||||||||||||||||||||||||||||||||| 85.9M/1000M]
Swp[ 0K/244M]
iperf3 -c 10.0.0.1 -t 10 -Z -i 40
htop output
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||
74.0%] Tasks: 29, 9 thr, 83 kthr; 4 running
2 [|||||||||||||||
20.5%] Load average: 1.20 0.73 0.90
3 [||||||||||||||
19.5%] Uptime: 4 days, 14:22:22
4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.8%]
Mem[||||||||||||||||||||||||||||||||||||||||||||| 86.0M/1000M]
Swp[ 0K/244M]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ARM multitheaded?
2017-11-21 9:25 ARM multitheaded? René van Dorst
@ 2017-11-21 9:40 ` René van Dorst
2017-11-21 10:02 ` Jason A. Donenfeld
0 siblings, 1 reply; 4+ messages in thread
From: René van Dorst @ 2017-11-21 9:40 UTC (permalink / raw)
To: WireGuard list
Hi Jason,
Part 2 ;)
I was expecting that my ixm6 quad core 933MHz outperform my single
core dove 800MHz with a large magnitude.
Dove (Cubox-es) iperf results:
root@cubox-es:~# iperf3 -c 10.0.0.1 -t 10 -Z -i 10
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.4 port 43600 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 194 MBytes 163 Mbits/sec 0 820 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 194 MBytes 163 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 192 MBytes 161 Mbits/sec receiver
iperf Done.
root@cubox-es:~# iperf3 -c 10.0.0.1 -t 10 -Z -i 10 -P 3
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.4 port 43604 connected to 10.0.0.1 port 5201
[ 6] local 10.0.0.4 port 43606 connected to 10.0.0.1 port 5201
[ 8] local 10.0.0.4 port 43608 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 89.3 MBytes 74.9 Mbits/sec 0 354 KBytes
[ 6] 0.00-10.00 sec 38.8 MBytes 32.6 Mbits/sec 0 227 KBytes
[ 8] 0.00-10.00 sec 54.3 MBytes 45.5 Mbits/sec 0 235 KBytes
[SUM] 0.00-10.00 sec 182 MBytes 153 Mbits/sec 0
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 89.3 MBytes 74.9 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 88.5 MBytes 74.2 Mbits/sec receiver
[ 6] 0.00-10.00 sec 38.8 MBytes 32.6 Mbits/sec 0 sender
[ 6] 0.00-10.00 sec 38.4 MBytes 32.2 Mbits/sec receiver
[ 8] 0.00-10.00 sec 54.3 MBytes 45.5 Mbits/sec 0 sender
[ 8] 0.00-10.00 sec 53.6 MBytes 44.9 Mbits/sec receiver
[SUM] 0.00-10.00 sec 182 MBytes 153 Mbits/sec 0 sender
[SUM] 0.00-10.00 sec 180 MBytes 151 Mbits/sec receiver
Imx6 (Utilite) iperf results:
[root@utilite ~]# iperf3 -c 10.0.0.1 -t 10 -Z -i 10
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.5 port 40336 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 216 MBytes 181 Mbits/sec 0 382 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 216 MBytes 181 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 215 MBytes 181 Mbits/sec receiver
iperf Done.
[root@utilite ~]# iperf3 -c 10.0.0.1 -t 10 -Z -i 10 -P 3
Connecting to host 10.0.0.1, port 5201
[ 4] local 10.0.0.5 port 40340 connected to 10.0.0.1 port 5201
[ 6] local 10.0.0.5 port 40342 connected to 10.0.0.1 port 5201
[ 8] local 10.0.0.5 port 40344 connected to 10.0.0.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-10.00 sec 93.5 MBytes 78.4 Mbits/sec 0 270 KBytes
[ 6] 0.00-10.00 sec 76.1 MBytes 63.9 Mbits/sec 1 224 KBytes
[ 8] 0.00-10.00 sec 88.9 MBytes 74.6 Mbits/sec 0 270 KBytes
[SUM] 0.00-10.00 sec 259 MBytes 217 Mbits/sec 1
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 93.5 MBytes 78.4 Mbits/sec 0 sender
[ 4] 0.00-10.00 sec 93.0 MBytes 78.0 Mbits/sec receiver
[ 6] 0.00-10.00 sec 76.1 MBytes 63.9 Mbits/sec 1 sender
[ 6] 0.00-10.00 sec 75.5 MBytes 63.3 Mbits/sec receiver
[ 8] 0.00-10.00 sec 88.9 MBytes 74.6 Mbits/sec 0 sender
[ 8] 0.00-10.00 sec 88.4 MBytes 74.1 Mbits/sec receiver
[SUM] 0.00-10.00 sec 259 MBytes 217 Mbits/sec 1 sender
[SUM] 0.00-10.00 sec 257 MBytes 215 Mbits/sec receiver
iperf Done.
I looked at the cpu usage at the imx while running iperf.
Then I see that iperf is around 2-10% cpu use.
But Kthreads use a lot more.
Below typical cpu usage. (HTOP cpu bars output)
Running: iperf3 -c 10.0.0.1 -t 10 -Z -i 40 -P 3
1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
87.7%] Tasks: 29, 9 thr, 83 kthr; 6 running
2 [||||||||||||||||||||
28.5%] Load average: 0.86 0.64 0.87
3 [|||||||||||||||||||
27.3%] Uptime: 4 days, 14:22:07
4 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
Mem[||||||||||||||||||||||||||||||||||||||||||||| 85.9M/1000M]
Swp[ 0K/244M]
Running: iperf3 -c 10.0.0.1 -t 10 -Z -i 40
htop output
1 [|||||||||||||||||||||||||||||||||||||||||||||||||||
74.0%] Tasks: 29, 9 thr, 83 kthr; 4 running
2 [|||||||||||||||
20.5%] Load average: 1.20 0.73 0.90
3 [||||||||||||||
19.5%] Uptime: 4 days, 14:22:22
4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||96.8%]
Mem[||||||||||||||||||||||||||||||||||||||||||||| 86.0M/1000M]
Swp[ 0K/244M]
So it seems that one of process in the chain has a bottleneck.
HTOP only show "kworkers" as a name. Not really useful for debugging.
See below.
1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||
79.1%] Tasks: 29, 9 thr, 82 kthr; 5 running
2 [||||||||||||||||||
24.5%] Load average: 2.07 1.33 1.35
3 [||||||||||||||||
23.2%] Uptime: 4 days, 14:34:57
4 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||99.4%]
Mem[||||||||||||||||||||||||||||||||||||||||||||| 86.3M/1000M]
Swp[ 0K/244M]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
13706 root 20 0 0 0 0 R 61.8 0.0 1:20.60 kworker/3:6
7 root 20 0 0 0 0 R 20.6 0.0 2:39.03 ksoftirqd/0
13743 root 20 0 0 0 0 S 19.9 0.0 0:10.00 kworker/2:0
13755 root 20 0 0 0 0 R 17.9 0.0 0:18.32 kworker/3:3
13707 root 20 0 0 0 0 S 15.9 0.0 0:24.29 kworker/1:3
13747 root 20 0 0 0 0 S 14.6 0.0 0:03.73 kworker/3:0
13753 root 20 0 0 0 0 S 13.3 0.0 0:01.68 kworker/0:1
13754 root 20 0 0 0 0 R 7.3 0.0 0:03.91 kworker/0:2
13752 root 20 0 0 0 0 S 4.7 0.0 0:02.97 kworker/1:0
13751 root 20 0 0 0 0 S 4.0 0.0 0:03.97 kworker/3:2
13748 root 20 0 2944 608 536 S 2.7 0.1 0:01.14 iperf3
-c 10.0.0.1 -t 1000 -Z -i 40
13749 root 20 0 0 0 0 S 2.7 0.0 0:02.61 kworker/2:1
13733 root 20 0 12860 3252 2368 R 2.0 0.3 0:16.53 htop
13757 root 20 0 0 0 0 S 0.7 0.0 0:01.54 kworker/2:2
13684 root 20 0 0 0 0 S 0.0 0.0 0:25.83 kworker/1:1
13750 root 20 0 0 0 0 S 0.0 0.0 0:04.12 kworker/3:1
13756 root 20 0 0 0 0 S 0.0 0.0 0:01.21 kworker/1:2
Any idea how to debug it and to improve the performance?
Greats,
René van Dorst.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ARM multitheaded?
2017-11-21 9:40 ` René van Dorst
@ 2017-11-21 10:02 ` Jason A. Donenfeld
[not found] ` <878texyk9e.fsf@nemesis.taht.net>
0 siblings, 1 reply; 4+ messages in thread
From: Jason A. Donenfeld @ 2017-11-21 10:02 UTC (permalink / raw)
To: René van Dorst; +Cc: Toke Høiland-Jørgensen, WireGuard list
Hi Ren=C3=A9,
There are a few bottlenecks in the existing queuing code:
- transmission of packets is limited to one core, even if encryption
is multicore, to avoid out of order packets.
- packet queues use a ring buffer with two spinlocks, which cause
contention on systems with copious amounts of CPUs (not your case).
- CPU autoscaling - sometimes using all the cores isn't useful if that
lowers the clockrate or if there are few packets, but we don't have an
auto scale-up/scale-down algorithm right now. instead we blast out to
all cores always.
- CPU locality - cores might be created on one core and encrypted on
another. not much we can do about this with a multicore algorithm,
unless there are "hints" or dual per-cpu and per-device queues with
scheduling between them, which is complicated and would need lots of
thought.
- the transmission core is also used as an encryption core. in some
environments this is a benefit, in others a detriment.
- there's a slightly expensive bitmask operation to determine which
CPU should be used for the next packet.
- other challenging puzzles from queue-theory land.
I've CCd Samuel and Toke in case they want to jump in on this thread
and complain some about other aspects of the multicore algorithm. It's
certainly much better than it was during padata-era, but there's still
a lot to be done. The implementation lives here:
>From these lines on down, best read from bottom to top.
https://git.zx2c4.com/WireGuard/tree/src/send.c#n185
https://git.zx2c4.com/WireGuard/tree/src/receive.c#n281
Utility functions:
https://git.zx2c4.com/WireGuard/tree/src/queueing.c
https://git.zx2c4.com/WireGuard/tree/src/queueing.h
Let me know if you have further ideas for improving performance.
Jason
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: ARM multitheaded?
[not found] ` <878texyk9e.fsf@nemesis.taht.net>
@ 2017-11-22 23:20 ` Jason A. Donenfeld
0 siblings, 0 replies; 4+ messages in thread
From: Jason A. Donenfeld @ 2017-11-22 23:20 UTC (permalink / raw)
To: Dave Taht; +Cc: Toke Høiland-Jørgensen, WireGuard list
On Thu, Nov 23, 2017 at 12:19 AM, Dave Taht <dave@taht.net> wrote:
> Not moi? :)
You too, of course!
>
> My take on things was basically not even try to do multicore on single
> flows but to start by divvying up things into tons of queues, and try to
> keep those flows entirely on the core they started with.
Yes, this is the traditional approach taken, but it doesn't work well
for accelerating the crypto of say, a single file downloading or
somebody watching youtube. I don't want these pegged to a single core.
Jason
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2017-11-22 23:15 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 9:25 ARM multitheaded? René van Dorst
2017-11-21 9:40 ` René van Dorst
2017-11-21 10:02 ` Jason A. Donenfeld
[not found] ` <878texyk9e.fsf@nemesis.taht.net>
2017-11-22 23:20 ` Jason A. Donenfeld
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).