From: 张飞 <zhangfei@nj.iscas.ac.cn>
To: musl@lists.openwall.com
Subject: Re: Re: [musl] memset_riscv64
Date: Wed, 19 Apr 2023 13:33:08 +0800 (GMT+08:00) [thread overview]
Message-ID: <658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn> (raw)
In-Reply-To: <CAKbZUD2Rfd9Lg37GY+N_bMeJOQJ=84yZ=SW9+vHMRdByU0CZ+A@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 6261 bytes --]
Hi!
I have revised the implementation of memset, which includes two versions of the basic instruction set and
vector instruction implementation, and used macro definitions to determine whether riscv hardware
supports vector extension.
The reason for implementing two versions is to hope that the memset implemented
using the basic instruction set can be applicable to all RISCV architecture CPUs, and the vector version can
accelerate the hardware supporting vector expansion.
At present, the riscv vector extension instruction set is in a frozen state, and the instruction set is stable.
In other open source libraries, such as openssl and openCV, riscv vector optimization is available.
I conducted tests on different data volumes and compared the performance of memset functions implemented
in C language, basic instruction set, and vector instruction set.The test case is test_memset.c
Performance comparison between C language implementation and assembly implementation was tested on
Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip Platform).
The test results are as follows.when it is less than 16 bytes, the performance of C language implementation is
better; From 16 bytes to 32768 bytes, the basic instruction implementation performance is better, with an average
improvement of over 30%; When it is greater than 32768 bytes, the performance of both is equivalent.
--------------------------------------------------------------------------------
length(byte) C language implementation(s) Basic instruction implementation(s)
--------------------------------------------------------------------------------
4 0.00000352 0.000004001
8 0.000004001 0.000005441
16 0.000006241 0.00000464
32 0.00000752 0.00000448
64 0.000008481 0.000005281
128 0.000009281 0.000005921
256 0.000011201 0.000007041
512 0.000014402 0.000010401
1024 0.000022563 0.000016962
2048 0.000039205 0.000030724
4096 0.000072809 0.000057768
8192 0.000153459 0.000132793
16384 0.000297157 0.000244992
32768 0.000784416 0.000735298
65536 0.005005252 0.004987382
131072 0.011286821 0.011256855
262144 0.023295169 0.022932165
524288 0.04647724 0.046084839
1048576 0.094114058 0.0932383
--------------------------------------------------------------------------------
Due to the lack of a chip that supports vector extension, I conducted a performance comparison test of memset
using C language and vector implementation on the Spike simulator, which has certain reference value.
It can be clearly seen that vector implementation is more efficient than C language implementation,
with an average performance improvement of over 200%.
--------------------------------------------------------------------------------
length(byte) C language implementation(s) Vector instruction implementation(s)
--------------------------------------------------------------------------------
4 0.000002839 0.000002939
8 0.000003239 0.000002939
16 0.000005239 0.000002939
32 0.000007039 0.000002939
64 0.000008039 0.000002939
128 0.000009239 0.000002939
256 0.000011639 0.000003539
512 0.000016439 0.000004739
1024 0.000026039 0.000007139
2048 0.000045239 0.000011939
4096 0.000083639 0.000021539
8192 0.000162298 0.000042598
16384 0.000317757 0.000082857
32768 0.000628675 0.000163375
65536 0.001250511 0.000324411
131072 0.002494183 0.000646483
262144 0.004981527 0.001290627
524288 0.009956215 0.002578915
1048576 0.019905591 0.005155491
--------------------------------------------------------------------------------
So I hope that a more efficient assembly implementation of riscv64 memset functions can be
integrated into musl and compatible with different riscv architectures' CPUs.
Fei Zhang
> -----原始邮件-----
> 发件人: "Pedro Falcato" <pedro.falcato@gmail.com>
> 发送时间: 2023-04-11 17:48:31 (星期二)
> 收件人: musl@lists.openwall.com
> 抄送:
> 主题: Re: [musl] memset_riscv64
>
> On Tue, Apr 11, 2023 at 3:18 AM 张飞 <zhangfei@nj.iscas.ac.cn> wrote:
> >
> > Hello,
> >
> > Currently, there is no assembly implementation of the memset function for riscv64 in Musl.
> > This patch is a riscv64 assembly implementation of the memset function, which is implemented using the basic instruction set and
> > has better performance than the c language implementation in Musl. I hope it can be integrated into Musl.
>
> Hi!
>
> Do you have performance measurements here? What exactly is the difference?
> As far as I know, no one is actively optimizing on riscv yet, only
> some movements in the upstream kernel (to prepare for vector extension
> stuff, unaligned loads/stores) and the corresponding glibc patches.
> Mainly because it's still super unobtanium and in very early stages,
> so optimizing is very hard.
>
> So what hardware did you use? Is there a large gain here? Given that
> your memset looks so simple, wouldn't it just be easier to write this
> in C?
>
> --
> Pedro
</zhangfei@nj.iscas.ac.cn></pedro.falcato@gmail.com>
[-- Attachment #2: memset_riscv64.patch --]
[-- Type: application/octet-stream, Size: 2646 bytes --]
diff -uprN src/string/riscv64/memset.S src/string/riscv64/memset.S
--- src/string/riscv64/memset.S 1970-01-01 08:00:00.000000000 +0800
+++ src/string/riscv64/memset.S 2023-04-18 13:37:22.548056966 +0800
@@ -0,0 +1,125 @@
+.global memset
+.type memset,@function
+
+#ifndef __riscv_vector
+#define SZREG 8
+#define REG_S sd
+#endif
+
+memset:
+#ifdef __riscv_vector
+ mv t0, a0
+ beqz a2, 3f
+
+ csrr t1, vlenb
+ addi t1, t1, -1
+ add a3, t0, t1
+ not t1, t1
+ and a3, a3, t1
+ beq a3, t0, 2f
+ sub a4, a3, t0
+ add a5, t0, a2
+
+1:
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ beq t0, a5, 3f
+ bltu t0, a3, 1b
+ sub a2, a2, a4
+
+2:
+ vsetvli t1, a2, e8, m8, ta, ma
+ vmv.v.x v0, a1
+ sub a2, a2, t1
+ vse8.v v0, (t0)
+ add t0, t0, t1
+ bnez a2, 2b
+3:
+ ret
+
+#else
+ mv t0, a0
+
+ sltiu a3, a2, 16
+ bnez a3, 4f
+
+ addi a3, t0, SZREG-1
+ andi a3, a3, ~(SZREG-1)
+ beq a3, t0, 2f
+ sub a4, a3, t0
+1:
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 1b
+ sub a2, a2, a4
+
+2:
+ andi a1, a1, 0xff
+ slli a3, a1, 8
+ or a1, a3, a1
+ slli a3, a1, 16
+ or a1, a3, a1
+ slli a3, a1, 32
+ or a1, a3, a1
+
+ andi a4, a2, ~(SZREG-1)
+ add a3, t0, a4
+
+ andi a4, a4, 31*SZREG
+ beqz a4, 3f
+ neg a4, a4
+ addi a4, a4, 32*SZREG
+
+ sub t0, t0, a4
+
+ la a5, 3f
+ srli a4, a4, 1
+ add a5, a5, a4
+ jr a5
+3:
+ REG_S a1, 0(t0)
+ REG_S a1, SZREG(t0)
+ REG_S a1, 2*SZREG(t0)
+ REG_S a1, 3*SZREG(t0)
+ REG_S a1, 4*SZREG(t0)
+ REG_S a1, 5*SZREG(t0)
+ REG_S a1, 6*SZREG(t0)
+ REG_S a1, 7*SZREG(t0)
+ REG_S a1, 8*SZREG(t0)
+ REG_S a1, 9*SZREG(t0)
+ REG_S a1, 10*SZREG(t0)
+ REG_S a1, 11*SZREG(t0)
+ REG_S a1, 12*SZREG(t0)
+ REG_S a1, 13*SZREG(t0)
+ REG_S a1, 14*SZREG(t0)
+ REG_S a1, 15*SZREG(t0)
+ REG_S a1, 16*SZREG(t0)
+ REG_S a1, 17*SZREG(t0)
+ REG_S a1, 18*SZREG(t0)
+ REG_S a1, 19*SZREG(t0)
+ REG_S a1, 20*SZREG(t0)
+ REG_S a1, 21*SZREG(t0)
+ REG_S a1, 22*SZREG(t0)
+ REG_S a1, 23*SZREG(t0)
+ REG_S a1, 24*SZREG(t0)
+ REG_S a1, 25*SZREG(t0)
+ REG_S a1, 26*SZREG(t0)
+ REG_S a1, 27*SZREG(t0)
+ REG_S a1, 28*SZREG(t0)
+ REG_S a1, 29*SZREG(t0)
+ REG_S a1, 30*SZREG(t0)
+ REG_S a1, 31*SZREG(t0)
+ addi t0, t0, 32*SZREG
+ bltu t0, a3, 3b
+ andi a2, a2, SZREG-1
+
+4:
+ beqz a2, 6f
+ add a3, t0, a2
+5:
+ sb a1, 0(t0)
+ addi t0, t0, 1
+ bltu t0, a3, 5b
+6:
+ ret
+#endif
[-- Attachment #3: test_memset.c --]
[-- Type: text/plain, Size: 929 bytes --]
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#define DATA_SIZE 5*1024*1024
#define MAX_LEN 1*1024*1024
#define OFFSET 0
#define LOOP_TIMES 100
int main(){
char *str1,*src1;
str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
printf("function test start\n");
src1 = str1+OFFSET;
struct timespec tv0,tv;
for(int len=2; len<=MAX_LEN; len*=2){
clock_gettime(CLOCK_REALTIME, &tv0);
for(int k=0; k<LOOP_TIMES; k++){
memset(src1, 'a', len);
}
clock_gettime(CLOCK_REALTIME, &tv);
tv.tv_sec -= tv0.tv_sec;
if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
tv.tv_nsec += 1000000000;
tv.tv_sec--;
}
printf("len: %d time: %ld.%.9ld\n",len, (long)tv.tv_sec, (long)tv.tv_nsec);
}
printf("function test end\n");
munmap(str1,DATA_SIZE);
return 0;
}
next prev parent reply other threads:[~2023-04-19 5:33 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-11 2:17 张飞
2023-04-11 9:48 ` Pedro Falcato
2023-04-19 5:33 ` 张飞 [this message]
2023-04-19 9:02 ` Szabolcs Nagy
2023-04-20 8:17 ` 张飞
2023-04-21 13:30 ` Szabolcs Nagy
2023-04-21 14:50 ` Pedro Falcato
2023-04-21 16:54 ` Rich Felker
2023-04-21 17:01 ` enh
2023-04-26 7:25 ` 张飞
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn \
--to=zhangfei@nj.iscas.ac.cn \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).