From: 张飞 <zhangfei@nj.iscas.ac.cn>
To: musl@lists.openwall.com
Subject: Re: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
Date: Wed, 19 Apr 2023 15:22:23 +0800 (GMT+08:00) [thread overview]
Message-ID: <484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn> (raw)
In-Reply-To: <20230411124822.GK3630668@port70.net>
[-- Attachment #1: Type: text/plain, Size: 7393 bytes --]
I did replace the C strlen code with a slower one except when
musl is built for "#ifdef __riscv_vector" isa extension.So I referred
to the C strlen code and implemented it with the basic instruction
set, and the performance of both is basically the same.
The reason for implementing two versions is to hope that the memset implemented
using the basic instruction set can be applicable to all RISCV architecture CPUs,
and the vector version can accelerate the hardware supporting vector expansion.
When the compiler adds vector extensions through --with-arch=rv64gcv,
__riscv_vector will also open by default.Similar macro definitions are common in
riscv, such as setjmp/riscv64/setjmp.S in musl, which includes
__riscv_float_abi_soft macro definitions.
At present, the riscv vector extension instruction set is in a frozen state, and
the instruction set is stable. In other open source libraries, such as openssl
and openCV, riscv vector optimization is available.We know that the assembly generated
by the compiler is often not the most efficient, and the automatic vectorization
scenarios are limited, so we need to optimize the function by manual vectorization.
For riscv, compiler automatic vectorization is still in its infancy.
I conducted tests on different data volumes and compared the performance of memset
functions implemented in C language, basic instruction set, and vector instruction
set.The test case is test_strlen.c
Performance comparison between C language implementation and assembly implementation was
tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip Platform).
The test results are as follows.Due to the consistent algorithm between the two, there
is basically no difference in performance.
--------------------------------------------------------------------------------
length(byte) C language implementation(s) Basic instruction implementation(s)
--------------------------------------------------------------------------------
2 0.00000528 0.000005441
4 0.00000544 0.000005437
8 0.00000464 0.00000496
16 0.00000544 0.00000512
32 0.0000064 0.00000592
64 0.000007994 0.000007841
128 0.000012 0.000012
256 0.000020321 0.000020481
512 0.000037282 0.000037762
1024 0.000069924 0.000070244
2048 0.000135046 0.000135528
4096 0.000264491 0.000264816
8192 0.000524342 0.000525631
16384 0.001069965 0.001047742
32768 0.002180252 0.002142207
65536 0.005921251 0.005883868
131072 0.012508934 0.012392895
262144 0.02503915 0.024896995
524288 0.049879091 0.049821832
1048576 0.09973658 0.099969603
--------------------------------------------------------------------------------
Due to the lack of a chip that supports vector extension, I conducted a performance
comparison test of strlen using C language and vector implementation on the Spike
simulator, which has certain reference value. It can be clearly seen that vector
implementation is more efficient than C language implementation, with an average
performance improvement of over 800%.
--------------------------------------------------------------------------------
length(byte) C language implementation(s) Vector instruction implementation(s)
--------------------------------------------------------------------------------
2 0.000003639 0.000003339
4 0.000004239 0.000003339
8 0.000003639 0.000003339
16 0.000004339 0.000003339
32 0.000005739 0.000003339
64 0.000008539 0.000003339
128 0.000014139 0.000004039
256 0.000025339 0.000004739
512 0.000047739 0.000006139
1024 0.000092539 0.000008939
2048 0.000182139 0.000014539
4096 0.000361339 0.000025739
8192 0.000719739 0.000048139
16384 0.001436539 0.000092939
32768 0.002870139 0.000182539
65536 0.005737339 0.000361739
131072 0.011471739 0.000720139
262144 0.022940539 0.001436939
524288 0.045878139 0.002870539
1048576 0.091753339 0.005737739
--------------------------------------------------------------------------------
So I hope to pass __riscv_vector, which enables hardware that does not support vector
extension to execute the basic instruction set implementation of strlen, has the same
performance as the C language implementation. For support vector extended hardware,
strlen implemented by vector instruction set is executed to achieve acceleration effect.
Fei Zhang
> -----原始邮件-----
> 发件人: "Szabolcs Nagy" <nsz@port70.net>
> 发送时间: 2023-04-11 20:48:22 (星期二)
> 收件人: "张飞" <zhangfei@nj.iscas.ac.cn>
> 抄送: musl@lists.openwall.com
> 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
>
> * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]:
> > I have made modifications to the assembly implementation of the riscv64 strlen function, mainly
> > focusing on address alignment processing to avoid the problem of data crossing
> > pages during vector instruction memory access.
> >
> > I think the assembly implementation of strlen is necessary. In glibc,
>
> if the c definition is not correct then you have to explain why.
> if it's very slow then please tell us so.
>
> > X86_64, aarch64, alpha, and others all have assembly implementations of this function,
> > while for riscv64, it is blank.
> > I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic.
>
> an asm implementation has significant maintenance cost so you should
> provide some benchmark data or other evidence/reasoning for us to
> decide if it's worth the cost.
>
> it seems you replaced the c strlen code with a slower one except when
> musl is built for "#ifdef __riscv_vector" isa extension. what cpus
> does this affect? are linux distros expected to use this as baseline?
> do different riscv cpus have similar simd performance properties? who
> will tweak the asm if not?
>
> in principle what you did can be done by the compiler auto vectorizer
> so maybe contributing to the compiler is more useful.
>
> note that glibc has cpu specific implementations that it can select
> at runtime, but musl uses one generic implementation for all cpus.
</zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>
[-- Attachment #2: strlen_riscv64.patch --]
[-- Type: application/octet-stream, Size: 2007 bytes --]
diff -uprN src/string/riscv64/strlen.S src/string/riscv64/strlen.S
--- src/string/riscv64/strlen.S 1970-01-01 08:00:00.000000000 +0800
+++ src/string/riscv64/strlen.S 2023-04-18 13:37:22.644057680 +0800
@@ -0,0 +1,82 @@
+# size_t strlen(const char *str)
+# a0 holds *str
+.global strlen
+.type strlen,@function
+strlen:
+#ifdef __riscv_vector
+ mv t0, a0 # Save start
+ csrr t1, vlenb
+ addi t1, t1, -1
+ add a3, t0, t1
+ not t1, t1
+ and a3, a3, t1
+ sub a4, a3, t0
+ beq a3, t0, loop # if already aligned
+
+unaligned:
+ lbu t1, 0(t0)
+ beqz t1, found
+ addi t0, t0, 1
+ blt t0, a3, unaligned
+
+loop:
+ vsetvli a1, x0, e8, m8, ta, ma # Vector of bytes of maximum length
+ vle8ff.v v8, (t0) # Load bytes
+ csrr a1, vl # Get bytes read
+ vmseq.vi v0, v8, 0 # Set v0[i] where v8[i] = 0
+ vfirst.m a2, v0 # Find first set bit
+ add t0, t0, a1 # Bump pointer
+ bltz a2, loop # Not found?
+
+ add a3, a3, a1 # Sum start + bump
+ add t0, t0, a2 # Add index
+ sub a3, t0, a3 # Subtract start address+bump
+ add a0, a3, a4
+ ret
+
+found:
+ sub a0, t0, a0
+ ret
+
+#else
+ mv a5, a0
+ andi a4, a0, 7
+ beqz a4, aligned # if already aligned
+
+unaligned:
+ lbu a4, 0(a5)
+ beqz a4, count
+ addi a5, a5, 1
+ andi a4, a5, 7
+ bnez a4, unaligned
+
+aligned:
+ la t0, magic
+ ld a1, 0(t0)
+ ld a2, 8(t0)
+
+loop:
+ ld a3, 0(a5)
+ add a4, a3, a1
+ not a3, a3
+ and a4, a4, a3
+ and a4, a4, a2
+ bnez a4, found
+ addi a5, a5, 8
+ j loop
+
+found:
+ lbu a4, 0(a5)
+ beqz a4, count
+ addi a5, a5, 1
+ j found
+
+count:
+ sub a0, a5, a0
+ ret
+
+.section .data
+magic:
+ .dword 0xfefefefefefefeff
+ .dword 0x8080808080808080
+#endif
[-- Attachment #3: test_strlen.c --]
[-- Type: text/plain, Size: 1035 bytes --]
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#define DATA_SIZE 5*1024*1024
#define MAX_LEN 1*1024*1024
#define LOOP_TIMES 100
int main(){
unsigned int len,ans;
char *str1,*src1;
str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
struct timespec tv0,tv;
for(len=1; len<=MAX_LEN; len*=2) {
memset(str1, 'a', DATA_SIZE);
src1 = str1; // +offset
src1[len] = '\0';
clock_gettime(CLOCK_REALTIME, &tv0);
for(int k=0; k<LOOP_TIMES; k++){
ans = strlen(src1);
}
clock_gettime(CLOCK_REALTIME, &tv);
tv.tv_sec -= tv0.tv_sec;
if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
tv.tv_nsec += 1000000000;
tv.tv_sec--;
}
printf("length: %u time: %ld.%.9ld\n",ans, (long)tv.tv_sec, (long)tv.tv_nsec);
if( ans != len)
printf("ERROR! len is %u,ans is %u\n",len,ans); //verify length
}
munmap(str1,DATA_SIZE);
return 0;
}
next prev parent reply other threads:[~2023-04-19 7:22 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-22 6:21 张飞
2023-03-22 6:27 ` A. Wilcox
2023-03-22 12:15 ` Rich Felker
2023-04-11 12:57 ` Szabolcs Nagy
2023-04-10 5:59 ` 张飞
2023-04-11 12:48 ` Szabolcs Nagy
2023-04-19 7:22 ` 张飞 [this message]
2023-04-19 22:39 ` enh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn \
--to=zhangfei@nj.iscas.ac.cn \
--cc=musl@lists.openwall.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).