* [musl] [PATCH]Implementation of strlen function in riscv64 architecture @ 2023-03-22 6:21 张飞 2023-03-22 6:27 ` A. Wilcox 0 siblings, 1 reply; 8+ messages in thread From: 张飞 @ 2023-03-22 6:21 UTC (permalink / raw) To: musl [-- Attachment #1.1: Type: text/plain, Size: 481 bytes --] Hi: Iimplementedvectorizationofthestrlenfunctionintheriscv64architecture, whichiscontrolledby__riscv_vectordefinition.Duetolackofsupportforrisc-vVexpansioninhardware, Iconductedperformancetestsonasimulator, whichwasmorethan10timestheperformanceachievedinClanguage. Intermsoffunctionality, Itestedthestringlengthfrom1byteto64Mb, andthealignmentofdifferentaddressesatthebeginningofthestring. Please review it.I'm Looking forward to your reply,thanks. Fei Zhang [-- Attachment #1.2: Type: text/html, Size: 7076 bytes --] [-- Attachment #2: strlen.S --] [-- Type: application/octet-stream, Size: 754 bytes --] .text .balign 4 .global strlen # size_t strlen(const char *str) # a0 holds *str strlen: mv t1, a0 # Save start #ifdef __riscv_vector loop: vsetvli t0, x0, e8, m8, ta, ma # Vector of bytes of maximum length vle8ff.v v8, (t1) # Load bytes csrr t0, vl # Get bytes read vmseq.vi v0, v8, 0 # Set v0[i] where v8[i] = 0 vfirst.m t2, v0 # Find first set bit add t1, t1, t0 # Bump pointer bltz t2, loop # Not found? add a0, a0, t0 # Sum start + bump add t1, t1, t2 # Add index sub a0, t1, a0 # Subtract start address+bump #else 1: lbu t0, 0(t1) beqz t0, 2f addi t1, t1, 1 j 1b 2: sub a0, t1, a0 #endif ret ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-03-22 6:21 [musl] [PATCH]Implementation of strlen function in riscv64 architecture 张飞 @ 2023-03-22 6:27 ` A. Wilcox 2023-03-22 12:15 ` Rich Felker 2023-04-10 5:59 ` 张飞 0 siblings, 2 replies; 8+ messages in thread From: A. Wilcox @ 2023-03-22 6:27 UTC (permalink / raw) To: musl The content of the message was sent as an image. For those who cannot view images, I've reproduced the text below: On Mar 22, 2023, at 1:21 AM, 张飞 <zhangfei@nj.iscas.ac.cn> wrote: > > Hi: > > I implemented vectorization of the strlen function in the riscv64 > architecture, which is controlled by __riscv_vector definition. Due > to lack of support for risc-v V expansion in hardware, I conducted > performance tests on a simulator, which was more than 10 times the > performance achieved in C language. In terms of functionality, I > tested the string length from 1 byte to 64 Mb, and the alignment of > different addresses at the beginning of the string. > > > Please review it.I'm Looking forward to your reply,thanks. > > > > Fei Zhang > <strlen.S> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-03-22 6:27 ` A. Wilcox @ 2023-03-22 12:15 ` Rich Felker 2023-04-11 12:57 ` Szabolcs Nagy 2023-04-10 5:59 ` 张飞 1 sibling, 1 reply; 8+ messages in thread From: Rich Felker @ 2023-03-22 12:15 UTC (permalink / raw) To: 张飞, A. Wilcox; +Cc: musl On Wed, Mar 22, 2023 at 01:27:33AM -0500, A. Wilcox wrote: > The content of the message was sent as an image. > > For those who cannot view images, I've reproduced the text below: > > On Mar 22, 2023, at 1:21 AM, 张飞 <zhangfei@nj.iscas.ac.cn> wrote: > > > > Hi: > > > > I implemented vectorization of the strlen function in the riscv64 > > architecture, which is controlled by __riscv_vector definition. Due > > to lack of support for risc-v V expansion in hardware, I conducted > > performance tests on a simulator, which was more than 10 times the > > performance achieved in C language. In terms of functionality, I > > tested the string length from 1 byte to 64 Mb, and the alignment of > > different addresses at the beginning of the string. > > > > > > Please review it.I'm Looking forward to your reply,thanks. The riscv64 target does not assume presence of vector extensions, and as it's generally not a bottleneck, strlen isn't one of the functions for which we generally have existing per-arch asm. If we were going to introduce this kind of thing for strlen, the preferable approach would probably be something like what I've suggested we change memcpy/memset to: having the arch definition provide only the minimal inline fragment needed to do the actual work (something like: loading a vector, optionally xor'ing it with a mask for the byte to search for, and reporting if it's found or the offset at which it's found) with the actual control logic all in C. Regarding the code submitted for review, I'm pretty sure it's buggy because it doesn't seem to do anything with alignment. If you pass it a pointer to the last byte of a page whose contents are zero, it will attempt to load the rest of the vector from the next page, and fault. Since strlen has no a priori way to know how long the object it's inspecting is, I don't believe there's any way to do a vectorized approach without pre-alignment to the size of the read you will be performing, processing everything up to the aligned start separately. Having to check for this kind of bug on a per-arch basis is one of the motivations for not wanting whole functions written in asm, but instead just minimal fragments, with this sort of common logic in C where you know, once it's been reviewed once, it's correct for all archs. Rich ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-03-22 12:15 ` Rich Felker @ 2023-04-11 12:57 ` Szabolcs Nagy 0 siblings, 0 replies; 8+ messages in thread From: Szabolcs Nagy @ 2023-04-11 12:57 UTC (permalink / raw) To: Rich Felker; +Cc: 张飞, A. Wilcox, musl * Rich Felker <dalias@libc.org> [2023-03-22 08:15:30 -0400]: > Regarding the code submitted for review, I'm pretty sure it's buggy > because it doesn't seem to do anything with alignment. If you pass it > a pointer to the last byte of a page whose contents are zero, it will > attempt to load the rest of the vector from the next page, and fault. the aarch64 sve isa extension has 'first faulting register' mask and there are load/store instructions that set it instead of actually faulting when a vector goes off at the end of a page. i suspect riscv copied this piece of architecture (as well as the variable vector length). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-03-22 6:27 ` A. Wilcox 2023-03-22 12:15 ` Rich Felker @ 2023-04-10 5:59 ` 张飞 2023-04-11 12:48 ` Szabolcs Nagy 1 sibling, 1 reply; 8+ messages in thread From: 张飞 @ 2023-04-10 5:59 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1879 bytes --] I have made modifications to the assembly implementation of the riscv64 strlen function, mainly focusing on address alignment processing to avoid the problem of data crossing pages during vector instruction memory access. I think the assembly implementation of strlen is necessary. In glibc, X86_64, aarch64, alpha, and others all have assembly implementations of this function, while for riscv64, it is blank. I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic. Please review the patch again and look forward to your reply. Fei Zhang > -----原始邮件----- > 发件人: "A. Wilcox" <awilfox@adelielinux.org> > 发送时间: 2023-03-22 14:27:33 (星期三) > 收件人: musl@lists.openwall.com > 抄送: > 主题: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture > > The content of the message was sent as an image. > > For those who cannot view images, I've reproduced the text below: > > On Mar 22, 2023, at 1:21 AM, 张飞 <zhangfei@nj.iscas.ac.cn> wrote: > > > > Hi: > > > > I implemented vectorization of the strlen function in the riscv64 > > architecture, which is controlled by __riscv_vector definition. Due > > to lack of support for risc-v V expansion in hardware, I conducted > > performance tests on a simulator, which was more than 10 times the > > performance achieved in C language. In terms of functionality, I > > tested the string length from 1 byte to 64 Mb, and the alignment of > > different addresses at the beginning of the string. > > > > > > Please review it.I'm Looking forward to your reply,thanks. > > > > > > > > Fei Zhang > > <strlen.s> </strlen.s></zhangfei@nj.iscas.ac.cn></awilfox@adelielinux.org> [-- Attachment #2: strlen_riscv64.patch --] [-- Type: application/octet-stream, Size: 1323 bytes --] diff -uprN src/string/riscv64/strlen.S src/string/riscv64/strlen.S --- src/string/riscv64/strlen.S 1970-01-01 08:00:00.000000000 +0800 +++ src/string/riscv64/strlen.S 2023-04-10 11:28:45.301698194 +0800 @@ -0,0 +1,46 @@ +# size_t strlen(const char *str) +# a0 holds *str +.global strlen +.type strlen,@function +strlen: + mv t0, a0 # Save start +#ifdef __riscv_vector + csrr t1, vlenb + addi t1, t1, -1 + add a3, t0, t1 + not t1, t1 + and a3, a3, t1 + sub a4, a3, t0 + beq a3, t0, loop /* if already aligned*/ + +unaligned: + lbu t1, 0(t0) + beqz t1, found + addi t0, t0, 1 + blt t0, a3, unaligned + +loop: + vsetvli a1, x0, e8, m8, ta, ma # Vector of bytes of maximum length + vle8ff.v v8, (t0) # Load bytes + csrr a1, vl # Get bytes read + vmseq.vi v0, v8, 0 # Set v0[i] where v8[i] = 0 + vfirst.m a2, v0 # Find first set bit + add t0, t0, a1 # Bump pointer + bltz a2, loop # Not found? + + add a3, a3, a1 # Sum start + bump + add t0, t0, a2 # Add index + sub a3, t0, a3 # Subtract start address+bump + add a0, a3, a4 + ret +#else +loop: + lbu t1, 0(t0) + beqz t1, found + addi t0, t0, 1 + j loop +#endif + +found: + sub a0, t0, a0 + ret ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-04-10 5:59 ` 张飞 @ 2023-04-11 12:48 ` Szabolcs Nagy 2023-04-19 7:22 ` 张飞 0 siblings, 1 reply; 8+ messages in thread From: Szabolcs Nagy @ 2023-04-11 12:48 UTC (permalink / raw) To: 张飞; +Cc: musl * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]: > I have made modifications to the assembly implementation of the riscv64 strlen function, mainly > focusing on address alignment processing to avoid the problem of data crossing > pages during vector instruction memory access. > > I think the assembly implementation of strlen is necessary. In glibc, if the c definition is not correct then you have to explain why. if it's very slow then please tell us so. > X86_64, aarch64, alpha, and others all have assembly implementations of this function, > while for riscv64, it is blank. > I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic. an asm implementation has significant maintenance cost so you should provide some benchmark data or other evidence/reasoning for us to decide if it's worth the cost. it seems you replaced the c strlen code with a slower one except when musl is built for "#ifdef __riscv_vector" isa extension. what cpus does this affect? are linux distros expected to use this as baseline? do different riscv cpus have similar simd performance properties? who will tweak the asm if not? in principle what you did can be done by the compiler auto vectorizer so maybe contributing to the compiler is more useful. note that glibc has cpu specific implementations that it can select at runtime, but musl uses one generic implementation for all cpus. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-04-11 12:48 ` Szabolcs Nagy @ 2023-04-19 7:22 ` 张飞 2023-04-19 22:39 ` enh 0 siblings, 1 reply; 8+ messages in thread From: 张飞 @ 2023-04-19 7:22 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 7393 bytes --] I did replace the C strlen code with a slower one except when musl is built for "#ifdef __riscv_vector" isa extension.So I referred to the C strlen code and implemented it with the basic instruction set, and the performance of both is basically the same. The reason for implementing two versions is to hope that the memset implemented using the basic instruction set can be applicable to all RISCV architecture CPUs, and the vector version can accelerate the hardware supporting vector expansion. When the compiler adds vector extensions through --with-arch=rv64gcv, __riscv_vector will also open by default.Similar macro definitions are common in riscv, such as setjmp/riscv64/setjmp.S in musl, which includes __riscv_float_abi_soft macro definitions. At present, the riscv vector extension instruction set is in a frozen state, and the instruction set is stable. In other open source libraries, such as openssl and openCV, riscv vector optimization is available.We know that the assembly generated by the compiler is often not the most efficient, and the automatic vectorization scenarios are limited, so we need to optimize the function by manual vectorization. For riscv, compiler automatic vectorization is still in its infancy. I conducted tests on different data volumes and compared the performance of memset functions implemented in C language, basic instruction set, and vector instruction set.The test case is test_strlen.c Performance comparison between C language implementation and assembly implementation was tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip Platform). The test results are as follows.Due to the consistent algorithm between the two, there is basically no difference in performance. -------------------------------------------------------------------------------- length(byte) C language implementation(s) Basic instruction implementation(s) -------------------------------------------------------------------------------- 2 0.00000528 0.000005441 4 0.00000544 0.000005437 8 0.00000464 0.00000496 16 0.00000544 0.00000512 32 0.0000064 0.00000592 64 0.000007994 0.000007841 128 0.000012 0.000012 256 0.000020321 0.000020481 512 0.000037282 0.000037762 1024 0.000069924 0.000070244 2048 0.000135046 0.000135528 4096 0.000264491 0.000264816 8192 0.000524342 0.000525631 16384 0.001069965 0.001047742 32768 0.002180252 0.002142207 65536 0.005921251 0.005883868 131072 0.012508934 0.012392895 262144 0.02503915 0.024896995 524288 0.049879091 0.049821832 1048576 0.09973658 0.099969603 -------------------------------------------------------------------------------- Due to the lack of a chip that supports vector extension, I conducted a performance comparison test of strlen using C language and vector implementation on the Spike simulator, which has certain reference value. It can be clearly seen that vector implementation is more efficient than C language implementation, with an average performance improvement of over 800%. -------------------------------------------------------------------------------- length(byte) C language implementation(s) Vector instruction implementation(s) -------------------------------------------------------------------------------- 2 0.000003639 0.000003339 4 0.000004239 0.000003339 8 0.000003639 0.000003339 16 0.000004339 0.000003339 32 0.000005739 0.000003339 64 0.000008539 0.000003339 128 0.000014139 0.000004039 256 0.000025339 0.000004739 512 0.000047739 0.000006139 1024 0.000092539 0.000008939 2048 0.000182139 0.000014539 4096 0.000361339 0.000025739 8192 0.000719739 0.000048139 16384 0.001436539 0.000092939 32768 0.002870139 0.000182539 65536 0.005737339 0.000361739 131072 0.011471739 0.000720139 262144 0.022940539 0.001436939 524288 0.045878139 0.002870539 1048576 0.091753339 0.005737739 -------------------------------------------------------------------------------- So I hope to pass __riscv_vector, which enables hardware that does not support vector extension to execute the basic instruction set implementation of strlen, has the same performance as the C language implementation. For support vector extended hardware, strlen implemented by vector instruction set is executed to achieve acceleration effect. Fei Zhang > -----原始邮件----- > 发件人: "Szabolcs Nagy" <nsz@port70.net> > 发送时间: 2023-04-11 20:48:22 (星期二) > 收件人: "张飞" <zhangfei@nj.iscas.ac.cn> > 抄送: musl@lists.openwall.com > 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture > > * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]: > > I have made modifications to the assembly implementation of the riscv64 strlen function, mainly > > focusing on address alignment processing to avoid the problem of data crossing > > pages during vector instruction memory access. > > > > I think the assembly implementation of strlen is necessary. In glibc, > > if the c definition is not correct then you have to explain why. > if it's very slow then please tell us so. > > > X86_64, aarch64, alpha, and others all have assembly implementations of this function, > > while for riscv64, it is blank. > > I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic. > > an asm implementation has significant maintenance cost so you should > provide some benchmark data or other evidence/reasoning for us to > decide if it's worth the cost. > > it seems you replaced the c strlen code with a slower one except when > musl is built for "#ifdef __riscv_vector" isa extension. what cpus > does this affect? are linux distros expected to use this as baseline? > do different riscv cpus have similar simd performance properties? who > will tweak the asm if not? > > in principle what you did can be done by the compiler auto vectorizer > so maybe contributing to the compiler is more useful. > > note that glibc has cpu specific implementations that it can select > at runtime, but musl uses one generic implementation for all cpus. </zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net> [-- Attachment #2: strlen_riscv64.patch --] [-- Type: application/octet-stream, Size: 2007 bytes --] diff -uprN src/string/riscv64/strlen.S src/string/riscv64/strlen.S --- src/string/riscv64/strlen.S 1970-01-01 08:00:00.000000000 +0800 +++ src/string/riscv64/strlen.S 2023-04-18 13:37:22.644057680 +0800 @@ -0,0 +1,82 @@ +# size_t strlen(const char *str) +# a0 holds *str +.global strlen +.type strlen,@function +strlen: +#ifdef __riscv_vector + mv t0, a0 # Save start + csrr t1, vlenb + addi t1, t1, -1 + add a3, t0, t1 + not t1, t1 + and a3, a3, t1 + sub a4, a3, t0 + beq a3, t0, loop # if already aligned + +unaligned: + lbu t1, 0(t0) + beqz t1, found + addi t0, t0, 1 + blt t0, a3, unaligned + +loop: + vsetvli a1, x0, e8, m8, ta, ma # Vector of bytes of maximum length + vle8ff.v v8, (t0) # Load bytes + csrr a1, vl # Get bytes read + vmseq.vi v0, v8, 0 # Set v0[i] where v8[i] = 0 + vfirst.m a2, v0 # Find first set bit + add t0, t0, a1 # Bump pointer + bltz a2, loop # Not found? + + add a3, a3, a1 # Sum start + bump + add t0, t0, a2 # Add index + sub a3, t0, a3 # Subtract start address+bump + add a0, a3, a4 + ret + +found: + sub a0, t0, a0 + ret + +#else + mv a5, a0 + andi a4, a0, 7 + beqz a4, aligned # if already aligned + +unaligned: + lbu a4, 0(a5) + beqz a4, count + addi a5, a5, 1 + andi a4, a5, 7 + bnez a4, unaligned + +aligned: + la t0, magic + ld a1, 0(t0) + ld a2, 8(t0) + +loop: + ld a3, 0(a5) + add a4, a3, a1 + not a3, a3 + and a4, a4, a3 + and a4, a4, a2 + bnez a4, found + addi a5, a5, 8 + j loop + +found: + lbu a4, 0(a5) + beqz a4, count + addi a5, a5, 1 + j found + +count: + sub a0, a5, a0 + ret + +.section .data +magic: + .dword 0xfefefefefefefeff + .dword 0x8080808080808080 +#endif [-- Attachment #3: test_strlen.c --] [-- Type: text/plain, Size: 1035 bytes --] #include <stdio.h> #include <sys/mman.h> #include <string.h> #include <time.h> #include <stdlib.h> #define DATA_SIZE 5*1024*1024 #define MAX_LEN 1*1024*1024 #define LOOP_TIMES 100 int main(){ unsigned int len,ans; char *str1,*src1; str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); struct timespec tv0,tv; for(len=1; len<=MAX_LEN; len*=2) { memset(str1, 'a', DATA_SIZE); src1 = str1; // +offset src1[len] = '\0'; clock_gettime(CLOCK_REALTIME, &tv0); for(int k=0; k<LOOP_TIMES; k++){ ans = strlen(src1); } clock_gettime(CLOCK_REALTIME, &tv); tv.tv_sec -= tv0.tv_sec; if ((tv.tv_nsec -= tv0.tv_nsec) < 0) { tv.tv_nsec += 1000000000; tv.tv_sec--; } printf("length: %u time: %ld.%.9ld\n",ans, (long)tv.tv_sec, (long)tv.tv_nsec); if( ans != len) printf("ERROR! len is %u,ans is %u\n",len,ans); //verify length } munmap(str1,DATA_SIZE); return 0; } ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture 2023-04-19 7:22 ` 张飞 @ 2023-04-19 22:39 ` enh 0 siblings, 0 replies; 8+ messages in thread From: enh @ 2023-04-19 22:39 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 8728 bytes --] On Wed, Apr 19, 2023 at 12:22 AM 张飞 <zhangfei@nj.iscas.ac.cn> wrote: > I did replace the C strlen code with a slower one except when > musl is built for "#ifdef __riscv_vector" isa extension.So I referred > to the C strlen code and implemented it with the basic instruction > set, and the performance of both is basically the same. > > The reason for implementing two versions is to hope that the memset > implemented > using the basic instruction set can be applicable to all RISCV > architecture CPUs, > and the vector version can accelerate the hardware supporting vector > expansion. > When the compiler adds vector extensions through --with-arch=rv64gcv, > __riscv_vector will also open by default.Similar macro definitions are > common in > riscv, such as setjmp/riscv64/setjmp.S in musl, which includes > __riscv_float_abi_soft macro definitions. > > At present, the riscv vector extension instruction set is in a frozen > state, and > the instruction set is stable. In other open source libraries, such as > openssl > and openCV, riscv vector optimization is available. is that actually checked in to openssl? the linux kernel patches to save/restore vector state still haven't been merged to linux-next afaik, and there's still no hwcaps support for V either. or are they using `__riscv_vector` too, and not detecting V at runtime? (the kernel's own use of V and Zb* seems to be based on an internal-only hwcap mechanism for now.) > We know that the assembly generated > by the compiler is often not the most efficient, and the automatic > vectorization > scenarios are limited, so we need to optimize the function by manual > vectorization. > For riscv, compiler automatic vectorization is still in its infancy. > have you tried sifive's autovectorization patches? do they help for this code? > I conducted tests on different data volumes and compared the performance > of memset > functions implemented in C language, basic instruction set, and vector > instruction > set.The test case is test_strlen.c > > Performance comparison between C language implementation and assembly > implementation was > tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip > Platform). > > The test results are as follows.Due to the consistent algorithm between > the two, there > is basically no difference in performance. > > > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Basic instruction > implementation(s) > > -------------------------------------------------------------------------------- > 2 0.00000528 0.000005441 > 4 0.00000544 0.000005437 > 8 0.00000464 0.00000496 > 16 0.00000544 0.00000512 > 32 0.0000064 0.00000592 > 64 0.000007994 0.000007841 > 128 0.000012 0.000012 > 256 0.000020321 0.000020481 > 512 0.000037282 0.000037762 > 1024 0.000069924 0.000070244 > 2048 0.000135046 0.000135528 > 4096 0.000264491 0.000264816 > 8192 0.000524342 0.000525631 > 16384 0.001069965 0.001047742 > 32768 0.002180252 0.002142207 > 65536 0.005921251 0.005883868 > 131072 0.012508934 0.012392895 > 262144 0.02503915 0.024896995 > 524288 0.049879091 0.049821832 > 1048576 0.09973658 0.099969603 > > -------------------------------------------------------------------------------- > > Due to the lack of a chip that supports vector extension, I conducted a > performance > comparison test of strlen using C language and vector implementation on > the Spike > simulator, which has certain reference value. It can be clearly seen that > vector > implementation is more efficient than C language implementation, with an > average > performance improvement of over 800%. > > > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Vector instruction > implementation(s) > > -------------------------------------------------------------------------------- > 2 0.000003639 0.000003339 > 4 0.000004239 0.000003339 > 8 0.000003639 0.000003339 > 16 0.000004339 0.000003339 > 32 0.000005739 0.000003339 > 64 0.000008539 0.000003339 > 128 0.000014139 0.000004039 > 256 0.000025339 0.000004739 > 512 0.000047739 0.000006139 > 1024 0.000092539 0.000008939 > 2048 0.000182139 0.000014539 > 4096 0.000361339 0.000025739 > 8192 0.000719739 0.000048139 > 16384 0.001436539 0.000092939 > 32768 0.002870139 0.000182539 > 65536 0.005737339 0.000361739 > 131072 0.011471739 0.000720139 > 262144 0.022940539 0.001436939 > 524288 0.045878139 0.002870539 > 1048576 0.091753339 0.005737739 > > -------------------------------------------------------------------------------- > > So I hope to pass __riscv_vector, which enables hardware that does not > support vector > extension to execute the basic instruction set implementation of strlen, > has the same > performance as the C language implementation. For support vector extended > hardware, > strlen implemented by vector instruction set is executed to achieve > acceleration effect. > > Fei Zhang > > > -----原始邮件----- > > 发件人: "Szabolcs Nagy" <nsz@port70.net> > > 发送时间: 2023-04-11 20:48:22 (星期二) > > 收件人: "张飞" <zhangfei@nj.iscas.ac.cn> > > 抄送: musl@lists.openwall.com > > 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in > riscv64 architecture > > > > * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]: > > > I have made modifications to the assembly implementation of the > riscv64 strlen function, mainly > > > focusing on address alignment processing to avoid the problem of > data crossing > > > pages during vector instruction memory access. > > > > > > I think the assembly implementation of strlen is necessary. In > glibc, > > > > if the c definition is not correct then you have to explain why. > > if it's very slow then please tell us so. > > > > > X86_64, aarch64, alpha, and others all have assembly > implementations of this function, > > > while for riscv64, it is blank. > > > I have also analyzed the test sets of Spec2006 and Spec2017, and > the strlen function is also a hot topic. > > > > an asm implementation has significant maintenance cost so you should > > provide some benchmark data or other evidence/reasoning for us to > > decide if it's worth the cost. > > > > it seems you replaced the c strlen code with a slower one except when > > musl is built for "#ifdef __riscv_vector" isa extension. what cpus > > does this affect? are linux distros expected to use this as baseline? > > do different riscv cpus have similar simd performance properties? who > > will tweak the asm if not? > > > > in principle what you did can be done by the compiler auto vectorizer > > so maybe contributing to the compiler is more useful. > > > > note that glibc has cpu specific implementations that it can select > > at runtime, but musl uses one generic implementation for all cpus. > </zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net> [-- Attachment #2: Type: text/html, Size: 11231 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-04-19 22:39 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-22 6:21 [musl] [PATCH]Implementation of strlen function in riscv64 architecture 张飞 2023-03-22 6:27 ` A. Wilcox 2023-03-22 12:15 ` Rich Felker 2023-04-11 12:57 ` Szabolcs Nagy 2023-04-10 5:59 ` 张飞 2023-04-11 12:48 ` Szabolcs Nagy 2023-04-19 7:22 ` 张飞 2023-04-19 22:39 ` enh
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).