On Wed, Apr 19, 2023 at 12:22 AM 张飞 wrote: > I did replace the C strlen code with a slower one except when > musl is built for "#ifdef __riscv_vector" isa extension.So I referred > to the C strlen code and implemented it with the basic instruction > set, and the performance of both is basically the same. > > The reason for implementing two versions is to hope that the memset > implemented > using the basic instruction set can be applicable to all RISCV > architecture CPUs, > and the vector version can accelerate the hardware supporting vector > expansion. > When the compiler adds vector extensions through --with-arch=rv64gcv, > __riscv_vector will also open by default.Similar macro definitions are > common in > riscv, such as setjmp/riscv64/setjmp.S in musl, which includes > __riscv_float_abi_soft macro definitions. > > At present, the riscv vector extension instruction set is in a frozen > state, and > the instruction set is stable. In other open source libraries, such as > openssl > and openCV, riscv vector optimization is available. is that actually checked in to openssl? the linux kernel patches to save/restore vector state still haven't been merged to linux-next afaik, and there's still no hwcaps support for V either. or are they using `__riscv_vector` too, and not detecting V at runtime? (the kernel's own use of V and Zb* seems to be based on an internal-only hwcap mechanism for now.) > We know that the assembly generated > by the compiler is often not the most efficient, and the automatic > vectorization > scenarios are limited, so we need to optimize the function by manual > vectorization. > For riscv, compiler automatic vectorization is still in its infancy. > have you tried sifive's autovectorization patches? do they help for this code? > I conducted tests on different data volumes and compared the performance > of memset > functions implemented in C language, basic instruction set, and vector > instruction > set.The test case is test_strlen.c > > Performance comparison between C language implementation and assembly > implementation was > tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip > Platform). > > The test results are as follows.Due to the consistent algorithm between > the two, there > is basically no difference in performance. > > > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Basic instruction > implementation(s) > > -------------------------------------------------------------------------------- > 2 0.00000528 0.000005441 > 4 0.00000544 0.000005437 > 8 0.00000464 0.00000496 > 16 0.00000544 0.00000512 > 32 0.0000064 0.00000592 > 64 0.000007994 0.000007841 > 128 0.000012 0.000012 > 256 0.000020321 0.000020481 > 512 0.000037282 0.000037762 > 1024 0.000069924 0.000070244 > 2048 0.000135046 0.000135528 > 4096 0.000264491 0.000264816 > 8192 0.000524342 0.000525631 > 16384 0.001069965 0.001047742 > 32768 0.002180252 0.002142207 > 65536 0.005921251 0.005883868 > 131072 0.012508934 0.012392895 > 262144 0.02503915 0.024896995 > 524288 0.049879091 0.049821832 > 1048576 0.09973658 0.099969603 > > -------------------------------------------------------------------------------- > > Due to the lack of a chip that supports vector extension, I conducted a > performance > comparison test of strlen using C language and vector implementation on > the Spike > simulator, which has certain reference value. It can be clearly seen that > vector > implementation is more efficient than C language implementation, with an > average > performance improvement of over 800%. > > > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Vector instruction > implementation(s) > > -------------------------------------------------------------------------------- > 2 0.000003639 0.000003339 > 4 0.000004239 0.000003339 > 8 0.000003639 0.000003339 > 16 0.000004339 0.000003339 > 32 0.000005739 0.000003339 > 64 0.000008539 0.000003339 > 128 0.000014139 0.000004039 > 256 0.000025339 0.000004739 > 512 0.000047739 0.000006139 > 1024 0.000092539 0.000008939 > 2048 0.000182139 0.000014539 > 4096 0.000361339 0.000025739 > 8192 0.000719739 0.000048139 > 16384 0.001436539 0.000092939 > 32768 0.002870139 0.000182539 > 65536 0.005737339 0.000361739 > 131072 0.011471739 0.000720139 > 262144 0.022940539 0.001436939 > 524288 0.045878139 0.002870539 > 1048576 0.091753339 0.005737739 > > -------------------------------------------------------------------------------- > > So I hope to pass __riscv_vector, which enables hardware that does not > support vector > extension to execute the basic instruction set implementation of strlen, > has the same > performance as the C language implementation. For support vector extended > hardware, > strlen implemented by vector instruction set is executed to achieve > acceleration effect. > > Fei Zhang > > > -----原始邮件----- > > 发件人: "Szabolcs Nagy" > > 发送时间: 2023-04-11 20:48:22 (星期二) > > 收件人: "张飞" > > 抄送: musl@lists.openwall.com > > 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in > riscv64 architecture > > > > * 张飞 [2023-04-10 13:59:22 +0800]: > > > I have made modifications to the assembly implementation of the > riscv64 strlen function, mainly > > > focusing on address alignment processing to avoid the problem of > data crossing > > > pages during vector instruction memory access. > > > > > > I think the assembly implementation of strlen is necessary. In > glibc, > > > > if the c definition is not correct then you have to explain why. > > if it's very slow then please tell us so. > > > > > X86_64, aarch64, alpha, and others all have assembly > implementations of this function, > > > while for riscv64, it is blank. > > > I have also analyzed the test sets of Spec2006 and Spec2017, and > the strlen function is also a hot topic. > > > > an asm implementation has significant maintenance cost so you should > > provide some benchmark data or other evidence/reasoning for us to > > decide if it's worth the cost. > > > > it seems you replaced the c strlen code with a slower one except when > > musl is built for "#ifdef __riscv_vector" isa extension. what cpus > > does this affect? are linux distros expected to use this as baseline? > > do different riscv cpus have similar simd performance properties? who > > will tweak the asm if not? > > > > in principle what you did can be done by the compiler auto vectorizer > > so maybe contributing to the compiler is more useful. > > > > note that glibc has cpu specific implementations that it can select > > at runtime, but musl uses one generic implementation for all cpus. >