On Wed, Apr 19, 2023 at 12:22 AM 张飞 <zhangfei@nj.iscas.ac.cn> wrote:

> I did replace the C strlen code with a slower one except when
> musl is built for "#ifdef __riscv_vector" isa extension.So I referred
> to the C strlen code and implemented it with the basic instruction
> set, and the performance of both is basically the same.
>
> The reason for implementing two versions is to hope that the memset
> implemented
> using the basic instruction set can be applicable to all RISCV
> architecture CPUs,
> and the vector version can accelerate the hardware supporting vector
> expansion.
> When the compiler adds vector extensions through --with-arch=rv64gcv,
> __riscv_vector will also open by default.Similar macro definitions are
> common in
> riscv, such as setjmp/riscv64/setjmp.S in musl, which includes
> __riscv_float_abi_soft macro definitions.
>
> At present, the riscv vector extension instruction set is in a frozen
> state, and
> the instruction set is stable. In other open source libraries, such as
> openssl
> and openCV, riscv vector optimization is available.


is that actually checked in to openssl? the linux kernel patches to
save/restore vector state still haven't been merged to linux-next afaik,
and there's still no hwcaps support for V either. or are they using
`__riscv_vector` too, and not detecting V at runtime? (the kernel's own use
of V and Zb* seems to be based on an internal-only hwcap mechanism for now.)


> We know that the assembly generated
> by the compiler is often not the most efficient, and the automatic
> vectorization
> scenarios are limited, so we need to optimize the function by manual
> vectorization.
> For riscv, compiler automatic vectorization is still in its infancy.
>

have you tried sifive's autovectorization patches? do they help for this
code?


> I conducted tests on different data volumes and compared the performance
> of memset
> functions implemented in C language, basic instruction set, and vector
> instruction
> set.The test case is test_strlen.c
>
> Performance comparison between C language implementation and assembly
> implementation was
> tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip
> Platform).
>
> The test results are as follows.Due to the consistent algorithm between
> the two, there
> is basically no difference in performance.
>
>
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
>
> --------------------------------------------------------------------------------
> 2                    0.00000528                      0.000005441
> 4                    0.00000544                      0.000005437
> 8                    0.00000464                      0.00000496
> 16                   0.00000544                      0.00000512
> 32                   0.0000064                       0.00000592
> 64                   0.000007994                     0.000007841
> 128                  0.000012                        0.000012
> 256                  0.000020321                     0.000020481
> 512                  0.000037282                     0.000037762
> 1024                 0.000069924                     0.000070244
> 2048                 0.000135046                     0.000135528
> 4096                 0.000264491                     0.000264816
> 8192                 0.000524342                     0.000525631
> 16384                0.001069965                     0.001047742
> 32768                0.002180252                     0.002142207
> 65536                0.005921251                     0.005883868
> 131072               0.012508934                     0.012392895
> 262144               0.02503915                      0.024896995
> 524288               0.049879091                     0.049821832
> 1048576              0.09973658                      0.099969603
>
> --------------------------------------------------------------------------------
>
> Due to the lack of a chip that supports vector extension, I conducted a
> performance
> comparison test of strlen using C language and vector implementation on
> the Spike
> simulator, which has certain reference value. It can be clearly seen that
> vector
> implementation is more efficient than C language implementation, with an
> average
> performance improvement of over 800%.
>
>
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Vector instruction
> implementation(s)
>
> --------------------------------------------------------------------------------
> 2                    0.000003639                     0.000003339
> 4                    0.000004239                     0.000003339
> 8                    0.000003639                     0.000003339
> 16                   0.000004339                     0.000003339
> 32                   0.000005739                     0.000003339
> 64                   0.000008539                     0.000003339
> 128                  0.000014139                     0.000004039
> 256                  0.000025339                     0.000004739
> 512                  0.000047739                     0.000006139
> 1024                 0.000092539                     0.000008939
> 2048                 0.000182139                     0.000014539
> 4096                 0.000361339                     0.000025739
> 8192                 0.000719739                     0.000048139
> 16384                0.001436539                     0.000092939
> 32768                0.002870139                     0.000182539
> 65536                0.005737339                     0.000361739
> 131072               0.011471739                     0.000720139
> 262144               0.022940539                     0.001436939
> 524288               0.045878139                     0.002870539
> 1048576              0.091753339                     0.005737739
>
> --------------------------------------------------------------------------------
>
> So I hope to pass __riscv_vector, which enables hardware that does not
> support vector
> extension to execute the basic instruction set implementation of strlen,
> has the same
> performance as the C language implementation. For support vector extended
> hardware,
> strlen implemented by vector instruction set is executed to achieve
> acceleration effect.
>
> Fei Zhang
>
> &gt; -----原始邮件-----
> &gt; 发件人: "Szabolcs Nagy" <nsz@port70.net>
> &gt; 发送时间: 2023-04-11 20:48:22 (星期二)
> &gt; 收件人: "张飞" <zhangfei@nj.iscas.ac.cn>
> &gt; 抄送: musl@lists.openwall.com
> &gt; 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in
> riscv64 architecture
> &gt;
> &gt; * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]:
> &gt; &gt; I have made modifications to the assembly implementation of the
> riscv64 strlen function, mainly
> &gt; &gt; focusing on address alignment processing to avoid the problem of
> data crossing
> &gt; &gt; pages during vector instruction memory access.
> &gt; &gt;
> &gt; &gt; I think the assembly implementation of strlen is necessary. In
> glibc,
> &gt;
> &gt; if the c definition is not correct then you have to explain why.
> &gt; if it's very slow then please tell us so.
> &gt;
> &gt; &gt; X86_64, aarch64, alpha, and others all have assembly
> implementations of this function,
> &gt; &gt; while for riscv64, it is blank.
> &gt; &gt; I have also analyzed the test sets of Spec2006 and Spec2017, and
> the strlen function is also a hot topic.
> &gt;
> &gt; an asm implementation has significant maintenance cost so you should
> &gt; provide some benchmark data or other evidence/reasoning for us to
> &gt; decide if it's worth the cost.
> &gt;
> &gt; it seems you replaced the c strlen code with a slower one except when
> &gt; musl is built for "#ifdef __riscv_vector" isa extension. what cpus
> &gt; does this affect? are linux distros expected to use this as baseline?
> &gt; do different riscv cpus have similar simd performance properties? who
> &gt; will tweak the asm if not?
> &gt;
> &gt; in principle what you did can be done by the compiler auto vectorizer
> &gt; so maybe contributing to the compiler is more useful.
> &gt;
> &gt; note that glibc has cpu specific implementations that it can select
> &gt; at runtime, but musl uses one generic implementation for all cpus.
> </zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>