From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,HTML_MESSAGE,MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 6812 invoked from network); 19 Apr 2023 22:39:35 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 19 Apr 2023 22:39:35 -0000 Received: (qmail 30441 invoked by uid 550); 19 Apr 2023 22:39:30 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 30405 invoked from network); 19 Apr 2023 22:39:29 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1681943957; x=1684535957; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=0JkYbGrJVFJJcTVcKNC3QT/ppWyoUc3HcP3m18wTPJw=; b=RMiKou9oqe+sC7Cd/9fzjyJbwh4IAtQJwpDTkY4bOtaIigRv6VAMIIE68ZR41lmDri MG4PJz9/x1IHmo4kIRy3rOknDqyYhCRbYsvHVvtilpqbttJEP6pnkWWe52jWzBmFy8ME zLFHEfO7HuZpjEr4HMaONORBJNyqQZO5Eq91PEdo3lT9qSn2ubfCKK5hTqIXvJSUPUNK lIw4yf5CkxioQjC+tXrH+nmodzgMoCx8bA4iSUlJ9iC7nrZgeU2JjVlBfwXRf2PvOMiU NS0l7+u/aS0gpy5oSbXxb/ibhTbcQ6y+9r0oBx3VcDQiPpfxQkcQ/TvXxMbsuXLAEi8y syIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681943957; x=1684535957; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0JkYbGrJVFJJcTVcKNC3QT/ppWyoUc3HcP3m18wTPJw=; b=RWxXoIi1b+eKlClABpIM+DssyCexH6xa/Q7r047A1n18+1OzUnLXoF01t36AV2Z1ag khy2OaEWRdE0+Zi9DWdjWfmmYB6ctri+gv3sSykys8if/0r+h3JrYdxa2oXwx1maSxeU zf2SSfVmwemJfxTyRrANxsPOLfn/9U9kM/saQSA7NfNvZnSh6lQdc2PhKeH2gs6X+J2S Hev4tVNqVtkltiezBuVo/x+FgEFBBFtKp1sOzY9s0xlDylOdTiQxoIsWPm6KBo3gG4jJ Z+EudKDOfpK50lOSOiOhVpxCec7CCJjWITNq2ybVkJz8NXblE6J9RJn0SdiPOwTyAcZn Wc6A== X-Gm-Message-State: AAQBX9cPgsZPkMU8RGzGL3wK9NhZNGtRZOds0HZZQkIRuRgMRUBNv8VA UtoEG6cyWNX+ko+XCcD8CAzj0Zjrqg/PQX02KYN6oOPmFsKmFPyFZv78rQ== X-Google-Smtp-Source: AKy350YVknmJBGGCRPNPuJ6Tb0dLe4etStPpInfOlieobBMw2Fhdtueu8nY8FjW1tvHb+RoJbAHdBcCzSxEBp4QOne4= X-Received: by 2002:a05:6214:509e:b0:5ed:9b61:88d1 with SMTP id kk30-20020a056214509e00b005ed9b6188d1mr256160qvb.12.1681943957484; Wed, 19 Apr 2023 15:39:17 -0700 (PDT) MIME-Version: 1.0 References: <3e056826.7b26.18707fad3e2.Coremail.zhangfei@nj.iscas.ac.cn> <27cfcd14.778c.18769bf603d.Coremail.zhangfei@nj.iscas.ac.cn> <20230411124822.GK3630668@port70.net> <484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn> In-Reply-To: <484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn> From: enh Date: Wed, 19 Apr 2023 15:39:06 -0700 Message-ID: To: musl@lists.openwall.com Content-Type: multipart/alternative; boundary="0000000000001ea9a205f9b81866" Subject: Re: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture --0000000000001ea9a205f9b81866 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Apr 19, 2023 at 12:22=E2=80=AFAM =E5=BC=A0=E9=A3=9E wrote: > I did replace the C strlen code with a slower one except when > musl is built for "#ifdef __riscv_vector" isa extension.So I referred > to the C strlen code and implemented it with the basic instruction > set, and the performance of both is basically the same. > > The reason for implementing two versions is to hope that the memset > implemented > using the basic instruction set can be applicable to all RISCV > architecture CPUs, > and the vector version can accelerate the hardware supporting vector > expansion. > When the compiler adds vector extensions through --with-arch=3Drv64gcv, > __riscv_vector will also open by default.Similar macro definitions are > common in > riscv, such as setjmp/riscv64/setjmp.S in musl, which includes > __riscv_float_abi_soft macro definitions. > > At present, the riscv vector extension instruction set is in a frozen > state, and > the instruction set is stable. In other open source libraries, such as > openssl > and openCV, riscv vector optimization is available. is that actually checked in to openssl? the linux kernel patches to save/restore vector state still haven't been merged to linux-next afaik, and there's still no hwcaps support for V either. or are they using `__riscv_vector` too, and not detecting V at runtime? (the kernel's own use of V and Zb* seems to be based on an internal-only hwcap mechanism for now.= ) > We know that the assembly generated > by the compiler is often not the most efficient, and the automatic > vectorization > scenarios are limited, so we need to optimize the function by manual > vectorization. > For riscv, compiler automatic vectorization is still in its infancy. > have you tried sifive's autovectorization patches? do they help for this code? > I conducted tests on different data volumes and compared the performance > of memset > functions implemented in C language, basic instruction set, and vector > instruction > set.The test case is test_strlen.c > > Performance comparison between C language implementation and assembly > implementation was > tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip > Platform). > > The test results are as follows.Due to the consistent algorithm between > the two, there > is basically no difference in performance. > > > -------------------------------------------------------------------------= ------- > length(byte) C language implementation(s) Basic instruction > implementation(s) > > -------------------------------------------------------------------------= ------- > 2 0.00000528 0.000005441 > 4 0.00000544 0.000005437 > 8 0.00000464 0.00000496 > 16 0.00000544 0.00000512 > 32 0.0000064 0.00000592 > 64 0.000007994 0.000007841 > 128 0.000012 0.000012 > 256 0.000020321 0.000020481 > 512 0.000037282 0.000037762 > 1024 0.000069924 0.000070244 > 2048 0.000135046 0.000135528 > 4096 0.000264491 0.000264816 > 8192 0.000524342 0.000525631 > 16384 0.001069965 0.001047742 > 32768 0.002180252 0.002142207 > 65536 0.005921251 0.005883868 > 131072 0.012508934 0.012392895 > 262144 0.02503915 0.024896995 > 524288 0.049879091 0.049821832 > 1048576 0.09973658 0.099969603 > > -------------------------------------------------------------------------= ------- > > Due to the lack of a chip that supports vector extension, I conducted a > performance > comparison test of strlen using C language and vector implementation on > the Spike > simulator, which has certain reference value. It can be clearly seen that > vector > implementation is more efficient than C language implementation, with an > average > performance improvement of over 800%. > > > -------------------------------------------------------------------------= ------- > length(byte) C language implementation(s) Vector instruction > implementation(s) > > -------------------------------------------------------------------------= ------- > 2 0.000003639 0.000003339 > 4 0.000004239 0.000003339 > 8 0.000003639 0.000003339 > 16 0.000004339 0.000003339 > 32 0.000005739 0.000003339 > 64 0.000008539 0.000003339 > 128 0.000014139 0.000004039 > 256 0.000025339 0.000004739 > 512 0.000047739 0.000006139 > 1024 0.000092539 0.000008939 > 2048 0.000182139 0.000014539 > 4096 0.000361339 0.000025739 > 8192 0.000719739 0.000048139 > 16384 0.001436539 0.000092939 > 32768 0.002870139 0.000182539 > 65536 0.005737339 0.000361739 > 131072 0.011471739 0.000720139 > 262144 0.022940539 0.001436939 > 524288 0.045878139 0.002870539 > 1048576 0.091753339 0.005737739 > > -------------------------------------------------------------------------= ------- > > So I hope to pass __riscv_vector, which enables hardware that does not > support vector > extension to execute the basic instruction set implementation of strlen, > has the same > performance as the C language implementation. For support vector extended > hardware, > strlen implemented by vector instruction set is executed to achieve > acceleration effect. > > Fei Zhang > > > -----=E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6----- > > =E5=8F=91=E4=BB=B6=E4=BA=BA: "Szabolcs Nagy" > > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2023-04-11 20:48:22 (=E6=98=9F= =E6=9C=9F=E4=BA=8C) > > =E6=94=B6=E4=BB=B6=E4=BA=BA: "=E5=BC=A0=E9=A3=9E" > > =E6=8A=84=E9=80=81: musl@lists.openwall.com > > =E4=B8=BB=E9=A2=98: Re: Re: [musl] [PATCH]Implementation of strlen f= unction in > riscv64 architecture > > > > * =E5=BC=A0=E9=A3=9E [2023-04-10 13:59:22 = +0800]: > > > I have made modifications to the assembly implementation of the > riscv64 strlen function, mainly > > > focusing on address alignment processing to avoid the problem o= f > data crossing > > > pages during vector instruction memory access. > > > > > > I think the assembly implementation of strlen is necessary. In > glibc, > > > > if the c definition is not correct then you have to explain why. > > if it's very slow then please tell us so. > > > > > X86_64, aarch64, alpha, and others all have assembly > implementations of this function, > > > while for riscv64, it is blank. > > > I have also analyzed the test sets of Spec2006 and Spec2017, an= d > the strlen function is also a hot topic. > > > > an asm implementation has significant maintenance cost so you should > > provide some benchmark data or other evidence/reasoning for us to > > decide if it's worth the cost. > > > > it seems you replaced the c strlen code with a slower one except whe= n > > musl is built for "#ifdef __riscv_vector" isa extension. what cpus > > does this affect? are linux distros expected to use this as baseline= ? > > do different riscv cpus have similar simd performance properties? wh= o > > will tweak the asm if not? > > > > in principle what you did can be done by the compiler auto vectorize= r > > so maybe contributing to the compiler is more useful. > > > > note that glibc has cpu specific implementations that it can select > > at runtime, but musl uses one generic implementation for all cpus. > --0000000000001ea9a205f9b81866 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Wed, Apr 19, 2023 at 12:22=E2=80= =AFAM =E5=BC=A0=E9=A3=9E <zha= ngfei@nj.iscas.ac.cn> wrote:
I did replace the C strlen code with a slower one excep= t when
musl is built for "#ifdef __riscv_vector" isa extension.So I refe= rred
to the C strlen code and implemented it with the basic instruction
set, and the performance of both is basically the same.

The reason for implementing two versions is to hope that the memset impleme= nted
using the basic instruction set can be applicable to all RISCV architecture= CPUs,
and the vector version can accelerate the hardware supporting vector expans= ion.
When the compiler adds vector extensions through --with-arch=3Drv64gcv,
__riscv_vector will also open by default.Similar macro definitions are comm= on in
riscv, such as setjmp/riscv64/setjmp.S in musl, which includes
__riscv_float_abi_soft macro definitions.

At present, the riscv vector extension instruction set is in a frozen state= , and
the instruction set is stable. In other open source libraries, such as open= ssl
and openCV, riscv vector optimization is available.

is that actually checked in to openssl? the linux kernel patches t= o save/restore vector state still haven't been merged to linux-next afa= ik, and there's still no hwcaps support for V either. or are they using= `__riscv_vector` too, and not detecting V at runtime? (the kernel's ow= n use of V and Zb* seems to be based on an internal-only hwcap mechanism fo= r now.)
=C2=A0
We know that the assembly generated
by the compiler is often not the most efficient, and the automatic vectoriz= ation
scenarios are limited, so we need to optimize the function by manual vector= ization.
For riscv, compiler automatic vectorization is still in its infancy.

have you tried sifive's autovectorization= patches? do they help for this code?
=C2=A0
I conducted tests on different data volumes and compared the performance of= memset
functions implemented in C language, basic instruction set, and vector inst= ruction
set.The test case is test_strlen.c

Performance comparison between C language implementation and assembly imple= mentation was
tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip P= latform).

The test results are as follows.Due to the consistent algorithm between the= two, there
is basically no difference in performance.

---------------------------------------------------------------------------= -----
length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0Basic instructi= on implementation(s)
---------------------------------------------------------------------------= -----
2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00= 000528=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 0.000005441=C2=A0 =C2=A0 =C2=A0 =C2=A0
4=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00= 000544=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 0.000005437=C2=A0 =C2=A0 =C2=A0 =C2=A0
8=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00= 000464=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 0.00000496
16=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00= 000544=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 0.00000512
32=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00= 00064=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.00000592
64=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00= 0007994=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000007841=C2=A0 =C2=A0 =C2=A0 =C2=A0
128=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.000012= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 0.000012=C2=A0 =C2=A0
256=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000203= 21=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000020481=C2=A0 =C2=A0 =C2=A0 =C2=A0
512=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000372= 82=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000037762=C2=A0 =C2=A0 =C2=A0 =C2=A0
1024=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0000699= 24=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000070244=C2=A0 =C2=A0 =C2=A0 =C2=A0
2048=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0001350= 46=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000135528=C2=A0 =C2=A0 =C2=A0 =C2=A0
4096=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0002644= 91=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000264816=C2=A0 =C2=A0 =C2=A0 =C2=A0
8192=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0005243= 42=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000525631=C2=A0 =C2=A0 =C2=A0 =C2=A0
16384=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.001069965=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 001047742=C2=A0 =C2=A0 =C2=A0 =C2=A0
32768=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.002180252=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 002142207=C2=A0 =C2=A0 =C2=A0 =C2=A0
65536=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.005921251=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 005883868=C2=A0 =C2=A0 =C2=A0 =C2=A0
131072=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.012508934=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 012392895=C2=A0 =C2=A0 =C2=A0 =C2=A0
262144=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.02503915=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0= .024896995=C2=A0 =C2=A0 =C2=A0 =C2=A0
524288=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.049879091=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 049821832=C2=A0 =C2=A0 =C2=A0 =C2=A0
1048576=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.09973658=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.099= 969603=C2=A0 =C2=A0 =C2=A0 =C2=A0
---------------------------------------------------------------------------= -----

Due to the lack of a chip that supports vector extension, I conducted a per= formance
comparison test of strlen using C language and vector implementation on the= Spike
simulator, which has certain reference value. It can be clearly seen that v= ector
implementation is more efficient than C language implementation, with an av= erage
performance improvement of over 800%.

---------------------------------------------------------------------------= -----
length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0Vector instruct= ion implementation(s)
---------------------------------------------------------------------------= -----
2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00= 0003639=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000003339
4=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00= 0004239=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000003339
8=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00= 0003639=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000003339
16=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00= 0004339=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000003339
32=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00= 0005739=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000003339
64=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00= 0008539=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A00.000003339
128=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000141= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000004039
256=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000253= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000004739
512=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000477= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000006139
1024=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0000925= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000008939
2048=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0001821= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000014539
4096=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0003613= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000025739
8192=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0007197= 39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00.000048139
16384=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.001436539=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 000092939
32768=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.002870139=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 000182539
65536=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.005737339=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 000361739
131072=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.011471739=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 000720139
262144=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.022940539=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 001436939
524288=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.045878139=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.= 002870539
1048576=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.091753339=C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0057= 37739
---------------------------------------------------------------------------= -----

So I hope to pass __riscv_vector, which enables hardware that does not supp= ort vector
extension to execute the basic instruction set implementation of strlen, ha= s the same
performance as the C language implementation. For support vector extended h= ardware,
strlen implemented by vector instruction set is executed to achieve acceler= ation effect.

Fei Zhang

&gt; -----=E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6-----
&gt; =E5=8F=91=E4=BB=B6=E4=BA=BA: "Szabolcs Nagy" <nsz@port70.net>
&gt; =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2023-04-11 20:48:22 (=E6=98= =9F=E6=9C=9F=E4=BA=8C)
&gt; =E6=94=B6=E4=BB=B6=E4=BA=BA: "=E5=BC=A0=E9=A3=9E" <zhangfei@nj.isca= s.ac.cn>
&gt; =E6=8A=84=E9=80=81: musl@lists.openwall.com
&gt; =E4=B8=BB=E9=A2=98: Re: Re: [musl] [PATCH]Implementation of strlen= function in riscv64 architecture
&gt;
&gt; * =E5=BC=A0=E9=A3=9E <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0= 800]:
&gt; &gt; I have made modifications to the assembly implementation = of the riscv64 strlen function, mainly
&gt; &gt; focusing on address alignment processing to avoid the pro= blem of data crossing
&gt; &gt; pages during vector instruction memory access.
&gt; &gt;
&gt; &gt; I think the assembly implementation of strlen is necessar= y. In glibc,
&gt;
&gt; if the c definition is not correct then you have to explain why. &gt; if it's very slow then please tell us so.
&gt;
&gt; &gt; X86_64, aarch64, alpha, and others all have assembly impl= ementations of this function,
&gt; &gt; while for riscv64, it is blank.
&gt; &gt; I have also analyzed the test sets of Spec2006 and Spec20= 17, and the strlen function is also a hot topic.
&gt;
&gt; an asm implementation has significant maintenance cost so you shou= ld
&gt; provide some benchmark data or other evidence/reasoning for us to<= br> &gt; decide if it's worth the cost.
&gt;
&gt; it seems you replaced the c strlen code with a slower one except w= hen
&gt; musl is built for "#ifdef __riscv_vector" isa extension.= what cpus
&gt; does this affect? are linux distros expected to use this as baseli= ne?
&gt; do different riscv cpus have similar simd performance properties? = who
&gt; will tweak the asm if not?
&gt;
&gt; in principle what you did can be done by the compiler auto vectori= zer
&gt; so maybe contributing to the compiler is more useful.
&gt;
&gt; note that glibc has cpu specific implementations that it can selec= t
&gt; at runtime, but musl uses one generic implementation for all cpus.=
</zhangfei@= nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>
--0000000000001ea9a205f9b81866--