From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,HTML_MESSAGE,MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 6812 invoked from network); 19 Apr 2023 22:39:35 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 19 Apr 2023 22:39:35 -0000
Received: (qmail 30441 invoked by uid 550); 19 Apr 2023 22:39:30 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 30405 invoked from network); 19 Apr 2023 22:39:29 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1681943957; x=1684535957;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :from:to:cc:subject:date:message-id:reply-to;
        bh=0JkYbGrJVFJJcTVcKNC3QT/ppWyoUc3HcP3m18wTPJw=;
        b=RMiKou9oqe+sC7Cd/9fzjyJbwh4IAtQJwpDTkY4bOtaIigRv6VAMIIE68ZR41lmDri
         MG4PJz9/x1IHmo4kIRy3rOknDqyYhCRbYsvHVvtilpqbttJEP6pnkWWe52jWzBmFy8ME
         zLFHEfO7HuZpjEr4HMaONORBJNyqQZO5Eq91PEdo3lT9qSn2ubfCKK5hTqIXvJSUPUNK
         lIw4yf5CkxioQjC+tXrH+nmodzgMoCx8bA4iSUlJ9iC7nrZgeU2JjVlBfwXRf2PvOMiU
         NS0l7+u/aS0gpy5oSbXxb/ibhTbcQ6y+9r0oBx3VcDQiPpfxQkcQ/TvXxMbsuXLAEi8y
         syIA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1681943957; x=1684535957;
        h=to:subject:message-id:date:from:in-reply-to:references:mime-version
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=0JkYbGrJVFJJcTVcKNC3QT/ppWyoUc3HcP3m18wTPJw=;
        b=RWxXoIi1b+eKlClABpIM+DssyCexH6xa/Q7r047A1n18+1OzUnLXoF01t36AV2Z1ag
         khy2OaEWRdE0+Zi9DWdjWfmmYB6ctri+gv3sSykys8if/0r+h3JrYdxa2oXwx1maSxeU
         zf2SSfVmwemJfxTyRrANxsPOLfn/9U9kM/saQSA7NfNvZnSh6lQdc2PhKeH2gs6X+J2S
         Hev4tVNqVtkltiezBuVo/x+FgEFBBFtKp1sOzY9s0xlDylOdTiQxoIsWPm6KBo3gG4jJ
         Z+EudKDOfpK50lOSOiOhVpxCec7CCJjWITNq2ybVkJz8NXblE6J9RJn0SdiPOwTyAcZn
         Wc6A==
X-Gm-Message-State: AAQBX9cPgsZPkMU8RGzGL3wK9NhZNGtRZOds0HZZQkIRuRgMRUBNv8VA
	UtoEG6cyWNX+ko+XCcD8CAzj0Zjrqg/PQX02KYN6oOPmFsKmFPyFZv78rQ==
X-Google-Smtp-Source: AKy350YVknmJBGGCRPNPuJ6Tb0dLe4etStPpInfOlieobBMw2Fhdtueu8nY8FjW1tvHb+RoJbAHdBcCzSxEBp4QOne4=
X-Received: by 2002:a05:6214:509e:b0:5ed:9b61:88d1 with SMTP id
 kk30-20020a056214509e00b005ed9b6188d1mr256160qvb.12.1681943957484; Wed, 19
 Apr 2023 15:39:17 -0700 (PDT)
MIME-Version: 1.0
References: <3e056826.7b26.18707fad3e2.Coremail.zhangfei@nj.iscas.ac.cn>
 <D5D86A43-EF57-4AAE-98A5-81740EA3C6F3@adelielinux.org> <27cfcd14.778c.18769bf603d.Coremail.zhangfei@nj.iscas.ac.cn>
 <20230411124822.GK3630668@port70.net> <484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn>
In-Reply-To: <484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn>
From: enh <enh@google.com>
Date: Wed, 19 Apr 2023 15:39:06 -0700
Message-ID: <CAJgzZooxj+5AKPg1O=kFW6pNQmY5sHuosG3LonOEyOd9aQBK7A@mail.gmail.com>
To: musl@lists.openwall.com
Content-Type: multipart/alternative; boundary="0000000000001ea9a205f9b81866"
Subject: Re: Re: Re: [musl] [PATCH]Implementation of strlen function in
 riscv64 architecture

--0000000000001ea9a205f9b81866
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, Apr 19, 2023 at 12:22=E2=80=AFAM =E5=BC=A0=E9=A3=9E <zhangfei@nj.is=
cas.ac.cn> wrote:

> I did replace the C strlen code with a slower one except when
> musl is built for "#ifdef __riscv_vector" isa extension.So I referred
> to the C strlen code and implemented it with the basic instruction
> set, and the performance of both is basically the same.
>
> The reason for implementing two versions is to hope that the memset
> implemented
> using the basic instruction set can be applicable to all RISCV
> architecture CPUs,
> and the vector version can accelerate the hardware supporting vector
> expansion.
> When the compiler adds vector extensions through --with-arch=3Drv64gcv,
> __riscv_vector will also open by default.Similar macro definitions are
> common in
> riscv, such as setjmp/riscv64/setjmp.S in musl, which includes
> __riscv_float_abi_soft macro definitions.
>
> At present, the riscv vector extension instruction set is in a frozen
> state, and
> the instruction set is stable. In other open source libraries, such as
> openssl
> and openCV, riscv vector optimization is available.


is that actually checked in to openssl? the linux kernel patches to
save/restore vector state still haven't been merged to linux-next afaik,
and there's still no hwcaps support for V either. or are they using
`__riscv_vector` too, and not detecting V at runtime? (the kernel's own use
of V and Zb* seems to be based on an internal-only hwcap mechanism for now.=
)


> We know that the assembly generated
> by the compiler is often not the most efficient, and the automatic
> vectorization
> scenarios are limited, so we need to optimize the function by manual
> vectorization.
> For riscv, compiler automatic vectorization is still in its infancy.
>

have you tried sifive's autovectorization patches? do they help for this
code?


> I conducted tests on different data volumes and compared the performance
> of memset
> functions implemented in C language, basic instruction set, and vector
> instruction
> set.The test case is test_strlen.c
>
> Performance comparison between C language implementation and assembly
> implementation was
> tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip
> Platform).
>
> The test results are as follows.Due to the consistent algorithm between
> the two, there
> is basically no difference in performance.
>
>
> -------------------------------------------------------------------------=
-------
> length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
>
> -------------------------------------------------------------------------=
-------
> 2                    0.00000528                      0.000005441
> 4                    0.00000544                      0.000005437
> 8                    0.00000464                      0.00000496
> 16                   0.00000544                      0.00000512
> 32                   0.0000064                       0.00000592
> 64                   0.000007994                     0.000007841
> 128                  0.000012                        0.000012
> 256                  0.000020321                     0.000020481
> 512                  0.000037282                     0.000037762
> 1024                 0.000069924                     0.000070244
> 2048                 0.000135046                     0.000135528
> 4096                 0.000264491                     0.000264816
> 8192                 0.000524342                     0.000525631
> 16384                0.001069965                     0.001047742
> 32768                0.002180252                     0.002142207
> 65536                0.005921251                     0.005883868
> 131072               0.012508934                     0.012392895
> 262144               0.02503915                      0.024896995
> 524288               0.049879091                     0.049821832
> 1048576              0.09973658                      0.099969603
>
> -------------------------------------------------------------------------=
-------
>
> Due to the lack of a chip that supports vector extension, I conducted a
> performance
> comparison test of strlen using C language and vector implementation on
> the Spike
> simulator, which has certain reference value. It can be clearly seen that
> vector
> implementation is more efficient than C language implementation, with an
> average
> performance improvement of over 800%.
>
>
> -------------------------------------------------------------------------=
-------
> length(byte)  C language implementation(s)   Vector instruction
> implementation(s)
>
> -------------------------------------------------------------------------=
-------
> 2                    0.000003639                     0.000003339
> 4                    0.000004239                     0.000003339
> 8                    0.000003639                     0.000003339
> 16                   0.000004339                     0.000003339
> 32                   0.000005739                     0.000003339
> 64                   0.000008539                     0.000003339
> 128                  0.000014139                     0.000004039
> 256                  0.000025339                     0.000004739
> 512                  0.000047739                     0.000006139
> 1024                 0.000092539                     0.000008939
> 2048                 0.000182139                     0.000014539
> 4096                 0.000361339                     0.000025739
> 8192                 0.000719739                     0.000048139
> 16384                0.001436539                     0.000092939
> 32768                0.002870139                     0.000182539
> 65536                0.005737339                     0.000361739
> 131072               0.011471739                     0.000720139
> 262144               0.022940539                     0.001436939
> 524288               0.045878139                     0.002870539
> 1048576              0.091753339                     0.005737739
>
> -------------------------------------------------------------------------=
-------
>
> So I hope to pass __riscv_vector, which enables hardware that does not
> support vector
> extension to execute the basic instruction set implementation of strlen,
> has the same
> performance as the C language implementation. For support vector extended
> hardware,
> strlen implemented by vector instruction set is executed to achieve
> acceleration effect.
>
> Fei Zhang
>
> &gt; -----=E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6-----
> &gt; =E5=8F=91=E4=BB=B6=E4=BA=BA: "Szabolcs Nagy" <nsz@port70.net>
> &gt; =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2023-04-11 20:48:22 (=E6=98=9F=
=E6=9C=9F=E4=BA=8C)
> &gt; =E6=94=B6=E4=BB=B6=E4=BA=BA: "=E5=BC=A0=E9=A3=9E" <zhangfei@nj.iscas=
.ac.cn>
> &gt; =E6=8A=84=E9=80=81: musl@lists.openwall.com
> &gt; =E4=B8=BB=E9=A2=98: Re: Re: [musl] [PATCH]Implementation of strlen f=
unction in
> riscv64 architecture
> &gt;
> &gt; * =E5=BC=A0=E9=A3=9E <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 =
+0800]:
> &gt; &gt; I have made modifications to the assembly implementation of the
> riscv64 strlen function, mainly
> &gt; &gt; focusing on address alignment processing to avoid the problem o=
f
> data crossing
> &gt; &gt; pages during vector instruction memory access.
> &gt; &gt;
> &gt; &gt; I think the assembly implementation of strlen is necessary. In
> glibc,
> &gt;
> &gt; if the c definition is not correct then you have to explain why.
> &gt; if it's very slow then please tell us so.
> &gt;
> &gt; &gt; X86_64, aarch64, alpha, and others all have assembly
> implementations of this function,
> &gt; &gt; while for riscv64, it is blank.
> &gt; &gt; I have also analyzed the test sets of Spec2006 and Spec2017, an=
d
> the strlen function is also a hot topic.
> &gt;
> &gt; an asm implementation has significant maintenance cost so you should
> &gt; provide some benchmark data or other evidence/reasoning for us to
> &gt; decide if it's worth the cost.
> &gt;
> &gt; it seems you replaced the c strlen code with a slower one except whe=
n
> &gt; musl is built for "#ifdef __riscv_vector" isa extension. what cpus
> &gt; does this affect? are linux distros expected to use this as baseline=
?
> &gt; do different riscv cpus have similar simd performance properties? wh=
o
> &gt; will tweak the asm if not?
> &gt;
> &gt; in principle what you did can be done by the compiler auto vectorize=
r
> &gt; so maybe contributing to the compiler is more useful.
> &gt;
> &gt; note that glibc has cpu specific implementations that it can select
> &gt; at runtime, but musl uses one generic implementation for all cpus.
> </zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>

--0000000000001ea9a205f9b81866
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Wed, Apr 19, 2023 at 12:22=E2=80=
=AFAM =E5=BC=A0=E9=A3=9E &lt;<a href=3D"mailto:zhangfei@nj.iscas.ac.cn">zha=
ngfei@nj.iscas.ac.cn</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">I did replace the C strlen code with a slower one excep=
t when<br>
musl is built for &quot;#ifdef __riscv_vector&quot; isa extension.So I refe=
rred <br>
to the C strlen code and implemented it with the basic instruction <br>
set, and the performance of both is basically the same.<br>
<br>
The reason for implementing two versions is to hope that the memset impleme=
nted <br>
using the basic instruction set can be applicable to all RISCV architecture=
 CPUs, <br>
and the vector version can accelerate the hardware supporting vector expans=
ion.<br>
When the compiler adds vector extensions through --with-arch=3Drv64gcv,<br>
__riscv_vector will also open by default.Similar macro definitions are comm=
on in <br>
riscv, such as setjmp/riscv64/setjmp.S in musl, which includes <br>
__riscv_float_abi_soft macro definitions.<br>
<br>
At present, the riscv vector extension instruction set is in a frozen state=
, and <br>
the instruction set is stable. In other open source libraries, such as open=
ssl <br>
and openCV, riscv vector optimization is available.</blockquote><div><br></=
div><div>is that actually checked in to openssl? the linux kernel patches t=
o save/restore vector state still haven&#39;t been merged to linux-next afa=
ik, and there&#39;s still no hwcaps support for V either. or are they using=
 `__riscv_vector` too, and not detecting V at runtime? (the kernel&#39;s ow=
n use of V and Zb* seems to be based on an internal-only hwcap mechanism fo=
r now.)</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">We know that the assembly generated <br>
by the compiler is often not the most efficient, and the automatic vectoriz=
ation <br>
scenarios are limited, so we need to optimize the function by manual vector=
ization.<br>
For riscv, compiler automatic vectorization is still in its infancy.<br></b=
lockquote><div><br></div><div>have you tried sifive&#39;s autovectorization=
 patches? do they help for this code?</div><div>=C2=A0</div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex">
I conducted tests on different data volumes and compared the performance of=
 memset <br>
functions implemented in C language, basic instruction set, and vector inst=
ruction <br>
set.The test case is test_strlen.c<br>
<br>
Performance comparison between C language implementation and assembly imple=
mentation was <br>
tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip P=
latform).<br>
<br>
The test results are as follows.Due to the consistent algorithm between the=
 two, there <br>
is basically no difference in performance.<br>
<br>
---------------------------------------------------------------------------=
-----<br>
length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0Basic instructi=
on implementation(s)<br>
---------------------------------------------------------------------------=
-----<br>
2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00=
000528=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 0.000005441=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
4=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00=
000544=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 0.000005437=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
8=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00=
000464=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 0.00000496 <br>
16=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00=
000544=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 0.00000512 <br>
32=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00=
00064=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.00000592 <br>
64=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00=
0007994=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000007841=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
128=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.000012=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 0.000012=C2=A0 =C2=A0<br>
256=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000203=
21=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000020481=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
512=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000372=
82=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000037762=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
1024=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0000699=
24=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000070244=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
2048=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0001350=
46=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000135528=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
4096=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0002644=
91=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000264816=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
8192=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0005243=
42=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000525631=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
16384=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.001069965=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
001047742=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
32768=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.002180252=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
002142207=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
65536=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.005921251=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
005883868=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
131072=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.012508934=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
012392895=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
262144=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.02503915=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0=
.024896995=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
524288=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.049879091=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
049821832=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
1048576=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.09973658=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.099=
969603=C2=A0 =C2=A0 =C2=A0 =C2=A0 <br>
---------------------------------------------------------------------------=
-----<br>
<br>
Due to the lack of a chip that supports vector extension, I conducted a per=
formance <br>
comparison test of strlen using C language and vector implementation on the=
 Spike <br>
simulator, which has certain reference value. It can be clearly seen that v=
ector <br>
implementation is more efficient than C language implementation, with an av=
erage <br>
performance improvement of over 800%.<br>
<br>
---------------------------------------------------------------------------=
-----<br>
length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0Vector instruct=
ion implementation(s)<br>
---------------------------------------------------------------------------=
-----<br>
2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00=
0003639=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000003339<br>
4=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00=
0004239=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000003339<br>
8=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.00=
0003639=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000003339<br>
16=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00=
0004339=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000003339<br>
32=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00=
0005739=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000003339<br>
64=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.00=
0008539=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A00.000003339<br>
128=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000141=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000004039<br>
256=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000253=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000004739<br>
512=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.0000477=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000006139<br>
1024=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0000925=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000008939<br>
2048=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0001821=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000014539<br>
4096=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0003613=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000025739<br>
8192=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0007197=
39=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00.000048139<br>
16384=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.001436539=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
000092939<br>
32768=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.002870139=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
000182539<br>
65536=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.005737339=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
000361739<br>
131072=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.011471739=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
000720139<br>
262144=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.022940539=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
001436939<br>
524288=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.045878139=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.=
002870539<br>
1048576=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0.091753339=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00.0057=
37739<br>
---------------------------------------------------------------------------=
-----<br>
<br>
So I hope to pass __riscv_vector, which enables hardware that does not supp=
ort vector <br>
extension to execute the basic instruction set implementation of strlen, ha=
s the same <br>
performance as the C language implementation. For support vector extended h=
ardware, <br>
strlen implemented by vector instruction set is executed to achieve acceler=
ation effect.<br>
<br>
Fei Zhang<br>
<br>
&amp;gt; -----=E5=8E=9F=E5=A7=8B=E9=82=AE=E4=BB=B6-----<br>
&amp;gt; =E5=8F=91=E4=BB=B6=E4=BA=BA: &quot;Szabolcs Nagy&quot; &lt;<a href=
=3D"mailto:nsz@port70.net" target=3D"_blank">nsz@port70.net</a>&gt;<br>
&amp;gt; =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2023-04-11 20:48:22 (=E6=98=
=9F=E6=9C=9F=E4=BA=8C)<br>
&amp;gt; =E6=94=B6=E4=BB=B6=E4=BA=BA: &quot;=E5=BC=A0=E9=A3=9E&quot; &lt;<a=
 href=3D"mailto:zhangfei@nj.iscas.ac.cn" target=3D"_blank">zhangfei@nj.isca=
s.ac.cn</a>&gt;<br>
&amp;gt; =E6=8A=84=E9=80=81: <a href=3D"mailto:musl@lists.openwall.com" tar=
get=3D"_blank">musl@lists.openwall.com</a><br>
&amp;gt; =E4=B8=BB=E9=A2=98: Re: Re: [musl] [PATCH]Implementation of strlen=
 function in riscv64 architecture<br>
&amp;gt; <br>
&amp;gt; * =E5=BC=A0=E9=A3=9E &lt;<a href=3D"mailto:zhangfei@nj.iscas.ac.cn=
" target=3D"_blank">zhangfei@nj.iscas.ac.cn</a>&gt; [2023-04-10 13:59:22 +0=
800]:<br>
&amp;gt; &amp;gt; I have made modifications to the assembly implementation =
of the riscv64 strlen function, mainly <br>
&amp;gt; &amp;gt; focusing on address alignment processing to avoid the pro=
blem of data crossing <br>
&amp;gt; &amp;gt; pages during vector instruction memory access.<br>
&amp;gt; &amp;gt; <br>
&amp;gt; &amp;gt; I think the assembly implementation of strlen is necessar=
y. In glibc, <br>
&amp;gt; <br>
&amp;gt; if the c definition is not correct then you have to explain why.<b=
r>
&amp;gt; if it&#39;s very slow then please tell us so.<br>
&amp;gt; <br>
&amp;gt; &amp;gt; X86_64, aarch64, alpha, and others all have assembly impl=
ementations of this function, <br>
&amp;gt; &amp;gt; while for riscv64, it is blank.<br>
&amp;gt; &amp;gt; I have also analyzed the test sets of Spec2006 and Spec20=
17, and the strlen function is also a hot topic.<br>
&amp;gt; <br>
&amp;gt; an asm implementation has significant maintenance cost so you shou=
ld<br>
&amp;gt; provide some benchmark data or other evidence/reasoning for us to<=
br>
&amp;gt; decide if it&#39;s worth the cost.<br>
&amp;gt; <br>
&amp;gt; it seems you replaced the c strlen code with a slower one except w=
hen<br>
&amp;gt; musl is built for &quot;#ifdef __riscv_vector&quot; isa extension.=
 what cpus<br>
&amp;gt; does this affect? are linux distros expected to use this as baseli=
ne?<br>
&amp;gt; do different riscv cpus have similar simd performance properties? =
who<br>
&amp;gt; will tweak the asm if not?<br>
&amp;gt; <br>
&amp;gt; in principle what you did can be done by the compiler auto vectori=
zer<br>
&amp;gt; so maybe contributing to the compiler is more useful.<br>
&amp;gt; <br>
&amp;gt; note that glibc has cpu specific implementations that it can selec=
t<br>
&amp;gt; at runtime, but musl uses one generic implementation for all cpus.=
<br>
&lt;/<a href=3D"mailto:zhangfei@nj.iscas.ac.cn" target=3D"_blank">zhangfei@=
nj.iscas.ac.cn</a>&gt;&lt;/<a href=3D"mailto:zhangfei@nj.iscas.ac.cn" targe=
t=3D"_blank">zhangfei@nj.iscas.ac.cn</a>&gt;&lt;/<a href=3D"mailto:nsz@port=
70.net" target=3D"_blank">nsz@port70.net</a>&gt;</blockquote></div></div>

--0000000000001ea9a205f9b81866--