[musl] [PATCH]Implementation of strlen function in riscv64 architecture

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] [PATCH]Implementation of strlen function in riscv64 architecture
@ 2023-03-22  6:21 张飞
  2023-03-22  6:27 ` A. Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: 张飞 @ 2023-03-22  6:21 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 481 bytes --]







Hi：
Iimplementedvectorizationofthestrlenfunctionintheriscv64architecture, whichiscontrolledby__riscv_vectordefinition.Duetolackofsupportforrisc-vVexpansioninhardware, Iconductedperformancetestsonasimulator, whichwasmorethan10timestheperformanceachievedinClanguage. Intermsoffunctionality, Itestedthestringlengthfrom1byteto64Mb, andthealignmentofdifferentaddressesatthebeginningofthestring.




Please review it.I'm Looking forward to your reply,thanks.




Fei Zhang







[-- Attachment #1.2: Type: text/html, Size: 7076 bytes --]

[-- Attachment #2: strlen.S --]
[-- Type: application/octet-stream, Size: 754 bytes --]

    .text
    .balign 4
    .global strlen
# size_t strlen(const char *str)
# a0 holds *str

strlen:
    mv t1, a0             # Save start

#ifdef __riscv_vector
loop:
    vsetvli t0, x0, e8, m8, ta, ma  # Vector of bytes of maximum length
    vle8ff.v v8, (t1)      # Load bytes
    csrr t0, vl           # Get bytes read
    vmseq.vi v0, v8, 0    # Set v0[i] where v8[i] = 0
    vfirst.m t2, v0       # Find first set bit
    add t1, t1, t0        # Bump pointer
    bltz t2, loop         # Not found?

    add a0, a0, t0        # Sum start + bump
    add t1, t1, t2        # Add index
    sub a0, t1, a0        # Subtract start address+bump
#else
1:
    lbu t0, 0(t1)
    beqz t0, 2f
    addi t1, t1, 1
    j 1b
2:
    sub a0, t1, a0
#endif
    ret


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-03-22  6:21 [musl] [PATCH]Implementation of strlen function in riscv64 architecture 张飞
@ 2023-03-22  6:27 ` A. Wilcox
  2023-03-22 12:15   ` Rich Felker
  2023-04-10  5:59   ` 张飞
  0 siblings, 2 replies; 8+ messages in thread
From: A. Wilcox @ 2023-03-22  6:27 UTC (permalink / raw)
  To: musl

The content of the message was sent as an image.

For those who cannot view images, I've reproduced the text below:

On Mar 22, 2023, at 1:21 AM, 张飞 <zhangfei@nj.iscas.ac.cn> wrote:
> 
> Hi:
> 
> I implemented vectorization of the strlen function in the riscv64
> architecture, which is controlled by __riscv_vector definition. Due
> to lack of support for risc-v V expansion in hardware, I conducted
> performance tests on a simulator, which was more than 10 times the
> performance achieved in C language. In terms of functionality, I
> tested the string length from 1 byte to 64 Mb, and the alignment of
> different addresses at the beginning of the string.
> 
> 
> Please review it.I'm Looking forward to your reply,thanks.
> 
> 
> 
> Fei Zhang
> <strlen.S>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-03-22  6:27 ` A. Wilcox
@ 2023-03-22 12:15   ` Rich Felker
  2023-04-11 12:57     ` Szabolcs Nagy
  2023-04-10  5:59   ` 张飞
  1 sibling, 1 reply; 8+ messages in thread
From: Rich Felker @ 2023-03-22 12:15 UTC (permalink / raw)
  To: 张飞, A. Wilcox; +Cc: musl

On Wed, Mar 22, 2023 at 01:27:33AM -0500, A. Wilcox wrote:
> The content of the message was sent as an image.
> 
> For those who cannot view images, I've reproduced the text below:
> 
> On Mar 22, 2023, at 1:21 AM, 张飞 <zhangfei@nj.iscas.ac.cn> wrote:
> > 
> > Hi:
> > 
> > I implemented vectorization of the strlen function in the riscv64
> > architecture, which is controlled by __riscv_vector definition. Due
> > to lack of support for risc-v V expansion in hardware, I conducted
> > performance tests on a simulator, which was more than 10 times the
> > performance achieved in C language. In terms of functionality, I
> > tested the string length from 1 byte to 64 Mb, and the alignment of
> > different addresses at the beginning of the string.
> > 
> > 
> > Please review it.I'm Looking forward to your reply,thanks.

The riscv64 target does not assume presence of vector extensions, and
as it's generally not a bottleneck, strlen isn't one of the functions
for which we generally have existing per-arch asm.

If we were going to introduce this kind of thing for strlen, the
preferable approach would probably be something like what I've
suggested we change memcpy/memset to: having the arch definition
provide only the minimal inline fragment needed to do the actual work
(something like: loading a vector, optionally xor'ing it with a mask
for the byte to search for, and reporting if it's found or the offset
at which it's found) with the actual control logic all in C.

Regarding the code submitted for review, I'm pretty sure it's buggy
because it doesn't seem to do anything with alignment. If you pass it
a pointer to the last byte of a page whose contents are zero, it will
attempt to load the rest of the vector from the next page, and fault.
Since strlen has no a priori way to know how long the object it's
inspecting is, I don't believe there's any way to do a vectorized
approach without pre-alignment to the size of the read you will be
performing, processing everything up to the aligned start separately.
Having to check for this kind of bug on a per-arch basis is one of the
motivations for not wanting whole functions written in asm, but
instead just minimal fragments, with this sort of common logic in C
where you know, once it's been reviewed once, it's correct for all
archs.

Rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-03-22 12:15   ` Rich Felker
@ 2023-04-11 12:57     ` Szabolcs Nagy
  0 siblings, 0 replies; 8+ messages in thread
From: Szabolcs Nagy @ 2023-04-11 12:57 UTC (permalink / raw)
  To: Rich Felker; +Cc: 张飞, A. Wilcox, musl

* Rich Felker <dalias@libc.org> [2023-03-22 08:15:30 -0400]:
> Regarding the code submitted for review, I'm pretty sure it's buggy
> because it doesn't seem to do anything with alignment. If you pass it
> a pointer to the last byte of a page whose contents are zero, it will
> attempt to load the rest of the vector from the next page, and fault.

the aarch64 sve isa extension has 'first faulting register' mask
and there are load/store instructions that set it instead of
actually faulting when a vector goes off at the end of a page.

i suspect riscv copied this piece of architecture (as well as the
variable vector length).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-03-22  6:27 ` A. Wilcox
  2023-03-22 12:15   ` Rich Felker
@ 2023-04-10  5:59   ` 张飞
  2023-04-11 12:48     ` Szabolcs Nagy
  1 sibling, 1 reply; 8+ messages in thread
From: 张飞 @ 2023-04-10  5:59 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1879 bytes --]

I have made modifications to the assembly implementation of the riscv64 strlen function, mainly 
focusing on address alignment processing to avoid the problem of data crossing 
pages during vector instruction memory access.

I think the assembly implementation of strlen is necessary. In glibc, 
X86_64, aarch64, alpha, and others all have assembly implementations of this function, 
while for riscv64, it is blank.
I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic.

Please review the patch again and look forward to your reply.

Fei Zhang

&gt; -----原始邮件-----
&gt; 发件人: "A. Wilcox" <awilfox@adelielinux.org>
&gt; 发送时间: 2023-03-22 14:27:33 (星期三)
&gt; 收件人: musl@lists.openwall.com
&gt; 抄送: 
&gt; 主题: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
&gt; 
&gt; The content of the message was sent as an image.
&gt; 
&gt; For those who cannot view images, I've reproduced the text below:
&gt; 
&gt; On Mar 22, 2023, at 1:21 AM, 张飞 <zhangfei@nj.iscas.ac.cn> wrote:
&gt; &gt; 
&gt; &gt; Hi:
&gt; &gt; 
&gt; &gt; I implemented vectorization of the strlen function in the riscv64
&gt; &gt; architecture, which is controlled by __riscv_vector definition. Due
&gt; &gt; to lack of support for risc-v V expansion in hardware, I conducted
&gt; &gt; performance tests on a simulator, which was more than 10 times the
&gt; &gt; performance achieved in C language. In terms of functionality, I
&gt; &gt; tested the string length from 1 byte to 64 Mb, and the alignment of
&gt; &gt; different addresses at the beginning of the string.
&gt; &gt; 
&gt; &gt; 
&gt; &gt; Please review it.I'm Looking forward to your reply,thanks.
&gt; &gt; 
&gt; &gt; 
&gt; &gt; 
&gt; &gt; Fei Zhang
&gt; &gt; <strlen.s>


</strlen.s></zhangfei@nj.iscas.ac.cn></awilfox@adelielinux.org>

[-- Attachment #2: strlen_riscv64.patch --]
[-- Type: application/octet-stream, Size: 1323 bytes --]

diff -uprN src/string/riscv64/strlen.S src/string/riscv64/strlen.S
--- src/string/riscv64/strlen.S	1970-01-01 08:00:00.000000000 +0800
+++ src/string/riscv64/strlen.S	2023-04-10 11:28:45.301698194 +0800
@@ -0,0 +1,46 @@
+# size_t strlen(const char *str)
+# a0 holds *str
+.global strlen
+.type strlen,@function
+strlen:
+    mv t0, a0             # Save start
+#ifdef __riscv_vector
+    csrr t1, vlenb 
+    addi t1, t1, -1
+    add  a3, t0, t1
+    not  t1, t1
+    and  a3, a3, t1
+    sub  a4, a3, t0
+    beq  a3, t0, loop /* if already aligned*/
+
+unaligned:
+    lbu     t1, 0(t0)
+    beqz    t1, found
+    addi    t0, t0, 1
+    blt     t0, a3, unaligned
+
+loop:
+    vsetvli a1, x0, e8, m8, ta, ma  # Vector of bytes of maximum length
+    vle8ff.v v8, (t0)      # Load bytes
+    csrr a1, vl           # Get bytes read
+    vmseq.vi v0, v8, 0    # Set v0[i] where v8[i] = 0
+    vfirst.m a2, v0       # Find first set bit
+    add t0, t0, a1        # Bump pointer
+    bltz a2, loop         # Not found?
+
+    add a3, a3, a1        # Sum start + bump
+    add t0, t0, a2        # Add index
+    sub a3, t0, a3        # Subtract start address+bump
+    add a0, a3, a4
+    ret
+#else
+loop:
+    lbu t1, 0(t0)
+    beqz t1, found
+    addi t0, t0, 1
+    j loop
+#endif
+
+found:
+    sub a0, t0, a0
+    ret

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-04-10  5:59   ` 张飞
@ 2023-04-11 12:48     ` Szabolcs Nagy
  2023-04-19  7:22       ` 张飞
  0 siblings, 1 reply; 8+ messages in thread
From: Szabolcs Nagy @ 2023-04-11 12:48 UTC (permalink / raw)
  To: 张飞; +Cc: musl

* 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]:
> I have made modifications to the assembly implementation of the riscv64 strlen function, mainly 
> focusing on address alignment processing to avoid the problem of data crossing 
> pages during vector instruction memory access.
> 
> I think the assembly implementation of strlen is necessary. In glibc, 

if the c definition is not correct then you have to explain why.
if it's very slow then please tell us so.

> X86_64, aarch64, alpha, and others all have assembly implementations of this function, 
> while for riscv64, it is blank.
> I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic.

an asm implementation has significant maintenance cost so you should
provide some benchmark data or other evidence/reasoning for us to
decide if it's worth the cost.

it seems you replaced the c strlen code with a slower one except when
musl is built for "#ifdef __riscv_vector" isa extension. what cpus
does this affect? are linux distros expected to use this as baseline?
do different riscv cpus have similar simd performance properties? who
will tweak the asm if not?

in principle what you did can be done by the compiler auto vectorizer
so maybe contributing to the compiler is more useful.

note that glibc has cpu specific implementations that it can select
at runtime, but musl uses one generic implementation for all cpus.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-04-11 12:48     ` Szabolcs Nagy
@ 2023-04-19  7:22       ` 张飞
  2023-04-19 22:39         ` enh
  0 siblings, 1 reply; 8+ messages in thread
From: 张飞 @ 2023-04-19  7:22 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 7393 bytes --]

I did replace the C strlen code with a slower one except when
musl is built for "#ifdef __riscv_vector" isa extension.So I referred 
to the C strlen code and implemented it with the basic instruction 
set, and the performance of both is basically the same.

The reason for implementing two versions is to hope that the memset implemented 
using the basic instruction set can be applicable to all RISCV architecture CPUs, 
and the vector version can accelerate the hardware supporting vector expansion.
When the compiler adds vector extensions through --with-arch=rv64gcv,
__riscv_vector will also open by default.Similar macro definitions are common in 
riscv, such as setjmp/riscv64/setjmp.S in musl, which includes 
__riscv_float_abi_soft macro definitions.

At present, the riscv vector extension instruction set is in a frozen state, and 
the instruction set is stable. In other open source libraries, such as openssl 
and openCV, riscv vector optimization is available.We know that the assembly generated 
by the compiler is often not the most efficient, and the automatic vectorization 
scenarios are limited, so we need to optimize the function by manual vectorization.
For riscv, compiler automatic vectorization is still in its infancy.

I conducted tests on different data volumes and compared the performance of memset 
functions implemented in C language, basic instruction set, and vector instruction 
set.The test case is test_strlen.c

Performance comparison between C language implementation and assembly implementation was 
tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip Platform).

The test results are as follows.Due to the consistent algorithm between the two, there 
is basically no difference in performance.

--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Basic instruction implementation(s)
--------------------------------------------------------------------------------
2	             0.00000528	                     0.000005441	
4	             0.00000544	                     0.000005437	
8	             0.00000464	                     0.00000496	
16	             0.00000544	                     0.00000512	
32	             0.0000064	                     0.00000592	
64	             0.000007994	             0.000007841	
128	             0.000012	                     0.000012	
256	             0.000020321	             0.000020481	
512	             0.000037282	             0.000037762	
1024	             0.000069924	             0.000070244	
2048	             0.000135046	             0.000135528	
4096	             0.000264491	             0.000264816	
8192	             0.000524342	             0.000525631	
16384	             0.001069965	             0.001047742	
32768	             0.002180252	             0.002142207	
65536	             0.005921251	             0.005883868	
131072	             0.012508934	             0.012392895	
262144	             0.02503915	                     0.024896995	
524288	             0.049879091	             0.049821832	
1048576	             0.09973658	                     0.099969603	
--------------------------------------------------------------------------------

Due to the lack of a chip that supports vector extension, I conducted a performance 
comparison test of strlen using C language and vector implementation on the Spike 
simulator, which has certain reference value. It can be clearly seen that vector 
implementation is more efficient than C language implementation, with an average 
performance improvement of over 800%.

--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Vector instruction implementation(s)
--------------------------------------------------------------------------------
2	             0.000003639	             0.000003339
4	             0.000004239	             0.000003339
8	             0.000003639	             0.000003339
16	             0.000004339	             0.000003339
32	             0.000005739	             0.000003339
64	             0.000008539	             0.000003339
128	             0.000014139	             0.000004039
256	             0.000025339	             0.000004739
512	             0.000047739	             0.000006139
1024	             0.000092539	             0.000008939
2048	             0.000182139	             0.000014539
4096	             0.000361339	             0.000025739
8192	             0.000719739	             0.000048139
16384	             0.001436539	             0.000092939
32768	             0.002870139	             0.000182539
65536	             0.005737339	             0.000361739
131072	             0.011471739	             0.000720139
262144	             0.022940539	             0.001436939
524288	             0.045878139	             0.002870539
1048576              0.091753339	             0.005737739
--------------------------------------------------------------------------------

So I hope to pass __riscv_vector, which enables hardware that does not support vector 
extension to execute the basic instruction set implementation of strlen, has the same 
performance as the C language implementation. For support vector extended hardware, 
strlen implemented by vector instruction set is executed to achieve acceleration effect.

Fei Zhang

&gt; -----原始邮件-----
&gt; 发件人: "Szabolcs Nagy" <nsz@port70.net>
&gt; 发送时间: 2023-04-11 20:48:22 (星期二)
&gt; 收件人: "张飞" <zhangfei@nj.iscas.ac.cn>
&gt; 抄送: musl@lists.openwall.com
&gt; 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
&gt; 
&gt; * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]:
&gt; &gt; I have made modifications to the assembly implementation of the riscv64 strlen function, mainly 
&gt; &gt; focusing on address alignment processing to avoid the problem of data crossing 
&gt; &gt; pages during vector instruction memory access.
&gt; &gt; 
&gt; &gt; I think the assembly implementation of strlen is necessary. In glibc, 
&gt; 
&gt; if the c definition is not correct then you have to explain why.
&gt; if it's very slow then please tell us so.
&gt; 
&gt; &gt; X86_64, aarch64, alpha, and others all have assembly implementations of this function, 
&gt; &gt; while for riscv64, it is blank.
&gt; &gt; I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic.
&gt; 
&gt; an asm implementation has significant maintenance cost so you should
&gt; provide some benchmark data or other evidence/reasoning for us to
&gt; decide if it's worth the cost.
&gt; 
&gt; it seems you replaced the c strlen code with a slower one except when
&gt; musl is built for "#ifdef __riscv_vector" isa extension. what cpus
&gt; does this affect? are linux distros expected to use this as baseline?
&gt; do different riscv cpus have similar simd performance properties? who
&gt; will tweak the asm if not?
&gt; 
&gt; in principle what you did can be done by the compiler auto vectorizer
&gt; so maybe contributing to the compiler is more useful.
&gt; 
&gt; note that glibc has cpu specific implementations that it can select
&gt; at runtime, but musl uses one generic implementation for all cpus.
</zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>

[-- Attachment #2: strlen_riscv64.patch --]
[-- Type: application/octet-stream, Size: 2007 bytes --]

diff -uprN src/string/riscv64/strlen.S src/string/riscv64/strlen.S
--- src/string/riscv64/strlen.S	1970-01-01 08:00:00.000000000 +0800
+++ src/string/riscv64/strlen.S	2023-04-18 13:37:22.644057680 +0800
@@ -0,0 +1,82 @@
+# size_t strlen(const char *str)
+# a0 holds *str
+.global strlen
+.type strlen,@function
+strlen:
+#ifdef __riscv_vector
+    mv      t0, a0      # Save start
+    csrr    t1, vlenb 
+    addi    t1, t1, -1
+    add     a3, t0, t1
+    not     t1, t1
+    and     a3, a3, t1
+    sub     a4, a3, t0
+    beq     a3, t0, loop # if already aligned
+
+unaligned:
+    lbu     t1, 0(t0)
+    beqz    t1, found
+    addi    t0, t0, 1
+    blt     t0, a3, unaligned
+
+loop:
+    vsetvli  a1, x0, e8, m8, ta, ma  # Vector of bytes of maximum length
+    vle8ff.v v8, (t0)      # Load bytes
+    csrr     a1, vl           # Get bytes read
+    vmseq.vi v0, v8, 0    # Set v0[i] where v8[i] = 0
+    vfirst.m a2, v0       # Find first set bit
+    add      t0, t0, a1        # Bump pointer
+    bltz     a2, loop         # Not found?
+
+    add    a3, a3, a1        # Sum start + bump
+    add    t0, t0, a2        # Add index
+    sub    a3, t0, a3        # Subtract start address+bump
+    add    a0, a3, a4
+    ret
+
+found:
+    sub    a0, t0, a0
+    ret
+
+#else
+    mv     a5, a0      
+    andi   a4, a0, 7         
+    beqz   a4, aligned  # if already aligned
+
+unaligned:
+    lbu    a4, 0(a5)
+    beqz   a4, count
+    addi   a5, a5, 1
+    andi   a4, a5, 7
+    bnez   a4, unaligned
+
+aligned:
+    la     t0, magic
+    ld     a1, 0(t0)
+    ld     a2, 8(t0)
+
+loop:
+    ld     a3, 0(a5)   
+    add    a4, a3, a1
+    not    a3, a3
+    and    a4, a4, a3
+    and    a4, a4, a2
+    bnez   a4, found
+    addi   a5, a5, 8
+    j      loop
+
+found:
+    lbu    a4, 0(a5)
+    beqz   a4, count
+    addi   a5, a5, 1
+    j      found
+
+count:
+    sub    a0, a5, a0
+    ret
+
+.section .data
+magic:
+    .dword 0xfefefefefefefeff
+    .dword 0x8080808080808080
+#endif

[-- Attachment #3: test_strlen.c --]
[-- Type: text/plain, Size: 1035 bytes --]

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>

#define DATA_SIZE 5*1024*1024
#define MAX_LEN 1*1024*1024
#define LOOP_TIMES 100
int main(){
   unsigned int len,ans;
   char *str1,*src1;
   str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

   struct timespec tv0,tv;
   for(len=1; len<=MAX_LEN; len*=2) {
      memset(str1, 'a', DATA_SIZE);
      src1 = str1;  // +offset
      src1[len] = '\0';

      clock_gettime(CLOCK_REALTIME, &tv0);
      for(int k=0; k<LOOP_TIMES; k++){
         ans = strlen(src1);
      }
      clock_gettime(CLOCK_REALTIME, &tv);
      tv.tv_sec -= tv0.tv_sec;
      if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
              tv.tv_nsec += 1000000000;
              tv.tv_sec--;
      }
      printf("length: %u  time: %ld.%.9ld\n",ans, (long)tv.tv_sec, (long)tv.tv_nsec);

      if( ans != len)
         printf("ERROR! len is %u,ans is %u\n",len,ans); //verify length
   }
   munmap(str1,DATA_SIZE);
   return 0;
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture
  2023-04-19  7:22       ` 张飞
@ 2023-04-19 22:39         ` enh
  0 siblings, 0 replies; 8+ messages in thread
From: enh @ 2023-04-19 22:39 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 8728 bytes --]

On Wed, Apr 19, 2023 at 12:22 AM 张飞 <zhangfei@nj.iscas.ac.cn> wrote:

> I did replace the C strlen code with a slower one except when
> musl is built for "#ifdef __riscv_vector" isa extension.So I referred
> to the C strlen code and implemented it with the basic instruction
> set, and the performance of both is basically the same.
>
> The reason for implementing two versions is to hope that the memset
> implemented
> using the basic instruction set can be applicable to all RISCV
> architecture CPUs,
> and the vector version can accelerate the hardware supporting vector
> expansion.
> When the compiler adds vector extensions through --with-arch=rv64gcv,
> __riscv_vector will also open by default.Similar macro definitions are
> common in
> riscv, such as setjmp/riscv64/setjmp.S in musl, which includes
> __riscv_float_abi_soft macro definitions.
>
> At present, the riscv vector extension instruction set is in a frozen
> state, and
> the instruction set is stable. In other open source libraries, such as
> openssl
> and openCV, riscv vector optimization is available.


is that actually checked in to openssl? the linux kernel patches to
save/restore vector state still haven't been merged to linux-next afaik,
and there's still no hwcaps support for V either. or are they using
`__riscv_vector` too, and not detecting V at runtime? (the kernel's own use
of V and Zb* seems to be based on an internal-only hwcap mechanism for now.)


> We know that the assembly generated
> by the compiler is often not the most efficient, and the automatic
> vectorization
> scenarios are limited, so we need to optimize the function by manual
> vectorization.
> For riscv, compiler automatic vectorization is still in its infancy.
>

have you tried sifive's autovectorization patches? do they help for this
code?


> I conducted tests on different data volumes and compared the performance
> of memset
> functions implemented in C language, basic instruction set, and vector
> instruction
> set.The test case is test_strlen.c
>
> Performance comparison between C language implementation and assembly
> implementation was
> tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip
> Platform).
>
> The test results are as follows.Due to the consistent algorithm between
> the two, there
> is basically no difference in performance.
>
>
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
>
> --------------------------------------------------------------------------------
> 2                    0.00000528                      0.000005441
> 4                    0.00000544                      0.000005437
> 8                    0.00000464                      0.00000496
> 16                   0.00000544                      0.00000512
> 32                   0.0000064                       0.00000592
> 64                   0.000007994                     0.000007841
> 128                  0.000012                        0.000012
> 256                  0.000020321                     0.000020481
> 512                  0.000037282                     0.000037762
> 1024                 0.000069924                     0.000070244
> 2048                 0.000135046                     0.000135528
> 4096                 0.000264491                     0.000264816
> 8192                 0.000524342                     0.000525631
> 16384                0.001069965                     0.001047742
> 32768                0.002180252                     0.002142207
> 65536                0.005921251                     0.005883868
> 131072               0.012508934                     0.012392895
> 262144               0.02503915                      0.024896995
> 524288               0.049879091                     0.049821832
> 1048576              0.09973658                      0.099969603
>
> --------------------------------------------------------------------------------
>
> Due to the lack of a chip that supports vector extension, I conducted a
> performance
> comparison test of strlen using C language and vector implementation on
> the Spike
> simulator, which has certain reference value. It can be clearly seen that
> vector
> implementation is more efficient than C language implementation, with an
> average
> performance improvement of over 800%.
>
>
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Vector instruction
> implementation(s)
>
> --------------------------------------------------------------------------------
> 2                    0.000003639                     0.000003339
> 4                    0.000004239                     0.000003339
> 8                    0.000003639                     0.000003339
> 16                   0.000004339                     0.000003339
> 32                   0.000005739                     0.000003339
> 64                   0.000008539                     0.000003339
> 128                  0.000014139                     0.000004039
> 256                  0.000025339                     0.000004739
> 512                  0.000047739                     0.000006139
> 1024                 0.000092539                     0.000008939
> 2048                 0.000182139                     0.000014539
> 4096                 0.000361339                     0.000025739
> 8192                 0.000719739                     0.000048139
> 16384                0.001436539                     0.000092939
> 32768                0.002870139                     0.000182539
> 65536                0.005737339                     0.000361739
> 131072               0.011471739                     0.000720139
> 262144               0.022940539                     0.001436939
> 524288               0.045878139                     0.002870539
> 1048576              0.091753339                     0.005737739
>
> --------------------------------------------------------------------------------
>
> So I hope to pass __riscv_vector, which enables hardware that does not
> support vector
> extension to execute the basic instruction set implementation of strlen,
> has the same
> performance as the C language implementation. For support vector extended
> hardware,
> strlen implemented by vector instruction set is executed to achieve
> acceleration effect.
>
> Fei Zhang
>
> &gt; -----原始邮件-----
> &gt; 发件人: "Szabolcs Nagy" <nsz@port70.net>
> &gt; 发送时间: 2023-04-11 20:48:22 (星期二)
> &gt; 收件人: "张飞" <zhangfei@nj.iscas.ac.cn>
> &gt; 抄送: musl@lists.openwall.com
> &gt; 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in
> riscv64 architecture
> &gt;
> &gt; * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-10 13:59:22 +0800]:
> &gt; &gt; I have made modifications to the assembly implementation of the
> riscv64 strlen function, mainly
> &gt; &gt; focusing on address alignment processing to avoid the problem of
> data crossing
> &gt; &gt; pages during vector instruction memory access.
> &gt; &gt;
> &gt; &gt; I think the assembly implementation of strlen is necessary. In
> glibc,
> &gt;
> &gt; if the c definition is not correct then you have to explain why.
> &gt; if it's very slow then please tell us so.
> &gt;
> &gt; &gt; X86_64, aarch64, alpha, and others all have assembly
> implementations of this function,
> &gt; &gt; while for riscv64, it is blank.
> &gt; &gt; I have also analyzed the test sets of Spec2006 and Spec2017, and
> the strlen function is also a hot topic.
> &gt;
> &gt; an asm implementation has significant maintenance cost so you should
> &gt; provide some benchmark data or other evidence/reasoning for us to
> &gt; decide if it's worth the cost.
> &gt;
> &gt; it seems you replaced the c strlen code with a slower one except when
> &gt; musl is built for "#ifdef __riscv_vector" isa extension. what cpus
> &gt; does this affect? are linux distros expected to use this as baseline?
> &gt; do different riscv cpus have similar simd performance properties? who
> &gt; will tweak the asm if not?
> &gt;
> &gt; in principle what you did can be done by the compiler auto vectorizer
> &gt; so maybe contributing to the compiler is more useful.
> &gt;
> &gt; note that glibc has cpu specific implementations that it can select
> &gt; at runtime, but musl uses one generic implementation for all cpus.
> </zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>

[-- Attachment #2: Type: text/html, Size: 11231 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-19 22:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-22  6:21 [musl] [PATCH]Implementation of strlen function in riscv64 architecture 张飞
2023-03-22  6:27 ` A. Wilcox
2023-03-22 12:15   ` Rich Felker
2023-04-11 12:57     ` Szabolcs Nagy
2023-04-10  5:59   ` 张飞
2023-04-11 12:48     ` Szabolcs Nagy
2023-04-19  7:22       ` 张飞
2023-04-19 22:39         ` enh

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).