[musl] memset

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] memset_riscv64
@ 2023-04-11  2:17 张飞
  2023-04-11  9:48 ` Pedro Falcato
  0 siblings, 1 reply; 10+ messages in thread
From: 张飞 @ 2023-04-11  2:17 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 399 bytes --]

Hello,

Currently, there is no assembly implementation of the memset function for riscv64 in Musl. 
This patch is a riscv64 assembly implementation of the memset function, which is implemented using the basic instruction set and 
has better performance than the c language implementation in Musl. I hope it can be integrated into Musl.

Please review it and look forward to your reply.

Fei Zhang



[-- Attachment #2: memset_riscv64.patch --]
[-- Type: application/octet-stream, Size: 1845 bytes --]

diff -uprN src/string/riscv64/memset.S src/string/riscv64/memset.S
--- src/string/riscv64/memset.S	1970-01-01 08:00:00.000000000 +0800
+++ src/string/riscv64/memset.S	2023-04-11 09:43:05.733274437 +0800
@@ -0,0 +1,95 @@
+.global memset
+.type memset,@function
+
+#define SZREG 8
+#define REG_S sd
+
+memset:
+        mv t0, a0  
+
+	sltiu a3, a2, 16
+	bnez a3, 4f
+
+	addi a3, t0, SZREG-1
+	andi a3, a3, ~(SZREG-1)
+	beq a3, t0, 2f  
+	sub a4, a3, t0
+
+1:
+	sb a1, 0(t0)
+	addi t0, t0, 1
+	bltu t0, a3, 1b
+	sub a2, a2, a4  
+
+2: 
+	andi a1, a1, 0xff
+	slli a3, a1, 8
+	or a1, a3, a1
+	slli a3, a1, 16
+	or a1, a3, a1
+	slli a3, a1, 32
+	or a1, a3, a1
+
+	andi a4, a2, ~(SZREG-1)
+	add a3, t0, a4
+
+	andi a4, a4, 31*SZREG  
+	beqz a4, 3f            
+	neg a4, a4
+	addi a4, a4, 32*SZREG  
+
+	sub t0, t0, a4
+
+	la a5, 3f
+	srli a4, a4, 1
+	add a5, a5, a4
+	jr a5
+
+3:
+	REG_S a1,        0(t0)
+	REG_S a1,    SZREG(t0)
+	REG_S a1,  2*SZREG(t0)
+	REG_S a1,  3*SZREG(t0)
+	REG_S a1,  4*SZREG(t0)
+	REG_S a1,  5*SZREG(t0)
+	REG_S a1,  6*SZREG(t0)
+	REG_S a1,  7*SZREG(t0)
+	REG_S a1,  8*SZREG(t0)
+	REG_S a1,  9*SZREG(t0)
+	REG_S a1, 10*SZREG(t0)
+	REG_S a1, 11*SZREG(t0)
+	REG_S a1, 12*SZREG(t0)
+	REG_S a1, 13*SZREG(t0)
+	REG_S a1, 14*SZREG(t0)
+	REG_S a1, 15*SZREG(t0)
+	REG_S a1, 16*SZREG(t0)
+	REG_S a1, 17*SZREG(t0)
+	REG_S a1, 18*SZREG(t0)
+	REG_S a1, 19*SZREG(t0)
+	REG_S a1, 20*SZREG(t0)
+	REG_S a1, 21*SZREG(t0)
+	REG_S a1, 22*SZREG(t0)
+	REG_S a1, 23*SZREG(t0)
+	REG_S a1, 24*SZREG(t0)
+	REG_S a1, 25*SZREG(t0)
+	REG_S a1, 26*SZREG(t0)
+	REG_S a1, 27*SZREG(t0)
+	REG_S a1, 28*SZREG(t0)
+	REG_S a1, 29*SZREG(t0)
+	REG_S a1, 30*SZREG(t0)
+	REG_S a1, 31*SZREG(t0)
+	addi t0, t0, 32*SZREG
+	bltu t0, a3, 3b
+	andi a2, a2, SZREG-1  
+
+4:
+	beqz a2, 6f
+	add a3, t0, a2
+
+5:
+	sb a1, 0(t0)
+	addi t0, t0, 1
+	bltu t0, a3, 5b
+
+6:
+	ret

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [musl] memset_riscv64
  2023-04-11  2:17 [musl] memset_riscv64 张飞
@ 2023-04-11  9:48 ` Pedro Falcato
  2023-04-19  5:33   ` 张飞
  0 siblings, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2023-04-11  9:48 UTC (permalink / raw)
  To: musl

On Tue, Apr 11, 2023 at 3:18 AM 张飞 <zhangfei@nj.iscas.ac.cn> wrote:
>
> Hello,
>
> Currently, there is no assembly implementation of the memset function for riscv64 in Musl.
> This patch is a riscv64 assembly implementation of the memset function, which is implemented using the basic instruction set and
> has better performance than the c language implementation in Musl. I hope it can be integrated into Musl.

Hi!

Do you have performance measurements here? What exactly is the difference?
As far as I know, no one is actively optimizing on riscv yet, only
some movements in the upstream kernel (to prepare for vector extension
stuff, unaligned loads/stores) and the corresponding glibc patches.
Mainly because it's still super unobtanium and in very early stages,
so optimizing is very hard.

So what hardware did you use? Is there a large gain here? Given that
your memset looks so simple, wouldn't it just be easier to write this
in C?

-- 
Pedro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [musl] memset_riscv64
  2023-04-11  9:48 ` Pedro Falcato
@ 2023-04-19  5:33   ` 张飞
  2023-04-19  9:02     ` Szabolcs Nagy
  0 siblings, 1 reply; 10+ messages in thread
From: 张飞 @ 2023-04-19  5:33 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 6261 bytes --]

Hi！

I have revised the implementation of memset, which includes two versions of the basic instruction set and 
vector instruction implementation, and used macro definitions to determine whether riscv hardware 
supports vector extension. 

The reason for implementing two versions is to hope that the memset implemented 
using the basic instruction set can be applicable to all RISCV architecture CPUs, and the vector version can 
accelerate the hardware supporting vector expansion. 

At present, the riscv vector extension instruction set is in a frozen state, and the instruction set is stable. 
In other open source libraries, such as openssl and openCV, riscv vector optimization is available.

I conducted tests on different data volumes and compared the performance of memset functions implemented 
in C language, basic instruction set, and vector instruction set.The test case is test_memset.c

Performance comparison between C language implementation and assembly implementation was tested on 
Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip Platform).
The test results are as follows.when it is less than 16 bytes, the performance of C language implementation is 
better; From 16 bytes to 32768 bytes, the basic instruction implementation performance is better, with an average 
improvement of over 30%; When it is greater than 32768 bytes, the performance of both is equivalent.

--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Basic instruction implementation(s)
--------------------------------------------------------------------------------	
4	          0.00000352	                    0.000004001	
8	          0.000004001	                    0.000005441	
16	          0.000006241	                    0.00000464	
32	          0.00000752	                    0.00000448	
64	          0.000008481	                    0.000005281	
128	          0.000009281	                    0.000005921	
256	          0.000011201	                    0.000007041	
512	          0.000014402	                    0.000010401	
1024	          0.000022563	                    0.000016962	
2048	          0.000039205	                    0.000030724	
4096	          0.000072809	                    0.000057768	
8192	          0.000153459	                    0.000132793	
16384	          0.000297157	                    0.000244992	
32768	          0.000784416	                    0.000735298	
65536	          0.005005252	                    0.004987382	
131072	          0.011286821	                    0.011256855	
262144	          0.023295169	                    0.022932165	
524288	          0.04647724	                    0.046084839	
1048576	          0.094114058	                    0.0932383	
--------------------------------------------------------------------------------

Due to the lack of a chip that supports vector extension, I conducted a performance comparison test of memset
using C language and vector implementation on the Spike simulator, which has certain reference value.
It can be clearly seen that vector implementation is more efficient than C language implementation, 
with an average performance improvement of over 200%.

--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Vector instruction implementation(s)
--------------------------------------------------------------------------------
4	          0.000002839	                    0.000002939	
8	          0.000003239	                    0.000002939	
16	          0.000005239	                    0.000002939	
32	          0.000007039	                    0.000002939	
64	          0.000008039	                    0.000002939	
128	          0.000009239	                    0.000002939	
256	          0.000011639	                    0.000003539	
512	          0.000016439	                    0.000004739	
1024	          0.000026039	                    0.000007139	
2048	          0.000045239	                    0.000011939	
4096	          0.000083639	                    0.000021539	
8192	          0.000162298	                    0.000042598	
16384	          0.000317757	                    0.000082857	
32768	          0.000628675	                    0.000163375	
65536	          0.001250511	                    0.000324411	
131072	          0.002494183	                    0.000646483	
262144	          0.004981527	                    0.001290627	
524288	          0.009956215	                    0.002578915	
1048576	          0.019905591	                    0.005155491	
--------------------------------------------------------------------------------

So I hope that a more efficient assembly implementation of riscv64 memset functions can be 
integrated into musl and compatible with different riscv architectures' CPUs.

Fei Zhang

&gt; -----原始邮件-----
&gt; 发件人: "Pedro Falcato" <pedro.falcato@gmail.com>
&gt; 发送时间: 2023-04-11 17:48:31 (星期二)
&gt; 收件人: musl@lists.openwall.com
&gt; 抄送: 
&gt; 主题: Re: [musl] memset_riscv64
&gt; 
&gt; On Tue, Apr 11, 2023 at 3:18 AM 张飞 <zhangfei@nj.iscas.ac.cn> wrote:
&gt; &gt;
&gt; &gt; Hello,
&gt; &gt;
&gt; &gt; Currently, there is no assembly implementation of the memset function for riscv64 in Musl.
&gt; &gt; This patch is a riscv64 assembly implementation of the memset function, which is implemented using the basic instruction set and
&gt; &gt; has better performance than the c language implementation in Musl. I hope it can be integrated into Musl.
&gt; 
&gt; Hi!
&gt; 
&gt; Do you have performance measurements here? What exactly is the difference?
&gt; As far as I know, no one is actively optimizing on riscv yet, only
&gt; some movements in the upstream kernel (to prepare for vector extension
&gt; stuff, unaligned loads/stores) and the corresponding glibc patches.
&gt; Mainly because it's still super unobtanium and in very early stages,
&gt; so optimizing is very hard.
&gt; 
&gt; So what hardware did you use? Is there a large gain here? Given that
&gt; your memset looks so simple, wouldn't it just be easier to write this
&gt; in C?
&gt; 
&gt; -- 
&gt; Pedro


</zhangfei@nj.iscas.ac.cn></pedro.falcato@gmail.com>

[-- Attachment #2: memset_riscv64.patch --]
[-- Type: application/octet-stream, Size: 2646 bytes --]

diff -uprN src/string/riscv64/memset.S src/string/riscv64/memset.S
--- src/string/riscv64/memset.S	1970-01-01 08:00:00.000000000 +0800
+++ src/string/riscv64/memset.S	2023-04-18 13:37:22.548056966 +0800
@@ -0,0 +1,125 @@
+.global memset
+.type memset,@function
+
+#ifndef __riscv_vector
+#define SZREG 8
+#define REG_S sd
+#endif
+
+memset:
+#ifdef __riscv_vector
+        mv      t0, a0
+        beqz    a2, 3f
+    
+        csrr    t1, vlenb
+        addi    t1, t1, -1
+        add     a3, t0, t1
+        not     t1, t1
+        and     a3, a3, t1
+        beq     a3, t0, 2f
+        sub     a4, a3, t0
+        add     a5, t0, a2        
+
+1:
+        sb      a1, 0(t0)
+        addi    t0, t0, 1
+        beq     t0, a5, 3f
+        bltu    t0, a3, 1b
+        sub     a2, a2, a4  
+
+2:
+        vsetvli t1, a2, e8, m8, ta, ma
+        vmv.v.x v0, a1 
+        sub     a2, a2, t1
+        vse8.v  v0, (t0)
+        add     t0, t0, t1
+        bnez    a2, 2b
+3:
+        ret
+
+#else
+        mv      t0, a0  
+
+	sltiu   a3, a2, 16
+	bnez    a3, 4f
+
+	addi    a3, t0, SZREG-1
+	andi    a3, a3, ~(SZREG-1)
+	beq     a3, t0, 2f  
+	sub     a4, a3, t0
+1:
+	sb      a1, 0(t0)
+	addi    t0, t0, 1
+	bltu    t0, a3, 1b
+	sub     a2, a2, a4  
+
+2: 
+	andi    a1, a1, 0xff
+	slli    a3, a1, 8
+	or      a1, a3, a1
+	slli    a3, a1, 16
+	or      a1, a3, a1
+	slli    a3, a1, 32
+	or      a1, a3, a1
+
+	andi    a4, a2, ~(SZREG-1)
+	add     a3, t0, a4
+
+	andi    a4, a4, 31*SZREG  
+	beqz    a4, 3f            
+	neg     a4, a4
+	addi    a4, a4, 32*SZREG  
+
+	sub     t0, t0, a4
+
+	la      a5, 3f
+	srli    a4, a4, 1
+	add     a5, a5, a4
+	jr      a5
+3:
+	REG_S a1,        0(t0)
+	REG_S a1,    SZREG(t0)
+	REG_S a1,  2*SZREG(t0)
+	REG_S a1,  3*SZREG(t0)
+	REG_S a1,  4*SZREG(t0)
+	REG_S a1,  5*SZREG(t0)
+	REG_S a1,  6*SZREG(t0)
+	REG_S a1,  7*SZREG(t0)
+	REG_S a1,  8*SZREG(t0)
+	REG_S a1,  9*SZREG(t0)
+	REG_S a1, 10*SZREG(t0)
+	REG_S a1, 11*SZREG(t0)
+	REG_S a1, 12*SZREG(t0)
+	REG_S a1, 13*SZREG(t0)
+	REG_S a1, 14*SZREG(t0)
+	REG_S a1, 15*SZREG(t0)
+	REG_S a1, 16*SZREG(t0)
+	REG_S a1, 17*SZREG(t0)
+	REG_S a1, 18*SZREG(t0)
+	REG_S a1, 19*SZREG(t0)
+	REG_S a1, 20*SZREG(t0)
+	REG_S a1, 21*SZREG(t0)
+	REG_S a1, 22*SZREG(t0)
+	REG_S a1, 23*SZREG(t0)
+	REG_S a1, 24*SZREG(t0)
+	REG_S a1, 25*SZREG(t0)
+	REG_S a1, 26*SZREG(t0)
+	REG_S a1, 27*SZREG(t0)
+	REG_S a1, 28*SZREG(t0)
+	REG_S a1, 29*SZREG(t0)
+	REG_S a1, 30*SZREG(t0)
+	REG_S a1, 31*SZREG(t0)
+	addi t0, t0, 32*SZREG
+	bltu t0, a3, 3b
+	andi a2, a2, SZREG-1  
+
+4:
+	beqz  a2, 6f
+	add   a3, t0, a2
+5:
+	sb    a1, 0(t0)
+	addi  t0, t0, 1
+	bltu  t0, a3, 5b
+6:
+	ret
+#endif

[-- Attachment #3: test_memset.c --]
[-- Type: text/plain, Size: 929 bytes --]

#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

#define DATA_SIZE 5*1024*1024
#define MAX_LEN 1*1024*1024
#define OFFSET 0
#define LOOP_TIMES 100
int main(){
   char *str1,*src1;
   str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

   printf("function test start\n");
   
   src1 = str1+OFFSET;
   struct timespec tv0,tv;
   for(int len=2; len<=MAX_LEN; len*=2){
      clock_gettime(CLOCK_REALTIME, &tv0);
      for(int k=0; k<LOOP_TIMES; k++){
          memset(src1, 'a', len);
      }
      clock_gettime(CLOCK_REALTIME, &tv);
      tv.tv_sec -= tv0.tv_sec;
      if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
	      tv.tv_nsec += 1000000000;
	      tv.tv_sec--;
      }
      printf("len: %d  time: %ld.%.9ld\n",len, (long)tv.tv_sec, (long)tv.tv_nsec);
   }

   printf("function test end\n");
   munmap(str1,DATA_SIZE);
   return 0;
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: [musl] memset_riscv64
  2023-04-19  5:33   ` 张飞
@ 2023-04-19  9:02     ` Szabolcs Nagy
  2023-04-20  8:17       ` 张飞
  0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2023-04-19  9:02 UTC (permalink / raw)
  To: 张飞; +Cc: musl

* 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-19 13:33:08 +0800]:
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Basic instruction implementation(s)
> --------------------------------------------------------------------------------	
> 4	          0.00000352	                    0.000004001	
> 8	          0.000004001	                    0.000005441	
> 16	          0.000006241	                    0.00000464	
> 32	          0.00000752	                    0.00000448	
> 64	          0.000008481	                    0.000005281	
> 128	          0.000009281	                    0.000005921	
> 256	          0.000011201	                    0.000007041	

i don't think these numbers can be trusted.

> #include <stdio.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <stdlib.h>
> #include <time.h>
> 
> #define DATA_SIZE 5*1024*1024
> #define MAX_LEN 1*1024*1024
> #define OFFSET 0
> #define LOOP_TIMES 100
> int main(){
>    char *str1,*src1;
>    str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
> 
>    printf("function test start\n");
>    
>    src1 = str1+OFFSET;
>    struct timespec tv0,tv;
>    for(int len=2; len<=MAX_LEN; len*=2){
>       clock_gettime(CLOCK_REALTIME, &tv0);
>       for(int k=0; k<LOOP_TIMES; k++){
>           memset(src1, 'a', len);
>       }
>       clock_gettime(CLOCK_REALTIME, &tv);
>       tv.tv_sec -= tv0.tv_sec;
>       if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
> 	      tv.tv_nsec += 1000000000;
> 	      tv.tv_sec--;
>       }
>       printf("len: %d  time: %ld.%.9ld\n",len, (long)tv.tv_sec, (long)tv.tv_nsec);


this repeatedly calls memset with exact same len, alignment and value.
so it favours branch heavy code since those are correctly predicted.

but even if you care about a branch-predicted microbenchmark, you
made a single measurement per size so you cannot tell how much the
time varies, you should do several measurements and take the min
so noise from system effects and cpu internal state are reduced
(also that state needs to be warmed up). and likely the LOOP_TIMES
should be bigger too for small sizes for reliable timing.

benchmarking string functions is tricky especially for a target arch
with many implementations.

>    }
> 
>    printf("function test end\n");
>    munmap(str1,DATA_SIZE);
>    return 0;
> }
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Re: [musl] memset_riscv64
  2023-04-19  9:02     ` Szabolcs Nagy
@ 2023-04-20  8:17       ` 张飞
  2023-04-21 13:30         ` Szabolcs Nagy
  0 siblings, 1 reply; 10+ messages in thread
From: 张飞 @ 2023-04-20  8:17 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 7671 bytes --]

Hi!
I listened to your suggestions and referred to string.c in Musl's test set(libc-bench), 
and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified 
it to a variable as a parameter in my own test case and passed it to the memset function. 
I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been 
sorted, only recording the running time of the middle 300 times.

I took turns executing two programs on the SiFive chip three times each, and the results 
are shown below.
                             First run result
--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Basic instruction implementation(s)
--------------------------------------------------------------------------------
100                 0.002208102                     0.002304056
200                 0.005053208                     0.004629598
400                 0.008666684                     0.007739176
800                 0.014065196                     0.012372702
1600                0.023377685                     0.020090966
3200                0.040221849                     0.034059631
6400                0.072095377                     0.060028906
12800               0.134040475                     0.110039387
25600               0.257426806                     0.210710952
51200               1.173755160                     1.121833227
102400              3.693170402                     3.637194098
204800              8.919975455                     8.865504460
409600             19.410922418                    19.360956493
--------------------------------------------------------------------------------

                             Second run result 
--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Basic instruction implementation(s)
--------------------------------------------------------------------------------
100                 0.002208109                     0.002293857
200                 0.005057374                     0.004640669
400                 0.008674218                     0.007760795
800                 0.014068582                     0.012417084
1600                0.023381095                     0.020124496
3200                0.040225138                     0.034093181
6400                0.072098744                     0.060069574
12800               0.134043954                     0.110088141
25600               0.256453187                     0.208578633
51200               1.166602505                     1.118972796
102400              3.684957231                     3.635116808
204800              8.916302592                     8.861590734
409600             19.411057216                    19.358777670
--------------------------------------------------------------------------------

                             Third run result 
--------------------------------------------------------------------------------
length(byte)  C language implementation(s)   Basic instruction implementation(s)
--------------------------------------------------------------------------------
100                 0.002208111                     0.002293227
200                 0.005056101                     0.004628539
400                 0.008677756                     0.007748687
800                 0.014085242                     0.012404443
1600                0.023397782                     0.020115710
3200                0.040242985                     0.034084435
6400                0.072116665                     0.060063767
12800               0.134060262                     0.110082427
25600               0.257865186                     0.209101754
51200               1.174257177                     1.117753408
102400              3.696518162                     3.635417503
204800              8.929357747                     8.858765915
409600             19.426520562                     19.356515671
--------------------------------------------------------------------------------

From the test results, it can be seen that the runtime of memset implemented using the basic 
instruction set assembly is basically shorter than that implemented using the C language. 
May I ask if the test results are convincing?


&gt; -----原始邮件-----
&gt; 发件人: "Szabolcs Nagy" <nsz@port70.net>
&gt; 发送时间: 2023-04-19 17:02:10 (星期三)
&gt; 收件人: "张飞" <zhangfei@nj.iscas.ac.cn>
&gt; 抄送: musl@lists.openwall.com
&gt; 主题: Re: Re: [musl] memset_riscv64
&gt; 
&gt; * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-19 13:33:08 +0800]:
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; length(byte)  C language implementation(s)   Basic instruction implementation(s)
&gt; &gt; --------------------------------------------------------------------------------	
&gt; &gt; 4	          0.00000352	                    0.000004001	
&gt; &gt; 8	          0.000004001	                    0.000005441	
&gt; &gt; 16	          0.000006241	                    0.00000464	
&gt; &gt; 32	          0.00000752	                    0.00000448	
&gt; &gt; 64	          0.000008481	                    0.000005281	
&gt; &gt; 128	          0.000009281	                    0.000005921	
&gt; &gt; 256	          0.000011201	                    0.000007041	
&gt; 
&gt; i don't think these numbers can be trusted.
&gt; 
&gt; &gt; #include <stdio.h>
&gt; &gt; #include <sys mman.h="">
&gt; &gt; #include <string.h>
&gt; &gt; #include <stdlib.h>
&gt; &gt; #include <time.h>
&gt; &gt; 
&gt; &gt; #define DATA_SIZE 5*1024*1024
&gt; &gt; #define MAX_LEN 1*1024*1024
&gt; &gt; #define OFFSET 0
&gt; &gt; #define LOOP_TIMES 100
&gt; &gt; int main(){
&gt; &gt;    char *str1,*src1;
&gt; &gt;    str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
&gt; &gt; 
&gt; &gt;    printf("function test start\n");
&gt; &gt;    
&gt; &gt;    src1 = str1+OFFSET;
&gt; &gt;    struct timespec tv0,tv;
&gt; &gt;    for(int len=2; len&lt;=MAX_LEN; len*=2){
&gt; &gt;       clock_gettime(CLOCK_REALTIME, &amp;tv0);
&gt; &gt;       for(int k=0; k<loop_times; k++){=""> &gt;           memset(src1, 'a', len);
&gt; &gt;       }
&gt; &gt;       clock_gettime(CLOCK_REALTIME, &amp;tv);
&gt; &gt;       tv.tv_sec -= tv0.tv_sec;
&gt; &gt;       if ((tv.tv_nsec -= tv0.tv_nsec) &lt; 0) {
&gt; &gt; 	      tv.tv_nsec += 1000000000;
&gt; &gt; 	      tv.tv_sec--;
&gt; &gt;       }
&gt; &gt;       printf("len: %d  time: %ld.%.9ld\n",len, (long)tv.tv_sec, (long)tv.tv_nsec);
&gt; 
&gt; 
&gt; this repeatedly calls memset with exact same len, alignment and value.
&gt; so it favours branch heavy code since those are correctly predicted.
&gt; 
&gt; but even if you care about a branch-predicted microbenchmark, you
&gt; made a single measurement per size so you cannot tell how much the
&gt; time varies, you should do several measurements and take the min
&gt; so noise from system effects and cpu internal state are reduced
&gt; (also that state needs to be warmed up). and likely the LOOP_TIMES
&gt; should be bigger too for small sizes for reliable timing.
&gt; 
&gt; benchmarking string functions is tricky especially for a target arch
&gt; with many implementations.
&gt; 
&gt; &gt;    }
&gt; &gt; 
&gt; &gt;    printf("function test end\n");
&gt; &gt;    munmap(str1,DATA_SIZE);
&gt; &gt;    return 0;
&gt; &gt; }
&gt; &gt; 
</loop_times;></time.h></stdlib.h></string.h></sys></stdio.h></zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>

[-- Attachment #2: test_memset2.c --]
[-- Type: text/plain, Size: 1364 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define BUFLEN 500000
#define LOOP_TIMES 500

int cmp(const void *a, const void *b) {
    double x = *(double *)a;
    double y = *(double *)b;
    if (x < y) return -1;
    if (x > y) return 1;
    return 0;
}

int main(){
        char *buf = malloc(BUFLEN);
	double *arr = malloc(sizeof(double) * LOOP_TIMES);
        size_t i,j,k;
        struct timespec tv0,tv;
	double times;

        for(j=100; j<BUFLEN; j*=2){
          for(k=0; k<LOOP_TIMES; k++){
            for (i=0; i<100; i++)
                  memset(buf+i, i, j-i);
          }
        }

        for(j=100; j<BUFLEN; j*=2){
          for(k=0; k<LOOP_TIMES; k++){
            clock_gettime(CLOCK_REALTIME, &tv0);
            for (i=0; i<100; i++)
                  memset(buf+i, i, j-i);
            clock_gettime(CLOCK_REALTIME, &tv);
            tv.tv_sec -= tv0.tv_sec;
            if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
                tv.tv_nsec += 1000000000;
                tv.tv_sec--;
            }
	    arr[k] = tv.tv_sec + (double)tv.tv_nsec/1000000000;
          }
          qsort(arr, 500, sizeof(double), cmp); 
          
	  for (int m = 100; m < LOOP_TIMES - 100; m++) {
              times += arr[m];
          }
	  printf("len: %ld  time: %.9lf\n",j, times);
	}
        free(buf);
        return 0;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Re: [musl] memset_riscv64
  2023-04-20  8:17       ` 张飞
@ 2023-04-21 13:30         ` Szabolcs Nagy
  2023-04-21 14:50           ` Pedro Falcato
  2023-04-26  7:25           ` 张飞
  0 siblings, 2 replies; 10+ messages in thread
From: Szabolcs Nagy @ 2023-04-21 13:30 UTC (permalink / raw)
  To: 张飞; +Cc: musl

* 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
> Hi!
> I listened to your suggestions and referred to string.c in Musl's test set(libc-bench), 
> and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified 
> it to a variable as a parameter in my own test case and passed it to the memset function. 
> I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been 
> sorted, only recording the running time of the middle 300 times.
> 
> I took turns executing two programs on the SiFive chip three times each, and the results 
> are shown below.
>                              First run result
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Basic instruction implementation(s)
> --------------------------------------------------------------------------------
> 100                 0.002208102                     0.002304056
> 200                 0.005053208                     0.004629598
> 400                 0.008666684                     0.007739176
> 800                 0.014065196                     0.012372702
> 1600                0.023377685                     0.020090966
> 3200                0.040221849                     0.034059631
> 6400                0.072095377                     0.060028906
> 12800               0.134040475                     0.110039387
> 25600               0.257426806                     0.210710952
> 51200               1.173755160                     1.121833227
> 102400              3.693170402                     3.637194098
> 204800              8.919975455                     8.865504460
> 409600             19.410922418                    19.360956493
> --------------------------------------------------------------------------------
> 
>                              Second run result 
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Basic instruction implementation(s)
> --------------------------------------------------------------------------------
> 100                 0.002208109                     0.002293857
> 200                 0.005057374                     0.004640669
> 400                 0.008674218                     0.007760795
> 800                 0.014068582                     0.012417084
> 1600                0.023381095                     0.020124496
> 3200                0.040225138                     0.034093181
> 6400                0.072098744                     0.060069574
> 12800               0.134043954                     0.110088141
> 25600               0.256453187                     0.208578633
> 51200               1.166602505                     1.118972796
> 102400              3.684957231                     3.635116808
> 204800              8.916302592                     8.861590734
> 409600             19.411057216                    19.358777670
> --------------------------------------------------------------------------------
> 
>                              Third run result 
> --------------------------------------------------------------------------------
> length(byte)  C language implementation(s)   Basic instruction implementation(s)
> --------------------------------------------------------------------------------
> 100                 0.002208111                     0.002293227
> 200                 0.005056101                     0.004628539
> 400                 0.008677756                     0.007748687
> 800                 0.014085242                     0.012404443
> 1600                0.023397782                     0.020115710
> 3200                0.040242985                     0.034084435
> 6400                0.072116665                     0.060063767
> 12800               0.134060262                     0.110082427
> 25600               0.257865186                     0.209101754
> 51200               1.174257177                     1.117753408
> 102400              3.696518162                     3.635417503
> 204800              8.929357747                     8.858765915
> 409600             19.426520562                     19.356515671
> --------------------------------------------------------------------------------
> 
> From the test results, it can be seen that the runtime of memset implemented using the basic 
> instruction set assembly is basically shorter than that implemented using the C language. 
> May I ask if the test results are convincing?

small sizes are much more common than large sizes, memsets can be
distributed such that sizes [0,100), [100,1000), [1000,inf) are
used for 1/3 of all memsets each (not the call count, but the
amount of bytes memset using such sizes), i.e. if you speed up
the size = [100,1000) and [1000,inf) cases by 10% but regress the
[0,100) case by 20% then the overall performance roughly stays
the same. (of course this is very workload dependent, but across
a system this is what i'd expect, probably even more skewed to
smaller sizes).

so we need to know what happens in the [0,100) range. what i see
is a ~4% regression there while there is a ~10% improvement in
the [100,1000) case and ~15% improvement in the [1000,inf) case
(it would be nice to know why the 25k case is so much faster and
why that speed up only applies to that size, we don't want to
optimize for some obscure cpu bug that will go away next year)

on practical workloads i would expect < 10% speedup overall from
the asm code (but we need more data in the [0,100) range to tell).
this may not be enough to justify the asm code.

rich already said he prefers a different style of implementation
(where the body of the function is in c but the inner loop is in
asm if that helps e.g. via simd).

here is an example of a benchmark that takes input distribution
into account from a workload:
https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53

> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <time.h>
> 
> #define BUFLEN 500000
> #define LOOP_TIMES 500
> 
> int cmp(const void *a, const void *b) {
>     double x = *(double *)a;
>     double y = *(double *)b;
>     if (x < y) return -1;
>     if (x > y) return 1;
>     return 0;
> }
> 
> int main(){
>         char *buf = malloc(BUFLEN);
> 	double *arr = malloc(sizeof(double) * LOOP_TIMES);
>         size_t i,j,k;
>         struct timespec tv0,tv;
> 	double times;
> 
>         for(j=100; j<BUFLEN; j*=2){
>           for(k=0; k<LOOP_TIMES; k++){
>             for (i=0; i<100; i++)
>                   memset(buf+i, i, j-i);
>           }
>         }
> 
>         for(j=100; j<BUFLEN; j*=2){
>           for(k=0; k<LOOP_TIMES; k++){
>             clock_gettime(CLOCK_REALTIME, &tv0);
>             for (i=0; i<100; i++)
>                   memset(buf+i, i, j-i);

alignment only matters up to 64 byte alignment and usually inputs
are at least 8byte aligned.

value is almost always 0. (we probably don't even need to test
non-0 case: a 0 check is correctly predicted in practice.)

i think length should have a small variation, just enough to add
penalty to small size checks where implementations may use many
branches.

so something like this may be better (madeup off,al numbers):

	buf = malloc((1<<16)+32);
	size_t sz[] = {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<15, 1<<16, 0};
	size_t off[16] = {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12};
	size_t al[16] = {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1};
	for (j=0; sz[j]; j++)
		for (k=0; k<20; k++) {
			t0 = tic();
			// large loop count is important for small sizes
			for (i=0; i<256; i++)
				memset(buf + al[i%16], 0, sz[j] + off[i%16]);
			t1 = tic();
			tmin = min(tmin,t1-t0);
		}

large memset (>=1k) can be tested separately (no.need to add off,al
variaion then, less inner loop is enough, but it should not hurt to
include them here).

>             clock_gettime(CLOCK_REALTIME, &tv);
>             tv.tv_sec -= tv0.tv_sec;
>             if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
>                 tv.tv_nsec += 1000000000;
>                 tv.tv_sec--;
>             }
> 	    arr[k] = tv.tv_sec + (double)tv.tv_nsec/1000000000;
>           }
>           qsort(arr, 500, sizeof(double), cmp); 

just take the minimum. we want to know the fastest execution.

>           
> 	  for (int m = 100; m < LOOP_TIMES - 100; m++) {
>               times += arr[m];
>           }
> 	  printf("len: %ld  time: %.9lf\n",j, times);

you can also print GB/s which is 256*sz[j]/tmin in my example.

> 	}
>         free(buf);
>         return 0;
> }


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Re: [musl] memset_riscv64
  2023-04-21 13:30         ` Szabolcs Nagy
@ 2023-04-21 14:50           ` Pedro Falcato
  2023-04-21 16:54             ` Rich Felker
  2023-04-26  7:25           ` 张飞
  1 sibling, 1 reply; 10+ messages in thread
From: Pedro Falcato @ 2023-04-21 14:50 UTC (permalink / raw)
  To: musl, 张飞; +Cc: nsz

On Fri, Apr 21, 2023 at 2:37 PM Szabolcs Nagy <nsz@port70.net> wrote:
>
> * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
> > Hi!
> > I listened to your suggestions and referred to string.c in Musl's test set(libc-bench),
> > and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified
> > it to a variable as a parameter in my own test case and passed it to the memset function.
> > I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been
> > sorted, only recording the running time of the middle 300 times.
> >
> > I took turns executing two programs on the SiFive chip three times each, and the results
> > are shown below.
> >                              First run result
> > --------------------------------------------------------------------------------
> > length(byte)  C language implementation(s)   Basic instruction implementation(s)
> > --------------------------------------------------------------------------------
> > 100                 0.002208102                     0.002304056
> > 200                 0.005053208                     0.004629598
> > 400                 0.008666684                     0.007739176
> > 800                 0.014065196                     0.012372702
> > 1600                0.023377685                     0.020090966
> > 3200                0.040221849                     0.034059631
> > 6400                0.072095377                     0.060028906
> > 12800               0.134040475                     0.110039387
> > 25600               0.257426806                     0.210710952
> > 51200               1.173755160                     1.121833227
> > 102400              3.693170402                     3.637194098
> > 204800              8.919975455                     8.865504460
> > 409600             19.410922418                    19.360956493
> > --------------------------------------------------------------------------------
> >
> >                              Second run result
> > --------------------------------------------------------------------------------
> > length(byte)  C language implementation(s)   Basic instruction implementation(s)
> > --------------------------------------------------------------------------------
> > 100                 0.002208109                     0.002293857
> > 200                 0.005057374                     0.004640669
> > 400                 0.008674218                     0.007760795
> > 800                 0.014068582                     0.012417084
> > 1600                0.023381095                     0.020124496
> > 3200                0.040225138                     0.034093181
> > 6400                0.072098744                     0.060069574
> > 12800               0.134043954                     0.110088141
> > 25600               0.256453187                     0.208578633
> > 51200               1.166602505                     1.118972796
> > 102400              3.684957231                     3.635116808
> > 204800              8.916302592                     8.861590734
> > 409600             19.411057216                    19.358777670
> > --------------------------------------------------------------------------------
> >
> >                              Third run result
> > --------------------------------------------------------------------------------
> > length(byte)  C language implementation(s)   Basic instruction implementation(s)
> > --------------------------------------------------------------------------------
> > 100                 0.002208111                     0.002293227
> > 200                 0.005056101                     0.004628539
> > 400                 0.008677756                     0.007748687
> > 800                 0.014085242                     0.012404443
> > 1600                0.023397782                     0.020115710
> > 3200                0.040242985                     0.034084435
> > 6400                0.072116665                     0.060063767
> > 12800               0.134060262                     0.110082427
> > 25600               0.257865186                     0.209101754
> > 51200               1.174257177                     1.117753408
> > 102400              3.696518162                     3.635417503
> > 204800              8.929357747                     8.858765915
> > 409600             19.426520562                     19.356515671
> > --------------------------------------------------------------------------------
> >
> > From the test results, it can be seen that the runtime of memset implemented using the basic
> > instruction set assembly is basically shorter than that implemented using the C language.
> > May I ask if the test results are convincing?
>
> small sizes are much more common than large sizes, memsets can be
> distributed such that sizes [0,100), [100,1000), [1000,inf) are
> used for 1/3 of all memsets each (not the call count, but the
> amount of bytes memset using such sizes), i.e. if you speed up
> the size = [100,1000) and [1000,inf) cases by 10% but regress the
> [0,100) case by 20% then the overall performance roughly stays
> the same. (of course this is very workload dependent, but across
> a system this is what i'd expect, probably even more skewed to
> smaller sizes).
>
> so we need to know what happens in the [0,100) range. what i see
> is a ~4% regression there while there is a ~10% improvement in
> the [100,1000) case and ~15% improvement in the [1000,inf) case
> (it would be nice to know why the 25k case is so much faster and
> why that speed up only applies to that size, we don't want to
> optimize for some obscure cpu bug that will go away next year)
>
> on practical workloads i would expect < 10% speedup overall from
> the asm code (but we need more data in the [0,100) range to tell).
> this may not be enough to justify the asm code.
>
> rich already said he prefers a different style of implementation
> (where the body of the function is in c but the inner loop is in
> asm if that helps e.g. via simd).

I don't think writing it all in C is viable, at least if you want to
squeeze every last bit of performance out of it (while avoiding
idiotic codegen that sometimes pops up).
Even with inline asm, I severely question its effectiveness. As I see
it, we have two major roadblocks for fast stringops support (and one
more for riscv):

1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
dispatch stringops functions to the best implementation according to
the CPU feature set. I have no good solution for static linking folks.
2) (Optional) Play around with C codegen that could add SIMD, inline
asm to try to make it fast-ish. LLVM folks have played around with
string ops written entirely in C++ through
__builtin_memcpy_inline (which does smart choices wrt overlapping
loads/stores, SIMD, etc depending on the size). Sadly,
__builtin_memcpy_inline is/was not available in GCC.

Testing the performance of C+inline asm vs pure asm would be interesting.

3) Write riscv stringops code in assembly once CPUs get more advanced
and we finally get a good idea on how the things perform. I still
think it's too new to optimize specifically for.
Extensions are popping up left and right, vector extensions aren't yet
properly supported in the kernel, and (most importantly) we don't have
a proper way to detect riscv features just yet.
For instance, doing unaligned accesses may either have little to no
performance penalty, they may have a big performance penalty (trapped
to M mode and emulated), or they may just not be supported at all.
AFAIK, atm Linux userspace has no way of finding this out (patchset
for this is still pending I think?), and things like the existence of
cheap unaligned accesses are a make-or-break for stringops as you get
to avoid *soooo* many branches.

In the RISCV case, you probably want to end up with at least 3 mem*
variants (no-unaligned, unaligned, vector).

Anyway, this was just a general brain-dump of my thoughts. The lack of
fast stringops is, AIUI, a problem in most architectures musl
supports.

> here is an example of a benchmark that takes input distribution
> into account from a workload:
> https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53

For some more data, here's the distribution on a workload (kernel
compiling I think?) on a FreeBSD kernel:
https://people.freebsd.org/~mjg/bufsizes.txt. The sizes will ofc vary
between software, but this is just a cute datapoint on kernel stuff.
Userspace probably has lots of bigger copies/memsets.

>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <time.h>
> >
> > #define BUFLEN 500000
> > #define LOOP_TIMES 500
> >
> > int cmp(const void *a, const void *b) {
> >     double x = *(double *)a;
> >     double y = *(double *)b;
> >     if (x < y) return -1;
> >     if (x > y) return 1;
> >     return 0;
> > }
> >
> > int main(){
> >         char *buf = malloc(BUFLEN);
> >       double *arr = malloc(sizeof(double) * LOOP_TIMES);
> >         size_t i,j,k;
> >         struct timespec tv0,tv;
> >       double times;
> >
> >         for(j=100; j<BUFLEN; j*=2){
> >           for(k=0; k<LOOP_TIMES; k++){
> >             for (i=0; i<100; i++)
> >                   memset(buf+i, i, j-i);
> >           }
> >         }
> >
> >         for(j=100; j<BUFLEN; j*=2){
> >           for(k=0; k<LOOP_TIMES; k++){
> >             clock_gettime(CLOCK_REALTIME, &tv0);
> >             for (i=0; i<100; i++)
> >                   memset(buf+i, i, j-i);
>
> alignment only matters up to 64 byte alignment and usually inputs
> are at least 8byte aligned.
>
> value is almost always 0. (we probably don't even need to test
> non-0 case: a 0 check is correctly predicted in practice.)
>
> i think length should have a small variation, just enough to add
> penalty to small size checks where implementations may use many
> branches.
>
> so something like this may be better (madeup off,al numbers):
>
>         buf = malloc((1<<16)+32);
>         size_t sz[] = {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<15, 1<<16, 0};
>         size_t off[16] = {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12};
>         size_t al[16] = {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1};
>         for (j=0; sz[j]; j++)
>                 for (k=0; k<20; k++) {
>                         t0 = tic();
>                         // large loop count is important for small sizes
>                         for (i=0; i<256; i++)
>                                 memset(buf + al[i%16], 0, sz[j] + off[i%16]);
>                         t1 = tic();
>                         tmin = min(tmin,t1-t0);
>                 }
>
> large memset (>=1k) can be tested separately (no.need to add off,al
> variaion then, less inner loop is enough, but it should not hurt to
> include them here).
>
> >             clock_gettime(CLOCK_REALTIME, &tv);
> >             tv.tv_sec -= tv0.tv_sec;
> >             if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
> >                 tv.tv_nsec += 1000000000;
> >                 tv.tv_sec--;
> >             }
> >           arr[k] = tv.tv_sec + (double)tv.tv_nsec/1000000000;
> >           }
> >           qsort(arr, 500, sizeof(double), cmp);
>
> just take the minimum. we want to know the fastest execution.
>
> >
> >         for (int m = 100; m < LOOP_TIMES - 100; m++) {
> >               times += arr[m];
> >           }
> >         printf("len: %ld  time: %.9lf\n",j, times);
>
> you can also print GB/s which is 256*sz[j]/tmin in my example.
>
> >       }
> >         free(buf);
> >         return 0;
> > }
>

I know folks are somewhat allergic to C++ here, but I wholeheartedly
recommend Google Benchmark for microbenchmarking needs.
See https://gist.github.com/heatd/165b70a4e0b75e815b82d723c01637dc for
something I used to benchmark my memcpy (partially taken from
AOSP/bionic).

-- 
Pedro

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Re: [musl] memset_riscv64
  2023-04-21 14:50           ` Pedro Falcato
@ 2023-04-21 16:54             ` Rich Felker
  2023-04-21 17:01               ` enh
  0 siblings, 1 reply; 10+ messages in thread
From: Rich Felker @ 2023-04-21 16:54 UTC (permalink / raw)
  To: Pedro Falcato; +Cc: musl, 张飞, nsz

On Fri, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote:
> On Fri, Apr 21, 2023 at 2:37 PM Szabolcs Nagy <nsz@port70.net> wrote:
> >
> > * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
> > > Hi!
> > > I listened to your suggestions and referred to string.c in Musl's test set(libc-bench),
> > > and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified
> > > it to a variable as a parameter in my own test case and passed it to the memset function.
> > > I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been
> > > sorted, only recording the running time of the middle 300 times.
> > >
> > > I took turns executing two programs on the SiFive chip three times each, and the results
> > > are shown below.
> > >                              First run result
> > > --------------------------------------------------------------------------------
> > > length(byte)  C language implementation(s)   Basic instruction implementation(s)
> > > --------------------------------------------------------------------------------
> > > 100                 0.002208102                     0.002304056
> > > 200                 0.005053208                     0.004629598
> > > 400                 0.008666684                     0.007739176
> > > 800                 0.014065196                     0.012372702
> > > 1600                0.023377685                     0.020090966
> > > 3200                0.040221849                     0.034059631
> > > 6400                0.072095377                     0.060028906
> > > 12800               0.134040475                     0.110039387
> > > 25600               0.257426806                     0.210710952
> > > 51200               1.173755160                     1.121833227
> > > 102400              3.693170402                     3.637194098
> > > 204800              8.919975455                     8.865504460
> > > 409600             19.410922418                    19.360956493
> > > --------------------------------------------------------------------------------
> > >
> > >                              Second run result
> > > --------------------------------------------------------------------------------
> > > length(byte)  C language implementation(s)   Basic instruction implementation(s)
> > > --------------------------------------------------------------------------------
> > > 100                 0.002208109                     0.002293857
> > > 200                 0.005057374                     0.004640669
> > > 400                 0.008674218                     0.007760795
> > > 800                 0.014068582                     0.012417084
> > > 1600                0.023381095                     0.020124496
> > > 3200                0.040225138                     0.034093181
> > > 6400                0.072098744                     0.060069574
> > > 12800               0.134043954                     0.110088141
> > > 25600               0.256453187                     0.208578633
> > > 51200               1.166602505                     1.118972796
> > > 102400              3.684957231                     3.635116808
> > > 204800              8.916302592                     8.861590734
> > > 409600             19.411057216                    19.358777670
> > > --------------------------------------------------------------------------------
> > >
> > >                              Third run result
> > > --------------------------------------------------------------------------------
> > > length(byte)  C language implementation(s)   Basic instruction implementation(s)
> > > --------------------------------------------------------------------------------
> > > 100                 0.002208111                     0.002293227
> > > 200                 0.005056101                     0.004628539
> > > 400                 0.008677756                     0.007748687
> > > 800                 0.014085242                     0.012404443
> > > 1600                0.023397782                     0.020115710
> > > 3200                0.040242985                     0.034084435
> > > 6400                0.072116665                     0.060063767
> > > 12800               0.134060262                     0.110082427
> > > 25600               0.257865186                     0.209101754
> > > 51200               1.174257177                     1.117753408
> > > 102400              3.696518162                     3.635417503
> > > 204800              8.929357747                     8.858765915
> > > 409600             19.426520562                     19.356515671
> > > --------------------------------------------------------------------------------
> > >
> > > From the test results, it can be seen that the runtime of memset implemented using the basic
> > > instruction set assembly is basically shorter than that implemented using the C language.
> > > May I ask if the test results are convincing?
> >
> > small sizes are much more common than large sizes, memsets can be
> > distributed such that sizes [0,100), [100,1000), [1000,inf) are
> > used for 1/3 of all memsets each (not the call count, but the
> > amount of bytes memset using such sizes), i.e. if you speed up
> > the size = [100,1000) and [1000,inf) cases by 10% but regress the
> > [0,100) case by 20% then the overall performance roughly stays
> > the same. (of course this is very workload dependent, but across
> > a system this is what i'd expect, probably even more skewed to
> > smaller sizes).
> >
> > so we need to know what happens in the [0,100) range. what i see
> > is a ~4% regression there while there is a ~10% improvement in
> > the [100,1000) case and ~15% improvement in the [1000,inf) case
> > (it would be nice to know why the 25k case is so much faster and
> > why that speed up only applies to that size, we don't want to
> > optimize for some obscure cpu bug that will go away next year)
> >
> > on practical workloads i would expect < 10% speedup overall from
> > the asm code (but we need more data in the [0,100) range to tell).
> > this may not be enough to justify the asm code.
> >
> > rich already said he prefers a different style of implementation
> > (where the body of the function is in c but the inner loop is in
> > asm if that helps e.g. via simd).
> 
> I don't think writing it all in C is viable, at least if you want to
> squeeze every last bit of performance out of it (while avoiding
> idiotic codegen that sometimes pops up).
> Even with inline asm, I severely question its effectiveness. As I see

I don't see any good reason for this doubt. If you claim it's not
viable, you should show cases where you really can't get the compiler
to do something reasonable with this type of code.

If the loop body were tiny and the loop control were a significant
portion of the loop execution overhead, then I could see this
potentially being a problem. But the main/only interesting case for
asm is where you're operating on largeish blocks.

> it, we have two major roadblocks for fast stringops support (and one
> more for riscv):
> 
> 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
> dispatch stringops functions to the best implementation according to
> the CPU feature set. I have no good solution for static linking folks.

Of course this is not an option, but it's also not needed. There is
only relevant dispatch cost when size is small, but you don't want or
need to dispatch to asm variants when size is small, so the dispatch
goes in the branch for large sizes, and the cost is effectively zero.

> 2) (Optional) Play around with C codegen that could add SIMD, inline
> asm to try to make it fast-ish. LLVM folks have played around with
> string ops written entirely in C++ through
> __builtin_memcpy_inline (which does smart choices wrt overlapping
> loads/stores, SIMD, etc depending on the size). Sadly,
> __builtin_memcpy_inline is/was not available in GCC.

The basic strategy here is to do head/tail of the operation with plain
portable C, in a minimal-length, minimal-branch fast path. Then, if
any middle that wasn't covered by the head/tail remains, either use an
arch-provided block operation primitive that's (for example; subject
to tuning) allowed to assume alignment of size and either src or dest,
or dispatch to a hwcap-specific bulk operation in asm that can make
similar assumptions.

> Testing the performance of C+inline asm vs pure asm would be interesting.

Yes but I don't think we'll find anything unexpected. In theory you
can probaby shave a couple cycles writing the asm by hand, but that
has a lot of costs that aren't sustainable, and pessimizes things like
LTO (for example, in LTO, the short-size fast paths may be able to be
inlined when the exact size isn't known but value range analysis
determines it's always in the small range).

> 3) Write riscv stringops code in assembly once CPUs get more advanced
> and we finally get a good idea on how the things perform. I still
> think it's too new to optimize specifically for.
> Extensions are popping up left and right, vector extensions aren't yet
> properly supported in the kernel, and (most importantly) we don't have
> a proper way to detect riscv features just yet.
> For instance, doing unaligned accesses may either have little to no
> performance penalty, they may have a big performance penalty (trapped
> to M mode and emulated), or they may just not be supported at all.
> AFAIK, atm Linux userspace has no way of finding this out (patchset
> for this is still pending I think?), and things like the existence of
> cheap unaligned accesses are a make-or-break for stringops as you get
> to avoid *soooo* many branches.

Yes, for RISC-V there is no way forward on vector or other ISA
extensions until the specs are firmed up and the framework for
detecting their presence is in place.

> In the RISCV case, you probably want to end up with at least 3 mem*
> variants (no-unaligned, unaligned, vector).

For memset, there's hardly a reason to waste effort on unaligned
versions. The middle is always aligned. For memcpy/memmove, where you
have src and dest misaligned modulo each other, the ability to do
unaligned loads or stores is valuable, and something the general
framework (in C) should allow us to take advantage of. I hadn't really
considered the possibility that we might want to support
unaligned-access support that's only known at runtime, rather than
part of the ISA level you're building for, so perhaps this is
something we should consider.

Rich

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Re: [musl] memset_riscv64
  2023-04-21 16:54             ` Rich Felker
@ 2023-04-21 17:01               ` enh
  0 siblings, 0 replies; 10+ messages in thread
From: enh @ 2023-04-21 17:01 UTC (permalink / raw)
  To: musl; +Cc: Pedro Falcato, 张飞, nsz

[-- Attachment #1: Type: text/plain, Size: 11477 bytes --]

On Fri, Apr 21, 2023 at 9:54 AM Rich Felker <dalias@libc.org> wrote:

> On Fri, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote:
> > On Fri, Apr 21, 2023 at 2:37 PM Szabolcs Nagy <nsz@port70.net> wrote:
> > >
> > > * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
> > > > Hi!
> > > > I listened to your suggestions and referred to string.c in Musl's
> test set(libc-bench),
> > > > and then modified the test cases. Since BUFLEN is a fixed value in
> strlen.c, I modified
> > > > it to a variable as a parameter in my own test case and passed it to
> the memset function.
> > > > I adjusted the LOOP_TIMES has been counted up to 500 times and the
> running time has been
> > > > sorted, only recording the running time of the middle 300 times.
> > > >
> > > > I took turns executing two programs on the SiFive chip three times
> each, and the results
> > > > are shown below.
> > > >                              First run result
> > > >
> --------------------------------------------------------------------------------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> --------------------------------------------------------------------------------
> > > > 100                 0.002208102                     0.002304056
> > > > 200                 0.005053208                     0.004629598
> > > > 400                 0.008666684                     0.007739176
> > > > 800                 0.014065196                     0.012372702
> > > > 1600                0.023377685                     0.020090966
> > > > 3200                0.040221849                     0.034059631
> > > > 6400                0.072095377                     0.060028906
> > > > 12800               0.134040475                     0.110039387
> > > > 25600               0.257426806                     0.210710952
> > > > 51200               1.173755160                     1.121833227
> > > > 102400              3.693170402                     3.637194098
> > > > 204800              8.919975455                     8.865504460
> > > > 409600             19.410922418                    19.360956493
> > > >
> --------------------------------------------------------------------------------
> > > >
> > > >                              Second run result
> > > >
> --------------------------------------------------------------------------------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> --------------------------------------------------------------------------------
> > > > 100                 0.002208109                     0.002293857
> > > > 200                 0.005057374                     0.004640669
> > > > 400                 0.008674218                     0.007760795
> > > > 800                 0.014068582                     0.012417084
> > > > 1600                0.023381095                     0.020124496
> > > > 3200                0.040225138                     0.034093181
> > > > 6400                0.072098744                     0.060069574
> > > > 12800               0.134043954                     0.110088141
> > > > 25600               0.256453187                     0.208578633
> > > > 51200               1.166602505                     1.118972796
> > > > 102400              3.684957231                     3.635116808
> > > > 204800              8.916302592                     8.861590734
> > > > 409600             19.411057216                    19.358777670
> > > >
> --------------------------------------------------------------------------------
> > > >
> > > >                              Third run result
> > > >
> --------------------------------------------------------------------------------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> --------------------------------------------------------------------------------
> > > > 100                 0.002208111                     0.002293227
> > > > 200                 0.005056101                     0.004628539
> > > > 400                 0.008677756                     0.007748687
> > > > 800                 0.014085242                     0.012404443
> > > > 1600                0.023397782                     0.020115710
> > > > 3200                0.040242985                     0.034084435
> > > > 6400                0.072116665                     0.060063767
> > > > 12800               0.134060262                     0.110082427
> > > > 25600               0.257865186                     0.209101754
> > > > 51200               1.174257177                     1.117753408
> > > > 102400              3.696518162                     3.635417503
> > > > 204800              8.929357747                     8.858765915
> > > > 409600             19.426520562                     19.356515671
> > > >
> --------------------------------------------------------------------------------
> > > >
> > > > From the test results, it can be seen that the runtime of memset
> implemented using the basic
> > > > instruction set assembly is basically shorter than that implemented
> using the C language.
> > > > May I ask if the test results are convincing?
> > >
> > > small sizes are much more common than large sizes, memsets can be
> > > distributed such that sizes [0,100), [100,1000), [1000,inf) are
> > > used for 1/3 of all memsets each (not the call count, but the
> > > amount of bytes memset using such sizes), i.e. if you speed up
> > > the size = [100,1000) and [1000,inf) cases by 10% but regress the
> > > [0,100) case by 20% then the overall performance roughly stays
> > > the same. (of course this is very workload dependent, but across
> > > a system this is what i'd expect, probably even more skewed to
> > > smaller sizes).
> > >
> > > so we need to know what happens in the [0,100) range. what i see
> > > is a ~4% regression there while there is a ~10% improvement in
> > > the [100,1000) case and ~15% improvement in the [1000,inf) case
> > > (it would be nice to know why the 25k case is so much faster and
> > > why that speed up only applies to that size, we don't want to
> > > optimize for some obscure cpu bug that will go away next year)
> > >
> > > on practical workloads i would expect < 10% speedup overall from
> > > the asm code (but we need more data in the [0,100) range to tell).
> > > this may not be enough to justify the asm code.
> > >
> > > rich already said he prefers a different style of implementation
> > > (where the body of the function is in c but the inner loop is in
> > > asm if that helps e.g. via simd).
> >
> > I don't think writing it all in C is viable, at least if you want to
> > squeeze every last bit of performance out of it (while avoiding
> > idiotic codegen that sometimes pops up).
> > Even with inline asm, I severely question its effectiveness. As I see
>
> I don't see any good reason for this doubt. If you claim it's not
> viable, you should show cases where you really can't get the compiler
> to do something reasonable with this type of code.
>
> If the loop body were tiny and the loop control were a significant
> portion of the loop execution overhead, then I could see this
> potentially being a problem. But the main/only interesting case for
> asm is where you're operating on largeish blocks.
>
> > it, we have two major roadblocks for fast stringops support (and one
> > more for riscv):
> >
> > 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
> > dispatch stringops functions to the best implementation according to
> > the CPU feature set. I have no good solution for static linking folks.
>
> Of course this is not an option, but it's also not needed. There is
> only relevant dispatch cost when size is small, but you don't want or
> need to dispatch to asm variants when size is small, so the dispatch
> goes in the branch for large sizes, and the cost is effectively zero.
>
> > 2) (Optional) Play around with C codegen that could add SIMD, inline
> > asm to try to make it fast-ish. LLVM folks have played around with
> > string ops written entirely in C++ through
> > __builtin_memcpy_inline (which does smart choices wrt overlapping
> > loads/stores, SIMD, etc depending on the size). Sadly,
> > __builtin_memcpy_inline is/was not available in GCC.
>
> The basic strategy here is to do head/tail of the operation with plain
> portable C, in a minimal-length, minimal-branch fast path. Then, if
> any middle that wasn't covered by the head/tail remains, either use an
> arch-provided block operation primitive that's (for example; subject
> to tuning) allowed to assume alignment of size and either src or dest,
> or dispatch to a hwcap-specific bulk operation in asm that can make
> similar assumptions.
>
> > Testing the performance of C+inline asm vs pure asm would be interesting.
>
> Yes but I don't think we'll find anything unexpected. In theory you
> can probaby shave a couple cycles writing the asm by hand, but that
> has a lot of costs that aren't sustainable, and pessimizes things like
> LTO (for example, in LTO, the short-size fast paths may be able to be
> inlined when the exact size isn't known but value range analysis
> determines it's always in the small range).
>
> > 3) Write riscv stringops code in assembly once CPUs get more advanced
> > and we finally get a good idea on how the things perform. I still
> > think it's too new to optimize specifically for.
> > Extensions are popping up left and right, vector extensions aren't yet
> > properly supported in the kernel, and (most importantly) we don't have
> > a proper way to detect riscv features just yet.
> > For instance, doing unaligned accesses may either have little to no
> > performance penalty, they may have a big performance penalty (trapped
> > to M mode and emulated), or they may just not be supported at all.
> > AFAIK, atm Linux userspace has no way of finding this out (patchset
> > for this is still pending I think?), and things like the existence of
> > cheap unaligned accesses are a make-or-break for stringops as you get
> > to avoid *soooo* many branches.
>
> Yes, for RISC-V there is no way forward on vector or other ISA
> extensions until the specs are firmed up and the framework for
> detecting their presence is in place.
>

(the vector state save/restore stuff still isn't in yet, which is making me
worry it won't make linux 6.4, but fwiw the risc-v hwprobe patches went
into linux-next earlier this week, so there's some progress there at
least...)


> > In the RISCV case, you probably want to end up with at least 3 mem*
> > variants (no-unaligned, unaligned, vector).
>
> For memset, there's hardly a reason to waste effort on unaligned
> versions. The middle is always aligned. For memcpy/memmove, where you
> have src and dest misaligned modulo each other, the ability to do
> unaligned loads or stores is valuable, and something the general
> framework (in C) should allow us to take advantage of. I hadn't really
> considered the possibility that we might want to support
> unaligned-access support that's only known at runtime, rather than
> part of the ISA level you're building for, so perhaps this is
> something we should consider.
>
> Rich
>

[-- Attachment #2: Type: text/html, Size: 14167 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: Re: Re: [musl] memset_riscv64
  2023-04-21 13:30         ` Szabolcs Nagy
  2023-04-21 14:50           ` Pedro Falcato
@ 2023-04-26  7:25           ` 张飞
  1 sibling, 0 replies; 10+ messages in thread
From: 张飞 @ 2023-04-26  7:25 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 17550 bytes --]

Hi！
I used your test example and modified the code for the link you provided to enable it to
run on the riscv platform, just like test_memset1.c and test_memset2.c.

Here are the test results for test_memset1.c：
                                 First run result
------------------------------------------------------------------------------------
              |   C language implementation      | Basic instruction implementation
------------------------------------------------------------------------------------
size(bytes)   |  min_time(s)       speed(GB/s)   |  min_time(s)       speed(GB/s)
------------------------------------------------------------------------------------
16            |  0.000022152          0.18       |  0.000020818          0.20       
32            |  0.000026968          0.30       |  0.000023380          0.35       
48            |  0.000027450          0.45       |  0.000023860          0.52       
64            |  0.000029215          0.56       |  0.000024341          0.67       
96            |  0.000030178          0.81       |  0.000025302          0.97       
200           |  0.000033228          1.54       |  0.000027864          1.84       
300           |  0.000036279          2.12       |  0.000031227          2.46       
400           |  0.000039810          2.57       |  0.000033949          3.02       
600           |  0.000047836          3.21       |  0.000040515          3.79       
1024          |  0.000064531          4.06       |  0.000054127          4.84      
2048          |  0.000106910          4.90       |  0.000089998          5.83      
4096          |  0.000191828          5.47       |  0.000155656          6.74       
8192          |  0.000356046          5.89       |  0.000286811          7.31       
16384         |  0.000685124          6.12       |  0.000549441          7.63       
32768         |  0.001460304          5.74       |  0.001222189          6.86      
65536         |  0.012082280          1.39       |  0.012054872          1.39
------------------------------------------------------------------------------------

                                 Second run result 
------------------------------------------------------------------------------------
              |   C language implementation      | Basic instruction implementation
------------------------------------------------------------------------------------
size(bytes)   |  min_time(s)       speed(GB/s)   |  min_time(s)       speed(GB/s)
------------------------------------------------------------------------------------
16            |  0.000021755          0.19       |  0.000020750          0.20      
32            |  0.000026484          0.31       |  0.000022810          0.36      
48            |  0.000026957          0.46       |  0.000023601          0.52       
64            |  0.000028692          0.57       |  0.000023918          0.69       
96            |  0.000029638          0.83       |  0.000024868          0.99       
200           |  0.000032633          1.57       |  0.000027403          1.87       
300           |  0.000035628          2.16       |  0.000030887          2.49       
400           |  0.000038781          2.64       |  0.000033580          3.05        
600           |  0.000046979          3.27       |  0.000040233          3.82       
1024          |  0.000063532          4.13       |  0.000053538          4.90     
2048          |  0.000104993          4.99       |  0.000088861          5.90    
4096          |  0.000188389          5.57       |  0.000153804          6.82        
8192          |  0.000349664          6.00       |  0.000283691          7.39       
16384         |  0.000673000          6.23       |  0.000543464          7.72       
32768         |  0.001433181          5.85       |  0.001217448          6.89     
65536         |  0.011850111          1.42       |  0.011945281          1.40
------------------------------------------------------------------------------------

                                 Third run result 
------------------------------------------------------------------------------------
              |   C language implementation      | Basic instruction implementation
------------------------------------------------------------------------------------
size(bytes)   |  min_time(s)       speed(GB/s)   |  min_time(s)       speed(GB/s)
------------------------------------------------------------------------------------
16            |  0.000021885          0.19       |  0.000020816          0.20      
32            |  0.000026642          0.31       |  0.000023040          0.36      
48            |  0.000027118          0.45       |  0.000023676          0.52       
64            |  0.000028863          0.57       |  0.000024311          0.67      
96            |  0.000029814          0.82       |  0.000024947          0.99      
200           |  0.000034413          1.49       |  0.000027648          1.85       
300           |  0.000035841          2.14       |  0.000031144          2.47       
400           |  0.000039329          2.60       |  0.000034005          3.01        
600           |  0.000047259          3.25       |  0.000040360          3.81       
1024          |  0.000063752          4.11       |  0.000053867          4.87     
2048          |  0.000105620          4.96       |  0.000089302          5.87   
4096          |  0.000189513          5.53       |  0.000154610          6.78       
8192          |  0.000351749          5.96       |  0.000284591          7.37        
16384         |  0.000676855          6.20       |  0.000545187          7.69        
32768         |  0.001440141          5.82       |  0.001208756          6.94      
65536         |  0.011974218          1.40       |  0.011976172          1.40
------------------------------------------------------------------------------------

Here are the test results for test_memset2.c：

                        C language implementation
Random memset (bytes/ns):
           memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25

Medium memset (bytes/ns):
           memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.86
Large memset (bytes/ns):
           memset_call 1K: 4.82 2K: 5.40 4K: 5.83 8K: 6.09 16K: 6.22 32K: 6.14 64K: 1.39

                        Basic instruction implementation
Random memset (bytes/ns):
           memset_call 32K: 0.45 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.30

Medium memset (bytes/ns):
           memset_call 8B: 0.18 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67
Large memset (bytes/ns):
           memset_call 1K: 6.62 2K: 7.03 4K: 7.47 8K: 7.71 16K: 7.83 32K: 7.64 64K: 1.40

From the test results, it can be seen that the memset implemented in C language performs better 
at around 8 bytes, while in other cases, the assembly implementation will perform better.

Fei Zhang
&gt; -----原始邮件-----
&gt; 发件人: "Szabolcs Nagy" <nsz@port70.net>
&gt; 发送时间: 2023-04-21 21:30:34 (星期五)
&gt; 收件人: "张飞" <zhangfei@nj.iscas.ac.cn>
&gt; 抄送: musl@lists.openwall.com
&gt; 主题: Re: Re: Re: [musl] memset_riscv64
&gt; 
&gt; * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
&gt; &gt; Hi!
&gt; &gt; I listened to your suggestions and referred to string.c in Musl's test set(libc-bench), 
&gt; &gt; and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified 
&gt; &gt; it to a variable as a parameter in my own test case and passed it to the memset function. 
&gt; &gt; I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been 
&gt; &gt; sorted, only recording the running time of the middle 300 times.
&gt; &gt; 
&gt; &gt; I took turns executing two programs on the SiFive chip three times each, and the results 
&gt; &gt; are shown below.
&gt; &gt;                              First run result
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; length(byte)  C language implementation(s)   Basic instruction implementation(s)
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; 100                 0.002208102                     0.002304056
&gt; &gt; 200                 0.005053208                     0.004629598
&gt; &gt; 400                 0.008666684                     0.007739176
&gt; &gt; 800                 0.014065196                     0.012372702
&gt; &gt; 1600                0.023377685                     0.020090966
&gt; &gt; 3200                0.040221849                     0.034059631
&gt; &gt; 6400                0.072095377                     0.060028906
&gt; &gt; 12800               0.134040475                     0.110039387
&gt; &gt; 25600               0.257426806                     0.210710952
&gt; &gt; 51200               1.173755160                     1.121833227
&gt; &gt; 102400              3.693170402                     3.637194098
&gt; &gt; 204800              8.919975455                     8.865504460
&gt; &gt; 409600             19.410922418                    19.360956493
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; 
&gt; &gt;                              Second run result 
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; length(byte)  C language implementation(s)   Basic instruction implementation(s)
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; 100                 0.002208109                     0.002293857
&gt; &gt; 200                 0.005057374                     0.004640669
&gt; &gt; 400                 0.008674218                     0.007760795
&gt; &gt; 800                 0.014068582                     0.012417084
&gt; &gt; 1600                0.023381095                     0.020124496
&gt; &gt; 3200                0.040225138                     0.034093181
&gt; &gt; 6400                0.072098744                     0.060069574
&gt; &gt; 12800               0.134043954                     0.110088141
&gt; &gt; 25600               0.256453187                     0.208578633
&gt; &gt; 51200               1.166602505                     1.118972796
&gt; &gt; 102400              3.684957231                     3.635116808
&gt; &gt; 204800              8.916302592                     8.861590734
&gt; &gt; 409600             19.411057216                    19.358777670
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; 
&gt; &gt;                              Third run result 
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; length(byte)  C language implementation(s)   Basic instruction implementation(s)
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; 100                 0.002208111                     0.002293227
&gt; &gt; 200                 0.005056101                     0.004628539
&gt; &gt; 400                 0.008677756                     0.007748687
&gt; &gt; 800                 0.014085242                     0.012404443
&gt; &gt; 1600                0.023397782                     0.020115710
&gt; &gt; 3200                0.040242985                     0.034084435
&gt; &gt; 6400                0.072116665                     0.060063767
&gt; &gt; 12800               0.134060262                     0.110082427
&gt; &gt; 25600               0.257865186                     0.209101754
&gt; &gt; 51200               1.174257177                     1.117753408
&gt; &gt; 102400              3.696518162                     3.635417503
&gt; &gt; 204800              8.929357747                     8.858765915
&gt; &gt; 409600             19.426520562                     19.356515671
&gt; &gt; --------------------------------------------------------------------------------
&gt; &gt; 
&gt; &gt; From the test results, it can be seen that the runtime of memset implemented using the basic 
&gt; &gt; instruction set assembly is basically shorter than that implemented using the C language. 
&gt; &gt; May I ask if the test results are convincing?
&gt; 
&gt; small sizes are much more common than large sizes, memsets can be
&gt; distributed such that sizes [0,100), [100,1000), [1000,inf) are
&gt; used for 1/3 of all memsets each (not the call count, but the
&gt; amount of bytes memset using such sizes), i.e. if you speed up
&gt; the size = [100,1000) and [1000,inf) cases by 10% but regress the
&gt; [0,100) case by 20% then the overall performance roughly stays
&gt; the same. (of course this is very workload dependent, but across
&gt; a system this is what i'd expect, probably even more skewed to
&gt; smaller sizes).
&gt; 
&gt; so we need to know what happens in the [0,100) range. what i see
&gt; is a ~4% regression there while there is a ~10% improvement in
&gt; the [100,1000) case and ~15% improvement in the [1000,inf) case
&gt; (it would be nice to know why the 25k case is so much faster and
&gt; why that speed up only applies to that size, we don't want to
&gt; optimize for some obscure cpu bug that will go away next year)
&gt; 
&gt; on practical workloads i would expect &lt; 10% speedup overall from
&gt; the asm code (but we need more data in the [0,100) range to tell).
&gt; this may not be enough to justify the asm code.
&gt; 
&gt; rich already said he prefers a different style of implementation
&gt; (where the body of the function is in c but the inner loop is in
&gt; asm if that helps e.g. via simd).
&gt; 
&gt; here is an example of a benchmark that takes input distribution
&gt; into account from a workload:
&gt; https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
&gt; 
&gt; &gt; #include <stdio.h>
&gt; &gt; #include <stdlib.h>
&gt; &gt; #include <string.h>
&gt; &gt; #include <time.h>
&gt; &gt; 
&gt; &gt; #define BUFLEN 500000
&gt; &gt; #define LOOP_TIMES 500
&gt; &gt; 
&gt; &gt; int cmp(const void *a, const void *b) {
&gt; &gt;     double x = *(double *)a;
&gt; &gt;     double y = *(double *)b;
&gt; &gt;     if (x &lt; y) return -1;
&gt; &gt;     if (x &gt; y) return 1;
&gt; &gt;     return 0;
&gt; &gt; }
&gt; &gt; 
&gt; &gt; int main(){
&gt; &gt;         char *buf = malloc(BUFLEN);
&gt; &gt; 	double *arr = malloc(sizeof(double) * LOOP_TIMES);
&gt; &gt;         size_t i,j,k;
&gt; &gt;         struct timespec tv0,tv;
&gt; &gt; 	double times;
&gt; &gt; 
&gt; &gt;         for(j=100; j<buflen; j*="2){"> &gt;           for(k=0; k<loop_times; k++){=""> &gt;             for (i=0; i&lt;100; i++)
&gt; &gt;                   memset(buf+i, i, j-i);
&gt; &gt;           }
&gt; &gt;         }
&gt; &gt; 
&gt; &gt;         for(j=100; j<buflen; j*="2){"> &gt;           for(k=0; k<loop_times; k++){=""> &gt;             clock_gettime(CLOCK_REALTIME, &amp;tv0);
&gt; &gt;             for (i=0; i&lt;100; i++)
&gt; &gt;                   memset(buf+i, i, j-i);
&gt; 
&gt; alignment only matters up to 64 byte alignment and usually inputs
&gt; are at least 8byte aligned.
&gt; 
&gt; value is almost always 0. (we probably don't even need to test
&gt; non-0 case: a 0 check is correctly predicted in practice.)
&gt; 
&gt; i think length should have a small variation, just enough to add
&gt; penalty to small size checks where implementations may use many
&gt; branches.
&gt; 
&gt; so something like this may be better (madeup off,al numbers):
&gt; 
&gt; 	buf = malloc((1&lt;&lt;16)+32);
&gt; 	size_t sz[] = {16, 32, 48, 64, 96, 200, 300, 400, 600, 1&lt;&lt;10, 1&lt;&lt;11, 1&lt;&lt;12, 1&lt;&lt;13, 1&lt;&lt;14, 1&lt;&lt;15, 1&lt;&lt;16, 0};
&gt; 	size_t off[16] = {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12};
&gt; 	size_t al[16] = {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1};
&gt; 	for (j=0; sz[j]; j++)
&gt; 		for (k=0; k&lt;20; k++) {
&gt; 			t0 = tic();
&gt; 			// large loop count is important for small sizes
&gt; 			for (i=0; i&lt;256; i++)
&gt; 				memset(buf + al[i%16], 0, sz[j] + off[i%16]);
&gt; 			t1 = tic();
&gt; 			tmin = min(tmin,t1-t0);
&gt; 		}
&gt; 
&gt; large memset (&gt;=1k) can be tested separately (no.need to add off,al
&gt; variaion then, less inner loop is enough, but it should not hurt to
&gt; include them here).
&gt; 
&gt; &gt;             clock_gettime(CLOCK_REALTIME, &amp;tv);
&gt; &gt;             tv.tv_sec -= tv0.tv_sec;
&gt; &gt;             if ((tv.tv_nsec -= tv0.tv_nsec) &lt; 0) {
&gt; &gt;                 tv.tv_nsec += 1000000000;
&gt; &gt;                 tv.tv_sec--;
&gt; &gt;             }
&gt; &gt; 	    arr[k] = tv.tv_sec + (double)tv.tv_nsec/1000000000;
&gt; &gt;           }
&gt; &gt;           qsort(arr, 500, sizeof(double), cmp); 
&gt; 
&gt; just take the minimum. we want to know the fastest execution.
&gt; 
&gt; &gt;           
&gt; &gt; 	  for (int m = 100; m &lt; LOOP_TIMES - 100; m++) {
&gt; &gt;               times += arr[m];
&gt; &gt;           }
&gt; &gt; 	  printf("len: %ld  time: %.9lf\n",j, times);
&gt; 
&gt; you can also print GB/s which is 256*sz[j]/tmin in my example.
&gt; 
&gt; &gt; 	}
&gt; &gt;         free(buf);
&gt; &gt;         return 0;
&gt; &gt; }
</loop_times;></buflen;></loop_times;></buflen;></time.h></string.h></stdlib.h></stdio.h></zhangfei@nj.iscas.ac.cn></zhangfei@nj.iscas.ac.cn></nsz@port70.net>

[-- Attachment #2: test_memset1.c --]
[-- Type: text/plain, Size: 1395 bytes --]

#include <float.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
int main(){
	int j,k,i;
	double times;
	struct timespec tv0,tv;
	int sz[] = {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<15, 1<<16, 0};
	int off[16] = {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12};
	int al[16] = {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1};

	char *buf = malloc((1<<16)+32);
	if (buf == NULL) {
           printf("malloc failed\n");
           exit(1);
        }
	// warm up
        for (j=0; sz[j]; j++)
	    for (k=0; k<20; k++)
	        for (i=0; i<256; i++)
		    memset(buf + al[i%16], 0, sz[j] + off[i%16]);

	printf("%-15s %-20s %-20s\n", "size(bytes)", "min_time(s)", "speed(GB/s)");
	for (j=0; sz[j]; j++) {
	        double min_time = DBL_MAX;
		for (k=0; k<20; k++) {
			clock_gettime(CLOCK_REALTIME, &tv0);
			// large loop count is important for small sizes
			for (i=0; i<256; i++)
				memset(buf + al[i%16], 0, sz[j] + off[i%16]);
			clock_gettime(CLOCK_REALTIME, &tv);
			tv.tv_sec -= tv0.tv_sec;
			if ((tv.tv_nsec -= tv0.tv_nsec) < 0) {
				tv.tv_nsec += 1000000000;
				tv.tv_sec--;
			}
			times = tv.tv_sec + (double)tv.tv_nsec/1e9;
			min_time = min_time < times ? min_time : times;
		}
                printf("%-15d %-20.9lf %-20.2lf\n",sz[j], min_time, 256*sz[j] / (min_time * 1e9));
        }
	return 0;
}

[-- Attachment #3: test_memset2.c --]
[-- Type: text/plain, Size: 7461 bytes --]

/*
 * memset benchmark.
 *
 * Copyright (c) 2021, Arm Limited.
 * SPDX-License-Identifier: MIT OR Apache-2.0 WITH LLVM-exception
 */

#define _GNU_SOURCE
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>
//#include "stringlib.h"
#include "benchlib.h"

#define ITERS  5000
#define ITERS2 20000000
#define ITERS3 1000000
#define NUM_TESTS 16384
#define MIN_SIZE 32768
#define MAX_SIZE (1024 * 1024)

static uint8_t a[MAX_SIZE + 4096] __attribute__((__aligned__(64)));

/*
#define F(x) {#x, x},

static const struct fun
{
  const char *name;
  void *(*fun)(void *, int, size_t);
} funtab[] =
{
#if __aarch64__
  F(__memset_aarch64)
#elif __arm__
  F(__memset_arm)
#endif
  F(memset)
#undef F
  {0, 0}
};
*/
typedef struct { uint32_t offset : 20, len : 12; } memset_test_t;
static memset_test_t test_arr[NUM_TESTS];

typedef struct { uint16_t size; uint16_t freq; } freq_data_t;
typedef struct { uint8_t align; uint16_t freq; } align_data_t;

#define SIZE_NUM 65536
#define SIZE_MASK (SIZE_NUM-1)
static uint8_t len_arr[SIZE_NUM];

/* Frequency data for memset sizes up to 4096 based on SPEC2017.  */
static freq_data_t memset_len_freq[] =
{
{40,28817}, {32,15336}, { 16,3823}, {296,3545}, { 24,3454}, {  8,1412},
{292,1202}, { 48, 927}, { 12, 613}, { 11, 539}, {284, 493}, {108, 414},
{ 88, 380}, { 20, 295}, {312, 271}, { 72, 233}, {  2, 200}, {  4, 192},
{ 15, 180}, { 14, 174}, { 13, 160}, { 56, 151}, { 36, 144}, { 64, 140},
{4095,133}, { 10, 130}, {  9, 124}, {  3, 124}, { 28, 120}, {  0, 118},
{288, 110}, {1152, 96}, {104,  90}, {  1,  86}, {832,  76}, {248,  74},
{1024, 69}, {120,  64}, {512,  63}, {384,  60}, {  6,  59}, { 80,  54},
{ 17,  50}, {  7,  49}, {520,  47}, {2048, 39}, {256,  37}, {864,  33},
{1440, 28}, { 22,  27}, {2056, 24}, {260,  23}, { 68,  23}, {  5,  22},
{ 18,  21}, {200,  18}, {2120, 18}, { 60,  17}, { 52,  16}, {336,  15},
{ 44,  13}, {192,  13}, {160,  12}, {2064, 12}, {128,  12}, { 76,  11},
{164,  11}, {152,  10}, {136,   9}, {488,   7}, { 96,   6}, {560,   6},
{1016,  6}, {112,   5}, {232,   5}, {168,   5}, {952,   5}, {184,   5},
{144,   4}, {252,   4}, { 84,   3}, {960,   3}, {3808,  3}, {244,   3},
{280,   3}, {224,   3}, {156,   3}, {1088,  3}, {440,   3}, {216,   2},
{304,   2}, { 23,   2}, { 25,   2}, { 26,   2}, {264,   2}, {328,   2},
{1096,  2}, {240,   2}, {1104,  2}, {704,   2}, {1664,  2}, {360,   2},
{808,   1}, {544,   1}, {236,   1}, {720,   1}, {368,   1}, {424,   1},
{640,   1}, {1112,  1}, {552,   1}, {272,   1}, {776,   1}, {376,   1},
{ 92,   1}, {536,   1}, {824,   1}, {496,   1}, {760,   1}, {792,   1},
{504,   1}, {344,   1}, {1816,  1}, {880,   1}, {176,   1}, {320,   1},
{352,   1}, {2008,  1}, {208,   1}, {408,   1}, {228,   1}, {2072,  1},
{568,   1}, {220,   1}, {616,   1}, {600,   1}, {392,   1}, {696,   1},
{2144,  1}, {1280,  1}, {2136,  1}, {632,   1}, {584,   1}, {456,   1},
{472,   1}, {3440,  1}, {2088,  1}, {680,   1}, {2928,  1}, {212,   1},
{648,   1}, {1752,  1}, {664,   1}, {3512,  1}, {1032,  1}, {528,   1},
{4072,  1}, {204,   1}, {2880,  1}, {3392,  1}, {712,   1}, { 59,   1},
{736,   1}, {592,   1}, {2520,  1}, {744,   1}, {196,   1}, {172,   1},
{728,   1}, {2040,  1}, {1192,  1}, {3600,  1}, {0, 0}
};

#define ALIGN_NUM 1024
#define ALIGN_MASK (ALIGN_NUM-1)
static uint8_t align_arr[ALIGN_NUM];

/* Alignment data for memset based on SPEC2017.  */
static align_data_t memset_align_freq[] =
{
 {16, 338}, {8, 307}, {32, 148}, {64, 131}, {4, 72}, {1, 23}, {2, 5}, {0, 0}
};

static void
init_memset_distribution (void)
{
  int i, j, freq, size, n;

  for (n = i = 0; (freq = memset_len_freq[i].freq) != 0; i++)
    for (j = 0, size = memset_len_freq[i].size; j < freq; j++)
      len_arr[n++] = size;
  assert (n == SIZE_NUM);

  for (n = i = 0; (freq = memset_align_freq[i].freq) != 0; i++)
    for (j = 0, size = memset_align_freq[i].align; j < freq; j++)
      align_arr[n++] = size - 1;
  assert (n == ALIGN_NUM);
}

static size_t
init_memset (size_t max_size)
{
  size_t total = 0;
  /* Create a random set of memsets with the given size and alignment
     distributions.  */
  for (int i = 0; i < NUM_TESTS; i++)
    {
      test_arr[i].offset = (rand32 (0) & (max_size - 1));
      test_arr[i].offset &= ~align_arr[rand32 (0) & ALIGN_MASK];
      test_arr[i].len = len_arr[rand32 (0) & SIZE_MASK];
      total += test_arr[i].len;
    }

  return total;
}


int main (void)
{
  init_memset_distribution ();

  memset (a, 1, sizeof (a));

  printf("Random memset (bytes/ns):\n");
  /*
  for (int f = 0; funtab[f].name != 0; f++)
    {
      size_t total_size = 0;
      uint64_t tsum = 0;
      printf ("%22s ", funtab[f].name);
      rand32 (0x12345678);

      for (int size = MIN_SIZE; size <= MAX_SIZE; size *= 2)
	{
	  size_t memset_size = init_memset (size) * ITERS;

	  for (int c = 0; c < NUM_TESTS; c++)
	    funtab[f].fun (a + test_arr[c].offset, 0, test_arr[c].len);

	  uint64_t t = clock_get_ns ();
	  for (int i = 0; i < ITERS; i++)
	    for (int c = 0; c < NUM_TESTS; c++)
	      funtab[f].fun (a + test_arr[c].offset, 0, test_arr[c].len);
	  t = clock_get_ns () - t;
	  total_size += memset_size;
	  tsum += t;
	  printf ("%dK: %.2f ", size / 1024, (double)memset_size / t);
	}
      printf( "avg %.2f\n", (double)total_size / tsum);
    }
    */
  size_t total_size = 0;
  uint64_t tsum = 0;
  printf ("%22s ", "memset_call");
  rand32 (0x12345678);

  for (int size = MIN_SIZE; size <= MAX_SIZE; size *= 2)
    {
      size_t memset_size = init_memset (size) * ITERS;

      for (int c = 0; c < NUM_TESTS; c++)
	memset (a + test_arr[c].offset, 0, test_arr[c].len);

      uint64_t t = clock_get_ns ();
      for (int i = 0; i < ITERS; i++)
	for (int c = 0; c < NUM_TESTS; c++)
	  memset (a + test_arr[c].offset, 0, test_arr[c].len);
      t = clock_get_ns () - t;
      total_size += memset_size;
      tsum += t;
      printf ("%dK: %.2f ", size / 1024, (double)memset_size / t);
    }
  printf( "avg %.2f\n", (double)total_size / tsum);


  printf ("\nMedium memset (bytes/ns):\n");
  /*
  for (int f = 0; funtab[f].name != 0; f++)
    {
      printf ("%22s ", funtab[f].name);

      for (int size = 8; size <= 512; size *= 2)
	{
	  uint64_t t = clock_get_ns ();
	  for (int i = 0; i < ITERS2; i++)
	    funtab[f].fun (a, 0, size);
	  t = clock_get_ns () - t;
	  printf ("%dB: %.2f ", size, (double)size * ITERS2 / t);
	}
      printf ("\n");
    }
    */
  printf ("%22s ", "memset_call");
  for (int size = 8; size <= 512; size *= 2)
    {
      uint64_t t = clock_get_ns ();
      for (int i = 0; i < ITERS2; i++)
	memset (a, 0, size);
      t = clock_get_ns () - t;
      printf ("%dB: %.2f ", size, (double)size * ITERS2 / t);
    }


  printf ("\nLarge memset (bytes/ns):\n");
  /*
  for (int f = 0; funtab[f].name != 0; f++)
    {
      printf ("%22s ", funtab[f].name);

      for (int size = 1024; size <= 65536; size *= 2)
	{
	  uint64_t t = clock_get_ns ();
	  for (int i = 0; i < ITERS3; i++)
	    funtab[f].fun (a, 0, size);
	  t = clock_get_ns () - t;
	  printf ("%dK: %.2f ", size / 1024, (double)size * ITERS3 / t);
	}
      printf ("\n");
    }
    */
  printf ("%22s ", "memset_call");
  for (int size = 1024; size <= 65536; size *= 2)
    {
      uint64_t t = clock_get_ns ();
      for (int i = 0; i < ITERS3; i++)
	memset (a, 0, size);
      t = clock_get_ns () - t;
      printf ("%dK: %.2f ", size / 1024, (double)size * ITERS3 / t);
    }
  printf ("\n\n");

  return 0;
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-04-26  7:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-11  2:17 [musl] memset_riscv64 张飞
2023-04-11  9:48 ` Pedro Falcato
2023-04-19  5:33   ` 张飞
2023-04-19  9:02     ` Szabolcs Nagy
2023-04-20  8:17       ` 张飞
2023-04-21 13:30         ` Szabolcs Nagy
2023-04-21 14:50           ` Pedro Falcato
2023-04-21 16:54             ` Rich Felker
2023-04-21 17:01               ` enh
2023-04-26  7:25           ` 张飞

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).