Hi! I listened to your suggestions and referred to string.c in Musl's test set(libc-bench), and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified it to a variable as a parameter in my own test case and passed it to the memset function. I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been sorted, only recording the running time of the middle 300 times. I took turns executing two programs on the SiFive chip three times each, and the results are shown below. First run result -------------------------------------------------------------------------------- length(byte) C language implementation(s) Basic instruction implementation(s) -------------------------------------------------------------------------------- 100 0.002208102 0.002304056 200 0.005053208 0.004629598 400 0.008666684 0.007739176 800 0.014065196 0.012372702 1600 0.023377685 0.020090966 3200 0.040221849 0.034059631 6400 0.072095377 0.060028906 12800 0.134040475 0.110039387 25600 0.257426806 0.210710952 51200 1.173755160 1.121833227 102400 3.693170402 3.637194098 204800 8.919975455 8.865504460 409600 19.410922418 19.360956493 -------------------------------------------------------------------------------- Second run result -------------------------------------------------------------------------------- length(byte) C language implementation(s) Basic instruction implementation(s) -------------------------------------------------------------------------------- 100 0.002208109 0.002293857 200 0.005057374 0.004640669 400 0.008674218 0.007760795 800 0.014068582 0.012417084 1600 0.023381095 0.020124496 3200 0.040225138 0.034093181 6400 0.072098744 0.060069574 12800 0.134043954 0.110088141 25600 0.256453187 0.208578633 51200 1.166602505 1.118972796 102400 3.684957231 3.635116808 204800 8.916302592 8.861590734 409600 19.411057216 19.358777670 -------------------------------------------------------------------------------- Third run result -------------------------------------------------------------------------------- length(byte) C language implementation(s) Basic instruction implementation(s) -------------------------------------------------------------------------------- 100 0.002208111 0.002293227 200 0.005056101 0.004628539 400 0.008677756 0.007748687 800 0.014085242 0.012404443 1600 0.023397782 0.020115710 3200 0.040242985 0.034084435 6400 0.072116665 0.060063767 12800 0.134060262 0.110082427 25600 0.257865186 0.209101754 51200 1.174257177 1.117753408 102400 3.696518162 3.635417503 204800 8.929357747 8.858765915 409600 19.426520562 19.356515671 -------------------------------------------------------------------------------- From the test results, it can be seen that the runtime of memset implemented using the basic instruction set assembly is basically shorter than that implemented using the C language. May I ask if the test results are convincing? > -----原始邮件----- > 发件人: "Szabolcs Nagy" > 发送时间: 2023-04-19 17:02:10 (星期三) > 收件人: "张飞" > 抄送: musl@lists.openwall.com > 主题: Re: Re: [musl] memset_riscv64 > > * 张飞 [2023-04-19 13:33:08 +0800]: > > -------------------------------------------------------------------------------- > > length(byte) C language implementation(s) Basic instruction implementation(s) > > -------------------------------------------------------------------------------- > > 4 0.00000352 0.000004001 > > 8 0.000004001 0.000005441 > > 16 0.000006241 0.00000464 > > 32 0.00000752 0.00000448 > > 64 0.000008481 0.000005281 > > 128 0.000009281 0.000005921 > > 256 0.000011201 0.000007041 > > i don't think these numbers can be trusted. > > > #include > > #include > > #include > > #include > > #include > > > > #define DATA_SIZE 5*1024*1024 > > #define MAX_LEN 1*1024*1024 > > #define OFFSET 0 > > #define LOOP_TIMES 100 > > int main(){ > > char *str1,*src1; > > str1 = (char *)mmap(NULL, DATA_SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > > > > printf("function test start\n"); > > > > src1 = str1+OFFSET; > > struct timespec tv0,tv; > > for(int len=2; len<=MAX_LEN; len*=2){ > > clock_gettime(CLOCK_REALTIME, &tv0); > > for(int k=0; k > memset(src1, 'a', len); > > } > > clock_gettime(CLOCK_REALTIME, &tv); > > tv.tv_sec -= tv0.tv_sec; > > if ((tv.tv_nsec -= tv0.tv_nsec) < 0) { > > tv.tv_nsec += 1000000000; > > tv.tv_sec--; > > } > > printf("len: %d time: %ld.%.9ld\n",len, (long)tv.tv_sec, (long)tv.tv_nsec); > > > this repeatedly calls memset with exact same len, alignment and value. > so it favours branch heavy code since those are correctly predicted. > > but even if you care about a branch-predicted microbenchmark, you > made a single measurement per size so you cannot tell how much the > time varies, you should do several measurements and take the min > so noise from system effects and cpu internal state are reduced > (also that state needs to be warmed up). and likely the LOOP_TIMES > should be bigger too for small sizes for reliable timing. > > benchmarking string functions is tricky especially for a target arch > with many implementations. > > > } > > > > printf("function test end\n"); > > munmap(str1,DATA_SIZE); > > return 0; > > } > >