From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 22505 invoked from network); 21 Apr 2023 13:37:36 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 21 Apr 2023 13:37:36 -0000 Received: (qmail 24181 invoked by uid 550); 21 Apr 2023 13:37:32 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 24146 invoked from network); 21 Apr 2023 13:37:32 -0000 Date: Fri, 21 Apr 2023 15:30:34 +0200 From: Szabolcs Nagy To: =?utf-8?B?5byg6aOe?= Cc: musl@lists.openwall.com Message-ID: <20230421133034.GS3630668@port70.net> Mail-Followup-To: =?utf-8?B?5byg6aOe?= , musl@lists.openwall.com References: <7ab4e713.9fae.1876e1ac122.Coremail.zhangfei@nj.iscas.ac.cn> <658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn> <20230419090210.GR3630668@port70.net> <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn> Subject: Re: Re: Re: [musl] memset_riscv64 * =E5=BC=A0=E9=A3=9E [2023-04-20 16:17:10 +0800]: > Hi! > I listened to your suggestions and referred to string.c in Musl's test se= t(libc-bench),=20 > and then modified the test cases. Since BUFLEN is a fixed value in strlen= =2Ec, I modified=20 > it to a variable as a parameter in my own test case and passed it to the = memset function.=20 > I adjusted the LOOP_TIMES has been counted up to 500 times and the runnin= g time has been=20 > sorted, only recording the running time of the middle 300 times. >=20 > I took turns executing two programs on the SiFive chip three times each, = and the results=20 > are shown below. > First run result > -------------------------------------------------------------------------= ------- > length(byte) C language implementation(s) Basic instruction implementa= tion(s) > -------------------------------------------------------------------------= ------- > 100 0.002208102 0.002304056 > 200 0.005053208 0.004629598 > 400 0.008666684 0.007739176 > 800 0.014065196 0.012372702 > 1600 0.023377685 0.020090966 > 3200 0.040221849 0.034059631 > 6400 0.072095377 0.060028906 > 12800 0.134040475 0.110039387 > 25600 0.257426806 0.210710952 > 51200 1.173755160 1.121833227 > 102400 3.693170402 3.637194098 > 204800 8.919975455 8.865504460 > 409600 19.410922418 19.360956493 > -------------------------------------------------------------------------= ------- >=20 > Second run result=20 > -------------------------------------------------------------------------= ------- > length(byte) C language implementation(s) Basic instruction implementa= tion(s) > -------------------------------------------------------------------------= ------- > 100 0.002208109 0.002293857 > 200 0.005057374 0.004640669 > 400 0.008674218 0.007760795 > 800 0.014068582 0.012417084 > 1600 0.023381095 0.020124496 > 3200 0.040225138 0.034093181 > 6400 0.072098744 0.060069574 > 12800 0.134043954 0.110088141 > 25600 0.256453187 0.208578633 > 51200 1.166602505 1.118972796 > 102400 3.684957231 3.635116808 > 204800 8.916302592 8.861590734 > 409600 19.411057216 19.358777670 > -------------------------------------------------------------------------= ------- >=20 > Third run result=20 > -------------------------------------------------------------------------= ------- > length(byte) C language implementation(s) Basic instruction implementa= tion(s) > -------------------------------------------------------------------------= ------- > 100 0.002208111 0.002293227 > 200 0.005056101 0.004628539 > 400 0.008677756 0.007748687 > 800 0.014085242 0.012404443 > 1600 0.023397782 0.020115710 > 3200 0.040242985 0.034084435 > 6400 0.072116665 0.060063767 > 12800 0.134060262 0.110082427 > 25600 0.257865186 0.209101754 > 51200 1.174257177 1.117753408 > 102400 3.696518162 3.635417503 > 204800 8.929357747 8.858765915 > 409600 19.426520562 19.356515671 > -------------------------------------------------------------------------= ------- >=20 > From the test results, it can be seen that the runtime of memset implemen= ted using the basic=20 > instruction set assembly is basically shorter than that implemented using= the C language.=20 > May I ask if the test results are convincing? small sizes are much more common than large sizes, memsets can be distributed such that sizes [0,100), [100,1000), [1000,inf) are used for 1/3 of all memsets each (not the call count, but the amount of bytes memset using such sizes), i.e. if you speed up the size =3D [100,1000) and [1000,inf) cases by 10% but regress the [0,100) case by 20% then the overall performance roughly stays the same. (of course this is very workload dependent, but across a system this is what i'd expect, probably even more skewed to smaller sizes). so we need to know what happens in the [0,100) range. what i see is a ~4% regression there while there is a ~10% improvement in the [100,1000) case and ~15% improvement in the [1000,inf) case (it would be nice to know why the 25k case is so much faster and why that speed up only applies to that size, we don't want to optimize for some obscure cpu bug that will go away next year) on practical workloads i would expect < 10% speedup overall from the asm code (but we need more data in the [0,100) range to tell). this may not be enough to justify the asm code. rich already said he prefers a different style of implementation (where the body of the function is in c but the inner loop is in asm if that helps e.g. via simd). here is an example of a benchmark that takes input distribution into account from a workload: https://github.com/ARM-software/optimized-routines/blob/master/string/bench= /memset.c#L53 > #include > #include > #include > #include >=20 > #define BUFLEN 500000 > #define LOOP_TIMES 500 >=20 > int cmp(const void *a, const void *b) { > double x =3D *(double *)a; > double y =3D *(double *)b; > if (x < y) return -1; > if (x > y) return 1; > return 0; > } >=20 > int main(){ > char *buf =3D malloc(BUFLEN); > double *arr =3D malloc(sizeof(double) * LOOP_TIMES); > size_t i,j,k; > struct timespec tv0,tv; > double times; >=20 > for(j=3D100; j for(k=3D0; k for (i=3D0; i<100; i++) > memset(buf+i, i, j-i); > } > } >=20 > for(j=3D100; j for(k=3D0; k clock_gettime(CLOCK_REALTIME, &tv0); > for (i=3D0; i<100; i++) > memset(buf+i, i, j-i); alignment only matters up to 64 byte alignment and usually inputs are at least 8byte aligned. value is almost always 0. (we probably don't even need to test non-0 case: a 0 check is correctly predicted in practice.) i think length should have a small variation, just enough to add penalty to small size checks where implementations may use many branches. so something like this may be better (madeup off,al numbers): buf =3D malloc((1<<16)+32); size_t sz[] =3D {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1<<11, 1<<= 12, 1<<13, 1<<14, 1<<15, 1<<16, 0}; size_t off[16] =3D {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12= }; size_t al[16] =3D {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1}; for (j=3D0; sz[j]; j++) for (k=3D0; k<20; k++) { t0 =3D tic(); // large loop count is important for small sizes for (i=3D0; i<256; i++) memset(buf + al[i%16], 0, sz[j] + off[i%16]); t1 =3D tic(); tmin =3D min(tmin,t1-t0); } large memset (>=3D1k) can be tested separately (no.need to add off,al variaion then, less inner loop is enough, but it should not hurt to include them here). > clock_gettime(CLOCK_REALTIME, &tv); > tv.tv_sec -=3D tv0.tv_sec; > if ((tv.tv_nsec -=3D tv0.tv_nsec) < 0) { > tv.tv_nsec +=3D 1000000000; > tv.tv_sec--; > } > arr[k] =3D tv.tv_sec + (double)tv.tv_nsec/1000000000; > } > qsort(arr, 500, sizeof(double), cmp);=20 just take the minimum. we want to know the fastest execution. > =20 > for (int m =3D 100; m < LOOP_TIMES - 100; m++) { > times +=3D arr[m]; > } > printf("len: %ld time: %.9lf\n",j, times); you can also print GB/s which is 256*sz[j]/tmin in my example. > } > free(buf); > return 0; > }