From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 22505 invoked from network); 21 Apr 2023 13:37:36 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 21 Apr 2023 13:37:36 -0000
Received: (qmail 24181 invoked by uid 550); 21 Apr 2023 13:37:32 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 24146 invoked from network); 21 Apr 2023 13:37:32 -0000
Date: Fri, 21 Apr 2023 15:30:34 +0200
From: Szabolcs Nagy <nsz@port70.net>
To: =?utf-8?B?5byg6aOe?= <zhangfei@nj.iscas.ac.cn>
Cc: musl@lists.openwall.com
Message-ID: <20230421133034.GS3630668@port70.net>
Mail-Followup-To: =?utf-8?B?5byg6aOe?= <zhangfei@nj.iscas.ac.cn>,
	musl@lists.openwall.com
References: <7ab4e713.9fae.1876e1ac122.Coremail.zhangfei@nj.iscas.ac.cn>
 <CAKbZUD2Rfd9Lg37GY+N_bMeJOQJ=84yZ=SW9+vHMRdByU0CZ+A@mail.gmail.com>
 <658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn>
 <20230419090210.GR3630668@port70.net>
 <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn>
Subject: Re: Re: Re: [musl] memset_riscv64

* =E5=BC=A0=E9=A3=9E <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
> Hi!
> I listened to your suggestions and referred to string.c in Musl's test se=
t(libc-bench),=20
> and then modified the test cases. Since BUFLEN is a fixed value in strlen=
=2Ec, I modified=20
> it to a variable as a parameter in my own test case and passed it to the =
memset function.=20
> I adjusted the LOOP_TIMES has been counted up to 500 times and the runnin=
g time has been=20
> sorted, only recording the running time of the middle 300 times.
>=20
> I took turns executing two programs on the SiFive chip three times each, =
and the results=20
> are shown below.
>                              First run result
> -------------------------------------------------------------------------=
-------
> length(byte)  C language implementation(s)   Basic instruction implementa=
tion(s)
> -------------------------------------------------------------------------=
-------
> 100                 0.002208102                     0.002304056
> 200                 0.005053208                     0.004629598
> 400                 0.008666684                     0.007739176
> 800                 0.014065196                     0.012372702
> 1600                0.023377685                     0.020090966
> 3200                0.040221849                     0.034059631
> 6400                0.072095377                     0.060028906
> 12800               0.134040475                     0.110039387
> 25600               0.257426806                     0.210710952
> 51200               1.173755160                     1.121833227
> 102400              3.693170402                     3.637194098
> 204800              8.919975455                     8.865504460
> 409600             19.410922418                    19.360956493
> -------------------------------------------------------------------------=
-------
>=20
>                              Second run result=20
> -------------------------------------------------------------------------=
-------
> length(byte)  C language implementation(s)   Basic instruction implementa=
tion(s)
> -------------------------------------------------------------------------=
-------
> 100                 0.002208109                     0.002293857
> 200                 0.005057374                     0.004640669
> 400                 0.008674218                     0.007760795
> 800                 0.014068582                     0.012417084
> 1600                0.023381095                     0.020124496
> 3200                0.040225138                     0.034093181
> 6400                0.072098744                     0.060069574
> 12800               0.134043954                     0.110088141
> 25600               0.256453187                     0.208578633
> 51200               1.166602505                     1.118972796
> 102400              3.684957231                     3.635116808
> 204800              8.916302592                     8.861590734
> 409600             19.411057216                    19.358777670
> -------------------------------------------------------------------------=
-------
>=20
>                              Third run result=20
> -------------------------------------------------------------------------=
-------
> length(byte)  C language implementation(s)   Basic instruction implementa=
tion(s)
> -------------------------------------------------------------------------=
-------
> 100                 0.002208111                     0.002293227
> 200                 0.005056101                     0.004628539
> 400                 0.008677756                     0.007748687
> 800                 0.014085242                     0.012404443
> 1600                0.023397782                     0.020115710
> 3200                0.040242985                     0.034084435
> 6400                0.072116665                     0.060063767
> 12800               0.134060262                     0.110082427
> 25600               0.257865186                     0.209101754
> 51200               1.174257177                     1.117753408
> 102400              3.696518162                     3.635417503
> 204800              8.929357747                     8.858765915
> 409600             19.426520562                     19.356515671
> -------------------------------------------------------------------------=
-------
>=20
> From the test results, it can be seen that the runtime of memset implemen=
ted using the basic=20
> instruction set assembly is basically shorter than that implemented using=
 the C language.=20
> May I ask if the test results are convincing?

small sizes are much more common than large sizes, memsets can be
distributed such that sizes [0,100), [100,1000), [1000,inf) are
used for 1/3 of all memsets each (not the call count, but the
amount of bytes memset using such sizes), i.e. if you speed up
the size =3D [100,1000) and [1000,inf) cases by 10% but regress the
[0,100) case by 20% then the overall performance roughly stays
the same. (of course this is very workload dependent, but across
a system this is what i'd expect, probably even more skewed to
smaller sizes).

so we need to know what happens in the [0,100) range. what i see
is a ~4% regression there while there is a ~10% improvement in
the [100,1000) case and ~15% improvement in the [1000,inf) case
(it would be nice to know why the 25k case is so much faster and
why that speed up only applies to that size, we don't want to
optimize for some obscure cpu bug that will go away next year)

on practical workloads i would expect < 10% speedup overall from
the asm code (but we need more data in the [0,100) range to tell).
this may not be enough to justify the asm code.

rich already said he prefers a different style of implementation
(where the body of the function is in c but the inner loop is in
asm if that helps e.g. via simd).

here is an example of a benchmark that takes input distribution
into account from a workload:
https://github.com/ARM-software/optimized-routines/blob/master/string/bench=
/memset.c#L53

> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <time.h>
>=20
> #define BUFLEN 500000
> #define LOOP_TIMES 500
>=20
> int cmp(const void *a, const void *b) {
>     double x =3D *(double *)a;
>     double y =3D *(double *)b;
>     if (x < y) return -1;
>     if (x > y) return 1;
>     return 0;
> }
>=20
> int main(){
>         char *buf =3D malloc(BUFLEN);
> 	double *arr =3D malloc(sizeof(double) * LOOP_TIMES);
>         size_t i,j,k;
>         struct timespec tv0,tv;
> 	double times;
>=20
>         for(j=3D100; j<BUFLEN; j*=3D2){
>           for(k=3D0; k<LOOP_TIMES; k++){
>             for (i=3D0; i<100; i++)
>                   memset(buf+i, i, j-i);
>           }
>         }
>=20
>         for(j=3D100; j<BUFLEN; j*=3D2){
>           for(k=3D0; k<LOOP_TIMES; k++){
>             clock_gettime(CLOCK_REALTIME, &tv0);
>             for (i=3D0; i<100; i++)
>                   memset(buf+i, i, j-i);

alignment only matters up to 64 byte alignment and usually inputs
are at least 8byte aligned.

value is almost always 0. (we probably don't even need to test
non-0 case: a 0 check is correctly predicted in practice.)

i think length should have a small variation, just enough to add
penalty to small size checks where implementations may use many
branches.

so something like this may be better (madeup off,al numbers):

	buf =3D malloc((1<<16)+32);
	size_t sz[] =3D {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1<<11, 1<<=
12, 1<<13, 1<<14, 1<<15, 1<<16, 0};
	size_t off[16] =3D {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12=
};
	size_t al[16] =3D {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1};
	for (j=3D0; sz[j]; j++)
		for (k=3D0; k<20; k++) {
			t0 =3D tic();
			// large loop count is important for small sizes
			for (i=3D0; i<256; i++)
				memset(buf + al[i%16], 0, sz[j] + off[i%16]);
			t1 =3D tic();
			tmin =3D min(tmin,t1-t0);
		}

large memset (>=3D1k) can be tested separately (no.need to add off,al
variaion then, less inner loop is enough, but it should not hurt to
include them here).

>             clock_gettime(CLOCK_REALTIME, &tv);
>             tv.tv_sec -=3D tv0.tv_sec;
>             if ((tv.tv_nsec -=3D tv0.tv_nsec) < 0) {
>                 tv.tv_nsec +=3D 1000000000;
>                 tv.tv_sec--;
>             }
> 	    arr[k] =3D tv.tv_sec + (double)tv.tv_nsec/1000000000;
>           }
>           qsort(arr, 500, sizeof(double), cmp);=20

just take the minimum. we want to know the fastest execution.

>          =20
> 	  for (int m =3D 100; m < LOOP_TIMES - 100; m++) {
>               times +=3D arr[m];
>           }
> 	  printf("len: %ld  time: %.9lf\n",j, times);

you can also print GB/s which is 256*sz[j]/tmin in my example.

> 	}
>         free(buf);
>         return 0;
> }