From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FROM,MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 25747 invoked from network); 21 Apr 2023 14:51:14 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 21 Apr 2023 14:51:14 -0000
Received: (qmail 32022 invoked by uid 550); 21 Apr 2023 14:51:10 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 31982 invoked from network); 21 Apr 2023 14:51:09 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1682088657; x=1684680657;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=pNyeYnpd6NrwADc+D/SlPS+hU4en61U5lVslGyB/7Ck=;
        b=J5bf6czrHDB4ft3EiruaLOmf6+i+e6Ccre1yhJfwdtXRTqKYOZpydaSgADWJkzw+rb
         zk6FtgMGsz4kq/06hu/Yrsv4xOaU+snK81w0t4qgJnJaZwwNTS2ZGB0iIoXI/2cwtrc9
         rU82v5bEw9butAphe0jNsGab4/w63fX3H/eZjQrPlzSiOUiF8/ei94CAqalz8m0p3UwT
         bB3sJPMFsqjeRNTFfMkrH2XpVxUXgJDsi48Ys899cowVRtUid+sRJs7uYsUjUPYU6eSp
         cXbsk1tiwQz4Izk3qC9MLm8sjzZWWL1XbsUSbqQfpcANnba1vcQDKnQHhks4M3tn6HwV
         0LlQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1682088657; x=1684680657;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=pNyeYnpd6NrwADc+D/SlPS+hU4en61U5lVslGyB/7Ck=;
        b=Yh1d3WleRjG9SOUwVAdXE9lhVC6t5CG6ygwcLkCYoY82c1Q8zVMazWF6gk4Nm5Gf1Z
         qfgJivdq10wLOoiWn9l7D1TugQ6UH6VHYI07OJSA9GoWwBaMi8k/ZdoQ0ANJNi69zz93
         FSYf3xQIQigsJpbEQyJTdeK1AHxJTA6NWWv3LHMcYof6agghsncZihHhy12AbxxiDgwc
         LiONuGcq4SwqNpkWUCV4mrTlqe/FcEsVQOAGiYroZ3VpP7oeHQ/XprCZkCI7lbZlTBVv
         j8Tp8MHYQPs4P8+BM0fHxj/bdj5wIIH4IYMdCoJ3Sb7mzxDG0Y0SKS7uSuEJcmmQe1CG
         r7TQ==
X-Gm-Message-State: AAQBX9eh9r+EpOB1FpugQHImmYUshwwdRazDsArazfQcmHjBD2nmZSxh
	BV13f/Z++X+q/n09m6/UyCrBihCHNSAjGdG7wuikbrbgcmrP8A==
X-Google-Smtp-Source: AKy350bndcWpQS+r/YXwR6VhtM5LEbUtMjFd53zjEC71yAPAvl63QGkjdy0CDbPLt0l9rPQ3Xncmma49+yPR4tx1EvY=
X-Received: by 2002:a17:90a:9285:b0:246:5787:6f5d with SMTP id
 n5-20020a17090a928500b0024657876f5dmr11573308pjo.10.1682088656316; Fri, 21
 Apr 2023 07:50:56 -0700 (PDT)
MIME-Version: 1.0
References: <7ab4e713.9fae.1876e1ac122.Coremail.zhangfei@nj.iscas.ac.cn>
 <CAKbZUD2Rfd9Lg37GY+N_bMeJOQJ=84yZ=SW9+vHMRdByU0CZ+A@mail.gmail.com>
 <658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn>
 <20230419090210.GR3630668@port70.net> <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn>
 <20230421133034.GS3630668@port70.net>
In-Reply-To: <20230421133034.GS3630668@port70.net>
From: Pedro Falcato <pedro.falcato@gmail.com>
Date: Fri, 21 Apr 2023 15:50:45 +0100
Message-ID: <CAKbZUD1n37yNecyTnxFvcEV5uG1tqMD09fn2_11VMpYHPvrDSA@mail.gmail.com>
To: musl@lists.openwall.com, =?UTF-8?B?5byg6aOe?= <zhangfei@nj.iscas.ac.cn>
Cc: nsz@port70.net
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Subject: Re: Re: Re: [musl] memset_riscv64

On Fri, Apr 21, 2023 at 2:37=E2=80=AFPM Szabolcs Nagy <nsz@port70.net> wrot=
e:
>
> * =E5=BC=A0=E9=A3=9E <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800=
]:
> > Hi!
> > I listened to your suggestions and referred to string.c in Musl's test =
set(libc-bench),
> > and then modified the test cases. Since BUFLEN is a fixed value in strl=
en.c, I modified
> > it to a variable as a parameter in my own test case and passed it to th=
e memset function.
> > I adjusted the LOOP_TIMES has been counted up to 500 times and the runn=
ing time has been
> > sorted, only recording the running time of the middle 300 times.
> >
> > I took turns executing two programs on the SiFive chip three times each=
, and the results
> > are shown below.
> >                              First run result
> > -----------------------------------------------------------------------=
---------
> > length(byte)  C language implementation(s)   Basic instruction implemen=
tation(s)
> > -----------------------------------------------------------------------=
---------
> > 100                 0.002208102                     0.002304056
> > 200                 0.005053208                     0.004629598
> > 400                 0.008666684                     0.007739176
> > 800                 0.014065196                     0.012372702
> > 1600                0.023377685                     0.020090966
> > 3200                0.040221849                     0.034059631
> > 6400                0.072095377                     0.060028906
> > 12800               0.134040475                     0.110039387
> > 25600               0.257426806                     0.210710952
> > 51200               1.173755160                     1.121833227
> > 102400              3.693170402                     3.637194098
> > 204800              8.919975455                     8.865504460
> > 409600             19.410922418                    19.360956493
> > -----------------------------------------------------------------------=
---------
> >
> >                              Second run result
> > -----------------------------------------------------------------------=
---------
> > length(byte)  C language implementation(s)   Basic instruction implemen=
tation(s)
> > -----------------------------------------------------------------------=
---------
> > 100                 0.002208109                     0.002293857
> > 200                 0.005057374                     0.004640669
> > 400                 0.008674218                     0.007760795
> > 800                 0.014068582                     0.012417084
> > 1600                0.023381095                     0.020124496
> > 3200                0.040225138                     0.034093181
> > 6400                0.072098744                     0.060069574
> > 12800               0.134043954                     0.110088141
> > 25600               0.256453187                     0.208578633
> > 51200               1.166602505                     1.118972796
> > 102400              3.684957231                     3.635116808
> > 204800              8.916302592                     8.861590734
> > 409600             19.411057216                    19.358777670
> > -----------------------------------------------------------------------=
---------
> >
> >                              Third run result
> > -----------------------------------------------------------------------=
---------
> > length(byte)  C language implementation(s)   Basic instruction implemen=
tation(s)
> > -----------------------------------------------------------------------=
---------
> > 100                 0.002208111                     0.002293227
> > 200                 0.005056101                     0.004628539
> > 400                 0.008677756                     0.007748687
> > 800                 0.014085242                     0.012404443
> > 1600                0.023397782                     0.020115710
> > 3200                0.040242985                     0.034084435
> > 6400                0.072116665                     0.060063767
> > 12800               0.134060262                     0.110082427
> > 25600               0.257865186                     0.209101754
> > 51200               1.174257177                     1.117753408
> > 102400              3.696518162                     3.635417503
> > 204800              8.929357747                     8.858765915
> > 409600             19.426520562                     19.356515671
> > -----------------------------------------------------------------------=
---------
> >
> > From the test results, it can be seen that the runtime of memset implem=
ented using the basic
> > instruction set assembly is basically shorter than that implemented usi=
ng the C language.
> > May I ask if the test results are convincing?
>
> small sizes are much more common than large sizes, memsets can be
> distributed such that sizes [0,100), [100,1000), [1000,inf) are
> used for 1/3 of all memsets each (not the call count, but the
> amount of bytes memset using such sizes), i.e. if you speed up
> the size =3D [100,1000) and [1000,inf) cases by 10% but regress the
> [0,100) case by 20% then the overall performance roughly stays
> the same. (of course this is very workload dependent, but across
> a system this is what i'd expect, probably even more skewed to
> smaller sizes).
>
> so we need to know what happens in the [0,100) range. what i see
> is a ~4% regression there while there is a ~10% improvement in
> the [100,1000) case and ~15% improvement in the [1000,inf) case
> (it would be nice to know why the 25k case is so much faster and
> why that speed up only applies to that size, we don't want to
> optimize for some obscure cpu bug that will go away next year)
>
> on practical workloads i would expect < 10% speedup overall from
> the asm code (but we need more data in the [0,100) range to tell).
> this may not be enough to justify the asm code.
>
> rich already said he prefers a different style of implementation
> (where the body of the function is in c but the inner loop is in
> asm if that helps e.g. via simd).

I don't think writing it all in C is viable, at least if you want to
squeeze every last bit of performance out of it (while avoiding
idiotic codegen that sometimes pops up).
Even with inline asm, I severely question its effectiveness. As I see
it, we have two major roadblocks for fast stringops support (and one
more for riscv):

1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
dispatch stringops functions to the best implementation according to
the CPU feature set. I have no good solution for static linking folks.
2) (Optional) Play around with C codegen that could add SIMD, inline
asm to try to make it fast-ish. LLVM folks have played around with
string ops written entirely in C++ through
__builtin_memcpy_inline (which does smart choices wrt overlapping
loads/stores, SIMD, etc depending on the size). Sadly,
__builtin_memcpy_inline is/was not available in GCC.

Testing the performance of C+inline asm vs pure asm would be interesting.

3) Write riscv stringops code in assembly once CPUs get more advanced
and we finally get a good idea on how the things perform. I still
think it's too new to optimize specifically for.
Extensions are popping up left and right, vector extensions aren't yet
properly supported in the kernel, and (most importantly) we don't have
a proper way to detect riscv features just yet.
For instance, doing unaligned accesses may either have little to no
performance penalty, they may have a big performance penalty (trapped
to M mode and emulated), or they may just not be supported at all.
AFAIK, atm Linux userspace has no way of finding this out (patchset
for this is still pending I think?), and things like the existence of
cheap unaligned accesses are a make-or-break for stringops as you get
to avoid *soooo* many branches.

In the RISCV case, you probably want to end up with at least 3 mem*
variants (no-unaligned, unaligned, vector).

Anyway, this was just a general brain-dump of my thoughts. The lack of
fast stringops is, AIUI, a problem in most architectures musl
supports.

> here is an example of a benchmark that takes input distribution
> into account from a workload:
> https://github.com/ARM-software/optimized-routines/blob/master/string/ben=
ch/memset.c#L53

For some more data, here's the distribution on a workload (kernel
compiling I think?) on a FreeBSD kernel:
https://people.freebsd.org/~mjg/bufsizes.txt. The sizes will ofc vary
between software, but this is just a cute datapoint on kernel stuff.
Userspace probably has lots of bigger copies/memsets.

>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <time.h>
> >
> > #define BUFLEN 500000
> > #define LOOP_TIMES 500
> >
> > int cmp(const void *a, const void *b) {
> >     double x =3D *(double *)a;
> >     double y =3D *(double *)b;
> >     if (x < y) return -1;
> >     if (x > y) return 1;
> >     return 0;
> > }
> >
> > int main(){
> >         char *buf =3D malloc(BUFLEN);
> >       double *arr =3D malloc(sizeof(double) * LOOP_TIMES);
> >         size_t i,j,k;
> >         struct timespec tv0,tv;
> >       double times;
> >
> >         for(j=3D100; j<BUFLEN; j*=3D2){
> >           for(k=3D0; k<LOOP_TIMES; k++){
> >             for (i=3D0; i<100; i++)
> >                   memset(buf+i, i, j-i);
> >           }
> >         }
> >
> >         for(j=3D100; j<BUFLEN; j*=3D2){
> >           for(k=3D0; k<LOOP_TIMES; k++){
> >             clock_gettime(CLOCK_REALTIME, &tv0);
> >             for (i=3D0; i<100; i++)
> >                   memset(buf+i, i, j-i);
>
> alignment only matters up to 64 byte alignment and usually inputs
> are at least 8byte aligned.
>
> value is almost always 0. (we probably don't even need to test
> non-0 case: a 0 check is correctly predicted in practice.)
>
> i think length should have a small variation, just enough to add
> penalty to small size checks where implementations may use many
> branches.
>
> so something like this may be better (madeup off,al numbers):
>
>         buf =3D malloc((1<<16)+32);
>         size_t sz[] =3D {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1=
<<11, 1<<12, 1<<13, 1<<14, 1<<15, 1<<16, 0};
>         size_t off[16] =3D {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4,=
 0, 0, 12};
>         size_t al[16] =3D {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16=
, 1};
>         for (j=3D0; sz[j]; j++)
>                 for (k=3D0; k<20; k++) {
>                         t0 =3D tic();
>                         // large loop count is important for small sizes
>                         for (i=3D0; i<256; i++)
>                                 memset(buf + al[i%16], 0, sz[j] + off[i%1=
6]);
>                         t1 =3D tic();
>                         tmin =3D min(tmin,t1-t0);
>                 }
>
> large memset (>=3D1k) can be tested separately (no.need to add off,al
> variaion then, less inner loop is enough, but it should not hurt to
> include them here).
>
> >             clock_gettime(CLOCK_REALTIME, &tv);
> >             tv.tv_sec -=3D tv0.tv_sec;
> >             if ((tv.tv_nsec -=3D tv0.tv_nsec) < 0) {
> >                 tv.tv_nsec +=3D 1000000000;
> >                 tv.tv_sec--;
> >             }
> >           arr[k] =3D tv.tv_sec + (double)tv.tv_nsec/1000000000;
> >           }
> >           qsort(arr, 500, sizeof(double), cmp);
>
> just take the minimum. we want to know the fastest execution.
>
> >
> >         for (int m =3D 100; m < LOOP_TIMES - 100; m++) {
> >               times +=3D arr[m];
> >           }
> >         printf("len: %ld  time: %.9lf\n",j, times);
>
> you can also print GB/s which is 256*sz[j]/tmin in my example.
>
> >       }
> >         free(buf);
> >         return 0;
> > }
>

I know folks are somewhat allergic to C++ here, but I wholeheartedly
recommend Google Benchmark for microbenchmarking needs.
See https://gist.github.com/heatd/165b70a4e0b75e815b82d723c01637dc for
something I used to benchmark my memcpy (partially taken from
AOSP/bionic).

--=20
Pedro