From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED,
	DKIM_INVALID,DKIM_SIGNED,HTML_MESSAGE,MAILING_LIST_MULTI,
	RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 31938 invoked from network); 21 Apr 2023 17:01:33 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 21 Apr 2023 17:01:33 -0000
Received: (qmail 19639 invoked by uid 550); 21 Apr 2023 17:01:30 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 19606 invoked from network); 21 Apr 2023 17:01:29 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1682096477; x=1684688477;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=r2MTMY0Y1LZP8l3gZKTEhJ+HUfpE2ZGE1biG0gc/H88=;
        b=e8Zxemq8KsZVcMqIcsY+h+HD0afGRBUrV4z9ilFsqVU8u5pz4cDvM9cxZQAo3gAikd
         7+XmuWJjG2vxbNA+pIfhkiD5ZE2jQpf53aH0nHFEE+vMnZOM1VxyaG6i5Pf7Q/pNKWcV
         dk/RpLAGg52cJZb4YxjdZjcyOzchRWtH/WqDsIeVjF/pr534s/QmvZBCDkavG2YWnWOB
         ZXqBvgxlITo0T75T1P86Dc0TRxx8sngFYcMvZ7yu+xvXiS4uZibKlhzGSo6GlUkNEA23
         RiM98sLi+C1UYzPe8csk3JGBzc7YikXRXWhBorAFr3FJU7w2UHHOeH4d/Qeq2oDY3iSS
         TUNQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1682096477; x=1684688477;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=r2MTMY0Y1LZP8l3gZKTEhJ+HUfpE2ZGE1biG0gc/H88=;
        b=YxL61wConsjFwM5Oivx5Sz4egWgyoZan85XNG/FNx2Kfz3OJ+mZxOdoGkz9CDxZtSQ
         o6iB828o48ZZADGgYYgOZnk/Lp3vkom75v0AufG03oTdA4G+BemQp6pilMHewg7uJcAP
         Wy3aHdrQ0uR4rt9f5uADXyouLYfcF+u5CFmJNOxr8hFSzoaijuLiUrzbqkip26whohc0
         ht9Ou9jhvBxoZKY19UKXHTYamKDmNPhEE/ZfTEvY25iBFs2nAjJIxEyiM4T5kG+X8SgB
         dbrMAyx0+2FSZhrnfg6a/9vgI/KQMpNsk5co41H+zG3OdoPKmNrtCyrxJUJ4feH70kfM
         5ugw==
X-Gm-Message-State: AAQBX9ei931IevQFy7iP0UYbTDoFjD0SOzTK563P0qnIx3G0LQmg//gI
	U7zulH5hv7Iw/7PbQE7b8/43TATGUhSBT3VRhQaSCkgZ7Ap8dvf4i2M5KQ==
X-Google-Smtp-Source: AKy350asbXFLKMgqRWkZfkcd4J7+VSPQ8MiFn9u9HAJu8nDWAlk0ikFSRbhICBt2u+GC0pYLGMjrH9yZIZCDorzLtrY=
X-Received: by 2002:a05:6214:1c0f:b0:5ee:e18d:3f1 with SMTP id
 u15-20020a0562141c0f00b005eee18d03f1mr10894055qvc.35.1682096477201; Fri, 21
 Apr 2023 10:01:17 -0700 (PDT)
MIME-Version: 1.0
References: <7ab4e713.9fae.1876e1ac122.Coremail.zhangfei@nj.iscas.ac.cn>
 <CAKbZUD2Rfd9Lg37GY+N_bMeJOQJ=84yZ=SW9+vHMRdByU0CZ+A@mail.gmail.com>
 <658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn>
 <20230419090210.GR3630668@port70.net> <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn>
 <20230421133034.GS3630668@port70.net> <CAKbZUD1n37yNecyTnxFvcEV5uG1tqMD09fn2_11VMpYHPvrDSA@mail.gmail.com>
 <20230421165414.GQ4163@brightrain.aerifal.cx>
In-Reply-To: <20230421165414.GQ4163@brightrain.aerifal.cx>
From: enh <enh@google.com>
Date: Fri, 21 Apr 2023 10:01:05 -0700
Message-ID: <CAJgzZop2EC-h3A6XR_vbLUpD6gAWGW4OWYO-BsuQoF-23JNapA@mail.gmail.com>
To: musl@lists.openwall.com
Cc: Pedro Falcato <pedro.falcato@gmail.com>, =?UTF-8?B?5byg6aOe?= <zhangfei@nj.iscas.ac.cn>, 
	nsz@port70.net
Content-Type: multipart/alternative; boundary="00000000000000e09f05f9db9b7f"
Subject: Re: Re: Re: [musl] memset_riscv64

--00000000000000e09f05f9db9b7f
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Apr 21, 2023 at 9:54=E2=80=AFAM Rich Felker <dalias@libc.org> wrote=
:

> On Fri, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote:
> > On Fri, Apr 21, 2023 at 2:37=E2=80=AFPM Szabolcs Nagy <nsz@port70.net> =
wrote:
> > >
> > > * =E5=BC=A0=E9=A3=9E <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +=
0800]:
> > > > Hi!
> > > > I listened to your suggestions and referred to string.c in Musl's
> test set(libc-bench),
> > > > and then modified the test cases. Since BUFLEN is a fixed value in
> strlen.c, I modified
> > > > it to a variable as a parameter in my own test case and passed it t=
o
> the memset function.
> > > > I adjusted the LOOP_TIMES has been counted up to 500 times and the
> running time has been
> > > > sorted, only recording the running time of the middle 300 times.
> > > >
> > > > I took turns executing two programs on the SiFive chip three times
> each, and the results
> > > > are shown below.
> > > >                              First run result
> > > >
> -------------------------------------------------------------------------=
-------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> -------------------------------------------------------------------------=
-------
> > > > 100                 0.002208102                     0.002304056
> > > > 200                 0.005053208                     0.004629598
> > > > 400                 0.008666684                     0.007739176
> > > > 800                 0.014065196                     0.012372702
> > > > 1600                0.023377685                     0.020090966
> > > > 3200                0.040221849                     0.034059631
> > > > 6400                0.072095377                     0.060028906
> > > > 12800               0.134040475                     0.110039387
> > > > 25600               0.257426806                     0.210710952
> > > > 51200               1.173755160                     1.121833227
> > > > 102400              3.693170402                     3.637194098
> > > > 204800              8.919975455                     8.865504460
> > > > 409600             19.410922418                    19.360956493
> > > >
> -------------------------------------------------------------------------=
-------
> > > >
> > > >                              Second run result
> > > >
> -------------------------------------------------------------------------=
-------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> -------------------------------------------------------------------------=
-------
> > > > 100                 0.002208109                     0.002293857
> > > > 200                 0.005057374                     0.004640669
> > > > 400                 0.008674218                     0.007760795
> > > > 800                 0.014068582                     0.012417084
> > > > 1600                0.023381095                     0.020124496
> > > > 3200                0.040225138                     0.034093181
> > > > 6400                0.072098744                     0.060069574
> > > > 12800               0.134043954                     0.110088141
> > > > 25600               0.256453187                     0.208578633
> > > > 51200               1.166602505                     1.118972796
> > > > 102400              3.684957231                     3.635116808
> > > > 204800              8.916302592                     8.861590734
> > > > 409600             19.411057216                    19.358777670
> > > >
> -------------------------------------------------------------------------=
-------
> > > >
> > > >                              Third run result
> > > >
> -------------------------------------------------------------------------=
-------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> -------------------------------------------------------------------------=
-------
> > > > 100                 0.002208111                     0.002293227
> > > > 200                 0.005056101                     0.004628539
> > > > 400                 0.008677756                     0.007748687
> > > > 800                 0.014085242                     0.012404443
> > > > 1600                0.023397782                     0.020115710
> > > > 3200                0.040242985                     0.034084435
> > > > 6400                0.072116665                     0.060063767
> > > > 12800               0.134060262                     0.110082427
> > > > 25600               0.257865186                     0.209101754
> > > > 51200               1.174257177                     1.117753408
> > > > 102400              3.696518162                     3.635417503
> > > > 204800              8.929357747                     8.858765915
> > > > 409600             19.426520562                     19.356515671
> > > >
> -------------------------------------------------------------------------=
-------
> > > >
> > > > From the test results, it can be seen that the runtime of memset
> implemented using the basic
> > > > instruction set assembly is basically shorter than that implemented
> using the C language.
> > > > May I ask if the test results are convincing?
> > >
> > > small sizes are much more common than large sizes, memsets can be
> > > distributed such that sizes [0,100), [100,1000), [1000,inf) are
> > > used for 1/3 of all memsets each (not the call count, but the
> > > amount of bytes memset using such sizes), i.e. if you speed up
> > > the size =3D [100,1000) and [1000,inf) cases by 10% but regress the
> > > [0,100) case by 20% then the overall performance roughly stays
> > > the same. (of course this is very workload dependent, but across
> > > a system this is what i'd expect, probably even more skewed to
> > > smaller sizes).
> > >
> > > so we need to know what happens in the [0,100) range. what i see
> > > is a ~4% regression there while there is a ~10% improvement in
> > > the [100,1000) case and ~15% improvement in the [1000,inf) case
> > > (it would be nice to know why the 25k case is so much faster and
> > > why that speed up only applies to that size, we don't want to
> > > optimize for some obscure cpu bug that will go away next year)
> > >
> > > on practical workloads i would expect < 10% speedup overall from
> > > the asm code (but we need more data in the [0,100) range to tell).
> > > this may not be enough to justify the asm code.
> > >
> > > rich already said he prefers a different style of implementation
> > > (where the body of the function is in c but the inner loop is in
> > > asm if that helps e.g. via simd).
> >
> > I don't think writing it all in C is viable, at least if you want to
> > squeeze every last bit of performance out of it (while avoiding
> > idiotic codegen that sometimes pops up).
> > Even with inline asm, I severely question its effectiveness. As I see
>
> I don't see any good reason for this doubt. If you claim it's not
> viable, you should show cases where you really can't get the compiler
> to do something reasonable with this type of code.
>
> If the loop body were tiny and the loop control were a significant
> portion of the loop execution overhead, then I could see this
> potentially being a problem. But the main/only interesting case for
> asm is where you're operating on largeish blocks.
>
> > it, we have two major roadblocks for fast stringops support (and one
> > more for riscv):
> >
> > 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
> > dispatch stringops functions to the best implementation according to
> > the CPU feature set. I have no good solution for static linking folks.
>
> Of course this is not an option, but it's also not needed. There is
> only relevant dispatch cost when size is small, but you don't want or
> need to dispatch to asm variants when size is small, so the dispatch
> goes in the branch for large sizes, and the cost is effectively zero.
>
> > 2) (Optional) Play around with C codegen that could add SIMD, inline
> > asm to try to make it fast-ish. LLVM folks have played around with
> > string ops written entirely in C++ through
> > __builtin_memcpy_inline (which does smart choices wrt overlapping
> > loads/stores, SIMD, etc depending on the size). Sadly,
> > __builtin_memcpy_inline is/was not available in GCC.
>
> The basic strategy here is to do head/tail of the operation with plain
> portable C, in a minimal-length, minimal-branch fast path. Then, if
> any middle that wasn't covered by the head/tail remains, either use an
> arch-provided block operation primitive that's (for example; subject
> to tuning) allowed to assume alignment of size and either src or dest,
> or dispatch to a hwcap-specific bulk operation in asm that can make
> similar assumptions.
>
> > Testing the performance of C+inline asm vs pure asm would be interestin=
g.
>
> Yes but I don't think we'll find anything unexpected. In theory you
> can probaby shave a couple cycles writing the asm by hand, but that
> has a lot of costs that aren't sustainable, and pessimizes things like
> LTO (for example, in LTO, the short-size fast paths may be able to be
> inlined when the exact size isn't known but value range analysis
> determines it's always in the small range).
>
> > 3) Write riscv stringops code in assembly once CPUs get more advanced
> > and we finally get a good idea on how the things perform. I still
> > think it's too new to optimize specifically for.
> > Extensions are popping up left and right, vector extensions aren't yet
> > properly supported in the kernel, and (most importantly) we don't have
> > a proper way to detect riscv features just yet.
> > For instance, doing unaligned accesses may either have little to no
> > performance penalty, they may have a big performance penalty (trapped
> > to M mode and emulated), or they may just not be supported at all.
> > AFAIK, atm Linux userspace has no way of finding this out (patchset
> > for this is still pending I think?), and things like the existence of
> > cheap unaligned accesses are a make-or-break for stringops as you get
> > to avoid *soooo* many branches.
>
> Yes, for RISC-V there is no way forward on vector or other ISA
> extensions until the specs are firmed up and the framework for
> detecting their presence is in place.
>

(the vector state save/restore stuff still isn't in yet, which is making me
worry it won't make linux 6.4, but fwiw the risc-v hwprobe patches went
into linux-next earlier this week, so there's some progress there at
least...)


> > In the RISCV case, you probably want to end up with at least 3 mem*
> > variants (no-unaligned, unaligned, vector).
>
> For memset, there's hardly a reason to waste effort on unaligned
> versions. The middle is always aligned. For memcpy/memmove, where you
> have src and dest misaligned modulo each other, the ability to do
> unaligned loads or stores is valuable, and something the general
> framework (in C) should allow us to take advantage of. I hadn't really
> considered the possibility that we might want to support
> unaligned-access support that's only known at runtime, rather than
> part of the ISA level you're building for, so perhaps this is
> something we should consider.
>
> Rich
>

--00000000000000e09f05f9db9b7f
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Fri, Apr 21, 2023 at 9:54=E2=80=AF=
AM Rich Felker &lt;<a href=3D"mailto:dalias@libc.org">dalias@libc.org</a>&g=
t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri=
, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote:<br>
&gt; On Fri, Apr 21, 2023 at 2:37=E2=80=AFPM Szabolcs Nagy &lt;<a href=3D"m=
ailto:nsz@port70.net" target=3D"_blank">nsz@port70.net</a>&gt; wrote:<br>
&gt; &gt;<br>
&gt; &gt; * =E5=BC=A0=E9=A3=9E &lt;<a href=3D"mailto:zhangfei@nj.iscas.ac.c=
n" target=3D"_blank">zhangfei@nj.iscas.ac.cn</a>&gt; [2023-04-20 16:17:10 +=
0800]:<br>
&gt; &gt; &gt; Hi!<br>
&gt; &gt; &gt; I listened to your suggestions and referred to string.c in M=
usl&#39;s test set(libc-bench),<br>
&gt; &gt; &gt; and then modified the test cases. Since BUFLEN is a fixed va=
lue in strlen.c, I modified<br>
&gt; &gt; &gt; it to a variable as a parameter in my own test case and pass=
ed it to the memset function.<br>
&gt; &gt; &gt; I adjusted the LOOP_TIMES has been counted up to 500 times a=
nd the running time has been<br>
&gt; &gt; &gt; sorted, only recording the running time of the middle 300 ti=
mes.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; I took turns executing two programs on the SiFive chip three=
 times each, and the results<br>
&gt; &gt; &gt; are shown below.<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 First run result<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt; length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0=
Basic instruction implementation(s)<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt; 100=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.002208102=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.002304056<br>
&gt; &gt; &gt; 200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.005053208=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.004629598<br>
&gt; &gt; &gt; 400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.008666684=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.007739176<br>
&gt; &gt; &gt; 800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.014065196=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.012372702<br>
&gt; &gt; &gt; 1600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.023377685=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.020090966<br>
&gt; &gt; &gt; 3200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.040221849=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.034059631<br>
&gt; &gt; &gt; 6400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.072095377=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.060028906<br>
&gt; &gt; &gt; 12800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
0.134040475=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.110039387<br>
&gt; &gt; &gt; 25600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
0.257426806=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.210710952<br>
&gt; &gt; &gt; 51200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
1.173755160=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A01.121833227<br>
&gt; &gt; &gt; 102400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3.693=
170402=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A03.637194098<br>
&gt; &gt; &gt; 204800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.919=
975455=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A08.865504460<br>
&gt; &gt; &gt; 409600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A019.410=
922418=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 19.360956493<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Second run result<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt; length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0=
Basic instruction implementation(s)<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt; 100=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.002208109=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.002293857<br>
&gt; &gt; &gt; 200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.005057374=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.004640669<br>
&gt; &gt; &gt; 400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.008674218=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.007760795<br>
&gt; &gt; &gt; 800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.014068582=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.012417084<br>
&gt; &gt; &gt; 1600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.023381095=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.020124496<br>
&gt; &gt; &gt; 3200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.040225138=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.034093181<br>
&gt; &gt; &gt; 6400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.072098744=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.060069574<br>
&gt; &gt; &gt; 12800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
0.134043954=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.110088141<br>
&gt; &gt; &gt; 25600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
0.256453187=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.208578633<br>
&gt; &gt; &gt; 51200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
1.166602505=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A01.118972796<br>
&gt; &gt; &gt; 102400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3.684=
957231=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A03.635116808<br>
&gt; &gt; &gt; 204800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.916=
302592=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A08.861590734<br>
&gt; &gt; &gt; 409600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A019.411=
057216=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 19.358777670<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Third run result<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt; length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0=
Basic instruction implementation(s)<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt; 100=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.002208111=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.002293227<br>
&gt; &gt; &gt; 200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.005056101=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.004628539<br>
&gt; &gt; &gt; 400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.008677756=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.007748687<br>
&gt; &gt; &gt; 800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A00.014085242=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A00.012404443<br>
&gt; &gt; &gt; 1600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.023397782=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.020115710<br>
&gt; &gt; &gt; 3200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.040242985=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.034084435<br>
&gt; &gt; &gt; 6400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
0.072116665=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.060063767<br>
&gt; &gt; &gt; 12800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
0.134060262=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.110082427<br>
&gt; &gt; &gt; 25600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
0.257865186=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00.209101754<br>
&gt; &gt; &gt; 51200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
1.174257177=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A01.117753408<br>
&gt; &gt; &gt; 102400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3.696=
518162=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A03.635417503<br>
&gt; &gt; &gt; 204800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.929=
357747=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A08.858765915<br>
&gt; &gt; &gt; 409600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A019.426=
520562=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A019.356515671<br>
&gt; &gt; &gt; ------------------------------------------------------------=
--------------------<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; From the test results, it can be seen that the runtime of me=
mset implemented using the basic<br>
&gt; &gt; &gt; instruction set assembly is basically shorter than that impl=
emented using the C language.<br>
&gt; &gt; &gt; May I ask if the test results are convincing?<br>
&gt; &gt;<br>
&gt; &gt; small sizes are much more common than large sizes, memsets can be=
<br>
&gt; &gt; distributed such that sizes [0,100), [100,1000), [1000,inf) are<b=
r>
&gt; &gt; used for 1/3 of all memsets each (not the call count, but the<br>
&gt; &gt; amount of bytes memset using such sizes), i.e. if you speed up<br=
>
&gt; &gt; the size =3D [100,1000) and [1000,inf) cases by 10% but regress t=
he<br>
&gt; &gt; [0,100) case by 20% then the overall performance roughly stays<br=
>
&gt; &gt; the same. (of course this is very workload dependent, but across<=
br>
&gt; &gt; a system this is what i&#39;d expect, probably even more skewed t=
o<br>
&gt; &gt; smaller sizes).<br>
&gt; &gt;<br>
&gt; &gt; so we need to know what happens in the [0,100) range. what i see<=
br>
&gt; &gt; is a ~4% regression there while there is a ~10% improvement in<br=
>
&gt; &gt; the [100,1000) case and ~15% improvement in the [1000,inf) case<b=
r>
&gt; &gt; (it would be nice to know why the 25k case is so much faster and<=
br>
&gt; &gt; why that speed up only applies to that size, we don&#39;t want to=
<br>
&gt; &gt; optimize for some obscure cpu bug that will go away next year)<br=
>
&gt; &gt;<br>
&gt; &gt; on practical workloads i would expect &lt; 10% speedup overall fr=
om<br>
&gt; &gt; the asm code (but we need more data in the [0,100) range to tell)=
.<br>
&gt; &gt; this may not be enough to justify the asm code.<br>
&gt; &gt;<br>
&gt; &gt; rich already said he prefers a different style of implementation<=
br>
&gt; &gt; (where the body of the function is in c but the inner loop is in<=
br>
&gt; &gt; asm if that helps e.g. via simd).<br>
&gt; <br>
&gt; I don&#39;t think writing it all in C is viable, at least if you want =
to<br>
&gt; squeeze every last bit of performance out of it (while avoiding<br>
&gt; idiotic codegen that sometimes pops up).<br>
&gt; Even with inline asm, I severely question its effectiveness. As I see<=
br>
<br>
I don&#39;t see any good reason for this doubt. If you claim it&#39;s not<b=
r>
viable, you should show cases where you really can&#39;t get the compiler<b=
r>
to do something reasonable with this type of code.<br>
<br>
If the loop body were tiny and the loop control were a significant<br>
portion of the loop execution overhead, then I could see this<br>
potentially being a problem. But the main/only interesting case for<br>
asm is where you&#39;re operating on largeish blocks.<br>
<br>
&gt; it, we have two major roadblocks for fast stringops support (and one<b=
r>
&gt; more for riscv):<br>
&gt; <br>
&gt; 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically<br>
&gt; dispatch stringops functions to the best implementation according to<b=
r>
&gt; the CPU feature set. I have no good solution for static linking folks.=
<br>
<br>
Of course this is not an option, but it&#39;s also not needed. There is<br>
only relevant dispatch cost when size is small, but you don&#39;t want or<b=
r>
need to dispatch to asm variants when size is small, so the dispatch<br>
goes in the branch for large sizes, and the cost is effectively zero.<br>
<br>
&gt; 2) (Optional) Play around with C codegen that could add SIMD, inline<b=
r>
&gt; asm to try to make it fast-ish. LLVM folks have played around with<br>
&gt; string ops written entirely in C++ through<br>
&gt; __builtin_memcpy_inline (which does smart choices wrt overlapping<br>
&gt; loads/stores, SIMD, etc depending on the size). Sadly,<br>
&gt; __builtin_memcpy_inline is/was not available in GCC.<br>
<br>
The basic strategy here is to do head/tail of the operation with plain<br>
portable C, in a minimal-length, minimal-branch fast path. Then, if<br>
any middle that wasn&#39;t covered by the head/tail remains, either use an<=
br>
arch-provided block operation primitive that&#39;s (for example; subject<br=
>
to tuning) allowed to assume alignment of size and either src or dest,<br>
or dispatch to a hwcap-specific bulk operation in asm that can make<br>
similar assumptions.<br>
<br>
&gt; Testing the performance of C+inline asm vs pure asm would be interesti=
ng.<br>
<br>
Yes but I don&#39;t think we&#39;ll find anything unexpected. In theory you=
<br>
can probaby shave a couple cycles writing the asm by hand, but that<br>
has a lot of costs that aren&#39;t sustainable, and pessimizes things like<=
br>
LTO (for example, in LTO, the short-size fast paths may be able to be<br>
inlined when the exact size isn&#39;t known but value range analysis<br>
determines it&#39;s always in the small range).<br>
<br>
&gt; 3) Write riscv stringops code in assembly once CPUs get more advanced<=
br>
&gt; and we finally get a good idea on how the things perform. I still<br>
&gt; think it&#39;s too new to optimize specifically for.<br>
&gt; Extensions are popping up left and right, vector extensions aren&#39;t=
 yet<br>
&gt; properly supported in the kernel, and (most importantly) we don&#39;t =
have<br>
&gt; a proper way to detect riscv features just yet.<br>
&gt; For instance, doing unaligned accesses may either have little to no<br=
>
&gt; performance penalty, they may have a big performance penalty (trapped<=
br>
&gt; to M mode and emulated), or they may just not be supported at all.<br>
&gt; AFAIK, atm Linux userspace has no way of finding this out (patchset<br=
>
&gt; for this is still pending I think?), and things like the existence of<=
br>
&gt; cheap unaligned accesses are a make-or-break for stringops as you get<=
br>
&gt; to avoid *soooo* many branches.<br>
<br>
Yes, for RISC-V there is no way forward on vector or other ISA<br>
extensions until the specs are firmed up and the framework for<br>
detecting their presence is in place.<br></blockquote><div><br></div><div>(=
the vector state save/restore stuff still isn&#39;t in yet, which is making=
 me worry it won&#39;t make linux 6.4, but fwiw the risc-v hwprobe patches =
went into linux-next earlier this week, so there&#39;s some progress there =
at least...)</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
&gt; In the RISCV case, you probably want to end up with at least 3 mem*<br=
>
&gt; variants (no-unaligned, unaligned, vector).<br>
<br>
For memset, there&#39;s hardly a reason to waste effort on unaligned<br>
versions. The middle is always aligned. For memcpy/memmove, where you<br>
have src and dest misaligned modulo each other, the ability to do<br>
unaligned loads or stores is valuable, and something the general<br>
framework (in C) should allow us to take advantage of. I hadn&#39;t really<=
br>
considered the possibility that we might want to support<br>
unaligned-access support that&#39;s only known at runtime, rather than<br>
part of the ISA level you&#39;re building for, so perhaps this is<br>
something we should consider.<br>
<br>
Rich<br>
</blockquote></div></div>

--00000000000000e09f05f9db9b7f--