From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,HTML_MESSAGE,MAILING_LIST_MULTI, RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 31938 invoked from network); 21 Apr 2023 17:01:33 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 21 Apr 2023 17:01:33 -0000 Received: (qmail 19639 invoked by uid 550); 21 Apr 2023 17:01:30 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 19606 invoked from network); 21 Apr 2023 17:01:29 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682096477; x=1684688477; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=r2MTMY0Y1LZP8l3gZKTEhJ+HUfpE2ZGE1biG0gc/H88=; b=e8Zxemq8KsZVcMqIcsY+h+HD0afGRBUrV4z9ilFsqVU8u5pz4cDvM9cxZQAo3gAikd 7+XmuWJjG2vxbNA+pIfhkiD5ZE2jQpf53aH0nHFEE+vMnZOM1VxyaG6i5Pf7Q/pNKWcV dk/RpLAGg52cJZb4YxjdZjcyOzchRWtH/WqDsIeVjF/pr534s/QmvZBCDkavG2YWnWOB ZXqBvgxlITo0T75T1P86Dc0TRxx8sngFYcMvZ7yu+xvXiS4uZibKlhzGSo6GlUkNEA23 RiM98sLi+C1UYzPe8csk3JGBzc7YikXRXWhBorAFr3FJU7w2UHHOeH4d/Qeq2oDY3iSS TUNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682096477; x=1684688477; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=r2MTMY0Y1LZP8l3gZKTEhJ+HUfpE2ZGE1biG0gc/H88=; b=YxL61wConsjFwM5Oivx5Sz4egWgyoZan85XNG/FNx2Kfz3OJ+mZxOdoGkz9CDxZtSQ o6iB828o48ZZADGgYYgOZnk/Lp3vkom75v0AufG03oTdA4G+BemQp6pilMHewg7uJcAP Wy3aHdrQ0uR4rt9f5uADXyouLYfcF+u5CFmJNOxr8hFSzoaijuLiUrzbqkip26whohc0 ht9Ou9jhvBxoZKY19UKXHTYamKDmNPhEE/ZfTEvY25iBFs2nAjJIxEyiM4T5kG+X8SgB dbrMAyx0+2FSZhrnfg6a/9vgI/KQMpNsk5co41H+zG3OdoPKmNrtCyrxJUJ4feH70kfM 5ugw== X-Gm-Message-State: AAQBX9ei931IevQFy7iP0UYbTDoFjD0SOzTK563P0qnIx3G0LQmg//gI U7zulH5hv7Iw/7PbQE7b8/43TATGUhSBT3VRhQaSCkgZ7Ap8dvf4i2M5KQ== X-Google-Smtp-Source: AKy350asbXFLKMgqRWkZfkcd4J7+VSPQ8MiFn9u9HAJu8nDWAlk0ikFSRbhICBt2u+GC0pYLGMjrH9yZIZCDorzLtrY= X-Received: by 2002:a05:6214:1c0f:b0:5ee:e18d:3f1 with SMTP id u15-20020a0562141c0f00b005eee18d03f1mr10894055qvc.35.1682096477201; Fri, 21 Apr 2023 10:01:17 -0700 (PDT) MIME-Version: 1.0 References: <7ab4e713.9fae.1876e1ac122.Coremail.zhangfei@nj.iscas.ac.cn> <658c32ae.2348c.187980096c9.Coremail.zhangfei@nj.iscas.ac.cn> <20230419090210.GR3630668@port70.net> <4deb3986.247e1.1879dbd21b4.Coremail.zhangfei@nj.iscas.ac.cn> <20230421133034.GS3630668@port70.net> <20230421165414.GQ4163@brightrain.aerifal.cx> In-Reply-To: <20230421165414.GQ4163@brightrain.aerifal.cx> From: enh Date: Fri, 21 Apr 2023 10:01:05 -0700 Message-ID: To: musl@lists.openwall.com Cc: Pedro Falcato , =?UTF-8?B?5byg6aOe?= , nsz@port70.net Content-Type: multipart/alternative; boundary="00000000000000e09f05f9db9b7f" Subject: Re: Re: Re: [musl] memset_riscv64 --00000000000000e09f05f9db9b7f Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Apr 21, 2023 at 9:54=E2=80=AFAM Rich Felker wrote= : > On Fri, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote: > > On Fri, Apr 21, 2023 at 2:37=E2=80=AFPM Szabolcs Nagy = wrote: > > > > > > * =E5=BC=A0=E9=A3=9E [2023-04-20 16:17:10 += 0800]: > > > > Hi! > > > > I listened to your suggestions and referred to string.c in Musl's > test set(libc-bench), > > > > and then modified the test cases. Since BUFLEN is a fixed value in > strlen.c, I modified > > > > it to a variable as a parameter in my own test case and passed it t= o > the memset function. > > > > I adjusted the LOOP_TIMES has been counted up to 500 times and the > running time has been > > > > sorted, only recording the running time of the middle 300 times. > > > > > > > > I took turns executing two programs on the SiFive chip three times > each, and the results > > > > are shown below. > > > > First run result > > > > > -------------------------------------------------------------------------= ------- > > > > length(byte) C language implementation(s) Basic instruction > implementation(s) > > > > > -------------------------------------------------------------------------= ------- > > > > 100 0.002208102 0.002304056 > > > > 200 0.005053208 0.004629598 > > > > 400 0.008666684 0.007739176 > > > > 800 0.014065196 0.012372702 > > > > 1600 0.023377685 0.020090966 > > > > 3200 0.040221849 0.034059631 > > > > 6400 0.072095377 0.060028906 > > > > 12800 0.134040475 0.110039387 > > > > 25600 0.257426806 0.210710952 > > > > 51200 1.173755160 1.121833227 > > > > 102400 3.693170402 3.637194098 > > > > 204800 8.919975455 8.865504460 > > > > 409600 19.410922418 19.360956493 > > > > > -------------------------------------------------------------------------= ------- > > > > > > > > Second run result > > > > > -------------------------------------------------------------------------= ------- > > > > length(byte) C language implementation(s) Basic instruction > implementation(s) > > > > > -------------------------------------------------------------------------= ------- > > > > 100 0.002208109 0.002293857 > > > > 200 0.005057374 0.004640669 > > > > 400 0.008674218 0.007760795 > > > > 800 0.014068582 0.012417084 > > > > 1600 0.023381095 0.020124496 > > > > 3200 0.040225138 0.034093181 > > > > 6400 0.072098744 0.060069574 > > > > 12800 0.134043954 0.110088141 > > > > 25600 0.256453187 0.208578633 > > > > 51200 1.166602505 1.118972796 > > > > 102400 3.684957231 3.635116808 > > > > 204800 8.916302592 8.861590734 > > > > 409600 19.411057216 19.358777670 > > > > > -------------------------------------------------------------------------= ------- > > > > > > > > Third run result > > > > > -------------------------------------------------------------------------= ------- > > > > length(byte) C language implementation(s) Basic instruction > implementation(s) > > > > > -------------------------------------------------------------------------= ------- > > > > 100 0.002208111 0.002293227 > > > > 200 0.005056101 0.004628539 > > > > 400 0.008677756 0.007748687 > > > > 800 0.014085242 0.012404443 > > > > 1600 0.023397782 0.020115710 > > > > 3200 0.040242985 0.034084435 > > > > 6400 0.072116665 0.060063767 > > > > 12800 0.134060262 0.110082427 > > > > 25600 0.257865186 0.209101754 > > > > 51200 1.174257177 1.117753408 > > > > 102400 3.696518162 3.635417503 > > > > 204800 8.929357747 8.858765915 > > > > 409600 19.426520562 19.356515671 > > > > > -------------------------------------------------------------------------= ------- > > > > > > > > From the test results, it can be seen that the runtime of memset > implemented using the basic > > > > instruction set assembly is basically shorter than that implemented > using the C language. > > > > May I ask if the test results are convincing? > > > > > > small sizes are much more common than large sizes, memsets can be > > > distributed such that sizes [0,100), [100,1000), [1000,inf) are > > > used for 1/3 of all memsets each (not the call count, but the > > > amount of bytes memset using such sizes), i.e. if you speed up > > > the size =3D [100,1000) and [1000,inf) cases by 10% but regress the > > > [0,100) case by 20% then the overall performance roughly stays > > > the same. (of course this is very workload dependent, but across > > > a system this is what i'd expect, probably even more skewed to > > > smaller sizes). > > > > > > so we need to know what happens in the [0,100) range. what i see > > > is a ~4% regression there while there is a ~10% improvement in > > > the [100,1000) case and ~15% improvement in the [1000,inf) case > > > (it would be nice to know why the 25k case is so much faster and > > > why that speed up only applies to that size, we don't want to > > > optimize for some obscure cpu bug that will go away next year) > > > > > > on practical workloads i would expect < 10% speedup overall from > > > the asm code (but we need more data in the [0,100) range to tell). > > > this may not be enough to justify the asm code. > > > > > > rich already said he prefers a different style of implementation > > > (where the body of the function is in c but the inner loop is in > > > asm if that helps e.g. via simd). > > > > I don't think writing it all in C is viable, at least if you want to > > squeeze every last bit of performance out of it (while avoiding > > idiotic codegen that sometimes pops up). > > Even with inline asm, I severely question its effectiveness. As I see > > I don't see any good reason for this doubt. If you claim it's not > viable, you should show cases where you really can't get the compiler > to do something reasonable with this type of code. > > If the loop body were tiny and the loop control were a significant > portion of the loop execution overhead, then I could see this > potentially being a problem. But the main/only interesting case for > asm is where you're operating on largeish blocks. > > > it, we have two major roadblocks for fast stringops support (and one > > more for riscv): > > > > 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically > > dispatch stringops functions to the best implementation according to > > the CPU feature set. I have no good solution for static linking folks. > > Of course this is not an option, but it's also not needed. There is > only relevant dispatch cost when size is small, but you don't want or > need to dispatch to asm variants when size is small, so the dispatch > goes in the branch for large sizes, and the cost is effectively zero. > > > 2) (Optional) Play around with C codegen that could add SIMD, inline > > asm to try to make it fast-ish. LLVM folks have played around with > > string ops written entirely in C++ through > > __builtin_memcpy_inline (which does smart choices wrt overlapping > > loads/stores, SIMD, etc depending on the size). Sadly, > > __builtin_memcpy_inline is/was not available in GCC. > > The basic strategy here is to do head/tail of the operation with plain > portable C, in a minimal-length, minimal-branch fast path. Then, if > any middle that wasn't covered by the head/tail remains, either use an > arch-provided block operation primitive that's (for example; subject > to tuning) allowed to assume alignment of size and either src or dest, > or dispatch to a hwcap-specific bulk operation in asm that can make > similar assumptions. > > > Testing the performance of C+inline asm vs pure asm would be interestin= g. > > Yes but I don't think we'll find anything unexpected. In theory you > can probaby shave a couple cycles writing the asm by hand, but that > has a lot of costs that aren't sustainable, and pessimizes things like > LTO (for example, in LTO, the short-size fast paths may be able to be > inlined when the exact size isn't known but value range analysis > determines it's always in the small range). > > > 3) Write riscv stringops code in assembly once CPUs get more advanced > > and we finally get a good idea on how the things perform. I still > > think it's too new to optimize specifically for. > > Extensions are popping up left and right, vector extensions aren't yet > > properly supported in the kernel, and (most importantly) we don't have > > a proper way to detect riscv features just yet. > > For instance, doing unaligned accesses may either have little to no > > performance penalty, they may have a big performance penalty (trapped > > to M mode and emulated), or they may just not be supported at all. > > AFAIK, atm Linux userspace has no way of finding this out (patchset > > for this is still pending I think?), and things like the existence of > > cheap unaligned accesses are a make-or-break for stringops as you get > > to avoid *soooo* many branches. > > Yes, for RISC-V there is no way forward on vector or other ISA > extensions until the specs are firmed up and the framework for > detecting their presence is in place. > (the vector state save/restore stuff still isn't in yet, which is making me worry it won't make linux 6.4, but fwiw the risc-v hwprobe patches went into linux-next earlier this week, so there's some progress there at least...) > > In the RISCV case, you probably want to end up with at least 3 mem* > > variants (no-unaligned, unaligned, vector). > > For memset, there's hardly a reason to waste effort on unaligned > versions. The middle is always aligned. For memcpy/memmove, where you > have src and dest misaligned modulo each other, the ability to do > unaligned loads or stores is valuable, and something the general > framework (in C) should allow us to take advantage of. I hadn't really > considered the possibility that we might want to support > unaligned-access support that's only known at runtime, rather than > part of the ISA level you're building for, so perhaps this is > something we should consider. > > Rich > --00000000000000e09f05f9db9b7f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Fri, Apr 21, 2023 at 9:54=E2=80=AF= AM Rich Felker <dalias@libc.org&g= t; wrote:
On Fri= , Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote:
> On Fri, Apr 21, 2023 at 2:37=E2=80=AFPM Szabolcs Nagy <nsz@port70.net> wrote:
> >
> > * =E5=BC=A0=E9=A3=9E <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 += 0800]:
> > > Hi!
> > > I listened to your suggestions and referred to string.c in M= usl's test set(libc-bench),
> > > and then modified the test cases. Since BUFLEN is a fixed va= lue in strlen.c, I modified
> > > it to a variable as a parameter in my own test case and pass= ed it to the memset function.
> > > I adjusted the LOOP_TIMES has been counted up to 500 times a= nd the running time has been
> > > sorted, only recording the running time of the middle 300 ti= mes.
> > >
> > > I took turns executing two programs on the SiFive chip three= times each, and the results
> > > are shown below.
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 First run result
> > > ------------------------------------------------------------= --------------------
> > > length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0= Basic instruction implementation(s)
> > > ------------------------------------------------------------= --------------------
> > > 100=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.002208102=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.002304056
> > > 200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.005053208=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.004629598
> > > 400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.008666684=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.007739176
> > > 800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.014065196=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.012372702
> > > 1600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.023377685=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.020090966
> > > 3200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.040221849=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.034059631
> > > 6400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.072095377=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.060028906
> > > 12800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0.134040475=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.110039387
> > > 25600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0.257426806=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.210710952
> > > 51200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 1.173755160=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A01.121833227
> > > 102400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3.693= 170402=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A03.637194098
> > > 204800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.919= 975455=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A08.865504460
> > > 409600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A019.410= 922418=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 19.360956493
> > > ------------------------------------------------------------= --------------------
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Second run result
> > > ------------------------------------------------------------= --------------------
> > > length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0= Basic instruction implementation(s)
> > > ------------------------------------------------------------= --------------------
> > > 100=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.002208109=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.002293857
> > > 200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.005057374=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.004640669
> > > 400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.008674218=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.007760795
> > > 800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.014068582=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.012417084
> > > 1600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.023381095=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.020124496
> > > 3200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.040225138=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.034093181
> > > 6400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.072098744=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.060069574
> > > 12800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0.134043954=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.110088141
> > > 25600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0.256453187=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.208578633
> > > 51200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 1.166602505=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A01.118972796
> > > 102400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3.684= 957231=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A03.635116808
> > > 204800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.916= 302592=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A08.861590734
> > > 409600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A019.411= 057216=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 19.358777670
> > > ------------------------------------------------------------= --------------------
> > >
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Third run result
> > > ------------------------------------------------------------= --------------------
> > > length(byte)=C2=A0 C language implementation(s)=C2=A0 =C2=A0= Basic instruction implementation(s)
> > > ------------------------------------------------------------= --------------------
> > > 100=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.002208111=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.002293227
> > > 200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.005056101=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.004628539
> > > 400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.008677756=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.007748687
> > > 800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A00.014085242=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A00.012404443
> > > 1600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.023397782=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.020115710
> > > 3200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.040242985=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.034084435
> > > 6400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = 0.072116665=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.060063767
> > > 12800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0.134060262=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.110082427
> > > 25600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0.257865186=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00.209101754
> > > 51200=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 1.174257177=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A01.117753408
> > > 102400=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3.696= 518162=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A03.635417503
> > > 204800=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 8.929= 357747=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A08.858765915
> > > 409600=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A019.426= 520562=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A019.356515671
> > > ------------------------------------------------------------= --------------------
> > >
> > > From the test results, it can be seen that the runtime of me= mset implemented using the basic
> > > instruction set assembly is basically shorter than that impl= emented using the C language.
> > > May I ask if the test results are convincing?
> >
> > small sizes are much more common than large sizes, memsets can be=
> > distributed such that sizes [0,100), [100,1000), [1000,inf) are > > used for 1/3 of all memsets each (not the call count, but the
> > amount of bytes memset using such sizes), i.e. if you speed up > > the size =3D [100,1000) and [1000,inf) cases by 10% but regress t= he
> > [0,100) case by 20% then the overall performance roughly stays > > the same. (of course this is very workload dependent, but across<= br> > > a system this is what i'd expect, probably even more skewed t= o
> > smaller sizes).
> >
> > so we need to know what happens in the [0,100) range. what i see<= br> > > is a ~4% regression there while there is a ~10% improvement in > > the [100,1000) case and ~15% improvement in the [1000,inf) case > > (it would be nice to know why the 25k case is so much faster and<= br> > > why that speed up only applies to that size, we don't want to=
> > optimize for some obscure cpu bug that will go away next year) > >
> > on practical workloads i would expect < 10% speedup overall fr= om
> > the asm code (but we need more data in the [0,100) range to tell)= .
> > this may not be enough to justify the asm code.
> >
> > rich already said he prefers a different style of implementation<= br> > > (where the body of the function is in c but the inner loop is in<= br> > > asm if that helps e.g. via simd).
>
> I don't think writing it all in C is viable, at least if you want = to
> squeeze every last bit of performance out of it (while avoiding
> idiotic codegen that sometimes pops up).
> Even with inline asm, I severely question its effectiveness. As I see<= br>
I don't see any good reason for this doubt. If you claim it's not viable, you should show cases where you really can't get the compiler to do something reasonable with this type of code.

If the loop body were tiny and the loop control were a significant
portion of the loop execution overhead, then I could see this
potentially being a problem. But the main/only interesting case for
asm is where you're operating on largeish blocks.

> it, we have two major roadblocks for fast stringops support (and one > more for riscv):
>
> 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
> dispatch stringops functions to the best implementation according to > the CPU feature set. I have no good solution for static linking folks.=

Of course this is not an option, but it's also not needed. There is
only relevant dispatch cost when size is small, but you don't want or need to dispatch to asm variants when size is small, so the dispatch
goes in the branch for large sizes, and the cost is effectively zero.

> 2) (Optional) Play around with C codegen that could add SIMD, inline > asm to try to make it fast-ish. LLVM folks have played around with
> string ops written entirely in C++ through
> __builtin_memcpy_inline (which does smart choices wrt overlapping
> loads/stores, SIMD, etc depending on the size). Sadly,
> __builtin_memcpy_inline is/was not available in GCC.

The basic strategy here is to do head/tail of the operation with plain
portable C, in a minimal-length, minimal-branch fast path. Then, if
any middle that wasn't covered by the head/tail remains, either use an<= br> arch-provided block operation primitive that's (for example; subject to tuning) allowed to assume alignment of size and either src or dest,
or dispatch to a hwcap-specific bulk operation in asm that can make
similar assumptions.

> Testing the performance of C+inline asm vs pure asm would be interesti= ng.

Yes but I don't think we'll find anything unexpected. In theory you=
can probaby shave a couple cycles writing the asm by hand, but that
has a lot of costs that aren't sustainable, and pessimizes things like<= br> LTO (for example, in LTO, the short-size fast paths may be able to be
inlined when the exact size isn't known but value range analysis
determines it's always in the small range).

> 3) Write riscv stringops code in assembly once CPUs get more advanced<= br> > and we finally get a good idea on how the things perform. I still
> think it's too new to optimize specifically for.
> Extensions are popping up left and right, vector extensions aren't= yet
> properly supported in the kernel, and (most importantly) we don't = have
> a proper way to detect riscv features just yet.
> For instance, doing unaligned accesses may either have little to no > performance penalty, they may have a big performance penalty (trapped<= br> > to M mode and emulated), or they may just not be supported at all.
> AFAIK, atm Linux userspace has no way of finding this out (patchset > for this is still pending I think?), and things like the existence of<= br> > cheap unaligned accesses are a make-or-break for stringops as you get<= br> > to avoid *soooo* many branches.

Yes, for RISC-V there is no way forward on vector or other ISA
extensions until the specs are firmed up and the framework for
detecting their presence is in place.

(= the vector state save/restore stuff still isn't in yet, which is making= me worry it won't make linux 6.4, but fwiw the risc-v hwprobe patches = went into linux-next earlier this week, so there's some progress there = at least...)
=C2=A0
> In the RISCV case, you probably want to end up with at least 3 mem* > variants (no-unaligned, unaligned, vector).

For memset, there's hardly a reason to waste effort on unaligned
versions. The middle is always aligned. For memcpy/memmove, where you
have src and dest misaligned modulo each other, the ability to do
unaligned loads or stores is valuable, and something the general
framework (in C) should allow us to take advantage of. I hadn't really<= br> considered the possibility that we might want to support
unaligned-access support that's only known at runtime, rather than
part of the ISA level you're building for, so perhaps this is
something we should consider.

Rich
--00000000000000e09f05f9db9b7f--