On Fri, Apr 21, 2023 at 9:54 AM Rich Felker wrote: > On Fri, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote: > > On Fri, Apr 21, 2023 at 2:37 PM Szabolcs Nagy wrote: > > > > > > * 张飞 [2023-04-20 16:17:10 +0800]: > > > > Hi! > > > > I listened to your suggestions and referred to string.c in Musl's > test set(libc-bench), > > > > and then modified the test cases. Since BUFLEN is a fixed value in > strlen.c, I modified > > > > it to a variable as a parameter in my own test case and passed it to > the memset function. > > > > I adjusted the LOOP_TIMES has been counted up to 500 times and the > running time has been > > > > sorted, only recording the running time of the middle 300 times. > > > > > > > > I took turns executing two programs on the SiFive chip three times > each, and the results > > > > are shown below. > > > > First run result > > > > > -------------------------------------------------------------------------------- > > > > length(byte) C language implementation(s) Basic instruction > implementation(s) > > > > > -------------------------------------------------------------------------------- > > > > 100 0.002208102 0.002304056 > > > > 200 0.005053208 0.004629598 > > > > 400 0.008666684 0.007739176 > > > > 800 0.014065196 0.012372702 > > > > 1600 0.023377685 0.020090966 > > > > 3200 0.040221849 0.034059631 > > > > 6400 0.072095377 0.060028906 > > > > 12800 0.134040475 0.110039387 > > > > 25600 0.257426806 0.210710952 > > > > 51200 1.173755160 1.121833227 > > > > 102400 3.693170402 3.637194098 > > > > 204800 8.919975455 8.865504460 > > > > 409600 19.410922418 19.360956493 > > > > > -------------------------------------------------------------------------------- > > > > > > > > Second run result > > > > > -------------------------------------------------------------------------------- > > > > length(byte) C language implementation(s) Basic instruction > implementation(s) > > > > > -------------------------------------------------------------------------------- > > > > 100 0.002208109 0.002293857 > > > > 200 0.005057374 0.004640669 > > > > 400 0.008674218 0.007760795 > > > > 800 0.014068582 0.012417084 > > > > 1600 0.023381095 0.020124496 > > > > 3200 0.040225138 0.034093181 > > > > 6400 0.072098744 0.060069574 > > > > 12800 0.134043954 0.110088141 > > > > 25600 0.256453187 0.208578633 > > > > 51200 1.166602505 1.118972796 > > > > 102400 3.684957231 3.635116808 > > > > 204800 8.916302592 8.861590734 > > > > 409600 19.411057216 19.358777670 > > > > > -------------------------------------------------------------------------------- > > > > > > > > Third run result > > > > > -------------------------------------------------------------------------------- > > > > length(byte) C language implementation(s) Basic instruction > implementation(s) > > > > > -------------------------------------------------------------------------------- > > > > 100 0.002208111 0.002293227 > > > > 200 0.005056101 0.004628539 > > > > 400 0.008677756 0.007748687 > > > > 800 0.014085242 0.012404443 > > > > 1600 0.023397782 0.020115710 > > > > 3200 0.040242985 0.034084435 > > > > 6400 0.072116665 0.060063767 > > > > 12800 0.134060262 0.110082427 > > > > 25600 0.257865186 0.209101754 > > > > 51200 1.174257177 1.117753408 > > > > 102400 3.696518162 3.635417503 > > > > 204800 8.929357747 8.858765915 > > > > 409600 19.426520562 19.356515671 > > > > > -------------------------------------------------------------------------------- > > > > > > > > From the test results, it can be seen that the runtime of memset > implemented using the basic > > > > instruction set assembly is basically shorter than that implemented > using the C language. > > > > May I ask if the test results are convincing? > > > > > > small sizes are much more common than large sizes, memsets can be > > > distributed such that sizes [0,100), [100,1000), [1000,inf) are > > > used for 1/3 of all memsets each (not the call count, but the > > > amount of bytes memset using such sizes), i.e. if you speed up > > > the size = [100,1000) and [1000,inf) cases by 10% but regress the > > > [0,100) case by 20% then the overall performance roughly stays > > > the same. (of course this is very workload dependent, but across > > > a system this is what i'd expect, probably even more skewed to > > > smaller sizes). > > > > > > so we need to know what happens in the [0,100) range. what i see > > > is a ~4% regression there while there is a ~10% improvement in > > > the [100,1000) case and ~15% improvement in the [1000,inf) case > > > (it would be nice to know why the 25k case is so much faster and > > > why that speed up only applies to that size, we don't want to > > > optimize for some obscure cpu bug that will go away next year) > > > > > > on practical workloads i would expect < 10% speedup overall from > > > the asm code (but we need more data in the [0,100) range to tell). > > > this may not be enough to justify the asm code. > > > > > > rich already said he prefers a different style of implementation > > > (where the body of the function is in c but the inner loop is in > > > asm if that helps e.g. via simd). > > > > I don't think writing it all in C is viable, at least if you want to > > squeeze every last bit of performance out of it (while avoiding > > idiotic codegen that sometimes pops up). > > Even with inline asm, I severely question its effectiveness. As I see > > I don't see any good reason for this doubt. If you claim it's not > viable, you should show cases where you really can't get the compiler > to do something reasonable with this type of code. > > If the loop body were tiny and the loop control were a significant > portion of the loop execution overhead, then I could see this > potentially being a problem. But the main/only interesting case for > asm is where you're operating on largeish blocks. > > > it, we have two major roadblocks for fast stringops support (and one > > more for riscv): > > > > 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically > > dispatch stringops functions to the best implementation according to > > the CPU feature set. I have no good solution for static linking folks. > > Of course this is not an option, but it's also not needed. There is > only relevant dispatch cost when size is small, but you don't want or > need to dispatch to asm variants when size is small, so the dispatch > goes in the branch for large sizes, and the cost is effectively zero. > > > 2) (Optional) Play around with C codegen that could add SIMD, inline > > asm to try to make it fast-ish. LLVM folks have played around with > > string ops written entirely in C++ through > > __builtin_memcpy_inline (which does smart choices wrt overlapping > > loads/stores, SIMD, etc depending on the size). Sadly, > > __builtin_memcpy_inline is/was not available in GCC. > > The basic strategy here is to do head/tail of the operation with plain > portable C, in a minimal-length, minimal-branch fast path. Then, if > any middle that wasn't covered by the head/tail remains, either use an > arch-provided block operation primitive that's (for example; subject > to tuning) allowed to assume alignment of size and either src or dest, > or dispatch to a hwcap-specific bulk operation in asm that can make > similar assumptions. > > > Testing the performance of C+inline asm vs pure asm would be interesting. > > Yes but I don't think we'll find anything unexpected. In theory you > can probaby shave a couple cycles writing the asm by hand, but that > has a lot of costs that aren't sustainable, and pessimizes things like > LTO (for example, in LTO, the short-size fast paths may be able to be > inlined when the exact size isn't known but value range analysis > determines it's always in the small range). > > > 3) Write riscv stringops code in assembly once CPUs get more advanced > > and we finally get a good idea on how the things perform. I still > > think it's too new to optimize specifically for. > > Extensions are popping up left and right, vector extensions aren't yet > > properly supported in the kernel, and (most importantly) we don't have > > a proper way to detect riscv features just yet. > > For instance, doing unaligned accesses may either have little to no > > performance penalty, they may have a big performance penalty (trapped > > to M mode and emulated), or they may just not be supported at all. > > AFAIK, atm Linux userspace has no way of finding this out (patchset > > for this is still pending I think?), and things like the existence of > > cheap unaligned accesses are a make-or-break for stringops as you get > > to avoid *soooo* many branches. > > Yes, for RISC-V there is no way forward on vector or other ISA > extensions until the specs are firmed up and the framework for > detecting their presence is in place. > (the vector state save/restore stuff still isn't in yet, which is making me worry it won't make linux 6.4, but fwiw the risc-v hwprobe patches went into linux-next earlier this week, so there's some progress there at least...) > > In the RISCV case, you probably want to end up with at least 3 mem* > > variants (no-unaligned, unaligned, vector). > > For memset, there's hardly a reason to waste effort on unaligned > versions. The middle is always aligned. For memcpy/memmove, where you > have src and dest misaligned modulo each other, the ability to do > unaligned loads or stores is valuable, and something the general > framework (in C) should allow us to take advantage of. I hadn't really > considered the possibility that we might want to support > unaligned-access support that's only known at runtime, rather than > part of the ISA level you're building for, so perhaps this is > something we should consider. > > Rich >