On Fri, Apr 21, 2023 at 9:54 AM Rich Felker <dalias@libc.org> wrote:

> On Fri, Apr 21, 2023 at 03:50:45PM +0100, Pedro Falcato wrote:
> > On Fri, Apr 21, 2023 at 2:37 PM Szabolcs Nagy <nsz@port70.net> wrote:
> > >
> > > * 张飞 <zhangfei@nj.iscas.ac.cn> [2023-04-20 16:17:10 +0800]:
> > > > Hi!
> > > > I listened to your suggestions and referred to string.c in Musl's
> test set(libc-bench),
> > > > and then modified the test cases. Since BUFLEN is a fixed value in
> strlen.c, I modified
> > > > it to a variable as a parameter in my own test case and passed it to
> the memset function.
> > > > I adjusted the LOOP_TIMES has been counted up to 500 times and the
> running time has been
> > > > sorted, only recording the running time of the middle 300 times.
> > > >
> > > > I took turns executing two programs on the SiFive chip three times
> each, and the results
> > > > are shown below.
> > > >                              First run result
> > > >
> --------------------------------------------------------------------------------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> --------------------------------------------------------------------------------
> > > > 100                 0.002208102                     0.002304056
> > > > 200                 0.005053208                     0.004629598
> > > > 400                 0.008666684                     0.007739176
> > > > 800                 0.014065196                     0.012372702
> > > > 1600                0.023377685                     0.020090966
> > > > 3200                0.040221849                     0.034059631
> > > > 6400                0.072095377                     0.060028906
> > > > 12800               0.134040475                     0.110039387
> > > > 25600               0.257426806                     0.210710952
> > > > 51200               1.173755160                     1.121833227
> > > > 102400              3.693170402                     3.637194098
> > > > 204800              8.919975455                     8.865504460
> > > > 409600             19.410922418                    19.360956493
> > > >
> --------------------------------------------------------------------------------
> > > >
> > > >                              Second run result
> > > >
> --------------------------------------------------------------------------------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> --------------------------------------------------------------------------------
> > > > 100                 0.002208109                     0.002293857
> > > > 200                 0.005057374                     0.004640669
> > > > 400                 0.008674218                     0.007760795
> > > > 800                 0.014068582                     0.012417084
> > > > 1600                0.023381095                     0.020124496
> > > > 3200                0.040225138                     0.034093181
> > > > 6400                0.072098744                     0.060069574
> > > > 12800               0.134043954                     0.110088141
> > > > 25600               0.256453187                     0.208578633
> > > > 51200               1.166602505                     1.118972796
> > > > 102400              3.684957231                     3.635116808
> > > > 204800              8.916302592                     8.861590734
> > > > 409600             19.411057216                    19.358777670
> > > >
> --------------------------------------------------------------------------------
> > > >
> > > >                              Third run result
> > > >
> --------------------------------------------------------------------------------
> > > > length(byte)  C language implementation(s)   Basic instruction
> implementation(s)
> > > >
> --------------------------------------------------------------------------------
> > > > 100                 0.002208111                     0.002293227
> > > > 200                 0.005056101                     0.004628539
> > > > 400                 0.008677756                     0.007748687
> > > > 800                 0.014085242                     0.012404443
> > > > 1600                0.023397782                     0.020115710
> > > > 3200                0.040242985                     0.034084435
> > > > 6400                0.072116665                     0.060063767
> > > > 12800               0.134060262                     0.110082427
> > > > 25600               0.257865186                     0.209101754
> > > > 51200               1.174257177                     1.117753408
> > > > 102400              3.696518162                     3.635417503
> > > > 204800              8.929357747                     8.858765915
> > > > 409600             19.426520562                     19.356515671
> > > >
> --------------------------------------------------------------------------------
> > > >
> > > > From the test results, it can be seen that the runtime of memset
> implemented using the basic
> > > > instruction set assembly is basically shorter than that implemented
> using the C language.
> > > > May I ask if the test results are convincing?
> > >
> > > small sizes are much more common than large sizes, memsets can be
> > > distributed such that sizes [0,100), [100,1000), [1000,inf) are
> > > used for 1/3 of all memsets each (not the call count, but the
> > > amount of bytes memset using such sizes), i.e. if you speed up
> > > the size = [100,1000) and [1000,inf) cases by 10% but regress the
> > > [0,100) case by 20% then the overall performance roughly stays
> > > the same. (of course this is very workload dependent, but across
> > > a system this is what i'd expect, probably even more skewed to
> > > smaller sizes).
> > >
> > > so we need to know what happens in the [0,100) range. what i see
> > > is a ~4% regression there while there is a ~10% improvement in
> > > the [100,1000) case and ~15% improvement in the [1000,inf) case
> > > (it would be nice to know why the 25k case is so much faster and
> > > why that speed up only applies to that size, we don't want to
> > > optimize for some obscure cpu bug that will go away next year)
> > >
> > > on practical workloads i would expect < 10% speedup overall from
> > > the asm code (but we need more data in the [0,100) range to tell).
> > > this may not be enough to justify the asm code.
> > >
> > > rich already said he prefers a different style of implementation
> > > (where the body of the function is in c but the inner loop is in
> > > asm if that helps e.g. via simd).
> >
> > I don't think writing it all in C is viable, at least if you want to
> > squeeze every last bit of performance out of it (while avoiding
> > idiotic codegen that sometimes pops up).
> > Even with inline asm, I severely question its effectiveness. As I see
>
> I don't see any good reason for this doubt. If you claim it's not
> viable, you should show cases where you really can't get the compiler
> to do something reasonable with this type of code.
>
> If the loop body were tiny and the loop control were a significant
> portion of the loop execution overhead, then I could see this
> potentially being a problem. But the main/only interesting case for
> asm is where you're operating on largeish blocks.
>
> > it, we have two major roadblocks for fast stringops support (and one
> > more for riscv):
> >
> > 1) Support GNU_IFUNC (as glibc, FreeBSD, etc do) to automatically
> > dispatch stringops functions to the best implementation according to
> > the CPU feature set. I have no good solution for static linking folks.
>
> Of course this is not an option, but it's also not needed. There is
> only relevant dispatch cost when size is small, but you don't want or
> need to dispatch to asm variants when size is small, so the dispatch
> goes in the branch for large sizes, and the cost is effectively zero.
>
> > 2) (Optional) Play around with C codegen that could add SIMD, inline
> > asm to try to make it fast-ish. LLVM folks have played around with
> > string ops written entirely in C++ through
> > __builtin_memcpy_inline (which does smart choices wrt overlapping
> > loads/stores, SIMD, etc depending on the size). Sadly,
> > __builtin_memcpy_inline is/was not available in GCC.
>
> The basic strategy here is to do head/tail of the operation with plain
> portable C, in a minimal-length, minimal-branch fast path. Then, if
> any middle that wasn't covered by the head/tail remains, either use an
> arch-provided block operation primitive that's (for example; subject
> to tuning) allowed to assume alignment of size and either src or dest,
> or dispatch to a hwcap-specific bulk operation in asm that can make
> similar assumptions.
>
> > Testing the performance of C+inline asm vs pure asm would be interesting.
>
> Yes but I don't think we'll find anything unexpected. In theory you
> can probaby shave a couple cycles writing the asm by hand, but that
> has a lot of costs that aren't sustainable, and pessimizes things like
> LTO (for example, in LTO, the short-size fast paths may be able to be
> inlined when the exact size isn't known but value range analysis
> determines it's always in the small range).
>
> > 3) Write riscv stringops code in assembly once CPUs get more advanced
> > and we finally get a good idea on how the things perform. I still
> > think it's too new to optimize specifically for.
> > Extensions are popping up left and right, vector extensions aren't yet
> > properly supported in the kernel, and (most importantly) we don't have
> > a proper way to detect riscv features just yet.
> > For instance, doing unaligned accesses may either have little to no
> > performance penalty, they may have a big performance penalty (trapped
> > to M mode and emulated), or they may just not be supported at all.
> > AFAIK, atm Linux userspace has no way of finding this out (patchset
> > for this is still pending I think?), and things like the existence of
> > cheap unaligned accesses are a make-or-break for stringops as you get
> > to avoid *soooo* many branches.
>
> Yes, for RISC-V there is no way forward on vector or other ISA
> extensions until the specs are firmed up and the framework for
> detecting their presence is in place.
>

(the vector state save/restore stuff still isn't in yet, which is making me
worry it won't make linux 6.4, but fwiw the risc-v hwprobe patches went
into linux-next earlier this week, so there's some progress there at
least...)


> > In the RISCV case, you probably want to end up with at least 3 mem*
> > variants (no-unaligned, unaligned, vector).
>
> For memset, there's hardly a reason to waste effort on unaligned
> versions. The middle is always aligned. For memcpy/memmove, where you
> have src and dest misaligned modulo each other, the ability to do
> unaligned loads or stores is valuable, and something the general
> framework (in C) should allow us to take advantage of. I hadn't really
> considered the possibility that we might want to support
> unaligned-access support that's only known at runtime, rather than
> part of the ISA level you're building for, so perhaps this is
> something we should consider.
>
> Rich
>