At 2023-02-11 17:22:49, "Markus Wichmann" <nullplan@gmx.net> wrote:

>
>Getting back to this, with the benchmarks recently performed, I notice
>one area where C++ is able to significantly outperform C, and that is in
>the area of generic functions. C++ std::sort is a template function, and
>the compiler can optimize the code for the situation. For example, in
>your case, you only ever need to swap two elements by way of copying
>four bytes, which the compiler can just inline.
>

Yes, I tried -O2 with g++ sort, its performance is much better.

>In C, the same code can only call memcpy(), and a call to cycle() will
>involve many calls to memcpy(), and they cannot be inlined since the
>compiler will not know the copy size at the time it is compiling the
>calls. We end up with tons of calls to memcpy() with a size of four,
>when memcpy() is optimized for large sizes.
>
>Merge sort, as glibc implements it, allows at least a few memcpy() calls
>with larger sizes. Glibc also avoids calling memcpy by duplicating the
>code for common sizes, only resorting to memcpy() for the default case.
>That may be an avenue worth pursuing, at least for cycle().
>

Totally agree with this, lots of memcpy function call overhead could be avoid.

>I'm not quite certain why glibc chose the sizes they chose for
>msort_with_tmp(), but they are branching out for 32-bit integers, 64-bit
>integers, longs (which in a System V ABI are always one of the former
>two) and void* (which in a System V ABI are always the same as a long).
>I think special casing 32-bit and 64-bit integers ought to get us most
>of the way there, since those are the most common sizes that can still
>be copied simply inline.
>
>Also note that glibc further reduces the need for memcpy() by sorting
>indirectly if the size of a single element exceeds 32 bytes. That is of
>course something they can do, since they already are allocating memory
>and have a fall-back strategy in place. Not sure if this is worthwhile
>for musl. At least in the static linking case, it would suddenly pull
>malloc() and free() into executables that previously did not have those.
>
>That last optimization would also not have any bearing on the current
>batch of benchmarks, since in those, a size of four is set in stone.
>
>Ciao,
>Markus