At 2023-02-11 17:22:49, "Markus Wichmann" wrote: > >Getting back to this, with the benchmarks recently performed, I notice >one area where C++ is able to significantly outperform C, and that is in >the area of generic functions. C++ std::sort is a template function, and >the compiler can optimize the code for the situation. For example, in >your case, you only ever need to swap two elements by way of copying >four bytes, which the compiler can just inline. > Yes, I tried -O2 with g++ sort, its performance is much better. >In C, the same code can only call memcpy(), and a call to cycle() will >involve many calls to memcpy(), and they cannot be inlined since the >compiler will not know the copy size at the time it is compiling the >calls. We end up with tons of calls to memcpy() with a size of four, >when memcpy() is optimized for large sizes. > >Merge sort, as glibc implements it, allows at least a few memcpy() calls >with larger sizes. Glibc also avoids calling memcpy by duplicating the >code for common sizes, only resorting to memcpy() for the default case. >That may be an avenue worth pursuing, at least for cycle(). > Totally agree with this, lots of memcpy function call overhead could be avoid. >I'm not quite certain why glibc chose the sizes they chose for >msort_with_tmp(), but they are branching out for 32-bit integers, 64-bit >integers, longs (which in a System V ABI are always one of the former >two) and void* (which in a System V ABI are always the same as a long). >I think special casing 32-bit and 64-bit integers ought to get us most >of the way there, since those are the most common sizes that can still >be copied simply inline. > >Also note that glibc further reduces the need for memcpy() by sorting >indirectly if the size of a single element exceeds 32 bytes. That is of >course something they can do, since they already are allocating memory >and have a fall-back strategy in place. Not sure if this is worthwhile >for musl. At least in the static linking case, it would suddenly pull >malloc() and free() into executables that previously did not have those. > >That last optimization would also not have any bearing on the current >batch of benchmarks, since in those, a size of four is set in stone. > >Ciao, >Markus