What modern CPUs have a penalty for double precision floating point
arithmetic on scalars compared to single precision once they are in a
register, i.e. ignoring memory fetch issues.
I have Agner Fog's excellent document for X86-64 which basically says that 32
bit and 64 bit operations for scalars take the same amount of time.
I am looking for the same type of information for ARM and RISC-V. I found the
data for 32-bit in the online documentation. But nothing bout 64 bit.
I cannot find anything on this topic on RISC-V or POWER10.
Maybe I am not searching on the right terms.
Note that I am after the raw performance, not say the relative performance
of say the MUSL sin() routine compared with the MUSL sinf().
Have you looked at the scheduler description for ARM, RISC-V and POWER in GCC or LLVM?
David