* [musl] x86 fma with run-time switch? @ 2024-03-15 17:53 Markus Wichmann 2024-03-15 21:36 ` Rich Felker 0 siblings, 1 reply; 5+ messages in thread From: Markus Wichmann @ 2024-03-15 17:53 UTC (permalink / raw) To: musl Hi all, in commit e9016138, Szabolcs wrote into the message that we really should be using the single-instruction versions if possible, and we should be switching at run time. I have an idea for how to do that without losing all of the history of the generic fma.c: - Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function inside to fma_soft and make it static (inline?). - Create a new src/math/fma.c that includes fma-soft.h and just calls fma_soft(). - In src/math/x86_64/fma.c: Unconditionally define fma_fma() and fma_fma4() (which are the current assembler versions) and include fma-soft.h. Create a dispatcher to figure out which version to call, and call that from fma(). Yeah, I know, the header file with stuff in it that takes memory is not exactly great, but I can't think of another way to define the generic version such that it is accessible to the arch-specific versions under a different name and linkage. The file must not be a .c file, or else it will confuse the build system. Question I have right out the gate is whether this would be interesting to the group. Second question is whether it is better to be running cpuid every time fma() is called, or to use a function pointer? I am partial to the dispatcher pattern myself. In that case, the function pointer is initialized at load time to point to the dispatcher, which then selects the best implementation and updates the function pointer. The main function only unconditionally calls the function pointer. With a bit of preprocessor magic, I can also ensure that if __FMA__ or __FMA4__ are set, the dispatcher is not included, and only the given function is called. Although that may incur a warning of an unused static function. I suppose that is a problem that can be fixed with more preprocessor magic. From my preliminary research, the fma3 and fma4 ISA extensions require no kernel support, so this will be the first time a CPUID call is needed. fma3 support is signalled with bit 12 of ECX in CPUID function 1. fma4 support is signalled with bit 16 of ECX in CPUID function 0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra safe, the CPU vendor ought to be checked, too. Doing the same for i386 requires also verifying kernel SSE support in hwcap (that implies CPUID support in the CPU, since the baseline 80486 does not necessarily have that, but all CPUs with SSE have it) and also support for extended CPUID in case of fma4. Since the CPUID challenges would be shared between fma and fmaf, I would like to put them into a new header file in src/include (maybe create src/include/x86_64? Or should it be added to arch/x86_64?) So what are your thoughts on this? Ciao, Markus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [musl] x86 fma with run-time switch? 2024-03-15 17:53 [musl] x86 fma with run-time switch? Markus Wichmann @ 2024-03-15 21:36 ` Rich Felker 2024-03-16 3:37 ` Markus Wichmann 0 siblings, 1 reply; 5+ messages in thread From: Rich Felker @ 2024-03-15 21:36 UTC (permalink / raw) To: Markus Wichmann; +Cc: musl On Fri, Mar 15, 2024 at 06:53:59PM +0100, Markus Wichmann wrote: > Hi all, > > in commit e9016138, Szabolcs wrote into the message that we really > should be using the single-instruction versions if possible, and we > should be switching at run time. I have an idea for how to do that > without losing all of the history of the generic fma.c: > > - Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function > inside to fma_soft and make it static (inline?). > - Create a new src/math/fma.c that includes fma-soft.h and just calls > fma_soft(). > - In src/math/x86_64/fma.c: Unconditionally define fma_fma() and > fma_fma4() (which are the current assembler versions) and include > fma-soft.h. Create a dispatcher to figure out which version to call, > and call that from fma(). You're making it too complicated. Just #define fma __soft_fma #include "../fma.c" or similar. > Yeah, I know, the header file with stuff in it that takes memory is not > exactly great, but I can't think of another way to define the generic > version such that it is accessible to the arch-specific versions under a > different name and linkage. The file must not be a .c file, or else it > will confuse the build system. > > Question I have right out the gate is whether this would be interesting > to the group. Second question is whether it is better to be running > cpuid every time fma() is called, or to use a function pointer? I am > partial to the dispatcher pattern myself. In that case, the function > pointer is initialized at load time to point to the dispatcher, which > then selects the best implementation and updates the function pointer. > The main function only unconditionally calls the function pointer. My expectation was that you would just use __hwcap, whereby it would be a hidden global access no more expensive than accessing a function pointer for indirect branch, and likely cheaper to do local direct branches based on testing bits of it, something like: if (__hwcap & WHATEVER) { __asm__(...); return ...; } else return __soft_fma(...); However, it looks like x86_64 lacks usable __hwcap. > With a bit of preprocessor magic, I can also ensure that if __FMA__ or > __FMA4__ are set, the dispatcher is not included, and only the given > function is called. Although that may incur a warning of an unused > static function. I suppose that is a problem that can be fixed with more > preprocessor magic. We already do that. > From my preliminary research, the fma3 and fma4 ISA extensions require > no kernel support, so this will be the first time a CPUID call is > needed. fma3 support is signalled with bit 12 of ECX in CPUID function > 1. fma4 support is signalled with bit 16 of ECX in CPUID function > 0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra > safe, the CPU vendor ought to be checked, too. > > Doing the same for i386 requires also verifying kernel SSE support in > hwcap (that implies CPUID support in the CPU, since the baseline 80486 > does not necessarily have that, but all CPUs with SSE have it) and also > support for extended CPUID in case of fma4. That seems a lot less likely to be worthwhile since it involves shuffling data back and forth between x87 and sse registers, but maybe it's still a big enough win to want to do it? The logic for how you determine availability seems right. > Since the CPUID challenges would be shared between fma and fmaf, I would > like to put them into a new header file in src/include (maybe create > src/include/x86_64? Or should it be added to arch/x86_64?) > > So what are your thoughts on this? I'm somewhat skeptical of what value there is to doing this particularly for fma. There're probably a lot more places we don't do any runtime-conditional optimized code that have higher returns (memcpy etc. being the most obvious) and it seems likely that programs that care about fma performance are themselves compiled with the right ISA levels and using the compiler builtin, never calling the function at all. Regardless, I wonder if we should have the x86_64 startup code store cpuid result somewhere we can use, so that we don't have to do nasty atomic stuff determining it late, and could just branch on the bits of a runtime-constant like we would with __hwcap. This would set the stage for being able to do with more-impactful things like mem* too. (That's a big project that involves designing the system for how archs define the large-block-ops the generic C functions would use, so not immediately applicable, but useful to be working towards it.) Rich ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [musl] x86 fma with run-time switch? 2024-03-15 21:36 ` Rich Felker @ 2024-03-16 3:37 ` Markus Wichmann 2024-03-16 9:16 ` Markus Wichmann 2024-03-18 16:15 ` Markus Wichmann 0 siblings, 2 replies; 5+ messages in thread From: Markus Wichmann @ 2024-03-16 3:37 UTC (permalink / raw) To: musl Am Fri, Mar 15, 2024 at 05:36:23PM -0400 schrieb Rich Felker: > You're making it too complicated. Just > > #define fma __soft_fma > #include "../fma.c" > > or similar. > But that leaves a __soft_fma() symbol with external linkage in there. OK, I suppose this is still less of a hack. > My expectation was that you would just use __hwcap, whereby it would > be a hidden global access no more expensive than accessing a function > pointer for indirect branch, and likely cheaper to do local direct > branches based on testing bits of it, something like: > > if (__hwcap & WHATEVER) { > __asm__(...); > return ...; > } else return __soft_fma(...); > > However, it looks like x86_64 lacks usable __hwcap. > My understanding was that __hwcap is a contract between kernel and userspace about what ISA extentions both kernel and CPU support. Whereas FMA on x86 bypasses the kernel. On other architectures, such as PowerPC, userspace has no way to negotiate ISA version with the CPU, so __hwcap does that too, but that is only by necessity. > We already do that. > My point was that I'd try to continue doing that with a minimum of repeated code. > > Doing the same for i386 requires also verifying kernel SSE support in > > hwcap (that implies CPUID support in the CPU, since the baseline 80486 > > does not necessarily have that, but all CPUs with SSE have it) and also > > support for extended CPUID in case of fma4. > > That seems a lot less likely to be worthwhile since it involves > shuffling data back and forth between x87 and sse registers, but maybe > it's still a big enough win to want to do it? > Right. Though FP parameters are passed on stack on i386, they are returned in %st(0). So the calling code would likely have the numbers in FPU, then spill them to memory for the call, then we load it into SSE, spill back to memory, load into FPU, and return. However, just a glance at the generic fma() code makes me doubt a bunch of spilling to memory and running a single SSE instruction can ever be slower than all of what is happening in there. > Regardless, I wonder if we should have the x86_64 startup code store > cpuid result somewhere we can use, so that we don't have to do nasty > atomic stuff determining it late, and could just branch on the bits of > a runtime-constant like we would with __hwcap. This would set the > stage for being able to do with more-impactful things like mem* too. > (That's a big project that involves designing the system for how archs > define the large-block-ops the generic C functions would use, so not > immediately applicable, but useful to be working towards it.) > > Rich The problem with that idea is that CPUID returns an enormous amount of information, but right now we only care about two bits on x86_64 (namely FMA and FMA4 support). So we could have some internal word such as __cpuid, that basically contains a digest of the CPUID information, namely only the bits we care about. That would be extensible for up to 64 bits we want to look at, but it would require complex control flow. And I thought you were against complex control flow in assembler. I am also unsure whether you need much more than __hwcap (and possibly __hwcap2) for the string functions. I don't know what optimizations you have in mind for those, but if you need AVX support, for example, then CPUID information does not help you. You need confirmation from the kernel that it supports AVX, or else you will catch a SIGILL in the attempt to use it. Very few ISA extensions can make do without kernel support, after all. Ciao, Markus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [musl] x86 fma with run-time switch? 2024-03-16 3:37 ` Markus Wichmann @ 2024-03-16 9:16 ` Markus Wichmann 2024-03-18 16:15 ` Markus Wichmann 1 sibling, 0 replies; 5+ messages in thread From: Markus Wichmann @ 2024-03-16 9:16 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 1399 bytes --] Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann: > The problem with that idea is that CPUID returns an enormous amount of > information, but right now we only care about two bits on x86_64 (namely > FMA and FMA4 support). So we could have some internal word such as > __cpuid, that basically contains a digest of the CPUID information, > namely only the bits we care about. That would be extensible for up to > 64 bits we want to look at, but it would require complex control flow. > And I thought you were against complex control flow in assembler. > OK, I may have misunderstood for a moment here. The benefits of fourty winks and a cup of joe, I suppose. But I am attaching an initial proposal to this one, where the initialization is in C. One problem that came to me while making the commits is that this is raising the required ISA level of the assembler. And I don't really know whether that is a problem. Dealing with it requires yet more work: We would have to add a configure test, and then not emit the new instructions if the assembler can't handle them. I don't think we will be able to just emit these instructions as numbers while still using named input and output constraints. On the other hand, these instructions (and support for them) have been around for one and a half decades by now, so it ought to be fine, right? Ciao, Markus [-- Attachment #2: 0001-Add-internal-CPUID-machinery.patch --] [-- Type: text/x-diff, Size: 3559 bytes --] From 9b89d49cd9dae3eb1fd9745e7fc4b91f52d1659f Mon Sep 17 00:00:00 2001 From: Markus Wichmann <nullplan@gmx.net> Date: Sat, 16 Mar 2024 09:51:37 +0100 Subject: [PATCH 1/2] Add internal CPUID machinery. This is meant to provide a way for implementation-internal optimizations to be enabled without a __hwcap flag. The CPUID provides an enormous amount of information, and capturing all of it unconditionally would be incredibly wasteful. Especially with the diversity of implementation out there. So instead I condense exactly the information needed down to one bit per feature, for each interesting feature. For starters, the only features in here are the FMA and FMA4 extentions, but this leaves another 62 bits for other miscellaneous enhancements. I had initially planned to put the call to __init_cpuid() into the arch-specific CRT code, but it is not valid in there. I need access to static variables, and this is not possible in the PIE and dynamic linking cases directly after _start. Only after the relocations were processed. So now I have put it in __libc_start_main, which every process will call between having processed the relocations and running application code. --- src/env/__libc_start_main.c | 2 ++ src/internal/x86_64/cpuid.c | 40 +++++++++++++++++++++++++++++++++++++ src/internal/x86_64/cpuid.h | 12 +++++++++++ 3 files changed, 54 insertions(+) create mode 100644 src/internal/x86_64/cpuid.c create mode 100644 src/internal/x86_64/cpuid.h diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c index c5b277bd..7d7e9f9b 100644 --- a/src/env/__libc_start_main.c +++ b/src/env/__libc_start_main.c @@ -9,6 +9,7 @@ static void dummy(void) {} weak_alias(dummy, _init); +weak_alias(dummy, __init_cpuid); extern weak hidden void (*const __init_array_start)(void), (*const __init_array_end)(void); @@ -38,6 +39,7 @@ void __init_libc(char **envp, char *pn) __init_tls(aux); __init_ssp((void *)aux[AT_RANDOM]); + __init_cpuid(); if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID] && !aux[AT_SECURE]) return; diff --git a/src/internal/x86_64/cpuid.c b/src/internal/x86_64/cpuid.c new file mode 100644 index 00000000..6218ad62 --- /dev/null +++ b/src/internal/x86_64/cpuid.c @@ -0,0 +1,40 @@ +#include "x86_64/cpuid.h" + +uint64_t __cpuid; + +struct regs { + uint32_t ax, bx, cx, dx; +}; + +static inline struct regs cpuid(uint32_t fn) +{ + struct regs ret; + __asm__("cpuid" : "=a"(ret.ax), "=b"(ret.bx), "=c"(ret.cx), "=d"(ret.dx) : "a"(fn)); + return ret; +} + +static inline int cpu_has_fma(void) +{ + struct regs r = cpuid(1); + return r.cx & 0x1000; +} + +static inline int cpu_is_amd(void) +{ + struct regs r = cpuid(0); + return r.bx == 0x68747541 && r.cx == 0x444d4163 && r.dx == 0x69746e65; +} + +static inline int cpu_has_fma4(void) +{ + struct regs r = cpuid(0x80000001); + return r.cx & 0x10000; +} + +void __init_cpuid(void) +{ + if (cpu_has_fma()) + __cpuid |= X86_FEAT_FMA; + if (cpu_is_amd() && cpu_has_fma4()) + __cpuid |= X86_FEAT_FMA4; +} diff --git a/src/internal/x86_64/cpuid.h b/src/internal/x86_64/cpuid.h new file mode 100644 index 00000000..40b66d3e --- /dev/null +++ b/src/internal/x86_64/cpuid.h @@ -0,0 +1,12 @@ +#ifndef X86_64_CPUID_H +#define X86_64_CPUID_H + +#include <features.h> +#include <stdint.h> +extern hidden uint64_t __cpuid; +void __init_cpuid(void); + +#define X86_FEAT_FMA 1 +#define X86_FEAT_FMA4 2 + +#endif -- 2.39.2 [-- Attachment #3: 0002-Runtime-switch-hardware-fma-on-x86_64.patch --] [-- Type: text/x-diff, Size: 2963 bytes --] From 121aae8cbd37396fd3b0e4e6f6d42b70b9966671 Mon Sep 17 00:00:00 2001 From: Markus Wichmann <nullplan@gmx.net> Date: Sat, 16 Mar 2024 10:02:11 +0100 Subject: [PATCH 2/2] Runtime switch hardware fma on x86_64. Instead of only using hardware fma instructions if enabled at compile time (i.e. if compiling at an ISA level that requires these to be present), we can now switch them in at runtime. Compile time switches are still effective and eliminate the other implementations, so the semantics don't change there. But even at baseline ISA level, we can now use the hardware FMA if __cpuid says it's OK. --- src/math/x86_64/fma.c | 28 ++++++++++++++++++++-------- src/math/x86_64/fmaf.c | 28 ++++++++++++++++++++-------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/src/math/x86_64/fma.c b/src/math/x86_64/fma.c index 4dd53f2a..04c6064a 100644 --- a/src/math/x86_64/fma.c +++ b/src/math/x86_64/fma.c @@ -1,23 +1,35 @@ #include <math.h> -#if __FMA__ - -double fma(double x, double y, double z) +static inline double fma_fma(double x, double y, double z) { __asm__ ("vfmadd132sd %1, %2, %0" : "+x" (x) : "x" (y), "x" (z)); return x; } -#elif __FMA4__ - -double fma(double x, double y, double z) +static inline double fma4_fma(double x, double y, double z) { __asm__ ("vfmaddsd %3, %2, %1, %0" : "=x" (x) : "x" (x), "x" (y), "x" (z)); return x; } -#else - +#if !__FMA__ && !__FMA4__ +#include "x86_64/cpuid.h" +#define fma __soft_fma #include "../fma.c" +#undef fma +#endif +double fma(double x, double y, double z) +{ +#if __FMA__ + return fma_fma(x, y, z); +#elif __FMA4__ + return fma4_fma(x, y, z); +#else + if (__cpuid & X86_FEAT_FMA) + return fma_fma(x, y, z); + if (__cpuid & X86_FEAT_FMA4) + return fma4_fma(x, y, z); + return __soft_fma(x, y, z); #endif +} diff --git a/src/math/x86_64/fmaf.c b/src/math/x86_64/fmaf.c index 30b971ff..b4d9b714 100644 --- a/src/math/x86_64/fmaf.c +++ b/src/math/x86_64/fmaf.c @@ -1,23 +1,35 @@ #include <math.h> -#if __FMA__ - -float fmaf(float x, float y, float z) +static inline float fma_fmaf(float x, float y, float z) { __asm__ ("vfmadd132ss %1, %2, %0" : "+x" (x) : "x" (y), "x" (z)); return x; } -#elif __FMA4__ - -float fmaf(float x, float y, float z) +static inline float fma4_fmaf(float x, float y, float z) { __asm__ ("vfmaddss %3, %2, %1, %0" : "=x" (x) : "x" (x), "x" (y), "x" (z)); return x; } -#else - +#if !__FMA__ && !__FMA4__ +#include "x86_64/cpuid.h" +#define fmaf __soft_fmaf #include "../fmaf.c" +#undef fmaf +#endif +float fmaf(float x, float y, float z) +{ +#if __FMA__ + return fma_fmaf(x, y, z); +#elif __FMA4__ + return fma4_fmaf(x, y, z); +#else + if (__cpuid & X86_FEAT_FMA) + return fma_fmaf(x, y, z); + if (__cpuid & X86_FEAT_FMA4) + return fma4_fmaf(x, y, z); + return __soft_fmaf(x, y, z); #endif +} -- 2.39.2 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [musl] x86 fma with run-time switch? 2024-03-16 3:37 ` Markus Wichmann 2024-03-16 9:16 ` Markus Wichmann @ 2024-03-18 16:15 ` Markus Wichmann 1 sibling, 0 replies; 5+ messages in thread From: Markus Wichmann @ 2024-03-18 16:15 UTC (permalink / raw) To: musl Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann: > I am also unsure whether you need much more than __hwcap (and possibly > __hwcap2) for the string functions. I don't know what optimizations you > have in mind for those, but if you need AVX support, for example, then > CPUID information does not help you. You need confirmation from the > kernel that it supports AVX, or else you will catch a SIGILL in the > attempt to use it. Very few ISA extensions can make do without kernel > support, after all. > *sigh* I had another look at how this all is supposed to work now. So __hwcap on x86_64 is just the contents of EDX from CPUID function 1. That's not very helpful at all. There is __hwcap2, which at least contains a bit that tells us whether wrfsbase is OK to use (which we might want to integrate into x86_64's version of __set_thread_area()), but that is it. For AVX and AVX512, the way you are supposed to do it now is: 1. Test the OSXFSR bit from CPUID 2. If set, use XGETBV to read XCR0 and test the YMM/ZMM bit. And that only tells you whether you can access the registers. But every other instruction is actually part of another ISA extension, so now you need to also check the corresponding CPUID bit for the ISA extension. Incidentally, even when __hwcap is useful, it doesn't help too much if the interface is dumb. Because I was also looking up when fsel became non-optional in PowerPC, and it turns out it was at the same time as fsqrt: Namely in Power ISA 2.03. (PowerPC 2.02 still lists it as optional). Right, so we only need to check the v2.03 bit in __hwcap, right? Nope. Because for one, there is no such bit defined, and for two, its closest proxy, the POWER5+ bit, is not set on the following implementations. It took until 2.06 for them to keep the bit around for good. So you have to test for three different bits. Thankfully not spilling into __hwcap2. Ciao, Markus ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-03-18 16:15 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-15 17:53 [musl] x86 fma with run-time switch? Markus Wichmann 2024-03-15 21:36 ` Rich Felker 2024-03-16 3:37 ` Markus Wichmann 2024-03-16 9:16 ` Markus Wichmann 2024-03-18 16:15 ` Markus Wichmann
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).