[musl] x86 fma with run-time switch?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] x86 fma with run-time switch?
@ 2024-03-15 17:53 Markus Wichmann
  2024-03-15 21:36 ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Wichmann @ 2024-03-15 17:53 UTC (permalink / raw)
  To: musl

Hi all,

in commit e9016138, Szabolcs wrote into the message that we really
should be using the single-instruction versions if possible, and we
should be switching at run time. I have an idea for how to do that
without losing all of the history of the generic fma.c:

- Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function
  inside to fma_soft and make it static (inline?).
- Create a new src/math/fma.c that includes fma-soft.h and just calls
  fma_soft().
- In src/math/x86_64/fma.c: Unconditionally define fma_fma() and
  fma_fma4() (which are the current assembler versions) and include
  fma-soft.h. Create a dispatcher to figure out which version to call,
  and call that from fma().

Yeah, I know, the header file with stuff in it that takes memory is not
exactly great, but I can't think of another way to define the generic
version such that it is accessible to the arch-specific versions under a
different name and linkage. The file must not be a .c file, or else it
will confuse the build system.

Question I have right out the gate is whether this would be interesting
to the group. Second question is whether it is better to be running
cpuid every time fma() is called, or to use a function pointer? I am
partial to the dispatcher pattern myself. In that case, the function
pointer is initialized at load time to point to the dispatcher, which
then selects the best implementation and updates the function pointer.
The main function only unconditionally calls the function pointer.

With a bit of preprocessor magic, I can also ensure that if __FMA__ or
__FMA4__ are set, the dispatcher is not included, and only the given
function is called. Although that may incur a warning of an unused
static function. I suppose that is a problem that can be fixed with more
preprocessor magic.

From my preliminary research, the fma3 and fma4 ISA extensions require
no kernel support, so this will be the first time a CPUID call is
needed. fma3 support is signalled with bit 12 of ECX in CPUID function
1. fma4 support is signalled with bit 16 of ECX in CPUID function
0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra
safe, the CPU vendor ought to be checked, too.

Doing the same for i386 requires also verifying kernel SSE support in
hwcap (that implies CPUID support in the CPU, since the baseline 80486
does not necessarily have that, but all CPUs with SSE have it) and also
support for extended CPUID in case of fma4.

Since the CPUID challenges would be shared between fma and fmaf, I would
like to put them into a new header file in src/include (maybe create
src/include/x86_64? Or should it be added to arch/x86_64?)

So what are your thoughts on this?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] x86 fma with run-time switch?
  2024-03-15 17:53 [musl] x86 fma with run-time switch? Markus Wichmann
@ 2024-03-15 21:36 ` Rich Felker
  2024-03-16  3:37   ` Markus Wichmann
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2024-03-15 21:36 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Fri, Mar 15, 2024 at 06:53:59PM +0100, Markus Wichmann wrote:
> Hi all,
> 
> in commit e9016138, Szabolcs wrote into the message that we really
> should be using the single-instruction versions if possible, and we
> should be switching at run time. I have an idea for how to do that
> without losing all of the history of the generic fma.c:
> 
> - Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function
>   inside to fma_soft and make it static (inline?).
> - Create a new src/math/fma.c that includes fma-soft.h and just calls
>   fma_soft().
> - In src/math/x86_64/fma.c: Unconditionally define fma_fma() and
>   fma_fma4() (which are the current assembler versions) and include
>   fma-soft.h. Create a dispatcher to figure out which version to call,
>   and call that from fma().

You're making it too complicated. Just

#define fma __soft_fma
#include "../fma.c"

or similar.

> Yeah, I know, the header file with stuff in it that takes memory is not
> exactly great, but I can't think of another way to define the generic
> version such that it is accessible to the arch-specific versions under a
> different name and linkage. The file must not be a .c file, or else it
> will confuse the build system.
> 
> Question I have right out the gate is whether this would be interesting
> to the group. Second question is whether it is better to be running
> cpuid every time fma() is called, or to use a function pointer? I am
> partial to the dispatcher pattern myself. In that case, the function
> pointer is initialized at load time to point to the dispatcher, which
> then selects the best implementation and updates the function pointer.
> The main function only unconditionally calls the function pointer.

My expectation was that you would just use __hwcap, whereby it would
be a hidden global access no more expensive than accessing a function
pointer for indirect branch, and likely cheaper to do local direct
branches based on testing bits of it, something like:

	if (__hwcap & WHATEVER) {
		__asm__(...);
		return ...;
	} else return __soft_fma(...);

However, it looks like x86_64 lacks usable __hwcap.

> With a bit of preprocessor magic, I can also ensure that if __FMA__ or
> __FMA4__ are set, the dispatcher is not included, and only the given
> function is called. Although that may incur a warning of an unused
> static function. I suppose that is a problem that can be fixed with more
> preprocessor magic.

We already do that.

> From my preliminary research, the fma3 and fma4 ISA extensions require
> no kernel support, so this will be the first time a CPUID call is
> needed. fma3 support is signalled with bit 12 of ECX in CPUID function
> 1. fma4 support is signalled with bit 16 of ECX in CPUID function
> 0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra
> safe, the CPU vendor ought to be checked, too.
> 
> Doing the same for i386 requires also verifying kernel SSE support in
> hwcap (that implies CPUID support in the CPU, since the baseline 80486
> does not necessarily have that, but all CPUs with SSE have it) and also
> support for extended CPUID in case of fma4.

That seems a lot less likely to be worthwhile since it involves
shuffling data back and forth between x87 and sse registers, but maybe
it's still a big enough win to want to do it?

The logic for how you determine availability seems right.

> Since the CPUID challenges would be shared between fma and fmaf, I would
> like to put them into a new header file in src/include (maybe create
> src/include/x86_64? Or should it be added to arch/x86_64?)
> 
> So what are your thoughts on this?

I'm somewhat skeptical of what value there is to doing this
particularly for fma. There're probably a lot more places we don't do
any runtime-conditional optimized code that have higher returns
(memcpy etc. being the most obvious) and it seems likely that programs
that care about fma performance are themselves compiled with the right
ISA levels and using the compiler builtin, never calling the function
at all.

Regardless, I wonder if we should have the x86_64 startup code store
cpuid result somewhere we can use, so that we don't have to do nasty
atomic stuff determining it late, and could just branch on the bits of
a runtime-constant like we would with __hwcap. This would set the
stage for being able to do with more-impactful things like mem* too.
(That's a big project that involves designing the system for how archs
define the large-block-ops the generic C functions would use, so not
immediately applicable, but useful to be working towards it.)

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] x86 fma with run-time switch?
  2024-03-15 21:36 ` Rich Felker
@ 2024-03-16  3:37   ` Markus Wichmann
  2024-03-16  9:16     ` Markus Wichmann
  2024-03-18 16:15     ` Markus Wichmann
  0 siblings, 2 replies; 5+ messages in thread
From: Markus Wichmann @ 2024-03-16  3:37 UTC (permalink / raw)
  To: musl

Am Fri, Mar 15, 2024 at 05:36:23PM -0400 schrieb Rich Felker:
> You're making it too complicated. Just
>
> #define fma __soft_fma
> #include "../fma.c"
>
> or similar.
>

But that leaves a __soft_fma() symbol with external linkage in there.
OK, I suppose this is still less of a hack.

> My expectation was that you would just use __hwcap, whereby it would
> be a hidden global access no more expensive than accessing a function
> pointer for indirect branch, and likely cheaper to do local direct
> branches based on testing bits of it, something like:
>
> 	if (__hwcap & WHATEVER) {
> 		__asm__(...);
> 		return ...;
> 	} else return __soft_fma(...);
>
> However, it looks like x86_64 lacks usable __hwcap.
>

My understanding was that __hwcap is a contract between kernel and
userspace about what ISA extentions both kernel and CPU support. Whereas
FMA on x86 bypasses the kernel.

On other architectures, such as PowerPC, userspace has no way to
negotiate ISA version with the CPU, so __hwcap does that too, but that
is only by necessity.

> We already do that.
>

My point was that I'd try to continue doing that with a minimum of
repeated code.

> > Doing the same for i386 requires also verifying kernel SSE support in
> > hwcap (that implies CPUID support in the CPU, since the baseline 80486
> > does not necessarily have that, but all CPUs with SSE have it) and also
> > support for extended CPUID in case of fma4.
>
> That seems a lot less likely to be worthwhile since it involves
> shuffling data back and forth between x87 and sse registers, but maybe
> it's still a big enough win to want to do it?
>

Right. Though FP parameters are passed on stack on i386, they are
returned in %st(0). So the calling code would likely have the numbers in
FPU, then spill them to memory for the call, then we load it into SSE,
spill back to memory, load into FPU, and return. However, just a glance
at the generic fma() code makes me doubt a bunch of spilling to memory
and running a single SSE instruction can ever be slower than all of what
is happening in there.

> Regardless, I wonder if we should have the x86_64 startup code store
> cpuid result somewhere we can use, so that we don't have to do nasty
> atomic stuff determining it late, and could just branch on the bits of
> a runtime-constant like we would with __hwcap. This would set the
> stage for being able to do with more-impactful things like mem* too.
> (That's a big project that involves designing the system for how archs
> define the large-block-ops the generic C functions would use, so not
> immediately applicable, but useful to be working towards it.)
>
> Rich

The problem with that idea is that CPUID returns an enormous amount of
information, but right now we only care about two bits on x86_64 (namely
FMA and FMA4 support). So we could have some internal word such as
__cpuid, that basically contains a digest of the CPUID information,
namely only the bits we care about. That would be extensible for up to
64 bits we want to look at, but it would require complex control flow.
And I thought you were against complex control flow in assembler.

I am also unsure whether you need much more than __hwcap (and possibly
__hwcap2) for the string functions. I don't know what optimizations you
have in mind for those, but if you need AVX support, for example, then
CPUID information does not help you. You need confirmation from the
kernel that it supports AVX, or else you will catch a SIGILL in the
attempt to use it. Very few ISA extensions can make do without kernel
support, after all.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] x86 fma with run-time switch?
  2024-03-16  3:37   ` Markus Wichmann
@ 2024-03-16  9:16     ` Markus Wichmann
  2024-03-18 16:15     ` Markus Wichmann
  1 sibling, 0 replies; 5+ messages in thread
From: Markus Wichmann @ 2024-03-16  9:16 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1399 bytes --]

Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann:
> The problem with that idea is that CPUID returns an enormous amount of
> information, but right now we only care about two bits on x86_64 (namely
> FMA and FMA4 support). So we could have some internal word such as
> __cpuid, that basically contains a digest of the CPUID information,
> namely only the bits we care about. That would be extensible for up to
> 64 bits we want to look at, but it would require complex control flow.
> And I thought you were against complex control flow in assembler.
>

OK, I may have misunderstood for a moment here. The benefits of fourty
winks and a cup of joe, I suppose. But I am attaching an initial
proposal to this one, where the initialization is in C.

One problem that came to me while making the commits is that this is
raising the required ISA level of the assembler. And I don't really know
whether that is a problem. Dealing with it requires yet more work: We
would have to add a configure test, and then not emit the new
instructions if the assembler can't handle them. I don't think we will
be able to just emit these instructions as numbers while still using
named input and output constraints.

On the other hand, these instructions (and support for them) have been
around for one and a half decades by now, so it ought to be fine, right?

Ciao,
Markus

[-- Attachment #2: 0001-Add-internal-CPUID-machinery.patch --]
[-- Type: text/x-diff, Size: 3559 bytes --]

From 9b89d49cd9dae3eb1fd9745e7fc4b91f52d1659f Mon Sep 17 00:00:00 2001
From: Markus Wichmann <nullplan@gmx.net>
Date: Sat, 16 Mar 2024 09:51:37 +0100
Subject: [PATCH 1/2] Add internal CPUID machinery.

This is meant to provide a way for implementation-internal optimizations
to be enabled without a __hwcap flag. The CPUID provides an enormous
amount of information, and capturing all of it unconditionally would be
incredibly wasteful. Especially with the diversity of implementation out
there. So instead I condense exactly the information needed down to one
bit per feature, for each interesting feature. For starters, the only
features in here are the FMA and FMA4 extentions, but this leaves
another 62 bits for other miscellaneous enhancements.

I had initially planned to put the call to __init_cpuid() into the
arch-specific CRT code, but it is not valid in there. I need access to
static variables, and this is not possible in the PIE and dynamic
linking cases directly after _start. Only after the relocations were
processed. So now I have put it in __libc_start_main, which every
process will call between having processed the relocations and running
application code.
---
 src/env/__libc_start_main.c |  2 ++
 src/internal/x86_64/cpuid.c | 40 +++++++++++++++++++++++++++++++++++++
 src/internal/x86_64/cpuid.h | 12 +++++++++++
 3 files changed, 54 insertions(+)
 create mode 100644 src/internal/x86_64/cpuid.c
 create mode 100644 src/internal/x86_64/cpuid.h

diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
index c5b277bd..7d7e9f9b 100644
--- a/src/env/__libc_start_main.c
+++ b/src/env/__libc_start_main.c
@@ -9,6 +9,7 @@

 static void dummy(void) {}
 weak_alias(dummy, _init);
+weak_alias(dummy, __init_cpuid);

 extern weak hidden void (*const __init_array_start)(void), (*const __init_array_end)(void);

@@ -38,6 +39,7 @@ void __init_libc(char **envp, char *pn)

 	__init_tls(aux);
 	__init_ssp((void *)aux[AT_RANDOM]);
+        __init_cpuid();

 	if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID]
 		&& !aux[AT_SECURE]) return;
diff --git a/src/internal/x86_64/cpuid.c b/src/internal/x86_64/cpuid.c
new file mode 100644
index 00000000..6218ad62
--- /dev/null
+++ b/src/internal/x86_64/cpuid.c
@@ -0,0 +1,40 @@
+#include "x86_64/cpuid.h"
+
+uint64_t __cpuid;
+
+struct regs {
+	uint32_t ax, bx, cx, dx;
+};
+
+static inline struct regs cpuid(uint32_t fn)
+{
+	struct regs ret;
+	__asm__("cpuid" : "=a"(ret.ax), "=b"(ret.bx), "=c"(ret.cx), "=d"(ret.dx) : "a"(fn));
+	return ret;
+}
+
+static inline int cpu_has_fma(void)
+{
+	struct regs r = cpuid(1);
+	return r.cx & 0x1000;
+}
+
+static inline int cpu_is_amd(void)
+{
+	struct regs r = cpuid(0);
+	return r.bx == 0x68747541 && r.cx == 0x444d4163 && r.dx == 0x69746e65;
+}
+
+static inline int cpu_has_fma4(void)
+{
+	struct regs r = cpuid(0x80000001);
+	return r.cx & 0x10000;
+}
+
+void __init_cpuid(void)
+{
+	if (cpu_has_fma())
+		__cpuid |= X86_FEAT_FMA;
+	if (cpu_is_amd() && cpu_has_fma4())
+		__cpuid |= X86_FEAT_FMA4;
+}
diff --git a/src/internal/x86_64/cpuid.h b/src/internal/x86_64/cpuid.h
new file mode 100644
index 00000000..40b66d3e
--- /dev/null
+++ b/src/internal/x86_64/cpuid.h
@@ -0,0 +1,12 @@
+#ifndef X86_64_CPUID_H
+#define X86_64_CPUID_H
+
+#include <features.h>
+#include <stdint.h>
+extern hidden uint64_t __cpuid;
+void __init_cpuid(void);
+
+#define X86_FEAT_FMA	    1
+#define X86_FEAT_FMA4	    2
+
+#endif
--
2.39.2


[-- Attachment #3: 0002-Runtime-switch-hardware-fma-on-x86_64.patch --]
[-- Type: text/x-diff, Size: 2963 bytes --]

From 121aae8cbd37396fd3b0e4e6f6d42b70b9966671 Mon Sep 17 00:00:00 2001
From: Markus Wichmann <nullplan@gmx.net>
Date: Sat, 16 Mar 2024 10:02:11 +0100
Subject: [PATCH 2/2] Runtime switch hardware fma on x86_64.

Instead of only using hardware fma instructions if enabled at compile
time (i.e. if compiling at an ISA level that requires these to be
present), we can now switch them in at runtime. Compile time switches
are still effective and eliminate the other implementations, so the
semantics don't change there. But even at baseline ISA level, we can now
use the hardware FMA if __cpuid says it's OK.
---
 src/math/x86_64/fma.c  | 28 ++++++++++++++++++++--------
 src/math/x86_64/fmaf.c | 28 ++++++++++++++++++++--------
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/src/math/x86_64/fma.c b/src/math/x86_64/fma.c
index 4dd53f2a..04c6064a 100644
--- a/src/math/x86_64/fma.c
+++ b/src/math/x86_64/fma.c
@@ -1,23 +1,35 @@
 #include <math.h>

-#if __FMA__
-
-double fma(double x, double y, double z)
+static inline double fma_fma(double x, double y, double z)
 {
 	__asm__ ("vfmadd132sd %1, %2, %0" : "+x" (x) : "x" (y), "x" (z));
 	return x;
 }

-#elif __FMA4__
-
-double fma(double x, double y, double z)
+static inline double fma4_fma(double x, double y, double z)
 {
 	__asm__ ("vfmaddsd %3, %2, %1, %0" : "=x" (x) : "x" (x), "x" (y), "x" (z));
 	return x;
 }

-#else
-
+#if !__FMA__ && !__FMA4__
+#include "x86_64/cpuid.h"
+#define fma __soft_fma
 #include "../fma.c"
+#undef fma
+#endif

+double fma(double x, double y, double z)
+{
+#if __FMA__
+	return fma_fma(x, y, z);
+#elif __FMA4__
+	return fma4_fma(x, y, z);
+#else
+	if (__cpuid & X86_FEAT_FMA)
+		return fma_fma(x, y, z);
+	if (__cpuid & X86_FEAT_FMA4)
+		return fma4_fma(x, y, z);
+	return __soft_fma(x, y, z);
 #endif
+}
diff --git a/src/math/x86_64/fmaf.c b/src/math/x86_64/fmaf.c
index 30b971ff..b4d9b714 100644
--- a/src/math/x86_64/fmaf.c
+++ b/src/math/x86_64/fmaf.c
@@ -1,23 +1,35 @@
 #include <math.h>

-#if __FMA__
-
-float fmaf(float x, float y, float z)
+static inline float fma_fmaf(float x, float y, float z)
 {
 	__asm__ ("vfmadd132ss %1, %2, %0" : "+x" (x) : "x" (y), "x" (z));
 	return x;
 }

-#elif __FMA4__
-
-float fmaf(float x, float y, float z)
+static inline float fma4_fmaf(float x, float y, float z)
 {
 	__asm__ ("vfmaddss %3, %2, %1, %0" : "=x" (x) : "x" (x), "x" (y), "x" (z));
 	return x;
 }

-#else
-
+#if !__FMA__ && !__FMA4__
+#include "x86_64/cpuid.h"
+#define fmaf __soft_fmaf
 #include "../fmaf.c"
+#undef fmaf
+#endif

+float fmaf(float x, float y, float z)
+{
+#if __FMA__
+	return fma_fmaf(x, y, z);
+#elif __FMA4__
+	return fma4_fmaf(x, y, z);
+#else
+	if (__cpuid & X86_FEAT_FMA)
+		return fma_fmaf(x, y, z);
+	if (__cpuid & X86_FEAT_FMA4)
+		return fma4_fmaf(x, y, z);
+	return __soft_fmaf(x, y, z);
 #endif
+}
--
2.39.2


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [musl] x86 fma with run-time switch?
  2024-03-16  3:37   ` Markus Wichmann
  2024-03-16  9:16     ` Markus Wichmann
@ 2024-03-18 16:15     ` Markus Wichmann
  1 sibling, 0 replies; 5+ messages in thread
From: Markus Wichmann @ 2024-03-18 16:15 UTC (permalink / raw)
  To: musl

Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann:
> I am also unsure whether you need much more than __hwcap (and possibly
> __hwcap2) for the string functions. I don't know what optimizations you
> have in mind for those, but if you need AVX support, for example, then
> CPUID information does not help you. You need confirmation from the
> kernel that it supports AVX, or else you will catch a SIGILL in the
> attempt to use it. Very few ISA extensions can make do without kernel
> support, after all.
>

*sigh* I had another look at how this all is supposed to work now. So
__hwcap on x86_64 is just the contents of EDX from CPUID function 1.
That's not very helpful at all. There is __hwcap2, which at least
contains a bit that tells us whether wrfsbase is OK to use (which we
might want to integrate into x86_64's version of __set_thread_area()),
but that is it. For AVX and AVX512, the way you are supposed to do it
now is:

1. Test the OSXFSR bit from CPUID
2. If set, use XGETBV to read XCR0 and test the YMM/ZMM bit.

And that only tells you whether you can access the registers. But every
other instruction is actually part of another ISA extension, so now you
need to also check the corresponding CPUID bit for the ISA extension.

Incidentally, even when __hwcap is useful, it doesn't help too much
if the interface is dumb. Because I was also looking up when fsel became
non-optional in PowerPC, and it turns out it was at the same time as
fsqrt: Namely in Power ISA 2.03. (PowerPC 2.02 still lists it as
optional). Right, so we only need to check the v2.03 bit in __hwcap,
right? Nope. Because for one, there is no such bit defined, and for two,
its closest proxy, the POWER5+ bit, is not set on the following
implementations. It took until 2.06 for them to keep the bit around for
good. So you have to test for three different bits. Thankfully not
spilling into __hwcap2.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-18 16:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-15 17:53 [musl] x86 fma with run-time switch? Markus Wichmann
2024-03-15 21:36 ` Rich Felker
2024-03-16  3:37   ` Markus Wichmann
2024-03-16  9:16     ` Markus Wichmann
2024-03-18 16:15     ` Markus Wichmann

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).