From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 Received: from second.openwall.net (second.openwall.net [193.110.157.125]) by inbox.vuxu.org (Postfix) with SMTP id 2DC092AFFA for ; Sat, 16 Mar 2024 10:16:23 +0100 (CET) Received: (qmail 7732 invoked by uid 550); 16 Mar 2024 09:12:01 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 7697 invoked from network); 16 Mar 2024 09:12:00 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.net; s=s31663417; t=1710580570; x=1711185370; i=nullplan@gmx.net; bh=hVETC+9jKB5lY+W9unXvkqYEKhx/8nRLeBN5ayQYrg0=; h=X-UI-Sender-Class:Date:From:To:Subject:References:In-Reply-To; b=Y40rBUT07gEWyIBYP/PF3nOznykEwFNwGIYH/iUxJZ5fWRX1FeuHDJvbjBDzXDgS c2hGWrWiVbsKkfEf12q0Q+BNmNA4UFO+eePElltcKhvWaTDaJ3Swopa3KwBU/kwJ/ 3nZwblYSD1teMIEfIV5eFqRt6f8YeRjyqBkvy30bxe47L3VSb+Y9GEOQppmmN2mmQ kObgkG5iKlXVeRsCJP8aAeKGZ/RiS1Ilr6ZfzP3rOYGwC7o3zLeUBKYFNkW43dn4T RfqZ7zvuSVOU5DQCGGSPAap6Bne+booEHT7gIfGM7x13R/L/AYG4sF4vXap9fw9tw 8hKG6W15kFHBqMa1vA== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Date: Sat, 16 Mar 2024 10:16:08 +0100 From: Markus Wichmann To: musl@lists.openwall.com Message-ID: References: <20240315213622.GG4163@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="OD6RltQkVR5nm4Iy" Content-Disposition: inline In-Reply-To: X-Provags-ID: V03:K1:/DdlpuQ5t3OS7tEPSz0BMFY795zRIGt2mnndVvTARMjUN3kU6Yy 1pNGMD8JDNWJCkiUUIISLfmptbf06aWnz6ivPbs2UyF8R6r0xOi/SXsOH1CqE+lhZ4CwSvt n6p9nq0Y6grv/x30TpyJvE6uclMLh/E/CqOMQEC9U1TqWcj7vlqf0BEkIz0GrGUqECHAQfZ 0xtdyo6qbMBGl3YviXM2w== UI-OutboundReport: notjunk:1;M01:P0:rgyAkxCXUEw=;4JDGvUQEBQ1PejWeC7fq4crAN8R DXHkvl5xri4muQA3jwUZYMCGRvEjRk/Lp11KQIR41bLq0rVqqgDr65Dzwn7Rnt6xHqPWVKf/C xrljJSfeRtwGJUi7QHriy3yFHsYfy9Sqcrla1/8dXQDAyLPlADCorWaY0O9yKv7a4D1UC3pbc /WxUT/pNk73L4KKf90REBr91PlxY4mMVa9G4YEx2c0DZUc0I1nxm9F5kHl4WmnMv/1LJ4PVR5 2B3QASs4tlwA6Jv5zyTYhoSwte6LQG94UK56JLmlmRl/7i7Ykb+MfsVet1TZ+OHR4SgyES9Y7 mdOirOeHxLX8GzwxkXoZquTQTNUUkL2v/7wuZSaWaeWFfW29B9900M/3b2JKJK2J5PbfMvYR3 c3jVm7cI0893939rD7sKqMLoRyjmo4mvzBANYzMAH/lCdtf5DVm83JiB1vCkXWSVPucEeP/U2 SswOMeUuYrbohVWIz5tfU825z/wnAz5+uWKpfaj9b7lj4NWfLYaSWGUA8PjfgGyO3D3AN9ADl iOUCpCz8q4UUwEF6se48pLw4OvtjiMNmi+c1q66fIm4Xm5mpuRYNaS+vKhpkexhe2Gxl+y9ik J7uEu0/5RGzzBSXrxAHSRybbcwdNpztGC+6UUYMix3i+NVREAVmHcwtXGXWGuIaa48Mna1wFW rWi3gLZmLc5hLbZWlBsXDc73/yEMEW1QgTuhPTq4bNN8XJWvmQhYnZ7y3F7FtbduT3KxC9zO0 yNyEzGNEiH3CtLvFklQEYILufpv2bFfA2aJ+wQ5s89Y5jmHGwfc5y/UvpOvR65ZnTWltyDe9l 4e/fee1lsHLUJAdFY9QvFY+eZT4YFidxFtOXkK/ABEAl4= Subject: Re: [musl] x86 fma with run-time switch? --OD6RltQkVR5nm4Iy Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann: > The problem with that idea is that CPUID returns an enormous amount of > information, but right now we only care about two bits on x86_64 (namely > FMA and FMA4 support). So we could have some internal word such as > __cpuid, that basically contains a digest of the CPUID information, > namely only the bits we care about. That would be extensible for up to > 64 bits we want to look at, but it would require complex control flow. > And I thought you were against complex control flow in assembler. > OK, I may have misunderstood for a moment here. The benefits of fourty winks and a cup of joe, I suppose. But I am attaching an initial proposal to this one, where the initialization is in C. One problem that came to me while making the commits is that this is raising the required ISA level of the assembler. And I don't really know whether that is a problem. Dealing with it requires yet more work: We would have to add a configure test, and then not emit the new instructions if the assembler can't handle them. I don't think we will be able to just emit these instructions as numbers while still using named input and output constraints. On the other hand, these instructions (and support for them) have been around for one and a half decades by now, so it ought to be fine, right? Ciao, Markus --OD6RltQkVR5nm4Iy Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="0001-Add-internal-CPUID-machinery.patch" Content-Transfer-Encoding: quoted-printable =46rom 9b89d49cd9dae3eb1fd9745e7fc4b91f52d1659f Mon Sep 17 00:00:00 2001 From: Markus Wichmann Date: Sat, 16 Mar 2024 09:51:37 +0100 Subject: [PATCH 1/2] Add internal CPUID machinery. This is meant to provide a way for implementation-internal optimizations to be enabled without a __hwcap flag. The CPUID provides an enormous amount of information, and capturing all of it unconditionally would be incredibly wasteful. Especially with the diversity of implementation out there. So instead I condense exactly the information needed down to one bit per feature, for each interesting feature. For starters, the only features in here are the FMA and FMA4 extentions, but this leaves another 62 bits for other miscellaneous enhancements. I had initially planned to put the call to __init_cpuid() into the arch-specific CRT code, but it is not valid in there. I need access to static variables, and this is not possible in the PIE and dynamic linking cases directly after _start. Only after the relocations were processed. So now I have put it in __libc_start_main, which every process will call between having processed the relocations and running application code. =2D-- src/env/__libc_start_main.c | 2 ++ src/internal/x86_64/cpuid.c | 40 +++++++++++++++++++++++++++++++++++++ src/internal/x86_64/cpuid.h | 12 +++++++++++ 3 files changed, 54 insertions(+) create mode 100644 src/internal/x86_64/cpuid.c create mode 100644 src/internal/x86_64/cpuid.h diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c index c5b277bd..7d7e9f9b 100644 =2D-- a/src/env/__libc_start_main.c +++ b/src/env/__libc_start_main.c @@ -9,6 +9,7 @@ static void dummy(void) {} weak_alias(dummy, _init); +weak_alias(dummy, __init_cpuid); extern weak hidden void (*const __init_array_start)(void), (*const __init= _array_end)(void); @@ -38,6 +39,7 @@ void __init_libc(char **envp, char *pn) __init_tls(aux); __init_ssp((void *)aux[AT_RANDOM]); + __init_cpuid(); if (aux[AT_UID]=3D=3Daux[AT_EUID] && aux[AT_GID]=3D=3Daux[AT_EGID] && !aux[AT_SECURE]) return; diff --git a/src/internal/x86_64/cpuid.c b/src/internal/x86_64/cpuid.c new file mode 100644 index 00000000..6218ad62 =2D-- /dev/null +++ b/src/internal/x86_64/cpuid.c @@ -0,0 +1,40 @@ +#include "x86_64/cpuid.h" + +uint64_t __cpuid; + +struct regs { + uint32_t ax, bx, cx, dx; +}; + +static inline struct regs cpuid(uint32_t fn) +{ + struct regs ret; + __asm__("cpuid" : "=3Da"(ret.ax), "=3Db"(ret.bx), "=3Dc"(ret.cx), "=3Dd"= (ret.dx) : "a"(fn)); + return ret; +} + +static inline int cpu_has_fma(void) +{ + struct regs r =3D cpuid(1); + return r.cx & 0x1000; +} + +static inline int cpu_is_amd(void) +{ + struct regs r =3D cpuid(0); + return r.bx =3D=3D 0x68747541 && r.cx =3D=3D 0x444d4163 && r.dx =3D=3D 0= x69746e65; +} + +static inline int cpu_has_fma4(void) +{ + struct regs r =3D cpuid(0x80000001); + return r.cx & 0x10000; +} + +void __init_cpuid(void) +{ + if (cpu_has_fma()) + __cpuid |=3D X86_FEAT_FMA; + if (cpu_is_amd() && cpu_has_fma4()) + __cpuid |=3D X86_FEAT_FMA4; +} diff --git a/src/internal/x86_64/cpuid.h b/src/internal/x86_64/cpuid.h new file mode 100644 index 00000000..40b66d3e =2D-- /dev/null +++ b/src/internal/x86_64/cpuid.h @@ -0,0 +1,12 @@ +#ifndef X86_64_CPUID_H +#define X86_64_CPUID_H + +#include +#include +extern hidden uint64_t __cpuid; +void __init_cpuid(void); + +#define X86_FEAT_FMA 1 +#define X86_FEAT_FMA4 2 + +#endif =2D- 2.39.2 --OD6RltQkVR5nm4Iy Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="0002-Runtime-switch-hardware-fma-on-x86_64.patch" Content-Transfer-Encoding: quoted-printable =46rom 121aae8cbd37396fd3b0e4e6f6d42b70b9966671 Mon Sep 17 00:00:00 2001 From: Markus Wichmann Date: Sat, 16 Mar 2024 10:02:11 +0100 Subject: [PATCH 2/2] Runtime switch hardware fma on x86_64. Instead of only using hardware fma instructions if enabled at compile time (i.e. if compiling at an ISA level that requires these to be present), we can now switch them in at runtime. Compile time switches are still effective and eliminate the other implementations, so the semantics don't change there. But even at baseline ISA level, we can now use the hardware FMA if __cpuid says it's OK. =2D-- src/math/x86_64/fma.c | 28 ++++++++++++++++++++-------- src/math/x86_64/fmaf.c | 28 ++++++++++++++++++++-------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/src/math/x86_64/fma.c b/src/math/x86_64/fma.c index 4dd53f2a..04c6064a 100644 =2D-- a/src/math/x86_64/fma.c +++ b/src/math/x86_64/fma.c @@ -1,23 +1,35 @@ #include -#if __FMA__ - -double fma(double x, double y, double z) +static inline double fma_fma(double x, double y, double z) { __asm__ ("vfmadd132sd %1, %2, %0" : "+x" (x) : "x" (y), "x" (z)); return x; } -#elif __FMA4__ - -double fma(double x, double y, double z) +static inline double fma4_fma(double x, double y, double z) { __asm__ ("vfmaddsd %3, %2, %1, %0" : "=3Dx" (x) : "x" (x), "x" (y), "x" = (z)); return x; } -#else - +#if !__FMA__ && !__FMA4__ +#include "x86_64/cpuid.h" +#define fma __soft_fma #include "../fma.c" +#undef fma +#endif +double fma(double x, double y, double z) +{ +#if __FMA__ + return fma_fma(x, y, z); +#elif __FMA4__ + return fma4_fma(x, y, z); +#else + if (__cpuid & X86_FEAT_FMA) + return fma_fma(x, y, z); + if (__cpuid & X86_FEAT_FMA4) + return fma4_fma(x, y, z); + return __soft_fma(x, y, z); #endif +} diff --git a/src/math/x86_64/fmaf.c b/src/math/x86_64/fmaf.c index 30b971ff..b4d9b714 100644 =2D-- a/src/math/x86_64/fmaf.c +++ b/src/math/x86_64/fmaf.c @@ -1,23 +1,35 @@ #include -#if __FMA__ - -float fmaf(float x, float y, float z) +static inline float fma_fmaf(float x, float y, float z) { __asm__ ("vfmadd132ss %1, %2, %0" : "+x" (x) : "x" (y), "x" (z)); return x; } -#elif __FMA4__ - -float fmaf(float x, float y, float z) +static inline float fma4_fmaf(float x, float y, float z) { __asm__ ("vfmaddss %3, %2, %1, %0" : "=3Dx" (x) : "x" (x), "x" (y), "x" = (z)); return x; } -#else - +#if !__FMA__ && !__FMA4__ +#include "x86_64/cpuid.h" +#define fmaf __soft_fmaf #include "../fmaf.c" +#undef fmaf +#endif +float fmaf(float x, float y, float z) +{ +#if __FMA__ + return fma_fmaf(x, y, z); +#elif __FMA4__ + return fma4_fmaf(x, y, z); +#else + if (__cpuid & X86_FEAT_FMA) + return fma_fmaf(x, y, z); + if (__cpuid & X86_FEAT_FMA4) + return fma4_fmaf(x, y, z); + return __soft_fmaf(x, y, z); #endif +} =2D- 2.39.2 --OD6RltQkVR5nm4Iy--