[-- Attachment #1: Type: text/plain, Size: 1066 bytes --] Yeah I also just went over the C99 spec as well, section 6.7.8, and I have to agree with clang developer's interpretation, that "{ 0 }" only initializes the first member of the union. "{ }" apparently is added in C23 as the "universal zero initializer". So changing the order moving sin6 up is the only way to be C99 conformant. On Mon, Mar 18, 2024 at 3:22 PM NRK <nrk@disroot.org> wrote: > On Mon, Mar 18, 2024 at 05:34:42PM -0400, Rich Felker wrote: > > If the clang interpretation is going to be this, we can just reorder > > the union members so that the largest one is first. > > Another option is to utilize implicit static initializer rules: > > | if it is a union, the first named member is initialized (recursively) > | according to these rules, and any padding is initialized to zero bits; > > So something like: > > static union u zero; > union u u = zero; > > Though, the "padding bits" part was added in C11 and wasn't present in > C99 in case you want to be pedantic about C99 conformance. > > - NRK > [-- Attachment #2: Type: text/html, Size: 1467 bytes --]
NRK dixit:
>So something like:
>
> static union u zero;
> union u u = zero;
Even that gets funny when you have:
union {
struct {
char *foo;
uintptr_t bar;
} a;
struct {
uintptr_t baz;
char *bla;
} b;
} u;
In this case, u.b.baz can very well be 0x55555555
when you initialise u.a as {} (C23+) or {0}, e.g.
on TenDRA with the option to have nil pointers be
not a binary 0-bits representation.
bye,
//mirabilos
--
22:20⎜<asarch> The crazy that persists in his craziness becomes a master
22:21⎜<asarch> And the distance between the craziness and geniality is
only measured by the success 18:35⎜<asarch> "Psychotics are consistently
inconsistent. The essence of sanity is to be inconsistently inconsistent
On Mon, Mar 18, 2024 at 05:34:42PM -0400, Rich Felker wrote:
> If the clang interpretation is going to be this, we can just reorder
> the union members so that the largest one is first.
Another option is to utilize implicit static initializer rules:
| if it is a union, the first named member is initialized (recursively)
| according to these rules, and any padding is initialized to zero bits;
So something like:
static union u zero;
union u u = zero;
Though, the "padding bits" part was added in C11 and wasn't present in
C99 in case you want to be pedantic about C99 conformance.
- NRK
On Mon, Mar 18, 2024 at 12:56:41PM -0700, Mike Cui wrote: > I recently upgraded to clang-18, and after compiling musl with it, I > noticed that all my getaddrinfo() calls are failing. I tracked this to be > an issue in __res_msend_rc(), where the 'sin6' member of union 'sa' is > initialized to garbage, rather than 0. Then later bind() fails > with EADDRNOTAVAIL. > > I reported this bug on clang discourse: > https://discourse.llvm.org/t/union-initialization-and-aliasing-clang-18-seems-to-miscompile-musl/77724, > and the discussion seems to suggest that there is potentially a bug in musl > as well. TL;DR: > > - According to strict interpretation of the C standard, initializing a > union with '{0}', only initializes the first member of the union to 0 (in > this case, sin4), and "default" initializes the rest. This interpretation > is still up for debate. The proper way to initialize the entire union is '{ > }' not '{ 0 }'. No, { } is a constraint violation. It may be valid in C++ or C23, but it's not in C99, which is the source language for musl. { 0 } has always been the universal zero-initializer for C. Moreover, the C language has no such thing as a "partially initialized object". I guess it's possible for an implementor to interpret zero-initialization of a union to produce something other than zero bits in the storage that is not part of the first union member, but it's rather hostile. > - There is currently a bug in clang-18 that treats '{ }' to be the same as > '{ 0 }'. The proposed fix is to just zero out the entire union for both "{ > 0 }" and "{ }". However we cannot rely on "{ 0 }" to always zero out the > entire union in the future. > > musl should be fixed to use "{ }" for initialization. And to work around > the current buggy release of clang-18, perhaps flip the order to make sin6 > the first member of the struct? I've attached a patch that works for me. > There may be other instances of the same bug in the musl code base. If the clang interpretation is going to be this, we can just reorder the union members so that the largest one is first. This should avoid dependency on how the compiler decided to interpret the universal zero initializer for unions. Rich
[-- Attachment #1: Type: text/plain, Size: 1918 bytes --] I recently upgraded to clang-18, and after compiling musl with it, I noticed that all my getaddrinfo() calls are failing. I tracked this to be an issue in __res_msend_rc(), where the 'sin6' member of union 'sa' is initialized to garbage, rather than 0. Then later bind() fails with EADDRNOTAVAIL. I reported this bug on clang discourse: https://discourse.llvm.org/t/union-initialization-and-aliasing-clang-18-seems-to-miscompile-musl/77724, and the discussion seems to suggest that there is potentially a bug in musl as well. TL;DR: - According to strict interpretation of the C standard, initializing a union with '{0}', only initializes the first member of the union to 0 (in this case, sin4), and "default" initializes the rest. This interpretation is still up for debate. The proper way to initialize the entire union is '{ }' not '{ 0 }'. - There is currently a bug in clang-18 that treats '{ }' to be the same as '{ 0 }'. The proposed fix is to just zero out the entire union for both "{ 0 }" and "{ }". However we cannot rely on "{ 0 }" to always zero out the entire union in the future. musl should be fixed to use "{ }" for initialization. And to work around the current buggy release of clang-18, perhaps flip the order to make sin6 the first member of the struct? I've attached a patch that works for me. There may be other instances of the same bug in the musl code base. --- a/src/network/res_msend.c +++ b/src/network/res_msend.c @@ -83,9 +83,9 @@ int __res_msend_rc(int nqueries, const unsigned char *const *queries, int fd; int timeout, attempts, retry_interval, servfail_retry; union { - struct sockaddr_in sin; struct sockaddr_in6 sin6; - } sa = {0}, ns[MAXNS] = {{0}}; + struct sockaddr_in sin; + } sa = {}, ns[MAXNS] = {{}}; socklen_t sl = sizeof sa.sin; int nns = 0; int family = AF_INET; [-- Attachment #2: Type: text/html, Size: 2376 bytes --]
Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann:
> I am also unsure whether you need much more than __hwcap (and possibly
> __hwcap2) for the string functions. I don't know what optimizations you
> have in mind for those, but if you need AVX support, for example, then
> CPUID information does not help you. You need confirmation from the
> kernel that it supports AVX, or else you will catch a SIGILL in the
> attempt to use it. Very few ISA extensions can make do without kernel
> support, after all.
>
*sigh* I had another look at how this all is supposed to work now. So
__hwcap on x86_64 is just the contents of EDX from CPUID function 1.
That's not very helpful at all. There is __hwcap2, which at least
contains a bit that tells us whether wrfsbase is OK to use (which we
might want to integrate into x86_64's version of __set_thread_area()),
but that is it. For AVX and AVX512, the way you are supposed to do it
now is:
1. Test the OSXFSR bit from CPUID
2. If set, use XGETBV to read XCR0 and test the YMM/ZMM bit.
And that only tells you whether you can access the registers. But every
other instruction is actually part of another ISA extension, so now you
need to also check the corresponding CPUID bit for the ISA extension.
Incidentally, even when __hwcap is useful, it doesn't help too much
if the interface is dumb. Because I was also looking up when fsel became
non-optional in PowerPC, and it turns out it was at the same time as
fsqrt: Namely in Power ISA 2.03. (PowerPC 2.02 still lists it as
optional). Right, so we only need to check the v2.03 bit in __hwcap,
right? Nope. Because for one, there is no such bit defined, and for two,
its closest proxy, the POWER5+ bit, is not set on the following
implementations. It took until 2.06 for them to keep the bit around for
good. So you have to test for three different bits. Thankfully not
spilling into __hwcap2.
Ciao,
Markus
[-- Attachment #1: Type: text/plain, Size: 1399 bytes --] Am Sat, Mar 16, 2024 at 04:37:29AM +0100 schrieb Markus Wichmann: > The problem with that idea is that CPUID returns an enormous amount of > information, but right now we only care about two bits on x86_64 (namely > FMA and FMA4 support). So we could have some internal word such as > __cpuid, that basically contains a digest of the CPUID information, > namely only the bits we care about. That would be extensible for up to > 64 bits we want to look at, but it would require complex control flow. > And I thought you were against complex control flow in assembler. > OK, I may have misunderstood for a moment here. The benefits of fourty winks and a cup of joe, I suppose. But I am attaching an initial proposal to this one, where the initialization is in C. One problem that came to me while making the commits is that this is raising the required ISA level of the assembler. And I don't really know whether that is a problem. Dealing with it requires yet more work: We would have to add a configure test, and then not emit the new instructions if the assembler can't handle them. I don't think we will be able to just emit these instructions as numbers while still using named input and output constraints. On the other hand, these instructions (and support for them) have been around for one and a half decades by now, so it ought to be fine, right? Ciao, Markus [-- Attachment #2: 0001-Add-internal-CPUID-machinery.patch --] [-- Type: text/x-diff, Size: 3559 bytes --] From 9b89d49cd9dae3eb1fd9745e7fc4b91f52d1659f Mon Sep 17 00:00:00 2001 From: Markus Wichmann <nullplan@gmx.net> Date: Sat, 16 Mar 2024 09:51:37 +0100 Subject: [PATCH 1/2] Add internal CPUID machinery. This is meant to provide a way for implementation-internal optimizations to be enabled without a __hwcap flag. The CPUID provides an enormous amount of information, and capturing all of it unconditionally would be incredibly wasteful. Especially with the diversity of implementation out there. So instead I condense exactly the information needed down to one bit per feature, for each interesting feature. For starters, the only features in here are the FMA and FMA4 extentions, but this leaves another 62 bits for other miscellaneous enhancements. I had initially planned to put the call to __init_cpuid() into the arch-specific CRT code, but it is not valid in there. I need access to static variables, and this is not possible in the PIE and dynamic linking cases directly after _start. Only after the relocations were processed. So now I have put it in __libc_start_main, which every process will call between having processed the relocations and running application code. --- src/env/__libc_start_main.c | 2 ++ src/internal/x86_64/cpuid.c | 40 +++++++++++++++++++++++++++++++++++++ src/internal/x86_64/cpuid.h | 12 +++++++++++ 3 files changed, 54 insertions(+) create mode 100644 src/internal/x86_64/cpuid.c create mode 100644 src/internal/x86_64/cpuid.h diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c index c5b277bd..7d7e9f9b 100644 --- a/src/env/__libc_start_main.c +++ b/src/env/__libc_start_main.c @@ -9,6 +9,7 @@ static void dummy(void) {} weak_alias(dummy, _init); +weak_alias(dummy, __init_cpuid); extern weak hidden void (*const __init_array_start)(void), (*const __init_array_end)(void); @@ -38,6 +39,7 @@ void __init_libc(char **envp, char *pn) __init_tls(aux); __init_ssp((void *)aux[AT_RANDOM]); + __init_cpuid(); if (aux[AT_UID]==aux[AT_EUID] && aux[AT_GID]==aux[AT_EGID] && !aux[AT_SECURE]) return; diff --git a/src/internal/x86_64/cpuid.c b/src/internal/x86_64/cpuid.c new file mode 100644 index 00000000..6218ad62 --- /dev/null +++ b/src/internal/x86_64/cpuid.c @@ -0,0 +1,40 @@ +#include "x86_64/cpuid.h" + +uint64_t __cpuid; + +struct regs { + uint32_t ax, bx, cx, dx; +}; + +static inline struct regs cpuid(uint32_t fn) +{ + struct regs ret; + __asm__("cpuid" : "=a"(ret.ax), "=b"(ret.bx), "=c"(ret.cx), "=d"(ret.dx) : "a"(fn)); + return ret; +} + +static inline int cpu_has_fma(void) +{ + struct regs r = cpuid(1); + return r.cx & 0x1000; +} + +static inline int cpu_is_amd(void) +{ + struct regs r = cpuid(0); + return r.bx == 0x68747541 && r.cx == 0x444d4163 && r.dx == 0x69746e65; +} + +static inline int cpu_has_fma4(void) +{ + struct regs r = cpuid(0x80000001); + return r.cx & 0x10000; +} + +void __init_cpuid(void) +{ + if (cpu_has_fma()) + __cpuid |= X86_FEAT_FMA; + if (cpu_is_amd() && cpu_has_fma4()) + __cpuid |= X86_FEAT_FMA4; +} diff --git a/src/internal/x86_64/cpuid.h b/src/internal/x86_64/cpuid.h new file mode 100644 index 00000000..40b66d3e --- /dev/null +++ b/src/internal/x86_64/cpuid.h @@ -0,0 +1,12 @@ +#ifndef X86_64_CPUID_H +#define X86_64_CPUID_H + +#include <features.h> +#include <stdint.h> +extern hidden uint64_t __cpuid; +void __init_cpuid(void); + +#define X86_FEAT_FMA 1 +#define X86_FEAT_FMA4 2 + +#endif -- 2.39.2 [-- Attachment #3: 0002-Runtime-switch-hardware-fma-on-x86_64.patch --] [-- Type: text/x-diff, Size: 2963 bytes --] From 121aae8cbd37396fd3b0e4e6f6d42b70b9966671 Mon Sep 17 00:00:00 2001 From: Markus Wichmann <nullplan@gmx.net> Date: Sat, 16 Mar 2024 10:02:11 +0100 Subject: [PATCH 2/2] Runtime switch hardware fma on x86_64. Instead of only using hardware fma instructions if enabled at compile time (i.e. if compiling at an ISA level that requires these to be present), we can now switch them in at runtime. Compile time switches are still effective and eliminate the other implementations, so the semantics don't change there. But even at baseline ISA level, we can now use the hardware FMA if __cpuid says it's OK. --- src/math/x86_64/fma.c | 28 ++++++++++++++++++++-------- src/math/x86_64/fmaf.c | 28 ++++++++++++++++++++-------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/src/math/x86_64/fma.c b/src/math/x86_64/fma.c index 4dd53f2a..04c6064a 100644 --- a/src/math/x86_64/fma.c +++ b/src/math/x86_64/fma.c @@ -1,23 +1,35 @@ #include <math.h> -#if __FMA__ - -double fma(double x, double y, double z) +static inline double fma_fma(double x, double y, double z) { __asm__ ("vfmadd132sd %1, %2, %0" : "+x" (x) : "x" (y), "x" (z)); return x; } -#elif __FMA4__ - -double fma(double x, double y, double z) +static inline double fma4_fma(double x, double y, double z) { __asm__ ("vfmaddsd %3, %2, %1, %0" : "=x" (x) : "x" (x), "x" (y), "x" (z)); return x; } -#else - +#if !__FMA__ && !__FMA4__ +#include "x86_64/cpuid.h" +#define fma __soft_fma #include "../fma.c" +#undef fma +#endif +double fma(double x, double y, double z) +{ +#if __FMA__ + return fma_fma(x, y, z); +#elif __FMA4__ + return fma4_fma(x, y, z); +#else + if (__cpuid & X86_FEAT_FMA) + return fma_fma(x, y, z); + if (__cpuid & X86_FEAT_FMA4) + return fma4_fma(x, y, z); + return __soft_fma(x, y, z); #endif +} diff --git a/src/math/x86_64/fmaf.c b/src/math/x86_64/fmaf.c index 30b971ff..b4d9b714 100644 --- a/src/math/x86_64/fmaf.c +++ b/src/math/x86_64/fmaf.c @@ -1,23 +1,35 @@ #include <math.h> -#if __FMA__ - -float fmaf(float x, float y, float z) +static inline float fma_fmaf(float x, float y, float z) { __asm__ ("vfmadd132ss %1, %2, %0" : "+x" (x) : "x" (y), "x" (z)); return x; } -#elif __FMA4__ - -float fmaf(float x, float y, float z) +static inline float fma4_fmaf(float x, float y, float z) { __asm__ ("vfmaddss %3, %2, %1, %0" : "=x" (x) : "x" (x), "x" (y), "x" (z)); return x; } -#else - +#if !__FMA__ && !__FMA4__ +#include "x86_64/cpuid.h" +#define fmaf __soft_fmaf #include "../fmaf.c" +#undef fmaf +#endif +float fmaf(float x, float y, float z) +{ +#if __FMA__ + return fma_fmaf(x, y, z); +#elif __FMA4__ + return fma4_fmaf(x, y, z); +#else + if (__cpuid & X86_FEAT_FMA) + return fma_fmaf(x, y, z); + if (__cpuid & X86_FEAT_FMA4) + return fma4_fmaf(x, y, z); + return __soft_fmaf(x, y, z); #endif +} -- 2.39.2
Am Fri, Mar 15, 2024 at 05:36:23PM -0400 schrieb Rich Felker: > You're making it too complicated. Just > > #define fma __soft_fma > #include "../fma.c" > > or similar. > But that leaves a __soft_fma() symbol with external linkage in there. OK, I suppose this is still less of a hack. > My expectation was that you would just use __hwcap, whereby it would > be a hidden global access no more expensive than accessing a function > pointer for indirect branch, and likely cheaper to do local direct > branches based on testing bits of it, something like: > > if (__hwcap & WHATEVER) { > __asm__(...); > return ...; > } else return __soft_fma(...); > > However, it looks like x86_64 lacks usable __hwcap. > My understanding was that __hwcap is a contract between kernel and userspace about what ISA extentions both kernel and CPU support. Whereas FMA on x86 bypasses the kernel. On other architectures, such as PowerPC, userspace has no way to negotiate ISA version with the CPU, so __hwcap does that too, but that is only by necessity. > We already do that. > My point was that I'd try to continue doing that with a minimum of repeated code. > > Doing the same for i386 requires also verifying kernel SSE support in > > hwcap (that implies CPUID support in the CPU, since the baseline 80486 > > does not necessarily have that, but all CPUs with SSE have it) and also > > support for extended CPUID in case of fma4. > > That seems a lot less likely to be worthwhile since it involves > shuffling data back and forth between x87 and sse registers, but maybe > it's still a big enough win to want to do it? > Right. Though FP parameters are passed on stack on i386, they are returned in %st(0). So the calling code would likely have the numbers in FPU, then spill them to memory for the call, then we load it into SSE, spill back to memory, load into FPU, and return. However, just a glance at the generic fma() code makes me doubt a bunch of spilling to memory and running a single SSE instruction can ever be slower than all of what is happening in there. > Regardless, I wonder if we should have the x86_64 startup code store > cpuid result somewhere we can use, so that we don't have to do nasty > atomic stuff determining it late, and could just branch on the bits of > a runtime-constant like we would with __hwcap. This would set the > stage for being able to do with more-impactful things like mem* too. > (That's a big project that involves designing the system for how archs > define the large-block-ops the generic C functions would use, so not > immediately applicable, but useful to be working towards it.) > > Rich The problem with that idea is that CPUID returns an enormous amount of information, but right now we only care about two bits on x86_64 (namely FMA and FMA4 support). So we could have some internal word such as __cpuid, that basically contains a digest of the CPUID information, namely only the bits we care about. That would be extensible for up to 64 bits we want to look at, but it would require complex control flow. And I thought you were against complex control flow in assembler. I am also unsure whether you need much more than __hwcap (and possibly __hwcap2) for the string functions. I don't know what optimizations you have in mind for those, but if you need AVX support, for example, then CPUID information does not help you. You need confirmation from the kernel that it supports AVX, or else you will catch a SIGILL in the attempt to use it. Very few ISA extensions can make do without kernel support, after all. Ciao, Markus
On Fri, Mar 15, 2024 at 06:53:59PM +0100, Markus Wichmann wrote: > Hi all, > > in commit e9016138, Szabolcs wrote into the message that we really > should be using the single-instruction versions if possible, and we > should be switching at run time. I have an idea for how to do that > without losing all of the history of the generic fma.c: > > - Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function > inside to fma_soft and make it static (inline?). > - Create a new src/math/fma.c that includes fma-soft.h and just calls > fma_soft(). > - In src/math/x86_64/fma.c: Unconditionally define fma_fma() and > fma_fma4() (which are the current assembler versions) and include > fma-soft.h. Create a dispatcher to figure out which version to call, > and call that from fma(). You're making it too complicated. Just #define fma __soft_fma #include "../fma.c" or similar. > Yeah, I know, the header file with stuff in it that takes memory is not > exactly great, but I can't think of another way to define the generic > version such that it is accessible to the arch-specific versions under a > different name and linkage. The file must not be a .c file, or else it > will confuse the build system. > > Question I have right out the gate is whether this would be interesting > to the group. Second question is whether it is better to be running > cpuid every time fma() is called, or to use a function pointer? I am > partial to the dispatcher pattern myself. In that case, the function > pointer is initialized at load time to point to the dispatcher, which > then selects the best implementation and updates the function pointer. > The main function only unconditionally calls the function pointer. My expectation was that you would just use __hwcap, whereby it would be a hidden global access no more expensive than accessing a function pointer for indirect branch, and likely cheaper to do local direct branches based on testing bits of it, something like: if (__hwcap & WHATEVER) { __asm__(...); return ...; } else return __soft_fma(...); However, it looks like x86_64 lacks usable __hwcap. > With a bit of preprocessor magic, I can also ensure that if __FMA__ or > __FMA4__ are set, the dispatcher is not included, and only the given > function is called. Although that may incur a warning of an unused > static function. I suppose that is a problem that can be fixed with more > preprocessor magic. We already do that. > From my preliminary research, the fma3 and fma4 ISA extensions require > no kernel support, so this will be the first time a CPUID call is > needed. fma3 support is signalled with bit 12 of ECX in CPUID function > 1. fma4 support is signalled with bit 16 of ECX in CPUID function > 0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra > safe, the CPU vendor ought to be checked, too. > > Doing the same for i386 requires also verifying kernel SSE support in > hwcap (that implies CPUID support in the CPU, since the baseline 80486 > does not necessarily have that, but all CPUs with SSE have it) and also > support for extended CPUID in case of fma4. That seems a lot less likely to be worthwhile since it involves shuffling data back and forth between x87 and sse registers, but maybe it's still a big enough win to want to do it? The logic for how you determine availability seems right. > Since the CPUID challenges would be shared between fma and fmaf, I would > like to put them into a new header file in src/include (maybe create > src/include/x86_64? Or should it be added to arch/x86_64?) > > So what are your thoughts on this? I'm somewhat skeptical of what value there is to doing this particularly for fma. There're probably a lot more places we don't do any runtime-conditional optimized code that have higher returns (memcpy etc. being the most obvious) and it seems likely that programs that care about fma performance are themselves compiled with the right ISA levels and using the compiler builtin, never calling the function at all. Regardless, I wonder if we should have the x86_64 startup code store cpuid result somewhere we can use, so that we don't have to do nasty atomic stuff determining it late, and could just branch on the bits of a runtime-constant like we would with __hwcap. This would set the stage for being able to do with more-impactful things like mem* too. (That's a big project that involves designing the system for how archs define the large-block-ops the generic C functions would use, so not immediately applicable, but useful to be working towards it.) Rich
Hi all, in commit e9016138, Szabolcs wrote into the message that we really should be using the single-instruction versions if possible, and we should be switching at run time. I have an idea for how to do that without losing all of the history of the generic fma.c: - Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function inside to fma_soft and make it static (inline?). - Create a new src/math/fma.c that includes fma-soft.h and just calls fma_soft(). - In src/math/x86_64/fma.c: Unconditionally define fma_fma() and fma_fma4() (which are the current assembler versions) and include fma-soft.h. Create a dispatcher to figure out which version to call, and call that from fma(). Yeah, I know, the header file with stuff in it that takes memory is not exactly great, but I can't think of another way to define the generic version such that it is accessible to the arch-specific versions under a different name and linkage. The file must not be a .c file, or else it will confuse the build system. Question I have right out the gate is whether this would be interesting to the group. Second question is whether it is better to be running cpuid every time fma() is called, or to use a function pointer? I am partial to the dispatcher pattern myself. In that case, the function pointer is initialized at load time to point to the dispatcher, which then selects the best implementation and updates the function pointer. The main function only unconditionally calls the function pointer. With a bit of preprocessor magic, I can also ensure that if __FMA__ or __FMA4__ are set, the dispatcher is not included, and only the given function is called. Although that may incur a warning of an unused static function. I suppose that is a problem that can be fixed with more preprocessor magic. From my preliminary research, the fma3 and fma4 ISA extensions require no kernel support, so this will be the first time a CPUID call is needed. fma3 support is signalled with bit 12 of ECX in CPUID function 1. fma4 support is signalled with bit 16 of ECX in CPUID function 0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra safe, the CPU vendor ought to be checked, too. Doing the same for i386 requires also verifying kernel SSE support in hwcap (that implies CPUID support in the CPU, since the baseline 80486 does not necessarily have that, but all CPUs with SSE have it) and also support for extended CPUID in case of fma4. Since the CPUID challenges would be shared between fma and fmaf, I would like to put them into a new header file in src/include (maybe create src/include/x86_64? Or should it be added to arch/x86_64?) So what are your thoughts on this? Ciao, Markus
在 2024/3/14 下午9:44, Leah Neukirchen 写道:
> lixing <lixing@loongson.cn> writes:
>
>> we checked the objdump binaries for -Os and -O2, the a_cas_p implement
>> looks ok for both.
>>
>> Also, we cross build the musl and mksh with -Os, the binary hang with
>> qemu user mode emulation , but not hang in the real hardware.
>>
>> so, maybe this is a qemu problem, we will let our qemu guys to check
>> this problem.
> Pretty surely it's a bug in QEMU. qemu 8.1.5 works with mksh.Os,
> qemu 8.2.1 hangs. I have bisected the issue down to:
>
> commit c5af6628f4be5d30800233e59ba3842ca19a12e6 (HEAD)
> Author: Jiajie Chen <c@jia.je>
> Date: Tue Aug 22 09:13:52 2023 +0200
>
> target/loongarch: Extract make_address_i() helper
>
> Reverting this hunk fixes it:
>
> diff --git a/target/loongarch/insn_trans/trans_atomic.c.inc b/target/loongarch/insn_trans/trans_atomic.c.inc
> index fbc081448d..bff3e7a74c 100644
> --- a/target/loongarch/insn_trans/trans_atomic.c.inc
> +++ b/target/loongarch/insn_trans/trans_atomic.c.inc
> @@ -7,8 +7,9 @@ static bool gen_ll(DisasContext *ctx, arg_rr_i *a, MemOp mop)
> {
> TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
> TCGv src1 = gpr_src(ctx, a->rj, EXT_NONE);
> - TCGv t0 = make_address_i(ctx, src1, a->imm);
> + TCGv t0 = tcg_temp_new();
>
> + tcg_gen_addi_tl(t0, src1, a->imm);
> tcg_gen_qemu_ld_i64(dest, t0, ctx->mem_idx, mop);
> tcg_gen_st_tl(t0, cpu_env, offsetof(CPULoongArchState, lladdr));
> tcg_gen_st_tl(dest, cpu_env, offsetof(CPULoongArchState, llval));
>
> I think the issue is that make_address_i optimizes the addition away
> if a->imm is zero, but that's just from looking at the code for 15min.
>
> hth,
Our qemu guys debuged the binary find that "ll.d $t0, $t0, 0" make the
t0 reg turn to 0,that should be the qemu code problem.
thanks.
XingLi
[-- Attachment #1: Type: text/plain, Size: 839 bytes --] On Sat, Mar 02, 2024 at 03:57:02PM +0100, Szabolcs Nagy wrote: > * Mark Brown <broonie@kernel.org> [2024-02-21 17:36:12 +0000]: > > > I said NOP but there's no reason it strictly needs to be a NOP. It > > > could instead do something reasonable to convey the state of racing > > > with shadow stack being disabled. > > This feels like it's getting complicated and I fear it may be an uphill > > struggle to get such code merged, at least for arm64. My instinct is > the aarch64 behaviour is already nop > for gcs instructions when gcs is disabled. > the isa was designed so async disable is > possible. Yeah, we'd need to handle GCSPR_EL0 somehow (currently it's inaccessible when GCS is disabled) and userspace would need to take care it's not doing something that could get stuck if for example a pop didn't actually *do* anything. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --]
lixing <lixing@loongson.cn> writes: > we checked the objdump binaries for -Os and -O2, the a_cas_p implement > looks ok for both. > > Also, we cross build the musl and mksh with -Os, the binary hang with > qemu user mode emulation , but not hang in the real hardware. > > so, maybe this is a qemu problem, we will let our qemu guys to check > this problem. Pretty surely it's a bug in QEMU. qemu 8.1.5 works with mksh.Os, qemu 8.2.1 hangs. I have bisected the issue down to: commit c5af6628f4be5d30800233e59ba3842ca19a12e6 (HEAD) Author: Jiajie Chen <c@jia.je> Date: Tue Aug 22 09:13:52 2023 +0200 target/loongarch: Extract make_address_i() helper Reverting this hunk fixes it: diff --git a/target/loongarch/insn_trans/trans_atomic.c.inc b/target/loongarch/insn_trans/trans_atomic.c.inc index fbc081448d..bff3e7a74c 100644 --- a/target/loongarch/insn_trans/trans_atomic.c.inc +++ b/target/loongarch/insn_trans/trans_atomic.c.inc @@ -7,8 +7,9 @@ static bool gen_ll(DisasContext *ctx, arg_rr_i *a, MemOp mop) { TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE); TCGv src1 = gpr_src(ctx, a->rj, EXT_NONE); - TCGv t0 = make_address_i(ctx, src1, a->imm); + TCGv t0 = tcg_temp_new(); + tcg_gen_addi_tl(t0, src1, a->imm); tcg_gen_qemu_ld_i64(dest, t0, ctx->mem_idx, mop); tcg_gen_st_tl(t0, cpu_env, offsetof(CPULoongArchState, lladdr)); tcg_gen_st_tl(dest, cpu_env, offsetof(CPULoongArchState, llval)); I think the issue is that make_address_i optimizes the addition away if a->imm is zero, but that's just from looking at the code for 15min. hth, -- Leah Neukirchen <leah@vuxu.org> https://leahneukirchen.org/
在 2024/3/14 下午1:44, Waldemar Brodkorb 写道:
> Hi,
> Rich Felker wrote,
>
>>> Yes I do. I found the reason why my mksh didn't worked.
>>> I compiled everything with -Os and then I get the deadlock.
>>> When I compile everything with -O2 musl mksh is working.
>>>
>>> So it seems some gcc problem code compiled with -Os.
>> Let's look at the generated asm for the function using a_cas_p and see
>> if this is a gcc bug or if the asm argument constraints as written
>> allowed a transformation that's not actually valid.
> Broken mksh with -Os:
>
> 000000012003b970 <cgt_init>:
> 12003b970: 02ff8063 addi.d $sp, $sp, -32
> 12003b974: 29c02077 st.d $s0, $sp, 8
> 12003b978: 27000078 stptr.d $s1, $sp, 0
> 12003b97c: 00150097 move $s0, $a0
> 12003b980: 001500b8 move $s1, $a1
> 12003b984: 1a000084 pcalau12i $a0, 4
> 12003b988: 1a000085 pcalau12i $a1, 4
> 12003b98c: 29c04076 st.d $fp, $sp, 16
> 12003b990: 29c06061 st.d $ra, $sp, 24
> 12003b994: 02c08076 addi.d $fp, $sp, 32
> 12003b998: 02c060a5 addi.d $a1, $a1, 24
> 12003b99c: 02c0c084 addi.d $a0, $a0, 48
> 12003b9a0: 54021400 bl 532 # 12003bbb4 <__vdsosym>
> 12003b9a4: 0015008e move $t2, $a0
> 12003b9a8: 38720000 dbar 0x0
> 12003b9ac: 1a00002d pcalau12i $t1, 1
> 12003b9b0: 1a00032f pcalau12i $t3, 25
> 12003b9b4: 02e5c1ad addi.d $t1, $t1, -1680
> 12003b9b8: 02ca21ec addi.d $t0, $t3, 648
> 12003b9bc: 2200018c ll.d $t0, $t0, 0
> 12003b9c0: 5c00198d bne $t0, $t1, 24 # 12003b9d8 <cgt_init+0x68>
> 12003b9c4: 001501cc move $t0, $t2
> 12003b9c8: 02ca21f0 addi.d $t4, $t3, 648
> 12003b9cc: 2300020c sc.d $t0, $t4, 0
> 12003b9d0: 0040818c slli.w $t0, $t0, 0x0
> 12003b9d4: 43ffe59f beqz $t0, -28 # 12003b9b8 <cgt_init+0x48>
> 12003b9d8: 38720000 dbar 0x0
> 12003b9dc: 400025c0 beqz $t2, 36 # 12003ba00 <cgt_init+0x90>
> 12003b9e0: 28c04076 ld.d $fp, $sp, 16
> 12003b9e4: 28c06061 ld.d $ra, $sp, 24
> 12003b9e8: 00150305 move $a1, $s1
> 12003b9ec: 001502e4 move $a0, $s0
> 12003b9f0: 26000078 ldptr.d $s1, $sp, 0
> 12003b9f4: 28c02077 ld.d $s0, $sp, 8
> 12003b9f8: 02c08063 addi.d $sp, $sp, 32
> 12003b9fc: 4c0001c0 jr $t2
> 12003ba00: 28c06061 ld.d $ra, $sp, 24
> 12003ba04: 28c04076 ld.d $fp, $sp, 16
> 12003ba08: 28c02077 ld.d $s0, $sp, 8
> 12003ba0c: 26000078 ldptr.d $s1, $sp, 0
> 12003ba10: 02bf6804 li.w $a0, -38
> 12003ba14: 02c08063 addi.d $sp, $sp, 32
> 12003ba18: 4c000020 ret
>
> Working mksh with -O2:
> 0000000120045f50 <cgt_init>:
> 120045f50: 02ff8063 addi.d $sp, $sp, -32
> 120045f54: 29c02077 st.d $s0, $sp, 8
> 120045f58: 27000078 stptr.d $s1, $sp, 0
> 120045f5c: 00150097 move $s0, $a0
> 120045f60: 001500b8 move $s1, $a1
> 120045f64: 1a000084 pcalau12i $a0, 4
> 120045f68: 1a000085 pcalau12i $a1, 4
> 120045f6c: 29c04076 st.d $fp, $sp, 16
> 120045f70: 29c06061 st.d $ra, $sp, 24
> 120045f74: 02c08076 addi.d $fp, $sp, 32
> 120045f78: 02dbc0a5 addi.d $a1, $a1, 1776
> 120045f7c: 02dc2084 addi.d $a0, $a0, 1800
> 120045f80: 54024c00 bl 588 # 1200461cc <__vdsosym>
> 120045f84: 0015008f move $t3, $a0
> 120045f88: 38720000 dbar 0x0
> 120045f8c: 1a000030 pcalau12i $t4, 1
> 120045f90: 1a00036d pcalau12i $t1, 27
> 120045f94: 02fd4210 addi.d $t4, $t4, -176
> 120045f98: 50001400 b 20 # 120045fac <cgt_init+0x5c>
> 120045f9c: 02ca21ae addi.d $t2, $t1, 648
> 120045fa0: 230001cc sc.d $t0, $t2, 0
> 120045fa4: 0040818c slli.w $t0, $t0, 0x0
> 120045fa8: 44001580 bnez $t0, 20 # 120045fbc <cgt_init+0x6c>
> 120045fac: 02ca21ac addi.d $t0, $t1, 648
> 120045fb0: 2200018e ll.d $t2, $t0, 0
> 120045fb4: 001501ec move $t0, $t3
> 120045fb8: 5bffe60e beq $t4, $t2, -28 # 120045f9c <cgt_init+0x4c>
> 120045fbc: 38720000 dbar 0x0
> 120045fc0: 400025e0 beqz $t3, 36 # 120045fe4 <cgt_init+0x94>
> 120045fc4: 28c04076 ld.d $fp, $sp, 16
> 120045fc8: 28c06061 ld.d $ra, $sp, 24
> 120045fcc: 00150305 move $a1, $s1
> 120045fd0: 001502e4 move $a0, $s0
> 120045fd4: 26000078 ldptr.d $s1, $sp, 0
> 120045fd8: 28c02077 ld.d $s0, $sp, 8
> 120045fdc: 02c08063 addi.d $sp, $sp, 32
> 120045fe0: 4c0001e0 jr $t3
> 120045fe4: 28c06061 ld.d $ra, $sp, 24
> 120045fe8: 28c04076 ld.d $fp, $sp, 16
> 120045fec: 28c02077 ld.d $s0, $sp, 8
> 120045ff0: 26000078 ldptr.d $s1, $sp, 0
> 120045ff4: 02bf6804 li.w $a0, -38
> 120045ff8: 02c08063 addi.d $sp, $sp, 32
> 120045ffc: 4c000020 ret
>
> Does that help? Both binaries can be found on https://debug.openadk.org.
>
> best regards
> Waldemar
we checked the objdump binaries for -Os and -O2, the a_cas_p implement
looks ok for both.
Also, we cross build the musl and mksh with -Os, the binary hang with
qemu user mode emulation , but not hang in the real hardware.
so, maybe this is a qemu problem, we will let our qemu guys to check
this problem.
Hi, Rich Felker wrote, > > Yes I do. I found the reason why my mksh didn't worked. > > I compiled everything with -Os and then I get the deadlock. > > When I compile everything with -O2 musl mksh is working. > > > > So it seems some gcc problem code compiled with -Os. > > Let's look at the generated asm for the function using a_cas_p and see > if this is a gcc bug or if the asm argument constraints as written > allowed a transformation that's not actually valid. Broken mksh with -Os: 000000012003b970 <cgt_init>: 12003b970: 02ff8063 addi.d $sp, $sp, -32 12003b974: 29c02077 st.d $s0, $sp, 8 12003b978: 27000078 stptr.d $s1, $sp, 0 12003b97c: 00150097 move $s0, $a0 12003b980: 001500b8 move $s1, $a1 12003b984: 1a000084 pcalau12i $a0, 4 12003b988: 1a000085 pcalau12i $a1, 4 12003b98c: 29c04076 st.d $fp, $sp, 16 12003b990: 29c06061 st.d $ra, $sp, 24 12003b994: 02c08076 addi.d $fp, $sp, 32 12003b998: 02c060a5 addi.d $a1, $a1, 24 12003b99c: 02c0c084 addi.d $a0, $a0, 48 12003b9a0: 54021400 bl 532 # 12003bbb4 <__vdsosym> 12003b9a4: 0015008e move $t2, $a0 12003b9a8: 38720000 dbar 0x0 12003b9ac: 1a00002d pcalau12i $t1, 1 12003b9b0: 1a00032f pcalau12i $t3, 25 12003b9b4: 02e5c1ad addi.d $t1, $t1, -1680 12003b9b8: 02ca21ec addi.d $t0, $t3, 648 12003b9bc: 2200018c ll.d $t0, $t0, 0 12003b9c0: 5c00198d bne $t0, $t1, 24 # 12003b9d8 <cgt_init+0x68> 12003b9c4: 001501cc move $t0, $t2 12003b9c8: 02ca21f0 addi.d $t4, $t3, 648 12003b9cc: 2300020c sc.d $t0, $t4, 0 12003b9d0: 0040818c slli.w $t0, $t0, 0x0 12003b9d4: 43ffe59f beqz $t0, -28 # 12003b9b8 <cgt_init+0x48> 12003b9d8: 38720000 dbar 0x0 12003b9dc: 400025c0 beqz $t2, 36 # 12003ba00 <cgt_init+0x90> 12003b9e0: 28c04076 ld.d $fp, $sp, 16 12003b9e4: 28c06061 ld.d $ra, $sp, 24 12003b9e8: 00150305 move $a1, $s1 12003b9ec: 001502e4 move $a0, $s0 12003b9f0: 26000078 ldptr.d $s1, $sp, 0 12003b9f4: 28c02077 ld.d $s0, $sp, 8 12003b9f8: 02c08063 addi.d $sp, $sp, 32 12003b9fc: 4c0001c0 jr $t2 12003ba00: 28c06061 ld.d $ra, $sp, 24 12003ba04: 28c04076 ld.d $fp, $sp, 16 12003ba08: 28c02077 ld.d $s0, $sp, 8 12003ba0c: 26000078 ldptr.d $s1, $sp, 0 12003ba10: 02bf6804 li.w $a0, -38 12003ba14: 02c08063 addi.d $sp, $sp, 32 12003ba18: 4c000020 ret Working mksh with -O2: 0000000120045f50 <cgt_init>: 120045f50: 02ff8063 addi.d $sp, $sp, -32 120045f54: 29c02077 st.d $s0, $sp, 8 120045f58: 27000078 stptr.d $s1, $sp, 0 120045f5c: 00150097 move $s0, $a0 120045f60: 001500b8 move $s1, $a1 120045f64: 1a000084 pcalau12i $a0, 4 120045f68: 1a000085 pcalau12i $a1, 4 120045f6c: 29c04076 st.d $fp, $sp, 16 120045f70: 29c06061 st.d $ra, $sp, 24 120045f74: 02c08076 addi.d $fp, $sp, 32 120045f78: 02dbc0a5 addi.d $a1, $a1, 1776 120045f7c: 02dc2084 addi.d $a0, $a0, 1800 120045f80: 54024c00 bl 588 # 1200461cc <__vdsosym> 120045f84: 0015008f move $t3, $a0 120045f88: 38720000 dbar 0x0 120045f8c: 1a000030 pcalau12i $t4, 1 120045f90: 1a00036d pcalau12i $t1, 27 120045f94: 02fd4210 addi.d $t4, $t4, -176 120045f98: 50001400 b 20 # 120045fac <cgt_init+0x5c> 120045f9c: 02ca21ae addi.d $t2, $t1, 648 120045fa0: 230001cc sc.d $t0, $t2, 0 120045fa4: 0040818c slli.w $t0, $t0, 0x0 120045fa8: 44001580 bnez $t0, 20 # 120045fbc <cgt_init+0x6c> 120045fac: 02ca21ac addi.d $t0, $t1, 648 120045fb0: 2200018e ll.d $t2, $t0, 0 120045fb4: 001501ec move $t0, $t3 120045fb8: 5bffe60e beq $t4, $t2, -28 # 120045f9c <cgt_init+0x4c> 120045fbc: 38720000 dbar 0x0 120045fc0: 400025e0 beqz $t3, 36 # 120045fe4 <cgt_init+0x94> 120045fc4: 28c04076 ld.d $fp, $sp, 16 120045fc8: 28c06061 ld.d $ra, $sp, 24 120045fcc: 00150305 move $a1, $s1 120045fd0: 001502e4 move $a0, $s0 120045fd4: 26000078 ldptr.d $s1, $sp, 0 120045fd8: 28c02077 ld.d $s0, $sp, 8 120045fdc: 02c08063 addi.d $sp, $sp, 32 120045fe0: 4c0001e0 jr $t3 120045fe4: 28c06061 ld.d $ra, $sp, 24 120045fe8: 28c04076 ld.d $fp, $sp, 16 120045fec: 28c02077 ld.d $s0, $sp, 8 120045ff0: 26000078 ldptr.d $s1, $sp, 0 120045ff4: 02bf6804 li.w $a0, -38 120045ff8: 02c08063 addi.d $sp, $sp, 32 120045ffc: 4c000020 ret Does that help? Both binaries can be found on https://debug.openadk.org. best regards Waldemar
if x!=0, y!=0, z==0 then fma(x,y,z) == x*y in all rounding modes, while adding z can ruin the sign of 0 if x*y rounds to -0. --- src/math/fma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/math/fma.c b/src/math/fma.c index 0c6f90c9..adfadca8 100644 --- a/src/math/fma.c +++ b/src/math/fma.c @@ -53,7 +53,7 @@ double fma(double x, double y, double z) return x*y + z; if (nz.e >= ZEROINFNAN) { if (nz.e > ZEROINFNAN) /* z==0 */ - return x*y + z; + return x*y; return z; } -- 2.43.0
* Colen Garoutte-Carson <coleng@microsoft.com> [2024-03-13 01:16:47 +0000]: > Hi, > > We're using MUSL cross-compilation toolchains to build fully statically linked binaries, targeting x86_64, aarch64, and armel. (Which allows us to work around some issues with glibc incompatibilities) > > If we connect GDB to our process and get a backtrace, we get very little. i.e.: > > __syscall_cp_c (Unknown Source:0) > __timedwait_cp (Unknown Source:0) > [Unknown/Just-In-Time compiled code] (Unknown Source:0) > > Or: > > __stdio_read (Unknown Source:0) > [Unknown/Just-In-Time compiled code] (Unknown Source:0) > > We'd really like to get the rest of these stacks to investigate issues such as deadlocks, for example (among several other reasons, such as our own crash reporting handler). > > I found related info on StackOverflow, such as: debugging - Why does gdb backtrace show only one frame when catching syscall? - Stack Overflow<https://stackoverflow.com/questions/29764951/why-does-gdb-backtrace-show-only-one-frame-when-catching-syscall> > And: c - Stack frame NULL in backtrace log - Stack Overflow<https://stackoverflow.com/questions/26214936/stack-frame-null-in-backtrace-log> > > My hypothesis is that we're calling into code that does not provide the stack unwind descriptors needed by GDB. The first of the above links refers specifically to MUSL and patching it (but for MIPS). > > I've tried building a fresh MUSL (1.2.4) GCC (13.2.0) toolchain, and the issue persists with a binary built with it. > > Can someone explain what we're running into here, and whether there is a potential solution (even if we may have to contribute it ourselves)? > the first thing to note is that --enable-debug is not the default, so make sure you configure musl with that when building the cross toolchain. (e.g. if you use musl-cross-make then you need MUSL_CONFIG += --enable-debug in your config.mak) the second thing to note is that musl only has debug info for asm files on x86 targets (the cfi directives are generated by a script, tools/add-cfi.*.awk, if you provide that for another target that will work too), this may affect backtrace from __syscall_cp_asm (although i expect gdb to figure this one out without cfi, since gdb should use heuristics when cfi is not available). there can be subtle issues with unwinding across signal handlers, but otherwise backtrace across c code should work if your compiler emits debug info (gcc -g). > Thanks!, > > * Colen >
On Wed, Mar 13, 2024 at 03:45:31PM +0100, Waldemar Brodkorb wrote:
> Hi,
> lixing wrote,
>
> >
> > 在 2024/3/13 上午1:18, Waldemar Brodkorb 写道:
> > > Hi Thorsten,
> > > Thorsten Glaser wrote,
> > >
> > > > Waldemar Brodkorb dixit:
> > > >
> > > > > lixing wrote,
> > > > > > We've tested static and dynamic build mksh with commit id
> > > > > > cbb8a0196aab53165a35339fd91ade599d184f both works ok.
> > > > That’s great to hear!
> > > >
> > > > > Can you check with qemu-system-loongarch64? I use gcc 13.2.0,
> > > > > binutils 2.42 and Linux 6.6.18 with the defconfig provided.
> > > > > Qemu is also 8.2.2. Can you provide your QEMU_EFI.fd for download?
> >
> > I just only use qemu user mode, but my colleague give me the address for
> > download the EFI as follow
> >
> > https://github.com/lixianglai/LoongarchVirtFirmware.
> >
> > Also, can you post the qemu-system-loongarch64 startup command to me?
> >
> > > > Perhaps to distinguish between emulation/runtime and toolchain
> > > > faults you could send each other the mksh binaries (if statically
> > > > linked, otherwise probably them plus musl’s dynamic parts)?
> >
> > I update the static binary to my repo on github, you can download and try,
> > we both tested with real hardware
> >
> > and qemu-loongarch64 user mode is ok. This mksh built with gcc 13.2.1,
> > binutils 2.42, linux 6.6.15 in the loongarch environment.
> >
> > https://github.com/lixing-star/binarys
> >
> > > Statically linked mksh from me is here:
> > > https://debug.openadk.org/mksh
> > we've tested your binary on 3c5000 is ok, but hanged with
> > qemu-loongarch64. Are you cross compiled the mksh for use?
>
> Yes I do. I found the reason why my mksh didn't worked.
> I compiled everything with -Os and then I get the deadlock.
> When I compile everything with -O2 musl mksh is working.
>
> So it seems some gcc problem code compiled with -Os.
Let's look at the generated asm for the function using a_cas_p and see
if this is a gcc bug or if the asm argument constraints as written
allowed a transformation that's not actually valid.
Rich
Hi,
lixing wrote,
>
> 在 2024/3/13 上午1:18, Waldemar Brodkorb 写道:
> > Hi Thorsten,
> > Thorsten Glaser wrote,
> >
> > > Waldemar Brodkorb dixit:
> > >
> > > > lixing wrote,
> > > > > We've tested static and dynamic build mksh with commit id
> > > > > cbb8a0196aab53165a35339fd91ade599d184f both works ok.
> > > That’s great to hear!
> > >
> > > > Can you check with qemu-system-loongarch64? I use gcc 13.2.0,
> > > > binutils 2.42 and Linux 6.6.18 with the defconfig provided.
> > > > Qemu is also 8.2.2. Can you provide your QEMU_EFI.fd for download?
>
> I just only use qemu user mode, but my colleague give me the address for
> download the EFI as follow
>
> https://github.com/lixianglai/LoongarchVirtFirmware.
>
> Also, can you post the qemu-system-loongarch64 startup command to me?
>
> > > Perhaps to distinguish between emulation/runtime and toolchain
> > > faults you could send each other the mksh binaries (if statically
> > > linked, otherwise probably them plus musl’s dynamic parts)?
>
> I update the static binary to my repo on github, you can download and try,
> we both tested with real hardware
>
> and qemu-loongarch64 user mode is ok. This mksh built with gcc 13.2.1,
> binutils 2.42, linux 6.6.15 in the loongarch environment.
>
> https://github.com/lixing-star/binarys
>
> > Statically linked mksh from me is here:
> > https://debug.openadk.org/mksh
> we've tested your binary on 3c5000 is ok, but hanged with
> qemu-loongarch64. Are you cross compiled the mksh for use?
Yes I do. I found the reason why my mksh didn't worked.
I compiled everything with -Os and then I get the deadlock.
When I compile everything with -O2 musl mksh is working.
So it seems some gcc problem code compiled with -Os.
best regards
Waldemar
[-- Attachment #1: Type: text/plain, Size: 1731 bytes --] Hi, We're using MUSL cross-compilation toolchains to build fully statically linked binaries, targeting x86_64, aarch64, and armel. (Which allows us to work around some issues with glibc incompatibilities) If we connect GDB to our process and get a backtrace, we get very little. i.e.: __syscall_cp_c (Unknown Source:0) __timedwait_cp (Unknown Source:0) [Unknown/Just-In-Time compiled code] (Unknown Source:0) Or: __stdio_read (Unknown Source:0) [Unknown/Just-In-Time compiled code] (Unknown Source:0) We'd really like to get the rest of these stacks to investigate issues such as deadlocks, for example (among several other reasons, such as our own crash reporting handler). I found related info on StackOverflow, such as: debugging - Why does gdb backtrace show only one frame when catching syscall? - Stack Overflow<https://stackoverflow.com/questions/29764951/why-does-gdb-backtrace-show-only-one-frame-when-catching-syscall> And: c - Stack frame NULL in backtrace log - Stack Overflow<https://stackoverflow.com/questions/26214936/stack-frame-null-in-backtrace-log> My hypothesis is that we're calling into code that does not provide the stack unwind descriptors needed by GDB. The first of the above links refers specifically to MUSL and patching it (but for MIPS). I've tried building a fresh MUSL (1.2.4) GCC (13.2.0) toolchain, and the issue persists with a binary built with it. Can someone explain what we're running into here, and whether there is a potential solution (even if we may have to contribute it ourselves)? Thanks!, * Colen [-- Attachment #2: Type: text/html, Size: 10100 bytes --]
* Zack Weinberg:
> I do fully agree that this is a design error in NFS and in close(2)
> more generally [like all destructors, it should be _impossible_ for
> close to fail], but there is no realistic prospect of changing it, and
> I've been burned a few times by programs that didn't notice delayed
> write errors.
There is fsync to avoid delayed write errors. Historically, it's been
very bad for performance. What coreutils et al. aim to do is to deal
with non-catastrophic failures (mainly out of space errors) without
incurring the fsync overhead.
Thanks,
Florian
在 2024/3/13 上午1:18, Waldemar Brodkorb 写道: > Hi Thorsten, > Thorsten Glaser wrote, > >> Waldemar Brodkorb dixit: >> >>> lixing wrote, >>>> We've tested static and dynamic build mksh with commit id >>>> cbb8a0196aab53165a35339fd91ade599d184f both works ok. >> That’s great to hear! >> >>> Can you check with qemu-system-loongarch64? I use gcc 13.2.0, >>> binutils 2.42 and Linux 6.6.18 with the defconfig provided. >>> Qemu is also 8.2.2. Can you provide your QEMU_EFI.fd for download? I just only use qemu user mode, but my colleague give me the address for download the EFI as follow https://github.com/lixianglai/LoongarchVirtFirmware. Also, can you post the qemu-system-loongarch64 startup command to me? >> Perhaps to distinguish between emulation/runtime and toolchain >> faults you could send each other the mksh binaries (if statically >> linked, otherwise probably them plus musl’s dynamic parts)? I update the static binary to my repo on github, you can download and try, we both tested with real hardware and qemu-loongarch64 user mode is ok. This mksh built with gcc 13.2.1, binutils 2.42, linux 6.6.15 in the loongarch environment. https://github.com/lixing-star/binarys > Statically linked mksh from me is here: > https://debug.openadk.org/mksh we've tested your binary on 3c5000 is ok, but hanged with qemu-loongarch64. Are you cross compiled the mksh for use? > > best regards > Waldemar > best reards, XingLi
On Tue, Mar 12, 2024 at 03:25:31PM -0400, Zack Weinberg wrote: > > > On Tue, Mar 12, 2024, at 10:42 AM, Rich Felker wrote: > > On Tue, Mar 12, 2024 at 03:31:04PM +0100, Florian Weimer wrote: > >> * Zack Weinberg: > >> > >> > On Tue, Mar 12, 2024, at 9:54 AM, Florian Weimer wrote: > >> >>> Doing this would break many programs, such as: > >> >>> - most of coreutils, e.g. programs like ls, cat or head, since they > >> >>> always `close` their input and output descriptors (when they've > >> >>> written or read something) to make sure to diagnose all errors > >> >> > >> >> A slightly better way to do this is to do fflush (stdout) followed by > >> >> error checking on close (dup (fileno (stdout))). > >> > > >> > Does that actually report delayed write errors? As you have it, > >> > the close() just drops the fd created by the dup(), the OFD is > >> > still referenced by fd 1 and therefore remains open. > >> > >> I don't think the VFS close action is subject to reference counting. > >> Otherwise the current coreutils error checking wouldn't work because in > >> many cases, another process retains a reference to the OFD. > > > > It is. close only reports errors if it's the last fd referring to the > > ofd. It's an incredibly stupid design choice by NFS that mismatches > > expected fd behavior. > > I do fully agree that this is a design error in NFS and in close(2) > more generally [like all destructors, it should be _impossible_ for > close to fail], but there is no realistic prospect of changing it, > and I've been burned a few times by programs that didn't notice > delayed write errors. The risk of missing an error because another > process still has a ref to the OFD is real, but it comes up rarely > enough that I don't think it should deter us from trying to get the > "only one process has this file open" case right. (Think shell `>` > redirection.) > > Also, I have written programs that expected fds 0 and/or 1 to be > pipes, and closed them well before exiting as a cheap way to send > one-shot events to the process on the other end of the pipe. I think > that's a legitimate way to use inherited fds, and it would be > significantly more difficult to do in some contexts if you had to > use a fd > 2 to do it (e.g. the process on the other end used > popen() to start things). There's a perfectly safe way to do that: you dup2 something else over fd 0/1/2 as needed, rather than closing it. > If there's something we can do to make it easier for programmers to > do the Right Thing with closing standard fds, IMHO we should. For > stdio, we perfectly well _could_ special case fclose(), when the > fileno is 0/1/2, to do the dup/close dance and leave both the FILE > and the fd in a valid, stubbed state. (For stdin, open("/dev/null", > O_RDONLY) is clearly an appropriate stub semantic; for stdout and > stderr I am unsure whether "discard all writes silently" or "fail > all writes" would be better.) I don't think that's at all conforming. fclose is specified to close the fd too, and while that's generally bad behavior, it is the expected behavior that a program can be depending on. (A hardening level to trap if a voluntary constraint is violated is a lot different than making a standard function silently do the wrong thing because something else would be "better" to do.) FWIW, accessing any of the standard FILEs after fclose on them (e.g. calling printf after fclose(stdout) or getchar() after fclose(stdin)) is UB, and would potentially be worth trapping too. Zero-overhead trapping could probably be arranged just by fiddling with the buffer pointers. > I don't think it's safe to make > close() special case 0/1/2, though—not least because it's supposed > to be atomic and async signal safe. And we should maybe give some > thought to runtimes for languages other than C, which probably have > their own buffering wrappers for fds 0/1/2 and might care about > cross-language interop. Indeed, the hypothetical thing was not intended as a change of semantics, just as a hardening mode to catch programming errors. Rich
On Tue, Mar 12, 2024, at 10:42 AM, Rich Felker wrote:
> On Tue, Mar 12, 2024 at 03:31:04PM +0100, Florian Weimer wrote:
>> * Zack Weinberg:
>>
>> > On Tue, Mar 12, 2024, at 9:54 AM, Florian Weimer wrote:
>> >>> Doing this would break many programs, such as:
>> >>> - most of coreutils, e.g. programs like ls, cat or head, since they
>> >>> always `close` their input and output descriptors (when they've
>> >>> written or read something) to make sure to diagnose all errors
>> >>
>> >> A slightly better way to do this is to do fflush (stdout) followed by
>> >> error checking on close (dup (fileno (stdout))).
>> >
>> > Does that actually report delayed write errors? As you have it,
>> > the close() just drops the fd created by the dup(), the OFD is
>> > still referenced by fd 1 and therefore remains open.
>>
>> I don't think the VFS close action is subject to reference counting.
>> Otherwise the current coreutils error checking wouldn't work because in
>> many cases, another process retains a reference to the OFD.
>
> It is. close only reports errors if it's the last fd referring to the
> ofd. It's an incredibly stupid design choice by NFS that mismatches
> expected fd behavior.
I do fully agree that this is a design error in NFS and in close(2) more generally [like all destructors, it should be _impossible_ for close to fail], but there is no realistic prospect of changing it, and I've been burned a few times by programs that didn't notice delayed write errors. The risk of missing an error because another process still has a ref to the OFD is real, but it comes up rarely enough that I don't think it should deter us from trying to get the "only one process has this file open" case right. (Think shell `>` redirection.)
Also, I have written programs that expected fds 0 and/or 1 to be pipes, and closed them well before exiting as a cheap way to send one-shot events to the process on the other end of the pipe. I think that's a legitimate way to use inherited fds, and it would be significantly more difficult to do in some contexts if you had to use a fd > 2 to do it (e.g. the process on the other end used popen() to start things).
If there's something we can do to make it easier for programmers to do the Right Thing with closing standard fds, IMHO we should. For stdio, we perfectly well _could_ special case fclose(), when the fileno is 0/1/2, to do the dup/close dance and leave both the FILE and the fd in a valid, stubbed state. (For stdin, open("/dev/null", O_RDONLY) is clearly an appropriate stub semantic; for stdout and stderr I am unsure whether "discard all writes silently" or "fail all writes" would be better.) I don't think it's safe to make close() special case 0/1/2, though—not least because it's supposed to be atomic and async signal safe. And we should maybe give some thought to runtimes for languages other than C, which probably have their own buffering wrappers for fds 0/1/2 and might care about cross-language interop.
zw
Hi Thorsten, Thorsten Glaser wrote, > Waldemar Brodkorb dixit: > > >lixing wrote, > > >> We've tested static and dynamic build mksh with commit id > >> cbb8a0196aab53165a35339fd91ade599d184f both works ok. > > That’s great to hear! > > >Can you check with qemu-system-loongarch64? I use gcc 13.2.0, > >binutils 2.42 and Linux 6.6.18 with the defconfig provided. > >Qemu is also 8.2.2. Can you provide your QEMU_EFI.fd for download? > > Perhaps to distinguish between emulation/runtime and toolchain > faults you could send each other the mksh binaries (if statically > linked, otherwise probably them plus musl’s dynamic parts)? Statically linked mksh from me is here: https://debug.openadk.org/mksh best regards Waldemar