* Model specific optimizations? @ 2016-09-29 14:21 Markus Wichmann 2016-09-29 14:57 ` Szabolcs Nagy 2016-09-29 15:23 ` Rich Felker 0 siblings, 2 replies; 14+ messages in thread From: Markus Wichmann @ 2016-09-29 14:21 UTC (permalink / raw) To: musl Hi there, I wanted to ask if there is any wish for the near future to support model-specific optimizations. What I mean by that is multiple implementations of the same function, where the best implementation is decided at run-time. One simple example would be PowerPC's fsqrt instruction. The PowerPC Book 1 defines it as optional and provides no way to know specifically, if the currently running processor supports this instruction besides executing it and seeing if you get a SIGILL. A cursory DuckDuckGo search revealed that Apple uses the instruction as sqrt implementation if it detects the CPU capability for that, however it only detects that capability by checking the PVR for known-good bit patterns (Currently, the only known PowerPC cores to support this instruction are the 970 and 970FX, which have a version field if 0x39 and 0x3c, respectively). x86 and -derived architectures at least have the cpuid instruction to check for some features, and admittedly, there's a lot of defined bits. However, glibc's ifunc-initialization function (which selects the implementation) also does a lot of work finding out the precise make and model of the CPU to set some more flags. The reason I ask is that lots of ISAs define optional parts that aren't mandatory, but grow in popularity more and more until they're seen in all current practical implementations. Like how x87 started out as a separate device but is a fixed part of x86 since the later days of the 486. Same with MMX, SSE, SSE2. None of these are mandatory by the ABI, but available in all practical implementations. And musl is never going to be able to utilize that in its current form. Oh, alright, the compiler might support it, but that's different. I also suspect the fsqrt instruction will be available in more future PowerPC implementations. If we were to go this route, the question is how to go about it. First the detection method: Stuff like cpuid or AT_HWCAP are pretty nice, because they allow for the detection of a feature, whereas version checking only allows one to find known-good implementations. The latter means there's a list of known-good values, and that list has to be kept up-to-date. However, the latter is also pretty much always possible, while the former isn't always available. The kernel doesn't check for fsqrt availability, for example. Then organization: Are we going the glibc route, which gathers all indirect functions in a single section and initializes all of the pointers at startup (__libc_init_main()), or do we put these checks separately in each function? To make a practical example, we could implement sqrt() for PowerPC like this: static double soft_sqrt(double); static double hard_sqrt(double); static double init_sqrt(double); static double (*sqrtfn)(double) = init_sqrt; double sqrt(double x) { return sqrtfn(x); } static double init_sqrt(double x) { unsigned long pvr; unsigned long ver; asm ("mfspr pvr, r0" : "=r"(pvr)); ver = (pvr >> 16) & 0xffff; /* XXX: Add more values for cores with the fsqrt instruction here */ if (0 || ver == 0x39 /* PowerPC 970 */ || ver == 0x3c /* PowerPC 970FX */ ) sqrtfn = hard_sqrt; else sqrtfn = soft_sqrt; return sqrtfn(x); } static double hard_sqrt(double x) { double r; asm ("fsqrt %0, %1": "=d"(r) : "d"(x)); return r; } #define sqrt soft_sqrt #include "../sqrt.c" Problem with this is: The same thing would have to be repeated for sqrtf(), the same list of known values would have to be maintained twice, although we could make it a real list (an array, I mean), and get rid of that issue. But it does add quite a bit of code, and the overhead of an indirect function call, and at the moment isn't going to be useful to all but a few people. Also, the inclusion here is a hack. But I couldn't think of a better way. Thoughts? Ciao, Markus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 14:21 Model specific optimizations? Markus Wichmann @ 2016-09-29 14:57 ` Szabolcs Nagy 2016-09-29 15:23 ` Rich Felker 1 sibling, 0 replies; 14+ messages in thread From: Szabolcs Nagy @ 2016-09-29 14:57 UTC (permalink / raw) To: musl * Markus Wichmann <nullplan@gmx.net> [2016-09-29 16:21:26 +0200]: > I wanted to ask if there is any wish for the near future to support > model-specific optimizations. What I mean by that is multiple > implementations of the same function, where the best implementation is > decided at run-time. musl already does some runtime selection based on hw/kernel features (arm atomics, vdso). it could use similar approaches for micro-architecture specific optimizations. this has a maintenance cost (hard to test, hard to benchmark), code size cost (all variants has to be present at runtime) and dispatch cost (it has to happen at startup or lazily) these costs are rarely justified. (there are secondary effects: glibc dispatches memcpy at runtime so the compilers have a hard time deciding when to inline it, as a consequence sometimes -O0 gives better performance than -O3 on x86 with glibc.) > One simple example would be PowerPC's fsqrt instruction. The PowerPC > Book 1 defines it as optional and provides no way to know specifically, > if the currently running processor supports this instruction besides > executing it and seeing if you get a SIGILL. if there is no linux hwcap bit for this then we cant do much about it. runtime dispatch only works if there is a reasonable way to detect hw features (hwcap, cpuid instruction, vdso something) e.g. parsing /proc/cpuinfo to figure out the cpu and guessing features from that or registering sigill signal handlers are not ok. > Then organization: Are we going the glibc route, which gathers all > indirect functions in a single section and initializes all of the > pointers at startup (__libc_init_main()), or do we put these checks > separately in each function? glibc uses ifunc for this, musl does not support ifunc at this point. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 14:21 Model specific optimizations? Markus Wichmann 2016-09-29 14:57 ` Szabolcs Nagy @ 2016-09-29 15:23 ` Rich Felker 2016-09-29 17:08 ` Markus Wichmann 1 sibling, 1 reply; 14+ messages in thread From: Rich Felker @ 2016-09-29 15:23 UTC (permalink / raw) To: musl On Thu, Sep 29, 2016 at 04:21:26PM +0200, Markus Wichmann wrote: > Hi there, > > I wanted to ask if there is any wish for the near future to support > model-specific optimizations. What I mean by that is multiple > implementations of the same function, where the best implementation is > decided at run-time. musl's general approach to this is to use non-mandatory ISA extensions only when compiled with the right -march to assume their availability. What I mean by "non-mandatory" is that some extensions must be used when supported in order to have proper behavior. For example if a baseline ISA does not have atomics (ARM, SH) but emulates them with help from the kernel that only works on UP machines (SH), code running on models that support SMP, even if it was just compiled for the baseline ISA, _must_ detect availability of the atomics and use them or it would not properly synchronize with other cpus. Another example is setjmp/longjmp handling of floating point registers on softfloat ARM ABIs. Per the ABI, if the registers exist, some of them are defined as call-saved, so setjmp/longjmp have to save and restore them even when the ABI libc was built for doesn't use them itself. It's possible we could also apply runtime detection on a case-by-case basis for things that are not mandatory but performance-critical; however, I think we would want to see convincing evidence that the performance gains would be worth the complexity/maintenance costs. For most cases, a better solution is probably just to build libc.so optimized for your machine with the appropriate -march. > One simple example would be PowerPC's fsqrt instruction. The PowerPC > Book 1 defines it as optional and provides no way to know specifically, > if the currently running processor supports this instruction besides > executing it and seeing if you get a SIGILL. > > A cursory DuckDuckGo search revealed that Apple uses the instruction as > sqrt implementation if it detects the CPU capability for that, however > it only detects that capability by checking the PVR for known-good bit > patterns (Currently, the only known PowerPC cores to support this > instruction are the 970 and 970FX, which have a version field if 0x39 > and 0x3c, respectively). x86 and -derived architectures at least have > the cpuid instruction to check for some features, and admittedly, > there's a lot of defined bits. However, glibc's ifunc-initialization > function (which selects the implementation) also does a lot of work > finding out the precise make and model of the CPU to set some more > flags. > > The reason I ask is that lots of ISAs define optional parts that aren't > mandatory, but grow in popularity more and more until they're seen in > all current practical implementations. Like how x87 started out as a > separate device but is a fixed part of x86 since the later days of the > 486. Same with MMX, SSE, SSE2. None of these are mandatory by the ABI, > but available in all practical implementations. And musl is never going > to be able to utilize that in its current form. Oh, alright, the > compiler might support it, but that's different. For vector instructions, generation by the compiler is almost always the only way we want to be using them. Hand-written vector code is huge liability in terms of maintenance, readability, bug-surface, etc. The aim in musl is to have no asm beyond what's absolutely necessary to run and to make performance-crticial functions (mainly memcpy and the like) acceptably fast. > I also suspect the > fsqrt instruction will be available in more future PowerPC > implementations. > > If we were to go this route, the question is how to go about it. First > the detection method: Stuff like cpuid or AT_HWCAP are pretty nice, > because they allow for the detection of a feature, whereas version > checking only allows one to find known-good implementations. The latter > means there's a list of known-good values, and that list has to be kept > up-to-date. However, the latter is also pretty much always possible, > while the former isn't always available. The kernel doesn't check for > fsqrt availability, for example. What kind of version-checking? Not all systems even give you a way to version-check. > Then organization: Are we going the glibc route, which gathers all > indirect functions in a single section and initializes all of the > pointers at startup (__libc_init_main()), or do we put these checks > separately in each function? Branch based on hwcap or similar in the functions themselves. > To make a practical example, we could implement sqrt() for PowerPC like > this: > > static double soft_sqrt(double); > static double hard_sqrt(double); > static double init_sqrt(double); > static double (*sqrtfn)(double) = init_sqrt; > > double sqrt(double x) { > return sqrtfn(x); > } > > static double init_sqrt(double x) { > unsigned long pvr; > unsigned long ver; > asm ("mfspr pvr, r0" : "=r"(pvr)); > ver = (pvr >> 16) & 0xffff; > /* XXX: Add more values for cores with the fsqrt instruction here */ > if (0 > || ver == 0x39 /* PowerPC 970 */ > || ver == 0x3c /* PowerPC 970FX */ > ) > sqrtfn = hard_sqrt; > else > sqrtfn = soft_sqrt; > > return sqrtfn(x); > } This code contains data races. In order to be safe under musl's memory model, sqrtfn would have to be volatile and should probably be written via a_cas_p. It also then has to have type void* and be cast to/from function pointer type. See clock_gettime.c. > static double hard_sqrt(double x) { > double r; > asm ("fsqrt %0, %1": "=d"(r) : "d"(x)); > return r; > } For some archs, gas produces an error or tags the .o file as needing a certain ISA level if you use an instruction that's not present in the baseline ISA. I'm not sure if this is an issue here or not. > #define sqrt soft_sqrt > #include "../sqrt.c" > > > Problem with this is: The same thing would have to be repeated for > sqrtf(), the same list of known values would have to be maintained > twice, although we could make it a real list (an array, I mean), and get > rid of that issue. But it does add quite a bit of code, and the overhead > of an indirect function call, and at the moment isn't going to be useful > to all but a few people. > > Also, the inclusion here is a hack. But I couldn't think of a better > way. I think it's the #define sqrt soft_sqrt that's a hack. The inclusion itself is okay and would be the right way to do this for sure if it were just a compile-time check and not a runtime one. Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 15:23 ` Rich Felker @ 2016-09-29 17:08 ` Markus Wichmann 2016-09-29 18:13 ` Rich Felker 0 siblings, 1 reply; 14+ messages in thread From: Markus Wichmann @ 2016-09-29 17:08 UTC (permalink / raw) To: musl On Thu, Sep 29, 2016 at 11:23:54AM -0400, Rich Felker wrote: > What kind of version-checking? Not all systems even give you a way to > version-check. > To the extent that they don't, they also don't give you a way to check for features (again, except for executing the instructions and seeing if you get SIGILL). PowerPC (sorry, but that's where I spent a lot of time on recently) for instance only has the PVR (Processor Version Register). No software I could find online uses another way to detect the features of the CPU. And for systems to not give you a way of detecting system version at runtime and then define optional parts of the ISA would be very dickish, in my opinion. That basically guarentees optional functions won't be used at all. > This code contains data races. In order to be safe under musl's memory > model, sqrtfn would have to be volatile and should probably be written > via a_cas_p. It also then has to have type void* and be cast to/from > function pointer type. See clock_gettime.c. > Well, yes, I was just throwing shit at a wall to see what sticks. We could also move the function pointer dispatch into a pthread_once block or something. I don't know if any caches need to be cleared then or not. But yes, there are better examples. > For some archs, gas produces an error or tags the .o file as needing a > certain ISA level if you use an instruction that's not present in the > baseline ISA. I'm not sure if this is an issue here or not. > As I said, fsqrt is defined in the baseline ISA, just marked as optional. So any PowerPC implementation is free to include it or not. There are a lot of optional features, and if the gas people made a different subarch for each combination of them, they'd be here all day. Not just instructions, too. Sometimes the optional thing is a register, and sometimes just bits in a register. > I think it's the #define sqrt soft_sqrt that's a hack. The inclusion > itself is okay and would be the right way to do this for sure if it > were just a compile-time check and not a runtime one. > I meant the define. While it is hacky, it does mean no code duplication and only one externally facing symbol regarding sqrt(), which is the one defined by the standard. Although I am abusing the little known rule about C that if a function is declared as static in its prototype, and the function definition doesn't have an explicit storage class specifier, then the function will be static. Most style guides (rightly) say to have the storage class specifier in the prototype and the definition be the same, because otherwise this gets confusing fast. I guess it goes to show that you should know your language even in the parts you barely ever use (because forbidden), because they might come in handy at some point. > Rich Ciao, Markus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 17:08 ` Markus Wichmann @ 2016-09-29 18:13 ` Rich Felker 2016-09-29 18:52 ` Adhemerval Zanella 2016-09-30 4:56 ` Markus Wichmann 0 siblings, 2 replies; 14+ messages in thread From: Rich Felker @ 2016-09-29 18:13 UTC (permalink / raw) To: musl On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote: > On Thu, Sep 29, 2016 at 11:23:54AM -0400, Rich Felker wrote: > > What kind of version-checking? Not all systems even give you a way to > > version-check. > > To the extent that they don't, they also don't give you a way to check > for features (again, except for executing the instructions and seeing if > you get SIGILL). PowerPC (sorry, but that's where I spent a lot of time > on recently) for instance only has the PVR (Processor Version Register). > No software I could find online uses another way to detect the features > of the CPU. > > And for systems to not give you a way of detecting system version at > runtime and then define optional parts of the ISA would be very dickish, > in my opinion. That basically guarentees optional functions won't be > used at all. On Linux it's supposed to be the kernel which detects availability of features (either by feature-specific cpu flags or translating a model to flags) but I don't see anything for fsqrt on ppc. :-( How/why did they botch this? > > This code contains data races. In order to be safe under musl's memory > > model, sqrtfn would have to be volatile and should probably be written > > via a_cas_p. It also then has to have type void* and be cast to/from > > function pointer type. See clock_gettime.c. > > Well, yes, I was just throwing shit at a wall to see what sticks. We > could also move the function pointer dispatch into a pthread_once block > or something. I don't know if any caches need to be cleared then or not. pthread_once/call_once would be the nice clean abstraction to use, but it's mildly to considerably more expensive, currently involving a full barrier. There's a nice technical report on how that can be eliminated but it requires TLS, which is also expensive on some archs. In cases like this where there's no state other than the function pointer, relaxed atomics can simply be used on the reading end and then they're always fast. > > For some archs, gas produces an error or tags the .o file as needing a > > certain ISA level if you use an instruction that's not present in the > > baseline ISA. I'm not sure if this is an issue here or not. > > As I said, fsqrt is defined in the baseline ISA, just marked as > optional. We're just using words differently. To me, baseline ISA means the part of the ISA that all models (or at least all usable models; e.g. for x86, pre-486 is not usable without trap-and-emulate of cmpxchg so we consider 486 the baseline ISA) support. > So any PowerPC implementation is free to include it or not. > There are a lot of optional features, and if the gas people made a > different subarch for each combination of them, they'd be here all day. They've actually done that for some archs... > > I think it's the #define sqrt soft_sqrt that's a hack. The inclusion > > itself is okay and would be the right way to do this for sure if it > > were just a compile-time check and not a runtime one. > > I meant the define. While it is hacky, it does mean no code duplication > and only one externally facing symbol regarding sqrt(), which is the one > defined by the standard. Although I am abusing the little known rule > about C that if a function is declared as static in its prototype, and > the function definition doesn't have an explicit storage class > specifier, then the function will be static. Most style guides (rightly) > say to have the storage class specifier in the prototype and the > definition be the same, because otherwise this gets confusing fast. > > I guess it goes to show that you should know your language even in the > parts you barely ever use (because forbidden), because they might come > in handy at some point. Yes, I was a bit surprised first and had to recall the rule, but I knew the code was either valid or a constraint violation right away. Anyway, I would have no objection right away to doing a patch like this that's decided at compile-time based on predefined macros set by -march. For runtime choice I think we need to discuss motivation. Are you trying to do a powerpc-based distro where you need a universal libc.so that works optimally on various models? Or would just compiling for the right -march meet your needs? Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 18:13 ` Rich Felker @ 2016-09-29 18:52 ` Adhemerval Zanella 2016-09-29 22:05 ` Szabolcs Nagy 2016-09-30 4:56 ` Markus Wichmann 1 sibling, 1 reply; 14+ messages in thread From: Adhemerval Zanella @ 2016-09-29 18:52 UTC (permalink / raw) To: musl On 29/09/2016 11:13, Rich Felker wrote: > On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote: >> On Thu, Sep 29, 2016 at 11:23:54AM -0400, Rich Felker wrote: >>> What kind of version-checking? Not all systems even give you a way to >>> version-check. >> >> To the extent that they don't, they also don't give you a way to check >> for features (again, except for executing the instructions and seeing if >> you get SIGILL). PowerPC (sorry, but that's where I spent a lot of time >> on recently) for instance only has the PVR (Processor Version Register). >> No software I could find online uses another way to detect the features >> of the CPU. >> >> And for systems to not give you a way of detecting system version at >> runtime and then define optional parts of the ISA would be very dickish, >> in my opinion. That basically guarentees optional functions won't be >> used at all. > > On Linux it's supposed to be the kernel which detects availability of > features (either by feature-specific cpu flags or translating a model > to flags) but I don't see anything for fsqrt on ppc. :-( How/why did > they botch this? Maybe because recent power work on kernel is POWER oriented, where fsqrt is define since POWER4. However some more recent freescale chips (such as e5500 and e6500) also decided to not add fsqrt instruction. With GCC you can check for _ARCH_PPCSQ to see if current arch flags allows fsqrt. From runtine I presume programs can check for hwcap bit PPC_FEATURE_POWER4, however it does not help on non-POWER chips which do support fsqrt. Another option and a bit hacky would issue fsqrt and trap on SIGILL... > >>> This code contains data races. In order to be safe under musl's memory >>> model, sqrtfn would have to be volatile and should probably be written >>> via a_cas_p. It also then has to have type void* and be cast to/from >>> function pointer type. See clock_gettime.c. >> >> Well, yes, I was just throwing shit at a wall to see what sticks. We >> could also move the function pointer dispatch into a pthread_once block >> or something. I don't know if any caches need to be cleared then or not. > > pthread_once/call_once would be the nice clean abstraction to use, but > it's mildly to considerably more expensive, currently involving a full > barrier. There's a nice technical report on how that can be eliminated > but it requires TLS, which is also expensive on some archs. In cases > like this where there's no state other than the function pointer, > relaxed atomics can simply be used on the reading end and then they're > always fast. > >>> For some archs, gas produces an error or tags the .o file as needing a >>> certain ISA level if you use an instruction that's not present in the >>> baseline ISA. I'm not sure if this is an issue here or not. >> >> As I said, fsqrt is defined in the baseline ISA, just marked as >> optional. > > We're just using words differently. To me, baseline ISA means the part > of the ISA that all models (or at least all usable models; e.g. for > x86, pre-486 is not usable without trap-and-emulate of cmpxchg so we > consider 486 the baseline ISA) support. > >> So any PowerPC implementation is free to include it or not. >> There are a lot of optional features, and if the gas people made a >> different subarch for each combination of them, they'd be here all day. > > They've actually done that for some archs... > >>> I think it's the #define sqrt soft_sqrt that's a hack. The inclusion >>> itself is okay and would be the right way to do this for sure if it >>> were just a compile-time check and not a runtime one. >> >> I meant the define. While it is hacky, it does mean no code duplication >> and only one externally facing symbol regarding sqrt(), which is the one >> defined by the standard. Although I am abusing the little known rule >> about C that if a function is declared as static in its prototype, and >> the function definition doesn't have an explicit storage class >> specifier, then the function will be static. Most style guides (rightly) >> say to have the storage class specifier in the prototype and the >> definition be the same, because otherwise this gets confusing fast. >> >> I guess it goes to show that you should know your language even in the >> parts you barely ever use (because forbidden), because they might come >> in handy at some point. > > Yes, I was a bit surprised first and had to recall the rule, but I > knew the code was either valid or a constraint violation right away. > > Anyway, I would have no objection right away to doing a patch like > this that's decided at compile-time based on predefined macros set by > -march. For runtime choice I think we need to discuss motivation. Are > you trying to do a powerpc-based distro where you need a universal > libc.so that works optimally on various models? Or would just > compiling for the right -march meet your needs? > > Rich > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 18:52 ` Adhemerval Zanella @ 2016-09-29 22:05 ` Szabolcs Nagy 2016-09-29 23:14 ` Adhemerval Zanella 0 siblings, 1 reply; 14+ messages in thread From: Szabolcs Nagy @ 2016-09-29 22:05 UTC (permalink / raw) To: musl * Adhemerval Zanella <adhemerval.zanella@linaro.org> [2016-09-29 11:52:44 -0700]: > On 29/09/2016 11:13, Rich Felker wrote: > > On Linux it's supposed to be the kernel which detects availability of > > features (either by feature-specific cpu flags or translating a model > > to flags) but I don't see anything for fsqrt on ppc. :-( How/why did > > they botch this? > > Maybe because recent power work on kernel is POWER oriented, where fsqrt > is define since POWER4. However some more recent freescale chips (such > as e5500 and e6500) also decided to not add fsqrt instruction. > > With GCC you can check for _ARCH_PPCSQ to see if current arch flags > allows fsqrt. From runtine I presume programs can check for hwcap bit > PPC_FEATURE_POWER4, however it does not help on non-POWER chips which > do support fsqrt. > how can distros deal with this? do they require POWER4? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 22:05 ` Szabolcs Nagy @ 2016-09-29 23:14 ` Adhemerval Zanella 0 siblings, 0 replies; 14+ messages in thread From: Adhemerval Zanella @ 2016-09-29 23:14 UTC (permalink / raw) To: musl On 29/09/2016 15:05, Szabolcs Nagy wrote: > * Adhemerval Zanella <adhemerval.zanella@linaro.org> [2016-09-29 11:52:44 -0700]: >> On 29/09/2016 11:13, Rich Felker wrote: >>> On Linux it's supposed to be the kernel which detects availability of >>> features (either by feature-specific cpu flags or translating a model >>> to flags) but I don't see anything for fsqrt on ppc. :-( How/why did >>> they botch this? >> >> Maybe because recent power work on kernel is POWER oriented, where fsqrt >> is define since POWER4. However some more recent freescale chips (such >> as e5500 and e6500) also decided to not add fsqrt instruction. >> >> With GCC you can check for _ARCH_PPCSQ to see if current arch flags >> allows fsqrt. From runtine I presume programs can check for hwcap bit >> PPC_FEATURE_POWER4, however it does not help on non-POWER chips which >> do support fsqrt. >> > > how can distros deal with this? > > do they require POWER4? > I do not really know how is the current approach for powerpc{32,64} distros, but I recall that both RHEL and SLES used to provided arch specific libc.so build/optimized for each chips (default, power4, powerX). The powerpc64le have a current minimum ISA of 2.07 (power8) with both complete fp and VSX, so it should not have this issue. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-29 18:13 ` Rich Felker 2016-09-29 18:52 ` Adhemerval Zanella @ 2016-09-30 4:56 ` Markus Wichmann 2016-10-01 5:50 ` Rich Felker 1 sibling, 1 reply; 14+ messages in thread From: Markus Wichmann @ 2016-09-30 4:56 UTC (permalink / raw) To: musl On Thu, Sep 29, 2016 at 02:13:36PM -0400, Rich Felker wrote: > On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote: > > [...] > On Linux it's supposed to be the kernel which detects availability of > features (either by feature-specific cpu flags or translating a model > to flags) but I don't see anything for fsqrt on ppc. :-( How/why did > they botch this? > Maybe it's a new extension? I only know version 2.2 of the PowerPC Book. Or maybe it goes back to the single core thing. (Only the 970 supports it, and that's pretty new.) Or maybe Linux kernel developers aren't interested in this problem, because a manual sqrt exists, and if need be, anyone can just implement the Babylonian method for speed. On PPC, it can be implemented in a loop consisting of four instructions, namely: ; .rodata half: .double 0.5 ; assuming positive finite argument ; if that can't be assumed, go through memory to inspect argument fmr 1, 0 ; yes, halving the exponent would be a better estimate ; requires going through memory, though lfd 2, half(13) li 0, 6 ;or more for more accurcy mtctr 0 1: ; fr0 = x, fr1 = a fdiv 3, 1, 0 ; fr3 = a/x fadd 3, 3, 0 ; fr3 = x + a/x fmul 0, 3, 2 ; fr0 = 0.5(x + a/x) bdnz 1b So maybe there wasn't a lot of need for the hardware sqrt. > > Well, yes, I was just throwing shit at a wall to see what sticks. We > > could also move the function pointer dispatch into a pthread_once block > > or something. I don't know if any caches need to be cleared then or not. > > pthread_once/call_once would be the nice clean abstraction to use, but > it's mildly to considerably more expensive, currently involving a full > barrier. There's a nice technical report on how that can be eliminated > but it requires TLS, which is also expensive on some archs. In cases > like this where there's no state other than the function pointer, > relaxed atomics can simply be used on the reading end and then they're > always fast. > Hmmm... not on PPC, though. TLS on Linux PPC just uses r2 as TLS pointer. So the entire thing could be used almost as-is by making sqrtfn thread-local? > > So any PowerPC implementation is free to include it or not. > > There are a lot of optional features, and if the gas people made a > > different subarch for each combination of them, they'd be here all day. > > They've actually done that for some archs... > That actually made me check if they did it here, but thankfully not. gas assembles the instruction without flags, without warning, and without a note or anything on the output file. > Anyway, I would have no objection right away to doing a patch like > this that's decided at compile-time based on predefined macros set by > -march. For runtime choice I think we need to discuss motivation. Are > you trying to do a powerpc-based distro where you need a universal > libc.so that works optimally on various models? Or would just > compiling for the right -march meet your needs? > Just idle musings. I was reading sqrt.c, which has a flowerbox saying "Use hardware sqrt if available" and recalled that there is a hardware sqrt on PPC and started doing research from there. And that ended up in the OP. > Rich Ciao, Markus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-09-30 4:56 ` Markus Wichmann @ 2016-10-01 5:50 ` Rich Felker 2016-10-01 8:52 ` Markus Wichmann 0 siblings, 1 reply; 14+ messages in thread From: Rich Felker @ 2016-10-01 5:50 UTC (permalink / raw) To: musl On Fri, Sep 30, 2016 at 06:56:15AM +0200, Markus Wichmann wrote: > On Thu, Sep 29, 2016 at 02:13:36PM -0400, Rich Felker wrote: > > On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote: > > > [...] > > On Linux it's supposed to be the kernel which detects availability of > > features (either by feature-specific cpu flags or translating a model > > to flags) but I don't see anything for fsqrt on ppc. :-( How/why did > > they botch this? > > > > Maybe it's a new extension? I only know version 2.2 of the PowerPC Book. > > Or maybe it goes back to the single core thing. (Only the 970 supports > it, and that's pretty new.) Or maybe Linux kernel developers aren't > interested in this problem, because a manual sqrt exists, and if need > be, anyone can just implement the Babylonian method for speed. On PPC, > it can be implemented in a loop consisting of four instructions, namely: > > ; .rodata > half: .double 0.5 > ; assuming positive finite argument > ; if that can't be assumed, go through memory to inspect argument > fmr 1, 0 ; yes, halving the exponent would be a better estimate > ; requires going through memory, though > lfd 2, half(13) > li 0, 6 ;or more for more accurcy > mtctr 0 > > 1: ; fr0 = x, fr1 = a > fdiv 3, 1, 0 ; fr3 = a/x > fadd 3, 3, 0 ; fr3 = x + a/x > fmul 0, 3, 2 ; fr0 = 0.5(x + a/x) > bdnz 1b > > So maybe there wasn't a lot of need for the hardware sqrt. I don't think this works at all. sqrt() is required to be correctly-rounded; that's the whole reason sqrt.c is costly. > > > Well, yes, I was just throwing shit at a wall to see what sticks. We > > > could also move the function pointer dispatch into a pthread_once block > > > or something. I don't know if any caches need to be cleared then or not. > > > > pthread_once/call_once would be the nice clean abstraction to use, but > > it's mildly to considerably more expensive, currently involving a full > > barrier. There's a nice technical report on how that can be eliminated > > but it requires TLS, which is also expensive on some archs. In cases > > like this where there's no state other than the function pointer, > > relaxed atomics can simply be used on the reading end and then they're > > always fast. > > Hmmm... not on PPC, though. TLS on Linux PPC just uses r2 as TLS > pointer. So the entire thing could be used almost as-is by making sqrtfn > thread-local? Yes and no. Not in musl because we don't use _Thread_local; it would require allocating space in the thread structure which is not appropriate for something like this. The right and most efficient solution is the one I described above. Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-10-01 5:50 ` Rich Felker @ 2016-10-01 8:52 ` Markus Wichmann 2016-10-01 15:10 ` Rich Felker 0 siblings, 1 reply; 14+ messages in thread From: Markus Wichmann @ 2016-10-01 8:52 UTC (permalink / raw) To: musl On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote: > I don't think this works at all. sqrt() is required to be > correctly-rounded; that's the whole reason sqrt.c is costly. It's an approximation, at least, which was rather my point. As I've come to realize over the course of this discussion, the fsqrt instruction is useless here and pretty much everywhere out there: - If you are looking for accuracy over speed, the standard C library has got you covered. - If you are looking for speed over accuracy, you can code up the Babylonian method inside five minutes. You can even tune it to suit your needs to an extent (mainly, number of rounds and method of first approximation). This method is also portable to other architectures, and can be done entirely in C (requiring IEEE floating point, but then, most serious FP code does that). Also, at least according to Apple, which were the only ones actually looking at the thing, such as I could find, it was only ever supported by the 970 and the 970FX cores, released in 2002 and 2004, respectively. I highly doubt they'll have much relevance. Chalk up my suspicions from the OP to not having researched enough. In closing: Nice discussion, but I'm sorry for the noise. Ciao, Markus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-10-01 8:52 ` Markus Wichmann @ 2016-10-01 15:10 ` Rich Felker 2016-10-01 19:53 ` Markus Wichmann 0 siblings, 1 reply; 14+ messages in thread From: Rich Felker @ 2016-10-01 15:10 UTC (permalink / raw) To: musl On Sat, Oct 01, 2016 at 10:52:14AM +0200, Markus Wichmann wrote: > On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote: > > I don't think this works at all. sqrt() is required to be > > correctly-rounded; that's the whole reason sqrt.c is costly. > > It's an approximation, at least, which was rather my point. > > As I've come to realize over the course of this discussion, the fsqrt > instruction is useless here and pretty much everywhere out there: I don't think that conclusion is correct. It certainly makes sense for libc to use it in targets that have it, assuming it safely produces correct results, and for compilers to generate it in place of a call to sqrt. > - If you are looking for accuracy over speed, the standard C library has > got you covered. Yes. > - If you are looking for speed over accuracy, you can code up the > Babylonian method inside five minutes. You can even tune it to suit > your needs to an extent (mainly, number of rounds and method of first > approximation). This method is also portable to other architectures, and > can be done entirely in C (requiring IEEE floating point, but then, most > serious FP code does that). This is not going to give you speed. If you want fast sqrt approximations, there are lots out there that are actually fast. And if the final result you need is 1/sqrt there are even faster ones. > Also, at least according to Apple, which were the only ones actually > looking at the thing, such as I could find, it was only ever supported > by the 970 and the 970FX cores, released in 2002 and 2004, respectively. > I highly doubt they'll have much relevance. Chalk up my suspicions from > the OP to not having researched enough. Do you mean these are the only non-POWER line models that have fsqrt? > In closing: Nice discussion, but I'm sorry for the noise. I don't think it's noise. It's been informative. And it does suggest that we should add static, compile-time support for using fsqrt on POWER and perhaps on these specific models that have it. That's useful information for making it a better-supported target under musl. Rich ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-10-01 15:10 ` Rich Felker @ 2016-10-01 19:53 ` Markus Wichmann 2016-10-02 13:59 ` Adhemerval Zanella 0 siblings, 1 reply; 14+ messages in thread From: Markus Wichmann @ 2016-10-01 19:53 UTC (permalink / raw) To: musl On Sat, Oct 01, 2016 at 11:10:12AM -0400, Rich Felker wrote: > On Sat, Oct 01, 2016 at 10:52:14AM +0200, Markus Wichmann wrote: > > On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote: > > > I don't think this works at all. sqrt() is required to be > > > correctly-rounded; that's the whole reason sqrt.c is costly. > > > > It's an approximation, at least, which was rather my point. > > > > As I've come to realize over the course of this discussion, the fsqrt > > instruction is useless here and pretty much everywhere out there: > > I don't think that conclusion is correct. It certainly makes sense for > libc to use it in targets that have it, assuming it safely produces > correct results, and for compilers to generate it in place of a call > to sqrt. > But again, that requires the appropriate flags. > > Also, at least according to Apple, which were the only ones actually > > looking at the thing, such as I could find, it was only ever supported > > by the 970 and the 970FX cores, released in 2002 and 2004, respectively. > > I highly doubt they'll have much relevance. Chalk up my suspicions from > > the OP to not having researched enough. > > Do you mean these are the only non-POWER line models that have fsqrt? > The more I research this, the more confused I get! So, I was looking for real-world users of fsqrt, do look at how they determine availability. The first such user I found was Apple's libm. Tracing back to where they set their feature flags, I found this file http://opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/start.s If you search for _cpu_capabilities, around line 180 you'll find a comment saying the feature flags in this file are only defaults and may be changed by initialization code. But I couldn't find anything setting more flags, if anything, flags got removed. And the only models that have the flag kHasFsqrt are the 970 and the 970FX. But then I noticed that their processor list is kind of small, so I continued the search. I found this e-mail claiming the 604 supports the instruction: http://aps.anl.gov/epics/tech-talk/2011/msg01247.php But if you look at datasheets of the 604, they say nothing either way. But alright, the 604 is and old model (intrduced in 1994), maybe fsqrt wasn't defined then. I personally work with the e300 (at my day job), and at least their datasheet makes it clear that fsqrt is not supported. Actually, apparently Freescale aren't big fans of this instruction at all, according to this comment: https://github.com/ibmruntimes/v8ppc/issues/119#issuecomment-72705975 Wikipedia claims, however, that it wasn't until the 620 that the square root instruction was put into hardware. I tried to find a 620 datasheet, but no luck so far. Next family on the list would be the 4xx. 403 can be discounted immediately as it lacks an FPU. Since the 401 is stripped down even further, it also has no FPU. From the 405 onward it get's dicey as they went the way of the x87: You could connect an external FPU if desired. I found one for 405 here: http://www.xilinx.com/support/documentation/ip_documentation/apu_fpu.pdf That one doesn't support fsqrt, at least not enough for our purposes, but it does support fsqrts (that's the single precision variant). That's a whole new level of weird. As for the rest: I hope, Apple got it right, because afer the 970, nothing more is listed in Wikipedia. So, as you can see, the whole thing is a mess. > Rich Ciao, Markus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Model specific optimizations? 2016-10-01 19:53 ` Markus Wichmann @ 2016-10-02 13:59 ` Adhemerval Zanella 0 siblings, 0 replies; 14+ messages in thread From: Adhemerval Zanella @ 2016-10-02 13:59 UTC (permalink / raw) To: musl On 01/10/2016 16:53, Markus Wichmann wrote: > On Sat, Oct 01, 2016 at 11:10:12AM -0400, Rich Felker wrote: >> On Sat, Oct 01, 2016 at 10:52:14AM +0200, Markus Wichmann wrote: >>> On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote: >>>> I don't think this works at all. sqrt() is required to be >>>> correctly-rounded; that's the whole reason sqrt.c is costly. >>> >>> It's an approximation, at least, which was rather my point. >>> >>> As I've come to realize over the course of this discussion, the fsqrt >>> instruction is useless here and pretty much everywhere out there: >> >> I don't think that conclusion is correct. It certainly makes sense for >> libc to use it in targets that have it, assuming it safely produces >> correct results, and for compilers to generate it in place of a call >> to sqrt. >> > > But again, that requires the appropriate flags.gcc/config/rs6000/rs6000-cpus.def:35: > >>> Also, at least according to Apple, which were the only ones actually >>> looking at the thing, such as I could find, it was only ever supported >>> by the 970 and the 970FX cores, released in 2002 and 2004, respectively. >>> I highly doubt they'll have much relevance. Chalk up my suspicions from >>> the OP to not having researched enough. >> >> Do you mean these are the only non-POWER line models that have fsqrt? >> > > The more I research this, the more confused I get! > > So, I was looking for real-world users of fsqrt, do look at how they > determine availability. The first such user I found was Apple's libm. > Tracing back to where they set their feature flags, I found this file > > http://opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/start.s > > If you search for _cpu_capabilities, around line 180 you'll find a > comment saying the feature flags in this file are only defaults and may > be changed by initialization code. But I couldn't find anything setting > more flags, if anything, flags got removed. And the only models that > have the flag kHasFsqrt are the 970 and the 970FX. > > But then I noticed that their processor list is kind of small, so I > continued the search. I found this e-mail claiming the 604 supports the > instruction: > > http://aps.anl.gov/epics/tech-talk/2011/msg01247.php > > But if you look at datasheets of the 604, they say nothing either way. > But alright, the 604 is and old model (intrduced in 1994), maybe fsqrt > wasn't defined then. > > I personally work with the e300 (at my day job), and at least their > datasheet makes it clear that fsqrt is not supported. Actually, > apparently Freescale aren't big fans of this instruction at all, > according to this comment: > > https://github.com/ibmruntimes/v8ppc/issues/119#issuecomment-72705975 > > Wikipedia claims, however, that it wasn't until the 620 that the square > root instruction was put into hardware. I tried to find a 620 datasheet, > but no luck so far. > > Next family on the list would be the 4xx. 403 can be discounted > immediately as it lacks an FPU. Since the 401 is stripped down even > further, it also has no FPU. From the 405 onward it get's dicey as they > went the way of the x87: You could connect an external FPU if desired. > > I found one for 405 here: > > http://www.xilinx.com/support/documentation/ip_documentation/apu_fpu.pdf > > That one doesn't support fsqrt, at least not enough for our purposes, > but it does support fsqrts (that's the single precision variant). That's > a whole new level of weird. > > As for the rest: I hope, Apple got it right, because afer the 970, > nothing more is listed in Wikipedia. > > So, as you can see, the whole thing is a mess. Since kernel does not track it, I found GCC internal implementation to be the most correct way to found if an chip implementation actually support some ppc instruction. For fsqrt gcc will define _ARCH_PPCSQ and internally this flag is controlled by OPTION_MASK_PPC_GPOPT. GCC cpu definition file has some information about it [1]: 27 /* For ISA 2.05, do not add MFPGPR, since it isn't in ISA 2.06, and don't add 28 ALTIVEC, since in general it isn't a win on power6. In ISA 2.04, fsel, 29 fre, fsqrt, etc. were no longer documented as optional. Group masks by 30 server and embedded. */ 31 #define ISA_2_5_MASKS_EMBEDDED (ISA_2_4_MASKS \ 32 | OPTION_MASK_CMPB \ 33 | OPTION_MASK_RECIP_PRECISION \ 34 | OPTION_MASK_PPC_GFXOPT \ 35 | OPTION_MASK_PPC_GPOPT) 36 36 37 #define ISA_2_5_MASKS_SERVER (ISA_2_5_MASKS_EMBEDDED | OPTION_MASK_DFP) Later in the file you can check that besides all POWER chips from power4 and forward, only the 970, cell, and G5 support fsqrt. The problem is you have new powerpc cores as e6500 that is suppose to follow the ISA 2.06 embedded profile but that does not implement fsqrt even on ppc64 mode. [1] gcc/config/rs6000/rs6000-cpus.def ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-10-02 13:59 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-09-29 14:21 Model specific optimizations? Markus Wichmann 2016-09-29 14:57 ` Szabolcs Nagy 2016-09-29 15:23 ` Rich Felker 2016-09-29 17:08 ` Markus Wichmann 2016-09-29 18:13 ` Rich Felker 2016-09-29 18:52 ` Adhemerval Zanella 2016-09-29 22:05 ` Szabolcs Nagy 2016-09-29 23:14 ` Adhemerval Zanella 2016-09-30 4:56 ` Markus Wichmann 2016-10-01 5:50 ` Rich Felker 2016-10-01 8:52 ` Markus Wichmann 2016-10-01 15:10 ` Rich Felker 2016-10-01 19:53 ` Markus Wichmann 2016-10-02 13:59 ` Adhemerval Zanella
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).