Model specific optimizations?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Model specific optimizations?
@ 2016-09-29 14:21 Markus Wichmann
  2016-09-29 14:57 ` Szabolcs Nagy
  2016-09-29 15:23 ` Rich Felker
  0 siblings, 2 replies; 14+ messages in thread
From: Markus Wichmann @ 2016-09-29 14:21 UTC (permalink / raw)
  To: musl

Hi there,

I wanted to ask if there is any wish for the near future to support
model-specific optimizations. What I mean by that is multiple
implementations of the same function, where the best implementation is
decided at run-time.

One simple example would be PowerPC's fsqrt instruction. The PowerPC
Book 1 defines it as optional and provides no way to know specifically,
if the currently running processor supports this instruction besides
executing it and seeing if you get a SIGILL.

A cursory DuckDuckGo search revealed that Apple uses the instruction as
sqrt implementation if it detects the CPU capability for that, however
it only detects that capability by checking the PVR for known-good bit
patterns (Currently, the only known PowerPC cores to support this
instruction are the 970 and 970FX, which have a version field if 0x39
and 0x3c, respectively). x86 and -derived architectures at least have
the cpuid instruction to check for some features, and admittedly,
there's a lot of defined bits. However, glibc's ifunc-initialization
function (which selects the implementation) also does a lot of work
finding out the precise make and model of the CPU to set some more
flags.

The reason I ask is that lots of ISAs define optional parts that aren't
mandatory, but grow in popularity more and more until they're seen in
all current practical implementations. Like how x87 started out as a
separate device but is a fixed part of x86 since the later days of the
486. Same with MMX, SSE, SSE2. None of these are mandatory by the ABI,
but available in all practical implementations. And musl is never going
to be able to utilize that in its current form. Oh, alright, the
compiler might support it, but that's different. I also suspect the
fsqrt instruction will be available in more future PowerPC
implementations.

If we were to go this route, the question is how to go about it. First
the detection method: Stuff like cpuid or AT_HWCAP are pretty nice,
because they allow for the detection of a feature, whereas version
checking only allows one to find known-good implementations. The latter
means there's a list of known-good values, and that list has to be kept
up-to-date. However, the latter is also pretty much always possible,
while the former isn't always available. The kernel doesn't check for
fsqrt availability, for example.

Then organization: Are we going the glibc route, which gathers all
indirect functions in a single section and initializes all of the
pointers at startup (__libc_init_main()), or do we put these checks
separately in each function?

To make a practical example, we could implement sqrt() for PowerPC like
this:

static double soft_sqrt(double);
static double hard_sqrt(double);
static double init_sqrt(double);
static double (*sqrtfn)(double) = init_sqrt;

double sqrt(double x) {
    return sqrtfn(x);
}

static double init_sqrt(double x) {
    unsigned long pvr;
    unsigned long ver;
    asm ("mfspr pvr, r0" : "=r"(pvr));
    ver = (pvr >> 16) & 0xffff;
    /* XXX: Add more values for cores with the fsqrt instruction here */
    if (0
        || ver == 0x39  /* PowerPC 970 */
        || ver == 0x3c  /* PowerPC 970FX */
    )
        sqrtfn = hard_sqrt;
    else
        sqrtfn = soft_sqrt;

    return sqrtfn(x);
}

static double hard_sqrt(double x) {
    double r;
    asm ("fsqrt %0, %1": "=d"(r) : "d"(x));
    return r;
}

#define sqrt soft_sqrt
#include "../sqrt.c"

Problem with this is: The same thing would have to be repeated for
sqrtf(), the same list of known values would have to be maintained
twice, although we could make it a real list (an array, I mean), and get
rid of that issue. But it does add quite a bit of code, and the overhead
of an indirect function call, and at the moment isn't going to be useful
to all but a few people.

Also, the inclusion here is a hack. But I couldn't think of a better
way.

Thoughts?

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 14:21 Model specific optimizations? Markus Wichmann
@ 2016-09-29 14:57 ` Szabolcs Nagy
  2016-09-29 15:23 ` Rich Felker
  1 sibling, 0 replies; 14+ messages in thread
From: Szabolcs Nagy @ 2016-09-29 14:57 UTC (permalink / raw)
  To: musl

* Markus Wichmann <nullplan@gmx.net> [2016-09-29 16:21:26 +0200]:
> I wanted to ask if there is any wish for the near future to support
> model-specific optimizations. What I mean by that is multiple
> implementations of the same function, where the best implementation is
> decided at run-time.

musl already does some runtime selection based on hw/kernel
features (arm atomics, vdso).

it could use similar approaches for micro-architecture specific
optimizations.

this has a maintenance cost (hard to test, hard to benchmark),
code size cost (all variants has to be present at runtime)
and dispatch cost (it has to happen at startup or lazily) these
costs are rarely justified.

(there are secondary effects: glibc dispatches memcpy at runtime
so the compilers have a hard time deciding when to inline it,
as a consequence sometimes -O0 gives better performance than -O3
on x86 with glibc.)

> One simple example would be PowerPC's fsqrt instruction. The PowerPC
> Book 1 defines it as optional and provides no way to know specifically,
> if the currently running processor supports this instruction besides
> executing it and seeing if you get a SIGILL.

if there is no linux hwcap bit for this then we cant do much about it.

runtime dispatch only works if there is a reasonable way to detect
hw features (hwcap, cpuid instruction, vdso something) e.g. parsing
/proc/cpuinfo to figure out the cpu and guessing features from that
or registering sigill signal handlers are not ok.

> Then organization: Are we going the glibc route, which gathers all
> indirect functions in a single section and initializes all of the
> pointers at startup (__libc_init_main()), or do we put these checks
> separately in each function?

glibc uses ifunc for this, musl does not support ifunc at this point.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 14:21 Model specific optimizations? Markus Wichmann
  2016-09-29 14:57 ` Szabolcs Nagy
@ 2016-09-29 15:23 ` Rich Felker
  2016-09-29 17:08   ` Markus Wichmann
  1 sibling, 1 reply; 14+ messages in thread
From: Rich Felker @ 2016-09-29 15:23 UTC (permalink / raw)
  To: musl

On Thu, Sep 29, 2016 at 04:21:26PM +0200, Markus Wichmann wrote:
> Hi there,
> 
> I wanted to ask if there is any wish for the near future to support
> model-specific optimizations. What I mean by that is multiple
> implementations of the same function, where the best implementation is
> decided at run-time.

musl's general approach to this is to use non-mandatory ISA extensions
only when compiled with the right -march to assume their availability.

What I mean by "non-mandatory" is that some extensions must be used
when supported in order to have proper behavior. For example if a
baseline ISA does not have atomics (ARM, SH) but emulates them with
help from the kernel that only works on UP machines (SH), code running
on models that support SMP, even if it was just compiled for the
baseline ISA, _must_ detect availability of the atomics and use them
or it would not properly synchronize with other cpus.

Another example is setjmp/longjmp handling of floating point registers
on softfloat ARM ABIs. Per the ABI, if the registers exist, some of
them are defined as call-saved, so setjmp/longjmp have to save and
restore them even when the ABI libc was built for doesn't use them
itself.

It's possible we could also apply runtime detection on a case-by-case
basis for things that are not mandatory but performance-critical;
however, I think we would want to see convincing evidence that the
performance gains would be worth the complexity/maintenance costs.

For most cases, a better solution is probably just to build libc.so
optimized for your machine with the appropriate -march.

> One simple example would be PowerPC's fsqrt instruction. The PowerPC
> Book 1 defines it as optional and provides no way to know specifically,
> if the currently running processor supports this instruction besides
> executing it and seeing if you get a SIGILL.
> 
> A cursory DuckDuckGo search revealed that Apple uses the instruction as
> sqrt implementation if it detects the CPU capability for that, however
> it only detects that capability by checking the PVR for known-good bit
> patterns (Currently, the only known PowerPC cores to support this
> instruction are the 970 and 970FX, which have a version field if 0x39
> and 0x3c, respectively). x86 and -derived architectures at least have
> the cpuid instruction to check for some features, and admittedly,
> there's a lot of defined bits. However, glibc's ifunc-initialization
> function (which selects the implementation) also does a lot of work
> finding out the precise make and model of the CPU to set some more
> flags.
> 
> The reason I ask is that lots of ISAs define optional parts that aren't
> mandatory, but grow in popularity more and more until they're seen in
> all current practical implementations. Like how x87 started out as a
> separate device but is a fixed part of x86 since the later days of the
> 486. Same with MMX, SSE, SSE2. None of these are mandatory by the ABI,
> but available in all practical implementations. And musl is never going
> to be able to utilize that in its current form. Oh, alright, the
> compiler might support it, but that's different.

For vector instructions, generation by the compiler is almost always
the only way we want to be using them. Hand-written vector code is
huge liability in terms of maintenance, readability, bug-surface, etc.
The aim in musl is to have no asm beyond what's absolutely necessary
to run and to make performance-crticial functions (mainly memcpy and
the like) acceptably fast.

> I also suspect the
> fsqrt instruction will be available in more future PowerPC
> implementations.
> 
> If we were to go this route, the question is how to go about it. First
> the detection method: Stuff like cpuid or AT_HWCAP are pretty nice,
> because they allow for the detection of a feature, whereas version
> checking only allows one to find known-good implementations. The latter
> means there's a list of known-good values, and that list has to be kept
> up-to-date. However, the latter is also pretty much always possible,
> while the former isn't always available. The kernel doesn't check for
> fsqrt availability, for example.

What kind of version-checking? Not all systems even give you a way to
version-check.

> Then organization: Are we going the glibc route, which gathers all
> indirect functions in a single section and initializes all of the
> pointers at startup (__libc_init_main()), or do we put these checks
> separately in each function?

Branch based on hwcap or similar in the functions themselves.

> To make a practical example, we could implement sqrt() for PowerPC like
> this:
> 
> static double soft_sqrt(double);
> static double hard_sqrt(double);
> static double init_sqrt(double);
> static double (*sqrtfn)(double) = init_sqrt;
> 
> double sqrt(double x) {
>     return sqrtfn(x);
> }
> 
> static double init_sqrt(double x) {
>     unsigned long pvr;
>     unsigned long ver;
>     asm ("mfspr pvr, r0" : "=r"(pvr));
>     ver = (pvr >> 16) & 0xffff;
>     /* XXX: Add more values for cores with the fsqrt instruction here */
>     if (0
>         || ver == 0x39  /* PowerPC 970 */
>         || ver == 0x3c  /* PowerPC 970FX */
>     )
>         sqrtfn = hard_sqrt;
>     else
>         sqrtfn = soft_sqrt;
> 
>     return sqrtfn(x);
> }

This code contains data races. In order to be safe under musl's memory
model, sqrtfn would have to be volatile and should probably be written
via a_cas_p. It also then has to have type void* and be cast to/from
function pointer type. See clock_gettime.c.

> static double hard_sqrt(double x) {
>     double r;
>     asm ("fsqrt %0, %1": "=d"(r) : "d"(x));
>     return r;
> }

For some archs, gas produces an error or tags the .o file as needing a
certain ISA level if you use an instruction that's not present in the
baseline ISA. I'm not sure if this is an issue here or not.

> #define sqrt soft_sqrt
> #include "../sqrt.c"
> 
> 
> Problem with this is: The same thing would have to be repeated for
> sqrtf(), the same list of known values would have to be maintained
> twice, although we could make it a real list (an array, I mean), and get
> rid of that issue. But it does add quite a bit of code, and the overhead
> of an indirect function call, and at the moment isn't going to be useful
> to all but a few people.
> 
> Also, the inclusion here is a hack. But I couldn't think of a better
> way.

I think it's the #define sqrt soft_sqrt that's a hack. The inclusion
itself is okay and would be the right way to do this for sure if it
were just a compile-time check and not a runtime one.

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 15:23 ` Rich Felker
@ 2016-09-29 17:08   ` Markus Wichmann
  2016-09-29 18:13     ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-09-29 17:08 UTC (permalink / raw)
  To: musl

On Thu, Sep 29, 2016 at 11:23:54AM -0400, Rich Felker wrote:
> What kind of version-checking? Not all systems even give you a way to
> version-check.
> 

To the extent that they don't, they also don't give you a way to check
for features (again, except for executing the instructions and seeing if
you get SIGILL). PowerPC (sorry, but that's where I spent a lot of time
on recently) for instance only has the PVR (Processor Version Register).
No software I could find online uses another way to detect the features
of the CPU.

And for systems to not give you a way of detecting system version at
runtime and then define optional parts of the ISA would be very dickish,
in my opinion. That basically guarentees optional functions won't be
used at all.

> This code contains data races. In order to be safe under musl's memory
> model, sqrtfn would have to be volatile and should probably be written
> via a_cas_p. It also then has to have type void* and be cast to/from
> function pointer type. See clock_gettime.c.
> 

Well, yes, I was just throwing shit at a wall to see what sticks. We
could also move the function pointer dispatch into a pthread_once block
or something. I don't know if any caches need to be cleared then or not.

But yes, there are better examples.

> For some archs, gas produces an error or tags the .o file as needing a
> certain ISA level if you use an instruction that's not present in the
> baseline ISA. I'm not sure if this is an issue here or not.
> 

As I said, fsqrt is defined in the baseline ISA, just marked as
optional. So any PowerPC implementation is free to include it or not.
There are a lot of optional features, and if the gas people made a
different subarch for each combination of them, they'd be here all day.

Not just instructions, too. Sometimes the optional thing is a register,
and sometimes just bits in a register.

> I think it's the #define sqrt soft_sqrt that's a hack. The inclusion
> itself is okay and would be the right way to do this for sure if it
> were just a compile-time check and not a runtime one.
> 

I meant the define. While it is hacky, it does mean no code duplication
and only one externally facing symbol regarding sqrt(), which is the one
defined by the standard. Although I am abusing the little known rule
about C that if a function is declared as static in its prototype, and
the function definition doesn't have an explicit storage class
specifier, then the function will be static. Most style guides (rightly)
say to have the storage class specifier in the prototype and the
definition be the same, because otherwise this gets confusing fast.

I guess it goes to show that you should know your language even in the
parts you barely ever use (because forbidden), because they might come
in handy at some point.

> Rich

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 17:08   ` Markus Wichmann
@ 2016-09-29 18:13     ` Rich Felker
  2016-09-29 18:52       ` Adhemerval Zanella
  2016-09-30  4:56       ` Markus Wichmann
  0 siblings, 2 replies; 14+ messages in thread
From: Rich Felker @ 2016-09-29 18:13 UTC (permalink / raw)
  To: musl

On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote:
> On Thu, Sep 29, 2016 at 11:23:54AM -0400, Rich Felker wrote:
> > What kind of version-checking? Not all systems even give you a way to
> > version-check.
> 
> To the extent that they don't, they also don't give you a way to check
> for features (again, except for executing the instructions and seeing if
> you get SIGILL). PowerPC (sorry, but that's where I spent a lot of time
> on recently) for instance only has the PVR (Processor Version Register).
> No software I could find online uses another way to detect the features
> of the CPU.
> 
> And for systems to not give you a way of detecting system version at
> runtime and then define optional parts of the ISA would be very dickish,
> in my opinion. That basically guarentees optional functions won't be
> used at all.

On Linux it's supposed to be the kernel which detects availability of
features (either by feature-specific cpu flags or translating a model
to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
they botch this?

> > This code contains data races. In order to be safe under musl's memory
> > model, sqrtfn would have to be volatile and should probably be written
> > via a_cas_p. It also then has to have type void* and be cast to/from
> > function pointer type. See clock_gettime.c.
> 
> Well, yes, I was just throwing shit at a wall to see what sticks. We
> could also move the function pointer dispatch into a pthread_once block
> or something. I don't know if any caches need to be cleared then or not.

pthread_once/call_once would be the nice clean abstraction to use, but
it's mildly to considerably more expensive, currently involving a full
barrier. There's a nice technical report on how that can be eliminated
but it requires TLS, which is also expensive on some archs. In cases
like this where there's no state other than the function pointer,
relaxed atomics can simply be used on the reading end and then they're
always fast.

> > For some archs, gas produces an error or tags the .o file as needing a
> > certain ISA level if you use an instruction that's not present in the
> > baseline ISA. I'm not sure if this is an issue here or not.
> 
> As I said, fsqrt is defined in the baseline ISA, just marked as
> optional.

We're just using words differently. To me, baseline ISA means the part
of the ISA that all models (or at least all usable models; e.g. for
x86, pre-486 is not usable without trap-and-emulate of cmpxchg so we
consider 486 the baseline ISA) support.

> So any PowerPC implementation is free to include it or not.
> There are a lot of optional features, and if the gas people made a
> different subarch for each combination of them, they'd be here all day.

They've actually done that for some archs...

> > I think it's the #define sqrt soft_sqrt that's a hack. The inclusion
> > itself is okay and would be the right way to do this for sure if it
> > were just a compile-time check and not a runtime one.
> 
> I meant the define. While it is hacky, it does mean no code duplication
> and only one externally facing symbol regarding sqrt(), which is the one
> defined by the standard. Although I am abusing the little known rule
> about C that if a function is declared as static in its prototype, and
> the function definition doesn't have an explicit storage class
> specifier, then the function will be static. Most style guides (rightly)
> say to have the storage class specifier in the prototype and the
> definition be the same, because otherwise this gets confusing fast.
> 
> I guess it goes to show that you should know your language even in the
> parts you barely ever use (because forbidden), because they might come
> in handy at some point.

Yes, I was a bit surprised first and had to recall the rule, but I
knew the code was either valid or a constraint violation right away.

Anyway, I would have no objection right away to doing a patch like
this that's decided at compile-time based on predefined macros set by
-march. For runtime choice I think we need to discuss motivation. Are
you trying to do a powerpc-based distro where you need a universal
libc.so that works optimally on various models? Or would just
compiling for the right -march meet your needs?

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 18:13     ` Rich Felker
@ 2016-09-29 18:52       ` Adhemerval Zanella
  2016-09-29 22:05         ` Szabolcs Nagy
  2016-09-30  4:56       ` Markus Wichmann
  1 sibling, 1 reply; 14+ messages in thread
From: Adhemerval Zanella @ 2016-09-29 18:52 UTC (permalink / raw)
  To: musl



On 29/09/2016 11:13, Rich Felker wrote:
> On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote:
>> On Thu, Sep 29, 2016 at 11:23:54AM -0400, Rich Felker wrote:
>>> What kind of version-checking? Not all systems even give you a way to
>>> version-check.
>>
>> To the extent that they don't, they also don't give you a way to check
>> for features (again, except for executing the instructions and seeing if
>> you get SIGILL). PowerPC (sorry, but that's where I spent a lot of time
>> on recently) for instance only has the PVR (Processor Version Register).
>> No software I could find online uses another way to detect the features
>> of the CPU.
>>
>> And for systems to not give you a way of detecting system version at
>> runtime and then define optional parts of the ISA would be very dickish,
>> in my opinion. That basically guarentees optional functions won't be
>> used at all.
> 
> On Linux it's supposed to be the kernel which detects availability of
> features (either by feature-specific cpu flags or translating a model
> to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
> they botch this?

Maybe because recent power work on kernel is POWER oriented, where fsqrt
is define since POWER4.  However some more recent freescale chips (such
as e5500 and e6500) also decided to not add fsqrt instruction.

With GCC you can check for _ARCH_PPCSQ to see if current arch flags 
allows fsqrt.  From runtine I presume programs can check for hwcap bit
PPC_FEATURE_POWER4, however it does not help on non-POWER chips which
do support fsqrt.

Another option and a bit hacky would issue fsqrt and trap on SIGILL... 

> 
>>> This code contains data races. In order to be safe under musl's memory
>>> model, sqrtfn would have to be volatile and should probably be written
>>> via a_cas_p. It also then has to have type void* and be cast to/from
>>> function pointer type. See clock_gettime.c.
>>
>> Well, yes, I was just throwing shit at a wall to see what sticks. We
>> could also move the function pointer dispatch into a pthread_once block
>> or something. I don't know if any caches need to be cleared then or not.
> 
> pthread_once/call_once would be the nice clean abstraction to use, but
> it's mildly to considerably more expensive, currently involving a full
> barrier. There's a nice technical report on how that can be eliminated
> but it requires TLS, which is also expensive on some archs. In cases
> like this where there's no state other than the function pointer,
> relaxed atomics can simply be used on the reading end and then they're
> always fast.
> 
>>> For some archs, gas produces an error or tags the .o file as needing a
>>> certain ISA level if you use an instruction that's not present in the
>>> baseline ISA. I'm not sure if this is an issue here or not.
>>
>> As I said, fsqrt is defined in the baseline ISA, just marked as
>> optional.
> 
> We're just using words differently. To me, baseline ISA means the part
> of the ISA that all models (or at least all usable models; e.g. for
> x86, pre-486 is not usable without trap-and-emulate of cmpxchg so we
> consider 486 the baseline ISA) support.
> 
>> So any PowerPC implementation is free to include it or not.
>> There are a lot of optional features, and if the gas people made a
>> different subarch for each combination of them, they'd be here all day.
> 
> They've actually done that for some archs...
> 
>>> I think it's the #define sqrt soft_sqrt that's a hack. The inclusion
>>> itself is okay and would be the right way to do this for sure if it
>>> were just a compile-time check and not a runtime one.
>>
>> I meant the define. While it is hacky, it does mean no code duplication
>> and only one externally facing symbol regarding sqrt(), which is the one
>> defined by the standard. Although I am abusing the little known rule
>> about C that if a function is declared as static in its prototype, and
>> the function definition doesn't have an explicit storage class
>> specifier, then the function will be static. Most style guides (rightly)
>> say to have the storage class specifier in the prototype and the
>> definition be the same, because otherwise this gets confusing fast.
>>
>> I guess it goes to show that you should know your language even in the
>> parts you barely ever use (because forbidden), because they might come
>> in handy at some point.
> 
> Yes, I was a bit surprised first and had to recall the rule, but I
> knew the code was either valid or a constraint violation right away.
> 
> Anyway, I would have no objection right away to doing a patch like
> this that's decided at compile-time based on predefined macros set by
> -march. For runtime choice I think we need to discuss motivation. Are
> you trying to do a powerpc-based distro where you need a universal
> libc.so that works optimally on various models? Or would just
> compiling for the right -march meet your needs?
> 
> Rich
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 18:52       ` Adhemerval Zanella
@ 2016-09-29 22:05         ` Szabolcs Nagy
  2016-09-29 23:14           ` Adhemerval Zanella
  0 siblings, 1 reply; 14+ messages in thread
From: Szabolcs Nagy @ 2016-09-29 22:05 UTC (permalink / raw)
  To: musl

* Adhemerval Zanella <adhemerval.zanella@linaro.org> [2016-09-29 11:52:44 -0700]:
> On 29/09/2016 11:13, Rich Felker wrote:
> > On Linux it's supposed to be the kernel which detects availability of
> > features (either by feature-specific cpu flags or translating a model
> > to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
> > they botch this?
> 
> Maybe because recent power work on kernel is POWER oriented, where fsqrt
> is define since POWER4.  However some more recent freescale chips (such
> as e5500 and e6500) also decided to not add fsqrt instruction.
> 
> With GCC you can check for _ARCH_PPCSQ to see if current arch flags 
> allows fsqrt.  From runtine I presume programs can check for hwcap bit
> PPC_FEATURE_POWER4, however it does not help on non-POWER chips which
> do support fsqrt.
> 

how can distros deal with this?

do they require POWER4?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 22:05         ` Szabolcs Nagy
@ 2016-09-29 23:14           ` Adhemerval Zanella
  0 siblings, 0 replies; 14+ messages in thread
From: Adhemerval Zanella @ 2016-09-29 23:14 UTC (permalink / raw)
  To: musl



On 29/09/2016 15:05, Szabolcs Nagy wrote:
> * Adhemerval Zanella <adhemerval.zanella@linaro.org> [2016-09-29 11:52:44 -0700]:
>> On 29/09/2016 11:13, Rich Felker wrote:
>>> On Linux it's supposed to be the kernel which detects availability of
>>> features (either by feature-specific cpu flags or translating a model
>>> to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
>>> they botch this?
>>
>> Maybe because recent power work on kernel is POWER oriented, where fsqrt
>> is define since POWER4.  However some more recent freescale chips (such
>> as e5500 and e6500) also decided to not add fsqrt instruction.
>>
>> With GCC you can check for _ARCH_PPCSQ to see if current arch flags 
>> allows fsqrt.  From runtine I presume programs can check for hwcap bit
>> PPC_FEATURE_POWER4, however it does not help on non-POWER chips which
>> do support fsqrt.
>>
> 
> how can distros deal with this?
> 
> do they require POWER4?
> 

I do not really know how is the current approach for powerpc{32,64} distros,
but I recall that both RHEL and SLES used to provided arch specific
libc.so build/optimized for each chips (default, power4, powerX).

The powerpc64le have a current minimum ISA of 2.07 (power8) with both
complete fp and VSX, so it should not have this issue.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-29 18:13     ` Rich Felker
  2016-09-29 18:52       ` Adhemerval Zanella
@ 2016-09-30  4:56       ` Markus Wichmann
  2016-10-01  5:50         ` Rich Felker
  1 sibling, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-09-30  4:56 UTC (permalink / raw)
  To: musl

On Thu, Sep 29, 2016 at 02:13:36PM -0400, Rich Felker wrote:
> On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote:
> > [...]
> On Linux it's supposed to be the kernel which detects availability of
> features (either by feature-specific cpu flags or translating a model
> to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
> they botch this?
> 

Maybe it's a new extension? I only know version 2.2 of the PowerPC Book.

Or maybe it goes back to the single core thing. (Only the 970 supports
it, and that's pretty new.) Or maybe Linux kernel developers aren't
interested in this problem, because a manual sqrt exists, and if need
be, anyone can just implement the Babylonian method for speed. On PPC,
it can be implemented in a loop consisting of four instructions, namely:

; .rodata
half: .double 0.5
; assuming positive finite argument
; if that can't be assumed, go through memory to inspect argument
fmr 1, 0    ; yes, halving the exponent would be a better estimate
; requires going through memory, though
lfd 2, half(13)
li 0, 6 ;or more for more accurcy
mtctr 0

1:  ; fr0 = x, fr1 = a
    fdiv 3, 1, 0    ; fr3 = a/x
    fadd 3, 3, 0    ; fr3 = x + a/x
    fmul 0, 3, 2    ; fr0 = 0.5(x + a/x)
    bdnz 1b

So maybe there wasn't a lot of need for the hardware sqrt.

> > Well, yes, I was just throwing shit at a wall to see what sticks. We
> > could also move the function pointer dispatch into a pthread_once block
> > or something. I don't know if any caches need to be cleared then or not.
> 
> pthread_once/call_once would be the nice clean abstraction to use, but
> it's mildly to considerably more expensive, currently involving a full
> barrier. There's a nice technical report on how that can be eliminated
> but it requires TLS, which is also expensive on some archs. In cases
> like this where there's no state other than the function pointer,
> relaxed atomics can simply be used on the reading end and then they're
> always fast.
> 

Hmmm... not on PPC, though. TLS on Linux PPC just uses r2 as TLS
pointer. So the entire thing could be used almost as-is by making sqrtfn
thread-local?

> > So any PowerPC implementation is free to include it or not.
> > There are a lot of optional features, and if the gas people made a
> > different subarch for each combination of them, they'd be here all day.
> 
> They've actually done that for some archs...
> 

That actually made me check if they did it here, but thankfully not. gas
assembles the instruction without flags, without warning, and without a
note or anything on the output file.

> Anyway, I would have no objection right away to doing a patch like
> this that's decided at compile-time based on predefined macros set by
> -march. For runtime choice I think we need to discuss motivation. Are
> you trying to do a powerpc-based distro where you need a universal
> libc.so that works optimally on various models? Or would just
> compiling for the right -march meet your needs?
> 

Just idle musings. I was reading sqrt.c, which has a flowerbox saying
"Use hardware sqrt if available" and recalled that there is a hardware
sqrt on PPC and started doing research from there. And that ended up in
the OP.

> Rich

Ciao,
Markus


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-09-30  4:56       ` Markus Wichmann
@ 2016-10-01  5:50         ` Rich Felker
  2016-10-01  8:52           ` Markus Wichmann
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2016-10-01  5:50 UTC (permalink / raw)
  To: musl

On Fri, Sep 30, 2016 at 06:56:15AM +0200, Markus Wichmann wrote:
> On Thu, Sep 29, 2016 at 02:13:36PM -0400, Rich Felker wrote:
> > On Thu, Sep 29, 2016 at 07:08:01PM +0200, Markus Wichmann wrote:
> > > [...]
> > On Linux it's supposed to be the kernel which detects availability of
> > features (either by feature-specific cpu flags or translating a model
> > to flags) but I don't see anything for fsqrt on ppc. :-( How/why did
> > they botch this?
> > 
> 
> Maybe it's a new extension? I only know version 2.2 of the PowerPC Book.
> 
> Or maybe it goes back to the single core thing. (Only the 970 supports
> it, and that's pretty new.) Or maybe Linux kernel developers aren't
> interested in this problem, because a manual sqrt exists, and if need
> be, anyone can just implement the Babylonian method for speed. On PPC,
> it can be implemented in a loop consisting of four instructions, namely:
> 
> ; .rodata
> half: .double 0.5
> ; assuming positive finite argument
> ; if that can't be assumed, go through memory to inspect argument
> fmr 1, 0    ; yes, halving the exponent would be a better estimate
> ; requires going through memory, though
> lfd 2, half(13)
> li 0, 6 ;or more for more accurcy
> mtctr 0
> 
> 1:  ; fr0 = x, fr1 = a
>     fdiv 3, 1, 0    ; fr3 = a/x
>     fadd 3, 3, 0    ; fr3 = x + a/x
>     fmul 0, 3, 2    ; fr0 = 0.5(x + a/x)
>     bdnz 1b
> 
> So maybe there wasn't a lot of need for the hardware sqrt.

I don't think this works at all. sqrt() is required to be
correctly-rounded; that's the whole reason sqrt.c is costly.

> > > Well, yes, I was just throwing shit at a wall to see what sticks. We
> > > could also move the function pointer dispatch into a pthread_once block
> > > or something. I don't know if any caches need to be cleared then or not.
> > 
> > pthread_once/call_once would be the nice clean abstraction to use, but
> > it's mildly to considerably more expensive, currently involving a full
> > barrier. There's a nice technical report on how that can be eliminated
> > but it requires TLS, which is also expensive on some archs. In cases
> > like this where there's no state other than the function pointer,
> > relaxed atomics can simply be used on the reading end and then they're
> > always fast.
> 
> Hmmm... not on PPC, though. TLS on Linux PPC just uses r2 as TLS
> pointer. So the entire thing could be used almost as-is by making sqrtfn
> thread-local?

Yes and no. Not in musl because we don't use _Thread_local; it would
require allocating space in the thread structure which is not
appropriate for something like this. The right and most efficient
solution is the one I described above.

Rich


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-10-01  5:50         ` Rich Felker
@ 2016-10-01  8:52           ` Markus Wichmann
  2016-10-01 15:10             ` Rich Felker
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-10-01  8:52 UTC (permalink / raw)
  To: musl

On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote:
> I don't think this works at all. sqrt() is required to be
> correctly-rounded; that's the whole reason sqrt.c is costly.

It's an approximation, at least, which was rather my point.

As I've come to realize over the course of this discussion, the fsqrt
instruction is useless here and pretty much everywhere out there:

- If you are looking for accuracy over speed, the standard C library has
  got you covered.
- If you are looking for speed over accuracy, you can code up the
  Babylonian method inside five minutes. You can even tune it to suit
your needs to an extent (mainly, number of rounds and method of first
approximation). This method is also portable to other architectures, and
can be done entirely in C (requiring IEEE floating point, but then, most
serious FP code does that).

Also, at least according to Apple, which were the only ones actually
looking at the thing, such as I could find, it was only ever supported
by the 970 and the 970FX cores, released in 2002 and 2004, respectively.
I highly doubt they'll have much relevance. Chalk up my suspicions from
the OP to not having researched enough.

In closing: Nice discussion, but I'm sorry for the noise.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-10-01  8:52           ` Markus Wichmann
@ 2016-10-01 15:10             ` Rich Felker
  2016-10-01 19:53               ` Markus Wichmann
  0 siblings, 1 reply; 14+ messages in thread
From: Rich Felker @ 2016-10-01 15:10 UTC (permalink / raw)
  To: musl

On Sat, Oct 01, 2016 at 10:52:14AM +0200, Markus Wichmann wrote:
> On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote:
> > I don't think this works at all. sqrt() is required to be
> > correctly-rounded; that's the whole reason sqrt.c is costly.
> 
> It's an approximation, at least, which was rather my point.
> 
> As I've come to realize over the course of this discussion, the fsqrt
> instruction is useless here and pretty much everywhere out there:

I don't think that conclusion is correct. It certainly makes sense for
libc to use it in targets that have it, assuming it safely produces
correct results, and for compilers to generate it in place of a call
to sqrt.

> - If you are looking for accuracy over speed, the standard C library has
>   got you covered.

Yes.

> - If you are looking for speed over accuracy, you can code up the
>   Babylonian method inside five minutes. You can even tune it to suit
> your needs to an extent (mainly, number of rounds and method of first
> approximation). This method is also portable to other architectures, and
> can be done entirely in C (requiring IEEE floating point, but then, most
> serious FP code does that).

This is not going to give you speed. If you want fast sqrt
approximations, there are lots out there that are actually fast. And
if the final result you need is 1/sqrt there are even faster ones.

> Also, at least according to Apple, which were the only ones actually
> looking at the thing, such as I could find, it was only ever supported
> by the 970 and the 970FX cores, released in 2002 and 2004, respectively.
> I highly doubt they'll have much relevance. Chalk up my suspicions from
> the OP to not having researched enough.

Do you mean these are the only non-POWER line models that have fsqrt?

> In closing: Nice discussion, but I'm sorry for the noise.

I don't think it's noise. It's been informative. And it does suggest
that we should add static, compile-time support for using fsqrt on
POWER and perhaps on these specific models that have it. That's useful
information for making it a better-supported target under musl.

Rich

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-10-01 15:10             ` Rich Felker
@ 2016-10-01 19:53               ` Markus Wichmann
  2016-10-02 13:59                 ` Adhemerval Zanella
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Wichmann @ 2016-10-01 19:53 UTC (permalink / raw)
  To: musl

On Sat, Oct 01, 2016 at 11:10:12AM -0400, Rich Felker wrote:
> On Sat, Oct 01, 2016 at 10:52:14AM +0200, Markus Wichmann wrote:
> > On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote:
> > > I don't think this works at all. sqrt() is required to be
> > > correctly-rounded; that's the whole reason sqrt.c is costly.
> > 
> > It's an approximation, at least, which was rather my point.
> > 
> > As I've come to realize over the course of this discussion, the fsqrt
> > instruction is useless here and pretty much everywhere out there:
> 
> I don't think that conclusion is correct. It certainly makes sense for
> libc to use it in targets that have it, assuming it safely produces
> correct results, and for compilers to generate it in place of a call
> to sqrt.
> 

But again, that requires the appropriate flags.

> > Also, at least according to Apple, which were the only ones actually
> > looking at the thing, such as I could find, it was only ever supported
> > by the 970 and the 970FX cores, released in 2002 and 2004, respectively.
> > I highly doubt they'll have much relevance. Chalk up my suspicions from
> > the OP to not having researched enough.
> 
> Do you mean these are the only non-POWER line models that have fsqrt?
> 

The more I research this, the more confused I get!

So, I was looking for real-world users of fsqrt, do look at how they
determine availability. The first such user I found was Apple's libm.
Tracing back to where they set their feature flags, I found this file

http://opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/start.s

If you search for _cpu_capabilities, around line 180 you'll find a
comment saying the feature flags in this file are only defaults and may
be changed by initialization code. But I couldn't find anything setting
more flags, if anything, flags got removed. And the only models that
have the flag kHasFsqrt are the 970 and the 970FX.

But then I noticed that their processor list is kind of small, so I
continued the search. I found this e-mail claiming the 604 supports the
instruction:

http://aps.anl.gov/epics/tech-talk/2011/msg01247.php

But if you look at datasheets of the 604, they say nothing either way.
But alright, the 604 is and old model (intrduced in 1994), maybe fsqrt
wasn't defined then.

I personally work with the e300 (at my day job), and at least their
datasheet makes it clear that fsqrt is not supported. Actually,
apparently Freescale aren't big fans of this instruction at all,
according to this comment:

https://github.com/ibmruntimes/v8ppc/issues/119#issuecomment-72705975

Wikipedia claims, however, that it wasn't until the 620 that the square
root instruction was put into hardware. I tried to find a 620 datasheet,
but no luck so far.

Next family on the list would be the 4xx. 403 can be discounted
immediately as it lacks an FPU. Since the 401 is stripped down even
further, it also has no FPU. From the 405 onward it get's dicey as they
went the way of the x87: You could connect an external FPU if desired.

I found one for 405 here:

http://www.xilinx.com/support/documentation/ip_documentation/apu_fpu.pdf

That one doesn't support fsqrt, at least not enough for our purposes,
but it does support fsqrts (that's the single precision variant). That's
a whole new level of weird.

As for the rest: I hope, Apple got it right, because afer the 970,
nothing more is listed in Wikipedia.

So, as you can see, the whole thing is a mess.

> Rich

Ciao,
Markus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Model specific optimizations?
  2016-10-01 19:53               ` Markus Wichmann
@ 2016-10-02 13:59                 ` Adhemerval Zanella
  0 siblings, 0 replies; 14+ messages in thread
From: Adhemerval Zanella @ 2016-10-02 13:59 UTC (permalink / raw)
  To: musl



On 01/10/2016 16:53, Markus Wichmann wrote:
> On Sat, Oct 01, 2016 at 11:10:12AM -0400, Rich Felker wrote:
>> On Sat, Oct 01, 2016 at 10:52:14AM +0200, Markus Wichmann wrote:
>>> On Sat, Oct 01, 2016 at 01:50:23AM -0400, Rich Felker wrote:
>>>> I don't think this works at all. sqrt() is required to be
>>>> correctly-rounded; that's the whole reason sqrt.c is costly.
>>>
>>> It's an approximation, at least, which was rather my point.
>>>
>>> As I've come to realize over the course of this discussion, the fsqrt
>>> instruction is useless here and pretty much everywhere out there:
>>
>> I don't think that conclusion is correct. It certainly makes sense for
>> libc to use it in targets that have it, assuming it safely produces
>> correct results, and for compilers to generate it in place of a call
>> to sqrt.
>>
> 
> But again, that requires the appropriate flags.gcc/config/rs6000/rs6000-cpus.def:35:
> 
>>> Also, at least according to Apple, which were the only ones actually
>>> looking at the thing, such as I could find, it was only ever supported
>>> by the 970 and the 970FX cores, released in 2002 and 2004, respectively.
>>> I highly doubt they'll have much relevance. Chalk up my suspicions from
>>> the OP to not having researched enough.
>>
>> Do you mean these are the only non-POWER line models that have fsqrt?
>>
> 
> The more I research this, the more confused I get!
> 
> So, I was looking for real-world users of fsqrt, do look at how they
> determine availability. The first such user I found was Apple's libm.
> Tracing back to where they set their feature flags, I found this file
> 
> http://opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/start.s
> 
> If you search for _cpu_capabilities, around line 180 you'll find a
> comment saying the feature flags in this file are only defaults and may
> be changed by initialization code. But I couldn't find anything setting
> more flags, if anything, flags got removed. And the only models that
> have the flag kHasFsqrt are the 970 and the 970FX.
> 
> But then I noticed that their processor list is kind of small, so I
> continued the search. I found this e-mail claiming the 604 supports the
> instruction:
> 
> http://aps.anl.gov/epics/tech-talk/2011/msg01247.php
> 
> But if you look at datasheets of the 604, they say nothing either way.
> But alright, the 604 is and old model (intrduced in 1994), maybe fsqrt
> wasn't defined then.
> 
> I personally work with the e300 (at my day job), and at least their
> datasheet makes it clear that fsqrt is not supported. Actually,
> apparently Freescale aren't big fans of this instruction at all,
> according to this comment:
> 
> https://github.com/ibmruntimes/v8ppc/issues/119#issuecomment-72705975
> 
> Wikipedia claims, however, that it wasn't until the 620 that the square
> root instruction was put into hardware. I tried to find a 620 datasheet,
> but no luck so far.
> 
> Next family on the list would be the 4xx. 403 can be discounted
> immediately as it lacks an FPU. Since the 401 is stripped down even
> further, it also has no FPU. From the 405 onward it get's dicey as they
> went the way of the x87: You could connect an external FPU if desired.
> 
> I found one for 405 here:
> 
> http://www.xilinx.com/support/documentation/ip_documentation/apu_fpu.pdf
> 
> That one doesn't support fsqrt, at least not enough for our purposes,
> but it does support fsqrts (that's the single precision variant). That's
> a whole new level of weird.
> 
> As for the rest: I hope, Apple got it right, because afer the 970,
> nothing more is listed in Wikipedia.
> 
> So, as you can see, the whole thing is a mess.

Since kernel does not track it, I found GCC internal implementation to
be the most correct way to found if an chip implementation actually
support some ppc instruction.

For fsqrt gcc will define _ARCH_PPCSQ and internally this flag is
controlled by OPTION_MASK_PPC_GPOPT. GCC cpu definition file has some
information about it [1]:

 27   /* For ISA 2.05, do not add MFPGPR, since it isn't in ISA 2.06, and don't add
 28      ALTIVEC, since in general it isn't a win on power6.  In ISA 2.04, fsel,
 29      fre, fsqrt, etc. were no longer documented as optional.  Group masks by
 30      server and embedded. */
 31 #define ISA_2_5_MASKS_EMBEDDED  (ISA_2_4_MASKS                          \
 32                                  | OPTION_MASK_CMPB                     \
 33                                  | OPTION_MASK_RECIP_PRECISION          \
 34                                  | OPTION_MASK_PPC_GFXOPT               \
 35                                  | OPTION_MASK_PPC_GPOPT)
 36 
 36 
 37 #define ISA_2_5_MASKS_SERVER    (ISA_2_5_MASKS_EMBEDDED | OPTION_MASK_DFP)

Later in the file you can check that besides all POWER chips from power4 and
forward, only the 970, cell, and G5 support fsqrt.

The problem is you have new powerpc cores as e6500 that is suppose to follow
the ISA 2.06 embedded profile but that does not implement fsqrt even on ppc64
mode.


[1] gcc/config/rs6000/rs6000-cpus.def


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-10-02 13:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-29 14:21 Model specific optimizations? Markus Wichmann
2016-09-29 14:57 ` Szabolcs Nagy
2016-09-29 15:23 ` Rich Felker
2016-09-29 17:08   ` Markus Wichmann
2016-09-29 18:13     ` Rich Felker
2016-09-29 18:52       ` Adhemerval Zanella
2016-09-29 22:05         ` Szabolcs Nagy
2016-09-29 23:14           ` Adhemerval Zanella
2016-09-30  4:56       ` Markus Wichmann
2016-10-01  5:50         ` Rich Felker
2016-10-01  8:52           ` Markus Wichmann
2016-10-01 15:10             ` Rich Felker
2016-10-01 19:53               ` Markus Wichmann
2016-10-02 13:59                 ` Adhemerval Zanella

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).