* More patches for math subtree
@ 2019-12-10 16:57 Stefan Kanthak
2019-12-10 19:35 ` Rich Felker
0 siblings, 1 reply; 7+ messages in thread
From: Stefan Kanthak @ 2019-12-10 16:57 UTC (permalink / raw)
To: musl
Some more optimisations: the current implementations of ceil(), floor()
and trunc() for i386 change the rounding control using fldcw instructions,
which are SLOW; these patches provide faster and smaller branch-free (!)
implementations.
JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
--- -/src/math/i386/floor.s
+++ +/src/math/i386/floor.s
@@ -1,67 +1,26 @@
.global floorf
.type floorf,@function
floorf:
flds 4(%esp)
jmp 1f
.global floorl
.type floorl,@function
floorl:
fldt 4(%esp)
jmp 1f
.global floor
.type floor,@function
floor:
fldl 4(%esp)
+1: fld %st(0)
+ frndint
+ fxch %st(1)
+ fucomip %st(1),%st(0)
+ fld1
+ fldz
+ fcmovb %st(1),%st(0)
+ fsubp %st(0),%st(2)
+ fstp %st(0)
+ ret
-1: mov $0x7,%al
-1: fstcw 4(%esp)
- mov 5(%esp),%ah
- mov %al,5(%esp)
- fldcw 4(%esp)
- frndint
- mov %ah,5(%esp)
- fldcw 4(%esp)
- ret
-
-.global ceil
-.type ceil,@function
-ceil:
- fldl 4(%esp)
- mov $0xb,%al
- jmp 1b
-
-.global ceilf
-.type ceilf,@function
-ceilf:
- flds 4(%esp)
- mov $0xb,%al
- jmp 1b
-
-.global ceill
-.type ceill,@function
-ceill:
- fldt 4(%esp)
- mov $0xb,%al
- jmp 1b
-
-.global trunc
-.type trunc,@function
-trunc:
- fldl 4(%esp)
- mov $0xf,%al
- jmp 1b
-
-.global truncf
-.type truncf,@function
-truncf:
- flds 4(%esp)
- mov $0xf,%al
- jmp 1b
-
-.global truncl
-.type truncl,@function
-truncl:
- fldt 4(%esp)
- mov $0xf,%al
- jmp 1b
--- -/src/math/i386/ceilf.s
+++ +/src/math/i386/ceilf.s
@@ -1,1 +1,1 @@
-# see floor.s
+# see ceil.s
--- -/src/math/i386/ceill.s
+++ +/src/math/i386/ceill.s
@@ -1,1 +1,1 @@
-# see floor.s
+# see ceil.s
--- -/src/math/i386/ceil.s
+++ +/src/math/i386/ceil.s
@@ -1,1 +1,26 @@
-# see floor.s
+.global ceilf
+.type ceilf,@function
+ceilf:
+ flds 4(%esp)
+ jmp 1f
+
+.global ceill
+.type ceill,@function
+ceill:
+ fldt 4(%esp)
+ jmp 1f
+
+.global ceil
+.type ceil,@function
+ceil:
+ fldl 4(%esp)
+1: fld %st(0)
+ frndint
+ fxch %st(1)
+ fucomip %st(1),%st(0)
+ fld1
+ fldz
+ fcmovnbe %st(1),%st(0)
+ faddp %st(0),%st(1)
+ fstp %st(0)
+ ret
--- -/src/math/i386/truncf.s
+++ +/src/math/i386/truncf.s
@@ -1,1 +1,1 @@
-# see floor.s
+# see trunc.s
--- -/src/math/i386/truncl.s
+++ +/src/math/i386/truncl.s
@@ -1,1 +1,1 @@
-# see floor.s
+# see trunc.s
--- -/src/math/i386/trunc.s
+++ +/src/math/i386/trunc.s
@@ -1,1 +1,32 @@
-# see floor.s
+.global truncf
+.type truncf,@function
+truncf:
+ flds 4(%esp)
+ jmp 1f
+
+.global truncl
+.type truncl,@function
+truncl:
+ fldt 4(%esp)
+ jmp 1f
+
+.global trunc
+.type trunc,@function
+trunc:
+ fldl 4(%esp)
+1: fld %st(0)
+ fabs
+ fld %st(0)
+ frndint
+ fxch %st(1)
+ fucomip %st(1),%st(0)
+ fldz
+ fld1
+ fcmovnb %st(1),%st(0)
+ fsubp %st(0),%st(2)
+ fucomip %st(2),%st(0)
+ fst %st(1)
+ fchs
+ fcmovbe %st(1),%st(0)
+ fstp %st(1)
+ ret
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: More patches for math subtree
2019-12-10 16:57 More patches for math subtree Stefan Kanthak
@ 2019-12-10 19:35 ` Rich Felker
2019-12-10 21:32 ` Stefan Kanthak
0 siblings, 1 reply; 7+ messages in thread
From: Rich Felker @ 2019-12-10 19:35 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: musl
On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> Some more optimisations: the current implementations of ceil(), floor()
> and trunc() for i386 change the rounding control using fldcw instructions,
> which are SLOW; these patches provide faster and smaller branch-free (!)
> implementations.
>
> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
>
> --- -/src/math/i386/floor.s
> +++ +/src/math/i386/floor.s
> @@ -1,67 +1,26 @@
> .global floorf
> .type floorf,@function
> floorf:
> flds 4(%esp)
> jmp 1f
>
> .global floorl
> .type floorl,@function
> floorl:
> fldt 4(%esp)
> jmp 1f
>
> .global floor
> .type floor,@function
> floor:
> fldl 4(%esp)
> +1: fld %st(0)
> + frndint
> + fxch %st(1)
> + fucomip %st(1),%st(0)
> + fld1
> + fldz
> + fcmovb %st(1),%st(0)
^^^^^^
fcmovb is not in the baseline ISA. Otherwise, I *think* the idea of
this patch looks good, provided I'm not missing anything with respect
to how status flags are affected.
As noted in the other email (sorry about not CC'ing you before; I've
got you on CC now), I really want to get rid of all these .s files in
favor of __asm__ statements with proper constraints in C source files.
That makes them inlineable with LTO, and makes it possible for the
compiler to select to use an instruction like fcmovb conditionally
based on the targeted ISA level rather than having to do a .S file
with hard-coded preprocessor conditionals. It also precludes x87 stack
imbalance bugs like CVE-2019-14697, which make me really wary of
manual changes to these files.
Would you be interested in working on converting over the files you
want to optimize (or even others too) to that form at the same time as
doing the optimizations? It would really help with review process and
with improving the overall code state.
Rich
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: More patches for math subtree
2019-12-10 19:35 ` Rich Felker
@ 2019-12-10 21:32 ` Stefan Kanthak
2019-12-10 22:17 ` Rich Felker
0 siblings, 1 reply; 7+ messages in thread
From: Stefan Kanthak @ 2019-12-10 21:32 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
"Rich Felker" <dalias@libc.org> wrote:
> On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
>> Some more optimisations: the current implementations of ceil(), floor()
>> and trunc() for i386 change the rounding control using fldcw instructions,
>> which are SLOW; these patches provide faster and smaller branch-free (!)
>> implementations.
>>
>> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
>>
>> --- -/src/math/i386/floor.s
>> +++ +/src/math/i386/floor.s
>> @@ -1,67 +1,26 @@
>> .global floorf
>> .type floorf,@function
>> floorf:
>> flds 4(%esp)
>> jmp 1f
>>
>> .global floorl
>> .type floorl,@function
>> floorl:
>> fldt 4(%esp)
>> jmp 1f
>>
>> .global floor
>> .type floor,@function
>> floor:
>> fldl 4(%esp)
>> +1: fld %st(0)
>> + frndint
>> + fxch %st(1)
>> + fucomip %st(1),%st(0)
>> + fld1
>> + fldz
>> + fcmovb %st(1),%st(0)
> ^^^^^^
>
> fcmovb is not in the baseline ISA.
This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
FUCOMI* were introduced with the PentiumPro. If you allow the use of
the latter you can safely use the former too. And FCOMI* and FUCOMI*
are already used in other .S files.
> Otherwise, I *think* the idea of this patch looks good, provided I'm
> not missing anything with respect to how status flags are affected.
FRNDINT takes care of them!
> As noted in the other email (sorry about not CC'ing you before; I've
> got you on CC now), I really want to get rid of all these .s files in
> favor of __asm__ statements with proper constraints in C source files.
> That makes them inlineable with LTO, and makes it possible for the
> compiler to select to use an instruction like fcmovb conditionally
> based on the targeted ISA level rather than having to do a .S file
> with hard-coded preprocessor conditionals.
While this is generally good idea, there's no guarantee that a compiler
will emit a branch-free instruction sequence like those shown above.
I also doubt that a compiler will produce the 5 instruction sequence
shown in my patch for src/math/i386/remquo.S which collects the FPU
flags C0, C3 and C1 set by FPREM.
I noticed that you provide .S files for "long double" on x86-64, but
not for "double" and "float". I therefore assume that you use the
SSE floating-point instructions there, respectively let the compiler
use them.
Does any compiler emit branch-free instruction sequences like the
following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
.code ; Intel syntax
ceil proc public
extern __real@8000000000000000:real8
movsd xmm1, __real@8000000000000000
extern __real@3ff0000000000000:real8
movsd xmm2, __real@3ff0000000000000
extern __real@4330000000000000:real8
movsd xmm3, __real@4330000000000000
movsd xmm4, xmm1
andnpd xmm1, xmm0
andpd xmm4, xmm0
cmpltsd xmm1, xmm3
andpd xmm1, xmm3
orpd xmm1, xmm4
movsd xmm3, xmm0
addsd xmm0, xmm1
subsd xmm0, xmm1
movsd xmm1, xmm0
cmpltsd xmm0, xmm3
andpd xmm0, xmm2
addsd xmm0, xmm1
orpd xmm0, xmm4
ret
ceil endp
Or instruction sequences like
.code ; Intel syntax
copysign proc public
movd rcx, xmm0
movd rdx, xmm1
shld rcx, rdx, 1
ror rcx, 1
movd xmm0, rcx
ret
copysign endp
.code ; Intel syntax
fdim proc public
movsd xmm2, xmm0
cmpsd xmm0, xmm1, 6
subsd xmm2, xmm1
andpd xmm0, xmm2
ret
fdim endp
> It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> make me really wary of manual changes to these files.
>
> Would you be interested in working on converting over the files you
> want to optimize (or even others too) to that form at the same time as
> doing the optimizations?
I don't use musl-libc; I also don't use an OS or a compiler/assembler
which can be used to build it.
I just stumbled upon the functions for which I sent in patches while
searching for code which uses Intel's FPU.
> It would really help with review process and with improving the overall
> code state.
If I start using musl-libc I'd be interested and rewrite these parts.
regards
Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: More patches for math subtree
2019-12-10 21:32 ` Stefan Kanthak
@ 2019-12-10 22:17 ` Rich Felker
2019-12-11 1:13 ` Rosen Penev
2019-12-11 9:53 ` Stefan Kanthak
0 siblings, 2 replies; 7+ messages in thread
From: Rich Felker @ 2019-12-10 22:17 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: musl
On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:
> "Rich Felker" <dalias@libc.org> wrote:
>
>
> > On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> >> Some more optimisations: the current implementations of ceil(), floor()
> >> and trunc() for i386 change the rounding control using fldcw instructions,
> >> which are SLOW; these patches provide faster and smaller branch-free (!)
> >> implementations.
> >>
> >> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
> >>
> >> --- -/src/math/i386/floor.s
> >> +++ +/src/math/i386/floor.s
> >> @@ -1,67 +1,26 @@
> >> .global floorf
> >> .type floorf,@function
> >> floorf:
> >> flds 4(%esp)
> >> jmp 1f
> >>
> >> .global floorl
> >> .type floorl,@function
> >> floorl:
> >> fldt 4(%esp)
> >> jmp 1f
> >>
> >> .global floor
> >> .type floor,@function
> >> floor:
> >> fldl 4(%esp)
> >> +1: fld %st(0)
> >> + frndint
> >> + fxch %st(1)
> >> + fucomip %st(1),%st(0)
> >> + fld1
> >> + fldz
> >> + fcmovb %st(1),%st(0)
> > ^^^^^^
> >
> > fcmovb is not in the baseline ISA.
>
> This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
> FUCOMI* were introduced with the PentiumPro. If you allow the use of
> the latter you can safely use the former too. And FCOMI* and FUCOMI*
> are already used in other .S files.
This is why we're not using them. I think you're looking at x86_64
where they are in the baseline ISA.
> > Otherwise, I *think* the idea of this patch looks good, provided I'm
> > not missing anything with respect to how status flags are affected.
>
> FRNDINT takes care of them!
OK.
> > As noted in the other email (sorry about not CC'ing you before; I've
> > got you on CC now), I really want to get rid of all these .s files in
> > favor of __asm__ statements with proper constraints in C source files.
> > That makes them inlineable with LTO, and makes it possible for the
> > compiler to select to use an instruction like fcmovb conditionally
> > based on the targeted ISA level rather than having to do a .S file
> > with hard-coded preprocessor conditionals.
>
> While this is generally good idea, there's no guarantee that a compiler
> will emit a branch-free instruction sequence like those shown above.
> I also doubt that a compiler will produce the 5 instruction sequence
> shown in my patch for src/math/i386/remquo.S which collects the FPU
> flags C0, C3 and C1 set by FPREM.
For that you'd probably put the collection of bits inside the asm. It
still makes just a few instructions of asm, with no need for external
call ABI logic in the asm.
> I noticed that you provide .S files for "long double" on x86-64, but
> not for "double" and "float". I therefore assume that you use the
> SSE floating-point instructions there, respectively let the compiler
> use them.
On the x86_64 ABI, float and double arithmetic are performed in SSE
rather than in excess precision with the x87 unit.
> Does any compiler emit branch-free instruction sequences like the
> following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
>
> .code ; Intel syntax
> ceil proc public
> extern __real@8000000000000000:real8
> movsd xmm1, __real@8000000000000000
> extern __real@3ff0000000000000:real8
> movsd xmm2, __real@3ff0000000000000
> extern __real@4330000000000000:real8
> movsd xmm3, __real@4330000000000000
> movsd xmm4, xmm1
> andnpd xmm1, xmm0
> andpd xmm4, xmm0
> cmpltsd xmm1, xmm3
> andpd xmm1, xmm3
> orpd xmm1, xmm4
> movsd xmm3, xmm0
> addsd xmm0, xmm1
> subsd xmm0, xmm1
> movsd xmm1, xmm0
> cmpltsd xmm0, xmm3
> andpd xmm0, xmm2
> addsd xmm0, xmm1
> orpd xmm0, xmm4
> ret
> ceil endp
>
> Or instruction sequences like
>
> .code ; Intel syntax
> copysign proc public
> movd rcx, xmm0
> movd rdx, xmm1
> shld rcx, rdx, 1
> ror rcx, 1
> movd xmm0, rcx
> ret
> copysign endp
Not quite (but it might be possible to write the C in terms of shifts
instead of masks such that it does), but I also don't think it's clear
which version is better. Yours here is mildly smaller and might
perform better, but when making changes that aren't clearly better
there should be some evidence that it's actually an improvement --
especially if it's not just improving existing arch optimizations but
adding new ones where the C was formerly used. Generally musl avoids
asm and arch-specific files as much as possible, using them only for
things that aren't representable in C or where the C is a lot larger
or slower or both.
> .code ; Intel syntax
> fdim proc public
> movsd xmm2, xmm0
> cmpsd xmm0, xmm1, 6
> subsd xmm2, xmm1
> andpd xmm0, xmm2
> ret
> fdim endp
Does this handle nans correctly?
> > It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> > make me really wary of manual changes to these files.
> >
> > Would you be interested in working on converting over the files you
> > want to optimize (or even others too) to that form at the same time as
> > doing the optimizations?
>
> I don't use musl-libc; I also don't use an OS or a compiler/assembler
> which can be used to build it.
> I just stumbled upon the functions for which I sent in patches while
> searching for code which uses Intel's FPU.
>
> > It would really help with review process and with improving the overall
> > code state.
>
> If I start using musl-libc I'd be interested and rewrite these parts.
OK. I don't mind looking at these patches further as-is, and I'll try
to continue offering constructive comments now, but it'll be after
this release cycle (hopefully wrapping that up in the next week or so)
before consideration for merging. musl 1.2.0 is already going to be a
release with big changes (time64) and I don't want to risk subtle
breakage with new changes that haven't been reviewed in detail yet or
had time for users to test.
Rich
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: More patches for math subtree
2019-12-10 22:17 ` Rich Felker
@ 2019-12-11 1:13 ` Rosen Penev
2019-12-11 9:53 ` Stefan Kanthak
1 sibling, 0 replies; 7+ messages in thread
From: Rosen Penev @ 2019-12-11 1:13 UTC (permalink / raw)
To: musl; +Cc: Stefan Kanthak
On Tue, Dec 10, 2019 at 2:17 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:
> > "Rich Felker" <dalias@libc.org> wrote:
> >
> >
> > > On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> > >> Some more optimisations: the current implementations of ceil(), floor()
> > >> and trunc() for i386 change the rounding control using fldcw instructions,
> > >> which are SLOW; these patches provide faster and smaller branch-free (!)
> > >> implementations.
> > >>
> > >> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
> > >>
> > >> --- -/src/math/i386/floor.s
> > >> +++ +/src/math/i386/floor.s
> > >> @@ -1,67 +1,26 @@
> > >> .global floorf
> > >> .type floorf,@function
> > >> floorf:
> > >> flds 4(%esp)
> > >> jmp 1f
> > >>
> > >> .global floorl
> > >> .type floorl,@function
> > >> floorl:
> > >> fldt 4(%esp)
> > >> jmp 1f
> > >>
> > >> .global floor
> > >> .type floor,@function
> > >> floor:
> > >> fldl 4(%esp)
> > >> +1: fld %st(0)
> > >> + frndint
> > >> + fxch %st(1)
> > >> + fucomip %st(1),%st(0)
> > >> + fld1
> > >> + fldz
> > >> + fcmovb %st(1),%st(0)
> > > ^^^^^^
> > >
> > > fcmovb is not in the baseline ISA.
> >
> > This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
> > FUCOMI* were introduced with the PentiumPro. If you allow the use of
> > the latter you can safely use the former too. And FCOMI* and FUCOMI*
> > are already used in other .S files.
>
> This is why we're not using them. I think you're looking at x86_64
> where they are in the baseline ISA.
>
> > > Otherwise, I *think* the idea of this patch looks good, provided I'm
> > > not missing anything with respect to how status flags are affected.
> >
> > FRNDINT takes care of them!
>
> OK.
>
> > > As noted in the other email (sorry about not CC'ing you before; I've
> > > got you on CC now), I really want to get rid of all these .s files in
> > > favor of __asm__ statements with proper constraints in C source files.
> > > That makes them inlineable with LTO, and makes it possible for the
> > > compiler to select to use an instruction like fcmovb conditionally
> > > based on the targeted ISA level rather than having to do a .S file
> > > with hard-coded preprocessor conditionals.
> >
> > While this is generally good idea, there's no guarantee that a compiler
> > will emit a branch-free instruction sequence like those shown above.
> > I also doubt that a compiler will produce the 5 instruction sequence
> > shown in my patch for src/math/i386/remquo.S which collects the FPU
> > flags C0, C3 and C1 set by FPREM.
>
> For that you'd probably put the collection of bits inside the asm. It
> still makes just a few instructions of asm, with no need for external
> call ABI logic in the asm.
>
> > I noticed that you provide .S files for "long double" on x86-64, but
> > not for "double" and "float". I therefore assume that you use the
> > SSE floating-point instructions there, respectively let the compiler
> > use them.
>
> On the x86_64 ABI, float and double arithmetic are performed in SSE
> rather than in excess precision with the x87 unit.
>
> > Does any compiler emit branch-free instruction sequences like the
> > following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
> >
> > .code ; Intel syntax
> > ceil proc public
> > extern __real@8000000000000000:real8
> > movsd xmm1, __real@8000000000000000
> > extern __real@3ff0000000000000:real8
> > movsd xmm2, __real@3ff0000000000000
> > extern __real@4330000000000000:real8
> > movsd xmm3, __real@4330000000000000
> > movsd xmm4, xmm1
> > andnpd xmm1, xmm0
> > andpd xmm4, xmm0
> > cmpltsd xmm1, xmm3
> > andpd xmm1, xmm3
> > orpd xmm1, xmm4
> > movsd xmm3, xmm0
> > addsd xmm0, xmm1
> > subsd xmm0, xmm1
> > movsd xmm1, xmm0
> > cmpltsd xmm0, xmm3
> > andpd xmm0, xmm2
> > addsd xmm0, xmm1
> > orpd xmm0, xmm4
> > ret
> > ceil endp
> >
> > Or instruction sequences like
> >
> > .code ; Intel syntax
> > copysign proc public
> > movd rcx, xmm0
> > movd rdx, xmm1
> > shld rcx, rdx, 1
> > ror rcx, 1
> > movd xmm0, rcx
> > ret
> > copysign endp
>
> Not quite (but it might be possible to write the C in terms of shifts
> instead of masks such that it does), but I also don't think it's clear
> which version is better. Yours here is mildly smaller and might
> perform better, but when making changes that aren't clearly better
> there should be some evidence that it's actually an improvement --
> especially if it's not just improving existing arch optimizations but
> adding new ones where the C was formerly used. Generally musl avoids
> asm and arch-specific files as much as possible, using them only for
> things that aren't representable in C or where the C is a lot larger
> or slower or both.
>
> > .code ; Intel syntax
> > fdim proc public
> > movsd xmm2, xmm0
> > cmpsd xmm0, xmm1, 6
> > subsd xmm2, xmm1
> > andpd xmm0, xmm2
> > ret
> > fdim endp
>
> Does this handle nans correctly?
>
> > > It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> > > make me really wary of manual changes to these files.
> > >
> > > Would you be interested in working on converting over the files you
> > > want to optimize (or even others too) to that form at the same time as
> > > doing the optimizations?
> >
> > I don't use musl-libc; I also don't use an OS or a compiler/assembler
> > which can be used to build it.
> > I just stumbled upon the functions for which I sent in patches while
> > searching for code which uses Intel's FPU.
> >
> > > It would really help with review process and with improving the overall
> > > code state.
> >
> > If I start using musl-libc I'd be interested and rewrite these parts.
>
> OK. I don't mind looking at these patches further as-is, and I'll try
> to continue offering constructive comments now, but it'll be after
> this release cycle (hopefully wrapping that up in the next week or so)
> before consideration for merging. musl 1.2.0 is already going to be a
> release with big changes (time64) and I don't want to risk subtle
> breakage with new changes that haven't been reviewed in detail yet or
> had time for users to test.
Since you guys are discussing math optimizations, here's another one:
https://www.openwall.com/lists/musl/2019/11/08/1
>
>
> Rich
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: More patches for math subtree
2019-12-10 22:17 ` Rich Felker
2019-12-11 1:13 ` Rosen Penev
@ 2019-12-11 9:53 ` Stefan Kanthak
2019-12-11 10:28 ` Szabolcs Nagy
1 sibling, 1 reply; 7+ messages in thread
From: Stefan Kanthak @ 2019-12-11 9:53 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
"Rich Felker" <dalias@libc.org> wrote:
> On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:
[ asm vs. C ]
>> Does any compiler emit branch-free instruction sequences like the
>> following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
>>
>> .code ; Intel syntax
>> ceil proc public
>> extern __real@8000000000000000:real8
>> movsd xmm1, __real@8000000000000000
>> extern __real@3ff0000000000000:real8
>> movsd xmm2, __real@3ff0000000000000
>> extern __real@4330000000000000:real8
>> movsd xmm3, __real@4330000000000000
>> movsd xmm4, xmm1
>> andnpd xmm1, xmm0
>> andpd xmm4, xmm0
>> cmpltsd xmm1, xmm3
>> andpd xmm1, xmm3
>> orpd xmm1, xmm4
>> movsd xmm3, xmm0
>> addsd xmm0, xmm1
>> subsd xmm0, xmm1
>> movsd xmm1, xmm0
>> cmpltsd xmm0, xmm3
>> andpd xmm0, xmm2
>> addsd xmm0, xmm1
>> orpd xmm0, xmm4
>> ret
>> ceil endp
>>
>> Or instruction sequences like
>>
>> .code ; Intel syntax
>> copysign proc public
>> movd rcx, xmm0
>> movd rdx, xmm1
>> shld rcx, rdx, 1
>> ror rcx, 1
>> movd xmm0, rcx
>> ret
>> copysign endp
>
> Not quite (but it might be possible to write the C in terms of shifts
> instead of masks such that it does), but I also don't think it's clear
> which version is better. Yours here is mildly smaller and might
> perform better, but when making changes that aren't clearly better
> there should be some evidence that it's actually an improvement --
> especially if it's not just improving existing arch optimizations but
> adding new ones where the C was formerly used.
Correct.
I expect the compiler to emit such properly optimised code instead of
calls to the library for standard functions like copysign(), fdim(),
etc. which can be written with just a few instructions ... what the
compiler but not (always) does.
JFTR: I don't know whether GCC or clang either provide intrinsics or
__builtin_* for such (or all those) small standard functions.
> Generally musl avoids asm and arch-specific files as much as possible,
> using them only for things that aren't representable in C or where
> the C is a lot larger or slower or both.
>
>> .code ; Intel syntax
>> fdim proc public
>> movsd xmm2, xmm0
>> cmpsd xmm0, xmm1, 6
>> subsd xmm2, xmm1
>> andpd xmm0, xmm2
>> ret
>> fdim endp
>
> Does this handle nans correctly?
Of course! It's equivalent to
double fdim(double a, double b)
{
uint64_t mask = (a <= b) ? 0ull : ~0ull;
union {double dbl; uint64_t ull;} u = {a - b};
u.ull &= mask;
return u.dbl;
}
[...]
> OK. I don't mind looking at these patches further as-is, and I'll try
> to continue offering constructive comments now, but it'll be after
> this release cycle (hopefully wrapping that up in the next week or so)
> before consideration for merging. musl 1.2.0 is already going to be a
> release with big changes (time64) and I don't want to risk subtle
> breakage with new changes that haven't been reviewed in detail yet or
> had time for users to test.
That's OK.
Stefan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: More patches for math subtree
2019-12-11 9:53 ` Stefan Kanthak
@ 2019-12-11 10:28 ` Szabolcs Nagy
0 siblings, 0 replies; 7+ messages in thread
From: Szabolcs Nagy @ 2019-12-11 10:28 UTC (permalink / raw)
To: Stefan Kanthak; +Cc: Rich Felker, musl
* Stefan Kanthak <stefan.kanthak@nexgo.de> [2019-12-11 10:53:41 +0100]:
> JFTR: I don't know whether GCC or clang either provide intrinsics or
> __builtin_* for such (or all those) small standard functions.
they do.
they may not inline them, but that's a compiler decision
and should be addressed there.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2019-12-11 10:28 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-10 16:57 More patches for math subtree Stefan Kanthak
2019-12-10 19:35 ` Rich Felker
2019-12-10 21:32 ` Stefan Kanthak
2019-12-10 22:17 ` Rich Felker
2019-12-11 1:13 ` Rosen Penev
2019-12-11 9:53 ` Stefan Kanthak
2019-12-11 10:28 ` Szabolcs Nagy
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).