* remquo - underlying logic
@ 2017-11-30 18:11 Damian McGuckin
2017-11-30 18:59 ` Szabolcs Nagy
0 siblings, 1 reply; 10+ messages in thread
From: Damian McGuckin @ 2017-11-30 18:11 UTC (permalink / raw)
To: musl
Is there anywhere this is described in detail.
As part of some research, I was studying an implementation
https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c
which uses ideas which avoids the fixed point arithmetic.
While there are some issues with this implementation, some of its ideas
may produce cleaner and likely shorter code than the current remquo.
maybe not so much with double but for anything else high, yes.
Has anybody tried this before?
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-11-30 18:11 remquo - underlying logic Damian McGuckin
@ 2017-11-30 18:59 ` Szabolcs Nagy
2017-11-30 20:29 ` Damian McGuckin
0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2017-11-30 18:59 UTC (permalink / raw)
To: musl
* Damian McGuckin <damianm@esi.com.au> [2017-12-01 05:11:40 +1100]:
> Is there anywhere this is described in detail.
>
> As part of some research, I was studying an implementation
>
> https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c
>
> which uses ideas which avoids the fixed point arithmetic.
>
> While there are some issues with this implementation, some of its ideas may
> produce cleaner and likely shorter code than the current remquo.
> maybe not so much with double but for anything else high, yes.
>
> Has anybody tried this before?
i think it does similar things than the current
implementation but uses float arithmetics instead
of ints.
i hadn't considered using floats, it may be faster
i don't know, you have to benchmark etc.
i can add this to my math todo items, but i can't
promise to look into it soon unless there is a
good reason to.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-11-30 18:59 ` Szabolcs Nagy
@ 2017-11-30 20:29 ` Damian McGuckin
2017-11-30 21:16 ` John Reiser
2017-11-30 21:17 ` Szabolcs Nagy
0 siblings, 2 replies; 10+ messages in thread
From: Damian McGuckin @ 2017-11-30 20:29 UTC (permalink / raw)
To: musl
On Thu, 30 Nov 2017, Szabolcs Nagy wrote:
>> https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c
>>
>> Has anybody tried this before?
> i think it does similar things than the current implementation but uses
> float arithmetics instead of ints.
Yes.
> i hadn't considered using floats, it may be faster i don't know, you
> have to benchmark etc.
That's what we plan to do as part of our research.
> i can add this to my math todo items, but i can't promise to look into
> it soon unless there is a good reason to.
It is on MY own TO-DO list. I was asking the question in case anybody had
done it before and found it useless, could think of major pitfalls, or had
any pearls of wisdom before we started. I will keep you posted.
Thanks - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-11-30 20:29 ` Damian McGuckin
@ 2017-11-30 21:16 ` John Reiser
2017-11-30 21:17 ` Szabolcs Nagy
1 sibling, 0 replies; 10+ messages in thread
From: John Reiser @ 2017-11-30 21:16 UTC (permalink / raw)
To: musl
> I was asking the question in case anybody had done it before and found it useless,
> could think of major pitfalls, or had any pearls of wisdom before we started.
The range of a floating-point exponent will limit the applicability.
Packing and unpacking floating-point format (logb, scalb, etc.) are non-trivial costs,
as are mucking around with NaN, +inf, -inf, denormals, etc.
The "big-O" efficiency is the same: find the difference in exponents,
scale both operands to have the same exponent, perform "ordinary long division"
with the number of steps equal to the difference in exponents;
take care to preserve enough precision. Floating-point division
of the scaled operands (perhaps using a narrower precision) can be a contender,
especially because mis-predicted branches in long division are VERY expensive
(typically: 70% of CPU pipeline depth [==> a dozen or more cycles].)
Newton-Raphson iteration can be a contender for some range of difference in exponents.
--
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-11-30 20:29 ` Damian McGuckin
2017-11-30 21:16 ` John Reiser
@ 2017-11-30 21:17 ` Szabolcs Nagy
2017-12-04 8:10 ` Damian McGuckin
2017-12-06 1:17 ` Damian McGuckin
1 sibling, 2 replies; 10+ messages in thread
From: Szabolcs Nagy @ 2017-11-30 21:17 UTC (permalink / raw)
To: musl
* Damian McGuckin <damianm@esi.com.au> [2017-12-01 07:29:05 +1100]:
> It is on MY own TO-DO list. I was asking the question in case anybody had
> done it before and found it useless, could think of major pitfalls, or had
> any pearls of wisdom before we started. I will keep you posted.
well, one of the difficulties is to get correct behaviour
in case of fenv access (e.g. remquo has exact result in
all rounding modes so intermediate results should be exact
too, raising underflow correctly is sometimes nontrivial,
getting the sign right when the result is zero in downward
rounding mode sometimes needs special handling, etc),
if you need to save/restore the fenv to achieve that then
it's unlikely to be faster than using ints, otherwise floats
are probably faster, but it's hard to tell.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-11-30 21:17 ` Szabolcs Nagy
@ 2017-12-04 8:10 ` Damian McGuckin
2017-12-06 1:17 ` Damian McGuckin
1 sibling, 0 replies; 10+ messages in thread
From: Damian McGuckin @ 2017-12-04 8:10 UTC (permalink / raw)
To: musl
Floats are not going to be faster for remquo(x, y, &q) except for a very
small range where
1 < x / y < 1000 (or so)
But I was hoping for only a moderate overhead.
And John Reiser mentioned
> The range of a floating-point exponent will limit the applicability.
No.
> Packing and unpacking floating-point format (logb, scalb, etc.) are
> non-trivial costs, as are mucking around with NaN, +inf, -inf,
> denormals, etc.
Not really.
> The "big-O" efficiency is the same: find the difference in exponents,
> scale both operands to have the same exponent,
The above are trivial compared to the operation where you
> perform "ordinary long division" with the number of steps equal to the
> difference in exponents; take care to preserve enough precision.
This is the killer. On a Xeon, the floating point alternative is generally
a factor of 4-6 worse. Actually, in the domain
y * 2^(p) < x < 2^(w-1)
where w is the word size in bits, the penalty is higher still, about 10.
but after that it drops down to a factor of 4 and increases slowly to a
factor of 6 where the difference in exponents is the same as the bias
of the exponent.
I must admit that I found the penalty incredible. A bit scary that faking
floating point is so much faster, although admittedly with quite simple
operations.
The only thing is that the code is infinitely more readable, an important
but not the dominant concern.
An interesting exercise, albiet a bit fruitless.
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-11-30 21:17 ` Szabolcs Nagy
2017-12-04 8:10 ` Damian McGuckin
@ 2017-12-06 1:17 ` Damian McGuckin
2017-12-06 10:37 ` Szabolcs Nagy
1 sibling, 1 reply; 10+ messages in thread
From: Damian McGuckin @ 2017-12-06 1:17 UTC (permalink / raw)
To: musl
While my exploration with floating point numbers was less than stellar,
I did notice that when
ex - ey < p (where p is the digits in the significant)
you can use the
fma
routine to compute some appropriately rounded/truncated version of the
quotient for both remquo and fmod. And this appears to not loose any
precision for the obvious reasons.
From my limited testing, the speed gain for this extremely limited range
of exponent difference is huge over the standard routine in MUSL. I will
do some more testing and report in detail but it seems to be orders of
magnitude.
Somebody might want to comment on that sort of approach.
In 99% of the floating point work I do, the calculations involve physical
stresses and strains and loads and such within digital models. They differ
in exponent range between 10^6 through to 10^12 unless I have screwed up
in my model. This is a lot less than 2^52 for a double and mostly is still
under the 2^23 for a float which is just above 10^6. So, in my sort of
calculations, the speed gain can be quite significant. The burden of the
extra branch seems to be trivial, even for the cases where the FMA is not
used.
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-12-06 1:17 ` Damian McGuckin
@ 2017-12-06 10:37 ` Szabolcs Nagy
2017-12-07 1:09 ` Damian McGuckin
0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2017-12-06 10:37 UTC (permalink / raw)
To: musl
* Damian McGuckin <damianm@esi.com.au> [2017-12-06 12:17:40 +1100]:
> While my exploration with floating point numbers was less than stellar,
> I did notice that when
>
> ex - ey < p (where p is the digits in the significant)
>
> you can use the
>
> fma
>
> routine to compute some appropriately rounded/truncated version of the
> quotient for both remquo and fmod. And this appears to not loose any
> precision for the obvious reasons.
>
> > From my limited testing, the speed gain for this extremely limited range
> of exponent difference is huge over the standard routine in MUSL. I will do
> some more testing and report in detail but it seems to be orders of
> magnitude.
>
> Somebody might want to comment on that sort of approach.
it's not clear to me how you use fma (x-(int)(x/y)*y ?),
but efficient fma instruction is not available on all
targets and the software implementation can be very slow.
and i suspect such approach would break fenv correctness.
assuming you rely on exact fma result you may need two
different implementation depending on FP_FAST_FMA
(which is currently missing in musl).
(musl is compiled with -std=c99 so x*y+z is not contracted
to fma(x,y,z) automatically when the instruction is
available, you have to add -ffp-contract=fast if that's
what you want, but it might break some code in musl that
relies on exact arithmetics, most math code should work
either way though.)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-12-06 10:37 ` Szabolcs Nagy
@ 2017-12-07 1:09 ` Damian McGuckin
2017-12-08 0:42 ` Szabolcs Nagy
0 siblings, 1 reply; 10+ messages in thread
From: Damian McGuckin @ 2017-12-07 1:09 UTC (permalink / raw)
To: musl
On Wed, 6 Dec 2017, Szabolcs Nagy wrote:
> it's not clear to me how you use fma (x-(int)(x/y)*y ?), but efficient
> fma instruction is not available on all targets and the software
> implementation can be very slow. and i suspect such approach would break
> fenv correctness.
Quite likely. Besides, my testing was flawed and FMA is NOT the answer I
thought it was.
> (musl is compiled with -std=c99 so x*y+z is not contracted to fma(x,y,z)
> automatically when the instruction is available, you have to add
> -ffp-contract=fast if that's what you want, but it might break some code
> in musl that relies on exact arithmetics, most math code should work
> either way though.)
For a lot of reasons, FMA is not the answer.
That said, over that range, I am experimenting using a simplistic form of
double-double arithmetic for that calculation.
Would you agree that when
(int) (x / y) < 2^52
the computation (int) (x / y) is accurate to within epsilon, i.e. if it
should be at most be incorrect by +/- 1.?
If so, and using the same sort of logic that log.c uses to split the
calculation of
k * log(2.0)
into a high and low component, or maybe into 4 components, would you agree
that it is possible to come up with an accurate computation of
x - y * (int) (x / y)
It should be much quicker than long division.
Regards - Damian
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic
2017-12-07 1:09 ` Damian McGuckin
@ 2017-12-08 0:42 ` Szabolcs Nagy
0 siblings, 0 replies; 10+ messages in thread
From: Szabolcs Nagy @ 2017-12-08 0:42 UTC (permalink / raw)
To: musl
* Damian McGuckin <damianm@esi.com.au> [2017-12-07 12:09:23 +1100]:
> That said, over that range, I am experimenting using a simplistic form of
> double-double arithmetic for that calculation.
>
> Would you agree that when
>
> (int) (x / y) < 2^52
>
> the computation (int) (x / y) is accurate to within epsilon, i.e. if it
> should be at most be incorrect by +/- 1.?
>
this part is not the issue:
if the exact mathematical (int)(x/y) is k, i.e.
k <= x/y < k+1
then the rounded double prec x/y is
k <= x/y <= k+1
because close to k+1 some values get rounded up,
so sometimes you would compute x-(k+1)*y instead
of x-k*y, but this is easy to correct: just add +y
in the end if the result is out of range.
(same is true when other round-to-int method is
used instead of trunc, assumes k and k+1 are
representable)
> If so, and using the same sort of logic that log.c uses to split the
> calculation of
>
> k * log(2.0)
>
> into a high and low component, or maybe into 4 components, would you agree
> that it is possible to come up with an accurate computation of
>
> x - y * (int) (x / y)
>
> It should be much quicker than long division.
the real problem is doing x-k*y exactly (it is
representable as double).
when evaluated without fma, there is a rounding
error on the mul (the sub is exact).
one possibility is if k < 2^26 and the bottom 26
bits of y are zeroed out in yz then
x - k*yz - k*(y-yz)
is exact, another way is to use veltkamp/dekker
exact multiplication algorithm.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-12-08 0:42 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-30 18:11 remquo - underlying logic Damian McGuckin
2017-11-30 18:59 ` Szabolcs Nagy
2017-11-30 20:29 ` Damian McGuckin
2017-11-30 21:16 ` John Reiser
2017-11-30 21:17 ` Szabolcs Nagy
2017-12-04 8:10 ` Damian McGuckin
2017-12-06 1:17 ` Damian McGuckin
2017-12-06 10:37 ` Szabolcs Nagy
2017-12-07 1:09 ` Damian McGuckin
2017-12-08 0:42 ` Szabolcs Nagy
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).