remquo - underlying logic

mailing list of musl libc
 help / color / mirror / code / Atom feed

* remquo - underlying logic
@ 2017-11-30 18:11 Damian McGuckin
  2017-11-30 18:59 ` Szabolcs Nagy
  0 siblings, 1 reply; 10+ messages in thread
From: Damian McGuckin @ 2017-11-30 18:11 UTC (permalink / raw)
  To: musl

Is there anywhere this is described in detail.

As part of some research, I was studying an implementation

https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c

which uses ideas which avoids the fixed point arithmetic.

While there are some issues with this implementation, some of its ideas 
may produce cleaner and likely shorter code than the current remquo.
maybe not so much with double but for anything else high, yes.

Has anybody tried this before?

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-11-30 18:11 remquo - underlying logic Damian McGuckin
@ 2017-11-30 18:59 ` Szabolcs Nagy
  2017-11-30 20:29   ` Damian McGuckin
  0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2017-11-30 18:59 UTC (permalink / raw)
  To: musl

* Damian McGuckin <damianm@esi.com.au> [2017-12-01 05:11:40 +1100]:
> Is there anywhere this is described in detail.
> 
> As part of some research, I was studying an implementation
> 
> https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c
> 
> which uses ideas which avoids the fixed point arithmetic.
> 
> While there are some issues with this implementation, some of its ideas may
> produce cleaner and likely shorter code than the current remquo.
> maybe not so much with double but for anything else high, yes.
> 
> Has anybody tried this before?

i think it does similar things than the current
implementation but uses float arithmetics instead
of ints.

i hadn't considered using floats, it may be faster
i don't know, you have to benchmark etc.

i can add this to my math todo items, but i can't
promise to look into it soon unless there is a
good reason to.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-11-30 18:59 ` Szabolcs Nagy
@ 2017-11-30 20:29   ` Damian McGuckin
  2017-11-30 21:16     ` John Reiser
  2017-11-30 21:17     ` Szabolcs Nagy
  0 siblings, 2 replies; 10+ messages in thread
From: Damian McGuckin @ 2017-11-30 20:29 UTC (permalink / raw)
  To: musl

On Thu, 30 Nov 2017, Szabolcs Nagy wrote:

>> https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c
>>
>> Has anybody tried this before?

> i think it does similar things than the current implementation but uses 
> float arithmetics instead of ints.

Yes.

> i hadn't considered using floats, it may be faster i don't know, you 
> have to benchmark etc.

That's what we plan to do as part of our research.

> i can add this to my math todo items, but i can't promise to look into 
> it soon unless there is a good reason to.

It is on MY own TO-DO list. I was asking the question in case anybody had 
done it before and found it useless, could think of major pitfalls, or had 
any pearls of wisdom before we started.  I will keep you posted.

Thanks - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-11-30 20:29   ` Damian McGuckin
@ 2017-11-30 21:16     ` John Reiser
  2017-11-30 21:17     ` Szabolcs Nagy
  1 sibling, 0 replies; 10+ messages in thread
From: John Reiser @ 2017-11-30 21:16 UTC (permalink / raw)
  To: musl

> I was asking the question in case anybody had done it before and found it useless,
> could think of major pitfalls, or had any pearls of wisdom before we started.

The range of a floating-point exponent will limit the applicability.
Packing and unpacking floating-point format (logb, scalb, etc.) are non-trivial costs,
as are mucking around with NaN, +inf, -inf, denormals, etc.

The "big-O" efficiency is the same: find the difference in exponents,
scale both operands to have the same exponent, perform "ordinary long division"
with the number of steps equal to the difference in exponents;
take care to preserve enough precision.  Floating-point division
of the scaled operands (perhaps using a narrower precision) can be a contender,
especially because mis-predicted branches in long division are VERY expensive
(typically: 70% of CPU pipeline depth [==> a dozen or more cycles].)
Newton-Raphson iteration can be a contender for some range of difference in exponents.

-- 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-11-30 20:29   ` Damian McGuckin
  2017-11-30 21:16     ` John Reiser
@ 2017-11-30 21:17     ` Szabolcs Nagy
  2017-12-04  8:10       ` Damian McGuckin
  2017-12-06  1:17       ` Damian McGuckin
  1 sibling, 2 replies; 10+ messages in thread
From: Szabolcs Nagy @ 2017-11-30 21:17 UTC (permalink / raw)
  To: musl

* Damian McGuckin <damianm@esi.com.au> [2017-12-01 07:29:05 +1100]:
> It is on MY own TO-DO list. I was asking the question in case anybody had
> done it before and found it useless, could think of major pitfalls, or had
> any pearls of wisdom before we started.  I will keep you posted.

well, one of the difficulties is to get correct behaviour
in case of fenv access (e.g. remquo has exact result in
all rounding modes so intermediate results should be exact
too, raising underflow correctly is sometimes nontrivial,
getting the sign right when the result is zero in downward
rounding mode sometimes needs special handling, etc),
if you need to save/restore the fenv to achieve that then
it's unlikely to be faster than using ints, otherwise floats
are probably faster, but it's hard to tell.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-11-30 21:17     ` Szabolcs Nagy
@ 2017-12-04  8:10       ` Damian McGuckin
  2017-12-06  1:17       ` Damian McGuckin
  1 sibling, 0 replies; 10+ messages in thread
From: Damian McGuckin @ 2017-12-04  8:10 UTC (permalink / raw)
  To: musl

Floats are not going to be faster for remquo(x, y, &q) except for a very 
small range where

 	1 < x / y < 1000 (or so)

But I was hoping for only a moderate overhead.

And John Reiser mentioned

> The range of a floating-point exponent will limit the applicability.

No.

> Packing and unpacking floating-point format (logb, scalb, etc.) are 
> non-trivial costs, as are mucking around with NaN, +inf, -inf, 
> denormals, etc.

Not really.

> The "big-O" efficiency is the same: find the difference in exponents, 
> scale both operands to have the same exponent,

The above are trivial compared to the operation where you

> perform "ordinary long division" with the number of steps equal to the 
> difference in exponents; take care to preserve enough precision.

This is the killer. On a Xeon, the floating point alternative is generally 
a factor of 4-6 worse. Actually, in the domain

 	y * 2^(p) < x < 2^(w-1)

where w is the word size in bits, the penalty is higher still, about 10.
but after that it drops down to a factor of 4 and increases slowly to a
factor of 6 where the difference in exponents is the same as the bias
of the exponent.

I must admit that I found the penalty incredible. A bit scary that faking 
floating point is so much faster, although admittedly with quite simple 
operations.

The only thing is that the code is infinitely more readable, an important 
but not the dominant concern.

An interesting exercise, albiet a bit fruitless.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-11-30 21:17     ` Szabolcs Nagy
  2017-12-04  8:10       ` Damian McGuckin
@ 2017-12-06  1:17       ` Damian McGuckin
  2017-12-06 10:37         ` Szabolcs Nagy
  1 sibling, 1 reply; 10+ messages in thread
From: Damian McGuckin @ 2017-12-06  1:17 UTC (permalink / raw)
  To: musl

While my exploration with floating point numbers was less than stellar,
I did notice that when

 	ex - ey < p (where p is the digits in the significant)

you can use the

 	fma

routine to compute some appropriately rounded/truncated version of the 
quotient for both remquo and fmod. And this appears to not loose any 
precision for the obvious reasons.

From my limited testing, the speed gain for this extremely limited range 
of exponent difference is huge over the standard routine in MUSL. I will 
do some more testing and report in detail but it seems to be orders of 
magnitude.

Somebody might want to comment on that sort of approach.

In 99% of the floating point work I do, the calculations involve physical 
stresses and strains and loads and such within digital models. They differ 
in exponent range between 10^6 through to 10^12 unless I have screwed up 
in my model. This is a lot less than 2^52 for a double and mostly is still 
under the 2^23 for a float which is just above 10^6. So, in my sort of 
calculations, the speed gain can be quite significant.  The burden of the 
extra branch seems to be trivial, even for the cases where the FMA is not 
used.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-12-06  1:17       ` Damian McGuckin
@ 2017-12-06 10:37         ` Szabolcs Nagy
  2017-12-07  1:09           ` Damian McGuckin
  0 siblings, 1 reply; 10+ messages in thread
From: Szabolcs Nagy @ 2017-12-06 10:37 UTC (permalink / raw)
  To: musl

* Damian McGuckin <damianm@esi.com.au> [2017-12-06 12:17:40 +1100]:
> While my exploration with floating point numbers was less than stellar,
> I did notice that when
> 
> 	ex - ey < p (where p is the digits in the significant)
> 
> you can use the
> 
> 	fma
> 
> routine to compute some appropriately rounded/truncated version of the
> quotient for both remquo and fmod. And this appears to not loose any
> precision for the obvious reasons.
> 
> > From my limited testing, the speed gain for this extremely limited range
> of exponent difference is huge over the standard routine in MUSL. I will do
> some more testing and report in detail but it seems to be orders of
> magnitude.
> 
> Somebody might want to comment on that sort of approach.

it's not clear to me how you use fma (x-(int)(x/y)*y ?),
but efficient fma instruction is not available on all
targets and the software implementation can be very slow.
and i suspect such approach would break fenv correctness.

assuming you rely on exact fma result you may need two
different implementation depending on FP_FAST_FMA
(which is currently missing in musl).

(musl is compiled with -std=c99 so x*y+z is not contracted
to fma(x,y,z) automatically when the instruction is
available, you have to add -ffp-contract=fast if that's
what you want, but it might break some code in musl that
relies on exact arithmetics, most math code should work
either way though.)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-12-06 10:37         ` Szabolcs Nagy
@ 2017-12-07  1:09           ` Damian McGuckin
  2017-12-08  0:42             ` Szabolcs Nagy
  0 siblings, 1 reply; 10+ messages in thread
From: Damian McGuckin @ 2017-12-07  1:09 UTC (permalink / raw)
  To: musl

On Wed, 6 Dec 2017, Szabolcs Nagy wrote:

> it's not clear to me how you use fma (x-(int)(x/y)*y ?), but efficient 
> fma instruction is not available on all targets and the software 
> implementation can be very slow. and i suspect such approach would break 
> fenv correctness.

Quite likely. Besides, my testing was flawed and FMA is NOT the answer I 
thought it was.

> (musl is compiled with -std=c99 so x*y+z is not contracted to fma(x,y,z) 
> automatically when the instruction is available, you have to add 
> -ffp-contract=fast if that's what you want, but it might break some code 
> in musl that relies on exact arithmetics, most math code should work 
> either way though.)

For a lot of reasons, FMA is not the answer.

That said, over that range, I am experimenting using a simplistic form of 
double-double arithmetic for that calculation.

Would you agree that when

 	(int) (x / y) < 2^52

the computation (int) (x / y) is accurate to within epsilon, i.e. if it
should be at most be incorrect by +/- 1.?

If so, and using the same sort of logic that log.c uses to split the 
calculation of

 	k * log(2.0)

into a high and low component, or maybe into 4 components, would you agree 
that it is possible to come up with an accurate computation of

 	x - y * (int) (x / y)

It should be much quicker than long division.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: remquo - underlying logic
  2017-12-07  1:09           ` Damian McGuckin
@ 2017-12-08  0:42             ` Szabolcs Nagy
  0 siblings, 0 replies; 10+ messages in thread
From: Szabolcs Nagy @ 2017-12-08  0:42 UTC (permalink / raw)
  To: musl

* Damian McGuckin <damianm@esi.com.au> [2017-12-07 12:09:23 +1100]:
> That said, over that range, I am experimenting using a simplistic form of
> double-double arithmetic for that calculation.
> 
> Would you agree that when
> 
> 	(int) (x / y) < 2^52
> 
> the computation (int) (x / y) is accurate to within epsilon, i.e. if it
> should be at most be incorrect by +/- 1.?
> 

this part is not the issue:

if the exact mathematical (int)(x/y) is k, i.e.

k <= x/y < k+1

then the rounded double prec x/y is

k <= x/y <= k+1

because close to k+1 some values get rounded up,
so sometimes you would compute x-(k+1)*y instead
of x-k*y, but this is easy to correct: just add +y
in the end if the result is out of range.
(same is true when other round-to-int method is
used instead of trunc, assumes k and k+1 are
representable)

> If so, and using the same sort of logic that log.c uses to split the
> calculation of
> 
> 	k * log(2.0)
> 
> into a high and low component, or maybe into 4 components, would you agree
> that it is possible to come up with an accurate computation of
> 
> 	x - y * (int) (x / y)
> 
> It should be much quicker than long division.

the real problem is doing x-k*y exactly (it is
representable as double).

when evaluated without fma, there is a rounding
error on the mul (the sub is exact).

one possibility is if k < 2^26 and the bottom 26
bits of y are zeroed out in yz then

x - k*yz - k*(y-yz)

is exact, another way is to use veltkamp/dekker
exact multiplication algorithm.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-12-08  0:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-30 18:11 remquo - underlying logic Damian McGuckin
2017-11-30 18:59 ` Szabolcs Nagy
2017-11-30 20:29   ` Damian McGuckin
2017-11-30 21:16     ` John Reiser
2017-11-30 21:17     ` Szabolcs Nagy
2017-12-04  8:10       ` Damian McGuckin
2017-12-06  1:17       ` Damian McGuckin
2017-12-06 10:37         ` Szabolcs Nagy
2017-12-07  1:09           ` Damian McGuckin
2017-12-08  0:42             ` Szabolcs Nagy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).