* remquo - underlying logic @ 2017-11-30 18:11 Damian McGuckin 2017-11-30 18:59 ` Szabolcs Nagy 0 siblings, 1 reply; 10+ messages in thread From: Damian McGuckin @ 2017-11-30 18:11 UTC (permalink / raw) To: musl Is there anywhere this is described in detail. As part of some research, I was studying an implementation https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c which uses ideas which avoids the fixed point arithmetic. While there are some issues with this implementation, some of its ideas may produce cleaner and likely shorter code than the current remquo. maybe not so much with double but for anything else high, yes. Has anybody tried this before? Regards - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-11-30 18:11 remquo - underlying logic Damian McGuckin @ 2017-11-30 18:59 ` Szabolcs Nagy 2017-11-30 20:29 ` Damian McGuckin 0 siblings, 1 reply; 10+ messages in thread From: Szabolcs Nagy @ 2017-11-30 18:59 UTC (permalink / raw) To: musl * Damian McGuckin <damianm@esi.com.au> [2017-12-01 05:11:40 +1100]: > Is there anywhere this is described in detail. > > As part of some research, I was studying an implementation > > https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c > > which uses ideas which avoids the fixed point arithmetic. > > While there are some issues with this implementation, some of its ideas may > produce cleaner and likely shorter code than the current remquo. > maybe not so much with double but for anything else high, yes. > > Has anybody tried this before? i think it does similar things than the current implementation but uses float arithmetics instead of ints. i hadn't considered using floats, it may be faster i don't know, you have to benchmark etc. i can add this to my math todo items, but i can't promise to look into it soon unless there is a good reason to. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-11-30 18:59 ` Szabolcs Nagy @ 2017-11-30 20:29 ` Damian McGuckin 2017-11-30 21:16 ` John Reiser 2017-11-30 21:17 ` Szabolcs Nagy 0 siblings, 2 replies; 10+ messages in thread From: Damian McGuckin @ 2017-11-30 20:29 UTC (permalink / raw) To: musl On Thu, 30 Nov 2017, Szabolcs Nagy wrote: >> https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/remquo.c >> >> Has anybody tried this before? > i think it does similar things than the current implementation but uses > float arithmetics instead of ints. Yes. > i hadn't considered using floats, it may be faster i don't know, you > have to benchmark etc. That's what we plan to do as part of our research. > i can add this to my math todo items, but i can't promise to look into > it soon unless there is a good reason to. It is on MY own TO-DO list. I was asking the question in case anybody had done it before and found it useless, could think of major pitfalls, or had any pearls of wisdom before we started. I will keep you posted. Thanks - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-11-30 20:29 ` Damian McGuckin @ 2017-11-30 21:16 ` John Reiser 2017-11-30 21:17 ` Szabolcs Nagy 1 sibling, 0 replies; 10+ messages in thread From: John Reiser @ 2017-11-30 21:16 UTC (permalink / raw) To: musl > I was asking the question in case anybody had done it before and found it useless, > could think of major pitfalls, or had any pearls of wisdom before we started. The range of a floating-point exponent will limit the applicability. Packing and unpacking floating-point format (logb, scalb, etc.) are non-trivial costs, as are mucking around with NaN, +inf, -inf, denormals, etc. The "big-O" efficiency is the same: find the difference in exponents, scale both operands to have the same exponent, perform "ordinary long division" with the number of steps equal to the difference in exponents; take care to preserve enough precision. Floating-point division of the scaled operands (perhaps using a narrower precision) can be a contender, especially because mis-predicted branches in long division are VERY expensive (typically: 70% of CPU pipeline depth [==> a dozen or more cycles].) Newton-Raphson iteration can be a contender for some range of difference in exponents. -- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-11-30 20:29 ` Damian McGuckin 2017-11-30 21:16 ` John Reiser @ 2017-11-30 21:17 ` Szabolcs Nagy 2017-12-04 8:10 ` Damian McGuckin 2017-12-06 1:17 ` Damian McGuckin 1 sibling, 2 replies; 10+ messages in thread From: Szabolcs Nagy @ 2017-11-30 21:17 UTC (permalink / raw) To: musl * Damian McGuckin <damianm@esi.com.au> [2017-12-01 07:29:05 +1100]: > It is on MY own TO-DO list. I was asking the question in case anybody had > done it before and found it useless, could think of major pitfalls, or had > any pearls of wisdom before we started. I will keep you posted. well, one of the difficulties is to get correct behaviour in case of fenv access (e.g. remquo has exact result in all rounding modes so intermediate results should be exact too, raising underflow correctly is sometimes nontrivial, getting the sign right when the result is zero in downward rounding mode sometimes needs special handling, etc), if you need to save/restore the fenv to achieve that then it's unlikely to be faster than using ints, otherwise floats are probably faster, but it's hard to tell. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-11-30 21:17 ` Szabolcs Nagy @ 2017-12-04 8:10 ` Damian McGuckin 2017-12-06 1:17 ` Damian McGuckin 1 sibling, 0 replies; 10+ messages in thread From: Damian McGuckin @ 2017-12-04 8:10 UTC (permalink / raw) To: musl Floats are not going to be faster for remquo(x, y, &q) except for a very small range where 1 < x / y < 1000 (or so) But I was hoping for only a moderate overhead. And John Reiser mentioned > The range of a floating-point exponent will limit the applicability. No. > Packing and unpacking floating-point format (logb, scalb, etc.) are > non-trivial costs, as are mucking around with NaN, +inf, -inf, > denormals, etc. Not really. > The "big-O" efficiency is the same: find the difference in exponents, > scale both operands to have the same exponent, The above are trivial compared to the operation where you > perform "ordinary long division" with the number of steps equal to the > difference in exponents; take care to preserve enough precision. This is the killer. On a Xeon, the floating point alternative is generally a factor of 4-6 worse. Actually, in the domain y * 2^(p) < x < 2^(w-1) where w is the word size in bits, the penalty is higher still, about 10. but after that it drops down to a factor of 4 and increases slowly to a factor of 6 where the difference in exponents is the same as the bias of the exponent. I must admit that I found the penalty incredible. A bit scary that faking floating point is so much faster, although admittedly with quite simple operations. The only thing is that the code is infinitely more readable, an important but not the dominant concern. An interesting exercise, albiet a bit fruitless. Regards - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-11-30 21:17 ` Szabolcs Nagy 2017-12-04 8:10 ` Damian McGuckin @ 2017-12-06 1:17 ` Damian McGuckin 2017-12-06 10:37 ` Szabolcs Nagy 1 sibling, 1 reply; 10+ messages in thread From: Damian McGuckin @ 2017-12-06 1:17 UTC (permalink / raw) To: musl While my exploration with floating point numbers was less than stellar, I did notice that when ex - ey < p (where p is the digits in the significant) you can use the fma routine to compute some appropriately rounded/truncated version of the quotient for both remquo and fmod. And this appears to not loose any precision for the obvious reasons. From my limited testing, the speed gain for this extremely limited range of exponent difference is huge over the standard routine in MUSL. I will do some more testing and report in detail but it seems to be orders of magnitude. Somebody might want to comment on that sort of approach. In 99% of the floating point work I do, the calculations involve physical stresses and strains and loads and such within digital models. They differ in exponent range between 10^6 through to 10^12 unless I have screwed up in my model. This is a lot less than 2^52 for a double and mostly is still under the 2^23 for a float which is just above 10^6. So, in my sort of calculations, the speed gain can be quite significant. The burden of the extra branch seems to be trivial, even for the cases where the FMA is not used. Regards - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-12-06 1:17 ` Damian McGuckin @ 2017-12-06 10:37 ` Szabolcs Nagy 2017-12-07 1:09 ` Damian McGuckin 0 siblings, 1 reply; 10+ messages in thread From: Szabolcs Nagy @ 2017-12-06 10:37 UTC (permalink / raw) To: musl * Damian McGuckin <damianm@esi.com.au> [2017-12-06 12:17:40 +1100]: > While my exploration with floating point numbers was less than stellar, > I did notice that when > > ex - ey < p (where p is the digits in the significant) > > you can use the > > fma > > routine to compute some appropriately rounded/truncated version of the > quotient for both remquo and fmod. And this appears to not loose any > precision for the obvious reasons. > > > From my limited testing, the speed gain for this extremely limited range > of exponent difference is huge over the standard routine in MUSL. I will do > some more testing and report in detail but it seems to be orders of > magnitude. > > Somebody might want to comment on that sort of approach. it's not clear to me how you use fma (x-(int)(x/y)*y ?), but efficient fma instruction is not available on all targets and the software implementation can be very slow. and i suspect such approach would break fenv correctness. assuming you rely on exact fma result you may need two different implementation depending on FP_FAST_FMA (which is currently missing in musl). (musl is compiled with -std=c99 so x*y+z is not contracted to fma(x,y,z) automatically when the instruction is available, you have to add -ffp-contract=fast if that's what you want, but it might break some code in musl that relies on exact arithmetics, most math code should work either way though.) ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-12-06 10:37 ` Szabolcs Nagy @ 2017-12-07 1:09 ` Damian McGuckin 2017-12-08 0:42 ` Szabolcs Nagy 0 siblings, 1 reply; 10+ messages in thread From: Damian McGuckin @ 2017-12-07 1:09 UTC (permalink / raw) To: musl On Wed, 6 Dec 2017, Szabolcs Nagy wrote: > it's not clear to me how you use fma (x-(int)(x/y)*y ?), but efficient > fma instruction is not available on all targets and the software > implementation can be very slow. and i suspect such approach would break > fenv correctness. Quite likely. Besides, my testing was flawed and FMA is NOT the answer I thought it was. > (musl is compiled with -std=c99 so x*y+z is not contracted to fma(x,y,z) > automatically when the instruction is available, you have to add > -ffp-contract=fast if that's what you want, but it might break some code > in musl that relies on exact arithmetics, most math code should work > either way though.) For a lot of reasons, FMA is not the answer. That said, over that range, I am experimenting using a simplistic form of double-double arithmetic for that calculation. Would you agree that when (int) (x / y) < 2^52 the computation (int) (x / y) is accurate to within epsilon, i.e. if it should be at most be incorrect by +/- 1.? If so, and using the same sort of logic that log.c uses to split the calculation of k * log(2.0) into a high and low component, or maybe into 4 components, would you agree that it is possible to come up with an accurate computation of x - y * (int) (x / y) It should be much quicker than long division. Regards - Damian Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037 Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here Views & opinions here are mine and not those of any past or present employer ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: remquo - underlying logic 2017-12-07 1:09 ` Damian McGuckin @ 2017-12-08 0:42 ` Szabolcs Nagy 0 siblings, 0 replies; 10+ messages in thread From: Szabolcs Nagy @ 2017-12-08 0:42 UTC (permalink / raw) To: musl * Damian McGuckin <damianm@esi.com.au> [2017-12-07 12:09:23 +1100]: > That said, over that range, I am experimenting using a simplistic form of > double-double arithmetic for that calculation. > > Would you agree that when > > (int) (x / y) < 2^52 > > the computation (int) (x / y) is accurate to within epsilon, i.e. if it > should be at most be incorrect by +/- 1.? > this part is not the issue: if the exact mathematical (int)(x/y) is k, i.e. k <= x/y < k+1 then the rounded double prec x/y is k <= x/y <= k+1 because close to k+1 some values get rounded up, so sometimes you would compute x-(k+1)*y instead of x-k*y, but this is easy to correct: just add +y in the end if the result is out of range. (same is true when other round-to-int method is used instead of trunc, assumes k and k+1 are representable) > If so, and using the same sort of logic that log.c uses to split the > calculation of > > k * log(2.0) > > into a high and low component, or maybe into 4 components, would you agree > that it is possible to come up with an accurate computation of > > x - y * (int) (x / y) > > It should be much quicker than long division. the real problem is doing x-k*y exactly (it is representable as double). when evaluated without fma, there is a rounding error on the mul (the sub is exact). one possibility is if k < 2^26 and the bottom 26 bits of y are zeroed out in yz then x - k*yz - k*(y-yz) is exact, another way is to use veltkamp/dekker exact multiplication algorithm. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-12-08 0:42 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-30 18:11 remquo - underlying logic Damian McGuckin 2017-11-30 18:59 ` Szabolcs Nagy 2017-11-30 20:29 ` Damian McGuckin 2017-11-30 21:16 ` John Reiser 2017-11-30 21:17 ` Szabolcs Nagy 2017-12-04 8:10 ` Damian McGuckin 2017-12-06 1:17 ` Damian McGuckin 2017-12-06 10:37 ` Szabolcs Nagy 2017-12-07 1:09 ` Damian McGuckin 2017-12-08 0:42 ` Szabolcs Nagy
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).