Possible Mistype in exp.c

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Possible Mistype in exp.c
@ 2019-01-28  6:59 Damian McGuckin
  2019-01-29 11:01 ` Szabolcs Nagy
  0 siblings, 1 reply; 18+ messages in thread
From: Damian McGuckin @ 2019-01-28  6:59 UTC (permalink / raw)
  To: musl

In 1.1.21 in exp.c

        if (x < -708.39641853226410622) {
             /* underflow if x!=-inf */
             FORCE_EVAL((float)(-0x1p-149/x));
             if (x < -745.13321910194110842)

The constant also appears is expf.c.

For floats and for floats, the constant

 	0x1p-149

is the smallest non-zero finite number and will do the right thing.

Using that with doubles will not trigger an underflow exception.

I believe that for doubles, this needs to be

 	0x1p-1074

for doubles, i.e. 0x1p-1022 * 0x1p-52 if you want a conceptual match.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-28  6:59 Possible Mistype in exp.c Damian McGuckin
@ 2019-01-29 11:01 ` Szabolcs Nagy
  2019-01-29 11:26   ` Damian McGuckin
  0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2019-01-29 11:01 UTC (permalink / raw)
  To: musl; +Cc: Damian McGuckin

* Damian McGuckin <damianm@esi.com.au> [2019-01-28 17:59:53 +1100]:
> In 1.1.21 in exp.c
> 
>        if (x < -708.39641853226410622) {
>             /* underflow if x!=-inf */
>             FORCE_EVAL((float)(-0x1p-149/x));
                         ^^^^^^^
>             if (x < -745.13321910194110842)
> 
> The constant also appears is expf.c.
> 
> For floats and for floats, the constant
> 
> 	0x1p-149
> 
> is the smallest non-zero finite number and will do the right thing.
> 
> Using that with doubles will not trigger an underflow exception.

it's not the division that's supposed to raise the exception,
but the cast (although the division might in case of an x with
sufficiently large magnitude).

> 
> I believe that for doubles, this needs to be
> 
> 	0x1p-1074

i don't remember the reson why i wrote the code the way it is,
probably i wanted to avoid subnormal range division as it can
be a lot slower than subnormal range cast (but this code will
be slow either way which is ok since it's a rare special case)
it can also save space on hw that can load float values into
double (powerpc?), but really this is just bikeshedding.

in any case i plan to completely rewrite exp and other
important math functions:

https://www.openwall.com/lists/musl/2018/12/08/1

that implementation is >2x faster in general than the old
fdlibm code as well as more precise.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-29 11:01 ` Szabolcs Nagy
@ 2019-01-29 11:26   ` Damian McGuckin
  2019-01-29 11:43     ` Szabolcs Nagy
  0 siblings, 1 reply; 18+ messages in thread
From: Damian McGuckin @ 2019-01-29 11:26 UTC (permalink / raw)
  To: musl

On Tue, 29 Jan 2019, Szabolcs Nagy wrote:

> it's not the division that's supposed to raise the exception,
> but the cast (although the division might in case of an x with
> sufficiently large magnitude).

I did not think about that. An interesting approach.

> in any case i plan to completely rewrite exp and other
> important math functions:
>
> https://www.openwall.com/lists/musl/2018/12/08/1

Thanks for the pointer.

> that implementation is >2x faster in general than the old fdlibm code as 
> well as more precise.

It is hard to argue to 2x faster. Mind you, I certainly found the ARM 
code much harder to read. Interesting none the less.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-29 11:26   ` Damian McGuckin
@ 2019-01-29 11:43     ` Szabolcs Nagy
  2019-01-30  1:18       ` Damian McGuckin
  0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2019-01-29 11:43 UTC (permalink / raw)
  To: musl; +Cc: Damian McGuckin

* Damian McGuckin <damianm@esi.com.au> [2019-01-29 22:26:23 +1100]:
> On Tue, 29 Jan 2019, Szabolcs Nagy wrote:
> 
> > it's not the division that's supposed to raise the exception,
> > but the cast (although the division might in case of an x with
> > sufficiently large magnitude).
> 
> I did not think about that. An interesting approach.
> 
> > in any case i plan to completely rewrite exp and other
> > important math functions:
> > 
> > https://www.openwall.com/lists/musl/2018/12/08/1
> 
> Thanks for the pointer.
> 
> > that implementation is >2x faster in general than the old fdlibm code as
> > well as more precise.
> 
> It is hard to argue to 2x faster. Mind you, I certainly found the ARM code
> much harder to read. Interesting none the less.

i did some cleanups compared to the arm repository:

https://www.openwall.com/lists/musl/2018/12/08/3/7

so e.g. some of the ifdef config variants got removed,

if you have some particular idea what could be improved
then let us know.

(the special case handling is more complicated because
it tries to avoid double rounding in the subnormal range,
the old code had >0.75 ulp error in this case, the new
code is always < 0.6 ulp error, it is probably not worth
the trouble, but i tried to guarantee "almost always
correctly rounded and only near half-way results may get
incorrectly rounded" for the new implementations)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-29 11:43     ` Szabolcs Nagy
@ 2019-01-30  1:18       ` Damian McGuckin
  2019-01-30  9:37         ` Szabolcs Nagy
  0 siblings, 1 reply; 18+ messages in thread
From: Damian McGuckin @ 2019-01-30  1:18 UTC (permalink / raw)
  To: musl

On Tue, 29 Jan 2019, Szabolcs Nagy wrote:

>>> https://www.openwall.com/lists/musl/2018/12/08/1

For some reason I missed this on the list.

You were certainly busy integrating this.

Is there any documentation on the Remez polynomials and the associated 
argument reductions done?  Or is it a case of 'read the code'. ARM have 
certainly done a lot of work. It will be really interesting when that CPU 
gets a lot closer to the high end Xeon.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30  1:18       ` Damian McGuckin
@ 2019-01-30  9:37         ` Szabolcs Nagy
  2019-01-30 11:14           ` Damian McGuckin
  2019-01-30 12:56           ` Possible Mistype in exp.c Damian McGuckin
  0 siblings, 2 replies; 18+ messages in thread
From: Szabolcs Nagy @ 2019-01-30  9:37 UTC (permalink / raw)
  To: musl; +Cc: Damian McGuckin

* Damian McGuckin <damianm@esi.com.au> [2019-01-30 12:18:54 +1100]:
> On Tue, 29 Jan 2019, Szabolcs Nagy wrote:
> 
> > > > https://www.openwall.com/lists/musl/2018/12/08/1
> 
> For some reason I missed this on the list.
> 
> You were certainly busy integrating this.
> 
> Is there any documentation on the Remez polynomials and the associated
> argument reductions done?  Or is it a case of 'read the code'. ARM have
> certainly done a lot of work. It will be really interesting when that CPU
> gets a lot closer to the high end Xeon.

i work at arm and i did this work there so i know the background.

there is not much published about the algorithms (although there aren't
many new ideas in the code, it's mostly about optimizing for modern cpus),
but i plan to at least put the polynomial and lookup table generation
code in that repo.

the polynomial coefficients were generated with the sollya tool from
inria.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30  9:37         ` Szabolcs Nagy
@ 2019-01-30 11:14           ` Damian McGuckin
  2019-01-30 12:19             ` Szabolcs Nagy
  2019-01-30 12:56           ` Possible Mistype in exp.c Damian McGuckin
  1 sibling, 1 reply; 18+ messages in thread
From: Damian McGuckin @ 2019-01-30 11:14 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

On Wed, 30 Jan 2019, Szabolcs Nagy wrote:

> i work at arm and i did this work there so i know the background.

That explains everything.

> there is not much published about the algorithms (although there aren't 
> many new ideas in the code,

Sadly, a lot of the existing ideas are not documented well. Even the texts 
on this topic seem to be lacking all the information to actually implement 
a fast low-level routine. They cover fundamentals well and important 
topics like error analysis but omit the basic techniques and little 
details and basic techniques that are needed to make it fast and still 
achieve that 1*ULP (or better) error bound.

> it's mostly about optimizing for modern cpus), but i plan to at least 
> put the polynomial and lookup table generation code in that repo.

Thanks. For some reason, I have this aversion to table lookup because 
suddenly there are lots more numbers hanging about. But, it certainly
helps to achieve optimal performance. I had a look at the GLIBC exp.c
which comes out of the Ultimate Libray done by IBM and I find that
inpossible to read/comprehend especially with its table lookups.

> the polynomial coefficients were generated with the sollya tool from 
> inria.

Yes, nice tool.  What range reduction is used for those routines and why 
are there several polynomials?  Just curious.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30 11:14           ` Damian McGuckin
@ 2019-01-30 12:19             ` Szabolcs Nagy
  2019-01-30 12:44               ` Alexander Monakov
  0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2019-01-30 12:19 UTC (permalink / raw)
  To: musl

* Damian McGuckin <damianm@esi.com.au> [2019-01-30 22:14:01 +1100]:
> On Wed, 30 Jan 2019, Szabolcs Nagy wrote:
> > there is not much published about the algorithms (although there aren't
> > many new ideas in the code,
> 
> Sadly, a lot of the existing ideas are not documented well. Even the texts
> on this topic seem to be lacking all the information to actually implement a
> fast low-level routine. They cover fundamentals well and important topics
> like error analysis but omit the basic techniques and little details and
> basic techniques that are needed to make it fast and still achieve that
> 1*ULP (or better) error bound.

i found "Jean-Michel Muller - Elementary Functions" useful for the math.

for < 1 ulp the main trick is computing the result as x+y where x is
exact and y is inexact but much smaller than x in magnitude. (this is
the only way to get worst case error < 1 ulp with an fp operation that
does a rounding and at least one of the inputs is inexact.)

the bigger the distance between x and y in magnitude, the closer you
get to 0.5 ulp (rounding error of the final add), and to get exact x
you sometimes need exact add and mul algorithms.

> > it's mostly about optimizing for modern cpus), but i plan to at least
> > put the polynomial and lookup table generation code in that repo.
> 
> Thanks. For some reason, I have this aversion to table lookup because
> suddenly there are lots more numbers hanging about. But, it certainly
> helps to achieve optimal performance. I had a look at the GLIBC exp.c
> which comes out of the Ultimate Libray done by IBM and I find that
> inpossible to read/comprehend especially with its table lookups.

table lookup is needed

1) to make the approximation efficient:

the main reason why fdlibm is slow is that it prefers using div or
large polynomials to table lookups. (the other reason is that it
uses a long chain of dependency between operations, while modern
cpus can evaluate many operations in parallel if there are no
dependencies between them)

2) and to get close to 0.5 ulp errors:

usually there is a polynomial evaluation like c0 + c1*x + ... where
c1*x must be much smaller than c0 to get close to 0.5 ulp error,
which requires reducing x into a very small range around 0, which
usually requires a table lookup (otherwise several terms in the
polynomial would need to be evaluated with multi-precision tricks
which is a lot of operations).

> > the polynomial coefficients were generated with the sollya tool from
> > inria.
> 
> Yes, nice tool.  What range reduction is used for those routines and why are
> there several polynomials?  Just curious.

the comment in the code says

  /* exp(x) = 2^(k/N) * exp(r), with exp(r) in [2^(-1/2N),2^(1/2N)].  */
  /* x = ln2/N*k + r, with int k and r in [-ln2/2N, ln2/2N].  */

so k = toint(N/ln2*x); r = x - ln2/N*k; to compute 2^(k/N) a lookup
table is used (N is the table size), and exp(r) is a polynomial.

the different polynomials were computed for different pecision and table
size requirements and because some targets lack efficient toint in
non-nearest rounding mode so the reduction interval is larger.
in the end the same polynomial is used in glibc, bionic, newlib and
musl (where the algorithms were upstreamed) and the polynomial
error is small enough in the wider intervall so the non-nearest toint
never gives more than 1 ulp error in non-nearest rounding mode.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30 12:19             ` Szabolcs Nagy
@ 2019-01-30 12:44               ` Alexander Monakov
  2019-01-30 14:10                 ` Szabolcs Nagy
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Monakov @ 2019-01-30 12:44 UTC (permalink / raw)
  To: musl

On Wed, 30 Jan 2019, Szabolcs Nagy wrote:
> for < 1 ulp the main trick is computing the result as x+y where x is
> exact and y is inexact but much smaller than x in magnitude. (this is
> the only way to get worst case error < 1 ulp with an fp operation that
> does a rounding and at least one of the inputs is inexact.)

As I recall this was covered in your GNU Cauldron presentation.
Can you upload the slides of that talk to the GCC wiki?

(I can help with adding your account to EditorGroup if needed
or upload the slides on your behalf if you send me the pdf)

Thanks.
Alexander


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30  9:37         ` Szabolcs Nagy
  2019-01-30 11:14           ` Damian McGuckin
@ 2019-01-30 12:56           ` Damian McGuckin
  2019-01-31  0:04             ` Szabolcs Nagy
  1 sibling, 1 reply; 18+ messages in thread
From: Damian McGuckin @ 2019-01-30 12:56 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

I have Muller's book. Good for the math. Not for any ideas about the 
actual implementation.

As a matter of interest, what was the benchmark against which you get a 2x 
speed gain?

I got 1.75 against GLIBC for what that is worth. I used a faster scaling 
routine. But I was not chasing a improved ULP performance ike you were as 
that was too much extra work. Your work there sounds like seriously smart 
stuff to me.

I used super-scalar friendly code which adds an extra multiplication. It 
made a miniscule tiny net benefit on the Xeons (not Xeon Gold).

I had 2 versions of the fast scaling routine replacing ldexp. One used a 
single ternary if/then/else and other grabbed the sign and did a table 
looking which meant one extra multiplication all the time but no branches. 
The one extra multiplication instead of a branch in the 2-line scaling 
routine made no difference.

I saw a tiny but measurable difference when I used a Xeon with an FMA 
compared to one which did not.

Discarding the last term in the SUN routine and that net loss of one 
multiplication still made no serious difference to the timing and of 
course, the results were affected.

My timings showed on my reworked code (for doubles)

 	21+%	the preliminary comparisons
 	43+%	polynomial computation super-scalar friendly way
 	35+%	y = 1 + (x*c/(2-c) - lo + hi);
 		return k == 0 ? y : scalbn-FAST(y, k);

I have slightly increased the work load in the comparisons because I avoid 
pulling 'x' apart into 'hx'. I used only doubles or floats.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30 12:44               ` Alexander Monakov
@ 2019-01-30 14:10                 ` Szabolcs Nagy
  2019-02-01 23:26                   ` Morten Welinder
  0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2019-01-30 14:10 UTC (permalink / raw)
  To: musl

* Alexander Monakov <amonakov@ispras.ru> [2019-01-30 15:44:23 +0300]:
> On Wed, 30 Jan 2019, Szabolcs Nagy wrote:
> > for < 1 ulp the main trick is computing the result as x+y where x is
> > exact and y is inexact but much smaller than x in magnitude. (this is
> > the only way to get worst case error < 1 ulp with an fp operation that
> > does a rounding and at least one of the inputs is inexact.)
> 
> As I recall this was covered in your GNU Cauldron presentation.
> Can you upload the slides of that talk to the GCC wiki?
> 
> (I can help with adding your account to EditorGroup if needed
> or upload the slides on your behalf if you send me the pdf)

please upload it on my behalf if you can:
http://port70.net/~nsz/articles/other/cauldron2018-slides.pdf
(that presentation did not go into algorithmic details though)

for others:
https://gcc.gnu.org/wiki/cauldron2018#Slides.2C_Videos_and_Notes


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30 12:56           ` Possible Mistype in exp.c Damian McGuckin
@ 2019-01-31  0:04             ` Szabolcs Nagy
  2019-01-31  0:39               ` Damian McGuckin
  0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2019-01-31  0:04 UTC (permalink / raw)
  To: musl

* Damian McGuckin <damianm@esi.com.au> [2019-01-30 23:56:05 +1100]:
> As a matter of interest, what was the benchmark against which you get a 2x
> speed gain?

i benchmarked a tight loop around a call where
(1) calls are independent so can be evaluated in parallel
(2) calls depend on the previous iteration result.
(so (1) measures maximum call throughput and (2) measures latency)

usually (1) can be improved significantly over old math code.
and such usage (e.g. calling a single function over an array) is
common (e.g. in fortran code) so rewriting old code is useful.

> I got 1.75 against GLIBC for what that is worth. I used a faster scaling
> routine. But I was not chasing a improved ULP performance ike you were as
> that was too much extra work. Your work there sounds like seriously smart
> stuff to me.

glibc <= 2.26 is old fdlibm and ibm libultim code
glibc 2.27 has single prec improvements
glibc 2.28 has double prec correct rounding slow paths removed
glibc 2.29 will use my code

so glibc version matters when you benchmark.

i did most of my benchmarks on various 64bit arm cores.
(which e.g. always have fma and int rounding instructions,
so some things are designed differently than on x86, but
the isa does not matter too much, the cpu internals matter
more and those seem to be similar across targets)

> I used super-scalar friendly code which adds an extra multiplication. It
> made a miniscule tiny net benefit on the Xeons (not Xeon Gold).
> 
> I had 2 versions of the fast scaling routine replacing ldexp. One used a
> single ternary if/then/else and other grabbed the sign and did a table
> looking which meant one extra multiplication all the time but no branches.
> The one extra multiplication instead of a branch in the 2-line scaling
> routine made no difference.
> 
> I saw a tiny but measurable difference when I used a Xeon with an FMA
> compared to one which did not.
> 
> Discarding the last term in the SUN routine and that net loss of one
> multiplication still made no serious difference to the timing and of course,
> the results were affected.
> 
> My timings showed on my reworked code (for doubles)
> 
> 	21+%	the preliminary comparisons
> 	43+%	polynomial computation super-scalar friendly way
> 	35+%	y = 1 + (x*c/(2-c) - lo + hi);
> 		return k == 0 ? y : scalbn-FAST(y, k);

here the killer is the division really.

scalbn(1+p, k) can be s=2^k; s + s*p; (single fma, just be
careful not to overflow s), but using a table lookup instead
of div should help the most.

> I have slightly increased the work load in the comparisons because I avoid
> pulling 'x' apart into 'hx'. I used only doubles or floats.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-31  0:04             ` Szabolcs Nagy
@ 2019-01-31  0:39               ` Damian McGuckin
  0 siblings, 0 replies; 18+ messages in thread
From: Damian McGuckin @ 2019-01-31  0:39 UTC (permalink / raw)
  To: musl

On Thu, 31 Jan 2019, Szabolcs Nagy wrote:

> i did most of my benchmarks on various 64bit arm cores.
> (which e.g. always have fma and int rounding instructions,
> so some things are designed differently than on x86, but
> the isa does not matter too much, the cpu internals matter
> more and those seem to be similar across targets)

I was/am impressed, especially as the accuracy improvement.

Thanks for indulging me with the explanations.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-01-30 14:10                 ` Szabolcs Nagy
@ 2019-02-01 23:26                   ` Morten Welinder
  2019-02-02  3:39                     ` Szabolcs Nagy
  0 siblings, 1 reply; 18+ messages in thread
From: Morten Welinder @ 2019-02-01 23:26 UTC (permalink / raw)
  To: musl

Is this code guaranteed to produced correctly-rounded results (like crlibm)?

I ask because having different machines produce different results is a
no-go for my use case.  The preferred way of solving that is
correctly-rounded results, but overriding libc's exp with a specific,
reasonable implementation isn't out of the question.

Also, what is the state of tests for the code?

Morten

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-02-01 23:26                   ` Morten Welinder
@ 2019-02-02  3:39                     ` Szabolcs Nagy
  2019-02-03  2:05                       ` Morten Welinder
  2019-04-03  1:13                       ` Floating Point Accuracy Damian McGuckin
  0 siblings, 2 replies; 18+ messages in thread
From: Szabolcs Nagy @ 2019-02-02  3:39 UTC (permalink / raw)
  To: musl

* Morten Welinder <mwelinder@gmail.com> [2019-02-01 18:26:59 -0500]:
> Is this code guaranteed to produced correctly-rounded results (like crlibm)?
> 
> I ask because having different machines produce different results is a
> no-go for my use case.  The preferred way of solving that is
> correctly-rounded results, but overriding libc's exp with a specific,
> reasonable implementation isn't out of the question.

libc implementations normally do not guarantee portable behaviour.

correct rounding is not practical for the libc. (i think it can have
similar average performance to an almost-cr implementation, but the
worst case will be 100x slower and it will require many big tables.
glibc used to be an exception: it had several double precision
functions impemented with correct rounding in nearest rounding mode,
but they were hard to maintain and nobody liked the >10000x latency
spikes, so it was decided long ago that cr should not be the goal
of glibc. meanwhile the ts 18661-4 spec introduced separate cr
prefixed names for correctly rounded math functions so if glibc
ever gets cr implementations again they will use different names.)

my new implementations are generic c code, and it's possible to get
portable results with the fp semantics of c across targets which
implement annex F with FLT_EVAL_METHOD==0: you have to turn fp
contraction off, disable builtins and other compiler smartness
(musl already does these) and remove any target specific ifdefs or
asm that may introduce non-portable results (in the new code only the
__FP_FAST_FMA checks).

that said, in practice there may be many reasons why a libc does not
provide completely portable behaviour:

- long double cannot be made portable since the format is different
  across targets.

- there are FLT_EVAL_METHOD==2 targets (musl supports i386 and m68k)
  which won't give the same results as ieee754 targets (without a
  correctly-rounding implementation).

- musl does not support other non-ieee754 targets, but they exist
  (and sometimes quite popular e.g. the flush-to-zero fpu mode).
  supporting such targets may become interesting at some point but
  portable results cannot be guaranteed.

- a target may have its own asm implementation if it has special
  instructions (currently only i386 has asm that may give different
  results).

- nan results may be different across targets since they are not
  fully specified by ieee754. there are other target variations such
  as when underflow is raised. so even on ieee754 targets, exactly
  matching behaviour is not always possible.

- fenv status may not be the same across targets (or compilers)
  because musl (and c99) does not care about inexact in many cases
  (so e.g. const folding of (int)1.2 may change it) and because
  some compilers don't support fenv access (e.g. clang, although
  the new code tries to fix that up as much as possible).

- compiler can introduce different results in user code anyway: libc
  only controls the libc cflags, not what's used for user code and
  e.g. gcc can const-fold math calls with correct rounding if it
  knows the inputs.

- not all targets have fma instruction, but fma usage can give
  significantly better performance on some targets and it's not nice
  to make things slower because of other targets. emulating fma is
  not realistic on non-fma targets (slow), but all modern cpus that
  are used for heavy numeric computation have fma so within those
  targets the results can be made portable without compromising
  prefromance. (it requires some changes to the new code not to rely
  on contraction for optimal fma usage which is non-deterministic
  across compilers)

- int rounding instructions can also make a significant difference
  in performance or quality if non-nearest rounding modes need to be
  supported (e.g. i currently don't have a satisfactory solution for
  >= double precision trigonometric argument reduction on targets
  that have no rounding mode independent nearest toint instruction).
  the fdlibm code is broken in non-nearest rounding modes, the fix i
  have is expensive and breaks useful properties, such fix is not
  necessary on some targets if we allow different results.

- i don't know other instructions that make a big difference across
  targets, but new ones may come up (e.g. 1/x and 1/sqrt estimate
  instructions may be useful in some algorithms, or target specific
  cutoff values depending on faster int vs fp cmp) and then someone
  may want to provide target specific optimizations and the libc
  has to evaluate the maintenance cost vs benefits.

> Also, what is the state of tests for the code?

there are no universal tests for math functions, i still maintain
detailed special case checks in libc-test, i ran statistical tests
that verify that the calculated ulp error bounds hold over many
millions of inputs and the glibc math tests pass (with and without
fma). there are a few regressions (e.g. exp2(-1023) raises underflow,
pow(4,1.5) raises inexact exception with the new code) and several
improvements (mostly precision and non-nearest rounding behaviour).


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-02-02  3:39                     ` Szabolcs Nagy
@ 2019-02-03  2:05                       ` Morten Welinder
  2019-02-03  3:07                         ` Damian McGuckin
  2019-04-03  1:13                       ` Floating Point Accuracy Damian McGuckin
  1 sibling, 1 reply; 18+ messages in thread
From: Morten Welinder @ 2019-02-03  2:05 UTC (permalink / raw)
  To: musl

Thanks.

> there are no universal tests for math functions [...]

I'm painfully aware of that, :-/  However, there are a few collections
of test values out there.  Of varying size and quality.

crlibm has fairly large tables of test values.  See, for example,
https://github.com/taschini/crlibm/blob/master/tests/sin.testdata

John Burkardt  has a pile at
http://people.sc.fsu.edu/~jburkardt/c_src/test_values/test_values.c
These tests are not ieee-854 based, but base 10.  For arctanh, for
example, the test value 0.999999 really is supposed to be fraction
999999/1000000 and not whatever that rounds to as a double.  That
reference value reflects that.

Testing is important.  Cases in point: crlibm had a crazy return value
for expm1( 2 * -2.079401950488110);  glibc 2.23 was released with a
broken cos function.

Morten

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Possible Mistype in exp.c
  2019-02-03  2:05                       ` Morten Welinder
@ 2019-02-03  3:07                         ` Damian McGuckin
  0 siblings, 0 replies; 18+ messages in thread
From: Damian McGuckin @ 2019-02-03  3:07 UTC (permalink / raw)
  To: musl

Just for reference, I have summarized the errors from a refactored version
of exp(x) from MUSL 1.1.21, the SUN fdlibm variant. Your table driven code
will run faster than my rework and have smaller errors than this rewrite of
exp.c.  But I thought I would just put this error summary out there.

I took the original, made the polynomial calculation more super-scalar
friendly, improved the comparisons at the start to reduce their average
impact which also means I could trash using GET_HIGH_WORD, and finally
discarded the scalbn at the end of exp(x), and computed the final result
of exp(x) as

 	return h * f * (one + y)

where y is the variable in the existing MUSL exp(x). After your wise
suggestion, I have changed it to

 	return h * (f + f * y)

FMA or non-FMA seems to make little difference.

The 'f' is a constant of the form 2^j, either 2^(+(p+2)) or 2^(-(p+2))
where 'p' is the precision, depending solely on the sign of 'k' from
the MUSL routine so either 0x1.0p+55 or 0x1.0p-55. No branching is
used in its computation. That 'h' is a normal number of the form 2^i
where i is equal to k-j, i.e. k as per MUSL and 'f' from above, with
the exponent j laying within the range [-1020,+969].

I compare against the results of GLIBC's expl(x). I hope that is OK.

 	exp(x) for x in [-745.0..600.0] by (1.25e-06) 1073741824 cases
 	*epsilon  frequency  %of-total   seen within
 	[0.0,0.1) 307561520  28.64390%  -745 .. +600
 	[0.1,0.2) 304143157  28.32554%  -744 .. +600
 	[0.2,0.3) 231766165  21.58491%  -744 .. +600
 	[0.3,0.4) 146794197  13.67127%  -744 .. +600
 	[0.4,0.5)  80455276   7.49298%  -744 .. +600
 	[0.5,0.6)   2932759   0.27313%  -723 .. +600
 	[0.6,0.7)     80629   0.00751%  -710 .. +600
 	[0.7,0.8)      8121   0.00076%  -709 .. -708
 	0.8+above         0   0.00000%  +N/A .. -N/A
 	WORST ERROR < 0.8*ULP

There is no problem in the error with exp(x) when x > +600 so I
stopped including it in my tests.

I never compare exp(x) for x < -745. In fact, exp(-745.00) returns
0x1.0p-1024 whereas expl(x) returns 0.57125014747105418 times that.
As 0.57123 will round up to 1.0, I treat them as matching anyway.
Note that my revised exp(x) underflows destructively to zero at a
value of x = -745.133219101941165264, i.e. calling the refactored
version exp', I see

 	exp'(-745.133219101941165263) = 4.94065646e-324 = 0x1.0p-1074;
 	exp'(-745.133219101941165264) = 0.0 (oops, too tiny)
while
 	expl(-745.133219101941165263) = 0.50000000000002121 * 0x1.0p-1074
 	expl(-745.133219101941165264) = 0.50000000000002121 * 0x1.0p-1074

This is better than in the original version, but still not quite at the
theoretical limit which is -745.133219101941207623 = ln(2.0) * 1075.

Even though I see correct rounding at the lower limit, I do not explicitly 
try and perform correct rounding as the old one did not either.

Also, I found that the computation times varies little between the early
model Xeon E5-1660 @ 3.3Ghz and a Xeon E5-2650-v4 @ 2.1Ghz. Interesting.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Floating Point Accuracy
  2019-02-02  3:39                     ` Szabolcs Nagy
  2019-02-03  2:05                       ` Morten Welinder
@ 2019-04-03  1:13                       ` Damian McGuckin
  1 sibling, 0 replies; 18+ messages in thread
From: Damian McGuckin @ 2019-04-03  1:13 UTC (permalink / raw)
  To: musl

Hi,

On Sat, 2 Feb 2019, Szabolcs Nagy wrote:

> correct rounding is not practical for the libc. (i think it can have 
> similar average performance to an almost-cr implementation, but the 
> worst case will be 100x slower and it will require many big tables. 
> glibc used to be an exception: it had several double precision functions 
> impemented with correct rounding in nearest rounding mode, but they were 
> hard to maintain and nobody liked the >10000x latency spikes, so it was 
> decided long ago that cr should not be the goal of glibc.

While I understand how hard it is to do CR, what are the current accuracy 
goals for LIBC? Are they documented? Specifically does it guarantee

 	under 1.5*ULP or under 2*ULP

I have read some things which say ALWAYS under 1*ULP which I used to 
believe. But that is just wrong.

Is this documented anywhere for MUSL please?

Or, is there a test suite to check for this?

> meanwhile the ts 18661-4 spec introduced separate cr prefixed names for 
> correctly rounded math functions so if glibc ever gets cr 
> implementations again they will use different names.)

Yes. I need to keep on top of these experimental C libraries. Is MUSL on 
the formal committe for this at all? Just curious.

> my new implementations are generic c code, and it's possible to get 
> portable results with the fp semantics of c across targets which 
> implement annex F with FLT_EVAL_METHOD==0: you have to turn fp 
> contraction off, disable builtins and other compiler smartness (musl 
> already does these) and remove any target specific ifdefs or asm that 
> may introduce non-portable results (in the new code only the 
> __FP_FAST_FMA checks).

Sometimes I curse FLT_EVAL_METHOD and sometimes I adore it. How is GLIBC 
built in this context or does this vary from Linux to Linux and AIX and 
Solaris? How do most people build/use MUSL? As somebody who often uses 
libraries from say Chapel or even Fortran, anything other than the use of 
FLT_EVAL_METHOD==0 can lead to interesting results.

> that said, in practice there may be many reasons why a libc does not
> provide completely portable behaviour:
>
> - long double cannot be made portable since the format is different
>  across targets.

Besides the new Power9 CPUs and IBM's big iron, what other CPUs support 
128-bit IEEE754 arithmetic and integers in hardware.  Also, same question 
about support at the assembler level in softtware such as the Sparc CPUs.
Any plan for that on ARMs that you can share?

You mention,

> - int rounding instructions can also make a significant difference
>  in performance or quality if non-nearest rounding modes need to be
>  supported (e.g. i currently don't have a satisfactory solution for
>  >= double precision trigonometric argument reduction on targets
>  that have no rounding mode independent nearest toint instruction).
>  the fdlibm code is broken in non-nearest rounding modes, the fix i
>  have is expensive and breaks useful properties,

Is this documented in any of the MUSL routines? I would be interested
to know the details.

> such fix is not necessary on some targets if we allow different results.
>
> - i don't know other instructions that make a big difference across
>  targets, but new ones may come up (e.g. 1/x and 1/sqrt estimate
>  instructions may be useful in some algorithms,

Until they are universal, they just complicate things. Do we need portable 
table lookups to provide these solutions to the masses?

>  or target specific cutoff values depending on faster int vs fp cmp)

Playing with 'remquo' a while ago, I was amazed at the speed of an integer 
comparison of 2 significands. It was so much faster than an FP equivalent.

>  and then someone may want to provide target specific optimizations and
>  the libc has to evaluate the maintenance cost vs benefits.

> there are no universal tests for math functions, i still maintain
> detailed special case checks in libc-test,

Where is that located please?

Is MUSL used much on hardware like the TI Signal Processing chips which 
have only 32-bit IEEE. That chip does have a non-IEEE754 extended float
which is 40-bit which is just 8 extra bits in the significand.

Thanks again - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-04-03  1:13 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-28  6:59 Possible Mistype in exp.c Damian McGuckin
2019-01-29 11:01 ` Szabolcs Nagy
2019-01-29 11:26   ` Damian McGuckin
2019-01-29 11:43     ` Szabolcs Nagy
2019-01-30  1:18       ` Damian McGuckin
2019-01-30  9:37         ` Szabolcs Nagy
2019-01-30 11:14           ` Damian McGuckin
2019-01-30 12:19             ` Szabolcs Nagy
2019-01-30 12:44               ` Alexander Monakov
2019-01-30 14:10                 ` Szabolcs Nagy
2019-02-01 23:26                   ` Morten Welinder
2019-02-02  3:39                     ` Szabolcs Nagy
2019-02-03  2:05                       ` Morten Welinder
2019-02-03  3:07                         ` Damian McGuckin
2019-04-03  1:13                       ` Floating Point Accuracy Damian McGuckin
2019-01-30 12:56           ` Possible Mistype in exp.c Damian McGuckin
2019-01-31  0:04             ` Szabolcs Nagy
2019-01-31  0:39               ` Damian McGuckin

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).