the freebsd fma code failed to raise underflow exception in some
cases in nearest rounding mode (affects fmal too) e.g.

  fma(-0x1p-1000, 0x1.000001p-74, 0x1p-1022)

and the inexact exception may be raised spuriously since the fenv
is not saved/restored around the exact multiplication algorithm
(affects x86 fma too).

another issue is that the underflow behaviour when the rounded result
is the minimal normal number is target dependent, ieee754 allows two
ways to raise underflow for inexact results: raise if the result before
rounding is in the subnormal range (e.g. aarch64, arm, powerpc) or if
the result after rounding with infinite exponent range is in the
subnormal range (e.g. x86, mips, sh).

to avoid all these issues the algorithm was rewritten with mostly int
arithmetics and float arithmetics is only used to get correct rounding
and raise exceptions according to the behaviour of the target without
any fenv.h dependency. it also unifies x86 and non-x86 fma.

fmaf is not affected, fmal need to be fixed too.

this algorithm depends on a_clz_64 and it required a nasty volatile
hack: gcc seems to miscompile the FORCE_EVAL macro of libm.h on i386.
---
 src/math/fma.c | 582 ++++++++++++++++-----------------------------------------
 1 file changed, 158 insertions(+), 424 deletions(-)

attaching the new fma.c instead of a diff, it's more readable.

depends on the a_clz_64 patch and previous scalbn fix.

fmal should be possible to do in a similar way.

i expect it to be faster than the previous code on most targets as
the rounding mode is not changed and has less multiplications
(it is faster on x86_64 and i386), the code size is a bit bigger
though.