From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13859
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Szabolcs Nagy <nsz@port70.net>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: [PATCH 00/18]  math updates
Date: Sun, 24 Feb 2019 22:00:31 +0100
Message-ID: <20190224210016.GA21289@port70.net>
References: <20181208125009.GY21289@port70.net>
 <20190223233602.GO23599@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="51900"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: Mutt/1.10.1 (2018-07-13)
To: musl@lists.openwall.com
Original-X-From: musl-return-13875-gllmg-musl=m.gmane.org@lists.openwall.com Sun Feb 24 22:00:49 2019
Return-path: <musl-return-13875-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.89)
	(envelope-from <musl-return-13875-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1gy0tC-000DJZ-4s
	for gllmg-musl@m.gmane.org; Sun, 24 Feb 2019 22:00:46 +0100
Original-Received: (qmail 1253 invoked by uid 550); 24 Feb 2019 21:00:44 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 1235 invoked from network); 24 Feb 2019 21:00:43 -0000
Mail-Followup-To: musl@lists.openwall.com
Content-Disposition: inline
In-Reply-To: <20190223233602.GO23599@brightrain.aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:13859
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/13859>

* Rich Felker <dalias@libc.org> [2019-02-23 18:36:02 -0500]:
> On Sat, Dec 08, 2018 at 01:50:10PM +0100, Szabolcs Nagy wrote:
> > add new code from
> > 
> >  https://github.com/ARM-software/optimized-routines
> 
> I think when I merge this, I should include a sha1 hash to indicate
> what revision it was based on. This documents both code history and
> license change. There should also be an update to the COPYRIGHT file
> indicating ARM copyright w/MIT license on new math code.

ok

> > with small modifications:
> > - remove all code paths related to errno handling (WANT_ERRNO cases)
> > - reformat with clang-format to linux style instead of gnu style.
> > - drop non-default configs (polynomial and table size settings)
> > - remove SNaN support (but keep the code so adding it back is easy)
> > - use __FP_FAST_FMA feature test with __builtin_fma
> 
> I think this should also be conditioned on defined(__GNUC__) or
> __GNUC__>=something or similar. It's not clear that __FP_FAST_FMA is
> supposed to imply that. But it looks like it's also used to condition
> presence of data in headers, so perhaps libm.h should define something
> indicating whether to use __builtin_fma to put the logic all in one
> place.

in gcc __FP_FAST_FMA is only defined if __builtin_fma is
inlined as a single instruction, i assumed other
implementations would not define __FP_FAST_FMA unless
they also supported __builtin_fma and able to inline it.

(clang does not define it, icc seems to, they both support
__builtin_fma with inlining)

> > - some macros got renamed: barriers, unlikely, HIDDEN
> > - kept TOINT_INTRINSICS code paths, but it's never set
> >   (requires __builtin_round and __builtin_lround support as single insn)
> 
> That's ok. Do you know if it helps enough that we should consider
> adding these later?

the generic c solution is problematic for non-nearest
rounding mode, because there is no way to efficiently
compute a remainder that works in all rounding modes:

  r = x - toint(x) // r in [-0.5, 0.5]

e.g. i don't yet know how to do this reasonably for
double prec trig functions.

it turns out in exp/exp2/pow it's ok to have r in [-1,1]
if we accept slightly worse quality for non-nearest rm.
then x+Shift-Shift works as toint and it should not be
much slower than a toint(x) instruction (the latency
will be a bit higher, i'd say ~5% for an exp call)

but in trig functions you really want r in
[-0.5-tiny, 0.5+tiny], and you also want -x to give
-r so trig function symmetry is not broken (ideally
in all rounding modes).

this is easy to achieve with a rounding mode independent
toint instruction, but not with normal arithmetic
operators. (e.g. one approach is to just call round and
lround and on targets where there is single instruction
it will be very fast, on others it will be very slow,
but at least the behaviour will be consistent.
in trig functions the arg reduction is not in the critical
path for common inputs so the cost may be justified, but
in exp such approach would be bad)

> > - error handling is split up across several translation units.
> > - data layout declarations are split into several _data.h headers
> > 
> > todo:
> > - fp_barrier implementation for various targets
> 
> Isn't the only thing that's needed the asm constraint letter for float
> regs? This would be easier than duplicating the code per-target. What
> are the exact barrier semantics? Hiding how the output depends on the
> input from buggy optimizer? Or forcing that the evaluation happen?

it's mostly just constraint letter, but e.g. on soft-fp
it is probably not needed (fenv does not work anyway).

there are several use-cases for the barrier:
- ensure a value is computed so the side-effect happens
  even if the result is otherwise unused.
- ensure constant computation is not folded i.e. the
  value should be hidden from the compiler.
- ensure an operation is not hoisted from a conditional
  path. (the current barrier does not guarantee this
  i think, but at least the inline asm prevents if
  conversions to single x=a?b:c; instruction with
  both b and c evaluated unconditionally, with
  x=a?barrier(b):c the compiler in principle can still
  do the evals unconditionally, but in practice it does
  not want to)

i didn't want to introduce many barriers for the different
use-cases, in general things work (the compiler cannot
do fenv breaking transformations without whole program
analysis) and only a few places need some little tweaks.

trying to provide correct fenv behaviour assuming the
compiler does not know anything about fenv is a bit
fragile no matter how it is done, i'm not sure what's
a reasonable definition for the barriers (the definition
would likely rely on no-lto) or if there is a cleaner
way to achieve the same result (ideally compilers would
provide a __builtin_fp_barrier thingy).

> > - musl does not enable fma contraction, new code would be better with it
> 
> Yes, we could do it selectively if gcc honored the pragma...

it honors the cflag, e.g. we can add -ffp-contract=fast to
math/*.c (i think this should be ok, if we run into trouble
we can remve the optimization)

> > - musl disables fabs etc inlining, using builtins would help
> 
> Yes. I've thought of trying to make the src/include/*.h patch this up
> to use the builtins that we can. The problem is not math-specific;
> string functions (esp memcpy/memset) are an even bigger issue.

i agree.

> > - FENV_ACCESS pragma should be set in some top level header in principle
> >   (like features.h)
> 
> Is there a reason? I like it as documentation of what actually wants
> it, even if gcc currently isn't honoring it. libm.h might make sense
> since all of libm needs it.

the iso c definition is "all or nothing" kind, so any
code that may run under non-default rounding mode must
be compiled with FENV_ACCESS "on" (otherwise ub) so if
we want to support libc calls with non-default rounding
mode then everything needs it on.

unfortunately "on" also implies that any call (and
possibly inline asm?) may change the rounding mode
which prevents some useful optimizations. (the same
story holds for status flag checks: any call may
modify or check the flags), which is why compiler
devs don't like this pragma thing.

> > - use the new helper functions/macros in existing code.
> > 
> > overall libc.so code size increase on x86_64: +8540 bytes
> > 
> > (i'll send the patches as attachments in two parts, because they are too
> > big for one mail)
> > 
> > Szabolcs Nagy (18):
> >   define FP_FAST_FMA* when fma* can be inlined
> >   math: move complex math out of libm.h
> >   math: add asuint, asuint64, asfloat and asdouble
> >   math: remove sun copyright from libm.h
> >   math: add fp_arch.h with fp_barrier and fp_force_eval
> >   math: add eval_as_float and eval_as_double
> >   math: add single precision error handling functions
> >   math: add double precision error handling functions
> >   math: add macros for static branch prediction hints
> >   math: add configuration macros
> 
> A few of these things (esp the ones that facilitate broken compilers
> that don't implement excess-precision right) I'm not really a fan of,
> or at least don't know if I am (I might be convinced if I'd worked on
> the code they affect). But as long as they're what work well for you
> developing and improving this code, and make it easier to share with
> other projects, I have no objection to going forward with them.

eval_as_float assumes the compiler will honor the
assignment, in practice that's not enough for gcc
with default excess precision handling but if we
have the eval_as_float call in all the relevant places,
it is easy to reuse the code externally with non-standard
build flags: only the eval_as_float definition needs
to be changed (e.g. by adding a barrier).

> 
> Sorry it took a while to get to this. I know sometime last release
> cycle I said I'd do it in the next cycle, but didn't remember to look
> back at it right away. The performance figures and code improvements
> look nice and I definitely want to have this in 1.1.22.

ok.