From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13698
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Szabolcs Nagy <nsz@port70.net>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Possible Mistype in exp.c
Date: Thu, 31 Jan 2019 01:04:26 +0100
Message-ID: <20190131000425.GH21289@port70.net>
References: <alpine.LRH.2.02.1901281753250.16424@key0.esi.com.au>
 <20190129110135.GC21289@port70.net>
 <alpine.LRH.2.02.1901292203340.799@key0.esi.com.au>
 <20190129114308.GD21289@port70.net>
 <alpine.LRH.2.02.1901301214230.828@key0.esi.com.au>
 <20190130093738.GE21289@port70.net>
 <alpine.LRH.2.02.1901302320110.28837@key0.esi.com.au>
Reply-To: musl@lists.openwall.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="137549"; mail-complaints-to="usenet@blaine.gmane.org"
User-Agent: Mutt/1.10.1 (2018-07-13)
To: musl@lists.openwall.com
Original-X-From: musl-return-13714-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jan 31 01:04:41 2019
Return-path: <musl-return-13714-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.89)
	(envelope-from <musl-return-13714-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1gozqT-000ZfI-CT
	for gllmg-musl@m.gmane.org; Thu, 31 Jan 2019 01:04:41 +0100
Original-Received: (qmail 20408 invoked by uid 550); 31 Jan 2019 00:04:38 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 20334 invoked from network); 31 Jan 2019 00:04:37 -0000
Mail-Followup-To: musl@lists.openwall.com
Content-Disposition: inline
In-Reply-To: <alpine.LRH.2.02.1901302320110.28837@key0.esi.com.au>
Xref: news.gmane.org gmane.linux.lib.musl.general:13698
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/13698>

* Damian McGuckin <damianm@esi.com.au> [2019-01-30 23:56:05 +1100]:
> As a matter of interest, what was the benchmark against which you get a 2x
> speed gain?

i benchmarked a tight loop around a call where
(1) calls are independent so can be evaluated in parallel
(2) calls depend on the previous iteration result.
(so (1) measures maximum call throughput and (2) measures latency)

usually (1) can be improved significantly over old math code.
and such usage (e.g. calling a single function over an array) is
common (e.g. in fortran code) so rewriting old code is useful.

> I got 1.75 against GLIBC for what that is worth. I used a faster scaling
> routine. But I was not chasing a improved ULP performance ike you were as
> that was too much extra work. Your work there sounds like seriously smart
> stuff to me.

glibc <= 2.26 is old fdlibm and ibm libultim code
glibc 2.27 has single prec improvements
glibc 2.28 has double prec correct rounding slow paths removed
glibc 2.29 will use my code

so glibc version matters when you benchmark.

i did most of my benchmarks on various 64bit arm cores.
(which e.g. always have fma and int rounding instructions,
so some things are designed differently than on x86, but
the isa does not matter too much, the cpu internals matter
more and those seem to be similar across targets)

> I used super-scalar friendly code which adds an extra multiplication. It
> made a miniscule tiny net benefit on the Xeons (not Xeon Gold).
> 
> I had 2 versions of the fast scaling routine replacing ldexp. One used a
> single ternary if/then/else and other grabbed the sign and did a table
> looking which meant one extra multiplication all the time but no branches.
> The one extra multiplication instead of a branch in the 2-line scaling
> routine made no difference.
> 
> I saw a tiny but measurable difference when I used a Xeon with an FMA
> compared to one which did not.
> 
> Discarding the last term in the SUN routine and that net loss of one
> multiplication still made no serious difference to the timing and of course,
> the results were affected.
> 
> My timings showed on my reworked code (for doubles)
> 
> 	21+%	the preliminary comparisons
> 	43+%	polynomial computation super-scalar friendly way
> 	35+%	y = 1 + (x*c/(2-c) - lo + hi);
> 		return k == 0 ? y : scalbn-FAST(y, k);

here the killer is the division really.

scalbn(1+p, k) can be s=2^k; s + s*p; (single fma, just be
careful not to overflow s), but using a table lookup instead
of div should help the most.

> I have slightly increased the work load in the comparisons because I avoid
> pulling 'x' apart into 'hx'. I used only doubles or floats.