From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/13698 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Szabolcs Nagy Newsgroups: gmane.linux.lib.musl.general Subject: Re: Possible Mistype in exp.c Date: Thu, 31 Jan 2019 01:04:26 +0100 Message-ID: <20190131000425.GH21289@port70.net> References: <20190129110135.GC21289@port70.net> <20190129114308.GD21289@port70.net> <20190130093738.GE21289@port70.net> Reply-To: musl@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="137549"; mail-complaints-to="usenet@blaine.gmane.org" User-Agent: Mutt/1.10.1 (2018-07-13) To: musl@lists.openwall.com Original-X-From: musl-return-13714-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jan 31 01:04:41 2019 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1gozqT-000ZfI-CT for gllmg-musl@m.gmane.org; Thu, 31 Jan 2019 01:04:41 +0100 Original-Received: (qmail 20408 invoked by uid 550); 31 Jan 2019 00:04:38 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 20334 invoked from network); 31 Jan 2019 00:04:37 -0000 Mail-Followup-To: musl@lists.openwall.com Content-Disposition: inline In-Reply-To: Xref: news.gmane.org gmane.linux.lib.musl.general:13698 Archived-At: * Damian McGuckin [2019-01-30 23:56:05 +1100]: > As a matter of interest, what was the benchmark against which you get a 2x > speed gain? i benchmarked a tight loop around a call where (1) calls are independent so can be evaluated in parallel (2) calls depend on the previous iteration result. (so (1) measures maximum call throughput and (2) measures latency) usually (1) can be improved significantly over old math code. and such usage (e.g. calling a single function over an array) is common (e.g. in fortran code) so rewriting old code is useful. > I got 1.75 against GLIBC for what that is worth. I used a faster scaling > routine. But I was not chasing a improved ULP performance ike you were as > that was too much extra work. Your work there sounds like seriously smart > stuff to me. glibc <= 2.26 is old fdlibm and ibm libultim code glibc 2.27 has single prec improvements glibc 2.28 has double prec correct rounding slow paths removed glibc 2.29 will use my code so glibc version matters when you benchmark. i did most of my benchmarks on various 64bit arm cores. (which e.g. always have fma and int rounding instructions, so some things are designed differently than on x86, but the isa does not matter too much, the cpu internals matter more and those seem to be similar across targets) > I used super-scalar friendly code which adds an extra multiplication. It > made a miniscule tiny net benefit on the Xeons (not Xeon Gold). > > I had 2 versions of the fast scaling routine replacing ldexp. One used a > single ternary if/then/else and other grabbed the sign and did a table > looking which meant one extra multiplication all the time but no branches. > The one extra multiplication instead of a branch in the 2-line scaling > routine made no difference. > > I saw a tiny but measurable difference when I used a Xeon with an FMA > compared to one which did not. > > Discarding the last term in the SUN routine and that net loss of one > multiplication still made no serious difference to the timing and of course, > the results were affected. > > My timings showed on my reworked code (for doubles) > > 21+% the preliminary comparisons > 43+% polynomial computation super-scalar friendly way > 35+% y = 1 + (x*c/(2-c) - lo + hi); > return k == 0 ? y : scalbn-FAST(y, k); here the killer is the division really. scalbn(1+p, k) can be s=2^k; s + s*p; (single fma, just be careful not to overflow s), but using a table lookup instead of div should help the most. > I have slightly increased the work load in the comparisons because I avoid > pulling 'x' apart into 'hx'. I used only doubles or floats.