From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 21915 invoked from network); 15 Aug 2021 07:47:15 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 15 Aug 2021 07:47:15 -0000 Received: (qmail 32222 invoked by uid 550); 15 Aug 2021 07:47:13 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 32195 invoked from network); 15 Aug 2021 07:47:12 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=dereferenced.org; s=mailbun; t=1629013620; bh=oQOO0t8+6DzidSVglB9PFDBdpgrQD8rIMj2Lht/4i8Y=; h=Date:From:To:cc:Subject:In-Reply-To:References; b=b0cuK+twulF0EHJxysXFLIQQy89hfz/So6tAYUzK/j0SS4dhRIUywV897IhEeHGCt d3AqPUQEypI1AEGdJ2B4TFeC6H9XcEZQXdblblHv2FsV4w8pyilu7lhKmhi/zKbGfJ s6pNsFlzjSF0A+cj10sg9FrtisJRm5M8B+sLzFD2jcPx/Xa/on/kROqrSctcVJUp5H TU2jiSI9SQZY7EXsgbgQVf3p0E2/bQW2knrOTBsEM+/h9+3Xp4PZ96tY7tQtWKrlO8 n5xEID7Jli3EQEiWQLX+q4JGC8khquPUto4QJrnF1kXNOAD1pzWdpYChBCSUjf17GF CLDTB83hW6BkA== Date: Sun, 15 Aug 2021 02:46:58 -0500 (CDT) From: Ariadne Conill To: musl@lists.openwall.com cc: Szabolcs Nagy In-Reply-To: <367A4018B58A4E308E2A95404362CBFB@H270> Message-ID: <4272846f-eb89-2856-af9-38571037a924@dereferenced.org> References: <0C6AAAD55DA44C6189B2FF4F5FB2C3E7@H270> <20210810213455.GB37904@port70.net> <20210814234612.GH37904@port70.net> <367A4018B58A4E308E2A95404362CBFB@H270> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Subject: Re: [musl] [PATCH #2] Properly simplified nextafter() Hi, On Sun, 15 Aug 2021, Stefan Kanthak wrote: > Szabolcs Nagy wrote: > >> * Stefan Kanthak [2021-08-13 14:04:51 +0200]: >>> Szabolcs Nagy wrote on 2021-08-10 at 23:34: > >>>> (the i386 machine where i originally tested this preferred int >>>> cmp and float cmp was very slow in the subnormal range >>> >>> This also and still holds for i386 FPU fadd/fmul as well as SSE >>> addsd/addss/mulss/mulsd additions/multiplies! >> >> they are avoided in the common case, and only used to create >> fenv side-effects. > > Unfortunately but for hard & SOFT-float, where no fenv exists, as > Rich wrote. My admittedly rudementary understanding of how soft-float is implemented in musl leads me to believe that this doesn't really matter that much. >>> --- -/src/math/nextafter.c >>> +++ +/src/math/nextafter.c >>> @@ -10,13 +10,13 @@ >>> return x + y; >>> if (ux.i == uy.i) >>> return y; >>> - ax = ux.i & -1ULL/2; >>> - ay = uy.i & -1ULL/2; >>> + ax = ux.i << 2; >>> + ay = uy.i << 2; >> >> the << 2 looks wrong, the top bit of the exponent is lost. > > It IS wrong, but only in the post, not in the code I tested. So... in other words, you are testing code that is different than the code you are submitting? > >>> if (ax == 0) { >>> if (ay == 0) >>> return y; >>> ux.i = (uy.i & 1ULL<<63) | 1; >>> - } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63)) >>> + } else if ((ax < ay) == ((int64_t) ux.i < 0)) >>> ux.i--; >>> else >>> ux.i++; >> ... >>> How do you compare these 60 instructions/252 bytes to the code I posted >>> (23 instructions/72 bytes)? >> >> you should benchmark, but the second best is to look >> at the longest dependency chain in the hot path and >> add up the instruction latencies. > > 1 billion calls to nextafter(), with random from, and to either 0 or +INF: > run 1 against glibc, 8.58 ns/call > run 2 against musl original, 3.59 > run 3 against musl patched, 0.52 > run 4 the pure floating-point variant from 0.72 > my initial post in this thread, > run 5 the assembly variant I posted. 0.28 ns/call > > Now hurry up and patch your slowmotion code! And how do these benchmarks look on non-x86 architectures, like aarch64 or riscv64? I would rather have a portable math library with functions that cost 3.59 nsec per call than one where the portable bits are not exercised on x86. > Stefan > > PS: I cheated a very tiny little bit: the isnan() macro of musl patched is > > #ifdef PATCH > #define isnan(x) ( \ > sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \ > sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \ > __fpclassifyl(x) == FP_NAN) > #else > #define isnan(x) ( \ > sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \ > sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \ > __fpclassifyl(x) == FP_NAN) > #endif // PATCH > > PPS: and of course the log from the benchmarks... > > [stefan@rome ~]$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: AuthenticAMD > CPU family: 23 > Model: 49 > Model name: AMD EPYC 7262 8-Core Processor > Stepping: 0 > CPU MHz: 3194.323 > BogoMIPS: 6388.64 > Virtualization: AMD-V > L1d cache: 32K > L1i cache: 32K > L2 cache: 512K > L3 cache: 16384K > ... > [stefan@rome ~]$ gcc --version > gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) > Copyright (C) 2018 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. gcc 8 is quite old at this point. gcc 9 and 10 have much better optimizers that are much more capable. Indeed, on my system with GCC 10.3.1, nextafter() is using SSE2 instructions on Alpine x86_64, and if I rebuild musl with `-march=znver2` it uses AVX instructions for nextafter(), which seems more than sufficiently optimized to me. Speaking in personal capacity only, I would rather musl's math routines remain to the point instead of going down the rabbit hole of manually optimized routines like glibc has done. That also includes hand-optimizing the C routines to exploit optimal behavior for some specific microarch. GCC already has knowledge of what optimizations are good for a specific microarch (this is the whole point of `-march` and `-mtune` after all), if something is missing, it should be fixed there. Ariadne