From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 6572 invoked from network); 11 Aug 2021 16:09:58 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 11 Aug 2021 16:09:58 -0000 Received: (qmail 4014 invoked by uid 550); 11 Aug 2021 16:09:52 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 3996 invoked from network); 11 Aug 2021 16:09:52 -0000 Date: Wed, 11 Aug 2021 12:09:39 -0400 From: Rich Felker To: Stefan Kanthak Cc: Szabolcs Nagy , musl@lists.openwall.com Message-ID: <20210811160938.GB13220@brightrain.aerifal.cx> References: <0C6AAAD55DA44C6189B2FF4F5FB2C3E7@H270> <20210810213455.GB37904@port70.net> <5C60D05C95724A36B3DB9942D06CFE5F@H270> <20210811024010.GA13220@brightrain.aerifal.cx> <7143269BEC424DE6A3B0218C4268C4C8@H270> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7143269BEC424DE6A3B0218C4268C4C8@H270> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [musl] [PATCH] Properly simplified nextafter() On Wed, Aug 11, 2021 at 05:44:28PM +0200, Stefan Kanthak wrote: > Rich Felker wrote: > > > On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote: > >> Szabolcs Nagy wrote: > >> > >>>* Stefan Kanthak [2021-08-10 08:23:46 +0200]: > >>>> > >>>> has quite some superfluous statements: > >>>> > >>>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|; > >>>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to > >>>> (ax == 0) && (ay == 0): the latter 2 tests can be removed; > >>> > >>> you replaced 4 int cmps with 4 float cmps (among other things). > >> > >> and hinted that the result of the second pair of comparisions is > >> already known from the first pair. > >> > >>> it's target dependent if float compares are fast or not. > >> > >> It's also target dependent whether the floating-point registers > >> can be accessed by integer instructions, or need to be copied: > >> some win, some loose! > >> Just let the compiler/optimizer do its job! > > > > The values have been copied already to perform isnan, > > NOT necessary: the compiler may have inlined isnan() and perform > the test for example using FXAM, FUCOM or FUCOMI on i386, or > UCOMISD on AMD64, without copying the arguments. static __inline unsigned __FLOAT_BITS(float __f) { union {float __f; unsigned __i;} __u; __u.__f = __f; return __u.__i; } #define isnan(x) ( \ sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \ sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \ __fpclassifyl(x) == FP_NAN) So, nope. Unless it's doing some extremely high level rewriting of this inspection of the representation. > >> 0. Doesn't musl provide target specific routines for targets with > >> soft FP? > > > > No, quite the opposite. Targets with hard fp and native insns for > > particular ops have target-specific versions, > > That's why I assumed that this may also be the case for soft FP. I don't see how that follows. Targets with soft fp are exactly the opposite of having a specialized native way to perform the operation; they do it in the most general way. > >> The code is of course smaller ... but not as small and fast as a > >> proper i386 or AMD64 assembly implementation ... which I can > >> post upon request. > > > > Full asm functions are not wanted; it's something we're trying to get > > rid of in favor of just using very small/single-insn asm statements > > with proper constraints, where it's sufficiently beneficial to have > > asm at all. But I'm not even clear how you could make this function > > more efficient with asm. The overall logic would be exactly the same > > as the C. Maybe on x86_64 there'd be some SSE instructions to let you > > elide a few things? > > No, just what the instruction set offers: 23 instructions in 72 bytes. > > nextafter: > comisd xmm1, xmm0 # CF = (from > to) > jp .Lmxcsr # from or to INDEFINITE? > je .Lequal # from = to? > sbb rdx, rdx # rdx = (from > to) ? -1 : 0 > movq rcx, xmm0 # rcx = from > mov rax, rcx > add rax, rax # CF = (from & -0.0) > jz .Lzero # from = ±0.0? > ..Lstep: > sbb rax, rax # rax = (from < 0.0) ? -1 : 0 > xor rax, rdx # rax = (from < 0.0) ^ (from > to) ? -1 : 0 > or rax, 1 # rax = (from < 0.0) ^ (from > to) ? -1 : 1 > add rax, rcx # rax = nextafter(from, to) > movq xmm0, rax # xmm0 = nextafter(from, to) > xorpd xmm1, xmm1 > ..Lmxcsr: > addsd xmm0, xmm1 # set MXCSR flags > ret > ..Lequal: > movsd xmm0, xmm1 # xmm0 = to > ret > ..Lzero: > movmskpd eax, xmm1 # rax = (to & -0.0) ? 0b?1 : 0b?0 > or eax, 2 # rax = (to & -0.0) ? 0b11 : 0b10 > ror rax, 1 # rax = (to & -0.0) ? 0x8000000000000001 : 1 > movq xmm0, rax # xmm0 = (to & -0.0) ? -0x1.0p-1074 : 0x1.0p-1074 > ret > > GCC generates here at least 12 instructions more, also longer ones, > including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF, > so the code is more than 50% fatter, mixes integer SSE and FP SSE > instructions which incur 2 cycles penalty on many Intel CPUs, with > WAY TOO MANY not so predictable (un)conditional branches. We don't use asm to optimize out 2 cycles. If the compiler is choosing a bad way to perform these loads the compiler should be fixed. But I don't think it matters in any measurable way in real usage. Rich