From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/14992 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Rosen Penev Newsgroups: gmane.linux.lib.musl.general Subject: Re: More patches for math subtree Date: Tue, 10 Dec 2019 17:13:00 -0800 Message-ID: References: <2C3325A208DA4260A1A0F7B4517D6DFA@H270> <20191210193558.GK1666@brightrain.aerifal.cx> <20191210221738.GL1666@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="153562"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Stefan Kanthak To: musl@lists.openwall.com Original-X-From: musl-return-15008-gllmg-musl=m.gmane.org@lists.openwall.com Wed Dec 11 02:13:31 2019 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by blaine.gmane.org with smtp (Exim 4.89) (envelope-from ) id 1ieqZD-000dky-Ob for gllmg-musl@m.gmane.org; Wed, 11 Dec 2019 02:13:27 +0100 Original-Received: (qmail 1663 invoked by uid 550); 11 Dec 2019 01:13:25 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Original-Received: (qmail 1639 invoked from network); 11 Dec 2019 01:13:24 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=PimNu0DGsUl7EbytdGT7drCnFHA0ROzxl0M+91B7lBU=; b=mUx7LBGlEsqnc9z6GYHJwgiygACb3ROWZOPUSDcgqX+qykiT5SFARdMCMbCgG9nIVE QaZPnSuK/ycJVwPhWAzfTT4doepVPD7obbA9z7JLB5GiCDAzj5lh3ADa/d5G2JNAnFIH 3sTGK1eiC/vkCNS4I/DbKIwo4vIIEHqHCF24aVFGQS0W2gZ7FM/5U4MRtHZUWZnm8FAz GrZB62kTocMQ1LbEl6b3O/qswW7+h8ffUJgZxqVJSh4/qRfjvYaWMMFASU2yTf/Xpp25 NnfqQbODp5//HI23N004R5DJ63WbaYgxlu5LM4p9+2q0TIvbN9qNAJM2lUdXF2YE+vho 1/uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=PimNu0DGsUl7EbytdGT7drCnFHA0ROzxl0M+91B7lBU=; b=D+De1Fg61hRfmIzi3cUU9oH4w1Ga+vsVQo4HLUeKoIIe7+3vNfI/uOijPVp3Sc1t0C ahZ4FZixF6pwLaOOjwRBOojvPTLgYQD7fq7Y+vwNBpvhE8Hl/dBda7nN+f75wkC7A1Lv VWuNb2X48vP44qT4xfRDioEskDlGsL6DYk7d1iQ5HSoVuQhH/c1cqKWvQkXjdoRQweNT WYiKHxHqW8XpUM3Ju1VLjg/RFlX2hvv7XaZeSrvACA3aV5ladB3SgtcamvfUTaM0Z8+W qOdw8GwPClqrpMwRp+fLrW3dVCcb/QtDIexjCHJUu94JoSVGPDOQ+nMAObETRojAoS0+ bUqQ== X-Gm-Message-State: APjAAAXoa9RaZaDNwQ4cfWUt/SDRw/QWR7i9xijaVGJBrafytwJpvYsM hWk/bt6wjwSRM22prKf9+FwcvP5TAN/b2YJq7MQL4bl5oik= X-Google-Smtp-Source: APXvYqyrzpePuGErVNZTCB/auutnj58EAx9Im5SUrJeG8lyiQK1FmDgCmeIL7AdGr9kn03U5M6RjKAWeQRsJqGbyaL8= X-Received: by 2002:aca:5fc1:: with SMTP id t184mr869056oib.20.1576026791894; Tue, 10 Dec 2019 17:13:11 -0800 (PST) In-Reply-To: <20191210221738.GL1666@brightrain.aerifal.cx> Xref: news.gmane.org gmane.linux.lib.musl.general:14992 Archived-At: On Tue, Dec 10, 2019 at 2:17 PM Rich Felker wrote: > > On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote: > > "Rich Felker" wrote: > > > > > > > On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote: > > >> Some more optimisations: the current implementations of ceil(), floor() > > >> and trunc() for i386 change the rounding control using fldcw instructions, > > >> which are SLOW; these patches provide faster and smaller branch-free (!) > > >> implementations. > > >> > > >> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies! > > >> > > >> --- -/src/math/i386/floor.s > > >> +++ +/src/math/i386/floor.s > > >> @@ -1,67 +1,26 @@ > > >> .global floorf > > >> .type floorf,@function > > >> floorf: > > >> flds 4(%esp) > > >> jmp 1f > > >> > > >> .global floorl > > >> .type floorl,@function > > >> floorl: > > >> fldt 4(%esp) > > >> jmp 1f > > >> > > >> .global floor > > >> .type floor,@function > > >> floor: > > >> fldl 4(%esp) > > >> +1: fld %st(0) > > >> + frndint > > >> + fxch %st(1) > > >> + fucomip %st(1),%st(0) > > >> + fld1 > > >> + fldz > > >> + fcmovb %st(1),%st(0) > > > ^^^^^^ > > > > > > fcmovb is not in the baseline ISA. > > > > This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and > > FUCOMI* were introduced with the PentiumPro. If you allow the use of > > the latter you can safely use the former too. And FCOMI* and FUCOMI* > > are already used in other .S files. > > This is why we're not using them. I think you're looking at x86_64 > where they are in the baseline ISA. > > > > Otherwise, I *think* the idea of this patch looks good, provided I'm > > > not missing anything with respect to how status flags are affected. > > > > FRNDINT takes care of them! > > OK. > > > > As noted in the other email (sorry about not CC'ing you before; I've > > > got you on CC now), I really want to get rid of all these .s files in > > > favor of __asm__ statements with proper constraints in C source files. > > > That makes them inlineable with LTO, and makes it possible for the > > > compiler to select to use an instruction like fcmovb conditionally > > > based on the targeted ISA level rather than having to do a .S file > > > with hard-coded preprocessor conditionals. > > > > While this is generally good idea, there's no guarantee that a compiler > > will emit a branch-free instruction sequence like those shown above. > > I also doubt that a compiler will produce the 5 instruction sequence > > shown in my patch for src/math/i386/remquo.S which collects the FPU > > flags C0, C3 and C1 set by FPREM. > > For that you'd probably put the collection of bits inside the asm. It > still makes just a few instructions of asm, with no need for external > call ABI logic in the asm. > > > I noticed that you provide .S files for "long double" on x86-64, but > > not for "double" and "float". I therefore assume that you use the > > SSE floating-point instructions there, respectively let the compiler > > use them. > > On the x86_64 ABI, float and double arithmetic are performed in SSE > rather than in excess precision with the x87 unit. > > > Does any compiler emit branch-free instruction sequences like the > > following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD? > > > > .code ; Intel syntax > > ceil proc public > > extern __real@8000000000000000:real8 > > movsd xmm1, __real@8000000000000000 > > extern __real@3ff0000000000000:real8 > > movsd xmm2, __real@3ff0000000000000 > > extern __real@4330000000000000:real8 > > movsd xmm3, __real@4330000000000000 > > movsd xmm4, xmm1 > > andnpd xmm1, xmm0 > > andpd xmm4, xmm0 > > cmpltsd xmm1, xmm3 > > andpd xmm1, xmm3 > > orpd xmm1, xmm4 > > movsd xmm3, xmm0 > > addsd xmm0, xmm1 > > subsd xmm0, xmm1 > > movsd xmm1, xmm0 > > cmpltsd xmm0, xmm3 > > andpd xmm0, xmm2 > > addsd xmm0, xmm1 > > orpd xmm0, xmm4 > > ret > > ceil endp > > > > Or instruction sequences like > > > > .code ; Intel syntax > > copysign proc public > > movd rcx, xmm0 > > movd rdx, xmm1 > > shld rcx, rdx, 1 > > ror rcx, 1 > > movd xmm0, rcx > > ret > > copysign endp > > Not quite (but it might be possible to write the C in terms of shifts > instead of masks such that it does), but I also don't think it's clear > which version is better. Yours here is mildly smaller and might > perform better, but when making changes that aren't clearly better > there should be some evidence that it's actually an improvement -- > especially if it's not just improving existing arch optimizations but > adding new ones where the C was formerly used. Generally musl avoids > asm and arch-specific files as much as possible, using them only for > things that aren't representable in C or where the C is a lot larger > or slower or both. > > > .code ; Intel syntax > > fdim proc public > > movsd xmm2, xmm0 > > cmpsd xmm0, xmm1, 6 > > subsd xmm2, xmm1 > > andpd xmm0, xmm2 > > ret > > fdim endp > > Does this handle nans correctly? > > > > It also precludes x87 stack imbalance bugs like CVE-2019-14697, which > > > make me really wary of manual changes to these files. > > > > > > Would you be interested in working on converting over the files you > > > want to optimize (or even others too) to that form at the same time as > > > doing the optimizations? > > > > I don't use musl-libc; I also don't use an OS or a compiler/assembler > > which can be used to build it. > > I just stumbled upon the functions for which I sent in patches while > > searching for code which uses Intel's FPU. > > > > > It would really help with review process and with improving the overall > > > code state. > > > > If I start using musl-libc I'd be interested and rewrite these parts. > > OK. I don't mind looking at these patches further as-is, and I'll try > to continue offering constructive comments now, but it'll be after > this release cycle (hopefully wrapping that up in the next week or so) > before consideration for merging. musl 1.2.0 is already going to be a > release with big changes (time64) and I don't want to risk subtle > breakage with new changes that haven't been reviewed in detail yet or > had time for users to test. Since you guys are discussing math optimizations, here's another one: https://www.openwall.com/lists/musl/2019/11/08/1 > > > Rich