From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/14992
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Rosen Penev <rosenp@gmail.com>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: More patches for math subtree
Date: Tue, 10 Dec 2019 17:13:00 -0800
Message-ID: <CAKxU2N9qpU9=7qNFavd=UjtLfH0UHrHN95E448oKk7H3w4OOQw@mail.gmail.com>
References: <2C3325A208DA4260A1A0F7B4517D6DFA@H270> <20191210193558.GK1666@brightrain.aerifal.cx>
 <FAF063FA3F8F4F1B883929708874E3AA@H270> <20191210221738.GL1666@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="153562"; mail-complaints-to="usenet@blaine.gmane.org"
Cc: Stefan Kanthak <stefan.kanthak@nexgo.de>
To: musl@lists.openwall.com
Original-X-From: musl-return-15008-gllmg-musl=m.gmane.org@lists.openwall.com Wed Dec 11 02:13:31 2019
Return-path: <musl-return-15008-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by blaine.gmane.org with smtp (Exim 4.89)
	(envelope-from <musl-return-15008-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1ieqZD-000dky-Ob
	for gllmg-musl@m.gmane.org; Wed, 11 Dec 2019 02:13:27 +0100
Original-Received: (qmail 1663 invoked by uid 550); 11 Dec 2019 01:13:25 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Original-Received: (qmail 1639 invoked from network); 11 Dec 2019 01:13:24 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=PimNu0DGsUl7EbytdGT7drCnFHA0ROzxl0M+91B7lBU=;
        b=mUx7LBGlEsqnc9z6GYHJwgiygACb3ROWZOPUSDcgqX+qykiT5SFARdMCMbCgG9nIVE
         QaZPnSuK/ycJVwPhWAzfTT4doepVPD7obbA9z7JLB5GiCDAzj5lh3ADa/d5G2JNAnFIH
         3sTGK1eiC/vkCNS4I/DbKIwo4vIIEHqHCF24aVFGQS0W2gZ7FM/5U4MRtHZUWZnm8FAz
         GrZB62kTocMQ1LbEl6b3O/qswW7+h8ffUJgZxqVJSh4/qRfjvYaWMMFASU2yTf/Xpp25
         NnfqQbODp5//HI23N004R5DJ63WbaYgxlu5LM4p9+2q0TIvbN9qNAJM2lUdXF2YE+vho
         1/uw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=PimNu0DGsUl7EbytdGT7drCnFHA0ROzxl0M+91B7lBU=;
        b=D+De1Fg61hRfmIzi3cUU9oH4w1Ga+vsVQo4HLUeKoIIe7+3vNfI/uOijPVp3Sc1t0C
         ahZ4FZixF6pwLaOOjwRBOojvPTLgYQD7fq7Y+vwNBpvhE8Hl/dBda7nN+f75wkC7A1Lv
         VWuNb2X48vP44qT4xfRDioEskDlGsL6DYk7d1iQ5HSoVuQhH/c1cqKWvQkXjdoRQweNT
         WYiKHxHqW8XpUM3Ju1VLjg/RFlX2hvv7XaZeSrvACA3aV5ladB3SgtcamvfUTaM0Z8+W
         qOdw8GwPClqrpMwRp+fLrW3dVCcb/QtDIexjCHJUu94JoSVGPDOQ+nMAObETRojAoS0+
         bUqQ==
X-Gm-Message-State: APjAAAXoa9RaZaDNwQ4cfWUt/SDRw/QWR7i9xijaVGJBrafytwJpvYsM
	hWk/bt6wjwSRM22prKf9+FwcvP5TAN/b2YJq7MQL4bl5oik=
X-Google-Smtp-Source: APXvYqyrzpePuGErVNZTCB/auutnj58EAx9Im5SUrJeG8lyiQK1FmDgCmeIL7AdGr9kn03U5M6RjKAWeQRsJqGbyaL8=
X-Received: by 2002:aca:5fc1:: with SMTP id t184mr869056oib.20.1576026791894;
 Tue, 10 Dec 2019 17:13:11 -0800 (PST)
In-Reply-To: <20191210221738.GL1666@brightrain.aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:14992
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/14992>

On Tue, Dec 10, 2019 at 2:17 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:
> > "Rich Felker" <dalias@libc.org> wrote:
> >
> >
> > > On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> > >> Some more optimisations: the current implementations of ceil(), floor()
> > >> and trunc() for i386 change the rounding control using fldcw instructions,
> > >> which are SLOW; these patches provide faster and smaller branch-free (!)
> > >> implementations.
> > >>
> > >> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
> > >>
> > >> --- -/src/math/i386/floor.s
> > >> +++ +/src/math/i386/floor.s
> > >> @@ -1,67 +1,26 @@
> > >>  .global floorf
> > >>  .type floorf,@function
> > >>  floorf:
> > >>          flds 4(%esp)
> > >>          jmp 1f
> > >>
> > >>  .global floorl
> > >>  .type floorl,@function
> > >>  floorl:
> > >>          fldt 4(%esp)
> > >>          jmp 1f
> > >>
> > >>  .global floor
> > >>  .type floor,@function
> > >>  floor:
> > >>          fldl 4(%esp)
> > >> +1:      fld %st(0)
> > >> +        frndint
> > >> +        fxch %st(1)
> > >> +        fucomip %st(1),%st(0)
> > >> +        fld1
> > >> +        fldz
> > >> +        fcmovb %st(1),%st(0)
> > >           ^^^^^^
> > >
> > > fcmovb is not in the baseline ISA.
> >
> > This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
> > FUCOMI* were introduced with the PentiumPro. If you allow the use of
> > the latter you can safely use the former too. And FCOMI* and FUCOMI*
> > are already used in other .S files.
>
> This is why we're not using them. I think you're looking at x86_64
> where they are in the baseline ISA.
>
> > > Otherwise, I *think* the idea of this patch looks good, provided I'm
> > > not missing anything with respect to how status flags are affected.
> >
> > FRNDINT takes care of them!
>
> OK.
>
> > > As noted in the other email (sorry about not CC'ing you before; I've
> > > got you on CC now), I really want to get rid of all these .s files in
> > > favor of __asm__ statements with proper constraints in C source files.
> > > That makes them inlineable with LTO, and makes it possible for the
> > > compiler to select to use an instruction like fcmovb conditionally
> > > based on the targeted ISA level rather than having to do a .S file
> > > with hard-coded preprocessor conditionals.
> >
> > While this is generally good idea, there's no guarantee that a compiler
> > will emit a branch-free instruction sequence like those shown above.
> > I also doubt that a compiler will produce the 5 instruction sequence
> > shown in my patch for src/math/i386/remquo.S which collects the FPU
> > flags C0, C3 and C1 set by FPREM.
>
> For that you'd probably put the collection of bits inside the asm. It
> still makes just a few instructions of asm, with no need for external
> call ABI logic in the asm.
>
> > I noticed that you provide .S files for "long double" on x86-64, but
> > not for "double" and "float". I therefore assume that you use the
> > SSE floating-point instructions there, respectively let the compiler
> > use them.
>
> On the x86_64 ABI, float and double arithmetic are performed in SSE
> rather than in excess precision with the x87 unit.
>
> > Does any compiler emit branch-free instruction sequences like the
> > following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
> >
> >         .code   ; Intel syntax
> > ceil    proc    public
> >         extern  __real@8000000000000000:real8
> >         movsd   xmm1, __real@8000000000000000
> >         extern  __real@3ff0000000000000:real8
> >         movsd   xmm2, __real@3ff0000000000000
> >         extern  __real@4330000000000000:real8
> >         movsd   xmm3, __real@4330000000000000
> >         movsd   xmm4, xmm1
> >         andnpd  xmm1, xmm0
> >         andpd   xmm4, xmm0
> >         cmpltsd xmm1, xmm3
> >         andpd   xmm1, xmm3
> >         orpd    xmm1, xmm4
> >         movsd   xmm3, xmm0
> >         addsd   xmm0, xmm1
> >         subsd   xmm0, xmm1
> >         movsd   xmm1, xmm0
> >         cmpltsd xmm0, xmm3
> >         andpd   xmm0, xmm2
> >         addsd   xmm0, xmm1
> >         orpd    xmm0, xmm4
> >         ret
> > ceil    endp
> >
> > Or instruction sequences like
> >
> >         .code   ; Intel syntax
> > copysign proc   public
> >         movd    rcx, xmm0
> >         movd    rdx, xmm1
> >         shld    rcx, rdx, 1
> >         ror     rcx, 1
> >         movd    xmm0, rcx
> >         ret
> > copysign endp
>
> Not quite (but it might be possible to write the C in terms of shifts
> instead of masks such that it does), but I also don't think it's clear
> which version is better. Yours here is mildly smaller and might
> perform better, but when making changes that aren't clearly better
> there should be some evidence that it's actually an improvement --
> especially if it's not just improving existing arch optimizations but
> adding new ones where the C was formerly used. Generally musl avoids
> asm and arch-specific files as much as possible, using them only for
> things that aren't representable in C or where the C is a lot larger
> or slower or both.
>
> >         .code   ; Intel syntax
> > fdim    proc    public
> >         movsd   xmm2, xmm0
> >         cmpsd   xmm0, xmm1, 6
> >         subsd   xmm2, xmm1
> >         andpd   xmm0, xmm2
> >         ret
> > fdim    endp
>
> Does this handle nans correctly?
>
> > > It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> > > make me really wary of manual changes to these files.
> > >
> > > Would you be interested in working on converting over the files you
> > > want to optimize (or even others too) to that form at the same time as
> > > doing the optimizations?
> >
> > I don't use musl-libc; I also don't use an OS or a compiler/assembler
> > which can be used to build it.
> > I just stumbled upon the functions for which I sent in patches while
> > searching for code which uses Intel's FPU.
> >
> > > It would really help with review process and with improving the overall
> > > code state.
> >
> > If I start using musl-libc I'd be interested and rewrite these parts.
>
> OK. I don't mind looking at these patches further as-is, and I'll try
> to continue offering constructive comments now, but it'll be after
> this release cycle (hopefully wrapping that up in the next week or so)
> before consideration for merging. musl 1.2.0 is already going to be a
> release with big changes (time64) and I don't want to risk subtle
> breakage with new changes that haven't been reviewed in detail yet or
> had time for users to test.
Since you guys are discussing math optimizations, here's another one:
https://www.openwall.com/lists/musl/2019/11/08/1
>
>
> Rich