From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 6572 invoked from network); 11 Aug 2021 16:09:58 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 11 Aug 2021 16:09:58 -0000
Received: (qmail 4014 invoked by uid 550); 11 Aug 2021 16:09:52 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 3996 invoked from network); 11 Aug 2021 16:09:52 -0000
Date: Wed, 11 Aug 2021 12:09:39 -0400
From: Rich Felker <dalias@libc.org>
To: Stefan Kanthak <stefan.kanthak@nexgo.de>
Cc: Szabolcs Nagy <nsz@port70.net>, musl@lists.openwall.com
Message-ID: <20210811160938.GB13220@brightrain.aerifal.cx>
References: <0C6AAAD55DA44C6189B2FF4F5FB2C3E7@H270>
 <20210810213455.GB37904@port70.net>
 <5C60D05C95724A36B3DB9942D06CFE5F@H270>
 <20210811024010.GA13220@brightrain.aerifal.cx>
 <7143269BEC424DE6A3B0218C4268C4C8@H270>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <7143269BEC424DE6A3B0218C4268C4C8@H270>
User-Agent: Mutt/1.5.21 (2010-09-15)
Subject: Re: [musl] [PATCH] Properly simplified nextafter()

On Wed, Aug 11, 2021 at 05:44:28PM +0200, Stefan Kanthak wrote:
> Rich Felker <dalias@libc.org> wrote:
> 
> > On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote:
> >> Szabolcs Nagy <nsz@port70.net> wrote:
> >>
> >>>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
> >>>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> >>>> has quite some superfluous statements:
> >>>>
> >>>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> >>>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
> >>>>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> >>>
> >>> you replaced 4 int cmps with 4 float cmps (among other things).
> >>
> >> and hinted that the result of the second pair of comparisions is
> >> already known from the first pair.
> >>
> >>> it's target dependent if float compares are fast or not.
> >>
> >> It's also target dependent whether the floating-point registers
> >> can be accessed by integer instructions, or need to be copied:
> >> some win, some loose!
> >> Just let the compiler/optimizer do its job!
> >
> > The values have been copied already to perform isnan,
> 
> NOT necessary: the compiler may have inlined isnan() and perform
> the test for example using FXAM, FUCOM or FUCOMI on i386, or
> UCOMISD on AMD64, without copying the arguments.

static __inline unsigned __FLOAT_BITS(float __f)
{
	union {float __f; unsigned __i;} __u;
	__u.__f = __f;
	return __u.__i;
}

#define isnan(x) ( \
	sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
	sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
	__fpclassifyl(x) == FP_NAN)

So, nope. Unless it's doing some extremely high level rewriting of
this inspection of the representation.

> >> 0. Doesn't musl provide target specific routines for targets with
> >>    soft FP?
> >
> > No, quite the opposite. Targets with hard fp and native insns for
> > particular ops have target-specific versions,
> 
> That's why I assumed that this may also be the case for soft FP.

I don't see how that follows. Targets with soft fp are exactly the
opposite of having a specialized native way to perform the operation;
they do it in the most general way.

> >> The code is of course smaller ... but not as small and fast as a
> >> proper i386 or AMD64 assembly implementation ... which I can
> >> post upon request.
> >
> > Full asm functions are not wanted; it's something we're trying to get
> > rid of in favor of just using very small/single-insn asm statements
> > with proper constraints, where it's sufficiently beneficial to have
> > asm at all. But I'm not even clear how you could make this function
> > more efficient with asm. The overall logic would be exactly the same
> > as the C. Maybe on x86_64 there'd be some SSE instructions to let you
> > elide a few things?
> 
> No, just what the instruction set offers: 23 instructions in 72 bytes.
> 
> nextafter:
>         comisd  xmm1, xmm0              # CF = (from > to)
>         jp      .Lmxcsr                 # from or to INDEFINITE?
>         je      .Lequal                 # from = to?
>         sbb     rdx, rdx                # rdx = (from > to) ? -1 : 0
>         movq    rcx, xmm0               # rcx = from
>         mov     rax, rcx
>         add     rax, rax                # CF = (from & -0.0)
>         jz      .Lzero                  # from = ±0.0?
> ..Lstep:
>         sbb     rax, rax                # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rdx                # rax = (from < 0.0) ^ (from > to) ? -1 : 0
>         or      rax, 1                  # rax = (from < 0.0) ^ (from > to) ? -1 : 1
>         add     rax, rcx                # rax = nextafter(from, to)
>         movq    xmm0, rax               # xmm0 = nextafter(from, to)
>         xorpd   xmm1, xmm1
> ..Lmxcsr:
>         addsd   xmm0, xmm1              # set MXCSR flags
>         ret
> ..Lequal:
>         movsd   xmm0, xmm1              # xmm0 = to
>         ret
> ..Lzero:
>         movmskpd eax, xmm1              # rax = (to & -0.0) ? 0b?1 : 0b?0
>         or      eax, 2                  # rax = (to & -0.0) ? 0b11 : 0b10
>         ror     rax, 1                  # rax = (to & -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax               # xmm0 = (to & -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         ret
> 
> GCC generates here at least 12 instructions more, also longer ones,
> including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF,
> so the code is more than 50% fatter, mixes integer SSE and FP SSE
> instructions which incur 2 cycles penalty on many Intel CPUs, with
> WAY TOO MANY not so predictable (un)conditional branches.

We don't use asm to optimize out 2 cycles. If the compiler is choosing
a bad way to perform these loads the compiler should be fixed. But I
don't think it matters in any measurable way in real usage.

Rich