mailing list of musl libc
 help / color / mirror / Atom feed
* [musl] [PATCH] Properly simplified nextafter()
@ 2021-08-10  6:23 Stefan Kanthak
  2021-08-10 21:34 ` Szabolcs Nagy
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-10  6:23 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

<https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
has quite some superfluous statements:

1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
   (ax == 0) && (ay == 0): the latter 2 tests can be removed;
3. there's absolutely no need to compare the signs of x and y
   with the sign of the direction: its sufficient to test that
   direction and sign of x match;
4. a proper compiler/optimizer should be able to reuse the results
   of the comparision (x == y) for (x < y) or (x > y) and
   (x == 0.0) for (x < 0.0) or (x > 0.0).

   JFTR: if ((x < 0.0) == (x < y)) is equivalent to
         if ((x > 0.0) == (x > y))

--- -/src/math/nextafter.c
+++ +/src/math/nextafter.c
@@ -3,20 +3,15 @@
 double nextafter(double x, double y)
 {
        union {double f; uint64_t i;} ux={x}, uy={y};
-       uint64_t ax, ay;
        int e;

        if (isnan(x) || isnan(y))
                return x + y;
-       if (ux.i == uy.i)
+       if (x == y)
                return y;
-       ax = ux.i & -1ULL/2;
-       ay = uy.i & -1ULL/2;
-       if (ax == 0) {
-               if (ay == 0)
-                       return y;
+       if (x == 0.0)
                ux.i = (uy.i & 1ULL<<63) | 1;
-       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
+       else if ((x < 0.0) == (x < y))
                ux.i--;
        else
                ux.i++;

[-- Attachment #2: nextafter.patch --]
[-- Type: application/octet-stream, Size: 565 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-10  6:23 [musl] [PATCH] Properly simplified nextafter() Stefan Kanthak
@ 2021-08-10 21:34 ` Szabolcs Nagy
  2021-08-10 22:53   ` Stefan Kanthak
  2021-08-13 12:04   ` [musl] [PATCH #2] " Stefan Kanthak
  0 siblings, 2 replies; 34+ messages in thread
From: Szabolcs Nagy @ 2021-08-10 21:34 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: musl

* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> has quite some superfluous statements:
> 
> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;

you replaced 4 int cmps with 4 float cmps (among other things).

it's target dependent if float compares are fast or not.
(the i386 machine where i originally tested this preferred int
cmp and float cmp was very slow in the subnormal range and
iirc it also raises the non-standard input denormal exception,
which is fine i guess. of course soft float abis much prefer int
cmp so your code is likely much slower and bigger there).

but i'm not against the change, it is likely better on modern
machines. did you try to benchmark it? or check the code size?

> 3. there's absolutely no need to compare the signs of x and y
>    with the sign of the direction: its sufficient to test that
>    direction and sign of x match;
> 4. a proper compiler/optimizer should be able to reuse the results
>    of the comparision (x == y) for (x < y) or (x > y) and
>    (x == 0.0) for (x < 0.0) or (x > 0.0).
> 
>    JFTR: if ((x < 0.0) == (x < y)) is equivalent to
>          if ((x > 0.0) == (x > y))
> 
> --- -/src/math/nextafter.c
> +++ +/src/math/nextafter.c
> @@ -3,20 +3,15 @@
>  double nextafter(double x, double y)
>  {
>         union {double f; uint64_t i;} ux={x}, uy={y};
> -       uint64_t ax, ay;
>         int e;
> 
>         if (isnan(x) || isnan(y))
>                 return x + y;
> -       if (ux.i == uy.i)
> +       if (x == y)
>                 return y;
> -       ax = ux.i & -1ULL/2;
> -       ay = uy.i & -1ULL/2;
> -       if (ax == 0) {
> -               if (ay == 0)
> -                       return y;
> +       if (x == 0.0)
>                 ux.i = (uy.i & 1ULL<<63) | 1;
> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> +       else if ((x < 0.0) == (x < y))
>                 ux.i--;
>         else
>                 ux.i++;



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-10 21:34 ` Szabolcs Nagy
@ 2021-08-10 22:53   ` Stefan Kanthak
  2021-08-11  2:40     ` Rich Felker
  2021-08-11  8:23     ` Szabolcs Nagy
  2021-08-13 12:04   ` [musl] [PATCH #2] " Stefan Kanthak
  1 sibling, 2 replies; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-10 22:53 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

Szabolcs Nagy <nsz@port70.net> wrote:

>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
>> has quite some superfluous statements:
>> 
>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
>>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> 
> you replaced 4 int cmps with 4 float cmps (among other things).

and hinted that the result of the second pair of comparisions is
already known from the first pair.

> it's target dependent if float compares are fast or not.

It's also target dependent whether the floating-point registers
can be accessed by integer instructions, or need to be copied:
some win, some loose!
Just let the compiler/optimizer do its job!

> (the i386 machine where i originally tested this preferred int
> cmp and float cmp was very slow in the subnormal range and
> iirc it also raises the non-standard input denormal exception,
> which is fine i guess.

This exception resp. the (sticky) flag is explicitly raised/set
in the part following the patch.

> of course soft float abis much prefer int cmp so your code is
> likely much slower and bigger there).

0. Doesn't musl provide target specific routines for targets with
   soft FP?

1. If not: the compiler knows the target ABI and SHOULD generate
   the proper integer comparisions there.
 
> but i'm not against the change, it is likely better on modern
> machines. did you try to benchmark it? or check the code size?

I STILL don't run a system supported by musl.
The code is of course smaller ... but not as small and fast as a
proper i386 or AMD64 assembly implementation ... which I can
post upon request.

regards
Stefan

>> 3. there's absolutely no need to compare the signs of x and y
>>    with the sign of the direction: its sufficient to test that
>>    direction and sign of x match;
>> 4. a proper compiler/optimizer should be able to reuse the results
>>    of the comparision (x == y) for (x < y) or (x > y) and
>>    (x == 0.0) for (x < 0.0) or (x > 0.0).
>> 
>>    JFTR: if ((x < 0.0) == (x < y)) is equivalent to
>>          if ((x > 0.0) == (x > y))
>> 
>> --- -/src/math/nextafter.c
>> +++ +/src/math/nextafter.c
>> @@ -3,20 +3,15 @@
>>  double nextafter(double x, double y)
>>  {
>>         union {double f; uint64_t i;} ux={x}, uy={y};
>> -       uint64_t ax, ay;
>>         int e;
>> 
>>         if (isnan(x) || isnan(y))
>>                 return x + y;
>> -       if (ux.i == uy.i)
>> +       if (x == y)
>>                 return y;
>> -       ax = ux.i & -1ULL/2;
>> -       ay = uy.i & -1ULL/2;
>> -       if (ax == 0) {
>> -               if (ay == 0)
>> -                       return y;
>> +       if (x == 0.0)
>>                 ux.i = (uy.i & 1ULL<<63) | 1;
>> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
>> +       else if ((x < 0.0) == (x < y))
>>                 ux.i--;
>>         else
>>                 ux.i++;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-10 22:53   ` Stefan Kanthak
@ 2021-08-11  2:40     ` Rich Felker
  2021-08-11 15:44       ` Stefan Kanthak
  2021-08-11  8:23     ` Szabolcs Nagy
  1 sibling, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-11  2:40 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote:
> Szabolcs Nagy <nsz@port70.net> wrote:
> 
> >* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
> >> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> >> has quite some superfluous statements:
> >> 
> >> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> >> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
> >>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> > 
> > you replaced 4 int cmps with 4 float cmps (among other things).
> 
> and hinted that the result of the second pair of comparisions is
> already known from the first pair.
> 
> > it's target dependent if float compares are fast or not.
> 
> It's also target dependent whether the floating-point registers
> can be accessed by integer instructions, or need to be copied:
> some win, some loose!
> Just let the compiler/optimizer do its job!

The values have been copied already to perform isnan, so continuing to
access them does not incur any further cost.

> > (the i386 machine where i originally tested this preferred int
> > cmp and float cmp was very slow in the subnormal range and
> > iirc it also raises the non-standard input denormal exception,
> > which is fine i guess.
> 
> This exception resp. the (sticky) flag is explicitly raised/set
> in the part following the patch.
> 
> > of course soft float abis much prefer int cmp so your code is
> > likely much slower and bigger there).
> 
> 0. Doesn't musl provide target specific routines for targets with
>    soft FP?

No, quite the opposite. Targets with hard fp and native insns for
particular ops have target-specific versions, but in general musl
strongly prefers use of common implementation across all targets when
there is not an obvious [nearly-]single-insn candidate for a
specialized version.

> 1. If not: the compiler knows the target ABI and SHOULD generate
>    the proper integer comparisions there.

Here it would require the compiler to recognize that the nan case was
already ruled out, and to special-case ±0 comparison on the
representation. Of course this is possible in theory, but it's almost
surely not happening now or any time soon. I'm pretty sure soft float
targets just end up calling the libgcc function for floating point
comparison if you do that.

Now, is the difference something actually worth caring about? I'm not
sure. I'm not particularly for or against this change to begin with,
but I do want changes to be well-motivated in the sense of improving
something in some measurable way. Otherwise they add to the review
burden (not just for me but for anyone who's following changes and
trying to verify they're safe/correct) without getting any value for
that. I did note there's already some inconsistency with nexttoward*,
which do it at least partly more like what you're proposing, so making
this entire family of functions consistent is a possible motivation.

> > but i'm not against the change, it is likely better on modern
> > machines. did you try to benchmark it? or check the code size?
> 
> I STILL don't run a system supported by musl.
> The code is of course smaller ... but not as small and fast as a
> proper i386 or AMD64 assembly implementation ... which I can
> post upon request.

Full asm functions are not wanted; it's something we're trying to get
rid of in favor of just using very small/single-insn asm statements
with proper constraints, where it's sufficiently beneficial to have
asm at all. But I'm not even clear how you could make this function
more efficient with asm. The overall logic would be exactly the same
as the C. Maybe on x86_64 there'd be some SSE instructions to let you
elide a few things?

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-10 22:53   ` Stefan Kanthak
  2021-08-11  2:40     ` Rich Felker
@ 2021-08-11  8:23     ` Szabolcs Nagy
  1 sibling, 0 replies; 34+ messages in thread
From: Szabolcs Nagy @ 2021-08-11  8:23 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: musl

* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-11 00:53:37 +0200]:
> Szabolcs Nagy <nsz@port70.net> wrote:
> > it's target dependent if float compares are fast or not.
> 
> It's also target dependent whether the floating-point registers
> can be accessed by integer instructions, or need to be copied:
> some win, some loose!

as Rich said, fp values are moved to int regs no matter what,
but float cmp can be avoided.

> Just let the compiler/optimizer do its job!

compilers cannot do this kind of refactoring.
(if they could, your change would not be needed)

> > (the i386 machine where i originally tested this preferred int
> > cmp and float cmp was very slow in the subnormal range and
> > iirc it also raises the non-standard input denormal exception,
> > which is fine i guess.
> 
> This exception resp. the (sticky) flag is explicitly raised/set
> in the part following the patch.

that is raised if the result is subnormal, but not if y is.
but we don't care about x86 specific denormal exception
only about standard underflow, so fcmp is fine.

> > of course soft float abis much prefer int cmp so your code is
> > likely much slower and bigger there).
> 
> 0. Doesn't musl provide target specific routines for targets with
>    soft FP?

no.

> 
> 1. If not: the compiler knows the target ABI and SHOULD generate
>    the proper integer comparisions there.

it can't, you get extern calls with register spills etc.

> > but i'm not against the change, it is likely better on modern
> > machines. did you try to benchmark it? or check the code size?
> 
> I STILL don't run a system supported by musl.

then how did you test it?

> The code is of course smaller ... but not as small and fast as a
> proper i386 or AMD64 assembly implementation ... which I can
> post upon request.

if this is an optimization patch then give us numbers
(and add them to the commit message).

at least you can look up instruction timings if you don't
want to benchmark:

e.g. aarch64 cortex-a72 optimization guide says

fcmp: 3 latency, 1 throughput
 cmp: 1 latency, 2 throughput

i.e. 1 fcmp ~ 6 cmp in the ideal case. some of the
removed instructions can execute in parallel so i'd
expect the original code to be faster on a cortex a72
for most inputs.

but on many other cores it will be faster or you can
optimize for code size or

as Rich said it makes sense to have consistent
algorithm with nexttoward (which does not use int
representation to avoid dealing with many different
long double formats). but you have to explain why
you are doing a change.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11  2:40     ` Rich Felker
@ 2021-08-11 15:44       ` Stefan Kanthak
  2021-08-11 16:09         ` Rich Felker
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-11 15:44 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, musl

Rich Felker <dalias@libc.org> wrote:

> On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote:
>> Szabolcs Nagy <nsz@port70.net> wrote:
>>
>>>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
>>>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
>>>> has quite some superfluous statements:
>>>>
>>>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
>>>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
>>>>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
>>>
>>> you replaced 4 int cmps with 4 float cmps (among other things).
>>
>> and hinted that the result of the second pair of comparisions is
>> already known from the first pair.
>>
>>> it's target dependent if float compares are fast or not.
>>
>> It's also target dependent whether the floating-point registers
>> can be accessed by integer instructions, or need to be copied:
>> some win, some loose!
>> Just let the compiler/optimizer do its job!
>
> The values have been copied already to perform isnan,

NOT necessary: the compiler may have inlined isnan() and perform
the test for example using FXAM, FUCOM or FUCOMI on i386, or
UCOMISD on AMD64, without copying the arguments.
I recommend to inspect the code GCC generates for AMD64, for example.

> so continuing to access them does not incur any further cost.

Non sequitur: see above.

[...]

>> 0. Doesn't musl provide target specific routines for targets with
>>    soft FP?
>
> No, quite the opposite. Targets with hard fp and native insns for
> particular ops have target-specific versions,

That's why I assumed that this may also be the case for soft FP.

> but in general musl strongly prefers use of common implementation
> across all targets when there is not an obvious [nearly-]single-insn
> candidate for a specialized version.

That's one of the reason why I submitted this patch: FP hardware is
mainstream.

>> 1. If not: the compiler knows the target ABI and SHOULD generate
>>    the proper integer comparisions there.
>
> Here it would require the compiler to recognize that the nan case was
> already ruled out, and to special-case ±0 comparison on the
> representation. Of course this is possible in theory, but it's almost
> surely not happening now or any time soon. I'm pretty sure soft float
> targets just end up calling the libgcc function for floating point
> comparison if you do that.

|     if (isnan(x) || isnan(y))
|          return x + y;

The 4 instructions I mentioned above set flags for all cases: see
below.

>> The code is of course smaller ... but not as small and fast as a
>> proper i386 or AMD64 assembly implementation ... which I can
>> post upon request.
>
> Full asm functions are not wanted; it's something we're trying to get
> rid of in favor of just using very small/single-insn asm statements
> with proper constraints, where it's sufficiently beneficial to have
> asm at all. But I'm not even clear how you could make this function
> more efficient with asm. The overall logic would be exactly the same
> as the C. Maybe on x86_64 there'd be some SSE instructions to let you
> elide a few things?

No, just what the instruction set offers: 23 instructions in 72 bytes.

nextafter:
        comisd  xmm1, xmm0              # CF = (from > to)
        jp      .Lmxcsr                 # from or to INDEFINITE?
        je      .Lequal                 # from = to?
        sbb     rdx, rdx                # rdx = (from > to) ? -1 : 0
        movq    rcx, xmm0               # rcx = from
        mov     rax, rcx
        add     rax, rax                # CF = (from & -0.0)
        jz      .Lzero                  # from = ±0.0?
.Lstep:
        sbb     rax, rax                # rax = (from < 0.0) ? -1 : 0
        xor     rax, rdx                # rax = (from < 0.0) ^ (from > to) ? -1 : 0
        or      rax, 1                  # rax = (from < 0.0) ^ (from > to) ? -1 : 1
        add     rax, rcx                # rax = nextafter(from, to)
        movq    xmm0, rax               # xmm0 = nextafter(from, to)
        xorpd   xmm1, xmm1
.Lmxcsr:
        addsd   xmm0, xmm1              # set MXCSR flags
        ret
.Lequal:
        movsd   xmm0, xmm1              # xmm0 = to
        ret
.Lzero:
        movmskpd eax, xmm1              # rax = (to & -0.0) ? 0b?1 : 0b?0
        or      eax, 2                  # rax = (to & -0.0) ? 0b11 : 0b10
        ror     rax, 1                  # rax = (to & -0.0) ? 0x8000000000000001 : 1
        movq    xmm0, rax               # xmm0 = (to & -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
        ret

GCC generates here at least 12 instructions more, also longer ones,
including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF,
so the code is more than 50% fatter, mixes integer SSE and FP SSE
instructions which incur 2 cycles penalty on many Intel CPUs, with
WAY TOO MANY not so predictable (un)conditional branches.

JFTR: it's almost always easy to beat the compiler!

Stefan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11 15:44       ` Stefan Kanthak
@ 2021-08-11 16:09         ` Rich Felker
  2021-08-11 16:50           ` Stefan Kanthak
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-11 16:09 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Wed, Aug 11, 2021 at 05:44:28PM +0200, Stefan Kanthak wrote:
> Rich Felker <dalias@libc.org> wrote:
> 
> > On Wed, Aug 11, 2021 at 12:53:37AM +0200, Stefan Kanthak wrote:
> >> Szabolcs Nagy <nsz@port70.net> wrote:
> >>
> >>>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
> >>>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> >>>> has quite some superfluous statements:
> >>>>
> >>>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> >>>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
> >>>>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> >>>
> >>> you replaced 4 int cmps with 4 float cmps (among other things).
> >>
> >> and hinted that the result of the second pair of comparisions is
> >> already known from the first pair.
> >>
> >>> it's target dependent if float compares are fast or not.
> >>
> >> It's also target dependent whether the floating-point registers
> >> can be accessed by integer instructions, or need to be copied:
> >> some win, some loose!
> >> Just let the compiler/optimizer do its job!
> >
> > The values have been copied already to perform isnan,
> 
> NOT necessary: the compiler may have inlined isnan() and perform
> the test for example using FXAM, FUCOM or FUCOMI on i386, or
> UCOMISD on AMD64, without copying the arguments.

static __inline unsigned __FLOAT_BITS(float __f)
{
	union {float __f; unsigned __i;} __u;
	__u.__f = __f;
	return __u.__i;
}

#define isnan(x) ( \
	sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
	sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
	__fpclassifyl(x) == FP_NAN)

So, nope. Unless it's doing some extremely high level rewriting of
this inspection of the representation.

> >> 0. Doesn't musl provide target specific routines for targets with
> >>    soft FP?
> >
> > No, quite the opposite. Targets with hard fp and native insns for
> > particular ops have target-specific versions,
> 
> That's why I assumed that this may also be the case for soft FP.

I don't see how that follows. Targets with soft fp are exactly the
opposite of having a specialized native way to perform the operation;
they do it in the most general way.

> >> The code is of course smaller ... but not as small and fast as a
> >> proper i386 or AMD64 assembly implementation ... which I can
> >> post upon request.
> >
> > Full asm functions are not wanted; it's something we're trying to get
> > rid of in favor of just using very small/single-insn asm statements
> > with proper constraints, where it's sufficiently beneficial to have
> > asm at all. But I'm not even clear how you could make this function
> > more efficient with asm. The overall logic would be exactly the same
> > as the C. Maybe on x86_64 there'd be some SSE instructions to let you
> > elide a few things?
> 
> No, just what the instruction set offers: 23 instructions in 72 bytes.
> 
> nextafter:
>         comisd  xmm1, xmm0              # CF = (from > to)
>         jp      .Lmxcsr                 # from or to INDEFINITE?
>         je      .Lequal                 # from = to?
>         sbb     rdx, rdx                # rdx = (from > to) ? -1 : 0
>         movq    rcx, xmm0               # rcx = from
>         mov     rax, rcx
>         add     rax, rax                # CF = (from & -0.0)
>         jz      .Lzero                  # from = ±0.0?
> ..Lstep:
>         sbb     rax, rax                # rax = (from < 0.0) ? -1 : 0
>         xor     rax, rdx                # rax = (from < 0.0) ^ (from > to) ? -1 : 0
>         or      rax, 1                  # rax = (from < 0.0) ^ (from > to) ? -1 : 1
>         add     rax, rcx                # rax = nextafter(from, to)
>         movq    xmm0, rax               # xmm0 = nextafter(from, to)
>         xorpd   xmm1, xmm1
> ..Lmxcsr:
>         addsd   xmm0, xmm1              # set MXCSR flags
>         ret
> ..Lequal:
>         movsd   xmm0, xmm1              # xmm0 = to
>         ret
> ..Lzero:
>         movmskpd eax, xmm1              # rax = (to & -0.0) ? 0b?1 : 0b?0
>         or      eax, 2                  # rax = (to & -0.0) ? 0b11 : 0b10
>         ror     rax, 1                  # rax = (to & -0.0) ? 0x8000000000000001 : 1
>         movq    xmm0, rax               # xmm0 = (to & -0.0) ? -0x1.0p-1074 : 0x1.0p-1074
>         ret
> 
> GCC generates here at least 12 instructions more, also longer ones,
> including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF,
> so the code is more than 50% fatter, mixes integer SSE and FP SSE
> instructions which incur 2 cycles penalty on many Intel CPUs, with
> WAY TOO MANY not so predictable (un)conditional branches.

We don't use asm to optimize out 2 cycles. If the compiler is choosing
a bad way to perform these loads the compiler should be fixed. But I
don't think it matters in any measurable way in real usage.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11 16:09         ` Rich Felker
@ 2021-08-11 16:50           ` Stefan Kanthak
  2021-08-11 17:57             ` Rich Felker
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-11 16:50 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, musl

Rich Felker <dalias@libc.org> wrote:

[...]

> static __inline unsigned __FLOAT_BITS(float __f)
> {
> union {float __f; unsigned __i;} __u;
> __u.__f = __f;
> return __u.__i;
> }
>
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> __fpclassifyl(x) == FP_NAN)
>
> So, nope.

GCC typically uses its __builtin_isnan() for isnan(), which doesn't
use integer instructions or reloads:

$ cat isnan.c
int foo(double x) {
    return isnan(x);
}
int bar(double x) {
    return __builtin_isnan(x);
}
$ gcc -S -O3 -o- isnan.c
...
        xorl    %eax, %eax
        ucomisd %xmm0, %xmm0
        setp    %al
        ret
...

> Unless it's doing some extremely high level rewriting of
> this inspection of the representation.

It performs the high-level substitution of isnan with __builtin_isnan

[...]

>> GCC generates here at least 12 instructions more, also longer ones,
>> including 2 movabs to load 0x8000000000000000 and 0x7FFFFFFFFFFFFFFF,
>> so the code is more than 50% fatter, mixes integer SSE and FP SSE
>> instructions which incur 2 cycles penalty on many Intel CPUs, with
>> WAY TOO MANY not so predictable (un)conditional branches.
>
> We don't use asm to optimize out 2 cycles.

This is just ONE of the many deficiencies of the code GCC generates.

> If the compiler is choosing a bad way to perform these loads the compiler
> should be fixed. But I don't think it matters in any measurable way in real usage.

On several families of Intel Core-i processors this 1 cycle penalty occurs
EVERY time an SSE register is accessed by a FP instruction AFTER an integer
instruction and vice versa!

BAD:
        pxor     xmm1, xmm1
        cmpsd    xmm0, xmm1

good:
        xorpd    xmm1, xmm1
        cmpsd    xmm0, xmm1

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11 16:50           ` Stefan Kanthak
@ 2021-08-11 17:57             ` Rich Felker
  2021-08-11 22:16               ` Szabolcs Nagy
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-11 17:57 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Wed, Aug 11, 2021 at 06:50:28PM +0200, Stefan Kanthak wrote:
> Rich Felker <dalias@libc.org> wrote:
> 
> [...]
> 
> > static __inline unsigned __FLOAT_BITS(float __f)
> > {
> > union {float __f; unsigned __i;} __u;
> > __u.__f = __f;
> > return __u.__i;
> > }
> >
> > #define isnan(x) ( \
> > sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> > sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> > __fpclassifyl(x) == FP_NAN)
> >
> > So, nope.
> 
> GCC typically uses its __builtin_isnan() for isnan(), which doesn't
> use integer instructions or reloads:

That's only if you #define isnan(x) __builtin_isnan(x)

> $ cat isnan.c
> int foo(double x) {
>     return isnan(x);
> }
> int bar(double x) {
>     return __builtin_isnan(x);
> }
> $ gcc -S -O3 -o- isnan.c
> ....
>         xorl    %eax, %eax
>         ucomisd %xmm0, %xmm0
>         setp    %al
>         ret
> ....

Which glibc, which is what you're using, does.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11 17:57             ` Rich Felker
@ 2021-08-11 22:16               ` Szabolcs Nagy
  2021-08-11 22:43                 ` Stefan Kanthak
  0 siblings, 1 reply; 34+ messages in thread
From: Szabolcs Nagy @ 2021-08-11 22:16 UTC (permalink / raw)
  To: Rich Felker; +Cc: Stefan Kanthak, musl

* Rich Felker <dalias@libc.org> [2021-08-11 13:57:23 -0400]:
> On Wed, Aug 11, 2021 at 06:50:28PM +0200, Stefan Kanthak wrote:
> > Rich Felker <dalias@libc.org> wrote:
> > > static __inline unsigned __FLOAT_BITS(float __f)
> > > {
> > > union {float __f; unsigned __i;} __u;
> > > __u.__f = __f;
> > > return __u.__i;
> > > }
> > >
> > > #define isnan(x) ( \
> > > sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> > > sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> > > __fpclassifyl(x) == FP_NAN)
> > >
> > > So, nope.
> > 
> > GCC typically uses its __builtin_isnan() for isnan(), which doesn't
> > use integer instructions or reloads:
> 
> That's only if you #define isnan(x) __builtin_isnan(x)

even then it should use int arithmetics, see below

> 
> > $ cat isnan.c
> > int foo(double x) {
> >     return isnan(x);
> > }
> > int bar(double x) {
> >     return __builtin_isnan(x);
> > }
> > $ gcc -S -O3 -o- isnan.c
> > ....
> >         xorl    %eax, %eax
> >         ucomisd %xmm0, %xmm0
> >         setp    %al
> >         ret
> > ....
> 
> Which glibc, which is what you're using, does.

it is also wrong: isnan must not signal for signaling nan.

this is a gcc bug, it fails even with -fsignaling-nans
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66462

once that's fixed glibc will behave like musl.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11 22:16               ` Szabolcs Nagy
@ 2021-08-11 22:43                 ` Stefan Kanthak
  2021-08-12  0:59                   ` Rich Felker
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-11 22:43 UTC (permalink / raw)
  To: Szabolcs Nagy, Rich Felker; +Cc: musl

Szabolcs Nagy <nsz@port70.net> wrote:

>* Rich Felker <dalias@libc.org> [2021-08-11 13:57:23 -0400]:
>> On Wed, Aug 11, 2021 at 06:50:28PM +0200, Stefan Kanthak wrote:
>> > Rich Felker <dalias@libc.org> wrote:
>> > > static __inline unsigned __FLOAT_BITS(float __f)
>> > > {
>> > > union {float __f; unsigned __i;} __u;
>> > > __u.__f = __f;
>> > > return __u.__i;
>> > > }
>> > >
>> > > #define isnan(x) ( \
>> > > sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
>> > > sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
>> > > __fpclassifyl(x) == FP_NAN)
>> > >
>> > > So, nope.
>> > 
>> > GCC typically uses its __builtin_isnan() for isnan(), which doesn't
>> > use integer instructions or reloads:
>> 
>> That's only if you #define isnan(x) __builtin_isnan(x)
> 
> even then it should use int arithmetics, see below
[...]
> it is also wrong: isnan must not signal for signaling nan.

ARGH: I recommend to read the subject first, then the C standard and
      recall how nextafter() is supposed to handle signalling NANs!

<https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html>

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH] Properly simplified nextafter()
  2021-08-11 22:43                 ` Stefan Kanthak
@ 2021-08-12  0:59                   ` Rich Felker
  0 siblings, 0 replies; 34+ messages in thread
From: Rich Felker @ 2021-08-12  0:59 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Thu, Aug 12, 2021 at 12:43:36AM +0200, Stefan Kanthak wrote:
> Szabolcs Nagy <nsz@port70.net> wrote:
> 
> >* Rich Felker <dalias@libc.org> [2021-08-11 13:57:23 -0400]:
> >> On Wed, Aug 11, 2021 at 06:50:28PM +0200, Stefan Kanthak wrote:
> >> > Rich Felker <dalias@libc.org> wrote:
> >> > > static __inline unsigned __FLOAT_BITS(float __f)
> >> > > {
> >> > > union {float __f; unsigned __i;} __u;
> >> > > __u.__f = __f;
> >> > > return __u.__i;
> >> > > }
> >> > >
> >> > > #define isnan(x) ( \
> >> > > sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> >> > > sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> >> > > __fpclassifyl(x) == FP_NAN)
> >> > >
> >> > > So, nope.
> >> > 
> >> > GCC typically uses its __builtin_isnan() for isnan(), which doesn't
> >> > use integer instructions or reloads:
> >> 
> >> That's only if you #define isnan(x) __builtin_isnan(x)
> > 
> > even then it should use int arithmetics, see below
> [...]
> > it is also wrong: isnan must not signal for signaling nan.
> 
> ARGH: I recommend to read the subject first, then the C standard and
>       recall how nextafter() is supposed to handle signalling NANs!
> 
> <https://pubs.opengroup.org/onlinepubs/9699919799/functions/nextafter.html>

I believe he was saying the implementation of isnan is wrong because
isnan isn't allowed to do that. Not that the behavior conflicts with
the contract for nextafter.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-10 21:34 ` Szabolcs Nagy
  2021-08-10 22:53   ` Stefan Kanthak
@ 2021-08-13 12:04   ` Stefan Kanthak
  2021-08-13 15:59     ` Rich Felker
  2021-08-14 23:46     ` Szabolcs Nagy
  1 sibling, 2 replies; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-13 12:04 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

[-- Attachment #1: Type: text/plain, Size: 5307 bytes --]

Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:

>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
>> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
>> has quite some superfluous statements:
>> 
>> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
>> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
>>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> 
> you replaced 4 int cmps with 4 float cmps (among other things).
> 
> it's target dependent if float compares are fast or not.

It's also target dependent whether the FP additions and multiplies
used to raise overflow/underflow are SLOOOWWW: how can you justify
them, especially for targets using soft-float?

|        /* raise overflow if ux.f is infinite and x is finite */
|        if (e == 0x7ff)
|                FORCE_EVAL(x+x);
|        /* raise underflow if ux.f is subnormal or zero */
|        if (e == 0)
|                FORCE_EVAL(x*x + ux.f*ux.f);

> (the i386 machine where i originally tested this preferred int
> cmp and float cmp was very slow in the subnormal range

This also and still holds for i386 FPU fadd/fmul as well as SSE
addsd/addss/mulss/mulsd additions/multiplies!

Second version:

--- -/src/math/nextafter.c
+++ +/src/math/nextafter.c
@@ -10,13 +10,13 @@
                return x + y;
        if (ux.i == uy.i)
                return y;
-       ax = ux.i & -1ULL/2;
-       ay = uy.i & -1ULL/2;
+       ax = ux.i << 2;
+       ay = uy.i << 2;
        if (ax == 0) {
                if (ay == 0)
                        return y;
                ux.i = (uy.i & 1ULL<<63) | 1;
-       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
+       } else if ((ax < ay) == ((int64_t) ux.i < 0))
                ux.i--;
        else
                ux.i++;

For AMD64, GCC generates the following ABSOLUTELY HORRIBLE CRAP
(the original code compiles even worse):

0000000000000000 <nextafter>:
   0:   48 83 ec 38             sub    $0x38,%rsp
   4:   0f 29 74 24 20          movaps %xmm6,0x20(%rsp)
   9:   49 b8 ff ff ff ff ff    movabs $0x7fffffffffffffff,%r8
  10:   ff ff 7f
  13:   49 b9 00 00 00 00 00    movabs $0x7ff0000000000000,%r9
  1a:   00 f0 7f
  1d:   66 49 0f 7e c2          movq   %xmm0,%r10
  22:   66 48 0f 7e c2          movq   %xmm0,%rdx
  27:   66 48 0f 7e c8          movq   %xmm1,%rax
  2c:   4d 21 c2                and    %r8,%r10
  2f:   66 48 0f 7e c1          movq   %xmm0,%rcx
  34:   4d 39 ca                cmp    %r9,%r10
  37:   0f 87 83 00 00 00       ja     bb <nextafter+0xbb>
  3d:   49 21 c0                and    %rax,%r8
  40:   66 49 0f 7e ca          movq   %xmm1,%r10
  45:   4d 39 c8                cmp    %r9,%r8
  48:   77 76                   ja     bb <nextafter+0xbb>
  4a:   66 0f 28 f1             movapd %xmm1,%xmm6
  4e:   48 39 d0                cmp    %rdx,%rax
  51:   74 7b                   je     c9 <nextafter+0xc9>
  53:   66 49 0f 7e c0          movq   %xmm0,%r8
  58:   48 8d 04 85 00 00 00    lea    0x0(,%rax,4),%rax
  5f:   00
  60:   49 c1 e0 02             shl    $0x2,%r8
  64:   74 7a                   je     db <nextafter+0xd7>
  66:   49 39 c0                cmp    %rax,%r8
  69:   66 49 0f 7e c0          movq   %xmm0,%r8
  6e:   48 8d 42 ff             lea    -0x1(%rdx),%rax
  72:   41 0f 93 c1             setae  %r9b
  76:   49 c1 e8 3f             shr    $0x3f,%r8
  7a:   48 83 c1 01             add    $0x1,%rcx
  7e:   45 38 c1                cmp    %r8b,%r9b
  81:   48 0f 44 c1             cmove  %rcx,%rax
  85:   48 89 c1                mov    %rax,%rcx
  88:   66 48 0f 6e f0          movq   %rax,%xmm6
  8d:   48 c1 e9 34             shr    $0x34,%rcx
  91:   81 e1 ff 07 00 00       and    $0x7ff,%ecx
  97:   81 f9 ff 07 00 00       cmp    $0x7ff,%ecx
  9d:   74 61                   je     ef <nextafter+0xef>
  9f:   85 c9                   test   %ecx,%ecx
  a1:   75 2b                   jne    c9 <nextafter+0xc9>
  a3:   66 48 0f 6e c2          movq   %rdx,%xmm0
  a8:   66 48 0f 6e c8          movq   %rax,%xmm1
  ad:   f2 0f 59 ce             mulsd  %xmm6,%xmm1
  b1:   f2 0f 59 c0             mulsd  %xmm0,%xmm0
  b5:   f2 0f 58 c1             addsd  %xmm1,%xmm0
  b9:   eb 0e                   jmp    c9 <nextafter+0xc9>
  bb:   66 48 0f 6e f2          movq   %rdx,%xmm6
  c0:   66 48 0f 6e d0          movq   %rax,%xmm2
  c5:   f2 0f 58 f2             addsd  %xmm2,%xmm6
  c9:   66 0f 28 c6             movapd %xmm6,%xmm0
  cd:   0f 28 74 24 20          movaps 0x20(%rsp),%xmm6
  d2:   48 83 c4 38             add    $0x38,%rsp
  d6:   c3                      retq
  d7:   48 85 c0                test   %rax,%rax
  da:   74 e9                   je     c9 <nextafter+0xc9>
  dc:   48 b8 00 00 00 00 00    movabs $0x8000000000000000,%rax
  e3:   00 00 80
  e6:   4c 21 d0                and    %r10,%rax
  ea:   48 83 c8 01             or     $0x1,%rax
  ed:   eb 8d                   jmp    85 <nextafter+0x85>
  ef:   66 48 0f 6e c2          movq   %rdx,%xmm0
  f4:   f2 0f 58 c0             addsd  %xmm0,%xmm0
  f8:   eb be                   jmp    c9 <nextafter+0xc9>

How do you compare these 60 instructions/252 bytes to the code I posted
(23 instructions/72 bytes)?

not amused about such HORRIBLE machine code!
Stefan

[-- Attachment #2: nextafter.patch --]
[-- Type: application/octet-stream, Size: 434 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-13 12:04   ` [musl] [PATCH #2] " Stefan Kanthak
@ 2021-08-13 15:59     ` Rich Felker
  2021-08-13 18:30       ` Stefan Kanthak
  2021-08-14 23:46     ` Szabolcs Nagy
  1 sibling, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-13 15:59 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Fri, Aug 13, 2021 at 02:04:51PM +0200, Stefan Kanthak wrote:
> Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:
> 
> >* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
> >> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
> >> has quite some superfluous statements:
> >> 
> >> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
> >> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
> >>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
> > 
> > you replaced 4 int cmps with 4 float cmps (among other things).
> > 
> > it's target dependent if float compares are fast or not.
> 
> It's also target dependent whether the FP additions and multiplies
> used to raise overflow/underflow are SLOOOWWW: how can you justify
> them, especially for targets using soft-float?

On a target with fenv, raising the exception flags is mandatory and
something must be done to make that happen. Performing the arithmetic
is far faster than prodding at the status flag registers explicitly.

On a target without fenv (all the existing soft float targets), it
should be valid for the compiler just not to emit them, since they
have no side effects. I think we could make FORCE_EVAL expand to
nothing on such targets (if the FE_* macros are not defined). This is
probably worth doing.

(Note that FORCE_EVAL only exists because compilers are buggy and
don't respect the fact that floating point expressions have side
effects even if the result is not used; with a properly working
compiler, it would not be needed, and non-fenv targets would
automatically omit the no-side-effect code.)

> |        /* raise overflow if ux.f is infinite and x is finite */
> |        if (e == 0x7ff)
> |                FORCE_EVAL(x+x);
> |        /* raise underflow if ux.f is subnormal or zero */
> |        if (e == 0)
> |                FORCE_EVAL(x*x + ux.f*ux.f);
> 
> > (the i386 machine where i originally tested this preferred int
> > cmp and float cmp was very slow in the subnormal range
> 
> This also and still holds for i386 FPU fadd/fmul as well as SSE
> addsd/addss/mulss/mulsd additions/multiplies!

It may be possible to reduce the number of such ops too; not sure. But
there's no way to eliminate them.

> Second version:
> 
> --- -/src/math/nextafter.c
> +++ +/src/math/nextafter.c
> @@ -10,13 +10,13 @@
>                 return x + y;
>         if (ux.i == uy.i)
>                 return y;
> -       ax = ux.i & -1ULL/2;
> -       ay = uy.i & -1ULL/2;
> +       ax = ux.i << 2;
> +       ay = uy.i << 2;
>         if (ax == 0) {
>                 if (ay == 0)
>                         return y;
>                 ux.i = (uy.i & 1ULL<<63) | 1;
> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>                 ux.i--;
>         else
>                 ux.i++;
> 
> For AMD64, GCC generates the following ABSOLUTELY HORRIBLE CRAP
> (the original code compiles even worse):
> 
> 0000000000000000 <nextafter>:
>    0:   48 83 ec 38             sub    $0x38,%rsp
>    4:   0f 29 74 24 20          movaps %xmm6,0x20(%rsp)
>    9:   49 b8 ff ff ff ff ff    movabs $0x7fffffffffffffff,%r8
>   10:   ff ff 7f
>   13:   49 b9 00 00 00 00 00    movabs $0x7ff0000000000000,%r9
>   1a:   00 f0 7f
>   1d:   66 49 0f 7e c2          movq   %xmm0,%r10
>   22:   66 48 0f 7e c2          movq   %xmm0,%rdx
>   27:   66 48 0f 7e c8          movq   %xmm1,%rax
>   2c:   4d 21 c2                and    %r8,%r10
>   2f:   66 48 0f 7e c1          movq   %xmm0,%rcx
>   34:   4d 39 ca                cmp    %r9,%r10
>   37:   0f 87 83 00 00 00       ja     bb <nextafter+0xbb>
>   3d:   49 21 c0                and    %rax,%r8
>   40:   66 49 0f 7e ca          movq   %xmm1,%r10
>   45:   4d 39 c8                cmp    %r9,%r8
>   48:   77 76                   ja     bb <nextafter+0xbb>
>   4a:   66 0f 28 f1             movapd %xmm1,%xmm6
>   4e:   48 39 d0                cmp    %rdx,%rax
>   51:   74 7b                   je     c9 <nextafter+0xc9>
>   53:   66 49 0f 7e c0          movq   %xmm0,%r8
>   58:   48 8d 04 85 00 00 00    lea    0x0(,%rax,4),%rax
>   5f:   00
>   60:   49 c1 e0 02             shl    $0x2,%r8
>   64:   74 7a                   je     db <nextafter+0xd7>
>   66:   49 39 c0                cmp    %rax,%r8
>   69:   66 49 0f 7e c0          movq   %xmm0,%r8
>   6e:   48 8d 42 ff             lea    -0x1(%rdx),%rax
>   72:   41 0f 93 c1             setae  %r9b
>   76:   49 c1 e8 3f             shr    $0x3f,%r8
>   7a:   48 83 c1 01             add    $0x1,%rcx
>   7e:   45 38 c1                cmp    %r8b,%r9b
>   81:   48 0f 44 c1             cmove  %rcx,%rax
>   85:   48 89 c1                mov    %rax,%rcx
>   88:   66 48 0f 6e f0          movq   %rax,%xmm6
>   8d:   48 c1 e9 34             shr    $0x34,%rcx
>   91:   81 e1 ff 07 00 00       and    $0x7ff,%ecx
>   97:   81 f9 ff 07 00 00       cmp    $0x7ff,%ecx
>   9d:   74 61                   je     ef <nextafter+0xef>
>   9f:   85 c9                   test   %ecx,%ecx
>   a1:   75 2b                   jne    c9 <nextafter+0xc9>
>   a3:   66 48 0f 6e c2          movq   %rdx,%xmm0
>   a8:   66 48 0f 6e c8          movq   %rax,%xmm1
>   ad:   f2 0f 59 ce             mulsd  %xmm6,%xmm1
>   b1:   f2 0f 59 c0             mulsd  %xmm0,%xmm0
>   b5:   f2 0f 58 c1             addsd  %xmm1,%xmm0
>   b9:   eb 0e                   jmp    c9 <nextafter+0xc9>
>   bb:   66 48 0f 6e f2          movq   %rdx,%xmm6
>   c0:   66 48 0f 6e d0          movq   %rax,%xmm2
>   c5:   f2 0f 58 f2             addsd  %xmm2,%xmm6
>   c9:   66 0f 28 c6             movapd %xmm6,%xmm0
>   cd:   0f 28 74 24 20          movaps 0x20(%rsp),%xmm6
>   d2:   48 83 c4 38             add    $0x38,%rsp
>   d6:   c3                      retq
>   d7:   48 85 c0                test   %rax,%rax
>   da:   74 e9                   je     c9 <nextafter+0xc9>
>   dc:   48 b8 00 00 00 00 00    movabs $0x8000000000000000,%rax
>   e3:   00 00 80
>   e6:   4c 21 d0                and    %r10,%rax
>   ea:   48 83 c8 01             or     $0x1,%rax
>   ed:   eb 8d                   jmp    85 <nextafter+0x85>
>   ef:   66 48 0f 6e c2          movq   %rdx,%xmm0
>   f4:   f2 0f 58 c0             addsd  %xmm0,%xmm0
>   f8:   eb be                   jmp    c9 <nextafter+0xc9>
> 
> How do you compare these 60 instructions/252 bytes to the code I posted
> (23 instructions/72 bytes)?
> 
> not amused about such HORRIBLE machine code!

Then submit patched to GCC. We are not writing a libc in assembly
language just because of a compiler that can't generate sufficiently
good code.

Have you traced the different execution paths through the above to see
if the cases that matter (non-exceptional ones) are actually terribly
bad? I haven't trawled through it to do this, but that's the first
thing I'd do if I wanted to evaluate it. From there it's easier to see
what's actually going wrong (if anything) and use that to reason about
whether there are reasonable changes to be made to the C code that
would produce better output.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-13 15:59     ` Rich Felker
@ 2021-08-13 18:30       ` Stefan Kanthak
  2021-08-14  4:07         ` Damian McGuckin
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-13 18:30 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, musl

Rich Felker <dalias@libc.org> wrote:

> On Fri, Aug 13, 2021 at 02:04:51PM +0200, Stefan Kanthak wrote:
>> Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:
>> 
>> >* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-10 08:23:46 +0200]:
>> >> <https://git.musl-libc.org/cgit/musl/plain/src/math/nextafter.c>
>> >> has quite some superfluous statements:
>> >> 
>> >> 1. there's absolutely no need for 2 uint64_t holding |x| and |y|;
>> >> 2. IEEE-754 specifies -0.0 == +0.0, so (x == y) is equivalent to
>> >>    (ax == 0) && (ay == 0): the latter 2 tests can be removed;
>> > 
>> > you replaced 4 int cmps with 4 float cmps (among other things).
>> > 
>> > it's target dependent if float compares are fast or not.
>> 
>> It's also target dependent whether the FP additions and multiplies
>> used to raise overflow/underflow are SLOOOWWW: how can you justify
>> them, especially for targets using soft-float?
> 
> On a target with fenv, raising the exception flags is mandatory and
> something must be done to make that happen. Performing the arithmetic
> is far faster than prodding at the status flag registers explicitly.

I did NOT propose the latter; I just questioned whether/why you use
two multiplications to detect underflow.

JFTR: FORCE_EVAL(dummy+0.0); raises both overflow and underflow,
      without multiplication.

> On a target without fenv (all the existing soft float targets), it
> should be valid for the compiler just not to emit them, since they
> have no side effects. I think we could make FORCE_EVAL expand to
> nothing on such targets (if the FE_* macros are not defined). This is
> probably worth doing.

Correct, that's the right thing to do there.

[ buggy compiler ]

> It may be possible to reduce the number of such ops too; not sure. But
> there's no way to eliminate them.

It's definitely possible to do a strength reduction and get rid of the
multiplications.

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-13 18:30       ` Stefan Kanthak
@ 2021-08-14  4:07         ` Damian McGuckin
  2021-08-14 22:45           ` Szabolcs Nagy
  0 siblings, 1 reply; 34+ messages in thread
From: Damian McGuckin @ 2021-08-14  4:07 UTC (permalink / raw)
  To: musl

On Fri, 13 Aug 2021, Stefan Kanthak wrote:

>> It may be possible to reduce the number of such ops too; not sure. But
>> there's no way to eliminate them.

Replacing the second FORCE'd expression with

 	FORCE((ux.f + x) * (0x1.0p-52 * 0.25));

eliminates one floating point OP, assuming the optimiser does the 
right thing to

 	epsilon / 4

in the expression. Some preliminary testing seems to suggest that the same 
exceptions get raised.

> It's definitely possible to do a strength reduction and get rid of the
> multiplications.

But then how is the exception still raised, or are we talking soft FP?

Stay safe - Damian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-14  4:07         ` Damian McGuckin
@ 2021-08-14 22:45           ` Szabolcs Nagy
  0 siblings, 0 replies; 34+ messages in thread
From: Szabolcs Nagy @ 2021-08-14 22:45 UTC (permalink / raw)
  To: Damian McGuckin; +Cc: musl

* Damian McGuckin <damianm@esi.com.au> [2021-08-14 14:07:34 +1000]:
> On Fri, 13 Aug 2021, Stefan Kanthak wrote:
> > > It may be possible to reduce the number of such ops too; not sure. But
> > > there's no way to eliminate them.
> 
> Replacing the second FORCE'd expression with
> 
> 	FORCE((ux.f + x) * (0x1.0p-52 * 0.25));
> 
> eliminates one floating point OP, assuming the optimiser does the right
> thing to
> 
> 	epsilon / 4
> 
> in the expression. Some preliminary testing seems to suggest that the same
> exceptions get raised.
> 
> > It's definitely possible to do a strength reduction and get rid of the
> > multiplications.
> 
> But then how is the exception still raised, or are we talking soft FP?

that does not work, i think.

but the subnormal result case is not worth optimizing.
(or rather, i prefer to optimize cold code paths for size)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-13 12:04   ` [musl] [PATCH #2] " Stefan Kanthak
  2021-08-13 15:59     ` Rich Felker
@ 2021-08-14 23:46     ` Szabolcs Nagy
  2021-08-15  7:04       ` Stefan Kanthak
  1 sibling, 1 reply; 34+ messages in thread
From: Szabolcs Nagy @ 2021-08-14 23:46 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: musl

* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-13 14:04:51 +0200]:
> Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:
> > it's target dependent if float compares are fast or not.
> 
> It's also target dependent whether the FP additions and multiplies
> used to raise overflow/underflow are SLOOOWWW: how can you justify
> them, especially for targets using soft-float?

for fenv side-effects, using fp arithmetic or conversion
operations are ideal.

i pointed out the new subnormal handling cases your patch
introduced because it was not clear that you considered it.
i'm mainly concerned about the performance of the common
cases, not rare special cases.

> > (the i386 machine where i originally tested this preferred int
> > cmp and float cmp was very slow in the subnormal range
> 
> This also and still holds for i386 FPU fadd/fmul as well as SSE
> addsd/addss/mulss/mulsd additions/multiplies!

they are avoided in the common case, and only used to create
fenv side-effects.

> --- -/src/math/nextafter.c
> +++ +/src/math/nextafter.c
> @@ -10,13 +10,13 @@
>                 return x + y;
>         if (ux.i == uy.i)
>                 return y;
> -       ax = ux.i & -1ULL/2;
> -       ay = uy.i & -1ULL/2;
> +       ax = ux.i << 2;
> +       ay = uy.i << 2;

the << 2 looks wrong, the top bit of the exponent is lost.

>         if (ax == 0) {
>                 if (ay == 0)
>                         return y;
>                 ux.i = (uy.i & 1ULL<<63) | 1;
> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>                 ux.i--;
>         else
>                 ux.i++;
...
> How do you compare these 60 instructions/252 bytes to the code I posted
> (23 instructions/72 bytes)?

you should benchmark, but the second best is to look
at the longest dependency chain in the hot path and
add up the instruction latencies.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-14 23:46     ` Szabolcs Nagy
@ 2021-08-15  7:04       ` Stefan Kanthak
  2021-08-15  7:46         ` Ariadne Conill
                           ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-15  7:04 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

Szabolcs Nagy <nsz@port70.net> wrote:

>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-13 14:04:51 +0200]:
>> Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:

>> > (the i386 machine where i originally tested this preferred int
>> > cmp and float cmp was very slow in the subnormal range
>>
>> This also and still holds for i386 FPU fadd/fmul as well as SSE
>> addsd/addss/mulss/mulsd additions/multiplies!
>
> they are avoided in the common case, and only used to create
> fenv side-effects.

Unfortunately but for hard & SOFT-float, where no fenv exists, as
Rich wrote.

>> --- -/src/math/nextafter.c
>> +++ +/src/math/nextafter.c
>> @@ -10,13 +10,13 @@
>>                 return x + y;
>>         if (ux.i == uy.i)
>>                 return y;
>> -       ax = ux.i & -1ULL/2;
>> -       ay = uy.i & -1ULL/2;
>> +       ax = ux.i << 2;
>> +       ay = uy.i << 2;
>
> the << 2 looks wrong, the top bit of the exponent is lost.

It IS wrong, but only in the post, not in the code I tested.

>>         if (ax == 0) {
>>                 if (ay == 0)
>>                         return y;
>>                 ux.i = (uy.i & 1ULL<<63) | 1;
>> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
>> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>>                 ux.i--;
>>         else
>>                 ux.i++;
> ...
>> How do you compare these 60 instructions/252 bytes to the code I posted
>> (23 instructions/72 bytes)?
>
> you should benchmark, but the second best is to look
> at the longest dependency chain in the hot path and
> add up the instruction latencies.

1 billion calls to nextafter(), with random from, and to either 0 or +INF:
run 1 against glibc,                         8.58 ns/call
run 2 against musl original,                 3.59
run 3 against musl patched,                  0.52
run 4 the pure floating-point variant from   0.72
      my initial post in this thread,
run 5 the assembly variant I posted.         0.28 ns/call

Now hurry up and patch your slowmotion code!

Stefan

PS: I cheated a very tiny little bit: the isnan() macro of musl patched is

#ifdef PATCH
#define isnan(x) ( \
sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
__fpclassifyl(x) == FP_NAN)
#else
#define isnan(x) ( \
sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
__fpclassifyl(x) == FP_NAN)
#endif // PATCH

PPS: and of course the log from the benchmarks...

[stefan@rome ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3194.323
BogoMIPS:              6388.64
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
...
[stefan@rome ~]$ gcc --version
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[stefan@rome ~]$ cat patch.c
#include <math.h>

static __inline unsigned __FLOAT_BITS(float __f)
{
 union {float __f; unsigned __i;} __u;
 __u.__f = __f;
 return __u.__i;
}

static __inline unsigned long long __DOUBLE_BITS(double __f)
{
 union {double __f; unsigned long long __i;} __u;
 __u.__f = __f;
 return __u.__i;
}

#ifdef PATCH
#define isnan(x) ( \
sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
__fpclassifyl(x) == FP_NAN)
#else
#define isnan(x) ( \
sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
__fpclassifyl(x) == FP_NAN)
#endif // PATCH

__attribute__((noinline))
double nextafter(double x, double y)
{
 union {double f; unsigned long long i;} ux={x}, uy={y};
 unsigned long long ax, ay;
 int e;

 if (isnan(x) || isnan(y))
  return x + y;
 if (ux.i == uy.i)
  return y;
#ifdef PATCH
 ax = ux.i << 1;
 ay = uy.i << 1;
#else
 ax = ux.i & -1ULL/2;
 ay = uy.i & -1ULL/2;
#endif
 if (ax == 0) {
  if (ay == 0)
   return y;
  ux.i = (uy.i & 1ULL<<63) | 1;
#ifdef PATCH
 } else if (ax < ay == (long long) ux.i < 0)
#else
 } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
#endif
  ux.i--;
 else
  ux.i++;
 e = ux.i >> 52 & 0x7ff;
 /* raise overflow if ux.f is infinite and x is finite */
 if (e == 0x7ff)
  FORCE_EVAL(x + x);
 /* raise underflow if ux.f is subnormal or zero */
 if (e == 0)
  FORCE_EVAL(x*x + ux.f*ux.f);
 return ux.f;
}
[stefan@rome ~]$ cat bench.c
// Copyright © 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>

#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <time.h>

__attribute__((noinline))
double nop(double foo, double bar)
{
    return foo + bar;
}

inline static
double lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);

    return *(double *) &lfsr;
}

inline static
double random64(void)
{
    static uint64_t seed = 0x0123456789ABCDEF;

    seed = seed * 6364136223846793005 + 1442695040888963407;

    return *(double *) &seed;
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    volatile double result;

    t0 = clock();

    for (n = 500000000u; n > 0u; n--) {
        result = nop(lfsr64(), 0.0);
        result = nop(random64(), 1.0 / 0.0);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--) {
        result = nextafter(lfsr64(), 0.0);
        result = nextafter(random64(), 1.0 / 0.0);
    }

    t2 = clock();
    tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;

    printf("\n"
           "nop()         %4lu.%06lu       0\n"
           "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
           "              %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
[stefan@rome ~]$ gcc -O3 -lm bench.c
[stefan@rome ~]$ perf stat ./a.out

nop()            1.480000       0
nextafter()     10.060000       8.580000
                11.540000 nano-seconds

 Performance counter stats for './a.out':

         11,548.78 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               145      page-faults:u             #    0.013 K/sec
    38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
    15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
    10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
    69,739,403,870      instructions:u            #    1.79  insn per cycle
                                                  #    0.22  stalled cycles per insn  (83.33%)
    16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
       500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)

      11.550414029 seconds time elapsed

      11.548265000 seconds user
       0.000999000 seconds sys


[stefan@rome ~]$ gcc -O3 bench.c patch.c
patch.c:23: warning: "isnan" redefined
 #define isnan(x) ( \

In file included from patch.c:1:
/usr/include/math.h:254: note: this is the location of the previous definition
 #  define isnan(x) \

[stefan@rome ~]$ perf stat ./a.out

nop()            1.480000       0
nextafter()      5.070000       3.590000
                 6.550000 nano-seconds

 Performance counter stats for './a.out':

          6,558.44 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               122      page-faults:u             #    0.019 K/sec
    22,038,054,931      cycles:u                  #    3.360 GHz                      (83.33%)
           204,849      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
     5,497,340,276      stalled-cycles-backend:u  #   24.94% backend cycles idle      (83.34%)
    39,751,746,482      instructions:u            #    1.80  insn per cycle
                                                  #    0.14  stalled cycles per insn  (83.34%)
     9,747,428,086      branches:u                # 1486.242 M/sec                    (83.34%)
       250,407,409      branch-misses:u           #    2.57% of all branches          (83.33%)

       6.559550924 seconds time elapsed

       6.558918000 seconds user
       0.000000000 seconds sys


[stefan@rome ~]$ gcc -O3 bench.c -DPATCH patch.c
patch.c:18: warning: "isnan" redefined
 #define isnan(x) ( \

In file included from patch.c:1:
/usr/include/math.h:254: note: this is the location of the previous definition
 #  define isnan(x) \

[stefan@rome ~]$ perf stat ./a.out

nop()            1.480000       0
nextafter()      2.000000       0.520000
                 3.480000 nano-seconds

 Performance counter stats for './a.out':

          3,489.59 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               123      page-faults:u             #    0.035 K/sec
    11,721,075,764      cycles:u                  #    3.359 GHz                      (83.32%)
           132,680      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.32%)
     4,500,051,993      stalled-cycles-backend:u  #   38.39% backend cycles idle      (83.32%)
    40,501,721,908      instructions:u            #    3.46  insn per cycle
                                                  #    0.11  stalled cycles per insn  (83.33%)
     8,494,571,027      branches:u                # 2434.258 M/sec                    (83.35%)
           497,230      branch-misses:u           #    0.01% of all branches          (83.34%)

       3.490437579 seconds time elapsed

       3.490053000 seconds user
       0.000000000 seconds sys


[stefan@rome ~]$ gcc -O3 patch.c nextafter-fp.c
[stefan@rome ~]$ perf stat ./a.out

nop()            1.490000       0
nextafter()      2.210000       0.720000
                 3.700000 nano-seconds

 Performance counter stats for './a.out':

          3,702.89 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               122      page-faults:u             #    0.033 K/sec
    12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
           135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
     5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
    38,002,430,460      instructions:u            #    3.06  insn per cycle
                                                  #    0.14  stalled cycles per insn  (83.34%)
     7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
           497,462      branch-misses:u           #    0.01% of all branches          (83.32%)

       3.703648875 seconds time elapsed

       3.703294000 seconds user
       0.000000000 seconds sys


[stefan@rome ~]$ gcc -O3 bench.c nextafter.s
[stefan@rome ~]$ perf stat ./a.out

nop()            1.630000       0
nextafter()      1.910000       0.280000
                 3.540000 nano-seconds

 Performance counter stats for './a.out':

          3,547.12 msec task-clock:u              #    1.000 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
               123      page-faults:u             #    0.035 K/sec
    11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
           129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
     3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
    37,493,523,285      instructions:u            #    3.14  insn per cycle
                                                  #    0.11  stalled cycles per insn  (83.34%)
     7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
           497,565      branch-misses:u           #    0.01% of all branches          (83.32%)

       3.547999008 seconds time elapsed

       3.546671000 seconds user
       0.000999000 seconds sys


[stefan@rome ~]$


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15  7:04       ` Stefan Kanthak
@ 2021-08-15  7:46         ` Ariadne Conill
  2021-08-15 13:59           ` Rich Felker
  2021-08-15  8:24         ` Damian McGuckin
  2021-08-15 14:56         ` Szabolcs Nagy
  2 siblings, 1 reply; 34+ messages in thread
From: Ariadne Conill @ 2021-08-15  7:46 UTC (permalink / raw)
  To: musl; +Cc: Szabolcs Nagy

Hi,

On Sun, 15 Aug 2021, Stefan Kanthak wrote:

> Szabolcs Nagy <nsz@port70.net> wrote:
>
>> * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-13 14:04:51 +0200]:
>>> Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:
>
>>>> (the i386 machine where i originally tested this preferred int
>>>> cmp and float cmp was very slow in the subnormal range
>>>
>>> This also and still holds for i386 FPU fadd/fmul as well as SSE
>>> addsd/addss/mulss/mulsd additions/multiplies!
>>
>> they are avoided in the common case, and only used to create
>> fenv side-effects.
>
> Unfortunately but for hard & SOFT-float, where no fenv exists, as
> Rich wrote.

My admittedly rudementary understanding of how soft-float is implemented 
in musl leads me to believe that this doesn't really matter that much.

>>> --- -/src/math/nextafter.c
>>> +++ +/src/math/nextafter.c
>>> @@ -10,13 +10,13 @@
>>>                 return x + y;
>>>         if (ux.i == uy.i)
>>>                 return y;
>>> -       ax = ux.i & -1ULL/2;
>>> -       ay = uy.i & -1ULL/2;
>>> +       ax = ux.i << 2;
>>> +       ay = uy.i << 2;
>>
>> the << 2 looks wrong, the top bit of the exponent is lost.
>
> It IS wrong, but only in the post, not in the code I tested.

So... in other words, you are testing code that is different than the code 
you are submitting?

>
>>>         if (ax == 0) {
>>>                 if (ay == 0)
>>>                         return y;
>>>                 ux.i = (uy.i & 1ULL<<63) | 1;
>>> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
>>> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>>>                 ux.i--;
>>>         else
>>>                 ux.i++;
>> ...
>>> How do you compare these 60 instructions/252 bytes to the code I posted
>>> (23 instructions/72 bytes)?
>>
>> you should benchmark, but the second best is to look
>> at the longest dependency chain in the hot path and
>> add up the instruction latencies.
>
> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> run 1 against glibc,                         8.58 ns/call
> run 2 against musl original,                 3.59
> run 3 against musl patched,                  0.52
> run 4 the pure floating-point variant from   0.72
>      my initial post in this thread,
> run 5 the assembly variant I posted.         0.28 ns/call
>
> Now hurry up and patch your slowmotion code!

And how do these benchmarks look on non-x86 architectures, like aarch64 or 
riscv64?

I would rather have a portable math library with functions that cost 3.59 
nsec per call than one where the portable bits are not exercised on x86.

> Stefan
>
> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
>
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> __fpclassifyl(x) == FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> __fpclassifyl(x) == FP_NAN)
> #endif // PATCH
>
> PPS: and of course the log from the benchmarks...
>
> [stefan@rome ~]$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             AuthenticAMD
> CPU family:            23
> Model:                 49
> Model name:            AMD EPYC 7262 8-Core Processor
> Stepping:              0
> CPU MHz:               3194.323
> BogoMIPS:              6388.64
> Virtualization:        AMD-V
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              16384K
> ...
> [stefan@rome ~]$ gcc --version
> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc 8 is quite old at this point.  gcc 9 and 10 have much better 
optimizers that are much more capable.

Indeed, on my system with GCC 10.3.1, nextafter() is using SSE2 
instructions on Alpine x86_64, and if I rebuild musl with `-march=znver2` 
it uses AVX instructions for nextafter(), which seems more than 
sufficiently optimized to me.

Speaking in personal capacity only, I would rather musl's math routines 
remain to the point instead of going down the rabbit hole of manually 
optimized routines like glibc has done.  That also includes 
hand-optimizing the C routines to exploit optimal behavior for some 
specific microarch.  GCC already has knowledge of what optimizations are 
good for a specific microarch (this is the whole point of `-march` and 
`-mtune` after all), if something is missing, it should be fixed there.

Ariadne

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15  7:04       ` Stefan Kanthak
  2021-08-15  7:46         ` Ariadne Conill
@ 2021-08-15  8:24         ` Damian McGuckin
  2021-08-15 14:03           ` Rich Felker
  2021-08-15 14:56         ` Szabolcs Nagy
  2 siblings, 1 reply; 34+ messages in thread
From: Damian McGuckin @ 2021-08-15  8:24 UTC (permalink / raw)
  To: musl; +Cc: Szabolcs Nagy


Hi Stefan,

On Sun, 15 Aug 2021, Stefan Kanthak wrote:

> __attribute__((noinline))
> double nextafter(double x, double y)
> {
> union {double f; unsigned long long i;} ux={x}, uy={y};
> unsigned long long ax, ay;
> int e;
>
> if (isnan(x) || isnan(y))
>  return x + y;
> if (ux.i == uy.i)
>  return y;
> #ifdef PATCH
> ax = ux.i << 1;
> ay = uy.i << 1;
> #else
> ax = ux.i & -1ULL/2;
> ay = uy.i & -1ULL/2;
> #endif
> if (ax == 0) {
>  if (ay == 0)
>   return y;
>  ux.i = (uy.i & 1ULL<<63) | 1;
> #ifdef PATCH
> } else if (ax < ay == (long long) ux.i < 0)
> #else
> } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> #endif
>  ux.i--;
> else
>  ux.i++;
> e = ux.i >> 52 & 0x7ff;
> /* raise overflow if ux.f is infinite and x is finite */
> if (e == 0x7ff)
>  FORCE_EVAL(x + x);
> /* raise underflow if ux.f is subnormal or zero */
> if (e == 0)
>  FORCE_EVAL(x*x + ux.f*ux.f);
> return ux.f;
> }

Maybe I am missing something and my brain is in weekend-mode ...

I did a quick check and ran the above code for some test cases:

nextafter(-9.7500000000e+01, 3.5000000000e+01) = -9.7500000000e+01
yourpatch(-9.7500000000e+01, 3.5000000000e+01) = -9.7500000000e+01

The error is 2.8421709430e-14

nextafter(-9.7500000000e+01, -3.5000000000e+01) = -9.7500000000e+01
yourpatch(-9.7500000000e+01, -3.5000000000e+01) = -9.7500000000e+01

The error is 2.8421709430e-14

nextafter(-inf, inf) = -1.7976931349e+308 Correct
yourpatch(-inf, inf) = -nan Incorrect

This is against standard GLIB.

Comparing the normal MUSL code against GLIB, there are no errors.

If I change this line:

 	} else if (ax < ay == (long long) ux.i < 0)

to
 	} else if (x < y == (long long) ux.i < 0)

Your code works flawlessly. But it introduces a floating point comparison.

- Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15  7:46         ` Ariadne Conill
@ 2021-08-15 13:59           ` Rich Felker
  2021-08-15 14:57             ` Ariadne Conill
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-15 13:59 UTC (permalink / raw)
  To: Ariadne Conill; +Cc: musl, Szabolcs Nagy

On Sun, Aug 15, 2021 at 02:46:58AM -0500, Ariadne Conill wrote:
> On Sun, 15 Aug 2021, Stefan Kanthak wrote:
> >[stefan@rome ~]$ gcc --version
> >gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
> >Copyright (C) 2018 Free Software Foundation, Inc.
> >This is free software; see the source for copying conditions.  There is NO
> >warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> gcc 8 is quite old at this point.  gcc 9 and 10 have much better
> optimizers that are much more capable.
> 
> Indeed, on my system with GCC 10.3.1, nextafter() is using SSE2
> instructions on Alpine x86_64, and if I rebuild musl with
> `-march=znver2` it uses AVX instructions for nextafter(), which
> seems more than sufficiently optimized to me.

As far as I can tell, the instructions used are not the issue here,
and there are no specialized instructions that help make it faster. If
GCC is doing a bad job, it's more a matter of the high level flow,
choice of how to load constants, how branches are implemented, etc.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15  8:24         ` Damian McGuckin
@ 2021-08-15 14:03           ` Rich Felker
  2021-08-15 15:10             ` Damian McGuckin
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-15 14:03 UTC (permalink / raw)
  To: Damian McGuckin; +Cc: musl, Szabolcs Nagy

On Sun, Aug 15, 2021 at 06:24:31PM +1000, Damian McGuckin wrote:
> 
> Hi Stefan,
> 
> On Sun, 15 Aug 2021, Stefan Kanthak wrote:
> 
> >__attribute__((noinline))
> >double nextafter(double x, double y)
> >{
> >union {double f; unsigned long long i;} ux={x}, uy={y};
> >unsigned long long ax, ay;
> >int e;
> >
> >if (isnan(x) || isnan(y))
> > return x + y;
> >if (ux.i == uy.i)
> > return y;
> >#ifdef PATCH
> >ax = ux.i << 1;
> >ay = uy.i << 1;
> >#else
> >ax = ux.i & -1ULL/2;
> >ay = uy.i & -1ULL/2;
> >#endif
> >if (ax == 0) {
> > if (ay == 0)
> >  return y;
> > ux.i = (uy.i & 1ULL<<63) | 1;
> >#ifdef PATCH
> >} else if (ax < ay == (long long) ux.i < 0)
> >#else
> >} else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> >#endif
> > ux.i--;
> >else
> > ux.i++;
> >e = ux.i >> 52 & 0x7ff;
> >/* raise overflow if ux.f is infinite and x is finite */
> >if (e == 0x7ff)
> > FORCE_EVAL(x + x);
> >/* raise underflow if ux.f is subnormal or zero */
> >if (e == 0)
> > FORCE_EVAL(x*x + ux.f*ux.f);
> >return ux.f;
> >}
> 
> Maybe I am missing something and my brain is in weekend-mode ...
> 
> I did a quick check and ran the above code for some test cases:
> 
> nextafter(-9.7500000000e+01, 3.5000000000e+01) = -9.7500000000e+01
> yourpatch(-9.7500000000e+01, 3.5000000000e+01) = -9.7500000000e+01
> 
> The error is 2.8421709430e-14
> 
> nextafter(-9.7500000000e+01, -3.5000000000e+01) = -9.7500000000e+01
> yourpatch(-9.7500000000e+01, -3.5000000000e+01) = -9.7500000000e+01
> 
> The error is 2.8421709430e-14

I don't follow; are you claiming Stefan's patch introduces an error
here? The outputs you printed show the exact same behavior before and
after but it's possible you printed them wrong. The nextafter function
is bit-exact; it does not have floating point error (inexactness)
unless the implementation is buggy.

> nextafter(-inf, inf) = -1.7976931349e+308 Correct
> yourpatch(-inf, inf) = -nan Incorrect
> 
> This is against standard GLIB.

glibc I assume you mean? In any case yes this looks like a bug in the
patch.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15  7:04       ` Stefan Kanthak
  2021-08-15  7:46         ` Ariadne Conill
  2021-08-15  8:24         ` Damian McGuckin
@ 2021-08-15 14:56         ` Szabolcs Nagy
  2021-08-15 15:19           ` Stefan Kanthak
  2 siblings, 1 reply; 34+ messages in thread
From: Szabolcs Nagy @ 2021-08-15 14:56 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: musl

* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
> Szabolcs Nagy <nsz@port70.net> wrote:
> > you should benchmark, but the second best is to look
> > at the longest dependency chain in the hot path and
> > add up the instruction latencies.
> 
> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> run 1 against glibc,                         8.58 ns/call
> run 2 against musl original,                 3.59
> run 3 against musl patched,                  0.52
> run 4 the pure floating-point variant from   0.72
>       my initial post in this thread,
> run 5 the assembly variant I posted.         0.28 ns/call

thanks for the numbers. it's not the best measurment
but shows some interesting effects.

> 
> Now hurry up and patch your slowmotion code!
> 
> Stefan
> 
> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
> 
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> __fpclassifyl(x) == FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> __fpclassifyl(x) == FP_NAN)
> #endif // PATCH

i think on x86 this only changes an and to an add
(or nothing at all if the compiler is smart)

if this is measurable that's an uarch issue of your cpu.

> 
> PPS: and of course the log from the benchmarks...
> 
> [stefan@rome ~]$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             AuthenticAMD
> CPU family:            23
> Model:                 49
> Model name:            AMD EPYC 7262 8-Core Processor
> Stepping:              0
> CPU MHz:               3194.323
> BogoMIPS:              6388.64
> Virtualization:        AMD-V
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              16384K
> ...
> [stefan@rome ~]$ gcc --version
> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> [stefan@rome ~]$ cat patch.c
> #include <math.h>
> 
> static __inline unsigned __FLOAT_BITS(float __f)
> {
>  union {float __f; unsigned __i;} __u;
>  __u.__f = __f;
>  return __u.__i;
> }
> 
> static __inline unsigned long long __DOUBLE_BITS(double __f)
> {
>  union {double __f; unsigned long long __i;} __u;
>  __u.__f = __f;
>  return __u.__i;
> }
> 
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> __fpclassifyl(x) == FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> __fpclassifyl(x) == FP_NAN)
> #endif // PATCH
> 
> __attribute__((noinline))
> double nextafter(double x, double y)
> {
>  union {double f; unsigned long long i;} ux={x}, uy={y};
>  unsigned long long ax, ay;
>  int e;
> 
>  if (isnan(x) || isnan(y))
>   return x + y;
>  if (ux.i == uy.i)
>   return y;
> #ifdef PATCH
>  ax = ux.i << 1;
>  ay = uy.i << 1;
> #else
>  ax = ux.i & -1ULL/2;
>  ay = uy.i & -1ULL/2;
> #endif
>  if (ax == 0) {
>   if (ay == 0)
>    return y;
>   ux.i = (uy.i & 1ULL<<63) | 1;
> #ifdef PATCH
>  } else if (ax < ay == (long long) ux.i < 0)
> #else
>  } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> #endif

this change is wrong.

>   ux.i--;
>  else
>   ux.i++;
>  e = ux.i >> 52 & 0x7ff;
>  /* raise overflow if ux.f is infinite and x is finite */
>  if (e == 0x7ff)
>   FORCE_EVAL(x + x);
>  /* raise underflow if ux.f is subnormal or zero */
>  if (e == 0)
>   FORCE_EVAL(x*x + ux.f*ux.f);
>  return ux.f;
> }
> [stefan@rome ~]$ cat bench.c
> // Copyright © 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
> 
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
> 
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
>     return foo + bar;
> }
> 
> inline static
> double lfsr64(void)
> {
>     // 64-bit linear feedback shift register (Galois form) using
>     //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
>     //   initialised with 2**64 / golden ratio
> 
>     static uint64_t lfsr = 0x9E3779B97F4A7C15;
>     const  uint64_t sign = (int64_t) lfsr >> 63;
> 
>     lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
> 
>     return *(double *) &lfsr;
> }
> 
> inline static
> double random64(void)
> {
>     static uint64_t seed = 0x0123456789ABCDEF;
> 
>     seed = seed * 6364136223846793005 + 1442695040888963407;
> 
>     return *(double *) &seed;
> }

this is ok, but:

use union for type punning, casts can be miscompiled.

i don't think you need two prngs for this benchmark.

uniform random over all floats is rarely what we want:
either exercise branches with similar distribution as
in real code, or try to exercise specific branches
uniformly to completely neutralize branch prediction
effects.

> 
> int main(void)
> {
>     clock_t t0, t1, t2, tt;
>     uint32_t n;
>     volatile double result;
> 
>     t0 = clock();

clock() is not ideal because it has to do a syscall.
(CLOCK_PROCESS_CPUTIME_ID is not handled in the vdso.)

using clock_gettime(CLOCK_MONOTONIC, &ts) is better.

> 
>     for (n = 500000000u; n > 0u; n--) {
>         result = nop(lfsr64(), 0.0);
>         result = nop(random64(), 1.0 / 0.0);
>     }
> 
>     t1 = clock();
> 
>     for (n = 500000000u; n > 0u; n--) {
>         result = nextafter(lfsr64(), 0.0);
>         result = nextafter(random64(), 1.0 / 0.0);
>     }

you created two dependency chains in the loop: lfsr64 and
random64 with many int operations.

nextafter calls are independent so many of those can go on
in parallel as long as the inputs are available.

so this measures how much your prng latency is affected by
computing nextafters in parallel. (i.e. not measuring
nextafter latency at all.) on a big enough cpu, nextafter
in this loop should have almost 0 effect.

the performance will break down if speculation fails (e.g.
branch-misspredicts or the cpu runs out of resources to
run enough nextafters in parallel to keep up with the prng).

the effect of branch misses are nicely visible in the
results: i expect the y=inf case mispredicts half the
time in the orig code which is 250M misses. i assume
your code compiled to

  u.i += sign

or similar, while the original had a branch on the sign.

that extra branch rarely has this dramatic effect in real
code though and i think it is arch dependent if it can be
removed at all.

for actually measuring the latency i suggest

  double r = 1.0;
  uint64_t t1, t0;
  t0 = tic();
  for (n = N; n > 0; n--)
    r = nextafter(r, 0.0);
  t1 = tic();

this should give the best-case latency of the hot path
(perfectly predicted).

i hope you learnt something.

> 
>     t2 = clock();
>     tt = t2 - t0; t2 -= t1; t1 -= t0; t0 = t2 - t1;
> 
>     printf("\n"
>            "nop()         %4lu.%06lu       0\n"
>            "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
>            "              %4lu.%06lu nano-seconds\n",
>            t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
>            tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.480000       0
> nextafter()     10.060000       8.580000
>                 11.540000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>          11,548.78 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                145      page-faults:u             #    0.013 K/sec
>     38,917,213,536      cycles:u                  #    3.370 GHz                      (83.33%)
>     15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cycles idle     (83.33%)
>     10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycles idle      (83.33%)
>     69,739,403,870      instructions:u            #    1.79  insn per cycle
>                                                   #    0.22  stalled cycles per insn  (83.33%)
>     16,495,748,110      branches:u                # 1428.354 M/sec                    (83.33%)
>        500,701,246      branch-misses:u           #    3.04% of all branches          (83.33%)
> 
>       11.550414029 seconds time elapsed
> 
>       11.548265000 seconds user
>        0.000999000 seconds sys
> 
> 
> [stefan@rome ~]$ gcc -O3 bench.c patch.c
> patch.c:23: warning: "isnan" redefined
>  #define isnan(x) ( \
> 
> In file included from patch.c:1:
> /usr/include/math.h:254: note: this is the location of the previous definition
>  #  define isnan(x) \
> 
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.480000       0
> nextafter()      5.070000       3.590000
>                  6.550000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>           6,558.44 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.019 K/sec
>     22,038,054,931      cycles:u                  #    3.360 GHz                      (83.33%)
>            204,849      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      5,497,340,276      stalled-cycles-backend:u  #   24.94% backend cycles idle      (83.34%)
>     39,751,746,482      instructions:u            #    1.80  insn per cycle
>                                                   #    0.14  stalled cycles per insn  (83.34%)
>      9,747,428,086      branches:u                # 1486.242 M/sec                    (83.34%)
>        250,407,409      branch-misses:u           #    2.57% of all branches          (83.33%)
> 
>        6.559550924 seconds time elapsed
> 
>        6.558918000 seconds user
>        0.000000000 seconds sys
> 
> 
> [stefan@rome ~]$ gcc -O3 bench.c -DPATCH patch.c
> patch.c:18: warning: "isnan" redefined
>  #define isnan(x) ( \
> 
> In file included from patch.c:1:
> /usr/include/math.h:254: note: this is the location of the previous definition
>  #  define isnan(x) \
> 
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.480000       0
> nextafter()      2.000000       0.520000
>                  3.480000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>           3,489.59 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,721,075,764      cycles:u                  #    3.359 GHz                      (83.32%)
>            132,680      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.32%)
>      4,500,051,993      stalled-cycles-backend:u  #   38.39% backend cycles idle      (83.32%)
>     40,501,721,908      instructions:u            #    3.46  insn per cycle
>                                                   #    0.11  stalled cycles per insn  (83.33%)
>      8,494,571,027      branches:u                # 2434.258 M/sec                    (83.35%)
>            497,230      branch-misses:u           #    0.01% of all branches          (83.34%)
> 
>        3.490437579 seconds time elapsed
> 
>        3.490053000 seconds user
>        0.000000000 seconds sys
> 
> 
> [stefan@rome ~]$ gcc -O3 patch.c nextafter-fp.c
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.490000       0
> nextafter()      2.210000       0.720000
>                  3.700000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>           3,702.89 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.033 K/sec
>     12,407,345,183      cycles:u                  #    3.351 GHz                      (83.32%)
>            135,817      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycles idle      (83.34%)
>     38,002,430,460      instructions:u            #    3.06  insn per cycle
>                                                   #    0.14  stalled cycles per insn  (83.34%)
>      7,497,381,393      branches:u                # 2024.735 M/sec                    (83.34%)
>            497,462      branch-misses:u           #    0.01% of all branches          (83.32%)
> 
>        3.703648875 seconds time elapsed
> 
>        3.703294000 seconds user
>        0.000000000 seconds sys
> 
> 
> [stefan@rome ~]$ gcc -O3 bench.c nextafter.s
> [stefan@rome ~]$ perf stat ./a.out
> 
> nop()            1.630000       0
> nextafter()      1.910000       0.280000
>                  3.540000 nano-seconds
> 
>  Performance counter stats for './a.out':
> 
>           3,547.12 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,949,840,797      cycles:u                  #    3.369 GHz                      (83.32%)
>            129,627      stalled-cycles-frontend:u #    0.00% frontend cycles idle     (83.34%)
>      3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycles idle      (83.34%)
>     37,493,523,285      instructions:u            #    3.14  insn per cycle
>                                                   #    0.11  stalled cycles per insn  (83.34%)
>      7,998,559,192      branches:u                # 2254.945 M/sec                    (83.34%)
>            497,565      branch-misses:u           #    0.01% of all branches          (83.32%)
> 
>        3.547999008 seconds time elapsed
> 
>        3.546671000 seconds user
>        0.000999000 seconds sys
> 
> 
> [stefan@rome ~]$

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 13:59           ` Rich Felker
@ 2021-08-15 14:57             ` Ariadne Conill
  0 siblings, 0 replies; 34+ messages in thread
From: Ariadne Conill @ 2021-08-15 14:57 UTC (permalink / raw)
  To: Rich Felker; +Cc: Ariadne Conill, musl, Szabolcs Nagy

Hi,

On Sun, 15 Aug 2021, Rich Felker wrote:

> On Sun, Aug 15, 2021 at 02:46:58AM -0500, Ariadne Conill wrote:
>> On Sun, 15 Aug 2021, Stefan Kanthak wrote:
>>> [stefan@rome ~]$ gcc --version
>>> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
>>> Copyright (C) 2018 Free Software Foundation, Inc.
>>> This is free software; see the source for copying conditions.  There is NO
>>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>>
>> gcc 8 is quite old at this point.  gcc 9 and 10 have much better
>> optimizers that are much more capable.
>>
>> Indeed, on my system with GCC 10.3.1, nextafter() is using SSE2
>> instructions on Alpine x86_64, and if I rebuild musl with
>> `-march=znver2` it uses AVX instructions for nextafter(), which
>> seems more than sufficiently optimized to me.
>
> As far as I can tell, the instructions used are not the issue here,
> and there are no specialized instructions that help make it faster. If
> GCC is doing a bad job, it's more a matter of the high level flow,
> choice of how to load constants, how branches are implemented, etc.

Right, I'm just more saying that at least from what I see at glancing at 
the disassembly in both cases, the code generated by GCC 10 does not seem 
particularly bad.

(And nitpicking in the aggressive way Stefan is doing over 3 nsec per call 
in something that really is not generally a fastpath is kind of silly 
anyway.)

Ariadne

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 14:03           ` Rich Felker
@ 2021-08-15 15:10             ` Damian McGuckin
  0 siblings, 0 replies; 34+ messages in thread
From: Damian McGuckin @ 2021-08-15 15:10 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Szabolcs Nagy


On Sun, 15 Aug 2021, Rich Felker wrote:

>> I did a quick check and ran the above code for some test cases:
>>
>> nextafter(-9.7500000000e+01, 3.5000000000e+01) = -9.7500000000e+01
>> yourpatch(-9.7500000000e+01, 3.5000000000e+01) = -9.7500000000e+01
>>
>> The error is 2.8421709430e-14
>>
>> nextafter(-9.7500000000e+01, -3.5000000000e+01) = -9.7500000000e+01
>> yourpatch(-9.7500000000e+01, -3.5000000000e+01) = -9.7500000000e+01
>>
>> The error is 2.8421709430e-14
>
> I don't follow; are you claiming Stefan's patch introduces an error
> here?

Yes.  I have suggested a change which solves the problem.

> The outputs you printed show the exact same behavior before and
> after but it's possible you printed them wrong. The nextafter function
> is bit-exact;

The bits from nextafter from GLIBC match those of MUSL for that case.
But the bits from nextafter from GLIBC do not match those of Stefan's
patch for the same case.

> it does not have floating point error (inexactness)
> unless the implementation is buggy.

>> nextafter(-inf, inf) = -1.7976931349e+308 Correct
>> yourpatch(-inf, inf) = -nan Incorrect
>>
>> This is against standard GLIB.
>
> glibc I assume you mean?

Yes. Sorry. FInger<->Brain miscommunications.

> In any case yes this looks like a bug in the patch.

I believe so.  Which I think Szabolcs has also suggested is the case.

Stay safe - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 14:56         ` Szabolcs Nagy
@ 2021-08-15 15:19           ` Stefan Kanthak
  2021-08-15 15:48             ` Rich Felker
  2021-08-15 15:52             ` Ariadne Conill
  0 siblings, 2 replies; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-15 15:19 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: musl

Szabolcs Nagy <nsz@port70.net> wrote:

> * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
>> Szabolcs Nagy <nsz@port70.net> wrote:
>>> you should benchmark, but the second best is to look
>>> at the longest dependency chain in the hot path and
>>> add up the instruction latencies.
>> 
>> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
>> run 1 against glibc,                         8.58 ns/call
>> run 2 against musl original,                 3.59
>> run 3 against musl patched,                  0.52
>> run 4 the pure floating-point variant from   0.72
>>       my initial post in this thread,
>> run 5 the assembly variant I posted.         0.28 ns/call
>
> thanks for the numbers. it's not the best measurment

IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!

> but shows some interesting effects.

It clearly shows that musl's current implementation SUCKS, at least
on AMD64.

>> 
>> Now hurry up and patch your slowmotion code!
>> 
>> Stefan
>> 
>> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
>> 
>> #ifdef PATCH
>> #define isnan(x) ( \
>> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
>> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
>> __fpclassifyl(x) == FP_NAN)
>> #else
>> #define isnan(x) ( \
>> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
>> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
>> __fpclassifyl(x) == FP_NAN)
>> #endif // PATCH
>
> i think on x86 this only changes an and to an add
> (or nothing at all if the compiler is smart)

BETTER THINK TWICE: where does the mask needed for the and come from?
Does it need an extra register?
How do you (for example) build it on ARM?

> if this is measurable that's an uarch issue of your cpu.

ARGH: it's not the and that makes the difference!

JFTR: movabs $0x7ff0000000000000, %r*x is a 10 byte instruction
      I recommend to read Intel's and AMD's processor optimisation
      manuals and learn just a little bit!

[braindead fullquote removed]

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 15:19           ` Stefan Kanthak
@ 2021-08-15 15:48             ` Rich Felker
  2021-08-15 16:29               ` Stefan Kanthak
  2021-08-15 15:52             ` Ariadne Conill
  1 sibling, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-15 15:48 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Sun, Aug 15, 2021 at 05:19:05PM +0200, Stefan Kanthak wrote:
> Szabolcs Nagy <nsz@port70.net> wrote:
> 
> > * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
> >> Szabolcs Nagy <nsz@port70.net> wrote:
> >>> you should benchmark, but the second best is to look
> >>> at the longest dependency chain in the hot path and
> >>> add up the instruction latencies.
> >> 
> >> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> >> run 1 against glibc,                         8.58 ns/call
> >> run 2 against musl original,                 3.59
> >> run 3 against musl patched,                  0.52
> >> run 4 the pure floating-point variant from   0.72
> >>       my initial post in this thread,
> >> run 5 the assembly variant I posted.         0.28 ns/call
> >
> > thanks for the numbers. it's not the best measurment
> 
> IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!

The burden of performing a meaningful measurement is on the party who
says there's something that needs to be changed.

> > but shows some interesting effects.
> 
> It clearly shows that musl's current implementation SUCKS, at least
> on AMD64.

Hardly. According to you it's faster than glibc, and looks
sufficiently fast never to be a bottleneck.

> >> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
> >> 
> >> #ifdef PATCH
> >> #define isnan(x) ( \
> >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> >> __fpclassifyl(x) == FP_NAN)
> >> #else
> >> #define isnan(x) ( \
> >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> >> __fpclassifyl(x) == FP_NAN)
> >> #endif // PATCH
> >
> > i think on x86 this only changes an and to an add
> > (or nothing at all if the compiler is smart)
> 
> BETTER THINK TWICE: where does the mask needed for the and come from?
> Does it need an extra register?
> How do you (for example) build it on ARM?
> 
> > if this is measurable that's an uarch issue of your cpu.
> 
> ARGH: it's not the and that makes the difference!
> 
> JFTR: movabs $0x7ff0000000000000, %r*x is a 10 byte instruction
>       I recommend to read Intel's and AMD's processor optimisation
>       manuals and learn just a little bit!

If you have a general reason (not specific to specific
microarchitectural considerartions) for why one form is preferred,
please state that from the beginning. I don't entirely understand your
argument here since in both the original version and yours, there's a
value on the RHS of the > operator that's in some sense nontrivial to
generate.

Ideally the compiler would be able to emit whichever form is preferred
for the target, since there's a clear transformation that can be made
either direction for this kind of thing. But since that's presently
not the case, if there's a version that can be expected, based on some
reasoning not just "what GCC happens to do", to be faster on most
targets, we should use that.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 15:19           ` Stefan Kanthak
  2021-08-15 15:48             ` Rich Felker
@ 2021-08-15 15:52             ` Ariadne Conill
  2021-08-15 16:09               ` Rich Felker
  1 sibling, 1 reply; 34+ messages in thread
From: Ariadne Conill @ 2021-08-15 15:52 UTC (permalink / raw)
  To: musl; +Cc: Szabolcs Nagy

Hi,

On Sun, 15 Aug 2021, Stefan Kanthak wrote:

> Szabolcs Nagy <nsz@port70.net> wrote:
>
>> * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
>>> Szabolcs Nagy <nsz@port70.net> wrote:
>>>> you should benchmark, but the second best is to look
>>>> at the longest dependency chain in the hot path and
>>>> add up the instruction latencies.
>>>
>>> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
>>> run 1 against glibc,                         8.58 ns/call
>>> run 2 against musl original,                 3.59
>>> run 3 against musl patched,                  0.52
>>> run 4 the pure floating-point variant from   0.72
>>>       my initial post in this thread,
>>> run 5 the assembly variant I posted.         0.28 ns/call
>>
>> thanks for the numbers. it's not the best measurment
>
> IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!
>
>> but shows some interesting effects.
>
> It clearly shows that musl's current implementation SUCKS, at least
> on AMD64.

I would rather have an implementation that is 3.59 ns/call and is 
maintained by somebody who is actually pleasant to talk to.  In the grand 
scheme of things 3.59 ns/call, and even 8.58 ns/call are not a big deal 
for a function like nextafter().

If musl does wind up merging this, I intend to revert that merge in Alpine 
because I cannot trust the correctness of any code written by somebody 
with this attitude.

Ariadne

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 15:52             ` Ariadne Conill
@ 2021-08-15 16:09               ` Rich Felker
  0 siblings, 0 replies; 34+ messages in thread
From: Rich Felker @ 2021-08-15 16:09 UTC (permalink / raw)
  To: Ariadne Conill; +Cc: musl, Szabolcs Nagy

On Sun, Aug 15, 2021 at 10:52:13AM -0500, Ariadne Conill wrote:
> Hi,
> 
> On Sun, 15 Aug 2021, Stefan Kanthak wrote:
> 
> >Szabolcs Nagy <nsz@port70.net> wrote:
> >
> >>* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
> >>>Szabolcs Nagy <nsz@port70.net> wrote:
> >>>>you should benchmark, but the second best is to look
> >>>>at the longest dependency chain in the hot path and
> >>>>add up the instruction latencies.
> >>>
> >>>1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> >>>run 1 against glibc,                         8.58 ns/call
> >>>run 2 against musl original,                 3.59
> >>>run 3 against musl patched,                  0.52
> >>>run 4 the pure floating-point variant from   0.72
> >>>      my initial post in this thread,
> >>>run 5 the assembly variant I posted.         0.28 ns/call
> >>
> >>thanks for the numbers. it's not the best measurment
> >
> >IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!
> >
> >>but shows some interesting effects.
> >
> >It clearly shows that musl's current implementation SUCKS, at least
> >on AMD64.
> 
> I would rather have an implementation that is 3.59 ns/call and is
> maintained by somebody who is actually pleasant to talk to.  In the
> grand scheme of things 3.59 ns/call, and even 8.58 ns/call are not a
> big deal for a function like nextafter().
> 
> If musl does wind up merging this, I intend to revert that merge in
> Alpine because I cannot trust the correctness of any code written by
> somebody with this attitude.

Don't worry, we will evaluate the correctness of anything we do merge
and stand by it. I'm open to well-reasoned changes, especially if they
improve things more broadly like the isnan comparisons, and have
already noted that there's some inconsistency in the existing
nextafter/nexttoward functions. But I'm not going to accept any patch
that's not making a well-reasoned, isolated, evaluatable, and testable
change. And I would very much appreciate dropping exaggerated and
hostile claims that something "sucks" or is "brain damaged" in further
discussion of the topic.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 15:48             ` Rich Felker
@ 2021-08-15 16:29               ` Stefan Kanthak
  2021-08-15 16:49                 ` Rich Felker
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-15 16:29 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, musl

Rich Felker <dalias@libc.org> wrote:

> On Sun, Aug 15, 2021 at 05:19:05PM +0200, Stefan Kanthak wrote:
>> Szabolcs Nagy <nsz@port70.net> wrote:
>> 
>> > * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
>> >> Szabolcs Nagy <nsz@port70.net> wrote:
>> >>> you should benchmark, but the second best is to look
>> >>> at the longest dependency chain in the hot path and
>> >>> add up the instruction latencies.
>> >> 
>> >> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
>> >> run 1 against glibc,                         8.58 ns/call
>> >> run 2 against musl original,                 3.59
>> >> run 3 against musl patched,                  0.52
>> >> run 4 the pure floating-point variant from   0.72
>> >>       my initial post in this thread,
>> >> run 5 the assembly variant I posted.         0.28 ns/call
>> >
>> > thanks for the numbers. it's not the best measurment
>> 
>> IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!
> 
> The burden of performing a meaningful measurement is on the party who
> says there's something that needs to be changed.

I offered you two patches which speed a rather simple function by a
measured factor of 5 and 7 respectively. IF YOU DOUBT THESE NUMBERS,
PROVIDE YOUR OWN!

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 16:29               ` Stefan Kanthak
@ 2021-08-15 16:49                 ` Rich Felker
  2021-08-15 20:52                   ` Stefan Kanthak
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2021-08-15 16:49 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Sun, Aug 15, 2021 at 06:29:08PM +0200, Stefan Kanthak wrote:
> Rich Felker <dalias@libc.org> wrote:
> 
> > On Sun, Aug 15, 2021 at 05:19:05PM +0200, Stefan Kanthak wrote:
> >> Szabolcs Nagy <nsz@port70.net> wrote:
> >> 
> >> > * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
> >> >> Szabolcs Nagy <nsz@port70.net> wrote:
> >> >>> you should benchmark, but the second best is to look
> >> >>> at the longest dependency chain in the hot path and
> >> >>> add up the instruction latencies.
> >> >> 
> >> >> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> >> >> run 1 against glibc,                         8.58 ns/call
> >> >> run 2 against musl original,                 3.59
> >> >> run 3 against musl patched,                  0.52
> >> >> run 4 the pure floating-point variant from   0.72
> >> >>       my initial post in this thread,
> >> >> run 5 the assembly variant I posted.         0.28 ns/call
> >> >
> >> > thanks for the numbers. it's not the best measurment
> >> 
> >> IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!
> > 
> > The burden of performing a meaningful measurement is on the party who
> > says there's something that needs to be changed.
> 
> I offered you two patches which speed a rather simple function by a
> measured factor of 5 and 7 respectively. IF YOU DOUBT THESE NUMBERS,
> PROVIDE YOUR OWN!

I really have better things to be doing than putting up with repeated
toxic interactions for the sake of a supposed miniscule improvement in
something nobody has identified as having any problem to begin with.

If you want to engage constructively, you're welcome to. This is not
it.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 16:49                 ` Rich Felker
@ 2021-08-15 20:52                   ` Stefan Kanthak
  2021-08-15 21:48                     ` Rich Felker
  0 siblings, 1 reply; 34+ messages in thread
From: Stefan Kanthak @ 2021-08-15 20:52 UTC (permalink / raw)
  To: Rich Felker; +Cc: Szabolcs Nagy, musl

Rich Felker <dalias@libc.org> wrote:

> I really have better things to be doing than putting up with repeated
> toxic interactions for the sake of a supposed miniscule improvement in
> something nobody has identified as having any problem to begin with.

Nobody as in "Niemand hat die Absicht eine Mauer zu bauen"?
Or just nobody EXCEPT ME bothered to take a look at the code of your
nextafter() and noticed its performance deficit (at least on AMD64
and i386)?

JFTR: your implementation is NON-COMPLIANT!
      I recommend to read the ISO C standard and follow it by the word.

> If you want to engage constructively, you're welcome to. This is not
> it.

You are most obviously NOT interested in performance improvement, or
just to stubborn: choose what you like better.

Stefan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] [PATCH #2] Properly simplified nextafter()
  2021-08-15 20:52                   ` Stefan Kanthak
@ 2021-08-15 21:48                     ` Rich Felker
  0 siblings, 0 replies; 34+ messages in thread
From: Rich Felker @ 2021-08-15 21:48 UTC (permalink / raw)
  To: Stefan Kanthak; +Cc: Szabolcs Nagy, musl

On Sun, Aug 15, 2021 at 10:52:24PM +0200, Stefan Kanthak wrote:
> Rich Felker <dalias@libc.org> wrote:
> 
> > I really have better things to be doing than putting up with repeated
> > toxic interactions for the sake of a supposed miniscule improvement in
> > something nobody has identified as having any problem to begin with.
> 
> Nobody as in "Niemand hat die Absicht eine Mauer zu bauen"?
> Or just nobody EXCEPT ME bothered to take a look at the code of your
> nextafter() and noticed its performance deficit (at least on AMD64
> and i386)?

I guess I missed your initial report about what software you were
running that was spending more than 1% of its execution time in
nextafter.

As I understand it, what happened was that you were reading the code
and thought "I bet I could do better". Maybe you can. But can you do
sufficiently better that it makes any practical difference, to be
worth the time spent reviewing that it's correct? Probably not.

But now, in your case, the bar is even higher. It's: Can you do
sufficiently better to be worth subjecting our good contributors to
someone who wants to insult and abuse them and demand their time doing
unproductive things? Given how fast the function already is, and how
improbable it is that it makes any difference to the run time of any
real code, the answer is most certainly no. You've raised the bar for
your own contribution so high that it can't be met.

> JFTR: your implementation is NON-COMPLIANT!
>       I recommend to read the ISO C standard and follow it by the word.

Making unspecific claims lik that to waste our time trying to figure
out what you mean is not making that bar any lower.

> > If you want to engage constructively, you're welcome to. This is not
> > it.
> 
> You are most obviously NOT interested in performance improvement, or
> just to stubborn: choose what you like better.

I am not interested in further subjecting the authors of our existing
code to your abuse. Please leave.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-08-15 21:49 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10  6:23 [musl] [PATCH] Properly simplified nextafter() Stefan Kanthak
2021-08-10 21:34 ` Szabolcs Nagy
2021-08-10 22:53   ` Stefan Kanthak
2021-08-11  2:40     ` Rich Felker
2021-08-11 15:44       ` Stefan Kanthak
2021-08-11 16:09         ` Rich Felker
2021-08-11 16:50           ` Stefan Kanthak
2021-08-11 17:57             ` Rich Felker
2021-08-11 22:16               ` Szabolcs Nagy
2021-08-11 22:43                 ` Stefan Kanthak
2021-08-12  0:59                   ` Rich Felker
2021-08-11  8:23     ` Szabolcs Nagy
2021-08-13 12:04   ` [musl] [PATCH #2] " Stefan Kanthak
2021-08-13 15:59     ` Rich Felker
2021-08-13 18:30       ` Stefan Kanthak
2021-08-14  4:07         ` Damian McGuckin
2021-08-14 22:45           ` Szabolcs Nagy
2021-08-14 23:46     ` Szabolcs Nagy
2021-08-15  7:04       ` Stefan Kanthak
2021-08-15  7:46         ` Ariadne Conill
2021-08-15 13:59           ` Rich Felker
2021-08-15 14:57             ` Ariadne Conill
2021-08-15  8:24         ` Damian McGuckin
2021-08-15 14:03           ` Rich Felker
2021-08-15 15:10             ` Damian McGuckin
2021-08-15 14:56         ` Szabolcs Nagy
2021-08-15 15:19           ` Stefan Kanthak
2021-08-15 15:48             ` Rich Felker
2021-08-15 16:29               ` Stefan Kanthak
2021-08-15 16:49                 ` Rich Felker
2021-08-15 20:52                   ` Stefan Kanthak
2021-08-15 21:48                     ` Rich Felker
2021-08-15 15:52             ` Ariadne Conill
2021-08-15 16:09               ` Rich Felker

mailing list of musl libc

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.vuxu.org/musl

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 musl musl/ https://inbox.vuxu.org/musl \
		musl@inbox.vuxu.org
	public-inbox-index musl

Example config snippet for mirrors.
Newsgroup available over NNTP:
	nntp://inbox.vuxu.org/vuxu.archive.musl


code repositories for the project(s) associated with this inbox:

	https://git.vuxu.org/mirror/musl/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git