From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 3442 invoked from network); 15 Aug 2021 14:56:28 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 15 Aug 2021 14:56:28 -0000
Received: (qmail 32411 invoked by uid 550); 15 Aug 2021 14:56:26 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 32393 invoked from network); 15 Aug 2021 14:56:26 -0000
Date: Sun, 15 Aug 2021 16:56:14 +0200
From: Szabolcs Nagy <nsz@port70.net>
To: Stefan Kanthak <stefan.kanthak@nexgo.de>
Cc: musl@lists.openwall.com
Message-ID: <20210815145614.GI37904@port70.net>
Mail-Followup-To: Stefan Kanthak <stefan.kanthak@nexgo.de>,
	musl@lists.openwall.com
References: <0C6AAAD55DA44C6189B2FF4F5FB2C3E7@H270>
 <20210810213455.GB37904@port70.net>
 <E2423D1F1F3848848AEA933048174858@H270>
 <20210814234612.GH37904@port70.net>
 <367A4018B58A4E308E2A95404362CBFB@H270>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In-Reply-To: <367A4018B58A4E308E2A95404362CBFB@H270>
Subject: Re: [musl] [PATCH #2] Properly simplified nextafter()

* Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-15 09:04:55 +0200]:
> Szabolcs Nagy <nsz@port70.net> wrote:
> > you should benchmark, but the second best is to look
> > at the longest dependency chain in the hot path and
> > add up the instruction latencies.
>=20
> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> run 1 against glibc,                         8.58 ns/call
> run 2 against musl original,                 3.59
> run 3 against musl patched,                  0.52
> run 4 the pure floating-point variant from   0.72
>       my initial post in this thread,
> run 5 the assembly variant I posted.         0.28 ns/call

thanks for the numbers. it's not the best measurment
but shows some interesting effects.

>=20
> Now hurry up and patch your slowmotion code!
>=20
> Stefan
>=20
> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
>=20
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe00000000=
00000ULL : \
> __fpclassifyl(x) =3D=3D FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800=
000 : \
> sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffUL=
L<<52 : \
> __fpclassifyl(x) =3D=3D FP_NAN)
> #endif // PATCH

i think on x86 this only changes an and to an add
(or nothing at all if the compiler is smart)

if this is measurable that's an uarch issue of your cpu.

>=20
> PPS: and of course the log from the benchmarks...
>=20
> [stefan@rome ~]$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             AuthenticAMD
> CPU family:            23
> Model:                 49
> Model name:            AMD EPYC 7262 8-Core Processor
> Stepping:              0
> CPU MHz:               3194.323
> BogoMIPS:              6388.64
> Virtualization:        AMD-V
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              16384K
> ...
> [stefan@rome ~]$ gcc --version
> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOS=
E.
>=20
> [stefan@rome ~]$ cat patch.c
> #include <math.h>
>=20
> static __inline unsigned __FLOAT_BITS(float __f)
> {
>  union {float __f; unsigned __i;} __u;
>  __u.__f =3D __f;
>  return __u.__i;
> }
>=20
> static __inline unsigned long long __DOUBLE_BITS(double __f)
> {
>  union {double __f; unsigned long long __i;} __u;
>  __u.__f =3D __f;
>  return __u.__i;
> }
>=20
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe00000000=
00000ULL : \
> __fpclassifyl(x) =3D=3D FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800=
000 : \
> sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffUL=
L<<52 : \
> __fpclassifyl(x) =3D=3D FP_NAN)
> #endif // PATCH
>=20
> __attribute__((noinline))
> double nextafter(double x, double y)
> {
>  union {double f; unsigned long long i;} ux=3D{x}, uy=3D{y};
>  unsigned long long ax, ay;
>  int e;
>=20
>  if (isnan(x) || isnan(y))
>   return x + y;
>  if (ux.i =3D=3D uy.i)
>   return y;
> #ifdef PATCH
>  ax =3D ux.i << 1;
>  ay =3D uy.i << 1;
> #else
>  ax =3D ux.i & -1ULL/2;
>  ay =3D uy.i & -1ULL/2;
> #endif
>  if (ax =3D=3D 0) {
>   if (ay =3D=3D 0)
>    return y;
>   ux.i =3D (uy.i & 1ULL<<63) | 1;
> #ifdef PATCH
>  } else if (ax < ay =3D=3D (long long) ux.i < 0)
> #else
>  } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
> #endif

this change is wrong.

>   ux.i--;
>  else
>   ux.i++;
>  e =3D ux.i >> 52 & 0x7ff;
>  /* raise overflow if ux.f is infinite and x is finite */
>  if (e =3D=3D 0x7ff)
>   FORCE_EVAL(x + x);
>  /* raise underflow if ux.f is subnormal or zero */
>  if (e =3D=3D 0)
>   FORCE_EVAL(x*x + ux.f*ux.f);
>  return ux.f;
> }
> [stefan@rome ~]$ cat bench.c
> // Copyright =C2=A9 2005-2021, Stefan Kanthak <stefan.kanthak@nexgo.de>
>=20
> #include <math.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <time.h>
>=20
> __attribute__((noinline))
> double nop(double foo, double bar)
> {
>     return foo + bar;
> }
>=20
> inline static
> double lfsr64(void)
> {
>     // 64-bit linear feedback shift register (Galois form) using
>     //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
>     //   initialised with 2**64 / golden ratio
>=20
>     static uint64_t lfsr =3D 0x9E3779B97F4A7C15;
>     const  uint64_t sign =3D (int64_t) lfsr >> 63;
>=20
>     lfsr =3D (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
>=20
>     return *(double *) &lfsr;
> }
>=20
> inline static
> double random64(void)
> {
>     static uint64_t seed =3D 0x0123456789ABCDEF;
>=20
>     seed =3D seed * 6364136223846793005 + 1442695040888963407;
>=20
>     return *(double *) &seed;
> }

this is ok, but:

use union for type punning, casts can be miscompiled.

i don't think you need two prngs for this benchmark.

uniform random over all floats is rarely what we want:
either exercise branches with similar distribution as
in real code, or try to exercise specific branches
uniformly to completely neutralize branch prediction
effects.

>=20
> int main(void)
> {
>     clock_t t0, t1, t2, tt;
>     uint32_t n;
>     volatile double result;
>=20
>     t0 =3D clock();

clock() is not ideal because it has to do a syscall.
(CLOCK_PROCESS_CPUTIME_ID is not handled in the vdso.)

using clock_gettime(CLOCK_MONOTONIC, &ts) is better.

>=20
>     for (n =3D 500000000u; n > 0u; n--) {
>         result =3D nop(lfsr64(), 0.0);
>         result =3D nop(random64(), 1.0 / 0.0);
>     }
>=20
>     t1 =3D clock();
>=20
>     for (n =3D 500000000u; n > 0u; n--) {
>         result =3D nextafter(lfsr64(), 0.0);
>         result =3D nextafter(random64(), 1.0 / 0.0);
>     }

you created two dependency chains in the loop: lfsr64 and
random64 with many int operations.

nextafter calls are independent so many of those can go on
in parallel as long as the inputs are available.

so this measures how much your prng latency is affected by
computing nextafters in parallel. (i.e. not measuring
nextafter latency at all.) on a big enough cpu, nextafter
in this loop should have almost 0 effect.

the performance will break down if speculation fails (e.g.
branch-misspredicts or the cpu runs out of resources to
run enough nextafters in parallel to keep up with the prng).

the effect of branch misses are nicely visible in the
results: i expect the y=3Dinf case mispredicts half the
time in the orig code which is 250M misses. i assume
your code compiled to

  u.i +=3D sign

or similar, while the original had a branch on the sign.

that extra branch rarely has this dramatic effect in real
code though and i think it is arch dependent if it can be
removed at all.

for actually measuring the latency i suggest

  double r =3D 1.0;
  uint64_t t1, t0;
  t0 =3D tic();
  for (n =3D N; n > 0; n--)
    r =3D nextafter(r, 0.0);
  t1 =3D tic();

this should give the best-case latency of the hot path
(perfectly predicted).

i hope you learnt something.

>=20
>     t2 =3D clock();
>     tt =3D t2 - t0; t2 -=3D t1; t1 -=3D t0; t0 =3D t2 - t1;
>=20
>     printf("\n"
>            "nop()         %4lu.%06lu       0\n"
>            "nextafter()   %4lu.%06lu    %4lu.%06lu\n"
>            "              %4lu.%06lu nano-seconds\n",
>            t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS=
_PER_SEC,
>            t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS=
_PER_SEC,
>            t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS=
_PER_SEC,
>            tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS=
_PER_SEC);
> }
> [stefan@rome ~]$ gcc -O3 -lm bench.c
> [stefan@rome ~]$ perf stat ./a.out
>=20
> nop()            1.480000       0
> nextafter()     10.060000       8.580000
>                 11.540000 nano-seconds
>=20
>  Performance counter stats for './a.out':
>=20
>          11,548.78 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                145      page-faults:u             #    0.013 K/sec
>     38,917,213,536      cycles:u                  #    3.370 GHz         =
             (83.33%)
>     15,647,656,615      stalled-cycles-frontend:u #   40.21% frontend cyc=
les idle     (83.33%)
>     10,746,044,422      stalled-cycles-backend:u  #   27.61% backend cycl=
es idle      (83.33%)
>     69,739,403,870      instructions:u            #    1.79  insn per cyc=
le
>                                                   #    0.22  stalled cycl=
es per insn  (83.33%)
>     16,495,748,110      branches:u                # 1428.354 M/sec       =
             (83.33%)
>        500,701,246      branch-misses:u           #    3.04% of all branc=
hes          (83.33%)
>=20
>       11.550414029 seconds time elapsed
>=20
>       11.548265000 seconds user
>        0.000999000 seconds sys
>=20
>=20
> [stefan@rome ~]$ gcc -O3 bench.c patch.c
> patch.c:23: warning: "isnan" redefined
>  #define isnan(x) ( \
>=20
> In file included from patch.c:1:
> /usr/include/math.h:254: note: this is the location of the previous defin=
ition
>  #  define isnan(x) \
>=20
> [stefan@rome ~]$ perf stat ./a.out
>=20
> nop()            1.480000       0
> nextafter()      5.070000       3.590000
>                  6.550000 nano-seconds
>=20
>  Performance counter stats for './a.out':
>=20
>           6,558.44 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.019 K/sec
>     22,038,054,931      cycles:u                  #    3.360 GHz         =
             (83.33%)
>            204,849      stalled-cycles-frontend:u #    0.00% frontend cyc=
les idle     (83.34%)
>      5,497,340,276      stalled-cycles-backend:u  #   24.94% backend cycl=
es idle      (83.34%)
>     39,751,746,482      instructions:u            #    1.80  insn per cyc=
le
>                                                   #    0.14  stalled cycl=
es per insn  (83.34%)
>      9,747,428,086      branches:u                # 1486.242 M/sec       =
             (83.34%)
>        250,407,409      branch-misses:u           #    2.57% of all branc=
hes          (83.33%)
>=20
>        6.559550924 seconds time elapsed
>=20
>        6.558918000 seconds user
>        0.000000000 seconds sys
>=20
>=20
> [stefan@rome ~]$ gcc -O3 bench.c -DPATCH patch.c
> patch.c:18: warning: "isnan" redefined
>  #define isnan(x) ( \
>=20
> In file included from patch.c:1:
> /usr/include/math.h:254: note: this is the location of the previous defin=
ition
>  #  define isnan(x) \
>=20
> [stefan@rome ~]$ perf stat ./a.out
>=20
> nop()            1.480000       0
> nextafter()      2.000000       0.520000
>                  3.480000 nano-seconds
>=20
>  Performance counter stats for './a.out':
>=20
>           3,489.59 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,721,075,764      cycles:u                  #    3.359 GHz         =
             (83.32%)
>            132,680      stalled-cycles-frontend:u #    0.00% frontend cyc=
les idle     (83.32%)
>      4,500,051,993      stalled-cycles-backend:u  #   38.39% backend cycl=
es idle      (83.32%)
>     40,501,721,908      instructions:u            #    3.46  insn per cyc=
le
>                                                   #    0.11  stalled cycl=
es per insn  (83.33%)
>      8,494,571,027      branches:u                # 2434.258 M/sec       =
             (83.35%)
>            497,230      branch-misses:u           #    0.01% of all branc=
hes          (83.34%)
>=20
>        3.490437579 seconds time elapsed
>=20
>        3.490053000 seconds user
>        0.000000000 seconds sys
>=20
>=20
> [stefan@rome ~]$ gcc -O3 patch.c nextafter-fp.c
> [stefan@rome ~]$ perf stat ./a.out
>=20
> nop()            1.490000       0
> nextafter()      2.210000       0.720000
>                  3.700000 nano-seconds
>=20
>  Performance counter stats for './a.out':
>=20
>           3,702.89 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                122      page-faults:u             #    0.033 K/sec
>     12,407,345,183      cycles:u                  #    3.351 GHz         =
             (83.32%)
>            135,817      stalled-cycles-frontend:u #    0.00% frontend cyc=
les idle     (83.34%)
>      5,498,895,906      stalled-cycles-backend:u  #   44.32% backend cycl=
es idle      (83.34%)
>     38,002,430,460      instructions:u            #    3.06  insn per cyc=
le
>                                                   #    0.14  stalled cycl=
es per insn  (83.34%)
>      7,497,381,393      branches:u                # 2024.735 M/sec       =
             (83.34%)
>            497,462      branch-misses:u           #    0.01% of all branc=
hes          (83.32%)
>=20
>        3.703648875 seconds time elapsed
>=20
>        3.703294000 seconds user
>        0.000000000 seconds sys
>=20
>=20
> [stefan@rome ~]$ gcc -O3 bench.c nextafter.s
> [stefan@rome ~]$ perf stat ./a.out
>=20
> nop()            1.630000       0
> nextafter()      1.910000       0.280000
>                  3.540000 nano-seconds
>=20
>  Performance counter stats for './a.out':
>=20
>           3,547.12 msec task-clock:u              #    1.000 CPUs utilized
>                  0      context-switches:u        #    0.000 K/sec
>                  0      cpu-migrations:u          #    0.000 K/sec
>                123      page-faults:u             #    0.035 K/sec
>     11,949,840,797      cycles:u                  #    3.369 GHz         =
             (83.32%)
>            129,627      stalled-cycles-frontend:u #    0.00% frontend cyc=
les idle     (83.34%)
>      3,998,960,716      stalled-cycles-backend:u  #   33.46% backend cycl=
es idle      (83.34%)
>     37,493,523,285      instructions:u            #    3.14  insn per cyc=
le
>                                                   #    0.11  stalled cycl=
es per insn  (83.34%)
>      7,998,559,192      branches:u                # 2254.945 M/sec       =
             (83.34%)
>            497,565      branch-misses:u           #    0.01% of all branc=
hes          (83.32%)
>=20
>        3.547999008 seconds time elapsed
>=20
>        3.546671000 seconds user
>        0.000999000 seconds sys
>=20
>=20
> [stefan@rome ~]$