From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 3442 invoked from network); 15 Aug 2021 14:56:28 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 15 Aug 2021 14:56:28 -0000 Received: (qmail 32411 invoked by uid 550); 15 Aug 2021 14:56:26 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 32393 invoked from network); 15 Aug 2021 14:56:26 -0000 Date: Sun, 15 Aug 2021 16:56:14 +0200 From: Szabolcs Nagy To: Stefan Kanthak Cc: musl@lists.openwall.com Message-ID: <20210815145614.GI37904@port70.net> Mail-Followup-To: Stefan Kanthak , musl@lists.openwall.com References: <0C6AAAD55DA44C6189B2FF4F5FB2C3E7@H270> <20210810213455.GB37904@port70.net> <20210814234612.GH37904@port70.net> <367A4018B58A4E308E2A95404362CBFB@H270> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <367A4018B58A4E308E2A95404362CBFB@H270> Subject: Re: [musl] [PATCH #2] Properly simplified nextafter() * Stefan Kanthak [2021-08-15 09:04:55 +0200]: > Szabolcs Nagy wrote: > > you should benchmark, but the second best is to look > > at the longest dependency chain in the hot path and > > add up the instruction latencies. >=20 > 1 billion calls to nextafter(), with random from, and to either 0 or +INF: > run 1 against glibc, 8.58 ns/call > run 2 against musl original, 3.59 > run 3 against musl patched, 0.52 > run 4 the pure floating-point variant from 0.72 > my initial post in this thread, > run 5 the assembly variant I posted. 0.28 ns/call thanks for the numbers. it's not the best measurment but shows some interesting effects. >=20 > Now hurry up and patch your slowmotion code! >=20 > Stefan >=20 > PS: I cheated a very tiny little bit: the isnan() macro of musl patched is >=20 > #ifdef PATCH > #define isnan(x) ( \ > sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \ > sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe00000000= 00000ULL : \ > __fpclassifyl(x) =3D=3D FP_NAN) > #else > #define isnan(x) ( \ > sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800= 000 : \ > sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffUL= L<<52 : \ > __fpclassifyl(x) =3D=3D FP_NAN) > #endif // PATCH i think on x86 this only changes an and to an add (or nothing at all if the compiler is smart) if this is measurable that's an uarch issue of your cpu. >=20 > PPS: and of course the log from the benchmarks... >=20 > [stefan@rome ~]$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: AuthenticAMD > CPU family: 23 > Model: 49 > Model name: AMD EPYC 7262 8-Core Processor > Stepping: 0 > CPU MHz: 3194.323 > BogoMIPS: 6388.64 > Virtualization: AMD-V > L1d cache: 32K > L1i cache: 32K > L2 cache: 512K > L3 cache: 16384K > ... > [stefan@rome ~]$ gcc --version > gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3) > Copyright (C) 2018 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOS= E. >=20 > [stefan@rome ~]$ cat patch.c > #include >=20 > static __inline unsigned __FLOAT_BITS(float __f) > { > union {float __f; unsigned __i;} __u; > __u.__f =3D __f; > return __u.__i; > } >=20 > static __inline unsigned long long __DOUBLE_BITS(double __f) > { > union {double __f; unsigned long long __i;} __u; > __u.__f =3D __f; > return __u.__i; > } >=20 > #ifdef PATCH > #define isnan(x) ( \ > sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \ > sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe00000000= 00000ULL : \ > __fpclassifyl(x) =3D=3D FP_NAN) > #else > #define isnan(x) ( \ > sizeof(x) =3D=3D sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800= 000 : \ > sizeof(x) =3D=3D sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffUL= L<<52 : \ > __fpclassifyl(x) =3D=3D FP_NAN) > #endif // PATCH >=20 > __attribute__((noinline)) > double nextafter(double x, double y) > { > union {double f; unsigned long long i;} ux=3D{x}, uy=3D{y}; > unsigned long long ax, ay; > int e; >=20 > if (isnan(x) || isnan(y)) > return x + y; > if (ux.i =3D=3D uy.i) > return y; > #ifdef PATCH > ax =3D ux.i << 1; > ay =3D uy.i << 1; > #else > ax =3D ux.i & -1ULL/2; > ay =3D uy.i & -1ULL/2; > #endif > if (ax =3D=3D 0) { > if (ay =3D=3D 0) > return y; > ux.i =3D (uy.i & 1ULL<<63) | 1; > #ifdef PATCH > } else if (ax < ay =3D=3D (long long) ux.i < 0) > #else > } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63)) > #endif this change is wrong. > ux.i--; > else > ux.i++; > e =3D ux.i >> 52 & 0x7ff; > /* raise overflow if ux.f is infinite and x is finite */ > if (e =3D=3D 0x7ff) > FORCE_EVAL(x + x); > /* raise underflow if ux.f is subnormal or zero */ > if (e =3D=3D 0) > FORCE_EVAL(x*x + ux.f*ux.f); > return ux.f; > } > [stefan@rome ~]$ cat bench.c > // Copyright =C2=A9 2005-2021, Stefan Kanthak >=20 > #include > #include > #include > #include >=20 > __attribute__((noinline)) > double nop(double foo, double bar) > { > return foo + bar; > } >=20 > inline static > double lfsr64(void) > { > // 64-bit linear feedback shift register (Galois form) using > // primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"), > // initialised with 2**64 / golden ratio >=20 > static uint64_t lfsr =3D 0x9E3779B97F4A7C15; > const uint64_t sign =3D (int64_t) lfsr >> 63; >=20 > lfsr =3D (lfsr << 1) ^ (0xAD93D23594C935A9 & sign); >=20 > return *(double *) &lfsr; > } >=20 > inline static > double random64(void) > { > static uint64_t seed =3D 0x0123456789ABCDEF; >=20 > seed =3D seed * 6364136223846793005 + 1442695040888963407; >=20 > return *(double *) &seed; > } this is ok, but: use union for type punning, casts can be miscompiled. i don't think you need two prngs for this benchmark. uniform random over all floats is rarely what we want: either exercise branches with similar distribution as in real code, or try to exercise specific branches uniformly to completely neutralize branch prediction effects. >=20 > int main(void) > { > clock_t t0, t1, t2, tt; > uint32_t n; > volatile double result; >=20 > t0 =3D clock(); clock() is not ideal because it has to do a syscall. (CLOCK_PROCESS_CPUTIME_ID is not handled in the vdso.) using clock_gettime(CLOCK_MONOTONIC, &ts) is better. >=20 > for (n =3D 500000000u; n > 0u; n--) { > result =3D nop(lfsr64(), 0.0); > result =3D nop(random64(), 1.0 / 0.0); > } >=20 > t1 =3D clock(); >=20 > for (n =3D 500000000u; n > 0u; n--) { > result =3D nextafter(lfsr64(), 0.0); > result =3D nextafter(random64(), 1.0 / 0.0); > } you created two dependency chains in the loop: lfsr64 and random64 with many int operations. nextafter calls are independent so many of those can go on in parallel as long as the inputs are available. so this measures how much your prng latency is affected by computing nextafters in parallel. (i.e. not measuring nextafter latency at all.) on a big enough cpu, nextafter in this loop should have almost 0 effect. the performance will break down if speculation fails (e.g. branch-misspredicts or the cpu runs out of resources to run enough nextafters in parallel to keep up with the prng). the effect of branch misses are nicely visible in the results: i expect the y=3Dinf case mispredicts half the time in the orig code which is 250M misses. i assume your code compiled to u.i +=3D sign or similar, while the original had a branch on the sign. that extra branch rarely has this dramatic effect in real code though and i think it is arch dependent if it can be removed at all. for actually measuring the latency i suggest double r =3D 1.0; uint64_t t1, t0; t0 =3D tic(); for (n =3D N; n > 0; n--) r =3D nextafter(r, 0.0); t1 =3D tic(); this should give the best-case latency of the hot path (perfectly predicted). i hope you learnt something. >=20 > t2 =3D clock(); > tt =3D t2 - t0; t2 -=3D t1; t1 -=3D t0; t0 =3D t2 - t1; >=20 > printf("\n" > "nop() %4lu.%06lu 0\n" > "nextafter() %4lu.%06lu %4lu.%06lu\n" > " %4lu.%06lu nano-seconds\n", > t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS= _PER_SEC, > t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS= _PER_SEC, > t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS= _PER_SEC, > tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS= _PER_SEC); > } > [stefan@rome ~]$ gcc -O3 -lm bench.c > [stefan@rome ~]$ perf stat ./a.out >=20 > nop() 1.480000 0 > nextafter() 10.060000 8.580000 > 11.540000 nano-seconds >=20 > Performance counter stats for './a.out': >=20 > 11,548.78 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 145 page-faults:u # 0.013 K/sec > 38,917,213,536 cycles:u # 3.370 GHz = (83.33%) > 15,647,656,615 stalled-cycles-frontend:u # 40.21% frontend cyc= les idle (83.33%) > 10,746,044,422 stalled-cycles-backend:u # 27.61% backend cycl= es idle (83.33%) > 69,739,403,870 instructions:u # 1.79 insn per cyc= le > # 0.22 stalled cycl= es per insn (83.33%) > 16,495,748,110 branches:u # 1428.354 M/sec = (83.33%) > 500,701,246 branch-misses:u # 3.04% of all branc= hes (83.33%) >=20 > 11.550414029 seconds time elapsed >=20 > 11.548265000 seconds user > 0.000999000 seconds sys >=20 >=20 > [stefan@rome ~]$ gcc -O3 bench.c patch.c > patch.c:23: warning: "isnan" redefined > #define isnan(x) ( \ >=20 > In file included from patch.c:1: > /usr/include/math.h:254: note: this is the location of the previous defin= ition > # define isnan(x) \ >=20 > [stefan@rome ~]$ perf stat ./a.out >=20 > nop() 1.480000 0 > nextafter() 5.070000 3.590000 > 6.550000 nano-seconds >=20 > Performance counter stats for './a.out': >=20 > 6,558.44 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 122 page-faults:u # 0.019 K/sec > 22,038,054,931 cycles:u # 3.360 GHz = (83.33%) > 204,849 stalled-cycles-frontend:u # 0.00% frontend cyc= les idle (83.34%) > 5,497,340,276 stalled-cycles-backend:u # 24.94% backend cycl= es idle (83.34%) > 39,751,746,482 instructions:u # 1.80 insn per cyc= le > # 0.14 stalled cycl= es per insn (83.34%) > 9,747,428,086 branches:u # 1486.242 M/sec = (83.34%) > 250,407,409 branch-misses:u # 2.57% of all branc= hes (83.33%) >=20 > 6.559550924 seconds time elapsed >=20 > 6.558918000 seconds user > 0.000000000 seconds sys >=20 >=20 > [stefan@rome ~]$ gcc -O3 bench.c -DPATCH patch.c > patch.c:18: warning: "isnan" redefined > #define isnan(x) ( \ >=20 > In file included from patch.c:1: > /usr/include/math.h:254: note: this is the location of the previous defin= ition > # define isnan(x) \ >=20 > [stefan@rome ~]$ perf stat ./a.out >=20 > nop() 1.480000 0 > nextafter() 2.000000 0.520000 > 3.480000 nano-seconds >=20 > Performance counter stats for './a.out': >=20 > 3,489.59 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 123 page-faults:u # 0.035 K/sec > 11,721,075,764 cycles:u # 3.359 GHz = (83.32%) > 132,680 stalled-cycles-frontend:u # 0.00% frontend cyc= les idle (83.32%) > 4,500,051,993 stalled-cycles-backend:u # 38.39% backend cycl= es idle (83.32%) > 40,501,721,908 instructions:u # 3.46 insn per cyc= le > # 0.11 stalled cycl= es per insn (83.33%) > 8,494,571,027 branches:u # 2434.258 M/sec = (83.35%) > 497,230 branch-misses:u # 0.01% of all branc= hes (83.34%) >=20 > 3.490437579 seconds time elapsed >=20 > 3.490053000 seconds user > 0.000000000 seconds sys >=20 >=20 > [stefan@rome ~]$ gcc -O3 patch.c nextafter-fp.c > [stefan@rome ~]$ perf stat ./a.out >=20 > nop() 1.490000 0 > nextafter() 2.210000 0.720000 > 3.700000 nano-seconds >=20 > Performance counter stats for './a.out': >=20 > 3,702.89 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 122 page-faults:u # 0.033 K/sec > 12,407,345,183 cycles:u # 3.351 GHz = (83.32%) > 135,817 stalled-cycles-frontend:u # 0.00% frontend cyc= les idle (83.34%) > 5,498,895,906 stalled-cycles-backend:u # 44.32% backend cycl= es idle (83.34%) > 38,002,430,460 instructions:u # 3.06 insn per cyc= le > # 0.14 stalled cycl= es per insn (83.34%) > 7,497,381,393 branches:u # 2024.735 M/sec = (83.34%) > 497,462 branch-misses:u # 0.01% of all branc= hes (83.32%) >=20 > 3.703648875 seconds time elapsed >=20 > 3.703294000 seconds user > 0.000000000 seconds sys >=20 >=20 > [stefan@rome ~]$ gcc -O3 bench.c nextafter.s > [stefan@rome ~]$ perf stat ./a.out >=20 > nop() 1.630000 0 > nextafter() 1.910000 0.280000 > 3.540000 nano-seconds >=20 > Performance counter stats for './a.out': >=20 > 3,547.12 msec task-clock:u # 1.000 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 123 page-faults:u # 0.035 K/sec > 11,949,840,797 cycles:u # 3.369 GHz = (83.32%) > 129,627 stalled-cycles-frontend:u # 0.00% frontend cyc= les idle (83.34%) > 3,998,960,716 stalled-cycles-backend:u # 33.46% backend cycl= es idle (83.34%) > 37,493,523,285 instructions:u # 3.14 insn per cyc= le > # 0.11 stalled cycl= es per insn (83.34%) > 7,998,559,192 branches:u # 2254.945 M/sec = (83.34%) > 497,565 branch-misses:u # 0.01% of all branc= hes (83.32%) >=20 > 3.547999008 seconds time elapsed >=20 > 3.546671000 seconds user > 0.000999000 seconds sys >=20 >=20 > [stefan@rome ~]$