From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.4 required=5.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,
	RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4
Received: (qmail 21915 invoked from network); 15 Aug 2021 07:47:15 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 15 Aug 2021 07:47:15 -0000
Received: (qmail 32222 invoked by uid 550); 15 Aug 2021 07:47:13 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 32195 invoked from network); 15 Aug 2021 07:47:12 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=dereferenced.org;
	s=mailbun; t=1629013620;
	bh=oQOO0t8+6DzidSVglB9PFDBdpgrQD8rIMj2Lht/4i8Y=;
	h=Date:From:To:cc:Subject:In-Reply-To:References;
	b=b0cuK+twulF0EHJxysXFLIQQy89hfz/So6tAYUzK/j0SS4dhRIUywV897IhEeHGCt
	 d3AqPUQEypI1AEGdJ2B4TFeC6H9XcEZQXdblblHv2FsV4w8pyilu7lhKmhi/zKbGfJ
	 s6pNsFlzjSF0A+cj10sg9FrtisJRm5M8B+sLzFD2jcPx/Xa/on/kROqrSctcVJUp5H
	 TU2jiSI9SQZY7EXsgbgQVf3p0E2/bQW2knrOTBsEM+/h9+3Xp4PZ96tY7tQtWKrlO8
	 n5xEID7Jli3EQEiWQLX+q4JGC8khquPUto4QJrnF1kXNOAD1pzWdpYChBCSUjf17GF
	 CLDTB83hW6BkA==
Date: Sun, 15 Aug 2021 02:46:58 -0500 (CDT)
From: Ariadne Conill <ariadne@dereferenced.org>
To: musl@lists.openwall.com
cc: Szabolcs Nagy <nsz@port70.net>
In-Reply-To: <367A4018B58A4E308E2A95404362CBFB@H270>
Message-ID: <4272846f-eb89-2856-af9-38571037a924@dereferenced.org>
References: <0C6AAAD55DA44C6189B2FF4F5FB2C3E7@H270> <20210810213455.GB37904@port70.net> <E2423D1F1F3848848AEA933048174858@H270> <20210814234612.GH37904@port70.net> <367A4018B58A4E308E2A95404362CBFB@H270>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
Subject: Re: [musl] [PATCH #2] Properly simplified nextafter()

Hi,

On Sun, 15 Aug 2021, Stefan Kanthak wrote:

> Szabolcs Nagy <nsz@port70.net> wrote:
>
>> * Stefan Kanthak <stefan.kanthak@nexgo.de> [2021-08-13 14:04:51 +0200]:
>>> Szabolcs Nagy <nsz@port70.net> wrote on 2021-08-10 at 23:34:
>
>>>> (the i386 machine where i originally tested this preferred int
>>>> cmp and float cmp was very slow in the subnormal range
>>>
>>> This also and still holds for i386 FPU fadd/fmul as well as SSE
>>> addsd/addss/mulss/mulsd additions/multiplies!
>>
>> they are avoided in the common case, and only used to create
>> fenv side-effects.
>
> Unfortunately but for hard & SOFT-float, where no fenv exists, as
> Rich wrote.

My admittedly rudementary understanding of how soft-float is implemented 
in musl leads me to believe that this doesn't really matter that much.

>>> --- -/src/math/nextafter.c
>>> +++ +/src/math/nextafter.c
>>> @@ -10,13 +10,13 @@
>>>                 return x + y;
>>>         if (ux.i == uy.i)
>>>                 return y;
>>> -       ax = ux.i & -1ULL/2;
>>> -       ay = uy.i & -1ULL/2;
>>> +       ax = ux.i << 2;
>>> +       ay = uy.i << 2;
>>
>> the << 2 looks wrong, the top bit of the exponent is lost.
>
> It IS wrong, but only in the post, not in the code I tested.

So... in other words, you are testing code that is different than the code 
you are submitting?

>
>>>         if (ax == 0) {
>>>                 if (ay == 0)
>>>                         return y;
>>>                 ux.i = (uy.i & 1ULL<<63) | 1;
>>> -       } else if (ax > ay || ((ux.i ^ uy.i) & 1ULL<<63))
>>> +       } else if ((ax < ay) == ((int64_t) ux.i < 0))
>>>                 ux.i--;
>>>         else
>>>                 ux.i++;
>> ...
>>> How do you compare these 60 instructions/252 bytes to the code I posted
>>> (23 instructions/72 bytes)?
>>
>> you should benchmark, but the second best is to look
>> at the longest dependency chain in the hot path and
>> add up the instruction latencies.
>
> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
> run 1 against glibc,                         8.58 ns/call
> run 2 against musl original,                 3.59
> run 3 against musl patched,                  0.52
> run 4 the pure floating-point variant from   0.72
>      my initial post in this thread,
> run 5 the assembly variant I posted.         0.28 ns/call
>
> Now hurry up and patch your slowmotion code!

And how do these benchmarks look on non-x86 architectures, like aarch64 or 
riscv64?

I would rather have a portable math library with functions that cost 3.59 
nsec per call than one where the portable bits are not exercised on x86.

> Stefan
>
> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
>
> #ifdef PATCH
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
> __fpclassifyl(x) == FP_NAN)
> #else
> #define isnan(x) ( \
> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
> __fpclassifyl(x) == FP_NAN)
> #endif // PATCH
>
> PPS: and of course the log from the benchmarks...
>
> [stefan@rome ~]$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                16
> On-line CPU(s) list:   0-15
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             AuthenticAMD
> CPU family:            23
> Model:                 49
> Model name:            AMD EPYC 7262 8-Core Processor
> Stepping:              0
> CPU MHz:               3194.323
> BogoMIPS:              6388.64
> Virtualization:        AMD-V
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              16384K
> ...
> [stefan@rome ~]$ gcc --version
> gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

gcc 8 is quite old at this point.  gcc 9 and 10 have much better 
optimizers that are much more capable.

Indeed, on my system with GCC 10.3.1, nextafter() is using SSE2 
instructions on Alpine x86_64, and if I rebuild musl with `-march=znver2` 
it uses AVX instructions for nextafter(), which seems more than 
sufficiently optimized to me.

Speaking in personal capacity only, I would rather musl's math routines 
remain to the point instead of going down the rabbit hole of manually 
optimized routines like glibc has done.  That also includes 
hand-optimizing the C routines to exploit optimal behavior for some 
specific microarch.  GCC already has knowledge of what optimizations are 
good for a specific microarch (this is the whole point of `-march` and 
`-mtune` after all), if something is missing, it should be fixed there.

Ariadne