[musl] roundf() (and round(), and ...)

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [musl] roundf() (and round(), and ...)
@ 2024-06-17  1:48 Damian McGuckin
  2024-06-18 12:23 ` Szabolcs Nagy
  0 siblings, 1 reply; 25+ messages in thread
From: Damian McGuckin @ 2024-06-17  1:48 UTC (permalink / raw)
  To: MUSL

Before I submit any suggestion for a change to roundf/round/.., could I 
get a critique of the style to make sure I compy with the style guide.

Also, I have not used anything except for the now default FLT_EVAL_METHOD
so I need guidance as to where to use 'float_t'. However, the only place
where such arithmetic (here) might be affected is the expression 'a + a'
which is exact anyway so I think it is irrelevant.

Thanks - Damian

Code follows:

float roundf(float x)
{
     static const int b = 0x7f;
     const float a = fabsf(x);
     union { float f; uint32_t _f; } r = { a }, _x = { x };

     if (((int) (r._f >> 23)) < b + 23) /* i.e. effectively |x| < 2^(p-1) */
     {
         /*
          * capture the sign of the argument
          */
         const uint32_t s = _x._f - r._f;
         /*
          * this should achieve rounding to nearest with any
          * ties (half-way cases) being rounded away-from-zero.
          * (is it wise to use uint32_t instead of int32_t here?)
          */
         const uint32_t rf = ((uint32_t) (a + a)) - (uint32_t) a;

         x = (r.f = (float) rf, r._f |= s, r.f);
     }
     return x;
}

Note that as per the latest IEEE-754 standard, the above does not raise an 
exception in the event of the rounding being inexact. This is not backwards
compatible with the existing MUSL equivalents. While it is another issue
altogether, this latest standard also drops the nearbyint() in favour of
routines called (as per the C standard) 'roundevenf()/roundeven()/...'.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-17  1:48 [musl] roundf() (and round(), and ...) Damian McGuckin
@ 2024-06-18 12:23 ` Szabolcs Nagy
  2024-06-18 12:40   ` Damian McGuckin
                     ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Szabolcs Nagy @ 2024-06-18 12:23 UTC (permalink / raw)
  To: Damian McGuckin; +Cc: MUSL

* Damian McGuckin <damianm@esi.com.au> [2024-06-17 11:48:38 +1000]:
> Before I submit any suggestion for a change to roundf/round/.., could I get
> a critique of the style to make sure I compy with the style guide.
> 
> Also, I have not used anything except for the now default FLT_EVAL_METHOD
> so I need guidance as to where to use 'float_t'. However, the only place
> where such arithmetic (here) might be affected is the expression 'a + a'
> which is exact anyway so I think it is irrelevant.

note that it is unspecified if your algorithm raises inexact or not.
(iirc musl asm implementations don't raise inexact, the c code does,
c23 now requires no inexact which i guess is what you tried to follow,
but that is hard to do in c)

otherwise i think your style is fine, but i added some comments.

FLT_EVAL_METHOD is not relevant for this code.

benchmark data may be useful (or code size e.g. on a soft-float target)
because that may be a valid justification to use this implementation.

> 
> Thanks - Damian
> 
> Code follows:
> 
> float roundf(float x)
> {
>     static const int b = 0x7f;
>     const float a = fabsf(x);
>     union { float f; uint32_t _f; } r = { a }, _x = { x };
> 
>     if (((int) (r._f >> 23)) < b + 23) /* i.e. effectively |x| < 2^(p-1) */
>     {
>         /*
>          * capture the sign of the argument
>          */
>         const uint32_t s = _x._f - r._f;

i used to use explicit unions then switched to helper function asuint(x)
because i found that a bit clearer and compiler optimizes it just fine.
and some ppl complained that memcpy is more correct than union, so it is
better hidden away behind a helper function if somebody wants to switch.

nowadays i'd probably write

  if (asuint(a) >> 23 < asuint(0x1p23f) >> 23)

but e.g. i find it just as clear writing

  const uint32_t abits = asuint(a);
  if (abits >> 23 < 0x7f + 23)

>         /*
>          * this should achieve rounding to nearest with any
>          * ties (half-way cases) being rounded away-from-zero.
>          * (is it wise to use uint32_t instead of int32_t here?)
>          */
>         const uint32_t rf = ((uint32_t) (a + a)) - (uint32_t) a;
> 
>         x = (r.f = (float) rf, r._f |= s, r.f);

it is isa dependent if int32_t or uint32_t is better, but i'd
expect signed convert is more widely supported than unsigned.

so i'd write this as

  const float r = (float)((int32_t)(a + a) - (int32_t)a);
  x = asfloat(asuint(r) | s);

>     }
>     return x;
> }
> 
> Note that as per the latest IEEE-754 standard, the above does not raise an
> exception in the event of the rounding being inexact. This is not backwards

it is unspecified if (uint32_t)a raises inexact exception,
so this will be target and compiler dependent in practice.

ieee-754-2008:

  5.8 Details of conversions from floating-point to integer formats
  ...
  A language standard that permits implicit conversions or expressions
  involving mixed types should require that these be implemented with
  the inexact-signaling conversion operations below.

c23:

  F.4 Floating to integer conversion
  ...
  ... whether conversion of a non-integral floating value raises the
  "inexact" floating-point exception is unspecified.[note]
  ...
  note: In those cases where it matters, library functions can be used
  to effect such conversions with or without raising the "inexact"
  floating-point exception.

> compatible with the existing MUSL equivalents. While it is another issue
> altogether, this latest standard also drops the nearbyint() in favour of
> routines called (as per the C standard) 'roundevenf()/roundeven()/...'.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-18 12:23 ` Szabolcs Nagy
@ 2024-06-18 12:40   ` Damian McGuckin
  2024-06-19  7:09   ` Damian McGuckin
  2024-06-19  8:17   ` Damian McGuckin
  2 siblings, 0 replies; 25+ messages in thread
From: Damian McGuckin @ 2024-06-18 12:40 UTC (permalink / raw)
  To: MUSL


Thanks for the thorough review

I will look at it in more tomorrow.

Almost midnight here now - Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-18 12:23 ` Szabolcs Nagy
  2024-06-18 12:40   ` Damian McGuckin
@ 2024-06-19  7:09   ` Damian McGuckin
  2024-06-19  8:17   ` Damian McGuckin
  2 siblings, 0 replies; 25+ messages in thread
From: Damian McGuckin @ 2024-06-19  7:09 UTC (permalink / raw)
  To: MUSL


Commenting on only part of that:

On Tue, 18 Jun 2024, Szabolcs Nagy wrote:

> it is unspecified if (uint32_t)a raises inexact exception,
> so this will be target and compiler dependent in practice.

Yes. Scary.

> ieee-754-2008:
>
>  5.8 Details of conversions from floating-point to integer formats
>  ...
>  A language standard that permits implicit conversions or expressions
>  involving mixed types should require that these be implemented with
>  the inexact-signaling conversion operations below.

IEEE-754 2019 now specifies:

 	The operations for conversion from floating-point to a specific
 	signed or unsigned integer format without signaling the inexact
 	exception are:

 	One of which is:

 		intFormatOf-convertToIntegerTowardZero(source)

 		convertToIntegerTowardZero(x) rounds x to an integral
 		value toward zero.

AND

 	The operations for conversion from floating-point to a specific
 	signed or unsigned integer format, signaling if inexact, are:

 		intFormatOf-convertToIntegerExactTowardZero(source)

 		convertToIntegerExactTowardZero(x) rounds x to an integral
 		value toward zero.

So, it depends as you say, which of those two operations the compiler 
uses when it translates:

 	(integral-type) floating-point-expression

If indeed the target supports both.

> c23:
>
>  F.4 Floating to integer conversion
>  ...
>  ... whether conversion of a non-integral floating value raises the
>  "inexact" floating-point exception is unspecified.[note]
>  ...
>  note: In those cases where it matters, library functions can be used
>  to effect such conversions with or without raising the "inexact"
>  floating-point exception.

Ah, but then such functions are lower down those that we currently name.

Thanks - Damian

Pacific Engineering Systems International ..... 20D Grose St, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-18 12:23 ` Szabolcs Nagy
  2024-06-18 12:40   ` Damian McGuckin
  2024-06-19  7:09   ` Damian McGuckin
@ 2024-06-19  8:17   ` Damian McGuckin
  2024-06-19 18:58     ` Szabolcs Nagy
  2 siblings, 1 reply; 25+ messages in thread
From: Damian McGuckin @ 2024-06-19  8:17 UTC (permalink / raw)
  To: MUSL

On Tue, 18 Jun 2024, Szabolcs Nagy wrote:

>> Also, I have not used anything except for the now default 
>> FLT_EVAL_METHOD so I need guidance as to where to use 'float_t'. 
>> However, the only place where such arithmetic (here) might be affected 
>> is the expression 'a + a' which is exact anyway so I think it is 
>> irrelevant.
>
> note that it is unspecified if your algorithm raises inexact or not.
> (iirc musl asm implementations don't raise inexact, the c code does,
> c23 now requires no inexact which i guess is what you tried to follow,

Yes.

> but that is hard to do in c)

Yes, well for certain (and across multiple architectures).

> otherwise i think your style is fine, but i added some comments.
>
> FLT_EVAL_METHOD is not relevant for this code.

That was my placeholder as in 'Memo to Self'.

I have not used anything other than 'FLT_EVAL_METHOD=0' for decades so
I am likely to not handle the other scenarios properly, i.e. using the
types 'float_t' and 'double_t' as appropriate. So I am worried that my
code is buggy in that context.

> benchmark data may be useful (or code size e.g. on a soft-float target)
> because that may be a valid justification to use this implementation.

It is much less assembler size on a hard-float target.

I have no experience on a soft-float target.

In my testing on X86-64, it is faster than either existing MUSL code or 
GLIBM, but not as fast as assembler using 'ROUNDSS/ROUNDSD'.  Not that 
these routines are going to be performance killers.

> i used to use explicit unions then switched to helper function asuint(x)
> because i found that a bit clearer and compiler optimizes it just fine.
> and some ppl complained that memcpy is more correct than union, so it is
> better hidden away behind a helper function if somebody wants to switch.

I initially wrote these routines using these helper functions but went 
back to unions because I had some issues. Unions are the only things which 
work at compile-time.

But I am happy to go back to helper function. I just redid them routines 
and I notices that they read quite well. They also map nicely into things 
like Rust's or Chapel's 'transmute' feature although they work at compile 
time and these helper functions do not.

I had also tried implementing these routines with <ieee754.h> from Linux 
to see how well that worked. I have very little experience with the use of 
'memcpy" as this also has problems at compile time.

> nowadays i'd probably write
>
>  if (asuint(a) >> 23 < asuint(0x1p23f) >> 23)

I wish it was asuint32 rather than asuint() but maybe I am too picky.

I would prefer to define the bias 'b' and precision 'p' and write it as

 	if ((asuint(a) >> (p - 1)) < b + (p - 1))

But I was worried that would be a change too far. Also, I use different 
languages in a given week so I probably us more coercions (or casts) and 
parentheses than MUSL prefers for C.

>>         /*
>>          * this should achieve rounding to nearest with any
>>          * ties (half-way cases) being rounded away-from-zero.
>>          * (is it wise to use uint32_t instead of int32_t here?)
>>          */
>>         const uint32_t rf = ((uint32_t) (a + a)) - (uint32_t) a;
>>
>>         x = (r.f = (float) rf, r._f |= s, r.f);
>
> it is isa dependent if int32_t or uint32_t is better, but i'd
> expect signed convert is more widely supported than unsigned.

Thanks for the advice - I am just testing that change now - Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-19  8:17   ` Damian McGuckin
@ 2024-06-19 18:58     ` Szabolcs Nagy
  2024-06-19 23:20       ` Rich Felker
  2024-06-21  8:12       ` Damian McGuckin
  0 siblings, 2 replies; 25+ messages in thread
From: Szabolcs Nagy @ 2024-06-19 18:58 UTC (permalink / raw)
  To: Damian McGuckin; +Cc: MUSL

* Damian McGuckin <damianm@esi.com.au> [2024-06-19 18:17:21 +1000]:
> On Tue, 18 Jun 2024, Szabolcs Nagy wrote:
> > FLT_EVAL_METHOD is not relevant for this code.
> 
> That was my placeholder as in 'Memo to Self'.
> 
> I have not used anything other than 'FLT_EVAL_METHOD=0' for decades so
> I am likely to not handle the other scenarios properly, i.e. using the
> types 'float_t' and 'double_t' as appropriate. So I am worried that my
> code is buggy in that context.

you can change all 'float x' declarations to 'float_t x' and
sometimes this allows saving a store+load on i386 and m68k
due to iso c rules.

e.g. consider

float foo1(float x)
{
	T y = x*x;
	T z = x+y;
	return z;
}

with T==float and FLT_EVAL_METHOD!=0 this is not the same as

float foo2(float x)
{
	retrun x + x*x;
}

because iso c requires assignment (and return statement) to
round away the excess precision, so in the former there are two
float stores while in the latter only one (for the return).
(with vs without the extra store the result is slightly different
but they are both different from ieee754 arithmetics anyway.
if the difference matters we use explicit eval_as_float.
i believe the standard -ffp-contract=on mode only contracts
foo2 and would give yet another different result, but musl keeps
contraction off so there is no implicit fma.)

there are two approaches to deal with the extra store issue
(only a performance issue, not correctness):

the gnu approach is that by default assignment does not round
excess precision (-fexcess-precision=fast) so T==float works,
but there is no guarantee where exactly the rounding happens,
any register spill may do it.

the musl approach is to use -fexcess-precision=standard and
use T==float_t and then it is guaranteed that intermediate
results are evaluated with excess precision (even if spilled).

i don't think optimizing for i386 or m68k is very relevant today,
so if you get this wrong it will not be a big issue.

> > benchmark data may be useful (or code size e.g. on a soft-float target)
> > because that may be a valid justification to use this implementation.
> 
> It is much less assembler size on a hard-float target.
> 
> I have no experience on a soft-float target.
> 
> In my testing on X86-64, it is faster than either existing MUSL code or
> GLIBM, but not as fast as assembler using 'ROUNDSS/ROUNDSD'.  Not that these
> routines are going to be performance killers.

ok, this sounds good.

note that in musl fabsf() is an actual call because we build musl
with -ffrestanding (builtins are turned off)

at some point we should fix this to ensure important functions are
optimized internally, e.g. have a header with

#define fabsf(x) __builtin_fabsf(x)

> > i used to use explicit unions then switched to helper function asuint(x)
> > because i found that a bit clearer and compiler optimizes it just fine.
> > and some ppl complained that memcpy is more correct than union, so it is
> > better hidden away behind a helper function if somebody wants to switch.
> 
> I initially wrote these routines using these helper functions but went back
> to unions because I had some issues. Unions are the only things which work
> at compile-time.
> 
> But I am happy to go back to helper function. I just redid them routines and
> I notices that they read quite well. They also map nicely into things like
> Rust's or Chapel's 'transmute' feature although they work at compile time
> and these helper functions do not.
> 
> I had also tried implementing these routines with <ieee754.h> from Linux to
> see how well that worked. I have very little experience with the use of
> 'memcpy" as this also has problems at compile time.

not sure what you mean by 'compile-time', in musl the helper functions
are actually macros now using unions (iirc with older compilers there
were some optimization issues when i used static inline function)

but with recent compiler and -O2

static inline unsigned asuint(float x)
{
	unsigned u;
	memcpy(&u, &x, sizeof x);
	return u;
}

should work just as well. (and this is pedantically correct even on c++)

> 
> > nowadays i'd probably write
> > 
> >  if (asuint(a) >> 23 < asuint(0x1p23f) >> 23)
> 
> I wish it was asuint32 rather than asuint() but maybe I am too picky.
> 
> I would prefer to define the bias 'b' and precision 'p' and write it as
> 
> 	if ((asuint(a) >> (p - 1)) < b + (p - 1))
> 
> But I was worried that would be a change too far. Also, I use different
> languages in a given week so I probably us more coercions (or casts) and
> parentheses than MUSL prefers for C.

i'm used to writing out the bias term and precision as int literal
but using names is fine too i guess.

the naming does not always help as the c and ieee float models differ:
they both use x=m*2^e, but in c m is in [0.5,1), in ieee m is in [1,2)
so the e is off by one (1024 vs 1023 emax). and 'precision bits'
includes the leading 1 in both models, but the representation does not
store that bit so i often don't count that (53 vs 52 bits).

> 
> > >         /*
> > >          * this should achieve rounding to nearest with any
> > >          * ties (half-way cases) being rounded away-from-zero.
> > >          * (is it wise to use uint32_t instead of int32_t here?)
> > >          */
> > >         const uint32_t rf = ((uint32_t) (a + a)) - (uint32_t) a;
> > > 
> > >         x = (r.f = (float) rf, r._f |= s, r.f);
> > 
> > it is isa dependent if int32_t or uint32_t is better, but i'd
> > expect signed convert is more widely supported than unsigned.
> 
> Thanks for the advice - I am just testing that change now - Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-19 18:58     ` Szabolcs Nagy
@ 2024-06-19 23:20       ` Rich Felker
  2024-06-21  8:12       ` Damian McGuckin
  1 sibling, 0 replies; 25+ messages in thread
From: Rich Felker @ 2024-06-19 23:20 UTC (permalink / raw)
  To: Damian McGuckin, MUSL

On Wed, Jun 19, 2024 at 08:58:37PM +0200, Szabolcs Nagy wrote:
> * Damian McGuckin <damianm@esi.com.au> [2024-06-19 18:17:21 +1000]:
> > On Tue, 18 Jun 2024, Szabolcs Nagy wrote:
> > > FLT_EVAL_METHOD is not relevant for this code.
> > 
> > That was my placeholder as in 'Memo to Self'.
> > 
> > I have not used anything other than 'FLT_EVAL_METHOD=0' for decades so
> > I am likely to not handle the other scenarios properly, i.e. using the
> > types 'float_t' and 'double_t' as appropriate. So I am worried that my
> > code is buggy in that context.
> 
> you can change all 'float x' declarations to 'float_t x' and
> sometimes this allows saving a store+load on i386 and m68k
> due to iso c rules.
> 
> e.g. consider
> 
> float foo1(float x)
> {
> 	T y = x*x;
> 	T z = x+y;
> 	return z;
> }
> 
> with T==float and FLT_EVAL_METHOD!=0 this is not the same as
> 
> float foo2(float x)
> {
> 	retrun x + x*x;
> }
> 
> because iso c requires assignment (and return statement) to
> round away the excess precision, so in the former there are two
> float stores while in the latter only one (for the return).
> (with vs without the extra store the result is slightly different
> but they are both different from ieee754 arithmetics anyway.
> if the difference matters we use explicit eval_as_float.
> i believe the standard -ffp-contract=on mode only contracts
> foo2 and would give yet another different result, but musl keeps
> contraction off so there is no implicit fma.)
> 
> there are two approaches to deal with the extra store issue
> (only a performance issue, not correctness):
> 
> the gnu approach is that by default assignment does not round
> excess precision (-fexcess-precision=fast) so T==float works,
> but there is no guarantee where exactly the rounding happens,
> any register spill may do it.
> 
> the musl approach is to use -fexcess-precision=standard and
> use T==float_t and then it is guaranteed that intermediate
> results are evaluated with excess precision (even if spilled).
> 
> i don't think optimizing for i386 or m68k is very relevant today,
> so if you get this wrong it will not be a big issue.

The code needs to be correct even on archs with excess precision. But
there's usually only a correctess distinction for functions like this
with bit-exact/correctly-rounded results, which we don't require for
most of libm.

> > > benchmark data may be useful (or code size e.g. on a soft-float target)
> > > because that may be a valid justification to use this implementation.
> > 
> > It is much less assembler size on a hard-float target.
> > 
> > I have no experience on a soft-float target.
> > 
> > In my testing on X86-64, it is faster than either existing MUSL code or
> > GLIBM, but not as fast as assembler using 'ROUNDSS/ROUNDSD'.  Not that these
> > routines are going to be performance killers.
> 
> ok, this sounds good.
> 
> note that in musl fabsf() is an actual call because we build musl
> with -ffrestanding (builtins are turned off)
> 
> at some point we should fix this to ensure important functions are
> optimized internally, e.g. have a header with
> 
> #define fabsf(x) __builtin_fabsf(x)

Yes. I think the src/include/*.h functions should do this where
appropriate, but we probably need detection unless __GNUC__ is
sufficient to assume the existence of all the ones we want to use.

We also need a list of functions it would be helpful for (mainly
memcpy, memset, strlen & friends, and some math functions, I think)
and some review to make sure there are no places where using the
builtins would be breaking.

> > > nowadays i'd probably write
> > > 
> > >  if (asuint(a) >> 23 < asuint(0x1p23f) >> 23)
> > 
> > I wish it was asuint32 rather than asuint() but maybe I am too picky.
> > 
> > I would prefer to define the bias 'b' and precision 'p' and write it as
> > 
> > 	if ((asuint(a) >> (p - 1)) < b + (p - 1))
> > 
> > But I was worried that would be a change too far. Also, I use different
> > languages in a given week so I probably us more coercions (or casts) and
> > parentheses than MUSL prefers for C.
> 
> i'm used to writing out the bias term and precision as int literal
> but using names is fine too i guess.

I prefer having it written out where it's used rather than having to
go look somewhere else to see the value.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-19 18:58     ` Szabolcs Nagy
  2024-06-19 23:20       ` Rich Felker
@ 2024-06-21  8:12       ` Damian McGuckin
  2024-06-21  8:16         ` Damian McGuckin
  2024-06-23 18:33         ` Rich Felker
  1 sibling, 2 replies; 25+ messages in thread
From: Damian McGuckin @ 2024-06-21  8:12 UTC (permalink / raw)
  To: MUSL

There is an optimization in roundf() (and an equivalent in round())
which handles the case of numbers 0 < |x| < 0.5

     if (e < 0x7f-1) {
         return 0*u.f;
     }

If one has lots of numbers close to zero (as you would if you were testing 
even possible 32-bit floating point number or for any other reason), this
can mean a total run time is 5% faster overall (at the expense of the 
performance of cases for |x| >= 0.5.

Is such an optimization worth it? And yes, we re not talking about lots of
compute cycles here!

Thanks - Damian

Pacific Engineering Systems International ..... 20D Grose St, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-21  8:12       ` Damian McGuckin
@ 2024-06-21  8:16         ` Damian McGuckin
  2024-06-23 18:33         ` Rich Felker
  1 sibling, 0 replies; 25+ messages in thread
From: Damian McGuckin @ 2024-06-21  8:16 UTC (permalink / raw)
  To: MUSL

On Fri, 21 Jun 2024, Damian McGuckin wrote:

> There is an optimization in roundf() (and an equivalent in round())
> which handles the case of numbers 0 < |x| < 0.5
>
>     if (e < 0x7f-1) {
>         return 0*u.f;
>     }
>
> If one has lots of numbers close to zero (as you would if you were testing 
> even possible 32-bit floating point number or for any other reason), this
> can mean a total run time is 5% faster overall (at the expense of the 
> performance of cases for |x| >= 0.5.
>
> Is such an optimization worth it? And yes, we re not talking about lots of
> compute cycles here!

This issue also arises in an implementation roundeven() which will be 
needed for C2Y compliance.

- Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-21  8:12       ` Damian McGuckin
  2024-06-21  8:16         ` Damian McGuckin
@ 2024-06-23 18:33         ` Rich Felker
  2024-06-25  3:41           ` Damian McGuckin
  1 sibling, 1 reply; 25+ messages in thread
From: Rich Felker @ 2024-06-23 18:33 UTC (permalink / raw)
  To: Damian McGuckin; +Cc: MUSL

On Fri, Jun 21, 2024 at 06:12:22PM +1000, Damian McGuckin wrote:
> 
> There is an optimization in roundf() (and an equivalent in round())
> which handles the case of numbers 0 < |x| < 0.5
> 
>     if (e < 0x7f-1) {
>         return 0*u.f;
>     }
> 
> If one has lots of numbers close to zero (as you would if you were
> testing even possible 32-bit floating point number or for any other
> reason), this
> can mean a total run time is 5% faster overall (at the expense of
> the performance of cases for |x| >= 0.5.
> 
> Is such an optimization worth it? And yes, we re not talking about lots of
> compute cycles here!

I think this is not a good tradeoff. It seems unlikely anyone would be
using round in a performance-critical context with numbers that are in
a range where an inline compare would be far better, and the costs are
increased code size, decreased performance in useful cases, and
exacerbating timing dependency on data (which is generally a negative
thing).

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-23 18:33         ` Rich Felker
@ 2024-06-25  3:41           ` Damian McGuckin
  2024-06-25 19:09             ` Markus Wichmann
  0 siblings, 1 reply; 25+ messages in thread
From: Damian McGuckin @ 2024-06-25  3:41 UTC (permalink / raw)
  To: MUSL

On Sun, 23 Jun 2024, Rich Felker wrote:

> I think this is not a good tradeoff. It seems unlikely anyone would be
> using round in a performance-critical context with numbers that are in
> a range where an inline compare would be far better, and the costs are
> increased code size, decreased performance in useful cases, and
> exacerbating timing dependency on data (which is generally a negative
> thing).

I was actually playing devil's advocate. I do not like the tradeoff either 
but I was looking for some words of wisdom (as per yours above) to justify
it.

Anyway... I am not convinced of the benefits of changing the existing 
rounding routines in MUSL to reflect any modification I might have made 
to the internal algorithms. And before any changes, I think there are 
other questions which need to be answered. So I might park it for now.

Here is a summary of the work for those who are interested. They reduce
the number of lines of GCC-11 generated assembler code by 33%. I am sure
that others have thought of (and used) the same incremental changes long
ago but never published them. Only the C 'float' routines are in a good
shape. The project itself was actually done in another language.  My
modifications assumes that a call likes 'fabsf' is inlined to assembler
which I hope we can assume is the rule these days.

This is the code size in GCC11 assembler instructions for C 'float' 
routines:

 	rintf....	32/16 - metric was 5% faster
 	truncf...	19/21 - metric was 10% faster
 	roundf...	47/30 - no speed difference
 	ceilf....	38/29 - no speed difference
 	floorf...	38/29 - no speed difference

This line count does not include the startproc and endproc lines, but
does include any line with just a label. The first number is that of
the MUSL 2.5 routine, and the second is that of the revised routine.
Overall, 174 lines down to 115 lines or a >33% reduction.

The performance metric was the computation and comparison of

 	GLibc routine + MUSL routine
or
 	Glibc routine + Revised routine

The GLIBC routines is there for comparison.  Where this showed the
revised algorithm was faster by 5% or more, I have noted it. Note
than no revision algorithm is slower. The CPU was a Xeon E5-2650v4.

A 'roundeven' routine exists and it is 44 lines of assembler long.

CLANG's line count is slightly longer,

I can recode the revised roundf() to be >5% faster than MUSL's roundf() 
but then I run the risk that I am have a timing dependency on data, in 
this case, the set of all the 32-bit floating point numbers.

- Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25  3:41           ` Damian McGuckin
@ 2024-06-25 19:09             ` Markus Wichmann
  2024-06-25 22:24               ` Damian McGuckin
  2024-06-25 22:55               ` Damian McGuckin
  0 siblings, 2 replies; 25+ messages in thread
From: Markus Wichmann @ 2024-06-25 19:09 UTC (permalink / raw)
  To: musl; +Cc: Damian McGuckin

Am Tue, Jun 25, 2024 at 01:41:45PM +1000 schrieb Damian McGuckin:
> My
> modifications assumes that a call likes 'fabsf' is inlined to assembler
> which I hope we can assume is the rule these days.
>

If GCC is working according to the documentation, then not inside of
musl at the moment. These inlinings happen when a builtin instrinsic
is lowered into an assembler instruction, and by default, fabs* is
recognized as a builtin. But not when building musl, because musl builds
with -ffreestanding, which includes -fno-builtin.

Rich has stated he wants to work around that with an
implementation-internal header file that defines macros such as

#define fabs(x) __builtin_fabs(x)

But as far as I can tell, this hasn't happened yet.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 19:09             ` Markus Wichmann
@ 2024-06-25 22:24               ` Damian McGuckin
  2024-06-25 22:28                 ` enh
  2024-06-25 22:55               ` Damian McGuckin
  1 sibling, 1 reply; 25+ messages in thread
From: Damian McGuckin @ 2024-06-25 22:24 UTC (permalink / raw)
  To: musl

On Tue, 25 Jun 2024, Markus Wichmann wrote:

> #define fabs(x) __builtin_fabs(x)
>
> But as far as I can tell, this hasn't happened yet.

Thanks for the insight - Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 22:24               ` Damian McGuckin
@ 2024-06-25 22:28                 ` enh
  2024-06-25 23:35                   ` Rich Felker
  2024-06-26  1:49                   ` Thorsten Glaser
  0 siblings, 2 replies; 25+ messages in thread
From: enh @ 2024-06-25 22:28 UTC (permalink / raw)
  To: musl

i don't know about gcc, but iirc for clang you don't even need to do
that. it assumes it knows what various functions mean, and inlines
trivial stuff like this anyway...

/tmp$ cat x.c
#include <math.h>

double foo(double x) { return fabs(x); }
/tmp$ cc -O2 -S -o - x.c
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 14, 0 sdk_version 14, 4
.globl _foo                            ; -- Begin function foo
.p2align 2
_foo:                                   ; @foo
.cfi_startproc
; %bb.0:
fabs d0, d0
ret
.cfi_endproc
                                        ; -- End function
.subsections_via_symbols
/tmp$

(the same is true even at -O0, there's just more boilerplate around it
that way.)

bionic actually uses __builtin_fabs() [and friends] to _implement_
these functions, should someone be foolish enough to be using a
function pointer to call them, but we don't actually expect these
functions to ever be called.

On Tue, Jun 25, 2024 at 6:24 PM Damian McGuckin <damianm@esi.com.au> wrote:
>
> On Tue, 25 Jun 2024, Markus Wichmann wrote:
>
> > #define fabs(x) __builtin_fabs(x)
> >
> > But as far as I can tell, this hasn't happened yet.
>
> Thanks for the insight - Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 19:09             ` Markus Wichmann
  2024-06-25 22:24               ` Damian McGuckin
@ 2024-06-25 22:55               ` Damian McGuckin
  2024-06-25 23:37                 ` Rich Felker
  1 sibling, 1 reply; 25+ messages in thread
From: Damian McGuckin @ 2024-06-25 22:55 UTC (permalink / raw)
  To: musl

On Tue, 25 Jun 2024, Markus Wichmann wrote:

> If GCC is working according to the documentation, then not inside of 
> musl at the moment. These inlinings happen when a builtin instrinsic is 
> lowered into an assembler instruction, and by default, fabs* is 
> recognized as a builtin. But not when building musl, because musl builds 
> with -ffreestanding, which includes -fno-builtin.

As a general rule, I think that is a very wise decison.

> Rich has stated he wants to work around that with an
> implementation-internal header file that defines macros such as

> #define fabs(x) __builtin_fabs(x)

With IEEE 754 (in 5.5.1) defining all of copy, negate, abs (and copysign) 
as sign-bit operations, i.e. they are not just recommended operatons, I 
think this is a smart move, at least for the absolute value and sign copy
functionality.

- Damian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 22:28                 ` enh
@ 2024-06-25 23:35                   ` Rich Felker
  2024-06-26 10:36                     ` enh
  2024-06-26  1:49                   ` Thorsten Glaser
  1 sibling, 1 reply; 25+ messages in thread
From: Rich Felker @ 2024-06-25 23:35 UTC (permalink / raw)
  To: enh; +Cc: musl

On Tue, Jun 25, 2024 at 06:28:52PM -0400, enh wrote:
> i don't know about gcc, but iirc for clang you don't even need to do
> that. it assumes it knows what various functions mean, and inlines
> trivial stuff like this anyway...

Not with -ffreestanding. If it does that with -ffreestanding, it's a
bug.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 22:55               ` Damian McGuckin
@ 2024-06-25 23:37                 ` Rich Felker
  0 siblings, 0 replies; 25+ messages in thread
From: Rich Felker @ 2024-06-25 23:37 UTC (permalink / raw)
  To: Damian McGuckin; +Cc: musl

On Wed, Jun 26, 2024 at 08:55:08AM +1000, Damian McGuckin wrote:
> On Tue, 25 Jun 2024, Markus Wichmann wrote:
> 
> >If GCC is working according to the documentation, then not inside
> >of musl at the moment. These inlinings happen when a builtin
> >instrinsic is lowered into an assembler instruction, and by
> >default, fabs* is recognized as a builtin. But not when building
> >musl, because musl builds with -ffreestanding, which includes
> >-fno-builtin.
> 
> As a general rule, I think that is a very wise decison.
> 
> >Rich has stated he wants to work around that with an
> >implementation-internal header file that defines macros such as
> 
> >#define fabs(x) __builtin_fabs(x)
> 
> With IEEE 754 (in 5.5.1) defining all of copy, negate, abs (and
> copysign) as sign-bit operations, i.e. they are not just recommended
> operatons, I think this is a smart move, at least for the absolute
> value and sign copy
> functionality.

I don't see how that's relevant. The external function call versions
are obligated to follow the IEEE rules too, which we do, and
__builtin_fabs does not imply a contract that the compiler can expand
fabs inline, just an allowance that it can know this means "the
standard fabs function semantics" to inline that if it wants, or emit
an external call. So doing this will purely be an optimization, no
semantic difference.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 22:28                 ` enh
  2024-06-25 23:35                   ` Rich Felker
@ 2024-06-26  1:49                   ` Thorsten Glaser
  2024-06-26  3:19                     ` Markus Wichmann
  2024-06-26 10:38                     ` enh
  1 sibling, 2 replies; 25+ messages in thread
From: Thorsten Glaser @ 2024-06-26  1:49 UTC (permalink / raw)
  To: musl

enh dixit:

>bionic actually uses __builtin_fabs() [and friends] to _implement_
>these functions, should someone be foolish enough to be using a

That’ll be fun should the compiler decide to insert a call to
fabs() at the call site instead, which it’d be allowed to do ☺

Stories of compilers doing that abound… IIRC it was the compiler
recognising the implementation pattern of memcpy and replacing
that by a call to memcpy… in some libc’s that was not careful to
use -ffreestanding memcpy implementation.

bye,
//mirabilos
-- 
15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-26  1:49                   ` Thorsten Glaser
@ 2024-06-26  3:19                     ` Markus Wichmann
  2024-06-26  3:43                       ` Thorsten Glaser
  2024-06-26 10:38                     ` enh
  1 sibling, 1 reply; 25+ messages in thread
From: Markus Wichmann @ 2024-06-26  3:19 UTC (permalink / raw)
  To: musl

Am Wed, Jun 26, 2024 at 01:49:09AM +0000 schrieb Thorsten Glaser:
> Stories of compilers doing that abound… IIRC it was the compiler
> recognising the implementation pattern of memcpy and replacing
> that by a call to memcpy… in some libc’s that was not careful to
> use -ffreestanding memcpy implementation.

memcpy() is special, in that GCC reserves the right to create calls to
it (and memmove(), memcmp(), and memset()) even in freestanding mode. It
doesn't do that for any other builtin. Your only options are to write
those functions in assembler, or write them in C and constantly curate
the options needed to compile the functions so that no recursive calls
happen.

> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
$ ./hw > /dev/full && echo FAIL || echo PASS

And you wouldn't believe how many implementations fail that.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-26  3:19                     ` Markus Wichmann
@ 2024-06-26  3:43                       ` Thorsten Glaser
  2024-06-26 14:11                         ` Markus Wichmann
  2024-06-26 15:25                         ` Rich Felker
  0 siblings, 2 replies; 25+ messages in thread
From: Thorsten Glaser @ 2024-06-26  3:43 UTC (permalink / raw)
  To: musl

Markus Wichmann dixit:

>memcpy() is special, in that GCC reserves the right to create calls to
>it (and memmove(), memcmp(), and memset()) even in freestanding mode. It
>doesn't do that for any other builtin.

But not to replace code that implements a memcpy, only to implement
things like struct copying.

>Your only options are to write
>those functions in assembler, or write them in C and constantly curate
>the options needed to compile the functions so that no recursive calls
>happen.

I’m personally happy writing .S files, but that’s a dying craft.

>> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)
>$ ./hw > /dev/full && echo FAIL || echo PASS
>
>And you wouldn't believe how many implementations fail that.

W: mksh: /dev/full: create: Permission denied
PASS

I don’t know what you mean… *g*

bye,
//mirabilos
(Yes, of course I know all nuances of what you mean, I’m on the list.)
-- 
“It is inappropriate to require that a time represented as
 seconds since the Epoch precisely represent the number of
 seconds between the referenced time and the Epoch.”
	-- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-25 23:35                   ` Rich Felker
@ 2024-06-26 10:36                     ` enh
  2024-06-26 15:23                       ` Rich Felker
  0 siblings, 1 reply; 25+ messages in thread
From: enh @ 2024-06-26 10:36 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Tue, Jun 25, 2024 at 7:35 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Jun 25, 2024 at 06:28:52PM -0400, enh wrote:
> > i don't know about gcc, but iirc for clang you don't even need to do
> > that. it assumes it knows what various functions mean, and inlines
> > trivial stuff like this anyway...
>
> Not with -ffreestanding. If it does that with -ffreestanding, it's a
> bug.

no, but my point is that most of your _users_ will never see the libm
functions anyway. (at least not on architectures new enough that their
fp instructions are basically "we went through <math.h> and added an
instruction for everything that was easy", like arm64 or riscv64.)

> Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-26  1:49                   ` Thorsten Glaser
  2024-06-26  3:19                     ` Markus Wichmann
@ 2024-06-26 10:38                     ` enh
  1 sibling, 0 replies; 25+ messages in thread
From: enh @ 2024-06-26 10:38 UTC (permalink / raw)
  To: musl

On Tue, Jun 25, 2024 at 9:49 PM Thorsten Glaser <tg@mirbsd.de> wrote:
>
> enh dixit:
>
> >bionic actually uses __builtin_fabs() [and friends] to _implement_
> >these functions, should someone be foolish enough to be using a
>
> That’ll be fun should the compiler decide to insert a call to
> fabs() at the call site instead, which it’d be allowed to do ☺

yeah, even for our limited set of supported architectures we can't do
all the functions for all the architectures. arm32 in particular is
missing a lot of stuff.

> Stories of compilers doing that abound… IIRC it was the compiler
> recognising the implementation pattern of memcpy and replacing
> that by a call to memcpy… in some libc’s that was not careful to
> use -ffreestanding memcpy implementation.

well, unlike musl we don't implement those functions in C. (but, yes,
bionic's had similar fun. stack protector trying to protect stuff
before we've set up the stack protector infrastructure is our usual
one.)

> bye,
> //mirabilos
> --
> 15:41⎜<Lo-lan-do:#fusionforge> Somebody write a testsuite for helloworld :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-26  3:43                       ` Thorsten Glaser
@ 2024-06-26 14:11                         ` Markus Wichmann
  2024-06-26 15:25                         ` Rich Felker
  1 sibling, 0 replies; 25+ messages in thread
From: Markus Wichmann @ 2024-06-26 14:11 UTC (permalink / raw)
  To: musl

Am Wed, Jun 26, 2024 at 03:43:27AM +0000 schrieb Thorsten Glaser:
> Markus Wichmann dixit:
>
> >memcpy() is special, in that GCC reserves the right to create calls to
> >it (and memmove(), memcmp(), and memset()) even in freestanding mode. It
> >doesn't do that for any other builtin.
>
> But not to replace code that implements a memcpy, only to implement
> things like struct copying.
>

No, there have been versions that would recognize you trying to write a
memcpy() and replacing it with a call to memcpy(). Which is fine, unless
you are actually writing a memcpy(). That is why musl now builds with
-fno-tree-loop-distribute-patterns if that is supported. Because that
was the optimizer making that decision, so it is now turned off.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-26 10:36                     ` enh
@ 2024-06-26 15:23                       ` Rich Felker
  0 siblings, 0 replies; 25+ messages in thread
From: Rich Felker @ 2024-06-26 15:23 UTC (permalink / raw)
  To: enh; +Cc: musl

On Wed, Jun 26, 2024 at 06:36:09AM -0400, enh wrote:
> On Tue, Jun 25, 2024 at 7:35 PM Rich Felker <dalias@libc.org> wrote:
> >
> > On Tue, Jun 25, 2024 at 06:28:52PM -0400, enh wrote:
> > > i don't know about gcc, but iirc for clang you don't even need to do
> > > that. it assumes it knows what various functions mean, and inlines
> > > trivial stuff like this anyway...
> >
> > Not with -ffreestanding. If it does that with -ffreestanding, it's a
> > bug.
> 
> no, but my point is that most of your _users_ will never see the libm
> functions anyway. (at least not on architectures new enough that their
> fp instructions are basically "we went through <math.h> and added an
> instruction for everything that was easy", like arm64 or riscv64.)

They will if they take the address of those functions and call them
indirectly, or if they're using (still default I think; if so that
should be fixed) -fmath-errno, etc.

Inlining is an optimization; it's not by itself a conforming
implementation.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [musl] roundf() (and round(), and ...)
  2024-06-26  3:43                       ` Thorsten Glaser
  2024-06-26 14:11                         ` Markus Wichmann
@ 2024-06-26 15:25                         ` Rich Felker
  1 sibling, 0 replies; 25+ messages in thread
From: Rich Felker @ 2024-06-26 15:25 UTC (permalink / raw)
  To: Thorsten Glaser; +Cc: musl

On Wed, Jun 26, 2024 at 03:43:27AM +0000, Thorsten Glaser wrote:
> Markus Wichmann dixit:
> 
> >memcpy() is special, in that GCC reserves the right to create calls to
> >it (and memmove(), memcmp(), and memset()) even in freestanding mode. It
> >doesn't do that for any other builtin.
> 
> But not to replace code that implements a memcpy, only to implement
> things like struct copying.

FWIW "struct copying" (with __may_alias__ attribute) is, or would be,
a reasonable way to write a portable "efficiently copy blocks of the
middle at a time" part of memcpy -- outsourcing the strategy for doing
so to the compiler's logic for what insn patterns are efficient.

Unfortunately you can't because it will just emit a call to memcpy on
many archs.

Rich

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-06-26 15:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-17  1:48 [musl] roundf() (and round(), and ...) Damian McGuckin
2024-06-18 12:23 ` Szabolcs Nagy
2024-06-18 12:40   ` Damian McGuckin
2024-06-19  7:09   ` Damian McGuckin
2024-06-19  8:17   ` Damian McGuckin
2024-06-19 18:58     ` Szabolcs Nagy
2024-06-19 23:20       ` Rich Felker
2024-06-21  8:12       ` Damian McGuckin
2024-06-21  8:16         ` Damian McGuckin
2024-06-23 18:33         ` Rich Felker
2024-06-25  3:41           ` Damian McGuckin
2024-06-25 19:09             ` Markus Wichmann
2024-06-25 22:24               ` Damian McGuckin
2024-06-25 22:28                 ` enh
2024-06-25 23:35                   ` Rich Felker
2024-06-26 10:36                     ` enh
2024-06-26 15:23                       ` Rich Felker
2024-06-26  1:49                   ` Thorsten Glaser
2024-06-26  3:19                     ` Markus Wichmann
2024-06-26  3:43                       ` Thorsten Glaser
2024-06-26 14:11                         ` Markus Wichmann
2024-06-26 15:25                         ` Rich Felker
2024-06-26 10:38                     ` enh
2024-06-25 22:55               ` Damian McGuckin
2024-06-25 23:37                 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).