Do not use 64 bit division if possible

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Do not use 64 bit division if possible
@ 2017-11-25 20:52 David Guillen Fandos
  2017-11-25 23:15 ` Michael Clark
  0 siblings, 1 reply; 11+ messages in thread
From: David Guillen Fandos @ 2017-11-25 20:52 UTC (permalink / raw)
  To: musl

Hey there,

Just noticed that my binary was getting some gcc functions for integer 
division in some places coming from musl. I checked and it seems that, 
even though musl assumes PAGE_SIZE is always power of two, that we 
divide by it instead of using shifts for that. This results in extra 
overhead and slow division on platforms that do not have a 64 bit 
divider (even the ones that do have 32 bit divider).

So I propose a patch here, let me know what you people think about.

David

diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
index b8b761d0..aa9fc9d1 100644
--- a/src/conf/sysconf.c
+++ b/src/conf/sysconf.c
@@ -4,6 +4,7 @@ long sysconf(int name)
  #include <sys/sysinfo.h>
  #include "syscall.h"
  #include "libc.h"
+#include "atomic.h"

  #define JT(x) (-256|(x))
  #define VER JT(1)
@@ -206,7 +206,7 @@ long sysconf(int name)
  		if (name==_SC_PHYS_PAGES) mem = si.totalram;
  		else mem = si.freeram + si.bufferram;
  		mem *= si.mem_unit;
-		mem /= PAGE_SIZE;
+		mem >>= (unsigned)(a_ctz_l(PAGE_SIZE));
  		return (mem > LONG_MAX) ? LONG_MAX : mem;
  		case JT_ZERO & 255:
  		return 0;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-25 20:52 Do not use 64 bit division if possible David Guillen Fandos
@ 2017-11-25 23:15 ` Michael Clark
  2017-11-25 23:46   ` David Guillen Fandos
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Clark @ 2017-11-25 23:15 UTC (permalink / raw)
  To: musl

At -O0 and above, clang and gcc strength reduce division by a constant power of two into a right shift (arithmetic or logical depending on signedness of the types).

- https://cx.rv8.io/g/kDrEkB

a_ctz_l is not exactly inexpensive, given it has a multiply, and, negate, shift, load (cache miss).

We’d be better off defining PAGE_SHIFT if we want to be certain the code uses shift when optimisation is disabled, however I trust the compilers to turn the division into a shift.

#ifndef a_ctz_l
#define a_ctz_l a_ctz_l
static inline int a_ctz_l(unsigned long x)
{
        static const char debruijn32[32] = {
                0, 1, 23, 2, 29, 24, 19, 3, 30, 27, 25, 11, 20, 8, 4, 13,
                31, 22, 28, 18, 26, 10, 7, 12, 21, 17, 9, 6, 16, 5, 15, 14
        };
        if (sizeof(long) == 8) return a_ctz_64(x);
        return debruijn32[(x&-x)*0x076be629 >> 27];
}
#endif

If you study the codegen then this might be a better change (including to all other archs).

$ git diff arch/x86_64/bits/limits.h
diff --git a/arch/x86_64/bits/limits.h b/arch/x86_64/bits/limits.h
index 792a30b..32f29bf 100644
--- a/arch/x86_64/bits/limits.h
+++ b/arch/x86_64/bits/limits.h
@@ -1,6 +1,6 @@
 #if defined(_POSIX_SOURCE) || defined(_POSIX_C_SOURCE) \
  || defined(_XOPEN_SOURCE) || defined(_GNU_SOURCE) || defined(_BSD_SOURCE)
-#define PAGE_SIZE 4096
+#define PAGE_SIZE 4096UL
 #define LONG_BIT 64
 #endif
 
 Try removing the UL suffix from the constant in the compiler explorer example above and see the change in codegen.

> On 26/11/2017, at 9:52 AM, David Guillen Fandos <david@davidgf.es> wrote:
> 
> Hey there,
> 
> Just noticed that my binary was getting some gcc functions for integer division in some places coming from musl. I checked and it seems that, even though musl assumes PAGE_SIZE is always power of two, that we divide by it instead of using shifts for that. This results in extra overhead and slow division on platforms that do not have a 64 bit divider (even the ones that do have 32 bit divider).
> 
> So I propose a patch here, let me know what you people think about.
> 
> David
> 
> 
> diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
> index b8b761d0..aa9fc9d1 100644
> --- a/src/conf/sysconf.c
> +++ b/src/conf/sysconf.c
> @@ -4,6 +4,7 @@ long sysconf(int name)
> #include <sys/sysinfo.h>
> #include "syscall.h"
> #include "libc.h"
> +#include "atomic.h"
> 
> #define JT(x) (-256|(x))
> #define VER JT(1)
> @@ -206,7 +206,7 @@ long sysconf(int name)
> 		if (name==_SC_PHYS_PAGES) mem = si.totalram;
> 		else mem = si.freeram + si.bufferram;
> 		mem *= si.mem_unit;
> -		mem /= PAGE_SIZE;
> +		mem >>= (unsigned)(a_ctz_l(PAGE_SIZE));
> 		return (mem > LONG_MAX) ? LONG_MAX : mem;
> 		case JT_ZERO & 255:
> 		return 0;



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-25 23:15 ` Michael Clark
@ 2017-11-25 23:46   ` David Guillen Fandos
  2017-11-25 23:53     ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: David Guillen Fandos @ 2017-11-25 23:46 UTC (permalink / raw)
  To: musl

Thanks for your response.
Please note that PAGE_SIZE is not a constant but an alias to 
libc.page_size which is a variable of type size_t (signed).
That's why at O1+ gcc doesn't generate a shift.

I also created a patch to include libc.page_shift, but as far as I can 
see no other functions would benefit from it, since there's no other 
divides there (only negations, additions and subtractions).

And yeah I agree, a_ctz_l is not exactly inexpensive but I guess it is 
better than full 64 bit signed division (that's why I cast unsigned 
otherwise the shift right is not trivial due to the sign).

Thanks!
David

On 26/11/17 00:15, Michael Clark wrote:
> At -O0 and above, clang and gcc strength reduce division by a constant power of two into a right shift (arithmetic or logical depending on signedness of the types).
> 
> - https://cx.rv8.io/g/kDrEkB
> 
> a_ctz_l is not exactly inexpensive, given it has a multiply, and, negate, shift, load (cache miss).
> 
> We’d be better off defining PAGE_SHIFT if we want to be certain the code uses shift when optimisation is disabled, however I trust the compilers to turn the division into a shift.
> 
> #ifndef a_ctz_l
> #define a_ctz_l a_ctz_l
> static inline int a_ctz_l(unsigned long x)
> {
>          static const char debruijn32[32] = {
>                  0, 1, 23, 2, 29, 24, 19, 3, 30, 27, 25, 11, 20, 8, 4, 13,
>                  31, 22, 28, 18, 26, 10, 7, 12, 21, 17, 9, 6, 16, 5, 15, 14
>          };
>          if (sizeof(long) == 8) return a_ctz_64(x);
>          return debruijn32[(x&-x)*0x076be629 >> 27];
> }
> #endif
> 
> If you study the codegen then this might be a better change (including to all other archs).
> 
> $ git diff arch/x86_64/bits/limits.h
> diff --git a/arch/x86_64/bits/limits.h b/arch/x86_64/bits/limits.h
> index 792a30b..32f29bf 100644
> --- a/arch/x86_64/bits/limits.h
> +++ b/arch/x86_64/bits/limits.h
> @@ -1,6 +1,6 @@
>   #if defined(_POSIX_SOURCE) || defined(_POSIX_C_SOURCE) \
>    || defined(_XOPEN_SOURCE) || defined(_GNU_SOURCE) || defined(_BSD_SOURCE)
> -#define PAGE_SIZE 4096
> +#define PAGE_SIZE 4096UL
>   #define LONG_BIT 64
>   #endif
>   
>   Try removing the UL suffix from the constant in the compiler explorer example above and see the change in codegen.
> 
>> On 26/11/2017, at 9:52 AM, David Guillen Fandos <david@davidgf.es> wrote:
>>
>> Hey there,
>>
>> Just noticed that my binary was getting some gcc functions for integer division in some places coming from musl. I checked and it seems that, even though musl assumes PAGE_SIZE is always power of two, that we divide by it instead of using shifts for that. This results in extra overhead and slow division on platforms that do not have a 64 bit divider (even the ones that do have 32 bit divider).
>>
>> So I propose a patch here, let me know what you people think about.
>>
>> David
>>
>>
>> diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
>> index b8b761d0..aa9fc9d1 100644
>> --- a/src/conf/sysconf.c
>> +++ b/src/conf/sysconf.c
>> @@ -4,6 +4,7 @@ long sysconf(int name)
>> #include <sys/sysinfo.h>
>> #include "syscall.h"
>> #include "libc.h"
>> +#include "atomic.h"
>>
>> #define JT(x) (-256|(x))
>> #define VER JT(1)
>> @@ -206,7 +206,7 @@ long sysconf(int name)
>> 		if (name==_SC_PHYS_PAGES) mem = si.totalram;
>> 		else mem = si.freeram + si.bufferram;
>> 		mem *= si.mem_unit;
>> -		mem /= PAGE_SIZE;
>> +		mem >>= (unsigned)(a_ctz_l(PAGE_SIZE));
>> 		return (mem > LONG_MAX) ? LONG_MAX : mem;
>> 		case JT_ZERO & 255:
>> 		return 0;
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-25 23:46   ` David Guillen Fandos
@ 2017-11-25 23:53     ` Rich Felker
  2017-11-26  0:10       ` Michael Clark
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2017-11-25 23:53 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 12:46:56AM +0100, David Guillen Fandos wrote:
> Thanks for your response.
> Please note that PAGE_SIZE is not a constant but an alias to
> libc.page_size which is a variable of type size_t (signed).
> That's why at O1+ gcc doesn't generate a shift.

Indeed; this varies by arch.

> I also created a patch to include libc.page_shift, but as far as I
> can see no other functions would benefit from it, since there's no
> other divides there (only negations, additions and subtractions).

Adding infrastructure complexity except in cases where it makes a
significant improvement to size or performance is generally not
desirable. mmap() is one other place where, in principle, division by
PAGE_SIZE might take place, but in practice the size is constant 4096
or 8192 on all archs.

> And yeah I agree, a_ctz_l is not exactly inexpensive but I guess it
> is better than full 64 bit signed division (that's why I cast
> unsigned otherwise the shift right is not trivial due to the sign).

The cost here is more a matter of adding a reading complexity
dependency on musl internals (a_*) where it's not needed. I wonder if
GCC could optimize it if we instead of /PAGE_SIZE wrote
/(PAGE_SIZE&-PAGE_SIZE). Or if we did something like define PAGE_SIZE
as ((libc.page_size&-libc.page_size)==libc.page_size ? libc.page_size
: 1/0) so that "PAGE_SIZE is not a power of 2" would become an
unreachable case.

Rich



> On 26/11/17 00:15, Michael Clark wrote:
> >At -O0 and above, clang and gcc strength reduce division by a constant power of two into a right shift (arithmetic or logical depending on signedness of the types).
> >
> >- https://cx.rv8.io/g/kDrEkB
> >
> >a_ctz_l is not exactly inexpensive, given it has a multiply, and, negate, shift, load (cache miss).
> >
> >We’d be better off defining PAGE_SHIFT if we want to be certain the code uses shift when optimisation is disabled, however I trust the compilers to turn the division into a shift.
> >
> >#ifndef a_ctz_l
> >#define a_ctz_l a_ctz_l
> >static inline int a_ctz_l(unsigned long x)
> >{
> >         static const char debruijn32[32] = {
> >                 0, 1, 23, 2, 29, 24, 19, 3, 30, 27, 25, 11, 20, 8, 4, 13,
> >                 31, 22, 28, 18, 26, 10, 7, 12, 21, 17, 9, 6, 16, 5, 15, 14
> >         };
> >         if (sizeof(long) == 8) return a_ctz_64(x);
> >         return debruijn32[(x&-x)*0x076be629 >> 27];
> >}
> >#endif
> >
> >If you study the codegen then this might be a better change (including to all other archs).
> >
> >$ git diff arch/x86_64/bits/limits.h
> >diff --git a/arch/x86_64/bits/limits.h b/arch/x86_64/bits/limits.h
> >index 792a30b..32f29bf 100644
> >--- a/arch/x86_64/bits/limits.h
> >+++ b/arch/x86_64/bits/limits.h
> >@@ -1,6 +1,6 @@
> >  #if defined(_POSIX_SOURCE) || defined(_POSIX_C_SOURCE) \
> >   || defined(_XOPEN_SOURCE) || defined(_GNU_SOURCE) || defined(_BSD_SOURCE)
> >-#define PAGE_SIZE 4096
> >+#define PAGE_SIZE 4096UL
> >  #define LONG_BIT 64
> >  #endif
> >  Try removing the UL suffix from the constant in the compiler explorer example above and see the change in codegen.
> >
> >>On 26/11/2017, at 9:52 AM, David Guillen Fandos <david@davidgf.es> wrote:
> >>
> >>Hey there,
> >>
> >>Just noticed that my binary was getting some gcc functions for integer division in some places coming from musl. I checked and it seems that, even though musl assumes PAGE_SIZE is always power of two, that we divide by it instead of using shifts for that. This results in extra overhead and slow division on platforms that do not have a 64 bit divider (even the ones that do have 32 bit divider).
> >>
> >>So I propose a patch here, let me know what you people think about.
> >>
> >>David
> >>
> >>
> >>diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
> >>index b8b761d0..aa9fc9d1 100644
> >>--- a/src/conf/sysconf.c
> >>+++ b/src/conf/sysconf.c
> >>@@ -4,6 +4,7 @@ long sysconf(int name)
> >>#include <sys/sysinfo.h>
> >>#include "syscall.h"
> >>#include "libc.h"
> >>+#include "atomic.h"
> >>
> >>#define JT(x) (-256|(x))
> >>#define VER JT(1)
> >>@@ -206,7 +206,7 @@ long sysconf(int name)
> >>		if (name==_SC_PHYS_PAGES) mem = si.totalram;
> >>		else mem = si.freeram + si.bufferram;
> >>		mem *= si.mem_unit;
> >>-		mem /= PAGE_SIZE;
> >>+		mem >>= (unsigned)(a_ctz_l(PAGE_SIZE));
> >>		return (mem > LONG_MAX) ? LONG_MAX : mem;
> >>		case JT_ZERO & 255:
> >>		return 0;
> >


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-25 23:53     ` Rich Felker
@ 2017-11-26  0:10       ` Michael Clark
  2017-11-26  0:49         ` David Guillen Fandos
  2017-11-26  0:49         ` Rich Felker
  0 siblings, 2 replies; 11+ messages in thread
From: Michael Clark @ 2017-11-26  0:10 UTC (permalink / raw)
  To: musl



> On 26/11/2017, at 12:53 PM, Rich Felker <dalias@libc.org> wrote:
> 
> On Sun, Nov 26, 2017 at 12:46:56AM +0100, David Guillen Fandos wrote:
>> Thanks for your response.
>> Please note that PAGE_SIZE is not a constant but an alias to
>> libc.page_size which is a variable of type size_t (signed).
>> That's why at O1+ gcc doesn't generate a shift.
> 
> Indeed; this varies by arch.

Oh, I wasn’t aware of that.

>> I also created a patch to include libc.page_shift, but as far as I
>> can see no other functions would benefit from it, since there's no
>> other divides there (only negations, additions and subtractions).
> 
> Adding infrastructure complexity except in cases where it makes a
> significant improvement to size or performance is generally not
> desirable. mmap() is one other place where, in principle, division by
> PAGE_SIZE might take place, but in practice the size is constant 4096
> or 8192 on all archs.
> 
>> And yeah I agree, a_ctz_l is not exactly inexpensive but I guess it
>> is better than full 64 bit signed division (that's why I cast
>> unsigned otherwise the shift right is not trivial due to the sign).
> 
> The cost here is more a matter of adding a reading complexity
> dependency on musl internals (a_*) where it's not needed. I wonder if
> GCC could optimize it if we instead of /PAGE_SIZE wrote
> /(PAGE_SIZE&-PAGE_SIZE). Or if we did something like define PAGE_SIZE
> as ((libc.page_size&-libc.page_size)==libc.page_size ? libc.page_size
> : 1/0) so that "PAGE_SIZE is not a power of 2" would become an
> unreachable case.

Interesting. It seems GCC figures out the division by zero is unreachable but the (n&-n) expression leads to a power of two, not to a  log2 n so the ctz is still required.

- https://cx.rv8.io/g/eHf2Ah

 One could do so once at initialisation time and add PAGE_SHIFT and on architectures with variable page sizes do this:

#define PAGE_SHIFT libc.page_shift

diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
index 2d758af..f24d10a 100644
--- a/src/env/__libc_start_main.c
+++ b/src/env/__libc_start_main.c
@@ -29,6 +29,7 @@ void __init_libc(char **envp, char *pn)
        __hwcap = aux[AT_HWCAP];
        __sysinfo = aux[AT_SYSINFO];
        libc.page_size = aux[AT_PAGESZ];
+       libc.page_shift = a_ctz_l(libc.page_size);
 
        if (!pn) pn = (void*)aux[AT_EXECFN];
        if (!pn) pn = "";

That isolates the a_ctz_l to one place.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-26  0:10       ` Michael Clark
@ 2017-11-26  0:49         ` David Guillen Fandos
  2017-11-26  0:59           ` Rich Felker
  2017-11-26  0:49         ` Rich Felker
  1 sibling, 1 reply; 11+ messages in thread
From: David Guillen Fandos @ 2017-11-26  0:49 UTC (permalink / raw)
  To: musl

Hey,

Wow that's an awesome optimization (the a&-a), didn't know gcc was smart 
enough to figure that out by itself :D
I just realized that PAGE_SIZE seems indeed to be defined to a constant 
for some architectures, did not notice since I was running on MIPS which 
has a page size different for each uarch.

I'd say the (a&-a) is a very simple optimization and we should use it, 
since it adds almost no complexity and sames some cycles and some .text 
bytes, which is sometimes a bit tight.

Something like this? Doesn't hurt constants, improves some arches :)

diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
index b8b761d0..aa9fc9d1 100644
--- a/src/conf/sysconf.c
+++ b/src/conf/sysconf.c
@@ -206,7 +206,7 @@ long sysconf(int name)
		if (name==_SC_PHYS_PAGES) mem = si.totalram;
		else mem = si.freeram + si.bufferram;
		mem *= si.mem_unit;
-		mem /= PAGE_SIZE;
+		mem /= (unsigned)(PAGE_SIZE & -PAGE_SIZE);
		return (mem > LONG_MAX) ? LONG_MAX : mem;
		case JT_ZERO & 255:
		return 0;

On 26/11/17 01:10, Michael Clark wrote:
> 
> 
>> On 26/11/2017, at 12:53 PM, Rich Felker <dalias@libc.org> wrote:
>>
>> On Sun, Nov 26, 2017 at 12:46:56AM +0100, David Guillen Fandos wrote:
>>> Thanks for your response.
>>> Please note that PAGE_SIZE is not a constant but an alias to
>>> libc.page_size which is a variable of type size_t (signed).
>>> That's why at O1+ gcc doesn't generate a shift.
>>
>> Indeed; this varies by arch.
> 
> Oh, I wasn’t aware of that.
> 
>>> I also created a patch to include libc.page_shift, but as far as I
>>> can see no other functions would benefit from it, since there's no
>>> other divides there (only negations, additions and subtractions).
>>
>> Adding infrastructure complexity except in cases where it makes a
>> significant improvement to size or performance is generally not
>> desirable. mmap() is one other place where, in principle, division by
>> PAGE_SIZE might take place, but in practice the size is constant 4096
>> or 8192 on all archs.
>>
>>> And yeah I agree, a_ctz_l is not exactly inexpensive but I guess it
>>> is better than full 64 bit signed division (that's why I cast
>>> unsigned otherwise the shift right is not trivial due to the sign).
>>
>> The cost here is more a matter of adding a reading complexity
>> dependency on musl internals (a_*) where it's not needed. I wonder if
>> GCC could optimize it if we instead of /PAGE_SIZE wrote
>> /(PAGE_SIZE&-PAGE_SIZE). Or if we did something like define PAGE_SIZE
>> as ((libc.page_size&-libc.page_size)==libc.page_size ? libc.page_size
>> : 1/0) so that "PAGE_SIZE is not a power of 2" would become an
>> unreachable case.
> 
> Interesting. It seems GCC figures out the division by zero is unreachable but the (n&-n) expression leads to a power of two, not to a  log2 n so the ctz is still required.
> 
> - https://cx.rv8.io/g/eHf2Ah
> 
>   One could do so once at initialisation time and add PAGE_SHIFT and on architectures with variable page sizes do this:
> 
> #define PAGE_SHIFT libc.page_shift
> 
> diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
> index 2d758af..f24d10a 100644
> --- a/src/env/__libc_start_main.c
> +++ b/src/env/__libc_start_main.c
> @@ -29,6 +29,7 @@ void __init_libc(char **envp, char *pn)
>          __hwcap = aux[AT_HWCAP];
>          __sysinfo = aux[AT_SYSINFO];
>          libc.page_size = aux[AT_PAGESZ];
> +       libc.page_shift = a_ctz_l(libc.page_size);
>   
>          if (!pn) pn = (void*)aux[AT_EXECFN];
>          if (!pn) pn = "";
> 
> That isolates the a_ctz_l to one place.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-26  0:49         ` David Guillen Fandos
@ 2017-11-26  0:59           ` Rich Felker
  2017-11-26  1:12             ` David Guillen Fandos
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2017-11-26  0:59 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 01:49:09AM +0100, David Guillen Fandos wrote:
> Hey,
> 
> Wow that's an awesome optimization (the a&-a), didn't know gcc was
> smart enough to figure that out by itself :D

It doesn't seem to be doing any optimizing for me. What it *should* do
is optimize the div to ctz+shift.

BTW please don't top-reply; it makes threads hard to follow and hard
to meaningfully reply to inline.

Rich


> I just realized that PAGE_SIZE seems indeed to be defined to a
> constant for some architectures, did not notice since I was running
> on MIPS which has a page size different for each uarch.
> 
> I'd say the (a&-a) is a very simple optimization and we should use
> it, since it adds almost no complexity and sames some cycles and
> some .text bytes, which is sometimes a bit tight.
> 
> Something like this? Doesn't hurt constants, improves some arches :)
> 
> diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
> index b8b761d0..aa9fc9d1 100644
> --- a/src/conf/sysconf.c
> +++ b/src/conf/sysconf.c
> @@ -206,7 +206,7 @@ long sysconf(int name)
> 		if (name==_SC_PHYS_PAGES) mem = si.totalram;
> 		else mem = si.freeram + si.bufferram;
> 		mem *= si.mem_unit;
> -		mem /= PAGE_SIZE;
> +		mem /= (unsigned)(PAGE_SIZE & -PAGE_SIZE);
> 		return (mem > LONG_MAX) ? LONG_MAX : mem;
> 		case JT_ZERO & 255:
> 		return 0;


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-26  0:59           ` Rich Felker
@ 2017-11-26  1:12             ` David Guillen Fandos
  2017-11-26  1:23               ` Rich Felker
  0 siblings, 1 reply; 11+ messages in thread
From: David Guillen Fandos @ 2017-11-26  1:12 UTC (permalink / raw)
  To: musl


On 26/11/17 01:59, Rich Felker wrote:
> On Sun, Nov 26, 2017 at 01:49:09AM +0100, David Guillen Fandos wrote:
>> Hey,
>>
>> Wow that's an awesome optimization (the a&-a), didn't know gcc was
>> smart enough to figure that out by itself :D
> 
> It doesn't seem to be doing any optimizing for me. What it *should* do
> is optimize the div to ctz+shift.
> 
> BTW please don't top-reply; it makes threads hard to follow and hard
> to meaningfully reply to inline.
> 
> Rich
> 
> 
>> I just realized that PAGE_SIZE seems indeed to be defined to a
>> constant for some architectures, did not notice since I was running
>> on MIPS which has a page size different for each uarch.
>>
>> I'd say the (a&-a) is a very simple optimization and we should use
>> it, since it adds almost no complexity and sames some cycles and
>> some .text bytes, which is sometimes a bit tight.
>>
>> Something like this? Doesn't hurt constants, improves some arches :)
>>
>> diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
>> index b8b761d0..aa9fc9d1 100644
>> --- a/src/conf/sysconf.c
>> +++ b/src/conf/sysconf.c
>> @@ -206,7 +206,7 @@ long sysconf(int name)
>> 		if (name==_SC_PHYS_PAGES) mem = si.totalram;
>> 		else mem = si.freeram + si.bufferram;
>> 		mem *= si.mem_unit;
>> -		mem /= PAGE_SIZE;
>> +		mem /= (unsigned)(PAGE_SIZE & -PAGE_SIZE);
>> 		return (mem > LONG_MAX) ? LONG_MAX : mem;
>> 		case JT_ZERO & 255:
>> 		return 0;

Sorry for that, default settings you know :)

Well the main reason is cause in MIPS it requires adding __divdi3 which 
is around 1KB of code, which hey, it's not much, but why would we need 
it right? It makes a difference in embedded tools with statically linked 
musl.

Thanks for your interest!

David




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-26  1:12             ` David Guillen Fandos
@ 2017-11-26  1:23               ` Rich Felker
  2017-11-26  1:40                 ` David Guillen Fandos
  0 siblings, 1 reply; 11+ messages in thread
From: Rich Felker @ 2017-11-26  1:23 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 02:12:58AM +0100, David Guillen Fandos wrote:
> 
> On 26/11/17 01:59, Rich Felker wrote:
> >On Sun, Nov 26, 2017 at 01:49:09AM +0100, David Guillen Fandos wrote:
> >>Hey,
> >>
> >>Wow that's an awesome optimization (the a&-a), didn't know gcc was
> >>smart enough to figure that out by itself :D
> >
> >It doesn't seem to be doing any optimizing for me. What it *should* do
> >is optimize the div to ctz+shift.
> >
> >BTW please don't top-reply; it makes threads hard to follow and hard
> >to meaningfully reply to inline.
> >
> >Rich
> >
> >
> >>I just realized that PAGE_SIZE seems indeed to be defined to a
> >>constant for some architectures, did not notice since I was running
> >>on MIPS which has a page size different for each uarch.
> >>
> >>I'd say the (a&-a) is a very simple optimization and we should use
> >>it, since it adds almost no complexity and sames some cycles and
> >>some .text bytes, which is sometimes a bit tight.
> >>
> >>Something like this? Doesn't hurt constants, improves some arches :)
> >>
> >>diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
> >>index b8b761d0..aa9fc9d1 100644
> >>--- a/src/conf/sysconf.c
> >>+++ b/src/conf/sysconf.c
> >>@@ -206,7 +206,7 @@ long sysconf(int name)
> >>		if (name==_SC_PHYS_PAGES) mem = si.totalram;
> >>		else mem = si.freeram + si.bufferram;
> >>		mem *= si.mem_unit;
> >>-		mem /= PAGE_SIZE;
> >>+		mem /= (unsigned)(PAGE_SIZE & -PAGE_SIZE);
> >>		return (mem > LONG_MAX) ? LONG_MAX : mem;
> >>		case JT_ZERO & 255:
> >>		return 0;
> 
> Sorry for that, default settings you know :)
> 
> Well the main reason is cause in MIPS it requires adding __divdi3
> which is around 1KB of code, which hey, it's not much, but why would
> we need it right? It makes a difference in embedded tools with
> statically linked musl.
> 
> Thanks for your interest!

If this is a real problem you're hitting, I'm interested in helping,
but it seems unlikely. If your program uses printf or other common
functions it will already be pulling in __divdi3 I think.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-26  1:23               ` Rich Felker
@ 2017-11-26  1:40                 ` David Guillen Fandos
  0 siblings, 0 replies; 11+ messages in thread
From: David Guillen Fandos @ 2017-11-26  1:40 UTC (permalink / raw)
  To: musl

On 26/11/17 02:23, Rich Felker wrote:
> On Sun, Nov 26, 2017 at 02:12:58AM +0100, David Guillen Fandos wrote:
>>
>> On 26/11/17 01:59, Rich Felker wrote:
>>> On Sun, Nov 26, 2017 at 01:49:09AM +0100, David Guillen Fandos wrote:
>>>> Hey,
>>>>
>>>> Wow that's an awesome optimization (the a&-a), didn't know gcc was
>>>> smart enough to figure that out by itself :D
>>>
>>> It doesn't seem to be doing any optimizing for me. What it *should* do
>>> is optimize the div to ctz+shift.
>>>
>>> BTW please don't top-reply; it makes threads hard to follow and hard
>>> to meaningfully reply to inline.
>>>
>>> Rich
>>>
>>>
>>>> I just realized that PAGE_SIZE seems indeed to be defined to a
>>>> constant for some architectures, did not notice since I was running
>>>> on MIPS which has a page size different for each uarch.
>>>>
>>>> I'd say the (a&-a) is a very simple optimization and we should use
>>>> it, since it adds almost no complexity and sames some cycles and
>>>> some .text bytes, which is sometimes a bit tight.
>>>>
>>>> Something like this? Doesn't hurt constants, improves some arches :)
>>>>
>>>> diff --git a/src/conf/sysconf.c b/src/conf/sysconf.c
>>>> index b8b761d0..aa9fc9d1 100644
>>>> --- a/src/conf/sysconf.c
>>>> +++ b/src/conf/sysconf.c
>>>> @@ -206,7 +206,7 @@ long sysconf(int name)
>>>> 		if (name==_SC_PHYS_PAGES) mem = si.totalram;
>>>> 		else mem = si.freeram + si.bufferram;
>>>> 		mem *= si.mem_unit;
>>>> -		mem /= PAGE_SIZE;
>>>> +		mem /= (unsigned)(PAGE_SIZE & -PAGE_SIZE);
>>>> 		return (mem > LONG_MAX) ? LONG_MAX : mem;
>>>> 		case JT_ZERO & 255:
>>>> 		return 0;
>>
>> Sorry for that, default settings you know :)
>>
>> Well the main reason is cause in MIPS it requires adding __divdi3
>> which is around 1KB of code, which hey, it's not much, but why would
>> we need it right? It makes a difference in embedded tools with
>> statically linked musl.
>>
>> Thanks for your interest!
> 
> If this is a real problem you're hitting, I'm interested in helping,
> but it seems unlikely. If your program uses printf or other common
> functions it will already be pulling in __divdi3 I think.
> 
> Rich
> 

Not a real problem really, more than binary size. I'm not using printf 
that's why I was chasing all the big-ish functions that seemed 
unnecessary in my binary and I was curious about why sysconf actually 
needed a 64 bit division on mips.

Also the (a&-a) doesnt seem to help, gcc is not that smart :) It seems 
there's no easy way to hint it that a number is a power of two, I guess 
that's why the kernel uses PAGE_SHIFT.

Given that page size is not constants in some arches it might be useful 
to have page shift, since some operations would be faster maybe? Like 
page aligning [ & ~(PAGE_SIZE-1) ]? Not sure if we care that much even 
though we use that in malloc.

Thanks for the interest!
David






^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Do not use 64 bit division if possible
  2017-11-26  0:10       ` Michael Clark
  2017-11-26  0:49         ` David Guillen Fandos
@ 2017-11-26  0:49         ` Rich Felker
  1 sibling, 0 replies; 11+ messages in thread
From: Rich Felker @ 2017-11-26  0:49 UTC (permalink / raw)
  To: musl

On Sun, Nov 26, 2017 at 01:10:15PM +1300, Michael Clark wrote:
> 
> 
> > On 26/11/2017, at 12:53 PM, Rich Felker <dalias@libc.org> wrote:
> > 
> > On Sun, Nov 26, 2017 at 12:46:56AM +0100, David Guillen Fandos wrote:
> >> Thanks for your response.
> >> Please note that PAGE_SIZE is not a constant but an alias to
> >> libc.page_size which is a variable of type size_t (signed).
> >> That's why at O1+ gcc doesn't generate a shift.
> > 
> > Indeed; this varies by arch.
> 
> Oh, I wasn’t aware of that.
> 
> >> I also created a patch to include libc.page_shift, but as far as I
> >> can see no other functions would benefit from it, since there's no
> >> other divides there (only negations, additions and subtractions).
> > 
> > Adding infrastructure complexity except in cases where it makes a
> > significant improvement to size or performance is generally not
> > desirable. mmap() is one other place where, in principle, division by
> > PAGE_SIZE might take place, but in practice the size is constant 4096
> > or 8192 on all archs.
> > 
> >> And yeah I agree, a_ctz_l is not exactly inexpensive but I guess it
> >> is better than full 64 bit signed division (that's why I cast
> >> unsigned otherwise the shift right is not trivial due to the sign).
> > 
> > The cost here is more a matter of adding a reading complexity
> > dependency on musl internals (a_*) where it's not needed. I wonder if
> > GCC could optimize it if we instead of /PAGE_SIZE wrote
> > /(PAGE_SIZE&-PAGE_SIZE). Or if we did something like define PAGE_SIZE
> > as ((libc.page_size&-libc.page_size)==libc.page_size ? libc.page_size
> > : 1/0) so that "PAGE_SIZE is not a power of 2" would become an
> > unreachable case.
> 
> Interesting. It seems GCC figures out the division by zero is unreachable but the (n&-n) expression leads to a power of two, not to a  log2 n so the ctz is still required.
> 
> - https://cx.rv8.io/g/eHf2Ah
> 
>  One could do so once at initialisation time and add PAGE_SHIFT and on architectures with variable page sizes do this:
> 
> #define PAGE_SHIFT libc.page_shift
> 
> diff --git a/src/env/__libc_start_main.c b/src/env/__libc_start_main.c
> index 2d758af..f24d10a 100644
> --- a/src/env/__libc_start_main.c
> +++ b/src/env/__libc_start_main.c
> @@ -29,6 +29,7 @@ void __init_libc(char **envp, char *pn)
>         __hwcap = aux[AT_HWCAP];
>         __sysinfo = aux[AT_SYSINFO];
>         libc.page_size = aux[AT_PAGESZ];
> +       libc.page_shift = a_ctz_l(libc.page_size);
>  
>         if (!pn) pn = (void*)aux[AT_EXECFN];
>         if (!pn) pn = "";
> 
> That isolates the a_ctz_l to one place.

Is there a reason it makes a difference? The operation involves a
syscall so the cost of a division is going to be dominated by the
syscall. If you're calling this repeatedly/in a loop, your program is
going to be super slow with or without the division.

Rich


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-11-26  1:40 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-25 20:52 Do not use 64 bit division if possible David Guillen Fandos
2017-11-25 23:15 ` Michael Clark
2017-11-25 23:46   ` David Guillen Fandos
2017-11-25 23:53     ` Rich Felker
2017-11-26  0:10       ` Michael Clark
2017-11-26  0:49         ` David Guillen Fandos
2017-11-26  0:59           ` Rich Felker
2017-11-26  1:12             ` David Guillen Fandos
2017-11-26  1:23               ` Rich Felker
2017-11-26  1:40                 ` David Guillen Fandos
2017-11-26  0:49         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).