mailing list of musl libc
 help / color / mirror / code / Atom feed
* [PATCH] configure: add gcc flags for better link-time optimization
@ 2015-10-23 12:30 Denys Vlasenko
  2015-10-23 13:12 ` Szabolcs Nagy
  2015-11-01 19:56 ` Rich Felker
  0 siblings, 2 replies; 16+ messages in thread
From: Denys Vlasenko @ 2015-10-23 12:30 UTC (permalink / raw)
  To: Rich Felker; +Cc: Denys Vlasenko, musl

libc.so size reduction:

   text	   data	    bss	    dec	    hex	filename
 564099	   1944	  11768	 577811	  8d113	libc.so.before
 562277	   1924	  11576	 575777	  8c921	libc.so

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
CC: musl <musl@lists.openwall.com>
---
 configure | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/configure b/configure
index 4564ad8..a9ed159 100755
--- a/configure
+++ b/configure
@@ -440,6 +440,44 @@ tryflag CFLAGS_AUTO -fno-unwind-tables
 tryflag CFLAGS_AUTO -fno-asynchronous-unwind-tables
 
 #
+# When linker merges sections, a tiny section (such as one resulting
+# from "static char flag_var") with no alignment restrictions
+# can end up logded between two more strongly aligned ones (say,
+# "static int global_cnt1/2", both of which want 32-bit alignment).
+# Then this byte-sized "flag_var" gets 3 bytes of padding.
+#
+# With section sorting by alignment, one-byte flag variables have
+# higher chance of being grouped together and not require padding.
+# (It can be made even better. Linker is too dumb.
+# ld needs to grow -Wl,--pack-sections-optimally)
+#
+# For us, this affects the size of only one file: libc.so
+#
+tryldflag LDFLAGS_AUTO -Wl,--sort-section=alignment
+tryldflag LDFLAGS_AUTO -Wl,--sort-common
+
+#
+# Put every function and data object into its own section:
+# .text.funcname, .data.var, .rodata.const_struct, .bss.zerovar
+#
+# Previous optimization isn't working too well by itself
+# because data objects aren't living in separate sections,
+# they are all grouped in one .data and one .bss section per *.o file.
+# With -ffunction/data-sections, section sorting eliminates more padding.
+#
+# Object files in static *.a files will also have their functions
+# and data objects each in its own section.
+#
+# This enables programs statically linked with -Wl,--gc-sections
+# to perform "section garbage collection": drop unused code and data
+# not on per-*.o-file basis, but on per-function and per-object basis.
+# This is a big thing: --gc-sections sometimes eliminates several percent
+# of unreachable code and data in final executable.
+#
+tryflag CFLAGS_AUTO -ffunction-sections
+tryflag CFLAGS_AUTO -fdata-sections
+
+#
 # The GNU toolchain defaults to assuming unmarked files need an
 # executable stack, potentially exposing vulnerabilities in programs
 # linked with such object files. Fix this.
-- 
1.8.1.4



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 12:30 [PATCH] configure: add gcc flags for better link-time optimization Denys Vlasenko
@ 2015-10-23 13:12 ` Szabolcs Nagy
  2015-10-23 14:41   ` Denys Vlasenko
  2015-10-27  1:21   ` Rich Felker
  2015-11-01 19:56 ` Rich Felker
  1 sibling, 2 replies; 16+ messages in thread
From: Szabolcs Nagy @ 2015-10-23 13:12 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker, Denys Vlasenko

* Denys Vlasenko <vda.linux@googlemail.com> [2015-10-23 14:30:26 +0200]:
> libc.so size reduction:
> 
>    text	   data	    bss	    dec	    hex	filename
>  564099	   1944	  11768	 577811	  8d113	libc.so.before
>  562277	   1924	  11576	 575777	  8c921	libc.so
> 

i assume this is x86_64, nice improvement.

> +# When linker merges sections, a tiny section (such as one resulting
> +# from "static char flag_var") with no alignment restrictions
> +# can end up logded between two more strongly aligned ones (say,
> +# "static int global_cnt1/2", both of which want 32-bit alignment).
> +# Then this byte-sized "flag_var" gets 3 bytes of padding.
> +#
> +# With section sorting by alignment, one-byte flag variables have
> +# higher chance of being grouped together and not require padding.
> +# (It can be made even better. Linker is too dumb.
> +# ld needs to grow -Wl,--pack-sections-optimally)
> +#
> +# For us, this affects the size of only one file: libc.so
> +#
> +tryldflag LDFLAGS_AUTO -Wl,--sort-section=alignment
> +tryldflag LDFLAGS_AUTO -Wl,--sort-common

i think this came up before
https://sourceware.org/bugzilla/show_bug.cgi?id=14156

it was also noted at some point that the optimal sorting
is 'sort by use' so all the unused legacy functions end
up on the same page so they never need to be loaded.

probably config knobs would be useful that turn off libc
features, but not by dropping them, just moving them to a
different part of libc.so that is assumed to be never needed.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 13:12 ` Szabolcs Nagy
@ 2015-10-23 14:41   ` Denys Vlasenko
  2015-10-23 14:48     ` Rich Felker
  2015-10-27  1:21   ` Rich Felker
  1 sibling, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-10-23 14:41 UTC (permalink / raw)
  To: musl, Rich Felker, Denys Vlasenko

On Fri, Oct 23, 2015 at 3:12 PM, Szabolcs Nagy <nsz@port70.net> wrote:
>> +# When linker merges sections, a tiny section (such as one resulting
>> +# from "static char flag_var") with no alignment restrictions
>> +# can end up logded between two more strongly aligned ones (say,
>> +# "static int global_cnt1/2", both of which want 32-bit alignment).
>> +# Then this byte-sized "flag_var" gets 3 bytes of padding.
>> +#
>> +# With section sorting by alignment, one-byte flag variables have
>> +# higher chance of being grouped together and not require padding.
>> +# (It can be made even better. Linker is too dumb.
>> +# ld needs to grow -Wl,--pack-sections-optimally)
>> +#
>> +# For us, this affects the size of only one file: libc.so
>> +#
>> +tryldflag LDFLAGS_AUTO -Wl,--sort-section=alignment
>> +tryldflag LDFLAGS_AUTO -Wl,--sort-common
>
> i think this came up before
> https://sourceware.org/bugzilla/show_bug.cgi?id=14156
>
> it was also noted at some point that the optimal sorting
> is 'sort by use' so all the unused legacy functions end
> up on the same page so they never need to be loaded.

Sure, but that would be quite hard to do.
How would you reliably know who uses which part of libc
code?

OTOH, we don't _need_ to kill ourselves trying to optimize
that. Optimizing code size is not the big thing here.
Even though data and bss shrinkage is smaller,
it is more important.

Minimizing the number of data pages is more important
than text pages. A text page is shared among all processes linked
to this libc.so; data page is allocated in every process
(as soon as even one byte in this page is written to.
With only 4 pages in total like in this example, I'm pretty sure
all of them get dirtied by libc init, use of stdio or malloc).

Make libc (.data + .bss) fit into one page less and you get about
as many pages saved as you have processes running.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 14:41   ` Denys Vlasenko
@ 2015-10-23 14:48     ` Rich Felker
  2015-10-23 22:00       ` Denys Vlasenko
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2015-10-23 14:48 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: musl

On Fri, Oct 23, 2015 at 04:41:17PM +0200, Denys Vlasenko wrote:
> On Fri, Oct 23, 2015 at 3:12 PM, Szabolcs Nagy <nsz@port70.net> wrote:
> >> +# When linker merges sections, a tiny section (such as one resulting
> >> +# from "static char flag_var") with no alignment restrictions
> >> +# can end up logded between two more strongly aligned ones (say,
> >> +# "static int global_cnt1/2", both of which want 32-bit alignment).
> >> +# Then this byte-sized "flag_var" gets 3 bytes of padding.
> >> +#
> >> +# With section sorting by alignment, one-byte flag variables have
> >> +# higher chance of being grouped together and not require padding.
> >> +# (It can be made even better. Linker is too dumb.
> >> +# ld needs to grow -Wl,--pack-sections-optimally)
> >> +#
> >> +# For us, this affects the size of only one file: libc.so
> >> +#
> >> +tryldflag LDFLAGS_AUTO -Wl,--sort-section=alignment
> >> +tryldflag LDFLAGS_AUTO -Wl,--sort-common
> >
> > i think this came up before
> > https://sourceware.org/bugzilla/show_bug.cgi?id=14156
> >
> > it was also noted at some point that the optimal sorting
> > is 'sort by use' so all the unused legacy functions end
> > up on the same page so they never need to be loaded.
> 
> Sure, but that would be quite hard to do.
> How would you reliably know who uses which part of libc
> code?
> 
> OTOH, we don't _need_ to kill ourselves trying to optimize
> that. Optimizing code size is not the big thing here.
> Even though data and bss shrinkage is smaller,
> it is more important.

I agree, data is a lot more important than code here.

> Minimizing the number of data pages is more important
> than text pages. A text page is shared among all processes linked
> to this libc.so; data page is allocated in every process
> (as soon as even one byte in this page is written to.
> With only 4 pages in total like in this example, I'm pretty sure
> all of them get dirtied by libc init, use of stdio or malloc).
> 
> Make libc (.data + .bss) fit into one page less and you get about
> as many pages saved as you have processes running.

FYI all the data/bss in libc except a few large objects _easily_ fits
in a single page. Unfortunately a couple of those large ones (malloc
state & stdio buffers) are used by the majority of programs. I'm still
not sure of the best way to achieve a particular sorting without awful
hacks. Sort by alignment may be a decent approximation of best
behavior but I need to check it out.

Thanks for working on this topic.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 14:48     ` Rich Felker
@ 2015-10-23 22:00       ` Denys Vlasenko
  2015-10-24 12:43         ` Szabolcs Nagy
  0 siblings, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-10-23 22:00 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Fri, Oct 23, 2015 at 4:48 PM, Rich Felker <dalias@libc.org> wrote:
>> Minimizing the number of data pages is more important
>> than text pages. A text page is shared among all processes linked
>> to this libc.so; data page is allocated in every process
>> (as soon as even one byte in this page is written to.
>> With only 4 pages in total like in this example, I'm pretty sure
>> all of them get dirtied by libc init, use of stdio or malloc).
>>
>> Make libc (.data + .bss) fit into one page less and you get about
>> as many pages saved as you have processes running.
>
> FYI all the data/bss in libc except a few large objects _easily_ fits
> in a single page.

What's importand is how many pages are dirtied.
Here's a test with Aboriginal's x86_64 static busybox:

# sleep 9999 | sh -c 'echo $$; exec ./busybox dd bs=1'
16290

# pmap 16290
16290:   ./busybox dd bs 1
0000000000400000    320K r-x--
/home/srcdevel/aboriginal/a.0/build/root-filesystem-x86_64/usr/bin/busybox
^^^^ text + rodata
000000000064f000      4K rw---
/home/srcdevel/aboriginal/a.0/build/root-filesystem-x86_64/usr/bin/busybox
^^^^ data + start of bss
0000000000650000      8K rw---    [ anon ]
^^^^ the rest of bss

0000000001655000      4K rw---    [ anon ]
^^^^ brk

00007fff57196000    132K rw---    [ stack ]
00007fff571df000      8K r----    [ anon ]
00007fff571e1000      8K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total              488K


Thus, for this binary, three RW pages mapped immediately for .data and .bss,
for any applet.
Are all these pages touched?

# cat /proc/16290/smaps
00400000-00450000 r-xp 00000000 08:02 1810890
  /home/srcdevel/aboriginal/a.0/build/root-filesystem-x86_64/usr/bin/busybox
Size:                320 kB
...
0064f000-00650000 rw-p 0004f000 08:02 1810890
  /home/srcdevel/aboriginal/a.0/build/root-filesystem-x86_64/usr/bin/busybox
Size:                  4 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
...
00650000-00652000 rw-p 00000000 00:00 0
Size:                  8 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB

No. Only two pages are mapped, not three.

This is pretty impressive. However, this is a small busybox config:
only 31 applet.

I have a complete (~320 applets) 32-bit static busybox config built
against uclibc, and it has only 2 pages .data+.bss

Will test & see how close to that can musl get.

I'll continue sending patches which allow to carry over some
data size reductions from uclibc to musl.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 22:00       ` Denys Vlasenko
@ 2015-10-24 12:43         ` Szabolcs Nagy
  2015-10-24 19:20           ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Szabolcs Nagy @ 2015-10-24 12:43 UTC (permalink / raw)
  To: musl; +Cc: Rich Felker

* Denys Vlasenko <vda.linux@googlemail.com> [2015-10-24 00:00:56 +0200]:
> What's importand is how many pages are dirtied.
> Here's a test with Aboriginal's x86_64 static busybox:
> 
...
> 
> No. Only two pages are mapped, not three.
> 
> This is pretty impressive. However, this is a small busybox config:
> only 31 applet.
> 
> I have a complete (~320 applets) 32-bit static busybox config built
> against uclibc, and it has only 2 pages .data+.bss
> 
> Will test & see how close to that can musl get.
> 

http://www.etalabs.net/compare_libcs.html

hm this says min dirty is 3pages on uclibc with static linking.

in case you repeat the experiments with glibc
a 'dirty pages with static busybox' entry would be
useful on that comparision page.

> I'll continue sending patches which allow to carry over some
> data size reductions from uclibc to musl.

nice


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-24 12:43         ` Szabolcs Nagy
@ 2015-10-24 19:20           ` Rich Felker
  0 siblings, 0 replies; 16+ messages in thread
From: Rich Felker @ 2015-10-24 19:20 UTC (permalink / raw)
  To: musl

On Sat, Oct 24, 2015 at 02:43:12PM +0200, Szabolcs Nagy wrote:
> * Denys Vlasenko <vda.linux@googlemail.com> [2015-10-24 00:00:56 +0200]:
> > What's importand is how many pages are dirtied.
> > Here's a test with Aboriginal's x86_64 static busybox:
> > 
> ....
> > 
> > No. Only two pages are mapped, not three.
> > 
> > This is pretty impressive. However, this is a small busybox config:
> > only 31 applet.
> > 
> > I have a complete (~320 applets) 32-bit static busybox config built
> > against uclibc, and it has only 2 pages .data+.bss
> > 
> > Will test & see how close to that can musl get.
> > 
> 
> http://www.etalabs.net/compare_libcs.html
> 
> hm this says min dirty is 3pages on uclibc with static linking.

I believe I counted one page of stack in these figures.

> in case you repeat the experiments with glibc
> a 'dirty pages with static busybox' entry would be
> useful on that comparision page.

Yes. Before doing much more on the comparison page though I think we
should develop a good reproducible way to run these tests. I was ok
with something casual when musl was first launched but it feels
inappropriate to continue without something more rigorous.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 13:12 ` Szabolcs Nagy
  2015-10-23 14:41   ` Denys Vlasenko
@ 2015-10-27  1:21   ` Rich Felker
  2015-10-27 19:09     ` Denys Vlasenko
  1 sibling, 1 reply; 16+ messages in thread
From: Rich Felker @ 2015-10-27  1:21 UTC (permalink / raw)
  To: musl; +Cc: Denys Vlasenko

On Fri, Oct 23, 2015 at 03:12:02PM +0200, Szabolcs Nagy wrote:
> * Denys Vlasenko <vda.linux@googlemail.com> [2015-10-23 14:30:26 +0200]:
> > libc.so size reduction:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> >  564099	   1944	  11768	 577811	  8d113	libc.so.before
> >  562277	   1924	  11576	 575777	  8c921	libc.so
> > 
> 
> i assume this is x86_64, nice improvement.

I suspect all of this difference comes from optimizing out dummy weak
functions that are replaced by strong versions in other files. The
same savings could be achieved by eliminating them with #ifndef
SHARED, but as noted elsewhere I actually want to eliminate all such
#ifdefs and build only one set of .o files to use for libc.so and
libc.a. So I think this looks like a nice way to get the same benefit
without #ifdef.

BTW I noticed that __simple_malloc (in lite_malloc.c) is external
despite there being no need for it to be external. If we make it
static, I think gc-sections will be able to remove it too. (Denys, if
you want to try the numbers with it made static before I get around to
it, I'd be happy to hear results.)

> > +# When linker merges sections, a tiny section (such as one resulting
> > +# from "static char flag_var") with no alignment restrictions
> > +# can end up logded between two more strongly aligned ones (say,
> > +# "static int global_cnt1/2", both of which want 32-bit alignment).
> > +# Then this byte-sized "flag_var" gets 3 bytes of padding.
> > +#
> > +# With section sorting by alignment, one-byte flag variables have
> > +# higher chance of being grouped together and not require padding.
> > +# (It can be made even better. Linker is too dumb.
> > +# ld needs to grow -Wl,--pack-sections-optimally)
> > +#
> > +# For us, this affects the size of only one file: libc.so
> > +#
> > +tryldflag LDFLAGS_AUTO -Wl,--sort-section=alignment
> > +tryldflag LDFLAGS_AUTO -Wl,--sort-common
> 
> i think this came up before
> https://sourceware.org/bugzilla/show_bug.cgi?id=14156

I don't think this bug affects linkin musl itself, since we don't have
init or fini sections, so we probably don't have to worry about it.
Does this sound correct? If there's a risk of these options making
breakage with some binutils versions then we probably have to detect
that, but hopefully there's not.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-27  1:21   ` Rich Felker
@ 2015-10-27 19:09     ` Denys Vlasenko
  2015-10-27 21:01       ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-10-27 19:09 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Tue, Oct 27, 2015 at 2:21 AM, Rich Felker <dalias@libc.org> wrote:
> On Fri, Oct 23, 2015 at 03:12:02PM +0200, Szabolcs Nagy wrote:
>> * Denys Vlasenko <vda.linux@googlemail.com> [2015-10-23 14:30:26 +0200]:
>> > libc.so size reduction:
>> >
>> >    text        data     bss     dec     hex filename
>> >  564099        1944   11768  577811   8d113 libc.so.before
>> >  562277        1924   11576  575777   8c921 libc.so
>> >
>>
>> i assume this is x86_64, nice improvement.
>
> I suspect all of this difference comes from optimizing out dummy weak
> functions that are replaced by strong versions in other files.

No, it does not look that way. These options don't give linker
any extra freedom to eliminate anything.

"nm --size-sort -D libc.so" shows no differences after

   --sort-section=alignment --sort-common

were added to ld command line.

After -ffunction-sections -fdata-sections are added to gcc command line,
"nm" output does change, the entire difference is as follows:

--- libc.so.nm.OLD       2015-10-27 19:57:52.971964518 +0100
+++ libc.so.nm  2015-10-27 19:58:28.544115009 +0100
@@ -18,7 +18,6 @@
 0000000000000001 T setutxent
 0000000000000001 W updwtmp
 0000000000000001 T updwtmpx
-0000000000000002 T dlclose
 0000000000000002 T __stack_chk_fail
 0000000000000003 T catclose
 0000000000000003 T dirfd
@@ -84,6 +83,7 @@
 0000000000000005 T catopen
 0000000000000005 T creall
 0000000000000005 T dladdr
+0000000000000005 T dlclose
 0000000000000005 T dlinfo
 0000000000000005 T __fbufsize
 0000000000000005 T __freadptrinc
@@ -328,7 +328,6 @@
 000000000000000b T recv
 000000000000000b T send
 000000000000000b T setprotoent
-000000000000000b T tfind
 000000000000000b T __xstat
 000000000000000b W __xstat64
 000000000000000c T __acquire_ptc
@@ -417,6 +416,7 @@
 000000000000000e T tcflow
 000000000000000e T tcflush
 000000000000000e T tcsendbreak
+000000000000000e T tfind
 000000000000000f T dcgettext
 000000000000000f T execv
 000000000000000f T execvp
@@ -986,8 +986,6 @@
 0000000000000032 T pthread_rwlock_init
 0000000000000032 T putchar_unlocked
 0000000000000032 T sem_init
-0000000000000032 T __stdio_exit
-0000000000000032 W __stdio_exit_needed
 0000000000000032 T wcschr
 0000000000000033 T pthread_rwlock_tryrdlock
 0000000000000033 T pthread_setcanceltype
@@ -1011,6 +1009,8 @@
 0000000000000035 T pthread_sigmask
 0000000000000035 T pwritev
 0000000000000035 W pwritev64
+0000000000000035 T __stdio_exit
+0000000000000035 W __stdio_exit_needed
 0000000000000035 W thrd_detach
 0000000000000035 T __tre_mem_destroy
 0000000000000035 T tsearch

As you see, not a single label was eliminated.


The visible small differences in size for a few functions are caused
by the need to always use "near" jumps (not "short" ones)
for tail call optimizations now, since they now jump across sections:

Before:
00000000000274a6 <dlclose>:
   274a6:       eb c8                   jmp    27470 <__reset_tls+0x1f0>
After:
000000000002795f <dlclose>:
   2795f:       e9 c5 ff ff ff          jmpq   27929 <__reset_tls+0x1f0>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-27 19:09     ` Denys Vlasenko
@ 2015-10-27 21:01       ` Rich Felker
  2015-10-28  9:53         ` Denys Vlasenko
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2015-10-27 21:01 UTC (permalink / raw)
  To: musl

On Tue, Oct 27, 2015 at 08:09:28PM +0100, Denys Vlasenko wrote:
> On Tue, Oct 27, 2015 at 2:21 AM, Rich Felker <dalias@libc.org> wrote:
> > On Fri, Oct 23, 2015 at 03:12:02PM +0200, Szabolcs Nagy wrote:
> >> * Denys Vlasenko <vda.linux@googlemail.com> [2015-10-23 14:30:26 +0200]:
> >> > libc.so size reduction:
> >> >
> >> >    text        data     bss     dec     hex filename
> >> >  564099        1944   11768  577811   8d113 libc.so.before
> >> >  562277        1924   11576  575777   8c921 libc.so
> >> >
> >>
> >> i assume this is x86_64, nice improvement.
> >
> > I suspect all of this difference comes from optimizing out dummy weak
> > functions that are replaced by strong versions in other files.
> 
> No, it does not look that way. These options don't give linker
> any extra freedom to eliminate anything.
> 
> "nm --size-sort -D libc.so" shows no differences after
> 
>    --sort-section=alignment --sort-common
> 
> were added to ld command line.

Oh, sorry, I misread which change that difference was coming from. In
that case it's a pleasant surprise.

> After -ffunction-sections -fdata-sections are added to gcc command line,
> "nm" output does change, the entire difference is as follows:
> 
> --- libc.so.nm.OLD       2015-10-27 19:57:52.971964518 +0100
> +++ libc.so.nm  2015-10-27 19:58:28.544115009 +0100
> @@ -18,7 +18,6 @@
>  0000000000000001 T setutxent
>  0000000000000001 W updwtmp
>  0000000000000001 T updwtmpx
> -0000000000000002 T dlclose
>  0000000000000002 T __stack_chk_fail
>  0000000000000003 T catclose
>  0000000000000003 T dirfd
> @@ -84,6 +83,7 @@
>  0000000000000005 T catopen
>  0000000000000005 T creall
>  0000000000000005 T dladdr
> +0000000000000005 T dlclose
>  0000000000000005 T dlinfo
>  0000000000000005 T __fbufsize
>  0000000000000005 T __freadptrinc
> @@ -328,7 +328,6 @@
>  000000000000000b T recv
>  000000000000000b T send
>  000000000000000b T setprotoent
> -000000000000000b T tfind
>  000000000000000b T __xstat
>  000000000000000b W __xstat64
>  000000000000000c T __acquire_ptc
> @@ -417,6 +416,7 @@
>  000000000000000e T tcflow
>  000000000000000e T tcflush
>  000000000000000e T tcsendbreak
> +000000000000000e T tfind
>  000000000000000f T dcgettext
>  000000000000000f T execv
>  000000000000000f T execvp
> @@ -986,8 +986,6 @@
>  0000000000000032 T pthread_rwlock_init
>  0000000000000032 T putchar_unlocked
>  0000000000000032 T sem_init
> -0000000000000032 T __stdio_exit
> -0000000000000032 W __stdio_exit_needed
>  0000000000000032 T wcschr
>  0000000000000033 T pthread_rwlock_tryrdlock
>  0000000000000033 T pthread_setcanceltype
> @@ -1011,6 +1009,8 @@
>  0000000000000035 T pthread_sigmask
>  0000000000000035 T pwritev
>  0000000000000035 W pwritev64
> +0000000000000035 T __stdio_exit
> +0000000000000035 W __stdio_exit_needed
>  0000000000000035 W thrd_detach
>  0000000000000035 T __tre_mem_destroy
>  0000000000000035 T tsearch
> 
> As you see, not a single label was eliminated.
> 
> 
> The visible small differences in size for a few functions are caused
> by the need to always use "near" jumps (not "short" ones)
> for tail call optimizations now, since they now jump across sections:
> 
> Before:
> 00000000000274a6 <dlclose>:
>    274a6:       eb c8                   jmp    27470 <__reset_tls+0x1f0>
> After:
> 000000000002795f <dlclose>:
>    2795f:       e9 c5 ff ff ff          jmpq   27929 <__reset_tls+0x1f0>

I see. That's probably not a big deal.

Did you see any symbols disappear when adding --gc-sections?

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-27 21:01       ` Rich Felker
@ 2015-10-28  9:53         ` Denys Vlasenko
  2015-10-28 10:05           ` Denys Vlasenko
  0 siblings, 1 reply; 16+ messages in thread
From: Denys Vlasenko @ 2015-10-28  9:53 UTC (permalink / raw)
  To: musl, Rich Felker

On Tue, Oct 27, 2015 at 10:01 PM, Rich Felker <dalias@libc.org> wrote:
>> After -ffunction-sections -fdata-sections are added to gcc command line,
>> "nm" output does change, the entire difference is as follows:
>>
>> --- libc.so.nm.OLD       2015-10-27 19:57:52.971964518 +0100
>> +++ libc.so.nm  2015-10-27 19:58:28.544115009 +0100
>> @@ -18,7 +18,6 @@
>>  0000000000000001 T setutxent
>>  0000000000000001 W updwtmp
>>  0000000000000001 T updwtmpx
>> -0000000000000002 T dlclose
>>  0000000000000002 T __stack_chk_fail
>>  0000000000000003 T catclose
>>  0000000000000003 T dirfd
>> @@ -84,6 +83,7 @@
>>  0000000000000005 T catopen
>>  0000000000000005 T creall
>>  0000000000000005 T dladdr
>> +0000000000000005 T dlclose
>>  0000000000000005 T dlinfo
>>  0000000000000005 T __fbufsize
>>  0000000000000005 T __freadptrinc
>> @@ -328,7 +328,6 @@
>>  000000000000000b T recv
>>  000000000000000b T send
>>  000000000000000b T setprotoent
>> -000000000000000b T tfind
>>  000000000000000b T __xstat
>>  000000000000000b W __xstat64
>>  000000000000000c T __acquire_ptc
>> @@ -417,6 +416,7 @@
>>  000000000000000e T tcflow
>>  000000000000000e T tcflush
>>  000000000000000e T tcsendbreak
>> +000000000000000e T tfind
>>  000000000000000f T dcgettext
>>  000000000000000f T execv
>>  000000000000000f T execvp
>> @@ -986,8 +986,6 @@
>>  0000000000000032 T pthread_rwlock_init
>>  0000000000000032 T putchar_unlocked
>>  0000000000000032 T sem_init
>> -0000000000000032 T __stdio_exit
>> -0000000000000032 W __stdio_exit_needed
>>  0000000000000032 T wcschr
>>  0000000000000033 T pthread_rwlock_tryrdlock
>>  0000000000000033 T pthread_setcanceltype
>> @@ -1011,6 +1009,8 @@
>>  0000000000000035 T pthread_sigmask
>>  0000000000000035 T pwritev
>>  0000000000000035 W pwritev64
>> +0000000000000035 T __stdio_exit
>> +0000000000000035 W __stdio_exit_needed
>>  0000000000000035 W thrd_detach
>>  0000000000000035 T __tre_mem_destroy
>>  0000000000000035 T tsearch
>>
>> As you see, not a single label was eliminated.
>>
>>
>> The visible small differences in size for a few functions are caused
>> by the need to always use "near" jumps (not "short" ones)
>> for tail call optimizations now, since they now jump across sections:
>>
>> Before:
>> 00000000000274a6 <dlclose>:
>>    274a6:       eb c8                   jmp    27470 <__reset_tls+0x1f0>
>> After:
>> 000000000002795f <dlclose>:
>>    2795f:       e9 c5 ff ff ff          jmpq   27929 <__reset_tls+0x1f0>
>
> I see. That's probably not a big deal.
>
> Did you see any symbols disappear when adding --gc-sections?

Yes, I do.

$ nm --size-sort busybox_unstripped >busybox_unstripped.nm
$ nm --size-sort busybox_unstripped--gc-sections
>busybox_unstripped--gc-sections.nm
$ diff -u busybox_unstripped.nm busybox_unstripped--gc-sections.nm |
grep '^[^ @]'

--- busybox_unstripped.nm    2015-10-28 10:48:16.362304813 +0100
+++ busybox_unstripped--gc-sections.nm    2015-10-28 10:48:26.056294599 +0100
-0000000000000001 t reinit_unicode_for_ash
-0000000000000001 t reinit_unicode_for_hush
-0000000000000007 T xmalloc_sockaddr2host
-0000000000000008 b cur.1926
-0000000000000008 b dummy
-0000000000000008 b dummy_file
-0000000000000008 b end.1927
-0000000000000008 b lock.1928
-0000000000000008 T xstrtoi_range
-0000000000000008 T xstrtoll_range
-000000000000000a T bb_internal_getpwnam_r
-000000000000000a T ipneigh_main
-000000000000000c T xsocket_stream
-000000000000000e T xgid2group
-0000000000000010 T selinux_or_die
-0000000000000011 T xatoi_range_sfx
-0000000000000011 T xatou_range_sfx
-0000000000000012 T xstrtoi
-0000000000000013 T xatoll_range_sfx
-0000000000000015 T replace
-0000000000000017 T xatoi_sfx
-0000000000000017 T xspawn
-0000000000000018 T replace_underscores
-000000000000001a T bb_iswspace
-000000000000001b T bb_internal_setpwent
-000000000000001b T xgetgrgid
-000000000000001c T llist_rev
-000000000000001c T xstrtoll
-000000000000001d T xread_char
-000000000000001e T monotonic_ns
-0000000000000021 T xatoll_sfx
-0000000000000021 T xmalloc_fgetline_str
-0000000000000022 T bb_iswpunct
-0000000000000023 T bb_iswalnum
-000000000000002f T bb_internal_endpwent
-000000000000002f T isrv_want_wr
-0000000000000033 T bb_delete_module
-000000000000003c T index_in_str_array
-000000000000003e T rewind
-0000000000000043 T is_suffixed_with
-000000000000004b T moderror
-0000000000000054 T executable_exists
-000000000000005a T rta_addattr_l
-0000000000000062 T string_to_llist
-0000000000000088 T bb_init_module
-00000000000000ae T bb_herror_msg
-00000000000000c3 T parse_cmdline_module_options
-000000000000010d T __simple_malloc


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-28  9:53         ` Denys Vlasenko
@ 2015-10-28 10:05           ` Denys Vlasenko
  0 siblings, 0 replies; 16+ messages in thread
From: Denys Vlasenko @ 2015-10-28 10:05 UTC (permalink / raw)
  To: musl, Rich Felker

On Wed, Oct 28, 2015 at 10:53 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>> Did you see any symbols disappear when adding --gc-sections?
>
> Yes, I do.
>
> $ nm --size-sort busybox_unstripped >busybox_unstripped.nm
> $ nm --size-sort busybox_unstripped--gc-sections
>>busybox_unstripped--gc-sections.nm
> $ diff -u busybox_unstripped.nm busybox_unstripped--gc-sections.nm |
> grep '^[^ @]'
>
> --- busybox_unstripped.nm    2015-10-28 10:48:16.362304813 +0100
> +++ busybox_unstripped--gc-sections.nm    2015-10-28 10:48:26.056294599 +0100
> -0000000000000001 t reinit_unicode_for_ash
> -0000000000000001 t reinit_unicode_for_hush
> -0000000000000007 T xmalloc_sockaddr2host
> -0000000000000008 b cur.1926
> -0000000000000008 b dummy
> -0000000000000008 b dummy_file
> -0000000000000008 b end.1927
> -0000000000000008 b lock.1928
> -0000000000000008 T xstrtoi_range
> -0000000000000008 T xstrtoll_range
> -000000000000000a T bb_internal_getpwnam_r
> -000000000000000a T ipneigh_main
> -000000000000000c T xsocket_stream
> -000000000000000e T xgid2group
> -0000000000000010 T selinux_or_die
> -0000000000000011 T xatoi_range_sfx
> -0000000000000011 T xatou_range_sfx
> -0000000000000012 T xstrtoi
> -0000000000000013 T xatoll_range_sfx
> -0000000000000015 T replace
> -0000000000000017 T xatoi_sfx
> -0000000000000017 T xspawn
> -0000000000000018 T replace_underscores
> -000000000000001a T bb_iswspace
> -000000000000001b T bb_internal_setpwent
> -000000000000001b T xgetgrgid
> -000000000000001c T llist_rev
> -000000000000001c T xstrtoll
> -000000000000001d T xread_char
> -000000000000001e T monotonic_ns
> -0000000000000021 T xatoll_sfx
> -0000000000000021 T xmalloc_fgetline_str
> -0000000000000022 T bb_iswpunct
> -0000000000000023 T bb_iswalnum
> -000000000000002f T bb_internal_endpwent
> -000000000000002f T isrv_want_wr
> -0000000000000033 T bb_delete_module
> -000000000000003c T index_in_str_array
> -000000000000003e T rewind
> -0000000000000043 T is_suffixed_with
> -000000000000004b T moderror
> -0000000000000054 T executable_exists
> -000000000000005a T rta_addattr_l
> -0000000000000062 T string_to_llist
> -0000000000000088 T bb_init_module
> -00000000000000ae T bb_herror_msg
> -00000000000000c3 T parse_cmdline_module_options
> -000000000000010d T __simple_malloc

This was with Rob's preconpiled system-image-x86_64.

Now with musl built with -ffunction-sections -fdata-sections:

--- busybox_unstripped.nm    2015-10-28 11:02:13.047555187 +0100
+++ busybox_unstripped--gc-sections.nm    2015-10-28 11:02:04.290531243 +0100
-0000000000000001 T __cxa_finalize
-0000000000000001 t dummy
-0000000000000001 t dummy
-0000000000000001 t dummy
-0000000000000001 t dummy
-0000000000000001 t reinit_unicode_for_ash
-0000000000000001 t reinit_unicode_for_hush
-0000000000000003 t dummy
-0000000000000004 T ether_line
-0000000000000004 T ether_ntohost
-0000000000000005 T fseek
-0000000000000005 T ftell
-0000000000000005 T __isalnum_l
-0000000000000005 W isalnum_l
-0000000000000005 T __iswalnum_l
-0000000000000005 W iswalnum_l
-0000000000000005 T __iswalpha_l
-0000000000000005 W iswalpha_l
-0000000000000005 T __iswblank_l
-0000000000000005 W iswblank_l
-0000000000000005 T __iswcntrl_l
-0000000000000005 W iswcntrl_l
-0000000000000005 T __iswctype_l
-0000000000000005 W iswctype_l
-0000000000000005 T __iswgraph_l
-0000000000000005 W iswgraph_l
-0000000000000005 T __iswlower_l
-0000000000000005 W iswlower_l
-0000000000000005 T __iswprint_l
-0000000000000005 W iswprint_l
-0000000000000005 T __iswpunct_l
-0000000000000005 W iswpunct_l
-0000000000000005 T __iswspace_l
-0000000000000005 W iswspace_l
-0000000000000005 T __iswupper_l
-0000000000000005 W iswupper_l
-0000000000000005 T __iswxdigit_l
-0000000000000005 W iswxdigit_l
-0000000000000005 T strtoimax
-0000000000000005 W __strtoimax_internal
-0000000000000005 T strtoumax
-0000000000000005 W __strtoumax_internal
-0000000000000005 T __toread_needs_stdio_exit
-0000000000000005 T __towlower_l
-0000000000000005 W towlower_l
-0000000000000005 T __towrite_needs_stdio_exit
-0000000000000005 T __towupper_l
-0000000000000005 W towupper_l
-0000000000000005 T __wctype_l
-0000000000000005 W wctype_l
-0000000000000006 b a.2175
-0000000000000007 T xmalloc_sockaddr2host
-0000000000000008 b cur.1926
-0000000000000008 b dummy
-0000000000000008 b dummy_file
-0000000000000008 b end.1927
-0000000000000008 b lock.1928
-0000000000000008 D __stderr_used
-0000000000000008 T xstrtoi_range
-0000000000000008 T xstrtoll_range
-000000000000000a T bb_internal_getpwnam_r
-000000000000000a T ether_aton
-000000000000000a T ipneigh_main
-000000000000000a T strtold
-000000000000000a W __strtold_l
-000000000000000a W strtold_l
-000000000000000c T xsocket_stream
-000000000000000e T posix_openpt
-000000000000000e T xgid2group
-0000000000000010 T selinux_or_die
-0000000000000011 T __tolower_l
-0000000000000011 W tolower_l
-0000000000000011 T xatoi_range_sfx
-0000000000000011 T xatou_range_sfx
-0000000000000012 T __isblank_l
-0000000000000012 W isblank_l
-0000000000000012 T xstrtoi
-0000000000000013 T xatoll_range_sfx
-0000000000000015 T replace
-0000000000000017 T umount
-0000000000000017 T xatoi_sfx
-0000000000000017 T xspawn
-0000000000000018 T replace_underscores
-000000000000001a T bb_iswspace
-000000000000001b T bb_internal_setpwent
-000000000000001b T xgetgrgid
-000000000000001c T llist_rev
-000000000000001c T xstrtoll
-000000000000001d T xread_char
-000000000000001e T monotonic_ns
-0000000000000021 T xatoll_sfx
-0000000000000021 T xmalloc_fgetline_str
-0000000000000022 T bb_iswpunct
-0000000000000022 T endusershell
-0000000000000022 T strtof
-0000000000000022 W __strtof_l
-0000000000000022 W strtof_l
-0000000000000023 T bb_iswalnum
-000000000000002f T bb_internal_endpwent
-000000000000002f T __do_orphaned_stdio_locks
-000000000000002f T isrv_want_wr
-0000000000000033 T bb_delete_module
-0000000000000034 T fstatvfs
-0000000000000034 W fstatvfs64
-0000000000000034 T statvfs
-0000000000000034 W statvfs64
-0000000000000035 T setlogmask
-000000000000003c T index_in_str_array
-000000000000003e T rewind
-0000000000000043 T is_suffixed_with
-000000000000004b T moderror
-000000000000004f W fstatfs
-000000000000004f T __fstatfs
-000000000000004f W fstatfs64
-0000000000000054 T addmntent
-0000000000000054 T executable_exists
-000000000000005a T rta_addattr_l
-000000000000005b T __timedwait
-0000000000000062 T string_to_llist
-0000000000000070 T __strcasecmp_l
-0000000000000070 W strcasecmp_l
-0000000000000088 T bb_init_module
-000000000000008b T __strncasecmp_l
-000000000000008b W strncasecmp_l
-000000000000009c t fixup
-00000000000000ae T bb_herror_msg
-00000000000000c3 T parse_cmdline_module_options
-000000000000010c T __simple_malloc

Because now objects live in separate sections each,
ld can drop more of them.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-10-23 12:30 [PATCH] configure: add gcc flags for better link-time optimization Denys Vlasenko
  2015-10-23 13:12 ` Szabolcs Nagy
@ 2015-11-01 19:56 ` Rich Felker
  2015-11-02 22:36   ` Rich Felker
  1 sibling, 1 reply; 16+ messages in thread
From: Rich Felker @ 2015-11-01 19:56 UTC (permalink / raw)
  To: musl

On Fri, Oct 23, 2015 at 02:30:26PM +0200, Denys Vlasenko wrote:
> +#
> +# Put every function and data object into its own section:
> +# .text.funcname, .data.var, .rodata.const_struct, .bss.zerovar
> +#
> +# Previous optimization isn't working too well by itself
> +# because data objects aren't living in separate sections,
> +# they are all grouped in one .data and one .bss section per *.o file.
> +# With -ffunction/data-sections, section sorting eliminates more padding.
> +#
> +# Object files in static *.a files will also have their functions
> +# and data objects each in its own section.
> +#
> +# This enables programs statically linked with -Wl,--gc-sections
> +# to perform "section garbage collection": drop unused code and data
> +# not on per-*.o-file basis, but on per-function and per-object basis.
> +# This is a big thing: --gc-sections sometimes eliminates several percent
> +# of unreachable code and data in final executable.
> +#
> +tryflag CFLAGS_AUTO -ffunction-sections
> +tryflag CFLAGS_AUTO -fdata-sections
> +
> +#

This is not just an optimization but going to save us from a horrible
class of compiler/assembler bugs that threatened to force dropping
support for all non-bleeding-edge toolchains:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68178
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66609
https://sourceware.org/bugzilla/show_bug.cgi?id=18561

By putting functions/objects in their own sections, the illegal but
widespread assembler 'optimization' of resolving differences between
symbols to a constant when one or both of the symbols has a weak
definition is suppressed, simply because differences of this form are
never constants when they cross sections.

As such I want to go ahead and apply this regardless of optimization
issues, but I think we should update the comments and commit message
to reflect that this is also working around serious toolchain issues.
I hope to get to it soon now; working on some other things at the
moment.

BTW thanks a lot for raising the idea of using these options. If it
hadn't been for your pending patch I probably would never have thought
of this as a solution to the toolchain problems above.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-11-01 19:56 ` Rich Felker
@ 2015-11-02 22:36   ` Rich Felker
  2015-11-03  0:01     ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2015-11-02 22:36 UTC (permalink / raw)
  To: musl

On Sun, Nov 01, 2015 at 02:56:58PM -0500, Rich Felker wrote:
> On Fri, Oct 23, 2015 at 02:30:26PM +0200, Denys Vlasenko wrote:
> > +#
> > +# Put every function and data object into its own section:
> > +# .text.funcname, .data.var, .rodata.const_struct, .bss.zerovar
> > +#
> > +# Previous optimization isn't working too well by itself
> > +# because data objects aren't living in separate sections,
> > +# they are all grouped in one .data and one .bss section per *.o file.
> > +# With -ffunction/data-sections, section sorting eliminates more padding.
> > +#
> > +# Object files in static *.a files will also have their functions
> > +# and data objects each in its own section.
> > +#
> > +# This enables programs statically linked with -Wl,--gc-sections
> > +# to perform "section garbage collection": drop unused code and data
> > +# not on per-*.o-file basis, but on per-function and per-object basis.
> > +# This is a big thing: --gc-sections sometimes eliminates several percent
> > +# of unreachable code and data in final executable.
> > +#
> > +tryflag CFLAGS_AUTO -ffunction-sections
> > +tryflag CFLAGS_AUTO -fdata-sections
> > +
> > +#
> 
> This is not just an optimization but going to save us from a horrible
> class of compiler/assembler bugs that threatened to force dropping
> support for all non-bleeding-edge toolchains:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68178
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66609
> https://sourceware.org/bugzilla/show_bug.cgi?id=18561
> 
> By putting functions/objects in their own sections, the illegal but
> widespread assembler 'optimization' of resolving differences between
> symbols to a constant when one or both of the symbols has a weak
> definition is suppressed, simply because differences of this form are
> never constants when they cross sections.
> 
> As such I want to go ahead and apply this regardless of optimization
> issues, but I think we should update the comments and commit message
> to reflect that this is also working around serious toolchain issues.
> I hope to get to it soon now; working on some other things at the
> moment.
> 
> BTW thanks a lot for raising the idea of using these options. If it
> hadn't been for your pending patch I probably would never have thought
> of this as a solution to the toolchain problems above.

Unfortunately there's an issue blocking this patch: some archs'
crt_arch.h asm fragments have code that assumes a "short" branch can
reach _start_c/_dlstart_c. With -ffunction-sections that's not the
case; the entry point and C start code can be moved arbitrarily far
apart by the linker. To fix this we either need to use a fully-general
branch to reach the C code, or have file-specific suppression of
-ffunction-sections for crt1, dlstart, etc. I'd rather just fix the
asm not to make assumptions about shortness -- some of these
assumptions are dangerously close to being wrong at -O0 anyway -- but
to do that I need to audit all the crt_arch.h files, find the affected
ones, and fix them. I'll start taking a look and see how bad it looks.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-11-02 22:36   ` Rich Felker
@ 2015-11-03  0:01     ` Rich Felker
  2015-11-05  2:58       ` Rich Felker
  0 siblings, 1 reply; 16+ messages in thread
From: Rich Felker @ 2015-11-03  0:01 UTC (permalink / raw)
  To: musl

On Mon, Nov 02, 2015 at 05:36:49PM -0500, Rich Felker wrote:
> On Sun, Nov 01, 2015 at 02:56:58PM -0500, Rich Felker wrote:
> > On Fri, Oct 23, 2015 at 02:30:26PM +0200, Denys Vlasenko wrote:
> > > +#
> > > +# Put every function and data object into its own section:
> > > +# .text.funcname, .data.var, .rodata.const_struct, .bss.zerovar
> > > +#
> > > +# Previous optimization isn't working too well by itself
> > > +# because data objects aren't living in separate sections,
> > > +# they are all grouped in one .data and one .bss section per *.o file.
> > > +# With -ffunction/data-sections, section sorting eliminates more padding.
> > > +#
> > > +# Object files in static *.a files will also have their functions
> > > +# and data objects each in its own section.
> > > +#
> > > +# This enables programs statically linked with -Wl,--gc-sections
> > > +# to perform "section garbage collection": drop unused code and data
> > > +# not on per-*.o-file basis, but on per-function and per-object basis.
> > > +# This is a big thing: --gc-sections sometimes eliminates several percent
> > > +# of unreachable code and data in final executable.
> > > +#
> > > +tryflag CFLAGS_AUTO -ffunction-sections
> > > +tryflag CFLAGS_AUTO -fdata-sections
> > > +
> > > +#
> > 
> > This is not just an optimization but going to save us from a horrible
> > class of compiler/assembler bugs that threatened to force dropping
> > support for all non-bleeding-edge toolchains:
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68178
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66609
> > https://sourceware.org/bugzilla/show_bug.cgi?id=18561
> > 
> > By putting functions/objects in their own sections, the illegal but
> > widespread assembler 'optimization' of resolving differences between
> > symbols to a constant when one or both of the symbols has a weak
> > definition is suppressed, simply because differences of this form are
> > never constants when they cross sections.
> > 
> > As such I want to go ahead and apply this regardless of optimization
> > issues, but I think we should update the comments and commit message
> > to reflect that this is also working around serious toolchain issues.
> > I hope to get to it soon now; working on some other things at the
> > moment.
> > 
> > BTW thanks a lot for raising the idea of using these options. If it
> > hadn't been for your pending patch I probably would never have thought
> > of this as a solution to the toolchain problems above.
> 
> Unfortunately there's an issue blocking this patch: some archs'
> crt_arch.h asm fragments have code that assumes a "short" branch can
> reach _start_c/_dlstart_c. With -ffunction-sections that's not the
> case; the entry point and C start code can be moved arbitrarily far
> apart by the linker. To fix this we either need to use a fully-general
> branch to reach the C code, or have file-specific suppression of
> -ffunction-sections for crt1, dlstart, etc. I'd rather just fix the
> asm not to make assumptions about shortness -- some of these
> assumptions are dangerously close to being wrong at -O0 anyway -- but
> to do that I need to audit all the crt_arch.h files, find the affected
> ones, and fix them. I'll start taking a look and see how bad it looks.

The only affected arch was actually sh, so I just fixed it. So I think
it's safe to use these options now.

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] configure: add gcc flags for better link-time optimization
  2015-11-03  0:01     ` Rich Felker
@ 2015-11-05  2:58       ` Rich Felker
  0 siblings, 0 replies; 16+ messages in thread
From: Rich Felker @ 2015-11-05  2:58 UTC (permalink / raw)
  To: musl

On Mon, Nov 02, 2015 at 07:01:07PM -0500, Rich Felker wrote:
> On Mon, Nov 02, 2015 at 05:36:49PM -0500, Rich Felker wrote:
> > On Sun, Nov 01, 2015 at 02:56:58PM -0500, Rich Felker wrote:
> > > On Fri, Oct 23, 2015 at 02:30:26PM +0200, Denys Vlasenko wrote:
> > > > +#
> > > > +# Put every function and data object into its own section:
> > > > +# .text.funcname, .data.var, .rodata.const_struct, .bss.zerovar
> > > > +#
> > > > +# Previous optimization isn't working too well by itself
> > > > +# because data objects aren't living in separate sections,
> > > > +# they are all grouped in one .data and one .bss section per *.o file.
> > > > +# With -ffunction/data-sections, section sorting eliminates more padding.
> > > > +#
> > > > +# Object files in static *.a files will also have their functions
> > > > +# and data objects each in its own section.
> > > > +#
> > > > +# This enables programs statically linked with -Wl,--gc-sections
> > > > +# to perform "section garbage collection": drop unused code and data
> > > > +# not on per-*.o-file basis, but on per-function and per-object basis.
> > > > +# This is a big thing: --gc-sections sometimes eliminates several percent
> > > > +# of unreachable code and data in final executable.
> > > > +#
> > > > +tryflag CFLAGS_AUTO -ffunction-sections
> > > > +tryflag CFLAGS_AUTO -fdata-sections
> > > > +
> > > > +#
> > > 
> > > This is not just an optimization but going to save us from a horrible
> > > class of compiler/assembler bugs that threatened to force dropping
> > > support for all non-bleeding-edge toolchains:
> > > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68178
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66609
> > > https://sourceware.org/bugzilla/show_bug.cgi?id=18561
> > > 
> > > By putting functions/objects in their own sections, the illegal but
> > > widespread assembler 'optimization' of resolving differences between
> > > symbols to a constant when one or both of the symbols has a weak
> > > definition is suppressed, simply because differences of this form are
> > > never constants when they cross sections.
> > > 
> > > As such I want to go ahead and apply this regardless of optimization
> > > issues, but I think we should update the comments and commit message
> > > to reflect that this is also working around serious toolchain issues.
> > > I hope to get to it soon now; working on some other things at the
> > > moment.
> > > 
> > > BTW thanks a lot for raising the idea of using these options. If it
> > > hadn't been for your pending patch I probably would never have thought
> > > of this as a solution to the toolchain problems above.
> > 
> > Unfortunately there's an issue blocking this patch: some archs'
> > crt_arch.h asm fragments have code that assumes a "short" branch can
> > reach _start_c/_dlstart_c. With -ffunction-sections that's not the
> > case; the entry point and C start code can be moved arbitrarily far
> > apart by the linker. To fix this we either need to use a fully-general
> > branch to reach the C code, or have file-specific suppression of
> > -ffunction-sections for crt1, dlstart, etc. I'd rather just fix the
> > asm not to make assumptions about shortness -- some of these
> > assumptions are dangerously close to being wrong at -O0 anyway -- but
> > to do that I need to audit all the crt_arch.h files, find the affected
> > ones, and fix them. I'll start taking a look and see how bad it looks.
> 
> The only affected arch was actually sh, so I just fixed it. So I think
> it's safe to use these options now.

All these are now committed, along with --gc-sections. I also fixed an
oversight that limited the benefits of --gc-sections, saving another
~300 bytes.

Thanks again!

Rich


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-11-05  2:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 12:30 [PATCH] configure: add gcc flags for better link-time optimization Denys Vlasenko
2015-10-23 13:12 ` Szabolcs Nagy
2015-10-23 14:41   ` Denys Vlasenko
2015-10-23 14:48     ` Rich Felker
2015-10-23 22:00       ` Denys Vlasenko
2015-10-24 12:43         ` Szabolcs Nagy
2015-10-24 19:20           ` Rich Felker
2015-10-27  1:21   ` Rich Felker
2015-10-27 19:09     ` Denys Vlasenko
2015-10-27 21:01       ` Rich Felker
2015-10-28  9:53         ` Denys Vlasenko
2015-10-28 10:05           ` Denys Vlasenko
2015-11-01 19:56 ` Rich Felker
2015-11-02 22:36   ` Rich Felker
2015-11-03  0:01     ` Rich Felker
2015-11-05  2:58       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).