On Tue, 30 Jun 2015, Rich Felker wrote:

> Discussion on #musl with Timo Teräs has produced the following
> results:
> 
> - Moving bloom filter size to struct dso gives 5% improvement in clang
>   (built as 110 .so's) start time, simply because of a reduction of
>   number of instructions in the hot path. So I think we should apply
>   that patch.

I think most of the improvement here actually comes from fewer cache misses.
As a result, I think we should take this idea further and shuffle struct dso a
little bit so that fields accessed in the hot find_sym loop are packed
together, if possible.

> - The whole outer for loop in find_sym is the hot path for
>   performance. As such, eliminating the lazy calculation of gnu_hash
>   and simply doing it before the loop should be a measurable win, just
>   by removing the if (!ghm) branch.

On a related note, it's possible to avoid calculating sysv hash, if gnu-hash
is enabled system-wide, by not setting 'global' flag on the vdso item (as
mentioned on IRC in your conversation with Timo).

> - Even the check if (!dso->global) continue; has nontrivial cost.
>   Since I want to replace this representation with a separate
>   linked-list chain for global dsos anyway (for other reasons) I think
>   that's worth prioritizing for performance too.

I'm curious what the other reasons are? :)

> - The strength-reduction of remainder operations does not seem to
>   provide worthwhile benefits yet, simply because so little of the
>   overall time is spent on the division/remainder.

On IRC we noted that on AArch64 it's slower than native div/mod on our
microbenchmark, and on ARM the speedup is smaller than expected.  My testing
on x86 indicates that it's not profitable in the dynamic linker (not sure
why).

Alexander