* [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
@ 2025-12-06 17:31 Fabian Rast
2025-12-07 18:45 ` Szabolcs Nagy
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Fabian Rast @ 2025-12-06 17:31 UTC (permalink / raw)
To: musl
[-- Attachment #1: Type: text/plain, Size: 14401 bytes --]
calculating the hashes of symbol names can take a significant amount of
time while applying relocations. however, symbol lookups for
symbols that are themselves part of the gnu hashtable in the object
being relocated, the hashvalue is already present in that hashtable
and can be efficiently retrieved using the symbol index.
the gnu hash table only stores the upper 31 bits of the 32 bit wide
hashes. the lsb is used to indicate the end of a chain.
this missing bit can be inferred from the position of the symbol index
relative to the two possible chains.
---
ldso/dynlink.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)
diff --git a/ldso/dynlink.c b/ldso/dynlink.c
index 44d9b181..6dee7a97 100644
--- a/ldso/dynlink.c
+++ b/ldso/dynlink.c
@@ -320,10 +320,23 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
#if defined(__GNUC__)
__attribute__((always_inline))
#endif
-static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
+static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
{
- uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
- size_t ghm = 1ul << gh % (8*sizeof(size_t));
+ uint32_t h = 0, gh, gho, *ght;
+ size_t ghm;
+
+ if (dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1] && ght[0] > 1) {
+ uint32_t nbuckets = ght[0], *buckets = ght + 4 + ght[2] * 2;
+ uint32_t h0 = buckets[nbuckets + sym_index - ght[1]] & ~1u, h1 = h0|1;
+ uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
+ gh = i0 < i1 && sym_index < i1
+ || i0 > i1 && sym_index >= i0
+ || !i1 ? h0 : h1;
+ } else gh = gnu_hash(s);
+
+ gho = gh / (8*sizeof(size_t));
+ ghm = 1ul << gh % (8*sizeof(size_t));
+
struct symdef def = {0};
struct dso **deps = use_deps ? dso->deps : 0;
for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
@@ -353,7 +366,7 @@ static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_d
static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
{
- return find_sym2(dso, s, need_def, 0);
+ return find_sym2(dso, s, need_def, 0, NULL, STN_UNDEF);
}
static struct symdef get_lfs64(const char *name)
@@ -437,7 +450,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
ctx = type==REL_COPY ? head->syms_next : head;
def = (sym->st_info>>4) == STB_LOCAL
? (struct symdef){ .dso = dso, .sym = sym }
- : find_sym(ctx, name, type==REL_PLT);
+ : find_sym2(ctx, name, type==REL_PLT, 0, dso, sym_index);
if (!def.sym) def = get_lfs64(name);
if (!def.sym && (sym->st_shndx != SHN_UNDEF
|| sym->st_info>>4 != STB_WEAK)) {
@@ -2345,7 +2358,7 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
return 0;
} else
use_deps = 1;
- struct symdef def = find_sym2(p, s, 0, use_deps);
+ struct symdef def = find_sym2(p, s, 0, use_deps, NULL, STN_UNDEF);
if (!def.sym) {
error("Symbol not found: %s", s);
return 0;
--
2.52.0
Hello,
thank you for your feedback on the previous patch.
I have integrated the idea of using the position of the
symbol index relative to the start of the two possible chains
to infer the missing bit in a single calculation that is independent
of the search path length.
I decided to do this instead of using the bloom filter because it seems
more reliable, e.g. the bloom filter might accept both versions of the hash.
There some edgecases that i considered:
the gnu hashtable might only have a single bucket, in which case
this approach cant work. (the bloom filter idea could still work here though)
In this case the hash is just calculated normally.
The hashtable might contain empty chains (buckets[h % nbuckets] == 0)
It is impossible for both considered buckets to be empty. If one bucket is empty
the correct hash must be the one indexing the other bucket.
The relative position of the two chains (which one is first) is undefined.
The chain indexed by h1 can be before the chain indexed by h0.
Even if the buckets array is sorted, it can be that
h0 % nbuckets == nbuckets - 1, and h1 % nbuckets == 0.
> uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
I used two modulo operations because it was slightly faster than
branching on (h0 % nbuckets)+1 == nbuckets. I can't confidently explain this...
= Benchmarks
My interpretation of the benchmarks:
All benchmarks have a lot of variance sadly. Clang seems to be an extreme
example, where this patch helps a lot more than in other scenarios. This
is probably dependend on the average cost of a symbol hash, that grows
with longer symbol names.
For example:
avglen.sh
```
eu-readelf --dyn-syms $1 | awk '{ total_len += length($8); total += 1 } END { print total_len/total }'
```
sh avglen.sh /usr/lib/libLLVM.so.21.1
84.3108
sh avglen.sh /usr/lib/libclang-cpp.so.21.1
88.9683
sh avglen.sh /usr/lib/firefox/libxul.so
24.2836
sh avglen.sh /usr/lib/libwebkit2gtk-4.1.so.0.17.6
37.898
To me it looks like the very long symbol names in libclang and libLLVM are
the cause for the speedups in the clang example, while there are only barely
measurable/no speedups with other libraries.
/home/fr/src/poop/zig-out/bin/poop -d 10000 \
'env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22' \
'env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22'
Benchmark 1 (1437 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
measurement mean ± σ min … max outliers delta
wall_time 6.94ms ± 1.12ms 3.93ms … 9.19ms 0 ( 0%) 0%
peak_rss 12.7MB ± 100KB 12.3MB … 12.9MB 6 ( 0%) 0%
cpu_cycles 10.8M ± 671K 9.34M … 14.4M 26 ( 2%) 0%
instructions 19.3M ± 880 19.3M … 19.3M 49 ( 3%) 0%
cache_references 466K ± 3.17K 457K … 485K 21 ( 1%) 0%
cache_misses 84.7K ± 1.33K 82.1K … 90.6K 37 ( 3%) 0%
branch_misses 79.5K ± 731 77.9K … 82.8K 4 ( 0%) 0%
Benchmark 2 (1599 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
measurement mean ± σ min … max outliers delta
wall_time 6.24ms ± 795us 3.59ms … 8.19ms 69 ( 4%) ⚡- 10.0% ± 1.0%
peak_rss 12.7MB ± 94.4KB 12.3MB … 12.8MB 9 ( 1%) + 0.1% ± 0.1%
cpu_cycles 8.79M ± 486K 7.49M … 11.6M 89 ( 6%) ⚡- 18.9% ± 0.4%
instructions 14.7M ± 859 14.7M … 14.7M 49 ( 3%) ⚡- 23.6% ± 0.0%
cache_references 465K ± 4.66K 453K … 487K 17 ( 1%) - 0.1% ± 0.1%
cache_misses 84.3K ± 1.38K 81.0K … 90.9K 108 ( 7%) - 0.4% ± 0.1%
branch_misses 77.0K ± 485 75.9K … 78.8K 12 ( 1%) ⚡- 3.2% ± 0.1%
The speedup is similar to the first version of the patch. My explanation
for this is, that, as Szabolcs Nagy pointed out, running clang does not
involve many dsos, so the overhead of the second bucket search per dso
had only a tiny impact.
The other benchmarks were executed in an alpine docker container:
sudo docker run --rm -it -v./master-install:/tmp/master-install -v./precomp-install:/tmp/precomp-install alpine:latest sh
apk add hyperfine clang ghostscript-gtk libreoffice-common webkit2gtk-4.1 firefox mpv ffmpeg
bm.sh:
```
hyperfine -N -w 500 -r 1000 "env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list $1" "env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list $1"
```
= clang
/ # sh bm.sh /usr/bin/clang
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
Time (mean ± σ): 11.7 ms ± 2.8 ms [User: 8.8 ms, System: 2.8 ms]
Range (min … max): 6.7 ms … 19.6 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang
Time (mean ± σ): 9.5 ms ± 1.7 ms [User: 6.7 ms, System: 2.7 ms]
Range (min … max): 5.6 ms … 13.5 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang ran
1.24 ± 0.38 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
= gsx
/ # sh bm.sh /usr/bin/gsx
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/gsx
Time (mean ± σ): 15.3 ms ± 3.4 ms [User: 10.6 ms, System: 4.6 ms]
Range (min … max): 10.3 ms … 23.4 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/gsx
Time (mean ± σ): 14.9 ms ± 3.2 ms [User: 10.2 ms, System: 4.6 ms]
Range (min … max): 9.9 ms … 21.5 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/gsx ran
1.03 ± 0.32 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/gsx
= ffmpeg
/ # sh bm.sh /usr/bin/ffmpeg
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/ffmpeg
Time (mean ± σ): 20.5 ms ± 4.1 ms [User: 15.1 ms, System: 5.1 ms]
Range (min … max): 14.6 ms … 32.7 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/ffmpeg
Time (mean ± σ): 19.6 ms ± 3.7 ms [User: 14.3 ms, System: 5.1 ms]
Range (min … max): 14.0 ms … 31.4 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/ffmpeg ran
1.05 ± 0.29 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/ffmpeg
= mpv
/ # sh bm.sh /usr/bin/mpv
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/mpv
Time (mean ± σ): 52.9 ms ± 4.4 ms [User: 44.7 ms, System: 7.9 ms]
Range (min … max): 46.3 ms … 91.3 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/mpv
Time (mean ± σ): 51.4 ms ± 4.5 ms [User: 43.2 ms, System: 7.9 ms]
Range (min … max): 44.7 ms … 91.7 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/mpv ran
1.03 ± 0.12 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/mpv
= libreoffice
/ # sh bm.sh /usr/lib/libreoffice/program/soffice.bin
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
Time (mean ± σ): 45.8 ms ± 5.2 ms [User: 38.4 ms, System: 7.2 ms]
Range (min … max): 38.0 ms … 79.1 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
Time (mean ± σ): 45.0 ms ± 4.9 ms [User: 37.6 ms, System: 7.2 ms]
Range (min … max): 36.6 ms … 69.6 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin ran
1.02 ± 0.16 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
= /usr/lib/libwebkit2gtk-*
/ # sh bm.sh /usr/lib/libwebkit2gtk-4.1.so.0
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
Time (mean ± σ): 57.6 ms ± 5.1 ms [User: 49.2 ms, System: 8.1 ms]
Range (min … max): 49.1 ms … 93.5 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
Time (mean ± σ): 54.9 ms ± 4.9 ms [User: 46.4 ms, System: 8.3 ms]
Range (min … max): 47.6 ms … 95.2 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0 ran
1.05 ± 0.13 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
= firefox
/ # sh bm.sh /usr/lib/firefox/libxul.so
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/firefox/libxul.so
Time (mean ± σ): 18.9 ms ± 3.6 ms [User: 13.7 ms, System: 5.0 ms]
Range (min … max): 12.6 ms … 28.2 ms 1000 runs
Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/lib/firefox/libxul.so
Time (mean ± σ): 18.3 ms ± 3.7 ms [User: 13.3 ms, System: 4.9 ms]
Range (min … max): 12.8 ms … 28.4 ms 1000 runs
Summary
env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/lib/firefox/libxul.so ran
1.03 ± 0.29 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/firefox/libxul.so
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-06 17:31 [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed Fabian Rast
@ 2025-12-07 18:45 ` Szabolcs Nagy
2025-12-07 19:44 ` Alexander Monakov
2025-12-08 1:44 ` Rich Felker
2 siblings, 0 replies; 18+ messages in thread
From: Szabolcs Nagy @ 2025-12-07 18:45 UTC (permalink / raw)
To: Fabian Rast; +Cc: musl
* Fabian Rast <fabian.rast@tum.de> [2025-12-06 18:31:06 +0100]:
> calculating the hashes of symbol names can take a significant amount of
> time while applying relocations. however, symbol lookups for
> symbols that are themselves part of the gnu hashtable in the object
> being relocated, the hashvalue is already present in that hashtable
> and can be efficiently retrieved using the symbol index.
>
> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> hashes. the lsb is used to indicate the end of a chain.
> this missing bit can be inferred from the position of the symbol index
> relative to the two possible chains.
> ---
> ldso/dynlink.c | 25 +++++++++++++++++++------
> 1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/ldso/dynlink.c b/ldso/dynlink.c
> index 44d9b181..6dee7a97 100644
> --- a/ldso/dynlink.c
> +++ b/ldso/dynlink.c
> @@ -320,10 +320,23 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
> #if defined(__GNUC__)
> __attribute__((always_inline))
> #endif
> -static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
> +static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
> {
> - uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
> - size_t ghm = 1ul << gh % (8*sizeof(size_t));
> + uint32_t h = 0, gh, gho, *ght;
> + size_t ghm;
> +
> + if (dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1] && ght[0] > 1) {
> + uint32_t nbuckets = ght[0], *buckets = ght + 4 + ght[2] * 2;
other part of the code uses *sizeof(size_t)/4 instead of *2.
> + uint32_t h0 = buckets[nbuckets + sym_index - ght[1]] & ~1u, h1 = h0|1;
> + uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> + gh = i0 < i1 && sym_index < i1
> + || i0 > i1 && sym_index >= i0
> + || !i1 ? h0 : h1;
yeah, this became more expensive than i expected.
> + } else gh = gnu_hash(s);
> +
> + gho = gh / (8*sizeof(size_t));
> + ghm = 1ul << gh % (8*sizeof(size_t));
> +
> struct symdef def = {0};
> struct dso **deps = use_deps ? dso->deps : 0;
> for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
...
> There some edgecases that i considered:
>
> the gnu hashtable might only have a single bucket, in which case
> this approach cant work. (the bloom filter idea could still work here though)
>
> In this case the hash is just calculated normally.
>
>
> The hashtable might contain empty chains (buckets[h % nbuckets] == 0)
>
> It is impossible for both considered buckets to be empty. If one bucket is empty
> the correct hash must be the one indexing the other bucket.
>
>
> The relative position of the two chains (which one is first) is undefined.
> The chain indexed by h1 can be before the chain indexed by h0.
> Even if the buckets array is sorted, it can be that
> h0 % nbuckets == nbuckets - 1, and h1 % nbuckets == 0.
>
> > uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> I used two modulo operations because it was slightly faster than
> branching on (h0 % nbuckets)+1 == nbuckets. I can't confidently explain this...
yeah this is a lot of code, big cpus have
enough resources to do two % out of order
while other things happen, smaller cpus
might struggle here..
> avglen.sh
> ```
> eu-readelf --dyn-syms $1 | awk '{ total_len += length($8); total += 1 } END { print total_len/total }'
> ```
>
> sh avglen.sh /usr/lib/libLLVM.so.21.1
> 84.3108
> sh avglen.sh /usr/lib/libclang-cpp.so.21.1
> 88.9683
> sh avglen.sh /usr/lib/firefox/libxul.so
> 24.2836
> sh avglen.sh /usr/lib/libwebkit2gtk-4.1.so.0.17.6
> 37.898
>
> To me it looks like the very long symbol names in libclang and libLLVM are
> the cause for the speedups in the clang example, while there are only barely
> measurable/no speedups with other libraries.
yeah, name mangling..
for better stats, check indirect deps too and
only symbols referenced by relocs and count
locally defined syms that we actually optimize:
echo '/usr/bin/clang
/usr/lib/firefox/libxul.so
/usr/lib/libwebkit2gtk-4.1.so.0
/usr/bin/mpv
/usr/bin/ffmpeg
/usr/bin/gsx
/usr/lib/libreoffice/program/soffice.bin' |while read y
do
echo $y
/lib/ld-musl-x86_64.so.1 --list $y|awk '$2~/=>/{print $3}'|while read x
do
readelf --dyn-sym -rW $x |awk '
$3~/^R_/&&$5{sub(/@.*/,"");a[$5]=length($5)}
$5~/GLOBAL|WEAK/&&$7!="UND"&&$8{sub(/@.*/,"");b[$8]="opt"}
END{for(i in a) print a[i],b[i]}'
done|sort -n|awk '
/opt/{a[n++]=$1;s+=$1}
END{
p=int(100*n/NR);n1=int(n*.25);n2=int(n*.5);n3=int(n*.75)
print "opt",p"%","avg",s/n,"q1",a[n1],"med",a[n2],"q3",a[n3]
}'
/usr/bin/clang
opt 87% avg 63.064 q1 33 med 50 q3 84
/usr/lib/firefox/libxul.so
opt 51% avg 30.4272 q1 19 med 25 q3 33
/usr/lib/libwebkit2gtk-4.1.so.0
opt 59% avg 46.349 q1 23 med 33 q3 54
/usr/bin/mpv
opt 66% avg 47.1374 q1 22 med 34 q3 58
/usr/bin/ffmpeg
opt 60% avg 38.9045 q1 19 med 28 q3 48
/usr/bin/gsx
opt 52% avg 29.1147 q1 18 med 24 q3 32
/usr/lib/libreoffice/program/soffice.bin
opt 37% avg 30.5672 q1 19 med 25 q3 35
i think you need high % of syms where
the optimization is applied (to be relevant)
and long syms as well as low cache miss
rate when accessing the local gnu hash tab
(to improve perf).
> = clang
> 1.24 ± 0.38 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
> = gsx
> 1.03 ± 0.32 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/gsx
> = ffmpeg
> 1.05 ± 0.29 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/ffmpeg
> = mpv
> 1.03 ± 0.12 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/mpv
> = libreoffice
> 1.02 ± 0.16 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libreoffice/program/soffice.bin
> = /usr/lib/libwebkit2gtk-*
> 1.05 ± 0.13 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/libwebkit2gtk-4.1.so.0
> = firefox
> 1.03 ± 0.29 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/lib/firefox/libxul.so
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-06 17:31 [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed Fabian Rast
2025-12-07 18:45 ` Szabolcs Nagy
@ 2025-12-07 19:44 ` Alexander Monakov
2025-12-08 1:44 ` Rich Felker
2 siblings, 0 replies; 18+ messages in thread
From: Alexander Monakov @ 2025-12-07 19:44 UTC (permalink / raw)
To: Fabian Rast; +Cc: musl
Hello.
Impressive work!
> The other benchmarks were executed in an alpine docker container
If you're using hyperfine just because poop is not packaged, may I suggest
measuring with perf instead? That way, you get cycles/instructions counters
which are invariant w.r.t CPU frequency scaling, so more relevant for us.
'perf stat -r 99 <workload>' runs the workload 99 times and prints deviations.
Alexander
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-06 17:31 [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed Fabian Rast
2025-12-07 18:45 ` Szabolcs Nagy
2025-12-07 19:44 ` Alexander Monakov
@ 2025-12-08 1:44 ` Rich Felker
2025-12-08 2:25 ` Rich Felker
2 siblings, 1 reply; 18+ messages in thread
From: Rich Felker @ 2025-12-08 1:44 UTC (permalink / raw)
To: Fabian Rast; +Cc: musl
On Sat, Dec 06, 2025 at 06:31:06PM +0100, Fabian Rast wrote:
> calculating the hashes of symbol names can take a significant amount of
> time while applying relocations. however, symbol lookups for
> symbols that are themselves part of the gnu hashtable in the object
> being relocated, the hashvalue is already present in that hashtable
> and can be efficiently retrieved using the symbol index.
>
> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> hashes. the lsb is used to indicate the end of a chain.
> this missing bit can be inferred from the position of the symbol index
> relative to the two possible chains.
> ---
> ldso/dynlink.c | 25 +++++++++++++++++++------
> 1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/ldso/dynlink.c b/ldso/dynlink.c
> index 44d9b181..6dee7a97 100644
> --- a/ldso/dynlink.c
> +++ b/ldso/dynlink.c
> @@ -320,10 +320,23 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
> #if defined(__GNUC__)
> __attribute__((always_inline))
> #endif
> -static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
> +static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
> {
> - uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
> - size_t ghm = 1ul << gh % (8*sizeof(size_t));
> + uint32_t h = 0, gh, gho, *ght;
> + size_t ghm;
> +
> + if (dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1] && ght[0] > 1) {
> + uint32_t nbuckets = ght[0], *buckets = ght + 4 + ght[2] * 2;
> + uint32_t h0 = buckets[nbuckets + sym_index - ght[1]] & ~1u, h1 = h0|1;
> + uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> + gh = i0 < i1 && sym_index < i1
> + || i0 > i1 && sym_index >= i0
> + || !i1 ? h0 : h1;
> + } else gh = gnu_hash(s);
> +
> + gho = gh / (8*sizeof(size_t));
> + ghm = 1ul << gh % (8*sizeof(size_t));
> +
This looks a lot more expensive than the method I suggested using the
bloom filter, which has no modulo, much less two.
> struct symdef def = {0};
> struct dso **deps = use_deps ? dso->deps : 0;
> for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
> @@ -353,7 +366,7 @@ static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_d
>
> static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
> {
> - return find_sym2(dso, s, need_def, 0);
> + return find_sym2(dso, s, need_def, 0, NULL, STN_UNDEF);
> }
>
> static struct symdef get_lfs64(const char *name)
> @@ -437,7 +450,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
> ctx = type==REL_COPY ? head->syms_next : head;
> def = (sym->st_info>>4) == STB_LOCAL
> ? (struct symdef){ .dso = dso, .sym = sym }
> - : find_sym(ctx, name, type==REL_PLT);
> + : find_sym2(ctx, name, type==REL_PLT, 0, dso, sym_index);
> if (!def.sym) def = get_lfs64(name);
> if (!def.sym && (sym->st_shndx != SHN_UNDEF
> || sym->st_info>>4 != STB_WEAK)) {
> @@ -2345,7 +2358,7 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
> return 0;
> } else
> use_deps = 1;
> - struct symdef def = find_sym2(p, s, 0, use_deps);
> + struct symdef def = find_sym2(p, s, 0, use_deps, NULL, STN_UNDEF);
> if (!def.sym) {
> error("Symbol not found: %s", s);
> return 0;
> --
> 2.52.0
>
>
> Hello,
>
> thank you for your feedback on the previous patch.
> I have integrated the idea of using the position of the
> symbol index relative to the start of the two possible chains
> to infer the missing bit in a single calculation that is independent
> of the search path length.
> I decided to do this instead of using the bloom filter because it seems
> more reliable, e.g. the bloom filter might accept both versions of the hash.
>
> There some edgecases that i considered:
>
> the gnu hashtable might only have a single bucket, in which case
> this approach cant work. (the bloom filter idea could still work here though)
>
> In this case the hash is just calculated normally.
>
>
> The hashtable might contain empty chains (buckets[h % nbuckets] == 0)
>
> It is impossible for both considered buckets to be empty. If one bucket is empty
> the correct hash must be the one indexing the other bucket.
>
>
> The relative position of the two chains (which one is first) is undefined.
> The chain indexed by h1 can be before the chain indexed by h0.
> Even if the buckets array is sorted, it can be that
> h0 % nbuckets == nbuckets - 1, and h1 % nbuckets == 0.
>
> > uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> I used two modulo operations because it was slightly faster than
> branching on (h0 % nbuckets)+1 == nbuckets. I can't confidently explain this...
That's almost surely not true on anything but "big" archs. I would
expect on archs without a hardware div, that even a single modulo
costs a lot more than the entire gnu hash calculation.
Rich
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 1:44 ` Rich Felker
@ 2025-12-08 2:25 ` Rich Felker
2025-12-08 5:20 ` Szabolcs Nagy
0 siblings, 1 reply; 18+ messages in thread
From: Rich Felker @ 2025-12-08 2:25 UTC (permalink / raw)
To: Fabian Rast; +Cc: musl
On Sun, Dec 07, 2025 at 08:44:54PM -0500, Rich Felker wrote:
> On Sat, Dec 06, 2025 at 06:31:06PM +0100, Fabian Rast wrote:
> > calculating the hashes of symbol names can take a significant amount of
> > time while applying relocations. however, symbol lookups for
> > symbols that are themselves part of the gnu hashtable in the object
> > being relocated, the hashvalue is already present in that hashtable
> > and can be efficiently retrieved using the symbol index.
> >
> > the gnu hash table only stores the upper 31 bits of the 32 bit wide
> > hashes. the lsb is used to indicate the end of a chain.
> > this missing bit can be inferred from the position of the symbol index
> > relative to the two possible chains.
> > ---
> > ldso/dynlink.c | 25 +++++++++++++++++++------
> > 1 file changed, 19 insertions(+), 6 deletions(-)
> >
> > diff --git a/ldso/dynlink.c b/ldso/dynlink.c
> > index 44d9b181..6dee7a97 100644
> > --- a/ldso/dynlink.c
> > +++ b/ldso/dynlink.c
> > @@ -320,10 +320,23 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
> > #if defined(__GNUC__)
> > __attribute__((always_inline))
> > #endif
> > -static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
> > +static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
> > {
> > - uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
> > - size_t ghm = 1ul << gh % (8*sizeof(size_t));
> > + uint32_t h = 0, gh, gho, *ght;
> > + size_t ghm;
> > +
> > + if (dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1] && ght[0] > 1) {
> > + uint32_t nbuckets = ght[0], *buckets = ght + 4 + ght[2] * 2;
> > + uint32_t h0 = buckets[nbuckets + sym_index - ght[1]] & ~1u, h1 = h0|1;
> > + uint32_t i0 = buckets[h0 % nbuckets], i1 = buckets[h1 % nbuckets];
> > + gh = i0 < i1 && sym_index < i1
> > + || i0 > i1 && sym_index >= i0
> > + || !i1 ? h0 : h1;
> > + } else gh = gnu_hash(s);
> > +
> > + gho = gh / (8*sizeof(size_t));
> > + ghm = 1ul << gh % (8*sizeof(size_t));
> > +
>
> This looks a lot more expensive than the method I suggested using the
> bloom filter, which has no modulo, much less two.
Actually, I'm not sure if the bloom filter approach works. Is a hash
included in the bloom filter as soon as it's present in the symbol
table, or only if it's present with a definition?
If it's absent for undefined, then this approach is a non-starter. It
can't tell us anything.
But if undefined symbols are present, then here's my concept for how
it would work:
It should consist of grabbing two consecutive bits aligned at a 2-bit
boundary (so not crossing slots) from the bloom filter, and
considering the possible values:
00 - impossible
01 - low bit is 0
10 - low bit is 1
11 - inconclusive; just compute hash. should be rare with good bloom.
Rich
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 2:25 ` Rich Felker
@ 2025-12-08 5:20 ` Szabolcs Nagy
2025-12-08 16:20 ` Rich Felker
0 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2025-12-08 5:20 UTC (permalink / raw)
To: Rich Felker; +Cc: Fabian Rast, musl
* Rich Felker <dalias@libc.org> [2025-12-07 21:25:42 -0500]:
> On Sun, Dec 07, 2025 at 08:44:54PM -0500, Rich Felker wrote:
> > This looks a lot more expensive than the method I suggested using the
> > bloom filter, which has no modulo, much less two.
>
> Actually, I'm not sure if the bloom filter approach works. Is a hash
> included in the bloom filter as soon as it's present in the symbol
> table, or only if it's present with a definition?
>
> If it's absent for undefined, then this approach is a non-starter. It
> can't tell us anything.
isn't it the other way around?
only defined symbols set bits in the bloom filter.
it is for checking if a dso might have a definition.
>
> But if undefined symbols are present, then here's my concept for how
> it would work:
>
> It should consist of grabbing two consecutive bits aligned at a 2-bit
> boundary (so not crossing slots) from the bloom filter, and
> considering the possible values:
>
> 00 - impossible
> 01 - low bit is 0
> 10 - low bit is 1
> 11 - inconclusive; just compute hash. should be rare with good bloom.
yeah, one of the bits is 'in the filter' the other
is a random bit test: if p portion of the bits are
1 then with p chance we fail.
in bfd ld the bloom mask seems to be sized such that
if nsym in [ 3<<k, 3<<(k+1) ) then maskbits = 32<<k
so 3/32 <= nsym/maskbits < 6/32
each symbol sets two bits so the probability that
a bit is 0 is (1-1/maskbits)^(2*nsym) so a bit is 1
with roughly p = 1 - exp(-2*nsym/maskbits) that is
0.17 < p < 0.32 probability. (normally two bits
are tested so bloom fails with p^2, but the other
bit is shared between h&-2 and h|1)
if this is right and the hash is good, then we have
about 20-30% chance to fail and 70-80% to succeed
just with that one bit.
it seems to me that with many dsos (>50), it's not
the gnu_hash but the walk over all dsos and checking
gnu_lookup_filtered will be the slow part. not
sure if it's possible to help with that.
i also noticed that some symbols are looked up a lot:
in clang >30% of lookups are (multiple relocs per dso)
__cxa_pure_virtual
_ZTVN10__cxxabiv120__si_class_type_infoE
_ZTVN10__cxxabiv117__class_type_infoE
_ZTVN10__cxxabiv121__vmi_class_type_infoE
and these generate fair bit of lookups too (undef weak):
_ITM_deregisterTMCloneTable
_ITM_registerTMCloneTable
__deregister_frame_info
__register_frame_info
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 5:20 ` Szabolcs Nagy
@ 2025-12-08 16:20 ` Rich Felker
2025-12-08 23:14 ` Szabolcs Nagy
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Rich Felker @ 2025-12-08 16:20 UTC (permalink / raw)
To: Fabian Rast, musl
On Mon, Dec 08, 2025 at 06:20:58AM +0100, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2025-12-07 21:25:42 -0500]:
> > On Sun, Dec 07, 2025 at 08:44:54PM -0500, Rich Felker wrote:
> > > This looks a lot more expensive than the method I suggested using the
> > > bloom filter, which has no modulo, much less two.
> >
> > Actually, I'm not sure if the bloom filter approach works. Is a hash
> > included in the bloom filter as soon as it's present in the symbol
> > table, or only if it's present with a definition?
> >
> > If it's absent for undefined, then this approach is a non-starter. It
> > can't tell us anything.
>
> isn't it the other way around?
>
> only defined symbols set bits in the bloom filter.
> it is for checking if a dso might have a definition.
I think you misunderstood what I was saying and looking for.
I was thinking the bloom filter would include everything in the symbol
table, including both symbols that provide definitions and symbol
table entries that are used to reference undefined symbols. In this
case, we could use the bloom filter of the DSO making the reference to
determine which of the two candidate hashes is correct, and do so very
quickly (nothing but a few bitshifts and masks).
However, it seems reasonable that the bloom filter would only include
symbols that actually have definitions. Otherwise at each step of the
DSO chain you'd get a false positive for every DSO that references the
symbol without defining it. So we probably can't do it the way I'd
hoped.
> > But if undefined symbols are present, then here's my concept for how
> > it would work:
> >
> > It should consist of grabbing two consecutive bits aligned at a 2-bit
> > boundary (so not crossing slots) from the bloom filter, and
> > considering the possible values:
> >
> > 00 - impossible
> > 01 - low bit is 0
> > 10 - low bit is 1
> > 11 - inconclusive; just compute hash. should be rare with good bloom.
>
> yeah, one of the bits is 'in the filter' the other
> is a random bit test: if p portion of the bits are
> 1 then with p chance we fail.
>
> in bfd ld the bloom mask seems to be sized such that
> if nsym in [ 3<<k, 3<<(k+1) ) then maskbits = 32<<k
> so 3/32 <= nsym/maskbits < 6/32
>
> each symbol sets two bits so the probability that
> a bit is 0 is (1-1/maskbits)^(2*nsym) so a bit is 1
> with roughly p = 1 - exp(-2*nsym/maskbits) that is
> 0.17 < p < 0.32 probability. (normally two bits
> are tested so bloom fails with p^2, but the other
> bit is shared between h&-2 and h|1)
>
> if this is right and the hash is good, then we have
> about 20-30% chance to fail and 70-80% to succeed
> just with that one bit.
>
>
> it seems to me that with many dsos (>50), it's not
> the gnu_hash but the walk over all dsos and checking
> gnu_lookup_filtered will be the slow part. not
> sure if it's possible to help with that.
> i also noticed that some symbols are looked up a lot:
> in clang >30% of lookups are (multiple relocs per dso)
> __cxa_pure_virtual
> _ZTVN10__cxxabiv120__si_class_type_infoE
> _ZTVN10__cxxabiv117__class_type_infoE
> _ZTVN10__cxxabiv121__vmi_class_type_infoE
> and these generate fair bit of lookups too (undef weak):
> _ITM_deregisterTMCloneTable
> _ITM_registerTMCloneTable
> __deregister_frame_info
> __register_frame_info
I think modern linkers sort relocations by the symbol referenced,
which allows you to bypass the lookup if the reference is the same as
the previous relocation. We do not take advantage of this in the
do_relocs loop at all, but we could. That would probably give far more
performance boost than eliding the hashing and might even make eliding
the hashing pointless.
Rich
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 16:20 ` Rich Felker
@ 2025-12-08 23:14 ` Szabolcs Nagy
2025-12-09 18:31 ` Szabolcs Nagy
2025-12-09 1:10 ` Fabian Rast
2026-02-03 17:13 ` Fangrui Song
2 siblings, 1 reply; 18+ messages in thread
From: Szabolcs Nagy @ 2025-12-08 23:14 UTC (permalink / raw)
To: Rich Felker; +Cc: Fabian Rast, musl
[-- Attachment #1: Type: text/plain, Size: 4712 bytes --]
* Rich Felker <dalias@libc.org> [2025-12-08 11:20:42 -0500]:
> On Mon, Dec 08, 2025 at 06:20:58AM +0100, Szabolcs Nagy wrote:
> > if this is right and the hash is good, then we have
> > about 20-30% chance to fail and 70-80% to succeed
> > just with that one bit.
empirically the two bit bloom filter has 4-8%
failure (0.2-0.3 squared) so this looks right.
> > it seems to me that with many dsos (>50), it's not
> > the gnu_hash but the walk over all dsos and checking
> > gnu_lookup_filtered will be the slow part. not
> > sure if it's possible to help with that.
> > i also noticed that some symbols are looked up a lot:
> > in clang >30% of lookups are (multiple relocs per dso)
> > __cxa_pure_virtual
> > _ZTVN10__cxxabiv120__si_class_type_infoE
> > _ZTVN10__cxxabiv117__class_type_infoE
> > _ZTVN10__cxxabiv121__vmi_class_type_infoE
> > and these generate fair bit of lookups too (undef weak):
> > _ITM_deregisterTMCloneTable
> > _ITM_registerTMCloneTable
> > __deregister_frame_info
> > __register_frame_info
>
> I think modern linkers sort relocations by the symbol referenced,
> which allows you to bypass the lookup if the reference is the same as
> the previous relocation. We do not take advantage of this in the
> do_relocs loop at all, but we could. That would probably give far more
> performance boost than eliding the hashing and might even make eliding
> the hashing pointless.
so there are two optimizations:
repeat: consecutive relocs referencing the same sym
local: relocs referencing a locally defined sym
in case of a "repeat" all of find_sym can be avoided.
in case of "local" the gnu_hash and the final lookup
in find_sym can be avoided (with some extra cost).
so i added instrumentation to count relocs etc:
it seems there are binaries that can benefit from a
"local" optimization like clang and others that can
benefit form "repeat" like libreoffice. the gnu hash
seems to work well: strcmp is only used once per reloc
(however in clang >60% is strcmp(p,p) because local sym)
clang:
!r !l: n 3667 7.8% slen 45.93 7.3% iter 3.70 ff 0.16 hc 3.00 cmp 0.99
r !l: n 14255 30.4% slen 35.92 22.1% iter 3.93 ff 0.58 hc 4.48 cmp 1.00
!r l: n 20932 44.6% slen 62.88 56.8% iter 3.25 ff 0.14 hc 3.06 cmp 1.00
r l: n 8087 17.2% slen 39.39 13.8% iter 2.93 ff 0.18 hc 3.16 cmp 1.00
other: n 11 0.0% slen 15.09 0.0% iter 1.00 ff 0.00 hc 0.27 cmp 11.82
total: n 46952 100.0% slen 49.31 100.0% iter 3.43 ff 0.28 hc 3.50 cmp 1.00
ffmpeg:
!r !l: n 11196 33.2% slen 15.67 17.2% iter 30.76 ff 1.76 hc 2.96 cmp 0.98
r !l: n 2029 6.0% slen 29.47 5.8% iter 53.47 ff 1.56 hc 4.92 cmp 1.00
!r l: n 15803 46.9% slen 38.81 60.0% iter 63.74 ff 3.65 hc 4.64 cmp 1.00
r l: n 4580 13.6% slen 37.49 16.8% iter 50.79 ff 2.76 hc 4.55 cmp 1.00
other: n 115 0.3% slen 16.82 0.2% iter 1.00 ff 0.03 hc 0.04 cmp 110.18
total: n 33723 100.0% slen 30.31 100.0% iter 50.20 ff 2.77 hc 4.07 cmp 1.37
libxul:
!r !l: n 13989 49.0% slen 18.18 35.9% iter 28.27 ff 1.31 hc 2.98 cmp 0.97
r !l: n 586 2.1% slen 32.81 2.7% iter 35.00 ff 2.56 hc 5.00 cmp 1.00
!r l: n 11582 40.6% slen 30.34 49.6% iter 44.23 ff 2.20 hc 4.21 cmp 1.00
r l: n 2299 8.1% slen 35.53 11.5% iter 41.33 ff 2.09 hc 4.51 cmp 1.00
other: n 81 0.3% slen 16.74 0.2% iter 1.00 ff 0.06 hc 0.09 cmp 99.19
total: n 28537 100.0% slen 24.81 100.0% iter 35.86 ff 1.76 hc 3.64 cmp 1.27
libreoffice
!r !l: n 37999 27.6% slen 36.07 25.1% iter 26.67 ff 1.38 hc 3.44 cmp 0.99
r !l: n 70778 51.4% slen 45.60 59.2% iter 18.03 ff 0.75 hc 3.57 cmp 1.00
!r l: n 21183 15.4% slen 29.79 11.6% iter 51.49 ff 2.82 hc 4.98 cmp 1.00
r l: n 7747 5.6% slen 28.33 4.0% iter 28.85 ff 1.47 hc 4.16 cmp 1.00
other: n 127 0.1% slen 16.83 0.0% iter 1.00 ff 0.07 hc 0.08 cmp 172.51
total: n 137834 100.0% slen 39.55 100.0% iter 26.15 ff 1.28 hc 3.78 cmp 1.16
nodejs
!r !l: n 3677 29.8% slen 24.00 19.2% iter 13.52 ff 0.81 hc 3.27 cmp 0.99
r !l: n 1974 16.0% slen 28.21 12.1% iter 14.70 ff 0.91 hc 6.07 cmp 1.00
!r l: n 4583 37.1% slen 50.91 50.7% iter 11.83 ff 0.66 hc 3.15 cmp 1.00
r l: n 2102 17.0% slen 39.39 18.0% iter 13.57 ff 0.87 hc 4.65 cmp 1.00
other: n 19 0.2% slen 15.89 0.1% iter 1.00 ff 0.05 hc 0.21 cmp 23.26
total: n 12355 100.0% slen 37.26 100.0% iter 13.07 ff 0.78 hc 3.90 cmp 1.03
legend:
r:repeat relocs, !r:non-repeat, l:local def, !l:non-local def
other: not the normal do_relocs find_sym call (lfs64 etc)
n: reloc count of the given type and total%
slen: avg strlen and total%
iter: avg find_sym iter (dso visits)
ff: avg bloom filter fail count (final success excluded)
hc: avg hashchain iter
cmp: avg strcmp count
avg is sum/n of the given kind of reloc
instrumentation hack is attached for reference.
[-- Attachment #2: 0001-reloc-sym-lookup-stats.patch --]
[-- Type: text/x-diff, Size: 5248 bytes --]
From 21faaf1a1d342f02d2528453470eb0c15aaa2cbe Mon Sep 17 00:00:00 2001
From: Szabolcs Nagy <nsz@port70.net>
Date: Mon, 8 Dec 2025 19:03:18 +0000
Subject: [PATCH] reloc sym lookup stats
---
ldso/dynlink.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 87 insertions(+), 4 deletions(-)
diff --git a/ldso/dynlink.c b/ldso/dynlink.c
index 715948f4..6bbaa850 100644
--- a/ldso/dynlink.c
+++ b/ldso/dynlink.c
@@ -157,6 +157,19 @@ static struct dso **main_ctor_queue;
static struct fdpic_loadmap *app_loadmap;
static struct fdpic_dummy_loadmap app_dummy_loadmap;
+struct rstat {
+ unsigned long n; // sum find_sym calls
+ unsigned long symlen; // sum strlen of refd syms
+ unsigned long lookup; // sum visited dsos in find_sym iter
+ unsigned long bloomfail; // sum bloom filter check fails
+ unsigned long hashchain; // sum visited hash chain
+ unsigned long nstrcmp; // sum strcmp calls
+};
+static struct rstat rsc; // current rs, until reloc is resolved
+static struct rstat rs[6]; // resolved counts under various cases
+static unsigned long sysvonly;
+static int prev_sym_index;
+
struct debug *_dl_debug_addr = &debug;
extern weak hidden char __ehdr_start[];
@@ -172,6 +185,7 @@ weak_alias(__fini_array_start, __fini_array_end);
static int dl_strcmp(const char *l, const char *r)
{
+ rsc.nstrcmp++;
for (; *l==*r && *l; l++, r++);
return *(unsigned char *)l - *(unsigned char *)r;
}
@@ -279,6 +293,7 @@ static Sym *gnu_lookup(uint32_t h1, uint32_t *hashtab, struct dso *dso, const ch
uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]);
for (h1 |= 1; ; i++) {
+ rsc.hashchain++;
uint32_t h2 = *hashval++;
if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0)
&& !strcmp(s, dso->strings + dso->syms[i].st_name))
@@ -298,7 +313,9 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
if (!(f & 1)) return 0;
- return gnu_lookup(h1, hashtab, dso, s);
+ void *p = gnu_lookup(h1, hashtab, dso, s);
+ if (!p) rsc.bloomfail++;
+ return p;
}
#define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS)
@@ -313,15 +330,19 @@ __attribute__((always_inline))
#endif
static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
{
+ rsc.n++;
+ rsc.symlen += strlen(s);
uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
size_t ghm = 1ul << gh % (8*sizeof(size_t));
struct symdef def = {0};
struct dso **deps = use_deps ? dso->deps : 0;
for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
Sym *sym;
+ rsc.lookup++;
if ((ght = dso->ghashtab)) {
sym = gnu_lookup_filtered(gh, ght, dso, s, gho, ghm);
} else {
+ sysvonly++;
if (!h) h = sysv_hash(s);
sym = sysv_lookup(s, h, dso);
}
@@ -400,6 +421,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
skip_relative = 1;
}
+ prev_sym_index = 0;
for (; rel_size; rel+=stride, rel_size-=stride*sizeof(size_t)) {
if (skip_relative && IS_RELATIVE(rel[1], dso->syms)) continue;
type = R_TYPE(rel[1]);
@@ -426,9 +448,33 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
sym = syms + sym_index;
name = strings + sym->st_name;
ctx = type==REL_COPY ? head->syms_next : head;
- def = (sym->st_info>>4) == STB_LOCAL
- ? (struct symdef){ .dso = dso, .sym = sym }
- : find_sym(ctx, name, type==REL_PLT);
+ if ((sym->st_info>>4) == STB_LOCAL) {
+ def = (struct symdef){ .dso = dso, .sym = sym };
+ } else {
+#define F(x) r->x += rsc.x
+ struct rstat *r = rs + 4; // lfs64, etc counts
+ F(n);
+ F(symlen);
+ F(lookup);
+ F(bloomfail);
+ F(hashchain);
+ F(nstrcmp);
+ rsc = (struct rstat){0};
+
+ def = find_sym(ctx, name, type==REL_PLT);
+ int rep = prev_sym_index == sym_index;
+ int loc = def.dso == dso;
+ prev_sym_index = sym_index;
+ r = rs + rep + loc*2; // repeat and local counts
+ F(n);
+ F(symlen);
+ F(lookup);
+ F(bloomfail);
+ F(hashchain);
+ F(nstrcmp);
+ rsc = (struct rstat){0};
+#undef F
+ }
if (!def.sym) def = get_lfs64(name);
if (!def.sym && (sym->st_shndx != SHN_UNDEF
|| sym->st_info>>4 != STB_WEAK)) {
@@ -2043,6 +2089,43 @@ void __dls3(size_t *sp, size_t *auxv)
libc.tls_size = tmp_tls_size;
}
+ if (ldd_mode) {
+ const char *rcase[] = {
+ "!r !l:",
+ " r !l:",
+ "!r l:",
+ " r l:",
+ "other:",
+ "total:"};
+ for (int i=0; i<5; i++) {
+#define F(x) rs[5].x += rs[i].x
+ F(n);
+ F(symlen);
+ F(lookup);
+ F(bloomfail);
+ F(hashchain);
+ F(nstrcmp);
+#undef F
+ }
+ dprintf(1,"total sysvonly lookups: %lu\n", sysvonly);
+ for (int i=0; i<6; i++) {
+ dprintf(1,"%s n %5lu %4.1f%% slen %4.2f %4.1f%% iter %.2f"
+ " ff %.2f hc %.2f cmp %.2f\n",
+ rcase[i],
+#define F(x) ((double)rs[i].x/rs[i].n)
+#define G(x) ((double)100*rs[i].x/rs[5].x)
+ rs[i].n,
+ G(n),
+ F(symlen),
+ G(symlen),
+ F(lookup),
+ F(bloomfail),
+ F(hashchain),
+ F(nstrcmp));
+#undef F
+#undef G
+ }
+ }
if (ldso_fail) _exit(127);
if (ldd_mode) _exit(0);
--
2.47.2
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 16:20 ` Rich Felker
2025-12-08 23:14 ` Szabolcs Nagy
@ 2025-12-09 1:10 ` Fabian Rast
2025-12-09 1:27 ` Rich Felker
2026-02-03 17:13 ` Fangrui Song
2 siblings, 1 reply; 18+ messages in thread
From: Fabian Rast @ 2025-12-09 1:10 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
[-- Attachment #1: Type: text/plain, Size: 2248 bytes --]
On Mon Dec 8, 2025 at 5:20 PM CET, Rich Felker wrote:
> I was thinking the bloom filter would include everything in the symbol
> table, including both symbols that provide definitions and symbol
> table entries that are used to reference undefined symbols. In this
> case, we could use the bloom filter of the DSO making the reference to
> determine which of the two candidate hashes is correct, and do so very
> quickly (nothing but a few bitshifts and masks).
>
> However, it seems reasonable that the bloom filter would only include
> symbols that actually have definitions. Otherwise at each step of the
> DSO chain you'd get a false positive for every DSO that references the
> symbol without defining it. So we probably can't do it the way I'd
> hoped.
I think only defined symbols are included.
But UNDEF symbols are usually also not included in the hashtable - they
are usually sorted to the beginning and gnu hash uses a constant offset
for every symbol index to exclude them. So for these undef symbols where
the bloom filter would not work we do not even have the other 31 bits
because the symbol is not in the hashtable. Of course the hash is also
a link time constant for these symbols, but its just not in the binary...
>> > But if undefined symbols are present, then here's my concept for how
>> > it would work:
>> >
>> > It should consist of grabbing two consecutive bits aligned at a 2-bit
>> > boundary (so not crossing slots) from the bloom filter, and
>> > considering the possible values:
>> >
>> > 00 - impossible
>> > 01 - low bit is 0
>> > 10 - low bit is 1
>> > 11 - inconclusive; just compute hash. should be rare with good bloom.
This sounds like it should work. I will try to implement this and see.
> I think modern linkers sort relocations by the symbol referenced,
> which allows you to bypass the lookup if the reference is the same as
> the previous relocation. We do not take advantage of this in the
> do_relocs loop at all, but we could. That would probably give far more
> performance boost than eliding the hashing and might even make eliding
> the hashing pointless.
I have already implemented this and will send a patch.
Best Regards,
Fabian Rast
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-09 1:10 ` Fabian Rast
@ 2025-12-09 1:27 ` Rich Felker
2025-12-22 16:37 ` Fabian Rast
0 siblings, 1 reply; 18+ messages in thread
From: Rich Felker @ 2025-12-09 1:27 UTC (permalink / raw)
To: Fabian Rast; +Cc: musl
On Tue, Dec 09, 2025 at 02:10:46AM +0100, Fabian Rast wrote:
> On Mon Dec 8, 2025 at 5:20 PM CET, Rich Felker wrote:
> > I was thinking the bloom filter would include everything in the symbol
> > table, including both symbols that provide definitions and symbol
> > table entries that are used to reference undefined symbols. In this
> > case, we could use the bloom filter of the DSO making the reference to
> > determine which of the two candidate hashes is correct, and do so very
> > quickly (nothing but a few bitshifts and masks).
> >
> > However, it seems reasonable that the bloom filter would only include
> > symbols that actually have definitions. Otherwise at each step of the
> > DSO chain you'd get a false positive for every DSO that references the
> > symbol without defining it. So we probably can't do it the way I'd
> > hoped.
>
> I think only defined symbols are included.
> But UNDEF symbols are usually also not included in the hashtable - they
> are usually sorted to the beginning and gnu hash uses a constant offset
> for every symbol index to exclude them. So for these undef symbols where
> the bloom filter would not work we do not even have the other 31 bits
> because the symbol is not in the hashtable. Of course the hash is also
> a link time constant for these symbols, but its just not in the binary...
>
> >> > But if undefined symbols are present, then here's my concept for how
> >> > it would work:
> >> >
> >> > It should consist of grabbing two consecutive bits aligned at a 2-bit
> >> > boundary (so not crossing slots) from the bloom filter, and
> >> > considering the possible values:
> >> >
> >> > 00 - impossible
> >> > 01 - low bit is 0
> >> > 10 - low bit is 1
> >> > 11 - inconclusive; just compute hash. should be rare with good bloom.
>
> This sounds like it should work. I will try to implement this and see.
Since the reference-only symbols aren't in the hashtable, I don't
think there's anything useful you can do with this.
Rich
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 23:14 ` Szabolcs Nagy
@ 2025-12-09 18:31 ` Szabolcs Nagy
0 siblings, 0 replies; 18+ messages in thread
From: Szabolcs Nagy @ 2025-12-09 18:31 UTC (permalink / raw)
To: Rich Felker, Fabian Rast, musl
* Szabolcs Nagy <nsz@port70.net> [2025-12-09 00:14:04 +0100]:
> so there are two optimizations:
> repeat: consecutive relocs referencing the same sym
> local: relocs referencing a locally defined sym
...
> clang:
> !r !l: n 3667 7.8% slen 45.93 7.3% iter 3.70 ff 0.16 hc 3.00 cmp 0.99
> r !l: n 14255 30.4% slen 35.92 22.1% iter 3.93 ff 0.58 hc 4.48 cmp 1.00
> !r l: n 20932 44.6% slen 62.88 56.8% iter 3.25 ff 0.14 hc 3.06 cmp 1.00
> r l: n 8087 17.2% slen 39.39 13.8% iter 2.93 ff 0.18 hc 3.16 cmp 1.00
> other: n 11 0.0% slen 15.09 0.0% iter 1.00 ff 0.00 hc 0.27 cmp 11.82
> total: n 46952 100.0% slen 49.31 100.0% iter 3.43 ff 0.28 hc 3.50 cmp 1.00
fwiw clang is built with -Bsymbolic-functions[1] and all
the l relocs (locally defined) in libllvm and libclang are
STT_OBJECT, mainly for typeinfo names and vtables which i
guess makes sense to be interposable for templated types
and in the presence of copy relocs for normal types too
(assuming these objects must have a unique address, not
sure if that's the case as they are implementation details)
[1]: https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic
i found this comment:
"As a data point, when building the Linux kernel's x86_64 defconfig
with a clang -fPIC built clang, my build is 15% faster if I add
-Bsymbolic-functions to libLLVM.so and libclang-cpp.so. I cannot
tell the performance difference with a mostly statically linked PIE
clang. From llvm-project 13.0.0 onwards, the build system uses
-Bsymbolic-functions by default."
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-09 1:27 ` Rich Felker
@ 2025-12-22 16:37 ` Fabian Rast
2025-12-22 18:40 ` Alexander Monakov
0 siblings, 1 reply; 18+ messages in thread
From: Fabian Rast @ 2025-12-22 16:37 UTC (permalink / raw)
To: Rich Felker; +Cc: musl
[-- Attachment #1.1.1: Type: text/plain, Size: 6332 bytes --]
On Tue Dec 9, 2025 at 2:27 AM CET, Rich Felker wrote:
> Since the reference-only symbols aren't in the hashtable, I don't
> think there's anything useful you can do with this.
I am not sure I understand this correctly. I think that it is possible
to use the bloom filter to infer the missing hash bit just like you
proposed in the case that a lookup is performed for a symbol that is
also defined by the current dso.
If useful is supposed to refer to actual usefulness for musl in particular,
I agree that this optimization might not be useful because I consider it
a hack - the gnu hash table surely was never supposed to be used in this
way.
Just for completeness I implemented pretty much your initial proposal:
> One way to do this would be to use the bloom filter from the dso
> making the symbol reference: if only one of the two candidates passes
> it, we know which candidate was correct. I would expect this to happen
> most of the time, if the bloom filter is actually doing anything
> useful to begin with.
As people previously pointed out in the discussion, one of the two bits
tested per hash is shared between the two candidate hashes h0 and h1, which
means in the best case only one bit has to be checked, in the worst case
two bits are checked and the hash has to be calculated anyways.
Szabolcs Nagy predicted a 70%-80% success rate for this method.
This holds up in some practical examples:
- clang
Total 48253 bloom method applicable 63.3% (30549), success rate: 70.0% (21381) total: 44.3%
- gsx
Total 30320 bloom method applicable 58.3% (17665), success rate: 74.0% (13068) total: 43.1%
- ffmpeg
Total 33351 bloom method applicable 60.3% (20111), success rate: 73.6% (14811) total: 44.4%
- mpv
Total 67330 bloom method applicable 58.6% (39427), success rate: 71.4% (28170) total: 41.8%
- libreoffice
Total 154576 bloom method applicable 27.3% (42206), success rate: 75.9% (32037) total: 20.7%
- webkit2gtk
Total 137514 bloom method applicable 34.6% (47560), success rate: 75.1% (35734) total: 26.0%
- libxul
Total 30188 bloom method applicable 44.6% (13455), success rate: 75.3% (10126) total: 33.5%
The remaining benchmarks show similar results to the version with the two divides.
This optimization is most effective in clang and less in the other examples.
master -> precomp bloom
/usr/bin/clang
cycles: 32551695 (0.66) -> 26122400 (0.65) -19.75%
instructions: 58840422 (0.01) -> 47172812 (0.01) -19.83%
ref-cycles: 23629302 (1.22) -> 19315295 (1.23) -18.26%
duration_time: 12734736 (1.24) -> 10485433 (1.23) -17.66%
/usr/bin/gsx
cycles: 45676684 (0.31) -> 43229878 (0.33) -5.36%
instructions: 56504823 (0.01) -> 40754248 (0.02) -27.87%
ref-cycles: 33672982 (0.98) -> 31472548 (0.87) -6.53%
duration_time: 18184999 (1.0) -> 17036436 (0.88) -6.32%
/usr/bin/ffmpeg
cycles: 72039503 (0.37) -> 67931365 (0.25) -5.7%
instructions: 86761463 (0.01) -> 58112029 (0.01) -33.02%
ref-cycles: 49514189 (1.04) -> 46358979 (1.01) -6.37%
duration_time: 26200043 (1.04) -> 24564555 (1.02) -6.24%
/usr/bin/mpv
cycles: 218613993 (0.24) -> 211474537 (0.13) -3.27%
instructions: 265004308 (0.0) -> 161926673 (0.0) -38.9%
ref-cycles: 133443129 (0.62) -> 128415393 (0.45) -3.77%
duration_time: 68599776 (0.63) -> 66095647 (0.46) -3.65%
/usr/lib/libreoffice/program/soffice.bin
cycles: 175438767 (0.18) -> 170710689 (0.17) -2.7%
instructions: 266312008 (0.0) -> 199613021 (0.0) -25.05%
ref-cycles: 110087107 (0.5) -> 106251019 (0.5) -3.48%
duration_time: 56843553 (0.5) -> 54922004 (0.5) -3.38%
/usr/lib/libwebkit2gtk-4.1.so.0
cycles: 234214481 (0.16) -> 224880118 (0.11) -3.99%
instructions: 329852233 (0.0) -> 233217195 (0.0) -29.3%
ref-cycles: 120347198 (1.09) -> 135856350 (0.6) 12.89%
duration_time: 62053249 (1.08) -> 69948475 (0.59) 12.72%
/usr/lib/firefox/libxul.so
cycles: 59222837 (0.27) -> 56319892 (0.27) -4.9%
instructions: 72834785 (0.01) -> 53201233 (0.01) -26.96%
ref-cycles: 40414897 (1.02) -> 38522864 (0.92) -4.68%
duration_time: 21500525 (1.01) -> 20588222 (0.91) -4.24%
Interestingly, there is a big regression in perfomance loading libwebkit2gtk
that i cannot explain...
On my cpu (AMD Ryzen AI 9 365) the version with the two divides performs a
little bit better than this version:
Benchmark 1 (1471 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-bloom-install/lib/ /home/fr/src/musl/precomp-bloom-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
measurement mean ± σ min … max outliers delta
wall_time 6.76ms ± 584us 4.69ms … 8.84ms 16 ( 1%) 0%
peak_rss 12.7MB ± 101KB 12.3MB … 12.8MB 20 ( 1%) 0%
cpu_cycles 9.18M ± 747K 7.64M … 12.2M 70 ( 5%) 0%
instructions 14.9M ± 919 14.9M … 14.9M 48 ( 3%) 0%
cache_references 469K ± 4.10K 458K … 489K 13 ( 1%) 0%
cache_misses 82.9K ± 1.48K 79.8K … 90.7K 109 ( 7%) 0%
branch_misses 80.8K ± 781 79.3K … 92.1K 4 ( 0%) 0%
Benchmark 2 (1533 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
measurement mean ± σ min … max outliers delta
wall_time 6.49ms ± 656us 3.93ms … 8.82ms 58 ( 4%) ⚡- 3.9% ± 0.7%
peak_rss 12.7MB ± 101KB 12.3MB … 12.8MB 22 ( 1%) - 0.0% ± 0.1%
cpu_cycles 8.95M ± 1.11M 7.22M … 14.3M 129 ( 8%) ⚡- 2.6% ± 0.7%
instructions 14.7M ± 857 14.7M … 14.7M 46 ( 3%) ⚡- 1.2% ± 0.0%
cache_references 468K ± 5.24K 455K … 493K 8 ( 1%) - 0.2% ± 0.1%
cache_misses 83.0K ± 1.67K 79.4K … 89.3K 38 ( 2%) + 0.1% ± 0.1%
branch_misses 76.9K ± 580 75.5K … 78.8K 7 ( 0%) ⚡- 4.9% ± 0.1%
I have attached the patch for reference.
Best Regards,
Fabian Rast
[-- Attachment #1.2: v3-0001-ldso-skip-gnu-hash-calculation-if-precomputed.patch --]
[-- Type: text/x-patch, Size: 3484 bytes --]
From 96147998b34c85786d55bf2506cbd2fc60668e1a Mon Sep 17 00:00:00 2001
From: Fabian Rast <fabian.rast@tum.de>
Date: Tue, 2 Dec 2025 12:46:41 +0100
Subject: [PATCH v3] ldso: skip gnu hash calculation if precomputed
calculating the hashes of symbol names takes a significant amount of
time while applying relocations. however, in case of symbol lookups for
symbols that are themselves part of the gnu hashtable in the object
being relocated, the hashvalue is already present in that hashtable
and can be efficiently retrieved using the symbol index.
the gnu hash table only stores the upper 31 bits of the 32 bit wide
hashes. the lsb is used to indicate the end of a chain.
this missing bit can be inferred by guessing and checking the
bloom filter to exclude incorrect guesses. In case the bloom filter
check is positive for both possibilities, this optimization does
not apply.
---
ldso/dynlink.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/ldso/dynlink.c b/ldso/dynlink.c
index 44d9b181..189bace1 100644
--- a/ldso/dynlink.c
+++ b/ldso/dynlink.c
@@ -320,10 +320,24 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
#if defined(__GNUC__)
__attribute__((always_inline))
#endif
-static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
+static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
{
- uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
- size_t ghm = 1ul << gh % (8*sizeof(size_t));
+ uint32_t h = 0, gh, gho, *ght;
+ size_t ghm;
+
+ if (dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1]) {
+ uint32_t nbuckets = ght[0], *buckets = ght+4+ght[2]*(sizeof(size_t)/4);
+ uint32_t h0 = buckets[nbuckets+sym_index-ght[1]] & ~1u, h1 = h0|1,
+ r = h0 % (8*sizeof(size_t));
+ size_t f = ((size_t *)(ght+4))[(h0 / (8*sizeof(size_t))) & (ght[2]-1)],
+ i = ((f >> r) | (f << 64-r)) & 3;
+ if (!(gh = (uint32_t[3]){h0, h1, 0}[i-1]))
+ gh = gnu_hash(s);
+ } else gh = gnu_hash(s);
+
+ gho = gh / (8*sizeof(size_t));
+ ghm = 1ul << gh % (8*sizeof(size_t));
+
struct symdef def = {0};
struct dso **deps = use_deps ? dso->deps : 0;
for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
@@ -353,7 +367,7 @@ static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_d
static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
{
- return find_sym2(dso, s, need_def, 0);
+ return find_sym2(dso, s, need_def, 0, NULL, STN_UNDEF);
}
static struct symdef get_lfs64(const char *name)
@@ -437,7 +451,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
ctx = type==REL_COPY ? head->syms_next : head;
def = (sym->st_info>>4) == STB_LOCAL
? (struct symdef){ .dso = dso, .sym = sym }
- : find_sym(ctx, name, type==REL_PLT);
+ : find_sym2(ctx, name, type==REL_PLT, 0, dso, sym_index);
if (!def.sym) def = get_lfs64(name);
if (!def.sym && (sym->st_shndx != SHN_UNDEF
|| sym->st_info>>4 != STB_WEAK)) {
@@ -2345,7 +2359,7 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
return 0;
} else
use_deps = 1;
- struct symdef def = find_sym2(p, s, 0, use_deps);
+ struct symdef def = find_sym2(p, s, 0, use_deps, NULL, STN_UNDEF);
if (!def.sym) {
error("Symbol not found: %s", s);
return 0;
--
2.52.0
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-22 16:37 ` Fabian Rast
@ 2025-12-22 18:40 ` Alexander Monakov
2025-12-22 19:21 ` Fabian Rast
2026-02-03 7:17 ` Fangrui Song
0 siblings, 2 replies; 18+ messages in thread
From: Alexander Monakov @ 2025-12-22 18:40 UTC (permalink / raw)
To: Fabian Rast; +Cc: Rich Felker, musl
On Mon, 22 Dec 2025, Fabian Rast wrote:
> /usr/lib/libwebkit2gtk-4.1.so.0
> cycles: 234214481 (0.16) -> 224880118 (0.11) -3.99%
> instructions: 329852233 (0.0) -> 233217195 (0.0) -29.3%
> ref-cycles: 120347198 (1.09) -> 135856350 (0.6) 12.89%
> duration_time: 62053249 (1.08) -> 69948475 (0.59) 12.72%
>
[...]
>
> Interestingly, there is a big regression in perfomance loading libwebkit2gtk
> that i cannot explain...
You have a reduction in actual CPU cycles (-4%, similar to others) and an
increase of the reference counter cycles (and the matching increase in
wall-clock time), which means that your CPU was running at a lower frequency
in this test. If it is reproducible, perhaps it's an artifact of how CPU's
automatic frequency scaling works: without the hashing loop, most of the
remaining work is low-IPC (lots of TLB and cache misses), so maybe it doesn't
raise clocks in this case. You can test e.g. by running at a fixed frequency.
(you may already know, but just in case: cpupower tool can be used for
requesting a fixed frequency with the 'performance' governor; to disable
turbo clocks on AMD CPUs, write '0' to /sys/devices/system/cpu/cpufreq/boost )
Thank you for your continued work on this!
Alexander
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-22 18:40 ` Alexander Monakov
@ 2025-12-22 19:21 ` Fabian Rast
2026-02-03 7:17 ` Fangrui Song
1 sibling, 0 replies; 18+ messages in thread
From: Fabian Rast @ 2025-12-22 19:21 UTC (permalink / raw)
To: Alexander Monakov; +Cc: Rich Felker, musl
[-- Attachment #1: Type: text/plain, Size: 2950 bytes --]
On Mon Dec 22, 2025 at 7:40 PM CET, Alexander Monakov wrote:
> You have a reduction in actual CPU cycles (-4%, similar to others) and an
> increase of the reference counter cycles (and the matching increase in
> wall-clock time), which means that your CPU was running at a lower frequency
> in this test. If it is reproducible, perhaps it's an artifact of how CPU's
> automatic frequency scaling works: without the hashing loop, most of the
> remaining work is low-IPC (lots of TLB and cache misses), so maybe it doesn't
> raise clocks in this case. You can test e.g. by running at a fixed frequency.
Yes, I reproduced this a couple of times and the explanation makes sense.
> (you may already know, but just in case: cpupower tool can be used for
> requesting a fixed frequency with the 'performance' governor; to disable
> turbo clocks on AMD CPUs, write '0' to /sys/devices/system/cpu/cpufreq/boost )
I did not know, so thanks a lot for this info, i really appreciate it!
The wall time regression is gone when benchmarking with disabled turbo boost:
master -> precomp bloom
/usr/bin/clang
cycles: 30595609 (0.52) -> 24689870 (0.62) -19.3%
instructions: 59173944 (0.01) -> 47476698 (0.01) -19.77%
ref-cycles: 30629075 (0.54) -> 24725304 (0.65) -19.28%
duration_time: 16338501 (0.58) -> 13351888 (0.71) -18.28%
/usr/bin/gsx
cycles: 41078342 (0.38) -> 36546838 (0.32) -11.03%
instructions: 56794652 (0.02) -> 41060367 (0.02) -27.7%
ref-cycles: 41047786 (0.4) -> 36539161 (0.33) -10.98%
duration_time: 22042062 (0.47) -> 19687133 (0.37) -10.68%
/usr/bin/ffmpeg
cycles: 61538791 (0.25) -> 54242735 (0.26) -11.86%
instructions: 87089975 (0.01) -> 58410309 (0.01) -32.93%
ref-cycles: 61433880 (0.27) -> 54181257 (0.28) -11.81%
duration_time: 32478566 (0.3) -> 28788301 (0.34) -11.36%
/usr/bin/mpv
cycles: 170858084 (0.16) -> 149992274 (0.12) -12.21%
instructions: 265719013 (0.0) -> 162564417 (0.01) -38.82%
ref-cycles: 169971792 (0.16) -> 149229465 (0.13) -12.2%
duration_time: 87288228 (0.18) -> 76845305 (0.15) -11.96%
/usr/lib/libreoffice/program/soffice.bin
cycles: 139990660 (0.21) -> 124937532 (0.22) -10.75%
instructions: 267200691 (0.0) -> 200466090 (0.0) -24.98%
ref-cycles: 139272339 (0.21) -> 124308124 (0.22) -10.74%
duration_time: 71883190 (0.23) -> 64386860 (0.24) -10.43%
/usr/lib/libwebkit2gtk-4.1.so.0
cycles: 188684348 (0.16) -> 168941948 (0.13) -10.46%
instructions: 330772629 (0.0) -> 233988146 (0.01) -29.26%
ref-cycles: 187666505 (0.16) -> 168063105 (0.13) -10.45%
duration_time: 96330834 (0.18) -> 86599671 (0.15) -10.1%
/usr/lib/firefox/libxul.so
cycles: 54223456 (0.24) -> 49060865 (0.23) -9.52%
instructions: 73184185 (0.01) -> 53532082 (0.01) -26.85%
ref-cycles: 54127522 (0.25) -> 49003711 (0.23) -9.47%
duration_time: 28769607 (0.28) -> 26139117 (0.28) -9.14%
Now the "cycles" stats are also much closer to the "ref-cycles".
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-22 18:40 ` Alexander Monakov
2025-12-22 19:21 ` Fabian Rast
@ 2026-02-03 7:17 ` Fangrui Song
2026-02-03 10:29 ` Fabian Rast
1 sibling, 1 reply; 18+ messages in thread
From: Fangrui Song @ 2026-02-03 7:17 UTC (permalink / raw)
To: Fabian Rast; +Cc: Rich Felker, musl
On Mon, Dec 22, 2025 at 10:40 AM Alexander Monakov <amonakov@ispras.ru> wrote:
>
>
> On Mon, 22 Dec 2025, Fabian Rast wrote:
>
> > /usr/lib/libwebkit2gtk-4.1.so.0
> > cycles: 234214481 (0.16) -> 224880118 (0.11) -3.99%
> > instructions: 329852233 (0.0) -> 233217195 (0.0) -29.3%
> > ref-cycles: 120347198 (1.09) -> 135856350 (0.6) 12.89%
> > duration_time: 62053249 (1.08) -> 69948475 (0.59) 12.72%
> >
> [...]
> >
> > Interestingly, there is a big regression in perfomance loading libwebkit2gtk
> > that i cannot explain...
>
> You have a reduction in actual CPU cycles (-4%, similar to others) and an
> increase of the reference counter cycles (and the matching increase in
> wall-clock time), which means that your CPU was running at a lower frequency
> in this test. If it is reproducible, perhaps it's an artifact of how CPU's
> automatic frequency scaling works: without the hashing loop, most of the
> remaining work is low-IPC (lots of TLB and cache misses), so maybe it doesn't
> raise clocks in this case. You can test e.g. by running at a fixed frequency.
>
> (you may already know, but just in case: cpupower tool can be used for
> requesting a fixed frequency with the 'performance' governor; to disable
> turbo clocks on AMD CPUs, write '0' to /sys/devices/system/cpu/cpufreq/boost )
>
> Thank you for your continued work on this!
>
> Alexander
Can the gh condition to optimized to
gh = !i1 || (i0 < i1 ? sym_index < i1 : sym_index >= i0) ? h0 : h1;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2026-02-03 7:17 ` Fangrui Song
@ 2026-02-03 10:29 ` Fabian Rast
0 siblings, 0 replies; 18+ messages in thread
From: Fabian Rast @ 2026-02-03 10:29 UTC (permalink / raw)
To: Fangrui Song; +Cc: musl
[-- Attachment #1: Type: text/plain, Size: 500 bytes --]
On Tue Feb 3, 2026 at 8:17 AM CET, Fangrui Song wrote:
> Can the gh condition to optimized to
>
> gh = !i1 || (i0 < i1 ? sym_index < i1 : sym_index >= i0) ? h0 : h1;
Yes, the only difference is for `i0 == i1` which should not
occur for any valid hash table.
gcc generates better code for your version: https://godbolt.org/z/zTsbqPn5a
As I understand, the main problem was the 2 divides or 1 divide + 1 branch
required to compute i0 and i1 in the first place.
Thanks,
Fabian Rast
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 293 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2025-12-08 16:20 ` Rich Felker
2025-12-08 23:14 ` Szabolcs Nagy
2025-12-09 1:10 ` Fabian Rast
@ 2026-02-03 17:13 ` Fangrui Song
2026-02-05 17:37 ` Florian Weimer
2 siblings, 1 reply; 18+ messages in thread
From: Fangrui Song @ 2026-02-03 17:13 UTC (permalink / raw)
To: musl; +Cc: Fabian Rast
On Mon, Dec 8, 2025 at 8:20 AM Rich Felker <dalias@libc.org> wrote:
>
> On Mon, Dec 08, 2025 at 06:20:58AM +0100, Szabolcs Nagy wrote:
> > * Rich Felker <dalias@libc.org> [2025-12-07 21:25:42 -0500]:
> > > On Sun, Dec 07, 2025 at 08:44:54PM -0500, Rich Felker wrote:
> > > > This looks a lot more expensive than the method I suggested using the
> > > > bloom filter, which has no modulo, much less two.
> > >
> > > Actually, I'm not sure if the bloom filter approach works. Is a hash
> > > included in the bloom filter as soon as it's present in the symbol
> > > table, or only if it's present with a definition?
> > >
> > > If it's absent for undefined, then this approach is a non-starter. It
> > > can't tell us anything.
> >
> > isn't it the other way around?
> >
> > only defined symbols set bits in the bloom filter.
> > it is for checking if a dso might have a definition.
>
> I think you misunderstood what I was saying and looking for.
>
> I was thinking the bloom filter would include everything in the symbol
> table, including both symbols that provide definitions and symbol
> table entries that are used to reference undefined symbols. In this
> case, we could use the bloom filter of the DSO making the reference to
> determine which of the two candidate hashes is correct, and do so very
> quickly (nothing but a few bitshifts and masks).
>
> However, it seems reasonable that the bloom filter would only include
> symbols that actually have definitions. Otherwise at each step of the
> DSO chain you'd get a false positive for every DSO that references the
> symbol without defining it. So we probably can't do it the way I'd
> hoped.
Google modified glibc (GRTE) has a "fastload" patch that dramatically
speeds up symbol resolution by building a position hash table at
startup that records, for each symbol, the earliest position in the
link map where that symbol can be found.
During symbol lookup, instead of scanning from position 0, the rtld
can skip directly to the relevant position.
https://sourceware.org/bugzilla/show_bug.cgi?id=16709 ("Dynamic
linking with large number of DSOs degrades into linear lookup")
This is particularly impactful for Google's environment, where
executables often link against more than 1000 DSOs - common when using
Bazel’s cc_test with the default dynamic_mode=fully.
However, adopting such a mechanism would require a bit of code.
If we want to add a new section to speed up symbol lookups, we can
take inspiration from Solaris direct bindings and macOS two-level
namespace.
Feel free to bring well-formed proposals to
https://groups.google.com/g/generic-abi and gnu-gabi.
(DT_GNU_HASH are generally off-limits for the generic-abi list, as
they lack support from Solaris.
https://groups.google.com/g/generic-abi/c/9L03yrxXPBc )
> > > But if undefined symbols are present, then here's my concept for how
> > > it would work:
> > >
> > > It should consist of grabbing two consecutive bits aligned at a 2-bit
> > > boundary (so not crossing slots) from the bloom filter, and
> > > considering the possible values:
> > >
> > > 00 - impossible
> > > 01 - low bit is 0
> > > 10 - low bit is 1
> > > 11 - inconclusive; just compute hash. should be rare with good bloom.
> >
> > yeah, one of the bits is 'in the filter' the other
> > is a random bit test: if p portion of the bits are
> > 1 then with p chance we fail.
> >
> > in bfd ld the bloom mask seems to be sized such that
> > if nsym in [ 3<<k, 3<<(k+1) ) then maskbits = 32<<k
> > so 3/32 <= nsym/maskbits < 6/32
> >
> > each symbol sets two bits so the probability that
> > a bit is 0 is (1-1/maskbits)^(2*nsym) so a bit is 1
> > with roughly p = 1 - exp(-2*nsym/maskbits) that is
> > 0.17 < p < 0.32 probability. (normally two bits
> > are tested so bloom fails with p^2, but the other
> > bit is shared between h&-2 and h|1)
> >
> > if this is right and the hash is good, then we have
> > about 20-30% chance to fail and 70-80% to succeed
> > just with that one bit.
> >
> >
> > it seems to me that with many dsos (>50), it's not
> > the gnu_hash but the walk over all dsos and checking
> > gnu_lookup_filtered will be the slow part. not
> > sure if it's possible to help with that.
> > i also noticed that some symbols are looked up a lot:
> > in clang >30% of lookups are (multiple relocs per dso)
> > __cxa_pure_virtual
> > _ZTVN10__cxxabiv120__si_class_type_infoE
> > _ZTVN10__cxxabiv117__class_type_infoE
> > _ZTVN10__cxxabiv121__vmi_class_type_infoE
> > and these generate fair bit of lookups too (undef weak):
> > _ITM_deregisterTMCloneTable
> > _ITM_registerTMCloneTable
> > __deregister_frame_info
> > __register_frame_info
>
> I think modern linkers sort relocations by the symbol referenced,
> which allows you to bypass the lookup if the reference is the same as
> the previous relocation. We do not take advantage of this in the
> do_relocs loop at all, but we could. That would probably give far more
> performance boost than eliding the hashing and might even make eliding
> the hashing pointless.
>
> Rich
Exactly. lld/ELF sorts dynamic relocations in .rela.dyn by
(!IsRelative,SymIndex,r_offset).
While non-relative relocations are grouped by SymIndex,
> https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#:~:text=dynamic%20relocations
> Each symbol is associated with very few dynamic relocations, typically 1 or 2 (R_*_JUMP_SLOT and R_*_GLOB_DAT). When a symbol is associated with more dynamic relocations, it is typically a base class function residing in multiple C++ virtual tables, e.g. __cxa_pure_virtual. -fexperimental-relative-c++-abi-vtables would eliminate such dynamic relocations.
If we look at a specific DSO, the "same symbol" optimization provides
little benefit outside of specific cases like `__cxa_pure_virtual`.
While symbols like _ZTVN10__cxxabiv120__si_class_type_infoE occur in
many DSOs, a one-entry cache would not optimize out its hash
computation.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed
2026-02-03 17:13 ` Fangrui Song
@ 2026-02-05 17:37 ` Florian Weimer
0 siblings, 0 replies; 18+ messages in thread
From: Florian Weimer @ 2026-02-05 17:37 UTC (permalink / raw)
To: Fangrui Song; +Cc: musl, Fabian Rast
* Fangrui Song:
> On Mon, Dec 8, 2025 at 8:20 AM Rich Felker <dalias@libc.org> wrote:
>>
>> On Mon, Dec 08, 2025 at 06:20:58AM +0100, Szabolcs Nagy wrote:
>> > * Rich Felker <dalias@libc.org> [2025-12-07 21:25:42 -0500]:
>> > > On Sun, Dec 07, 2025 at 08:44:54PM -0500, Rich Felker wrote:
>> > > > This looks a lot more expensive than the method I suggested using the
>> > > > bloom filter, which has no modulo, much less two.
>> > >
>> > > Actually, I'm not sure if the bloom filter approach works. Is a hash
>> > > included in the bloom filter as soon as it's present in the symbol
>> > > table, or only if it's present with a definition?
>> > >
>> > > If it's absent for undefined, then this approach is a non-starter. It
>> > > can't tell us anything.
>> >
>> > isn't it the other way around?
>> >
>> > only defined symbols set bits in the bloom filter.
>> > it is for checking if a dso might have a definition.
>>
>> I think you misunderstood what I was saying and looking for.
>>
>> I was thinking the bloom filter would include everything in the symbol
>> table, including both symbols that provide definitions and symbol
>> table entries that are used to reference undefined symbols. In this
>> case, we could use the bloom filter of the DSO making the reference to
>> determine which of the two candidate hashes is correct, and do so very
>> quickly (nothing but a few bitshifts and masks).
>>
>> However, it seems reasonable that the bloom filter would only include
>> symbols that actually have definitions. Otherwise at each step of the
>> DSO chain you'd get a false positive for every DSO that references the
>> symbol without defining it. So we probably can't do it the way I'd
>> hoped.
>
> Google modified glibc (GRTE) has a "fastload" patch that dramatically
> speeds up symbol resolution by building a position hash table at
> startup that records, for each symbol, the earliest position in the
> link map where that symbol can be found.
> During symbol lookup, instead of scanning from position 0, the rtld
> can skip directly to the relevant position.
> https://sourceware.org/bugzilla/show_bug.cgi?id=16709 ("Dynamic
> linking with large number of DSOs degrades into linear lookup")
I think the implementation started with this commit on the
google/grte/v5-2.27/master branch.
commit 590786950c039260b75db235e81f2ae1633a055e
Author: Paul Pluzhnikov <ppluzhnikov@google.com>
Date: Tue Mar 4 19:07:05 2014 -0800
Add "fastload" support.
<https://sourceware.org/cgit/glibc/commit/?id=590786950c039260b75db235e81f2ae1633a055e>
There's been some changes after that.
I'm not sure if this has ever been submitted to libc-alpha? It's
certainly an interesting approach.
I haven't checked if the latest implementation contains heuristics not
to duplicate the hash tables if there are only few symbol references in
relation to the total number of exported symbols. It might not be a
uniform improvement across all kinds of process images.
Thanks,
Florian
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-02-05 17:38 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-06 17:31 [musl] [PATCH v2] ldso: skip gnu hash calculation if precomputed Fabian Rast
2025-12-07 18:45 ` Szabolcs Nagy
2025-12-07 19:44 ` Alexander Monakov
2025-12-08 1:44 ` Rich Felker
2025-12-08 2:25 ` Rich Felker
2025-12-08 5:20 ` Szabolcs Nagy
2025-12-08 16:20 ` Rich Felker
2025-12-08 23:14 ` Szabolcs Nagy
2025-12-09 18:31 ` Szabolcs Nagy
2025-12-09 1:10 ` Fabian Rast
2025-12-09 1:27 ` Rich Felker
2025-12-22 16:37 ` Fabian Rast
2025-12-22 18:40 ` Alexander Monakov
2025-12-22 19:21 ` Fabian Rast
2026-02-03 7:17 ` Fangrui Song
2026-02-03 10:29 ` Fabian Rast
2026-02-03 17:13 ` Fangrui Song
2026-02-05 17:37 ` Florian Weimer
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).