mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] [PATCH] ldso: skip gnu hash calculation if precomputed
@ 2025-12-05  0:21 Fabian Rast
  2025-12-05 15:43 ` Szabolcs Nagy
  0 siblings, 1 reply; 6+ messages in thread
From: Fabian Rast @ 2025-12-05  0:21 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 8470 bytes --]

calculating the hashes of symbol names takes a significant amount of
time while applying relocations. However, for symbol lookups for
names that are themselves part of the gnu hashtable in the object
being relocated, the hashvalue is already present in that hashtable
and can be efficiently retrieved using the symbol index.

the gnu hash table only stores the upper 31 bits of the 32 bit wide
hashes. the lsb is used to indicate the end of a chain.
this missing bit can be guessed for the cost of traversing two chains
in the worst case.
---
 ldso/dynlink.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/ldso/dynlink.c b/ldso/dynlink.c
index 715948f4..fe716b9e 100644
--- a/ldso/dynlink.c
+++ b/ldso/dynlink.c
@@ -311,16 +311,27 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
 #if defined(__GNUC__)
 __attribute__((always_inline))
 #endif
-static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
+static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
 {
-	uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
-	size_t ghm = 1ul << gh % (8*sizeof(size_t));
+	uint32_t h = 0, gh, gho, *ght, gh_precomp;
+	size_t ghm;
+
+	gh_precomp = dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1];
+	gh = gh_precomp
+		? ght[4 + ght[2] * 2 + ght[0] + sym_index - ght[1]] & ~1u
+		: gnu_hash(s);
+
+	gho = gh / (8*sizeof(size_t));
+	ghm = 1ul << gh % (8*sizeof(size_t));
+
 	struct symdef def = {0};
 	struct dso **deps = use_deps ? dso->deps : 0;
 	for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
 		Sym *sym;
 		if ((ght = dso->ghashtab)) {
 			sym = gnu_lookup_filtered(gh, ght, dso, s, gho, ghm);
+			if (!sym && gh_precomp)
+				sym = gnu_lookup_filtered(gh+1, ght, dso, s, gho, (ghm<<1)?(ghm<<1):1);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, dso);
@@ -344,7 +355,7 @@ static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_d
 
 static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
 {
-	return find_sym2(dso, s, need_def, 0);
+	return find_sym2(dso, s, need_def, 0, NULL, STN_UNDEF);
 }
 
 static struct symdef get_lfs64(const char *name)
@@ -428,7 +439,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
 			ctx = type==REL_COPY ? head->syms_next : head;
 			def = (sym->st_info>>4) == STB_LOCAL
 				? (struct symdef){ .dso = dso, .sym = sym }
-				: find_sym(ctx, name, type==REL_PLT);
+				: find_sym2(ctx, name, type==REL_PLT, 0, dso, sym_index);
 			if (!def.sym) def = get_lfs64(name);
 			if (!def.sym && (sym->st_shndx != SHN_UNDEF
 			    || sym->st_info>>4 != STB_WEAK)) {
@@ -2277,7 +2288,7 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
 		return 0;
 	} else
 		use_deps = 1;
-	struct symdef def = find_sym2(p, s, 0, use_deps);
+	struct symdef def = find_sym2(p, s, 0, use_deps, NULL, STN_UNDEF);
 	if (!def.sym) {
 		error("Symbol not found: %s", s);
 		return 0;
-- 
2.52.0

Hi,

i would like to propose an optimization to dynamic symbol lookup that
improves load times of executables using big libraries:

During relocation symbol lookups are commonly using the gnu hashtable.
I found that a significant portion of time is spent calculating hashes
of symbol names as part of the gnu hashtable lookup.

It is unfortunate that this is done at load-time, because
the hashes are already known at link time.

Because hashes are stored almost in full in the gnu hashtable
it is possible to use these precomputed values from the hashtable
in some cases.

If a symbol is beeing looked up to apply a relocation and that
symbol is also defined in the dso being relocated (and thus part
of the hashtable) the upper 31 bits of the hash can be efficiently
loaded from the hashtable using the symbol index.

The missing lsb can be guessed for the cost of possibly looking
in two chains for the two possible hash values.

Please let me know what you think, I am happy to hear your feedback.
Best Regards,

Fabian Rast


= Benchmarks
I tested this using clang 22 linking libclang-cpp and libLLVM.  I inserted
timestamps into the loader startup code to measure startuptime until the
programs entry point is called.
On my laptop (AMD Ryzen AI 9 365) averaged over 200 measurements:
Without optimization: 5441.205 us
With    optimization: 4714.1   us
This suggests a roughly 10% improvement in startuptime in this scenario.


I also measured running musl in ldd mode and without the timing measurement
modifications:
```
/home/fr/src/poop/zig-out/bin/poop -d 10000 \
    'env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22' \
    'env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22'
```

Benchmark 1 (1297 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          7.66ms ±  863us    5.08ms … 11.3ms         38 ( 3%)        0%
  peak_rss           12.6MB ± 81.2KB    12.0MB … 12.7MB        185 (14%)        0%
  cpu_cycles         11.0M  ± 1.12M     9.08M  … 15.6M          65 ( 5%)        0%
  instructions       19.3M  ±  897      19.3M  … 19.3M          48 ( 4%)        0%
  cache_references    463K  ± 2.52K      453K  …  481K          11 ( 1%)        0%
  cache_misses       84.2K  ± 1.45K     81.0K  … 91.2K          26 ( 2%)        0%
  branch_misses      79.6K  ±  814      78.0K  … 83.3K           3 ( 0%)        0%
Benchmark 2 (1445 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.89ms ±  615us    4.70ms … 9.00ms         19 ( 1%)        ⚡- 10.1% ±  0.7%
  peak_rss           12.6MB ± 82.3KB    12.1MB … 12.7MB         89 ( 6%)          +  0.0% ±  0.0%
  cpu_cycles         9.35M  ±  807K     7.72M  … 12.5M          91 ( 6%)        ⚡- 15.3% ±  0.7%
  instructions       15.4M  ±  865      15.4M  … 15.4M          42 ( 3%)        ⚡- 19.9% ±  0.0%
  cache_references    466K  ± 4.28K      454K  …  504K          30 ( 2%)          +  0.8% ±  0.1%
  cache_misses       85.6K  ± 1.50K     82.7K  … 92.3K          54 ( 4%)        💩+  1.7% ±  0.1%
  branch_misses      83.4K  ±  686      81.8K  … 85.6K           1 ( 0%)        💩+  4.7% ±  0.1%

To reproduce without compiling clang with a musl cross toolchain an
alpine docker container can be used:
- Install the two different versions into `./master-install` and `./precomp-install`
- sudo docker run --rm -it -v./master-install:/tmp/master-install -v./precomp-install:/tmp/precomp-install alpine:latest sh
in the container:
- apk add clang hyperfine
- hyperfine -N -w 100 'env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang' 'env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang'

```
Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
  Time (mean ± σ):      12.6 ms ±   1.8 ms    [User: 9.7 ms, System: 2.8 ms]
  Range (min … max):     9.3 ms …  16.0 ms    191 runs

Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang
  Time (mean ± σ):      10.5 ms ±   1.4 ms    [User: 7.7 ms, System: 2.7 ms]
  Range (min … max):     7.5 ms …  13.5 ms    298 runs

Summary
  env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang ran
    1.20 ± 0.23 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
```


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH] ldso: skip gnu hash calculation if precomputed
  2025-12-05  0:21 [musl] [PATCH] ldso: skip gnu hash calculation if precomputed Fabian Rast
@ 2025-12-05 15:43 ` Szabolcs Nagy
  2025-12-05 17:37   ` Fabian Rast
  0 siblings, 1 reply; 6+ messages in thread
From: Szabolcs Nagy @ 2025-12-05 15:43 UTC (permalink / raw)
  To: Fabian Rast; +Cc: musl

* Fabian Rast <fabian.rast@tum.de> [2025-12-05 01:21:11 +0100]:
> calculating the hashes of symbol names takes a significant amount of
> time while applying relocations. However, for symbol lookups for
> names that are themselves part of the gnu hashtable in the object
> being relocated, the hashvalue is already present in that hashtable
> and can be efficiently retrieved using the symbol index.
> 
> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> hashes. the lsb is used to indicate the end of a chain.
> this missing bit can be guessed for the cost of traversing two chains
> in the worst case.

you mean two chains per loaded dso?

then i'd expect the optimization to be only enabled
if dso-index-in-symbol-lookup-order < N.

> ---
>  ldso/dynlink.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/ldso/dynlink.c b/ldso/dynlink.c
> index 715948f4..fe716b9e 100644
> --- a/ldso/dynlink.c
> +++ b/ldso/dynlink.c
> @@ -311,16 +311,27 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
>  #if defined(__GNUC__)
>  __attribute__((always_inline))
>  #endif
> -static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
> +static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps, struct dso *dso2, int sym_index)
>  {
> -	uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
> -	size_t ghm = 1ul << gh % (8*sizeof(size_t));
> +	uint32_t h = 0, gh, gho, *ght, gh_precomp;
> +	size_t ghm;
> +
> +	gh_precomp = dso2 && (ght = dso2->ghashtab) && sym_index >= ght[1];
> +	gh = gh_precomp
> +		? ght[4 + ght[2] * 2 + ght[0] + sym_index - ght[1]] & ~1u
> +		: gnu_hash(s);
> +
> +	gho = gh / (8*sizeof(size_t));
> +	ghm = 1ul << gh % (8*sizeof(size_t));
> +
>  	struct symdef def = {0};
>  	struct dso **deps = use_deps ? dso->deps : 0;
>  	for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
>  		Sym *sym;
>  		if ((ght = dso->ghashtab)) {
>  			sym = gnu_lookup_filtered(gh, ght, dso, s, gho, ghm);
> +			if (!sym && gh_precomp)
> +				sym = gnu_lookup_filtered(gh+1, ght, dso, s, gho, (ghm<<1)?(ghm<<1):1);

how can ghm<<1 == 0 happen?

>  		} else {
>  			if (!h) h = sysv_hash(s);
>  			sym = sysv_lookup(s, h, dso);
> @@ -344,7 +355,7 @@ static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_d
>  
>  static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
>  {
> -	return find_sym2(dso, s, need_def, 0);
> +	return find_sym2(dso, s, need_def, 0, NULL, STN_UNDEF);
>  }
>  
>  static struct symdef get_lfs64(const char *name)
> @@ -428,7 +439,7 @@ static void do_relocs(struct dso *dso, size_t *rel, size_t rel_size, size_t stri
>  			ctx = type==REL_COPY ? head->syms_next : head;
>  			def = (sym->st_info>>4) == STB_LOCAL
>  				? (struct symdef){ .dso = dso, .sym = sym }
> -				: find_sym(ctx, name, type==REL_PLT);
> +				: find_sym2(ctx, name, type==REL_PLT, 0, dso, sym_index);
>  			if (!def.sym) def = get_lfs64(name);
>  			if (!def.sym && (sym->st_shndx != SHN_UNDEF
>  			    || sym->st_info>>4 != STB_WEAK)) {
> @@ -2277,7 +2288,7 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
>  		return 0;
>  	} else
>  		use_deps = 1;
> -	struct symdef def = find_sym2(p, s, 0, use_deps);
> +	struct symdef def = find_sym2(p, s, 0, use_deps, NULL, STN_UNDEF);
>  	if (!def.sym) {
>  		error("Symbol not found: %s", s);
>  		return 0;
> -- 
> 2.52.0
...
> I tested this using clang 22 linking libclang-cpp and libLLVM.  I inserted
> timestamps into the loader startup code to measure startuptime until the
> programs entry point is called.
> On my laptop (AMD Ryzen AI 9 365) averaged over 200 measurements:
> Without optimization: 5441.205 us
> With    optimization: 4714.1   us
> This suggests a roughly 10% improvement in startuptime in this scenario.

i don't quite understand the scenario (is this just clang startup time?
why different than the number below?) but would be good to know how many
dsos (and relocs etc) are involved.

> 
> 
> I also measured running musl in ldd mode and without the timing measurement
> modifications:
> ```
> /home/fr/src/poop/zig-out/bin/poop -d 10000 \
>     'env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22' \
>     'env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22'
> ```
> 
> Benchmark 1 (1297 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/master-install/lib/ /home/fr/src/musl/master-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
>   measurement          mean ± σ            min … max           outliers         delta
>   wall_time          7.66ms ±  863us    5.08ms … 11.3ms         38 ( 3%)        0%
>   peak_rss           12.6MB ± 81.2KB    12.0MB … 12.7MB        185 (14%)        0%
>   cpu_cycles         11.0M  ± 1.12M     9.08M  … 15.6M          65 ( 5%)        0%
>   instructions       19.3M  ±  897      19.3M  … 19.3M          48 ( 4%)        0%
>   cache_references    463K  ± 2.52K      453K  …  481K          11 ( 1%)        0%
>   cache_misses       84.2K  ± 1.45K     81.0K  … 91.2K          26 ( 2%)        0%
>   branch_misses      79.6K  ±  814      78.0K  … 83.3K           3 ( 0%)        0%
> Benchmark 2 (1445 runs): env LD_LIBRARY_PATH=/home/fr/src/musl/precomp-install/lib/ /home/fr/src/musl/precomp-install/lib/ld-musl-x86_64.so.1 --list /home/fr/src/llvm-project/build/bin/clang-22
>   measurement          mean ± σ            min … max           outliers         delta
>   wall_time          6.89ms ±  615us    4.70ms … 9.00ms         19 ( 1%)        ⚡- 10.1% ±  0.7%
>   peak_rss           12.6MB ± 82.3KB    12.1MB … 12.7MB         89 ( 6%)          +  0.0% ±  0.0%
>   cpu_cycles         9.35M  ±  807K     7.72M  … 12.5M          91 ( 6%)        ⚡- 15.3% ±  0.7%
>   instructions       15.4M  ±  865      15.4M  … 15.4M          42 ( 3%)        ⚡- 19.9% ±  0.0%
>   cache_references    466K  ± 4.28K      454K  …  504K          30 ( 2%)          +  0.8% ±  0.1%
>   cache_misses       85.6K  ± 1.50K     82.7K  … 92.3K          54 ( 4%)        💩+  1.7% ±  0.1%
>   branch_misses      83.4K  ±  686      81.8K  … 85.6K           1 ( 0%)        💩+  4.7% ±  0.1%

these results look good.
thanks.

> 
> To reproduce without compiling clang with a musl cross toolchain an
> alpine docker container can be used:
> - Install the two different versions into `./master-install` and `./precomp-install`
> - sudo docker run --rm -it -v./master-install:/tmp/master-install -v./precomp-install:/tmp/precomp-install alpine:latest sh
> in the container:
> - apk add clang hyperfine
> - hyperfine -N -w 100 'env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang' 'env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang'
> 
> ```
> Benchmark 1: env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
>   Time (mean ± σ):      12.6 ms ±   1.8 ms    [User: 9.7 ms, System: 2.8 ms]
>   Range (min … max):     9.3 ms …  16.0 ms    191 runs
> 
> Benchmark 2: env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang
>   Time (mean ± σ):      10.5 ms ±   1.4 ms    [User: 7.7 ms, System: 2.7 ms]
>   Range (min … max):     7.5 ms …  13.5 ms    298 runs
> 
> Summary
>   env LD_LIBRARY_PATH=/tmp/precomp-install/lib /tmp/precomp-install/lib/libc.so --list /usr/bin/clang ran
>     1.20 ± 0.23 times faster than env LD_LIBRARY_PATH=/tmp/master-install/lib /tmp/master-install/lib/libc.so --list /usr/bin/clang
> ```
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH] ldso: skip gnu hash calculation if precomputed
  2025-12-05 15:43 ` Szabolcs Nagy
@ 2025-12-05 17:37   ` Fabian Rast
  2025-12-05 18:30     ` Szabolcs Nagy
  0 siblings, 1 reply; 6+ messages in thread
From: Fabian Rast @ 2025-12-05 17:37 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 4649 bytes --]

>> the gnu hash table only stores the upper 31 bits of the 32 bit wide
>> hashes. the lsb is used to indicate the end of a chain.
>> this missing bit can be guessed for the cost of traversing two chains
>> in the worst case.
>
> you mean two chains per loaded dso?
>
> then i'd expect the optimization to be only enabled
> if dso-index-in-symbol-lookup-order < N.

I don't quite understand what you mean by this.
To clarify - when I am talking about "chains" here i mean the chains
internal to the gnu hashtable and not any linked lists of dsos.
I should probably call these "buckets" and not "chains" to avoid confusion, sorry.
This patch modifies how symbols are looked up in a single dso, not
in the "global scope" of loaded dsos.

So, basically, to do a hashtable lookup, you compute the hash of the symbol name,
and use a modulo operation to select the correct chain/bucket in the hashtable.
Then you only check the symbols in that chain/bucket for equality.

Because the precomputed hashes that this patch is using are each missing one
bit, we can only know that the hashvalue for a given symbol name is one of
two possible values, leading to two possible buckets of the hashtable that
have to be searched.
However, if a match is already found in the first bucket we search, we
of course do not have to search the second bucket.
Searching a hashtable bucket is not that expensive, in my experience they
are rather small.
For example here are the number of buckets for each length in libLLVM.so:
length: 1: 924 buckets
length: 2: 1429 buckets
length: 3: 1918 buckets
length: 4: 1895 buckets
length: 5: 1517 buckets
length: 6: 1045 buckets
length: 7: 615 buckets
length: 8: 271 buckets
length: 9: 127 buckets
length: 10: 46 buckets
length: 11: 26 buckets
length: 12: 9 buckets
length: 13: 2 buckets

> how can ghm<<1 == 0 happen?

Suppose that `gh % (8*sizeof(size_t))` is 63, so ghm will be 0x8000000000000000.
When we shift ghm left by one, we end up with zero.
What we actually want is one, because zgh % (8*sizeof(size_t))` should have been zero.


>> I tested this using clang 22 linking libclang-cpp and libLLVM.  I inserted
>> timestamps into the loader startup code to measure startuptime until the
>> programs entry point is called.
>> On my laptop (AMD Ryzen AI 9 365) averaged over 200 measurements:
>> Without optimization: 5441.205 us
>> With    optimization: 4714.1   us
>> This suggests a roughly 10% improvement in startuptime in this scenario.
>
> i don't quite understand the scenario (is this just clang startup time?
> why different than the number below?)

Yes, this is just startuptime. It is different from the other numbers,
because timing is measured differently. In these numbers there is some
overhead due to the "rdtsc" instruction i used to get timestamps.
In the other numbers, there is overhead because i call the loader
through `env` to set the library path environment. At least this is my
explanaion. Note that I am running in ldd mode to focus on measuring load times.

> but would be good to know how many dsos (and relocs etc) are involved.

These are the dsos involved:
```
$ ~/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 --list ./bin/clang

/home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 (0x7f4d20ed9000)
libclang-cpp.so.22.0git => /home/fr/src/llvm-project/build/lib/libclang-cpp.so.22.0git (0x7f4d1d400000)
libLLVM.so.22.0git => /home/fr/src/llvm-project/build/lib/libLLVM.so.22.0git (0x7f4d18e00000)
libstdc++.so.6 => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/libstdc++.so.6 (0x7f4d18a00000)
libgcc_s.so.1 => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/libgcc_s.so.1 (0x7f4d20ea6000)
libc.so => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 (0x7f4d20ed9000)
```

Symbol lookups: used `gh_precomp` path / total calls to find_sym2
    10189/14955

Approximations for symbol etc count using readelf:
Undefined symbols:
readelf --dyn-syms libclang-cpp.so.22.0git | grep UND -c

defined symbols:
readelf --dyn-syms libclang-cpp.so.22.0git | grep UND -cv

i exclude relative relocations from the count because they never need a symbol
lookup and are irrelevant.
My libraries have relr compressed relative relocations - i counted using a text
editor and do not have a shell one liner.

libclang has 2204 undefined and 37831 defined symbols and 4723 non-relative relocations.
libLLVM has 340 undefined and 39300 defined symbols and 5479 non-relative relocations.

Best Regards,
Fabian Rast

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH] ldso: skip gnu hash calculation if precomputed
  2025-12-05 17:37   ` Fabian Rast
@ 2025-12-05 18:30     ` Szabolcs Nagy
  2025-12-05 18:48       ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Szabolcs Nagy @ 2025-12-05 18:30 UTC (permalink / raw)
  To: Fabian Rast; +Cc: musl

* Fabian Rast <fabian.rast@tum.de> [2025-12-05 18:37:57 +0100]:
> >> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> >> hashes. the lsb is used to indicate the end of a chain.
> >> this missing bit can be guessed for the cost of traversing two chains
> >> in the worst case.
> >
> > you mean two chains per loaded dso?
> >
> > then i'd expect the optimization to be only enabled
> > if dso-index-in-symbol-lookup-order < N.
> 
> I don't quite understand what you mean by this.
...
> Searching a hashtable bucket is not that expensive, in my experience they
> are rather small.
> For example here are the number of buckets for each length in libLLVM.so:
> length: 1: 924 buckets
> length: 2: 1429 buckets
> length: 3: 1918 buckets
> length: 4: 1895 buckets
> length: 5: 1517 buckets
> length: 6: 1045 buckets
...

you have to do this for every loaded dso that's
before the dso being relocated in symbol lookup
order.

we know the relocated dso has a definition for
the symbol since it was in the hash table. so
we expect every single earlier dso to fail the
symbol lookup: symbol interposition is rare.
the point of the hash is to make such lookup
failure fast. but you made it twice as slow on
average.

eventually the gain from not computing the hash
is dominated by caculating 2*n bloom filters
(and potentially chasing twice as many hash chains)

> > how can ghm<<1 == 0 happen?
> 
> Suppose that `gh % (8*sizeof(size_t))` is 63, so ghm will be 0x8000000000000000.

we establised that gh = hash & ~1

> > but would be good to know how many dsos (and relocs etc) are involved.
> 
> These are the dsos involved:
> ```
> $ ~/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 --list ./bin/clang
> 
> /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 (0x7f4d20ed9000)
> libclang-cpp.so.22.0git => /home/fr/src/llvm-project/build/lib/libclang-cpp.so.22.0git (0x7f4d1d400000)
> libLLVM.so.22.0git => /home/fr/src/llvm-project/build/lib/libLLVM.so.22.0git (0x7f4d18e00000)
> libstdc++.so.6 => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/libstdc++.so.6 (0x7f4d18a00000)
> libgcc_s.so.1 => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/libgcc_s.so.1 (0x7f4d20ea6000)
> libc.so => /home/fr/src/musl-cross-make/output/x86_64-linux-musl/lib/ld-musl-x86_64.so.1 (0x7f4d20ed9000)
> ```

that's not a lot of dso, i'd try binaries like:

/usr/bin/ffmpeg
/usr/bin/mpv
/usr/bin/gsx
/usr/lib/libreoffice/program/soffice.bin
/usr/lib/libwebkit2gtk-*
/usr/lib/firefox/libxul.so

i.e. something with n indirect dso deps, where
n is large, so 2*n bloom filter checks may
outgrow the gnu_hash computation for a short
string.

> 
> Symbol lookups: used `gh_precomp` path / total calls to find_sym2
>     10189/14955

i guess what matters is how expensive the
hash lookup per dso vs gnu_hash eval is

> 
> Approximations for symbol etc count using readelf:
> Undefined symbols:
> readelf --dyn-syms libclang-cpp.so.22.0git | grep UND -c
> 
> defined symbols:
> readelf --dyn-syms libclang-cpp.so.22.0git | grep UND -cv
> 
> i exclude relative relocations from the count because they never need a symbol
> lookup and are irrelevant.
> My libraries have relr compressed relative relocations - i counted using a text
> editor and do not have a shell one liner.
> 
> libclang has 2204 undefined and 37831 defined symbols and 4723 non-relative relocations.
> libLLVM has 340 undefined and 39300 defined symbols and 5479 non-relative relocations.
> 
> Best Regards,
> Fabian Rast



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH] ldso: skip gnu hash calculation if precomputed
  2025-12-05 18:30     ` Szabolcs Nagy
@ 2025-12-05 18:48       ` Rich Felker
  2025-12-05 21:06         ` Szabolcs Nagy
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2025-12-05 18:48 UTC (permalink / raw)
  To: Fabian Rast, musl

On Fri, Dec 05, 2025 at 07:30:51PM +0100, Szabolcs Nagy wrote:
> * Fabian Rast <fabian.rast@tum.de> [2025-12-05 18:37:57 +0100]:
> > >> the gnu hash table only stores the upper 31 bits of the 32 bit wide
> > >> hashes. the lsb is used to indicate the end of a chain.
> > >> this missing bit can be guessed for the cost of traversing two chains
> > >> in the worst case.
> > >
> > > you mean two chains per loaded dso?
> > >
> > > then i'd expect the optimization to be only enabled
> > > if dso-index-in-symbol-lookup-order < N.
> > 
> > I don't quite understand what you mean by this.
> ...
> > Searching a hashtable bucket is not that expensive, in my experience they
> > are rather small.
> > For example here are the number of buckets for each length in libLLVM.so:
> > length: 1: 924 buckets
> > length: 2: 1429 buckets
> > length: 3: 1918 buckets
> > length: 4: 1895 buckets
> > length: 5: 1517 buckets
> > length: 6: 1045 buckets
> ...
> 
> you have to do this for every loaded dso that's
> before the dso being relocated in symbol lookup
> order.
> 
> we know the relocated dso has a definition for
> the symbol since it was in the hash table. so
> we expect every single earlier dso to fail the
> symbol lookup: symbol interposition is rare.
> the point of the hash is to make such lookup
> failure fast. but you made it twice as slow on
> average.
> 
> eventually the gain from not computing the hash
> is dominated by caculating 2*n bloom filters
> (and potentially chasing twice as many hash chains)

I think this analysis sounds correct, in which case the optimization
probably only makes sense if we can find a way to determine the
correct hash in tiny bounded time as a one-off operation, rather than
having to try two candidate hashes per dso.

One way to do this would be to use the bloom filter from the dso
making the symbol reference: if only one of the two candidates passes
it, we know which candidate was correct. I would expect this to happen
most of the time, if the bloom filter is actually doing anything
useful to begin with.

The other option would be seeing if there's a fast way to compute just
the low bit of the hash. I think it's just the parity of the low bits
of all the bytes in the symbol name, which may be moderately faster to
compute than the hash, but probably not enough to help.

My leaning would be that the winning move is just to use the bloom
filter when is succeeds, and otherwise treat it as not having a
precomputed gnuhash.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [musl] [PATCH] ldso: skip gnu hash calculation if precomputed
  2025-12-05 18:48       ` Rich Felker
@ 2025-12-05 21:06         ` Szabolcs Nagy
  0 siblings, 0 replies; 6+ messages in thread
From: Szabolcs Nagy @ 2025-12-05 21:06 UTC (permalink / raw)
  To: Rich Felker; +Cc: Fabian Rast, musl

* Rich Felker <dalias@libc.org> [2025-12-05 13:48:34 -0500]:
> On Fri, Dec 05, 2025 at 07:30:51PM +0100, Szabolcs Nagy wrote:
> > eventually the gain from not computing the hash
> > is dominated by caculating 2*n bloom filters
> > (and potentially chasing twice as many hash chains)
> 
> I think this analysis sounds correct, in which case the optimization
> probably only makes sense if we can find a way to determine the
> correct hash in tiny bounded time as a one-off operation, rather than
> having to try two candidate hashes per dso.
> 
> One way to do this would be to use the bloom filter from the dso
> making the symbol reference: if only one of the two candidates passes
> it, we know which candidate was correct. I would expect this to happen
> most of the time, if the bloom filter is actually doing anything
> useful to begin with.

nice, this should work as the hash chain starts at:

uint32_t i = buckets[h % nbuckets];

so the chain depends on the lsb and i is a symidx

uint32_t *ght = dso->ghastab;
uint32_t nbuckets = ght[0];
uint32_t *buckets = ght + 4 + ght[2]*(sizeof(size_t)/4);
uint32_t h0 = buckets[nbuckets + sym_index - ght[1]] & -2u;
uint32_t h1 = h0|1;
uint32_t i1 = buckets[h1 % nbuckets];
uint32_t gh = sym_idx < i1 ? h0 : h1;

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-12-05 21:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-05  0:21 [musl] [PATCH] ldso: skip gnu hash calculation if precomputed Fabian Rast
2025-12-05 15:43 ` Szabolcs Nagy
2025-12-05 17:37   ` Fabian Rast
2025-12-05 18:30     ` Szabolcs Nagy
2025-12-05 18:48       ` Rich Felker
2025-12-05 21:06         ` Szabolcs Nagy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).