[PATCH 0/5] gnu-hash speedups

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [PATCH 0/5] gnu-hash speedups
@ 2015-06-23 23:24 Alexander Monakov
  2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

Hello!

I'm happy to finally send the results of my tinkering with .gnu.hash in musl.
This patch series should improve the speed of initial loading with many
dynamic libraries and symbols, provided that .gnu.hash is populated by the
linker.  I was testing on 32-bit x86 with Clang/LLVM, and got a speedup from
~235-245ms to ~95-105ms.  This is about halfway between non-lazy binding and
lazy-binding with glibc.

Most of the speedup comes from the first patch (Bloom filters), and a bit from
the last (strength reduction for h*33).  Notably, I don't see an improvement
from the second patch (strength reduction for modulus), but it may be there
for other architectures.

I'd love to see test results on other platforms!  Although keep in mind that
the patches don't affect loading when .gnu.hash is not present.

Thanks to Rich for ideas (unsigned division by magic and 'h1 == (h2|1)' test
reordering), and to community members who encouraged me to finish this work :)

Yours,

Alexander Monakov (5):
  dynlink.c: use bloom filter in gnu hash lookup
  dynlink.c: compute modulus via magic multiplication
  dynlink.c: slim down gnu_lookup
  dynlink.c: pass gnu-hash table pointer to gnu_lookup
  dynlink.c: use a faster expression in gnu_hash

 src/ldso/dynlink.c | 124 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 101 insertions(+), 23 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup
  2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
@ 2015-06-23 23:24 ` Alexander Monakov
  2015-06-24  5:39   ` Rich Felker
  2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

Introduce gnu_lookup_filtered and use it to speed up symbol lookups in
find_sym (do_dlsym is left as is, based on an expectation that frequently
dlsym queries will use a dlopen handle rather than RTLD_NEXT or RTLD_DEFAULT,
and will not need to look at more than one DSO).  The size of the bloom filter
table minus 1, ghashmask, is stored in the dso struct together with the
hashtable pointer to reduce pointer chasing on the hot path.
---
 src/ldso/dynlink.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index b77c6f6..fa91b39 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -54,6 +54,7 @@ struct dso {
 	Sym *syms;
 	uint32_t *hashtab;
 	uint32_t *ghashtab;
+	uint32_t ghashmask;
 	int16_t *versym;
 	char *strings;
 	unsigned char *map;
@@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 	return 0;
 }
 
+static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
+{
+	uint32_t *hashtab = dso->ghashtab;
+	size_t *bloomwords = hashtab+4;
+	size_t f = bloomwords[fofs & dso->ghashmask];
+	if (!(f & fmask)) return 0;
+
+	f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
+	if (!(f & 1)) return 0;
+
+	return gnu_lookup(s, h1, dso);
+}
+
 #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS)
 #define OK_BINDS (1<<STB_GLOBAL | 1<<STB_WEAK | 1<<STB_GNU_UNIQUE)
 
@@ -209,14 +223,20 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 
 static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
 {
-	uint32_t h = 0, gh = 0;
+	uint32_t h = 0, gh, gho;
+	size_t ghm = 0;
 	struct symdef def = {0};
 	for (; dso; dso=dso->next) {
 		Sym *sym;
 		if (!dso->global) continue;
 		if (dso->ghashtab) {
-			if (!gh) gh = gnu_hash(s);
-			sym = gnu_lookup(s, gh, dso);
+			if (!ghm) {
+				gh = gnu_hash(s);
+				int maskbits = 8 * sizeof ghm;
+				gho = gh / maskbits;
+				ghm = 1ul << gh % maskbits;
+			}
+			sym = gnu_lookup_filtered(s, gh, dso, gho, ghm);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, dso);
@@ -682,8 +702,10 @@ static void decode_dyn(struct dso *p)
 		p->rpath_orig = (void *)(p->strings + dyn[DT_RPATH]);
 	if (dyn[0]&(1<<DT_RUNPATH))
 		p->rpath_orig = (void *)(p->strings + dyn[DT_RUNPATH]);
-	if (search_vec(p->dynv, dyn, DT_GNU_HASH))
+	if (search_vec(p->dynv, dyn, DT_GNU_HASH)) {
 		p->ghashtab = (void *)(p->base + *dyn);
+		p->ghashmask = p->ghashtab[2]-1;
+	}
 	if (search_vec(p->dynv, dyn, DT_VERSYM))
 		p->versym = (void *)(p->base + *dyn);
 }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
  2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
@ 2015-06-23 23:24 ` Alexander Monakov
  2015-06-24  4:18   ` Alexander Monakov
  2015-06-24  4:24   ` Rich Felker
  2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

Based on http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html

Do a little hand-holding for the compiler and fold magic post-shift into
32-bit high word -> low word shift on 64-bit platforms.
---
 src/ldso/dynlink.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index fa91b39..99dadd4 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -41,6 +41,11 @@ struct td_index {
 	struct td_index *next;
 };
 
+struct udiv {
+	uint32_t mul;
+	int s1, s2, inc;
+};
+
 struct dso {
 	unsigned char *base;
 	char *name;
@@ -55,6 +60,7 @@ struct dso {
 	uint32_t *hashtab;
 	uint32_t *ghashtab;
 	uint32_t ghashmask;
+	struct udiv gudiv;
 	int16_t *versym;
 	char *strings;
 	unsigned char *map;
@@ -141,6 +147,60 @@ static int search_vec(size_t *v, size_t *r, size_t key)
 	return 1;
 }
 
+static void precompute_udiv(uint32_t div, struct udiv *p)
+{
+	if (!(div&(div-1)))
+		return;
+	int bits = 0;
+again:
+	p->s1 = bits;
+	uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div;
+
+	int log=0;
+	tmp=div; do log++; while (tmp>>=1);
+
+	int exp, rdshf;
+	uint32_t pow, rdmul=0;
+	for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) {
+		int ovf = rem >= div - rem;
+		quo *= 2; rem *= 2;
+		if (ovf) quo++, rem -= div;
+
+		if (exp >= log-bits || div-rem <= pow)
+			break;
+
+		if (!rdmul && rem <= pow)
+			rdmul = quo, rdshf = exp;
+	}
+	if (exp < log) {
+		p->mul = quo+1;
+		p->s2  = exp;
+		p->inc = 0;
+	} else if (div & 1) {
+		p->mul = rdmul;
+		p->s2  = rdshf;
+		p->inc = 1;
+	} else {
+		do bits++; while (!((div >>= 1) & 1));
+		goto again;
+	}
+}
+
+static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
+{
+	if (!(div&(div-1)))
+		return x&(div-1);
+	uint32_t v = x;
+	if (p->s1) v >>= p->s1;
+	else if (v != ~0u) v += p->inc;
+	int s32=32, s2=p->s2;
+	if (sizeof(long) == 8)
+		s32+=s2, s2=0;
+	v = (1ull * v * p->mul) >> s32;
+	v >>= s2;
+	return x-v*div;
+}
+
 static uint32_t sysv_hash(const char *s0)
 {
 	const unsigned char *s = (void *)s0;
@@ -184,7 +244,7 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 	uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
 	uint32_t h2;
 	uint32_t *hashval;
-	uint32_t i = buckets[h1 % nbuckets];
+	uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)];
 
 	if (!i) return 0;
 
@@ -705,6 +765,7 @@ static void decode_dyn(struct dso *p)
 	if (search_vec(p->dynv, dyn, DT_GNU_HASH)) {
 		p->ghashtab = (void *)(p->base + *dyn);
 		p->ghashmask = p->ghashtab[2]-1;
+		precompute_udiv(p->ghashtab[0], &p->gudiv);
 	}
 	if (search_vec(p->dynv, dyn, DT_VERSYM))
 		p->versym = (void *)(p->base + *dyn);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/5] dynlink.c: slim down gnu_lookup
  2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
  2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
  2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov
@ 2015-06-23 23:24 ` Alexander Monakov
  2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

Do not reference dso->syms and dso-strings until point of use.
Check 'h1 == (h2|1)', the simplest condition, before the others.
---
 src/ldso/dynlink.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index 99dadd4..28812ef 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -237,24 +237,20 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso)
 
 static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 {
-	Sym *syms = dso->syms;
-	char *strings = dso->strings;
 	uint32_t *hashtab = dso->ghashtab;
 	uint32_t nbuckets = hashtab[0];
 	uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
-	uint32_t h2;
-	uint32_t *hashval;
 	uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)];
 
 	if (!i) return 0;
 
-	hashval = buckets + nbuckets + (i - hashtab[1]);
+	uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]);
 
 	for (h1 |= 1; ; i++) {
-		h2 = *hashval++;
-		if ((!dso->versym || dso->versym[i] >= 0)
-		    && (h1 == (h2|1)) && !strcmp(s, strings + syms[i].st_name))
-			return syms+i;
+		uint32_t h2 = *hashval++;
+		if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0)
+		    && !strcmp(s, dso->strings + dso->syms[i].st_name))
+			return dso->syms+i;
 		if (h2 & 1) break;
 	}
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup
  2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
                   ` (2 preceding siblings ...)
  2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov
@ 2015-06-23 23:24 ` Alexander Monakov
  2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
  5 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

The callers need to check the value of the pointer anyway, so make them pass
the pointer to gnu_lookup instead of reloading it there.

Reorder gnu_lookup arguments so that always-used ones are listed first.  GCC
can choose a calling convention with arguments in registers (e.g. up to 3
arguments in eax, ecx, edx on x86), but cannot reorder the arguments for
static functions.
---
 src/ldso/dynlink.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index 28812ef..99be7d0 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -235,9 +235,8 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso)
 	return 0;
 }
 
-static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
+static Sym *gnu_lookup(uint32_t h1, struct dso *dso, uint32_t *hashtab, const char *s)
 {
-	uint32_t *hashtab = dso->ghashtab;
 	uint32_t nbuckets = hashtab[0];
 	uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
 	uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)];
@@ -257,9 +256,9 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 	return 0;
 }
 
-static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
+static Sym *gnu_lookup_filtered(uint32_t h1, struct dso *dso, uint32_t *hashtab, const char *s,
+				uint32_t fofs, size_t fmask)
 {
-	uint32_t *hashtab = dso->ghashtab;
 	size_t *bloomwords = hashtab+4;
 	size_t f = bloomwords[fofs & dso->ghashmask];
 	if (!(f & fmask)) return 0;
@@ -267,7 +266,7 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin
 	f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
 	if (!(f & 1)) return 0;
 
-	return gnu_lookup(s, h1, dso);
+	return gnu_lookup(h1, dso, hashtab, s);
 }
 
 #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS)
@@ -279,20 +278,20 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin
 
 static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
 {
-	uint32_t h = 0, gh, gho;
+	uint32_t h = 0, gh, gho, *ght;
 	size_t ghm = 0;
 	struct symdef def = {0};
 	for (; dso; dso=dso->next) {
 		Sym *sym;
 		if (!dso->global) continue;
-		if (dso->ghashtab) {
+		if ((ght = dso->ghashtab)) {
 			if (!ghm) {
 				gh = gnu_hash(s);
 				int maskbits = 8 * sizeof ghm;
 				gho = gh / maskbits;
 				ghm = 1ul << gh % maskbits;
 			}
-			sym = gnu_lookup_filtered(s, gh, dso, gho, ghm);
+			sym = gnu_lookup_filtered(gh, dso, ght, s, gho, ghm);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, dso);
@@ -1623,7 +1622,7 @@ void *__tls_get_addr(size_t *);
 static void *do_dlsym(struct dso *p, const char *s, void *ra)
 {
 	size_t i;
-	uint32_t h = 0, gh = 0;
+	uint32_t h = 0, gh = 0, *ght;
 	Sym *sym;
 	if (p == head || p == RTLD_DEFAULT || p == RTLD_NEXT) {
 		if (p == RTLD_DEFAULT) {
@@ -1641,9 +1640,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
 	}
 	if (invalid_dso_handle(p))
 		return 0;
-	if (p->ghashtab) {
+	if ((ght = p->ghashtab)) {
 		gh = gnu_hash(s);
-		sym = gnu_lookup(s, gh, p);
+		sym = gnu_lookup(gh, p, ght, s);
 	} else {
 		h = sysv_hash(s);
 		sym = sysv_lookup(s, h, p);
@@ -1653,9 +1652,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
 	if (sym && sym->st_value && (1<<(sym->st_info&0xf) & OK_TYPES))
 		return p->base + sym->st_value;
 	if (p->deps) for (i=0; p->deps[i]; i++) {
-		if (p->deps[i]->ghashtab) {
+		if ((ght = p->deps[i]->ghashtab)) {
 			if (!gh) gh = gnu_hash(s);
-			sym = gnu_lookup(s, gh, p->deps[i]);
+			sym = gnu_lookup(gh, p->deps[i], ght, s);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, p->deps[i]);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash
  2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
                   ` (3 preceding siblings ...)
  2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
@ 2015-06-23 23:24 ` Alexander Monakov
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
  5 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw)
  To: musl; +Cc: Alexander Monakov

With -Os, GCC uses a multiply rather than a shift and addition for 'h*33'.
Use a more efficient expression explicitely.
---
 src/ldso/dynlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index 99be7d0..f7342cd 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -217,7 +217,7 @@ static uint32_t gnu_hash(const char *s0)
 	const unsigned char *s = (void *)s0;
 	uint_fast32_t h = 5381;
 	for (; *s; s++)
-		h = h*33 + *s;
+		h += h*32 + *s;
 	return h;
 }
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov
@ 2015-06-24  4:18   ` Alexander Monakov
  2015-06-24  4:19     ` Rich Felker
  2015-06-24  4:24   ` Rich Felker
  1 sibling, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-24  4:18 UTC (permalink / raw)
  To: musl

On Wed, 24 Jun 2015, Alexander Monakov wrote:
> diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
> index fa91b39..99dadd4 100644
> --- a/src/ldso/dynlink.c
> +++ b/src/ldso/dynlink.c
> @@ -41,6 +41,11 @@ struct td_index {
>  	struct td_index *next;
>  };
>  
> +struct udiv {
> +	uint32_t mul;
> +	int s1, s2, inc;
> +};

Can use 'char' rather than 'int' here.

Alexander


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-24  4:18   ` Alexander Monakov
@ 2015-06-24  4:19     ` Rich Felker
  0 siblings, 0 replies; 28+ messages in thread
From: Rich Felker @ 2015-06-24  4:19 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2015 at 07:18:28AM +0300, Alexander Monakov wrote:
> On Wed, 24 Jun 2015, Alexander Monakov wrote:
> > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
> > index fa91b39..99dadd4 100644
> > --- a/src/ldso/dynlink.c
> > +++ b/src/ldso/dynlink.c
> > @@ -41,6 +41,11 @@ struct td_index {
> >  	struct td_index *next;
> >  };
> >  
> > +struct udiv {
> > +	uint32_t mul;
> > +	int s1, s2, inc;
> > +};
> 
> Can use 'char' rather than 'int' here.

Yes, that seems like a nice size improvement, and it might also allow
the compiler to do additional optimizations based on knowledge of the
range.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov
  2015-06-24  4:18   ` Alexander Monakov
@ 2015-06-24  4:24   ` Rich Felker
  2015-06-24  4:32     ` Alexander Monakov
  1 sibling, 1 reply; 28+ messages in thread
From: Rich Felker @ 2015-06-24  4:24 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote:
> +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
> +{
> +	if (!(div&(div-1)))
> +		return x&(div-1);
> +	uint32_t v = x;
> +	if (p->s1) v >>= p->s1;
> +	else if (v != ~0u) v += p->inc;
> +	int s32=32, s2=p->s2;
> +	if (sizeof(long) == 8)
> +		s32+=s2, s2=0;
> +	v = (1ull * v * p->mul) >> s32;
> +	v >>= s2;
> +	return x-v*div;
> +}

I think having the div argument here is a pessimization. Power-of-two
sizes are unlikely to be used, and it's possible for precompute_udiv
to setup the struct udiv to work perfectly fine for powers of two, so
that there is a single code path with no branches. The p->s1 branch
should not be needed either.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-24  4:24   ` Rich Felker
@ 2015-06-24  4:32     ` Alexander Monakov
  2015-06-24  5:13       ` Rich Felker
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-24  4:32 UTC (permalink / raw)
  To: musl

On Wed, 24 Jun 2015, Rich Felker wrote:

> On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote:
> > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
> > +{
> > +	if (!(div&(div-1)))
> > +		return x&(div-1);
> > +	uint32_t v = x;
> > +	if (p->s1) v >>= p->s1;
> > +	else if (v != ~0u) v += p->inc;
> > +	int s32=32, s2=p->s2;
> > +	if (sizeof(long) == 8)
> > +		s32+=s2, s2=0;
> > +	v = (1ull * v * p->mul) >> s32;
> > +	v >>= s2;
> > +	return x-v*div;
> > +}
> 
> I think having the div argument here is a pessimization.

How can I do the last step, 'x-v*div' without it?

> Power-of-two sizes are unlikely to be used,

probably so, but...

> and it's possible for precompute_udiv
> to setup the struct udiv to work perfectly fine for powers of two,

I gave it a thought, and I didn't find how to do that either.

> so that there is a single code path with no branches. The p->s1 branch
> should not be needed either.

p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
non-zero, so try to skip the shift if possible; in the rare case the pre-shift
is non-zero, the check allows to skip the saturating increment operation.

Alexander


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-24  4:32     ` Alexander Monakov
@ 2015-06-24  5:13       ` Rich Felker
  2015-06-24  6:08         ` Alexander Monakov
  0 siblings, 1 reply; 28+ messages in thread
From: Rich Felker @ 2015-06-24  5:13 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2015 at 07:32:56AM +0300, Alexander Monakov wrote:
> On Wed, 24 Jun 2015, Rich Felker wrote:
> 
> > On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote:
> > > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
> > > +{
> > > +	if (!(div&(div-1)))
> > > +		return x&(div-1);
> > > +	uint32_t v = x;
> > > +	if (p->s1) v >>= p->s1;
> > > +	else if (v != ~0u) v += p->inc;
> > > +	int s32=32, s2=p->s2;
> > > +	if (sizeof(long) == 8)
> > > +		s32+=s2, s2=0;
> > > +	v = (1ull * v * p->mul) >> s32;
> > > +	v >>= s2;
> > > +	return x-v*div;
> > > +}
> > 
> > I think having the div argument here is a pessimization.
> 
> How can I do the last step, 'x-v*div' without it?

Ah yes. Would it be preferable to have that in struct udiv then,
though? Then the caller never has to care about the divisor, and in
the case where umod doesn't get inlined, the call will be more
efficient (fewer args).

> > Power-of-two sizes are unlikely to be used,
> 
> probably so, but...
> 
> > and it's possible for precompute_udiv
> > to setup the struct udiv to work perfectly fine for powers of two,
> 
> I gave it a thought, and I didn't find how to do that either.

Originally I thought this would work:

p->s1 = ctz(div);
p->inc = 0;
p->mul = 1;
p->s2 = 0;

But the >>32 of course breaks it. What about:

p->s1 = 0
p->inc = 0;
p->mul = 0x80000000;
p->s2 = ctz(div)-1;

This should handle all values except 1, which probably isn't needed,
but maybe there's a special case that works for 1 too...? I think the
following works for all values except 0xffffffff:

p->s1 = 0
p->inc = 1;
p->mul = 0xffffffff;
p->s2 = 0;

Using the post-mul add rather than the saturated increment would make
this work for 0xffffffff too.

Another obvious solution is not using the +32 offset so that the right
shift can just be 31, but that pessimizes the code on 32-bit archs
quite a bit.

> > so that there is a single code path with no branches. The p->s1 branch
> > should not be needed either.
> 
> p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
> non-zero, so try to skip the shift if possible; in the rare case the pre-shift
> is non-zero, the check allows to skip the saturating increment operation.

Shifts are costly? Are we talking about P4-era junk? ;-)

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup
  2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
@ 2015-06-24  5:39   ` Rich Felker
  2015-06-24  6:29     ` Alexander Monakov
  0 siblings, 1 reply; 28+ messages in thread
From: Rich Felker @ 2015-06-24  5:39 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2015 at 02:24:51AM +0300, Alexander Monakov wrote:
> Introduce gnu_lookup_filtered and use it to speed up symbol lookups in
> find_sym (do_dlsym is left as is, based on an expectation that frequently
> dlsym queries will use a dlopen handle rather than RTLD_NEXT or RTLD_DEFAULT,
> and will not need to look at more than one DSO).  The size of the bloom filter
> table minus 1, ghashmask, is stored in the dso struct together with the
> hashtable pointer to reduce pointer chasing on the hot path.
> ---
>  src/ldso/dynlink.c | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
> index b77c6f6..fa91b39 100644
> --- a/src/ldso/dynlink.c
> +++ b/src/ldso/dynlink.c
> @@ -54,6 +54,7 @@ struct dso {
>  	Sym *syms;
>  	uint32_t *hashtab;
>  	uint32_t *ghashtab;
> +	uint32_t ghashmask;
>  	int16_t *versym;
>  	char *strings;
>  	unsigned char *map;
> @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
>  	return 0;
>  }
>  
> +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
> +{
> +	uint32_t *hashtab = dso->ghashtab;
> +	size_t *bloomwords = hashtab+4;
> +	size_t f = bloomwords[fofs & dso->ghashmask];

Is this measurably faster than fofs & hashtab[2]-1 ?

If suspect not, in which case it seems desirable not to increase the
size of struct dso. If it is worthwhile, at least don't sandwich a
uint32_t between two pointers where it might incur another 32 bits of
padding.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-24  5:13       ` Rich Felker
@ 2015-06-24  6:08         ` Alexander Monakov
  2015-06-24  6:39           ` Rich Felker
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-24  6:08 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1719 bytes --]

> > How can I do the last step, 'x-v*div' without it?
> 
> Ah yes. Would it be preferable to have that in struct udiv then,
> though? Then the caller never has to care about the divisor, and in
> the case where umod doesn't get inlined, the call will be more
> efficient (fewer args).

I doubt that; the caller needs that value ('nbuckets') anyway.

[...]
> Using the post-mul add rather than the saturated increment would make
> this work for 0xffffffff too.

post-mul add is problematic on 32-bit
 
> Another obvious solution is not using the +32 offset so that the right
> shift can just be 31, but that pessimizes the code on 32-bit archs
> quite a bit.

I agree that a special-case for a power-of-two divisor will fire too rarely,
so figuring a replacement out would be nice.

If you (or anyone) want to play with ideas, I'm attaching my test driver for
magic division that tests 2^32-1 inputs for a divisor given on the command
line (the main loop looks odd, but my goal was to have gcc vectorize it).

> > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
> > non-zero, so try to skip the shift if possible; in the rare case the pre-shift
> > is non-zero, the check allows to skip the saturating increment operation.
> 
> Shifts are costly? Are we talking about P4-era junk? ;-)

I primarily had modern Intel cores in mind where as far as I can see shifts by
%ecx have latency 2.  But even ignoring that, there's the second part of my
argument.

Anyway it's not trivial to measure an impact on the same "modern Intel cores",
so if anybody can say how that looks on a different platform, I'd appreciate
that.  Removing my if-else there is obviously beneficial for code size.

Alexander

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-c; name=udiv.c, Size: 1923 bytes --]

#include <stdlib.h>
static void precalc(unsigned div, unsigned *mulp, int *prep, int *postp, int *incp)
{
  int bits = 0;
again:
  *prep = bits;
  unsigned pow = 1u<<31, quo = pow / div, rem = pow % div;

  int log=0, exp;
  unsigned tmp=div;
  do log++; while (tmp >>= 1);

  unsigned down_mul=0;
  for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) {
    int adj = rem >= div - rem;
    quo *= 2; rem *= 2;
    if (adj) {quo++; rem -= div; }

    if (exp >= log-bits || div-rem <= pow)
      break;

    if (!down_mul && rem <= pow) {
      down_mul = quo; *postp = exp;
    }
  }
  if (exp < log) {
    *mulp = quo+1;
    *postp = exp;
    *incp = 0;
  } else if (div & 1) {
    *mulp = down_mul;
    *incp = 1;
  } else {
    do bits++; while (!((div >>= 1) & 1));
    goto again;
  }
}
int main(int argc, char *argv[])
{
  unsigned div = atoi(argv[1]);
  if (!(div&(div-1)))
    return 0;

  unsigned mul;
  int pre, post, inc;
  precalc(div, &mul, &pre, &post, &inc);
  unsigned s32 = 32;
  if (sizeof(long) == 8) {
    s32 += post;
    post = 0;
  }
  unsigned rem, quo, val;
  char bad=0;
#define loop(pre, inc, post) ({    \
  for (val=~0u, quo=val/div, rem=val%div+1; quo+1; quo--, rem=div)      \
    for (; rem; rem--, val--) {  \
      unsigned v = val;            \
      v >>= pre;                   \
      if (v+inc) v+=inc;           \
      v = (1ull * v * mul) >> s32; \
      v >>= post;                  \
      bad |= (v != quo);      \
    } })
  if (post)
    if (inc)
      if (pre)
        return 2;
      else
        loop(0, inc, post);
    else
      if (pre)
        loop(pre, 0, post);
      else
        loop(0, 0, post);
  else
    if (inc)
      if (pre)
        return 2;
      else
        loop(0, inc, 0);
    else
      if (pre)
        loop(pre, 0, 0);
      else
        loop(0, 0, 0);
  return bad;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup
  2015-06-24  5:39   ` Rich Felker
@ 2015-06-24  6:29     ` Alexander Monakov
  2015-06-24  6:32       ` Alexander Monakov
  2015-06-24  6:50       ` Rich Felker
  0 siblings, 2 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-24  6:29 UTC (permalink / raw)
  To: musl

> > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
> > index b77c6f6..fa91b39 100644
> > --- a/src/ldso/dynlink.c
> > +++ b/src/ldso/dynlink.c
> > @@ -54,6 +54,7 @@ struct dso {
> >  	Sym *syms;
> >  	uint32_t *hashtab;
> >  	uint32_t *ghashtab;
> > +	uint32_t ghashmask;
> >  	int16_t *versym;
> >  	char *strings;
> >  	unsigned char *map;
> > @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
> >  	return 0;
> >  }
> >  
> > +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
> > +{
> > +	uint32_t *hashtab = dso->ghashtab;
> > +	size_t *bloomwords = hashtab+4;
> > +	size_t f = bloomwords[fofs & dso->ghashmask];
> 
> Is this measurably faster than fofs & hashtab[2]-1 ?

It seems to give a small improvement for me, but with the other patches, too
small to measure reliably.  As I'm using a big testcase, the impact on typical
libraries should be really tiny.

> If suspect not, in which case it seems desirable not to increase the
> size of struct dso. If it is worthwhile, at least don't sandwich a
> uint32_t between two pointers where it might incur another 32 bits of
> padding.

I wanted it to be nearby the hashtab pointer, which needs to be loaded anyway.
If ghashmask needs to move so that loading it will access a different cache
line, it's likely to diminish an already small improvement, at which point
it's easier to drop this patch :)

Would be nice if someone could test on a platform with slower caches.

Alexander


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup
  2015-06-24  6:29     ` Alexander Monakov
@ 2015-06-24  6:32       ` Alexander Monakov
  2015-06-24  6:50       ` Rich Felker
  1 sibling, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-24  6:32 UTC (permalink / raw)
  To: musl

> I wanted it to be nearby the hashtab pointer, which needs to be loaded anyway.
> If ghashmask needs to move so that loading it will access a different cache
> line, it's likely to diminish an already small improvement, at which point
> it's easier to drop this patch :)

Erm, I meant the part of the patch with ghashmask, not the whole patch.

Alexander


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
  2015-06-24  6:08         ` Alexander Monakov
@ 2015-06-24  6:39           ` Rich Felker
  0 siblings, 0 replies; 28+ messages in thread
From: Rich Felker @ 2015-06-24  6:39 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2015 at 09:08:15AM +0300, Alexander Monakov wrote:
> > > How can I do the last step, 'x-v*div' without it?
> > 
> > Ah yes. Would it be preferable to have that in struct udiv then,
> > though? Then the caller never has to care about the divisor, and in
> > the case where umod doesn't get inlined, the call will be more
> > efficient (fewer args).
> 
> I doubt that; the caller needs that value ('nbuckets') anyway.

OK.

> [...]
> > Using the post-mul add rather than the saturated increment would make
> > this work for 0xffffffff too.
> 
> post-mul add is problematic on 32-bit

Maybe. It's not clear to me which is cheaper, a saturating multiply
(well-predicted branch) or an adc. IMO the only place the latter is
clearly more costly is on archs with no proper carry mechanism.

> > Another obvious solution is not using the +32 offset so that the right
> > shift can just be 31, but that pessimizes the code on 32-bit archs
> > quite a bit.
> 
> I agree that a special-case for a power-of-two divisor will fire too rarely,
> so figuring a replacement out would be nice.
> 
> If you (or anyone) want to play with ideas, I'm attaching my test driver for
> magic division that tests 2^32-1 inputs for a divisor given on the command
> line (the main loop looks odd, but my goal was to have gcc vectorize it).

Nice.

> > > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
> > > non-zero, so try to skip the shift if possible; in the rare case the pre-shift
> > > is non-zero, the check allows to skip the saturating increment operation.
> > 
> > Shifts are costly? Are we talking about P4-era junk? ;-)
> 
> I primarily had modern Intel cores in mind where as far as I can see shifts by
> %ecx have latency 2.  But even ignoring that, there's the second part of my
> argument.

With a latency of 2, I can't imagine there being any benefit to
attempting to avoid it with a conditional branch.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup
  2015-06-24  6:29     ` Alexander Monakov
  2015-06-24  6:32       ` Alexander Monakov
@ 2015-06-24  6:50       ` Rich Felker
  1 sibling, 0 replies; 28+ messages in thread
From: Rich Felker @ 2015-06-24  6:50 UTC (permalink / raw)
  To: musl

On Wed, Jun 24, 2015 at 09:29:25AM +0300, Alexander Monakov wrote:
> > > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
> > > index b77c6f6..fa91b39 100644
> > > --- a/src/ldso/dynlink.c
> > > +++ b/src/ldso/dynlink.c
> > > @@ -54,6 +54,7 @@ struct dso {
> > >  	Sym *syms;
> > >  	uint32_t *hashtab;
> > >  	uint32_t *ghashtab;
> > > +	uint32_t ghashmask;
> > >  	int16_t *versym;
> > >  	char *strings;
> > >  	unsigned char *map;
> > > @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
> > >  	return 0;
> > >  }
> > >  
> > > +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
> > > +{
> > > +	uint32_t *hashtab = dso->ghashtab;
> > > +	size_t *bloomwords = hashtab+4;
> > > +	size_t f = bloomwords[fofs & dso->ghashmask];
> > 
> > Is this measurably faster than fofs & hashtab[2]-1 ?
> 
> It seems to give a small improvement for me, but with the other patches, too
> small to measure reliably.  As I'm using a big testcase, the impact on typical
> libraries should be really tiny.
> 
> > If suspect not, in which case it seems desirable not to increase the
> > size of struct dso. If it is worthwhile, at least don't sandwich a
> > uint32_t between two pointers where it might incur another 32 bits of
> > padding.
> 
> I wanted it to be nearby the hashtab pointer, which needs to be loaded anyway.
> If ghashmask needs to move so that loading it will access a different cache
> line, it's likely to diminish an already small improvement, at which point
> it's easier to drop this patch :)

Well we could just reorder some things to put them adjacent to one
another without wasting the padding, if it helps.

I don't want to reject this part outright, but I think changes like
this that obviously have some cost and where the benefit is
non-obvious should have some strong justification. (The umod stuff
obviously has some size cost too in struct dso, but its benefit is a
lot more clear, especially on archs without hardware div.)

Maybe it would make sense to omit the ghashmask part to begin with (it
makes the patch smaller anyway) and then do some testing to see
whether it's really beneficial once the other things are merged and we
have a stable basis for performance comparison.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 0/6] gnu-hash speedups
  2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
                   ` (4 preceding siblings ...)
  2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
@ 2015-06-27 23:48 ` Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
                     ` (6 more replies)
  5 siblings, 7 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

v2:
reorder patches
[2] split out ghashmask micro-optimization into a separate patch
[2] use 'unsigned char' rather than 'int' for struct udiv fields s1,s2,inc
[2] use an explicit cast in bloomwords assignment
[4] reorder 'hashtab' before 'dso'
[5] remove branch on zero s1
[5] tweak implementation of saturating addition
[5] fold 32-bit shift into s2, avoiding run-time adjustments on 64-bit arch

Alexander Monakov (6):
  dynlink.c: use a faster expression in gnu_hash
  dynlink.c: use bloom filter in gnu hash lookup
  dynlink.c: slim down gnu_lookup
  dynlink.c: pass gnu-hash table pointer to gnu_lookup
  dynlink.c: compute modulus via magic multiplication
  dynlink.c: store bloom filter size in struct dso

 src/ldso/dynlink.c | 124 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 101 insertions(+), 23 deletions(-)



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
@ 2015-06-27 23:48   ` Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

With -Os, GCC uses a multiply rather than a shift and addition for 'h*33'.
Use a more efficient expression explicitely.
---
 src/ldso/dynlink.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index b77c6f6..0df81f2 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -156,7 +156,7 @@ static uint32_t gnu_hash(const char *s0)
 	const unsigned char *s = (void *)s0;
 	uint_fast32_t h = 5381;
 	for (; *s; s++)
-		h = h*33 + *s;
+		h += h*32 + *s;
 	return h;
 }
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
@ 2015-06-27 23:48   ` Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

Introduce gnu_lookup_filtered and use it to speed up symbol lookups in
find_sym (do_dlsym is left as is, based on an expectation that frequently
dlsym queries will use a dlopen handle rather than RTLD_NEXT or RTLD_DEFAULT,
and will not need to look at more than one DSO).
---
 src/ldso/dynlink.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index 0df81f2..b08596e 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -200,6 +200,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 	return 0;
 }
 
+static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
+{
+	uint32_t *hashtab = dso->ghashtab;
+	const size_t *bloomwords = (const void *)(hashtab+4);
+	size_t f = bloomwords[fofs & (hashtab[2]-1)];
+	if (!(f & fmask)) return 0;
+
+	f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
+	if (!(f & 1)) return 0;
+
+	return gnu_lookup(s, h1, dso);
+}
+
 #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS)
 #define OK_BINDS (1<<STB_GLOBAL | 1<<STB_WEAK | 1<<STB_GNU_UNIQUE)
 
@@ -209,14 +222,20 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 
 static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
 {
-	uint32_t h = 0, gh = 0;
+	uint32_t h = 0, gh, gho;
+	size_t ghm = 0;
 	struct symdef def = {0};
 	for (; dso; dso=dso->next) {
 		Sym *sym;
 		if (!dso->global) continue;
 		if (dso->ghashtab) {
-			if (!gh) gh = gnu_hash(s);
-			sym = gnu_lookup(s, gh, dso);
+			if (!ghm) {
+				gh = gnu_hash(s);
+				int maskbits = 8 * sizeof ghm;
+				gho = gh / maskbits;
+				ghm = 1ul << gh % maskbits;
+			}
+			sym = gnu_lookup_filtered(s, gh, dso, gho, ghm);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, dso);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 3/6] dynlink.c: slim down gnu_lookup
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
@ 2015-06-27 23:48   ` Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

Do not reference dso->syms and dso-strings until point of use.
Check 'h1 == (h2|1)', the simplest condition, before the others.
---
 src/ldso/dynlink.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index b08596e..0ca7a16 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -176,24 +176,20 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso)
 
 static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 {
-	Sym *syms = dso->syms;
-	char *strings = dso->strings;
 	uint32_t *hashtab = dso->ghashtab;
 	uint32_t nbuckets = hashtab[0];
 	uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
-	uint32_t h2;
-	uint32_t *hashval;
 	uint32_t i = buckets[h1 % nbuckets];
 
 	if (!i) return 0;
 
-	hashval = buckets + nbuckets + (i - hashtab[1]);
+	uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]);
 
 	for (h1 |= 1; ; i++) {
-		h2 = *hashval++;
-		if ((!dso->versym || dso->versym[i] >= 0)
-		    && (h1 == (h2|1)) && !strcmp(s, strings + syms[i].st_name))
-			return syms+i;
+		uint32_t h2 = *hashval++;
+		if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0)
+		    && !strcmp(s, dso->strings + dso->syms[i].st_name))
+			return dso->syms+i;
 		if (h2 & 1) break;
 	}
 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
                     ` (2 preceding siblings ...)
  2015-06-27 23:48   ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov
@ 2015-06-27 23:48   ` Alexander Monakov
  2015-06-28  0:05     ` Alexander Monakov
  2015-06-27 23:48   ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

The callers need to check the value of the pointer anyway, so make them pass
the pointer to gnu_lookup instead of reloading it there.

Reorder gnu_lookup arguments so that always-used ones are listed first.  GCC
can choose a calling convention with arguments in registers (e.g. up to 3
arguments in eax, ecx, edx on x86), but cannot reorder the arguments for
static functions.
---
 src/ldso/dynlink.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index 0ca7a16..1c62efe 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -174,9 +174,8 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso)
 	return 0;
 }
 
-static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
+static Sym *gnu_lookup(uint32_t h1, uint32_t *hashtab, struct dso *dso, const char *s)
 {
-	uint32_t *hashtab = dso->ghashtab;
 	uint32_t nbuckets = hashtab[0];
 	uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
 	uint32_t i = buckets[h1 % nbuckets];
@@ -196,9 +195,9 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso)
 	return 0;
 }
 
-static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask)
+static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso, const char *s,
+				uint32_t fofs, size_t fmask)
 {
-	uint32_t *hashtab = dso->ghashtab;
 	const size_t *bloomwords = (const void *)(hashtab+4);
 	size_t f = bloomwords[fofs & (hashtab[2]-1)];
 	if (!(f & fmask)) return 0;
@@ -206,7 +205,7 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin
 	f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
 	if (!(f & 1)) return 0;
 
-	return gnu_lookup(s, h1, dso);
+	return gnu_lookup(h1, hashtab, dso, s);
 }
 
 #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS)
@@ -218,20 +217,20 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin
 
 static struct symdef find_sym(struct dso *dso, const char *s, int need_def)
 {
-	uint32_t h = 0, gh, gho;
+	uint32_t h = 0, gh, gho, *ght;
 	size_t ghm = 0;
 	struct symdef def = {0};
 	for (; dso; dso=dso->next) {
 		Sym *sym;
 		if (!dso->global) continue;
-		if (dso->ghashtab) {
+		if ((ght = dso->ghashtab)) {
 			if (!ghm) {
 				gh = gnu_hash(s);
 				int maskbits = 8 * sizeof ghm;
 				gho = gh / maskbits;
 				ghm = 1ul << gh % maskbits;
 			}
-			sym = gnu_lookup_filtered(s, gh, dso, gho, ghm);
+			sym = gnu_lookup_filtered(gh, ght, dso, s, gho, ghm);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, dso);
@@ -1559,7 +1558,7 @@ void *__tls_get_addr(size_t *);
 static void *do_dlsym(struct dso *p, const char *s, void *ra)
 {
 	size_t i;
-	uint32_t h = 0, gh = 0;
+	uint32_t h = 0, gh = 0, *ght;
 	Sym *sym;
 	if (p == head || p == RTLD_DEFAULT || p == RTLD_NEXT) {
 		if (p == RTLD_DEFAULT) {
@@ -1577,9 +1576,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
 	}
 	if (invalid_dso_handle(p))
 		return 0;
-	if (p->ghashtab) {
+	if ((ght = p->ghashtab)) {
 		gh = gnu_hash(s);
-		sym = gnu_lookup(s, gh, p);
+		sym = gnu_lookup(gh, ght, p, s);
 	} else {
 		h = sysv_hash(s);
 		sym = sysv_lookup(s, h, p);
@@ -1589,9 +1588,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra)
 	if (sym && sym->st_value && (1<<(sym->st_info&0xf) & OK_TYPES))
 		return p->base + sym->st_value;
 	if (p->deps) for (i=0; p->deps[i]; i++) {
-		if (p->deps[i]->ghashtab) {
+		if ((ght = p->deps[i]->ghashtab)) {
 			if (!gh) gh = gnu_hash(s);
-			sym = gnu_lookup(s, gh, p->deps[i]);
+			sym = gnu_lookup(gh, ght, p->deps[i], s);
 		} else {
 			if (!h) h = sysv_hash(s);
 			sym = sysv_lookup(s, h, p->deps[i]);


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
                     ` (3 preceding siblings ...)
  2015-06-27 23:48   ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
@ 2015-06-27 23:48   ` Alexander Monakov
  2015-06-30 17:51     ` Rich Felker
  2015-06-27 23:48   ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov
  2015-06-28  2:45   ` [PATCH v2 0/6] gnu-hash speedups Rich Felker
  6 siblings, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

Based on http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html

Do a little hand-holding for the compiler and fold magic post-shift into
32-bit high word -> low word shift on 64-bit platforms.
---
 src/ldso/dynlink.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index 1c62efe..f181209 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -41,6 +41,11 @@ struct td_index {
 	struct td_index *next;
 };
 
+struct udiv {
+	uint32_t mul;
+	unsigned char s1, s2, inc;
+};
+
 struct dso {
 	unsigned char *base;
 	char *name;
@@ -54,6 +59,7 @@ struct dso {
 	Sym *syms;
 	uint32_t *hashtab;
 	uint32_t *ghashtab;
+	struct udiv gudiv;
 	int16_t *versym;
 	char *strings;
 	unsigned char *map;
@@ -140,6 +146,60 @@ static int search_vec(size_t *v, size_t *r, size_t key)
 	return 1;
 }
 
+static void precompute_udiv(uint32_t div, struct udiv *p)
+{
+	if (!(div&(div-1)))
+		return;
+	int bits = 0, s2adj = (sizeof(long) == 8) ? 32 : 0;
+again:
+	p->s1 = bits;
+	uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div;
+
+	int log=0;
+	tmp=div; do log++; while (tmp>>=1);
+
+	int exp, rdshf;
+	uint32_t pow, rdmul=0;
+	for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) {
+		int ovf = rem >= div - rem;
+		quo *= 2; rem *= 2;
+		if (ovf) quo++, rem -= div;
+
+		if (exp >= log-bits || div-rem <= pow)
+			break;
+
+		if (!rdmul && rem <= pow)
+			rdmul = quo, rdshf = exp;
+	}
+	if (exp < log) {
+		p->mul = quo+1;
+		p->s2  = exp + s2adj;
+		p->inc = 0;
+	} else if (div & 1) {
+		p->mul = rdmul;
+		p->s2  = rdshf + s2adj;
+		p->inc = 1;
+	} else {
+		do bits++; while (!((div >>= 1) & 1));
+		goto again;
+	}
+}
+
+static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
+{
+	if (!(div&(div-1)))
+		return x&(div-1);
+	uint32_t v = x;
+	v >>= p->s1;
+	if (v + p->inc) v += p->inc;
+	int s32=32, s2=p->s2;
+	if (sizeof(long) == 8)
+		s32=s2, s2=0;
+	v = (1ull * v * p->mul) >> s32;
+	v >>= s2;
+	return x-v*div;
+}
+
 static uint32_t sysv_hash(const char *s0)
 {
 	const unsigned char *s = (void *)s0;
@@ -178,7 +238,7 @@ static Sym *gnu_lookup(uint32_t h1, uint32_t *hashtab, struct dso *dso, const ch
 {
 	uint32_t nbuckets = hashtab[0];
 	uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
-	uint32_t i = buckets[h1 % nbuckets];
+	uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)];
 
 	if (!i) return 0;
 
@@ -696,8 +756,10 @@ static void decode_dyn(struct dso *p)
 		p->rpath_orig = (void *)(p->strings + dyn[DT_RPATH]);
 	if (dyn[0]&(1<<DT_RUNPATH))
 		p->rpath_orig = (void *)(p->strings + dyn[DT_RUNPATH]);
-	if (search_vec(p->dynv, dyn, DT_GNU_HASH))
+	if (search_vec(p->dynv, dyn, DT_GNU_HASH)) {
 		p->ghashtab = (void *)(p->base + *dyn);
+		precompute_udiv(p->ghashtab[0], &p->gudiv);
+	}
 	if (search_vec(p->dynv, dyn, DT_VERSYM))
 		p->versym = (void *)(p->base + *dyn);
 }


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
                     ` (4 preceding siblings ...)
  2015-06-27 23:48   ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov
@ 2015-06-27 23:48   ` Alexander Monakov
  2015-06-28  2:45   ` [PATCH v2 0/6] gnu-hash speedups Rich Felker
  6 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw)
  To: musl

This slightly speeds up gnu_lookup_filtered by avoiding pointer chasing.
---
 src/ldso/dynlink.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c
index f181209..12e8c25 100644
--- a/src/ldso/dynlink.c
+++ b/src/ldso/dynlink.c
@@ -59,6 +59,7 @@ struct dso {
 	Sym *syms;
 	uint32_t *hashtab;
 	uint32_t *ghashtab;
+	uint32_t ghashmask;
 	struct udiv gudiv;
 	int16_t *versym;
 	char *strings;
@@ -259,7 +260,7 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso,
 				uint32_t fofs, size_t fmask)
 {
 	const size_t *bloomwords = (const void *)(hashtab+4);
-	size_t f = bloomwords[fofs & (hashtab[2]-1)];
+	size_t f = bloomwords[fofs & dso->ghashmask];
 	if (!(f & fmask)) return 0;
 
 	f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
@@ -758,6 +759,7 @@ static void decode_dyn(struct dso *p)
 		p->rpath_orig = (void *)(p->strings + dyn[DT_RUNPATH]);
 	if (search_vec(p->dynv, dyn, DT_GNU_HASH)) {
 		p->ghashtab = (void *)(p->base + *dyn);
+		p->ghashmask = p->ghashtab[2]-1;
 		precompute_udiv(p->ghashtab[0], &p->gudiv);
 	}
 	if (search_vec(p->dynv, dyn, DT_VERSYM))


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup
  2015-06-27 23:48   ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
@ 2015-06-28  0:05     ` Alexander Monakov
  2015-06-28  8:59       ` Alexander Monakov
  0 siblings, 1 reply; 28+ messages in thread
From: Alexander Monakov @ 2015-06-28  0:05 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: TEXT/PLAIN, Size: 421 bytes --]

Rich,

here's a version of your bench that should test latency rather than throughput
of the umod operation, with random divisors.

From my testing it looks like your version with 64-bit add is clearly better
on 64-bit, but loses on 32-bit unless 'add' is made a 64-bit field.

An advantage of 'saturating add' is that it doesn't need extra space ('inc' is
1 byte rather than 4, which are required for 'add').

Alexander

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-c; name=umod-bench.c, Size: 3087 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

struct udiv {
	uint64_t add;
	uint32_t mul;
	int s1, s2, inc;
};

static void precompute_udiv(uint32_t div, struct udiv *p)
{
	int s2_adj = (sizeof(long)==8) ? 32 : 0;
	if (!(div&(div-1))) {
		if (sizeof(long) == 8) {
			p->s1 = 0;
			p->mul = 1;
			p->add = 0;
			for (p->s2=0; div>1; div>>=1, p->s2++);
		} else if (div == 1) {
			p->s1 = 0;
			p->s2 = 0;
			p->mul = 0xffffffff;
			p->add = 0xffffffff;
		} else {
			p->s1 = 0;
			p->mul = 0x80000000;
			p->add = 0;
			for (p->s2=0; div>2; div>>=1, p->s2++);
		}
		return;
	}
	int bits = 0;
again:
	p->s1 = bits;
	uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div;

	int log=0;
	tmp=div; do log++; while (tmp>>=1);

	int exp, rdshf;
	uint32_t pow, rdmul=0;
	for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) {
		int ovf = rem >= div - rem;
		quo *= 2; rem *= 2;
		if (ovf) quo++, rem -= div;

		if (exp >= log-bits || div-rem <= pow)
			break;

		if (!rdmul && rem <= pow)
			rdmul = quo, rdshf = exp;
	}
	if (exp < log) {
		p->mul = quo+1;
		p->s2  = exp + s2_adj;
		p->inc = 0;
		p->add = 0;
	} else if (div & 1) {
		p->mul = rdmul;
		p->s2  = rdshf + s2_adj;
		p->inc = 1;
		p->add = rdmul;
	} else {
		do bits++; while (!((div >>= 1) & 1));
		goto again;
	}
}

static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
{
#if 0
	int s32 = (sizeof(long)==8) ? 0 : 32;
	return x - ((unsigned long)(1ull * (x>>p->s1) * p->mul + p->add >> s32) >> p->s2)*div;
#elif 0
	int s32 = (sizeof(long)==8) ? 0 : 32;
	uint32_t v = x;
	v >>= p->s1;
	if (v!=-1) v += p->inc;
	v = (unsigned long)((1ull * v * p->mul) >> s32) >> p->s2;
	return x-v*div;
#else
	if (!(div&(div-1)))
		return x&(div-1);
	uint32_t v = x;
	v >>= p->s1;
	if (v + p->inc) v += p->inc;
	int s32=32, s2=p->s2;
	if (sizeof(long) == 8)
		s32=s2, s2=0;
	v = (1ull * v * p->mul) >> s32;
	v >>= s2;
	return x-v*div;
#endif
}

#define N 1000000

uint32_t ext_umod(uint32_t x, uint32_t div, struct udiv *p)
{
	return umod(x, div, p);
}

int main(int argc, char **argv)
{
	size_t i;
	struct timespec t0, t;
	uint32_t *a = calloc(N, sizeof *a);
	for (i=0; i<N; i++) a[i] = 2+rand();
	struct udiv *ud = calloc(N, sizeof *ud);
	for (i=0; i<N; i++) precompute_udiv(a[i], ud+i);
	uint32_t accum, val;

	clock_gettime(CLOCK_REALTIME, &t0);
	for (accum=i=0, val=-1; i<N; i++) accum += (val %= a[i]);
	clock_gettime(CLOCK_REALTIME, &t);
	t.tv_sec -= t0.tv_sec;
	if ((t.tv_nsec -= t0.tv_nsec) < 0) {
		t.tv_nsec += 1000000000;
		t.tv_sec--;
	}
	printf("%lu %lld.%.9ld\n", (unsigned long)accum, (long long)t.tv_sec, t.tv_nsec);

	clock_gettime(CLOCK_REALTIME, &t0);
	for (accum=i=0, val=-1; i<N; i++) accum += (val = umod(val, a[i], ud+i));
	clock_gettime(CLOCK_REALTIME, &t);
	t.tv_sec -= t0.tv_sec;
	if ((t.tv_nsec -= t0.tv_nsec) < 0) {
		t.tv_nsec += 1000000000;
		t.tv_sec--;
	}
	printf("%lu %lld.%.9ld\n", (unsigned long)accum, (long long)t.tv_sec, t.tv_nsec);
	return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/6] gnu-hash speedups
  2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
                     ` (5 preceding siblings ...)
  2015-06-27 23:48   ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov
@ 2015-06-28  2:45   ` Rich Felker
  6 siblings, 0 replies; 28+ messages in thread
From: Rich Felker @ 2015-06-28  2:45 UTC (permalink / raw)
  To: musl

On Sun, Jun 28, 2015 at 02:48:29AM +0300, Alexander Monakov wrote:
> v2:
> reorder patches
> [2] split out ghashmask micro-optimization into a separate patch
> [2] use 'unsigned char' rather than 'int' for struct udiv fields s1,s2,inc
> [2] use an explicit cast in bloomwords assignment
> [4] reorder 'hashtab' before 'dso'
> [5] remove branch on zero s1
> [5] tweak implementation of saturating addition
> [5] fold 32-bit shift into s2, avoiding run-time adjustments on 64-bit arch
> 
> Alexander Monakov (6):
>   dynlink.c: use a faster expression in gnu_hash
>   dynlink.c: use bloom filter in gnu hash lookup
>   dynlink.c: slim down gnu_lookup
>   dynlink.c: pass gnu-hash table pointer to gnu_lookup

I've committed 1-4 with minimal changes: refolding commit messages so
"git log" fits in 80 columns, fixing a typo in one of the messages,
and removing the line-break in the function args for
gnu_lookup_filtered (there were already well over 80 chars before the
break, and if it's going to be broken, spaces rather than tabs need to
be used for alignment).

>   dynlink.c: compute modulus via magic multiplication
>   dynlink.c: store bloom filter size in struct dso

Let's resume testing these when you next have time. The timing for the
umod stuff is still so unstable undef different -O and -mtune
combinations that I'm not sure what to think about which version is
best. But it looks like a clear win. I'll see if I can do some tests
on real arm hardware too, since I expect it to make a big difference
there.

Rich

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup
  2015-06-28  0:05     ` Alexander Monakov
@ 2015-06-28  8:59       ` Alexander Monakov
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Monakov @ 2015-06-28  8:59 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: TEXT/PLAIN, Size: 812 bytes --]

(this and the previous email are actually following up on patch 5, not 4)

How about handling power-of-two divisors like this.

In precompute_udiv:

1. Factor out powers of two from 'div' up front and get rid of 'goto again'
   (ok since doing the pre-shift unconditionally looks better)
2. If 'div' is now 1:
   - on 64-bit there's no problem: set 'mul' to 1 and 's2' to 0
   - on 32-bit: set 'mul' to 0

...and in umod:

3. On 32-bit arches, check 'mul' after pre-shift and skip to end if zero.

With this change it's easy to drop 'div' argument from 'umod', rename 'umod'
to 'udiv' and make the caller do the modulus computation.

Attaching a variant of the latency bench that implements the above; I also
tried to avoid small 'val' in the bench loop by using a recurrence with a
bitwise inversion.

Alexander

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-c; name=umod-bench.c, Size: 2338 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

struct udiv {
	uint32_t mul;
	unsigned char s1, s2, inc;
};

static void precompute_udiv(uint32_t div, struct udiv *p)
{
	int bits = 0;
	while (!(div & 1)) div >>= 1, bits++;
	p->s1 = bits;
	if (div == 1) {
		p->mul = (sizeof(long) == 8);
		p->s2  = 0;
		p->inc = 0;
		return;
	}
	uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div;

	int log=0;
	tmp=div; do log++; while (tmp>>=1);

	int exp, rdshf;
	uint32_t pow, rdmul=0;
	for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) {
		int ovf = rem >= div - rem;
		quo *= 2; rem *= 2;
		if (ovf) quo++, rem -= div;

		if (exp >= log-bits || div-rem <= pow)
			break;

		if (!rdmul && rem <= pow)
			rdmul = quo, rdshf = exp;
	}
	int s2_adj = (sizeof(long)==8) ? 32 : 0;
	if (exp < log) {
		p->mul = quo+1;
		p->s2  = exp + s2_adj;
		p->inc = 0;
	} else {
		p->mul = rdmul;
		p->s2  = rdshf + s2_adj;
		p->inc = 1;
	}
}

static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p)
{
	uint32_t v = x >> p->s1;
	if (sizeof(long) == 8 || p->mul) {
		if (v + p->inc) v += p->inc;
		int s32=32, s2=p->s2;
		if (sizeof(long) == 8)
			s32=s2, s2=0;
		v = (1ull * v * p->mul) >> s32;
		v >>= s2;
	}
	return x-v*div;
}

#define N 1000000

uint32_t ext_umod(uint32_t x, uint32_t div, struct udiv *p)
{
	return umod(x, div, p);
}

int main(int argc, char **argv)
{
	size_t i;
	struct timespec t0, t;
	uint32_t *a = calloc(N, sizeof *a);
	for (i=0; i<N; i++) a[i] = 1+rand();
	struct udiv *ud = calloc(N, sizeof *ud);
	for (i=0; i<N; i++) precompute_udiv(a[i], ud+i);
	uint32_t val;

	clock_gettime(CLOCK_REALTIME, &t0);
	for (i=0, val=-1; i<N; i++) val = ~(val % a[i]);
	clock_gettime(CLOCK_REALTIME, &t);
	t.tv_sec -= t0.tv_sec;
	if ((t.tv_nsec -= t0.tv_nsec) < 0) {
		t.tv_nsec += 1000000000;
		t.tv_sec--;
	}
	printf("%lu %lld.%.9ld\n", (unsigned long)val, (long long)t.tv_sec, t.tv_nsec);

	clock_gettime(CLOCK_REALTIME, &t0);
	for (i=0, val=-1; i<N; i++) val = ~umod(val, a[i], ud+i);
	clock_gettime(CLOCK_REALTIME, &t);
	t.tv_sec -= t0.tv_sec;
	if ((t.tv_nsec -= t0.tv_nsec) < 0) {
		t.tv_nsec += 1000000000;
		t.tv_sec--;
	}
	printf("%lu %lld.%.9ld\n", (unsigned long)val, (long long)t.tv_sec, t.tv_nsec);
	return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication
  2015-06-27 23:48   ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov
@ 2015-06-30 17:51     ` Rich Felker
  0 siblings, 0 replies; 28+ messages in thread
From: Rich Felker @ 2015-06-30 17:51 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

On Sun, Jun 28, 2015 at 02:48:34AM +0300, Alexander Monakov wrote:
> Based on http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html
> 
> Do a little hand-holding for the compiler and fold magic post-shift into
> 32-bit high word -> low word shift on 64-bit platforms.
> ---

On 32-bit Atom, I was able to get an improvement from 116ms to 113ms
running an empty program linked with a .so produced using the attached
script. This is probably roughly best-case improvement, so the umod
stuff is probably not likely to be worthwhile unless/until we can
improve the areas that are currently dominating the runtime. That's
rather disappointing, but maybe we can improve enough other stuff for
it to make sense to do the umod optimization.

Rich

[-- Attachment #2: gen.sh --]
[-- Type: application/x-sh, Size: 112 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-06-30 17:51 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
2015-06-24  5:39   ` Rich Felker
2015-06-24  6:29     ` Alexander Monakov
2015-06-24  6:32       ` Alexander Monakov
2015-06-24  6:50       ` Rich Felker
2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov
2015-06-24  4:18   ` Alexander Monakov
2015-06-24  4:19     ` Rich Felker
2015-06-24  4:24   ` Rich Felker
2015-06-24  4:32     ` Alexander Monakov
2015-06-24  5:13       ` Rich Felker
2015-06-24  6:08         ` Alexander Monakov
2015-06-24  6:39           ` Rich Felker
2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov
2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
2015-06-28  0:05     ` Alexander Monakov
2015-06-28  8:59       ` Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov
2015-06-30 17:51     ` Rich Felker
2015-06-27 23:48   ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov
2015-06-28  2:45   ` [PATCH v2 0/6] gnu-hash speedups Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).