* [PATCH 0/5] gnu-hash speedups @ 2015-06-23 23:24 Alexander Monakov 2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov ` (5 more replies) 0 siblings, 6 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw) To: musl; +Cc: Alexander Monakov Hello! I'm happy to finally send the results of my tinkering with .gnu.hash in musl. This patch series should improve the speed of initial loading with many dynamic libraries and symbols, provided that .gnu.hash is populated by the linker. I was testing on 32-bit x86 with Clang/LLVM, and got a speedup from ~235-245ms to ~95-105ms. This is about halfway between non-lazy binding and lazy-binding with glibc. Most of the speedup comes from the first patch (Bloom filters), and a bit from the last (strength reduction for h*33). Notably, I don't see an improvement from the second patch (strength reduction for modulus), but it may be there for other architectures. I'd love to see test results on other platforms! Although keep in mind that the patches don't affect loading when .gnu.hash is not present. Thanks to Rich for ideas (unsigned division by magic and 'h1 == (h2|1)' test reordering), and to community members who encouraged me to finish this work :) Yours, Alexander Monakov (5): dynlink.c: use bloom filter in gnu hash lookup dynlink.c: compute modulus via magic multiplication dynlink.c: slim down gnu_lookup dynlink.c: pass gnu-hash table pointer to gnu_lookup dynlink.c: use a faster expression in gnu_hash src/ldso/dynlink.c | 124 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 101 insertions(+), 23 deletions(-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov @ 2015-06-23 23:24 ` Alexander Monakov 2015-06-24 5:39 ` Rich Felker 2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov ` (4 subsequent siblings) 5 siblings, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw) To: musl; +Cc: Alexander Monakov Introduce gnu_lookup_filtered and use it to speed up symbol lookups in find_sym (do_dlsym is left as is, based on an expectation that frequently dlsym queries will use a dlopen handle rather than RTLD_NEXT or RTLD_DEFAULT, and will not need to look at more than one DSO). The size of the bloom filter table minus 1, ghashmask, is stored in the dso struct together with the hashtable pointer to reduce pointer chasing on the hot path. --- src/ldso/dynlink.c | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index b77c6f6..fa91b39 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -54,6 +54,7 @@ struct dso { Sym *syms; uint32_t *hashtab; uint32_t *ghashtab; + uint32_t ghashmask; int16_t *versym; char *strings; unsigned char *map; @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) return 0; } +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) +{ + uint32_t *hashtab = dso->ghashtab; + size_t *bloomwords = hashtab+4; + size_t f = bloomwords[fofs & dso->ghashmask]; + if (!(f & fmask)) return 0; + + f >>= (h1 >> hashtab[3]) % (8 * sizeof f); + if (!(f & 1)) return 0; + + return gnu_lookup(s, h1, dso); +} + #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS) #define OK_BINDS (1<<STB_GLOBAL | 1<<STB_WEAK | 1<<STB_GNU_UNIQUE) @@ -209,14 +223,20 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) static struct symdef find_sym(struct dso *dso, const char *s, int need_def) { - uint32_t h = 0, gh = 0; + uint32_t h = 0, gh, gho; + size_t ghm = 0; struct symdef def = {0}; for (; dso; dso=dso->next) { Sym *sym; if (!dso->global) continue; if (dso->ghashtab) { - if (!gh) gh = gnu_hash(s); - sym = gnu_lookup(s, gh, dso); + if (!ghm) { + gh = gnu_hash(s); + int maskbits = 8 * sizeof ghm; + gho = gh / maskbits; + ghm = 1ul << gh % maskbits; + } + sym = gnu_lookup_filtered(s, gh, dso, gho, ghm); } else { if (!h) h = sysv_hash(s); sym = sysv_lookup(s, h, dso); @@ -682,8 +702,10 @@ static void decode_dyn(struct dso *p) p->rpath_orig = (void *)(p->strings + dyn[DT_RPATH]); if (dyn[0]&(1<<DT_RUNPATH)) p->rpath_orig = (void *)(p->strings + dyn[DT_RUNPATH]); - if (search_vec(p->dynv, dyn, DT_GNU_HASH)) + if (search_vec(p->dynv, dyn, DT_GNU_HASH)) { p->ghashtab = (void *)(p->base + *dyn); + p->ghashmask = p->ghashtab[2]-1; + } if (search_vec(p->dynv, dyn, DT_VERSYM)) p->versym = (void *)(p->base + *dyn); } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup 2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov @ 2015-06-24 5:39 ` Rich Felker 2015-06-24 6:29 ` Alexander Monakov 0 siblings, 1 reply; 28+ messages in thread From: Rich Felker @ 2015-06-24 5:39 UTC (permalink / raw) To: musl On Wed, Jun 24, 2015 at 02:24:51AM +0300, Alexander Monakov wrote: > Introduce gnu_lookup_filtered and use it to speed up symbol lookups in > find_sym (do_dlsym is left as is, based on an expectation that frequently > dlsym queries will use a dlopen handle rather than RTLD_NEXT or RTLD_DEFAULT, > and will not need to look at more than one DSO). The size of the bloom filter > table minus 1, ghashmask, is stored in the dso struct together with the > hashtable pointer to reduce pointer chasing on the hot path. > --- > src/ldso/dynlink.c | 30 ++++++++++++++++++++++++++---- > 1 file changed, 26 insertions(+), 4 deletions(-) > > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c > index b77c6f6..fa91b39 100644 > --- a/src/ldso/dynlink.c > +++ b/src/ldso/dynlink.c > @@ -54,6 +54,7 @@ struct dso { > Sym *syms; > uint32_t *hashtab; > uint32_t *ghashtab; > + uint32_t ghashmask; > int16_t *versym; > char *strings; > unsigned char *map; > @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) > return 0; > } > > +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) > +{ > + uint32_t *hashtab = dso->ghashtab; > + size_t *bloomwords = hashtab+4; > + size_t f = bloomwords[fofs & dso->ghashmask]; Is this measurably faster than fofs & hashtab[2]-1 ? If suspect not, in which case it seems desirable not to increase the size of struct dso. If it is worthwhile, at least don't sandwich a uint32_t between two pointers where it might incur another 32 bits of padding. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup 2015-06-24 5:39 ` Rich Felker @ 2015-06-24 6:29 ` Alexander Monakov 2015-06-24 6:32 ` Alexander Monakov 2015-06-24 6:50 ` Rich Felker 0 siblings, 2 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-24 6:29 UTC (permalink / raw) To: musl > > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c > > index b77c6f6..fa91b39 100644 > > --- a/src/ldso/dynlink.c > > +++ b/src/ldso/dynlink.c > > @@ -54,6 +54,7 @@ struct dso { > > Sym *syms; > > uint32_t *hashtab; > > uint32_t *ghashtab; > > + uint32_t ghashmask; > > int16_t *versym; > > char *strings; > > unsigned char *map; > > @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) > > return 0; > > } > > > > +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) > > +{ > > + uint32_t *hashtab = dso->ghashtab; > > + size_t *bloomwords = hashtab+4; > > + size_t f = bloomwords[fofs & dso->ghashmask]; > > Is this measurably faster than fofs & hashtab[2]-1 ? It seems to give a small improvement for me, but with the other patches, too small to measure reliably. As I'm using a big testcase, the impact on typical libraries should be really tiny. > If suspect not, in which case it seems desirable not to increase the > size of struct dso. If it is worthwhile, at least don't sandwich a > uint32_t between two pointers where it might incur another 32 bits of > padding. I wanted it to be nearby the hashtab pointer, which needs to be loaded anyway. If ghashmask needs to move so that loading it will access a different cache line, it's likely to diminish an already small improvement, at which point it's easier to drop this patch :) Would be nice if someone could test on a platform with slower caches. Alexander ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup 2015-06-24 6:29 ` Alexander Monakov @ 2015-06-24 6:32 ` Alexander Monakov 2015-06-24 6:50 ` Rich Felker 1 sibling, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-24 6:32 UTC (permalink / raw) To: musl > I wanted it to be nearby the hashtab pointer, which needs to be loaded anyway. > If ghashmask needs to move so that loading it will access a different cache > line, it's likely to diminish an already small improvement, at which point > it's easier to drop this patch :) Erm, I meant the part of the patch with ghashmask, not the whole patch. Alexander ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup 2015-06-24 6:29 ` Alexander Monakov 2015-06-24 6:32 ` Alexander Monakov @ 2015-06-24 6:50 ` Rich Felker 1 sibling, 0 replies; 28+ messages in thread From: Rich Felker @ 2015-06-24 6:50 UTC (permalink / raw) To: musl On Wed, Jun 24, 2015 at 09:29:25AM +0300, Alexander Monakov wrote: > > > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c > > > index b77c6f6..fa91b39 100644 > > > --- a/src/ldso/dynlink.c > > > +++ b/src/ldso/dynlink.c > > > @@ -54,6 +54,7 @@ struct dso { > > > Sym *syms; > > > uint32_t *hashtab; > > > uint32_t *ghashtab; > > > + uint32_t ghashmask; > > > int16_t *versym; > > > char *strings; > > > unsigned char *map; > > > @@ -200,6 +201,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) > > > return 0; > > > } > > > > > > +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) > > > +{ > > > + uint32_t *hashtab = dso->ghashtab; > > > + size_t *bloomwords = hashtab+4; > > > + size_t f = bloomwords[fofs & dso->ghashmask]; > > > > Is this measurably faster than fofs & hashtab[2]-1 ? > > It seems to give a small improvement for me, but with the other patches, too > small to measure reliably. As I'm using a big testcase, the impact on typical > libraries should be really tiny. > > > If suspect not, in which case it seems desirable not to increase the > > size of struct dso. If it is worthwhile, at least don't sandwich a > > uint32_t between two pointers where it might incur another 32 bits of > > padding. > > I wanted it to be nearby the hashtab pointer, which needs to be loaded anyway. > If ghashmask needs to move so that loading it will access a different cache > line, it's likely to diminish an already small improvement, at which point > it's easier to drop this patch :) Well we could just reorder some things to put them adjacent to one another without wasting the padding, if it helps. I don't want to reject this part outright, but I think changes like this that obviously have some cost and where the benefit is non-obvious should have some strong justification. (The umod stuff obviously has some size cost too in struct dso, but its benefit is a lot more clear, especially on archs without hardware div.) Maybe it would make sense to omit the ghashmask part to begin with (it makes the patch smaller anyway) and then do some testing to see whether it's really beneficial once the other things are merged and we have a stable basis for performance comparison. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov 2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov @ 2015-06-23 23:24 ` Alexander Monakov 2015-06-24 4:18 ` Alexander Monakov 2015-06-24 4:24 ` Rich Felker 2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov ` (3 subsequent siblings) 5 siblings, 2 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw) To: musl; +Cc: Alexander Monakov Based on http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html Do a little hand-holding for the compiler and fold magic post-shift into 32-bit high word -> low word shift on 64-bit platforms. --- src/ldso/dynlink.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index fa91b39..99dadd4 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -41,6 +41,11 @@ struct td_index { struct td_index *next; }; +struct udiv { + uint32_t mul; + int s1, s2, inc; +}; + struct dso { unsigned char *base; char *name; @@ -55,6 +60,7 @@ struct dso { uint32_t *hashtab; uint32_t *ghashtab; uint32_t ghashmask; + struct udiv gudiv; int16_t *versym; char *strings; unsigned char *map; @@ -141,6 +147,60 @@ static int search_vec(size_t *v, size_t *r, size_t key) return 1; } +static void precompute_udiv(uint32_t div, struct udiv *p) +{ + if (!(div&(div-1))) + return; + int bits = 0; +again: + p->s1 = bits; + uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div; + + int log=0; + tmp=div; do log++; while (tmp>>=1); + + int exp, rdshf; + uint32_t pow, rdmul=0; + for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) { + int ovf = rem >= div - rem; + quo *= 2; rem *= 2; + if (ovf) quo++, rem -= div; + + if (exp >= log-bits || div-rem <= pow) + break; + + if (!rdmul && rem <= pow) + rdmul = quo, rdshf = exp; + } + if (exp < log) { + p->mul = quo+1; + p->s2 = exp; + p->inc = 0; + } else if (div & 1) { + p->mul = rdmul; + p->s2 = rdshf; + p->inc = 1; + } else { + do bits++; while (!((div >>= 1) & 1)); + goto again; + } +} + +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) +{ + if (!(div&(div-1))) + return x&(div-1); + uint32_t v = x; + if (p->s1) v >>= p->s1; + else if (v != ~0u) v += p->inc; + int s32=32, s2=p->s2; + if (sizeof(long) == 8) + s32+=s2, s2=0; + v = (1ull * v * p->mul) >> s32; + v >>= s2; + return x-v*div; +} + static uint32_t sysv_hash(const char *s0) { const unsigned char *s = (void *)s0; @@ -184,7 +244,7 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4); uint32_t h2; uint32_t *hashval; - uint32_t i = buckets[h1 % nbuckets]; + uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)]; if (!i) return 0; @@ -705,6 +765,7 @@ static void decode_dyn(struct dso *p) if (search_vec(p->dynv, dyn, DT_GNU_HASH)) { p->ghashtab = (void *)(p->base + *dyn); p->ghashmask = p->ghashtab[2]-1; + precompute_udiv(p->ghashtab[0], &p->gudiv); } if (search_vec(p->dynv, dyn, DT_VERSYM)) p->versym = (void *)(p->base + *dyn); ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov @ 2015-06-24 4:18 ` Alexander Monakov 2015-06-24 4:19 ` Rich Felker 2015-06-24 4:24 ` Rich Felker 1 sibling, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-24 4:18 UTC (permalink / raw) To: musl On Wed, 24 Jun 2015, Alexander Monakov wrote: > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c > index fa91b39..99dadd4 100644 > --- a/src/ldso/dynlink.c > +++ b/src/ldso/dynlink.c > @@ -41,6 +41,11 @@ struct td_index { > struct td_index *next; > }; > > +struct udiv { > + uint32_t mul; > + int s1, s2, inc; > +}; Can use 'char' rather than 'int' here. Alexander ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-24 4:18 ` Alexander Monakov @ 2015-06-24 4:19 ` Rich Felker 0 siblings, 0 replies; 28+ messages in thread From: Rich Felker @ 2015-06-24 4:19 UTC (permalink / raw) To: musl On Wed, Jun 24, 2015 at 07:18:28AM +0300, Alexander Monakov wrote: > On Wed, 24 Jun 2015, Alexander Monakov wrote: > > diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c > > index fa91b39..99dadd4 100644 > > --- a/src/ldso/dynlink.c > > +++ b/src/ldso/dynlink.c > > @@ -41,6 +41,11 @@ struct td_index { > > struct td_index *next; > > }; > > > > +struct udiv { > > + uint32_t mul; > > + int s1, s2, inc; > > +}; > > Can use 'char' rather than 'int' here. Yes, that seems like a nice size improvement, and it might also allow the compiler to do additional optimizations based on knowledge of the range. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov 2015-06-24 4:18 ` Alexander Monakov @ 2015-06-24 4:24 ` Rich Felker 2015-06-24 4:32 ` Alexander Monakov 1 sibling, 1 reply; 28+ messages in thread From: Rich Felker @ 2015-06-24 4:24 UTC (permalink / raw) To: musl On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote: > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) > +{ > + if (!(div&(div-1))) > + return x&(div-1); > + uint32_t v = x; > + if (p->s1) v >>= p->s1; > + else if (v != ~0u) v += p->inc; > + int s32=32, s2=p->s2; > + if (sizeof(long) == 8) > + s32+=s2, s2=0; > + v = (1ull * v * p->mul) >> s32; > + v >>= s2; > + return x-v*div; > +} I think having the div argument here is a pessimization. Power-of-two sizes are unlikely to be used, and it's possible for precompute_udiv to setup the struct udiv to work perfectly fine for powers of two, so that there is a single code path with no branches. The p->s1 branch should not be needed either. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-24 4:24 ` Rich Felker @ 2015-06-24 4:32 ` Alexander Monakov 2015-06-24 5:13 ` Rich Felker 0 siblings, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-24 4:32 UTC (permalink / raw) To: musl On Wed, 24 Jun 2015, Rich Felker wrote: > On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote: > > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) > > +{ > > + if (!(div&(div-1))) > > + return x&(div-1); > > + uint32_t v = x; > > + if (p->s1) v >>= p->s1; > > + else if (v != ~0u) v += p->inc; > > + int s32=32, s2=p->s2; > > + if (sizeof(long) == 8) > > + s32+=s2, s2=0; > > + v = (1ull * v * p->mul) >> s32; > > + v >>= s2; > > + return x-v*div; > > +} > > I think having the div argument here is a pessimization. How can I do the last step, 'x-v*div' without it? > Power-of-two sizes are unlikely to be used, probably so, but... > and it's possible for precompute_udiv > to setup the struct udiv to work perfectly fine for powers of two, I gave it a thought, and I didn't find how to do that either. > so that there is a single code path with no branches. The p->s1 branch > should not be needed either. p->s1 check is for this reason: shift are relatively costly, and s1 is rarely non-zero, so try to skip the shift if possible; in the rare case the pre-shift is non-zero, the check allows to skip the saturating increment operation. Alexander ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-24 4:32 ` Alexander Monakov @ 2015-06-24 5:13 ` Rich Felker 2015-06-24 6:08 ` Alexander Monakov 0 siblings, 1 reply; 28+ messages in thread From: Rich Felker @ 2015-06-24 5:13 UTC (permalink / raw) To: musl On Wed, Jun 24, 2015 at 07:32:56AM +0300, Alexander Monakov wrote: > On Wed, 24 Jun 2015, Rich Felker wrote: > > > On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote: > > > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) > > > +{ > > > + if (!(div&(div-1))) > > > + return x&(div-1); > > > + uint32_t v = x; > > > + if (p->s1) v >>= p->s1; > > > + else if (v != ~0u) v += p->inc; > > > + int s32=32, s2=p->s2; > > > + if (sizeof(long) == 8) > > > + s32+=s2, s2=0; > > > + v = (1ull * v * p->mul) >> s32; > > > + v >>= s2; > > > + return x-v*div; > > > +} > > > > I think having the div argument here is a pessimization. > > How can I do the last step, 'x-v*div' without it? Ah yes. Would it be preferable to have that in struct udiv then, though? Then the caller never has to care about the divisor, and in the case where umod doesn't get inlined, the call will be more efficient (fewer args). > > Power-of-two sizes are unlikely to be used, > > probably so, but... > > > and it's possible for precompute_udiv > > to setup the struct udiv to work perfectly fine for powers of two, > > I gave it a thought, and I didn't find how to do that either. Originally I thought this would work: p->s1 = ctz(div); p->inc = 0; p->mul = 1; p->s2 = 0; But the >>32 of course breaks it. What about: p->s1 = 0 p->inc = 0; p->mul = 0x80000000; p->s2 = ctz(div)-1; This should handle all values except 1, which probably isn't needed, but maybe there's a special case that works for 1 too...? I think the following works for all values except 0xffffffff: p->s1 = 0 p->inc = 1; p->mul = 0xffffffff; p->s2 = 0; Using the post-mul add rather than the saturated increment would make this work for 0xffffffff too. Another obvious solution is not using the +32 offset so that the right shift can just be 31, but that pessimizes the code on 32-bit archs quite a bit. > > so that there is a single code path with no branches. The p->s1 branch > > should not be needed either. > > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely > non-zero, so try to skip the shift if possible; in the rare case the pre-shift > is non-zero, the check allows to skip the saturating increment operation. Shifts are costly? Are we talking about P4-era junk? ;-) Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-24 5:13 ` Rich Felker @ 2015-06-24 6:08 ` Alexander Monakov 2015-06-24 6:39 ` Rich Felker 0 siblings, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-24 6:08 UTC (permalink / raw) To: musl [-- Attachment #1: Type: TEXT/PLAIN, Size: 1719 bytes --] > > How can I do the last step, 'x-v*div' without it? > > Ah yes. Would it be preferable to have that in struct udiv then, > though? Then the caller never has to care about the divisor, and in > the case where umod doesn't get inlined, the call will be more > efficient (fewer args). I doubt that; the caller needs that value ('nbuckets') anyway. [...] > Using the post-mul add rather than the saturated increment would make > this work for 0xffffffff too. post-mul add is problematic on 32-bit > Another obvious solution is not using the +32 offset so that the right > shift can just be 31, but that pessimizes the code on 32-bit archs > quite a bit. I agree that a special-case for a power-of-two divisor will fire too rarely, so figuring a replacement out would be nice. If you (or anyone) want to play with ideas, I'm attaching my test driver for magic division that tests 2^32-1 inputs for a divisor given on the command line (the main loop looks odd, but my goal was to have gcc vectorize it). > > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely > > non-zero, so try to skip the shift if possible; in the rare case the pre-shift > > is non-zero, the check allows to skip the saturating increment operation. > > Shifts are costly? Are we talking about P4-era junk? ;-) I primarily had modern Intel cores in mind where as far as I can see shifts by %ecx have latency 2. But even ignoring that, there's the second part of my argument. Anyway it's not trivial to measure an impact on the same "modern Intel cores", so if anybody can say how that looks on a different platform, I'd appreciate that. Removing my if-else there is obviously beneficial for code size. Alexander [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: TEXT/x-c; name=udiv.c, Size: 1923 bytes --] #include <stdlib.h> static void precalc(unsigned div, unsigned *mulp, int *prep, int *postp, int *incp) { int bits = 0; again: *prep = bits; unsigned pow = 1u<<31, quo = pow / div, rem = pow % div; int log=0, exp; unsigned tmp=div; do log++; while (tmp >>= 1); unsigned down_mul=0; for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) { int adj = rem >= div - rem; quo *= 2; rem *= 2; if (adj) {quo++; rem -= div; } if (exp >= log-bits || div-rem <= pow) break; if (!down_mul && rem <= pow) { down_mul = quo; *postp = exp; } } if (exp < log) { *mulp = quo+1; *postp = exp; *incp = 0; } else if (div & 1) { *mulp = down_mul; *incp = 1; } else { do bits++; while (!((div >>= 1) & 1)); goto again; } } int main(int argc, char *argv[]) { unsigned div = atoi(argv[1]); if (!(div&(div-1))) return 0; unsigned mul; int pre, post, inc; precalc(div, &mul, &pre, &post, &inc); unsigned s32 = 32; if (sizeof(long) == 8) { s32 += post; post = 0; } unsigned rem, quo, val; char bad=0; #define loop(pre, inc, post) ({ \ for (val=~0u, quo=val/div, rem=val%div+1; quo+1; quo--, rem=div) \ for (; rem; rem--, val--) { \ unsigned v = val; \ v >>= pre; \ if (v+inc) v+=inc; \ v = (1ull * v * mul) >> s32; \ v >>= post; \ bad |= (v != quo); \ } }) if (post) if (inc) if (pre) return 2; else loop(0, inc, post); else if (pre) loop(pre, 0, post); else loop(0, 0, post); else if (inc) if (pre) return 2; else loop(0, inc, 0); else if (pre) loop(pre, 0, 0); else loop(0, 0, 0); return bad; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication 2015-06-24 6:08 ` Alexander Monakov @ 2015-06-24 6:39 ` Rich Felker 0 siblings, 0 replies; 28+ messages in thread From: Rich Felker @ 2015-06-24 6:39 UTC (permalink / raw) To: musl On Wed, Jun 24, 2015 at 09:08:15AM +0300, Alexander Monakov wrote: > > > How can I do the last step, 'x-v*div' without it? > > > > Ah yes. Would it be preferable to have that in struct udiv then, > > though? Then the caller never has to care about the divisor, and in > > the case where umod doesn't get inlined, the call will be more > > efficient (fewer args). > > I doubt that; the caller needs that value ('nbuckets') anyway. OK. > [...] > > Using the post-mul add rather than the saturated increment would make > > this work for 0xffffffff too. > > post-mul add is problematic on 32-bit Maybe. It's not clear to me which is cheaper, a saturating multiply (well-predicted branch) or an adc. IMO the only place the latter is clearly more costly is on archs with no proper carry mechanism. > > Another obvious solution is not using the +32 offset so that the right > > shift can just be 31, but that pessimizes the code on 32-bit archs > > quite a bit. > > I agree that a special-case for a power-of-two divisor will fire too rarely, > so figuring a replacement out would be nice. > > If you (or anyone) want to play with ideas, I'm attaching my test driver for > magic division that tests 2^32-1 inputs for a divisor given on the command > line (the main loop looks odd, but my goal was to have gcc vectorize it). Nice. > > > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely > > > non-zero, so try to skip the shift if possible; in the rare case the pre-shift > > > is non-zero, the check allows to skip the saturating increment operation. > > > > Shifts are costly? Are we talking about P4-era junk? ;-) > > I primarily had modern Intel cores in mind where as far as I can see shifts by > %ecx have latency 2. But even ignoring that, there's the second part of my > argument. With a latency of 2, I can't imagine there being any benefit to attempting to avoid it with a conditional branch. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 3/5] dynlink.c: slim down gnu_lookup 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov 2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov 2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov @ 2015-06-23 23:24 ` Alexander Monakov 2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov ` (2 subsequent siblings) 5 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw) To: musl; +Cc: Alexander Monakov Do not reference dso->syms and dso-strings until point of use. Check 'h1 == (h2|1)', the simplest condition, before the others. --- src/ldso/dynlink.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index 99dadd4..28812ef 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -237,24 +237,20 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso) static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) { - Sym *syms = dso->syms; - char *strings = dso->strings; uint32_t *hashtab = dso->ghashtab; uint32_t nbuckets = hashtab[0]; uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4); - uint32_t h2; - uint32_t *hashval; uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)]; if (!i) return 0; - hashval = buckets + nbuckets + (i - hashtab[1]); + uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]); for (h1 |= 1; ; i++) { - h2 = *hashval++; - if ((!dso->versym || dso->versym[i] >= 0) - && (h1 == (h2|1)) && !strcmp(s, strings + syms[i].st_name)) - return syms+i; + uint32_t h2 = *hashval++; + if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0) + && !strcmp(s, dso->strings + dso->syms[i].st_name)) + return dso->syms+i; if (h2 & 1) break; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov ` (2 preceding siblings ...) 2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov @ 2015-06-23 23:24 ` Alexander Monakov 2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov 5 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw) To: musl; +Cc: Alexander Monakov The callers need to check the value of the pointer anyway, so make them pass the pointer to gnu_lookup instead of reloading it there. Reorder gnu_lookup arguments so that always-used ones are listed first. GCC can choose a calling convention with arguments in registers (e.g. up to 3 arguments in eax, ecx, edx on x86), but cannot reorder the arguments for static functions. --- src/ldso/dynlink.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index 28812ef..99be7d0 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -235,9 +235,8 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso) return 0; } -static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) +static Sym *gnu_lookup(uint32_t h1, struct dso *dso, uint32_t *hashtab, const char *s) { - uint32_t *hashtab = dso->ghashtab; uint32_t nbuckets = hashtab[0]; uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4); uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)]; @@ -257,9 +256,9 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) return 0; } -static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) +static Sym *gnu_lookup_filtered(uint32_t h1, struct dso *dso, uint32_t *hashtab, const char *s, + uint32_t fofs, size_t fmask) { - uint32_t *hashtab = dso->ghashtab; size_t *bloomwords = hashtab+4; size_t f = bloomwords[fofs & dso->ghashmask]; if (!(f & fmask)) return 0; @@ -267,7 +266,7 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin f >>= (h1 >> hashtab[3]) % (8 * sizeof f); if (!(f & 1)) return 0; - return gnu_lookup(s, h1, dso); + return gnu_lookup(h1, dso, hashtab, s); } #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS) @@ -279,20 +278,20 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin static struct symdef find_sym(struct dso *dso, const char *s, int need_def) { - uint32_t h = 0, gh, gho; + uint32_t h = 0, gh, gho, *ght; size_t ghm = 0; struct symdef def = {0}; for (; dso; dso=dso->next) { Sym *sym; if (!dso->global) continue; - if (dso->ghashtab) { + if ((ght = dso->ghashtab)) { if (!ghm) { gh = gnu_hash(s); int maskbits = 8 * sizeof ghm; gho = gh / maskbits; ghm = 1ul << gh % maskbits; } - sym = gnu_lookup_filtered(s, gh, dso, gho, ghm); + sym = gnu_lookup_filtered(gh, dso, ght, s, gho, ghm); } else { if (!h) h = sysv_hash(s); sym = sysv_lookup(s, h, dso); @@ -1623,7 +1622,7 @@ void *__tls_get_addr(size_t *); static void *do_dlsym(struct dso *p, const char *s, void *ra) { size_t i; - uint32_t h = 0, gh = 0; + uint32_t h = 0, gh = 0, *ght; Sym *sym; if (p == head || p == RTLD_DEFAULT || p == RTLD_NEXT) { if (p == RTLD_DEFAULT) { @@ -1641,9 +1640,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra) } if (invalid_dso_handle(p)) return 0; - if (p->ghashtab) { + if ((ght = p->ghashtab)) { gh = gnu_hash(s); - sym = gnu_lookup(s, gh, p); + sym = gnu_lookup(gh, p, ght, s); } else { h = sysv_hash(s); sym = sysv_lookup(s, h, p); @@ -1653,9 +1652,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra) if (sym && sym->st_value && (1<<(sym->st_info&0xf) & OK_TYPES)) return p->base + sym->st_value; if (p->deps) for (i=0; p->deps[i]; i++) { - if (p->deps[i]->ghashtab) { + if ((ght = p->deps[i]->ghashtab)) { if (!gh) gh = gnu_hash(s); - sym = gnu_lookup(s, gh, p->deps[i]); + sym = gnu_lookup(gh, p->deps[i], ght, s); } else { if (!h) h = sysv_hash(s); sym = sysv_lookup(s, h, p->deps[i]); ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov ` (3 preceding siblings ...) 2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov @ 2015-06-23 23:24 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov 5 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-23 23:24 UTC (permalink / raw) To: musl; +Cc: Alexander Monakov With -Os, GCC uses a multiply rather than a shift and addition for 'h*33'. Use a more efficient expression explicitely. --- src/ldso/dynlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index 99be7d0..f7342cd 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -217,7 +217,7 @@ static uint32_t gnu_hash(const char *s0) const unsigned char *s = (void *)s0; uint_fast32_t h = 5381; for (; *s; s++) - h = h*33 + *s; + h += h*32 + *s; return h; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 0/6] gnu-hash speedups 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov ` (4 preceding siblings ...) 2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov ` (6 more replies) 5 siblings, 7 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl v2: reorder patches [2] split out ghashmask micro-optimization into a separate patch [2] use 'unsigned char' rather than 'int' for struct udiv fields s1,s2,inc [2] use an explicit cast in bloomwords assignment [4] reorder 'hashtab' before 'dso' [5] remove branch on zero s1 [5] tweak implementation of saturating addition [5] fold 32-bit shift into s2, avoiding run-time adjustments on 64-bit arch Alexander Monakov (6): dynlink.c: use a faster expression in gnu_hash dynlink.c: use bloom filter in gnu hash lookup dynlink.c: slim down gnu_lookup dynlink.c: pass gnu-hash table pointer to gnu_lookup dynlink.c: compute modulus via magic multiplication dynlink.c: store bloom filter size in struct dso src/ldso/dynlink.c | 124 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 101 insertions(+), 23 deletions(-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov ` (5 subsequent siblings) 6 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl With -Os, GCC uses a multiply rather than a shift and addition for 'h*33'. Use a more efficient expression explicitely. --- src/ldso/dynlink.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index b77c6f6..0df81f2 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -156,7 +156,7 @@ static uint32_t gnu_hash(const char *s0) const unsigned char *s = (void *)s0; uint_fast32_t h = 5381; for (; *s; s++) - h = h*33 + *s; + h += h*32 + *s; return h; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov ` (4 subsequent siblings) 6 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl Introduce gnu_lookup_filtered and use it to speed up symbol lookups in find_sym (do_dlsym is left as is, based on an expectation that frequently dlsym queries will use a dlopen handle rather than RTLD_NEXT or RTLD_DEFAULT, and will not need to look at more than one DSO). --- src/ldso/dynlink.c | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index 0df81f2..b08596e 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -200,6 +200,19 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) return 0; } +static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) +{ + uint32_t *hashtab = dso->ghashtab; + const size_t *bloomwords = (const void *)(hashtab+4); + size_t f = bloomwords[fofs & (hashtab[2]-1)]; + if (!(f & fmask)) return 0; + + f >>= (h1 >> hashtab[3]) % (8 * sizeof f); + if (!(f & 1)) return 0; + + return gnu_lookup(s, h1, dso); +} + #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS) #define OK_BINDS (1<<STB_GLOBAL | 1<<STB_WEAK | 1<<STB_GNU_UNIQUE) @@ -209,14 +222,20 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) static struct symdef find_sym(struct dso *dso, const char *s, int need_def) { - uint32_t h = 0, gh = 0; + uint32_t h = 0, gh, gho; + size_t ghm = 0; struct symdef def = {0}; for (; dso; dso=dso->next) { Sym *sym; if (!dso->global) continue; if (dso->ghashtab) { - if (!gh) gh = gnu_hash(s); - sym = gnu_lookup(s, gh, dso); + if (!ghm) { + gh = gnu_hash(s); + int maskbits = 8 * sizeof ghm; + gho = gh / maskbits; + ghm = 1ul << gh % maskbits; + } + sym = gnu_lookup_filtered(s, gh, dso, gho, ghm); } else { if (!h) h = sysv_hash(s); sym = sysv_lookup(s, h, dso); ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 3/6] dynlink.c: slim down gnu_lookup 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov ` (3 subsequent siblings) 6 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl Do not reference dso->syms and dso-strings until point of use. Check 'h1 == (h2|1)', the simplest condition, before the others. --- src/ldso/dynlink.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index b08596e..0ca7a16 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -176,24 +176,20 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso) static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) { - Sym *syms = dso->syms; - char *strings = dso->strings; uint32_t *hashtab = dso->ghashtab; uint32_t nbuckets = hashtab[0]; uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4); - uint32_t h2; - uint32_t *hashval; uint32_t i = buckets[h1 % nbuckets]; if (!i) return 0; - hashval = buckets + nbuckets + (i - hashtab[1]); + uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]); for (h1 |= 1; ; i++) { - h2 = *hashval++; - if ((!dso->versym || dso->versym[i] >= 0) - && (h1 == (h2|1)) && !strcmp(s, strings + syms[i].st_name)) - return syms+i; + uint32_t h2 = *hashval++; + if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0) + && !strcmp(s, dso->strings + dso->syms[i].st_name)) + return dso->syms+i; if (h2 & 1) break; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov ` (2 preceding siblings ...) 2015-06-27 23:48 ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-28 0:05 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov ` (2 subsequent siblings) 6 siblings, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl The callers need to check the value of the pointer anyway, so make them pass the pointer to gnu_lookup instead of reloading it there. Reorder gnu_lookup arguments so that always-used ones are listed first. GCC can choose a calling convention with arguments in registers (e.g. up to 3 arguments in eax, ecx, edx on x86), but cannot reorder the arguments for static functions. --- src/ldso/dynlink.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index 0ca7a16..1c62efe 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -174,9 +174,8 @@ static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso) return 0; } -static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) +static Sym *gnu_lookup(uint32_t h1, uint32_t *hashtab, struct dso *dso, const char *s) { - uint32_t *hashtab = dso->ghashtab; uint32_t nbuckets = hashtab[0]; uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4); uint32_t i = buckets[h1 % nbuckets]; @@ -196,9 +195,9 @@ static Sym *gnu_lookup(const char *s, uint32_t h1, struct dso *dso) return 0; } -static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uint32_t fofs, size_t fmask) +static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso, const char *s, + uint32_t fofs, size_t fmask) { - uint32_t *hashtab = dso->ghashtab; const size_t *bloomwords = (const void *)(hashtab+4); size_t f = bloomwords[fofs & (hashtab[2]-1)]; if (!(f & fmask)) return 0; @@ -206,7 +205,7 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin f >>= (h1 >> hashtab[3]) % (8 * sizeof f); if (!(f & 1)) return 0; - return gnu_lookup(s, h1, dso); + return gnu_lookup(h1, hashtab, dso, s); } #define OK_TYPES (1<<STT_NOTYPE | 1<<STT_OBJECT | 1<<STT_FUNC | 1<<STT_COMMON | 1<<STT_TLS) @@ -218,20 +217,20 @@ static Sym *gnu_lookup_filtered(const char *s, uint32_t h1, struct dso *dso, uin static struct symdef find_sym(struct dso *dso, const char *s, int need_def) { - uint32_t h = 0, gh, gho; + uint32_t h = 0, gh, gho, *ght; size_t ghm = 0; struct symdef def = {0}; for (; dso; dso=dso->next) { Sym *sym; if (!dso->global) continue; - if (dso->ghashtab) { + if ((ght = dso->ghashtab)) { if (!ghm) { gh = gnu_hash(s); int maskbits = 8 * sizeof ghm; gho = gh / maskbits; ghm = 1ul << gh % maskbits; } - sym = gnu_lookup_filtered(s, gh, dso, gho, ghm); + sym = gnu_lookup_filtered(gh, ght, dso, s, gho, ghm); } else { if (!h) h = sysv_hash(s); sym = sysv_lookup(s, h, dso); @@ -1559,7 +1558,7 @@ void *__tls_get_addr(size_t *); static void *do_dlsym(struct dso *p, const char *s, void *ra) { size_t i; - uint32_t h = 0, gh = 0; + uint32_t h = 0, gh = 0, *ght; Sym *sym; if (p == head || p == RTLD_DEFAULT || p == RTLD_NEXT) { if (p == RTLD_DEFAULT) { @@ -1577,9 +1576,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra) } if (invalid_dso_handle(p)) return 0; - if (p->ghashtab) { + if ((ght = p->ghashtab)) { gh = gnu_hash(s); - sym = gnu_lookup(s, gh, p); + sym = gnu_lookup(gh, ght, p, s); } else { h = sysv_hash(s); sym = sysv_lookup(s, h, p); @@ -1589,9 +1588,9 @@ static void *do_dlsym(struct dso *p, const char *s, void *ra) if (sym && sym->st_value && (1<<(sym->st_info&0xf) & OK_TYPES)) return p->base + sym->st_value; if (p->deps) for (i=0; p->deps[i]; i++) { - if (p->deps[i]->ghashtab) { + if ((ght = p->deps[i]->ghashtab)) { if (!gh) gh = gnu_hash(s); - sym = gnu_lookup(s, gh, p->deps[i]); + sym = gnu_lookup(gh, ght, p->deps[i], s); } else { if (!h) h = sysv_hash(s); sym = sysv_lookup(s, h, p->deps[i]); ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup 2015-06-27 23:48 ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov @ 2015-06-28 0:05 ` Alexander Monakov 2015-06-28 8:59 ` Alexander Monakov 0 siblings, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-28 0:05 UTC (permalink / raw) To: musl [-- Attachment #1: Type: TEXT/PLAIN, Size: 421 bytes --] Rich, here's a version of your bench that should test latency rather than throughput of the umod operation, with random divisors. From my testing it looks like your version with 64-bit add is clearly better on 64-bit, but loses on 32-bit unless 'add' is made a 64-bit field. An advantage of 'saturating add' is that it doesn't need extra space ('inc' is 1 byte rather than 4, which are required for 'add'). Alexander [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: TEXT/x-c; name=umod-bench.c, Size: 3087 bytes --] #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <time.h> struct udiv { uint64_t add; uint32_t mul; int s1, s2, inc; }; static void precompute_udiv(uint32_t div, struct udiv *p) { int s2_adj = (sizeof(long)==8) ? 32 : 0; if (!(div&(div-1))) { if (sizeof(long) == 8) { p->s1 = 0; p->mul = 1; p->add = 0; for (p->s2=0; div>1; div>>=1, p->s2++); } else if (div == 1) { p->s1 = 0; p->s2 = 0; p->mul = 0xffffffff; p->add = 0xffffffff; } else { p->s1 = 0; p->mul = 0x80000000; p->add = 0; for (p->s2=0; div>2; div>>=1, p->s2++); } return; } int bits = 0; again: p->s1 = bits; uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div; int log=0; tmp=div; do log++; while (tmp>>=1); int exp, rdshf; uint32_t pow, rdmul=0; for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) { int ovf = rem >= div - rem; quo *= 2; rem *= 2; if (ovf) quo++, rem -= div; if (exp >= log-bits || div-rem <= pow) break; if (!rdmul && rem <= pow) rdmul = quo, rdshf = exp; } if (exp < log) { p->mul = quo+1; p->s2 = exp + s2_adj; p->inc = 0; p->add = 0; } else if (div & 1) { p->mul = rdmul; p->s2 = rdshf + s2_adj; p->inc = 1; p->add = rdmul; } else { do bits++; while (!((div >>= 1) & 1)); goto again; } } static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) { #if 0 int s32 = (sizeof(long)==8) ? 0 : 32; return x - ((unsigned long)(1ull * (x>>p->s1) * p->mul + p->add >> s32) >> p->s2)*div; #elif 0 int s32 = (sizeof(long)==8) ? 0 : 32; uint32_t v = x; v >>= p->s1; if (v!=-1) v += p->inc; v = (unsigned long)((1ull * v * p->mul) >> s32) >> p->s2; return x-v*div; #else if (!(div&(div-1))) return x&(div-1); uint32_t v = x; v >>= p->s1; if (v + p->inc) v += p->inc; int s32=32, s2=p->s2; if (sizeof(long) == 8) s32=s2, s2=0; v = (1ull * v * p->mul) >> s32; v >>= s2; return x-v*div; #endif } #define N 1000000 uint32_t ext_umod(uint32_t x, uint32_t div, struct udiv *p) { return umod(x, div, p); } int main(int argc, char **argv) { size_t i; struct timespec t0, t; uint32_t *a = calloc(N, sizeof *a); for (i=0; i<N; i++) a[i] = 2+rand(); struct udiv *ud = calloc(N, sizeof *ud); for (i=0; i<N; i++) precompute_udiv(a[i], ud+i); uint32_t accum, val; clock_gettime(CLOCK_REALTIME, &t0); for (accum=i=0, val=-1; i<N; i++) accum += (val %= a[i]); clock_gettime(CLOCK_REALTIME, &t); t.tv_sec -= t0.tv_sec; if ((t.tv_nsec -= t0.tv_nsec) < 0) { t.tv_nsec += 1000000000; t.tv_sec--; } printf("%lu %lld.%.9ld\n", (unsigned long)accum, (long long)t.tv_sec, t.tv_nsec); clock_gettime(CLOCK_REALTIME, &t0); for (accum=i=0, val=-1; i<N; i++) accum += (val = umod(val, a[i], ud+i)); clock_gettime(CLOCK_REALTIME, &t); t.tv_sec -= t0.tv_sec; if ((t.tv_nsec -= t0.tv_nsec) < 0) { t.tv_nsec += 1000000000; t.tv_sec--; } printf("%lu %lld.%.9ld\n", (unsigned long)accum, (long long)t.tv_sec, t.tv_nsec); return 0; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup 2015-06-28 0:05 ` Alexander Monakov @ 2015-06-28 8:59 ` Alexander Monakov 0 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-28 8:59 UTC (permalink / raw) To: musl [-- Attachment #1: Type: TEXT/PLAIN, Size: 812 bytes --] (this and the previous email are actually following up on patch 5, not 4) How about handling power-of-two divisors like this. In precompute_udiv: 1. Factor out powers of two from 'div' up front and get rid of 'goto again' (ok since doing the pre-shift unconditionally looks better) 2. If 'div' is now 1: - on 64-bit there's no problem: set 'mul' to 1 and 's2' to 0 - on 32-bit: set 'mul' to 0 ...and in umod: 3. On 32-bit arches, check 'mul' after pre-shift and skip to end if zero. With this change it's easy to drop 'div' argument from 'umod', rename 'umod' to 'udiv' and make the caller do the modulus computation. Attaching a variant of the latency bench that implements the above; I also tried to avoid small 'val' in the bench loop by using a recurrence with a bitwise inversion. Alexander [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: TEXT/x-c; name=umod-bench.c, Size: 2338 bytes --] #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <time.h> struct udiv { uint32_t mul; unsigned char s1, s2, inc; }; static void precompute_udiv(uint32_t div, struct udiv *p) { int bits = 0; while (!(div & 1)) div >>= 1, bits++; p->s1 = bits; if (div == 1) { p->mul = (sizeof(long) == 8); p->s2 = 0; p->inc = 0; return; } uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div; int log=0; tmp=div; do log++; while (tmp>>=1); int exp, rdshf; uint32_t pow, rdmul=0; for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) { int ovf = rem >= div - rem; quo *= 2; rem *= 2; if (ovf) quo++, rem -= div; if (exp >= log-bits || div-rem <= pow) break; if (!rdmul && rem <= pow) rdmul = quo, rdshf = exp; } int s2_adj = (sizeof(long)==8) ? 32 : 0; if (exp < log) { p->mul = quo+1; p->s2 = exp + s2_adj; p->inc = 0; } else { p->mul = rdmul; p->s2 = rdshf + s2_adj; p->inc = 1; } } static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) { uint32_t v = x >> p->s1; if (sizeof(long) == 8 || p->mul) { if (v + p->inc) v += p->inc; int s32=32, s2=p->s2; if (sizeof(long) == 8) s32=s2, s2=0; v = (1ull * v * p->mul) >> s32; v >>= s2; } return x-v*div; } #define N 1000000 uint32_t ext_umod(uint32_t x, uint32_t div, struct udiv *p) { return umod(x, div, p); } int main(int argc, char **argv) { size_t i; struct timespec t0, t; uint32_t *a = calloc(N, sizeof *a); for (i=0; i<N; i++) a[i] = 1+rand(); struct udiv *ud = calloc(N, sizeof *ud); for (i=0; i<N; i++) precompute_udiv(a[i], ud+i); uint32_t val; clock_gettime(CLOCK_REALTIME, &t0); for (i=0, val=-1; i<N; i++) val = ~(val % a[i]); clock_gettime(CLOCK_REALTIME, &t); t.tv_sec -= t0.tv_sec; if ((t.tv_nsec -= t0.tv_nsec) < 0) { t.tv_nsec += 1000000000; t.tv_sec--; } printf("%lu %lld.%.9ld\n", (unsigned long)val, (long long)t.tv_sec, t.tv_nsec); clock_gettime(CLOCK_REALTIME, &t0); for (i=0, val=-1; i<N; i++) val = ~umod(val, a[i], ud+i); clock_gettime(CLOCK_REALTIME, &t); t.tv_sec -= t0.tv_sec; if ((t.tv_nsec -= t0.tv_nsec) < 0) { t.tv_nsec += 1000000000; t.tv_sec--; } printf("%lu %lld.%.9ld\n", (unsigned long)val, (long long)t.tv_sec, t.tv_nsec); return 0; } ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov ` (3 preceding siblings ...) 2015-06-27 23:48 ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-30 17:51 ` Rich Felker 2015-06-27 23:48 ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov 2015-06-28 2:45 ` [PATCH v2 0/6] gnu-hash speedups Rich Felker 6 siblings, 1 reply; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl Based on http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html Do a little hand-holding for the compiler and fold magic post-shift into 32-bit high word -> low word shift on 64-bit platforms. --- src/ldso/dynlink.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 64 insertions(+), 2 deletions(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index 1c62efe..f181209 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -41,6 +41,11 @@ struct td_index { struct td_index *next; }; +struct udiv { + uint32_t mul; + unsigned char s1, s2, inc; +}; + struct dso { unsigned char *base; char *name; @@ -54,6 +59,7 @@ struct dso { Sym *syms; uint32_t *hashtab; uint32_t *ghashtab; + struct udiv gudiv; int16_t *versym; char *strings; unsigned char *map; @@ -140,6 +146,60 @@ static int search_vec(size_t *v, size_t *r, size_t key) return 1; } +static void precompute_udiv(uint32_t div, struct udiv *p) +{ + if (!(div&(div-1))) + return; + int bits = 0, s2adj = (sizeof(long) == 8) ? 32 : 0; +again: + p->s1 = bits; + uint32_t tmp = 1u<<31, quo = tmp/div, rem = tmp%div; + + int log=0; + tmp=div; do log++; while (tmp>>=1); + + int exp, rdshf; + uint32_t pow, rdmul=0; + for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) { + int ovf = rem >= div - rem; + quo *= 2; rem *= 2; + if (ovf) quo++, rem -= div; + + if (exp >= log-bits || div-rem <= pow) + break; + + if (!rdmul && rem <= pow) + rdmul = quo, rdshf = exp; + } + if (exp < log) { + p->mul = quo+1; + p->s2 = exp + s2adj; + p->inc = 0; + } else if (div & 1) { + p->mul = rdmul; + p->s2 = rdshf + s2adj; + p->inc = 1; + } else { + do bits++; while (!((div >>= 1) & 1)); + goto again; + } +} + +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) +{ + if (!(div&(div-1))) + return x&(div-1); + uint32_t v = x; + v >>= p->s1; + if (v + p->inc) v += p->inc; + int s32=32, s2=p->s2; + if (sizeof(long) == 8) + s32=s2, s2=0; + v = (1ull * v * p->mul) >> s32; + v >>= s2; + return x-v*div; +} + static uint32_t sysv_hash(const char *s0) { const unsigned char *s = (void *)s0; @@ -178,7 +238,7 @@ static Sym *gnu_lookup(uint32_t h1, uint32_t *hashtab, struct dso *dso, const ch { uint32_t nbuckets = hashtab[0]; uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4); - uint32_t i = buckets[h1 % nbuckets]; + uint32_t i = buckets[umod(h1, nbuckets, &dso->gudiv)]; if (!i) return 0; @@ -696,8 +756,10 @@ static void decode_dyn(struct dso *p) p->rpath_orig = (void *)(p->strings + dyn[DT_RPATH]); if (dyn[0]&(1<<DT_RUNPATH)) p->rpath_orig = (void *)(p->strings + dyn[DT_RUNPATH]); - if (search_vec(p->dynv, dyn, DT_GNU_HASH)) + if (search_vec(p->dynv, dyn, DT_GNU_HASH)) { p->ghashtab = (void *)(p->base + *dyn); + precompute_udiv(p->ghashtab[0], &p->gudiv); + } if (search_vec(p->dynv, dyn, DT_VERSYM)) p->versym = (void *)(p->base + *dyn); } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication 2015-06-27 23:48 ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov @ 2015-06-30 17:51 ` Rich Felker 0 siblings, 0 replies; 28+ messages in thread From: Rich Felker @ 2015-06-30 17:51 UTC (permalink / raw) To: musl [-- Attachment #1: Type: text/plain, Size: 757 bytes --] On Sun, Jun 28, 2015 at 02:48:34AM +0300, Alexander Monakov wrote: > Based on http://ridiculousfish.com/blog/posts/labor-of-division-episode-iii.html > > Do a little hand-holding for the compiler and fold magic post-shift into > 32-bit high word -> low word shift on 64-bit platforms. > --- On 32-bit Atom, I was able to get an improvement from 116ms to 113ms running an empty program linked with a .so produced using the attached script. This is probably roughly best-case improvement, so the umod stuff is probably not likely to be worthwhile unless/until we can improve the areas that are currently dominating the runtime. That's rather disappointing, but maybe we can improve enough other stuff for it to make sense to do the umod optimization. Rich [-- Attachment #2: gen.sh --] [-- Type: application/x-sh, Size: 112 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov ` (4 preceding siblings ...) 2015-06-27 23:48 ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov @ 2015-06-27 23:48 ` Alexander Monakov 2015-06-28 2:45 ` [PATCH v2 0/6] gnu-hash speedups Rich Felker 6 siblings, 0 replies; 28+ messages in thread From: Alexander Monakov @ 2015-06-27 23:48 UTC (permalink / raw) To: musl This slightly speeds up gnu_lookup_filtered by avoiding pointer chasing. --- src/ldso/dynlink.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/ldso/dynlink.c b/src/ldso/dynlink.c index f181209..12e8c25 100644 --- a/src/ldso/dynlink.c +++ b/src/ldso/dynlink.c @@ -59,6 +59,7 @@ struct dso { Sym *syms; uint32_t *hashtab; uint32_t *ghashtab; + uint32_t ghashmask; struct udiv gudiv; int16_t *versym; char *strings; @@ -259,7 +260,7 @@ static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso, uint32_t fofs, size_t fmask) { const size_t *bloomwords = (const void *)(hashtab+4); - size_t f = bloomwords[fofs & (hashtab[2]-1)]; + size_t f = bloomwords[fofs & dso->ghashmask]; if (!(f & fmask)) return 0; f >>= (h1 >> hashtab[3]) % (8 * sizeof f); @@ -758,6 +759,7 @@ static void decode_dyn(struct dso *p) p->rpath_orig = (void *)(p->strings + dyn[DT_RUNPATH]); if (search_vec(p->dynv, dyn, DT_GNU_HASH)) { p->ghashtab = (void *)(p->base + *dyn); + p->ghashmask = p->ghashtab[2]-1; precompute_udiv(p->ghashtab[0], &p->gudiv); } if (search_vec(p->dynv, dyn, DT_VERSYM)) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 0/6] gnu-hash speedups 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov ` (5 preceding siblings ...) 2015-06-27 23:48 ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov @ 2015-06-28 2:45 ` Rich Felker 6 siblings, 0 replies; 28+ messages in thread From: Rich Felker @ 2015-06-28 2:45 UTC (permalink / raw) To: musl On Sun, Jun 28, 2015 at 02:48:29AM +0300, Alexander Monakov wrote: > v2: > reorder patches > [2] split out ghashmask micro-optimization into a separate patch > [2] use 'unsigned char' rather than 'int' for struct udiv fields s1,s2,inc > [2] use an explicit cast in bloomwords assignment > [4] reorder 'hashtab' before 'dso' > [5] remove branch on zero s1 > [5] tweak implementation of saturating addition > [5] fold 32-bit shift into s2, avoiding run-time adjustments on 64-bit arch > > Alexander Monakov (6): > dynlink.c: use a faster expression in gnu_hash > dynlink.c: use bloom filter in gnu hash lookup > dynlink.c: slim down gnu_lookup > dynlink.c: pass gnu-hash table pointer to gnu_lookup I've committed 1-4 with minimal changes: refolding commit messages so "git log" fits in 80 columns, fixing a typo in one of the messages, and removing the line-break in the function args for gnu_lookup_filtered (there were already well over 80 chars before the break, and if it's going to be broken, spaces rather than tabs need to be used for alignment). > dynlink.c: compute modulus via magic multiplication > dynlink.c: store bloom filter size in struct dso Let's resume testing these when you next have time. The timing for the umod stuff is still so unstable undef different -O and -mtune combinations that I'm not sure what to think about which version is best. But it looks like a clear win. I'll see if I can do some tests on real arm hardware too, since I expect it to make a big difference there. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2015-06-30 17:51 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov 2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov 2015-06-24 5:39 ` Rich Felker 2015-06-24 6:29 ` Alexander Monakov 2015-06-24 6:32 ` Alexander Monakov 2015-06-24 6:50 ` Rich Felker 2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov 2015-06-24 4:18 ` Alexander Monakov 2015-06-24 4:19 ` Rich Felker 2015-06-24 4:24 ` Rich Felker 2015-06-24 4:32 ` Alexander Monakov 2015-06-24 5:13 ` Rich Felker 2015-06-24 6:08 ` Alexander Monakov 2015-06-24 6:39 ` Rich Felker 2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov 2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov 2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov 2015-06-28 0:05 ` Alexander Monakov 2015-06-28 8:59 ` Alexander Monakov 2015-06-27 23:48 ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov 2015-06-30 17:51 ` Rich Felker 2015-06-27 23:48 ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov 2015-06-28 2:45 ` [PATCH v2 0/6] gnu-hash speedups Rich Felker
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).