Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication

mailing list of musl libc
 help / color / mirror / code / Atom feed

From: Alexander Monakov <amonakov@ispras.ru>
To: musl@lists.openwall.com
Subject: Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication
Date: Wed, 24 Jun 2015 09:08:15 +0300 (MSK)	[thread overview]
Message-ID: <alpine.LNX.2.11.1506240828220.9758@monopod.intra.ispras.ru> (raw)
In-Reply-To: <20150624051312.GP1173@brightrain.aerifal.cx>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1719 bytes --]

> > How can I do the last step, 'x-v*div' without it?
> 
> Ah yes. Would it be preferable to have that in struct udiv then,
> though? Then the caller never has to care about the divisor, and in
> the case where umod doesn't get inlined, the call will be more
> efficient (fewer args).

I doubt that; the caller needs that value ('nbuckets') anyway.

[...]
> Using the post-mul add rather than the saturated increment would make
> this work for 0xffffffff too.

post-mul add is problematic on 32-bit
 
> Another obvious solution is not using the +32 offset so that the right
> shift can just be 31, but that pessimizes the code on 32-bit archs
> quite a bit.

I agree that a special-case for a power-of-two divisor will fire too rarely,
so figuring a replacement out would be nice.

If you (or anyone) want to play with ideas, I'm attaching my test driver for
magic division that tests 2^32-1 inputs for a divisor given on the command
line (the main loop looks odd, but my goal was to have gcc vectorize it).

> > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
> > non-zero, so try to skip the shift if possible; in the rare case the pre-shift
> > is non-zero, the check allows to skip the saturating increment operation.
> 
> Shifts are costly? Are we talking about P4-era junk? ;-)

I primarily had modern Intel cores in mind where as far as I can see shifts by
%ecx have latency 2.  But even ignoring that, there's the second part of my
argument.

Anyway it's not trivial to measure an impact on the same "modern Intel cores",
so if anybody can say how that looks on a different platform, I'd appreciate
that.  Removing my if-else there is obviously beneficial for code size.

Alexander

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-c; name=udiv.c, Size: 1923 bytes --]

#include <stdlib.h>
static void precalc(unsigned div, unsigned *mulp, int *prep, int *postp, int *incp)
{
  int bits = 0;
again:
  *prep = bits;
  unsigned pow = 1u<<31, quo = pow / div, rem = pow % div;

  int log=0, exp;
  unsigned tmp=div;
  do log++; while (tmp >>= 1);

  unsigned down_mul=0;
  for (exp=0, pow=1u<<bits; ; exp++, pow<<=1) {
    int adj = rem >= div - rem;
    quo *= 2; rem *= 2;
    if (adj) {quo++; rem -= div; }

    if (exp >= log-bits || div-rem <= pow)
      break;

    if (!down_mul && rem <= pow) {
      down_mul = quo; *postp = exp;
    }
  }
  if (exp < log) {
    *mulp = quo+1;
    *postp = exp;
    *incp = 0;
  } else if (div & 1) {
    *mulp = down_mul;
    *incp = 1;
  } else {
    do bits++; while (!((div >>= 1) & 1));
    goto again;
  }
}
int main(int argc, char *argv[])
{
  unsigned div = atoi(argv[1]);
  if (!(div&(div-1)))
    return 0;

  unsigned mul;
  int pre, post, inc;
  precalc(div, &mul, &pre, &post, &inc);
  unsigned s32 = 32;
  if (sizeof(long) == 8) {
    s32 += post;
    post = 0;
  }
  unsigned rem, quo, val;
  char bad=0;
#define loop(pre, inc, post) ({    \
  for (val=~0u, quo=val/div, rem=val%div+1; quo+1; quo--, rem=div)      \
    for (; rem; rem--, val--) {  \
      unsigned v = val;            \
      v >>= pre;                   \
      if (v+inc) v+=inc;           \
      v = (1ull * v * mul) >> s32; \
      v >>= post;                  \
      bad |= (v != quo);      \
    } })
  if (post)
    if (inc)
      if (pre)
        return 2;
      else
        loop(0, inc, post);
    else
      if (pre)
        loop(pre, 0, post);
      else
        loop(0, 0, post);
  else
    if (inc)
      if (pre)
        return 2;
      else
        loop(0, inc, 0);
    else
      if (pre)
        loop(pre, 0, 0);
      else
        loop(0, 0, 0);
  return bad;
}

next prev parent reply	other threads:[~2015-06-24  6:08 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-23 23:24 [PATCH 0/5] gnu-hash speedups Alexander Monakov
2015-06-23 23:24 ` [PATCH 1/5] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
2015-06-24  5:39   ` Rich Felker
2015-06-24  6:29     ` Alexander Monakov
2015-06-24  6:32       ` Alexander Monakov
2015-06-24  6:50       ` Rich Felker
2015-06-23 23:24 ` [PATCH 2/5] dynlink.c: compute modulus via magic multiplication Alexander Monakov
2015-06-24  4:18   ` Alexander Monakov
2015-06-24  4:19     ` Rich Felker
2015-06-24  4:24   ` Rich Felker
2015-06-24  4:32     ` Alexander Monakov
2015-06-24  5:13       ` Rich Felker
2015-06-24  6:08         ` Alexander Monakov [this message]
2015-06-24  6:39           ` Rich Felker
2015-06-23 23:24 ` [PATCH 3/5] dynlink.c: slim down gnu_lookup Alexander Monakov
2015-06-23 23:24 ` [PATCH 4/5] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
2015-06-23 23:24 ` [PATCH 5/5] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
2015-06-27 23:48 ` [PATCH v2 0/6] gnu-hash speedups Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 1/6] dynlink.c: use a faster expression in gnu_hash Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 2/6] dynlink.c: use bloom filter in gnu hash lookup Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 3/6] dynlink.c: slim down gnu_lookup Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 4/6] dynlink.c: pass gnu-hash table pointer to gnu_lookup Alexander Monakov
2015-06-28  0:05     ` Alexander Monakov
2015-06-28  8:59       ` Alexander Monakov
2015-06-27 23:48   ` [PATCH v2 5/6] dynlink.c: compute modulus via magic multiplication Alexander Monakov
2015-06-30 17:51     ` Rich Felker
2015-06-27 23:48   ` [PATCH v2 6/6] dynlink.c: store bloom filter size in struct dso Alexander Monakov
2015-06-28  2:45   ` [PATCH v2 0/6] gnu-hash speedups Rich Felker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LNX.2.11.1506240828220.9758@monopod.intra.ispras.ru \
    --to=amonakov@ispras.ru \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).