(this and the previous email are actually following up on patch 5, not 4) How about handling power-of-two divisors like this. In precompute_udiv: 1. Factor out powers of two from 'div' up front and get rid of 'goto again' (ok since doing the pre-shift unconditionally looks better) 2. If 'div' is now 1: - on 64-bit there's no problem: set 'mul' to 1 and 's2' to 0 - on 32-bit: set 'mul' to 0 ...and in umod: 3. On 32-bit arches, check 'mul' after pre-shift and skip to end if zero. With this change it's easy to drop 'div' argument from 'umod', rename 'umod' to 'udiv' and make the caller do the modulus computation. Attaching a variant of the latency bench that implements the above; I also tried to avoid small 'val' in the bench loop by using a recurrence with a bitwise inversion. Alexander