Rich, here's a version of your bench that should test latency rather than throughput of the umod operation, with random divisors. From my testing it looks like your version with 64-bit add is clearly better on 64-bit, but loses on 32-bit unless 'add' is made a 64-bit field. An advantage of 'saturating add' is that it doesn't need extra space ('inc' is 1 byte rather than 4, which are required for 'add'). Alexander