On Tue, Feb 17, 2015 at 5:51 PM, Denys Vlasenko wrote: > On Tue, Feb 17, 2015 at 5:12 PM, Rich Felker wrote: >> On Tue, Feb 17, 2015 at 02:08:52PM +0100, Denys Vlasenko wrote: >>> >> Please see attached file. >>> > >>> > I tried it and it's ~1 cycle slower for at least sizes 16-30; >>> > presumably we're seeing the cost of the extra compare/branch at these >>> > sizes but not at others. What does your timing test show? >>> >>> See below. >>> First column - result of my2.s >>> Second column - result of vda1.s >>> >>> Basically, the "rep stosq" code path got a bit faster, while >>> small memsets stayed the same. >> >> Can you post your test program for me to try out? Here's what I've >> been using, attached. > > With your program I see similar results: Changed your program to output floating point results, and do many more iterations finding minimum, as otherwise (on my machine) consecutive runs give +-2 cycles discrepancy for most measurements. With one million iterations, discrepancy between runs is often zero, and when it's not, it's one cycle or less. Please see attached files. my2.OUT1 and my2.OUT2 are two runs of my2.s code (to judge how much noise is in the measurements).