On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker wrote: > On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote: >> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker wrote: >> What speedups? >> In particular: >> - perform pre-alignment if dst is unaligned > > For the rep stosq path? Does it help? I don't recall the details but I > seem to remember both docs and measurements showing no reliable > benefit from alignment for this instruction, and we had people trying > things on several different cpu models. I'm open to hearing evidence > to the contrary though. size:20k buf:0x7f38656e2100 stos:25978 ns (times 32), 25.227500 bytes/ns stos+1:31395 ns (times 32), 20.874662 bytes/ns stos+4:31396 ns (times 32), 20.873997 bytes/ns stos+8:24446 ns (times 32), 26.808476 bytes/ns size:50k buf:0x7fbca1dc9100 stos:68149 ns (times 32), 24.041439 bytes/ns stos+1:85762 ns (times 32), 19.104032 bytes/ns stos+4:85762 ns (times 32), 19.104032 bytes/ns stos+8:68204 ns (times 32), 24.022051 bytes/ns size:1024k buf:0x7fa3036a5100 stos:1632285 ns (times 32), 20.556724 bytes/ns stos+1:1891092 ns (times 32), 17.743416 bytes/ns stos+4:1891089 ns (times 32), 17.743444 bytes/ns stos+8:1632181 ns (times 32), 20.558034 bytes/ns size:5000k buf:0x7fdf5cd6b100 stos:15592138 ns (times 32), 10.558298 bytes/ns stos+1:15501841 ns (times 32), 10.619799 bytes/ns stos+4:15507773 ns (times 32), 10.615737 bytes/ns stos+8:15589617 ns (times 32), 10.560005 bytes/ns The source is attached. This data shows that (on my CPU, Sandy Bridge with 4MB L2) 8-byte alignment helps when stores fit into L1 or L2. If memset is larger than L2, memory throughput is too low and there is no measurable difference.