On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote:
>> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker <dalias@aerifal.cx> wrote:
>> What speedups?
>> In particular:
>> - perform pre-alignment if dst is unaligned
>
> For the rep stosq path? Does it help? I don't recall the details but I
> seem to remember both docs and measurements showing no reliable
> benefit from alignment for this instruction, and we had people trying
> things on several different cpu models. I'm open to hearing evidence
> to the contrary though.

size:20k buf:0x7f38656e2100
stos:25978 ns (times 32), 25.227500 bytes/ns
stos+1:31395 ns (times 32), 20.874662 bytes/ns
stos+4:31396 ns (times 32), 20.873997 bytes/ns
stos+8:24446 ns (times 32), 26.808476 bytes/ns

size:50k buf:0x7fbca1dc9100
stos:68149 ns (times 32), 24.041439 bytes/ns
stos+1:85762 ns (times 32), 19.104032 bytes/ns
stos+4:85762 ns (times 32), 19.104032 bytes/ns
stos+8:68204 ns (times 32), 24.022051 bytes/ns

size:1024k buf:0x7fa3036a5100
stos:1632285 ns (times 32), 20.556724 bytes/ns
stos+1:1891092 ns (times 32), 17.743416 bytes/ns
stos+4:1891089 ns (times 32), 17.743444 bytes/ns
stos+8:1632181 ns (times 32), 20.558034 bytes/ns

size:5000k buf:0x7fdf5cd6b100
stos:15592138 ns (times 32), 10.558298 bytes/ns
stos+1:15501841 ns (times 32), 10.619799 bytes/ns
stos+4:15507773 ns (times 32), 10.615737 bytes/ns
stos+8:15589617 ns (times 32), 10.560005 bytes/ns

The source is attached.

This data shows that (on my CPU, Sandy Bridge with 4MB L2)
8-byte alignment helps when stores fit into L1 or L2.
If memset is larger than L2, memory throughput is too low
and there is no measurable difference.