On Thu, Feb 12, 2015 at 8:26 PM, Denys Vlasenko wrote: >> I'd actually like to extend the "short" range up to at least 32 bytes >> using two 8-byte writes for the middle, unless the savings from using >> 32-bit imul instead of 64-bit are sufficient to justify 4 4-byte >> writes for the middle. On the cpu I tested on, the difference is 11 >> cycles vs 32 cycles for non-rep path versus rep path at size 32. > > The short path causes mixed feelings in me. > > On one hand, it's elegant in a contrived way. > > On the other hand, multiple > overlaying stores must be causing hell in store unit. > I'm thinking, maybe there's a faster way to do that. For example, like in the attached implementation. This one will not perform eight stores to memory to fill 15 byte area... only two.