Here's a draft of an improved i386 memset.s based on the principles Denys Vlasenko and I discussed on his and my x86_64 versions. Compared to the current code, it reduces entry/exit overhead, increases the length supported in the non-rep-stosl path, and aligns the rep-stosl. My tests don't measure the misalignment penalty, but even in the aligned case the rep-stosl path is slightly faster (~5 cycles per run, out of at least 64 cycles and the non-rep-stosl path is significantly faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32). Empirically the byte-register-access/left-shift method of extending the fill value to a word performs better than imul for me, but the margin is very small (at most 1 cycle). Since we support much older cpus (like actual 486) where imul could be really slow, I think this is the right approach in principle too. I used imul in the rep-stosl path but haven't tested whether it's faster there. The non-rep-stosl path only goes up to size 62. I think sizes up to 126 could benefit from it, but the string of stores was getting really long. Correctness has not been tested so there may be stupid bugs. Rich