I would think the iterate-per-char-till-zero would take the most time, even if GCC vectorized without SIMD it would still need to iterate to find the zero in the word with the zero, current musl does this as well though.

On Jul 10, 2013 1:34 PM, "Andre Renaud" <andre@bluewatersys.com> wrote:
>> What also might be worth testing is whether GCC can compete if you
>> just give it a naive loop (not the fancy pseudo-vectorized stuff
>> currently in musl) and good CFLAGS. I know on x86 I was able to beat
>> the fanciest asm strlen I could come up with simply by writing the
>> naive loop in C and unrolling it a lot.
>
>
> Duff's device!

That was exactly my first idea too, but interestingly it turns out not
to have really added any performance improvement. Looking at the
assembler, with -O3, gcc does a pretty good job of unrolling as it is.

Regards,
Andre