* Updated draft of improved memset.s for i386
@ 2015-02-25 20:37 Rich Felker
0 siblings, 0 replies; only message in thread
From: Rich Felker @ 2015-02-25 20:37 UTC (permalink / raw)
To: musl
[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]
Here's a new version of the improved i386 memset.s. The main changes
are:
- Alignment to 16-byte boundary rather than 4-byte for rep stosl.
- Preserving existing over-alignment via rounding up instead of adding
16 then rounding down.
- Special-casing already-aligned case (saves a few cycles when already
aligned, maybe 5-10% total run time at sizes just above the rep
stosl cutoff such as 64).
- Keeping the rep stosl run-length as long as possible rather than
trying to avoid duplicate stores. This helps a lot (>2x improvement)
at size 1024 on Atom and shouldn't hurt in general.
At this point I think it should be a net improvement on nearly any x86
system.
I've checked and it passes the current tests in libc-test. I'm not
entirely sure the tests cover all the cases we need though. For the
32-bit version, tests need to cover:
- All sizes 0-62; alignment doesn't matter.
- Sufficiently many sizes >=63 to get all alignments mod 16 for both
the length and the base pointer.
For the 64-bit versions (either Denys's latest or mine) we also need
coverage for all sizes 63-126 (alignmen doesn't matter) and
sufficiently many past that to test all alignments mod 16 for both
length and base. For the sake of robustness and future-proofing, we
should probably be testing all base and length alignments mod 32 or
more up to size 256 or larger.
Rich
[-- Attachment #2: memset-draft3.s --]
[-- Type: text/plain, Size: 1170 bytes --]
.global memset
.type memset,@function
memset:
mov 12(%esp),%ecx
cmp $62,%ecx
ja 2f
mov 8(%esp),%dl
mov 4(%esp),%eax
test %ecx,%ecx
jz 1f
mov %dl,%dh
mov %dl,(%eax)
mov %dl,-1(%eax,%ecx)
cmp $2,%ecx
jbe 1f
mov %dx,1(%eax)
mov %dx,(-1-2)(%eax,%ecx)
cmp $6,%ecx
jbe 1f
shl $8,%edx
mov %dh,%dl
shl $8,%edx
mov %dh,%dl
mov %edx,(1+2)(%eax)
mov %edx,(-1-2-4)(%eax,%ecx)
cmp $14,%ecx
jbe 1f
mov %edx,(1+2+4)(%eax)
mov %edx,(1+2+4+4)(%eax)
mov %edx,(-1-2-4-8)(%eax,%ecx)
mov %edx,(-1-2-4-4)(%eax,%ecx)
cmp $30,%ecx
jbe 1f
mov %edx,(1+2+4+8)(%eax)
mov %edx,(1+2+4+8+4)(%eax)
mov %edx,(1+2+4+8+8)(%eax)
mov %edx,(1+2+4+8+12)(%eax)
mov %edx,(-1-2-4-8-16)(%eax,%ecx)
mov %edx,(-1-2-4-8-12)(%eax,%ecx)
mov %edx,(-1-2-4-8-8)(%eax,%ecx)
mov %edx,(-1-2-4-8-4)(%eax,%ecx)
1: ret
2: movzbl 8(%esp),%eax
mov %edi,12(%esp)
imul $0x1010101,%eax
mov 4(%esp),%edi
test $15,%edi
mov %eax,-4(%edi,%ecx)
jnz 2f
1: shr $2, %ecx
rep
stosl
mov 4(%esp),%eax
mov 12(%esp),%edi
ret
2: xor %edx,%edx
sub %edi,%edx
and $15,%edx
mov %eax,(%edi)
mov %eax,4(%edi)
mov %eax,8(%edi)
mov %eax,12(%edi)
sub %edx,%ecx
add %edx,%edi
jmp 1b
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2015-02-25 20:37 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-25 20:37 Updated draft of improved memset.s for i386 Rich Felker
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).