mailing list of musl libc
 help / color / mirror / code / Atom feed
* Updated draft of improved memset.s for i386
@ 2015-02-25 20:37 Rich Felker
  0 siblings, 0 replies; only message in thread
From: Rich Felker @ 2015-02-25 20:37 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1370 bytes --]

Here's a new version of the improved i386 memset.s. The main changes
are:

- Alignment to 16-byte boundary rather than 4-byte for rep stosl.

- Preserving existing over-alignment via rounding up instead of adding
  16 then rounding down.

- Special-casing already-aligned case (saves a few cycles when already
  aligned, maybe 5-10% total run time at sizes just above the rep
  stosl cutoff such as 64).

- Keeping the rep stosl run-length as long as possible rather than
  trying to avoid duplicate stores. This helps a lot (>2x improvement)
  at size 1024 on Atom and shouldn't hurt in general.

At this point I think it should be a net improvement on nearly any x86
system.

I've checked and it passes the current tests in libc-test. I'm not
entirely sure the tests cover all the cases we need though. For the
32-bit version, tests need to cover:

- All sizes 0-62; alignment doesn't matter.

- Sufficiently many sizes >=63 to get all alignments mod 16 for both
  the length and the base pointer.

For the 64-bit versions (either Denys's latest or mine) we also need
coverage for all sizes 63-126 (alignmen doesn't matter) and
sufficiently many past that to test all alignments mod 16 for both
length and base. For the sake of robustness and future-proofing, we
should probably be testing all base and length alignments mod 32 or
more up to size 256 or larger.

Rich

[-- Attachment #2: memset-draft3.s --]
[-- Type: text/plain, Size: 1170 bytes --]

.global memset
.type memset,@function
memset:
	mov 12(%esp),%ecx
	cmp $62,%ecx
	ja 2f

	mov 8(%esp),%dl
	mov 4(%esp),%eax
	test %ecx,%ecx
	jz 1f

	mov %dl,%dh

	mov %dl,(%eax)
	mov %dl,-1(%eax,%ecx)
	cmp $2,%ecx
	jbe 1f

	mov %dx,1(%eax)
	mov %dx,(-1-2)(%eax,%ecx)
	cmp $6,%ecx
	jbe 1f

	shl $8,%edx
	mov %dh,%dl
	shl $8,%edx
	mov %dh,%dl

	mov %edx,(1+2)(%eax)
	mov %edx,(-1-2-4)(%eax,%ecx)
	cmp $14,%ecx
	jbe 1f

	mov %edx,(1+2+4)(%eax)
	mov %edx,(1+2+4+4)(%eax)
	mov %edx,(-1-2-4-8)(%eax,%ecx)
	mov %edx,(-1-2-4-4)(%eax,%ecx)
	cmp $30,%ecx
	jbe 1f

	mov %edx,(1+2+4+8)(%eax)
	mov %edx,(1+2+4+8+4)(%eax)
	mov %edx,(1+2+4+8+8)(%eax)
	mov %edx,(1+2+4+8+12)(%eax)
	mov %edx,(-1-2-4-8-16)(%eax,%ecx)
	mov %edx,(-1-2-4-8-12)(%eax,%ecx)
	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
	mov %edx,(-1-2-4-8-4)(%eax,%ecx)

1:	ret 	

2:	movzbl 8(%esp),%eax
	mov %edi,12(%esp)
	imul $0x1010101,%eax
	mov 4(%esp),%edi
	test $15,%edi
	mov %eax,-4(%edi,%ecx)
	jnz 2f

1:	shr $2, %ecx
	rep
	stosl
	mov 4(%esp),%eax
	mov 12(%esp),%edi
	ret
	
2:	xor %edx,%edx
	sub %edi,%edx
	and $15,%edx
	mov %eax,(%edi)
	mov %eax,4(%edi)
	mov %eax,8(%edi)
	mov %eax,12(%edi)
	sub %edx,%ecx
	add %edx,%edi
	jmp 1b

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2015-02-25 20:37 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-25 20:37 Updated draft of improved memset.s for i386 Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).