mailing list of musl libc
 help / color / mirror / code / Atom feed
* Draft of improved memset.s for i386
@ 2015-02-24  1:09 Rich Felker
  2015-02-24  3:02 ` Denys Vlasenko
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2015-02-24  1:09 UTC (permalink / raw)
  To: musl; +Cc: Denys Vlasenko

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

Here's a draft of an improved i386 memset.s based on the principles
Denys Vlasenko and I discussed on his and my x86_64 versions. Compared
to the current code, it reduces entry/exit overhead, increases the
length supported in the non-rep-stosl path, and aligns the rep-stosl.

My tests don't measure the misalignment penalty, but even in the
aligned case the rep-stosl path is slightly faster (~5 cycles per run,
out of at least 64 cycles and the non-rep-stosl path is significantly
faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32).

Empirically the byte-register-access/left-shift method of extending
the fill value to a word performs better than imul for me, but the
margin is very small (at most 1 cycle). Since we support much older
cpus (like actual 486) where imul could be really slow, I think this
is the right approach in principle too. I used imul in the rep-stosl
path but haven't tested whether it's faster there.

The non-rep-stosl path only goes up to size 62. I think sizes up to
126 could benefit from it, but the string of stores was getting really
long.

Correctness has not been tested so there may be stupid bugs.

Rich

[-- Attachment #2: memset-draft.s --]
[-- Type: text/plain, Size: 1091 bytes --]

.global memset
.type memset,@function
memset:
	mov 12(%esp),%ecx
	cmp $62,%ecx
	ja 2f

	movzbl 8(%esp),%edx
	mov 4(%esp),%eax
	test %ecx,%ecx
	jz 1f

	mov %dl,%dh

	mov %dl,(%eax)
	mov %dl,-1(%eax,%ecx)
	cmp $2,%ecx
	jbe 1f

	mov %dx,1(%eax)
	mov %dx,(-1-2)(%eax,%ecx)
	cmp $6,%ecx
	jbe 1f

	shl $8,%edx
	mov %dh,%dl
	shl $8,%edx
	mov %dh,%dl

	mov %edx,(1+2)(%eax)
	mov %edx,(-1-2-4)(%eax,%ecx)
	cmp $14,%ecx
	jbe 1f

	mov %edx,(1+2+4)(%eax)
	mov %edx,(1+2+4+4)(%eax)
	mov %edx,(-1-2-4-8)(%eax,%ecx)
	mov %edx,(-1-2-4-4)(%eax,%ecx)
	cmp $30,%ecx
	jbe 1f

	mov %edx,(1+2+4+8)(%eax)
	mov %edx,(1+2+4+8+4)(%eax)
	mov %edx,(1+2+4+8+8)(%eax)
	mov %edx,(1+2+4+8+12)(%eax)
	mov %edx,(-1-2-4-8-16)(%eax,%ecx)
	mov %edx,(-1-2-4-8-12)(%eax,%ecx)
	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
	mov %edx,(-1-2-4-8-4)(%eax,%ecx)

1:	ret 	

2:	mov %edi,12(%esp)
	movzbl 8(%esp),%eax
	mov $0x01010101,%edx
	imul %edx,%eax

	mov %ecx,%edx
	lea -5(%ecx),%ecx
	mov 4(%esp),%edi
	shr $2, %ecx

	mov %eax,(%edi)
	mov %eax,-8(%edi,%edx)
	mov %eax,-4(%edi,%edx)
	add $4,%edi
	and $-4,%edi
	rep
	stosl
	mov 4(%esp),%eax
	ret

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-24  5:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24  1:09 Draft of improved memset.s for i386 Rich Felker
2015-02-24  3:02 ` Denys Vlasenko
2015-02-24  3:06   ` Denys Vlasenko
2015-02-24  3:18     ` Rich Felker
2015-02-24  5:36       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).