mailing list of musl libc
 help / color / mirror / code / Atom feed
* Draft of improved memset.s for i386
@ 2015-02-24  1:09 Rich Felker
  2015-02-24  3:02 ` Denys Vlasenko
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2015-02-24  1:09 UTC (permalink / raw)
  To: musl; +Cc: Denys Vlasenko

[-- Attachment #1: Type: text/plain, Size: 1156 bytes --]

Here's a draft of an improved i386 memset.s based on the principles
Denys Vlasenko and I discussed on his and my x86_64 versions. Compared
to the current code, it reduces entry/exit overhead, increases the
length supported in the non-rep-stosl path, and aligns the rep-stosl.

My tests don't measure the misalignment penalty, but even in the
aligned case the rep-stosl path is slightly faster (~5 cycles per run,
out of at least 64 cycles and the non-rep-stosl path is significantly
faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32).

Empirically the byte-register-access/left-shift method of extending
the fill value to a word performs better than imul for me, but the
margin is very small (at most 1 cycle). Since we support much older
cpus (like actual 486) where imul could be really slow, I think this
is the right approach in principle too. I used imul in the rep-stosl
path but haven't tested whether it's faster there.

The non-rep-stosl path only goes up to size 62. I think sizes up to
126 could benefit from it, but the string of stores was getting really
long.

Correctness has not been tested so there may be stupid bugs.

Rich

[-- Attachment #2: memset-draft.s --]
[-- Type: text/plain, Size: 1091 bytes --]

.global memset
.type memset,@function
memset:
	mov 12(%esp),%ecx
	cmp $62,%ecx
	ja 2f

	movzbl 8(%esp),%edx
	mov 4(%esp),%eax
	test %ecx,%ecx
	jz 1f

	mov %dl,%dh

	mov %dl,(%eax)
	mov %dl,-1(%eax,%ecx)
	cmp $2,%ecx
	jbe 1f

	mov %dx,1(%eax)
	mov %dx,(-1-2)(%eax,%ecx)
	cmp $6,%ecx
	jbe 1f

	shl $8,%edx
	mov %dh,%dl
	shl $8,%edx
	mov %dh,%dl

	mov %edx,(1+2)(%eax)
	mov %edx,(-1-2-4)(%eax,%ecx)
	cmp $14,%ecx
	jbe 1f

	mov %edx,(1+2+4)(%eax)
	mov %edx,(1+2+4+4)(%eax)
	mov %edx,(-1-2-4-8)(%eax,%ecx)
	mov %edx,(-1-2-4-4)(%eax,%ecx)
	cmp $30,%ecx
	jbe 1f

	mov %edx,(1+2+4+8)(%eax)
	mov %edx,(1+2+4+8+4)(%eax)
	mov %edx,(1+2+4+8+8)(%eax)
	mov %edx,(1+2+4+8+12)(%eax)
	mov %edx,(-1-2-4-8-16)(%eax,%ecx)
	mov %edx,(-1-2-4-8-12)(%eax,%ecx)
	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
	mov %edx,(-1-2-4-8-4)(%eax,%ecx)

1:	ret 	

2:	mov %edi,12(%esp)
	movzbl 8(%esp),%eax
	mov $0x01010101,%edx
	imul %edx,%eax

	mov %ecx,%edx
	lea -5(%ecx),%ecx
	mov 4(%esp),%edi
	shr $2, %ecx

	mov %eax,(%edi)
	mov %eax,-8(%edi,%edx)
	mov %eax,-4(%edi,%edx)
	add $4,%edi
	and $-4,%edi
	rep
	stosl
	mov 4(%esp),%eax
	ret

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Draft of improved memset.s for i386
  2015-02-24  1:09 Draft of improved memset.s for i386 Rich Felker
@ 2015-02-24  3:02 ` Denys Vlasenko
  2015-02-24  3:06   ` Denys Vlasenko
  0 siblings, 1 reply; 5+ messages in thread
From: Denys Vlasenko @ 2015-02-24  3:02 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Tue, Feb 24, 2015 at 2:09 AM, Rich Felker <dalias@libc.org> wrote:
>   mov %edi,12(%esp)

Shouldn't this be "mov 12(%esp),%edi"?
It's a load of dst pointer from stack, right?

>   mov $0x01010101,%edx
>   imul %edx,%eax

I think you can just use "imul $0x01010101,%eax" instead.

(We can't use this form of imul in 64-bit code
since its immediate operand can't be 64-bit wide).


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Draft of improved memset.s for i386
  2015-02-24  3:02 ` Denys Vlasenko
@ 2015-02-24  3:06   ` Denys Vlasenko
  2015-02-24  3:18     ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Denys Vlasenko @ 2015-02-24  3:06 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl

On Tue, Feb 24, 2015 at 4:02 AM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Tue, Feb 24, 2015 at 2:09 AM, Rich Felker <dalias@libc.org> wrote:
>>   mov %edi,12(%esp)
>
> Shouldn't this be "mov 12(%esp),%edi"?
> It's a load of dst pointer from stack, right?

Erm... no it is not, 12(%esp) is size param.

Looks like this insn serves no purpose?
This will simply trash size param on stack,
since %edi is not initialized.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Draft of improved memset.s for i386
  2015-02-24  3:06   ` Denys Vlasenko
@ 2015-02-24  3:18     ` Rich Felker
  2015-02-24  5:36       ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2015-02-24  3:18 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: musl

On Tue, Feb 24, 2015 at 04:06:06AM +0100, Denys Vlasenko wrote:
> On Tue, Feb 24, 2015 at 4:02 AM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
> > On Tue, Feb 24, 2015 at 2:09 AM, Rich Felker <dalias@libc.org> wrote:
> >>   mov %edi,12(%esp)
> >
> > Shouldn't this be "mov 12(%esp),%edi"?
> > It's a load of dst pointer from stack, right?
> 
> Erm... no it is not, 12(%esp) is size param.
> 
> Looks like this insn serves no purpose?
> This will simply trash size param on stack,
> since %edi is not initialized.

The purpose is saving %edi, since %edi is not call-clobbered on the
i386 ABI. I'm just storing it over top of an argument we already
loaded (argument space belongs to the callee per the ABI) instead of
adjusting the stack pointer.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: Draft of improved memset.s for i386
  2015-02-24  3:18     ` Rich Felker
@ 2015-02-24  5:36       ` Rich Felker
  0 siblings, 0 replies; 5+ messages in thread
From: Rich Felker @ 2015-02-24  5:36 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: musl

On Mon, Feb 23, 2015 at 10:18:11PM -0500, Rich Felker wrote:
> On Tue, Feb 24, 2015 at 04:06:06AM +0100, Denys Vlasenko wrote:
> > On Tue, Feb 24, 2015 at 4:02 AM, Denys Vlasenko
> > <vda.linux@googlemail.com> wrote:
> > > On Tue, Feb 24, 2015 at 2:09 AM, Rich Felker <dalias@libc.org> wrote:
> > >>   mov %edi,12(%esp)
> > >
> > > Shouldn't this be "mov 12(%esp),%edi"?
> > > It's a load of dst pointer from stack, right?
> > 
> > Erm... no it is not, 12(%esp) is size param.
> > 
> > Looks like this insn serves no purpose?
> > This will simply trash size param on stack,
> > since %edi is not initialized.
> 
> The purpose is saving %edi, since %edi is not call-clobbered on the
> i386 ABI. I'm just storing it over top of an argument we already
> loaded (argument space belongs to the callee per the ABI) instead of
> adjusting the stack pointer.

But I'm missing the instruction to restore it before return.

Based on your comment about an immediate in the imul, it might make
more sense just to load the dest addr into edx rather than edi, then
schedule push at a reasonable place and pop at the end. The sub
$4,%edi can then be replaced with a lea -4(%edx),%edi.

Rich


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-24  5:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24  1:09 Draft of improved memset.s for i386 Rich Felker
2015-02-24  3:02 ` Denys Vlasenko
2015-02-24  3:06   ` Denys Vlasenko
2015-02-24  3:18     ` Rich Felker
2015-02-24  5:36       ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).