From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7092
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Draft of improved memset.s for i386
Date: Mon, 23 Feb 2015 20:09:52 -0500
Message-ID: <20150224010952.GA10683@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="lrZ03NoBR/3+SXJZ"
X-Trace: ger.gmane.org 1424740217 24381 80.91.229.3 (24 Feb 2015 01:10:17 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 24 Feb 2015 01:10:17 +0000 (UTC)
Cc: Denys Vlasenko <vda.linux@googlemail.com>
To: musl@lists.openwall.com
Original-X-From: musl-return-7105-gllmg-musl=m.gmane.org@lists.openwall.com Tue Feb 24 02:10:16 2015
Return-path: <musl-return-7105-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-7105-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1YQ40u-0004SO-8w
	for gllmg-musl@m.gmane.org; Tue, 24 Feb 2015 02:10:16 +0100
Original-Received: (qmail 1834 invoked by uid 550); 24 Feb 2015 01:10:13 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 1773 invoked from network); 24 Feb 2015 01:10:07 -0000
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:7092
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/7092>


--lrZ03NoBR/3+SXJZ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Here's a draft of an improved i386 memset.s based on the principles
Denys Vlasenko and I discussed on his and my x86_64 versions. Compared
to the current code, it reduces entry/exit overhead, increases the
length supported in the non-rep-stosl path, and aligns the rep-stosl.

My tests don't measure the misalignment penalty, but even in the
aligned case the rep-stosl path is slightly faster (~5 cycles per run,
out of at least 64 cycles and the non-rep-stosl path is significantly
faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32).

Empirically the byte-register-access/left-shift method of extending
the fill value to a word performs better than imul for me, but the
margin is very small (at most 1 cycle). Since we support much older
cpus (like actual 486) where imul could be really slow, I think this
is the right approach in principle too. I used imul in the rep-stosl
path but haven't tested whether it's faster there.

The non-rep-stosl path only goes up to size 62. I think sizes up to
126 could benefit from it, but the string of stores was getting really
long.

Correctness has not been tested so there may be stupid bugs.

Rich

--lrZ03NoBR/3+SXJZ
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memset-draft.s"

.global memset
.type memset,@function
memset:
	mov 12(%esp),%ecx
	cmp $62,%ecx
	ja 2f

	movzbl 8(%esp),%edx
	mov 4(%esp),%eax
	test %ecx,%ecx
	jz 1f

	mov %dl,%dh

	mov %dl,(%eax)
	mov %dl,-1(%eax,%ecx)
	cmp $2,%ecx
	jbe 1f

	mov %dx,1(%eax)
	mov %dx,(-1-2)(%eax,%ecx)
	cmp $6,%ecx
	jbe 1f

	shl $8,%edx
	mov %dh,%dl
	shl $8,%edx
	mov %dh,%dl

	mov %edx,(1+2)(%eax)
	mov %edx,(-1-2-4)(%eax,%ecx)
	cmp $14,%ecx
	jbe 1f

	mov %edx,(1+2+4)(%eax)
	mov %edx,(1+2+4+4)(%eax)
	mov %edx,(-1-2-4-8)(%eax,%ecx)
	mov %edx,(-1-2-4-4)(%eax,%ecx)
	cmp $30,%ecx
	jbe 1f

	mov %edx,(1+2+4+8)(%eax)
	mov %edx,(1+2+4+8+4)(%eax)
	mov %edx,(1+2+4+8+8)(%eax)
	mov %edx,(1+2+4+8+12)(%eax)
	mov %edx,(-1-2-4-8-16)(%eax,%ecx)
	mov %edx,(-1-2-4-8-12)(%eax,%ecx)
	mov %edx,(-1-2-4-8-8)(%eax,%ecx)
	mov %edx,(-1-2-4-8-4)(%eax,%ecx)

1:	ret 	

2:	mov %edi,12(%esp)
	movzbl 8(%esp),%eax
	mov $0x01010101,%edx
	imul %edx,%eax

	mov %ecx,%edx
	lea -5(%ecx),%ecx
	mov 4(%esp),%edi
	shr $2, %ecx

	mov %eax,(%edi)
	mov %eax,-8(%edi,%edx)
	mov %eax,-4(%edi,%edx)
	add $4,%edi
	and $-4,%edi
	rep
	stosl
	mov 4(%esp),%eax
	ret

--lrZ03NoBR/3+SXJZ--