From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7092 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Draft of improved memset.s for i386 Date: Mon, 23 Feb 2015 20:09:52 -0500 Message-ID: <20150224010952.GA10683@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="lrZ03NoBR/3+SXJZ" X-Trace: ger.gmane.org 1424740217 24381 80.91.229.3 (24 Feb 2015 01:10:17 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Tue, 24 Feb 2015 01:10:17 +0000 (UTC) Cc: Denys Vlasenko To: musl@lists.openwall.com Original-X-From: musl-return-7105-gllmg-musl=m.gmane.org@lists.openwall.com Tue Feb 24 02:10:16 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YQ40u-0004SO-8w for gllmg-musl@m.gmane.org; Tue, 24 Feb 2015 02:10:16 +0100 Original-Received: (qmail 1834 invoked by uid 550); 24 Feb 2015 01:10:13 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 1773 invoked from network); 24 Feb 2015 01:10:07 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7092 Archived-At: --lrZ03NoBR/3+SXJZ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Here's a draft of an improved i386 memset.s based on the principles Denys Vlasenko and I discussed on his and my x86_64 versions. Compared to the current code, it reduces entry/exit overhead, increases the length supported in the non-rep-stosl path, and aligns the rep-stosl. My tests don't measure the misalignment penalty, but even in the aligned case the rep-stosl path is slightly faster (~5 cycles per run, out of at least 64 cycles and the non-rep-stosl path is significantly faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32). Empirically the byte-register-access/left-shift method of extending the fill value to a word performs better than imul for me, but the margin is very small (at most 1 cycle). Since we support much older cpus (like actual 486) where imul could be really slow, I think this is the right approach in principle too. I used imul in the rep-stosl path but haven't tested whether it's faster there. The non-rep-stosl path only goes up to size 62. I think sizes up to 126 could benefit from it, but the string of stores was getting really long. Correctness has not been tested so there may be stupid bugs. Rich --lrZ03NoBR/3+SXJZ Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="memset-draft.s" .global memset .type memset,@function memset: mov 12(%esp),%ecx cmp $62,%ecx ja 2f movzbl 8(%esp),%edx mov 4(%esp),%eax test %ecx,%ecx jz 1f mov %dl,%dh mov %dl,(%eax) mov %dl,-1(%eax,%ecx) cmp $2,%ecx jbe 1f mov %dx,1(%eax) mov %dx,(-1-2)(%eax,%ecx) cmp $6,%ecx jbe 1f shl $8,%edx mov %dh,%dl shl $8,%edx mov %dh,%dl mov %edx,(1+2)(%eax) mov %edx,(-1-2-4)(%eax,%ecx) cmp $14,%ecx jbe 1f mov %edx,(1+2+4)(%eax) mov %edx,(1+2+4+4)(%eax) mov %edx,(-1-2-4-8)(%eax,%ecx) mov %edx,(-1-2-4-4)(%eax,%ecx) cmp $30,%ecx jbe 1f mov %edx,(1+2+4+8)(%eax) mov %edx,(1+2+4+8+4)(%eax) mov %edx,(1+2+4+8+8)(%eax) mov %edx,(1+2+4+8+12)(%eax) mov %edx,(-1-2-4-8-16)(%eax,%ecx) mov %edx,(-1-2-4-8-12)(%eax,%ecx) mov %edx,(-1-2-4-8-8)(%eax,%ecx) mov %edx,(-1-2-4-8-4)(%eax,%ecx) 1: ret 2: mov %edi,12(%esp) movzbl 8(%esp),%eax mov $0x01010101,%edx imul %edx,%eax mov %ecx,%edx lea -5(%ecx),%ecx mov 4(%esp),%edi shr $2, %ecx mov %eax,(%edi) mov %eax,-8(%edi,%edx) mov %eax,-4(%edi,%edx) add $4,%edi and $-4,%edi rep stosl mov 4(%esp),%eax ret --lrZ03NoBR/3+SXJZ--