From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3868 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Optimized C memcpy [updated] Date: Sun, 11 Aug 2013 04:13:13 -0400 Message-ID: <20130811081312.GY221@brightrain.aerifal.cx> References: <20130807182123.GA17670@brightrain.aerifal.cx> <20130811051135.GW221@brightrain.aerifal.cx> <20130811062009.GX221@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1376208805 11739 80.91.229.3 (11 Aug 2013 08:13:25 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 11 Aug 2013 08:13:25 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3872-gllmg-musl=m.gmane.org@lists.openwall.com Sun Aug 11 10:13:28 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V8QmG-0007HM-3c for gllmg-musl@plane.gmane.org; Sun, 11 Aug 2013 10:13:28 +0200 Original-Received: (qmail 22202 invoked by uid 550); 11 Aug 2013 08:13:25 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 22191 invoked from network); 11 Aug 2013 08:13:25 -0000 Content-Disposition: inline In-Reply-To: <20130811062009.GX221@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3868 Archived-At: On Sun, Aug 11, 2013 at 02:20:10AM -0400, Rich Felker wrote: > On Sun, Aug 11, 2013 at 01:11:35AM -0400, Rich Felker wrote: > > struct block32 { uint32_t data[8]; }; > > struct block64 { uint64_t data[8]; }; > > > > void *memcpy(void *restrict dest, const void *restrict src, size_t n) > > { > > unsigned char *d = dest; > > const unsigned char *s = src; > > uint32_t w, x; > > > > for (; (uintptr_t)s % 8 && n; n--) *d++ = *s++; > > if (!n) return dest; > > > > if (n>=4) switch ((uintptr_t)d % 4) { > > case 0: > > if (!((uintptr_t)d%8)) for (; n>=64; s+=64, d+=64, n-=64) > > *(struct block64 *)d = *(struct block64 *)s; > > Unfortunately this case seems to be compiling to a call to memcpy on > powerpc (but nowhere else I found). So I may need to drop the special > case for 64-bit alignment. I wish there was some source for knowledge > of the cases that can trigger gcc's stupidity, though... It turns out mips at certain optimization levels is also generating a memcpy for the structure assignments. I think I just need to drop all of the structure-assignment tricks and use a mildly unrolled loop with uint32_t units for the aligned case. This gives much worse performance on ARM, where gcc fails to generate the proper ldmia/stmia without the struct, but we have asm we can use for ARM anyway. On other archs, the struct copy code does not even seem to help. The simple integer loop works just as well. I'll do some more experimenting and probably commit the ARM asm soon, followed by the C code once I get some better feedback on how it performs on real machines. Rich