From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/1388 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: musl libc, memcpy Date: Mon, 30 Jul 2012 16:41:00 -0400 Message-ID: <20120730204100.GY544@brightrain.aerifal.cx> References: Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1343680861 21947 80.91.229.3 (30 Jul 2012 20:41:01 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Mon, 30 Jul 2012 20:41:01 +0000 (UTC) Cc: musl@lists.openwall.com To: Kim Walisch Original-X-From: musl-return-1389-gllmg-musl=m.gmane.org@lists.openwall.com Mon Jul 30 22:41:01 2012 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1Svwlw-0007hm-Af for gllmg-musl@plane.gmane.org; Mon, 30 Jul 2012 22:41:00 +0200 Original-Received: (qmail 7944 invoked by uid 550); 30 Jul 2012 20:40:59 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 7936 invoked from network); 30 Jul 2012 20:40:59 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:1388 Archived-At: Hi, I'm replying with the list CC'd so others can comment too. Sorry I haven't gotten a chance to try this code or review it in detail yet. What follows is a short initial commentary but I'll give it some more attention soon. On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote: > Hi, > > I have been reading through several libc implementations on the > internet for the past days and for fun I have written a fast yet > portable memcpy implementation. It uses more code than your > implementation but I do not think it is bloated. Some quick benchmarks > that I ran on my Intel Core-i5 670 3.46GHz (Red Hat 6.2 x86_64) > indicate that my implemenation runs about 50 percent faster than yours > for aligned data and up to 10 times faster for unaligned data using > gcc-4.7. The Intel C compiler even vectorizes the main copying loop > using SSE instructions (if compiled with icc -O2 -xHost) which gives a > performance better than glibc's memcpy on my system. I would be happy > to hear your opinion about my memcpy implementation. I'd like to know what block sizes you were looking at, because for memcpy that makes all the difference in the world: For very small blocks (down to 1 byte), performance will be dominated by conditional branches picking what to do. For very large blocks (larger than cache), performance will be memory-bound and even byte-at-a-time copying might be competitive. Theoretically, there's only a fairly small range of sizes where the algorithm used matters a lot. > /* CPU architectures that support fast unaligned memory access */ > #if defined(__i386) || defined(__x86_64) > # define UNALIGNED_MEMORY_ACCESS > #endif I don't think this is necessary or useful. If we want better performance on these archs, a tiny asm file that does almost nothing but "rep movsd" is known to be the fastest solution on 32-bit x86, and is at least the second-fastest on 64-bit, with the faster solutions not being available on all cpus. On pretty much all other archs, unaligned access is illegal. > static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n) > { > char *d = (char*) dest; > const char *s = (const char*) src; > size_t bytes_iteration = sizeof(uintptr_t) * 8; > > while (n >= bytes_iteration) > { > ((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0]; > ((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1]; > ((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2]; > ((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3]; > ((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4]; > ((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5]; > ((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6]; > ((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7]; > d += bytes_iteration; > s += bytes_iteration; > n -= bytes_iteration; > } This is just manual loop unrolling, no? GCC should do the equivalent if you ask it to aggressively unroll loops, including the vectorization; if not, that seems like a GCC bug. Rich