From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/1388
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: musl libc, memcpy
Date: Mon, 30 Jul 2012 16:41:00 -0400
Message-ID: <20120730204100.GY544@brightrain.aerifal.cx>
References: <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: dough.gmane.org 1343680861 21947 80.91.229.3 (30 Jul 2012 20:41:01 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Mon, 30 Jul 2012 20:41:01 +0000 (UTC)
Cc: musl@lists.openwall.com
To: Kim Walisch <kim.walisch@gmail.com>
Original-X-From: musl-return-1389-gllmg-musl=m.gmane.org@lists.openwall.com Mon Jul 30 22:41:01 2012
Return-path: <musl-return-1389-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-1389-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1Svwlw-0007hm-Af
	for gllmg-musl@plane.gmane.org; Mon, 30 Jul 2012 22:41:00 +0200
Original-Received: (qmail 7944 invoked by uid 550); 30 Jul 2012 20:40:59 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 7936 invoked from network); 30 Jul 2012 20:40:59 -0000
Content-Disposition: inline
In-Reply-To: <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:1388
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/1388>

Hi,

I'm replying with the list CC'd so others can comment too. Sorry I
haven't gotten a chance to try this code or review it in detail yet.
What follows is a short initial commentary but I'll give it some more
attention soon.

On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote:
> Hi,
> 
> I have been reading through several libc implementations on the
> internet for the past days and for fun I have written a fast yet
> portable memcpy implementation. It uses more code than your
> implementation but I do not think it is bloated. Some quick benchmarks
> that I ran on my Intel Core-i5 670 3.46GHz  (Red Hat 6.2 x86_64)
> indicate that my implemenation runs about 50 percent faster than yours
> for aligned data and up to 10 times faster for unaligned data using
> gcc-4.7. The Intel C compiler even vectorizes the main copying loop
> using SSE instructions (if compiled with icc -O2 -xHost) which gives a
> performance better than glibc's memcpy on my system. I would be happy
> to hear your opinion about my memcpy implementation.

I'd like to know what block sizes you were looking at, because for
memcpy that makes all the difference in the world:

For very small blocks (down to 1 byte), performance will be dominated
by conditional branches picking what to do.

For very large blocks (larger than cache), performance will be
memory-bound and even byte-at-a-time copying might be competitive.

Theoretically, there's only a fairly small range of sizes where the
algorithm used matters a lot.

> /* CPU architectures that support fast unaligned memory access */
> #if defined(__i386) || defined(__x86_64)
> # define UNALIGNED_MEMORY_ACCESS
> #endif

I don't think this is necessary or useful. If we want better
performance on these archs, a tiny asm file that does almost nothing
but "rep movsd" is known to be the fastest solution on 32-bit x86, and
is at least the second-fastest on 64-bit, with the faster solutions
not being available on all cpus. On pretty much all other archs,
unaligned access is illegal.

> static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n)
> {
> 	char *d = (char*) dest;
> 	const char *s = (const char*) src;
> 	size_t bytes_iteration = sizeof(uintptr_t) * 8;
> 
> 	while (n >= bytes_iteration)
> 	{
> 		((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0];
> 		((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1];
> 		((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2];
> 		((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3];
> 		((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4];
> 		((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5];
> 		((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6];
> 		((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7];
> 		d += bytes_iteration;
> 		s += bytes_iteration;
> 		n -= bytes_iteration;
> 	}

This is just manual loop unrolling, no? GCC should do the equivalent
if you ask it to aggressively unroll loops, including the
vectorization; if not, that seems like a GCC bug.

Rich