From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/1393 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: musl libc, memcpy Date: Wed, 1 Aug 2012 00:27:22 -0400 Message-ID: <20120801042722.GB544@brightrain.aerifal.cx> References: <20120730204100.GY544@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1343795244 19043 80.91.229.3 (1 Aug 2012 04:27:24 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Wed, 1 Aug 2012 04:27:24 +0000 (UTC) Cc: musl@lists.openwall.com To: Kim Walisch Original-X-From: musl-return-1394-gllmg-musl=m.gmane.org@lists.openwall.com Wed Aug 01 06:27:21 2012 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1SwQWk-0006oP-Vr for gllmg-musl@plane.gmane.org; Wed, 01 Aug 2012 06:27:19 +0200 Original-Received: (qmail 17665 invoked by uid 550); 1 Aug 2012 04:27:18 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 17654 invoked from network); 1 Aug 2012 04:27:17 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:1393 Archived-At: On Tue, Jul 31, 2012 at 12:19:13AM +0200, Kim Walisch wrote: > > I'd like to know what block sizes you were looking at, because for > > memcpy that makes all the difference in the world: > > I copied blocks of 16 kilobytes. OK, that sounds (off-hand) like a good size for testing. > > I don't think this is necessary or useful. If we want better > > performance on these archs, a tiny asm file that does almost nothing > > but "rep movsd" is known to be the fastest solution on 32-bit x86, and > > is at least the second-fastest on 64-bit, with the faster solutions > > not being available on all cpus. On pretty much all other archs, > > unaligned access is illegal. > > My point is that your code uses byte (char) copying for unaligned data > but on x86 this is not necessary. Using a simple macro in your memcpy > implementation that always uses the size_t copying path for x86 speeds > up your memcpy implementation by about 500% for unaligned data on my > PC (Intel i5-670 3.46GHz, gcc-4.7, SL Linux 6.2 x86_64). You can also > use a separate asm file with "rep movsd" for x86, I guess it will run > at the same speed as my macro solution. I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based version. I'd be interested in hearing how it performs. > Another interesting thing to mention is that gcc-4.5 vectorizes the 3 > copying loops of your memcpy implementation if it is compiled with the > -ftree-vectorize flag (add -ftree-vectorizer-verbose=1 for > vectorization report) but not if simply compiled with -O2 or -O3. With Odd, the gcc manual claims -ftree-vectorize is included in -O3: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html > $ gcc -O2 -ftree-vectorize -ftree-vectorizer-verbose=1 memcpy.c main.c -o memcpy > > memcpy.c:25: note: created 1 versioning for alias checks. > memcpy.c:25: note: LOOP VECTORIZED. > memcpy.c:21: note: created 1 versioning for alias checks. > memcpy.c:21: note: LOOP VECTORIZED. > memcpy.c:9: note: vectorized 2 loops in function. >From the sound of those notes, I suspect duplicate code (and wasteful conditional branches) are getting generated to handle the possibility that the source and destination pointers might alias. I think this means it would be a good idea to add proper use of "restrict" pointers (per C99 requirements) in musl sooner rather than later; it might both reduce code size and improve performance. Rich