Re: musl libc, memcpy

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Re: musl libc, memcpy
       [not found] <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com>
@ 2012-07-30 20:41 ` Rich Felker
  2012-07-31  1:25   ` Luca Barbato
       [not found]   ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
  0 siblings, 2 replies; 6+ messages in thread
From: Rich Felker @ 2012-07-30 20:41 UTC (permalink / raw)
  To: Kim Walisch; +Cc: musl

Hi,

I'm replying with the list CC'd so others can comment too. Sorry I
haven't gotten a chance to try this code or review it in detail yet.
What follows is a short initial commentary but I'll give it some more
attention soon.

On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote:
> Hi,
> 
> I have been reading through several libc implementations on the
> internet for the past days and for fun I have written a fast yet
> portable memcpy implementation. It uses more code than your
> implementation but I do not think it is bloated. Some quick benchmarks
> that I ran on my Intel Core-i5 670 3.46GHz  (Red Hat 6.2 x86_64)
> indicate that my implemenation runs about 50 percent faster than yours
> for aligned data and up to 10 times faster for unaligned data using
> gcc-4.7. The Intel C compiler even vectorizes the main copying loop
> using SSE instructions (if compiled with icc -O2 -xHost) which gives a
> performance better than glibc's memcpy on my system. I would be happy
> to hear your opinion about my memcpy implementation.

I'd like to know what block sizes you were looking at, because for
memcpy that makes all the difference in the world:

For very small blocks (down to 1 byte), performance will be dominated
by conditional branches picking what to do.

For very large blocks (larger than cache), performance will be
memory-bound and even byte-at-a-time copying might be competitive.

Theoretically, there's only a fairly small range of sizes where the
algorithm used matters a lot.

> /* CPU architectures that support fast unaligned memory access */
> #if defined(__i386) || defined(__x86_64)
> # define UNALIGNED_MEMORY_ACCESS
> #endif

I don't think this is necessary or useful. If we want better
performance on these archs, a tiny asm file that does almost nothing
but "rep movsd" is known to be the fastest solution on 32-bit x86, and
is at least the second-fastest on 64-bit, with the faster solutions
not being available on all cpus. On pretty much all other archs,
unaligned access is illegal.

> static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n)
> {
> 	char *d = (char*) dest;
> 	const char *s = (const char*) src;
> 	size_t bytes_iteration = sizeof(uintptr_t) * 8;
> 
> 	while (n >= bytes_iteration)
> 	{
> 		((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0];
> 		((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1];
> 		((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2];
> 		((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3];
> 		((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4];
> 		((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5];
> 		((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6];
> 		((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7];
> 		d += bytes_iteration;
> 		s += bytes_iteration;
> 		n -= bytes_iteration;
> 	}

This is just manual loop unrolling, no? GCC should do the equivalent
if you ask it to aggressively unroll loops, including the
vectorization; if not, that seems like a GCC bug.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: musl libc, memcpy
  2012-07-30 20:41 ` musl libc, memcpy Rich Felker
@ 2012-07-31  1:25   ` Luca Barbato
       [not found]   ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
  1 sibling, 0 replies; 6+ messages in thread
From: Luca Barbato @ 2012-07-31  1:25 UTC (permalink / raw)
  To: musl

On 07/30/2012 10:41 PM, Rich Felker wrote:
> Hi,
> 
> I'm replying with the list CC'd so others can comment too. Sorry I
> haven't gotten a chance to try this code or review it in detail yet.
> What follows is a short initial commentary but I'll give it some more
> attention soon.

mem function and string function are implemented as plugin in recent
glibc and they use platform optimized versions, not sure if it is worth
considering it for musl.

iirc gcc and clang have builtins for some of those functions as well.

lu

-- 

Luca Barbato
Gentoo/linux
http://dev.gentoo.org/~lu_zero



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: musl libc, memcpy
       [not found]   ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
@ 2012-08-01  4:27     ` Rich Felker
  2012-08-01  5:40       ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2012-08-01  4:27 UTC (permalink / raw)
  To: Kim Walisch; +Cc: musl

On Tue, Jul 31, 2012 at 12:19:13AM +0200, Kim Walisch wrote:
> > I'd like to know what block sizes you were looking at, because for
> > memcpy that makes all the difference in the world:
> 
> I copied blocks of 16 kilobytes.

OK, that sounds (off-hand) like a good size for testing.

> > I don't think this is necessary or useful. If we want better
> > performance on these archs, a tiny asm file that does almost nothing
> > but "rep movsd" is known to be the fastest solution on 32-bit x86, and
> > is at least the second-fastest on 64-bit, with the faster solutions
> > not being available on all cpus. On pretty much all other archs,
> > unaligned access is illegal.
> 
> My point is that your code uses byte (char) copying for unaligned data
> but on x86 this is not necessary. Using a simple macro in your memcpy
> implementation that always uses the size_t copying path for x86 speeds
> up your memcpy implementation by about 500% for unaligned data on my
> PC (Intel i5-670 3.46GHz, gcc-4.7, SL Linux 6.2 x86_64). You can also
> use a separate asm file with "rep movsd" for x86, I guess it will run
> at the same speed as my macro solution.

I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
version. I'd be interested in hearing how it performs.

> Another interesting thing to mention is that gcc-4.5 vectorizes the 3
> copying loops of your memcpy implementation if it is compiled with the
> -ftree-vectorize flag (add -ftree-vectorizer-verbose=1 for
> vectorization report) but not if simply compiled with -O2 or -O3. With

Odd, the gcc manual claims -ftree-vectorize is included in -O3:

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

> $ gcc -O2 -ftree-vectorize -ftree-vectorizer-verbose=1 memcpy.c main.c -o memcpy
> 
> memcpy.c:25: note: created 1 versioning for alias checks.
> memcpy.c:25: note: LOOP VECTORIZED.
> memcpy.c:21: note: created 1 versioning for alias checks.
> memcpy.c:21: note: LOOP VECTORIZED.
> memcpy.c:9: note: vectorized 2 loops in function.

From the sound of those notes, I suspect duplicate code (and wasteful
conditional branches) are getting generated to handle the possibility
that the source and destination pointers might alias. I think this
means it would be a good idea to add proper use of "restrict" pointers
(per C99 requirements) in musl sooner rather than later; it might both
reduce code size and improve performance.

Rich


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: musl libc, memcpy
  2012-08-01  4:27     ` Rich Felker
@ 2012-08-01  5:40       ` Rich Felker
  2012-08-01  6:19         ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2012-08-01  5:40 UTC (permalink / raw)
  To: musl; +Cc: Kim Walisch

[-- Attachment #1: Type: text/plain, Size: 226 bytes --]

On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote:
> I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
> version. I'd be interested in hearing how it performs.

And here is the attachment...

Rich

[-- Attachment #2: memcpy.s --]
[-- Type: text/plain, Size: 302 bytes --]

.global xmemcpy
.type xmemcpy,@function
xmemcpy:
	push %esi
	push %edi
	mov 12(%esp),%edi
	mov 16(%esp),%esi
	mov 20(%esp),%ecx
	mov %edi,%eax
	mov %ecx,%edx
	shr $2,%ecx
	rep
	movsl
	and $3,%edx
	jz 1f
2:	mov (%esi),%cl
	mov %cl,(%edi)
	inc %esi
	inc %edi
	dec %edx
	jnz 2b
1:	pop %edi
	pop %esi
	ret

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: musl libc, memcpy
  2012-08-01  5:40       ` Rich Felker
@ 2012-08-01  6:19         ` Rich Felker
  2012-08-03 23:22           ` John Spencer
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2012-08-01  6:19 UTC (permalink / raw)
  To: musl; +Cc: Kim Walisch

[-- Attachment #1: Type: text/plain, Size: 417 bytes --]

On Wed, Aug 01, 2012 at 01:40:11AM -0400, Rich Felker wrote:
> On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote:
> > I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
> > version. I'd be interested in hearing how it performs.
> 
> And here is the attachment...

And here's a version that might be faster; reportedly, rep movsd works
better when the destination address is aligned.

Rich

[-- Attachment #2: memcpy.s --]
[-- Type: text/plain, Size: 341 bytes --]

.global xmemcpy
.type xmemcpy,@function
xmemcpy:
	push %esi
	push %edi
	mov 12(%esp),%edi
	mov 16(%esp),%esi
	mov 20(%esp),%ecx
	mov %edi,%eax
	cmp $4,%ecx
	jc 1f
	test $3,%edi
	jz 1f
2:	movsb
	dec %ecx
	test $3,%edi
	jnz 2b
1:	mov %ecx,%edx
	shr $2,%ecx
	rep
	movsl
	and $3,%edx
	jz 1f
2:	movsb
	dec %edx
	jnz 2b
1:	pop %edi
	pop %esi
	ret

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: musl libc, memcpy
  2012-08-01  6:19         ` Rich Felker
@ 2012-08-03 23:22           ` John Spencer
  0 siblings, 0 replies; 6+ messages in thread
From: John Spencer @ 2012-08-03 23:22 UTC (permalink / raw)
  To: musl

i've setup a perfomance test ( https://github.com/rofl0r/memcpy-test )

these are the average results for i386 (100 runs on big sizes, 10000 on 
smaller ones)

                 asm version    current c-version
size: 3     172 ticks       199 ticks
size: 4     167 ticks       167 ticks
size: 5     197 ticks       186 ticks
size: 8     187 ticks       186 ticks
size: 15        195 ticks       196 ticks
size: 16        186 ticks       185 ticks
size: 23        202 ticks       199 ticks
size: 24        193 ticks       188 ticks
size: 25        205 ticks       212 ticks
size: 31        199 ticks       198 ticks
size: 32        195 ticks       192 ticks
size: 33        204 ticks       192 ticks
size: 63        213 ticks       255 ticks
size: 64        219 ticks       226 ticks
size: 65        208 ticks       238 ticks
size: 95        220 ticks       247 ticks
size: 96        214 ticks       239 ticks
size: 97        217 ticks       243 ticks
size: 127       233 ticks       261 ticks
size: 128       225 ticks       254 ticks
size: 129       229 ticks       266 ticks
size: 159       242 ticks       279 ticks
size: 160       235 ticks       268 ticks
size: 161       238 ticks       273 ticks
size: 191       255 ticks       288 ticks
size: 192       264 ticks       288 ticks
size: 193       248 ticks       287 ticks
size: 255       279 ticks       323 ticks
size: 256       266 ticks       313 ticks
size: 257       269 ticks       319 ticks
size: 383       332 ticks       391 ticks
size: 384       308 ticks       370 ticks
size: 385       307 ticks       384 ticks
size: 511       345 ticks       439 ticks
size: 512       315 ticks       434 ticks
size: 513       318 ticks       439 ticks
size: 767       370 ticks       571 ticks
size: 768       330 ticks       555 ticks
size: 769       334 ticks       566 ticks
size: 1023      382 ticks       740 ticks
size: 1024      349 ticks       727 ticks
size: 1025      358 ticks       694 ticks
size: 1535      423 ticks       936 ticks
size: 1536      393 ticks       930 ticks
size: 1537      400 ticks       929 ticks
size: 2048      448 ticks       1176 ticks
size: 4096      822 ticks       2404 ticks
size: 8192      3136 ticks      8310 ticks
size: 16384     6481 ticks      9780 ticks
size: 32768     11645 ticks     19060 ticks
size: 65536     29700 ticks     52051 ticks
size: 131072    307029 ticks    310875 ticks
size: 262144    608502 ticks    617698 ticks
size: 524288    1222116 ticks   1244987 ticks
size: 1048576   2500207 ticks   2712991 ticks
size: 2097152   5279016 ticks   5566665 ticks
size: 4194304   10586333 ticks  10849110 ticks
size: 8388608   21961730 ticks  22473953 ticks
size: 16777216  45966254 ticks  47159258 ticks
size: 33554432  92434464 ticks  95873868 ticks
size: 67108864  189858530 ticks 190456107 ticks

it looks as if the asm version is up to twice as fast, depending on the 
size of data copied.
now waiting for the x86_64 version (if you could provide a working 64bit 
rdtsc inline asm function, i'll gladly take that as well)

someone on ##asm suggested that movaps with xmm regs was fastest in his 
tests.
would be interesting to test such a version as well.

On 08/01/2012 08:19 AM, Rich Felker wrote:
> On Wed, Aug 01, 2012 at 01:40:11AM -0400, Rich Felker wrote:
>> On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote:
>>> I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
>>> version. I'd be interested in hearing how it performs.
>> And here is the attachment...
> And here's a version that might be faster; reportedly, rep movsd works
> better when the destination address is aligned.
>
> Rich



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-08-03 23:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com>
2012-07-30 20:41 ` musl libc, memcpy Rich Felker
2012-07-31  1:25   ` Luca Barbato
     [not found]   ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
2012-08-01  4:27     ` Rich Felker
2012-08-01  5:40       ` Rich Felker
2012-08-01  6:19         ` Rich Felker
2012-08-03 23:22           ` John Spencer

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).