* Re: musl libc, memcpy
[not found] <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com>
@ 2012-07-30 20:41 ` Rich Felker
2012-07-31 1:25 ` Luca Barbato
[not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
0 siblings, 2 replies; 6+ messages in thread
From: Rich Felker @ 2012-07-30 20:41 UTC (permalink / raw)
To: Kim Walisch; +Cc: musl
Hi,
I'm replying with the list CC'd so others can comment too. Sorry I
haven't gotten a chance to try this code or review it in detail yet.
What follows is a short initial commentary but I'll give it some more
attention soon.
On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote:
> Hi,
>
> I have been reading through several libc implementations on the
> internet for the past days and for fun I have written a fast yet
> portable memcpy implementation. It uses more code than your
> implementation but I do not think it is bloated. Some quick benchmarks
> that I ran on my Intel Core-i5 670 3.46GHz (Red Hat 6.2 x86_64)
> indicate that my implemenation runs about 50 percent faster than yours
> for aligned data and up to 10 times faster for unaligned data using
> gcc-4.7. The Intel C compiler even vectorizes the main copying loop
> using SSE instructions (if compiled with icc -O2 -xHost) which gives a
> performance better than glibc's memcpy on my system. I would be happy
> to hear your opinion about my memcpy implementation.
I'd like to know what block sizes you were looking at, because for
memcpy that makes all the difference in the world:
For very small blocks (down to 1 byte), performance will be dominated
by conditional branches picking what to do.
For very large blocks (larger than cache), performance will be
memory-bound and even byte-at-a-time copying might be competitive.
Theoretically, there's only a fairly small range of sizes where the
algorithm used matters a lot.
> /* CPU architectures that support fast unaligned memory access */
> #if defined(__i386) || defined(__x86_64)
> # define UNALIGNED_MEMORY_ACCESS
> #endif
I don't think this is necessary or useful. If we want better
performance on these archs, a tiny asm file that does almost nothing
but "rep movsd" is known to be the fastest solution on 32-bit x86, and
is at least the second-fastest on 64-bit, with the faster solutions
not being available on all cpus. On pretty much all other archs,
unaligned access is illegal.
> static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n)
> {
> char *d = (char*) dest;
> const char *s = (const char*) src;
> size_t bytes_iteration = sizeof(uintptr_t) * 8;
>
> while (n >= bytes_iteration)
> {
> ((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0];
> ((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1];
> ((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2];
> ((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3];
> ((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4];
> ((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5];
> ((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6];
> ((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7];
> d += bytes_iteration;
> s += bytes_iteration;
> n -= bytes_iteration;
> }
This is just manual loop unrolling, no? GCC should do the equivalent
if you ask it to aggressively unroll loops, including the
vectorization; if not, that seems like a GCC bug.
Rich
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy
2012-07-30 20:41 ` musl libc, memcpy Rich Felker
@ 2012-07-31 1:25 ` Luca Barbato
[not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
1 sibling, 0 replies; 6+ messages in thread
From: Luca Barbato @ 2012-07-31 1:25 UTC (permalink / raw)
To: musl
On 07/30/2012 10:41 PM, Rich Felker wrote:
> Hi,
>
> I'm replying with the list CC'd so others can comment too. Sorry I
> haven't gotten a chance to try this code or review it in detail yet.
> What follows is a short initial commentary but I'll give it some more
> attention soon.
mem function and string function are implemented as plugin in recent
glibc and they use platform optimized versions, not sure if it is worth
considering it for musl.
iirc gcc and clang have builtins for some of those functions as well.
lu
--
Luca Barbato
Gentoo/linux
http://dev.gentoo.org/~lu_zero
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: musl libc, memcpy
[not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
@ 2012-08-01 4:27 ` Rich Felker
2012-08-01 5:40 ` Rich Felker
0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2012-08-01 4:27 UTC (permalink / raw)
To: Kim Walisch; +Cc: musl
On Tue, Jul 31, 2012 at 12:19:13AM +0200, Kim Walisch wrote:
> > I'd like to know what block sizes you were looking at, because for
> > memcpy that makes all the difference in the world:
>
> I copied blocks of 16 kilobytes.
OK, that sounds (off-hand) like a good size for testing.
> > I don't think this is necessary or useful. If we want better
> > performance on these archs, a tiny asm file that does almost nothing
> > but "rep movsd" is known to be the fastest solution on 32-bit x86, and
> > is at least the second-fastest on 64-bit, with the faster solutions
> > not being available on all cpus. On pretty much all other archs,
> > unaligned access is illegal.
>
> My point is that your code uses byte (char) copying for unaligned data
> but on x86 this is not necessary. Using a simple macro in your memcpy
> implementation that always uses the size_t copying path for x86 speeds
> up your memcpy implementation by about 500% for unaligned data on my
> PC (Intel i5-670 3.46GHz, gcc-4.7, SL Linux 6.2 x86_64). You can also
> use a separate asm file with "rep movsd" for x86, I guess it will run
> at the same speed as my macro solution.
I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
version. I'd be interested in hearing how it performs.
> Another interesting thing to mention is that gcc-4.5 vectorizes the 3
> copying loops of your memcpy implementation if it is compiled with the
> -ftree-vectorize flag (add -ftree-vectorizer-verbose=1 for
> vectorization report) but not if simply compiled with -O2 or -O3. With
Odd, the gcc manual claims -ftree-vectorize is included in -O3:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
> $ gcc -O2 -ftree-vectorize -ftree-vectorizer-verbose=1 memcpy.c main.c -o memcpy
>
> memcpy.c:25: note: created 1 versioning for alias checks.
> memcpy.c:25: note: LOOP VECTORIZED.
> memcpy.c:21: note: created 1 versioning for alias checks.
> memcpy.c:21: note: LOOP VECTORIZED.
> memcpy.c:9: note: vectorized 2 loops in function.
From the sound of those notes, I suspect duplicate code (and wasteful
conditional branches) are getting generated to handle the possibility
that the source and destination pointers might alias. I think this
means it would be a good idea to add proper use of "restrict" pointers
(per C99 requirements) in musl sooner rather than later; it might both
reduce code size and improve performance.
Rich
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy
2012-08-01 4:27 ` Rich Felker
@ 2012-08-01 5:40 ` Rich Felker
2012-08-01 6:19 ` Rich Felker
0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2012-08-01 5:40 UTC (permalink / raw)
To: musl; +Cc: Kim Walisch
[-- Attachment #1: Type: text/plain, Size: 226 bytes --]
On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote:
> I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
> version. I'd be interested in hearing how it performs.
And here is the attachment...
Rich
[-- Attachment #2: memcpy.s --]
[-- Type: text/plain, Size: 302 bytes --]
.global xmemcpy
.type xmemcpy,@function
xmemcpy:
push %esi
push %edi
mov 12(%esp),%edi
mov 16(%esp),%esi
mov 20(%esp),%ecx
mov %edi,%eax
mov %ecx,%edx
shr $2,%ecx
rep
movsl
and $3,%edx
jz 1f
2: mov (%esi),%cl
mov %cl,(%edi)
inc %esi
inc %edi
dec %edx
jnz 2b
1: pop %edi
pop %esi
ret
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy
2012-08-01 5:40 ` Rich Felker
@ 2012-08-01 6:19 ` Rich Felker
2012-08-03 23:22 ` John Spencer
0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2012-08-01 6:19 UTC (permalink / raw)
To: musl; +Cc: Kim Walisch
[-- Attachment #1: Type: text/plain, Size: 417 bytes --]
On Wed, Aug 01, 2012 at 01:40:11AM -0400, Rich Felker wrote:
> On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote:
> > I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
> > version. I'd be interested in hearing how it performs.
>
> And here is the attachment...
And here's a version that might be faster; reportedly, rep movsd works
better when the destination address is aligned.
Rich
[-- Attachment #2: memcpy.s --]
[-- Type: text/plain, Size: 341 bytes --]
.global xmemcpy
.type xmemcpy,@function
xmemcpy:
push %esi
push %edi
mov 12(%esp),%edi
mov 16(%esp),%esi
mov 20(%esp),%ecx
mov %edi,%eax
cmp $4,%ecx
jc 1f
test $3,%edi
jz 1f
2: movsb
dec %ecx
test $3,%edi
jnz 2b
1: mov %ecx,%edx
shr $2,%ecx
rep
movsl
and $3,%edx
jz 1f
2: movsb
dec %edx
jnz 2b
1: pop %edi
pop %esi
ret
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy
2012-08-01 6:19 ` Rich Felker
@ 2012-08-03 23:22 ` John Spencer
0 siblings, 0 replies; 6+ messages in thread
From: John Spencer @ 2012-08-03 23:22 UTC (permalink / raw)
To: musl
i've setup a perfomance test ( https://github.com/rofl0r/memcpy-test )
these are the average results for i386 (100 runs on big sizes, 10000 on
smaller ones)
asm version current c-version
size: 3 172 ticks 199 ticks
size: 4 167 ticks 167 ticks
size: 5 197 ticks 186 ticks
size: 8 187 ticks 186 ticks
size: 15 195 ticks 196 ticks
size: 16 186 ticks 185 ticks
size: 23 202 ticks 199 ticks
size: 24 193 ticks 188 ticks
size: 25 205 ticks 212 ticks
size: 31 199 ticks 198 ticks
size: 32 195 ticks 192 ticks
size: 33 204 ticks 192 ticks
size: 63 213 ticks 255 ticks
size: 64 219 ticks 226 ticks
size: 65 208 ticks 238 ticks
size: 95 220 ticks 247 ticks
size: 96 214 ticks 239 ticks
size: 97 217 ticks 243 ticks
size: 127 233 ticks 261 ticks
size: 128 225 ticks 254 ticks
size: 129 229 ticks 266 ticks
size: 159 242 ticks 279 ticks
size: 160 235 ticks 268 ticks
size: 161 238 ticks 273 ticks
size: 191 255 ticks 288 ticks
size: 192 264 ticks 288 ticks
size: 193 248 ticks 287 ticks
size: 255 279 ticks 323 ticks
size: 256 266 ticks 313 ticks
size: 257 269 ticks 319 ticks
size: 383 332 ticks 391 ticks
size: 384 308 ticks 370 ticks
size: 385 307 ticks 384 ticks
size: 511 345 ticks 439 ticks
size: 512 315 ticks 434 ticks
size: 513 318 ticks 439 ticks
size: 767 370 ticks 571 ticks
size: 768 330 ticks 555 ticks
size: 769 334 ticks 566 ticks
size: 1023 382 ticks 740 ticks
size: 1024 349 ticks 727 ticks
size: 1025 358 ticks 694 ticks
size: 1535 423 ticks 936 ticks
size: 1536 393 ticks 930 ticks
size: 1537 400 ticks 929 ticks
size: 2048 448 ticks 1176 ticks
size: 4096 822 ticks 2404 ticks
size: 8192 3136 ticks 8310 ticks
size: 16384 6481 ticks 9780 ticks
size: 32768 11645 ticks 19060 ticks
size: 65536 29700 ticks 52051 ticks
size: 131072 307029 ticks 310875 ticks
size: 262144 608502 ticks 617698 ticks
size: 524288 1222116 ticks 1244987 ticks
size: 1048576 2500207 ticks 2712991 ticks
size: 2097152 5279016 ticks 5566665 ticks
size: 4194304 10586333 ticks 10849110 ticks
size: 8388608 21961730 ticks 22473953 ticks
size: 16777216 45966254 ticks 47159258 ticks
size: 33554432 92434464 ticks 95873868 ticks
size: 67108864 189858530 ticks 190456107 ticks
it looks as if the asm version is up to twice as fast, depending on the
size of data copied.
now waiting for the x86_64 version (if you could provide a working 64bit
rdtsc inline asm function, i'll gladly take that as well)
someone on ##asm suggested that movaps with xmm regs was fastest in his
tests.
would be interesting to test such a version as well.
On 08/01/2012 08:19 AM, Rich Felker wrote:
> On Wed, Aug 01, 2012 at 01:40:11AM -0400, Rich Felker wrote:
>> On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote:
>>> I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
>>> version. I'd be interested in hearing how it performs.
>> And here is the attachment...
> And here's a version that might be faster; reportedly, rep movsd works
> better when the destination address is aligned.
>
> Rich
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-08-03 23:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com>
2012-07-30 20:41 ` musl libc, memcpy Rich Felker
2012-07-31 1:25 ` Luca Barbato
[not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>
2012-08-01 4:27 ` Rich Felker
2012-08-01 5:40 ` Rich Felker
2012-08-01 6:19 ` Rich Felker
2012-08-03 23:22 ` John Spencer
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).