* Re: musl libc, memcpy [not found] <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com> @ 2012-07-30 20:41 ` Rich Felker 2012-07-31 1:25 ` Luca Barbato [not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com> 0 siblings, 2 replies; 6+ messages in thread From: Rich Felker @ 2012-07-30 20:41 UTC (permalink / raw) To: Kim Walisch; +Cc: musl Hi, I'm replying with the list CC'd so others can comment too. Sorry I haven't gotten a chance to try this code or review it in detail yet. What follows is a short initial commentary but I'll give it some more attention soon. On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote: > Hi, > > I have been reading through several libc implementations on the > internet for the past days and for fun I have written a fast yet > portable memcpy implementation. It uses more code than your > implementation but I do not think it is bloated. Some quick benchmarks > that I ran on my Intel Core-i5 670 3.46GHz (Red Hat 6.2 x86_64) > indicate that my implemenation runs about 50 percent faster than yours > for aligned data and up to 10 times faster for unaligned data using > gcc-4.7. The Intel C compiler even vectorizes the main copying loop > using SSE instructions (if compiled with icc -O2 -xHost) which gives a > performance better than glibc's memcpy on my system. I would be happy > to hear your opinion about my memcpy implementation. I'd like to know what block sizes you were looking at, because for memcpy that makes all the difference in the world: For very small blocks (down to 1 byte), performance will be dominated by conditional branches picking what to do. For very large blocks (larger than cache), performance will be memory-bound and even byte-at-a-time copying might be competitive. Theoretically, there's only a fairly small range of sizes where the algorithm used matters a lot. > /* CPU architectures that support fast unaligned memory access */ > #if defined(__i386) || defined(__x86_64) > # define UNALIGNED_MEMORY_ACCESS > #endif I don't think this is necessary or useful. If we want better performance on these archs, a tiny asm file that does almost nothing but "rep movsd" is known to be the fastest solution on 32-bit x86, and is at least the second-fastest on 64-bit, with the faster solutions not being available on all cpus. On pretty much all other archs, unaligned access is illegal. > static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n) > { > char *d = (char*) dest; > const char *s = (const char*) src; > size_t bytes_iteration = sizeof(uintptr_t) * 8; > > while (n >= bytes_iteration) > { > ((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0]; > ((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1]; > ((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2]; > ((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3]; > ((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4]; > ((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5]; > ((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6]; > ((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7]; > d += bytes_iteration; > s += bytes_iteration; > n -= bytes_iteration; > } This is just manual loop unrolling, no? GCC should do the equivalent if you ask it to aggressively unroll loops, including the vectorization; if not, that seems like a GCC bug. Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy 2012-07-30 20:41 ` musl libc, memcpy Rich Felker @ 2012-07-31 1:25 ` Luca Barbato [not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com> 1 sibling, 0 replies; 6+ messages in thread From: Luca Barbato @ 2012-07-31 1:25 UTC (permalink / raw) To: musl On 07/30/2012 10:41 PM, Rich Felker wrote: > Hi, > > I'm replying with the list CC'd so others can comment too. Sorry I > haven't gotten a chance to try this code or review it in detail yet. > What follows is a short initial commentary but I'll give it some more > attention soon. mem function and string function are implemented as plugin in recent glibc and they use platform optimized versions, not sure if it is worth considering it for musl. iirc gcc and clang have builtins for some of those functions as well. lu -- Luca Barbato Gentoo/linux http://dev.gentoo.org/~lu_zero ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com>]
* Re: musl libc, memcpy [not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com> @ 2012-08-01 4:27 ` Rich Felker 2012-08-01 5:40 ` Rich Felker 0 siblings, 1 reply; 6+ messages in thread From: Rich Felker @ 2012-08-01 4:27 UTC (permalink / raw) To: Kim Walisch; +Cc: musl On Tue, Jul 31, 2012 at 12:19:13AM +0200, Kim Walisch wrote: > > I'd like to know what block sizes you were looking at, because for > > memcpy that makes all the difference in the world: > > I copied blocks of 16 kilobytes. OK, that sounds (off-hand) like a good size for testing. > > I don't think this is necessary or useful. If we want better > > performance on these archs, a tiny asm file that does almost nothing > > but "rep movsd" is known to be the fastest solution on 32-bit x86, and > > is at least the second-fastest on 64-bit, with the faster solutions > > not being available on all cpus. On pretty much all other archs, > > unaligned access is illegal. > > My point is that your code uses byte (char) copying for unaligned data > but on x86 this is not necessary. Using a simple macro in your memcpy > implementation that always uses the size_t copying path for x86 speeds > up your memcpy implementation by about 500% for unaligned data on my > PC (Intel i5-670 3.46GHz, gcc-4.7, SL Linux 6.2 x86_64). You can also > use a separate asm file with "rep movsd" for x86, I guess it will run > at the same speed as my macro solution. I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based version. I'd be interested in hearing how it performs. > Another interesting thing to mention is that gcc-4.5 vectorizes the 3 > copying loops of your memcpy implementation if it is compiled with the > -ftree-vectorize flag (add -ftree-vectorizer-verbose=1 for > vectorization report) but not if simply compiled with -O2 or -O3. With Odd, the gcc manual claims -ftree-vectorize is included in -O3: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html > $ gcc -O2 -ftree-vectorize -ftree-vectorizer-verbose=1 memcpy.c main.c -o memcpy > > memcpy.c:25: note: created 1 versioning for alias checks. > memcpy.c:25: note: LOOP VECTORIZED. > memcpy.c:21: note: created 1 versioning for alias checks. > memcpy.c:21: note: LOOP VECTORIZED. > memcpy.c:9: note: vectorized 2 loops in function. From the sound of those notes, I suspect duplicate code (and wasteful conditional branches) are getting generated to handle the possibility that the source and destination pointers might alias. I think this means it would be a good idea to add proper use of "restrict" pointers (per C99 requirements) in musl sooner rather than later; it might both reduce code size and improve performance. Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy 2012-08-01 4:27 ` Rich Felker @ 2012-08-01 5:40 ` Rich Felker 2012-08-01 6:19 ` Rich Felker 0 siblings, 1 reply; 6+ messages in thread From: Rich Felker @ 2012-08-01 5:40 UTC (permalink / raw) To: musl; +Cc: Kim Walisch [-- Attachment #1: Type: text/plain, Size: 226 bytes --] On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote: > I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based > version. I'd be interested in hearing how it performs. And here is the attachment... Rich [-- Attachment #2: memcpy.s --] [-- Type: text/plain, Size: 302 bytes --] .global xmemcpy .type xmemcpy,@function xmemcpy: push %esi push %edi mov 12(%esp),%edi mov 16(%esp),%esi mov 20(%esp),%ecx mov %edi,%eax mov %ecx,%edx shr $2,%ecx rep movsl and $3,%edx jz 1f 2: mov (%esi),%cl mov %cl,(%edi) inc %esi inc %edi dec %edx jnz 2b 1: pop %edi pop %esi ret ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy 2012-08-01 5:40 ` Rich Felker @ 2012-08-01 6:19 ` Rich Felker 2012-08-03 23:22 ` John Spencer 0 siblings, 1 reply; 6+ messages in thread From: Rich Felker @ 2012-08-01 6:19 UTC (permalink / raw) To: musl; +Cc: Kim Walisch [-- Attachment #1: Type: text/plain, Size: 417 bytes --] On Wed, Aug 01, 2012 at 01:40:11AM -0400, Rich Felker wrote: > On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote: > > I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based > > version. I'd be interested in hearing how it performs. > > And here is the attachment... And here's a version that might be faster; reportedly, rep movsd works better when the destination address is aligned. Rich [-- Attachment #2: memcpy.s --] [-- Type: text/plain, Size: 341 bytes --] .global xmemcpy .type xmemcpy,@function xmemcpy: push %esi push %edi mov 12(%esp),%edi mov 16(%esp),%esi mov 20(%esp),%ecx mov %edi,%eax cmp $4,%ecx jc 1f test $3,%edi jz 1f 2: movsb dec %ecx test $3,%edi jnz 2b 1: mov %ecx,%edx shr $2,%ecx rep movsl and $3,%edx jz 1f 2: movsb dec %edx jnz 2b 1: pop %edi pop %esi ret ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Re: musl libc, memcpy 2012-08-01 6:19 ` Rich Felker @ 2012-08-03 23:22 ` John Spencer 0 siblings, 0 replies; 6+ messages in thread From: John Spencer @ 2012-08-03 23:22 UTC (permalink / raw) To: musl i've setup a perfomance test ( https://github.com/rofl0r/memcpy-test ) these are the average results for i386 (100 runs on big sizes, 10000 on smaller ones) asm version current c-version size: 3 172 ticks 199 ticks size: 4 167 ticks 167 ticks size: 5 197 ticks 186 ticks size: 8 187 ticks 186 ticks size: 15 195 ticks 196 ticks size: 16 186 ticks 185 ticks size: 23 202 ticks 199 ticks size: 24 193 ticks 188 ticks size: 25 205 ticks 212 ticks size: 31 199 ticks 198 ticks size: 32 195 ticks 192 ticks size: 33 204 ticks 192 ticks size: 63 213 ticks 255 ticks size: 64 219 ticks 226 ticks size: 65 208 ticks 238 ticks size: 95 220 ticks 247 ticks size: 96 214 ticks 239 ticks size: 97 217 ticks 243 ticks size: 127 233 ticks 261 ticks size: 128 225 ticks 254 ticks size: 129 229 ticks 266 ticks size: 159 242 ticks 279 ticks size: 160 235 ticks 268 ticks size: 161 238 ticks 273 ticks size: 191 255 ticks 288 ticks size: 192 264 ticks 288 ticks size: 193 248 ticks 287 ticks size: 255 279 ticks 323 ticks size: 256 266 ticks 313 ticks size: 257 269 ticks 319 ticks size: 383 332 ticks 391 ticks size: 384 308 ticks 370 ticks size: 385 307 ticks 384 ticks size: 511 345 ticks 439 ticks size: 512 315 ticks 434 ticks size: 513 318 ticks 439 ticks size: 767 370 ticks 571 ticks size: 768 330 ticks 555 ticks size: 769 334 ticks 566 ticks size: 1023 382 ticks 740 ticks size: 1024 349 ticks 727 ticks size: 1025 358 ticks 694 ticks size: 1535 423 ticks 936 ticks size: 1536 393 ticks 930 ticks size: 1537 400 ticks 929 ticks size: 2048 448 ticks 1176 ticks size: 4096 822 ticks 2404 ticks size: 8192 3136 ticks 8310 ticks size: 16384 6481 ticks 9780 ticks size: 32768 11645 ticks 19060 ticks size: 65536 29700 ticks 52051 ticks size: 131072 307029 ticks 310875 ticks size: 262144 608502 ticks 617698 ticks size: 524288 1222116 ticks 1244987 ticks size: 1048576 2500207 ticks 2712991 ticks size: 2097152 5279016 ticks 5566665 ticks size: 4194304 10586333 ticks 10849110 ticks size: 8388608 21961730 ticks 22473953 ticks size: 16777216 45966254 ticks 47159258 ticks size: 33554432 92434464 ticks 95873868 ticks size: 67108864 189858530 ticks 190456107 ticks it looks as if the asm version is up to twice as fast, depending on the size of data copied. now waiting for the x86_64 version (if you could provide a working 64bit rdtsc inline asm function, i'll gladly take that as well) someone on ##asm suggested that movaps with xmm regs was fastest in his tests. would be interesting to test such a version as well. On 08/01/2012 08:19 AM, Rich Felker wrote: > On Wed, Aug 01, 2012 at 01:40:11AM -0400, Rich Felker wrote: >> On Wed, Aug 01, 2012 at 12:27:22AM -0400, Rich Felker wrote: >>> I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based >>> version. I'd be interested in hearing how it performs. >> And here is the attachment... > And here's a version that might be faster; reportedly, rep movsd works > better when the destination address is aligned. > > Rich ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-08-03 23:22 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <CAE_DJ110cVdicu-wPe_Ndeg9ih+g3AXZ_hNoGgX+DftJm6q=mA@mail.gmail.com> 2012-07-30 20:41 ` musl libc, memcpy Rich Felker 2012-07-31 1:25 ` Luca Barbato [not found] ` <CAE_DJ1328EfKQp6t33bq0k+9Hbo0Fvu6dhvO2OOePcx2xa3QeQ@mail.gmail.com> 2012-08-01 4:27 ` Rich Felker 2012-08-01 5:40 ` Rich Felker 2012-08-01 6:19 ` Rich Felker 2012-08-03 23:22 ` John Spencer
Code repositories for project(s) associated with this public inbox https://git.vuxu.org/mirror/musl/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).