Optimized C memcpy

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Optimized C memcpy
@ 2013-08-07 18:21 Rich Felker
  2013-08-08 12:59 ` Andrew Bradford
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Rich Felker @ 2013-08-07 18:21 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1275 bytes --]

Attached is the latest version of my "pure C" (modulo aliasing issues)
memcpy implementation. Compiled with -O3 on arm, it matches the
performance of the assembly language memcpy from Bionic for aligned
copies, and is only 25% slower than the asm for misaligned copies. And
it's only mildly larger. It uses the same principle as the Bionic
code: large block copies as aligned 32-bit units for aligned copies,
and aligned-load, bitshift-then-or, aligned-store for misaligned
copies. This should, in principle, work well on typical risc archs
that have plenty of registers but no misaligned load or store support.

Unfortunately it only works on little-endian (I haven't though much
yet about how it could be adapted to big-endian), but testing it on
qemu-ppc with the endian check disabled (thus wrong behavior)
suggested that this approach would work well on there too if we could
adapt it. Of course tests under qemu are not worth much; the ARM tests
were on real hardware and I'd like to see real-hardware results for
others archs (mipsel?) too.

This is not a replacement for the ARM asm (which is still better), but
it's a step towards avoiding the need to have written-by-hand assembly
for every single new arch we add as a prerequisite for tolerable
performance.

Rich

[-- Attachment #2: memcpy_risc3.c --]
[-- Type: text/plain, Size: 2710 bytes --]

#include <string.h>
#include <stdlib.h>
#include <stdint.h>

struct block {
	uint32_t data[8];
};

void *memcpy(void *restrict dest, const void *restrict src, size_t n)
{
	unsigned char *d = dest;
	const unsigned char *s = src;
	uint32_t w, x;

	for (; (uintptr_t)s % 4 && n; n--) *d++ = *s++;
	if (!n) return dest;

	if (n>=4) switch ((uintptr_t)d % 4) {
	case 0:
		for (; n>=32; s+=32, d+=32, n-=32)
			*(struct block *)d = *(struct block *)s;
		break;
	case 1:
		if (!(union { int i; char c; }){1}.c) break;
		w = *(uint32_t *)s;
		*d++ = *s++;
		*d++ = *s++;
		*d++ = *s++;
		n -= 3;
		for (; n>=33; s+=32, d+=32, n-=32) {
			x = *(uint32_t *)(s+1);
			*(uint32_t *)(d+0) = (w>>24) | (x<<8);
			w = *(uint32_t *)(s+5);
			*(uint32_t *)(d+4) = (x>>24) | (w<<8);
			x = *(uint32_t *)(s+9);
			*(uint32_t *)(d+8) = (w>>24) | (x<<8);
			w = *(uint32_t *)(s+13);
			*(uint32_t *)(d+12) = (x>>24) | (w<<8);
			x = *(uint32_t *)(s+17);
			*(uint32_t *)(d+16) = (w>>24) | (x<<8);
			w = *(uint32_t *)(s+21);
			*(uint32_t *)(d+20) = (x>>24) | (w<<8);
			x = *(uint32_t *)(s+25);
			*(uint32_t *)(d+24) = (w>>24) | (x<<8);
			w = *(uint32_t *)(s+29);
			*(uint32_t *)(d+28) = (x>>24) | (w<<8);
		}
		break;
	case 2:
		if (!(union { int i; char c; }){1}.c) break;
		w = *(uint32_t *)s;
		*d++ = *s++;
		*d++ = *s++;
		n -= 2;
		for (; n>=34; s+=32, d+=32, n-=32) {
			x = *(uint32_t *)(s+2);
			*(uint32_t *)(d+0) = (w>>16) | (x<<16);
			w = *(uint32_t *)(s+6);
			*(uint32_t *)(d+4) = (x>>16) | (w<<16);
			x = *(uint32_t *)(s+10);
			*(uint32_t *)(d+8) = (w>>16) | (x<<16);
			w = *(uint32_t *)(s+14);
			*(uint32_t *)(d+12) = (x>>16) | (w<<16);
			x = *(uint32_t *)(s+18);
			*(uint32_t *)(d+16) = (w>>16) | (x<<16);
			w = *(uint32_t *)(s+22);
			*(uint32_t *)(d+20) = (x>>16) | (w<<16);
			x = *(uint32_t *)(s+26);
			*(uint32_t *)(d+24) = (w>>16) | (x<<16);
			w = *(uint32_t *)(s+30);
			*(uint32_t *)(d+28) = (x>>16) | (w<<16);
		}
		break;
	case 3:
		if (!(union { int i; char c; }){1}.c) break;
		w = *(uint32_t *)s;
		*d++ = *s++;
		n -= 1;
		for (; n>=35; s+=32, d+=32, n-=32) {
			x = *(uint32_t *)(s+3);
			*(uint32_t *)(d+0) = (w>>8) | (x<<24);
			w = *(uint32_t *)(s+7);
			*(uint32_t *)(d+4) = (x>>8) | (w<<24);
			x = *(uint32_t *)(s+11);
			*(uint32_t *)(d+8) = (w>>8) | (x<<24);
			w = *(uint32_t *)(s+15);
			*(uint32_t *)(d+12) = (x>>8) | (w<<24);
			x = *(uint32_t *)(s+19);
			*(uint32_t *)(d+16) = (w>>8) | (x<<24);
			w = *(uint32_t *)(s+23);
			*(uint32_t *)(d+20) = (x>>8) | (w<<24);
			x = *(uint32_t *)(s+27);
			*(uint32_t *)(d+24) = (w>>8) | (x<<24);
			w = *(uint32_t *)(s+31);
			*(uint32_t *)(d+28) = (x>>8) | (w<<24);
		}
		break;
	}

	for (; n; n--) *d++ = *s++;
	return dest;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-07 18:21 Optimized C memcpy Rich Felker
@ 2013-08-08 12:59 ` Andrew Bradford
  2013-08-08 13:03   ` Andrew Bradford
  2013-08-09  5:02 ` Rob Landley
  2013-08-11  5:11 ` Optimized C memcpy [updated] Rich Felker
  2 siblings, 1 reply; 13+ messages in thread
From: Andrew Bradford @ 2013-08-08 12:59 UTC (permalink / raw)
  To: musl

On Wed, Aug 7, 2013, at 02:21 PM, Rich Felker wrote:
> Attached is the latest version of my "pure C" (modulo aliasing issues)
> memcpy implementation. Compiled with -O3 on arm, it matches the
> performance of the assembly language memcpy from Bionic for aligned
> copies, and is only 25% slower than the asm for misaligned copies. And
> it's only mildly larger. It uses the same principle as the Bionic
> code: large block copies as aligned 32-bit units for aligned copies,
> and aligned-load, bitshift-then-or, aligned-store for misaligned
> copies. This should, in principle, work well on typical risc archs
> that have plenty of registers but no misaligned load or store support.
> 
> Unfortunately it only works on little-endian (I haven't though much
> yet about how it could be adapted to big-endian), but testing it on
> qemu-ppc with the endian check disabled (thus wrong behavior)
> suggested that this approach would work well on there too if we could
> adapt it. Of course tests under qemu are not worth much; the ARM tests
> were on real hardware and I'd like to see real-hardware results for
> others archs (mipsel?) too.
> 
> This is not a replacement for the ARM asm (which is still better), but
> it's a step towards avoiding the need to have written-by-hand assembly
> for every single new arch we add as a prerequisite for tolerable
> performance.

Sorry if this has been discussed before but Google isn't much help.  Why
is 32 bytes chosen as the block size over other sizes?

It seems that the code would be fewer lines if blocks were 4 bytes,
hence easier to read, verify, and understand.  What's the performance
penalty -- I assume there has to be one -- that I'm not understanding
which drives the choice of 32 byte blocks?

Thanks,
Andrew


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-08 12:59 ` Andrew Bradford
@ 2013-08-08 13:03   ` Andrew Bradford
  2013-08-08 13:17     ` Luca Barbato
  2013-08-08 15:15     ` Rich Felker
  0 siblings, 2 replies; 13+ messages in thread
From: Andrew Bradford @ 2013-08-08 13:03 UTC (permalink / raw)
  To: musl

On Thu, Aug 8, 2013, at 08:59 AM, Andrew Bradford wrote:
> On Wed, Aug 7, 2013, at 02:21 PM, Rich Felker wrote:
> > Attached is the latest version of my "pure C" (modulo aliasing issues)
> > memcpy implementation. Compiled with -O3 on arm, it matches the
> > performance of the assembly language memcpy from Bionic for aligned
> > copies, and is only 25% slower than the asm for misaligned copies. And
> > it's only mildly larger. It uses the same principle as the Bionic
> > code: large block copies as aligned 32-bit units for aligned copies,
> > and aligned-load, bitshift-then-or, aligned-store for misaligned
> > copies. This should, in principle, work well on typical risc archs
> > that have plenty of registers but no misaligned load or store support.
> > 
> > Unfortunately it only works on little-endian (I haven't though much
> > yet about how it could be adapted to big-endian), but testing it on
> > qemu-ppc with the endian check disabled (thus wrong behavior)
> > suggested that this approach would work well on there too if we could
> > adapt it. Of course tests under qemu are not worth much; the ARM tests
> > were on real hardware and I'd like to see real-hardware results for
> > others archs (mipsel?) too.
> > 
> > This is not a replacement for the ARM asm (which is still better), but
> > it's a step towards avoiding the need to have written-by-hand assembly
> > for every single new arch we add as a prerequisite for tolerable
> > performance.
> 
> Sorry if this has been discussed before but Google isn't much help.  Why
> is 32 bytes chosen as the block size over other sizes?
> 
> It seems that the code would be fewer lines if blocks were 4 bytes,

Sorry, I now see why 4 byte blocks won't work due to the misalignment,
but 8 or 16 seem like they should be possible.
Is it just the evaluation of the for loop being expensive that's trying
to be avoided?

Thanks,
Andrew


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-08 13:03   ` Andrew Bradford
@ 2013-08-08 13:17     ` Luca Barbato
  2013-08-08 15:15     ` Rich Felker
  1 sibling, 0 replies; 13+ messages in thread
From: Luca Barbato @ 2013-08-08 13:17 UTC (permalink / raw)
  To: musl

On 08/08/13 15:03, Andrew Bradford wrote:
> Sorry, I now see why 4 byte blocks won't work due to the misalignment,
> but 8 or 16 seem like they should be possible.
> Is it just the evaluation of the for loop being expensive that's trying
> to be avoided?

Probably 32bytes uses cachelines better on average.

lu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-08 13:03   ` Andrew Bradford
  2013-08-08 13:17     ` Luca Barbato
@ 2013-08-08 15:15     ` Rich Felker
  2013-08-08 20:17       ` Andre Renaud
  1 sibling, 1 reply; 13+ messages in thread
From: Rich Felker @ 2013-08-08 15:15 UTC (permalink / raw)
  To: musl

On Thu, Aug 08, 2013 at 09:03:51AM -0400, Andrew Bradford wrote:
> > > This is not a replacement for the ARM asm (which is still better), but
> > > it's a step towards avoiding the need to have written-by-hand assembly
> > > for every single new arch we add as a prerequisite for tolerable
> > > performance.
> > 
> > Sorry if this has been discussed before but Google isn't much help.  Why
> > is 32 bytes chosen as the block size over other sizes?
> > 
> > It seems that the code would be fewer lines if blocks were 4 bytes,
> 
> Sorry, I now see why 4 byte blocks won't work due to the misalignment,
> but 8 or 16 seem like they should be possible.
> Is it just the evaluation of the for loop being expensive that's trying
> to be avoided?

It's purely empirical reasons. 8 is the smallest that would work
without extra logic to shuffle w/x. 16 runs 50% slower than the ARM
asm. 32 runs only 25% slower than the ARM asm.

Rich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-08 15:15     ` Rich Felker
@ 2013-08-08 20:17       ` Andre Renaud
  2013-08-08 20:26         ` Rich Felker
  0 siblings, 1 reply; 13+ messages in thread
From: Andre Renaud @ 2013-08-08 20:17 UTC (permalink / raw)
  To: musl

Hi Rich,
From the looks of the code, compared to the original bionic assembly,
I assume the remaining speed difference is caused by the C-code doing
8 discrete store operations, where as the bionic code batches these
all up into registers and does these as a single multiple-store. Would
it be worth having a structure with 8 32-bit ints in it, and doing a
single write to d of one of these (hoping that gcc will catch it and
turn it into a stm instruction)? It unfortunately runs the risk that
gcc will decide a 32-byte copy is worth using memcpy for, resulting in
the recursive issue you've seen previously.

Regards,
Andre

On 9 August 2013 03:15, Rich Felker <dalias@aerifal.cx> wrote:
> On Thu, Aug 08, 2013 at 09:03:51AM -0400, Andrew Bradford wrote:
>> > > This is not a replacement for the ARM asm (which is still better), but
>> > > it's a step towards avoiding the need to have written-by-hand assembly
>> > > for every single new arch we add as a prerequisite for tolerable
>> > > performance.
>> >
>> > Sorry if this has been discussed before but Google isn't much help.  Why
>> > is 32 bytes chosen as the block size over other sizes?
>> >
>> > It seems that the code would be fewer lines if blocks were 4 bytes,
>>
>> Sorry, I now see why 4 byte blocks won't work due to the misalignment,
>> but 8 or 16 seem like they should be possible.
>> Is it just the evaluation of the for loop being expensive that's trying
>> to be avoided?
>
> It's purely empirical reasons. 8 is the smallest that would work
> without extra logic to shuffle w/x. 16 runs 50% slower than the ARM
> asm. 32 runs only 25% slower than the ARM asm.
>
> Rich



-- 
Bluewater Systems - An Aiotec Company

Andre Renaud
andre@bluewatersys.com          5 Amuri Park, 404 Barbadoes St
www.bluewatersys.com            PO Box 13 889, Christchurch 8013
www.aiotec.co.nz                New Zealand
Phone: +64 3 3779127            Freecall: Australia 1800 148 751
Fax:   +64 3 3779135            USA 1800 261 2934


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-08 20:17       ` Andre Renaud
@ 2013-08-08 20:26         ` Rich Felker
  0 siblings, 0 replies; 13+ messages in thread
From: Rich Felker @ 2013-08-08 20:26 UTC (permalink / raw)
  To: musl

On Fri, Aug 09, 2013 at 08:17:22AM +1200, Andre Renaud wrote:
> Hi Rich,
> >From the looks of the code, compared to the original bionic assembly,
> I assume the remaining speed difference is caused by the C-code doing
> 8 discrete store operations, where as the bionic code batches these
> all up into registers and does these as a single multiple-store. Would
> it be worth having a structure with 8 32-bit ints in it, and doing a
> single write to d of one of these (hoping that gcc will catch it and
> turn it into a stm instruction)? It unfortunately runs the risk that
> gcc will decide a 32-byte copy is worth using memcpy for, resulting in
> the recursive issue you've seen previously.

In principle, the ARM codegen calls to memcpy (these are different
from the optimizer's calls to memcpy, which we already know how to
turn off) only happen when the compiler cannot determine that the
accesses are aligned. You can experiment with using a structure, but I
think there's also a risk that the compiler will not allocate enough
registers for it and will thereby end up using temp space on the
stack. If you want to experiment, just do one of the cases, not all
three, to make it easier.

Rich

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy
  2013-08-07 18:21 Optimized C memcpy Rich Felker
  2013-08-08 12:59 ` Andrew Bradford
@ 2013-08-09  5:02 ` Rob Landley
  2013-08-11  5:11 ` Optimized C memcpy [updated] Rich Felker
  2 siblings, 0 replies; 13+ messages in thread
From: Rob Landley @ 2013-08-09  5:02 UTC (permalink / raw)
  To: musl; +Cc: musl

On 08/07/2013 01:21:24 PM, Rich Felker wrote:
> This is not a replacement for the ARM asm (which is still better), but
> it's a step towards avoiding the need to have written-by-hand assembly
> for every single new arch we add as a prerequisite for tolerable
> performance.

Are there any other updates to the porting guide?

Rob

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy [updated]
  2013-08-07 18:21 Optimized C memcpy Rich Felker
  2013-08-08 12:59 ` Andrew Bradford
  2013-08-09  5:02 ` Rob Landley
@ 2013-08-11  5:11 ` Rich Felker
  2013-08-11  6:20   ` Rich Felker
  2 siblings, 1 reply; 13+ messages in thread
From: Rich Felker @ 2013-08-11  5:11 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 466 bytes --]

On Wed, Aug 07, 2013 at 02:21:24PM -0400, Rich Felker wrote:
> Unfortunately it only works on little-endian (I haven't though much
> yet about how it could be adapted to big-endian), but testing it on

Making it work for big endian was simply a matter of reversing the
direction of the shifts. I've updated it with this change, and some
other minor improvements. See attached file. If this works on on all
existing archs, I think it might be worth committing.

Rich

[-- Attachment #2: memcpy_risc4.c --]
[-- Type: text/plain, Size: 3016 bytes --]

#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <endian.h>

#if __BYTE_ORDER == __LITTLE_ENDIAN
#define LS >>
#define RS <<
#else
#define LS <<
#define RS >>
#endif

struct block32 { uint32_t data[8]; };
struct block64 { uint64_t data[8]; };

void *memcpy(void *restrict dest, const void *restrict src, size_t n)
{
	unsigned char *d = dest;
	const unsigned char *s = src;
	uint32_t w, x;

	for (; (uintptr_t)s % 8 && n; n--) *d++ = *s++;
	if (!n) return dest;

	if (n>=4) switch ((uintptr_t)d % 4) {
	case 0:
		if (!((uintptr_t)d%8)) for (; n>=64; s+=64, d+=64, n-=64)
			*(struct block64 *)d = *(struct block64 *)s;
		else for (; n>=32; s+=32, d+=32, n-=32)
			*(struct block32 *)d = *(struct block32 *)s;
		for (; n>=4; s+=4, d+=4, n-=4)
			*(uint32_t *)d = *(uint32_t *)s;
		break;
	case 1:
		w = *(uint32_t *)s;
		*d++ = *s++;
		*d++ = *s++;
		*d++ = *s++;
		n -= 3;
		for (; n>=33; s+=32, d+=32, n-=32) {
			x = *(uint32_t *)(s+1);
			*(uint32_t *)(d+0) = (w LS 24) | (x RS 8);
			w = *(uint32_t *)(s+5);
			*(uint32_t *)(d+4) = (x LS 24) | (w RS 8);
			x = *(uint32_t *)(s+9);
			*(uint32_t *)(d+8) = (w LS 24) | (x RS 8);
			w = *(uint32_t *)(s+13);
			*(uint32_t *)(d+12) = (x LS 24) | (w RS 8);
			x = *(uint32_t *)(s+17);
			*(uint32_t *)(d+16) = (w LS 24) | (x RS 8);
			w = *(uint32_t *)(s+21);
			*(uint32_t *)(d+20) = (x LS 24) | (w RS 8);
			x = *(uint32_t *)(s+25);
			*(uint32_t *)(d+24) = (w LS 24) | (x RS 8);
			w = *(uint32_t *)(s+29);
			*(uint32_t *)(d+28) = (x LS 24) | (w RS 8);
		}
		break;
	case 2:
		w = *(uint32_t *)s;
		*d++ = *s++;
		*d++ = *s++;
		n -= 2;
		for (; n>=34; s+=32, d+=32, n-=32) {
			x = *(uint32_t *)(s+2);
			*(uint32_t *)(d+0) = (w LS 16) | (x RS 16);
			w = *(uint32_t *)(s+6);
			*(uint32_t *)(d+4) = (x LS 16) | (w RS 16);
			x = *(uint32_t *)(s+10);
			*(uint32_t *)(d+8) = (w LS 16) | (x RS 16);
			w = *(uint32_t *)(s+14);
			*(uint32_t *)(d+12) = (x LS 16) | (w RS 16);
			x = *(uint32_t *)(s+18);
			*(uint32_t *)(d+16) = (w LS 16) | (x RS 16);
			w = *(uint32_t *)(s+22);
			*(uint32_t *)(d+20) = (x LS 16) | (w RS 16);
			x = *(uint32_t *)(s+26);
			*(uint32_t *)(d+24) = (w LS 16) | (x RS 16);
			w = *(uint32_t *)(s+30);
			*(uint32_t *)(d+28) = (x LS 16) | (w RS 16);
		}
		break;
	case 3:
		w = *(uint32_t *)s;
		*d++ = *s++;
		n -= 1;
		for (; n>=35; s+=32, d+=32, n-=32) {
			x = *(uint32_t *)(s+3);
			*(uint32_t *)(d+0) = (w LS 8) | (x RS 24);
			w = *(uint32_t *)(s+7);
			*(uint32_t *)(d+4) = (x LS 8) | (w RS 24);
			x = *(uint32_t *)(s+11);
			*(uint32_t *)(d+8) = (w LS 8) | (x RS 24);
			w = *(uint32_t *)(s+15);
			*(uint32_t *)(d+12) = (x LS 8) | (w RS 24);
			x = *(uint32_t *)(s+19);
			*(uint32_t *)(d+16) = (w LS 8) | (x RS 24);
			w = *(uint32_t *)(s+23);
			*(uint32_t *)(d+20) = (x LS 8) | (w RS 24);
			x = *(uint32_t *)(s+27);
			*(uint32_t *)(d+24) = (w LS 8) | (x RS 24);
			w = *(uint32_t *)(s+31);
			*(uint32_t *)(d+28) = (x LS 8) | (w RS 24);
		}
		break;
	}

	for (; n; n--) *d++ = *s++;
	return dest;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy [updated]
  2013-08-11  5:11 ` Optimized C memcpy [updated] Rich Felker
@ 2013-08-11  6:20   ` Rich Felker
  2013-08-11  8:13     ` Rich Felker
  0 siblings, 1 reply; 13+ messages in thread
From: Rich Felker @ 2013-08-11  6:20 UTC (permalink / raw)
  To: musl

On Sun, Aug 11, 2013 at 01:11:35AM -0400, Rich Felker wrote:
> struct block32 { uint32_t data[8]; };
> struct block64 { uint64_t data[8]; };
> 
> void *memcpy(void *restrict dest, const void *restrict src, size_t n)
> {
> 	unsigned char *d = dest;
> 	const unsigned char *s = src;
> 	uint32_t w, x;
> 
> 	for (; (uintptr_t)s % 8 && n; n--) *d++ = *s++;
> 	if (!n) return dest;
> 
> 	if (n>=4) switch ((uintptr_t)d % 4) {
> 	case 0:
> 		if (!((uintptr_t)d%8)) for (; n>=64; s+=64, d+=64, n-=64)
> 			*(struct block64 *)d = *(struct block64 *)s;

Unfortunately this case seems to be compiling to a call to memcpy on
powerpc (but nowhere else I found). So I may need to drop the special
case for 64-bit alignment. I wish there was some source for knowledge
of the cases that can trigger gcc's stupidity, though...

Rich


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy [updated]
  2013-08-11  6:20   ` Rich Felker
@ 2013-08-11  8:13     ` Rich Felker
  2013-08-11 11:14       ` Luca Barbato
  0 siblings, 1 reply; 13+ messages in thread
From: Rich Felker @ 2013-08-11  8:13 UTC (permalink / raw)
  To: musl

On Sun, Aug 11, 2013 at 02:20:10AM -0400, Rich Felker wrote:
> On Sun, Aug 11, 2013 at 01:11:35AM -0400, Rich Felker wrote:
> > struct block32 { uint32_t data[8]; };
> > struct block64 { uint64_t data[8]; };
> > 
> > void *memcpy(void *restrict dest, const void *restrict src, size_t n)
> > {
> > 	unsigned char *d = dest;
> > 	const unsigned char *s = src;
> > 	uint32_t w, x;
> > 
> > 	for (; (uintptr_t)s % 8 && n; n--) *d++ = *s++;
> > 	if (!n) return dest;
> > 
> > 	if (n>=4) switch ((uintptr_t)d % 4) {
> > 	case 0:
> > 		if (!((uintptr_t)d%8)) for (; n>=64; s+=64, d+=64, n-=64)
> > 			*(struct block64 *)d = *(struct block64 *)s;
> 
> Unfortunately this case seems to be compiling to a call to memcpy on
> powerpc (but nowhere else I found). So I may need to drop the special
> case for 64-bit alignment. I wish there was some source for knowledge
> of the cases that can trigger gcc's stupidity, though...

It turns out mips at certain optimization levels is also generating a
memcpy for the structure assignments. I think I just need to drop all
of the structure-assignment tricks and use a mildly unrolled loop with
uint32_t units for the aligned case. This gives much worse performance
on ARM, where gcc fails to generate the proper ldmia/stmia without the
struct, but we have asm we can use for ARM anyway. On other archs, the
struct copy code does not even seem to help. The simple integer loop
works just as well.

I'll do some more experimenting and probably commit the ARM asm soon,
followed by the C code once I get some better feedback on how it
performs on real machines.

Rich

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy [updated]
  2013-08-11  8:13     ` Rich Felker
@ 2013-08-11 11:14       ` Luca Barbato
  2013-08-11 11:27         ` Rich Felker
  0 siblings, 1 reply; 13+ messages in thread
From: Luca Barbato @ 2013-08-11 11:14 UTC (permalink / raw)
  To: musl

On 11/08/13 10:13, Rich Felker wrote:
>> Unfortunately this case seems to be compiling to a call to memcpy on
>> powerpc (but nowhere else I found). So I may need to drop the special
>> case for 64-bit alignment. I wish there was some source for knowledge
>> of the cases that can trigger gcc's stupidity, though...
> 
> It turns out mips at certain optimization levels is also generating a
> memcpy for the structure assignments. I think I just need to drop all
> of the structure-assignment tricks and use a mildly unrolled loop with
> uint32_t units for the aligned case. This gives much worse performance
> on ARM, where gcc fails to generate the proper ldmia/stmia without the
> struct, but we have asm we can use for ARM anyway. On other archs, the
> struct copy code does not even seem to help. The simple integer loop
> works just as well.
> 
> I'll do some more experimenting and probably commit the ARM asm soon,
> followed by the C code once I get some better feedback on how it
> performs on real machines.

What about sprinkling volatile here and there?

lu



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Optimized C memcpy [updated]
  2013-08-11 11:14       ` Luca Barbato
@ 2013-08-11 11:27         ` Rich Felker
  0 siblings, 0 replies; 13+ messages in thread
From: Rich Felker @ 2013-08-11 11:27 UTC (permalink / raw)
  To: musl

On Sun, Aug 11, 2013 at 01:14:31PM +0200, Luca Barbato wrote:
> On 11/08/13 10:13, Rich Felker wrote:
> >> Unfortunately this case seems to be compiling to a call to memcpy on
> >> powerpc (but nowhere else I found). So I may need to drop the special
> >> case for 64-bit alignment. I wish there was some source for knowledge
> >> of the cases that can trigger gcc's stupidity, though...
> > 
> > It turns out mips at certain optimization levels is also generating a
> > memcpy for the structure assignments. I think I just need to drop all
> > of the structure-assignment tricks and use a mildly unrolled loop with
> > uint32_t units for the aligned case. This gives much worse performance
> > on ARM, where gcc fails to generate the proper ldmia/stmia without the
> > struct, but we have asm we can use for ARM anyway. On other archs, the
> > struct copy code does not even seem to help. The simple integer loop
> > works just as well.
> > 
> > I'll do some more experimenting and probably commit the ARM asm soon,
> > followed by the C code once I get some better feedback on how it
> > performs on real machines.
> 
> What about sprinkling volatile here and there?

That might help with gcc 4.8.x issues, but these are already worked
around by turning off the offending optimization, and it seems like
major GCC folks are considering these a bug and hoping to fix them in
an upcoming version anyway.

The structure assignments generating memcpy, however, are a
longstanding bug and are hard-coded in the deep target-specific code
generation. There is no indication that volatile would make them go
away (in fact, I was unable to get it to go away using volatile)
because the memcpy that's being generated is not an abstract
optimization of a series of reads/writes, but being generated directly
out of the signal operation of struct assignment.

While I'd like to see this bug fixed, I don't see any hope of using
struct assignments to implement a C memcpy without some serious
configure tests to make sure it works. There are just too many
combinations of optimization flags, compilers, compiler versions, etc.
that would have to be checked to have any confidence in it not
breaking, and so far it seems ARM (for which we can just use the asm)
is the only arch that would actually benefit from it.

Rich

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-08-11 11:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-07 18:21 Optimized C memcpy Rich Felker
2013-08-08 12:59 ` Andrew Bradford
2013-08-08 13:03   ` Andrew Bradford
2013-08-08 13:17     ` Luca Barbato
2013-08-08 15:15     ` Rich Felker
2013-08-08 20:17       ` Andre Renaud
2013-08-08 20:26         ` Rich Felker
2013-08-09  5:02 ` Rob Landley
2013-08-11  5:11 ` Optimized C memcpy [updated] Rich Felker
2013-08-11  6:20   ` Rich Felker
2013-08-11  8:13     ` Rich Felker
2013-08-11 11:14       ` Luca Barbato
2013-08-11 11:27         ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).