Re: [J-core] Aligned copies and cacheline conflicts?

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Re: [J-core] Aligned copies and cacheline conflicts?
       [not found] <0c256cb1-d0fa-9a5a-3976-b7ef545c1827@landley.net>
@ 2016-09-15  0:34 ` Rich Felker
  2016-09-15  0:58   ` Rob Landley
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2016-09-15  0:34 UTC (permalink / raw)
  To: Rob Landley; +Cc: j-core, musl

On Tue, Sep 13, 2016 at 07:21:58PM -0500, Rob Landley wrote:
> There was a discussion at one point about how reading from and writing
> to an alised cache line (anything at an 8k offset) would cause horrible
> performance (the write cacheline evicting the read cacheline and vice
> versa), how this was a more common problem than 1 in 256 because things
> are often page aligned, and how a workaround was to have memcpy read
> data into 8 registers and then write it out again from those 8 registers
> to avoid the ping-ping.
> 
> Question: did that memcpy change actually go into musl and the kernel?
> (Seems like both would need it...) If so, what do I have to make sure
> I've pulled to get them?

This has not gone upstream yet mainly because:

1) I'm not sure if it's a good idea for other archs that use the
generic C memcpy.

2) It would be a lot of extra code to handle the misaligned-mod-4
cases this way as well, and unlikely to help much anyway since this
case doesn't arise from page alignment, so it's not clear if I should
do this case too.

I could put a fork of memcpy.c in sh/memcpy.c and work on it there and
only merge it back to the shared one if others test it on other archs
and find it beneficial (or at least not harmful).

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [J-core] Aligned copies and cacheline conflicts?
  2016-09-15  0:34 ` [J-core] Aligned copies and cacheline conflicts? Rich Felker
@ 2016-09-15  0:58   ` Rob Landley
  2016-09-15  2:36     ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Rob Landley @ 2016-09-15  0:58 UTC (permalink / raw)
  To: Rich Felker; +Cc: j-core, musl

On 09/14/2016 07:34 PM, Rich Felker wrote:
> I could put a fork of memcpy.c in sh/memcpy.c and work on it there and
> only merge it back to the shared one if others test it on other archs
> and find it beneficial (or at least not harmful).

Both musl and the kernel need it. And yes at the moment it seems
architecture-specific, but it's a _big_ performance difference...

Rob


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [J-core] Aligned copies and cacheline conflicts?
  2016-09-15  0:58   ` Rob Landley
@ 2016-09-15  2:36     ` Rich Felker
  2016-09-16 22:16       ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2016-09-15  2:36 UTC (permalink / raw)
  To: Rob Landley; +Cc: j-core, musl

On Wed, Sep 14, 2016 at 07:58:52PM -0500, Rob Landley wrote:
> On 09/14/2016 07:34 PM, Rich Felker wrote:
> > I could put a fork of memcpy.c in sh/memcpy.c and work on it there and
> > only merge it back to the shared one if others test it on other archs
> > and find it beneficial (or at least not harmful).
> 
> Both musl and the kernel need it. And yes at the moment it seems
> architecture-specific, but it's a _big_ performance difference...

I actually think it's justifiable to have in the generic C memcpy,
from a standpoint that the generic C shouldn't assume an N-way (N>1,
i.e. not direct mapped) associative cache. Just need to make sure
changing it doesn't make gcc do something utterly idiotic for other
archs, I guess. I'll take a look at this.

Rich


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [J-core] Aligned copies and cacheline conflicts?
  2016-09-15  2:36     ` Rich Felker
@ 2016-09-16 22:16       ` Rich Felker
  2016-09-17  1:40         ` Rob Landley
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2016-09-16 22:16 UTC (permalink / raw)
  To: Rob Landley; +Cc: j-core, musl

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On Wed, Sep 14, 2016 at 10:36:45PM -0400, Rich Felker wrote:
> On Wed, Sep 14, 2016 at 07:58:52PM -0500, Rob Landley wrote:
> > On 09/14/2016 07:34 PM, Rich Felker wrote:
> > > I could put a fork of memcpy.c in sh/memcpy.c and work on it there and
> > > only merge it back to the shared one if others test it on other archs
> > > and find it beneficial (or at least not harmful).
> > 
> > Both musl and the kernel need it. And yes at the moment it seems
> > architecture-specific, but it's a _big_ performance difference...
> 
> I actually think it's justifiable to have in the generic C memcpy,
> from a standpoint that the generic C shouldn't assume an N-way (N>1,
> i.e. not direct mapped) associative cache. Just need to make sure
> changing it doesn't make gcc do something utterly idiotic for other
> archs, I guess. I'll take a look at this.

Attached is a draft memcpy I'm considering for musl. Compared to the
current one, it:

1. Works on 32 bytes per iteration, and adds barriers between the load
   phase and store phase to preclude cache line aliasing between src
   and dest with a direct-mapped cache.

2. Equally unrolls the misaligned src/dest cases.

3. Adjusts the offsets used in the misaligned src/dest loops to all be
   multiples of 4, with the adjustments to make that work outside the
   loops. This helps compilers generate indexed addressing modes (e.g.
   @(4,Rm)) rather than having to resort to arithmetic.

4. Factors the misaligned cases into a common inline function to
   reduce code duplication.

Comments welcome.

Rich

[-- Attachment #2: memcpy-draft.c --]
[-- Type: text/plain, Size: 2704 bytes --]

#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <endian.h>

#ifdef __GNUC__

#if __BYTE_ORDER == __LITTLE_ENDIAN
#define LS >>
#define RS <<
#else
#define LS <<
#define RS >>
#endif

typedef uint32_t __attribute__((__may_alias__)) u32;
typedef uint16_t __attribute__((__may_alias__)) u16;

static inline uint32_t shifted_block_copy(unsigned char *d, const unsigned char *s, uint32_t w, int ls)
{
	int rs = 32-ls;
	uint32_t t1 = *(u32 *)(s+4);
	uint32_t t2 = *(u32 *)(s+8);
	uint32_t t3 = *(u32 *)(s+12);
	uint32_t t4 = *(u32 *)(s+16);
	uint32_t t5 = *(u32 *)(s+20);
	uint32_t t6 = *(u32 *)(s+24);
	uint32_t t7 = *(u32 *)(s+28);
	uint32_t t8 = *(u32 *)(s+32);
	__asm__ __volatile__ ( "" : : "r"(s), "r"(d) : "memory" );
	*(u32 *)(d) = (w LS ls) | (t1 RS rs);
	*(u32 *)(d+4) = (t1 LS ls) | (t2 RS rs);
	*(u32 *)(d+8) = (t2 LS ls) | (t3 RS rs);
	*(u32 *)(d+12) = (t3 LS ls) | (t4 RS rs);
	*(u32 *)(d+16) = (t4 LS ls) | (t5 RS rs);
	*(u32 *)(d+20) = (t5 LS ls) | (t6 RS rs);
	*(u32 *)(d+24) = (t6 LS ls) | (t7 RS rs);
	*(u32 *)(d+28) = (t7 LS ls) | (t8 RS rs);
	return t8;
}

#endif

void *memcpy(void *restrict dest, const void *restrict src, size_t n)
{
	unsigned char *d = dest;
	const unsigned char *s = src;

#ifdef __GNUC__

	for (; (uintptr_t)s % 4 && n; n--) *d++ = *s++;

	if ((uintptr_t)d % 4 == 0) {
		size_t c32 = n>>5, c4 = (n&31)>>2, c1=n&3;
		for (; c32; c32--, s+=32, d+=32) {
			uint32_t t0 = *(u32 *)(s+0);
			uint32_t t1 = *(u32 *)(s+4);
			uint32_t t2 = *(u32 *)(s+8);
			uint32_t t3 = *(u32 *)(s+12);
			uint32_t t4 = *(u32 *)(s+16);
			uint32_t t5 = *(u32 *)(s+20);
			uint32_t t6 = *(u32 *)(s+24);
			uint32_t t7 = *(u32 *)(s+28);
			__asm__ __volatile__ ( "" : : "r"(s), "r"(d) : "memory" );
			*(u32 *)(d+0) = t0;
			*(u32 *)(d+4) = t1;
			*(u32 *)(d+8) = t2;
			*(u32 *)(d+12) = t3;
			*(u32 *)(d+16) = t4;
			*(u32 *)(d+20) = t5;
			*(u32 *)(d+24) = t6;
			*(u32 *)(d+28) = t7;
		}
		for (; c4; c4--, s+=4, d+=4) {
			*(u32 *)d = *(u32 *)s;
		}
		for (; c1; c1--, s++, d++) {
			*d = *s;
		}
		return dest;
	}

	if (!n) return dest;

	size_t c32 = n>=36 ? (n-4)>>5 : 0;
	uint32_t w = *(u32 *)s;

	n -= (c32<<5);

	if (c32) switch ((uintptr_t)d % 4) {
	case 1:
		d[0] = s[0];
		d[1] = s[1];
		d[2] = s[2];
		d += 3;
		n -= 3;
		for (; c32; c32--, s+=32, d+=32)
			w = shifted_block_copy(d, s, w, 24);
		s += 3;
		break;
	case 2:
		*(u16 *)d = *(u16 *)s;
		d += 2;
		n -= 2;
		for (; c32; c32--, s+=32, d+=32)
			w = shifted_block_copy(d, s, w, 16);
		s += 2;
		break;
	case 3:
		d[0] = s[0];
		d += 1;
		n -= 1;
		for (; c32; c32--, s+=32, d+=32)
			w = shifted_block_copy(d, s, w, 8);
		s += 1;
		break;
	}
#endif

	for (; n; n--) *d++ = *s++;
	return dest;
}


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [J-core] Aligned copies and cacheline conflicts?
  2016-09-16 22:16       ` Rich Felker
@ 2016-09-17  1:40         ` Rob Landley
  2016-09-17  2:17           ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Rob Landley @ 2016-09-17  1:40 UTC (permalink / raw)
  To: Rich Felker; +Cc: j-core, musl

On 09/16/2016 05:16 PM, Rich Felker wrote:
> Attached is a draft memcpy I'm considering for musl. Compared to the
> current one, it:
> 
> 1. Works on 32 bytes per iteration, and adds barriers between the load
>    phase and store phase to preclude cache line aliasing between src
>    and dest with a direct-mapped cache.
> 
> 2. Equally unrolls the misaligned src/dest cases.
> 
> 3. Adjusts the offsets used in the misaligned src/dest loops to all be
>    multiples of 4, with the adjustments to make that work outside the
>    loops. This helps compilers generate indexed addressing modes (e.g.
>    @(4,Rm)) rather than having to resort to arithmetic.
> 
> 4. Factors the misaligned cases into a common inline function to
>    reduce code duplication.
> 
> Comments welcome.

Superficial comments first:

I know the compiler's probably smart enough to convert %4 into &3, but
given that the point is performance optimization I'd have thought you'd
be explicit about what the machine should be doing?

Both chunks of code have their own 8 register read and 8 register write
(one is 0-7, one is 1-8).

Design comments:

Instead of optimized per-target assembly, you have an #ifdef gnuc
wrapped around just under 70 lines of C code with an __asm__
__volatile__ blob in the middle, calling a 20 line C function. Because
presumably on sh this will produce roughly the same workaround for the
primitive cache architecture, only now you're doing it indirectly and
applying it to everybody.

The motivation for this is that j2 has a more primitive cache
architecture than normal these days, so it needs an optimization most
other chips don't. This is "generic" so it'll be built on
register-constrained 32 bit x86, and on 64 bit systems where it should
presumably be using u64 not u32.

And of course gcc inlines its own version unless you hit it with a brick
anyway, so solving it in musl is of questionable utility.

I'm not sure you're focusing on the right problem?

> Rich

Rob

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [J-core] Aligned copies and cacheline conflicts?
  2016-09-17  1:40         ` Rob Landley
@ 2016-09-17  2:17           ` Rich Felker
  0 siblings, 0 replies; 6+ messages in thread
From: Rich Felker @ 2016-09-17  2:17 UTC (permalink / raw)
  To: Rob Landley; +Cc: j-core, musl

On Fri, Sep 16, 2016 at 08:40:05PM -0500, Rob Landley wrote:
> On 09/16/2016 05:16 PM, Rich Felker wrote:
> > Attached is a draft memcpy I'm considering for musl. Compared to the
> > current one, it:
> > 
> > 1. Works on 32 bytes per iteration, and adds barriers between the load
> >    phase and store phase to preclude cache line aliasing between src
> >    and dest with a direct-mapped cache.
> > 
> > 2. Equally unrolls the misaligned src/dest cases.
> > 
> > 3. Adjusts the offsets used in the misaligned src/dest loops to all be
> >    multiples of 4, with the adjustments to make that work outside the
> >    loops. This helps compilers generate indexed addressing modes (e.g.
> >    @(4,Rm)) rather than having to resort to arithmetic.
> > 
> > 4. Factors the misaligned cases into a common inline function to
> >    reduce code duplication.
> > 
> > Comments welcome.
> 
> Superficial comments first:
> 
> I know the compiler's probably smart enough to convert %4 into &3, but
> given that the point is performance optimization I'd have thought you'd
> be explicit about what the machine should be doing?

Generally this is the style preferred in musl. If a compiler fails
that bad...

> Both chunks of code have their own 8 register read and 8 register write
> (one is 0-7, one is 1-8).

Yes, the shifted one originally had the 'w' variable named t0. It
works with 9 words (0-8) rather than 8 because the destination blocks
span partial source words. Maybe I should bring back the t0 name or
rename them though.

> Design comments:
> 
> Instead of optimized per-target assembly, you have an #ifdef gnuc

The code is inside a __GNUC__ conditional because it cannot be written
in plain C; it needs the __may_alias__ attribute to be able to access
arbitrary-type buffers as uint32_t arrays. This is no change from the
existing memcpy.c in musl.

> wrapped around just under 70 lines of C code with an __asm__
> __volatile__ blob in the middle, calling a 20 line C function. Because
> presumably on sh this will produce roughly the same workaround for the
> primitive cache architecture, only now you're doing it indirectly and
> applying it to everybody.

I don't see it as a "workaround" but rather a memcpy with fewer
assumptions -- it removes the assumption that writes to the
destination don't interfere with caching of the source.

> The motivation for this is that j2 has a more primitive cache
> architecture than normal these days, so it needs an optimization most
> other chips don't. This is "generic" so it'll be built on
> register-constrained 32 bit x86, and on 64 bit systems where it should
> presumably be using u64 not u32.

musl has optimized asm for a few first-class archs where I've been
completely unable to get the compiler to reliably generate a suitable
memcpy -- presently that's x86[_64] and arm. Up until recently, x86_64
was the only 64-bit arch we had, and it had its own memcpy. Now that
there's aarch64, mips64, powerpc64, and soon s390x, I think it would
be a good idea to revisit making the generic memcpy.c 64-bit friendly.
But that's a separate task and does not conflict with any work being
done now.

So -- back to the point -- musl's generic C implementation of memcpy
is intended to have minimal assumptions on the cpu architecture while
still producing something very fast, and it generally does. Its
performance characteristics are mostly oriented towards lower-end risc
archs where you have a decent number of registers, no support for
misaligned loads/stores, and where shift-into-place with word-sized
loads/stores is the proper strategy.

> And of course gcc inlines its own version unless you hit it with a brick
> anyway, so solving it in musl is of questionable utility.

Only for cases where the size is constant and various other conditions
are met. The fact that it does this even for large sizes is a bug and
it will be fixed. Almost all archs admit faster bulk memcpy that what
you can reasonably inline so gcc simply should not be doing this, and
fixing it is necessary anyway if we want to allow vdso-provided dma
memcpy.

> I'm not sure you're focusing on the right problem?

Yes, for sh/j2, but there are other open problems for improving musl's
generic memcpy now that we have a wider variety of archs.

Regarding the barrier that makes memcpy safe against direct-mapped
caches like the j2 one, It could be turned on/off with a simple macro
definition provided by the arch files if it harms performance on other
archs, but I don't think it does.

We could likewise add such a macro for "arch allows misaligned
loads/stores" to optimize out the shift-into-place code on archs that
don't need it and allow a faster direct load/store (the "mod4=0" case
in the current code) to be used for all alignments on such archs.

One of the changes made in this update, moving the shifted block copy
into its own (inline) function, makes it easier to move to using
native wordsize rather than hard-coding 32-bit. The main reason the
current code in musl hard-coded 32-bit was that I got tired of writing
the (written out in duplicate) cases for 8 different alignments mod 8.
But now it's just a few lines per case with little opportunity for
introducing errors. In terms of code size, "twice as many cases but
half the number of loads/stores per case" should be fairly neutral.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-09-17  2:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <0c256cb1-d0fa-9a5a-3976-b7ef545c1827@landley.net>
2016-09-15  0:34 ` [J-core] Aligned copies and cacheline conflicts? Rich Felker
2016-09-15  0:58   ` Rob Landley
2016-09-15  2:36     ` Rich Felker
2016-09-16 22:16       ` Rich Felker
2016-09-17  1:40         ` Rob Landley
2016-09-17  2:17           ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).