On Wed, Sep 14, 2016 at 10:36:45PM -0400, Rich Felker wrote:
> On Wed, Sep 14, 2016 at 07:58:52PM -0500, Rob Landley wrote:
> > On 09/14/2016 07:34 PM, Rich Felker wrote:
> > > I could put a fork of memcpy.c in sh/memcpy.c and work on it there and
> > > only merge it back to the shared one if others test it on other archs
> > > and find it beneficial (or at least not harmful).
> > 
> > Both musl and the kernel need it. And yes at the moment it seems
> > architecture-specific, but it's a _big_ performance difference...
> 
> I actually think it's justifiable to have in the generic C memcpy,
> from a standpoint that the generic C shouldn't assume an N-way (N>1,
> i.e. not direct mapped) associative cache. Just need to make sure
> changing it doesn't make gcc do something utterly idiotic for other
> archs, I guess. I'll take a look at this.

Attached is a draft memcpy I'm considering for musl. Compared to the
current one, it:

1. Works on 32 bytes per iteration, and adds barriers between the load
   phase and store phase to preclude cache line aliasing between src
   and dest with a direct-mapped cache.

2. Equally unrolls the misaligned src/dest cases.

3. Adjusts the offsets used in the misaligned src/dest loops to all be
   multiples of 4, with the adjustments to make that work outside the
   loops. This helps compilers generate indexed addressing modes (e.g.
   @(4,Rm)) rather than having to resort to arithmetic.

4. Factors the misaligned cases into a common inline function to
   reduce code duplication.

Comments welcome.

Rich