On Wed, Sep 14, 2016 at 10:36:45PM -0400, Rich Felker wrote: > On Wed, Sep 14, 2016 at 07:58:52PM -0500, Rob Landley wrote: > > On 09/14/2016 07:34 PM, Rich Felker wrote: > > > I could put a fork of memcpy.c in sh/memcpy.c and work on it there and > > > only merge it back to the shared one if others test it on other archs > > > and find it beneficial (or at least not harmful). > > > > Both musl and the kernel need it. And yes at the moment it seems > > architecture-specific, but it's a _big_ performance difference... > > I actually think it's justifiable to have in the generic C memcpy, > from a standpoint that the generic C shouldn't assume an N-way (N>1, > i.e. not direct mapped) associative cache. Just need to make sure > changing it doesn't make gcc do something utterly idiotic for other > archs, I guess. I'll take a look at this. Attached is a draft memcpy I'm considering for musl. Compared to the current one, it: 1. Works on 32 bytes per iteration, and adds barriers between the load phase and store phase to preclude cache line aliasing between src and dest with a direct-mapped cache. 2. Equally unrolls the misaligned src/dest cases. 3. Adjusts the offsets used in the misaligned src/dest loops to all be multiples of 4, with the adjustments to make that work outside the loops. This helps compilers generate indexed addressing modes (e.g. @(4,Rm)) rather than having to resort to arithmetic. 4. Factors the misaligned cases into a common inline function to reduce code duplication. Comments welcome. Rich