From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham
	autolearn_force=no version=3.4.4
Received: (qmail 25415 invoked from network); 26 Jun 2020 08:41:06 -0000
Received: from mother.openwall.net (195.42.179.200)
  by inbox.vuxu.org with ESMTPUTF8; 26 Jun 2020 08:41:06 -0000
Received: (qmail 7504 invoked by uid 550); 26 Jun 2020 08:41:01 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 7486 invoked from network); 26 Jun 2020 08:41:01 -0000
Date: Fri, 26 Jun 2020 10:40:49 +0200
From: Szabolcs Nagy <nsz@port70.net>
To: Rich Felker <dalias@libc.org>
Cc: musl@lists.openwall.com
Message-ID: <20200626084049.GG2048759@port70.net>
Mail-Followup-To: Rich Felker <dalias@libc.org>, musl@lists.openwall.com
References: <20200624204243.GL6430@brightrain.aerifal.cx>
 <20200625081504.GE2048759@port70.net>
 <20200625153936.GP6430@brightrain.aerifal.cx>
 <20200625173125.GF2048759@port70.net>
 <20200625205024.GR6430@brightrain.aerifal.cx>
 <20200625211536.GS6430@brightrain.aerifal.cx>
 <20200626012003.GX6430@brightrain.aerifal.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200626012003.GX6430@brightrain.aerifal.cx>
Subject: Re: [musl] Release prep for 1.2.1, and afterwards

* Rich Felker <dalias@libc.org> [2020-06-25 21:20:06 -0400]:
> On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote:
> > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote:
> > > > > > but it would be nice if we could get the aarch64
> > > > > > memcpy patch in (the c implementation is really
> > > > > > slow and i've seen ppl compare aarch64 vs x86
> > > > > > server performance with some benchmark on alpine..)
> > > > > 
> > > > > OK, I'll look again.
> > > > 
> > > > thanks.
> > > > 
> > > > (there are more aarch64 string functions in the
> > > > optimized-routines github repo but i think they
> > > > are not as important as memcpy/memmove/memset)
> > > 
> > > I found the code. Can you commend on performance and whether memset is
> > > needed? (The C memset should be rather good already, moreso than
> > > memcpy.)

the asm seems faster in all measurements but there is
a lot of variance with different size/alignment cases.

the avg improvement on typical workload and the possible
improvements across various cases and cores i'd expect:

memcpy typical: 1.6x-1.7x
memcpy possible: 1.2x-3.1x

memset typical: 1.1x-1.4x
memset possible: 1.0x-2.6x

> > Are the assumptions (v8-a, unaligned access) documented in memcpy.S
> > valid for all presently supportable aarch64?

yes, unaligned access on normal memory in userspace
is valid (part of the base abi on linux).

iirc a core can be configured to trap unaligned access
and it is not valid on device memory so e.g. such
memcpy would not work in the kernel. but avoiding
unaligned access in memcpy is not enough to fix that,
the compiler will generate unaligned load for

int f(char *p)
{
    int i;
    __builtin_memcpy(&i,p,sizeof i);
    return i;
}

> > 
> > A couple comments for merging if we do, that aren't hard requirements
> > but preferences:
> > 
> > - I'd like to expand out the macros from ../asmdefs.h since that won't
> >   be available and they just hide things (I guess they're attractive
> >   for Apple/macho users or something but not relevant to musl) and
> >   since the symbol name lines need to be changed anyway to public
> >   name. "Local var name" macros are ok to leave; changing them would
> >   be too error-prone and they make the code more readable anyway.

the weird macros are there so the code is similar to glibc
asm code (which adds cfi annotation and optionally adds
profile hooks to entry etc)

> > 
> > - I'd prefer not to have memmove logic in memcpy since it makes it
> >   larger and implies that misuse of memcpy when you mean memmove is
> >   supported usage. I'd be happy with an approach like x86 though,
> >   defining an __memcpy_fwd alias and having memmove tail call to that
> >   unless len>128 and reverse is needed, or just leaving memmove.c.

in principle the code should be called memmove, not memcpy,
since it satisfies the memmove contract, which of course
works for memcpy too. so tail calling memmove from memcpy
makes more sense but memcpy is more performance critical
than memmove, so we probably should not add extra branches
there..

> 
> Something like the attached.

looks good to me.