From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 25415 invoked from network); 26 Jun 2020 08:41:06 -0000 Received: from mother.openwall.net (195.42.179.200) by inbox.vuxu.org with ESMTPUTF8; 26 Jun 2020 08:41:06 -0000 Received: (qmail 7504 invoked by uid 550); 26 Jun 2020 08:41:01 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 7486 invoked from network); 26 Jun 2020 08:41:01 -0000 Date: Fri, 26 Jun 2020 10:40:49 +0200 From: Szabolcs Nagy To: Rich Felker Cc: musl@lists.openwall.com Message-ID: <20200626084049.GG2048759@port70.net> Mail-Followup-To: Rich Felker , musl@lists.openwall.com References: <20200624204243.GL6430@brightrain.aerifal.cx> <20200625081504.GE2048759@port70.net> <20200625153936.GP6430@brightrain.aerifal.cx> <20200625173125.GF2048759@port70.net> <20200625205024.GR6430@brightrain.aerifal.cx> <20200625211536.GS6430@brightrain.aerifal.cx> <20200626012003.GX6430@brightrain.aerifal.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200626012003.GX6430@brightrain.aerifal.cx> Subject: Re: [musl] Release prep for 1.2.1, and afterwards * Rich Felker [2020-06-25 21:20:06 -0400]: > On Thu, Jun 25, 2020 at 05:15:42PM -0400, Rich Felker wrote: > > On Thu, Jun 25, 2020 at 04:50:24PM -0400, Rich Felker wrote: > > > > > > but it would be nice if we could get the aarch64 > > > > > > memcpy patch in (the c implementation is really > > > > > > slow and i've seen ppl compare aarch64 vs x86 > > > > > > server performance with some benchmark on alpine..) > > > > > > > > > > OK, I'll look again. > > > > > > > > thanks. > > > > > > > > (there are more aarch64 string functions in the > > > > optimized-routines github repo but i think they > > > > are not as important as memcpy/memmove/memset) > > > > > > I found the code. Can you commend on performance and whether memset is > > > needed? (The C memset should be rather good already, moreso than > > > memcpy.) the asm seems faster in all measurements but there is a lot of variance with different size/alignment cases. the avg improvement on typical workload and the possible improvements across various cases and cores i'd expect: memcpy typical: 1.6x-1.7x memcpy possible: 1.2x-3.1x memset typical: 1.1x-1.4x memset possible: 1.0x-2.6x > > Are the assumptions (v8-a, unaligned access) documented in memcpy.S > > valid for all presently supportable aarch64? yes, unaligned access on normal memory in userspace is valid (part of the base abi on linux). iirc a core can be configured to trap unaligned access and it is not valid on device memory so e.g. such memcpy would not work in the kernel. but avoiding unaligned access in memcpy is not enough to fix that, the compiler will generate unaligned load for int f(char *p) { int i; __builtin_memcpy(&i,p,sizeof i); return i; } > > > > A couple comments for merging if we do, that aren't hard requirements > > but preferences: > > > > - I'd like to expand out the macros from ../asmdefs.h since that won't > > be available and they just hide things (I guess they're attractive > > for Apple/macho users or something but not relevant to musl) and > > since the symbol name lines need to be changed anyway to public > > name. "Local var name" macros are ok to leave; changing them would > > be too error-prone and they make the code more readable anyway. the weird macros are there so the code is similar to glibc asm code (which adds cfi annotation and optionally adds profile hooks to entry etc) > > > > - I'd prefer not to have memmove logic in memcpy since it makes it > > larger and implies that misuse of memcpy when you mean memmove is > > supported usage. I'd be happy with an approach like x86 though, > > defining an __memcpy_fwd alias and having memmove tail call to that > > unless len>128 and reverse is needed, or just leaving memmove.c. in principle the code should be called memmove, not memcpy, since it satisfies the memmove contract, which of course works for memcpy too. so tail calling memmove from memcpy makes more sense but memcpy is more performance critical than memmove, so we probably should not add extra branches there.. > > Something like the attached. looks good to me.