From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3801 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: ARM memcpy post-0.9.12-release thread Date: Fri, 2 Aug 2013 16:41:47 -0400 Message-ID: <20130802204146.GO221@brightrain.aerifal.cx> References: <20130731022631.GA6655@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375476149 24912 80.91.229.3 (2 Aug 2013 20:42:29 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 2 Aug 2013 20:42:29 +0000 (UTC) Cc: Andre Renaud To: musl@lists.openwall.com Original-X-From: musl-return-3805-gllmg-musl=m.gmane.org@lists.openwall.com Fri Aug 02 22:42:31 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V5MBD-0006et-3J for gllmg-musl@plane.gmane.org; Fri, 02 Aug 2013 22:42:31 +0200 Original-Received: (qmail 7842 invoked by uid 550); 2 Aug 2013 20:42:30 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 7834 invoked from network); 2 Aug 2013 20:42:30 -0000 Content-Disposition: inline In-Reply-To: <20130731022631.GA6655@brightrain.aerifal.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3801 Archived-At: Andre, do you have any input on this? (Cc'ing) Rich On Tue, Jul 30, 2013 at 10:26:31PM -0400, Rich Felker wrote: > Hi all (especially Andre), > > I've been doing some experimenting with ARM memcpy, and I have not > found any way to beat the Bionic asm file for misaligned copies. The > best I could do with simple inline asm (reading multi-words and > writing byte-at-a-time or vice versa) improved the performance nearly > 40% compared to musl's current code, but it was still worse than half > the speed of the Bionic asm. > > For the aligned case, however, as I've said before, the Bionic code > runs 10% slower for me than the C-with-inline-asm I posted to the > list. Commenting out the prefetch code in the Bionic version brings > the performance up to the same as my version. > > I also found that the Bionic code was mysteriously crashing on the > real system I test on (it worked on my toolchain with qemu). On > further investigation, the test system's toolchain had -mthumb (with > thumb2) as the default; adding -marm made it work. Both ways the asm > was being interpreted as arm; the problem was that the *calling* code > being thumb broke it. The solution was adding .type memcpy,%function > to the asm file. Without that, the linker cannot know that the symbol > it's resolving is a function name and thus that it has to adjust the > low bit of the relocated address as a flag for whether the code is arm > or thumb. I've now got the code working reliably it seems. > > Sizes so far: > Current C code: 260 bytes > My best-attempt inline asm: 352 bytes > Bionic (with prefetch removed): 764 bytes > > Obviously the Bionic code is a bit larger than the others and than I'd > like it to be, but it looks really hard to trim it down without > ruining performance for misaligned copies; roughly half of the asm > covers the misaligned case, which is expensive because you have three > different code paths for different ways it can be off mod 4. > > One other issue we have to consider if we go with the Bionic code is > that we'd need to add sub-arch asm dirs to use it. As-is, the code is > hard-coded for little endian. It will shuffle the byte order badly > when copying on a big endian machine. > > Some rough times (128k copy repeated 10000 times): > > Aligned case: > Current C code: 1.2s > My best-attempt C code: 0.75s > My best-attempt inline asm: 0.57s > Bionic asm: 0.63s > Bionic asm without prefetch: 0.57s > > Misaligned case: > Current C code: 4.7s > My best-attempt inline asm: 2.9s > Bionic asm: 1.1s > > Rich