From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3782 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: ARM memcpy post-0.9.12-release thread Date: Tue, 30 Jul 2013 22:26:31 -0400 Message-ID: <20130731022631.GA6655@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1375237607 1828 80.91.229.3 (31 Jul 2013 02:26:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Wed, 31 Jul 2013 02:26:47 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3786-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jul 31 04:26:49 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1V4M7h-0001Zv-Pq for gllmg-musl@plane.gmane.org; Wed, 31 Jul 2013 04:26:45 +0200 Original-Received: (qmail 28243 invoked by uid 550); 31 Jul 2013 02:26:44 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 28230 invoked from network); 31 Jul 2013 02:26:44 -0000 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3782 Archived-At: Hi all (especially Andre), I've been doing some experimenting with ARM memcpy, and I have not found any way to beat the Bionic asm file for misaligned copies. The best I could do with simple inline asm (reading multi-words and writing byte-at-a-time or vice versa) improved the performance nearly 40% compared to musl's current code, but it was still worse than half the speed of the Bionic asm. For the aligned case, however, as I've said before, the Bionic code runs 10% slower for me than the C-with-inline-asm I posted to the list. Commenting out the prefetch code in the Bionic version brings the performance up to the same as my version. I also found that the Bionic code was mysteriously crashing on the real system I test on (it worked on my toolchain with qemu). On further investigation, the test system's toolchain had -mthumb (with thumb2) as the default; adding -marm made it work. Both ways the asm was being interpreted as arm; the problem was that the *calling* code being thumb broke it. The solution was adding .type memcpy,%function to the asm file. Without that, the linker cannot know that the symbol it's resolving is a function name and thus that it has to adjust the low bit of the relocated address as a flag for whether the code is arm or thumb. I've now got the code working reliably it seems. Sizes so far: Current C code: 260 bytes My best-attempt inline asm: 352 bytes Bionic (with prefetch removed): 764 bytes Obviously the Bionic code is a bit larger than the others and than I'd like it to be, but it looks really hard to trim it down without ruining performance for misaligned copies; roughly half of the asm covers the misaligned case, which is expensive because you have three different code paths for different ways it can be off mod 4. One other issue we have to consider if we go with the Bionic code is that we'd need to add sub-arch asm dirs to use it. As-is, the code is hard-coded for little endian. It will shuffle the byte order badly when copying on a big endian machine. Some rough times (128k copy repeated 10000 times): Aligned case: Current C code: 1.2s My best-attempt C code: 0.75s My best-attempt inline asm: 0.57s Bionic asm: 0.63s Bionic asm without prefetch: 0.57s Misaligned case: Current C code: 4.7s My best-attempt inline asm: 2.9s Bionic asm: 1.1s Rich