From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3782
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: ARM memcpy post-0.9.12-release thread
Date: Tue, 30 Jul 2013 22:26:31 -0400
Message-ID: <20130731022631.GA6655@brightrain.aerifal.cx>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1375237607 1828 80.91.229.3 (31 Jul 2013 02:26:47 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Wed, 31 Jul 2013 02:26:47 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3786-gllmg-musl=m.gmane.org@lists.openwall.com Wed Jul 31 04:26:49 2013
Return-path: <musl-return-3786-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3786-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1V4M7h-0001Zv-Pq
	for gllmg-musl@plane.gmane.org; Wed, 31 Jul 2013 04:26:45 +0200
Original-Received: (qmail 28243 invoked by uid 550); 31 Jul 2013 02:26:44 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 28230 invoked from network); 31 Jul 2013 02:26:44 -0000
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:3782
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3782>

Hi all (especially Andre),

I've been doing some experimenting with ARM memcpy, and I have not
found any way to beat the Bionic asm file for misaligned copies. The
best I could do with simple inline asm (reading multi-words and
writing byte-at-a-time or vice versa) improved the performance nearly
40% compared to musl's current code, but it was still worse than half
the speed of the Bionic asm.

For the aligned case, however, as I've said before, the Bionic code
runs 10% slower for me than the C-with-inline-asm I posted to the
list. Commenting out the prefetch code in the Bionic version brings
the performance up to the same as my version.

I also found that the Bionic code was mysteriously crashing on the
real system I test on (it worked on my toolchain with qemu). On
further investigation, the test system's toolchain had -mthumb (with
thumb2) as the default; adding -marm made it work. Both ways the asm
was being interpreted as arm; the problem was that the *calling* code
being thumb broke it. The solution was adding .type memcpy,%function
to the asm file. Without that, the linker cannot know that the symbol
it's resolving is a function name and thus that it has to adjust the
low bit of the relocated address as a flag for whether the code is arm
or thumb. I've now got the code working reliably it seems.

Sizes so far:
Current C code: 260 bytes
My best-attempt inline asm: 352 bytes
Bionic (with prefetch removed): 764 bytes

Obviously the Bionic code is a bit larger than the others and than I'd
like it to be, but it looks really hard to trim it down without
ruining performance for misaligned copies; roughly half of the asm
covers the misaligned case, which is expensive because you have three
different code paths for different ways it can be off mod 4.

One other issue we have to consider if we go with the Bionic code is
that we'd need to add sub-arch asm dirs to use it. As-is, the code is
hard-coded for little endian. It will shuffle the byte order badly
when copying on a big endian machine.

Some rough times (128k copy repeated 10000 times):

Aligned case:
Current C code: 1.2s
My best-attempt C code: 0.75s
My best-attempt inline asm: 0.57s
Bionic asm: 0.63s
Bionic asm without prefetch: 0.57s

Misaligned case:
Current C code: 4.7s
My best-attempt inline asm: 2.9s
Bionic asm: 1.1s

Rich