From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3628
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Thinking about release
Date: Thu, 11 Jul 2013 08:46:13 -0400
Message-ID: <20130711124613.GO29800@brightrain.aerifal.cx>
References: <CAPfzE3a0h=2NFqgnBqXj3J2q7VgYjqZ19Ab=0LAe5u5SvWXHaA@mail.gmail.com>
 <20130613014314.GC29800@brightrain.aerifal.cx>
 <CAPfzE3aerGrdmTkj15o0CTVtt8TZpTyAnSAj1Joau+Jb_cNGUA@mail.gmail.com>
 <20130709053711.GO29800@brightrain.aerifal.cx>
 <CAPfzE3ZTxynUeJjq7KWijZGhsV==NymW4vqLhnQbEYCXRxVf-g@mail.gmail.com>
 <CAPfzE3ZsMpC9d4VDZyHabhKOffOQW0dnG7Nwpm8EqVBLUXNZKg@mail.gmail.com>
 <CAPfzE3YDFjqHxRaZFeiy0CvbYWYGKzgDGEp-71xSz-03GhNTxw@mail.gmail.com>
 <20130711033754.GL29800@brightrain.aerifal.cx>
 <CAPfzE3ZMGwEvs2n_4LCKzMv0FROS55_1N+HdBw7HgNhexgM+eA@mail.gmail.com>
 <CAPfzE3aoD4mpO9RrV-enuXxkCvMPY_7rEE6e9w8NuX-ntEqtqA@mail.gmail.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1373546791 11178 80.91.229.3 (11 Jul 2013 12:46:31 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 11 Jul 2013 12:46:31 +0000 (UTC)
To: musl@lists.openwall.com
Original-X-From: musl-return-3632-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jul 11 14:46:33 2013
Return-path: <musl-return-3632-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3632-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1UxGGU-0005lZ-Hg
	for gllmg-musl@plane.gmane.org; Thu, 11 Jul 2013 14:46:30 +0200
Original-Received: (qmail 31947 invoked by uid 550); 11 Jul 2013 12:46:26 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 31939 invoked from network); 11 Jul 2013 12:46:26 -0000
Content-Disposition: inline
In-Reply-To: <CAPfzE3aoD4mpO9RrV-enuXxkCvMPY_7rEE6e9w8NuX-ntEqtqA@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:3628
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3628>

On Thu, Jul 11, 2013 at 05:10:41PM +1200, Andre Renaud wrote:
> > I can't see any obvious reason why this shouldn't work, although the
> > assembler as it stands makes pretty heavy use of all the registers,
> > and I can't immediately see how to rework it to free up 2 more (I can
> > free up 1 by dropping the attempted preload). Given my (lack of)
> > skills with ARM assembler, I'm not sure I'll be able to look too
> > deeply into either of these options, but I'll have a go at the inline
> > ASM version to force 8*4byte loads to see if it improves things.
> 
> I've given it a bit of a go, and at first it appears to be working
> (although I don't exactly have a comprehensive test suite, so this is
> very preliminary). Anyone with some more ARM assembler experience is
> welcome to chip in with a comment.
> 
> I also managed to mess up my last set of benchmarking - I'd indicated
> that I got 65 vs 95 vs 105, however I'd stuffed up the fact that the
> first call would have poor cache performance. Once I corrected that
> the results have become more like 65(naive) vs 105(typedef) vs
> 113(asm).
> 
> Using the below code, it becomes 65(naive), 113(inline asm), 113(full
> asm). So the inline is able to do perform as we'd expect. Assuming
> that it is technically correct (which is probably the biggest
> question).

It's not.

> #define SS (8 * 4)
> #define ALIGN (SS - 1)
> void * noinline my_asm_memcpy(void * restrict dest, const void *
> restrict src, size_t n)
> {
>     unsigned char *d = dest;
>     const unsigned char *s = src;
> 
>     if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN))
>         goto misaligned;
> 
>     for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++;
>     if (n) {
>         for (; n>=SS; n-= SS) {
>                 __asm__("ldmia %0, {r4-r11}"
>                                 : "=r" (s)
>                                 : "0" (s)
>                                 : "r4", "r5", "r6", "r7", "r8", "r9",
> "r10", "r11");
>                 s+=SS;
>                 __asm__("stmia %0, {r4-r11}"
>                                 : "=r" (d)
>                                 :"0" (d));
>                 d+=SS;

You need both instructions in the same asm block, and proper
constraints. As it is, whether the registers keep their values between
the two separate asm blocks is up to the compiler's whims.

With the proper constraints ("+r" type), the s+=SS and d+=SS are
unnecessary, as a bonus. Also there's no reason to force alignment to
SS for this loop; that will simply prevent it from being used as much
for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS
in the for loop.

Last night I was in the process of writing something very similar, but
I put the for loop in asm too and didn't finish it. If it performs
just as well with the loop in C, I like your version better.

Rich