From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3622
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@aerifal.cx>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: Thinking about release
Date: Wed, 10 Jul 2013 23:37:55 -0400
Message-ID: <20130711033754.GL29800@brightrain.aerifal.cx>
References: <20130613012517.GA5859@brightrain.aerifal.cx>
 <CAPfzE3a0h=2NFqgnBqXj3J2q7VgYjqZ19Ab=0LAe5u5SvWXHaA@mail.gmail.com>
 <20130613014314.GC29800@brightrain.aerifal.cx>
 <CAPfzE3aerGrdmTkj15o0CTVtt8TZpTyAnSAj1Joau+Jb_cNGUA@mail.gmail.com>
 <20130709053711.GO29800@brightrain.aerifal.cx>
 <CAPfzE3ZTxynUeJjq7KWijZGhsV==NymW4vqLhnQbEYCXRxVf-g@mail.gmail.com>
 <CAPfzE3ZsMpC9d4VDZyHabhKOffOQW0dnG7Nwpm8EqVBLUXNZKg@mail.gmail.com>
 <CAPfzE3YDFjqHxRaZFeiy0CvbYWYGKzgDGEp-71xSz-03GhNTxw@mail.gmail.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: ger.gmane.org 1373513891 1989 80.91.229.3 (11 Jul 2013 03:38:11 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Thu, 11 Jul 2013 03:38:11 +0000 (UTC)
Cc: Andre Renaud <andre@bluewatersys.com>
To: musl@lists.openwall.com
Original-X-From: musl-return-3626-gllmg-musl=m.gmane.org@lists.openwall.com Thu Jul 11 05:38:12 2013
Return-path: <musl-return-3626-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@plane.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-3626-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1Ux7hr-0003XW-Mu
	for gllmg-musl@plane.gmane.org; Thu, 11 Jul 2013 05:38:11 +0200
Original-Received: (qmail 11681 invoked by uid 550); 11 Jul 2013 03:38:10 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 11671 invoked from network); 11 Jul 2013 03:38:09 -0000
Content-Disposition: inline
In-Reply-To: <CAPfzE3YDFjqHxRaZFeiy0CvbYWYGKzgDGEp-71xSz-03GhNTxw@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Xref: news.gmane.org gmane.linux.lib.musl.general:3622
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/3622>

On Thu, Jul 11, 2013 at 10:44:16AM +1200, Andre Renaud wrote:
> > This results in 95MB/s on my platform (up from 65MB/s for the existing
> > memcpy.c, and down from 105MB/s with the asm optimised version). It is
> > essentially identically readable to the existing memcpy.c. I'm not
> > really famiilar with any other cpu architectures, so I'm not sure if
> > this would improve, or hurt, performance on other platforms.
> 
> Reviewing the assembler that is produced, it appears that GCC will
> never generate an ldm/stm instruction (load/store multiple) that reads
> into more than 4 registers, where as the optimised assembler does them
> that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried

For the asm, could we make it more than 8? 10 seems easy, 12 seems
doubtful. I don't see a fundamental reason it needs to be a power of
two, unless the cache line alignment really helps and isn't just
cargo-culting. (This is something I'd still like to know about the
asm: whether it's doing unnecessary stuff that does not help
performance.)

> various tricks/optimisations with the C code, and can't convince GCC
> to do more than 4. I assume that this is probably where the remaining
> 10MB/s is between these two variants.

Yes, I suspect so. One slightly crazy idea I had was to write the
function in C with just inline asm for the inner ldm/stm loop. The
build system does not yet have support for .c files in the arch dirs
instead of .s files, but it could be added.

> Rich - do you have any comments on whether either the C or assembler
> variants of memcpy might be suitable for inclusion in musl?

I would say either might be, but it looks like if we want competitive
performance, some asm will be needed (either inline or full). My
leaning would be to go for something simpler than the asm you've been
experimenting with, but with same or better performance, if this is
possible. I realize the code is not that big as-is, in terms of binary
size, but it's big from an "understanding it" perspective and I don't
like big asm blobs that are hard for somebody to look at and say "oh
yeah, this is clearly right".

Anyway, the big questions I'd still like to get answered before moving
forward is whether the cache line alignment has any benefit.

Rich