From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/3631 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: Thinking about release Date: Thu, 11 Jul 2013 23:16:15 -0400 Message-ID: <20130712031615.GS29800@brightrain.aerifal.cx> References: <20130709053711.GO29800@brightrain.aerifal.cx> <20130711033754.GL29800@brightrain.aerifal.cx> <20130711124613.GO29800@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1373598988 16761 80.91.229.3 (12 Jul 2013 03:16:28 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Fri, 12 Jul 2013 03:16:28 +0000 (UTC) To: musl@lists.openwall.com Original-X-From: musl-return-3635-gllmg-musl=m.gmane.org@lists.openwall.com Fri Jul 12 05:16:29 2013 Return-path: Envelope-to: gllmg-musl@plane.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1UxTqP-0004b4-B7 for gllmg-musl@plane.gmane.org; Fri, 12 Jul 2013 05:16:29 +0200 Original-Received: (qmail 22119 invoked by uid 550); 12 Jul 2013 03:16:28 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 22111 invoked from network); 12 Jul 2013 03:16:28 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Xref: news.gmane.org gmane.linux.lib.musl.general:3631 Archived-At: On Fri, Jul 12, 2013 at 10:34:31AM +1200, Andre Renaud wrote: > I've rejiggled it a bit, and it appears to be working. I wasn't > entirely sure what you meant about the proper constraints. There is an > additional reason why 8*4 was used for the align - to force the whole > loop to work in cache-line blocks. I've now done this explicitly on > the lead-in by doing the first few copies as 32-bit, then going to the > full cache-line asm. This has the same performance as the fully native > assembler. However to get that I had to use the same trick that the > native assembler uses - doing a load of the next block prior to > storing this one. I'm a bit concerned that this would mean we'd be > doing a read that was out of bounds, and I can't entirely see why this > wouldn't be happening with the existing assembler (but I'm presuming > it doesn't). Any comments on this side of it? I was unable to measure any difference in performance of your version with the prefetch hack versus simply: __asm__ __volatile__( "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" : "+r"(d), "+r"(s) : : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory"); in the inner loop. Rich