From mboxrd@z Thu Jan 1 00:00:00 1970 X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7057 Path: news.gmane.org!not-for-mail From: Rich Felker Newsgroups: gmane.linux.lib.musl.general Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks up to 30 bytes long Date: Mon, 16 Feb 2015 12:36:35 -0500 Message-ID: <20150216173634.GA23507@brightrain.aerifal.cx> References: <1423845589-5920-1-git-send-email-vda.linux@googlemail.com> <20150214193533.GK23507@brightrain.aerifal.cx> <20150215040655.GM23507@brightrain.aerifal.cx> <20150215150313.GO23507@brightrain.aerifal.cx> Reply-To: musl@lists.openwall.com NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: ger.gmane.org 1424108221 31289 80.91.229.3 (16 Feb 2015 17:37:01 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Mon, 16 Feb 2015 17:37:01 +0000 (UTC) Cc: musl To: Denys Vlasenko Original-X-From: musl-return-7070-gllmg-musl=m.gmane.org@lists.openwall.com Mon Feb 16 18:37:00 2015 Return-path: Envelope-to: gllmg-musl@m.gmane.org Original-Received: from mother.openwall.net ([195.42.179.200]) by plane.gmane.org with smtp (Exim 4.69) (envelope-from ) id 1YNPbQ-0000M0-5J for gllmg-musl@m.gmane.org; Mon, 16 Feb 2015 18:37:00 +0100 Original-Received: (qmail 30571 invoked by uid 550); 16 Feb 2015 17:36:58 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: Original-Received: (qmail 30485 invoked from network); 16 Feb 2015 17:36:50 -0000 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Original-Sender: Rich Felker Xref: news.gmane.org gmane.linux.lib.musl.general:7057 Archived-At: On Sun, Feb 15, 2015 at 10:44:59PM +0100, Denys Vlasenko wrote: > On Sun, Feb 15, 2015 at 4:03 PM, Rich Felker wrote: > >> Just because we don't personally see a hit from 6-cycle imul of AMD CPUs, > >> it does not mean people who do use those CPUs don't exist. Have heart... > > > > Did you test the version I attached? I think there should be at least > > 4-5 cycles between when the imul is launched and when the result is > > used, so I'm failing to see how the latency is a big deal. > > Okay, I won't insist. > Your version works good. The "rep stosq" setup time is still noticeable > even when we switch to it after 126: > > 129 byte block: 10.37 bytes/ns > 128 byte block: 10.65 bytes/ns > 127 byte block: 10.58 bytes/ns > 126 byte block: 18.44 bytes/ns > 125 byte block: 18.30 bytes/ns > 124 byte block: 18.15 bytes/ns > > but I don't think we should do anything about this. Agreed. The size of code is really going to blow up at the next step, and hopefully future cpus will get less bad about pessimizing rep stosq startup. > "sub $8,%rcx" can be folded into lea. > > Please see attached file. I tried it and it's ~1 cycle slower for at least sizes 16-30; presumably we're seeing the cost of the extra compare/branch at these sizes but not at others. What does your timing test show? Rich