From mboxrd@z Thu Jan  1 00:00:00 1970
X-Msuck: nntp://news.gmane.org/gmane.linux.lib.musl.general/7067
Path: news.gmane.org!not-for-mail
From: Rich Felker <dalias@libc.org>
Newsgroups: gmane.linux.lib.musl.general
Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks
 up to 30 bytes long
Date: Tue, 17 Feb 2015 11:12:22 -0500
Message-ID: <20150217161222.GF23507@brightrain.aerifal.cx>
References: <1423845589-5920-1-git-send-email-vda.linux@googlemail.com>
 <20150214193533.GK23507@brightrain.aerifal.cx>
 <20150215040655.GM23507@brightrain.aerifal.cx>
 <CAK1hOcPQ=mADeAUP3i-Xt3rvHmgUrVVoz2yUEOkUEYQ2xRVN2g@mail.gmail.com>
 <20150215150313.GO23507@brightrain.aerifal.cx>
 <CAK1hOcMgM5j-EtOk2aPao6ma=M7PVyA_3U=22+8HbPu+S9GXdw@mail.gmail.com>
 <20150216173634.GA23507@brightrain.aerifal.cx>
 <CAK1hOcOEznkzfBGDJwWvdaNXTKwEiiz88r=9tEQF--T=CQvXJg@mail.gmail.com>
Reply-To: musl@lists.openwall.com
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="llIrKcgUOe3dCx0c"
X-Trace: ger.gmane.org 1424189571 12068 80.91.229.3 (17 Feb 2015 16:12:51 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Tue, 17 Feb 2015 16:12:51 +0000 (UTC)
Cc: musl <musl@lists.openwall.com>
To: Denys Vlasenko <vda.linux@googlemail.com>
Original-X-From: musl-return-7080-gllmg-musl=m.gmane.org@lists.openwall.com Tue Feb 17 17:12:51 2015
Return-path: <musl-return-7080-gllmg-musl=m.gmane.org@lists.openwall.com>
Envelope-to: gllmg-musl@m.gmane.org
Original-Received: from mother.openwall.net ([195.42.179.200])
	by plane.gmane.org with smtp (Exim 4.69)
	(envelope-from <musl-return-7080-gllmg-musl=m.gmane.org@lists.openwall.com>)
	id 1YNklW-0006ef-Tp
	for gllmg-musl@m.gmane.org; Tue, 17 Feb 2015 17:12:51 +0100
Original-Received: (qmail 17748 invoked by uid 550); 17 Feb 2015 16:12:48 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
Original-Received: (qmail 17648 invoked from network); 17 Feb 2015 16:12:40 -0000
Content-Disposition: inline
In-Reply-To: <CAK1hOcOEznkzfBGDJwWvdaNXTKwEiiz88r=9tEQF--T=CQvXJg@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Original-Sender: Rich Felker <dalias@aerifal.cx>
Xref: news.gmane.org gmane.linux.lib.musl.general:7067
Archived-At: <http://permalink.gmane.org/gmane.linux.lib.musl.general/7067>


--llIrKcgUOe3dCx0c
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

On Tue, Feb 17, 2015 at 02:08:52PM +0100, Denys Vlasenko wrote:
> >> Please see attached file.
> >
> > I tried it and it's ~1 cycle slower for at least sizes 16-30;
> > presumably we're seeing the cost of the extra compare/branch at these
> > sizes but not at others. What does your timing test show?
> 
> See below.
> First column - result of my2.s
> Second column - result of vda1.s
> 
> Basically, the "rep stosq" code path got a bit faster, while
> small memsets stayed the same.

Can you post your test program for me to try out? Here's what I've
been using, attached.

Rich

--llIrKcgUOe3dCx0c
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="memset-cycles.c"

#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>

static inline unsigned rdtsc()
{
#if defined __i386__ || defined __x86_64__
	unsigned x;
	__asm__ __volatile__ ( "rdtsc" : "=a"(x) : : "rdx" );
//	__asm__ __volatile__ ( "cpuid ; rdtsc" : "=a"(x)
//		: : "rbx", "rcx", "rdx" );
	return x;
#else
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_nsec;
#endif
}

char buf[32768+100];

int main()
{
	unsigned a=0;
	unsigned i, j, t, tmin=-1;
	unsigned long long tmean=0;
	unsigned overhead = -1;
	size_t n;

	for (i=0; i<0+1*4096; i++) {
		t = rdtsc();
		__asm__ __volatile__("nop");
		t = rdtsc()-t;
		if (t < overhead) overhead = t;
	}
	//overhead = 0;

	for (n=2; n<32768; n+=(n<64 ? 2 : n<512 ? 32 : n)) {
		tmin = -1;
		tmean = 0;
		for (i=0; i<0+1*4096; i++) {
			__asm__ __volatile__ ("" : : : "memory");
			t = rdtsc();
			for (j=0; j<64; j++) {
				memset(buf, 0, n);
				__asm__ __volatile__ ("" : : : "memory");
			}
			t = rdtsc()-t;
			__asm__ __volatile__ ("" : : : "memory");
			if (t < tmin) tmin = t;
			tmean += t;
		}
		tmin -= overhead;
		tmean -= 4096*overhead;
		tmin /= 64;
		tmean /= 64;
		tmean /= 4096;
		printf("size %zu: min=%u, avg=%llu\n", n, tmin, tmean);
	}
}

--llIrKcgUOe3dCx0c--