mailing list of musl libc
 help / color / mirror / code / Atom feed
From: Denys Vlasenko <vda.linux@googlemail.com>
To: Rich Felker <dalias@libc.org>
Cc: musl <musl@lists.openwall.com>
Subject: Re: [PATCH] x86_64/memset: use "small block" code for blocks up to 30 bytes long
Date: Tue, 17 Feb 2015 18:30:37 +0100	[thread overview]
Message-ID: <CAK1hOcMkZhFfBJmDeDGraeHLq8GUZS_KBCC39AB5OxMBMfVihA@mail.gmail.com> (raw)
In-Reply-To: <CAK1hOcOaN4SnpO2jMGib3tFEf+c8=Tu8Nwi2YnOhzefpSSqTng@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1244 bytes --]

On Tue, Feb 17, 2015 at 5:51 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Tue, Feb 17, 2015 at 5:12 PM, Rich Felker <dalias@libc.org> wrote:
>> On Tue, Feb 17, 2015 at 02:08:52PM +0100, Denys Vlasenko wrote:
>>> >> Please see attached file.
>>> >
>>> > I tried it and it's ~1 cycle slower for at least sizes 16-30;
>>> > presumably we're seeing the cost of the extra compare/branch at these
>>> > sizes but not at others. What does your timing test show?
>>>
>>> See below.
>>> First column - result of my2.s
>>> Second column - result of vda1.s
>>>
>>> Basically, the "rep stosq" code path got a bit faster, while
>>> small memsets stayed the same.
>>
>> Can you post your test program for me to try out? Here's what I've
>> been using, attached.
>
> With your program I see similar results:

Changed your program to output floating point results,
and do many more iterations finding minimum,
as otherwise (on my machine) consecutive runs give
+-2 cycles discrepancy for most measurements.
With one million iterations, discrepancy between
 runs is often zero, and when it's not, it's one cycle or less.

Please see attached files.
my2.OUT1 and my2.OUT2 are two runs of my2.s code
(to judge how much noise is in the measurements).

[-- Attachment #2: my2.OUT1 --]
[-- Type: application/octet-stream, Size: 1533 bytes --]

size 2: min=7.96, avg=8.17
size 4: min=8.01, avg=8.19
size 6: min=8.03, avg=8.26
size 8: min=8.12, avg=8.31
size 10: min=8.13, avg=8.45
size 12: min=8.19, avg=8.41
size 14: min=8.21, avg=8.48
size 16: min=9.05, avg=9.35
size 18: min=9.06, avg=9.49
size 20: min=9.12, avg=9.45
size 22: min=9.20, avg=9.48
size 24: min=9.22, avg=9.54
size 26: min=9.24, avg=9.65
size 28: min=9.34, avg=10.21
size 30: min=9.27, avg=9.71
size 32: min=10.94, avg=11.17
size 34: min=12.41, avg=12.75
size 36: min=12.56, avg=12.87
size 38: min=12.58, avg=12.97
size 40: min=11.75, avg=12.39
size 42: min=11.87, avg=12.07
size 44: min=11.73, avg=12.74
size 46: min=11.71, avg=12.89
size 48: min=11.70, avg=12.69
size 50: min=11.70, avg=12.92
size 52: min=11.84, avg=13.03
size 54: min=11.67, avg=12.23
size 56: min=11.65, avg=12.37
size 58: min=11.65, avg=12.13
size 60: min=11.62, avg=12.05
size 62: min=11.62, avg=12.11
size 64: min=19.40, avg=19.98
size 96: min=18.00, avg=18.58
size 128: min=32.14, avg=34.43
size 160: min=35.50, avg=37.91
size 192: min=39.00, avg=41.62
size 224: min=42.00, avg=46.42
size 256: min=45.00, avg=50.31
size 288: min=48.00, avg=52.63
size 320: min=51.00, avg=55.73
size 352: min=57.00, avg=61.11
size 384: min=60.00, avg=64.06
size 416: min=63.00, avg=67.07
size 448: min=66.00, avg=70.23
size 480: min=69.00, avg=73.61
size 512: min=75.00, avg=83.13
size 1024: min=126.00, avg=134.42
size 2048: min=228.00, avg=237.24
size 4096: min=432.00, avg=453.01
size 8192: min=837.00, avg=861.20
size 16384: min=1650.00, avg=1695.39

[-- Attachment #3: vda1.OUT1 --]
[-- Type: application/octet-stream, Size: 1533 bytes --]

size 2: min=7.97, avg=8.15
size 4: min=8.01, avg=8.21
size 6: min=8.03, avg=8.27
size 8: min=8.12, avg=8.29
size 10: min=8.16, avg=8.40
size 12: min=8.19, avg=8.49
size 14: min=8.25, avg=8.46
size 16: min=9.10, avg=9.33
size 18: min=9.11, avg=9.48
size 20: min=9.19, avg=9.53
size 22: min=9.20, avg=9.49
size 24: min=9.22, avg=9.51
size 26: min=9.24, avg=9.60
size 28: min=9.34, avg=10.06
size 30: min=9.36, avg=9.95
size 32: min=10.94, avg=11.34
size 34: min=12.52, avg=12.88
size 36: min=12.56, avg=12.98
size 38: min=12.69, avg=13.00
size 40: min=11.75, avg=12.15
size 42: min=11.87, avg=12.43
size 44: min=11.73, avg=13.18
size 46: min=11.71, avg=13.14
size 48: min=11.85, avg=13.43
size 50: min=11.85, avg=13.26
size 52: min=11.84, avg=13.34
size 54: min=11.67, avg=12.38
size 56: min=11.65, avg=12.24
size 58: min=11.65, avg=12.72
size 60: min=11.62, avg=12.36
size 62: min=11.62, avg=12.25
size 64: min=19.40, avg=20.45
size 96: min=18.00, avg=18.55
size 128: min=31.29, avg=33.33
size 160: min=34.50, avg=36.89
size 192: min=37.80, avg=40.21
size 224: min=42.00, avg=43.81
size 256: min=45.00, avg=48.03
size 288: min=48.00, avg=52.69
size 320: min=51.00, avg=55.10
size 352: min=55.50, avg=59.32
size 384: min=58.50, avg=61.74
size 416: min=61.50, avg=65.17
size 448: min=64.50, avg=69.09
size 480: min=67.50, avg=71.62
size 512: min=75.00, avg=80.14
size 1024: min=126.00, avg=131.30
size 2048: min=228.00, avg=235.69
size 4096: min=432.00, avg=442.48
size 8192: min=837.00, avg=862.37
size 16384: min=1650.00, avg=1687.02

[-- Attachment #4: memset-cycles-vda.c --]
[-- Type: text/x-csrc, Size: 1222 bytes --]

#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <string.h>

static inline unsigned rdtsc()
{
#if defined __i386__ || defined __x86_64__
	unsigned x;
	__asm__ __volatile__ ( "rdtsc" : "=a"(x) : : "rdx" );
//	__asm__ __volatile__ ( "cpuid ; rdtsc" : "=a"(x)
//		: : "rbx", "rcx", "rdx" );
	return x;
#else
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_nsec;
#endif
}

char buf[32768+100];

int main()
{
	unsigned i, t, tmin;
	unsigned long long tmean;
	unsigned n;

// I need a million of iterations to get a stable "min" measurement
#define REP (1024*4096)

	for (n=2; n<32768; n+=(n<64 ? 2 : n<512 ? 32 : n)) {
		int repeat = (1024 / (n|1)) ? : 1;

		memset(buf, 0, n);
		tmin = -1;
		tmean = 0;
		for (i=0; i < REP; i++) {
			int j = repeat;
			__asm__ __volatile__ ("" : : : "memory");
			t = rdtsc();
			do {
				memset(buf, 0, n);
				__asm__ __volatile__ ("" : : : "memory");
			} while (--j != 0);
			t = rdtsc() - t;
			__asm__ __volatile__ ("" : : : "memory");
			if (t < tmin) tmin = t;
			tmean += t;
		}
		printf("size %u: min=%.2f, avg=%.2f\n",
			n,
			(double)tmin / repeat,
			(double)tmean / (repeat*REP)
		);
	}
	return 0;
}

[-- Attachment #5: my2.OUT2 --]
[-- Type: application/octet-stream, Size: 1533 bytes --]

size 2: min=7.96, avg=8.20
size 4: min=8.01, avg=8.24
size 6: min=8.03, avg=8.29
size 8: min=8.10, avg=8.32
size 10: min=8.16, avg=8.45
size 12: min=8.19, avg=8.50
size 14: min=8.25, avg=8.49
size 16: min=9.10, avg=9.35
size 18: min=9.17, avg=9.58
size 20: min=9.12, avg=9.48
size 22: min=9.20, avg=9.54
size 24: min=9.22, avg=9.61
size 26: min=9.24, avg=9.68
size 28: min=9.34, avg=10.25
size 30: min=9.36, avg=9.89
size 32: min=10.94, avg=11.33
size 34: min=12.41, avg=12.87
size 36: min=12.56, avg=12.98
size 38: min=12.58, avg=12.98
size 40: min=11.75, avg=12.22
size 42: min=11.87, avg=12.30
size 44: min=11.73, avg=12.71
size 46: min=11.71, avg=12.82
size 48: min=11.70, avg=12.65
size 50: min=11.70, avg=12.64
size 52: min=11.84, avg=15.61
size 54: min=11.67, avg=12.71
size 56: min=11.65, avg=12.33
size 58: min=11.65, avg=12.43
size 60: min=11.62, avg=12.10
size 62: min=11.62, avg=12.05
size 64: min=19.40, avg=20.12
size 96: min=18.00, avg=18.46
size 128: min=32.14, avg=34.66
size 160: min=36.00, avg=37.96
size 192: min=38.40, avg=41.83
size 224: min=42.75, avg=45.58
size 256: min=45.00, avg=50.88
size 288: min=49.00, avg=54.01
size 320: min=52.00, avg=56.12
size 352: min=57.00, avg=60.68
size 384: min=60.00, avg=63.90
size 416: min=63.00, avg=67.51
size 448: min=66.00, avg=70.17
size 480: min=69.00, avg=73.32
size 512: min=75.00, avg=82.78
size 1024: min=126.00, avg=134.00
size 2048: min=228.00, avg=237.86
size 4096: min=432.00, avg=448.33
size 8192: min=837.00, avg=862.52
size 16384: min=1650.00, avg=1698.77

  reply	other threads:[~2015-02-17 17:30 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-13 16:39 Denys Vlasenko
2015-02-14 19:35 ` Rich Felker
2015-02-15  4:06   ` Rich Felker
2015-02-15 14:07     ` Denys Vlasenko
2015-02-15 15:03       ` Rich Felker
2015-02-15 21:44         ` Denys Vlasenko
2015-02-15 22:55           ` Rich Felker
2015-02-16 10:09             ` Denys Vlasenko
2015-02-16 15:12               ` Rich Felker
2015-02-16 17:36           ` Rich Felker
2015-02-17 13:08             ` Denys Vlasenko
2015-02-17 16:12               ` Rich Felker
2015-02-17 16:51                 ` Denys Vlasenko
2015-02-17 17:30                   ` Denys Vlasenko [this message]
2015-02-17 17:40                   ` Rich Felker
2015-02-17 18:53                     ` Denys Vlasenko
2015-02-17 21:12                       ` Rich Felker
2015-02-18  9:05                         ` Denys Vlasenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAK1hOcMkZhFfBJmDeDGraeHLq8GUZS_KBCC39AB5OxMBMfVihA@mail.gmail.com \
    --to=vda.linux@googlemail.com \
    --cc=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).