* Re: [PATCH] x86_64/memset: simple optimizations
[not found] ` <20150207130655.GW23507@brightrain.aerifal.cx>
@ 2015-02-10 20:27 ` Denys Vlasenko
2015-02-10 20:43 ` Rich Felker
0 siblings, 1 reply; 4+ messages in thread
From: Denys Vlasenko @ 2015-02-10 20:27 UTC (permalink / raw)
To: Rich Felker, musl
[-- Attachment #1: Type: text/plain, Size: 1724 bytes --]
On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote:
>> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker <dalias@aerifal.cx> wrote:
>> What speedups?
>> In particular:
>> - perform pre-alignment if dst is unaligned
>
> For the rep stosq path? Does it help? I don't recall the details but I
> seem to remember both docs and measurements showing no reliable
> benefit from alignment for this instruction, and we had people trying
> things on several different cpu models. I'm open to hearing evidence
> to the contrary though.
size:20k buf:0x7f38656e2100
stos:25978 ns (times 32), 25.227500 bytes/ns
stos+1:31395 ns (times 32), 20.874662 bytes/ns
stos+4:31396 ns (times 32), 20.873997 bytes/ns
stos+8:24446 ns (times 32), 26.808476 bytes/ns
size:50k buf:0x7fbca1dc9100
stos:68149 ns (times 32), 24.041439 bytes/ns
stos+1:85762 ns (times 32), 19.104032 bytes/ns
stos+4:85762 ns (times 32), 19.104032 bytes/ns
stos+8:68204 ns (times 32), 24.022051 bytes/ns
size:1024k buf:0x7fa3036a5100
stos:1632285 ns (times 32), 20.556724 bytes/ns
stos+1:1891092 ns (times 32), 17.743416 bytes/ns
stos+4:1891089 ns (times 32), 17.743444 bytes/ns
stos+8:1632181 ns (times 32), 20.558034 bytes/ns
size:5000k buf:0x7fdf5cd6b100
stos:15592138 ns (times 32), 10.558298 bytes/ns
stos+1:15501841 ns (times 32), 10.619799 bytes/ns
stos+4:15507773 ns (times 32), 10.615737 bytes/ns
stos+8:15589617 ns (times 32), 10.560005 bytes/ns
The source is attached.
This data shows that (on my CPU, Sandy Bridge with 4MB L2)
8-byte alignment helps when stores fit into L1 or L2.
If memset is larger than L2, memory throughput is too low
and there is no measurable difference.
[-- Attachment #2: t.c --]
[-- Type: text/x-csrc, Size: 3216 bytes --]
#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/time.h>
#include <sys/syscall.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
/* Old glibc (< 2.3.4) does not provide this constant. We use syscall
* directly so this definition is safe. */
#ifndef CLOCK_MONOTONIC
#define CLOCK_MONOTONIC 1
#endif
/* libc has incredibly messy way of doing this,
* typically requiring -lrt. We just skip all this mess */
static void get_mono(struct timespec *ts)
{
syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts);
}
void memset_rep_stosq(void *ptr, unsigned long cnt)
{
unsigned long ax,cx,di;
asm volatile(
"rep stosq"
: "=D" (di), "=c" (cx), "=a" (ax)
: "0" (ptr), "1" (cnt), "2" (0)
: "memory"
);
}
void memset_movnti(void *ptr, unsigned long cnt)
{
unsigned long ax,cx,di;
asm volatile(
"1: movnti %%rax,(%%rdi)\n"
"add $8,%%rdi\n"
"dec %%rcx\n"
"jnz 1b\n"
"sfence\n"
: "=D" (di), "=c" (cx), "=a" (ax)
: "0" (ptr), "1" (cnt), "2" (0)
: "memory"
);
}
void memset_movnti_unroll(void *ptr, unsigned long cnt)
{
unsigned long ax,cx,di;
asm volatile(
"1:\n"
"movnti %%rax,(%%rdi)\n"
"movnti %%rax,8(%%rdi)\n"
"movnti %%rax,16(%%rdi)\n"
"movnti %%rax,24(%%rdi)\n"
"add $32,%%rdi\n"
"dec %%rcx\n"
"jnz 1b\n"
"sfence\n"
: "=D" (di), "=c" (cx), "=a" (ax)
: "0" (ptr), "1" (cnt/4), "2" (0)
: "memory"
);
}
unsigned gett()
{
#if 0
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_usec;
#else
struct timespec ts;
get_mono(&ts);
return ts.tv_nsec;
#endif
}
unsigned difft(unsigned t2, unsigned t1)
{
t2 -= t1;
if ((int)t2 < 0)
t2 += 1000000000;
return t2;
}
#define BUF (50*1024)
#define BUF8 (BUF/8)
void measure(void *buf, void (*m)(void *ptr, unsigned long cnt), const char *name)
{
unsigned t1, t2, cnt;
sleep(1);
m(buf, BUF8);
t2 = -1U;
cnt = 1000;
while (--cnt) {
t1 = gett();
#define REPEAT 32
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);m(buf, BUF8);
t1 = difft(gett(), t1);
if (t2 > t1)
t2 = t1;
// printf("%s:%u ns %u\n", name, t1, t2);
}
printf("%s:%u ns (times %d), %.6f bytes/ns\n", name, t2, REPEAT, (double)(BUF) * REPEAT / t2);
}
int main()
{
char *buf = malloc(8*BUF + 4096);
buf += 0x100;
buf = (char*)((long)buf & ~0xffL);
printf("size:%uk buf:%p\n", BUF/1024, buf);
// measure(buf, memset_movnti, "movnti");
// measure(buf, memset_movnti_unroll, "movnti_unroll");
measure(buf, memset_rep_stosq, "stos");
// measure(buf+1, memset_movnti, "movnti+1");
// measure(buf+1, memset_movnti_unroll, "movnti_unroll+1");
measure(buf+1, memset_rep_stosq, "stos+1");
// measure(buf+3, memset_movnti, "movnti+3");
// measure(buf+3, memset_movnti_unroll, "movnti_unroll+3");
measure(buf+4, memset_rep_stosq, "stos+4");
measure(buf+8, memset_rep_stosq, "stos+8");
return 0;
}
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Re: [PATCH] x86_64/memset: simple optimizations
2015-02-10 20:27 ` [PATCH] x86_64/memset: simple optimizations Denys Vlasenko
@ 2015-02-10 20:43 ` Rich Felker
2015-02-10 20:52 ` Denys Vlasenko
0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2015-02-10 20:43 UTC (permalink / raw)
To: musl
On Tue, Feb 10, 2015 at 09:27:17PM +0100, Denys Vlasenko wrote:
> On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote:
> >> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker <dalias@aerifal.cx> wrote:
> >> What speedups?
> >> In particular:
> >> - perform pre-alignment if dst is unaligned
> >
> > For the rep stosq path? Does it help? I don't recall the details but I
> > seem to remember both docs and measurements showing no reliable
> > benefit from alignment for this instruction, and we had people trying
> > things on several different cpu models. I'm open to hearing evidence
> > to the contrary though.
>
> size:20k buf:0x7f38656e2100
> stos:25978 ns (times 32), 25.227500 bytes/ns
> stos+1:31395 ns (times 32), 20.874662 bytes/ns
> stos+4:31396 ns (times 32), 20.873997 bytes/ns
> stos+8:24446 ns (times 32), 26.808476 bytes/ns
>
> size:50k buf:0x7fbca1dc9100
> stos:68149 ns (times 32), 24.041439 bytes/ns
> stos+1:85762 ns (times 32), 19.104032 bytes/ns
> stos+4:85762 ns (times 32), 19.104032 bytes/ns
> stos+8:68204 ns (times 32), 24.022051 bytes/ns
>
> size:1024k buf:0x7fa3036a5100
> stos:1632285 ns (times 32), 20.556724 bytes/ns
> stos+1:1891092 ns (times 32), 17.743416 bytes/ns
> stos+4:1891089 ns (times 32), 17.743444 bytes/ns
> stos+8:1632181 ns (times 32), 20.558034 bytes/ns
>
> size:5000k buf:0x7fdf5cd6b100
> stos:15592138 ns (times 32), 10.558298 bytes/ns
> stos+1:15501841 ns (times 32), 10.619799 bytes/ns
> stos+4:15507773 ns (times 32), 10.615737 bytes/ns
> stos+8:15589617 ns (times 32), 10.560005 bytes/ns
>
> The source is attached.
OK. This looks sufficiently significant (despite unaligned memsets
being rare) that it would be nice to optimize it. Could we just write
an initial possibly-misaligned word then increment the start address
and round it up before using rep stos?
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/time.h>
> #include <sys/syscall.h>
> #include <time.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> /* Old glibc (< 2.3.4) does not provide this constant. We use syscall
> * directly so this definition is safe. */
> #ifndef CLOCK_MONOTONIC
> #define CLOCK_MONOTONIC 1
> #endif
>
> /* libc has incredibly messy way of doing this,
> * typically requiring -lrt. We just skip all this mess */
> static void get_mono(struct timespec *ts)
> {
> syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts);
> }
FWIW, this is a bad idea; you get syscall overhead in your
measurements. If you just use clock_gettime (the function) you'll get
vdso results (no syscall).
Using the syscall directly is also sketchy in that x32 has an
incorrect kernel-side definition for struct timespec, but I think it
will only matter if aarch64-ILP32 copies this problem from x32 and
you're using a big-endian system.
Rich
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Re: [PATCH] x86_64/memset: simple optimizations
2015-02-10 20:43 ` Rich Felker
@ 2015-02-10 20:52 ` Denys Vlasenko
2015-02-10 20:54 ` Rich Felker
0 siblings, 1 reply; 4+ messages in thread
From: Denys Vlasenko @ 2015-02-10 20:52 UTC (permalink / raw)
To: musl
On Tue, Feb 10, 2015 at 9:43 PM, Rich Felker <dalias@aerifal.cx> wrote:
> On Tue, Feb 10, 2015 at 09:27:17PM +0100, Denys Vlasenko wrote:
>> On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@aerifal.cx> wrote:
>> /* libc has incredibly messy way of doing this,
>> * typically requiring -lrt. We just skip all this mess */
>> static void get_mono(struct timespec *ts)
>> {
>> syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts);
>> }
>
> FWIW, this is a bad idea; you get syscall overhead in your
> measurements. If you just use clock_gettime (the function) you'll get
> vdso results (no syscall).
I repeat memset 32 times between reading timespamp.
Thus, even with "small" 20kb memset test
there are 640kb of writes to L1. This is bit enough
to make overhead insignificant.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Re: [PATCH] x86_64/memset: simple optimizations
2015-02-10 20:52 ` Denys Vlasenko
@ 2015-02-10 20:54 ` Rich Felker
0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2015-02-10 20:54 UTC (permalink / raw)
To: musl
On Tue, Feb 10, 2015 at 09:52:54PM +0100, Denys Vlasenko wrote:
> On Tue, Feb 10, 2015 at 9:43 PM, Rich Felker <dalias@aerifal.cx> wrote:
> > On Tue, Feb 10, 2015 at 09:27:17PM +0100, Denys Vlasenko wrote:
> >> On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@aerifal.cx> wrote:
> >> /* libc has incredibly messy way of doing this,
> >> * typically requiring -lrt. We just skip all this mess */
> >> static void get_mono(struct timespec *ts)
> >> {
> >> syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts);
> >> }
> >
> > FWIW, this is a bad idea; you get syscall overhead in your
> > measurements. If you just use clock_gettime (the function) you'll get
> > vdso results (no syscall).
>
> I repeat memset 32 times between reading timespamp.
> Thus, even with "small" 20kb memset test
> there are 640kb of writes to L1. This is bit enough
> to make overhead insignificant.
Yes, I agree it's probably okay the way you've structured the test
here; that's why I mentioned it as a "FWIW" rather than an objection
to the results. It was more an aside remark about how this technique
could be problematic in the future. Sorry for not being clear.
Rich
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-02-10 20:54 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1423258814-9045-1-git-send-email-vda.linux@googlemail.com>
[not found] ` <20150207003535.GS23507@brightrain.aerifal.cx>
[not found] ` <CAK1hOcP8zCo843y0VhqMr6Wc0JtaD4V4V=TicLnmJ6SynBhmGw@mail.gmail.com>
[not found] ` <20150207130655.GW23507@brightrain.aerifal.cx>
2015-02-10 20:27 ` [PATCH] x86_64/memset: simple optimizations Denys Vlasenko
2015-02-10 20:43 ` Rich Felker
2015-02-10 20:52 ` Denys Vlasenko
2015-02-10 20:54 ` Rich Felker
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).