x86[_64] memset and rep stos

mailing list of musl libc
 help / color / mirror / code / Atom feed

* x86[_64] memset and rep stos
@ 2015-02-25  6:12 Rich Felker
  2015-02-25  7:54 ` 邓尧
  0 siblings, 1 reply; 4+ messages in thread
From: Rich Felker @ 2015-02-25  6:12 UTC (permalink / raw)
  To: musl

Doing some timings on the new proposed memset code, I found it was
pathologically slow on my Atom D510 (32-bit) when reaching sizes
around 2k - 16k. Like 4x slower than the old code. Apparently the
issue is that the work being done to align the destination mod 4
misaligns it mod higher powers of two, and "rep stos" performs
pathologically bad when it's not cache-line-aligned, or something like
that. On my faster 64-bit system alignment mod 16 also seems to make a
difference, but less - it's 1.5x slower misaligned mod 16.

I also found that on the 32-bit Atom, there seems to be a huge jump in
speed at size 1024 -- sizes just below 1024 are roughly 2x slower.
Since it otherwise doesn't make a measurable difference, it seems
preferable _not_ to try to reduce the length of the rep stos to avoid
writing the same bytes multiple times but simply use the max allowable
length.

Combined with the first issue, it seems we should "round up to a
multiple of 16" rather than "add 16 then round down to a multiple of
16". Not only does this avoid reducing the length of the rep stos; it
also preserves any higher-than-16 alignment that might be preexisting,
in case even higher alignments are faster.

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: x86[_64] memset and rep stos
  2015-02-25  6:12 x86[_64] memset and rep stos Rich Felker
@ 2015-02-25  7:54 ` 邓尧
  2015-02-25  9:20   ` Szabolcs Nagy
  0 siblings, 1 reply; 4+ messages in thread
From: 邓尧 @ 2015-02-25  7:54 UTC (permalink / raw)
  To: musl

I'm not an expert on micro optimization, but why not use a dynamic
routine selection system which would select the optimal routine for a
given CPU during program initialization. The routine selection
algorithm could simply be a predefined static table look up.
IMO, only very small number of functions (like memset, memcpy) would
benefit from such a system, so no code size overhead to worry about.

On Wed, Feb 25, 2015 at 2:12 PM, Rich Felker <dalias@libc.org> wrote:
> Doing some timings on the new proposed memset code, I found it was
> pathologically slow on my Atom D510 (32-bit) when reaching sizes
> around 2k - 16k. Like 4x slower than the old code. Apparently the
> issue is that the work being done to align the destination mod 4
> misaligns it mod higher powers of two, and "rep stos" performs
> pathologically bad when it's not cache-line-aligned, or something like
> that. On my faster 64-bit system alignment mod 16 also seems to make a
> difference, but less - it's 1.5x slower misaligned mod 16.
>
> I also found that on the 32-bit Atom, there seems to be a huge jump in
> speed at size 1024 -- sizes just below 1024 are roughly 2x slower.
> Since it otherwise doesn't make a measurable difference, it seems
> preferable _not_ to try to reduce the length of the rep stos to avoid
> writing the same bytes multiple times but simply use the max allowable
> length.
>
> Combined with the first issue, it seems we should "round up to a
> multiple of 16" rather than "add 16 then round down to a multiple of
> 16". Not only does this avoid reducing the length of the rep stos; it
> also preserves any higher-than-16 alignment that might be preexisting,
> in case even higher alignments are faster.
>
> Rich


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: x86[_64] memset and rep stos
  2015-02-25  7:54 ` 邓尧
@ 2015-02-25  9:20   ` Szabolcs Nagy
  2015-02-25 17:26     ` Rich Felker
  0 siblings, 1 reply; 4+ messages in thread
From: Szabolcs Nagy @ 2015-02-25  9:20 UTC (permalink / raw)
  To: musl

* ?????? <torshie@gmail.com> [2015-02-25 15:54:31 +0800]:
> I'm not an expert on micro optimization, but why not use a dynamic
> routine selection system which would select the optimal routine for a
> given CPU during program initialization. The routine selection
> algorithm could simply be a predefined static table look up.
> IMO, only very small number of functions (like memset, memcpy) would
> benefit from such a system, so no code size overhead to worry about.

my guess is

- for static linking it adds at least an extra indirection
  and these functions often used with small input

- code size overhead: now you have to include all possible
  versions in libc.so

- for dynamic linking there is a load time dispatch mechanism:
  STT_GNU_IFUNC but it's broken due to lack of specs

- maintainance burden, hard to test

- selecting the right algorithm at runtime is not easy

but i guess eventually when more ppl use musl it will make sense
to add more target specific optimizations


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: x86[_64] memset and rep stos
  2015-02-25  9:20   ` Szabolcs Nagy
@ 2015-02-25 17:26     ` Rich Felker
  0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2015-02-25 17:26 UTC (permalink / raw)
  To: musl

On Wed, Feb 25, 2015 at 10:20:20AM +0100, Szabolcs Nagy wrote:
> * ?????? <torshie@gmail.com> [2015-02-25 15:54:31 +0800]:
> > I'm not an expert on micro optimization, but why not use a dynamic
> > routine selection system which would select the optimal routine for a
> > given CPU during program initialization. The routine selection
> > algorithm could simply be a predefined static table look up.
> > IMO, only very small number of functions (like memset, memcpy) would
> > benefit from such a system, so no code size overhead to worry about.
> 
> my guess is
> 
> - for static linking it adds at least an extra indirection
>   and these functions often used with small input

Yes. Actually the extra indirection seems not to be measurable on
modern cpus but it may matter on older cpus, and older cpus are the
only reason to want runtime selection since the optimal strategy is
not changing for modern ones.

> - code size overhead: now you have to include all possible
>   versions in libc.so

Or worse, in a static binary.

> - for dynamic linking there is a load time dispatch mechanism:
>   STT_GNU_IFUNC but it's broken due to lack of specs

It also wouldn't get resolved before libc.so/ld.so needed to call
these functions.

> - maintainance burden, hard to test

This is the main one -- unbounded growth of maintenance and debugging
burden -- how would you test N different versions of the same code
without either having a test system for each cpu model, or extra hacks
to force a particular version of the code to get used even when it
mismatches the host cpu model? Who is responsible for ensuring that
the "optimized" variants for old cpus are actually still faster than
all the new variants? Etc. These issues are not just theoretical;
they're a real world problem that's plagued glibc even though they
have a much bigger team maintaining it. And uclibc basically fell
apart (IMO) because of inability to test and maintain all the
combinations of options (although it was other types of options, not
optimized asm variants).

> - selecting the right algorithm at runtime is not easy

Well, it involves the same tedious (but sometimes fun) work Denys and
I have been doing, except doing it with lots of different cpu models
and code variants (i.e. much more work).

> but i guess eventually when more ppl use musl it will make sense
> to add more target specific optimizations

Target-specific, yes, at least for targets that are being used in
places where there are measurable bottlenecks. But I'd still rather
avoid code variants for cpu models within the same ISA. If it's
looking like that kind of thing is needed, perhaps we should try to
model the relevant functions with "GNU C" (with portable fallback of
course, as always) and have the compiler generate the appropriate
variants (i.e. outsourcing trust that the generated code is actually
correct to the compiler) rather than writing multiple asm versions.

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-25 17:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-25  6:12 x86[_64] memset and rep stos Rich Felker
2015-02-25  7:54 ` 邓尧
2015-02-25  9:20   ` Szabolcs Nagy
2015-02-25 17:26     ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).