memchr() performance

mailing list of musl libc
 help / color / mirror / code / Atom feed

* memchr() performance
@ 2016-09-18 18:54 Georg Sauthoff
  2016-09-18 20:40 ` Rich Felker
  2016-09-18 20:40 ` Szabolcs Nagy
  0 siblings, 2 replies; 7+ messages in thread
From: Georg Sauthoff @ 2016-09-18 18:54 UTC (permalink / raw)
  To: musl

(please CC me as I am not subscribed to this ML)

Hello,

fyi, I've done some benchmarking of different memchr() and std::find()
versions.

I also included the memchr() version from musl.

In general, musl's memchr() implementation doesn't perform better than a
simple unrolled loop (as used in libstdc++ std::find()) - and that is
consistent over different CPU generations and architectures.

On recent Intel CPUs it is even slower than a naive implementation:

https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
https://gms.tf/sparc-and-ppc-find-benchmark-results.html

Of course, on x86, other implementations that use SIMD instructions
perform even better.

Best regards
Georg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: memchr() performance
  2016-09-18 18:54 memchr() performance Georg Sauthoff
@ 2016-09-18 20:40 ` Rich Felker
  2016-09-18 20:40 ` Szabolcs Nagy
  1 sibling, 0 replies; 7+ messages in thread
From: Rich Felker @ 2016-09-18 20:40 UTC (permalink / raw)
  To: musl

On Sun, Sep 18, 2016 at 08:54:22PM +0200, Georg Sauthoff wrote:
> (please CC me as I am not subscribed to this ML)
> 
> Hello,
> 
> fyi, I've done some benchmarking of different memchr() and std::find()
> versions.
> 
> I also included the memchr() version from musl.
> 
> In general, musl's memchr() implementation doesn't perform better than a
> simple unrolled loop (as used in libstdc++ std::find()) - and that is
> consistent over different CPU generations and architectures.
> 
> On recent Intel CPUs it is even slower than a naive implementation:

Are you assuming vectorization of the naive version by the compiler? I
think it's reasonable to assume that on x86_64 but not on 32-bit since
many users build for a baseline ISA that does not have vector ops
(i486 or i586).

> https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
> https://gms.tf/sparc-and-ppc-find-benchmark-results.html
> 
> Of course, on x86, other implementations that use SIMD instructions
> perform even better.

I'm aware that musl's memchr (and more generally the related functions
like strchr, strlen, etc.) are not performing great, but it's not
clear to me what the right solution is, since the different approaches
vary A LOT in terms of how they compare with each other depending on
the exact cpu model and compiler. Improving this situation is probably
a big project.

Rich

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: memchr() performance
  2016-09-18 18:54 memchr() performance Georg Sauthoff
  2016-09-18 20:40 ` Rich Felker
@ 2016-09-18 20:40 ` Szabolcs Nagy
  2016-09-19 13:29   ` Georg Sauthoff
  1 sibling, 1 reply; 7+ messages in thread
From: Szabolcs Nagy @ 2016-09-18 20:40 UTC (permalink / raw)
  To: Georg Sauthoff; +Cc: musl

* Georg Sauthoff <mail@georg.so> [2016-09-18 20:54:22 +0200]:
> 
> In general, musl's memchr() implementation doesn't perform better than a
> simple unrolled loop (as used in libstdc++ std::find()) - and that is
> consistent over different CPU generations and architectures.
> 

memchr in musl was never updated (same for >5 years) so probably
should be and last time the position was

"In the particular case of strlen, the naive unrolled strlen with no
OOB access is actually optimal on most or all 32-bit archs, better
than what we have now. I suspect the same is true for strchr and other
related functions."
http://www.openwall.com/lists/musl/2016/01/05/5

but we did not have benchmark numbers at the time.. note that
this benchmark does not measure the effect of more branch
prediction slots used in the unrolled case.

> On recent Intel CPUs it is even slower than a naive implementation:
> 
> https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
> https://gms.tf/sparc-and-ppc-find-benchmark-results.html
> 
> Of course, on x86, other implementations that use SIMD instructions
> perform even better.
> 

yes simd is expected to be faster.

but that needs asm which is expensive to maintain (there is no
portable simd language extension for c and there is the aliasing
issue: the reinterpret_cast in your code is formally ub).

> Best regards
> Georg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: memchr() performance
  2016-09-18 20:40 ` Szabolcs Nagy
@ 2016-09-19 13:29   ` Georg Sauthoff
  2016-09-19 13:37     ` Rich Felker
  2016-09-19 17:58     ` Szabolcs Nagy
  0 siblings, 2 replies; 7+ messages in thread
From: Georg Sauthoff @ 2016-09-19 13:29 UTC (permalink / raw)
  To: musl

On Sun, Sep 18, 2016 at 10:40:36PM +0200, Szabolcs Nagy wrote:

Hello,

> * Georg Sauthoff <mail@georg.so> [2016-09-18 20:54:22 +0200]:

[..]

> > On recent Intel CPUs it is even slower than a naive implementation:
> > 
> > https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
> > https://gms.tf/sparc-and-ppc-find-benchmark-results.html
> > 
> > Of course, on x86, other implementations that use SIMD instructions
> > perform even better.

> yes simd is expected to be faster.

> but that needs asm which is expensive to maintain (there is no
> portable simd language extension for c and there is the aliasing
> issue: the reinterpret_cast in your code is formally ub).

you mean because the vector-word pointer returned by reinterpret_cast is
used to access vector-words in the memory passed via a char pointer and
this is not covered by the ISO C++ strict aliasing rules?

Yes. Sure, ub means that anything can happen, but this case should be ok
with GCC - if the function is compiled in isolation in its own
translation unit.

I mean, there isn't much possibiltiy for reordering due to the
application of strict-aliasing-rules that would yield a different
result.

There are no aliased write accesses.

Btw, the current musl memchr() implementation has similar aliased
accesses - there, unsigned characters are aliased via a size_t pointer. 

Best regards
Georg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: memchr() performance
  2016-09-19 13:29   ` Georg Sauthoff
@ 2016-09-19 13:37     ` Rich Felker
  2016-09-19 17:58     ` Szabolcs Nagy
  1 sibling, 0 replies; 7+ messages in thread
From: Rich Felker @ 2016-09-19 13:37 UTC (permalink / raw)
  To: musl

On Mon, Sep 19, 2016 at 03:29:53PM +0200, Georg Sauthoff wrote:
> On Sun, Sep 18, 2016 at 10:40:36PM +0200, Szabolcs Nagy wrote:
> 
> Hello,
> 
> > * Georg Sauthoff <mail@georg.so> [2016-09-18 20:54:22 +0200]:
> 
> [..]
> 
> > > On recent Intel CPUs it is even slower than a naive implementation:
> > > 
> > > https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
> > > https://gms.tf/sparc-and-ppc-find-benchmark-results.html
> > > 
> > > Of course, on x86, other implementations that use SIMD instructions
> > > perform even better.
> 
> > yes simd is expected to be faster.
>  
> > but that needs asm which is expensive to maintain (there is no
> > portable simd language extension for c and there is the aliasing
> > issue: the reinterpret_cast in your code is formally ub).
> 
> you mean because the vector-word pointer returned by reinterpret_cast is
> used to access vector-words in the memory passed via a char pointer and
> this is not covered by the ISO C++ strict aliasing rules?
> 
> Yes. Sure, ub means that anything can happen, but this case should be ok
> with GCC - if the function is compiled in isolation in its own
> translation unit.
> 
> I mean, there isn't much possibiltiy for reordering due to the
> application of strict-aliasing-rules that would yield a different
> result.
> 
> There are no aliased write accesses.
> 
> Btw, the current musl memchr() implementation has similar aliased
> accesses - there, unsigned characters are aliased via a size_t pointer. 

Yes, that's a known bug with a pending patch - see:

http://www.openwall.com/lists/musl/2016/04/27/8

Rich


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: memchr() performance
  2016-09-19 13:29   ` Georg Sauthoff
  2016-09-19 13:37     ` Rich Felker
@ 2016-09-19 17:58     ` Szabolcs Nagy
  2016-09-20 17:57       ` Georg Sauthoff
  1 sibling, 1 reply; 7+ messages in thread
From: Szabolcs Nagy @ 2016-09-19 17:58 UTC (permalink / raw)
  To: musl; +Cc: Georg Sauthoff

* Georg Sauthoff <mail@georg.so> [2016-09-19 15:29:53 +0200]:
> 
> Yes. Sure, ub means that anything can happen, but this case should be ok
> with GCC - if the function is compiled in isolation in its own
> translation unit.
> 

unless you use lto

(which musl does not support now, but ppl are doing this)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: memchr() performance
  2016-09-19 17:58     ` Szabolcs Nagy
@ 2016-09-20 17:57       ` Georg Sauthoff
  0 siblings, 0 replies; 7+ messages in thread
From: Georg Sauthoff @ 2016-09-20 17:57 UTC (permalink / raw)
  To: musl

On Mon, Sep 19, 2016 at 07:58:48PM +0200, Szabolcs Nagy wrote:
> * Georg Sauthoff <mail@georg.so> [2016-09-19 15:29:53 +0200]:

> > Yes. Sure, ub means that anything can happen, but this case should be ok
> > with GCC - if the function is compiled in isolation in its own
> > translation unit.
 
> unless you use lto
 
> (which musl does not support now, but ppl are doing this)

Yes, with lto there is no isolation.

For lto you can do one of the following:

- write a version in pure assembler, doesn't have to be harder to read
  than a C version

- use the may_alias attribute (supported at least by GCC/Clang)

- apply the memcpy() trick:
  https://github.com/gsauthof/find-memchr/blob/master/find_avx2_memcpy_impl.cc

Best regards
Georg


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-09-20 17:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-18 18:54 memchr() performance Georg Sauthoff
2016-09-18 20:40 ` Rich Felker
2016-09-18 20:40 ` Szabolcs Nagy
2016-09-19 13:29   ` Georg Sauthoff
2016-09-19 13:37     ` Rich Felker
2016-09-19 17:58     ` Szabolcs Nagy
2016-09-20 17:57       ` Georg Sauthoff

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).