towlower performance

mailing list of musl libc
 help / color / mirror / code / Atom feed

* towlower performance
@ 2017-05-25 18:01 maksis .
  2017-05-27  0:39 ` Andre McCurdy
  0 siblings, 1 reply; 6+ messages in thread
From: maksis . @ 2017-05-25 18:01 UTC (permalink / raw)
  To: musl@lists.openwall.com <musl@lists.openwall.com>

[-- Attachment #1: Type: text/plain, Size: 860 bytes --]

Hi,

After switching my C++ app from a glibc-based build to a static build using musl, I've noticed significant differences in performance with certain loading operations, which seems to be related to using std::map with a case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file with a flat listing of ~350k directory items and mapping them by name takes about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort algorithm by removing all calls to towlower, the loading time with musl drops to 3.5 seconds.

I put together a C++ benchmark with the same sorting algorithm, which seems to produce similar results (GCC 6.3, 64 bit, -O3):

https://gist.github.com/maksis/92ad04f525d69043283350675d04f160

glibc: ~2 seconds (Arch Linux)
musl: ~7 seconds (Alpine Linux 3.6.0)

What might be causing the difference?

[-- Attachment #2: Type: text/html, Size: 2597 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: towlower performance
  2017-05-25 18:01 towlower performance maksis .
@ 2017-05-27  0:39 ` Andre McCurdy
  2017-05-27  0:59   ` Rich Felker
  0 siblings, 1 reply; 6+ messages in thread
From: Andre McCurdy @ 2017-05-27  0:39 UTC (permalink / raw)
  To: musl

On Thu, May 25, 2017 at 11:01 AM, maksis .
<maksis@adrenaline-network.com> wrote:
> Hi,
>
> After switching my C++ app from a glibc-based build to a static build using
> musl, I've noticed significant differences in performance with certain
> loading operations, which seems to be related to using std::map with a
> case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> with a flat listing of ~350k directory items and mapping them by name takes
> about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> algorithm by removing all calls to towlower, the loading time with musl
> drops to 3.5 seconds.
>
> I put together a C++ benchmark with the same sorting algorithm, which seems
> to produce similar results (GCC 6.3, 64 bit, -O3):
>
> https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
>
> glibc: ~2 seconds (Arch Linux)
>
> musl: ~7 seconds (Alpine Linux 3.6.0)
>
> What might be causing the difference?

I'm not sure if it maps directly to your results, but when building a
gcc based musl toolchain, libstdc++ gets configured to use the
"generic" OS specific include dir, which seems to contain some less
than optimal ctype code. e.g. ctype_inline.h comes with the following
warning:

  // The following definitions are portable, but insanely slow. If one
  // cares at all about performance, then specialized ctype
  // functionality should be added for the native os in question: see
  // the config/os/bits/ctype_*.h files.

  https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: towlower performance
  2017-05-27  0:39 ` Andre McCurdy
@ 2017-05-27  0:59   ` Rich Felker
  2017-05-30 12:23     ` [PATCH] towupper/towlower: fast path for ascii chars Natanael Copa
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Felker @ 2017-05-27  0:59 UTC (permalink / raw)
  To: musl

On Fri, May 26, 2017 at 05:39:31PM -0700, Andre McCurdy wrote:
> On Thu, May 25, 2017 at 11:01 AM, maksis .
> <maksis@adrenaline-network.com> wrote:
> > Hi,
> >
> > After switching my C++ app from a glibc-based build to a static build using
> > musl, I've noticed significant differences in performance with certain
> > loading operations, which seems to be related to using std::map with a
> > case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> > with a flat listing of ~350k directory items and mapping them by name takes
> > about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> > algorithm by removing all calls to towlower, the loading time with musl
> > drops to 3.5 seconds.
> >
> > I put together a C++ benchmark with the same sorting algorithm, which seems
> > to produce similar results (GCC 6.3, 64 bit, -O3):
> >
> > https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
> >
> > glibc: ~2 seconds (Arch Linux)
> >
> > musl: ~7 seconds (Alpine Linux 3.6.0)
> >
> > What might be causing the difference?
> 
> I'm not sure if it maps directly to your results, but when building a
> gcc based musl toolchain, libstdc++ gets configured to use the
> "generic" OS specific include dir, which seems to contain some less
> than optimal ctype code. e.g. ctype_inline.h comes with the following
> warning:
> 
>   // The following definitions are portable, but insanely slow. If one
>   // cares at all about performance, then specialized ctype
>   // functionality should be added for the native os in question: see
>   // the config/os/bits/ctype_*.h files.
> 
>   https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h

That's a bit of an overstatement. While it's possible for some
operations to be slightly to moderately faster with a
pokes-at-libc-internals implementation of c++ locale/ctype
functionality, the portable code is not "insanely" slow. For wchar
functionality I don't think it's even slower at all; as far as I know
glibc only provides function-bypass tables for the 8bit char is*
functions, and for the past couple decades they haven't even fully
bypassed function calls since a function call is used to obtain the
thread-local locale tables.

In any case the portable-interface-based implementation (generic) is
the correct thing to do. musl does not and won't provide backdoors to
bypass that.

Thankfully, though, I think the issue maksis reported is more a matter
of the towlower/towupper implementation being oriented towards size
(we have all the Unicode case mappings plus code to perform them in
about 1k!) rather than performance. A simple fix that would make this
case fast (maybe faster than glibc even) is short-circuiting the
common ascii case, but a more multilingual-friendly solution might be
using some sort of O(1) first-stage table to find the relevant parts
of the mapping tables instead of using linear searches and multiple
special-case branches up front. I suspect we can make the code very
fast while keeping the size under 3k and probably less.

Rich

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] towupper/towlower: fast path for ascii chars
  2017-05-27  0:59   ` Rich Felker
@ 2017-05-30 12:23     ` Natanael Copa
  2017-05-31  3:04       ` A. Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Natanael Copa @ 2017-05-30 12:23 UTC (permalink / raw)
  To: musl; +Cc: Natanael Copa

Make a fast path for ascii chars which is assumed to be the most common
case. This has significant performance benefit on xml json and similar

---

This gives a performance boost for the given testcase:
https://gist.github.com/maksis/92ad04f525d69043283350675d04f160

Before:
Completed in 8.302969s, compare count 54136421

After:
Completed in 2.745886s, compare count 54136421

I don't consider this the final solution but it is atleast a significant
improvement.

 src/ctype/towctrans.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/ctype/towctrans.c b/src/ctype/towctrans.c
index 6af61875..cf13a862 100644
--- a/src/ctype/towctrans.c
+++ b/src/ctype/towctrans.c
@@ -1,3 +1,4 @@
+#include <ctype.h>
 #include <wctype.h>
 #include "libc.h"
 
@@ -9,7 +10,6 @@ static const struct {
 	signed char lower;
 	unsigned char len;
 } casemaps[] = {
-	CASEMAP('A','Z','a'),
 	CASEMAP(0xc0,0xde,0xe0),
 
 	CASELACE(0x0100,0x012e),
@@ -257,12 +257,12 @@ static wchar_t __towcase(wchar_t wc, int lower)
 
 wint_t towupper(wint_t wc)
 {
-	return __towcase(wc, 0);
+	return (unsigned)wc < 128 ? toupper(wc) : __towcase(wc, 0);
 }
 
 wint_t towlower(wint_t wc)
 {
-	return __towcase(wc, 1);
+	return (unsigned)wc < 128 ? tolower(wc) : __towcase(wc, 1);
 }
 
 wint_t __towupper_l(wint_t c, locale_t l)
-- 
2.13.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] towupper/towlower: fast path for ascii chars
  2017-05-30 12:23     ` [PATCH] towupper/towlower: fast path for ascii chars Natanael Copa
@ 2017-05-31  3:04       ` A. Wilcox
  0 siblings, 0 replies; 6+ messages in thread
From: A. Wilcox @ 2017-05-31  3:04 UTC (permalink / raw)
  To: musl

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 30/05/17 07:23, Natanael Copa wrote:
> Make a fast path for ascii chars which is assumed to be the most
> common case. This has significant performance benefit on xml json
> and similar


I'm not sure if that would be the common case when you are already
dealing with wide chars.  I would assume you're using wide chars for a
reason...


> { -	return __towcase(wc, 0); +	return (unsigned)wc < 128 ?
> toupper(wc) : __towcase(wc, 0); }
> 
> wint_t towlower(wint_t wc) { -	return __towcase(wc, 1); +	return
> (unsigned)wc < 128 ? tolower(wc) : __towcase(wc, 1); }


...so I would likely do something like:

return (unsigned)wc > 127 ? __towcase(wc, 1) : tolower(wc);

I think it comes down to silly style though.  Functionally it is
likely similar / the same.

Would be curious to see if there is any difference on other loads (and
on other architectures).


- --arw

- -- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJZLjKRAAoJEMspy1GSK50UfPMQAI9WHwH2TRHfRBmC9F9dRRqR
c0wTMC0IkaI9iD7xrdpG9lq6EhHelNbC94bIUFtHY5Eao4Vf6SJUv7b+o+OEHGs0
11PmDSvciHZx4vQb64rQJte8u6s3Z++POMpsLKULS4Gi63xLCtaCKB8yp4/eKezL
5cBqxCIXz+TlCpS/7TQxyYjwCA8C13Utd/2GmU4/9cCkdDOHJ5asg3sfFS4MtVPS
tSSunDgDgn0b3b0nLD0q4PiymRZ0cpGj2MKLWCR2WPdBkMitjegNm4wYcJE6naWH
ao4Fp26McviszvQvxlWzidpiDAq6kxU6jwHylV/VniseqcG5XBDDyG/MrzzMDlKr
lyNGQg6xXAcVvIujZBJYJYrgL1oDEpQ6c+JS4aD4gcM0WIn8YM9K5PQPKbcwrsqo
2sDKY51UTT3SDoghtj7ZkqgBMbQLmEE5lyom3B7EtcQazKj9pWtTmCy8kI+cMQcT
WPYurl67aEg54WW1oGChKza6P1WyuWTL27TL4NVOsVMyCqSa2ekF7rJTBP7rey9r
hNuFz14/tKyZTA1Mq3RWtN15ydsC+ToLbZikYqEmmBq36p2J1BbeWznfRRw6dpyt
G1sz70FtTBtODAK1xVngsUsYVPINgNhaQlQKs96NSiQzU0khY2or9kbYWPm0GyS8
fPerUQTvKecXmUYd2OVk
=pfgv
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: towlower performance
       [not found] <1871525485.1112020.1495912199619.ref@mail.yahoo.com>
@ 2017-05-27 19:09 ` Brad Conroy
  0 siblings, 0 replies; 6+ messages in thread
From: Brad Conroy @ 2017-05-27 19:09 UTC (permalink / raw)
  To: musl

> Thankfully, though, I think the issue maksis reported is more a matter
> of the towlower/towupper implementation being oriented towards
> size

The current macros in ctype.h evaluate twice and could all really be inlined.
For a test case try building strcasecmp with and without tolower inlined.
It is actually smaller inlined because it doesn't need to setup for a call.

I  try to avoid branching in inlined functions to avoid mis-predictions and to
keep constant time as well as better vectorization (although the unsigned
comparison trick may prevent automatic conversion to SIMD on x86 since
there is no unsigned compare), so here are some branchless static inlined
versions if you want to use them:

static __inline int isalnum(int c){
	return ((unsigned)c-'0' < 10)|(((unsigned)c|32)-'a' < 26);
}
static __inline int isalpha(int c){
	return (((unsigned)c|32)-'a' < 26);
}
static __inline int isascii(int c){
	return (unsigned)c<128;
}
static __inline int isblank(int c){
	return (c==' ')|(c=='\t');
}
static __inline int iscntrl(int c){
	return ((unsigned)c < 0x20) | (c == 0x7f);
}
static __inline int isdigit(int c){
	return (unsigned)c-'0' < 10;
}
static __inline int isgraph(int c){
	return (unsigned)c-0x21 < 0x5e;
}
static __inline int islower(int c){
	return (unsigned)c-'a' < 26;
}
static __inline int isprint(int c){
	return (unsigned)c-0x20 < 0x5f;
}
static __inline int ispunct(int c){
	unsigned y=(unsigned)c;
	return ( ( y-'!' < 95 ) & ( ((y|32)-'a' > 25) | (y-'0' > 9) ) );
}
static __inline int isspace(int c){
	return ((unsigned)c-'\t' < 5)|(c == ' ');
}
static __inline int isupper(int c){
	return (unsigned)c-'A' < 26;
}
static __inline int isxdigit(int c){
	return ((unsigned)c-'0' < 10) | (((unsigned)c|32)-'a' < 6);
}
static __inline int toascii(int c){
	return c & 0x7f;
}
static __inline int tolower(int c){
	return c | ( ((unsigned)c-'A' < 26)<<5 );
}
static __inline int toupper(int c){
	return c & ~(((unsigned)c-'a' < 26)<<5);
}


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-31  3:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-25 18:01 towlower performance maksis .
2017-05-27  0:39 ` Andre McCurdy
2017-05-27  0:59   ` Rich Felker
2017-05-30 12:23     ` [PATCH] towupper/towlower: fast path for ascii chars Natanael Copa
2017-05-31  3:04       ` A. Wilcox
     [not found] <1871525485.1112020.1495912199619.ref@mail.yahoo.com>
2017-05-27 19:09 ` towlower performance Brad Conroy

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).