towlower performance

mailing list of musl libc
 help / color / mirror / code / Atom feed

* towlower performance
@ 2017-05-25 18:01 maksis .
  2017-05-27  0:39 ` Andre McCurdy
  0 siblings, 1 reply; 5+ messages in thread
From: maksis . @ 2017-05-25 18:01 UTC (permalink / raw)
  To: musl@lists.openwall.com <musl@lists.openwall.com>

[-- Attachment #1: Type: text/plain, Size: 860 bytes --]

Hi,

After switching my C++ app from a glibc-based build to a static build using musl, I've noticed significant differences in performance with certain loading operations, which seems to be related to using std::map with a case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file with a flat listing of ~350k directory items and mapping them by name takes about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort algorithm by removing all calls to towlower, the loading time with musl drops to 3.5 seconds.

I put together a C++ benchmark with the same sorting algorithm, which seems to produce similar results (GCC 6.3, 64 bit, -O3):

https://gist.github.com/maksis/92ad04f525d69043283350675d04f160

glibc: ~2 seconds (Arch Linux)
musl: ~7 seconds (Alpine Linux 3.6.0)

What might be causing the difference?

[-- Attachment #2: Type: text/html, Size: 2597 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: towlower performance
  2017-05-25 18:01 towlower performance maksis .
@ 2017-05-27  0:39 ` Andre McCurdy
  2017-05-27  0:59   ` Rich Felker
  0 siblings, 1 reply; 5+ messages in thread
From: Andre McCurdy @ 2017-05-27  0:39 UTC (permalink / raw)
  To: musl

On Thu, May 25, 2017 at 11:01 AM, maksis .
<maksis@adrenaline-network.com> wrote:
> Hi,
>
> After switching my C++ app from a glibc-based build to a static build using
> musl, I've noticed significant differences in performance with certain
> loading operations, which seems to be related to using std::map with a
> case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> with a flat listing of ~350k directory items and mapping them by name takes
> about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> algorithm by removing all calls to towlower, the loading time with musl
> drops to 3.5 seconds.
>
> I put together a C++ benchmark with the same sorting algorithm, which seems
> to produce similar results (GCC 6.3, 64 bit, -O3):
>
> https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
>
> glibc: ~2 seconds (Arch Linux)
>
> musl: ~7 seconds (Alpine Linux 3.6.0)
>
> What might be causing the difference?

I'm not sure if it maps directly to your results, but when building a
gcc based musl toolchain, libstdc++ gets configured to use the
"generic" OS specific include dir, which seems to contain some less
than optimal ctype code. e.g. ctype_inline.h comes with the following
warning:

  // The following definitions are portable, but insanely slow. If one
  // cares at all about performance, then specialized ctype
  // functionality should be added for the native os in question: see
  // the config/os/bits/ctype_*.h files.

  https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: towlower performance
  2017-05-27  0:39 ` Andre McCurdy
@ 2017-05-27  0:59   ` Rich Felker
  2017-05-30 12:23     ` [PATCH] towupper/towlower: fast path for ascii chars Natanael Copa
  0 siblings, 1 reply; 5+ messages in thread
From: Rich Felker @ 2017-05-27  0:59 UTC (permalink / raw)
  To: musl

On Fri, May 26, 2017 at 05:39:31PM -0700, Andre McCurdy wrote:
> On Thu, May 25, 2017 at 11:01 AM, maksis .
> <maksis@adrenaline-network.com> wrote:
> > Hi,
> >
> > After switching my C++ app from a glibc-based build to a static build using
> > musl, I've noticed significant differences in performance with certain
> > loading operations, which seems to be related to using std::map with a
> > case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> > with a flat listing of ~350k directory items and mapping them by name takes
> > about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> > algorithm by removing all calls to towlower, the loading time with musl
> > drops to 3.5 seconds.
> >
> > I put together a C++ benchmark with the same sorting algorithm, which seems
> > to produce similar results (GCC 6.3, 64 bit, -O3):
> >
> > https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
> >
> > glibc: ~2 seconds (Arch Linux)
> >
> > musl: ~7 seconds (Alpine Linux 3.6.0)
> >
> > What might be causing the difference?
> 
> I'm not sure if it maps directly to your results, but when building a
> gcc based musl toolchain, libstdc++ gets configured to use the
> "generic" OS specific include dir, which seems to contain some less
> than optimal ctype code. e.g. ctype_inline.h comes with the following
> warning:
> 
>   // The following definitions are portable, but insanely slow. If one
>   // cares at all about performance, then specialized ctype
>   // functionality should be added for the native os in question: see
>   // the config/os/bits/ctype_*.h files.
> 
>   https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h

That's a bit of an overstatement. While it's possible for some
operations to be slightly to moderately faster with a
pokes-at-libc-internals implementation of c++ locale/ctype
functionality, the portable code is not "insanely" slow. For wchar
functionality I don't think it's even slower at all; as far as I know
glibc only provides function-bypass tables for the 8bit char is*
functions, and for the past couple decades they haven't even fully
bypassed function calls since a function call is used to obtain the
thread-local locale tables.

In any case the portable-interface-based implementation (generic) is
the correct thing to do. musl does not and won't provide backdoors to
bypass that.

Thankfully, though, I think the issue maksis reported is more a matter
of the towlower/towupper implementation being oriented towards size
(we have all the Unicode case mappings plus code to perform them in
about 1k!) rather than performance. A simple fix that would make this
case fast (maybe faster than glibc even) is short-circuiting the
common ascii case, but a more multilingual-friendly solution might be
using some sort of O(1) first-stage table to find the relevant parts
of the mapping tables instead of using linear searches and multiple
special-case branches up front. I suspect we can make the code very
fast while keeping the size under 3k and probably less.

Rich

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] towupper/towlower: fast path for ascii chars
  2017-05-27  0:59   ` Rich Felker
@ 2017-05-30 12:23     ` Natanael Copa
  2017-05-31  3:04       ` A. Wilcox
  0 siblings, 1 reply; 5+ messages in thread
From: Natanael Copa @ 2017-05-30 12:23 UTC (permalink / raw)
  To: musl; +Cc: Natanael Copa

Make a fast path for ascii chars which is assumed to be the most common
case. This has significant performance benefit on xml json and similar

---

This gives a performance boost for the given testcase:
https://gist.github.com/maksis/92ad04f525d69043283350675d04f160

Before:
Completed in 8.302969s, compare count 54136421

After:
Completed in 2.745886s, compare count 54136421

I don't consider this the final solution but it is atleast a significant
improvement.

 src/ctype/towctrans.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/ctype/towctrans.c b/src/ctype/towctrans.c
index 6af61875..cf13a862 100644
--- a/src/ctype/towctrans.c
+++ b/src/ctype/towctrans.c
@@ -1,3 +1,4 @@
+#include <ctype.h>
 #include <wctype.h>
 #include "libc.h"
 
@@ -9,7 +10,6 @@ static const struct {
 	signed char lower;
 	unsigned char len;
 } casemaps[] = {
-	CASEMAP('A','Z','a'),
 	CASEMAP(0xc0,0xde,0xe0),
 
 	CASELACE(0x0100,0x012e),
@@ -257,12 +257,12 @@ static wchar_t __towcase(wchar_t wc, int lower)
 
 wint_t towupper(wint_t wc)
 {
-	return __towcase(wc, 0);
+	return (unsigned)wc < 128 ? toupper(wc) : __towcase(wc, 0);
 }
 
 wint_t towlower(wint_t wc)
 {
-	return __towcase(wc, 1);
+	return (unsigned)wc < 128 ? tolower(wc) : __towcase(wc, 1);
 }
 
 wint_t __towupper_l(wint_t c, locale_t l)
-- 
2.13.0



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] towupper/towlower: fast path for ascii chars
  2017-05-30 12:23     ` [PATCH] towupper/towlower: fast path for ascii chars Natanael Copa
@ 2017-05-31  3:04       ` A. Wilcox
  0 siblings, 0 replies; 5+ messages in thread
From: A. Wilcox @ 2017-05-31  3:04 UTC (permalink / raw)
  To: musl

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 30/05/17 07:23, Natanael Copa wrote:
> Make a fast path for ascii chars which is assumed to be the most
> common case. This has significant performance benefit on xml json
> and similar


I'm not sure if that would be the common case when you are already
dealing with wide chars.  I would assume you're using wide chars for a
reason...


> { -	return __towcase(wc, 0); +	return (unsigned)wc < 128 ?
> toupper(wc) : __towcase(wc, 0); }
> 
> wint_t towlower(wint_t wc) { -	return __towcase(wc, 1); +	return
> (unsigned)wc < 128 ? tolower(wc) : __towcase(wc, 1); }


...so I would likely do something like:

return (unsigned)wc > 127 ? __towcase(wc, 1) : tolower(wc);

I think it comes down to silly style though.  Functionally it is
likely similar / the same.

Would be curious to see if there is any difference on other loads (and
on other architectures).


- --arw

- -- 
A. Wilcox (awilfox)
Project Lead, Adélie Linux
http://adelielinux.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJZLjKRAAoJEMspy1GSK50UfPMQAI9WHwH2TRHfRBmC9F9dRRqR
c0wTMC0IkaI9iD7xrdpG9lq6EhHelNbC94bIUFtHY5Eao4Vf6SJUv7b+o+OEHGs0
11PmDSvciHZx4vQb64rQJte8u6s3Z++POMpsLKULS4Gi63xLCtaCKB8yp4/eKezL
5cBqxCIXz+TlCpS/7TQxyYjwCA8C13Utd/2GmU4/9cCkdDOHJ5asg3sfFS4MtVPS
tSSunDgDgn0b3b0nLD0q4PiymRZ0cpGj2MKLWCR2WPdBkMitjegNm4wYcJE6naWH
ao4Fp26McviszvQvxlWzidpiDAq6kxU6jwHylV/VniseqcG5XBDDyG/MrzzMDlKr
lyNGQg6xXAcVvIujZBJYJYrgL1oDEpQ6c+JS4aD4gcM0WIn8YM9K5PQPKbcwrsqo
2sDKY51UTT3SDoghtj7ZkqgBMbQLmEE5lyom3B7EtcQazKj9pWtTmCy8kI+cMQcT
WPYurl67aEg54WW1oGChKza6P1WyuWTL27TL4NVOsVMyCqSa2ekF7rJTBP7rey9r
hNuFz14/tKyZTA1Mq3RWtN15ydsC+ToLbZikYqEmmBq36p2J1BbeWznfRRw6dpyt
G1sz70FtTBtODAK1xVngsUsYVPINgNhaQlQKs96NSiQzU0khY2or9kbYWPm0GyS8
fPerUQTvKecXmUYd2OVk
=pfgv
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-31  3:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-25 18:01 towlower performance maksis .
2017-05-27  0:39 ` Andre McCurdy
2017-05-27  0:59   ` Rich Felker
2017-05-30 12:23     ` [PATCH] towupper/towlower: fast path for ascii chars Natanael Copa
2017-05-31  3:04       ` A. Wilcox

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).