Re: towlower performance

mailing list of musl libc
 help / color / mirror / code / Atom feed

* Re: towlower performance
       [not found] <1871525485.1112020.1495912199619.ref@mail.yahoo.com>
@ 2017-05-27 19:09 ` Brad Conroy
  0 siblings, 0 replies; 4+ messages in thread
From: Brad Conroy @ 2017-05-27 19:09 UTC (permalink / raw)
  To: musl

> Thankfully, though, I think the issue maksis reported is more a matter
> of the towlower/towupper implementation being oriented towards
> size

The current macros in ctype.h evaluate twice and could all really be inlined.
For a test case try building strcasecmp with and without tolower inlined.
It is actually smaller inlined because it doesn't need to setup for a call.

I  try to avoid branching in inlined functions to avoid mis-predictions and to
keep constant time as well as better vectorization (although the unsigned
comparison trick may prevent automatic conversion to SIMD on x86 since
there is no unsigned compare), so here are some branchless static inlined
versions if you want to use them:

static __inline int isalnum(int c){
	return ((unsigned)c-'0' < 10)|(((unsigned)c|32)-'a' < 26);
}
static __inline int isalpha(int c){
	return (((unsigned)c|32)-'a' < 26);
}
static __inline int isascii(int c){
	return (unsigned)c<128;
}
static __inline int isblank(int c){
	return (c==' ')|(c=='\t');
}
static __inline int iscntrl(int c){
	return ((unsigned)c < 0x20) | (c == 0x7f);
}
static __inline int isdigit(int c){
	return (unsigned)c-'0' < 10;
}
static __inline int isgraph(int c){
	return (unsigned)c-0x21 < 0x5e;
}
static __inline int islower(int c){
	return (unsigned)c-'a' < 26;
}
static __inline int isprint(int c){
	return (unsigned)c-0x20 < 0x5f;
}
static __inline int ispunct(int c){
	unsigned y=(unsigned)c;
	return ( ( y-'!' < 95 ) & ( ((y|32)-'a' > 25) | (y-'0' > 9) ) );
}
static __inline int isspace(int c){
	return ((unsigned)c-'\t' < 5)|(c == ' ');
}
static __inline int isupper(int c){
	return (unsigned)c-'A' < 26;
}
static __inline int isxdigit(int c){
	return ((unsigned)c-'0' < 10) | (((unsigned)c|32)-'a' < 6);
}
static __inline int toascii(int c){
	return c & 0x7f;
}
static __inline int tolower(int c){
	return c | ( ((unsigned)c-'A' < 26)<<5 );
}
static __inline int toupper(int c){
	return c & ~(((unsigned)c-'a' < 26)<<5);
}


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: towlower performance
  2017-05-27  0:39 ` Andre McCurdy
@ 2017-05-27  0:59   ` Rich Felker
  0 siblings, 0 replies; 4+ messages in thread
From: Rich Felker @ 2017-05-27  0:59 UTC (permalink / raw)
  To: musl

On Fri, May 26, 2017 at 05:39:31PM -0700, Andre McCurdy wrote:
> On Thu, May 25, 2017 at 11:01 AM, maksis .
> <maksis@adrenaline-network.com> wrote:
> > Hi,
> >
> > After switching my C++ app from a glibc-based build to a static build using
> > musl, I've noticed significant differences in performance with certain
> > loading operations, which seems to be related to using std::map with a
> > case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> > with a flat listing of ~350k directory items and mapping them by name takes
> > about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> > algorithm by removing all calls to towlower, the loading time with musl
> > drops to 3.5 seconds.
> >
> > I put together a C++ benchmark with the same sorting algorithm, which seems
> > to produce similar results (GCC 6.3, 64 bit, -O3):
> >
> > https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
> >
> > glibc: ~2 seconds (Arch Linux)
> >
> > musl: ~7 seconds (Alpine Linux 3.6.0)
> >
> > What might be causing the difference?
> 
> I'm not sure if it maps directly to your results, but when building a
> gcc based musl toolchain, libstdc++ gets configured to use the
> "generic" OS specific include dir, which seems to contain some less
> than optimal ctype code. e.g. ctype_inline.h comes with the following
> warning:
> 
>   // The following definitions are portable, but insanely slow. If one
>   // cares at all about performance, then specialized ctype
>   // functionality should be added for the native os in question: see
>   // the config/os/bits/ctype_*.h files.
> 
>   https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h

That's a bit of an overstatement. While it's possible for some
operations to be slightly to moderately faster with a
pokes-at-libc-internals implementation of c++ locale/ctype
functionality, the portable code is not "insanely" slow. For wchar
functionality I don't think it's even slower at all; as far as I know
glibc only provides function-bypass tables for the 8bit char is*
functions, and for the past couple decades they haven't even fully
bypassed function calls since a function call is used to obtain the
thread-local locale tables.

In any case the portable-interface-based implementation (generic) is
the correct thing to do. musl does not and won't provide backdoors to
bypass that.

Thankfully, though, I think the issue maksis reported is more a matter
of the towlower/towupper implementation being oriented towards size
(we have all the Unicode case mappings plus code to perform them in
about 1k!) rather than performance. A simple fix that would make this
case fast (maybe faster than glibc even) is short-circuiting the
common ascii case, but a more multilingual-friendly solution might be
using some sort of O(1) first-stage table to find the relevant parts
of the mapping tables instead of using linear searches and multiple
special-case branches up front. I suspect we can make the code very
fast while keeping the size under 3k and probably less.

Rich

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: towlower performance
  2017-05-25 18:01 maksis .
@ 2017-05-27  0:39 ` Andre McCurdy
  2017-05-27  0:59   ` Rich Felker
  0 siblings, 1 reply; 4+ messages in thread
From: Andre McCurdy @ 2017-05-27  0:39 UTC (permalink / raw)
  To: musl

On Thu, May 25, 2017 at 11:01 AM, maksis .
<maksis@adrenaline-network.com> wrote:
> Hi,
>
> After switching my C++ app from a glibc-based build to a static build using
> musl, I've noticed significant differences in performance with certain
> loading operations, which seems to be related to using std::map with a
> case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file
> with a flat listing of ~350k directory items and mapping them by name takes
> about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort
> algorithm by removing all calls to towlower, the loading time with musl
> drops to 3.5 seconds.
>
> I put together a C++ benchmark with the same sorting algorithm, which seems
> to produce similar results (GCC 6.3, 64 bit, -O3):
>
> https://gist.github.com/maksis/92ad04f525d69043283350675d04f160
>
> glibc: ~2 seconds (Arch Linux)
>
> musl: ~7 seconds (Alpine Linux 3.6.0)
>
> What might be causing the difference?

I'm not sure if it maps directly to your results, but when building a
gcc based musl toolchain, libstdc++ gets configured to use the
"generic" OS specific include dir, which seems to contain some less
than optimal ctype code. e.g. ctype_inline.h comes with the following
warning:

  // The following definitions are portable, but insanely slow. If one
  // cares at all about performance, then specialized ctype
  // functionality should be added for the native os in question: see
  // the config/os/bits/ctype_*.h files.

  https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libstdc%2B%2B-v3/config/os/generic/ctype_inline.h


^ permalink raw reply	[flat|nested] 4+ messages in thread

* towlower performance
@ 2017-05-25 18:01 maksis .
  2017-05-27  0:39 ` Andre McCurdy
  0 siblings, 1 reply; 4+ messages in thread
From: maksis . @ 2017-05-25 18:01 UTC (permalink / raw)
  To: musl@lists.openwall.com <musl@lists.openwall.com>

[-- Attachment #1: Type: text/plain, Size: 860 bytes --]

Hi,

After switching my C++ app from a glibc-based build to a static build using musl, I've noticed significant differences in performance with certain loading operations, which seems to be related to using std::map with a case-insensitive UTF-8 sorting algorithm. For example, parsing a XML file with a flat listing of ~350k directory items and mapping them by name takes about 3 seconds with glibc vs 13 seconds with musl. After modifying the sort algorithm by removing all calls to towlower, the loading time with musl drops to 3.5 seconds.

I put together a C++ benchmark with the same sorting algorithm, which seems to produce similar results (GCC 6.3, 64 bit, -O3):

https://gist.github.com/maksis/92ad04f525d69043283350675d04f160

glibc: ~2 seconds (Arch Linux)
musl: ~7 seconds (Alpine Linux 3.6.0)

What might be causing the difference?

[-- Attachment #2: Type: text/html, Size: 2597 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-05-27 19:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1871525485.1112020.1495912199619.ref@mail.yahoo.com>
2017-05-27 19:09 ` towlower performance Brad Conroy
2017-05-25 18:01 maksis .
2017-05-27  0:39 ` Andre McCurdy
2017-05-27  0:59   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).