From: "alice" <alice@ayaya.dev>
To: <musl@lists.openwall.com>, "Rich Felker" <dalias@libc.org>
Cc: "Markus Wichmann" <nullplan@gmx.net>
Subject: Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
Date: Sat, 11 Feb 2023 06:44:29 +0100 [thread overview]
Message-ID: <CQFHTXNLAUOI.2OHPT1JEJF27G@sumire> (raw)
In-Reply-To: <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com>
On Sat Feb 11, 2023 at 6:12 AM CET, David Wang wrote:
>
>
>
>
> At 2023-02-10 22:19:55, "Rich Felker" <dalias@libc.org> wrote:
> >On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote:
> >>
> >>
> >>
> >
> >> About wrapper_cmp, in my last profiling, there are total 931387
> >> samples collected, 257403 samples contain callchain ->wrapper_cmp,
> >> among those 257403 samples, 167410 samples contain callchain
> >> ->wrapper_cmp->mycmp, that is why I think there is extra overhead
> >> about wrapper_cmp. Maybe compiler optimization would change the
> >> result, and I will make further checks.
> >
> >Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10
> >instructions.
> >
> >Rich
>
> With optimized binary code, it is very hard to collect an intact callchain from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN.
> But to profile qsort, a callchain may not be necessary. IP register sampling would be enough to identify which part take most cpu cycles.
> So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just use PERF_SAMPLE_IP
>
> This is what I got:
> +-------------------+---------------+
> | func | count |
> +-------------------+---------------+
> | Total | 423488 |
> | memcpy | 48.76% 206496 |
> | sift | 16.29% 68989 |
> | mycmp | 14.57% 61714 |
> | trinkle | 8.90% 37690 |
> | cycle | 5.45% 23061 |
> | shr | 2.19% 9293 |
> | __qsort_r | 1.77% 7505 |
> | main | 1.04% 4391 |
> | shl | 0.55% 2325 |
> | wrapper_cmp | 0.42% 1779 |
> | rand | 0.05% 229 |
> | __set_thread_area | 0.00% 16 |
> +-------------------+---------------+
> (Note that, in this profile report, I count only those samples that are directly within the function body, the samples within sub-function do not contribute to any of its parent functions.)
>
> And you're right, with optimization, the impact of wrapper_cmp is very very low, only 0.42%.
>
> The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REGS_USER) to collect statistics about the size (the 3rd parameter, stored in RDX register) of memcpy, and all of those memcpy function calls are just copying 4 bytes, according to the source code, the size of memcpy is item size to be sorted, which is int32 in my test case.
> Maybe something could be improved here.
>
>
> I also made same profiling against glibc:
> +-----------------------------+--------------+
> | func | count |
> +-----------------------------+--------------+
> | Total | 640880 |
> | msort_with_tmp.part.0 | 73.99 474176 | <--- merge sort?
> | mycmp | 11.76 75392 |
> | main | 6.45 41306 |
> | __memcpy_avx_unaligned_erms | 4.58 29339 |
> | random | 0.86 5525 |
> | __memcpy_avx_unaligned | 0.83 5293 |
> | random_r | 0.76 4882 |
> | rand | 0.45 2897 |
> | _init | 0.31 1975 |
> | _fini | 0.01 80 |
> | __free | 0.00 5 |
> | _int_malloc | 0.00 5 |
> | malloc | 0.00 2 |
> | __qsort_r | 0.00 1 |
> | _int_free | 0.00 1 |
> +-----------------------------+--------------+
>
> Test code:
> -------------------
> #include <stdio.h>
> #include <stdlib.h>
>
> int mycmp(const void *a, const void *b) { return *(const int *)a-*(const int*)b; }
>
> #define MAXN 1<<20
> int vs[MAXN];
>
> int main() {
> int i, j, k, n, t;
> for (k=0; k<1024; k++) {
> for (i=0; i<MAXN; i++) vs[i]=i;
> for (n=MAXN; n>1; n--) {
> i=n-1; j=rand()%n;
> if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; }
> }
> qsort(vs, MAXN, sizeof(vs[0]), mycmp);
> }
> return 0;
> }
>
> -------------------
> gcc test.c -O2 -static
> With musl-libc:
> $ time ./a.out
>
> real 9m 5.10s
> user 9m 5.09s
> sys 0m 0.00s
>
> With glic:
> $ time ./a.out
> real 1m56.287s
> user 1m56.270s
> sys 0m0.004s
>
>
>
> To sum up, optimize those memcpy calls and reduce comparation to its minimum could have significant performance improvements, but I doubt it could achieve a 4-factor improvement.
based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
optimisations, the _avx_ functions in your case. musl doesn't implement any
SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
the same kind of native per-arch optimisation.
you should rerun these with GLIBC_TUNABLES, from something in:
https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
which should let you disable them all (if you just want to compare C to C code).
( unrelated, but has there been some historic discussion of implementing
something similar in musl? i feel like i might be forgetting something. )
>
> FYI
> David
next prev parent reply other threads:[~2023-02-11 5:44 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-20 1:49 Guy
2023-01-20 12:55 ` alice
2023-01-30 10:04 ` [musl] " David Wang
2023-02-01 18:01 ` Markus Wichmann
2023-02-02 2:12 ` [musl] " David Wang
2023-02-03 5:22 ` [musl] " David Wang
2023-02-03 8:03 ` Alexander Monakov
2023-02-03 9:01 ` [musl] " David Wang
2023-02-09 19:03 ` Rich Felker
2023-02-09 19:20 ` Alexander Monakov
2023-02-09 19:52 ` Rich Felker
2023-02-09 20:18 ` Rich Felker
2023-02-09 20:27 ` Pierpaolo Bernardi
2023-02-10 4:10 ` Markus Wichmann
2023-02-10 10:00 ` [musl] " David Wang
2023-02-10 13:10 ` Rich Felker
2023-02-10 13:45 ` [musl] " David Wang
2023-02-10 14:19 ` Rich Felker
2023-02-11 5:12 ` [musl] " David Wang
2023-02-11 5:44 ` alice [this message]
2023-02-11 8:39 ` Joakim Sindholt
2023-02-11 9:06 ` alice
2023-02-11 9:31 ` [musl] " David Wang
2023-02-11 13:35 ` Rich Felker
2023-02-11 17:18 ` David Wang
2023-02-16 15:15 ` David Wang
2023-02-16 16:07 ` Rich Felker
2023-02-17 1:35 ` [musl] " David Wang
2023-02-17 13:17 ` Alexander Monakov
2023-02-17 15:07 ` Rich Felker
2023-02-11 9:22 ` [musl] " Markus Wichmann
2023-02-11 9:36 ` [musl] " David Wang
2023-02-11 9:51 ` David Wang
2023-01-20 13:32 ` [musl] qsort Valery Ushakov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CQFHTXNLAUOI.2OHPT1JEJF27G@sumire \
--to=alice@ayaya.dev \
--cc=dalias@libc.org \
--cc=musl@lists.openwall.com \
--cc=nullplan@gmx.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).