mailing list of musl libc
 help / color / mirror / code / Atom feed
From: "David Wang" <00107082@163.com>
To: "Rich Felker" <dalias@libc.org>
Cc: musl@lists.openwall.com, "Markus Wichmann" <nullplan@gmx.net>
Subject: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
Date: Sat, 11 Feb 2023 13:12:23 +0800 (CST)	[thread overview]
Message-ID: <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com> (raw)
In-Reply-To: <20230210141955.GA4163@brightrain.aerifal.cx>





At 2023-02-10 22:19:55, "Rich Felker" <dalias@libc.org> wrote:
>On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote:
>> 
>> 
>> 
>
>> About wrapper_cmp, in my last profiling, there are total 931387
>> samples collected, 257403 samples contain callchain ->wrapper_cmp,
>> among those 257403 samples, 167410 samples contain callchain
>> ->wrapper_cmp->mycmp, that is why I think there is extra overhead
>> about wrapper_cmp. Maybe compiler optimization would change the
>> result, and I will make further checks.
>
>Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10
>instructions.
>
>Rich

With optimized binary code, it is very hard to collect an intact callchain from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN.
But to profile qsort, a callchain may not be necessary. IP register sampling would be enough to identify which part take most cpu cycles.
So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just use PERF_SAMPLE_IP

This is what I got:
+-------------------+---------------+
|        func       |     count     |
+-------------------+---------------+
|       Total       |     423488    |
|       memcpy      | 48.76% 206496 |
|        sift       |  16.29% 68989 |
|       mycmp       |  14.57% 61714 |
|      trinkle      |  8.90% 37690  |
|       cycle       |  5.45% 23061  |
|        shr        |   2.19% 9293  |
|     __qsort_r     |   1.77% 7505  |
|        main       |   1.04% 4391  |
|        shl        |   0.55% 2325  |
|    wrapper_cmp    |   0.42% 1779  |
|        rand       |   0.05% 229   |
| __set_thread_area |    0.00% 16   |
+-------------------+---------------+
(Note that, in this profile report, I count only those samples that are directly within the function body, the samples within sub-function do not contribute to any of its parent functions.)

And you're right, with optimization, the impact of wrapper_cmp is very very low, only 0.42%.

The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REGS_USER) to collect statistics about the size (the 3rd parameter, stored in RDX register) of memcpy, and all of those memcpy function calls are just copying 4 bytes, according to the source code, the size of memcpy is item size to be sorted, which is int32 in my test case. 
Maybe something could be improved here.


I also made same profiling against glibc:
+-----------------------------+--------------+
|             func            |    count     |
+-----------------------------+--------------+
|            Total            |    640880    |
|    msort_with_tmp.part.0    | 73.99 474176 |  <--- merge sort?
|            mycmp            | 11.76 75392  |
|             main            |  6.45 41306  |
| __memcpy_avx_unaligned_erms |  4.58 29339  |
|            random           |  0.86 5525   |
|    __memcpy_avx_unaligned   |  0.83 5293   |
|           random_r          |  0.76 4882   |
|             rand            |  0.45 2897   |
|            _init            |  0.31 1975   |
|            _fini            |   0.01 80    |
|            __free           |    0.00 5    |
|         _int_malloc         |    0.00 5    |
|            malloc           |    0.00 2    |
|          __qsort_r          |    0.00 1    |
|          _int_free          |    0.00 1    |
+-----------------------------+--------------+

Test code:
-------------------
#include <stdio.h>
#include <stdlib.h>

int mycmp(const void *a, const void *b) { return *(const int *)a-*(const int*)b; }

#define MAXN 1<<20
int vs[MAXN];

int main() {
    int i, j, k, n, t;
    for (k=0; k<1024; k++) {
        for (i=0; i<MAXN; i++) vs[i]=i;
        for (n=MAXN; n>1; n--) {
            i=n-1; j=rand()%n;
            if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; }
        }
        qsort(vs, MAXN, sizeof(vs[0]), mycmp);
    }
    return 0;
}

-------------------
gcc test.c -O2 -static
With musl-libc:
$ time ./a.out

real	9m 5.10s
user	9m 5.09s
sys	0m 0.00s

With glic:
$ time ./a.out
real	1m56.287s
user	1m56.270s
sys	0m0.004s



To sum up, optimize those memcpy calls and reduce comparation to its minimum could have significant performance improvements, but I doubt it could achieve a 4-factor improvement.

FYI
David


  reply	other threads:[~2023-02-11  5:13 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-20  1:49 Guy
2023-01-20 12:55 ` alice
2023-01-30 10:04   ` [musl] " David Wang
2023-02-01 18:01     ` Markus Wichmann
2023-02-02  2:12       ` [musl] " David Wang
2023-02-03  5:22         ` [musl] " David Wang
2023-02-03  8:03           ` Alexander Monakov
2023-02-03  9:01             ` [musl] " David Wang
2023-02-09 19:03       ` Rich Felker
2023-02-09 19:20         ` Alexander Monakov
2023-02-09 19:52           ` Rich Felker
2023-02-09 20:18             ` Rich Felker
2023-02-09 20:27               ` Pierpaolo Bernardi
2023-02-10  4:10             ` Markus Wichmann
2023-02-10 10:00         ` [musl] " David Wang
2023-02-10 13:10           ` Rich Felker
2023-02-10 13:45             ` [musl] " David Wang
2023-02-10 14:19               ` Rich Felker
2023-02-11  5:12                 ` David Wang [this message]
2023-02-11  5:44                   ` [musl] " alice
2023-02-11  8:39                     ` Joakim Sindholt
2023-02-11  9:06                       ` alice
2023-02-11  9:31                         ` [musl] " David Wang
2023-02-11 13:35                         ` Rich Felker
2023-02-11 17:18                           ` David Wang
2023-02-16 15:15       ` David Wang
2023-02-16 16:07         ` Rich Felker
2023-02-17  1:35           ` [musl] " David Wang
2023-02-17 13:17           ` Alexander Monakov
2023-02-17 15:07             ` Rich Felker
2023-02-11  9:22     ` [musl] " Markus Wichmann
2023-02-11  9:36       ` [musl] " David Wang
2023-02-11  9:51       ` David Wang
2023-01-20 13:32 ` [musl] qsort Valery Ushakov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=10dbd851.a99.1863ee385b5.Coremail.00107082@163.com \
    --to=00107082@163.com \
    --cc=dalias@libc.org \
    --cc=musl@lists.openwall.com \
    --cc=nullplan@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).