From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED,
	MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no
	version=3.4.4
Received: (qmail 6055 invoked from network); 11 Feb 2023 05:44:52 -0000
Received: from second.openwall.net (193.110.157.125)
  by inbox.vuxu.org with ESMTPUTF8; 11 Feb 2023 05:44:52 -0000
Received: (qmail 30276 invoked by uid 550); 11 Feb 2023 05:44:49 -0000
Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm
Precedence: bulk
List-Post: <mailto:musl@lists.openwall.com>
List-Help: <mailto:musl-help@lists.openwall.com>
List-Unsubscribe: <mailto:musl-unsubscribe@lists.openwall.com>
List-Subscribe: <mailto:musl-subscribe@lists.openwall.com>
List-ID: <musl.lists.openwall.com>
Reply-To: musl@lists.openwall.com
Received: (qmail 30238 invoked from network); 11 Feb 2023 05:44:48 -0000
MIME-Version: 1.0
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ayaya.dev; s=key1;
	t=1676094276;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=GEvTM9g03mgBOi/lYtaFMnpbfW/UZxleE+WSLn6H3PY=;
	b=UvoBMTPlV7TSE+DotBdUlvHUumyHrrdmdXw54Pcbm0jdHbU8ktlRyQcM0V0+zb0m3GxpED
	Wh4FawBD/xGeLeZHSNkur9WDWq9a2XnOuQZ1jHOq62ed/Sv2FWHuGtfFvlPCCP4hoWAspQ
	t/5GSn2vDxn8JaV/gyBvbLVYuXs8btM=
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Date: Sat, 11 Feb 2023 06:44:29 +0100
Message-Id: <CQFHTXNLAUOI.2OHPT1JEJF27G@sumire>
Cc: "Markus Wichmann" <nullplan@gmx.net>
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: "alice" <alice@ayaya.dev>
To: <musl@lists.openwall.com>, "Rich Felker" <dalias@libc.org>
References: <CAAEi2GextYuWRK-JKtpCLxewyJ2u380m5+s=M_0P=ZBDxyX-xA@mail.gmail.com> <CPX182QJK5D6.2Y2DE5NACZ1I@sumire> <4d290220.36d6.1860222ca46.Coremail.00107082@163.com> <20230201180115.GB2626@voyager> <20230209190316.GU4163@brightrain.aerifal.cx> <75d9cfae.35eb.1863ac4e3c0.Coremail.00107082@163.com> <20230210131044.GZ4163@brightrain.aerifal.cx> <23b37232.4d4c.1863b92aa13.Coremail.00107082@163.com> <20230210141955.GA4163@brightrain.aerifal.cx> <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com>
In-Reply-To: <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com>
X-Migadu-Flow: FLOW_OUT
Subject: Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl]
 qsort

On Sat Feb 11, 2023 at 6:12 AM CET, David Wang wrote:
>
>
>
>
> At 2023-02-10 22:19:55, "Rich Felker" <dalias@libc.org> wrote:
> >On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote:
> >>=20
> >>=20
> >>=20
> >
> >> About wrapper_cmp, in my last profiling, there are total 931387
> >> samples collected, 257403 samples contain callchain ->wrapper_cmp,
> >> among those 257403 samples, 167410 samples contain callchain
> >> ->wrapper_cmp->mycmp, that is why I think there is extra overhead
> >> about wrapper_cmp. Maybe compiler optimization would change the
> >> result, and I will make further checks.
> >
> >Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10
> >instructions.
> >
> >Rich
>
> With optimized binary code, it is very hard to collect an intact callchai=
n from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN.
> But to profile qsort, a callchain may not be necessary. IP register sampl=
ing would be enough to identify which part take most cpu cycles.
> So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just us=
e PERF_SAMPLE_IP
>
> This is what I got:
> +-------------------+---------------+
> |        func       |     count     |
> +-------------------+---------------+
> |       Total       |     423488    |
> |       memcpy      | 48.76% 206496 |
> |        sift       |  16.29% 68989 |
> |       mycmp       |  14.57% 61714 |
> |      trinkle      |  8.90% 37690  |
> |       cycle       |  5.45% 23061  |
> |        shr        |   2.19% 9293  |
> |     __qsort_r     |   1.77% 7505  |
> |        main       |   1.04% 4391  |
> |        shl        |   0.55% 2325  |
> |    wrapper_cmp    |   0.42% 1779  |
> |        rand       |   0.05% 229   |
> | __set_thread_area |    0.00% 16   |
> +-------------------+---------------+
> (Note that, in this profile report, I count only those samples that are d=
irectly within the function body, the samples within sub-function do not co=
ntribute to any of its parent functions.)
>
> And you're right, with optimization, the impact of wrapper_cmp is very ve=
ry low, only 0.42%.
>
> The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REG=
S_USER) to collect statistics about the size (the 3rd parameter, stored in =
RDX register) of memcpy, and all of those memcpy function calls are just co=
pying 4 bytes, according to the source code, the size of memcpy is item siz=
e to be sorted, which is int32 in my test case.=20
> Maybe something could be improved here.
>
>
> I also made same profiling against glibc:
> +-----------------------------+--------------+
> |             func            |    count     |
> +-----------------------------+--------------+
> |            Total            |    640880    |
> |    msort_with_tmp.part.0    | 73.99 474176 |  <--- merge sort?
> |            mycmp            | 11.76 75392  |
> |             main            |  6.45 41306  |
> | __memcpy_avx_unaligned_erms |  4.58 29339  |
> |            random           |  0.86 5525   |
> |    __memcpy_avx_unaligned   |  0.83 5293   |
> |           random_r          |  0.76 4882   |
> |             rand            |  0.45 2897   |
> |            _init            |  0.31 1975   |
> |            _fini            |   0.01 80    |
> |            __free           |    0.00 5    |
> |         _int_malloc         |    0.00 5    |
> |            malloc           |    0.00 2    |
> |          __qsort_r          |    0.00 1    |
> |          _int_free          |    0.00 1    |
> +-----------------------------+--------------+
>
> Test code:
> -------------------
> #include <stdio.h>
> #include <stdlib.h>
>
> int mycmp(const void *a, const void *b) { return *(const int *)a-*(const =
int*)b; }
>
> #define MAXN 1<<20
> int vs[MAXN];
>
> int main() {
>     int i, j, k, n, t;
>     for (k=3D0; k<1024; k++) {
>         for (i=3D0; i<MAXN; i++) vs[i]=3Di;
>         for (n=3DMAXN; n>1; n--) {
>             i=3Dn-1; j=3Drand()%n;
>             if (i!=3Dj) { t=3Dvs[i]; vs[i]=3Dvs[j]; vs[j]=3Dt; }
>         }
>         qsort(vs, MAXN, sizeof(vs[0]), mycmp);
>     }
>     return 0;
> }
>
> -------------------
> gcc test.c -O2 -static
> With musl-libc:
> $ time ./a.out
>
> real	9m 5.10s
> user	9m 5.09s
> sys	0m 0.00s
>
> With glic:
> $ time ./a.out
> real	1m56.287s
> user	1m56.270s
> sys	0m0.004s
>
>
>
> To sum up, optimize those memcpy calls and reduce comparation to its mini=
mum could have significant performance improvements, but I doubt it could a=
chieve a 4-factor improvement.

based on the glibc profiling, glibc also has their natively-loaded-cpu-spec=
ific
optimisations, the _avx_ functions in your case. musl doesn't implement any
SIMD optimisations, so this is a bit apples-to-oranges unless musl implemen=
ts
the same kind of native per-arch optimisation.

you should rerun these with GLIBC_TUNABLES, from something in:
https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tuna=
bles.html
which should let you disable them all (if you just want to compare C to C c=
ode).

( unrelated, but has there been some historic discussion of implementing
  something similar in musl? i feel like i might be forgetting something. )

>
> FYI
> David