From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on inbox.vuxu.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_INVALID,DKIM_SIGNED, MAILING_LIST_MULTI,RCVD_IN_MSPIKE_H2 autolearn=ham autolearn_force=no version=3.4.4 Received: (qmail 6055 invoked from network); 11 Feb 2023 05:44:52 -0000 Received: from second.openwall.net (193.110.157.125) by inbox.vuxu.org with ESMTPUTF8; 11 Feb 2023 05:44:52 -0000 Received: (qmail 30276 invoked by uid 550); 11 Feb 2023 05:44:49 -0000 Mailing-List: contact musl-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-ID: Reply-To: musl@lists.openwall.com Received: (qmail 30238 invoked from network); 11 Feb 2023 05:44:48 -0000 MIME-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ayaya.dev; s=key1; t=1676094276; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GEvTM9g03mgBOi/lYtaFMnpbfW/UZxleE+WSLn6H3PY=; b=UvoBMTPlV7TSE+DotBdUlvHUumyHrrdmdXw54Pcbm0jdHbU8ktlRyQcM0V0+zb0m3GxpED Wh4FawBD/xGeLeZHSNkur9WDWq9a2XnOuQZ1jHOq62ed/Sv2FWHuGtfFvlPCCP4hoWAspQ t/5GSn2vDxn8JaV/gyBvbLVYuXs8btM= Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Sat, 11 Feb 2023 06:44:29 +0100 Message-Id: Cc: "Markus Wichmann" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: "alice" To: , "Rich Felker" References: <4d290220.36d6.1860222ca46.Coremail.00107082@163.com> <20230201180115.GB2626@voyager> <20230209190316.GU4163@brightrain.aerifal.cx> <75d9cfae.35eb.1863ac4e3c0.Coremail.00107082@163.com> <20230210131044.GZ4163@brightrain.aerifal.cx> <23b37232.4d4c.1863b92aa13.Coremail.00107082@163.com> <20230210141955.GA4163@brightrain.aerifal.cx> <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com> In-Reply-To: <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com> X-Migadu-Flow: FLOW_OUT Subject: Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort On Sat Feb 11, 2023 at 6:12 AM CET, David Wang wrote: > > > > > At 2023-02-10 22:19:55, "Rich Felker" wrote: > >On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote: > >>=20 > >>=20 > >>=20 > > > >> About wrapper_cmp, in my last profiling, there are total 931387 > >> samples collected, 257403 samples contain callchain ->wrapper_cmp, > >> among those 257403 samples, 167410 samples contain callchain > >> ->wrapper_cmp->mycmp, that is why I think there is extra overhead > >> about wrapper_cmp. Maybe compiler optimization would change the > >> result, and I will make further checks. > > > >Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10 > >instructions. > > > >Rich > > With optimized binary code, it is very hard to collect an intact callchai= n from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN. > But to profile qsort, a callchain may not be necessary. IP register sampl= ing would be enough to identify which part take most cpu cycles. > So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just us= e PERF_SAMPLE_IP > > This is what I got: > +-------------------+---------------+ > | func | count | > +-------------------+---------------+ > | Total | 423488 | > | memcpy | 48.76% 206496 | > | sift | 16.29% 68989 | > | mycmp | 14.57% 61714 | > | trinkle | 8.90% 37690 | > | cycle | 5.45% 23061 | > | shr | 2.19% 9293 | > | __qsort_r | 1.77% 7505 | > | main | 1.04% 4391 | > | shl | 0.55% 2325 | > | wrapper_cmp | 0.42% 1779 | > | rand | 0.05% 229 | > | __set_thread_area | 0.00% 16 | > +-------------------+---------------+ > (Note that, in this profile report, I count only those samples that are d= irectly within the function body, the samples within sub-function do not co= ntribute to any of its parent functions.) > > And you're right, with optimization, the impact of wrapper_cmp is very ve= ry low, only 0.42%. > > The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REG= S_USER) to collect statistics about the size (the 3rd parameter, stored in = RDX register) of memcpy, and all of those memcpy function calls are just co= pying 4 bytes, according to the source code, the size of memcpy is item siz= e to be sorted, which is int32 in my test case.=20 > Maybe something could be improved here. > > > I also made same profiling against glibc: > +-----------------------------+--------------+ > | func | count | > +-----------------------------+--------------+ > | Total | 640880 | > | msort_with_tmp.part.0 | 73.99 474176 | <--- merge sort? > | mycmp | 11.76 75392 | > | main | 6.45 41306 | > | __memcpy_avx_unaligned_erms | 4.58 29339 | > | random | 0.86 5525 | > | __memcpy_avx_unaligned | 0.83 5293 | > | random_r | 0.76 4882 | > | rand | 0.45 2897 | > | _init | 0.31 1975 | > | _fini | 0.01 80 | > | __free | 0.00 5 | > | _int_malloc | 0.00 5 | > | malloc | 0.00 2 | > | __qsort_r | 0.00 1 | > | _int_free | 0.00 1 | > +-----------------------------+--------------+ > > Test code: > ------------------- > #include > #include > > int mycmp(const void *a, const void *b) { return *(const int *)a-*(const = int*)b; } > > #define MAXN 1<<20 > int vs[MAXN]; > > int main() { > int i, j, k, n, t; > for (k=3D0; k<1024; k++) { > for (i=3D0; i for (n=3DMAXN; n>1; n--) { > i=3Dn-1; j=3Drand()%n; > if (i!=3Dj) { t=3Dvs[i]; vs[i]=3Dvs[j]; vs[j]=3Dt; } > } > qsort(vs, MAXN, sizeof(vs[0]), mycmp); > } > return 0; > } > > ------------------- > gcc test.c -O2 -static > With musl-libc: > $ time ./a.out > > real 9m 5.10s > user 9m 5.09s > sys 0m 0.00s > > With glic: > $ time ./a.out > real 1m56.287s > user 1m56.270s > sys 0m0.004s > > > > To sum up, optimize those memcpy calls and reduce comparation to its mini= mum could have significant performance improvements, but I doubt it could a= chieve a 4-factor improvement. based on the glibc profiling, glibc also has their natively-loaded-cpu-spec= ific optimisations, the _avx_ functions in your case. musl doesn't implement any SIMD optimisations, so this is a bit apples-to-oranges unless musl implemen= ts the same kind of native per-arch optimisation. you should rerun these with GLIBC_TUNABLES, from something in: https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tuna= bles.html which should let you disable them all (if you just want to compare C to C c= ode). ( unrelated, but has there been some historic discussion of implementing something similar in musl? i feel like i might be forgetting something. ) > > FYI > David