mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] qsort
@ 2023-01-20  1:49 Guy
  2023-01-20 12:55 ` alice
  2023-01-20 13:32 ` [musl] qsort Valery Ushakov
  0 siblings, 2 replies; 34+ messages in thread
From: Guy @ 2023-01-20  1:49 UTC (permalink / raw)
  To: musl

Hi,

I have a program whose bottleneck is qsort.
I noticed that the performance with musl is much slower then with glibc.
Why is quick sort not used in musl?

Thanks,
Guy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] qsort
  2023-01-20  1:49 [musl] qsort Guy
@ 2023-01-20 12:55 ` alice
  2023-01-30 10:04   ` [musl] " David Wang
  2023-01-20 13:32 ` [musl] qsort Valery Ushakov
  1 sibling, 1 reply; 34+ messages in thread
From: alice @ 2023-01-20 12:55 UTC (permalink / raw)
  To: musl

On Fri Jan 20, 2023 at 2:49 AM CET, Guy wrote:
> Hi,
>
> I have a program whose bottleneck is qsort.
> I noticed that the performance with musl is much slower then with glibc.

diagnosing why this is the case is somewhat difficult without either seeing
the program, or (better), a specific corpus of things that are being sorted in
both cases (to have a solid test case).

> Why is quick sort not used in musl?

presumably, because:

 /* Smoothsort, an adaptive variant of Heapsort.  Memory usage: O(1).
    Run time: Worst case O(n log n), close to O(n) in the mostly-sorted case. */

vanilla quicksort is O(log(n)) additional memory use, and so the optimisation
is more likely to be on memory use. on top of that, the worst-case performance
of quicksort is O(n^2) (apparently), but i'm not an expert on sorting
algorithms :). so, your specific (problem) case needs a specific example to
diagnose.

commit 22263709eda9f7d692a0f484fd759f757418dbd7 is the one that replaced the
old heapsort with this custom smoothsort implementation, with
52cf5c18f4ad3a7a59fb7113cf115c6fc05c7494 being the one that added the above
comment.

>
> Thanks,
> Guy


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re: qsort
  2023-01-20  1:49 [musl] qsort Guy
  2023-01-20 12:55 ` alice
@ 2023-01-20 13:32 ` Valery Ushakov
  1 sibling, 0 replies; 34+ messages in thread
From: Valery Ushakov @ 2023-01-20 13:32 UTC (permalink / raw)
  To: musl

Guy <galfandary@gmail.com> wrote:

> I have a program whose bottleneck is qsort.
> I noticed that the performance with musl is much slower then with glibc.
> Why is quick sort not used in musl?

Sorry for a random drive by remark that is only very tangentially
related to your question...  quicksort is not guaranteed to be quick
and can be quadratic on "bad" input.  Everyone probably knows that in
a kind of theoretical way, but you can run into these pathological
cases in quite mundane circumstances.  E.g. the NNTP overviews for the
Lua mailing list archives (gmane.comp.lang.lua.general) at gmane.io
used to trigger this worst case scenario for me (~120s to enter the ng
and sort/thread articles with qsort(3) vs. ~15s for heapsort).

-uwe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] qsort
  2023-01-20 12:55 ` alice
@ 2023-01-30 10:04   ` David Wang
  2023-02-01 18:01     ` Markus Wichmann
  2023-02-11  9:22     ` [musl] " Markus Wichmann
  0 siblings, 2 replies; 34+ messages in thread
From: David Wang @ 2023-01-30 10:04 UTC (permalink / raw)
  To: musl

I made a test comparing the performance difference  about qsort between   musl and glibc, the code is just sorting some random set of data, and collect how many times the compare operation   invoked,  also I collect the counters for c++ sort  which also is designed to throttle the worse case performance of quick sort.
```
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <algorithm>

using namespace std;
typedef long long LL;
#define MAXN 200000
int as[MAXN];
int bs[MAXN];

LL c1=0, c2=0;
bool rcmp(int a, int b) { c2++; return a>b; }
int myrcmp(const void *a, const void *b) {c1++; return *(const int*)b-*(const int*)a;}
 
int main() {
  int i, j, k, t;
  time_t tx = time(NULL);
  srand(tx);
  for (k=0; k<1024; k++) {
    for (i=0; i<MAXN; i++) as[i]=i;
    for (i=MAXN-1; i>0; i--) {
      j=rand()%(i+1);
      if (j!=i) { t=as[i]; as[i]=as[j]; as[j]=t; }
    }
    for (i=0; i<MAXN; i++) bs[i]=as[i];
    qsort(as, MAXN, sizeof(as[0]), myrcmp);
    sort(bs, bs+MAXN, rcmp);
  }
  printf("qsort:%lld  vs   sort:%lld\n", c1, c2);
  return 0;
}
```

WIth glibc,  average report is  "qsort:3351271634  vs   sort:4358412048"
But on alpine,  it is "qsort:9585927992  vs   sort:4354856330"

On average, qsort with musl is significantly slower than c++ sort, while with glibc the average performance of qsort is bettern than c++ sort.

I think something is strange here,  the O(n*n) worse case performance of quick sort is really an issue, but why the average performance is  that bad,  comparing  with c++ sort.
Is there any story about the implementation of qsort in musl?  I feel it focused on performance improvement for some special kind of domain, but not clear what it is.

David Wang


At 2023-01-20 20:55:39, "alice" <alice@ayaya.dev> wrote:
>On Fri Jan 20, 2023 at 2:49 AM CET, Guy wrote:
>> Hi,
>>
>> I have a program whose bottleneck is qsort.
>> I noticed that the performance with musl is much slower then with glibc.
>
>diagnosing why this is the case is somewhat difficult without either seeing
>the program, or (better), a specific corpus of things that are being sorted in
>both cases (to have a solid test case).
>
>> Why is quick sort not used in musl?
>
>presumably, because:
>
> /* Smoothsort, an adaptive variant of Heapsort.  Memory usage: O(1).
>    Run time: Worst case O(n log n), close to O(n) in the mostly-sorted case. */
>
>vanilla quicksort is O(log(n)) additional memory use, and so the optimisation
>is more likely to be on memory use. on top of that, the worst-case performance
>of quicksort is O(n^2) (apparently), but i'm not an expert on sorting
>algorithms :). so, your specific (problem) case needs a specific example to
>diagnose.
>
>commit 22263709eda9f7d692a0f484fd759f757418dbd7 is the one that replaced the
>old heapsort with this custom smoothsort implementation, with
>52cf5c18f4ad3a7a59fb7113cf115c6fc05c7494 being the one that added the above
>comment.
>
>>
>> Thanks,
>> Guy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-01-30 10:04   ` [musl] " David Wang
@ 2023-02-01 18:01     ` Markus Wichmann
  2023-02-02  2:12       ` [musl] " David Wang
                         ` (2 more replies)
  2023-02-11  9:22     ` [musl] " Markus Wichmann
  1 sibling, 3 replies; 34+ messages in thread
From: Markus Wichmann @ 2023-02-01 18:01 UTC (permalink / raw)
  To: musl

On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote:
> WIth glibc,  average report is  "qsort:3351271634  vs   sort:4358412048"
> But on alpine,  it is "qsort:9585927992  vs   sort:4354856330"
>
> On average, qsort with musl is significantly slower than c++ sort,
> while with glibc the average performance of qsort is bettern than c++
> sort.
>

It seems to me as though smoothsort does not optimize for the number of
comparisons.

> Is there any story about the implementation of qsort in musl?  I feel
> it focused on performance improvement for some special kind of domain,
> but not clear what it is.
>

Smoothsort has the desirable property of being adaptive. Its runtime
gets closer to O(n) the more sorted the input already is. glibc uses
mergesort or quicksort (the latter as fallback) and neither of them has
that property. Plus, mergesort requires scratch storage and has a
significantly harder time sorting arrays with large elements, because
you end up constantly copying stuff. glibc tries to mitigate this by
indirectly sorting once the elements go above 32 bytes in size.

Basically, glibc is optimized for the comparisons, musl more for the
number of swaps. Although we really shouldn't loose sight of the
compares entirely, since those are indirect function calls, and current
processors seem to dislike those.

Anyway, there is an inefficiency in musl's smoothsort implementation,
namely in the sift() function:


| if(cmp(ar[0], lf, arg) >= 0 && cmp(ar[0], rt, arg) >= 0) {
| 	break;
| }
| if(cmp(lf, rt, arg) >= 0) {
| 	ar[i++] = lf;
| 	head = lf;
| 	pshift -= 1;
| } else {
| 	ar[i++] = rt;
| 	head = rt;
| 	pshift -= 2;
| }

As you can see, this does two or three comparisions, but actually needs
to do only two in any case: If we detect ar[0] >= lf && ar[0] < rt, then
by transitivity we have also detected rt > lf, so there is no need to
compare lf and rt directly anymore. I have not really found a good way
to formulate code that takes advantage of that, maybe something like
this?


		if(cmp(ar[0], lf, arg) < 0) {
			if(cmp(lf, rt, arg) >= 0) {
				ar[i++] = lf;
				head = lf;
				pshift -= 1;
			} else {
go_right:
				ar[i++] = rt;
				head = rt;
				pshift -= 2;
			}
		} else if (cmp(ar[0], rt, arg) < 0) {
			goto go_right;
		} else {
			break;
		}

Yeah, not great. But it does use only two compares ever. This ought to
lower the number of compares in your test somewhat.

HTH,
Markus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-01 18:01     ` Markus Wichmann
@ 2023-02-02  2:12       ` David Wang
  2023-02-03  5:22         ` [musl] " David Wang
  2023-02-09 19:03       ` Rich Felker
  2023-02-16 15:15       ` David Wang
  2 siblings, 1 reply; 34+ messages in thread
From: David Wang @ 2023-02-02  2:12 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 1992 bytes --]


At 2023-02-02 02:01:15, "Markus Wichmann" <nullplan@gmx.net> wrote:
>On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote:

>> Is there any story about the implementation of qsort in musl?  I feel
>> it focused on performance improvement for some special kind of domain,
>> but not clear what it is.
>>
>
>Smoothsort has the desirable property of being adaptive. Its runtime
>gets closer to O(n) the more sorted the input already is. glibc uses
>mergesort or quicksort (the latter as fallback) and neither of them has
>that property. Plus, mergesort requires scratch storage and has a
>significantly harder time sorting arrays with large elements, because
>you end up constantly copying stuff. glibc tries to mitigate this by
>indirectly sorting once the elements go above 32 bytes in size.
>
>Basically, glibc is optimized for the comparisons, musl more for the
>number of swaps. Although we really shouldn't loose sight of the
>compares entirely, since those are indirect function calls, and current
>processors seem to dislike those.

Thanks for the information, but the state of things (the average performance, measure in total runtime, could be about 5~6-factor slower than c++ sort in alpine, 8~9-factor slower than qsort in debian-glibc) is very disturbing,  when I use qsort, I would expect O(nlogn) average performance with small const factor (but care less about its O(n) cases, to be honest...)  because I meant to use quick sort... and if I have concern about worst case preformance, I would switch to C++ sort (which is said to switch to merge sort when recursive depth is too deep for quicksort, but I have not confirmed it myself...).  
I time the sorting of 200000 elements randomly shuffled for 1024 rounds, the timing report for qsort on alpine is:
 # time ./a.out 
real	3m 3.20s
user	3m 3.19s
sys	0m 0.00s
c++ sort on alpine is:
real	0m 35.79s
user	0m 35.78s
sys	0m 0.00s
While qsort on debian is way much faster:
real	0m19.783s
user	0m19.783s
sys	0m0.000s

David


[-- Attachment #2: Type: text/html, Size: 2199 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-02  2:12       ` [musl] " David Wang
@ 2023-02-03  5:22         ` David Wang
  2023-02-03  8:03           ` Alexander Monakov
  0 siblings, 1 reply; 34+ messages in thread
From: David Wang @ 2023-02-03  5:22 UTC (permalink / raw)
  To: musl

I think I was not reading the mail carefully enough,  and did not notice the O(1) "in place" space complexity.
Sorry about those boring tests report,   I was just shocked to know that qsort is not quick-sort in musl.....
Please ignore my comments

David









At 2023-02-02 10:12:53, "David Wang" <00107082@163.com> wrote:



At 2023-02-02 02:01:15, "Markus Wichmann" <nullplan@gmx.net> wrote:
>On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote:

>> Is there any story about the implementation of qsort in musl?  I feel
>> it focused on performance improvement for some special kind of domain,
>> but not clear what it is.
>>
>
>Smoothsort has the desirable property of being adaptive. Its runtime
>gets closer to O(n) the more sorted the input already is. glibc uses
>mergesort or quicksort (the latter as fallback) and neither of them has
>that property. Plus, mergesort requires scratch storage and has a
>significantly harder time sorting arrays with large elements, because
>you end up constantly copying stuff. glibc tries to mitigate this by
>indirectly sorting once the elements go above 32 bytes in size.
>
>Basically, glibc is optimized for the comparisons, musl more for the
>number of swaps. Although we really shouldn't loose sight of the
>compares entirely, since those are indirect function calls, and current
>processors seem to dislike those.

Thanks for the information, but the state of things (the average performance, measure in total runtime, could be about 5~6-factor slower than c++ sort in alpine, 8~9-factor slower than qsort in debian-glibc) is very disturbing,  when I use qsort, I would expect O(nlogn) average performance with small const factor (but care less about its O(n) cases, to be honest...)  because I meant to use quick sort... and if I have concern about worst case preformance, I would switch to C++ sort (which is said to switch to merge sort when recursive depth is too deep for quicksort, but I have not confirmed it myself...).  
I time the sorting of 200000 elements randomly shuffled for 1024 rounds, the timing report for qsort on alpine is:
 # time ./a.out 
real	3m 3.20s
user	3m 3.19s
sys	0m 0.00s
c++ sort on alpine is:
real	0m 35.79s
user	0m 35.78s
sys	0m 0.00s
While qsort on debian is way much faster:
real	0m19.783s
user	0m19.783s
sys	0m0.000s

David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-03  5:22         ` [musl] " David Wang
@ 2023-02-03  8:03           ` Alexander Monakov
  2023-02-03  9:01             ` [musl] " David Wang
  0 siblings, 1 reply; 34+ messages in thread
From: Alexander Monakov @ 2023-02-03  8:03 UTC (permalink / raw)
  To: musl



On Fri, 3 Feb 2023, David Wang wrote:

> I think I was not reading the mail carefully enough,  and did not notice the
> O(1) "in place" space complexity.

It's not correct to claim that musl smoothsort improves on qsort by
having O(1) space complexity rather than O(log n). The storage for
Leonardo numbers in musl smoothsort grows as O(log n) as well, musl
just allocates enough for the maximum possible 'n'.

Alexander

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:[musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-03  8:03           ` Alexander Monakov
@ 2023-02-03  9:01             ` David Wang
  0 siblings, 0 replies; 34+ messages in thread
From: David Wang @ 2023-02-03  9:01 UTC (permalink / raw)
  To: musl



At 2023-02-03 16:03:03, "Alexander Monakov" <amonakov@ispras.ru> wrote:
>
>
>On Fri, 3 Feb 2023, David Wang wrote:
>
>> I think I was not reading the mail carefully enough,  and did not notice the
>> O(1) "in place" space complexity.
>
>It's not correct to claim that musl smoothsort improves on qsort by
>having O(1) space complexity rather than O(log n). The storage for
>Leonardo numbers in musl smoothsort grows as O(log n) as well, musl
>just allocates enough for the maximum possible 'n'.
>
>Alexander

Thanks for the clarification.

 Check with code history, it surprises me big time to know  that qsort in musl is never quick-sort, from the beginning qsort is heapsort because of O(nlgn) worse-case preformance and O(1) memory usage, and smoothsort is an improvement over heapsort....

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-02-01 18:01     ` Markus Wichmann
  2023-02-02  2:12       ` [musl] " David Wang
@ 2023-02-09 19:03       ` Rich Felker
  2023-02-09 19:20         ` Alexander Monakov
  2023-02-10 10:00         ` [musl] " David Wang
  2023-02-16 15:15       ` David Wang
  2 siblings, 2 replies; 34+ messages in thread
From: Rich Felker @ 2023-02-09 19:03 UTC (permalink / raw)
  To: Markus Wichmann; +Cc: musl

On Wed, Feb 01, 2023 at 07:01:15PM +0100, Markus Wichmann wrote:
> On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote:
> > WIth glibc,  average report is  "qsort:3351271634  vs   sort:4358412048"
> > But on alpine,  it is "qsort:9585927992  vs   sort:4354856330"
> >
> > On average, qsort with musl is significantly slower than c++ sort,
> > while with glibc the average performance of qsort is bettern than c++
> > sort.
> >
> 
> It seems to me as though smoothsort does not optimize for the number of
> comparisons.
> 
> > Is there any story about the implementation of qsort in musl?  I feel
> > it focused on performance improvement for some special kind of domain,
> > but not clear what it is.
> >
> 
> Smoothsort has the desirable property of being adaptive. Its runtime
> gets closer to O(n) the more sorted the input already is. glibc uses
> mergesort or quicksort (the latter as fallback) and neither of them has
> that property. Plus, mergesort requires scratch storage and has a
> significantly harder time sorting arrays with large elements, because
> you end up constantly copying stuff. glibc tries to mitigate this by
> indirectly sorting once the elements go above 32 bytes in size.
> 
> Basically, glibc is optimized for the comparisons, musl more for the
> number of swaps. Although we really shouldn't loose sight of the
> compares entirely, since those are indirect function calls, and current
> processors seem to dislike those.
> 
> Anyway, there is an inefficiency in musl's smoothsort implementation,
> namely in the sift() function:
> 
> 
> | if(cmp(ar[0], lf, arg) >= 0 && cmp(ar[0], rt, arg) >= 0) {
> | 	break;
> | }
> | if(cmp(lf, rt, arg) >= 0) {
> | 	ar[i++] = lf;
> | 	head = lf;
> | 	pshift -= 1;
> | } else {
> | 	ar[i++] = rt;
> | 	head = rt;
> | 	pshift -= 2;
> | }
> 
> As you can see, this does two or three comparisions, but actually needs
> to do only two in any case: If we detect ar[0] >= lf && ar[0] < rt, then
> by transitivity we have also detected rt > lf, so there is no need to
> compare lf and rt directly anymore. I have not really found a good way
> to formulate code that takes advantage of that, maybe something like
> this?
> 
> 
> 		if(cmp(ar[0], lf, arg) < 0) {
> 			if(cmp(lf, rt, arg) >= 0) {
> 				ar[i++] = lf;
> 				head = lf;
> 				pshift -= 1;
> 			} else {
> go_right:
> 				ar[i++] = rt;
> 				head = rt;
> 				pshift -= 2;
> 			}
> 		} else if (cmp(ar[0], rt, arg) < 0) {
> 			goto go_right;
> 		} else {
> 			break;
> 		}
> 
> Yeah, not great. But it does use only two compares ever. This ought to
> lower the number of compares in your test somewhat.

I think we should pursue this. IIRC there was another fairly
significant missed optimization in qsort discussed a while back, where
discussion dropped off because the person involved turned out to be
impossible to work with.

On the overall topic of qsort and quicksort, quicksort is just not a
viable algorithm by itself because it's O(n²) time. glibc does not use
it by itself, but uses "introsort", a fancy way to say it introspects
the quicksort (rather just counts the depth) and switches to an O(n
log n) algorithm once it's descended too many levels. Alternatively,
there are ways to pick the pivot element (quickselect) that guarantee
O(n log n) overall performance, but the pivot selection is so heavy as
to make it impractical.

It might make sense for us to adopt an approach like introsort, using
smoothsort as the fallback, especially if we could preserve the
property of being fast on mostly/entirely sorted inputs. But I'm not
sure if that's possible.

Before we consider any such change though we should probably see
profiling data to determine where the time is being spent in
smoothsort, in case it's stupid stuff that's fixable.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-02-09 19:03       ` Rich Felker
@ 2023-02-09 19:20         ` Alexander Monakov
  2023-02-09 19:52           ` Rich Felker
  2023-02-10 10:00         ` [musl] " David Wang
  1 sibling, 1 reply; 34+ messages in thread
From: Alexander Monakov @ 2023-02-09 19:20 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann


On Thu, 9 Feb 2023, Rich Felker wrote:

>  glibc does not use
> it by itself, but uses "introsort", a fancy way to say it introspects
> the quicksort (rather just counts the depth) and switches to an O(n
> log n) algorithm once it's descended too many levels.

This is so completely untrue. Glibc uses mergesort, falling back on
quicksort with median-of-three pivot selection when allocating the
intermediate array for mergesort fails or seems too costly:

https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/msort.c;h=6dd1cc3aa152d0be99e578c8b4853976451057a5;hb=HEAD
https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/qsort.c;h=728a0ed370e00a76bdd22c3317a75b9be6736f48;hb=HEAD

Alexander

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-02-09 19:20         ` Alexander Monakov
@ 2023-02-09 19:52           ` Rich Felker
  2023-02-09 20:18             ` Rich Felker
  2023-02-10  4:10             ` Markus Wichmann
  0 siblings, 2 replies; 34+ messages in thread
From: Rich Felker @ 2023-02-09 19:52 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: musl, Markus Wichmann

On Thu, Feb 09, 2023 at 10:20:45PM +0300, Alexander Monakov wrote:
> 
> On Thu, 9 Feb 2023, Rich Felker wrote:
> 
> >  glibc does not use
> > it by itself, but uses "introsort", a fancy way to say it introspects
> > the quicksort (rather just counts the depth) and switches to an O(n
> > log n) algorithm once it's descended too many levels.
> 
> This is so completely untrue. Glibc uses mergesort, falling back on
> quicksort with median-of-three pivot selection when allocating the
> intermediate array for mergesort fails or seems too costly:

Did this change at some point or have I just always been under the
wrong impression on this?

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-02-09 19:52           ` Rich Felker
@ 2023-02-09 20:18             ` Rich Felker
  2023-02-09 20:27               ` Pierpaolo Bernardi
  2023-02-10  4:10             ` Markus Wichmann
  1 sibling, 1 reply; 34+ messages in thread
From: Rich Felker @ 2023-02-09 20:18 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: musl, Markus Wichmann

On Thu, Feb 09, 2023 at 02:52:45PM -0500, Rich Felker wrote:
> On Thu, Feb 09, 2023 at 10:20:45PM +0300, Alexander Monakov wrote:
> > 
> > On Thu, 9 Feb 2023, Rich Felker wrote:
> > 
> > >  glibc does not use
> > > it by itself, but uses "introsort", a fancy way to say it introspects
> > > the quicksort (rather just counts the depth) and switches to an O(n
> > > log n) algorithm once it's descended too many levels.
> > 
> > This is so completely untrue. Glibc uses mergesort, falling back on
> > quicksort with median-of-three pivot selection when allocating the
> > intermediate array for mergesort fails or seems too costly:
> 
> Did this change at some point or have I just always been under the
> wrong impression on this?

To self-answer, from git history it looks like I was just completely
wrong. Maybe it was GNU libstdc++ using introsort? Or..?

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-02-09 20:18             ` Rich Felker
@ 2023-02-09 20:27               ` Pierpaolo Bernardi
  0 siblings, 0 replies; 34+ messages in thread
From: Pierpaolo Bernardi @ 2023-02-09 20:27 UTC (permalink / raw)
  To: musl

On Thu, Feb 9, 2023 at 9:18 PM Rich Felker <dalias@libc.org> wrote:

> > Did this change at some point or have I just always been under the
> > wrong impression on this?

> To self-answer, from git history it looks like I was just completely
> wrong. Maybe it was GNU libstdc++ using introsort? Or..?

I too knew that glibc used to use introsort. I had examined the code.
BUT that was many years ago, and my memory must be failing.
Wikipedia says:

"Introsort or some variant is used in a number of standard library
sort functions, including some C++ sort implementations.

The June 2000 SGI C++ Standard Template Library stl_algo.h
implementation of unstable sort uses the Musser introsort approach
with the recursion depth to switch to heapsort passed as a parameter,
median-of-3 pivot selection and the Knuth final insertion sort pass
for partitions smaller than 16.

The GNU Standard C++ library is similar: uses introsort with a maximum
depth of 2×log2 n, followed by an insertion sort on partitions smaller
than 16.[3]

LLVM libc++ also uses introsort with a maximum depth of 2×log2 n,
however the size limit for insertion sort is different for different
data types (30 if swaps are trivial, 6 otherwise). Also, arrays with
sizes up to 5 are handled separately.[4] Kutenin (2022) provides an
overview for some changes made by LLVM, with a focus on the 2022 fix
for quadraticness.[5]

The Microsoft .NET Framework Class Library, starting from version 4.5
(2012), uses introsort instead of simple quicksort.[6]

Go uses introsort with small modification: for slices of 12 or less
elements it uses Shellsort instead of insertion sort, and more
advanced median of three medians of three pivot selection for
quicksort.

Java, starting from version 14 (2020), uses a hybrid sorting algorithm
that uses merge sort for highly structured arrays (arrays that are
composed of a small number of sorted subarrays) and introsort
otherwise to sort arrays of ints, longs, floats and doubles."

Cheers

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-02-09 19:52           ` Rich Felker
  2023-02-09 20:18             ` Rich Felker
@ 2023-02-10  4:10             ` Markus Wichmann
  1 sibling, 0 replies; 34+ messages in thread
From: Markus Wichmann @ 2023-02-10  4:10 UTC (permalink / raw)
  To: musl

On Thu, Feb 09, 2023 at 02:52:45PM -0500, Rich Felker wrote:
> On Thu, Feb 09, 2023 at 10:20:45PM +0300, Alexander Monakov wrote:
> >
> > On Thu, 9 Feb 2023, Rich Felker wrote:
> >
> > >  glibc does not use
> > > it by itself, but uses "introsort", a fancy way to say it introspects
> > > the quicksort (rather just counts the depth) and switches to an O(n
> > > log n) algorithm once it's descended too many levels.
> >
> > This is so completely untrue. Glibc uses mergesort, falling back on
> > quicksort with median-of-three pivot selection when allocating the
> > intermediate array for mergesort fails or seems too costly:
>
> Did this change at some point or have I just always been under the
> wrong impression on this?
>
> Rich

The latter. As I recall, I notified you about this because it was also
wrong on the etalabs libc comparison page. Lemme check... yep, still
wrong.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-09 19:03       ` Rich Felker
  2023-02-09 19:20         ` Alexander Monakov
@ 2023-02-10 10:00         ` David Wang
  2023-02-10 13:10           ` Rich Felker
  1 sibling, 1 reply; 34+ messages in thread
From: David Wang @ 2023-02-10 10:00 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

>
>Before we consider any such change though we should probably see
>profiling data to determine where the time is being spent in
>smoothsort, in case it's stupid stuff that's fixable.
>
>Rich

I made a rough profiling against current implementation of  qsort in musl  (Just copy the source code and recompile it to expose symbols)


The function call counters are as following:

main: 931387
qsort2: 917148  <----this is qsort
__qsort_r: 911594
trinkle: 811314
sift: 537263
wrapper_cmp: 257403  <--- wrapper cmp function
mycmp: 167410  <---real cmp function
cycle: 105809
shr: 41585
pntz: 27833
_init: 14925
a_ctz_64: 11341
a_ctz_l: 9127
shl: 8333


1.  wrapper_cmp took about 25%,   overhead (257403-167410) seems  high when the real comp functions is very simple(which just compares integer values), could be optimized out for qsort
2.  most "qsort" time spend in "trinkle", and most "trinkle" time spend in "sift".


A tree-view profiling report is attached.  (There may be several incorrect call-chains collected, but I think the proportion among function-call counters  is correct,  with high probability....)



FYI
David















[-- Attachment #2: report.html --]
[-- Type: text/html, Size: 7239 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-10 10:00         ` [musl] " David Wang
@ 2023-02-10 13:10           ` Rich Felker
  2023-02-10 13:45             ` [musl] " David Wang
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2023-02-10 13:10 UTC (permalink / raw)
  To: David Wang; +Cc: musl, Markus Wichmann

On Fri, Feb 10, 2023 at 06:00:27PM +0800, David Wang wrote:
> >
> >Before we consider any such change though we should probably see
> >profiling data to determine where the time is being spent in
> >smoothsort, in case it's stupid stuff that's fixable.
> >
> >Rich
> 
> I made a rough profiling against current implementation of qsort in
> musl (Just copy the source code and recompile it to expose symbols)
> 
> 
> The function call counters are as following:
> 
> main: 931387
> qsort2: 917148  <----this is qsort
> __qsort_r: 911594
> trinkle: 811314
> sift: 537263
> wrapper_cmp: 257403  <--- wrapper cmp function
> mycmp: 167410  <---real cmp function
> cycle: 105809
> shr: 41585
> pntz: 27833
> _init: 14925
> a_ctz_64: 11341
> a_ctz_l: 9127
> shl: 8333
> 
> 
> 1. wrapper_cmp took about 25%, overhead (257403-167410) seems high
> when the real comp functions is very simple(which just compares
> integer values), could be optimized out for qsort
> 2. most "qsort" time spend in "trinkle", and most "trinkle" time
> spend in "sift".
> 
> 
> A tree-view profiling report is attached. (There may be several
> incorrect call-chains collected, but I think the proportion among
> function-call counters is correct, with high probability....)

What tool was used for this? gprof or anything else invasive is not
meaningful; for tiny functions, the entire time measured will be the
profiling overhead. perf(1) is the only way I know to get meaningful
numbers.

In particular, it makes no sense that significant time was spent in
wrapper_cmp, which looks like (i386):

   0:   ff 64 24 0c             jmp    *0xc(%esp)

or (x86_64):

   0:   ff e2                   jmpq   *%rdx

or (arm):

   0:   4710            bx      r2

but I can imagine it being (relatively) gigantic with a call out to
profiling code.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-10 13:10           ` Rich Felker
@ 2023-02-10 13:45             ` David Wang
  2023-02-10 14:19               ` Rich Felker
  0 siblings, 1 reply; 34+ messages in thread
From: David Wang @ 2023-02-10 13:45 UTC (permalink / raw)
  To: musl; +Cc: Markus Wichmann




At 2023-02-10 21:10:45, "Rich Felker" <dalias@libc.org> wrote:
>On Fri, Feb 10, 2023 at 06:00:27PM +0800, David Wang wrote:

>What tool was used for this? gprof or anything else invasive is not
>meaningful; for tiny functions, the entire time measured will be the
>profiling overhead. perf(1) is the only way I know to get meaningful
>numbers.
>
>In particular, it makes no sense that significant time was spent in
>wrapper_cmp, which looks like (i386):
>
>   0:   ff 64 24 0c             jmp    *0xc(%esp)
>
>or (x86_64):
>
>   0:   ff e2                   jmpq   *%rdx
>
>or (arm):
>
>   0:   4710            bx      r2
>
>but I can imagine it being (relatively) gigantic with a call out to
>profiling code.
>
>Rich

I have myself implemented a profiling tool, using perf_event_open to start profiling and mmap to collect callchains, the source code is here  https://github.com/zq-david-wang/linux-tools/blob/main/perf/profiler/profiler.cpp
(Still buggy, there is always strange callchain which I could not figure out...and I am still working on it...) 

Also, I did not use any optimization when compile the code, which could make a difference, I will take time to give it a try .

About wrapper_cmp, in my last profiling, there are total  931387 samples collected, 257403 samples contain callchain  ->wrapper_cmp, among those 257403 samples,  167410 samples contain callchain ->wrapper_cmp->mycmp, 
that is why I think there is extra overhead about wrapper_cmp.   Maybe compiler optimization would change the result, and I will make further checks.


David




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-10 13:45             ` [musl] " David Wang
@ 2023-02-10 14:19               ` Rich Felker
  2023-02-11  5:12                 ` [musl] " David Wang
  0 siblings, 1 reply; 34+ messages in thread
From: Rich Felker @ 2023-02-10 14:19 UTC (permalink / raw)
  To: David Wang; +Cc: musl, Markus Wichmann

On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote:
> 
> 
> 
> At 2023-02-10 21:10:45, "Rich Felker" <dalias@libc.org> wrote:
> >On Fri, Feb 10, 2023 at 06:00:27PM +0800, David Wang wrote:
> 
> >What tool was used for this? gprof or anything else invasive is not
> >meaningful; for tiny functions, the entire time measured will be the
> >profiling overhead. perf(1) is the only way I know to get meaningful
> >numbers.
> >
> >In particular, it makes no sense that significant time was spent in
> >wrapper_cmp, which looks like (i386):
> >
> >   0:   ff 64 24 0c             jmp    *0xc(%esp)
> >
> >or (x86_64):
> >
> >   0:   ff e2                   jmpq   *%rdx
> >
> >or (arm):
> >
> >   0:   4710            bx      r2
> >
> >but I can imagine it being (relatively) gigantic with a call out to
> >profiling code.
> >
> >Rich
> 
> I have myself implemented a profiling tool, using perf_event_open to
> start profiling and mmap to collect callchains, the source code is
> here
> https://github.com/zq-david-wang/linux-tools/blob/main/perf/profiler/profiler.cpp
> (Still buggy, there is always strange callchain which I could not
> figure out...and I am still working on it...)

Thanks for sharing. It's nice to see another tool like this.

> Also, I did not use any optimization when compile the code, which
> could make a difference, I will take time to give it a try .

Yes, that would make a big difference. For this to be meaninful the
measurement needs to be with optimizations.

> About wrapper_cmp, in my last profiling, there are total 931387
> samples collected, 257403 samples contain callchain ->wrapper_cmp,
> among those 257403 samples, 167410 samples contain callchain
> ->wrapper_cmp->mycmp, that is why I think there is extra overhead
> about wrapper_cmp. Maybe compiler optimization would change the
> result, and I will make further checks.

Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10
instructions.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-10 14:19               ` Rich Felker
@ 2023-02-11  5:12                 ` David Wang
  2023-02-11  5:44                   ` alice
  0 siblings, 1 reply; 34+ messages in thread
From: David Wang @ 2023-02-11  5:12 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl, Markus Wichmann





At 2023-02-10 22:19:55, "Rich Felker" <dalias@libc.org> wrote:
>On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote:
>> 
>> 
>> 
>
>> About wrapper_cmp, in my last profiling, there are total 931387
>> samples collected, 257403 samples contain callchain ->wrapper_cmp,
>> among those 257403 samples, 167410 samples contain callchain
>> ->wrapper_cmp->mycmp, that is why I think there is extra overhead
>> about wrapper_cmp. Maybe compiler optimization would change the
>> result, and I will make further checks.
>
>Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10
>instructions.
>
>Rich

With optimized binary code, it is very hard to collect an intact callchain from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN.
But to profile qsort, a callchain may not be necessary. IP register sampling would be enough to identify which part take most cpu cycles.
So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just use PERF_SAMPLE_IP

This is what I got:
+-------------------+---------------+
|        func       |     count     |
+-------------------+---------------+
|       Total       |     423488    |
|       memcpy      | 48.76% 206496 |
|        sift       |  16.29% 68989 |
|       mycmp       |  14.57% 61714 |
|      trinkle      |  8.90% 37690  |
|       cycle       |  5.45% 23061  |
|        shr        |   2.19% 9293  |
|     __qsort_r     |   1.77% 7505  |
|        main       |   1.04% 4391  |
|        shl        |   0.55% 2325  |
|    wrapper_cmp    |   0.42% 1779  |
|        rand       |   0.05% 229   |
| __set_thread_area |    0.00% 16   |
+-------------------+---------------+
(Note that, in this profile report, I count only those samples that are directly within the function body, the samples within sub-function do not contribute to any of its parent functions.)

And you're right, with optimization, the impact of wrapper_cmp is very very low, only 0.42%.

The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REGS_USER) to collect statistics about the size (the 3rd parameter, stored in RDX register) of memcpy, and all of those memcpy function calls are just copying 4 bytes, according to the source code, the size of memcpy is item size to be sorted, which is int32 in my test case. 
Maybe something could be improved here.


I also made same profiling against glibc:
+-----------------------------+--------------+
|             func            |    count     |
+-----------------------------+--------------+
|            Total            |    640880    |
|    msort_with_tmp.part.0    | 73.99 474176 |  <--- merge sort?
|            mycmp            | 11.76 75392  |
|             main            |  6.45 41306  |
| __memcpy_avx_unaligned_erms |  4.58 29339  |
|            random           |  0.86 5525   |
|    __memcpy_avx_unaligned   |  0.83 5293   |
|           random_r          |  0.76 4882   |
|             rand            |  0.45 2897   |
|            _init            |  0.31 1975   |
|            _fini            |   0.01 80    |
|            __free           |    0.00 5    |
|         _int_malloc         |    0.00 5    |
|            malloc           |    0.00 2    |
|          __qsort_r          |    0.00 1    |
|          _int_free          |    0.00 1    |
+-----------------------------+--------------+

Test code:
-------------------
#include <stdio.h>
#include <stdlib.h>

int mycmp(const void *a, const void *b) { return *(const int *)a-*(const int*)b; }

#define MAXN 1<<20
int vs[MAXN];

int main() {
    int i, j, k, n, t;
    for (k=0; k<1024; k++) {
        for (i=0; i<MAXN; i++) vs[i]=i;
        for (n=MAXN; n>1; n--) {
            i=n-1; j=rand()%n;
            if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; }
        }
        qsort(vs, MAXN, sizeof(vs[0]), mycmp);
    }
    return 0;
}

-------------------
gcc test.c -O2 -static
With musl-libc:
$ time ./a.out

real	9m 5.10s
user	9m 5.09s
sys	0m 0.00s

With glic:
$ time ./a.out
real	1m56.287s
user	1m56.270s
sys	0m0.004s



To sum up, optimize those memcpy calls and reduce comparation to its minimum could have significant performance improvements, but I doubt it could achieve a 4-factor improvement.

FYI
David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  5:12                 ` [musl] " David Wang
@ 2023-02-11  5:44                   ` alice
  2023-02-11  8:39                     ` Joakim Sindholt
  0 siblings, 1 reply; 34+ messages in thread
From: alice @ 2023-02-11  5:44 UTC (permalink / raw)
  To: musl, Rich Felker; +Cc: Markus Wichmann

On Sat Feb 11, 2023 at 6:12 AM CET, David Wang wrote:
>
>
>
>
> At 2023-02-10 22:19:55, "Rich Felker" <dalias@libc.org> wrote:
> >On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote:
> >> 
> >> 
> >> 
> >
> >> About wrapper_cmp, in my last profiling, there are total 931387
> >> samples collected, 257403 samples contain callchain ->wrapper_cmp,
> >> among those 257403 samples, 167410 samples contain callchain
> >> ->wrapper_cmp->mycmp, that is why I think there is extra overhead
> >> about wrapper_cmp. Maybe compiler optimization would change the
> >> result, and I will make further checks.
> >
> >Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10
> >instructions.
> >
> >Rich
>
> With optimized binary code, it is very hard to collect an intact callchain from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN.
> But to profile qsort, a callchain may not be necessary. IP register sampling would be enough to identify which part take most cpu cycles.
> So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just use PERF_SAMPLE_IP
>
> This is what I got:
> +-------------------+---------------+
> |        func       |     count     |
> +-------------------+---------------+
> |       Total       |     423488    |
> |       memcpy      | 48.76% 206496 |
> |        sift       |  16.29% 68989 |
> |       mycmp       |  14.57% 61714 |
> |      trinkle      |  8.90% 37690  |
> |       cycle       |  5.45% 23061  |
> |        shr        |   2.19% 9293  |
> |     __qsort_r     |   1.77% 7505  |
> |        main       |   1.04% 4391  |
> |        shl        |   0.55% 2325  |
> |    wrapper_cmp    |   0.42% 1779  |
> |        rand       |   0.05% 229   |
> | __set_thread_area |    0.00% 16   |
> +-------------------+---------------+
> (Note that, in this profile report, I count only those samples that are directly within the function body, the samples within sub-function do not contribute to any of its parent functions.)
>
> And you're right, with optimization, the impact of wrapper_cmp is very very low, only 0.42%.
>
> The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REGS_USER) to collect statistics about the size (the 3rd parameter, stored in RDX register) of memcpy, and all of those memcpy function calls are just copying 4 bytes, according to the source code, the size of memcpy is item size to be sorted, which is int32 in my test case. 
> Maybe something could be improved here.
>
>
> I also made same profiling against glibc:
> +-----------------------------+--------------+
> |             func            |    count     |
> +-----------------------------+--------------+
> |            Total            |    640880    |
> |    msort_with_tmp.part.0    | 73.99 474176 |  <--- merge sort?
> |            mycmp            | 11.76 75392  |
> |             main            |  6.45 41306  |
> | __memcpy_avx_unaligned_erms |  4.58 29339  |
> |            random           |  0.86 5525   |
> |    __memcpy_avx_unaligned   |  0.83 5293   |
> |           random_r          |  0.76 4882   |
> |             rand            |  0.45 2897   |
> |            _init            |  0.31 1975   |
> |            _fini            |   0.01 80    |
> |            __free           |    0.00 5    |
> |         _int_malloc         |    0.00 5    |
> |            malloc           |    0.00 2    |
> |          __qsort_r          |    0.00 1    |
> |          _int_free          |    0.00 1    |
> +-----------------------------+--------------+
>
> Test code:
> -------------------
> #include <stdio.h>
> #include <stdlib.h>
>
> int mycmp(const void *a, const void *b) { return *(const int *)a-*(const int*)b; }
>
> #define MAXN 1<<20
> int vs[MAXN];
>
> int main() {
>     int i, j, k, n, t;
>     for (k=0; k<1024; k++) {
>         for (i=0; i<MAXN; i++) vs[i]=i;
>         for (n=MAXN; n>1; n--) {
>             i=n-1; j=rand()%n;
>             if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; }
>         }
>         qsort(vs, MAXN, sizeof(vs[0]), mycmp);
>     }
>     return 0;
> }
>
> -------------------
> gcc test.c -O2 -static
> With musl-libc:
> $ time ./a.out
>
> real	9m 5.10s
> user	9m 5.09s
> sys	0m 0.00s
>
> With glic:
> $ time ./a.out
> real	1m56.287s
> user	1m56.270s
> sys	0m0.004s
>
>
>
> To sum up, optimize those memcpy calls and reduce comparation to its minimum could have significant performance improvements, but I doubt it could achieve a 4-factor improvement.

based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
optimisations, the _avx_ functions in your case. musl doesn't implement any
SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
the same kind of native per-arch optimisation.

you should rerun these with GLIBC_TUNABLES, from something in:
https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
which should let you disable them all (if you just want to compare C to C code).

( unrelated, but has there been some historic discussion of implementing
  something similar in musl? i feel like i might be forgetting something. )

>
> FYI
> David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  5:44                   ` alice
@ 2023-02-11  8:39                     ` Joakim Sindholt
  2023-02-11  9:06                       ` alice
  0 siblings, 1 reply; 34+ messages in thread
From: Joakim Sindholt @ 2023-02-11  8:39 UTC (permalink / raw)
  To: musl

On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@ayaya.dev> wrote:
> based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
> optimisations, the _avx_ functions in your case. musl doesn't implement any
> SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
> the same kind of native per-arch optimisation.
> 
> you should rerun these with GLIBC_TUNABLES, from something in:
> https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
> which should let you disable them all (if you just want to compare C to C code).
> 
> ( unrelated, but has there been some historic discussion of implementing
>   something similar in musl? i feel like i might be forgetting something. )

There already are arch-specific asm implementations of functions like
memcpy. As I see it there are 3 issues standing between musl and the
glibc approach of writing a new function every time Intel or AMD
releases a new core design:
1) ifunc resolvers don't work on statically linked binaries.
2) If they did it would mean shipping 12 different implementations of
   each optimized function, making the binary huge for, for the most
   part, no good reason.
3) The esoteric bug is no longer in memcpy but in either memcpy_c,
   memcpy_mmx, memcpy_3dnow, memcpy_sse2, memcpy_sse3, memcpy_ssse3,
   memcpy_sse41, memcpy_sse42, memcpy_avx, memcpy_avx2, memcpy_avx512,
   or memcpy_amx or whatever else is added in the future in a
   never-ending spiral of implementations piling up.

It is my opinion that musl should remain small and concise to allow it
to effectively serve both the "small" and "gotta go fast" markets. I say
both because you can always haul in libreallyreallyfastsort.a/so but you
can't take the 47 qsort/memcpy implementations out of libc.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  8:39                     ` Joakim Sindholt
@ 2023-02-11  9:06                       ` alice
  2023-02-11  9:31                         ` [musl] " David Wang
  2023-02-11 13:35                         ` Rich Felker
  0 siblings, 2 replies; 34+ messages in thread
From: alice @ 2023-02-11  9:06 UTC (permalink / raw)
  To: musl

On Sat Feb 11, 2023 at 9:39 AM CET, Joakim Sindholt wrote:
> On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@ayaya.dev> wrote:
> > based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
> > optimisations, the _avx_ functions in your case. musl doesn't implement any
> > SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
> > the same kind of native per-arch optimisation.
> > 
> > you should rerun these with GLIBC_TUNABLES, from something in:
> > https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
> > which should let you disable them all (if you just want to compare C to C code).
> > 
> > ( unrelated, but has there been some historic discussion of implementing
> >   something similar in musl? i feel like i might be forgetting something. )
>
> There already are arch-specific asm implementations of functions like
> memcpy.

apologies, i wasn't quite clear- the difference
between src/string/x86_64/memcpy.s and the glibc fiesta is that the latter
utilises subarch-specific SIMD (as you explain below), e.g. AVX like in the
above benchmarks. a baseline x86_64 asm is more fair-game if the difference is
as significant as it is for memcpy :)

i wonder if anyone has tried such baseline-asm for str*, or for non i386/
x86_64 by now. there seems to only be x86 and mips asm in the tree currently
(base platform support aside).
(purely out of interest of course- i don't have the ability to write such
things (yet), and maybe there are some gains more significant than "2.2%"
possible with just sse2 for instance.)

> As I see it there are 3 issues standing between musl and the
> glibc approach of writing a new function every time Intel or AMD
> releases a new core design:
> 1) ifunc resolvers don't work on statically linked binaries.
> 2) If they did it would mean shipping 12 different implementations of
>    each optimized function, making the binary huge for, for the most
>    part, no good reason.
> 3) The esoteric bug is no longer in memcpy but in either memcpy_c,
>    memcpy_mmx, memcpy_3dnow, memcpy_sse2, memcpy_sse3, memcpy_ssse3,
>    memcpy_sse41, memcpy_sse42, memcpy_avx, memcpy_avx2, memcpy_avx512,
>    or memcpy_amx or whatever else is added in the future in a
>    never-ending spiral of implementations piling up.

3) is admittedly the worst effect- niche esoteric debugging is worse than "disk
space", and having so many implementations is certainly hard to maintain.

> It is my opinion that musl should remain small and concise to allow it
> to effectively serve both the "small" and "gotta go fast" markets. I say
> both because you can always haul in libreallyreallyfastsort.a/so but you
> can't take the 47 qsort/memcpy implementations out of libc.

yes, i generally find myself having the same opinion :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] qsort
  2023-01-30 10:04   ` [musl] " David Wang
  2023-02-01 18:01     ` Markus Wichmann
@ 2023-02-11  9:22     ` Markus Wichmann
  2023-02-11  9:36       ` [musl] " David Wang
  2023-02-11  9:51       ` David Wang
  1 sibling, 2 replies; 34+ messages in thread
From: Markus Wichmann @ 2023-02-11  9:22 UTC (permalink / raw)
  To: musl

On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote:
> On average, qsort with musl is significantly slower than c++ sort,
> while with glibc the average performance of qsort is bettern than c++
> sort.

Getting back to this, with the benchmarks recently performed, I notice
one area where C++ is able to significantly outperform C, and that is in
the area of generic functions. C++ std::sort is a template function, and
the compiler can optimize the code for the situation. For example, in
your case, you only ever need to swap two elements by way of copying
four bytes, which the compiler can just inline.

In C, the same code can only call memcpy(), and a call to cycle() will
involve many calls to memcpy(), and they cannot be inlined since the
compiler will not know the copy size at the time it is compiling the
calls. We end up with tons of calls to memcpy() with a size of four,
when memcpy() is optimized for large sizes.

Merge sort, as glibc implements it, allows at least a few memcpy() calls
with larger sizes. Glibc also avoids calling memcpy by duplicating the
code for common sizes, only resorting to memcpy() for the default case.
That may be an avenue worth pursuing, at least for cycle().

I'm not quite certain why glibc chose the sizes they chose for
msort_with_tmp(), but they are branching out for 32-bit integers, 64-bit
integers, longs (which in a System V ABI are always one of the former
two) and void* (which in a System V ABI are always the same as a long).
I think special casing 32-bit and 64-bit integers ought to get us most
of the way there, since those are the most common sizes that can still
be copied simply inline.

Also note that glibc further reduces the need for memcpy() by sorting
indirectly if the size of a single element exceeds 32 bytes. That is of
course something they can do, since they already are allocating memory
and have a fall-back strategy in place. Not sure if this is worthwhile
for musl. At least in the static linking case, it would suddenly pull
malloc() and free() into executables that previously did not have those.

That last optimization would also not have any bearing on the current
batch of benchmarks, since in those, a size of four is set in stone.

Ciao,
Markus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  9:06                       ` alice
@ 2023-02-11  9:31                         ` David Wang
  2023-02-11 13:35                         ` Rich Felker
  1 sibling, 0 replies; 34+ messages in thread
From: David Wang @ 2023-02-11  9:31 UTC (permalink / raw)
  To: musl

At 2023-02-11 17:06:02, "alice" <alice@ayaya.dev> wrote:
>On Sat Feb 11, 2023 at 9:39 AM CET, Joakim Sindholt wrote:
>> On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@ayaya.dev> wrote:


Strangely you jump to some conclusions about optimization of memcpy itself, is it better to check why qsort need that many memcpy at first?
In my profiling, qsort in my test case make lots of memcpy(src, dest, 4), how may cpu cycles can be saved if put effort into optimizing memcpy itself when size of only 4??


David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  9:22     ` [musl] " Markus Wichmann
@ 2023-02-11  9:36       ` David Wang
  2023-02-11  9:51       ` David Wang
  1 sibling, 0 replies; 34+ messages in thread
From: David Wang @ 2023-02-11  9:36 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2300 bytes --]





At 2023-02-11 17:22:49, "Markus Wichmann" <nullplan@gmx.net> wrote:

>
>Getting back to this, with the benchmarks recently performed, I notice
>one area where C++ is able to significantly outperform C, and that is in
>the area of generic functions. C++ std::sort is a template function, and
>the compiler can optimize the code for the situation. For example, in
>your case, you only ever need to swap two elements by way of copying
>four bytes, which the compiler can just inline.
>
Yes, I tried -O2 with g++ sort, its performance is much better.






>In C, the same code can only call memcpy(), and a call to cycle() will
>involve many calls to memcpy(), and they cannot be inlined since the
>compiler will not know the copy size at the time it is compiling the
>calls. We end up with tons of calls to memcpy() with a size of four,
>when memcpy() is optimized for large sizes.
>
>Merge sort, as glibc implements it, allows at least a few memcpy() calls
>with larger sizes. Glibc also avoids calling memcpy by duplicating the
>code for common sizes, only resorting to memcpy() for the default case.
>That may be an avenue worth pursuing, at least for cycle().
>


Totally agree with this, lots of memcpy function call overhead could be avoid.




>I'm not quite certain why glibc chose the sizes they chose for
>msort_with_tmp(), but they are branching out for 32-bit integers, 64-bit
>integers, longs (which in a System V ABI are always one of the former
>two) and void* (which in a System V ABI are always the same as a long).
>I think special casing 32-bit and 64-bit integers ought to get us most
>of the way there, since those are the most common sizes that can still
>be copied simply inline.
>
>Also note that glibc further reduces the need for memcpy() by sorting
>indirectly if the size of a single element exceeds 32 bytes. That is of
>course something they can do, since they already are allocating memory
>and have a fall-back strategy in place. Not sure if this is worthwhile
>for musl. At least in the static linking case, it would suddenly pull
>malloc() and free() into executables that previously did not have those.
>
>That last optimization would also not have any bearing on the current
>batch of benchmarks, since in those, a size of four is set in stone.
>
>Ciao,
>Markus

[-- Attachment #2: Type: text/html, Size: 2723 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  9:22     ` [musl] " Markus Wichmann
  2023-02-11  9:36       ` [musl] " David Wang
@ 2023-02-11  9:51       ` David Wang
  1 sibling, 0 replies; 34+ messages in thread
From: David Wang @ 2023-02-11  9:51 UTC (permalink / raw)
  To: musl



At 2023-02-11 17:22:49, "Markus Wichmann" <nullplan@gmx.net> wrote:

>Also note that glibc further reduces the need for memcpy() by sorting
>indirectly if the size of a single element exceeds 32 bytes. That is of
>course something they can do, since they already are allocating memory
>and have a fall-back strategy in place. Not sure if this is worthwhile
>for musl. At least in the static linking case, it would suddenly pull
>malloc() and free() into executables that previously did not have those.
>
>That last optimization would also not have any bearing on the current
>batch of benchmarks, since in those, a size of four is set in stone.
>

My test data is just for demo, all data item is int32, definitely not meant to be benchmarking... 



>Ciao,
>Markus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] Re:Re: [musl] qsort
  2023-02-11  9:06                       ` alice
  2023-02-11  9:31                         ` [musl] " David Wang
@ 2023-02-11 13:35                         ` Rich Felker
  2023-02-11 17:18                           ` David Wang
  1 sibling, 1 reply; 34+ messages in thread
From: Rich Felker @ 2023-02-11 13:35 UTC (permalink / raw)
  To: musl

On Sat, Feb 11, 2023 at 10:06:02AM +0100, alice wrote:
> On Sat Feb 11, 2023 at 9:39 AM CET, Joakim Sindholt wrote:
> > On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@ayaya.dev> wrote:
> > > based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
> > > optimisations, the _avx_ functions in your case. musl doesn't implement any
> > > SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
> > > the same kind of native per-arch optimisation.
> > > 
> > > you should rerun these with GLIBC_TUNABLES, from something in:
> > > https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
> > > which should let you disable them all (if you just want to compare C to C code).
> > > 
> > > ( unrelated, but has there been some historic discussion of implementing
> > >   something similar in musl? i feel like i might be forgetting something. )
> >
> > There already are arch-specific asm implementations of functions like
> > memcpy.
> 
> apologies, i wasn't quite clear- the difference
> between src/string/x86_64/memcpy.s and the glibc fiesta is that the latter
> utilises subarch-specific SIMD (as you explain below), e.g. AVX like in the
> above benchmarks. a baseline x86_64 asm is more fair-game if the difference is
> as significant as it is for memcpy :)

Folks are missing the point here. It's not anything to do with AVX or
even glibc's memcpy making glibc faster here. Rather, it's that glibc
is *not calling memcpy* for 4-byte (and likely a bunch of other
specialized cases) element sizes. Either they manually special-case
them, or the compiler (due to lack of -ffreestanding and likely -O3 or
something) is inlining the memcpy.

Based on the profiling data, I would predict an instant 2x speed boost
special-casing small sizes to swap directly with no memcpy call.

Incidentally, our memcpy is almost surely at least as fast as glibc's
for 4-byte copies. It's very large sizes where performance is likely
to diverge.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] qsort
  2023-02-11 13:35                         ` Rich Felker
@ 2023-02-11 17:18                           ` David Wang
  0 siblings, 0 replies; 34+ messages in thread
From: David Wang @ 2023-02-11 17:18 UTC (permalink / raw)
  To: musl




At 2023-02-11 21:35:33, "Rich Felker" <dalias@libc.org> wrote:

>Based on the profiling data, I would predict an instant 2x speed boost
>special-casing small sizes to swap directly with no memcpy call.
>

I made some experimental changes, use different cycle function for width 4,8 or 16,
--- a/src/stdlib/qsort.c
+++ b/src/stdlib/qsort.c
...
+static void cyclex1(unsigned char* ar[], int n)
+{
+       unsigned char tmp[32];
+       int i;
+    int *p1, *p2;
+       if(n < 2) {
+               return;
+       }
+       ar[n] = tmp;
+    p1 = (int*)ar[n];
+    p2 = (int*)ar[0];
+    *p1=*p2;
+    for(i = 0; i < n; i++) {
+        p1 = (int*)ar[i];
+        p2 = (int*)ar[i+1];
+        p1[0]=p2[0];
+    }
+}
+static void cyclex2(unsigned char* ar[], int n)
+{
+       unsigned char tmp[32];
+       int i;
+    long long *p1, *p2;
+       if(n < 2) {
+               return;
+       }
+       ar[n] = tmp;
+    p1 = (long long*)ar[n];
+    p2 = (long long*)ar[0];
+    *p1=*p2;
+    for(i = 0; i < n; i++) {
+        p1 = (long long*)ar[i];
+        p2 = (long long*)ar[i+1];
+        p1[0]=p2[0];
+    }
+}
+static void cyclex4(unsigned char* ar[], int n)
+{
+       unsigned char tmp[32];
+       int i;
+    long long *p1, *p2;
+       if(n < 2) {
+               return;
+       }
+       ar[n] = tmp;
+    p1 = (long long*)ar[n];
+    p2 = (long long*)ar[0];
+    *p1++=*p2++;
+    *p1++=*p2++;
+    for(i = 0; i < n; i++) {
+        p1 = (long long*)ar[i];
+        p2 = (long long*)ar[i+1];
+        p1[0]=p2[0];
+        p1[1]=p2[1];
+    }
+}
+

-       cycle(width, ar, i);
+    if (width==4) cyclex1(ar, i);
+    else if (width==8) cyclex2(ar, i);
+    else if (width==16) cyclex4(ar, i);
+    else cycle(width, ar, i);
---
I am not skilled in writing high performance codes, the above is what I can think of for now.

a rough timing report is as following:
+-------------------------+-----------+----------+-----------+
|        item size        |   glibc   |   musl   |  opt musl |
+-------------------------+-----------+----------+-----------+
|          4 int          | 0m15.794s | 1m 7.52s | 0m 37.27s |
|          8 long         | 0m16.351s | 1m 2.92s | 0m 45.12s |
| 16 struct{ long k, v; } | 0m23.262s | 1m 9.74s | 0m 55.07s |
+-------------------------+-----------+----------+-----------+
(128 rounds of qsort random 1<<20 items)
The test code for 16bytes qsort:

#include <stdio.h>
#include <stdlib.h>


typedef struct { long long k, v; } VNode;
int mycmp(const void *a, const void *b) {
    long long d = ((const VNode*)a)->v - ((const VNode*)b)->v;
    if (d>0) return 1;
    else if (d<0) return -1;
    return 0;
}

#define MAXN 1<<20
VNode vs[MAXN];

int main() {
    int i, j, k, n;
    long long t;
    for (k=0; k<128; k++) {
        for (i=0; i<MAXN; i++) vs[i].v=i;
        for (n=MAXN; n>1; n--) {
            i=n-1; j=rand()%n;
            if (i!=j) { t=vs[i].v; vs[i].v=vs[j].v; vs[j].v=t; }
        }
        qsort(vs, MAXN, sizeof(vs[0]), mycmp);
        for (i=0; i<MAXN; i++) if (vs[i].v!=i) { printf("error\n") ;return 1; }
    }
    return 0;
}


The highest improvement happens with sorting int32,  and as date size increases, the impact of the memcpy call-overhead decreases.




>Incidentally, our memcpy is almost surely at least as fast as glibc's
>for 4-byte copies. It's very large sizes where performance is likely
>to diverge.
>
>Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] qsort
  2023-02-01 18:01     ` Markus Wichmann
  2023-02-02  2:12       ` [musl] " David Wang
  2023-02-09 19:03       ` Rich Felker
@ 2023-02-16 15:15       ` David Wang
  2023-02-16 16:07         ` Rich Felker
  2 siblings, 1 reply; 34+ messages in thread
From: David Wang @ 2023-02-16 15:15 UTC (permalink / raw)
  To: musl





At 2023-02-02 02:01:15, "Markus Wichmann" <nullplan@gmx.net> wrote:

>Anyway, there is an inefficiency in musl's smoothsort implementation,
>namely in the sift() function:
>
>
>| if(cmp(ar[0], lf, arg) >= 0 && cmp(ar[0], rt, arg) >= 0) {
>| 	break;
>| }
>| if(cmp(lf, rt, arg) >= 0) {
>| 	ar[i++] = lf;
>| 	head = lf;
>| 	pshift -= 1;
>| } else {
>| 	ar[i++] = rt;
>| 	head = rt;
>| 	pshift -= 2;
>| }
>
>As you can see, this does two or three comparisions, but actually needs
>to do only two in any case: If we detect ar[0] >= lf && ar[0] < rt, then
>by transitivity we have also detected rt > lf, so there is no need to
>compare lf and rt directly anymore. I have not really found a good way
>to formulate code that takes advantage of that, maybe something like
>this?
>
>
>		if(cmp(ar[0], lf, arg) < 0) {
>			if(cmp(lf, rt, arg) >= 0) {
>				ar[i++] = lf;
>				head = lf;
>				pshift -= 1;
>			} else {
>go_right:
>				ar[i++] = rt;
>				head = rt;
>				pshift -= 2;
>			}
>		} else if (cmp(ar[0], rt, arg) < 0) {
>			goto go_right;
>		} else {
>			break;
>		}
>
>Yeah, not great. But it does use only two compares ever. This ought to
>lower the number of compares in your test somewhat.
>
>HTH,
>Markus


I tried this change, but improvement is not much.
When sorting 1<<20 random items, my test shows average count of comparisons decrease from 2.733*nlogn to 2.717*nlogn, improved less than 1%
Three comparisons, in the original code ,only happen when parent is larger than its left child, and smaller then its right child, if (parent, left, right) are independent, the probability for this order right>parent>left is 1/6,  the expected comparison would be 3*1/6+2*5/6 = 13/6 for the original code, and we can get 1/13 improvement, but heapsort procedure may make the probability much less than 1/6 since every round a leaf node is put to the head, and so I got less than 1% improvement. 

Strange thing is, I made a textbook implementation of heapsort as in following patch, optimized for size 8, the runtime and comparison improved a lot.

The code patch is
diff --git a/src/stdlib/qsort_nr.c b/src/stdlib/qsort_nr.c
index 8ffe71d0..710b1aaa 100644
--- a/src/stdlib/qsort_nr.c
+++ b/src/stdlib/qsort_nr.c
@@ -1,4 +1,5 @@
 #define _BSD_SOURCE
+#include <stdint.h>
 #include <stdlib.h>
 
 typedef int (*cmpfun)(const void *, const void *);
@@ -8,7 +9,36 @@ static int wrapper_cmp(const void *v1, const void *v2, void *cmp)
 	return ((cmpfun)cmp)(v1, v2);
 }
 
+static void _heapifyx2(uint64_t *base, int s, int n, cmpfun cmp) {
+	int ms, ns;
+	uint64_t t;
+	while(1) {
+		ns=s*2+1;
+		if (ns>=n) break;
+		ms=s;
+		if (cmp(base+ms, base+ns)<0) ms=ns;
+		ns++;
+		if (ns<n&&cmp(base+ms, base+ns)<0) ms=ns;
+		if (s==ms) break;
+		t=base[ms]; base[ms]=base[s]; base[s]=t;
+		s=ms;
+	}
+}
+
+static void _hsortx2(uint64_t *base, size_t n, cmpfun cmp) {
+	int i;
+	uint64_t t;
+	for (i=n/2; i>=0; i--) _heapifyx2(base, i, n, cmp);
+	for (n--; n>0; n--) {
+		t=base[0]; base[0]=base[n]; base[n]=t;
+		_heapifyx2(base, 0, n, cmp);
+	}
+}
+
 void qsort(void *base, size_t nel, size_t width, cmpfun cmp)
 {
+	if (width==8) {
+		return _hsortx2(base, nel, cmp);
+	}
 	__qsort_r(base, nel, width, wrapper_cmp, (void *)cmp);
 }



And the test code is:
-----------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define MAXM 20
#define MAXN 1<<MAXM


long long count=0;
typedef struct { int k, v; } VNode;
VNode vs[MAXN];
int mycmp(const void *a, const void *b) {
    count++;
    return ((const VNode*)a)->v - ((const VNode*)b)->v;
}


int main() {
    int i, j, k, n;
    double f;
    VNode t;
    srand(time(NULL));
    for (k=0; k<256; k++) {
        for (i=0; i<MAXN; i++) vs[i].v=i; 
        for (n=MAXN; n>1; n--) {
            i=n-1; j=rand()%n;
            if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; }
        }
        qsort(vs, MAXN, sizeof(vs[0]), mycmp);
        for (i=0; i<MAXN; i++) if (vs[i].v!=i) { printf("error\n") ;return 1; }
    }
    f=count; f/=k; f/=MAXN; f/=MAXM; printf("count: %.3f*nlogn\n", f);
    return 0;
}
-----------------------

Here is what I collect
+------------+-----------+-------------+
|            |  runtime  |  comparison |
+------------+-----------+-------------+
|   glibc    | 0m32.093s | 0.937*nlogn |
|  heapsort  | 0m54.708s | 1.846*nlogn |
| smoothsort | 1m13.137s | 2.733*nlogn | <-- this version of smoothsort has optimized for 8byte data item.
+------------+-----------+-------------+

I think smoothsort has worse average performance than heapsort based on the number of comparisons made.
But the ~2*nlogn comparison of heapsort could not beat ~1*nlogn of merge sort, used by glibc.

So back to this thread

> On Thu, 9 Feb 2023, Rich Felker wrote:
>
>Before we consider any such change though we should probably see
>profiling data to determine where the time is being spent in
>smoothsort, in case it's stupid stuff that's fixable.
>
>Rich


I think, optimize for small size 4/8/16 is practical, since feeding large item to qsort is less efficient than feeding an index or address to qsort, in my opinion,  a typical structure for sorting would be  { long long k, v; } or something alike, where k is index/address referring to the real data. The performance improved a lot if optimize out those memcpy call.
And further improvement could be made, for the average performance, if switch back to heapsort.
 (Hope someone else can help confirm those two above).

But heapsort in theory is always significantly slower than mergesort, I think it should be about 2 time slower when using full set of benchmark tests





David





^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] qsort
  2023-02-16 15:15       ` David Wang
@ 2023-02-16 16:07         ` Rich Felker
  2023-02-17  1:35           ` [musl] " David Wang
  2023-02-17 13:17           ` Alexander Monakov
  0 siblings, 2 replies; 34+ messages in thread
From: Rich Felker @ 2023-02-16 16:07 UTC (permalink / raw)
  To: David Wang; +Cc: musl

On Thu, Feb 16, 2023 at 11:15:24PM +0800, David Wang wrote:
> 
> At 2023-02-02 02:01:15, "Markus Wichmann" <nullplan@gmx.net> wrote:
> 
> >Anyway, there is an inefficiency in musl's smoothsort implementation,
> >namely in the sift() function:
> >
> >
> >| if(cmp(ar[0], lf, arg) >= 0 && cmp(ar[0], rt, arg) >= 0) {
> >| 	break;
> >| }
> >| if(cmp(lf, rt, arg) >= 0) {
> >| 	ar[i++] = lf;
> >| 	head = lf;
> >| 	pshift -= 1;
> >| } else {
> >| 	ar[i++] = rt;
> >| 	head = rt;
> >| 	pshift -= 2;
> >| }
> >
> >As you can see, this does two or three comparisions, but actually needs
> >to do only two in any case: If we detect ar[0] >= lf && ar[0] < rt, then
> >by transitivity we have also detected rt > lf, so there is no need to
> >compare lf and rt directly anymore. I have not really found a good way
> >to formulate code that takes advantage of that, maybe something like
> >this?
> >
> >
> >		if(cmp(ar[0], lf, arg) < 0) {
> >			if(cmp(lf, rt, arg) >= 0) {
> >				ar[i++] = lf;
> >				head = lf;
> >				pshift -= 1;
> >			} else {
> >go_right:
> >				ar[i++] = rt;
> >				head = rt;
> >				pshift -= 2;
> >			}
> >		} else if (cmp(ar[0], rt, arg) < 0) {
> >			goto go_right;
> >		} else {
> >			break;
> >		}
> >
> >Yeah, not great. But it does use only two compares ever. This ought to
> >lower the number of compares in your test somewhat.
> >
> >HTH,
> >Markus
> 
> 
> I tried this change, but improvement is not much.
> When sorting 1<<20 random items, my test shows average count of
> comparisons decrease from 2.733*nlogn to 2.717*nlogn, improved less
> than 1%
> Three comparisons, in the original code ,only happen when parent is
> larger than its left child, and smaller then its right child, if
> (parent, left, right) are independent, the probability for this
> order right>parent>left is 1/6, the expected comparison would be
> 3*1/6+2*5/6 = 13/6 for the original code, and we can get 1/13
> improvement, but heapsort procedure may make the probability much
> less than 1/6 since every round a leaf node is put to the head, and
> so I got less than 1% improvement.

OK, so it looks like this doesn't really matter. Kinda disappointing
but at least it means we don't need to do anything.

> Strange thing is, I made a textbook implementation of heapsort as in
> following patch, optimized for size 8, the runtime and comparison
> improved a lot.

Do you have a good theoretical reason why that might be the case?

> Here is what I collect
> +------------+-----------+-------------+
> |            |  runtime  |  comparison |
> +------------+-----------+-------------+
> |   glibc    | 0m32.093s | 0.937*nlogn |
> |  heapsort  | 0m54.708s | 1.846*nlogn |
> | smoothsort | 1m13.137s | 2.733*nlogn | <-- this version of smoothsort has optimized for 8byte data item.
> +------------+-----------+-------------+
> 
> I think smoothsort has worse average performance than heapsort based
> on the number of comparisons made.
> But the ~2*nlogn comparison of heapsort could not beat ~1*nlogn of
> merge sort, used by glibc.

Are those comparison counts just measured, or do they have some
theoretical basis? It'd be nice to know if we're possibly doing
something suboptimal still or if smoothsort is really more expensive
for a fundamental reason.

> So back to this thread
> 
> > On Thu, 9 Feb 2023, Rich Felker wrote:
> >
> >Before we consider any such change though we should probably see
> >profiling data to determine where the time is being spent in
> >smoothsort, in case it's stupid stuff that's fixable.
> 
> I think, optimize for small size 4/8/16 is practical, since feeding
> large item to qsort is less efficient than feeding an index or
> address to qsort, in my opinion, a typical structure for sorting
> would be { long long k, v; } or something alike, where k is
> index/address referring to the real data. The performance improved a
> lot if optimize out those memcpy call.

Yes, I think optimizing out the external memcpy calls for small sizes
is a good idea regardless of whatever else we do. The code you tested
is not valid (it has UB via aliasing violations) but we could fix it
either using __may_alias__ types conditioned on __GNUC__, or exposing
memcpy intrinsics where they don't result in circular definitions with
some magic in src/include/string.h. I'd like to do the latter at some
point anyway.

> And further improvement could be made, for the average performance,
> if switch back to heapsort. (Hope someone else can help confirm
> those two above).

This is kinda disappointing but might make sense if true.

If an introsort that switches from quicksort to heapsort is faster
still, that would also be a candidate.

> But heapsort in theory is always significantly slower than
> mergesort, I think it should be about 2 time slower when using full
> set of benchmark tests

Mergesort is simply not a candidate unless there's a way to make
in-place merge practical, but as I understand it that's prohibitively
costly, and I've never really seen a comprehensible implementation
that was convincingly correct. If I'm wrong on this and it is doable,
we could consider that.

Just using malloc opporunistically is not appropriate behavior. Even
if it succeeds, it could gratuitously consume vast amounts of memory,
making *other operations* in the same processs or other processes fail
when they should succeed. It's pretty basic to the principles of musl
that we don't do stuff like that (unnecessary allocation).

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [musl] Re:Re: [musl] qsort
  2023-02-16 16:07         ` Rich Felker
@ 2023-02-17  1:35           ` David Wang
  2023-02-17 13:17           ` Alexander Monakov
  1 sibling, 0 replies; 34+ messages in thread
From: David Wang @ 2023-02-17  1:35 UTC (permalink / raw)
  To: Rich Felker; +Cc: musl



At 2023-02-17 00:07:35, "Rich Felker" <dalias@libc.org> wrote:
>On Thu, Feb 16, 2023 at 11:15:24PM +0800, David Wang wrote:

>> Strange thing is, I made a textbook implementation of heapsort as in
>> following patch, optimized for size 8, the runtime and comparison
>> improved a lot.
>
>Do you have a good theoretical reason why that might be the case?

Smoothsort is way too complicated for me to make out a clear analyze, according to my reading, smoothsort would make the binary tree quite skewed, make the const factor in O(nlogn) bigger for average performance.
Its advantage  is for best case performance when the input is almost sorted.
Following quote  is from the smoothsort paper "Smoothsort, an alternative for sorting in situ",  written by Esger W. Dijkstra 
"""
While heapsort prunes the tree leaf by leaf, smoothsort prunes the tree at the root, and
immediately one of heapsort's charms is lost: while the tree in heapsort remains beautifully
balanced, the tree in smoothsort can get very skew indeed. So why bother about smoothsort at all?
Well, I wanted to design a sorting algorithm of order N in the best case, of order N⋅log N in
the worst case, and with a smooth transition between the two (hence its name).
"""
I think it is safe to conclude:
1. heapsort is better than smoothsort for average performance
2. smoothsort is better than heapsort and  many other sorting algorithm for best case


>
>Yes, I think optimizing out the external memcpy calls for small sizes
>is a good idea regardless of whatever else we do. The code you tested
>is not valid (it has UB via aliasing violations) but we could fix it
>either using __may_alias__ types conditioned on __GNUC__, or exposing
>memcpy intrinsics where they don't result in circular definitions with
>some magic in src/include/string.h. I'd like to do the latter at some
>point anyway.
>

Thanks for pointing this out, I am so not experienced with engineering details, yet.



David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] qsort
  2023-02-16 16:07         ` Rich Felker
  2023-02-17  1:35           ` [musl] " David Wang
@ 2023-02-17 13:17           ` Alexander Monakov
  2023-02-17 15:07             ` Rich Felker
  1 sibling, 1 reply; 34+ messages in thread
From: Alexander Monakov @ 2023-02-17 13:17 UTC (permalink / raw)
  To: musl; +Cc: David Wang

On Thu, 16 Feb 2023, Rich Felker wrote:

> Mergesort is simply not a candidate unless there's a way to make
> in-place merge practical, but as I understand it that's prohibitively
> costly, and I've never really seen a comprehensible implementation
> that was convincingly correct. If I'm wrong on this and it is doable,
> we could consider that.

*** gestures at https://www.openwall.com/lists/musl/2014/09/01/2

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [musl] qsort
  2023-02-17 13:17           ` Alexander Monakov
@ 2023-02-17 15:07             ` Rich Felker
  0 siblings, 0 replies; 34+ messages in thread
From: Rich Felker @ 2023-02-17 15:07 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: musl, David Wang

On Fri, Feb 17, 2023 at 04:17:05PM +0300, Alexander Monakov wrote:
> On Thu, 16 Feb 2023, Rich Felker wrote:
> 
> > Mergesort is simply not a candidate unless there's a way to make
> > in-place merge practical, but as I understand it that's prohibitively
> > costly, and I've never really seen a comprehensible implementation
> > that was convincingly correct. If I'm wrong on this and it is doable,
> > we could consider that.
> 
> *** gestures at https://www.openwall.com/lists/musl/2014/09/01/2

Thank you for reminding me of this. I had completely forgotten it. I
think I had some confusion over the performance figures that, glancing
at it now, might have been the exact same sort of "memcpy vs inline
swap" issue we've been discussing in this thread. I'm also not sure if
the bugs mentioned later in the thread were ever resolved.

I really do like the size and "simplicity" of the code, and I'm
optimistic that this might be the path to try taking for getting qsort
performance more comparable to other implementations.

Rich

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-02-17 15:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-20  1:49 [musl] qsort Guy
2023-01-20 12:55 ` alice
2023-01-30 10:04   ` [musl] " David Wang
2023-02-01 18:01     ` Markus Wichmann
2023-02-02  2:12       ` [musl] " David Wang
2023-02-03  5:22         ` [musl] " David Wang
2023-02-03  8:03           ` Alexander Monakov
2023-02-03  9:01             ` [musl] " David Wang
2023-02-09 19:03       ` Rich Felker
2023-02-09 19:20         ` Alexander Monakov
2023-02-09 19:52           ` Rich Felker
2023-02-09 20:18             ` Rich Felker
2023-02-09 20:27               ` Pierpaolo Bernardi
2023-02-10  4:10             ` Markus Wichmann
2023-02-10 10:00         ` [musl] " David Wang
2023-02-10 13:10           ` Rich Felker
2023-02-10 13:45             ` [musl] " David Wang
2023-02-10 14:19               ` Rich Felker
2023-02-11  5:12                 ` [musl] " David Wang
2023-02-11  5:44                   ` alice
2023-02-11  8:39                     ` Joakim Sindholt
2023-02-11  9:06                       ` alice
2023-02-11  9:31                         ` [musl] " David Wang
2023-02-11 13:35                         ` Rich Felker
2023-02-11 17:18                           ` David Wang
2023-02-16 15:15       ` David Wang
2023-02-16 16:07         ` Rich Felker
2023-02-17  1:35           ` [musl] " David Wang
2023-02-17 13:17           ` Alexander Monakov
2023-02-17 15:07             ` Rich Felker
2023-02-11  9:22     ` [musl] " Markus Wichmann
2023-02-11  9:36       ` [musl] " David Wang
2023-02-11  9:51       ` David Wang
2023-01-20 13:32 ` [musl] qsort Valery Ushakov

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).