* [RFC] new qsort implementation
@ 2014-09-01 7:12 Bobby Bingham
2014-09-01 11:17 ` Szabolcs Nagy
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 7:12 UTC (permalink / raw)
To: musl
[-- Attachment #1.1: Type: text/plain, Size: 4034 bytes --]
Hi all,
As I mentioned a while back on IRC, I've been looking into wikisort[1]
and grailsort[2] to see if either of them would be a good candidate for
use as musl's qsort.
The C-language reference implementations of both of these algorithms are
inappropriate, as they're both very large, not type-agnostic, and are
not written in idiomatic C. Some of the complexity of both comes from
the fact that they are stable sorts, which qsort is not required to be.
Attached is an implementation of qsort based on a predecessor of the
paper that grailsort is based on, which describes an unstable block
merge sort. The paper is available at [3]. My algorithm deviates a
little bit from what the paper describes, but it should be pretty easy
to follow.
You can find my test program with this algorithm and others at [4].
Some of the implementations included are quicksort variants, so the
"qsort-killer" testcase will trigger quadratic behavior in them. If you
want to run this you should consider reducing the maximum input size in
testcases.h, disabling the qsort-killer input at the bottom of
testcases.c, or disabling the affected sort algorithms ("freebsd",
"glibc quicksort", and depending on your libc, "system") in sorters.c.
Here are the numbers comparing musl's current smoothsort with the
attached grailsort code for various input patterns and sizes. The test
was run on x86_64, compiled with gcc 4.8.3 at -Os:
random sorted reverse constant sorted+noise reverse+noise qsort-killer elements
compares ms compares ms compares ms compares ms compares ms compares ms compares ms
musl smoothsort 327818 11 19976 0 268152 8 19976 0 142608 5 289637 8 139090 4 10000
4352267 97 199971 2 3327332 59 199971 2 2479200 45 3826143 68 1803634 37 100000
54441776 945 1999963 29 40048748 663 1999963 27 32506944 577 47405848 798 21830972 411 1000000
652805234 12300 19999960 289 465600753 7505 19999960 293 402201458 6891 572755136 9484 259691645 4741 10000000
grailsort 161696 2 71024 0 41110 0 28004 0 143195 1 125943 1 89027 0 10000
1993908 27 753996 2 412840 5 270727 3 1818802 15 1640569 17 1064759 10 100000
23428984 330 7686249 27 4177007 74 2729965 41 21581351 192 19909192 211 12325132 119 1000000
266520949 3884 75927601 277 42751315 901 28243939 436 248048604 2343 232357446 2575 139177031 1368 10000000
As far as code size, here are before and after sizes as reported by
size(1) for bsearch.o, qsort.o, and a minimal statically linked program
using qsort:
before after
bsearch.o 116 160
qsort.o 1550 1242
statictest 2950 2819
At -O2, the before and after sizes show the same basic pattern. At -O3,
gcc performs more aggressive inlining on the grailsort code, and it
balloons to more than twice the size of musl's current code.
For the sake of soft-float targets, I'd also like to look at replacing
the call to sqrt with an integer approximation. But before I go much
further, I'd like to post the code here and get some feedback.
1. https://github.com/BonzaiThePenguin/WikiSort
2. https://github.com/Mrrl/GrailSort
3. http://akira.ruc.dk/~keld/teaching/algoritmedesign_f04/Artikler/04/Huang88.pdf
4. http://git.koorogi.info/cgit/qsort_compare/
--
Bobby Bingham
[-- Attachment #1.2: bsearch.c --]
[-- Type: text/x-c, Size: 697 bytes --]
#include <stdlib.h>
#include <limits.h>
size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
{
size_t baseidx = 0, tryidx;
void *try;
int sign;
while (nel > 0) {
tryidx = baseidx + nel/2;
try = (char*)base + tryidx*width;
sign = cmp(key, try);
if (!sign) return tryidx;
else if (sign < 0)
nel /= 2;
else {
baseidx = tryidx + 1;
nel -= nel/2 + 1;
}
}
return ~baseidx;
}
void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
{
size_t idx = __bsearch(key, base, nel, width, cmp);
return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
}
[-- Attachment #1.3: qsort.c --]
[-- Type: text/x-c, Size: 4853 bytes --]
#include <math.h>
#include <stddef.h>
#include <stdint.h>
#include <limits.h>
typedef int (*cmpfun)(const void *, const void *);
size_t __bsearch(const void *, const void *, size_t, size_t, cmpfun);
static inline size_t bsearch_pos(const void *key, const void *haystack, size_t nmel, size_t width, cmpfun cmp)
{
size_t pos = __bsearch(key, haystack, nmel, width, cmp);
return pos > SSIZE_MAX ? ~pos : pos;
}
static inline void *tail(char *base, size_t nmel, size_t width)
{
return base + (nmel-1) * width;
}
static void swap(char *a, char *b, size_t width)
{
#ifdef __GNUC__
typedef uint32_t __attribute__((__may_alias__)) u32;
if ((uintptr_t)a % 4 == 0 && (uintptr_t)b % 4 == 0) {
for (; width >= 4; width -= 4) {
uint32_t tmp = *((u32*)a);
*((u32*)a) = *((u32*)b);
*((u32*)b) = tmp;
a += 4;
b += 4;
}
}
#endif
while (width--) {
char tmp = *a;
*a++ = *b;
*b++ = tmp;
}
}
static void rotate(char *base, size_t size, size_t shift)
{
int dir = 1;
while (shift) {
while (2*shift <= size) {
swap(base, base + dir*shift, shift);
size -= shift;
base += shift*dir;
}
shift = size - shift;
base = dir > 0 ? base + size - shift : base - shift;
dir *= -1;
}
}
static void distribute_buffer(char *base, size_t bufnmel, size_t sortnmel, size_t width, cmpfun cmp)
{
while (bufnmel) {
char *sorted = base + bufnmel * width;
size_t insertpos = bsearch_pos(base, sorted, sortnmel, width, cmp);
if (insertpos > 0)
rotate(base, (bufnmel + insertpos) * width, bufnmel * width);
base += (insertpos + 1) * width;
bufnmel -= 1;
sortnmel -= insertpos;
}
}
#define MAX_SORTNET 8
static const uint8_t sortnet[][2] = {
/* 0: index = 0 */
/* 1: index = 0 */
/* 2: index = 0 */
{0,1},
/* 3: index = 1 */
{0,1}, {0,2}, {1,2},
/* 4: index = 4 */
{0,1}, {2,3}, {0,2}, {1,3}, {1,2},
/* 5: index = 9 */
{0,1}, {3,4}, {2,4}, {2,3}, {1,4}, {0,3}, {0,2}, {1,3}, {1,2},
/* 6: index = 18 */
{1,2}, {4,5}, {0,2}, {3,5}, {0,1}, {3,4}, {2,5}, {0,3}, {1,4}, {2,4},
{1,3}, {2,3},
/* 7: index = 30 */
{1,2}, {3,4}, {5,6}, {0,2}, {3,5}, {4,6}, {0,1}, {4,5}, {2,6}, {0,4},
{1,5}, {0,3}, {2,5}, {1,3}, {2,4}, {2,3},
/* 8: index = 46 */
{0,1}, {2,3}, {4,5}, {6,7}, {0,2}, {1,3}, {4,6}, {5,7}, {1,2}, {5,6},
{0,4}, {3,7}, {1,5}, {2,6}, {1,4}, {3,6}, {2,4}, {3,5}, {3,4},
/* 9: index = 65 */
};
static const uint8_t sortnet_index[] = { 0, 0, 0, 1, 4, 9, 18, 30, 46, 65 };
static void sorting_network(char *base, size_t nmel, size_t width, cmpfun cmp)
{
for (int i = sortnet_index[nmel]; i < sortnet_index[nmel+1]; i++) {
char *elem1 = base + sortnet[i][0] * width;
char *elem2 = base + sortnet[i][1] * width;
if (cmp(elem1, elem2) > 0) swap(elem1, elem2, width);
}
}
/* get index of last block whose head is less than the previous block's tail */
static size_t last_overlap(char *base, size_t bcount, size_t bwidth, size_t width, cmpfun cmp)
{
for (char *cur = tail(base, bcount, bwidth); --bcount; cur -= bwidth)
if (cmp(cur - width, cur) > 0) break;
return bcount;
}
void merge(char *buf, char *base, size_t anmel, size_t bnmel, size_t width, cmpfun cmp)
{
char *a = buf;
char *b = base + anmel * width;
size_t skip = bsearch_pos(b, base, anmel, width, cmp);
anmel -= skip;
base += skip * width;
swap(base, a, anmel * width);
while (anmel && bnmel) {
if (cmp(a, b) <= 0) { swap(base, a, width); a += width; anmel--; }
else { swap(base, b, width); b += width; bnmel--; }
base += width;
}
swap(base, a, anmel * width);
}
void qsort(void *unsorted, size_t nmel, size_t width, cmpfun cmp)
{
char *base = unsorted;
if (nmel <= MAX_SORTNET) {
sorting_network(base, nmel, width, cmp);
return;
}
size_t blknmel = sqrt(nmel); /* elements in a block */
size_t bufnmel = blknmel + nmel % blknmel; /* elements in the buffer */
size_t bwidth = blknmel * width; /* size of a block in bytes */
size_t blocks = nmel / blknmel - 1; /* number of blocks in a + b */
size_t acount = blocks / 2;
size_t bcount = blocks - acount;
char *a = base + bufnmel * width;
char *b = a + acount * bwidth;
qsort(a, acount * blknmel, width, cmp);
qsort(b, bcount * blknmel, width, cmp);
/* if already sorted, nothing to do */
if (cmp(tail(a, acount * blknmel, width), b) <= 0)
goto distribute;
/* sort all the a and b blocks together by their head elements */
qsort(a, blocks, bwidth, cmp);
/* merge, starting from the end and working towards the beginning */
while (blocks > 1) {
size_t overlap = last_overlap(a, blocks, bwidth, width, cmp);
if (overlap > 0)
merge(base, tail(a, overlap, bwidth), blknmel, (blocks - overlap) * blknmel, width, cmp);
blocks = overlap;
}
distribute:
qsort(base, bufnmel, width, cmp);
distribute_buffer(base, bufnmel, nmel - bufnmel, width, cmp);
}
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] new qsort implementation
2014-09-01 7:12 [RFC] new qsort implementation Bobby Bingham
@ 2014-09-01 11:17 ` Szabolcs Nagy
2014-09-01 18:20 ` Bobby Bingham
2014-09-01 11:25 ` Alexander Monakov
2023-02-17 15:51 ` [musl] " Rich Felker
2 siblings, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2014-09-01 11:17 UTC (permalink / raw)
To: musl
* Bobby Bingham <koorogi@koorogi.info> [2014-09-01 02:12:43 -0500]:
> You can find my test program with this algorithm and others at [4].
> Some of the implementations included are quicksort variants, so the
> "qsort-killer" testcase will trigger quadratic behavior in them. If you
> want to run this you should consider reducing the maximum input size in
> testcases.h, disabling the qsort-killer input at the bottom of
> testcases.c, or disabling the affected sort algorithms ("freebsd",
> "glibc quicksort", and depending on your libc, "system") in sorters.c.
(i had a few build errors: musl_heapsort and musl_smoothsort were not
declared in sorters.h and glibc needs -lrt for clock_gettime)
smooth sort is best for almost sorted lists when only a few elements
are out of order (swap some random elements in a sorted array), this
is common in practice so you should test this case too
the noise case should use much less noise imho (so you test when
only local rearrangements are needed: buffer[i] += random()%small)
another common case is sorting two concatenated sorted arrays
(merge sort should do better in this case)
it would be nice to have a benchmark that is based on common
qsort usage cases
the qsort-killer test is not very interesting for algorithms other
than quicksort (where it is the worst-case), but it would be nice
to analyze the worst cases for smoothsort and grailsort
(they are both O(n logn) so nothing spectacular is expected but it
would be interesting to see how they compare against the theoretical
optimum: ceil(lgamma(n)/log(2)) compares)
> Here are the numbers comparing musl's current smoothsort with the
> attached grailsort code for various input patterns and sizes. The test
> was run on x86_64, compiled with gcc 4.8.3 at -Os:
>
> sorted reverse constant
> compares ms compares ms compares ms
> musl smoothsort 19976 0 268152 8 19976 0
> 199971 2 3327332 59 199971 2
> 1999963 29 40048748 663 1999963 27
> 19999960 289 465600753 7505 19999960 293
>
> grailsort 71024 0 41110 0 28004 0
> 753996 2 412840 5 270727 3
> 7686249 27 4177007 74 2729965 41
> 75927601 277 42751315 901 28243939 436
>
interesting that the sorted case is faster with much more compares
here on i386 smoothsort is faster
sorted reverse constant
compares ms compares ms compares ms
musl smoothsort 19976 0 268152 7 19976 1
199971 8 3327332 103 199971 15
1999963 105 40048748 1151 1999963 103
19999960 1087 465600753 13339 19999960 1103
grailsort 71024 1 41110 3 28004 3
753996 20 412840 23 270727 23
7686249 151 4177007 370 2729965 224
75927601 1438 42751315 4507 28243939 2353
> #include <stdlib.h>
> #include <limits.h>
>
> size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> {
> size_t baseidx = 0, tryidx;
> void *try;
> int sign;
>
> while (nel > 0) {
> tryidx = baseidx + nel/2;
> try = (char*)base + tryidx*width;
> sign = cmp(key, try);
> if (!sign) return tryidx;
> else if (sign < 0)
> nel /= 2;
> else {
> baseidx = tryidx + 1;
> nel -= nel/2 + 1;
> }
> }
>
> return ~baseidx;
> }
>
> void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> {
> size_t idx = __bsearch(key, base, nel, width, cmp);
> return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
> }
musl does not malloc >=SSIZE_MAX memory, but mmap can so baseidx
may be >0x7fffffff on a 32bit system
i'm not sure if current qsort handles this case
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] new qsort implementation
2014-09-01 7:12 [RFC] new qsort implementation Bobby Bingham
2014-09-01 11:17 ` Szabolcs Nagy
@ 2014-09-01 11:25 ` Alexander Monakov
2014-09-01 18:27 ` Bobby Bingham
2023-02-17 15:51 ` [musl] " Rich Felker
2 siblings, 1 reply; 9+ messages in thread
From: Alexander Monakov @ 2014-09-01 11:25 UTC (permalink / raw)
To: musl
Hi,
It seems you forgot to commit changes to sorter.h in your repo.
Comparing musl-heapsort to musl-smoothsort, the former appears significantly
better than the latter except on "sorted" input, and even then it's not 20x
faster like the musl commit adding smoothsort claims (about 6.6x for me). It
does reduce the number of comparisons by 20x there, as the commit says.
There is variation on how divide-and-conquer algorithms in your test handle
sorting on the lowest level; for instance grailsort_ref uses insertion sort
and your implementations use a sorting network (is that correct?). Would your
comparison be more apples-to-apples if all compared approaches used the same
sorter on the last level, if appropriate (assuming sorting networks improve
performance in some cases)?
Why did you choose to use sorting networks in your implementations?
For wikisort and grailsort, their "ref" variants perform about 2x faster
on some tests for me . Is that due to last-level sorter choice, or are there
other significant differences?
Thanks.
Alexander
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] new qsort implementation
2014-09-01 11:17 ` Szabolcs Nagy
@ 2014-09-01 18:20 ` Bobby Bingham
2014-09-01 20:53 ` Rich Felker
0 siblings, 1 reply; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 18:20 UTC (permalink / raw)
To: musl
[-- Attachment #1: Type: text/plain, Size: 5112 bytes --]
On Mon, Sep 01, 2014 at 01:17:48PM +0200, Szabolcs Nagy wrote:
> (i had a few build errors: musl_heapsort and musl_smoothsort were not
> declared in sorters.h and glibc needs -lrt for clock_gettime)
Fixed.
>
> smooth sort is best for almost sorted lists when only a few elements
> are out of order (swap some random elements in a sorted array), this
> is common in practice so you should test this case too
>
> the noise case should use much less noise imho (so you test when
> only local rearrangements are needed: buffer[i] += random()%small)
I've reduced the amount of noise locally. I haven't pushed this change
yet, because when I tried it I found a bug in my grailsort
implementation that I'd like to fix first.
I had taken the merging code I'd written for my wikisort implementation
and tried to adapt it to grailsort, but it looks like I missed a case.
If some (but not all) blocks contain elements with a uniform value, it
is possible for the block merge step to produce the wrong results.
I'm looking at how best to fix this, but the best solution may be to
bring my merge step more in line with that described by the paper.
>
> another common case is sorting two concatenated sorted arrays
> (merge sort should do better in this case)
>
Added.
> it would be nice to have a benchmark that is based on common
> qsort usage cases
>
> the qsort-killer test is not very interesting for algorithms other
> than quicksort (where it is the worst-case), but it would be nice
> to analyze the worst cases for smoothsort and grailsort
> (they are both O(n logn) so nothing spectacular is expected but it
> would be interesting to see how they compare against the theoretical
> optimum: ceil(lgamma(n)/log(2)) compares)
Once I fix the code, I'll work on this analysis.
>
> > Here are the numbers comparing musl's current smoothsort with the
> > attached grailsort code for various input patterns and sizes. The test
> > was run on x86_64, compiled with gcc 4.8.3 at -Os:
> >
> > sorted reverse constant
> > compares ms compares ms compares ms
> > musl smoothsort 19976 0 268152 8 19976 0
> > 199971 2 3327332 59 199971 2
> > 1999963 29 40048748 663 1999963 27
> > 19999960 289 465600753 7505 19999960 293
> >
> > grailsort 71024 0 41110 0 28004 0
> > 753996 2 412840 5 270727 3
> > 7686249 27 4177007 74 2729965 41
> > 75927601 277 42751315 901 28243939 436
> >
>
> interesting that the sorted case is faster with much more compares
> here on i386 smoothsort is faster
>
> sorted reverse constant
> compares ms compares ms compares ms
> musl smoothsort 19976 0 268152 7 19976 1
> 199971 8 3327332 103 199971 15
> 1999963 105 40048748 1151 1999963 103
> 19999960 1087 465600753 13339 19999960 1103
>
> grailsort 71024 1 41110 3 28004 3
> 753996 20 412840 23 270727 23
> 7686249 151 4177007 370 2729965 224
> 75927601 1438 42751315 4507 28243939 2353
>
Interesting. When I saw that grailsort was faster even with more
comparisons on my machine, I had attributed it to my swap possibly being
faster. But I don't see why this wouldn't also be the case on i386, so
maybe something else is going on.
> > #include <stdlib.h>
> > #include <limits.h>
> >
> > size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > {
> > size_t baseidx = 0, tryidx;
> > void *try;
> > int sign;
> >
> > while (nel > 0) {
> > tryidx = baseidx + nel/2;
> > try = (char*)base + tryidx*width;
> > sign = cmp(key, try);
> > if (!sign) return tryidx;
> > else if (sign < 0)
> > nel /= 2;
> > else {
> > baseidx = tryidx + 1;
> > nel -= nel/2 + 1;
> > }
> > }
> >
> > return ~baseidx;
> > }
> >
> > void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > {
> > size_t idx = __bsearch(key, base, nel, width, cmp);
> > return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
> > }
>
> musl does not malloc >=SSIZE_MAX memory, but mmap can so baseidx
> may be >0x7fffffff on a 32bit system
>
> i'm not sure if current qsort handles this case
I thought I recalled hearing that SSIZE_MAX was the upper bound on all
object sizes in musl, but if we still allow larger mmaps than that, I
guess not. I'll find a different approach when I send the next version
of the code.
--
Bobby Bingham
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] new qsort implementation
2014-09-01 11:25 ` Alexander Monakov
@ 2014-09-01 18:27 ` Bobby Bingham
0 siblings, 0 replies; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 18:27 UTC (permalink / raw)
To: musl
[-- Attachment #1: Type: text/plain, Size: 2228 bytes --]
On Mon, Sep 01, 2014 at 03:25:18PM +0400, Alexander Monakov wrote:
> Hi,
>
> It seems you forgot to commit changes to sorter.h in your repo.
Yes, that's fixed now.
>
> Comparing musl-heapsort to musl-smoothsort, the former appears significantly
> better than the latter except on "sorted" input, and even then it's not 20x
> faster like the musl commit adding smoothsort claims (about 6.6x for me). It
> does reduce the number of comparisons by 20x there, as the commit says.
There is one difference between the musl heapsort in my repo compared to
what was used in musl, and that's the swap function. The one in musl
worked with 3 memcpys in a loop with a 256 byte temporary buffer. When
I added it to my test program, I made it use the swap function I'd
already written for grailsort/wikisort, which essentially inlines the
same concept. That could explain the speed discrepancy.
>
> There is variation on how divide-and-conquer algorithms in your test handle
> sorting on the lowest level; for instance grailsort_ref uses insertion sort
> and your implementations use a sorting network (is that correct?). Would your
Correct.
> comparison be more apples-to-apples if all compared approaches used the same
> sorter on the last level, if appropriate (assuming sorting networks improve
> performance in some cases)?
>
> Why did you choose to use sorting networks in your implementations?
Primarily because of I've heard Rich say on IRC a few times now that
sorting networks are a better choice for small sizes than insertion
sort, and I trust his opinion on this sort of thing.
It would be interesting to do a more apples to apples comparison.
>
> For wikisort and grailsort, their "ref" variants perform about 2x faster
> on some tests for me . Is that due to last-level sorter choice, or are there
> other significant differences?
TBH, I haven't spent as much time as I should deciphering everything
that's going on in the reference implementations. I suspect that a
large part of their complexity comes from optimizing for all sorts of
different cases, and even if it does account for a 2x speedup, I don't
think we'd want to introduce that much bloat in musl.
>
> Thanks.
>
> Alexander
>
--
Bobby Bingham
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] new qsort implementation
2014-09-01 18:20 ` Bobby Bingham
@ 2014-09-01 20:53 ` Rich Felker
2014-09-01 21:46 ` Bobby Bingham
0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2014-09-01 20:53 UTC (permalink / raw)
To: musl
On Mon, Sep 01, 2014 at 01:20:05PM -0500, Bobby Bingham wrote:
> > > Here are the numbers comparing musl's current smoothsort with the
> > > attached grailsort code for various input patterns and sizes. The test
> > > was run on x86_64, compiled with gcc 4.8.3 at -Os:
> > >
> > > sorted reverse constant
> > > compares ms compares ms compares ms
> > > musl smoothsort 19976 0 268152 8 19976 0
> > > 199971 2 3327332 59 199971 2
> > > 1999963 29 40048748 663 1999963 27
> > > 19999960 289 465600753 7505 19999960 293
> > >
> > > grailsort 71024 0 41110 0 28004 0
> > > 753996 2 412840 5 270727 3
> > > 7686249 27 4177007 74 2729965 41
> > > 75927601 277 42751315 901 28243939 436
> > >
> >
> > interesting that the sorted case is faster with much more compares
> > here on i386 smoothsort is faster
> >
> > sorted reverse constant
> > compares ms compares ms compares ms
> > musl smoothsort 19976 0 268152 7 19976 1
> > 199971 8 3327332 103 199971 15
> > 1999963 105 40048748 1151 1999963 103
> > 19999960 1087 465600753 13339 19999960 1103
> >
> > grailsort 71024 1 41110 3 28004 3
> > 753996 20 412840 23 270727 23
> > 7686249 151 4177007 370 2729965 224
> > 75927601 1438 42751315 4507 28243939 2353
> >
>
> Interesting. When I saw that grailsort was faster even with more
> comparisons on my machine, I had attributed it to my swap possibly being
> faster. But I don't see why this wouldn't also be the case on i386, so
> maybe something else is going on.
I think it makes sense to test with two different types of cases:
expensive comparisons (costly compare function) and expensive swaps
(large array elements).
> > > #include <stdlib.h>
> > > #include <limits.h>
> > >
> > > size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > > {
> > > size_t baseidx = 0, tryidx;
> > > void *try;
> > > int sign;
> > >
> > > while (nel > 0) {
> > > tryidx = baseidx + nel/2;
> > > try = (char*)base + tryidx*width;
> > > sign = cmp(key, try);
> > > if (!sign) return tryidx;
> > > else if (sign < 0)
> > > nel /= 2;
> > > else {
> > > baseidx = tryidx + 1;
> > > nel -= nel/2 + 1;
> > > }
> > > }
> > >
> > > return ~baseidx;
> > > }
> > >
> > > void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > > {
> > > size_t idx = __bsearch(key, base, nel, width, cmp);
> > > return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
> > > }
> >
> > musl does not malloc >=SSIZE_MAX memory, but mmap can so baseidx
> > may be >0x7fffffff on a 32bit system
> >
> > i'm not sure if current qsort handles this case
>
> I thought I recalled hearing that SSIZE_MAX was the upper bound on all
> object sizes in musl, but if we still allow larger mmaps than that, I
> guess not. I'll find a different approach when I send the next version
> of the code.
You are correct and nsz is mistaken on this. musl does not permit any
object size larger than SSIZE_MAX. mmap and malloc both enforce this.
But I'm not sure why you've written bsearch to need this assumption.
The bsearch in musl gets by fine without it.
Rich
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC] new qsort implementation
2014-09-01 20:53 ` Rich Felker
@ 2014-09-01 21:46 ` Bobby Bingham
0 siblings, 0 replies; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 21:46 UTC (permalink / raw)
To: musl
[-- Attachment #1: Type: text/plain, Size: 912 bytes --]
On Mon, Sep 01, 2014 at 04:53:42PM -0400, Rich Felker wrote:
> > I thought I recalled hearing that SSIZE_MAX was the upper bound on all
> > object sizes in musl, but if we still allow larger mmaps than that, I
> > guess not. I'll find a different approach when I send the next version
> > of the code.
>
> You are correct and nsz is mistaken on this. musl does not permit any
> object size larger than SSIZE_MAX. mmap and malloc both enforce this.
> But I'm not sure why you've written bsearch to need this assumption.
> The bsearch in musl gets by fine without it.
The point of this was that bsearch returns NULL if the element is not
found, but I need to know where the element would have been had it been
present. The bitwise complement is used so that bsearch, written as a
wrapper around __bsearch, can tell whether the element was present or
not without requiring an extra comparison.
--
Bobby Bingham
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] [RFC] new qsort implementation
2014-09-01 7:12 [RFC] new qsort implementation Bobby Bingham
2014-09-01 11:17 ` Szabolcs Nagy
2014-09-01 11:25 ` Alexander Monakov
@ 2023-02-17 15:51 ` Rich Felker
2023-02-17 22:53 ` Rich Felker
2 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2023-02-17 15:51 UTC (permalink / raw)
To: Bobby Bingham; +Cc: musl
On Mon, Sep 01, 2014 at 02:12:43AM -0500, Bobby Bingham wrote:
> Hi all,
>
> As I mentioned a while back on IRC, I've been looking into wikisort[1]
> and grailsort[2] to see if either of them would be a good candidate for
> use as musl's qsort.
I'd forgotten about this until Alexander Monakov mentioned it in the
context of the present qsort performance thread (introduced with
Message-ID: <CAAEi2GextYuWRK-JKtpCLxewyJ2u380m5+s=M_0P=ZBDxyX-xA@mail.gmail.com>)
and think it's probably worth pursuing.
Apparently there was a bug with constant blocks that would need to be
addressed.
Since then, we've also added qsort_r. Between this changing the
interface needs of the bsearch_pos function and the general avoidance
in musl of special cross-component internal interfaces, I suspect it
would make sense to just duplicate the bsearch code with the suitable
interface as a static function in qsort.c.
On performance, it looks like this already has something of an inlined
element swap, which may be a big part of the difference. So to
evaluate the degree to which it might be better, we really need to do
the same with the existing smoothsort code and compare them apples to
apples.
Regarding the sqrt, nowadays musl's sqrt is basically all integer code
except on targets with a native sqrt instruction, so it's probably not
catastrophic to performance on softfloat archs. However, it is a
little bit sketchy with respect to determinism since it's affected by
(and in turn alters) the floating point environment (rounding mode,
exception flags). I assume only an approximation within some
reasonable bounds (to ensure the desired big-O) is needed, so it
probably makes sense to do a fast integer approximation. Maybe even
something that's essentially just "wordsize minus clz, >>1, 1<<that,
multiply by 1.4 if shifted-out bit was 1" would suffice.
Rich
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [musl] [RFC] new qsort implementation
2023-02-17 15:51 ` [musl] " Rich Felker
@ 2023-02-17 22:53 ` Rich Felker
0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2023-02-17 22:53 UTC (permalink / raw)
To: Bobby Bingham; +Cc: musl
On Fri, Feb 17, 2023 at 10:51:14AM -0500, Rich Felker wrote:
> Regarding the sqrt, nowadays musl's sqrt is basically all integer code
> except on targets with a native sqrt instruction, so it's probably not
> catastrophic to performance on softfloat archs. However, it is a
> little bit sketchy with respect to determinism since it's affected by
> (and in turn alters) the floating point environment (rounding mode,
> exception flags). I assume only an approximation within some
> reasonable bounds (to ensure the desired big-O) is needed, so it
> probably makes sense to do a fast integer approximation. Maybe even
> something that's essentially just "wordsize minus clz, >>1, 1<<that,
> multiply by 1.4 if shifted-out bit was 1" would suffice.
Indeed, an approach like this (which is just taking the first-order
approximation of sqrt centered at the next-lower power of two) seems
to have a max error of around 6.7%, and the cost is essentially just
one clz and one divide. If you're happy with up to 25% error, using
the next-lower *even* power of two has a cost that's essentially just
one clz. I think either could be made significantly better by using
the nearest (resp. nearest-even) power of two rather than always going
down, without any significant addition of costly operations.
Rich
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2023-02-17 22:53 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-01 7:12 [RFC] new qsort implementation Bobby Bingham
2014-09-01 11:17 ` Szabolcs Nagy
2014-09-01 18:20 ` Bobby Bingham
2014-09-01 20:53 ` Rich Felker
2014-09-01 21:46 ` Bobby Bingham
2014-09-01 11:25 ` Alexander Monakov
2014-09-01 18:27 ` Bobby Bingham
2023-02-17 15:51 ` [musl] " Rich Felker
2023-02-17 22:53 ` Rich Felker
Code repositories for project(s) associated with this public inbox
https://git.vuxu.org/mirror/musl/
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).