[RFC] new qsort implementation

mailing list of musl libc
 help / color / mirror / code / Atom feed

* [RFC] new qsort implementation
@ 2014-09-01  7:12 Bobby Bingham
  2014-09-01 11:17 ` Szabolcs Nagy
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01  7:12 UTC (permalink / raw)
  To: musl


[-- Attachment #1.1: Type: text/plain, Size: 4034 bytes --]

Hi all,

As I mentioned a while back on IRC, I've been looking into wikisort[1]
and grailsort[2] to see if either of them would be a good candidate for
use as musl's qsort.

The C-language reference implementations of both of these algorithms are
inappropriate, as they're both very large, not type-agnostic, and are
not written in idiomatic C.  Some of the complexity of both comes from
the fact that they are stable sorts, which qsort is not required to be.

Attached is an implementation of qsort based on a predecessor of the
paper that grailsort is based on, which describes an unstable block
merge sort.  The paper is available at [3].  My algorithm deviates a
little bit from what the paper describes, but it should be pretty easy
to follow.

You can find my test program with this algorithm and others at [4].
Some of the implementations included are quicksort variants, so the
"qsort-killer" testcase will trigger quadratic behavior in them.  If you
want to run this you should consider reducing the maximum input size in
testcases.h, disabling the qsort-killer input at the bottom of
testcases.c, or disabling the affected sort algorithms ("freebsd",
"glibc quicksort", and depending on your libc, "system") in sorters.c.

Here are the numbers comparing musl's current smoothsort with the
attached grailsort code for various input patterns and sizes.  The test
was run on x86_64, compiled with gcc 4.8.3 at -Os:

                              random              sorted             reverse            constant        sorted+noise       reverse+noise        qsort-killer    elements
                     compares     ms     compares     ms     compares     ms     compares     ms     compares     ms     compares     ms     compares     ms
musl smoothsort        327818     11        19976      0       268152      8        19976      0       142608      5       289637      8       139090      4       10000
                      4352267     97       199971      2      3327332     59       199971      2      2479200     45      3826143     68      1803634     37      100000
                     54441776    945      1999963     29     40048748    663      1999963     27     32506944    577     47405848    798     21830972    411     1000000
                    652805234  12300     19999960    289    465600753   7505     19999960    293    402201458   6891    572755136   9484    259691645   4741    10000000

grailsort              161696      2        71024      0        41110      0        28004      0       143195      1       125943      1        89027      0       10000
                      1993908     27       753996      2       412840      5       270727      3      1818802     15      1640569     17      1064759     10      100000
                     23428984    330      7686249     27      4177007     74      2729965     41     21581351    192     19909192    211     12325132    119     1000000
                    266520949   3884     75927601    277     42751315    901     28243939    436    248048604   2343    232357446   2575    139177031   1368    10000000

As far as code size, here are before and after sizes as reported by
size(1) for bsearch.o, qsort.o, and a minimal statically linked program
using qsort:

                before      after
    bsearch.o      116        160
    qsort.o       1550       1242
    statictest    2950       2819

At -O2, the before and after sizes show the same basic pattern.  At -O3,
gcc performs more aggressive inlining on the grailsort code, and it
balloons to more than twice the size of musl's current code.

For the sake of soft-float targets, I'd also like to look at replacing
the call to sqrt with an integer approximation.  But before I go much
further, I'd like to post the code here and get some feedback.

1. https://github.com/BonzaiThePenguin/WikiSort
2. https://github.com/Mrrl/GrailSort
3. http://akira.ruc.dk/~keld/teaching/algoritmedesign_f04/Artikler/04/Huang88.pdf
4. http://git.koorogi.info/cgit/qsort_compare/

--
Bobby Bingham

[-- Attachment #1.2: bsearch.c --]
[-- Type: text/x-c, Size: 697 bytes --]

#include <stdlib.h>
#include <limits.h>

size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
{
	size_t baseidx = 0, tryidx;
	void *try;
	int sign;

	while (nel > 0) {
		tryidx = baseidx + nel/2;
		try = (char*)base + tryidx*width;
		sign = cmp(key, try);
		if (!sign) return tryidx;
		else if (sign < 0)
			nel /= 2;
		else {
			baseidx = tryidx + 1;
			nel -= nel/2 + 1;
		}
	}

	return ~baseidx;
}

void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
{
	size_t idx = __bsearch(key, base, nel, width, cmp);
	return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
}

[-- Attachment #1.3: qsort.c --]
[-- Type: text/x-c, Size: 4853 bytes --]

#include <math.h>
#include <stddef.h>
#include <stdint.h>
#include <limits.h>

typedef int (*cmpfun)(const void *, const void *);
size_t __bsearch(const void *, const void *, size_t, size_t, cmpfun);

static inline size_t bsearch_pos(const void *key, const void *haystack, size_t nmel, size_t width, cmpfun cmp)
{
	size_t pos = __bsearch(key, haystack, nmel, width, cmp);
	return pos > SSIZE_MAX ? ~pos : pos;
}

static inline void *tail(char *base, size_t nmel, size_t width)
{
	return base + (nmel-1) * width;
}

static void swap(char *a, char *b, size_t width)
{
#ifdef __GNUC__
	typedef uint32_t __attribute__((__may_alias__)) u32;

	if ((uintptr_t)a % 4 == 0 && (uintptr_t)b % 4 == 0) {
		for (; width >= 4; width -= 4) {
			uint32_t tmp = *((u32*)a);
			*((u32*)a) = *((u32*)b);
			*((u32*)b) = tmp;
			a += 4;
			b += 4;
		}
	}
#endif

	while (width--) {
		char tmp = *a;
		*a++ = *b;
		*b++ = tmp;
	}
}

static void rotate(char *base, size_t size, size_t shift)
{
	int dir = 1;
	while (shift) {
		while (2*shift <= size) {
			swap(base, base + dir*shift, shift);
			size -= shift;
			base += shift*dir;
		}
		shift = size - shift;
		base  = dir > 0 ? base + size - shift : base - shift;
		dir  *= -1;
	}
}

static void distribute_buffer(char *base, size_t bufnmel, size_t sortnmel, size_t width, cmpfun cmp)
{
	while (bufnmel) {
		char  *sorted    = base + bufnmel * width;
		size_t insertpos = bsearch_pos(base, sorted, sortnmel, width, cmp);

		if (insertpos > 0)
			rotate(base, (bufnmel + insertpos) * width, bufnmel * width);

		base     += (insertpos + 1) * width;
		bufnmel  -= 1;
		sortnmel -= insertpos;
	}
}

#define MAX_SORTNET 8

static const uint8_t sortnet[][2] = {
	/* 0: index =  0 */
	/* 1: index =  0 */
	/* 2: index =  0 */
	{0,1},
	/* 3: index =  1 */
	{0,1}, {0,2}, {1,2},
	/* 4: index =  4 */
	{0,1}, {2,3}, {0,2}, {1,3}, {1,2},
	/* 5: index =  9 */
	{0,1}, {3,4}, {2,4}, {2,3}, {1,4}, {0,3}, {0,2}, {1,3}, {1,2},
	/* 6: index = 18 */
	{1,2}, {4,5}, {0,2}, {3,5}, {0,1}, {3,4}, {2,5}, {0,3}, {1,4}, {2,4},
	{1,3}, {2,3},
	/* 7: index = 30 */
	{1,2}, {3,4}, {5,6}, {0,2}, {3,5}, {4,6}, {0,1}, {4,5}, {2,6}, {0,4},
	{1,5}, {0,3}, {2,5}, {1,3}, {2,4}, {2,3},
	/* 8: index = 46 */
	{0,1}, {2,3}, {4,5}, {6,7}, {0,2}, {1,3}, {4,6}, {5,7}, {1,2}, {5,6},
	{0,4}, {3,7}, {1,5}, {2,6}, {1,4}, {3,6}, {2,4}, {3,5}, {3,4},
	/* 9: index = 65 */
};

static const uint8_t sortnet_index[] = { 0, 0, 0, 1, 4, 9, 18, 30, 46, 65 };

static void sorting_network(char *base, size_t nmel, size_t width, cmpfun cmp)
{
	for (int i = sortnet_index[nmel]; i < sortnet_index[nmel+1]; i++) {
		char *elem1 = base + sortnet[i][0] * width;
		char *elem2 = base + sortnet[i][1] * width;
		if (cmp(elem1, elem2) > 0) swap(elem1, elem2, width);
	}
}

/* get index of last block whose head is less than the previous block's tail */
static size_t last_overlap(char *base, size_t bcount, size_t bwidth, size_t width, cmpfun cmp)
{
	for (char *cur = tail(base, bcount, bwidth); --bcount; cur -= bwidth)
		if (cmp(cur - width, cur) > 0) break;
	return bcount;
}

void merge(char *buf, char *base, size_t anmel, size_t bnmel, size_t width, cmpfun cmp)
{
	char *a = buf;
	char *b = base + anmel * width;

	size_t skip = bsearch_pos(b, base, anmel, width, cmp);
	anmel -= skip;
	base  += skip * width;

	swap(base, a, anmel * width);

	while (anmel && bnmel) {
		if (cmp(a, b) <= 0) { swap(base, a, width); a += width; anmel--; }
		else                { swap(base, b, width); b += width; bnmel--; }
		base += width;
	}

	swap(base, a, anmel * width);
}

void qsort(void *unsorted, size_t nmel, size_t width, cmpfun cmp)
{
	char *base = unsorted;

	if (nmel <= MAX_SORTNET) {
		sorting_network(base, nmel, width, cmp);
		return;
	}

	size_t blknmel = sqrt(nmel);               /* elements in a block */
	size_t bufnmel = blknmel + nmel % blknmel; /* elements in the buffer */
	size_t bwidth  = blknmel * width;          /* size of a block in bytes */
	size_t blocks  = nmel / blknmel - 1;       /* number of blocks in a + b */
	size_t acount  = blocks / 2;
	size_t bcount  = blocks - acount;

	char *a = base + bufnmel * width;
	char *b = a    + acount  * bwidth;

	qsort(a, acount * blknmel, width, cmp);
	qsort(b, bcount * blknmel, width, cmp);

	/* if already sorted, nothing to do */
	if (cmp(tail(a, acount * blknmel, width), b) <= 0)
		goto distribute;

	/* sort all the a and b blocks together by their head elements */
	qsort(a, blocks, bwidth, cmp);

	/* merge, starting from the end and working towards the beginning */
	while (blocks > 1) {
		size_t overlap = last_overlap(a, blocks, bwidth, width, cmp);
		if (overlap > 0)
			merge(base, tail(a, overlap, bwidth), blknmel, (blocks - overlap) * blknmel, width, cmp);
		blocks = overlap;
	}

distribute:
	qsort(base, bufnmel, width, cmp);
	distribute_buffer(base, bufnmel, nmel - bufnmel, width, cmp);
}

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] new qsort implementation
  2014-09-01  7:12 [RFC] new qsort implementation Bobby Bingham
@ 2014-09-01 11:17 ` Szabolcs Nagy
  2014-09-01 18:20   ` Bobby Bingham
  2014-09-01 11:25 ` Alexander Monakov
  2023-02-17 15:51 ` [musl] " Rich Felker
  2 siblings, 1 reply; 9+ messages in thread
From: Szabolcs Nagy @ 2014-09-01 11:17 UTC (permalink / raw)
  To: musl

* Bobby Bingham <koorogi@koorogi.info> [2014-09-01 02:12:43 -0500]:
> You can find my test program with this algorithm and others at [4].
> Some of the implementations included are quicksort variants, so the
> "qsort-killer" testcase will trigger quadratic behavior in them.  If you
> want to run this you should consider reducing the maximum input size in
> testcases.h, disabling the qsort-killer input at the bottom of
> testcases.c, or disabling the affected sort algorithms ("freebsd",
> "glibc quicksort", and depending on your libc, "system") in sorters.c.

(i had a few build errors: musl_heapsort and musl_smoothsort were not
declared in sorters.h and glibc needs -lrt for clock_gettime)

smooth sort is best for almost sorted lists when only a few elements
are out of order (swap some random elements in a sorted array), this
is common in practice so you should test this case too

the noise case should use much less noise imho (so you test when
only local rearrangements are needed: buffer[i] += random()%small)

another common case is sorting two concatenated sorted arrays
(merge sort should do better in this case)

it would be nice to have a benchmark that is based on common
qsort usage cases

the qsort-killer test is not very interesting for algorithms other
than quicksort (where it is the worst-case), but it would be nice
to analyze the worst cases for smoothsort and grailsort
(they are both O(n logn) so nothing spectacular is expected but it
would be interesting to see how they compare against the theoretical
optimum: ceil(lgamma(n)/log(2)) compares)

> Here are the numbers comparing musl's current smoothsort with the
> attached grailsort code for various input patterns and sizes.  The test
> was run on x86_64, compiled with gcc 4.8.3 at -Os:
> 
>                           sorted             reverse            constant
>                  compares     ms     compares     ms     compares     ms
> musl smoothsort     19976      0       268152      8        19976      0
>                    199971      2      3327332     59       199971      2
>                   1999963     29     40048748    663      1999963     27
>                  19999960    289    465600753   7505     19999960    293
> 
> grailsort           71024      0        41110      0        28004      0
>                    753996      2       412840      5       270727      3
>                   7686249     27      4177007     74      2729965     41
>                  75927601    277     42751315    901     28243939    436
> 

interesting that the sorted case is faster with much more compares
here on i386 smoothsort is faster

                            sorted             reverse            constant
                   compares     ms     compares     ms     compares     ms
musl smoothsort       19976      0       268152      7        19976      1
                     199971      8      3327332    103       199971     15
                    1999963    105     40048748   1151      1999963    103
                   19999960   1087    465600753  13339     19999960   1103

grailsort             71024      1        41110      3        28004      3
                     753996     20       412840     23       270727     23
                    7686249    151      4177007    370      2729965    224
                   75927601   1438     42751315   4507     28243939   2353

> #include <stdlib.h>
> #include <limits.h>
> 
> size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> {
> 	size_t baseidx = 0, tryidx;
> 	void *try;
> 	int sign;
> 
> 	while (nel > 0) {
> 		tryidx = baseidx + nel/2;
> 		try = (char*)base + tryidx*width;
> 		sign = cmp(key, try);
> 		if (!sign) return tryidx;
> 		else if (sign < 0)
> 			nel /= 2;
> 		else {
> 			baseidx = tryidx + 1;
> 			nel -= nel/2 + 1;
> 		}
> 	}
> 
> 	return ~baseidx;
> }
> 
> void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> {
> 	size_t idx = __bsearch(key, base, nel, width, cmp);
> 	return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
> }

musl does not malloc >=SSIZE_MAX memory, but mmap can so baseidx
may be >0x7fffffff on a 32bit system

i'm not sure if current qsort handles this case


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] new qsort implementation
  2014-09-01  7:12 [RFC] new qsort implementation Bobby Bingham
  2014-09-01 11:17 ` Szabolcs Nagy
@ 2014-09-01 11:25 ` Alexander Monakov
  2014-09-01 18:27   ` Bobby Bingham
  2023-02-17 15:51 ` [musl] " Rich Felker
  2 siblings, 1 reply; 9+ messages in thread
From: Alexander Monakov @ 2014-09-01 11:25 UTC (permalink / raw)
  To: musl

Hi,

It seems you forgot to commit changes to sorter.h in your repo.

Comparing musl-heapsort to musl-smoothsort, the former appears significantly
better than the latter except on "sorted" input, and even then it's not 20x
faster like the musl commit adding smoothsort claims (about 6.6x for me).  It
does reduce the number of comparisons by 20x there, as the commit says.

There is variation on how divide-and-conquer algorithms in your test handle
sorting on the lowest level; for instance grailsort_ref uses insertion sort
and your implementations use a sorting network (is that correct?).  Would your
comparison be more apples-to-apples if all compared approaches used the same
sorter on the last level, if appropriate (assuming sorting networks improve
performance in some cases)?  

Why did you choose to use sorting networks in your implementations?

For wikisort and grailsort, their "ref" variants perform about 2x faster
on some tests for me .  Is that due to last-level sorter choice, or are there
other significant differences?

Thanks.

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] new qsort implementation
  2014-09-01 11:17 ` Szabolcs Nagy
@ 2014-09-01 18:20   ` Bobby Bingham
  2014-09-01 20:53     ` Rich Felker
  0 siblings, 1 reply; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 18:20 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 5112 bytes --]

On Mon, Sep 01, 2014 at 01:17:48PM +0200, Szabolcs Nagy wrote:
> (i had a few build errors: musl_heapsort and musl_smoothsort were not
> declared in sorters.h and glibc needs -lrt for clock_gettime)

Fixed.

>
> smooth sort is best for almost sorted lists when only a few elements
> are out of order (swap some random elements in a sorted array), this
> is common in practice so you should test this case too
>
> the noise case should use much less noise imho (so you test when
> only local rearrangements are needed: buffer[i] += random()%small)

I've reduced the amount of noise locally.  I haven't pushed this change
yet, because when I tried it I found a bug in my grailsort
implementation that I'd like to fix first.

I had taken the merging code I'd written for my wikisort implementation
and tried to adapt it to grailsort, but it looks like I missed a case.
If some (but not all) blocks contain elements with a uniform value, it
is possible for the block merge step to produce the wrong results.

I'm looking at how best to fix this, but the best solution may be to
bring my merge step more in line with that described by the paper.

>
> another common case is sorting two concatenated sorted arrays
> (merge sort should do better in this case)
>

Added.

> it would be nice to have a benchmark that is based on common
> qsort usage cases
>
> the qsort-killer test is not very interesting for algorithms other
> than quicksort (where it is the worst-case), but it would be nice
> to analyze the worst cases for smoothsort and grailsort
> (they are both O(n logn) so nothing spectacular is expected but it
> would be interesting to see how they compare against the theoretical
> optimum: ceil(lgamma(n)/log(2)) compares)

Once I fix the code, I'll work on this analysis.

>
> > Here are the numbers comparing musl's current smoothsort with the
> > attached grailsort code for various input patterns and sizes.  The test
> > was run on x86_64, compiled with gcc 4.8.3 at -Os:
> >
> >                           sorted             reverse            constant
> >                  compares     ms     compares     ms     compares     ms
> > musl smoothsort     19976      0       268152      8        19976      0
> >                    199971      2      3327332     59       199971      2
> >                   1999963     29     40048748    663      1999963     27
> >                  19999960    289    465600753   7505     19999960    293
> >
> > grailsort           71024      0        41110      0        28004      0
> >                    753996      2       412840      5       270727      3
> >                   7686249     27      4177007     74      2729965     41
> >                  75927601    277     42751315    901     28243939    436
> >
>
> interesting that the sorted case is faster with much more compares
> here on i386 smoothsort is faster
>
>                             sorted             reverse            constant
>                    compares     ms     compares     ms     compares     ms
> musl smoothsort       19976      0       268152      7        19976      1
>                      199971      8      3327332    103       199971     15
>                     1999963    105     40048748   1151      1999963    103
>                    19999960   1087    465600753  13339     19999960   1103
>
> grailsort             71024      1        41110      3        28004      3
>                      753996     20       412840     23       270727     23
>                     7686249    151      4177007    370      2729965    224
>                    75927601   1438     42751315   4507     28243939   2353
>

Interesting.  When I saw that grailsort was faster even with more
comparisons on my machine, I had attributed it to my swap possibly being
faster.  But I don't see why this wouldn't also be the case on i386, so
maybe something else is going on.

> > #include <stdlib.h>
> > #include <limits.h>
> >
> > size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > {
> > 	size_t baseidx = 0, tryidx;
> > 	void *try;
> > 	int sign;
> >
> > 	while (nel > 0) {
> > 		tryidx = baseidx + nel/2;
> > 		try = (char*)base + tryidx*width;
> > 		sign = cmp(key, try);
> > 		if (!sign) return tryidx;
> > 		else if (sign < 0)
> > 			nel /= 2;
> > 		else {
> > 			baseidx = tryidx + 1;
> > 			nel -= nel/2 + 1;
> > 		}
> > 	}
> >
> > 	return ~baseidx;
> > }
> >
> > void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > {
> > 	size_t idx = __bsearch(key, base, nel, width, cmp);
> > 	return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
> > }
>
> musl does not malloc >=SSIZE_MAX memory, but mmap can so baseidx
> may be >0x7fffffff on a 32bit system
>
> i'm not sure if current qsort handles this case

I thought I recalled hearing that SSIZE_MAX was the upper bound on all
object sizes in musl, but if we still allow larger mmaps than that, I
guess not.  I'll find a different approach when I send the next version
of the code.

--
Bobby Bingham

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] new qsort implementation
  2014-09-01 11:25 ` Alexander Monakov
@ 2014-09-01 18:27   ` Bobby Bingham
  0 siblings, 0 replies; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 18:27 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 2228 bytes --]

On Mon, Sep 01, 2014 at 03:25:18PM +0400, Alexander Monakov wrote:
> Hi,
>
> It seems you forgot to commit changes to sorter.h in your repo.

Yes, that's fixed now.

>
> Comparing musl-heapsort to musl-smoothsort, the former appears significantly
> better than the latter except on "sorted" input, and even then it's not 20x
> faster like the musl commit adding smoothsort claims (about 6.6x for me).  It
> does reduce the number of comparisons by 20x there, as the commit says.

There is one difference between the musl heapsort in my repo compared to
what was used in musl, and that's the swap function.  The one in musl
worked with 3 memcpys in a loop with a 256 byte temporary buffer.  When
I added it to my test program, I made it use the swap function I'd
already written for grailsort/wikisort, which essentially inlines the
same concept.  That could explain the speed discrepancy.

>
> There is variation on how divide-and-conquer algorithms in your test handle
> sorting on the lowest level; for instance grailsort_ref uses insertion sort
> and your implementations use a sorting network (is that correct?).  Would your

Correct.

> comparison be more apples-to-apples if all compared approaches used the same
> sorter on the last level, if appropriate (assuming sorting networks improve
> performance in some cases)?
>
> Why did you choose to use sorting networks in your implementations?

Primarily because of I've heard Rich say on IRC a few times now that
sorting networks are a better choice for small sizes than insertion
sort, and I trust his opinion on this sort of thing.

It would be interesting to do a more apples to apples comparison.

>
> For wikisort and grailsort, their "ref" variants perform about 2x faster
> on some tests for me .  Is that due to last-level sorter choice, or are there
> other significant differences?

TBH, I haven't spent as much time as I should deciphering everything
that's going on in the reference implementations.  I suspect that a
large part of their complexity comes from optimizing for all sorts of
different cases, and even if it does account for a 2x speedup, I don't
think we'd want to introduce that much bloat in musl.

>
> Thanks.
>
> Alexander
>

--
Bobby Bingham

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] new qsort implementation
  2014-09-01 18:20   ` Bobby Bingham
@ 2014-09-01 20:53     ` Rich Felker
  2014-09-01 21:46       ` Bobby Bingham
  0 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2014-09-01 20:53 UTC (permalink / raw)
  To: musl

On Mon, Sep 01, 2014 at 01:20:05PM -0500, Bobby Bingham wrote:
> > > Here are the numbers comparing musl's current smoothsort with the
> > > attached grailsort code for various input patterns and sizes.  The test
> > > was run on x86_64, compiled with gcc 4.8.3 at -Os:
> > >
> > >                           sorted             reverse            constant
> > >                  compares     ms     compares     ms     compares     ms
> > > musl smoothsort     19976      0       268152      8        19976      0
> > >                    199971      2      3327332     59       199971      2
> > >                   1999963     29     40048748    663      1999963     27
> > >                  19999960    289    465600753   7505     19999960    293
> > >
> > > grailsort           71024      0        41110      0        28004      0
> > >                    753996      2       412840      5       270727      3
> > >                   7686249     27      4177007     74      2729965     41
> > >                  75927601    277     42751315    901     28243939    436
> > >
> >
> > interesting that the sorted case is faster with much more compares
> > here on i386 smoothsort is faster
> >
> >                             sorted             reverse            constant
> >                    compares     ms     compares     ms     compares     ms
> > musl smoothsort       19976      0       268152      7        19976      1
> >                      199971      8      3327332    103       199971     15
> >                     1999963    105     40048748   1151      1999963    103
> >                    19999960   1087    465600753  13339     19999960   1103
> >
> > grailsort             71024      1        41110      3        28004      3
> >                      753996     20       412840     23       270727     23
> >                     7686249    151      4177007    370      2729965    224
> >                    75927601   1438     42751315   4507     28243939   2353
> >
> 
> Interesting.  When I saw that grailsort was faster even with more
> comparisons on my machine, I had attributed it to my swap possibly being
> faster.  But I don't see why this wouldn't also be the case on i386, so
> maybe something else is going on.

I think it makes sense to test with two different types of cases:
expensive comparisons (costly compare function) and expensive swaps
(large array elements).

> > > #include <stdlib.h>
> > > #include <limits.h>
> > >
> > > size_t __bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > > {
> > > 	size_t baseidx = 0, tryidx;
> > > 	void *try;
> > > 	int sign;
> > >
> > > 	while (nel > 0) {
> > > 		tryidx = baseidx + nel/2;
> > > 		try = (char*)base + tryidx*width;
> > > 		sign = cmp(key, try);
> > > 		if (!sign) return tryidx;
> > > 		else if (sign < 0)
> > > 			nel /= 2;
> > > 		else {
> > > 			baseidx = tryidx + 1;
> > > 			nel -= nel/2 + 1;
> > > 		}
> > > 	}
> > >
> > > 	return ~baseidx;
> > > }
> > >
> > > void *bsearch(const void *key, const void *base, size_t nel, size_t width, int (*cmp)(const void *, const void *))
> > > {
> > > 	size_t idx = __bsearch(key, base, nel, width, cmp);
> > > 	return idx > SSIZE_MAX ? NULL : (char*)base + idx*width;
> > > }
> >
> > musl does not malloc >=SSIZE_MAX memory, but mmap can so baseidx
> > may be >0x7fffffff on a 32bit system
> >
> > i'm not sure if current qsort handles this case
> 
> I thought I recalled hearing that SSIZE_MAX was the upper bound on all
> object sizes in musl, but if we still allow larger mmaps than that, I
> guess not.  I'll find a different approach when I send the next version
> of the code.

You are correct and nsz is mistaken on this. musl does not permit any
object size larger than SSIZE_MAX. mmap and malloc both enforce this.
But I'm not sure why you've written bsearch to need this assumption.
The bsearch in musl gets by fine without it.

Rich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] new qsort implementation
  2014-09-01 20:53     ` Rich Felker
@ 2014-09-01 21:46       ` Bobby Bingham
  0 siblings, 0 replies; 9+ messages in thread
From: Bobby Bingham @ 2014-09-01 21:46 UTC (permalink / raw)
  To: musl

[-- Attachment #1: Type: text/plain, Size: 912 bytes --]

On Mon, Sep 01, 2014 at 04:53:42PM -0400, Rich Felker wrote:
> > I thought I recalled hearing that SSIZE_MAX was the upper bound on all
> > object sizes in musl, but if we still allow larger mmaps than that, I
> > guess not.  I'll find a different approach when I send the next version
> > of the code.
>
> You are correct and nsz is mistaken on this. musl does not permit any
> object size larger than SSIZE_MAX. mmap and malloc both enforce this.
> But I'm not sure why you've written bsearch to need this assumption.
> The bsearch in musl gets by fine without it.

The point of this was that bsearch returns NULL if the element is not
found, but I need to know where the element would have been had it been
present.  The bitwise complement is used so that bsearch, written as a
wrapper around __bsearch, can tell whether the element was present or
not without requiring an extra comparison.

--
Bobby Bingham

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [musl] [RFC] new qsort implementation
  2014-09-01  7:12 [RFC] new qsort implementation Bobby Bingham
  2014-09-01 11:17 ` Szabolcs Nagy
  2014-09-01 11:25 ` Alexander Monakov
@ 2023-02-17 15:51 ` Rich Felker
  2023-02-17 22:53   ` Rich Felker
  2 siblings, 1 reply; 9+ messages in thread
From: Rich Felker @ 2023-02-17 15:51 UTC (permalink / raw)
  To: Bobby Bingham; +Cc: musl

On Mon, Sep 01, 2014 at 02:12:43AM -0500, Bobby Bingham wrote:
> Hi all,
> 
> As I mentioned a while back on IRC, I've been looking into wikisort[1]
> and grailsort[2] to see if either of them would be a good candidate for
> use as musl's qsort.

I'd forgotten about this until Alexander Monakov mentioned it in the
context of the present qsort performance thread (introduced with
Message-ID: <CAAEi2GextYuWRK-JKtpCLxewyJ2u380m5+s=M_0P=ZBDxyX-xA@mail.gmail.com>)
and think it's probably worth pursuing.

Apparently there was a bug with constant blocks that would need to be
addressed.

Since then, we've also added qsort_r. Between this changing the
interface needs of the bsearch_pos function and the general avoidance
in musl of special cross-component internal interfaces, I suspect it
would make sense to just duplicate the bsearch code with the suitable
interface as a static function in qsort.c.

On performance, it looks like this already has something of an inlined
element swap, which may be a big part of the difference. So to
evaluate the degree to which it might be better, we really need to do
the same with the existing smoothsort code and compare them apples to
apples.

Regarding the sqrt, nowadays musl's sqrt is basically all integer code
except on targets with a native sqrt instruction, so it's probably not
catastrophic to performance on softfloat archs. However, it is a
little bit sketchy with respect to determinism since it's affected by
(and in turn alters) the floating point environment (rounding mode,
exception flags). I assume only an approximation within some
reasonable bounds (to ensure the desired big-O) is needed, so it
probably makes sense to do a fast integer approximation. Maybe even
something that's essentially just "wordsize minus clz, >>1, 1<<that,
multiply by 1.4 if shifted-out bit was 1" would suffice.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [musl] [RFC] new qsort implementation
  2023-02-17 15:51 ` [musl] " Rich Felker
@ 2023-02-17 22:53   ` Rich Felker
  0 siblings, 0 replies; 9+ messages in thread
From: Rich Felker @ 2023-02-17 22:53 UTC (permalink / raw)
  To: Bobby Bingham; +Cc: musl

On Fri, Feb 17, 2023 at 10:51:14AM -0500, Rich Felker wrote:
> Regarding the sqrt, nowadays musl's sqrt is basically all integer code
> except on targets with a native sqrt instruction, so it's probably not
> catastrophic to performance on softfloat archs. However, it is a
> little bit sketchy with respect to determinism since it's affected by
> (and in turn alters) the floating point environment (rounding mode,
> exception flags). I assume only an approximation within some
> reasonable bounds (to ensure the desired big-O) is needed, so it
> probably makes sense to do a fast integer approximation. Maybe even
> something that's essentially just "wordsize minus clz, >>1, 1<<that,
> multiply by 1.4 if shifted-out bit was 1" would suffice.

Indeed, an approach like this (which is just taking the first-order
approximation of sqrt centered at the next-lower power of two) seems
to have a max error of around 6.7%, and the cost is essentially just
one clz and one divide. If you're happy with up to 25% error, using
the next-lower *even* power of two has a cost that's essentially just
one clz. I think either could be made significantly better by using
the nearest (resp. nearest-even) power of two rather than always going
down, without any significant addition of costly operations.

Rich

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-02-17 22:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-01  7:12 [RFC] new qsort implementation Bobby Bingham
2014-09-01 11:17 ` Szabolcs Nagy
2014-09-01 18:20   ` Bobby Bingham
2014-09-01 20:53     ` Rich Felker
2014-09-01 21:46       ` Bobby Bingham
2014-09-01 11:25 ` Alexander Monakov
2014-09-01 18:27   ` Bobby Bingham
2023-02-17 15:51 ` [musl] " Rich Felker
2023-02-17 22:53   ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).