mailing list of musl libc
 help / color / mirror / code / Atom feed
* [musl] Improvements to qsort
@ 2021-07-09 11:26 jason
  2021-07-10  0:20 ` Rich Felker
  0 siblings, 1 reply; 2+ messages in thread
From: jason @ 2021-07-09 11:26 UTC (permalink / raw)
  To: musl; +Cc: jason

I've been trying to understand ngmalloc, and it's a struggle, so for
relaxation I had a look at qsort, and have a few suggestions.

To avoid burying the lead, the following reduces comparisons by
about 4% with no downside:

--- a/qsort.c
+++ b/qsort.c
@@ -100,18 +100,16 @@ static void sift(unsigned char *head, size_t width, cmpfun cmp, int pshift, size
 		rt = head - width;
 		lf = head - width - lp[pshift - 2];
 
-		if((*cmp)(ar[0], lf) >= 0 && (*cmp)(ar[0], rt) >= 0) {
-			break;
-		}
 		if((*cmp)(lf, rt) >= 0) {
-			ar[i++] = lf;
 			head = lf;
 			pshift -= 1;
 		} else {
-			ar[i++] = rt;
 			head = rt;
 			pshift -= 2;
 		}
+		if((*cmp)(ar[0], head) >= 0)
+			break;
+		ar[i++] = head;
 	}
 	cycle(width, ar, i);
 }

Basically, instead of doing two comparisons to see if ar[0] is greater
than both the left and right children and then comparing the children
with each other to find the greater one, find the greater child and
compare ar[0] with that one.

I have several other less-significant changes that I wonder if musl is
interested in:

* I notice than pntz() is always used in the following context:
	trail = pntz(p);
	shr(p, trail);
	pshift += trail;
  both pntz() and shr() have branches conditional on whether the shift
  is more or less than 8*sizeof(size_t) bits.  I have replaced this
  by a single function, which allows the conditional branch to be
  shared:

/*
 * Re-normalize p[].  Ignore the low-order bit (which must be set!),
 * find the position of the next-least-significant bit, and shift p[]
 * down by that much.  Return the number of bits shifted.
 *
 * There must be a penultimate trailing bit; i.e. p[] must not be {1, 0}.
 */
static inline int normalize(size_t p[2])
{
	int s;
	size_t t = p[0] - 1;

	if (t == 0) {
		t = p[1];
		s = ntz(t);
		p[0] = t >> s;
		p[1] = 0;
		s += 8*sizeof(size_t);
	} else {
		s = ntz(t);
		p[0] = t >> s | p[1] << (8*sizeof(size_t) - s);
		p[1] >>= s;
	}
	return s;
}

  which is used simply as:
	pshift += normalize(p);

* As part of my learning about the code, I have added a lot of comments,
  particularly large block comments about the smoothsort algorithm
  in general, the meaning of p[] and pindex, what a "stepson" is,
  when it's okay to use sift() instead of trinkle(), what the "trusty"
  flag means, how cycle() is used, etc.  Would these be worth cleaning
  up and submitting?

* In my private copy, I have changed the type of "trusty" to bool,
  a data type I'm quite fond of as it concisely documents the limited
  range of values.  I notice, however, that musl is completely devoid of
  any use of <stdbool.h>, although it uses other C99 features.  Is this
  a conscious decision?

* Likewise, I notice that most musl code has a space in "keyword (condition)"
  like "if (foo != 0)", "while (bar < N)" and "for (i=0; i < N; i++)".
  But the qsort() code, with one single exception, omits the space.

  Likewise, most msul code is quite happy to "if (condition) break;"
  on a single line, with no braces, but the original sift() function
  above puts the break on a second line and a closing brace on a third.

  The musl code base is inconsistent enough that it's hard to infer code
  formatting conventions.  Are these discrepancies in qsort a problem
  worth fixing?

* The way the heap-building loop maintains its loop invariants is a trifle
  peculiar.  It begins with the element at *head included in p[] and pindex,
  but excluded from the heap ordering invariants.  Then it establishes
  the ordering invariants (with sift() or trinkle()) and immediately adds
  another (un-ordered) element to p[] and pindex.

  After the main loop exits, a standalone call to trinkle() places the
  final element in heap order.

  I have a more conventional (add, then sift) ordering working which
  seems to simplify the code:
 	while(head < high) {
+		bool notrinkle = false;
+
+		/* Append an element to the heap */
+		head += width;
 		if((p[0] & 3) == 3) {
-			sift(head, width, cmp, pshift, lp);
+			/* Two trees of adjacent sizes: merge with head */
 			shr(p, 2);
 			pshift += 2;
-		} else {
-			if(lp[pshift - 1] >= high - head) {
-				trinkle(head, width, cmp, p, pshift, 0, lp);
-			} else {
+			if((size_t)(high - head) > lp[pshift - 1]) {
 				sift(head, width, cmp, pshift, lp);
+				notrinkle = true;
 			}
-			
-			if(pshift == 1) {
-				shl(p, 1);
-				pshift = 0;
-			} else {
-				shl(p, pshift - 1);
-				pshift = 1;
-			}
+		} else if(pshift == 1) {
+			/* order-1 exists; new tree is order-0 */
+			shl(p, 1);
+			pshift = 0;
+			notrinkle = (high != head);
+		} else {
+			/* new tree is order-1 */
+			shl(p, pshift - 1);
+			pshift = 1;
+			notrinkle = ((size_t)(high - head) > lp[0]);
 		}
-		
 		p[0] |= 1;
-		head += width;
+
+		/* Re-establish the heap using trinkle */
+		if(!notrinkle)
+			trinkle(head, width, cmp, p, pshift, false, lp);
 	}
 
-	trinkle(head, width, cmp, p, pshift, 0, lp);

  If the original ordering is wanted, then the loop initialization
  can be changed to start with two elements (p[0] = 3, pindex = 0,
  head = base + width), since two elements automatically fulfil the
  required loop invariants.

  Does anyone have a preference?

* Since the first test in qsort() excludes width == 0, the loop over
  width in cycle() can be changed to "do { ... } while (width);".

* The ar[] array in sift() only has to be as large as the lp[]
  array in qsort(), not 1/6 larger.

* The ar[] array in trinkle() only has to be half as large, plus one.

* The pindex > 1 case in the heap-consuming loop contains some
  redundant left-then-right shifting:
			shl(p, 2);
			pshift -= 2;
			p[0] ^= 7;
			shr(p, 1);
  This can be optimized.

* There are several cases where variables can be moved into
  smaller local scopes, particularly the "stepson", "rt" and "lf"
  variables in trinkle().  This is another change I've made privately
  as I think it makes the code more reasonable.  Worth submitting?

* GCC emits a couple of signed/unsigned comparison warnings with
  -W -Wall.  Worth fixing?

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [musl] Improvements to qsort
  2021-07-09 11:26 [musl] Improvements to qsort jason
@ 2021-07-10  0:20 ` Rich Felker
  0 siblings, 0 replies; 2+ messages in thread
From: Rich Felker @ 2021-07-10  0:20 UTC (permalink / raw)
  To: jason; +Cc: musl

On Fri, Jul 09, 2021 at 01:26:28PM +0200, jason wrote:
> I've been trying to understand ngmalloc, and it's a struggle, so for
> relaxation I had a look at qsort, and have a few suggestions.
> 
> To avoid burying the lead, the following reduces comparisons by
> about 4% with no downside:
> 
> --- a/qsort.c
> +++ b/qsort.c
> @@ -100,18 +100,16 @@ static void sift(unsigned char *head, size_t width, cmpfun cmp, int pshift, size
>  		rt = head - width;
>  		lf = head - width - lp[pshift - 2];
>  
> -		if((*cmp)(ar[0], lf) >= 0 && (*cmp)(ar[0], rt) >= 0) {
> -			break;
> -		}
>  		if((*cmp)(lf, rt) >= 0) {
> -			ar[i++] = lf;
>  			head = lf;
>  			pshift -= 1;
>  		} else {
> -			ar[i++] = rt;
>  			head = rt;
>  			pshift -= 2;
>  		}
> +		if((*cmp)(ar[0], head) >= 0)
> +			break;
> +		ar[i++] = head;
>  	}
>  	cycle(width, ar, i);
>  }
> 
> Basically, instead of doing two comparisons to see if ar[0] is greater
> than both the left and right children and then comparing the children
> with each other to find the greater one, find the greater child and
> compare ar[0] with that one.

I think this (and addition of comments) are the most worthwhile to
look at first. In particular, cmp is a caller-provided function that
can be expensive, so reducing the number of calls is very helpful.

As an aside, it may be nice to get qsort_r in before making changes so
that qsort_r is available for backport without also backporting
bleeding edge improvements.

> I have several other less-significant changes that I wonder if musl is
> interested in:
> 
> * I notice than pntz() is always used in the following context:
> 	trail = pntz(p);
> 	shr(p, trail);
> 	pshift += trail;
>   both pntz() and shr() have branches conditional on whether the shift
>   is more or less than 8*sizeof(size_t) bits.  I have replaced this
>   by a single function, which allows the conditional branch to be
>   shared:
> 
> /*
>  * Re-normalize p[].  Ignore the low-order bit (which must be set!),
>  * find the position of the next-least-significant bit, and shift p[]
>  * down by that much.  Return the number of bits shifted.
>  *
>  * There must be a penultimate trailing bit; i.e. p[] must not be {1, 0}.
>  */
> static inline int normalize(size_t p[2])
> {
> 	int s;
> 	size_t t = p[0] - 1;
> 
> 	if (t == 0) {
> 		t = p[1];
> 		s = ntz(t);
> 		p[0] = t >> s;
> 		p[1] = 0;
> 		s += 8*sizeof(size_t);
> 	} else {
> 		s = ntz(t);
> 		p[0] = t >> s | p[1] << (8*sizeof(size_t) - s);
> 		p[1] >>= s;
> 	}
> 	return s;
> }
> 
>   which is used simply as:
> 	pshift += normalize(p);

I'm unclear whether it's an improvement open-coding shr rather than
just calling it here. Seems like it introduced new duplication of that
logic while eliminating other duplication. If you think a change is
really better here, it might make sense to factor it as first moving
the repeated pntz/shr/pshift+=... to a function without refactoring,
then expanding pntz out into it (since pntz is not called anywhere
else) in a second change, but without expanding out shr.

> * As part of my learning about the code, I have added a lot of comments,
>   particularly large block comments about the smoothsort algorithm
>   in general, the meaning of p[] and pindex, what a "stepson" is,
>   when it's okay to use sift() instead of trinkle(), what the "trusty"
>   flag means, how cycle() is used, etc.  Would these be worth cleaning
>   up and submitting?

This sounds good. If you can submit them, I'll see if myself and the
original author can review that they make sense.

> * In my private copy, I have changed the type of "trusty" to bool,
>   a data type I'm quite fond of as it concisely documents the limited
>   range of values.  I notice, however, that musl is completely devoid of
>   any use of <stdbool.h>, although it uses other C99 features.  Is this
>   a conscious decision?

Yes, the _Bool type isn't used. That's just a style choice (mainly
because it has some mildly non-obvious behaviors and doesn't really
help anything) but one I'd like to remain consistent on.

> * Likewise, I notice that most musl code has a space in "keyword (condition)"
>   like "if (foo != 0)", "while (bar < N)" and "for (i=0; i < N; i++)".
>   But the qsort() code, with one single exception, omits the space.

The qsort code was written with mildly different spacing style here
than what I usually use and ask others to use, but I don't think it
makes any difference. I'd rather be consistent, but I'd also rather
not have "let's gratuitously change spaces" commits.

>   Likewise, most msul code is quite happy to "if (condition) break;"
>   on a single line, with no braces, but the original sift() function
>   above puts the break on a second line and a closing brace on a third.

Really there is no strong preference here, except possibly to put it
on its own line if it's parallel to an else clause that needs to be on
its own line (e.g. for line length), or if there's a special risk of
it being overlooked if it's on the same line. Mostly I'd aim for
*local consistency* here. There is no compelling reason to need global
consistency across the codebase on how this is done.

>   The musl code base is inconsistent enough that it's hard to infer code
>   formatting conventions.  Are these discrepancies in qsort a problem
>   worth fixing?

No. If someone wrote code for musl with mildly different style but
that wasn't deemed offensively bad when it was contributed, there's no
need to change it now/later.

> * The way the heap-building loop maintains its loop invariants is a trifle
>   peculiar.  It begins with the element at *head included in p[] and pindex,
>   but excluded from the heap ordering invariants.  Then it establishes
>   the ordering invariants (with sift() or trinkle()) and immediately adds
>   another (un-ordered) element to p[] and pindex.
> 
>   After the main loop exits, a standalone call to trinkle() places the
>   final element in heap order.
> 
>   I have a more conventional (add, then sift) ordering working which
>   seems to simplify the code:
>  	while(head < high) {
> +		bool notrinkle = false;
> +
> +		/* Append an element to the heap */
> +		head += width;
>  		if((p[0] & 3) == 3) {
> -			sift(head, width, cmp, pshift, lp);
> +			/* Two trees of adjacent sizes: merge with head */
>  			shr(p, 2);
>  			pshift += 2;
> -		} else {
> -			if(lp[pshift - 1] >= high - head) {
> -				trinkle(head, width, cmp, p, pshift, 0, lp);
> -			} else {
> +			if((size_t)(high - head) > lp[pshift - 1]) {
>  				sift(head, width, cmp, pshift, lp);
> +				notrinkle = true;
>  			}
> -			
> -			if(pshift == 1) {
> -				shl(p, 1);
> -				pshift = 0;
> -			} else {
> -				shl(p, pshift - 1);
> -				pshift = 1;
> -			}
> +		} else if(pshift == 1) {
> +			/* order-1 exists; new tree is order-0 */
> +			shl(p, 1);
> +			pshift = 0;
> +			notrinkle = (high != head);
> +		} else {
> +			/* new tree is order-1 */
> +			shl(p, pshift - 1);
> +			pshift = 1;
> +			notrinkle = ((size_t)(high - head) > lp[0]);
>  		}
> -		
>  		p[0] |= 1;
> -		head += width;
> +
> +		/* Re-establish the heap using trinkle */
> +		if(!notrinkle)
> +			trinkle(head, width, cmp, p, pshift, false, lp);
>  	}
>  
> -	trinkle(head, width, cmp, p, pshift, 0, lp);
> 
>   If the original ordering is wanted, then the loop initialization
>   can be changed to start with two elements (p[0] = 3, pindex = 0,
>   head = base + width), since two elements automatically fulfil the
>   required loop invariants.
> 
>   Does anyone have a preference?

I'll need to read this in a lot more detail to comment on it.

> * Since the first test in qsort() excludes width == 0, the loop over
>   width in cycle() can be changed to "do { ... } while (width);".

I really doubt this makes any difference.

> * The ar[] array in sift() only has to be as large as the lp[]
>   array in qsort(), not 1/6 larger.
> 
> * The ar[] array in trinkle() only has to be half as large, plus one.

These observations may be true, but are very conspicuous high-risk
changes from the perspective of someone not familiar with the code.
Unless there is a compelling reason I would not touch them. Someone
reviewing musl commits for new security bugs (aka code suspicious of
being a backdoor) would sink a lot of time into analyzing that sort of
change, and I do not want to waste their time.

> * The pindex > 1 case in the heap-consuming loop contains some
>   redundant left-then-right shifting:
> 			shl(p, 2);
> 			pshift -= 2;
> 			p[0] ^= 7;
> 			shr(p, 1);
>   This can be optimized.

I think this has been mentioned before by others, and at the time we
were not sure why it was written this way either. It can/should
probably be changed.

> * There are several cases where variables can be moved into
>   smaller local scopes, particularly the "stepson", "rt" and "lf"
>   variables in trinkle().  This is another change I've made privately
>   as I think it makes the code more reasonable.  Worth submitting?

Probably not. There is no consistent rule about how this should be
done, and as mentioned above i don't like "style conformance" commits
unless there's something egregiously wrong.

> * GCC emits a couple of signed/unsigned comparison warnings with
>   -W -Wall.  Worth fixing?

No, the set of warnings musl targets is in configure. Intentional use
of signed/unsigned promotion semantics, including comparison, is used
all over musl.


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-07-10  0:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-09 11:26 [musl] Improvements to qsort jason
2021-07-10  0:20 ` Rich Felker

Code repositories for project(s) associated with this public inbox

	https://git.vuxu.org/mirror/musl/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).