Sort.array easily degenerates

caml-list - the Caml user's mailing list
 help / color / mirror / Atom feed

* Sort.array easily degenerates
@ 1999-03-06  0:27 Markus Mottl
  1999-03-09 10:44 ` Xavier Leroy
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Mottl @ 1999-03-06  0:27 UTC (permalink / raw)
  To: OCAML

Hello,

I have played around with the new functions and modules in the new
OCAML-release. Besides other things I have tested the new function for
sorting arrays (Sort.array).

I am not sure where the problem in the implementation is, but the
"qsort"-function, which is applied in "Sort.array" degenerates easily
on pre-sorted and/or non-unique data. E.g.:

-----  SNIP -----
let _ =
  let size = 5000 in
  let ar = Array.create size 0 in
  for i = 0 to (size-1) do ar.(i) <- i done;
  Sort.array (>=) ar
-----  SNIP -----

The array to be sorted is initialized with its index. Then it is sorted
with descending order.

Running the same test with larger arrays clearly shows that we encounter
the worst-case behaviour (n^2) of quicksort.

Even worse:

If we initialize the array with the same number, the time complexity
stays at its worst-case but with an even higher constant factor.

Initializing the array with random integers (of a large range) shows
that in such cases "Sort.array" does not perform this badly (actually
quite well).

I have compared this to "qsort" in "stdlib.h", the standard library of
C. It is faster than the OCAML-version, as was to be expected. But in
contrast to OCAML it behaves very nicely on low-entropy data.

E.g.: (C-Code):

-----  SNIP -----
#include <stdlib.h>

int int_comp (const void * a, const void * b) {
  return *(int *) a - *(int *) b;
}

const int size = 5000;

int main () {
  int ar[size];
  for (int i = 0; i < size; ++i) ar[i] = i;
  qsort (ar, size, sizeof(int), int_comp);
}
-----  SNIP -----

It would probably be a good idea to change the implementation of
"Sort.array" so as to make it more unlikely to encounter such worst-case
behaviour. Especially with data, where elements may occur more than once,
the current implementation performs really badly.

So here a question: has someone already written a quicksort-function
with in-place modification for arrays which demonstrates nicer behaviour?

Best regards,
Markus

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sort.array easily degenerates
  1999-03-06  0:27 Sort.array easily degenerates Markus Mottl
@ 1999-03-09 10:44 ` Xavier Leroy
  1999-03-09 23:03   ` doligez
  1999-03-10  0:28   ` Markus Mottl
  0 siblings, 2 replies; 5+ messages in thread
From: Xavier Leroy @ 1999-03-09 10:44 UTC (permalink / raw)
  To: Markus Mottl, OCAML

> I have played around with the new functions and modules in the new
> OCAML-release. Besides other things I have tested the new function for
> sorting arrays (Sort.array).
> I am not sure where the problem in the implementation is, but the
> "qsort"-function, which is applied in "Sort.array" degenerates easily
> on pre-sorted and/or non-unique data.

The Sort.array implementation is Quicksort with insertion sort for
small partitions, as suggested in Sedgewick.  I should know better
than take some code out of an algorithms textbook and expect that it
will work well... 

At any rate, any one is welcome to send me a better implementation.

- Xavier Leroy




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sort.array easily degenerates
  1999-03-09 10:44 ` Xavier Leroy
@ 1999-03-09 23:03   ` doligez
  1999-03-10 13:58     ` Xavier Leroy
  1999-03-10  0:28   ` Markus Mottl
  1 sibling, 1 reply; 5+ messages in thread
From: doligez @ 1999-03-09 23:03 UTC (permalink / raw)
  To: OCAML

>From: Xavier Leroy <Xavier.Leroy@inria.fr>

>The Sort.array implementation is Quicksort with insertion sort for
>small partitions, as suggested in Sedgewick.  I should know better
>than take some code out of an algorithms textbook and expect that it
>will work well... 

There's no way to implement Sedgewick's quicksort with the interface
given in sort.mli.  You'd need two comparison functions, one for ">="
and one for "<=".  That explains the degenerate case when all the
elements are equal.

And it degenerates on already-sorted data because you swap the pivot
with the right-most element.  As a consequence, one of the subarrays
has its two highest elements in first and last positions, which makes
the median-of-three degenerate.  I think this bug is also in
Sedgewick's pseudo-code.

Also, you should recurse on the smallest subarray first, not the
largest.

I vote for Shellsort.

-- Damien

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sort.array easily degenerates
  1999-03-09 10:44 ` Xavier Leroy
  1999-03-09 23:03   ` doligez
@ 1999-03-10  0:28   ` Markus Mottl
  1 sibling, 0 replies; 5+ messages in thread
From: Markus Mottl @ 1999-03-10  0:28 UTC (permalink / raw)
  To: Xavier Leroy; +Cc: OCAML

> The Sort.array implementation is Quicksort with insertion sort for
> small partitions, as suggested in Sedgewick.  I should know better
> than take some code out of an algorithms textbook and expect that it
> will work well... 
> 
> At any rate, any one is welcome to send me a better implementation.

I have also compared it to the Sedgewick-version and wondered, what was
wrong with the implementation - it seems that the version in the book
doesn't hold what it promises...

Someone suggested via mail to me that "sort" as can be found in the STL
is very efficient. I took a look at it and it makes indeed a very good
impression. There is an excellent paper about it on the following page:

  http://www.cs.rpi.edu/~musser/gp/timing.html
  Name of paper: Introspective Sorting and Searching Algorithms

  download paper from:
  http://www.cs.rpi.edu/~musser/gp/introsort.ps

It's a kind of hybrid version of various sorting algorithms. It does not
only guarantee a worst-case bound of N*log(N), but it is also as fast as
quicksort in the average case. The constant factor compared to quicksort
is just a little bit larger so it seems to be a true alternative.

The implementation requires heap-algorithms. If someone has time, he could
try to implement the sort algorithm with a suitable heap-implementation
from Okasaki's purely functional data structures - some of them are very
efficient. Take a look at the paper and on the page

  http://miss.wu-wien.ac.at/~mottl/ocaml_sources/intro.html

and download "pure_fun.tar.gz". In chapter 3 you will find "LeftistHeap"
and in chapter 5 "SplayHeap". Both are quite efficient (SplayHeap
seems to be faster (garbage collection parameters can change the
behaviour significantly), but is a bit more complicated). With some minor
changes/additions it should be possible to use them for heap-sorting.

As it seems, a collection of such algorithms and data structures would
really come handy in the OCAML-standard-library...

Another question is, whether to also support "stable_sort" as in the
STL. It guarantees that elements which are already sorted will stay in
the same order. This is important with "order"-functions that consider
only a part of the data representation to be sorted.

Best regards,
Markus Mottl

-- 
Markus Mottl, mottl@miss.wu-wien.ac.at, http://miss.wu-wien.ac.at/~mottl

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sort.array easily degenerates
  1999-03-09 23:03   ` doligez
@ 1999-03-10 13:58     ` Xavier Leroy
  0 siblings, 0 replies; 5+ messages in thread
From: Xavier Leroy @ 1999-03-10 13:58 UTC (permalink / raw)
  To: doligez, OCAML

I have revised my Quicksort implementation based on that found in
glibc 2, which contains some clever optimizations.  Interested parties
can see the code at

  http://camlcvs.inria.fr/cgi-bin/cvsweb.out/ocaml/stdlib/sort.ml

The behavior on extreme situations (input already sorted) is now much
better.

> There's no way to implement Sedgewick's quicksort with the interface
> given in sort.mli.  You'd need two comparison functions, one for ">="
> and one for "<=".  That explains the degenerate case when all the
> elements are equal.

It's true that ">=" vs. ">" can make a big difference, but >= is
easily definable from <= : a >= b is b <= a, a > b is not(a <= b),
a < b is not(b <= a).

> And it degenerates on already-sorted data because you swap the pivot
> with the right-most element.  As a consequence, one of the subarrays
> has its two highest elements in first and last positions, which makes
> the median-of-three degenerate.  I think this bug is also in
> Sedgewick's pseudo-code.

Sedgewick doesn't even give pseudo-code for the "median of three"
heuristic, but the glibc implementation avoids this swap altogether.

> Also, you should recurse on the smallest subarray first, not the
> largest.

Good point.

> I vote for Shellsort.

Well, I tried it, and it's noticeably slower than quicksort by almost
a factor of two on random input.  Perhaps because polymorphic array
access is quite expensive in OCaml.

- Xavier Leroy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~1999-03-10 16:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
1999-03-06  0:27 Sort.array easily degenerates Markus Mottl
1999-03-09 10:44 ` Xavier Leroy
1999-03-09 23:03   ` doligez
1999-03-10 13:58     ` Xavier Leroy
1999-03-10  0:28   ` Markus Mottl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).