
ethods for sorting have been intensively
studied and optimized. Nevertheless, many poor algorithms (or poor versions
of good algorithms) are still in circulation. Recently, I have translated a
variety of sorting routines into Visual Basic and compared their performance.
The collection includes some of the faster algorithms known (e.g., QuickSort,
RadixSort and Merge-Sort) and incorporates many optimizations known to increase
speed and/or consistency. On an 800 mhz PowerBook running the VBA routines
in Excel 2001, I observe sorting speeds as high as a million random strings
or doubles per minute (Table 1). I hope you will find the code for these sorts
useful and interesting.
What makes a good sorting algorithm?
Speed is probably the top consideration, but other factors of interest include
versatility in handling various data types, consistency of perform-ance, memory
requirements, length and complexity of code, and the property of stability (preserving
the original order of records that have equal keys). As you may guess, no single
sort is the winner in all categories simultaneously (Table 2).
Let's start with speed, which breaks
down into "order" and "overhead". When we talk about the
order of a sort, we mean the relationship between the number of keys to be sorted
and the time required. The best case is O(N); time is linearly proportional
to the number of items. We can't do this with any sort that works by comparing
keys; the best such sorts can do is O(N log N), but we can do it with a RadixSort,
which doesn't use comparisons. Many simple sorts (Bubble, Insertion, Selection)
have O(N^2) behavior, and should never be used for sorting long lists. But
what about short lists? The other part of the speed equation is overhead resulting
from complex code, and the sorts that are good for long lists tend to have more
of it. For short lists of 5 to 50 keys or for long lists that are almost sorted,
Insertion-Sort is extremely efficient and can be faster than finishing the same
job with QuickSort or a RadixSort. Many of the routines in my collection are
"hybrids", with a version of InsertionSort finishing up after a fast
algorithm has done most of the job.
The third aspect of speed is consistency.
Some sorts always take the same amount of time , but many have "best case"
and "worst case" performance for particular input orders of keys.
A famous example is QuickSort, generally the fastest of the O(N log N) sorts,
but it always has an O(N^2) worst case. It can be tweaked to make this worst
case very unlikely to occur, but other O(N log N) sorts like HeapSort and MergeSort
remain O(N log N) in their worst cases. QuickSort will almost always beat them,
but occasionally they will leave it in the dust.
Along with worst cases, the nature
of keys can also dramatically affect the speed (and the relative speed) of sorts.
This is especially true for string keys, which can vary significantly in length
and relatedness. Longer keys take longer to copy or compare, and highly related
keys take longer to compare because we have to examine more digits before finding
a difference. Two sorts may run neck and neck on short keys, but if one is
more efficient in doing fewer comparisons or moves, the difference on longer
keys may be dramatic. Likewise, a sort that manipulates pointers to long keys
can be far faster than one that moves the keys themselves.
Speed is usually the largest issue,
but certainly not the only one. If you're sorting a very large number of keys,
having enough memory may still be an issue, and methods that sort "in place"
(i.e., without extra room) will be more desirable than those that use a duplicate
array (e.g. MergeSort) or those that use a significant amount of stack space
for recursive calls. The length of the sorting algorithms themselves is not
much of an issue today, but there is considerable variation, and short, simple
algorithms are appealing and easy to maintain. Similarly, algorithms that can
easily be applied to all data types (strings, integers, longs, doubles, etc.)
are convenient, although they may not be as fast as more specialized algorithms
focused on a single data type.
A final issue is "stability"
or the ability to preserve the original order of records that have equal keys.
Some sorts have it (Insertion, Selection, Bubble, Merge, Radix, Ternary Quick)
and others don't (Shell, Comb, Heap, Quick). If you want to sequentially sort
records on multiple fields, you'll need it.
This article comes with nine sorting
algorithms in VB: Selection, Insertion, Shell, Comb, Heap, Merge, Quick, Multi-Key
Quick and MSD Radix. If I could choose only one of these, it would be MergeSort,
because it is quite fast, stable, readily adapts to all data types and never
behaves worse than O(N log N). QuickSort is often faster and just as adaptable,
but is not stable; the well-tuned version described by Robert Sedgewick and
included here is highly unlikely to actually show O(N^2) worst case behavior.
Next on my list would be MSD RadixSort; this one is specialized for strings,
but it's O(N) linear behavior will make it faster than anything else if you
have millions of strings to sort (and it's stable). Ternary QuickSort can also
be very fast, but may shine better in languages like C that offer faster byte-level
access to strings. Finally, HeapSort is compact in both code length and memory
demands, adaptable to all data types and never worse than O(N log N). Although
I include Comb, Shell and SelectionSort, I think they are outclassed by the
sorts mentioned above, and I include InsertionSort as an auxiliary to Merge,
Quick and Radix, and not for independent usage.
Except for MSD RadixSort and Ternary
QuickSort (which are specialized for strings), I provide two versions of each
sorta "pointerized" version set up for strings and a "direct"
version set up for longs. For example, pMergeSortS is a pointerized sort for
strings and MergeSortL is its direct counterpart for longs. The direct versions
rearrange the keys themselves, while the pointerized versions leave the keys
in their original positions and only rearrange pointers. The pointer approach
saves time for strings and doublesdata types that are longer than 4 byte
pointers. The direct approach is just as fast or faster for short data types
like integers, longs or singles. To adapt a string version to handle doubles
or a long version to handle integers or singles, just change the declaration
of the array that holds the keys.
Along with the sorts,
I include a SortBase file containing a number of support routines for generating,
listing or loading sets of random strings, longs and doubles. SortBase also
contains global constants and arrays used by the sorts, and examples of timing
routines for testing them. More details, references, usage hints and commented
VB code are found in the files for each sort. Enjoy.
SortBase - Support sorting routines
Code and details for CombSort
Code and details for HeapSort
Code and details for InsertionSort
Code and details for MergeSort
Code and details for MSD RadixSort
Code and details for QuickSort
Code and details for SelectionSort
Code and details for ShellSort
Code and details for ShuttleMergeSort
Code and details for Ternary QuickSort