MSD RadixSort - An algorithm that can achieve O(N) behavior

' MSD RadixSort. No sort based on comparisons can be faster than O(N log N).
' RadixSort makes no comparisons and can therefore achieve O(N) behavior. To
' do this, it examines keys one byte at a time, counting the number of keys
' that have each possible byte value. The counts are then used to build an
' offset table specifying the sorted order of the keys. It's easiest to work
' backwards from the least significant digit of the keys (LSD Radix),
' since order based on more significant digits is not disturbed by less
' significant digits. Unfortunately, the LSD approach requires padding short
' keys if key length is variable, and guarantees that all digits will be
' examined even if the first 3-4 digits contain all the information needed to
' achieve sorted order. Most significant digit (MSD) RadixSort takes a lot
' more bookkeeping -- the list must repeatedly be split into sublists for each
' value of the last digit processed -- but the pay-off is that only as many
' digits will be examined as are needed. As in QuickSort and MergeSort,
' it's worthwhile to hand off the sublists to InsertionSort when they get
' short enough.
'
' MSD RadixSort is stable and runs in linear O(N) time. It is fairly memory
' intensive, needing space for some extra counting arrays and stack space for
' recursive calls, but still uses less memory than MergeSort. This version is
' set up for strings and would take some work to adapt to other data types.
' Integers and longs could be handled by converting them to strings (and
' limiting the arrays CNT and IND to the ten numerical digit values plus the
' minus sign); this is probably worthwhile only if you have hundreds of
' thousands of integers or longs to sort. Doubles are more of a challenge,
' requiring conversion to an array of bytes by type casting. I have not
' pursued this, since VBA on the Mac does not include the CopyMemory function
' that enables type casting on Windows systems.
'
' NOTE: This version is set up to count byte values from 1 to 127. If your
' keys use (for instance) only lowercase or only uppercase alphabetic
' characters or only numerical digits, you can trim the CNT and IND arrays and
' your sorts will run correspondingly faster.
'
' Reference: P. M. McIlroy, K. Bostic and M. D. McIlroy,
' "Engineering Radix Sort", Computing Systems 6(1):5-27 (1993).
'
' Speed: MSDRadixSortS sorts 500,000 random strings in 28.3 sec; sorts 100186
' library call numbers in 21.3 sec; sorts 25479 dictionary words in 1.8 sec
' (random order), 1.5 sec (presorted) or 1.9 sec (reverse sorted). Timed in
' Excel 2001 on an 800 mhz PowerBook.
'
' Bottom line: complex and best suited to strings, but there's nothing faster
' for really long lists.
' Usage:
Dim S1(L To R) As Strings
Dim B1(1 To nChars) As Byte
Dim P1(L To R) As Long
For I = L To R
S1(I) = GetRandomString()
Next I
StrsToBytes S1, B1, P1 'a routine that stores the strings in 0 terminated
' byte arrays with
'P1() holding pointers to the start of each byte series.
MSDRadixSortS L, R, B1, P1
' CODE:
Sub MSDRadixSortS(N As Long)
Dim CH() As Integer
ReDim CH(1 To N)
'See below for type PILE (stack records)
ReDim STACK(1 To 1000)
StackPtr = 1
With STACK(StackPtr)
.L = 1
.R = N
.D = 0
End With
StackPtr = StackPtr + 1
RadixS CH
End Sub
Sub RadixS(CH() As Integer)
Dim L As Long
Dim R As Long
Dim D As Integer
Dim I As Long
Dim J As Long
Dim TMP As Long
Dim C As Integer
Dim NextC As Integer
Dim CNT(-1 To 127) As Long
Dim IND(-1 To 127) As Long
Dim NuCnt As Long
Dim BigCnt As Long
Dim OldSp As Long
Dim BigSp As Long
While StackPtr > 1
'Pop a PILE off the stack.
StackPtr = StackPtr - 1
'Get left and right limits of sublist and depth of byte to examine
With STACK(StackPtr)
L = .L
R = .R
D = .D
End With
'Sublists of <= 24 keys will be finished by InsertionSort
If R - L > 24 Then
'Clear the count array.
For I = -1 To 127
CNT(I) = 0
Next I
'Get the byte at depth D in each key and count the number of each byte
' value.
For I = L To R
C = CInt(B(P(I) + D))
CH(I) = C
CNT(C) = CNT(C) + 1
Next I
'We will add the counts to create sorted addresses in array IND().
IND(-1) = L
'At the same time we'll push the sublists for each byte value onto the
' stack.
OldSp = StackPtr
'We'll track the biggest sublist and make sure it comes off the stack
' last;
'this ensures the stack will not get more than logarithmically deep.
BigCnt = 0
'Now we build the addresses out of the counts.
For I = 0 To 127
J = I - 1
NuCnt = CNT(J)
IND(I) = IND(J) + NuCnt
'If there is a sublist of keys starting with the current byte value,
' stack it.
If NuCnt > 1 Then
With STACK(StackPtr)
.L = IND(J)
.R = IND(I) - 1
.D = D + 1
End With
StackPtr = StackPtr + 1
'Keep track of the largest count / sublist.
If NuCnt > BigCnt Then
BigCnt = NuCnt
BigSp = StackPtr
End If
End If
Next I
'Swap the biggest sublist down into the stack so it comes off last.
TMP = BigSp
BigSp = OldSp
OldSp = TMP
'Now use the counts to move the pointers to their sorted positions
' based on the
'bytes examined so far; doing this in place this gets a bit ugly.
For I = L To R
'We will use the byte at I to find what address P(I) should be mapped
' to.
C = CH(I)
'We use -1 to flag pointers already moved.
CH(I) = -1
'If C = -1 we skip the loop and increment I until we find a pointer
' not already moved.
Do While C >= 0
'We go to IND(C) to get the destination address for P(I).
J = IND(C)
'We swap the current pointer for the one at that address.
TMP = P(I)
P(I) = P(J)
P(J) = TMP
'Now we determine where to map the pointer we just displaced.
'We get the byte value from its key.
NextC = CH(J)
'We flag it as moved.
CH(J) = -1
'Once we've used each address we increment it unless we've hit R.
If J < R Then IND(C) = J + 1 Else Exit Do
'We set C to the byte of the key of the displaced pointer; ready
' to loop!
C = NextC
Loop
'If a series of displacements circles back to a pointer already moved,
' we
'increment I until we find a pointer not yet moved or until we end at
' R.
Next I
Else
DeepInsertS B, P, L, 1 + R - L, D
End If
Wend
End Sub
Type PILE
L As Long
R As Long
D As Integer
End Type
Dim STACK() As PILE
Dim StackPtr As Integer
Sub DeepInsertS(B() As Byte, P() As Long, L As Long, N As Long, D As Integer)
Dim LP As Long
Dim RP As Long
Dim TMP As Long
Dim I As Long
Dim J As Long
For RP = L + 1 To L + N - 1
TMP = P(RP)
For LP = RP To L + 1 Step -1
I = TMP + D
J = P(LP - 1) + D
Do While B(I) = B(J)
If B(I) = 0 Or B(J) = 0 Then Exit Do
I = I + 1
J = J + 1
Loop
If CInt(B(I)) - CInt(B(J)) < 0 Then P(LP) = P(LP - 1) Else Exit For
Next LP
P(LP) = TMP
Next RP
End Sub