f you do as much writing as I do, then you?re probably familiar with Microsoft Word?s tracking features. They let you easily see what?s changed in different versions of a Word file.
But what if you want to see what?s changed in a plain text file? What if you want to compare different versions of data files? What if your project no longer passes its unit tests and you want to see what changed in the source code files during in the last week?
If you have these files under change control, then you?re probably done because a decent change control system will highlight changes between different versions. If these files aren?t under change control, or you just like figuring out how these things work, you can build your own tool to see what?s changed.
This article explains how you can see what?s changed between two documents or two strings. It describes an algorithm that you can use to find differences and includes C# and Visual Basic examples in the source code download.
The eventual goal of this article is to see how to documents differ but the algorithm I?m going to describe is easier to understand if you consider two strings instead, so I?ll start there. Once you know how to find the difference between two strings, you can generalize it to find the difference between two documents, or two of anything that are made up of things like letters or paragraphs.
When you ask for the difference between two strings, you really want the smallest difference. Obviously you could delete every letter from the first string and then insert every letter from the second to give the new string. That gives you the new string but doesn?t really help you understand how the two are related. If the two strings share many letters, then this solution doesn?t show you what has ?changed? to get from the first string to the second.
For example, to convert ?cat? into ?cart,? you could delete the c, a, and t, and then insert c, a, r, and t, which would require seven changes. It?s easy to see in this case that a much simpler solution is to simply insert the ?r? in ?cat? to get ?cart? in a single change. That more accurately tells you what changes between the two strings.
An edit distance is a measure of how different two strings are. There are several ways to define edit distance but for this article assume that it?s simply the smallest number of deletions and additions needed to convert one string into another. For example, the edit distance between ?cat? and ?cart? is 1.
For a simple case like the cat/cart conversion it?s easy to guess the edit distance. When the strings are less similar, it?s a bit harder to find the best solution. For example, one way to transform ?parsnip? into ?turnip? is to:
- 1. Delete ?p? arsnip
2. Delete ?a? rsnip
3. Insert ?t? trsnip
4. Insert ?u? tursnip
5. Delete ?s? turnip
This gives an edit distance of 5, but is that the best solution possible? Looking at the letters, it?s not always obvious which changes give the best result.
One way to make finding the edit distance easier is to look at an edit graph that shows the possible transformations from one string to another. Figure 1 shows an edit graph for the parsnip/turnip transformation.
|Figure 1. Turnip Transformation: The blue path through this edit graph shows the shortest way to transform ?parsnip? into ?turnip.?|
To build the graph, make an array of nodes as shown in Figure 1. Write the letters of the starting string across the top and the letters in the finishing string down the left side. Draw links connecting each dot to those below and to the right.
Any point in the graph that corresponds to the same letter in both strings is called a match point. For example, ?parsnip? and ?turnip? both contain an ?r? so the node below the ?r? in ?parsnip? and to the right of the ?r? in ?turnip? is a match point. In Figure 1, the match points are shaded pink.
To finish the edit graph, add a link leading to each match point from the node that is above and to the left, as shown in Figure 1.
The graph looks confusing at first but it?s actually fairly simple. The goal is to follow a path from the upper left to the lower right corner. Each move to the right corresponds to removing a letter from the original string. In Figure 1, the first two moves to the right along the blue path correspond to removing the letters ?p? and ?a? from ?parsnip.?
Each move down corresponds to inserting a letter in the new string. In Figure 1, the next two moves along the blue path correspond to inserting the letters ?t? and ?u? to the string.
Diagonal moves correspond to leaving a letter unchanged. The next move along the blue path corresponds to leaving the ?r? alone.
With these rules, finding the edit distance and the smallest series of changes to convert the string is easy. Simply find the shortest path through the edit graph with right and downward links costing one and diagonal links costing nothing. To think of this in another way, you must find the path through the graph that uses the most diagonals.
If you think in those terms, then it?s easy to see that the blue path represents the best solution.
(Note that there may be more than one path with the same shortest distance through the graph. In that case, there are multiple ways to convert the first string into the second with the same cost.)
Programming Edit Distance
For any two strings, you could build a program that constructs the edit graph and then performs a shortest path calculation on it to find the best solution and the edit distance. For information on finding shortest paths in general, see my article Network Know-How: Finding Shortest Paths. The edit graph, however, has some special structure that we can use to our advantage.
The basic idea is to calculate the shortest paths between the start node in the upper left corner to every node in the graph. Because all of the links lead to the right and down (or both in the case of diagonals), we can calculate the distances starting in the upper left corner and sweeping to the right and down.
Start by labeling the upper left corner with the distance 0, because the path from that node to itself is zero.
Next label the nodes in the leftmost column with the values 1, 2, 3, and so forth. The paths to these nodes correspond to adding successive letters from the final string. In Figure 1, the ?t? node has cost 1, the ?u? node has cost 2, and so forth.
Similarly label the nodes across the top row of the graph with the values 1, 2, 3, and so forth. These paths correspond to removing letters from the initial string. Now you can begin sweeping through the graph to fill in the other values.
Consider each node in the second column from top to bottom. There are three possible paths that you could take to get to this node.
First, you could get to the node from the node above. That move corresponds to adding a new letter to the string so the cost to get to this node is 1 greater than the cost needed to get to the node above.
Second, you could get to the node from the node to the left. That move corresponds to removing a letter from the original string so the cost to get to this node is 1 greater than the cost needed to get to the node to the left.
Third, you could get to the node along a diagonal link if one ends at this node. Diagonal links have zero cost so the cost is the same as the cost to get to the node to the upper left.
|Figure 2. Going the Distance: Program StringEditDistance transforms ?parsnip? into ?turnip? in 5 steps.|
To find the best cost to get to this node, the algorithm considers all three of these possible paths and then picks the best one. It records the node?s new distance and, since you?ll probably want to know what the best path is, it records from which node the path came.
The StringEditDistance program shown in Figure 2 uses this algorithm. Notice that the result follows the blue path in Figure 1: it deletes ?pa,? adds ?tu,? and deletes ?s.?
Listing 1 shows the MakeEditGraph function that program StringEditDistance uses to build the edit graph and calculate node distances. It builds the graph?s array and initializes the leftmost column and top row. It then moves through the array from left-to-right filling in other node distances.
Listing 2 shows the DisplayResults routine that the program uses to display the results. It follows the best path through the edit graph backwards from the finish node to the start node. At each node it visits, it checks the type of change that was made to get to this node (deletion, insertion, or leaving a letter alone) and it adds an appropriate letter to the beginning of the output. As you can see in Figure 2, the program draws deleted letters in crossed-out red, inserted letters in underlined blue, and letters that are the same in both strings in black.
Comparing Files Line-by-Line
Example program FileStringEditDistance shown in Figure 3 uses the same algorithm to compare two files character-by-character. (If you look very closely at Figure 3, you?ll see that the program removed the first ?i? in the line that declares the variable distance1. See if you can figure out why. I?ll give you the answer near the end of this article.)
|Figure 3. File Guile: Program FileStringEditDistance finds the difference between two files treated as strings.|
Program FileStringEditDistance does a reasonable job of comparing files. By looking at Figure 3, you can easily tell which characters were added or deleted.
Unfortunately program FileStringEditDistance has a big drawback: it can potentially use a lot of memory. The files used in Figure 3 were relatively small but they hold more than 2,500 characters so the edit graph array contains more than 6.5 million entries. A typical chapter in one of my books could easily contain more than 50,000 characters. The edit graph for two versions of such a file would contain more than 2.5 billion entries! Not only would that use up a huge chunk of memory, but it would take the program quite a while to build the graph.
Fortunately it?s easy to modify the algorithm to compare the files one line at a time. Simply replace the two strings with two arrays of strings. Instead of comparing characters within two strings, you now compare strings within two arrays of strings.The FileLineEditDistance example program shown in Figure 4 uses this approach to compare files line-by-line.
Download the examples and take a look at their text to see the differences. In fact, you can use the FileLineEditDistance program to compare the files as shown in Figure 4.
Download the examples and give them a try. Then if you?re feeling adventurous, make some modifications. There are several variations that you might like to try.
|Figure 4. A Fine Line: Program FileLineEditDistance compares files line-by-line instead of character-by-character.|
For example, these examples find a single shortest path through the edit graph but not necessarily the best. For example, suppose you wanted to find the difference between the strings BBBBC and BC. The StringEditDistance program drops the middle Bs to give BBBBC but you might prefer that it drop the first Bs so it can keep the two remaining characters together as in BBBBC.
A variation of the FileLineEditDistance program might compare lines and, if the lines are different, then compare the text within the lines. The result would be almost the same as the result given by the FileStringEditDistance program but without the enormous edit graph.
Another popular variation on the FileLineEditDistance program displays the two files side-by-side with corresponding lines highlighted. (That?s what Visual Source Safe?s different tool does.)
If you keep all of your files in a source control system or you like Microsoft Word?s revision tracking system, then you may never need these techniques. But for other stray files, these methods can be useful. You can even add them to your applications so users can easily find the changes in different versions of scripts, memos, and other text they have saved.