Turn a Parsnip into a Turnip with Edit Distance Algorithms

Edit distance algorithms tell you how different two strings are from each other. That lets you see the differences between different versions of a string or file, or add differencing tools to your applications.

f you do as much writing as I do, then you’re probably familiar with Microsoft Word’s tracking features. They let you easily see what’s changed in different versions of a Word file.

But what if you want to see what’s changed in a plain text file? What if you want to compare different versions of data files? What if your project no longer passes its unit tests and you want to see what changed in the source code files during in the last week?

If you have these files under change control, then you’re probably done because a decent change control system will highlight changes between different versions. If these files aren’t under change control, or you just like figuring out how these things work, you can build your own tool to see what’s changed.

This article explains how you can see what’s changed between two documents or two strings. It describes an algorithm that you can use to find differences and includes C# and Visual Basic examples in the source code download.

Edit Distance

The eventual goal of this article is to see how to documents differ but the algorithm I’m going to describe is easier to understand if you consider two strings instead, so I’ll start there. Once you know how to find the difference between two strings, you can generalize it to find the difference between two documents, or two of anything that are made up of things like letters or paragraphs.

When you ask for the difference between two strings, you really want the smallest difference. Obviously you could delete every letter from the first string and then insert every letter from the second to give the new string. That gives you the new string but doesn’t really help you understand how the two are related. If the two strings share many letters, then this solution doesn’t show you what has “changed” to get from the first string to the second.

For example, to convert “cat” into “cart,” you could delete the c, a, and t, and then insert c, a, r, and t, which would require seven changes. It’s easy to see in this case that a much simpler solution is to simply insert the “r” in “cat” to get “cart” in a single change. That more accurately tells you what changes between the two strings.

An edit distance is a measure of how different two strings are. There are several ways to define edit distance but for this article assume that it’s simply the smallest number of deletions and additions needed to convert one string into another. For example, the edit distance between “cat” and “cart” is 1.

For a simple case like the cat/cart conversion it’s easy to guess the edit distance. When the strings are less similar, it’s a bit harder to find the best solution. For example, one way to transform “parsnip” into “turnip” is to:

This gives an edit distance of 5, but is that the best solution possible? Looking at the letters, it’s not always obvious which changes give the best result.

One way to make finding the edit distance easier is to look at an edit graph that shows the possible transformations from one string to another. Figure 1 shows an edit graph for the parsnip/turnip transformation.

Figure 1. Turnip Transformation: The blue path through this edit graph shows the shortest way to transform “parsnip” into “turnip.”

To build the graph, make an array of nodes as shown in Figure 1. Write the letters of the starting string across the top and the letters in the finishing string down the left side. Draw links connecting each dot to those below and to the right.

Any point in the graph that corresponds to the same letter in both strings is called a match point. For example, “parsnip” and “turnip” both contain an “r” so the node below the “r” in “parsnip” and to the right of the “r” in “turnip” is a match point. In Figure 1, the match points are shaded pink.

To finish the edit graph, add a link leading to each match point from the node that is above and to the left, as shown in Figure 1.

The graph looks confusing at first but it’s actually fairly simple. The goal is to follow a path from the upper left to the lower right corner. Each move to the right corresponds to removing a letter from the original string. In Figure 1, the first two moves to the right along the blue path correspond to removing the letters “p” and “a” from “parsnip.”

Each move down corresponds to inserting a letter in the new string. In Figure 1, the next two moves along the blue path correspond to inserting the letters “t” and “u” to the string.

Diagonal moves correspond to leaving a letter unchanged. The next move along the blue path corresponds to leaving the “r” alone.

With these rules, finding the edit distance and the smallest series of changes to convert the string is easy. Simply find the shortest path through the edit graph with right and downward links costing one and diagonal links costing nothing. To think of this in another way, you must find the path through the graph that uses the most diagonals.

If you think in those terms, then it’s easy to see that the blue path represents the best solution.

(Note that there may be more than one path with the same shortest distance through the graph. In that case, there are multiple ways to convert the first string into the second with the same cost.)