Turn a Parsnip into a Turnip with Edit Distance Algorithms

Turn a Parsnip into a Turnip with Edit Distance Algorithms

f you do as much writing as I do, then you?re probably familiar with Microsoft Word?s tracking features. They let you easily see what?s changed in different versions of a Word file.

But what if you want to see what?s changed in a plain text file? What if you want to compare different versions of data files? What if your project no longer passes its unit tests and you want to see what changed in the source code files during in the last week?

If you have these files under change control, then you?re probably done because a decent change control system will highlight changes between different versions. If these files aren?t under change control, or you just like figuring out how these things work, you can build your own tool to see what?s changed.

This article explains how you can see what?s changed between two documents or two strings. It describes an algorithm that you can use to find differences and includes C# and Visual Basic examples in the source code download.

Edit Distance

The eventual goal of this article is to see how to documents differ but the algorithm I?m going to describe is easier to understand if you consider two strings instead, so I?ll start there. Once you know how to find the difference between two strings, you can generalize it to find the difference between two documents, or two of anything that are made up of things like letters or paragraphs.

When you ask for the difference between two strings, you really want the smallest difference. Obviously you could delete every letter from the first string and then insert every letter from the second to give the new string. That gives you the new string but doesn?t really help you understand how the two are related. If the two strings share many letters, then this solution doesn?t show you what has ?changed? to get from the first string to the second.

For example, to convert ?cat? into ?cart,? you could delete the c, a, and t, and then insert c, a, r, and t, which would require seven changes. It?s easy to see in this case that a much simpler solution is to simply insert the ?r? in ?cat? to get ?cart? in a single change. That more accurately tells you what changes between the two strings.

An edit distance is a measure of how different two strings are. There are several ways to define edit distance but for this article assume that it?s simply the smallest number of deletions and additions needed to convert one string into another. For example, the edit distance between ?cat? and ?cart? is 1.

For a simple case like the cat/cart conversion it?s easy to guess the edit distance. When the strings are less similar, it?s a bit harder to find the best solution. For example, one way to transform ?parsnip? into ?turnip? is to:

    1. Delete ?p? arsnip
    2. Delete ?a? rsnip
    3. Insert ?t? trsnip
    4. Insert ?u? tursnip
    5. Delete ?s? turnip

This gives an edit distance of 5, but is that the best solution possible? Looking at the letters, it?s not always obvious which changes give the best result.

One way to make finding the edit distance easier is to look at an edit graph that shows the possible transformations from one string to another. Figure 1 shows an edit graph for the parsnip/turnip transformation.

Figure 1. Turnip Transformation: The blue path through this edit graph shows the shortest way to transform ?parsnip? into ?turnip.?

To build the graph, make an array of nodes as shown in Figure 1. Write the letters of the starting string across the top and the letters in the finishing string down the left side. Draw links connecting each dot to those below and to the right.

Any point in the graph that corresponds to the same letter in both strings is called a match point. For example, ?parsnip? and ?turnip? both contain an ?r? so the node below the ?r? in ?parsnip? and to the right of the ?r? in ?turnip? is a match point. In Figure 1, the match points are shaded pink.

To finish the edit graph, add a link leading to each match point from the node that is above and to the left, as shown in Figure 1.

The graph looks confusing at first but it?s actually fairly simple. The goal is to follow a path from the upper left to the lower right corner. Each move to the right corresponds to removing a letter from the original string. In Figure 1, the first two moves to the right along the blue path correspond to removing the letters ?p? and ?a? from ?parsnip.?

Each move down corresponds to inserting a letter in the new string. In Figure 1, the next two moves along the blue path correspond to inserting the letters ?t? and ?u? to the string.

Diagonal moves correspond to leaving a letter unchanged. The next move along the blue path corresponds to leaving the ?r? alone.

With these rules, finding the edit distance and the smallest series of changes to convert the string is easy. Simply find the shortest path through the edit graph with right and downward links costing one and diagonal links costing nothing. To think of this in another way, you must find the path through the graph that uses the most diagonals.

If you think in those terms, then it?s easy to see that the blue path represents the best solution.

(Note that there may be more than one path with the same shortest distance through the graph. In that case, there are multiple ways to convert the first string into the second with the same cost.)

Programming Edit Distance

For any two strings, you could build a program that constructs the edit graph and then performs a shortest path calculation on it to find the best solution and the edit distance. For information on finding shortest paths in general, see my article Network Know-How: Finding Shortest Paths. The edit graph, however, has some special structure that we can use to our advantage.

The basic idea is to calculate the shortest paths between the start node in the upper left corner to every node in the graph. Because all of the links lead to the right and down (or both in the case of diagonals), we can calculate the distances starting in the upper left corner and sweeping to the right and down.

Start by labeling the upper left corner with the distance 0, because the path from that node to itself is zero.

Next label the nodes in the leftmost column with the values 1, 2, 3, and so forth. The paths to these nodes correspond to adding successive letters from the final string. In Figure 1, the ?t? node has cost 1, the ?u? node has cost 2, and so forth.

Similarly label the nodes across the top row of the graph with the values 1, 2, 3, and so forth. These paths correspond to removing letters from the initial string. Now you can begin sweeping through the graph to fill in the other values.

Consider each node in the second column from top to bottom. There are three possible paths that you could take to get to this node.

First, you could get to the node from the node above. That move corresponds to adding a new letter to the string so the cost to get to this node is 1 greater than the cost needed to get to the node above.

Second, you could get to the node from the node to the left. That move corresponds to removing a letter from the original string so the cost to get to this node is 1 greater than the cost needed to get to the node to the left.

Third, you could get to the node along a diagonal link if one ends at this node. Diagonal links have zero cost so the cost is the same as the cost to get to the node to the upper left.

Figure 2. Going the Distance: Program StringEditDistance transforms ?parsnip? into ?turnip? in 5 steps.

To find the best cost to get to this node, the algorithm considers all three of these possible paths and then picks the best one. It records the node?s new distance and, since you?ll probably want to know what the best path is, it records from which node the path came.

The StringEditDistance program shown in Figure 2 uses this algorithm. Notice that the result follows the blue path in Figure 1: it deletes ?pa,? adds ?tu,? and deletes ?s.?

Listing 1 shows the MakeEditGraph function that program StringEditDistance uses to build the edit graph and calculate node distances. It builds the graph?s array and initializes the leftmost column and top row. It then moves through the array from left-to-right filling in other node distances.

Listing 2 shows the DisplayResults routine that the program uses to display the results. It follows the best path through the edit graph backwards from the finish node to the start node. At each node it visits, it checks the type of change that was made to get to this node (deletion, insertion, or leaving a letter alone) and it adds an appropriate letter to the beginning of the output. As you can see in Figure 2, the program draws deleted letters in crossed-out red, inserted letters in underlined blue, and letters that are the same in both strings in black.

Comparing Files Line-by-Line

Example program FileStringEditDistance shown in Figure 3 uses the same algorithm to compare two files character-by-character. (If you look very closely at Figure 3, you?ll see that the program removed the first ?i? in the line that declares the variable distance1. See if you can figure out why. I?ll give you the answer near the end of this article.)

Figure 3. File Guile: Program FileStringEditDistance finds the difference between two files treated as strings.

Program FileStringEditDistance does a reasonable job of comparing files. By looking at Figure 3, you can easily tell which characters were added or deleted.

Unfortunately program FileStringEditDistance has a big drawback: it can potentially use a lot of memory. The files used in Figure 3 were relatively small but they hold more than 2,500 characters so the edit graph array contains more than 6.5 million entries. A typical chapter in one of my books could easily contain more than 50,000 characters. The edit graph for two versions of such a file would contain more than 2.5 billion entries! Not only would that use up a huge chunk of memory, but it would take the program quite a while to build the graph.

Fortunately it?s easy to modify the algorithm to compare the files one line at a time. Simply replace the two strings with two arrays of strings. Instead of comparing characters within two strings, you now compare strings within two arrays of strings.The FileLineEditDistance example program shown in Figure 4 uses this approach to compare files line-by-line.

Download the examples and take a look at their text to see the differences. In fact, you can use the FileLineEditDistance program to compare the files as shown in Figure 4.

Different Differences

Download the examples and give them a try. Then if you?re feeling adventurous, make some modifications. There are several variations that you might like to try.

Figure 4. A Fine Line: Program FileLineEditDistance compares files line-by-line instead of character-by-character.

For example, these examples find a single shortest path through the edit graph but not necessarily the best. For example, suppose you wanted to find the difference between the strings BBBBC and BC. The StringEditDistance program drops the middle Bs to give BBBBC but you might prefer that it drop the first Bs so it can keep the two remaining characters together as in BBBBC.

A variation of the FileLineEditDistance program might compare lines and, if the lines are different, then compare the text within the lines. The result would be almost the same as the result given by the FileStringEditDistance program but without the enormous edit graph.

Another popular variation on the FileLineEditDistance program displays the two files side-by-side with corresponding lines highlighted. (That?s what Visual Source Safe?s different tool does.)

If you keep all of your files in a source control system or you like Microsoft Word?s revision tracking system, then you may never need these techniques. But for other stray files, these methods can be useful. You can even add them to your applications so users can easily find the changes in different versions of scripts, memos, and other text they have saved.



Share the Post:
5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1

Copilot Revolution

Microsoft Copilot: A Suit of AI Features

Microsoft’s latest offering, Microsoft Copilot, aims to revolutionize the way we interact with technology. By integrating various AI capabilities, this all-in-one tool provides users with an improved experience that not

AI Girlfriend Craze

AI Girlfriend Craze Threatens Relationships

The surge in virtual AI girlfriends’ popularity is playing a role in the escalating issue of loneliness among young males, and this could have serious repercussions for America’s future. A

AIOps Innovations

Senser is Changing AIOps

Senser, an AIOps platform based in Tel Aviv, has introduced its groundbreaking AI-powered observability solution to support developers and operations teams in promptly pinpointing the root causes of service disruptions

Bebop Charging Stations

Check Out The New Bebob Battery Charging Stations

Bebob has introduced new 4- and 8-channel battery charging stations primarily aimed at rental companies, providing a convenient solution for clients with a large quantity of batteries. These wall-mountable and

Malyasian Networks

Malaysia’s Dual 5G Network Growth

On Wednesday, Malaysia’s Prime Minister Anwar Ibrahim announced the country’s plan to implement a dual 5G network strategy. This move is designed to achieve a more equitable incorporation of both

Advanced Drones Race

Pentagon’s Bold Race for Advanced Drones

The Pentagon has recently unveiled its ambitious strategy to acquire thousands of sophisticated drones within the next two years. This decision comes in response to Russia’s rapid utilization of airborne

Important Updates

You Need to See the New Microsoft Updates

Microsoft has recently announced a series of new features and updates across their applications, including Outlook, Microsoft Teams, and SharePoint. These new developments are centered around improving user experience, streamlining

Price Wars

Inside Hyundai and Kia’s Price Wars

South Korean automakers Hyundai and Kia are cutting the prices on a number of their electric vehicles (EVs) in response to growing price competition within the South Korean market. Many

Solar Frenzy Surprises

Solar Subsidy in Germany Causes Frenzy

In a shocking turn of events, the German national KfW bank was forced to discontinue its home solar power subsidy program for charging electric vehicles (EVs) after just one day,

Electric Spare

Electric Cars Ditch Spare Tires for Efficiency

Ira Newlander from West Los Angeles is thinking about trading in his old Ford Explorer for a contemporary hybrid or electric vehicle. However, he has observed that the majority of

Solar Geoengineering Impacts

Unraveling Solar Geoengineering’s Hidden Impacts

As we continue to face the repercussions of climate change, scientists and experts seek innovative ways to mitigate its impacts. Solar geoengineering (SG), a technique involving the distribution of aerosols

Razer Discount

Unbelievable Razer Blade 17 Discount

On September 24, 2023, it was reported that Razer, a popular brand in the premium gaming laptop industry, is offering an exceptional deal on their Razer Blade 17 model. Typically

Innovation Ignition

New Fintech Innovation Ignites Change

The fintech sector continues to attract substantial interest, as demonstrated by a dedicated fintech stage at a recent event featuring panel discussions and informal conversations with industry professionals. The gathering,

Import Easing

Easing Import Rules for Big Tech

India has chosen to ease its proposed restrictions on imports of laptops, tablets, and other IT hardware, allowing manufacturers like Apple Inc., HP Inc., and Dell Technologies Inc. more time