Finding and Eliminating Duplicate Data

Finding and Eliminating Duplicate Data

uring my years as a database developer I have often faced the task of finding and deleting duplicate data in a database. Duplicates impact data accuracy and credibility, and they can also affect database performance if they’re too numerous compared to the total number of rows in the table. The solution (described later) that I was able to find on Oracle Support Services’ MetaLink Web site is inefficient and works extremely slowly with big data sets.



How do you delete duplicate data in a database?



Use a PL/SQL solution (a custom stored procedure) or a SQL solution (using the new analytic function RANK() with nested subqueries) to eliminate duplicates in a single SQL statement and control which rows to keep in the database.
What Is Duplicate Data?
By “duplicate data” I mean rows in a table that contain identical information in a field, or combination of fields, that’s supposed to be unique. For example, it could be the Social Security Number field, or Last Name and First Name fields. We call this a duplicate key. Most tables in a database nowadays have a primary key constraint, which maintains a unique value for each row. From a database point of view each row is unique; however, from a user point of view these rows are duplicates because they contain identical duplicate key (First Name + Last Name) values, even though their IDs differ:

ID   Last Name       First Name City            Phone—- ————— ———- ————— ———-1005 Krieger         Jeff       San Ramon       92529971001012 Krieger         Jeff       San Ramon       92529971001017 Krieger         Jeff       San Ramon       9252997100
How is duplicate data commonly created? Usually there are two processes that may lead to this situation:
  • Loading and merging data from different sources.

  • Entering data in the system via a graphical user interface, where the system generates a unique number for each row and assigns it as a primary key.
In both cases, unique constraint is missing, which opens the gate for duplicate data.

How to Find Duplicate Data
Let’s create a table called Customers and populate it with data that intentionally contains some duplicates (see Listing 1). As you can see, Listing 1 does not contain any code that prevents entering duplicates. The code below creates a unique constraint on the LastName and FirstName fields, which you could use during the initial database design before data gets loaded into the table, in order to prevent entering duplicates into the database:

ALTER TABLE Customers   ADD CONSTRAINT Customers_LastFirst   UNIQUE (LastName, FirstName);
The duplicate key in the Customers table is the combination of LastName and FirstName. Let’s group data by duplicate key and count rows within each group:
SELECT LastName, FirstName, COUNT(*)   FROM Customers   GROUP BY LastName, FirstName   ORDER BY LastName, FirstName;
Listing 2 shows the output of the above code. Three rows in the output have a count greater than 1, which means there are three groups of duplicates.

Let’s select them using the HAVING() clause, thus filtering out all “non-duplicate” data:

SELECT LastName, FirstName, COUNT(*)   FROM Customers   GROUP BY LastName, FirstName   HAVING COUNT(*) > 1;
Listing 3 shows the output of the above code. However, these query results do not show the IDs that identify each row. Using the last query as a subquery inside an IN clause will do just that:
SELECT ID, LastName, FirstName   FROM Customers   WHERE (LastName, FirstName) IN   (SELECT LastName, FirstName       FROM Customers       GROUP BY LastName, FirstName       HAVING COUNT(*) > 1)   ORDER BY LastName, FirstName;
Listing 4 shows the output of the above code. This query shows you that there are three groups of duplicates with ten rows total. We want to keep the first rows in each group with IDs 1005, 1009, and 1001; and delete seven rows with IDs 1012, 1017, 1010, 1011, 1016, 1019, and 1014.A Solution from Oracle Support
Does Oracle have a solution to our duplicates problem? I found an article called “Common SQL*Plus Questions and Answers” (Doc ID: 2194.1) on Oracle’s Support Services MetaLink Web site. It uses the Oracle aggregate function MIN()?or the MAX() function?to solve the problem.

MIN() allows you to select one row per group?duplicates and non-duplicates?so that you get a list of all the rows you want to keep:

SELECT MIN(ID) AS ID, LastName, FirstName   FROM Customers   GROUP BY LastName, FirstName;
Listing 5 shows the output of the above code.

Now you just need to delete rows that are not in this list, using the last query as a subquery inside an antijoin (the NOT IN clause):

DELETE FROM Customers   WHERE ID NOT IN   (SELECT MIN(ID)       FROM Customers	   GROUP BY LastName, FirstName);
However, an antijoin query with the NOT IN clause is inefficient to make this work. In our case two (!) full table scans need to be performed to resolve this SQL statement. That leads to substantial performance loss for big data sets. For performance testing I created the Customers data set with 500,000 rows and 45,000 duplicates (9 percent of the total). The above command ran for more than one hour with no results?except that it exhausted my patience?so I killed the process.

Another disadvantage of this syntax is that you can’t control which row per group of duplicates you can keep in the database.

A PL/SQL Solution: Deleting Duplicate Data with a Stored Procedure
Let me give you an example of a PL/SQL stored procedure, called DeleteDuplicate (see Listing 6), that cleans up duplicates. The algorithm for this procedure is pretty straightforward:

  1. It selects the duplicate data in the cursor, sorted by duplicate key (LastName, FirstName in our case), as shown in Listing 4.

  2. It opens the cursor and fetches each row, one by one, in a loop.

  3. It compares the duplicate key value with the previously fetched one.

  4. If this is a first fetch, or the value is different, then that’s the first row in a new group so it skips it and fetches the next row. Otherwise, it’s a duplicate row within the same group, so it deletes it.
Let’s run the stored procedure and check it against the Customers data:
BEGIN   DeleteDuplicates;  END;/SELECT LastName, FirstName, COUNT(*)   FROM Customers   GROUP BY LastName, FirstName   HAVING COUNT(*) > 1;
The last SELECT statement returns no rows, so the duplicates are gone.

The main job of extracting duplicates in this procedure is done by a SQL statement, which is defined in the csr_Duplicates cursor. The PL/SQL procedural code is used only to implement the logic of deleting all rows in the group except the first one. Could it all be done by one SQL statement?

A SQL Solution: Deleting Duplicate Data with a Single SQL Statement Using RANK()
The Oracle 8i analytic function RANK() allows you to rank each item in a group. (For more information about RANK(), see my 10-Minute Solution, “Performing Top-N Queries in Oracle.”) In our case, we are using this function to assign dynamically sequential numbers in the group of duplicates sorted by the primary key. With RANK(), grouping is specified in the PARTITION BY clause and sort order for ranking is specified in the ORDER BY clause:
SELECT ID, LastName, FirstName,   RANK() OVER (PARTITION BY LastName,       FirstName ORDER BY ID) SeqNumber   FROM Customers   ORDER BY LastName, FirstName;
Listing 7 shows the output of the above query.

Bingo! Now, values in the SeqNumber column, assigned by RANK(), allow you to separate all duplicate rows (SeqNumber > 1) from non-duplicates (SeqNumber = 1) and retrieve only those rows you want to delete:

SELECT ID, LastName, FirstName   FROM   (SELECT ID, LastName, FirstName,      RANK() OVER (PARTITION BY LastName,          FirstName ORDER BY ID) AS SeqNumber      FROM Customers)   WHERE SeqNumber > 1;
Listing 8 shows the output of the above code. It contains seven duplicate rows that have to be deleted. I tested this code on the Customers data set with 500,000 rows total and 45,000 duplicates, and it took only 77 seconds to count the duplicates.

Now you are ready to delete the duplicates by issuing the SQL DELETE command. Here is the first version of it, which executed (for me) in 135 seconds:

DELETE  FROM CUSTOMERS  WHERE ID IN    (SELECT ID      FROM      (SELECT ID, LastName, FirstName,         RANK() OVER (PARTITION BY LastName,             FirstName ORDER BY ID) AS SeqNumber         FROM Customers)      WHERE SeqNumber > 1);
You may notice that the last two statements rank all the rows in the table, which is inefficient. Let’s improve the last SQL SELECT statement by applying RANK() only to the groups of duplicates instead of all rows.

The following syntax is much more efficient, even though it’s not as concise as the last SELECT ID above:

SELECT ID, LastName, FirstName   FROM   (SELECT ID, LastName, FirstName,      RANK() OVER (PARTITION BY LastName,          FirstName ORDER BY ID) AS SeqNumber      FROM     (SELECT ID, LastName, FirstName         FROM Customers         WHERE (LastName, FirstName) IN         (SELECT LastName, FirstName            FROM Customers            GROUP BY LastName, FirstName            HAVING COUNT(*) > 1)))     WHERE SeqNumber > 1;
Counting the duplicates now took only 26 seconds, which amounted to a 67 percent performance gain.

Here is the improved SQL DELETE statement, which uses the improved SELECT statement as a subquery:

DELETE  FROM Customers  WHERE ID IN  (SELECT ID      FROM      (SELECT ID, LastName, FirstName,         RANK() OVER (PARTITION BY LastName,             FirstName ORDER BY ID) AS SeqNumber         FROM        (SELECT ID, LastName, FirstName            FROM Customers            WHERE (LastName, FirstName) IN            (SELECT LastName, FirstName               FROM Customers               GROUP BY LastName, FirstName               HAVING COUNT(*) > 1)))        WHERE SeqNumber > 1);
Now it took only 47 seconds to find and delete 45,000 duplicates from 500,000 rows, compared to the 135 seconds it took in my first version of DELETE. That’s a significant performance gain (65 percent).

By comparison, the DeleteDuplicates stored procedure clocked in at 56 seconds, which is a little slower (19 percent) than just the SQL statement.

Deleting Duplicate Rows When There’s No Primary Key

 
Although it’s a sign of bad database design, you may have a table with no primary key. In that case, you can use this technique to delete duplicate rows. Read on

Replacing the PL/SQL stored procedure with my single SQL statement will get you much more concise code and may improve your performance because there is no overhead caused by the PL/SQL-to-SQL context switch in the stored procedure. However, the performance comparison results between the SQL statement and the PL/SQL procedure may vary, depending on the data set size and percentage of duplicates. I would expect the PL/SQL procedure to get faster or even outperform the SQL statement if the number of duplicates is relatively small?i.e., 1 to 3 percent of all rows in the table.

What if your table doesn’t have a primary key? You can use another technique as well (see sidebar).

RANK()’s Additional Capabilities
The RANK() function allows you to select the row per group of duplicates you want to keep. Let’s say you need to keep the latest (or earliest) record determined by the value in the RecDate field. In this case you just need to include RecDate in the ORDER BY clause of RANK() in order to sort duplicates within each group by RecDate in DESCending (or ASCending) order, and then by ID.

Here is the syntax for keeping the latest record per group:

DELETE  FROM Customers  WHERE ID IN  (SELECT ID      FROM      (SELECT ID, LastName, FirstName,         RANK() OVER (PARTITION BY LastName,             FirstName ORDER BY RecDate DESC, ID) AS SeqNumber         FROM        (SELECT ID, LastName, FirstName, RecDate            FROM Customers            WHERE (LastName, FirstName) IN            (SELECT LastName, FirstName               FROM Customers               GROUP BY LastName, FirstName               HAVING COUNT(*) > 1)))        WHERE SeqNumber > 1);
The flexibility of this technique also allows you to control how many rows per group you want to keep in the table. Let’s say you have a database with promotional or rebate list information and you have common business conditions to enforce, such as “limit five entries per household” or “limit three rebates per person.” By specifying the number of rows to keep (3) in the WHERE and HAVING clauses, your SELECT statement will do the job again and delete all excessive (more than 3) rebate entries per person:
DELETE  FROM Customers  WHERE ID IN  (SELECT ID      FROM      (SELECT ID, LastName, FirstName,         RANK() OVER (PARTITION BY LastName,             FirstName ORDER BY ID) AS SeqNumber         FROM        (SELECT ID, LastName, FirstName            FROM Customers            WHERE (LastName, FirstName) IN            (SELECT LastName, FirstName               FROM Customers               GROUP BY LastName, FirstName               HAVING COUNT(*) > 3)))        WHERE SeqNumber > 3);
As you can see, using the RANK() function allows you to eliminate duplicates in a single SQL statement and gives you more capabilities by extending the power of your queries.

devx-admin

devx-admin

Share the Post:
Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India,

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with

Software Development

Top Software Development Companies

Looking for the best in software development? Our list of Top Software Development Companies is your gateway to finding the right tech partner. Dive in and explore the leaders in

India Web Development

Top Web Development Companies in India

In the digital race, the right web development partner is your winning edge. Dive into our curated list of top web development companies in India, and kickstart your journey to

USA Web Development

Top Web Development Companies in USA

Looking for the best web development companies in the USA? We’ve got you covered! Check out our top 10 picks to find the right partner for your online project. Your

Clean Energy Adoption

Inside Michigan’s Clean Energy Revolution

Democratic state legislators in Michigan continue to discuss and debate clean energy legislation in the hopes of establishing a comprehensive clean energy strategy for the state. A Senate committee meeting

Chips Act Revolution

European Chips Act: What is it?

In response to the intensifying worldwide technology competition, Europe has unveiled the long-awaited European Chips Act. This daring legislative proposal aims to fortify Europe’s semiconductor supply chain and enhance its

Revolutionized Low-Code

You Should Use Low-Code Platforms for Apps

As the demand for rapid software development increases, low-code platforms have emerged as a popular choice among developers for their ability to build applications with minimal coding. These platforms not

Cybersecurity Strategy

Five Powerful Strategies to Bolster Your Cybersecurity

In today’s increasingly digital landscape, businesses of all sizes must prioritize cyber security measures to defend against potential dangers. Cyber security professionals suggest five simple technological strategies to help companies

Global Layoffs

Tech Layoffs Are Getting Worse Globally

Since the start of 2023, the global technology sector has experienced a significant rise in layoffs, with over 236,000 workers being let go by 1,019 tech firms, as per data

Huawei Electric Dazzle

Huawei Dazzles with Electric Vehicles and Wireless Earbuds

During a prominent unveiling event, Huawei, the Chinese telecommunications powerhouse, kept quiet about its enigmatic new 5G phone and alleged cutting-edge chip development. Instead, Huawei astounded the audience by presenting

Cybersecurity Banking Revolution

Digital Banking Needs Cybersecurity

The banking, financial, and insurance (BFSI) sectors are pioneers in digital transformation, using web applications and application programming interfaces (APIs) to provide seamless services to customers around the world. Rising

FinTech Leadership

Terry Clune’s Fintech Empire

Over the past 30 years, Terry Clune has built a remarkable business empire, with CluneTech at the helm. The CEO and Founder has successfully created eight fintech firms, attracting renowned

The Role Of AI Within A Web Design Agency?

In the digital age, the role of Artificial Intelligence (AI) in web design is rapidly evolving, transitioning from a futuristic concept to practical tools used in design, coding, content writing

Generative AI Revolution

Is Generative AI the Next Internet?

The increasing demand for Generative AI models has led to a surge in its adoption across diverse sectors, with healthcare, automotive, and financial services being among the top beneficiaries. These

Microsoft Laptop

The New Surface Laptop Studio 2 Is Nuts

The Surface Laptop Studio 2 is a dynamic and robust all-in-one laptop designed for creators and professionals alike. It features a 14.4″ touchscreen and a cutting-edge design that is over

5G Innovations

GPU-Accelerated 5G in Japan

NTT DOCOMO, a global telecommunications giant, is set to break new ground in the industry as it prepares to launch a GPU-accelerated 5G network in Japan. This innovative approach will

AI Ethics

AI Journalism: Balancing Integrity and Innovation

An op-ed, produced using Microsoft’s Bing Chat AI software, recently appeared in the St. Louis Post-Dispatch, discussing the potential concerns surrounding the employment of artificial intelligence (AI) in journalism. These

Savings Extravaganza

Big Deal Days Extravaganza

The highly awaited Big Deal Days event for October 2023 is nearly here, scheduled for the 10th and 11th. Similar to the previous year, this autumn sale has already created

Cisco Splunk Deal

Cisco Splunk Deal Sparks Tech Acquisition Frenzy

Cisco’s recent massive purchase of Splunk, an AI-powered cybersecurity firm, for $28 billion signals a potential boost in tech deals after a year of subdued mergers and acquisitions in the

Iran Drone Expansion

Iran’s Jet-Propelled Drone Reshapes Power Balance

Iran has recently unveiled a jet-propelled variant of its Shahed series drone, marking a significant advancement in the nation’s drone technology. The new drone is poised to reshape the regional

Solar Geoengineering

Did the Overshoot Commission Shoot Down Geoengineering?

The Overshoot Commission has recently released a comprehensive report that discusses the controversial topic of Solar Geoengineering, also known as Solar Radiation Modification (SRM). The Commission’s primary objective is to

Remote Learning

Revolutionizing Remote Learning for Success

School districts are preparing to reveal a substantial technological upgrade designed to significantly improve remote learning experiences for both educators and students amid the ongoing pandemic. This major investment, which

Revolutionary SABERS Transforming

SABERS Batteries Transforming Industries

Scientists John Connell and Yi Lin from NASA’s Solid-state Architecture Batteries for Enhanced Rechargeability and Safety (SABERS) project are working on experimental solid-state battery packs that could dramatically change the

Build a Website

How Much Does It Cost to Build a Website?

Are you wondering how much it costs to build a website? The approximated cost is based on several factors, including which add-ons and platforms you choose. For example, a self-hosted

Battery Investments

Battery Startups Attract Billion-Dollar Investments

In recent times, battery startups have experienced a significant boost in investments, with three businesses obtaining over $1 billion in funding within the last month. French company Verkor amassed $2.1