Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Finding and Eliminating Duplicate Data-4 : Page 4


advertisement
A SQL Solution: Deleting Duplicate Data with a Single SQL Statement Using RANK()
The Oracle 8i analytic function RANK() allows you to rank each item in a group. (For more information about RANK(), see my 10-Minute Solution, "Performing Top-N Queries in Oracle.") In our case, we are using this function to assign dynamically sequential numbers in the group of duplicates sorted by the primary key. With RANK(), grouping is specified in the PARTITION BY clause and sort order for ranking is specified in the ORDER BY clause:

SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY ID) SeqNumber FROM Customers ORDER BY LastName, FirstName;

Listing 7 shows the output of the above query.

Bingo! Now, values in the SeqNumber column, assigned by RANK(), allow you to separate all duplicate rows (SeqNumber > 1) from non-duplicates (SeqNumber = 1) and retrieve only those rows you want to delete:



SELECT ID, LastName, FirstName FROM (SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY ID) AS SeqNumber FROM Customers) WHERE SeqNumber > 1;

Listing 8 shows the output of the above code. It contains seven duplicate rows that have to be deleted. I tested this code on the Customers data set with 500,000 rows total and 45,000 duplicates, and it took only 77 seconds to count the duplicates.

Now you are ready to delete the duplicates by issuing the SQL DELETE command. Here is the first version of it, which executed (for me) in 135 seconds:

DELETE FROM CUSTOMERS WHERE ID IN (SELECT ID FROM (SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY ID) AS SeqNumber FROM Customers) WHERE SeqNumber > 1);

You may notice that the last two statements rank all the rows in the table, which is inefficient. Let's improve the last SQL SELECT statement by applying RANK() only to the groups of duplicates instead of all rows.

The following syntax is much more efficient, even though it's not as concise as the last SELECT ID above:

SELECT ID, LastName, FirstName FROM (SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY ID) AS SeqNumber FROM (SELECT ID, LastName, FirstName FROM Customers WHERE (LastName, FirstName) IN (SELECT LastName, FirstName FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 1))) WHERE SeqNumber > 1;

Counting the duplicates now took only 26 seconds, which amounted to a 67 percent performance gain.

Here is the improved SQL DELETE statement, which uses the improved SELECT statement as a subquery:

DELETE FROM Customers WHERE ID IN (SELECT ID FROM (SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY ID) AS SeqNumber FROM (SELECT ID, LastName, FirstName FROM Customers WHERE (LastName, FirstName) IN (SELECT LastName, FirstName FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 1))) WHERE SeqNumber > 1);

Now it took only 47 seconds to find and delete 45,000 duplicates from 500,000 rows, compared to the 135 seconds it took in my first version of DELETE. That's a significant performance gain (65 percent).

By comparison, the DeleteDuplicates stored procedure clocked in at 56 seconds, which is a little slower (19 percent) than just the SQL statement.

Deleting Duplicate Rows When There's No Primary Key

 
Although it's a sign of bad database design, you may have a table with no primary key. In that case, you can use this technique to delete duplicate rows. Read on...

Replacing the PL/SQL stored procedure with my single SQL statement will get you much more concise code and may improve your performance because there is no overhead caused by the PL/SQL-to-SQL context switch in the stored procedure. However, the performance comparison results between the SQL statement and the PL/SQL procedure may vary, depending on the data set size and percentage of duplicates. I would expect the PL/SQL procedure to get faster or even outperform the SQL statement if the number of duplicates is relatively small—i.e., 1 to 3 percent of all rows in the table.

What if your table doesn't have a primary key? You can use another technique as well (see sidebar).

RANK()'s Additional Capabilities
The RANK() function allows you to select the row per group of duplicates you want to keep. Let's say you need to keep the latest (or earliest) record determined by the value in the RecDate field. In this case you just need to include RecDate in the ORDER BY clause of RANK() in order to sort duplicates within each group by RecDate in DESCending (or ASCending) order, and then by ID.

Here is the syntax for keeping the latest record per group:

DELETE FROM Customers WHERE ID IN (SELECT ID FROM (SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY RecDate DESC, ID) AS SeqNumber FROM (SELECT ID, LastName, FirstName, RecDate FROM Customers WHERE (LastName, FirstName) IN (SELECT LastName, FirstName FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 1))) WHERE SeqNumber > 1);

The flexibility of this technique also allows you to control how many rows per group you want to keep in the table. Let's say you have a database with promotional or rebate list information and you have common business conditions to enforce, such as "limit five entries per household" or "limit three rebates per person." By specifying the number of rows to keep (3) in the WHERE and HAVING clauses, your SELECT statement will do the job again and delete all excessive (more than 3) rebate entries per person:

DELETE FROM Customers WHERE ID IN (SELECT ID FROM (SELECT ID, LastName, FirstName, RANK() OVER (PARTITION BY LastName, FirstName ORDER BY ID) AS SeqNumber FROM (SELECT ID, LastName, FirstName FROM Customers WHERE (LastName, FirstName) IN (SELECT LastName, FirstName FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 3))) WHERE SeqNumber > 3);

As you can see, using the RANK() function allows you to eliminate duplicates in a single SQL statement and gives you more capabilities by extending the power of your queries.


Boris Milrud has 10 years of experience as a software developer. He is a Senior Database Engineer with Callidus Software, Inc. located in San Jose CA. He is specializing in all aspects of Oracle database software development including database design, programming, optimization and tuning.
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date