Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Finding and Eliminating Duplicate Data-3 : Page 3


advertisement
A Solution from Oracle Support
Does Oracle have a solution to our duplicates problem? I found an article called "Common SQL*Plus Questions and Answers" (Doc ID: 2194.1) on Oracle's Support Services MetaLink Web site. It uses the Oracle aggregate function MIN()—or the MAX() function—to solve the problem.

MIN() allows you to select one row per group—duplicates and non-duplicates—so that you get a list of all the rows you want to keep:

SELECT MIN(ID) AS ID, LastName, FirstName FROM Customers GROUP BY LastName, FirstName;

Listing 5 shows the output of the above code.



Now you just need to delete rows that are not in this list, using the last query as a subquery inside an antijoin (the NOT IN clause):

DELETE FROM Customers WHERE ID NOT IN (SELECT MIN(ID) FROM Customers GROUP BY LastName, FirstName);

However, an antijoin query with the NOT IN clause is inefficient to make this work. In our case two (!) full table scans need to be performed to resolve this SQL statement. That leads to substantial performance loss for big data sets. For performance testing I created the Customers data set with 500,000 rows and 45,000 duplicates (9 percent of the total). The above command ran for more than one hour with no results—except that it exhausted my patience—so I killed the process.

Another disadvantage of this syntax is that you can't control which row per group of duplicates you can keep in the database.

A PL/SQL Solution: Deleting Duplicate Data with a Stored Procedure
Let me give you an example of a PL/SQL stored procedure, called DeleteDuplicate (see Listing 6), that cleans up duplicates. The algorithm for this procedure is pretty straightforward:

  1. It selects the duplicate data in the cursor, sorted by duplicate key (LastName, FirstName in our case), as shown in Listing 4.

  2. It opens the cursor and fetches each row, one by one, in a loop.

  3. It compares the duplicate key value with the previously fetched one.

  4. If this is a first fetch, or the value is different, then that's the first row in a new group so it skips it and fetches the next row. Otherwise, it's a duplicate row within the same group, so it deletes it.
Let's run the stored procedure and check it against the Customers data:

BEGIN DeleteDuplicates; END; / SELECT LastName, FirstName, COUNT(*) FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 1;

The last SELECT statement returns no rows, so the duplicates are gone.

The main job of extracting duplicates in this procedure is done by a SQL statement, which is defined in the csr_Duplicates cursor. The PL/SQL procedural code is used only to implement the logic of deleting all rows in the group except the first one. Could it all be done by one SQL statement?



Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date