A Solution from Oracle Support
Does Oracle have a solution to our duplicates problem? I found an article called "Common SQL*Plus Questions and Answers" (Doc ID: 2194.1) on Oracle's Support Services MetaLink Web site
. It uses the Oracle aggregate function MIN()or the MAX() functionto solve the problem.
MIN() allows you to select one row per groupduplicates and non-duplicatesso that you get a list of all the rows you want to keep:
SELECT MIN(ID) AS ID, LastName, FirstName
GROUP BY LastName, FirstName;
shows the output of the above code.
Now you just need to delete rows that are not in this list, using the last query as a subquery inside an antijoin (the NOT IN clause):
DELETE FROM Customers
WHERE ID NOT IN
GROUP BY LastName, FirstName);
However, an antijoin query with the NOT IN clause is inefficient to make this work. In our case two (!) full table scans need to be performed to resolve this SQL statement. That leads to substantial performance loss for big data sets. For performance testing I created the Customers data set with 500,000 rows and 45,000 duplicates (9 percent of the total). The above command ran for more than one hour with no resultsexcept that it exhausted my patienceso I killed the process.
Another disadvantage of this syntax is that you can't control which row per group of duplicates you can keep in the database.
A PL/SQL Solution: Deleting Duplicate Data with a Stored Procedure
Let me give you an example of a PL/SQL stored procedure, called DeleteDuplicate (see Listing 6), that cleans up duplicates. The algorithm for this procedure is pretty straightforward:
- It selects the duplicate data in the cursor, sorted by duplicate key (LastName, FirstName in our case), as shown in Listing 4.
- It opens the cursor and fetches each row, one by one, in a loop.
- It compares the duplicate key value with the previously fetched one.
- If this is a first fetch, or the value is different, then that's the first row in a new group so it skips it and fetches the next row. Otherwise, it's a duplicate row within the same group, so it deletes it.
Let's run the stored procedure and check it against the Customers data:
SELECT LastName, FirstName, COUNT(*)
GROUP BY LastName, FirstName
HAVING COUNT(*) > 1;
The last SELECT statement returns no rows, so the duplicates are gone.
The main job of extracting duplicates in this procedure is done by a SQL statement, which is defined in the csr_Duplicates cursor. The PL/SQL procedural code is used only to implement the logic of deleting all rows in the group except the first one. Could it all be done by one SQL statement?