RSS Feed
Download our iPhone app
Browse DevX
Sign up for e-mail newsletters from DevX


Finding and Eliminating Duplicate Data-3 : Page 3

A Solution from Oracle Support
Does Oracle have a solution to our duplicates problem? I found an article called "Common SQL*Plus Questions and Answers" (Doc ID: 2194.1) on Oracle's Support Services MetaLink Web site. It uses the Oracle aggregate function MIN()—or the MAX() function—to solve the problem.

MIN() allows you to select one row per group—duplicates and non-duplicates—so that you get a list of all the rows you want to keep:

SELECT MIN(ID) AS ID, LastName, FirstName
   FROM Customers
   GROUP BY LastName, FirstName;
Listing 5 shows the output of the above code.

Now you just need to delete rows that are not in this list, using the last query as a subquery inside an antijoin (the NOT IN clause):

       FROM Customers
	   GROUP BY LastName, FirstName);
However, an antijoin query with the NOT IN clause is inefficient to make this work. In our case two (!) full table scans need to be performed to resolve this SQL statement. That leads to substantial performance loss for big data sets. For performance testing I created the Customers data set with 500,000 rows and 45,000 duplicates (9 percent of the total). The above command ran for more than one hour with no results—except that it exhausted my patience—so I killed the process.

Another disadvantage of this syntax is that you can't control which row per group of duplicates you can keep in the database.

A PL/SQL Solution: Deleting Duplicate Data with a Stored Procedure
Let me give you an example of a PL/SQL stored procedure, called DeleteDuplicate (see Listing 6), that cleans up duplicates. The algorithm for this procedure is pretty straightforward:

  1. It selects the duplicate data in the cursor, sorted by duplicate key (LastName, FirstName in our case), as shown in Listing 4.

  2. It opens the cursor and fetches each row, one by one, in a loop.

  3. It compares the duplicate key value with the previously fetched one.

  4. If this is a first fetch, or the value is different, then that's the first row in a new group so it skips it and fetches the next row. Otherwise, it's a duplicate row within the same group, so it deletes it.
Let's run the stored procedure and check it against the Customers data:

SELECT LastName, FirstName, COUNT(*)
   FROM Customers
   GROUP BY LastName, FirstName
   HAVING COUNT(*) > 1;
The last SELECT statement returns no rows, so the duplicates are gone.

The main job of extracting duplicates in this procedure is done by a SQL statement, which is defined in the csr_Duplicates cursor. The PL/SQL procedural code is used only to implement the logic of deleting all rows in the group except the first one. Could it all be done by one SQL statement?

Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date