dcsimg
LinkedIn
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Finding and Eliminating Duplicate Data-2 : Page 2


advertisement
What Is Duplicate Data?
By "duplicate data" I mean rows in a table that contain identical information in a field, or combination of fields, that's supposed to be unique. For example, it could be the Social Security Number field, or Last Name and First Name fields. We call this a duplicate key. Most tables in a database nowadays have a primary key constraint, which maintains a unique value for each row. From a database point of view each row is unique; however, from a user point of view these rows are duplicates because they contain identical duplicate key (First Name + Last Name) values, even though their IDs differ:
ID   Last Name       First Name City            Phone
---- --------------- ---------- --------------- ----------
1005 Krieger         Jeff       San Ramon       9252997100
1012 Krieger         Jeff       San Ramon       9252997100
1017 Krieger         Jeff       San Ramon       9252997100
How is duplicate data commonly created? Usually there are two processes that may lead to this situation:
  • Loading and merging data from different sources.

  • Entering data in the system via a graphical user interface, where the system generates a unique number for each row and assigns it as a primary key.
In both cases, unique constraint is missing, which opens the gate for duplicate data.

How to Find Duplicate Data
Let's create a table called Customers and populate it with data that intentionally contains some duplicates (see Listing 1). As you can see, Listing 1 does not contain any code that prevents entering duplicates. The code below creates a unique constraint on the LastName and FirstName fields, which you could use during the initial database design before data gets loaded into the table, in order to prevent entering duplicates into the database:

ALTER TABLE Customers
   ADD CONSTRAINT Customers_LastFirst
   UNIQUE (LastName, FirstName);
The duplicate key in the Customers table is the combination of LastName and FirstName. Let's group data by duplicate key and count rows within each group:
SELECT LastName, FirstName, COUNT(*)
   FROM Customers
   GROUP BY LastName, FirstName
   ORDER BY LastName, FirstName;
Listing 2 shows the output of the above code. Three rows in the output have a count greater than 1, which means there are three groups of duplicates.

Let's select them using the HAVING() clause, thus filtering out all "non-duplicate" data:

SELECT LastName, FirstName, COUNT(*)
   FROM Customers
   GROUP BY LastName, FirstName
   HAVING COUNT(*) > 1;
Listing 3 shows the output of the above code. However, these query results do not show the IDs that identify each row. Using the last query as a subquery inside an IN clause will do just that:
SELECT ID, LastName, FirstName
   FROM Customers
   WHERE (LastName, FirstName) IN
   (SELECT LastName, FirstName
       FROM Customers
       GROUP BY LastName, FirstName
       HAVING COUNT(*) > 1)
   ORDER BY LastName, FirstName;
Listing 4 shows the output of the above code. This query shows you that there are three groups of duplicates with ten rows total. We want to keep the first rows in each group with IDs 1005, 1009, and 1001; and delete seven rows with IDs 1012, 1017, 1010, 1011, 1016, 1019, and 1014.


Close Icon
Thanks for your registration, follow us on our social networks to keep up-to-date