Login | Register   
LinkedIn
Google+
Twitter
RSS Feed
Download our iPhone app
TODAY'S HEADLINES  |   ARTICLE ARCHIVE  |   FORUMS  |   TIP BANK
Browse DevX
Sign up for e-mail newsletters from DevX


advertisement
 

Finding and Eliminating Duplicate Data-2 : Page 2


advertisement
What Is Duplicate Data?
By "duplicate data" I mean rows in a table that contain identical information in a field, or combination of fields, that's supposed to be unique. For example, it could be the Social Security Number field, or Last Name and First Name fields. We call this a duplicate key. Most tables in a database nowadays have a primary key constraint, which maintains a unique value for each row. From a database point of view each row is unique; however, from a user point of view these rows are duplicates because they contain identical duplicate key (First Name + Last Name) values, even though their IDs differ:

ID Last Name First Name City Phone ---- --------------- ---------- --------------- ---------- 1005 Krieger Jeff San Ramon 9252997100 1012 Krieger Jeff San Ramon 9252997100 1017 Krieger Jeff San Ramon 9252997100

How is duplicate data commonly created? Usually there are two processes that may lead to this situation:
  • Loading and merging data from different sources.

  • Entering data in the system via a graphical user interface, where the system generates a unique number for each row and assigns it as a primary key.
In both cases, unique constraint is missing, which opens the gate for duplicate data.

How to Find Duplicate Data
Let's create a table called Customers and populate it with data that intentionally contains some duplicates (see Listing 1). As you can see, Listing 1 does not contain any code that prevents entering duplicates. The code below creates a unique constraint on the LastName and FirstName fields, which you could use during the initial database design before data gets loaded into the table, in order to prevent entering duplicates into the database:



ALTER TABLE Customers ADD CONSTRAINT Customers_LastFirst UNIQUE (LastName, FirstName);

The duplicate key in the Customers table is the combination of LastName and FirstName. Let's group data by duplicate key and count rows within each group:

SELECT LastName, FirstName, COUNT(*) FROM Customers GROUP BY LastName, FirstName ORDER BY LastName, FirstName;

Listing 2 shows the output of the above code. Three rows in the output have a count greater than 1, which means there are three groups of duplicates.

Let's select them using the HAVING() clause, thus filtering out all "non-duplicate" data:

SELECT LastName, FirstName, COUNT(*) FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 1;

Listing 3 shows the output of the above code. However, these query results do not show the IDs that identify each row. Using the last query as a subquery inside an IN clause will do just that:

SELECT ID, LastName, FirstName FROM Customers WHERE (LastName, FirstName) IN (SELECT LastName, FirstName FROM Customers GROUP BY LastName, FirstName HAVING COUNT(*) > 1) ORDER BY LastName, FirstName;

Listing 4 shows the output of the above code. This query shows you that there are three groups of duplicates with ten rows total. We want to keep the first rows in each group with IDs 1005, 1009, and 1001; and delete seven rows with IDs 1012, 1017, 1010, 1011, 1016, 1019, and 1014.


Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

Sitemap
Thanks for your registration, follow us on our social networks to keep up-to-date