Validate Your Refactored SQL Queries with a New SQL Server 2005 Operator

s a developer, how often have you been asked to refactor a SQL statement to increase performance, accommodate changes to database design, support a new database, or for a whole host of other reasons? Often you need to do the refactoring without affecting the results. When you refactor a query that returns the correct result set, adhering to the Hippocratic principle of “First, do no harm” will ensure you maintain quality. In this case, the principle as applied is “Make sure you return the same results.”

The question then arises as to how you prove that the fabulous new SQL you’ve written returns an identical resultset when your query may return thousands of records. You can count the records, but after that you are into sampling or having QA test the data, which takes time and may have insufficient coverage.

This article shows how you can use the new EXCEPT operator in SQL Server 2005 to prove that your rewritten SQL is returning the same results. It takes you through a simple example, but you can apply the same principle to a query of any complexity.

When a Simple Row Count Is Not Enough
First, let’s create a simple scenario that demonstrates the technique: Six rows are inserted into TableA, and the key field on TableA (id) may or may not appear in the refid column on TableB. Here’s the SQL:

CREATE TABLE TableA (id INT IDENTITY NOT NULL PRIMARY KEY, textfield char(20))goCREATE TABLE TableB (refid int NOT NULL FOREIGN KEY REFERENCES TableA(id), moretext char(40))goINSERT INTO TableA VALUES ('Row one')INSERT INTO TableA VALUES ('Row two')INSERT INTO TableA VALUES ('Row three')INSERT INTO TableA VALUES ('Row four')INSERT INTO TableA VALUES ('Row five')INSERT INTO TableA VALUES ('Row six')INSERT INTO TableB SELECT id, RTRIM(textfield) + ' - Reference' FROM TableA WHERE id in (1, 2, 4, 6)

Four of the rows in TableA are referenced in TableB. The SQL for a query that returns only rows from TableA that are referenced in TableB is as follows:

SELECT * FROM TableAWHERE EXISTS (SELECT * FROM TableB WHERE TableA.id = TableB.refid)

Here are the results:

id          textfield----------- --------------------1           Row one             2           Row two             4           Row four            6           Row six             (4 row(s) affected)

Now suppose a new development comes along that requires you to use joins as a discipline instead of WHERE EXISTS. You could change the query to use a left outer join like this:

SELECT TableA.* FROM TableALEFT OUTER JOIN TableB ON TableA.id = TableB.refid

Here are the results:

id          textfield----------- --------------------1           Row one             2           Row two             3           Row three           4           Row four            5           Row five            6           Row six             (6 row(s) affected)

This SQL returns six rows, which instantly indicates a problem: The simple row count mechanism does not tell you which extra rows have been returned. In this small sample, you can easily see which data is extra, but if the original and refactored queries returned thousands of rows, it would be much more difficult.

Identifying the extra rows is an important step in determining what is wrong with your refactored query, but a simple row count alone is not enough. The following section demonstrates how to use the new EXCEPT operator in SQL Server 2005 in addition to a row count to check the output. You will still need to use the number of rows returned as a validation mechanism because EXCEPT eliminates all duplicated rows from the results.

Meet the EXCEPT Operator
The EXCEPT operator is a SQL language enhancement introduced in SQL Server 2005. Like the UNION operator, EXCEPT combines two SELECT statements into a single results set. The same rules about number of columns and compatible data types apply equally to UNION and EXCEPT. EXCEPT takes the output from the first (or left-hand side?LHS) SELECT and compares it with the second (or right-hand side?RHS) SELECT and returns distinct rows from the LHS SELECT that are not present in the RHS SELECT. Think of this processing as similar to the way that a UNION ignores duplicate rows.

When I first read about the EXCEPT operator, I thought it was cool but I could not actually find a valid reason to use it in production code. In most cases, you can obtain the same results using joins and WHERE clauses. However, EXCEPT becomes very useful when validating a refactored query.

Validate Your Refactored Query
In this example, you use EXCEPT to verify that your refactored query returns the same results, by checking whether the query using EXCEPT returns no rows. If it does, you will know that the queries return the same results:

SELECT * FROM TableAWHERE EXISTS (SELECT * FROM TableB WHERE TableA.id = TableB.refid)EXCEPTSELECT TableA.* FROM TableALEFT OUTER JOIN TableB ON TableA.id = TableB.refid

Here are the results:

id          textfield----------- --------------------(0 row(s) affected)

Great, this returns no rows. That must mean the SQL queries are returning the same results, right? Wrong! EXCEPT works uni-directionally, so it will display only rows from the LHS query that are not in the RHS query. If the RHS query is returning rows that are not in the LHS query, EXCEPT won’t return these rows. Therefore, you can test fully only by running the query reversed as well:

SELECT TableA.* FROM TableALEFT OUTER JOIN TableB ON TableA.id = TableB.refidEXCEPTSELECT * FROM TableAWHERE EXISTS (SELECT * FROM TableB WHERE TableA.id = TableB.refid)

Here are the results:

id          textfield----------- --------------------3           Row three           5           Row five            (2 row(s) affected)

This query produces a list of results indicating that when you refactored the query, you changed the results. It also tells you which extra rows are being returned.

Refactor and Validate Again
Looking again at how you have rewritten the original query, you will notice that the introduction of the left outer join changed the query behavior. In order to retain the same query behavior as the original, you should have used an inner join:

SELECT TableA.* FROM TableAINNER JOIN TableB ON TableA.id = TableB.refid

Here are the results:

id          textfield----------- --------------------1           Row one             2           Row two             4           Row four            6           Row six             (4 row(s) affected)

You can now use the EXCEPT clause once again to validate your results:

SELECT * FROM TableAWHERE EXISTS (SELECT * FROM TableB WHERE TableA.id = TableB.refid)EXCEPTSELECT TableA.* FROM TableAINNER JOIN TableB ON TableA.id = TableB.refid

Here are the results:

id          textfield----------- --------------------(0 row(s) affected)

Remember, you have to reverse the query to confirm that your SQL results are the same:

SELECT TableA.* FROM TableAINNER JOIN TableB ON TableA.id = TableB.refidEXCEPTSELECT * FROM TableAWHERE EXISTS (SELECT * FROM TableB WHERE TableA.id = TableB.refid)

Here are the results:

id          textfield----------- --------------------(0 row(s) affected)

You have now proved that the refactored query does return the same results.

Refactoring with No Harm Done
The row-by-row comparison provided by EXCEPT, combined with the number of rows being returned remaining consistent, provides compelling evidence that your refactored query is correct. In DB2, the EXCEPT operator is extended using the EXCEPT ALL syntax, which returns rows where the row is different and if more rows are on the LHS query than on the RHS query.

The technique shown in this article can save time and effort and provide you with certainty that your refactoring exercise has done no harm. If Hippocrates had been a software professional, he would most certainly have approved.

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

Recent Articles: