Tip 6: The Baker's Dozen SpotlightSubqueries and Derived Tables
Requirement: A construction company has a table of construction jobs, and two tables for hours worked per job and materials purchased per job. You need to produce a basic result that lists each job, the total hours worked per job, and the total material costs per job. Some jobs have labor hours but no costs, and a few jobs have material costs but no labor hours.
On the surface, this may seem like a fairly easy requirement. However, the situation poses an interesting problem, and also opens a discussion on query methodologies.
Some developers may approach the situation by using a pair of OUTER JOIN
statements, like so:
SELECT J.JobID, SUM(H.Hours) AS TotHours,
SUM(C.Costs) AS TotCosts
FROM JobMast J
LEFT JOIN JobHours H ON H.JobID = J.JobID
LEFT JOIN JobCosts C ON C.JobID = J.JobID
GROUP BY J.JobID
Unfortunately, this query doesn't
produce correct results. Because of the aggregation involved (a single job could have two labor rows and three cost rows), the results for hours and costs will be very high (i.e., double-counted, triple-counted, etc).
Fortunately, the SQL-92 standard lets you solve this problem by implementing subqueries that return a scalar value in the SELECT
statement. The following query produces reliable results.
(SELECT SUM(H.Hours) FROM JobHours H
WHERE H.JobID = H.JobID) AS TotHours,
(SELECT SUM(C.Costs) FROM JobCosts C
WHERE C.JobID = H.JobID) AS TotCosts
FROM JobMast J
GROUP BY J.JobID
The above code represents a correlated subquery. A correlated subquery depends on the results of the outer query. As such, it cannot be executed independently.
The outer query executes the subquery for every record in the outer query. Some view this practice as inefficient, and elect to use derived tables
SELECT J.JobID, H.TotHours, C.TotCost
FROM JobMast J
LEFT JOIN (SELECT JobID, SUM(Hours) AS TotHours
GROUP BY JobID) H
ON H.JobID = J.JobID
LEFT JOIN (SELECT JobID, SUM(Costs) AS TotCosts
GROUP BY JobID) C
ON C.JobID = J.JobID
In the above query, the two subqueries in parenthesis produce derived tables. Derived tables are essentially temporary, ad hoc views that only last for the duration of the query. The contents of the derived table can be referenced by the outer query, just like any other table. (Note that you must always reference a derived table with an alias.) You can also create a derived table from a subquery in a FROM
clause. You should primarily use derived tables when you need a temporary result set for just one query.
Everyone knows the old joke about getting ten economists in a room and getting ten different views of economics. Although that's somewhat of an exaggeration, you could ask ten different developers which approach is preferable, and you'd likely see some votes for the correlated subquery approach because of readability, some votes for the derived table approach on the basis of better potential performance, and no doubt a few votes for still another alternative.
Developers of similar strength and experience can and will disagree on the emphasis of performance (or the promise of performance) versus maintenance and readability. Sometimes there is no one single correct answer. The key is to apply consistency in approaching these types of situations.
Before I move on, let me add a final piece to the requirement. The result set I constructed contained a summary of hours for each job. Instead of a summary of hours, you may want to sum hours multiplied by an hourly employee rate to produce total labor dollars.
As an additional complication, the rates may change throughout a job (an employee may work half a job at one rate, and a second job at a higher rate). Listing 7
expands the query above to demonstrate how to match up hours for a work date against the valid rate for that date. The additional logic finds the employee rate with the highest (MAX) effective date that is less than or equal to the specific date worked. Either query above can use the additional subquery.
Tip 7: Table Variables vs. Temporary Tables
The listings for Tips 1 and 3 introduced table variables as a means of storing temporary result sets for repeated uses in the stored procedure. In those examples, the results of the table-valued UDF (that worked with the list of account keys) were directly joined against the Orders Header table. If you wanted to use the table containing the account keys for another join, you could use the following:
DECLARE @tAcctKeys TABLE (IntKey int)
INSERT INTO @tAcctKeys
SELECT IntKey = IntKey FROM
-- now you can use @tAcctKeys in any subsequent
-- joins in the stored procedure
According to Microsoft documentation, table variables provide some of the same benefits of temporary tables, with the additional benefit of slightly improved performance. Table variables have the same general scope as any other variable: they are automatically cleared at the end of a procedure or function. SQL Server recompiles stored procedures less often when you use table variables as opposed to temporary tables. Additionally, transactions that involve table variables last only for the duration of an update on the table variable. As a result, table variables require less locking and logging resources. Because table variables have limited scope and are not part of the persistent database, transaction rollbacks do not affect them.
Although table variables may perform slightly better, there is a slight misconception that they perform better than temporary tables because SQL Server stores the contents in memory instead of the tempdb system database. According to Microsoft, a table variable is not a memory-only structure. A table variable could hold more data than can fit in memory; therefore, it has to have a place on disk to store data. Table variables are created in the tempdb
database similar to temporary tables. If memory is available, both table variables and temporary tables are created and processed while in memory (data cache).
Even with this fact, developers generally prefer table variables over temporary tables for the reasons described above. However, table variables carry some notable limitations. You cannot TRUNCATE a table variable, nor can you modify the structure of a table variable after you've created it. Additionally, you cannot use either of the following statements with table variables:
-- neither line will work with table variables
INSERT @tTable EXEC <sp_mystoredproc>
SELECT * INTO @tTable FROM <mytable>