Set-based solutions are always faster than loops in SQL, right? Usually that is true but not always. A good example would be if you wanted to create a running total of sales values.
Let's say we have a table called Sale that has a SaleDate (DateTime) field that records the date and time a sale was made and a SaleValue field that records the total value of each sale. We want to produce a list of all sales, in order, along with a running total of their values.
A set-based solution would look something like this (for simplicity I'm assuming that no two sales can occur at exactly the same time):
This will perform pretty well over a reasonably small dataset but the performance will rapidly degrade as the number of records in the Sale table grows. The culprit is this join:
This type of join is commonly referred to as a Triangular Join and you should notice 2 things about it:
- The table is joined back to itself
- The join is based on an inequality
For each row in S1, S2 will return the current row plus all the previous rows. It needs to do this because it is recalculating the running total from scratch for each row in the table. In other words, as the number of rows in the table increases, the number of rows in the resultant dataset will increase exponentially.
The formula to calculate the number of rows this join will return is (n2 + n)/2. For a table containing 10 rows that's 55 rows in the dataset. For a table containing 100 rows it's a dataset of 5,050. For a table containing 1,000 rows it's 500,500. That's a pretty scary rate of growth and it's not surprising that the DBMS starts to struggle.
So what's the solution to all this? There are two broad approaches that I'm aware of. One of which is formally supported, the other isn't.
The first, formally supported, solution is simply to use some form of loop. Here I've used a cursor but you could also use a while loop. Be aware that, when looping in this way, you must either be concerned with blocking or concurrency issues, depending on whether you lock the records in the set or not. For that reason, I recommend that you copy the data sideways into a temporary repository first. I've used a table variable but a temp table would also make sense.
You will find that, over a substantial number of rows, this will outperform the set-based approach by orders of magnitude. The reason is simple, it maintains a running total as it works through the rows in the table.
The second, not formally supported, solution is to use a quirky update. (I've no idea who first coined that phrase but it's got a great ring to it). I say that it's not formally supported because it relies on some undocumented behaviour in SQL Server. That said, the behaviour has worked consistently across every version of SQL Server I've ever worked with and continues to do so. You could argue that it appears to be informally supported. Debate rages over whether you should use it or not and I've no wish to take sides in that debate. I suspect it's an argument for which there is no truly correct answer. If you're tempted to use the technique, then I urge you to do some research beyond this article and make up your own mind.
Here's a quirky update example:
Note the update works across the rows in the order in which they are returned, hence the clustered primary key on SaleDate to force the rows to be returned in the expected order. That's the first piece of undocumented behaviour because there's nothing to say that rows will be returned in that order – they just always have been so far.
Also note the unusual configuration of the Set statement. The latter assignment (RunningTotal = @RunningTotal + SaleValue) is resolved first, setting the appropriate value on the row in the table. The former assignment (@RunningTotal = RunningTotal) is resolved second, effectively updating our running total. This is the second piece of undocumented behaviour because (to my knowledge at least) there is nothing on SQL Server's documentation that allows this form of double assignment – but the engine sure allows it.
Of the three approaches I have found the quirky update to be significantly faster than either a loop or a triangular join for tables of any significant size (and my significant I mean as few as 100 rows or more). The loop follows in a respectable second place and the join rapidly falls into a pretty dismal last place. I personally tend to favour a loop over a quirky update because the performance it gives me has always been "good enough" and I don't want to get kicked by some piece of behaviour I don't fully understand but much smarter and more qualified minds than mine have advocated it's use.
One closing thought, please do not take this article as a justification to litter your production code with a bunch of cursors and loops. Your senior DBA will not thank you for doing that. Situations where a looping approach outperforms a set-based approach are very rare and very specific. The only one I find that crops up consistently is when I find myself producing running values, though I've no doubt there are others. You should generally take a set-based solution first, performance test it (at scale!) and start investigating loops only if you find your set-based approach, for some reason, fails to scale up.
About the Author
Declan Hillier has been developing business systems since 2001 and formed his own company, TopOut Software Ltd, in 2011. He doesn't have a blog but probably should and promises to start one just as soon as he thinks of enough interesting things to write about.