t is a natural tendency?and normally a very healthy attitude?to look for the best in any situation, to be optimistic about opportunities and life in general. However, when it comes to contingency planning for mission-critical systems, such an attitude can be deadly. I was reminded of this fact during the last few weeks by what occurred at several companies I know. I’d like to share these stories with you in the hopes that you will learn from their mistakes rather than repeat them. Following that, I will also discuss some basic steps to mitigate risks.
Disasters Waiting to Happen
There’s a telecommunications company I know of that uses SQL Server to process calls. All programmers log in to production as the “sa” (super administrator, who has rights to all databases and objects), even though I have repeatedly warned them against doing this. They trust their programmers and insist that they are not worried about fraud. They also feel that they cannot afford the overhead and time required to implement a more secure environment.
I think you know what’s coming. Recently, in order to implement a change, a programmer accessed one of the databases in a system that is used to authenticate phone card security codes for a large region of this company’s business. While making his change, he accidentally dropped the main account table. At that point, all customers of this company in that region could not complete their calls. The DBAs were quickly notified and they immediately started to restore from their backups. However, it took them almost two days to bring the system back up.
Clearly, multiple levels of optimism were in effect here. The company believed that the programmers would never make a mistake, and the DBAs believed that they would never have to use the backups and so did not practice and brush up on the necessary steps to recover quickly from a failure.
In another instance, a small but growing Internet company suffered a disaster of a different type. This company had custom software written for them to run their order-fulfillment process. At some point, a disagreement arose among the senior executives who were running the company. Over the weekend, one executive entered the building and erased the order-fulfillment program from the system, keeping just one copy as a form of blackmail. Of course, had the company followed standard backup procedures by keeping offsite copies, this would not have been an issue. However, optimism, indifference, and plain ignorance reared its ugly head once again. (The executive consequently realized that he could be facing criminal penalties and returned the program.)
Here’s my point. Mistakes happen, employees do go berserk, and?yes?hardware and software do fail. Rather than be optimistic that nothing will happen, I prefer to place the necessary effort into planning my contingencies so that I can be confident that if a disaster does occur, I can deal with it.
How do you go about planning for disasters? How do you know when you’ve crossed the border from proper preparation to true paranoia? Of course, risk management and mitigation is a large field, and certainly cannot be covered in one article. However, I’d like to suggest a simple tool that can help you start down the path. (Special thanks to David Blumenthal for his lessons in risk management.)
Whenever I begin to assess a situation, I like to use the following matrix:
I start by first listing all the possible events that can occur without regard to the effect or the probability. Once I’ve done so, I go down the list and fill in the rest of the matrix:
- Effect: Will the event cause a minor, major, or catastrophic failure to the system? For how long will the impact be felt?
- Probability: Do these types of events happen often? What is the probability that such an event will occur?
- Mitigation: What could be done to eliminate or at least mitigate the effects of such an event?
Don’t decide which risks you will attempt to mitigate before you answer all the questions, because all the above events need to be taken into account in your decision process. For example, I remember hearing of a report that stated that a very large number of the businesses located in the World Trade Center were out of business a year after the building was bombed. Apparently, most of the companies so affected did not have any plan to deal with a catastrophe of such a magnitude. It is possible that those companies decided that the risk of a terrorist bombing or some other catastrophe was extremely low and so were not prepared when the unlikely event indeed occurred. It could be that had they also considered the effect of such a catastrophe on their business, they would have made sure to prepare an alternate site regardless of the low probability of ever using it.
In my next installment, I’ll delve into risk management in some more detail by looking at some of the common, and not so common, risks that SQL Server installations face.