n my first article on this subject, I discussed the need to have a healthy dose of paranoia when planning how to protect your systems and information. I also demonstrated some basic steps for risk mitigation in general. In this article I apply some professional paranoia to one of the most important tasks a DBA is faced with?creating a proper backup and contingency plan. Additionally, I also share some of the feedback I received from readers of the first article in the series.
As a DBA, you know that one of your primary responsibilities is planning backups. But creating a correct backup solution is not as simple as it sounds. There are many issues to be taken into account.
There are several steps to follow when setting up a comprehensive backup plan:
- Develop a healthy sense of paranoia when planning how to protect your systems and information (covered in Part I).
- Formulate tough questions about how much downtime and data loss are acceptable to your business.
- Ask the proper business people with authority to weigh the financial considerations of the answers they give.
As always, whenever you create a backup plan, there is a tradeoff between cost and effectiveness. A backup plan that utilizes drives from EMC mirrored over a fiber optic network will be more effective than a nightly backup to tape, but it will also cost more. Before you drill down to actual solutions, it is important to establish the parameters for your decisions. Here are the questions that you will need to answer.
- What, if any, downtime is acceptable? What does downtime cost? The answer to these questions vary widely, based on your business and your system. For example, if you run a human resource system that is used for reporting purposes, the acceptable downtime will be a lot greater than one for an e-commerce site taking orders 24 hours a day.
Depending on the nature of the system, there may also be portions of the day or business cycle where uptime is more important. For example, it may be OK to take the payroll system down for several hours during the week. But downtime will have much more impact if it is on the day that payroll is supposed to be distributed. Analyze the workload of the system to determine if your uptime needs vary. This will also allow you to schedule any preventive maintenance and downtime better?if it does need to occur.
Make sure to get answers to these questions in hard numbers. Answers such as “uptime is very important”?while perhaps true?will not help you plan. The actual numeric answers can come in different flavors. It may simply be a percentage such as “we need to be up 99.9999 percent of the time” (“six sigma” as it’s called in the industry). You could also obtain a measure of the number of incidents that can occur. In some cases, a one-time incident where the system is down for six hours is more manageable than several instances where you’re down for five minutes at a stretch, even if the total downtime is less than six hours.
You also need to determine the cost of any downtime in actual dollars. This will allow you to evaluate the cost of your backup solution in terms of the dollars saved by preventing that downtime. In some cases, the answers you receive may mean that you are not justified in obtaining that fancy hardware and backup solution you’ve had your eyes on. In other instances, it will give you the necessary ammunition to justify your case.
- How fast does the recovery process need to be in case a crash does occur? This question is closely related to the amount of uptime you need to provide, but should still be explored separately. The answer may not be a one-size-fits-all solution but may depend on the type of crash you experience. For example, in case of an earthquake or explosion, both the business and your customers may understand and be willing to wait several days until they can access the system. However, for something as “simple” as a disk crash, you may be expected to be back up within ten minutes. Of course, in some cases, such as a company that needs 99.9999 percent uptime, you may receive the mandate to ensure that recovery is almost instantaneous, regardless of the situation.
- What, if any, data loss is acceptable? Just as uptime is important, you also need to examine what the implications are if any of the information in your system is lost due to a system crash or other human failure. The less you can afford to lose, the more rigorous (and potentially more costly) your plan will be. Depending on the use of your system, your answers will vary widely. If your system supports huge fund transfers to and from the Federal Reserve Bank, you probably can’t afford to even lose one transaction. However, in other cases, there may exist a paper trail that the business will be able to use to recapture activity for some period of time. Or the business may make a conscious decision to accept some data loss willingly. For example, one telecommunications company I used to work for made a strategic decision that if a system crash occurred, it was more important to keep the calls going through than to worry about billing for the calls. Therefore, they built into their system a way to bypass authentication and recording of the calls (done through SQL Server) if necessary. They willingly lost data in order to keep their customers happy.
- Whom should you ask these questions? Make no mistake: if a severe problem occurs that affects your business, your every action will be scrutinized. Therefore, ensure that the people who have the actual authority to make decisions answer the questions above. It is best if the people from the business side are involved in the decision process, not that it’s left up to someone in the IT department alone to resolve them. Of course, once decisions have been made, make sure to document them in your plan. This is not simply to cover you, but to also provide a basis for understanding the plan and reevaluating it when changes occur to your business.
Questions such as acceptable downtime, loss of data, etc. can be disturbing to business users. You don’t want your questions to imply that you are building an unreliable and shaky system. You should therefore consider carefully how you broach these questions.
Much depends on the sophistication of the user. With one user, a simple comment that I had found a bug in SQL Server 4.2 led to over an hour’s discussion on bugs in software. Apparently this user had never realized that there could be any bugs in commercial software products!
You should also be prepared to explain the implications of the answers the user will make. That way, you know how to handle someone who gives you an unreasonable answer, such as “Absolutely no downtime at all.” Rather than getting into an argument over the reasonableness of such a demand, simply explain the additional cost to the system to meet such a demand.
In a future 10-Minute Solution, I’ll show you how to utilize the answers to the questions discussed here to choose among the various options for backups provided by SQL Server.
The Quiet Computer Room
Finally, I’d like to share a story provided to me by Mike Frey in response to my previous article on this topic.
The Chicago branch of this bank had most of its trading computers in one room. This room was accessible to a limited number of employees and was also separated from the main office by a door that only employees could open. Of course, the person who delivered the printer paper needed access to the computer room to stock the paper. This person, while not technical, was trained on what should and should not be done in the computer room. This person also worked primarily at another site, stocking their computer room, and had not spent much time in the trading computer room.
The other site had a very modern computer room. If your hands were full, you could press a large red button next the door to open it ‘hands free.’ The trading computer room also had a large red button by the door. It was the emergency power button. We always wondered why no one had taken the time to put a sign on it indicating its purpose. But, of course, everyone knew what it was for, right?
Needless to say, the paper stocker tried to open the trading computer room’s door by hitting the red button. It was a very strange experience to walk by that computer room when all the air conditioning blowers were off.
Bad enough. It also seemed that during all the disaster recovery planning, no one had actually tried to shut down all the trading room computers and restart them. All the effort was focused on a severe disaster that prevented access to the building. It took almost seven hours for the administrators to get all the trading systems up and running again. Naturally, this happened early in the morning.