n episode 217, Paul Randal from the Microsoft SQL Server team talked to us about CHECKDB, a built-in tool to check the integrity of SQL Server databases.
Hey, Paul on your blog recently you have a post about "How long does *your* CHECKDB take?" In which you asked people to send in their results to you. Did you get a lot of responses?
Paul Randal: I have had 10 to 20 responses so far.
Carl Franklin: Have there been, any surprises?
Paul Randal: Not that I have seen so far. No, I mean there's a lot variety in the size of databases, and lots of good-sized databases [of] half-a-terabyte or more, and in fact I did [something at] Boot Camp in fact with Kimberly [Tripp] a couple of weeks ago, where we had a bunch of internals from Microsoft partners and some hardware partners, and one of the guys from HP that was there offered to run a CHECKDB on their three terabytes data warehouse host they were using for TPC-H, benchmarking runs, then they got it to run in eight hours on three terabytes.
Richard Campbell: And it's funny to say, to offer to run that, that's really a gesture. It is a major thing to run a CHECKDB on a database that big.
Paul Randal: Right. I mean, if your hardware can't handle it, it's going to take a long, long time. So, I mean there's a ton of different factors. Do you want to go through the factors that say how long it takes?
Carl Franklin: Sure yeah.
Paul Randal: I mean these are some questions I get asked: How long this CHECKDB takes? How long is it going to take to run on my database? The answer I'd like to give to lots of questions is, "Hey it depends."
Richard Campbell: That's a correct answer to everything.
Paul Randal: Right, absolutely. I mean it's a good way of getting yourself off the hook, especially if you're busy. But anyway, the things that are going to affect how long CHECKDB takes to run-obviously the size of the database, that's a no-brainer, right? And then you have got to think if you've got any other load on the server; if you have any concurrent I/O load; if you have any concurrent CPU activity. CHECKDB is going to read every single page and allocated page in the database, so it's going to put a huge I/O load on the server.
Richard Campbell: And you are not just talking about I/O load from SQL Server, but if there is anything else running on that server.
Paul Randal: Oh absolutely, yeah. It's going to put a load on your I/O subsystem, so if there is anything else using that then you don't want to be doing a big CHECKDB at the same time. CPU activities, some of the algorithms that are in CHECKDB, are very CPU-intensive because they're reconciling facts from all different parts of the database. There's one algorithm, for instance, that checks for every row in a base table. Let's say you brought a clustered index, there's exactly one matching row in each non-clustered index and vice versa, and if you've used the option that'll allow that to run, it's going to run-that takes up 30% of the CPU that CHECKDB uses.
Carl Franklin: Wow!
Paul Randal: So, I mean one of the things I say to VLDB customers to get their runtimes down is to try running with the PHYSICAL_ONLY option, because that turns off all of these deeper logical checks, and makes the run time a lot faster. It turns it from a CPU band process into an I/O band process.
Richard Campbell: Just physically go and touch all my allocations.
Paul Randal: Yeah, look at every single allocated page, audit the page, if there are any page checksums on the page then check the page checksums, and that's a really good way of being able to check for any gross hardware errors.
Richard Campbell: Right, as opposed to the actual CRC work of making sure everything is consistent.
Paul Randal: Right. So, that's another couple of factors, then [it looks at] any concurrent update activity on the database, okay? So, for SQL Server 2000, well, the first thing that we have to do when we do a CHECKDB is we have to get a consistent user database. So, for SQL Server 2000 we did, it's called Log Analysis. So, at the end of reading through all the pages in the databases, we have to look and see what transactions happened during the time that we were reading all the pages, possible resource and inconsistencies during that time, and so what we have to do is, we have to internally run recovery on all that transaction log. So, if you put a whole bunch of update activity then you are going to generate tons of transaction logs, and so the internal recovery of all that transaction log is going to take a long time.
Richard Campbell: That's because you're doing transactions while CHECKDB is going on.
Paul Randal: Now, CHECKDB is online, so you can do transactions, but if you've got a really heavy transaction workload, then it is going to take a lot longer. So, then there's the throughput (ph) capabilities of the I/O subsystem-that one is obvious. We are going to read all these pages; we are going to generate tons of I/O. So, if your I/O subsystem can only do, I don't know, say, 10 megabits a second, it's just going to take a long, long time to do a 10 terabyte database.
Richard Campbell: Yeah, and if it's a problem, it's going to show up in disk queuing. It's very, very busy.
Paul Randal: Yeah, your disk-queue lengths are going to be really long. Number of CPUs on the box, and if you are running Enterprise Edition-an Enterprise Edition on 2000 and 2005 CHECKDB can run in parallel. So, the way we do that is we actually use the query processor internally to drive the actions that CHECKDB takes, so the query processor can decide how far to parallelize CHECKDB. So, if you have-and this is on the Enterprise Edition-so if you have say eight CPUs, and you got a moderate load, then the CPU may decide to split CHECKDB over four CPUs. So, if you have a multi-CPU box and it's not very heavily loaded, then you can get a much fast run time, because CHECKDB is going to parallelize, and it's parallelizing in terms of threads. It's also going to parallelize the I/O, so you'll get better I/O throughput too.
Richard Campbell: So, it's like you are working multiple tables simultaneously while you are doing this?
Carl Franklin: Can that be a problem though, running with multiple CPUs?
Paul Randal: Oh, absolutely it can. So we find that some application vendors, for instance SAP, they like to encourage their customers to turn off parallelism while they are running CHECKDB, so they get a more predictable runtime and more predictable load on the server, so there is a documentary trace like 2528-it's documented in the SQL Server Books Online for CHECKDB, that actually prevents CHECKDB ever going parallel.
Carl Franklin: Cool.
Richard Campbell: And this is only an issue with the Enterprise Edition anyway; any other edition is single-threaded, no matter what?
Paul Randal: Absolutely, yup.
Richard Campbell: Okay.
Paul Randal: So the next factor, let's say, the speed of the disk, where a TempDB is placed, so this one is unobvious. The way that CHECKDB works is, think about the way that you could check the consistency of a very large B-tree. The obvious brute force method is, you read a page, you see that there is a bunch of links to other pages so you follow the links, make sure they are correct, so on and so on and so on, so you are doing a kind of brute force that first check. Now think of the IO characteristics of doing that, you are going to take a whole bunch of random page reads, all over the place, so it's not very efficient.
Richard Campbell: You can go all over the drive?
|Editor's Note: This article was first published in the May/June 2007 issue of CoDe Magazine, and is reprinted here by permission.